Week Ending 7.27.2025
RESEARCH WATCH: 7.27.2025
Dual Path Learning -- learning from noise and context for medical image denoising
Medical imaging faces a persistent challenge: noise introduced by imaging devices can severely compromise diagnostic accuracy and patient outcomes. Traditional denoising approaches typically focus on either noise characteristics or contextual information, but rarely both simultaneously. This research introduces Dual-Pathway Learning (DPL), an innovative architecture that combines both noise and contextual information sources to achieve superior denoising performance. The model demonstrates remarkable versatility across multiple imaging modalities and noise types, showing 3.35% PSNR improvement over baseline methods. Applications span radiology, pathology, and clinical diagnostics where image clarity is paramount for accurate disease detection and treatment planning.
Authors: Jitindra Fartiyal, Pedro Freire, Yasmeen Whayeb, James S. Wolffsohn, Sergei K. Turitsyn, Sergei G. Sokolov
Link: https://arxiv.org/abs/2507.19035v1
Date: 2025-07-d
Summary:
Medical imaging plays a critical role in modern healthcare, enabling clinicians to accurately diagnose diseases and develop effective treatment plans. However, noise, often introduced by imaging devices, can degrade image quality, leading to misinterpretation and compromised clinical outcomes. Existing denoising approaches typically rely either on noise characteristics or on contextual information from the image. Moreover, they are commonly developed and evaluated for a single imaging modality and noise type. Motivated by Geng et.al CNCL, which integrates both noise and context, this study introduces a Dual-Pathway Learning (DPL) model architecture that effectively denoises medical images by leveraging both sources of information and fusing them to generate the final output. DPL is evaluated across multiple imaging modalities and various types of noise, demonstrating its robustness and generalizability. DPL improves PSNR by 3.35% compared to the baseline UNet when evaluated on Gaussian noise and trained across all modalities. The code is available at 10.5281/zenodo.15836053.
--------------------------------------------------------------------------------------------------------
GEAR: Gaze-Enabled Human-Robot Collaborative Assembly
Manufacturing and assembly tasks require precise coordination between humans and robots, but current interfaces often create physical barriers to natural collaboration. GEAR introduces a revolutionary gaze-enabled system that transforms human-robot interaction by allowing robots to respond intuitively to user eye movements during assembly tasks. Through comprehensive testing with 30 participants across varying complexity scenarios, the system demonstrated reduced physical demand and enhanced user experience compared to traditional touchscreen interfaces. This technology has profound applications in manufacturing, assistive robotics, and any collaborative environment where seamless human-robot communication is essential for productivity and safety.
Authors: Asad Ali Shahid, Angelo Moroncelli, Drazen Brscic, Takayuki Kanda, Loris Roveda
Link: https://arxiv.org/abs/2507.18947v1
Date: 2025-07-d
Summary:
Recent progress in robot autonomy and safety has significantly improved human-robot interactions, enabling robots to work alongside humans on various tasks. However, complex assembly tasks still present significant challenges due to inherent task variability and the need for precise operations. This work explores deploying robots in an assistive role for such tasks, where the robot assists by fetching parts while the skilled worker provides high-level guidance and performs the assembly. We introduce GEAR, a gaze-enabled system designed to enhance human-robot collaboration by allowing robots to respond to the user's gaze. We evaluate GEAR against a touch-based interface where users interact with the robot through a touchscreen. The experimental study involved 30 participants working on two distinct assembly scenarios of varying complexity. Results demonstrated that GEAR enabled participants to accomplish the assembly with reduced physical demand and effort compared to the touchscreen interface, especially for complex tasks, maintaining great performance, and receiving objects effectively. Participants also reported enhanced user experience while performing assembly tasks. Project page: sites.google.com/view/gear-hri
--------------------------------------------------------------------------------------------------------
Deep Learning for Blood-Brain Barrier Permeability Prediction
Neuropharmaceutical development faces a critical bottleneck: predicting whether drug molecules can cross the blood-brain barrier (BBB), a protective membrane that blocks most compounds from reaching the brain. Traditional methods rely on static physicochemical rules prone to systematic errors, while early machine learning approaches suffered from limited capacity and poor interpretability. This comprehensive review examines the evolution from basic neural networks to sophisticated graph-based models, highlighting multi-task learning and generative approaches. The research provides essential guidance for pharmaceutical companies developing neurological treatments, potentially accelerating drug discovery timelines and improving success rates in treating brain diseases.
Authors: Zihan Yang, Haipeng Gong
Link: https://arxiv.org/abs/2507.18557v1
Date: 2025-07-d
Summary:
Predicting whether a molecule can cross the blood-brain barrier (BBB) is a key step in early-stage neuropharmaceutical development, directly influencing both research efficiency and success rates in drug discovery. Traditional empirical methods based on physicochemical properties are prone to systematic misjudgements due to their reliance on static rules. Early machine learning models, although data-driven, often suffer from limited capacity, poor generalization, and insufficient interpretability. In recent years, artificial intelligence (AI) methods have become essential tools for predicting BBB permeability and guiding related drug design, owing to their ability to model molecular structures and capture complex biological mechanisms. This article systematically reviews the evolution of this field-from deep neural networks to graph-based structural modeling-highlighting the advantages of multi-task and multimodal learning strategies in identifying mechanism-relevant variables. We further explore the emerging potential of generative models and causal inference methods for integrating permeability prediction with mechanism-aware drug design. BBB modeling is in the transition from static classification toward mechanistic perception and structure-function modeling. This paradigm shift provides a methodological foundation and future roadmap for the integration of AI into neuropharmacological development.
--------------------------------------------------------------------------------------------------------
Graph Neural Networks excel at learning from interconnected data but struggle with heterophilous graphs where connected nodes have different characteristics—a common real-world scenario. GLANCE addresses this limitation through an innovative framework combining logic-guided reasoning, dynamic graph refinement, and adaptive clustering. The system processes complex network structures more intelligently by incorporating interpretable embeddings and attention-based edge pruning. Demonstrated on benchmark datasets including Cornell, Texas, and Wisconsin, GLANCE offers robust solutions for social networks, citation analysis, fraud detection, and recommendation systems where traditional graph methods fail due to diverse node relationships and complex structural patterns.
Authors: Zhongtian Sun, Anoushka Harit, Alexandra Cristea, Christl A. Donnelly, Pietro Liò
Link: https://arxiv.org/abs/2507.18521v1
Date: 2025-07-d
Summary:
Graph Neural Networks (GNNs) have demonstrated significant success in learning from graph-structured data but often struggle on heterophilous graphs, where connected nodes differ in features or class labels. This limitation arises from indiscriminate neighbor aggregation and insufficient incorporation of higher-order structural patterns. To address these challenges, we propose GLANCE (Graph Logic Attention Network with Cluster Enhancement), a novel framework that integrates logic-guided reasoning, dynamic graph refinement, and adaptive clustering to enhance graph representation learning. GLANCE combines a logic layer for interpretable and structured embeddings, multi-head attention-based edge pruning for denoising graph structures, and clustering mechanisms for capturing global patterns. Experimental results in benchmark datasets, including Cornell, Texas, and Wisconsin, demonstrate that GLANCE achieves competitive performance, offering robust and interpretable solutions for heterophilous graph scenarios. The proposed framework is lightweight, adaptable, and uniquely suited to the challenges of heterophilous graphs.
--------------------------------------------------------------------------------------------------------
An Empirical Study on Embodied Artificial Intelligence Robot (EAIR) Software Bugs
As embodied AI robots become increasingly prevalent in real-world applications, understanding their failure modes becomes critical for safe deployment. This groundbreaking study systematically analyzes 885 bugs from 80 EAIR projects, revealing unique challenges in robot software development. The research identifies 18 underlying causes and 15 distinct symptoms, with eight being specific to embodied AI systems. These EAIR-specific issues often involve severe functional failures and physical hazards stemming from AI reasoning and decision-making complexities. The findings provide essential insights for robotics engineers, safety regulators, and researchers developing autonomous systems across industries including healthcare, manufacturing, and domestic assistance.
Authors: Zeqin Liao, Zibin Zheng, Peifan Reng, Henglong Liang, Zixu Gao, Zhixiang Chen, Wei Li, Yuhong Nan
Link: https://arxiv.org/abs/2507.18267v1
Date: 2025-07-d
Summary:
Embodied Artificial Intelligence Robots (EAIR) is an emerging and rapidly evolving technological domain. Ensuring their program correctness is fundamental to their successful deployment. However, a general and in-depth understanding of EAIR system bugs remains lacking, which hinders the development of practices and techniques to tackle EAIR system bugs. To bridge this gap, we conducted the first systematic study of 885 EAIR system bugs collected from 80 EAIR system projects to investigate their symptoms, underlying causes, and module distribution. Our analysis takes considerable effort, which classifies these bugs into 18 underlying causes, 15 distinct symptoms, and identifies 13 affected modules. It reveals several new interesting findings and implications which help shed light on future research on tackling or repairing EAIR system bugs. First, among the 15 identified symptoms, our findings highlight 8 symptoms specific to EAIR systems, which is characterized by severe functional failures and potential physical hazards. Second, within the 18 underlying causes, we define 8 EAIR-specific causes, the majority of which stem from the intricate issues of AI- agent reasoning and decision making. Finally, to facilitate precise and efficient bug prediction, detection, and repair, we constructed a mapping between underlying causes and the modules in which they most frequently occur, which enables researchers to focus diagnostic efforts on the modules most susceptible to specific bug types.
--------------------------------------------------------------------------------------------------------
AlphaGo Moment for Model Architecture Discovery
AI research advancement has been constrained by human cognitive limitations, creating a fundamental bottleneck in neural architecture innovation. ASI-Arch represents a paradigm shift—the first fully autonomous system capable of conducting scientific research in architecture discovery without human intervention. Moving beyond traditional Neural Architecture Search, this system can hypothesize, implement, and validate novel architectural concepts through rigorous experimentation. Through 1,773 autonomous experiments, ASI-Arch discovered 106 state-of-the-art linear attention architectures, establishing the first scaling law for scientific discovery itself. This breakthrough transforms AI research from human-limited to computation-scalable, with applications spanning automated research, accelerated innovation cycles, and autonomous scientific discovery.
Authors: Yixiu Liu, Yang Nan, Weixian Xu, Xiangkun Hu, Lyumanshan Ye, Zhen Qin, Pengfei Liu
Link: https://arxiv.org/abs/2507.18074v1
Date: 2025-07-d
Summary:
While AI systems demonstrate exponentially improving capabilities, the pace of AI research itself remains linearly bounded by human cognitive capacity, creating an increasingly severe development bottleneck. We present ASI-Arch, the first demonstration of Artificial Superintelligence for AI research (ASI4AI) in the critical domain of neural architecture discovery--a fully autonomous system that shatters this fundamental constraint by enabling AI to conduct its own architectural innovation. Moving beyond traditional Neural Architecture Search (NAS), which is fundamentally limited to exploring human-defined spaces, we introduce a paradigm shift from automated optimization to automated innovation. ASI-Arch can conduct end-to-end scientific research in the domain of architecture discovery, autonomously hypothesizing novel architectural concepts, implementing them as executable code, training and empirically validating their performance through rigorous experimentation and past experience. ASI-Arch conducted 1,773 autonomous experiments over 20,000 GPU hours, culminating in the discovery of 106 innovative, state-of-the-art (SOTA) linear attention architectures. Like AlphaGo's Move 37 that revealed unexpected strategic insights invisible to human players, our AI-discovered architectures demonstrate emergent design principles that systematically surpass human-designed baselines and illuminate previously unknown pathways for architectural innovation. Crucially, we establish the first empirical scaling law for scientific discovery itself--demonstrating that architectural breakthroughs can be scaled computationally, transforming research progress from a human-limited to a computation-scalable process. We provide comprehensive analysis of the emergent design patterns and autonomous research capabilities that enabled these breakthroughs, establishing a blueprint for self-accelerating AI systems.
--------------------------------------------------------------------------------------------------------
Group Sequence Policy Optimization
Training large language models with reinforcement learning presents significant stability and efficiency challenges, particularly with complex architectures like Mixture-of-Experts models. Group Sequence Policy Optimization (GSPO) introduces a novel approach that performs sequence-level rather than token-level optimization, fundamentally changing how importance ratios are calculated and applied. This methodology achieves superior training efficiency while notably stabilizing MoE reinforcement learning processes that traditionally suffer from training instabilities. GSPO's contributions to the latest Qwen3 models demonstrate its practical impact. Applications extend to chatbot development, language model alignment, and any scenario requiring stable reinforcement learning for natural language processing systems.
Authors: Chujie Zheng, Shixuan Liu, Mingze Li, Xiong-Hui Chen, Bowen Yu, Chang Gao, Kai Dang, Yuqiong Liu, Rui Men, An Yang, Jingren Zhou, Junyang Lin
Link: https://arxiv.org/abs/2507.18071v1
Date: 2025-07-d
Summary:
This paper introduces Group Sequence Policy Optimization (GSPO), our stable, efficient, and performant reinforcement learning algorithm for training large language models. Unlike previous algorithms that adopt token-level importance ratios, GSPO defines the importance ratio based on sequence likelihood and performs sequence-level clipping, rewarding, and optimization. We demonstrate that GSPO achieves superior training efficiency and performance compared to the GRPO algorithm, notably stabilizes Mixture-of-Experts (MoE) RL training, and has the potential for simplifying the design of RL infrastructure. These merits of GSPO have contributed to the remarkable improvements in the latest Qwen3 models.
--------------------------------------------------------------------------------------------------------
VIBE: Video-Input Brain Encoder for fMRI Response Modeling
Understanding how the human brain processes complex audiovisual information remains one of neuroscience's greatest challenges. VIBE introduces a sophisticated two-stage Transformer architecture that fuses video, audio, and text features to predict brain activity measured through fMRI. Trained on 65 hours of movie data, the system achieves remarkable correlation with actual brain responses, winning the Algonauts 2025 Challenge. This breakthrough enables unprecedented insights into neural processing of multimedia content, with applications in brain-computer interfaces, cognitive neuroscience research, mental health diagnostics, and potentially revolutionary treatments for neurological disorders by understanding how healthy brains process complex sensory information.
Authors: Daniel Carlström Schad, Shrey Dixit, Janis Keck, Viktor Studenyak, Aleksandr Shpilevoi, Andrej Bicanski
Link: https://arxiv.org/abs/2507.17958v2
Date: 2025-07-d
Summary:
We present VIBE, a two-stage Transformer that fuses multi-modal video, audio, and text features to predict fMRI activity. Representations from open-source models (Qwen2.5, BEATs, Whisper, SlowFast, V-JEPA) are merged by a modality-fusion transformer and temporally decoded by a prediction transformer with rotary embeddings. Trained on 65 hours of movie data from the CNeuroMod dataset and ensembled across 20 seeds, VIBE attains mean parcel-wise Pearson correlations of 0.3225 on in-distribution Friends S07 and 0.2125 on six out-of-distribution films. An earlier iteration of the same architecture obtained 0.3198 and 0.2096, respectively, winning Phase-1 and placing second overall in the Algonauts 2025 Challenge.
--------------------------------------------------------------------------------------------------------
Talk with the Things: Integrating LLMs into IoT Networks
The Internet of Things generates vast amounts of data but lacks intuitive interaction methods for everyday users. This research presents an edge-centric framework that integrates Large Language Models directly into IoT architectures, enabling natural language control and context-aware automation. By deploying lightweight RAG-based LLMs on edge devices, the system achieves reduced latency, enhanced privacy, and improved inference quality compared to cloud-based solutions. Smart home prototypes using LLaMA 3 and Gemma models demonstrate practical feasibility. Applications span smart cities, industrial automation, healthcare monitoring, and agricultural systems where natural language interaction with connected devices could revolutionize user experiences.
Authors: Alakesh Kalita
Link: https://arxiv.org/abs/2507.17865v1
Date: 2025-07-d
Summary:
The convergence of Large Language Models (LLMs) and Internet of Things (IoT) networks open new opportunities for building intelligent, responsive, and user-friendly systems. This work presents an edge-centric framework that integrates LLMs into IoT architectures to enable natural language-based control, context-aware decision-making, and enhanced automation. The proposed modular and lightweight Retrieval Augmented Generation (RAG)-based LLMs are deployed on edge computing devices connected to IoT gateways, enabling local processing of user commands and sensor data for reduced latency, improved privacy, and enhanced inference quality. We validate the framework through a smart home prototype using LLaMA 3 and Gemma 2B models for controlling smart devices. Experimental results highlight the trade-offs between model accuracy and inference time with respect to models size. At last, we also discuss the potential applications that can use LLM-based IoT systems, and a few key challenges associated with such systems.
--------------------------------------------------------------------------------------------------------
Pretraining on the Test Set Is No Longer All You Need: A Debate-Driven Approach to QA Benchmarks
As advanced language models increasingly saturate standard QA benchmarks, concerns about data contamination and memorization undermine evaluation validity. This research introduces a revolutionary debate-driven evaluation paradigm that transforms existing QA datasets into structured adversarial debates between models. One model defends the official answer while another constructs alternative responses, with a blind judge determining quality. This approach substantially increases difficulty while penalizing shallow memorization, demonstrating that models fine-tuned on test questions perform worse in debates despite higher accuracy scores. Applications include robust AI evaluation, educational assessment tools, and developing more reliable benchmarks for measuring genuine reasoning capabilities.
Authors: Linbo Cao, Jinman Zhao
Link: https://arxiv.org/abs/2507.17747v1
Date: 2025-07-d
Summary:
As frontier language models increasingly saturate standard QA benchmarks, concerns about data contamination, memorization, and escalating dataset creation costs persist. We propose a debate-driven evaluation paradigm that transforms any existing QA dataset into structured adversarial debates--where one model is given the official answer to defend, and another constructs and defends an alternative answer--adjudicated by a judge model blind to the correct solution. By forcing multi-round argumentation, this approach substantially increases difficulty while penalizing shallow memorization, yet reuses QA items to reduce curation overhead. We make two main contributions: (1) an evaluation pipeline to systematically convert QA tasks into debate-based assessments, and (2) a public benchmark that demonstrates our paradigm's effectiveness on a subset of MMLU-Pro questions, complete with standardized protocols and reference models. Empirical results validate the robustness of the method and its effectiveness against data contamination--a Llama 3.1 model fine-tuned on test questions showed dramatic accuracy improvements (50% -> 82%) but performed worse in debates. Results also show that even weaker judges can reliably differentiate stronger debaters, highlighting how debate-based evaluation can scale to future, more capable systems while maintaining a fraction of the cost of creating new benchmarks. Overall, our framework underscores that "pretraining on the test set is no longer all you need," offering a sustainable path for measuring the genuine reasoning ability of advanced language models.
--------------------------------------------------------------------------------------------------------
Rubrics as Rewards: Reinforcement Learning Beyond Verifiable Domains
Reinforcement learning traditionally relies on clear, verifiable reward signals, but many real-world tasks involve subjective evaluation criteria without single correct answers. This research introduces "Rubrics as Rewards," a framework using structured, checklist-style rubrics as interpretable reward signals for language model training. Unlike opaque preference-based methods prone to spurious correlations, RaR provides transparent evaluation criteria that achieve up to 28% relative improvement on HealthBench-1k. The approach enables smaller judge models to better align with human preferences across different scales. Applications include educational assessment, creative writing evaluation, medical diagnosis support, and any domain requiring nuanced, multi-criteria evaluation.
Authors: Anisha Gunjal, Anthony Wang, Elaine Lau, Vaskar Nath, Bing Liu, Sean Hendryx
Link: https://arxiv.org/abs/2507.17746v1
Date: 2025-07-d
Summary:
Extending Reinforcement Learning with Verifiable Rewards (RLVR) to real-world tasks often requires balancing objective and subjective evaluation criteria. However, many such tasks lack a single, unambiguous ground truth-making it difficult to define reliable reward signals for post-training language models. While traditional preference-based methods offer a workaround, they rely on opaque reward functions that are difficult to interpret and prone to spurious correlations. We introduce $\textbf{Rubrics as Rewards}$ (RaR), a framework that uses structured, checklist-style rubrics as interpretable reward signals for on-policy training with GRPO. Our best RaR method yields up to a $28\%$ relative improvement on HealthBench-1k compared to simple Likert-based approaches, while matching or surpassing the performance of reward signals derived from expert-written references. By treating rubrics as structured reward signals, we show that RaR enables smaller-scale judge models to better align with human preferences and sustain robust performance across model scales.
--------------------------------------------------------------------------------------------------------
Yume: An Interactive World Generation Model
Creating immersive, interactive digital worlds from simple inputs represents a frontier in AI-generated content. Yume transforms static images into dynamic, explorable environments using advanced video generation techniques. The system employs Masked Video Diffusion Transformer architecture with memory modules for infinite video generation, combined with novel sampling methods for enhanced visual quality and control precision. Users can explore these generated worlds through keyboard inputs, with future plans for peripheral devices and neural signal integration. Applications span gaming, virtual reality, architectural visualization, training simulations, and entertainment where dynamic world creation from minimal input could revolutionize content generation workflows.
Authors: Xiaofeng Mao, Shaoheng Lin, Zhen Li, Chuanhao Li, Wenshuo Peng, Tong He, Jiangmiao Pang, Mingmin Chi, Yu Qiao, Kaipeng Zhang
Link: https://arxiv.org/abs/2507.17744v1
Date: 2025-07-d
Summary:
Yume aims to use images, text, or videos to create an interactive, realistic, and dynamic world, which allows exploration and control using peripheral devices or neural signals. In this report, we present a preview version of \method, which creates a dynamic world from an input image and allows exploration of the world using keyboard actions. To achieve this high-fidelity and interactive video world generation, we introduce a well-designed framework, which consists of four main components, including camera motion quantization, video generation architecture, advanced sampler, and model acceleration. First, we quantize camera motions for stable training and user-friendly interaction using keyboard inputs. Then, we introduce the Masked Video Diffusion Transformer~(MVDT) with a memory module for infinite video generation in an autoregressive manner. After that, training-free Anti-Artifact Mechanism (AAM) and Time Travel Sampling based on Stochastic Differential Equations (TTS-SDE) are introduced to the sampler for better visual quality and more precise control. Moreover, we investigate model acceleration by synergistic optimization of adversarial distillation and caching mechanisms. We use the high-quality world exploration dataset \sekai to train \method, and it achieves remarkable results in diverse scenes and applications. All data, codebase, and model weights are available on https://github.com/stdstu12/YUME. Yume will update monthly to achieve its original goal. Project page: https://stdstu12.github.io/YUME-Project/.
--------------------------------------------------------------------------------------------------------
On the Interaction of Compressibility and Adversarial Robustness
Modern neural networks must balance multiple competing objectives: accuracy, efficiency, and security against adversarial attacks. This research provides the first principled framework analyzing how different compression methods—neuron-level sparsity and spectral compression—fundamentally impact adversarial robustness. The study reveals that compression creates highly sensitive directions in representation space that adversaries can exploit, regardless of how compression is achieved. Through theoretical analysis and empirical validation, the work demonstrates a fundamental tension between structured efficiency and security. Applications include secure model deployment, edge computing where efficiency is crucial, and developing compression techniques that maintain robustness for safety-critical applications.
Authors: Melih Barsbey, Antônio H. Ribeiro, Umut Şimşekli, Tolga Birdal
Link: https://arxiv.org/abs/2507.17725v1
Date: 2025-07-d
Summary:
Modern neural networks are expected to simultaneously satisfy a host of desirable properties: accurate fitting to training data, generalization to unseen inputs, parameter and computational efficiency, and robustness to adversarial perturbations. While compressibility and robustness have each been studied extensively, a unified understanding of their interaction still remains elusive. In this work, we develop a principled framework to analyze how different forms of compressibility - such as neuron-level sparsity and spectral compressibility - affect adversarial robustness. We show that these forms of compression can induce a small number of highly sensitive directions in the representation space, which adversaries can exploit to construct effective perturbations. Our analysis yields a simple yet instructive robustness bound, revealing how neuron and spectral compressibility impact $L_\infty$ and $L_2$ robustness via their effects on the learned representations. Crucially, the vulnerabilities we identify arise irrespective of how compression is achieved - whether via regularization, architectural bias, or implicit learning dynamics. Through empirical evaluations across synthetic and realistic tasks, we confirm our theoretical predictions, and further demonstrate that these vulnerabilities persist under adversarial training and transfer learning, and contribute to the emergence of universal adversarial perturbations. Our findings show a fundamental tension between structured compressibility and robustness, and suggest new pathways for designing models that are both efficient and secure.
--------------------------------------------------------------------------------------------------------
HOTA: Hamiltonian framework for Optimal Transport Advection
Optimal transport theory provides elegant mathematical frameworks for guiding probability flows, but current generative models often assume trivial geometries and rely on strong assumptions that don't respect true optimality principles. HOTA introduces a Hamilton-Jacobi-Bellman based method that tackles optimal transport through Kantorovich potentials, enabling efficient trajectory optimization without explicit density modeling. The approach handles non-smooth cost functions and outperforms existing methods on standard benchmarks. Applications span generative modeling, image-to-image translation, domain adaptation, and any scenario requiring optimal probability mass transportation while respecting underlying manifold geometry and complex cost structures.
Authors: Nazar Buzun, Daniil Shlenskii, Maxim Bobrin, Dmitry V. Dylov
Link: https://arxiv.org/abs/2507.17513v1
Date: 2025-07-d
Summary:
Optimal transport (OT) has become a natural framework for guiding the probability flows. Yet, the majority of recent generative models assume trivial geometry (e.g., Euclidean) and rely on strong density-estimation assumptions, yielding trajectories that do not respect the true principles of optimality in the underlying manifold. We present Hamiltonian Optimal Transport Advection (HOTA), a Hamilton-Jacobi-Bellman based method that tackles the dual dynamical OT problem explicitly through Kantorovich potentials, enabling efficient and scalable trajectory optimization. Our approach effectively evades the need for explicit density modeling, performing even when the cost functionals are non-smooth. Empirically, HOTA outperforms all baselines in standard benchmarks, as well as in custom datasets with non-differentiable costs, both in terms of feasibility and optimality.
--------------------------------------------------------------------------------------------------------
Demonstration of Efficient Predictive Surrogates for Large-scale Quantum Processors
Quantum processors represent revolutionary computational potential but remain extremely expensive and rare, limiting widespread research access. This work introduces "predictive surrogates"—classical learning models that emulate quantum processor behavior with provable computational efficiency. The approach dramatically reduces quantum resource requirements while maintaining performance, demonstrated on 20-qubit superconducting processors. Applications in variational quantum eigensolvers and Floquet symmetry-protected topological phases show orders of magnitude reduction in measurement overhead. This breakthrough enables broader quantum research participation, accelerated algorithm development, and practical quantum advantage exploration without requiring direct access to expensive quantum hardware.
Authors: Wei-You Liao, Yuxuan Du, Xinbiao Wang, Tian-Ci Tian, Yong Luo, Bo Du, Dacheng Tao, He-Liang Huang
Link: https://arxiv.org/abs/2507.17470v1
Date: 2025-07-d
Summary:
The ongoing development of quantum processors is driving breakthroughs in scientific discovery. Despite this progress, the formidable cost of fabricating large-scale quantum processors means they will remain rare for the foreseeable future, limiting their widespread application. To address this bottleneck, we introduce the concept of predictive surrogates, which are classical learning models designed to emulate the mean-value behavior of a given quantum processor with provably computational efficiency. In particular, we propose two predictive surrogates that can substantially reduce the need for quantum processor access in diverse practical scenarios. To demonstrate their potential in advancing digital quantum simulation, we use these surrogates to emulate a quantum processor with up to 20 programmable superconducting qubits, enabling efficient pre-training of variational quantum eigensolvers for families of transverse-field Ising models and identification of non-equilibrium Floquet symmetry-protected topological phases. Experimental results reveal that the predictive surrogates not only reduce measurement overhead by orders of magnitude, but can also surpass the performance of conventional, quantum-resource-intensive approaches. Collectively, these findings establish predictive surrogates as a practical pathway to broadening the impact of advanced quantum processors.
--------------------------------------------------------------------------------------------------------
Our Cars Can Talk: How IoT Brings AI to Vehicles
Vehicle maintenance has traditionally been reactive, responding to failures after they occur. This conceptual framework envisions transforming automotive maintenance through IoT-enabled sensing platforms integrated with AI copilots that communicate effectively with both vehicle systems and drivers. By treating vehicles as comprehensive sensing platforms, the approach enables proactive maintenance predictions and intelligent user interaction. The integration of AI systems that understand both machine diagnostics and human communication patterns could revolutionize automotive care. Applications span fleet management, personal vehicle ownership, autonomous vehicle development, and creating smarter transportation ecosystems where vehicles become intelligent partners rather than passive tools.
Authors: Amod Kant Agrawal
Link: https://arxiv.org/abs/2507.17214v1
Date: 2025-07-d
Summary:
Bringing AI to vehicles and enabling them as sensing platforms is key to transforming maintenance from reactive to proactive. Now is the time to integrate AI copilots that speak both languages: machine and driver. This article offers a conceptual and technical perspective intended to spark interdisciplinary dialogue and guide future research and development in intelligent vehicle systems, predictive maintenance, and AI-powered user interaction.
--------------------------------------------------------------------------------------------------------
Autonomous driving systems must accurately predict the future movements of multiple road participants simultaneously, but current methods struggle with generating high-quality predictions for low-probability scenarios. JAM introduces a two-stage framework that first classifies trajectories by type to ensure comprehensive mode coverage, then performs joint prediction using keypoint guidance for critical trajectory information. Extensive testing on the Waymo Open Motion Dataset demonstrates state-of-the-art performance in interactive trajectory prediction. Applications include autonomous vehicle navigation, traffic simulation, urban planning, and any multi-agent system requiring accurate prediction of complex interactions between multiple moving entities.
Authors: Fangze Lin, Ying He, Fei Yu, Hong Zhang
Link: https://arxiv.org/abs/2507.17152v1
Date: 2025-07-d
Summary:
Predicting the future motion of road participants is a critical task in autonomous driving. In this work, we address the challenge of low-quality generation of low-probability modes in multi-agent joint prediction. To tackle this issue, we propose a two-stage multi-agent interactive prediction framework named \textit{keypoint-guided joint prediction after classification-aware marginal proposal} (JAM). The first stage is modeled as a marginal prediction process, which classifies queries by trajectory type to encourage the model to learn all categories of trajectories, providing comprehensive mode information for the joint prediction module. The second stage is modeled as a joint prediction process, which takes the scene context and the marginal proposals from the first stage as inputs to learn the final joint distribution. We explicitly introduce key waypoints to guide the joint prediction module in better capturing and leveraging the critical information from the initial predicted trajectories. We conduct extensive experiments on the real-world Waymo Open Motion Dataset interactive prediction benchmark. The results show that our approach achieves competitive performance. In particular, in the framework comparison experiments, the proposed JAM outperforms other prediction frameworks and achieves state-of-the-art performance in interactive trajectory prediction. The code is available at https://github.com/LinFunster/JAM to facilitate future research.
--------------------------------------------------------------------------------------------------------
ScSAM: Debiasing Morphology and Distributional Variability in Subcellular Semantic Segmentation
Subcellular component analysis faces significant challenges due to extreme morphological and distributional variability among organelles, leading to biased feature learning in segmentation models. ScSAM enhances the Segment Anything Model by incorporating Masked Autoencoder-guided cellular prior knowledge and introducing feature alignment mechanisms for robust representation fusion. The method includes a cosine similarity-based class prompt encoder for activating category-specific features. Demonstrated across diverse subcellular datasets, ScSAM outperforms existing methods by addressing data imbalance and capturing fine-grained spatial details. Applications span cell biology research, medical diagnostics, drug discovery, and any biomedical imaging scenario requiring precise organelle identification and analysis.
Authors: Bo Fang, Jianan Fan, Dongnan Liu, Hang Chang, Gerald J. Shami, Filip Braet, Weidong Cai
Link: https://arxiv.org/abs/2507.17149v1
Date: 2025-07-d
Summary:
The significant morphological and distributional variability among subcellular components poses a long-standing challenge for learning-based organelle segmentation models, significantly increasing the risk of biased feature learning. Existing methods often rely on single mapping relationships, overlooking feature diversity and thereby inducing biased training. Although the Segment Anything Model (SAM) provides rich feature representations, its application to subcellular scenarios is hindered by two key challenges: (1) The variability in subcellular morphology and distribution creates gaps in the label space, leading the model to learn spurious or biased features. (2) SAM focuses on global contextual understanding and often ignores fine-grained spatial details, making it challenging to capture subtle structural alterations and cope with skewed data distributions. To address these challenges, we introduce ScSAM, a method that enhances feature robustness by fusing pre-trained SAM with Masked Autoencoder (MAE)-guided cellular prior knowledge to alleviate training bias from data imbalance. Specifically, we design a feature alignment and fusion module to align pre-trained embeddings to the same feature space and efficiently combine different representations. Moreover, we present a cosine similarity matrix-based class prompt encoder to activate class-specific features to recognize subcellular categories. Extensive experiments on diverse subcellular image datasets demonstrate that ScSAM outperforms state-of-the-art methods.
--------------------------------------------------------------------------------------------------------
LoRA is All You Need for Safety Alignment of Reasoning LLMs
Advanced reasoning language models demonstrate remarkable problem-solving capabilities but require safety alignment to prevent harmful outputs. However, traditional safety fine-tuning significantly degrades reasoning abilities—a phenomenon called "Safety Tax." This research demonstrates that Low-Rank Adaptation (LoRA) for safety fine-tuning effectively aligns models without compromising reasoning capabilities by restricting safety updates to low-rank spaces, minimizing interference with reasoning weights. Extensive evaluation across mathematics, science, and coding benchmarks shows safety levels comparable to full-model fine-tuning while preserving reasoning performance. Applications include deploying safe AI assistants, educational tools, and reasoning systems where both safety and capability are essential.
Authors: Yihao Xue, Baharan Mirzasoleiman
Link: https://arxiv.org/abs/2507.17075v1
Date: 2025-07-d
Summary:
Reasoning LLMs have demonstrated remarkable breakthroughs in solving complex problems that were previously out of reach. To ensure LLMs do not assist with harmful requests, safety alignment fine-tuning is necessary in the post-training phase. However, safety alignment fine-tuning has recently been shown to significantly degrade reasoning abilities, a phenomenon known as the "Safety Tax". In this work, we show that using LoRA for SFT on refusal datasets effectively aligns the model for safety without harming its reasoning capabilities. This is because restricting the safety weight updates to a low-rank space minimizes the interference with the reasoning weights. Our extensive experiments across four benchmarks covering math, science, and coding show that this approach produces highly safe LLMs -- with safety levels comparable to full-model fine-tuning -- without compromising their reasoning abilities. Additionally, we observe that LoRA induces weight updates with smaller overlap with the initial weights compared to full-model fine-tuning. We also explore methods that further reduce such overlap -- via regularization or during weight merging -- and observe some improvement on certain tasks. We hope this result motivates designing approaches that yield more consistent improvements in the reasoning-safety trade-off.
--------------------------------------------------------------------------------------------------------
Bringing Balance to Hand Shape Classification: Mitigating Data Imbalance Through Generative Models
Sign language recognition systems face critical limitations due to severely imbalanced handshape datasets, hindering effective classifier training. This research explores synthetic data generation using two GAN architectures: ReACGAN for label-conditioned generation and SPADE for pose-informed spatial configuration. The approach improves state-of-the-art accuracy on the RWTH German sign language dataset by 5% while demonstrating cross-dataset generalization capabilities. By addressing fundamental data scarcity issues, this work enables more inclusive communication technologies. Applications span assistive technology development, sign language education, accessibility tools, and cultural preservation efforts where accurate handshape recognition is essential for effective human-computer interaction.
Authors: Gaston Gustavo Rios, Pedro Dal Bianco, Franco Ronchetti, Facundo Quiroga, Oscar Stanchi, Santiago Ponte Ahón, Waldo Hasperué
Link: https://arxiv.org/abs/2507.17008v1
Date: 2025-07-d
Summary:
Most sign language handshape datasets are severely limited and unbalanced, posing significant challenges to effective model training. In this paper, we explore the effectiveness of augmenting the training data of a handshape classifier by generating synthetic data. We use an EfficientNet classifier trained on the RWTH German sign language handshape dataset, which is small and heavily unbalanced, applying different strategies to combine generated and real images. We compare two Generative Adversarial Networks (GAN) architectures for data generation: ReACGAN, which uses label information to condition the data generation process through an auxiliary classifier, and SPADE, which utilizes spatially-adaptive normalization to condition the generation on pose information. ReACGAN allows for the generation of realistic images that align with specific handshape labels, while SPADE focuses on generating images with accurate spatial handshape configurations. Our proposed techniques improve the current state-of-the-art accuracy on the RWTH dataset by 5%, addressing the limitations of small and unbalanced datasets. Additionally, our method demonstrates the capability to generalize across different sign language datasets by leveraging pose-based generation trained on the extensive HaGRID dataset. We achieve comparable performance to single-source trained classifiers without the need for retraining the generator.
--------------------------------------------------------------------------------------------------------