Week Ending 8.24.2025

 

RESEARCH WATCH: 8.24.2025

 

OPERA: A Reinforcement Learning--Enhanced Orchestrated Planner-Executor Architecture for Reasoning-Oriented Multi-Hop Retrieval

Retrieval-Augmented Generation (RAG) systems struggle with complex multi-hop reasoning tasks that require connecting information across multiple documents. OPERA addresses this challenge by introducing a novel architecture that tightly couples reasoning and retrieval processes. The system features a Goal Planning Module that decomposes complex queries into manageable sub-goals, executed by specialized reasoning and retrieval components. This approach is particularly valuable for applications requiring deep analytical capabilities, such as scientific research, legal document analysis, and technical support systems. By training with Multi-Agents Progressive Group Relative Policy Optimization, OPERA demonstrates superior performance on complex benchmarks, making it ideal for enterprise knowledge management systems and academic research platforms.

Authors:  Yu Liu, Yanbing Liu, Fangfang Yuan, Cong Cao, Youbang Sun, Kun Peng, WeiZhuo Chen, Jianjun Li, Zhiyuan Ma

Link:  https://arxiv.org/abs/2508.16438v1

Date: 2025-08-d

Summary:

Recent advances in large language models (LLMs) and dense retrievers have driven significant progress in retrieval-augmented generation (RAG). However, existing approaches face significant challenges in complex reasoning-oriented multi-hop retrieval tasks: 1) Ineffective reasoning-oriented planning: Prior methods struggle to generate robust multi-step plans for complex queries, as rule-based decomposers perform poorly on out-of-template questions. 2) Suboptimal reasoning-driven retrieval: Related methods employ limited query reformulation, leading to iterative retrieval loops that often fail to locate golden documents. 3) Insufficient reasoning-guided filtering: Prevailing methods lack the fine-grained reasoning to effectively filter salient information from noisy results, hindering utilization of retrieved knowledge. Fundamentally, these limitations all stem from the weak coupling between retrieval and reasoning in current RAG architectures. We introduce the Orchestrated Planner-Executor Reasoning Architecture (OPERA), a novel reasoning-driven retrieval framework. OPERA's Goal Planning Module (GPM) decomposes questions into sub-goals, which are executed by a Reason-Execute Module (REM) with specialized components for precise reasoning and effective retrieval. To train OPERA, we propose Multi-Agents Progressive Group Relative Policy Optimization (MAPGRPO), a novel variant of GRPO. Experiments on complex multi-hop benchmarks show OPERA's superior performance, validating both the MAPGRPO method and OPERA's design. Code is available at https://github.com/Ameame1/OPERA.

--------------------------------------------------------------------------------------------------------

The next question after Turing's question: Introducing the Grow-AI test

The Turing Test has long been the gold standard for artificial intelligence assessment, but it focuses primarily on conversational ability rather than developmental growth. The GROW-AI test introduces a revolutionary framework asking "Can machines grow up?" rather than simply "Can machines think?" This assessment evaluates AI entities across six criteria through structured games, measuring their capacity for autonomous development and maturity. The framework is particularly relevant for evaluating modern AI systems like large language models, autonomous robots, and software agents. Applications include AI safety research, developmental robotics, educational AI tutoring systems, and regulatory frameworks for AI deployment. The standardized AI Journal system ensures reproducible evaluations, making this framework valuable for researchers, policymakers, and technology companies developing next-generation AI systems.

Authors:  Alexandru Tugui

Link:  https://arxiv.org/abs/2508.16277v1

Date: 2025-08-d

Summary:

This study aims to extend the framework for assessing artificial intelligence, called GROW-AI (Growth and Realization of Autonomous Wisdom), designed to answer the question "Can machines grow up?" -- a natural successor to the Turing Test. The methodology applied is based on a system of six primary criteria (C1-C6), each assessed through a specific "game", divided into four arenas that explore both the human dimension and its transposition into AI. All decisions and actions of the entity are recorded in a standardized AI Journal, the primary source for calculating composite scores. The assessment uses the prior expert method to establish initial weights, and the global score -- Grow Up Index -- is calculated as the arithmetic mean of the six scores, with interpretation on maturity thresholds. The results show that the methodology allows for a coherent and comparable assessment of the level of "growth" of AI entities, regardless of their type (robots, software agents, LLMs). The multi-game structure highlights strengths and vulnerable areas, and the use of a unified journal guarantees traceability and replicability in the evaluation. The originality of the work lies in the conceptual transposition of the process of "growing" from the human world to that of artificial intelligence, in an integrated testing format that combines perspectives from psychology, robotics, computer science, and ethics. Through this approach, GROW-AI not only measures performance but also captures the evolutionary path of an AI entity towards maturity.

--------------------------------------------------------------------------------------------------------

EGRA:Toward Enhanced Behavior Graphs and Representation Alignment for Multimodal Recommendation

Multimodal recommendation systems leverage rich item information beyond traditional user-item interactions, incorporating visual, textual, and audio features to improve recommendation quality. EGRA tackles two critical limitations in existing systems: noisy modality features corrupting behavior graphs and inflexible alignment mechanisms. By constructing item-item graphs from pretrained multimodal representations and introducing dynamic alignment weighting, EGRA achieves more robust recommendations. This technology has significant commercial applications in e-commerce platforms, streaming services, social media, and content discovery systems. The bi-level dynamic alignment mechanism could revolutionize how platforms like Amazon, Netflix, and TikTok balance collaborative filtering with content-based recommendations. The framework's ability to handle modality noise makes it particularly valuable for real-world deployment where data quality varies significantly across different content types and user interactions.

Authors:  Xiaoxiong Zhang, Xin Zhou, Zhiwei Zeng, Yongjie Wang, Dusit Niyato, Zhiqi Shen

Link:  https://arxiv.org/abs/2508.16170v1

Date: 2025-08-d

Summary:

MultiModal Recommendation (MMR) systems have emerged as a promising solution for improving recommendation quality by leveraging rich item-side modality information, prompting a surge of diverse methods. Despite these advances, existing methods still face two critical limitations. First, they use raw modality features to construct item-item links for enriching the behavior graph, while giving limited attention to balancing collaborative and modality-aware semantics or mitigating modality noise in the process. Second, they use a uniform alignment weight across all entities and also maintain a fixed alignment strength throughout training, limiting the effectiveness of modality-behavior alignment. To address these challenges, we propose EGRA. First, instead of relying on raw modality features, it alleviates sparsity by incorporating into the behavior graph an item-item graph built from representations generated by a pretrained MMR model. This enables the graph to capture both collaborative patterns and modality aware similarities with enhanced robustness against modality noise. Moreover, it introduces a novel bi-level dynamic alignment weighting mechanism to improve modality-behavior representation alignment, which dynamically assigns alignment strength across entities according to their alignment degree, while gradually increasing the overall alignment intensity throughout training. Extensive experiments on five datasets show that EGRA significantly outperforms recent methods, confirming its effectiveness.

--------------------------------------------------------------------------------------------------------

Bridging the Gap in Ophthalmic AI: MM-Retinal-Reason Dataset and OphthaReason Model toward Dynamic Multimodal Reasoning

Medical AI systems typically excel at basic pattern recognition but struggle with complex clinical reasoning that mirrors real physician decision-making. This research introduces the first ophthalmology-specific multimodal reasoning model that integrates patient history, symptoms, and medical imaging for comprehensive diagnosis. The MM-Retinal-Reason dataset encompasses both fundamental visual recognition tasks and sophisticated clinical reasoning scenarios. OphthaReason's Uncertainty-Aware Dynamic Thinking mechanism adapts reasoning depth based on case complexity, mimicking expert ophthalmologist behavior. Applications include telemedicine platforms, diagnostic support systems in underserved regions, medical education tools, and clinical decision support. The 24.92% improvement over general medical models demonstrates significant potential for specialized medical AI, particularly valuable for screening programs in developing countries and reducing diagnostic errors in complex retinal conditions.

Authors:  Ruiqi Wu, Yuang Yao, Tengfei Ma, Chenran Zhang, Na Su, Tao Zhou, Geng Chen, Wen Fan, Yi Zhou

Link:  https://arxiv.org/abs/2508.16129v1

Date: 2025-08-d

Summary:

Multimodal large language models (MLLMs) have recently demonstrated remarkable reasoning abilities with reinforcement learning paradigm. Although several multimodal reasoning models have been explored in the medical domain, most of them focus exclusively on basic reasoning, which refers to shallow inference based on visual feature matching. However, real-world clinical diagnosis extends beyond basic reasoning, demanding reasoning processes that integrate heterogeneous clinical information (such as chief complaints and medical history) with multimodal medical imaging data. To bridge this gap, we introduce MM-Retinal-Reason, the first ophthalmic multimodal dataset with the full spectrum of perception and reasoning. It encompasses both basic reasoning tasks and complex reasoning tasks, aiming to enhance visual-centric fundamental reasoning capabilities and emulate realistic clinical thinking patterns. Building upon MM-Retinal-Reason, we propose OphthaReason, the first ophthalmology-specific multimodal reasoning model with step-by-step reasoning traces. To enable flexible adaptation to both basic and complex reasoning tasks, we specifically design a novel method called Uncertainty-Aware Dynamic Thinking (UADT), which estimates sample-level uncertainty via entropy and dynamically modulates the model's exploration depth using a shaped advantage mechanism. Comprehensive experiments demonstrate that our model achieves state-of-the-art performance on both basic and complex reasoning tasks, outperforming general-purpose MLLMs, medical MLLMs, RL-based medical MLLMs, and ophthalmic MLLMs by at least 24.92\%, 15.00\%, 21.20\%, and 17.66\%. Project Page: \href{https://github.com/lxirich/OphthaReason}{link}.

--------------------------------------------------------------------------------------------------------

SceneGen: Single-Image 3D Scene Generation in One Feedforward Pass

3D content creation remains a bottleneck for VR/AR applications and embodied AI systems, typically requiring expensive optimization processes or manual asset retrieval. SceneGen revolutionizes this workflow by generating multiple 3D assets with geometry and texture from a single scene image in one forward pass. The novel feature aggregation module integrates local object details with global scene context, enabling simultaneous generation of 3D objects and their spatial relationships. This technology has transformative applications in game development, virtual reality content creation, architectural visualization, and autonomous vehicle simulation. The framework's extensibility to multi-image inputs makes it valuable for real estate virtual tours, e-commerce product visualization, and rapid prototyping. SceneGen's efficiency could democratize 3D content creation, enabling small studios and individual creators to produce high-quality virtual environments previously requiring specialized expertise and resources.

Authors:  Yanxu Meng, Haoning Wu, Ya Zhang, Weidi Xie

Link:  https://arxiv.org/abs/2508.15769v1

Date: 2025-08-d

Summary:

3D content generation has recently attracted significant research interest due to its applications in VR/AR and embodied AI. In this work, we address the challenging task of synthesizing multiple 3D assets within a single scene image. Concretely, our contributions are fourfold: (i) we present SceneGen, a novel framework that takes a scene image and corresponding object masks as input, simultaneously producing multiple 3D assets with geometry and texture. Notably, SceneGen operates with no need for optimization or asset retrieval; (ii) we introduce a novel feature aggregation module that integrates local and global scene information from visual and geometric encoders within the feature extraction module. Coupled with a position head, this enables the generation of 3D assets and their relative spatial positions in a single feedforward pass; (iii) we demonstrate SceneGen's direct extensibility to multi-image input scenarios. Despite being trained solely on single-image inputs, our architectural design enables improved generation performance with multi-image inputs; and (iv) extensive quantitative and qualitative evaluations confirm the efficiency and robust generation abilities of our approach. We believe this paradigm offers a novel solution for high-quality 3D content generation, potentially advancing its practical applications in downstream tasks. The code and model will be publicly available at: https://mengmouxu.github.io/SceneGen.

--------------------------------------------------------------------------------------------------------

Intern-S1: A Scientific Multimodal Foundation Model

Scientific research faces a significant gap between general-purpose AI models and specialized scientific applications, limiting AI's transformative potential in research domains. Intern-S1 addresses this challenge as a 241-billion parameter Mixture-of-Experts model specifically designed for scientific understanding across multiple modalities. Trained on 5T tokens including 2.5T scientific tokens, the model undergoes sophisticated reinforcement learning with Mixture-of-Rewards across 1000+ tasks simultaneously. The model demonstrates breakthrough performance in molecular synthesis planning, reaction condition prediction, and crystal stability prediction. Applications span pharmaceutical research, materials science, chemistry automation, and scientific education. Intern-S1's superior performance over closed-source models in professional scientific tasks suggests potential for accelerating drug discovery, optimizing industrial chemical processes, and advancing fundamental research. The open-source availability democratizes access to advanced scientific AI capabilities for researchers worldwide.

Authors:  Lei Bai, Zhongrui Cai, Maosong Cao, Weihan Cao, Chiyu Chen, Haojiong Chen, Kai Chen, Pengcheng Chen, Ying Chen, Yongkang Chen, Yu Cheng, Yu Cheng, Pei Chu, Tao Chu, Erfei Cui, Ganqu Cui, Long Cui, Ziyun Cui, Nianchen Deng, Ning Ding, Nanqin Dong, Peijie Dong, Shihan Dou, Sinan Du, Haodong Duan, Caihua Fan, Ben Gao, Changjiang Gao, Jianfei Gao, Songyang Gao, Yang Gao, Zhangwei Gao, Jiaye Ge, Qiming Ge, Lixin Gu, Yuzhe Gu, Aijia Guo, Qipeng Guo, Xu Guo, Conghui He, Junjun He, Yili Hong, Siyuan Hou, Caiyu Hu, Hanglei Hu, Jucheng Hu, Ming Hu, Zhouqi Hua, Haian Huang, Junhao Huang, Xu Huang, Zixian Huang, Zhe Jiang, Lingkai Kong, Linyang Li, Peiji Li, Pengze Li, Shuaibin Li, Tianbin Li, Wei Li, Yuqiang Li, Dahua Lin, Junyao Lin, Tianyi Lin, Zhishan Lin, Hongwei Liu, Jiangning Liu, Jiyao Liu, Junnan Liu, Kai Liu, Kaiwen Liu, Kuikun Liu, Shichun Liu, Shudong Liu, Wei Liu, Xinyao Liu, Yuhong Liu, Zhan Liu, Yinquan Lu, Haijun Lv, Hongxia Lv, Huijie Lv, Qidang Lv, Ying Lv, Chengqi Lyu, Chenglong Ma, Jianpeng Ma, Ren Ma, Runmin Ma, Runyuan Ma, Xinzhu Ma, Yichuan Ma, Zihan Ma, Sixuan Mi, Junzhi Ning, Wenchang Ning, Xinle Pang, Jiahui Peng, Runyu Peng, Yu Qiao, Jiantao Qiu, Xiaoye Qu, Yuan Qu, Yuchen Ren, Fukai Shang, Wenqi Shao, Junhao Shen, Shuaike Shen, Chunfeng Song, Demin Song, Diping Song, Chenlin Su, Weijie Su, Weigao Sun, Yu Sun, Qian Tan, Cheng Tang, Huanze Tang, Kexian Tang, Shixiang Tang, Jian Tong, Aoran Wang, Bin Wang, Dong Wang, Lintao Wang, Rui Wang, Weiyun Wang, Wenhai Wang, Yi Wang, Ziyi Wang, Ling-I Wu, Wen Wu, Yue Wu, Zijian Wu, Linchen Xiao, Shuhao Xing, Chao Xu, Huihui Xu, Jun Xu, Ruiliang Xu, Wanghan Xu, GanLin Yang, Yuming Yang, Haochen Ye, Jin Ye, Shenglong Ye, Jia Yu, Jiashuo Yu, Jing Yu, Fei Yuan, Bo Zhang, Chao Zhang, Chen Zhang, Hongjie Zhang, Jin Zhang, Qiaosheng Zhang, Qiuyinzhe Zhang, Songyang Zhang, Taolin Zhang, Wenlong Zhang, Wenwei Zhang, Yechen Zhang, Ziyang Zhang, Haiteng Zhao, Qian Zhao, Xiangyu Zhao, Xiangyu Zhao, Bowen Zhou, Dongzhan Zhou, Peiheng Zhou, Yuhao Zhou, Yunhua Zhou, Dongsheng Zhu, Lin Zhu, Yicheng Zou

Link:  https://arxiv.org/abs/2508.15763v1

Date: 2025-08-d

Summary:

In recent years, a plethora of open-source foundation models have emerged, achieving remarkable progress in some widely attended fields, with performance being quite close to that of closed-source models. However, in high-value but more challenging scientific professional fields, either the fields still rely on expert models, or the progress of general foundation models lags significantly compared to those in popular areas, far from sufficient for transforming scientific research and leaving substantial gap between open-source models and closed-source models in these scientific domains. To mitigate this gap and explore a step further toward Artificial General Intelligence (AGI), we introduce Intern-S1, a specialized generalist equipped with general understanding and reasoning capabilities with expertise to analyze multiple science modal data. Intern-S1 is a multimodal Mixture-of-Experts (MoE) model with 28 billion activated parameters and 241 billion total parameters, continually pre-trained on 5T tokens, including over 2.5T tokens from scientific domains. In the post-training stage, Intern-S1 undergoes offline and then online reinforcement learning (RL) in InternBootCamp, where we propose Mixture-of-Rewards (MoR) to synergize the RL training on more than 1000 tasks simultaneously. Through integrated innovations in algorithms, data, and training systems, Intern-S1 achieved top-tier performance in online RL training.On comprehensive evaluation benchmarks, Intern-S1 demonstrates competitive performance on general reasoning tasks among open-source models and significantly outperforms open-source models in scientific domains, surpassing closed-source state-of-the-art models in professional tasks, such as molecular synthesis planning, reaction condition prediction, predicting thermodynamic stabilities for crystals. Our models are available at https://huggingface.co/internlm/Intern-S1.

--------------------------------------------------------------------------------------------------------

DIO: Refining Mutual Information and Causal Chain to Enhance Machine Abstract Reasoning Ability

Abstract reasoning represents a fundamental bottleneck in AI systems, limiting their ability to solve complex problems requiring pattern recognition and logical inference. This research focuses on Raven's Progressive Matrices (RPM), the gold standard for evaluating abstract reasoning capabilities. DIO introduces a "causal chain modeling" approach that systematically analyzes the complete reasoning process from images to abstract attributes to pattern consistency. However, the initial mutual information maximization approach failed to capture genuine human-like reasoning logic, leading to three progressive improvement methods. This work has significant implications for AI safety, autonomous systems, educational technology, and cognitive computing applications. Enhanced abstract reasoning capabilities could revolutionize problem-solving AI systems, intelligent tutoring platforms, automated theorem proving, and decision-making systems requiring complex logical inference beyond pattern matching.

Authors:  Ruizhuo Song, Beiming Yuan

Link:  https://arxiv.org/abs/2508.15387v1

Date: 2025-08-d

Summary:

Despite the outstanding performance of current deep learning models across various domains, their fundamental bottleneck in abstract reasoning remains unresolved. To address this challenge, the academic community has introduced Raven's Progressive Matrices (RPM) problems as an authoritative benchmark for evaluating the abstract reasoning capabilities of deep learning algorithms, with a focus on core intelligence dimensions such as abstract reasoning, pattern recognition, and complex problem-solving. Therefore, this paper centers on solving RPM problems, aiming to contribute to enhancing the abstract reasoning abilities of machine intelligence. Firstly, this paper adopts a ``causal chain modeling'' perspective to systematically analyze the complete causal chain in RPM tasks: image $\rightarrow$ abstract attributes $\rightarrow$ progressive attribute patterns $\rightarrow$ pattern consistency $\rightarrow$ correct answer. Based on this analysis, the network architecture of the baseline model DIO is designed. However, experiments reveal that the optimization objective formulated for DIO, namely maximizing the variational lower bound of mutual information between the context and the correct option, fails to enable the model to genuinely acquire the predefined human reasoning logic. This is attributed to two main reasons: the tightness of the lower bound significantly impacts the effectiveness of mutual information maximization, and mutual information, as a statistical measure, does not capture the causal relationship between subjects and objects. To overcome these limitations, this paper progressively proposes three improvement methods:

--------------------------------------------------------------------------------------------------------

Way to Build Native AI-driven 6G Air Interface: Principles, Roadmap, and Outlook

Next-generation 6G wireless networks require fundamental architectural innovations beyond incremental improvements to current technologies. This research proposes a native AI-driven air interface built on compression and adaptation principles, moving beyond symbol-level accuracy to semantic information transmission. The compression capability extracts task-relevant information from source data, while adaptation enables dynamic transmission across diverse conditions and requirements. This paradigm shift has transformative applications in autonomous vehicles, IoT systems, smart cities, and industrial automation. The semantic communication approach could revolutionize bandwidth efficiency in satellite networks, enable ultra-low latency applications, and support massive machine-type communications. The case study on 6G non-terrestrial networks demonstrates practical implementation potential, suggesting this architecture could enable new applications like real-time holographic communications, brain-computer interfaces, and ubiquitous AI services requiring intelligent, context-aware wireless connectivity.

Authors:  Ping Zhang, Kai Niu, Yiming Liu, Zijian Liang, Nan Ma, Xiaodong Xu, Wenjun Xu, Mengying Sun, Yinqiu Liu, Xiaoyun Wang, Ruichen Zhang

Link:  https://arxiv.org/abs/2508.15277v1

Date: 2025-08-d

Summary:

Artificial intelligence (AI) is expected to serve as a foundational capability across the entire lifecycle of 6G networks, spanning design, deployment, and operation. This article proposes a native AI-driven air interface architecture built around two core characteristics: compression and adaptation. On one hand, compression enables the system to understand and extract essential semantic information from the source data, focusing on task relevance rather than symbol-level accuracy. On the other hand, adaptation allows the air interface to dynamically transmit semantic information across diverse tasks, data types, and channel conditions, ensuring scalability and robustness. This article first introduces the native AI-driven air interface architecture, then discusses representative enabling methodologies, followed by a case study on semantic communication in 6G non-terrestrial networks. Finally, it presents a forward-looking discussion on the future of native AI in 6G, outlining key challenges and research opportunities.

--------------------------------------------------------------------------------------------------------

FiReFly: Fair Distributed Receding Horizon Planning for Multiple UAVs

Multi-robot systems often operate with competing resource requirements, creating potential inequities in energy consumption and mission effectiveness. FiReFly introduces fairness concepts into multi-UAV motion planning, ensuring equitable energy distribution while maintaining mission success. The distributed fair motion planner integrates with safe controllers to balance individual robot performance with collective fairness. This technology has critical applications in search and rescue operations, environmental monitoring, delivery services, and military surveillance. Fair resource allocation prevents individual robots from being overutilized, extending overall system lifetime and reliability. The framework's scalability up to 50 UAVs makes it suitable for large-scale operations like disaster response, agricultural monitoring, and urban air mobility systems. Real-time performance up to 15 UAVs enables practical deployment in time-critical scenarios, while the fairness optimization could influence regulatory frameworks for autonomous aerial systems.

Authors:  Nicole Fronda, Bardh Hoxha, Houssam Abbas

Link:  https://arxiv.org/abs/2508.14381v1

Date: 2025-08-d

Summary:

We propose injecting notions of fairness into multi-robot motion planning. When robots have competing interests, it is important to optimize for some kind of fairness in their usage of resources. In this work, we explore how the robots' energy expenditures might be fairly distributed among them, while maintaining mission success. We formulate a distributed fair motion planner and integrate it with safe controllers in a algorithm called FiReFly. For simulated reach-avoid missions, FiReFly produces fairer trajectories and improves mission success rates over a non-fair planner. We find that real-time performance is achievable up to 15 UAVs, and that scaling up to 50 UAVs is possible with trade-offs between runtime and fairness improvements.

--------------------------------------------------------------------------------------------------------

GLASS: Test-Time Acceleration for LLMs via Global-Local Neural Importance Aggregation

Deploying Large Language Models on resource-constrained edge devices requires aggressive optimization without sacrificing output quality. GLASS introduces dynamic pruning that adapts to specific prompts, addressing limitations of static pruning patterns and predictor-based schemes. The method combines prompt-local statistics with model-intrinsic global importance measures through rank aggregation, enabling context-aware sparsification. This technology is crucial for mobile AI applications, autonomous vehicles, IoT devices, and real-time systems requiring fast inference. GLASS's particular strength in long-form generation scenarios makes it valuable for edge-deployed chatbots, content generation systems, and interactive AI assistants. The training-free approach eliminates deployment overhead, making it practical for diverse hardware configurations. As AI capabilities expand to edge computing environments, GLASS-type acceleration techniques become essential for democratizing access to advanced language models across resource-limited platforms.

Authors:  Amirmohsen Sattarifard, Sepehr Lavasani, Ehsan Imani, Kunlin Zhang, Hanlin Xu, Fengyu Sun, Negar Hassanpour, Chao Gao

Link:  https://arxiv.org/abs/2508.14302v1

Date: 2025-08-d

Summary:

Deploying Large Language Models (LLMs) on edge hardware demands aggressive, prompt-aware dynamic pruning to reduce computation without degrading quality. Static or predictor-based schemes either lock in a single sparsity pattern or incur extra runtime overhead, and recent zero-shot methods that rely on statistics from a single prompt fail on short prompt and/or long generation scenarios. We introduce A/I-GLASS: Activation- and Impact-based Global-Local neural importance Aggregation for feed-forward network SparSification, two training-free methods that dynamically select FFN units using a rank-aggregation of prompt local and model-intrinsic global neuron statistics. Empirical results across multiple LLMs and benchmarks demonstrate that GLASS significantly outperforms prior training-free methods, particularly in challenging long-form generation scenarios, without relying on auxiliary predictors or adding any inference overhead.

--------------------------------------------------------------------------------------------------------

TOM: An Open-Source Tongue Segmentation Method with Multi-Teacher Distillation and Task-Specific Data Augmentation

Traditional Chinese Medicine relies heavily on tongue diagnosis, but existing computer vision tools for tongue analysis lack robustness and accessibility. TOM addresses this gap through multi-teacher knowledge distillation, creating a highly efficient segmentation model with 96.6% fewer parameters while maintaining 95.22% mIoU performance. The novel diffusion-based data augmentation enhances generalization across diverse tongue appearances. This technology has direct applications in telemedicine, automated health screening, and TCM digitization efforts. The packaged online and offline tools democratize access for practitioners without programming expertise. TOM's constitution classification case study demonstrates practical diagnostic value, potentially revolutionizing traditional medicine practice through AI assistance. The open-source nature enables global adoption and further research, particularly valuable in regions where TCM is prevalent and healthcare resources are limited, enabling scalable diagnostic support systems.

Authors:  Jiacheng Xie, Ziyang Zhang, Biplab Poudel, Congyu Guo, Yang Yu, Guanghui An, Xiaoting Tang, Lening Zhao, Chunhui Xu, Dong Xu

Link:  https://arxiv.org/abs/2508.14932v1

Date: 2025-08-d

Summary:

Tongue imaging serves as a valuable diagnostic tool, particularly in Traditional Chinese Medicine (TCM). The quality of tongue surface segmentation significantly affects the accuracy of tongue image classification and subsequent diagnosis in intelligent tongue diagnosis systems. However, existing research on tongue image segmentation faces notable limitations, and there is a lack of robust and user-friendly segmentation tools. This paper proposes a tongue image segmentation model (TOM) based on multi-teacher knowledge distillation. By incorporating a novel diffusion-based data augmentation method, we enhanced the generalization ability of the segmentation model while reducing its parameter size. Notably, after reducing the parameter count by 96.6% compared to the teacher models, the student model still achieves an impressive segmentation performance of 95.22% mIoU. Furthermore, we packaged and deployed the trained model as both an online and offline segmentation tool (available at https://itongue.cn/), allowing TCM practitioners and researchers to use it without any programming experience. We also present a case study on TCM constitution classification using segmented tongue patches. Experimental results demonstrate that training with tongue patches yields higher classification performance and better interpretability than original tongue images. To our knowledge, this is the first open-source and freely available tongue image segmentation tool.

--------------------------------------------------------------------------------------------------------

LENS: Learning to Segment Anything with Unified Reinforced Reasoning

Text-prompted image segmentation enables intuitive human-computer interaction but existing methods struggle with generalization to unseen prompts and domains due to lack of explicit reasoning. LENS introduces a reinforcement learning framework that jointly optimizes chain-of-thought reasoning and segmentation quality through unified rewards spanning multiple levels. Using the 3-billion parameter Qwen2.5-VL model, LENS achieves significant improvements over existing methods on standard benchmarks. This technology has transformative applications in robotics, augmented reality, medical imaging, and interactive design tools. The ability to segment objects through natural language descriptions could revolutionize photo editing software, autonomous navigation systems, and assistive technologies. The RL-driven reasoning approach provides a pathway toward more generalizable AI systems capable of complex visual understanding, particularly valuable for applications requiring precise object identification in cluttered or complex environments.

Authors:  Lianghui Zhu, Bin Ouyang, Yuxuan Zhang, Tianheng Cheng, Rui Hu, Haocheng Shen, Longjin Ran, Xiaoxin Chen, Li Yu, Wenyu Liu, Xinggang Wang

Link:  https://arxiv.org/abs/2508.14153v1

Date: 2025-08-d

Summary:

Text-prompted image segmentation enables fine-grained visual understanding and is critical for applications such as human-computer interaction and robotics. However, existing supervised fine-tuning methods typically ignore explicit chain-of-thought (CoT) reasoning at test time, which limits their ability to generalize to unseen prompts and domains. To address this issue, we introduce LENS, a scalable reinforcement-learning framework that jointly optimizes the reasoning process and segmentation in an end-to-end manner. We propose unified reinforcement-learning rewards that span sentence-, box-, and segment-level cues, encouraging the model to generate informative CoT rationales while refining mask quality. Using a publicly available 3-billion-parameter vision-language model, i.e., Qwen2.5-VL-3B-Instruct, LENS achieves an average cIoU of 81.2% on the RefCOCO, RefCOCO+, and RefCOCOg benchmarks, outperforming the strong fine-tuned method, i.e., GLaMM, by up to 5.6%. These results demonstrate that RL-driven CoT reasoning serves as a robust prior for text-prompted segmentation and offers a practical path toward more generalizable Segment Anything models. Code is available at https://github.com/hustvl/LENS.

--------------------------------------------------------------------------------------------------------

CAMAR: Continuous Actions Multi-Agent Routing

Multi-agent reinforcement learning faces a shortage of benchmarks combining continuous actions with complex coordination challenges. CAMAR fills this gap by providing a high-performance environment for multi-agent pathfinding with continuous action spaces, supporting both cooperative and competitive scenarios. The benchmark's efficiency of 100,000 steps per second enables rapid experimentation and development. Integration capabilities with classical planning methods like RRT and RRT* create hybrid approaches combining traditional algorithms with modern MARL. Applications include autonomous vehicle coordination, robot swarm management, logistics optimization, and traffic control systems. The three-tier evaluation protocol enables deeper algorithmic analysis and fair comparison between methods. CAMAR's realistic scenarios and reproducible benchmarking tools make it valuable for researchers developing next-generation coordination algorithms for autonomous systems requiring precise, continuous control in complex environments.

Authors:  Artem Pshenitsyn, Aleksandr Panov, Alexey Skrynnik

Link:  https://arxiv.org/abs/2508.12845v1

Date: 2025-08-d

Summary:

Multi-agent reinforcement learning (MARL) is a powerful paradigm for solving cooperative and competitive decision-making problems. While many MARL benchmarks have been proposed, few combine continuous state and action spaces with challenging coordination and planning tasks. We introduce CAMAR, a new MARL benchmark designed explicitly for multi-agent pathfinding in environments with continuous actions. CAMAR supports cooperative and competitive interactions between agents and runs efficiently at up to 100,000 environment steps per second. We also propose a three-tier evaluation protocol to better track algorithmic progress and enable deeper analysis of performance. In addition, CAMAR allows the integration of classical planning methods such as RRT and RRT* into MARL pipelines. We use them as standalone baselines and combine RRT* with popular MARL algorithms to create hybrid approaches. We provide a suite of test scenarios and benchmarking tools to ensure reproducibility and fair comparison. Experiments show that CAMAR presents a challenging and realistic testbed for the MARL community.

--------------------------------------------------------------------------------------------------------

HRS: Hybrid Representation Framework with Scheduling Awareness for Time Series Forecasting in Crowdsourced Cloud-Edge Platforms

Crowdsourced Cloud-Edge Platforms face significant challenges maintaining Quality of Service during traffic surges, with traditional forecasting methods creating a dilemma between underprovisioning risks and overprovisioning costs. HRS introduces a hybrid framework combining numerical and image-based representations to capture extreme load dynamics more effectively. The Scheduling-Aware Loss addresses prediction error asymmetry, optimizing forecasts for downstream scheduling decisions rather than just accuracy metrics. This technology has critical applications in content delivery networks, cloud computing platforms, edge computing systems, and streaming services. The 63.1% reduction in SLA violations and 32.3% decrease in profit loss demonstrate significant business value. HRS could revolutionize resource management for platforms like Netflix, Amazon Web Services, and content delivery networks, enabling more efficient resource allocation during peak demand periods while maintaining service quality guarantees.

Authors:  Tiancheng Zhang, Cheng Zhang, Shuren Liu, Xiaofei Wang, Shaoyuan Huang, Wenyu Wang

Link:  https://arxiv.org/abs/2508.12839v2

Date: 2025-08-d

Summary:

With the rapid proliferation of streaming services, network load exhibits highly time-varying and bursty behavior, posing serious challenges for maintaining Quality of Service (QoS) in Crowdsourced Cloud-Edge Platforms (CCPs). While CCPs leverage Predict-then-Schedule architecture to improve QoS and profitability, accurate load forecasting remains challenging under traffic surges. Existing methods either minimize mean absolute error, resulting in underprovisioning and potential Service Level Agreement (SLA) violations during peak periods, or adopt conservative overprovisioning strategies, which mitigate SLA risks at the expense of increased resource expenditure. To address this dilemma, we propose HRS, a hybrid representation framework with scheduling awareness that integrates numerical and image-based representations to better capture extreme load dynamics. We further introduce a Scheduling-Aware Loss (SAL) that captures the asymmetric impact of prediction errors, guiding predictions that better support scheduling decisions. Extensive experiments on four real-world datasets demonstrate that HRS consistently outperforms ten baselines and achieves state-of-the-art performance, reducing SLA violation rates by 63.1% and total profit loss by 32.3%.

--------------------------------------------------------------------------------------------------------

Preacher: Paper-to-Video Agentic System

Scientific communication faces challenges in making research accessible to broader audiences, with traditional video abstracts requiring extensive manual effort. Preacher introduces the first automated paper-to-video system using an agentic approach that decomposes papers top-down then synthesizes coherent video abstracts bottom-up. The Progressive Chain of Thought enables granular planning while cross-modal alignment ensures coherent representation. This technology has transformative applications in academic publishing, science communication, educational content creation, and research dissemination. Preacher could revolutionize how research findings reach public audiences, enable automated conference presentation generation, and support science journalism. The system's ability to generate high-quality abstracts across five research fields demonstrates broad applicability. For researchers, publishers, and educational institutions, Preacher offers a scalable solution for creating engaging video content that bridges the gap between complex research and public understanding.

Authors:  Jingwei Liu, Ling Yang, Hao Luo, Fan Wang, Hongyan Li, Mengdi Wang

Link:  https://arxiv.org/abs/2508.09632v4

Date: 2025-08-d

Summary:

The paper-to-video task converts a research paper into a structured video abstract, distilling key concepts, methods, and conclusions into an accessible, well-organized format. While state-of-the-art video generation models demonstrate potential, they are constrained by limited context windows, rigid video duration constraints, limited stylistic diversity, and an inability to represent domain-specific knowledge. To address these limitations, we introduce Preacher, the first paper-to-video agentic system. Preacher employs a topdown approach to decompose, summarize, and reformulate the paper, followed by bottom-up video generation, synthesizing diverse video segments into a coherent abstract. To align cross-modal representations, we define key scenes and introduce a Progressive Chain of Thought (P-CoT) for granular, iterative planning. Preacher successfully generates high-quality video abstracts across five research fields, demonstrating expertise beyond current video generation models. Code will be released at: https://github.com/GenVerse/Paper2Video

--------------------------------------------------------------------------------------------------------

LATTE: Learning Aligned Transactions and Textual Embeddings for Bank Clients

Financial institutions require sophisticated client understanding from transaction sequences, but direct LLM application on long sequences is computationally prohibitive for real-time systems. LATTE introduces a contrastive learning framework that aligns behavioral features with semantic embeddings from frozen LLMs, dramatically reducing inference costs while maintaining representation quality. The approach summarizes behavioral features into short prompts supervised through contrastive loss with LLM embeddings. Applications span credit scoring, fraud detection, personalized financial services, and risk assessment. LATTE's computational efficiency makes it deployable in latency-sensitive financial environments while outperforming state-of-the-art methods on real-world datasets. This technology could revolutionize banking analytics, enabling real-time customer insights, dynamic credit decisions, and personalized financial recommendations. The framework's efficiency advantages make it particularly valuable for processing millions of customer profiles in production banking systems.

Authors:  Egor Fadeev, Dzhambulat Mollaev, Aleksei Shestov, Dima Korolev, Omar Zoloev, Ivan Kireev, Andrey Savchenko, Maksim Makarenko

Link:  https://arxiv.org/abs/2508.10021v2

Date: 2025-08-d

Summary:

Learning clients embeddings from sequences of their historic communications is central to financial applications. While large language models (LLMs) offer general world knowledge, their direct use on long event sequences is computationally expensive and impractical in real-world pipelines. In this paper, we propose LATTE, a contrastive learning framework that aligns raw event embeddings with semantic embeddings from frozen LLMs. Behavioral features are summarized into short prompts, embedded by the LLM, and used as supervision via contrastive loss. The proposed approach significantly reduces inference cost and input size compared to conventional processing of complete sequence by LLM. We experimentally show that our method outperforms state-of-the-art techniques for learning event sequence representations on real-world financial datasets while remaining deployable in latency-sensitive environments.

--------------------------------------------------------------------------------------------------------

SPARE: Single-Pass Annotation with Reference-Guided Evaluation for Automatic Process Supervision and Reward Modelling

Process supervision represents a crucial advancement in training Large Language Models for complex multi-step reasoning, but efficient high-quality automated annotation remains challenging. SPARE introduces a structured framework for single-pass step-wise annotation that jointly aligns solution steps with reference solutions while determining accuracy through explicit reasoning. The method demonstrates effectiveness across mathematical reasoning, question answering, and spatial reasoning tasks. SPARE's data efficiency, using only 16% of training samples compared to human-labeled baselines, makes it economically viable for large-scale applications. This technology has significant implications for AI training pipelines, educational systems, automated tutoring, and reasoning-intensive applications. The 2.3× speedup over MCTS methods while maintaining competitive performance suggests practical deployment advantages. SPARE could revolutionize how reasoning models are trained and evaluated, enabling more sophisticated AI assistants and problem-solving systems.

Authors:  Md Imbesat Hassan Rizvi, Xiaodan Zhu, Iryna Gurevych

Link:  https://arxiv.org/abs/2506.15498v2

Date: 2025-08-d

Summary:

Process or step-wise supervision has played a crucial role in advancing complex multi-step reasoning capabilities of Large Language Models (LLMs). However, efficient, high-quality automated process annotation remains a significant challenge. To address this, we introduce Single-Pass Annotation with Reference-Guided Evaluation (SPARE), a novel structured framework that enables efficient per-step annotation by jointly aligning solution steps to reference solutions and determine its accuracy with explicit reasoning in single generation. We demonstrate SPARE's effectiveness across four diverse datasets spanning mathematical reasoning (GSM8K, MATH), multi-hop question answering (MuSiQue-Ans), and spatial reasoning (SpaRP), showing consistent improvements in two applications: (1) training Process Reward Models (PRMs) for ranking and aggregating multiple generations, and (2) fine-tuning models via offline reinforcement learning for greedy decoding. On ProcessBench, SPARE demonstrates data-efficient out-of-distribution generalization, using only $\sim$16% of training samples compared to human-labeled and other synthetically trained baselines. Additionally, it achieves competitive performance with MCTS-based methods while offering 2.3$\times$ speedup in terms of total token count. Manual analysis reveals complementary precision-recall characteristics with MCTS approaches, suggesting potential for ensemble methods. These results establish SPARE as a practical and scalable solution for automatic process supervision in LLM reasoning.

--------------------------------------------------------------------------------------------------------

It's the Thought that Counts: Evaluating the Attempts of Frontier LLMs to Persuade on Harmful Topics

Large Language Models possess significant persuasive capabilities that enable beneficial applications but raise concerns about potential misuse for harmful purposes like political manipulation or extremist recruitment. The Attempt to Persuade Eval (APE) benchmark shifts focus from persuasion success to willingness to attempt persuasion, particularly on harmful topics. Using multi-turn conversational setups, APE evaluates model behavior across conspiracies, controversial issues, and harmful content. This research has critical implications for AI safety, content moderation, regulatory frameworks, and platform policy development. The findings that models frequently attempt persuasion on harmful topics and jailbreaking increases this willingness highlight gaps in current safety measures. APE provides essential tools for researchers, policymakers, and technology companies developing responsible AI systems, particularly relevant for social media platforms, chatbot deployment, and educational AI applications where persuasive influence could have significant societal impact.

Authors:  Matthew Kowal, Jasper Timm, Jean-Francois Godbout, Thomas Costello, Antonio A. Arechar, Gordon Pennycook, David Rand, Adam Gleave, Kellin Pelrine

Link:  https://arxiv.org/abs/2506.02873v3

Date: 2025-08-d

Summary:

Persuasion is a powerful capability of large language models (LLMs) that both enables beneficial applications (e.g. helping people quit smoking) and raises significant risks (e.g. large-scale, targeted political manipulation). Prior work has found models possess a significant and growing persuasive capability, measured by belief changes in simulated or real users. However, these benchmarks overlook a crucial risk factor: the propensity of a model to attempt to persuade in harmful contexts. Understanding whether a model will blindly ``follow orders'' to persuade on harmful topics (e.g. glorifying joining a terrorist group) is key to understanding the efficacy of safety guardrails. Moreover, understanding if and when a model will engage in persuasive behavior in pursuit of some goal is essential to understanding the risks from agentic AI systems. We propose the Attempt to Persuade Eval (APE) benchmark, that shifts the focus from persuasion success to persuasion attempts, operationalized as a model's willingness to generate content aimed at shaping beliefs or behavior. Our evaluation framework probes frontier LLMs using a multi-turn conversational setup between simulated persuader and persuadee agents. APE explores a diverse spectrum of topics including conspiracies, controversial issues, and non-controversially harmful content. We introduce an automated evaluator model to identify willingness to persuade and measure the frequency and context of persuasive attempts. We find that many open and closed-weight models are frequently willing to attempt persuasion on harmful topics and that jailbreaking can increase willingness to engage in such behavior. Our results highlight gaps in current safety guardrails and underscore the importance of evaluating willingness to persuade as a key dimension of LLM risk. APE is available at github.com/AlignmentResearch/AttemptPersuadeEval

--------------------------------------------------------------------------------------------------------

CO-Bench: Benchmarking Language Model Agents in Algorithm Search for Combinatorial Optimization

Combinatorial optimization problems pervade industries from logistics to manufacturing, but LLM-based agents' potential in this domain remains underexplored due to lack of comprehensive evaluation frameworks. CO-Bench introduces 36 real-world CO problems spanning diverse domains and complexity levels, providing structured formulations and curated data for systematic investigation. The benchmark reveals both strengths and limitations of existing LLM agents compared to human-designed algorithms. Applications include supply chain optimization, resource allocation, scheduling problems, network design, and operations research. CO-Bench's comprehensive coverage enables researchers to identify promising directions for LLM-based optimization, potentially revolutionizing how complex operational problems are solved. The benchmark could accelerate development of AI systems capable of tackling NP-hard problems in manufacturing, logistics, telecommunications, and financial optimization, offering new approaches to historically challenging computational problems.

Authors:  Weiwei Sun, Shengyu Feng, Shanda Li, Yiming Yang

Link:  https://arxiv.org/abs/2504.04310v3

Date: 2025-08-d

Summary:

Although LLM-based agents have attracted significant attention in domains such as software engineering and machine learning research, their role in advancing combinatorial optimization (CO) remains relatively underexplored. This gap underscores the need for a deeper understanding of their potential in tackling structured, constraint-intensive problems -- a pursuit currently limited by the absence of comprehensive benchmarks for systematic investigation. To address this, we introduce CO-Bench, a benchmark suite featuring 36 real-world CO problems drawn from a broad range of domains and complexity levels. CO-Bench includes structured problem formulations and curated data to support rigorous investigation of LLM agents. We evaluate multiple agentic frameworks against established human-designed algorithms, revealing the strengths and limitations of existing LLM agents and identifying promising directions for future research. CO-Bench is publicly available at https://github.com/sunnweiwei/CO-Bench.

--------------------------------------------------------------------------------------------------------

LearnLM: Improving Gemini for Learning

Current generative AI systems default to information presentation rather than engaging pedagogical behavior that promotes learning. LearnLM reframes this challenge as pedagogical instruction following, where models receive system-level instructions describing desired teaching behaviors. This approach avoids committing to specific pedagogical definitions while enabling teachers and developers to specify desired behaviors. The framework allows integration with Gemini's expanding capabilities through post-training mixtures. LearnLM demonstrates substantial expert preferences across diverse learning scenarios, with significant improvements over competing models. Applications span educational technology, personalized tutoring, corporate training, and adaptive learning systems. The model's availability on Google AI Studio democratizes access to AI-powered educational tools. LearnLM could revolutionize online education, enable personalized learning experiences at scale, and support teachers with intelligent tutoring assistance, particularly valuable in regions with limited educational resources.

Authors:  LearnLM Team, Abhinit Modi, Aditya Srikanth Veerubhotla, Aliya Rysbek, Andrea Huber, Brett Wiltshire, Brian Veprek, Daniel Gillick, Daniel Kasenberg, Derek Ahmed, Irina Jurenka, James Cohan, Jennifer She, Julia Wilkowski, Kaiz Alarakyia, Kevin R. McKee, Lisa Wang, Markus Kunesch, Mike Schaekermann, Miruna Pîslar, Nikhil Joshi, Parsa Mahmoudieh, Paul Jhun, Sara Wiltberger, Shakir Mohamed, Shashank Agarwal, Shubham Milind Phal, Sun Jae Lee, Theofilos Strinopoulos, Wei-Jen Ko, Amy Wang, Ankit Anand, Avishkar Bhoopchand, Dan Wild, Divya Pandya, Filip Bar, Garth Graham, Holger Winnemoeller, Mahvish Nagda, Prateek Kolhar, Renee Schneider, Shaojian Zhu, Stephanie Chan, Steve Yadlowsky, Viknesh Sounderajah, Yannis Assael

Link:  https://arxiv.org/abs/2412.16429v3

Date: 2025-08-d

Summary:

Today's generative AI systems are tuned to present information by default, rather than engage users in service of learning as a human tutor would. To address the wide range of potential education use cases for these systems, we reframe the challenge of injecting pedagogical behavior as one of \textit{pedagogical instruction following}, where training and evaluation examples include system-level instructions describing the specific pedagogy attributes present or desired in subsequent model turns. This framing avoids committing our models to any particular definition of pedagogy, and instead allows teachers or developers to specify desired model behavior. It also clears a path to improving Gemini models for learning -- by enabling the addition of our pedagogical data to post-training mixtures -- alongside their rapidly expanding set of capabilities. Both represent important changes from our initial tech report. We show how training with pedagogical instruction following produces a LearnLM model (available on Google AI Studio) that experts substantially prefer across a diverse set of learning scenarios, with average preference strengths of +31\% over GPT-4o, +11\% over Claude 3.5 Sonnet, and +13\% over the Gemini 1.5 Pro model on which LearnLM was based.

--------------------------------------------------------------------------------------------------------


EYE ON A.I. GETS READERS UP TO DATE ON THE LATEST FUNDING NEWS AND RELATED ISSUES. SUBSCRIBE FOR THE WEEKLY NEWSLETTER.