Week Ending 8.3.2025

RESEARCH WATCH: 8.3.2025

Explainable AI for Automated User-specific Feedback in Surgical Skill Acquisition

Surgical training has traditionally relied on limited expert supervision, creating bottlenecks in skill development due to faculty availability and subjective assessment variability. This research introduces an innovative solution using explainable AI to provide automated, personalized feedback for surgical trainees. The system analyzes video recordings of surgical procedures, extracting skill-related metrics and comparing performance against expert benchmarks. By offering objective, quantitative feedback through understandable proxies, this approach could revolutionize surgical education. Potential applications include autonomous training modules for medical schools, standardized competency assessment tools, and continuous professional development platforms. The technology could significantly improve training efficiency, reduce dependency on expert availability, and ensure consistent skill evaluation across different institutions, ultimately enhancing patient safety through better-trained surgeons.

Authors: Catalina Gomez, Lalithkumar Seenivasan, Xinrui Zou, Jeewoo Yoon, Sirui Chu, Ariel Leong, Patrick Kramer, Yu-Chun Ku, Jose L. Porras, Alejandro Martin-Gomez, Masaru Ishii, Mathias Unberath

Link: https://arxiv.org/abs/2508.02593v1

Date: 2025-08-d

Summary:

Traditional surgical skill acquisition relies heavily on expert feedback, yet direct access is limited by faculty availability and variability in subjective assessments. While trainees can practice independently, the lack of personalized, objective, and quantitative feedback reduces the effectiveness of self-directed learning. Recent advances in computer vision and machine learning have enabled automated surgical skill assessment, demonstrating the feasibility of automatic competency evaluation. However, it is unclear whether such Artificial Intelligence (AI)-driven feedback can contribute to skill acquisition. Here, we examine the effectiveness of explainable AI (XAI)-generated feedback in surgical training through a human-AI study. We create a simulation-based training framework that utilizes XAI to analyze videos and extract surgical skill proxies related to primitive actions. Our intervention provides automated, user-specific feedback by comparing trainee performance to expert benchmarks and highlighting deviations from optimal execution through understandable proxies for actionable guidance. In a prospective user study with medical students, we compare the impact of XAI-guided feedback against traditional video-based coaching on task outcomes, cognitive load, and trainees' perceptions of AI-assisted learning. Results showed improved cognitive load and confidence post-intervention. While no differences emerged between the two feedback types in reducing performance gaps or practice adjustments, trends in the XAI group revealed desirable effects where participants more closely mimicked expert practice. This work encourages the study of explainable AI in surgical education and the development of data-driven, adaptive feedback mechanisms that could transform learning experiences and competency assessment.

--------------------------------------------------------------------------------------------------------

CAMA: Enhancing Mathematical Reasoning in Large Language Models with Causal Knowledge

Mathematical reasoning remains a significant challenge for large language models, primarily due to complex structural dependencies inherent in mathematical problems. CAMA addresses this limitation through a novel two-stage causal framework that explicitly maps mathematical knowledge structures. The system constructs Mathematical Causal Graphs that encode solution strategies and their dependencies, then dynamically extracts relevant subgraphs to guide reasoning for new problems. This approach transforms abstract mathematical relationships into actionable computational structures. Applications span educational technology, automated tutoring systems, mathematical problem-solving assistants, and research tools. The framework could enhance STEM education platforms, support mathematicians in complex proofs, improve AI-assisted mathematical research, and create more reliable quantitative analysis tools for finance, engineering, and scientific computing. CAMA represents a significant step toward AI systems capable of sophisticated mathematical reasoning.

Authors: Lei Zan, Keli Zhang, Ruichu Cai, Lujia Pan

Link: https://arxiv.org/abs/2508.02583v1

Date: 2025-08-d

Summary:

Large Language Models (LLMs) have demonstrated strong performance across a wide range of tasks, yet they still struggle with complex mathematical reasoning, a challenge fundamentally rooted in deep structural dependencies. To address this challenge, we propose \textbf{CA}usal \textbf{MA}thematician (\textbf{CAMA}), a two-stage causal framework that equips LLMs with explicit, reusable mathematical structure. In the learning stage, CAMA first constructs the \textbf{M}athematical \textbf{C}ausal \textbf{G}raph (\textbf{MCG}), a high-level representation of solution strategies, by combining LLM priors with causal discovery algorithms applied to a corpus of question-solution pairs. The resulting MCG encodes essential knowledge points and their causal dependencies. To better align the graph with downstream reasoning tasks, CAMA further refines the MCG through iterative feedback derived from a selected subset of the question-solution pairs. In the reasoning stage, given a new question, CAMA dynamically extracts a task-relevant subgraph from the MCG, conditioned on both the question content and the LLM's intermediate reasoning trace. This subgraph, which encodes the most pertinent knowledge points and their causal dependencies, is then injected back into the LLM to guide its reasoning process. Empirical results on real-world datasets show that CAMA significantly improves LLM performance on challenging mathematical problems. Furthermore, our experiments demonstrate that structured guidance consistently outperforms unstructured alternatives, and that incorporating asymmetric causal relationships yields greater improvements than using symmetric associations alone.

--------------------------------------------------------------------------------------------------------

Automatic Identification of Machine Learning-Specific Code Smells

As machine learning adoption accelerates across industries, code quality becomes increasingly critical for reliable ML systems. This research addresses the gap in ML-specific code analysis by developing MLpylint, a static analysis tool designed to identify problematic patterns unique to ML codebases. Unlike traditional code smell detection, this tool understands ML-specific issues such as data leakage, improper model validation, and inefficient tensor operations. The tool was validated on 160 open-source ML projects and evaluated by 15 ML professionals. Applications include automated code review systems for ML teams, continuous integration pipelines for ML projects, educational tools for teaching ML best practices, and quality assurance frameworks for production ML systems. This technology could significantly improve ML system reliability, reduce debugging time, enhance team productivity, and establish industry standards for ML code quality, ultimately leading to more robust AI applications.

Authors: Peter Hamfelt, Ricardo Britto, Lincoln Rocha, Camilo Almendra

Link: https://arxiv.org/abs/2508.02541v1

Date: 2025-08-d

Summary:

Machine learning (ML) has rapidly grown in popularity, becoming vital to many industries. Currently, the research on code smells in ML applications lacks tools and studies that address the identification and validity of ML-specific code smells. This work investigates suitable methods and tools to design and develop a static code analysis tool (MLpylint) based on code smell criteria. This research employed the Design Science Methodology. In the problem identification phase, a literature review was conducted to identify ML-specific code smells. In solution design, a secondary literature review and consultations with experts were performed to select methods and tools for implementing the tool. We evaluated the tool on data from 160 open-source ML applications sourced from GitHub. We also conducted a static validation through an expert survey involving 15 ML professionals. The results indicate the effectiveness and usefulness of the MLpylint. We aim to extend our current approach by investigating ways to introduce MLpylint seamlessly into development workflows, fostering a more productive and innovative developer environment.

--------------------------------------------------------------------------------------------------------

Is Uncertainty Quantification a Viable Alternative to Learned Deferral?

In critical AI applications like healthcare, knowing when AI systems should defer decisions to human experts is paramount for patient safety. This research investigates two approaches: learned deferral methods that optimize trade-offs between autonomous prediction and human consultation, and uncertainty quantification methods that estimate model confidence. Using a large ophthalmology dataset for glaucoma detection, the study evaluates both approaches' robustness to out-of-distribution data. The findings suggest uncertainty quantification may be more reliable for clinical deployment due to better generalization capabilities. Applications include medical diagnostic systems, autonomous vehicle safety protocols, financial risk assessment tools, and quality control in manufacturing. This research could inform the design of safer AI systems across high-stakes domains, establishing frameworks for human-AI collaboration that maintain performance while ensuring safety through appropriate expert intervention when AI confidence is insufficient.

Authors: Anna M. Wundram, Christian F. Baumgartner

Link: https://arxiv.org/abs/2508.02319v1

Date: 2025-08-d

Summary:

Artificial Intelligence (AI) holds the potential to dramatically improve patient care. However, it is not infallible, necessitating human-AI-collaboration to ensure safe implementation. One aspect of AI safety is the models' ability to defer decisions to a human expert when they are likely to misclassify autonomously. Recent research has focused on methods that learn to defer by optimising a surrogate loss function that finds the optimal trade-off between predicting a class label or deferring. However, during clinical translation, models often face challenges such as data shift. Uncertainty quantification methods aim to estimate a model's confidence in its predictions. However, they may also be used as a deferral strategy which does not rely on learning from specific training distribution. We hypothesise that models developed to quantify uncertainty are more robust to out-of-distribution (OOD) input than learned deferral models that have been trained in a supervised fashion. To investigate this hypothesis, we constructed an extensive evaluation study on a large ophthalmology dataset, examining both learned deferral models and established uncertainty quantification methods, assessing their performance in- and out-of-distribution. Specifically, we evaluate their ability to accurately classify glaucoma from fundus images while deferring cases with a high likelihood of error. We find that uncertainty quantification methods may be a promising choice for AI deferral.

--------------------------------------------------------------------------------------------------------

VeOmni: Scaling Any Modality Model Training with Model-Centric Distributed Recipe Zoo

Training omni-modal large language models capable of processing text, images, audio, and video simultaneously presents enormous computational challenges. VeOmni addresses these scaling issues through a modular training framework that decouples communication from computation, enabling efficient 3D parallelism. The system achieved impressive benchmarks: 2,800 tokens/sec/GPU throughput for a 30B parameter model and scaling to 160K context lengths on 128 GPUs. This framework supports seamless integration of new modalities with minimal code changes. Applications span multimodal AI assistants, content creation platforms, educational tools that process diverse media types, accessibility technologies, and research platforms for multimodal AI development. VeOmni could accelerate development of next-generation AI systems capable of understanding and generating across all media types, enabling more natural human-computer interaction and comprehensive AI applications that mirror human multimodal communication abilities.

Authors: Qianli Ma, Yaowei Zheng, Zhelun Shi, Zhongkai Zhao, Bin Jia, Ziyue Huang, Zhiqi Lin, Youjie Li, Jiacheng Yang, Yanghua Peng, Zhi Zhang, Xin Liu

Link: https://arxiv.org/abs/2508.02317v1

Date: 2025-08-d

Summary:

Recent advances in large language models (LLMs) have driven impressive progress in omni-modal understanding and generation. However, training omni-modal LLMs remains a significant challenge due to the heterogeneous model architectures required to process diverse modalities, necessitating sophisticated system design for efficient large-scale training. Existing frameworks typically entangle model definition with parallel logic, incurring limited scalability and substantial engineering overhead for end-to-end omni-modal training. % We present \veomni, a modular and efficient training framework to accelerate the development of omni-modal LLMs. \veomni introduces model-centric distributed recipes that decouples communication from computation, enabling efficient 3D parallelism on omni-modal LLMs. \veomni also features a flexible configuration interface supporting seamless integration of new modalities with minimal code change. % Using \veomni, a omni-modal mixture-of-experts (MoE) model with 30B parameters can be trained with over 2,800 tokens/sec/GPU throughput and scale to 160K context lengths via 3D parallelism on 128 GPUs, showcasing its superior efficiency and scalability for training large omni-modal LLMs.

--------------------------------------------------------------------------------------------------------

Large AI Models for Wireless Physical Layer

Wireless communication systems are undergoing transformation through large AI model integration at the physical layer. This comprehensive review examines how Large AI Models (LAMs) address limitations of conventional AI approaches in wireless communications through superior generalization, multitask processing, and multimodal capabilities. The research categorizes applications into two strategies: leveraging pre-trained LAMs and developing native LAMs specifically for physical layer tasks. Both approaches demonstrate significant performance improvements across diverse wireless scenarios. Applications include next-generation 5G/6G networks, adaptive spectrum management, intelligent beamforming, network optimization, and edge computing solutions. This technology could revolutionize wireless infrastructure by enabling self-optimizing networks, improving spectral efficiency, reducing interference, and supporting emerging applications like massive IoT deployments. The integration of LAMs in wireless systems promises more robust, adaptive, and efficient communication networks capable of handling increasing data demands and complexity.

Authors: Jiajia Guo, Yiming Cui, Shi Jin, Jun Zhang

Link: https://arxiv.org/abs/2508.02314v1

Date: 2025-08-d

Summary:

Large artificial intelligence models (LAMs) are transforming wireless physical layer technologies through their robust generalization, multitask processing, and multimodal capabilities. This article reviews recent advancements in LAM applications for physical layer communications, addressing limitations of conventional AI-based approaches. LAM applications are classified into two strategies: leveraging pre-trained LAMs and developing native LAMs designed specifically for physical layer tasks. The motivations and key frameworks of these approaches are comprehensively examined through multiple use cases. Both strategies significantly improve performance and adaptability across diverse wireless scenarios. Future research directions, including efficient architectures, interpretability, standardized datasets, and collaboration between large and small models, are proposed to advance LAM-based physical layer solutions for next-generation communication systems.

--------------------------------------------------------------------------------------------------------

A Survey on Data Security in Large Language Models

Large Language Models have revolutionized natural language processing but face critical data security vulnerabilities due to their dependence on massive, often uncurated training datasets. This comprehensive survey examines security risks including toxic output generation, hallucinations, prompt injection attacks, and data poisoning. The research reviews current defense strategies such as adversarial training, reinforcement learning from human feedback, and data augmentation techniques. Additionally, it analyzes datasets used for security assessment across different domains. Applications include secure AI development frameworks, content moderation systems, enterprise AI governance platforms, and regulatory compliance tools. This work could inform the development of safer AI systems, establish security standards for LLM deployment, guide policymaking for AI regulation, and create tools for assessing and mitigating AI risks. As LLMs integrate into critical systems, this research provides essential guidance for maintaining user trust and system reliability.

Authors: Kang Chen, Xiuze Zhou, Yuanguo Lin, Jinhe Su, Yuanhui Yu, Li Shen, Fan Lin

Link: https://arxiv.org/abs/2508.02312v1

Date: 2025-08-d

Summary:

Large Language Models (LLMs), now a foundation in advancing natural language processing, power applications such as text generation, machine translation, and conversational systems. Despite their transformative potential, these models inherently rely on massive amounts of training data, often collected from diverse and uncurated sources, which exposes them to serious data security risks. Harmful or malicious data can compromise model behavior, leading to issues such as toxic output, hallucinations, and vulnerabilities to threats such as prompt injection or data poisoning. As LLMs continue to be integrated into critical real-world systems, understanding and addressing these data-centric security risks is imperative to safeguard user trust and system reliability. This survey offers a comprehensive overview of the main data security risks facing LLMs and reviews current defense strategies, including adversarial training, RLHF, and data augmentation. Additionally, we categorize and analyze relevant datasets used for assessing robustness and security across different domains, providing guidance for future research. Finally, we highlight key research directions that focus on secure model updates, explainability-driven defenses, and effective governance frameworks, aiming to promote the safe and responsible development of LLM technology. This work aims to inform researchers, practitioners, and policymakers, driving progress toward data security in LLMs.

--------------------------------------------------------------------------------------------------------

CAPO: Towards Enhancing LLM Reasoning through Verifiable Generative Credit Assignment

Current reinforcement learning approaches for improving LLM reasoning assign uniform rewards to entire responses, hindering precise identification of which reasoning steps contribute to success or failure. CAPO introduces a novel solution using generative process reward models to provide step-wise critique and token-level credit assignment. The method leverages off-the-shelf LLMs to generate detailed feedback for each reasoning step, enabling more granular reward distribution. Voting mechanisms enhance accuracy and robustness across multiple critiques. Applications include advanced tutoring systems that provide specific feedback on student reasoning, automated code review tools that identify problematic logic steps, scientific reasoning assistants that verify hypothesis formation, and mathematical problem-solving platforms that guide students through complex proofs. CAPO could significantly improve AI reasoning capabilities across domains requiring multi-step logical thinking, leading to more reliable AI assistants, better educational tools, and enhanced decision-support systems that can explain their reasoning processes clearly.

Authors: Guofu Xie, Yunsheng Shi, Hongtao Tian, Ting Yao, Xiao Zhang

Link: https://arxiv.org/abs/2508.02298v1

Date: 2025-08-d

Summary:

Reinforcement Learning with Verifiable Rewards (RLVR) has improved the reasoning abilities of Large Language Models (LLMs) by using rule-based binary feedback, helping to mitigate reward hacking. However, current RLVR methods typically treat whole responses as single actions, assigning the same reward to every token. This coarse-grained feedback hampers precise credit assignment, making it hard for models to identify which reasoning steps lead to success or failure, and often results in suboptimal policies and inefficient learning. Methods like PPO provide credit assignment through value estimation, but often yield inaccurate and unverifiable signals due to limited sampling. On the other hand, methods using Process Reward Models can provide step-by-step judgments for each reasoning step, but they require high-quality process supervision labels and are time-consuming when applied in online reinforcement learning (RL). To overcome these limitations, we introduce a simple but efficient method Credit Assignment Policy Optimization (CAPO). Given a reasoning response rollout from the policy model, CAPO directly leverages an off-the-shelf, general-purpose LLM as a Generative Process Reward Model (LLM-as-GenPRM) to generate all step-wise critique by one pass, thereby providing verifiable token-level rewards to refine the tokens that were originally assigned identical rule-based rewards. This enables more fine-grained credit assignment in an effective way. Furthermore, to enhance the accuracy and robustness of CAPO, we employ voting mechanisms that scale with the number of generated critiques. Extensive experiments using different backbones like Llama and Qwen models and in different sizes show that CAPO consistently outperforms supervised learning-based and RL-based fine-tuning methods across six challenging mathematical benchmarks and three out-of-domain benchmarks.

--------------------------------------------------------------------------------------------------------

Forecasting When to Forecast: Accelerating Diffusion Models with Confidence-Gated Taylor

Diffusion Transformers excel at visual generation but suffer from slow inference speeds, limiting their practical deployment. This research introduces a novel acceleration method using Taylor expansion to predict future features from cached past representations, combined with confidence-gating mechanisms. Unlike previous approaches requiring fine-grained feature storage, this method operates at the block level, reducing memory overhead. The confidence-gating system dynamically decides between Taylor prediction and full computation based on prediction reliability. Results show impressive acceleration: 3.17x on FLUX, 2.36x on DiT, and 4.14x on Wan Video with minimal quality loss. Applications include real-time content generation for gaming, interactive design tools, video editing software, AR/VR applications, and mobile content creation apps. This technology could democratize high-quality visual generation by making it accessible on resource-constrained devices, enabling new creative applications and improving user experience in multimedia platforms through faster, more responsive generation capabilities.

Authors: Xiaoliu Guan, Lielin Jiang, Hanqi Chen, Xu Zhang, Jiaxing Yan, Guanzhong Wang, Yi Liu, Zetao Zhang, Yu Wu

Link: https://arxiv.org/abs/2508.02240v1

Date: 2025-08-d

Summary:

Diffusion Transformers (DiTs) have demonstrated remarkable performance in visual generation tasks. However, their low inference speed limits their deployment in low-resource applications. Recent training-free approaches exploit the redundancy of features across timesteps by caching and reusing past representations to accelerate inference. Building on this idea, TaylorSeer instead uses cached features to predict future ones via Taylor expansion. However, its module-level prediction across all transformer blocks (e.g., attention or feedforward modules) requires storing fine-grained intermediate features, leading to notable memory and computation overhead. Moreover, it adopts a fixed caching schedule without considering the varying accuracy of predictions across timesteps, which can lead to degraded outputs when prediction fails. To address these limitations, we propose a novel approach to better leverage Taylor-based acceleration. First, we shift the Taylor prediction target from the module level to the last block level, significantly reducing the number of cached features. Furthermore, observing strong sequential dependencies among Transformer blocks, we propose to use the error between the Taylor-estimated and actual outputs of the first block as an indicator of prediction reliability. If the error is small, we trust the Taylor prediction for the last block; otherwise, we fall back to full computation, thereby enabling a dynamic caching mechanism. Empirical results show that our method achieves a better balance between speed and quality, achieving a 3.17x acceleration on FLUX, 2.36x on DiT, and 4.14x on Wan Video with negligible quality drop. The Project Page is \href{https://cg-taylor-acce.github.io/CG-Taylor/}{here.}

--------------------------------------------------------------------------------------------------------

Balancing Information Accuracy and Response Timeliness in Networked LLMs

The computational demands of large language models present significant deployment challenges, driving interest in networked systems using smaller, specialized models. This research investigates a distributed architecture where topic-specialized LLMs collaborate to answer categorical queries. Users submit binary questions routed to relevant LLM clusters, with responses aggregated for improved accuracy. The study formulates optimization problems balancing information accuracy against response time, demonstrating that aggregated responses consistently outperform individual models. Applications include distributed customer support systems, collaborative research platforms, real-time information networks, edge computing deployments, and multi-expert consultation systems. This approach could enable more efficient AI services by reducing individual model computational requirements while maintaining high performance. The framework supports scalable AI deployment in resource-constrained environments, cost-effective enterprise AI solutions, and specialized domain applications where expertise distribution improves both efficiency and accuracy compared to monolithic large models.

Authors: Yigit Turkmen, Baturalp Buyukates, Melih Bastopcu

Link: https://arxiv.org/abs/2508.02209v1

Date: 2025-08-d

Summary:

Recent advancements in Large Language Models (LLMs) have transformed many fields including scientific discovery, content generation, biomedical text mining, and educational technology. However, the substantial requirements for training data, computational resources, and energy consumption pose significant challenges for their practical deployment. A promising alternative is to leverage smaller, specialized language models and aggregate their outputs to improve overall response quality. In this work, we investigate a networked LLM system composed of multiple users, a central task processor, and clusters of topic-specialized LLMs. Each user submits categorical binary (true/false) queries, which are routed by the task processor to a selected cluster of $m$ LLMs. After gathering individual responses, the processor returns a final aggregated answer to the user. We characterize both the information accuracy and response timeliness in this setting, and formulate a joint optimization problem to balance these two competing objectives. Our extensive simulations demonstrate that the aggregated responses consistently achieve higher accuracy than those of individual LLMs. Notably, this improvement is more significant when the participating LLMs exhibit similar standalone performance.

--------------------------------------------------------------------------------------------------------

Large-Scale Model Enabled Semantic Communication Based on Robust Knowledge Distillation

Semantic communication systems aim to transmit meaning rather than raw data, promising more efficient information exchange. This research proposes RKD-SC, a framework using robust knowledge distillation to create channel-noise-robust semantic communication powered by large-scale models. The approach addresses two key challenges: designing optimal compact architectures and maintaining robustness against channel noise. The framework includes knowledge distillation-based architecture search and a novel two-stage robust knowledge distillation algorithm. A channel-aware transformer provides adaptive channel coding under diverse conditions. Applications include next-generation wireless networks with intelligent data compression, IoT systems with limited bandwidth, satellite communications, military communications requiring robust transmission, and mobile networks seeking efficiency improvements. This technology could revolutionize telecommunications by dramatically reducing bandwidth requirements while maintaining information fidelity, enabling new applications in remote areas, supporting massive IoT deployments, and improving communication efficiency in challenging environments where traditional systems struggle.

Authors: Kuiyuan DIng, Caili Guo, Yang Yang, Zhongtian Du, Walid Saad

Link: https://arxiv.org/abs/2508.02148v1

Date: 2025-08-d

Summary:

Large-scale models (LSMs) can be an effective framework for semantic representation and understanding, thereby providing a suitable tool for designing semantic communication (SC) systems. However, their direct deployment is often hindered by high computational complexity and resource requirements. In this paper, a novel robust knowledge distillation based semantic communication (RKD-SC) framework is proposed to enable efficient and \textcolor{black}{channel-noise-robust} LSM-powered SC. The framework addresses two key challenges: determining optimal compact model architectures and effectively transferring knowledge while maintaining robustness against channel noise. First, a knowledge distillation-based lightweight differentiable architecture search (KDL-DARTS) algorithm is proposed. This algorithm integrates knowledge distillation loss and a complexity penalty into the neural architecture search process to identify high-performance, lightweight semantic encoder architectures. Second, a novel two-stage robust knowledge distillation (RKD) algorithm is developed to transfer semantic capabilities from an LSM (teacher) to a compact encoder (student) and subsequently enhance system robustness. To further improve resilience to channel impairments, a channel-aware transformer (CAT) block is introduced as the channel codec, trained under diverse channel conditions with variable-length outputs. Extensive simulations on image classification tasks demonstrate that the RKD-SC framework significantly reduces model parameters while preserving a high degree of the teacher model's performance and exhibiting superior robustness compared to existing methods.

--------------------------------------------------------------------------------------------------------

The Complexity of Extreme Climate Events on the New Zealand's Kiwifruit Industry

Climate change intensifies extreme weather events, creating unprecedented challenges for agricultural industries worldwide. This research focuses on New Zealand's kiwifruit farming, examining impacts of frost, drought, extreme rainfall, and heatwaves on harvest yields. Using Isolation Forest anomaly detection, the study analyzes climate history and recorded extreme events alongside yield data. Results reveal considerable variability in how different extreme events affect yields, highlighting discrepancies between climatic extremes and individual farm outcomes. The research identifies limitations in current anomaly detection approaches and emphasizes the need for integrating farm management strategies with climate adaptation practices. Applications include agricultural risk assessment tools, crop insurance models, climate adaptation strategies for farmers, precision agriculture systems, and policy frameworks for agricultural resilience. This work could inform sustainable farming practices, support agricultural decision-making systems, guide crop diversification strategies, and help develop early warning systems for extreme weather impacts on agriculture.

Authors: Boyuan Zheng, Victor W. Chu, Zhidong Li, Evan Webster, Ashley Rootsey

Link: https://arxiv.org/abs/2508.02130v1

Date: 2025-08-d

Summary:

Climate change has intensified the frequency and severity of extreme weather events, presenting unprecedented challenges to the agricultural industry worldwide. In this investigation, we focus on kiwifruit farming in New Zealand. We propose to examine the impacts of climate-induced extreme events, specifically frost, drought, extreme rainfall, and heatwave, on kiwifruit harvest yields. These four events were selected due to their significant impacts on crop productivity and their prevalence as recorded by climate monitoring institutions in the country. We employed Isolation Forest, an unsupervised anomaly detection method, to analyse climate history and recorded extreme events, alongside with kiwifruit yields. Our analysis reveals considerable variability in how different types of extreme event affect kiwifruit yields underscoring notable discrepancies between climatic extremes and individual farm's yield outcomes. Additionally, our study highlights critical limitations of current anomaly detection approaches, particularly in accurately identifying events such as frost. These findings emphasise the need for integrating supplementary features like farm management strategies with climate adaptation practices. Our further investigation will employ ensemble methods that consolidate nearby farms' yield data and regional climate station features to reduce variance, thereby enhancing the accuracy and reliability of extreme event detection and the formulation of response strategies.

--------------------------------------------------------------------------------------------------------

A Survey on AgentOps: Categorization, Challenges, and Future Directions

LLM-based agent systems offer significant advantages in flexibility and interpretability over traditional systems, driving widespread adoption across industries. However, like conventional systems, these agents frequently encounter anomalies leading to instability and insecurity. This comprehensive survey establishes the first systematic framework for agent system operations, dubbed AgentOps. The research categorizes anomalies into intra-agent and inter-agent types and defines four key operational stages: monitoring, anomaly detection, root cause analysis, and resolution. Applications include autonomous vehicle fleet management, financial trading systems, healthcare automation, smart city infrastructure, industrial process control, and customer service platforms. This framework could establish industry standards for agent system reliability, inform development of monitoring tools for AI systems, guide regulatory approaches for autonomous systems, create training programs for AI system operators, and support the development of self-healing agent architectures that can automatically detect and resolve operational issues.

Authors: Zexin Wang, Jingjing Li, Quan Zhou, Haotian Si, Yuanhao Liu, Jianhui Li, Gaogang Xie, Fei Sun, Dan Pei, Changhua Pei

Link: https://arxiv.org/abs/2508.02121v1

Date: 2025-08-d

Summary:

As the reasoning capabilities of Large Language Models (LLMs) continue to advance, LLM-based agent systems offer advantages in flexibility and interpretability over traditional systems, garnering increasing attention. However, despite the widespread research interest and industrial application of agent systems, these systems, like their traditional counterparts, frequently encounter anomalies. These anomalies lead to instability and insecurity, hindering their further development. Therefore, a comprehensive and systematic approach to the operation and maintenance of agent systems is urgently needed. Unfortunately, current research on the operations of agent systems is sparse. To address this gap, we have undertaken a survey on agent system operations with the aim of establishing a clear framework for the field, defining the challenges, and facilitating further development. Specifically, this paper begins by systematically defining anomalies within agent systems, categorizing them into intra-agent anomalies and inter-agent anomalies. Next, we introduce a novel and comprehensive operational framework for agent systems, dubbed Agent System Operations (AgentOps). We provide detailed definitions and explanations of its four key stages: monitoring, anomaly detection, root cause analysis, and resolution.

--------------------------------------------------------------------------------------------------------

Don't Overthink It: A Survey of Efficient R1-style Large Reasoning Models

Large Reasoning Models like DeepSeek R1 demonstrate exceptional performance in complex tasks through enhanced logical deduction and decision-making capabilities. However, these models often exhibit "overthinking" behavior, constructing excessively long reasoning chains with redundant steps, reducing efficiency and potentially affecting accuracy. This survey systematically reviews efficient reasoning methods, categorizing approaches into single-model optimization and model collaboration strategies. The research addresses the critical balance between reasoning depth and computational efficiency. Applications include automated theorem proving, scientific research assistance, legal document analysis, financial risk assessment, strategic planning tools, and educational platforms requiring step-by-step reasoning. This work could inform development of more efficient AI reasoning systems, reduce computational costs in reasoning-intensive applications, improve real-time decision-making capabilities, and guide the design of practical reasoning systems that maintain high performance while optimizing resource utilization across various domains requiring complex logical processing.

Authors: Linan Yue, Yichao Du, Yizhi Wang, Weibo Gao, Fangzhou Yao, Li Wang, Ye Liu, Ziyu Xu, Qi Liu, Shimin Di, Min-Ling Zhang

Link: https://arxiv.org/abs/2508.02120v1

Date: 2025-08-d

Summary:

Recently, Large Reasoning Models (LRMs) have gradually become a research hotspot due to their outstanding performance in handling complex tasks. Among them, DeepSeek R1 has garnered significant attention for its exceptional performance and open-source nature, driving advancements in the research of R1-style LRMs. Unlike traditional Large Language Models (LLMs), these models enhance logical deduction and decision-making capabilities during reasoning by incorporating mechanisms such as long chain-of-thought and self-reflection through reinforcement learning. However, with the widespread application of these models, the problem of overthinking has gradually emerged. Specifically, when generating answers, these models often construct excessively long reasoning chains with redundant or repetitive steps, which leads to reduced reasoning efficiency and may affect the accuracy of the final answer. To this end, various efficient reasoning methods have been proposed, aiming to reduce the length of reasoning paths without compromising model performance and reasoning capability. By reviewing the current research advancements in the field of efficient reasoning methods systematically, we categorize existing works into two main directions based on the lens of single-model optimization versus model collaboration: (1) Efficient Reasoning with Single Model, which focuses on improving the reasoning efficiency of individual models; and (2) Efficient Reasoning with Model Collaboration, which explores optimizing reasoning paths through collaboration among multiple models. Besides, we maintain a public GitHub repository that tracks the latest progress in efficient reasoning methods.

--------------------------------------------------------------------------------------------------------

SE-Agent: Self-Evolution Trajectory Optimization in Multi-Step Reasoning with LLM-Based Agents

LLM-based agents show impressive capabilities in complex reasoning and tool use through multi-step environmental interactions, but their problem-solving trajectories remain underexploited. SE-Agent addresses this through a self-evolution framework that iteratively optimizes reasoning processes using revision, recombination, and refinement operations. Unlike approaches such as Monte Carlo Tree Search that ignore trajectory interdependence, SE-Agent expands search spaces beyond local optima and leverages cross-trajectory inspiration for efficient performance enhancement. The system achieved up to 55% relative improvement and state-of-the-art performance on SWE-bench Verified for resolving real-world GitHub issues. Applications include automated software debugging, code generation systems, problem-solving assistants, research automation tools, and complex task planning systems. This technology could significantly improve AI agent capabilities in software development, enable more sophisticated automated programming tools, enhance debugging efficiency, and support the development of autonomous systems capable of learning from their own problem-solving experiences.

Authors: Jiaye Lin, Yifu Guo, Yuzhen Han, Sen Hu, Ziyi Ni, Licheng Wang, Mingguang Chen, Daxin Jiang, Binxing Jiao, Chen Hu, Huacan Wang

Link: https://arxiv.org/abs/2508.02085v1

Date: 2025-08-d

Summary:

Large Language Model (LLM)-based agents have recently shown impressive capabilities in complex reasoning and tool use via multi-step interactions with their environments. While these agents have the potential to tackle complicated tasks, their problem-solving process, i.e., agents' interaction trajectory leading to task completion, remains underexploited. These trajectories contain rich feedback that can navigate agents toward the right directions for solving problems correctly. Although prevailing approaches, such as Monte Carlo Tree Search (MCTS), can effectively balance exploration and exploitation, they ignore the interdependence among various trajectories and lack the diversity of search spaces, which leads to redundant reasoning and suboptimal outcomes. To address these challenges, we propose SE-Agent, a Self-Evolution framework that enables Agents to optimize their reasoning processes iteratively. Our approach revisits and enhances former pilot trajectories through three key operations: revision, recombination, and refinement. This evolutionary mechanism enables two critical advantages: (1) it expands the search space beyond local optima by intelligently exploring diverse solution paths guided by previous trajectories, and (2) it leverages cross-trajectory inspiration to efficiently enhance performance while mitigating the impact of suboptimal reasoning paths. Through these mechanisms, SE-Agent achieves continuous self-evolution that incrementally improves reasoning quality. We evaluate SE-Agent on SWE-bench Verified to resolve real-world GitHub issues. Experimental results across five strong LLMs show that integrating SE-Agent delivers up to 55% relative improvement, achieving state-of-the-art performance among all open-source agents on SWE-bench Verified. Our code and demonstration materials are publicly available at https://github.com/wanghuacan/SE-Agent.

--------------------------------------------------------------------------------------------------------

Everyone Contributes! Incentivizing Strategic Cooperation in Multi-LLM Systems via Sequential Public Goods Games

Coordinating multiple large language models to solve complex tasks collaboratively presents a fundamental trade-off between computational costs and collective performance. This research introduces MAC-SPGG, a game-theoretically grounded reinforcement learning framework that systematically incentivizes cooperation in multi-LLM ensembles. The sequential protocol eliminates free-riding behavior while reducing communication overhead compared to traditional approaches. Agents move sequentially, observing predecessors' outputs and updating beliefs to condition their contributions. The framework proves the existence and uniqueness of equilibrium solutions under realistic parameters. Applications include distributed AI reasoning systems, collaborative problem-solving platforms, multi-agent research tools, consensus-building systems, and complex decision-making frameworks. This technology could enable more effective AI collaboration in scenarios requiring diverse expertise, improve collective intelligence systems, support distributed computing environments, and create frameworks for fair resource allocation in multi-agent AI systems where individual contributions must be balanced with collective benefits.

Authors: Yunhao Liang, Yuan Qu, Jingyuan Yang, Shaochong Lin, Zuo-Jun Max Shen

Link: https://arxiv.org/abs/2508.02076v1

Date: 2025-08-d

Summary:

Coordinating multiple large language models (LLMs) to solve complex tasks collaboratively poses a fundamental trade-off between the computation costs and collective performance compared with individual model. We introduce a novel, game-theoretically grounded reinforcement learning (RL) framework, the Multi-Agent Cooperation Sequential Public Goods Game (MAC-SPGG), to systematically incentivize cooperation in multi-LLM ensembles. In MAC-SPGG, LLM agents move in sequence, observing predecessors' outputs and updating beliefs to condition their own contributions. By redesigning the public-goods reward, effortful contributions become the unique Subgame Perfect Nash Equilibrium (SPNE), which eliminates free-riding under traditional SPGG or PGG. Its sequential protocol replaces costly round-based information exchanges with a streamlined decision flow, cutting communication overhead while retaining strategic depth. We prove the existence and uniqueness of the SPNE under realistic parameters, and empirically show that MAC-SPGG-trained ensembles outperform single-agent baselines, chain-of-thought prompting, and other cooperative methods, even achieving comparable performance to large-scale models across reasoning, math, code generation, and NLP tasks. Our results highlight the power of structured, incentive-aligned MAC-SPGG cooperation for scalable and robust multi-agent language generation.

--------------------------------------------------------------------------------------------------------

Risk identification based on similar case retrieval enhancement,

Construction site safety management faces challenges in automated risk and hazard identification. Existing large language model approaches either struggle with complex hazard features through image-text matching or suffer from high training costs and poor generalization through instruction fine-tuning. This research proposes a hazard identification method using similar case retrieval enhancement, integrating external knowledge and retrieved case contexts via prompt fine-tuning. The approach includes retrieval library, image similarity retrieval, and large model retrieval enhancement modules, enabling efficient recognition without training. Experimental results on real construction data show significant improvements, with GLM-4V's recognition accuracy increasing to 50%, representing a 35.49% boost. Applications include automated construction safety monitoring, real-time hazard detection systems, safety compliance verification tools, accident prevention platforms, and safety training applications. This technology could significantly improve construction site safety, reduce workplace accidents, automate safety inspections, and provide consistent hazard identification across different construction environments.

Authors: Jiawei Li, Chengye Yang, Yaochen Zhang, Weilin Sun, Lei Meng, Xiangxu Meng

Link: https://arxiv.org/abs/2508.02073v1

Date: 2025-08-d

Summary:

The goal of construction site risk and hazard identification is to enhance safety management through automation. Existing research based on large language models falls into two categories: image-text matching for collaborative reasoning, which struggles with complex hazard features, and instruction fine-tuning or dialogue guidance using professional datasets, which suffers from high training costs and poor generalization.To address this, we propose a hazard identification method using similar case retrieval enhancement. By integrating external knowledge and retrieved case contexts via prompt fine-tuning, we mitigate misjudgments caused by limited domain knowledge and weak feature associations. Our method includes three modules: retrieval library, image similarity retrieval, and large model retrieval enhancement, enabling efficient recognition without training. Experiments on real construction data show significant improvements. For instance, GLM-4V's recognition accuracy increased to 50\%, a 35.49\% boost. The method enhances accuracy, context understanding, and stability, offering new theoretical and technical support for hazard detection.

--------------------------------------------------------------------------------------------------------

GPU in the Blind Spot: Overlooked Security Risks in Transportation

Graphics processing units are becoming essential components in intelligent transportation systems for video-based and AI applications, providing high-throughput computing for sensor fusion and roadside analytics. However, GPUs represent a critical security blind spot, vulnerable to cyber and hardware attacks including unauthorized crypto mining. This research presents a case study demonstrating the impact of stealthy crypto miners on critical AI workloads using a YOLOv8-based video processing pipeline. Results show 50% frame rate drops and 90% power usage increases from background mining activities. The study develops lightweight classifiers achieving high accuracy in detecting GPU misuse through telemetry features. Applications include autonomous vehicle security systems, smart traffic infrastructure monitoring, edge device protection, fleet management security, and critical infrastructure safeguarding. This work could establish GPU security protocols for transportation systems, inform cybersecurity frameworks for connected vehicles, guide security standards for edge computing in transportation, and develop monitoring systems for detecting compromised GPU resources.

Authors: Sefatun-Noor Puspa, Mashrur Chowdhury

Link: https://arxiv.org/abs/2508.01995v1

Date: 2025-08-d

Summary:

Graphics processing units (GPUs) are becoming an essential part of the intelligent transportation system (ITS) for enabling video-based and artificial intelligence (AI) based applications. GPUs provide high-throughput and energy-efficient computing for tasks like sensor fusion and roadside video analytics. However, these GPUs are one of the most unmonitored components in terms of security. This makes them vulnerable to cyber and hardware attacks, including unauthorized crypto mining. This paper highlights GPU security as a critical blind spot in transportation cybersecurity. To support this concern, it also presents a case study showing the impact of stealthy unauthorized crypto miners on critical AI workloads, along with a detection strategy. We used a YOLOv8-based video processing pipeline running on an RTX 2060 GPU for the case study. A multi-streaming application was executed while a T-Rex crypto miner ran in the background. We monitored how the miner degraded GPU performance by reducing the frame rate and increasing power consumption, which could be a serious concern for GPUs operating in autonomous vehicles or battery-powered edge devices. We observed measurable impacts using GPU telemetry (nvidia-smi) and Nsight Compute profiling, where frame rate dropped by 50 percent, and power usage increased by up to 90%. To detect, we trained lightweight classifiers using extracted telemetry features. All models achieved high accuracy, precision, recall, and F1-score. This paper raises urgent awareness about GPU observability gaps in ITS and offers a replicable framework for detecting GPU misuse through on-device telemetry.

--------------------------------------------------------------------------------------------------------

Accelerating LLM Reasoning via Early Rejection with Partial Reward Modeling

Large Language Models increasingly tackle complex reasoning tasks in mathematics, logic, and multi-step question answering. Process Reward Models improve reasoning quality by rewarding intermediate steps but introduce substantial computational overhead when generating multiple solution candidates. This research investigates using PRMs mid-generation for early rejection of suboptimal candidates before completion. The key hypothesis proposes that PRMs are also Partial Reward Models, meaning partial reasoning step scores predict final output quality. Theoretical proof shows rejection risk decreases exponentially with generation length, while empirical validation demonstrates strong correlation between partial and final rewards. Results show up to 1.4×-9× reduction in inference FLOPs without performance degradation on math reasoning benchmarks. Applications include automated tutoring systems, mathematical problem-solving platforms, theorem proving assistants, scientific reasoning tools, and educational assessment systems. This technology could significantly reduce computational costs in reasoning-intensive applications while maintaining high performance, enabling more accessible AI reasoning capabilities.

Authors: Seyyed Saeid Cheshmi, Azal Ahmad Khan, Xinran Wang, Zirui Liu, Ali Anwar

Link: https://arxiv.org/abs/2508.01969v1

Date: 2025-08-d

Summary:

Large Language Models (LLMs) are increasingly relied upon for solving complex reasoning tasks in domains such as mathematics, logic, and multi-step question answering. A growing line of work seeks to improve reasoning quality by scaling inference time compute particularly through Process Reward Models (PRMs), used to reward the reasoning at intermediate steps. While effective, these methods introduce substantial computational overhead, especially when generating large numbers of solutions in parallel. In this paper, we investigate whether PRMs can be used mid-generation to provide early signals that enable the rejection of suboptimal candidates before full generation of step is complete. We introduce the hypothesis that PRMs are also Partial Reward Models, meaning that the scores they assign to partially completed reasoning step are predictive of final output quality. This allows for principled early rejection based on intermediate token-level signals. We support this hypothesis both theoretically, by proving that the risk of discarding optimal beams decreases exponentially with generation length and empirically, by demonstrating a strong correlation between partial and final rewards across multiple reward models. On math reasoning benchmarks, our method achieves up to 1.4x-9x reduction in inference FLOPs without degrading final performance. These results suggest that early rejection is a powerful mechanism for improving the compute-efficiency of reasoning in LLMs.

--------------------------------------------------------------------------------------------------------

ACT-Tensor: Tensor Completion Framework for Financial Dataset Imputation

Missing data in financial panels presents critical obstacles, undermining asset-pricing models and reducing investment strategy effectiveness. Financial datasets are inherently multi-dimensional, spanning firms, time, and variables, adding complexity to imputation tasks. Conventional methods fail by flattening multidimensional structures and struggling with heterogeneous missingness patterns or extreme sparsity. ACT-Tensor introduces an Adaptive, Cluster-based Temporal smoothing framework specifically tailored for severely missing multi-dimensional financial data. The framework incorporates cluster-based completion capturing cross-sectional heterogeneity and temporal smoothing removing noise while preserving fundamental trends. Extensive experiments demonstrate consistent outperformance of state-of-the-art benchmarks across missing data regimes. Financial utility evaluation shows reduced pricing errors and significantly improved risk-adjusted portfolio returns. Applications include investment strategy optimization, risk management systems, financial forecasting platforms, asset pricing models, and market analysis tools. This technology could transform financial data analysis, enabling more robust investment decisions and improved market understanding despite incomplete datasets.

Authors: Junyi Mo, Jiayu Li, Duo Zhang, Elynn Chen

Link: https://arxiv.org/abs/2508.01861v1

Date: 2025-08-d

Summary:

Missing data in financial panels presents a critical obstacle, undermining asset-pricing models and reducing the effectiveness of investment strategies. Such panels are often inherently multi-dimensional, spanning firms, time, and financial variables, which adds complexity to the imputation task. Conventional imputation methods often fail by flattening the data's multidimensional structure, struggling with heterogeneous missingness patterns, or overfitting in the face of extreme data sparsity. To address these limitations, we introduce an Adaptive, Cluster-based Temporal smoothing tensor completion framework (ACT-Tensor) tailored for severely and heterogeneously missing multi-dimensional financial data panels. ACT-Tensor incorporates two key innovations: a cluster-based completion module that captures cross-sectional heterogeneity by learning group-specific latent structures; and a temporal smoothing module that proactively removes short-lived noise while preserving slow-moving fundamental trends. Extensive experiments show that ACT-Tensor consistently outperforms state-of-the-art benchmarks in terms of imputation accuracy across a range of missing data regimes, including extreme sparsity scenarios. To assess its practical financial utility, we evaluate the imputed data with an asset-pricing pipeline tailored for tensor-structured financial data. Results show that ACT-Tensor not only reduces pricing errors but also significantly improves risk-adjusted returns of the constructed portfolio. These findings confirm that our method delivers highly accurate and informative imputations, offering substantial value for financial decision-making.

--------------------------------------------------------------------------------------------------------

EYE ON A.I. GETS READERS UP TO DATE ON THE LATEST FUNDING NEWS AND RELATED ISSUES. SUBSCRIBE FOR THE WEEKLY NEWSLETTER.

Artificial Intelligence, Research WatchCraig SmithAugust 6, 2025Comment