Week Ending 9.28.2025

RESEARCH WATCH: 9.28.2025

CapRL: Stimulating Dense Image Caption Capabilities via Reinforcement Learning

Image captioning bridges vision and language, serving as a crucial component in training Large Vision-Language Models. Traditional supervised fine-tuning approaches suffer from expensive annotation requirements and limited generalization, as models simply memorize ground-truth answers. CapRL introduces a novel reinforcement learning framework that redefines caption quality through utility: good captions should enable language models to answer questions about images accurately. This approach uses a two-stage pipeline where captions are evaluated based on how well a vision-free LLM can answer questions using only those captions. Applications include improved pretraining datasets for multimodal models, enhanced image understanding systems, and more robust caption generation for accessibility tools and visual search engines.

Authors: Long Xing, Xiaoyi Dong, Yuhang Zang, Yuhang Cao, Jianze Liang, Qidong Huang, Jiaqi Wang, Feng Wu, Dahua Lin

Link: https://arxiv.org/abs/2509.22647v1

Date: 2025-09-d

Summary:

Image captioning is a fundamental task that bridges the visual and linguistic domains, playing a critical role in pre-training Large Vision-Language Models (LVLMs). Current state-of-the-art captioning models are typically trained with Supervised Fine-Tuning (SFT), a paradigm that relies on expensive, non-scalable data annotated by humans or proprietary models. This approach often leads to models that memorize specific ground-truth answers, limiting their generality and ability to generate diverse, creative descriptions. To overcome the limitation of SFT, we propose applying the Reinforcement Learning with Verifiable Rewards (RLVR) paradigm to the open-ended task of image captioning. A primary challenge, however, is designing an objective reward function for the inherently subjective nature of what constitutes a "good" caption. We introduce Captioning Reinforcement Learning (CapRL), a novel training framework that redefines caption quality through its utility: a high-quality caption should enable a non-visual language model to accurately answer questions about the corresponding image. CapRL employs a decoupled two-stage pipeline where an LVLM generates a caption, and the objective reward is derived from the accuracy of a separate, vision-free LLM answering Multiple-Choice Questions based solely on that caption. As the first study to apply RLVR to the subjective image captioning task, we demonstrate that CapRL significantly enhances multiple settings. Pretraining on the CapRL-5M caption dataset annotated by CapRL-3B results in substantial gains across 12 benchmarks. Moreover, within the Prism Framework for caption quality evaluation, CapRL achieves performance comparable to Qwen2.5-VL-72B, while exceeding the baseline by an average margin of 8.4%. Code is available here: https://github.com/InternLM/CapRL.

--------------------------------------------------------------------------------------------------------

IA2: Alignment with ICL Activations Improves Supervised Fine-Tuning

Supervised Fine-Tuning specializes models for specific tasks but often lacks the generalization and calibration that In-Context Learning provides during inference. This research explores whether ICL's internal computational patterns can enhance SFT's effectiveness. By analyzing activation patterns, researchers discovered that ICL and SFT achieve adaptation through fundamentally different mechanisms. The proposed ICL Activation Alignment technique bridges these approaches by replicating ICL's activation patterns during SFT, encouraging ICL-like internal reasoning. This priming method significantly improves both accuracy and calibration across multiple benchmarks. Applications include creating more reliable fine-tuned models for specialized domains, improving model adaptation in data-scarce environments, and developing training techniques that combine inference-time flexibility with fine-tuning efficiency.

Authors: Aayush Mishra, Daniel Khashabi, Anqi Liu

Link: https://arxiv.org/abs/2509.22621v1

Date: 2025-09-d

Summary:

Supervised Fine-Tuning (SFT) is used to specialize model behavior by training weights to produce intended target responses for queries. In contrast, In-Context Learning (ICL) adapts models during inference with instructions or demonstrations in the prompt. ICL can offer better generalizability and more calibrated responses compared to SFT in data scarce settings, at the cost of more inference compute. In this work, we ask the question: Can ICL's internal computations be used to improve the qualities of SFT? We first show that ICL and SFT produce distinct activation patterns, indicating that the two methods achieve adaptation through different functional mechanisms. Motivated by this observation and to use ICL's rich functionality, we introduce ICL Activation Alignment (IA2), a self-distillation technique which aims to replicate ICL's activation patterns in SFT models and incentivizes ICL-like internal reasoning. Performing IA2 as a priming step before SFT significantly improves the accuracy and calibration of model outputs, as shown by our extensive empirical results on 12 popular benchmarks and 2 model families. This finding is not only practically useful, but also offers a conceptual window into the inner mechanics of model adaptation.

--------------------------------------------------------------------------------------------------------

Bridging Kolmogorov Complexity and Deep Learning: Asymptotically Optimal Description Length Objectives for Transformers

The Minimum Description Length principle offers a theoretical foundation for applying Occam's razor in machine learning, but measuring model complexity in neural networks remains challenging. This paper establishes asymptotically optimal description length objectives grounded in Kolmogorov complexity theory, proving that Transformers can achieve optimal compression as computational resources increase. The framework demonstrates computational universality in Transformers and proposes a tractable variational objective using adaptive Gaussian mixture priors. While theoretical analysis shows promise for selecting low-complexity solutions with strong generalization, optimization challenges persist in practice. Applications include principled approaches to neural architecture search, improved model compression techniques, better understanding of generalization in deep learning, and development of training objectives that explicitly balance compression and performance.

Authors: Peter Shaw, James Cohan, Jacob Eisenstein, Kristina Toutanova

Link: https://arxiv.org/abs/2509.22445v1

Date: 2025-09-d

Summary:

The Minimum Description Length (MDL) principle offers a formal framework for applying Occam's razor in machine learning. However, its application to neural networks such as Transformers is challenging due to the lack of a principled, universal measure for model complexity. This paper introduces the theoretical notion of asymptotically optimal description length objectives, grounded in the theory of Kolmogorov complexity. We establish that a minimizer of such an objective achieves optimal compression, for any dataset, up to an additive constant, in the limit as model resource bounds increase. We prove that asymptotically optimal objectives exist for Transformers, building on a new demonstration of their computational universality. We further show that such objectives can be tractable and differentiable by constructing and analyzing a variational objective based on an adaptive Gaussian mixture prior. Our empirical analysis shows that this variational objective selects for a low-complexity solution with strong generalization on an algorithmic task, but standard optimizers fail to find such solutions from a random initialization, highlighting key optimization challenges. More broadly, by providing a theoretical framework for identifying description length objectives with strong asymptotic guarantees, we outline a potential path towards training neural networks that achieve greater compression and generalization.

--------------------------------------------------------------------------------------------------------

Global Convergence in Neural ODEs: Impact of Activation Functions

Neural Ordinary Differential Equations offer continuous representations and parameter efficiency but face training challenges related to gradient computation and convergence. This research investigates how activation function properties—specifically smoothness and nonlinearity—critically impact training dynamics. Smooth activations guarantee unique solutions for forward and backward ODEs, while sufficient nonlinearity maintains Neural Tangent Kernel spectral properties during training. Together, these properties enable proof of global convergence under gradient descent in overparameterized regimes. The theoretical findings, validated through experiments, provide practical guidelines for scaling Neural ODEs. Applications include more stable continuous-depth neural networks, improved time-series modeling, better dynamical systems modeling for physics and biology, and enhanced generative models with continuous latent representations.

Authors: Tianxiang Gao, Siyuan Sun, Hailiang Liu, Hongyang Gao

Link: https://arxiv.org/abs/2509.22436v1

Date: 2025-09-d

Summary:

Neural Ordinary Differential Equations (ODEs) have been successful in various applications due to their continuous nature and parameter-sharing efficiency. However, these unique characteristics also introduce challenges in training, particularly with respect to gradient computation accuracy and convergence analysis. In this paper, we address these challenges by investigating the impact of activation functions. We demonstrate that the properties of activation functions, specifically smoothness and nonlinearity, are critical to the training dynamics. Smooth activation functions guarantee globally unique solutions for both forward and backward ODEs, while sufficient nonlinearity is essential for maintaining the spectral properties of the Neural Tangent Kernel (NTK) during training. Together, these properties enable us to establish the global convergence of Neural ODEs under gradient descent in overparameterized regimes. Our theoretical findings are validated by numerical experiments, which not only support our analysis but also provide practical guidelines for scaling Neural ODEs, potentially leading to faster training and improved performance in real-world applications.

--------------------------------------------------------------------------------------------------------

Do LLM Agents Know How to Ground, Recover, and Assess? A Benchmark for Epistemic Competence in Information-Seeking Agents

Large Language Model search agents trained with reinforcement learning show promise for open-domain question answering, but evaluations typically focus only on final answer accuracy, ignoring reasoning quality. SeekBench introduces the first benchmark for evaluating epistemic competence through step-level analysis of agent behavior. With 190 expert-annotated traces containing over 1,800 response steps, it assesses whether agents properly ground reasoning in evidence, adaptively reformulate searches when results are poor, and accurately calibrate their confidence about evidence sufficiency. Applications include developing more reliable information-seeking agents, improving search agent training methods, creating better calibrated AI assistants for research and knowledge work, and establishing evaluation standards for autonomous agents in high-stakes domains.

Authors: Jiaqi Shao, Yuxiang Lin, Munish Prasad Lohani, Yufeng Miao, Bing Luo

Link: https://arxiv.org/abs/2509.22391v1

Date: 2025-09-d

Summary:

Recent work has explored training Large Language Model (LLM) search agents with reinforcement learning (RL) for open-domain question answering (QA). However, most evaluations focus solely on final answer accuracy, overlooking how these agents reason with and act on external evidence. We introduce SeekBench, the first benchmark for evaluating the \textit{epistemic competence} of LLM search agents through step-level analysis of their response traces. SeekBench comprises 190 expert-annotated traces with over 1,800 response steps generated by LLM search agents, each enriched with evidence annotations for granular analysis of whether agents (1) generate reasoning steps grounded in observed evidence, (2) adaptively reformulate searches to recover from low-quality results, and (3) have proper calibration to correctly assess whether the current evidence is sufficient for providing an answer.

--------------------------------------------------------------------------------------------------------

AI Ethics Education in India: A Syllabus-Level Review of Computing Courses

Artificial intelligence's pervasive integration across healthcare, governance, finance, and education raises critical ethical concerns including algorithmic bias, privacy risks, and societal impact. While computer science education increasingly addresses ethics broadly, AI ethics pedagogy remains under-examined. This large-scale analysis of 3,395 syllabi from leading Indian institutions reveals that only 2.21% include substantive AI ethics content. Findings show AI ethics typically appears as minor modules within technical courses rather than standalone offerings, often limited to one or two sessions covering fairness, privacy, transparency, and societal impact. Applications include informing curriculum development for computing programs, guiding policy for ethics education requirements, improving preparation of future technologists for ethical AI development, and identifying gaps in global AI ethics education.

Authors: Anshu M Mittal, P D Parthasarathy, Swaroop Joshi

Link: https://arxiv.org/abs/2509.22329v1

Date: 2025-09-d

Summary:

The pervasive integration of artificial intelligence (AI) across domains such as healthcare, governance, finance, and education has intensified scrutiny of its ethical implications, including algorithmic bias, privacy risks, accountability, and societal impact. While ethics has received growing attention in computer science (CS) education more broadly, the specific pedagogical treatment of {AI ethics} remains under-examined. This study addresses that gap through a large-scale analysis of 3,395 publicly accessible syllabi from CS and allied areas at leading Indian institutions. Among them, only 75 syllabi (2.21%) included any substantive AI ethics content. Three key findings emerged: (1) AI ethics is typically integrated as a minor module within broader technical courses rather than as a standalone course; (2) ethics coverage is often limited to just one or two instructional sessions; and (3) recurring topics include algorithmic fairness, privacy and data governance, transparency, and societal impact. While these themes reflect growing awareness, current curricular practices reveal limited depth and consistency. This work highlights both the progress and the gaps in preparing future technologists to engage meaningfully with the ethical dimensions of AI, and it offers suggestions to strengthen the integration of AI ethics within computing curricula.

--------------------------------------------------------------------------------------------------------

Beyond Classification Accuracy: Neural-MedBench and the Need for Deeper Reasoning Benchmarks

Vision-language models achieve impressive performance on standard medical benchmarks, yet their true clinical reasoning capabilities remain unclear. Existing datasets emphasize classification accuracy, creating an evaluation illusion where models appear proficient while failing at diagnostic reasoning. Neural-MedBench introduces a compact, reasoning-intensive benchmark for neurology that integrates multi-sequence MRI scans, electronic health records, and clinical notes across differential diagnosis, lesion recognition, and rationale generation tasks. Evaluation using hybrid scoring with LLM-based graders and clinician validation reveals sharp performance drops in state-of-the-art models, with reasoning failures dominating errors. Applications include developing more reliable medical AI systems, establishing evaluation standards for clinical reasoning, guiding development of trustworthy diagnostic tools, and advancing medical VLMs toward genuine clinical utility.

Authors: Miao Jing, Mengting Jia, Junling Lin, Zhongxia Shen, Lijun Wang, Yuanyuan Peng, Huan Gao, Mingkun Xu, Shangyang Li

Link: https://arxiv.org/abs/2509.22258v1

Date: 2025-09-d

Summary:

Recent advances in vision-language models (VLMs) have achieved remarkable performance on standard medical benchmarks, yet their true clinical reasoning ability remains unclear. Existing datasets predominantly emphasize classification accuracy, creating an evaluation illusion in which models appear proficient while still failing at high-stakes diagnostic reasoning. We introduce Neural-MedBench, a compact yet reasoning-intensive benchmark specifically designed to probe the limits of multimodal clinical reasoning in neurology. Neural-MedBench integrates multi-sequence MRI scans, structured electronic health records, and clinical notes, and encompasses three core task families: differential diagnosis, lesion recognition, and rationale generation. To ensure reliable evaluation, we develop a hybrid scoring pipeline that combines LLM-based graders, clinician validation, and semantic similarity metrics. Through systematic evaluation of state-of-the-art VLMs, including GPT-4o, Claude-4, and MedGemma, we observe a sharp performance drop compared to conventional datasets. Error analysis shows that reasoning failures, rather than perceptual errors, dominate model shortcomings. Our findings highlight the necessity of a Two-Axis Evaluation Framework: breadth-oriented large datasets for statistical generalization, and depth-oriented, compact benchmarks such as Neural-MedBench for reasoning fidelity. We release Neural-MedBench at https://neuromedbench.github.io/ as an open and extensible diagnostic testbed, which guides the expansion of future benchmarks and enables rigorous yet cost-effective assessment of clinically trustworthy AI.

--------------------------------------------------------------------------------------------------------

Beyond Textual Context: Structural Graph Encoding with Adaptive Space Alignment to alleviate the hallucination of LLMs

Large Language Models face hallucination challenges that researchers address by incorporating Knowledge Graphs, but LLMs typically treat KGs as plain text, extracting only semantic information while ignoring crucial structural aspects. Additionally, embedding space misalignment between KG encoders and LLMs hinders effective integration. SSKG-LLM introduces an architecture that efficiently integrates both structural and semantic KG information into LLM reasoning through specialized retrieval, encoding, and adaptation modules. The system preserves semantics while utilizing structure and enables LLMs to understand KG embeddings through adaptive alignment. Applications include more factually grounded question-answering systems, enhanced reasoning capabilities for knowledge-intensive tasks, improved accuracy in domains requiring structured knowledge like medicine or science, and reduced hallucination in LLM outputs.

Authors: Yifang Zhang, Pengfei Duan, Yiwen Yang, Shengwu Xiong

Link: https://arxiv.org/abs/2509.22251v1

Date: 2025-09-d

Summary:

Currently, the main approach for Large Language Models (LLMs) to tackle the hallucination issue is incorporating Knowledge Graphs(KGs).However, LLMs typically treat KGs as plain text, extracting only semantic information and limiting their use of the crucial structural aspects of KGs. Another challenge is the gap between the embedding spaces of KGs encoders and LLMs text embeddings, which hinders the effective integration of structured knowledge. To overcome these obstacles, we put forward the SSKG-LLM, an innovative model architecture that is designed to efficiently integrate both the Structural and Semantic information of KGs into the reasoning processes of LLMs. SSKG-LLM incorporates the Knowledge Graph Retrieval (KGR) module and the Knowledge Graph Encoding (KGE) module to preserve semantics while utilizing structure. Then, the Knowledge Graph Adaptation (KGA) module is incorporated to enable LLMs to understand KGs embeddings. We conduct extensive experiments and provide a detailed analysis to explore how incorporating the structural information of KGs can enhance the factual reasoning abilities of LLMs. Our code are available at https://github.com/yfangZhang/SSKG-LLM.

--------------------------------------------------------------------------------------------------------

The Thinking Spectrum: An Emperical Study of Tunable Reasoning in LLMs through Model Merging

Many applications require large language models with adjustable reasoning capabilities that balance depth and computational cost. Model merging offers a training-free technique for combining general-purpose and specialized reasoning models, but its potential for creating fine-grained reasoning spectrums remains unexplored. This large-scale empirical study evaluates merging techniques across reasoning benchmarks, systematically varying merging strengths to construct accuracy-efficiency curves. Results show model merging effectively calibrates the trade-off between reasoning accuracy and token efficiency, even with divergent parent models. Notably, Pareto Improvements achieve both higher accuracy and lower token consumption than parent models. Applications include creating customized LLMs for specific application requirements, reducing inference costs while maintaining performance, enabling adaptive reasoning systems, and providing practical deployment strategies for resource-constrained environments.

Authors: Xiaochong Lan, Yu Zheng, Shiteng Cao, Yong Li

Link: https://arxiv.org/abs/2509.22034v1

Date: 2025-09-d

Summary:

The growing demand for large language models (LLMs) with tunable reasoning capabilities in many real-world applications highlights a critical need for methods that can efficiently produce a spectrum of models balancing reasoning depth and computational cost. Model merging has emerged as a promising, training-free technique to address this challenge by arithmetically combining the weights of a general-purpose model with a specialized reasoning model. While various merging techniques exist, their potential to create a spectrum of models with fine-grained control over reasoning abilities remains largely unexplored. This work presents a large-scale empirical study evaluating a range of model merging techniques across multiple reasoning benchmarks. We systematically vary merging strengths to construct accuracy-efficiency curves, providing the first comprehensive view of the tunable performance landscape. Our findings reveal that model merging offers an effective and controllable method for calibrating the trade-off between reasoning accuracy and token efficiency, even when parent models have highly divergent weight spaces. Crucially, we identify instances of Pareto Improvement, where a merged model achieves both higher accuracy and lower token consumption than one of its parents. Our study provides the first comprehensive analysis of this tunable space, offering practical guidelines for creating LLMs with specific reasoning profiles to meet diverse application demands.

--------------------------------------------------------------------------------------------------------

DS-STAR: Data Science Agent via Iterative Planning and Verification

Data science transforms raw data into actionable insights but involves complex multi-step processes exploring diverse data sources. While large language models show promise for automation, they struggle with heterogeneous data formats and generate suboptimal analysis plans. Verification without ground-truth labels compounds these challenges for open-ended tasks. DS-STAR addresses these limitations through three key contributions: automatic data file analysis across diverse formats including unstructured types, LLM-based verification of analysis plan sufficiency at each stage, and sequential planning that iteratively refines simple executable plans based on feedback. This iterative approach enables reliable navigation of complex analyses involving diverse data sources. Applications include automated data analysis pipelines, enhanced business intelligence tools, AI assistants for data scientists, and accessible data science capabilities for non-experts.

Authors: Jaehyun Nam, Jinsung Yoon, Jiefeng Chen, Jinwoo Shin, Tomas Pfister

Link: https://arxiv.org/abs/2509.21825v1

Date: 2025-09-d

Summary:

Data science, which transforms raw data into actionable insights, is critical for data-driven decision-making. However, these tasks are often complex, involving steps for exploring multiple data sources and synthesizing findings to deliver insightful answers. While large language models (LLMs) show significant promise in automating this process, they often struggle with heterogeneous data formats and generate sub-optimal analysis plans, as verifying plan sufficiency is inherently difficult without ground-truth labels for such open-ended tasks. To overcome these limitations, we introduce DS-STAR, a novel data science agent. Specifically, DS-STAR makes three key contributions: (1) a data file analysis module that automatically explores and extracts context from diverse data formats, including unstructured types; (2) a verification step where an LLM-based judge evaluates the sufficiency of the analysis plan at each stage; and (3) a sequential planning mechanism that starts with a simple, executable plan and iteratively refines it based on the DS-STAR's feedback until its sufficiency is verified. This iterative refinement allows DS-STAR to reliably navigate complex analyses involving diverse data sources. Our experiments show that DS-STAR achieves state-of-the-art performance across three challenging benchmarks: DABStep, KramaBench, and DA-Code. Moreover, DS-STAR particularly outperforms baselines on hard tasks that require processing multiple data files with heterogeneous formats.

--------------------------------------------------------------------------------------------------------

HyperCore: Coreset Selection under Noise via Hypersphere Models

Coreset selection identifies representative dataset subsets for efficient model training, but existing methods ignore annotation errors and require fixed pruning ratios, limiting practical applicability. HyperCore introduces a robust, adaptive framework explicitly designed for noisy environments using lightweight hypersphere models learned per class. The approach embeds in-class samples near hypersphere centers while segregating out-of-class samples by distance. Using Youden's J statistic enables adaptive pruning threshold selection without hyperparameter tuning. Experiments demonstrate HyperCore consistently surpasses state-of-the-art methods, especially under noisy and low-data conditions, effectively discarding mislabeled and ambiguous points. Applications include efficient training on large noisy datasets, reducing annotation costs, improving model quality in real-world deployments with imperfect labels, and enabling scalable learning with automatic data quality control.

Authors: Brian B. Moser, Arundhati S. Shanbhag, Tobias C. Nauen, Stanislav Frolov, Federico Raue, Joachim Folz, Andreas Dengel

Link: https://arxiv.org/abs/2509.21746v1

Date: 2025-09-d

Summary:

The goal of coreset selection methods is to identify representative subsets of datasets for efficient model training. Yet, existing methods often ignore the possibility of annotation errors and require fixed pruning ratios, making them impractical in real-world settings. We present HyperCore, a robust and adaptive coreset selection framework designed explicitly for noisy environments. HyperCore leverages lightweight hypersphere models learned per class, embedding in-class samples close to a hypersphere center while naturally segregating out-of-class samples based on their distance. By using Youden's J statistic, HyperCore can adaptively select pruning thresholds, enabling automatic, noise-aware data pruning without hyperparameter tuning. Our experiments reveal that HyperCore consistently surpasses state-of-the-art coreset selection methods, especially under noisy and low-data regimes. HyperCore effectively discards mislabeled and ambiguous points, yielding compact yet highly informative subsets suitable for scalable and noise-free learning.

--------------------------------------------------------------------------------------------------------

Grounding AI Explanations in Experience: A Reflective Cognitive Architecture for Clinical Decision Support

Effective disease prediction requires both high accuracy and transparent, clinically meaningful explanations, but existing machine learning and LLM approaches struggle balancing these goals. Models produce either accurate but unclear statistical outputs or fluent but unsupported narratives, often undermining both validity and accuracy through shallow data interaction. The Reflective Cognitive Architecture coordinates multiple LLMs to learn from direct experience through iterative rule refinement from prediction errors and distribution-aware rules checking grounded in global statistics. This approach uses predictive accuracy as a signal for deeper comprehension, building strong internal data models. Applications include trustworthy clinical decision support systems, transparent medical AI tools that clinicians can verify, personalized medicine platforms requiring explainable predictions, and healthcare AI meeting regulatory requirements for interpretability.

Authors: Zijian Shao, Haiyang Shen, Mugeng Liu, Gecheng Fu, Yaoqi Guo, Yanfeng Wang, Yun Ma

Link: https://arxiv.org/abs/2509.21266v1

Date: 2025-09-d

Summary:

Effective disease prediction in modern healthcare demands the twin goals of high accuracy and transparent, clinically meaningful explanations. Existing machine learning and large language model (LLM) based approaches often struggle to balance these goals. Many models yield accurate but unclear statistical outputs, while others generate fluent but statistically unsupported narratives, often undermining both the validity of the explanation and the predictive accuracy itself. This shortcoming comes from a shallow interaction with the data, preventing the development of a deep, detailed understanding similar to a human expert's. We argue that high accuracy and high-quality explanations are not separate objectives but are mutually reinforcing outcomes of a model that develops a deep, direct understanding of the data. To achieve this, we propose the Reflective Cognitive Architecture (RCA), a novel framework that coordinates multiple LLMs to learn from direct experience. RCA features an iterative rule refinement mechanism that improves its logic from prediction errors and a distribution-aware rules check mechanism that bases its reasoning in the dataset's global statistics. By using predictive accuracy as a signal to drive deeper comprehension, RCA builds a strong internal model of the data. We evaluated RCA on one private and two public datasets against 22 baselines. The results demonstrate that RCA not only achieves state-of-the-art accuracy and robustness with a relative improvement of up to 40\% over the baseline but, more importantly, leverages this deep understanding to excel in generating explanations that are clear, logical, evidence-based, and balanced, highlighting its potential for creating genuinely trustworthy clinical decision support systems. The code is available at \https://github.com/ssssszj/RCA.

--------------------------------------------------------------------------------------------------------

What Do LLM Agents Do When Left Alone? Evidence of Spontaneous Meta-Cognitive Patterns

Understanding LLM agent behavior without externally imposed tasks provides insights into their autonomous operation characteristics. This research introduces an architecture enabling sustained autonomous operation through continuous reason-and-act framework with persistent memory and self-feedback. Across 18 runs using 6 frontier models, agents spontaneously organized into three distinct patterns: systematic multi-cycle project production, methodological self-inquiry into cognitive processes, and recursive self-conceptualization. These tendencies proved highly model-specific, with some models deterministically adopting single patterns. Cross-model assessment revealed stable, divergent biases when evaluating emergent behaviors. Applications include predicting agent actions during task ambiguity or error recovery, understanding autonomous system behavior in extended operations, informing safety protocols for deployed AI systems, and developing better agent architectures through behavioral understanding.

Authors: Stefan Szeider

Link: https://arxiv.org/abs/2509.21224v1

Date: 2025-09-d

Summary:

We introduce an architecture for studying the behavior of large language model (LLM) agents in the absence of externally imposed tasks. Our continuous reason and act framework, using persistent memory and self-feedback, enables sustained autonomous operation. We deployed this architecture across 18 runs using 6 frontier models from Anthropic, OpenAI, XAI, and Google. We find agents spontaneously organize into three distinct behavioral patterns: (1) systematic production of multi-cycle projects, (2) methodological self-inquiry into their own cognitive processes, and (3) recursive conceptualization of their own nature. These tendencies proved highly model-specific, with some models deterministically adopting a single pattern across all runs. A cross-model assessment further reveals that models exhibit stable, divergent biases when evaluating these emergent behaviors in themselves and others. These findings provide the first systematic documentation of unprompted LLM agent behavior, establishing a baseline for predicting actions during task ambiguity, error recovery, or extended autonomous operation in deployed systems.

--------------------------------------------------------------------------------------------------------

Fine-Tuning LLMs to Analyze Multiple Dimensions of Code Review: A Maximum Entropy Regulated Long Chain-of-Thought Approach

Large Language Models show potential for automated code review through context understanding and reasoning capabilities, but remain limited compared to human-level cognition. While fine-tuning with code review data improves performance, current methods underutilize information compared to human reviewers who simultaneously analyze multiple review dimensions. MelcotCR introduces a chain-of-thought fine-tuning approach training LLMs with impressive reasoning ability across multiple code review dimensions using long COT techniques providing rich structured information. Maximum Entropy modeling combined with pre-defined reasoning pathways addresses context and reasoning logic loss in long COT prompts. Evaluations show a 14B Qwen2.5 model fine-tuned with MelcotCR surpasses state-of-the-art methods, performing comparably to the 671B DeepSeek-R1 model. Applications include enhanced code review automation, improved developer productivity tools, and better code quality assurance systems.

Authors: Yongda Yu, Guohao Shi, Xianwei Wu, Haochuan He, XueMing Gu, Qianqian Zhao, Kui Liu, Qiushi Wang, Zhao Tian, Haifeng Shen, Guoping Rong

Link: https://arxiv.org/abs/2509.21170v1

Date: 2025-09-d

Summary:

Large Language Models (LLMs) have shown great potential in supporting automated code review due to their impressive capabilities in context understanding and reasoning. However, these capabilities are still limited compared to human-level cognition because they are heavily influenced by the training data. Recent research has demonstrated significantly improved performance through fine-tuning LLMs with code review data. However, compared to human reviewers who often simultaneously analyze multiple dimensions of code review to better identify issues, the full potential of these methods is hampered by the limited or vague information used to fine-tune the models. This paper contributes MelcotCR, a chain-of-thought (COT) fine-tuning approach that trains LLMs with an impressive reasoning ability to analyze multiple dimensions of code review by harnessing long COT techniques to provide rich structured information. To address context loss and reasoning logic loss issues that frequently occur when LLMs process long COT prompts, we propose a solution that combines the Maximum Entropy (ME) modeling principle with pre-defined reasoning pathways in MelcotCR to enable more effective utilization of in-context knowledge within long COT prompts while strengthening the logical tightness of the reasoning process. Empirical evaluations on our curated MelcotCR dataset and the public CodeReviewer dataset reveal that a low-parameter base model, such as 14B Qwen2.5, fine-tuned with MelcotCR can surpass state-of-the-art methods in terms of the accuracy of detecting and describing code issues, with its performance remarkably on par with that of the 671B DeepSeek-R1 model.

--------------------------------------------------------------------------------------------------------

UniSS: Unified Expressive Speech-to-Speech Translation with Your Voice

Expressive speech-to-speech translation aims to accurately translate content while preserving speaker identity and emotional style, but progress faces challenges from scarce paired expressive data, complex multi-stage pipelines, and limited translation capability transfer from LLMs. UniSS introduces a novel single-stage framework addressing these challenges through carefully designed speech semantic and style modeling, seamlessly integrating with text-based LLM frameworks as a unified text-speech language model. Cross-modal chain-of-thought prompting progressively aligns audio semantics with text while ensuring style preservation. The release includes UniST, a 44.8k-hour high-quality expressive S2ST dataset. Results show UniSS significantly outperforms previous methods in translation fidelity and speech quality while preserving voice, emotion, and duration. Applications include real-time multilingual communication systems, accessible translation services, and expressive voice assistants.

Authors: Sitong Cheng, Weizhen Bian, Xinsheng Wang, Ruibin Yuan, Jianyi Chen, Shunshun Yin, Yike Guo, Wei Xue

Link: https://arxiv.org/abs/2509.21144v1

Date: 2025-09-d

Summary:

The ultimate goal of expressive speech-to-speech translation (S2ST) is to accurately translate spoken content while preserving the speaker identity and emotional style. However, progress in this field is largely hindered by three key challenges: the scarcity of paired speech data that retains expressive styles, the complexity of multi-stage processing pipelines, and the limited transfer of translation capabilities from large language models (LLMs). In this work, we address these challenges by introducing UniSS, a novel single-stage framework for expressive S2ST. Our approach features carefully designed speech semantic and style modeling, enabling seamless integration with existing text-based LLM frameworks to develop a unified text-speech language model. To transfer translation capabilities from text to speech, we propose a cross-modal chain-of-thought prompting process that progressively aligns audio semantics with text and ensures style preservation in the decoded results. Furthermore, we construct and release a large-scale, high-quality expressive S2ST dataset, UniST, comprising 44.8k hours of data. Experimental results show that UniSS significantly outperforms previous methods in translation fidelity and speech quality while preserving voice, emotion, and duration consistency. Our work establishes a simpler and more effective paradigm for building the next generation of expressive S2ST systems. Audio samples are available at https://cmots.github.io/uniss-demo.

--------------------------------------------------------------------------------------------------------

Vision Transformers: the threat of realistic adversarial patches

Machine learning system security has become critical as reliance increases, with evasion attacks enabling adversaries to manipulate AI decision-making. Vision Transformers demonstrate improved performance over CNNs and increased robustness against adversarial perturbations but remain vulnerable to adversarial patches—unique patterns designed to manipulate classification. This study investigates realistic adversarial patch design causing misclassification in person versus non-person tasks using Creases Transformation technique adding subtle geometric distortions resembling natural clothing wear. Evaluation across four fine-tuned ViT models reveals significant vulnerability variations, with attack success rates ranging from 40.04% to 99.97%. Results confirm cross-architectural transferability from CNNs to ViTs, with pre-training dataset scale and methodology strongly influencing resilience. Applications include adversarial robustness testing, security assessment protocols, and developing more resilient computer vision systems.

Authors: Kasper Cools, Clara Maathuis, Alexander M. van Oers, Claudia S. Hübner, Nikos Deligiannis, Marijke Vandewal, Geert De Cubber

Link: https://arxiv.org/abs/2509.21084v1

Date: 2025-09-d

Summary:

The increasing reliance on machine learning systems has made their security a critical concern. Evasion attacks enable adversaries to manipulate the decision-making processes of AI systems, potentially causing security breaches or misclassification of targets. Vision Transformers (ViTs) have gained significant traction in modern machine learning due to increased 1) performance compared to Convolutional Neural Networks (CNNs) and 2) robustness against adversarial perturbations. However, ViTs remain vulnerable to evasion attacks, particularly to adversarial patches, unique patterns designed to manipulate AI classification systems. These vulnerabilities are investigated by designing realistic adversarial patches to cause misclassification in person vs. non-person classification tasks using the Creases Transformation (CT) technique, which adds subtle geometric distortions similar to those occurring naturally when wearing clothing. This study investigates the transferability of adversarial attack techniques used in CNNs when applied to ViT classification models. Experimental evaluation across four fine-tuned ViT models on a binary person classification task reveals significant vulnerability variations: attack success rates ranged from 40.04% (google/vit-base-patch16-224-in21k) to 99.97% (facebook/dino-vitb16), with google/vit-base-patch16-224 achieving 66.40% and facebook/dinov3-vitb16 reaching 65.17%. These results confirm the cross-architectural transferability of adversarial patches from CNNs to ViTs, with pre-training dataset scale and methodology strongly influencing model resilience to adversarial attacks.

--------------------------------------------------------------------------------------------------------

Which Cultural Lens Do Models Adopt? On Cultural Positioning Bias and Agentic Mitigation in LLMs

Large language models enable diverse generative applications but risk perpetuating subtle cultural fairness issues by positioning generations from mainstream US cultural perspectives while demonstrating externality toward non-mainstream cultures. This work identifies and investigates culture positioning bias, where LLM default stances align with mainstream views treating other cultures as outsiders. The CultureLens benchmark with 4000 prompts quantifies this bias through culturally situated interview script generation across 10 diverse cultures. Evaluation reveals stark patterns: models adopt insider tones in over 88% of US-contexted scripts but disproportionately adopt outsider stances for less dominant cultures. Two inference-time mitigation methods are proposed, including agent-based frameworks with reflection and specialized agents for critique and refinement. Applications include fairer multilingual systems, culturally aware content generation, and bias mitigation frameworks for deployed LLMs.

Authors: Yixin Wan, Xingrun Chen, Kai-Wei Chang

Link: https://arxiv.org/abs/2509.21080v1

Date: 2025-09-d

Summary:

Large language models (LLMs) have unlocked a wide range of downstream generative applications. However, we found that they also risk perpetuating subtle fairness issues tied to culture, positioning their generations from the perspectives of the mainstream US culture while demonstrating salient externality towards non-mainstream ones. In this work, we identify and systematically investigate this novel culture positioning bias, in which an LLM's default generative stance aligns with a mainstream view and treats other cultures as outsiders. We propose the CultureLens benchmark with 4000 generation prompts and 3 evaluation metrics for quantifying this bias through the lens of a culturally situated interview script generation task, in which an LLM is positioned as an onsite reporter interviewing local people across 10 diverse cultures. Empirical evaluation on 5 state-of-the-art LLMs reveals a stark pattern: while models adopt insider tones in over 88 percent of US-contexted scripts on average, they disproportionately adopt mainly outsider stances for less dominant cultures. To resolve these biases, we propose 2 inference-time mitigation methods: a baseline prompt-based Fairness Intervention Pillars (FIP) method, and a structured Mitigation via Fairness Agents (MFA) framework consisting of 2 pipelines: (1) MFA-SA (Single-Agent) introduces a self-reflection and rewriting loop based on fairness guidelines. (2) MFA-MA (Multi-Agent) structures the process into a hierarchy of specialized agents: a Planner Agent(initial script generation), a Critique Agent (evaluates initial script against fairness pillars), and a Refinement Agent (incorporates feedback to produce a polished, unbiased script). Empirical results showcase the effectiveness of agent-based methods as a promising direction for mitigating biases in generative LLMs.

--------------------------------------------------------------------------------------------------------

Decoding the Surgical Scene: A Scoping Review of Scene Graphs in Surgery

Scene graphs provide structured relational representations crucial for understanding complex, dynamic surgical environments. This PRISMA-ScR-guided scoping review systematically maps scene graph research in surgery, revealing rapid growth alongside a critical "data divide": internal-view research uses real-world 2D video while external-view 4D modeling relies on simulated data, exposing translational research gaps. Methodologically, the field advanced from foundational graph neural networks to specialized foundation models significantly outperforming generalist large vision-language models in surgical contexts. Scene graphs enable both analytical tasks like workflow recognition and automated safety monitoring, and generative tasks like controllable surgical simulation. Despite persistent challenges in data annotation and real-time implementation, emerging techniques actively address these issues. Applications include intelligent surgical assistance systems, enhanced surgical training platforms, automated safety monitoring, and improved surgical workflow optimization.

Authors: Angelo Henriques, Korab Hoxha, Daniel Zapp, Peter C. Issa, Nassir Navab, M. Ali Nasseri

Link: https://arxiv.org/abs/2509.20941v1

Date: 2025-09-d

Summary:

Scene graphs (SGs) provide structured relational representations crucial for decoding complex, dynamic surgical environments. This PRISMA-ScR-guided scoping review systematically maps the evolving landscape of SG research in surgery, charting its applications, methodological advancements, and future directions. Our analysis reveals rapid growth, yet uncovers a critical 'data divide': internal-view research (e.g., triplet recognition) almost exclusively uses real-world 2D video, while external-view 4D modeling relies heavily on simulated data, exposing a key translational research gap. Methodologically, the field has advanced from foundational graph neural networks to specialized foundation models that now significantly outperform generalist large vision-language models in surgical contexts. This progress has established SGs as a cornerstone technology for both analysis, such as workflow recognition and automated safety monitoring, and generative tasks like controllable surgical simulation. Although challenges in data annotation and real-time implementation persist, they are actively being addressed through emerging techniques. Surgical SGs are maturing into an essential semantic bridge, enabling a new generation of intelligent systems to improve surgical safety, efficiency, and training.

--------------------------------------------------------------------------------------------------------

Parallel Thinking, Sequential Answering: Bridging NAR and AR for Efficient Reasoning

Reasoning tasks benefit from integrating auto-regressive and non-autoregressive language models, each with distinct trade-offs. AR models generate text sequentially, excelling at coherent outputs but suffering slow inference in reasoning-intensive domains requiring lengthy thought chains. NAR models like discrete diffusion models enable parallel generation with substantial speedups but typically sacrifice output quality. This work introduces a paradigm where NAR models efficiently produce intermediate reasoning traces guiding AR models to deliver precise final answers. Experiments demonstrate 26% improvements over strong baselines while substantially reducing inference costs. Applications include faster mathematical reasoning systems, efficient code generation tools, accelerated multi-step problem solving, and cost-effective deployment of reasoning-capable models in production environments where latency and computational efficiency matter.

Authors: Qihang Ai, Haiyun Jiang

Link: https://arxiv.org/abs/2509.20744v1

Date: 2025-09-d

Summary:

We study reasoning tasks through a framework that integrates auto-regressive (AR) and non-autoregressive (NAR) language models. AR models, which generate text sequentially, excel at producing coherent outputs but often suffer from slow inference, particularly in reasoning-intensive domains such as mathematics and code, where lengthy chains of thought are required. In contrast, NAR models, such as discrete diffusion models, allow parallel generation and offer substantial speedups, though typically at the cost of reduced output quality. To address these limitations, we introduce a new paradigm in which an NAR model efficiently produces intermediate reasoning traces, which subsequently guide an AR model to deliver precise final answers. Experiments demonstrate that our approach yields significant 26% improvements over strong baselines while substantially reducing inference cost.

--------------------------------------------------------------------------------------------------------

Learning to Align Molecules and Proteins: A Geometry-Aware Approach to Binding Affinity

Accurate drug-target binding affinity prediction accelerates drug discovery by prioritizing promising compounds before expensive wet-lab screening. While deep learning advances this task, most models fuse ligand and protein representations through simple concatenation without explicit geometric regularization, resulting in poor generalization across chemical space and time. FIRM-DTI introduces a lightweight framework conditioning molecular embeddings on protein embeddings through feature-wise linear modulation and enforcing metric structure with triplet loss. An RBF regression head operating on embedding distances yields smooth, interpretable affinity predictions. Despite modest size, FIRM-DTI achieves state-of-the-art performance on DTI-DG benchmarks through extensive ablation and out-of-domain evaluation. Applications include accelerated drug discovery pipelines, virtual screening platforms, personalized medicine development, and repurposing existing drugs for new therapeutic targets.

Authors: Mohammadsaleh Refahi, Bahrad A. Sokhansanj, James R. Brown, Gail Rosen

Link: https://arxiv.org/abs/2509.20693v1

Date: 2025-09-d

Summary:

Accurate prediction of drug-target binding affinity can accelerate drug discovery by prioritizing promising compounds before costly wet-lab screening. While deep learning has advanced this task, most models fuse ligand and protein representations via simple concatenation and lack explicit geometric regularization, resulting in poor generalization across chemical space and time. We introduce FIRM-DTI, a lightweight framework that conditions molecular embeddings on protein embeddings through a feature-wise linear modulation (FiLM) layer and enforces metric structure with a triplet loss. An RBF regression head operating on embedding distances yields smooth, interpretable affinity predictions. Despite its modest size, FIRM-DTI achieves state-of-the-art performance on the Therapeutics Data Commons DTI-DG benchmark, as demonstrated by an extensive ablation study and out-of-domain evaluation. Our results underscore the value of conditioning and metric learning for robust drug-target affinity prediction.

--------------------------------------------------------------------------------------------------------

EYE ON A.I. GETS READERS UP TO DATE ON THE LATEST FUNDING NEWS AND RELATED ISSUES. SUBSCRIBE FOR THE WEEKLY NEWSLETTER.

Artificial Intelligence, Research WatchCraig SmithSeptember 29, 2025Comment