Week Ending 1.25.2026

 

RESEARCH WATCH: 1.25.2026

 

A Scalable Measure of Loss Landscape Curvature for Analyzing the Training Dynamics of LLMs

Understanding how neural networks learn requires measuring the curvature of their loss landscapes, but this has been computationally prohibitive for large language models. This paper introduces "critical sharpness," an efficient measure requiring fewer than 10 forward passes that captures key training phenomena like progressive sharpening and Edge of Stability. Demonstrated on models up to 7B parameters, this tool enables practitioners to diagnose training dynamics and optimize data mixing strategies during pre-training and fine-tuning transitions. The scalable approach provides actionable insights for improving training efficiency, stability, and performance at scales previously inaccessible to curvature analysis, making it valuable for researchers optimizing large-scale language model development.

Authors:  Dayal Singh Kalra, Jean-Christophe Gagnon-Audet, Andrey Gromov, Ishita Mediratta, Kelvin Niu, Alexander H Miller, Michael Shvartsman

Link:  https://arxiv.org/abs/2601.16979v1

Date: 2026-01-d

Summary:

Understanding the curvature evolution of the loss landscape is fundamental to analyzing the training dynamics of neural networks. The most commonly studied measure, Hessian sharpness ($λ_{\max}^H$) -- the largest eigenvalue of the loss Hessian -- determines local training stability and interacts with the learning rate throughout training. Despite its significance in analyzing training dynamics, direct measurement of Hessian sharpness remains prohibitive for Large Language Models (LLMs) due to high computational cost. We analyze $\textit{critical sharpness}$ ($λ_c$), a computationally efficient measure requiring fewer than $10$ forward passes given the update direction $Δ\mathbfθ$. Critically, this measure captures well-documented Hessian sharpness phenomena, including progressive sharpening and Edge of Stability. Using this measure, we provide the first demonstration of these sharpness phenomena at scale, up to $7$B parameters, spanning both pre-training and mid-training of OLMo-2 models. We further introduce $\textit{relative critical sharpness}$ ($λ_c^{1\to 2}$), which quantifies the curvature of one loss landscape while optimizing another, to analyze the transition from pre-training to fine-tuning and guide data mixing strategies. Critical sharpness provides practitioners with a practical tool for diagnosing curvature dynamics and informing data composition choices at scale. More broadly, our work shows that scalable curvature measures can provide actionable insights for large-scale training.

--------------------------------------------------------------------------------------------------------

Nishpaksh: TEC Standard-Compliant Framework for Fairness Auditing and Certification of AI Models

As AI systems increasingly influence high-stakes decisions in telecommunications and emerging 6G applications, ensuring fairness becomes critical for regulatory compliance and public trust. Nishpaksh addresses gaps in existing fairness toolkits by aligning with India's Telecommunication Engineering Centre standards, providing region-specific regulatory compliance. This web-based framework integrates risk quantification, contextual threshold determination, and quantitative fairness evaluation into audit-grade assessments. Validated on the COMPAS dataset, it identifies attribute-specific biases and generates standardized fairness scores. Applications span critical infrastructure like telecommunications, enabling organizations to meet national AI governance requirements while deploying responsible AI systems that respect local regulatory priorities and cultural contexts.

Authors:  Shashank Prakash, Ranjitha Prasad, Avinash Agarwal

Link:  https://arxiv.org/abs/2601.16926v1

Date: 2026-01-d

Summary:

The growing reliance on Artificial Intelligence (AI) models in high-stakes decision-making systems, particularly within emerging telecom and 6G applications, underscores the urgent need for transparent and standardized fairness assessment frameworks. While global toolkits such as IBM AI Fairness 360 and Microsoft Fairlearn have advanced bias detection, they often lack alignment with region-specific regulatory requirements and national priorities. To address this gap, we propose Nishpaksh, an indigenous fairness evaluation tool that operationalizes the Telecommunication Engineering Centre (TEC) Standard for the Evaluation and Rating of Artificial Intelligence Systems. Nishpaksh integrates survey-based risk quantification, contextual threshold determination, and quantitative fairness evaluation into a unified, web-based dashboard. The tool employs vectorized computation, reactive state management, and certification-ready reporting to enable reproducible, audit-grade assessments, thereby addressing a critical post-standardization implementation need. Experimental validation on the COMPAS dataset demonstrates Nishpaksh's effectiveness in identifying attribute-specific bias and generating standardized fairness scores compliant with the TEC framework. The system bridges the gap between research-oriented fairness methodologies and regulatory AI governance in India, marking a significant step toward responsible and auditable AI deployment within critical infrastructure like telecommunications.

--------------------------------------------------------------------------------------------------------

IRS Compensation of Hyper-Rayleigh Fading: How Many Elements Are Needed?

Intelligent Reflecting Surfaces offer promising solutions for enhancing wireless communications by compensating for severe fading conditions. This research quantifies the minimum number of IRS elements needed to overcome Hyper-Rayleigh Regime fading, representing worst-case multipath conditions. Using the Inverse Power Lomax channel model, the study derives that at least 6 elements are required to escape full Hyper-Rayleigh conditions, while 14 elements achieve no-Hyper-Rayleigh status. These findings provide concrete design guidelines for deploying IRS technology in challenging wireless environments, enabling engineers to optimize system configurations for reliable communication in scenarios with heavy fading, such as indoor environments, urban canyons, or emergency response situations where traditional solutions fail.

Authors:  Aleksey S. Gvozdarev

Link:  https://arxiv.org/abs/2601.16915v1

Date: 2026-01-d

Summary:

The letter introduces and studies the problem of defining the minimum number of Intelligent Reflecting Surface (IRS) elements needed to compensate for heavy fading conditions in multipath fading channels. The fading severity is quantified in terms of Hyper-Rayleigh Regimes (HRRs) (i.e., full-HRR (worst-case conditions), strong-, weak-, and no-HRR), and the channel model used (Inverse Power Lomax (IPL)) was chosen since it can account for all HRRs. The research presents the derived closed-form channel coefficient envelope statistics for the single IRS-element channel with IPL statistics in both subchannels and total IRS-assisted channel, as well as tight approximations for the channel coefficient and instantaneous signal-to-noise ratio (SNR) statistics for the latter. The derived expressions helped estimate channel parameters corresponding to the specific HRRs of the total channel and demonstrate that while both single links (i.e., ''source-IRS'' and ''IRS-destination'') are in full-HRR, the minimum number of IRS elements needed to bring the total IRS-assisted link (''source-IRS-destination'') out of full-HRR is no less than $6$ (for the whole range on the IPL scale parameter corresponding full-HRR). Furthermore, the minimum number of IRS elements required to bring the total IRS-assisted link into no-HRR is $14$ (under the same conditions).

--------------------------------------------------------------------------------------------------------

Emerging Threats and Countermeasures in Neuromorphic Systems: A Survey

Neuromorphic computing mimics brain-inspired processing through spiking neurons and memristive devices, offering energy-efficient alternatives to traditional computing. However, these emerging architectures introduce unique security vulnerabilities stemming from asynchronous processing and stochastic device behavior. This comprehensive survey maps the security landscape of neuromorphic systems, analyzing attack methodologies, side-channel vulnerabilities, and countermeasures across both hardware and software implementations. Coverage includes spiking neural networks and security primitives like Physical Unclonable Functions and True Random Number Generators. As neuromorphic computing advances toward practical deployment in edge devices, autonomous systems, and secure computation applications, understanding these security challenges becomes essential for developing trustworthy brain-inspired architectures that can safely operate in adversarial environments.

Authors:  Pablo Sorrentino, Stjepan Picek, Ihsen Alouani, Nikolaos Athanasios Anagnostopoulos, Francesco Regazzoni, Lejla Batina, Tamalika Banerjee, Fatih Turkmen

Link:  https://arxiv.org/abs/2601.16589v1

Date: 2026-01-d

Summary:

Neuromorphic computing mimics brain-inspired mechanisms through spiking neurons and energy-efficient processing, offering a pathway to efficient in-memory computing (IMC). However, these advancements raise critical security and privacy concerns. As the adoption of bio-inspired architectures and memristive devices increases, so does the urgency to assess the vulnerability of these emerging technologies to hardware and software attacks. Emerging architectures introduce new attack surfaces, particularly due to asynchronous, event-driven processing and stochastic device behavior. The integration of memristors into neuromorphic hardware and software implementations in spiking neural networks offers diverse possibilities for advanced computing architectures, including their role in security-aware applications. This survey systematically analyzes the security landscape of neuromorphic systems, covering attack methodologies, side-channel vulnerabilities, and countermeasures. We focus on both hardware and software concerns relevant to spiking neural networks (SNNs) and hardware primitives, such as Physical Unclonable Functions (PUFs) and True Random Number Generators (TRNGs) for cryptographic and secure computation applications. We approach this analysis from diverse perspectives, from attack methodologies to countermeasure strategies that integrate efficiency and protection in brain-inspired hardware. This review not only maps the current landscape of security threats but provides a foundation for developing secure and trustworthy neuromorphic architectures.

--------------------------------------------------------------------------------------------------------

REprompt: Prompt Generation for Intelligent Software Development Guided by Requirements Engineering

Large language models are transforming software development, serving as foundation models in coding agents where prompts carry user requirements and guide model behavior. Despite their importance, designing effective prompts remains challenging, requiring expertise in both prompt engineering and requirements engineering. Existing automated prompt optimization methods neglect formal requirement specification principles. REprompt addresses this gap through a multi-agent framework that grounds prompt generation in requirements engineering methodologies, optimizing both system prompts (high-level instructions) and user prompts (specific requirements). This approach reduces manual effort while ensuring generated prompts conform to realistic software development specifications, enabling more reliable and effective AI-assisted development in vibe-coding scenarios where conversational paradigms dominate.

Authors:  Junjie Shi, Weisong Sun, Zhenpeng Chen, Zhujun Wu, Xiaohong Chen, Zhi Jin, Yang Liu

Link:  https://arxiv.org/abs/2601.16507v1

Date: 2026-01-d

Summary:

The rapid development of large language models is transforming software development. Beyond serving as code auto-completion tools in integrated development environments, large language models increasingly function as foundation models within coding agents in vibe-coding scenarios. In such settings, prompts play a central role in agent-based intelligent software development, as they not only guide the behavior of large language models but also serve as carriers of user requirements. Under the dominant conversational paradigm, prompts are typically divided into system prompts and user prompts. System prompts provide high-level instructions to steer model behavior and establish conversational context, while user prompts represent inputs and requirements provided by human users. Despite their importance, designing effective prompts remains challenging, as it requires expertise in both prompt engineering and software engineering, particularly requirements engineering. To reduce the burden of manual prompt construction, numerous automated prompt engineering methods have been proposed. However, most existing approaches neglect the methodological principles of requirements engineering, limiting their ability to generate artifacts that conform to formal requirement specifications in realistic software development scenarios. To address this gap, we propose REprompt, a multi-agent prompt optimization framework guided by requirements engineering. Experiment results demonstrate that REprompt effectively optimizes both system and user prompts by grounding prompt generation in requirements engineering principles.

--------------------------------------------------------------------------------------------------------

Timely Machine: Awareness of Time Makes Test-Time Scaling Agentic

Test-time scaling enhances large language model reasoning on complex tasks, but traditional generation-length-based definitions fail in agentic scenarios where tool latency decouples inference time from generation length. Timely Machine redefines test-time as wall-clock time, enabling models to dynamically adjust strategies based on time budgets. The Timely-Eval benchmark reveals that smaller models excel with fast feedback through frequent interactions, while larger models dominate high-latency settings through superior interaction quality. However, existing models fail to adapt reasoning to time constraints. Timely-RL addresses this through reinforcement learning that enhances temporal planning after supervised fine-tuning, improving time budget awareness and performance, offering new perspectives on test-time scaling for agentic AI applications.

Authors:  Yichuan Ma, Linyang Li, Yongkang chen, Peiji Li, Xiaozhe Li, Qipeng Guo, Dahua Lin, Kai Chen

Link:  https://arxiv.org/abs/2601.16486v1

Date: 2026-01-d

Summary:

As large language models (LLMs) increasingly tackle complex reasoning tasks, test-time scaling has become critical for enhancing capabilities. However, in agentic scenarios with frequent tool calls, the traditional generation-length-based definition breaks down: tool latency decouples inference time from generation length. We propose Timely Machine, redefining test-time as wall-clock time, where models dynamically adjust strategies based on time budgets. We introduce Timely-Eval, a benchmark spanning high-frequency tool calls, low-frequency tool calls, and time-constrained reasoning. By varying tool latency, we find smaller models excel with fast feedback through more interactions, while larger models dominate high-latency settings via superior interaction quality. Moreover, existing models fail to adapt reasoning to time budgets. We propose Timely-RL to address this gap. After cold-start supervised fine-tuning, we use reinforcement learning to enhance temporal planning. Timely-RL improves time budget awareness and consistently boosts performance across Timely-Eval. We hope our work offers a new perspective on test-time scaling for the agentic era.

--------------------------------------------------------------------------------------------------------

Regional Bias in Large Language Models

Large language models increasingly influence global applications, yet regional bias threatens their fairness and inclusivity across diverse cultural contexts. This study evaluates ten prominent LLMs using 100 carefully designed prompts that probe forced-choice decisions between regions under neutral scenarios. The FAZE framework measures regional bias on a 10-point scale, revealing substantial variation: GPT-3.5 exhibits the highest bias (9.5) while Claude 3.5 Sonnet scores lowest (2.5). These findings demonstrate that regional bias can meaningfully undermine LLM reliability and fairness in real-world cross-cultural applications. The work contributes to AI fairness research by highlighting the need for inclusive evaluation frameworks and systematic approaches to identify and mitigate geographic biases in language models deployed globally.

Authors:  M P V S Gopinadh, Kappara Lakshmi Sindhu, Soma Sekhar Pandu Ranga Raju P, Yesaswini Swarna

Link:  https://arxiv.org/abs/2601.16349v1

Date: 2026-01-d

Summary:

This study investigates regional bias in large language models (LLMs), an emerging concern in AI fairness and global representation. We evaluate ten prominent LLMs: GPT-3.5, GPT-4o, Gemini 1.5 Flash, Gemini 1.0 Pro, Claude 3 Opus, Claude 3.5 Sonnet, Llama 3, Gemma 7B, Mistral 7B, and Vicuna-13B using a dataset of 100 carefully designed prompts that probe forced-choice decisions between regions under contextually neutral scenarios. We introduce FAZE, a prompt-based evaluation framework that measures regional bias on a 10-point scale, where higher scores indicate a stronger tendency to favor specific regions. Experimental results reveal substantial variation in bias levels across models, with GPT-3.5 exhibiting the highest bias score (9.5) and Claude 3.5 Sonnet scoring the lowest (2.5). These findings indicate that regional bias can meaningfully undermine the reliability, fairness, and inclusivity of LLM outputs in real-world, cross-cultural applications. This work contributes to AI fairness research by highlighting the importance of inclusive evaluation frameworks and systematic approaches for identifying and mitigating geographic biases in language models.

--------------------------------------------------------------------------------------------------------

Cosmos Policy: Fine-Tuning Video Models for Visuomotor Control and Planning

Recent video generation models capture complex physical interactions and temporal dynamics, offering rich spatiotemporal priors for robotics. While previous approaches adapted video models for robot policies through complex multi-stage training and specialized architectures, Cosmos Policy simplifies this through single-stage post-training on robot demonstrations without architectural modifications. By encoding robot actions, future states, and expected rewards as latent frames within Cosmos-Predict2's diffusion process, the approach leverages pretrained priors and standard next-token prediction. Cosmos Policy achieves state-of-the-art performance on LIBERO (98.5%) and RoboCasa (67.1%) benchmarks and excels in real-world bimanual manipulation. Additionally, it enables test-time planning and learning from experience to refine world models, demonstrating a scalable paradigm for robot policy learning.

Authors:  Moo Jin Kim, Yihuai Gao, Tsung-Yi Lin, Yen-Chen Lin, Yunhao Ge, Grace Lam, Percy Liang, Shuran Song, Ming-Yu Liu, Chelsea Finn, Jinwei Gu

Link:  https://arxiv.org/abs/2601.16163v1

Date: 2026-01-d

Summary:

Recent video generation models demonstrate remarkable ability to capture complex physical interactions and scene evolution over time. To leverage their spatiotemporal priors, robotics works have adapted video models for policy learning but introduce complexity by requiring multiple stages of post-training and new architectural components for action generation. In this work, we introduce Cosmos Policy, a simple approach for adapting a large pretrained video model (Cosmos-Predict2) into an effective robot policy through a single stage of post-training on the robot demonstration data collected on the target platform, with no architectural modifications. Cosmos Policy learns to directly generate robot actions encoded as latent frames within the video model's latent diffusion process, harnessing the model's pretrained priors and core learning algorithm to capture complex action distributions. Additionally, Cosmos Policy generates future state images and values (expected cumulative rewards), which are similarly encoded as latent frames, enabling test-time planning of action trajectories with higher likelihood of success. In our evaluations, Cosmos Policy achieves state-of-the-art performance on the LIBERO and RoboCasa simulation benchmarks (98.5% and 67.1% average success rates, respectively) and the highest average score in challenging real-world bimanual manipulation tasks, outperforming strong diffusion policies trained from scratch, video model-based policies, and state-of-the-art vision-language-action models fine-tuned on the same robot demonstrations. Furthermore, given policy rollout data, Cosmos Policy can learn from experience to refine its world model and value function and leverage model-based planning to achieve even higher success rates in challenging tasks. We release code, models, and training data at https://research.nvidia.com/labs/dir/cosmos-policy/

--------------------------------------------------------------------------------------------------------

Pay (Cross) Attention to the Melody: Curriculum Masking for Single-Encoder Melodic Harmonization

Melodic harmonization—generating harmonic accompaniments for melodies—remains central to computational music generation. Single-encoder transformer approaches frame harmonization as masked sequence modeling, but existing training curricula produce weak attention between melody and harmony, limiting melodic cue exploitation, especially out-of-domain. The FF (full-to-full) curriculum keeps harmony tokens masked initially, then progressively unmasks entire sequences, strengthening melody-harmony interactions. Systematic evaluation across temporal quantization, conditioning methods, and melody representations shows FF consistently outperforms baselines, with particularly strong out-of-domain gains on jazz standards. The findings highlight training curriculum importance for effective melody conditioning, with quarter-note quantization and pitch-class representations proving advantageous, offering robust strategies for single-encoder harmonization systems.

Authors:  Maximos Kaliakatsos-Papakostas, Dimos Makris, Konstantinos Soiledis, Konstantinos-Theodoros Tsamis, Vassilis Katsouros, Emilios Cambouropoulos

Link:  https://arxiv.org/abs/2601.16150v1

Date: 2026-01-d

Summary:

Melodic harmonization, the task of generating harmonic accompaniments for a given melody, remains a central challenge in computational music generation. Recent single encoder transformer approaches have framed harmonization as a masked sequence modeling problem, but existing training curricula inspired by discrete diffusion often result in weak (cross) attention between melody and harmony. This leads to limited exploitation of melodic cues, particularly in out-of-domain contexts. In this work, we introduce a training curriculum, FF (full-to-full), which keeps all harmony tokens masked for several training steps before progressively unmasking entire sequences during training to strengthen melody-harmony interactions. We systematically evaluate this approach against prior curricula across multiple experimental axes, including temporal quantization (quarter vs. sixteenth note), bar-level vs. time-signature conditioning, melody representation (full range vs. pitch class), and inference-time unmasking strategies. Models are trained on the HookTheory dataset and evaluated both in-domain and on a curated collection of jazz standards, using a comprehensive set of metrics that assess chord progression structure, harmony-melody alignment, and rhythmic coherence. Results demonstrate that the proposed FF curriculum consistently outperforms baselines in nearly all metrics, with particularly strong gains in out-of-domain evaluations where harmonic adaptability to novel melodic queues is crucial. We further find that quarter-note quantization, intertwining of bar tokens, and pitch-class melody representations are advantageous in the FF setting. Our findings highlight the importance of training curricula in enabling effective melody conditioning and suggest that full-to-full unmasking offers a robust strategy for single encoder harmonization.

--------------------------------------------------------------------------------------------------------

On the Intrinsic Dimensions of Data in Kernel Learning

The manifold hypothesis suggests machine learning generalizes better when input data lies on low-dimensional manifolds. This work investigates two intrinsic dimension notions in Kernel Ridge Regression: the upper Minkowski dimension (based on kernel-induced metrics) and the effective dimension (from Kolmogorov n-width decay rates). The study analyzes relationships between n-widths and integral operator eigenvalues, showing n-widths characterize worst-case eigenvalue decay across probability measures. This enables excess error bounds of O(n^{-(2+d_K)/(2+2d_K)+ε}) for large training sets. An algorithm estimates n-width upper bounds from finite samples, requiring O(ε^{-d_ρ}log(1/ε)) samples for near-uniform distributions. Results on fractal sets reveal the Laplace kernel's effective dimension can be significantly smaller than Minkowski dimension, informing kernel selection and generalization understanding.

Authors:  Rustem Takhanov

Link:  https://arxiv.org/abs/2601.16139v1

Date: 2026-01-d

Summary:

The manifold hypothesis suggests that the generalization performance of machine learning methods improves significantly when the intrinsic dimension of the input distribution's support is low. In the context of KRR, we investigate two alternative notions of intrinsic dimension. The first, denoted $d_ρ$, is the upper Minkowski dimension defined with respect to the canonical metric induced by a kernel function $K$ on a domain $Ω$. The second, denoted $d_K$, is the effective dimension, derived from the decay rate of Kolmogorov $n$-widths associated with $K$ on $Ω$. Given a probability measure $μ$ on $Ω$, we analyze the relationship between these $n$-widths and eigenvalues of the integral operator $φ\to \int_ΩK(\cdot,x)φ(x)dμ(x)$. We show that, for a fixed domain $Ω$, the Kolmogorov $n$-widths characterize the worst-case eigenvalue decay across all probability measures $μ$ supported on $Ω$. These eigenvalues are central to understanding the generalization behavior of constrained KRR, enabling us to derive an excess error bound of order $O(n^{-\frac{2+d_K}{2+2d_K} + ε})$ for any $ε> 0$, when the training set size $n$ is large. We also propose an algorithm that estimates upper bounds on the $n$-widths using only a finite sample from $μ$. For distributions close to uniform, we prove that $ε$-accurate upper bounds on all $n$-widths can be computed with high probability using at most $O\left(ε^{-d_ρ}\log\frac{1}ε\right)$ samples, with fewer required for small $n$. Finally, we compute the effective dimension $d_K$ for various fractal sets and present additional numerical experiments. Our results show that, for kernels such as the Laplace kernel, the effective dimension $d_K$ can be significantly smaller than the Minkowski dimension $d_ρ$, even though $d_K = d_ρ$ provably holds on regular domains.

--------------------------------------------------------------------------------------------------------

SAMTok: Representing Any Mask with Two Words

Pixel-wise capabilities are essential for interactive intelligent systems, yet scaling pixel-wise multimodal large language models remains challenging due to complex encoders, specialized decoders, and incompatible training objectives. SAMTok converts any region mask into two discrete tokens, treating masks as language tokens that enable base models to learn pixel-wise capabilities through standard next-token prediction and reinforcement learning without architectural modifications. Built on SAM2 and trained on 209M masks using mask encoders and vector quantizers, SAMTok produces compact, information-rich tokens. With 5M training samples, QwenVL-SAMTok achieves state-of-the-art results on region captioning, visual question answering, referring segmentation, and interactive segmentation. Textual answer-matching rewards enable efficient reinforcement learning, demonstrating a scalable paradigm for equipping models with pixel-wise understanding.

Authors:  Yikang Zhou, Tao Zhang, Dengxian Gong, Yuanzheng Wu, Ye Tian, Haochen Wang, Haobo Yuan, Jiacong Wang, Lu Qi, Hao Fei, Anran Wang, Zhuochen Wang, Yujing Wang, Cheng Chen, Shunping Ji, Xiangtai Li

Link:  https://arxiv.org/abs/2601.16093v1

Date: 2026-01-d

Summary:

Pixel-wise capabilities are essential for building interactive intelligent systems. However, pixel-wise multi-modal LLMs (MLLMs) remain difficult to scale due to complex region-level encoders, specialized segmentation decoders, and incompatible training objectives. To address these challenges, we present SAMTok, a discrete mask tokenizer that converts any region mask into two special tokens and reconstructs the mask using these tokens with high fidelity. By treating masks as new language tokens, SAMTok enables base MLLMs (such as the QwenVL series) to learn pixel-wise capabilities through standard next-token prediction and simple reinforcement learning, without architectural modifications and specialized loss design. SAMTok builds on SAM2 and is trained on 209M diverse masks using a mask encoder and residual vector quantizer to produce discrete, compact, and information-rich tokens. With 5M SAMTok-formatted mask understanding and generation data samples, QwenVL-SAMTok attains state-of-the-art or comparable results on region captioning, region VQA, grounded conversation, referring segmentation, scene graph parsing, and multi-round interactive segmentation. We further introduce a textual answer-matching reward that enables efficient reinforcement learning for mask generation, delivering substantial improvements on GRES and GCG benchmarks. Our results demonstrate a scalable and straightforward paradigm for equipping MLLMs with strong pixel-wise capabilities. Our code and models are available.

--------------------------------------------------------------------------------------------------------

Artificial Rigidities vs. Biological Noise: A Comparative Analysis of Multisensory Integration in AV-HuBERT and Human Observers

Understanding how AI models process multisensory information compared to humans reveals insights into their biological fidelity. This study benchmarks AV-HuBERT against 44 human observers using incongruent audiovisual stimuli (McGurk effect), revealing striking quantitative isomorphism: both exhibited nearly identical auditory dominance rates (32.0% vs. 31.8%), suggesting the model captures biological thresholds. However, AV-HuBERT showed deterministic bias toward phonetic fusion (68.0%), significantly exceeding human rates (47.7%). While humans displayed perceptual variability and diverse error profiles, the model remained strictly categorical. These findings suggest self-supervised architectures mimic multisensory outcomes but lack neural variability inherent to human speech perception, informing development of more biologically realistic speech processing systems.

Authors:  Francisco Portillo López

Link:  https://arxiv.org/abs/2601.15869v1

Date: 2026-01-d

Summary:

This study evaluates AV-HuBERT's perceptual bio-fidelity by benchmarking its response to incongruent audiovisual stimuli (McGurk effect) against human observers (N=44). Results reveal a striking quantitative isomorphism: AI and humans exhibited nearly identical auditory dominance rates (32.0% vs. 31.8%), suggesting the model captures biological thresholds for auditory resistance. However, AV-HuBERT showed a deterministic bias toward phonetic fusion (68.0%), significantly exceeding human rates (47.7%). While humans displayed perceptual stochasticity and diverse error profiles, the model remained strictly categorical. Findings suggest that current self-supervised architectures mimic multisensory outcomes but lack the neural variability inherent to human speech perception.

--------------------------------------------------------------------------------------------------------

RF Intelligence for Health: Classification of SmartBAN Signals in overcrowded ISM band

Accurate radio-frequency signal classification is essential for reliable wearable health-monitoring systems, enabling awareness of interference conditions affecting medical protocols. In the crowded 2.4 GHz ISM band, identifying low-power medical sensor transmissions is challenging due to co-channel interference and power asymmetry with coexisting technologies. This work introduces the first open-source framework for automatic SmartBAN signal recognition in Body Area Networks, combining synthetic simulated datasets with real software-defined radio acquisitions. Deep convolutional networks using ResNet encoders and attention-enhanced U-Net decoders achieve over 90% accuracy on synthetic data with consistent real-world performance. By enabling reliable SmartBAN recognition in dense spectral environments, this framework supports interference-aware coexistence strategies, improving dependability of wearable healthcare systems critical for patient monitoring.

Authors:  Nicola Gallucci, Giacomo Aragnetti, Matteo Malagrinò, Francesco Linsalata, Maurizio Magarini, Lorenzo Mucchi

Link:  https://arxiv.org/abs/2601.15836v1

Date: 2026-01-d

Summary:

Accurate classification of Radio-Frequency (RF) signals is essential for reliable wearable health-monitoring systems, providing awareness of the interference conditions in which medical protocols operate. In the overcrowded 2.4 GHz ISM band, however, identifying low-power transmissions from medical sensors is challenging due to strong co-channel interference and substantial power asymmetry with coexisting technologies. This work introduces the first open source framework for automatic recognition of SmartBAN signals in Body Area Networks (BANs). The framework combines a synthetic dataset of simulated signals with real RF acquisitions obtained through Software-Defined Radios (SDRs), enabling both controlled and realistic evaluation. Deep convolutional neural networks based on ResNet encoders and U-Net decoders with attention mechanisms are trained and assessed across diverse propagation conditions. The proposed approach achieves over 90% accuracy on synthetic datasets and demonstrates consistent performance on real over-the-air spectrograms. By enabling reliable SmartBAN signal recognition in dense spectral environments, this framework supports interferenceaware coexistence strategies and improves the dependability of wearable healthcare systems.

--------------------------------------------------------------------------------------------------------

Inference-Time Scaling of Verification: Self-Evolving Deep Research Agents via Test-Time Rubric-Guided Verification

Deep Research Agents automate knowledge discovery and problem-solving, with most efforts focusing on post-training policy improvements. This work proposes an alternative: self-evolving agents through iterative output verification guided by crafted rubrics, enabling inference-time scaling. DeepVerifier, a rubrics-based outcome reward verifier derived from an automatically constructed DRA Failure Taxonomy (five major categories, thirteen sub-categories), outperforms baseline judges by 12%-48% in meta-evaluation F1 scores. Integrated as a plug-and-play test-time module, DeepVerifier produces detailed rubric-based feedback, enabling iterative bootstrapping that delivers 8%-11% accuracy gains on GAIA and XBench-DeepResearch subsets. The released DeepVerifier-4K dataset provides 4,646 high-quality verification examples focused on reflection and self-critique, supporting open-source advancement in robust agent verification capabilities.

Authors:  Yuxuan Wan, Tianqing Fang, Zaitang Li, Yintong Huo, Wenxuan Wang, Haitao Mi, Dong Yu, Michael R. Lyu

Link:  https://arxiv.org/abs/2601.15808v1

Date: 2026-01-d

Summary:

Recent advances in Deep Research Agents (DRAs) are transforming automated knowledge discovery and problem-solving. While the majority of existing efforts focus on enhancing policy capabilities via post-training, we propose an alternative paradigm: self-evolving the agent's ability by iteratively verifying the policy model's outputs, guided by meticulously crafted rubrics. This approach gives rise to the inference-time scaling of verification, wherein an agent self-improves by evaluating its generated answers to produce iterative feedback and refinements. We derive the rubrics based on an automatically constructed DRA Failure Taxonomy, which systematically classifies agent failures into five major categories and thirteen sub-categories. We present DeepVerifier, a rubrics-based outcome reward verifier that leverages the asymmetry of verification and outperforms vanilla agent-as-judge and LLM judge baselines by 12%-48% in meta-evaluation F1 score. To enable practical self-evolution, DeepVerifier integrates as a plug-and-play module during test-time inference. The verifier produces detailed rubric-based feedback, which is fed back to the agent for iterative bootstrapping, refining responses without additional training. This test-time scaling delivers 8%-11% accuracy gains on challenging subsets of GAIA and XBench-DeepResearch when powered by capable closed-source LLMs. Finally, to support open-source advancement, we release DeepVerifier-4K, a curated supervised fine-tuning dataset of 4,646 high-quality agent steps focused on DRA verification. These examples emphasize reflection and self-critique, enabling open models to develop robust verification capabilities.

--------------------------------------------------------------------------------------------------------

HumanLLM: Towards Personalized Understanding and Simulation of Human Nature

Large language models excel at objective tasks but struggle with nuanced understanding of human behavior, limiting social simulation and personalized applications. This limitation stems from pretraining on vast uncontextualized web data that fails to capture individuals' continuous, situated decision contexts. HumanLLM addresses this through a foundation model designed for personalized understanding and simulation. The Cognitive Genome Dataset, curated from Reddit, Twitter, Blogger, and Amazon through rigorous filtering and synthesis, contains over 5.5 million user logs distilling profiles, behaviors, and thinking patterns. Supervised fine-tuning on diverse learning tasks enables prediction of individualized behaviors, thoughts, and experiences. HumanLLM achieves superior performance predicting user actions and inner thoughts, mimicking writing styles, and generating authentic profiles, with significant gains on out-of-domain social intelligence benchmarks.

Authors:  Yuxuan Lei, Tianfu Wang, Jianxun Lian, Zhengyu Hu, Defu Lian, Xing Xie

Link:  https://arxiv.org/abs/2601.15793v1

Date: 2026-01-d

Summary:

Motivated by the remarkable progress of large language models (LLMs) in objective tasks like mathematics and coding, there is growing interest in their potential to simulate human behavior--a capability with profound implications for transforming social science research and customer-centric business insights. However, LLMs often lack a nuanced understanding of human cognition and behavior, limiting their effectiveness in social simulation and personalized applications. We posit that this limitation stems from a fundamental misalignment: standard LLM pretraining on vast, uncontextualized web data does not capture the continuous, situated context of an individual's decisions, thoughts, and behaviors over time. To bridge this gap, we introduce HumanLLM, a foundation model designed for personalized understanding and simulation of individuals. We first construct the Cognitive Genome Dataset, a large-scale corpus curated from real-world user data on platforms like Reddit, Twitter, Blogger, and Amazon. Through a rigorous, multi-stage pipeline involving data filtering, synthesis, and quality control, we automatically extract over 5.5 million user logs to distill rich profiles, behaviors, and thinking patterns. We then formulate diverse learning tasks and perform supervised fine-tuning to empower the model to predict a wide range of individualized human behaviors, thoughts, and experiences. Comprehensive evaluations demonstrate that HumanLLM achieves superior performance in predicting user actions and inner thoughts, more accurately mimics user writing styles and preferences, and generates more authentic user profiles compared to base models. Furthermore, HumanLLM shows significant gains on out-of-domain social intelligence benchmarks, indicating enhanced generalization.

--------------------------------------------------------------------------------------------------------

From Passive Metric to Active Signal: The Evolving Role of Uncertainty Quantification in Large Language Models

Large language models demonstrate remarkable capabilities, yet unreliability remains a critical deployment barrier in high-stakes domains. This survey charts uncertainty's functional evolution from passive diagnostic metric to active control signal guiding real-time model behavior. Uncertainty serves as active control across three frontiers: in advanced reasoning to optimize computation and trigger self-correction; in autonomous agents to govern metacognitive decisions about tool use and information seeking; and in reinforcement learning to mitigate reward hacking and enable self-improvement through intrinsic rewards. Grounded in theoretical frameworks like Bayesian methods and Conformal Prediction, the survey provides comprehensive analysis, critical perspectives, and practical design patterns, arguing that mastering uncertainty is essential for building scalable, reliable, and trustworthy next-generation AI systems.

Authors:  Jiaxin Zhang, Wendi Cui, Zhuohang Li, Lifu Huang, Bradley Malin, Caiming Xiong, Chien-Sheng Wu

Link:  https://arxiv.org/abs/2601.15690v1

Date: 2026-01-d

Summary:

While Large Language Models (LLMs) show remarkable capabilities, their unreliability remains a critical barrier to deployment in high-stakes domains. This survey charts a functional evolution in addressing this challenge: the evolution of uncertainty from a passive diagnostic metric to an active control signal guiding real-time model behavior. We demonstrate how uncertainty is leveraged as an active control signal across three frontiers: in \textbf{advanced reasoning} to optimize computation and trigger self-correction; in \textbf{autonomous agents} to govern metacognitive decisions about tool use and information seeking; and in \textbf{reinforcement learning} to mitigate reward hacking and enable self-improvement via intrinsic rewards. By grounding these advancements in emerging theoretical frameworks like Bayesian methods and Conformal Prediction, we provide a unified perspective on this transformative trend. This survey provides a comprehensive overview, critical analysis, and practical design patterns, arguing that mastering the new trend of uncertainty is essential for building the next generation of scalable, reliable, and trustworthy AI.

--------------------------------------------------------------------------------------------------------

Agentic AI Governance and Lifecycle Management in Healthcare

Healthcare organizations increasingly embed agentic AI into workflows for clinical documentation and early-warning monitoring, but face agent sprawl causing duplicated agents, unclear accountability, inconsistent controls, and persisting tool permissions. Existing AI governance frameworks emphasize lifecycle risk management but provide limited operational guidance for agent fleets. The Unified Agent Lifecycle Management (UALM) blueprint, synthesized from governance standards, agent security literature, and healthcare compliance requirements, maps gaps onto five control-plane layers: identity registries, orchestration, PHI-bounded context, runtime policy enforcement with kill-switches, and lifecycle management with credential revocation. A companion maturity model supports staged adoption, offering healthcare CIOs, CISOs, and clinical leaders implementable patterns for audit-ready oversight that preserves innovation while enabling safer scaling across clinical and administrative domains.

Authors:  Chandra Prakash, Mary Lind, Avneesh Sisodia

Link:  https://arxiv.org/abs/2601.15630v1

Date: 2026-01-d

Summary:

Healthcare organizations are beginning to embed agentic AI into routine workflows, including clinical documentation support and early-warning monitoring. As these capabilities diffuse across departments and vendors, health systems face agent sprawl, causing duplicated agents, unclear accountability, inconsistent controls, and tool permissions that persist beyond the original use case. Existing AI governance frameworks emphasize lifecycle risk management but provide limited guidance for the day-to-day operations of agent fleets. We propose a Unified Agent Lifecycle Management (UALM) blueprint derived from a rapid, practice-oriented synthesis of governance standards, agent security literature, and healthcare compliance requirements. UALM maps recurring gaps onto five control-plane layers: (1) an identity and persona registry, (2) orchestration and cross-domain mediation, (3) PHI-bounded context and memory, (4) runtime policy enforcement with kill-switch triggers, and (5) lifecycle management and decommissioning linked to credential revocation and audit logging. A companion maturity model supports staged adoption. UALM offers healthcare CIOs, CISOs, and clinical leaders an implementable pattern for audit-ready oversight that preserves local innovation and enables safer scaling across clinical and administrative domains.

--------------------------------------------------------------------------------------------------------

Robust Tool Use via Fission-GRPO: Learning to Recover from Execution Errors

Large language models effectively call tools but remain brittle in multi-turn execution: smaller models often degenerate into repetitive invalid re-invocations after errors, failing to interpret feedback and self-correct. This brittleness hinders reliable deployment where execution errors are inevitable. Standard reinforcement learning treats errors as sparse negative rewards without recovery guidance, while synthetic error-correction datasets suffer distribution mismatch with on-policy error modes. Fission-GRPO converts execution errors into corrective supervision within the RL loop, fissioning failed trajectories into new training instances augmented with diagnostic feedback from a finetuned Error Simulator, then resampling recovery rollouts on-policy. On BFCL v4 Multi-Turn, Fission-GRPO improves Qwen3-8B's error recovery rate by 5.7% absolute, yielding 4% overall accuracy gains over GRPO and outperforming specialized tool-use agents.

Authors:  Zhiwei Zhang, Fei Zhao, Rui Wang, Zezhong Wang, Bin Liang, Jiakang Wang, Yao Hu, Shaosheng Cao, Kam-Fai Wong

Link:  https://arxiv.org/abs/2601.15625v1

Date: 2026-01-d

Summary:

Large language models (LLMs) can call tools effectively, yet they remain brittle in multi-turn execution: following a tool call error, smaller models often degenerate into repetitive invalid re-invocations, failing to interpret error feedback and self-correct. This brittleness hinders reliable real-world deployment, where the execution errors are inherently inevitable during tool interaction procedures. We identify a key limitation of current approaches: standard reinforcement learning (RL) treats errors as sparse negative rewards, providing no guidance on how to recover, while pre-collected synthetic error-correction datasets suffer from distribution mismatch with the model's on-policy error modes. To bridge this gap, we propose Fission-GRPO, a framework that converts execution errors into corrective supervision within the RL training loop. Our core mechanism fissions each failed trajectory into a new training instance by augmenting it with diagnostic feedback from a finetuned Error Simulator, then resampling recovery rollouts on-policy. This enables the model to learn from the precise errors it makes during exploration, rather than from static, pre-collected error cases. On the BFCL v4 Multi-Turn, Fission-GRPO improves the error recovery rate of Qwen3-8B by 5.7% absolute, crucially, yielding a 4% overall accuracy gain (42.75% to 46.75%) over GRPO and outperforming specialized tool-use agents.

--------------------------------------------------------------------------------------------------------

From Generative Engines to Actionable Simulators: The Imperative of Physical Grounding in World Models

World models simulate environment evolution under actions, enabling planning through imagined futures rather than reactive perception. However, current models suffer from visual conflation—assuming high-fidelity video generation implies understanding physical and causal dynamics. While modern models excel at predicting pixels, they frequently violate invariant constraints, fail under intervention, and break down in safety-critical decision-making. This survey argues visual realism is an unreliable proxy for world understanding. Effective world models must encode causal structure, respect domain-specific constraints, and remain stable over long horizons. The work reframes world models as actionable simulators emphasizing structured 4D interfaces, constraint-aware dynamics, and closed-loop evaluation. Medical decision-making serves as an epistemic stress test, demonstrating value depends on counterfactual reasoning and robust long-horizon foresight capabilities.

Authors:  Zhikang Chen, Tingting Zhu

Link:  https://arxiv.org/abs/2601.15533v1

Date: 2026-01-d

Summary:

A world model is an AI system that simulates how an environment evolves under actions, enabling planning through imagined futures rather than reactive perception. Current world models, however, suffer from visual conflation: the mistaken assumption that high-fidelity video generation implies an understanding of physical and causal dynamics. We show that while modern models excel at predicting pixels, they frequently violate invariant constraints, fail under intervention, and break down in safety-critical decision-making. This survey argues that visual realism is an unreliable proxy for world understanding. Instead, effective world models must encode causal structure, respect domain-specific constraints, and remain stable over long horizons. We propose a reframing of world models as actionable simulators rather than visual engines, emphasizing structured 4D interfaces, constraint-aware dynamics, and closed-loop evaluation. Using medical decision-making as an epistemic stress test, where trial-and-error is impossible and errors are irreversible, we demonstrate that a world model's value is determined not by how realistic its rollouts appear, but by its ability to support counterfactual reasoning, intervention planning, and robust long-horizon foresight.

--------------------------------------------------------------------------------------------------------

Solar twins in Gaia DR3 GSP-Spec I. Building a large catalog of Solar twins with ages

Solar twins—stars with stellar parameters nearly identical to the Sun—offer unique opportunities for high-precision Galactic archaeology. However, previous catalogs typically contain only tens of objects with poorly characterized selection functions. This work builds a large Solar-twin catalog from Gaia DR3 GSP-Spec, providing model-driven stellar parameters including ages with well-characterized selection. From candidates within ±200K in temperature, ±0.2 in log g, and ±0.1 dex in metallicity of Solar values, the final catalog contains 6,594 stars. Ages determined using Bayesian isochrone-projection methods with different parameter combinations are validated through a mock catalog of 75,588 artificial twins. Demonstrating catalog utility, the study statistically confirms age–chemical abundance relations for several species, showing trends from small high-precision samples persist in larger independent samples, bridging precision and demographic studies.

Authors:  Daisuke Taniguchi, Patrick de Laverny, Alejandra Recio-Blanco, Takuji Tsujimoto, Pedro A. Palicio

Link:  https://arxiv.org/abs/2601.15387v1

Date: 2026-01-d

Summary:

[Abbreviated] Context. Solar twins, stars whose stellar parameters (Teff, log g, and [M/H]) are very close to the Solar ones, offer a unique opportunity to investigate Galactic archaeology with very high accuracy and precision. However, most previous catalogs of Solar twins contain only a small number of objects (typically a few tens), and their selection functions are poorly characterized. Aims. We aim at building a large catalog of Solar twins from Gaia DR3 GSP-Spec, providing model-driven, rather than data-driven, stellar parameters including ages, together with a well-characterized selection function. Methods. Using stellar parameters from the Gaia DR3 GSP-Spec catalog, we selected Solar-twin candidates whose parameters lie within +- 200 K in Teff, +- 0.2 in log g, and +- 0.1 dex in [M/H] of the Solar values. Candidates unlikely to be genuine Solar twins were removed using Gaia flags and photometric constraints. We determined accurate ages for individual twins with a Bayesian isochrone-projection method, considering three combinations of parameters: Teff, [M/H], and either log g, M_G, or M_Ks. We also constructed a mock catalog to characterize the selection function. Results. Our final GSP-Spec Solar-twin catalog contains 6,594 stars. The mock catalog consisting of 75,588 artificial twins well reproduces the main characteristics of the observed catalog, especially for ages determined with M_G or M_Ks. To demonstrate the usefulness of our catalog, we compared chemical abundances [X/Fe] with age. We statistically confirmed the age--[X/Fe] relations for several species (e.g., Al, Si, Ca, and Y), demonstrating that trends previously identified in small but very high-precision samples persist in a much larger, independent sample. Conclusions. Our study bridges small high-precision Solar-twin samples and large data-driven ones, enabling demographic studies of Solar twins.

--------------------------------------------------------------------------------------------------------


EYE ON A.I. GETS READERS UP TO DATE ON THE LATEST FUNDING NEWS AND RELATED ISSUES. SUBSCRIBE FOR THE WEEKLY NEWSLETTER.

SUBSCRIBE