Week Ending 2.1.2026
RESEARCH WATCH: 2.1.2026
Regularisation in neural networks: a survey and empirical analysis of approaches
Neural networks excel at many tasks but often struggle with generalizing to new, unseen data. Regularisation techniques aim to improve this generalization ability, though practitioners often assume any regularisation will help. This comprehensive survey reviews modern regularisation methods across four categories: data-based, architecture, training, and loss function strategies. The empirical analysis reveals that regularisation effectiveness is highly dataset-dependent—for instance, regularisation terms benefited numeric datasets while batch normalisation improved image classification. These findings are crucial for practitioners selecting appropriate regularisation techniques for their specific applications, from computer vision to tabular data analysis.
Authors: Christiaan P. Opperman, Anna S. Bosman, Katherine M. Malan
Link: https://arxiv.org/abs/2601.23131v1
Date: 2026-01-d
Summary:
Despite huge successes on a wide range of tasks, neural networks are known to sometimes struggle to generalise to unseen data. Many approaches have been proposed over the years to promote the generalisation ability of neural networks, collectively known as regularisation techniques. These are used as common practice under the assumption that any regularisation added to the pipeline would result in a performance improvement. In this study, we investigate whether this assumption holds in practice. First, we provide a broad review of regularisation techniques, including modern theories such as double descent. We propose a taxonomy of methods under four broad categories, namely: (1) data-based strategies, (2) architecture strategies, (3) training strategies, and (4) loss function strategies. Notably, we highlight the contradictions and correspondences between the approaches in these broad classes. Further, we perform an empirical comparison of the various regularisation techniques on classification tasks for ten numerical and image datasets applied to the multi-layer perceptron and convolutional neural network architectures. Results show that the efficacy of regularisation is dataset-dependent. For example, the use of a regularisation term only improved performance on numeric datasets, whereas batch normalisation improved performance on image datasets only. Generalisation is crucial to machine learning; thus, understanding the effects of applying regularisation techniques, and considering the connections between them is essential to the appropriate use of these methods in practice.
--------------------------------------------------------------------------------------------------------
The Hot Mess of AI: How Does Misalignment Scale With Model Intelligence and Task Complexity?
As AI systems tackle increasingly complex and consequential tasks, understanding their failure modes becomes critical for safety. This research investigates whether advanced AI fails through systematic goal misalignment or incoherent, nonsensical behavior. Using bias-variance decomposition, the study introduces "incoherence" as a metric measuring failures stemming from randomness rather than systematic bias. Results show that longer reasoning chains and, paradoxically, more capable models often produce more incoherent failures. This suggests future AI systems may cause unpredictable industrial accidents rather than pursuing consistently misaligned goals, highlighting the need for alignment research targeting reward hacking and goal misspecification in advanced AI deployment.
Authors: Alexander Hägele, Aryo Pradipta Gema, Henry Sleight, Ethan Perez, Jascha Sohl-Dickstein
Link: https://arxiv.org/abs/2601.23045v1
Date: 2026-01-d
Summary:
As AI becomes more capable, we entrust it with more general and consequential tasks. The risks from failure grow more severe with increasing task scope. It is therefore important to understand how extremely capable AI models will fail: Will they fail by systematically pursuing goals we do not intend? Or will they fail by being a hot mess, and taking nonsensical actions that do not further any goal? We operationalize this question using a bias-variance decomposition of the errors made by AI models: An AI's \emph{incoherence} on a task is measured over test-time randomness as the fraction of its error that stems from variance rather than bias in task outcome. Across all tasks and frontier models we measure, the longer models spend reasoning and taking actions, \emph{the more incoherent} their failures become. Incoherence changes with model scale in a way that is experiment dependent. However, in several settings, larger, more capable models are more incoherent than smaller models. Consequently, scale alone seems unlikely to eliminate incoherence. Instead, as more capable AIs pursue harder tasks, requiring more sequential action and thought, our results predict failures to be accompanied by more incoherent behavior. This suggests a future where AIs sometimes cause industrial accidents (due to unpredictable misbehavior), but are less likely to exhibit consistent pursuit of a misaligned goal. This increases the relative importance of alignment research targeting reward hacking or goal misspecification.
--------------------------------------------------------------------------------------------------------
Industrial optical quality control relies on supervised machine learning, but faces a fundamental challenge: defective parts are rare, creating highly imbalanced datasets that harm model performance. Traditional solutions like specialized loss functions or basic data augmentation have limitations. This work explores generative AI—specifically Stable Diffusion and CycleGAN—as an alternative for dataset expansion. Tested on thermal imaging of combine harvester components for defect detection, Stable Diffusion-based dataset expansion achieved a 4.6% improvement in segmentation performance, reaching 84.6% Mean IoU. This approach offers manufacturing industries a powerful tool for improving automated quality control systems where collecting sufficient defective samples is impractical or expensive.
Authors: Dennis Sprute, Hanna Senke, Holger Flatt
Link: https://arxiv.org/abs/2601.22961v1
Date: 2026-01-d
Summary:
Supervised machine learning algorithms play a crucial role in optical quality control within industrial production. These approaches require representative datasets for effective model training. However, while non-defective components are frequent, defective parts are rare in production, resulting in highly imbalanced datasets that adversely impact model performance. Existing strategies to address this challenge, such as specialized loss functions or traditional data augmentation techniques, have limitations, including the need for careful hyperparameter tuning or the alteration of only simple image features. Therefore, this work explores the potential of generative artificial intelligence (GenAI) as an alternative method for expanding limited datasets and enhancing supervised machine learning performance. Specifically, we investigate Stable Diffusion and CycleGAN as image generation models, focusing on the segmentation of combine harvester components in thermal images for subsequent defect detection. Our results demonstrate that dataset expansion using Stable Diffusion yields the most significant improvement, enhancing segmentation performance by 4.6 %, resulting in a Mean Intersection over Union (Mean IoU) of 84.6 %.
--------------------------------------------------------------------------------------------------------
Toward Fully Autonomous Driving: AI, Challenges, Opportunities, and Needs
Autonomous driving promises transformative benefits but faces significant challenges from the unpredictable, ever-changing real world. While AI has demonstrated superior performance over classical approaches in handling complexity, its use raises critical questions about safety and transferability across different environments. This paper analyzes the current state of automated driving, identifies limitations of existing systems, and explores how advancing AI capabilities could enable true autonomy. The authors examine various challenges in the context of prospective technological developments, providing a roadmap for research needs. Applications span from consumer vehicles to commercial transportation, robotics, and any domain requiring robust decision-making in dynamic, uncertain environments.
Authors: Lars Ullrich, Michael Buchholz, Klaus Dietmayer, Knut Graichen
Link: https://arxiv.org/abs/2601.22927v1
Date: 2026-01-d
Summary:
Automated driving (AD) is promising, but the transition to fully autonomous driving is, among other things, subject to the real, ever-changing open world and the resulting challenges. However, research in the field of AD demonstrates the ability of artificial intelligence (AI) to outperform classical approaches, handle higher complexities, and reach a new level of autonomy. At the same time, the use of AI raises further questions of safety and transferability. To identify the challenges and opportunities arising from AI concerning autonomous driving functionalities, we have analyzed the current state of AD, outlined limitations, and identified foreseeable technological possibilities. Thereby, various further challenges are examined in the context of prospective developments. In this way, this article reconsiders fully autonomous driving with respect to advancements in the field of AI and carves out the respective needs and resulting research questions.
--------------------------------------------------------------------------------------------------------
Fine-tuned large language models can be weaponized to covertly encode secrets into their outputs through steganographic channels—a serious security threat. Previous demonstrations used easily detectable encoding schemes with 100% recoverability. This research introduces low-recoverability steganography using embedding-space-derived mappings, making hidden messages harder to detect. While exact secret recovery rates increased substantially (up to 123%), payload recoverability decreased, creating more sophisticated attacks. However, the study also proposes defenses: linear probes trained on model activations can detect steganographic behavior with up to 33% higher accuracy than in clean models. These findings are critical for AI security, model auditing, and developing defenses against malicious fine-tuning attacks.
Authors: Charles Westphal, Keivan Navaie, Fernando E. Rosas
Link: https://arxiv.org/abs/2601.22818v1
Date: 2026-01-d
Summary:
Fine-tuned LLMs can covertly encode prompt secrets into outputs via steganographic channels. Prior work demonstrated this threat but relied on trivially recoverable encodings. We formalize payload recoverability via classifier accuracy and show previous schemes achieve 100\% recoverability. In response, we introduce low-recoverability steganography, replacing arbitrary mappings with embedding-space-derived ones. For Llama-8B (LoRA) and Ministral-8B (LoRA) trained on TrojanStego prompts, exact secret recovery rises from 17$\rightarrow$30\% (+78\%) and 24$\rightarrow$43\% (+80\%) respectively, while on Llama-70B (LoRA) trained on Wiki prompts, it climbs from 9$\rightarrow$19\% (+123\%), all while reducing payload recoverability. We then discuss detection. We argue that detecting fine-tuning-based steganographic attacks requires approaches beyond traditional steganalysis. Standard approaches measure distributional shift, which is an expected side-effect of fine-tuning. Instead, we propose a mechanistic interpretability approach: linear probes trained on later-layer activations detect the secret with up to 33\% higher accuracy in fine-tuned models compared to base models, even for low-recoverability schemes. This suggests that malicious fine-tuning leaves actionable internal signatures amenable to interpretability-based defenses.
--------------------------------------------------------------------------------------------------------
Beyond Abstract Compliance: Operationalising trust in AI as a moral relationship
Current AI trustworthiness frameworks, like the EU's approach, treat trust as a designable property evaluated through technical criteria, ignoring how trust is subjectively experienced, culturally embedded, and inherently relational. Drawing on African communitarian philosophies and relational ethics, this paper proposes expanded principles framing trust as a dynamic, temporal relationship requiring transparency and mutual respect. The authors advocate for involving communities throughout the AI lifecycle to build incremental trust through meaningful relationships. Two use-cases—healthcare and education—illustrate how trust-enabling principles based on African relational ethics can be operationalized. This approach promises more equitable, context-sensitive AI systems, particularly valuable for deploying AI in diverse cultural contexts and underserved communities.
Authors: Lameck Mbangula Amugongo, Tutaleni Asino, Nicola J Bidwell
Link: https://arxiv.org/abs/2601.22769v1
Date: 2026-01-d
Summary:
Dominant approaches, e.g. the EU's "Trustworthy AI framework", treat trust as a property that can be designed for, evaluated, and governed according to normative and technical criteria. They do not address how trust is subjectively cultivated and experienced, culturally embedded, and inherently relational. This paper proposes some expanded principles for trust in AI that can be incorporated into common development methods and frame trust as a dynamic, temporal relationship, which involves transparency and mutual respect. We draw on relational ethics and, in particular, African communitarian philosophies, to foreground the nuances of inclusive, participatory processes and long-term relationships with communities. Involving communities throughout the AI lifecycle can foster meaningful relationships with AI design and development teams that incrementally build trust and promote more equitable and context-sensitive AI systems. We illustrate how trust-enabling principles based on African relational ethics can be operationalised, using two use-cases for AI: healthcare and education.
--------------------------------------------------------------------------------------------------------
Test-Time Mixture of World Models for Embodied Agents in Dynamic Environments
Language model-based embodied agents increasingly operate in real-world environments but struggle with adaptability in dynamic settings where accurate world models are crucial. This research extends the Mixture-of-Experts paradigm beyond its conventional limitations—where routing remains fixed after deployment—by introducing Test-time Mixture of World Models (TMoW). This framework updates routing functions during inference through multi-granular prototype-based routing, test-time refinement, and distilled mixture-based augmentation. Evaluated on VirtualHome, ALFWorld, and RLBench benchmarks, TMoW demonstrates strong zero-shot adaptation and few-shot expansion capabilities. Applications include robotics, smart homes, assistive technologies, and any domain requiring agents to continuously adapt to evolving environments without extensive retraining.
Authors: Jinwoo Jang, Minjong Yoo, Sihyung Yoon, Honguk Woo
Link: https://arxiv.org/abs/2601.22647v1
Date: 2026-01-d
Summary:
Language model (LM)-based embodied agents are increasingly deployed in real-world settings. Yet, their adaptability remains limited in dynamic environments, where constructing accurate and flexible world models is crucial for effective reasoning and decision-making. To address this challenge, we extend the Mixture-of-Experts (MoE) paradigm to embodied agents. While conventional MoE architectures modularize knowledge into expert components with pre-trained routing, they remain rigid once deployed, making them less effective for adapting to unseen domains in dynamic environments. We therefore propose Test-time Mixture of World Models (TMoW), a framework that enhances adaptability to unseen and evolving domains. TMoW updates its routing function over world models at test time, unlike conventional MoE where the function remains fixed, enabling agents to recombine existing models and integrate new ones for continual adaptation. It achieves this through (i) multi-granular prototype-based routing, which adapts mixtures across object- to scene-level similarities, (ii) test-time refinement that aligns unseen domain features with prototypes during inference, and (iii) distilled mixture-based augmentation, which efficiently constructs new models from few-shot data and existing prototypes. We evaluate TMoW on VirtualHome, ALFWorld, and RLBench benchmarks, demonstrating strong performance in both zero-shot adaptation and few-shot expansion scenarios, and showing that it enables embodied agents to operate effectively in dynamic environments.
--------------------------------------------------------------------------------------------------------
Beyond Medical Chatbots: Meddollina and the Rise of Continuous Clinical Intelligence
Medical AI systems appear clinically fluent in benchmarks but exhibit behaviors incompatible with real-world deployment: premature diagnostic closure, unjustified certainty, and instability across multi-step decisions. The authors argue these failures stem from treating medicine as next-token prediction rather than responsibility-bound reasoning under uncertainty. They introduce Clinical Contextual Intelligence (CCI) and Meddollina, a governance-first system that constrains inference before language generation. Evaluated across 16,412+ medical queries, Meddollina demonstrates calibrated uncertainty, conservative reasoning, and reduced speculative completion compared to baselines. This work challenges the assumption that scaling alone will produce deployable medical AI, proposing instead that progress requires measuring clinician-aligned behavior under uncertainty—critical for patient safety.
Authors: Vaibhav Ram S. V. N. S, Swetanshu Agrawal, Samudra Banerjee, Abdul Muhsin
Link: https://arxiv.org/abs/2601.22645v1
Date: 2026-01-d
Summary:
Generative medical AI now appears fluent and knowledgeable enough to resemble clinical intelligence, encouraging the belief that scaling will make it safe. But clinical reasoning is not text generation. It is a responsibility-bound process under ambiguity, incomplete evidence, and longitudinal context. Even as benchmark scores rise, generation-centric systems still show behaviours incompatible with clinical deployment: premature closure, unjustified certainty, intent drift, and instability across multi-step decisions. We argue these are structural consequences of treating medicine as next-token prediction. We formalise Clinical Contextual Intelligence (CCI) as a distinct capability class required for real-world clinical use, defined by persistent context awareness, intent preservation, bounded inference, and principled deferral when evidence is insufficient. We introduce Meddollina, a governance-first clinical intelligence system designed to constrain inference before language realisation, prioritising clinical appropriateness over generative completeness. Meddollina acts as a continuous intelligence layer supporting clinical workflows while preserving clinician authority. We evaluate Meddollina using a behaviour-first regime across 16,412+ heterogeneous medical queries, benchmarking against general-purpose models, medical-tuned models, and retrieval-augmented systems. Meddollina exhibits a distinct behavioural profile: calibrated uncertainty, conservative reasoning under underspecification, stable longitudinal constraint adherence, and reduced speculative completion relative to generation-centric baselines. These results suggest deployable medical AI will not emerge from scaling alone, motivating a shift toward Continuous Clinical Intelligence, where progress is measured by clinician-aligned behaviour under uncertainty rather than fluency-driven completion.
--------------------------------------------------------------------------------------------------------
Statistical Estimation of Adversarial Risk in Large Language Models under Best-of-N Sampling
Large language models are typically evaluated under single-shot adversarial prompting, severely underestimating real-world risk where attackers can exploit parallel sampling to repeatedly probe until eliciting harmful responses. This research introduces SABER (Scaling-Aware Best-of-N Estimation of Risk), modeling jailbreak vulnerability under repeated sampling using Beta distributions and deriving analytic scaling laws. With only 100 samples, SABER predicts attack success rates at 1000 attempts with 86.2% less error than baselines. Results reveal that models appearing robust under standard evaluation can experience rapid nonlinear risk amplification under parallel adversarial pressure. This methodology enables realistic, low-cost safety assessment crucial for deploying LLMs in production environments.
Authors: Mingqian Feng, Xiaodong Liu, Weiwei Yang, Chenliang Xu, Christopher White, Jianfeng Gao
Link: https://arxiv.org/abs/2601.22636v1
Date: 2026-01-d
Summary:
Large Language Models (LLMs) are typically evaluated for safety under single-shot or low-budget adversarial prompting, which underestimates real-world risk. In practice, attackers can exploit large-scale parallel sampling to repeatedly probe a model until a harmful response is produced. While recent work shows that attack success increases with repeated sampling, principled methods for predicting large-scale adversarial risk remain limited. We propose a scaling-aware Best-of-N estimation of risk, SABER, for modeling jailbreak vulnerability under Best-of-N sampling. We model sample-level success probabilities using a Beta distribution, the conjugate prior of the Bernoulli distribution, and derive an analytic scaling law that enables reliable extrapolation of large-N attack success rates from small-budget measurements. Using only n=100 samples, our anchored estimator predicts ASR@1000 with a mean absolute error of 1.66, compared to 12.04 for the baseline, which is an 86.2% reduction in estimation error. Our results reveal heterogeneous risk scaling profiles and show that models appearing robust under standard evaluation can experience rapid nonlinear risk amplification under parallel adversarial pressure. This work provides a low-cost, scalable methodology for realistic LLM safety assessment. We will release our code and evaluation scripts upon publication to future research.
--------------------------------------------------------------------------------------------------------
EUGens: Efficient, Unified, and General Dense Layers
Fully-connected feedforward layers create computational and parameter bottlenecks in neural networks, limiting real-time applications and deployment in resource-constrained environments. EUGens (Efficient, Unified, and General dense layers) leverage random features to approximate standard layers while incorporating input norm dependence, unifying existing efficient extensions and reducing complexity from quadratic to linear time. They enable the first unbiased algorithms approximating layers with arbitrary polynomial activations. Integrated into Transformers and MLPs, EUGens deliver up to 27% faster inference and 30% better memory efficiency across image classification, language model pre-training, and 3D scene reconstruction. Applications span mobile devices, edge computing, large-scale language models, and any scenario requiring efficient neural network deployment.
Authors: Sang Min Kim, Byeongchan Kim, Arijit Sehanobish, Somnath Basu Roy Chowdhury, Rahul Kidambi, Dongseok Shim, Avinava Dubey, Snigdha Chaturvedi, Min-hwan Oh, Krzysztof Choromanski
Link: https://arxiv.org/abs/2601.22563v1
Date: 2026-01-d
Summary:
Efficient neural networks are essential for scaling machine learning models to real-time applications and resource-constrained environments. Fully-connected feedforward layers (FFLs) introduce computation and parameter count bottlenecks within neural network architectures. To address this challenge, in this work, we propose a new class of dense layers that generalize standard fully-connected feedforward layers, \textbf{E}fficient, \textbf{U}nified and \textbf{Gen}eral dense layers (EUGens). EUGens leverage random features to approximate standard FFLs and go beyond them by incorporating a direct dependence on the input norms in their computations. The proposed layers unify existing efficient FFL extensions and improve efficiency by reducing inference complexity from quadratic to linear time. They also lead to \textbf{the first} unbiased algorithms approximating FFLs with arbitrary polynomial activation functions. Furthermore, EuGens reduce the parameter count and computational overhead while preserving the expressive power and adaptability of FFLs. We also present a layer-wise knowledge transfer technique that bypasses backpropagation, enabling efficient adaptation of EUGens to pre-trained models. Empirically, we observe that integrating EUGens into Transformers and MLPs yields substantial improvements in inference speed (up to \textbf{27}\%) and memory efficiency (up to \textbf{30}\%) across a range of tasks, including image classification, language model pre-training, and 3D scene reconstruction. Overall, our results highlight the potential of EUGens for the scalable deployment of large-scale neural networks in real-world scenarios.
--------------------------------------------------------------------------------------------------------
Recoverability Has a Law: The ERR Measure for Tool-Augmented Agents
Language model agents often recover from failed tool executions, yet this capability lacked formal explanation. This research presents a predictive theory showing recoverability follows a measurable law. The Expected Recovery Regret (ERR) framework quantifies deviation from optimal recovery policies, deriving a first-order relationship with the empirically observable Efficiency Score. This produces a falsifiable quantitative law validated across five benchmarks spanning controlled perturbations, diagnostic reasoning, and real-world APIs. Predicted regret matched observed values within δ≤0.05 across model scales and perturbation types. These findings reveal recoverability as a governed property of interaction dynamics rather than an architectural artifact, providing theoretical foundations for building robust tool-using agents in production environments.
Authors: Sri Vatsa Vuddanti, Satwik Kumar Chittiprolu
Link: https://arxiv.org/abs/2601.22352v1
Date: 2026-01-d
Summary:
Language model agents often appear capable of self-recovery after failing tool call executions, yet this behavior lacks a formal explanation. We present a predictive theory that resolves this gap by showing that recoverability follows a measurable law. To elaborate, we formalize recoverability through Expected Recovery Regret (ERR), which quantifies the deviation of a recovery policy from the optimal one under stochastic execution noise, and derive a first-order relationship between ERR and an empirical observable quantity, the Efficiency Score (ES). This yields a falsifiable first-order quantitative law of recovery dynamics in tool-using agents. We empirically validate the law across five tool-use benchmarks spanning controlled perturbations, diagnostic reasoning, and real-world APIs. Across model scales, perturbation regimes, and recovery horizons, predicted regret under the ERR-ES law closely matched observed post-failure regret measured from Monte Carlo rollouts, within delta less than or equal to 0.05. Our results reveal that recoverability is not an artifact of model scale or architecture, but a governed property of interaction dynamics, providing a theoretical foundation for execution-level robustness in language agents.
--------------------------------------------------------------------------------------------------------
Exploring Reasoning Reward Model for Agents
Agentic reinforcement learning enables complex reasoning and tool use but typically relies on sparse outcome-based rewards that fail to differentiate intermediate reasoning quality. Agent Reasoning Reward Model (Agent-RRM) provides structured feedback including explicit reasoning traces, focused critiques highlighting flaws, and overall performance scores. Three integration strategies were investigated: text-augmented refinement, reward-augmented guidance, and unified feedback integration. The unified approach (Reagent-U) achieved substantial improvements, reaching 43.7% on GAIA and 46.2% on WebWalkerQA across 12 benchmarks. This work demonstrates that process-level feedback significantly improves agentic learning, with applications in autonomous task completion, web navigation, tool use, and any domain requiring multi-step reasoning.
Authors: Kaixuan Fan, Kaituo Feng, Manyuan Zhang, Tianshuo Peng, Zhixun Li, Yilei Jiang, Shuang Chen, Peng Pei, Xunliang Cai, Xiangyu Yue
Link: https://arxiv.org/abs/2601.22154v1
Date: 2026-01-d
Summary:
Agentic Reinforcement Learning (Agentic RL) has achieved notable success in enabling agents to perform complex reasoning and tool use. However, most methods still relies on sparse outcome-based reward for training. Such feedback fails to differentiate intermediate reasoning quality, leading to suboptimal training results. In this paper, we introduce Agent Reasoning Reward Model (Agent-RRM), a multi-faceted reward model that produces structured feedback for agentic trajectories, including (1) an explicit reasoning trace , (2) a focused critique that provides refinement guidance by highlighting reasoning flaws, and (3) an overall score that evaluates process performance. Leveraging these signals, we systematically investigate three integration strategies: Reagent-C (text-augmented refinement), Reagent-R (reward-augmented guidance), and Reagent-U (unified feedback integration). Extensive evaluations across 12 diverse benchmarks demonstrate that Reagent-U yields substantial performance leaps, achieving 43.7% on GAIA and 46.2% on WebWalkerQA, validating the effectiveness of our reasoning reward model and training schemes. Code, models, and datasets are all released to facilitate future research.
--------------------------------------------------------------------------------------------------------
Alpha Discovery via Grammar-Guided Learning and Search
Discovering formulaic alpha factors—predictive signals for trading—remains central to quantitative finance, yet existing methods ignore syntactic and semantic constraints, relying on exhaustive unstructured search. AlphaCFG introduces a grammar-based framework using context-free grammars to define syntactically valid, financially interpretable, computationally efficient factors. Alpha discovery is formulated as a tree-structured linguistic Markov decision process solved via grammar-aware Monte Carlo Tree Search with syntax-sensitive networks. Experiments on Chinese and U.S. stock markets demonstrate superior search efficiency and trading profitability versus baselines. Beyond trading strategies, AlphaCFG provides a general framework for symbolic factor discovery applicable to asset pricing, portfolio construction, and broader quantitative finance applications.
Authors: Han Yang, Dong Hao, Zhuohan Wang, Qi Shi, Xingtong Li
Link: https://arxiv.org/abs/2601.22119v1
Date: 2026-01-d
Summary:
Automatically discovering formulaic alpha factors is a central problem in quantitative finance. Existing methods often ignore syntactic and semantic constraints, relying on exhaustive search over unstructured and unbounded spaces. We present AlphaCFG, a grammar-based framework for defining and discovering alpha factors that are syntactically valid, financially interpretable, and computationally efficient. AlphaCFG uses an alpha-oriented context-free grammar to define a tree-structured, size-controlled search space, and formulates alpha discovery as a tree-structured linguistic Markov decision process, which is then solved using a grammar-aware Monte Carlo Tree Search guided by syntax-sensitive value and policy networks. Experiments on Chinese and U.S. stock market datasets show that AlphaCFG outperforms state-of-the-art baselines in both search efficiency and trading profitability. Beyond trading strategies, AlphaCFG serves as a general framework for symbolic factor discovery and refinement across quantitative finance, including asset pricing and portfolio construction.
--------------------------------------------------------------------------------------------------------
Defining Operational Conditions for Safety-Critical AI-Based Systems from Data
Safety-critical AI systems require defining the Operational Design Domain (ODD)—environmental conditions under which systems operate safely—yet traditional approaches rely on early-stage expert knowledge when data may be incomplete. This paper presents a novel Safety-by-Design method for a posteriori ODD definition from collected data using multi-dimensional kernel-based representation. Validated through Monte Carlo methods and a real-world aviation collision-avoidance use case, the approach demonstrates that data-driven ODDs can equal the underlying hidden ODD of training data. This methodology enables certification of AI-based systems in aviation, autonomous vehicles, medical devices, and industrial automation where defining operational boundaries from existing data is crucial for regulatory approval.
Authors: Johann Christensen, Elena Hoemann, Frank Köster, Sven Hallerbach
Link: https://arxiv.org/abs/2601.22118v1
Date: 2026-01-d
Summary:
Artificial Intelligence (AI) has been on the rise in many domains, including numerous safety-critical applications. However, for complex systems found in the real world, or when data already exist, defining the underlying environmental conditions is extremely challenging. This often results in an incomplete description of the environment in which the AI-based system must operate. Nevertheless, this description, called the Operational Design Domain (ODD), is required in many domains for the certification of AI-based systems. Traditionally, the ODD is created in the early stages of the development process, drawing on sophisticated expert knowledge and related standards. This paper presents a novel Safety-by-Design method to a posteriori define the ODD from previously collected data using a multi-dimensional kernel-based representation. This approach is validated through both Monte Carlo methods and a real-world aviation use case for a future safety-critical collision-avoidance system. Moreover, by defining under what conditions two ODDs are equal, the paper shows that the data-driven ODD can equal the original, underlying hidden ODD of the data. Utilizing the novel, Safe-by-Design kernel-based ODD enables future certification of data-driven, safety-critical AI-based systems.
--------------------------------------------------------------------------------------------------------
Heterogeneous Computing: The Key to Powering the Future of AI Agent Inference
AI agent inference creates an "inference-heavy datacenter future" exposing bottlenecks beyond compute—particularly memory capacity, bandwidth, and high-speed interconnects. This paper introduces Operational Intensity (OI) and Capacity Footprint (CF) metrics jointly explaining regimes missed by classic roofline analysis, including memory capacity walls. Across agentic workflows (chat, coding, web use), OI/CF shift dramatically, with long-context KV cache making decode highly memory-bound. These observations motivate disaggregated serving, specialized prefill/decode accelerators, broader networking, and decoupled compute-memory via optical I/O. The framework suggests agent-hardware co-design and high-bandwidth memory disaggregation as foundations for sustaining efficiency in large-scale agentic AI inference across datacenters and edge deployment.
Authors: Yiren Zhao, Junyi Liu
Link: https://arxiv.org/abs/2601.22001v1
Date: 2026-01-d
Summary:
AI agent inference is driving an inference heavy datacenter future and exposes bottlenecks beyond compute - especially memory capacity, memory bandwidth and high-speed interconnect. We introduce two metrics - Operational Intensity (OI) and Capacity Footprint (CF) - that jointly explain regimes the classic roofline analysis misses, including the memory capacity wall. Across agentic workflows (chat, coding, web use, computer use) and base model choices (GQA/MLA, MoE, quantization), OI/CF can shift dramatically, with long context KV cache making decode highly memory bound. These observations motivate disaggregated serving and system level heterogeneity: specialized prefill and decode accelerators, broader scale up networking, and decoupled compute-memory enabled by optical I/O. We further hypothesize agent-hardware co design, multiple inference accelerators within one system, and high bandwidth, large capacity memory disaggregation as foundations for adaptation to evolving OI/CF. Together, these directions chart a path to sustain efficiency and capability for large scale agentic AI inference.
--------------------------------------------------------------------------------------------------------
Liquid Interfaces: A Dynamic Ontology for the Interoperability of Autonomous Systems
Contemporary software architectures struggle supporting autonomous agents with adaptive, probabilistic reasoning while integration relies on static interfaces and deterministic contracts. Liquid Interfaces introduces a paradigm where interfaces are ephemeral relational events emerging through runtime intention articulation and semantic negotiation rather than persistent artifacts. The Liquid Interface Protocol (LIP) governs intention-driven interaction, negotiated execution, and enforced ephemerality under semantic uncertainty. A reference architecture demonstrates practical feasibility. This approach provides foundations for adaptive coordination in multi-agent systems, applicable to distributed AI systems, IoT ecosystems, robotic swarms, and any domain requiring flexible, context-dependent interactions between autonomous entities without rigid pre-defined protocols.
Authors: Dhiogo de Sá, Carlos Schmiedel, Carlos Pereira Lopes
Link: https://arxiv.org/abs/2601.21993v1
Date: 2026-01-d
Summary:
Contemporary software architectures struggle to support autonomous agents whose reasoning is adaptive, probabilistic, and context-dependent, while system integration remains dominated by static interfaces and deterministic contracts. This paper introduces Liquid Interfaces, a coordination paradigm in which interfaces are not persistent technical artifacts, but ephemeral relational events that emerge through intention articulation and semantic negotiation at runtime.We formalize this model and present the Liquid Interface Protocol (LIP),which governs intention-driven interaction, negotiated execution, and enforce ephemerality under semantic uncertainty. We further discuss the governance implications of this approach and describe a reference architecture that demonstrates practical feasibility. Liquid Interfaces provide a principled foundation for adaptive coordination in agent-based systems
--------------------------------------------------------------------------------------------------------
MoE-ACT: Improving Surgical Imitation Learning Policies through Supervised Mixture-of-Experts
Imitation learning in surgical robotics faces unique challenges: data scarcity, constrained workspaces, and exceptional safety requirements. This research presents a supervised Mixture-of-Experts architecture for phase-structured surgical tasks, demonstrating that Action Chunking Transformer policies can learn complex long-horizon manipulation from under 150 demonstrations using only stereo endoscopic images. Evaluated on bowel grasping and retraction—where robot assistants interpret surgeon cues, grasp deformable tissue, and perform sustained retraction—the MoE architecture significantly outperformed generalist Vision-Language-Action models and standard baselines. Notably, policies generalized to unseen viewpoints and transferred zero-shot to ex vivo porcine tissue, with preliminary in vivo results suggesting pathways toward clinical deployment in minimally invasive surgery.
Authors: Lorenzo Mazza, Ariel Rodriguez, Rayan Younis, Martin Lelis, Ortrun Hellig, Chenpan Li, Sebastian Bodenstedt, Martin Wagner, Stefanie Speidel
Link: https://arxiv.org/abs/2601.21971v1
Date: 2026-01-d
Summary:
Imitation learning has achieved remarkable success in robotic manipulation, yet its application to surgical robotics remains challenging due to data scarcity, constrained workspaces, and the need for an exceptional level of safety and predictability. We present a supervised Mixture-of-Experts (MoE) architecture designed for phase-structured surgical manipulation tasks, which can be added on top of any autonomous policy. Unlike prior surgical robot learning approaches that rely on multi-camera setups or thousands of demonstrations, we show that a lightweight action decoder policy like Action Chunking Transformer (ACT) can learn complex, long-horizon manipulation from less than 150 demonstrations using solely stereo endoscopic images, when equipped with our architecture. We evaluate our approach on the collaborative surgical task of bowel grasping and retraction, where a robot assistant interprets visual cues from a human surgeon, executes targeted grasping on deformable tissue, and performs sustained retraction. We benchmark our method against state-of-the-art Vision-Language-Action (VLA) models and the standard ACT baseline. Our results show that generalist VLAs fail to acquire the task entirely, even under standard in-distribution conditions. Furthermore, while standard ACT achieves moderate success in-distribution, adopting a supervised MoE architecture significantly boosts its performance, yielding higher success rates in-distribution and demonstrating superior robustness in out-of-distribution scenarios, including novel grasp locations, reduced illumination, and partial occlusions. Notably, it generalizes to unseen testing viewpoints and also transfers zero-shot to ex vivo porcine tissue without additional training, offering a promising pathway toward in vivo deployment. To support this, we present qualitative preliminary results of policy roll-outs during in vivo porcine surgery.
--------------------------------------------------------------------------------------------------------
Recent studies analyzing POSS1-E photographic plates claimed evidence for artificial objects near Earth based on alleged deficits in terrestrial shadow, linear feature clusters, and correlations with nuclear tests and UAP sightings. This critical evaluation reexamines these claims using previously published datasets. Analyses find no deficit in terrestrial shadow, reveal that one-third of reported linear cluster features were catalog stars, and show correlations with nuclear tests become insignificant after proper normalization—largely determined by telescope observation schedules. The study uncovers dataset inconsistencies, unvalidated data containing artifacts and defects, and spatial distributions suggesting scanning artifacts rather than optical transients. This rigorous scrutiny exemplifies scientific reproducibility standards critical for extraordinary claims.
Authors: Wesley Andrés Watters, Laura Dominé, Sarah Little, Cameron Pratt, Kevin H. Knuth
Link: https://arxiv.org/abs/2601.21946v1
Date: 2026-01-d
Summary:
Recent studies by B. Villarroel and colleagues have assembled and analyzed datasets of unidentified features measured from digital scans of photographic plates captured by the first-epoch Palomar Observatory Sky Survey (POSS1) in the pre-Sputnik era. These studies have called attention to (i) a purported deficit of features within Earth's shadow; (ii) the sporadic presence of linear clusters; and (iii) a positive correlation between the timing of feature observations and nuclear tests as well as Unidentified Aerial Phenomena (UAP) sighting reports. These observations were cited as evidence that some fraction of the unidentified features represent glinting artificial objects near Earth. We have examined these claims using two related, previously published datasets. When analyzing the most vetted of these, we do not observe the reported deficit in the terrestrial shadow. We determine that a third of the features in the reported linear clusters were not confidently distinguished from catalog stars. We find that the reported correlation between the timing of feature observations and nuclear tests becomes insignificant after properly normalizing by the number of observation days, and is almost completely determined by the observation schedule of the Palomar telescope. We uncover important inconsistencies in the definitions of the datasets used in these studies, as well as the use of unvalidated datasets containing catalog stars, scan artifacts, and plate defects. It has not been shown that any of the features in these datasets represent optical transients. We examine the spatial distribution of the plate-derived features, finding an overall gradual increase in number density toward the corners and edges of plates, as well as examples of (i) empty north-south strips that span multiple plates; (ii) clusters and voids having geometric shapes; and (iii) amorphous clusters.
--------------------------------------------------------------------------------------------------------
Self-Compression of Chain-of-Thought via Multi-Agent Reinforcement Learning
Large Reasoning Models suffer from inference overhead due to redundant reasoning, bottlenecking deployment and degrading user experience. Existing reinforcement learning solutions using simple length penalties struggle to balance brevity with accuracy, potentially compromising critical reasoning logic. Self-Compression via MARL (SCMA) addresses this through specialized agents: a Segmentation Agent decomposing reasoning into logical chunks and a Scoring Agent quantifying chunk importance. These collaboratively define importance-weighted length penalties during training, incentivizing a Reasoning Agent to preserve essential logic without inference overhead. Empirical evaluations show 11.1%-39.0% response length reduction while boosting accuracy 4.33%-10.02%, with ablations validating that multi-agent synergy yields more powerful models than vanilla RL approaches.
Authors: Yiqun Chen, Jinyuan Feng, Wei Yang, Meizhi Zhong, Zhengliang Shi, Rui Li, Xiaochi Wei, Yan Gao, Yi Wu, Yao Hu, Zhiqiang Pu, Jiaxin Mao
Link: https://arxiv.org/abs/2601.21919v1
Date: 2026-01-d
Summary:
The inference overhead induced by redundant reasoning undermines the interactive experience and severely bottlenecks the deployment of Large Reasoning Models. Existing reinforcement learning (RL)-based solutions tackle this problem by coupling a length penalty with outcome-based rewards. This simplistic reward weighting struggles to reconcile brevity with accuracy, as enforcing brevity may compromise critical reasoning logic. In this work, we address this limitation by proposing a multi-agent RL framework that selectively penalizes redundant chunks, while preserving essential reasoning logic. Our framework, Self-Compression via MARL (SCMA), instantiates redundancy detection and evaluation through two specialized agents: \textbf{a Segmentation Agent} for decomposing the reasoning process into logical chunks, and \textbf{a Scoring Agent} for quantifying the significance of each chunk. The Segmentation and Scoring agents collaboratively define an importance-weighted length penalty during training, incentivizing \textbf{a Reasoning Agent} to prioritize essential logic without introducing inference overhead during deployment. Empirical evaluations across model scales demonstrate that SCMA reduces response length by 11.1\% to 39.0\% while boosting accuracy by 4.33\% to 10.02\%. Furthermore, ablation studies and qualitative analysis validate that the synergistic optimization within the MARL framework fosters emergent behaviors, yielding more powerful LRMs compared to vanilla RL paradigms.
--------------------------------------------------------------------------------------------------------
Current LLM post-training optimizes complete reasoning trajectories through supervised fine-tuning and outcome-based reinforcement learning, but this problem-centric approach doesn't align with human cognition, which decomposes problem-solving into acquiring abstract strategies then adapting them to specific instances. Chain-of-Meta-Thought (CoMT) focuses supervised learning on abstract reasoning patterns enabling generalizable strategy acquisition, while Confidence-Calibrated Reinforcement Learning (CCRL) optimizes task adaptation via confidence-aware rewards on intermediate steps, preventing error cascades. Experiments across four models and eight benchmarks show 2.19% in-distribution and 4.63% out-of-distribution improvements over standard methods, while reducing training time 65-70% and token consumption 50%, demonstrating that cognitive alignment yields superior generalization and efficiency.
Authors: Shaojie Wang, Liang Zhang
Link: https://arxiv.org/abs/2601.21909v1
Date: 2026-01-d
Summary:
Current LLM post-training methods optimize complete reasoning trajectories through Supervised Fine-Tuning (SFT) followed by outcome-based Reinforcement Learning (RL). While effective, a closer examination reveals a fundamental gap: this approach does not align with how humans actually solve problems. Human cognition naturally decomposes problem-solving into two distinct stages: first acquiring abstract strategies (i.e., meta-knowledge) that generalize across problems, then adapting them to specific instances. In contrast, by treating complete trajectories as basic units, current methods are inherently problem-centric, entangling abstract strategies with problem-specific execution. To address this misalignment, we propose a cognitively-inspired framework that explicitly mirrors the two-stage human cognitive process. Specifically, Chain-of-Meta-Thought (CoMT) focuses supervised learning on abstract reasoning patterns without specific executions, enabling acquisition of generalizable strategies. Confidence-Calibrated Reinforcement Learning (CCRL) then optimizes task adaptation via confidence-aware rewards on intermediate steps, preventing overconfident errors from cascading and improving execution reliability. Experiments across four models and eight benchmarks show 2.19\% and 4.63\% improvements in-distribution and out-of-distribution respectively over standard methods, while reducing training time by 65-70% and token consumption by 50%, demonstrating that aligning post-training with human cognitive principles yields not only superior generalization but also enhanced training efficiency.
--------------------------------------------------------------------------------------------------------