Week Ending 1.11.2026
RESEARCH WATCH: 1.11.2026
CHDP: Cooperative Hybrid Diffusion Policies for Reinforcement Learning in Parameterized Action Space
Hybrid action spaces combining discrete choices with continuous parameters are common in robotics and game AI, yet optimizing them efficiently remains challenging. CHDP addresses this through a cooperative framework where two diffusion-based agents handle discrete and continuous actions respectively, with the continuous policy conditioned on discrete representations. A sequential update scheme prevents conflicts during training, while a codebook embeds high-dimensional discrete spaces into manageable latent representations. This approach could revolutionize robot manipulation tasks requiring both tool selection and precise movement control, or enable more sophisticated game AI that combines strategic decisions with fine-tuned execution parameters.
Authors: Bingyi Liu, Jinbo He, Haiyong Shi, Enshu Wang, Weizhen Han, Jingxiang Hao, Peixi Wang, Zhuangzhuang Zhang
Link: https://arxiv.org/abs/2601.05675v1
Date: 2026-01-d
Summary:
Hybrid action space, which combines discrete choices and continuous parameters, is prevalent in domains such as robot control and game AI. However, efficiently modeling and optimizing hybrid discrete-continuous action space remains a fundamental challenge, mainly due to limited policy expressiveness and poor scalability in high-dimensional settings. To address this challenge, we view the hybrid action space problem as a fully cooperative game and propose a \textbf{Cooperative Hybrid Diffusion Policies (CHDP)} framework to solve it. CHDP employs two cooperative agents that leverage a discrete and a continuous diffusion policy, respectively. The continuous policy is conditioned on the discrete action's representation, explicitly modeling the dependency between them. This cooperative design allows the diffusion policies to leverage their expressiveness to capture complex distributions in their respective action spaces. To mitigate the update conflicts arising from simultaneous policy updates in this cooperative setting, we employ a sequential update scheme that fosters co-adaptation. Moreover, to improve scalability when learning in high-dimensional discrete action space, we construct a codebook that embeds the action space into a low-dimensional latent space. This mapping enables the discrete policy to learn in a compact, structured space. Finally, we design a Q-function-based guidance mechanism to align the codebook's embeddings with the discrete policy's representation during training. On challenging hybrid action benchmarks, CHDP outperforms the state-of-the-art method by up to $19.3\%$ in success rate.
--------------------------------------------------------------------------------------------------------
Advancing credit mobility through stakeholder-informed AI design and adoption
Students transferring from two-year to four-year colleges often lose credits due to labor-intensive manual course articulation processes. This study develops AI methods to automatically suggest course equivalencies, working with the SUNY system to ensure practical adoption. Through stakeholder surveys and improved alignment techniques, the researchers achieved 5.5-fold accuracy improvements and 61% faculty adoption rates, projecting a 12-fold increase in recognized credit transfer opportunities. This work demonstrates how stakeholder-informed AI design can address educational equity issues, potentially saving students significant time and money while increasing bachelor's degree completion rates across higher education systems nationwide.
Authors: Yerin Kwak, Siddharth Adelkar, Zachary A. Pardos
Link: https://arxiv.org/abs/2601.05666v1
Date: 2026-01-d
Summary:
Transferring from a 2-year to a 4-year college is crucial for socioeconomic mobility, yet students often face challenges ensuring their credits are fully recognized, leading to delays in their academic progress and unexpected costs. Determining whether courses at different institutions are equivalent (i.e., articulation) is essential for successful credit transfer, as it minimizes unused credits and increases the likelihood of bachelor's degree completion. However, establishing articulation agreements remains time- and resource-intensive, as all candidate articulations are reviewed manually. Although recent efforts have explored the use of artificial intelligence to support this work, its use in articulation practice remains limited. Given these challenges and the need for scalable support, this study applies artificial intelligence to suggest articulations between institutions in collaboration with the State University of New York system, one of the largest systems of higher education in the US. To develop our methodology, we first surveyed articulation staff and faculty to assess adoption rates of baseline algorithmic recommendations and gather feedback on perceptions and concerns about these recommendations. Building on these insights, we developed a supervised alignment method that addresses superficial matching and institutional biases in catalog descriptions, achieving a 5.5-fold improvement in accuracy over previous methods. Based on articulation predictions of this method and a 61% average surveyed adoption rate among faculty and staff, these findings project a 12-fold increase in valid credit mobility opportunities that would otherwise remain unrealized. This study suggests that stakeholder-informed design of AI in higher education administration can expand student credit mobility and help reshape current institutional decision-making in course articulation.
--------------------------------------------------------------------------------------------------------
The Patent Trial and Appeal Board adjudicates thousands of appeals annually, requiring both technical understanding and structured legal reasoning. PILOT-Bench establishes the first systematic benchmark for evaluating LLMs on patent-domain legal reasoning, aligning PTAB decisions with USPTO data across three classification tasks following legal IRAC structure. Results reveal substantial gaps between commercial models (>0.75 F1) and open-source alternatives (0.56 F1), highlighting reasoning capacity differences. This benchmark enables systematic evaluation of AI systems for patent law applications, potentially supporting patent examiners, legal practitioners, and inventors in navigating complex appeal processes and improving consistency in patent adjudication decisions.
Authors: Yehoon Jang, Chaewon Lee, Hyun-seok Min, Sungchul Choi
Link: https://arxiv.org/abs/2601.04758v1
Date: 2026-01-d
Summary:
The Patent Trial and Appeal Board (PTAB) of the USPTO adjudicates thousands of ex parte appeals each year, requiring the integration of technical understanding and legal reasoning. While large language models (LLMs) are increasingly applied in patent and legal practice, their use has remained limited to lightweight tasks, with no established means of systematically evaluating their capacity for structured legal reasoning in the patent domain. In this work, we introduce PILOT-Bench, the first PTAB-centric benchmark that aligns PTAB decisions with USPTO patent data at the case-level and formalizes three IRAC-aligned classification tasks: Issue Type, Board Authorities, and Subdecision. We evaluate a diverse set of closed-source (commercial) and open-source LLMs and conduct analyses across multiple perspectives, including input-variation settings, model families, and error tendencies. Notably, on the Issue Type task, closed-source models consistently exceed 0.75 in Micro-F1 score, whereas the strongest open-source model (Qwen-8B) achieves performance around 0.56, highlighting a substantial gap in reasoning capabilities. PILOT-Bench establishes a foundation for the systematic evaluation of patent-domain legal reasoning and points toward future directions for improving LLMs through dataset design and model alignment. All data, code, and benchmark resources are available at https://github.com/TeamLab/pilot-bench.
--------------------------------------------------------------------------------------------------------
Cognitive Infrastructure: A Unified DCIM Framework for AI Data Centers
Managing AI data centers requires integrating diverse infrastructure components while optimizing for performance, sustainability, and automation. DCIM 3.0 presents a unified framework combining semantic reasoning through knowledge graphs, predictive thermal analytics, autonomous orchestration, and the Unified Device Connectivity Protocol. This architecture addresses critical challenges in managing GPU-intensive computing environments where thermal management and energy efficiency are paramount. Applications include optimizing cooling systems in real-time, predicting infrastructure failures before they occur, and enabling digital twin simulations for capacity planning. This framework could significantly reduce operational costs and environmental impact while improving reliability for organizations deploying large-scale AI training infrastructure.
Authors: Krishna Chaitanya Sunkara
Link: https://arxiv.org/abs/2601.04750v1
Date: 2026-01-d
Summary:
This work presents DCIM 3.0, a unified framework integrating semantic reasoning, predictive analytics, autonomous orchestration, and unified connectivity for next-generation AI data center management. The framework addresses critical challenges in infrastructure automation, sustainability, and digital-twin design through knowledge graph-based intelligence, thermal modeling, and the Unified Device Connectivity Protocol (UDCP).Keywords-Data Center Infrastructure Management, DCIM, AI Data Centers, Knowledge Graphs, Digital Twin, Thermal Management, Infrastructure Automation, Sustainability, GPU Computing, Data Center
--------------------------------------------------------------------------------------------------------
When Single-Agent with Skills Replace Multi-Agent Systems and When They Fail
Multi-agent AI systems achieve complex reasoning through specialized agents but incur substantial computational costs. This work investigates whether single agents with skill libraries can achieve similar modularity more efficiently, finding that skill-based approaches reduce token usage and latency while maintaining accuracy. However, skill selection exhibits bounded capacity similar to human cognition, maintaining stability until a critical library size triggers sharp degradation. Semantic confusability among skills drives this limitation more than raw quantity. Applications include more efficient reasoning systems for enterprises seeking cost reduction, while the cognitive-grounded framework informs designing scalable AI architectures through hierarchical skill organization principles borrowed from human decision-making strategies.
Authors: Xiaoxiao Li
Link: https://arxiv.org/abs/2601.04748v1
Date: 2026-01-d
Summary:
Multi-agent AI systems have proven effective for complex reasoning. These systems are compounded by specialized agents, which collaborate through explicit communication, but incur substantial computational overhead. A natural question arises: can we achieve similar modularity benefits with a single agent that selects from a library of skills? We explore this question by viewing skills as internalized agent behaviors. From this perspective, a multi-agent system can be compiled into an equivalent single-agent system, trading inter-agent communication for skill selection. Our preliminary experiments suggest this approach can substantially reduce token usage and latency while maintaining competitive accuracy on reasoning benchmarks. However, this efficiency raises a deeper question that has received little attention: how does skill selection scale as libraries grow?
Drawing on principles from cognitive science, we propose that LLM skill selection exhibits bounded capacity analogous to human decision-making. We investigate the scaling behavior of skill selection and observe a striking pattern. Rather than degrading gradually, selection accuracy remains stable up to a critical library size, then drops sharply, indicating a phase transition reminiscent of capacity limits in human cognition. Furthermore, we find evidence that semantic confusability among similar skills, rather than library size alone, plays a central role in this degradation. This perspective suggests that hierarchical organization, which has long helped humans manage complex choices, may similarly benefit AI systems. Our initial results with hierarchical routing support this hypothesis. This work opens new questions about the fundamental limits of semantic-based skill selection in LLMs and offers a cognitive-grounded framework and practical guidelines for designing scalable skill-based agents.
--------------------------------------------------------------------------------------------------------
Fast Mining and Dynamic Time-to-Event Prediction over Multi-sensor Data Streams
Predicting machine failures from real-time sensor data requires adapting to evolving patterns in dynamic industrial environments. TimeCast addresses this through a framework that identifies distinct time-evolving stages, learns individual models for each, and adapts predictions based on pattern shifts. By capturing time-varying interdependencies between multiple sensors while scaling linearly with input size, TimeCast enables online updates on streaming data. Applications include predictive maintenance for manufacturing equipment, reducing unexpected downtime and maintenance costs, energy infrastructure monitoring, and any industrial setting where early failure detection from sensor arrays could prevent costly disruptions or safety incidents.
Authors: Kota Nakamura, Koki Kawabata, Yasuko Matsubara, Yasushi Sakurai
Link: https://arxiv.org/abs/2601.04741v1
Date: 2026-01-d
Summary:
Given real-time sensor data streams obtained from machines, how can we continuously predict when a machine failure will occur? This work aims to continuously forecast the timing of future events by analyzing multi-sensor data streams. A key characteristic of real-world data streams is their dynamic nature, where the underlying patterns evolve over time. To address this, we present TimeCast, a dynamic prediction framework designed to adapt to these changes and provide accurate, real-time predictions of future event time. Our proposed method has the following properties: (a) Dynamic: it identifies the distinct time-evolving patterns (i.e., stages) and learns individual models for each, enabling us to make adaptive predictions based on pattern shifts. (b) Practical: it finds meaningful stages that capture time-varying interdependencies between multiple sensors and improve prediction performance; (c) Scalable: our algorithm scales linearly with the input size and enables online model updates on data streams. Extensive experiments on real datasets demonstrate that TimeCast provides higher prediction accuracy than state-of-the-art methods while finding dynamic changes in data streams with a great reduction in computational time.
--------------------------------------------------------------------------------------------------------
Computational Compliance for AI Regulation: Blueprint for a New Research Domain
As AI regulations emerge globally, traditional compliance methods cannot scale to meet enforcement demands. This work argues for computational compliance: algorithms running throughout AI system lifecycles that automatically steer toward regulatory adherence. The authors specify design goals for such algorithms and propose benchmark datasets for measuring performance. This blueprint defines a new research domain addressing the growing gap between regulatory requirements and compliance capabilities. Applications include automated documentation generation for regulatory submissions, real-time monitoring systems that flag potential violations during model development, and governance tools helping organizations navigate complex multi-jurisdictional AI regulations while maintaining innovation velocity.
Authors: Bill Marino, Nicholas D. Lane
Link: https://arxiv.org/abs/2601.04474v1
Date: 2026-01-d
Summary:
The era of AI regulation (AIR) is upon us. But AI systems, we argue, will not be able to comply with these regulations at the necessary speed and scale by continuing to rely on traditional, analogue methods of compliance. Instead, we posit that compliance with these regulations will only realistically be achieved computationally: that is, with algorithms that run across the life cycle of an AI system, automatically steering it toward AIR compliance in the face of dynamic conditions. Yet despite their (we would argue) inevitability, the research community has yet to specify exactly how these algorithms for computational AIR compliance should behave - or how we should benchmark their performance. To fill these gaps, we specify a set of design goals for such algorithms. In addition, we specify a benchmark dataset that can be used to quantitatively measure whether individual algorithms satisfy these design goals. By delivering this blueprint, we hope to give shape to an important but uncrystallized new domain of research - and, in doing so, incite necessary investment in it.
--------------------------------------------------------------------------------------------------------
Cochlear implant surgery requires mastoidectomy—removing temporal bone to access the cochlea—but predicting this region from preoperative imaging remains challenging without ground-truth labels. This work proposes a hybrid self-supervised and weakly-supervised learning framework predicting mastoidectomy regions directly from preoperative CT scans, achieving 0.72 mean Dice score. The method enables constructing 3D postmastoidectomy surfaces before surgery, improving surgical planning and reducing risks. Applications include personalized surgical simulation for training, patient-specific risk assessment identifying anatomical challenges, optimized surgical approach planning, and potentially extending to other surgical procedures requiring preoperative prediction of tissue removal or modification regions.
Authors: Yike Zhang, Eduardo Davalos, Dingjie Su, Ange Lou, Jack Noble
Link: https://arxiv.org/abs/2601.04405v2
Date: 2026-01-d
Summary:
Cochlear Implant (CI) surgery treats severe hearing loss by inserting an electrode array into the cochlea to stimulate the auditory nerve. An important step in this procedure is mastoidectomy, which removes part of the mastoid region of the temporal bone to provide surgical access. Accurate mastoidectomy shape prediction from preoperative imaging improves pre-surgical planning, reduces risks, and enhances surgical outcomes. Despite its importance, there are limited deep-learning-based studies regarding this topic due to the challenges of acquiring ground-truth labels. We address this gap by investigating self-supervised and weakly-supervised learning models to predict the mastoidectomy region without human annotations. We propose a hybrid self-supervised and weakly-supervised learning framework to predict the mastoidectomy region directly from preoperative CT scans, where the mastoid remains intact. Our hybrid method achieves a mean Dice score of 0.72 when predicting the complex and boundary-less mastoidectomy shape, surpassing state-of-the-art approaches and demonstrating strong performance. The method provides groundwork for constructing 3D postmastoidectomy surfaces directly from the corresponding preoperative CT scans. To our knowledge, this is the first work that integrating self-supervised and weakly-supervised learning for mastoidectomy shape prediction, offering a robust and efficient solution for CI surgical planning while leveraging 3D T-distribution loss in weakly-supervised medical imaging.
--------------------------------------------------------------------------------------------------------
XAI-LAW: A Logic Programming Tool for Modeling, Explaining, and Learning Legal Decisions
This tool uses Answer Set Programming to model Italian Criminal Code articles and semi-automatically learn legal rules from prior judicial decisions. The system supports legal experts by providing reasoning frameworks and possible legal outcomes while handling contradictions that arise during encoding. Its automatic explainability features clarify judicial decision logic through stable model "supportedness," and integrated inductive logic programming generalizes rules from case examples. Applications include judicial decision support systems, legal education tools demonstrating logical reasoning in criminal law, consistency checking across verdicts, and potentially extending to other legal domains or jurisdictions requiring structured reasoning with explainable outcomes.
Authors: Agostino Dovier, Talissa Dreossi, Andrea Formisano, Benedetta Strizzolo
Link: https://arxiv.org/abs/2601.03844v1
Date: 2026-01-d
Summary:
We propose an approach to model articles of the Italian Criminal Code (ICC), using Answer Set Programming (ASP), and to semi-automatically learn legal rules from examples based on prior judicial decisions. The developed tool is intended to support legal experts during the criminal trial phase by providing reasoning and possible legal outcomes. The methodology involves analyzing and encoding articles of the ICC in ASP, including "crimes against the person" and property offenses. The resulting model is validated on a set of previous verdicts and refined as necessary. During the encoding process, contradictions may arise; these are properly handled by the system, which also generates possible decisions for new cases and provides explanations through a tool that leverages the "supportedness" of stable models. The automatic explainability offered by the tool can also be used to clarify the logic behind judicial decisions, making the decision-making process more interpretable. Furthermore, the tool integrates an inductive logic programming system for ASP, which is employed to generalize legal rules from case examples.
--------------------------------------------------------------------------------------------------------
Systems Explaining Systems: A Framework for Intelligence and Consciousness
This paper proposes that intelligence and consciousness emerge from relational structure rather than prediction mechanisms. Intelligence is defined as forming and integrating causal connections through context enrichment, while consciousness emerges when recursive architectures allow higher-order systems to interpret lower-order patterns, creating dynamically stabilized meta-states. This framework reframes predictive processing as emergent from contextual interpretation and suggests recursive multi-system architectures may enable more human-like AI. Applications include designing next-generation AI architectures with improved self-awareness, developing systems capable of metacognitive reasoning, and informing debates about machine consciousness while providing practical guidelines for building AI systems exhibiting sophisticated cognitive capabilities.
Authors: Sean Niklas Semmler
Link: https://arxiv.org/abs/2601.04269v1
Date: 2026-01-d
Summary:
This paper proposes a conceptual framework in which intelligence and consciousness emerge from relational structure rather than from prediction or domain-specific mechanisms. Intelligence is defined as the capacity to form and integrate causal connections between signals, actions, and internal states. Through context enrichment, systems interpret incoming information using learned relational structure that provides essential context in an efficient representation that the raw input itself does not contain, enabling efficient processing under metabolic constraints.
Building on this foundation, we introduce the systems-explaining-systems principle, where consciousness emerges when recursive architectures allow higher-order systems to learn and interpret the relational patterns of lower-order systems across time. These interpretations are integrated into a dynamically stabilized meta-state and fed back through context enrichment, transforming internal models from representations of the external world into models of the system's own cognitive processes.
The framework reframes predictive processing as an emergent consequence of contextual interpretation rather than explicit forecasting and suggests that recursive multi-system architectures may be necessary for more human-like artificial intelligence.
--------------------------------------------------------------------------------------------------------
An Algorithmic Framework for Systematic Literature Reviews: A Case Study for Financial Narratives
Systematic literature reviews are time-consuming and subjective. This framework automates SLRs using NLP techniques, clustering algorithms, and interpretability tools, applying it to financial narratives research. The case study reveals that while progress exists in modeling how structured accounts of economic events influence markets, the field remains fragmented, often reducing narratives to sentiment analysis or topic modeling without unified theoretical frameworks. Applications include accelerating academic research synthesis across disciplines, identifying research gaps systematically, improving reproducibility in literature reviews, and specifically for finance, developing more sophisticated models of how narratives drive market dynamics and asset pricing.
Authors: Gabin Taibi, Joerg Osterrieder
Link: https://arxiv.org/abs/2601.03794v1
Date: 2026-01-d
Summary:
This paper introduces an algorithmic framework for conducting systematic literature reviews (SLRs), designed to improve efficiency, reproducibility, and selection quality assessment in the literature review process. The proposed method integrates Natural Language Processing (NLP) techniques, clustering algorithms, and interpretability tools to automate and structure the selection and analysis of academic publications. The framework is applied to a case study focused on financial narratives, an emerging area in financial economics that examines how structured accounts of economic events, formed by the convergence of individual interpretations, influence market dynamics and asset prices. Drawing from the Scopus database of peer-reviewed literature, the review highlights research efforts to model financial narratives using various NLP techniques. Results reveal that while advances have been made, the conceptualization of financial narratives remains fragmented, often reduced to sentiment analysis, topic modeling, or their combination, without a unified theoretical framework. The findings underscore the value of more rigorous and dynamic narrative modeling approaches and demonstrate the effectiveness of the proposed algorithmic SLR methodology.
--------------------------------------------------------------------------------------------------------
Membox: Weaving Topic Continuity into Long-Range Memory for LLM Agents
Current LLM memory systems fragment dialogues into isolated utterances, damaging narrative flow and biasing retrieval toward lexical similarity rather than preserving topic continuity. Membox introduces hierarchical memory architecture with a Topic Loom that groups consecutive same-topic turns into coherent "memory boxes," then links these via a Trace Weaver into long-range event timelines. Achieving 68% F1 improvement on temporal reasoning while using fewer context tokens, Membox demonstrates superior efficiency-effectiveness balance. Applications include more coherent conversational AI for customer service, personal assistant systems maintaining context across extended interactions, and therapy or coaching applications where preserving conversational threads enhances user experience.
Authors: Dehao Tao, Guoliang Ma, Yongfeng Huang, Minghu Jiang
Link: https://arxiv.org/abs/2601.03785v1
Date: 2026-01-d
Summary:
Human-agent dialogues often exhibit topic continuity-a stable thematic frame that evolves through temporally adjacent exchanges-yet most large language model (LLM) agent memory systems fail to preserve it. Existing designs follow a fragmentation-compensation paradigm: they first break dialogue streams into isolated utterances for storage, then attempt to restore coherence via embedding-based retrieval. This process irreversibly damages narrative and causal flow, while biasing retrieval towards lexical similarity. We introduce membox, a hierarchical memory architecture centered on a Topic Loom that continuously monitors dialogue in a sliding-window fashion, grouping consecutive same-topic turns into coherent "memory boxes" at storage time. Sealed boxes are then linked by a Trace Weaver into long-range event-timeline traces, recovering macro-topic recurrences across discontinuities. Experiments on LoCoMo demonstrate that Membox achieves up to 68% F1 improvement on temporal reasoning tasks, outperforming competitive baselines (e.g., Mem0, A-MEM). Notably, Membox attains these gains while using only a fraction of the context tokens required by existing methods, highlighting a superior balance between efficiency and effectiveness. By explicitly modeling topic continuity, Membox offers a cognitively motivated mechanism for enhancing both coherence and efficiency in LLM agents.
--------------------------------------------------------------------------------------------------------
LLMs generate increasingly sophisticated multilingual text, raising concerns about personalized disinformation. This study examines personalization capabilities across 10 languages, testing 1,080 prompt combinations across 16 models to evaluate both misuse potential and beneficial applications. Results show personalization quality varies by target (demographic groups versus social media platforms) and language, with platform-targeted personalization affecting detectability more significantly, especially in English. Applications include developing detection systems for automated disinformation campaigns, understanding cross-linguistic variation in AI text generation quality, and informing content moderation strategies while recognizing legitimate uses of personalization in marketing, education, and communication.
Authors: Dominik Macko
Link: https://arxiv.org/abs/2601.03752v1
Date: 2026-01-d
Summary:
Capabilities of large language models to generate multilingual coherent text have continuously enhanced in recent years, which opens concerns about their potential misuse. Previous research has shown that they can be misused for generation of personalized disinformation in multiple languages. It has also been observed that personalization negatively affects detectability of machine-generated texts; however, this has been studied in the English language only. In this work, we examine this phenomenon across 10 languages, while we focus not only on potential misuse of personalization capabilities, but also on potential benefits they offer. Overall, we cover 1080 combinations of various personalization aspects in the prompts, for which the texts are generated by 16 distinct language models (17,280 texts in total). Our results indicate that there are differences in personalization quality of the generated texts when targeting demographic groups and when targeting social-media platforms across languages. Personalization towards platforms affects detectability of the generated texts in a higher scale, especially in English, where the personalization quality is the highest.
--------------------------------------------------------------------------------------------------------
Can AI Chatbots Provide Coaching in Engineering? Beyond Information Processing Toward Mastery
Engineering education faces dual disruption: eroding apprenticeship models and emerging AI coaching partners. This mixed-methods study (75 students, 7 faculty) examines whether AI chatbots foster mastery or merely deliver information. Findings reveal acceptance for technical problem-solving (M=3.84) but skepticism toward moral and contextual judgment, with faculty expressing stronger risk concerns (M=4.71 vs 4.14) and 64-71% demanding privacy. The multiplex coaching framework proposed integrates human wisdom through expert-in-the-loop models, preserving apprenticeship depth while leveraging AI scalability. Applications include supplemental technical tutoring systems, freeing human mentors for judgment-intensive guidance, and hybrid educational platforms balancing accessibility with wisdom.
Authors: Junaid Qadir, Muhammad Adil Attique, Saleha Shoaib, Syed Ibrahim Ghaznavi
Link: https://arxiv.org/abs/2601.03693v1
Date: 2026-01-d
Summary:
Engineering education faces a double disruption: traditional apprenticeship models that cultivated judgment and tacit skill are eroding, just as generative AI emerges as an informal coaching partner. This convergence rekindles long-standing questions in the philosophy of AI and cognition about the limits of computation, the nature of embodied rationality, and the distinction between information processing and wisdom. Building on this rich intellectual tradition, this paper examines whether AI chatbots can provide coaching that fosters mastery rather than merely delivering information. We synthesize critical perspectives from decades of scholarship on expertise, tacit knowledge, and human-machine interaction, situating them within the context of contemporary AI-driven education. Empirically, we report findings from a mixed-methods study (N = 75 students, N = 7 faculty) exploring the use of a coaching chatbot in engineering education. Results reveal a consistent boundary: participants accept AI for technical problem solving (convergent tasks; M = 3.84 on a 1-5 Likert scale) but remain skeptical of its capacity for moral, emotional, and contextual judgment (divergent tasks). Faculty express stronger concerns over risk (M = 4.71 vs. M = 4.14, p = 0.003), and privacy emerges as a key requirement, with 64-71 percent of participants demanding strict confidentiality. Our findings suggest that while generative AI can democratize access to cognitive and procedural support, it cannot replicate the embodied, value-laden dimensions of human mentorship. We propose a multiplex coaching framework that integrates human wisdom within expert-in-the-loop models, preserving the depth of apprenticeship while leveraging AI scalability to enrich the next generation of engineering education.
--------------------------------------------------------------------------------------------------------
Towards Compositional Generalization of LLMs via Skill Taxonomy Guided Data Synthesis
LLMs struggle with compositional generalization due to long-tailed distributions of complex skill combinations. STEPS addresses this through Skill Taxonomy guided Entropy-based Post-training data Synthesis, organizing skills into interpretable hierarchical taxonomies using structural information theory. The framework formulates data synthesis as constrained information maximization, selecting skill combinations that maximize marginal structural information while preserving coherence. Outperforming existing baselines on instruction-following and agent-based evaluations, STEPS demonstrates improved compositional generalization. Applications include more capable AI assistants handling complex multi-step tasks, improved agent systems for automation workflows, and training data curation strategies for organizations developing specialized LLMs requiring robust generalization.
Authors: Yifan Wei, Li Du, Xiaoyan Yu, Yang Feng, Angsheng Li
Link: https://arxiv.org/abs/2601.03676v1
Date: 2026-01-d
Summary:
Large Language Models (LLMs) and agent-based systems often struggle with compositional generalization due to a data bottleneck in which complex skill combinations follow a long-tailed, power-law distribution, limiting both instruction-following performance and generalization in agent-centric tasks. To address this challenge, we propose STEPS, a Skill Taxonomy guided Entropy-based Post-training data Synthesis framework for generating compositionally challenging data. STEPS explicitly targets compositional generalization by uncovering latent relationships among skills and organizing them into an interpretable, hierarchical skill taxonomy using structural information theory. Building on this taxonomy, we formulate data synthesis as a constrained information maximization problem, selecting skill combinations that maximize marginal structural information within the hierarchy while preserving semantic coherence. Experiments on challenging instruction-following benchmarks show that STEPS outperforms existing data synthesis baselines, while also yielding improved compositional generalization in downstream agent-based evaluations.
--------------------------------------------------------------------------------------------------------
Physics-Informed Neural Networks integrate physical laws with data but lack comprehensive uncertainty quantification for prognostics applications. This heteroscedastic Bayesian PINN framework jointly models epistemic and aleatoric uncertainty, providing full predictive posteriors for spatiotemporal insulation material aging. Validated on transformer insulation with finite-element models and solar power plant data, the approach improves accuracy and calibration over deterministic and dropout-based alternatives. Applications include risk-aware transformer asset management, preventive maintenance scheduling balancing cost against failure risk, power grid reliability planning, and extending to other critical infrastructure where uncertainty-aware degradation prediction informs asset lifecycle decisions and replacement strategies.
Authors: Ibai Ramirez, Jokin Alcibar, Joel Pino, Mikel Sanz, Jose I. Aizpurua
Link: https://arxiv.org/abs/2601.03673v1
Date: 2026-01-d
Summary:
Physics-Informed Neural Networks (PINNs) provide a framework for integrating physical laws with data. However, their application to Prognostics and Health Management (PHM) remains constrained by the limited uncertainty quantification (UQ) capabilities. Most existing PINN-based prognostics approaches are deterministic or account only for epistemic uncertainty, limiting their suitability for risk-aware decision-making. This work introduces a heteroscedastic Bayesian Physics-Informed Neural Network (B-PINN) framework that jointly models epistemic and aleatoric uncertainty, yielding full predictive posteriors for spatiotemporal insulation material ageing estimation. The approach integrates Bayesian Neural Networks (BNNs) with physics-based residual enforcement and prior distributions, enabling probabilistic inference within a physics-informed learning architecture. The framework is evaluated on transformer insulation ageing application, validated with a finite-element thermal model and field measurements from a solar power plant, and benchmarked against deterministic PINNs, dropout-based PINNs (d-PINNs), and alternative B-PINN variants. Results show that the proposed B-PINN provides improved predictive accuracy and better-calibrated uncertainty estimates than competing approaches. A systematic sensitivity study further analyzes the impact of boundary-condition, initial-condition, and residual sampling strategies on accuracy, calibration, and generalization. Overall, the findings highlight the potential of Bayesian physics-informed learning to support uncertainty-aware prognostics and informed decision-making in transformer asset management.
--------------------------------------------------------------------------------------------------------
Towards a Mechanistic Understanding of Propositional Logical Reasoning in Large Language Models
Understanding LLM internal reasoning mechanisms remains fundamental to AI interpretability. This work analyzes Qwen3 models on PropLogic-MI, revealing coherent computational architecture comprising Staged Computation (layer-wise phases), Information Transmission (boundary token aggregation), Fact Retrospection (persistent source fact access), and Specialized Attention Heads. These mechanisms generalize across scales, rule types, and reasoning depths, providing evidence that LLMs employ structured computational strategies. Applications include improving model architectures through mechanistic insights, developing more efficient reasoning systems, debugging logical failures by understanding information flow, and informing interpretability research seeking to understand and enhance LLM reasoning capabilities systematically.
Authors: Danchun Chen, Qiyao Yan, Liangming Pan
Link: https://arxiv.org/abs/2601.04260v1
Date: 2026-01-d
Summary:
Understanding how Large Language Models (LLMs) perform logical reasoning internally remains a fundamental challenge. While prior mechanistic studies focus on identifying taskspecific circuits, they leave open the question of what computational strategies LLMs employ for propositional reasoning. We address this gap through comprehensive analysis of Qwen3 (8B and 14B) on PropLogic-MI, a controlled dataset spanning 11 propositional logic rule categories across one-hop and two-hop reasoning. Rather than asking ''which components are necessary,'' we ask ''how does the model organize computation?'' Our analysis reveals a coherent computational architecture comprising four interlocking mechanisms: Staged Computation (layer-wise processing phases), Information Transmission (information flow aggregation at boundary tokens), Fact Retrospection (persistent re-access of source facts), and Specialized Attention Heads (functionally distinct head types). These mechanisms generalize across model scales, rule types, and reasoning depths, providing mechanistic evidence that LLMs employ structured computational strategies for logical reasoning.
--------------------------------------------------------------------------------------------------------
ReEfBench: Quantifying the Reasoning Efficiency of LLMs
Test-time scaling enables complex reasoning but current evaluations cannot distinguish genuine reasoning from verbosity. ReEfBench proposes a neuro-symbolic framework for non-intrusive, process-centric reasoning evaluation, identifying four behavioral prototypes and diagnosing failure modes. Analysis reveals extended token generation isn't prerequisite for deep reasoning, while mixing long and short Chain-of-Thought data risks premature saturation. Distillation into smaller models captures behavioral length but fails replicating logical efficacy due to capacity limits. Applications include optimizing inference costs by identifying efficient reasoning patterns, improving training strategies avoiding saturation, guiding model selection balancing performance against computational requirements, and informing distillation techniques preserving reasoning quality.
Authors: Zhizhang Fu, Yuancheng Gu, Chenkai Hu, Hanmeng Liu, Yue Zhang
Link: https://arxiv.org/abs/2601.03550v1
Date: 2026-01-d
Summary:
Test-time scaling has enabled Large Language Models (LLMs) to tackle complex reasoning, yet the limitations of current Chain-of-Thought (CoT) evaluation obscures whether performance gains stem from genuine reasoning or mere verbosity. To address this, (1) we propose a novel neuro-symbolic framework for the non-intrusive, comprehensive process-centric evaluation of reasoning. (2) Through this lens, we identify four distinct behavioral prototypes and diagnose the failure modes. (3) We examine the impact of inference mode, training strategy, and model scale. Our analysis reveals that extended token generation is not a prerequisite for deep reasoning. Furthermore, we reveal critical constraints: mixing long and short CoT data in training risks in premature saturation and collapse, while distillation into smaller models captures behavioral length but fails to replicate logical efficacy due to intrinsic capacity limits.
--------------------------------------------------------------------------------------------------------
Autonomous navigation in complex environments remains central to robotics. This bio-inspired model employs parallel place field layers at multiple spatial scales with replay-based rewards and dynamic scale fusion, mimicking mammalian hippocampal spatial representation. Simulations show improved path efficiency and accelerated learning versus single-scale baselines. Applications include warehouse robotics requiring efficient navigation in dynamic environments, autonomous vehicles mapping and navigating urban areas at multiple resolutions, drone delivery systems balancing broad area coverage with precise landing, and any robotics application where multiscale spatial understanding enhances adaptability and learning speed in partially observable environments.
Authors: Bekarys Dukenbaev, Andrew Gerstenslager, Alexander Johnson, Ali A. Minai
Link: https://arxiv.org/abs/2601.03520v1
Date: 2026-01-d
Summary:
Autonomous navigation in complex and partially observable environments remains a central challenge in robotics. Several bio-inspired models of mapping and navigation based on place cells in the mammalian hippocampus have been proposed. This paper introduces a new robust model that employs parallel layers of place fields at multiple spatial scales, a replay-based reward mechanism, and dynamic scale fusion. Simulations show that the model improves path efficiency and accelerates learning compared to single-scale baselines, highlighting the value of multiscale spatial representations for adaptive robot navigation.
--------------------------------------------------------------------------------------------------------
Deploy-Master: Automating the Deployment of 50,000+ Agent-Ready Scientific Tools in One Day
Authors: Yi Wang, Zhenting Huang, Zhaohan Ding, Ruoxue Liao, Yuan Huang, Xinzijian Liu, Jiajun Xie, Siheng Chen, Linfeng Zhang
Link: https://arxiv.org/abs/2601.03513v1
Date: 2026-01-d
Summary:
Open-source scientific software is abundant, yet most tools remain difficult to compile, configure, and reuse, sustaining a small-workshop mode of scientific computing. This deployment bottleneck limits reproducibility, large-scale evaluation, and the practical integration of scientific tools into modern AI-for-Science (AI4S) and agentic workflows.
We present Deploy-Master, a one-stop agentic workflow for large-scale tool discovery, build specification inference, execution-based validation, and publication. Guided by a taxonomy spanning 90+ scientific and engineering domains, our discovery stage starts from a recall-oriented pool of over 500,000 public repositories and progressively filters it to 52,550 executable tool candidates under license- and quality-aware criteria. Deploy-Master transforms heterogeneous open-source repositories into runnable, containerized capabilities grounded in execution rather than documentation claims. In a single day, we performed 52,550 build attempts and constructed reproducible runtime environments for 50,112 scientific tools. Each successful tool is validated by a minimal executable command and registered in SciencePedia for search and reuse, enabling direct human use and optional agent-based invocation.
Beyond delivering runnable tools, we report a deployment trace at the scale of 50,000 tools, characterizing throughput, cost profiles, failure surfaces, and specification uncertainty that become visible only at scale. These results explain why scientific software remains difficult to operationalize and motivate shared, observable execution substrates as a foundation for scalable AI4S and agentic science.
--------------------------------------------------------------------------------------------------------