Week Ending 12.28.2025
RESEARCH WATCH: 12.28.2025
As data centers become major electricity consumers, their integration into power grids presents critical infrastructure challenges. This research addresses the voltage drops and feeder losses that occur when data centers connect to weak distribution networks. By formulating data center placement as a techno-economic optimization problem, the authors combine network constraints with economic factors like land prices and distributed generation costs. Their genetic algorithm approach achieved a 36% reduction in power losses while maintaining voltage quality in test scenarios. This framework could guide utilities and tech companies in planning sustainable data center expansions that balance grid stability with investment efficiency.
Authors: Amin Hajihasani, Mahmoud Modaresi
Link: https://arxiv.org/abs/2512.21987v1
Date: 2025-12-d
Summary:
Data centers are among the fastest growing electricity consumers and can impose severe voltage drops and feeder losses when connected to weak distribution networks. This paper formulates a techno economic siting problem in which each candidate data center site is mapped to a bus of the distribution network and is assumed to deploy on site renewable generation and power electronic interfaces, resulting in a controllable net active power injection equivalent to distributed generation. A mixed integer nonlinear optimization model is developed to jointly select the connection bus and size the DG capacity while respecting network operating limits. The objective combines three normalized terms including active power losses, a voltage deviation index capturing profile quality, and investment cost derived from location dependent land price and unit DG cost. To address the discrete continuous search space, an intelligent genetic algorithm is embedded in a multi scenario decision framework with adaptive weight tuning. Three stakeholder scenarios prioritize losses, voltage quality, or techno economic balance, and additional balanced scenarios are generated automatically until the optimal bus decision converges. A case study on the IEEE 33 bus radial system demonstrates the effectiveness of the approach. The converged design selects bus 14 with 1.10 MW DG, reducing total losses from 202.67 kW to 129.37 kW while improving the minimum bus voltage to 0.933 per unit at a moderate investment cost of 1.33 MUSD. The proposed framework provides an interpretable pathway to integrate economic indicators into distribution aware data center siting.
--------------------------------------------------------------------------------------------------------
Aerial World Model for Long-horizon Visual Generation and Navigation in 3D Space
Autonomous drone navigation in complex 3D environments requires more than obstacle avoidance—it demands semantic understanding of space. This work introduces ANWM, a world model that predicts future visual observations from past frames and actions, enabling drones to evaluate trajectory options based on plausibility. The innovative Future Frame Projection module provides geometric priors by projecting past views into future viewpoints, addressing the challenge of long-distance visual prediction. Applications span search-and-rescue operations, infrastructure inspection, delivery services, and autonomous surveying. By bridging low-level control with high-level semantic planning, this approach could advance UAV autonomy in urban environments, disaster zones, and large-scale outdoor operations.
Authors: Weichen Zhang, Peizhi Tang, Xin Zeng, Fanhang Man, Shiquan Yu, Zichao Dai, Baining Zhao, Hongjin Chen, Yu Shang, Wei Wu, Chen Gao, Xinlei Chen, Xin Wang, Yong Li, Wenwu Zhu
Link: https://arxiv.org/abs/2512.21887v1
Date: 2025-12-d
Summary:
Unmanned aerial vehicles (UAVs) have emerged as powerful embodied agents. One of the core abilities is autonomous navigation in large-scale three-dimensional environments. Existing navigation policies, however, are typically optimized for low-level objectives such as obstacle avoidance and trajectory smoothness, lacking the ability to incorporate high-level semantics into planning. To bridge this gap, we propose ANWM, an aerial navigation world model that predicts future visual observations conditioned on past frames and actions, thereby enabling agents to rank candidate trajectories by their semantic plausibility and navigational utility. ANWM is trained on 4-DoF UAV trajectories and introduces a physics-inspired module: Future Frame Projection (FFP), which projects past frames into future viewpoints to provide coarse geometric priors. This module mitigates representational uncertainty in long-distance visual generation and captures the mapping between 3D trajectories and egocentric observations. Empirical results demonstrate that ANWM significantly outperforms existing world models in long-distance visual forecasting and improves UAV navigation success rates in large-scale environments.
--------------------------------------------------------------------------------------------------------
MASFIN: A Multi-Agent System for Decomposed Financial Reasoning and Forecasting
Financial forecasting demands integration of diverse data sources while maintaining transparency and avoiding systematic biases like survivorship bias. MASFIN tackles this by deploying multiple AI agents that combine structured financial metrics with unstructured news, using GPT-4o-nano for cost-efficient analysis. The system generates weekly equity portfolios with optimized allocation weights, achieving 7.33% cumulative returns across eight weeks while outperforming major indices in most periods. This modular architecture demonstrates how multi-agent systems can enhance financial decision-making through explicit bias mitigation and reproducible reasoning. Applications include algorithmic trading, portfolio management, risk assessment, and democratizing sophisticated quantitative analysis for smaller investment firms.
Authors: Marc S. Montalvo, Hamed Yaghoobian
Link: https://arxiv.org/abs/2512.21878v1
Date: 2025-12-d
Summary:
Recent advances in large language models (LLMs) are transforming data-intensive domains, with finance representing a high-stakes environment where transparent and reproducible analysis of heterogeneous signals is essential. Traditional quantitative methods remain vulnerable to survivorship bias, while many AI-driven approaches struggle with signal integration, reproducibility, and computational efficiency. We introduce MASFIN, a modular multi-agent framework that integrates LLMs with structured financial metrics and unstructured news, while embedding explicit bias-mitigation protocols. The system leverages GPT-4.1-nano for reproducability and cost-efficient inference and generates weekly portfolios of 15-30 equities with allocation weights optimized for short-term performance. In an eight-week evaluation, MASFIN delivered a 7.33% cumulative return, outperforming the S&P 500, NASDAQ-100, and Dow Jones benchmarks in six of eight weeks, albeit with higher volatility. These findings demonstrate the promise of bias-aware, generative AI frameworks for financial forecasting and highlight opportunities for modular multi-agent design to advance practical, transparent, and reproducible approaches in quantitative finance.
--------------------------------------------------------------------------------------------------------
MoonBot: Modular and On-Demand Reconfigurable Robot Toward Moon Base Construction
Lunar base construction demands versatile robotic systems that maximize functionality within strict payload mass constraints. MoonBot addresses this through modular, reconfigurable design that adapts to varying lunar tasks and environmental conditions. The system successfully demonstrated essential operations including civil engineering, infrastructure transportation, deployment, and assistance with inflatable habitats. The modular approach allows mission planners to optimize robot configurations for specific phases of base construction, from initial site preparation to ongoing maintenance. Beyond lunar applications, this technology could revolutionize terrestrial construction in remote or hazardous environments, disaster response scenarios, and adaptive manufacturing systems requiring flexible automation.
Authors: Kentaro Uno, Elian Neppel, Gustavo H. Diaz, Ashutosh Mishra, Shamistan Karimov, A. Sejal Jain, Ayesha Habib, Pascal Pama, Hazal Gozbasi, Shreya Santra, Kazuya Yoshida
Link: https://arxiv.org/abs/2512.21853v1
Date: 2025-12-d
Summary:
The allure of lunar surface exploration and development has recently captured widespread global attention. Robots have proved to be indispensable for exploring uncharted terrains, uncovering and leveraging local resources, and facilitating the construction of future human habitats. In this article, we introduce the modular and on-demand reconfigurable robot (MoonBot), a modular and reconfigurable robotic system engineered to maximize functionality while operating within the stringent mass constraints of lunar payloads and adapting to varying environmental conditions and task requirements. This article details the design and development of MoonBot and presents a preliminary field demonstration that validates the proof of concept through the execution of milestone tasks simulating the establishment of lunar infrastructure. These tasks include essential civil engineering operations, infrastructural component transportation and deployment, and assistive operations with inflatable modules. Furthermore, we systematically summarize the lessons learned during testing, focusing on the connector design and providing valuable insights for the advancement of modular robotic systems in future lunar missions.
--------------------------------------------------------------------------------------------------------
IoT environments present constantly changing conditions that challenge fixed, device-specific programming logic. DeMe introduces a framework where large language models dynamically generate task-execution methods by incorporating decorations derived from hidden goals, learned experiences, and environmental feedback. Unlike traditional rule-based systems, decorations emerge from universal behavioral principles and observed conditions rather than hardcoded logic. The framework enables pre-decoration, post-decoration, and step-by-step modifications to ensure context-awareness and safety alignment. Applications include smart homes that adapt to occupant behaviors, industrial IoT systems responding to equipment failures, autonomous robots navigating novel situations, and healthcare devices adjusting protocols based on patient responses.
Authors: Hong Su
Link: https://arxiv.org/abs/2512.21817v1
Date: 2025-12-d
Summary:
Intelligent IoT systems increasingly rely on large language models (LLMs) to generate task-execution methods for dynamic environments. However, existing approaches lack the ability to systematically produce new methods when facing previously unseen situations, and they often depend on fixed, device-specific logic that cannot adapt to changing environmental conditions.In this paper, we propose Method Decoration (DeMe), a general framework that modifies the method-generation path of an LLM using explicit decorations derived from hidden goals, accumulated learned methods, and environmental feedback. Unlike traditional rule augmentation, decorations in DeMe are not hardcoded; instead, they are extracted from universal behavioral principles, experience, and observed environmental differences. DeMe enables the agent to reshuffle the structure of its method path-through pre-decoration, post-decoration, intermediate-step modification, and step insertion-thereby producing context-aware, safety-aligned, and environment-adaptive methods. Experimental results show that method decoration allows IoT devices to derive ore appropriate methods when confronting unknown or faulty operating conditions.
--------------------------------------------------------------------------------------------------------
Do Latent Tokens Think? A Causal and Adversarial Analysis of Chain-of-Continuous-Thought
Chain-of-Continuous-Thought (COCONUT) claims to enhance reasoning efficiency in large language models through latent tokens, but this investigation reveals fundamental limitations. Through steering and shortcut experiments, the authors demonstrate that COCONUT tokens function as uninterpretable placeholders rather than genuine reasoning representations. While resistant to perturbation, they promote dataset artifact exploitation over authentic problem-solving, inflating benchmark performance without corresponding reasoning capability. This finding has critical implications for AI safety and reliability, suggesting that opaque reasoning methods may conceal shortcut dependencies. The research underscores the importance of interpretability in AI systems and warns against accepting performance gains without understanding underlying mechanisms—crucial for deploying language models in high-stakes applications.
Authors: Yuyi Zhang, Boyu Tang, Tianjie Ju, Sufeng Duan, Gongshen Liu
Link: https://arxiv.org/abs/2512.21711v1
Date: 2025-12-d
Summary:
Latent tokens are gaining attention for enhancing reasoning in large language models (LLMs), yet their internal mechanisms remain unclear. This paper examines the problem from a reliability perspective, uncovering fundamental weaknesses: latent tokens function as uninterpretable placeholders rather than encoding faithful reasoning. While resistant to perturbation, they promote shortcut usage over genuine reasoning. We focus on Chain-of-Continuous-Thought (COCONUT), which claims better efficiency and stability than explicit Chain-of-Thought (CoT) while maintaining performance. We investigate this through two complementary approaches. First, steering experiments perturb specific token subsets, namely COCONUT and explicit CoT. Unlike CoT tokens, COCONUT tokens show minimal sensitivity to steering and lack reasoning-critical information. Second, shortcut experiments evaluate models under biased and out-of-distribution settings. Results on MMLU and HotpotQA demonstrate that COCONUT consistently exploits dataset artifacts, inflating benchmark performance without true reasoning. These findings reposition COCONUT as a pseudo-reasoning mechanism: it generates plausible traces that conceal shortcut dependence rather than faithfully representing reasoning processes.
--------------------------------------------------------------------------------------------------------
A Unified Definition of Hallucination, Or: It's the World Model, Stupid
Despite years of research, hallucination remains a persistent challenge in language models. This work proposes that hallucination fundamentally stems from inaccurate internal world modeling that becomes observable to users—whether through contradicting knowledge bases or producing summaries inconsistent with source material. By varying the reference world model and knowledge conflict policies, the authors unify disparate hallucination definitions under one framework. This perspective forces evaluations to explicitly specify their assumed "world" or truth source, clarifies distinctions from planning errors, and provides common language for comparing benchmarks. The authors outline plans for synthetic benchmarks with fully specified world models to stress-test language model capabilities, potentially advancing solutions to this critical AI reliability problem.
Authors: Emmy Liu, Varun Gangal, Chelsea Zou, Xiaoqi Huang, Michael Yu, Alex Chang, Zhuofu Tao, Sachin Kumar, Steven Y. Feng
Link: https://arxiv.org/abs/2512.21577v1
Date: 2025-12-d
Summary:
Despite numerous attempts to solve the issue of hallucination since the inception of neural language models, it remains a problem in even frontier large language models today. Why is this the case? We walk through definitions of hallucination used in the literature from a historical perspective up to the current day, and fold them into a single definition of hallucination, wherein different prior definitions focus on different aspects of our definition. At its core, we argue that hallucination is simply inaccurate (internal) world modeling, in a form where it is observable to the user (e.g., stating a fact which contradicts a knowledge base, or producing a summary which contradicts a known source). By varying the reference world model as well as the knowledge conflict policy (e.g., knowledge base vs. in-context), we arrive at the different existing definitions of hallucination present in the literature. We argue that this unified view is useful because it forces evaluations to make clear their assumed "world" or source of truth, clarifies what should and should not be called hallucination (as opposed to planning or reward/incentive-related errors), and provides a common language to compare benchmarks and mitigation techniques. Building on this definition, we outline plans for a family of benchmarks in which hallucinations are defined as mismatches with synthetic but fully specified world models in different environments, and sketch out how these benchmarks can use such settings to stress-test and improve the world modeling components of language models.
--------------------------------------------------------------------------------------------------------
Broadening participation in computer science, particularly among underrepresented groups, requires innovative pedagogical approaches. Microtopia combines coding with design thinking, AI, IoT, and robotics through problem-based learning centered on UN Sustainable Development Goals. Ethnic minority girls organized into "nations" tackled sector-based projects spanning healthcare, transportation, and architecture. Statistical analysis revealed significant increases in confidence, enjoyment, and motivation when computing connected to sustainability and global challenges. This interdisciplinary approach demonstrates that contextualizing CS education within meaningful real-world problems can transform perceptions and increase engagement. Applications extend to educational programs targeting any underrepresented group, curriculum design emphasizing practical relevance, and STEM outreach initiatives.
Authors: Nadine Aburumman, Ju-Ling Shih, Cigdem Sengul, Monica Pereira
Link: https://arxiv.org/abs/2512.21214v1
Date: 2025-12-d
Summary:
This paper presents Microtopia, an interdisciplinary programme designed to broaden participation in computer science (CS) among ethnic minority girls. The programme combined coding with design thinking activities, incorporating Artificial Intelligence (AI), the Internet of Things (IoT), and Robotics as key technologies. Learning activities were formulated around the UN Sustainable Development Goals and the Chinese Five Elements philosophy to support problem-based learning. Pupils were organised into "nations" and engaged in sector-based projects (e.g., healthcare, transportation, fashion, tourism, food, architecture). Using pre- and post-questionnaires, we investigated how socioeconomic and ethnocultural factors influenced pupils' preconceptions of CS, and whether participation in Microtopia shifted their perceptions. Through statistical analysis of the questionnaire data, we identified significant increases in students' confidence, enjoyment, and motivation, particularly when computing was presented as relevant to sustainability and global challenges.
--------------------------------------------------------------------------------------------------------
LLM Personas as a Substitute for Field Experiments in Method Benchmarking
Field experiments provide credible benchmarks for methods in societal systems but impose prohibitive costs and delays on iterative development. This research proves that LLM-based persona simulation can serve as valid substitutes when methods observe only aggregate outcomes and evaluation is algorithm-blind. Under these conditions, replacing humans with personas is indistinguishable from changing evaluation populations. The authors provide information-theoretic bounds on required sample sizes to achieve decision-relevant discriminability matching field experiments. This framework could accelerate development cycles in recommendation systems, content moderation, interface design, and policy evaluation—enabling rapid prototyping and testing before expensive human trials while maintaining benchmark validity.
Authors: Enoch Hyunwook Kang
Link: https://arxiv.org/abs/2512.21080v1
Date: 2025-12-d
Summary:
Field experiments (A/B tests) are often the most credible benchmark for methods in societal systems, but their cost and latency create a major bottleneck for iterative method development. LLM-based persona simulation offers a cheap synthetic alternative, yet it is unclear whether replacing humans with personas preserves the benchmark interface that adaptive methods optimize against. We prove an if-and-only-if characterization: when (i) methods observe only the aggregate outcome (aggregate-only observation) and (ii) evaluation depends only on the submitted artifact and not on the algorithm's identity or provenance (algorithm-blind evaluation), swapping humans for personas is just panel change from the method's point of view, indistinguishable from changing the evaluation population (e.g., New York to Jakarta). Furthermore, we move from validity to usefulness: we define an information-theoretic discriminability of the induced aggregate channel and show that making persona benchmarking as decision-relevant as a field experiment is fundamentally a sample-size question, yielding explicit bounds on the number of independent persona evaluations required to reliably distinguish meaningfully different methods at a chosen resolution.
--------------------------------------------------------------------------------------------------------
Agentic Explainable Artificial Intelligence (Agentic XAI) Approach To Explore Better Explanation
Communicating AI explanations to non-experts remains challenging despite advances in explainable AI (XAI). This study introduces agentic XAI, combining SHAP-based explainability with multimodal large language models that iteratively refine explanations. Testing on Japanese rice yield data, the framework improved explanation quality by 30-33% through refinement rounds, peaking at rounds 3-4. However, excessive iteration degraded quality, revealing a bias-variance trade-off: early rounds lacked depth while over-refinement introduced verbosity and ungrounded abstractions. Both human experts and LLM evaluators confirmed this pattern. The findings suggest strategic early stopping optimizes practical utility, challenging assumptions about monotonic improvement and providing evidence-based design principles for systems requiring transparent, accessible AI explanations.
Authors: Tomoaki Yamaguchi, Yutong Zhou, Masahiro Ryo, Keisuke Katsura
Link: https://arxiv.org/abs/2512.21066v1
Date: 2025-12-d
Summary:
Explainable artificial intelligence (XAI) enables data-driven understanding of factor associations with response variables, yet communicating XAI outputs to laypersons remains challenging, hindering trust in AI-based predictions. Large language models (LLMs) have emerged as promising tools for translating technical explanations into accessible narratives, yet the integration of agentic AI, where LLMs operate as autonomous agents through iterative refinement, with XAI remains unexplored. This study proposes an agentic XAI framework combining SHAP-based explainability with multimodal LLM-driven iterative refinement to generate progressively enhanced explanations. As a use case, we tested this framework as an agricultural recommendation system using rice yield data from 26 fields in Japan. The Agentic XAI initially provided a SHAP result and explored how to improve the explanation through additional analysis iteratively across 11 refinement rounds (Rounds 0-10). Explanations were evaluated by human experts (crop scientists) (n=12) and LLMs (n=14) against seven metrics: Specificity, Clarity, Conciseness, Practicality, Contextual Relevance, Cost Consideration, and Crop Science Credibility. Both evaluator groups confirmed that the framework successfully enhanced recommendation quality with an average score increase of 30-33% from Round 0, peaking at Rounds 3-4. However, excessive refinement showed a substantial drop in recommendation quality, indicating a bias-variance trade-off where early rounds lacked explanation depth (bias) while excessive iteration introduced verbosity and ungrounded abstraction (variance), as revealed by metric-specific analysis. These findings suggest that strategic early stopping (regularization) is needed for optimizing practical utility, challenging assumptions about monotonic improvement and providing evidence-based design principles for agentic XAI systems.
--------------------------------------------------------------------------------------------------------
Can Agentic AI Match the Performance of Human Data Scientists?
Recent advances in large language models have automated many data science workflows, but can they match human experts who leverage domain knowledge? This research designs a prediction task where crucial information resides in image data rather than tabular features, creating scenarios where generic analytical workflows fail. Using property insurance data, experiments demonstrate that agentic AI relying on standard analytics falls short of methods incorporating domain-specific insights. Human data scientists identifying hidden variables through domain expertise consistently outperform generic AI agents. This highlights a fundamental limitation: current agentic AI systems struggle to recognize when domain knowledge is necessary and how to incorporate it, underscoring critical areas for future development.
Authors: An Luo, Jin Du, Fangqiao Tian, Xun Xian, Robert Specht, Ganghua Wang, Xuan Bi, Charles Fleming, Jayanth Srinivasa, Ashish Kundu, Mingyi Hong, Jie Ding
Link: https://arxiv.org/abs/2512.20959v1
Date: 2025-12-d
Summary:
Data science plays a critical role in transforming complex data into actionable insights across numerous domains. Recent developments in large language models (LLMs) have significantly automated data science workflows, but a fundamental question persists: Can these agentic AI systems truly match the performance of human data scientists who routinely leverage domain-specific knowledge? We explore this question by designing a prediction task where a crucial latent variable is hidden in relevant image data instead of tabular features. As a result, agentic AI that generates generic codes for modeling tabular data cannot perform well, while human experts could identify the important hidden variable using domain knowledge. We demonstrate this idea with a synthetic dataset for property insurance. Our experiments show that agentic AI that relies on generic analytics workflow falls short of methods that use domain-specific insights. This highlights a key limitation of the current agentic AI for data science and underscores the need for future research to develop agentic AI systems that can better recognize and incorporate domain knowledge.
--------------------------------------------------------------------------------------------------------
NVIDIA's Nemotron 3 Nano represents a new generation of efficient language models using Mixture-of-Experts hybrid Mamba-Transformer architecture. Pretrained on 25 trillion tokens, it achieves superior accuracy compared to its predecessor while activating less than half the parameters per forward pass, delivering up to 3.3x higher inference throughput than similarly-sized models. The model demonstrates enhanced agentic reasoning, chat abilities, and supports context lengths up to 1 million tokens. Applications span enterprise chatbots, code generation, document analysis, and on-device AI where computational efficiency is paramount. By balancing accuracy with inference speed, this architecture addresses the growing demand for powerful yet deployable language models in resource-constrained environments.
Authors: NVIDIA, :, Aaron Blakeman, Aaron Grattafiori, Aarti Basant, Abhibha Gupta, Abhinav Khattar, Adi Renduchintala, Aditya Vavre, Akanksha Shukla, Akhiad Bercovich, Aleksander Ficek, Aleksandr Shaposhnikov, Alex Kondratenko, Alexander Bukharin, Alexandre Milesi, Ali Taghibakhshi, Alisa Liu, Amelia Barton, Ameya Sunil Mahabaleshwarkar, Amir Klein, Amit Zuker, Amnon Geifman, Amy Shen, Anahita Bhiwandiwalla, Andrew Tao, Ann Guan, Anubhav Mandarwal, Arham Mehta, Ashwath Aithal, Ashwin Poojary, Asif Ahamed, Asma Kuriparambil Thekkumpate, Ayush Dattagupta, Banghua Zhu, Bardiya Sadeghi, Barnaby Simkin, Ben Lanir, Benedikt Schifferer, Besmira Nushi, Bilal Kartal, Bita Darvish Rouhani, Boris Ginsburg, Brandon Norick, Brandon Soubasis, Branislav Kisacanin, Brian Yu, Bryan Catanzaro, Carlo del Mundo, Chantal Hwang, Charles Wang, Cheng-Ping Hsieh, Chenghao Zhang, Chenhan Yu, Chetan Mungekar, Chintan Patel, Chris Alexiuk, Christopher Parisien, Collin Neale, Damon Mosk-Aoyama, Dan Su, Dane Corneil, Daniel Afrimi, Daniel Rohrer, Daniel Serebrenik, Daria Gitman, Daria Levy, Darko Stosic, David Mosallanezhad, Deepak Narayanan, Dhruv Nathawani, Dima Rekesh, Dina Yared, Divyanshu Kakwani, Dong Ahn, Duncan Riach, Dusan Stosic, Edgar Minasyan, Edward Lin, Eileen Long, Eileen Peters Long, Elena Lantz, Ellie Evans, Elliott Ning, Eric Chung, Eric Harper, Eric Tramel, Erick Galinkin, Erik Pounds, Evan Briones, Evelina Bakhturina, Faisal Ladhak, Fay Wang, Fei Jia, Felipe Soares, Feng Chen, Ferenc Galko, Frankie Siino, Gal Hubara Agam, Ganesh Ajjanagadde, Gantavya Bhatt, Gargi Prasad, George Armstrong, Gerald Shen, Gorkem Batmaz, Grigor Nalbandyan, Haifeng Qian, Harsh Sharma, Hayley Ross, Helen Ngo, Herman Sahota, Hexin Wang, Himanshu Soni, Hiren Upadhyay, Huizi Mao, Huy C Nguyen, Huy Q Nguyen, Iain Cunningham, Ido Shahaf, Igor Gitman, Ilya Loshchilov, Ivan Moshkov, Izzy Putterman, Jan Kautz, Jane Polak Scowcroft, Jared Casper, Jatin Mitra, Jeffrey Glick, Jenny Chen, Jesse Oliver, Jian Zhang, Jiaqi Zeng, Jie Lou, Jimmy Zhang, Jining Huang, Joey Conway, Joey Guman, John Kamalu, Johnny Greco, Jonathan Cohen, Joseph Jennings, Joyjit Daw, Julien Veron Vialard, Junkeun Yi, Jupinder Parmar, Kai Xu, Kan Zhu, Kari Briski, Katherine Cheung, Katherine Luna, Keshav Santhanam, Kevin Shih, Kezhi Kong, Khushi Bhardwaj, Krishna C. Puvvada, Krzysztof Pawelec, Kumar Anik, Lawrence McAfee, Laya Sleiman, Leon Derczynski, Li Ding, Lucas Liebenwein, Luis Vega, Maanu Grover, Maarten Van Segbroeck, Maer Rodrigues de Melo, Makesh Narsimhan Sreedhar, Manoj Kilaru, Maor Ashkenazi, Marc Romeijn, Mark Cai, Markus Kliegl, Maryam Moosaei, Matvei Novikov, Mehrzad Samadi, Melissa Corpuz, Mengru Wang, Meredith Price, Michael Boone, Michael Evans, Miguel Martinez, Mike Chrzanowski, Mohammad Shoeybi, Mostofa Patwary, Nabin Mulepati, Natalie Hereth, Nave Assaf, Negar Habibi, Neta Zmora, Netanel Haber, Nicola Sessions, Nidhi Bhatia, Nikhil Jukar, Nikki Pope, Nikolai Ludwig, Nima Tajbakhsh, Nirmal Juluru, Oleksii Hrinchuk, Oleksii Kuchaiev, Olivier Delalleau, Oluwatobi Olabiyi, Omer Ullman Argov, Ouye Xie, Parth Chadha, Pasha Shamis, Pavlo Molchanov, Pawel Morkisz, Peter Dykas, Peter Jin, Pinky Xu, Piotr Januszewski, Pranav Prashant Thombre, Prasoon Varshney, Pritam Gundecha, Qing Miao, Rabeeh Karimi Mahabadi, Ran El-Yaniv, Ran Zilberstein, Rasoul Shafipour, Rich Harang, Rick Izzo, Rima Shahbazyan, Rishabh Garg, Ritika Borkar, Ritu Gala, Riyad Islam, Roger Waleffe, Rohit Watve, Roi Koren, Ruoxi Zhang, Russell J. Hewett, Ryan Prenger, Ryan Timbrook, Sadegh Mahdavi, Sahil Modi, Samuel Kriman, Sanjay Kariyappa, Sanjeev Satheesh, Saori Kaji, Satish Pasumarthi, Sean Narentharen, Sean Narenthiran, Seonmyeong Bak, Sergey Kashirsky, Seth Poulos, Shahar Mor, Shanmugam Ramasamy, Shantanu Acharya, Shaona Ghosh, Sharath Turuvekere Sreenivas, Shelby Thomas, Shiqing Fan, Shreya Gopal, Shrimai Prabhumoye, Shubham Pachori, Shubham Toshniwal, Shuoyang Ding, Siddharth Singh, Simeng Sun, Smita Ithape, Somshubra Majumdar, Soumye Singhal, Stefania Alborghetti, Stephen Ge, Sugam Dipak Devare, Sumeet Kumar Barua, Suseella Panguluri, Suyog Gupta, Sweta Priyadarshi, Syeda Nahida Akter, Tan Bui, Teodor-Dumitru Ene, Terry Kong, Thanh Do, Tijmen Blankevoort, Tom Balough, Tomer Asida, Tomer Bar Natan, Tugrul Konuk, Twinkle Vashishth, Udi Karpas, Ushnish De, Vahid Noorozi, Vahid Noroozi, Venkat Srinivasan, Venmugil Elango, Vijay Korthikanti, Vitaly Kurin, Vitaly Lavrukhin, Wanli Jiang, Wasi Uddin Ahmad, Wei Du, Wei Ping, Wenfei Zhou, Will Jennings, William Zhang, Wojciech Prazuch, Xiaowei Ren, Yashaswi Karnati, Yejin Choi, Yev Meyer, Yi-Fu Wu, Yian Zhang, Ying Lin, Yonatan Geifman, Yonggan Fu, Yoshi Subara, Yoshi Suhara, Yubo Gao, Zach Moshe, Zhen Dong, Zihan Liu, Zijia Chen, Zijie Yan
Link: https://arxiv.org/abs/2512.20848v1
Date: 2025-12-d
Summary:
We present Nemotron 3 Nano 30B-A3B, a Mixture-of-Experts hybrid Mamba-Transformer language model. Nemotron 3 Nano was pretrained on 25 trillion text tokens, including more than 3 trillion new unique tokens over Nemotron 2, followed by supervised fine tuning and large-scale RL on diverse environments. Nemotron 3 Nano achieves better accuracy than our previous generation Nemotron 2 Nano while activating less than half of the parameters per forward pass. It achieves up to 3.3x higher inference throughput than similarly-sized open models like GPT-OSS-20B and Qwen3-30B-A3B-Thinking-2507, while also being more accurate on popular benchmarks. Nemotron 3 Nano demonstrates enhanced agentic, reasoning, and chat abilities and supports context lengths up to 1M tokens. We release both our pretrained Nemotron 3 Nano 30B-A3B Base and post-trained Nemotron 3 Nano 30B-A3B checkpoints on Hugging Face.
--------------------------------------------------------------------------------------------------------
Safety Alignment of LMs via Non-cooperative Games
Ensuring language model safety without compromising usefulness presents a fundamental AI alignment challenge. This work reframes safety alignment as a non-zero-sum game between Attacker and Defender language models trained jointly through online reinforcement learning. Each continuously adapts to the other's evolving strategies, driving iterative improvement. Using preference-based rewards from pairwise comparisons rather than point-wise scores provides robust supervision and reduces reward hacking. The resulting AdvGame approach simultaneously improves helpfulness and adversarial resilience, shifting the safety-utility Pareto frontier. Additionally, the trained Attacker converges into a general-purpose red-teaming agent for probing arbitrary models. This paradigm could advance AI safety evaluation, content moderation systems, and robust model deployment.
Authors: Anselm Paulus, Ilia Kulikov, Brandon Amos, Rémi Munos, Ivan Evtimov, Kamalika Chaudhuri, Arman Zharmagambetov
Link: https://arxiv.org/abs/2512.20806v1
Date: 2025-12-d
Summary:
Ensuring the safety of language models (LMs) while maintaining their usefulness remains a critical challenge in AI alignment. Current approaches rely on sequential adversarial training: generating adversarial prompts and fine-tuning LMs to defend against them. We introduce a different paradigm: framing safety alignment as a non-zero-sum game between an Attacker LM and a Defender LM trained jointly via online reinforcement learning. Each LM continuously adapts to the other's evolving strategies, driving iterative improvement. Our method uses a preference-based reward signal derived from pairwise comparisons instead of point-wise scores, providing more robust supervision and potentially reducing reward hacking. Our RL recipe, AdvGame, shifts the Pareto frontier of safety and utility, yielding a Defender LM that is simultaneously more helpful and more resilient to adversarial attacks. In addition, the resulting Attacker LM converges into a strong, general-purpose red-teaming agent that can be directly deployed to probe arbitrary target models.
--------------------------------------------------------------------------------------------------------
Cube Bench: A Benchmark for Spatial Visual Reasoning in MLLMs
Spatial and sequential reasoning capabilities of multimodal language models remain poorly understood. Cube Bench uses Rubik's cube solving to evaluate five skills: reconstructing cube faces, choosing optimal moves, predicting outcomes, executing multi-step plans, and detecting errors. Testing seven MLLMs reveals accuracy drops sharply with scramble depth, models rarely recover from mistakes, and high perception accuracy doesn't guarantee competent action selection. A pronounced gap emerges between closed and open-source models, with open-weight models approaching chance on difficult settings. Simple self-correction yields modest gains but introduces overthinking. This compact, reproducible benchmark provides insights for developing embodied AI, robotic manipulation, planning systems, and spatial reasoning capabilities.
Authors: Dhruv Anand, Ehsan Shareghi
Link: https://arxiv.org/abs/2512.20595v1
Date: 2025-12-d
Summary:
We introduce Cube Bench, a Rubik's-cube benchmark for evaluating spatial and sequential reasoning in multimodal large language models (MLLMs). The benchmark decomposes performance into five skills: (i) reconstructing cube faces from images and text, (ii) choosing the optimal next move, (iii) predicting the outcome of a candidate move without applying it, (iv) executing multi-step plans while recovering from mistakes, and (v) detecting and revising one's own errors. Using a shared set of scrambled cube states, identical prompts and parsers, and a single distance-to-solved metric, we compare recent MLLMs side by side as a function of scramble depth. Across seven MLLMs, accuracy drops sharply with depth; once a trajectory stalls or diverges, models rarely recover, and high face-reconstruction accuracy does not guarantee competent action selection or multi-step execution. A pronounced closed- vs open-source gap emerges: the strongest closed model leads on both single-step perception tasks and multi-step control tasks, while open-weight models cluster near chance on the hardest settings; yet even the best MLLM degrades at higher cube complexity. A simple self-correction via reflective thinking yields modest gains but can also introduce overthinking. Cube Bench offers a compact, reproducible probe of sequential spatial reasoning in MLLMs.
--------------------------------------------------------------------------------------------------------
Synthesizing Procedural Memory: Challenges and Architectures in Automated Workflow Generation
While executable code represents optimal agentic procedural memory, autonomously synthesizing this memory remains underexplored. This paper operationalizes how large language models transition from passive tool-users to active workflow architects. Through cross-service orchestration involving Outlook and OneDrive, the authors identify four bottlenecks: the Discovery Gap in navigating tool registries, Verification Gap in grounding tool responses, Decomposition Gap addressed through Linear State Anchoring, and Scaling Gap focused on concurrency and persistence. By enforcing a scientific methodology of hypothesize, probe, and code, agents autonomously write robust, production-grade skills. Applications include enterprise automation, API integration, workflow optimization, and autonomous software development—advancing toward truly self-improving AI systems.
Authors: Nishant Gaurav, Adit Akarsh, Ankit Ranjan, Manoj Bajaj
Link: https://arxiv.org/abs/2512.20278v1
Date: 2025-12-d
Summary:
While CodeMem establishes executable code as the optimal representation for agentic procedural memory, the mechanism for autonomously synthesizing this memory from a blank slate remains underexplored. This paper operationalizes the transition of Large Language Models from passive tool-users to active workflow architects. Through a high-fidelity case study of a cross-service orchestration task involving Outlook and OneDrive, we identify and address four structural bottlenecks in automated skill generation: the Discovery Gap involving navigation of large tool registries, the Verification Gap regarding grounding tool response structures, the Decomposition Gap which replaces inefficient search with Linear State Anchoring, and the Scaling Gap focused on concurrency and persistence. We demonstrate that by enforcing a scientific methodology of hypothesize, probe, and code, agents can autonomously write robust, production-grade code skills.
--------------------------------------------------------------------------------------------------------
Concept Generalization in Humans and Large Language Models: Insights from the Number Game
Understanding how humans and AI systems generalize concepts reveals fundamental differences in learning and reasoning. Using the number game—a concept inference task—this research compares human and LLM generalization through Bayesian modeling. Humans flexibly infer both rule-based and similarity-based concepts, demonstrating few-shot generalization even from single examples. In contrast, LLMs rely more heavily on mathematical rules and require more samples to generalize. The Bayesian model captured human behavior more accurately than LLMs, highlighting differences in inductive biases and inference strategies. These insights inform AI development, mathematical education tools, cognitive science research, and designing systems that better align with human learning patterns.
Authors: Arghavan Bazigaran, Hansem Sohn
Link: https://arxiv.org/abs/2512.20162v1
Date: 2025-12-d
Summary:
We compare human and large language model (LLM) generalization in the number game, a concept inference task. Using a Bayesian model as an analytical framework, we examined the inductive biases and inference strategies of humans and LLMs. The Bayesian model captured human behavior better than LLMs in that humans flexibly infer rule-based and similarity-based concepts, whereas LLMs rely more on mathematical rules. Humans also demonstrated a few-shot generalization, even from a single example, while LLMs required more samples to generalize. These contrasts highlight the fundamental differences in how humans and LLMs infer and generalize mathematical concepts.
--------------------------------------------------------------------------------------------------------
ABBEL: LLM Agents Acting through Belief Bottlenecks Expressed in Language
As sequential decision-making tasks lengthen, maintaining full interaction histories becomes computationally impractical. ABBEL introduces a framework where LLM agents maintain concise contexts through belief states—natural language summaries of task-relevant discoveries. At each step, agents update prior beliefs with new observations to form posterior beliefs, using only posteriors to select actions. This maintains near-constant memory use while generating interpretable beliefs. However, belief bottlenecks propagate errors, initially causing inferior performance versus full context. Applying reinforcement learning with belief grading and length penalties enables ABBEL to surpass full-context performance while using less memory. Applications include long-horizon robotics, conversational AI, game playing, and resource-constrained deployment scenarios.
Authors: Aly Lidayan, Jakob Bjorner, Satvik Golechha, Kartik Goyal, Alane Suhr
Link: https://arxiv.org/abs/2512.20111v1
Date: 2025-12-d
Summary:
As the length of sequential decision-making tasks increases, it becomes computationally impractical to keep full interaction histories in context. We introduce a general framework for LLM agents to maintain concise contexts through multi-step interaction: Acting through Belief Bottlenecks Expressed in Language (ABBEL), and methods to further improve ABBEL agents with RL post-training. ABBEL replaces long multi-step interaction history by a belief state, i.e., a natural language summary of what has been discovered about task-relevant unknowns. Under ABBEL, at each step the agent first updates a prior belief with the most recent observation from the environment to form a posterior belief, then uses only the posterior to select an action. We systematically evaluate frontier models under ABBEL across six diverse multi-step environments, finding that ABBEL supports generating interpretable beliefs while maintaining near-constant memory use over interaction steps. However, bottleneck approaches are generally prone to error propagation, which we observe causing inferior performance when compared to the full context setting due to errors in belief updating. Therefore, we train LLMs to generate and act on beliefs within the ABBEL framework via reinforcement learning (RL). We experiment with belief grading, to reward higher quality beliefs, as well as belief length penalties to reward more compressed beliefs. Our experiments demonstrate the ability of RL to improve ABBEL's performance beyond the full context setting, while using less memory than contemporaneous approaches.
--------------------------------------------------------------------------------------------------------
Collaborative learning enables distributed agents to cooperatively train models while preserving privacy, with knowledge distillation facilitating efficient knowledge transfer. However, mechanisms by which distillation leverages memory and knowledge across agents remain underexplored. This comprehensive review categorizes memory and knowledge within distillation processes, examining distributed, hierarchical, and decentralized learning patterns. The authors emphasize task heterogeneity across federated learning, multi-agent domain adaptation, federated multi-modal learning, continual learning, multi-task learning, and graph knowledge embedding. They highlight challenges including model heterogeneity, data heterogeneity, resource constraints, and privacy concerns. This framework advances understanding of collaborative learning systems, informing development of edge AI, federated systems, privacy-preserving machine learning, and distributed intelligence applications.
Authors: Pengchao Han, Xi Huang, Yi Fang, Guojun Han
Link: https://arxiv.org/abs/2512.19972v1
Date: 2025-12-d
Summary:
Collaborative learning has emerged as a key paradigm in large-scale intelligent systems, enabling distributed agents to cooperatively train their models while addressing their privacy concerns. Central to this paradigm is knowledge distillation (KD), a technique that facilitates efficient knowledge transfer among agents. However, the underlying mechanisms by which KD leverages memory and knowledge across agents remain underexplored. This paper aims to bridge this gap by offering a comprehensive review of KD in collaborative learning, with a focus on the roles of memory and knowledge. We define and categorize memory and knowledge within the KD process and explore their interrelationships, providing a clear understanding of how knowledge is extracted, stored, and shared in collaborative settings. We examine various collaborative learning patterns, including distributed, hierarchical, and decentralized structures, and provide insights into how memory and knowledge dynamics shape the effectiveness of KD in collaborative learning. Particularly, we emphasize task heterogeneity in distributed learning pattern covering federated learning (FL), multi-agent domain adaptation (MADA), federated multi-modal learning (FML), federated continual learning (FCL), federated multi-task learning (FMTL), and federated graph knowledge embedding (FKGE). Additionally, we highlight model heterogeneity, data heterogeneity, resource heterogeneity, and privacy concerns of these tasks. Our analysis categorizes existing work based on how they handle memory and knowledge. Finally, we discuss existing challenges and propose future directions for advancing KD techniques in the context of collaborative learning.
--------------------------------------------------------------------------------------------------------
LeLaR: The First In-Orbit Demonstration of an AI-Based Satellite Attitude Controller
Satellite attitude control traditionally relies on time-consuming classical controller design sensitive to model uncertainties. Deep reinforcement learning offers adaptive alternatives by learning control strategies through simulation interaction. This work presents the first successful in-orbit demonstration of an AI-based attitude controller, deployed on the InnoCube 3U nanosatellite launched January 2025. The controller, trained entirely in simulation, overcame the Sim2Real gap to successfully perform inertial pointing maneuvers. Steady-state metrics confirm robust performance comparable to classical PD controllers during repeated in-orbit operations. This breakthrough demonstrates viability of AI-based spacecraft control, enabling applications in satellite constellation management, autonomous space missions, adaptive orbital maneuvers, and reducing costs associated with traditional controller development.
Authors: Kirill Djebko, Tom Baumann, Erik Dilger, Frank Puppe, Sergio Montenegro
Link: https://arxiv.org/abs/2512.19576v2
Date: 2025-12-d
Summary:
Attitude control is essential for many satellite missions. Classical controllers, however, are time-consuming to design and sensitive to model uncertainties and variations in operational boundary conditions. Deep Reinforcement Learning (DRL) offers a promising alternative by learning adaptive control strategies through autonomous interaction with a simulation environment. Overcoming the Sim2Real gap, which involves deploying an agent trained in simulation onto the real physical satellite, remains a significant challenge. In this work, we present the first successful in-orbit demonstration of an AI-based attitude controller for inertial pointing maneuvers. The controller was trained entirely in simulation and deployed to the InnoCube 3U nanosatellite, which was developed by the Julius-Maximilians-Universität Würzburg in cooperation with the Technische Universität Berlin, and launched in January 2025. We present the AI agent design, the methodology of the training procedure, the discrepancies between the simulation and the observed behavior of the real satellite, and a comparison of the AI-based attitude controller with the classical PD controller of InnoCube. Steady-state metrics confirm the robust performance of the AI-based controller during repeated in-orbit maneuvers.
--------------------------------------------------------------------------------------------------------
Activations as Features: Probing LLMs for Generalizable Essay Scoring Representations
Automated essay scoring faces challenges in cross-prompt settings due to diverse scoring criteria. While previous work focused on LLM outputs, this research explores whether intermediate layer activations provide valuable information. The authors used activations to fit probes for cross-prompt essay scoring, analyzing effects of different models and input content. By computing essay directions across trait dimensions under various prompts, they analyzed how LLMs adapt evaluation perspectives to essay types and traits. Results demonstrate activations possess strong discriminative power and LLMs effectively handle scoring criteria diversity by adapting perspectives. Applications include scalable educational assessment, writing feedback systems, standardized test grading, and personalized learning platforms requiring nuanced evaluation across diverse contexts.
Authors: Jinwei Chi, Ke Wang, Yu Chen, Xuanye Lin, Qiang Xu
Link: https://arxiv.org/abs/2512.19456v1
Date: 2025-12-d
Summary:
Automated essay scoring (AES) is a challenging task in cross-prompt settings due to the diversity of scoring criteria. While previous studies have focused on the output of large language models (LLMs) to improve scoring accuracy, we believe activations from intermediate layers may also provide valuable information. To explore this possibility, we evaluated the discriminative power of LLMs' activations in cross-prompt essay scoring task. Specifically, we used activations to fit probes and further analyzed the effects of different models and input content of LLMs on this discriminative power. By computing the directions of essays across various trait dimensions under different prompts, we analyzed the variation in evaluation perspectives of large language models concerning essay types and traits. Results show that the activations possess strong discriminative power in evaluating essay quality and that LLMs can adapt their evaluation perspectives to different traits and essay types, effectively handling the diversity of scoring criteria in cross-prompt settings.
--------------------------------------------------------------------------------------------------------