Week Ending 1.4.2026

RESEARCH WATCH: 1.4.2026

Materials Informatics: Emergence To Autonomous Discovery In The Age Of AI

Materials science is entering a transformative era where AI drives discovery from prediction to autonomous experimentation. This perspective traces materials informatics from its physics foundations through the Materials Genome Initiative to today's large language model integration. Key methodologies like Bayesian Optimization and Reinforcement Learning enable inverse design—specifying desired properties to discover new materials. The emergence of self-driving laboratories represents a paradigm shift toward "human-out-of-the-loop" discovery, where AI systems autonomously design experiments, interpret results, and iteratively refine hypotheses. Applications span drug discovery, energy materials, and aerospace alloys, potentially accelerating materials development from decades to months while addressing challenges in uncertainty quantification and specialist versus generalist model design.

Authors: Turab Lookman, YuJie Liu, Zhibin Gao

Link: https://arxiv.org/abs/2601.00742v1

Date: 2026-01-d

Summary:

This perspective explores the evolution of materials informatics, from its foundational roots in physics and information theory to its maturation through artificial intelligence (AI). We trace the field's trajectory from early milestones to the transformative impact of the Materials Genome Initiative and the recent advent of large language models (LLMs). Rather than a mere toolkit, we present materials informatics as an evolving ecosystem, reviewing key methodologies such as Bayesian Optimization, Reinforcement Learning, and Transformers that drive inverse design and autonomous self-driving laboratories. We specifically address the practical challenges of LLM integration, comparing specialist versus generalist models and discussing solutions for uncertainty quantification. Looking forward, we assess the transition of AI from a predictive tool to a collaborative research partner. By leveraging active learning and retrieval-augmented generation (RAG), the field is moving toward a new era of autonomous materials science, increasingly characterized by "human-out-of-the-loop" discovery processes.

--------------------------------------------------------------------------------------------------------

IRPO: Scaling the Bradley-Terry Model via Reinforcement Learning

Training AI systems through human feedback faces a computational bottleneck: comparing response pairs scales quadratically with candidate count. This work introduces IRPO, which replaces pairwise comparisons with efficient pointwise scoring while preserving interpretability. By incorporating the Bradley-Terry model—a classical framework for ranking—into reinforcement learning, IRPO evaluates arbitrarily many responses in linear time. Applications include training chatbots, coding assistants, and creative writing models more efficiently. The framework maintains fine-grained reward signals crucial for nuanced behaviors while dramatically reducing computational overhead. Results show state-of-the-art performance among pointwise models and competitive results with pairwise approaches, suggesting IRPO could democratize large-scale AI alignment by making reinforcement learning accessible beyond resource-rich organizations.

Authors: Haonan Song, Qingchen Xie, Huan Zhu, Feng Xiao, Luxi Xing, Fuzhen Li, Liu Kang, Feng Jiang, Zhiyong Zheng, Fan Yang

Link: https://arxiv.org/abs/2601.00677v1

Date: 2026-01-d

Summary:

Generative Reward Models (GRMs) have attracted considerable research interest in reward modeling due to their interpretability, inference-time scalability, and potential for refinement through reinforcement learning (RL). However, widely used pairwise GRMs create a computational bottleneck when integrated with RL algorithms such as Group Relative Policy Optimization (GRPO). This bottleneck arises from two factors: (i) the O(n^2) time complexity of pairwise comparisons required to obtain relative scores, and (ii) the computational overhead of repeated sampling or additional chain-of-thought (CoT) reasoning to improve performance. To address the first factor, we propose Intergroup Relative Preference Optimization (IRPO), a novel RL framework that incorporates the well-established Bradley-Terry model into GRPO. By generating a pointwise score for each response, IRPO enables efficient evaluation of arbitrarily many candidates during RL training while preserving interpretability and fine-grained reward signals. Experimental results demonstrate that IRPO achieves state-of-the-art (SOTA) performance among pointwise GRMs across multiple benchmarks, with performance comparable to that of current leading pairwise GRMs. Furthermore, we show that IRPO significantly outperforms pairwise GRMs in post-training evaluations.

--------------------------------------------------------------------------------------------------------

A Comprehensive Dataset for Human vs. AI Generated Image Detection

Synthetic images from Stable Diffusion, DALL-E, and MidJourney are increasingly indistinguishable from photographs, enabling misinformation and manipulated media. MS COCOAI addresses this detection challenge with 96,000 images—half real, half synthetic from five leading generators. The dataset supports two tasks: binary classification (real versus generated) and generator identification (which model created this image). Applications include content moderation, journalism verification, forensic analysis, and social media platform integrity. As generative AI becomes ubiquitous, reliable detection systems are critical for maintaining information ecosystem trust. This dataset provides researchers standardized benchmarks for developing robust detectors, potentially informing regulatory frameworks and platform policies around synthetic media disclosure.

Authors: Rajarshi Roy, Nasrin Imanpour, Ashhar Aziz, Shashwat Bajpai, Gurpreet Singh, Shwetangshu Biswas, Kapil Wanaskar, Parth Patwa, Subhankar Ghosh, Shreyas Dixit, Nilesh Ranjan Pal, Vipula Rawte, Ritvik Garimella, Gaytri Jena, Vasu Sharma, Vinija Jain, Aman Chadha, Aishwarya Naresh Reganti, Amitava Das

Link: https://arxiv.org/abs/2601.00553v1

Date: 2026-01-d

Summary:

Multimodal generative AI systems like Stable Diffusion, DALL-E, and MidJourney have fundamentally changed how synthetic images are created. These tools drive innovation but also enable the spread of misleading content, false information, and manipulated media. As generated images become harder to distinguish from photographs, detecting them has become an urgent priority. To combat this challenge, We release MS COCOAI, a novel dataset for AI generated image detection consisting of 96000 real and synthetic datapoints, built using the MS COCO dataset. To generate synthetic images, we use five generators: Stable Diffusion 3, Stable Diffusion 2.1, SDXL, DALL-E 3, and MidJourney v6. Based on the dataset, we propose two tasks: (1) classifying images as real or generated, and (2) identifying which model produced a given synthetic image. The dataset is available at https://huggingface.co/datasets/Rajarshi-Roy-research/Defactify_Image_Dataset.

--------------------------------------------------------------------------------------------------------

CoCo-Fed: A Unified Framework for Memory- and Communication-Efficient Federated Learning at the Wireless Edge

Deploying large neural networks on wireless base stations faces dual constraints: limited memory for local training and bandwidth-constrained backhaul links for model updates. CoCo-Fed addresses both through gradient compression and orthogonal subspace superposition—projecting layer-wise updates into consolidated matrices before transmission. This approach enables edge intelligence in Open Radio Access Networks without overwhelming infrastructure. Applications include network optimization, predictive maintenance, and wireless sensing tasks like angle-of-arrival estimation. The framework proves convergent even under non-IID data distributions common in wireless environments. By breaking memory walls and reducing backhaul traffic simultaneously, CoCo-Fed makes sophisticated AI feasible for resource-constrained network equipment, advancing toward truly intelligent wireless infrastructure.

Authors: Zhiheng Guo, Zhaoyang Liu, Zihan Cen, Chenyuan Feng, Xinghua Sun, Xiang Chen, Tony Q. S. Quek, Xijun Wang

Link: https://arxiv.org/abs/2601.00549v1

Date: 2026-01-d

Summary:

The deployment of large-scale neural networks within the Open Radio Access Network (O-RAN) architecture is pivotal for enabling native edge intelligence. However, this paradigm faces two critical bottlenecks: the prohibitive memory footprint required for local training on resource-constrained gNBs, and the saturation of bandwidth-limited backhaul links during the global aggregation of high-dimensional model updates. To address these challenges, we propose CoCo-Fed, a novel Compression and Combination-based Federated learning framework that unifies local memory efficiency and global communication reduction. Locally, CoCo-Fed breaks the memory wall by performing a double-dimension down-projection of gradients, adapting the optimizer to operate on low-rank structures without introducing additional inference parameters/latency. Globally, we introduce a transmission protocol based on orthogonal subspace superposition, where layer-wise updates are projected and superimposed into a single consolidated matrix per gNB, drastically reducing the backhaul traffic. Beyond empirical designs, we establish a rigorous theoretical foundation, proving the convergence of CoCo-Fed even under unsupervised learning conditions suitable for wireless sensing tasks. Extensive simulations on an angle-of-arrival estimation task demonstrate that CoCo-Fed significantly outperforms state-of-the-art baselines in both memory and communication efficiency while maintaining robust convergence under non-IID settings.

--------------------------------------------------------------------------------------------------------

Overlooked Safety Vulnerability in LLMs: Malicious Intelligent Optimization Algorithm Request and its Jailbreak

Large language models increasingly support automated algorithm design, creating overlooked safety risks. MalOptBench reveals that mainstream LLMs—including GPT-4 and DeepSeek—readily generate malicious optimization algorithms for harmful applications when requests frame problems technically. With 83.59% attack success rates rising to near-complete failure under targeted jailbreaks, models demonstrate critical vulnerabilities in complex decision-making scenarios. Applications of concern include financial market manipulation, adversarial attacks on critical infrastructure, and circumventing security systems. Existing defenses prove marginally effective and prone to over-cautious blocking of legitimate requests. These findings highlight urgent needs for stronger alignment techniques specifically addressing algorithmic reasoning domains, suggesting current safety measures inadequately account for indirect harm pathways through computational tools rather than direct content generation.

Authors: Haoran Gu, Handing Wang, Yi Mei, Mengjie Zhang, Yaochu Jin

Link: https://arxiv.org/abs/2601.00213v1

Date: 2026-01-d

Summary:

The widespread deployment of large language models (LLMs) has raised growing concerns about their misuse risks and associated safety issues. While prior studies have examined the safety of LLMs in general usage, code generation, and agent-based applications, their vulnerabilities in automated algorithm design remain underexplored. To fill this gap, this study investigates this overlooked safety vulnerability, with a particular focus on intelligent optimization algorithm design, given its prevalent use in complex decision-making scenarios. We introduce MalOptBench, a benchmark consisting of 60 malicious optimization algorithm requests, and propose MOBjailbreak, a jailbreak method tailored for this scenario. Through extensive evaluation of 13 mainstream LLMs including the latest GPT-5 and DeepSeek-V3.1, we reveal that most models remain highly susceptible to such attacks, with an average attack success rate of 83.59% and an average harmfulness score of 4.28 out of 5 on original harmful prompts, and near-complete failure under MOBjailbreak. Furthermore, we assess state-of-the-art plug-and-play defenses that can be applied to closed-source models, and find that they are only marginally effective against MOBjailbreak and prone to exaggerated safety behaviors. These findings highlight the urgent need for stronger alignment techniques to safeguard LLMs against misuse in algorithm design.

--------------------------------------------------------------------------------------------------------

SpaceTimePilot: Generative Rendering of Dynamic Scenes Across Space and Time

Video diffusion models generate realistic footage but typically entangle camera movement with scene dynamics. SpaceTimePilot disentangles space and time, enabling independent control over viewpoint and motion sequence in generated videos. Given a monocular video, it re-renders scenes from arbitrary camera positions and time progressions. Applications span virtual reality content creation, film pre-visualization, sports analysis from novel angles, and autonomous vehicle simulation with varied temporal dynamics. The model introduces animation time-embeddings for explicit motion control and trains on temporal-warped multi-view data to learn robust space-time separation. CamxTime, the first synthetic dataset with full space-time trajectory coverage, enhances precision. This technology could democratize cinematic effects previously requiring expensive multi-camera arrays while advancing embodied AI research through controllable scene synthesis.

Authors: Zhening Huang, Hyeonho Jeong, Xuelin Chen, Yulia Gryaditskaya, Tuanfeng Y. Wang, Joan Lasenby, Chun-Hao Huang

Link: https://arxiv.org/abs/2512.25075v1

Date: 2025-12-d

Summary:

We present SpaceTimePilot, a video diffusion model that disentangles space and time for controllable generative rendering. Given a monocular video, SpaceTimePilot can independently alter the camera viewpoint and the motion sequence within the generative process, re-rendering the scene for continuous and arbitrary exploration across space and time. To achieve this, we introduce an effective animation time-embedding mechanism in the diffusion process, allowing explicit control of the output video's motion sequence with respect to that of the source video. As no datasets provide paired videos of the same dynamic scene with continuous temporal variations, we propose a simple yet effective temporal-warping training scheme that repurposes existing multi-view datasets to mimic temporal differences. This strategy effectively supervises the model to learn temporal control and achieve robust space-time disentanglement. To further enhance the precision of dual control, we introduce two additional components: an improved camera-conditioning mechanism that allows altering the camera from the first frame, and CamxTime, the first synthetic space-and-time full-coverage rendering dataset that provides fully free space-time video trajectories within a scene. Joint training on the temporal-warping scheme and the CamxTime dataset yields more precise temporal control. We evaluate SpaceTimePilot on both real-world and synthetic data, demonstrating clear space-time disentanglement and strong results compared to prior work. Project page: https://zheninghuang.github.io/Space-Time-Pilot/ Code: https://github.com/ZheningHuang/spacetimepilot

--------------------------------------------------------------------------------------------------------

Modeling Language as a Sequence of Thoughts

Transformer language models excel at generating natural text but lack globally consistent representations, contributing to brittleness in relational reasoning and context integration. Inspired by cognitive science showing humans compress linguistic streams into persistent event-like representations, Thought Gestalt models language at dual abstraction levels: tokens and sentence-level "thoughts." This recurrent Transformer generates sentences while cross-attending to prior sentence representations stored in memory. Remarkably, both levels train end-to-end with standard next-token prediction. Applications include improved dialogue systems maintaining coherent entity tracking, more data-efficient language learning, and robust question-answering across long contexts. Scaling experiments show 5-8% data efficiency gains and 33-42% parameter efficiency versus GPT-2, while reducing errors on relational direction tasks like the reversal curse, suggesting sentence-level abstraction bridges gaps between surface statistics and deeper comprehension.

Authors: Nasim Borazjanizadeh, James McClelland

Link: https://arxiv.org/abs/2512.25026v1

Date: 2025-12-d

Summary:

Transformer language models can generate strikingly natural text by modeling language as a sequence of tokens. Yet, by relying primarily on surface-level co-occurrence statistics, they fail to form globally consistent latent representations of entities and events, lack of which contributes to brittleness in relational direction (e.g., reversal curse), contextualization errors, and data inefficiency. On the other hand, cognitive science shows that human comprehension involves converting the input linguistic stream into compact, event-like representations that persist in memory while verbatim form is short-lived. Motivated by this view, we introduce Thought Gestalt (TG) model, a recurrent Transformer that models language at two levels of abstraction - tokens and sentence-level "thought" states. TG generates the tokens of one sentence at a time while cross-attending to a memory of prior sentence representations. In TG, token and sentence representations are generated using the same set of model parameters and trained with a single objective, the next-token cross-entropy: by retaining the computation graph of sentence representations written to memory, gradients from future token losses flow backward through cross-attention to optimize the parameters generating earlier sentence vectors. In scaling experiments, TG consistently improves efficiency over matched GPT-2 runs, among other baselines, with scaling fits indicating GPT-2 requires ~5-8% more data and ~33-42% more parameters to match TG's loss. TG also reduces errors on relational direction generalization on a father-son reversal curse probe.

--------------------------------------------------------------------------------------------------------

Iterative Deployment Improves Planning Skills in LLMs

Fine-tuning language models on user-curated data from previous deployments can dramatically alter model capabilities through implicit reinforcement learning. Testing across planning domains reveals substantial skill improvements with later models discovering solutions orders of magnitude longer than initial versions. This mechanism effectively implements RL in the outer training loop with implicit, undefined reward functions derived from user selection biases. Applications include interactive planning assistants, code generation tools, and decision support systems that improve through deployment. However, AI safety implications are significant: undefined reward functions may optimize unexpected objectives, potentially leading to unintended behaviors at scale. The work demonstrates deployment itself as a training regime, suggesting practitioners should carefully monitor capability evolution across deployment cycles and consider explicit reward specifications to maintain alignment with intended objectives.

Authors: Augusto B. Corrêa, Yoav Gelberg, Luckeciano C. Melo, Ilia Shumailov, André G. Pereira, Yarin Gal

Link: https://arxiv.org/abs/2512.24940v1

Date: 2025-12-d

Summary:

We show that iterative deployment of large language models (LLMs), each fine-tuned on data carefully curated by users from the previous models' deployment, can significantly change the properties of the resultant models. By testing this mechanism on various planning domains, we observe substantial improvements in planning skills, with later models displaying emergent generalization by discovering much longer plans than the initial models. We then provide theoretical analysis showing that iterative deployment effectively implements reinforcement learning (RL) training in the outer-loop (i.e. not as part of intentional model training), with an implicit reward function. The connection to RL has two important implications: first, for the field of AI safety, as the reward function entailed by repeated deployment is not defined explicitly, and could have unexpected implications to the properties of future model deployments. Second, the mechanism highlighted here can be viewed as an alternative training regime to explicit RL, relying on data curation rather than explicit rewards.

--------------------------------------------------------------------------------------------------------

Throughput Optimization in UAV-Mounted RIS under Jittering and Imperfect CSI via DRL

Unmanned aerial vehicles carrying reconfigurable intelligent surfaces can reshape wireless signals on-demand but face practical challenges from UAV jitter and uncertain channel conditions. This work tackles throughput maximization under stochastic three-dimensional jitter and imperfect channel state information through model-free deep reinforcement learning. Applications include disaster response communications, temporary event coverage, military operations, and remote area connectivity where ground infrastructure is unavailable. The DRL framework using contextual bandits with differentiable feasibility layers achieves 0.6 millisecond inference times—600× faster than conventional optimization approaches—while maintaining comparable or slightly lower performance. Under severe jitter and low channel quality, DRL methods outperform traditional baselines. This speed advantage enables real-time adaptation crucial for dynamic UAV deployments, making aerial reconfigurable surfaces practical for responsive wireless network augmentation.

Authors: Anas K. Saeed, Mahmoud M. Salim, Ali Arshad Nasir, Ali H. Muqaibel

Link: https://arxiv.org/abs/2512.24773v1

Date: 2025-12-d

Summary:

Reconfigurable intelligent surfaces (RISs) mounted on unmanned aerial vehicles (UAVs) can reshape wireless propagation on-demand. However, their performance is sensitive to UAV jitter and cascaded channel uncertainty. This paper investigates a downlink multiple-input single-output UAV-mounted RIS system in which a ground multiple-antenna base station (BS) serves multiple single-antenna users under practical impairments. Our goal is to maximize the expected throughput under stochastic three-dimensional UAV jitter and imperfect cascaded channel state information (CSI) based only on the available channel estimates. This leads to a stochastic nonconvex optimization problem subject to a BS transmit power constraint and strict unit-modulus constraints on all RIS elements. To address this problem, we design a model-free deep reinforcement learning (DRL) framework with a contextual bandit formulation. A differentiable feasibility layer is utilized to map continuous actions to feasible solutions, while the reward is a Monte Carlo estimate of the expected throughput. We instantiate this framework with constrained variants of deep deterministic policy gradient (DDPG) and twin delayed deep deterministic policy gradient (TD3) that do not use target networks. Simulations show that the proposed algorithms yield higher throughput than conventional alternating optimization-based weighted minimum mean-square error (AO-WMMSE) baselines under severe jitter and low CSI quality. Across different scenarios, the proposed methods achieve performance that is either comparable to or slightly below the AO-WMMSE benchmark, based on sample average approximation (SAA) with a relative gap ranging from 0-12%. Moreover, the proposed DRL controllers achieve online inference times of 0.6 ms per decision versus roughly 370-550 ms for AO-WMMSE solvers.

--------------------------------------------------------------------------------------------------------

Dream2Flow: Bridging Video Generation and Open-World Manipulation with 3D Object Flow

Generative video models synthesize plausible physical interactions but translating predicted human-led motions into robotic actions remains challenging. Dream2Flow bridges this gap using 3D object flow as an intermediate representation—reconstructing object trajectories from generated videos then formulating manipulation as trajectory tracking. This separates desired state changes from embodiment-specific actuators, enabling zero-shot transfer across diverse object categories: rigid tools, articulated doors, deformable cloth, and granular materials. Applications include household robotics, warehouse automation, agricultural manipulation, and assistive devices. By converting video model guidance into executable commands through trajectory optimization or reinforcement learning without task-specific demonstrations, the framework makes pre-trained generative models actionable for robotics. This approach could accelerate robot learning by leveraging vast video datasets encoding physical common sense rather than requiring expensive robot demonstration data.

Authors: Karthik Dharmarajan, Wenlong Huang, Jiajun Wu, Li Fei-Fei, Ruohan Zhang

Link: https://arxiv.org/abs/2512.24766v1

Date: 2025-12-d

Summary:

Generative video modeling has emerged as a compelling tool to zero-shot reason about plausible physical interactions for open-world manipulation. Yet, it remains a challenge to translate such human-led motions into the low-level actions demanded by robotic systems. We observe that given an initial image and task instruction, these models excel at synthesizing sensible object motions. Thus, we introduce Dream2Flow, a framework that bridges video generation and robotic control through 3D object flow as an intermediate representation. Our method reconstructs 3D object motions from generated videos and formulates manipulation as object trajectory tracking. By separating the state changes from the actuators that realize those changes, Dream2Flow overcomes the embodiment gap and enables zero-shot guidance from pre-trained video models to manipulate objects of diverse categories-including rigid, articulated, deformable, and granular. Through trajectory optimization or reinforcement learning, Dream2Flow converts reconstructed 3D object flow into executable low-level commands without task-specific demonstrations. Simulation and real-world experiments highlight 3D object flow as a general and scalable interface for adapting video generation models to open-world robotic manipulation. Videos and visualizations are available at https://dream2flow.github.io/.

--------------------------------------------------------------------------------------------------------

LSRE: Latent Semantic Rule Encoding for Real-Time Semantic Risk Detection in Autonomous Driving

Autonomous vehicles must obey complex social rules beyond codified traffic laws—yielding to emergency vehicles, respecting traffic officer gestures, stopping for school buses. Vision-language models interpret such semantics but their inference costs prohibit real-time deployment. LSRE encodes sparsely sampled VLM judgments into lightweight classifiers within a recurrent world model's latent space, enabling 10 Hz semantic risk assessment without per-frame VLM queries. Applications include urban autonomous driving, especially edge cases underrepresented in training data. Testing on six semantic-failure scenarios demonstrates accuracy comparable to VLM baselines with substantially earlier hazard anticipation and low latency. The approach generalizes to semantically similar unseen cases, suggesting language-guided latent classification offers deployable semantic safety monitoring—bridging the gap between foundation model reasoning and real-time autonomous system requirements critical for safe human-robot interaction.

Authors: Qian Cheng, Weitao Zhou, Cheng Jing, Nanshan Deng, Junze Wen, Zhaoyang Liu, Kun Jiang, Diange Yang

Link: https://arxiv.org/abs/2512.24712v1

Date: 2025-12-d

Summary:

Real-world autonomous driving must adhere to complex human social rules that extend beyond legally codified traffic regulations. Many of these semantic constraints, such as yielding to emergency vehicles, complying with traffic officers' gestures, or stopping for school buses, are intuitive for humans yet difficult to encode explicitly. Although large vision-language models (VLMs) can interpret such semantics, their inference cost makes them impractical for real-time deployment.This work proposes LSRE, a Latent Semantic Rule Encoding framework that converts sparsely sampled VLM judgments into decision boundaries within the latent space of a recurrent world model. By encoding language-defined safety semantics into a lightweight latent classifier, LSRE enables real-time semantic risk assessment at 10 Hz without per-frame VLM queries. Experiments on six semantic-failure scenarios in CARLA demonstrate that LSRE attains semantic risk detection accuracy comparable to a large VLM baseline, while providing substantially earlier hazard anticipation and maintaining low computational latency. LSRE further generalizes to rarely seen semantic-similar test cases, indicating that language-guided latent classification offers an effective and deployable mechanism for semantic safety monitoring in autonomous driving.

--------------------------------------------------------------------------------------------------------

Evaluating the Reasoning Abilities of LLMs on Underrepresented Mathematics Competition Problems

Most LLM mathematics evaluations use standard benchmarks, potentially missing diverse reasoning challenges. This study evaluates GPT-4o-mini, Gemini-2.0-Flash, and DeepSeek-V3 on Missouri Collegiate Mathematics Competition problems spanning Calculus, Analytic Geometry, and Discrete Mathematics. DeepSeek-V3 demonstrates superior performance across categories, though all models exhibit notable geometry weaknesses. Error pattern analysis reveals DeepSeek-V3 makes primarily computational and logical mistakes, GPT-4o-mini struggles with approach selection, and Gemini draws premature conclusions with incomplete reasoning. Applications include identifying targeted training needs for mathematical reasoning systems, informing educational AI tool development, and understanding model-specific failure modes. Results highlight that underrepresented problem sets expose distinct reasoning limitations invisible in standard benchmarks, suggesting geometry deserves focused architectural attention and diverse evaluation datasets better characterize real-world mathematical reasoning capabilities.

Authors: Samuel Golladay, Majid Bani-Yaghoub

Link: https://arxiv.org/abs/2512.24505v1

Date: 2025-12-d

Summary:

Understanding the limitations of Large Language Models, or LLMs, in mathematical reasoning has been the focus of several recent studies. However, the majority of these studies use the same datasets for benchmarking, which limits the generalizability of their findings and may not fully capture the diverse challenges present in mathematical tasks. The purpose of the present study is to analyze the performance of LLMs on underrepresented mathematics competition problems. We prompted three leading LLMs, namely GPT-4o-mini, Gemini-2.0-Flash, and DeepSeek-V3, with the Missouri Collegiate Mathematics Competition problems in the areas of Calculus, Analytic Geometry, and Discrete Mathematics. The LLMs responses were then compared to the known correct solutions in order to determine the accuracy of the LLM for each problem domain. We also analyzed the LLMs reasoning to explore patterns in errors across problem types and models. DeepSeek-V3 has the best performance in all three categories of Calculus, Analytic Geometry, and Discrete Mathematics, both in reasoning and correct final answers. All three LLMs exhibited notably weak performance in Geometry. The majority of errors made by DeepSeek-V3 were attributed to computational and logical mistakes, whereas GPT-4o-mini frequently exhibited logical and approach-related errors. Gemini, on the other hand, tended to struggle with incomplete reasoning and drawing rushed conclusions. In conclusion, evaluating LLMs on underrepresented mathematics competition datasets can provide deeper insights into their distinct error patterns and highlight ongoing challenges in structured reasoning, particularly within the domain of Geometry.

--------------------------------------------------------------------------------------------------------

Foundation models on the bridge: Semantic hazard detection and safety maneuvers for maritime autonomy with vision-language models

Maritime autonomous systems must handle semantic hazards—diver-down flags, nearby fires, unusual objects—where correct action depends on meaning rather than geometry. Classical autonomy stacks struggle with such out-of-distribution situations. This work proposes Semantic Lookout, using vision-language models for camera-only fallback maneuver selection that chooses cautious actions from water-valid trajectories under continuous human authority. Applications address IMO MASS Code requirements for autonomous vessels: detecting operational design domain departures, executing short-horizon fallback maneuvers, and facilitating operator handover. Testing on 40 harbor scenes shows sub-10 second models retain most awareness of slower state-of-the-art systems while outperforming geometry-only baselines. Field validation confirms end-to-end operation within practical latency budgets. Results support VLMs as semantic fallback selectors compatible with maritime regulations, motivating hybrid autonomy combining foundation model semantics with multi-sensor perception.

Authors: Kim Alexander Christensen, Andreas Gudahl Tufte, Alexey Gusev, Rohan Sinha, Milan Ganai, Ole Andreas Alsos, Marco Pavoned, Martin Steinert

Link: https://arxiv.org/abs/2512.24470v1

Date: 2025-12-d

Summary:

The draft IMO MASS Code requires autonomous and remotely supervised maritime vessels to detect departures from their operational design domain, enter a predefined fallback that notifies the operator, permit immediate human override, and avoid changing the voyage plan without approval. Meeting these obligations in the alert-to-takeover gap calls for a short-horizon, human-overridable fallback maneuver. Classical maritime autonomy stacks struggle when the correct action depends on meaning (e.g., diver-down flag means people in the water, fire close by means hazard). We argue (i) that vision-language models (VLMs) provide semantic awareness for such out-of-distribution situations, and (ii) that a fast-slow anomaly pipeline with a short-horizon, human-overridable fallback maneuver makes this practical in the handover window. We introduce Semantic Lookout, a camera-only, candidate-constrained vision-language model (VLM) fallback maneuver selector that selects one cautious action (or station-keeping) from water-valid, world-anchored trajectories under continuous human authority. On 40 harbor scenes we measure per-call scene understanding and latency, alignment with human consensus (model majority-of-three voting), short-horizon risk-relief on fire hazard scenes, and an on-water alert->fallback maneuver->operator handover. Sub-10 s models retain most of the awareness of slower state-of-the-art models. The fallback maneuver selector outperforms geometry-only baselines and increases standoff distance on fire scenes. A field run verifies end-to-end operation. These results support VLMs as semantic fallback maneuver selectors compatible with the draft IMO MASS Code, within practical latency budgets, and motivate future work on domain-adapted, hybrid autonomy that pairs foundation-model semantics with multi-sensor bird's-eye-view perception and short-horizon replanning.

--------------------------------------------------------------------------------------------------------

Lifting Vision: Ground to Aerial Localization with Reasoning Guided Planning

Visual reasoning systems typically rely on textual information for inference, limiting effectiveness in spatial tasks requiring geometric understanding. ViReLoc introduces Geo-Consistent Visual Planning, performing localization and navigation planning using only visual representations without text intermediaries or real-time GPS. The framework learns spatial dependencies and geometric relations through visual-domain inference, optimized via reinforcement learning. Applications include GPS-denied navigation for military operations, indoor positioning, underground exploration, and secure navigation avoiding signal interception. By encoding step-by-step inference visually and integrating contrastive learning with adaptive feature interaction to align cross-view perspectives, ViReLoc plans routes between ground images via aerial reference matching. Experiments across diverse scenarios demonstrate improved spatial reasoning accuracy and cross-view retrieval performance, establishing visual reasoning as viable for navigation without global positioning dependencies, potentially enhancing security and resilience.

Authors: Soham Pahari, M. Srinivas

Link: https://arxiv.org/abs/2512.24404v1

Date: 2025-12-d

Summary:

Multimodal intelligence development recently show strong progress in visual understanding and high level reasoning. Though, most reasoning system still reply on textual information as the main medium for inference. This limit their effectiveness in spatial tasks such as visual navigation and geo-localization. This work discuss about the potential scope of this field and eventually propose an idea visual reasoning paradigm Geo-Consistent Visual Planning, our introduced framework called Visual Reasoning for Localization, or ViReLoc, which performs planning and localization using only visual representations. The proposed framework learns spatial dependencies and geometric relations that text based reasoning often suffer to understand. By encoding step by step inference in the visual domain and optimizing with reinforcement based objectives, ViReLoc plans routes between two given ground images. The system also integrates contrastive learning and adaptive feature interaction to align cross view perspectives and reduce viewpoint differences. Experiments across diverse navigation and localization scenarios show consistent improvements in spatial reasoning accuracy and cross view retrieval performance. These results establish visual reasoning as a strong complementary approach for navigation and localization, and show that such tasks can be performed without real time global positioning system data, leading to more secure navigation solutions.

--------------------------------------------------------------------------------------------------------

Medical Image Classification on Imbalanced Data Using ProGAN and SMA-Optimized ResNet: Application to COVID-19

Medical image classification faces severe class imbalance—dramatically more examples of healthy cases than specific diseases, especially during pandemics. This work addresses COVID-19 detection using progressive generative adversarial networks to synthesize minority-class images, combining them with real data through weighted mixing. Multi-objective optimization tunes classifier hyperparameters via a meta-heuristic algorithm. Applications extend beyond COVID-19 to rare disease detection, cancer screening, and any medical imaging domain with inherent class imbalance. Testing on large imbalanced chest X-ray datasets achieves 95.5% and 98.5% accuracy for 4-class and 2-class problems respectively. The approach demonstrates that synthetic data generation can effectively augment scarce real examples when properly weighted and combined with optimized classification architectures, potentially accelerating diagnostic AI development for emerging diseases where balanced training data is unavailable during critical early pandemic phases.

Authors: Sina Jahromi, Farshid Hajati, Alireza Rezaee, Javaher Nourian

Link: https://arxiv.org/abs/2512.24214v1

Date: 2025-12-d

Summary:

The challenge of imbalanced data is prominent in medical image classification. This challenge arises when there is a significant disparity in the number of images belonging to a particular class, such as the presence or absence of a specific disease, as compared to the number of images belonging to other classes. This issue is especially notable during pandemics, which may result in an even more significant imbalance in the dataset. Researchers have employed various approaches in recent years to detect COVID-19 infected individuals accurately and quickly, with artificial intelligence and machine learning algorithms at the forefront. However, the lack of sufficient and balanced data remains a significant obstacle to these methods. This study addresses the challenge by proposing a progressive generative adversarial network to generate synthetic data to supplement the real ones. The proposed method suggests a weighted approach to combine synthetic data with real ones before inputting it into a deep network classifier. A multi-objective meta-heuristic population-based optimization algorithm is employed to optimize the hyper-parameters of the classifier. The proposed model exhibits superior cross-validated metrics compared to existing methods when applied to a large and imbalanced chest X-ray image dataset of COVID-19. The proposed model achieves 95.5% and 98.5% accuracy for 4-class and 2-class imbalanced classification problems, respectively. The successful experimental outcomes demonstrate the effectiveness of the proposed model in classifying medical images using imbalanced data during pandemics.

--------------------------------------------------------------------------------------------------------

Interactive Machine Learning: From Theory to Scale

Effective machine learning often requires extensive labeled data or online trial-and-error, which can be expensive or risky. Interactive machine learning addresses this by having learners actively influence information collection and action selection based on past observations. This dissertation develops algorithmic principles across three dimensions: active learning achieving exponential label savings without restrictive noise assumptions; contextual bandits with guarantees independent of action space size; and model selection under partial feedback. Applications span medical diagnosis requiring minimal expert labeling, recommendation systems with millions of items, autonomous systems learning through limited real-world interaction, and adaptive clinical trials. Results provide first efficient algorithms for several interactive learning problems while establishing fundamental limits. The work offers principled guidance for deploying interactive methods in large-scale settings, balancing statistical optimality with computational efficiency for practical real-world applications.

Authors: Yinglun Zhu

Link: https://arxiv.org/abs/2512.23924v1

Date: 2025-12-d

Summary:

Machine learning has achieved remarkable success across a wide range of applications, yet many of its most effective methods rely on access to large amounts of labeled data or extensive online interaction. In practice, acquiring high-quality labels and making decisions through trial-and-error can be expensive, time-consuming, or risky, particularly in large-scale or high-stakes settings. This dissertation studies interactive machine learning, in which the learner actively influences how information is collected or which actions are taken, using past observations to guide future interactions. We develop new algorithmic principles and establish fundamental limits for interactive learning along three dimensions: active learning with noisy data and rich model classes, sequential decision making with large action spaces, and model selection under partial feedback. Our results include the first computationally efficient active learning algorithms achieving exponential label savings without low-noise assumptions; the first efficient, general-purpose contextual bandit algorithms whose guarantees are independent of the size of the action space; and the first tight characterizations of the fundamental cost of model selection in sequential decision making. Overall, this dissertation advances the theoretical foundations of interactive learning by developing algorithms that are statistically optimal and computationally efficient, while also providing principled guidance for deploying interactive learning methods in large-scale, real-world settings.

--------------------------------------------------------------------------------------------------------

A multimodal Transformer for InSAR-based ground deformation forecasting with cross-site generalization across Europe

Continental-scale monitoring services like Europe's EGMS provide dense observations of ground deformation but predicting future motion remains challenging due to superposed long-term trends, seasonal cycles, and abrupt discontinuities across spatially heterogeneous regions. This work proposes a multimodal patch-based Transformer for single-step displacement map forecasting, ingesting recent observations plus static kinematic indicators and harmonic temporal encodings. Applications include urban planning, infrastructure management, landslide early warning, and natural hazard mitigation. Testing on eastern Ireland achieves 0.90mm RMSE and 0.97 R² on 100km² tiles, with the Transformer clearly outperforming CNN-LSTM and graph-based alternatives when using multimodal inputs. The approach demonstrates that foundation model architectures can effectively integrate multiple information sources for geospatial prediction, potentially enabling proactive intervention in areas experiencing concerning deformation trends before critical thresholds are reached.

Authors: Wendong Yao, Binhua Huang, Soumyabrata Dev

Link: https://arxiv.org/abs/2512.23906v1

Date: 2025-12-d

Summary:

Near-real-time regional-scale monitoring of ground deformation is increasingly required to support urban planning, critical infrastructure management, and natural hazard mitigation. While Interferometric Synthetic Aperture Radar (InSAR) and continental-scale services such as the European Ground Motion Service (EGMS) provide dense observations of past motion, predicting the next observation remains challenging due to the superposition of long-term trends, seasonal cycles, and occasional abrupt discontinuities (e.g., co-seismic steps), together with strong spatial heterogeneity. In this study we propose a multimodal patch-based Transformer for single-step, fixed-interval next-epoch nowcasting of displacement maps from EGMS time series (resampled to a 64x64 grid over 100 km x 100 km tiles). The model ingests recent displacement snapshots together with (i) static kinematic indicators (mean velocity, acceleration, seasonal amplitude) computed in a leakage-safe manner from the training window only, and (ii) harmonic day-of-year encodings. On the eastern Ireland tile (E32N34), the STGCN is strongest in the displacement-only setting, whereas the multimodal Transformer clearly outperforms CNN-LSTM, CNN-LSTM+Attn, and multimodal STGCN when all models receive the same multimodal inputs, achieving RMSE = 0.90 mm and $R^2$ = 0.97 on the test set with the best threshold accuracies.

--------------------------------------------------------------------------------------------------------

AI tutoring can safely and effectively support students: An exploratory RCT in UK classrooms

One-to-one tutoring delivers superior educational outcomes but scales poorly due to cost. This exploratory RCT with 165 UK secondary students integrated LearnLM—a pedagogically fine-tuned generative AI—into mathematics tutoring, with expert tutors supervising and potentially editing every AI message. Tutors approved 76.4% of drafted messages with minimal or no edits, and students guided by LearnLM performed as well or better than those with human tutors alone, showing 5.5 percentage point improvement on novel problem-solving. Applications include expanding access to personalized instruction in under-resourced schools, supplementing teacher capacity, and providing after-hours support. Tutors reported learning new pedagogical practices from the model, particularly Socratic questioning techniques. Results suggest pedagogically fine-tuned AI can effectively deliver individualized learning support at scale while maintaining quality, potentially democratizing access to gold-standard educational experiences previously limited by tutor availability and cost.

Authors: LearnLM Team, Eedi, :, Albert Wang, Aliya Rysbek, Andrea Huber, Anjali Nambiar, Anna Kenolty, Ben Caulfield, Beth Lilley-Draper, Bibi Groot, Brian Veprek, Chelsea Burdett, Claire Willis, Craig Barton, Digory Smith, George Mu, Harriet Walters, Irina Jurenka, Iris Hulls, James Stalley-Moores, Jonathan Caton, Julia Wilkowski, Kaiz Alarakyia, Kevin R. McKee, Liam McCafferty, Lucy Dalton, Markus Kunesch, Pauline Malubay, Rachel Kidson, Rich Wells, Sam Wheeler, Sara Wiltberger, Shakir Mohamed, Simon Woodhead, Vasco Brazão

Link: https://arxiv.org/abs/2512.23633v1

Date: 2025-12-d

Summary:

One-to-one tutoring is widely considered the gold standard for personalized education, yet it remains prohibitively expensive to scale. To evaluate whether generative AI might help expand access to this resource, we conducted an exploratory randomized controlled trial (RCT) with $N = 165$ students across five UK secondary schools. We integrated LearnLM -- a generative AI model fine-tuned for pedagogy -- into chat-based tutoring sessions on the Eedi mathematics platform. In the RCT, expert tutors directly supervised LearnLM, with the remit to revise each message it drafted until they would be satisfied sending it themselves. LearnLM proved to be a reliable source of pedagogical instruction, with supervising tutors approving 76.4% of its drafted messages making zero or minimal edits (i.e., changing only one or two characters). This translated into effective tutoring support: students guided by LearnLM performed at least as well as students chatting with human tutors on each learning outcome we measured. In fact, students who received support from LearnLM were 5.5 percentage points more likely to solve novel problems on subsequent topics (with a success rate of 66.2%) than those who received tutoring from human tutors alone (rate of 60.7%). In interviews, tutors highlighted LearnLM's strength at drafting Socratic questions that encouraged deeper reflection from students, with multiple tutors even reporting that they learned new pedagogical practices from the model. Overall, our results suggest that pedagogically fine-tuned AI tutoring systems may play a promising role in delivering effective, individualized learning support at scale.

--------------------------------------------------------------------------------------------------------

Divergent-Convergent Thinking in Large Language Models for Creative Problem Generation

Large language models generating educational content suffer from the "Artificial Hivemind" effect—producing similar responses within and across models, exposing students to repetitive problems that harm diverse thinking. CreativeDC addresses this through two-phase prompting inspired by Wallas's creativity theory and Guilford's divergent-convergent thinking framework, explicitly scaffolding LLM reasoning into separate exploration and constraint satisfaction phases. Applications include automated question generation for homework, test creation, adaptive learning platforms, and educational content production at scale. Evaluation across diversity, novelty, and utility metrics shows CreativeDC significantly exceeds baselines in diversity and novelty while maintaining high utility. Scaling analysis demonstrates CreativeDC generates more distinct problems as sampling increases, growing faster than alternatives. This approach could help educators create varied learning materials efficiently while avoiding the homogenization risks of naive LLM deployment in education.

Authors: Manh Hung Nguyen, Adish Singla

Link: https://arxiv.org/abs/2512.23601v1

Date: 2025-12-d

Summary:

Large language models (LLMs) have significant potential for generating educational questions and problems, enabling educators to create large-scale learning materials. However, LLMs are fundamentally limited by the ``Artificial Hivemind'' effect, where they generate similar responses within the same model and produce homogeneous outputs across different models. As a consequence, students may be exposed to overly similar and repetitive LLM-generated problems, which harms diversity of thought. Drawing inspiration from Wallas's theory of creativity and Guilford's framework of divergent-convergent thinking, we propose CreativeDC, a two-phase prompting method that explicitly scaffolds the LLM's reasoning into distinct phases. By decoupling creative exploration from constraint satisfaction, our method enables LLMs to explore a broader space of ideas before committing to a final problem. We evaluate CreativeDC for creative problem generation using a comprehensive set of metrics that capture diversity, novelty, and utility. The results show that CreativeDC achieves significantly higher diversity and novelty compared to baselines while maintaining high utility. Moreover, scaling analysis shows that CreativeDC generates a larger effective number of distinct problems as more are sampled, increasing at a faster rate than baseline methods.

--------------------------------------------------------------------------------------------------------

Alpha-R1: Alpha Screening with LLM Reasoning via Reinforcement Learning

Quantitative investment strategies face signal decay and regime shifts as market conditions evolve. Traditional approaches rely on historical correlations, struggling when economic environments change. Alpha-R1 introduces an 8-billion parameter model trained via reinforcement learning for context-aware factor screening, reasoning over factor logic and real-time news to evaluate alpha relevance under changing conditions. Applications include systematic trading, portfolio management, risk assessment, and quantitative research. The model selectively activates or deactivates factors based on contextual consistency rather than treating them as static numerical series, incorporating semantic rationale about when factors are economically relevant. Empirical results across multiple asset pools show consistent outperformance versus benchmark strategies with improved robustness to alpha decay. This demonstrates LLMs can support quantitative finance through explicit economic reasoning, potentially reducing strategy obsolescence in non-stationary markets by adapting factor usage to evolving regimes.

Authors: Zuoyou Jiang, Li Zhao, Rui Sun, Ruohan Sun, Zhongjian Li, Jing Li, Daxin Jiang, Zuo Bai, Cheng Hua

Link: https://arxiv.org/abs/2512.23515v1

Date: 2025-12-d

Summary:

Signal decay and regime shifts pose recurring challenges for data-driven investment strategies in non-stationary markets. Conventional time-series and machine learning approaches, which rely primarily on historical correlations, often struggle to generalize when the economic environment changes. While large language models (LLMs) offer strong capabilities for processing unstructured information, their potential to support quantitative factor screening through explicit economic reasoning remains underexplored. Existing factor-based methods typically reduce alphas to numerical time series, overlooking the semantic rationale that determines when a factor is economically relevant. We propose Alpha-R1, an 8B-parameter reasoning model trained via reinforcement learning for context-aware alpha screening. Alpha-R1 reasons over factor logic and real-time news to evaluate alpha relevance under changing market conditions, selectively activating or deactivating factors based on contextual consistency. Empirical results across multiple asset pools show that Alpha-R1 consistently outperforms benchmark strategies and exhibits improved robustness to alpha decay. The full implementation and resources are available at https://github.com/FinStep-AI/Alpha-R1.

--------------------------------------------------------------------------------------------------------

EYE ON A.I. GETS READERS UP TO DATE ON THE LATEST FUNDING NEWS AND RELATED ISSUES. SUBSCRIBE FOR THE WEEKLY NEWSLETTER.

Artificial Intelligence, Research WatchCraig SmithJanuary 5, 2026Comment