Week Ending 11.2.2025
RESEARCH WATCH: 11.2.2025
Leveraging Foundation Models for Enhancing Robot Perception and Action
Foundation models are revolutionizing robotics by enabling machines to understand and interact with unstructured environments more effectively. This thesis systematically explores how these powerful AI models can enhance robotic localization, interaction, and manipulation capabilities. By developing a cohesive framework for semantics-aware robotic intelligence, this work addresses fundamental challenges in enabling robots to operate in real-world scenarios. The research has significant applications in autonomous navigation, object manipulation in warehouses, service robotics in homes and hospitals, and industrial automation. By bridging the gap between high-level semantic understanding and low-level robotic control, this framework could accelerate the deployment of more capable robots across manufacturing, logistics, healthcare, and domestic assistance domains.
Authors: Reihaneh Mirjalili
Link: https://arxiv.org/abs/2510.26855v1
Date: 2025-10-d
Summary:
This thesis investigates how foundation models can be systematically leveraged to enhance robotic capabilities, enabling more effective localization, interaction, and manipulation in unstructured environments. The work is structured around four core lines of inquiry, each addressing a fundamental challenge in robotics while collectively contributing to a cohesive framework for semantics-aware robotic intelligence.
--------------------------------------------------------------------------------------------------------
Inside CORE-KG: Evaluating Structured Prompting and Coreference Resolution for Knowledge Graphs
Human smuggling networks present complex challenges for law enforcement analysis, particularly when working with dense, unstructured legal documents. This research introduces CORE-KG, a framework that leverages large language models to construct cleaner knowledge graphs from legal case files. By integrating coreference resolution and structured prompts, the system dramatically reduces duplicate nodes and legal noise that plague automated extraction systems. The ablation study reveals that coreference resolution prevents 28% more node duplication, while structured prompts reduce noisy nodes by 73%. This work has immediate applications in law enforcement intelligence, legal document analysis, compliance monitoring, and criminal network detection. The methodology could extend to other complex domains requiring structured extraction from ambiguous text.
Authors: Dipak Meher, Carlotta Domeniconi
Link: https://arxiv.org/abs/2510.26512v1
Date: 2025-10-d
Summary:
Human smuggling networks are increasingly adaptive and difficult to analyze. Legal case documents offer critical insights but are often unstructured, lexically dense, and filled with ambiguous or shifting references, which pose significant challenges for automated knowledge graph (KG) construction. While recent LLM-based approaches improve over static templates, they still generate noisy, fragmented graphs with duplicate nodes due to the absence of guided extraction and coreference resolution. The recently proposed CORE-KG framework addresses these limitations by integrating a type-aware coreference module and domain-guided structured prompts, significantly reducing node duplication and legal noise. In this work, we present a systematic ablation study of CORE-KG to quantify the individual contributions of its two key components. Our results show that removing coreference resolution results in a 28.32% increase in node duplication and a 4.32% increase in noisy nodes, while removing structured prompts leads to a 4.34% increase in node duplication and a 73.33% increase in noisy nodes. These findings offer empirical insights for designing robust LLM-based pipelines for extracting structured representations from complex legal texts.
--------------------------------------------------------------------------------------------------------
As reasoning models become more powerful, safety concerns emerge alongside their capabilities. This research reveals a critical vulnerability: the same reasoning processes that improve task performance can be exploited to bypass safety guardrails. The Chain-of-Thought Hijacking attack achieves alarmingly high success rates (94-100%) across major AI models by padding harmful requests with benign puzzle reasoning. Through mechanistic analysis, the researchers discovered that long benign reasoning chains dilute safety signals by redirecting attention away from harmful content. This work has crucial implications for AI safety research, model deployment strategies, content moderation systems, and regulatory frameworks. Understanding these vulnerabilities is essential for developing more robust safety mechanisms in next-generation AI systems used in education, healthcare, and public-facing applications.
Authors: Jianli Zhao, Tingchen Fu, Rylan Schaeffer, Mrinank Sharma, Fazl Barez
Link: https://arxiv.org/abs/2510.26418v1
Date: 2025-10-d
Summary:
Large reasoning models (LRMs) achieve higher task performance by allocating more inference-time compute, and prior works suggest this scaled reasoning may also strengthen safety by improving refusal. Yet we find the opposite: the same reasoning can be used to bypass safeguards. We introduce Chain-of-Thought Hijacking, a jailbreak attack on reasoning models. The attack pads harmful requests with long sequences of harmless puzzle reasoning. Across HarmBench, CoT Hijacking reaches a 99%, 94%, 100%, and 94% attack success rate (ASR) on Gemini 2.5 Pro, GPT o4 mini, Grok 3 mini, and Claude 4 Sonnet, respectively - far exceeding prior jailbreak methods for LRMs. To understand the effectiveness of our attack, we turn to a mechanistic analysis, which shows that mid layers encode the strength of safety checking, while late layers encode the verification outcome. Long benign CoT dilutes both signals by shifting attention away from harmful tokens. Targeted ablations of attention heads identified by this analysis causally decrease refusal, confirming their role in a safety subnetwork. These results show that the most interpretable form of reasoning - explicit CoT - can itself become a jailbreak vector when combined with final-answer cues. We release prompts, outputs, and judge decisions to facilitate replication.
--------------------------------------------------------------------------------------------------------
ComboBench: Can LLMs Manipulate Physical Devices to Play Virtual Reality Games?
Virtual reality gaming requires intuitive translation of high-level intentions into precise device manipulations—something humans do effortlessly but remains challenging for AI. ComboBench introduces a comprehensive benchmark evaluating whether large language models can replicate this embodied understanding across 262 VR game scenarios. Results reveal that while leading models like Gemini-1.5-Pro demonstrate strong task decomposition, they struggle with procedural reasoning and spatial understanding compared to humans. This research has significant applications in VR interface design, accessibility tools for disabled users, game AI development, and robotic teleoperation systems. The findings could inform the development of AI assistants that help users learn complex VR interactions, adaptive difficulty systems, and ultimately more intuitive human-computer interfaces.
Authors: Shuqing Li, Jiayi Yan, Chenyu Niu, Jen-tse Huang, Yun Peng, Wenxuan Wang, Yepang Liu, Michael R. Lyu
Link: https://arxiv.org/abs/2510.24706v1
Date: 2025-10-d
Summary:
Virtual Reality (VR) games require players to translate high-level semantic actions into precise device manipulations using controllers and head-mounted displays (HMDs). While humans intuitively perform this translation based on common sense and embodied understanding, whether Large Language Models (LLMs) can effectively replicate this ability remains underexplored. This paper introduces a benchmark, ComboBench, evaluating LLMs' capability to translate semantic actions into VR device manipulation sequences across 262 scenarios from four popular VR games: Half-Life: Alyx, Into the Radius, Moss: Book II, and Vivecraft. We evaluate seven LLMs, including GPT-3.5, GPT-4, GPT-4o, Gemini-1.5-Pro, LLaMA-3-8B, Mixtral-8x7B, and GLM-4-Flash, compared against annotated ground truth and human performance. Our results reveal that while top-performing models like Gemini-1.5-Pro demonstrate strong task decomposition capabilities, they still struggle with procedural reasoning and spatial understanding compared to humans. Performance varies significantly across games, suggesting sensitivity to interaction complexity. Few-shot examples substantially improve performance, indicating potential for targeted enhancement of LLMs' VR manipulation capabilities. We release all materials at https://sites.google.com/view/combobench.
--------------------------------------------------------------------------------------------------------
ViPER: Empowering the Self-Evolution of Visual Perception Abilities in Vision-Language Model
Vision-language models excel at many tasks but struggle with fine-grained visual perception—a critical limitation for real-world applications. ViPER introduces a self-bootstrapping framework that enables models to iteratively improve their perceptual abilities without compromising general capabilities. By structuring visual learning as a coarse-to-fine progressive process and integrating self-critiquing with reinforcement learning, ViPER creates a closed-loop training paradigm. Applied to Qwen2.5-VL, the system achieves up to 6% improvement on fine-grained perception tasks while maintaining generalizability. This breakthrough has applications in medical image analysis, quality control inspection, autonomous driving, surveillance systems, and scientific image interpretation. The self-improvement mechanism could accelerate the development of more capable vision systems without constant human supervision.
Authors: Juntian Zhang, Song Jin, Chuanqi Cheng, Yuhan Liu, Yankai Lin, Xun Zhang, Yufei Zhang, Fei Jiang, Guojun Yin, Wei Lin, Rui Yan
Link: https://arxiv.org/abs/2510.24285v1
Date: 2025-10-d
Summary:
The limited capacity for fine-grained visual perception presents a critical bottleneck for Vision-Language Models (VLMs) in real-world applications. Addressing this is challenging due to the scarcity of high-quality data and the limitations of existing methods: supervised fine-tuning (SFT) often compromises general capabilities, while reinforcement fine-tuning (RFT) prioritizes textual reasoning over visual perception. To bridge this gap, we propose a novel two-stage task that structures visual perception learning as a coarse-to-fine progressive process. Based on this task formulation, we develop ViPER, a self-bootstrapping framework specifically designed to enable iterative evolution through self-critiquing and self-prediction. By synergistically integrating image-level and instance-level reconstruction with a two-stage reinforcement learning strategy, ViPER establishes a closed-loop training paradigm, where internally synthesized data directly fuel the enhancement of perceptual ability. Applied to the Qwen2.5-VL family, ViPER produces the Qwen-Viper series. With an average gain of 1.7% on seven comprehensive benchmarks spanning various tasks and up to 6.0% on fine-grained perception, Qwen-Viper consistently demonstrates superior performance across different vision-language scenarios while maintaining generalizability. Beyond enabling self-improvement in perceptual capabilities, ViPER provides concrete evidence for the reciprocal relationship between generation and understanding, a breakthrough to developing more autonomous and capable VLMs.
--------------------------------------------------------------------------------------------------------
MGA: Memory-Driven GUI Agent for Observation-Centric Interaction
Developing AI agents that can navigate complex computer interfaces remains challenging, with existing approaches suffering from error propagation and local exploration bias. MGA reframes GUI interaction around "observe first, then decide," modeling each step as an independent environment state with screenshot, spatial information, and structured memory. This observation-centric approach eliminates reliance on historical trajectories that amplify errors in traditional long-chain execution models. Experimental results demonstrate substantial gains in robustness, generalization, and efficiency across desktop applications and web interfaces. MGA has practical applications in automated testing, accessibility assistance for users with disabilities, robotic process automation, user interface evaluation, and AI-powered customer support. The framework could enable more reliable AI assistants for everyday computer tasks.
Authors: Weihua Cheng, Ersheng Ni, Wenlong Wang, Yifei Sun, Junming Liu, Wangyu Shen, Yirong Chen, Botian Shi, Ding Wang
Link: https://arxiv.org/abs/2510.24168v1
Date: 2025-10-d
Summary:
The rapid progress of Large Language Models (LLMs) and their multimodal extensions (MLLMs) has enabled agentic systems capable of perceiving and acting across diverse environments. A challenging yet impactful frontier is the development of GUI agents, which must navigate complex desktop and web interfaces while maintaining robustness and generalization. Existing paradigms typically model tasks as long-chain executions, concatenating historical trajectories into the context. While approaches such as Mirage and GTA1 refine planning or introduce multi-branch action selection, they remain constrained by two persistent issues: Dependence on historical trajectories, which amplifies error propagation. And Local exploration bias, where "decision-first, observation-later" mechanisms overlook critical interface cues. We introduce the Memory-Driven GUI Agent (MGA), which reframes GUI interaction around the principle of observe first, then decide. MGA models each step as an independent, context-rich environment state represented by a triad: current screenshot, task-agnostic spatial information, and a dynamically updated structured memory. Experiments on OSworld benchmarks, real desktop applications (Chrome, VSCode, VLC), and cross-task transfer demonstrate that MGA achieves substantial gains in robustness, generalization, and efficiency compared to state-of-the-art baselines. The code is publicly available at: {https://anonymous.4open.science/r/MGA-3571}.
--------------------------------------------------------------------------------------------------------
Taming the Tail: NoI Topology Synthesis for Mixed DL Workloads on Chiplet-Based Accelerators
Modern AI accelerators increasingly adopt chiplet-based architectures to improve scaling, but disaggregation introduces latency challenges in the Network-on-Interposer. Large model inference generates bursty memory traffic that inflates tail latency and violates service-level agreements. This research introduces PARL, a reinforcement learning system that synthesizes custom NoI topologies balancing throughput, latency, and power. By formulating topology design as multi-objective optimization and introducing an Interference Score to quantify contention, PARL-generated topologies reduce worst-case slowdown to 1.2x while meeting SLAs. Applications include cloud AI inference services, edge computing platforms, autonomous vehicle processors, and data center accelerators. This workload-aware design approach could significantly improve the efficiency and reliability of next-generation AI hardware infrastructure.
Authors: Arnav Shukla, Harsh Sharma, Srikant Bharadwaj, Vinayak Abrol, Sujay Deb
Link: https://arxiv.org/abs/2510.24113v1
Date: 2025-10-d
Summary:
Heterogeneous chiplet-based systems improve scaling by disag-gregating CPUs/GPUs and emerging technologies (HBM/DRAM).However this on-package disaggregation introduces a latency inNetwork-on-Interposer(NoI). We observe that in modern large-modelinference, parameters and activations routinely move backand forth from HBM/DRAM, injecting large, bursty flows into theinterposer. These memory-driven transfers inflate tail latency andviolate Service Level Agreements (SLAs) across k-ary n-cube base-line NoI topologies. To address this gap we introduce an InterferenceScore (IS) that quantifies worst-case slowdown under contention.We then formulate NoI synthesis as a multi-objective optimization(MOO) problem. We develop PARL (Partition-Aware ReinforcementLearner), a topology generator that balances throughput, latency,and power. PARL-generated topologies reduce contention at the memory cut, meet SLAs, and cut worst-case slowdown to 1.2 times while maintaining competitive mean throughput relative to link-rich meshes. Overall, this reframes NoI design for heterogeneouschiplet accelerators with workload-aware objectives.
--------------------------------------------------------------------------------------------------------
Human-centered AI demands robots that understand natural language, plan complex tasks, and execute them reliably—capabilities that remain elusive in current systems. PFEA introduces a comprehensive framework integrating voice interaction, vision-language planning, and action execution with continuous feedback evaluation. The system achieves 28% higher task success rates compared to baseline approaches by combining vision-based planning with natural language instruction conversion. This architecture has immediate applications in assistive robotics for elderly care, household automation, warehouse logistics, rehabilitation therapy, and collaborative manufacturing. By enabling robots to understand and execute high-level natural language commands while adapting based on performance feedback, PFEA advances the vision of intuitive human-robot collaboration in everyday environments.
Authors: Wenbin Ding, Jun Chen, Mingjia Chen, Fei Xie, Qi Mao, Philip Dames
Link: https://arxiv.org/abs/2510.24109v1
Date: 2025-10-d
Summary:
The rapid advancement of Large Language Models (LLMs) has marked a significant breakthrough in Artificial Intelligence (AI), ushering in a new era of Human-centered Artificial Intelligence (HAI). HAI aims to better serve human welfare and needs, thereby placing higher demands on the intelligence level of robots, particularly in aspects such as natural language interaction, complex task planning, and execution. Intelligent agents powered by LLMs have opened up new pathways for realizing HAI. However, existing LLM-based embodied agents often lack the ability to plan and execute complex natural language control tasks online. This paper explores the implementation of intelligent robotic manipulating agents based on Vision-Language Models (VLMs) in the physical world. We propose a novel embodied agent framework for robots, which comprises a human-robot voice interaction module, a vision-language agent module and an action execution module. The vision-language agent itself includes a vision-based task planner, a natural language instruction converter, and a task performance feedback evaluator. Experimental results demonstrate that our agent achieves a 28\% higher average task success rate in both simulated and real environments compared to approaches relying solely on LLM+CLIP, significantly improving the execution success rate of high-level natural language instruction tasks.
--------------------------------------------------------------------------------------------------------
E-commerce platforms manage massive catalogs requiring quality assessment across thousands of product categories and attributes, traditionally demanding extensive human expertise. This research introduces a training-free cascade system that automatically generates and refines prompts for quality evaluation without labeled data or fine-tuning. Starting from minimal human-crafted seeds, the system optimizes instructions for catalog-specific requirements, achieving 8-10% improvements in precision and recall. Most remarkably, it reduces domain expert effort by 99%—from 5.1 hours to 3 minutes per attribute—while maintaining effectiveness across five languages. Applications include marketplace quality control, product listing optimization, catalog management automation, seller onboarding assistance, and cross-border commerce. This approach democratizes quality assessment capabilities for e-commerce platforms lacking extensive training data.
Authors: Soham Satyadharma, Fatemeh Sheikholeslami, Swati Kaul, Aziz Umit Batur, Suleiman A. Khan
Link: https://arxiv.org/abs/2510.23941v1
Date: 2025-10-d
Summary:
We introduce a novel, training free cascade for auto-prompting Large Language Models (LLMs) to assess product quality in e-commerce. Our system requires no training labels or model fine-tuning, instead automatically generating and refining prompts for evaluating attribute quality across tens of thousands of product category-attribute pairs. Starting from a seed of human-crafted prompts, the cascade progressively optimizes instructions to meet catalog-specific requirements. This approach bridges the gap between general language understanding and domain-specific knowledge at scale in complex industrial catalogs. Our extensive empirical evaluations shows the auto-prompt cascade improves precision and recall by $8-10\%$ over traditional chain-of-thought prompting. Notably, it achieves these gains while reducing domain expert effort from 5.1 hours to 3 minutes per attribute - a $99\%$ reduction. Additionally, the cascade generalizes effectively across five languages and multiple quality assessment tasks, consistently maintaining performance gains.
--------------------------------------------------------------------------------------------------------
AI-generated imagery's increasing realism poses authenticity verification challenges across media and security domains. This system combines a lightweight convolutional classifier with vision-language models to detect, localize, and explain artifacts in images, achieving 96.5% accuracy while maintaining 175ms inference time on standard CPUs. By categorizing 70 artifact types into semantic groups and generating interpretable explanations, the system makes authenticity detection transparent for both humans and machines. Edge device deployment enables applications in digital forensics, social media content moderation, journalism verification, legal evidence authentication, and industrial quality inspection. The explainability component is particularly valuable for building trust in automated detection systems and supporting human decision-making in high-stakes authenticity verification scenarios.
Authors: Aryan Mathur, Asaduddin Ahmed, Pushti Amit Vasoya, Simeon Kandan Sonar, Yasir Z, Madesh Kuppusamy
Link: https://arxiv.org/abs/2510.23775v1
Date: 2025-10-d
Summary:
The increasing realism of AI-generated imagery poses challenges for verifying visual authenticity. We present an explainable image authenticity detection system that combines a lightweight convolutional classifier ("Faster-Than-Lies") with a Vision-Language Model (Qwen2-VL-7B) to classify, localize, and explain artifacts in 32x32 images. Our model achieves 96.5% accuracy on the extended CiFAKE dataset augmented with adversarial perturbations and maintains an inference time of 175ms on 8-core CPUs, enabling deployment on local or edge devices. Using autoencoder-based reconstruction error maps, we generate artifact localization heatmaps, which enhance interpretability for both humans and the VLM. We further categorize 70 visual artifact types into eight semantic groups and demonstrate explainable text generation for each detected anomaly. This work highlights the feasibility of combining visual and linguistic reasoning for interpretable authenticity detection in low-resolution imagery and outlines potential cross-domain applications in forensics, industrial inspection, and social media moderation.
--------------------------------------------------------------------------------------------------------
ReCode: Unify Plan and Action for Universal Granularity Control
Current AI agents struggle with decisions requiring variable granularity, hampered by rigid separation between high-level planning and low-level actions. ReCode introduces a unified code representation where planning becomes abstract placeholder functions recursively decomposed into finer-grained sub-functions until reaching primitive actions. This paradigm shift enables dynamic granularity control while generating rich multi-level training data that teaches hierarchical decision-making. Extensive experiments demonstrate significant performance improvements and exceptional data efficiency over advanced baselines. Applications include robotic task planning, autonomous software development, business process automation, game AI, and intelligent tutoring systems. By dissolving the plan-action boundary, ReCode advances toward more adaptable AI agents capable of seamlessly operating across decision complexities in real-world scenarios.
Authors: Zhaoyang Yu, Jiayi Zhang, Huixue Su, Yufan Zhao, Yifan Wu, Mingyi Deng, Jinyu Xiang, Yizhang Lin, Lingxiao Tang, Yingchao Li, Yuyu Luo, Bang Liu, Chenglin Wu
Link: https://arxiv.org/abs/2510.23564v2
Date: 2025-10-d
Summary:
Real-world tasks require decisions at varying granularities, and humans excel at this by leveraging a unified cognitive representation where planning is fundamentally understood as a high-level form of action. However, current Large Language Model (LLM)-based agents lack this crucial capability to operate fluidly across decision granularities. This limitation stems from existing paradigms that enforce a rigid separation between high-level planning and low-level action, which impairs dynamic adaptability and limits generalization. We propose ReCode (Recursive Code Generation), a novel paradigm that addresses this limitation by unifying planning and action within a single code representation. In this representation, ReCode treats high-level plans as abstract placeholder functions, which the agent then recursively decomposes into finer-grained sub-functions until reaching primitive actions. This recursive approach dissolves the rigid boundary between plan and action, enabling the agent to dynamically control its decision granularity. Furthermore, the recursive structure inherently generates rich, multi-granularity training data, enabling models to learn hierarchical decision-making processes. Extensive experiments show ReCode significantly surpasses advanced baselines in inference performance and demonstrates exceptional data efficiency in training, validating our core insight that unifying planning and action through recursive code generation is a powerful and effective approach to achieving universal granularity control. The code is available at https://github.com/FoundationAgents/ReCode.
--------------------------------------------------------------------------------------------------------
Education Paradigm Shift To Maintain Human Competitive Advantage Over AI
The rise of generative AI, particularly large language models, has transformed intellectual labor automation from hypothetical concern to practical reality. This paper examines fundamental weaknesses in current AI technologies, particularly LLMs, arguing that their root limitations are unfixable with existing approaches. The authors propose educational reforms within the constructivist paradigm to cultivate human skills that maintain competitive advantage over AI tools. This work has profound implications for curriculum design, workforce development, higher education strategy, vocational training, and lifelong learning programs. As AI increasingly handles routine intellectual tasks, education systems must pivot toward developing uniquely human capabilities—critical thinking, creativity, ethical reasoning, and complex problem-solving—ensuring meaningful human participation in an AI-augmented knowledge economy.
Authors: Stanislav Selitskiy, Chihiro Inoue
Link: https://arxiv.org/abs/2510.23436v1
Date: 2025-10-d
Summary:
Discussion about the replacement of intellectual human labour by ``thinking machines'' has been present in the public and expert discourse since the creation of Artificial Intelligence (AI) as an idea and terminology since the middle of the twentieth century. Until recently, it was more of a hypothetical concern. However, in recent years, with the rise of Generative AI, especially Large Language Models (LLM), and particularly with the widespread popularity of the ChatGPT model, that concern became practical. Many domains of human intellectual labour have to adapt to the new AI tools that give humans new functionality and opportunity, but also question the viability and necessity of some human work that used to be considered intellectual yet has now become an easily automatable commodity. Education, unexpectedly, has now become burdened by an especially crucial role of charting long-range strategies for discovering viable human skills that would guarantee their place in the world of the ubiquitous use of AI in the intellectual sphere. We highlight weaknesses of the current AI and, especially, of its LLM-based core, show that root causes of LLMs' weaknesses are unfixable by the current technologies, and propose directions in the constructivist paradigm for the changes in Education that ensure long-term advantages of humans over AI tools.
--------------------------------------------------------------------------------------------------------
Towards a Generalizable AI for Materials Discovery: Validation through Immersion Coolant Screening
Most AI models for materials discovery remain narrowly specialized, requiring retraining for each new property prediction task. GATE addresses this limitation by jointly learning 34 physicochemical properties across thermal, electrical, mechanical, and optical domains within a shared geometric space. This generalizable approach reduces false positives in multi-criteria screening by capturing cross-property correlations. Applied to immersion cooling fluid discovery for data centers—a stringent real-world challenge—GATE identified 92,861 promising candidates from billions screened, with four experimentally validated showing commercial-grade performance. Applications extend to battery materials, catalysts, semiconductors, sustainable chemicals, and pharmaceutical compounds. GATE's problem-agnostic design eliminates repeated data collection and retraining cycles, potentially accelerating materials innovation across industries while reducing research costs.
Authors: Hyunseung Kim, Dae-Woong Jeong, Changyoung Park, Won-Ji Lee, Ha-Eun Lee, Ji-Hye Lee, Rodrigo Hormazabal, Sung Moon Ko, Sumin Lee, Soorin Yim, Chanhui Lee, Sehui Han, Sang-Ho Cha, Woohyung Lim
Link: https://arxiv.org/abs/2510.23371v2
Date: 2025-10-d
Summary:
Artificial intelligence (AI) has emerged as a powerful accelerator of materials discovery, yet most existing models remain problem-specific, requiring additional data collection and retraining for each new property. Here we introduce and validate GATE (Geometrically Aligned Transfer Encoder) -- a generalizable AI framework that jointly learns 34 physicochemical properties spanning thermal, electrical, mechanical, and optical domains. By aligning these properties within a shared geometric space, GATE captures cross-property correlations that reduce disjoint-property bias -- a key factor causing false positives in multi-criteria screening. To demonstrate its generalizable utility, GATE -- without any problem-specific model reconfiguration -- applied to the discovery of immersion cooling fluids for data centers, a stringent real-world challenge defined by the Open Compute Project (OCP). Screening billions of candidates, GATE identified 92,861 molecules as promising for practical deployment. Four were experimentally or literarily validated, showing strong agreement with wet-lab measurements and performance comparable to or exceeding a commercial coolant. These results establish GATE as a generalizable AI platform readily applicable across diverse materials discovery tasks.
--------------------------------------------------------------------------------------------------------
Objective, scalable teaching quality measurement remains an education challenge, with large language models offering potential solutions that have struggled with complex classroom observation instruments. This research develops custom LLMs using sentence-level embeddings—better suited for long-form classroom transcripts than conventional tokenization. These specialized models achieve human-level and even super-human performance when expert ratings exceed 0.65 correlation. Analysis reveals advanced models attribute more score variation to lesson-level features rather than isolated utterances, challenging single-turn annotation paradigms. Model scores align with teacher value-added measures, indicating they capture learning-relevant features. Applications include teacher professional development, instructional coaching, education research, certification processes, and continuous feedback systems. This methodology could democratize access to expert-level teaching evaluation.
Authors: Michael Hardy
Link: https://arxiv.org/abs/2510.22968v1
Date: 2025-10-d
Summary:
Objective and scalable measurement of teaching quality is a persistent challenge in education. While Large Language Models (LLMs) offer potential, general-purpose models have struggled to reliably apply complex, authentic classroom observation instruments. This paper uses custom LLMs built on sentence-level embeddings, an architecture better suited for the long-form, interpretive nature of classroom transcripts than conventional subword tokenization. We systematically evaluate five different sentence embeddings under a data-efficient training regime designed to prevent overfitting. Our results demonstrate that these specialized models can achieve human-level and even super-human performance with expert human ratings above 0.65 and surpassing the average human-human rater correlation. Further, through analysis of annotation context windows, we find that more advanced models-those better aligned with human judgments-attribute a larger share of score variation to lesson-level features rather than isolated utterances, challenging the sufficiency of single-turn annotation paradigms. Finally, to assess external validity, we find that aggregate model scores align with teacher value-added measures, indicating they are capturing features relevant to student learning. However, this trend does not hold at the individual item level, suggesting that while the models learn useful signals, they have not yet achieved full generalization. This work establishes a viable and powerful new methodology for AI-driven instructional measurement, offering a path toward providing scalable, reliable, and valid feedback for educator development.
--------------------------------------------------------------------------------------------------------
McKean-Vlasov equations describe mean-field dynamics in systems ranging from statistical physics to machine learning. This research establishes exact equivalence between stationary solutions and infinite-dimensional Fourier coefficient equations, enabling transparent characterization of bifurcations, periodicity, and resonance structures. Applied to Noisy Mean-Field Transformer models, the theory reveals how temperature parameters affect bifurcation geometry and create metastable multi-mode states. The framework characterizes discontinuous phase transitions through free energy landscape analysis. Applications include understanding attention mechanisms in transformers, optimizing neural network training dynamics, analyzing collective behavior in multi-agent systems, improving generative models, and developing theoretical foundations for deep learning. These insights could inform architecture design decisions and training strategies for next-generation AI models.
Authors: Krishnakumar Balasubramanian, Sayan Banerjee, Philippe Rigollet
Link: https://arxiv.org/abs/2510.20094v2
Date: 2025-10-d
Summary:
We study stationary solutions of McKean-Vlasov equations on the circle. Our main contributions stem from observing an exact equivalence between solutions of the stationary McKean-Vlasov equation and an infinite-dimensional quadratic system of equations over Fourier coefficients, which allows explicit characterization of the stationary states in a sequence space rather than a function space. This framework provides a transparent description of local bifurcations, characterizing their periodicity, and resonance structures, while accommodating singular potentials. We derive analytic expressions that characterize the emergence, form and shape (supercritical, critical, subcritical or transcritical) of bifurcations involving possibly multiple Fourier modes and connect them with discontinuous phase transitions. We also characterize, under suitable assumptions, the detailed structure of the stationary bifurcating solutions that are accurate upto an arbitrary number of Fourier modes. At the global level, we establish regularity and concavity properties of the free energy landscape, proving existence, compactness, and coexistence of globally minimizing stationary measures, further identifying discontinuous phase transitions with points of non-differentiability of the minimum free energy map. As an application, we specialize the theory to the Noisy Mean-Field Transformer model, where we show how changing the inverse temperature parameter $\beta$ affects the geometry of the infinitely many bifurcations from the uniform measure. We also explain how increasing $\beta$ can lead to a rich class of approximate multi-mode stationary solutions which can be seen as `metastable states'. Further, a sharp transition from continuous to discontinuous (first-order) phase behavior is observed as $\beta$ increases.
--------------------------------------------------------------------------------------------------------
3D Optimization for AI Inference Scaling: Balancing Accuracy, Cost, and Latency
AI inference scaling typically relies on simplistic one-dimensional heuristics or two-dimensional trade-offs, ignoring real-world constraints on cost and latency. This research introduces a 3D optimization framework jointly calibrating accuracy, cost, and latency within unified decision space, enabling constraint-aware inference scaling. Monte Carlo simulations across scenarios and models evaluate four optimization methods, revealing knee-point optimization achieves best balance while accuracy-maximization suits precision-critical applications. This framework shapes feasible spaces that lower-dimensional approaches miss, enabling environment-adaptive inference scaling. Applications include cloud AI services with SLA requirements, edge computing under resource constraints, mobile AI applications balancing battery and performance, serverless computing cost optimization, and multi-tenant AI platforms. The theoretical foundation supports deployment-aware scaling across diverse operational contexts.
Authors: Minseok Jung, Abhas Ricky, Muhammad Rameez Chatni
Link: https://arxiv.org/abs/2510.18905v2
Date: 2025-10-d
Summary:
AI inference scaling is often tuned through 1D heuristics (a fixed reasoning passes) or 2D bivariate trade-offs (e.g., performance vs. compute), which fail to consider cost and latency constraints. We introduce a 3D optimization framework that jointly calibrates accuracy, cost, and latency within a unified decision space, enabling constraints-aware inference scaling. Using Monte Carlo simulations across three representative scenarios and nine simulated large language models, we evaluate four optimization methods to address the 3D multi-objective optimization (MOO) problem. Framing inference scaling in MOO shapes a feasible space that 1D and 2D optimizations fail to capture, enabling environmentadaptive selection of the inference scaling k. Results show that knee-point optimization achieves the best balance, while accuracy-maximization remains favorable when precision is prioritized. The framework establishes a theoretical foundation for deployment-aware inference scaling across diverse operational contexts.
--------------------------------------------------------------------------------------------------------
A Principle of Targeted Intervention for Multi-Agent Reinforcement Learning
Steering cooperative multi-agent systems toward desired outcomes is challenging, especially when global guidance proves impractical in large-scale scenarios. This work employs multi-agent influence diagrams to analyze interaction paradigms and introduces targeted intervention—applying guidance to single agents rather than entire systems. The Pre-Strategy Intervention technique realizes this paradigm through causal inference, achieving composite outcomes integrating task goals and desired behaviors. Relevance graph analysis provides tools for evaluating whether learning paradigms work under specific interaction designs. Applications include autonomous vehicle coordination, robotic warehouse systems, smart grid management, drone swarm control, and game AI development. By reducing intervention complexity while maintaining effectiveness, targeted intervention enables more practical deployment of cooperative AI systems in real-world multi-agent scenarios.
Authors: Anjie Liu, Jianhong Wang, Samuel Kaski, Jun Wang, Mengyue Yang
Link: https://arxiv.org/abs/2510.17697v3
Date: 2025-10-d
Summary:
Steering cooperative multi-agent reinforcement learning (MARL) towards desired outcomes is challenging, particularly when the global guidance from a human on the whole multi-agent system is impractical in a large-scale MARL. On the other hand, designing external mechanisms (e.g., intrinsic rewards and human feedback) to coordinate agents mostly relies on empirical studies, lacking a easy-to-use research tool. In this work, we employ multi-agent influence diagrams (MAIDs) as a graphical framework to address the above issues. First, we introduce the concept of MARL interaction paradigms (orthogonal to MARL learning paradigms), using MAIDs to analyze and visualize both unguided self-organization and global guidance mechanisms in MARL. Then, we design a new MARL interaction paradigm, referred to as the targeted intervention paradigm that is applied to only a single targeted agent, so the problem of global guidance can be mitigated. In implementation, we introduce a causal inference technique, referred to as Pre-Strategy Intervention (PSI), to realize the targeted intervention paradigm. Since MAIDs can be regarded as a special class of causal diagrams, a composite desired outcome that integrates the primary task goal and an additional desired outcome can be achieved by maximizing the corresponding causal effect through the PSI. Moreover, the bundled relevance graph analysis of MAIDs provides a tool to identify whether an MARL learning paradigm is workable under the design of an MARL interaction paradigm. In experiments, we demonstrate the effectiveness of our proposed targeted intervention, and verify the result of relevance graph analysis.
--------------------------------------------------------------------------------------------------------
Cohort Discovery: A Survey on LLM-Assisted Clinical Trial Recruitment
Clinical trial recruitment faces persistent challenges matching trials with eligible patients despite natural language trial descriptions and mixed-format patient data. Large language models offer knowledge aggregation and reasoning capabilities suited for this task, yet adoption remains limited due to reliance on proprietary models and weak evaluation benchmarks. This comprehensive survey analyzes trial-patient matching tasks, contextualizing emerging LLM-based approaches while critically examining existing benchmarks, methodologies, and evaluation frameworks. The work identifies challenges to LLM adoption in clinical research and outlines future directions. Applications include accelerating drug development, improving trial diversity, reducing recruitment costs, enhancing patient outcomes through trial access, and supporting personalized medicine initiatives. Rigorous evaluation frameworks developed here could advance trustworthy AI deployment in healthcare.
Authors: Shrestha Ghosh, Moritz Schneider, Carina Reinicke, Carsten Eickhoff
Link: https://arxiv.org/abs/2506.15301v2
Date: 2025-10-d
Summary:
Recent advances in LLMs have greatly improved general-domain NLP tasks. Yet, their adoption in critical domains, such as clinical trial recruitment, remains limited. As trials are designed in natural language and patient data is represented as both structured and unstructured text, the task of matching trials and patients benefits from knowledge aggregation and reasoning abilities of LLMs. Classical approaches are trial-specific and LLMs with their ability to consolidate distributed knowledge hold the potential to build a more general solution. Yet recent applications of LLM-assisted methods rely on proprietary models and weak evaluation benchmarks. In this survey, we are the first to analyze the task of trial-patient matching and contextualize emerging LLM-based approaches in clinical trial recruitment. We critically examine existing benchmarks, approaches and evaluation frameworks, the challenges to adopting LLM technologies in clinical research and exciting future directions.
--------------------------------------------------------------------------------------------------------
SATURN: SAT-based Reinforcement Learning to Unleash Language Model Reasoning
Designing reinforcement learning tasks that effectively develop language model reasoning capabilities remains an open challenge. SATURN addresses limitations of existing tasks—scalability, verifiability, and controllable difficulty—by using Boolean Satisfiability problems for training. The framework enables scalable task construction, rule-based verification, and precise difficulty control through curriculum learning that progressively challenges models from easy to hard. Applied to DeepSeek models, SATURN achieves substantial improvements: +14.0 to +28.1 pass@3 on SAT problems, +1.8 to +4.9 on math and programming benchmarks, and +8.8% over state-of-the-art RL approaches. Applications include developing more capable reasoning models, improving mathematical problem-solving, advancing automated theorem proving, and understanding reasoning development. The methodology provides a principled approach to cultivating systematic reasoning capabilities.
Authors: Huanyu Liu, Jia Li, Hao Zhu, Kechi Zhang, Yihong Dong, Ge Li
Link: https://arxiv.org/abs/2505.16368v2
Date: 2025-10-d
Summary:
How to design reinforcement learning (RL) tasks that effectively unleash the reasoning capability of large language models (LLMs) remains an open question. Existing RL tasks (e.g., math, programming, and constructing reasoning tasks) suffer from three key limitations: (1) Scalability. They rely heavily on human annotation or expensive LLM synthesis to generate sufficient training data. (2) Verifiability. LLMs' outputs are hard to verify automatically and reliably. (3) Controllable Difficulty. Most tasks lack fine-grained difficulty control, making it hard to train LLMs to develop reasoning ability from easy to hard. To address these limitations, we propose Saturn, a SAT-based RL framework that uses Boolean Satisfiability (SAT) problems to train and evaluate LLMs reasoning. Saturn enables scalable task construction, rule-based verification, and precise difficulty control. Saturn designs a curriculum learning pipeline that continuously improves LLMs' reasoning capability by constructing SAT tasks of increasing difficulty and training LLMs from easy to hard. To ensure stable training, we design a principled mechanism to control difficulty transitions. We introduce Saturn-2.6k, a dataset of 2,660 SAT problems with varying difficulty. It supports the evaluation of how LLM reasoning changes with problem difficulty. We apply Saturn to DeepSeek-R1-Distill-Qwen and obtain Saturn-1.5B and Saturn-7B. We achieve several notable results: (1) On SAT problems, Saturn-1.5B and Saturn-7B achieve average pass@3 improvements of +14.0 and +28.1, respectively. (2) On math and programming tasks, Saturn-1.5B and Saturn-7B improve average scores by +4.9 and +1.8 on benchmarks (e.g., AIME, LiveCodeBench). (3) Compared to the state-of-the-art (SOTA) approach in constructing RL tasks, Saturn achieves further improvements of +8.8%. We release the source code, data, and models to support future research.
--------------------------------------------------------------------------------------------------------
A method for the systematic generation of graph XAI benchmarks via Weisfeiler-Leman coloring
Graph neural networks' opaque decision-making undermines their use in safety-critical applications, necessitating robust explainability benchmarking. Current graph-XAI benchmarks are limited to simplistic synthetic datasets or few expert-curated tasks, hindering rigorous evaluation. This research automates benchmark construction from generic graph classification datasets using Weisfeiler-Leman color refinement for efficient subgraph matching and discriminating motif mining. The approach ensures motifs align with GNN expressiveness, providing meaningful proxy ground-truth explanations. The OpenGraphXAI suite delivers 15 ready-made benchmarks from molecular datasets, with codebase enabling 2,000+ additional benchmarks. Applications include drug discovery explainability, chemical safety assessment, materials design transparency, bioinformatics interpretation, and regulatory compliance for AI in chemistry. Systematic benchmarking accelerates development of trustworthy graph-based AI systems across scientific domains.
Authors: Michele Fontanesi, Alessio Micheli, Marco Podda, Domenico Tortorella
Link: https://arxiv.org/abs/2505.12437v2
Date: 2025-10-d
Summary:
Graph neural networks have become the de facto model for learning from structured data. However, the decision-making process of GNNs remains opaque to the end user, which undermines their use in safety-critical applications. Several explainable AI techniques for graphs have been developed to address this major issue. Focusing on graph classification, these explainers identify subgraph motifs that explain predictions. Therefore, a robust benchmarking of graph explainers is required to ensure that the produced explanations are of high quality, i.e., aligned with the GNN's decision process. However, current graph-XAI benchmarks are limited to simplistic synthetic datasets or a few real-world tasks curated by domain experts, hindering rigorous and reproducible evaluation, and consequently stalling progress in the field. To overcome these limitations, we propose a method to automate the construction of graph XAI benchmarks from generic graph classification datasets. Our approach leverages the Weisfeiler-Leman color refinement algorithm to efficiently perform approximate subgraph matching and mine class-discriminating motifs, which serve as proxy ground-truth class explanations. At the same time, we ensure that these motifs can be learned by GNNs because their discriminating power aligns with WL expressiveness. This work also introduces the OpenGraphXAI benchmark suite, which consists of 15 ready-made graph-XAI datasets derived by applying our method to real-world molecular classification datasets. The suite is available to the public along with a codebase to generate over 2,000 additional graph-XAI benchmarks. Finally, we present a use case that illustrates how the suite can be used to assess the effectiveness of a selection of popular graph explainers, demonstrating the critical role of a sufficiently large benchmark collection for improving the significance of experimental results.
--------------------------------------------------------------------------------------------------------