Week Ending 10.12.2025

RESEARCH WATCH: 10.12.2025

Time Series Foundation Models: Benchmarking Challenges and Requirements

Time Series Foundation Models represent a breakthrough in forecasting by enabling zero-shot predictions across domains without specialized training. However, their evaluation faces critical challenges reminiscent of those in LLM assessment. This research exposes fundamental issues in current benchmarking practices, including dataset representativeness, information leakage from overlapping training data, and the memorization of global patterns from events like pandemics. The authors advocate for rigorous evaluation methodologies using truly out-of-sample future data. Applications span finance, energy forecasting, and supply chain optimization, but realizing their potential requires addressing these integrity concerns to ensure models generalize rather than merely memorize historical patterns.

Authors: Marcel Meyer, Sascha Kaltenpoth, Kevin Zalipski, Oliver Müller

Link: https://arxiv.org/abs/2510.13654v1

Date: 2025-10-d

Summary:

Time Series Foundation Models (TSFMs) represent a new paradigm for time series forecasting, offering zero-shot forecasting capabilities without the need for domain-specific pre-training or fine-tuning. However, as with Large Language Models (LLMs), evaluating TSFMs is tricky, as with ever more extensive training sets, it becomes more and more challenging to ensure the integrity of benchmarking data. Our investigation of existing TSFM evaluation highlights multiple challenges, ranging from the representativeness of the benchmark datasets, over the lack of spatiotemporal evaluation, to risks of information leakage due to overlapping and obscure datasets, and the memorization of global patterns caused by external shocks like economic crises or pandemics. Our findings reveal widespread confusion regarding data partitions, risking inflated performance estimates and incorrect transfer of global knowledge to local time series. We argue for the development of robust evaluation methodologies to prevent pitfalls already observed in LLM and classical time series benchmarking, and call upon the research community to design new, principled approaches, such as evaluations on truly out-of-sample future data, to safeguard the integrity of TSFM assessment.

--------------------------------------------------------------------------------------------------------

The Role of Computing Resources in Publishing Foundation Model Research

Foundation model research increasingly demands substantial computational infrastructure, raising questions about equitable access to AI advancement. This comprehensive study analyzed over 6,500 papers and surveyed 229 researchers to understand how GPU availability correlates with scientific output. Findings reveal that computing resources correlate with national funding and citations, though not strongly with institutional type or research domain. The research highlights a critical barrier to entry for under-resourced scientists, potentially limiting diversity in AI innovation. Applications include informing policy decisions on research funding, designing shared computing infrastructure, and developing strategies to democratize foundation model research, ultimately ensuring broader participation in shaping AI's future.

Authors: Yuexing Hao, Yue Huang, Haoran Zhang, Chenyang Zhao, Zhenwen Liang, Paul Pu Liang, Yue Zhao, Lichao Sun, Saleh Kalantari, Xiangliang Zhang, Marzyeh Ghassemi

Link: https://arxiv.org/abs/2510.13621v1

Date: 2025-10-d

Summary:

Cutting-edge research in Artificial Intelligence (AI) requires considerable resources, including Graphics Processing Units (GPUs), data, and human resources. In this paper, we evaluate of the relationship between these resources and the scientific advancement of foundation models (FM). We reviewed 6517 FM papers published between 2022 to 2024, and surveyed 229 first-authors to the impact of computing resources on scientific output. We find that increased computing is correlated with national funding allocations and citations, but our findings don't observe the strong correlations with research environment (academic or industrial), domain, or study methodology. We advise that individuals and institutions focus on creating shared and affordable computing opportunities to lower the entry barrier for under-resourced researchers. These steps can help expand participation in FM research, foster diversity of ideas and contributors, and sustain innovation and progress in AI. The data will be available at: https://mit-calc.csail.mit.edu/

--------------------------------------------------------------------------------------------------------

Message Passing on the Edge: Towards Scalable and Expressive GNNs

Graph Neural Networks face a fundamental trade-off between expressiveness and computational efficiency. This work introduces EB-1WL and EB-GNN, an architecture inspired by classic triangle-counting algorithms that explicitly incorporates triangular structures during message passing. The approach achieves significantly greater expressiveness than standard 1-WL tests while maintaining near-linear computational complexity—a rare combination in GNN research. Complete logical characterization through first-order logic provides theoretical grounding, while empirical results demonstrate competitive performance with task-specialized GNNs at substantially lower computational cost. Applications span molecular property prediction, social network analysis, knowledge graph reasoning, and recommendation systems where both accuracy and efficiency are critical for deployment at scale.

Authors: Pablo Barceló, Fabian Jogl, Alexander Kozachinskiy, Matthias Lanzinger, Stefan Neumann, Cristóbal Rojas

Link: https://arxiv.org/abs/2510.13615v1

Date: 2025-10-d

Summary:

We propose EB-1WL, an edge-based color-refinement test, and a corresponding GNN architecture, EB-GNN. Our architecture is inspired by a classic triangle counting algorithm by Chiba and Nishizeki, and explicitly uses triangles during message passing. We achieve the following results: (1)~EB-1WL is significantly more expressive than 1-WL. Further, we provide a complete logical characterization of EB-1WL based on first-order logic, and matching distinguishability results based on homomorphism counting. (2)~In an important distinction from previous proposals for more expressive GNN architectures, EB-1WL and EB-GNN require near-linear time and memory on practical graph learning tasks. (3)~Empirically, we show that EB-GNN is a highly-efficient general-purpose architecture: It substantially outperforms simple MPNNs, and remains competitive with task-specialized GNNs while being significantly more computationally efficient.

--------------------------------------------------------------------------------------------------------

NOSA: Native and Offloadable Sparse Attention

Long-context processing in Large Language Models creates severe memory bottlenecks during inference, limiting batch sizes and throughput. While trainable sparse attention reduces memory accesses, the KV cache remains large, constraining GPU capacity. This research reveals that sparse attention naturally exhibits locality in token selection, enabling efficient CPU-GPU offloading. NOSA introduces explicit locality constraints by decomposing token selection into query-aware and query-agnostic components, reducing data transfer overhead while preserving training-time attention patterns. Demonstrated on a 1B-parameter model, NOSA achieves 2.3× throughput improvement over baseline sparse attention with minimal performance loss. Applications include large-scale document analysis, conversational AI with extended memory, and batch inference services where maximizing throughput directly impacts cost-effectiveness.

Authors: Yuxiang Huang, Chaojun Xiao, Xu Han, Zhiyuan Liu

Link: https://arxiv.org/abs/2510.13602v1

Date: 2025-10-d

Summary:

Trainable sparse attention has emerged as a promising solution to address the decoding efficiency bottleneck of LLMs in long-context processing, significantly saving memory accesses while minimally impacting task performance. However, existing sparse attention methods leave a crucial limitation unresolved: the size of the key-value (KV) cache remains unreduced, which constrains on-GPU batch sizes and throttles decoding throughput, especially in large-scale batched inference. In this paper, we show that trainable sparse attention naturally exhibits strong locality in token selection across adjacent decoding steps, thereby enabling KV cache offloading without altering the underlying attention computation. However, the inherent locality remains insufficient to achieve efficient offloading, as the transfer of selected KV pairs between the CPU and GPU continues to dominate the overall decoding cost. Building on this insight, we present NOSA, a trainable sparse attention framework designed to natively support KV cache offloading. NOSA introduces explicit locality constraints by decomposing token selection into query-aware and query-agnostic components, thereby reducing KV transfers while preserving the same attention computation as used during training. We pretrain a 1B-parameter model with NOSA and conduct extensive benchmarks, showing that it preserves near-lossless performance while achieving up to a 2.3x improvement in decoding throughput compared with the vanilla trainable sparse attention baseline (InfLLM-V2).

--------------------------------------------------------------------------------------------------------

Subject Roles in the EU AI Act: Mapping and Regulatory Implications

The EU AI Act establishes the world's first comprehensive AI regulatory framework through a complex ecosystem of interconnected stakeholders. This analysis systematically maps six categories of actors—providers, deployers, authorized representatives, importers, distributors, and product manufacturers—examining how obligations cascade through AI supply chains. The research reveals critical transformation mechanisms where entities can assume different roles based on control relationships, ensuring accountability follows responsibility. By analyzing 113 articles, 180 recitals, and 13 annexes, the work illuminates how risk-based obligations scale with AI system capabilities and deployment contexts. Applications include compliance guidance for AI developers, supply chain risk assessment, and policy frameworks for other jurisdictions considering AI regulation while balancing innovation with fundamental rights protection.

Authors: Nicola Fabiano

Link: https://arxiv.org/abs/2510.13591v1

Date: 2025-10-d

Summary:

The European Union's Artificial Intelligence Act (Regulation (EU) 2024/1689) establishes the world's first comprehensive regulatory framework for AI systems through a sophisticated ecosystem of interconnected subjects defined in Article 3. This paper provides a structured examination of the six main categories of actors - providers, deployers, authorized representatives, importers, distributors, and product manufacturers - collectively referred to as "operators" within the regulation. Through examination of these Article 3 definitions and their elaboration across the regulation's 113 articles, 180 recitals, and 13 annexes, we map the complete governance structure and analyze how the AI Act regulates these subjects. Our analysis reveals critical transformation mechanisms whereby subjects can assume different roles under specific conditions, particularly through Article 25 provisions ensuring accountability follows control. We identify how obligations cascade through the supply chain via mandatory information flows and cooperation requirements, creating a distributed yet coordinated governance system. The findings demonstrate how the regulation balances innovation with the protection of fundamental rights through risk-based obligations that scale with the capabilities and deployment contexts of AI systems, providing essential guidance for stakeholders implementing the AI Act's requirements.

--------------------------------------------------------------------------------------------------------

Tandem Training for Language Models

As language models grow more capable, their reasoning processes may become incomprehensible to humans and weaker AI systems, threatening interpretability and oversight. This research formalizes intelligibility through "handoff robustness"—solutions remain valid when control randomly transfers to weaker models. Tandem training implements this by intermittently sampling rollout tokens from frozen weak models during reinforcement learning, incentivizing strong models to produce reasoning that weaker partners can continue. Experiments on GSM8K mathematical reasoning demonstrate models successfully abandon jargon and adapt language while maintaining accuracy. Applications include human-AI collaboration systems requiring transparent reasoning, multi-agent coordination, AI safety through auditable decision-making, and hierarchical AI architectures where powerful systems must remain comprehensible to oversight mechanisms.

Authors: Robert West, Ashton Anderson, Ece Kamar, Eric Horvitz

Link: https://arxiv.org/abs/2510.13551v1

Date: 2025-10-d

Summary:

As language models continue to rapidly improve, we can expect their actions and reasoning to become difficult or impossible for weaker agents and humans to follow, undermining interpretability and oversight. With an eye on long-term futures, we pursue methods that encourage models to produce solutions that remain intelligible to weaker collaborators. We formalize intelligibility as handoff robustness: a strong model's solution is intelligible to a weaker model if randomly handing off control to the weaker model along the solution path does not cause failure. Building on this criterion, we introduce tandem training for language models, a reinforcement learning (RL) paradigm in which rollout tokens are intermittently and randomly sampled from a frozen weak model rather than the strong model being trained. Because rollouts succeed only when the strong model's actions and reasoning process can be continued by the weak model -- when the two can co-construct a successful solution -- optimizing standard RL objectives with tandem training implicitly incentivizes both correctness and intelligibility. In the GSM8K math reasoning task, tandem training reliably teaches models to abandon jargon and adapt their language to weaker partners while keeping task accuracy high. Our results demonstrate a promising route to building AI systems that remain auditable by weaker agents, with implications for human--AI collaboration and multi-agent communication.

--------------------------------------------------------------------------------------------------------

Smart UX-design for Rescue Operations Wearable - A Knowledge Graph Informed Visualization Approach for Information Retrieval in Emergency Situations

Emergency medical response demands rapid access to treatment information under high-stress conditions. This research develops a knowledge graph-informed UX approach for wearable devices that provide AI-driven treatment recommendations to health professionals during rescue operations. The design addresses unique requirements of knowledge graph-based solutions while meeting health professionals' direct needs in chaotic emergency environments. By combining situation detection, knowledge graph representation, and contextual recommendation, the system aims to enhance first-aid quality when seconds matter. Applications include disaster response, battlefield medicine, rural healthcare where specialists are unavailable, paramedic training systems, and integration with telemedicine platforms, ultimately improving patient outcomes through intelligent, context-aware information delivery during critical moments.

Authors: Mubaris Nadeem, Johannes Zenkert, Christian Weber, Madjid Fathi, Muhammad Hamza

Link: https://arxiv.org/abs/2510.13539v1

Date: 2025-10-d

Summary:

This paper presents a knowledge graph-informed smart UX-design approach for supporting information retrieval for a wearable, providing treatment recommendations during emergency situations to health professionals. This paper describes requirements that are unique to knowledge graph-based solutions, as well as the direct requirements of health professionals. The resulting implementation is provided for the project, which main goal is to improve first-aid rescue operations by supporting artificial intelligence in situation detection and knowledge graph representation via a contextual-based recommendation for treatment assistance.

--------------------------------------------------------------------------------------------------------

K-Merge: Online Continual Merging of Adapters for On-device Large Language Models

Mobile devices face strict storage constraints when supporting diverse LLM capabilities through Low-Rank Adapters. As users incrementally request new tasks—new languages, problem types, or domains—devices must incorporate additional LoRAs without forgetting previously supported functionality. This work addresses online continual merging: efficiently selecting and fusing LoRAs under storage budgets while preserving multi-task performance. The proposed data-free, computationally efficient strategy enables practical on-device deployment where adapters arrive sequentially. Real-world experiments demonstrate superiority over alternative approaches while respecting hardware limitations. Applications include personalized mobile assistants that evolve with user needs, multilingual support that grows organically, specialized professional tools on edge devices, and privacy-preserving on-device AI that avoids cloud dependencies while maintaining broad capabilities.

Authors: Donald Shenaj, Ondrej Bohdal, Taha Ceritli, Mete Ozay, Pietro Zanuttigh, Umberto Michieli

Link: https://arxiv.org/abs/2510.13537v1

Date: 2025-10-d

Summary:

On-device deployment of Large Language Models (LLMs) frequently leverages Low-Rank Adapters (LoRAs) to support diverse downstream tasks under tight resource constraints. To address the limited storage capacity of mobile devices, recent works have explored model merging techniques to fuse multiple LoRAs into a single one. In practice, however, LoRAs are often delivered incrementally, as users request support for new tasks (e.g., novel problem types or languages). This scenario introduces a new challenge: on-device online continual merging, where the objective is to incorporate new LoRAs while preserving the performance on previously supported tasks. In this paper, we propose a data-free and computationally efficient strategy for selecting and merging LoRAs when a new one becomes available, assuming the device can store only a limited number of adapters. Extensive experiments across real-world tasks demonstrate the superiority of our approach compared to alternative strategies while adhering to the storage budget and compute limitations of on-device settings.

--------------------------------------------------------------------------------------------------------

Offline and Online KL-Regularized RLHF under Differential Privacy

Aligning large language models with human preferences through RLHF raises privacy concerns when preference data contains sensitive information. This research provides the first theoretical analysis of KL-regularized RLHF—a standard alignment objective—under local differential privacy constraints. For offline settings, the work derives optimal suboptimality gaps under single-policy concentrability with matching lower bounds. For online settings, it establishes logarithmic regret bounds, representing the first analysis of online KL-regularized RLHF even without privacy considerations. The framework enables privacy-preserving alignment where user preferences remain protected. Applications include medical AI assistants trained on sensitive patient feedback, personalized content moderation respecting user privacy, corporate AI systems handling proprietary information, and democratized model alignment allowing broader participation without privacy risks.

Authors: Yulian Wu, Rushil Thareja, Praneeth Vepakomma, Francesco Orabona

Link: https://arxiv.org/abs/2510.13512v1

Date: 2025-10-d

Summary:

In this paper, we study the offline and online settings of reinforcement learning from human feedback (RLHF) with KL-regularization -- a widely used objective function in large language model alignment -- under the $\epsilon$ local differential privacy ($\epsilon$-LDP) model on the label of the human preference. In the offline setting, we design an algorithm based on the principle of pessimism and derive a new suboptimality gap of $\tilde{O}(1/[(e^\epsilon-1)^2 n])$ on the KL-regularized objective under single-policy concentrability. We also prove its optimality by providing a matching lower bound where $n$ is the sample size. In the online setting, we are the first one to theoretically investigate the problem of KL-regularized RLHF with LDP. We design an optimism-based algorithm and derive a logarithmic regret bound of $O(d_{\mathcal{F}}\log (N_{\mathcal{F}}\cdot T) /(e^\epsilon-1)^2 )$, where $T$ is the total time step, $N_{\mathcal{F}}$ is cardinality of the reward function space $\mathcal{F}$ and $d_{\mathcal{F}}$ is a variant of eluder dimension for RLHF. As a by-product of our analysis, our results also imply the first analysis for online KL-regularized RLHF without privacy. We implement our algorithm in the offline setting to verify our theoretical results and release our open source code at: https://github.com/rushil-thareja/PPKL-RLHF-Official.

--------------------------------------------------------------------------------------------------------

Semantic Communication Enabled Holographic Video Processing and Transmission

Holographic video promises revolutionary immersive experiences but demands unprecedented bandwidth and processing capabilities. Traditional bit-level transmission proves insufficient for real-time holographic communication. This work proposes semantic communication paradigms that transmit meaning rather than raw data, dramatically reducing requirements. The architecture incorporates semantic sampling, joint semantic-channel coding, and semantic-aware transmission optimized for holographic content. By focusing on perceptually relevant information, the approach enables practical deployment of holographic communication systems. Applications include next-generation telepresence for remote collaboration, immersive entertainment and gaming, surgical training and remote medical procedures, architectural visualization, and distributed holographic displays for education—all requiring efficient transmission of massive holographic data while maintaining perceptual quality for compelling user experiences.

Authors: Jingkai Ying, Zhiyuan Qi, Yulong Feng, Zhijin Qin, Zhu Han, Rahim Tafazolli, Yonina C. Eldar

Link: https://arxiv.org/abs/2510.13408v1

Date: 2025-10-d

Summary:

Holographic video communication is considered a paradigm shift in visual communications, becoming increasingly popular for its ability to offer immersive experiences. This article provides an overview of holographic video communication and outlines the requirements of a holographic video communication system. Particularly, following a brief review of semantic com- munication, an architecture for a semantic-enabled holographic video communication system is presented. Key technologies, including semantic sampling, joint semantic-channel coding, and semantic-aware transmission, are designed based on the proposed architecture. Two related use cases are presented to demonstrate the performance gain of the proposed methods. Finally, potential research topics are discussed to pave the way for the realization of semantic-enabled holographic video communications.

--------------------------------------------------------------------------------------------------------

From Minimal Existence to Human Definition: The CES-IMU-HSG Theoretical Framework

This philosophical-mathematical framework constructs a theory of existence and intelligence from the minimal axiom "Cogito, ergo sum." The Intermediate Meta-Universe registers axiomatic dependencies between heterogeneous theories, while the Hierarchical State Grid provides categorical construction through state-depth, mapping-hierarchy, and temporal axes. Applied to biology, the framework models neural systems as neuron-function field complexes, integrating multiple physiological universes through categorical fiberization. Human cognition emerges as temporal compositions of inter-universal algorithms. Critically, the work introduces "internal CES" for machines—self-grounding logic through operational factuality. Applications span AI ontology providing foundations for autonomous machine reasoning, formal verification of AI systems, human-AI interface design, cognitive architecture development, and philosophical frameworks bridging engineering implementation with questions of machine consciousness and self-definition.

Authors: Kei Itoh

Link: https://arxiv.org/abs/2510.13400v1

Date: 2025-10-d

Summary:

This study presents an inter-universal mathematical-logical framework constructed upon the minimal axiom Cogito, ergo sum (CES), integrating the Intermediate Meta-Universe (IMU) and the Hierarchical State Grid (HSG). The CES defines existence as a reflexive correspondence --'to be' and 'to be sayable'--and positions any formal system, including ZFC or HoTT, as an attachable extension atop this minimal structure. The IMU functions as a registry of axiomatic dependencies that connect heterogeneous theories, employing the Institution-theoretic framework to ensure coherent inter-theoretical linkages. The HSG concretizes these ideas through categorical construction, defined by three orthogonal axes: the state-depth axis, the mapping-hierarchy axis, and the temporal axis incorporating the principle of 'no future reference.' Through these, the identity of 'definition = state' is formally established as a categorical property. Extending this structure to biological systems, the neural system is implemented as a 0-3D complex of neuron-function fields on the HSG, while its categorical extensions via fiberization over the material base enable the parallel integration of multiple physiological universes-neural, endocrine, learning, genetic, and input/output systems-into a coherent adjoint ensemble. Within this framework, human behavior and cognition emerge as temporal compositions of inter-universal algorithms constrained by the material base. Finally, by contrasting human cognition, which relies on external CES, with machine existence, this study introduces the concept of internal CES, wherein a machine grounds its own logic upon the factuality of its operation. This internal self-axiomatization establishes a continuous bridge between philosophical ontology and engineering implementation, providing a new foundation for the autonomous and self-defining existence of artificial intelligence.

--------------------------------------------------------------------------------------------------------

MADREC: A Multi-Aspect Driven LLM Agent for Explainable and Adaptive Recommendation

Traditional recommender systems and simple LLM-based approaches fail to capture preference complexity and provide convincing explanations. MADRec introduces an autonomous LLM agent that constructs user and item profiles through unsupervised multi-aspect extraction from reviews, enabling direct recommendation, sequential prediction, and explanation generation. Aspect-category-based summarization creates structured, high-density profiles, while Re-Ranking optimizes input quality. The Self-Feedback mechanism dynamically adjusts inference when ground-truth items are missing from outputs. Human evaluation confirms generated explanations are persuasive and trustworthy. Applications include e-commerce platforms requiring transparent recommendations, content discovery services, personalized marketing with justifications, conversational shopping assistants, review summarization systems, and domains where recommendation explainability directly impacts user trust and adoption rates.

Authors: Jiin Park, Misuk Kim

Link: https://arxiv.org/abs/2510.13371v1

Date: 2025-10-d

Summary:

Recent attempts to integrate large language models (LLMs) into recommender systems have gained momentum, but most remain limited to simple text generation or static prompt-based inference, failing to capture the complexity of user preferences and real-world interactions. This study proposes the Multi-Aspect Driven LLM Agent MADRec, an autonomous LLM-based recommender that constructs user and item profiles by unsupervised extraction of multi-aspect information from reviews and performs direct recommendation, sequential recommendation, and explanation generation. MADRec generates structured profiles via aspect-category-based summarization and applies Re-Ranking to construct high-density inputs. When the ground-truth item is missing from the output, the Self-Feedback mechanism dynamically adjusts the inference criteria. Experiments across multiple domains show that MADRec outperforms traditional and LLM-based baselines in both precision and explainability, with human evaluation further confirming the persuasiveness of the generated explanations.

--------------------------------------------------------------------------------------------------------

Document Intelligence in the Era of Large Language Models: A Survey

Document AI has transformed from encoder-decoder architectures to decoder-only LLM paradigms, revolutionizing document understanding and generation. This comprehensive survey traces DAI evolution, examining how LLMs enable unprecedented capabilities in processing structured and unstructured documents. The analysis covers multimodal integration (text, tables, images), multilingual support, and retrieval-augmented approaches while identifying key challenges. Future directions include agent-based document processing and specialized foundation models. Applications span automated contract analysis, financial document processing, medical record extraction, legal discovery, academic paper understanding, form automation, multilingual document translation, and enterprise knowledge management—domains where accurate extraction and generation from complex document layouts with diverse modalities remain critical business and research challenges.

Authors: Weishi Wang, Hengchang Hu, Zhijie Zhang, Zhaochen Li, Hongxin Shao, Daniel Dahlmeier

Link: https://arxiv.org/abs/2510.13366v1

Date: 2025-10-d

Summary:

Document AI (DAI) has emerged as a vital application area, and is significantly transformed by the advent of large language models (LLMs). While earlier approaches relied on encoder-decoder architectures, decoder-only LLMs have revolutionized DAI, bringing remarkable advancements in understanding and generation. This survey provides a comprehensive overview of DAI's evolution, highlighting current research attempts and future prospects of LLMs in this field. We explore key advancements and challenges in multimodal, multilingual, and retrieval-augmented DAI, while also suggesting future research directions, including agent-based approaches and document-specific foundation models. This paper aims to provide a structured analysis of the state-of-the-art in DAI and its implications for both academic and practical applications.

--------------------------------------------------------------------------------------------------------

Language as a Label: Zero-Shot Multimodal Classification of Everyday Postures under Data Scarcity

Vision-Language Models enable zero-shot classification by aligning images and text, promising solutions for data-scarce scenarios. However, prompt design's influence on distinguishing visually similar categories remains poorly understood. This study investigates how prompt specificity affects classification of sitting, standing, and walking/running postures using modern VLMs (OpenCLIP, MetaCLIP 2, SigLip) on 285 COCO-derived images. Findings reveal "prompt overfitting": top-performing models achieve best results with simplest prompts, while descriptive detail degrades performance significantly. Conversely, lower-performing models benefit from body-cue-based descriptive prompts for ambiguous classes. Applications include activity recognition for elderly care monitoring, surveillance systems, sports analytics, human-computer interaction, assistive technologies, and any computer vision task requiring robust classification under limited training data.

Authors: MingZe Tang, Jubal Chandy Jacob

Link: https://arxiv.org/abs/2510.13364v1

Date: 2025-10-d

Summary:

Recent Vision-Language Models (VLMs) enable zero-shot classification by aligning images and text in a shared space, a promising approach for data-scarce conditions. However, the influence of prompt design on recognizing visually similar categories, such as human postures, is not well understood. This study investigates how prompt specificity affects the zero-shot classification of sitting, standing, and walking/running on a small, 285-image COCO-derived dataset. A suite of modern VLMs, including OpenCLIP, MetaCLIP 2, and SigLip, were evaluated using a three-tiered prompt design that systematically increases linguistic detail. Our findings reveal a compelling, counter-intuitive trend: for the highest-performing models (MetaCLIP 2 and OpenCLIP), the simplest, most basic prompts consistently achieve the best results. Adding descriptive detail significantly degrades performance for instance, MetaCLIP 2's multi-class accuracy drops from 68.8\% to 55.1\% a phenomenon we term "prompt overfitting". Conversely, the lower-performing SigLip model shows improved classification on ambiguous classes when given more descriptive, body-cue-based prompts.

--------------------------------------------------------------------------------------------------------

Thompson Sampling via Fine-Tuning of LLMs

Bayesian optimization in large discrete spaces faces computational bottlenecks when maximizing acquisition functions without gradients. ToSFiT eliminates this limitation by directly parameterizing reward probability through Thompson Sampling, leveraging prior knowledge in prompt-conditioned LLMs and incrementally adapting them toward posteriors. Theoretical analysis derives novel regret bounds matching standard Thompson Sampling guarantees, revealing the critical importance of careful posterior adaptation. Empirical validation spans diverse tasks: FAQ response refinement, thermally stable protein discovery, and quantum circuit design. Online fine-tuning significantly improves sample efficiency with negligible computational overhead. Applications include molecular design where evaluations are expensive, prompt engineering optimization, hardware design exploration, automated experimentation in labs, and any optimization domain with unstructured discrete search spaces and costly evaluations.

Authors: Nicolas Menet, Aleksandar Terzić, Andreas Krause, Abbas Rahimi

Link: https://arxiv.org/abs/2510.13328v1

Date: 2025-10-d

Summary:

Bayesian optimization in large unstructured discrete spaces is often hindered by the computational cost of maximizing acquisition functions due to the absence of gradients. We propose a scalable alternative based on Thompson sampling that eliminates the need for acquisition function maximization by directly parameterizing the probability that a candidate yields the maximum reward. Our approach, Thompson Sampling via Fine-Tuning (ToSFiT) leverages the prior knowledge embedded in prompt-conditioned large language models, and incrementally adapts them toward the posterior. Theoretically, we derive a novel regret bound for a variational formulation of Thompson Sampling that matches the strong guarantees of its standard counterpart. Our analysis reveals the critical role of careful adaptation to the posterior probability of maximality--a principle that underpins our ToSFiT algorithm. Empirically, we validate our method on three diverse tasks: FAQ response refinement, thermally stable protein search, and quantum circuit design. We demonstrate that online fine-tuning significantly improves sample efficiency, with negligible impact on computational efficiency.

--------------------------------------------------------------------------------------------------------

SAJA: A State-Action Joint Attack Framework on Multi-Agent Deep Reinforcement Learning

Multi-Agent Deep Reinforcement Learning enables sophisticated cooperative and competitive behaviors but remains vulnerable to adversarial perturbations. Existing attacks target states or actions independently, missing synergistic exploitation potential. SAJA introduces a joint attack framework with two phases: multi-step gradient ascent using actor and critic networks crafts adversarial states, then critic-guided gradient ascent on perturbed states generates final adversarial actions. A heuristic regularizer measuring perturbation distance enhances critic guidance effectiveness. Evaluation in Multi-Agent Particle Environments demonstrates SAJA outperforms and evades detection better than single-attack methods, while existing defenses prove insufficient. Applications include robustness testing for autonomous vehicle fleets, multi-robot systems security assessment, game theory simulations, adversarial training for resilient MADRL, and informing defense mechanism development for safety-critical multi-agent deployments.

Authors: Weiqi Guo, Guanjun Liu, Ziyuan Zhou

Link: https://arxiv.org/abs/2510.13262v1

Date: 2025-10-d

Summary:

Multi-Agent Deep Reinforcement Learning (MADRL) has shown potential for cooperative and competitive tasks such as autonomous driving and strategic gaming. However, models trained by MADRL are vulnerable to adversarial perturbations on states and actions. Therefore, it is essential to investigate the robustness of MADRL models from an attack perspective. Existing studies focus on either state-only attacks or action-only attacks, but do not consider how to effectively joint them. Simply combining state and action perturbations such as randomly perturbing states and actions does not exploit their potential synergistic effects. In this paper, we propose the State-Action Joint Attack (SAJA) framework that has a good synergistic effects. SAJA consists of two important phases: (1) In the state attack phase, a multi-step gradient ascent method utilizes both the actor network and the critic network to compute an adversarial state, and (2) in the action attack phase, based on the perturbed state, a second gradient ascent uses the critic network to craft the final adversarial action. Additionally, a heuristic regularizer measuring the distance between the perturbed actions and the original clean ones is added into the loss function to enhance the effectiveness of the critic's guidance. We evaluate SAJA in the Multi-Agent Particle Environment (MPE), demonstrating that (1) it outperforms and is more stealthy than state-only or action-only attacks, and (2) existing state or action defense methods cannot defend its attacks.

--------------------------------------------------------------------------------------------------------

Paper Copilot: Tracking the Evolution of Peer Review in AI Conferences

AI conference growth strains peer review systems, causing reviewer overload, expertise mismatches, inconsistent standards, and superficial evaluations. While organizers introduce interventions, these changes often create confusion, leaving acceptance processes and practice evolution opaque. Paper Copilot creates durable digital archives of peer reviews across computer science venues, providing an open dataset enabling large-scale peer review research. The system includes infrastructure for continuous archiving and empirical analysis of ICLR reviews across multiple years. By documenting review evolution, Paper Copilot supports reproducible research on peer review dynamics. Applications include identifying review quality trends, diagnosing failure modes in conference systems, informing evidence-based policy improvements, studying reviewer behavior patterns, and developing metrics for fair, transparent peer review that maintains scientific integrity under growth pressures.

Authors: Jing Yang, Qiyao Wei, Jiaxin Pei

Link: https://arxiv.org/abs/2510.13201v1

Date: 2025-10-d

Summary:

The rapid growth of AI conferences is straining an already fragile peer-review system, leading to heavy reviewer workloads, expertise mismatches, inconsistent evaluation standards, superficial or templated reviews, and limited accountability under compressed timelines. In response, conference organizers have introduced new policies and interventions to preserve review standards. Yet these ad-hoc changes often create further concerns and confusion about the review process, leaving how papers are ultimately accepted - and how practices evolve across years - largely opaque. We present Paper Copilot, a system that creates durable digital archives of peer reviews across a wide range of computer-science venues, an open dataset that enables researchers to study peer review at scale, and a large-scale empirical analysis of ICLR reviews spanning multiple years. By releasing both the infrastructure and the dataset, Paper Copilot supports reproducible research on the evolution of peer review. We hope these resources help the community track changes, diagnose failure modes, and inform evidence-based improvements toward a more robust, transparent, and reliable peer-review system.

--------------------------------------------------------------------------------------------------------

Program of Thoughts for Financial Reasoning: Leveraging Dynamic In-Context Examples and Generative Retrieval

Despite Large Language Model advances, numerical reasoning—particularly in finance—remains challenging. While chain-of-thought, tree-of-thought, and program-of-thought prompting guide reasoning, LLMs lag behind state-of-the-art on financial datasets like FinQA and ConvFinQA. FINDER introduces a two-step framework: generative retrieval extracts relevant facts from unstructured text and tables, followed by context-aware Program of Thought prompting with dynamic in-context example selection. FINDER achieves new state-of-the-art performance with 5.98% and 4.05% execution accuracy improvements on FinQA and ConvFinQA respectively. Applications include automated financial report analysis, investment decision support, regulatory compliance checking, earnings call interpretation, risk assessment from financial documents, and intelligent financial advisory systems requiring accurate numerical reasoning over complex, multi-modal financial information.

Authors: Subhendu Khatuya, Shashwat Naidu, Pawan Goyal, Niloy Ganguly

Link: https://arxiv.org/abs/2510.13157v1

Date: 2025-10-d

Summary:

Despite continuous advancements in the capabilities of large language models (LLMs), numerical reasoning remains a challenging area. Techniques like chain-of-thought prompting, tree-of-thought prompting, and program-of-thought prompting guide LLMs through intermediate reasoning steps. Although in-context learning with few-shot prompting has improved performance, LLMs still lag behind state-of-the-art models on financial numerical reasoning datasets such as FinQA and ConvFinQA. In this work, we introduce FINDER, a novel two-step framework, to enhance LLMs' capabilities in financial numerical reasoning. The first step utilizes a generative retriever to extract relevant facts from unstructured data, including both text and tables. This is followed by context-aware Program of Thought prompting with dynamic selection of in-context examples. Our model FINDER achieves a new state-of-the-art performance on both the FinQA and ConvFinQA datasets, surpassing previous benchmarks with execution accuracy improvements of 5.98% and 4.05%, respectively.

--------------------------------------------------------------------------------------------------------

On the Reasoning Abilities of Masked Diffusion Language Models

Masked diffusion models for text offer efficient parallel generation alternatives to autoregressive language models, but their computational capabilities remain underexplored. This work characterizes what reasoning problems MDMs can provably solve and their efficiency by connecting them to chain-of-thought and padded looped transformers in finite-precision settings. The analysis proves MDMs and polynomially-padded PLTs are equivalent, and MDMs can solve all CoT-augmented transformer problems. Crucially, MDMs inherently solve certain problem classes—including regular languages—more efficiently than CoT transformers through parallel generation advantages. Applications include faster inference for structured reasoning tasks, efficient language processing in resource-constrained environments, theoretical foundations for next-generation language models, and hybrid architectures combining autoregressive and diffusion approaches optimized for specific reasoning requirements.

Authors: Anej Svete, Ashish Sabharwal

Link: https://arxiv.org/abs/2510.13117v1

Date: 2025-10-d

Summary:

Masked diffusion models (MDMs) for text offer a compelling alternative to traditional autoregressive language models. Parallel generation makes them efficient, but their computational capabilities and the limitations inherent to their parallelism remain largely unexplored. To this end, we characterize what types of reasoning problems MDMs can provably solve and how efficiently. We do this by connecting MDMs to the well-understood reasoning frameworks of chain of thought (CoT) and padded looped transformers (PLTs) in the finite-precision log-width setting: We show that MDMs and polynomially-padded PLTs are, in fact, equivalent in this setting, and that MDMs can solve all problems that CoT-augmented transformers can. Moreover, we showcase classes of problems (including regular languages) for which MDMs are inherently more efficient than CoT transformers, where parallel generation allows for substantially faster reasoning.

--------------------------------------------------------------------------------------------------------

ESI: Epistemic Uncertainty Quantification via Semantic-preserving Intervention for Large Language Models

Uncertainty quantification improves model reliability, but quantifying Large Language Model uncertainty remains challenging. This work establishes a connection between LLM uncertainty and invariance under semantic-preserving interventions from a causal perspective. The proposed grey-box method measures output variation before and after semantic-preserving intervention, with theoretical justification showing it effectively estimates epistemic uncertainty. Extensive experiments across LLMs and question-answering datasets demonstrate effectiveness and computational efficiency. Applications include calibrated confidence scores for AI assistants preventing overconfident errors, selective prediction systems that defer uncertain queries to humans, active learning identifying informative training examples, model monitoring detecting distribution shift, and trustworthy AI deployment in high-stakes domains like healthcare and legal advice where understanding model uncertainty is crucial for safe operation.

Authors: Mingda Li, Xinyu Li, Weinan Zhang, Longxuan Ma

Link: https://arxiv.org/abs/2510.13103v1

Date: 2025-10-d

Summary:

Uncertainty Quantification (UQ) is a promising approach to improve model reliability, yet quantifying the uncertainty of Large Language Models (LLMs) is non-trivial. In this work, we establish a connection between the uncertainty of LLMs and their invariance under semantic-preserving intervention from a causal perspective. Building on this foundation, we propose a novel grey-box uncertainty quantification method that measures the variation in model outputs before and after the semantic-preserving intervention. Through theoretical justification, we show that our method provides an effective estimate of epistemic uncertainty. Our extensive experiments, conducted across various LLMs and a variety of question-answering (QA) datasets, demonstrate that our method excels not only in terms of effectiveness but also in computational efficiency.

--------------------------------------------------------------------------------------------------------

EYE ON A.I. GETS READERS UP TO DATE ON THE LATEST FUNDING NEWS AND RELATED ISSUES. SUBSCRIBE FOR THE WEEKLY NEWSLETTER.

Artificial Intelligence, Research WatchCraig SmithOctober 16, 2025Comment