Week Ending 10.26.2025

RESEARCH WATCH: 10.26.2025

A Knowledge-Graph Translation Layer for Mission-Aware Multi-Agent Path Planning in Spatiotemporal Dynamics

This paper addresses the gap between high-level mission goals and low-level path planner inputs for autonomous agents in dynamic environments. It introduces a Knowledge Graph (KG) framework that acts as an intelligent translation layer, utilizing a two-plane architecture to convert declarative mission facts into per-agent, mission-aware "worldviews" and physics-aware rules. This approach effectively decouples mission semantics from the core planner, allowing complex, coordinated paths to be modified simply by changing facts in the KG. A case study with Autonomous Underwater Vehicles (AUVs) demonstrates its ability to produce distinct, high-performing, and explainable autonomous systems. Potential applications lie in coordinating fleets of robots for complex tasks like search and rescue, surveillance, or logistics in unstructured, changing environments.

Authors: Edward Holmberg, Elias Ioup, Mahdi Abdelguerfi

Link: https://arxiv.org/abs/2510.21695v1

Date: 2025-10-d

Summary:

The coordination of autonomous agents in dynamic environments is hampered by the semantic gap between high-level mission objectives and low-level planner inputs. To address this, we introduce a framework centered on a Knowledge Graph (KG) that functions as an intelligent translation layer. The KG's two-plane architecture compiles declarative facts into per-agent, mission-aware ``worldviews" and physics-aware traversal rules, decoupling mission semantics from a domain-agnostic planner. This allows complex, coordinated paths to be modified simply by changing facts in the KG. A case study involving Autonomous Underwater Vehicles (AUVs) in the Gulf of Mexico visually demonstrates the end-to-end process and quantitatively proves that different declarative policies produce distinct, high-performing outcomes. This work establishes the KG not merely as a data repository, but as a powerful, stateful orchestrator for creating adaptive and explainable autonomous systems.

--------------------------------------------------------------------------------------------------------

On Thin Ice: Towards Explainable Conservation Monitoring via Attribution and Perturbations

This research tackles the lack of trust in black-box neural networks that hinders their adoption in ecological conservation, specifically for monitoring wildlife. It uses computer vision to detect pinnipeds (harbor seals) in aerial imagery and applies post-hoc explanation methods like HiResCAM and LIME to audit model predictions. By assessing explanations for localization fidelity, faithfulness, and diagnostic utility, the work provides evidence for true positives by showing attention on seal torsos rather than background. Crucially, this method uncovers systematic failure modes, such as confusion between seals and black ice. Potential applications include creating auditable, decision-supporting tools for conservationists that can reliably track and monitor endangered species.

Authors: Jiayi Zhou, Günel Aghakishiyeva, Saagar Arya, Julian Dale, James David Poling, Holly R. Houliston, Jamie N. Womble, Gregory D. Larsen, David W. Johnston, Brinnae Bent

Link: https://arxiv.org/abs/2510.21689v1

Date: 2025-10-d

Summary:

Computer vision can accelerate ecological research and conservation monitoring, yet adoption in ecology lags in part because of a lack of trust in black-box neural-network-based models. We seek to address this challenge by applying post-hoc explanations to provide evidence for predictions and document limitations that are important to field deployment. Using aerial imagery from Glacier Bay National Park, we train a Faster R-CNN to detect pinnipeds (harbor seals) and generate explanations via gradient-based class activation mapping (HiResCAM, LayerCAM), local interpretable model-agnostic explanations (LIME), and perturbation-based explanations. We assess explanations along three axes relevant to field use: (i) localization fidelity: whether high-attribution regions coincide with the animal rather than background context; (ii) faithfulness: whether deletion/insertion tests produce changes in detector confidence; and (iii) diagnostic utility: whether explanations reveal systematic failure modes. Explanations concentrate on seal torsos and contours rather than surrounding ice/rock, and removal of the seals reduces detection confidence, providing model-evidence for true positives. The analysis also uncovers recurrent error sources, including confusion between seals and black ice and rocks. We translate these findings into actionable next steps for model development, including more targeted data curation and augmentation. By pairing object detection with post-hoc explainability, we can move beyond "black-box" predictions toward auditable, decision-supporting tools for conservation monitoring.

--------------------------------------------------------------------------------------------------------

A Multimodal Benchmark for Framing of Oil & Gas Advertising and Potential Greenwashing Detection

To combat potential "greenwashing" and understand strategic communication at scale, this work introduces a novel, multimodal benchmark dataset for analyzing oil and gas advertising. The dataset features expert-annotated video ads from platforms like Facebook and YouTube, providing labels for 13 distinct framing types across various companies and countries. Unlike prior text-only resources, this benchmark is specifically designed to evaluate vision-language models (VLMs) in their ability to detect subtle, often visual, framing shifts. Initial experiments reveal significant room for improvement, particularly in identifying green innovation framing. Potential applications include developing automated systems to monitor and flag deceptive advertising practices for regulators, media watchdogs, and consumers.

Authors: Gaku Morio, Harri Rowlands, Dominik Stammbach, Christopher D. Manning, Peter Henderson

Link: https://arxiv.org/abs/2510.21679v1

Date: 2025-10-d

Summary:

Companies spend large amounts of money on public relations campaigns to project a positive brand image. However, sometimes there is a mismatch between what they say and what they do. Oil & gas companies, for example, are accused of "greenwashing" with imagery of climate-friendly initiatives. Understanding the framing, and changes in framing, at scale can help better understand the goals and nature of public relations campaigns. To address this, we introduce a benchmark dataset of expert-annotated video ads obtained from Facebook and YouTube. The dataset provides annotations for 13 framing types for more than 50 companies or advocacy groups across 20 countries. Our dataset is especially designed for the evaluation of vision-language models (VLMs), distinguishing it from past text-only framing datasets. Baseline experiments show some promising results, while leaving room for improvement for future work: GPT-4.1 can detect environmental messages with 79% F1 score, while our best model only achieves 46% F1 score on identifying framing around green innovation. We also identify challenges that VLMs must address, such as implicit framing, handling videos of various lengths, or implicit cultural backgrounds. Our dataset contributes to research in multimodal analysis of strategic communication in the energy sector.

--------------------------------------------------------------------------------------------------------

CMOMgen: Complex Multi-Ontology Alignment via Pattern-Guided In-Context Learning

Constructing comprehensive knowledge graphs often requires integrating data from multiple, related ontologies, which is a challenge due to the semantic limitations of simple equivalence mappings. This paper introduces CMOMgen, the first end-to-end strategy for Complex Multi-Ontology Matching (CMOM), which aligns a source entity to a composite logical expression of multiple target entities. It leverages Retrieval-Augmented Generation (RAG) to select relevant classes and provide reference mappings as examples, thereby enhancing In-Context Learning. Evaluated on biomedical tasks, CMOMgen outperforms baselines in generating complete and semantically sound mappings. Potential applications are vital in fields like biomedicine for integrating diverse datasets, accelerating drug discovery, and building richer, more context-aware knowledge representation layers.

Authors: Marta Contreiras Silva, Daniel Faria, Catia Pesquita

Link: https://arxiv.org/abs/2510.21656v1

Date: 2025-10-d

Summary:

Constructing comprehensive knowledge graphs requires the use of multiple ontologies in order to fully contextualize data into a domain. Ontology matching finds equivalences between concepts interconnecting ontologies and creating a cohesive semantic layer. While the simple pairwise state of the art is well established, simple equivalence mappings cannot provide full semantic integration of related but disjoint ontologies. Complex multi-ontology matching (CMOM) aligns one source entity to composite logical expressions of multiple target entities, establishing more nuanced equivalences and provenance along the ontological hierarchy. We present CMOMgen, the first end-to-end CMOM strategy that generates complete and semantically sound mappings, without establishing any restrictions on the number of target ontologies or entities. Retrieval-Augmented Generation selects relevant classes to compose the mapping and filters matching reference mappings to serve as examples, enhancing In-Context Learning. The strategy was evaluated in three biomedical tasks with partial reference alignments. CMOMgen outperforms baselines in class selection, demonstrating the impact of having a dedicated strategy. Our strategy also achieves a minimum of 63% in F1-score, outperforming all baselines and ablated versions in two out of three tasks and placing second in the third. Furthermore, a manual evaluation of non-reference mappings showed that 46% of the mappings achieve the maximum score, further substantiating its ability to construct semantically sound mappings.

--------------------------------------------------------------------------------------------------------

Group Inertial Poser: Multi-Person Pose and Global Translation from Sparse Inertial Sensors and Ultra-Wideband Ranging

This research addresses the limitations of purely IMU-based motion tracking, which struggles with accurate global translation and relative positioning in multi-person scenarios. Group Inertial Poser introduces a novel fusion approach that combines sparse Inertial Measurement Units (IMUs) with absolute distance measurements from Ultra-Wideband (UWB) ranging. By feeding these fused observations into structured state-space models and employing a two-step optimization, the method robustly estimates both full-body poses and global trajectories for multiple individuals. This technology, accompanied by the new GIP-DB dataset, promises highly accurate and robust multi-human motion capture "in the wild," with applications in virtual reality, animation, fitness tracking, and robotics.

Authors: Ying Xue, Jiaxi Jiang, Rayan Armani, Dominik Hollidt, Yi-Chi Liao, Christian Holz

Link: https://arxiv.org/abs/2510.21654v1

Date: 2025-10-d

Summary:

Tracking human full-body motion using sparse wearable inertial measurement units (IMUs) overcomes the limitations of occlusion and instrumentation of the environment inherent in vision-based approaches. However, purely IMU-based tracking compromises translation estimates and accurate relative positioning between individuals, as inertial cues are inherently self-referential and provide no direct spatial reference for others. In this paper, we present a novel approach for robustly estimating body poses and global translation for multiple individuals by leveraging the distances between sparse wearable sensors - both on each individual and across multiple individuals. Our method Group Inertial Poser estimates these absolute distances between pairs of sensors from ultra-wideband ranging (UWB) and fuses them with inertial observations as input into structured state-space models to integrate temporal motion patterns for precise 3D pose estimation. Our novel two-step optimization further leverages the estimated distances for accurately tracking people's global trajectories through the world. We also introduce GIP-DB, the first IMU+UWB dataset for two-person tracking, which comprises 200 minutes of motion recordings from 14 participants. In our evaluation, Group Inertial Poser outperforms previous state-of-the-art methods in accuracy and robustness across synthetic and real-world data, showing the promise of IMU+UWB-based multi-human motion capture in the wild. Code, models, dataset: https://github.com/eth-siplab/GroupInertialPoser

--------------------------------------------------------------------------------------------------------

AstaBench: Rigorous Benchmarking of AI Agents with a Scientific Research Suite

Rigorous evaluation is crucial for advancing AI agents that aim to automate tasks in scientific research, from literature review to data analysis. AstaBench is introduced as a comprehensive suite to holistically measure an agent's ability to perform the entire scientific discovery process. It comprises 2400+ problems and includes the first scientific research environment with production-grade search tools for controlled, reproducible evaluation. The benchmark comes with nine science-optimized Asta agent classes and baselines. Evaluation of 57 agents shows that despite progress, AI is still far from solving the challenge of science research assistance. Potential applications include standardizing the development and deployment of next-generation AI agents for academic and industrial R&D.

Authors: Jonathan Bragg, Mike D'Arcy, Nishant Balepur, Dan Bareket, Bhavana Dalvi, Sergey Feldman, Dany Haddad, Jena D. Hwang, Peter Jansen, Varsha Kishore, Bodhisattwa Prasad Majumder, Aakanksha Naik, Sigal Rahamimov, Kyle Richardson, Amanpreet Singh, Harshit Surana, Aryeh Tiktinsky, Rosni Vasu, Guy Wiener, Chloe Anastasiades, Stefan Candra, Jason Dunkelberger, Dan Emery, Rob Evans, Malachi Hamada, Regan Huff, Rodney Kinney, Matt Latzke, Jaron Lochner, Ruben Lozano-Aguilera, Cecile Nguyen, Smita Rao, Amber Tanaka, Brooke Vlahos, Peter Clark, Doug Downey, Yoav Goldberg, Ashish Sabharwal, Daniel S. Weld

Link: https://arxiv.org/abs/2510.21652v1

Date: 2025-10-d

Summary:

AI agents hold the potential to revolutionize scientific productivity by automating literature reviews, replicating experiments, analyzing data, and even proposing new directions of inquiry; indeed, there are now many such agents, ranging from general-purpose "deep research" systems to specialized science-specific agents, such as AI Scientist and AIGS. Rigorous evaluation of these agents is critical for progress. Yet existing benchmarks fall short on several fronts: they (1) fail to provide holistic, product-informed measures of real-world use cases such as science research; (2) lack reproducible agent tools necessary for a controlled comparison of core agentic capabilities; (3) do not account for confounding variables such as model cost and tool access; (4) do not provide standardized interfaces for quick agent prototyping and evaluation; and (5) lack comprehensive baseline agents necessary to identify true advances. In response, we define principles and tooling for more rigorously benchmarking agents. Using these, we present AstaBench, a suite that provides the first holistic measure of agentic ability to perform scientific research, comprising 2400+ problems spanning the entire scientific discovery process and multiple scientific domains, and including many problems inspired by actual user requests to deployed Asta agents. Our suite comes with the first scientific research environment with production-grade search tools that enable controlled, reproducible evaluation, better accounting for confounders. Alongside, we provide a comprehensive suite of nine science-optimized classes of Asta agents and numerous baselines. Our extensive evaluation of 57 agents across 22 agent classes reveals several interesting findings, most importantly that despite meaningful progress on certain individual aspects, AI remains far from solving the challenge of science research assistance.

--------------------------------------------------------------------------------------------------------

A Dynamic Knowledge Distillation Method Based on the Gompertz Curve

Conventional knowledge distillation methods often fail to account for the evolving learning speed of a smaller "student" model, leading to suboptimal transfer from a larger "teacher." This paper proposes Gompertz-CNN, a dynamic distillation framework that uses the Gompertz growth curve to modulate the weight of the distillation loss. This stage-aware strategy reflects the student's learning progression: slow start, rapid middle, and late saturation. By also incorporating Wasserstein distance and gradient matching, the multi-loss objective significantly improves transfer. Extensive experiments show consistent accuracy gains on image classification tasks. Potential applications include creating highly efficient, small models for edge computing and mobile devices without sacrificing the performance gained from large, complex models.

Authors: Han Yang, Guangjun Qin

Link: https://arxiv.org/abs/2510.21649v1

Date: 2025-10-d

Summary:

This paper introduces a novel dynamic knowledge distillation framework, Gompertz-CNN, which integrates the Gompertz growth model into the training process to address the limitations of traditional knowledge distillation. Conventional methods often fail to capture the evolving cognitive capacity of student models, leading to suboptimal knowledge transfer. To overcome this, we propose a stage-aware distillation strategy that dynamically adjusts the weight of distillation loss based on the Gompertz curve, reflecting the student's learning progression: slow initial growth, rapid mid-phase improvement, and late-stage saturation. Our framework incorporates Wasserstein distance to measure feature-level discrepancies and gradient matching to align backward propagation behaviors between teacher and student models. These components are unified under a multi-loss objective, where the Gompertz curve modulates the influence of distillation losses over time. Extensive experiments on CIFAR-10 and CIFAR-100 using various teacher-student architectures (e.g., ResNet50 and MobileNet_v2) demonstrate that Gompertz-CNN consistently outperforms traditional distillation methods, achieving up to 8% and 4% accuracy gains on CIFAR-10 and CIFAR-100, respectively.

--------------------------------------------------------------------------------------------------------

DEEDEE: Fast and Scalable Out-of-Distribution Dynamics Detection

Reliable deployment of Reinforcement Learning (RL) agents in safety-critical settings requires robust detection of shifts in the environment, known as Out-of-Distribution (OOD) dynamics. The paper introduces DEEDEE, a minimal, two-statistic detector that significantly simplifies complex, representation-heavy pipelines. DEEDEE uses only the episodewise mean and an RBF kernel similarity to a training summary, capturing complementary global and local deviations. This simplicity results in a 600-fold reduction in compute while maintaining or exceeding the accuracy of contemporary detectors. The core finding is that diverse anomalies imprint on RL trajectories through a small set of low-order statistics. Potential applications include real-time, lightweight OOD detection for safe and scalable deployment of RL in autonomous vehicles, industrial control, and robotics.

Authors: Tala Aljaafari, Varun Kanade, Philip Torr, Christian Schroeder de Witt

Link: https://arxiv.org/abs/2510.21638v1

Date: 2025-10-d

Summary:

Deploying reinforcement learning (RL) in safety-critical settings is constrained by brittleness under distribution shift. We study out-of-distribution (OOD) detection for RL time series and introduce DEEDEE, a two-statistic detector that revisits representation-heavy pipelines with a minimal alternative. DEEDEE uses only an episodewise mean and an RBF kernel similarity to a training summary, capturing complementary global and local deviations. Despite its simplicity, DEEDEE matches or surpasses contemporary detectors across standard RL OOD suites, delivering a 600-fold reduction in compute (FLOPs / wall-time) and an average 5% absolute accuracy gain over strong baselines. Conceptually, our results indicate that diverse anomaly types often imprint on RL trajectories through a small set of low-order statistics, suggesting a compact foundation for OOD detection in complex environments.

--------------------------------------------------------------------------------------------------------

Few-Shot Knowledge Distillation of LLMs With Counterfactual Explanations

Knowledge distillation for Large Language Models (LLMs) often requires vast amounts of data, which is a major constraint in few-shot, task-aware scenarios. The paper introduces Counterfactual-explanation-infused Distillation (CoD), a novel strategy to efficiently transfer knowledge using minimal data. CoD systematically incorporates Counterfactual Explanations (CFEs)—inputs that minimally perturb the data to flip the teacher model's prediction—to precisely map the teacher's decision boundary. The theoretical and geometric analysis confirms that CFEs act as informative knowledge probes. CoD demonstrably outperforms standard distillation in few-shot regimes, achieving better results with half the original samples. Potential applications involve rapidly and cost-effectively deploying specialized, smaller LLMs in resource-constrained or proprietary data environments.

Authors: Faisal Hamman, Pasan Dissanayake, Yanjun Fu, Sanghamitra Dutta

Link: https://arxiv.org/abs/2510.21631v1

Date: 2025-10-d

Summary:

Knowledge distillation is a promising approach to transfer capabilities from complex teacher models to smaller, resource-efficient student models that can be deployed easily, particularly in task-aware scenarios. However, existing methods of task-aware distillation typically require substantial quantities of data which may be unavailable or expensive to obtain in many practical scenarios. In this paper, we address this challenge by introducing a novel strategy called Counterfactual-explanation-infused Distillation CoD for few-shot task-aware knowledge distillation by systematically infusing counterfactual explanations. Counterfactual explanations (CFEs) refer to inputs that can flip the output prediction of the teacher model with minimum perturbation. Our strategy CoD leverages these CFEs to precisely map the teacher's decision boundary with significantly fewer samples. We provide theoretical guarantees for motivating the role of CFEs in distillation, from both statistical and geometric perspectives. We mathematically show that CFEs can improve parameter estimation by providing more informative examples near the teacher's decision boundary. We also derive geometric insights on how CFEs effectively act as knowledge probes, helping the students mimic the teacher's decision boundaries more effectively than standard data. We perform experiments across various datasets and LLMs to show that CoD outperforms standard distillation approaches in few-shot regimes (as low as 8-512 samples). Notably, CoD only uses half of the original samples used by the baselines, paired with their corresponding CFEs and still improves performance.

--------------------------------------------------------------------------------------------------------

The Universal Landscape of Human Reasoning

Understanding the dynamic accumulation and transformation of information during human reasoning is a fundamental challenge. This paper introduces Information Flow Tracking (IF-Track), a method that leverages Large Language Models (LLMs) as a probabilistic encoder to quantify information entropy and gain at each reasoning step. IF-Track successfully models the universal landscape of human reasoning behaviors within a single metric space, capturing essential features, identifying error patterns, and characterizing individual differences. The method is used to reconcile competing psychological theories, such as single- versus dual-process models. Potential applications include creating a quantitative bridge between cognitive theory and measurement, offering mechanistic insights into reasoning architecture, and potentially improving the design of AI systems.

Authors: Qiguang Chen, Jinhao Liu, Libo Qin, Yimeng Zhang, Yihao Liang, Shangxu Ren, Chengyu Luan, Dengyun Peng, Hanjing Li, Jiannan Guan, Zheng Yan, Jiaqi Wang, Mengkang Hu, Yantao Du, Zhi Chen, Xie Chen, Wanxiang Che

Link: https://arxiv.org/abs/2510.21623v1

Date: 2025-10-d

Summary:

Understanding how information is dynamically accumulated and transformed in human reasoning has long challenged cognitive psychology, philosophy, and artificial intelligence. Existing accounts, from classical logic to probabilistic models, illuminate aspects of output or individual modelling, but do not offer a unified, quantitative description of general human reasoning dynamics. To solve this, we introduce Information Flow Tracking (IF-Track), that uses large language models (LLMs) as probabilistic encoder to quantify information entropy and gain at each reasoning step. Through fine-grained analyses across diverse tasks, our method is the first successfully models the universal landscape of human reasoning behaviors within a single metric space. We show that IF-Track captures essential reasoning features, identifies systematic error patterns, and characterizes individual differences. Applied to discussion of advanced psychological theory, we first reconcile single- versus dual-process theories in IF-Track and discover the alignment of artificial and human cognition and how LLMs reshaping human reasoning process. This approach establishes a quantitative bridge between theory and measurement, offering mechanistic insights into the architecture of reasoning.

--------------------------------------------------------------------------------------------------------

DeepAgent: A General Reasoning Agent with Scalable Toolsets

Real-world problem-solving for Large Reasoning Models (LRMs) necessitates the use of external tools and long-horizon interactions, which current agent frameworks often handle poorly. DeepAgent is introduced as an end-to-end deep reasoning agent that integrates autonomous thinking, tool discovery, and action execution. It addresses context length challenges with an autonomous memory folding mechanism that compresses interaction history into structured episodic, working, and tool memories. A novel reinforcement learning strategy, ToolPO, teaches general-purpose tool use efficiently. DeepAgent consistently outperforms baselines on general tool-use and downstream tasks. Potential applications involve creating more capable, general agents for complex tasks like advanced web automation, multi-step problem solving, and long-term interactive assistance.

Authors: Xiaoxi Li, Wenxiang Jiao, Jiarui Jin, Guanting Dong, Jiajie Jin, Yinuo Wang, Hao Wang, Yutao Zhu, Ji-Rong Wen, Yuan Lu, Zhicheng Dou

Link: https://arxiv.org/abs/2510.21618v1

Date: 2025-10-d

Summary:

Large reasoning models have demonstrated strong problem-solving abilities, yet real-world tasks often require external tools and long-horizon interactions. Existing agent frameworks typically follow predefined workflows, which limit autonomous and global task completion. In this paper, we introduce DeepAgent, an end-to-end deep reasoning agent that performs autonomous thinking, tool discovery, and action execution within a single, coherent reasoning process. To address the challenges of long-horizon interactions, particularly the context length explosion from multiple tool calls and the accumulation of interaction history, we introduce an autonomous memory folding mechanism that compresses past interactions into structured episodic, working, and tool memories, reducing error accumulation while preserving critical information. To teach general-purpose tool use efficiently and stably, we develop an end-to-end reinforcement learning strategy, namely ToolPO, that leverages LLM-simulated APIs and applies tool-call advantage attribution to assign fine-grained credit to the tool invocation tokens. Extensive experiments on eight benchmarks, including general tool-use tasks (ToolBench, API-Bank, TMDB, Spotify, ToolHop) and downstream applications (ALFWorld, WebShop, GAIA, HLE), demonstrate that DeepAgent consistently outperforms baselines across both labeled-tool and open-set tool retrieval scenarios. This work takes a step toward more general and capable agents for real-world applications. The code and demo are available at https://github.com/RUC-NLPIR/DeepAgent.

--------------------------------------------------------------------------------------------------------

Huxley-Gödel Machine: Human-Level Coding Agent Development by an Approximation of the Optimal Self-Improving Machine

This research addresses the Metaproductivity-Performance Mismatch in self-improving coding agents, where current performance doesn't reliably indicate future improvement potential. Inspired by the theoretical Gödel Machine, the paper introduces the Huxley-Gödel Machine (HGM), which estimates Clade Metaproductivity ($\mathrm{CMP}$)—the aggregated performance of an agent's descendants—to guide its search for self-modifications. By prioritizing changes with higher future potential, HGM outperforms prior self-improving methods on coding benchmarks like SWE-bench, requiring less wall-clock time. Optimized by HGM, a GPT-5-mini agent achieved human-level performance on SWE-bench Lite. Potential applications lie in developing highly efficient, truly self-improving AI that can rapidly evolve its own codebase to achieve human-expert-level software engineering.

Authors: Wenyi Wang, Piotr Piękos, Li Nanbo, Firas Laakom, Yimeng Chen, Mateusz Ostaszewski, Mingchen Zhuge, Jürgen Schmidhuber

Link: https://arxiv.org/abs/2510.21614v1

Date: 2025-10-d

Summary:

Recent studies operationalize self-improvement through coding agents that edit their own codebases. They grow a tree of self-modifications through expansion strategies that favor higher software engineering benchmark performance, assuming that this implies more promising subsequent self-modifications. However, we identify a mismatch between the agent's self-improvement potential (metaproductivity) and its coding benchmark performance, namely the Metaproductivity-Performance Mismatch. Inspired by Huxley's concept of clade, we propose a metric ($\mathrm{CMP}$) that aggregates the benchmark performances of the descendants of an agent as an indicator of its potential for self-improvement. We show that, in our self-improving coding agent development setting, access to the true $\mathrm{CMP}$ is sufficient to simulate how the G\"odel Machine would behave under certain assumptions. We introduce the Huxley-G\"odel Machine (HGM), which, by estimating $\mathrm{CMP}$ and using it as guidance, searches the tree of self-modifications. On SWE-bench Verified and Polyglot, HGM outperforms prior self-improving coding agent development methods while using less wall-clock time. Last but not least, HGM demonstrates strong transfer to other coding datasets and large language models. The agent optimized by HGM on SWE-bench Verified with GPT-5-mini and evaluated on SWE-bench Lite with GPT-5 achieves human-level performance, matching the best officially checked results of human-engineered coding agents. Our code is available at https://github.com/metauto-ai/HGM.

--------------------------------------------------------------------------------------------------------

Generative Correlation Manifolds: Generating Synthetic Data with Preserved Higher-Order Correlations

Synthetic data generation is crucial for privacy and model robustness, but existing methods often fail to preserve the higher-order correlation structure of real-world data. This limitation yields superficially realistic data that breaks down in sophisticated modeling tasks. The paper introduces Generative Correlation Manifolds (GCM), a computationally efficient method that uses Cholesky decomposition of a target correlation matrix. GCM is mathematically proven to preserve the entire correlation structure, from simple pairwise relationships to complex, multi-variable interactions. Potential applications include privacy-preserving data sharing for sensitive datasets (e.g., in finance or healthcare), robust training of machine learning models, and complex data simulation where multi-variable dependencies are critical.

Authors: Jens E. d'Hondt, Wieger R. Punter, Odysseas Papapetrou

Link: https://arxiv.org/abs/2510.21610v1

Date: 2025-10-d

Summary:

The increasing need for data privacy and the demand for robust machine learning models have fueled the development of synthetic data generation techniques. However, current methods often succeed in replicating simple summary statistics but fail to preserve both the pairwise and higher-order correlation structure of the data that define the complex, multi-variable interactions inherent in real-world systems. This limitation can lead to synthetic data that is superficially realistic but fails when used for sophisticated modeling tasks. In this white paper, we introduce Generative Correlation Manifolds (GCM), a computationally efficient method for generating synthetic data. The technique uses Cholesky decomposition of a target correlation matrix to produce datasets that, by mathematical proof, preserve the entire correlation structure -- from simple pairwise relationships to higher-order interactions -- of the source dataset. We argue that this method provides a new approach to synthetic data generation with potential applications in privacy-preserving data sharing, robust model training, and simulation.

--------------------------------------------------------------------------------------------------------

Doc-Researcher: A Unified System for Multimodal Document Parsing and Deep Research

Current deep research systems for Large Language Models (LLMs) are limited to textual web data, neglecting the vast knowledge in multimodal documents (containing figures, tables, etc.). Doc-Researcher is a unified system that bridges this gap with three components: deep multimodal parsing that preserves visual semantics, systematic retrieval architecture supporting text, vision, and hybrid queries, and iterative multi-agent workflows for evidence synthesis. To enable rigorous testing, the new M4DocBench benchmark is introduced for multimodal, multi-hop, and multi-turn research. Doc-Researcher achieves a 3.4x accuracy improvement over baselines. Potential applications include revolutionizing research and analysis in fields that rely heavily on complex documents, such as legal, scientific, and engineering domains.

Authors: Kuicai Dong, Shurui Huang, Fangda Ye, Wei Han, Zhi Zhang, Dexun Li, Wenjun Li, Qu Yang, Gang Wang, Yichao Wang, Chen Zhang, Yong Liu

Link: https://arxiv.org/abs/2510.21603v1

Date: 2025-10-d

Summary:

Deep Research systems have revolutionized how LLMs solve complex questions through iterative reasoning and evidence gathering. However, current systems remain fundamentally constrained to textual web data, overlooking the vast knowledge embedded in multimodal documents Processing such documents demands sophisticated parsing to preserve visual semantics (figures, tables, charts, and equations), intelligent chunking to maintain structural coherence, and adaptive retrieval across modalities, which are capabilities absent in existing systems. In response, we present Doc-Researcher, a unified system that bridges this gap through three integrated components: (i) deep multimodal parsing that preserves layout structure and visual semantics while creating multi-granular representations from chunk to document level, (ii) systematic retrieval architecture supporting text-only, vision-only, and hybrid paradigms with dynamic granularity selection, and (iii) iterative multi-agent workflows that decompose complex queries, progressively accumulate evidence, and synthesize comprehensive answers across documents and modalities. To enable rigorous evaluation, we introduce M4DocBench, the first benchmark for Multi-modal, Multi-hop, Multi-document, and Multi-turn deep research. Featuring 158 expert-annotated questions with complete evidence chains across 304 documents, M4DocBench tests capabilities that existing benchmarks cannot assess. Experiments demonstrate that Doc-Researcher achieves 50.6% accuracy, 3.4xbetter than state-of-the-art baselines, validating that effective document research requires not just better retrieval, but fundamentally deep parsing that preserve multimodal integrity and support iterative research. Our work establishes a new paradigm for conducting deep research on multimodal document collections.

--------------------------------------------------------------------------------------------------------

EYE ON A.I. GETS READERS UP TO DATE ON THE LATEST FUNDING NEWS AND RELATED ISSUES. SUBSCRIBE FOR THE WEEKLY NEWSLETTER.

Artificial Intelligence, Research WatchCraig SmithOctober 27, 2025Comment