Week Ending 9.7.2025
RESEARCH WATCH: 9.7.2025
WinT3R: Window-Based Streaming Reconstruction with Camera Token Pool
WinT3R presents a new approach to real-time 3D reconstruction, addressing the common trade-off between quality and speed. The model uses a sliding window mechanism to enhance geometric predictions and a camera token pool to improve the reliability of camera pose estimation. This design allows it to create precise 3D maps and track camera movement on the fly without a large computational burden. This makes WinT3R a powerful tool for applications requiring high-quality, real-time spatial awareness, such as augmented reality, robotics, and drone navigation, where instant and accurate environmental mapping is critical for safe and effective operation.
Authors: Zizun Li, Jianjun Zhou, Yifan Wang, Haoyu Guo, Wenzheng Chang, Yang Zhou, Haoyi Zhu, Junyi Chen, Chunhua Shen, Tong He
Link: https://arxiv.org/abs/2509.05296v1
Date: 2025-09-d
Summary:
We present WinT3R, a feed-forward reconstruction model capable of online prediction of precise camera poses and high-quality point maps. Previous methods suffer from a trade-off between reconstruction quality and real-time performance. To address this, we first introduce a sliding window mechanism that ensures sufficient information exchange among frames within the window, thereby improving the quality of geometric predictions without large computation. In addition, we leverage a compact representation of cameras and maintain a global camera token pool, which enhances the reliability of camera pose estimation without sacrificing efficiency. These designs enable WinT3R to achieve state-of-the-art performance in terms of online reconstruction quality, camera pose estimation, and reconstruction speed, as validated by extensive experiments on diverse datasets. Code and model are publicly available at https://github.com/LiZizun/WinT3R.
--------------------------------------------------------------------------------------------------------
This paper introduces a novel method to understand how large language models (LLMs) acquire linguistic abilities during training. Instead of just using benchmarks to see what a model can do, the authors use sparse crosscoders to track the evolution of specific features, like grammar or syntax, across different model checkpoints. This allows researchers to pinpoint when and how a model learns a particular concept. The approach offers a promising way to make LLM development more interpretable and fine-grained, with potential applications in improving training efficiency and debugging models by identifying when and why certain capabilities emerge or fail to develop.
Authors: Deniz Bayazit, Aaron Mueller, Antoine Bosselut
Link: https://arxiv.org/abs/2509.05291v1
Date: 2025-09-d
Summary:
Large language models (LLMs) learn non-trivial abstractions during pretraining, like detecting irregular plural noun subjects. However, it is not well understood when and how specific linguistic abilities emerge as traditional evaluation methods such as benchmarking fail to reveal how models acquire concepts and capabilities. To bridge this gap and better understand model training at the concept level, we use sparse crosscoders to discover and align features across model checkpoints. Using this approach, we track the evolution of linguistic features during pretraining. We train crosscoders between open-sourced checkpoint triplets with significant performance and representation shifts, and introduce a novel metric, Relative Indirect Effects (RelIE), to trace training stages at which individual features become causally important for task performance. We show that crosscoders can detect feature emergence, maintenance, and discontinuation during pretraining. Our approach is architecture-agnostic and scalable, offering a promising path toward more interpretable and fine-grained analysis of representation learning throughout pretraining.
--------------------------------------------------------------------------------------------------------
SpikingBrain Technical Report: Spiking Brain-inspired Large Models
SpikingBrain introduces a new family of large models inspired by the human brain, aiming to solve the efficiency issues of traditional Transformer-based LLMs, such as quadratic computational scaling and linear memory growth. By using linear and hybrid-linear attention architectures with adaptive spiking neurons, SpikingBrain significantly improves long-context processing. This research demonstrates the feasibility of developing large models on non-NVIDIA hardware, offering a new path for efficient, scalable, and low-power AI. The models' event-driven spiking behavior and high sparsity enable substantial speedups and reduced energy consumption, with applications in developing more sustainable and accessible AI systems.
Authors: Yuqi Pan, Yupeng Feng, Jinghao Zhuang, Siyu Ding, Zehao Liu, Bohan Sun, Yuhong Chou, Han Xu, Xuerui Qiu, Anlin Deng, Anjie Hu, Peng Zhou, Man Yao, Jibin Wu, Jian Yang, Guoliang Sun, Bo Xu, Guoqi Li
Link: https://arxiv.org/abs/2509.05276v1
Date: 2025-09-d
Summary:
Mainstream Transformer-based large language models face major efficiency bottlenecks: training computation scales quadratically with sequence length, and inference memory grows linearly, limiting long-context processing. Building large models on non-NVIDIA platforms also poses challenges for stable and efficient training. To address this, we introduce SpikingBrain, a family of brain-inspired models designed for efficient long-context training and inference. SpikingBrain leverages the MetaX GPU cluster and focuses on three aspects: (1) Model Architecture: linear and hybrid-linear attention architectures with adaptive spiking neurons; (2) Algorithmic Optimizations: an efficient, conversion-based training pipeline and a dedicated spike coding framework; (3) System Engineering: customized training frameworks, operator libraries, and parallelism strategies tailored to MetaX hardware. Using these techniques, we develop two models: SpikingBrain-7B, a linear LLM, and SpikingBrain-76B, a hybrid-linear MoE LLM. These models demonstrate the feasibility of large-scale LLM development on non-NVIDIA platforms. SpikingBrain achieves performance comparable to open-source Transformer baselines while using only about 150B tokens for continual pre-training. Our models significantly improve long-sequence training efficiency and deliver inference with (partially) constant memory and event-driven spiking behavior. For example, SpikingBrain-7B attains over 100x speedup in Time to First Token for 4M-token sequences. Training remains stable for weeks on hundreds of MetaX C550 GPUs, with the 7B model reaching a Model FLOPs Utilization of 23.4 percent. The proposed spiking scheme achieves 69.15 percent sparsity, enabling low-power operation. Overall, this work demonstrates the potential of brain-inspired mechanisms to drive the next generation of efficient and scalable large model design.
--------------------------------------------------------------------------------------------------------
LatticeWorld is a framework that uses multimodal large language models to streamline the creation of complex, interactive 3D virtual environments. By leveraging a lightweight LLM (LLaMA-2-7B) and industry-standard rendering engines, the system can generate dynamic, large-scale worlds from simple text or visual instructions. This approach drastically reduces the time and effort of traditional manual modeling, achieving a 90x increase in industrial production efficiency. This framework has significant potential applications in embodied AI, autonomous driving simulations, and the entertainment industry, where the rapid generation of realistic and dynamic virtual worlds is essential for training, testing, and creative content production.
Authors: Yinglin Duan, Zhengxia Zou, Tongwei Gu, Wei Jia, Zhan Zhao, Luyi Xu, Xinzhu Liu, Hao Jiang, Kang Chen, Shuang Qiu
Link: https://arxiv.org/abs/2509.05263v1
Date: 2025-09-d
Summary:
Recent research has been increasingly focusing on developing 3D world models that simulate complex real-world scenarios. World models have found broad applications across various domains, including embodied AI, autonomous driving, entertainment, etc. A more realistic simulation with accurate physics will effectively narrow the sim-to-real gap and allow us to gather rich information about the real world conveniently. While traditional manual modeling has enabled the creation of virtual 3D scenes, modern approaches have leveraged advanced machine learning algorithms for 3D world generation, with most recent advances focusing on generative methods that can create virtual worlds based on user instructions. This work explores such a research direction by proposing LatticeWorld, a simple yet effective 3D world generation framework that streamlines the industrial production pipeline of 3D environments. LatticeWorld leverages lightweight LLMs (LLaMA-2-7B) alongside the industry-grade rendering engine (e.g., Unreal Engine 5) to generate a dynamic environment. Our proposed framework accepts textual descriptions and visual instructions as multimodal inputs and creates large-scale 3D interactive worlds with dynamic agents, featuring competitive multi-agent interaction, high-fidelity physics simulation, and real-time rendering. We conduct comprehensive experiments to evaluate LatticeWorld, showing that it achieves superior accuracy in scene layout generation and visual fidelity. Moreover, LatticeWorld achieves over a $90\times$ increase in industrial production efficiency while maintaining high creative quality compared with traditional manual production methods. Our demo video is available at https://youtu.be/8VWZXpERR18
--------------------------------------------------------------------------------------------------------
Scaling Performance of Large Language Model Pretraining
This paper addresses the scarcity of public information regarding the practical challenges and scaling performance of pre-training large language models. The authors demystify the complexities of distributed training, managing massive datasets, and scaling data parallelism to fully utilize GPU capacity. The work provides valuable insights and practical recommendations for tuning training performance in large-scale pipelines. The findings are highly relevant for AI researchers, engineers, and companies investing in supercomputing infrastructure, as they can help to optimize training pipelines, improve efficiency, and reduce the immense computational costs associated with developing the next generation of LLMs.
Authors: Alexander Interrante-Grant, Carla Varela-Rosa, Suhaas Narayan, Chris Connelly, Albert Reuther
Link: https://arxiv.org/abs/2509.05258v1
Date: 2025-09-d
Summary:
Large language models (LLMs) show best-in-class performance across a wide range of natural language processing applications. Training these models is an extremely computationally expensive task; frontier Artificial Intelligence (AI) research companies are investing billions of dollars into supercomputing infrastructure to train progressively larger models on increasingly massive datasets. Unfortunately, information about the scaling performance and training considerations of these large training pipelines is scarce in public literature. Working with large-scale datasets and models can be complex and practical recommendations are scarce in the public literature for tuning training performance when scaling up large language models. In this paper, we aim to demystify the large language model pretraining pipeline somewhat - in particular with respect to distributed training, managing large datasets across hundreds of nodes, and scaling up data parallelism with an emphasis on fully leveraging available GPU compute capacity.
--------------------------------------------------------------------------------------------------------
COGITAO: A Visual Reasoning Framework To Study Compositionality & Generalization
COGITAO is a new data generation framework and benchmark designed to rigorously test a machine learning model's ability to compose learned concepts and apply them to new situations. Inspired by the ARC-AGI problem set, it creates rule-based visual tasks with a vast number of unique rules and adjustable difficulty. By providing baseline experiments with state-of-the-art vision models, the paper highlights a persistent limitation: models struggle to generalize to new combinations of familiar elements, despite strong performance on individual tasks. This open-sourced framework is a crucial tool for research into generalization and compositionality, with applications in creating more robust and flexible AI systems for visual tasks.
Authors: Yassine Taoudi-Benchekroun, Klim Troyan, Pascal Sager, Stefan Gerber, Lukas Tuggener, Benjamin Grewe
Link: https://arxiv.org/abs/2509.05249v1
Date: 2025-09-d
Summary:
The ability to compose learned concepts and apply them in novel settings is key to human intelligence, but remains a persistent limitation in state-of-the-art machine learning models. To address this issue, we introduce COGITAO, a modular and extensible data generation framework and benchmark designed to systematically study compositionality and generalization in visual domains. Drawing inspiration from ARC-AGI's problem-setting, COGITAO constructs rule-based tasks which apply a set of transformations to objects in grid-like environments. It supports composition, at adjustable depth, over a set of 28 interoperable transformations, along with extensive control over grid parametrization and object properties. This flexibility enables the creation of millions of unique task rules -- surpassing concurrent datasets by several orders of magnitude -- across a wide range of difficulties, while allowing virtually unlimited sample generation per rule. We provide baseline experiments using state-of-the-art vision models, highlighting their consistent failures to generalize to novel combinations of familiar elements, despite strong in-domain performance. COGITAO is fully open-sourced, including all code and datasets, to support continued research in this field.
--------------------------------------------------------------------------------------------------------
Uncertain but Useful: Leveraging CNN Variability into Data Augmentation
This research explores the numerical variability of deep learning models during training, particularly in the context of neuroimaging. It demonstrates that while training introduces instability, this variability can be harnessed as a resource. By creating numerical ensembles from controlled perturbations, the authors show that these ensembles can be repurposed as a data augmentation strategy. As a proof of concept, they demonstrate improved performance in a brain age regression task. This finding re-frames model training variability from a reproducibility concern into a valuable tool for improving robustness and developing new applications, such as enhancing medical imaging analysis and diagnostics.
Authors: Inés Gonzalez-Pepe, Vinuyan Sivakolunthu, Yohan Chatelain, Tristan Glatard
Link: https://arxiv.org/abs/2509.05238v1
Date: 2025-09-d
Summary:
Deep learning (DL) is rapidly advancing neuroimaging by achieving state-of-the-art performance with reduced computation times. Yet the numerical stability of DL models -- particularly during training -- remains underexplored. While inference with DL is relatively stable, training introduces additional variability primarily through iterative stochastic optimization. We investigate this training-time variability using FastSurfer, a CNN-based whole-brain segmentation pipeline. Controlled perturbations are introduced via floating point perturbations and random seeds. We find that: (i) FastSurfer exhibits higher variability compared to that of a traditional neuroimaging pipeline, suggesting that DL inherits and is particularly susceptible to sources of instability present in its predecessors; (ii) ensembles generated with perturbations achieve performance similar to an unperturbed baseline; and (iii) variability effectively produces ensembles of numerical model families that can be repurposed for downstream applications. As a proof of concept, we demonstrate that numerical ensembles can be used as a data augmentation strategy for brain age regression. These findings position training-time variability not only as a reproducibility concern but also as a resource that can be harnessed to improve robustness and enable new applications in neuroimaging.
--------------------------------------------------------------------------------------------------------
The CURE framework is designed to mitigate conceptual shortcuts—spurious, concept-driven correlations—that impair the robustness and fairness of pre-trained language models. The method uses a content extractor and a reversal network to disentangle task-relevant information from harmful biases. A controllable debiasing module then fine-tunes the model to diminish these biases while preserving useful information. The framework is lightweight and flexible, demonstrating significant improvements in F1 scores on standard datasets with minimal computational overhead. This approach has broad applications in creating more reliable, fair, and trustworthy language understanding systems for diverse applications, from content moderation to sentiment analysis.
Authors: Aysenur Kocak, Shuo Yang, Bardh Prenkaj, Gjergji Kasneci
Link: https://arxiv.org/abs/2509.05230v1
Date: 2025-09-d
Summary:
Pre-trained language models have achieved remarkable success across diverse applications but remain susceptible to spurious, concept-driven correlations that impair robustness and fairness. In this work, we introduce CURE, a novel and lightweight framework that systematically disentangles and suppresses conceptual shortcuts while preserving essential content information. Our method first extracts concept-irrelevant representations via a dedicated content extractor reinforced by a reversal network, ensuring minimal loss of task-relevant information. A subsequent controllable debiasing module employs contrastive learning to finely adjust the influence of residual conceptual cues, enabling the model to either diminish harmful biases or harness beneficial correlations as appropriate for the target task. Evaluated on the IMDB and Yelp datasets using three pre-trained architectures, CURE achieves an absolute improvement of +10 points in F1 score on IMDB and +2 points on Yelp, while introducing minimal computational overhead. Our approach establishes a flexible, unsupervised blueprint for combating conceptual biases, paving the way for more reliable and fair language understanding systems.
--------------------------------------------------------------------------------------------------------
RapidGNN is a distributed training framework for large-scale Graph Neural Networks (GNNs), addressing the significant communication overhead and energy consumption challenges associated with training on highly connected datasets. The framework uses deterministic sampling-based scheduling to enable efficient cache construction and prefetching of remote features. This results in a substantial improvement in training throughput and a significant reduction in remote feature fetches. The energy efficiency improvements for both CPU and GPU make it a powerful tool for sustainable AI. RapidGNN is poised to enhance the training of GNNs for applications like social network analysis, fraud detection, and drug discovery.
Authors: Arefin Niam, Tevfik Kosar, M S Q Zulkar Nine
Link: https://arxiv.org/abs/2509.05207v1
Date: 2025-09-d
Summary:
Graph Neural Networks (GNNs) have become popular across a diverse set of tasks in exploring structural relationships between entities. However, due to the highly connected structure of the datasets, distributed training of GNNs on large-scale graphs poses significant challenges. Traditional sampling-based approaches mitigate the computational loads, yet the communication overhead remains a challenge. This paper presents RapidGNN, a distributed GNN training framework with deterministic sampling-based scheduling to enable efficient cache construction and prefetching of remote features. Evaluation on benchmark graph datasets demonstrates RapidGNN's effectiveness across different scales and topologies. RapidGNN improves end-to-end training throughput by 2.46x to 3.00x on average over baseline methods across the benchmark datasets, while cutting remote feature fetches by over 9.70x to 15.39x. RapidGNN further demonstrates near-linear scalability with an increasing number of computing units efficiently. Furthermore, it achieves increased energy efficiency over the baseline methods for both CPU and GPU by 44% and 32%, respectively.
--------------------------------------------------------------------------------------------------------
Enhancing 3D Point Cloud Classification with ModelNet-R and Point-SkipNet
This paper introduces a refined dataset, ModelNet-R, to address the limitations of the widely-used ModelNet40 for 3D point cloud classification. It also proposes Point-SkipNet, a lightweight graph-based neural network that achieves high accuracy with fewer parameters. The research highlights the critical importance of dataset quality in optimizing model efficiency and performance. By providing a more reliable benchmark and an efficient network architecture, this work has significant applications in improving the accuracy and speed of 3D point cloud classification for fields like autonomous driving, robotics, and augmented reality, where precise object recognition in real-time is essential.
Authors: Mohammad Saeid, Amir Salarpour, Pedram MohajerAnsari
Link: https://arxiv.org/abs/2509.05198v1
Date: 2025-09-d
Summary:
The classification of 3D point clouds is crucial for applications such as autonomous driving, robotics, and augmented reality. However, the commonly used ModelNet40 dataset suffers from limitations such as inconsistent labeling, 2D data, size mismatches, and inadequate class differentiation, which hinder model performance. This paper introduces ModelNet-R, a meticulously refined version of ModelNet40 designed to address these issues and serve as a more reliable benchmark. Additionally, this paper proposes Point-SkipNet, a lightweight graph-based neural network that leverages efficient sampling, neighborhood grouping, and skip connections to achieve high classification accuracy with reduced computational overhead. Extensive experiments demonstrate that models trained in ModelNet-R exhibit significant performance improvements. Notably, Point-SkipNet achieves state-of-the-art accuracy on ModelNet-R with a substantially lower parameter count compared to contemporary models. This research highlights the crucial role of dataset quality in optimizing model efficiency for 3D point cloud classification. For more details, see the code at: https://github.com/m-saeid/ModeNetR_PointSkipNet.
--------------------------------------------------------------------------------------------------------
AI Agents for Web Testing: A Case Study in the Wild
This paper introduces WebProber, an AI agent-based framework for automated web testing that goes beyond traditional methods. Using large language models and AI agents, WebProber simulates human-like interactions to explore websites, identify usability issues, and generate human-readable reports. The case study on academic websites revealed numerous usability issues that traditional tools missed, highlighting the potential of this approach. This research points to a promising future for AI-driven web testing, where systems can more effectively capture complex user behaviors and deliver higher-quality, user-centered web experiences by autonomously identifying and reporting bugs and usability flaws.
Authors: Naimeng Ye, Xiao Yu, Ruize Xu, Tianyi Peng, Zhou Yu
Link: https://arxiv.org/abs/2509.05197v1
Date: 2025-09-d
Summary:
Automated web testing plays a critical role in ensuring high-quality user experiences and delivering business value. Traditional approaches primarily focus on code coverage and load testing, but often fall short of capturing complex user behaviors, leaving many usability issues undetected. The emergence of large language models (LLM) and AI agents opens new possibilities for web testing by enabling human-like interaction with websites and a general awareness of common usability problems. In this work, we present WebProber, a prototype AI agent-based web testing framework. Given a URL, WebProber autonomously explores the website, simulating real user interactions, identifying bugs and usability issues, and producing a human-readable report. We evaluate WebProber through a case study of 120 academic personal websites, where it uncovered 29 usability issues--many of which were missed by traditional tools. Our findings highlight agent-based testing as a promising direction while outlining directions for developing next-generation, user-centered testing frameworks.
--------------------------------------------------------------------------------------------------------
Accuracy-Constrained CNN Pruning for Efficient and Reliable EEG-Based Seizure Detection
This study presents a lightweight, structured pruning method for a one-dimensional CNN model to improve the efficiency and reliability of EEG-based seizure detection. By carefully pruning the convolutional kernels, the model achieves a significant reduction in size and memory (50%) while maintaining or even slightly improving predictive accuracy and precision. This demonstrates that neural networks often have significant redundancy. The approach offers a promising way to develop more efficient and reliable models for resource-limited settings. The findings have direct applications in creating portable and affordable medical devices for real-time seizure detection and other biomedical signal analysis.
Authors: Mounvik K, N Harshit
Link: https://arxiv.org/abs/2509.05190v1
Date: 2025-09-d
Summary:
Deep learning models, especially convolutional neural networks (CNNs), have shown considerable promise for biomedical signals such as EEG-based seizure detection. However, these models come with challenges, primarily due to their size and compute requirements in environments where real-time detection or limited resources are available. In this study, we present a lightweight one-dimensional CNN model with structured pruning to improve efficiency and reliability. The model was trained with mild early stopping to address possible overfitting, achieving an accuracy of 92.78% and a macro-F1 score of 0.8686. Structured pruning of the baseline CNN involved removing 50% of the convolutional kernels based on their importance to model predictions. Surprisingly, after pruning the weights and memory by 50%, the new network was still able to maintain predictive capabilities, while modestly increasing precision to 92.87% and improving the macro-F1 score to 0.8707. Overall, we present a convincing case that structured pruning removes redundancy, improves generalization, and, in combination with mild early stopping, achieves a promising way forward to improve seizure detection efficiency and reliability, which is clear motivation for resource-limited settings.
--------------------------------------------------------------------------------------------------------
Exploring Situated Stabilities of a Rhythm Generation System through Variational Cross-Examination
This paper uses a postphenomenological framework to analyze GrooveTransformer, a real-time rhythm generation system, and its unexpected versatility. Through Variational Cross-Examination (VCE), the authors show how the system adapted to three distinct artistic contexts: a drum accompanist, a rhythmic voltage sequencer, and a rhythm driver for a harmonic system. The research identifies key factors contributing to this "multistability," including system invariants and interdisciplinary collaboration. This work serves as a valuable case study for the design of digital musical instruments (DMIs), providing a method to understand how technologies mediate and are shaped by their users and contexts, with applications in creative AI and human-computer interaction.
Authors: Błażej Kotowski, Nicholas Evans, Behzad Haki, Frederic Font, Sergi Jordà
Link: https://arxiv.org/abs/2509.05145v1
Date: 2025-09-d
Summary:
This paper investigates GrooveTransformer, a real-time rhythm generation system, through the postphenomenological framework of Variational Cross-Examination (VCE). By reflecting on its deployment across three distinct artistic contexts, we identify three stabilities: an autonomous drum accompaniment generator, a rhythmic control voltage sequencer in Eurorack format, and a rhythm driver for a harmonic accompaniment system. The versatility of its applications was not an explicit goal from the outset of the project. Thus, we ask: how did this multistability emerge? Through VCE, we identify three key contributors to its emergence: the affordances of system invariants, the interdisciplinary collaboration, and the situated nature of its development. We conclude by reflecting on the viability of VCE as a descriptive and analytical method for Digital Musical Instrument (DMI) design, emphasizing its value in uncovering how technologies mediate, co-shape, and are co-shaped by users and contexts.
--------------------------------------------------------------------------------------------------------
Evaluation and Comparison Semantics for ODRL
This paper addresses the need for a comprehensive formal semantics for the Open Digital Rights Language (ODRL), which is a key standard for governing the use of digital resources. The authors propose a simple, intuitive formal semantics based on query answering that is aligned with the latest language specification. Building on this, they define and study the problem of comparing two policies to detect if one is more restrictive or permissive. This research is crucial for ensuring the accurate and consistent implementation of digital rights policies in data sharing scenarios, with applications in digital content management, intellectual property enforcement, and privacy-preserving data exchange.
Authors: Jaime Osvaldo Salas, Paolo Pareti, Semih Yumuşak, Soulmaz Gheisari, Luis-Daniel Ibáñez, George Konstantinidis
Link: https://arxiv.org/abs/2509.05139v1
Date: 2025-09-d
Summary:
We consider the problem of evaluating, and comparing computational policies in the Open Digital Rights Language (ODRL), which has become the de facto standard for governing the access and usage of digital resources. Although preliminary progress has been made on the formal specification of the language's features, a comprehensive formal semantics of ODRL is still missing. In this paper, we provide a simple and intuitive formal semantics for ODRL that is based on query answering. Our semantics refines previous formalisations, and is aligned with the latest published specification of the language (2.2). Building on our evaluation semantics, and motivated by data sharing scenarios, we also define and study the problem of comparing two policies, detecting equivalent, more restrictive or more permissive policies.
--------------------------------------------------------------------------------------------------------
GenAI-based test case generation and execution in SDV platform
This paper introduces an approach that uses generative AI, including Large Language Models and Vision-Language Models, to automate the generation of test cases for software-defined vehicles (SDVs). The methodology translates natural language requirements and system diagrams into structured test cases, and leverages the Vehicle Signal Specification to ensure compatibility across different automotive subsystems. The system, demonstrated on a Child Presence Detection use case, significantly reduces the manual effort required for test specification. Despite some need for human intervention, this framework showcases a promising path toward a more efficient and automated testing pipeline for the complex software that powers modern vehicles.
Authors: Denesa Zyberaj, Lukasz Mazur, Nenad Petrovic, Pankhuri Verma, Pascal Hirmer, Dirk Slama, Xiangwei Cheng, Alois Knoll
Link: https://arxiv.org/abs/2509.05112v1
Date: 2025-09-d
Summary:
This paper introduces a GenAI-driven approach for automated test case generation, leveraging Large Language Models and Vision-Language Models to translate natural language requirements and system diagrams into structured Gherkin test cases. The methodology integrates Vehicle Signal Specification modeling to standardize vehicle signal definitions, improve compatibility across automotive subsystems, and streamline integration with third-party testing tools. Generated test cases are executed within the digital.auto playground, an open and vendor-neutral environment designed to facilitate rapid validation of software-defined vehicle functionalities. We evaluate our approach using the Child Presence Detection System use case, demonstrating substantial reductions in manual test specification effort and rapid execution of generated tests. Despite significant automation, the generation of test cases and test scripts still requires manual intervention due to current limitations in the GenAI pipeline and constraints of the digital.auto platform.
--------------------------------------------------------------------------------------------------------
ICR: Iterative Clarification and Rewriting for Conversational Search
This paper introduces ICR (Iterative Clarification and Rewriting), a novel framework for conversational query rewriting that addresses the challenge of handling multiple vague expressions. Unlike traditional end-to-end approaches, ICR operates iteratively, with the model alternating between generating clarification questions for the user and rewriting the query. This method allows the system to progressively refine the query by resolving one ambiguity at a time, leading to more accurate search results. The approach has applications in developing more effective conversational search systems and virtual assistants that can better understand and respond to complex or ambiguous user requests.
Authors: Zhiyu Cao, Peifeng Li, Qiaoming Zhu
Link: https://arxiv.org/abs/2509.05100v1
Date: 2025-09-d
Summary:
Most previous work on Conversational Query Rewriting employs an end-to-end rewriting paradigm. However, this approach is hindered by the issue of multiple fuzzy expressions within the query, which complicates the simultaneous identification and rewriting of multiple positions. To address this issue, we propose a novel framework ICR (Iterative Clarification and Rewriting), an iterative rewriting scheme that pivots on clarification questions. Within this framework, the model alternates between generating clarification questions and rewritten queries. The experimental results show that our ICR can continuously improve retrieval performance in the clarification-rewriting iterative process, thereby achieving state-of-the-art performance on two popular datasets.
--------------------------------------------------------------------------------------------------------
ProToM: Promoting Prosocial Behaviour via Theory of Mind-Informed Feedback
ProToM is an AI system designed to promote prosocial behavior—actions that benefit others—in multi-agent environments. Using Bayesian inverse planning, ProToM infers the goals of individual agents and provides targeted, context-sensitive feedback to encourage cooperation. The system is evaluated in two multi-agent environments, where it outperforms state-of-the-art baselines by achieving higher success rates and shorter task completion times. By leveraging a Theory of Mind framework to reason about other agents' intentions, ProToM offers a blueprint for creating more collaborative and effective AI agents, with applications in coordinating autonomous systems in complex social or industrial settings.
Authors: Matteo Bortoletto, Yichao Zhou, Lance Ying, Tianmin Shu, Andreas Bulling
Link: https://arxiv.org/abs/2509.05091v1
Date: 2025-09-d
Summary:
While humans are inherently social creatures, the challenge of identifying when and how to assist and collaborate with others - particularly when pursuing independent goals - can hinder cooperation. To address this challenge, we aim to develop an AI system that provides useful feedback to promote prosocial behaviour - actions that benefit others, even when not directly aligned with one's own goals. We introduce ProToM, a Theory of Mind-informed facilitator that promotes prosocial actions in multi-agent systems by providing targeted, context-sensitive feedback to individual agents. ProToM first infers agents' goals using Bayesian inverse planning, then selects feedback to communicate by maximising expected utility, conditioned on the inferred goal distribution. We evaluate our approach against baselines in two multi-agent environments: Doors, Keys, and Gems, as well as Overcooked. Our results suggest that state-of-the-art large language and reasoning models fall short of communicating feedback that is both contextually grounded and well-timed - leading to higher communication overhead and task speedup. In contrast, ProToM provides targeted and helpful feedback, achieving a higher success rate, shorter task completion times, and is consistently preferred by human users.
--------------------------------------------------------------------------------------------------------
Finding your MUSE: Mining Unexpected Solutions Engine
This paper introduces a methodology for constructing Functional Concept Graphs (FCGs) to help innovators overcome cognitive fixation and discover novel solutions. FCGs are interconnected representations of functional elements that facilitate abstraction and analogical inspiration. The authors present MUSE, an algorithm that leverages these graphs to generate creative and unexpected inspirations for a given problem. The release of a large-scale FCG based on 500K patents provides a valuable resource for future research. This work has significant applications in innovation, design, and problem-solving, as it provides a structured way to systematically explore the solution space and generate truly original ideas.
Authors: Nir Sweed, Hanit Hakim, Ben Wolfson, Hila Lifshitz, Dafna Shahaf
Link: https://arxiv.org/abs/2509.05072v1
Date: 2025-09-d
Summary:
Innovators often exhibit cognitive fixation on existing solutions or nascent ideas, hindering the exploration of novel alternatives. This paper introduces a methodology for constructing Functional Concept Graphs (FCGs), interconnected representations of functional elements that support abstraction, problem reframing, and analogical inspiration. Our approach yields large-scale, high-quality FCGs with explicit abstraction relations, overcoming limitations of prior work. We further present MUSE, an algorithm leveraging FCGs to generate creative inspirations for a given problem. We demonstrate our method by computing an FCG on 500K patents, which we release for further research.
--------------------------------------------------------------------------------------------------------
Systematic Review and Meta-analysis of AI-driven MRI Motion Artifact Detection and Correction
This paper provides a comprehensive review and meta-analysis of AI-driven methods for improving MRI image quality by detecting and correcting motion artifacts. It highlights the potential of deep learning, particularly generative models, to reduce artifacts and enhance diagnostics. However, it also points out key challenges, such as a lack of standardized datasets and the risk of visual distortions. The research concludes by motivating the need for public datasets and standardized reporting protocols. The findings have crucial applications in clinical practice, where AI-driven solutions could significantly improve diagnostic accuracy, reduce the need for repeat scans, and enhance patient care by ensuring high-quality, motion-free medical images.
Authors: Mojtaba Safari, Zach Eidex, Richard L. J. Qiu, Matthew Goette, Tonghe Wang, Xiaofeng Yang
Link: https://arxiv.org/abs/2509.05071v1
Date: 2025-09-d
Summary:
Background: To systematically review and perform a meta-analysis of artificial intelligence (AI)-driven methods for detecting and correcting magnetic resonance imaging (MRI) motion artifacts, assessing current developments, effectiveness, challenges, and future research directions. Methods: A comprehensive systematic review and meta-analysis were conducted, focusing on deep learning (DL) approaches, particularly generative models, for the detection and correction of MRI motion artifacts. Quantitative data were extracted regarding utilized datasets, DL architectures, and performance metrics. Results: DL, particularly generative models, show promise for reducing motion artifacts and improving image quality; however, limited generalizability, reliance on paired training data, and risk of visual distortions remain key challenges that motivate standardized datasets and reporting. Conclusions: AI-driven methods, particularly DL generative models, show significant potential for improving MRI image quality by effectively addressing motion artifacts. However, critical challenges must be addressed, including the need for comprehensive public datasets, standardized reporting protocols for artifact levels, and more advanced, adaptable DL techniques to reduce reliance on extensive paired datasets. Addressing these aspects could substantially enhance MRI diagnostic accuracy, reduce healthcare costs, and improve patient care outcomes.
--------------------------------------------------------------------------------------------------------
ToM-SSI: Evaluating Theory of Mind in Situated Social Interactions
This paper introduces a new benchmark, ToM-SSI, to evaluate the Theory of Mind (ToM) capabilities of foundation models in rich, situated social interactions. Unlike existing benchmarks that rely on text-only or dyadic interactions, ToM-SSI is multimodal and includes group interactions with multiple agents. This design allows for the study of mixed cooperative-obstructive settings and parallel reasoning about multiple agents' mental states. The evaluation reveals that current models perform poorly on these complex tasks, highlighting a significant gap in their ability to reason about social dynamics. This benchmark is a crucial tool for future research aiming to develop AI with a more sophisticated understanding of social cognition.
Authors: Matteo Bortoletto, Constantin Ruhdorfer, Andreas Bulling
Link: https://arxiv.org/abs/2509.05066v1
Date: 2025-09-d
Summary:
Most existing Theory of Mind (ToM) benchmarks for foundation models rely on variations of the Sally-Anne test, offering only a very limited perspective on ToM and neglecting the complexity of human social interactions. To address this gap, we propose ToM-SSI: a new benchmark specifically designed to test ToM capabilities in environments rich with social interactions and spatial dynamics. While current ToM benchmarks are limited to text-only or dyadic interactions, ToM-SSI is multimodal and includes group interactions of up to four agents that communicate and move in situated environments. This unique design allows us to study, for the first time, mixed cooperative-obstructive settings and reasoning about multiple agents' mental state in parallel, thus capturing a wider range of social cognition than existing benchmarks. Our evaluations reveal that the current models' performance is still severely limited, especially in these new tasks, highlighting critical gaps for future research.
--------------------------------------------------------------------------------------------------------