Week Ending 8.31.2025

 

RESEARCH WATCH: 8.31.2025

 

The Demon is in Ambiguity: Revisiting Situation Recognition with Single Positive Multi-Label Learning

Computer vision systems must understand complex real-world scenes, but current approaches oversimplify verb classification by assuming only one action occurs per image. This research addresses the fundamental challenge that multiple events can simultaneously describe the same visual scene, creating inherent ambiguity in interpretation. The proposed single positive multi-label learning framework and Graph Enhanced Verb Multilayer Perceptron offer significant improvements for autonomous systems, robotics, and surveillance applications where nuanced scene understanding is critical. This work has practical implications for self-driving cars interpreting traffic scenarios, security systems analyzing complex interactions, and assistive technologies helping visually impaired users understand their environment with greater accuracy and contextual depth.

Authors:  Yiming Lin, Yuchen Niu, Shang Wang, Kaizhu Huang, Qiufeng Wang, Xiao-Bo Jin

Link:  https://arxiv.org/abs/2508.21816v1

Date: 2025-08-d

Summary:

Context recognition (SR) is a fundamental task in computer vision that aims to extract structured semantic summaries from images by identifying key events and their associated entities. Specifically, given an input image, the model must first classify the main visual events (verb classification), then identify the participating entities and their semantic roles (semantic role labeling), and finally localize these entities in the image (semantic role localization). Existing methods treat verb classification as a single-label problem, but we show through a comprehensive analysis that this formulation fails to address the inherent ambiguity in visual event recognition, as multiple verb categories may reasonably describe the same image. This paper makes three key contributions: First, we reveal through empirical analysis that verb classification is inherently a multi-label problem due to the ubiquitous semantic overlap between verb categories. Second, given the impracticality of fully annotating large-scale datasets with multiple labels, we propose to reformulate verb classification as a single positive multi-label learning (SPMLL) problem - a novel perspective in SR research. Third, we design a comprehensive multi-label evaluation benchmark for SR that is carefully designed to fairly evaluate model performance in a multi-label setting. To address the challenges of SPMLL, we futher develop the Graph Enhanced Verb Multilayer Perceptron (GE-VerbMLP), which combines graph neural networks to capture label correlations and adversarial training to optimize decision boundaries. Extensive experiments on real-world datasets show that our approach achieves more than 3\% MAP improvement while remaining competitive on traditional top-1 and top-5 accuracy metrics.

--------------------------------------------------------------------------------------------------------

Achieving Hilbert-Schmidt Independence Under Rényi Differential Privacy for Fair and Private Data Generation

Healthcare and financial institutions face mounting pressure to share data responsibly while protecting individual privacy and preventing algorithmic bias. This research introduces FLIP, a transformer-based approach that generates synthetic tabular data satisfying both differential privacy constraints and fairness requirements without knowing specific downstream tasks. The framework addresses critical challenges in medical research, where patient data must remain confidential while enabling meaningful analysis across demographic groups. Applications span electronic health records, clinical trials, financial lending datasets, and government statistics. By ensuring statistical independence between protected attributes and generated data while maintaining utility, this work enables organizations to comply with GDPR, HIPAA, and AI fairness regulations while fostering innovation and research collaboration.

Authors:  Tobias Hyrup, Emmanouil Panagiotou, Arjun Roy, Arthur Zimek, Eirini Ntoutsi, Peter Schneider-Kamp

Link:  https://arxiv.org/abs/2508.21815v1

Date: 2025-08-d

Summary:

As privacy regulations such as the GDPR and HIPAA and responsibility frameworks for artificial intelligence such as the AI Act gain traction, the ethical and responsible use of real-world data faces increasing constraints. Synthetic data generation has emerged as a promising solution to risk-aware data sharing and model development, particularly for tabular datasets that are foundational to sensitive domains such as healthcare. To address both privacy and fairness concerns in this setting, we propose FLIP (Fair Latent Intervention under Privacy guarantees), a transformer-based variational autoencoder augmented with latent diffusion to generate heterogeneous tabular data. Unlike the typical setup in fairness-aware data generation, we assume a task-agnostic setup, not reliant on a fixed, defined downstream task, thus offering broader applicability. To ensure privacy, FLIP employs R\'enyi differential privacy (RDP) constraints during training and addresses fairness in the input space with RDP-compatible balanced sampling that accounts for group-specific noise levels across multiple sampling rates. In the latent space, we promote fairness by aligning neuron activation patterns across protected groups using Centered Kernel Alignment (CKA), a similarity measure extending the Hilbert-Schmidt Independence Criterion (HSIC). This alignment encourages statistical independence between latent representations and the protected feature. Empirical results demonstrate that FLIP effectively provides significant fairness improvements for task-agnostic fairness and across diverse downstream tasks under differential privacy constraints.

--------------------------------------------------------------------------------------------------------

Automated Clinical Problem Detection from SOAP Notes using a Collaborative Multi-Agent LLM Architecture

Clinical documentation analysis remains a bottleneck in healthcare, with physicians spending countless hours interpreting complex patient notes. This research introduces a collaborative multi-agent system that mimics clinical consultation teams to identify medical problems from SOAP note sections. The approach addresses the critical need for robust clinical decision support tools that can handle high-stakes medical scenarios where single-model approaches may fail. Applications include emergency department triage, clinical coding automation, quality assurance systems, and physician training platforms. The system's ability to surface conflicting evidence through structured debate processes makes it valuable for second opinion systems, medical education, and reducing diagnostic errors. This technology could significantly improve healthcare efficiency while maintaining the collaborative reasoning that characterizes effective medical practice.

Authors:  Yeawon Lee, Xiaoyang Wang, Christopher C. Yang

Link:  https://arxiv.org/abs/2508.21803v1

Date: 2025-08-d

Summary:

Accurate interpretation of clinical narratives is critical for patient care, but the complexity of these notes makes automation challenging. While Large Language Models (LLMs) show promise, single-model approaches can lack the robustness required for high-stakes clinical tasks. We introduce a collaborative multi-agent system (MAS) that models a clinical consultation team to address this gap. The system is tasked with identifying clinical problems by analyzing only the Subjective (S) and Objective (O) sections of SOAP notes, simulating the diagnostic reasoning process of synthesizing raw data into an assessment. A Manager agent orchestrates a dynamically assigned team of specialist agents who engage in a hierarchical, iterative debate to reach a consensus. We evaluated our MAS against a single-agent baseline on a curated dataset of 420 MIMIC-III notes. The dynamic multi-agent configuration demonstrated consistently improved performance in identifying congestive heart failure, acute kidney injury, and sepsis. Qualitative analysis of the agent debates reveals that this structure effectively surfaces and weighs conflicting evidence, though it can occasionally be susceptible to groupthink. By modeling a clinical team's reasoning process, our system offers a promising path toward more accurate, robust, and interpretable clinical decision support tools.

--------------------------------------------------------------------------------------------------------

Tree-Guided Diffusion Planner

Robotic control and autonomous systems require sophisticated planning capabilities to navigate complex, multi-objective environments. This research presents a zero-shot planning framework that addresses limitations of gradient-based approaches when dealing with non-convex objectives and multi-reward structures. The tree search methodology enables robots to balance exploration and exploitation without task-specific training, making it highly adaptable to diverse scenarios. Applications span autonomous navigation in warehouses, robotic manipulation in manufacturing, drone path planning in dynamic environments, and game AI development. The framework's ability to handle maze navigation, robotic arm control, and multi-goal exploration demonstrates versatility for real-world deployment. This technology could revolutionize how robots adapt to new environments and tasks, reducing the need for extensive retraining while improving performance in complex, uncertain conditions.

Authors:  Hyeonseong Jeon, Cheolhong Min, Jaesik Park

Link:  https://arxiv.org/abs/2508.21800v1

Date: 2025-08-d

Summary:

Planning with pretrained diffusion models has emerged as a promising approach for solving test-time guided control problems. However, standard gradient guidance typically performs optimally under convex and differentiable reward landscapes, showing substantially reduced effectiveness in real-world scenarios involving non-convex objectives, non-differentiable constraints, and multi-reward structures. Furthermore, recent supervised planning approaches require task-specific training or value estimators, which limits test-time flexibility and zero-shot generalization. We propose a Tree-guided Diffusion Planner (TDP), a zero-shot test-time planning framework that balances exploration and exploitation through structured trajectory generation. We frame test-time planning as a tree search problem using a bi-level sampling process: (1) diverse parent trajectories are produced via training-free particle guidance to encourage broad exploration, and (2) sub-trajectories are refined through fast conditional denoising guided by task objectives. TDP addresses the limitations of gradient guidance by exploring diverse trajectory regions and harnessing gradient information across this expanded solution space using only pretrained models and test-time reward signals. We evaluate TDP on three diverse tasks: maze gold-picking, robot arm block manipulation, and AntMaze multi-goal exploration. TDP consistently outperforms state-of-the-art approaches on all tasks. The project page can be found at: tree-diffusion-planner.github.io.

--------------------------------------------------------------------------------------------------------

DynaMark: A Reinforcement Learning Framework for Dynamic Watermarking in Industrial Machine Tool Controllers

Industry 4.0's interconnected manufacturing systems face increasing cybersecurity threats, particularly replay attacks that manipulate sensor data to compromise production processes. This research addresses critical vulnerabilities in machine tool controllers through adaptive watermarking that learns optimal defense strategies in real-time. The reinforcement learning approach dynamically adjusts security measures based on system behavior and threat detection, offering superior protection compared to static methods. Applications include smart factories, CNC machining centers, industrial IoT networks, and critical infrastructure protection. The framework's 70% reduction in energy consumption while maintaining security demonstrates practical viability for resource-constrained manufacturing environments. This technology enables manufacturers to embrace connectivity benefits while protecting against sophisticated cyber attacks, ensuring production continuity and quality in increasingly networked industrial ecosystems.

Authors:  Navid Aftabi, Abhishek Hanchate, Satish Bukkapatnam, Dan Li

Link:  https://arxiv.org/abs/2508.21797v1

Date: 2025-08-d

Summary:

Industry 4.0's highly networked Machine Tool Controllers (MTCs) are prime targets for replay attacks that use outdated sensor data to manipulate actuators. Dynamic watermarking can reveal such tampering, but current schemes assume linear-Gaussian dynamics and use constant watermark statistics, making them vulnerable to the time-varying, partly proprietary behavior of MTCs. We close this gap with DynaMark, a reinforcement learning framework that models dynamic watermarking as a Markov decision process (MDP). It learns an adaptive policy online that dynamically adapts the covariance of a zero-mean Gaussian watermark using available measurements and detector feedback, without needing system knowledge. DynaMark maximizes a unique reward function balancing control performance, energy consumption, and detection confidence dynamically. We develop a Bayesian belief updating mechanism for real-time detection confidence in linear systems. This approach, independent of specific system assumptions, underpins the MDP for systems with linear dynamics. On a Siemens Sinumerik 828D controller digital twin, DynaMark achieves a reduction in watermark energy by 70% while preserving the nominal trajectory, compared to constant variance baselines. It also maintains an average detection delay equivalent to one sampling interval. A physical stepper-motor testbed validates these findings, rapidly triggering alarms with less control performance decline and exceeding existing benchmarks.

--------------------------------------------------------------------------------------------------------

TMUAD: Enhancing Logical Capabilities in Unified Anomaly Detection Models with a Text Memory Bank

Industrial quality control and medical diagnostics require systems capable of detecting both structural defects and logical inconsistencies in complex data. Traditional anomaly detection methods excel at identifying visual irregularities but struggle with logical relationships between objects. This research introduces a three-memory framework combining textual descriptions with visual features to capture semantic anomalies that purely visual systems miss. Applications span manufacturing quality assurance, medical imaging diagnosis, security surveillance, and autonomous vehicle perception. The unified approach to structural and logical anomaly detection makes it valuable for pharmaceutical packaging inspection, medical scan analysis, and industrial process monitoring. This technology addresses the growing need for AI systems that understand not just what they see, but whether it makes logical sense in context.

Authors:  Jiawei Liu, Jiahe Hou, Wei Wang, Jinsong Du, Yang Cong, Huijie Fan

Link:  https://arxiv.org/abs/2508.21795v1

Date: 2025-08-d

Summary:

Anomaly detection, which aims to identify anomalies deviating from normal patterns, is challenging due to the limited amount of normal data available. Unlike most existing unified methods that rely on carefully designed image feature extractors and memory banks to capture logical relationships between objects, we introduce a text memory bank to enhance the detection of logical anomalies. Specifically, we propose a Three-Memory framework for Unified structural and logical Anomaly Detection (TMUAD). First, we build a class-level text memory bank for logical anomaly detection by the proposed logic-aware text extractor, which can capture rich logical descriptions of objects from input images. Second, we construct an object-level image memory bank that preserves complete object contours by extracting features from segmented objects. Third, we employ visual encoders to extract patch-level image features for constructing a patch-level memory bank for structural anomaly detection. These three complementary memory banks are used to retrieve and compare normal images that are most similar to the query image, compute anomaly scores at multiple levels, and fuse them into a final anomaly score. By unifying structural and logical anomaly detection through collaborative memory banks, TMUAD achieves state-of-the-art performance across seven publicly available datasets involving industrial and medical domains. The model and code are available at https://github.com/SIA-IDE/TMUAD.

--------------------------------------------------------------------------------------------------------

MoE-Health: A Mixture of Experts Framework for Robust Multimodal Healthcare Prediction

Healthcare institutions generate vast amounts of heterogeneous data across electronic health records, clinical notes, and medical images, yet effectively combining these modalities for clinical prediction remains challenging. This research addresses the critical problem of incomplete or varying data availability across patients and healthcare systems. The Mixture of Experts framework dynamically adapts to available data modalities, making it practical for real-world clinical deployment where complete datasets are rare. Applications include mortality prediction, readmission risk assessment, length of stay estimation, and personalized treatment planning. The framework's robustness to missing modalities makes it valuable for resource-limited healthcare settings, telemedicine platforms, and integrated health information systems. This technology enables more accurate clinical predictions while accommodating the data heterogeneity inherent in modern healthcare environments.

Authors:  Xiaoyang Wang, Christopher C. Yang

Link:  https://arxiv.org/abs/2508.21793v1

Date: 2025-08-d

Summary:

Healthcare systems generate diverse multimodal data, including Electronic Health Records (EHR), clinical notes, and medical images. Effectively leveraging this data for clinical prediction is challenging, particularly as real-world samples often present with varied or incomplete modalities. Existing approaches typically require complete modality data or rely on manual selection strategies, limiting their applicability in real-world clinical settings where data availability varies across patients and institutions. To address these limitations, we propose MoE-Health, a novel Mixture of Experts framework designed for robust multimodal fusion in healthcare prediction. MoE-Health architecture is specifically developed to handle samples with differing modalities and improve performance on critical clinical tasks. By leveraging specialized expert networks and a dynamic gating mechanism, our approach dynamically selects and combines relevant experts based on available data modalities, enabling flexible adaptation to varying data availability scenarios. We evaluate MoE-Health on the MIMIC-IV dataset across three critical clinical prediction tasks: in-hospital mortality prediction, long length of stay, and hospital readmission prediction. Experimental results demonstrate that MoE-Health achieves superior performance compared to existing multimodal fusion methods while maintaining robustness across different modality availability patterns. The framework effectively integrates multimodal information, offering improved predictive performance and robustness in handling heterogeneous and incomplete healthcare data, making it particularly suitable for deployment in diverse healthcare environments with heterogeneous data availability.

--------------------------------------------------------------------------------------------------------

Going over Fine Web with a Fine-Tooth Comb: Technical Report of Indexing Fine Web for Problematic Content Search and Retrieval

Large language models trained on web-scale datasets inherit biases and harmful content from their training data, yet analyzing these massive corpora for problematic material has been computationally prohibitive. This research provides the first comprehensive framework for real-time analysis of LLM training datasets, enabling rapid identification of harmful content across terabytes of text. The ElasticSearch-based pipeline achieves millisecond query performance on 1.5TB datasets, making large-scale content auditing feasible. Applications include AI safety research, content moderation systems, dataset curation for model training, and regulatory compliance tools. This technology enables researchers and organizations to build more responsible AI systems by understanding and mitigating harmful content in training data, supporting efforts to develop safer, more ethical language models for public deployment.

Authors:  Inés Altemir Marinas, Anastasiia Kucherenko, Andrei Kucharavy

Link:  https://arxiv.org/abs/2508.21788v1

Date: 2025-08-d

Summary:

Large language models (LLMs) rely heavily on web-scale datasets like Common Crawl, which provides over 80\% of training data for some modern models. However, the indiscriminate nature of web crawling raises challenges in data quality, safety, and ethics. Despite the critical importance of training data quality, prior research on harmful content has been limited to small samples due to computational constraints. This project presents a framework for indexing and analyzing LLM training datasets using an ElasticSearch-based pipeline. We apply it to SwissAI's FineWeb-2 corpus (1.5TB, four languages), achieving fast query performance--most searches in milliseconds, all under 2 seconds. Our work demonstrates real-time dataset analysis, offering practical tools for safer, more accountable AI systems.

--------------------------------------------------------------------------------------------------------

PiCSAR: Probabilistic Confidence Selection And Ranking

Large language models and reasoning systems generate multiple candidate solutions, but selecting the best response without ground truth remains challenging. This research introduces a training-free method that evaluates response quality using joint log-likelihood of reasoning chains and final answers. The approach addresses a fundamental bottleneck in AI systems where generating good candidates is easier than identifying the best one. Applications span mathematical problem solving, code generation, scientific reasoning, and educational tutoring systems. The method's substantial improvements across diverse benchmarks make it valuable for enhancing AI assistants, automated grading systems, and decision support tools. This technology enables more reliable AI systems by improving their ability to self-evaluate and select high-quality outputs, reducing the need for human oversight in automated reasoning tasks.

Authors:  Joshua Ong Jun Leang, Zheng Zhao, Aryo Pradipta Gema, Sohee Yang, Wai-Chung Kwan, Xuanli He, Wenda Li, Pasquale Minervini, Eleonora Giunchiglia, Shay B. Cohen

Link:  https://arxiv.org/abs/2508.21787v1

Date: 2025-08-d

Summary:

Best-of-n sampling improves the accuracy of large language models (LLMs) and large reasoning models (LRMs) by generating multiple candidate solutions and selecting the one with the highest reward. The key challenge for reasoning tasks is designing a scoring function that can identify correct reasoning chains without access to ground-truth answers. We propose Probabilistic Confidence Selection And Ranking (PiCSAR): a simple, training-free method that scores each candidate generation using the joint log-likelihood of the reasoning and final answer. The joint log-likelihood of the reasoning and final answer naturally decomposes into reasoning confidence and answer confidence. PiCSAR achieves substantial gains across diverse benchmarks (+10.18 on MATH500, +9.81 on AIME2025), outperforming baselines with at least 2x fewer samples in 16 out of 20 comparisons. Our analysis reveals that correct reasoning chains exhibit significantly higher reasoning and answer confidence, justifying the effectiveness of PiCSAR.

--------------------------------------------------------------------------------------------------------

Benchmarking GPT-5 in Radiation Oncology: Measurable Gains, but Persistent Need for Expert Oversight

Artificial intelligence in oncology holds promise for improving treatment planning and clinical decision-making, but rigorous evaluation in specialized medical domains remains limited. This research provides comprehensive benchmarking of GPT-5's capabilities in radiation oncology, revealing both significant improvements over previous models and persistent limitations requiring expert oversight. The study addresses critical needs for AI-assisted clinical decision support in cancer treatment, where precision is paramount. Applications include treatment planning assistance, medical education, clinical guidelines development, and second opinion systems. The work's identification of specific error patterns in complex scenarios provides valuable insights for safe AI integration in oncology practice. This research supports the responsible deployment of AI in cancer care while highlighting the continued importance of human expertise in clinical decision-making.

Authors:  Ugur Dinc, Jibak Sarkar, Philipp Schubert, Sabine Semrau, Thomas Weissmann, Andre Karius, Johann Brand, Bernd-Niklas Axer, Ahmed Gomaa, Pluvio Stephan, Ishita Sheth, Sogand Beirami, Annette Schwarz, Udo Gaipl, Benjamin Frey, Christoph Bert, Stefanie Corradini, Rainer Fietkau, Florian Putz

Link:  https://arxiv.org/abs/2508.21777v1

Date: 2025-08-d

Summary:

Introduction: Large language models (LLM) have shown great potential in clinical decision support. GPT-5 is a novel LLM system that has been specifically marketed towards oncology use.   Methods: Performance was assessed using two complementary benchmarks: (i) the ACR Radiation Oncology In-Training Examination (TXIT, 2021), comprising 300 multiple-choice items, and (ii) a curated set of 60 authentic radiation oncologic vignettes representing diverse disease sites and treatment indications. For the vignette evaluation, GPT-5 was instructed to generate concise therapeutic plans. Four board-certified radiation oncologists rated correctness, comprehensiveness, and hallucinations. Inter-rater reliability was quantified using Fleiss' \k{appa}.   Results: On the TXIT benchmark, GPT-5 achieved a mean accuracy of 92.8%, outperforming GPT-4 (78.8%) and GPT-3.5 (62.1%). Domain-specific gains were most pronounced in Dose and Diagnosis. In the vignette evaluation, GPT-5's treatment recommendations were rated highly for correctness (mean 3.24/4, 95% CI: 3.11-3.38) and comprehensiveness (3.59/4, 95% CI: 3.49-3.69). Hallucinations were rare with no case reaching majority consensus for their presence. Inter-rater agreement was low (Fleiss' \k{appa} 0.083 for correctness), reflecting inherent variability in clinical judgment. Errors clustered in complex scenarios requiring precise trial knowledge or detailed clinical adaptation.   Discussion: GPT-5 clearly outperformed prior model variants on the radiation oncology multiple-choice benchmark. Although GPT-5 exhibited favorable performance in generating real-world radiation oncology treatment recommendations, correctness ratings indicate room for further improvement. While hallucinations were infrequent, the presence of substantive errors underscores that GPT-5-generated recommendations require rigorous expert oversight before clinical implementation.

--------------------------------------------------------------------------------------------------------

Unsupervised Video Continual Learning via Non-Parametric Deep Embedded Clustering

Video understanding systems must continuously adapt to new content categories without forgetting previously learned information, but most approaches require labeled data and explicit task boundaries. This research addresses the challenging scenario of learning from streaming video data without supervision or task knowledge. The non-parametric clustering approach using Kernel Density Estimation enables systems to discover and adapt to new video categories automatically. Applications include video surveillance systems, content recommendation platforms, autonomous vehicle perception, and robotics applications where continuous adaptation to new environments is crucial. The framework's ability to learn from unlabeled video streams makes it valuable for real-world deployments where manual annotation is impractical. This technology enables more flexible video understanding systems that can continuously evolve and adapt to changing visual environments.

Authors:  Nattapong Kurpukdee, Adrian G. Bors

Link:  https://arxiv.org/abs/2508.21773v1

Date: 2025-08-d

Summary:

We propose a realistic scenario for the unsupervised video learning where neither task boundaries nor labels are provided when learning a succession of tasks. We also provide a non-parametric learning solution for the under-explored problem of unsupervised video continual learning. Videos represent a complex and rich spatio-temporal media information, widely used in many applications, but which have not been sufficiently explored in unsupervised continual learning. Prior studies have only focused on supervised continual learning, relying on the knowledge of labels and task boundaries, while having labeled data is costly and not practical. To address this gap, we study the unsupervised video continual learning (uVCL). uVCL raises more challenges due to the additional computational and memory requirements of processing videos when compared to images. We introduce a general benchmark experimental protocol for uVCL by considering the learning of unstructured video data categories during each task. We propose to use the Kernel Density Estimation (KDE) of deep embedded video features extracted by unsupervised video transformer networks as a non-parametric probabilistic representation of the data. We introduce a novelty detection criterion for the incoming new task data, dynamically enabling the expansion of memory clusters, aiming to capture new knowledge when learning a succession of tasks. We leverage the use of transfer learning from the previous tasks as an initial state for the knowledge transfer to the current learning task. We found that the proposed methodology substantially enhances the performance of the model when successively learning many tasks. We perform in-depth evaluations on three standard video action recognition datasets, including UCF101, HMDB51, and Something-to-Something V2, without using any labels or class boundaries.

--------------------------------------------------------------------------------------------------------

UItron: Foundational GUI Agent with Advanced Perception and Planning

Graphical user interface automation represents a crucial step toward artificial general intelligence, enabling AI systems to interact with software applications as humans do. This research addresses fundamental challenges in GUI automation through systematic data engineering and interactive infrastructure development. The framework's success with Chinese mobile applications fills a critical gap in cross-cultural AI capabilities. Applications span automated testing, accessibility tools for disabled users, robotic process automation, and AI assistants for complex software workflows. The curriculum reinforcement learning approach enables sophisticated reasoning and exploration in dynamic interface environments. This technology could revolutionize human-computer interaction by enabling AI systems to operate any software application, potentially transforming productivity tools, accessibility solutions, and automated customer service systems across diverse cultural and linguistic contexts.

Authors:  Zhixiong Zeng, Jing Huang, Liming Zheng, Wenkang Han, Yufeng Zhong, Lei Chen, Longrong Yang, Yingjie Chu, Yuzhi He, Lin Ma

Link:  https://arxiv.org/abs/2508.21767v1

Date: 2025-08-d

Summary:

GUI agent aims to enable automated operations on Mobile/PC devices, which is an important task toward achieving artificial general intelligence. The rapid advancement of VLMs accelerates the development of GUI agents, owing to their powerful capabilities in visual understanding and task planning. However, building a GUI agent remains a challenging task due to the scarcity of operation trajectories, the availability of interactive infrastructure, and the limitation of initial capabilities in foundation models. In this work, we introduce UItron, an open-source foundational model for automatic GUI agents, featuring advanced GUI perception, grounding, and planning capabilities. UItron highlights the necessity of systemic data engineering and interactive infrastructure as foundational components for advancing GUI agent development. It not only systematically studies a series of data engineering strategies to enhance training effects, but also establishes an interactive environment connecting both Mobile and PC devices. In training, UItron adopts supervised finetuning over perception and planning tasks in various GUI scenarios, and then develop a curriculum reinforcement learning framework to enable complex reasoning and exploration for online environments. As a result, UItron achieves superior performance in benchmarks of GUI perception, grounding, and planning. In particular, UItron highlights the interaction proficiency with top-tier Chinese mobile APPs, as we identified a general lack of Chinese capabilities even in state-of-the-art solutions. To this end, we manually collect over one million steps of operation trajectories across the top 100 most popular apps, and build the offline and online agent evaluation environments. Experimental results demonstrate that UItron achieves significant progress in Chinese app scenarios, propelling GUI agents one step closer to real-world application.

--------------------------------------------------------------------------------------------------------

Reasoning-Intensive Regression

Traditional language models excel at classification tasks but struggle with numerical reasoning from text, particularly in specialized domains with limited training data. This research identifies reasoning-intensive regression as a distinct challenge requiring deeper textual analysis to deduce numerical properties. The MENTAT framework combines prompt optimization with ensemble learning to address scenarios like rubric-based scoring and domain-specific retrieval. Applications include automated essay grading, financial document analysis, medical report scoring, and academic peer review systems. The approach's success in achieving substantial improvements over baselines makes it valuable for educational assessment platforms, regulatory compliance tools, and quality evaluation systems. This technology enables more sophisticated automated evaluation of textual content where numerical judgments require complex reasoning rather than simple pattern matching.

Authors:  Diane Tchuindjo, Omar Khattab

Link:  https://arxiv.org/abs/2508.21762v1

Date: 2025-08-d

Summary:

AI researchers and practitioners increasingly apply large language models (LLMs) to what we call reasoning-intensive regression (RiR), i.e. deducing subtle numerical properties from text. Unlike standard language regression tasks, e.g. for sentiment or similarity, RiR often appears instead in ad-hoc problems like rubric-based scoring or domain-specific retrieval, where much deeper analysis of text is required while only limited task-specific training data and computation are available. We cast three realistic problems as RiR tasks to establish an initial benchmark, and use that to test our hypothesis that prompting frozen LLMs and finetuning Transformer encoders via gradient descent will both often struggle in RiR. We then propose MENTAT, a simple and lightweight method that combines batch-reflective prompt optimization with neural ensemble learning. MENTAT achieves up to 65% improvement over both baselines, though substantial room remains for future advances in RiR.

--------------------------------------------------------------------------------------------------------

Orientability of Causal Relations in Time Series using Summary Causal Graphs and Faithful Distributions

Understanding causal relationships in complex temporal systems is fundamental to making informed decisions across scientific and business domains. This research provides theoretical foundations for causal discovery when experts can provide high-level structural knowledge but detailed causal mechanisms remain unknown. The framework enables orientation of micro-level causal relationships using macro-level expert knowledge, even with cycles and bidirected edges. Applications span climate science, epidemiology, economics, and neuroscience where temporal causal relationships are crucial but difficult to establish empirically. The work's theoretical guarantees for edge orientation provide confidence for causal inference in complex systems. This technology supports evidence-based decision-making in scenarios where controlled experiments are impossible but observational time series data and expert knowledge are available, improving causal understanding in critical domains.

Authors:  Timothée Loranchet, Charles K. Assaad

Link:  https://arxiv.org/abs/2508.21742v1

Date: 2025-08-d

Summary:

Understanding causal relations between temporal variables is a central challenge in time series analysis, particularly when the full causal structure is unknown. Even when the full causal structure cannot be fully specified, experts often succeed in providing a high-level abstraction of the causal graph, known as a summary causal graph, which captures the main causal relations between different time series while abstracting away micro-level details. In this work, we present conditions that guarantee the orientability of micro-level edges between temporal variables given the background knowledge encoded in a summary causal graph and assuming having access to a faithful and causally sufficient distribution with respect to the true unknown graph. Our results provide theoretical guarantees for edge orientation at the micro-level, even in the presence of cycles or bidirected edges at the macro-level. These findings offer practical guidance for leveraging SCGs to inform causal discovery in complex temporal systems and highlight the value of incorporating expert knowledge to improve causal inference from observational time series data.

--------------------------------------------------------------------------------------------------------

Neural Network Acceleration on MPSoC board: Integrating SLAC's SNL, Rogue Software and Auto-SNL

High-energy physics experiments generate data at unprecedented rates, requiring real-time machine learning inference to manage massive data streams effectively. This research addresses the critical challenge of deploying adaptive neural networks on FPGA hardware for real-time data reduction at particle accelerators. The SNL framework's dynamic weight updates without hardware resynthesis enables adaptive learning in experimental environments. Applications extend beyond physics to medical imaging, autonomous vehicles, and industrial automation requiring ultra-low latency inference. The Auto-SNL Python extension democratizes FPGA deployment for researchers without specialized hardware knowledge. This technology enables real-time AI inference in resource-constrained, high-throughput environments where traditional computing approaches fail, supporting scientific discovery and industrial applications requiring immediate decision-making from streaming sensor data.

Authors:  Hamza Ezzaoui Rahali, Abhilasha Dave, Larry Ruckman, Mohammad Mehdi Rahimifar, Audrey C. Therrien, James J. Russel, Ryan T. Herbst

Link:  https://arxiv.org/abs/2508.21739v1

Date: 2025-08-d

Summary:

The LCLS-II Free Electron Laser (FEL) will generate X-ray pulses for beamline experiments at rates of up to 1~MHz, with detectors producing data throughputs exceeding 1 TB/s. Managing such massive data streams presents significant challenges, as transmission and storage infrastructures become prohibitively expensive. Machine learning (ML) offers a promising solution for real-time data reduction, but conventional implementations introduce excessive latency, making them unsuitable for high-speed experimental environments. To address these challenges, SLAC developed the SLAC Neural Network Library (SNL), a specialized framework designed to deploy real-time ML inference models on Field-Programmable Gate Arrays (FPGA). SNL's key feature is the ability to dynamically update model weights without requiring FPGA resynthesis, enhancing flexibility for adaptive learning applications. To further enhance usability and accessibility, we introduce Auto-SNL, a Python extension that streamlines the process of converting Python-based neural network models into SNL-compatible high-level synthesis code. This paper presents a benchmark comparison against hls4ml, the current state-of-the-art tool, across multiple neural network architectures, fixed-point precisions, and synthesis configurations targeting a Xilinx ZCU102 FPGA. The results showed that SNL achieves competitive or superior latency in most tested architectures, while in some cases also offering FPGA resource savings. This adaptation demonstrates SNL's versatility, opening new opportunities for researchers and academics in fields such as high-energy physics, medical imaging, robotics, and many more.

--------------------------------------------------------------------------------------------------------

Developer Insights into Designing AI-Based Computer Perception Tools

AI-based computer perception tools are reshaping clinical practice by automatically analyzing behavioral and physiological data, but their successful integration depends on careful design balancing technical capabilities with user needs. This research reveals how developers navigate complex ethical and practical considerations when creating tools that influence clinical decision-making. The findings highlight the importance of explainability, workflow integration, appropriate customization, and responsible innovation in clinical AI systems. Applications span mental health assessment, neurological disorder detection, telemedicine platforms, and clinical research tools. The study's insights into developer perspectives provide valuable guidance for creating trustworthy, acceptable AI tools in healthcare settings. This research supports the responsible development of clinical AI by emphasizing developers' roles as ethical stewards and the need for interdisciplinary collaboration in creating tools that advance medical knowledge while maintaining user trust.

Authors:  Maya Guhan, Meghan E. Hurley, Eric A. Storch, John Herrington, Casey Zampella, Julia Parish-Morris, Gabriel Lázaro-Muñoz, Kristin Kostick-Quenet

Link:  https://arxiv.org/abs/2508.21733v1

Date: 2025-08-d

Summary:

Artificial intelligence (AI)-based computer perception (CP) technologies use mobile sensors to collect behavioral and physiological data for clinical decision-making. These tools can reshape how clinical knowledge is generated and interpreted. However, effective integration of these tools into clinical workflows depends on how developers balance clinical utility with user acceptability and trustworthiness. Our study presents findings from 20 in-depth interviews with developers of AI-based CP tools. Interviews were transcribed and inductive, thematic analysis was performed to identify 4 key design priorities: 1) to account for context and ensure explainability for both patients and clinicians; 2) align tools with existing clinical workflows; 3) appropriately customize to relevant stakeholders for usability and acceptability; and 4) push the boundaries of innovation while aligning with established paradigms. Our findings highlight that developers view themselves as not merely technical architects but also ethical stewards, designing tools that are both acceptable by users and epistemically responsible (prioritizing objectivity and pushing clinical knowledge forward). We offer the following suggestions to help achieve this balance: documenting how design choices around customization are made, defining limits for customization choices, transparently conveying information about outputs, and investing in user training. Achieving these goals will require interdisciplinary collaboration between developers, clinicians, and ethicists.

--------------------------------------------------------------------------------------------------------

CAD2DMD-SET: Synthetic Generation Tool of Digital Measurement Device CAD Model Datasets for fine-tuning Large Vision-Language Models

Digital measurement device recognition in real-world conditions poses significant challenges for vision-language models, particularly in augmented reality and industrial applications where accuracy is critical. This research addresses the gap between laboratory performance and practical deployment through synthetic data generation using 3D CAD models and advanced rendering techniques. The tool enables creation of diverse, labeled datasets for training robust measurement reading capabilities. Applications span industrial automation, quality control, augmented reality maintenance systems, and accessibility tools for visually impaired users. The substantial performance improvements demonstrated across state-of-the-art models make this approach valuable for any application requiring precise reading of digital displays under challenging conditions. This technology enables more reliable AI systems for industrial and consumer applications where measurement accuracy directly impacts safety and performance.

Authors:  João Valente, Atabak Dehban, Rodrigo Ventura

Link:  https://arxiv.org/abs/2508.21732v1

Date: 2025-08-d

Summary:

Recent advancements in Large Vision-Language Models (LVLMs) have demonstrated impressive capabilities across various multimodal tasks. They continue, however, to struggle with trivial scenarios such as reading values from Digital Measurement Devices (DMDs), particularly in real-world conditions involving clutter, occlusions, extreme viewpoints, and motion blur; common in head-mounted cameras and Augmented Reality (AR) applications. Motivated by these limitations, this work introduces CAD2DMD-SET, a synthetic data generation tool designed to support visual question answering (VQA) tasks involving DMDs. By leveraging 3D CAD models, advanced rendering, and high-fidelity image composition, our tool produces diverse, VQA-labelled synthetic DMD datasets suitable for fine-tuning LVLMs. Additionally, we present DMDBench, a curated validation set of 1,000 annotated real-world images designed to evaluate model performance under practical constraints. Benchmarking three state-of-the-art LVLMs using Average Normalised Levenshtein Similarity (ANLS) and further fine-tuning LoRA's of these models with CAD2DMD-SET's generated dataset yielded substantial improvements, with InternVL showcasing a score increase of 200% without degrading on other tasks. This demonstrates that the CAD2DMD-SET training dataset substantially improves the robustness and performance of LVLMs when operating under the previously stated challenging conditions. The CAD2DMD-SET tool is expected to be released as open-source once the final version of this manuscript is prepared, allowing the community to add different measurement devices and generate their own datasets.

--------------------------------------------------------------------------------------------------------

Freeze and Conquer: Reusable Ansatz for Solving the Traveling Salesman Problem

Quantum computing approaches to optimization problems face scalability challenges and require extensive circuit optimization for each problem instance. This research introduces an efficient strategy for variational quantum algorithms that separates circuit structure optimization from parameter tuning, enabling rapid adaptation to new problem instances. The freeze-and-reuse approach makes quantum optimization more practical for near-term quantum hardware. Applications include logistics optimization, manufacturing scheduling, network routing, and resource allocation problems across various industries. The demonstrated success rates for moderate problem sizes and robust generalization ability make this approach valuable for quantum advantage demonstrations in optimization. This technology advances practical quantum computing by reducing the computational overhead of quantum optimization algorithms, bringing quantum solutions closer to real-world deployment in logistics and operations research.

Authors:  Fabrizio Fagiolo, Nicolo' Vescera

Link:  https://arxiv.org/abs/2508.21730v1

Date: 2025-08-d

Summary:

In this paper we present a variational algorithm for the Traveling Salesman Problem (TSP) that combines (i) a compact encoding of permutations, which reduces the qubit requirement too, (ii) an optimize-freeze-reuse strategy: where the circuit topology (``Ansatz'') is first optimized on a training instance by Simulated Annealing (SA), then ``frozen'' and re-used on novel instances, limited to a rapid re-optimization of only the circuit parameters. This pipeline eliminates costly structural research in testing, making the procedure immediately implementable on NISQ hardware.   On a set of $40$ randomly generated symmetric instances that span $4 - 7$ cities, the resulting Ansatz achieves an average optimal trip sampling probability of $100\%$ for 4 city cases, $90\%$ for 5 city cases and $80\%$ for 6 city cases. With 7 cities the success rate drops markedly to an average of $\sim 20\%$, revealing the onset of scalability limitations of the proposed method.   The results show robust generalization ability for moderate problem sizes and indicate how freezing the Ansatz can dramatically reduce time-to-solution without degrading solution quality. The paper also discusses scalability limitations, the impact of ``warm-start'' initialization of parameters, and prospects for extension to more complex problems, such as Vehicle Routing and Job-Shop Scheduling.

--------------------------------------------------------------------------------------------------------

OptMark: Robust Multi-bit Diffusion Watermarking via Inference Time Optimization

Digital content protection and user tracking in AI-generated imagery face significant challenges as generative models become more sophisticated and accessible. This research addresses the critical need for robust watermarking that survives various attacks while enabling large-scale user identification. The multi-bit watermarking approach with strategic temporal embedding provides comprehensive protection against both image transformations and generative attacks. Applications include copyright protection for digital artists, content authenticity verification, user tracking for platform accountability, and preventing misuse of AI-generated content. The method's invisibility and robustness make it suitable for commercial deployment in content creation platforms, news media, and social networks. This technology supports responsible AI deployment by enabling content traceability and ownership verification in an era of increasingly sophisticated synthetic media generation.

Authors:  Jiazheng Xing, Hai Ci, Hongbin Xu, Hangjie Yuan, Yong Liu, Mike Zheng Shou

Link:  https://arxiv.org/abs/2508.21727v1

Date: 2025-08-d

Summary:

Watermarking diffusion-generated images is crucial for copyright protection and user tracking. However, current diffusion watermarking methods face significant limitations: zero-bit watermarking systems lack the capacity for large-scale user tracking, while multi-bit methods are highly sensitive to certain image transformations or generative attacks, resulting in a lack of comprehensive robustness. In this paper, we propose OptMark, an optimization-based approach that embeds a robust multi-bit watermark into the intermediate latents of the diffusion denoising process. OptMark strategically inserts a structural watermark early to resist generative attacks and a detail watermark late to withstand image transformations, with tailored regularization terms to preserve image quality and ensure imperceptibility. To address the challenge of memory consumption growing linearly with the number of denoising steps during optimization, OptMark incorporates adjoint gradient methods, reducing memory usage from O(N) to O(1). Experimental results demonstrate that OptMark achieves invisible multi-bit watermarking while ensuring robust resilience against valuemetric transformations, geometric transformations, editing, and regeneration attacks.

--------------------------------------------------------------------------------------------------------

PosterForest: Hierarchical Multi-Agent Collaboration for Scientific Poster Generation

Scientific communication requires effective visual presentation of complex research, but creating professional posters demands significant time and design expertise that many researchers lack. This research addresses the challenge of automated poster generation while preserving hierarchical document structure and semantic relationships between textual and visual elements. The multi-agent collaboration approach mirrors human design processes through specialized agents for content and layout planning. Applications include academic conferences, research presentations, educational materials, and scientific outreach programs. The framework's ability to optimize logical consistency, content fidelity, and visual coherence makes it valuable for researchers, educators, and science communicators. This technology democratizes professional scientific communication by enabling researchers to focus on content while ensuring effective visual presentation, potentially improving knowledge dissemination and research impact across academic disciplines.

Authors:  Jiho Choi, Seojeong Park, Seongjong Song, Hyunjung Shim

Link:  https://arxiv.org/abs/2508.21720v1

Date: 2025-08-d

Summary:

We present a novel training-free framework, \textit{PosterForest}, for automated scientific poster generation. Unlike prior approaches, which largely neglect the hierarchical structure of scientific documents and the semantic integration of textual and visual elements, our method addresses both challenges directly. We introduce the \textit{Poster Tree}, a hierarchical intermediate representation that jointly encodes document structure and visual-textual relationships at multiple levels. Our framework employs a multi-agent collaboration strategy, where agents specializing in content summarization and layout planning iteratively coordinate and provide mutual feedback. This approach enables the joint optimization of logical consistency, content fidelity, and visual coherence. Extensive experiments on multiple academic domains show that our method outperforms existing baselines in both qualitative and quantitative evaluations. The resulting posters achieve quality closest to expert-designed ground truth and deliver superior information preservation, structural clarity, and user preference.

--------------------------------------------------------------------------------------------------------


EYE ON A.I. GETS READERS UP TO DATE ON THE LATEST FUNDING NEWS AND RELATED ISSUES. SUBSCRIBE FOR THE WEEKLY NEWSLETTER.