Week Ending 7.20.2025
RESEARCH WATCH: 7.20.2025
Lessons from the TREC Plain Language Adaptation of Biomedical Abstracts (PLABA) track
Medical literature remains largely inaccessible to patients and caregivers due to complex terminology and technical language. This research addresses the critical need for translating biomedical abstracts into plain language that general audiences can understand. The PLABA track represents a pioneering effort to systematically evaluate automated approaches for medical text simplification using language models. Applications include patient education materials, health literacy improvement, informed consent documents, and bridging the communication gap between healthcare providers and patients. The work's emphasis on rigorous evaluation methodologies and expert assessment establishes important benchmarks for future medical text adaptation systems, potentially revolutionizing how medical knowledge is disseminated to the public.
Authors: Brian Ondov, William Xia, Kush Attal, Ishita Unde, Jerry He, Hoa Dang, Ian Soboroff, Dina Demner-Fushman
Link: https://arxiv.org/abs/2507.14096v1
Date: 2025-07-d
Summary:
Objective: Recent advances in language models have shown potential to adapt professional-facing biomedical literature to plain language, making it accessible to patients and caregivers. However, their unpredictability, combined with the high potential for harm in this domain, means rigorous evaluation is necessary. Our goals with this track were to stimulate research and to provide high-quality evaluation of the most promising systems. Methods: We hosted the Plain Language Adaptation of Biomedical Abstracts (PLABA) track at the 2023 and 2024 Text Retrieval Conferences. Tasks included complete, sentence-level, rewriting of abstracts (Task 1) as well as identifying and replacing difficult terms (Task 2). For automatic evaluation of Task 1, we developed a four-fold set of professionally-written references. Submissions for both Tasks 1 and 2 were provided extensive manual evaluation from biomedical experts. Results: Twelve teams spanning twelve countries participated in the track, with models from multilayer perceptrons to large pretrained transformers. In manual judgments of Task 1, top-performing models rivaled human levels of factual accuracy and completeness, but not simplicity or brevity. Automatic, reference-based metrics generally did not correlate well with manual judgments. In Task 2, systems struggled with identifying difficult terms and classifying how to replace them. When generating replacements, however, LLM-based systems did well in manually judged accuracy, completeness, and simplicity, though not in brevity. Conclusion: The PLABA track showed promise for using Large Language Models to adapt biomedical literature for the general public, while also highlighting their deficiencies and the need for improved automatic benchmarking tools.
--------------------------------------------------------------------------------------------------------
Seed-X: Building Strong Multilingual Translation LLM with 7B Parameters
Global communication increasingly demands high-quality translation across diverse language pairs, yet most translation systems either lack multilingual capabilities or require massive computational resources. Seed-X addresses this challenge by developing a compact 7-billion parameter model that rivals much larger systems like GPT-4o across 28 languages. The model's chain-of-thought reasoning approach and reinforcement learning enhancement represent significant advances in translation methodology. Applications span international business communication, diplomatic relations, educational content localization, cross-cultural research, and making information accessible across language barriers. By open-sourcing their approach, the researchers enable smaller organizations and developing nations to deploy sophisticated translation capabilities without requiring massive infrastructure investments, democratizing access to multilingual AI technology.
Authors: Shanbo Cheng, Yu Bao, Qian Cao, Luyang Huang, Liyan Kang, Zhicheng Liu, Yu Lu, Wenhao Zhu, Zhichao Huang, Tao Li, Sitong Liu, Ningxin Peng, Shuaijie She, Lu Xu, Nuo Xu, Sen Yang, Runsheng Yu, Yiming Yu, Liehao Zou, Hang Li, Lu Lu, Yuxuan Wang, Yonghui Wu
Link: https://arxiv.org/abs/2507.13618v1
Date: 2025-07-d
Summary:
Multilingual translation stands as a challenging task for large language models (LLMs) to handle intricate language patterns and stilted translations that arise in automated translations. In this paper, we introduce Seed-X, a family of open-source LLMs comprising instruct and reasoning models, pushing the limits of translation capability with 7B parameter size. The base model is pre-trained on a diverse, high-quality dataset encompassing both monolingual and bilingual content across 28 languages, harnessing the full potential of multilingual data. The instruct model is then finetuned to translate by Chain-of-Thought (CoT) reasoning and further enhanced through reinforcement learning (RL) to achieve better generalization across diverse language pairs. Seed-X achieves performance comparable to leading closed-source models, including Gemini-2.5 and GPT-4o, across 28 languages, and significantly outperforms larger open-source models in both automatic metrics and human evaluations. We share the best practices through our optimization process, and make the parameter public available for advancing translation research and applications.
--------------------------------------------------------------------------------------------------------
PHASE: Passive Human Activity Simulation Evaluation
Cybersecurity training environments like cyber ranges and honeypots require realistic human behavior to effectively prepare defenders for actual threats, yet no reliable method exists to verify behavioral authenticity. PHASE introduces a machine learning framework that can distinguish genuine human network activity from synthetic user personas with over 90% accuracy using only passive network monitoring. This breakthrough addresses a fundamental limitation in cybersecurity simulation fidelity. Applications include validating training scenarios for cybersecurity professionals, improving honeypot effectiveness for threat detection, enhancing sandbox analysis environments, and developing more realistic user behavior models for security research. The system's passive nature ensures it doesn't disrupt simulation environments while providing quantitative assessment of behavioral realism, enabling more effective cybersecurity training programs.
Authors: Steven Lamp, Jason D. Hiser, Anh Nguyen-Tuong, Jack W. Davidson
Link: https://arxiv.org/abs/2507.13505v1
Date: 2025-07-d
Summary:
Cybersecurity simulation environments, such as cyber ranges, honeypots, and sandboxes, require realistic human behavior to be effective, yet no quantitative method exists to assess the behavioral fidelity of synthetic user personas. This paper presents PHASE (Passive Human Activity Simulation Evaluation), a machine learning framework that analyzes Zeek connection logs and distinguishes human from non-human activity with over 90\% accuracy. PHASE operates entirely passively, relying on standard network monitoring without any user-side instrumentation or visible signs of surveillance. All network activity used for machine learning is collected via a Zeek network appliance to avoid introducing unnecessary network traffic or artifacts that could disrupt the fidelity of the simulation environment. The paper also proposes a novel labeling approach that utilizes local DNS records to classify network traffic, thereby enabling machine learning analysis. Furthermore, we apply SHAP (SHapley Additive exPlanations) analysis to uncover temporal and behavioral signatures indicative of genuine human users. In a case study, we evaluate a synthetic user persona and identify distinct non-human patterns that undermine behavioral realism. Based on these insights, we develop a revised behavioral configuration that significantly improves the human-likeness of synthetic activity yielding a more realistic and effective synthetic user persona.
--------------------------------------------------------------------------------------------------------
Revisiting Reliability in the Reasoning-based Pose Estimation Benchmark
Human pose estimation using multimodal large language models has emerged as an important capability, but the field lacks reliable evaluation standards. This research identifies critical flaws in the widely-used RPE benchmark, including mismatched image indices, redundant data, and ambiguous descriptions that undermine consistent evaluation. The authors' meticulous refinement and open-source release of corrected annotations addresses these reproducibility issues. Applications include robotics navigation, human-computer interaction, sports analysis, healthcare monitoring, and augmented reality systems that require accurate human pose understanding. By establishing more reliable evaluation standards, this work enables fair comparison of different pose estimation approaches and accelerates progress in developing AI systems that can accurately interpret and respond to human body language and positioning.
Authors: Junsu Kim, Naeun Kim, Jaeho Lee, Incheol Park, Dongyoon Han, Seungryul Baek
Link: https://arxiv.org/abs/2507.13314v1
Date: 2025-07-d
Summary:
The reasoning-based pose estimation (RPE) benchmark has emerged as a widely adopted evaluation standard for pose-aware multimodal large language models (MLLMs). Despite its significance, we identified critical reproducibility and benchmark-quality issues that hinder fair and consistent quantitative evaluations. Most notably, the benchmark utilizes different image indices from those of the original 3DPW dataset, forcing researchers into tedious and error-prone manual matching processes to obtain accurate ground-truth (GT) annotations for quantitative metrics (\eg, MPJPE, PA-MPJPE). Furthermore, our analysis reveals several inherent benchmark-quality limitations, including significant image redundancy, scenario imbalance, overly simplistic poses, and ambiguous textual descriptions, collectively undermining reliable evaluations across diverse scenarios. To alleviate manual effort and enhance reproducibility, we carefully refined the GT annotations through meticulous visual matching and publicly release these refined annotations as an open-source resource, thereby promoting consistent quantitative evaluations and facilitating future advancements in human pose-aware multimodal reasoning.
--------------------------------------------------------------------------------------------------------
Towards Formal Verification of LLM-Generated Code from Natural Language Prompts
Large language models increasingly generate code from natural language descriptions, but users struggle to detect errors in AI-generated programs, creating potential safety and reliability risks. This research proposes Astrogator, a system that provides formal correctness guarantees for LLM-generated code through automated verification. Using Ansible as a testbed, the system achieves 83% verification of correct code and 92% identification of incorrect code. Applications include infrastructure automation, cloud deployment scripts, configuration management, and any domain where code correctness is critical. The approach could enable natural language programming for non-programmers while ensuring reliability, potentially transforming how software is developed by making programming accessible to domain experts who lack coding expertise but understand their requirements clearly.
Authors: Aaron Councilman, David Fu, Aryan Gupta, Chengxiao Wang, David Grove, Yu-Xiong Wang, Vikram Adve
Link: https://arxiv.org/abs/2507.13290v1
Date: 2025-07-d
Summary:
In the past few years LLMs have emerged as a tool that can aid programmers by taking natural language descriptions and generating code based on it. However, LLMs often generate incorrect code that users need to fix and the literature suggests users often struggle to detect these errors. In this work we seek to offer formal guarantees of correctness to LLM generated code; such guarantees could improve the experience of using AI Code Assistants and potentially enable natural language programming for users with little or no programming knowledge. To address this challenge we propose to incorporate a formal query language that can represent a user's intent in a formally defined but natural language-like manner that a user can confirm matches their intent. Then, using such a query we propose to verify LLM generated code to ensure it matches the user's intent. We implement these ideas in our system, Astrogator, for the Ansible programming language which includes such a formal query language, a calculus for representing the behavior of Ansible programs, and a symbolic interpreter which is used for the verification. On a benchmark suite of 21 code-generation tasks, our verifier is able to verify correct code in 83% of cases and identify incorrect code in 92%.
--------------------------------------------------------------------------------------------------------
Automating Steering for Safe Multimodal Large Language Models
Multimodal large language models face new safety challenges when processing combined text and visual inputs, particularly against adversarial attacks designed to elicit harmful responses. AutoSteer introduces a novel inference-time intervention system that can detect and prevent unsafe outputs without requiring model fine-tuning. The system uses adaptive safety probing and selective intervention to maintain general capabilities while blocking harmful content. Applications include content moderation platforms, educational AI systems, customer service chatbots, and any deployment where MLLMs interact with diverse user-generated content. This modular approach enables safer deployment of multimodal AI in consumer applications, social media platforms, and professional environments where maintaining both functionality and safety is crucial for user trust and regulatory compliance.
Authors: Lyucheng Wu, Mengru Wang, Ziwen Xu, Tri Cao, Nay Oo, Bryan Hooi, Shumin Deng
Link: https://arxiv.org/abs/2507.13255v1
Date: 2025-07-d
Summary:
Recent progress in Multimodal Large Language Models (MLLMs) has unlocked powerful cross-modal reasoning abilities, but also raised new safety concerns, particularly when faced with adversarial multimodal inputs. To improve the safety of MLLMs during inference, we introduce a modular and adaptive inference-time intervention technology, AutoSteer, without requiring any fine-tuning of the underlying model. AutoSteer incorporates three core components: (1) a novel Safety Awareness Score (SAS) that automatically identifies the most safety-relevant distinctions among the model's internal layers; (2) an adaptive safety prober trained to estimate the likelihood of toxic outputs from intermediate representations; and (3) a lightweight Refusal Head that selectively intervenes to modulate generation when safety risks are detected. Experiments on LLaVA-OV and Chameleon across diverse safety-critical benchmarks demonstrate that AutoSteer significantly reduces the Attack Success Rate (ASR) for textual, visual, and cross-modal threats, while maintaining general abilities. These findings position AutoSteer as a practical, interpretable, and effective framework for safer deployment of multimodal AI systems.
--------------------------------------------------------------------------------------------------------
Large language model alignment represents one of the most critical challenges in developing reliable AI systems, as misaligned models can exhibit unpredictable or harmful behaviors. This comprehensive review examines how inverse reinforcement learning techniques can improve LLM alignment by learning appropriate reward models from human feedback. The work bridges reinforcement learning theory with practical LLM training methodologies. Applications include developing more helpful AI assistants, creating controllable content generation systems, improving conversational AI safety, and building specialized domain models that adhere to professional standards. The insights from sparse-reward RL environments could inform training more capable reasoning systems and AI agents that maintain alignment even in complex, long-horizon tasks where direct human oversight becomes impractical.
Authors: Hao Sun, Mihaela van der Schaar
Link: https://arxiv.org/abs/2507.13158v1
Date: 2025-07-d
Summary:
In the era of Large Language Models (LLMs), alignment has emerged as a fundamental yet challenging problem in the pursuit of more reliable, controllable, and capable machine intelligence. The recent success of reasoning models and conversational AI systems has underscored the critical role of reinforcement learning (RL) in enhancing these systems, driving increased research interest at the intersection of RL and LLM alignment. This paper provides a comprehensive review of recent advances in LLM alignment through the lens of inverse reinforcement learning (IRL), emphasizing the distinctions between RL techniques employed in LLM alignment and those in conventional RL tasks. In particular, we highlight the necessity of constructing neural reward models from human data and discuss the formal and practical implications of this paradigm shift. We begin by introducing fundamental concepts in RL to provide a foundation for readers unfamiliar with the field. We then examine recent advances in this research agenda, discussing key challenges and opportunities in conducting IRL for LLM alignment. Beyond methodological considerations, we explore practical aspects, including datasets, benchmarks, evaluation metrics, infrastructure, and computationally efficient training and inference techniques. Finally, we draw insights from the literature on sparse-reward RL to identify open questions and potential research directions. By synthesizing findings from diverse studies, we aim to provide a structured and critical overview of the field, highlight unresolved challenges, and outline promising future directions for improving LLM alignment through RL and IRL techniques.
--------------------------------------------------------------------------------------------------------
Infrastructure monitoring requires automated detection of structural defects, but existing systems struggle to balance accuracy with the computational efficiency needed for real-time deployment. FORTRESS introduces a novel architecture combining depthwise separable convolutions with Kolmogorov-Arnold Networks, achieving 91% parameter reduction while improving segmentation performance. The system delivers superior accuracy with 3x faster inference speed. Applications include bridge inspection systems, building maintenance programs, autonomous inspection drones, predictive maintenance for critical infrastructure, and emergency response systems that need rapid damage assessment. This technology could revolutionize infrastructure management by enabling continuous, automated monitoring of structural health, potentially preventing catastrophic failures through early defect detection while reducing inspection costs and improving public safety through proactive maintenance strategies.
Authors: Christina Thrainer, Md Meftahul Ferdaus, Mahdi Abdelguerfi, Christian Guetl, Steven Sloan, Kendall N. Niles, Ken Pathak
Link: https://arxiv.org/abs/2507.12675v1
Date: 2025-07-d
Summary:
Automated structural defect segmentation in civil infrastructure faces a critical challenge: achieving high accuracy while maintaining computational efficiency for real-time deployment. This paper presents FORTRESS (Function-composition Optimized Real-Time Resilient Structural Segmentation), a new architecture that balances accuracy and speed by using a special method that combines depthwise separable convolutions with adaptive Kolmogorov-Arnold Network integration. FORTRESS incorporates three key innovations: a systematic depthwise separable convolution framework achieving a 3.6x parameter reduction per layer, adaptive TiKAN integration that selectively applies function composition transformations only when computationally beneficial, and multi-scale attention fusion combining spatial, channel, and KAN-enhanced features across decoder levels. The architecture achieves remarkable efficiency gains with 91% parameter reduction (31M to 2.9M), 91% computational complexity reduction (13.7 to 1.17 GFLOPs), and 3x inference speed improvement while delivering superior segmentation performance. Evaluation on benchmark infrastructure datasets demonstrates state-of-the-art results with an F1- score of 0.771 and a mean IoU of 0.677, significantly outperforming existing methods including U-Net, SA-UNet, and U- KAN. The dual optimization strategy proves essential for optimal performance, establishing FORTRESS as a robust solution for practical structural defect segmentation in resource-constrained environments where both accuracy and computational efficiency are paramount. Comprehensive architectural specifications are provided in the Supplemental Material. Source code is available at URL: https://github.com/faeyelab/fortress-paper-code.
--------------------------------------------------------------------------------------------------------
Transforming Football Data into Object-centric Event Logs with Spatial Context Information
Traditional sports analytics focuses on individual events, but understanding team dynamics requires analyzing complex interactions between multiple players simultaneously. This research introduces a framework for transforming football data into object-centric event logs that capture spatial relationships and multi-object interactions. The approach enables more sophisticated analysis of team coordination, tactical patterns, and strategic decision-making. Applications include tactical analysis for coaching staff, player performance evaluation, scouting systems, broadcast enhancement with tactical insights, and fan engagement through deeper game understanding. The framework could extend to other team sports and complex multi-agent scenarios, potentially revolutionizing sports science by providing coaches and analysts with tools to understand emergent team behaviors and optimize tactical approaches based on spatial dynamics.
Authors: Vito Chan, Lennart Ebert, Paul-Julius Hillmann, Christoffer Rubensson, Stephan A. Fahrenkrog-Petersen, Jan Mendling
Link: https://arxiv.org/abs/2507.12504v1
Date: 2025-07-d
Summary:
Object-centric event logs expand the conventional single-case notion event log by considering multiple objects, allowing for the analysis of more complex and realistic process behavior. However, the number of real-world object-centric event logs remains limited, and further studies are needed to test their usefulness. The increasing availability of data from team sports can facilitate object-centric process mining, leveraging both real-world data and suitable use cases. In this paper, we present a framework for transforming football (soccer) data into an object-centric event log, further enhanced with a spatial dimension. We demonstrate the effectiveness of our framework by generating object-centric event logs based on real-world football data and discuss the results for varying process representations. With our paper, we provide the first example for object-centric event logs in football analytics. Future work should consider variant analysis and filtering techniques to better handle variability
--------------------------------------------------------------------------------------------------------
Formal Verification of Neural Certificates Done Dynamically
Neural certificates provide mathematical proofs of safety in cyber-physical systems, but traditional verification methods face scalability challenges that limit their practical deployment. This research introduces a lightweight runtime monitoring framework that performs real-time verification during system operation rather than exhaustive pre-deployment analysis. The approach enables timely detection of safety violations with minimal computational overhead. Applications include autonomous vehicle safety systems, industrial control systems, medical device monitoring, aerospace applications, and any safety-critical system where neural networks control physical processes. This dynamic verification approach could enable broader adoption of AI in safety-critical domains by providing continuous safety assurance without the computational burden of static verification, potentially accelerating deployment of AI-controlled systems in high-stakes environments.
Authors: Thomas A. Henzinger, Konstantin Kueffner, Emily Yu
Link: https://arxiv.org/abs/2507.11987v1
Date: 2025-07-d
Summary:
Neural certificates have emerged as a powerful tool in cyber-physical systems control, providing witnesses of correctness. These certificates, such as barrier functions, often learned alongside control policies, once verified, serve as mathematical proofs of system safety. However, traditional formal verification of their defining conditions typically faces scalability challenges due to exhaustive state-space exploration. To address this challenge, we propose a lightweight runtime monitoring framework that integrates real-time verification and does not require access to the underlying control policy. Our monitor observes the system during deployment and performs on-the-fly verification of the certificate over a lookahead region to ensure safety within a finite prediction horizon. We instantiate this framework for ReLU-based control barrier functions and demonstrate its practical effectiveness in a case study. Our approach enables timely detection of safety violations and incorrect certificates with minimal overhead, providing an effective but lightweight alternative to the static verification of the certificates.
--------------------------------------------------------------------------------------------------------
Online Training and Pruning of Deep Reinforcement Learning Networks
Deep reinforcement learning often requires large neural networks for optimal performance, but their computational and memory demands limit deployment in resource-constrained environments. This research introduces XiNet, a method for simultaneously training and pruning RL networks using variational Bernoulli distributions to automatically identify and remove unnecessary network components. The approach achieves significant compression with minimal performance loss. Applications include mobile robotics, edge computing devices, autonomous drones with limited computational power, real-time control systems, and embedded AI applications. This technology could democratize deployment of sophisticated RL agents in scenarios where computational resources are limited, enabling intelligent behavior in IoT devices, wearable technology, and distributed sensor networks where traditional large-scale neural networks would be impractical.
Authors: Valentin Frank Ingmar Guenter, Athanasios Sideris
Link: https://arxiv.org/abs/2507.11975v1
Date: 2025-07-d
Summary:
Scaling deep neural networks (NN) of reinforcement learning (RL) algorithms has been shown to enhance performance when feature extraction networks are used but the gained performance comes at the significant expense of increased computational and memory complexity. Neural network pruning methods have successfully addressed this challenge in supervised learning. However, their application to RL is underexplored. We propose an approach to integrate simultaneous training and pruning within advanced RL methods, in particular to RL algorithms enhanced by the Online Feature Extractor Network (OFENet). Our networks (XiNet) are trained to solve stochastic optimization problems over the RL networks' weights and the parameters of variational Bernoulli distributions for 0/1 Random Variables $\xi$ scaling each unit in the networks. The stochastic problem formulation induces regularization terms that promote convergence of the variational parameters to 0 when a unit contributes little to the performance. In this case, the corresponding structure is rendered permanently inactive and pruned from its network. We propose a cost-aware, sparsity-promoting regularization scheme, tailored to the DenseNet architecture of OFENets expressing the parameter complexity of involved networks in terms of the parameters of the RVs in these networks. Then, when matching this cost with the regularization terms, the many hyperparameters associated with them are automatically selected, effectively combining the RL objectives and network compression. We evaluate our method on continuous control benchmarks (MuJoCo) and the Soft Actor-Critic RL agent, demonstrating that OFENets can be pruned considerably with minimal loss in performance. Furthermore, our results confirm that pruning large networks during training produces more efficient and higher performing RL agents rather than training smaller networks from scratch.
--------------------------------------------------------------------------------------------------------
FOUNDER: Grounding Foundation Models in World Models for Open-Ended Embodied Decision Making
Current AI agents struggle with generalizing across diverse embodied tasks without explicit reward signals, limiting their adaptability in real-world scenarios. FOUNDER addresses this by combining the semantic understanding of foundation models with the predictive capabilities of world models, enabling reward-free learning in complex environments. The system learns goal-conditioned policies through imagination and uses temporal distance as an informative reward signal. Applications include household robotics, autonomous exploration systems, general-purpose manipulation tasks, and adaptive manufacturing systems. This approach could enable truly general-purpose robotic agents capable of understanding tasks described in natural language or demonstrated through video, potentially revolutionizing robotics by creating systems that can adapt to new tasks without extensive retraining or reward engineering.
Authors: Yucen Wang, Rui Yu, Shenghua Wan, Le Gan, De-Chuan Zhan
Link: https://arxiv.org/abs/2507.12496v1
Date: 2025-07-d
Summary:
Foundation Models (FMs) and World Models (WMs) offer complementary strengths in task generalization at different levels. In this work, we propose FOUNDER, a framework that integrates the generalizable knowledge embedded in FMs with the dynamic modeling capabilities of WMs to enable open-ended task solving in embodied environments in a reward-free manner. We learn a mapping function that grounds FM representations in the WM state space, effectively inferring the agent's physical states in the world simulator from external observations. This mapping enables the learning of a goal-conditioned policy through imagination during behavior learning, with the mapped task serving as the goal state. Our method leverages the predicted temporal distance to the goal state as an informative reward signal. FOUNDER demonstrates superior performance on various multi-task offline visual control benchmarks, excelling in capturing the deep-level semantics of tasks specified by text or videos, particularly in scenarios involving complex observations or domain gaps where prior methods struggle. The consistency of our learned reward function with the ground-truth reward is also empirically validated. Our project website is https://sites.google.com/view/founder-rl.
--------------------------------------------------------------------------------------------------------
A Roadmap for Climate-Relevant Robotics Research
Climate change represents humanity's greatest challenge, yet the robotics community's potential contributions remain underexplored and fragmented. This comprehensive roadmap identifies high-impact applications where robotics can address climate priorities across energy systems, construction, agriculture, transportation, and environmental monitoring. The work emphasizes applying not just physical robots but the broader robotics toolkit including planning, perception, and control algorithms. Applications span renewable energy optimization, precision agriculture for sustainable farming, building efficiency retrofits, autonomous low-emission transportation, and large-scale environmental sensing. This roadmap could catalyze coordinated research efforts and inspire roboticists to direct their expertise toward urgent climate solutions, potentially accelerating development of technologies essential for achieving global climate goals while creating new research directions that benefit both robotics advancement and environmental sustainability.
Authors: Alan Papalia, Charles Dawson, Laurentiu L. Anton, Norhan Magdy Bayomi, Bianca Champenois, Jung-Hoon Cho, Levi Cai, Joseph DelPreto, Kristen Edwards, Bilha-Catherine Githinji, Cameron Hickert, Vindula Jayawardana, Matthew Kramer, Shreyaa Raghavan, David Russell, Shide Salimi, Jingnan Shi, Soumya Sudhakar, Yanwei Wang, Shouyi Wang, Luca Carlone, Vijay Kumar, Daniela Rus, John E. Fernandez, Cathy Wu, George Kantor, Derek Young, Hanumant Singh
Link: https://arxiv.org/abs/2507.11623v2
Date: 2025-07-d
Summary:
Climate change is one of the defining challenges of the 21st century, and many in the robotics community are looking for ways to contribute. This paper presents a roadmap for climate-relevant robotics research, identifying high-impact opportunities for collaboration between roboticists and experts across climate domains such as energy, the built environment, transportation, industry, land use, and Earth sciences. These applications include problems such as energy systems optimization, construction, precision agriculture, building envelope retrofits, autonomous trucking, and large-scale environmental monitoring. Critically, we include opportunities to apply not only physical robots but also the broader robotics toolkit - including planning, perception, control, and estimation algorithms - to climate-relevant problems. A central goal of this roadmap is to inspire new research directions and collaboration by highlighting specific, actionable problems at the intersection of robotics and climate. This work represents a collaboration between robotics researchers and domain experts in various climate disciplines, and it serves as an invitation to the robotics community to bring their expertise to bear on urgent climate priorities.
--------------------------------------------------------------------------------------------------------
Scaling the memory wall using mixed-precision -- HPG-MxP on an exascale machine
Scientific computing applications often face memory bandwidth limitations that prevent them from fully utilizing modern high-performance computing systems, despite advances in processing power. This research demonstrates practical performance gains from mixed-precision algorithms on exascale systems, achieving 1.6x speedup using combined double- and single-precision arithmetic. The work addresses the critical gap between theoretical mixed-precision benefits and real-world scientific application performance. Applications include climate modeling, computational fluid dynamics, materials science simulations, and large-scale physics simulations that require massive computational resources. This approach could enable more complex scientific simulations within existing computational budgets, potentially accelerating scientific discovery by making previously intractable problems computationally feasible while maximizing utilization of expensive supercomputing infrastructure for breakthrough research in physics, chemistry, and engineering.
Authors: Aditya Kashi, Nicholson Koukpaizan, Hao Lu, Michael Matheson, Sarp Oral, Feiyi Wang
Link: https://arxiv.org/abs/2507.11512v1
Date: 2025-07-d
Summary:
Mixed-precision algorithms have been proposed as a way for scientific computing to benefit from some of the gains seen for artificial intelligence (AI) on recent high performance computing (HPC) platforms. A few applications dominated by dense matrix operations have seen substantial speedups by utilizing low precision formats such as FP16. However, a majority of scientific simulation applications are memory bandwidth limited. Beyond preliminary studies, the practical gain from using mixed-precision algorithms on a given HPC system is largely unclear. The High Performance GMRES Mixed Precision (HPG-MxP) benchmark has been proposed to measure the useful performance of a HPC system on sparse matrix-based mixed-precision applications. In this work, we present a highly optimized implementation of the HPG-MxP benchmark for an exascale system and describe our algorithm enhancements. We show for the first time a speedup of 1.6x using a combination of double- and single-precision on modern GPU-based supercomputers.
--------------------------------------------------------------------------------------------------------
Chain of Thought Monitorability: A New and Fragile Opportunity for AI Safety
As AI systems become more capable, ensuring their safety becomes increasingly critical, yet traditional oversight methods may prove insufficient for advanced AI systems. Chain of thought monitoring represents a promising approach for AI safety by observing AI reasoning processes in human-interpretable language, enabling detection of potentially harmful intentions before they manifest as actions. This transparency could provide unprecedented insight into AI decision-making processes. Applications include monitoring AI systems in high-stakes environments, ensuring alignment in autonomous agents, detecting deceptive behavior, and maintaining human oversight of AI reasoning. This approach could be crucial for safely deploying advanced AI systems in critical applications like healthcare, finance, and governance, though its fragility underscores the need for careful consideration in AI development to preserve this monitoring capability.
Authors: Tomek Korbak, Mikita Balesni, Elizabeth Barnes, Yoshua Bengio, Joe Benton, Joseph Bloom, Mark Chen, Alan Cooney, Allan Dafoe, Anca Dragan, Scott Emmons, Owain Evans, David Farhi, Ryan Greenblatt, Dan Hendrycks, Marius Hobbhahn, Evan Hubinger, Geoffrey Irving, Erik Jenner, Daniel Kokotajlo, Victoria Krakovna, Shane Legg, David Lindner, David Luan, Aleksander Mądry, Julian Michael, Neel Nanda, Dave Orr, Jakub Pachocki, Ethan Perez, Mary Phuong, Fabien Roger, Joshua Saxe, Buck Shlegeris, Martín Soto, Eric Steinberger, Jasmine Wang, Wojciech Zaremba, Bowen Baker, Rohin Shah, Vlad Mikulik
Link: https://arxiv.org/abs/2507.11473v1
Date: 2025-07-d
Summary:
AI systems that "think" in human language offer a unique opportunity for AI safety: we can monitor their chains of thought (CoT) for the intent to misbehave. Like all other known AI oversight methods, CoT monitoring is imperfect and allows some misbehavior to go unnoticed. Nevertheless, it shows promise and we recommend further research into CoT monitorability and investment in CoT monitoring alongside existing safety methods. Because CoT monitorability may be fragile, we recommend that frontier model developers consider the impact of development decisions on CoT monitorability.
--------------------------------------------------------------------------------------------------------
Opus: A Prompt Intention Framework for Complex Workflow Generation
Large language models struggle with generating complex workflows directly from user queries, often producing illogical or incomplete results as query complexity increases. The Opus framework introduces an intermediate intention capture layer that extracts workflow signals, interprets them into structured intentions, and generates workflows based on these intentions. This approach enables more reliable and scalable workflow generation. Applications include business process automation, software development pipelines, data analysis workflows, and task management systems that need to translate high-level goals into executable procedures. This framework could revolutionize how organizations automate complex processes by enabling non-technical users to specify complex workflows in natural language, potentially transforming business process management and making sophisticated automation accessible to domain experts without programming expertise.
Authors: Théo Fagnoni, Mahsun Altin, Chia En Chung, Phillip Kingston, Alan Tuning, Dana O. Mohamed, Inès Adnani
Link: https://arxiv.org/abs/2507.11288v1
Date: 2025-07-d
Summary:
This paper introduces the Opus Prompt Intention Framework, designed to improve complex Workflow Generation with instruction-tuned Large Language Models (LLMs). We propose an intermediate Intention Capture layer between user queries and Workflow Generation, implementing the Opus Workflow Intention Framework, which consists of extracting Workflow Signals from user queries, interpreting them into structured Workflow Intention objects, and generating Workflows based on these Intentions. Our results show that this layer enables LLMs to produce logical and meaningful outputs that scale reliably as query complexity increases. On a synthetic benchmark of 1,000 multi-intent query-Workflow(s) pairs, applying the Opus Prompt Intention Framework to Workflow Generation yields consistent improvements in semantic Workflow similarity metrics. In this paper, we introduce the Opus Prompt Intention Framework by applying the concepts of Workflow Signal and Workflow Intention to LLM-driven Workflow Generation. We present a reproducible, customizable LLM-based Intention Capture system to extract Workflow Signals and Workflow Intentions from user queries. Finally, we provide empirical evidence that the proposed system significantly improves Workflow Generation quality compared to direct generation from user queries, particularly in cases of Mixed Intention Elicitation.
--------------------------------------------------------------------------------------------------------
Assessing Color Vision Test in Large Vision-language Models
Color perception represents a fundamental aspect of visual understanding, yet the color vision capabilities of large vision-language models remain poorly understood despite their widespread adoption. This research defines systematic color vision testing for these models and constructs comprehensive evaluation datasets covering multiple difficulty levels and test categories. Understanding these limitations is crucial for applications requiring accurate color recognition. Applications include medical diagnosis assistance, quality control in manufacturing, artistic and design applications, accessibility tools for color-blind users, and scientific applications requiring precise color analysis. This work could inform development of more visually capable AI systems and identify scenarios where current models may fail, ensuring appropriate deployment of vision-language models in color-critical applications while guiding improvements in visual perception capabilities.
Authors: Hongfei Ye, Bin Chen, Wenxi Liu, Yu Zhang, Zhao Li, Dandan Ni, Hongyang Chen
Link: https://arxiv.org/abs/2507.11153v1
Date: 2025-07-d
Summary:
With the widespread adoption of large vision-language models, the capacity for color vision in these models is crucial. However, the color vision abilities of large visual-language models have not yet been thoroughly explored. To address this gap, we define a color vision testing task for large vision-language models and construct a dataset \footnote{Anonymous Github Showing some of the data https://anonymous.4open.science/r/color-vision-test-dataset-3BCD} that covers multiple categories of test questions and tasks of varying difficulty levels. Furthermore, we analyze the types of errors made by large vision-language models and propose fine-tuning strategies to enhance their performance in color vision tests.
--------------------------------------------------------------------------------------------------------
Indoor localization using Wi-Fi signals faces significant challenges from environmental noise and device heterogeneity, limiting accuracy across different mobile devices. GATE introduces a novel graph neural network approach that models spatial relationships between locations while addressing the limitations of conventional deep learning methods in handling non-Euclidean RSS noise patterns. The system achieves substantially lower localization errors across diverse indoor environments and heterogeneous devices. Applications include indoor navigation systems, location-based services, emergency response in buildings, asset tracking, and smart building management. This technology could enable precise indoor positioning for autonomous robots, enhance accessibility navigation for visually impaired users, and improve efficiency of warehouse operations through accurate location tracking of personnel and equipment in complex indoor environments.
Authors: Danish Gufran, Sudeep Pasricha
Link: https://arxiv.org/abs/2507.11053v1
Date: 2025-07-d
Summary:
Accurate indoor localization is crucial for enabling spatial context in smart environments and navigation systems. Wi-Fi Received Signal Strength (RSS) fingerprinting is a widely used indoor localization approach due to its compatibility with mobile embedded devices. Deep Learning (DL) models improve accuracy in localization tasks by learning RSS variations across locations, but they assume fingerprint vectors exist in a Euclidean space, failing to incorporate spatial relationships and the non-uniform distribution of real-world RSS noise. This results in poor generalization across heterogeneous mobile devices, where variations in hardware and signal processing distort RSS readings. Graph Neural Networks (GNNs) can improve upon conventional DL models by encoding indoor locations as nodes and modeling their spatial and signal relationships as edges. However, GNNs struggle with non-Euclidean noise distributions and suffer from the GNN blind spot problem, leading to degraded accuracy in environments with dense access points (APs). To address these challenges, we propose GATE, a novel framework that constructs an adaptive graph representation of fingerprint vectors while preserving an indoor state-space topology, modeling the non-Euclidean structure of RSS noise to mitigate environmental noise and address device heterogeneity. GATE introduces 1) a novel Attention Hyperspace Vector (AHV) for enhanced message passing, 2) a novel Multi-Dimensional Hyperspace Vector (MDHV) to mitigate the GNN blind spot, and 3) an new Real-Time Edge Construction (RTEC) approach for dynamic graph adaptation. Extensive real-world evaluations across multiple indoor spaces with varying path lengths, AP densities, and heterogeneous devices demonstrate that GATE achieves 1.6x to 4.72x lower mean localization errors and 1.85x to 4.57x lower worst-case errors compared to state-of-the-art indoor localization frameworks.
--------------------------------------------------------------------------------------------------------
Lessons Learned from Evaluation of LLM based Multi-agents in Safer Therapy Recommendation
Managing multimorbidity patients with multiple chronic conditions requires careful coordination to avoid treatment conflicts and adverse drug interactions, presenting significant challenges for healthcare providers. This research investigates using LLM-based multi-agent systems to simulate multidisciplinary team decision-making for safer therapy recommendations. The study reveals that current single-agent systems perform as well as multi-agent approaches while highlighting limitations in recommendation completeness. Applications include clinical decision support systems, medication management platforms, chronic disease management, and training systems for healthcare professionals. This work could improve patient safety by providing automated screening for treatment conflicts and supporting healthcare providers in complex therapeutic decisions, particularly valuable in resource-constrained healthcare settings where specialist consultations may be limited or unavailable.
Authors: Yicong Wu, Ting Chen, Irit Hochberg, Zhoujian Sun, Ruth Edry, Zhengxing Huang, Mor Peleg
Link: https://arxiv.org/abs/2507.10911v1
Date: 2025-07-d
Summary:
Therapy recommendation for chronic patients with multimorbidity is challenging due to risks of treatment conflicts. Existing decision support systems face scalability limitations. Inspired by the way in which general practitioners (GP) manage multimorbidity patients, occasionally convening multidisciplinary team (MDT) collaboration, this study investigated the feasibility and value of using a Large Language Model (LLM)-based multi-agent system (MAS) for safer therapy recommendations. We designed a single agent and a MAS framework simulating MDT decision-making by enabling discussion among LLM agents to resolve medical conflicts. The systems were evaluated on therapy planning tasks for multimorbidity patients using benchmark cases. We compared MAS performance with single-agent approaches and real-world benchmarks. An important contribution of our study is the definition of evaluation metrics that go beyond the technical precision and recall and allow the inspection of clinical goals met and medication burden of the proposed advices to a gold standard benchmark. Our results show that with current LLMs, a single agent GP performs as well as MDTs. The best-scoring models provide correct recommendations that address all clinical goals, yet the advices are incomplete. Some models also present unnecessary medications, resulting in unnecessary conflicts between medication and conditions or drug-drug interactions.
--------------------------------------------------------------------------------------------------------
The integration of artificial intelligence in journalism represents a transformative shift in how news is created, distributed, and consumed, yet comprehensive understanding of this intersection remains fragmented. This systematic review analyzes global research trends from 2010-2025, revealing sharp increases in AI journalism research post-2020, with focus areas including automation, misinformation detection, and ethical governance. The analysis reveals cautious optimism tempered by concerns about bias and accountability. Applications include automated content generation, fact-checking systems, personalized news delivery, misinformation detection, and newsroom workflow optimization. This research could guide responsible AI adoption in journalism, inform policy decisions about AI in media, and highlight the need for inclusive research perspectives from underrepresented regions to ensure equitable development of AI journalism technologies globally.
Authors: Mohammad Al Masum Molla, Md Manjurul Ahsan
Link: https://arxiv.org/abs/2507.10891v1
Date: 2025-07-d
Summary:
Artificial Intelligence (AI) is reshaping journalistic practices across the globe, offering new opportunities while raising ethical, professional, and societal concerns. This study presents a comprehensive systematic review of published articles on AI in journalism from 2010 to 2025. Following the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) 2020 guidelines, a total of 72 peer-reviewed articles were selected from Scopus and Web of Science databases. The analysis combines bibliometric mapping and qualitative thematic synthesis to identify dominant trends, technologies, geographical distributions, and ethical debates. Additionally, sentiment analysis was performed on article abstracts using the Valence Aware Dictionary and sEntiment Reasoner (VADER) algorithm to capture evaluative tones across the literature. The findings show a sharp increase in research activity after 2020, with prominent focus areas including automation, misinformation, and ethical governance. While most studies reflect cautious optimism, concerns over bias, transparency, and accountability remain persistent. The review also highlights regional disparities in scholarly contributions, with limited representation from the Global South. By integrating quantitative and qualitative insights, this study offers a multi-dimensional understanding of how AI is transforming journalism and proposes future research directions for inclusive and responsible innovation.
--------------------------------------------------------------------------------------------------------