Week Ending 6.15.2025
RESEARCH WATCH: 6.15.2025
Autonomous vehicle safety relies heavily on predicting the behavior of surrounding vehicles, particularly during high-risk maneuvers like lane changes. This research bridges the critical gap between theoretical advances and practical deployment by implementing a real-world lane change prediction system using Knowledge Graph Embeddings and Bayesian inference. The system successfully anticipates lane changes 3-4 seconds in advance, enabling proactive braking responses. Applications include enhanced autonomous driving safety systems, advanced driver assistance technologies, and traffic management systems that can preemptively adjust to prevent accidents and optimize traffic flow.
Authors: M. Manzour, Catherine M. Elias, Omar M. Shehata, R. Izquierdo, M. A. Sotelo
Link: https://arxiv.org/abs/2506.11925v1
Date: 2025-06-13
Summary:
Research on lane change prediction has gained a lot of momentum in the last couple of years. However, most research is confined to simulation or results obtained from datasets, leaving a gap between algorithmic advances and on-road deployment. This work closes that gap by demonstrating, on real hardware, a lane-change prediction system based on Knowledge Graph Embeddings (KGEs) and Bayesian inference. Moreover, the ego-vehicle employs a longitudinal braking action to ensure the safety of both itself and the surrounding vehicles. Our architecture consists of two modules: (i) a perception module that senses the environment, derives input numerical features, and converts them into linguistic categories; and communicates them to the prediction module; (ii) a pretrained prediction module that executes a KGE and Bayesian inference model to anticipate the target vehicle's maneuver and transforms the prediction into longitudinal braking action. Real-world hardware experimental validation demonstrates that our prediction system anticipates the target vehicle's lane change three to four seconds in advance, providing the ego vehicle sufficient time to react and allowing the target vehicle to make the lane change safely.
--------------------------------------------------------------------------------------------------------
DaMO: A Data-Efficient Multimodal Orchestrator for Temporal Reasoning with Video LLMs
Video understanding remains a significant challenge for AI systems, particularly when requiring precise temporal reasoning about specific moments. DaMO addresses this limitation through a novel hierarchical dual-stream architecture that efficiently processes both visual and audio information while maintaining computational efficiency. The system's four-stage progressive training paradigm enables sophisticated video-language understanding with reduced data requirements. Key applications include video surveillance systems requiring temporal event detection, educational platforms needing precise video moment analysis, and content moderation tools that must identify specific problematic segments within longer video content.
Authors: Bo-Cheng Chiu, Jen-Jee Chen, Yu-Chee Tseng, Feng-Chi Chen
Link: https://arxiv.org/abs/2506.11558v1
Date: 2025-06-13
Summary:
Large Language Models (LLMs) have recently been extended to the video domain, enabling sophisticated video-language understanding. However, existing Video LLMs often exhibit limitations in fine-grained temporal reasoning, restricting their ability to precisely attribute responses to specific video moments, especially under constrained supervision. We introduce DaMO, a data-efficient Video LLM explicitly designed for accurate temporal reasoning and multimodal understanding. At its core, the proposed Temporal-aware Fuseformer employs a hierarchical dual-stream architecture that progressively captures temporal dynamics within each modality and effectively fuses complementary visual and audio information. To further enhance computational efficiency, DaMO integrates a global residual that reduces spatial redundancy while preserving essential semantic details. We train DaMO via a structured four-stage progressive training paradigm, incrementally equipping the model with multimodal alignment, semantic grounding, and temporal reasoning capabilities. This work also contributes multiple datasets augmented from existing ones with GPT-generated temporally grounded QA pairs for tasks requiring temporal supervision. Comprehensive experiments on temporal grounding and video QA benchmarks demonstrate that DaMO consistently surpasses prior methods, particularly in tasks demanding precise temporal alignment and reasoning. Our work establishes a promising direction for data-efficient video-language modeling.
--------------------------------------------------------------------------------------------------------
Demographic bias in facial recognition systems poses serious societal risks, yet existing metrics often fail to detect subtle disparities, particularly in edge cases. The Comprehensive Equity Index introduces a novel approach by separately analyzing genuine and impostor score distributions while focusing on tail probabilities where biases commonly manifest. This methodology provides superior sensitivity for detecting nuanced biases that traditional metrics miss. Applications span law enforcement facial recognition auditing, border security system evaluation, workplace access control fairness assessment, and regulatory compliance testing for any AI system requiring demographic bias evaluation and mitigation.
Authors: Imanol Solano, Julian Fierrez, Aythami Morales, Alejandro Peña, Ruben Tolosana, Francisco Zamora-Martinez, Javier San Agustin
Link: https://arxiv.org/abs/2506.10564v1
Date: 2025-06-12
Summary:
Demographic bias in high-performance face recognition (FR) systems often eludes detection by existing metrics, especially with respect to subtle disparities in the tails of the score distribution. We introduce the Comprehensive Equity Index (CEI), a novel metric designed to address this limitation. CEI uniquely analyzes genuine and impostor score distributions separately, enabling a configurable focus on tail probabilities while also considering overall distribution shapes. Our extensive experiments (evaluating state-of-the-art FR systems, intentionally biased models, and diverse datasets) confirm CEI's superior ability to detect nuanced biases where previous methods fall short. Furthermore, we present CEI^A, an automated version of the metric that enhances objectivity and simplifies practical application. CEI provides a robust and sensitive tool for operational FR fairness assessment. The proposed methods have been developed particularly for bias evaluation in face biometrics but, in general, they are applicable for comparing statistical distributions in any problem where one is interested in analyzing the distribution tails.
--------------------------------------------------------------------------------------------------------
Unsupervised Elicitation of Language Models
As language models develop superhuman capabilities, obtaining quality human supervision becomes increasingly difficult or impossible. Internal Coherence Maximization represents a breakthrough approach that enables models to improve through self-generated labels without external supervision. The method matches golden supervision performance while significantly outperforming crowdsourced human labels, particularly valuable for domains where human expertise is limited. Applications include training advanced AI assistants in specialized fields, developing models for emerging domains lacking expert annotators, scientific research acceleration, and creating more capable AI systems that can self-improve beyond current human knowledge boundaries.
Authors: Jiaxin Wen, Zachary Ankner, Arushi Somani, Peter Hase, Samuel Marks, Jacob Goldman-Wetzler, Linda Petrini, Henry Sleight, Collin Burns, He He, Shi Feng, Ethan Perez, Jan Leike
Link: https://arxiv.org/abs/2506.10139v1
Date: 2025-06-11
Summary:
To steer pretrained language models for downstream tasks, today's post-training paradigm relies on humans to specify desired behaviors. However, for models with superhuman capabilities, it is difficult or impossible to get high-quality human supervision. To address this challenge, we introduce a new unsupervised algorithm, Internal Coherence Maximization (ICM), to fine-tune pretrained language models on their own generated labels, \emph{without external supervision}. On GSM8k-verification, TruthfulQA, and Alpaca reward modeling tasks, our method matches the performance of training on golden supervision and outperforms training on crowdsourced human supervision. On tasks where LMs' capabilities are strongly superhuman, our method can elicit those capabilities significantly better than training on human labels. Finally, we show that our method can improve the training of frontier LMs: we use our method to train an unsupervised reward model and use reinforcement learning to train a Claude 3.5 Haiku-based assistant. Both the reward model and the assistant outperform their human-supervised counterparts.
--------------------------------------------------------------------------------------------------------
V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning
Understanding and acting in the physical world represents a fundamental AI challenge requiring integration of perception, prediction, and planning capabilities. V-JEPA 2 combines massive internet-scale video data with minimal robot interaction data to create models capable of real-world planning and manipulation. The system achieves state-of-the-art performance across multiple video understanding tasks while enabling zero-shot robotic deployment for object manipulation. Applications include warehouse automation, household robotics, manufacturing systems requiring flexible object handling, and any robotic application where extensive task-specific training data is unavailable or impractical to collect.
Authors: Mido Assran, Adrien Bardes, David Fan, Quentin Garrido, Russell Howes, Mojtaba, Komeili, Matthew Muckley, Ammar Rizvi, Claire Roberts, Koustuv Sinha, Artem Zholus, Sergio Arnaud, Abha Gejji, Ada Martin, Francois Robert Hogan, Daniel Dugas, Piotr Bojanowski, Vasil Khalidov, Patrick Labatut, Francisco Massa, Marc Szafraniec, Kapil Krishnakumar, Yong Li, Xiaodong Ma, Sarath Chandar, Franziska Meier, Yann LeCun, Michael Rabbat, Nicolas Ballas
Link: https://arxiv.org/abs/2506.09985v1
Date: 2025-06-11
Summary:
A major challenge for modern AI is to learn to understand the world and learn to act largely by observation. This paper explores a self-supervised approach that combines internet-scale video data with a small amount of interaction data (robot trajectories), to develop models capable of understanding, predicting, and planning in the physical world. We first pre-train an action-free joint-embedding-predictive architecture, V-JEPA 2, on a video and image dataset comprising over 1 million hours of internet video. V-JEPA 2 achieves strong performance on motion understanding (77.3 top-1 accuracy on Something-Something v2) and state-of-the-art performance on human action anticipation (39.7 recall-at-5 on Epic-Kitchens-100) surpassing previous task-specific models. Additionally, after aligning V-JEPA 2 with a large language model, we demonstrate state-of-the-art performance on multiple video question-answering tasks at the 8 billion parameter scale (e.g., 84.0 on PerceptionTest, 76.9 on TempCompass). Finally, we show how self-supervised learning can be applied to robotic planning tasks by post-training a latent action-conditioned world model, V-JEPA 2-AC, using less than 62 hours of unlabeled robot videos from the Droid dataset. We deploy V-JEPA 2-AC zero-shot on Franka arms in two different labs and enable picking and placing of objects using planning with image goals. Notably, this is achieved without collecting any data from the robots in these environments, and without any task-specific training or reward. This work demonstrates how self-supervised learning from web-scale data and a small amount of robot interaction data can yield a world model capable of planning in the physical world.
--------------------------------------------------------------------------------------------------------
How Do People Revise Inconsistent Beliefs? Examining Belief Revision in Humans with User Studies
Understanding human belief revision patterns is crucial for developing AI systems that can effectively model and align with human reasoning processes. This research reveals that people consistently prefer explanation-based belief revisions over minimal changes suggested by classical theory, even when these revisions seem suboptimal from a theoretical perspective. The findings have significant implications for AI development, particularly systems designed to interact with humans or model human decision-making. Applications include personalized recommendation systems, educational AI tutors, negotiation support systems, and any human-AI collaboration where understanding and predicting human reasoning patterns is essential for effective interaction.
Authors: Stylianos Loukas Vasileiou, Antonio Rago, Maria Vanina Martinez, William Yeoh
Link: https://arxiv.org/abs/2506.09977v1
Date: 2025-06-11
Summary:
Understanding how humans revise their beliefs in light of new information is crucial for developing AI systems which can effectively model, and thus align with, human reasoning. While theoretical belief revision frameworks rely on a set of principles that establish how these operations are performed, empirical evidence from cognitive psychology suggests that people may follow different patterns when presented with conflicting information. In this paper, we present three comprehensive user studies showing that people consistently prefer explanation-based revisions, i.e., those which are guided by explanations, that result in changes to their belief systems that are not necessarily captured by classical belief change theory. Our experiments systematically investigate how people revise their beliefs with explanations for inconsistencies, whether they are provided with them or left to formulate them themselves, demonstrating a robust preference for what may seem non-minimal revisions across different types of scenarios. These findings have implications for AI systems designed to model human reasoning or interact with humans, suggesting that such systems should accommodate explanation-based, potentially non-minimal belief revision operators to better align with human cognitive processes.
--------------------------------------------------------------------------------------------------------
CausalVQA: A Physically Grounded Causal Reasoning Benchmark for Video Models
Current video understanding models excel at surface-level perception but struggle with deeper causal reasoning about physical world interactions. CausalVQA introduces a comprehensive benchmark focusing on five question types: counterfactual, hypothetical, anticipation, planning, and descriptive reasoning. The dataset reveals significant gaps between current AI capabilities and human performance, particularly in anticipation and hypothetical scenarios. Applications include robotics systems requiring physical world prediction, autonomous vehicle decision-making, safety system design, educational AI for physics and engineering, and any application requiring AI systems to understand cause-and-effect relationships in real-world scenarios.
Authors: Aaron Foss, Chloe Evans, Sasha Mitts, Koustuv Sinha, Ammar Rizvi, Justine T. Kao
Link: https://arxiv.org/abs/2506.09943v1
Date: 2025-06-11
Summary:
We introduce CausalVQA, a benchmark dataset for video question answering (VQA) composed of question-answer pairs that probe models' understanding of causality in the physical world. Existing VQA benchmarks either tend to focus on surface perceptual understanding of real-world videos, or on narrow physical reasoning questions created using simulation environments. CausalVQA fills an important gap by presenting challenging questions that are grounded in real-world scenarios, while focusing on models' ability to predict the likely outcomes of different actions and events through five question types: counterfactual, hypothetical, anticipation, planning and descriptive. We designed quality control mechanisms that prevent models from exploiting trivial shortcuts, requiring models to base their answers on deep visual understanding instead of linguistic cues. We find that current frontier multimodal models fall substantially below human performance on the benchmark, especially on anticipation and hypothetical questions. This highlights a challenge for current systems to leverage spatial-temporal reasoning, understanding of physical principles, and comprehension of possible alternatives to make accurate predictions in real-world settings.
--------------------------------------------------------------------------------------------------------
ReasonMed: A 370K Multi-Agent Generated Dataset for Advancing Medical Reasoning
Medical question answering demands sophisticated reasoning capabilities that combine domain knowledge with logical inference, yet current language models underperform in this critical area. ReasonMed addresses this challenge through a massive dataset of 370,000 high-quality medical reasoning examples, created using multi-agent verification and refinement processes. The resulting ReasonMed-7B model sets new benchmarks for medical AI, even outperforming much larger models. Applications include clinical decision support systems, medical education platforms, telemedicine consultation assistance, medical research acceleration, and healthcare AI systems requiring sophisticated reasoning about complex medical scenarios and patient care decisions.
Authors: Yu Sun, Xingyu Qian, Weiwen Xu, Hao Zhang, Chenghao Xiao, Long Li, Yu Rong, Wenbing Huang, Qifeng Bai, Tingyang Xu
Link: https://arxiv.org/abs/2506.09513v1
Date: 2025-06-11
Summary:
Though reasoning-based large language models (LLMs) have excelled in mathematics and programming, their capabilities in knowledge-intensive medical question answering remain underexplored. To address this, we introduce ReasonMed, the largest medical reasoning dataset, comprising 370k high-quality examples distilled from 1.7 million initial reasoning paths generated by various LLMs. ReasonMed is constructed through a \textit{multi-agent verification and refinement process}, where we design an \textit{Error Refiner} to enhance the reasoning paths by identifying and correcting error-prone steps flagged by a verifier. Leveraging ReasonMed, we systematically investigate best practices for training medical reasoning models and find that combining detailed Chain-of-Thought (CoT) reasoning with concise answer summaries yields the most effective fine-tuning strategy. Based on this strategy, we train ReasonMed-7B, which sets a new benchmark for sub-10B models, outperforming the prior best by 4.17\% and even exceeding LLaMA3.1-70B on PubMedQA by 4.60\%.
--------------------------------------------------------------------------------------------------------
Multi-Agent Language Models: Advancing Cooperation, Coordination, and Adaptation
The development of theory of mind capabilities in language models represents a crucial step toward effective human-AI collaboration and multi-agent coordination. This research investigates whether LLMs can understand and reason about others' intentions through cooperative multi-agent reinforcement learning frameworks. The work aims to create AI agents capable of seamless collaboration with both artificial and human partners. Applications include collaborative robotics teams, human-AI workplace partnerships, educational group learning systems, negotiation and mediation platforms, and any scenario requiring AI systems to understand, predict, and appropriately respond to the intentions and goals of their human or artificial collaborators.
Authors: Arjun Vaithilingam Sudhakar
Link: https://arxiv.org/abs/2506.09331v1
Date: 2025-06-11
Summary:
Modern Large Language Models (LLMs) exhibit impressive zero-shot and few-shot generalization capabilities across complex natural language tasks, enabling their widespread use as virtual assistants for diverse applications such as translation and summarization. Despite being trained solely on large corpora of text without explicit supervision on author intent, LLMs appear to infer the underlying meaning of textual interactions. This raises a fundamental question: can LLMs model and reason about the intentions of others, i.e., do they possess a form of theory of mind? Understanding other's intentions is crucial for effective collaboration, which underpins human societal success and is essential for cooperative interactions among multiple agents, including humans and autonomous systems. In this work, we investigate the theory of mind in LLMs through the lens of cooperative multi-agent reinforcement learning (MARL), where agents learn to collaborate via repeated interactions, mirroring human social reasoning. Our approach aims to enhance artificial agent's ability to adapt and cooperate with both artificial and human partners. By leveraging LLM-based agents capable of natural language interaction, we move towards creating hybrid human-AI systems that can foster seamless collaboration, with broad implications for the future of human-artificial interaction.
--------------------------------------------------------------------------------------------------------
Improved LLM Agents for Financial Document Question Answering
Financial document analysis requires sophisticated numerical reasoning across complex tabular and textual data, presenting unique challenges for language models. This research develops improved critic and calculator agents that outperform existing approaches while providing safer, more reliable financial analysis. The work addresses critical limitations of traditional methods when oracle labels are unavailable, making the approach more practical for real-world deployment. Applications include automated financial report analysis, investment decision support, regulatory compliance checking, audit assistance, loan application processing, and any financial technology requiring accurate numerical reasoning over complex document structures.
Authors: Nelvin Tan, Zian Seng, Liang Zhang, Yu-Ching Shih, Dong Yang, Amol Salunkhe
Link: https://arxiv.org/abs/2506.08726v1
Date: 2025-06-10
Summary:
Large language models (LLMs) have shown impressive capabilities on numerous natural language processing tasks. However, LLMs still struggle with numerical question answering for financial documents that include tabular and textual data. Recent works have showed the effectiveness of critic agents (i.e., self-correction) for this task given oracle labels. Building upon this framework, this paper examines the effectiveness of the traditional critic agent when oracle labels are not available, and show, through experiments, that this critic agent's performance deteriorates in this scenario. With this in mind, we present an improved critic agent, along with the calculator agent which outperforms the previous state-of-the-art approach (program-of-thought) and is safer. Furthermore, we investigate how our agents interact with each other, and how this interaction affects their performance.
--------------------------------------------------------------------------------------------------------
Efficient Generation of Diverse Cooperative Agents with World Models
Zero-Shot Coordination requires training agents that can effectively collaborate with previously unseen partners, but current methods are computationally expensive and sample inefficient. XPM-WM introduces world model-based trajectory simulation to dramatically improve training efficiency while maintaining the diversity of collaborative conventions necessary for robust coordination. This approach enables scalable generation of cooperative agent populations without training each agent from scratch. Applications include multi-robot coordination systems, autonomous vehicle fleets, distributed computing networks, collaborative AI assistants, and any multi-agent system requiring effective coordination without prior interaction or explicit communication protocols.
Authors: Yi Loo, Akshunn Trivedi, Malika Meghjani
Link: https://arxiv.org/abs/2506.07450v1
Date: 2025-06-09
Summary:
A major bottleneck in the training process for Zero-Shot Coordination (ZSC) agents is the generation of partner agents that are diverse in collaborative conventions. Current Cross-play Minimization (XPM) methods for population generation can be very computationally expensive and sample inefficient as the training objective requires sampling multiple types of trajectories. Each partner agent in the population is also trained from scratch, despite all of the partners in the population learning policies of the same coordination task. In this work, we propose that simulated trajectories from the dynamics model of an environment can drastically speed up the training process for XPM methods. We introduce XPM-WM, a framework for generating simulated trajectories for XPM via a learned World Model (WM). We show XPM with simulated trajectories removes the need to sample multiple trajectories. In addition, we show our proposed method can effectively generate partners with diverse conventions that match the performance of previous methods in terms of SP population training reward as well as training partners for ZSC agents. Our method is thus, significantly more sample efficient and scalable to a larger number of partners.
--------------------------------------------------------------------------------------------------------
Autoregressive Semantic Visual Reconstruction Helps VLMs Understand Better
Current vision-language models primarily focus on text generation while underutilizing visual information, leading to incomplete understanding and missed visual details. Autoregressive Semantic Visual Reconstruction enables joint learning across visual and textual modalities within a unified framework, improving multimodal comprehension by 5% across benchmarks. The key insight involves reconstructing semantic rather than raw visual representations. Applications include enhanced image captioning systems, visual question answering platforms, medical image analysis, autonomous vehicle perception, educational content creation, and any application requiring deep integration of visual understanding with language generation capabilities.
Authors: Dianyi Wang, Wei Song, Yikun Wang, Siyuan Wang, Kaicheng Yu, Zhongyu Wei, Jiaqi Wang
Link: https://arxiv.org/abs/2506.09040v1
Date: 2025-06-10
Summary:
Typical large vision-language models (LVLMs) apply autoregressive supervision solely to textual sequences, without fully incorporating the visual modality into the learning process. This results in three key limitations: (1) an inability to utilize images without accompanying captions, (2) the risk that captions omit critical visual details, and (3) the challenge that certain vision-centric content cannot be adequately conveyed through text. As a result, current LVLMs often prioritize vision-to-language alignment while potentially overlooking fine-grained visual information. While some prior works have explored autoregressive image generation, effectively leveraging autoregressive visual supervision to enhance image understanding remains an open challenge. In this paper, we introduce Autoregressive Semantic Visual Reconstruction (ASVR), which enables joint learning of visual and textual modalities within a unified autoregressive framework. We show that autoregressively reconstructing the raw visual appearance of images does not enhance and may even impair multimodal understanding. In contrast, autoregressively reconstructing the semantic representation of images consistently improves comprehension. Notably, we find that even when models are given continuous image features as input, they can effectively reconstruct discrete semantic tokens, resulting in stable and consistent improvements across a wide range of multimodal understanding benchmarks. Our approach delivers significant performance gains across varying data scales (556k-2M) and types of LLM bacbones. Specifically, ASVR improves LLaVA-1.5 by 5% in average scores across 14 multimodal benchmarks. The code is available at https://github.com/AlenjandroWang/ASVR.
--------------------------------------------------------------------------------------------------------
What Users Value and Critique: Large-Scale Analysis of User Feedback on AI-Powered Mobile Apps
As AI features proliferate across mobile applications, understanding user perceptions and satisfaction drivers becomes crucial for successful deployment and adoption. This comprehensive analysis of 894,000 AI-specific reviews across 292 apps reveals consistent patterns: users value productivity, reliability, and personalization while criticizing technical failures, pricing, and language limitations. The multi-stage analysis pipeline enables scalable feedback processing at unprecedented scale. Applications include product development guidance for AI features, user experience optimization, market research for AI applications, app store optimization strategies, and strategic planning for any organization developing consumer-facing AI technologies.
Authors: Vinaik Chhetri, Krishna Upadhyay, A. B. Siddique, Umar Farooq
Link: https://arxiv.org/abs/2506.10785v1
Date: 2025-06-12
Summary:
Artificial Intelligence (AI)-powered features have rapidly proliferated across mobile apps in various domains, including productivity, education, entertainment, and creativity. However, how users perceive, evaluate, and critique these AI features remains largely unexplored, primarily due to the overwhelming volume of user feedback. In this work, we present the first comprehensive, large-scale study of user feedback on AI-powered mobile apps, leveraging a curated dataset of 292 AI-driven apps across 14 categories with 894K AI-specific reviews from Google Play. We develop and validate a multi-stage analysis pipeline that begins with a human-labeled benchmark and systematically evaluates large language models (LLMs) and prompting strategies. Each stage, including review classification, aspect-sentiment extraction, and clustering, is validated for accuracy and consistency. Our pipeline enables scalable, high-precision analysis of user feedback, extracting over one million aspect-sentiment pairs clustered into 18 positive and 15 negative user topics. Our analysis reveals that users consistently focus on a narrow set of themes: positive comments emphasize productivity, reliability, and personalized assistance, while negative feedback highlights technical failures (e.g., scanning and recognition), pricing concerns, and limitations in language support. Our pipeline surfaces both satisfaction with one feature and frustration with another within the same review. These fine-grained, co-occurring sentiments are often missed by traditional approaches that treat positive and negative feedback in isolation or rely on coarse-grained analysis. To this end, our approach provides a more faithful reflection of the real-world user experiences with AI-powered apps. Category-aware analysis further uncovers both universal drivers of satisfaction and domain-specific frustrations.
--------------------------------------------------------------------------------------------------------
SALT: A Lightweight Model Adaptation Method for Closed Split Computing Environments
Split computing environments, where models are distributed between edge devices and cloud servers, often involve proprietary systems that prevent traditional adaptation methods. SALT introduces a practical solution through client-side adapters that refine latent features without accessing or modifying the original models. This approach enables personalized inference while maintaining communication efficiency and respecting system constraints. Applications include personalized mobile AI services, edge computing optimization, privacy-preserving model adaptation, telecommunications network optimization, and any distributed AI system where model components are proprietary or communication bandwidth is limited.
Authors: Yuya Okada, Takayuki Nishio
Link: https://arxiv.org/abs/2506.07355v1
Date: 2025-06-09
Summary:
We propose SALT (Split-Adaptive Lightweight Tuning), a lightweight model adaptation framework for Split Computing under closed constraints, where the head and tail networks are proprietary and inaccessible to users. In such closed environments, conventional adaptation methods are infeasible since they require access to model parameters or architectures. SALT addresses this challenge by introducing a compact, trainable adapter on the client side to refine latent features from the head network, enabling user-specific adaptation without modifying the original models or increasing communication overhead. We evaluate SALT on user-specific classification tasks with CIFAR-10 and CIFAR-100, demonstrating improved accuracy with lower training latency compared to fine-tuning methods. Furthermore, SALT facilitates model adaptation for robust inference over lossy networks, a common challenge in edge-cloud environments. With minimal deployment overhead, SALT offers a practical solution for personalized inference in edge AI systems under strict system constraints.
--------------------------------------------------------------------------------------------------------
Robustly Improving LLM Fairness in Realistic Settings via Interpretability
Large language models increasingly make high-stakes hiring decisions, yet existing bias mitigation methods fail when realistic contextual details are introduced. This research reveals that simple anti-bias prompts become ineffective in real-world scenarios, while internal bias mitigation through activation editing achieves robust bias reduction across all tested conditions. The approach generalizes across leading commercial and open-source models. Applications include hiring and recruitment platforms, performance evaluation systems, loan approval processes, academic admissions support, and any high-stakes decision-making system where demographic fairness is crucial for ethical and legal compliance.
Authors: Adam Karvonen, Samuel Marks
Link: https://arxiv.org/abs/2506.10922v1
Date: 2025-06-12
Summary:
Large language models (LLMs) are increasingly deployed in high-stakes hiring applications, making decisions that directly impact people's careers and livelihoods. While prior studies suggest simple anti-bias prompts can eliminate demographic biases in controlled evaluations, we find these mitigations fail when realistic contextual details are introduced. We address these failures through internal bias mitigation: by identifying and neutralizing sensitive attribute directions within model activations, we achieve robust bias reduction across all tested scenarios. Across leading commercial (GPT-4o, Claude 4 Sonnet, Gemini 2.5 Flash) and open-source models (Gemma-2 27B, Gemma-3, Mistral-24B), we find that adding realistic context such as company names, culture descriptions from public careers pages, and selective hiring constraints (e.g.,``only accept candidates in the top 10\%") induces significant racial and gender biases (up to 12\% differences in interview rates). When these biases emerge, they consistently favor Black over White candidates and female over male candidates across all tested models and scenarios. Moreover, models can infer demographics and become biased from subtle cues like college affiliations, with these biases remaining invisible even when inspecting the model's chain-of-thought reasoning. To address these limitations, our internal bias mitigation identifies race and gender-correlated directions and applies affine concept editing at inference time. Despite using directions from a simple synthetic dataset, the intervention generalizes robustly, consistently reducing bias to very low levels (typically under 1\%, always below 2.5\%) while largely maintaining model performance. Our findings suggest that practitioners deploying LLMs for hiring should adopt more realistic evaluation methodologies and consider internal mitigation strategies for equitable outcomes.
--------------------------------------------------------------------------------------------------------
Outside Knowledge Conversational Video (OKCV) Dataset -- Dialoguing over Videos
Video-based dialogue systems must integrate visual understanding with external knowledge while maintaining conversational context, representing a significant challenge for current AI systems. The OKCV dataset provides 2,017 videos with extensive human-annotated dialogues requiring both visual grounding and external knowledge integration. Current models show substantial performance gaps compared to humans, highlighting the complexity of this task. Applications include educational video platforms with interactive Q&A, customer support systems for visual products, medical consultation platforms using diagnostic videos, and any conversational AI system requiring sophisticated integration of visual content with external knowledge sources.
Authors: Benjamin Reichman, Constantin Patsch, Jack Truxal, Atishay Jain, Larry Heck
Link: https://arxiv.org/abs/2506.09953v1
Date: 2025-06-11
Summary:
In outside knowledge visual question answering (OK-VQA), the model must identify relevant visual information within an image and incorporate external knowledge to accurately respond to a question. Extending this task to a visually grounded dialogue setting based on videos, a conversational model must both recognize pertinent visual details over time and answer questions where the required information is not necessarily present in the visual information. Moreover, the context of the overall conversation must be considered for the subsequent dialogue. To explore this task, we introduce a dataset comprised of $2,017$ videos with $5,986$ human-annotated dialogues consisting of $40,954$ interleaved dialogue turns. While the dialogue context is visually grounded in specific video segments, the questions further require external knowledge that is not visually present. Thus, the model not only has to identify relevant video parts but also leverage external knowledge to converse within the dialogue. We further provide several baselines evaluated on our dataset and show future challenges associated with this task. The dataset is made publicly available here: https://github.com/c-patsch/OKCV.
--------------------------------------------------------------------------------------------------------
Understanding Human-AI Trust in Education
As AI chatbots become prevalent in educational settings, understanding how students develop trust toward these anthropomorphic systems becomes crucial for effective implementation. This research reveals that students develop a distinct form of human-AI trust that differs from both interpersonal and technology trust models, with both human-like and system-like trust factors influencing student engagement and learning outcomes. Applications include educational chatbot design, online learning platform development, AI tutoring system optimization, student support service enhancement, and any educational technology requiring appropriate trust calibration for effective learning outcomes and student engagement.
Authors: Griffin Pitts, Sanaz Motamedi
Link: https://arxiv.org/abs/2506.09160v2
Date: 2025-06-12
Summary:
As AI chatbots become increasingly integrated in education, students are turning to these systems for guidance, feedback, and information. However, the anthropomorphic characteristics of these chatbots create ambiguity regarding whether students develop trust toward them as they would a human peer or instructor, based in interpersonal trust, or as they would any other piece of technology, based in technology trust. This ambiguity presents theoretical challenges, as interpersonal trust models may inappropriately ascribe human intentionality and morality to AI, while technology trust models were developed for non-social technologies, leaving their applicability to anthropomorphic systems unclear. To address this gap, we investigate how human-like and system-like trusting beliefs comparatively influence students' perceived enjoyment, trusting intention, behavioral intention to use, and perceived usefulness of an AI chatbot - factors associated with students' engagement and learning outcomes. Through partial least squares structural equation modeling, we found that human-like and system-like trust significantly influenced student perceptions, with varied effects. Human-like trust more strongly predicted trusting intention, while system-like trust better predicted behavioral intention and perceived usefulness. Both had similar effects on perceived enjoyment. Given the partial explanatory power of each type of trust, we propose that students develop a distinct form of trust with AI chatbots (human-AI trust) that differs from human-human and human-technology models of trust. Our findings highlight the need for new theoretical frameworks specific to human-AI trust and offer practical insights for fostering appropriately calibrated trust, which is critical for the effective adoption and pedagogical impact of AI in education.
--------------------------------------------------------------------------------------------------------
Video content analysis requires sophisticated spatiotemporal reasoning that current vision-language models struggle to achieve, particularly for detailed frame-by-frame understanding. Video-CoT introduces 192,000 fine-grained question-answer pairs with 23,000 Chain-of-Thought annotations to address this limitation. The comprehensive benchmark reveals significant performance gaps in current models, highlighting the complexity of spatiotemporal video understanding. Applications include video surveillance systems, sports analysis platforms, medical procedure evaluation, educational video analysis, content moderation systems, and any application requiring detailed understanding of temporal relationships and spatial dynamics within video content.
Authors: Shuyi Zhang, Xiaoshuai Hao, Yingbo Tang, Lingfeng Zhang, Pengwei Wang, Zhongyuan Wang, Hongxuan Ma, Shanghang Zhang
Link: https://arxiv.org/abs/2506.08817v3
Date: 2025-06-12
Summary:
Video content comprehension is essential for various applications, ranging from video analysis to interactive systems. Despite advancements in large-scale vision-language models (VLMs), these models often struggle to capture the nuanced, spatiotemporal details essential for thorough video analysis. To address this gap, we introduce Video-CoT, a groundbreaking dataset designed to enhance spatiotemporal understanding using Chain-of-Thought (CoT) methodologies. Video-CoT contains 192,000 fine-grained spa-tiotemporal question-answer pairs and 23,000 high-quality CoT-annotated samples, providing a solid foundation for evaluating spatiotemporal understanding in video comprehension. Additionally, we provide a comprehensive benchmark for assessing these tasks, with each task featuring 750 images and tailored evaluation metrics. Our extensive experiments reveal that current VLMs face significant challenges in achieving satisfactory performance, high-lighting the difficulties of effective spatiotemporal understanding. Overall, the Video-CoT dataset and benchmark open new avenues for research in multimedia understanding and support future innovations in intelligent systems requiring advanced video analysis capabilities. By making these resources publicly available, we aim to encourage further exploration in this critical area. Project website:https://video-cot.github.io/ .
--------------------------------------------------------------------------------------------------------
StepProof: Step-by-step verification of natural language mathematical proofs
Interactive theorem provers provide powerful formal verification capabilities but lack natural language interfaces, limiting accessibility for mathematicians and students. StepProof bridges this gap through granular, sentence-level verification that breaks complete proofs into verifiable subproofs. This approach significantly improves proof success rates while enabling more intuitive interaction with formal verification systems. Applications include mathematical education platforms, automated proof checking systems, research collaboration tools, peer review assistance, and any mathematical or logical reasoning system requiring verification of natural language arguments with formal rigor.
Authors: Xiaolin Hu, Qinghua Zhou, Bogdan Grechuk, Ivan Y. Tyukin
Link: https://arxiv.org/abs/2506.10558v1
Date: 2025-06-12
Summary:
Interactive theorem provers (ITPs) are powerful tools for the formal verification of mathematical proofs down to the axiom level. However, their lack of a natural language interface remains a significant limitation. Recent advancements in large language models (LLMs) have enhanced the understanding of natural language inputs, paving the way for autoformalization - the process of translating natural language proofs into formal proofs that can be verified. Despite these advancements, existing autoformalization approaches are limited to verifying complete proofs and lack the capability for finer, sentence-level verification. To address this gap, we propose StepProof, a novel autoformalization method designed for granular, step-by-step verification. StepProof breaks down complete proofs into multiple verifiable subproofs, enabling sentence-level verification. Experimental results demonstrate that StepProof significantly improves proof success rates and efficiency compared to traditional methods. Additionally, we found that minor manual adjustments to the natural language proofs, tailoring them for step-level verification, further enhanced StepProof's performance in autoformalization.
--------------------------------------------------------------------------------------------------------
Peer support expands mental health care access through community-based assistance, yet digital platform design for such support remains under-examined, particularly in Asian contexts. This Singapore-based study reveals the complex emotional labor and sociocultural dimensions shaping peer support practices across online, offline, and hybrid environments. The research provides design directions for culturally responsive digital tools that enhance rather than replace relational care. Applications include mental health platform design, community support system development, AI-augmented peer counseling, cultural adaptation of mental health technologies, and any digital health intervention requiring culturally sensitive design for effective peer-to-peer support systems.
Authors: Kellie Yu Hui Sim, Kenny Tsu Wei Choo
Link: https://arxiv.org/abs/2506.09362v1
Date: 2025-06-11
Summary:
Peer support plays a vital role in expanding access to mental health care by providing empathetic, community-based support outside formal clinical systems. As digital platforms increasingly mediate such support, the design and impact of these technologies remain under-examined, particularly in Asian contexts. This paper presents findings from an interview study with 20 peer supporters in Singapore, who operate across diverse online, offline, and hybrid environments. Through a thematic analysis, we unpack how participants start, conduct, and sustain peer support, highlighting their motivations, emotional labour, and the sociocultural dimensions shaping their practices. Building on this grounded understanding, we surface design directions for culturally responsive digital tools that scaffold rather than supplant relational care. Drawing insights from qualitative accounts, we offer a situated perspective on how AI might responsibly augment peer support. This research contributes to human-centred computing by articulating the lived realities of peer supporters and proposing design implications for trustworthy and context-sensitive AI in mental health.
--------------------------------------------------------------------------------------------------------