Week Ending 6.1.2025

RESEARCH WATCH: 6.1.2025

Rehearsal with Auxiliary-Informed Sampling for Audio Deepfake Detection

Audio deepfakes pose significant threats to information integrity and security in our increasingly digital world. Current detection systems struggle when encountering new, unseen deepfake techniques, creating a critical vulnerability. This research addresses the challenge through continual learning—enabling systems to learn from new attacks while retaining knowledge of previous ones. The proposed RAIS framework uses auxiliary labels to intelligently select diverse training samples, achieving remarkable performance with only 1.953% error rate. Applications span cybersecurity, media verification, legal evidence authentication, and social media platform integrity. As deepfake technology evolves rapidly, such adaptive detection systems become essential for maintaining trust in digital communications.

Authors: Falih Gozi Febrinanto, Kristen Moore, Chandra Thapa, Jiangang Ma, Vidya Saikrishna, Feng Xia

Link: https://arxiv.org/abs/2505.24486v1

Date: 2025-05-30

Summary:

The performance of existing audio deepfake detection frameworks degrades when confronted with new deepfake attacks. Rehearsal-based continual learning (CL), which updates models using a limited set of old data samples, helps preserve prior knowledge while incorporating new information. However, existing rehearsal techniques don't effectively capture the diversity of audio characteristics, introducing bias and increasing the risk of forgetting. To address this challenge, we propose Rehearsal with Auxiliary-Informed Sampling (RAIS), a rehearsal-based CL approach for audio deepfake detection. RAIS employs a label generation network to produce auxiliary labels, guiding diverse sample selection for the memory buffer. Extensive experiments show RAIS outperforms state-of-the-art methods, achieving an average Equal Error Rate (EER) of 1.953 % across five experiences. The code is available at: https://github.com/falihgoz/RAIS.

--------------------------------------------------------------------------------------------------------

AMSbench: A Comprehensive Benchmark for Evaluating MLLM Capabilities in AMS Circuits

Analog and Mixed-Signal circuits form the backbone of modern electronics, from smartphones to medical devices, yet their design remains largely manual and expertise-dependent. This research introduces a comprehensive benchmark to evaluate whether advanced AI models can assist in automating circuit design—a longstanding engineering challenge. By testing prominent AI models on tasks like circuit analysis and schematic understanding, the study reveals significant limitations in current AI capabilities for complex engineering domains. The benchmark's 8,000 test questions span multiple difficulty levels, providing a rigorous evaluation framework. Successful AI integration could revolutionize electronic design automation, accelerate product development cycles, reduce design costs, and democratize access to specialized circuit design knowledge.

Authors: Yichen Shi, Ze Zhang, Hongyang Wang, Zhuofu Tao, Zhongyi Li, Bingyu Chen, Yaxin Wang, Zhiping Yu, Ting-Jung Lin, Lei He

Link: https://arxiv.org/abs/2505.24138v1

Date: 2025-05-30

Summary:

Analog/Mixed-Signal (AMS) circuits play a critical role in the integrated circuit (IC) industry. However, automating Analog/Mixed-Signal (AMS) circuit design has remained a longstanding challenge due to its difficulty and complexity. Recent advances in Multi-modal Large Language Models (MLLMs) offer promising potential for supporting AMS circuit analysis and design. However, current research typically evaluates MLLMs on isolated tasks within the domain, lacking a comprehensive benchmark that systematically assesses model capabilities across diverse AMS-related challenges. To address this gap, we introduce AMSbench, a benchmark suite designed to evaluate MLLM performance across critical tasks including circuit schematic perception, circuit analysis, and circuit design. AMSbench comprises approximately 8000 test questions spanning multiple difficulty levels and assesses eight prominent models, encompassing both open-source and proprietary solutions such as Qwen 2.5-VL and Gemini 2.5 Pro. Our evaluation highlights significant limitations in current MLLMs, particularly in complex multi-modal reasoning and sophisticated circuit design tasks. These results underscore the necessity of advancing MLLMs' understanding and effective application of circuit-specific knowledge, thereby narrowing the existing performance gap relative to human expertise and moving toward fully automated AMS circuit design workflows. Our data is released at https://huggingface.co/datasets/wwhhyy/AMSBench

--------------------------------------------------------------------------------------------------------

GenIC: An LLM-Based Framework for Instance Completion in Knowledge Graphs

Knowledge graphs power intelligent systems from search engines to recommendation platforms, but they inevitably contain gaps—missing facts that limit their utility. Traditional approaches struggle with the complex task of predicting both relationships and entities when only partial information is available. This research leverages large language models' natural language understanding capabilities to address knowledge graph completion more effectively. The GenIC framework breaks the problem into property prediction and link prediction stages, utilizing entity descriptions and contextual information. Applications include enhancing search engine results, improving recommendation systems, supporting automated fact-checking, advancing question-answering systems, and enabling more comprehensive knowledge discovery across domains like biomedicine, finance, and scientific research.

Authors: Amel Gader, Alsayed Algergawy

Link: https://arxiv.org/abs/2505.24036v1

Date: 2025-05-29

Summary:

Knowledge graph completion aims to address the gaps of knowledge bases by adding new triples that represent facts. The complexity of this task depends on how many parts of a triple are already known. Instance completion involves predicting the relation-tail pair when only the head is given (h, ?, ?). Notably, modern knowledge bases often contain entity descriptions and types, which can provide valuable context for inferring missing facts. By leveraging these textual descriptions and the ability of large language models to extract facts from them and recognize patterns within the knowledge graph schema, we propose an LLM-powered, end-to-end instance completion approach. Specifically, we introduce GenIC: a two-step Generative Instance Completion framework. The first step focuses on property prediction, treated as a multi-label classification task. The second step is link prediction, framed as a generative sequence-to-sequence task. Experimental results on three datasets show that our method outperforms existing baselines. Our code is available at https://github.com/amal-gader/genic.

--------------------------------------------------------------------------------------------------------

Bounded-Abstention Pairwise Learning to Rank

Ranking systems make critical decisions affecting people's lives—from job recommendations to medical treatment prioritization—yet they often operate without safety mechanisms. This research introduces abstention capabilities to ranking systems, allowing them to defer uncertain decisions to human experts rather than making potentially harmful recommendations. By identifying when confidence is too low for automated decision-making, the system prevents misranking in high-stakes scenarios. The theoretical foundation and model-agnostic algorithm make this approach widely applicable. Key applications include recruitment platforms avoiding biased candidate rankings, medical systems deferring complex diagnostic prioritizations, financial services abstaining from uncertain credit decisions, and educational platforms seeking human input for ambiguous student assessments, ultimately improving decision quality and reducing algorithmic harm.

Authors: Antonio Ferrara, Andrea Pugnana, Francesco Bonchi, Salvatore Ruggieri

Link: https://arxiv.org/abs/2505.23437v1

Date: 2025-05-29

Summary:

Ranking systems influence decision-making in high-stakes domains like health, education, and employment, where they can have substantial economic and social impacts. This makes the integration of safety mechanisms essential. One such mechanism is $\textit{abstention}$, which enables algorithmic decision-making system to defer uncertain or low-confidence decisions to human experts. While abstention have been predominantly explored in the context of classification tasks, its application to other machine learning paradigms remains underexplored. In this paper, we introduce a novel method for abstention in pairwise learning-to-rank tasks. Our approach is based on thresholding the ranker's conditional risk: the system abstains from making a decision when the estimated risk exceeds a predefined threshold. Our contributions are threefold: a theoretical characterization of the optimal abstention strategy, a model-agnostic, plug-in algorithm for constructing abstaining ranking models, and a comprehensive empirical evaluations across multiple datasets, demonstrating the effectiveness of our approach.

--------------------------------------------------------------------------------------------------------

MaCP: Minimal yet Mighty Adaptation via Hierarchical Cosine Projection

Large foundation models require enormous computational resources for fine-tuning, limiting their accessibility and practical deployment. This research introduces MaCP, a parameter-efficient adaptation method that achieves superior performance while dramatically reducing memory and computational requirements. By leveraging cosine projection's mathematical properties, the approach intelligently selects the most critical frequency components for adaptation. The method's versatility spans text understanding, generation, summarization, image classification, and video analysis. Applications include enabling fine-tuning on resource-constrained devices, reducing cloud computing costs for model customization, accelerating research in academic settings with limited budgets, facilitating rapid prototyping of specialized AI applications, and making advanced AI capabilities accessible to smaller organizations and individual developers.

Authors: Yixian Shen, Qi Bi, Jia-Hong Huang, Hongyi Zhu, Andy D. Pimentel, Anuj Pathania

Link: https://arxiv.org/abs/2505.23870v1

Date: 2025-05-29

Summary:

We present a new adaptation method MaCP, Minimal yet Mighty adaptive Cosine Projection, that achieves exceptional performance while requiring minimal parameters and memory for fine-tuning large foundation models. Its general idea is to exploit the superior energy compaction and decorrelation properties of cosine projection to improve both model efficiency and accuracy. Specifically, it projects the weight change from the low-rank adaptation into the discrete cosine space. Then, the weight change is partitioned over different levels of the discrete cosine spectrum, and each partition's most critical frequency components are selected. Extensive experiments demonstrate the effectiveness of MaCP across a wide range of single-modality tasks, including natural language understanding, natural language generation, text summarization, as well as multi-modality tasks such as image classification and video understanding. MaCP consistently delivers superior accuracy, significantly reduced computational complexity, and lower memory requirements compared to existing alternatives.

--------------------------------------------------------------------------------------------------------

Zero-to-Hero: Zero-Shot Initialization Empowering Reference-Based Video Appearance Editing

Video editing traditionally requires extensive manual work and specialized expertise, limiting creative expression and professional content creation. This research tackles reference-based video editing, where users can specify desired appearances through example images rather than ambiguous text descriptions. The Zero-to-Hero approach first edits a single anchor frame to match the reference, then propagates this appearance consistently across the entire video sequence. By addressing temporal consistency and memory efficiency challenges, the method enables high-quality video editing without requiring specialized hardware. Applications include content creation for social media, film post-production, advertising campaigns, educational video enhancement, personal video editing, and automated video style transfer, democratizing professional-quality video editing capabilities.

Authors: Tongtong Su, Chengyu Wang, Jun Huang, Dongming Lu

Link: https://arxiv.org/abs/2505.23134v1

Date: 2025-05-29

Summary:

Appearance editing according to user needs is a pivotal task in video editing. Existing text-guided methods often lead to ambiguities regarding user intentions and restrict fine-grained control over editing specific aspects of objects. To overcome these limitations, this paper introduces a novel approach named {Zero-to-Hero}, which focuses on reference-based video editing that disentangles the editing process into two distinct problems. It achieves this by first editing an anchor frame to satisfy user requirements as a reference image and then consistently propagating its appearance across other frames. We leverage correspondence within the original frames to guide the attention mechanism, which is more robust than previously proposed optical flow or temporal modules in memory-friendly video generative models, especially when dealing with objects exhibiting large motions. It offers a solid ZERO-shot initialization that ensures both accuracy and temporal consistency. However, intervention in the attention mechanism results in compounded imaging degradation with over-saturated colors and unknown blurring issues. Starting from Zero-Stage, our Hero-Stage Holistically learns a conditional generative model for vidEo RestOration. To accurately evaluate the consistency of the appearance, we construct a set of videos with multiple appearances using Blender, enabling a fine-grained and deterministic evaluation. Our method outperforms the best-performing baseline with a PSNR improvement of 2.6 dB. The project page is at https://github.com/Tonniia/Zero2Hero.

--------------------------------------------------------------------------------------------------------

VIRAL: Vision-grounded Integration for Reward design And Learning

Reinforcement learning agents require carefully designed reward functions to learn desired behaviors, but creating these rewards is notoriously difficult and error-prone. Poorly designed rewards can lead to unintended consequences or suboptimal learning. This research introduces VIRAL, which uses multimodal AI to automatically generate and refine reward functions based on visual goals and user feedback. The system can incorporate human guidance or use video descriptions to iteratively improve reward design. By accelerating behavior learning while ensuring alignment with user intentions, VIRAL addresses a fundamental challenge in AI safety and effectiveness. Applications include robotics training, game AI development, autonomous vehicle behavior design, personalized AI assistants, and any domain where specifying desired behaviors through traditional programming is challenging or impractical.

Authors: Valentin Cuzin-Rambaud, Emilien Komlenovic, Alexandre Faure, Bruno Yun

Link: https://arxiv.org/abs/2505.22092v2

Date: 2025-05-30

Summary:

The alignment between humans and machines is a critical challenge in artificial intelligence today. Reinforcement learning, which aims to maximize a reward function, is particularly vulnerable to the risks associated with poorly designed reward functions. Recent advancements has shown that Large Language Models (LLMs) for reward generation can outperform human performance in this context. We introduce VIRAL, a pipeline for generating and refining reward functions through the use of multi-modal LLMs. VIRAL autonomously creates and interactively improves reward functions based on a given environment and a goal prompt or annotated image. The refinement process can incorporate human feedback or be guided by a description generated by a video LLM, which explains the agent's policy in video form. We evaluated VIRAL in five Gymnasium environments, demonstrating that it accelerates the learning of new behaviors while ensuring improved alignment with user intent. The source-code and demo video are available at: https://github.com/VIRAL-UCBL1/VIRAL and https://youtu.be/Hqo82CxVT38.

--------------------------------------------------------------------------------------------------------

The Mirage of Multimodality: Where Truth is Tested and Honesty Unravels

Advanced reasoning models represent the cutting edge of AI capabilities, designed to perform complex inference through slower, more deliberate thinking processes. However, this research reveals a concerning paradox: when these sophisticated models encounter incomplete or misleading visual information, they're more likely to fabricate plausible but false details to support their reasoning. This "Mirage of Multimodality" phenomenon challenges assumptions about AI reliability and truthfulness. The finding has critical implications for AI deployment in high-stakes domains like healthcare, legal systems, financial analysis, and scientific research. Understanding these limitations is essential for developing more robust AI systems, implementing appropriate safety measures, and educating users about when to trust or question AI-generated explanations, particularly in multimodal contexts.

Authors: Jiaming Ji, Sitong Fang, Wenjing Cao, Jiahao Li, Xuyao Wang, Juntao Dai, Chi-Min Chan, Sirui Han, Yike Guo, Yaodong Yang

Link: https://arxiv.org/abs/2505.20214v1

Date: 2025-05-26

Summary:

Reasoning models have recently attracted significant attention, especially for tasks that involve complex inference. Their strengths exemplify the System II paradigm (slow, structured thinking), contrasting with the System I (rapid, heuristic-driven). Yet, does slower reasoning necessarily lead to greater truthfulness? Our findings suggest otherwise. In this study, we present the first systematic investigation of distortions associated with System I and System II reasoning in multimodal contexts. We demonstrate that slower reasoning models, when presented with incomplete or misleading visual inputs, are more likely to fabricate plausible yet false details to support flawed reasoning -- a phenomenon we term the "Mirage of Multimodality". To examine this, we constructed a 5,000-sample hierarchical prompt dataset annotated by 50 human participants. These prompts gradually increase in complexity, revealing a consistent pattern: slower reasoning models tend to employ depth-first thinking (delving deeper into incorrect premises), whereas faster chat models favor breadth-first inference, exhibiting greater caution under uncertainty. Our results highlight a critical vulnerability of slower reasoning models: although highly effective in structured domains such as mathematics, it becomes brittle when confronted with ambiguous multimodal inputs.

--------------------------------------------------------------------------------------------------------

A Responsible Face Recognition Approach for Small and Mid-Scale Systems Through Personalized Neural Networks

Traditional face recognition systems raise significant privacy and fairness concerns, particularly for smaller organizations that lack resources for comprehensive bias testing and privacy protection. This research proposes a novel approach replacing standard face templates with personalized neural networks for each individual. The MOTE system creates dedicated binary classifiers for each enrolled identity, enabling fine-grained fairness adjustments and enhanced privacy protection. While requiring more computational resources, this approach offers substantial improvements in responsible AI deployment. Applications include small business security systems, educational institution access control, healthcare facility patient identification, community organization member verification, and any context where privacy-preserving, fair face recognition is prioritized over maximum efficiency, particularly benefiting organizations seeking ethical AI implementations.

Authors: Sebastian Groß, Stefan Heindorf, Philipp Terhörst

Link: https://arxiv.org/abs/2505.19920v1

Date: 2025-05-26

Summary:

Traditional face recognition systems rely on extracting fixed face representations, known as templates, to store and verify identities. These representations are typically generated by neural networks that often lack explainability and raise concerns regarding fairness and privacy. In this work, we propose a novel model-template (MOTE) approach that replaces vector-based face templates with small personalized neural networks. This design enables more responsible face recognition for small and medium-scale systems. During enrollment, MOTE creates a dedicated binary classifier for each identity, trained to determine whether an input face matches the enrolled identity. Each classifier is trained using only a single reference sample, along with synthetically balanced samples to allow adjusting fairness at the level of a single individual during enrollment. Extensive experiments across multiple datasets and recognition systems demonstrate substantial improvements in fairness and particularly in privacy. Although the method increases inference time and storage requirements, it presents a strong solution for small- and mid-scale applications where fairness and privacy are critical.

--------------------------------------------------------------------------------------------------------

OCN: Effectively Utilizing Higher-Order Common Neighbors for Better Link Prediction

Link prediction—determining which connections are likely to form in networks—is fundamental to understanding social relationships, recommending connections, and predicting system evolution. Current methods using common neighbors often suffer from redundancy and information loss when considering higher-order relationships. This research introduces Orthogonal Common Neighbor (OCN), which eliminates redundancy between different-order common neighbors while preventing over-smoothing of network information. The 7.7% improvement over existing methods represents significant progress in network analysis. Applications include social media friend recommendations, professional networking platforms, academic collaboration prediction, protein interaction discovery, supply chain optimization, fraud detection in financial networks, and urban planning for transportation systems, enabling more accurate predictions of future connections and system behavior.

Authors: Juntong Wang, Xiyuan Wang, Muhan Zhang

Link: https://arxiv.org/abs/2505.19719v1

Date: 2025-05-26

Summary:

Common Neighbors (CNs) and their higher-order variants are important pairwise features widely used in state-of-the-art link prediction methods. However, existing methods often struggle with the repetition across different orders of CNs and fail to fully leverage their potential. We identify that these limitations stem from two key issues: redundancy and over-smoothing in high-order common neighbors. To address these challenges, we design orthogonalization to eliminate redundancy between different-order CNs and normalization to mitigate over-smoothing. By combining these two techniques, we propose Orthogonal Common Neighbor (OCN), a novel approach that significantly outperforms the strongest baselines by an average of 7.7% on popular link prediction benchmarks. A thorough theoretical analysis is provided to support our method. Ablation studies also verify the effectiveness of our orthogonalization and normalization techniques.

--------------------------------------------------------------------------------------------------------

How Syntax Specialization Emerges in Language Models

Large language models demonstrate remarkable linguistic capabilities, but understanding how they develop internal representations of grammar and syntax remains largely mysterious. This research tracks the emergence of syntactic specialization during model training, revealing that language models develop brain-like neural specialization for grammatical structures. The study identifies a critical period of rapid specialization and shows how this process varies with model scale and training data. These insights advance our understanding of AI learning processes and have implications for developing more efficient training methods, designing better model architectures, creating more interpretable AI systems, and potentially informing theories of human language acquisition. Applications include improving language model training efficiency, developing better debugging tools for AI systems, and advancing cognitive science research.

Authors: Xufeng Duan, Zhaoqian Yao, Yunhao Zhang, Shaonan Wang, Zhenguang G. Cai

Link: https://arxiv.org/abs/2505.19548v1

Date: 2025-05-26

Summary:

Large language models (LLMs) have been found to develop surprising internal specializations: Individual neurons, attention heads, and circuits become selectively sensitive to syntactic structure, reflecting patterns observed in the human brain. While this specialization is well-documented, how it emerges during training and what influences its development remains largely unknown. In this work, we tap into the black box of specialization by tracking its formation over time. By quantifying internal syntactic consistency across minimal pairs from various syntactic phenomena, we identify a clear developmental trajectory: Syntactic sensitivity emerges gradually, concentrates in specific layers, and exhibits a 'critical period' of rapid internal specialization. This process is consistent across architectures and initialization parameters (e.g., random seeds), and is influenced by model scale and training data. We therefore reveal not only where syntax arises in LLMs but also how some models internalize it during training. To support future research, we will release the code, models, and training checkpoints upon acceptance.

--------------------------------------------------------------------------------------------------------

SALE : Low-bit Estimation for Efficient Sparse Attention in Long-context LLM Prefilling

Processing long documents and conversations in large language models faces a fundamental bottleneck: attention computation grows quadratically with sequence length, making extended context processing prohibitively expensive. This research introduces SALE, a method that accelerates long-context processing by intelligently skipping less important parts of the attention computation while maintaining accuracy. Using 4-bit quantization and fine-grained importance estimation, SALE achieves over 3.36x speedup for sequences longer than 64K tokens. Applications include processing lengthy documents, maintaining extended conversations, analyzing research papers, legal document review, customer service chat histories, code repository analysis, and real-time processing of streaming content, making long-context AI capabilities more accessible and cost-effective for practical deployment.

Authors: Xiaodong Ji, Hailin Zhang, Fangcheng Fu, Bin Cui

Link: https://arxiv.org/abs/2505.24179v1

Date: 2025-05-30

Summary:

Many advanced Large Language Model (LLM) applications require long-context processing, but the self-attention module becomes a bottleneck during the prefilling stage of inference due to its quadratic time complexity with respect to sequence length. Existing sparse attention methods accelerate attention computation by skipping less significant regions of the attention map. However, these approaches typically perform coarse-grained inspection of the attention map, rendering considerable loss in model accuracy. In this paper, we propose SALE, a fine-grained sparse attention method that accelerates the long-context prefilling stage of LLM with negligible loss in model accuracy. SALE achieves fast and accurate fine-grained attention weight estimation through 4-bit quantized query-key products, followed by block-sparse attention to accelerate prefilling computations. For importance evaluation for query-key pairs, we adopt our Relative Attention Score metric, which offers significantly higher efficiency within our framework. We implement a custom CUDA kernel optimized for our approach for hardware efficiency, reducing the additional overhead to approximately 11% of the full attention latency. Notably, SALE requires no parameter training and can be seamlessly integrated into existing systems with trivial code modifications. Experiments on long-context benchmarks demonstrate that our method outperforms existing approaches in accuracy-efficiency trade-offs, achieving at least 3.36x speedups on Llama-3.1-8B for sequences longer than 64K while maintaining model quality.

--------------------------------------------------------------------------------------------------------

GeoMan: Temporally Consistent Human Geometry Estimation using Image-to-Video Diffusion

Accurately estimating human body shape and pose from videos is crucial for numerous applications but remains challenging due to temporal inconsistencies and lack of training data. This research introduces GeoMan, which reframes video geometry estimation as an image-to-video generation problem, leveraging large-scale video datasets while requiring minimal 4D training data. The approach produces temporally consistent depth and normal estimations, overcoming limitations of single-image methods. Applications include virtual reality and augmented reality experiences, motion capture for film and gaming, fitness and health monitoring applications, sports performance analysis, video editing and special effects, telepresence systems, and rehabilitation therapy guidance, enabling more realistic and accurate human representation in digital environments.

Authors: Gwanghyun Kim, Xueting Li, Ye Yuan, Koki Nagano, Tianye Li, Jan Kautz, Se Young Chun, Umar Iqbal

Link: https://arxiv.org/abs/2505.23085v1

Date: 2025-05-29

Summary:

Estimating accurate and temporally consistent 3D human geometry from videos is a challenging problem in computer vision. Existing methods, primarily optimized for single images, often suffer from temporal inconsistencies and fail to capture fine-grained dynamic details. To address these limitations, we present GeoMan, a novel architecture designed to produce accurate and temporally consistent depth and normal estimations from monocular human videos. GeoMan addresses two key challenges: the scarcity of high-quality 4D training data and the need for metric depth estimation to accurately model human size. To overcome the first challenge, GeoMan employs an image-based model to estimate depth and normals for the first frame of a video, which then conditions a video diffusion model, reframing video geometry estimation task as an image-to-video generation problem. This design offloads the heavy lifting of geometric estimation to the image model and simplifies the video model's role to focus on intricate details while using priors learned from large-scale video datasets. Consequently, GeoMan improves temporal consistency and generalizability while requiring minimal 4D training data. To address the challenge of accurate human size estimation, we introduce a root-relative depth representation that retains critical human-scale details and is easier to be estimated from monocular inputs, overcoming the limitations of traditional affine-invariant and metric depth representations. GeoMan achieves state-of-the-art performance in both qualitative and quantitative evaluations, demonstrating its effectiveness in overcoming longstanding challenges in 3D human geometry estimation from videos.

--------------------------------------------------------------------------------------------------------

Humble AI in the real-world: the case of algorithmic hiring

AI systems increasingly make high-stakes decisions affecting people's careers and livelihoods, yet they often operate as "black boxes" without acknowledging their limitations or uncertainties. This research applies "Humble AI" principles to algorithmic hiring, advocating for AI systems that explicitly communicate their uncertainties and limitations to users. The approach includes uncertainty quantification, entropy estimates, and user interfaces that highlight algorithmic unknowns. By fostering transparency and encouraging human oversight, this work addresses critical concerns about AI bias and fairness in recruitment. Applications include hiring platforms, loan approval systems, medical diagnosis support, educational assessment tools, and any domain where AI decisions significantly impact individuals, promoting more responsible and trustworthy AI deployment.

Authors: Rahul Nair, Inge Vejsbjerg, Elizabeth Daly, Christos Varytimidis, Bran Knowles

Link: https://arxiv.org/abs/2505.20918v1

Date: 2025-05-27

Summary:

Humble AI (Knowles et al., 2023) argues for cautiousness in AI development and deployments through scepticism (accounting for limitations of statistical learning), curiosity (accounting for unexpected outcomes), and commitment (accounting for multifaceted values beyond performance). We present a real-world case study for humble AI in the domain of algorithmic hiring. Specifically, we evaluate virtual screening algorithms in a widely used hiring platform that matches candidates to job openings. There are several challenges in misrecognition and stereotyping in such contexts that are difficult to assess through standard fairness and trust frameworks; e.g., someone with a non-traditional background is less likely to rank highly. We demonstrate technical feasibility of how humble AI principles can be translated to practice through uncertainty quantification of ranks, entropy estimates, and a user experience that highlights algorithmic unknowns. We describe preliminary discussions with focus groups made up of recruiters. Future user studies seek to evaluate whether the higher cognitive load of a humble AI system fosters a climate of trust in its outcomes.

--------------------------------------------------------------------------------------------------------

OpenS2V-Nexus: A Detailed Benchmark and Million-Scale Dataset for Subject-to-Video Generation

Video generation technology lacks standardized evaluation methods and comprehensive datasets, limiting progress in creating videos that faithfully incorporate specific subjects or objects. This research establishes OpenS2V-Nexus, combining a detailed benchmark with a massive 5-million sample dataset for subject-to-video generation. The framework evaluates models on subject consistency, naturalness, and text relevance, providing rigorous assessment tools for video generation capabilities. The dataset ensures subject diversity through cross-video associations and synthetic multi-view representations. Applications include content creation for marketing, personalized video generation, educational content development, entertainment industry tools, social media content creation, and automated video production, providing the infrastructure needed to advance video generation research and deployment.

Authors: Shenghai Yuan, Xianyi He, Yufan Deng, Yang Ye, Jinfa Huang, Bin Lin, Jiebo Luo, Li Yuan

Link: https://arxiv.org/abs/2505.20292v3

Date: 2025-05-28

Summary:

Subject-to-Video (S2V) generation aims to create videos that faithfully incorporate reference content, providing enhanced flexibility in the production of videos. To establish the infrastructure for S2V generation, we propose OpenS2V-Nexus, consisting of (i) OpenS2V-Eval, a fine-grained benchmark, and (ii) OpenS2V-5M, a million-scale dataset. In contrast to existing S2V benchmarks inherited from VBench that focus on global and coarse-grained assessment of generated videos, OpenS2V-Eval focuses on the model's ability to generate subject-consistent videos with natural subject appearance and identity fidelity. For these purposes, OpenS2V-Eval introduces 180 prompts from seven major categories of S2V, which incorporate both real and synthetic test data. Furthermore, to accurately align human preferences with S2V benchmarks, we propose three automatic metrics, NexusScore, NaturalScore and GmeScore, to separately quantify subject consistency, naturalness, and text relevance in generated videos. Building on this, we conduct a comprehensive evaluation of 16 representative S2V models, highlighting their strengths and weaknesses across different content. Moreover, we create the first open-source large-scale S2V generation dataset OpenS2V-5M, which consists of five million high-quality 720P subject-text-video triples. Specifically, we ensure subject-information diversity in our dataset by (1) segmenting subjects and building pairing information via cross-video associations and (2) prompting GPT-Image-1 on raw frames to synthesize multi-view representations. Through OpenS2V-Nexus, we deliver a robust infrastructure to accelerate future S2V generation research.

--------------------------------------------------------------------------------------------------------

TeViR: Text-to-Video Reward with Diffusion Models for Efficient Reinforcement Learning

Training robots to perform complex manipulation tasks requires carefully designed reward functions, but creating these rewards is time-consuming and often requires domain expertise. Traditional sparse rewards lead to inefficient learning, while dense rewards are difficult to design correctly. This research introduces TeViR, which uses text-to-video diffusion models to generate dense rewards by comparing predicted image sequences with actual robot behavior. This approach eliminates the need for manual reward engineering while significantly improving sample efficiency. Applications include robotic assembly, household task automation, manufacturing process optimization, surgical robot training, autonomous vehicle behavior learning, and any robotic system requiring complex manipulation skills, making robot training more accessible and efficient across diverse domains.

Authors: Yuhui Chen, Haoran Li, Zhennan Jiang, Haowei Wen, Dongbin Zhao

Link: https://arxiv.org/abs/2505.19769v1

Date: 2025-05-26

Summary:

Developing scalable and generalizable reward engineering for reinforcement learning (RL) is crucial for creating general-purpose agents, especially in the challenging domain of robotic manipulation. While recent advances in reward engineering with Vision-Language Models (VLMs) have shown promise, their sparse reward nature significantly limits sample efficiency. This paper introduces TeViR, a novel method that leverages a pre-trained text-to-video diffusion model to generate dense rewards by comparing the predicted image sequence with current observations. Experimental results across 11 complex robotic tasks demonstrate that TeViR outperforms traditional methods leveraging sparse rewards and other state-of-the-art (SOTA) methods, achieving better sample efficiency and performance without ground truth environmental rewards. TeViR's ability to efficiently guide agents in complex environments highlights its potential to advance reinforcement learning applications in robotic manipulation.

--------------------------------------------------------------------------------------------------------

DiG-Net: Enhancing Quality of Life through Hyper-Range Dynamic Gesture Recognition in Assistive Robotics

Gesture-based control for assistive robots typically requires users to be in close proximity, limiting accessibility for individuals with mobility constraints or those requiring remote assistance. This research introduces DiG-Net, enabling dynamic gesture recognition at distances up to 30 meters while maintaining high accuracy under challenging conditions. The system combines depth-conditioned alignment with spatio-temporal processing to handle physical attenuation and reduced resolution at extended ranges. Applications include home healthcare assistance, industrial safety monitoring, remote robot operation, accessibility solutions for mobility-impaired individuals, emergency response scenarios, and smart home control systems, significantly expanding the practical utility of gesture-controlled assistive technologies and improving quality of life for users requiring remote or long-distance robot interaction.

Authors: Eran Bamani Beeri, Eden Nissinman, Avishai Sintov

Link: https://arxiv.org/abs/2505.24786v1

Date: 2025-05-30

Summary:

Dynamic hand gestures play a pivotal role in assistive human-robot interaction (HRI), facilitating intuitive, non-verbal communication, particularly for individuals with mobility constraints or those operating robots remotely. Current gesture recognition methods are mostly limited to short-range interactions, reducing their utility in scenarios demanding robust assistive communication from afar. In this paper, we introduce a novel approach designed specifically for assistive robotics, enabling dynamic gesture recognition at extended distances of up to 30 meters, thereby significantly improving accessibility and quality of life. Our proposed Distance-aware Gesture Network (DiG-Net) effectively combines Depth-Conditioned Deformable Alignment (DADA) blocks with Spatio-Temporal Graph modules, enabling robust processing and classification of gesture sequences captured under challenging conditions, including significant physical attenuation, reduced resolution, and dynamic gesture variations commonly experienced in real-world assistive environments. We further introduce the Radiometric Spatio-Temporal Depth Attenuation Loss (RSTDAL), shown to enhance learning and strengthen model robustness across varying distances. Our model demonstrates significant performance improvement over state-of-the-art gesture recognition frameworks, achieving a recognition accuracy of 97.3% on a diverse dataset with challenging hyper-range gestures. By effectively interpreting gestures from considerable distances, DiG-Net significantly enhances the usability of assistive robots in home healthcare, industrial safety, and remote assistance scenarios, enabling seamless and intuitive interactions for users regardless of physical limitations

--------------------------------------------------------------------------------------------------------

Judging Quality Across Languages: A Multilingual Approach to Pretraining Data Filtering with Language Models

High-quality multilingual training data is essential for developing effective global AI systems, yet current datasets rely heavily on heuristic filtering methods that don't transfer well across languages and cultures. This research introduces JQL, a systematic approach for curating diverse, high-quality multilingual data at scale while significantly reducing computational demands. The method distills large language models' annotation capabilities into lightweight multilingual annotators that work across 35 languages, including unseen scripts. By improving data quality and retention rates, JQL enhances downstream model performance across diverse linguistic contexts. Applications include developing more equitable global AI systems, improving machine translation, enhancing multilingual search engines, creating culturally sensitive content moderation tools, and supporting language preservation efforts for underrepresented languages.

Authors: Mehdi Ali, Manuel Brack, Max Lübbering, Elias Wendt, Abbas Goher Khan, Richard Rutmann, Alex Jude, Maurice Kraus, Alexander Arno Weber, Felix Stollenwerk, David Kaczér, Florian Mai, Lucie Flek, Rafet Sifa, Nicolas Flores-Herr, Joachim Köhler, Patrick Schramowski, Michael Fromm, Kristian Kersting

Link: https://arxiv.org/abs/2505.22232v1

Date: 2025-05-28

Summary:

High-quality multilingual training data is essential for effectively pretraining large language models (LLMs). Yet, the availability of suitable open-source multilingual datasets remains limited. Existing state-of-the-art datasets mostly rely on heuristic filtering methods, restricting both their cross-lingual transferability and scalability. Here, we introduce JQL, a systematic approach that efficiently curates diverse and high-quality multilingual data at scale while significantly reducing computational demands. JQL distills LLMs' annotation capabilities into lightweight annotators based on pretrained multilingual embeddings. These models exhibit robust multilingual and cross-lingual performance, even for languages and scripts unseen during training. Evaluated empirically across 35 languages, the resulting annotation pipeline substantially outperforms current heuristic filtering methods like Fineweb2. JQL notably enhances downstream model training quality and increases data retention rates. Our research provides practical insights and valuable resources for multilingual data curation, raising the standards of multilingual dataset development.

--------------------------------------------------------------------------------------------------------

Towards Interpretability Without Sacrifice: Faithful Dense Layer Decomposition with Mixture of Decoders

Understanding how large language models process information remains a significant challenge, limiting our ability to debug, edit, and control these systems. Current interpretability methods often sacrifice model accuracy for understanding, creating impractical trade-offs. This research introduces Mixture of Decoders (MxDs), which decompose dense neural network layers into thousands of specialized, sparsely-activating sublayers while preserving the original model's performance. This approach enables detailed analysis of how models process natural language without degrading their capabilities. Applications include AI safety research, model debugging and validation, developing more controllable AI systems, understanding model biases, creating interpretable AI for regulated industries, and advancing research in AI transparency, providing crucial tools for responsible AI development and deployment.

Authors: James Oldfield, Shawn Im, Yixuan Li, Mihalis A. Nicolaou, Ioannis Patras, Grigorios G Chrysos

Link: https://arxiv.org/abs/2505.21364v1

Date: 2025-05-27

Summary:

Multilayer perceptrons (MLPs) are an integral part of large language models, yet their dense representations render them difficult to understand, edit, and steer. Recent methods learn interpretable approximations via neuron-level sparsity, yet fail to faithfully reconstruct the original mapping--significantly increasing model's next-token cross-entropy loss. In this paper, we advocate for moving to layer-level sparsity to overcome the accuracy trade-off in sparse layer approximation. Under this paradigm, we introduce Mixture of Decoders (MxDs). MxDs generalize MLPs and Gated Linear Units, expanding pre-trained dense layers into tens of thousands of specialized sublayers. Through a flexible form of tensor factorization, each sparsely activating MxD sublayer implements a linear transformation with full-rank weights--preserving the original decoders' expressive capacity even under heavy sparsity. Experimentally, we show that MxDs significantly outperform state-of-the-art methods (e.g., Transcoders) on the sparsity-accuracy frontier in language models with up to 3B parameters. Further evaluations on sparse probing and feature steering demonstrate that MxDs learn similarly specialized features of natural language--opening up a promising new avenue for designing interpretable yet faithful decompositions. Our code is included at: https://github.com/james-oldfield/MxD/.

--------------------------------------------------------------------------------------------------------

SAEs Are Good for Steering -- If You Select the Right Features

Sparse Autoencoders (SAEs) offer a promising approach for understanding and controlling AI model behavior without requiring labeled training data. However, current methods for identifying useful features focus primarily on input patterns rather than actual effects on model outputs, limiting their effectiveness for steering applications. This research distinguishes between input features and output features, proposing scoring methods to identify features that meaningfully influence model behavior. By filtering out irrelevant features, the approach achieves 2-3x improvements in steering performance, making SAEs competitive with supervised methods. Applications include AI safety research, content moderation systems, bias mitigation tools, personalized AI assistants, and educational AI systems, enabling more precise control over AI behavior while maintaining unsupervised learning benefits.

Authors: Dana Arad, Aaron Mueller, Yonatan Belinkov

Link: https://arxiv.org/abs/2505.20063v1

Date: 2025-05-26

Summary:

Sparse Autoencoders (SAEs) have been proposed as an unsupervised approach to learn a decomposition of a model's latent space. This enables useful applications such as steering - influencing the output of a model towards a desired concept - without requiring labeled data. Current methods identify SAE features to steer by analyzing the input tokens that activate them. However, recent work has highlighted that activations alone do not fully describe the effect of a feature on the model's output. In this work, we draw a distinction between two types of features: input features, which mainly capture patterns in the model's input, and output features, which have a human-understandable effect on the model's output. We propose input and output scores to characterize and locate these types of features, and show that high values for both scores rarely co-occur in the same features. These findings have practical implications: after filtering out features with low output scores, we obtain 2-3x improvements when steering with SAEs, making them competitive with supervised methods.

--------------------------------------------------------------------------------------------------------

EYE ON A.I. GETS READERS UP TO DATE ON THE LATEST FUNDING NEWS AND RELATED ISSUES. SUBSCRIBE FOR THE WEEKLY NEWSLETTER.

Artificial Intelligence, Research WatchCraig SmithJune 2, 2025Comment