Week Ending 5.25.2025

 

RESEARCH WATCH: 5.25.2025

 

Daily-Omni: Towards Audio-Visual Reasoning with Temporal Alignment across Modalities

Current multimodal AI systems excel at processing visual or audio information independently but struggle with synchronized cross-modal understanding. Daily-Omni addresses this gap by introducing a comprehensive benchmark featuring 684 real-world videos with rich audio-visual content and 1,197 QA pairs across six task categories. The framework includes an automated annotation pipeline and a training-free agent that combines Visual Language Models, Audio Language Models, and speech recognition. This research has significant applications in robotics, autonomous systems, accessibility technologies, and human-computer interaction where understanding both what we see and hear simultaneously is crucial for intelligent decision-making.

Authors:  Ziwei Zhou, Rui Wang, Zuxuan Wu

Link:  https://arxiv.org/abs/2505.17862v1

Date: 2025-05-23

Summary:

Recent Multimodal Large Language Models (MLLMs) achieve promising performance on visual and audio benchmarks independently. However, the ability of these models to process cross-modal information synchronously remains largely unexplored. In this paper, we introduce: 1) Daily-Omni, an Audio-Visual Questioning and Answering benchmark comprising 684 videos of daily life scenarios from diverse sources, rich in both audio and visual information, and featuring 1197 multiple-choice QA pairs across 6 major tasks; 2) Daily-Omni QA Generation Pipeline, which includes automatic annotation, QA generation and QA optimization, significantly improves efficiency for human evaluation and scalability of the benchmark; 3) Daily-Omni-Agent, a training-free agent utilizing open-source Visual Language Model (VLM), Audio Language Model (ALM) and Automatic Speech Recognition (ASR) model to establish a baseline for this benchmark. The results show that current MLLMs still struggle significantly with tasks requiring audio-visual integration, but combining VLMs and ALMs with simple temporal alignment techniques can achieve substantially better performance. Codes and benchmark are available at \href{https://github.com/Lliar-liar/Daily-Omni}{https://github.com/Lliar-liar/Daily-Omni}.

--------------------------------------------------------------------------------------------------------

Superplatforms Have to Attack AI Agents

The digital economy's foundation rests on superplatforms that monetize user attention through advertising and algorithmic curation. However, AI agents powered by large language models threaten this business model by providing autonomous navigation across platforms, potentially bypassing traditional attention-based monetization. This research analyzes the fundamental conflict between user-attention economics and agent-driven autonomy through gatekeeping theory, predicting that superplatforms will proactively constrain AI agents to maintain control over digital traffic. The implications extend to antitrust policy, digital ecosystem governance, and the future architecture of internet commerce, highlighting tensions that will shape the next phase of digital platform evolution.

Authors:  Jianghao Lin, Jiachen Zhu, Zheli Zhou, Yunjia Xi, Weiwen Liu, Yong Yu, Weinan Zhang

Link:  https://arxiv.org/abs/2505.17861v1

Date: 2025-05-23

Summary:

Over the past decades, superplatforms, digital companies that integrate a vast range of third-party services and applications into a single, unified ecosystem, have built their fortunes on monopolizing user attention through targeted advertising and algorithmic content curation. Yet the emergence of AI agents driven by large language models (LLMs) threatens to upend this business model. Agents can not only free user attention with autonomy across diverse platforms and therefore bypass the user-attention-based monetization, but might also become the new entrance for digital traffic. Hence, we argue that superplatforms have to attack AI agents to defend their centralized control of digital traffic entrance. Specifically, we analyze the fundamental conflict between user-attention-based monetization and agent-driven autonomy through the lens of our gatekeeping theory. We show how AI agents can disintermediate superplatforms and potentially become the next dominant gatekeepers, thereby forming the urgent necessity for superplatforms to proactively constrain and attack AI agents. Moreover, we go through the potential technologies for superplatform-initiated attacks, covering a brand-new, unexplored technical area with unique challenges. We have to emphasize that, despite our position, this paper does not advocate for adversarial attacks by superplatforms on AI agents, but rather offers an envisioned trend to highlight the emerging tensions between superplatforms and AI agents. Our aim is to raise awareness and encourage critical discussion for collaborative solutions, prioritizing user interests and perserving the openness of digital ecosystems in the age of AI agents.

--------------------------------------------------------------------------------------------------------

Scaling Image and Video Generation via Test-Time Evolutionary Search

As pre-training costs for generative models escalate, test-time scaling emerges as a promising alternative for improving performance through inference-time computation. EvoSearch addresses limitations in existing test-time scaling approaches for image and video generation by reformulating the process as an evolutionary search problem. The method uses biological evolution principles to explore and refine denoising trajectories in diffusion and flow models, incorporating selection and mutation mechanisms tailored to stochastic differential equations. Applications span creative industries, content generation, film production, and scientific visualization, offering a training-free approach to enhance generative model quality while maintaining sample diversity and computational efficiency.

Authors:  Haoran He, Jiajun Liang, Xintao Wang, Pengfei Wan, Di Zhang, Kun Gai, Ling Pan

Link:  https://arxiv.org/abs/2505.17618v1

Date: 2025-05-23

Summary:

As the marginal cost of scaling computation (data and parameters) during model pre-training continues to increase substantially, test-time scaling (TTS) has emerged as a promising direction for improving generative model performance by allocating additional computation at inference time. While TTS has demonstrated significant success across multiple language tasks, there remains a notable gap in understanding the test-time scaling behaviors of image and video generative models (diffusion-based or flow-based models). Although recent works have initiated exploration into inference-time strategies for vision tasks, these approaches face critical limitations: being constrained to task-specific domains, exhibiting poor scalability, or falling into reward over-optimization that sacrifices sample diversity. In this paper, we propose \textbf{Evo}lutionary \textbf{Search} (EvoSearch), a novel, generalist, and efficient TTS method that effectively enhances the scalability of both image and video generation across diffusion and flow models, without requiring additional training or model expansion. EvoSearch reformulates test-time scaling for diffusion and flow models as an evolutionary search problem, leveraging principles from biological evolution to efficiently explore and refine the denoising trajectory. By incorporating carefully designed selection and mutation mechanisms tailored to the stochastic differential equation denoising process, EvoSearch iteratively generates higher-quality offspring while preserving population diversity. Through extensive evaluation across both diffusion and flow architectures for image and video generation tasks, we demonstrate that our method consistently outperforms existing approaches, achieves higher diversity, and shows strong generalizability to unseen evaluation metrics. Our project is available at the website https://tinnerhrhe.github.io/evosearch.

--------------------------------------------------------------------------------------------------------

Graph Mamba for Efficient Whole Slide Image Understanding

Whole Slide Images in histopathology present massive computational challenges due to their gigapixel resolution and complex spatial relationships between tissue regions. Traditional approaches using Graph Neural Networks and Transformers face scalability limitations when processing these enormous medical images. WSI-GMamba combines the relational modeling strengths of GNNs with Mamba's sequential processing efficiency, achieving Transformer-level performance with seven times fewer computational operations. This breakthrough has direct applications in cancer diagnosis, pathology workflow automation, drug discovery research, and personalized medicine, potentially accelerating diagnostic timelines while reducing computational costs for healthcare institutions processing thousands of high-resolution tissue samples.

Authors:  Jiaxuan Lu, Junyan Shi, Yuhui Lin, Fang Yan, Yue Gao, Shaoting Zhang, Xiaosong Wang

Link:  https://arxiv.org/abs/2505.17457v1

Date: 2025-05-23

Summary:

Whole Slide Images (WSIs) in histopathology present a significant challenge for large-scale medical image analysis due to their high resolution, large size, and complex tile relationships. Existing Multiple Instance Learning (MIL) methods, such as Graph Neural Networks (GNNs) and Transformer-based models, face limitations in scalability and computational cost. To bridge this gap, we propose the WSI-GMamba framework, which synergistically combines the relational modeling strengths of GNNs with the efficiency of Mamba, the State Space Model designed for sequence learning. The proposed GMamba block integrates Message Passing, Graph Scanning & Flattening, and feature aggregation via a Bidirectional State Space Model (Bi-SSM), achieving Transformer-level performance with 7* fewer FLOPs. By leveraging the complementary strengths of lightweight GNNs and Mamba, the WSI-GMamba framework delivers a scalable solution for large-scale WSI analysis, offering both high accuracy and computational efficiency for slide-level classification.

--------------------------------------------------------------------------------------------------------

KRIS-Bench: Benchmarking Next-Level Intelligent Image Editing Models

Despite impressive visual outputs from instruction-based image editing models, their capacity for knowledge-based reasoning remains unexplored. KRIS-Bench introduces a diagnostic framework rooted in educational theory, categorizing editing tasks across Factual, Conceptual, and Procedural knowledge types through 22 representative tasks and 1,267 annotated instances. The benchmark incorporates a novel Knowledge Plausibility metric enhanced by human studies to evaluate reasoning capabilities. Applications include professional photo editing software, educational content creation, scientific visualization, medical imaging, and creative design tools where users need systems that understand not just what to change visually, but why changes make logical sense within specific contexts.

Authors:  Yongliang Wu, Zonghui Li, Xinting Hu, Xinyu Ye, Xianfang Zeng, Gang Yu, Wenbo Zhu, Bernt Schiele, Ming-Hsuan Yang, Xu Yang

Link:  https://arxiv.org/abs/2505.16707v1

Date: 2025-05-22

Summary:

Recent advances in multi-modal generative models have enabled significant progress in instruction-based image editing. However, while these models produce visually plausible outputs, their capacity for knowledge-based reasoning editing tasks remains under-explored. In this paper, we introduce KRIS-Bench (Knowledge-based Reasoning in Image-editing Systems Benchmark), a diagnostic benchmark designed to assess models through a cognitively informed lens. Drawing from educational theory, KRIS-Bench categorizes editing tasks across three foundational knowledge types: Factual, Conceptual, and Procedural. Based on this taxonomy, we design 22 representative tasks spanning 7 reasoning dimensions and release 1,267 high-quality annotated editing instances. To support fine-grained evaluation, we propose a comprehensive protocol that incorporates a novel Knowledge Plausibility metric, enhanced by knowledge hints and calibrated through human studies. Empirical results on 10 state-of-the-art models reveal significant gaps in reasoning performance, highlighting the need for knowledge-centric benchmarks to advance the development of intelligent image editing systems.

--------------------------------------------------------------------------------------------------------

DriveMoE: Mixture-of-Experts for Vision-Language-Action Model in End-to-End Autonomous Driving

End-to-end autonomous driving requires processing multi-view sensory data while handling diverse driving scenarios, particularly challenging maneuvers like aggressive turns. DriveMoE applies Mixture-of-Experts architecture from language models to autonomous driving, featuring Scene-Specialized Vision MoE for dynamic camera selection and Skill-Specialized Action MoE for behavioral specialization. Built on a Vision-Language-Action baseline, the system mirrors human driving cognition by selectively attending to crucial visual cues rather than processing all information exhaustively. Applications extend beyond autonomous vehicles to robotics navigation, drone control, and any system requiring real-time decision-making from multiple sensory inputs in complex, dynamic environments.

Authors:  Zhenjie Yang, Yilin Chai, Xiaosong Jia, Qifeng Li, Yuqian Shao, Xuekai Zhu, Haisheng Su, Junchi Yan

Link:  https://arxiv.org/abs/2505.16278v1

Date: 2025-05-22

Summary:

End-to-end autonomous driving (E2E-AD) demands effective processing of multi-view sensory data and robust handling of diverse and complex driving scenarios, particularly rare maneuvers such as aggressive turns. Recent success of Mixture-of-Experts (MoE) architecture in Large Language Models (LLMs) demonstrates that specialization of parameters enables strong scalability. In this work, we propose DriveMoE, a novel MoE-based E2E-AD framework, with a Scene-Specialized Vision MoE and a Skill-Specialized Action MoE. DriveMoE is built upon our $\pi_0$ Vision-Language-Action (VLA) baseline (originally from the embodied AI field), called Drive-$\pi_0$. Specifically, we add Vision MoE to Drive-$\pi_0$ by training a router to select relevant cameras according to the driving context dynamically. This design mirrors human driving cognition, where drivers selectively attend to crucial visual cues rather than exhaustively processing all visual information. In addition, we add Action MoE by training another router to activate specialized expert modules for different driving behaviors. Through explicit behavioral specialization, DriveMoE is able to handle diverse scenarios without suffering from modes averaging like existing models. In Bench2Drive closed-loop evaluation experiments, DriveMoE achieves state-of-the-art (SOTA) performance, demonstrating the effectiveness of combining vision and action MoE in autonomous driving tasks. We will release our code and models of DriveMoE and Drive-$\pi_0$.

--------------------------------------------------------------------------------------------------------

Scalable Defense against In-the-wild Jailbreaking Attacks with Safety Context Retrieval

Large Language Models face persistent threats from jailbreaking attacks where adversaries use crafted prompts to elicit harmful responses, compromising safety in real-world deployments. Existing static defense mechanisms quickly become obsolete as new attack methods emerge. Safety Context Retrieval leverages retrieval-augmented generation techniques, demonstrating that minimal safety-aligned examples can significantly enhance robustness against specific attack patterns. The approach provides a scalable paradigm for defending against evolving jailbreaking tactics. Applications include chatbot safety systems, content moderation platforms, educational AI tutors, and enterprise AI assistants where maintaining ethical boundaries while preserving functionality is critical for user trust and regulatory compliance.

Authors:  Taiye Chen, Zeming Wei, Ang Li, Yisen Wang

Link:  https://arxiv.org/abs/2505.15753v1

Date: 2025-05-21

Summary:

Large Language Models (LLMs) are known to be vulnerable to jailbreaking attacks, wherein adversaries exploit carefully engineered prompts to induce harmful or unethical responses. Such threats have raised critical concerns about the safety and reliability of LLMs in real-world deployment. While existing defense mechanisms partially mitigate such risks, subsequent advancements in adversarial techniques have enabled novel jailbreaking methods to circumvent these protections, exposing the limitations of static defense frameworks. In this work, we explore defending against evolving jailbreaking threats through the lens of context retrieval. First, we conduct a preliminary study demonstrating that even a minimal set of safety-aligned examples against a particular jailbreak can significantly enhance robustness against this attack pattern. Building on this insight, we further leverage the retrieval-augmented generation (RAG) techniques and propose Safety Context Retrieval (SCR), a scalable and robust safeguarding paradigm for LLMs against jailbreaking. Our comprehensive experiments demonstrate how SCR achieves superior defensive performance against both established and emerging jailbreaking tactics, contributing a new paradigm to LLM safety. Our code will be available upon publication.

--------------------------------------------------------------------------------------------------------

Moonbeam: A MIDI Foundation Model Using Both Absolute and Relative Music Attributes

Symbolic music generation requires understanding both absolute musical elements and their relative relationships within compositions. Moonbeam addresses this through a transformer-based foundation model trained on 81.6K hours of MIDI data, incorporating novel tokenization methods and Multidimensional Relative Attention to capture musical relationships without additional parameters. The model supports both music understanding tasks and conditional generation including music infilling. Applications span music composition software, educational tools for music theory, personalized playlist generation, adaptive video game soundtracks, therapeutic music generation for healthcare, and professional music production assistance where understanding musical structure and relationships is essential for creative workflows.

Authors:  Zixun Guo, Simon Dixon

Link:  https://arxiv.org/abs/2505.15559v1

Date: 2025-05-21

Summary:

Moonbeam is a transformer-based foundation model for symbolic music, pretrained on a large and diverse collection of MIDI data totaling 81.6K hours of music and 18 billion tokens. Moonbeam incorporates music-domain inductive biases by capturing both absolute and relative musical attributes through the introduction of a novel domain-knowledge-inspired tokenization method and Multidimensional Relative Attention (MRA), which captures relative music information without additional trainable parameters. Leveraging the pretrained Moonbeam, we propose 2 finetuning architectures with full anticipatory capabilities, targeting 2 categories of downstream tasks: symbolic music understanding and conditional music generation (including music infilling). Our model outperforms other large-scale pretrained music models in most cases in terms of accuracy and F1 score across 3 downstream music classification tasks on 4 datasets. Moreover, our finetuned conditional music generation model outperforms a strong transformer baseline with a REMI-like tokenizer. We open-source the code, pretrained model, and generated samples on Github.

--------------------------------------------------------------------------------------------------------

X-WebAgentBench: A Multilingual Interactive Web Benchmark for Evaluating Global Agentic System

Current web agent research predominantly focuses on English scenarios, neglecting the needs of over 7,000 world languages requiring comparable agentic services. X-WebAgentBench introduces the first multilingual agent benchmark in interactive web environments, evaluating planning and interaction performance across multiple languages to advance global agent intelligence. The benchmark reveals that even advanced models like GPT-4o with cross-lingual techniques achieve unsatisfactory results in multilingual contexts. Applications include international e-commerce platforms, multilingual customer service systems, global content management, cross-cultural communication tools, and accessibility services for non-English speaking populations navigating digital interfaces and web-based services worldwide.

Authors:  Peng Wang, Ruihan Tao, Qiguang Chen, Mengkang Hu, Libo Qin

Link:  https://arxiv.org/abs/2505.15372v1

Date: 2025-05-21

Summary:

Recently, large language model (LLM)-based agents have achieved significant success in interactive environments, attracting significant academic and industrial attention. Despite these advancements, current research predominantly focuses on English scenarios. In reality, there are over 7,000 languages worldwide, all of which demand access to comparable agentic services. Nevertheless, the development of language agents remains inadequate for meeting the diverse requirements of multilingual agentic applications. To fill this gap, we introduce X-WebAgentBench, a novel multilingual agent benchmark in an interactive web environment, which evaluates the planning and interaction performance of language agents across multiple languages, thereby contributing to the advancement of global agent intelligence. Additionally, we assess the performance of various LLMs and cross-lingual alignment methods, examining their effectiveness in enhancing agents. Our findings reveal that even advanced models like GPT-4o, when combined with cross-lingual techniques, fail to achieve satisfactory results. We hope that X-WebAgentBench can serve as a valuable benchmark for multilingual agent scenario in real-world applications.

--------------------------------------------------------------------------------------------------------

Robust extrapolation using physics-related activation functions in neural networks for nuclear masses

Nuclear mass prediction is crucial for understanding atomic structure and nuclear processes, but traditional neural networks struggle with extrapolation beyond measured data due to their "black box" nature and computer science-oriented design. This research demonstrates that replacing standard activation functions with physics-related functions significantly improves extrapolation performance while providing interpretable insights. Using only neutron and proton numbers without existing mass models or magic number knowledge, the approach accurately covers light nuclei to drip lines. Applications include nuclear physics research, astrophysics modeling, nuclear energy reactor design, radioisotope production for medical applications, and fundamental physics research requiring accurate predictions of unmeasured nuclear properties.

Authors:  C. H. Kim, K. Y. Chae, M. S. Smith

Link:  https://arxiv.org/abs/2505.15363v1

Date: 2025-05-21

Summary:

Given the importance of nuclear mass predictions, numerous models have been developed to extrapolate the measured data into unknown regions. While neural networks -- the core of modern artificial intelligence -- have been recently suggested as powerful methods, showcasing high predictive power in the measured region, their ability to extrapolate remains questionable. This limitation stems from their `black box' nature and large number of parameters entangled with nonlinear functions designed in the context of computer science. In this study, we demonstrate that replacing such nonlinear functions with physics-related functions significantly improves extrapolation performance and provides enhanced understanding of the model mechanism. Using only the information about neutron (N) and proton (Z) numbers without any existing global mass models or knowledge of magic numbers, we developed a highly accurate model that covers light nuclei (N, Z > 0) up to the drip lines. The extrapolation performance was rigorously evaluated using the outermost nuclei in the measurement landscape, and only the data in the inner region was used for training. We present details of the method and model, along with opportunities for future improvements.

--------------------------------------------------------------------------------------------------------

Zero-Shot Gaze-based Volumetric Medical Image Segmentation

Accurate anatomical structure segmentation in 3D medical images is essential for clinical applications like disease monitoring and cancer treatment planning. Current interactive segmentation models rely on manual prompts such as bounding boxes and mouse clicks. This study introduces eye gaze as a novel input modality for interactive segmentation, marking the first application of eye-tracking for 3D medical image analysis. Evaluation with SAM-2 and MedSAM-2 shows gaze-based prompts offer time-efficient interaction with slightly lower segmentation quality compared to traditional methods. Applications include hands-free medical image analysis, accessibility tools for disabled practitioners, surgical planning systems, and streamlined radiology workflows where rapid, intuitive interaction improves diagnostic efficiency.

Authors:  Tatyana Shmykova, Leila Khaertdinova, Ilya Pershin

Link:  https://arxiv.org/abs/2505.15256v1

Date: 2025-05-21

Summary:

Accurate segmentation of anatomical structures in volumetric medical images is crucial for clinical applications, including disease monitoring and cancer treatment planning. Contemporary interactive segmentation models, such as Segment Anything Model 2 (SAM-2) and its medical variant (MedSAM-2), rely on manually provided prompts like bounding boxes and mouse clicks. In this study, we introduce eye gaze as a novel informational modality for interactive segmentation, marking the application of eye-tracking for 3D medical image segmentation. We evaluate the performance of using gaze-based prompts with SAM-2 and MedSAM-2 using both synthetic and real gaze data. Compared to bounding boxes, gaze-based prompts offer a time-efficient interaction approach with slightly lower segmentation quality. Our findings highlight the potential of using gaze as a complementary input modality for interactive 3D medical image segmentation.

--------------------------------------------------------------------------------------------------------

Addressing memory bandwidth scalability in vector processors for streaming applications

As AI/ML models and datasets grow exponentially, memory bandwidth becomes the critical bottleneck limiting performance in data-parallel applications. Traditional architectures like GPUs and neural network accelerators, while power-efficient compared to CPUs, still face memory bandwidth constraints when data reuse is limited. This research proposes an extended memory hierarchy with three on-chip memory levels and ultra-wide registers with data-shufflers to improve adaptability across varying applications. The architecture innovations target streaming applications where continuous data flow processing is essential. Applications include real-time video processing, autonomous vehicle sensor fusion, high-frequency trading systems, scientific computing simulations, and edge AI deployment where memory bandwidth optimization directly impacts system performance and energy efficiency.

Authors:  Jordi Altayo, Paul Delestrac, David Novo, Simey Yang, Debjyoti Bhattacharjee, Francky Catthoor

Link:  https://arxiv.org/abs/2505.12856v1

Date: 2025-05-19

Summary:

As the size of artificial intelligence and machine learning (AI/ML) models and datasets grows, the memory bandwidth becomes a critical bottleneck. The paper presents a novel extended memory hierarchy that addresses some major memory bandwidth challenges in data-parallel AI/ML applications. While data-parallel architectures like GPUs and neural network accelerators have improved power performance compared to traditional CPUs, they can still be significantly bottlenecked by their memory bandwidth, especially when the data reuse in the loop kernels is limited. Systolic arrays (SAs) and GPUs attempt to mitigate the memory bandwidth bottleneck but can still become memory bandwidth throttled when the amount of data reuse is not sufficient to confine data access mostly to the local memories near to the processing. To mitigate this, the proposed architecture introduces three levels of on-chip memory -- local, intermediate, and global -- with an ultra-wide register and data-shufflers to improve versatility and adaptivity to varying data-parallel applications. The paper explains the innovations at a conceptual level and presents a detailed description of the architecture innovations. We also map a representative data-parallel application, like a convolutional neural network (CNN), to the proposed architecture and quantify the benefits vis-a-vis GPUs and repersentative accelerators based on systolic arrays and vector processors.

--------------------------------------------------------------------------------------------------------

PEER pressure: Model-to-Model Regularization for Single Source Domain Generalization

Single source domain generalization aims to train models that generalize well to unseen target domains using only one source domain, typically through data augmentation strategies. However, augmentation-based methods suffer from performance fluctuations during training, making model selection challenging in realistic scenarios. PEER introduces a novel approach using a proxy model to learn augmented data while the main model accumulates knowledge through parameter averaging and mutual information maximization. The method achieves state-of-the-art performance with simple random augmentation, surpassing complex augmentation strategies. Applications include medical imaging across different hospitals, autonomous driving in varying weather conditions, and any machine learning deployment where training and deployment environments differ significantly.

Authors:  Dong Kyu Cho, Inwoo Hwang, Sanghack Lee

Link:  https://arxiv.org/abs/2505.12745v1

Date: 2025-05-19

Summary:

Data augmentation is a popular tool for single source domain generalization, which expands the source domain by generating simulated ones, improving generalization on unseen target domains. In this work, we show that the performance of such augmentation-based methods in the target domains universally fluctuates during training, posing challenges in model selection under realistic scenarios. We argue that the fluctuation stems from the inability of the model to accumulate the knowledge learned from diverse augmentations, exacerbating feature distortion during training. Based on this observation, we propose a novel generalization method, coined Parameter-Space Ensemble with Entropy Regularization (PEER), that uses a proxy model to learn the augmented data on behalf of the main model. The main model is updated by averaging its parameters with the proxy model, progressively accumulating knowledge over the training steps. Maximizing the mutual information between the output representations of the two models guides the learning process of the proxy model, mitigating feature distortion during training. Experimental results demonstrate the effectiveness of PEER in reducing the OOD performance fluctuation and enhancing generalization across various datasets, including PACS, Digits, Office-Home, and VLCS. Notably, our method with simple random augmentation achieves state-of-the-art performance, surpassing prior approaches on sDG that utilize complex data augmentation strategies.

--------------------------------------------------------------------------------------------------------

Point, Detect, Count: Multi-Task Medical Image Understanding with Instruction-Tuned Vision-Language Models

Medical image analysis requires simultaneous detection, localization, and counting of findings, mirroring clinical workflows where radiologists perform multiple diagnostic tasks concurrently. This research investigates fine-tuning Vision-Language Models for multi-task medical understanding using MedMultiPoints dataset covering endoscopy and microscopy images. The approach reformulates each task into instruction-based prompts suitable for vision-language reasoning, demonstrating that multi-task training improves robustness and accuracy while revealing trade-offs in edge cases. Applications include automated radiology reporting, pathology analysis, clinical decision support systems, medical education tools, and diagnostic workflow automation where interpretable, structured outputs are essential for healthcare professionals to trust and verify AI-assisted diagnoses.

Authors:  Sushant Gautam, Michael A. Riegler, Pål Halvorsen

Link:  https://arxiv.org/abs/2505.16647v1

Date: 2025-05-22

Summary:

We investigate fine-tuning Vision-Language Models (VLMs) for multi-task medical image understanding, focusing on detection, localization, and counting of findings in medical images. Our objective is to evaluate whether instruction-tuned VLMs can simultaneously improve these tasks, with the goal of enhancing diagnostic accuracy and efficiency. Using MedMultiPoints, a multimodal dataset with annotations from endoscopy (polyps and instruments) and microscopy (sperm cells), we reformulate each task into instruction-based prompts suitable for vision-language reasoning. We fine-tune Qwen2.5-VL-7B-Instruct using Low-Rank Adaptation (LoRA) across multiple task combinations. Results show that multi-task training improves robustness and accuracy. For example, it reduces the Count Mean Absolute Error (MAE) and increases Matching Accuracy in the Counting + Pointing task. However, trade-offs emerge, such as more zero-case point predictions, indicating reduced reliability in edge cases despite overall performance gains. Our study highlights the potential of adapting general-purpose VLMs to specialized medical tasks via prompt-driven fine-tuning. This approach mirrors clinical workflows, where radiologists simultaneously localize, count, and describe findings - demonstrating how VLMs can learn composite diagnostic reasoning patterns. The model produces interpretable, structured outputs, offering a promising step toward explainable and versatile medical AI. Code, model weights, and scripts will be released for reproducibility at https://github.com/simula/PointDetectCount.

--------------------------------------------------------------------------------------------------------

Multi-agent Reinforcement Learning vs. Fixed-Time Control for Traffic Signal Optimization: A Simulation Study

Urban traffic congestion at intersections significantly impacts travel time, fuel consumption, and emissions, while traditional fixed-time signal control systems lack adaptability for dynamic traffic patterns. This study explores multi-agent reinforcement learning for traffic signal coordination across multiple intersections using a simulated environment with randomly generated vehicle flows. Each traffic signal operates as an autonomous agent making decisions based on local observations and neighboring agent information. The MARL approach demonstrates statistically significant improvements in reduced waiting times and improved throughput compared to fixed-time controllers. Applications include smart city infrastructure, traffic management systems, emergency vehicle routing, and urban planning optimization for reducing congestion and environmental impact.

Authors:  Saahil Mahato

Link:  https://arxiv.org/abs/2505.14544v1

Date: 2025-05-20

Summary:

Urban traffic congestion, particularly at intersections, significantly impacts travel time, fuel consumption, and emissions. Traditional fixed-time signal control systems often lack the adaptability to manage dynamic traffic patterns effectively. This study explores the application of multi-agent reinforcement learning (MARL) to optimize traffic signal coordination across multiple intersections within a simulated environment. Utilizing Pygame, a simulation was developed to model a network of interconnected intersections with randomly generated vehicle flows to reflect realistic traffic variability. A decentralized MARL controller was implemented, in which each traffic signal operates as an autonomous agent, making decisions based on local observations and information from neighboring agents. Performance was evaluated against a baseline fixed-time controller using metrics such as average vehicle wait time and overall throughput. The MARL approach demonstrated statistically significant improvements, including reduced average waiting times and improved throughput. These findings suggest that MARL-based dynamic control strategies hold substantial promise for improving urban traffic management efficiency. More research is recommended to address scalability and real-world implementation challenges.

--------------------------------------------------------------------------------------------------------

Speculative Decoding Reimagined for Multimodal Large Language Models

Multimodal Large Language Models face significant inference speed bottlenecks that limit their practical deployment, despite speculative decoding's success in accelerating text-only LLMs. Current speculative decoding methods fail to achieve similar speedups for MLLMs due to fundamental differences in processing visual and text tokens. Multimodal Speculative Decoding addresses this by decoupling text and visual token processing in draft models and employing two-stage training that first improves language modeling then gradually introduces multimodal capabilities. The approach achieves up to 2.46× speedup on multimodal benchmarks. Applications include real-time visual question answering, interactive multimedia chatbots, autonomous system perception, and any application requiring fast multimodal reasoning where inference speed directly impacts user experience.

Authors:  Luxi Lin, Zhihang Lin, Zhanpeng Zeng, Rongrong Ji

Link:  https://arxiv.org/abs/2505.14260v1

Date: 2025-05-20

Summary:

This paper introduces Multimodal Speculative Decoding (MSD) to accelerate Multimodal Large Language Models (MLLMs) inference. Speculative decoding has been shown to accelerate Large Language Models (LLMs) without sacrificing accuracy. However, current speculative decoding methods for MLLMs fail to achieve the same speedup as they do for LLMs. To address this, we reimagine speculative decoding specifically for MLLMs. Our analysis of MLLM characteristics reveals two key design principles for MSD: (1) Text and visual tokens have fundamentally different characteristics and need to be processed separately during drafting. (2) Both language modeling ability and visual perception capability are crucial for the draft model. For the first principle, MSD decouples text and visual tokens in the draft model, allowing each to be handled based on its own characteristics. For the second principle, MSD uses a two-stage training strategy: In stage one, the draft model is trained on text-only instruction-tuning datasets to improve its language modeling ability. In stage two, MSD gradually introduces multimodal data to enhance the visual perception capability of the draft model. Experiments show that MSD boosts inference speed by up to $2.29\times$ for LLaVA-1.5-7B and up to $2.46\times$ for LLaVA-1.5-13B on multimodal benchmarks, demonstrating its effectiveness. Our code is available at https://github.com/Lyn-Lucy/MSD.

--------------------------------------------------------------------------------------------------------

EVADE: Multimodal Benchmark for Evasive Content Detection in E-Commerce Applications

E-commerce platforms increasingly rely on AI models to detect illicit or misleading product content, but these systems remain vulnerable to evasive content that superficially complies with policies while covertly conveying prohibited claims. Unlike traditional adversarial attacks, evasive content exploits ambiguity and context, making detection exceptionally challenging. EVADE introduces the first expert-curated, Chinese, multimodal benchmark specifically designed for evasive content detection, containing 2,833 text samples and 13,961 images across six demanding product categories. The benchmark reveals substantial performance gaps in state-of-the-art models. Applications include content moderation systems, regulatory compliance automation, consumer protection platforms, and advertising verification where detecting subtle policy violations is crucial for maintaining platform integrity.

Authors:  Ancheng Xu, Zhihao Yang, Jingpeng Li, Guanghu Yuan, Longze Chen, Liang Yan, Jiehui Zhou, Zhen Qin, Hengyun Chang, Hamid Alinejad-Rokny, Bo Zheng, Min Yang

Link:  https://arxiv.org/abs/2505.17654v1

Date: 2025-05-23

Summary:

E-commerce platforms increasingly rely on Large Language Models (LLMs) and Vision-Language Models (VLMs) to detect illicit or misleading product content. However, these models remain vulnerable to evasive content: inputs (text or images) that superficially comply with platform policies while covertly conveying prohibited claims. Unlike traditional adversarial attacks that induce overt failures, evasive content exploits ambiguity and context, making it far harder to detect. Existing robustness benchmarks provide little guidance for this demanding, real-world challenge. We introduce EVADE, the first expert-curated, Chinese, multimodal benchmark specifically designed to evaluate foundation models on evasive content detection in e-commerce. The dataset contains 2,833 annotated text samples and 13,961 images spanning six demanding product categories, including body shaping, height growth, and health supplements. Two complementary tasks assess distinct capabilities: Single-Violation, which probes fine-grained reasoning under short prompts, and All-in-One, which tests long-context reasoning by merging overlapping policy rules into unified instructions. Notably, the All-in-One setting significantly narrows the performance gap between partial and full-match accuracy, suggesting that clearer rule definitions improve alignment between human and model judgment. We benchmark 26 mainstream LLMs and VLMs and observe substantial performance gaps: even state-of-the-art models frequently misclassify evasive samples. By releasing EVADE and strong baselines, we provide the first rigorous standard for evaluating evasive-content detection, expose fundamental limitations in current multimodal reasoning, and lay the groundwork for safer and more transparent content moderation systems in e-commerce. The dataset is publicly available at https://huggingface.co/datasets/koenshen/EVADE-Bench.

--------------------------------------------------------------------------------------------------------

Delving into RL for Image Generation with CoT: A Study on DPO vs. GRPO

Reinforcement Learning has shown significant promise in enhancing Chain-of-Thought reasoning capabilities in language models, with Direct Preference Optimization and Group Relative Policy Optimization emerging as prominent algorithms. Autoregressive image generation presents unique challenges distinct from LLM-based reasoning, including text-image consistency, aesthetic quality, and sophisticated reward model design. This comprehensive investigation compares GRPO and DPO algorithms in autoregressive image generation, evaluating in-domain performance and out-of-domain generalization while examining the impact of different reward models. Applications include creative content generation, advertising imagery, scientific visualization, educational materials, and any domain requiring controllable, high-quality image synthesis where systematic optimization of generation quality is essential.

Authors:  Chengzhuo Tong, Ziyu Guo, Renrui Zhang, Wenyu Shan, Xinyu Wei, Zhenghao Xing, Hongsheng Li, Pheng-Ann Heng

Link:  https://arxiv.org/abs/2505.17017v1

Date: 2025-05-22

Summary:

Recent advancements underscore the significant role of Reinforcement Learning (RL) in enhancing the Chain-of-Thought (CoT) reasoning capabilities of large language models (LLMs). Two prominent RL algorithms, Direct Preference Optimization (DPO) and Group Relative Policy Optimization (GRPO), are central to these developments, showcasing different pros and cons. Autoregressive image generation, also interpretable as a sequential CoT reasoning process, presents unique challenges distinct from LLM-based CoT reasoning. These encompass ensuring text-image consistency, improving image aesthetic quality, and designing sophisticated reward models, rather than relying on simpler rule-based rewards. While recent efforts have extended RL to this domain, these explorations typically lack an in-depth analysis of the domain-specific challenges and the characteristics of different RL strategies. To bridge this gap, we provide the first comprehensive investigation of the GRPO and DPO algorithms in autoregressive image generation, evaluating their in-domain performance and out-of-domain generalization, while scrutinizing the impact of different reward models on their respective capabilities. Our findings reveal that GRPO and DPO exhibit distinct advantages, and crucially, that reward models possessing stronger intrinsic generalization capabilities potentially enhance the generalization potential of the applied RL algorithms. Furthermore, we systematically explore three prevalent scaling strategies to enhance both their in-domain and out-of-domain proficiency, deriving unique insights into efficiently scaling performance for each paradigm. We hope our study paves a new path for inspiring future work on developing more effective RL algorithms to achieve robust CoT reasoning in the realm of autoregressive image generation. Code is released at https://github.com/ZiyuGuo99/Image-Generation-CoT

--------------------------------------------------------------------------------------------------------

Fashion Industry in the Age of Generative Artificial Intelligence and Metaverse: A systematic Review

The trillion-dollar fashion industry encompasses apparel, footwear, and accessories production and distribution, facing transformation through Generative Artificial Intelligence and metaverse technologies. This systematic literature review analyzes 118 papers from 2014-2023 using PRISMA methodology to examine the research landscape surrounding GAI and metaverse integration in fashion. SWOT analysis reveals that combining these technologies can revolutionize manufacturing, design, sales, and customer experiences. The research proposes a new framework integrating GAI and metaverse to enhance the fashion industry through various use cases. Applications include virtual try-on systems, personalized design generation, virtual fashion shows, digital clothing for avatars, sustainable production optimization, and immersive shopping experiences that bridge physical and digital fashion retail.

Authors:  Rania Ahmed, Eman Ahmed, Ahmed Elbarbary, Ashraf Darwish, Aboul Ella Hassanien

Link:  https://arxiv.org/abs/2505.17141v1

Date: 2025-05-22

Summary:

The fashion industry is an extremely profitable market that generates trillions of dollars in revenue by producing and distributing apparel, footwear, and accessories. This systematic literature review (SLR) seeks to systematically review and analyze the research landscape about the Generative Artificial Intelligence (GAI) and metaverse in the fashion industry. Thus, investigating the impact of integrating both technologies to enhance the fashion industry. This systematic review uses the Reporting Items for Systematic reviews and Meta-Analyses (PRISMA) methodology, including three essential phases: identification, evaluation, and reporting. In the identification phase, the target search problems are determined by selecting appropriate keywords and alternative synonyms. After that 578 documents from 2014 to the end of 2023 are retrieved. The evaluation phase applies three screening steps to assess papers and choose 118 eligible papers for full-text reading. Finally, the reporting phase thoroughly examines and synthesizes the 118 eligible papers to identify key themes associated with GAI and Metaverse in the fashion industry. Based on Strengths, Weaknesses, Opportunities, and Threats (SWOT) analyses performed for both GAI and metaverse for the fashion industry, it is concluded that the integration of GAI and the metaverse holds the capacity to profoundly revolutionize the fashion sector, presenting chances for improved manufacturing, design, sales, and client experiences. Accordingly, the research proposes a new framework to integrate GAI and metaverse to enhance the fashion industry. The framework presents different use cases to promote the fashion industry using the integration. Future research points for achieving a successful integration are demonstrated.

--------------------------------------------------------------------------------------------------------

DEBATE, TRAIN, EVOLVE: Self Evolution of Language Model Reasoning

Large language models have achieved significant reasoning improvements through extensive training on massive datasets, but relying solely on additional data is becoming increasingly impractical. Debate, Train, Evolve introduces a novel ground truth-free training framework using multi-agent debate traces to evolve single language models autonomously. The approach incorporates Reflect-Critique-Refine prompting strategy to improve debate quality through explicit reasoning critique and refinement. Extensive evaluations show substantial improvements with 8.92% average accuracy gain on challenging datasets and strong cross-domain generalization. Applications include automated reasoning systems, educational AI tutors, legal analysis tools, scientific research assistance, and any domain requiring sophisticated logical reasoning where models must continuously improve their capabilities without external supervision or additional training data.

Authors:  Gaurav Srivastava, Zhenyu Bi, Meng Lu, Xuan Wang

Link:  https://arxiv.org/abs/2505.15734v1

Date: 2025-05-21

Summary:

Large language models (LLMs) have improved significantly in their reasoning through extensive training on massive datasets. However, relying solely on additional data for improvement is becoming increasingly impractical, highlighting the need for models to autonomously enhance their reasoning without external supervision. In this paper, we propose Debate, Train, Evolve (DTE), a novel ground truth-free training framework that uses multi-agent debate traces to evolve a single language model. We also introduce a new prompting strategy Reflect-Critique-Refine, to improve debate quality by explicitly instructing agents to critique and refine their reasoning. Extensive evaluations on five reasoning benchmarks with six open-weight models show that our DTE framework achieve substantial improvements, with an average accuracy gain of 8.92% on the challenging GSM-PLUS dataset. Furthermore, we observe strong cross-domain generalization, with an average accuracy gain of 5.8% on all other benchmarks, suggesting that our method captures general reasoning capabilities.

--------------------------------------------------------------------------------------------------------


EYE ON A.I. GETS READERS UP TO DATE ON THE LATEST FUNDING NEWS AND RELATED ISSUES. SUBSCRIBE FOR THE WEEKLY NEWSLETTER.