Week Ending 6.8.2025
RESEARCH WATCH: 6.8.2025
Vision-Language Models (VLMs) represent a breakthrough in AI's ability to understand both visual and textual information simultaneously, but their computational demands have limited deployment in resource-constrained environments. This research addresses a critical bottleneck in real-world applications like autonomous vehicles and robotics, where processing power and latency constraints are paramount. The proposed LiteVLM pipeline tackles three key optimization areas: intelligent patch selection to filter irrelevant visual data, token reduction to minimize language processing overhead, and speculative decoding for faster response generation. With demonstrated 2.5x speed improvements on automotive platforms, this work opens pathways for deploying sophisticated AI reasoning capabilities in edge devices, mobile robotics, and embedded systems where traditional cloud-based processing isn't feasible.
Authors: Jin Huang, Yuchao Jin, Le An, Josh Park
Link: https://arxiv.org/abs/2506.07416v1
Date: 2025-06-09
Summary:
This paper introduces an efficient Vision-Language Model (VLM) pipeline specifically optimized for deployment on embedded devices, such as those used in robotics and autonomous driving. The pipeline significantly reduces the computational overhead by jointly leveraging patch selection to filter irrelevant camera views, a token selection module to reduce input sequence length for the LLM, and speculative decoding to accelerate token generation. Evaluation on the NVIDIA DRIVE Thor platform for automonous driving application, our pipeline achieves $2.5\times$ end-to-end latency reduction without compromising task accuracy. The speed-up further increases to $3.2\times$ when applying FP8 post-training quantization. These results demonstrate our pipeline as a viable solution for enabling real-time VLM deployment in resource-constrained environments.
--------------------------------------------------------------------------------------------------------
Improving LLM Reasoning through Interpretable Role-Playing Steering
Role-playing has emerged as a powerful technique for enhancing Large Language Model performance, but traditional prompt-based approaches lack consistency and interpretability. This research introduces a novel framework that moves beyond surface-level prompting to directly manipulate the internal neural representations associated with role-playing behavior. By identifying and steering specific model features, researchers can achieve more reliable and controllable performance improvements. The implications extend far beyond academic benchmarks—this technology could enable more consistent AI assistants, specialized professional AI tools, and educational applications where maintaining specific expertise personas is crucial. The interpretability aspect is particularly valuable for safety-critical applications, allowing developers to understand and control how role information influences model behavior, potentially leading to more trustworthy AI systems in domains requiring specialized knowledge.
Authors: Anyi Wang, Dong Shu, Yifan Wang, Yunpu Ma, Mengnan Du
Link: https://arxiv.org/abs/2506.07335v1
Date: 2025-06-09
Summary:
Role-playing has emerged as an effective technique for enhancing the reasoning capabilities of large language models (LLMs). However, existing methods primarily rely on prompt engineering, which often lacks stability and interpretability. In this paper, we introduce Sparse Autoencoder Role-Playing Steering (SRPS), a novel framework that identifies and manipulates internal model features associated with role-playing behavior. Our approach extracts latent representations from role-play prompts, selects the most relevant features based on activation patterns, and constructs a steering vector that can be injected into the model's residual stream with controllable intensity. Our method enables fine-grained control over role-specific behavior and offers insights into how role information influences internal model activations. Extensive experiments across various reasoning benchmarks and model sizes demonstrate consistent performance gains. Notably, in the zero-shot chain-of-thought (CoT) setting, the accuracy of Llama3.1-8B on CSQA improves from 31.86% to 39.80%, while Gemma2-9B on SVAMP increases from 37.50% to 45.10%. These results highlight the potential of SRPS to enhance reasoning ability in LLMs, providing better interpretability and stability compared to traditional prompt-based role-playing.
--------------------------------------------------------------------------------------------------------
Efficient Text-Attributed Graph Learning through Selective Annotation and Graph Alignment
Text-Attributed Graphs (TAGs) represent complex real-world networks where nodes contain rich textual information—think social media profiles, research paper citations, or product descriptions in recommendation systems. Traditional approaches require extensive manual annotation of every node, making them prohibitively expensive for large-scale applications. This research introduces GAGA, a revolutionary approach that achieves comparable performance while annotating only 1% of the data. The framework's efficiency stems from intelligent selection of representative nodes and sophisticated alignment techniques that propagate knowledge across the network. Applications span social network analysis, academic paper classification, e-commerce recommendation systems, and knowledge graph construction. The dramatic reduction in annotation requirements makes previously impossible large-scale projects feasible, potentially accelerating research in network science, social media analysis, and automated content categorization across industries.
Authors: Huanyi Xie, Lijie Hu, Lu Yu, Tianhao Huang, Longfei Li, Meng Li, Jun Zhou, Huan Wang, Di Wang
Link: https://arxiv.org/abs/2506.07168v1
Date: 2025-06-08
Summary:
In the realm of Text-attributed Graphs (TAGs), traditional graph neural networks (GNNs) often fall short due to the complex textual information associated with each node. Recent methods have improved node representations by leveraging large language models (LLMs) to enhance node text features, but these approaches typically require extensive annotations or fine-tuning across all nodes, which is both time-consuming and costly. To overcome these challenges, we introduce GAGA, an efficient framework for TAG representation learning. GAGA reduces annotation time and cost by focusing on annotating only representative nodes and edges. It constructs an annotation graph that captures the topological relationships among these annotations. Furthermore, GAGA employs a two-level alignment module to effectively integrate the annotation graph with the TAG, aligning their underlying structures. Experiments show that GAGA achieves classification accuracies on par with or surpassing state-of-the-art methods while requiring only 1% of the data to be annotated, demonstrating its high efficiency.
--------------------------------------------------------------------------------------------------------
What makes Reasoning Models Different? Follow the Reasoning Leader for Efficient Decoding
Large Reasoning Models achieve impressive performance by generating extensive chains of thought, but this verbosity creates significant computational overhead and can lead to unnecessary detail that slows inference. This research provides crucial insights into how reasoning models behave differently from standard language models, revealing two key phenomena: persistent divergence patterns and localized alignment variations within sentences. The proposed FoReaL-Decoding method leverages these insights to create a collaborative approach where a powerful model guides initial reasoning steps while a smaller model completes the details. With 30-50% computational savings and up to 40% reduction in response length while maintaining 86-100% performance, this work addresses a critical efficiency bottleneck. Applications include cost-effective deployment of reasoning AI in educational tools, mathematical problem-solving systems, and automated code generation where both accuracy and speed matter.
Authors: Ming Li, Zhengyuan Yang, Xiyao Wang, Dianqi Li, Kevin Lin, Tianyi Zhou, Lijuan Wang
Link: https://arxiv.org/abs/2506.06998v1
Date: 2025-06-08
Summary:
Large reasoning models (LRMs) achieve strong reasoning performance by emitting long chains of thought. Yet, these verbose traces slow down inference and often drift into unnecessary detail, known as the overthinking phenomenon. To better understand LRMs' behavior, we systematically analyze the token-level misalignment between reasoning and non-reasoning models. While it is expected that their primary difference lies in the stylistic "thinking cues", LRMs uniquely exhibit two pivotal, previously under-explored phenomena: a Global Misalignment Rebound, where their divergence from non-reasoning models persists or even grows as response length increases, and more critically, a Local Misalignment Diminish, where the misalignment concentrates at the "thinking cues" each sentence starts with but rapidly declines in the remaining of the sentence. Motivated by the Local Misalignment Diminish, we propose FoReaL-Decoding, a collaborative fast-slow thinking decoding method for cost-quality trade-off. In FoReaL-Decoding, a Leading model leads the first few tokens for each sentence, and then a weaker draft model completes the following tokens to the end of each sentence. FoReaL-Decoding adopts a stochastic gate to smoothly interpolate between the small and the large model. On four popular math-reasoning benchmarks (AIME24, GPQA-Diamond, MATH500, AMC23), FoReaL-Decoding reduces theoretical FLOPs by 30 to 50% and trims CoT length by up to 40%, while preserving 86 to 100% of model performance. These results establish FoReaL-Decoding as a simple, plug-and-play route to controllable cost-quality trade-offs in reasoning-centric tasks.
--------------------------------------------------------------------------------------------------------
The integration of satellite networks and high-altitude platforms with terrestrial 5G/6G infrastructure represents the future of global connectivity, but these systems face unique energy and computational constraints. This research tackles the complex optimization challenge of dynamically splitting network functions between centralized and distributed units in space-based networks. Using Deep Q-Network reinforcement learning, the system intelligently adapts to changing traffic patterns, network conditions, and power limitations—critical factors for satellite and aerial platform operations. Applications include global internet coverage through satellite constellations, emergency communications in disaster zones, rural connectivity solutions, and military communications. The energy efficiency focus is particularly crucial for space-based platforms where power is extremely limited. This work contributes to making ubiquitous global connectivity economically and technically feasible.
Authors: S. M. Mahdi Shahabi, Xiaonan Deng, Ahmad Qidan, Taisir Elgorashi, Jaafar Elmirghani
Link: https://arxiv.org/abs/2506.06876v1
Date: 2025-06-07
Summary:
This paper explores the integration of Open Radio Access Network (O-RAN) principles with non-terrestrial networks (NTN) and investigates the optimization of the functional split between Centralized Units (CU) and Distributed Units (DU) to improve energy efficiency in dynamic network environments. Given the inherent constraints of NTN platforms, such as Low Earth Orbit (LEO) satellites and high-altitude platform stations (HAPS), we propose a reinforcement learning-based framework utilizing Deep Q-Network (DQN) to intelligently determine the optimal RAN functional split. The proposed approach dynamically adapts to real-time fluctuations in traffic demand, network conditions, and power limitations, ensuring efficient resource allocation and enhanced system performance.The numerical results demonstrate that the proposed policy effectively adapts to network traffic flow by selecting an efficient network disaggregation strategy and corresponding functional split option based on data rate and latency requirements.
--------------------------------------------------------------------------------------------------------
SAFE: Finding Sparse and Flat Minima to Improve Pruning
Neural network pruning—removing unnecessary connections to create smaller, faster models—typically suffers from performance degradation that's difficult to recover. This research draws inspiration from robust optimization theory to address a fundamental challenge: finding network configurations that are both sparse (few connections) and flat (stable to perturbations). The SAFE framework formulates pruning as a constrained optimization problem that explicitly encourages flatness, leading to more robust compressed models. Applications span mobile AI deployment, edge computing, IoT devices, and embedded systems where model size and memory constraints are critical. The demonstrated resilience to noisy data makes this particularly valuable for real-world deployment scenarios. With growing demand for efficient AI in smartphones, autonomous vehicles, and industrial sensors, this work provides a principled approach to creating reliable, compact models without sacrificing performance or robustness.
Authors: Dongyeop Lee, Kwanhee Lee, Jinseok Chung, Namhoon Lee
Link: https://arxiv.org/abs/2506.06866v1
Date: 2025-06-07
Summary:
Sparsifying neural networks often suffers from seemingly inevitable performance degradation, and it remains challenging to restore the original performance despite much recent progress. Motivated by recent studies in robust optimization, we aim to tackle this problem by finding subnetworks that are both sparse and flat at the same time. Specifically, we formulate pruning as a sparsity-constrained optimization problem where flatness is encouraged as an objective. We solve it explicitly via an augmented Lagrange dual approach and extend it further by proposing a generalized projection operation, resulting in novel pruning methods called SAFE and its extension, SAFE$^+$. Extensive evaluations on standard image classification and language modeling tasks reveal that SAFE consistently yields sparse networks with improved generalization performance, which compares competitively to well-established baselines. In addition, SAFE demonstrates resilience to noisy data, making it well-suited for real-world conditions.
--------------------------------------------------------------------------------------------------------
Curriculum Reinforcement Learning from Easy to Hard Tasks Improves LLM Reasoning
Large Language Models struggle with complex reasoning tasks when trained directly on difficult problems, similar to how humans learn better through progressive difficulty. This research introduces E2H Reasoner, a curriculum-based reinforcement learning approach that gradually increases task difficulty, allowing models to build reasoning skills incrementally. The theoretical framework provides convergence guarantees and demonstrates that structured curriculum learning requires fewer training samples than direct approaches. Applications include mathematical education systems, coding assistants, logical reasoning tools, and problem-solving AI. The method's particular effectiveness for smaller models (1.5B-3B parameters) is significant for democratizing advanced AI capabilities, making sophisticated reasoning accessible in resource-constrained environments. This work addresses the critical challenge of making reasoning AI more sample-efficient and reliable, potentially accelerating the development of educational AI tutors and automated reasoning systems across various domains.
Authors: Shubham Parashar, Shurui Gui, Xiner Li, Hongyi Ling, Sushil Vemuri, Blake Olson, Eric Li, Yu Zhang, James Caverlee, Dileep Kalathil, Shuiwang Ji
Link: https://arxiv.org/abs/2506.06632v1
Date: 2025-06-07
Summary:
We aim to improve the reasoning capabilities of language models via reinforcement learning (RL). Recent RL post-trained models like DeepSeek-R1 have demonstrated reasoning abilities on mathematical and coding tasks. However, prior studies suggest that using RL alone to improve reasoning on inherently difficult tasks is less effective. Here, we draw inspiration from curriculum learning and propose to schedule tasks from easy to hard (E2H), allowing LLMs to build reasoning skills gradually. Our method is termed E2H Reasoner. Empirically, we observe that, although easy tasks are important initially, fading them out through appropriate scheduling is essential in preventing overfitting. Theoretically, we establish convergence guarantees for E2H Reasoner within an approximate policy iteration framework. We derive finite-sample complexity bounds and show that when tasks are appropriately decomposed and conditioned, learning through curriculum stages requires fewer total samples than direct learning. Experiments across multiple domains show that E2H Reasoner significantly improves the reasoning ability of small LLMs (1.5B to 3B), which otherwise struggle when trained with vanilla RL alone, highlighting the effectiveness of our method.
--------------------------------------------------------------------------------------------------------
WhisQ: Cross-Modal Representation Learning for Text-to-Music MOS Prediction
Evaluating AI-generated music requires assessing both musical quality and alignment with text prompts—a complex challenge as music generation technology advances. WhisQ addresses this dual evaluation problem through sophisticated cross-modal learning that understands both audio and text representations. The framework employs temporal audio encoding and text understanding with specialized attention mechanisms for fine-grained alignment assessment. Applications include music production tools, content creation platforms, automated music quality assessment, and copyright evaluation systems. With the rapid growth of AI music generation in entertainment, advertising, and personal content creation, reliable evaluation metrics become crucial for industry adoption. The substantial improvements in correlation scores demonstrate practical value for music streaming platforms, AI music generators, and creative software companies seeking to automatically assess generated content quality and prompt adherence.
Authors: Jakaria Islam Emon, Kazi Tamanna Alam, Md. Abu Salek
Link: https://arxiv.org/abs/2506.05899v1
Date: 2025-06-06
Summary:
Mean Opinion Score (MOS) prediction for text to music systems requires evaluating both overall musical quality and text prompt alignment. This paper introduces WhisQ, a multimodal architecture that addresses this dual-assessment challenge through sequence level co-attention and optimal transport regularization. WhisQ employs the Whisper Base pretrained model for temporal audio encoding and Qwen 3, a 0.6B Small Language Model (SLM), for text encoding, with both maintaining sequence structure for fine grained cross-modal modeling. The architecture features specialized prediction pathways: OMQ is predicted from pooled audio embeddings, while TA leverages bidirectional sequence co-attention between audio and text. Sinkhorn optimal transport loss further enforce semantic alignment in the shared embedding space. On the MusicEval Track-1 dataset, WhisQ achieves substantial improvements over the baseline: 7% improvement in Spearman correlation for OMQ and 14% for TA. Ablation studies reveal that optimal transport regularization provides the largest performance gain (10% SRCC improvement), demonstrating the importance of explicit cross-modal alignment for text-to-music evaluation.
--------------------------------------------------------------------------------------------------------
Heartcare Suite: Multi-dimensional Understanding of ECG with Raw Multi-lead Signal Modeling
Electrocardiogram interpretation represents a critical healthcare challenge where AI can significantly impact patient outcomes, but existing approaches often struggle with the complexity of multi-lead ECG signals. Heartcare Suite introduces a comprehensive framework combining a large-scale dataset, systematic benchmarks, and an innovative tokenization approach that converts raw ECG signals into semantically meaningful discrete tokens. The bidirectional diffusion mechanism and vector quantization enable sophisticated signal understanding across disease diagnosis, rhythm analysis, and morphology interpretation. Applications span telemedicine, emergency medicine, cardiac monitoring devices, and automated diagnostic systems. With cardiovascular disease being a leading cause of mortality worldwide, this technology could democratize expert-level ECG interpretation, particularly valuable in underserved regions lacking specialized cardiologists. The framework's generalization capabilities suggest potential for improving diagnostic accuracy and reducing healthcare disparities globally.
Authors: Yihan Xie, Sijing Li, Tianwei Lin, Zhuonan Wang, Chenglin Yang, Yu Zhong, Wenqiao Zhang, Haoyuan Li, Hao Jiang, Fengda Zhang, Qishan Chen, Jun Xiao, Yueting Zhuang, Beng Chin Ooi
Link: https://arxiv.org/abs/2506.05831v2
Date: 2025-06-09
Summary:
We present Heartcare Suite, a multimodal comprehensive framework for finegrained electrocardiogram (ECG) understanding. It comprises three key components: (i) Heartcare-220K, a high-quality, structured, and comprehensive multimodal ECG dataset covering essential tasks such as disease diagnosis, waveform morphology analysis, and rhythm interpretation. (ii) Heartcare-Bench, a systematic and multi-dimensional benchmark designed to evaluate diagnostic intelligence and guide the optimization of Medical Multimodal Large Language Models (Med-MLLMs) in ECG scenarios. and (iii) HeartcareGPT with a tailored tokenizer Bidirectional ECG Abstract Tokenization (Beat), which compresses raw multi-lead signals into semantically rich discrete tokens via duallevel vector quantization and query-guided bidirectional diffusion mechanism. Built upon Heartcare-220K, HeartcareGPT achieves strong generalization and SoTA performance across multiple clinically meaningful tasks. Extensive experiments demonstrate that Heartcare Suite is highly effective in advancing ECGspecific multimodal understanding and evaluation. Our project is available at https://github.com/DCDmllm/Heartcare-Suite .
--------------------------------------------------------------------------------------------------------
DriveAction: A Benchmark for Exploring Human-like Driving Decisions in VLA Models
Autonomous driving requires understanding not just what to perceive, but how to act based on vision and language inputs, yet existing benchmarks lack comprehensive action-level evaluation aligned with human decision-making. DriveAction addresses this gap with a comprehensive benchmark derived from real-world driving data, providing high-level discrete actions from actual user operations rather than synthetic scenarios. The action-rooted evaluation framework explicitly connects vision, language, and action components, enabling precise identification of model weaknesses. Applications include autonomous vehicle development, driver assistance systems, simulation environments, and human-AI collaboration in transportation. The benchmark's demonstration that models require both vision and language guidance for optimal performance provides crucial insights for developing more reliable autonomous systems. This work contributes to safer, more human-like autonomous driving by establishing rigorous evaluation standards that reflect real-world complexity and decision-making patterns.
Authors: Yuhan Hao, Zhengning Li, Lei Sun, Weilong Wang, Naixin Yi, Sheng Song, Caihong Qin, Mofan Zhou, Yifei Zhan, Peng Jia, Xianpeng Lang
Link: https://arxiv.org/abs/2506.05667v1
Date: 2025-06-06
Summary:
Vision-Language-Action (VLA) models have advanced autonomous driving, but existing benchmarks still lack scenario diversity, reliable action-level annotation, and evaluation protocols aligned with human preferences. To address these limitations, we introduce DriveAction, the first action-driven benchmark specifically designed for VLA models, comprising 16,185 QA pairs generated from 2,610 driving scenarios. DriveAction leverages real-world driving data proactively collected by users of production-level autonomous vehicles to ensure broad and representative scenario coverage, offers high-level discrete action labels collected directly from users' actual driving operations, and implements an action-rooted tree-structured evaluation framework that explicitly links vision, language, and action tasks, supporting both comprehensive and task-specific assessment. Our experiments demonstrate that state-of-the-art vision-language models (VLMs) require both vision and language guidance for accurate action prediction: on average, accuracy drops by 3.3% without vision input, by 4.1% without language input, and by 8.0% without either. Our evaluation supports precise identification of model bottlenecks with robust and consistent results, thus providing new insights and a rigorous foundation for advancing human-like decisions in autonomous driving.
--------------------------------------------------------------------------------------------------------
Hallucinate, Ground, Repeat: A Framework for Generalized Visual Relationship Detection
Understanding relationships between objects is fundamental to visual intelligence, yet most systems are limited to predefined relationship categories, constraining their real-world applicability. This research introduces an innovative framework that leverages large language models to hypothesize plausible relationships, then iteratively grounds these predictions in visual evidence. The expectation-maximization inspired approach enables learning beyond annotated data, generalizing to unseen relationship types. Applications include robotics navigation, assistive technologies for visually impaired individuals, automated image captioning, and scene understanding systems. The ability to detect novel relationships without extensive annotation makes this particularly valuable for emerging domains and specialized applications. This work addresses a fundamental limitation in computer vision, moving toward more flexible and generalizable visual understanding systems that can adapt to new environments and relationship types without requiring exhaustive manual labeling.
Authors: Shanmukha Vellamcheti, Sanjoy Kundu, Sathyanarayanan N. Aakur
Link: https://arxiv.org/abs/2506.05651v1
Date: 2025-06-06
Summary:
Understanding relationships between objects is central to visual intelligence, with applications in embodied AI, assistive systems, and scene understanding. Yet, most visual relationship detection (VRD) models rely on a fixed predicate set, limiting their generalization to novel interactions. A key challenge is the inability to visually ground semantically plausible, but unannotated, relationships hypothesized from external knowledge. This work introduces an iterative visual grounding framework that leverages large language models (LLMs) as structured relational priors. Inspired by expectation-maximization (EM), our method alternates between generating candidate scene graphs from detected objects using an LLM (expectation) and training a visual model to align these hypotheses with perceptual evidence (maximization). This process bootstraps relational understanding beyond annotated data and enables generalization to unseen predicates. Additionally, we introduce a new benchmark for open-world VRD on Visual Genome with 21 held-out predicates and evaluate under three settings: seen, unseen, and mixed. Our model outperforms LLM-only, few-shot, and debiased baselines, achieving mean recall (mR@50) of 15.9, 13.1, and 11.7 on predicate classification on these three sets. These results highlight the promise of grounded LLM priors for scalable open-world visual understanding.
--------------------------------------------------------------------------------------------------------
MMTU: A Massive Multi-Task Table Understanding and Reasoning Benchmark
Tables are ubiquitous in professional environments—spreadsheets, databases, reports—yet comprehensive evaluation of AI's table-handling capabilities remains limited, focusing mainly on simple question-answering rather than complex professional tasks. MMTU introduces a comprehensive benchmark with over 30,000 questions across 25 real-world table tasks that mirror the complexity faced by data engineers, analysts, and database administrators. The benchmark reveals that even frontier reasoning models achieve only 60% performance, highlighting significant room for improvement. Applications span business intelligence, automated data analysis, spreadsheet assistants, database management tools, and computational notebooks. As organizations increasingly rely on AI for data processing and analysis, this benchmark provides crucial insights for developing more capable table-understanding systems. The comprehensive evaluation framework could accelerate progress in automated data analysis tools, potentially transforming how professionals interact with structured data.
Authors: Junjie Xing, Yeye He, Mengyu Zhou, Haoyu Dong, Shi Han, Lingjiao Chen, Dongmei Zhang, Surajit Chaudhuri, H. V. Jagadish
Link: https://arxiv.org/abs/2506.05587v1
Date: 2025-06-05
Summary:
Tables and table-based use cases play a crucial role in many important real-world applications, such as spreadsheets, databases, and computational notebooks, which traditionally require expert-level users like data engineers, data analysts, and database administrators to operate. Although LLMs have shown remarkable progress in working with tables (e.g., in spreadsheet and database copilot scenarios), comprehensive benchmarking of such capabilities remains limited. In contrast to an extensive and growing list of NLP benchmarks, evaluations of table-related tasks are scarce, and narrowly focus on tasks like NL-to-SQL and Table-QA, overlooking the broader spectrum of real-world tasks that professional users face. This gap limits our understanding and model progress in this important area. In this work, we introduce MMTU, a large-scale benchmark with over 30K questions across 25 real-world table tasks, designed to comprehensively evaluate models ability to understand, reason, and manipulate real tables at the expert-level. These tasks are drawn from decades' worth of computer science research on tabular data, with a focus on complex table tasks faced by professional users. We show that MMTU require a combination of skills -- including table understanding, reasoning, and coding -- that remain challenging for today's frontier models, where even frontier reasoning models like OpenAI o4-mini and DeepSeek R1 score only around 60%, suggesting significant room for improvement. We highlight key findings in our evaluation using MMTU and hope that this benchmark drives further advances in understanding and developing foundation models for structured data processing and analysis. Our code and data are available at https://github.com/MMTU-Benchmark/MMTU and https://huggingface.co/datasets/MMTU-benchmark/MMTU.
--------------------------------------------------------------------------------------------------------
Beyond the Buzz: A Pragmatic Take on Inference Disaggregation
As AI model inference scales to multi-node deployments, disaggregation—splitting inference into distinct phases—promises improved throughput and interactivity, but practical deployment remains challenging due to system complexity. This research provides the first systematic large-scale study, evaluating hundreds of thousands of design points across diverse workloads and hardware configurations. The findings reveal that disaggregation is most effective for specific traffic patterns and larger models, with dynamic rate matching and elastic scaling being critical success factors. Applications include cloud AI services, large-scale model serving, enterprise AI deployments, and high-throughput inference systems. The practical insights address real deployment challenges faced by companies scaling AI services, providing actionable guidance for system architects and engineers. This work contributes to making large-scale AI deployment more efficient and cost-effective, crucial for the widespread adoption of advanced AI capabilities.
Authors: Tiyasa Mitra, Ritika Borkar, Nidhi Bhatia, Ramon Matas, Shivam Raj, Dheevatsa Mudigere, Ritchie Zhao, Maximilian Golub, Arpan Dutta, Sailaja Madduri, Dharmesh Jani, Brian Pharris, Bita Darvish Rouhani
Link: https://arxiv.org/abs/2506.05508v1
Date: 2025-06-05
Summary:
As inference scales to multi-node deployments, disaggregation - splitting inference into distinct phases - offers a promising path to improving the throughput-interactivity Pareto frontier. Despite growing enthusiasm and a surge of open-source efforts, practical deployment of disaggregated serving remains limited due to the complexity of the optimization search space and system-level coordination. In this paper, we present the first systematic study of disaggregated inference at scale, evaluating hundreds of thousands of design points across diverse workloads and hardware configurations. We find that disaggregation is most effective for prefill-heavy traffic patterns and larger models. Our results highlight the critical role of dynamic rate matching and elastic scaling in achieving Pareto-optimal performance. Our findings offer actionable insights for efficient disaggregated deployments to navigate the trade-off between system throughput and interactivity.
--------------------------------------------------------------------------------------------------------
The rapid evolution of Large Language Models has dramatically improved their ability to solve complex algorithmic problems, with recent models achieving performance comparable to top-performing students on university-level algorithm exams. This comprehensive evaluation reveals both strengths in multi-step reasoning and remaining challenges with graph-based problems, while demonstrating robust multilingual capabilities. The research extends beyond evaluation to explore practical applications in educational content generation, offering instructors powerful tools for creating high-quality feedback and editorial materials. Applications include automated tutoring systems, algorithm education platforms, programming contest preparation, and academic content generation. The demonstrated progression from struggling to mastering complex algorithmic challenges within months illustrates the rapid advancement of AI capabilities. This work contributes to understanding AI's potential in computer science education and suggests promising directions for AI-assisted learning and instruction.
Authors: Adrian Marius Dumitran, Theodor-Pierre Moroianu, Vasile Paul Alexe
Link: https://arxiv.org/abs/2506.04965v1
Date: 2025-06-05
Summary:
This paper presents a comprehensive evaluation of the performance of state-of-the-art Large Language Models (LLMs) on challenging university-level algorithms exams. By testing multiple models on both a Romanian exam and its high-quality English translation, we analyze LLMs' problem-solving capabilities, consistency, and multilingual performance. Our empirical study reveals that the most recent models not only achieve scores comparable to top-performing students but also demonstrate robust reasoning skills on complex, multi-step algorithmic challenges, even though difficulties remain with graph-based tasks. Building on these findings, we explore the potential of LLMs to support educational environments through the generation of high-quality editorial content, offering instructors a powerful tool to enhance student feedback. The insights and best practices discussed herein pave the way for further integration of generative AI in advanced algorithm education.
--------------------------------------------------------------------------------------------------------
Robustness Evaluation for Video Models with Reinforcement Learning
Video classification models face unique robustness challenges compared to image-based systems due to temporal complexity and increased computational demands, making traditional adversarial evaluation approaches insufficient. This research introduces a multi-agent reinforcement learning framework that cooperatively identifies spatial and temporal vulnerabilities while maintaining visual imperceptibility and temporal coherence. The approach outperforms existing methods in efficiency and effectiveness across popular video recognition models and datasets. Applications include security systems, autonomous vehicle perception, surveillance technologies, and content moderation platforms. With video AI becoming increasingly prevalent in safety-critical applications, robust evaluation becomes essential for deployment confidence. The framework's ability to generate custom distortion types makes it particularly valuable for domain-specific robustness testing. This work contributes to safer deployment of video AI systems by providing more realistic and comprehensive evaluation methods.
Authors: Ashwin Ramesh Babu, Sajad Mousavi, Vineet Gundecha, Sahand Ghorbanpour, Avisek Naug, Antonio Guillen, Ricardo Luna Gutierrez, Soumyendu Sarkar
Link: https://arxiv.org/abs/2506.05431v1
Date: 2025-06-05
Summary:
Evaluating the robustness of Video classification models is very challenging, specifically when compared to image-based models. With their increased temporal dimension, there is a significant increase in complexity and computational cost. One of the key challenges is to keep the perturbations to a minimum to induce misclassification. In this work, we propose a multi-agent reinforcement learning approach (spatial and temporal) that cooperatively learns to identify the given video's sensitive spatial and temporal regions. The agents consider temporal coherence in generating fine perturbations, leading to a more effective and visually imperceptible attack. Our method outperforms the state-of-the-art solutions on the Lp metric and the average queries. Our method enables custom distortion types, making the robustness evaluation more relevant to the use case. We extensively evaluate 4 popular models for video action recognition on two popular datasets, HMDB-51 and UCF-101.
--------------------------------------------------------------------------------------------------------
Fake news detection faces the dual challenge of rapidly evolving misinformation tactics and the computational constraints of real-time detection systems. Traditional approaches struggle with adaptation to new patterns, while large language models, despite strong capabilities, lack current knowledge and appropriate examples. This research introduces a collaborative framework where large and small models complement each other through multi-round learning, with lifelong knowledge editing and replay-based continual learning preventing catastrophic forgetting. Applications include social media platforms, news verification systems, content moderation tools, and information integrity services. With misinformation becoming increasingly sophisticated and widespread, automated detection systems that can continuously adapt without complete retraining become crucial for maintaining information ecosystem health. The framework's ability to leverage both generalization power and classification expertise addresses practical deployment needs for social media companies and news organizations.
Authors: Ziyi Zhou, Xiaoming Zhang, Litian Zhang, Yibo Zhang, Zhenyu Guan, Chaozhuo Li, Philip S. Yu
Link: https://arxiv.org/abs/2506.04739v1
Date: 2025-06-05
Summary:
The widespread dissemination of fake news on social media has significantly impacted society, resulting in serious consequences. Conventional deep learning methodologies employing small language models (SLMs) suffer from extensive supervised training requirements and difficulties adapting to evolving news environments due to data scarcity and distribution shifts. Large language models (LLMs), despite robust zero-shot capabilities, fall short in accurately detecting fake news owing to outdated knowledge and the absence of suitable demonstrations. In this paper, we propose a novel Continuous Collaborative Emergent Fake News Detection (C$^2$EFND) framework to address these challenges. The C$^2$EFND framework strategically leverages both LLMs' generalization power and SLMs' classification expertise via a multi-round collaborative learning framework. We further introduce a lifelong knowledge editing module based on a Mixture-of-Experts architecture to incrementally update LLMs and a replay-based continue learning method to ensure SLMs retain prior knowledge without retraining entirely. Extensive experiments on Pheme and Twitter16 datasets demonstrate that C$^2$EFND significantly outperforms existed methods, effectively improving detection accuracy and adaptability in continuous emergent fake news scenarios.
--------------------------------------------------------------------------------------------------------
Using In-Context Learning for Automatic Defect Labelling of Display Manufacturing Data
Manufacturing quality control requires precise defect detection, but manual labeling of defects is time-consuming and expensive, creating bottlenecks in industrial inspection systems. This research introduces an AI-assisted auto-labeling system that leverages in-context learning with enhanced SegGPT architecture and scribble-based annotation mechanisms. The approach achieves significant performance improvements while maintaining 60% auto-labeling coverage, with models trained on auto-labeled data matching human-labeled performance. Applications include semiconductor manufacturing, display production, automotive quality control, and general industrial inspection systems. The practical validation on industrial datasets demonstrates real-world applicability, addressing a common pain point in manufacturing AI deployment. This work contributes to more efficient quality control processes, potentially reducing costs while maintaining high standards, crucial for industries where defect detection directly impacts product reliability and safety.
Authors: Babar Hussain, Qiang Liu, Gang Chen, Bihai She, Dahai Yu
Link: https://arxiv.org/abs/2506.04717v1
Date: 2025-06-05
Summary:
This paper presents an AI-assisted auto-labeling system for display panel defect detection that leverages in-context learning capabilities. We adopt and enhance the SegGPT architecture with several domain-specific training techniques and introduce a scribble-based annotation mechanism to streamline the labeling process. Our two-stage training approach, validated on industrial display panel datasets, demonstrates significant improvements over the baseline model, achieving an average IoU increase of 0.22 and a 14% improvement in recall across multiple product types, while maintaining approximately 60% auto-labeling coverage. Experimental results show that models trained on our auto-labeled data match the performance of those trained on human-labeled data, offering a practical solution for reducing manual annotation efforts in industrial inspection systems.
--------------------------------------------------------------------------------------------------------
UNO: Unlearning via Orthogonalization in Generative models
As generative AI becomes more powerful and widespread, the ability to selectively remove specific data influences becomes critical for privacy, legal compliance, and content correction without expensive retraining. This research introduces fast unlearning algorithms based on loss gradient orthogonalization that achieve four key objectives: effective forgetting, quality preservation, desired data retention, and computational efficiency. The approach demonstrates orders of magnitude faster unlearning compared to existing methods while maintaining model fidelity. Applications include privacy-compliant AI systems, content moderation, copyright compliance, and personalized AI services. With increasing regulatory focus on data rights and AI accountability, efficient unlearning becomes essential for commercial AI deployment. The dramatic speed improvements make previously impractical unlearning scenarios feasible, potentially enabling more responsive and compliant AI systems that can adapt to changing requirements without complete retraining.
Authors: Pinak Mandal, Georg A. Gottwald
Link: https://arxiv.org/abs/2506.04712v1
Date: 2025-06-05
Summary:
As generative models become increasingly powerful and pervasive, the ability to unlearn specific data, whether due to privacy concerns, legal requirements, or the correction of harmful content, has become increasingly important. Unlike in conventional training, where data are accumulated and knowledge is reinforced, unlearning aims to selectively remove the influence of particular data points without costly retraining from scratch. To be effective and reliable, such algorithms need to achieve (i) forgetting of the undesired data, (ii) preservation of the quality of the generation, (iii) preservation of the influence of the desired training data on the model parameters, and (iv) small number of training steps. We propose fast unlearning algorithms based on loss gradient orthogonalization. We show that our algorithms are able to forget data while maintaining the fidelity of the original model. Using MNIST and CelebA data, we demonstrate that our algorithms achieve orders of magnitude faster unlearning times than their predecessors, such as gradient surgery.
--------------------------------------------------------------------------------------------------------
Analog dual classifier via a time-modulated neuromorphic metasurface
Neuromorphic computing through physical systems offers potential advantages in speed and energy efficiency over traditional digital approaches, but existing implementations are typically limited to single-task realizations. This research introduces a breakthrough neuromorphic metasurface that performs dual classification tasks simultaneously through intelligent frequency multiplexing and temporal modulation. The system conducts analog computation through guided wave scattering, with tunable phases serving as trainable parameters. Applications include edge computing devices, ultra-low-power AI systems, specialized signal processing hardware, and embedded intelligence in mechanical systems. The ability to perform parallel classification tasks addresses a major limitation in physical computing systems, potentially enabling more efficient specialized AI hardware. This work contributes to the development of alternative computing paradigms that could offer significant advantages in specific applications where traditional digital processing faces limitations.
Authors: M. Mousa, M. Moghaddaszadeh, M. Nouh
Link: https://arxiv.org/abs/2506.04629v1
Date: 2025-06-05
Summary:
A neuromorphic metasurface embodies mechanical intelligence by realizing physical neural architectures. It exploits guided wave scattering to conduct computations in an analog manner. Through multiple tuned waveguides, the neuromorphic system recognizes the features of an input signal and self-identifies its classification label. The computational input is introduced to the system through mechanical excitations at one edge, generating elastic waves which traverse multiple layers of resonant metasurfaces. These metasurfaces possess a tunable phase akin to trainable parameters in deep learning algorithms. While early efforts have been promising, the well-established constraints on wave propagation in finite media limit such systems to single-task realizations. In this work, we devise a dual classifier neuromorphic metasurface and demonstrate its effectiveness in carrying out two completely independent classification problems, which are concurrently carried out in parallel, thus addressing a major bottleneck in physical computing systems. Parallelization is achieved through smart multiplexing of the carrier computational frequency, enabled by prescribed temporal modulations of the embedded waveguides. The presented theory and results pave the way for new paradigms in wave-based computing systems which have been elusive thus far.
--------------------------------------------------------------------------------------------------------
Pseudo-Simulation for Autonomous Driving
Autonomous vehicle evaluation faces a fundamental trilemma: real-world testing is unsafe and unreproducible, closed-loop simulation lacks realism or is computationally expensive, and open-loop evaluation misses compounding errors. This research introduces pseudo-simulation, a novel paradigm that operates on real datasets while augmenting them with synthetic observations generated using 3D Gaussian Splatting. The approach approximates potential future states through diverse synthetic observations with proximity-based weighting schemes. Applications include autonomous vehicle development, safety validation systems, driving policy evaluation, and regulatory testing frameworks. The method's superior correlation with closed-loop simulations compared to existing open-loop approaches addresses a critical evaluation gap in autonomous driving. This work contributes to safer and more reliable autonomous vehicle development by providing more realistic evaluation methods that balance computational efficiency with accuracy, potentially accelerating the deployment of safe autonomous driving technology.
Authors: Wei Cao, Marcel Hallgarten, Tianyu Li, Daniel Dauner, Xunjiang Gu, Caojun Wang, Yakov Miron, Marco Aiello, Hongyang Li, Igor Gilitschenski, Boris Ivanovic, Marco Pavone, Andreas Geiger, Kashyap Chitta
Link: https://arxiv.org/abs/2506.04218v1
Date: 2025-06-04
Summary:
Existing evaluation paradigms for Autonomous Vehicles (AVs) face critical limitations. Real-world evaluation is often challenging due to safety concerns and a lack of reproducibility, whereas closed-loop simulation can face insufficient realism or high computational costs. Open-loop evaluation, while being efficient and data-driven, relies on metrics that generally overlook compounding errors. In this paper, we propose pseudo-simulation, a novel paradigm that addresses these limitations. Pseudo-simulation operates on real datasets, similar to open-loop evaluation, but augments them with synthetic observations generated prior to evaluation using 3D Gaussian Splatting. Our key idea is to approximate potential future states the AV might encounter by generating a diverse set of observations that vary in position, heading, and speed. Our method then assigns a higher importance to synthetic observations that best match the AV's likely behavior using a novel proximity-based weighting scheme. This enables evaluating error recovery and the mitigation of causal confusion, as in closed-loop benchmarks, without requiring sequential interactive simulation. We show that pseudo-simulation is better correlated with closed-loop simulations (R^2=0.8) than the best existing open-loop approach (R^2=0.7). We also establish a public leaderboard for the community to benchmark new methodologies with pseudo-simulation. Our code is available at https://github.com/autonomousvision/navsim.
--------------------------------------------------------------------------------------------------------