Week Ending 6.22.2025

RESEARCH WATCH: 6.22.2025

SwarmAgentic: Towards Fully Automated Agentic System Generation via Swarm Intelligence

Traditional multi-agent systems require extensive manual design and lack self-optimization capabilities, limiting their adaptability to complex real-world scenarios. SwarmAgentic addresses this challenge by introducing a revolutionary framework that automatically generates complete agentic systems from scratch using swarm intelligence principles. By maintaining populations of candidate systems that evolve through feedback-guided updates inspired by Particle Swarm Optimization, this approach eliminates human intervention in system design. The framework demonstrates remarkable performance gains, achieving over 260% improvement on travel planning benchmarks. Applications span autonomous logistics coordination, distributed problem-solving in smart cities, adaptive resource management, and scalable multi-robot systems where traditional hand-crafted approaches become impractical.

Authors: Yao Zhang, Chenyang Lin, Shijie Tang, Haokun Chen, Shijie Zhou, Yunpu Ma, Volker Tresp

Link: https://arxiv.org/abs/2506.15672v1

Date: 2025-06-18

Summary:

The rapid progress of Large Language Models has advanced agentic systems in decision-making, coordination, and task execution. Yet, existing agentic system generation frameworks lack full autonomy, missing from-scratch agent generation, self-optimizing agent functionality, and collaboration, limiting adaptability and scalability. We propose SwarmAgentic, a framework for fully automated agentic system generation that constructs agentic systems from scratch and jointly optimizes agent functionality and collaboration as interdependent components through language-driven exploration. To enable efficient search over system-level structures, SwarmAgentic maintains a population of candidate systems and evolves them via feedback-guided updates, drawing inspiration from Particle Swarm Optimization (PSO). We evaluate our method on six real-world, open-ended, and exploratory tasks involving high-level planning, system-level coordination, and creative reasoning. Given only a task description and an objective function, SwarmAgentic outperforms all baselines, achieving a +261.8% relative improvement over ADAS on the TravelPlanner benchmark, highlighting the effectiveness of full automation in structurally unconstrained tasks. This framework marks a significant step toward scalable and autonomous agentic system design, bridging swarm intelligence with fully automated system multi-agent generation. Our code is publicly released at https://yaoz720.github.io/SwarmAgentic/.

--------------------------------------------------------------------------------------------------------

Pixel-level Certified Explanations via Randomized Smoothing

Deep learning models' decision-making processes remain opaque, with explanation methods vulnerable to adversarial manipulation that can drastically alter attribution maps while preserving predictions. This vulnerability undermines trust in AI systems, particularly in high-stakes domains like medical diagnosis and autonomous driving. The proposed certification framework addresses this critical gap by providing the first guaranteed robustness for pixel-level explanations using randomized smoothing techniques. By reformulating attribution robustness as a segmentation problem, the method ensures reliable explanations resistant to input perturbations. Applications include trustworthy medical imaging analysis, forensic evidence evaluation, safety-critical autonomous systems, and regulatory compliance scenarios where explainable AI decisions must withstand scrutiny and potential adversarial challenges.

Authors: Alaa Anani, Tobias Lorenz, Mario Fritz, Bernt Schiele

Link: https://arxiv.org/abs/2506.15499v1

Date: 2025-06-18

Summary:

Post-hoc attribution methods aim to explain deep learning predictions by highlighting influential input pixels. However, these explanations are highly non-robust: small, imperceptible input perturbations can drastically alter the attribution map while maintaining the same prediction. This vulnerability undermines their trustworthiness and calls for rigorous robustness guarantees of pixel-level attribution scores. We introduce the first certification framework that guarantees pixel-level robustness for any black-box attribution method using randomized smoothing. By sparsifying and smoothing attribution maps, we reformulate the task as a segmentation problem and certify each pixel's importance against $\ell_2$-bounded perturbations. We further propose three evaluation metrics to assess certified robustness, localization, and faithfulness. An extensive evaluation of 12 attribution methods across 5 ImageNet models shows that our certified attributions are robust, interpretable, and faithful, enabling reliable use in downstream tasks. Our code is at https://github.com/AlaaAnani/certified-attributions.

--------------------------------------------------------------------------------------------------------

SonicVerse: Multi-Task Learning for Music Feature-Informed Captioning

Music databases and streaming platforms struggle with accurate, detailed descriptions that capture both technical and perceptual aspects of musical content. Traditional captioning systems often miss nuanced musical characteristics, limiting their utility for music discovery, education, and research applications. SonicVerse revolutionizes music captioning by integrating auxiliary feature detection tasks like key and vocal detection directly into the caption generation process. This multi-task architecture produces rich, descriptive captions for short fragments while enabling detailed time-informed descriptions for longer pieces through large language model integration. Applications include enhanced music recommendation systems, automated music cataloging for libraries, educational tools for music theory instruction, content creation assistance for musicians, and improved accessibility features for hearing-impaired users seeking detailed musical descriptions.

Authors: Anuradha Chopra, Abhinaba Roy, Dorien Herremans

Link: https://arxiv.org/abs/2506.15154v1

Date: 2025-06-18

Summary:

Detailed captions that accurately reflect the characteristics of a music piece can enrich music databases and drive forward research in music AI. This paper introduces a multi-task music captioning model, SonicVerse, that integrates caption generation with auxiliary music feature detection tasks such as key detection, vocals detection, and more, so as to directly capture both low-level acoustic details as well as high-level musical attributes. The key contribution is a projection-based architecture that transforms audio input into language tokens, while simultaneously detecting music features through dedicated auxiliary heads. The outputs of these heads are also projected into language tokens, to enhance the captioning input. This framework not only produces rich, descriptive captions for short music fragments but also directly enables the generation of detailed time-informed descriptions for longer music pieces, by chaining the outputs using a large-language model. To train the model, we extended the MusicBench dataset by annotating it with music features using MIRFLEX, a modular music feature extractor, resulting in paired audio, captions and music feature data. Experimental results show that incorporating features in this way improves the quality and detail of the generated captions.

--------------------------------------------------------------------------------------------------------

PeRL: Permutation-Enhanced Reinforcement Learning for Interleaved Vision-Language Reasoning

Vision-language models excel at single-image reasoning but struggle with complex multi-image scenarios requiring understanding of spatial relationships and positional reasoning across images. This limitation restricts their applicability in real-world scenarios involving document analysis, visual storytelling, and multi-step visual problem-solving. PeRL addresses this challenge through a novel reinforcement learning approach that uses image sequence permutation to explore diverse positional relationships and rollout filtering to optimize learning trajectories. The method achieves state-of-the-art performance on multi-image benchmarks while maintaining single-image capabilities. Applications include automated document processing, visual instruction following for robotics, multi-view analysis in medical imaging, architectural design evaluation, surveillance systems requiring temporal visual reasoning, and educational platforms needing sophisticated visual comprehension capabilities.

Authors: Yizhen Zhang, Yang Ding, Shuoshuo Zhang, Xinchen Zhang, Haoling Li, Zhong-zhi Li, Peijie Wang, Jie Wu, Lei Ji, Yelong Shen, Yujiu Yang, Yeyun Gong

Link: https://arxiv.org/abs/2506.14907v1

Date: 2025-06-17

Summary:

Inspired by the impressive reasoning capabilities demonstrated by reinforcement learning approaches like DeepSeek-R1, recent emerging research has begun exploring the use of reinforcement learning (RL) to enhance vision-language models (VLMs) for multimodal reasoning tasks. However, most existing multimodal reinforcement learning approaches remain limited to spatial reasoning within single-image contexts, yet still struggle to generalize to more complex and real-world scenarios involving multi-image positional reasoning, where understanding the relationships across images is crucial. To address this challenge, we propose a general reinforcement learning approach PeRL tailored for interleaved multimodal tasks, and a multi-stage strategy designed to enhance the exploration-exploitation trade-off, thereby improving learning efficiency and task performance. Specifically, we introduce permutation of image sequences to simulate varied positional relationships to explore more spatial and positional diversity. Furthermore, we design a rollout filtering mechanism for resampling to focus on trajectories that contribute most to learning optimal behaviors to exploit learned policies effectively. We evaluate our model on 5 widely-used multi-image benchmarks and 3 single-image benchmarks. Our experiments confirm that PeRL trained model consistently surpasses R1-related and interleaved VLM baselines by a large margin, achieving state-of-the-art performance on multi-image benchmarks, while preserving comparable performance on single-image tasks.

--------------------------------------------------------------------------------------------------------

PictSure: Pretraining Embeddings Matters for In-Context Learning Image Classifiers

Few-shot image classification in specialized domains faces significant challenges due to limited labeled data availability, making traditional supervised learning approaches impractical. While in-context learning offers promise for addressing data scarcity, existing approaches overlook the critical role of image embeddings in determining classification performance. PictSure systematically investigates how embedding model architecture, pretraining objectives, and fine-tuning strategies impact downstream few-shot classification tasks. The framework demonstrates superior out-of-domain performance compared to existing methods while maintaining competitive in-domain results. Applications include medical image diagnosis in rare disease identification, agricultural pest detection, quality control in manufacturing, wildlife species identification for conservation efforts, archaeological artifact classification, and rapid prototype classification in research settings where extensive labeled datasets are unavailable.

Authors: Lukas Schiesser, Cornelius Wolff, Sophie Haas, Simon Pukrop

Link: https://arxiv.org/abs/2506.14842v1

Date: 2025-06-16

Summary:

Building image classification models remains cumbersome in data-scarce domains, where collecting large labeled datasets is impractical. In-context learning (ICL) has emerged as a promising paradigm for few-shot image classification (FSIC), enabling models to generalize across domains without gradient-based adaptation. However, prior work has largely overlooked a critical component of ICL-based FSIC pipelines: the role of image embeddings. In this work, we present PictSure, an ICL framework that places the embedding model -- its architecture, pretraining, and training dynamics -- at the center of analysis. We systematically examine the effects of different visual encoder types, pretraining objectives, and fine-tuning strategies on downstream FSIC performance. Our experiments show that the training success and the out-of-domain performance are highly dependent on how the embedding models are pretrained. Consequently, PictSure manages to outperform existing ICL-based FSIC models on out-of-domain benchmarks that differ significantly from the training distribution, while maintaining comparable results on in-domain tasks. Code can be found at https://github.com/PictSure/pictsure-library.

--------------------------------------------------------------------------------------------------------

Rethinking Test-Time Scaling for Medical AI: Model and Task-Aware Strategies for LLMs and VLMs

Medical AI applications demand exceptional reliability and accuracy, yet current test-time scaling strategies for enhancing model reasoning capabilities remain underexplored in healthcare contexts. The complexity of medical decision-making, combined with the high stakes of clinical applications, necessitates careful evaluation of scaling approaches across different model types and medical tasks. This comprehensive investigation examines test-time scaling effectiveness for both language and vision-language models in medical domains, considering factors like model size, task complexity, and robustness against misleading information. Applications include clinical decision support systems, medical image interpretation, diagnostic reasoning assistance, drug discovery processes, patient risk assessment, telemedicine consultations, and medical education platforms where enhanced reasoning capabilities could significantly improve healthcare outcomes and professional training effectiveness.

Authors: Gyutaek Oh, Seoyeon Kim, Sangjoon Park, Byung-Hoon Kim

Link: https://arxiv.org/abs/2506.13102v1

Date: 2025-06-16

Summary:

Test-time scaling has recently emerged as a promising approach for enhancing the reasoning capabilities of large language models or vision-language models during inference. Although a variety of test-time scaling strategies have been proposed, and interest in their application to the medical domain is growing, many critical aspects remain underexplored, including their effectiveness for vision-language models and the identification of optimal strategies for different settings. In this paper, we conduct a comprehensive investigation of test-time scaling in the medical domain. We evaluate its impact on both large language models and vision-language models, considering factors such as model size, inherent model characteristics, and task complexity. Finally, we assess the robustness of these strategies under user-driven factors, such as misleading information embedded in prompts. Our findings offer practical guidelines for the effective use of test-time scaling in medical applications and provide insights into how these strategies can be further refined to meet the reliability and interpretability demands of the medical domain.

--------------------------------------------------------------------------------------------------------

The Compositional Architecture of Regret in Large Language Models

Understanding how large language models process and express regret when confronted with contradictory evidence is crucial for developing reliable AI systems capable of self-correction and trustworthy interaction. Current models lack transparency in their internal mechanisms for handling conflicting information, limiting their reliability in critical applications. This research introduces novel methodologies for analyzing regret mechanisms through specialized datasets, optimal representation layer identification, and neuron-level analysis. The discovery of M-shaped processing patterns and distinct neuron categories provides unprecedented insights into model cognition. Applications include improved fact-checking systems, enhanced conversational AI capable of acknowledging mistakes, more reliable educational assistants, trustworthy medical consultation tools, legal reasoning systems requiring self-correction capabilities, and fundamental advances in interpretable AI development for safety-critical domains.

Authors: Xiangxiang Cui, Shu Yang, Tianjin Huang, Wanyu Lin, Lijie Hu, Di Wang

Link: https://arxiv.org/abs/2506.15617v1

Date: 2025-06-18

Summary:

Regret in Large Language Models refers to their explicit regret expression when presented with evidence contradicting their previously generated misinformation. Studying the regret mechanism is crucial for enhancing model reliability and helps in revealing how cognition is coded in neural networks. To understand this mechanism, we need to first identify regret expressions in model outputs, then analyze their internal representation. This analysis requires examining the model's hidden states, where information processing occurs at the neuron level. However, this faces three key challenges: (1) the absence of specialized datasets capturing regret expressions, (2) the lack of metrics to find the optimal regret representation layer, and (3) the lack of metrics for identifying and analyzing regret neurons. Addressing these limitations, we propose: (1) a workflow for constructing a comprehensive regret dataset through strategically designed prompting scenarios, (2) the Supervised Compression-Decoupling Index (S-CDI) metric to identify optimal regret representation layers, and (3) the Regret Dominance Score (RDS) metric to identify regret neurons and the Group Impact Coefficient (GIC) to analyze activation patterns. Our experimental results successfully identified the optimal regret representation layer using the S-CDI metric, which significantly enhanced performance in probe classification experiments. Additionally, we discovered an M-shaped decoupling pattern across model layers, revealing how information processing alternates between coupling and decoupling phases. Through the RDS metric, we categorized neurons into three distinct functional groups: regret neurons, non-regret neurons, and dual neurons.

--------------------------------------------------------------------------------------------------------

Spotting tell-tale visual artifacts in face swapping videos: strengths and pitfalls of CNN detectors

Face swapping technology poses increasing security threats in remote communications, enabling sophisticated deepfake attacks that can compromise authentication systems and spread misinformation. While detection methods exploiting visual artifacts from occlusion scenarios show promise, their real-world effectiveness across different acquisition sources and swapping algorithms remains unclear. This investigation benchmarks CNN-based detection models on comprehensive datasets, revealing excellent within-domain performance but significant cross-domain generalization challenges. The findings highlight critical limitations in current detection approaches when faced with diverse swapping techniques and acquisition conditions. Applications include video conference security systems, social media content verification, law enforcement forensic analysis, journalism fact-checking tools, identity verification for financial services, and educational platforms teaching digital media literacy to combat misinformation and protect against sophisticated visual deception attacks.

Authors: Riccardo Ziglio, Cecilia Pasquini, Silvio Ranise

Link: https://arxiv.org/abs/2506.16497v1

Date: 2025-06-19

Summary:

Face swapping manipulations in video streams represents an increasing threat in remote video communications, due to advances in automated and real-time tools. Recent literature proposes to characterize and exploit visual artifacts introduced in video frames by swapping algorithms when dealing with challenging physical scenes, such as face occlusions. This paper investigates the effectiveness of this approach by benchmarking CNN-based data-driven models on two data corpora (including a newly collected one) and analyzing generalization capabilities with respect to different acquisition sources and swapping algorithms. The results confirm excellent performance of general-purpose CNN architectures when operating within the same data source, but a significant difficulty in robustly characterizing occlusion-based visual cues across datasets. This highlights the need for specialized detection strategies to deal with such artifacts.

--------------------------------------------------------------------------------------------------------

DAILOC: Domain-Incremental Learning for Indoor Localization using Smartphones

Indoor localization systems face persistent challenges from device heterogeneity and temporal environmental changes, causing performance degradation and requiring frequent recalibration in real-world deployments. Traditional approaches address device and temporal variations independently, leading to poor generalization and catastrophic forgetting over time. DAILOC introduces a novel domain-incremental learning framework that simultaneously handles both challenges through innovative disentanglement strategies and memory-guided alignment mechanisms. The system achieves significant improvements in localization accuracy across multiple devices and time periods. Applications include smart building navigation systems, retail customer analytics, emergency response coordination, warehouse automation, healthcare patient tracking, museum interactive guides, and accessibility services for visually impaired individuals requiring reliable indoor positioning across diverse smartphone devices and changing environmental conditions.

Authors: Akhil Singampalli, Danish Gufran, Sudeep Pasricha

Link: https://arxiv.org/abs/2506.15554v1

Date: 2025-06-18

Summary:

Wi-Fi fingerprinting-based indoor localization faces significant challenges in real-world deployments due to domain shifts arising from device heterogeneity and temporal variations within indoor environments. Existing approaches often address these issues independently, resulting in poor generalization and susceptibility to catastrophic forgetting over time. In this work, we propose DAILOC, a novel domain-incremental learning framework that jointly addresses both temporal and device-induced domain shifts. DAILOC introduces a novel disentanglement strategy that separates domain shifts from location-relevant features using a multi-level variational autoencoder. Additionally, we introduce a novel memory-guided class latent alignment mechanism to address the effects of catastrophic forgetting over time. Experiments across multiple smartphones, buildings, and time instances demonstrate that DAILOC significantly outperforms state-of-the-art methods, achieving up to 2.74x lower average error and 4.6x lower worst-case error.

--------------------------------------------------------------------------------------------------------

Enhancing Step-by-Step and Verifiable Medical Reasoning in MLLMs

Medical multimodal large language models require robust reasoning capabilities for accurate diagnosis and treatment recommendations, yet existing approaches lack comprehensive frameworks for generating and evaluating effective reasoning paths. The complexity of medical decision-making demands systematic approaches to chain-of-thought reasoning that can be verified and trusted by healthcare professionals. The proposed Mentor-Intern Collaborative Search methodology generates rigorous medical reasoning data through collaborative model interactions and quality assessment. This approach enables the development of Chiron-o1, achieving state-of-the-art performance across medical benchmarks. Applications include clinical decision support systems, medical education platforms, diagnostic reasoning assistance, radiology report generation, patient case analysis, telemedicine consultations, medical research analysis, and training tools for healthcare professionals requiring explainable and verifiable AI reasoning in high-stakes medical environments.

Authors: Haoran Sun, Yankai Jiang, Wenjie Lou, Yujie Zhang, Wenjie Li, Lilong Wang, Mianxin Liu, Lei Liu, Xiaosong Wang

Link: https://arxiv.org/abs/2506.16962v1

Date: 2025-06-20

Summary:

Multimodal large language models (MLLMs) have begun to demonstrate robust reasoning capabilities on general tasks, yet their application in the medical domain remains in its early stages. Constructing chain-of-thought (CoT) training data is essential for bolstering the reasoning abilities of medical MLLMs. However, existing approaches exhibit a deficiency in offering a comprehensive framework for searching and evaluating effective reasoning paths towards critical diagnosis. To address this challenge, we propose Mentor-Intern Collaborative Search (MICS), a novel reasoning-path searching scheme to generate rigorous and effective medical CoT data. MICS first leverages mentor models to initialize the reasoning, one step at a time, then prompts each intern model to continue the thinking along those initiated paths, and finally selects the optimal reasoning path according to the overall reasoning performance of multiple intern models. The reasoning performance is determined by an MICS-Score, which assesses the quality of generated reasoning paths. Eventually, we construct MMRP, a multi-task medical reasoning dataset with ranked difficulty, and Chiron-o1, a new medical MLLM devised via a curriculum learning strategy, with robust visual question-answering and generalizable reasoning capabilities. Extensive experiments demonstrate that Chiron-o1, trained on our CoT dataset constructed using MICS, achieves state-of-the-art performance across a list of medical visual question answering and reasoning benchmarks. Codes are available at GitHub - manglu097/Chiron-o1: Enhancing Step-by-Step and Verifiable Medical Reasoning in MLLMs

--------------------------------------------------------------------------------------------------------

The Price of Freedom: Exploring Expressivity and Runtime Tradeoffs in Equivariant Tensor Products

Equivariant neural networks have revolutionized 3D modeling tasks by incorporating geometric symmetries, but their computational efficiency remains a critical bottleneck for practical applications. The tensor product operation, fundamental to these networks, presents significant computational complexity challenges that researchers attempt to optimize through various approaches. However, reported speedups often come at hidden costs to model expressivity and performance. This systematic analysis reveals crucial tradeoffs between computational efficiency and model capabilities across different tensor product implementations. The work introduces practical optimization strategies and comprehensive benchmarking methodologies. Applications include molecular property prediction, protein structure analysis, materials science simulations, robotics motion planning, computer graphics rendering, drug discovery processes, and physics-informed machine learning where geometric understanding and computational efficiency must be carefully balanced for practical deployment.

Authors: YuQing Xie, Ameya Daigavane, Mit Kotak, Tess Smidt

Link: https://arxiv.org/abs/2506.13523v1

Date: 2025-06-16

Summary:

$E(3)$-equivariant neural networks have demonstrated success across a wide range of 3D modelling tasks. A fundamental operation in these networks is the tensor product, which interacts two geometric features in an equivariant manner to create new features. Due to the high computational complexity of the tensor product, significant effort has been invested to optimize the runtime of this operation. For example, Luo et al. (2024) recently proposed the Gaunt tensor product (GTP) which promises a significant speedup. In this work, we provide a careful, systematic analysis of a number of tensor product operations. In particular, we emphasize that different tensor products are not performing the same operation. The reported speedups typically come at the cost of expressivity. We introduce measures of expressivity and interactability to characterize these differences. In addition, we realized the original implementation of GTP can be greatly simplified by directly using a spherical grid at no cost in asymptotic runtime. This spherical grid approach is faster on our benchmarks and in actual training of the MACE interatomic potential by 30\%. Finally, we provide the first systematic microbenchmarks of the various tensor product operations. We find that the theoretical runtime guarantees can differ wildly from empirical performance, demonstrating the need for careful application-specific benchmarking. Code is available at \href{https://github.com/atomicarchitects/PriceofFreedom}{https://github.com/atomicarchitects/PriceofFreedom}

--------------------------------------------------------------------------------------------------------

MIST: Jailbreaking Black-box Large Language Models via Iterative Semantic Tuning

Large language models' safety alignments remain vulnerable to sophisticated attack methods designed to elicit harmful responses, posing significant risks for deployed AI systems. Traditional jailbreaking approaches face challenges with discrete token inputs, limited model access, and constrained query budgets when targeting black-box systems. MIST introduces an innovative iterative semantic tuning approach that preserves original intent while inducing harmful content through sequential synonym search and order-determining optimization strategies. The method demonstrates competitive success rates across multiple model architectures with improved computational efficiency. Applications include AI safety research, red-team security testing, robustness evaluation for deployed systems, safety protocol development, regulatory compliance assessment, and defensive mechanism design for protecting against adversarial prompt engineering in production AI systems requiring enhanced security measures.

Authors: Muyang Zheng, Yuanzhi Yao, Changting Lin, Rui Wang, Meng Han

Link: https://arxiv.org/abs/2506.16792v1

Date: 2025-06-20

Summary:

Despite efforts to align large language models (LLMs) with societal and moral values, these models remain susceptible to jailbreak attacks--methods designed to elicit harmful responses. Jailbreaking black-box LLMs is considered challenging due to the discrete nature of token inputs, restricted access to the target LLM, and limited query budget. To address the issues above, we propose an effective method for jailbreaking black-box large language Models via Iterative Semantic Tuning, named MIST. MIST enables attackers to iteratively refine prompts that preserve the original semantic intent while inducing harmful content. Specifically, to balance semantic similarity with computational efficiency, MIST incorporates two key strategies: sequential synonym search, and its advanced version--order-determining optimization. Extensive experiments across two open-source models and four closed-source models demonstrate that MIST achieves competitive attack success rates and attack transferability compared with other state-of-the-art white-box and black-box jailbreak methods. Additionally, we conduct experiments on computational efficiency to validate the practical viability of MIST.

--------------------------------------------------------------------------------------------------------

From LLMs to MLLMs to Agents: A Survey of Emerging Paradigms in Jailbreak Attacks and Defenses within LLM Ecosystem

The rapid evolution from single-modal language models to multimodal systems and intelligent agents has dramatically expanded AI capabilities while introducing increasingly sophisticated security vulnerabilities. Current understanding of jailbreak attacks and defense mechanisms lacks comprehensive coverage of these emerging paradigms and their unique security challenges. This systematic survey traces the developmental trajectory across the expanding LLM ecosystem, categorizing attack techniques and defense strategies while identifying critical research gaps. The work provides updated synthesis of recent advances and outlines future research directions for resilient AI systems. Applications include enterprise AI security frameworks, regulatory compliance guidelines, academic research roadmaps, industry security standards development, government policy formation for AI governance, and educational curricula for AI safety training in the rapidly evolving landscape of intelligent systems.

Authors: Yanxu Mao, Tiehan Cui, Peipei Liu, Datao You, Hongsong Zhu

Link: https://arxiv.org/abs/2506.15170v1

Date: 2025-06-18

Summary:

Large language models (LLMs) are rapidly evolving from single-modal systems to multimodal LLMs and intelligent agents, significantly expanding their capabilities while introducing increasingly severe security risks. This paper presents a systematic survey of the growing complexity of jailbreak attacks and corresponding defense mechanisms within the expanding LLM ecosystem. We first trace the developmental trajectory from LLMs to MLLMs and Agents, highlighting the core security challenges emerging at each stage. Next, we categorize mainstream jailbreak techniques from both the attack impact and visibility perspectives, and provide a comprehensive analysis of representative attack methods, related datasets, and evaluation metrics. On the defense side, we organize existing strategies based on response timing and technical approach, offering a structured understanding of their applicability and implementation. Furthermore, we identify key limitations in existing surveys, such as insufficient attention to agent-specific security issues, the absence of a clear taxonomy for hybrid jailbreak methods, a lack of detailed analysis of experimental setups, and outdated coverage of recent advancements. To address these limitations, we provide an updated synthesis of recent work and outline future research directions in areas such as dataset construction, evaluation framework optimization, and strategy generalization. Our study seeks to enhance the understanding of jailbreak mechanisms and facilitate the advancement of more resilient and adaptive defense strategies in the context of ever more capable LLMs.

--------------------------------------------------------------------------------------------------------

Edeflip: Supervised Word Translation between English and Yoruba

Machine translation for low-resource languages faces unique challenges that existing high-resource language solutions cannot adequately address, limiting global accessibility to translation technologies. While embedding alignment has become state-of-the-art for translation without parallel corpora, its effectiveness for languages with limited computational resources remains unclear. This study implements supervised embedding alignment for English-Yoruba translation, revealing critical factors like embedding quality and normalization that significantly impact translation precision in low-resource settings. The work demonstrates limitations of current methods while identifying essential considerations for low-resource language translation. Applications include endangered language preservation, cross-cultural communication tools, educational resources for indigenous communities, international development programs, cultural heritage digitization, multilingual content creation, and bridging digital divides by making translation technology accessible to underrepresented linguistic communities worldwide.

Authors: Ikeoluwa Abioye, Jiani Ge

Link: https://arxiv.org/abs/2506.13020v1

Date: 2025-06-16

Summary:

In recent years, embedding alignment has become the state-of-the-art machine translation approach, as it can yield high-quality translation without training on parallel corpora. However, existing research and application of embedding alignment mostly focus on high-resource languages with high-quality monolingual embeddings. It is unclear if and how low-resource languages may be similarly benefited. In this study, we implement an established supervised embedding alignment method for word translation from English to Yoruba, the latter a low-resource language. We found that higher embedding quality and normalizing embeddings increase word translation precision, with, additionally, an interaction effect between the two. Our results demonstrate the limitations of the state-of-the-art supervised embedding alignment when it comes to low-resource languages, for which there are additional factors that need to be taken into consideration, such as the importance of curating high-quality monolingual embeddings. We hope our work will be a starting point for further machine translation research that takes into account the challenges that low-resource languages face.

--------------------------------------------------------------------------------------------------------

Tower+: Bridging Generality and Translation Specialization in Multilingual LLMs

Specialized fine-tuning for machine translation often sacrifices general-purpose capabilities like conversational reasoning and instruction-following, limiting practical utility in real-world applications requiring diverse skills. Current approaches struggle to maintain both translation excellence and multilingual general-purpose performance simultaneously. Tower+ introduces a novel training recipe combining continued pretraining, supervised fine-tuning, preference optimization, and reinforcement learning with verifiable rewards to achieve optimal balance. The resulting models deliver state-of-the-art translation performance while maintaining strong general capabilities across multiple scales. Applications include international business communication platforms, multilingual customer service systems, global content localization services, diplomatic and legal document translation, multilingual educational platforms, cross-border e-commerce solutions, and comprehensive language assistance tools serving diverse user needs in professional and personal contexts.

Authors: Ricardo Rei, Nuno M. Guerreiro, José Pombal, João Alves, Pedro Teixeirinha, Amin Farajian, André F. T. Martins

Link: https://arxiv.org/abs/2506.17080v1

Date: 2025-06-20

Summary:

Fine-tuning pretrained LLMs has been shown to be an effective strategy for reaching state-of-the-art performance on specific tasks like machine translation. However, this process of adaptation often implies sacrificing general-purpose capabilities, such as conversational reasoning and instruction-following, hampering the utility of the system in real-world applications that require a mixture of skills. In this paper, we introduce Tower+, a suite of models designed to deliver strong performance across both translation and multilingual general-purpose text capabilities. We achieve a Pareto frontier between translation specialization and multilingual general-purpose capabilities by introducing a novel training recipe that builds on Tower (Alves et al., 2024), comprising continued pretraining, supervised fine-tuning, preference optimization, and reinforcement learning with verifiable rewards. At each stage of training, we carefully generate and curate data to strengthen performance on translation as well as general-purpose tasks involving code generation, mathematics problem solving, and general instruction-following. We develop models at multiple scales: 2B, 9B, and 72B. Our smaller models often outperform larger general-purpose open-weight and proprietary LLMs (e.g., Llama 3.3 70B, GPT-4o). Our largest model delivers best-in-class translation performance for high-resource languages and top results in multilingual Arena Hard evaluations and in IF-MT, a benchmark we introduce for evaluating both translation and instruction-following. Our findings highlight that it is possible to rival frontier models in general capabilities, while optimizing for specific business domains, such as translation and localization.

--------------------------------------------------------------------------------------------------------

Revolutionizing Validation and Verification: Explainable Testing Methodologies for Intelligent Automotive Decision-Making Systems

Autonomous driving systems employ complex decision-making models with multimodal inputs, making traditional validation and verification approaches inadequate for ensuring safety and reliability. Current manual testing methods are inefficient and labor-intensive, while the complexity of these systems makes failure diagnosis and anomaly tracing extremely challenging. This methodology integrates explainability, transparency, and interpretability into validation processes through large language model-generated test scenarios and real-time simulation validation. The framework includes comprehensive testing tools with explanation generation capabilities. Applications include automotive safety certification, regulatory compliance testing, insurance risk assessment, fleet management systems, autonomous vehicle development, transportation infrastructure planning, and public trust building for autonomous technologies requiring transparent and verifiable safety validation in critical transportation environments.

Authors: Halit Eris, Stefan Wagner

Link: https://arxiv.org/abs/2506.16876v1

Date: 2025-06-20

Summary:

Autonomous Driving Systems (ADS) use complex decision-making (DM) models with multimodal sensory inputs, making rigorous validation and verification (V&V) essential for safety and reliability. These models pose challenges in diagnosing failures, tracing anomalies, and maintaining transparency, with current manual testing methods being inefficient and labor-intensive. This vision paper presents a methodology that integrates explainability, transparency, and interpretability into V&V processes. We propose refining V&V requirements through literature reviews and stakeholder input, generating explainable test scenarios via large language models (LLMs), and enabling real-time validation in simulation environments. Our framework includes test oracle, explanation generation, and a test chatbot, with empirical studies planned to evaluate improvements in diagnostic efficiency and transparency. Our goal is to streamline V&V, reduce resources, and build user trust in autonomous technologies.

--------------------------------------------------------------------------------------------------------

Natural Intelligence: the information processing power of life

Understanding the computational capabilities of biological systems provides crucial insights into the fundamental nature of information processing and its relationship to artificial intelligence development. Living systems perform extraordinary information processing through chemical interactions, from photosynthesis to neural activity, yet the scale and scope of these computations remain poorly quantified. This investigation estimates the total information processing capacity of life on Earth, revealing that biological systems perform approximately 10^33-10^35 operations per second globally, with individual humans processing 10^20-10^22 operations per second. Applications include bio-inspired computing architectures, understanding cognitive limits and capabilities, developing more efficient artificial intelligence systems, advancing synthetic biology, improving drug discovery processes, enhancing our understanding of consciousness and intelligence, and informing the development of neuromorphic computing systems that could revolutionize computational efficiency.

Authors: Seth Lloyd, Michele Reilly

Link: https://arxiv.org/abs/2506.16478v1

Date: 2025-06-19

Summary:

Merely by existing, all physical systems contain information, and physical dynamics transforms and processes that information. This note investigates the information processing power of living systems. Living systems harvest free energy from the sun, from geothermal sources, and from each other. They then use that free energy to drive the complex set of chemical interactions that underlie life. All molecules -- be they simple molecules such as water, or complex molecules such as DNA -- register information via their chemical composition. When these molecules undergo chemical reactions, that information is transformed and processed. These chemical transformations can be thought of as elementary logical operations: such bio-ops include the absorption of a photon in a chromophore during photosynthesis, the formation or breaking of covalent, hydrogen, and van der Waals bonds in the process of metabolism and reproduction, or the release of a neurotransmitter molecule when a synapse fires in the brain. This paper estimates the total number of bio-ops that have been, and are being performed, by life on earth. We find that the current number of bio-ops performed by all life on earth is around $10^{33}-10^{35}$ bio-ops per second. The cells in an individual human being perform around $10^{20}-10^{22}$ bio-ops per second, comparable to the information processing power of all the computers, cell phones, and server farms on earth. Depending on how one defines a neural operation, at most a few percent of human bio-ops take place in the firing of neurons and synapses in the brain. Over the course of life on earth, about $10^{50}-10^{52}$ bio-ops have taken place.

--------------------------------------------------------------------------------------------------------

One-Step Diffusion for Detail-Rich and Temporally Consistent Video Super-Resolution

Real-world video super-resolution faces the fundamental challenge of generating rich spatial details while maintaining temporal consistency, particularly when leveraging powerful generative models like stable diffusion. Existing methods often compromise between detail quality and temporal coherence, resulting in suboptimal visual experiences. The proposed Dual LoRA Learning paradigm trains an effective one-step diffusion model that achieves both realistic frame details and temporal consistency through innovative Cross-Frame Retrieval and alternating LoRA training phases. This approach enables efficient high-quality video restoration in a single diffusion step. Applications include video streaming enhancement, legacy media restoration, security surveillance improvement, medical imaging enhancement, entertainment content upscaling, virtual reality content processing, and broadcast television quality improvement where both detail preservation and temporal stability are crucial for user experience and analysis accuracy.

Authors: Yujing Sun, Lingchen Sun, Shuaizheng Liu, Rongyuan Wu, Zhengqiang Zhang, Lei Zhang

Link: https://arxiv.org/abs/2506.15591v2

Date: 2025-06-20

Summary:

It is a challenging problem to reproduce rich spatial details while maintaining temporal consistency in real-world video super-resolution (Real-VSR), especially when we leverage pre-trained generative models such as stable diffusion (SD) for realistic details synthesis. Existing SD-based Real-VSR methods often compromise spatial details for temporal coherence, resulting in suboptimal visual quality. We argue that the key lies in how to effectively extract the degradation-robust temporal consistency priors from the low-quality (LQ) input video and enhance the video details while maintaining the extracted consistency priors. To achieve this, we propose a Dual LoRA Learning (DLoRAL) paradigm to train an effective SD-based one-step diffusion model, achieving realistic frame details and temporal consistency simultaneously. Specifically, we introduce a Cross-Frame Retrieval (CFR) module to aggregate complementary information across frames, and train a Consistency-LoRA (C-LoRA) to learn robust temporal representations from degraded inputs. After consistency learning, we fix the CFR and C-LoRA modules and train a Detail-LoRA (D-LoRA) to enhance spatial details while aligning with the temporal space defined by C-LoRA to keep temporal coherence. The two phases alternate iteratively for optimization, collaboratively delivering consistent and detail-rich outputs. During inference, the two LoRA branches are merged into the SD model, allowing efficient and high-quality video restoration in a single diffusion step. Experiments show that DLoRAL achieves strong performance in both accuracy and speed. Code and models are available at https://github.com/yjsunnn/DLoRAL.

--------------------------------------------------------------------------------------------------------

Warping and Matching Subsequences Between Time Series

Time series comparison is fundamental to clustering, classification, and pattern recognition tasks, yet current elastic distance measures provide quantitative comparisons without qualitative insights into structural relationships. Traditional visualizations focus on point-to-point alignment, failing to convey broader transformations like shifting, compression, and amplitude changes at the subsequence level. This limitation makes it difficult to understand how time series relate to each other beyond numerical similarity scores. The proposed technique simplifies warping paths to highlight, quantify, and visualize key transformations, enhancing interpretability in time series analysis. Applications include financial market analysis, medical signal interpretation, industrial process monitoring, climate data analysis, speech recognition systems, motion capture analysis, and anomaly detection where understanding the nature of temporal relationships and structural changes is crucial for decision-making and pattern understanding.

Authors: Simiao Lin, Wannes Meert, Pieter Robberechts, Hendrik Blockeel

Link: https://arxiv.org/abs/2506.15452v1

Date: 2025-06-18

Summary:

Comparing time series is essential in various tasks such as clustering and classification. While elastic distance measures that allow warping provide a robust quantitative comparison, a qualitative comparison on top of them is missing. Traditional visualizations focus on point-to-point alignment and do not convey the broader structural relationships at the level of subsequences. This limitation makes it difficult to understand how and where one time series shifts, speeds up or slows down with respect to another. To address this, we propose a novel technique that simplifies the warping path to highlight, quantify and visualize key transformations (shift, compression, difference in amplitude). By offering a clearer representation of how subsequences match between time series, our method enhances interpretability in time series comparison.

--------------------------------------------------------------------------------------------------------

Open-World Object Counting in Videos

Traditional object counting methods are limited to predefined categories and struggle with dynamic video environments where objects appear, disappear, and reappear across frames. This limitation is particularly problematic in crowded scenes with occlusions where avoiding double counting and tracking reappearances becomes crucial. The new task of open-world object counting in videos addresses these challenges by enabling counting of any object specified through text descriptions or image examples. CountVid leverages image-based counting and promptable video segmentation to achieve automated, open-world counting capabilities. Applications include wildlife population monitoring, crowd management and safety, inventory tracking in warehouses, traffic analysis, retail analytics, surveillance systems, environmental research, sports performance analysis, and quality control in manufacturing where accurate counting of diverse objects across video sequences is essential for operational efficiency and data-driven decision making.

Authors: Niki Amini-Naieni, Andrew Zisserman

Link: https://arxiv.org/abs/2506.15368v1

Date: 2025-06-18

Summary:

We introduce a new task of open-world object counting in videos: given a text description, or an image example, that specifies the target object, the objective is to enumerate all the unique instances of the target objects in the video. This task is especially challenging in crowded scenes with occlusions and similar objects, where avoiding double counting and identifying reappearances is crucial. To this end, we make the following contributions: we introduce a model, CountVid, for this task. It leverages an image-based counting model, and a promptable video segmentation and tracking model to enable automated, open-world object counting across video frames. To evaluate its performance, we introduce VideoCount, a new dataset for our novel task built from the TAO and MOT20 tracking datasets, as well as from videos of penguins and metal alloy crystallization captured by x-rays. Using this dataset, we demonstrate that CountVid provides accurate object counts, and significantly outperforms strong baselines. The VideoCount dataset, the CountVid model, and all the code are available at https://github.com/niki-amini-naieni/CountVid/.

--------------------------------------------------------------------------------------------------------

EYE ON A.I. GETS READERS UP TO DATE ON THE LATEST FUNDING NEWS AND RELATED ISSUES. SUBSCRIBE FOR THE WEEKLY NEWSLETTER.

Artificial Intelligence, Research WatchCraig SmithJune 25, 2025Comment