Week Ending 4.27.2025

RESEARCH WATCH: 4.27.2025

LLM-Guided Open RAN: Empowering Hierarchical RAN Intelligent Control

This paper introduces an innovative framework integrating large language models with hierarchical radio access network intelligent controllers for improved wireless network management. By combining LLMs for strategic guidance with reinforcement learning for real-time tasks, the LLM-hRIC approach enhances collaboration between non-real-time and near-real-time components in open radio access networks. This integration enables more efficient resource allocation across different time scales, particularly beneficial in integrated access and backhaul networks. The framework could revolutionize telecommunications infrastructure by making networks more intelligent, adaptable, and responsive to changing conditions, potentially improving service quality while optimizing resource usage.

Authors: Lingyan Bao, Sinwoong Yun, Jemin Lee, Tony Q. S. Quek

Link: https://arxiv.org/abs/2504.18062v1

Date: 2025-04-25

Summary:

Recent advancements in large language models (LLMs) have led to a significant interest in deploying LLMempowered algorithms for wireless communication networks. Meanwhile, open radio access network (O-RAN) techniques offer unprecedented flexibility, with the non-real-time (non-RT) radio access network (RAN) intelligent controller (RIC) (non-RT RIC) and near-real-time (near-RT) RIC (near-RT RIC) components enabling intelligent resource management across different time scales. In this paper, we propose the LLM empowered hierarchical RIC (LLM-hRIC) framework to improve the collaboration between RICs. This framework integrates LLMs with reinforcement learning (RL) for efficient network resource management. In this framework, LLMs-empowered non-RT RICs provide strategic guidance and high-level policies based on environmental context. Concurrently, RL-empowered near-RT RICs perform low-latency tasks based on strategic guidance and local near-RT observation. We evaluate the LLM-hRIC framework in an integrated access and backhaul (IAB) network setting. Simulation results demonstrate that the proposed framework achieves superior performance. Finally, we discuss the key future challenges in applying LLMs to O-RAN.

--------------------------------------------------------------------------------------------------------

Conformal Segmentation in Industrial Surface Defect Detection with Statistical Guarantees

This research addresses critical reliability challenges in automated steel surface defect detection using a novel statistical approach. By implementing conformal segmentation, the authors establish rigorous statistical guarantees for detection accuracy, overcoming limitations of traditional CNN-based methods that suffer from annotation uncertainties and overfitting. The approach evaluates model performance through calibration data, defines quantifiable error metrics, and derives statistically sound thresholds for identifying defective regions. This methodology could transform quality control in manufacturing industries by providing more reliable automated inspection systems with quantifiable confidence levels, reducing false positives and negatives while establishing clear metrics for model uncertainty assessment.

Authors: Cheng Shen, Yuewei Liu

Link: https://arxiv.org/abs/2504.17721v1

Date: 2025-04-24

Summary:

In industrial settings, surface defects on steel can significantly compromise its service life and elevate potential safety risks. Traditional defect detection methods predominantly rely on manual inspection, which suffers from low efficiency and high costs. Although automated defect detection approaches based on Convolutional Neural Networks(e.g., Mask R-CNN) have advanced rapidly, their reliability remains challenged due to data annotation uncertainties during deep model training and overfitting issues. These limitations may lead to detection deviations when processing the given new test samples, rendering automated detection processes unreliable. To address this challenge, we first evaluate the detection model's practical performance through calibration data that satisfies the independent and identically distributed (i.i.d) condition with test data. Specifically, we define a loss function for each calibration sample to quantify detection error rates, such as the complement of recall rate and false discovery rate. Subsequently, we derive a statistically rigorous threshold based on a user-defined risk level to identify high-probability defective pixels in test images, thereby constructing prediction sets (e.g., defect regions). This methodology ensures that the expected error rate (mean error rate) on the test set remains strictly bounced by the predefined risk level. Additionally, we observe a negative correlation between the average prediction set size and the risk level on the test set, establishing a statistically rigorous metric for assessing detection model uncertainty. Furthermore, our study demonstrates robust and efficient control over the expected test set error rate across varying calibration-to-test partitioning ratios, validating the method's adaptability and operational effectiveness.

--------------------------------------------------------------------------------------------------------

Fried Parameter Estimation from Single Wavefront Sensor Image with Artificial Neural Networks

This study presents a groundbreaking data-driven approach for estimating atmospheric turbulence (Fried parameter) from single wavefront sensor images using neural networks. By adapting computer vision techniques for both Shack-Hartmann and pyramid wavefront sensors, the researchers achieve millimeter-level accuracy across various conditions with remarkable processing speed (0.83ms). The innovation lies in developing a single network that functions accurately in both open and closed-loop adaptive optics configurations. The approach could significantly enhance astronomical observations, adaptive optics systems, and free space optical communications by enabling real-time atmospheric turbulence compensation with affordable hardware, improving image clarity and data transmission reliability.

Authors: Jeffrey Smith, Taisei Fujii, Jesse Cranney, Charles Gretton

Link: https://arxiv.org/abs/2504.17029v1

Date: 2025-04-23

Summary:

Atmospheric turbulence degrades the quality of astronomical observations in ground-based telescopes, leading to distorted and blurry images. Adaptive Optics (AO) systems are designed to counteract these effects, using atmospheric measurements captured by a wavefront sensor to make real-time corrections to the incoming wavefront. The Fried parameter, r0, characterises the strength of atmospheric turbulence and is an essential control parameter for optimising the performance of AO systems and more recently sky profiling for Free Space Optical (FSO) communication channels. In this paper, we develop a novel data-driven approach, adapting machine learning methods from computer vision for Fried parameter estimation from a single Shack-Hartmann or pyramid wavefront sensor image. Using these data-driven methods, we present a detailed simulation-based evaluation of our approach using the open-source COMPASS AO simulation tool to evaluate both the Shack-Hartmann and pyramid wavefront sensors. Our evaluation is over a range of guide star magnitudes, and realistic noise, atmospheric and instrument conditions. Remarkably, we are able to develop a single network-based estimator that is accurate in both open and closed-loop AO configurations. Our method accurately estimates the Fried parameter from a single WFS image directly from AO telemetry to a few millimetres. Our approach is suitable for real time control, exhibiting 0.83ms r0 inference times on retail NVIDIA RTX 3090 GPU hardware, and thereby demonstrating a compelling economic solution for use in real-time instrument control.

--------------------------------------------------------------------------------------------------------

MMHCL: Multi-Modal Hypergraph Contrastive Learning for Recommendation

This paper presents a novel recommendation framework that addresses data sparsity and cold-start problems in multimodal content platforms. By constructing user-to-user and item-to-item hypergraphs, MMHCL mines shared preferences and multimodal semantic resemblances to generate denser second-order connections that complement first-order user-item interactions. The contrastive learning approach enhances feature distinguishability by optimizing mutual information between different embedding levels. This framework could revolutionize recommendation systems for platforms like social media, e-commerce, and content streaming services by providing more accurate suggestions even with limited interaction data, improving user experience and engagement while helping platforms better understand customer preferences through multimodal data.

Authors: Xu Guo, Tong Zhang, Fuyun Wang, Xudong Wang, Xiaoya Zhang, Xin Liu, Zhen Cui

Link: https://arxiv.org/abs/2504.16576v1

Date: 2025-04-23

Summary:

The burgeoning presence of multimodal content-sharing platforms propels the development of personalized recommender systems. Previous works usually suffer from data sparsity and cold-start problems, and may fail to adequately explore semantic user-product associations from multimodal data. To address these issues, we propose a novel Multi-Modal Hypergraph Contrastive Learning (MMHCL) framework for user recommendation. For a comprehensive information exploration from user-product relations, we construct two hypergraphs, i.e. a user-to-user (u2u) hypergraph and an item-to-item (i2i) hypergraph, to mine shared preferences among users and intricate multimodal semantic resemblance among items, respectively. This process yields denser second-order semantics that are fused with first-order user-item interaction as complementary to alleviate the data sparsity issue. Then, we design a contrastive feature enhancement paradigm by applying synergistic contrastive learning. By maximizing/minimizing the mutual information between second-order (e.g. shared preference pattern for users) and first-order (information of selected items for users) embeddings of the same/different users and items, the feature distinguishability can be effectively enhanced. Compared with using sparse primary user-item interaction only, our MMHCL obtains denser second-order hypergraphs and excavates more abundant shared attributes to explore the user-product associations, which to a certain extent alleviates the problems of data sparsity and cold-start. Extensive experiments have comprehensively demonstrated the effectiveness of our method. Our code is publicly available at: https://github.com/Xu107/MMHCL.

--------------------------------------------------------------------------------------------------------

Disentangling and Generating Modalities for Recommendation in Missing Modality Scenarios

This innovative research tackles the challenging problem of missing modalities in recommendation systems through the DGMRec framework. By disentangling modality features into general and specific components from an information-based perspective, then generating missing modality features by integrating aligned features and user preferences, the system maintains performance even when modalities are unavailable. The approach significantly outperforms existing methods in missing modality scenarios and enables cross-modal retrieval capabilities. This technology could transform recommendation systems for e-commerce, streaming services, and content platforms by maintaining accuracy when visual, textual, or audio data is missing, enhancing user experience across diverse device and bandwidth constraints.

Authors: Jiwan Kim, Hongseok Kang, Sein Kim, Kibum Kim, Chanyoung Park

Link: https://arxiv.org/abs/2504.16352v1

Date: 2025-04-23

Summary:

Multi-modal recommender systems (MRSs) have achieved notable success in improving personalization by leveraging diverse modalities such as images, text, and audio. However, two key challenges remain insufficiently addressed: (1) Insufficient consideration of missing modality scenarios and (2) the overlooking of unique characteristics of modality features. These challenges result in significant performance degradation in realistic situations where modalities are missing. To address these issues, we propose Disentangling and Generating Modality Recommender (DGMRec), a novel framework tailored for missing modality scenarios. DGMRec disentangles modality features into general and specific modality features from an information-based perspective, enabling richer representations for recommendation. Building on this, it generates missing modality features by integrating aligned features from other modalities and leveraging user modality preferences. Extensive experiments show that DGMRec consistently outperforms state-of-the-art MRSs in challenging scenarios, including missing modalities and new item settings as well as diverse missing ratios and varying levels of missing modalities. Moreover, DGMRec's generation-based approach enables cross-modal retrieval, a task inapplicable for existing MRSs, highlighting its adaptability and potential for real-world applications. Our code is available at https://github.com/ptkjw1997/DGMRec.

--------------------------------------------------------------------------------------------------------

Bidirectional Task-Motion Planning Based on Hierarchical Reinforcement Learning for Strategic Confrontation

This research introduces a revolutionary bidirectional approach to swarm robotics task-motion planning using hierarchical reinforcement learning. Unlike traditional unidirectional methods that separate decision-making into discrete layers, this framework enables dynamic interaction between task allocation and path planning layers, significantly improving adaptability in confrontational scenarios. By implementing cross-training techniques and a trajectory prediction model that bridges abstract tasks with actionable goals, the system achieves exceptional performance with over 80% confrontation win rate and sub-0.01 second decision times. This technology could transform applications in autonomous vehicle fleets, warehouse robotics, search and rescue operations, and competitive robotic sports by enabling more cohesive and responsive multi-robot coordination.

Authors: Qizhen Wu, Lei Chen, Kexin Liu, Jinhu Lü

Link: https://arxiv.org/abs/2504.15876v2

Date: 2025-04-23

Summary:

In swarm robotics, confrontation scenarios, including strategic confrontations, require efficient decision-making that integrates discrete commands and continuous actions. Traditional task and motion planning methods separate decision-making into two layers, but their unidirectional structure fails to capture the interdependence between these layers, limiting adaptability in dynamic environments. Here, we propose a novel bidirectional approach based on hierarchical reinforcement learning, enabling dynamic interaction between the layers. This method effectively maps commands to task allocation and actions to path planning, while leveraging cross-training techniques to enhance learning across the hierarchical framework. Furthermore, we introduce a trajectory prediction model that bridges abstract task representations with actionable planning goals. In our experiments, it achieves over 80% in confrontation win rate and under 0.01 seconds in decision time, outperforming existing approaches. Demonstrations through large-scale tests and real-world robot experiments further emphasize the generalization capabilities and practical applicability of our method.

--------------------------------------------------------------------------------------------------------

Bridge the Gap: From Weak to Full Supervision for Temporal Action Localization with PseudoFormer

This paper introduces PseudoFormer, an innovative two-branch framework that bridges weakly and fully-supervised temporal action localization in videos. The approach tackles three key challenges through novel components: RickerFusion for high-quality pseudo label generation, leveraging both snippet-level and proposal-level labels with different priors, and implementing uncertainty mask and iterative refinement mechanisms for training with noisy labels. Achieving state-of-the-art results on THUMOS14 and ActivityNet1.3 benchmarks, this technology could transform video understanding applications including surveillance, sports analysis, human-computer interaction, and content moderation by enabling more accurate action detection with reduced annotation requirements, making advanced video analysis more accessible and cost-effective.

Authors: Ziyi Liu, Yangcen Liu

Link: https://arxiv.org/abs/2504.14860v1

Date: 2025-04-21

Summary:

Weakly-supervised Temporal Action Localization (WTAL) has achieved notable success but still suffers from a lack of temporal annotations, leading to a performance and framework gap compared with fully-supervised methods. While recent approaches employ pseudo labels for training, three key challenges: generating high-quality pseudo labels, making full use of different priors, and optimizing training methods with noisy labels remain unresolved. Due to these perspectives, we propose PseudoFormer, a novel two-branch framework that bridges the gap between weakly and fully-supervised Temporal Action Localization (TAL). We first introduce RickerFusion, which maps all predicted action proposals to a global shared space to generate pseudo labels with better quality. Subsequently, we leverage both snippet-level and proposal-level labels with different priors from the weak branch to train the regression-based model in the full branch. Finally, the uncertainty mask and iterative refinement mechanism are applied for training with noisy pseudo labels. PseudoFormer achieves state-of-the-art WTAL results on the two commonly used benchmarks, THUMOS14 and ActivityNet1.3. Besides, extensive ablation studies demonstrate the contribution of each component of our method.

--------------------------------------------------------------------------------------------------------

Roll the dice & look before you leap: Going beyond the creative limits of next-token prediction

This paper examines the creative limitations of current language models through carefully designed algorithmic tasks that mirror real-world open-ended challenges. By analyzing tasks requiring stochastic planning to discover knowledge graph connections or construct new patterns, the researchers demonstrate that next-token learning approaches are inherently myopic and prone to excessive memorization. In contrast, multi-token approaches like teacherless training and diffusion models excel at producing diverse and original outputs. The study also finds that injecting noise at the input layer (hash-conditioning) outperforms temperature sampling for eliciting controlled randomness. These insights could revolutionize creative AI applications in content generation, problem-solving, and research by enabling more innovative, less predictable outputs.

Authors: Vaishnavh Nagarajan, Chen Henry Wu, Charles Ding, Aditi Raghunathan

Link: https://arxiv.org/abs/2504.15266v1

Date: 2025-04-21

Summary:

We design a suite of minimal algorithmic tasks that are a loose abstraction of open-ended real-world tasks. This allows us to cleanly and controllably quantify the creative limits of the present-day language model. Much like real-world tasks that require a creative, far-sighted leap of thought, our tasks require an implicit, open-ended stochastic planning step that either (a) discovers new connections in an abstract knowledge graph (like in wordplay, drawing analogies, or research) or (b) constructs new patterns (like in designing math problems or new proteins). In these tasks, we empirically and conceptually argue how next-token learning is myopic and memorizes excessively; comparatively, multi-token approaches, namely teacherless training and diffusion models, excel in producing diverse and original output. Secondly, in our tasks, we find that to elicit randomness from the Transformer without hurting coherence, it is better to inject noise right at the input layer (via a method we dub hash-conditioning) rather than defer to temperature sampling from the output layer. Thus, our work offers a principled, minimal test-bed for analyzing open-ended creative skills, and offers new arguments for going beyond next-token learning and softmax-based sampling. We make part of the code available under https://github.com/chenwu98/algorithmic-creativity

--------------------------------------------------------------------------------------------------------

Advancing Embodied Intelligence in Robotic-Assisted Endovascular Procedures: A Systematic Review of AI Solutions

This comprehensive review examines how embodied intelligence is transforming robotic-assisted endovascular procedures. By integrating data-driven approaches, computer vision, and machine learning with robotic systems, researchers are enhancing the ability of robots to navigate complex vascular networks and adapt to dynamic physiological conditions. These technologies enable real-time vessel segmentation, device tracking, and anatomical landmark detection while reinforcement and imitation learning refine navigation strategies. The review outlines future directions including federated learning for medical data sharing, explainable AI for clinical support, and advanced human-robot collaboration. These advancements could revolutionize minimally invasive vascular treatments by improving precision, reducing radiation exposure, and potentially enabling greater procedural autonomy.

Authors: Tianliang Yao, Bo Lu, Markus Kowarschik, Yixuan Yuan, Hubin Zhao, Sebastien Ourselin, Kaspar Althoefer, Junbo Ge, Peng Qi

Link: https://arxiv.org/abs/2504.15327v2

Date: 2025-04-23

Summary:

Endovascular procedures have revolutionized the treatment of vascular diseases thanks to minimally invasive solutions that significantly reduce patient recovery time and enhance clinical outcomes. However, the precision and dexterity required during these procedures poses considerable challenges for interventionists. Robotic systems have emerged offering transformative solutions, addressing issues such as operator fatigue, radiation exposure, and the inherent limitations of human precision. The integration of Embodied Intelligence (EI) into these systems signifies a paradigm shift, enabling robots to navigate complex vascular networks and adapt to dynamic physiological conditions. Data-driven approaches, advanced computer vision, medical image analysis, and machine learning techniques, are at the forefront of this evolution. These methods augment procedural intelligence by facilitating real-time vessel segmentation, device tracking, and anatomical landmark detection. Reinforcement learning and imitation learning further refine navigation strategies and replicate experts' techniques. This review systematically examines the integration of EI principles into robotic technologies, in relation to endovascular procedures. We discuss recent advancements in intelligent perception and data-driven control, and their practical applications in robot-assisted endovascular procedures. By critically evaluating current limitations and emerging opportunities, this review establishes a framework for future developments, emphasizing the potential for greater autonomy and improved clinical outcomes. Emerging trends and specific areas of research, such as federated learning for medical data sharing, explainable AI for clinical decision support, and advanced human-robot collaboration paradigms, are also explored, offering insights into the future direction of this rapidly evolving field.

--------------------------------------------------------------------------------------------------------

A comprehensive review of classifier probability calibration metrics

This paper provides an extensive review of probability calibration metrics for classifier and object detection models, organizing 82 major metrics into four classifier families and an object detection family. By examining how well a model's confidence levels align with its actual accuracy, these metrics offer crucial assessment tools that complement traditional accuracy measurements. Understanding calibration is particularly important when combining multiple systems, ensuring reliability in safety-critical applications, and building user trust. This comprehensive categorization and mathematical formalization of calibration metrics could significantly impact AI safety, multi-model integration, and decision support systems by providing standardized methods to assess and improve the reliability of probabilistic predictions across numerous domains.

Authors: Richard Oliver Lane

Link: https://arxiv.org/abs/2504.18278v1

Date: 2025-04-25

Summary:

Probabilities or confidence values produced by artificial intelligence (AI) and machine learning (ML) models often do not reflect their true accuracy, with some models being under or over confident in their predictions. For example, if a model is 80% sure of an outcome, is it correct 80% of the time? Probability calibration metrics measure the discrepancy between confidence and accuracy, providing an independent assessment of model calibration performance that complements traditional accuracy metrics. Understanding calibration is important when the outputs of multiple systems are combined, for assurance in safety or business-critical contexts, and for building user trust in models. This paper provides a comprehensive review of probability calibration metrics for classifier and object detection models, organising them according to a number of different categorisations to highlight their relationships. We identify 82 major metrics, which can be grouped into four classifier families (point-based, bin-based, kernel or curve-based, and cumulative) and an object detection family. For each metric, we provide equations where available, facilitating implementation and comparison by future researchers.

--------------------------------------------------------------------------------------------------------

Towards responsible AI for education: Hybrid human-AI to confront the Elephant in the room

This critical analysis identifies nine persistent challenges undermining fairness, transparency, and effectiveness in AI educational applications. Issues range from conflating domain-agnostic language models with true educational AI to ignoring fundamental learning processes and contextual factors. The authors advocate for hybrid neural-symbolic AI methods to address these problems, emphasizing the need to incorporate learning sciences, domain knowledge, and stakeholder involvement in AI design. This research could transform educational technology by promoting systems that better understand individual learning processes, integrate domain expertise, use appropriate evaluation metrics, and provide ethical, explainable, and contextually relevant support—ultimately creating more effective, responsible AI tools that enhance rather than undermine educational practices.

Authors: Danial Hooshyar, Gustav Šír, Yeongwook Yang, Eve Kikas, Raija Hämäläinen, Tommi Kärkkäinen, Dragan Gašević, Roger Azevedo

Link: https://arxiv.org/abs/2504.16148v1

Date: 2025-04-22

Summary:

Despite significant advancements in AI-driven educational systems and ongoing calls for responsible AI for education, several critical issues remain unresolved -- acting as the elephant in the room within AI in education, learning analytics, educational data mining, learning sciences, and educational psychology communities. This critical analysis identifies and examines nine persistent challenges that continue to undermine the fairness, transparency, and effectiveness of current AI methods and applications in education. These include: (1) the lack of clarity around what AI for education truly means -- often ignoring the distinct purposes, strengths, and limitations of different AI families -- and the trend of equating it with domain-agnostic, company-driven large language models; (2) the widespread neglect of essential learning processes such as motivation, emotion, and (meta)cognition in AI-driven learner modelling and their contextual nature; (3) limited integration of domain knowledge and lack of stakeholder involvement in AI design and development; (4) continued use of non-sequential machine learning models on temporal educational data; (5) misuse of non-sequential metrics to evaluate sequential models; (6) use of unreliable explainable AI methods to provide explanations for black-box models; (7) ignoring ethical guidelines in addressing data inconsistencies during model training; (8) use of mainstream AI methods for pattern discovery and learning analytics without systematic benchmarking; and (9) overemphasis on global prescriptions while overlooking localised, student-specific recommendations. Supported by theoretical and empirical research, we demonstrate how hybrid AI methods -- specifically neural-symbolic AI -- can address the elephant in the room and serve as the foundation for responsible, trustworthy AI systems in education.

--------------------------------------------------------------------------------------------------------

Vidi: Large Multimodal Models for Video Understanding and Editing

This paper introduces Vidi, a groundbreaking family of Large Multimodal Models specifically designed for video understanding and editing. Focusing initially on temporal retrieval—identifying specific time ranges within videos corresponding to text queries—the system processes hour-long videos with audio support and diverse query formats. The researchers also present the VUE-TR benchmark with five key advancements including longer video durations, audio support, and refined evaluation metrics. Outperforming leading proprietary models like GPT-4o and Gemini, Vidi could revolutionize video content creation, editing workflows, and search capabilities by enabling precise identification of relevant segments in long-form content, dramatically improving efficiency for creators and accessibility for viewers.

Authors: Vidi Team, Celong Liu, Chia-Wen Kuo, Dawei Du, Fan Chen, Guang Chen, Jiamin Yuan, Lingxi Zhang, Lu Guo, Lusha Li, Longyin Wen, Qingyu Chen, Rachel Deng, Sijie Zhu, Stuart Siew, Tong Jin, Wei Lu, Wen Zhong, Xiaohui Shen, Xin Gu, Xing Mei, Xueqiong Qu

Link: https://arxiv.org/abs/2504.15681v2

Date: 2025-04-24

Summary:

Humans naturally share information with those they are connected to, and video has become one of the dominant mediums for communication and expression on the Internet. To support the creation of high-quality large-scale video content, a modern pipeline requires a comprehensive understanding of both the raw input materials (e.g., the unedited footage captured by cameras) and the editing components (e.g., visual effects). In video editing scenarios, models must process multiple modalities (e.g., vision, audio, text) with strong background knowledge and handle flexible input lengths (e.g., hour-long raw videos), which poses significant challenges for traditional models. In this report, we introduce Vidi, a family of Large Multimodal Models (LMMs) for a wide range of video understand editing scenarios. The first release focuses on temporal retrieval, i.e., identifying the time ranges within the input videos corresponding to a given text query, which plays a critical role in intelligent editing. The model is capable of processing hour-long videos with strong temporal understanding capability, e.g., retrieve time ranges for certain queries. To support a comprehensive evaluation in real-world scenarios, we also present the VUE-TR benchmark, which introduces five key advancements. 1) Video duration: significantly longer than videos of existing temporal retrival datasets, 2) Audio support: includes audio-based queries, 3) Query format: diverse query lengths/formats, 4) Annotation quality: ground-truth time ranges are manually annotated. 5) Evaluation metric: a refined IoU metric to support evaluation over multiple time ranges. Remarkably, Vidi significantly outperforms leading proprietary models, e.g., GPT-4o and Gemini, on the temporal retrieval task, indicating its superiority in video editing scenarios.

--------------------------------------------------------------------------------------------------------

A General Infrastructure and Workflow for Quadrotor Deep Reinforcement Learning and Reality Deployment

This research addresses the significant challenge of deploying learning-based methods on quadrotors in real-world environments by introducing a comprehensive platform that enables seamless transfer of deep reinforcement learning policies. The framework integrates training environments, flight dynamics, DRL algorithms, middleware, and hardware into a workflow that allows policies to be developed from simulation to real-world deployment within minutes. By providing diverse environment types including hovering, obstacle avoidance, trajectory tracking, and planning in unknown spaces, the platform establishes a practical benchmark for physical experiments. This infrastructure could accelerate drone application development for package delivery, surveillance, search and rescue, and agricultural monitoring by bridging the simulation-to-reality gap efficiently.

Authors: Kangyao Huang, Hao Wang, Yu Luo, Jingyu Chen, Jintao Chen, Xiangkui Zhang, Xiangyang Ji, Huaping Liu

Link: https://arxiv.org/abs/2504.15129v1

Date: 2025-04-21

Summary:

Deploying robot learning methods to a quadrotor in unstructured outdoor environments is an exciting task. Quadrotors operating in real-world environments by learning-based methods encounter several challenges: a large amount of simulator generated data required for training, strict demands for real-time processing onboard, and the sim-to-real gap caused by dynamic and noisy conditions. Current works have made a great breakthrough in applying learning-based methods to end-to-end control of quadrotors, but rarely mention the infrastructure system training from scratch and deploying to reality, which makes it difficult to reproduce methods and applications. To bridge this gap, we propose a platform that enables the seamless transfer of end-to-end deep reinforcement learning (DRL) policies. We integrate the training environment, flight dynamics control, DRL algorithms, the MAVROS middleware stack, and hardware into a comprehensive workflow and architecture that enables quadrotors' policies to be trained from scratch to real-world deployment in several minutes. Our platform provides rich types of environments including hovering, dynamic obstacle avoidance, trajectory tracking, balloon hitting, and planning in unknown environments, as a physical experiment benchmark. Through extensive empirical validation, we demonstrate the efficiency of proposed sim-to-real platform, and robust outdoor flight performance under real-world perturbations. Details can be found from our website https://emnavi.tech/AirGym/.

--------------------------------------------------------------------------------------------------------

Learning from Less: SINDy Surrogates in RL

This paper introduces an innovative approach using Sparse Identification of Nonlinear Dynamics (SINDy) to develop computationally efficient surrogate environments for reinforcement learning. Through experiments in OpenAI Gym environments, the researchers demonstrate that SINDy-based surrogates accurately capture underlying dynamics while reducing computational costs by 20-35%. Using minimal interactions (75 for Mountain Car, 1000 for Lunar Lander), the approach achieves remarkably accurate state modeling with high correlations and low error rates. This technology could transform reinforcement learning applications in robotics, autonomous vehicles, and complex systems control by enabling faster training with fewer resources while maintaining performance, making advanced AI control systems more accessible and energy-efficient.

Authors: Aniket Dixit, Muhammad Ibrahim Khan, Faizan Ahmed, James Brusey

Link: https://arxiv.org/abs/2504.18113v1

Date: 2025-04-25

Summary:

This paper introduces an approach for developing surrogate environments in reinforcement learning (RL) using the Sparse Identification of Nonlinear Dynamics (SINDy) algorithm. We demonstrate the effectiveness of our approach through extensive experiments in OpenAI Gym environments, particularly Mountain Car and Lunar Lander. Our results show that SINDy-based surrogate models can accurately capture the underlying dynamics of these environments while reducing computational costs by 20-35%. With only 75 interactions for Mountain Car and 1000 for Lunar Lander, we achieve state-wise correlations exceeding 0.997, with mean squared errors as low as 3.11e-06 for Mountain Car velocity and 1.42e-06 for LunarLander position. RL agents trained in these surrogate environments require fewer total steps (65,075 vs. 100,000 for Mountain Car and 801,000 vs. 1,000,000 for Lunar Lander) while achieving comparable performance to those trained in the original environments, exhibiting similar convergence patterns and final performance metrics. This work contributes to the field of model-based RL by providing an efficient method for generating accurate, interpretable surrogate environments.

--------------------------------------------------------------------------------------------------------

How Private is Your Attention? Bridging Privacy with In-Context Learning

This groundbreaking research explores the intersection of differential privacy and in-context learning in transformer models. The authors propose a differentially private pretraining algorithm for linear attention heads and provide the first theoretical analysis of privacy-accuracy trade-offs for in-context learning in linear regression. Their findings characterize the fundamental tension between optimization and privacy-induced noise, while demonstrating robustness to adversarial perturbations of training prompts. Supported by extensive simulations, this work could significantly impact privacy-preserving AI applications in healthcare, finance, and personal assistants by enabling models to learn from examples at inference time while maintaining formal privacy guarantees, addressing growing concerns about data protection in machine learning.

Authors: Soham Bonnerjee, Zhen Wei, Yeon, Anna Asch, Sagnik Nandy, Promit Ghosal

Link: https://arxiv.org/abs/2504.16000v1

Date: 2025-04-22

Summary:

In-context learning (ICL)-the ability of transformer-based models to perform new tasks from examples provided at inference time-has emerged as a hallmark of modern language models. While recent works have investigated the mechanisms underlying ICL, its feasibility under formal privacy constraints remains largely unexplored. In this paper, we propose a differentially private pretraining algorithm for linear attention heads and present the first theoretical analysis of the privacy-accuracy trade-off for ICL in linear regression. Our results characterize the fundamental tension between optimization and privacy-induced noise, formally capturing behaviors observed in private training via iterative methods. Additionally, we show that our method is robust to adversarial perturbations of training prompts, unlike standard ridge regression. All theoretical findings are supported by extensive simulations across diverse settings.

--------------------------------------------------------------------------------------------------------

Aerial Active STAR-RIS-assisted Satellite-Terrestrial Covert Communications

This research introduces an innovative satellite-terrestrial covert communication system enhanced by aerial active simultaneous transmitting and reflecting reconfigurable intelligent surfaces (AASTAR-RIS). Addressing challenges of long-distance path loss and security risks in urban environments, the authors derive minimal detection error probability under worst-case conditions and formulate an optimization problem to maximize channel capacity while ensuring transmission covertness. Their generative deterministic policy gradient algorithm uses diffusion models and action gradient mechanisms to efficiently solve the high-dimensional state-action space problem. This technology could revolutionize secure communications for military operations, critical infrastructure, governmental communications, and remote sensing by enabling reliable covert data transmission across integrated space-air-ground networks.

Authors: Chuang Zhang, Geng Sun, Jiahui Li, Jiacheng Wang, Ruichen Zhang, Dusit Niyato, Shiwen Mao, Tony Q. S. Quek

Link: https://arxiv.org/abs/2504.16146v1

Date: 2025-04-22

Summary:

An integration of satellites and terrestrial networks is crucial for enhancing performance of next generation communication systems. However, the networks are hindered by the long-distance path loss and security risks in dense urban environments. In this work, we propose a satellite-terrestrial covert communication system assisted by the aerial active simultaneous transmitting and reflecting reconfigurable intelligent surface (AASTAR-RIS) to improve the channel capacity while ensuring the transmission covertness. Specifically, we first derive the minimal detection error probability (DEP) under the worst condition that the Warden has perfect channel state information (CSI). Then, we formulate an AASTAR-RIS-assisted satellite-terrestrial covert communication optimization problem (ASCCOP) to maximize the sum of the fair channel capacity for all ground users while meeting the strict covert constraint, by jointly optimizing the trajectory and active beamforming of the AASTAR-RIS. Due to the challenges posed by the complex and high-dimensional state-action spaces as well as the need for efficient exploration in dynamic environments, we propose a generative deterministic policy gradient (GDPG) algorithm, which is a generative deep reinforcement learning (DRL) method to solve the ASCCOP. Concretely, the generative diffusion model (GDM) is utilized as the policy representation of the algorithm to enhance the exploration process by generating diverse and high-quality samples through a series of denoising steps. Moreover, we incorporate an action gradient mechanism to accomplish the policy improvement of the algorithm, which refines the better state-action pairs through the gradient ascent. Simulation results demonstrate that the proposed approach significantly outperforms important benchmarks.

--------------------------------------------------------------------------------------------------------

Guidelines for External Disturbance Factors in the Use of OCR in Real-World Environments

This paper addresses a critical gap in optical character recognition (OCR) implementation by compiling comprehensive guidelines for real-world external disturbance factors that degrade performance. As OCR applications expand, interference from various usage environments prevents systems from achieving their inherent accuracy, complicating quality control of recognition devices. The authors systematically catalog these factors along with resulting image degradation phenomena, creating a practical reference table with usage instructions. These guidelines could significantly improve OCR deployment in diverse settings including document processing, manufacturing quality control, license plate recognition, and mobile applications by enabling better environmental adaptation and mitigation strategies for adverse conditions.

Authors: Kenji Iwata, Eiki Ishidera, Toshifumi Yamaai, Yutaka Satoh, Hiroshi Tanaka, Katsuhiko Takahashi, Akio Furuhata, Yoshihisa Tanabe, Hiroshi Matsumura

Link: https://arxiv.org/abs/2504.14913v1

Date: 2025-04-21

Summary:

The performance of OCR has improved with the evolution of AI technology. As OCR continues to broaden its range of applications, the increased likelihood of interference introduced by various usage environments can prevent it from achieving its inherent performance. This results in reduced recognition accuracy under certain conditions, and makes the quality control of recognition devices more challenging. Therefore, to ensure that users can properly utilize OCR, we compiled the real-world external disturbance factors that cause performance degradation, along with the resulting image degradation phenomena, into an external disturbance factor table and, by also indicating how to make use of it, organized them into guidelines.

--------------------------------------------------------------------------------------------------------

BadVideo: Stealthy Backdoor Attack against Text-to-Video Generation

This research reveals a concerning vulnerability in text-to-video generation models by introducing BadVideo, the first backdoor attack framework specifically targeting these systems. Exploiting the inherent redundancy in generated videos—elements not explicitly specified in text prompts—the attack embeds harmful content through two sophisticated strategies: Spatio-Temporal Composition and Dynamic Element Transformation. These approaches seamlessly integrate malicious elements with user instructions while evading traditional content moderation systems that analyze individual frames. The study demonstrates high attack success rates while maintaining original semantics and performance on clean inputs. This work highlights significant security implications for entertainment, education, and media production industries using T2V technologies, calling attention to potential risks of manipulation and content integrity breaches.

Authors: Ruotong Wang, Mingli Zhu, Jiarong Ou, Rui Chen, Xin Tao, Pengfei Wan, Baoyuan Wu

Link: https://arxiv.org/abs/2504.16907v1

Date: 2025-04-23

Summary:

Text-to-video (T2V) generative models have rapidly advanced and found widespread applications across fields like entertainment, education, and marketing. However, the adversarial vulnerabilities of these models remain rarely explored. We observe that in T2V generation tasks, the generated videos often contain substantial redundant information not explicitly specified in the text prompts, such as environmental elements, secondary objects, and additional details, providing opportunities for malicious attackers to embed hidden harmful content. Exploiting this inherent redundancy, we introduce BadVideo, the first backdoor attack framework tailored for T2V generation. Our attack focuses on designing target adversarial outputs through two key strategies: (1) Spatio-Temporal Composition, which combines different spatiotemporal features to encode malicious information; (2) Dynamic Element Transformation, which introduces transformations in redundant elements over time to convey malicious information. Based on these strategies, the attacker's malicious target seamlessly integrates with the user's textual instructions, providing high stealthiness. Moreover, by exploiting the temporal dimension of videos, our attack successfully evades traditional content moderation systems that primarily analyze spatial information within individual frames. Extensive experiments demonstrate that BadVideo achieves high attack success rates while preserving original semantics and maintaining excellent performance on clean inputs. Overall, our work reveals the adversarial vulnerability of T2V models, calling attention to potential risks and misuse. Our project page is at https://wrt2000.github.io/BadVideo2025/.

--------------------------------------------------------------------------------------------------------

Describe Anything: Detailed Localized Image and Video Captioning

This paper introduces the Describe Anything Model (DAM), an innovative approach to detailed localized captioning of specific regions in images and videos. The model leverages two key innovations—focal prompts for high-resolution encoding of targeted regions and a localized vision backbone integrating precise localization with broader context. To overcome data scarcity, the researchers developed a semi-supervised learning-based data pipeline and a benchmark for evaluation without reference captions. Achieving state-of-the-art results across seven benchmarks, DAM could transform applications in accessibility for visually impaired users, content indexing, automated surveillance analysis, medical image interpretation, and educational tools by enabling precise, detailed descriptions of specific visual elements within complex scenes.

Authors: Long Lian, Yifan Ding, Yunhao Ge, Sifei Liu, Hanzi Mao, Boyi Li, Marco Pavone, Ming-Yu Liu, Trevor Darrell, Adam Yala, Yin Cui

Link: https://arxiv.org/abs/2504.16072v1

Date: 2025-04-22

Summary:

Generating detailed and accurate descriptions for specific regions in images and videos remains a fundamental challenge for vision-language models. We introduce the Describe Anything Model (DAM), a model designed for detailed localized captioning (DLC). DAM preserves both local details and global context through two key innovations: a focal prompt, which ensures high-resolution encoding of targeted regions, and a localized vision backbone, which integrates precise localization with its broader context. To tackle the scarcity of high-quality DLC data, we propose a Semi-supervised learning (SSL)-based Data Pipeline (DLC-SDP). DLC-SDP starts with existing segmentation datasets and expands to unlabeled web images using SSL. We introduce DLC-Bench, a benchmark designed to evaluate DLC without relying on reference captions. DAM sets new state-of-the-art on 7 benchmarks spanning keyword-level, phrase-level, and detailed multi-sentence localized image and video captioning.

--------------------------------------------------------------------------------------------------------

T2VShield: Model-Agnostic Jailbreak Defense for Text-to-Video Models

This research addresses critical security vulnerabilities in text-to-video generation models by introducing T2VShield, a comprehensive defense framework against jailbreak attacks. By systematically analyzing input, model, and output stages, the authors identify limitations in existing defenses and implement a multi-faceted approach: a reasoning and retrieval-based prompt rewriting mechanism to sanitize malicious inputs, and a multi-scope detection module capturing inconsistencies across time and modalities. Working with both open and closed-source systems without requiring internal model access, T2VShield reduces jailbreak success rates by up to 35 percent. This technology could significantly enhance safety in creative tools, social media platforms, and educational applications by preventing the generation of harmful content while maintaining intended functionality.

Authors: Siyuan Liang, Jiayang Liu, Jiecheng Zhai, Tianmeng Fang, Rongcheng Tu, Aishan Liu, Xiaochun Cao, Dacheng Tao

Link: https://arxiv.org/abs/2504.15512v1

Date: 2025-04-22

Summary:

The rapid development of generative artificial intelligence has made text to video models essential for building future multimodal world simulators. However, these models remain vulnerable to jailbreak attacks, where specially crafted prompts bypass safety mechanisms and lead to the generation of harmful or unsafe content. Such vulnerabilities undermine the reliability and security of simulation based applications. In this paper, we propose T2VShield, a comprehensive and model agnostic defense framework designed to protect text to video models from jailbreak threats. Our method systematically analyzes the input, model, and output stages to identify the limitations of existing defenses, including semantic ambiguities in prompts, difficulties in detecting malicious content in dynamic video outputs, and inflexible model centric mitigation strategies. T2VShield introduces a prompt rewriting mechanism based on reasoning and multimodal retrieval to sanitize malicious inputs, along with a multi scope detection module that captures local and global inconsistencies across time and modalities. The framework does not require access to internal model parameters and works with both open and closed source systems. Extensive experiments on five platforms show that T2VShield can reduce jailbreak success rates by up to 35 percent compared to strong baselines. We further develop a human centered audiovisual evaluation protocol to assess perceptual safety, emphasizing the importance of visual level defense in enhancing the trustworthiness of next generation multimodal simulators.

--------------------------------------------------------------------------------------------------------

EYE ON A.I. GETS READERS UP TO DATE ON THE LATEST FUNDING NEWS AND RELATED ISSUES. SUBSCRIBE FOR THE WEEKLY NEWSLETTER.

Artificial Intelligence, Research WatchCraig SmithApril 28, 2025Comment