Week Ending 6.29.2025
RESEARCH WATCH: 6.29.2025
Automatic Speech Recognition (ASR) technology has transformative potential for low-resource languages like Bangla, spoken by over 300 million people globally. This research addresses the critical challenge of developing effective ASR systems for languages with limited training data by comparing two state-of-the-art models: OpenAI's Whisper and Facebook's Wav2Vec-BERT. The study's systematic evaluation using Word Error Rate, Character Error Rate, and computational efficiency metrics provides valuable insights for multilingual ASR development. Applications include voice assistants for Bangla speakers, automated transcription services, accessibility tools for hearing-impaired individuals, and educational technology platforms, potentially bridging the digital divide for Bengali-speaking communities worldwide.
Authors: Md Sazzadul Islam Ridoy, Sumi Akter, Md. Aminur Rahman
Link: https://arxiv.org/abs/2507.01931v1
Date: 2025-07-02
Summary:
In recent years, neural models trained on large multilingual text and speech datasets have shown great potential for supporting low-resource languages. This study investigates the performances of two state-of-the-art Automatic Speech Recognition (ASR) models, OpenAI's Whisper (Small & Large-V2) and Facebook's Wav2Vec-BERT on Bangla, a low-resource language. We have conducted experiments using two publicly available datasets: Mozilla Common Voice-17 and OpenSLR to evaluate model performances. Through systematic fine-tuning and hyperparameter optimization, including learning rate, epochs, and model checkpoint selection, we have compared the models based on Word Error Rate (WER), Character Error Rate (CER), Training Time, and Computational Efficiency. The Wav2Vec-BERT model outperformed Whisper across all key evaluation metrics, demonstrated superior performance while requiring fewer computational resources, and offered valuable insights to develop robust speech recognition systems in low-resource linguistic settings.
--------------------------------------------------------------------------------------------------------
Are Vision Transformer Representations Semantically Meaningful? A Case Study in Medical Imaging
Vision Transformers (ViTs) have revolutionized medical imaging with superior accuracy in disease classification, segmentation, and detection tasks. However, their deployment in safety-critical healthcare environments raises concerns about model interpretability and reliability. This research exposes a fundamental vulnerability in ViT representations, demonstrating that seemingly robust models can produce inconsistent outputs for imperceptibly different inputs. The findings have profound implications for medical AI applications, including radiology diagnosis, pathology analysis, and surgical planning. Understanding these limitations is crucial for developing trustworthy medical AI systems. The research highlights the need for more robust evaluation frameworks and could influence regulatory approval processes for AI-powered medical devices.
Authors: Montasir Shams, Chashi Mahiul Islam, Shaeke Salman, Phat Tran, Xiuwen Liu
Link: https://arxiv.org/abs/2507.01788v1
Date: 2025-07-02
Summary:
Vision transformers (ViTs) have rapidly gained prominence in medical imaging tasks such as disease classification, segmentation, and detection due to their superior accuracy compared to conventional deep learning models. However, due to their size and complex interactions via the self-attention mechanism, they are not well understood. In particular, it is unclear whether the representations produced by such models are semantically meaningful. In this paper, using a projected gradient-based algorithm, we show that their representations are not semantically meaningful and they are inherently vulnerable to small changes. Images with imperceptible differences can have very different representations; on the other hand, images that should belong to different semantic classes can have nearly identical representations. Such vulnerability can lead to unreliable classification results; for example, unnoticeable changes cause the classification accuracy to be reduced by over 60\%. %. To the best of our knowledge, this is the first work to systematically demonstrate this fundamental lack of semantic meaningfulness in ViT representations for medical image classification, revealing a critical challenge for their deployment in safety-critical systems.
--------------------------------------------------------------------------------------------------------
AI4Research: A Survey of Artificial Intelligence for Scientific Research
The integration of artificial intelligence into scientific research represents a paradigm shift toward automated discovery and experimentation. This comprehensive survey addresses the growing intersection of AI technologies, particularly large language models like OpenAI-o1 and DeepSeek-R1, with scientific methodology across disciplines. The research categorizes five mainstream AI4Research tasks and identifies critical gaps in current approaches. Applications span drug discovery, materials science, climate modeling, and experimental design automation. The survey's systematic taxonomy and resource compilation could accelerate AI adoption in laboratories worldwide, potentially revolutionizing how scientific hypotheses are generated, tested, and validated while addressing concerns about experimental rigor and reproducibility.
Authors: Qiguang Chen, Mingda Yang, Libo Qin, Jinhao Liu, Zheng Yan, Jiannan Guan, Dengyun Peng, Yiyan Ji, Hanjing Li, Mengkang Hu, Yimeng Zhang, Yihao Liang, Yuhang Zhou, Jiaqi Wang, Zhi Chen, Wanxiang Che
Link: https://arxiv.org/abs/2507.01903v1
Date: 2025-07-02
Summary:
Recent advancements in artificial intelligence (AI), particularly in large language models (LLMs) such as OpenAI-o1 and DeepSeek-R1, have demonstrated remarkable capabilities in complex domains such as logical reasoning and experimental coding. Motivated by these advancements, numerous studies have explored the application of AI in the innovation process, particularly in the context of scientific research. These AI technologies primarily aim to develop systems that can autonomously conduct research processes across a wide range of scientific disciplines. Despite these significant strides, a comprehensive survey on AI for Research (AI4Research) remains absent, which hampers our understanding and impedes further development in this field. To address this gap, we present a comprehensive survey and offer a unified perspective on AI4Research. Specifically, the main contributions of our work are as follows: (1) Systematic taxonomy: We first introduce a systematic taxonomy to classify five mainstream tasks in AI4Research. (2) New frontiers: Then, we identify key research gaps and highlight promising future directions, focusing on the rigor and scalability of automated experiments, as well as the societal impact. (3) Abundant applications and resources: Finally, we compile a wealth of resources, including relevant multidisciplinary applications, data corpora, and tools. We hope our work will provide the research community with quick access to these resources and stimulate innovative breakthroughs in AI4Research.
--------------------------------------------------------------------------------------------------------
LoRA Fine-Tuning Without GPUs: A CPU-Efficient Meta-Generation Framework for LLMs
The democratization of large language model customization faces significant barriers due to GPU requirements and computational costs. This research introduces a groundbreaking approach enabling LoRA fine-tuning on standard laptop CPUs, making advanced AI accessible to researchers, educators, and developers with limited resources. The meta-operator framework learns to combine pre-trained adapters rather than performing gradient updates, offering a practical alternative to traditional fine-tuning methods. Applications include personalized chatbots, domain-specific language models for small businesses, educational AI tools, and rapid prototyping for AI startups. This work could significantly lower barriers to AI innovation, enabling broader participation in the development of specialized language models.
Authors: Reza Arabpour, Haitz Sáez de Ocáriz Borde, Anastasis Kratsios
Link: https://arxiv.org/abs/2507.01806v1
Date: 2025-07-02
Summary:
Low-Rank Adapters (LoRAs) have transformed the fine-tuning of Large Language Models (LLMs) by enabling parameter-efficient updates. However, their widespread adoption remains limited by the reliance on GPU-based training. In this work, we propose a theoretically grounded approach to LoRA fine-tuning designed specifically for users with limited computational resources, particularly those restricted to standard laptop CPUs. Our method learns a meta-operator that maps any input dataset, represented as a probability distribution, to a set of LoRA weights by leveraging a large bank of pre-trained adapters for the Mistral-7B-Instruct-v0.2 model. Instead of performing new gradient-based updates, our pipeline constructs adapters via lightweight combinations of existing LoRAs directly on CPU. While the resulting adapters do not match the performance of GPU-trained counterparts, they consistently outperform the base Mistral model on downstream tasks, offering a practical and accessible alternative to traditional GPU-based fine-tuning.
--------------------------------------------------------------------------------------------------------
Towards Foundation Auto-Encoders for Time-Series Anomaly Detection
Time-series anomaly detection is critical across numerous industries, from cybersecurity to manufacturing and healthcare monitoring. This research introduces Foundation Auto-Encoders (FAE), inspired by the success of large pretrained models in natural language processing. By pretraining on massive time-series datasets, FAE learns complex temporal patterns that generalize across diverse domains and applications. The model's potential for zero-shot anomaly detection could revolutionize monitoring systems in telecommunications, financial markets, IoT sensor networks, and industrial equipment. Applications include fraud detection, network intrusion identification, predictive maintenance, and health monitoring systems. The foundation model approach promises to reduce the need for domain-specific model development while improving detection accuracy.
Authors: Gastón García González, Pedro Casas, Emilio Martínez, Alicia Fernández
Link: https://arxiv.org/abs/2507.01875v1
Date: 2025-07-02
Summary:
We investigate a novel approach to time-series modeling, inspired by the successes of large pretrained foundation models. We introduce FAE (Foundation Auto-Encoders), a foundation generative-AI model for anomaly detection in time-series data, based on Variational Auto-Encoders (VAEs). By foundation, we mean a model pretrained on massive amounts of time-series data which can learn complex temporal patterns useful for accurate modeling, forecasting, and detection of anomalies on previously unseen datasets. FAE leverages VAEs and Dilated Convolutional Neural Networks (DCNNs) to build a generic model for univariate time-series modeling, which could eventually perform properly in out-of-the-box, zero-shot anomaly detection applications. We introduce the main concepts of FAE, and present preliminary results in different multi-dimensional time-series datasets from various domains, including a real dataset from an operational mobile ISP, and the well known KDD 2021 Anomaly Detection dataset.
--------------------------------------------------------------------------------------------------------
SpecCLIP: Aligning and Translating Spectroscopic Measurements for Stars
Stellar spectroscopy generates vast amounts of data containing rich physical and chemical information about stars, but analyzing this data across different instruments and surveys remains challenging. SpecCLIP applies foundation model principles to stellar spectral analysis, creating robust embeddings that support diverse astronomical applications. The model's ability to align and translate between different spectral types enables cross-survey calibration and analysis. Applications include stellar parameter estimation, chemical abundance determination, anomaly detection for discovering unusual stellar objects, and enabling large-scale astronomical surveys. This work could accelerate discoveries in galactic archaeology, exoplanet host star characterization, and our understanding of stellar evolution, while providing tools for next-generation astronomical surveys.
Authors: Xiaosheng Zhao, Yang Huang, Guirong Xue, Xiao Kong, Jifeng Liu, Xiaoyu Tang, Timothy C. Beers, Yuan-Sen Ting, A-Li Luo
Link: https://arxiv.org/abs/2507.01939v1
Date: 2025-07-02
Summary:
In recent years, large language models (LLMs) have transformed natural language understanding through vast datasets and large-scale parameterization. Inspired by this success, we present SpecCLIP, a foundation model framework that extends LLM-inspired methodologies to stellar spectral analysis. Stellar spectra, akin to structured language, encode rich physical and chemical information about stars. By training foundation models on large-scale spectral datasets, our goal is to learn robust and informative embeddings that support diverse downstream applications. As a proof of concept, SpecCLIP involves pre-training on two spectral types--LAMOST low-resolution and Gaia XP--followed by contrastive alignment using the CLIP (Contrastive Language-Image Pre-training) framework, adapted to associate spectra from different instruments. This alignment is complemented by auxiliary decoders that preserve spectrum-specific information and enable translation (prediction) between spectral types, with the former achieved by maximizing mutual information between embeddings and input spectra. The result is a cross-spectrum framework enabling intrinsic calibration and flexible applications across instruments. We demonstrate that fine-tuning these models on moderate-sized labeled datasets improves adaptability to tasks such as stellar-parameter estimation and chemical-abundance determination. SpecCLIP also enhances the accuracy and precision of parameter estimates benchmarked against external survey data. Additionally, its similarity search and cross-spectrum prediction capabilities offer potential for anomaly detection. Our results suggest that contrastively trained foundation models enriched with spectrum-aware decoders can advance precision stellar spectroscopy.
--------------------------------------------------------------------------------------------------------
Small and medium-sized manufacturers face a critical dilemma: they need innovative data-driven solutions but cannot share proprietary data due to competitive concerns. This research demonstrates a privacy-preserving platform that enables secure collaboration between manufacturers and researchers while maintaining data confidentiality. The case study focuses on crystal analysis for food manufacturing, replacing manual quality control processes with automated image analysis tools. Applications extend to pharmaceutical manufacturing, materials science, semiconductor production, and chemical processing industries. The platform's approach could transform how sensitive industrial data is leveraged for AI development, enabling innovation while protecting intellectual property and maintaining competitive advantages.
Authors: Xiaoyu Ji, Jessica Shorland, Joshua Shank, Pascal Delpe-Brice, Latanya Sweeney, Jan Allebach, Ali Shakouri
Link: https://arxiv.org/abs/2507.01808v1
Date: 2025-07-02
Summary:
Small- and medium-sized manufacturers need innovative data tools but, because of competition and privacy concerns, often do not want to share their proprietary data with researchers who might be interested in helping. This paper introduces a privacy-preserving platform by which manufacturers may safely share their data with researchers through secure methods, so that those researchers then create innovative tools to solve the manufacturers' real-world problems, and then provide tools that execute solutions back onto the platform for others to use with privacy and confidentiality guarantees. We illustrate this problem through a particular use case which addresses an important problem in the large-scale manufacturing of food crystals, which is that quality control relies on image analysis tools. Previous to our research, food crystals in the images were manually counted, which required substantial and time-consuming human efforts, but we have developed and deployed a crystal analysis tool which makes this process both more rapid and accurate. The tool enables automatic characterization of the crystal size distribution and numbers from microscope images while the natural imperfections from the sample preparation are automatically removed; a machine learning model to count high resolution translucent crystals and agglomeration of crystals was also developed to aid in these efforts. The resulting algorithm was then packaged for real-world use on the factory floor via a web-based app secured through the originating privacy-preserving platform, allowing manufacturers to use it while keeping their proprietary data secure. After demonstrating this full process, future directions are also explored.
--------------------------------------------------------------------------------------------------------
AC-DiT: Adaptive Coordination Diffusion Transformer for Mobile Manipulation
Mobile manipulation represents a frontier in robotics, combining navigation and manipulation capabilities for complex household and industrial tasks. This research addresses the fundamental challenge of coordinating mobile base movement with manipulator control, introducing a novel diffusion transformer architecture. The model's adaptive multimodal conditioning enables dynamic adjustment between 2D visual and 3D geometric information based on task requirements. Applications include household service robots, warehouse automation, elder care assistance, and industrial mobile manipulators. The technology could enable robots to perform complex tasks like cooking, cleaning, and object manipulation in unstructured environments, advancing the deployment of general-purpose robotic assistants in homes and workplaces.
Authors: Sixiang Chen, Jiaming Liu, Siyuan Qian, Han Jiang, Lily Li, Renrui Zhang, Zhuoyang Liu, Chenyang Gu, Chengkai Hou, Pengwei Wang, Zhongyuan Wang, Shanghang Zhang
Link: https://arxiv.org/abs/2507.01961v1
Date: 2025-07-02
Summary:
Recently, mobile manipulation has attracted increasing attention for enabling language-conditioned robotic control in household tasks. However, existing methods still face challenges in coordinating mobile base and manipulator, primarily due to two limitations. On the one hand, they fail to explicitly model the influence of the mobile base on manipulator control, which easily leads to error accumulation under high degrees of freedom. On the other hand, they treat the entire mobile manipulation process with the same visual observation modality (e.g., either all 2D or all 3D), overlooking the distinct multimodal perception requirements at different stages during mobile manipulation. To address this, we propose the Adaptive Coordination Diffusion Transformer (AC-DiT), which enhances mobile base and manipulator coordination for end-to-end mobile manipulation. First, since the motion of the mobile base directly influences the manipulator's actions, we introduce a mobility-to-body conditioning mechanism that guides the model to first extract base motion representations, which are then used as context prior for predicting whole-body actions. This enables whole-body control that accounts for the potential impact of the mobile base's motion. Second, to meet the perception requirements at different stages of mobile manipulation, we design a perception-aware multimodal conditioning strategy that dynamically adjusts the fusion weights between various 2D visual images and 3D point clouds, yielding visual features tailored to the current perceptual needs. This allows the model to, for example, adaptively rely more on 2D inputs when semantic information is crucial for action prediction, while placing greater emphasis on 3D geometric information when precise spatial understanding is required. We validate AC-DiT through extensive experiments on both simulated and real-world mobile manipulation tasks.
--------------------------------------------------------------------------------------------------------
A Survey on Vision-Language-Action Models: An Action Tokenization Perspective
Vision-Language-Action (VLA) models represent the convergence of computer vision, natural language processing, and robotics, aiming to create intelligent agents that can understand and interact with the physical world. This comprehensive survey unifies diverse VLA approaches through the lens of action tokenization, categorizing how different models encode actionable information. The framework identifies eight distinct action token types, from language descriptions to raw actions. Applications span autonomous vehicles, robotic manipulation, drone control, and embodied AI agents. Understanding action tokenization is crucial for developing general-purpose robots capable of following natural language instructions, potentially revolutionizing human-robot interaction and enabling more intuitive robotic systems.
Authors: Yifan Zhong, Fengshuo Bai, Shaofei Cai, Xuchuan Huang, Zhang Chen, Xiaowei Zhang, Yuanfei Wang, Shaoyang Guo, Tianrui Guan, Ka Nam Lui, Zhiquan Qi, Yitao Liang, Yuanpei Chen, Yaodong Yang
Link: https://arxiv.org/abs/2507.01925v1
Date: 2025-07-02
Summary:
The remarkable advancements of vision and language foundation models in multimodal understanding, reasoning, and generation has sparked growing efforts to extend such intelligence to the physical world, fueling the flourishing of vision-language-action (VLA) models. Despite seemingly diverse approaches, we observe that current VLA models can be unified under a single framework: vision and language inputs are processed by a series of VLA modules, producing a chain of \textit{action tokens} that progressively encode more grounded and actionable information, ultimately generating executable actions. We further determine that the primary design choice distinguishing VLA models lies in how action tokens are formulated, which can be categorized into language description, code, affordance, trajectory, goal state, latent representation, raw action, and reasoning. However, there remains a lack of comprehensive understanding regarding action tokens, significantly impeding effective VLA development and obscuring future directions. Therefore, this survey aims to categorize and interpret existing VLA research through the lens of action tokenization, distill the strengths and limitations of each token type, and identify areas for improvement. Through this systematic review and analysis, we offer a synthesized outlook on the broader evolution of VLA models, highlight underexplored yet promising directions, and contribute guidance for future research, hoping to bring the field closer to general-purpose intelligence.
--------------------------------------------------------------------------------------------------------
Evolving HPC services to enable ML workloads on HPE Cray EX
The Alps Research Infrastructure, featuring 10,752 GPUs, represents a significant computational resource for AI and machine learning research. However, traditional HPC services don't adequately support the dynamic needs of ML workloads, creating barriers for researchers transitioning from conventional computing environments. This research identifies key challenges and proposes technological enhancements to bridge the gap between HPC and ML communities. Applications include large-scale neural network training, distributed machine learning experiments, and AI model inference services. The proposed solutions could accelerate AI research by making supercomputing resources more accessible to ML practitioners, potentially enabling breakthroughs in areas requiring massive computational power like climate modeling and drug discovery.
Authors: Stefano Schuppli, Fawzi Mohamed, Henrique Mendonça, Nina Mujkanovic, Elia Palme, Dino Conciatore, Lukas Drescher, Miguel Gila, Pim Witlox, Joost VandeVondele, Maxime Martinasso, Thomas C. Schulthess, Torsten Hoefler
Link: https://arxiv.org/abs/2507.01880v1
Date: 2025-07-02
Summary:
The Alps Research Infrastructure leverages GH200 technology at scale, featuring 10,752 GPUs. Accessing Alps provides a significant computational advantage for researchers in Artificial Intelligence (AI) and Machine Learning (ML). While Alps serves a broad range of scientific communities, traditional HPC services alone are not sufficient to meet the dynamic needs of the ML community. This paper presents an initial investigation into extending HPC service capabilities to better support ML workloads. We identify key challenges and gaps we have observed since the early-access phase (2023) of Alps by the Swiss AI community and propose several technological enhancements. These include a user environment designed to facilitate the adoption of HPC for ML workloads, balancing performance with flexibility; a utility for rapid performance screening of ML applications during development; observability capabilities and data products for inspecting ongoing large-scale ML workloads; a utility to simplify the vetting of allocated nodes for compute readiness; a service plane infrastructure to deploy various types of workloads, including support and inference services; and a storage infrastructure tailored to the specific needs of ML workloads. These enhancements aim to facilitate the execution of ML workloads on HPC systems, increase system usability and resilience, and better align with the needs of the ML community. We also discuss our current approach to security aspects. This paper concludes by placing these proposals in the broader context of changes in the communities served by HPC infrastructure like ours.
--------------------------------------------------------------------------------------------------------
Mental healthcare billing fraud represents a significant challenge in healthcare systems, with complex sequential patterns and limited labeled data making detection difficult. This research explores hybrid deep learning approaches combining LSTM networks and Transformers with pseudo-labeling techniques to address class imbalance and label scarcity. The study evaluates models on real-world mental healthcare billing datasets, demonstrating the potential of combining different architectures and semi-supervised learning. Applications include fraud detection systems for healthcare insurers, compliance monitoring for healthcare providers, and automated billing audit systems. The approach could be extended to other healthcare domains and financial fraud detection, improving system integrity while reducing administrative costs.
Authors: Samirah Bakker, Yao Ma, Seyed Sahand Mohammadi Ziabari
Link: https://arxiv.org/abs/2507.01924v1
Date: 2025-07-02
Summary:
The complexity of mental healthcare billing enables anomalies, including fraud. While machine learning methods have been applied to anomaly detection, they often struggle with class imbalance, label scarcity, and complex sequential patterns. This study explores a hybrid deep learning approach combining Long Short-Term Memory (LSTM) networks and Transformers, with pseudo-labeling via Isolation Forests (iForest) and Autoencoders (AE). Prior work has not evaluated such hybrid models trained on pseudo-labeled data in the context of healthcare billing. The approach is evaluated on two real-world billing datasets related to mental healthcare. The iForest LSTM baseline achieves the highest recall (0.963) on declaration-level data. On the operation-level data, the hybrid iForest-based model achieves the highest recall (0.744), though at the cost of lower precision. These findings highlight the potential of combining pseudo-labeling with hybrid deep learning in complex, imbalanced anomaly detection settings.
--------------------------------------------------------------------------------------------------------
Portfolio optimization remains a fundamental challenge in quantitative finance, particularly for large-scale equity portfolios where traditional methods struggle with noisy covariance estimates. This research develops a rotation-invariant neural network that jointly learns to transform historical returns and regularize covariance matrices for minimum-variance portfolio construction. The model's dimension-agnostic design enables training on hundreds of stocks while applying to thousands without retraining. Applications include institutional asset management, pension fund optimization, risk parity strategies, and robo-advisory platforms. The approach demonstrates superior risk-adjusted returns and lower volatility compared to analytical competitors, potentially transforming how large-scale portfolio optimization is performed in practice while maintaining mathematical interpretability.
Authors: Christian Bongiorno, Efstratios Manolakis, Rosario Nunzio Mantegna
Link: https://arxiv.org/abs/2507.01918v1
Date: 2025-07-02
Summary:
We develop a rotation-invariant neural network that provides the global minimum-variance portfolio by jointly learning how to lag-transform historical returns and how to regularise both the eigenvalues and the marginal volatilities of large equity covariance matrices. This explicit mathematical mapping offers clear interpretability of each module's role, so the model cannot be regarded as a pure black-box. The architecture mirrors the analytical form of the global minimum-variance solution yet remains agnostic to dimension, so a single model can be calibrated on panels of a few hundred stocks and applied, without retraining, to one thousand US equities-a cross-sectional jump that demonstrates robust out-of-sample generalisation. The loss function is the future realized minimum portfolio variance and is optimized end-to-end on real daily returns. In out-of-sample tests from January 2000 to December 2024 the estimator delivers systematically lower realised volatility, smaller maximum drawdowns, and higher Sharpe ratios than the best analytical competitors, including state-of-the-art non-linear shrinkage. Furthermore, although the model is trained end-to-end to produce an unconstrained (long-short) minimum-variance portfolio, we show that its learned covariance representation can be used in general optimizers under long-only constraints with virtually no loss in its performance advantage over competing estimators. These gains persist when the strategy is executed under a highly realistic implementation framework that models market orders at the auctions, empirical slippage, exchange fees, and financing charges for leverage, and they remain stable during episodes of acute market stress.
--------------------------------------------------------------------------------------------------------