Week Ending 8.17.2025

 

RESEARCH WATCH: 8.17.2025

 

Handwritten Text Recognition of Historical Manuscripts Using Transformer-Based Models

Historical manuscripts contain invaluable cultural and scholarly knowledge, but accessing their contents requires digitization and transcription—a traditionally labor-intensive process. This research tackles the challenge of automatically recognizing handwritten text in 16th-century Latin manuscripts using TrOCR, a state-of-the-art transformer model. The study's innovations in image preprocessing and data augmentation, combined with ensemble learning approaches, achieved remarkable improvements in accuracy. Applications extend beyond historical research to digitizing personal archives, legal documents, medical records, and any handwritten materials requiring automated transcription. This technology could revolutionize how libraries, museums, and research institutions preserve and make accessible centuries of handwritten knowledge.

Authors:  Erez Meoded

Link:  https://arxiv.org/abs/2508.11499v1

Date: 2025-08-d

Summary:

Historical handwritten text recognition (HTR) is essential for unlocking the cultural and scholarly value of archival documents, yet digitization is often hindered by scarce transcriptions, linguistic variation, and highly diverse handwriting styles. In this study, we apply TrOCR, a state-of-the-art transformer-based HTR model, to 16th-century Latin manuscripts authored by Rudolf Gwalther. We investigate targeted image preprocessing and a broad suite of data augmentation techniques, introducing four novel augmentation methods designed specifically for historical handwriting characteristics. We also evaluate ensemble learning approaches to leverage the complementary strengths of augmentation-trained models. On the Gwalther dataset, our best single-model augmentation (Elastic) achieves a Character Error Rate (CER) of 1.86, while a top-5 voting ensemble achieves a CER of 1.60 - representing a 50% relative improvement over the best reported TrOCR_BASE result and a 42% improvement over the previous state of the art. These results highlight the impact of domain-specific augmentations and ensemble strategies in advancing HTR performance for historical manuscripts.

--------------------------------------------------------------------------------------------------------

Inside Knowledge: Graph-based Path Generation with Explainable Data Augmentation and Curriculum Learning for Visual Indoor Navigation

Indoor navigation presents unique challenges where GPS signals fail, requiring innovative solutions for location-based services. This research introduces a vision-only deep learning system that guides users through indoor spaces using smartphone cameras alone. The approach uses graph-based path generation with explainable AI techniques, making the navigation process transparent and trustworthy. Key applications include shopping mall navigation, hospital wayfinding, airport terminal guidance, and accessibility solutions for visually impaired users. The system's independence from special sensors, internet connectivity, or pre-placed markers makes it highly deployable. With a practical Android application already developed, this technology could transform how people navigate complex indoor environments.

Authors:  Daniel Airinei, Elena Burceanu, Marius Leordeanu

Link:  https://arxiv.org/abs/2508.11446v1

Date: 2025-08-d

Summary:

Indoor navigation is a difficult task, as it generally comes with poor GPS access, forcing solutions to rely on other sources of information. While significant progress continues to be made in this area, deployment to production applications is still lacking, given the complexity and additional requirements of current solutions. Here, we introduce an efficient, real-time and easily deployable deep learning approach, based on visual input only, that can predict the direction towards a target from images captured by a mobile device. Our technical approach, based on a novel graph-based path generation method, combined with explainable data augmentation and curriculum learning, includes contributions that make the process of data collection, annotation and training, as automatic as possible, efficient and robust. On the practical side, we introduce a novel largescale dataset, with video footage inside a relatively large shopping mall, in which each frame is annotated with the correct next direction towards different specific target destinations. Different from current methods, ours relies solely on vision, avoiding the need of special sensors, additional markers placed along the path, knowledge of the scene map or internet access. We also created an easy to use application for Android, which we plan to make publicly available. We make all our data and code available along with visual demos on our project site

--------------------------------------------------------------------------------------------------------

Vision-Language Models display a strong gender bias

As artificial intelligence systems increasingly influence hiring, content recommendation, and social interactions, understanding their embedded biases becomes critical for fair deployment. This study exposes concerning gender associations in vision-language models that align images with text descriptions. By analyzing how these models associate face images with occupational and activity descriptions, researchers revealed systematic gender biases that could perpetuate workplace discrimination and social stereotypes. The implications span recruitment platforms, educational tools, content moderation systems, and any application where AI systems evaluate people based on visual appearance. This research provides essential frameworks for bias detection and highlights the urgent need for developing more equitable AI systems across industries.

Authors:  Aiswarya Konavoor, Raj Abhijit Dandekar, Rajat Dandekar, Sreedath Panat

Link:  https://arxiv.org/abs/2508.11262v1

Date: 2025-08-d

Summary:

Vision-language models (VLM) align images and text in a shared representation space that is useful for retrieval and zero-shot transfer. Yet, this alignment can encode and amplify social stereotypes in subtle ways that are not obvious from standard accuracy metrics. In this study, we test whether the contrastive vision-language encoder exhibits gender-linked associations when it places embeddings of face images near embeddings of short phrases that describe occupations and activities. We assemble a dataset of 220 face photographs split by perceived binary gender and a set of 150 unique statements distributed across six categories covering emotional labor, cognitive labor, domestic labor, technical labor, professional roles, and physical labor. We compute unit-norm image embeddings for every face and unit-norm text embeddings for every statement, then define a statement-level association score as the difference between the mean cosine similarity to the male set and the mean cosine similarity to the female set, where positive values indicate stronger association with the male set and negative values indicate stronger association with the female set. We attach bootstrap confidence intervals by resampling images within each gender group, aggregate by category with a separate bootstrap over statements, and run a label-swap null model that estimates the level of mean absolute association we would expect if no gender structure were present. The outcome is a statement-wise and category-wise map of gender associations in a contrastive vision-language space, accompanied by uncertainty, simple sanity checks, and a robust gender bias evaluation framework.

--------------------------------------------------------------------------------------------------------

Hallucination in LLM-Based Code Generation: An Automotive Case Study

Large language models show tremendous promise for automating software development, but their tendency to generate plausible-seeming but incorrect code poses serious risks, especially in safety-critical domains. This automotive industry case study reveals how leading models like GPT-4 frequently produce code with syntax errors, invalid references, and API conflicts despite appearing correct. The research demonstrates that even state-of-the-art models struggle with domain-specific code generation, requiring extensive context and multiple refinements. Applications include automated software testing, code review systems, developer assistance tools, and any scenario where AI-generated code must be reliable. This work underscores the critical need for robust validation mechanisms before deploying AI-generated code in production environments.

Authors:  Marc Pavel, Nenad Petrovic, Lukasz Mazur, Vahid Zolfaghari, Fengjunjie Pan, Alois Knoll

Link:  https://arxiv.org/abs/2508.11257v1

Date: 2025-08-d

Summary:

Large Language Models (LLMs) have shown significant potential in automating code generation tasks offering new opportunities across software engineering domains. However, their practical application remains limited due to hallucinations - outputs that appear plausible but are factually incorrect, unverifiable or nonsensical. This paper investigates hallucination phenomena in the context of code generation with a specific focus on the automotive domain. A case study is presented that evaluates multiple code LLMs for three different prompting complexities ranging from a minimal one-liner prompt to a prompt with Covesa Vehicle Signal Specifications (VSS) as additional context and finally to a prompt with an additional code skeleton. The evaluation reveals a high frequency of syntax violations, invalid reference errors and API knowledge conflicts in state-of-the-art models GPT-4.1, Codex and GPT-4o. Among the evaluated models, only GPT-4.1 and GPT-4o were able to produce a correct solution when given the most context-rich prompt. Simpler prompting strategies failed to yield a working result, even after multiple refinement iterations. These findings highlight the need for effective mitigation techniques to ensure the safe and reliable use of LLM generated code, especially in safety-critical domains such as automotive software systems.

--------------------------------------------------------------------------------------------------------

Generalized Decoupled Learning for Enhancing Open-Vocabulary Dense Perception

Computer vision systems traditionally recognize only predefined categories, limiting their real-world applicability where visual concepts are boundless. This research addresses fundamental limitations in CLIP-based models for dense perception tasks like object detection and segmentation. By decoupling attention mechanisms into "content" and "context" features, the DeCLIP framework significantly improves spatial consistency and local discriminability. Applications span autonomous vehicles recognizing novel road objects, robotics systems adapting to new environments, medical imaging detecting rare conditions, and surveillance systems identifying previously unseen threats. This breakthrough enables AI systems to understand visual concepts beyond their training data, moving closer to human-like visual understanding across diverse real-world scenarios.

Authors:  Junjie Wang, Keyu Chen, Yulin Li, Bin Chen, Hengshuang Zhao, Xiaojuan Qi, Zhuotao Tian

Link:  https://arxiv.org/abs/2508.11256v1

Date: 2025-08-d

Summary:

Dense visual perception tasks have been constrained by their reliance on predefined categories, limiting their applicability in real-world scenarios where visual concepts are unbounded. While Vision-Language Models (VLMs) like CLIP have shown promise in open-vocabulary tasks, their direct application to dense perception often leads to suboptimal performance due to limitations in local feature representation. In this work, we present our observation that CLIP's image tokens struggle to effectively aggregate information from spatially or semantically related regions, resulting in features that lack local discriminability and spatial consistency. To address this issue, we propose DeCLIP, a novel framework that enhances CLIP by decoupling the self-attention module to obtain ``content'' and ``context'' features respectively. \revise{The context features are enhanced by jointly distilling semantic correlations from Vision Foundation Models (VFMs) and object integrity cues from diffusion models, thereby enhancing spatial consistency. In parallel, the content features are aligned with image crop representations and constrained by region correlations from VFMs to improve local discriminability. Extensive experiments demonstrate that DeCLIP establishes a solid foundation for open-vocabulary dense perception, consistently achieving state-of-the-art performance across a broad spectrum of tasks, including 2D detection and segmentation, 3D instance segmentation, video instance segmentation, and 6D object pose estimation.} Code is available at https://github.com/xiaomoguhz/DeCLIP

--------------------------------------------------------------------------------------------------------

On Strong and Weak Admissibility in Non-Flat Assumption-Based Argumentation

Automated reasoning systems must handle conflicting information and uncertain knowledge, making argumentation frameworks essential for AI decision-making. This research expands assumption-based argumentation theory by introducing strong and weak admissibility concepts, providing more nuanced approaches to handling contradictory evidence. The work establishes formal foundations for reasoning systems that can better evaluate competing arguments and maintain consistency under uncertainty. Applications include legal reasoning systems that weigh conflicting evidence, medical diagnosis tools handling uncertain symptoms, policy analysis frameworks evaluating competing proposals, and any AI system requiring robust decision-making under incomplete information. This theoretical advancement could improve how intelligent systems handle real-world ambiguity and conflicting data.

Authors:  Matti Berthold, Lydia Blümel, Anna Rapberger

Link:  https://arxiv.org/abs/2508.11182v1

Date: 2025-08-d

Summary:

In this work, we broaden the investigation of admissibility notions in the context of assumption-based argumentation (ABA). More specifically, we study two prominent alternatives to the standard notion of admissibility from abstract argumentation, namely strong and weak admissibility, and introduce the respective preferred, complete and grounded semantics for general (sometimes called non-flat) ABA. To do so, we use abstract bipolar set-based argumentation frameworks (BSAFs) as formal playground since they concisely capture the relations between assumptions and are expressive enough to represent general non-flat ABA frameworks, as recently shown. While weak admissibility has been recently investigated for a restricted fragment of ABA in which assumptions cannot be derived (flat ABA), strong admissibility has not been investigated for ABA so far. We introduce strong admissibility for ABA and investigate desirable properties. We furthermore extend the recent investigations of weak admissibility in the flat ABA fragment to the non-flat case. We show that the central modularization property is maintained under classical, strong, and weak admissibility. We also show that strong and weakly admissible semantics in non-flat ABA share some of the shortcomings of standard admissible semantics and discuss ways to address these.

--------------------------------------------------------------------------------------------------------

Better Supervised Fine-tuning for VQA: Integer-Only Loss

Vision-language models increasingly serve as automated evaluators for visual content quality, from social media moderation to video production assessment. This research addresses precision issues in visual quality assessment by constraining model outputs to integers and introducing targeted loss calculations that focus on key evaluation components. The IOVQA approach achieved third place in a major video quality assessment challenge, demonstrating practical effectiveness. Applications include automated content moderation, video streaming quality control, educational assessment tools, and any system requiring consistent visual evaluation. This methodology offers a simple yet powerful technique for improving AI reliability in quantitative visual assessment tasks across entertainment, education, and digital media industries.

Authors:  Baihong Qian, Haotian Fan, Wenjie Liao, Yunqiu Wang, Tao Li, Junhui Cui

Link:  https://arxiv.org/abs/2508.11170v1

Date: 2025-08-d

Summary:

With the rapid advancement of vision language models(VLM), their ability to assess visual content based on specific criteria and dimensions has become increasingly critical for applications such as video-theme consistency assessment and visual quality scoring. However, existing methods often suffer from imprecise results and inefficient loss calculation, which limit the focus of the model on key evaluation indicators. To address this, we propose IOVQA(Integer-only VQA), a novel fine-tuning approach tailored for VLMs to enhance their performance in video quality assessment tasks. The key innovation of IOVQA lies in its label construction and its targeted loss calculation mechanism. Specifically, during dataset curation, we constrain the model's output to integers within the range of [10,50], ensuring numerical stability, and convert decimal Overall_MOS to integer before using them as labels. We also introduce a target-mask strategy: when computing the loss, only the first two-digit-integer of the label is unmasked, forcing the model to learn the critical components of the numerical evaluation. After fine-tuning the Qwen2.5-VL model using the constructed dataset, experimental results demonstrate that the proposed method significantly improves the model's accuracy and consistency in the VQA task, ranking 3rd in VQualA 2025 GenAI-Bench AIGC Video Quality Assessment Challenge -- Track I. Our work highlights the effectiveness of merely leaving integer labels during fine-tuning, providing an effective idea for optimizing VLMs in quantitative evaluation scenarios.

--------------------------------------------------------------------------------------------------------

Managing Risks from Large Digital Loads Using Coordinated Grid-Forming Storage Network

The explosive growth of AI data centers creates unprecedented challenges for electrical grid stability, with massive, unpredictable power demands threatening infrastructure reliability. This research proposes coordinating distributed energy storage systems to manage these extreme load fluctuations without costly transmission upgrades. The bi-layered control strategy combines fast local responses with coordinated network-wide adjustments to maintain grid stability. Applications extend beyond AI data centers to electric vehicle charging networks, renewable energy integration, and any scenario with large, variable power demands. This technology could enable sustainable expansion of digital infrastructure while maintaining grid reliability, supporting the continued growth of artificial intelligence and cloud computing without compromising electrical system stability.

Authors:  Soumya Kundu, Kaustav Chatterjee, Ramij R. Hossain, Sai Pushpak Nandanoori, Veronica Adetola

Link:  https://arxiv.org/abs/2508.11080v1

Date: 2025-08-d

Summary:

Anticipated rapid growth of large digital load, driven by artificial intelligence (AI) data centers, is poised to increase uncertainty and large fluctuations in consumption, threatening the stability, reliability, and security of the energy infrastructure. Conventional measures taken by grid planners and operators to ensure stable and reliable integration of new resources are either cost-prohibitive (e.g., transmission upgrades) or ill-equipped (e.g., generation control) to resolve the unique challenges brought on by AI Data Centers (e.g., extreme load transients). In this work, we explore the feasibility of coordinating and managing available flexibility in the grid, in terms of grid-forming storage units, to ensure stable and reliable integration of AI Data Centers without the need for costly grid upgrades. Recently developed bi-layered coordinated control strategies -- involving fast-acting, local, autonomous, control at the storage to maintain transient safety in voltage and frequency at the point-of-interconnection, and a slower, coordinated (consensus) control to restore normal operating condition in the grid -- are used in the case studies. A comparison is drawn between broadly two scenarios: a network of coordinated, smaller, distributed storage vs. larger storage installations collocated with large digital loads. IEEE 68-bus network is used for the case studies, with large digital load profiles drawn from the MIT Supercloud Dataset.

--------------------------------------------------------------------------------------------------------

From Individual to Multi-Agent Algorithmic Recourse: Minimizing the Welfare Gap via Capacitated Bipartite Matching

Algorithmic decision-making affects millions through hiring, lending, and service allocation, but existing fairness solutions focus on individual cases rather than system-wide outcomes. This research introduces a framework for multi-agent algorithmic recourse that considers how multiple individuals compete for limited opportunities. By modeling the interaction as a matching problem, the approach optimizes social welfare while maintaining individual actionability. Applications include job placement systems balancing candidate and employer needs, college admissions managing limited seats, healthcare resource allocation during shortages, and any scenario where AI systems must fairly distribute opportunities among competing individuals. This work transforms algorithmic fairness from individual recommendations to comprehensive system design.

Authors:  Zahra Khotanlou, Kate Larson, Amir-Hossein Karimi

Link:  https://arxiv.org/abs/2508.11070v1

Date: 2025-08-d

Summary:

Decision makers are increasingly relying on machine learning in sensitive situations. In such settings, algorithmic recourse aims to provide individuals with actionable and minimally costly steps to reverse unfavorable AI-driven decisions. While existing research predominantly focuses on single-individual (i.e., seeker) and single-model (i.e., provider) scenarios, real-world applications often involve multiple interacting stakeholders. Optimizing outcomes for seekers under an individual welfare approach overlooks the inherently multi-agent nature of real-world systems, where individuals interact and compete for limited resources. To address this, we introduce a novel framework for multi-agent algorithmic recourse that accounts for multiple recourse seekers and recourse providers. We model this many-to-many interaction as a capacitated weighted bipartite matching problem, where matches are guided by both recourse cost and provider capacity. Edge weights, reflecting recourse costs, are optimized for social welfare while quantifying the welfare gap between individual welfare and this collectively feasible outcome. We propose a three-layer optimization framework: (1) basic capacitated matching, (2) optimal capacity redistribution to minimize the welfare gap, and (3) cost-aware optimization balancing welfare maximization with capacity adjustment costs. Experimental validation on synthetic and real-world datasets demonstrates that our framework enables the many-to-many algorithmic recourse to achieve near-optimal welfare with minimum modification in system settings. This work extends algorithmic recourse from individual recommendations to system-level design, providing a tractable path toward higher social welfare while maintaining individual actionability.

--------------------------------------------------------------------------------------------------------

Learning with Confidence

Understanding how humans and machines learn from uncertain information is fundamental to developing robust AI systems. This research formalizes the concept of "confidence" in learning—how much trust one places in incoming information and its impact on beliefs. The framework unifies various learning concepts including learning rates, evidence weights, and adaptive algorithms under a single theoretical umbrella. Applications span adaptive learning systems that adjust to user feedback, financial models incorporating market uncertainty, medical diagnosis systems weighing symptom reliability, and any AI system learning from potentially unreliable data. This theoretical foundation could improve how intelligent systems handle uncertainty and adapt their learning strategies based on information quality and source reliability.

Authors:  Oliver Ethan Richardson

Link:  https://arxiv.org/abs/2508.11037v1

Date: 2025-08-d

Summary:

We characterize a notion of confidence that arises in learning or updating beliefs: the amount of trust one has in incoming information and its impact on the belief state. This learner's confidence can be used alongside (and is easily mistaken for) probability or likelihood, but it is fundamentally a different concept -- one that captures many familiar concepts in the literature, including learning rates and number of training epochs, Shafer's weight of evidence, and Kalman gain. We formally axiomatize what it means to learn with confidence, give two canonical ways of measuring confidence on a continuum, and prove that confidence can always be represented in this way. Under additional assumptions, we derive more compact representations of confidence-based learning in terms of vector fields and loss functions. These representations induce an extended language of compound "parallel" observations. We characterize Bayes Rule as the special case of an optimizing learner whose loss representation is a linear expectation.

--------------------------------------------------------------------------------------------------------

Note on Selection Bias in Observational Estimates of Algorithmic Progress

Measuring progress in artificial intelligence development is crucial for understanding technological advancement and guiding research investment decisions. This research identifies a critical methodological flaw in observational studies of AI progress, where selection bias can lead to overestimating algorithmic improvements over time. The bias occurs when computational resource allocation decisions correlate with unobserved algorithmic quality factors. Applications include AI research evaluation, technology investment decisions, policy planning for AI development, and benchmarking studies comparing different AI approaches. This work highlights the importance of rigorous methodology in AI progress assessment, ensuring that claims about technological advancement are based on sound statistical foundations rather than biased observations.

Authors:  Parker Whitfill

Link:  https://arxiv.org/abs/2508.11033v1

Date: 2025-08-d

Summary:

Ho et. al (2024) is an interesting paper that attempts to estimate the degree of algorithmic progress from language models. They collect observational data on language models' loss and compute over time, and argue that as time has passed, language models' algorithmic efficiency has been rising. That is, the loss achieved for fixed compute has been dropping over time. In this note, I want to raise one potential methodological problem with the estimation strategy. Intuitively, if part of algorithmic quality is latent, and compute choices are endogenous to algorithmic quality, then resulting estimates of algorithmic quality will be biased.

--------------------------------------------------------------------------------------------------------

Improving Text Style Transfer using Masked Diffusion Language Models with Inference-time Scaling

Text style transfer—converting text from one style to another while preserving meaning—has applications across communication, creative writing, and content adaptation. This research advances masked diffusion language models as alternatives to autoregressive systems, offering better scalability and training efficiency. The verifier-based inference-time scaling method significantly improves generation quality during the denoising process. Applications include automated content adaptation for different audiences, creative writing assistance, professional communication tools, accessibility features for different reading levels, and personalized content generation. This technology could transform how we automatically adapt written content for various contexts, from technical documentation simplification to creative writing style matching, making communication more effective and accessible.

Authors:  Tejomay Kishor Padole, Suyash P Awate, Pushpak Bhattacharyya

Link:  https://arxiv.org/abs/2508.10995v1

Date: 2025-08-d

Summary:

Masked diffusion language models (MDMs) have recently gained traction as a viable generative framework for natural language. This can be attributed to its scalability and ease of training compared to other diffusion model paradigms for discrete data, establishing itself as the state-of-the-art non-autoregressive generator for discrete data. Diffusion models, in general, have shown excellent ability to improve the generation quality by leveraging inference-time scaling either by increasing the number of denoising steps or by using external verifiers on top of the outputs of each step to guide the generation. In this work, we propose a verifier-based inference-time scaling method that aids in finding a better candidate generation during the denoising process of the MDM. Our experiments demonstrate the application of MDMs for standard text-style transfer tasks and establish MDMs as a better alternative to autoregressive language models. Additionally, we show that a simple soft-value-based verifier setup for MDMs using off-the-shelf pre-trained embedding models leads to significant gains in generation quality even when used on top of typical classifier-free guidance setups in the existing literature.

--------------------------------------------------------------------------------------------------------

Empirical Investigation into Configuring Echo State Networks for Representative Benchmark Problem Domains

Echo State Networks represent a powerful yet underutilized approach to machine learning, offering unique advantages for temporal data processing but requiring extensive expertise for effective configuration. This comprehensive study provides practical guidelines for configuring these reservoir computing systems across different problem domains including time series prediction and pattern generation. The research addresses the experience gap that prevents wider adoption by establishing rules of thumb for parameter selection and architectural decisions. Applications include financial forecasting, weather prediction, speech recognition, control systems, and any domain involving temporal pattern recognition. This work democratizes access to reservoir computing by providing practical configuration guidance, potentially expanding the use of these efficient neural networks.

Authors:  Brooke R. Weborg, Gursel Serpen

Link:  https://arxiv.org/abs/2508.10887v1

Date: 2025-08-d

Summary:

This paper examines Echo State Network, a reservoir computer, performance using four different benchmark problems, then proposes heuristics or rules of thumb for configuring the architecture, as well as the selection of parameters and their values, which are applicable to problems within the same domain, to help serve to fill the experience gap needed by those entering this field of study. The influence of various parameter selections and their value adjustments, as well as architectural changes made to an Echo State Network, a powerful recurrent neural network configured as a reservoir computer, can be challenging to fully comprehend without experience in the field, and even some hyperparameter optimization algorithms may have difficulty adjusting parameter values without proper manual selections made first. Therefore, it is imperative to understand the effects of parameters and their value selection on Echo State Network architecture performance for a successful build. Thus, to address the requirement for an extensive background in Echo State Network architecture, as well as examine how Echo State Network performance is affected with respect to variations in architecture, design, and parameter selection and values, a series of benchmark tasks representing different problem domains, including time series prediction, pattern generation, chaotic system prediction, and time series classification, were modeled and experimented on to show the impact on the performance of Echo State Network.

--------------------------------------------------------------------------------------------------------

A Survey on Diffusion Language Models

Diffusion language models emerge as a compelling alternative to dominant autoregressive text generation approaches, offering parallel token generation and bidirectional context understanding. This comprehensive survey traces the evolution from autoregressive models to diffusion approaches, providing taxonomies of current techniques and analyzing their advantages in reducing inference latency while maintaining generation quality. The work covers foundational principles, state-of-the-art models, and optimization strategies including advanced decoding and caching mechanisms. Applications span creative writing, content generation, real-time translation, interactive dialogue systems, and any scenario requiring fast, high-quality text generation. This survey serves as an essential resource for researchers and practitioners seeking to understand and implement these emerging models in natural language processing applications.

Authors:  Tianyi Li, Mingda Chen, Bowei Guo, Zhiqiang Shen

Link:  https://arxiv.org/abs/2508.10875v1

Date: 2025-08-d

Summary:

Diffusion Language Models (DLMs) are rapidly emerging as a powerful and promising alternative to the dominant autoregressive (AR) paradigm. By generating tokens in parallel through an iterative denoising process, DLMs possess inherent advantages in reducing inference latency and capturing bidirectional context, thereby enabling fine-grained control over the generation process. While achieving a several-fold speed-up, recent advancements have allowed DLMs to show performance comparable to their autoregressive counterparts, making them a compelling choice for various natural language processing tasks. In this survey, we provide a holistic overview of the current DLM landscape. We trace its evolution and relationship with other paradigms, such as autoregressive and masked language models, and cover both foundational principles and state-of-the-art models. Our work offers an up-to-date, comprehensive taxonomy and an in-depth analysis of current techniques, from pre-training strategies to advanced post-training methods. Another contribution of this survey is a thorough review of DLM inference strategies and optimizations, including improvements in decoding parallelism, caching mechanisms, and generation quality. We also highlight the latest approaches to multimodal extensions of DLMs and delineate their applications across various practical scenarios. Furthermore, our discussion addresses the limitations and challenges of DLMs, including efficiency, long-sequence handling, and infrastructure requirements, while outlining future research directions to sustain progress in this rapidly evolving field. Project GitHub is available at https://github.com/VILA-Lab/Awesome-DLMs.

--------------------------------------------------------------------------------------------------------

Performance of GPT-5 in Brain Tumor MRI Reasoning

Medical imaging interpretation requires years of specialized training, creating bottlenecks in healthcare delivery, especially in underserved regions. This study evaluates GPT-5 family models on brain tumor differentiation using MRI images, combining visual analysis with natural language reasoning for medical diagnosis. While achieving moderate accuracy across glioblastoma, meningioma, and brain metastases cases, performance remained below clinical standards. Applications include preliminary screening tools, medical education systems, second opinion support, and diagnostic assistance in resource-limited settings. This research highlights both the potential and current limitations of large language models in medical imaging, suggesting paths toward AI-assisted radiology while emphasizing the continued need for human expertise in clinical decision-making.

Authors:  Mojtaba Safari, Shansong Wang, Mingzhe Hu, Zach Eidex, Qiang Li, Xiaofeng Yang

Link:  https://arxiv.org/abs/2508.10865v1

Date: 2025-08-d

Summary:

Accurate differentiation of brain tumor types on magnetic resonance imaging (MRI) is critical for guiding treatment planning in neuro-oncology. Recent advances in large language models (LLMs) have enabled visual question answering (VQA) approaches that integrate image interpretation with natural language reasoning. In this study, we evaluated GPT-4o, GPT-5-nano, GPT-5-mini, and GPT-5 on a curated brain tumor VQA benchmark derived from 3 Brain Tumor Segmentation (BraTS) datasets - glioblastoma (GLI), meningioma (MEN), and brain metastases (MET). Each case included multi-sequence MRI triplanar mosaics and structured clinical features transformed into standardized VQA items. Models were assessed in a zero-shot chain-of-thought setting for accuracy on both visual and reasoning tasks. Results showed that GPT-5-mini achieved the highest macro-average accuracy (44.19%), followed by GPT-5 (43.71%), GPT-4o (41.49%), and GPT-5-nano (35.85%). Performance varied by tumor subtype, with no single model dominating across all cohorts. These findings suggest that GPT-5 family models can achieve moderate accuracy in structured neuro-oncological VQA tasks, but not at a level acceptable for clinical use.

--------------------------------------------------------------------------------------------------------

Who Benefits from AI Explanations? Towards Accessible and Interpretable Systems

As AI systems make increasingly critical decisions, explainability becomes essential for user trust and regulatory compliance, yet current explainable AI methods largely exclude users with disabilities. This research exposes significant accessibility gaps in XAI techniques, revealing that most explanations rely on inherently visual formats unsuitable for users with vision impairments. The study provides a methodological framework for inclusive XAI design and demonstrates that simplified explanations work better for non-visual users. Applications include accessible financial services, inclusive healthcare decision support, educational AI tools, and any AI system serving diverse user populations. This work emphasizes the importance of universal design in AI explainability, ensuring that transparency benefits all users regardless of ability.

Authors:  Maria J. P. Peixoto, Akriti Pandey, Ahsan Zaman, Peter R. Lewis

Link:  https://arxiv.org/abs/2508.10806v1

Date: 2025-08-d

Summary:

As AI systems are increasingly deployed to support decision-making in critical domains, explainability has become a means to enhance the understandability of these outputs and enable users to make more informed and conscious choices. However, despite growing interest in the usability of eXplainable AI (XAI), the accessibility of these methods, particularly for users with vision impairments, remains underexplored. This paper investigates accessibility gaps in XAI through a two-pronged approach. First, a literature review of 79 studies reveals that evaluations of XAI techniques rarely include disabled users, with most explanations relying on inherently visual formats. Second, we present a four-part methodological proof of concept that operationalizes inclusive XAI design: (1) categorization of AI systems, (2) persona definition and contextualization, (3) prototype design and implementation, and (4) expert and user assessment of XAI techniques for accessibility. Preliminary findings suggest that simplified explanations are more comprehensible for non-visual users than detailed ones, and that multimodal presentation is required for more equitable interpretability.

--------------------------------------------------------------------------------------------------------

Estimating Covariance for Global Minimum Variance Portfolio: A Decision-Focused Learning Approach

Portfolio optimization fundamentally depends on accurate parameter estimation, yet traditional statistical methods minimize prediction errors rather than optimizing actual investment outcomes. This research applies decision-focused learning to portfolio construction, directly optimizing decision quality rather than statistical accuracy for global minimum variance portfolios. The approach consistently delivers superior investment performance compared to conventional mean-squared error minimization methods. Applications include institutional asset management, robo-advisors, pension fund optimization, risk management systems, and any financial application requiring portfolio construction under uncertainty. This methodology could transform quantitative finance by aligning machine learning objectives with actual investment goals, potentially improving returns while reducing risk for millions of investors across various financial products and services.

Authors:  Juchan Kim, Inwoo Tae, Yongjae Lee

Link:  https://arxiv.org/abs/2508.10776v1

Date: 2025-08-d

Summary:

Portfolio optimization constitutes a cornerstone of risk management by quantifying the risk-return trade-off. Since it inherently depends on accurate parameter estimation under conditions of future uncertainty, the selection of appropriate input parameters is critical for effective portfolio construction. However, most conventional statistical estimators and machine learning algorithms determine these parameters by minimizing mean-squared error (MSE), a criterion that can yield suboptimal investment decisions. In this paper, we adopt decision-focused learning (DFL) - an approach that directly optimizes decision quality rather than prediction error such as MSE - to derive the global minimum-variance portfolio (GMVP). Specifically, we theoretically derive the gradient of decision loss using the analytic solution of GMVP and its properties regarding the principal components of itself. Through extensive empirical evaluation, we show that prediction-focused estimation methods may fail to produce optimal allocations in practice, whereas DFL-based methods consistently deliver superior decision performance. Furthermore, we provide a comprehensive analysis of DFL's mechanism in GMVP construction, focusing on its volatility reduction capability, decision-driving features, and estimation characteristics.

--------------------------------------------------------------------------------------------------------

Video-BLADE: Block-Sparse Attention Meets Step Distillation for Efficient Video Generation

High-quality video generation using diffusion transformers faces significant computational bottlenecks from slow iterative denoising and quadratic attention costs, limiting practical deployment. BLADE introduces an innovative framework combining adaptive block-sparse attention with sparsity-aware step distillation, achieving remarkable acceleration without quality loss. The approach delivers 14x speedup on some models while actually improving generation quality on benchmarks. Applications include real-time video content creation, interactive media production, personalized video generation for social media, educational content development, and gaming. This breakthrough could democratize high-quality video generation by making it computationally accessible, enabling new creative applications and business models in entertainment, marketing, education, and communication industries.

Authors:  Youping Gu, Xiaolong Li, Yuhao Hu, Bohan Zhuang

Link:  https://arxiv.org/abs/2508.10774v1

Date: 2025-08-d

Summary:

Diffusion transformers currently lead the field in high-quality video generation, but their slow iterative denoising process and prohibitive quadratic attention costs for long sequences create significant inference bottlenecks. While both step distillation and sparse attention mechanisms have shown promise as independent acceleration strategies, effectively combining these approaches presents critical challenges -- training-free integration yields suboptimal results, while separately training sparse attention after step distillation requires prohibitively expensive high-quality video data. To overcome these limitations, we propose BLADE, an innovative data-free joint training framework that introduces: (1) an Adaptive Block-Sparse Attention (ASA) mechanism for dynamically generating content-aware sparsity masks to focus computation on salient spatiotemporal features, and (2) a sparsity-aware step distillation paradigm built upon Trajectory Distribution Matching (TDM) that directly incorporates sparsity into the distillation process rather than treating it as a separate compression step, with fast convergence. We validate BLADE on text-to-video models like CogVideoX-5B and Wan2.1-1.3B. Our framework demonstrates remarkable efficiency gains across different scales. On Wan2.1-1.3B, BLADE achieves a 14.10x end-to-end inference acceleration over a 50-step baseline. Moreover, on models such as CogVideoX-5B with short video sequence lengths, our framework delivers a robust 8.89x speedup. Crucially, the acceleration is accompanied by a consistent quality improvement. On the VBench-2.0 benchmark, BLADE boosts the score of CogVideoX-5B to 0.569 (from 0.534) and Wan2.1-1.3B to 0.570 (from 0.563), results that are further corroborated by superior ratings in human evaluations. Our code and model weights are publicly available at: http://ziplab.co/BLADE-Homepage/.

--------------------------------------------------------------------------------------------------------

FROGENT: An End-to-End Full-process Drug Design Agent

Drug discovery traditionally involves fragmented tools across isolated platforms, forcing scientists to navigate incompatible interfaces and manage complex workflows manually. FROGENT addresses this by creating a unified AI agent that integrates biochemical databases, tool libraries, and specialized models under a Large Language Model framework. The system handles complete drug discovery pipelines from target identification through molecular design and synthesis planning, significantly outperforming baseline approaches. Applications include pharmaceutical research acceleration, academic drug discovery, personalized medicine development, and rare disease therapeutic exploration. This platform could democratize drug discovery by making sophisticated computational tools accessible to smaller research teams, potentially accelerating the development of new treatments while reducing costs and complexity.

Authors:  Qihua Pan, Dong Xu, Jenna Xinyi Yao, Lijia Ma, Zexuan Zhu, Junkai Ji

Link:  https://arxiv.org/abs/2508.10760v1

Date: 2025-08-d

Summary:

Powerful AI tools for drug discovery reside in isolated web apps, desktop programs, and code libraries. Such fragmentation forces scientists to manage incompatible interfaces and specialized scripts, which can be a cumbersome and repetitive process. To address this issue, a Full-pROcess druG dEsign ageNT, named FROGENT, has been proposed. Specifically, FROGENT utilizes a Large Language Model and the Model Context Protocol to integrate multiple dynamic biochemical databases, extensible tool libraries, and task-specific AI models. This agentic framework allows FROGENT to execute complicated drug discovery workflows dynamically, including component tasks such as target identification, molecule generation and retrosynthetic planning. FROGENT has been evaluated on eight benchmarks that cover various aspects of drug discovery, such as knowledge retrieval, property prediction, virtual screening, mechanistic analysis, molecular design, and synthesis. It was compared against six increasingly advanced ReAct-style agents that support code execution and literature searches. Empirical results demonstrated that FROGENT triples the best baseline performance in hit-finding and doubles it in interaction profiling, significantly outperforming both the open-source model Qwen3-32B and the commercial model GPT-4o. In addition, real-world cases have been utilized to validate the practicability and generalization of FROGENT. This development suggests that streamlining the agentic drug discovery pipeline can significantly enhance researcher productivity.

--------------------------------------------------------------------------------------------------------

Retro-Expert: Collaborative Reasoning for Interpretable Retrosynthesis

Chemical synthesis planning requires expert knowledge to determine how complex molecules can be constructed from simpler starting materials, a process called retrosynthesis. Existing computational approaches operate as black boxes, making their reasoning opaque to chemists who need to understand and trust the proposed synthetic routes. Retro-Expert combines Large Language Models with specialized chemical models through reinforcement learning to provide both accurate predictions and natural language explanations grounded in chemical logic. Applications include pharmaceutical synthesis planning, organic chemistry education, chemical manufacturing optimization, and drug development acceleration. This interpretable approach could bridge the gap between AI predictions and actionable chemical insights, making computational retrosynthesis tools more trustworthy and useful for practicing chemists.

Authors:  Xinyi Li, Sai Wang, Yutian Lin, Yu Wu, Yi Yang

Link:  https://arxiv.org/abs/2508.10967v1

Date: 2025-08-d

Summary:

Retrosynthesis prediction aims to infer the reactant molecule based on a given product molecule, which is a fundamental task in chemical synthesis. However, existing models rely on static pattern-matching paradigm, which limits their ability to perform effective logic decision-making, leading to black-box decision-making. Building on this, we propose Retro-Expert, an interpretable retrosynthesis framework that performs collaborative reasoning by combining the complementary reasoning strengths of Large Language Models and specialized models via reinforcement learning. It outputs natural language explanations grounded in chemical logic through three components: (1) specialized models perform shallow reasoning to construct high-quality chemical decision space, (2) LLM-driven critical reasoning to generate predictions and corresponding interpretable reasoning path, and (3) reinforcement learning optimizing interpretable decision policy. Experiments show that Retro-Expert not only surpasses both LLM-based and specialized models across different metrics but also provides expert-aligned explanations that bridge the gap between AI predictions and actionable chemical insights.

--------------------------------------------------------------------------------------------------------


EYE ON A.I. GETS READERS UP TO DATE ON THE LATEST FUNDING NEWS AND RELATED ISSUES. SUBSCRIBE FOR THE WEEKLY NEWSLETTER.