Week Ending 11.16.2025

 

RESEARCH WATCH: 11.16.2025

 

D-GAP: Improving Out-of-Domain Robustness via Dataset-Agnostic and Gradient-Guided Augmentation in Amplitude and Pixel Spaces

Computer vision models often fail when deployed in real-world settings where images differ from training data due to varying backgrounds, styles, or acquisition devices. D-GAP addresses this challenge by introducing a novel augmentation technique that operates in both frequency and pixel spaces. By using gradient-guided sensitivity maps to identify which frequency components the model relies on, D-GAP adaptively interpolates between source and target samples while preserving fine spatial details. This approach could significantly improve medical imaging systems, autonomous vehicles, and surveillance technologies that must perform reliably across diverse environments without requiring domain-specific expertise or extensive retraining.

Authors:  Ruoqi Wang, Haitao Wang, Shaojie Guo, Qiong Luo

Link:  https://arxiv.org/abs/2511.11286v1

Date: 2025-11-d

Summary:

Out-of-domain (OOD) robustness is challenging to achieve in real-world computer vision applications, where shifts in image background, style, and acquisition instruments always degrade model performance. Generic augmentations show inconsistent gains under such shifts, whereas dataset-specific augmentations require expert knowledge and prior analysis. Moreover, prior studies show that neural networks adapt poorly to domain shifts because they exhibit a learning bias to domain-specific frequency components. Perturbing frequency values can mitigate such bias but overlooks pixel-level details, leading to suboptimal performance. To address these problems, we propose D-GAP (Dataset-agnostic and Gradient-guided augmentation in Amplitude and Pixel spaces), improving OOD robustness by introducing targeted augmentation in both the amplitude space (frequency space) and pixel space. Unlike conventional handcrafted augmentations, D-GAP computes sensitivity maps in the frequency space from task gradients, which reflect how strongly the model responds to different frequency components, and uses the maps to adaptively interpolate amplitudes between source and target samples. This way, D-GAP reduces the learning bias in frequency space, while a complementary pixel-space blending procedure restores fine spatial details. Extensive experiments on four real-world datasets and three domain-adaptation benchmarks show that D-GAP consistently outperforms both generic and dataset-specific augmentations, improving average OOD performance by +5.3% on real-world datasets and +1.8% on benchmark datasets.

--------------------------------------------------------------------------------------------------------

Key Decision-Makers in Multi-Agent Debates: Who Holds the Power?

Multi-agent debate systems using large language models show promise for enhanced reasoning, but the strategic placement of different viewpoints significantly impacts performance. This research discovers that positioning the "truth" perspective last in debates can improve reasoning accuracy by up to 22%. The proposed MADC strategy addresses practical scenarios where truth is unknown by simulating role consistency and selecting the most agreed-upon perspective. These findings could revolutionize collaborative AI systems in legal analysis, scientific peer review, business strategy development, and policy-making, where multiple perspectives must be synthesized to reach optimal decisions while overcoming inherent performance bottlenecks.

Authors:  Qian Zhang, Yan Zheng, Jinyi Liu, Hebin Liang, Lanjun Wang

Link:  https://arxiv.org/abs/2511.11040v1

Date: 2025-11-d

Summary:

Recent studies on LLM agent scaling have highlighted the potential of Multi-Agent Debate (MAD) to enhance reasoning abilities. However, the critical aspect of role allocation strategies remains underexplored. In this study, we demonstrate that allocating roles with differing viewpoints to specific positions significantly impacts MAD's performance in reasoning tasks. Specifically, we find a novel role allocation strategy, "Truth Last", which can improve MAD performance by up to 22% in reasoning tasks. To address the issue of unknown truth in practical applications, we propose the Multi-Agent Debate Consistency (MADC) strategy, which systematically simulates and optimizes its core mechanisms. MADC incorporates path consistency to assess agreement among independent roles, simulating the role with the highest consistency score as the truth. We validated MADC across a range of LLMs (9 models), including the DeepSeek-R1 Distilled Models, on challenging reasoning tasks. MADC consistently demonstrated advanced performance, effectively overcoming MAD's performance bottlenecks and providing a crucial pathway for further improvements in LLM agent scaling.

--------------------------------------------------------------------------------------------------------

CrossMed: A Multimodal Cross-Task Benchmark for Compositional Generalization in Medical Imaging

Medical AI systems must handle diverse imaging modalities, anatomical regions, and tasks, yet their ability to generalize across unseen combinations remains poorly understood. CrossMed introduces a structured benchmark evaluating compositional generalization through 20,200 visual question-answering instances spanning X-rays, MRIs, and CT scans. The benchmark reveals significant performance drops when models encounter novel modality-anatomy-task combinations, particularly in zero-overlap settings. Interestingly, cross-task transfer emerges, where classification training improves segmentation performance. This work provides crucial insights for developing versatile medical AI assistants capable of supporting radiologists across multiple specialties, imaging technologies, and diagnostic tasks without requiring extensive retraining for each specific combination.

Authors:  Pooja Singh, Siddhant Ujjain, Tapan Kumar Gandhi, Sandeep Kumar

Link:  https://arxiv.org/abs/2511.11034v1

Date: 2025-11-d

Summary:

Recent advances in multimodal large language models have enabled unified processing of visual and textual inputs, offering promising applications in general-purpose medical AI. However, their ability to generalize compositionally across unseen combinations of imaging modality, anatomy, and task type remains underexplored. We introduce CrossMed, a benchmark designed to evaluate compositional generalization (CG) in medical multimodal LLMs using a structured Modality-Anatomy-Task (MAT) schema. CrossMed reformulates four public datasets, CheXpert (X-ray classification), SIIM-ACR (X-ray segmentation), BraTS 2020 (MRI classification and segmentation), and MosMedData (CT classification) into a unified visual question answering (VQA) format, resulting in 20,200 multiple-choice QA instances. We evaluate two open-source multimodal LLMs, LLaVA-Vicuna-7B and Qwen2-VL-7B, on both Related and Unrelated MAT splits, as well as a zero-overlap setting where test triplets share no Modality, Anatomy, or Task with the training data. Models trained on Related splits achieve 83.2 percent classification accuracy and 0.75 segmentation cIoU, while performance drops significantly under Unrelated and zero-overlap conditions, demonstrating the benchmark difficulty. We also show cross-task transfer, where segmentation performance improves by 7 percent cIoU even when trained using classification-only data. Traditional models (ResNet-50 and U-Net) show modest gains, confirming the broad utility of the MAT framework, while multimodal LLMs uniquely excel at compositional generalization. CrossMed provides a rigorous testbed for evaluating zero-shot, cross-task, and modality-agnostic generalization in medical vision-language models.

--------------------------------------------------------------------------------------------------------

AirCopBench: A Benchmark for Multi-drone Collaborative Embodied Perception and Reasoning

While single-agent vision models excel in controlled environments, real-world applications increasingly rely on multi-drone systems offering enhanced coverage and redundancy. AirCopBench introduces the first comprehensive benchmark evaluating multimodal language models in collaborative aerial perception under degraded conditions like poor weather or lighting. With 14,600+ questions spanning scene understanding, object detection, perception assessment, and collaborative decision-making, this benchmark reveals a 24.38% performance gap between leading models and human performance. Applications include search-and-rescue operations, environmental monitoring, infrastructure inspection, and disaster response, where multiple drones must coordinate effectively despite challenging perceptual conditions to accomplish complex missions requiring spatial reasoning and collaboration.

Authors:  Jirong Zha, Yuxuan Fan, Tianyu Zhang, Geng Chen, Yingfeng Chen, Chen Gao, Xinlei Chen

Link:  https://arxiv.org/abs/2511.11025v1

Date: 2025-11-d

Summary:

Multimodal Large Language Models (MLLMs) have shown promise in single-agent vision tasks, yet benchmarks for evaluating multi-agent collaborative perception remain scarce. This gap is critical, as multi-drone systems provide enhanced coverage, robustness, and collaboration compared to single-sensor setups. Existing multi-image benchmarks mainly target basic perception tasks using high-quality single-agent images, thus failing to evaluate MLLMs in more complex, egocentric collaborative scenarios, especially under real-world degraded perception conditions.To address these challenges, we introduce AirCopBench, the first comprehensive benchmark designed to evaluate MLLMs in embodied aerial collaborative perception under challenging perceptual conditions. AirCopBench includes 14.6k+ questions derived from both simulator and real-world data, spanning four key task dimensions: Scene Understanding, Object Understanding, Perception Assessment, and Collaborative Decision, across 14 task types. We construct the benchmark using data from challenging degraded-perception scenarios with annotated collaborative events, generating large-scale questions through model-, rule-, and human-based methods under rigorous quality control. Evaluations on 40 MLLMs show significant performance gaps in collaborative perception tasks, with the best model trailing humans by 24.38% on average and exhibiting inconsistent results across tasks. Fine-tuning experiments further confirm the feasibility of sim-to-real transfer in aerial collaborative perception and reasoning.

--------------------------------------------------------------------------------------------------------

A Multifaceted Analysis of Negative Bias in Large Language Models through the Lens of Parametric Knowledge

Large language models exhibit systematic negative bias, excessively favoring "no" responses in binary decision tasks regardless of correct answers. This research reveals that models display format-level bias, responding more to prompt structure than semantic content, particularly when lacking sufficient knowledge. Through systematic categorization based on parametric knowledge—correct, incorrect, and insufficient—the study identifies shortcut behaviors where models default to negative responses when uncertain. These findings have critical implications for medical diagnosis systems, content moderation, hiring decisions, and financial risk assessment, where biased responses could lead to systematic errors. Understanding these patterns enables developing mitigation strategies through improved prompting and training approaches.

Authors:  Jongyoon Song, Sangwon Yu, Sungroh Yoon

Link:  https://arxiv.org/abs/2511.10881v1

Date: 2025-11-d

Summary:

Negative bias refers to the tendency of large language models (LLMs) to excessively generate negative responses in binary decision tasks (e.g., yes-no question answering). Previous research has focused on detecting and addressing negative attention heads that induce negative bias. However, the underlying detailed factors influencing negative bias remain underexplored. In this paper, we demonstrate that LLMs exhibit format-level negative bias, meaning the prompt format more influences their responses than the semantics of the negative response. For the fine-grained study of the negative bias, we introduce a pipeline for constructing the evaluation set, which systematically categorizes the dataset into three subsets based on the model's parametric knowledge: correct, incorrect, and insufficient relevant knowledge. Through analysis of this evaluation set, we identify a shortcut behavior in which models tend to generate negative responses when they lack sufficient knowledge to answer a yes-no question, leading to negative bias. We further examine how negative bias changes under various prompting scenarios related to parametric knowledge. We observe that providing relevant context and offering an "I don't know" option generally reduces negative bias, whereas chain-of-thought prompting tends to amplify the bias. Finally, we demonstrate that the degree of negative bias can vary depending on the type of prompt, which influences the direction of the response. Our work reveals the various factors that influence negative bias, providing critical insights for mitigating it in LLMs.

--------------------------------------------------------------------------------------------------------

Reinforcing Stereotypes of Anger: Emotion AI on African American Vernacular English

Emotion detection AI systems are deployed in high-stakes contexts like mental health assessment and hiring, yet they often reflect dominant cultural norms while excluding linguistic diversity. This study examines how leading models perform on African American Vernacular English compared to General American English, analyzing 2.7 million geo-tagged tweets. Results reveal disturbing patterns: models exhibit false positive anger detection rates on AAVE exceeding double those on GAE, with one popular model reaching 60% false positives. These biases correlate with neighborhood demographics, potentially reinforcing harmful racial stereotypes. The findings demand culturally-informed emotion recognition systems to prevent systematic discrimination in mental health screening, workplace monitoring, and social services.

Authors:  Rebecca Dorn, Christina Chance, Casandra Rusti, Charles Bickham, Kai-Wei Chang, Fred Morstatter, Kristina Lerman

Link:  https://arxiv.org/abs/2511.10846v1

Date: 2025-11-d

Summary:

Automated emotion detection is widely used in applications ranging from well-being monitoring to high-stakes domains like mental health and hiring. However, models often rely on annotations that reflect dominant cultural norms, limiting model ability to recognize emotional expression in dialects often excluded from training data distributions, such as African American Vernacular English (AAVE). This study examines emotion recognition model performance on AAVE compared to General American English (GAE). We analyze 2.7 million tweets geo-tagged within Los Angeles. Texts are scored for strength of AAVE using computational approximations of dialect features. Annotations of emotion presence and intensity are collected on a dataset of 875 tweets with both high and low AAVE densities. To assess model accuracy on a task as subjective as emotion perception, we calculate community-informed "silver" labels where AAVE-dense tweets are labeled by African American, AAVE-fluent (ingroup) annotators. On our labeled sample, GPT and BERT-based models exhibit false positive prediction rates of anger on AAVE more than double than on GAE. SpanEmo, a popular text-based emotion model, increases false positive rates of anger from 25 percent on GAE to 60 percent on AAVE. Additionally, a series of linear regressions reveals that models and non-ingroup annotations are significantly more correlated with profanity-based AAVE features than ingroup annotations. Linking Census tract demographics, we observe that neighborhoods with higher proportions of African American residents are associated with higher predictions of anger (Pearson's correlation r = 0.27) and lower joy (r = -0.10). These results find an emergent safety issue of emotion AI reinforcing racial stereotypes through biased emotion classification. We emphasize the need for culturally and dialect-informed affective computing systems.

--------------------------------------------------------------------------------------------------------

Optimal Welfare in Noncooperative Network Formation under Attack

Modern communication networks like the Internet lack centralized control, consisting of independently administered entities making decentralized security decisions. This creates complex strategic interactions between network participants seeking connectivity and attackers attempting disruption. Building on game-theoretic foundations, this research demonstrates that networks formed by selfish agents can surprisingly achieve asymptotically optimal welfare even after attacks, resisting large classes of potential attackers. Counter-intuitively, attackers minimizing social welfare don't inflict maximum damage. These insights inform designing resilient peer-to-peer networks, Internet-of-Things systems, blockchain architectures, and critical infrastructure where decentralized security decisions must withstand sophisticated attacks while maintaining service quality.

Authors:  Natan Doubez, Pascal Lenzner, Marcus Wunderlich

Link:  https://arxiv.org/abs/2511.10845v1

Date: 2025-11-d

Summary:

Communication networks are essential for our economy and our everyday lives. This makes them lucrative targets for attacks. Today, we see an ongoing battle between criminals that try to disrupt our key communication networks and security professionals that try to mitigate these attacks. However, today's networks, like the Internet or peer-to-peer networks among smart devices, are not controlled by a single authority, but instead consist of many independently administrated entities that are interconnected. Thus, both the decisions of how to interconnect and how to secure against potential attacks are taken in a decentralized way by selfish agents.   This strategic setting, with agents that want to interconnect and potential attackers that want to disrupt the network, was captured via an influential game-theoretic model by Goyal, Jabbari, Kearns, Khanna, and Morgenstern (WINE 2016). We revisit this model and show improved tight bounds on the achieved robustness of networks created by selfish agents. As our main result, we show that such networks can resist attacks of a large class of potential attackers, i.e., these networks maintain asymptotically optimal welfare post attack. This improves several bounds and resolves an open problem. Along the way, we show the counter-intuitive result, that attackers that aim at minimizing the social welfare post attack do not actually inflict the greatest possible damage.

--------------------------------------------------------------------------------------------------------

Sabiá: Um Chatbot de Inteligência Artificial Generativa para Suporte no Dia a Dia do Ensino Superior

Students frequently struggle accessing routine academic information scattered across institutional documents and websites, creating confusion about administrative procedures, course requirements, and campus resources. Sabiá addresses this challenge using Generative AI and Retrieval-Augmented Generation to consolidate and simplify information access. After testing multiple models, Gemini 2.0 Flash emerged optimal for quality and speed, while Gemma demonstrated strong open-source performance. This solution could transform student services by providing 24/7 instant access to enrollment procedures, financial aid information, course prerequisites, and campus policies. Similar systems could benefit corporate training, government services, and healthcare navigation, reducing administrative burden while improving user experience.

Authors:  Guilherme Biava Rodrigues, Franciele Beal, Marlon Marcon, Alinne Cristinne Corrêa Souza, André Roberto Ortoncelli, Francisco Carlos Monteiro Souza, Rodolfo Adamshuk Silva

Link:  https://arxiv.org/abs/2511.10787v1

Date: 2025-11-d

Summary:

Students often report difficulties in accessing day-to-day academic information, which is usually spread across numerous institutional documents and websites. This fragmentation results in a lack of clarity and causes confusion about routine university information. This project proposes the development of a chatbot using Generative Artificial Intelligence (GenAI) and Retrieval-Augmented Generation (RAG) to simplify access to such information. Several GenAI models were tested and evaluated based on quality metrics and the LLM-as-a-Judge approach. Among them, Gemini 2.0 Flash stood out for its quality and speed, and Gemma 3n for its good performance and open-source nature.

--------------------------------------------------------------------------------------------------------

Right Looks, Wrong Reasons: Compositional Fidelity in Text-to-Image Generation

Leading text-to-image models produce visually impressive results yet fundamentally fail at logical composition—struggling with negation, counting, and spatial relationships. While models handle individual primitives adequately, combining them causes dramatic performance collapse. The research identifies three root causes: training data lacks explicit negations, continuous attention architectures cannot handle discrete logic, and evaluation metrics prioritize visual plausibility over constraint satisfaction. These limitations impact creative industries, accessibility tools generating described scenes, educational materials, and synthetic data generation for AI training. The findings suggest that achieving genuine compositional understanding requires fundamental architectural advances rather than incremental improvements or simple scaling solutions.

Authors:  Mayank Vatsa, Aparna Bharati, Richa Singh

Link:  https://arxiv.org/abs/2511.10136v1

Date: 2025-11-d

Summary:

The architectural blueprint of today's leading text-to-image models contains a fundamental flaw: an inability to handle logical composition. This survey investigates this breakdown across three core primitives-negation, counting, and spatial relations. Our analysis reveals a dramatic performance collapse: models that are accurate on single primitives fail precipitously when these are combined, exposing severe interference. We trace this failure to three key factors. First, training data show a near-total absence of explicit negations. Second, continuous attention architectures are fundamentally unsuitable for discrete logic. Third, evaluation metrics reward visual plausibility over constraint satisfaction. By analyzing recent benchmarks and methods, we show that current solutions and simple scaling cannot bridge this gap. Achieving genuine compositionality, we conclude, will require fundamental advances in representation and reasoning rather than incremental adjustments to existing architectures.

--------------------------------------------------------------------------------------------------------

Do Language Models Associate Sound with Meaning? A Multimodal Study of Sound Symbolism

Sound symbolism explores non-arbitrary relationships between phonetic forms and meanings, providing a unique lens for understanding how multimodal language models process auditory information. This research introduces LEX-ICON, featuring 8,052 natural and 2,930 constructed words across four languages, annotated with semantic features like sharpness or roundness. Analysis reveals models demonstrate phonetic intuitions aligning with linguistic research and exhibit attention patterns focusing on iconic phonemes. These findings bridge AI and cognitive linguistics, with applications in speech synthesis making artificial voices more naturally expressive, language learning tools helping students understand pronunciation-meaning connections, brand naming, and developing more human-aligned conversational AI systems.

Authors:  Jinhong Jeong, Sunghyun Lee, Jaeyoung Lee, Seonah Han, Youngjae Yu

Link:  https://arxiv.org/abs/2511.10045v1

Date: 2025-11-d

Summary:

Sound symbolism is a linguistic concept that refers to non-arbitrary associations between phonetic forms and their meanings. We suggest that this can be a compelling probe into how Multimodal Large Language Models (MLLMs) interpret auditory information in human languages. We investigate MLLMs' performance on phonetic iconicity across textual (orthographic and IPA) and auditory forms of inputs with up to 25 semantic dimensions (e.g., sharp vs. round), observing models' layer-wise information processing by measuring phoneme-level attention fraction scores. To this end, we present LEX-ICON, an extensive mimetic word dataset consisting of 8,052 words from four natural languages (English, French, Japanese, and Korean) and 2,930 systematically constructed pseudo-words, annotated with semantic features applied across both text and audio modalities. Our key findings demonstrate (1) MLLMs' phonetic intuitions that align with existing linguistic research across multiple semantic dimensions and (2) phonosemantic attention patterns that highlight models' focus on iconic phonemes. These results bridge domains of artificial intelligence and cognitive linguistics, providing the first large-scale, quantitative analyses of phonetic iconicity in terms of MLLMs' interpretability.

--------------------------------------------------------------------------------------------------------

Moral Change or Noise? On Problems of Aligning AI With Temporally Unstable Human Feedback

AI alignment methods assume moral preferences are static targets, but human moral reasoning evolves over time. This poses fundamental challenges for AI systems in high-stakes healthcare and policy domains where misalignment causes serious harm. Studying kidney allocation decisions across multiple sessions with 400+ participants, researchers found 6-20% response instability when participants faced identical scenarios at different times. Predictive model performance degraded temporally, with both response-level changes and deeper shifts in decision-making frameworks. These findings raise critical questions: should AI systems track legitimate moral evolution while filtering arbitrary fluctuations? Applications include transplant allocation, parole decisions, and treatment recommendations where stable, trustworthy AI alignment proves essential.

Authors:  Vijay Keswani, Cyrus Cousins, Breanna Nguyen, Vincent Conitzer, Hoda Heidari, Jana Schaich Borg, Walter Sinnott-Armstrong

Link:  https://arxiv.org/abs/2511.10032v1

Date: 2025-11-d

Summary:

Alignment methods in moral domains seek to elicit moral preferences of human stakeholders and incorporate them into AI. This presupposes moral preferences as static targets, but such preferences often evolve over time. Proper alignment of AI to dynamic human preferences should ideally account for "legitimate" changes to moral reasoning, while ignoring changes related to attention deficits, cognitive biases, or other arbitrary factors. However, common AI alignment approaches largely neglect temporal changes in preferences, posing serious challenges to proper alignment, especially in high-stakes applications of AI, e.g., in healthcare domains, where misalignment can jeopardize the trustworthiness of the system and yield serious individual and societal harms. This work investigates the extent to which people's moral preferences change over time, and the impact of such changes on AI alignment. Our study is grounded in the kidney allocation domain, where we elicit responses to pairwise comparisons of hypothetical kidney transplant patients from over 400 participants across 3-5 sessions. We find that, on average, participants change their response to the same scenario presented at different times around 6-20% of the time (exhibiting "response instability"). Additionally, we observe significant shifts in several participants' retrofitted decision-making models over time (capturing "model instability"). The predictive performance of simple AI models decreases as a function of both response and model instability. Moreover, predictive performance diminishes over time, highlighting the importance of accounting for temporal changes in preferences during training. These findings raise fundamental normative and technical challenges relevant to AI alignment, highlighting the need to better understand the object of alignment (what to align to) when user preferences change significantly over time.

--------------------------------------------------------------------------------------------------------

Echoing: Identity Failures when LLM Agents Talk to Each Other

As LLM agents increasingly interact autonomously, new failure modes emerge absent in human-agent conversations. "Echoing" occurs when agents abandon assigned roles and mirror conversational partners, undermining intended objectives. Across 2,000+ conversations, three major providers, and three domains, echoing rates ranged from 5-70%, persisting even in advanced reasoning models at 32.8%. The phenomenon intensifies beyond seven conversational turns and resists simple prompting fixes. Applications affected include automated customer service, multi-agent negotiation systems, collaborative research assistants, and AI-powered therapy. The research introduces protocol-level mitigation using structured responses, reducing echoing to 9%, essential for deploying reliable multi-agent systems in commerce, healthcare, and social services.

Authors:  Sarath Shekkizhar, Romain Cosentino, Adam Earle, Silvio Savarese

Link:  https://arxiv.org/abs/2511.09710v1

Date: 2025-11-d

Summary:

As large language model (LLM) based agents interact autonomously with one another, a new class of failures emerges that cannot be predicted from single agent performance: behavioral drifts in agent-agent conversations (AxA). Unlike human-agent interactions, where humans ground and steer conversations, AxA lacks such stabilizing signals, making these failures unique. We investigate one such failure, echoing, where agents abandon their assigned roles and instead mirror their conversational partners, undermining their intended objectives. Through experiments across $60$ AxA configurations, $3$ domains, and $2000+$ conversations, we demonstrate that echoing occurs across three major LLM providers, with echoing rates from $5\%$ to $70\%$ depending on the model and domain. Moreover, we find that echoing is persistent even in advanced reasoning models with substantial rates ($32.8\%$) that are not reduced by increased reasoning efforts. We analyze prompt impacts, conversation dynamics, showing that echoing arises as interaction grows longer ($7+$ turns in experiments) and is not merely an artifact of sub-optimal prompting. Finally, we introduce a protocol-level mitigation in which targeted use of structured responses reduces echoing to $9\%$.

--------------------------------------------------------------------------------------------------------

Alignment Debt: The Hidden Work of Making AI Usable

Frontier language models optimize for high-resource contexts, creating mismatches with Global South conditions requiring users to perform additional work. "Alignment debt" captures this burden across cultural-linguistic, infrastructural, epistemic, and interaction dimensions. Surveying 411 users in Kenya and Nigeria revealed 51.9% experienced cultural-linguistic debt, 43.1% infrastructural, and 33.8% epistemic challenges. Users facing epistemic misalignment verified outputs significantly more (91.5% vs. 80.8%), with verification intensity correlating with cumulative debt burden. This framework challenges fairness metrics based solely on model performance, demanding context-aware design for equitable AI deployment in education, healthcare, agriculture, and governance across diverse global contexts where user burden signals system inadequacy.

Authors:  Cumi Oyemike, Elizabeth Akpan, Pierre Hervé-Berdys

Link:  https://arxiv.org/abs/2511.09663v1

Date: 2025-11-d

Summary:

Frontier LLMs are optimised around high-resource assumptions about language, knowledge, devices, and connectivity. Whilst widely accessible, they often misfit conditions in the Global South. As a result, users must often perform additional work to make these systems usable. We term this alignment debt: the user-side burden that arises when AI systems fail to align with cultural, linguistic, infrastructural, or epistemic contexts. We develop and validate a four-part taxonomy of alignment debt through a survey of 411 AI users in Kenya and Nigeria. Among respondents measurable on this taxonomy (n = 385), prevalence is: Cultural and Linguistic (51.9%), Infrastructural (43.1%), Epistemic (33.8%), and Interaction (14.0%). Country comparisons show a divergence in Infrastructural and Interaction debt, challenging one-size-fits-Africa assumptions. Alignment debt is associated with compensatory labour, but responses vary by debt type: users facing Epistemic challenges verify outputs at significantly higher rates (91.5% vs. 80.8%; p = 0.037), and verification intensity correlates with cumulative debt burden (Spearmans rho = 0.147, p = 0.004). In contrast, Infrastructural and Interaction debts show weak or null associations with verification, indicating that some forms of misalignment cannot be resolved through verification alone. These findings show that fairness must be judged not only by model metrics but also by the burden imposed on users at the margins, compelling context-aware safeguards that alleviate alignment debt in Global South settings. The alignment debt framework provides an empirically grounded way to measure user burden, informing both design practice and emerging African AI governance efforts.

--------------------------------------------------------------------------------------------------------

Multimodal Large Language Models for Low-Resource Languages: A Case Study for Basque

While commercial multimodal language models perform acceptably in low-resource languages, open-source alternatives lag significantly behind. This research develops strong MLLMs for Basque by creating specialized training and evaluation datasets. Surprisingly, only 20% Basque multimodal data suffices for solid performance, and Basque-adapted language model backbones aren't necessary for strong results. Using Llama-3.1-Instruct alongside Basque-adapted Latxa, the study explores optimal data mixtures. These findings provide a replicable blueprint for developing MLLMs in other low-resource languages, supporting cultural preservation, education in native languages, accessibility for linguistic minorities, and democratizing AI benefits beyond dominant languages, ensuring technological inclusion for diverse linguistic communities worldwide.

Authors:  Lukas Arana, Julen Etxaniz, Ander Salaberria, Gorka Azkune

Link:  https://arxiv.org/abs/2511.09396v1

Date: 2025-11-d

Summary:

Current Multimodal Large Language Models exhibit very strong performance for several demanding tasks. While commercial MLLMs deliver acceptable performance in low-resource languages, comparable results remain unattained within the open science community. In this paper, we aim to develop a strong MLLM for a low-resource language, namely Basque. For that purpose, we develop our own training and evaluation image-text datasets. Using two different Large Language Models as backbones, the Llama-3.1-Instruct model and a Basque-adapted variant called Latxa, we explore several data mixtures for training. We show that: i) low ratios of Basque multimodal data (around 20%) are already enough to obtain solid results on Basque benchmarks, and ii) contrary to expected, a Basque instructed backbone LLM is not required to obtain a strong MLLM in Basque. Our results pave the way to develop MLLMs for other low-resource languages by openly releasing our resources.

--------------------------------------------------------------------------------------------------------

A multimodal AI agent for clinical decision support in ophthalmology

Existing medical AI systems lack flexibility and interpretability, particularly problematic in ophthalmology requiring diverse imaging modalities. EyeAgent introduces the first agentic framework using large language models as reasoning engines to dynamically orchestrate 53 validated tools across 23 imaging modalities. Diagnostic accuracy improved from 69.71% baseline to 80.79% with full tool integration, achieving 93.7% tool selection accuracy and 88%+ expert ratings across key dimensions. In human-AI collaboration, EyeAgent matched senior ophthalmologists and improved junior ophthalmologists' accuracy by 18.51%. Applications extend beyond ophthalmology to radiology, pathology, cardiology, and dermatology, demonstrating how modular, multimodal AI systems can provide trustworthy, interpretable clinical decision support.

Authors:  Danli Shi, Xiaolan Chen, Bingjie Yan, Weiyi Zhang, Pusheng Xu, Jiancheng Yang, Ruoyu Chen, Siyu Huang, Bowen Liu, Xinyuan Wu, Meng Xie, Ziyu Gao, Yue Wu, Senlin Lin, Kai Jin, Xia Gong, Yih Chung Tham, Xiujuan Zhang, Li Dong, Yuzhou Zhang, Jason Yam, Guangming Jin, Xiaohu Ding, Haidong Zou, Yalin Zheng, Zongyuan Ge, Mingguang He

Link:  https://arxiv.org/abs/2511.09394v1

Date: 2025-11-d

Summary:

Artificial intelligence has shown promise in medical imaging, yet most existing systems lack flexibility, interpretability, and adaptability - challenges especially pronounced in ophthalmology, where diverse imaging modalities are essential. We present EyeAgent, the first agentic AI framework for comprehensive and interpretable clinical decision support in ophthalmology. Using a large language model (DeepSeek-V3) as its central reasoning engine, EyeAgent interprets user queries and dynamically orchestrates 53 validated ophthalmic tools across 23 imaging modalities for diverse tasks including classification, segmentation, detection, image/report generation, and quantitative analysis. Stepwise ablation analysis demonstrated a progressive improvement in diagnostic accuracy, rising from a baseline of 69.71% (using only 5 general tools) to 80.79% when the full suite of 53 specialized tools was integrated. In an expert rating study on 200 real-world clinical cases, EyeAgent achieved 93.7% tool selection accuracy and received expert ratings of more than 88% across accuracy, completeness, safety, reasoning, and interpretability. In human-AI collaboration, EyeAgent matched or exceeded the performance of senior ophthalmologists and, when used as an assistant, improved overall diagnostic accuracy by 18.51% and report quality scores by 19%, with the greatest benefit observed among junior ophthalmologists. These findings establish EyeAgent as a scalable and trustworthy AI framework for ophthalmology and provide a blueprint for modular, multimodal, and clinically aligned next-generation AI systems.

--------------------------------------------------------------------------------------------------------

Unveiling Hidden Threats: Using Fractal Triggers to Boost Stealthiness of Distributed Backdoor Attacks in Federated Learning

Distributed backdoor attacks in federated learning decompose triggers across participants to improve stealthiness but require more poisoned data, increasing detection risk. FTDBA leverages fractal self-similarity to strengthen sub-trigger features, reducing poisoning volume by 37.6% while achieving 92.3% attack success rate. Dynamic angular perturbation adapts across training phases, balancing efficiency and stealth, reducing detection rates by 22.8% and KL divergence by 41.2%. While demonstrating federated learning vulnerabilities requiring defensive innovation, this research paradoxically advances both attack sophistication and security understanding for collaborative medical research, financial modeling, and IoT systems where federated learning protects sensitive data while remaining vulnerable to coordinated manipulation.

Authors:  Jian Wang, Hong Shen, Chan-Tong Lam

Link:  https://arxiv.org/abs/2511.09252v1

Date: 2025-11-d

Summary:

Traditional distributed backdoor attacks (DBA) in federated learning improve stealthiness by decomposing global triggers into sub-triggers, which however requires more poisoned data to maintian the attck strength and hence increases the exposure risk. To overcome this defect, This paper proposes a novel method, namely Fractal-Triggerred Distributed Backdoor Attack (FTDBA), which leverages the self-similarity of fractals to enhance the feature strength of sub-triggers and hence significantly reduce the required poisoning volume for the same attack strength. To address the detectability of fractal structures in the frequency and gradient domains, we introduce a dynamic angular perturbation mechanism that adaptively adjusts perturbation intensity across the training phases to balance efficiency and stealthiness. Experiments show that FTDBA achieves a 92.3\% attack success rate with only 62.4\% of the poisoning volume required by traditional DBA methods, while reducing the detection rate by 22.8\% and KL divergence by 41.2\%. This study presents a low-exposure, high-efficiency paradigm for federated backdoor attacks and expands the application of fractal features in adversarial sample generation.

--------------------------------------------------------------------------------------------------------

SimPath: Mitigating Motion Sickness in In-vehicle Infotainment Systems via Driving Condition Adaptation

Motion sickness significantly impacts passenger comfort and in-vehicle infotainment system usage, particularly during irregular driving conditions. SimPath introduces visual design adaptations synchronized with driving conditions to mitigate motion sickness. Real-vehicle experiments demonstrate meaningful motion sickness reduction, though divided attention prevents direct efficiency improvements. These findings inform designing intelligent, user-friendly infotainment systems for autonomous vehicles where passengers increasingly engage with digital content during travel. Applications include ride-sharing services, luxury vehicles, autonomous shuttles, and long-distance transportation where passenger comfort and productivity become competitive differentiators. The research provides theoretical support for future IVIS development prioritizing passenger wellbeing alongside entertainment and productivity features.

Authors:  Jinghao Huang, Siqi Yao, Yu Zhang

Link:  https://arxiv.org/abs/2511.09240v2

Date: 2025-11-d

Summary:

The problem of Motion Sickness (MS) among passengers significantly impacts the comfort and efficiency of In-Vehicle Infotainment Systems (IVIS) use. In this study, we innovatively designed SimPath, a visual design to effectively mitigate passengers' MS and boost their efficiency of using IVIS during driving. The study focuses on the problem of irregular motion conditions frequently encountered during actual driving. To validate the efficacy of this approach, two sets of real - vehicle experiments were carried out in real driving scenarios. The results demonstrate that this approach significantly reduces passenger's MS level to a certain extent. However, due to divided attention from visual content, it does not directly improve the IVIS efficiency. In conclusion, this study offers crucial insights for the design of a more intelligent and user friendly IVIS, based on the discussion of the principle, providing strong theoretical support and practical guidance for the development of future IVIS in autonomous vehicles.

--------------------------------------------------------------------------------------------------------

Good-for-MDP State Reduction for Stochastic LTL Planning

Planning in Markov Decision Processes with Linear Temporal Logic goals requires transforming LTL formulas into good-for-MDP automata, but automata size critically affects scalability. This research introduces novel state-space reduction techniques leveraging recent game-theoretic minimization advances, significantly reducing automaton states. Additionally, a direct single-exponential construction for GF formulas improves upon general doubly-exponential complexity. Experiments confirm practical effectiveness and scalability advantages. Applications include robotics mission planning, autonomous vehicle navigation under temporal constraints, smart manufacturing scheduling, drone coordination, and AI planning systems requiring probabilistic reasoning under complex temporal specifications, enabling more efficient synthesis of policies satisfying sophisticated goals in uncertain environments.

Authors:  Christoph Weinhuber, Giuseppe De Giacomo, Yong Li, Sven Schewe, Qiyi Tang

Link:  https://arxiv.org/abs/2511.09073v1

Date: 2025-11-d

Summary:

We study stochastic planning problems in Markov Decision Processes (MDPs) with goals specified in Linear Temporal Logic (LTL). The state-of-the-art approach transforms LTL formulas into good-for-MDP (GFM) automata, which feature a restricted form of nondeterminism. These automata are then composed with the MDP, allowing the agent to resolve the nondeterminism during policy synthesis. A major factor affecting the scalability of this approach is the size of the generated automata. In this paper, we propose a novel GFM state-space reduction technique that significantly reduces the number of automata states. Our method employs a sophisticated chain of transformations, leveraging recent advances in good-for-games minimisation developed for adversarial settings. In addition to our theoretical contributions, we present empirical results demonstrating the practical effectiveness of our state-reduction technique. Furthermore, we introduce a direct construction method for formulas of the form $\mathsf{G}\mathsf{F}\varphi$, where $\varphi$ is a co-safety formula. This construction is provably single-exponential in the worst case, in contrast to the general doubly-exponential complexity. Our experiments confirm the scalability advantages of this specialised construction.

--------------------------------------------------------------------------------------------------------

MedHE: Communication-Efficient Privacy-Preserving Federated Learning with Adaptive Gradient Sparsification for Healthcare

Healthcare federated learning requires strong privacy guarantees while maintaining efficiency across resource-constrained medical institutions. MedHE combines adaptive gradient sparsification with CKKS homomorphic encryption, achieving 97.5% communication reduction while preserving utility. Dynamic threshold mechanisms with error compensation enable top-k gradient selection, reducing communication from 1277 MB to 32 MB per round while maintaining 89.5% accuracy comparable to standard federated learning. Formal security analysis provides differential privacy guarantees with epsilon ≤1.0 and HIPAA compliance. Applications include multi-hospital disease prediction, rare disease research requiring cross-institutional data, pharmaceutical drug discovery, and personalized medicine initiatives where privacy-preserving collaboration enables breakthroughs impossible with isolated datasets while protecting sensitive patient information.

Authors:  Farjana Yesmin

Link:  https://arxiv.org/abs/2511.09043v1

Date: 2025-11-d

Summary:

Healthcare federated learning requires strong privacy guarantees while maintaining computational efficiency across resource-constrained medical institutions. This paper presents MedHE, a novel framework combining adaptive gradient sparsification with CKKS homomorphic encryption to enable privacy-preserving collaborative learning on sensitive medical data. Our approach introduces a dynamic threshold mechanism with error compensation for top-k gradient selection, achieving 97.5 percent communication reduction while preserving model utility. We provide formal security analysis under Ring Learning with Errors assumptions and demonstrate differential privacy guarantees with epsilon less than or equal to 1.0. Statistical testing across 5 independent trials shows MedHE achieves 89.5 percent plus or minus 0.8 percent accuracy, maintaining comparable performance to standard federated learning (p=0.32) while reducing communication from 1277 MB to 32 MB per training round. Comprehensive evaluation demonstrates practical feasibility for real-world medical deployments with HIPAA compliance and scalability to 100 plus institutions.

--------------------------------------------------------------------------------------------------------

A Neurosymbolic Approach to Natural Language Formalization and Verification

Large language models excel at natural language reasoning but their stochasticity limits adoption in regulated industries requiring strict policy compliance. This two-stage neurosymbolic framework uses LLMs with optional human guidance to formalize natural language policies, then validates statements through inference-time autoformalization. Multiple redundant formalization steps with semantic equivalence checking achieve >99% soundness, indicating near-zero false positives. Auditable logical artifacts substantiate verification outcomes. Applications span financial compliance verification, healthcare protocol adherence, legal contract analysis, and regulatory documentation where formal correctness proves essential. This approach bridges natural language accessibility with formal verification rigor, enabling AI adoption in risk-averse domains previously excluded due to interpretability and reliability concerns.

Authors:  Sam Bayless, Stefano Buliani, Darion Cassel, Byron Cook, Duncan Clough, Rémi Delmas, Nafi Diallo, Ferhat Erata, Nick Feng, Dimitra Giannakopoulou, Aman Goel, Aditya Gokhale, Joe Hendrix, Marc Hudak, Dejan Jovanović, Andrew M. Kent, Benjamin Kiesl-Reiter, Jeffrey J. Kuna, Nadia Labai, Joseph Lilien, Divya Raghunathan, Zvonimir Rakamarić, Niloofar Razavi, Michael Tautschnig, Ali Torkamani, Nathaniel Weir, Michael W. Whalen, Jianan Yao

Link:  https://arxiv.org/abs/2511.09008v1

Date: 2025-11-d

Summary:

Large Language Models perform well at natural language interpretation and reasoning, but their inherent stochasticity limits their adoption in regulated industries like finance and healthcare that operate under strict policies. To address this limitation, we present a two-stage neurosymbolic framework that (1) uses LLMs with optional human guidance to formalize natural language policies, allowing fine-grained control of the formalization process, and (2) uses inference-time autoformalization to validate logical correctness of natural language statements against those policies. When correctness is paramount, we perform multiple redundant formalization steps at inference time, cross checking the formalizations for semantic equivalence. Our benchmarks demonstrate that our approach exceeds 99% soundness, indicating a near-zero false positive rate in identifying logical validity. Our approach produces auditable logical artifacts that substantiate the verification outcomes and can be used to improve the original text.

--------------------------------------------------------------------------------------------------------


EYE ON A.I. GETS READERS UP TO DATE ON THE LATEST FUNDING NEWS AND RELATED ISSUES. SUBSCRIBE FOR THE WEEKLY NEWSLETTER.

SUBSCRIBE