Week Ending 11.9.2025

RESEARCH WATCH: 11.9.2025

Perceptually Aligning Representations of Music via Noise-Augmented Autoencoders

This research tackles a fundamental challenge in music AI: creating representations that mirror human perception. By training autoencoders to reconstruct audio from noised encodings while incorporating perceptual losses, the authors achieve hierarchical structures where perceptually important information occupies coarser representation levels. This approach shows promise for music generation systems, particularly in latent diffusion models that need to capture subtle musical features. Applications span music recommendation algorithms, automated composition tools, and brain-computer interfaces for music therapy. The method's ability to predict EEG responses to music listening could revolutionize how we understand music cognition and personalize audio experiences for listeners with different perceptual profiles.

Authors: Mathias Rose Bjare, Giorgia Cantisani, Marco Pasini, Stefan Lattner, Gerhard Widmer

Link: https://arxiv.org/abs/2511.05350v1

Date: 2025-11-d

Summary:

We argue that training autoencoders to reconstruct inputs from noised versions of their encodings, when combined with perceptual losses, yields encodings that are structured according to a perceptual hierarchy. We demonstrate the emergence of this hierarchical structure by showing that, after training an audio autoencoder in this manner, perceptually salient information is captured in coarser representation structures than with conventional training. Furthermore, we show that such perceptual hierarchies improve latent diffusion decoding in the context of estimating surprisal in music pitches and predicting EEG-brain responses to music listening. Pretrained weights are available on github.com/CPJKU/pa-audioic.

--------------------------------------------------------------------------------------------------------

Autonomous generation of different courses of action in mechanized combat operations

Military decision-making during combat requires rapid evaluation of countless tactical alternatives under extreme uncertainty. This paper presents an AI system that generates and evaluates thousands of potential courses of action for mechanized battalions in real-time. By considering force composition, terrain, opponent capabilities, and field manual guidelines, the system provides commanders with optimized tactical recommendations as battlefield conditions evolve. Applications extend beyond military contexts to crisis management, emergency response coordination, and strategic planning in dynamic environments. The methodology's ability to continuously adapt recommendations based on changing conditions makes it valuable for autonomous vehicle coordination, disaster response logistics, and complex resource allocation problems requiring sequential decision-making under uncertainty.

Authors: Johan Schubert, Patrik Hansen, Pontus Hörling, Ronnie Johansson

Link: https://arxiv.org/abs/2511.05182v1

Date: 2025-11-d

Summary:

In this paper, we propose a methodology designed to support decision-making during the execution phase of military ground combat operations, with a focus on one's actions. This methodology generates and evaluates recommendations for various courses of action for a mechanized battalion, commencing with an initial set assessed by their anticipated outcomes. It systematically produces thousands of individual action alternatives, followed by evaluations aimed at identifying alternative courses of action with superior outcomes. These alternatives are appraised in light of the opponent's status and actions, considering unit composition, force ratios, types of offense and defense, and anticipated advance rates. Field manuals evaluate battle outcomes and advancement rates. The processes of generation and evaluation work concurrently, yielding a variety of alternative courses of action. This approach facilitates the management of new course generation based on previously evaluated actions. As the combat unfolds and conditions evolve, revised courses of action are formulated for the decision-maker within a sequential decision-making framework.

--------------------------------------------------------------------------------------------------------

Generating Software Architecture Description from Source Code using Reverse Engineering and Large Language Model

Software architecture documentation frequently becomes outdated or non-existent, forcing developers to understand complex systems directly from source code—a cognitively demanding and time-consuming process. This research combines traditional reverse engineering with LLMs to semi-automatically generate Software Architecture Descriptions (SADs) from codebases. The approach extracts component diagrams, identifies architecturally significant elements through prompt engineering, and generates behavioral state machines using few-shot learning. Applications include accelerating developer onboarding, facilitating legacy system modernization, and maintaining architectural clarity throughout software lifecycles. This methodology could revolutionize documentation practices in software engineering, particularly for large-scale systems, and enable better communication between technical and non-technical stakeholders.

Authors: Ahmad Hatahet, Christoph Knieke, Andreas Rausch

Link: https://arxiv.org/abs/2511.05165v1

Date: 2025-11-d

Summary:

Software Architecture Descriptions (SADs) are essential for managing the inherent complexity of modern software systems. They enable high-level architectural reasoning, guide design decisions, and facilitate effective communication among diverse stakeholders. However, in practice, SADs are often missing, outdated, or poorly aligned with the system's actual implementation. Consequently, developers are compelled to derive architectural insights directly from source code-a time-intensive process that increases cognitive load, slows new developer onboarding, and contributes to the gradual degradation of clarity over the system's lifetime. To address these issues, we propose a semi-automated generation of SADs from source code by integrating reverse engineering (RE) techniques with a Large Language Model (LLM). Our approach recovers both static and behavioral architectural views by extracting a comprehensive component diagram, filtering architecturally significant elements (core components) via prompt engineering, and generating state machine diagrams to model component behavior based on underlying code logic with few-shots prompting. This resulting views representation offer a scalable and maintainable alternative to traditional manual architectural documentation. This methodology, demonstrated using C++ examples, highlights the potent capability of LLMs to: 1) abstract the component diagram, thereby reducing the reliance on human expert involvement, and 2) accurately represent complex software behaviors, especially when enriched with domain-specific knowledge through few-shot prompting. These findings suggest a viable path toward significantly reducing manual effort while enhancing system understanding and long-term maintainability.

--------------------------------------------------------------------------------------------------------

No Pose Estimation? No Problem: Pose-Agnostic and Instance-Aware Test-Time Adaptation for Monocular Depth Estimation

Monocular depth estimation—inferring 3D scene structure from single camera images—underpins autonomous vehicles, robotics, and augmented reality applications. However, these systems often fail when deployed in environments different from training conditions. PITTA introduces a novel test-time adaptation framework that doesn't require camera pose information, addressing a major practical limitation. By employing instance-aware masking to distinguish dynamic objects from static backgrounds and incorporating edge extraction techniques, PITTA achieves superior performance on diverse driving datasets. Applications include robust autonomous driving systems, drone navigation in unknown environments, mobile robot perception, and AR/VR applications requiring real-time depth estimation across varying conditions without extensive retraining.

Authors: Mingyu Sung, Hyeonmin Choe, Il-Min Kim, Sangseok Yun, Jae Mo Kang

Link: https://arxiv.org/abs/2511.05055v1

Date: 2025-11-d

Summary:

Monocular depth estimation (MDE), inferring pixel-level depths in single RGB images from a monocular camera, plays a crucial and pivotal role in a variety of AI applications demanding a three-dimensional (3D) topographical scene. In the real-world scenarios, MDE models often need to be deployed in environments with different conditions from those for training. Test-time (domain) adaptation (TTA) is one of the compelling and practical approaches to address the issue. Although there have been notable advancements in TTA for MDE, particularly in a self-supervised manner, existing methods are still ineffective and problematic when applied to diverse and dynamic environments. To break through this challenge, we propose a novel and high-performing TTA framework for MDE, named PITTA. Our approach incorporates two key innovative strategies: (i) pose-agnostic TTA paradigm for MDE and (ii) instance-aware image masking. Specifically, PITTA enables highly effective TTA on a pretrained MDE network in a pose-agnostic manner without resorting to any camera pose information. Besides, our instance-aware masking strategy extracts instance-wise masks for dynamic objects (e.g., vehicles, pedestrians, etc.) from a segmentation mask produced by a pretrained panoptic segmentation network, by removing static objects including background components. To further boost performance, we also present a simple yet effective edge extraction methodology for the input image (i.e., a single monocular image) and depth map. Extensive experimental evaluations on DrivingStereo and Waymo datasets with varying environmental conditions demonstrate that our proposed framework, PITTA, surpasses the existing state-of-the-art techniques with remarkable performance improvements in MDE during TTA.

--------------------------------------------------------------------------------------------------------

Enhancing Public Speaking Skills in Engineering Students Through AI

Engineering students frequently struggle with public speaking despite its critical importance for professional success. Traditional feedback methods are time-intensive and difficult to scale. This research develops a multi-modal AI system that evaluates verbal communication (pitch, pacing, intonation), non-verbal cues (facial expressions, gestures, posture), and introduces "expressive coherence"—ensuring alignment between speech and body language. Among tested LLMs, Gemini Pro showed strongest agreement with human evaluators. Applications extend beyond engineering education to corporate training, political speech preparation, and therapeutic interventions for communication disorders. The scalability enables unlimited practice opportunities, potentially transforming how professionals develop presentation skills and helping individuals with social anxiety build confidence.

Authors: Amol Harsh, Brainerd Prince, Siddharth Siddharth, Deepan Raj Prabakar Muthirayan, Kabir S Bhalla, Esraaj Sarkar Gupta, Siddharth Sahu

Link: https://arxiv.org/abs/2511.04995v1

Date: 2025-11-d

Summary:

This research-to-practice full paper was inspired by the persistent challenge in effective communication among engineering students. Public speaking is a necessary skill for future engineers as they have to communicate technical knowledge with diverse stakeholders. While universities offer courses or workshops, they are unable to offer sustained and personalized training to students. Providing comprehensive feedback on both verbal and non-verbal aspects of public speaking is time-intensive, making consistent and individualized assessment impractical. This study integrates research on verbal and non-verbal cues in public speaking to develop an AI-driven assessment model for engineering students. Our approach combines speech analysis, computer vision, and sentiment detection into a multi-modal AI system that provides assessment and feedback. The model evaluates (1) verbal communication (pitch, loudness, pacing, intonation), (2) non-verbal communication (facial expressions, gestures, posture), and (3) expressive coherence, a novel integration ensuring alignment between speech and body language. Unlike previous systems that assess these aspects separately, our model fuses multiple modalities to deliver personalized, scalable feedback. Preliminary testing demonstrated that our AI-generated feedback was moderately aligned with expert evaluations. Among the state-of-the-art AI models evaluated, all of which were Large Language Models (LLMs), including Gemini and OpenAI models, Gemini Pro emerged as the best-performing, showing the strongest agreement with human annotators. By eliminating reliance on human evaluators, this AI-driven public speaking trainer enables repeated practice, helping students naturally align their speech with body language and emotion, crucial for impactful and professional communication.

--------------------------------------------------------------------------------------------------------

NuFast-Earth: Efficient Atmospheric, Solar, and Supernova Neutrino Propagation Through the Earth

Understanding neutrino oscillations through Earth's varying matter density is crucial for next-generation physics experiments like DUNE, HyperK, IceCube, and KM3NeT. As these facilities come online, computational demands for atmospheric and solar neutrino analyses will increase dramatically. NuFast-Earth provides an efficient algorithm for calculating oscillation probabilities through Earth's complex density profiles. By intelligently caching calculations and only recomputing necessary components, the code achieves significant speed improvements. Applications span fundamental physics research, supernova detection, and atmospheric neutrino studies. The software's flexibility in handling arbitrary Earth models enables precise predictions for experiments studying neutrino properties, potentially revealing new physics beyond the Standard Model and improving our understanding of stellar processes.

Authors: Peter B. Denton, Stephen J. Parke

Link: https://arxiv.org/abs/2511.04735v1

Date: 2025-11-d

Summary:

Algorithms for computing neutrino oscillation probabilities in sharply varying matter potentials such as the Earth are becoming increasingly important. As the next generation of experiments, DUNE and HyperK as well as the IceCube upgrade and KM3NeT, come online, the computational cost for atmospheric and solar neutrinos will continue to increase. To address these issues, we expand upon our previous algorithm for long-baseline calculations to efficiently handle probabilities through the Earth for atmospheric, nighttime solar, and supernova neutrinos. The algorithm is fast, flexible, and accurate. It can handle arbitrary Earth models with two different schemes for varying density profiles. We also provide a c++ implementation of the code called NuFast-Earth along with a detailed user manual. The code intelligently keeps track of repeated calculations and only recalculates what is needed on each successive call which can also help provide significant speed-ups.

--------------------------------------------------------------------------------------------------------

Cambrian-S: Towards Spatial Supersensing in Video

Current vision AI systems operate reactively, processing information without true spatial understanding or predictive capabilities. This research introduces "spatial supersensing"—a paradigm encompassing semantic perception, continuous memory, implicit 3D reasoning, and predictive world modeling. The authors present VSI-SUPER, a benchmark testing long-horizon spatial recall and continual counting that resists brute-force context expansion. Cambrian-S, trained on curated video data, shows substantial improvements yet reveals that scale alone is insufficient. Applications include autonomous robotics requiring spatial awareness, surveillance systems maintaining long-term scene understanding, and augmented reality needing predictive environment modeling. The proof-of-concept predictive sensing approach, using surprise-driven memory organization, suggests paths toward AI systems that genuinely understand and anticipate spatial environments.

Authors: Shusheng Yang, Jihan Yang, Pinzhi Huang, Ellis Brown, Zihao Yang, Yue Yu, Shengbang Tong, Zihan Zheng, Yifan Xu, Muhan Wang, Daohan Lu, Rob Fergus, Yann LeCun, Li Fei-Fei, Saining Xie

Link: https://arxiv.org/abs/2511.04670v1

Date: 2025-11-d

Summary:

We argue that progress in true multimodal intelligence calls for a shift from reactive, task-driven systems and brute-force long context towards a broader paradigm of supersensing. We frame spatial supersensing as four stages beyond linguistic-only understanding: semantic perception (naming what is seen), streaming event cognition (maintaining memory across continuous experiences), implicit 3D spatial cognition (inferring the world behind pixels), and predictive world modeling (creating internal models that filter and organize information). Current benchmarks largely test only the early stages, offering narrow coverage of spatial cognition and rarely challenging models in ways that require true world modeling. To drive progress in spatial supersensing, we present VSI-SUPER, a two-part benchmark: VSR (long-horizon visual spatial recall) and VSC (continual visual spatial counting). These tasks require arbitrarily long video inputs yet are resistant to brute-force context expansion. We then test data scaling limits by curating VSI-590K and training Cambrian-S, achieving +30% absolute improvement on VSI-Bench without sacrificing general capabilities. Yet performance on VSI-SUPER remains limited, indicating that scale alone is insufficient for spatial supersensing. We propose predictive sensing as a path forward, presenting a proof-of-concept in which a self-supervised next-latent-frame predictor leverages surprise (prediction error) to drive memory and event segmentation. On VSI-SUPER, this approach substantially outperforms leading proprietary baselines, showing that spatial supersensing requires models that not only see but also anticipate, select, and organize experience.

--------------------------------------------------------------------------------------------------------

Are language models aware of the road not taken? Token-level uncertainty and hidden state dynamics

Language models generate text token-by-token, where each choice can lead to vastly different reasoning paths. This research investigates whether models internally represent alternative paths they didn't take. By analyzing hidden activations, the authors find correlations between token-level uncertainty and model steerability—activation interventions work best when multiple paths remain available. Furthermore, hidden states can predict future outcome distributions, suggesting models implicitly encode possibility spaces. Applications include improving model reliability through uncertainty quantification, developing better human-AI collaboration tools, enhancing interpretability for high-stakes decisions, and creating controllable generation systems. Understanding these internal representations could enable more trustworthy AI assistants that explicitly acknowledge alternatives and uncertainty.

Authors: Amir Zur, Atticus Geiger, Ekdeep Singh Lubana, Eric Bigelow

Link: https://arxiv.org/abs/2511.04527v1

Date: 2025-11-d

Summary:

When a language model generates text, the selection of individual tokens might lead it down very different reasoning paths, making uncertainty difficult to quantify. In this work, we consider whether reasoning language models represent the alternate paths that they could take during generation. To test this hypothesis, we use hidden activations to control and predict a language model's uncertainty during chain-of-thought reasoning. In our experiments, we find a clear correlation between how uncertain a model is at different tokens, and how easily the model can be steered by controlling its activations. This suggests that activation interventions are most effective when there are alternate paths available to the model -- in other words, when it has not yet committed to a particular final answer. We also find that hidden activations can predict a model's future outcome distribution, demonstrating that models implicitly represent the space of possible paths.

--------------------------------------------------------------------------------------------------------

A Lightweight Framework for Integrated Sensing and Communications with RIS

Future 6G networks will integrate sensing and communication capabilities, with Reconfigurable Intelligent Surfaces (RIS) enhancing both functions. Existing optimization methods using semidefinite relaxation suffer from high computational complexity, while iterative algorithms produce suboptimal, initialization-dependent solutions. This research presents a lightweight framework providing closed-form solutions that explicitly balance communication-sensing trade-offs and distribute beam gains across multiple targets. By partitioning RIS configuration into communication-optimized and sensing-perturbation components, the approach achieves performance comparable to complex methods with dramatically reduced computational costs. Applications include smart cities with integrated surveillance and connectivity, industrial IoT monitoring, vehicular networks combining radar and communication, and efficient spectrum utilization in resource-constrained environments.

Authors: Chu Li, Kevin Weinberger, Aydin Sezgin

Link: https://arxiv.org/abs/2511.04448v1

Date: 2025-11-d

Summary:

Reconfigurable Intelligent Surfaces (RIS) have been recognized as a promising technology to enhance both communication and sensing performance in integrated sensing and communication (ISAC) systems for future 6G networks. However, existing RIS optimization methods for improving ISAC performance are mainly based on semidefinite relaxation (SDR) or iterative algorithms. The former suffers from high computational complexity and limited scalability, especially when the number of RIS elements becomes large, while the latter yields suboptimal solutions whose performance depends on initialization. In this work, we introduce a lightweight RIS phase design framework that provides a closed-form solution and explicitly accounts for the trade-off between communication and sensing, as well as proportional beam gain distribution toward multiple sensing targets. The key idea is to partition the RIS configuration into two parts: the first part is designed to maximize the communication performance, while the second introduces small perturbations to generate multiple beams for multi-target sensing. Simulation results validate the effectiveness of the proposed approach and demonstrate that it achieves performance comparable to SDR but with significantly lower computational complexity.

--------------------------------------------------------------------------------------------------------

Deep learning-based object detection of offshore platforms on Sentinel-1 Imagery and the impact of synthetic training data

Monitoring expanding marine infrastructure—offshore wind farms, oil platforms, artificial islands—requires robust detection systems. However, comprehensive training datasets are scarce, particularly for underrepresented platform types. This study investigates using synthetic Sentinel-1 satellite imagery alongside real data to train YOLOv10 object detection models. The approach achieved 85% F1 score, improving to 90% with synthetic data, and successfully detected 3,529 platforms across unseen regions. Applications include environmental monitoring, maritime security, illegal fishing prevention, and infrastructure management. The demonstrated geographic transferability suggests potential for global-scale automated monitoring. This methodology addresses common remote sensing challenges around dataset imbalance and could extend to detecting other marine structures, ships, or coastal changes.

Authors: Robin Spanier, Thorsten Hoeser, Claudia Kuenzer

Link: https://arxiv.org/abs/2511.04304v1

Date: 2025-11-d

Summary:

The recent and ongoing expansion of marine infrastructure, including offshore wind farms, oil and gas platforms, artificial islands, and aquaculture facilities, highlights the need for effective monitoring systems. The development of robust models for offshore infrastructure detection relies on comprehensive, balanced datasets, but falls short when samples are scarce, particularly for underrepresented object classes, shapes, and sizes. By training deep learning-based YOLOv10 object detection models with a combination of synthetic and real Sentinel-1 satellite imagery acquired in the fourth quarter of 2023 from four regions (Caspian Sea, South China Sea, Gulf of Guinea, and Coast of Brazil), this study investigates the use of synthetic training data to enhance model performance. We evaluated this approach by applying the model to detect offshore platforms in three unseen regions (Gulf of Mexico, North Sea, Persian Gulf) and thereby assess geographic transferability. This region-holdout evaluation demonstrated that the model generalises beyond the training areas. In total, 3,529 offshore platforms were detected, including 411 in the North Sea, 1,519 in the Gulf of Mexico, and 1,593 in the Persian Gulf. The model achieved an F1 score of 0.85, which improved to 0.90 upon incorporating synthetic data. We analysed how synthetic data enhances the representation of unbalanced classes and overall model performance, taking a first step toward globally transferable detection of offshore infrastructure. This study underscores the importance of balanced datasets and highlights synthetic data generation as an effective strategy to address common challenges in remote sensing, demonstrating the potential of deep learning for scalable, global offshore infrastructure monitoring.

--------------------------------------------------------------------------------------------------------

Examining Student and AI Generated Personalized Analogies in Introductory Physics

Analogies help students grasp abstract physics concepts by connecting them to familiar experiences. Research shows self-generated analogies improve learning, but facilitating this in large courses is challenging. This study analyzes 800 student responses explaining the Morse potential curve, comparing spontaneous and explicitly-prompted student analogies with AI-generated ones. Results show students rarely spontaneously use analogies but generate diverse contexts when prompted. Unlike consistent AI responses, students demonstrated varied reasoning influenced by disciplinary knowledge and personal attributes. Applications include developing AI-powered tutoring systems, improving physics pedagogy through personalized learning, identifying knowledge gaps through analogical reasoning patterns, and understanding how generative AI can complement human learning processes while maintaining student agency.

Authors: Amogh Sirnoorkar, Winter Allen, Syed Furqan Abbas Hashmi, N. Sanjay Rebello

Link: https://arxiv.org/abs/2511.04290v1

Date: 2025-11-d

Summary:

Comparing abstract concepts (such as electric circuits) with familiar ideas (plumbing systems) through analogies is central to practice and communication of physics. Contemporary research highlights self-generated analogies to better facilitate students' learning than the taught ones. "Spontaneous" and "self-generated" analogies represent the two ways through which students construct personalized analogies. However, facilitating them, particularly in large enrollment courses remains a challenge, and recent developments in generative artificial intelligence (AI) promise potential to address this issue. In this qualitative study, we analyze around 800 student responses in exploring the extent to which students spontaneously leverage analogies while explaining Morse potential curve in a language suitable for second graders and self-generate analogies in their preferred everyday contexts. We also compare the student-generated spontaneous analogies with AI-generated ones prompted by students. Lastly, we explore the themes associated with students' perceived ease and difficulty in generating analogies across both cases. Results highlight that unlike AI responses, student-generated spontaneous explanations seldom employ analogies. However, when explicitly asked to explain the behavior of the curve in terms of their everyday contexts, students employ diverse analogical contexts. A combination of disciplinary knowledge, agency to generate customized explanations, and personal attributes tend to influence students' perceived ease in generating explanations across the two cases. Implications of these results on the potential of AI to facilitate students' personalized analogical reasoning, and the role of analogies in making students notice gaps in their understanding are discussed.

--------------------------------------------------------------------------------------------------------

Are We Aligned? A Preliminary Investigation of the Alignment of Responsible AI Values between LLMs and Human Judgment

As LLMs increasingly assist in software engineering tasks, understanding their alignment with human values becomes critical. This study evaluates 23 LLMs across four tasks: selecting key responsible AI values, rating importance contextually, resolving value trade-offs, and prioritizing requirements embodying those values. Results show LLMs align more closely with AI practitioners than general populations, emphasizing fairness, privacy, and transparency. However, inconsistencies emerge between stated values and requirement prioritization, revealing gaps between claimed and applied behavior. Applications include improving requirements engineering processes, developing value-aware AI assistants, creating benchmarks for ethical AI alignment, and informing policy around AI deployment in decision-critical contexts requiring human oversight.

Authors: Asma Yamani, Malak Baslyman, Moataz Ahmed

Link: https://arxiv.org/abs/2511.04157v1

Date: 2025-11-d

Summary:

Large Language Models (LLMs) are increasingly employed in software engineering tasks such as requirements elicitation, design, and evaluation, raising critical questions regarding their alignment with human judgments on responsible AI values. This study investigates how closely LLMs' value preferences align with those of two human groups: a US-representative sample and AI practitioners. We evaluate 23 LLMs across four tasks: (T1) selecting key responsible AI values, (T2) rating their importance in specific contexts, (T3) resolving trade-offs between competing values, and (T4) prioritizing software requirements that embody those values. The results show that LLMs generally align more closely with AI practitioners than with the US-representative sample, emphasizing fairness, privacy, transparency, safety, and accountability. However, inconsistencies appear between the values that LLMs claim to uphold (Tasks 1-3) and the way they prioritize requirements (Task 4), revealing gaps in faithfulness between stated and applied behavior. These findings highlight the practical risk of relying on LLMs in requirements engineering without human oversight and motivate the need for systematic approaches to benchmark, interpret, and monitor value alignment in AI-assisted software development.

--------------------------------------------------------------------------------------------------------

Interpreting Multi-Attribute Confounding through Numerical Attributes in Large Language Models

While LLMs exhibit numerical reasoning errors, the underlying representational mechanisms remain unclear. This research investigates how LLMs internally integrate multiple numerical attributes and how irrelevant context perturbs these representations. Using linear probing and partial correlation analysis, the study reveals that LLMs encode real-world numerical correlations but systematically amplify them. Irrelevant context induces consistent magnitude representation shifts with model-size-dependent downstream effects. Applications include improving LLM reliability in quantitative reasoning tasks, developing fairer decision-making systems, enhancing financial and medical AI applications requiring numerical accuracy, and creating representation-aware control mechanisms. Understanding these vulnerabilities is crucial for deploying LLMs in high-stakes domains where numerical precision matters.

Authors: Hirohane Takagi, Gouki Minegishi, Shota Kizawa, Issey Sukeda, Hitomi Yanaka

Link: https://arxiv.org/abs/2511.04053v1

Date: 2025-11-d

Summary:

Although behavioral studies have documented numerical reasoning errors in large language models (LLMs), the underlying representational mechanisms remain unclear. We hypothesize that numerical attributes occupy shared latent subspaces and investigate two questions:(1) How do LLMs internally integrate multiple numerical attributes of a single entity? (2)How does irrelevant numerical context perturb these representations and their downstream outputs? To address these questions, we combine linear probing with partial correlation analysis and prompt-based vulnerability tests across models of varying sizes. Our results show that LLMs encode real-world numerical correlations but tend to systematically amplify them. Moreover, irrelevant context induces consistent shifts in magnitude representations, with downstream effects that vary by model size. These findings reveal a vulnerability in LLM decision-making and lay the groundwork for fairer, representation-aware control under multi-attribute entanglement.

--------------------------------------------------------------------------------------------------------

Collaborative Agents for Automated Program Repair in Ruby

Automated Program Repair using LLMs has advanced rapidly but remains computationally expensive and focused on few languages. Ruby, despite widespread use in web development, has received little APR attention. RAMP introduces a lightweight framework formulating repair as feedback-driven iteration using collaborative agents that generate tests, reflect on errors, and refine fixes. Unlike approaches requiring large repair databases or costly fine-tuning, RAMP operates through lightweight prompting and test-driven feedback, achieving 67% pass@1 on Ruby benchmarks while converging within five iterations. Applications include IDE-integrated debugging assistants, automated code review tools, developer productivity enhancement, and extending LLM-based programming support to underrepresented languages.

Authors: Nikta Akbarpour, Mahdieh Sadat Benis, Fatemeh Hendijani Fard, Ali Ouni, Mohamed Aymen Saied

Link: https://arxiv.org/abs/2511.03925v1

Date: 2025-11-d

Summary:

Automated Program Repair (APR) has advanced rapidly with Large Language Models (LLMs), but most existing methods remain computationally expensive, and focused on a small set of languages. Ruby, despite its widespread use in web development and the persistent challenges faced by its developers, has received little attention in APR research. In this paper, we introduce RAMP, a novel lightweight framework that formulates program repair as a feedback-driven, iterative process for Ruby. RAMP employs a team of collaborative agents that generate targeted tests, reflect on errors, and refine candidate fixes until a correct solution is found. Unlike prior approaches, RAMP is designed to avoid reliance on large multilingual repair databases or costly fine-tuning, instead operating directly on Ruby through lightweight prompting and test-driven feedback. Evaluation on the XCodeEval benchmark shows that RAMP achieves a pass@1 of 67% on Ruby, outper-forming prior approaches. RAMP converges quickly within five iterations, and ablation studies confirm that test generation and self-reflection are key drivers of its performance. Further analysis shows that RAMP is particularly effective at repairing wrong answers, compilation errors, and runtime errors. Our approach provides new insights into multi-agent repair strategies, and establishes a foundation for extending LLM-based debugging tools to under-studied languages.

--------------------------------------------------------------------------------------------------------

Levers of Power in the Field of AI

AI governance requires understanding how decision-makers across academia, government, business, and civil society navigate power dynamics. This study examines how individuals exercise "levers of power"—social mechanisms shaping institutional responses to technological change. Through personalized questionnaires based on neo-institutionalist frameworks, researchers created twelve fictional personas representing high-level decision-makers from North America and Europe. These personas illustrate intersections between personal agency, organizational logics, and institutional infrastructures in AI governance. Applications include informing policy development, improving cross-sector collaboration, understanding institutional change dynamics, and empowering civil society engagement. The framework helps policymakers and advocates identify actionable strategies for influencing AI governance within their institutional contexts.

Authors: Tammy Mackenzie, Sukriti Punj, Natalie Perez, Sreyoshi Bhaduri, Branislav Radeljic

Link: https://arxiv.org/abs/2511.03859v1

Date: 2025-11-d

Summary:

This paper examines how decision makers in academia, government, business, and civil society navigate questions of power in implementations of artificial intelligence. The study explores how individuals experience and exercise levers of power, which are presented as social mechanisms that shape institutional responses to technological change. The study reports on the responses of personalized questionnaires designed to gather insight on a decision maker's institutional purview, based on an institutional governance framework developed from the work of Neo-institutionalists. Findings present the anonymized, real responses and circumstances of respondents in the form of twelve fictional personas of high-level decision makers from North America and Europe. These personas illustrate how personal agency, organizational logics, and institutional infrastructures may intersect in the governance of AI. The decision makers' responses to the questionnaires then inform a discussion of the field-level personal power of decision makers, methods of fostering institutional stability in times of change, and methods of influencing institutional change in the field of AI. The final section of the discussion presents a table of the dynamics of the levers of power in the field of AI for change makers and five testable hypotheses for institutional and social movement researchers. In summary, this study provides insight on the means for policymakers within institutions and their counterparts in civil society to personally engage with AI governance.

--------------------------------------------------------------------------------------------------------

Noise Injection: Improving Out-of-Distribution Generalization for Limited Size Datasets

Deep learning models often fail when deployed on data from different sources than training data—a critical problem for medical AI applications like COVID-19 detection from chest X-rays. Models exploit source-specific artifacts (shortcuts) rather than genuine biomarkers, limiting generalization. This study investigates fundamental noise injection techniques (Gaussian, Speckle, Poisson, Salt and Pepper) during training, demonstrating significant reduction in performance gaps between in-distribution and out-of-distribution evaluation—from 0.10-0.20 to 0.01-0.06 across metrics. Applications include improving medical diagnosis systems' robustness, enhancing model deployment across different hospitals or imaging devices, and developing more generalizable AI for resource-constrained healthcare settings where diverse data sources are common.

Authors: Duong Mai, Lawrence Hall

Link: https://arxiv.org/abs/2511.03855v1

Date: 2025-11-d

Summary:

Deep learned (DL) models for image recognition have been shown to fail to generalize to data from different devices, populations, etc. COVID-19 detection from Chest X-rays (CXRs), in particular, has been shown to fail to generalize to out-of-distribution (OOD) data from new clinical sources not covered in the training set. This occurs because models learn to exploit shortcuts - source-specific artifacts that do not translate to new distributions - rather than reasonable biomarkers to maximize performance on in-distribution (ID) data. Rendering the models more robust to distribution shifts, our study investigates the use of fundamental noise injection techniques (Gaussian, Speckle, Poisson, and Salt and Pepper) during training. Our empirical results demonstrate that this technique can significantly reduce the performance gap between ID and OOD evaluation from 0.10-0.20 to 0.01-0.06, based on results averaged over ten random seeds across key metrics such as AUC, F1, accuracy, recall and specificity. Our source code is publicly available at https://github.com/Duongmai127/Noisy-ood

--------------------------------------------------------------------------------------------------------

Addressing prior dependence in hierarchical Bayesian modeling for PTA data analysis I: Methodology and implementation

Pulsar Timing Array (PTA) data analysis faces challenges from high-dimensional parameter spaces with strong interdependencies among astrophysical, noise, and nuisance parameters. These issues exemplify broader problems in decision science where model over-parameterization and prior sensitivity compromise computational tractability and reliability. This research introduces Normalizing Flows (NFs) to decorrelate hierarchical prior parameters from astrophysical parameters, combined with i-nessai, a flow-guided nested sampler. Applications extend beyond astronomy to any hierarchical Bayesian inference problem: financial modeling with complex dependencies, climate modeling with interconnected parameters, epidemiological modeling, and computational biology. The methodology improves both statistical robustness and computational efficiency for complex parameter estimation tasks.

Authors: Luigi D'amico, Eleonora Villa, Fatima Modica Bittordo, Aldo Barca, Francesco Alì, Massimo Meneghetti, Luca Naso

Link: https://arxiv.org/abs/2511.03667v1

Date: 2025-11-d

Summary:

Complex inference tasks, such as those encountered in Pulsar Timing Array (PTA) data analysis, rely on Bayesian frameworks. The high-dimensional parameter space and the strong interdependencies among astrophysical, pulsar noise, and nuisance parameters introduce significant challenges for efficient learning and robust inference. These challenges are emblematic of broader issues in decision science, where model over-parameterization and prior sensitivity can compromise both computational tractability and the reliability of the results. We address these issues in the framework of hierarchical Bayesian modeling by introducing a reparameterization strategy. Our approach employs Normalizing Flows (NFs) to decorrelate the parameters governing hierarchical priors from those of astrophysical interest. The use of NF-based mappings provides both the flexibility to realize the reparametrization and the tractability to preserve proper probability densities. We further adopt i-nessai, a flow-guided nested sampler, to accelerate exploration of complex posteriors. This unified use of NFs improves statistical robustness and computational efficiency, providing a principled methodology for addressing hierarchical Bayesian inference in PTA analysis.

--------------------------------------------------------------------------------------------------------

Watermarking Large Language Models in Europe: Interpreting the AI Act in Light of Technology

The EU AI Act requires LLM providers to mark outputs with methods that are "sufficiently reliable, interoperable, effective and robust," but rapidly evolving watermarking techniques make translating these standards into measurable evaluations challenging. This paper provides a taxonomy categorizing watermarking methods by lifecycle stage (before, during, or after training) and proposes concrete interpretations of EU requirements mapped to state-of-the-art evaluations. Since interoperability remains largely untheorized, three normative dimensions are proposed for assessment. Comparison reveals no current method satisfies all four standards. Applications include informing regulatory compliance, guiding watermarking research priorities, supporting policy development, and establishing frameworks for trustworthy AI deployment in Europe.

Authors: Thomas Souverain

Link: https://arxiv.org/abs/2511.03641v1

Date: 2025-11-d

Summary:

To foster trustworthy Artificial Intelligence (AI) within the European Union, the AI Act requires providers to mark and detect the outputs of their general-purpose models. The Article 50 and Recital 133 call for marking methods that are ''sufficiently reliable, interoperable, effective and robust''. Yet, the rapidly evolving and heterogeneous landscape of watermarks for Large Language Models (LLMs) makes it difficult to determine how these four standards can be translated into concrete and measurable evaluations. Our paper addresses this challenge, anchoring the normativity of European requirements in the multiplicity of watermarking techniques. Introducing clear and distinct concepts on LLM watermarking, our contribution is threefold. (1) Watermarking Categorisation: We propose an accessible taxonomy of watermarking methods according to the stage of the LLM lifecycle at which they are applied - before, during, or after training, and during next-token distribution or sampling. (2) Watermarking Evaluation: We interpret the EU AI Act's requirements by mapping each criterion with state-of-the-art evaluations on robustness and detectability of the watermark, and of quality of the LLM. Since interoperability remains largely untheorised in LLM watermarking research, we propose three normative dimensions to frame its assessment. (3) Watermarking Comparison: We compare current watermarking methods for LLMs against the operationalised European criteria and show that no approach yet satisfies all four standards. Encouraged by emerging empirical tests, we recommend further research into watermarking directly embedded within the low-level architecture of LLMs.

--------------------------------------------------------------------------------------------------------

LiveTradeBench: Seeking Real-World Alpha with Large Language Models

Current LLM benchmarks test isolated reasoning in static settings, lacking real-world dynamics and uncertainty. LiveTradeBench introduces live trading evaluation where agents make sequential portfolio allocation decisions using streaming market prices and news across U.S. stocks and Polymarket prediction markets. Unlike backtesting, this eliminates information leakage and captures real-time uncertainty. Evaluating 21 LLMs over 50 days reveals that high benchmark scores don't guarantee superior trading outcomes, models display distinct risk-appetite profiles, and some effectively adapt to live signals. Applications include developing AI financial advisors, testing decision-making under uncertainty, improving sequential reasoning evaluation, and understanding LLM consistency in dynamic environments beyond static problem-solving.

Authors: Haofei Yu, Fenghai Li, Jiaxuan You

Link: https://arxiv.org/abs/2511.03628v1

Date: 2025-11-d

Summary:

Large language models (LLMs) achieve strong performance across benchmarks--from knowledge quizzes and math reasoning to web-agent tasks--but these tests occur in static settings, lacking real dynamics and uncertainty. Consequently, they evaluate isolated reasoning or problem-solving rather than decision-making under uncertainty. To address this, we introduce LiveTradeBench, a live trading environment for evaluating LLM agents in realistic and evolving markets. LiveTradeBench follows three design principles: (i) Live data streaming of market prices and news, eliminating dependence on offline backtesting and preventing information leakage while capturing real-time uncertainty; (ii) a portfolio-management abstraction that extends control from single-asset actions to multi-asset allocation, integrating risk management and cross-asset reasoning; and (iii) multi-market evaluation across structurally distinct environments--U.S. stocks and Polymarket prediction markets--differing in volatility, liquidity, and information flow. At each step, an agent observes prices, news, and its portfolio, then outputs percentage allocations that balance risk and return. Using LiveTradeBench, we run 50-day live evaluations of 21 LLMs across families. Results show that (1) high LMArena scores do not imply superior trading outcomes; (2) models display distinct portfolio styles reflecting risk appetite and reasoning dynamics; and (3) some LLMs effectively leverage live signals to adapt decisions. These findings expose a gap between static evaluation and real-world competence, motivating benchmarks that test sequential decision making and consistency under live uncertainty.

--------------------------------------------------------------------------------------------------------

Security and Privacy Management of IoT Using Quantum Computing

Quantum computing threatens classical cryptography securing IoT communications. Algorithms like Shor's and Grover's will render RSA, ECC, and potentially AES vulnerable, necessitating quantum-resilient alternatives. This comprehensive examination explores Post-Quantum Cryptographic (PQC) families—lattice-based, code-based, hash-based, and multivariate approaches—analyzing deployment feasibility in resource-constrained IoT devices. Quantum-based methods like Quantum Key Distribution (QKD) and Quantum Random Number Generators (QRNGs) offer physics-based security guarantees. Applications span securing smart cities, industrial IoT, healthcare devices, and critical infrastructure against future quantum attacks. The chapter emphasizes privacy management, regulatory compliance, and standardization needs, providing frameworks for building cryptographically robust IoT ecosystems in the post-quantum era.

Authors: Jaydip Sen

Link: https://arxiv.org/abs/2511.03538v1

Date: 2025-11-d

Summary:

The convergence of the Internet of Things (IoT) and quantum computing is redefining the security paradigm of interconnected digital systems. Classical cryptographic algorithms such as RSA, Elliptic Curve Cryptography (ECC), and Advanced Encryption Standard (AES) have long provided the foundation for securing IoT communication. However, the emergence of quantum algorithms such as Shor's and Grover's threatens to render these techniques vulnerable, necessitating the development of quantum-resilient alternatives. This chapter examines the implications of quantum computing for IoT security and explores strategies for building cryptographically robust systems in the post-quantum era. It presents an overview of Post-Quantum Cryptographic (PQC) families, including lattice-based, code-based, hash-based, and multivariate approaches, analyzing their potential for deployment in resource-constrained IoT environments. In addition, quantum-based methods such as Quantum Key Distribution (QKD) and Quantum Random Number Generators (QRNGs) are discussed for their ability to enhance confidentiality and privacy through physics-based security guarantees. The chapter also highlights issues of privacy management, regulatory compliance, and standardization, emphasizing the need for collaborative efforts across academia, industry, and governance. Overall, it provides a comprehensive perspective on security IoT ecosystems against quantum threats and ensures resilience in the next generation of intelligent networks.

--------------------------------------------------------------------------------------------------------

EYE ON A.I. GETS READERS UP TO DATE ON THE LATEST FUNDING NEWS AND RELATED ISSUES. SUBSCRIBE FOR THE WEEKLY NEWSLETTER.

Artificial Intelligence, Research WatchCraig SmithNovember 10, 2025Comment