Week Ending 11.23.2025

RESEARCH WATCH: 11.23.2025

An Artificial Intelligence Framework for Measuring Human Spine Aging Using MRI

Spinal degeneration represents a significant health concern affecting millions worldwide, yet current assessment methods lack standardized biomarkers. This research introduces a deep learning approach to estimate "spine age" from MRI scans, analyzing over 18,000 cases to identify patterns of age-related deterioration. The key innovation involves computing a "spine age gap"—the difference between chronological and predicted spine age—which correlates with specific degenerative conditions like disc bulges and stenosis, as well as lifestyle factors including smoking and physically demanding work. This biomarker could revolutionize preventive medicine by enabling earlier intervention and personalized treatment strategies for at-risk populations.

Authors: Roozbeh Bazargani, Saqib Abdullah Basar, Daniel Daly-Grafstein, Rodrigo Solis Pompa, Soojin Lee, Saurabh Garg, Yuntong Ma, John A. Carrino, Siavash Khallaghi, Sam Hashemi

Link: https://arxiv.org/abs/2511.17485v1

Date: 2025-11-d

Summary:

The human spine is a complex structure composed of 33 vertebrae. It holds the body and is important for leading a healthy life. The spine is vulnerable to age-related degenerations that can be identified through magnetic resonance imaging (MRI). In this paper we propose a novel computer-vison-based deep learning method to estimate spine age using images from over 18,000 MRI series. Data are restricted to subjects with only age-related spine degeneration. Eligibility criteria are created by identifying common age-based clusters of degenerative spine conditions using uniform manifold approximation and projection (UMAP) and hierarchical density-based spatial clustering of applications with noise (HDBSCAN). Model selection is determined using a detailed ablation study on data size, loss, and the effect of different spine regions. We evaluate the clinical utility of our model by calculating the difference between actual spine age and model-predicted age, the spine age gap (SAG), and examining the association between these differences and spine degenerative conditions and lifestyle factors. We find that SAG is associated with conditions including disc bulges, disc osteophytes, spinal stenosis, and fractures, as well as lifestyle factors like smoking and physically demanding work, and thus may be a useful biomarker for measuring overall spine health.

--------------------------------------------------------------------------------------------------------

Masked-and-Reordered Self-Supervision for Reinforcement Learning from Verifiable Rewards

Mathematical reasoning remains a frontier challenge for large language models, particularly in theorem proving where intermediate steps matter but final answers resist direct verification. This work introduces MR-RLVR, a novel training methodology combining masked prediction and step reordering with reinforcement learning to extract learnable signals from intermediate reasoning processes. Testing on mathematical benchmarks reveals 9-10% performance improvements in pass rates compared to baseline approaches. The framework addresses a critical scalability limitation by enabling models to learn from process-level rewards rather than solely outcome-verifiable signals, potentially unlocking breakthroughs in formal mathematics, scientific discovery, and complex problem-solving applications.

Authors: Zhen Wang, Zhifeng Gao, Guolin Ke

Link: https://arxiv.org/abs/2511.17473v1

Date: 2025-11-d

Summary:

Test-time scaling has been shown to substantially improve large language models' (LLMs) mathematical reasoning. However, for a large portion of mathematical corpora, especially theorem proving, RLVR's scalability is limited: intermediate reasoning is crucial, while final answers are difficult to directly and reliably verify. Meanwhile, token-level SFT often degenerates into rote memorization rather than inducing longer chains of thought. Inspired by BERT's self-supervised tasks, we propose MR-RLVR (Masked-and-Reordered RLVR), which constructs process-level self-supervised rewards via "masked-then-fill" and "step reordering" to extract learnable signals from intermediate reasoning. Our training pipeline comprises two stages: we first perform self-supervised training on sampled mathematical calculation and proof data; we then conduct RLVR fine-tuning on mathematical calculation datasets where only outcomes are verifiable. We implement MR-RLVR on Qwen2.5-3B and DeepSeek-R1-Distill-Qwen-1.5B, and evaluate on AIME24, AIME25, AMC23, and MATH500. Under a fixed sampling and decoding budget, MR-RLVR achieves average relative gains over the original RLVR of +9.86% Pass@1, +5.27% Pass@5, and +4.00% Pass@8. These results indicate that incorporating process-aware self-supervised signals can effectively enhance RLVR's scalability and performance in only outcome-verifiable settings.

--------------------------------------------------------------------------------------------------------

Sparse Mixture-of-Experts for Multi-Channel Imaging: Are All Channel Interactions Required?

Multi-channel imaging domains like cell painting and satellite imagery present unique computational challenges for Vision Transformers, as modeling all channel interactions creates quadratic complexity in attention mechanisms. This research proposes MoE-ViT, applying mixture-of-experts principles where each channel represents an expert and a lightweight router selects only relevant channels per image patch. Proof-of-concept results on real-world datasets demonstrate substantial efficiency gains without sacrificing accuracy. This approach directly addresses the computational bottleneck that currently limits practical deployment of foundation models in scientific imaging, making advanced vision capabilities accessible for resource-constrained applications in biology and Earth observation.

Authors: Sukwon Yun, Heming Yao, Burkhard Hoeckendorf, David Richmond, Aviv Regev, Russell Littman

Link: https://arxiv.org/abs/2511.17400v1

Date: 2025-11-d

Summary:

Vision Transformers ($\text{ViTs}$) have become the backbone of vision foundation models, yet their optimization for multi-channel domains - such as cell painting or satellite imagery - remains underexplored. A key challenge in these domains is capturing interactions between channels, as each channel carries different information. While existing works have shown efficacy by treating each channel independently during tokenization, this approach naturally introduces a major computational bottleneck in the attention block - channel-wise comparisons leads to a quadratic growth in attention, resulting in excessive $\text{FLOPs}$ and high training cost. In this work, we shift focus from efficacy to the overlooked efficiency challenge in cross-channel attention and ask: "Is it necessary to model all channel interactions?". Inspired by the philosophy of Sparse Mixture-of-Experts ($\text{MoE}$), we propose MoE-ViT, a Mixture-of-Experts architecture for multi-channel images in $\text{ViTs}$, which treats each channel as an expert and employs a lightweight router to select only the most relevant experts per patch for attention. Proof-of-concept experiments on real-world datasets - JUMP-CP and So2Sat - demonstrate that $\text{MoE-ViT}$ achieves substantial efficiency gains without sacrificing, and in some cases enhancing, performance, making it a practical and attractive backbone for multi-channel imaging.

--------------------------------------------------------------------------------------------------------

DISCA: A Digital In-memory Stochastic Computing Architecture Using A Compressed Bent-Pyramid Format

Edge AI deployment faces fundamental constraints from memory bandwidth limitations and energy consumption, particularly for massive matrix multiplication workloads inherent in modern deep learning. This paper presents DISCA, a novel digital in-memory computing architecture combining stochastic computing principles with a compressed data format to dramatically improve efficiency. Post-layout simulations achieve 3.59 TOPS/W per bit using 180nm technology, orders of magnitude beyond conventional approaches. By inheriting analog computing's computational simplicity while preserving digital systems' reliability and scalability, DISCA enables deployment of sophisticated AI models on resource-constrained edge devices, from autonomous vehicles to surveillance drones.

Authors: Shady Agwa, Yikang Shen, Shiwei Wang, Themis Prodromakis

Link: https://arxiv.org/abs/2511.17265v1

Date: 2025-11-d

Summary:

Nowadays, we are witnessing an Artificial Intelligence revolution that dominates the technology landscape in various application domains, such as healthcare, robotics, automotive, security, and defense. Massive-scale AI models, which mimic the human brain's functionality, typically feature millions and even billions of parameters through data-intensive matrix multiplication tasks. While conventional Von-Neumann architectures struggle with the memory wall and the end of Moore's Law, these AI applications are migrating rapidly towards the edge, such as in robotics and unmanned aerial vehicles for surveillance, thereby adding more constraints to the hardware budget of AI architectures at the edge. Although in-memory computing has been proposed as a promising solution for the memory wall, both analog and digital in-memory computing architectures suffer from substantial degradation of the proposed benefits due to various design limitations. We propose a new digital in-memory stochastic computing architecture, DISCA, utilizing a compressed version of the quasi-stochastic Bent-Pyramid data format. DISCA inherits the same computational simplicity of analog computing, while preserving the same scalability, productivity, and reliability of digital systems. Post-layout modeling results of DISCA show an energy efficiency of 3.59 TOPS/W per bit at 500 MHz using a commercial 180nm CMOS technology. Therefore, DISCA significantly improves the energy efficiency for matrix multiplication workloads by orders of magnitude if scaled and compared to its counterpart architectures.

--------------------------------------------------------------------------------------------------------

Randomness as Reference: Benchmark Metric for Optimization in Engineering

Traditional optimization benchmarks rely on artificial test suites poorly representative of real-world engineering complexity, limiting their ability to predict algorithm performance on genuine design problems. This research introduces a comprehensive benchmark suite of 231 engineering-derived optimization problems and a novel metric using random sampling as statistical reference for unbiased performance comparison. Evaluation of 20 algorithms reveals that commonly-used metaheuristics show severe efficiency loss on realistic problems compared to simple benchmarks. These findings provide practitioners with actionable guidelines for algorithm selection and establish a more transparent, reproducible evaluation framework directly applicable to computational fluid dynamics, finite element analysis, and other engineering simulations.

Authors: Stefan Ivić, Siniša Družeta, Luka Grbčić

Link: https://arxiv.org/abs/2511.17226v1

Date: 2025-11-d

Summary:

Benchmarking optimization algorithms is fundamental for the advancement of computational intelligence. However, widely adopted artificial test suites exhibit limited correspondence with the diversity and complexity of real-world engineering optimization tasks. This paper presents a new benchmark suite comprising 231 bounded, continuous, unconstrained optimization problems, the majority derived from engineering design and simulation scenarios, including computational fluid dynamics and finite element analysis models. In conjunction with this suite, a novel performance metric is introduced, which employs random sampling as a statistical reference, providing nonlinear normalization of objective values and enabling unbiased comparison of algorithmic efficiency across heterogeneous problems. Using this framework, 20 deterministic and stochastic optimization methods were systematically evaluated through hundreds of independent runs per problem, ensuring statistical robustness. The results indicate that only a few of the tested optimization methods consistently achieve excellent performance, while several commonly used metaheuristics exhibit severe efficiency loss on engineering-type problems, emphasizing the limitations of conventional benchmarks. Furthermore, the conducted tests are used for analyzing various features of the optimization methods, providing practical guidelines for their application. The proposed test suite and metric together offer a transparent, reproducible, and practically relevant platform for evaluating and comparing optimization methods, thereby narrowing the gap between the available benchmark tests and realistic engineering applications.

--------------------------------------------------------------------------------------------------------

UI-CUBE: Enterprise-Grade Computer Use Agent Benchmarking Beyond Task Accuracy to Operational Reliability

Current computer use agents demonstrate reasonable accuracy on simple interface tasks but fail catastrophically on realistic enterprise workflows, revealing fundamental architectural limitations poorly captured by existing benchmarks. UI-CUBE introduces 226 carefully-designed tasks across difficulty tiers, exposing sharp capability cliffs: agents achieve 67-85% on simple interactions but plummet to 9-19% on complex workflows. Analysis identifies deficiencies in memory management, hierarchical planning, and state coordination rather than incremental capability gaps. These findings establish a critical diagnostic tool for developing production-ready automation systems and provide clear architectural priorities for advancing computer use agents from proof-of-concept demonstrations to genuinely deployable enterprise tools.

Authors: Horia Cristescu, Charles Park, Trong Canh Nguyen, Sergiu Talmacel, Alexandru-Gabriel Ilie, Stefan Adam

Link: https://arxiv.org/abs/2511.17131v1

Date: 2025-11-d

Summary:

While current Computer Use Agent (CUA) benchmarks measure task completion effectively, they provide limited assessment of enterprise deployment readiness, emphasizing functional correctness over the operational reliability required for production systems. We present UI-CUBE (UiPath Computer Use BEnchmark), a systematic benchmark comprising 226 tasks across two difficulty tiers designed to expose fundamental architectural limitations in current CUAs. Our evaluation covers simple UI interactions (136 tasks) and complex workflows including copy-paste tasks (50 tasks) and enterprise application scenarios (40 tasks), with systematic interface variation coverage, multi-resolution testing and automated validation of task success through the application state. Evaluation of five state-of-the-art models reveals a sharp capability cliff rather than gradual performance degradation. Simple UI interactions achieve 67-85% success rates (compared to 97.9% human performance), but complex workflows drop precipitously to 9-19%. Human evaluators with no prior application experience achieve only 61.2% on complex tasks despite near-perfect performance on simple tasks, establishing realistic performance ceilings. This discontinuous performance pattern -- where agents achieve 68-87% of human performance on simple tasks but only 15-32% on complex workflows -- indicates fundamental architectural limitations in memory management, hierarchical planning, and state coordination rather than incremental capability gaps addressable through better training or prompting. UI-CUBE functions as an enterprise-readiness diagnostic, revealing that while current CUAs can manipulate individual interface elements, they cannot yet function as reliable workflow automation tools. These findings provide architectural insights essential for developing production-ready CUAs capable of managing complex, multi-step enterprise processes.

--------------------------------------------------------------------------------------------------------

Supervised Fine Tuning of Large Language Models for Domain Specific Knowledge Graph Construction:A Case Study on Hunan's Historical Celebrities

Domain-specific information extraction remains challenging when general-purpose language models face low-resource settings lacking sufficient training data. This study demonstrates how supervised fine-tuning enhances LLM performance for specialized knowledge graph construction, using Hunan's historical cultural heritage as a test case. Researchers designed schema-guided instruction templates and applied parameter-efficient fine-tuning to multiple models, with Qwen3-8B achieving 89.39% accuracy on cultural figure extraction. Results confirm that targeted fine-tuning substantially improves extraction quality, establishing cost-effective methodologies for cultural heritage digitization, regional history research, and specialized domain applications where adaptation to local knowledge systems is essential.

Authors: Junjie Hao, Chun Wang, Ying Qiao, Qiuyue Zuo, Qiya Song, Hua Ma, Xieping Gao

Link: https://arxiv.org/abs/2511.17012v1

Date: 2025-11-d

Summary:

Large language models and knowledge graphs offer strong potential for advancing research on historical culture by supporting the extraction, analysis, and interpretation of cultural heritage. Using Hunan's modern historical celebrities shaped by Huxiang culture as a case study, pre-trained large models can help researchers efficiently extract key information, including biographical attributes, life events, and social relationships, from textual sources and construct structured knowledge graphs. However, systematic data resources for Hunan's historical celebrities remain limited, and general-purpose models often underperform in domain knowledge extraction and structured output generation in such low-resource settings. To address these issues, this study proposes a supervised fine-tuning approach for enhancing domain-specific information extraction. First, we design a fine-grained, schema-guided instruction template tailored to the Hunan historical celebrities domain and build an instruction-tuning dataset to mitigate the lack of domain-specific training corpora. Second, we apply parameter-efficient instruction fine-tuning to four publicly available large language models - Qwen2.5-7B, Qwen3-8B, DeepSeek-R1-Distill-Qwen-7B, and Llama-3.1-8B-Instruct - and develop evaluation criteria for assessing their extraction performance. Experimental results show that all models exhibit substantial performance gains after fine-tuning. Among them, Qwen3-8B achieves the strongest results, reaching a score of 89.3866 with 100 samples and 50 training iterations. This study provides new insights into fine-tuning vertical large language models for regional historical and cultural domains and highlights their potential for cost-effective applications in cultural heritage knowledge extraction and knowledge graph construction.

--------------------------------------------------------------------------------------------------------

WorldGen: From Text to Traversable and Interactive 3D Worlds

Creating immersive 3D environments typically demands extensive manual labor from specialized artists and developers, limiting accessibility for creators without technical expertise. WorldGen addresses this barrier by enabling automatic generation of large-scale, interactive, fully-textured 3D worlds directly from natural language descriptions. The system combines LLM-driven scene layout reasoning, procedural generation, diffusion-based 3D synthesis, and object-aware decomposition to produce geometrically consistent, visually rich environments compatible with standard game engines. Applications span gaming, social platforms, and simulation environments, democratizing world creation and potentially accelerating development cycles for immersive experiences while reducing production costs significantly.

Authors: Dilin Wang, Hyunyoung Jung, Tom Monnier, Kihyuk Sohn, Chuhang Zou, Xiaoyu Xiang, Yu-Ying Yeh, Di Liu, Zixuan Huang, Thu Nguyen-Phuoc, Yuchen Fan, Sergiu Oprea, Ziyan Wang, Roman Shapovalov, Nikolaos Sarafianos, Thibault Groueix, Antoine Toisoul, Prithviraj Dhar, Xiao Chu, Minghao Chen, Geon Yeong Park, Mahima Gupta, Yassir Azziz, Rakesh Ranjan, Andrea Vedaldi

Link: https://arxiv.org/abs/2511.16825v1

Date: 2025-11-d

Summary:

We introduce WorldGen, a system that enables the automatic creation of large-scale, interactive 3D worlds directly from text prompts. Our approach transforms natural language descriptions into traversable, fully textured environments that can be immediately explored or edited within standard game engines. By combining LLM-driven scene layout reasoning, procedural generation, diffusion-based 3D generation, and object-aware scene decomposition, WorldGen bridges the gap between creative intent and functional virtual spaces, allowing creators to design coherent, navigable worlds without manual modeling or specialized 3D expertise. The system is fully modular and supports fine-grained control over layout, scale, and style, producing worlds that are geometrically consistent, visually rich, and efficient to render in real time. This work represents a step towards accessible, generative world-building at scale, advancing the frontier of 3D generative AI for applications in gaming, simulation, and immersive social environments.

--------------------------------------------------------------------------------------------------------

Cognitive Foundations for Reasoning and Their Manifestation in LLMs

Large language models solve complex problems yet fail on simpler variants, suggesting they employ reasoning mechanisms fundamentally different from human cognition. This research synthesizes cognitive science into a 28-element taxonomy, then analyzes 170,000 reasoning traces from 17 models alongside human think-aloud protocols. Results reveal systematic differences: humans employ hierarchical structures and metacognitive monitoring while models rely on shallow forward chaining. Despite possessing behavioral patterns associated with success, models fail to deploy them spontaneously. Leveraging these insights, researchers developed test-time guidance improving performance by 60% on complex problems, establishing foundations for developing models that reason through principled cognitive mechanisms rather than brittle shortcuts.

Authors: Priyanka Kargupta, Shuyue Stella Li, Haocheng Wang, Jinu Lee, Shan Chen, Orevaoghene Ahia, Dean Light, Thomas L. Griffiths, Max Kleiman-Weiner, Jiawei Han, Asli Celikyilmaz, Yulia Tsvetkov

Link: https://arxiv.org/abs/2511.16660v1

Date: 2025-11-d

Summary:

Large language models solve complex problems yet fail on simpler variants, suggesting they achieve correct outputs through mechanisms fundamentally different from human reasoning. We synthesize cognitive science research into a taxonomy of 28 cognitive elements spanning computational constraints, meta-cognitive controls, knowledge representations, and transformation operations, then analyze their behavioral manifestations in reasoning traces. We propose a fine-grained cognitive evaluation framework and conduct the first large-scale analysis of 170K traces from 17 models across text, vision, and audio modalities, alongside 54 human think-aloud traces, which we make publicly available. Our analysis reveals systematic structural differences: humans employ hierarchical nesting and meta-cognitive monitoring while models rely on shallow forward chaining, with divergence most pronounced on ill-structured problems. Meta-analysis of 1,598 LLM reasoning papers reveals the research community concentrates on easily quantifiable behaviors (sequential organization: 55%, decomposition: 60%) while neglecting meta-cognitive controls (self-awareness: 16%, evaluation: 8%) that correlate with success. Models possess behavioral repertoires associated with success but fail to deploy them spontaneously. Leveraging these patterns, we develop test-time reasoning guidance that automatically scaffold successful structures, improving performance by up to 60% on complex problems. By bridging cognitive science and LLM research, we establish a foundation for developing models that reason through principled cognitive mechanisms rather than brittle spurious reasoning shortcuts or memorization, opening new directions for both improving model capabilities and testing theories of human cognition at scale.

--------------------------------------------------------------------------------------------------------

SAM 3D: 3Dfy Anything in Images

Single-image 3D reconstruction requires predicting complete geometry, texture, and pose from limited visual information where occlusion and scene clutter complicate interpretation. SAM 3D presents a generative model addressing this challenge through an unprecedented human-model-in-the-loop annotation pipeline creating visually-grounded training data at scale. The approach combines synthetic pretraining with real-world alignment, breaking conventional data limitations in 3D learning. Evaluation against recent methods shows at least 5:1 human preference win rates on real-world objects and scenes. Released code, model weights, and benchmarks enable applications in e-commerce, robotics, autonomous systems, and AR/VR environments requiring robust 3D understanding from monocular images.

Authors: SAM 3D Team, Xingyu Chen, Fu-Jen Chu, Pierre Gleize, Kevin J Liang, Alexander Sax, Hao Tang, Weiyao Wang, Michelle Guo, Thibaut Hardin, Xiang Li, Aohan Lin, Jiawei Liu, Ziqi Ma, Anushka Sagar, Bowen Song, Xiaodong Wang, Jianing Yang, Bowen Zhang, Piotr Dollár, Georgia Gkioxari, Matt Feiszli, Jitendra Malik

Link: https://arxiv.org/abs/2511.16624v1

Date: 2025-11-d

Summary:

We present SAM 3D, a generative model for visually grounded 3D object reconstruction, predicting geometry, texture, and layout from a single image. SAM 3D excels in natural images, where occlusion and scene clutter are common and visual recognition cues from context play a larger role. We achieve this with a human- and model-in-the-loop pipeline for annotating object shape, texture, and pose, providing visually grounded 3D reconstruction data at unprecedented scale. We learn from this data in a modern, multi-stage training framework that combines synthetic pretraining with real-world alignment, breaking the 3D "data barrier". We obtain significant gains over recent work, with at least a 5:1 win rate in human preference tests on real-world objects and scenes. We will release our code and model weights, an online demo, and a new challenging benchmark for in-the-wild 3D object reconstruction.

--------------------------------------------------------------------------------------------------------

Supervised Contrastive Learning for Few-Shot AI-Generated Image Detection and Attribution

Generative models produce increasingly photorealistic synthetic images faster than detection methods can adapt, creating critical challenges for media authenticity and forensic analysis. This work proposes a two-stage detection framework combining supervised contrastive learning for embedding extraction with few-shot k-nearest neighbors classification. Testing with just 150 images per generator achieves 91.3% detection accuracy, surpassing existing methods. For source attribution, improvements reach 14.7% in AUC metrics. The approach generalizes across unseen generators without exhaustive retraining, providing scalable solutions for digital forensics, news verification, and content authenticity assessment in an era of increasingly sophisticated synthetic media generation.

Authors: Jaime Álvarez Urueña, David Camacho, Javier Huertas Tato

Link: https://arxiv.org/abs/2511.16541v2

Date: 2025-11-d

Summary:

The rapid advancement of generative artificial intelligence has enabled the creation of synthetic images that are increasingly indistinguishable from authentic content, posing significant challenges for digital media integrity. This problem is compounded by the accelerated release cycle of novel generative models, which renders traditional detection approaches (reliant on periodic retraining) computationally infeasible and operationally impractical. This work proposes a novel two-stage detection framework designed to address the generalization challenge inherent in synthetic image detection. The first stage employs a vision deep learning model trained via supervised contrastive learning to extract discriminative embeddings from input imagery. Critically, this model was trained on a strategically partitioned subset of available generators, with specific architectures withheld from training to rigorously ablate cross-generator generalization capabilities. The second stage utilizes a k-nearest neighbors (k-NN) classifier operating on the learned embedding space, trained in a few-shot learning paradigm incorporating limited samples from previously unseen test generators. With merely 150 images per class in the few-shot learning regime, which are easily obtainable from current generation models, the proposed framework achieves an average detection accuracy of 91.3%, representing a 5.2 percentage point improvement over existing approaches . For the source attribution task, the proposed approach obtains improvements of of 14.70% and 4.27% in AUC and OSCR respectively on an open set classification context, marking a significant advancement toward robust, scalable forensic attribution systems capable of adapting to the evolving generative AI landscape without requiring exhaustive retraining protocols.

--------------------------------------------------------------------------------------------------------

Trustworthy AI in the Agentic Lakehouse: from Concurrency to Governance

Enterprise deployment of AI agents on production data requires infrastructure supporting both computational correctness and governance compliance beyond capabilities of traditional data architectures. This paper argues that trustworthiness emerges from infrastructure design, proposing Bauplan, an agent-first lakehouse implementing data and compute isolation through MVCC-inspired transaction semantics adapted for decoupled, multi-language environments. The reference implementation includes self-healing pipelines seamlessly coupling agent reasoning with correctness guarantees. This addresses the critical infrastructure gap preventing enterprises from trusting agents with sensitive data, enabling safe autonomous workflows in financial systems, healthcare, and other regulated industries.

Authors: Jacopo Tagliabue, Federico Bianchi, Ciro Greco

Link: https://arxiv.org/abs/2511.16402v1

Date: 2025-11-d

Summary:

Even as AI capabilities improve, most enterprises do not consider agents trustworthy enough to work on production data. In this paper, we argue that the path to trustworthy agentic workflows begins with solving the infrastructure problem first: traditional lakehouses are not suited for agent access patterns, but if we design one around transactions, governance follows. In particular, we draw an operational analogy to MVCC in databases and show why a direct transplant fails in a decoupled, multi-language setting. We then propose an agent-first design, Bauplan, that reimplements data and compute isolation in the lakehouse. We conclude by sharing a reference implementation of a self-healing pipeline in Bauplan, which seamlessly couples agent reasoning with all the desired guarantees for correctness and trust.

--------------------------------------------------------------------------------------------------------

Collaborative Management for Chronic Diseases and Depression: A Double Heterogeneity-based Multi-Task Learning Method

Comorbidity between physical chronic diseases and depression requires integrated assessment, yet most health sensing systems address these independently despite their interdependencies. This research formulates multi-disease assessment as a multi-task learning problem, addressing dual challenges: disease heterogeneity (conditions manifest differently) and patient heterogeneity (individuals with same disease show varied patterns). The proposed ADH-MTL method employs group-level modeling, complexity reduction strategies, and Bayesian networks to capture dependencies while balancing similarities across model components. Results on real-world wearable data demonstrate significant improvements, establishing computational foundations for integrated physical-mental healthcare delivery and advancing collaborative chronic disease management.

Authors: Yidong Chai, Haoxin Liu, Jiaheng Xie, Chaopeng Wang, Xiao Fang

Link: https://arxiv.org/abs/2511.16398v1

Date: 2025-11-d

Summary:

Wearable sensor technologies and deep learning are transforming healthcare management. Yet, most health sensing studies focus narrowly on physical chronic diseases. This overlooks the critical need for joint assessment of comorbid physical chronic diseases and depression, which is essential for collaborative chronic care. We conceptualize multi-disease assessment, including both physical diseases and depression, as a multi-task learning (MTL) problem, where each disease assessment is modeled as a task. This joint formulation leverages inter-disease relationships to improve accuracy, but it also introduces the challenge of double heterogeneity: chronic diseases differ in their manifestation (disease heterogeneity), and patients with the same disease show varied patterns (patient heterogeneity). To address these issues, we first adopt existing techniques and propose a base method. Given the limitations of the base method, we further propose an Advanced Double Heterogeneity-based Multi-Task Learning (ADH-MTL) method that improves the base method through three innovations: (1) group-level modeling to support new patient predictions, (2) a decomposition strategy to reduce model complexity, and (3) a Bayesian network that explicitly captures dependencies while balancing similarities and differences across model components. Empirical evaluations on real-world wearable sensor data demonstrate that ADH-MTL significantly outperforms existing baselines, and each of its innovations is shown to be effective. This study contributes to health information systems by offering a computational solution for integrated physical and mental healthcare and provides design principles for advancing collaborative chronic disease management across the pre-treatment, treatment, and post-treatment phases.

--------------------------------------------------------------------------------------------------------

Reducing Instability in Synthetic Data Evaluation with a Super-Metric in MalDataGen

Evaluating synthetic malware datasets lacks standardized metrics, creating inconsistency and unreliability in security research. This work integrates a Super-Metric into MalDataGen that aggregates eight metrics across four fidelity dimensions into a single weighted score. Testing across ten generative models and five datasets demonstrates that the Super-Metric exhibits superior stability and consistency compared to traditional metrics, showing stronger correlation with actual classifier performance. This advancement enables more dependable evaluation of synthetic data generation techniques, supporting research in malware detection, adversarial robustness, and dataset generation for cybersecurity applications where evaluation consistency is critical.

Authors: Anna Luiza Gomes da Silva, Diego Kreutz, Angelo Diniz, Rodrigo Mansilha, Celso Nobre da Fonseca

Link: https://arxiv.org/abs/2511.16373v1

Date: 2025-11-d

Summary:

Evaluating the quality of synthetic data remains a persistent challenge in the Android malware domain due to instability and the lack of standardization among existing metrics. This work integrates into MalDataGen a Super-Metric that aggregates eight metrics across four fidelity dimensions, producing a single weighted score. Experiments involving ten generative models and five balanced datasets demonstrate that the Super-Metric is more stable and consistent than traditional metrics, exhibiting stronger correlations with the actual performance of classifiers.

--------------------------------------------------------------------------------------------------------

Can MLLMs Read the Room? A Multimodal Benchmark for Assessing Deception in Multi-Party Social Interactions

Despite advanced reasoning capabilities, state-of-the-art multimodal language models fundamentally lack human-like ability to assess deception in complex social contexts. This research introduces MIDA, a benchmark featuring synchronized video and text with verified ground-truth labels for every statement in multi-party interactions. Evaluation of 12 leading models reveals significant gaps; even GPT-4o struggles distinguishing truth from falsehood. Analysis identifies failures in grounding language in multimodal social cues and modeling others' knowledge and intentions. Proposed solutions—Social Chain-of-Thought reasoning and Dynamic Social Epistemic Memory modules—show promise, establishing directions for developing trustworthy AI systems with genuine social reasoning capabilities.

Authors: Caixin Kang, Yifei Huang, Liangyang Ouyang, Mingfang Zhang, Ruicong Liu, Yoichi Sato

Link: https://arxiv.org/abs/2511.16221v1

Date: 2025-11-d

Summary:

Despite their advanced reasoning capabilities, state-of-the-art Multimodal Large Language Models (MLLMs) demonstrably lack a core component of human intelligence: the ability to `read the room' and assess deception in complex social interactions. To rigorously quantify this failure, we introduce a new task, Multimodal Interactive Deception Assessment (MIDA), and present a novel multimodal dataset providing synchronized video and text with verifiable ground-truth labels for every statement. We establish a comprehensive benchmark evaluating 12 state-of-the-art open- and closed-source MLLMs, revealing a significant performance gap: even powerful models like GPT-4o struggle to distinguish truth from falsehood reliably. Our analysis of failure modes indicates that these models fail to effectively ground language in multimodal social cues and lack the ability to model what others know, believe, or intend, highlighting the urgent need for novel approaches to building more perceptive and trustworthy AI systems. To take a step forward, we design a Social Chain-of-Thought (SoCoT) reasoning pipeline and a Dynamic Social Epistemic Memory (DSEM) module. Our framework yields performance improvement on this challenging task, demonstrating a promising new path toward building MLLMs capable of genuine human-like social reasoning.

--------------------------------------------------------------------------------------------------------

Quantum artificial intelligence for pattern recognition at high-energy colliders: Tales of Three "Quantum's"

High-energy physics generates massive datasets requiring computationally intensive pattern recognition, motivating exploration of quantum computing approaches for potential efficiency breakthroughs. This review examines three quantum technologies—quantum gates, quantum annealing, and quantum-inspired algorithms—across their current applications in particle physics. Each approach offers distinct advantages and limitations. As collider experiments increase in scale and computational demands grow exponentially, quantum-classical hybrid approaches may enable previously infeasible analyses of fundamental physics phenomena, potentially accelerating discoveries in particle physics and complex data analysis.

Authors: Hideki Okawa

Link: https://arxiv.org/abs/2511.16713v1

Date: 2025-11-d

Summary:

Quantum computing applications are an emerging field in high-energy physics. Its ambitious fusion with artificial intelligence is expected to deliver significant efficiency gains over existing methods and/or enable computation from a fundamentally different perspective. High-energy physics is a big data science that utilizes large-scale facilities, detectors, high-performance computing, and its worldwide networks. The experimental workflow consumes a significant amount of computing resources, and its annual cost will continue to grow exponentially at future colliders. In particular, pattern recognition is one of the most crucial and computationally intensive tasks. Three types of quantum computing technologies, i.e., quantum gates, quantum annealing, and quantum-inspired, are all actively investigated for high-energy physics applications, and each has its pros and cons. This article reviews the current status of quantum computing applications for pattern recognition at high-energy colliders.

--------------------------------------------------------------------------------------------------------

A Mathematical Framework for Custom Reward Functions in Job Application Evaluation using Reinforcement Learning

Traditional Applicant Tracking Systems employ rigid keyword-matching, rejecting qualified candidates for minor semantic variations despite substantial skill matches. This work develops a two-step fine-tuning approach using a small language model (~600M parameters): first supervised fine-tuning establishes baselines, then reinforcement learning through GRPO optimizes a multi-component custom reward function enabling holistic candidate assessment. Addressing reward hacking through iterative reward function refinement, the resulting model achieves 91% accuracy with 0.85 recall and 1.0 precision on unseen test data. This framework demonstrates how properly-executed reinforcement learning overcomes limitations of both traditional ATS and naive RL usage in human resources.

Authors: Shreyansh Jain, Madhav Singhvi, Shreya Rahul Jain, Pranav S, Dishaa Lokesh, Naren Chittibabu, Akash Anandhan

Link: https://arxiv.org/abs/2511.16073v1

Date: 2025-11-d

Summary:

Conventional Applicant Tracking Systems (ATS) tend to be inflexible keyword-matchers, and deny gifted candidates a role due to a few minor semantic mismatches. This article describes a new two-step process to design a more refined resume evaluation model based on a small language model (<600M parameters) that is finetuned using GRPO on a custom reward function. To begin with, Supervised Fine-Tuning (SFT) was used to build a solid baseline model. Second, this SFT model was also optimized with the help of Reinforcement Learning (RL) through GRPO under the guidance of a new, multi-component reward function that can holistically assess candidates beyond simple keyword matching. We indicate that the RL application presents a critical problem of reward hacking due to the initial experiments of aggressive penalties, which produces faulty, excessively negative model behaviors. We have overcome this challenge by refining the reward function repeatedly and training hyperparameters into a stable "gentle polishing process" of the reward function. Our resulting GRPO-polished model demonstrates significant real-world efficacy, achieving a final accuracy of 91% on unseen test data. The model shows a strong ability to correctly identify qualified candidates (recall of 0.85 for the 'SELECTED' class) while also showing exceptional precision (1.0), confirming its reliability. These results indicate that a properly executed, two-step fine-tuning procedure can indeed effectively refine a small language model to be able to conduct fine-tuned and human-like candidate scoring, overcoming the drawbacks of both traditional ATS and naive RL usage.

--------------------------------------------------------------------------------------------------------

Early science acceleration experiments with GPT-5

Advanced language models increasingly assist scientists across disciplines, yet documented evidence of concrete research contributions remains limited. This collection of case studies demonstrates GPT-5's practical impact across mathematics, physics, astronomy, computer science, biology, and materials science. Notably, the work includes four new verified mathematical results, showcasing AI's capacity to help researchers settle previously unsolved problems. These examples illustrate both achievements and limitations, identifying where AI provides significant time savings and where human expertise remains essential. The work underscores how frontier AI enables meaningful collaboration, accelerating research workflows while clarifying remaining barriers to autonomous scientific discovery.

Authors: Sébastien Bubeck, Christian Coester, Ronen Eldan, Timothy Gowers, Yin Tat Lee, Alexandru Lupsasca, Mehtaab Sawhney, Robert Scherrer, Mark Sellke, Brian K. Spears, Derya Unutmaz, Kevin Weil, Steven Yin, Nikita Zhivotovskiy

Link: https://arxiv.org/abs/2511.16072v1

Date: 2025-11-d

Summary:

AI models like GPT-5 are an increasingly valuable tool for scientists, but many remain unaware of the capabilities of frontier AI. We present a collection of short case studies in which GPT-5 produced new, concrete steps in ongoing research across mathematics, physics, astronomy, computer science, biology, and materials science. In these examples, the authors highlight how AI accelerated their work, and where it fell short; where expert time was saved, and where human input was still key. We document the interactions of the human authors with GPT-5, as guiding examples of fruitful collaboration with AI. Of note, this paper includes four new results in mathematics (carefully verified by the human authors), underscoring how GPT-5 can help human mathematicians settle previously unsolved problems. These contributions are modest in scope but profound in implication, given the rate at which frontier AI is progressing.

--------------------------------------------------------------------------------------------------------

Detecting Sleeper Agents in Large Language Models via Semantic Drift Analysis

Backdoored language models can exhibit latent malicious behavior under specific deployment conditions while appearing safe during testing—a critical security vulnerability in AI systems. This research presents a dual-method detection system combining semantic drift analysis with canary baseline comparison, achieving 92.5% accuracy with 100% precision and 85% recall on official sleeper agent models. Using Sentence-BERT embeddings to measure semantic deviation, the approach operates in real-time without model modification. These findings address an urgent security gap in AI deployment, providing practical defenses against deceptive model behavior and establishing foundations for safer production AI systems.

Authors: Shahin Zanbaghi, Ryan Rostampour, Farhan Abid, Salim Al Jarmakani

Link: https://arxiv.org/abs/2511.15992v1

Date: 2025-11-d

Summary:

Large Language Models (LLMs) can be backdoored to exhibit malicious behavior under specific deployment conditions while appearing safe during training a phenomenon known as "sleeper agents." Recent work by Hubinger et al. demonstrated that these backdoors persist through safety training, yet no practical detection methods exist. We present a novel dual-method detection system combining semantic drift analysis with canary baseline comparison to identify backdoored LLMs in real-time. Our approach uses Sentence-BERT embeddings to measure semantic deviation from safe baselines, complemented by injected canary questions that monitor response consistency. Evaluated on the official Cadenza-Labs dolphin-llama3-8B sleeper agent model, our system achieves 92.5% accuracy with 100% precision (zero false positives) and 85% recall. The combined detection method operates in real-time (<1s per query), requires no model modification, and provides the first practical solution to LLM backdoor detection. Our work addresses a critical security gap in AI deployment and demonstrates that embedding-based detection can effectively identify deceptive model behavior without sacrificing deployment efficiency.

--------------------------------------------------------------------------------------------------------

Thinking, Faithful and Stable: Mitigating Hallucinations in LLMs

Language models frequently hallucinate—generating plausible-sounding but incorrect information during multi-step reasoning, undermining reliability for high-stakes applications. This work develops a self-correcting framework leveraging fine-grained uncertainty signals: confidence alignment and token-level entropy spikes to detect unreliable reasoning in real-time. A composite reward function penalizes unjustified confidence and entropy anomalies while encouraging stable reasoning trajectories. Reinforcement learning guides models toward introspection, improving both final answer accuracy and reasoning calibration. Ablation studies validate each component's contribution, establishing mechanisms for developing more trustworthy, faithful language models suitable for applications requiring transparent, verifiable reasoning chains.

Authors: Chelsea Zou, Yiheng Yao, Basant Khalil

Link: https://arxiv.org/abs/2511.15921v1

Date: 2025-11-d

Summary:

This project develops a self correcting framework for large language models (LLMs) that detects and mitigates hallucinations during multi-step reasoning. Rather than relying solely on final answer correctness, our approach leverages fine grained uncertainty signals: 1) self-assessed confidence alignment, and 2) token-level entropy spikes to detect unreliable and unfaithful reasoning in real time. We design a composite reward function that penalizes unjustified high confidence and entropy spikes, while encouraging stable and accurate reasoning trajectories. These signals guide a reinforcement learning (RL) policy that makes the model more introspective and shapes the model's generation behavior through confidence-aware reward feedback, improving not just outcome correctness but the coherence and faithfulness of their intermediate reasoning steps. Experiments show that our method improves both final answer accuracy and reasoning calibration, with ablations validating the individual contribution of each signal.

--------------------------------------------------------------------------------------------------------

EYE ON A.I. GETS READERS UP TO DATE ON THE LATEST FUNDING NEWS AND RELATED ISSUES. SUBSCRIBE FOR THE WEEKLY NEWSLETTER.

Artificial Intelligence, Research WatchCraig SmithNovember 24, 2025Comment