Week Ending 10.5.2025

RESEARCH WATCH: 10.5.2025

Reward Models are Metrics in a Trench Coat

As reinforcement learning transforms how we fine-tune large language models, reward models have emerged as critical components for assessing output quality. This paper bridges two historically separate fields—reward modeling and evaluation metrics—revealing how they face identical challenges like spurious correlations and reward hacking. The authors demonstrate that traditional metrics often outperform reward models on specific tasks, suggesting we've been reinventing the wheel. Applications include improving RLHF pipelines, developing robust evaluation frameworks, and creating better preference learning systems that avoid common pitfalls by leveraging decades of metrics research.

Authors: Sebastian Gehrmann

Link: https://arxiv.org/abs/2510.03231v1

Date: 2025-10-d

Summary:

The emergence of reinforcement learning in post-training of large language models has sparked significant interest in reward models. Reward models assess the quality of sampled model outputs to generate training signals. This task is also performed by evaluation metrics that monitor the performance of an AI model. We find that the two research areas are mostly separate, leading to redundant terminology and repeated pitfalls. Common challenges include susceptibility to spurious correlations, impact on downstream reward hacking, methods to improve data quality, and approaches to meta-evaluation. Our position paper argues that a closer collaboration between the fields can help overcome these issues. To that end, we show how metrics outperform reward models on specific tasks and provide an extensive survey of the two areas. Grounded in this survey, we point to multiple research topics in which closer alignment can improve reward models and metrics in areas such as preference elicitation methods, avoidance of spurious correlations and reward hacking, and calibration-aware meta-evaluation.

--------------------------------------------------------------------------------------------------------

Self-Anchor: Large Language Model Reasoning via Step-by-step Attention Alignment

Complex reasoning remains a frontier challenge for LLMs, with longer chains causing critical steps to get buried in context. Self-Anchor addresses this through structured attention steering—decomposing reasoning into plans and automatically aligning model focus to relevant inference steps. This lightweight prompting method notably bridges the performance gap between standard and specialized reasoning models without retraining. Applications span mathematical problem-solving, multi-step planning tasks, code generation with complex logic, and any domain requiring sustained logical coherence across extended reasoning chains where maintaining context is crucial.

Authors: Hongxiang Zhang, Yuan Tian, Tianyi Zhang

Link: https://arxiv.org/abs/2510.03223v1

Date: 2025-10-d

Summary:

To solve complex reasoning tasks for Large Language Models (LLMs), prompting-based methods offer a lightweight alternative to fine-tuning and reinforcement learning. However, as reasoning chains extend, critical intermediate steps and the original prompt will be buried in the context, receiving insufficient attention and leading to errors. In this paper, we propose Self-Anchor, a novel pipeline that leverages the inherent structure of reasoning to steer LLM attention. Self-Anchor decomposes reasoning trajectories into structured plans and automatically aligns the model's attention to the most relevant inference steps, allowing the model to maintain focus throughout generation. Our experiment shows that Self-Anchor outperforms SOTA prompting methods across six benchmarks. Notably, Self-Anchor significantly reduces the performance gap between ``non-reasoning'' models and specialized reasoning models, with the potential to enable most LLMs to tackle complex reasoning tasks without retraining.

--------------------------------------------------------------------------------------------------------

CoDA: Agentic Systems for Collaborative Data Visualization

Data scientists spend countless hours manually crafting visualizations despite advances in AI automation. CoDA reimagines this challenge through multi-agent collaboration, where specialized LLM agents handle metadata analysis, task planning, code generation, and quality refinement. Unlike single-agent approaches that fail on complex, multi-file datasets, CoDA's workflow achieves 41.5% performance gains through metadata-focused analysis and iterative refinement. Applications include automated dashboard generation, exploratory data analysis, business intelligence reporting, and scientific visualization where natural language queries need translation into sophisticated, error-free visualizations across complex data structures.

Authors: Zichen Chen, Jiefeng Chen, Sercan Ö. Arik, Misha Sra, Tomas Pfister, Jinsung Yoon

Link: https://arxiv.org/abs/2510.03194v1

Date: 2025-10-d

Summary:

Deep research has revolutionized data analysis, yet data scientists still devote substantial time to manually crafting visualizations, highlighting the need for robust automation from natural language queries. However, current systems struggle with complex datasets containing multiple files and iterative refinement. Existing approaches, including simple single- or multi-agent systems, often oversimplify the task, focusing on initial query parsing while failing to robustly manage data complexity, code errors, or final visualization quality. In this paper, we reframe this challenge as a collaborative multi-agent problem. We introduce CoDA, a multi-agent system that employs specialized LLM agents for metadata analysis, task planning, code generation, and self-reflection. We formalize this pipeline, demonstrating how metadata-focused analysis bypasses token limits and quality-driven refinement ensures robustness. Extensive evaluations show CoDA achieves substantial gains in the overall score, outperforming competitive baselines by up to 41.5%. This work demonstrates that the future of visualization automation lies not in isolated code generation but in integrated, collaborative agentic workflows.

--------------------------------------------------------------------------------------------------------

Simulation to Rules: A Dual-VLM Framework for Formal Visual Planning

Visual planning requires both perceptual understanding and logical reasoning—capabilities split between Vision Language Models and PDDL planners. VLMFP bridges this gap using dual VLMs: SimVLM simulates action consequences while GenVLM generates and refines PDDL files through comparison. This eliminates reliance on human-defined domain files or constant environment access. Applications include robotic task planning, automated game playing, visual puzzle solving, and any scenario requiring formal planning from visual inputs. The framework's ability to generalize across different appearances and rules makes it valuable for autonomous systems operating in diverse visual environments.

Authors: Yilun Hao, Yongchao Chen, Chuchu Fan, Yang Zhang

Link: https://arxiv.org/abs/2510.03182v1

Date: 2025-10-d

Summary:

Vision Language Models (VLMs) show strong potential for visual planning but struggle with precise spatial and long-horizon reasoning. In contrast, Planning Domain Definition Language (PDDL) planners excel at long-horizon formal planning, but cannot interpret visual inputs. Recent works combine these complementary advantages by enabling VLMs to turn visual planning problems into PDDL files for formal planning. However, while VLMs can generate PDDL problem files satisfactorily, they struggle to accurately generate the PDDL domain files, which describe all the planning rules. As a result, prior methods rely on human experts to predefine domain files or on constant environment access for refinement. We propose VLMFP, a Dual-VLM-guided framework that can autonomously generate both PDDL problem and domain files for formal visual planning. VLMFP introduces two VLMs to ensure reliable PDDL file generation: A SimVLM that simulates action consequences based on input rule descriptions, and a GenVLM that generates and iteratively refines PDDL files by comparing the PDDL and SimVLM execution results. VLMFP unleashes multiple levels of generalizability: The same generated PDDL domain file works for all the different instances under the same problem, and VLMs generalize to different problems with varied appearances and rules. We evaluate VLMFP with 6 grid-world domains and test its generalization to unseen instances, appearance, and game rules. On average, SimVLM accurately describes 95.5%, 82.6% of scenarios, simulates 85.5%, 87.8% of action sequence, and judges 82.4%, 85.6% goal reaching for seen and unseen appearances, respectively. With the guidance of SimVLM, VLMFP can generate PDDL files to reach 70.0%, 54.1% valid plans for unseen instances in seen and unseen appearances, respectively. Project page: https://sites.google.com/view/vlmfp.

--------------------------------------------------------------------------------------------------------

CHORD: Customizing Hybrid-precision On-device Model for Sequential Recommendation with Device-cloud Collaboration

Deploying recommendation models on mobile devices promises real-time personalization but faces severe resource constraints. CHORD tackles this through personalized quantization—using cloud-based hypernetworks to identify user-specific critical parameters, then applying mixed-precision quantization on-device. This achieves model compression without sacrificing personalization accuracy or requiring expensive local retraining. Applications include mobile shopping recommendations, content feeds, music streaming suggestions, and any sequential recommendation scenario where privacy concerns or latency requirements demand on-device inference while maintaining personalized experiences across heterogeneous device capabilities.

Authors: Tianqi Liu, Kairui Fu, Shengyu Zhang, Wenyan Fan, Zhaocheng Du, Jieming Zhu, Fan Wu, Fei Wu

Link: https://arxiv.org/abs/2510.03038v1

Date: 2025-10-d

Summary:

With the advancement of mobile device capabilities, deploying reranking models directly on devices has become feasible, enabling real-time contextual recommendations. When migrating models from cloud to devices, resource heterogeneity inevitably necessitates model compression. Recent quantization methods show promise for efficient deployment, yet they overlook device-specific user interests, resulting in compromised recommendation accuracy. While on-device finetuning captures personalized user preference, it imposes additional computational burden through local retraining. To address these challenges, we propose a framework for \underline{\textbf{C}}ustomizing \underline{\textbf{H}}ybrid-precision \underline{\textbf{O}}n-device model for sequential \underline{\textbf{R}}ecommendation with \underline{\textbf{D}}evice-cloud collaboration (\textbf{CHORD}), leveraging channel-wise mixed-precision quantization to simultaneously achieve personalization and resource-adaptive deployment. CHORD distributes randomly initialized models across heterogeneous devices and identifies user-specific critical parameters through auxiliary hypernetwork modules on the cloud. Our parameter sensitivity analysis operates across multiple granularities (layer, filter, and element levels), enabling precise mapping from user profiles to quantization strategy. Through on-device mixed-precision quantization, CHORD delivers dynamic model adaptation and accelerated inference without backpropagation, eliminating costly retraining cycles. We minimize communication overhead by encoding quantization strategies using only 2 bits per channel instead of 32-bit weights. Experiments on three real-world datasets with two popular backbones (SASRec and Caser) demonstrate the accuracy, efficiency, and adaptivity of CHORD.

--------------------------------------------------------------------------------------------------------

Investigating The Smells of LLM Generated Code

While LLMs revolutionize code generation, quality concerns persist beyond functional correctness. This study systematically evaluates code smells—indicators of poor design and maintenance issues—in LLM-generated Java code. Finding 63% more smells than human-written code, with degradation worsening for complex tasks, the research reveals critical quality gaps. Applications include improving code generation models, developing quality-aware prompting strategies, creating automated code review tools for AI-generated code, and establishing coding standards for LLM-assisted development. These insights are crucial for organizations adopting AI coding assistants while maintaining software quality standards.

Authors: Debalina Ghosh Paul, Hong Zhu, Ian Bayley

Link: https://arxiv.org/abs/2510.03029v1

Date: 2025-10-d

Summary:

Context: Large Language Models (LLMs) are increasingly being used to generate program code. Much research has been reported on the functional correctness of generated code, but there is far less on code quality. Objectives: In this study, we propose a scenario-based method of evaluating the quality of LLM-generated code to identify the weakest scenarios in which the quality of LLM generated code should be improved. Methods: The method measures code smells, an important indicator of code quality, and compares them with a baseline formed from reference solutions of professionally written code. The test dataset is divided into various subsets according to the topics of the code and complexity of the coding tasks to represent different scenarios of using LLMs for code generation. We will also present an automated test system for this purpose and report experiments with the Java programs generated in response to prompts given to four state-of-the-art LLMs: Gemini Pro, ChatGPT, Codex, and Falcon. Results: We find that LLM-generated code has a higher incidence of code smells compared to reference solutions. Falcon performed the least badly, with a smell increase of 42.28%, followed by Gemini Pro (62.07%), ChatGPT (65.05%) and finally Codex (84.97%). The average smell increase across all LLMs was 63.34%, comprising 73.35% for implementation smells and 21.42% for design smells. We also found that the increase in code smells is greater for more complex coding tasks and for more advanced topics, such as those involving object-orientated concepts. Conclusion: In terms of code smells, LLM's performances on various coding task complexities and topics are highly correlated to the quality of human written code in the corresponding scenarios. However, the quality of LLM generated code is noticeably poorer than human written code.

--------------------------------------------------------------------------------------------------------

AI Generated Child Sexual Abuse Material -- What's the Harm?

The emergence of AI-generated CSAM presents unprecedented challenges for child protection and law enforcement. This paper systematically examines the multifaceted harms: revictimization of survivors, grooming facilitation, normalization of exploitation, and potential pathways to offending. The authors challenge claims of "victimless" synthetic content, demonstrating how AI CSAM undermines protective factors and desensitizes users. Applications include informing policy frameworks, developing detection technologies, establishing legal precedents, and creating intervention strategies. This critical analysis provides essential grounding for researchers, policymakers, and tech companies grappling with emerging AI safety challenges in child protection.

Authors: Caoilte Ó Ciardha, John Buckley, Rebecca S. Portnoff

Link: https://arxiv.org/abs/2510.02978v1

Date: 2025-10-d

Summary:

The development of generative artificial intelligence (AI) tools capable of producing wholly or partially synthetic child sexual abuse material (AI CSAM) presents profound challenges for child protection, law enforcement, and societal responses to child exploitation. While some argue that the harmfulness of AI CSAM differs fundamentally from other CSAM due to a perceived absence of direct victimization, this perspective fails to account for the range of risks associated with its production and consumption. AI has been implicated in the creation of synthetic CSAM of children who have not previously been abused, the revictimization of known survivors of abuse, the facilitation of grooming, coercion and sexual extortion, and the normalization of child sexual exploitation. Additionally, AI CSAM may serve as a new or enhanced pathway into offending by lowering barriers to engagement, desensitizing users to progressively extreme content, and undermining protective factors for individuals with a sexual interest in children. This paper provides a primer on some key technologies, critically examines the harms associated with AI CSAM, and cautions against claims that it may function as a harm reduction tool, emphasizing how some appeals to harmlessness obscure its real risks and may contribute to inertia in ecosystem responses.

--------------------------------------------------------------------------------------------------------

Assessment Twins: A Protocol for AI-Vulnerable Summative Assessment

Generative AI threatens traditional academic assessment validity, demanding fundamental redesign beyond surveillance solutions. Assessment twins pairs complementary evaluation formats—like essays with oral defenses—addressing identical learning outcomes through different evidence modes. This triangulation approach preserves established assessment benefits while resisting AI circumvention without invasive monitoring. Applications span higher education course design, professional certification programs, skills-based training evaluation, and any context requiring authentic assessment of human capabilities. The framework offers practical implementation guidance for educators navigating the AI era while maintaining pedagogical integrity and supporting genuine learning outcomes.

Authors: Jasper Roe, Mike Perkins, Louie Giray

Link: https://arxiv.org/abs/2510.02929v1

Date: 2025-10-d

Summary:

Generative Artificial Intelligence (GenAI) is reshaping higher education and raising pressing concerns about the integrity and validity of higher education assessment. While assessment redesign is increasingly seen as a necessity, there is a relative lack of literature detailing what such redesign may entail. In this paper, we introduce assessment twins as an accessible approach for redesigning assessment tasks to enhance validity. We use Messick's unified validity framework to systematically map the ways in which GenAI threaten content, structural, consequential, generalisability, and external validity. Following this, we define assessment twins as two deliberately linked components that address the same learning outcomes through different modes of evidence, scheduled closely together to allow for cross-verification and assurance of learning. We argue that the twin approach helps mitigate validity threats by triangulating evidence across complementary formats, such as pairing essays with oral defences, group discussions, or practical demonstrations. We highlight several advantages: preservation of established assessment formats, reduction of reliance on surveillance technologies, and flexible use across cohort sizes. To guide implementation, we propose a three-step design process: identifying vulnerabilities, aligning outcomes, selecting complementary tasks, and developing interdependent marking schemes. We also acknowledge the challenges, including resource intensity, equity concerns, and the need for empirical validation. Nonetheless, we contend that assessment twins represent a validity-focused response to GenAI that prioritises pedagogy while supporting meaningful student learning outcomes.

--------------------------------------------------------------------------------------------------------

DMark: Order-Agnostic Watermarking for Diffusion Large Language Models

Diffusion LLMs offer faster generation than autoregressive models but break traditional watermarking methods due to non-sequential token generation. DMark introduces three strategies—predictive, bidirectional, and combined watermarking—specifically designed for arbitrary-order decoding. Achieving 92-99% detection rates while maintaining quality, it establishes watermarking feasibility for non-autoregressive models. Applications include content authentication, copyright protection, misinformation tracking, and regulatory compliance for AI-generated text. As diffusion models gain adoption for their speed advantages, DMark provides essential infrastructure for maintaining accountability and traceability in generated content across various deployment scenarios.

Authors: Linyu Wu, Linhao Zhong, Wenjie Qu, Yuexin Li, Yue Liu, Shengfang Zhai, Chunhua Shen, Jiaheng Zhang

Link: https://arxiv.org/abs/2510.02902v1

Date: 2025-10-d

Summary:

Diffusion large language models (dLLMs) offer faster generation than autoregressive models while maintaining comparable quality, but existing watermarking methods fail on them due to their non-sequential decoding. Unlike autoregressive models that generate tokens left-to-right, dLLMs can finalize tokens in arbitrary order, breaking the causal design underlying traditional watermarks. We present DMark, the first watermarking framework designed specifically for dLLMs. DMark introduces three complementary strategies to restore watermark detectability: predictive watermarking uses model-predicted tokens when actual context is unavailable; bidirectional watermarking exploits both forward and backward dependencies unique to diffusion decoding; and predictive-bidirectional watermarking combines both approaches to maximize detection strength. Experiments across multiple dLLMs show that DMark achieves 92.0-99.5% detection rates at 1% false positive rate while maintaining text quality, compared to only 49.6-71.2% for naive adaptations of existing methods. DMark also demonstrates robustness against text manipulations, establishing that effective watermarking is feasible for non-autoregressive language models.

--------------------------------------------------------------------------------------------------------

Global Convergence of Policy Gradient for Entropy Regularized Linear-Quadratic Control with multiplicative noise

Reinforcement learning for continuous control faces challenges in unknown environments with multiplicative noise. This paper proves global convergence for Regularized Policy Gradient in entropy-regularized Linear Quadratic control, despite non-convexity. The novel Sample-Based RPG operates without system knowledge while maintaining theoretical guarantees. Applications include robotic control, autonomous vehicle navigation, financial portfolio optimization, and industrial process control where system dynamics are uncertain. The entropy regularization accelerates convergence while balancing exploration-exploitation tradeoffs, making it practical for real-world control systems requiring both stability guarantees and adaptability to unknown parameters.

Authors: Gabriel Diaz, Lucky Li, Wenhao Zhang

Link: https://arxiv.org/abs/2510.02896v1

Date: 2025-10-d

Summary:

Reinforcement Learning (RL) has emerged as a powerful framework for sequential decision-making in dynamic environments, particularly when system parameters are unknown. This paper investigates RL-based control for entropy-regularized Linear Quadratic control (LQC) problems with multiplicative noises over an infinite time horizon. First, we adapt the Regularized Policy Gradient (RPG) algorithm to stochastic optimal control settings, proving that despite the non-convexity of the problem, RPG converges globally under conditions of gradient domination and near-smoothness. Second, based on zero-order optimization approach, we introduce a novel model free RL algorithm: Sample-Based Regularized Policy Gradient (SB-RPG). SB-RPG operates without knowledge of system parameters yet still retains strong theoretical guarantees of global convergence. Our model leverages entropy regularization to accelerate convergence and address the exploration versus exploitation trade-off inherent in RL. Numerical simulations validate the theoretical results and demonstrate the efficacy of SB-RPG in unknown-parameters environments.

--------------------------------------------------------------------------------------------------------

A Computational Framework for Interpretable Text-Based Personality Assessment from Social Media

Automated personality assessment from social media offers insights into individual differences at scale. This thesis presents PANDORA—a 17-million comment Reddit dataset integrating MBTI and Big Five models—and SIMPA, a framework matching user statements to validated questionnaire items. This approach maintains psychological validity while enabling scalable assessment. Applications include mental health screening, personalized recommendation systems, human resources analytics, and social science research. The framework's interpretability and model-agnostic design extend beyond personality to any domain requiring complex psychological assessment from natural language, bridging computational methods with established psychometric principles.

Authors: Matej Gjurković

Link: https://arxiv.org/abs/2510.02811v1

Date: 2025-10-d

Summary:

Personality refers to individual differences in behavior, thinking, and feeling. With the growing availability of digital footprints, especially from social media, automated methods for personality assessment have become increasingly important. Natural language processing (NLP) enables the analysis of unstructured text data to identify personality indicators. However, two main challenges remain central to this thesis: the scarcity of large, personality-labeled datasets and the disconnect between personality psychology and NLP, which restricts model validity and interpretability. To address these challenges, this thesis presents two datasets -- MBTI9k and PANDORA -- collected from Reddit, a platform known for user anonymity and diverse discussions. The PANDORA dataset contains 17 million comments from over 10,000 users and integrates the MBTI and Big Five personality models with demographic information, overcoming limitations in data size, quality, and label coverage. Experiments on these datasets show that demographic variables influence model validity. In response, the SIMPA (Statement-to-Item Matching Personality Assessment) framework was developed - a computational framework for interpretable personality assessment that matches user-generated statements with validated questionnaire items. By using machine learning and semantic similarity, SIMPA delivers personality assessments comparable to human evaluations while maintaining high interpretability and efficiency. Although focused on personality assessment, SIMPA's versatility extends beyond this domain. Its model-agnostic design, layered cue detection, and scalability make it suitable for various research and practical applications involving complex label taxonomies and variable cue associations with target concepts.

--------------------------------------------------------------------------------------------------------

Prototyping Digital Social Spaces through Metaphor-Driven Design: Translating Spatial Concepts into an Interactive Social Simulation

Current social media platforms prioritize engagement over diverse social experiences. This paper introduces metaphor-driven design, where users express social expectations through metaphors that translate into platform features and LLM-agent simulations. Participants' interactions revealed how metaphors capture distinct dynamics like intimacy and participation patterns. Applications include social platform prototyping, community design tools, virtual world creation, and exploring alternative social architectures beyond current paradigms. This approach democratizes social space design, enabling stakeholders to envision and test new interaction models without technical expertise, potentially reshaping how we conceptualize and build online communities.

Authors: Yoojin Hong, Martina Di Paola, Braahmi Padmakumar, Hwi Joon Lee, Mahnoor Shafiq, Joseph Seering

Link: https://arxiv.org/abs/2510.02759v1

Date: 2025-10-d

Summary:

Social media platforms are central to communication, yet their designs remain narrowly focused on engagement and scale. While researchers have proposed alternative visions for online spaces, these ideas are difficult to prototype within platform constraints. In this paper, we introduce a metaphor-driven system to help users imagine and explore new social media environments. The system translates users' metaphors into structured sets of platform features and generates interactive simulations populated with LLM-driven agents. To evaluate this approach, we conducted a study where participants created and interacted with simulated social media spaces. Our findings show that metaphors allow users to express distinct social expectations, and that perceived authenticity of the simulation depended on how well it captured dynamics like intimacy, participation, and temporal engagement. We conclude by discussing how metaphor-driven simulation can be a powerful design tool for prototyping alternative social architectures and expanding the design space for future social platforms.

--------------------------------------------------------------------------------------------------------

Time-To-Inconsistency: A Survival Analysis of Large Language Model Robustness to Adversarial Attacks

Understanding LLM degradation in extended conversations is crucial for reliable deployment. This pioneering survival analysis of 36,951 conversation turns reveals that abrupt semantic drift catastrophically increases failure risk, while gradual drift is protective. The framework employs Cox, AFT, and Random Forest models to characterize temporal dynamics. Applications include chatbot reliability engineering, customer service system design, therapeutic conversation agents, and any extended dialogue system requiring sustained coherence. These insights enable development of resilient conversational agents that maintain consistency across long interactions, challenging assumptions about semantic stability requirements in conversational AI.

Authors: Yubo Li, Ramayya Krishnan, Rema Padman

Link: https://arxiv.org/abs/2510.02712v1

Date: 2025-10-d

Summary:

Large Language Models (LLMs) have revolutionized conversational AI, yet their robustness in extended multi-turn dialogues remains poorly understood. Existing evaluation frameworks focus on static benchmarks and single-turn assessments, failing to capture the temporal dynamics of conversational degradation that characterize real-world interactions. In this work, we present the first comprehensive survival analysis of conversational AI robustness, analyzing 36,951 conversation turns across 9 state-of-the-art LLMs to model failure as a time-to-event process. Our survival modeling framework-employing Cox proportional hazards, Accelerated Failure Time, and Random Survival Forest approaches-reveals extraordinary temporal dynamics. We find that abrupt, prompt-to-prompt(P2P) semantic drift is catastrophic, dramatically increasing the hazard of conversational failure. In stark contrast, gradual, cumulative drift is highly protective, vastly reducing the failure hazard and enabling significantly longer dialogues. AFT models with interactions demonstrate superior performance, achieving excellent discrimination and exceptional calibration. These findings establish survival analysis as a powerful paradigm for evaluating LLM robustness, offer concrete insights for designing resilient conversational agents, and challenge prevailing assumptions about the necessity of semantic consistency in conversational AI Systems.

--------------------------------------------------------------------------------------------------------

RAMAC: Multimodal Risk-Aware Offline Reinforcement Learning and the Role of Behavior Regularization

Safety-critical domains demand offline RL that delivers high returns without catastrophic risks. RAMAC combines expressive generative actors (diffusion/flow-matching) with distributional critics, enabling risk-sensitive learning in complex multimodal scenarios. The framework achieves consistent CVaR improvements while maintaining returns on Stochastic-D4RL tasks. Applications include autonomous driving, healthcare treatment planning, financial trading, and robotics where online learning is infeasible and failure costs are high. By differentiating through generative paths with combined risk and behavioral cloning objectives, RAMAC enables deployment of expressive policies in risk-averse settings previously limited to conservative approaches.

Authors: Kai Fukazawa, Kunal Mundada, Iman Soltani

Link: https://arxiv.org/abs/2510.02695v1

Date: 2025-10-d

Summary:

In safety-critical domains where online data collection is infeasible, offline reinforcement learning (RL) offers an attractive alternative but only if policies deliver high returns without incurring catastrophic lower-tail risk. Prior work on risk-averse offline RL achieves safety at the cost of value conservatism and restricted policy classes, whereas expressive policies are only used in risk-neutral settings. Here, we address this gap by introducing the \textbf{Risk-Aware Multimodal Actor-Critic (RAMAC)} framework, which couples an \emph{expressive generative actor} with a distributional critic. The RAMAC differentiates composite objective combining distributional risk and BC loss through the generative path, achieving risk-sensitive learning in complex multimodal scenarios. We instantiate RAMAC with diffusion and flow-matching actors and observe consistent gains in $\mathrm{CVaR}_{0.1}$ while maintaining strong returns on most Stochastic-D4RL tasks. Code: https://github.com/KaiFukazawa/RAMAC.git

--------------------------------------------------------------------------------------------------------

MALF: A Multi-Agent LLM Framework for Intelligent Fuzzing of Industrial Control Protocols

Industrial control systems face increasing cyber threats through protocol vulnerabilities. MALF revolutionizes fuzzing by integrating LLMs with multi-agent coordination, using RAG for domain knowledge and QLoRA for protocol-aware generation. Achieving 88-92% test case pass rates and discovering three zero-day vulnerabilities in real power plant systems, it surpasses traditional methods. Applications include critical infrastructure security, SCADA system testing, IoT protocol validation, and industrial cybersecurity assessment. The framework's ability to understand protocol semantics while generating diverse, valid test cases addresses the unique challenges of ICS security testing.

Authors: Bowei Ning, Xuejun Zong, Kan He

Link: https://arxiv.org/abs/2510.02694v1

Date: 2025-10-d

Summary:

Industrial control systems (ICS) are vital to modern infrastructure but increasingly vulnerable to cybersecurity threats, particularly through weaknesses in their communication protocols. This paper presents MALF (Multi-Agent LLM Fuzzing Framework), an advanced fuzzing solution that integrates large language models (LLMs) with multi-agent coordination to identify vulnerabilities in industrial control protocols (ICPs). By leveraging Retrieval-Augmented Generation (RAG) for domain-specific knowledge and QLoRA fine-tuning for protocol-aware input generation, MALF enhances fuzz testing precision and adaptability. The multi-agent framework optimizes seed generation, mutation strategies, and feedback-driven refinement, leading to improved vulnerability discovery. Experiments on protocols like Modbus/TCP, S7Comm, and Ethernet/IP demonstrate that MALF surpasses traditional methods, achieving a test case pass rate (TCPR) of 88-92% and generating more exception triggers (ETN). MALF also maintains over 90% seed coverage and Shannon entropy values between 4.2 and 4.6 bits, ensuring diverse, protocol-compliant mutations. Deployed in a real-world Industrial Attack-Defense Range for power plants, MALF identified critical vulnerabilities, including three zero-day flaws, one confirmed and registered by CNVD. These results validate MALF's effectiveness in real-world fuzzing applications. This research highlights the transformative potential of multi-agent LLMs in ICS cybersecurity, offering a scalable, automated framework that sets a new standard for vulnerability discovery and strengthens critical infrastructure security against emerging threats.

--------------------------------------------------------------------------------------------------------

ARMs: Adaptive Red-Teaming Agent against Multimodal Models with Plug-and-Play Attacks

Vision-language models introduce novel safety vulnerabilities requiring comprehensive assessment beyond manual engineering. ARMs automatically optimizes diverse red-teaming strategies through reasoning-enhanced orchestration, introducing 11 novel attack patterns and integrating 17 algorithms via model context protocol. Achieving 52% higher success rates and exceeding 90% on Claude-4-Sonnet, it reveals emerging vulnerabilities. Applications include AI safety evaluation, regulatory compliance testing, model robustness assessment, and security audit automation. The framework's construction of ARMs-Bench with 30K instances across 51 risk categories provides actionable guidance for improving multimodal safety alignment against real-world threats.

Authors: Zhaorun Chen, Xun Liu, Mintong Kang, Jiawei Zhang, Minzhou Pan, Shuang Yang, Bo Li

Link: https://arxiv.org/abs/2510.02677v1

Date: 2025-10-d

Summary:

As vision-language models (VLMs) gain prominence, their multimodal interfaces also introduce new safety vulnerabilities, making the safety evaluation challenging and critical. Existing red-teaming efforts are either restricted to a narrow set of adversarial patterns or depend heavily on manual engineering, lacking scalable exploration of emerging real-world VLM vulnerabilities. To bridge this gap, we propose ARMs, an adaptive red-teaming agent that systematically conducts comprehensive risk assessments for VLMs. Given a target harmful behavior or risk definition, ARMs automatically optimizes diverse red-teaming strategies with reasoning-enhanced multi-step orchestration, to effectively elicit harmful outputs from target VLMs. We propose 11 novel multimodal attack strategies, covering diverse adversarial patterns of VLMs (e.g., reasoning hijacking, contextual cloaking), and integrate 17 red-teaming algorithms into ARMs via model context protocol (MCP). To balance the diversity and effectiveness of the attack, we design a layered memory with an epsilon-greedy attack exploration algorithm. Extensive experiments on instance- and policy-based benchmarks show that ARMs achieves SOTA attack success rates, exceeding baselines by an average of 52.1% and surpassing 90% on Claude-4-Sonnet. We show that the diversity of red-teaming instances generated by ARMs is significantly higher, revealing emerging vulnerabilities in VLMs. Leveraging ARMs, we construct ARMs-Bench, a large-scale multimodal safety dataset comprising over 30K red-teaming instances spanning 51 diverse risk categories, grounded in both real-world multimodal threats and regulatory risks. Safety fine-tuning with ARMs-Bench substantially improves the robustness of VLMs while preserving their general utility, providing actionable guidance to improve multimodal safety alignment against emerging threats.

--------------------------------------------------------------------------------------------------------

To Compress or Not? Pushing the Frontier of Lossless GenAI Model Weights Compression with Exponent Concentration

Scaling GenAI models to hundreds of billions of parameters demands efficient compression without quality loss. This paper reveals an exponent concentration phenomenon in model weights, theoretically grounded in α-stable distributions from SGD. Establishing a compression limit near FP4.67, the authors propose ECF8 achieving 26.9% memory savings and 177% throughput acceleration with perfectly lossless computation. Applications include edge deployment, cloud inference optimization, model distribution, and enabling larger models on existing hardware. This principled approach to floating-point design opens new avenues for efficient AI deployment without compromising model capabilities.

Authors: Zeyu Yang, Tianyi Zhang, Jianwen Xie, Chuan Li, Zhaozhuo Xu, Anshumali Shrivastava

Link: https://arxiv.org/abs/2510.02676v1

Date: 2025-10-d

Summary:

The scaling of Generative AI (GenAI) models into the hundreds of billions of parameters makes low-precision computation indispensable for efficient deployment. We argue that the fundamental solution lies in developing low-precision floating-point formats, which inherently provide numerical stability, memory savings, and hardware efficiency without dequantization overhead. In this paper, we present a theoretical and empirical study of an exponent concentration phenomenon in GenAI weights: exponents consistently exhibit low entropy across architectures and modalities. We show that this arises naturally from $\alpha$-stable distributions induced by stochastic gradient descent, and we prove tight bounds on the entropy of exponents. Our analysis establishes a theoretical compression limit near FP4.67, which motivates the design of a practical FP8 format. Building on these insights, we propose Exponent-Concentrated FP8 (ECF8), a lossless compression framework with entropy-aware encoding and GPU-optimized decoding. Experiments on LLMs and DiTs up to 671B parameters demonstrate up to 26.9% memory savings and 177.1% throughput acceleration, with perfectly lossless computations, i.e., no deviation in model outputs. Our results establish exponent concentration as a statistical law of trained models and open a principled path for lossless low-precision floating-point design in the FP8 era.

--------------------------------------------------------------------------------------------------------

HALO: Memory-Centric Heterogeneous Accelerator with 2.5D Integration for Low-Batch LLM Inference

Interactive LLM applications demand efficient low-batch, long-context inference—a regime poorly served by existing accelerators. HALO integrates HBM-based Compute-in-DRAM with on-chip analog Compute-in-Memory through 2.5D integration, adapting to distinct prefill/decode phase requirements. Achieving 18x speedup over attention-optimized mappings, it addresses the compute-memory dichotomy in LLM inference. Applications include chatbots, personal assistants, real-time translation, and interactive AI services where latency matters. The phase-aware mapping strategy maximizes hardware utilization across inference stages, enabling responsive AI interactions previously limited by memory bandwidth and compute constraints.

Authors: Shubham Negi, Kaushik Roy

Link: https://arxiv.org/abs/2510.02675v1

Date: 2025-10-d

Summary:

The rapid adoption of Large Language Models (LLMs) has driven a growing demand for efficient inference, particularly in latency-sensitive applications such as chatbots and personalized assistants. Unlike traditional deep neural networks, LLM inference proceeds in two distinct phases: the prefill phase, which processes the full input sequence in parallel, and the decode phase, which generates tokens sequentially. These phases exhibit highly diverse compute and memory requirements, which makes accelerator design particularly challenging. Prior works have primarily been optimized for high-batch inference or evaluated only short input context lengths, leaving the low-batch and long context regime, which is critical for interactive applications, largely underexplored. We propose HALO, a heterogeneous memory centric accelerator designed for these unique challenges of prefill and decode phases in low-batch LLM inference. HALO integrates HBM based Compute-in-DRAM (CiD) with an on-chip analog Compute-in-Memory (CiM), co-packaged using 2.5D integration. To further improve the hardware utilization, we introduce a phase-aware mapping strategy that adapts to the distinct demands of the prefill and decode phases. Compute bound operations in the prefill phase are mapped to CiM to exploit its high throughput matrix multiplication capability, while memory-bound operations in the decode phase are executed on CiD to benefit from reduced data movement within DRAM. Additionally, we present an analysis of the performance tradeoffs of LLMs under two architectural extremes: a fully CiD and a fully on-chip analog CiM design to highlight the need for a heterogeneous design. We evaluate HALO on LLaMA-2 7B and Qwen3 8B models. Our experimental results show that LLMs mapped to HALO achieve up to 18x geometric mean speedup over AttAcc, an attention-optimized mapping and 2.5x over CENT, a fully CiD based mapping.

--------------------------------------------------------------------------------------------------------

When Researchers Say Mental Model/Theory of Mind of AI, What Are They Really Talking About?

Current discourse conflates AI's behavioral predictions with genuine mental states, missing crucial distinctions between simulation and experience. This position paper argues that LLMs achieving human-level ToM task performance demonstrates only behavioral mimicry, not authentic cognition. The author proposes shifting toward mutual ToM frameworks emphasizing interaction dynamics over isolated testing. Applications include human-AI collaboration design, AI evaluation methodologies, interface development, and trust calibration systems. This reframing has profound implications for how we design, test, and deploy AI systems in social contexts, moving beyond anthropomorphic assumptions toward interaction-centered understanding.

Authors: Xiaoyun Yin, Elmira Zahmat Doost, Shiwen Zhou, Garima Arya Yadav, Jamie C. Gorman

Link: https://arxiv.org/abs/2510.02660v1

Date: 2025-10-d

Summary:

When researchers claim AI systems possess ToM or mental models, they are fundamentally discussing behavioral predictions and bias corrections rather than genuine mental states. This position paper argues that the current discourse conflates sophisticated pattern matching with authentic cognition, missing a crucial distinction between simulation and experience. While recent studies show LLMs achieving human-level performance on ToM laboratory tasks, these results are based only on behavioral mimicry. More importantly, the entire testing paradigm may be flawed in applying individual human cognitive tests to AI systems, but assessing human cognition directly in the moment of human-AI interaction. I suggest shifting focus toward mutual ToM frameworks that acknowledge the simultaneous contributions of human cognition and AI algorithms, emphasizing the interaction dynamics, instead of testing AI in isolation.

--------------------------------------------------------------------------------------------------------

A Concept of Possibility for Real-World Events

Traditional possibility theory lacks practical grounding for real-world planning. This paper redefines possibility through prerequisites and constraints, computing event possibility from probabilities that prerequisites hold and constraints don't. Unlike Zadeh's abstract formulation, this version directly addresses planning feasibility. Applications include route planning, project management, resource allocation, and decision support systems where multiple plans compete. The framework captures how humans naturally reason about plan feasibility, potentially improving automated planning systems, risk assessment tools, and decision-making interfaces by aligning computational methods with intuitive human reasoning about what's possible versus merely probable.

Authors: Daniel G. Schwartz

Link: https://arxiv.org/abs/2510.02655v1

Date: 2025-10-d

Summary:

This paper offers a new concept of {\it possibility} as an alternative to the now-a-days standard concept originally introduced by L.A. Zadeh in 1978. This new version was inspired by the original but, formally, has nothing in common with it other than that they both adopt the {\L}ukasiewicz multivalent interpretation of the logical connectives. Moreover, rather than seeking to provide a general notion of possibility, this focuses specifically on the possibility of a real-world event. An event is viewed as having prerequisites that enable its occurrence and constraints that may impede its occurrence, and the possibility of the event is computed as a function of the probabilities that the prerequisites hold and the constraints do not. This version of possibility might appropriately be applied to problems of planning. When there are multiple plans available for achieving a goal, this theory can be used to determine which plan is most possible, i.e., easiest or most feasible to complete. It is speculated that this model of reasoning correctly captures normal human reasoning about plans. The theory is elaborated and an illustrative example for vehicle route planning is provided. There is also a suggestion of potential future applications.

--------------------------------------------------------------------------------------------------------

EYE ON A.I. GETS READERS UP TO DATE ON THE LATEST FUNDING NEWS AND RELATED ISSUES. SUBSCRIBE FOR THE WEEKLY NEWSLETTER.

Artificial Intelligence, Research WatchCraig SmithOctober 6, 2025Comment