Week Ending 12.21.2025
RESEARCH WATCH: 12.21.2025
Adversarial Robustness of Vision in Open Foundation Models
As vision-language models become increasingly deployed in security-critical applications, understanding their vulnerabilities to adversarial attacks is paramount. This research examines how small, imperceptible modifications to images can deceive advanced AI systems like LLaVA and Llama 3.2 Vision, causing them to misidentify objects or provide incorrect answers. The findings reveal that even sophisticated open-weight models remain susceptible to visual perturbations, with implications for autonomous vehicles, medical imaging systems, and security applications. Interestingly, the study demonstrates that benchmark performance doesn't necessarily predict robustness, suggesting that architectural design and training approaches play crucial roles in developing more resilient AI systems for real-world deployment.
Authors: Jonathon Fox, William J Buchanan, Pavlos Papadopoulos
Link: https://arxiv.org/abs/2512.17902v1
Date: 2025-12-d
Summary:
With the increase in deep learning, it becomes increasingly difficult to understand the model in which AI systems can identify objects. Thus, an adversary could aim to modify an image by adding unseen elements, which will confuse the AI in its recognition of an entity. This paper thus investigates the adversarial robustness of LLaVA-1.5-13B and Meta's Llama 3.2 Vision-8B-2. These are tested for untargeted PGD (Projected Gradient Descent) against the visual input modality, and empirically evaluated on the Visual Question Answering (VQA) v2 dataset subset. The results of these adversarial attacks are then quantified using the standard VQA accuracy metric. This evaluation is then compared with the accuracy degradation (accuracy drop) of LLaVA and Llama 3.2 Vision. A key finding is that Llama 3.2 Vision, despite a lower baseline accuracy in this setup, exhibited a smaller drop in performance under attack compared to LLaVA, particularly at higher perturbation levels. Overall, the findings confirm that the vision modality represents a viable attack vector for degrading the performance of contemporary open-weight VLMs, including Meta's Llama 3.2 Vision. Furthermore, they highlight that adversarial robustness does not necessarily correlate directly with standard benchmark performance and may be influenced by underlying architectural and training factors.
--------------------------------------------------------------------------------------------------------
With over a billion people now interacting with AI chatbots and virtual assistants, understanding how humanlike design affects user behavior is critical for responsible AI development. This groundbreaking cross-cultural study challenges the Western-centric assumption that making AI more humanlike universally increases trust and engagement. Through experiments across ten diverse nations, researchers found that design choices fostering trust in Brazil could undermine it in Japan, revealing how cultural context fundamentally shapes human-AI interaction. These findings have profound implications for global AI deployment, customer service applications, mental health chatbots, and educational tools, suggesting that one-size-fits-all design approaches may inadvertently harm user experience and trust in certain populations.
Authors: Robin Schimmelpfennig, Mark Díaz, Vinodkumar Prabhakaran, Aida Davani
Link: https://arxiv.org/abs/2512.17898v1
Date: 2025-12-d
Summary:
Over a billion users across the globe interact with AI systems engineered with increasing sophistication to mimic human traits. This shift has triggered urgent debate regarding Anthropomorphism, the attribution of human characteristics to synthetic agents, and its potential to induce misplaced trust or emotional dependency. However, the causal link between more humanlike AI design and subsequent effects on engagement and trust has not been tested in realistic human-AI interactions with a global user pool. Prevailing safety frameworks continue to rely on theoretical assumptions derived from Western populations, overlooking the global diversity of AI users. Here, we address these gaps through two large-scale cross-national experiments (N=3,500) across 10 diverse nations, involving real-time and open-ended interactions with an AI system. We find that when evaluating an AI's human-likeness, users focus less on the kind of theoretical aspects often cited in policy (e.g., sentience or consciousness), but rather applied, interactional cues like conversation flow or understanding the user's perspective. We also experimentally demonstrate that humanlike design levers can causally increase anthropomorphism among users; however, we do not find that humanlike design universally increases behavioral measures for user engagement and trust, as previous theoretical work suggests. Instead, part of the connection between human-likeness and behavioral outcomes is fractured by culture: specific design choices that foster self-reported trust in AI-systems in some populations (e.g., Brazil) may trigger the opposite result in others (e.g., Japan). Our findings challenge prevailing narratives of inherent risk in humanlike AI design. Instead, we identify a nuanced, culturally mediated landscape of human-AI interaction, which demands that we move beyond a one-size-fits-all approach in AI governance.
--------------------------------------------------------------------------------------------------------
Integrating Computational Methods and AI into Qualitative Studies of Aging and Later Life
Qualitative aging research traditionally relies on rich, contextual methods like interviews and ethnography, but analyzing large volumes of such data has been time-consuming and limited in scale. This work demonstrates how machine learning and natural language processing can augment rather than replace traditional qualitative approaches, enabling researchers to systematically analyze thousands of interviews while preserving depth and nuance. Drawing on studies of dementia care and nationally representative interviews, the research shows potential for streamlining workflows, scaling sample sizes, and generating novel multi-method insights. Applications extend to healthcare policy, elder care services, longitudinal studies of aging populations, and developing more responsive interventions for older adults and their caregivers.
Authors: Corey M. Abramson
Link: https://arxiv.org/abs/2512.17850v1
Date: 2025-12-d
Summary:
This chapter demonstrates how computational social science (CSS) tools are extending and expanding research on aging. The depth and context from traditionally qualitative methods such as participant observation, in-depth interviews, and historical documents are increasingly employed alongside scalable data management, computational text analysis, and open-science practices. Machine learning (ML) and natural language processing (NLP), provide resources to aggregate and systematically index large volumes of qualitative data, identify patterns, and maintain clear links to in-depth accounts. Drawing on case studies of projects that examine later life--including examples with original data from the DISCERN study (a team-based ethnography of life with dementia) and secondary analyses of the American Voices Project (nationally representative interview)--the chapter highlights both uses and challenges of bringing CSS tools into more meaningful dialogue with qualitative aging research. The chapter argues such work has potential for (1) streamlining and augmenting existing workflows, (2) scaling up samples and projects, and (3) generating multi-method approaches to address important questions in new ways, before turning to practices useful for individuals and teams seeking to understand current possibilities or refine their workflow processes. The chapter concludes that current developments are not without peril, but offer potential for new insights into aging and the life course by broadening--rather than replacing--the methodological foundations of qualitative research.
--------------------------------------------------------------------------------------------------------
Adapting powerful large language models to specific tasks typically requires substantial computational resources and direct parameter access—luxuries unavailable to many organizations facing API-only models and tight budgets. This research introduces an innovative approach that sidesteps these limitations by training small, specialized models to complement large models on particular data distributions, achieving performance comparable to traditional fine-tuning without touching the large model's parameters. The method proves especially valuable for startups, academic researchers, and organizations in developing regions with limited GPU access. Practical applications include custom customer service bots, domain-specific assistants for healthcare or legal services, and specialized tools for niche industries where general-purpose models underperform but full fine-tuning remains economically unfeasible.
Authors: Dong Chen, Zhengqing Hu, Shixing Zhao, Yibo Guo
Link: https://arxiv.org/abs/2512.17771v1
Date: 2025-12-d
Summary:
While the enormous parameter scale endows Large Models (LMs) with unparalleled performance, it also limits their adaptability across specific tasks. Parameter-Efficient Fine-Tuning (PEFT) has emerged as a critical approach for effectively adapting LMs to a diverse range of downstream tasks. However, existing PEFT methods face two primary challenges: (1) High resource cost. Although PEFT methods significantly reduce resource demands compared to full fine-tuning, it still requires substantial time and memory, making it impractical in resource-constrained environments. (2) Parameter dependency. PEFT methods heavily rely on updating a subset of parameters associated with LMs to incorporate task-specific knowledge. Yet, due to increasing competition in the LMs landscape, many companies have adopted closed-source policies for their leading models, offering access only via Application Programming Interface (APIs). Whereas, the expense is often cost-prohibitive and difficult to sustain, as the fine-tuning process of LMs is extremely slow. Even if small models perform far worse than LMs in general, they can achieve superior results on particular distributions while requiring only minimal resources. Motivated by this insight, we propose Easy Adaptation (EA), which designs Specific Small Models (SSMs) to complement the underfitted data distribution for LMs. Extensive experiments show that EA matches the performance of PEFT on diverse tasks without accessing LM parameters, and requires only minimal resources.
--------------------------------------------------------------------------------------------------------
An Empirical Study of Sampling Hyperparameters in Diffusion-Based Super-Resolution
Super-resolution technology transforms low-quality images into high-resolution ones, with applications spanning satellite imagery, medical scans, and forensic analysis. This research provides crucial practical guidance for practitioners implementing diffusion-based super-resolution systems by systematically investigating which parameters most significantly impact output quality. The study reveals that conditioning step size matters far more than the number of diffusion steps, offering a clear optimization strategy that can dramatically improve results while reducing computational costs. These findings benefit photographers restoring old photos, security analysts enhancing surveillance footage, healthcare professionals improving diagnostic image quality, astronomers processing telescope data, and any application requiring high-quality image reconstruction from limited source material.
Authors: Yudhistira Arief Wibowo
Link: https://arxiv.org/abs/2512.17675v1
Date: 2025-12-d
Summary:
Diffusion models have shown strong potential for solving inverse problems such as single-image super-resolution, where a high-resolution image is recovered from a low-resolution observation using a pretrained unconditional prior. Conditioning methods, including Diffusion Posterior Sampling (DPS) and Manifold Constrained Gradient (MCG), can substantially improve reconstruction quality, but they introduce additional hyperparameters that require careful tuning. In this work, we conduct an empirical ablation study on FFHQ super-resolution to identify the dominant factors affecting performance when applying conditioning to pretrained diffusion models, and show that the conditioning step size has a significantly greater impact than the diffusion step count, with step sizes in the range of [2.0, 3.0] yielding the best overall performance in our experiments.
--------------------------------------------------------------------------------------------------------
Behavioural Effects of Agentic Messaging: A Case Study on a Financial Service Application
Marketing personalization stands at a crossroads between maximizing engagement and respecting user boundaries, particularly in sensitive domains like financial services. This study evaluates whether AI agents capable of adaptive, individualized decision-making can improve customer retention without increasing opt-out rates during a high-stakes national tax filing period. Results showed agent-led messaging reduced unsubscribes by 21% while encouraging earlier tax filing behavior, demonstrating how intelligent systems can modulate communication intensity based on individual user responses. Applications extend beyond finance to healthcare reminders, educational notifications, e-commerce campaigns, and any context where maintaining long-term user relationships requires balancing helpful nudges with respect for user autonomy and attention.
Authors: Olivier Jeunen, Schaun Wheeler
Link: https://arxiv.org/abs/2512.17462v1
Date: 2025-12-d
Summary:
Marketing and product personalisation provide a prominent and visible use-case for the application of Information Retrieval methods across several business domains. Recently, agentic approaches to these problems have been gaining traction. This work evaluates the behavioural and retention effects of agentic personalisation on a financial service application's customer communication system during a 2025 national tax filing period. Through a two month-long randomised controlled trial, we compare an agentic messaging approach against a business-as-usual (BAU) rule-based campaign system, focusing on two primary outcomes: unsubscribe behaviour and conversion timing. Empirical results show that agent-led messaging reduced unsubscribe events by 21\% ($\pm 0.01$) relative to BAU and increased early filing behaviour in the weeks preceding the national deadline. These findings demonstrate how adaptive, user-level decision-making systems can modulate engagement intensity whilst improving long-term retention indicators.
--------------------------------------------------------------------------------------------------------
Dialectics for Artificial Intelligence
Can AI systems develop concepts through experience the way humans do, discovering that planetary classifications or species boundaries might need revision as new information emerges? This theoretical work proposes treating concepts not as fixed dictionary definitions but as evolving information structures defined through reversible relationships with an agent's total experience. Using algorithmic information theory, the framework formalizes how concepts naturally emerge, split, merge, and compete to explain new observations—mirroring human conceptual development from childhood learning to scientific paradigm shifts. Potential applications include building AI systems that genuinely understand context, developing more interpretable machine learning models, creating educational AI that adapts conceptual frameworks to student understanding, and advancing artificial general intelligence research.
Authors: Zhengmian Hu
Link: https://arxiv.org/abs/2512.17373v1
Date: 2025-12-d
Summary:
Can artificial intelligence discover, from raw experience and without human supervision, concepts that humans have discovered? One challenge is that human concepts themselves are fluid: conceptual boundaries can shift, split, and merge as inquiry progresses (e.g., Pluto is no longer considered a planet). To make progress, we need a definition of "concept" that is not merely a dictionary label, but a structure that can be revised, compared, and aligned across agents. We propose an algorithmic-information viewpoint that treats a concept as an information object defined only through its structural relation to an agent's total experience. The core constraint is determination: a set of parts forms a reversible consistency relation if any missing part is recoverable from the others (up to the standard logarithmic slack in Kolmogorov-style identities). This reversibility prevents "concepts" from floating free of experience and turns concept existence into a checkable structural claim. To judge whether a decomposition is natural, we define excess information, measuring the redundancy overhead introduced by splitting experience into multiple separately described parts. On top of these definitions, we formulate dialectics as an optimization dynamics: as new patches of information appear (or become contested), competing concepts bid to explain them via shorter conditional descriptions, driving systematic expansion, contraction, splitting, and merging. Finally, we formalize low-cost concept transmission and multi-agent alignment using small grounds/seeds that allow another agent to reconstruct the same concept under a shared protocol, making communication a concrete compute-bits trade-off.
--------------------------------------------------------------------------------------------------------
Conservative Bias in Multi-Teacher Learning: Why Agents Prefer Low-Reward Advisors
In collaborative robotics and human-AI training scenarios, understanding how artificial agents select among multiple teachers is crucial for designing effective learning systems. This surprising discovery reveals that when given choices, learning agents overwhelmingly favor conservative teachers offering consistent but modest rewards over those promising dramatically higher payoffs—selecting them 93% of the time. This preference for safety and predictability over optimality mirrors human decision-making patterns and has significant implications for robot training in manufacturing, autonomous vehicle development, healthcare robotics where safety is paramount, and any human-robot collaboration scenario. The findings suggest that building trust and ensuring consistent guidance may matter more than maximizing immediate performance gains in training autonomous systems.
Authors: Maher Mesto, Francisco Cruz
Link: https://arxiv.org/abs/2512.17180v1
Date: 2025-12-d
Summary:
Interactive reinforcement learning (IRL) has shown promise in enabling autonomous agents and robots to learn complex behaviours from human teachers, yet the dynamics of teacher selection remain poorly understood. This paper reveals an unexpected phenomenon in IRL: when given a choice between teachers with different reward structures, learning agents overwhelmingly prefer conservative, low-reward teachers (93.16% selection rate) over those offering 20x higher rewards. Through 1,250 experimental runs in navigation tasks with multiple expert teachers, we discovered: (1) Conservative bias dominates teacher selection: agents systematically choose the lowest-reward teacher, prioritising consistency over optimality; (2) Critical performance thresholds exist at teacher availability rho >= 0.6 and accuracy omega >= 0.6, below which the framework fails catastrophically; (3) The framework achieves 159% improvement over baseline Q-learning under concept drift. These findings challenge fundamental assumptions about optimal teaching in RL and suggest potential implications for human-robot collaboration, where human preferences for safety and consistency may align with the observed agent selection behaviour, potentially informing training paradigms for safety-critical robotic applications.
--------------------------------------------------------------------------------------------------------
Indigenous language preservation faces unique challenges when meeting modern educational accountability standards, requiring assessments that honor cultural authenticity while maintaining psychometric rigor. This work demonstrates how AI can ethically support Hawaiian language assessment development through careful human oversight, community involvement, and culturally grounded frameworks. The system successfully identified systematic design issues in test items while preserving linguistic and cultural integrity, accelerating analysis without replacing human expertise. This model offers valuable guidance for other Indigenous communities developing native language programs, minority language education systems worldwide, specialized technical vocabulary assessments, and any context where cultural sensitivity and educational measurement must coexist, proving AI can amplify rather than replace human cultural authority.
Authors: Pōhai Kūkea-Shultz, Frank Brockmann
Link: https://arxiv.org/abs/2512.17140v1
Date: 2025-12-d
Summary:
This paper presents the design and evaluation of a community-based artificial intelligence (AI) workflow developed for the Kaiapuni Assessment of Educational Outcomes (KĀ'EO) program, the only native language assessment used for federal accountability in the United States. The project explored whether document-grounded language models could ethically and effectively augment human analysis of item performance while preserving the cultural and linguistic integrity of the Hawaiian language. Operating under the KĀ'EO AI Policy Framework, the workflow used NotebookLM for cross-document synthesis of psychometric data and Claude 3.5 Sonnet for developer-facing interpretation, with human oversight at every stage. Fifty-eight flagged items across Hawaiian Language Arts, Mathematics, and Science were reviewed during Round 2 of the AI Lab, producing six interpretive briefs that identified systemic design issues such as linguistic ambiguity, Depth-of-Knowledge (DOK) misalignment, and structural overload. The findings demonstrate that AI can serve as an ethically bounded amplifier of human expertise, accelerating analysis while simultaneously prioritizing fairness, human expertise, and cultural authority. This work offers a replicable model for responsible AI integration in Indigenous-language educational measurement.
--------------------------------------------------------------------------------------------------------
Knowledge Distillation with Structured Chain-of-Thought for Text-to-SQL
Enterprise database systems face a costly trilemma: powerful commercial AI models are expensive, security concerns prevent cloud-based solutions, and affordable small models produce unreliable queries. This research addresses the challenge by teaching smaller, deployable models to reason about database queries using formal execution plans as structured blueprints rather than ambiguous natural language explanations. The approach achieved an 8.1% performance improvement, primarily by reducing syntax errors through clearer logical guidance. Applications span business intelligence platforms, customer service chatbots querying databases, automated report generation systems, data analysis tools for non-technical users, and any enterprise context requiring accurate SQL generation without exposing sensitive data to external APIs or incurring ongoing per-query costs.
Authors: Khushboo Thaker, Yony Bresler
Link: https://arxiv.org/abs/2512.17053v1
Date: 2025-12-d
Summary:
Deploying accurate Text-to-SQL systems at the enterprise level faces a difficult trilemma involving cost, security and performance. Current solutions force enterprises to choose between expensive, proprietary Large Language Models (LLMs) and low-performing Small Language Models (SLMs). Efforts to improve SLMs often rely on distilling reasoning from large LLMs using unstructured Chain-of-Thought (CoT) traces, a process that remains inherently ambiguous. Instead, we hypothesize that a formal, structured reasoning representation provides a clearer, more reliable teaching signal, as the Text-to-SQL task requires explicit and precise logical steps. To evaluate this hypothesis, we propose Struct-SQL, a novel Knowledge Distillation (KD) framework that trains an SLM to emulate a powerful large LLM. Consequently, we adopt a query execution plan as a formal blueprint to derive this structured reasoning. Our SLM, distilled with structured CoT, achieves an absolute improvement of 8.1% over an unstructured CoT distillation baseline. A detailed error analysis reveals that a key factor in this gain is a marked reduction in syntactic errors. This demonstrates that teaching a model to reason using a structured logical blueprint is beneficial for reliable SQL generation in SLMs.
--------------------------------------------------------------------------------------------------------
A Women's Health Benchmark for Large Language Models
Millions increasingly turn to AI chatbots for health information, yet their reliability in women's health—encompassing obstetrics, gynecology, oncology, and emergency medicine—remains critically understudied. This benchmark evaluation of 13 leading language models reveals approximately 60% failure rates across diverse clinical scenarios, with particularly dangerous gaps in recognizing urgent medical situations requiring immediate care. The findings expose significant risks in AI-provided health advice spanning patient self-assessment, clinical decision support, and policy interpretation. Urgent applications include improving AI health assistants, training medical education tools, developing safer symptom checkers, informing regulatory frameworks for AI health applications, and ensuring equitable healthcare access, particularly in underserved regions where AI may substitute for unavailable specialist consultation.
Authors: Victoria-Elisabeth Gruber, Razvan Marinescu, Diego Fajardo, Amin H. Nassar, Christopher Arkfeld, Alexandria Ludlow, Shama Patel, Mehrnoosh Samaei, Valerie Klug, Anna Huber, Marcel Gühner, Albert Botta i Orfila, Irene Lagoja, Kimya Tarr, Haleigh Larson, Mary Beth Howard
Link: https://arxiv.org/abs/2512.17028v1
Date: 2025-12-d
Summary:
As large language models (LLMs) become primary sources of health information for millions, their accuracy in women's health remains critically unexamined. We introduce the Women's Health Benchmark (WHB), the first benchmark evaluating LLM performance specifically in women's health. Our benchmark comprises 96 rigorously validated model stumps covering five medical specialties (obstetrics and gynecology, emergency medicine, primary care, oncology, and neurology), three query types (patient query, clinician query, and evidence/policy query), and eight error types (dosage/medication errors, missing critical information, outdated guidelines/treatment recommendations, incorrect treatment advice, incorrect factual information, missing/incorrect differential diagnosis, missed urgency, and inappropriate recommendations). We evaluated 13 state-of-the-art LLMs and revealed alarming gaps: current models show approximately 60\% failure rates on the women's health benchmark, with performance varying dramatically across specialties and error types. Notably, models universally struggle with "missed urgency" indicators, while newer models like GPT-5 show significant improvements in avoiding inappropriate recommendations. Our findings underscore that AI chatbots are not yet fully able of providing reliable advice in women's health.
--------------------------------------------------------------------------------------------------------
GenEval 2: Addressing Benchmark Drift in Text-to-Image Evaluation
As text-to-image AI systems rapidly advance, evaluation benchmarks face inevitable obsolescence—yesterday's challenging test becomes today's trivial task. This research documents how GenEval, a popular benchmark, has drifted 17.7% from human judgment as models improved, rendering it effectively saturated and unable to distinguish cutting-edge systems. The proposed solution introduces GenEval 2 with more complex compositional challenges and Soft-TIFA, a modular evaluation method less prone to drift. These tools are crucial for AI art platforms selecting models, researchers comparing approaches, companies deploying generative AI in advertising and design, content moderation systems evaluating synthetic media, and policymakers assessing AI capability development, ensuring evaluation keeps pace with technological advancement.
Authors: Amita Kamath, Kai-Wei Chang, Ranjay Krishna, Luke Zettlemoyer, Yushi Hu, Marjan Ghazvininejad
Link: https://arxiv.org/abs/2512.16853v1
Date: 2025-12-d
Summary:
Automating Text-to-Image (T2I) model evaluation is challenging; a judge model must be used to score correctness, and test prompts must be selected to be challenging for current T2I models but not the judge. We argue that satisfying these constraints can lead to benchmark drift over time, where the static benchmark judges fail to keep up with newer model capabilities. We show that benchmark drift is a significant problem for GenEval, one of the most popular T2I benchmarks. Although GenEval was well-aligned with human judgment at the time of its release, it has drifted far from human judgment over time -- resulting in an absolute error of as much as 17.7% for current models. This level of drift strongly suggests that GenEval has been saturated for some time, as we verify via a large-scale human study. To help fill this benchmarking gap, we introduce a new benchmark, GenEval 2, with improved coverage of primitive visual concepts and higher degrees of compositionality, which we show is more challenging for current models. We also introduce Soft-TIFA, an evaluation method for GenEval 2 that combines judgments for visual primitives, which we show is more well-aligned with human judgment and argue is less likely to drift from human-alignment over time (as compared to more holistic judges such as VQAScore). Although we hope GenEval 2 will provide a strong benchmark for many years, avoiding benchmark drift is far from guaranteed and our work, more generally, highlights the importance of continual audits and improvement for T2I and related automated model evaluation benchmarks.
--------------------------------------------------------------------------------------------------------
Meta-RL Induces Exploration in Language Agents
Language model agents trained through reinforcement learning often struggle with tasks requiring systematic exploration and learning from failed attempts—critical capabilities for real-world applications. This work introduces LaMer, a meta-learning framework enabling AI agents to actively explore environments and adapt strategies based on trial-and-error without retraining, achieving 11-19% performance improvements across diverse tasks. The approach combines cross-episode learning for long-term optimization with in-context reflection for rapid adaptation. Applications include autonomous shopping assistants navigating unfamiliar e-commerce sites, game-playing AI tackling novel puzzles, robotic systems adapting to new environments, scientific discovery tools exploring hypothesis spaces, and customer service bots learning optimal interaction patterns, particularly in scenarios where exhaustive pre-training proves impractical.
Authors: Yulun Jiang, Liangze Jiang, Damien Teney, Michael Moor, Maria Brbic
Link: https://arxiv.org/abs/2512.16848v1
Date: 2025-12-d
Summary:
Reinforcement learning (RL) has enabled the training of large language model (LLM) agents to interact with the environment and to solve multi-turn long-horizon tasks. However, the RL-trained agents often struggle in tasks that require active exploration and fail to efficiently adapt from trial-and-error experiences. In this paper, we present LaMer, a general Meta-RL framework that enables LLM agents to actively explore and learn from the environment feedback at test time. LaMer consists of two key components: (i) a cross-episode training framework to encourage exploration and long-term rewards optimization; and (ii) in-context policy adaptation via reflection, allowing the agent to adapt their policy from task feedback signal without gradient update. Experiments across diverse environments show that LaMer significantly improves performance over RL baselines, with 11%, 14%, and 19% performance gains on Sokoban, MineSweeper and Webshop, respectively. Moreover, LaMer also demonstrates better generalization to more challenging or previously unseen tasks compared to the RL-trained agents. Overall, our results demonstrate that Meta-RL provides a principled approach to induce exploration in language agents, enabling more robust adaptation to novel environments through learned exploration strategies.
--------------------------------------------------------------------------------------------------------
Cyber Humanism in Education: Reclaiming Agency through AI and Learning Sciences
As generative AI transforms educational workflows, concerns mount about cognitive offloading, epistemic automation, and teacher de-professionalization—raising fundamental questions about human agency in AI-mediated learning. This framework proposes "Cyber Humanism" positioning educators and students as algorithmic citizens co-authoring socio-technical learning infrastructures rather than passive technology consumers. Through reflexive competence, algorithmic citizenship, and dialogic design, the approach emphasizes human-AI collaboration that strengthens rather than supplants human expertise. Applications include redesigning teacher professional development, creating AI-literacy curricula, developing ethical AI integration policies for schools and universities, designing prompt-based learning environments, establishing new educator certifications, and shaping educational technology governance to ensure AI augments rather than replaces human pedagogical judgment and student agency.
Authors: Giovanni Adorni
Link: https://arxiv.org/abs/2512.16701v1
Date: 2025-12-d
Summary:
Generative Artificial Intelligence (GenAI) is rapidly reshaping how knowledge is produced and validated in education. Rather than adding another digital tool, large language models reconfigure reading, writing, and coding into hybrid human-AI workflows, raising concerns about epistemic automation, cognitive offloading, and the de-professiona\-lisation of teachers. This paper proposes \emph{Cyber Humanism in Education} as a framework for reclaiming human agency in this landscape. We conceptualise AI-enabled learning environments as socio-technical infrastructures co-authored by humans and machines, and position educators and learners as epistemic agents and \emph{algorithmic citizens} who have both the right and the responsibility to shape these infrastructures. We articulate three pillars for cyber-humanist design, \emph{reflexive competence}, \emph{algorithmic citizenship}, and \emph{dialogic design}, and relate them to major international digital and AI competence frameworks. We then present higher-education case studies that operationalise these ideas through \emph{prompt-based learning} and a new \emph{Conversational AI Educator} certification within the EPICT ecosystem. The findings show how such practices can strengthen epistemic agency while surfacing tensions around workload, equity, and governance, and outline implications for the future of AI-rich, human-centred education.
--------------------------------------------------------------------------------------------------------
The Universe Learning Itself: On the Evolution of Dynamics from the Big Bang to Machine Intelligence
Rather than viewing cosmology, biology, cognition, and artificial intelligence as separate domains, this ambitious theoretical work traces a continuous thread of evolving dynamics from the universe's origin through contemporary machine learning. The framework interprets cosmic structure formation, planetary geochemistry, biological evolution, brain development, and AI systems as successive regimes of increasingly complex dynamical systems connected by phase transitions and emergent attractors. By identifying recurring mathematical patterns—instability, bifurcation, multiscale coupling—the work offers a unified lens for understanding how the universe has progressively developed systems capable of modeling and perturbing their own futures. Applications include informing AI architecture design, guiding astrobiology research, developing unified theories across scientific disciplines, and philosophical frameworks for understanding consciousness and intelligence.
Authors: Pradeep Singh, Mudasani Rushikesh, Bezawada Sri Sai Anurag, Balasubramanian Raman
Link: https://arxiv.org/abs/2512.16515v1
Date: 2025-12-d
Summary:
We develop a unified, dynamical-systems narrative of the universe that traces a continuous chain of structure formation from the Big Bang to contemporary human societies and their artificial learning systems. Rather than treating cosmology, astrophysics, geophysics, biology, cognition, and machine intelligence as disjoint domains, we view each as successive regimes of dynamics on ever-richer state spaces, stitched together by phase transitions, symmetry-breaking events, and emergent attractors. Starting from inflationary field dynamics and the growth of primordial perturbations, we describe how gravitational instability sculpts the cosmic web, how dissipative collapse in baryonic matter yields stars and planets, and how planetary-scale geochemical cycles define long-lived nonequilibrium attractors. Within these attractors, we frame the origin of life as the emergence of self-maintaining reaction networks, evolutionary biology as flow on high-dimensional genotype-phenotype-environment manifolds, and brains as adaptive dynamical systems operating near critical surfaces. Human culture and technology-including modern machine learning and artificial intelligence-are then interpreted as symbolic and institutional dynamics that implement and refine engineered learning flows which recursively reshape their own phase space. Throughout, we emphasize recurring mathematical motifs-instability, bifurcation, multiscale coupling, and constrained flows on measure-zero subsets of the accessible state space. Our aim is not to present any new cosmological or biological model, but a cross-scale, theoretical perspective: a way of reading the universe's history as the evolution of dynamics itself, culminating (so far) in biological and artificial systems capable of modeling, predicting, and deliberately perturbing their own future trajectories.
--------------------------------------------------------------------------------------------------------
Probing Scientific General Intelligence of LLMs with Scientist-Aligned Workflows
Despite impressive capabilities, current AI systems lack genuine Scientific General Intelligence—the ability to autonomously conceive experiments, investigate phenomena, and reason across scientific domains like human researchers. This comprehensive evaluation framework grounds scientific AI assessment in the Practical Inquiry Model, testing systems across deep research, idea generation, experimental design, and reasoning tasks inspired by science's biggest questions. Results reveal significant gaps: low exact-match accuracy despite reasonable step-level reasoning, infeasible experimental proposals, and persistent multimodal reasoning challenges. Applications include developing AI research assistants for laboratories, drug discovery platforms, materials science exploration tools, hypothesis generation systems, automated literature review services, and educational tools that genuinely participate in scientific discovery rather than merely retrieving information.
Authors: Wanghan Xu, Yuhao Zhou, Yifan Zhou, Qinglong Cao, Shuo Li, Jia Bu, Bo Liu, Yixin Chen, Xuming He, Xiangyu Zhao, Xiang Zhuang, Fengxiang Wang, Zhiwang Zhou, Qiantai Feng, Wenxuan Huang, Jiaqi Wei, Hao Wu, Yuejin Yang, Guangshuai Wang, Sheng Xu, Ziyan Huang, Xinyao Liu, Jiyao Liu, Cheng Tang, Wei Li, Ying Chen, Junzhi Ning, Pengfei Jiang, Chenglong Ma, Ye Du, Changkai Ji, Huihui Xu, Ming Hu, Jiangbin Zheng, Xin Chen, Yucheng Wu, Feifei Jiang, Xi Chen, Xiangru Tang, Yuchen Fu, Yingzhou Lu, Yuanyuan Zhang, Lihao Sun, Chengbo Li, Jinzhe Ma, Wanhao Liu, Yating Liu, Kuo-Cheng Wu, Shengdu Chai, Yizhou Wang, Ouwen Zhangjin, Chen Tang, Shufei Zhang, Wenbo Cao, Junjie Ren, Taoyong Cui, Zhouheng Yao, Juntao Deng, Yijie Sun, Feng Liu, Wangxu Wei, Jingyi Xu, Zhangrui Li, Junchao Gong, Zijie Guo, Zhiyu Yao, Zaoyu Chen, Tianhao Peng, Fangchen Yu, Bo Zhang, Dongzhan Zhou, Shixiang Tang, Jiaheng Liu, Fenghua Ling, Yan Lu, Yuchen Ren, Ben Fei, Zhen Zhao, Xinyu Gu, Rui Su, Xiao-Ming Wu, Weikang Si, Yang Liu, Hao Chen, Xiangchao Yan, Xue Yang, Junchi Yan, Jiamin Wu, Qihao Zheng, Chenhui Li, Zhiqiang Gao, Hao Kong, Junjun He, Mao Su, Tianfan Fu, Peng Ye, Chunfeng Song, Nanqing Dong, Yuqiang Li, Huazhu Fu, Siqi Sun, Lijing Cheng, Jintai Lin, Wanli Ouyang, Bowen Zhou, Wenlong Zhang, Lei Bai
Link: https://arxiv.org/abs/2512.16969v1
Date: 2025-12-d
Summary:
Despite advances in scientific AI, a coherent framework for Scientific General Intelligence (SGI)-the ability to autonomously conceive, investigate, and reason across scientific domains-remains lacking. We present an operational SGI definition grounded in the Practical Inquiry Model (PIM: Deliberation, Conception, Action, Perception) and operationalize it via four scientist-aligned tasks: deep research, idea generation, dry/wet experiments, and experimental reasoning. SGI-Bench comprises over 1,000 expert-curated, cross-disciplinary samples inspired by Science's 125 Big Questions, enabling systematic evaluation of state-of-the-art LLMs. Results reveal gaps: low exact match (10--20%) in deep research despite step-level alignment; ideas lacking feasibility and detail; high code executability but low execution result accuracy in dry experiments; low sequence fidelity in wet protocols; and persistent multimodal comparative-reasoning challenges. We further introduce Test-Time Reinforcement Learning (TTRL), which optimizes retrieval-augmented novelty rewards at inference, enhancing hypothesis novelty without reference answer. Together, our PIM-grounded definition, workflow-centric benchmark, and empirical insights establish a foundation for AI systems that genuinely participate in scientific discovery.
--------------------------------------------------------------------------------------------------------
Towards AI-Supported Research: a Vision of the TIB AIssistant
Academic research increasingly drowns in information overload while generative AI promises unprecedented workflow augmentation—yet effectively integrating AI into scholarly practices remains challenging due to varying domain needs and unclear accuracy guarantees. This platform vision proposes a modular, domain-agnostic human-machine collaborative system supporting researchers throughout the entire research lifecycle, from ideation through writing. With shared data stores, flexible orchestration frameworks, and prompt/tool libraries, the system aims to assist literature analysis, methodology development, data analysis, and scholarly communication. Applications span accelerating systematic reviews, suggesting research directions, automating data preprocessing, generating draft sections, cross-disciplinary knowledge synthesis, and democratizing advanced research tools for under-resourced institutions, ultimately amplifying rather than replacing human scholarly expertise.
Authors: Sören Auer, Allard Oelen, Mohamad Yaser Jaradeh, Mutahira Khalid, Farhana Keya, Sasi Kiran Gaddipati, Jennifer D'Souza, Lorenz Schlüter, Amirreza Alasti, Gollam Rabby, Azanzi Jiomekong, Oliver Karras
Link: https://arxiv.org/abs/2512.16447v1
Date: 2025-12-d
Summary:
The rapid advancements in Generative AI and Large Language Models promise to transform the way research is conducted, potentially offering unprecedented opportunities to augment scholarly workflows. However, effectively integrating AI into research remains a challenge due to varying domain requirements, limited AI literacy, the complexity of coordinating tools and agents, and the unclear accuracy of Generative AI in research. We present the vision of the TIB AIssistant, a domain-agnostic human-machine collaborative platform designed to support researchers across disciplines in scientific discovery, with AI assistants supporting tasks across the research life cycle. The platform offers modular components - including prompt and tool libraries, a shared data store, and a flexible orchestration framework - that collectively facilitate ideation, literature analysis, methodology development, data analysis, and scholarly writing. We describe the conceptual framework, system architecture, and implementation of an early prototype that demonstrates the feasibility and potential impact of our approach.
--------------------------------------------------------------------------------------------------------
Using Gaussian Splats to Create High-Fidelity Facial Geometry and Texture
Creating photorealistic digital faces from photographs traditionally requires extensive capture setups and processing pipelines, limiting accessibility for filmmaking, game development, and virtual reality applications. This approach leverages Gaussian Splatting—a neural representation more explicit than NeRFs—to reconstruct accurate facial geometry and appearance from just 11 uncalibrated images rather than lengthy videos. By constraining Gaussians to underlying surfaces and transforming them into view-dependent neural textures, the method produces assets immediately usable in standard graphics pipelines without modifying existing rendering infrastructure. Applications include rapid digital character creation for films, personalized avatar generation for virtual meetings, facial animation in games, virtual try-on for cosmetics, forensic facial reconstruction, and accessible content creation tools democratizing high-fidelity digital human production.
Authors: Haodi He, Jihun Yu, Ronald Fedkiw
Link: https://arxiv.org/abs/2512.16397v1
Date: 2025-12-d
Summary:
We leverage increasingly popular three-dimensional neural representations in order to construct a unified and consistent explanation of a collection of uncalibrated images of the human face. Our approach utilizes Gaussian Splatting, since it is more explicit and thus more amenable to constraints than NeRFs. We leverage segmentation annotations to align the semantic regions of the face, facilitating the reconstruction of a neutral pose from only 11 images (as opposed to requiring a long video). We soft constrain the Gaussians to an underlying triangulated surface in order to provide a more structured Gaussian Splat reconstruction, which in turn informs subsequent perturbations to increase the accuracy of the underlying triangulated surface. The resulting triangulated surface can then be used in a standard graphics pipeline. In addition, and perhaps most impactful, we show how accurate geometry enables the Gaussian Splats to be transformed into texture space where they can be treated as a view-dependent neural texture. This allows one to use high visual fidelity Gaussian Splatting on any asset in a scene without the need to modify any other asset or any other aspect (geometry, lighting, renderer, etc.) of the graphics pipeline. We utilize a relightable Gaussian model to disentangle texture from lighting in order to obtain a delit high-resolution albedo texture that is also readily usable in a standard graphics pipeline. The flexibility of our system allows for training with disparate images, even with incompatible lighting, facilitating robust regularization. Finally, we demonstrate the efficacy of our approach by illustrating its use in a text-driven asset creation pipeline.
--------------------------------------------------------------------------------------------------------
Occupational Tasks, Automation, and Economic Growth: A Modeling and Simulation Approach
The Fourth Industrial Revolution raises urgent questions about how AI and automation will reshape employment, productivity, and economic inequality across occupations. This analytical framework endogenously integrates knowledge accumulation with technological lock-in and knowledge generation burdens, explicitly modeling how automation and production interact with economic growth. Through high-throughput computational simulations, the work quantifies how structural parameters influence production output, wages, and labor's share of economic gains—crucially demonstrating that wages and labor shares can be independently influenced through distinct policy interventions. Applications include informing labor market policies, designing educational curricula for workforce adaptation, evaluating universal basic income proposals, guiding corporate automation strategies, and developing economic forecasts for technology-driven transformation.
Authors: Georgios A. Tritsaris
Link: https://arxiv.org/abs/2512.16261v1
Date: 2025-12-d
Summary:
The Fourth Industrial Revolution commonly refers to the accelerating technological transformation that has been taking place in the 21st century. Economic growth theories which treat the accumulation of knowledge and its effect on production endogenously remain relevant, yet they have been evolving to explain how the current wave of advancements in automation and artificial intelligence (AI) technology will affect productivity and different occupations. The work contributes to current economic discourse by developing an analytical task-based framework that endogenously integrates knowledge accumulation with frictions that describe technological lock-in and the burden of knowledge generation and validation. The interaction between production (or automation) and growth (or knowledge accumulation) is also described explicitly. To study how automation and AI shape economic outcomes, I rely on high-throughput calculations of the developed model. The effect of the model's structural parameters on key variables such as the production output, wages, and labor shares of output is quantified, and possible intervention strategies are briefly discussed. An important result is that wages and labor shares are not directly linked, instead they can be influenced independently through distinct policy levers.
--------------------------------------------------------------------------------------------------------
Wireless network optimization requires simultaneously placing radio nodes and assigning users while balancing signal quality against load distribution—a complex problem with real-world deployment constraints. This algorithm introduces weighted K-harmonic means clustering with rigorous convergence guarantees under both deterministic and random initialization, directly mapping clustering weights to fractional user association based on signal strength. Unlike classical approaches, the method admits a natural wireless interpretation while achieving superior tradeoffs between minimum signal strength and fairness compared to existing baselines. Applications include 5G and future 6G network planning, optimizing WiFi access point placement in buildings, managing cellular network densification, coordinating satellite internet constellations, emergency communication network deployment, and any wireless infrastructure design requiring principled joint node placement and user association strategies.
Authors: Gourab Ghatak
Link: https://arxiv.org/abs/2512.16185v1
Date: 2025-12-d
Summary:
We propose the \emph{weighted K-harmonic means} (WKHM) clustering algorithm, a regularized variant of K-harmonic means designed to ensure numerical stability while enabling soft assignments through inverse-distance weighting. Unlike classical K-means and constrained K-means, WKHM admits a direct interpretation in wireless networks: its weights are exactly equivalent to fractional user association based on received signal strength. We establish rigorous convergence guarantees under both deterministic and stochastic settings, addressing key technical challenges arising from non-convexity and random initialization. Specifically, we prove monotone descent to a local minimum under fixed initialization, convergence in probability under Binomial Point Process (BPP) initialization, and almost sure convergence under mild decay conditions. These results provide the first stochastic convergence guarantees for harmonic-mean-based clustering. Finally, through extensive simulations with diverse user distributions, we show that WKHM achieves a superior tradeoff between minimum signal strength and load fairness compared to classical and modern clustering baselines, making it a principled tool for joint radio node placement and user association in wireless networks.
--------------------------------------------------------------------------------------------------------