Week Ending 1.18.2026

RESEARCH WATCH: 1.18.2026

Capacity Constraints Make Admissions Processes Less Predictable

Machine learning systems increasingly predict outcomes in capacity-constrained admissions processes, from college acceptance to job hiring. However, these processes fundamentally differ from traditional ML paradigms because decisions depend on the entire applicant pool rather than individual merit alone. This research reveals how capacity constraints create inherent unpredictability, introducing concepts of "instability" and "variability" that measure how admissions decisions shift when applicant pools change. Using New York City high school admissions data, the authors demonstrate that ML performance degrades as applicant pools diverge from training data. These findings have critical implications for college admissions consulting, employment recruitment services, and fairness in algorithmic decision-making systems.

Authors: Evan Dong, Nikhil Garg, Sarah Dean

Link: https://arxiv.org/abs/2601.11513v1

Date: 2026-01-d

Summary:

Machine learning models are often used to make predictions about admissions process outcomes, such as for colleges or jobs. However, such decision processes differ substantially from the conventional machine learning paradigm. Because admissions decisions are capacity-constrained, whether a student is admitted depends on the other applicants who apply. We show how this dependence affects predictive performance even in otherwise ideal settings. Theoretically, we introduce two concepts that characterize the relationship between admission function properties, machine learning representation, and generalization to applicant pool distribution shifts: instability, which measures how many existing decisions can change when a single new applicant is introduced; and variability, which measures the number of unique students whose decisions can change. Empirically, we illustrate our theory on individual-level admissions data from the New York City high school matching system, showing that machine learning performance degrades as the applicant pool increasingly differs from the training data. Furthermore, there are larger performance drops for schools using decision rules that are more unstable and variable. Our work raises questions about the reliability of predicting individual admissions probabilities.

--------------------------------------------------------------------------------------------------------

Health Facility Location in Ethiopia: Leveraging LLMs to Integrate Expert Knowledge into Algorithmic Planning

Ethiopia faces a critical challenge in upgrading rural health infrastructure with limited resources, requiring strategic decisions about which facilities to prioritize. Traditional optimization methods demand precise quantitative objectives, yet healthcare stakeholders often express priorities in qualitative, natural language terms. This research bridges this gap through the LEG framework, combining provable optimization algorithms with large language models to systematically integrate expert knowledge into facility placement decisions. Tested across three Ethiopian regions, the system balances population coverage guarantees with diverse stakeholder preferences. This approach has applications in global health planning, disaster response resource allocation, educational infrastructure development, and any domain requiring data-driven decisions that incorporate human expertise and community values.

Authors: Yohai Trabelsi, Guojun Xiong, Fentabil Getnet, Stéphane Verguet, Milind Tambe

Link: https://arxiv.org/abs/2601.11479v1

Date: 2026-01-d

Summary:

Ethiopia's Ministry of Health is upgrading health posts to improve access to essential services, particularly in rural areas. Limited resources, however, require careful prioritization of which facilities to upgrade to maximize population coverage while accounting for diverse expert and stakeholder preferences. In collaboration with the Ethiopian Public Health Institute and Ministry of Health, we propose a hybrid framework that systematically integrates expert knowledge with optimization techniques. Classical optimization methods provide theoretical guarantees but require explicit, quantitative objectives, whereas stakeholder criteria are often articulated in natural language and difficult to formalize. To bridge these domains, we develop the Large language model and Extended Greedy (LEG) framework. Our framework combines a provable approximation algorithm for population coverage optimization with LLM-driven iterative refinement that incorporates human-AI alignment to ensure solutions reflect expert qualitative guidance while preserving coverage guarantees. Experiments on real-world data from three Ethiopian regions demonstrate the framework's effectiveness and its potential to inform equitable, data-driven health system planning.

--------------------------------------------------------------------------------------------------------

The unreasonable effectiveness of pattern matching

This research challenges fundamental assumptions about how large language models process language by demonstrating their remarkable ability to extract meaning from "Jabberwocky"—text where content words are replaced with nonsense. When presented with sentences like "He dwushed a ghanc zawk," LLMs successfully translate them to sensible alternatives like "He dragged a spare chair." This capability addresses ongoing debates about whether LLMs merely mimic language or possess deeper understanding. The findings suggest pattern-matching is not inferior to "real" intelligence but rather a crucial component of it. Applications extend to language learning systems, translation tools for fragmented texts, data recovery from corrupted documents, and understanding cognitive processes underlying human language comprehension.

Authors: Gary Lupyan, Blaise Agüera y Arcas

Link: https://arxiv.org/abs/2601.11432v1

Date: 2026-01-d

Summary:

We report on an astonishing ability of large language models (LLMs) to make sense of "Jabberwocky" language in which most or all content words have been randomly replaced by nonsense strings, e.g., translating "He dwushed a ghanc zawk" to "He dragged a spare chair". This result addresses ongoing controversies regarding how to best think of what LLMs are doing: are they a language mimic, a database, a blurry version of the Web? The ability of LLMs to recover meaning from structural patterns speaks to the unreasonable effectiveness of pattern-matching. Pattern-matching is not an alternative to "real" intelligence, but rather a key ingredient.

--------------------------------------------------------------------------------------------------------

Wetland mapping from sparse annotations with satellite image time series and temporal-aware segment anything model

Accurate wetland mapping is essential for ecosystem conservation, yet traditional approaches requiring dense pixel-level annotations are prohibitively expensive. Wetlands also exhibit dramatic seasonal and year-to-year changes that render single-date imagery inadequate. While foundation models like SAM show promise with sparse point labels, they fail to capture temporal dynamics, producing fragmented maps in heterogeneous environments. WetSAM addresses these challenges by integrating satellite time series with a dual-branch architecture that separates temporal patterns from spatial structure. Achieving 85.58% F1-score across eight global regions totaling approximately 40,000 km², this framework enables scalable, low-cost wetland monitoring. Applications include climate change impact assessment, biodiversity conservation planning, carbon sequestration monitoring, and regulatory compliance mapping.

Authors: Shuai Yuan, Tianwu Lin, Shuang Chen, Yu Xia, Peng Qin, Xiangyu Liu, Xiaoqing Xu, Nan Xu, Hongsheng Zhang, Jie Wang, Peng Gong

Link: https://arxiv.org/abs/2601.11400v1

Date: 2026-01-d

Summary:

Accurate wetland mapping is essential for ecosystem monitoring, yet dense pixel-level annotation is prohibitively expensive and practical applications usually rely on sparse point labels, under which existing deep learning models perform poorly, while strong seasonal and inter-annual wetland dynamics further render single-date imagery inadequate and lead to significant mapping errors; although foundation models such as SAM show promising generalization from point prompts, they are inherently designed for static images and fail to model temporal information, resulting in fragmented masks in heterogeneous wetlands. To overcome these limitations, we propose WetSAM, a SAM-based framework that integrates satellite image time series for wetland mapping from sparse point supervision through a dual-branch design, where a temporally prompted branch extends SAM with hierarchical adapters and dynamic temporal aggregation to disentangle wetland characteristics from phenological variability, and a spatial branch employs a temporally constrained region-growing strategy to generate reliable dense pseudo-labels, while a bidirectional consistency regularization jointly optimizes both branches. Extensive experiments across eight global regions of approximately 5,000 km2 each demonstrate that WetSAM substantially outperforms state-of-the-art methods, achieving an average F1-score of 85.58%, and delivering accurate and structurally consistent wetland segmentation with minimal labeling effort, highlighting its strong generalization capability and potential for scalable, low-cost, high-resolution wetland mapping.

--------------------------------------------------------------------------------------------------------

SD-RAG: A Prompt-Injection-Resilient Framework for Selective Disclosure in Retrieval-Augmented Generation

Retrieval-Augmented Generation systems combine large language models with external knowledge retrieval, but current approaches risk exposing sensitive information directly to generation models. Existing safeguards relying on prompts instructing models to protect data are vulnerable to prompt injection attacks that override these constraints. SD-RAG fundamentally reimagines this architecture by decoupling security enforcement from generation, applying sanitization during retrieval rather than trusting the LLM to self-censor. The system introduces semantic mechanisms for human-readable security policies and graph-based data models supporting fine-grained access control. Demonstrating up to 58% improvement in privacy scores while resisting prompt injection attacks, SD-RAG has applications in healthcare information systems, legal document analysis, enterprise knowledge management, and any domain requiring controlled information disclosure.

Authors: Aiman Al Masoud, Marco Arazzi, Antonino Nocera

Link: https://arxiv.org/abs/2601.11199v1

Date: 2026-01-d

Summary:

Retrieval-Augmented Generation (RAG) has attracted significant attention due to its ability to combine the generative capabilities of Large Language Models (LLMs) with knowledge obtained through efficient retrieval mechanisms over large-scale data collections. Currently, the majority of existing approaches overlook the risks associated with exposing sensitive or access-controlled information directly to the generation model. Only a few approaches propose techniques to instruct the generative model to refrain from disclosing sensitive information; however, recent studies have also demonstrated that LLMs remain vulnerable to prompt injection attacks that can override intended behavioral constraints. For these reasons, we propose a novel approach to Selective Disclosure in Retrieval-Augmented Generation, called SD-RAG, which decouples the enforcement of security and privacy constraints from the generation process itself. Rather than relying on prompt-level safeguards, SD-RAG applies sanitization and disclosure controls during the retrieval phase, prior to augmenting the language model's input. Moreover, we introduce a semantic mechanism to allow the ingestion of human-readable dynamic security and privacy constraints together with an optimized graph-based data model that supports fine-grained, policy-aware retrieval. Our experimental evaluation demonstrates the superiority of SD-RAG over baseline existing approaches, achieving up to a $58\%$ improvement in the privacy score, while also showing a strong resilience to prompt injection attacks targeting the generative model.

--------------------------------------------------------------------------------------------------------

Spectral Characterization and Mitigation of Sequential Knowledge Editing Collapse

Large language models require periodic updates as knowledge evolves, but sequential editing often catastrophically degrades general capabilities. While heuristic constraints partially mitigate this, the fundamental mechanisms causing collapse remain poorly understood. This research provides spectral analysis revealing that general abilities correlate with dominant singular directions in pretrained weight matrices—directions progressively disrupted by repeated edits. The REVIVE framework addresses this by explicitly preserving these critical subspaces during updates, representing changes in spectral basis and filtering interfering components. Tested with up to 20,000 sequential edits across multiple models, REVIVE maintains both editing efficacy and general performance. Applications include maintaining AI assistants with current information, corporate knowledge base updates, personalized model adaptation, and any scenario requiring continual learning without catastrophic forgetting.

Authors: Chi Zhang, Mengqi Zhang, Xiaotian Ye, Runxi Cheng, Zisheng Zhou, Ying Zhou, Pengjie Ren, Zhumin Chen

Link: https://arxiv.org/abs/2601.11042v1

Date: 2026-01-d

Summary:

Sequential knowledge editing in large language models often causes catastrophic collapse of the model's general abilities, especially for parameter-modifying methods. Existing approaches mitigate this issue through heuristic constraints on parameter updates, yet the mechanisms underlying such degradation remain insufficiently understood. In this work, we present a spectral analysis of sequential knowledge editing and show that a model's general abilities are closely associated with dominant singular directions of pretrained weight matrices. These directions are highly sensitive to perturbations and are progressively disrupted by repeated edits, closely tracking the collapse in both editing efficacy and general performance. Building on this insight, we propose REVIVE, a plug-and-play framework that stabilizes sequential editing by explicitly preserving the dominant singular subspace. REVIVE represents parameter updates in the spectral basis of the original weights and filters components that would interfere with the protected region. Extensive experiments across multiple models and benchmarks show that REVIVE consistently improves editing efficacy while substantially preserving general abilities under long-horizon sequential editing, including extreme settings with up to 20,000 edits.

--------------------------------------------------------------------------------------------------------

What Matters in Data Curation for Multimodal Reasoning? Insights from the DCVLR Challenge

Multimodal AI systems combining vision and language require massive datasets, but optimal curation strategies remain unclear. The NeurIPS 2025 DCVLR challenge isolated dataset selection by fixing model architecture and training protocols, enabling systematic study of curation principles. The winning approach revealed that difficulty-based example selection on aligned base datasets drives performance gains, while common assumptions proved incorrect: increasing dataset size primarily reduced variance rather than improving mean accuracy, and diversity heuristics plus synthetic augmentation provided no benefit or even degraded performance. These findings characterize a saturation-regime evaluation where alignment and difficulty matter most. Applications include efficient training of vision-language models, curriculum design for AI systems, educational content sequencing, and resource-constrained model development.

Authors: Yosub Shin, Michael Buriek, Boris Sobolev, Pavel Bushuyeu, Vikas Kumar, Haoyang Xu, Samuel Watson, Igor Molybog

Link: https://arxiv.org/abs/2601.10922v1

Date: 2026-01-d

Summary:

We study data curation for multimodal reasoning through the NeurIPS 2025 Data Curation for Vision-Language Reasoning (DCVLR) challenge, which isolates dataset selection by fixing the model and training protocol. Using a compact curated dataset derived primarily from Walton Multimodal Cold Start, our submission placed first in the challenge. Through post-competition ablations, we show that difficulty-based example selection on an aligned base dataset is the dominant driver of performance gains. Increasing dataset size does not reliably improve mean accuracy under the fixed training recipe, but mainly reduces run-to-run variance, while commonly used diversity and synthetic augmentation heuristics provide no additional benefit and often degrade performance. These results characterize DCVLR as a saturation-regime evaluation and highlight the central role of alignment and difficulty in data-efficient multimodal reasoning.

--------------------------------------------------------------------------------------------------------

Digital Metabolism: Decoupling Logic from Facts via Regenerative Unlearning -- Towards a Pure Neural Logic Core

Large language models conflate reasoning capabilities with factual knowledge within shared parameters, creating a "memory wall" where computational resources simulate retrieval rather than performing inference—often causing hallucinations. This research proposes "digital metabolism," hypothesizing that targeted forgetting can distill pure neural logic. The Regenerative Logic-Core Protocol uses gradient reversal to make specific factual dependencies undecodable while preserving reasoning structures. Applied to Qwen2.5-0.5B, the approach achieves near-zero retention of targeted facts while exhibiting phase transitions suggesting "structural crystallization." The metabolized model spontaneously adopts chain-of-thought scaffolding, compensating for lost associative recall. This points toward modular architectures separating logic (Neural CPU) from facts (Symbolic RAM), with applications in trustworthy AI, knowledge-grounded reasoning systems, and efficient model deployment.

Authors: Mengmeng Peng, Zhenyu Fang, He Sun

Link: https://arxiv.org/abs/2601.10810v1

Date: 2026-01-d

Summary:

Large language models (LLMs) currently suffer from parameter entanglement, where general reasoning capabilities (logic) and specific factual knowledge (facts) exist in a superposition state within shared weights. This coupling leads to the "memory wall," where computational capacity is squandered on simulating retrieval, often resulting in hallucinations. In this paper, we propose "digital metabolism," a thermodynamic hypothesis suggesting that targeted forgetting is necessary for distilling a pure neural logic core. To validate this hypothesis, we introduce the Regenerative Logic-Core Protocol (RLCP), a dual-stream training framework that renders specific factual dependencies linearly undecodable via deep-layer gradient reversal. Applying RLCP to Qwen2.5-0.5B, we observe a distinct phase transition: the model achieves near-zero retention of targeted factual associations (Accuracy < 7%) while exhibiting changes consistent with an emergent "structural crystallization" effect. Empirical analysis on GSM8K reveals that the "metabolized" model spontaneously adopts chain-of-thought (CoT) scaffolding, which we interpret as compensating for the loss of direct associative recall (shifting from $O(1)$ recall to $O(N)$ reasoning). While the causal mechanism underlying this behavioral shift requires further investigation, our findings provide a dynamic weight-level counterpart to architectural innovations like DeepSeek's Engram, paving the way for modular "Neural CPU + Symbolic RAM" architectures.

--------------------------------------------------------------------------------------------------------

SatMap: Revisiting Satellite Maps as Prior for Online HD Map Construction

Autonomous vehicles require high-definition maps for safe navigation, but camera-based online map construction suffers from limited depth perception and occlusion issues. SatMap integrates satellite imagery with multi-view camera observations, leveraging bird's-eye-view satellite data as a global prior that provides lane-level semantics and texture information. This fusion approach effectively mitigates depth ambiguity and occlusion challenges inherent in ground-level camera systems. On the nuScenes dataset, SatMap achieves 34.8% improvement over camera-only baselines and 8.5% over camera-LiDAR fusion. The method particularly excels in long-range scenarios and adverse weather conditions. Applications extend beyond autonomous driving to drone navigation, augmented reality systems for urban environments, infrastructure monitoring, emergency response route planning, and any robotics application requiring accurate spatial understanding.

Authors: Kanak Mazumder, Fabian B. Flohr

Link: https://arxiv.org/abs/2601.10512v1

Date: 2026-01-d

Summary:

Online high-definition (HD) map construction is an essential part of a safe and robust end-to-end autonomous driving (AD) pipeline. Onboard camera-based approaches suffer from limited depth perception and degraded accuracy due to occlusion. In this work, we propose SatMap, an online vectorized HD map estimation method that integrates satellite maps with multi-view camera observations and directly predicts a vectorized HD map for downstream prediction and planning modules. Our method leverages lane-level semantics and texture from satellite imagery captured from a Bird's Eye View (BEV) perspective as a global prior, effectively mitigating depth ambiguity and occlusion. In our experiments on the nuScenes dataset, SatMap achieves 34.8% mAP performance improvement over the camera-only baseline and 8.5% mAP improvement over the camera-LiDAR fusion baseline. Moreover, we evaluate our model in long-range and adverse weather conditions to demonstrate the advantages of using a satellite prior map. Source code will be available at https://iv.ee.hm.edu/satmap/.

--------------------------------------------------------------------------------------------------------

Unifying Speech Recognition, Synthesis and Conversion with Autoregressive Transformers

Traditional speech systems employ separate models for text-to-speech, automatic speech recognition, and voice conversion, creating fragmented pipelines that limit efficiency and cross-task generalization. General-Purpose Audio (GPA) unifies these tasks within a single large language model architecture operating on shared discrete audio tokens. Through instruction-driven task induction, one autoregressive model flexibly performs multiple speech tasks without architectural modifications. The design combines joint multi-task training with scalable inference achieving high throughput, including a lightweight 0.3B-parameter variant for edge deployment. This unified approach demonstrates competitive performance across diverse speech tasks while maintaining practical deployment viability. Applications include virtual assistants, accessibility tools for individuals with disabilities, language learning platforms, voice cloning, and content creation tools.

Authors: Runyuan Cai, Yu Lin, Yiming Wang, Chunlin Fu, Xiaodong Zeng

Link: https://arxiv.org/abs/2601.10770v1

Date: 2026-01-d

Summary:

Traditional speech systems typically rely on separate, task-specific models for text-to-speech (TTS), automatic speech recognition (ASR), and voice conversion (VC), resulting in fragmented pipelines that limit scalability, efficiency, and cross-task generalization. In this paper, we present General-Purpose Audio (GPA), a unified audio foundation model that integrates multiple core speech tasks within a single large language model (LLM) architecture. GPA operates on a shared discrete audio token space and supports instruction-driven task induction, enabling a single autoregressive model to flexibly perform TTS, ASR, and VC without architectural modifications. This unified design combines a fully autoregressive formulation over discrete speech tokens, joint multi-task training across speech domains, and a scalable inference pipeline that achieves high concurrency and throughput. The resulting model family supports efficient multi-scale deployment, including a lightweight 0.3B-parameter variant optimized for edge and resource-constrained environments. Together, these design choices demonstrate that a unified autoregressive architecture can achieve competitive performance across diverse speech tasks while remaining viable for low-latency, practical deployment.

--------------------------------------------------------------------------------------------------------

DanQing: An Up-to-Date Large-Scale Chinese Vision-Language Pre-training Dataset

Vision-language models like CLIP excel at tasks from cross-modal retrieval to image captioning, enabled by extensive English datasets such as COYO-700M and LAION-400M. However, Chinese vision-language development lags due to scarce high-quality data. DanQing addresses this gap with 100 million carefully curated Chinese image-text pairs from Common Crawl, primarily using 2024-2025 web data to capture evolving semantic trends. The rigorous selection pipeline ensures superior data quality compared to existing datasets. Continual pre-training experiments with SigLIP2 demonstrate consistent performance improvements across Chinese zero-shot classification, cross-modal retrieval, and multimodal evaluations. Released under Creative Commons CC-BY 4.0, DanQing enables Chinese e-commerce search, educational applications, cultural heritage digitization, social media analysis, and advancement of Chinese multimodal AI research.

Authors: Hengyu Shen, Tiancheng Gu, Bin Qin, Lan Wu, Yuling Wu, Shuo Tan, Zelong Sun, Jun Wang, Nan Wu, Xiang An, Weidong Cai, Ziyong Feng, Kaicheng Yang

Link: https://arxiv.org/abs/2601.10305v1

Date: 2026-01-d

Summary:

Vision-Language Pre-training (VLP) models demonstrate strong performance across various downstream tasks by learning from large-scale image-text pairs through contrastive pretraining. The release of extensive English image-text datasets (e.g., COYO-700M and LAION-400M) has enabled widespread adoption of models such as CLIP and SigLIP in tasks including cross-modal retrieval and image captioning. However, the advancement of Chinese vision-language pretraining has substantially lagged behind, due to the scarcity of high-quality Chinese image-text data. To address this gap, we develop a comprehensive pipeline for constructing a high-quality Chinese cross-modal dataset. As a result, we propose DanQing, which contains 100 million image-text pairs collected from Common Crawl. Different from existing datasets, DanQing is curated through a more rigorous selection process, yielding superior data quality. Moreover, DanQing is primarily built from 2024-2025 web data, enabling models to better capture evolving semantic trends and thus offering greater practical utility. We compare DanQing with existing datasets by continual pre-training of the SigLIP2 model. Experimental results show that DanQing consistently achieves superior performance across a range of Chinese downstream tasks, including zero-shot classification, cross-modal retrieval, and LMM-based evaluations. To facilitate further research in Chinese vision-language pre-training, we will open-source the DanQing dataset under the Creative Common CC-BY 4.0 license.

--------------------------------------------------------------------------------------------------------

Towards Native Intelligence: 6G-LLM Trained with Reinforcement Learning from NDT Feedback

Next-generation 6G networks require intelligent orchestration that understands application requirements and communication system capabilities. Current rule-based approaches rely on modular, experience-driven optimization lacking adaptability. The 6G-LLM offers network native intelligence but faces limitations: dependence on scarce, meticulously curated training data and inability for continual self-improvement. The RLDTF framework addresses this by leveraging network digital twins to generate reward signals based on orchestration outcomes, using reinforcement learning to guide optimal decision-making dynamically. A weighted token mechanism improves output accuracy. Experimental results demonstrate significant improvements over state-of-the-art baselines in orchestration accuracy and solution optimality. Applications include adaptive network management, quality-of-service optimization, resource allocation in cloud computing, smart city infrastructure, and autonomous network operations.

Authors: Zhuoran Xiao, Tao Tao, Chenhui Ye, Yunbo Hu, Yijia Feng, Tianyu Jiao, Liyu Cai

Link: https://arxiv.org/abs/2601.09992v1

Date: 2026-01-d

Summary:

Owing to its comprehensive understanding of upper-layer application requirements and the capabilities of practical communication systems, the 6G-LLM (6G domain large language model) offers a promising pathway toward realizing network native intelligence. Serving as the system orchestrator, the 6G-LLM drives a paradigm shift that fundamentally departs from existing rule-based approaches, which primarily rely on modular, experience-driven optimization. By contrast, the 6G-LLM substantially enhances network flexibility and adaptability. Nevertheless, current efforts to construct 6G-LLMs are constrained by their reliance on large-scale, meticulously curated, human-authored corpora, which are impractical to obtain in real-world scenarios. Moreover, purely offline-trained models lack the capacity for continual self-improvement, limiting their ability to adapt to the highly dynamic requirements of wireless communication environments. To overcome these limitations, we propose a novel training paradigm termed RLDTF (Reinforcement Learning from Digital Twin Feedback) for 6G-LLMs. This framework leverages network digital twins to generate reward signals based on orchestration outcomes, while employing reinforcement learning to guide the model toward optimal decision-making dynamically. Furthermore, we introduce a weighted token mechanism to improve output accuracy. Comprehensive experimental results demonstrate that our proposed framework significantly outperforms state-of-the-art baselines in orchestration accuracy and solution optimality.

--------------------------------------------------------------------------------------------------------

What Do LLM Agents Know About Their World? Task2Quiz: A Paradigm for Studying Environment Understanding

LLM agents demonstrate impressive performance in decision-making and tool-use tasks, yet their ability to generalize across varying environments remains inadequately examined. Current evaluations focus on trajectory-based metrics measuring task success without assessing whether agents possess transferable environmental understanding. Task-to-Quiz (T2Q) introduces a paradigm decoupling task execution from world-state comprehension. T2QBench provides 30 environments with 1,967 grounded question-answer pairs across difficulty levels. Experiments reveal task success poorly predicts environment understanding, and current memory mechanisms fail to help agents acquire grounded environmental models. Findings identify proactive exploration and fine-grained state representation as critical bottlenecks. Applications include developing more robust AI assistants, autonomous robots, game-playing agents, and understanding fundamental challenges in agent-based learning systems.

Authors: Siyuan Liu, Hongbang Yuan, Xinze Li, Ziyue Zhu, Yixin Cao, Yu-Gang Jiang

Link: https://arxiv.org/abs/2601.09503v1

Date: 2026-01-d

Summary:

Large language model (LLM) agents have demonstrated remarkable capabilities in complex decision-making and tool-use tasks, yet their ability to generalize across varying environments remains a under-examined concern. Current evaluation paradigms predominantly rely on trajectory-based metrics that measure task success, while failing to assess whether agents possess a grounded, transferable model of the environment. To address this gap, we propose Task-to-Quiz (T2Q), a deterministic and automated evaluation paradigm designed to decouple task execution from world-state understanding. We instantiate this paradigm in T2QBench, a suite comprising 30 environments and 1,967 grounded QA pairs across multiple difficulty levels. Our extensive experiments reveal that task success is often a poor proxy for environment understanding, and that current memory machanism can not effectively help agents acquire a grounded model of the environment. These findings identify proactive exploration and fine-grained state representation as primary bottlenecks, offering a robust foundation for developing more generalizable autonomous agents.

--------------------------------------------------------------------------------------------------------

Bridging Semantic Understanding and Popularity Bias with LLMs

Recommender systems frequently exhibit popularity bias, favoring mainstream items at the expense of niche content, yet most debiasing methods treat this superficially through diversity enhancement or long-tail coverage. These approaches neglect deeper semantic understanding of bias's causal origins, limiting both debiasing effectiveness and recommendation accuracy. FairLRM addresses this gap by decomposing popularity bias into item-side and user-side components, using structured instruction-based prompts to enhance large language models' comprehension of global item distributions and individual preferences. Unlike traditional methods relying on surface features, FairLRM semantically interprets underlying bias mechanisms. Empirical evaluation demonstrates significant improvements in both fairness and accuracy. Applications include music and video streaming platforms, e-commerce product recommendations, news aggregation, content discovery systems, and creating more equitable digital marketplaces.

Authors: Renqiang Luo, Dong Zhang, Yupeng Gao, Wen Shi, Mingliang Hou, Jiaying Liu, Zhe Wang, Shuo Yu

Link: https://arxiv.org/abs/2601.09478v2

Date: 2026-01-d

Summary:

Semantic understanding of popularity bias is a crucial yet underexplored challenge in recommender systems, where popular items are often favored at the expense of niche content. Most existing debiasing methods treat the semantic understanding of popularity bias as a matter of diversity enhancement or long-tail coverage, neglecting the deeper semantic layer that embodies the causal origins of the bias itself. Consequently, such shallow interpretations limit both their debiasing effectiveness and recommendation accuracy. In this paper, we propose FairLRM, a novel framework that bridges the gap in the semantic understanding of popularity bias with Recommendation via Large Language Model (RecLLM). FairLRM decomposes popularity bias into item-side and user-side components, using structured instruction-based prompts to enhance the model's comprehension of both global item distributions and individual user preferences. Unlike traditional methods that rely on surface-level features such as "diversity" or "debiasing", FairLRM improves the model's ability to semantically interpret and address the underlying bias. Through empirical evaluation, we show that FairLRM significantly enhances both fairness and recommendation accuracy, providing a more semantically aware and trustworthy approach to enhance the semantic understanding of popularity bias. The implementation is available at https://github.com/LuoRenqiang/FairLRM.

--------------------------------------------------------------------------------------------------------

Improving Symbolic Translation of Language Models for Logical Reasoning

Translating natural language into first-order logic enables verifiable, reliable reasoning when paired with external solvers. However, smaller language models struggle with this translation, producing formatting and translation errors that undermine system reliability. Existing self-iteration approaches depend heavily on underlying model capabilities. This research categorizes common errors and fine-tunes smaller models using data synthesized by large language models. Incremental inference divides the process into predicate generation and FOL translation stages, providing greater control and quality improvement. A verification module specifically targets predicate-arity errors. Comprehensive evaluation across three model families and four logical-reasoning datasets shows reduced error rates, increased predicate coverage, and improved reasoning performance. Applications include legal reasoning systems, automated theorem proving, educational tools for logic instruction, and making reliable symbolic reasoning accessible.

Authors: Ramya Keerthy Thatikonda, Jiuzhou Han, Wray Buntine, Ehsan Shareghi

Link: https://arxiv.org/abs/2601.09446v1

Date: 2026-01-d

Summary:

The use of formal language for deductive logical reasoning aligns well with language models (LMs), where translating natural language (NL) into first-order logic (FOL) and employing an external solver results in a verifiable and therefore reliable reasoning system. However, smaller LMs often struggle with this translation task, frequently producing incorrect symbolic outputs due to formatting and translation errors. Existing approaches typically rely on self-iteration to correct these errors, but such methods depend heavily on the capabilities of the underlying model. To address this, we first categorize common errors and fine-tune smaller LMs using data synthesized by large language models. The evaluation is performed using the defined error categories. We introduce incremental inference, which divides inference into two stages, predicate generation and FOL translation, providing greater control over model behavior and enhancing generation quality as measured by predicate metrics. This decomposition framework also enables the use of a verification module that targets predicate-arity errors to further improve performance. Our study evaluates three families of models across four logical-reasoning datasets. The comprehensive fine-tuning, incremental inference, and verification modules reduce error rates, increase predicate coverage, and improve reasoning performance for smaller LMs, moving us closer to developing reliable and accessible symbolic-reasoning systems.

--------------------------------------------------------------------------------------------------------

Ability Transfer and Recovery via Modularized Parameters Localization

Continual pre-training and fine-tuning improve large language models in specific domains or languages, but specialization often degrades other capabilities through catastrophic forgetting. This research investigates how abilities distribute within LLM parameters by analyzing module activations under domain and language-specific inputs. Findings reveal ability-related activations concentrate in remarkably small channel sets (typically under 5%) that are largely disentangled with good sufficiency and stability. ACT (Activation-Guided Channel-wise Ability Transfer) localizes ability-relevant channels via activation differences, selectively transferring only corresponding parameters followed by lightweight compatibility fine-tuning. Experiments on multilingual mathematical and scientific reasoning demonstrate successful recovery of forgotten abilities while preserving retained skills. Applications include efficient model merging, personalized language models, cross-lingual transfer, and mitigating catastrophic forgetting.

Authors: Songyao Jin, Kun Zhou, Wenqi Li, Peng Wang, Biwei Huang

Link: https://arxiv.org/abs/2601.09398v1

Date: 2026-01-d

Summary:

Large language models can be continually pre-trained or fine-tuned to improve performance in specific domains, languages, or skills, but this specialization often degrades other capabilities and may cause catastrophic forgetting. We investigate how abilities are distributed within LLM parameters by analyzing module activations under domain- and language-specific inputs for closely related models. Across layers and modules, we find that ability-related activations are highly concentrated in a small set of channels (typically <5\%), and these channels are largely disentangled with good sufficiency and stability. Building on these observations, we propose ACT (Activation-Guided Channel-wise Ability Transfer), which localizes ability-relevant channels via activation differences and selectively transfers only the corresponding parameters, followed by lightweight fine-tuning for compatibility. Experiments on multilingual mathematical and scientific reasoning show that ACT can recover forgotten abilities while preserving retained skills. It can also merge multiple specialized models to integrate several abilities into a single model with minimal interference. Our code and data will be publicly released.

--------------------------------------------------------------------------------------------------------

Navigating Ethical AI Challenges in the Industrial Sector: Balancing Innovation and Responsibility

AI integration into industrial sectors drives innovation while expanding ethical considerations, necessitating reevaluation of governing principles. AI-empowered industrial innovation inherently intersects with ethics as advancements introduce challenges related to transparency, accountability, and fairness. This chapter examines ethical aspects of AI in industrial use cases and associated factors including research practices and data sharing. It emphasizes embedding ethical principles into industrial AI systems, demonstrating how this fosters technological breakthroughs and stakeholder trust. The work provides actionable insights guiding industrial research and development toward futures where AI enables ethical, responsible progress and inclusive industrial ecosystems. Applications include manufacturing automation, supply chain management, quality control systems, predictive maintenance, industrial robotics, and establishing ethical frameworks for emerging industrial technologies.

Authors: Ruomu Tan, Martin W Hoffmann

Link: https://arxiv.org/abs/2601.09351v1

Date: 2026-01-d

Summary:

The integration of artificial intelligence (AI) into the industrial sector has not only driven innovation but also expanded the ethical landscape, necessitating a reevaluation of principles governing technology and its applications and awareness in research and development of industrial AI solutions. This chapter explores how AI-empowered industrial innovation inherently intersects with ethics, as advancements in AI introduce new challenges related to transparency, accountability, and fairness. In the chapter, we then examine the ethical aspects of several examples of AI manifestation in industrial use cases and associated factors such as ethical practices in the research and development process and data sharing. With the progress of ethical industrial AI solutions, we emphasize the importance of embedding ethical principles into industrial AI systems and its potential to inspire technological breakthroughs and foster trust among stakeholders. This chapter also offers actionable insights to guide industrial research and development toward a future where AI serves as an enabler for ethical and responsible industrial progress as well as a more inclusive industrial ecosystem.

--------------------------------------------------------------------------------------------------------

Policy-Based Reinforcement Learning with Action Masking for Dynamic Job Shop Scheduling under Uncertainty: Handling Random Arrivals and Machine Failures

Dynamic Job Shop Scheduling Problems under uncertainty—characterized by stochastic job arrivals and unexpected machine breakdowns—pose significant challenges for manufacturing efficiency. This framework employs Coloured Timed Petri Nets for environment representation and Maskable Proximal Policy Optimization for dynamic decision-making restricted to feasible actions. Job arrivals follow Gamma distributions capturing bursts and fluctuating workloads; machine failures use Weibull distributions representing age-dependent degradation. Two action-masking strategies are studied: non-gradient probability override and gradient-based invalid action penalization. Extensive experiments on dynamic benchmarks demonstrate consistent outperformance of traditional heuristic and rule-based approaches in makespan minimization. The combination of interpretable Petri-net models with adaptive reinforcement learning yields resilient, scalable, explainable frameworks. Applications include smart manufacturing, production planning, supply chain optimization, and real-time scheduling in dynamic industrial environments.

Authors: Sofiene Lassoued, Stefan Lier, Andreas Schwung

Link: https://arxiv.org/abs/2601.09293v1

Date: 2026-01-d

Summary:

We present a novel framework for solving Dynamic Job Shop Scheduling Problems under uncertainty, addressing the challenges introduced by stochastic job arrivals and unexpected machine breakdowns. Our approach follows a model-based paradigm, using Coloured Timed Petri Nets to represent the scheduling environment, and Maskable Proximal Policy Optimization to enable dynamic decision-making while restricting the agent to feasible actions at each decision point. To simulate realistic industrial conditions, dynamic job arrivals are modeled using a Gamma distribution, which captures complex temporal patterns such as bursts, clustering, and fluctuating workloads. Machine failures are modeled using a Weibull distribution to represent age-dependent degradation and wear-out dynamics. These stochastic models enable the framework to reflect real-world manufacturing scenarios better. In addition, we study two action-masking strategies: a non-gradient approach that overrides the probabilities of invalid actions, and a gradient-based approach that assigns negative gradients to invalid actions within the policy network. We conduct extensive experiments on dynamic JSSP benchmarks, demonstrating that our method consistently outperforms traditional heuristic and rule-based approaches in terms of makespan minimization. The results highlight the strength of combining interpretable Petri-net-based models with adaptive reinforcement learning policies, yielding a resilient, scalable, and explainable framework for real-time scheduling in dynamic and uncertain manufacturing environments.

--------------------------------------------------------------------------------------------------------

KTCF: Actionable Recourse in Knowledge Tracing via Counterfactual Explanations for Education

Knowledge Tracing models student learning for adaptive education, offering superior performance and application potential. However, connecting these AI systems to practical educational interventions requires explainability. This work conceptualizes counterfactual explanations as the bridge from XAI to education, offering actionable recourse that is inherently causal, local, and understandable to non-expert stakeholders. KTCF generates counterfactual explanations accounting for knowledge concept relationships, with post-processing converting explanations into sequences of educational instructions. Experiments on large-scale educational datasets demonstrate 5.7% to 34% improvements over existing methods across metrics, with qualitative evaluation showing educational instructions reduce study burden. The work demonstrates counterfactuals' potential for responsible, practical AI in education. Applications include personalized learning platforms, intelligent tutoring systems, curriculum design, student intervention programs, and educational assessment tools.

Authors: Woojin Kim, Changkwon Lee, Hyeoncheol Kim

Link: https://arxiv.org/abs/2601.09156v1

Date: 2026-01-d

Summary:

Using Artificial Intelligence to improve teaching and learning benefits greater adaptivity and scalability in education. Knowledge Tracing (KT) is recognized for student modeling task due to its superior performance and application potential in education. To this end, we conceptualize and investigate counterfactual explanation as the connection from XAI for KT to education. Counterfactual explanations offer actionable recourse, are inherently causal and local, and easy for educational stakeholders to understand who are often non-experts. We propose KTCF, a counterfactual explanation generation method for KT that accounts for knowledge concept relationships, and a post-processing scheme that converts a counterfactual explanation into a sequence of educational instructions. We experiment on a large-scale educational dataset and show our KTCF method achieves superior and robust performance over existing methods, with improvements ranging from 5.7% to 34% across metrics. Additionally, we provide a qualitative evaluation of our post-processing scheme, demonstrating that the resulting educational instructions help in reducing large study burden. We show that counterfactuals have the potential to advance the responsible and practical use of AI in education. Future works on XAI for KT may benefit from educationally grounded conceptualization and developing stakeholder-centered methods.

--------------------------------------------------------------------------------------------------------

SkinFlow: Efficient Information Transmission for Open Dermatological Diagnosis via Dynamic Visual Encoding and Staged RL

General-purpose Large Vision-Language Models struggle in dermatology due to "diffuse attention"—inability to distinguish subtle pathological lesions from background. This research challenges assumptions that parameter scaling alone achieves medical precision. SkinFlow treats diagnosis as optimizing visual information transmission efficiency, using a Virtual-Width Dynamic Vision Encoder to "unfold" pathological manifolds without physical parameter expansion. Two-stage Reinforcement Learning sequentially aligns explicit medical descriptions and reconstructs implicit diagnostic textures within constrained semantic space. A clinically grounded evaluation protocol prioritizes diagnostic safety and hierarchical relevance over rigid label matching. The 7B model achieves state-of-the-art on Fitzpatrick17k: +12.06% Top-1 accuracy and +28.57% Top-6 accuracy over massive general-purpose models. Applications include telemedicine dermatology consultations, skin cancer screening, medical education, clinical decision support, and accessible dermatological care.

Authors: Lijun Liu, Linwei Chen, Zhishou Zhang, Meng Tian, Hengfu Cui, Ruiyang Li, Zhaocheng Liu, Qiang Ju, Qianxi Li, Hong-Yu Zhou

Link: https://arxiv.org/abs/2601.09136v1

Date: 2026-01-d

Summary:

General-purpose Large Vision-Language Models (LVLMs), despite their massive scale, often falter in dermatology due to "diffuse attention" - the inability to disentangle subtle pathological lesions from background noise. In this paper, we challenge the assumption that parameter scaling is the only path to medical precision. We introduce SkinFlow, a framework that treats diagnosis as an optimization of visual information transmission efficiency. Our approach utilizes a Virtual-Width Dynamic Vision Encoder (DVE) to "unfold" complex pathological manifolds without physical parameter expansion, coupled with a two-stage Reinforcement Learning strategy. This strategy sequentially aligns explicit medical descriptions (Stage I) and reconstructs implicit diagnostic textures (Stage II) within a constrained semantic space. Furthermore, we propose a clinically grounded evaluation protocol that prioritizes diagnostic safety and hierarchical relevance over rigid label matching. Empirical results are compelling: our 7B model establishes a new state-of-the-art on the Fitzpatrick17k benchmark, achieving a +12.06% gain in Top-1 accuracy and a +28.57% boost in Top-6 accuracy over the massive general-purpose models (e.g., Qwen3VL-235B and GPT-5.2). These findings demonstrate that optimizing geometric capacity and information flow yields superior diagnostic reasoning compared to raw parameter scaling.

--------------------------------------------------------------------------------------------------------

EYE ON A.I. GETS READERS UP TO DATE ON THE LATEST FUNDING NEWS AND RELATED ISSUES. SUBSCRIBE FOR THE WEEKLY NEWSLETTER.

Artificial Intelligence, Research WatchCraig SmithJanuary 19, 2026Comment