Week Ending 9.21.2025
RESEARCH WATCH: 9.21.2025
Machine learning systems often struggle with generalization because they focus solely on task-relevant information, missing potentially useful knowledge for future applications. This research introduces the concept of latent learning—acquiring information not immediately relevant but valuable later—inspired by cognitive science principles. The authors demonstrate how episodic memory systems can address this limitation, showing applications from language modeling's reversal curse to navigation tasks. By incorporating oracle retrieval mechanisms, systems can flexibly reuse past experiences for better generalization. This work has significant implications for developing more data-efficient AI systems that mirror natural intelligence, potentially revolutionizing how we design learning algorithms across domains from robotics to natural language processing.
Authors: Andrew Kyle Lampinen, Martin Engelcke, Yuxuan Li, Arslan Chaudhry, James L. McClelland
Link: https://arxiv.org/abs/2509.16189v1
Date: 2025-09-d
Summary:
When do machine learning systems fail to generalize, and what mechanisms could improve their generalization? Here, we draw inspiration from cognitive science to argue that one weakness of machine learning systems is their failure to exhibit latent learning -- learning information that is not relevant to the task at hand, but that might be useful in a future task. We show how this perspective links failures ranging from the reversal curse in language modeling to new findings on agent-based navigation. We then highlight how cognitive science points to episodic memory as a potential part of the solution to these issues. Correspondingly, we show that a system with an oracle retrieval mechanism can use learning experiences more flexibly to generalize better across many of these challenges. We also identify some of the essential components for effectively using retrieval, including the importance of within-example in-context learning for acquiring the ability to use information across retrieved examples. In summary, our results illustrate one possible contributor to the relative data inefficiency of current machine learning systems compared to natural intelligence, and help to understand how retrieval methods can complement parametric learning to improve generalization.
--------------------------------------------------------------------------------------------------------
BEFT: Bias-Efficient Fine-Tuning of Language Models
Parameter-efficient fine-tuning has become crucial for adapting large language models while maintaining computational efficiency. Among various techniques, fine-tuning only bias terms offers exceptional parameter efficiency, particularly in low-data scenarios. However, the relationship between different bias components (query, key, value projections) and performance remains unclear. This research introduces BEFT, a systematic approach for selecting optimal bias terms for fine-tuning. Through extensive evaluation across models ranging from 110M to 6.7B parameters, the method demonstrates superior performance on classification, multiple-choice, and generation tasks. This advancement could significantly reduce computational costs for model adaptation, making sophisticated AI more accessible to organizations with limited resources while maintaining competitive performance across diverse applications.
Authors: Baichuan Huang, Ananth Balashankar, Amir Aminifar
Link: https://arxiv.org/abs/2509.15974v1
Date: 2025-09-d
Summary:
Fine-tuning all-bias-terms stands out among various parameter-efficient fine-tuning (PEFT) techniques, owing to its out-of-the-box usability and competitive performance, especially in low-data regimes. Bias-only fine-tuning has the potential for unprecedented parameter efficiency. However, the link between fine-tuning different bias terms (i.e., bias terms in the query, key, or value projections) and downstream performance remains unclear. The existing approaches, e.g., based on the magnitude of bias change or empirical Fisher information, provide limited guidance for selecting the particular bias term for effective fine-tuning. In this paper, we propose an approach for selecting the bias term to be fine-tuned, forming the foundation of our bias-efficient fine-tuning (BEFT). We extensively evaluate our bias-efficient approach against other bias-selection approaches, across a wide range of large language models (LLMs) spanning encoder-only and decoder-only architectures from 110M to 6.7B parameters. Our results demonstrate the effectiveness and superiority of our bias-efficient approach on diverse downstream tasks, including classification, multiple-choice, and generation tasks.
--------------------------------------------------------------------------------------------------------
Structured Information for Improving Spatial Relationships in Text-to-Image Generation
Text-to-image generation has made remarkable progress, yet accurately capturing spatial relationships from natural language descriptions remains challenging. Current systems often struggle with complex spatial arrangements, limiting their practical applications. This work introduces a lightweight solution using tuple-based structured information to enhance spatial accuracy in generated images. By fine-tuning a language model to automatically convert prompts into structured representations, the approach seamlessly integrates into existing text-to-image pipelines. The method maintains overall image quality while significantly improving spatial relationships, with automatically generated tuples matching human-crafted quality. This advancement has broad applications in creative industries, architectural visualization, educational content creation, and any domain requiring precise spatial control in generated imagery, making AI art tools more reliable and practical.
Authors: Sander Schildermans, Chang Tian, Ying Jiao, Marie-Francine Moens
Link: https://arxiv.org/abs/2509.15962v1
Date: 2025-09-d
Summary:
Text-to-image (T2I) generation has advanced rapidly, yet faithfully capturing spatial relationships described in natural language prompts remains a major challenge. Prior efforts have addressed this issue through prompt optimization, spatially grounded generation, and semantic refinement. This work introduces a lightweight approach that augments prompts with tuple-based structured information, using a fine-tuned language model for automatic conversion and seamless integration into T2I pipelines. Experimental results demonstrate substantial improvements in spatial accuracy, without compromising overall image quality as measured by Inception Score. Furthermore, the automatically generated tuples exhibit quality comparable to human-crafted tuples. This structured information provides a practical and portable solution to enhance spatial relationships in T2I generation, addressing a key limitation of current large-scale generative systems.
--------------------------------------------------------------------------------------------------------
Monte Carlo Tree Diffusion with Multiple Experts for Protein Design
Protein design—generating amino acid sequences that fold into functional structures—represents one of biotechnology's most challenging problems. Traditional approaches combining language models with Monte Carlo Tree Search face limitations with long-range dependencies and vast search spaces. This research introduces MCTD-ME, which integrates masked diffusion models with tree search for multi-token planning and efficient exploration. The method employs multiple experts of varying capacities and uses biophysical-enhanced diffusion for rollouts, targeting low-confidence regions while preserving reliable residues. Superior performance on inverse folding benchmarks, especially for longer proteins, demonstrates significant advancement. Applications span drug discovery, enzyme engineering, therapeutic protein development, and synthetic biology, potentially accelerating the design of novel proteins for medical treatments, industrial processes, and sustainable biotechnology solutions.
Authors: Xuefeng Liu, Mingxuan Cao, Songhao Jiang, Xiao Luo, Xiaotian Duan, Mengdi Wang, Tobin R. Sosnick, Jinbo Xu, Rick Stevens
Link: https://arxiv.org/abs/2509.15796v1
Date: 2025-09-d
Summary:
The goal of protein design is to generate amino acid sequences that fold into functional structures with desired properties. Prior methods combining autoregressive language models with Monte Carlo Tree Search (MCTS) struggle with long-range dependencies and suffer from an impractically large search space. We propose MCTD-ME, Monte Carlo Tree Diffusion with Multiple Experts, which integrates masked diffusion models with tree search to enable multi-token planning and efficient exploration. Unlike autoregressive planners, MCTD-ME uses biophysical-fidelity-enhanced diffusion denoising as the rollout engine, jointly revising multiple positions and scaling to large sequence spaces. It further leverages experts of varying capacities to enrich exploration, guided by a pLDDT-based masking schedule that targets low-confidence regions while preserving reliable residues. We propose a novel multi-expert selection rule (PH-UCT-ME) extends predictive-entropy UCT to expert ensembles. On the inverse folding task (CAMEO and PDB benchmarks), MCTD-ME outperforms single-expert and unguided baselines in both sequence recovery (AAR) and structural similarity (scTM), with gains increasing for longer proteins and benefiting from multi-expert guidance. More generally, the framework is model-agnostic and applicable beyond inverse folding, including de novo protein engineering and multi-objective molecular generation.
--------------------------------------------------------------------------------------------------------
Ideal Registration? Segmentation is All You Need
Medical image registration—aligning images from different time points or modalities—traditionally relies on globally uniform smoothness constraints that inadequately handle anatomical motion's complex regional variations. This limitation particularly affects cardiac, abdominal, and lung imaging where different tissues move differently. SegReg introduces an innovative segmentation-driven approach that implements anatomically adaptive regularization by exploiting region-specific deformation patterns. The method decomposes images into coherent subregions, processes them through the same registration backbone, then integrates partial deformation fields globally. Achieving 98.23% Dice score on critical anatomies with ground-truth segmentation and 2-12% improvements with automatic segmentation, SegReg demonstrates that registration accuracy depends linearly on segmentation quality. Applications include surgical planning, treatment monitoring, disease progression tracking, and medical diagnostics across multiple imaging modalities.
Authors: Xiang Chen, Fengting Zhang, Qinghao Liu, Min Liu, Kun Wu, Yaonan Wang, Hang Zhang
Link: https://arxiv.org/abs/2509.15784v1
Date: 2025-09-d
Summary:
Deep learning has revolutionized image registration by its ability to handle diverse tasks while achieving significant speed advantages over conventional approaches. Current approaches, however, often employ globally uniform smoothness constraints that fail to accommodate the complex, regionally varying deformations characteristic of anatomical motion. To address this limitation, we propose SegReg, a Segmentation-driven Registration framework that implements anatomically adaptive regularization by exploiting region-specific deformation patterns. Our SegReg first decomposes input moving and fixed images into anatomically coherent subregions through segmentation. These localized domains are then processed by the same registration backbone to compute optimized partial deformation fields, which are subsequently integrated into a global deformation field. SegReg achieves near-perfect structural alignment (98.23% Dice on critical anatomies) using ground-truth segmentation, and outperforms existing methods by 2-12% across three clinical registration scenarios (cardiac, abdominal, and lung images) even with automatic segmentation. Our SegReg demonstrates a near-linear dependence of registration accuracy on segmentation quality, transforming the registration challenge into a segmentation problem. The source code will be released upon manuscript acceptance.
--------------------------------------------------------------------------------------------------------
Toward Efficient Influence Function: Dropout as a Compression Tool
Understanding how individual training data points affect machine learning model performance is crucial for transparency, debugging, and data selection. Influence functions provide this theoretical framework but face significant computational and memory challenges, especially for large-scale models where gradients match model size. This research introduces a novel approach leveraging dropout as a gradient compression mechanism to compute influence functions more efficiently. The method dramatically reduces both computational and memory overhead during influence computation and gradient compression processes. Through theoretical analysis and empirical validation, the approach preserves critical data influence components while enabling application to modern large-scale models. This advancement has implications for model interpretability, data valuation, privacy-preserving machine learning, and responsible AI development across industries requiring explainable and auditable AI systems.
Authors: Yuchen Zhang, Mohammad Mohammadi Amiri
Link: https://arxiv.org/abs/2509.15651v1
Date: 2025-09-d
Summary:
Assessing the impact the training data on machine learning models is crucial for understanding the behavior of the model, enhancing the transparency, and selecting training data. Influence function provides a theoretical framework for quantifying the effect of training data points on model's performance given a specific test data. However, the computational and memory costs of influence function presents significant challenges, especially for large-scale models, even when using approximation methods, since the gradients involved in computation are as large as the model itself. In this work, we introduce a novel approach that leverages dropout as a gradient compression mechanism to compute the influence function more efficiently. Our method significantly reduces computational and memory overhead, not only during the influence function computation but also in gradient compression process. Through theoretical analysis and empirical validation, we demonstrate that our method could preserves critical components of the data influence and enables its application to modern large-scale models.
--------------------------------------------------------------------------------------------------------
The (Short-Term) Effects of Large Language Models on Unemployment and Earnings
The rapid adoption of Large Language Models since ChatGPT's late 2022 release has sparked debates about productivity gains versus job displacement concerns. This economic analysis examines LLM's short-term labor market effects by comparing earnings and unemployment across occupations with varying exposure levels. Using Synthetic Difference in Differences methodology, the research reveals that highly exposed workers experienced earnings increases following ChatGPT's introduction, while unemployment rates remained stable. These findings suggest initial labor market adjustments operate primarily through earnings rather than worker displacement. The research has significant implications for policy makers, economists, and business leaders planning workforce transitions. Understanding these patterns helps inform education policy, retraining programs, and economic forecasting as AI technologies continue evolving and expanding across industries.
Authors: Danqing Chen, Carina Kane, Austin Kozlowski, Nadav Kunievsky, James A. Evans
Link: https://arxiv.org/abs/2509.15510v1
Date: 2025-09-d
Summary:
Large Language Models have spread rapidly since the release of ChatGPT in late 2022, accompanied by claims of major productivity gains but also concerns about job displacement. This paper examines the short-run labor market effects of LLM adoption by comparing earnings and unemployment across occupations with differing levels of exposure to these technologies. Using a Synthetic Difference in Differences approach, we estimate the impact of LLM exposure on earnings and unemployment. Our findings show that workers in highly exposed occupations experienced earnings increases following ChatGPT's introduction, while unemployment rates remained unchanged. These results suggest that initial labor market adjustments to LLMs operate primarily through earnings rather than worker reallocation.
--------------------------------------------------------------------------------------------------------
The proliferation of Internet of Things devices in 6G wireless networks requires robust security mechanisms, particularly when integrating radio frequency and free space optical communications. This research addresses security challenges through multi-reconfigurable intelligent surface (RIS) aided systems that dynamically control propagation environments. The work models RF links with Rician fading and FSO links with Málaga turbulence including pointing errors, creating realistic propagation conditions. Analytical expressions for secrecy outage probability, average secrecy capacity, and effective secrecy throughput provide theoretical foundations. Results demonstrate that heterodyne detection mitigates pointing error effects, while multi-RIS structures achieve up to 47.67% improvement in secrecy outage probability. Applications span secure IoT deployments, military communications, financial data transmission, and any scenario requiring enhanced physical layer security in mixed RF-FSO networks.
Authors: Anika Tabassum Biva, Md. Ibrahim, A. S. M. Badrudduza, Imran Shafique Ansari
Link: https://arxiv.org/abs/2509.15411v1
Date: 2025-09-d
Summary:
Due to their ability to dynamically control the propagation environment, reconfigurable intelligent surfaces (RISs) offer a promising solution to address the challenges of $6$G wireless communication, especially in the context of Internet of Things (IoT) networks. This paper investigates a mixed communication model with multi-RIS-aided radio frequency (RF)-free space optics (FSO) to enhance the performance of IoT applications in complex environments. An eavesdropper is assumed to be present, attempting to intercept confidential information transmitted over the RF link. All RF links are modeled using Rician fading, while the FSO link accounts for M\'alaga turbulence with pointing errors, capturing real-world propagation conditions. Closed-form analytical expressions are derived for the secrecy outage probability, average secrecy capacity, and effective secrecy throughput in terms of Meijer's G function. To gain further insight, high signal-to-noise approximations of these metrics are also presented. Numerical results highlight the importance of heterodyne detection in mitigating the adverse effects of pointing errors on the FSO link. Moreover, integrating a multi-RIS structure into the proposed model significantly increases secrecy performance, achieving up to a $47.67\%$ improvement in SOP compared to conventional methods. Finally, the derived analytical results are validated through Monte Carlo simulations.
--------------------------------------------------------------------------------------------------------
Recent Advancements in Microscopy Image Enhancement using Deep Learning: A Survey
Microscopy imaging is fundamental to biological research and materials science, but image quality often limits detailed analysis of cellular structures and microscopic materials. Deep learning has revolutionized microscopy image enhancement, offering unprecedented capabilities in resolution improvement, reconstruction, and noise reduction. This comprehensive survey examines the evolution of deep learning methods across three core domains: super-resolution for detail enhancement, reconstruction for recovering degraded images, and denoising for clarity improvement. The review explores current trends and practical applications, highlighting how these advancements enable better understanding of biological processes and material properties. Applications span medical diagnostics, drug discovery, materials research, quality control in manufacturing, and fundamental biological research. The survey provides researchers and practitioners with essential insights for selecting appropriate enhancement techniques for specific microscopy applications.
Authors: Debasish Dutta, Neeharika Sonowal, Risheraj Barauh, Deepjyoti Chetia, Sanjib Kr Kalita
Link: https://arxiv.org/abs/2509.15363v1
Date: 2025-09-d
Summary:
Microscopy image enhancement plays a pivotal role in understanding the details of biological cells and materials at microscopic scales. In recent years, there has been a significant rise in the advancement of microscopy image enhancement, specifically with the help of deep learning methods. This survey paper aims to provide a snapshot of this rapidly growing state-of-the-art method, focusing on its evolution, applications, challenges, and future directions. The core discussions take place around the key domains of microscopy image enhancement of super-resolution, reconstruction, and denoising, with each domain explored in terms of its current trends and their practical utility of deep learning.
--------------------------------------------------------------------------------------------------------
Internalizing Self-Consistency in Language Models: Multi-Agent Consensus Alignment
Language models exhibit frustrating inconsistencies, generating contradictory responses to identical prompts despite their sophisticated capabilities. While inference-time methods partially address this issue, they fail to tackle the core problem: models struggle to reliably select reasoning pathways leading to consistent outcomes. This research formalizes self-consistency as an intrinsic property of well-aligned reasoning models and introduces Multi-Agent Consensus Alignment (MACA), a reinforcement learning framework for post-training models. Through deliberative multi-agent exchanges, agents ground reasoning in peer arguments rather than simple aggregation, creating richer consensus signals. MACA enables self-teaching for more decisive, concise reasoning and better peer insight utilization. Results show substantial improvements across self-consistency (+27.6% on GSM8K), single-agent reasoning (+23.7% on MATH), and multi-agent decision-making (+42.7% on MathQA), with strong generalization to unseen benchmarks.
Authors: Ankur Samanta, Akshayaa Magesh, Youliang Yu, Runzhe Wu, Ayush Jain, Daniel Jiang, Boris Vidolov, Paul Sajda, Yonathan Efroni, Kaveh Hassani
Link: https://arxiv.org/abs/2509.15172v1
Date: 2025-09-d
Summary:
Language Models (LMs) are inconsistent reasoners, often generating contradictory responses to identical prompts. While inference-time methods can mitigate these inconsistencies, they fail to address the core problem: LMs struggle to reliably select reasoning pathways leading to consistent outcomes under exploratory sampling. To address this, we formalize self-consistency as an intrinsic property of well-aligned reasoning models and introduce Multi-Agent Consensus Alignment (MACA), a reinforcement learning framework that post-trains models to favor reasoning trajectories aligned with their internal consensus using majority/minority outcomes from multi-agent debate. These trajectories emerge from deliberative exchanges where agents ground reasoning in peer arguments, not just aggregation of independent attempts, creating richer consensus signals than single-round majority voting. MACA enables agents to teach themselves to be more decisive and concise, and better leverage peer insights in multi-agent settings without external supervision, driving substantial improvements across self-consistency (+27.6% on GSM8K), single-agent reasoning (+23.7% on MATH), sampling-based inference (+22.4% Pass@20 on MATH), and multi-agent ensemble decision-making (+42.7% on MathQA). These findings, coupled with strong generalization to unseen benchmarks (+16.3% on GPQA, +11.6% on CommonsenseQA), demonstrate robust self-alignment that more reliably unlocks latent reasoning potential of language models.
--------------------------------------------------------------------------------------------------------
Formal verification and mechanized mathematics represent the cutting edge of mathematical rigor, where complex mathematical constructions are verified by computer systems. This work presents the formalization of the multi-graded Proj construction in Lean4, a sophisticated algebraic geometry concept crucial for understanding projective varieties and schemes. The multi-graded Proj construction extends classical projective geometry to handle more complex grading structures, enabling deeper analysis of algebraic objects. By successfully formalizing this advanced mathematical concept, the research demonstrates the maturity of modern proof assistants and their capability to handle research-level mathematics. Applications extend beyond pure mathematics to cryptography, coding theory, and computer algebra systems. This work contributes to the growing movement toward mechanized verification of mathematical knowledge, ensuring correctness and enabling computer-assisted discovery in advanced mathematical research.
Authors: Arnaud Mayeux, Jujian Zhang
Link: https://arxiv.org/abs/2509.15116v1
Date: 2025-09-d
Summary:
We formalize the multi-graded Proj construction in Lean4, illustrating mechanized mathematics and formalization.
--------------------------------------------------------------------------------------------------------
Modeling Transformers as complex networks to analyze learning dynamics
Understanding how Large Language Models acquire complex capabilities during training remains a fundamental challenge in AI interpretability. This research applies Complex Network Theory to analyze transformer learning dynamics by representing models as directed, weighted graphs where nodes are computational components (attention heads and MLPs) and edges represent causal influence measured through intervention-based ablation. Tracking the Pythia-14M model across 143 training checkpoints on canonical induction tasks reveals distinct learning phases: exploration, consolidation, and refinement. The analysis identifies stable hierarchies of information spreader components and dynamic information gatherer components that reconfigure at key learning junctures. This component-level network perspective offers a powerful macroscopic lens for visualizing and understanding self-organizing principles driving functional circuit formation. Applications include model optimization, interpretability research, and developing more efficient training procedures for large language models.
Authors: Elisabetta Rocchetti
Link: https://arxiv.org/abs/2509.15269v1
Date: 2025-09-d
Summary:
The process by which Large Language Models (LLMs) acquire complex capabilities during training remains a key open question in mechanistic interpretability. This project investigates whether these learning dynamics can be characterized through the lens of Complex Network Theory (CNT). I introduce a novel methodology to represent a Transformer-based LLM as a directed, weighted graph where nodes are the model's computational components (attention heads and MLPs) and edges represent causal influence, measured via an intervention-based ablation technique. By tracking the evolution of this component-graph across 143 training checkpoints of the Pythia-14M model on a canonical induction task, I analyze a suite of graph-theoretic metrics. The results reveal that the network's structure evolves through distinct phases of exploration, consolidation, and refinement. Specifically, I identify the emergence of a stable hierarchy of information spreader components and a dynamic set of information gatherer components, whose roles reconfigure at key learning junctures. This work demonstrates that a component-level network perspective offers a powerful macroscopic lens for visualizing and understanding the self-organizing principles that drive the formation of functional circuits in LLMs.
--------------------------------------------------------------------------------------------------------
The proliferation of IoT devices with microphones performing on-device audio classification creates significant security vulnerabilities while operating under resource constraints. This research presents a comprehensive defense-in-depth architecture addressing these challenges through a security protocol treating edge devices, cellular networks, and cloud backends as separate trust domains. The system employs TPM-based remote attestation and mutually authenticated TLS 1.3 connections. During startup, each boot stage is measured into TPM PCRs, with devices only decrypting LUKS-sealed partitions after cloud verification and one-time unlock key release. This ensures tampered devices remain inert. Post-quantum resilience through Kyber and Dilithium hybridization, end-to-end encryption, signed AI models, and tamper-responsive sensors provide comprehensive protection. Applications span smart home security, industrial IoT monitoring, healthcare devices, and any audio-processing IoT deployment requiring robust security frameworks.
Authors: Sergio Benlloch-Lopez, Miquel Viel-Vazquez, Javier Naranjo-Alcazar, Jordi Grau-Haro, Pedro Zuccarello
Link: https://arxiv.org/abs/2509.14657v2
Date: 2025-09-d
Summary:
The rapid proliferation of IoT nodes equipped with microphones and capable of performing on-device audio classification exposes highly sensitive data while operating under tight resource constraints. To protect against this, we present a defence-in-depth architecture comprising a security protocol that treats the edge device, cellular network and cloud backend as three separate trust domains, linked by TPM-based remote attestation and mutually authenticated TLS 1.3. A STRIDE-driven threat model and attack-tree analysis guide the design. At startup, each boot stage is measured into TPM PCRs. The node can only decrypt its LUKS-sealed partitions after the cloud has verified a TPM quote and released a one-time unlock key. This ensures that rogue or tampered devices remain inert. Data in transit is protected by TLS 1.3 and hybridised with Kyber and Dilithium to provide post-quantum resilience. Meanwhile, end-to-end encryption and integrity hashes safeguard extracted audio features. Signed, rollback-protected AI models and tamper-responsive sensors harden firmware and hardware. Data at rest follows a 3-2-1 strategy comprising a solid-state drive sealed with LUKS, an offline cold archive encrypted with a hybrid post-quantum cipher and an encrypted cloud replica. Finally, we set out a plan for evaluating the physical and logical security of the proposed protocol.
--------------------------------------------------------------------------------------------------------
Dynamic energy systems and controls require sophisticated modeling frameworks for designing and testing supervisory and fault-tolerant strategies. Modelica, while widely used for equation-based modeling, demands specialized expertise making control module development labor-intensive. This research examines using Large Language Models to automate Control Description Language module generation in Building Modelica Library. The structured workflow combines standardized prompt scaffolds, library-aware grounding, automated OpenModelica compilation, and human-in-the-loop evaluation. Results show GPT-4o failed in zero-shot mode while Claude Sonnet 4 achieved full success for basic logic blocks with engineered prompts. Control modules reached 83% success rates, with failed outputs requiring medium-level human repair (1-8 hours). Despite limitations, the LLM-assisted workflow reduced development time from 10-20 hours to 4-6 hours per module (40-60% savings), demonstrating significant potential for accelerating engineering workflows.
Authors: Hanlong Wan, Xing Lu, Yan Chen, Karthik Devaprasad, Laura Hinkle
Link: https://arxiv.org/abs/2509.14623v1
Date: 2025-09-d
Summary:
Dynamic energy systems and controls require advanced modeling frameworks to design and test supervisory and fault tolerant strategies. Modelica is a widely used equation based language, but developing control modules is labor intensive and requires specialized expertise. This paper examines the use of large language models (LLMs) to automate the generation of Control Description Language modules in the Building Modelica Library as a case study. We developed a structured workflow that combines standardized prompt scaffolds, library aware grounding, automated compilation with OpenModelica, and human in the loop evaluation. Experiments were carried out on four basic logic tasks (And, Or, Not, and Switch) and five control modules (chiller enable/disable, bypass valve control, cooling tower fan speed, plant requests, and relief damper control). The results showed that GPT 4o failed to produce executable Modelica code in zero shot mode, while Claude Sonnet 4 achieved up to full success for basic logic blocks with carefully engineered prompts. For control modules, success rates reached 83 percent, and failed outputs required medium level human repair (estimated one to eight hours). Retrieval augmented generation often produced mismatches in module selection (for example, And retrieved as Or), while a deterministic hard rule search strategy avoided these errors. Human evaluation also outperformed AI evaluation, since current LLMs cannot assess simulation results or validate behavioral correctness. Despite these limitations, the LLM assisted workflow reduced the average development time from 10 to 20 hours down to 4 to 6 hours per module, corresponding to 40 to 60 percent time savings. These results highlight both the potential and current limitations of LLM assisted Modelica generation, and point to future research in pre simulation validation, stronger grounding, and closed loop evaluation.
--------------------------------------------------------------------------------------------------------
Do Vision-Language Models See Urban Scenes as People Do? An Urban Perception Benchmark
Understanding human perception of urban environments is crucial for informed city design and planning decisions. This research introduces a benchmark testing vision-language models on urban perception using 100 Montreal street images split between photographs and photorealistic synthetic scenes. Twelve participants from seven community groups provided 230 annotation forms across 30 dimensions mixing physical attributes and subjective impressions. Seven VLMs were evaluated in zero-shot settings with structured prompts and deterministic parsing. Results reveal stronger model alignment on visible, objective properties than subjective appraisals, with the top system (Claude Sonnet) achieving macro 0.31 and mean Jaccard 0.48 on multi-label items. Higher human agreement correlated with better model scores, while synthetic images slightly lowered performance. Applications include urban planning, architectural design, smart city development, and participatory design processes requiring understanding of human environmental perception.
Authors: Rashid Mushkani
Link: https://arxiv.org/abs/2509.14574v1
Date: 2025-09-d
Summary:
Understanding how people read city scenes can inform design and planning. We introduce a small benchmark for testing vision-language models (VLMs) on urban perception using 100 Montreal street images, evenly split between photographs and photorealistic synthetic scenes. Twelve participants from seven community groups supplied 230 annotation forms across 30 dimensions mixing physical attributes and subjective impressions. French responses were normalized to English. We evaluated seven VLMs in a zero-shot setup with a structured prompt and deterministic parser. We use accuracy for single-choice items and Jaccard overlap for multi-label items; human agreement uses Krippendorff's alpha and pairwise Jaccard. Results suggest stronger model alignment on visible, objective properties than subjective appraisals. The top system (claude-sonnet) reaches macro 0.31 and mean Jaccard 0.48 on multi-label items. Higher human agreement coincides with better model scores. Synthetic images slightly lower scores. We release the benchmark, prompts, and harness for reproducible, uncertainty-aware evaluation in participatory urban analysis.
--------------------------------------------------------------------------------------------------------
AToken: A Unified Tokenizer for Vision
Visual tokenization has traditionally required separate specialized systems for different modalities and tasks, limiting unified multimodal AI development. AToken addresses this fragmentation by providing the first unified visual tokenizer achieving both high-fidelity reconstruction and semantic understanding across images, videos, and 3D assets. The system encodes diverse visual inputs into a shared 4D latent space using a pure transformer architecture with 4D rotary position embeddings for arbitrary resolutions and temporal durations. An adversarial-free training objective combining perceptual and Gram matrix losses ensures stable training and state-of-the-art reconstruction quality. Through progressive training curriculum, AToken supports both continuous and discrete latent tokens. Results show 0.21 rFID with 82.2% ImageNet accuracy for images, competitive video and 3D performance, and enables both generation and understanding tasks across modalities, pointing toward next-generation unified multimodal AI systems.
Authors: Jiasen Lu, Liangchen Song, Mingze Xu, Byeongjoo Ahn, Yanjun Wang, Chen Chen, Afshin Dehghan, Yinfei Yang
Link: https://arxiv.org/abs/2509.14476v2
Date: 2025-09-d
Summary:
We present AToken, the first unified visual tokenizer that achieves both high-fidelity reconstruction and semantic understanding across images, videos, and 3D assets. Unlike existing tokenizers that specialize in either reconstruction or understanding for single modalities, AToken encodes these diverse visual inputs into a shared 4D latent space, unifying both tasks and modalities in a single framework. Specifically, we introduce a pure transformer architecture with 4D rotary position embeddings to process visual inputs of arbitrary resolutions and temporal durations. To ensure stable training, we introduce an adversarial-free training objective that combines perceptual and Gram matrix losses, achieving state-of-the-art reconstruction quality. By employing a progressive training curriculum, AToken gradually expands from single images, videos, and 3D, and supports both continuous and discrete latent tokens. AToken achieves 0.21 rFID with 82.2% ImageNet accuracy for images, 3.01 rFVD with 40.2% MSRVTT retrieval for videos, and 28.28 PSNR with 90.9% classification accuracy for 3D.. In downstream applications, AToken enables both visual generation tasks (e.g., image generation with continuous and discrete tokens, text-to-video generation, image-to-3D synthesis) and understanding tasks (e.g., multimodal LLMs), achieving competitive performance across all benchmarks. These results shed light on the next-generation multimodal AI systems built upon unified visual tokenization.
--------------------------------------------------------------------------------------------------------
From Mimicry to True Intelligence (TI) -- A New Paradigm for Artificial General Intelligence
Current Artificial General Intelligence definitions focus on performance metrics rather than underlying cognitive mechanisms, providing no clear research roadmap and failing to define genuine intelligence qualitatively. This research proposes a paradigm shift from external mimicry to foundational cognitive architectures inspired by human brain structure. True Intelligence (TI) is defined through six core components: embodied sensory fusion, core directives, dynamic schemata creation, highly-interconnected multi-expert architecture, orchestration layer, and Interconnectedness (hypothesized to result in consciousness). The authors propose a practical five-level AGI taxonomy based on measurable components, providing clear developmental milestones. They argue that Level-5 AGI implementing all five measurable components becomes functionally equivalent to TI, with consciousness remaining a philosophical debate. This framework synthesizes insights from analytical psychology, schema theory, metacognition, and modern brain architectures, offering the first holistic, mechanism-based AGI definition with actionable research directions.
Authors: Meltem Subasioglu, Nevzat Subasioglu
Link: https://arxiv.org/abs/2509.14474v1
Date: 2025-09-d
Summary:
The debate around Artificial General Intelligence (AGI) remains open due to two fundamentally different goals: replicating human-like performance versus replicating human-like cognitive processes. We argue that current performance-based definitions are inadequate because they provide no clear, mechanism-focused roadmap for research, and they fail to properly define the qualitative nature of genuine intelligence. Drawing inspiration from the human brain, we propose a new paradigm that shifts the focus from external mimicry to the development of foundational cognitive architectures. We define True Intelligence (TI) as a system characterized by six core components: embodied sensory fusion, core directives, dynamic schemata creation, a highly-interconnected multi-expert architecture, an orchestration layer, and lastly, the unmeasurable quality of Interconnectedness, which we hypothesize results in consciousness and a subjective experience. We propose a practical, five-level taxonomy of AGI based on the number of the first five measurable components a system exhibits. This framework provides a clear path forward with developmental milestones that directly address the challenge of building genuinely intelligent systems. We contend that once a system achieves Level-5 AGI by implementing all five measurable components, the difference between it and TI remains as a purely philosophical debate. For practical purposes - and given theories indicate consciousness is an emergent byproduct of integrated, higher-order cognition - we conclude that a fifth-level AGI is functionally and practically equivalent to TI. This work synthesizes diverse insights from analytical psychology, schema theory, metacognition, modern brain architectures and latest works in AI to provide the first holistic, mechanism-based definition of AGI that offers a clear and actionable path for the research community.
--------------------------------------------------------------------------------------------------------
VCBench: Benchmarking LLMs in Venture Capital
Venture capital represents one of the most challenging prediction domains, where signals are sparse, outcomes uncertain, and even top investors achieve modest performance relative to difficulty. This research introduces VCBench, the first benchmark for predicting founder success in venture capital, providing 9,000 anonymized founder profiles standardized to preserve predictive features while preventing identity leakage. Adversarial tests show over 90% reduction in re-identification risk. Nine state-of-the-art LLMs were evaluated, with DeepSeek-V3 delivering six times baseline precision, GPT-4o achieving highest F0.5 score, and most models surpassing human benchmarks. The market index achieves 1.9% precision at inception, while Y Combinator outperforms by 1.7x and tier-1 firms by 2.9x. VCBench establishes a community-driven standard for reproducible, privacy-preserving AGI evaluation in early-stage venture forecasting, available publicly at vcbench.com for advancing investment decision-making research.
Authors: Rick Chen, Joseph Ternasky, Afriyie Samuel Kwesi, Ben Griffin, Aaron Ontoyin Yin, Zakari Salifu, Kelvin Amoaba, Xianling Mu, Fuat Alican, Yigit Ihlamur
Link: https://arxiv.org/abs/2509.14448v1
Date: 2025-09-d
Summary:
Benchmarks such as SWE-bench and ARC-AGI demonstrate how shared datasets accelerate progress toward artificial general intelligence (AGI). We introduce VCBench, the first benchmark for predicting founder success in venture capital (VC), a domain where signals are sparse, outcomes are uncertain, and even top investors perform modestly. At inception, the market index achieves a precision of 1.9%. Y Combinator outperforms the index by a factor of 1.7x, while tier-1 firms are 2.9x better. VCBench provides 9,000 anonymized founder profiles, standardized to preserve predictive features while resisting identity leakage, with adversarial tests showing more than 90% reduction in re-identification risk. We evaluate nine state-of-the-art large language models (LLMs). DeepSeek-V3 delivers over six times the baseline precision, GPT-4o achieves the highest F0.5, and most models surpass human benchmarks. Designed as a public and evolving resource available at vcbench.com, VCBench establishes a community-driven standard for reproducible and privacy-preserving evaluation of AGI in early-stage venture forecasting.
--------------------------------------------------------------------------------------------------------
GestOS: Advanced Hand Gesture Interpretation via Large Language Models to control Any Type of Robot
Traditional gesture-based control systems map hand movements to fixed commands or single-agent actions, limiting flexibility and scalability in multi-robot environments. GestOS introduces a gesture-based operating system for high-level control of heterogeneous robot teams that interprets gestures semantically rather than procedurally. The system combines lightweight visual perception with Large Language Model reasoning: hand poses are converted to structured textual descriptions, which LLMs use to infer intent and generate robot-specific commands. A robot selection module ensures gesture-triggered tasks match the most suitable agent in real-time based on capabilities, current state, and supported instruction sets. This architecture enables context-aware, adaptive control without requiring explicit user specification of targets or commands. GestOS advances gesture interaction from recognition to intelligent orchestration, supporting scalable, flexible, user-friendly collaboration with robotic systems in dynamic environments across manufacturing, service robotics, and research applications.
Authors: Artem Lykov, Oleg Kobzarev, Dzmitry Tsetserukou
Link: https://arxiv.org/abs/2509.14412v1
Date: 2025-09-d
Summary:
We present GestOS, a gesture-based operating system for high-level control of heterogeneous robot teams. Unlike prior systems that map gestures to fixed commands or single-agent actions, GestOS interprets hand gestures semantically and dynamically distributes tasks across multiple robots based on their capabilities, current state, and supported instruction sets. The system combines lightweight visual perception with large language model (LLM) reasoning: hand poses are converted into structured textual descriptions, which the LLM uses to infer intent and generate robot-specific commands. A robot selection module ensures that each gesture-triggered task is matched to the most suitable agent in real time. This architecture enables context-aware, adaptive control without requiring explicit user specification of targets or commands. By advancing gesture interaction from recognition to intelligent orchestration, GestOS supports scalable, flexible, and user-friendly collaboration with robotic systems in dynamic environments.
--------------------------------------------------------------------------------------------------------
Deceptive Beauty: Evaluating the Impact of Beauty Filters on Deepfake and Morphing Attack Detection
Digital beautification through social media filters has become ubiquitous, raising concerns about facial image reliability and automated analysis effectiveness. This issue is particularly critical for digital manipulation detectors designed to distinguish between genuine and manipulated content, especially deepfakes and morphing attacks intended to deceive both humans and facial recognition systems. The research comprehensively analyzes whether beauty filters impact deepfake and morphing attack detector performance by evaluating multiple state-of-the-art detectors on benchmark datasets before and after applying various smoothing filters. Findings reveal significant performance degradation, highlighting vulnerabilities introduced by facial enhancements. This research underscores the urgent need for robust detection models resilient to such alterations, with implications for security systems, identity verification, media authentication, and social media platform integrity. Understanding these vulnerabilities is crucial for developing next-generation detection systems capable of handling the increasingly sophisticated landscape of digital image manipulation.
Authors: Sara Concas, Simone Maurizio La Cava, Andrea Panzino, Ester Masala, Giulia Orrù, Gian Luca Marcialis
Link: https://arxiv.org/abs/2509.14120v1
Date: 2025-09-d
Summary:
Digital beautification through social media filters has become increasingly popular, raising concerns about the reliability of facial images and videos and the effectiveness of automated face analysis. This issue is particularly critical for digital manipulation detectors, systems aiming at distinguishing between genuine and manipulated data, especially in cases involving deepfakes and morphing attacks designed to deceive humans and automated facial recognition. This study examines whether beauty filters impact the performance of deepfake and morphing attack detectors. We perform a comprehensive analysis, evaluating multiple state-of-the-art detectors on benchmark datasets before and after applying various smoothing filters. Our findings reveal performance degradation, highlighting vulnerabilities introduced by facial enhancements and underscoring the need for robust detection models resilient to such alterations.
--------------------------------------------------------------------------------------------------------