Week Ending 3.24.2024

 

RESEARCH WATCH: 3.24.2024

 

LATTE3D: Large-scale Amortized Text-To-Enhanced3D Synthesis

Fast text-to-3D generation has vast applications, from content creation to product design. However, existing methods struggle with detail and scalability. LATTE3D overcomes these limitations through a scalable architecture and 3D-aware priors, enabling highly detailed textured 3D mesh generation in just 400ms per prompt.

Authors:  Kevin Xie, Jonathan Lorraine, Tianshi Cao, Jun Gao, James Lucas, Antonio Torralba, Sanja Fidler, Xiaohui Zeng

Link:  https://arxiv.org/abs/2403.15385v1

Date: 2024-03-22

Summary:

Recent text-to-3D generation approaches produce impressive 3D results but require time-consuming optimization that can take up to an hour per prompt. Amortized methods like ATT3D optimize multiple prompts simultaneously to improve efficiency, enabling fast text-to-3D synthesis. However, they cannot capture high-frequency geometry and texture details and struggle to scale to large prompt sets, so they generalize poorly. We introduce LATTE3D, addressing these limitations to achieve fast, high-quality generation on a significantly larger prompt set. Key to our method is 1) building a scalable architecture and 2) leveraging 3D data during optimization through 3D-aware diffusion priors, shape regularization, and model initialization to achieve robustness to diverse and complex training prompts. LATTE3D amortizes both neural field and textured surface generation to produce highly detailed textured meshes in a single forward pass. LATTE3D generates 3D objects in 400ms, and can be further enhanced with fast test-time optimization.

--------------------------------------------------------------------------------------------------------

Dialogue Understandability: Why are we streaming movies with subtitles?

Subtitles are about more than just accessibility - understanding movie dialogue impacts the entire viewing experience. This paper formalizes the influential factors as "Dialogue Understandability" and maps them to quality of experience frameworks, paving the way for enhancement tools.

Authors:  Helard Becerra, Alessandro Ragano, Diptasree Debnath, Asad Ullah, Crisron Rudolf Lucas, Martin Walsh, Andrew Hines

Link:  https://arxiv.org/abs/2403.15336v1

Date: 2024-03-22

Summary:

Watching movies and TV shows with subtitles enabled is not simply down to audibility or speech intelligibility. A variety of evolving factors related to technological advances, cinema production and social behaviour challenge our perception and understanding. This study seeks to formalise and give context to these influential factors under a wider and novel term referred to as Dialogue Understandability. We propose a working definition for Dialogue Understandability being a listener's capacity to follow the story without undue cognitive effort or concentration being required that impacts their Quality of Experience (QoE). The paper identifies, describes and categorises the factors that influence Dialogue Understandability mapping them over the QoE framework, a media streaming lifecycle, and the stakeholders involved. We then explore available measurement tools in the literature and link them to the factors they could potentially be used for. The maturity and suitability of these tools is evaluated over a set of pilot experiments. Finally, we reflect on the gaps that still need to be filled, what we can measure and what not, future subjective experiments, and new research trends that could help us to fully characterise Dialogue Understandability.

--------------------------------------------------------------------------------------------------------

Reasoning-Enhanced Object-Centric Learning for Videos

Object detection and tracking are core computer vision tasks. This work proposes a novel reasoning module inspired by human intuitive physics to enhance object-centric video models, improving perception and tracking in complex scenes.

Authors:  Jian Li, Pu Ren, Yang Liu, Hao Sun

Link:  https://arxiv.org/abs/2403.15245v1

Date: 2024-03-22

Summary:

Object-centric learning aims to break down complex visual scenes into more manageable object representations, enhancing the understanding and reasoning abilities of machine learning systems toward the physical world. Recently, slot-based video models have demonstrated remarkable proficiency in segmenting and tracking objects, but they overlook the importance of the effective reasoning module. In the real world, reasoning and predictive abilities play a crucial role in human perception and object tracking; in particular, these abilities are closely related to human intuitive physics. Inspired by this, we designed a novel reasoning module called the Slot-based Time-Space Transformer with Memory buffer (STATM) to enhance the model's perception ability in complex scenes. The memory buffer primarily serves as storage for slot information from upstream modules, the Slot-based Time-Space Transformer makes predictions through slot-based spatiotemporal attention computations and fusion. Our experiment results on various datasets show that STATM can significantly enhance object-centric learning capabilities of slot-based video models.

--------------------------------------------------------------------------------------------------------

Solving a Real-World Package Delivery Routing Problem Using Quantum Annealers

Quantum computing offers potential for solving complex real-world routing problems, which remain challenging for classical approaches. This paper presents a quantum-classical hybrid solver considering realistic constraints like vehicle capacities and priority deliveries for efficient package delivery routing.

Authors:  Eneko Osaba, Esther Villar-Rodriguez, Antón Asla

Link:  https://arxiv.org/abs/2403.15114v1

Date: 2024-03-22

Summary:

Research focused on the conjunction between quantum computing and routing problems has been very prolific in recent years. Most of the works revolve around classical problems such as the Traveling Salesman Problem or the Vehicle Routing Problem. Even though working on these problems is valuable, it is also undeniable that their academic-oriented nature falls short of real-world requirements. The main objective of this research is to present a solving method for realistic instances, avoiding problem relaxations or technical shortcuts. Instead, a quantum-classical hybrid solver has been developed, coined Q4RPD, that considers a set of real constraints such as a heterogeneous fleet of vehicles, priority deliveries, and capacities characterized by two values: weight and dimensions of the packages. Q4RPD resorts to the Leap Constrained Quadratic Model Hybrid Solver of D-Wave. To demonstrate the application of Q4RPD, an experimentation composed of six different instances has been conducted, aiming to serve as illustrative examples.

--------------------------------------------------------------------------------------------------------

End-to-End Mineral Exploration with Artificial Intelligence and Ambient Noise Tomography

Integrating geophysics and AI promises to accelerate critical mineral discovery. This innovative workflow leverages ambient noise tomography's advantages with AI models fine-tuned on local data to delineate mineral deposits, crucial for the renewable energy transition.

Authors:  Jack Muir, Gerrit Olivier, Anthony Reid

Link:  https://arxiv.org/abs/2403.15095v1

Date: 2024-03-22

Summary:

This paper presents an innovative end-to-end workflow for mineral exploration, integrating ambient noise tomography (ANT) and artificial intelligence (AI) to enhance the discovery and delineation of mineral resources essential for the global transition to a low carbon economy. We focus on copper as a critical element, required in significant quantities for renewable energy solutions. We show the benefits of utilising ANT, characterised by its speed, scalability, depth penetration, resolution, and low environmental impact, alongside artificial intelligence (AI) techniques to refine a continent-scale prospectivity model at the deposit scale by fine-tuning our model on local high-resolution data. We show the promise of the method by first presenting a new data-driven AI prospectivity model for copper within Australia, which serves as our foundation model for further fine-tuning. We then focus on the Hillside IOCG deposit on the prospective Yorke Peninsula. We show that with relatively few local training samples (orebody intercepts), we can fine tune the foundation model to provide a good estimate of the Hillside orebody outline. Our methodology demonstrates how AI can augment geophysical data interpretation, providing a novel approach to mineral exploration with improved decision-making capabilities for targeting mineralization, thereby addressing the urgent need for increased mineral resource discovery.

--------------------------------------------------------------------------------------------------------

RAmBLA: A Framework for Evaluating the Reliability of LLMs as Assistants in the Biomedical Domain

As large language models support high-stakes biomedical applications, evaluating their reliability becomes paramount. RAmBLA provides a systematic framework to assess LLMs' robustness, recall, and lack of hallucinations for safe biomedical use cases.

Authors:  William James Bolton, Rafael Poyiadzi, Edward R. Morrell, Gabriela van Bergen Gonzalez Bueno, Lea Goetz

Link:  https://arxiv.org/abs/2403.14578v1

Date: 2024-03-21

Summary:

Large Language Models (LLMs) increasingly support applications in a wide range of domains, some with potential high societal impact such as biomedicine, yet their reliability in realistic use cases is under-researched. In this work we introduce the Reliability AssesMent for Biomedical LLM Assistants (RAmBLA) framework and evaluate whether four state-of-the-art foundation LLMs can serve as reliable assistants in the biomedical domain. We identify prompt robustness, high recall, and a lack of hallucinations as necessary criteria for this use case. We design shortform tasks and tasks requiring LLM freeform responses mimicking real-world user interactions. We evaluate LLM performance using semantic similarity with a ground truth response, through an evaluator LLM.

--------------------------------------------------------------------------------------------------------

Object-Centric Domain Randomization for 3D Shape Reconstruction in the Wild

Data scarcity hinders real-world 3D shape reconstruction from 2D images. This work proposes synthesizing paired data via conditional generative models while preserving object shape, enhancing reconstruction models' performance.

Authors:  Junhyeong Cho, Kim Youwang, Hunmin Yang, Tae-Hyun Oh

Link:  https://arxiv.org/abs/2403.14539v1

Date: 2024-03-21

Summary:

One of the biggest challenges in single-view 3D shape reconstruction in the wild is the scarcity of <3D shape, 2D image>-paired data from real-world environments. Inspired by remarkable achievements via domain randomization, we propose ObjectDR which synthesizes such paired data via a random simulation of visual variations in object appearances and backgrounds. Our data synthesis framework exploits a conditional generative model (e.g., ControlNet) to generate images conforming to spatial conditions such as 2.5D sketches, which are obtainable through a rendering process of 3D shapes from object collections (e.g., Objaverse-XL). To simulate diverse variations while preserving object silhouettes embedded in spatial conditions, we also introduce a disentangled framework which leverages an initial object guidance. After synthesizing a wide range of data, we pre-train a model on them so that it learns to capture a domain-invariant geometry prior which is consistent across various domains. We validate its effectiveness by substantially improving 3D shape reconstruction models on a real-world benchmark. In a scale-up evaluation, our pre-training achieves 23.6% superior results compared with the pre-training on high-quality computer graphics renderings.

--------------------------------------------------------------------------------------------------------

Multi-criteria approach for selecting an explanation from the set of counterfactuals produced by an ensemble of explainers

Explaining AI decisions through counterfactuals involves optimizing conflicting quality measures. This ensemble approach selects a single compromise counterfactual balancing multiple criteria, aiding interpretability.

Authors:  Ignacy Stępka, Mateusz Lango, Jerzy Stefanowski

Link:  https://arxiv.org/abs/2403.13940v1

Date: 2024-03-20

Summary:

Counterfactuals are widely used to explain ML model predictions by providing alternative scenarios for obtaining the more desired predictions. They can be generated by a variety of methods that optimize different, sometimes conflicting, quality measures and produce quite different solutions. However, choosing the most appropriate explanation method and one of the generated counterfactuals is not an easy task. Instead of forcing the user to test many different explanation methods and analysing conflicting solutions, in this paper, we propose to use a multi-stage ensemble approach that will select single counterfactual based on the multiple-criteria analysis. It offers a compromise solution that scores well on several popular quality measures. This approach exploits the dominance relation and the ideal point decision aid method, which selects one counterfactual from the Pareto front. The conducted experiments demonstrated that the proposed approach generates fully actionable counterfactuals with attractive compromise values of the considered quality measures.

--------------------------------------------------------------------------------------------------------

Towards Principled Representation Learning from Videos for Reinforcement Learning

Despite advances, a theoretical understanding of learning representations from videos for decision-making is lacking. This work provides insights into the sample complexity of approaches like contrastive learning and forward modeling under various noise settings.

Authors:  Dipendra Misra, Akanksha Saran, Tengyang Xie, Alex Lamb, John Langford

Link:  https://arxiv.org/abs/2403.13765v1

Date: 2024-03-20

Summary:

We study pre-training representations for decision-making using video data, which is abundantly available for tasks such as game agents and software testing. Even though significant empirical advances have been made on this problem, a theoretical understanding remains absent. We initiate the theoretical investigation into principled approaches for representation learning and focus on learning the latent state representations of the underlying MDP using video data. We study two types of settings: one where there is iid noise in the observation, and a more challenging setting where there is also the presence of exogenous noise, which is non-iid noise that is temporally correlated, such as the motion of people or cars in the background. We study three commonly used approaches: autoencoding, temporal contrastive learning, and forward modeling. We prove upper bounds for temporal contrastive learning and forward modeling in the presence of only iid noise. We show that these approaches can learn the latent state and use it to do efficient downstream RL with polynomial sample complexity. When exogenous noise is also present, we establish a lower bound result showing that the sample complexity of learning from video data can be exponentially worse than learning from action-labeled trajectory data. This partially explains why reinforcement learning with video pre-training is hard. We evaluate these representational learning methods in two visual domains, yielding results that are consistent with our theoretical findings.

--------------------------------------------------------------------------------------------------------

PuzzleVQA: Diagnosing Multimodal Reasoning Challenges of Language Models with Abstract Visual Patterns

Evaluating multimodal reasoning in language models is challenging. PuzzleVQA introduces abstract visual patterns based on fundamental concepts like colors and shapes to diagnose reasoning bottlenecks and guide model improvements.

Authors:  Yew Ken Chia, Vernon Toh Yan Han, Deepanway Ghosal, Lidong Bing, Soujanya Poria

Link:  https://arxiv.org/abs/2403.13315v1

Date: 2024-03-20

Summary:

Large multimodal models extend the impressive capabilities of large language models by integrating multimodal understanding abilities. However, it is not clear how they can emulate the general intelligence and reasoning ability of humans. As recognizing patterns and abstracting concepts are key to general intelligence, we introduce PuzzleVQA, a collection of puzzles based on abstract patterns. With this dataset, we evaluate large multimodal models with abstract patterns based on fundamental concepts, including colors, numbers, sizes, and shapes. Through our experiments on state-of-the-art large multimodal models, we find that they are not able to generalize well to simple abstract patterns. Notably, even GPT-4V cannot solve more than half of the puzzles. To diagnose the reasoning challenges in large multimodal models, we progressively guide the models with our ground truth reasoning explanations for visual perception, inductive reasoning, and deductive reasoning. Our systematic analysis finds that the main bottlenecks of GPT-4V are weaker visual perception and inductive reasoning abilities. Through this work, we hope to shed light on the limitations of large multimodal models and how they can better emulate human cognitive processes in the future (Our data and code will be released publicly at https://github.com/declare-lab/LLM-PuzzleTest).

--------------------------------------------------------------------------------------------------------

Evolutionary Optimization of Model Merging Recipes

Model merging offers a cost-effective LLM development approach, but manually combining diverse models is limited. This evolutionary algorithm automates the process, discovering powerful cross-domain compositions like a Japanese math LLM.

Authors:  Takuya Akiba, Makoto Shing, Yujin Tang, Qi Sun, David Ha

Link:  https://arxiv.org/abs/2403.13187v1

Date: 2024-03-19

Summary:

We present a novel application of evolutionary algorithms to automate the creation of powerful foundation models. While model merging has emerged as a promising approach for LLM development due to its cost-effectiveness, it currently relies on human intuition and domain knowledge, limiting its potential. Here, we propose an evolutionary approach that overcomes this limitation by automatically discovering effective combinations of diverse open-source models, harnessing their collective intelligence without requiring extensive additional training data or compute. Our approach operates in both parameter space and data flow space, allowing for optimization beyond just the weights of the individual models. This approach even facilitates cross-domain merging, generating models like a Japanese LLM with Math reasoning capabilities. Surprisingly, our Japanese Math LLM achieved state-of-the-art performance on a variety of established Japanese LLM benchmarks, even surpassing models with significantly more parameters, despite not being explicitly trained for such tasks. Furthermore, a culturally-aware Japanese VLM generated through our approach demonstrates its effectiveness in describing Japanese culture-specific content, outperforming previous Japanese VLMs. This work not only contributes new state-of-the-art models back to the open-source community, but also introduces a new paradigm for automated model composition, paving the way for exploring alternative, efficient approaches to foundation model development.

--------------------------------------------------------------------------------------------------------

What Does Evaluation of Explainable Artificial Intelligence Actually Tell Us? A Case for Compositional and Contextual Validation of XAI Building Blocks

While user studies evaluate explanation delivery, evaluating the underlying technical explanations is equally crucial. This modular framework systematically validates the components like explanation generation algorithms based on anticipated use cases.

Authors:  Kacper Sokol, Julia E. Vogt

Link:  https://arxiv.org/abs/2403.12730v1

Date: 2024-03-19

Summary:

Despite significant progress, evaluation of explainable artificial intelligence remains elusive and challenging. In this paper we propose a fine-grained validation framework that is not overly reliant on any one facet of these sociotechnical systems, and that recognises their inherent modular structure: technical building blocks, user-facing explanatory artefacts and social communication protocols. While we concur that user studies are invaluable in assessing the quality and effectiveness of explanation presentation and delivery strategies from the explainees' perspective in a particular deployment context, the underlying explanation generation mechanisms require a separate, predominantly algorithmic validation strategy that accounts for the technical and human-centred desiderata of their (numerical) outputs. Such a comprehensive sociotechnical utility-based evaluation framework could allow to systematically reason about the properties and downstream influence of different building blocks from which explainable artificial intelligence systems are composed -- accounting for a diverse range of their engineering and social aspects -- in view of the anticipated use case.

--------------------------------------------------------------------------------------------------------

A New Intelligent Reflecting Surface-Aided Electromagnetic Stealth Strategy

Traditional stealth coatings face limitations like angle and frequency dependence. This work proposes using reconfigurable intelligent reflecting surfaces synergistically with coatings for enhanced radar stealth.

Authors:  Xue Xiong, Beixiong Zheng, A. Lee Swindlehurst, Jie Tang, Wen Wu

Link:  https://arxiv.org/abs/2403.12352v1

Date: 2024-03-19

Summary:

Electromagnetic wave absorbing material (EWAM) plays an essential role in manufacturing stealth aircraft, which can achieve the electromagnetic stealth (ES) by reducing the strength of the signal reflected back to the radar system. However, the stealth performance is limited by the coating thickness, incident wave angles, and working frequencies. To tackle these limitations, we propose a new intelligent reflecting surface (IRS)-aided ES system where an IRS is deployed at the target to synergize with EWAM for effectively mitigating the echo signal and thus reducing the radar detection probability. Considering the monotonic relationship between the detection probability and the received signal-to-noise-ratio (SNR) at the radar, we formulate an optimization problem that minimizes the SNR under the reflection constraint of each IRS element, and a semi-closed-form solution is derived by using Karush-Kuhn-Tucker (KKT) conditions. Simulation results validate the superiority of the proposed IRS-aided ES system compared to various benchmarks.

--------------------------------------------------------------------------------------------------------

Vid2Robot: End-to-end Video-conditioned Policy Learning with Cross-Attention Transformers

Robots typically rely on text instructions, but this approach explores learning directly from video demonstrations, decoding human intent into executable robot actions through cross-attention for alignment.

Authors:  Vidhi Jain, Maria Attarian, Nikhil J Joshi, Ayzaan Wahid, Danny Driess, Quan Vuong, Pannag R Sanketi, Pierre Sermanet, Stefan Welker, Christine Chan, Igor Gilitschenski, Yonatan Bisk, Debidatta Dwibedi

Link:  https://arxiv.org/abs/2403.12943v1

Date: 2024-03-19

Summary:

While large-scale robotic systems typically rely on textual instructions for tasks, this work explores a different approach: can robots infer the task directly from observing humans? This shift necessitates the robot's ability to decode human intent and translate it into executable actions within its physical constraints and environment. We introduce Vid2Robot, a novel end-to-end video-based learning framework for robots. Given a video demonstration of a manipulation task and current visual observations, Vid2Robot directly produces robot actions. This is achieved through a unified representation model trained on a large dataset of human video and robot trajectory. The model leverages cross-attention mechanisms to fuse prompt video features to the robot's current state and generate appropriate actions that mimic the observed task. To further improve policy performance, we propose auxiliary contrastive losses that enhance the alignment between human and robot video representations. We evaluate Vid2Robot on real-world robots, demonstrating a 20% improvement in performance compared to other video-conditioned policies when using human demonstration videos. Additionally, our model exhibits emergent capabilities, such as successfully transferring observed motions from one object to another, and long-horizon composition, thus showcasing its potential for real-world applications. Project website: vid2robot.github.io

--------------------------------------------------------------------------------------------------------

AnyV2V: A Plug-and-Play Framework For Any Video-to-Video Editing Tasks

Most prior video editing techniques were task-specific. AnyV2V offers a plug-and-play framework supporting diverse video editing tasks by integrating off-the-shelf image editors and video generators, enhancing versatility.

Authors:  Max Ku, Cong Wei, Weiming Ren, Harry Yang, Wenhu Chen

Link:  https://arxiv.org/abs/2403.14468v2

Date: 2024-03-22

Summary:

Video-to-video editing involves editing a source video along with additional control (such as text prompts, subjects, or styles) to generate a new video that aligns with the source video and the provided control. Traditional methods have been constrained to certain editing types, limiting their ability to meet the wide range of user demands. In this paper, we introduce AnyV2V, a novel training-free framework designed to simplify video editing into two primary steps: (1) employing an off-the-shelf image editing model (e.g. InstructPix2Pix, InstantID, etc) to modify the first frame, (2) utilizing an existing image-to-video generation model (e.g. I2VGen-XL) for DDIM inversion and feature injection. In the first stage, AnyV2V can plug in any existing image editing tools to support an extensive array of video editing tasks. Beyond the traditional prompt-based editing methods, AnyV2V also can support novel video editing tasks, including reference-based style transfer, subject-driven editing, and identity manipulation, which were unattainable by previous methods. In the second stage, AnyV2V can plug in any existing image-to-video models to perform DDIM inversion and intermediate feature injection to maintain the appearance and motion consistency with the source video. On the prompt-based editing, we show that AnyV2V can outperform the previous best approach by 35\% on prompt alignment, and 25\% on human preference. On the three novel tasks, we show that AnyV2V also achieves a high success rate. We believe AnyV2V will continue to thrive due to its ability to seamlessly integrate the fast-evolving image editing methods. Such compatibility can help AnyV2V to increase its versatility to cater to diverse user demands.

--------------------------------------------------------------------------------------------------------

Subjective-Aligned Dateset and Metric for Text-to-Video Quality Assessment

Despite progress in text-to-video generation, quantifying output quality remains challenging. T2VQA-DB provides a large-scale dataset with subjective scores to develop and benchmark perceptual quality assessment metrics.

Authors:  Tengchuan Kou, Xiaohong Liu, Zicheng Zhang, Chunyi Li, Haoning Wu, Xiongkuo Min, Guangtao Zhai, Ning Liu

Link:  https://arxiv.org/abs/2403.11956v2

Date: 2024-03-19

Summary:

With the rapid development of generative models, Artificial Intelligence-Generated Contents (AIGC) have exponentially increased in daily lives. Among them, Text-to-Video (T2V) generation has received widespread attention. Though many T2V models have been released for generating high perceptual quality videos, there is still lack of a method to evaluate the quality of these videos quantitatively. To solve this issue, we establish the largest-scale Text-to-Video Quality Assessment DataBase (T2VQA-DB) to date. The dataset is composed of 10,000 videos generated by 9 different T2V models. We also conduct a subjective study to obtain each video's corresponding mean opinion score. Based on T2VQA-DB, we propose a novel transformer-based model for subjective-aligned Text-to-Video Quality Assessment (T2VQA). The model extracts features from text-video alignment and video fidelity perspectives, then it leverages the ability of a large language model to give the prediction score. Experimental results show that T2VQA outperforms existing T2V metrics and SOTA video quality assessment models. Quantitative analysis indicates that T2VQA is capable of giving subjective-align predictions, validating its effectiveness. The dataset and code will be released at https://github.com/QMME/T2VQA.

--------------------------------------------------------------------------------------------------------

Cartoon Hallucinations Detection: Pose-aware In Context Visual Learning

Visual hallucinations limit text-to-image models' applicability, especially for non-photorealistic styles. This work proposes using pose guidance with vision-language models for detecting critical defects in generated cartoon character images.

Authors:  Bumsoo Kim, Wonseop Shin, Kyuchul Lee, Sanghyun Seo

Link:  https://arxiv.org/abs/2403.15048v1

Date: 2024-03-22

Summary:

Large-scale Text-to-Image (TTI) models have become a common approach for generating training data in various generative fields. However, visual hallucinations, which contain perceptually critical defects, remain a concern, especially in non-photorealistic styles like cartoon characters. We propose a novel visual hallucination detection system for cartoon character images generated by TTI models. Our approach leverages pose-aware in-context visual learning (PA-ICVL) with Vision-Language Models (VLMs), utilizing both RGB images and pose information. By incorporating pose guidance from a fine-tuned pose estimator, we enable VLMs to make more accurate decisions. Experimental results demonstrate significant improvements in identifying visual hallucinations compared to baseline methods relying solely on RGB images. This research advances TTI models by mitigating visual hallucinations, expanding their potential in non-photorealistic domains.

--------------------------------------------------------------------------------------------------------

Scene-LLM: Extending Language Model for 3D Visual Understanding and Reasoning

Integrating language models' reasoning into 3D environments can enhance embodied agents' capabilities. Scene-LLM projects hybrid 3D scene representations into the language space for language grounding and interactive planning.

Authors:  Rao Fu, Jingyu Liu, Xilun Chen, Yixin Nie, Wenhan Xiong

Link:  https://arxiv.org/abs/2403.11401v1

Date: 2024-03-18

Summary:

This paper introduces Scene-LLM, a 3D-visual-language model that enhances embodied agents' abilities in interactive 3D indoor environments by integrating the reasoning strengths of Large Language Models (LLMs). Scene-LLM adopts a hybrid 3D visual feature representation, that incorporates dense spatial information and supports scene state updates. The model employs a projection layer to efficiently project these features in the pre-trained textual embedding space, enabling effective interpretation of 3D visual information. Unique to our approach is the integration of both scene-level and ego-centric 3D information. This combination is pivotal for interactive planning, where scene-level data supports global planning and ego-centric data is important for localization. Notably, we use ego-centric 3D frame features for feature alignment, an efficient technique that enhances the model's ability to align features of small objects within the scene. Our experiments with Scene-LLM demonstrate its strong capabilities in dense captioning, question answering, and interactive planning. We believe Scene-LLM advances the field of 3D visual understanding and reasoning, offering new possibilities for sophisticated agent interactions in indoor settings.

--------------------------------------------------------------------------------------------------------

A task of anomaly detection for a smart satellite Internet of things system

Detecting anomalies in sensor data is crucial for industrial safety but challenged by complex correlations, real-time requirements, and data scarcity. This unsupervised approach leverages GANs and self-attention for high-performance anomaly monitoring.

Authors:  Zilong Shao

Link:  https://arxiv.org/abs/2403.14738v1

Date: 2024-03-21

Summary:

When the equipment is working, real-time collection of environmental sensor data for anomaly detection is one of the key links to prevent industrial process accidents and network attacks and ensure system security. However, under the environment with specific real-time requirements, the anomaly detection for environmental sensors still faces the following difficulties: (1) The complex nonlinear correlation characteristics between environmental sensor data variables lack effective expression methods, and the distribution between the data is difficult to be captured. (2) it is difficult to ensure the real-time monitoring requirements by using complex machine learning models, and the equipment cost is too high. (3) Too little sample data leads to less labeled data in supervised learning. This paper proposes an unsupervised deep learning anomaly detection system. Based on the generative adversarial network and self-attention mechanism, considering the different feature information contained in the local subsequences, it automatically learns the complex linear and nonlinear dependencies between environmental sensor variables, and uses the anomaly score calculation method combining reconstruction error and discrimination error. It can monitor the abnormal points of real sensor data with high real-time performance and can run on the intelligent satellite Internet of things system, which is suitable for the real working environment. Anomaly detection outperforms baseline methods in most cases and has good interpretability, which can be used to prevent industrial accidents and cyber-attacks for monitoring environmental sensors.

--------------------------------------------------------------------------------------------------------

ERASE: Benchmarking Feature Selection Methods for Deep Recommender Systems

As recommender systems rely on more features, effective selection methods become critical yet lack systematic evaluation. ERASE provides a comprehensive benchmark across datasets and selection techniques to guide deployment.

Authors:  Pengyue Jia, Yejing Wang, Zhaocheng Du, Xiangyu Zhao, Yichao Wang, Bo Chen, Wanyu Wang, Huifeng Guo, Ruiming Tang

Link:  https://arxiv.org/abs/2403.12660v2

Date: 2024-03-20

Summary:

Deep Recommender Systems (DRS) are increasingly dependent on a large number of feature fields for more precise recommendations. Effective feature selection methods are consequently becoming critical for further enhancing the accuracy and optimizing storage efficiencies to align with the deployment demands. This research area, particularly in the context of DRS, is nascent and faces three core challenges. Firstly, variant experimental setups across research papers often yield unfair comparisons, obscuring practical insights. Secondly, the existing literature's lack of detailed analysis on selection attributes, based on large-scale datasets and a thorough comparison among selection techniques and DRS backbones, restricts the generalizability of findings and impedes deployment on DRS. Lastly, research often focuses on comparing the peak performance achievable by feature selection methods, an approach that is typically computationally infeasible for identifying the optimal hyperparameters and overlooks evaluating the robustness and stability of these methods. To bridge these gaps, this paper presents ERASE, a comprehensive bEnchmaRk for feAture SElection for DRS. ERASE comprises a thorough evaluation of eleven feature selection methods, covering both traditional and deep learning approaches, across four public datasets, private industrial datasets, and a real-world commercial platform, achieving significant enhancement. Our code is available online for ease of reproduction.

--------------------------------------------------------------------------------------------------------


EYE ON A.I. GETS READERS UP TO DATE ON THE LATEST FUNDING NEWS AND RELATED ISSUES. SUBSCRIBE FOR THE WEEKLY NEWSLETTER.