Eye On AI

View Original

Week Ending 4.7.2024

RESEARCH WATCH: 4.7.2024

Multi-modal perception for soft robotic interactions using generative models

The paper introduces a perception model that fuses data from multiple sensory modalities like touch and vision to build a comprehensive state representation for soft robots. This enables more robust control of highly deformable bodies in unstructured environments, with applications in fields like robotics and automation.

Authors:  Enrico Donato, Egidio Falotico, Thomas George Thuruthel

Link:  https://arxiv.org/abs/2404.04220v1

Date: 2024-04-05

Summary:

Perception is essential for the active interaction of physical agents with the external environment. The integration of multiple sensory modalities, such as touch and vision, enhances this perceptual process, creating a more comprehensive and robust understanding of the world. Such fusion is particularly useful for highly deformable bodies such as soft robots. Developing a compact, yet comprehensive state representation from multi-sensory inputs can pave the way for the development of complex control strategies. This paper introduces a perception model that harmonizes data from diverse modalities to build a holistic state representation and assimilate essential information. The model relies on the causality between sensory input and robotic actions, employing a generative model to efficiently compress fused information and predict the next observation. We present, for the first time, a study on how touch can be predicted from vision and proprioception on soft robots, the importance of the cross-modal generation and why this is essential for soft robotic interactions in unstructured environments.

--------------------------------------------------------------------------------------------------------

Demonstration Guided Multi-Objective Reinforcement Learning

The paper presents a demonstration-guided multi-objective reinforcement learning approach to address the challenges of training policies from scratch in multi-objective settings. This has potential applications in domains with conflicting objectives, such as robotics, resource management, and recommendation systems.

Authors:  Junlin Lu, Patrick Mannion, Karl Mason

Link:  https://arxiv.org/abs/2404.03997v1

Date: 2024-04-05

Summary:

Multi-objective reinforcement learning (MORL) is increasingly relevant due to its resemblance to real-world scenarios requiring trade-offs between multiple objectives. Catering to diverse user preferences, traditional reinforcement learning faces amplified challenges in MORL. To address the difficulty of training policies from scratch in MORL, we introduce demonstration-guided multi-objective reinforcement learning (DG-MORL). This novel approach utilizes prior demonstrations, aligns them with user preferences via corner weight support, and incorporates a self-evolving mechanism to refine suboptimal demonstrations. Our empirical studies demonstrate DG-MORL's superiority over existing MORL algorithms, establishing its robustness and efficacy, particularly under challenging conditions. We also provide an upper bound of the algorithm's sample complexity.

--------------------------------------------------------------------------------------------------------

MultiParaDetox: Extending Text Detoxification with Parallel Data to New Languages

The paper describes a method to extend parallel text detoxification corpora to multiple languages, enabling the development of state-of-the-art text detoxification models for safe communication in digital platforms across languages.

Authors:  Daryna Dementieva, Nikolay Babakov, Alexander Panchenko

Link:  https://arxiv.org/abs/2404.02037v1

Date: 2024-04-02

Summary:

Text detoxification is a textual style transfer (TST) task where a text is paraphrased from a toxic surface form, e.g. featuring rude words, to the neutral register. Recently, text detoxification methods found their applications in various task such as detoxification of Large Language Models (LLMs) (Leong et al., 2023; He et al., 2024; Tang et al., 2023) and toxic speech combating in social networks (Deng et al., 2023; Mun et al., 2023; Agarwal et al., 2023). All these applications are extremely important to ensure safe communication in modern digital worlds. However, the previous approaches for parallel text detoxification corpora collection -- ParaDetox (Logacheva et al., 2022) and APPADIA (Atwell et al., 2022) -- were explored only in monolingual setup. In this work, we aim to extend ParaDetox pipeline to multiple languages presenting MultiParaDetox to automate parallel detoxification corpus collection for potentially any language. Then, we experiment with different text detoxification models -- from unsupervised baselines to LLMs and fine-tuned models on the presented parallel corpora -- showing the great benefit of parallel corpus presence to obtain state-of-the-art text detoxification models for any language.

--------------------------------------------------------------------------------------------------------

Settling Time vs. Accuracy Tradeoffs for Clustering Big Data

The paper studies the theoretical and practical runtime limits of k-means and k-median clustering on large datasets, providing a blueprint for effective clustering regardless of data size, with applications in data analysis and machine learning.

Authors:  Andrew Draganov, David Saulpic, Chris Schwiegelshohn

Link:  https://arxiv.org/abs/2404.01936v1

Date: 2024-04-02

Summary:

We study the theoretical and practical runtime limits of k-means and k-median clustering on large datasets. Since effectively all clustering methods are slower than the time it takes to read the dataset, the fastest approach is to quickly compress the data and perform the clustering on the compressed representation. Unfortunately, there is no universal best choice for compressing the number of points - while random sampling runs in sublinear time and coresets provide theoretical guarantees, the former does not enforce accuracy while the latter is too slow as the numbers of points and clusters grow. Indeed, it has been conjectured that any sensitivity-based coreset construction requires super-linear time in the dataset size. We examine this relationship by first showing that there does exist an algorithm that obtains coresets via sensitivity sampling in effectively linear time - within log-factors of the time it takes to read the data. Any approach that significantly improves on this must then resort to practical heuristics, leading us to consider the spectrum of sampling strategies across both real and artificial datasets in the static and streaming settings. Through this, we show the conditions in which coresets are necessary for preserving cluster validity as well as the settings in which faster, cruder sampling strategies are sufficient. As a result, we provide a comprehensive theoretical and practical blueprint for effective clustering regardless of data size. Our code is publicly available and has scripts to recreate the experiments.

--------------------------------------------------------------------------------------------------------

Weakly-supervised Audio Separation via Bi-modal Semantic Similarity

The paper proposes a bi-modal separation framework that leverages language information to enhance unsupervised and supervised audio separation, benefiting applications in audio processing and source separation.

Authors:  Tanvir Mahmud, Saeed Amizadeh, Kazuhito Koishida, Diana Marculescu

Link:  https://arxiv.org/abs/2404.01740v1

Date: 2024-04-02

Summary:

Conditional sound separation in multi-source audio mixtures without having access to single source sound data during training is a long standing challenge. Existing mix-and-separate based methods suffer from significant performance drop with multi-source training mixtures due to the lack of supervision signal for single source separation cases during training. However, in the case of language-conditional audio separation, we do have access to corresponding text descriptions for each audio mixture in our training data, which can be seen as (rough) representations of the audio samples in the language modality. To this end, in this paper, we propose a generic bi-modal separation framework which can enhance the existing unsupervised frameworks to separate single-source signals in a target modality (i.e., audio) using the easily separable corresponding signals in the conditioning modality (i.e., language), without having access to single-source samples in the target modality during training. We empirically show that this is well within reach if we have access to a pretrained joint embedding model between the two modalities (i.e., CLAP). Furthermore, we propose to incorporate our framework into two fundamental scenarios to enhance separation performance. First, we show that our proposed methodology significantly improves the performance of purely unsupervised baselines by reducing the distribution shift between training and test samples. In particular, we show that our framework can achieve 71% boost in terms of Signal-to-Distortion Ratio (SDR) over the baseline, reaching 97.5% of the supervised learning performance. Second, we show that we can further improve the performance of the supervised learning itself by 17% if we augment it by our proposed weakly-supervised framework, that enables a powerful semi-supervised framework for audio separation.

--------------------------------------------------------------------------------------------------------

Stream of Search (SoS): Learning to Search in Language

The paper introduces Stream of Search, a unified language for representing search processes, and demonstrates its ability to improve language model performance on the challenging task of number combination, with potential applications in problem-solving and decision-making.

Authors:  Kanishk Gandhi, Denise Lee, Gabriel Grand, Muxin Liu, Winson Cheng, Archit Sharma, Noah D. Goodman

Link:  https://arxiv.org/abs/2404.03683v1

Date: 2024-04-01

Summary:

Language models are rarely shown fruitful mistakes while training. They then struggle to look beyond the next token, suffering from a snowballing of errors and struggling to predict the consequence of their actions several steps ahead. In this paper, we show how language models can be taught to search by representing the process of search in language, as a flattened string -- a stream of search (SoS). We propose a unified language for search that captures an array of different symbolic search strategies. We demonstrate our approach using the simple yet difficult game of Countdown, where the goal is to combine input numbers with arithmetic operations to reach a target number. We pretrain a transformer-based language model from scratch on a dataset of streams of search generated by heuristic solvers. We find that SoS pretraining increases search accuracy by 25% over models trained to predict only the optimal search trajectory. We further finetune this model with two policy improvement methods: Advantage-Induced Policy Alignment (APA) and Self-Taught Reasoner (STaR). The finetuned SoS models solve 36% of previously unsolved problems, including problems that cannot be solved by any of the heuristic solvers. Our results indicate that language models can learn to solve problems via search, self-improve to flexibly use different search strategies, and potentially discover new ones.

--------------------------------------------------------------------------------------------------------

Machine Learning Robustness: A Primer

The paper provides a comprehensive overview of the concept of robustness in machine learning, its importance for trustworthy AI systems, and various techniques for assessment and improvement, relevant across diverse ML applications.

Authors:  Houssem Ben Braiek, Foutse Khomh

Link:  https://arxiv.org/abs/2404.00897v1

Date: 2024-04-01

Summary:

This chapter explores the foundational concept of robustness in Machine Learning (ML) and its integral role in establishing trustworthiness in Artificial Intelligence (AI) systems. The discussion begins with a detailed definition of robustness, portraying it as the ability of ML models to maintain stable performance across varied and unexpected environmental conditions. ML robustness is dissected through several lenses: its complementarity with generalizability; its status as a requirement for trustworthy AI; its adversarial vs non-adversarial aspects; its quantitative metrics; and its indicators such as reproducibility and explainability. The chapter delves into the factors that impede robustness, such as data bias, model complexity, and the pitfalls of underspecified ML pipelines. It surveys key techniques for robustness assessment from a broad perspective, including adversarial attacks, encompassing both digital and physical realms. It covers non-adversarial data shifts and nuances of Deep Learning (DL) software testing methodologies. The discussion progresses to explore amelioration strategies for bolstering robustness, starting with data-centric approaches like debiasing and augmentation. Further examination includes a variety of model-centric methods such as transfer learning, adversarial training, and randomized smoothing. Lastly, post-training methods are discussed, including ensemble techniques, pruning, and model repairs, emerging as cost-effective strategies to make models more resilient against the unpredictable. This chapter underscores the ongoing challenges and limitations in estimating and achieving ML robustness by existing approaches. It offers insights and directions for future research on this crucial concept, as a prerequisite for trustworthy AI systems.

--------------------------------------------------------------------------------------------------------

Generative AI in the Wild: Prospects, Challenges, and Strategies

The paper investigates user perceptions and usage strategies around generative AI in creative industries, offering insights to guide the design of future generative AI tools.

Authors:  Yuan Sun, Eunchae Jang, Fenglong Ma, Ting Wang

Link:  https://arxiv.org/abs/2404.04101v1

Date: 2024-04-03

Summary:

Propelled by their remarkable capabilities to generate novel and engaging content, Generative Artificial Intelligence (GenAI) technologies are disrupting traditional workflows in many industries. While prior research has examined GenAI from a techno-centric perspective, there is still a lack of understanding about how users perceive and utilize GenAI in real-world scenarios. To bridge this gap, we conducted semi-structured interviews with (N=18) GenAI users in creative industries, investigating the human-GenAI co-creation process within a holistic LUA (Learning, Using and Assessing) framework. Our study uncovered an intriguingly complex landscape: Prospects-GenAI greatly fosters the co-creation between human expertise and GenAI capabilities, profoundly transforming creative workflows; Challenges-Meanwhile, users face substantial uncertainties and complexities arising from resource availability, tool usability, and regulatory compliance; Strategies-In response, users actively devise various strategies to overcome many of such challenges. Our study reveals key implications for the design of future GenAI tools.

--------------------------------------------------------------------------------------------------------

Transfer Learning from Whisper for Microscopic Intelligibility Prediction

The paper explores the use of transfer learning from the Whisper speech recognition model to improve microscopic intelligibility prediction, benefiting applications in speech processing and accessibility.

Authors:  Paul Best, Santiago Cuervo, Ricard Marxer

Link:  https://arxiv.org/abs/2404.01737v1

Date: 2024-04-02

Summary:

Macroscopic intelligibility models predict the expected human word-error-rate for a given speech-in-noise stimulus. In contrast, microscopic intelligibility models aim to make fine-grained predictions about listeners' perception, e.g. predicting phonetic or lexical responses. State-of-the-art macroscopic models use transfer learning from large scale deep learning models for speech processing, whereas such methods have rarely been used for microscopic modeling. In this paper, we study the use of transfer learning from Whisper, a state-of-the-art deep learning model for automatic speech recognition, for microscopic intelligibility prediction at the level of lexical responses. Our method outperforms the considered baselines, even in a zero-shot setup, and yields a relative improvement of up to 66\% when fine-tuned to predict listeners' responses. Our results showcase the promise of large scale deep learning based methods for microscopic intelligibility prediction.

--------------------------------------------------------------------------------------------------------

Insights from the Use of Previously Unseen Neural Architecture Search Datasets

The paper introduces new datasets and benchmarks for neural architecture search, aiming to encourage research on models that generalize beyond a limited set of datasets, with wide-ranging impacts on AI development.

Authors:  Rob Geada, David Towers, Matthew Forshaw, Amir Atapour-Abarghouei, A. Stephen McGough

Link:  https://arxiv.org/abs/2404.02189v1

Date: 2024-04-02

Summary:

The boundless possibility of neural networks which can be used to solve a problem -- each with different performance -- leads to a situation where a Deep Learning expert is required to identify the best neural network. This goes against the hope of removing the need for experts. Neural Architecture Search (NAS) offers a solution to this by automatically identifying the best architecture. However, to date, NAS work has focused on a small set of datasets which we argue are not representative of real-world problems. We introduce eight new datasets created for a series of NAS Challenges: AddNIST, Language, MultNIST, CIFARTile, Gutenberg, Isabella, GeoClassing, and Chesseract. These datasets and challenges are developed to direct attention to issues in NAS development and to encourage authors to consider how their models will perform on datasets unknown to them at development time. We present experimentation using standard Deep Learning methods as well as the best results from challenge participants.

--------------------------------------------------------------------------------------------------------

Active Causal Learning for Decoding Chemical Complexities with Targeted Interventions

The paper presents an active causal learning approach for molecular design, leveraging strategic sampling and interventions to identify causal relationships and optimize desired properties, with applications in chemistry, materials science, and drug discovery.

Authors:  Zachary R. Fox, Ayana Ghosh

Link:  https://arxiv.org/abs/2404.04224v1

Date: 2024-04-05

Summary:

Predicting and enhancing inherent properties based on molecular structures is paramount to design tasks in medicine, materials science, and environmental management. Most of the current machine learning and deep learning approaches have become standard for predictions, but they face challenges when applied across different datasets due to reliance on correlations between molecular representation and target properties. These approaches typically depend on large datasets to capture the diversity within the chemical space, facilitating a more accurate approximation, interpolation, or extrapolation of the chemical behavior of molecules. In our research, we introduce an active learning approach that discerns underlying cause-effect relationships through strategic sampling with the use of a graph loss function. This method identifies the smallest subset of the dataset capable of encoding the most information representative of a much larger chemical space. The identified causal relations are then leveraged to conduct systematic interventions, optimizing the design task within a chemical space that the models have not encountered previously. While our implementation focused on the QM9 quantum-chemical dataset for a specific design task-finding molecules with a large dipole moment-our active causal learning approach, driven by intelligent sampling and interventions, holds potential for broader applications in molecular, materials design and discovery.

--------------------------------------------------------------------------------------------------------

A noisy elephant in the room: Is your out-of-distribution detector robust to label noise?

The paper examines the robustness of out-of-distribution detectors to label noise in the underlying classification model, revealing an overlooked limitation and importance for real-world deployments of these systems.

Authors:  Galadrielle Humblot-Renaux, Sergio Escalera, Thomas B. Moeslund

Link:  https://arxiv.org/abs/2404.01775v1

Date: 2024-04-02

Summary:

The ability to detect unfamiliar or unexpected images is essential for safe deployment of computer vision systems. In the context of classification, the task of detecting images outside of a model's training domain is known as out-of-distribution (OOD) detection. While there has been a growing research interest in developing post-hoc OOD detection methods, there has been comparably little discussion around how these methods perform when the underlying classifier is not trained on a clean, carefully curated dataset. In this work, we take a closer look at 20 state-of-the-art OOD detection methods in the (more realistic) scenario where the labels used to train the underlying classifier are unreliable (e.g. crowd-sourced or web-scraped labels). Extensive experiments across different datasets, noise types & levels, architectures and checkpointing strategies provide insights into the effect of class label noise on OOD detection, and show that poor separation between incorrectly classified ID samples vs. OOD samples is an overlooked yet important limitation of existing methods. Code: https://github.com/glhr/ood-labelnoise

--------------------------------------------------------------------------------------------------------

Procedural Fairness in Machine Learning

The paper defines and formalizes the concept of procedural fairness in machine learning, proposing methods to identify and improve procedural fairness alongside distributive fairness, with implications for ethical and responsible AI development.

Authors:  Ziming Wang, Changwu Huang, Xin Yao

Link:  https://arxiv.org/abs/2404.01877v1

Date: 2024-04-02

Summary:

Fairness in machine learning (ML) has received much attention. However, existing studies have mainly focused on the distributive fairness of ML models. The other dimension of fairness, i.e., procedural fairness, has been neglected. In this paper, we first define the procedural fairness of ML models, and then give formal definitions of individual and group procedural fairness. We propose a novel metric to evaluate the group procedural fairness of ML models, called $GPF_{FAE}$, which utilizes a widely used explainable artificial intelligence technique, namely feature attribution explanation (FAE), to capture the decision process of the ML models. We validate the effectiveness of $GPF_{FAE}$ on a synthetic dataset and eight real-world datasets. Our experiments reveal the relationship between procedural and distributive fairness of the ML model. Based on our analysis, we propose a method for identifying the features that lead to the procedural unfairness of the model and propose two methods to improve procedural fairness after identifying unfair features. Our experimental results demonstrate that we can accurately identify the features that lead to procedural unfairness in the ML model, and both of our proposed methods can significantly improve procedural fairness with a slight impact on model performance, while also improving distributive fairness.

--------------------------------------------------------------------------------------------------------

Exploring How Multiple Levels of GPT-Generated Programming Hints Support or Disappoint Novices

The paper investigates how different levels of programming hints generated by large language models can support or hinder novice programmers, informing the design of adaptive educational tools.

Authors:  Ruiwei Xiao, Xinying Hou, John Stamper

Link:  https://arxiv.org/abs/2404.02213v1

Date: 2024-04-02

Summary:

Recent studies have integrated large language models (LLMs) into diverse educational contexts, including providing adaptive programming hints, a type of feedback focuses on helping students move forward during problem-solving. However, most existing LLM-based hint systems are limited to one single hint type. To investigate whether and how different levels of hints can support students' problem-solving and learning, we conducted a think-aloud study with 12 novices using the LLM Hint Factory, a system providing four levels of hints from general natural language guidance to concrete code assistance, varying in format and granularity. We discovered that high-level natural language hints alone can be helpless or even misleading, especially when addressing next-step or syntax-related help requests. Adding lower-level hints, like code examples with in-line comments, can better support students. The findings open up future work on customizing help responses from content, format, and granularity levels to accurately identify and meet students' learning needs.

--------------------------------------------------------------------------------------------------------

Defense without Forgetting: Continual Adversarial Defense with Anisotropic & Isotropic Pseudo Replay

The paper introduces a continual adversarial defense approach that can adapt to new attacks while maintaining robustness against previous ones, with applications in securing machine learning systems against evolving threats.

Authors:  Yuhang Zhou, Zhongyun Hua

Link:  https://arxiv.org/abs/2404.01828v1

Date: 2024-04-02

Summary:

Deep neural networks have demonstrated susceptibility to adversarial attacks. Adversarial defense techniques often focus on one-shot setting to maintain robustness against attack. However, new attacks can emerge in sequences in real-world deployment scenarios. As a result, it is crucial for a defense model to constantly adapt to new attacks, but the adaptation process can lead to catastrophic forgetting of previously defended against attacks. In this paper, we discuss for the first time the concept of continual adversarial defense under a sequence of attacks, and propose a lifelong defense baseline called Anisotropic \& Isotropic Replay (AIR), which offers three advantages: (1) Isotropic replay ensures model consistency in the neighborhood distribution of new data, indirectly aligning the output preference between old and new tasks. (2) Anisotropic replay enables the model to learn a compromise data manifold with fresh mixed semantics for further replay constraints and potential future attacks. (3) A straightforward regularizer mitigates the 'plasticity-stability' trade-off by aligning model output between new and old tasks. Experiment results demonstrate that AIR can approximate or even exceed the empirical performance upper bounds achieved by Joint Training.

--------------------------------------------------------------------------------------------------------

Benchmarking ChatGPT on Algorithmic Reasoning

The paper benchmarks the algorithmic reasoning capabilities of ChatGPT, a prominent large language model, raising new points in the discussion around learning algorithms with neural networks.

Authors:  Sean McLeish, Avi Schwarzschild, Tom Goldstein

Link:  https://arxiv.org/abs/2404.03441v1

Date: 2024-04-04

Summary:

We evaluate ChatGPT's ability to solve algorithm problems from the CLRS benchmark suite that is designed for GNNs. The benchmark requires the use of a specified classical algorithm to solve a given problem. We find that ChatGPT outperforms specialist GNN models, using Python to successfully solve these problems. This raises new points in the discussion about learning algorithms with neural networks.

--------------------------------------------------------------------------------------------------------

Affective-NLI: Towards Accurate and Interpretable Personality Recognition in Conversation

The paper proposes Affective-NLI, a personality recognition approach that leverages affective information and semantic understanding, enabling accurate and interpretable results, with applications in personalized services and human-computer interaction.

Authors:  Zhiyuan Wen, Jiannong Cao, Yu Yang, Ruosong Yang, Shuaiqi Liu

Link:  https://arxiv.org/abs/2404.02589v1

Date: 2024-04-03

Summary:

Personality Recognition in Conversation (PRC) aims to identify the personality traits of speakers through textual dialogue content. It is essential for providing personalized services in various applications of Human-Computer Interaction (HCI), such as AI-based mental therapy and companion robots for the elderly. Most recent studies analyze the dialog content for personality classification yet overlook two major concerns that hinder their performance. First, crucial implicit factors contained in conversation, such as emotions that reflect the speakers' personalities are ignored. Second, only focusing on the input dialog content disregards the semantic understanding of personality itself, which reduces the interpretability of the results. In this paper, we propose Affective Natural Language Inference (Affective-NLI) for accurate and interpretable PRC. To utilize affectivity within dialog content for accurate personality recognition, we fine-tuned a pre-trained language model specifically for emotion recognition in conversations, facilitating real-time affective annotations for utterances. For interpretability of recognition results, we formulate personality recognition as an NLI problem by determining whether the textual description of personality labels is entailed by the dialog content. Extensive experiments on two daily conversation datasets suggest that Affective-NLI significantly outperforms (by 6%-7%) state-of-the-art approaches. Additionally, our Flow experiment demonstrates that Affective-NLI can accurately recognize the speaker's personality in the early stages of conversations by surpassing state-of-the-art methods with 22%-34%.

--------------------------------------------------------------------------------------------------------

Release of Pre-Trained Models for the Japanese Language

The paper announces the release of pre-trained models for the Japanese language, including GPT, CLIP, Stable Diffusion, and HuBERT, advancing the democratization of AI in non-English-speaking communities.

Authors:  Kei Sawada, Tianyu Zhao, Makoto Shing, Kentaro Mitsui, Akio Kaga, Yukiya Hono, Toshiaki Wakatsuki, Koh Mitsuda

Link:  https://arxiv.org/abs/2404.01657v1

Date: 2024-04-02

Summary:

AI democratization aims to create a world in which the average person can utilize AI techniques. To achieve this goal, numerous research institutes have attempted to make their results accessible to the public. In particular, large pre-trained models trained on large-scale data have shown unprecedented potential, and their release has had a significant impact. However, most of the released models specialize in the English language, and thus, AI democratization in non-English-speaking communities is lagging significantly. To reduce this gap in AI access, we released Generative Pre-trained Transformer (GPT), Contrastive Language and Image Pre-training (CLIP), Stable Diffusion, and Hidden-unit Bidirectional Encoder Representations from Transformers (HuBERT) pre-trained in Japanese. By providing these models, users can freely interface with AI that aligns with Japanese cultural values and ensures the identity of Japanese culture, thus enhancing the democratization of AI. Additionally, experiments showed that pre-trained models specialized for Japanese can efficiently achieve high performance in Japanese tasks.

--------------------------------------------------------------------------------------------------------

Entity-Centric Reinforcement Learning for Object Manipulation from Pixels

The paper introduces an entity-centric reinforcement learning approach for object manipulation from pixel observations, enabling agents to learn and generalize to tasks with complex dependencies between multiple objects, benefiting robotic applications.

Authors:  Dan Haramati, Tal Daniel, Aviv Tamar

Link:  https://arxiv.org/abs/2404.01220v1

Date: 2024-04-01

Summary:

Manipulating objects is a hallmark of human intelligence, and an important task in domains such as robotics. In principle, Reinforcement Learning (RL) offers a general approach to learn object manipulation. In practice, however, domains with more than a few objects are difficult for RL agents due to the curse of dimensionality, especially when learning from raw image observations. In this work we propose a structured approach for visual RL that is suitable for representing multiple objects and their interaction, and use it to learn goal-conditioned manipulation of several objects. Key to our method is the ability to handle goals with dependencies between the objects (e.g., moving objects in a certain order). We further relate our architecture to the generalization capability of the trained agent, based on a theoretical result for compositional generalization, and demonstrate agents that learn with 3 objects but generalize to similar tasks with over 10 objects. Videos and code are available on the project website: https://sites.google.com/view/entity-centric-rl

--------------------------------------------------------------------------------------------------------

Integrating AI in NDE: Techniques, Trends, and Further Directions

The paper provides a comprehensive survey on the integration of AI methods in Nondestructive Evaluation (NDE) techniques, such as magnetic, ultrasound, thermography, and optical inspection, driving the innovation towards NDE 4.0.

Authors:  Eduardo Pérez, Cemil Emre Ardic, Ozan Çakıroğlu, Kevin Jacob, Sayako Kodera, Luca Pompa, Mohamad Rachid, Han Wang, Yiming Zhou, Cyril Zimmer, Florian Römer, Ahmad Osman

Link:  https://arxiv.org/abs/2404.03449v1

Date: 2024-04-04

Summary:

The digital transformation is fundamentally changing our industries, affecting planning, execution as well as monitoring of production processes in a wide range of application fields. With product line-ups becoming more and more versatile and diverse, the necessary inspection and monitoring sparks significant novel requirements on the corresponding Nondestructive Evaluation (NDE) systems. The establishment of increasingly powerful approaches to incorporate Artificial Intelligence (AI) may provide just the needed innovation to solve some of these challenges.   In this paper we provide a comprehensive survey about the usage of AI methods in NDE in light of the recent innovations towards NDE 4.0. Since we cannot discuss each NDE modality in one paper, we limit our attention to magnetic methods, ultrasound, thermography, as well as optical inspection. In addition to reviewing recent AI developments in each field, we draw common connections by pointing out NDE-related tasks that have a common underlying mathematical problem and categorizing the state of the art according to the corresponding sub-tasks. In so doing, interdisciplinary connections are drawn that provide a more complete overall picture.

--------------------------------------------------------------------------------------------------------


EYE ON A.I. GETS READERS UP TO DATE ON THE LATEST FUNDING NEWS AND RELATED ISSUES. SUBSCRIBE FOR THE WEEKLY NEWSLETTER.