Week Ending 11.12.2023

RESEARCH WATCH: 11.12.2023

SPONSORED BY

Digimarc digital watermarks invisibly guard your digital assets to protect against misuse, prove copyright ownership, and verify authenticity. In an era of artificial intelligence, don’t leave your images and other digital content exposed. Demand superior content protection and maintain trust in your brand with Digimarc.

Checkout Digimarc - https://www.digimarc.com/

Robust Adversarial Attacks Detection for Deep Learning based Relative Pose Estimation for Space Rendezvous

Robust Adversarial Attacks Detection for Deep Learning based Relative Pose Estimation for Space Rendezvous: This paper proposes a new approach to detect adversarial attacks on deep learning models for spacecraft navigation. Detecting attacks is crucial for ensuring reliability and security. The method uses explainability techniques to identify anomalies in model predictions. This has applications in developing robust navigation systems for space missions.

Authors: Ziwei Wang, Nabil Aouf, Jose Pizarro, Christophe Honvault

Link: https://arxiv.org/abs/2311.05992v1

Date: 2023-11-10

Summary:

Research on developing deep learning techniques for autonomous spacecraft relative navigation challenges is continuously growing in recent years. Adopting those techniques offers enhanced performance. However, such approaches also introduce heightened apprehensions regarding the trustability and security of such deep learning methods through their susceptibility to adversarial attacks. In this work, we propose a novel approach for adversarial attack detection for deep neural network-based relative pose estimation schemes based on the explainability concept. We develop for an orbital rendezvous scenario an innovative relative pose estimation technique adopting our proposed Convolutional Neural Network (CNN), which takes an image from the chaser's onboard camera and outputs accurately the target's relative position and rotation. We perturb seamlessly the input images using adversarial attacks that are generated by the Fast Gradient Sign Method (FGSM). The adversarial attack detector is then built based on a Long Short Term Memory (LSTM) network which takes the explainability measure namely SHapley Value from the CNN-based pose estimator and flags the detection of adversarial attacks when acting. Simulation results show that the proposed adversarial attack detector achieves a detection accuracy of 99.21%. Both the deep relative pose estimator and adversarial attack detector are then tested on real data captured from our laboratory-designed setup. The experimental results from our laboratory-designed setup demonstrate that the proposed adversarial attack detector achieves an average detection accuracy of 96.29%.

--------------------------------------------------------------------------------------------------------

Fake Alignment: Are LLMs Really Aligned Well?

Fake Alignment: Are LLMs Really Aligned Well?: This work investigates potential limitations in evaluating alignment in large language models. It introduces new metrics to quantify fake alignment, where models appear aligned on some tests but not on others. The findings highlight the need for more rigorous alignment evaluation, with implications for developing safe and reliable LLMs.

Authors: Yixu Wang, Yan Teng, Kexin Huang, Chengqi Lyu, Songyang Zhang, Wenwei Zhang, Xingjun Ma, Yingchun Wang

Link: https://arxiv.org/abs/2311.05915v1

Date: 2023-11-10

Summary:

The growing awareness of safety concerns in large language models (LLMs) has sparked considerable interest in the evaluation of safety within current research endeavors. This study investigates an interesting issue pertaining to the evaluation of LLMs, namely the substantial discrepancy in performance between multiple-choice questions and open-ended questions. Inspired by research on jailbreak attack patterns, we argue this is caused by mismatched generalization. That is, the LLM does not have a comprehensive understanding of the complex concept of safety. Instead, it only remembers what to answer for open-ended safety questions, which makes it unable to solve other forms of safety tests. We refer to this phenomenon as fake alignment and construct a comparative benchmark to empirically verify its existence in LLMs. Such fake alignment renders previous evaluation protocols unreliable. To address this, we introduce the FAEF framework and two novel metrics\textemdash Consistency Score (CS) and Consistent Safety Score (CSS), which jointly assess two complementary forms of evaluation to quantify fake alignment and obtain corrected performance estimates. Applying FAEF to 14 widely-used LLMs reveals several models with purported safety are poorly aligned in practice. Our work highlights potential limitations in prevailing alignment methodologies.

--------------------------------------------------------------------------------------------------------

Long-Horizon Dialogue Understanding for Role Identification in the Game of Avalon with Large Language Models

Long-Horizon Dialogue Understanding for Role Identification in the Game of Avalon with Large Language Models: This research explores using large language models for identifying player roles and goals in the social deception game Avalon. It introduces a new dataset of human dialogues exhibiting complex, long-horizon deception. The work sheds light on LLMs' ability to understand pragmatic reasoning in multi-agent settings.

Authors: Simon Stepputtis, Joseph Campbell, Yaqi Xie, Zhengyang Qi, Wenxin Sharon Zhang, Ruiyi Wang, Sanketh Rangreji, Michael Lewis, Katia Sycara

Link: https://arxiv.org/abs/2311.05720v1

Date: 2023-11-09

Summary:

Deception and persuasion play a critical role in long-horizon dialogues between multiple parties, especially when the interests, goals, and motivations of the participants are not aligned. Such complex tasks pose challenges for current Large Language Models (LLM) as deception and persuasion can easily mislead them, especially in long-horizon multi-party dialogues. To this end, we explore the game of Avalon: The Resistance, a social deduction game in which players must determine each other's hidden identities to complete their team's objective. We introduce an online testbed and a dataset containing 20 carefully collected and labeled games among human players that exhibit long-horizon deception in a cooperative-competitive setting. We discuss the capabilities of LLMs to utilize deceptive long-horizon conversations between six human players to determine each player's goal and motivation. Particularly, we discuss the multimodal integration of the chat between the players and the game's state that grounds the conversation, providing further insights into the true player identities. We find that even current state-of-the-art LLMs do not reach human performance, making our dataset a compelling benchmark to investigate the decision-making and language-processing capabilities of LLMs. Our dataset and online testbed can be found at our project website: https://sstepput.github.io/Avalon-NLU/

--------------------------------------------------------------------------------------------------------

Data Valuation and Detections in Federated Learning

Data Valuation and Detections in Federated Learning: This paper proposes a new privacy-preserving method to evaluate data contributions in federated learning without a predefined algorithm. It enables transparent data valuation to incentivize high-quality data from clients. The approach has applications in building robust and fair federated learning systems.

Authors: Wenqian Li, Shuran Fu, Fengrui Zhang, Yan Pang

Link: https://arxiv.org/abs/2311.05304v1

Date: 2023-11-09

Summary:

Federated Learning (FL) enables collaborative model training without sharing raw data, demanding abundant, high-quality data for optimal model performance. Fair and efficient data evaluation is a fundamental issue for incentivizing clients to provide more high-quality data. Meanwhile, it is likely that only a subset of clients and datasets are relevant for a learning task while the rest of them may have a negative impact on the model training. This paper introduces a novel privacy-preserving method for evaluating client contributions and selecting relevant data samples without a pre-specified training algorithm. Our proposed approach, FedBary, utilizes Wasserstein distance within the federated context, offering a new pioneering solution for data valuation, which provides transparent data evaluation and efficient computation of Wasserstein barycenter to mitigate reliance on validation data. We conduct extensive empirical experiments and theoretical analysis, showing the promising research of this valuation metric.

--------------------------------------------------------------------------------------------------------

Explainable artificial intelligence for Healthcare applications using Random Forest Classifier with LIME and SHAP

Explainable artificial intelligence for Healthcare applications using Random Forest Classifier with LIME and SHAP: This work provides an analysis of explainable AI techniques like LIME and SHAP using a random forest classifier on a diabetes dataset. The results offer insights into model transparency and trustworthiness for healthcare AI. This demonstrates the potential of explainable AI in high-stakes medical prediction tasks.

Authors: Mrutyunjaya Panda, Soumya Ranjan Mahanta

Link: https://arxiv.org/abs/2311.05665v1

Date: 2023-11-09

Summary:

With the advances in computationally efficient artificial Intelligence (AI) techniques and their numerous applications in our everyday life, there is a pressing need to understand the computational details hidden in black box AI techniques such as most popular machine learning and deep learning techniques; through more detailed explanations. The origin of explainable AI (xAI) is coined from these challenges and recently gained more attention by the researchers by adding explainability comprehensively in traditional AI systems. This leads to develop an appropriate framework for successful applications of xAI in real life scenarios with respect to innovations, risk mitigation, ethical issues and logical values to the users. In this book chapter, an in-depth analysis of several xAI frameworks and methods including LIME (Local Interpretable Model-agnostic Explanations) and SHAP (SHapley Additive exPlanations) are provided. Random Forest Classifier as black box AI is used on a publicly available Diabetes symptoms dataset with LIME and SHAP for better interpretations. The results obtained are interesting in terms of transparency, valid and trustworthiness in diabetes disease prediction.

--------------------------------------------------------------------------------------------------------

RAGLog: Log Anomaly Detection using Retrieval Augmented Generation

RAGLog: Log Anomaly Detection using Retrieval Augmented Generation: This paper explores using a retrieval augmented language model to detect anomalies in system logs, which is key for cyber resilience. The proposed RAGLog method shows promise for automating log analysis without anomalous examples, with applications in IT system monitoring and digital forensics.

Authors: Jonathan Pan, Swee Liang Wong, Yidi Yuan

Link: https://arxiv.org/abs/2311.05261v1

Date: 2023-11-09

Summary:

The ability to detect log anomalies from system logs is a vital activity needed to ensure cyber resiliency of systems. It is applied for fault identification or to facilitate cyber investigation and digital forensics. However, as logs belonging to different systems and components differ significantly, the challenge to perform such analysis is humanly challenging from the volume, variety and velocity of logs. This is further complicated by the lack or unavailability of anomalous log entries to develop trained machine learning or artificial intelligence models for such purposes. In this research work, we explore the use of a Retrieval Augmented Large Language Model that leverages a vector database to detect anomalies from logs. We used a Question and Answer configuration pipeline. To the best of our knowledge, our experiment which we called RAGLog is a novel one and the experimental results show much promise.

--------------------------------------------------------------------------------------------------------

Green Resilience of Cyber-Physical Systems

Green Resilience of Cyber-Physical Systems: This early PhD proposal suggests using game theory for resilient and green decision making in cyber-physical systems. It outlines a model for human-robot collaboration that achieves resilience while minimizing environmental impact. This has applications in developing sustainable and reliable intelligent systems.

Authors: Diaeddin Rimawi

Link: https://arxiv.org/abs/2311.05201v1

Date: 2023-11-09

Summary:

Cyber-Physical System (CPS) represents systems that join both hardware and software components to perform real-time services. Maintaining the system's reliability is critical to the continuous delivery of these services. However, the CPS running environment is full of uncertainties and can easily lead to performance degradation. As a result, the need for a recovery technique is highly needed to achieve resilience in the system, with keeping in mind that this technique should be as green as possible. This early doctorate proposal, suggests a game theory solution to achieve resilience and green in CPS. Game theory has been known for its fast performance in decision-making, helping the system to choose what maximizes its payoffs. The proposed game model is described over a real-life collaborative artificial intelligence system (CAIS), that involves robots with humans to achieve a common goal. It shows how the expected results of the system will achieve the resilience of CAIS with minimized CO2 footprint.

--------------------------------------------------------------------------------------------------------

Automated Annotation of Scientific Texts for ML-based Keyphrase Extraction and Validation

Automated Annotation of Scientific Texts for ML-based Keyphrase Extraction and Validation: This work introduces new automated labeling techniques to validate machine-generated metadata for scientific texts. By exploiting domain knowledge, the methods accelerate annotation to enable metadata extraction from large unlabeled corpora. This can facilitate search and discovery in science domains like genomics.

Authors: Oluwamayowa O. Amusat, Harshad Hegde, Christopher J. Mungall, Anna Giannakou, Neil P. Byers, Dan Gunter, Kjiersten Fagnan, Lavanya Ramakrishnan

Link: https://arxiv.org/abs/2311.05042v1

Date: 2023-11-08

Summary:

Advanced omics technologies and facilities generate a wealth of valuable data daily; however, the data often lacks the essential metadata required for researchers to find and search them effectively. The lack of metadata poses a significant challenge in the utilization of these datasets. Machine learning-based metadata extraction techniques have emerged as a potentially viable approach to automatically annotating scientific datasets with the metadata necessary for enabling effective search. Text labeling, usually performed manually, plays a crucial role in validating machine-extracted metadata. However, manual labeling is time-consuming; thus, there is an need to develop automated text labeling techniques in order to accelerate the process of scientific innovation. This need is particularly urgent in fields such as environmental genomics and microbiome science, which have historically received less attention in terms of metadata curation and creation of gold-standard text mining datasets. In this paper, we present two novel automated text labeling approaches for the validation of ML-generated metadata for unlabeled texts, with specific applications in environmental genomics. Our techniques show the potential of two new ways to leverage existing information about the unlabeled texts and the scientific domain. The first technique exploits relationships between different types of data sources related to the same research study, such as publications and proposals. The second technique takes advantage of domain-specific controlled vocabularies or ontologies. In this paper, we detail applying these approaches for ML-generated metadata validation. Our results show that the proposed label assignment approaches can generate both generic and highly-specific text labels for the unlabeled texts, with up to 44% of the labels matching with those suggested by a ML keyword extraction algorithm.

--------------------------------------------------------------------------------------------------------

Beyond Size: How Gradients Shape Pruning Decisions in Large Language Models

Beyond Size: How Gradients Shape Pruning Decisions in Large Language Models: This paper proposes a new gradient-based method for unstructured pruning of large language models. By leveraging gradients and Taylor expansion, it outperforms weight magnitude and activation-based pruning. Analyzing gradients provides insights into geometric dependence in LLMs to guide pruning.

Authors: Rocktim Jyoti Das, Liqun Ma, Zhiqiang Shen

Link: https://arxiv.org/abs/2311.04902v1

Date: 2023-11-08

Summary:

Large Language Models (LLMs) with a billion or more parameters are prime targets for network pruning, which aims to reduce a portion of the network weights without compromising performance. Prior approaches such as Weights Magnitude, SparseGPT, and Wanda, either concentrated solely on weights or integrated weights with activations for sparsity. However, they overlooked the informative gradients derived from pretrained large language models. In this paper, we present a novel sparsity-centric pruning method for pretrained LLMs, termed Gradient-based Language Model Pruner (GBLM-Pruner). GBLM-Pruner leverages the first-order term of the Taylor expansion, operating in a training-free manner by harnessing properly normalized gradients from a few calibration samples to determine the importance pruning score, and substantially outperforms competitive counterparts like SparseGPT and Wanda in multiple benchmarks. Intriguing, after incorporating gradients, the unstructured pruning method tends to reveal some structural patterns post-pruning, which mirrors the geometric interdependence inherent in the LLMs' parameter structure. Additionally, GBLM-Pruner functions without any subsequent retraining or weight updates to maintain its simplicity as other counterparts. Extensive evaluations on LLaMA-1 and LLaMA-2 across various language benchmarks and perplexity show that GBLM-Pruner surpasses magnitude pruning, Wanda (weights+activations) and SparseGPT (weights+activations+weight update) by significant margins. Our code and models are available at https://github.com/RocktimJyotiDas/GBLM-Pruner.

--------------------------------------------------------------------------------------------------------

Prompt Sketching for Large Language Models

Prompt Sketching for Large Language Models: This work introduces prompt sketching to address limitations of sequential prompting for LLMs. By predicting values for template variables, sketching gives more control over generation. Experiments show improvements in zero-shot reasoning over direct asking and chain-of-thought prompting.

Authors: Luca Beurer-Kellner, Mark Niklas Müller, Marc Fischer, Martin Vechev

Link: https://arxiv.org/abs/2311.04954v1

Date: 2023-11-08

Summary:

Many recent prompting strategies for large language models (LLMs) query the model multiple times sequentially -- first to produce intermediate results and then the final answer. However, using these methods, both decoder and model are unaware of potential follow-up prompts, leading to disconnected and undesirably wordy intermediate responses. In this work, we address this issue by proposing prompt sketching, a new prompting paradigm in which an LLM does not only respond by completing a prompt, but by predicting values for multiple variables in a template. This way, sketching grants users more control over the generation process, e.g., by providing a reasoning framework via intermediate instructions, leading to better overall results. The key idea enabling sketching with existing, autoregressive models is to adapt the decoding procedure to also score follow-up instructions during text generation, thus optimizing overall template likelihood in inference. Our experiments show that in a zero-shot setting, prompt sketching outperforms existing, sequential prompting schemes such as direct asking or chain-of-thought on 7 out of 8 LLM benchmarking tasks, including state tracking, arithmetic reasoning, and general question answering. To facilitate future use, we release a number of generic, yet effective sketches applicable to many tasks, and an open source library called dclib, powering our sketch-aware decoders.

--------------------------------------------------------------------------------------------------------

LongQLoRA: Efficient and Effective Method to Extend Context Length of Large Language Models

LongQLoRA: Efficient and Effective Method to Extend Context Length of Large Language Models: This paper presents LongQLoRA, a method to efficiently extend context length for LLMs with less training. Experiments show competitive performance to state-of-the-art methods in extending models like LLaMA and Vicuna-13B. This enables building performant LLMs without prohibitively large resources.

Authors: Jianxin Yang

Link: https://arxiv.org/abs/2311.04879v2

Date: 2023-11-09

Summary:

We present LongQLoRA, an efficient and effective method to extend context length of large language models with less training resources. LongQLoRA combines the advantages of Position Interpolation, QLoRA and Shift Short Attention of LongLoRA. With a single 32GB V100 GPU, LongQLoRA can extend the context length of LLaMA2 7B and 13B from 4096 to 8192 and even to 12k within 1000 finetuning steps. LongQLoRA achieves competitive perplexity performance on PG19 and Proof-pile datasets, our model outperforms LongLoRA and is very close to MPT-7B-8K within the evaluation context length of 8192. We collect and build 39k long instruction data to extend context length of Vicuna-13B from 4096 to 8192 and achieve good performance both in long and short context generation task. We also do some ablation experiments to study the effect of LoRA rank, finetuning steps and attention patterns in inference. The model weights, training data and code are available at https://github.com/yangjianxin1/LongQLoRA.

--------------------------------------------------------------------------------------------------------

Why Do Clinical Probabilistic Models Fail To Transport Between Sites?

Why Do Clinical Probabilistic Models Fail To Transport Between Sites?: This perspective examines reasons probabilistic models fail to transport between clinical sites despite high performance at training sites. It proposes isolating site-specific practice patterns from disease patterns to improve transportability. This provides insights into developing robust clinical AI.

Authors: Thomas A. Lasko, Eric V. Strobl, William W. Stead

Link: https://arxiv.org/abs/2311.04787v1

Date: 2023-11-08

Summary:

The rising popularity of artificial intelligence in healthcare is highlighting the problem that a computational model achieving super-human clinical performance at its training sites may perform substantially worse at new sites. In this perspective, we present common sources for this failure to transport, which we divide into sources under the control of the experimenter and sources inherent to the clinical data-generating process. Of the inherent sources we look a little deeper into site-specific clinical practices that can affect the data distribution and propose a potential solution intended to isolate the imprint of those practices on the data from the patterns of disease cause and effect that are the usual target of clinical models.

--------------------------------------------------------------------------------------------------------

Explained anomaly detection in text reviews: Can subjective scenarios be correctly evaluated?

Explained anomaly detection in text reviews: Can subjective scenarios be correctly evaluated?: This work develops a pipeline for detecting and explaining anomalies in online reviews. A human study compares explanation techniques and measures their impact on reproducing classifications. Explaining subjective tasks sheds light on challenges in anomaly detection for text.

Authors: David Novoa-Paradela, Oscar Fontenla-Romero, Bertha Guijarro-Berdiñas

Link: https://arxiv.org/abs/2311.04948v1

Date: 2023-11-08

Summary:

This paper presents a pipeline to detect and explain anomalous reviews in online platforms. The pipeline is made up of three modules and allows the detection of reviews that do not generate value for users due to either worthless or malicious composition. The classifications are accompanied by a normality score and an explanation that justifies the decision made. The pipeline's ability to solve the anomaly detection task was evaluated using different datasets created from a large Amazon database. Additionally, a study comparing three explainability techniques involving 241 participants was conducted to assess the explainability module. The study aimed to measure the impact of explanations on the respondents' ability to reproduce the classification model and their perceived usefulness. This work can be useful to automate tasks in review online platforms, such as those for electronic commerce, and offers inspiration for addressing similar problems in the field of anomaly detection in textual data. We also consider it interesting to have carried out a human evaluation of the capacity of different explainability techniques in a real and infrequent scenario such as the detection of anomalous reviews, as well as to reflect on whether it is possible to explain tasks as humanly subjective as this one.

--------------------------------------------------------------------------------------------------------

LuminanceL1Loss: A loss function which measures percieved brightness and colour differences

LuminanceL1Loss: A loss function which measures percieved brightness and colour differences: This paper introduces LuminanceL1Loss, a novel loss function for image restoration. It transforms images to grayscale before computing MSE loss. Experiments show consistent gains over MSE, demonstrating efficacy for image reconstruction tasks like denoising.

Authors: Dominic De Jonge

Link: https://arxiv.org/abs/2311.04614v1

Date: 2023-11-08

Summary:

We introduce LuminanceL1Loss, a novel loss function designed to enhance the performance of image restoration tasks. We demonstrate its superiority over MSE when applied to the Retinexformer, BUIFD and DnCNN architectures. Our proposed LuminanceL1Loss leverages a unique approach by transforming images into grayscale and subsequently computing the MSE loss for both grayscale and color channels. Experimental results demonstrate that this innovative loss function consistently outperforms traditional methods, showcasing its potential in image denoising and other related tasks in image reconstruction. It demonstrates gains up to 4.7dB. The results presented in this study highlight the efficacy of LuminanceL1Loss for various image restoration tasks.

--------------------------------------------------------------------------------------------------------

TEAL: Tokenize and Embed ALL for Multi-modal Large Language Models

TEAL: Tokenize and Embed ALL for Multi-modal Large Language Models: This work proposes TEAL to improve multi-modal understanding and generation for LLMs. By tokenizing all modalities into a joint embedding space, frozen textual LLMs can perform non-text tasks. This provides simple and effective multimodal capabilities for existing models.

Authors: Zhen Yang, Yingxue Zhang, Fandong Meng, Jie Zhou

Link: https://arxiv.org/abs/2311.04589v1

Date: 2023-11-08

Summary:

Despite Multi-modal Large Language Models (MM-LLMs) have made exciting strides recently, they are still struggling to efficiently model the interactions among multi-modal inputs and the generation in non-textual modalities. In this work, we propose TEAL (Tokenize and Embed ALl)}, an approach to treat the input from any modality as a token sequence and learn a joint embedding space for all modalities. Specifically, for the input from any modality, TEAL first discretizes it into a token sequence with the off-the-shelf tokenizer and embeds the token sequence into a joint embedding space with a learnable embedding matrix. MM-LLMs just need to predict the multi-modal tokens autoregressively as the textual LLMs do. Finally, the corresponding de-tokenizer is applied to generate the output in each modality based on the predicted token sequence. With the joint embedding space, TEAL enables the frozen LLMs to perform both understanding and generation tasks involving non-textual modalities, such as image and audio. Thus, the textual LLM can just work as an interface and maintain its high performance in textual understanding and generation. Experiments show that TEAL achieves substantial improvements in multi-modal understanding, and implements a simple scheme for multi-modal generations.

--------------------------------------------------------------------------------------------------------

RAMIEL: A Parallel-Wire Driven Monopedal Robot for High and Continuous Jumping

RAMIEL: A Parallel-Wire Driven Monopedal Robot for High and Continuous Jumping: This paper presents RAMIEL, a wire-driven monopedal robot that achieves high (1.6m) and continuous jumping using a lightweight leg with quasi-direct drive. The parallel wire mechanism enables simultaneous controllability and power for jumping. This demonstrates new capabilities for dynamic legged robots.

Authors: Temma Suzuki, Yasunori Toshimitsu, Yuya Nagamatsu, Kento Kawaharazuka, Akihiro Miki, Yoshimoto Ribayashi, Masahiro Bando, Kunio Kojima, Yohei Kakiuchi, Kei Okada, Masayuki Inaba

Link: https://arxiv.org/abs/2311.04573v1

Date: 2023-11-08

Summary:

Legged robots with high locomotive performance have been extensively studied, and various leg structures have been proposed. Especially, a leg structure that can achieve both continuous and high jumps is advantageous for moving around in a three-dimensional environment. In this study, we propose a parallel wire-driven leg structure, which has one DoF of linear motion and two DoFs of rotation and is controlled by six wires, as a structure that can achieve both continuous jumping and high jumping. The proposed structure can simultaneously achieve high controllability on each DoF, long acceleration distance and high power required for jumping. In order to verify the jumping performance of the parallel wire-driven leg structure, we have developed a parallel wire-driven monopedal robot, RAMIEL. RAMIEL is equipped with quasi-direct drive, high power wire winding mechanisms and a lightweight leg, and can achieve a maximum jumping height of 1.6 m and a maximum of seven continuous jumps.

--------------------------------------------------------------------------------------------------------

Improving Pacing in Long-Form Story Planning

Improving Pacing in Long-Form Story Planning: This work introduces a system called CONCOCT to improve story outline pacing using predicted event concreteness. Results show more consistent pacing over baselines when generating hierarchical outlines. Controlling pacing addresses a key challenge in automating coherent long-form story generation.

Authors: Yichen Wang, Kevin Yang, Xiaoming Liu, Dan Klein

Link: https://arxiv.org/abs/2311.04459v1

Date: 2023-11-08

Summary:

Existing LLM-based systems for writing long-form stories or story outlines frequently suffer from unnatural pacing, whether glossing over important events or over-elaborating on insignificant details, resulting in a jarring experience for the reader. We propose a CONCrete Outline ConTrol (CONCOCT) system to improve pacing when automatically generating story outlines. We first train a concreteness evaluator to judge which of two events is more concrete (low-level-detailed). This evaluator can then be used to control pacing in hierarchical outline generation; in this work, we explore a vaguest-first expansion procedure that aims for uniform pacing. We further use the evaluator to filter new outline items based on predicted concreteness. Compared to a baseline hierarchical outline generator, humans judge CONCOCT's pacing to be more consistent over 57% of the time across multiple outline lengths; the gains also translate to downstream stories. All code, data, and models are open-sourced.

--------------------------------------------------------------------------------------------------------

Evaluating Uncertainty Quantification approaches for Neural PDEs in scientific applications

Evaluating Uncertainty Quantification approaches for Neural PDEs in scientific applications: This paper evaluates uncertainty quantification methods for neural PDEs on scientific problems like fluid dynamics. It finds Bayesian approaches display higher certainty than deep ensembles, suggesting potential underestimation. Quantifying uncertainty is key for reliable scientific predictions.

Authors: Vardhan Dongre, Gurpreet Singh Hora

Link: https://arxiv.org/abs/2311.04457v1

Date: 2023-11-08

Summary:

The accessibility of spatially distributed data, enabled by affordable sensors, field, and numerical experiments, has facilitated the development of data-driven solutions for scientific problems, including climate change, weather prediction, and urban planning. Neural Partial Differential Equations (Neural PDEs), which combine deep learning (DL) techniques with domain expertise (e.g., governing equations) for parameterization, have proven to be effective in capturing valuable correlations within spatiotemporal datasets. However, sparse and noisy measurements coupled with modeling approximation introduce aleatoric and epistemic uncertainties. Therefore, quantifying uncertainties propagated from model inputs to outputs remains a challenge and an essential goal for establishing the trustworthiness of Neural PDEs. This work evaluates various Uncertainty Quantification (UQ) approaches for both Forward and Inverse Problems in scientific applications. Specifically, we investigate the effectiveness of Bayesian methods, such as Hamiltonian Monte Carlo (HMC) and Monte-Carlo Dropout (MCD), and a more conventional approach, Deep Ensembles (DE). To illustrate their performance, we take two canonical PDEs: Burger's equation and the Navier-Stokes equation. Our results indicate that Neural PDEs can effectively reconstruct flow systems and predict the associated unknown parameters. However, it is noteworthy that the results derived from Bayesian methods, based on our observations, tend to display a higher degree of certainty in their predictions as compared to those obtained using the DE. This elevated certainty in predictions suggests that Bayesian techniques might underestimate the true underlying uncertainty, thereby appearing more confident in their predictions than the DE approach.

--------------------------------------------------------------------------------------------------------

Interpretable Geoscience Artificial Intelligence (XGeoS-AI): Application to Demystify Image Recognition

Interpretable Geoscience Artificial Intelligence (XGeoS-AI): Application to Demystify Image Recognition: This study proposes an interpretable AI framework for geoscience image recognition tasks like CT analysis. Inspired by human vision, it generates local thresholds for recognition. This demonstrates the importance of interpretability for promoting explainable and trustworthy AI in the earth sciences.

Authors: Jin-Jian Xu, Hao Zhang, Chao-Sheng Tang, Lin Li, Bin Shi

Link: https://arxiv.org/abs/2311.04940v1

Date: 2023-11-08

Summary:

As Earth science enters the era of big data, artificial intelligence (AI) not only offers great potential for solving geoscience problems, but also plays a critical role in accelerating the understanding of the complex, interactive, and multiscale processes of Earth's behavior. As geoscience AI models are progressively utilized for significant predictions in crucial situations, geoscience researchers are increasingly demanding their interpretability and versatility. This study proposes an interpretable geoscience artificial intelligence (XGeoS-AI) framework to unravel the mystery of image recognition in the Earth sciences, and its effectiveness and versatility is demonstrated by taking computed tomography (CT) image recognition as an example. Inspired by the mechanism of human vision, the proposed XGeoS-AI framework generates a threshold value from a local region within the whole image to complete the recognition. Different kinds of artificial intelligence (AI) methods, such as Support Vector Regression (SVR), Multilayer Perceptron (MLP), Convolutional Neural Network (CNN), can be adopted as the AI engines of the proposed XGeoS-AI framework to efficiently complete geoscience image recognition tasks. Experimental results demonstrate that the effectiveness, versatility, and heuristics of the proposed framework have great potential in solving geoscience image recognition problems. Interpretable AI should receive more and more attention in the field of the Earth sciences, which is the key to promoting more rational and wider applications of AI in the field of Earth sciences. In addition, the proposed interpretable framework may be the forerunner of technological innovation in the Earth sciences.

--------------------------------------------------------------------------------------------------------

Data Factors for Better Compositional Generalization

Data Factors for Better Compositional Generalization: This work provides an empirical analysis of how training data factors like scale and complexity affect compositional generalization in language models. More diversity induces stronger generalization, while balancing easy and hard examples is most effective. The findings inform building robustly generalizable NLP models.

Authors: Xiang Zhou, Yichen Jiang, Mohit Bansal

Link: https://arxiv.org/abs/2311.04420v1

Date: 2023-11-08

Summary:

Recent diagnostic datasets on compositional generalization, such as SCAN (Lake and Baroni, 2018) and COGS (Kim and Linzen, 2020), expose severe problems in models trained from scratch on these datasets. However, in contrast to this poor performance, state-of-the-art models trained on larger and more general datasets show better generalization ability. In this work, to reconcile this inconsistency, we conduct an empirical analysis by training Transformer models on a variety of training sets with different data factors, including dataset scale, pattern complexity, example difficulty, etc. First, we show that increased dataset complexity can lead to better generalization behavior on multiple different generalization challenges. To further understand this improvement, we show two axes of the benefit from more complex datasets: they provide more diverse examples so compositional understanding becomes more effective, and they also prevent ungeneralizable memorization of the examples due to reduced example repetition frequency. Finally, we explore how training examples of different difficulty levels influence generalization differently. On synthetic datasets, simple examples invoke stronger compositionality than hard examples do. On larger-scale real language datasets, while hard examples become more important potentially to ensure decent data coverage, a balanced mixture of simple and hard examples manages to induce the strongest generalizability. The code and data for this work are available at https://github.com/owenzx/data4comp

--------------------------------------------------------------------------------------------------------

EYE ON A.I. GETS READERS UP TO DATE ON THE LATEST FUNDING NEWS AND RELATED ISSUES. SUBSCRIBE FOR THE WEEKLY NEWSLETTER.

Artificial Intelligence, Research WatchCraig SmithNovember 15, 2023Comment