TAILOR Selected paper: July 2024

Every month, we want to acknowledge some valuable TAILOR papers, selected among the papers published by scientists belonging to our network by TAILOR principal investigator Fredrik Heintz.
The list of the most valuable papers gathers contributions from different TAILOR partners, each providing valuable insights on different topics related to TrustworthyAI.
Stay tuned for other valuable insights and groundbreaking research from our diverse community!

A General framework for user-guided bayesian optimization

C. Hvarfner, F. Hutter, and L. Nardi

12th International Conference on Learning Representations, ICLR 2024, 2024. [Online]. Available: https://www.scopus.com/inward/record.uri?eid=2-s2.0-85197024019&partnerID=40&md5=23f2d9b725ca38e104ac88ba10617705

Abstract: The optimization of expensive-to-evaluate black-box functions is prevalent in various scientific disciplines. Bayesian optimization is an automatic, general and sample-efficient method to solve these problems with minimal knowledge of the underlying function dynamics. However, the ability of Bayesian optimization to incorporate prior knowledge or beliefs about the function at hand in order to accelerate the optimization is limited, which reduces its appeal for knowledgeable practitioners with tight budgets. To allow domain experts to customize the optimization routine, we propose ColaBO, the first Bayesian-principled framework for incorporating prior beliefs beyond the typical kernel structure, such as the likely location of the optimizer or the optimal value. The generality of ColaBO makes it applicable across different Monte Carlo acquisition functions and types of user beliefs. We empirically demonstrate ColaBO’s ability to substantially accelerate optimization when the prior information is accurate, and to retain approximately default performance when it is misleading.

Tokenizer Choice For LLM Training: Negligible or Crucial?

M. Ali et al.

Findings of the Association for Computational Linguistics: NAACL 2024 – Findings, 2024, pp. 3907–3924. [Online]. Available: https://www.scopus.com/inward/record.uri?eid=2-s2.0-85197953847&partnerID=40&md5=55459095a8524e3874f833d87502d0c5

Abstract: The recent success of Large Language Models (LLMs) has been predominantly driven by curating the training dataset composition, scaling of model architectures and dataset sizes and advancements in pretraining objectives, leaving tokenizer influence as a blind spot. Shedding light on this underexplored area, we conduct a comprehensive study on the influence of tokenizer choice on LLM downstream performance by training 24 mono- and multilingual LLMs at a 2.6 B parameter scale, ablating different tokenizer algorithms and parameterizations. Our studies highlight that the tokenizer choice can significantly impact the model’s downstream performance and training costs. In particular, we find that the common tokenizer evaluation metrics fertility and parity are not always predictive of model downstream performance, rendering these metrics a questionable proxy for the model’s downstream performance. Furthermore, we show that multilingual tokenizers trained on the five most frequent European languages require vocabulary size increases of factor three in comparison to English. While English-centric tokenizers have been applied to the training of multi-lingual LLMs in the past, we find that this approach results in a severe downstream performance degradation and additional training costs of up to 68%, due to an inefficient tokenization vocabulary. © 2024 Association for Computational Linguistics.

Towards efficient AutoML: a pipeline synthesis approach leveraging pre-trained transformers for multimodal data

A. Moharil, J. Vanschoren, P. Singh, and D. Tamburri,

Machine Learning, vol. 113, no. 9, pp. 7011–7053, 2024, doi: 10.1007/s10994-024-06568-1

Abstract: This paper introduces an Automated Machine Learning (AutoML) framework specifically designed to efficiently synthesize end-to-end multimodal machine learning pipelines. Traditional reliance on the computationally demanding Neural Architecture Search is minimized through the strategic integration of pre-trained transformer models. This innovative approach enables the effective unification of diverse data modalities into high-dimensional embeddings, streamlining the pipeline development process. We leverage an advanced Bayesian Optimization strategy, informed by meta-learning, to facilitate the warm-starting of the pipeline synthesis, thereby enhancing computational efficiency. Our methodology demonstrates its potential to create advanced and custom multimodal pipelines within limited computational resources. Extensive testing across 23 varied multimodal datasets indicates the promise and utility of our framework in diverse scenarios. The results contribute to the ongoing efforts in the AutoML field, suggesting new possibilities for efficiently handling complex multimodal data. This research represents a step towards developing more efficient and versatile tools in multimodal machine learning pipeline development, acknowledging the collaborative and ever-evolving nature of this field.