Exploring the (Lack of) Cultural Diversity in Multilingual Datasets for NLP

Lea Krause

PhD candidate at Vrije Universiteit Amsterdam

The project addresses the critical need for cultural diversity in multilingual datasets used to train and evaluate language models and conversational agents. Current practices often involve translating English-centric content, which limits the cultural authenticity and applicability of these datasets across different regions. For example, evaluating models using datasets that prominently feature questions about American football may not represent relevant knowledge for many parts of the world. We conduct a comprehensive evaluation of existing datasets, focusing on linguistic and cultural representation, and develop methodologies to incorporate authentic cultural nuances. By doing so, the project seeks to ensure that NLP systems can effectively serve and reflect the diverse global community, ultimately contributing to more inclusive and culturally aware AI technologies.

Keywords: Artificial Intelligence

Scientific area: Image Perception, Artificial Intelligence

Bio: Lea Krause is a PhD student at the Computational Linguistics and Text Mining Lab at the Vrije Universiteit Amsterdam, under the supervision of Piek Vossen. She holds a master’s degree in Speech and Language Processing from the University of Edinburgh. Currently, she is involved in the Make Robots Talk project and participates in the Explainability track of the Hybrid Intelligence Centre. Her research centres on contextualising conversational AI, informed by pragmatic theory, with a particular emphasis on perspectives and the accurate communication of data and model uncertainty.

Visiting period: April-June 2024 at OII, University of Oxford