Navigation auf


URPP Language and Space (2013-2024)

Non-randomness in Morphological Diversity: A Computational Approach Based on Multilingual Corpora

Project Description


The project applies information theory, statistical modelling and machine learning to the study of language adaptation using linguistic data extracted from multilingual corpora. In addition to the theoretical findings, the project will provide a data set consisting of text samples of 100 languages facilitating future use of corpus-based computational methods in scientific approaches to linguistic diversity and change.

Project members:

  •  Olga Sozinova (PhD student)
  • Ximena Gutierrez-Vasques (PostDoc),
  • Christian Bentz (PostDoc, external collaborator),
  • Steven Moran (PostDoc, external collaborator) 
  •  Tanja Samardžić (PI).

Funding:  SNF grant #176305 2018—2022. 

Outputs of the project

Computational methods to describe languages

Publication in the EU Research magazine about our project, aiming at a wide audience. Spring 2022.Link to the full issue

EU Research publication


TeDDi Sample: Text Data Diversity Sample for Language Comparison and Multilingual NLP

Steven Moran, Christian Bentz, Ximena Gutierrez-Vasques, Olga Sozinova and Tanja Samardzic. 2022. "TeDDi Sample: Text Data Diversity Sample for Language Comparison and Multilingual NLP”. In Proceedings of The International Conference on Language Resources and Evaluation (LREC), Marseille, France. 20—25 June 2022.


Collecting the TeDDi Sample

The turning point of BPE merges

Ximena Gutierrez-Vasques, Christian Bentz, Olga Sozinova and Tanja Samardzic. 2021. "From characters to words: the turning point of BPE merges”. European Chapter of the Association for Computational Linguistics, Long Papers.

Interpretability for morphological inflection

Tatyana Ruzsics, Olga Sozinova, Ximena Gutierrez-Vasques and Tanja Samardzic. 2021. "Interpretability for morphological inflection: from character-level predictions to subword-level rules”. European Chapter of the Association for Computational Linguistics, Long Papers.

Weiterführende Informationen

T. Samardzic: Text-based measures of language similarity

T. Samardzic: Text-based measures of language similarity

More about T. Samardzic: Text-based measures of language similarity

Workshop: Corpus-based and Computational Approaches to Linguistic Variation.

27-28 April 2022, University of Helsinki.

X. Gutierrez: From characters to words -- the turning point of BPE merges

X. Gutierrez: From characters to words -- the turning point of BPE merges

More about X. Gutierrez: From characters to words -- the turning point of BPE merges

URPP Language and Space: Looking back and looking forward

26 January, 2021

O. Sozinova: Geometry of linguistic morphology

O. Sozinova: Geometry of linguistic morphology

More about O. Sozinova: Geometry of linguistic morphology

URPP Language and Space: Looking back and looking forward

26 January, 2021


IWMLC 2019

Measuring inflectional and derivational complexity

Olga Sozinova, Christian Bentz and Tanja Samardžić

Interactive Workshop on Measuring Language Complexity (IWMLC)

12—13 September 2019

Freiburg, Germany