Navigation auf uzh.ch
The project applies information theory, statistical modelling and machine learning to the study of language adaptation using linguistic data extracted from multilingual corpora. In addition to the theoretical findings, the project will provide a data set consisting of text samples of 100 languages facilitating future use of corpus-based computational methods in scientific approaches to linguistic diversity and change.
Publication in the EU Research magazine about our project, aiming at a wide audience. Spring 2022.Link to the full issue
Steven Moran, Christian Bentz, Ximena Gutierrez-Vasques, Olga Sozinova and Tanja Samardzic. 2022. "TeDDi Sample: Text Data Diversity Sample for Language Comparison and Multilingual NLP”. In Proceedings of The International Conference on Language Resources and Evaluation (LREC), Marseille, France. 20—25 June 2022. |
Ximena Gutierrez-Vasques, Christian Bentz, Olga Sozinova and Tanja Samardzic. 2021. "From characters to words: the turning point of BPE merges”. European Chapter of the Association for Computational Linguistics, Long Papers. |
Tatyana Ruzsics, Olga Sozinova, Ximena Gutierrez-Vasques and Tanja Samardzic. 2021. "Interpretability for morphological inflection: from character-level predictions to subword-level rules”. European Chapter of the Association for Computational Linguistics, Long Papers. |