Navigation auf uzh.ch
Tatyana Soldatova Ruzsics was a member of the URPP Language and Space from 2016 to 2021. She was a doctoral student under the supervision of Tanja Samardžić (co-supervision with Rico Sennrich) in the project Upstream Text Processing. Her research interests include deep learning methods for upstream NLP processing: writing normalization, lemmatization, morphological segmentation and morphological reinflection.
She successfully defended her doctorate on the topic Multi-level Modelling for Upstream Text Processing on April 29, 2021 (link to Zora entry).
The PhD project addresses the notion of morphological richness of languages in a large-scale morphological typological analysis using massively parallel corpora. Morphologically rich languages express multiple levels of information already at the word level and thus they are expected to have a higher level of word types variations and in turn, low frequency of word types. Therefore, measures based on distribution of word types can differentiate between morphologically rich and morphologically poor languages. However, distribution of word types is only a partial indicator since it does not distinguish between morphological and lexical diversity. On the other hand, a comparison based on word alignments, i.e. how many words in one language correspond to a word type in another language, is expected to distinguish between these two types. Given that the word boundaries is uncertain phenomena, monolingual tests concerning different definitions of words will be performed for a subset of languages.
The project will further focus on the variation in space of the obtained morphological richness structure. Geographical distribution for measures of similarities and differences between languages is one of the objectives of contemporary typology. Thus, the proposed research will provide tools and materials for addressing language contact effects and for potential further investigations of language evolution. The use of corpora will serve as a valuable contribution to this research field since most of the work is currently based on grammars.
The main research questions can be therefore expressed as:
Tanja Samardžić, Balthasar Bickel, Martin Volk
URPP Language and Space