Navigation auf uzh.ch

Suche

URPP Language and Space (2013-2024)

Language and Space Resources

Finding past language contact

Description Contact between speakers of two or more languages can leave traces in the linguistic record and reveal geographic areas of past human interaction. The sBayes algorithm finds these areas in cultural data.
Publication https://doi.org/10.1098/rsif.2020.1031
Code & Data

Python package and case study:https://github.com/NicoNeureiter/sBayes

 

Exploring correlations in genetic and cultural variation

Description In this study, peoples with similar genes are found to also share similar grammar, though not necessarily similar words or sounds, suggesting that grammar may serve as a cultural marker of population connections beyond recent contact or descent.
Publication https://www.science.org/doi/10.1126/sciadv.abd9223
Code & Data

case study: https://github.com/derpetermann/music_languages_genes

Media

UZH: https://www.news.uzh.ch/en/articles/2021/Grammar.html

Video: https://youtu.be/bcE3-Xm9CIY

 

Finding contact in phylogenetic trees

Description Phylogenetic trees show how languages have diversified over time but ignore a central aspect of language evolution — contact.  The contacTrees model can infer a phylogenetic tree and contact events, where one language borrowed linguistic traits from another one.
Publication https://doi.org/10.1057/s41599-022-01211-7
Code & Data

Beast 2 package and case study: https://github.com/NicoNeureiter/contacTrees

 

Revealing Swiss German dialect regions

Description In this paper, a Bayesian clustering method from evolutionary biology was applied to Swiss German dialect data, revealing five distinct morphosyntactic populations that align with traditional dialect regions and supporting a gradual dialect continuum in Swiss German.
Publication https://doi.org/10.1017/jlg.2021.12
Media Department of Geography, UZH https://www.geo.uzh.ch/en/news/papers/2022/2022-07-schweizerdeutsche-grammatik.html

 

Glottography - Mapping the Languages of the World

Description Glottography is a geodata platform for mapping the world’s languages. Glottography represents the geographic locations of languages as polygons, along with relevant metadata, including Glottocodes that uniquely identify each language.
Publication Not yet published
Code & Data Data and web mapping service: https://github.com/Glottography

 

Does phylogeography work?

Description This study shows that Bayesian phylogeography struggles to reconstruct human migrations, such as relocations due to conflict, but effectively captures gradual language expansions, which produce distinct phylogenetic patterns that are absent when speakers migrate.
Publication https://doi.org/10.1098/rsos.201079
Code & Data Case study, https://github.com/NicoNeureiter/drifting_into_nowhere/
  and https://zenodo.org/records/4279082

 

Revealing the evolutionary history of diffusion processes

Description This study presents a Bayesian model to reconstruct the historical spread of phenomena with known distributions at two points in time, applied here to the spread of Indo-European languages in South America. The model infers possible evolutionary histories, offering a general approach for analysing diffusion processes from incomplete data.
Publication https://doi.org/10.4230/LIPIcs.GIScience.2023.71
Code & Data R and C++ package and case study, https://github.com/takuya-tkhs/sBread

 

Evidence for Britain and Ireland as a linguistic area

Description This study combines qualitative and quantitative methods to study linguistic area formation, showing that languages in Britain and Ireland exhibit significant linguistic similarity, regardless of ancestry, across space, time, and sociocultural settings.
Publication https://muse.jhu.edu/article/733280  

 

The first law of Geography

Description Why are near things more similar than distant ones? And why can this be a blessing and a nuisance? Three videos explain the first law of Geography.
Media

Video 1: https://youtu.be/6T1A4l0pcWE?si=b4_XLuANMNZV9bHX

Video 2: https://youtu.be/CptZdD78MV4?si=9j2lzFPMRMGSJP-8

Video 3: https://youtu.be/u-U-FWKWfWQ?si=kcR5uoeNkMtF8G0S

 

Detecting contact in linguistic phylogenetic trees

Description The contacTrees model is a Bayesian phylogenetic method that incorporates language contact, addressing the limitations of traditional phylogenetic methods that assume languages evolve independently. By accounting for horizontal transfer, it improves the accuracy of reconstructing language family trees and contact events, offering a more nuanced approach to studying language and cultural evolution.
Publication https://www.nature.com/articles/s41599-022-01211-7
Code & Data

Beast 2 package, https://github.com/NicoNeureiter/contacTrees

case study, https://github.com/NicoNeureiter/contacTrees-IndoEuropean

simulation study, https://github.com/NicoNeureiter/contacTrees-SimulationStudy

 

Text data and linguistic diversity

Description Today, over 7,000 languages are spoken worldwide, cataloged and partially described in linguistic resources like Glottolog and WALS. This study develops a quantifiable and reproducible method for describing languages using text data, offering new insights into mapping linguistic diversity.
Publications

https://aclanthology.org/2021.eacl-main.302/

https://aclanthology.org/2022.lrec-1.123/

https://aclanthology.org/2022.conll-1.18/

https://aclanthology.org/2023.cl-4.5/

https://aclanthology.org/2024.findings-naacl.213/

Code & Data

TeDDi tools https://github.com/MorphDiv/TeDDi_sample

Analysis of language spaces https://github.com/MorphDiv/transfer-lang

Information theory measures over BPE merges

https://github.com/ximenina/theturningpoint

LangDive Python library

 

Subword text processing

Description This project enhances neural sequence-to-sequence models for NLP preprocessing tasks by incorporating structural signals from multiple text layers, improving performance in tasks like machine translation and speech recognition.
Publication

https://aclanthology.org/K17-1020/

Code & Data

Subword segmentation with synchronised decoding https://github.com/tatyana-ruzsics/uzh-corpuslab-morphological-segmentation

Interpretable reinflection https://github.com/tatyana-ruzsics/interpretable-inflection

 

Swiss German text processing

Description The ArchiMob corpus represents German varieties spoken on the territory of Switzerland. It is the first electronic resource containing long samples of transcribed text in Swiss German, intended to be used for studying spatial distribution of morphosyntactic features and for natural language processing.
Publication

https://link.springer.com/article/10.1007/s10579-019-09457-5

Code & Data

Code:

https://github.com/Christof93/archimob_tools

https://github.com/yunigma/Kaldi-for-ASR-of-Swiss-German

https://github.com/tannonk/two-headed-master

https://github.com/tatyana-ruzsics/uzh-corpuslab-normalization

https://github.com/tatyana-ruzsics/uzh-corpuslab-pos-normalization

Data:

https://doi.org/10.48656/496p-3w34

https://doi.org/10.48656/brdm-ht43

 

Text processing for Bosnian, Coratian, Montenegrin, Serbian (BCMS)

Empirical research methods (teaching materials)

Description The course, "Revisiting research training in linguistics: theory, logic, method," is part of a pilot program funded by Movetia and offered by the Universities of Zurich, Geneva, and Belgrade. It aims to enhance scientific and research skills, particularly for BA and MA students in linguistics and language-related fields, though it may also benefit a wider audience interested in these areas.
Open EdX

https://apps.elearn.mnf.uzh.ch/learning/course/course-v1:PHIL+Movetia101+2022/home

 

The essence of machine learning for linguists in tech

Description

This learning block is a guide on how to acquire the core set of notions in machine learning that are necessary for students of language or linguists who plan to work with engineers and scientists, or anyone with a similar background and interests..

Its intended use is supervised study, whereby a student learns actively under the supervision of a teacher.
Access link

https://upskillsproject.eu/project/machine_learning/

 

Workshop on regional markedness in text

Description A gentle introduction to the process of analysing corpora, containing information on which South Slavic corpora are available on the CLARIN.SI repository, and how to find comparable corpora; how to explore corpora through the noSketchEngine and KonText concordancers; how to query the corpora using the CQL (Corpus Query Language) syntax; how to analyse gender marking in each South Slavic corpus and how the results can be interpreted to analyse gender bias in society.
Materials https://github.com/clarinsi/workshop_reg_mark