Language and Space Resources

Finding past language contact
Exploring correlations in genetic and cultural variation
Finding contact in phylogenetic trees
Revealing Swiss German dialect regions
Glottography - Mapping the Languages of the World
Does phylogeography work?
Revealing the evolutionary history of diffusion processes
Evidence for Britain and Ireland as a linguistic area
The first law of Geography
Detecting contact in linguistic phylogenetic trees
Text data and linguistic diversity
Subword text processing
Swiss German text processing
Text processing for Bosnian, Coratian, Montenegrin, Serbian (BCMS)
Empirical research methods (teaching materials)
The essence of machine learning for linguists in tech
Workshop on regional markedness in text

Finding past language contact

Description	Contact between speakers of two or more languages can leave traces in the linguistic record and reveal geographic areas of past human interaction. The sBayes algorithm finds these areas in cultural data.
Publication	https://doi.org/10.1098/rsif.2020.1031
Code & Data	Python package and case study:https://github.com/NicoNeureiter/sBayes

Exploring correlations in genetic and cultural variation

Description	In this study, peoples with similar genes are found to also share similar grammar, though not necessarily similar words or sounds, suggesting that grammar may serve as a cultural marker of population connections beyond recent contact or descent.
Publication	https://www.science.org/doi/10.1126/sciadv.abd9223
Code & Data	case study: https://github.com/derpetermann/music_languages_genes
Media	UZH: https://www.news.uzh.ch/en/articles/2021/Grammar.html Video: https://youtu.be/bcE3-Xm9CIY

Top

Finding contact in phylogenetic trees

Description	Phylogenetic trees show how languages have diversified over time but ignore a central aspect of language evolution — contact. The contacTrees model can infer a phylogenetic tree and contact events, where one language borrowed linguistic traits from another one.
Publication	https://doi.org/10.1057/s41599-022-01211-7
Code & Data	Beast 2 package and case study: https://github.com/NicoNeureiter/contacTrees

Top

Revealing Swiss German dialect regions

Description	In this paper, a Bayesian clustering method from evolutionary biology was applied to Swiss German dialect data, revealing five distinct morphosyntactic populations that align with traditional dialect regions and supporting a gradual dialect continuum in Swiss German.
Publication	https://doi.org/10.1017/jlg.2021.12
Media	Department of Geography, UZH https://www.geo.uzh.ch/en/news/papers/2022/2022-07-schweizerdeutsche-grammatik.html

Top

Glottography - Mapping the Languages of the World

Description	Glottography is a geodata platform for mapping the world’s languages. Glottography represents the geographic locations of languages as polygons, along with relevant metadata, including Glottocodes that uniquely identify each language.
Publication	Not yet published
Code & Data	Data and web mapping service: https://github.com/Glottography

Top

Does phylogeography work?

Description	This study shows that Bayesian phylogeography struggles to reconstruct human migrations, such as relocations due to conflict, but effectively captures gradual language expansions, which produce distinct phylogenetic patterns that are absent when speakers migrate.
Publication	https://doi.org/10.1098/rsos.201079
Code & Data	Case study, https://github.com/NicoNeureiter/drifting_into_nowhere/ and https://zenodo.org/records/4279082

Top

Revealing the evolutionary history of diffusion processes

Description	This study presents a Bayesian model to reconstruct the historical spread of phenomena with known distributions at two points in time, applied here to the spread of Indo-European languages in South America. The model infers possible evolutionary histories, offering a general approach for analysing diffusion processes from incomplete data.
Publication	https://doi.org/10.4230/LIPIcs.GIScience.2023.71
Code & Data	R and C++ package and case study, https://github.com/takuya-tkhs/sBread

Top

Evidence for Britain and Ireland as a linguistic area

Description	This study combines qualitative and quantitative methods to study linguistic area formation, showing that languages in Britain and Ireland exhibit significant linguistic similarity, regardless of ancestry, across space, time, and sociocultural settings.
Publication	https://muse.jhu.edu/article/733280

Top

The first law of Geography

Description

Why are near things more similar than distant ones? And why can this be a blessing and a nuisance? Three videos explain the first law of Geography.

Media

Video 1: https://youtu.be/6T1A4l0pcWE?si=b4_XLuANMNZV9bHX

Video 2: https://youtu.be/CptZdD78MV4?si=9j2lzFPMRMGSJP-8

Video 3: https://youtu.be/u-U-FWKWfWQ?si=kcR5uoeNkMtF8G0S

Top

Detecting contact in linguistic phylogenetic trees

Description

The contacTrees model is a Bayesian phylogenetic method that incorporates language contact, addressing the limitations of traditional phylogenetic methods that assume languages evolve independently. By accounting for horizontal transfer, it improves the accuracy of reconstructing language family trees and contact events, offering a more nuanced approach to studying language and cultural evolution.

Publication

https://www.nature.com/articles/s41599-022-01211-7

Code & Data

Beast 2 package, https://github.com/NicoNeureiter/contacTrees

case study, https://github.com/NicoNeureiter/contacTrees-IndoEuropean

simulation study, https://github.com/NicoNeureiter/contacTrees-SimulationStudy

Top

Text data and linguistic diversity

Description

Today, over 7,000 languages are spoken worldwide, cataloged and partially described in linguistic resources like Glottolog and WALS. This study develops a quantifiable and reproducible method for describing languages using text data, offering new insights into mapping linguistic diversity.

Publications

https://aclanthology.org/2021.eacl-main.302/

https://aclanthology.org/2022.lrec-1.123/

https://aclanthology.org/2022.conll-1.18/

https://aclanthology.org/2023.cl-4.5/

https://aclanthology.org/2024.findings-naacl.213/

Code & Data

TeDDi tools https://github.com/MorphDiv/TeDDi_sample

Analysis of language spaces https://github.com/MorphDiv/transfer-lang

Information theory measures over BPE merges

https://github.com/ximenina/theturningpoint

LangDive Python library

Top

Subword text processing

Description

This project enhances neural sequence-to-sequence models for NLP preprocessing tasks by incorporating structural signals from multiple text layers, improving performance in tasks like machine translation and speech recognition.

Publication

https://aclanthology.org/K17-1020/

Code & Data

Subword segmentation with synchronised decoding https://github.com/tatyana-ruzsics/uzh-corpuslab-morphological-segmentation

Interpretable reinflection https://github.com/tatyana-ruzsics/interpretable-inflection

Top

Swiss German text processing

Description

The ArchiMob corpus represents German varieties spoken on the territory of Switzerland. It is the first electronic resource containing long samples of transcribed text in Swiss German, intended to be used for studying spatial distribution of morphosyntactic features and for natural language processing.

Publication

https://link.springer.com/article/10.1007/s10579-019-09457-5

Code & Data

Code:

https://github.com/Christof93/archimob_tools

https://github.com/yunigma/Kaldi-for-ASR-of-Swiss-German

https://github.com/tannonk/two-headed-master

https://github.com/tatyana-ruzsics/uzh-corpuslab-normalization

https://github.com/tatyana-ruzsics/uzh-corpuslab-pos-normalization

Data:

https://doi.org/10.48656/496p-3w34

https://doi.org/10.48656/brdm-ht43

Top

Text processing for Bosnian, Coratian, Montenegrin, Serbian (BCMS)

Description

Resources created and shared through transnational cooperation started by the ReLDI institutional partnership (Link to SNSF data portal:
https://data.snf.ch/grants/grant/160501).

Publication

https://doi.org/10.1515/9783110767377-017

Code & Data

Code:

https://github.com/clarinsi/classla

https://github.com/clarinsi/tweetcat

Data:

http://hdl.handle.net/11356/1792

http://hdl.handle.net/11356/1793

http://hdl.handle.net/11356/1794

http://hdl.handle.net/11356/1843

http://hdl.handle.net/11356/1422

http://hdl.handle.net/11356/1708

Top

Empirical research methods (teaching materials)

Description

The course, "Revisiting research training in linguistics: theory, logic, method," is part of a pilot program funded by Movetia and offered by the Universities of Zurich, Geneva, and Belgrade. It aims to enhance scientific and research skills, particularly for BA and MA students in linguistics and language-related fields, though it may also benefit a wider audience interested in these areas.

Open EdX

https://apps.elearn.mnf.uzh.ch/learning/course/course-v1:PHIL+Movetia101+2022/home

Top

The essence of machine learning for linguists in tech

Description

This learning block is a guide on how to acquire the core set of notions in machine learning that are necessary for students of language or linguists who plan to work with engineers and scientists, or anyone with a similar background and interests..

Its intended use is supervised study, whereby a student learns actively under the supervision of a teacher.

Access link

https://upskillsproject.eu/project/machine_learning/

Top

Workshop on regional markedness in text

Description

A gentle introduction to the process of analysing corpora, containing information on which South Slavic corpora are available on the CLARIN.SI repository, and how to find comparable corpora; how to explore corpora through the noSketchEngine and KonText concordancers; how to query the corpora using the CQL (Corpus Query Language) syntax; how to analyse gender marking in each South Slavic corpus and how the results can be interpreted to analyse gender bias in society.

Materials

https://github.com/clarinsi/workshop_reg_mark

Top

Quicklinks

Main navigation

Language and Space Resources

Table of contents

Finding past language contact

Exploring correlations in genetic and cultural variation

Finding contact in phylogenetic trees

Revealing Swiss German dialect regions

Glottography - Mapping the Languages of the World

Does phylogeography work?

Revealing the evolutionary history of diffusion processes

Evidence for Britain and Ireland as a linguistic area

The first law of Geography

Detecting contact in linguistic phylogenetic trees

Text data and linguistic diversity

Subword text processing

Swiss German text processing

Text processing for Bosnian, Coratian, Montenegrin, Serbian (BCMS)

Empirical research methods (teaching materials)

The essence of machine learning for linguists in tech

Workshop on regional markedness in text