Navigation auf uzh.ch
A detailed documentation of the corpus is shared with the first release (download here (PDF, 317 KB)). What follows is an overview of the main features of Release 2 and corrections performed since the previous release.
POS tagging. This task was carried out as follows:
Some corrections of glitches and inconsistencies have been carried out in two phases, as described in the next sections.
First correction phase
The first correction phase took place until October 2018. Some transcription mistakes and inconsistencies have been corrected (for example, schsch → sch, qu → kw, sp → schp). Further corrections concerned the writing of schwa (initially ä, corrected to e) in the files 1008, 1055, 1138, 1188, 1189, 1205, and the change from ò to o in 1008.
The second correction phase took place in the second half of 2019. Below is an overview of the various issues that were addressed in this phase.
A thorough analysis revealed some remaining errors in the transcribed forms. Most of them are strings that contain a parenthesis followed by the $ sign. The cause of the error can be explained as follows: in the .exb file that represents the transcription performed with the EXMARaLDA tool, parentheses are used by the transcriber as implicit annotation to delimit an unclear sequence (tag <unclear>
in the .xml file), whereas the $ is used as hesitation marker after a sequence such as ää (tag <vocal>
in the .xml file). The Python script used to convert the .exb format into .xml is unable to process an utterance in which a vocalized unit is contained in an unclear sequence. An example of such an occurrence is the erroneous output (aso_rot_ä$
in ID 1044, which results from the following .exb annotation:
<event start="T486" end="T487">si wüssed (aso rot ä$)
khulturbolschewismus hed s uf e ganz e wiite
khulturberiich erschtreggt oder so </event>
This type of error affects the structure of the .xml file. The actual output is as follows:
<vocal>
<desc xml:id="d1044-u535-w3">(aso_rot_ä$</desc>
</vocal>
However, the output should be as follows:
<unclear>
<w normalised="also" tag="ADV" xml:id="d1044-u535-w3">aso</w>
<w normalised="rot" tag="ADJD" xml:id="d1044-u535-w4">rot</w>
<vocal>
<desc xml:id="d1044-u535-w5">ä</desc>
</vocal>
</unclear>
Adding elements to the utterance would mean that the numbering of the words should be changed: since the words w4
and w5
have been added, the following word should have the index 6 instead of the original 4. The final decision was that these instances should be corrected manually by adding the opening and closing tags <unclear>
. This will treat the entire unclear sequence as a single vocalized element, thus not compromising the numbering. The output of the example illustrated above would then be as follows:
<unclear>
<vocal>
<desc xml:id="d1044-u535-w3">(aso_rot_ä$</desc>
</vocal>
</unclear>
The remaining errors are:
five occurrences of a question mark after a word (ID 1082_2, 1082_3, 1121 and 1147). These have been corrected by means of a JSON file, using the same procedure as the Archimob correction page.7
one occurrence of a w
element that consists of a period sign (ID 1147), whereby the element is the only one in an utterance by a person with the ID otherPerson
Two empty elements, which were once occurrences of hyphens, then removed: ID 1228, as a result of the expression fliegerbeobachtungs- und meldedienst; and ID 1248, where the element is the only one in an utterance pronounced by the interviewed person. Such empty elements are problematic for the KALDI tool, since it cannot process them and crashes. Re-introducing the hyphens that were removed would not prevent the KALDI tool to crash. Therefore the solution adopted consists in deleting the empty elements, since this does not compromise the order of the sound files.
Some elements in the XML files had an empty normalised
attribute. The missing elements have been added manually by means of a JSON file.
In the entire file 1163.xml, the normalised
attribute was “xxx”, due to a misalignment during the import of normalized forms into the XML file. This has been corrected.
For the purpose of anonymization, some sound segments have been deleted. However, entire blocks of segments in doc 1188 seem to have been deleted even if they do not contain instances of anonymization. Though the cause of this glitch could not be determined, the missing sound files have been reinstated.
In the file person_file.xml
, part of the XML archive available on the Archimob website, the following errors have been found and corrected:
Line 61: <person xml:id=PRos sex=f>
, the recording number is missing (1073)
Line 66: <person xml:id=SErh sex=m>
, the recording number is missing (1075)
Line 115: <person xml:id=DHan1963 sex=m>
, the recording number is incorrect (it should be 1163)
Archimob (archives de la mobilisation): http://www.archimob.ch↩
https://www.spur.uzh.ch/en/departments/research/textgroup/ArchiMob.html↩
Archimob correction page, accessible only through the UZH net: http://linguistik-web.uzh.ch:4000/correct_archimob↩