Language Records
This page contains all codices, minor texts, and Bible translations from the Old and Middle Hungarian period digitized and annotated within the projects. The list of the texts in the corpus and their abbreviations used in the corpus query tool is available here.
For displaying all special characters, installation of the Junicode font package is needed.
For each text, a short description about its source, the text processing steps which were conducted on the text, and the number of tokens is available. Moreover, remarks on spelling, locus markers, bracketing, punctuation, etc. are also provided if there are deviations from the general rules given in the corpus description.
For each text, the original orthographic form is provided in a plain text and in a PDF format. If the original text material is already available on the web, we did not create the PDF version but we did paste the link. If there is a normalized version of the text, it is also provided here in plain text format. For each text, a tsv file containing every text processing level and metadata is also available. Blank lines mark sentence boundaries, while the columns separated by tabulators contain the following pieces of information:
- locus markers (in the first n columns, where n can be parsed from the first line of the tsv file;
- word form in its original orthographic form ;
- word form in its normalized form;
- interpretation;
- verbal prefixes detached from the verb;
- remark;
- lemma of word form in its normalized form;
- morphological analysis.
The original orthographic form of each text is available, however, only a smaller subcorpus has also been normalized and annotated morphologically.
The normalized version of the following texts are available:
- Vienna Codex
- Birk Codex
- Bod Codex
- Czech Codex
- Festetics Codex
- Guary Codex
- Jókai Codex
- Jordánszky Codex (only the New Testament)
- Kazinczy Codex
- Booklet on the Dignity of the Apostles
- Miskolc Fragment
- Munich Codex
- the first part of the Székelyudvarhely Codex
- all miscellaneous minor texts, except of the Cisio
- all Middle Hungarian Bible translations
If there is morphological annotation, by default, it follows the rules written in the corpus description and in the list of morphological codes. The morphologically annotated version of some codices is also available in the CoNLL-U format applied within the Universal Dependencies and Morphology framework, in which word lines contain the annotation of a word in 10 fields separated by single tab characters, and blank lines mark sentence boundaries.
Word lines contain the following fields:
- ID: Word index, integer starting at 1 for each new sentence.
- FORM: Word form or punctuation symbol in its original orthographic form.
- LEMMA: Lemma or stem of word form in its normalized form.
- UPOSTAG: Universal part-of-speech tag following the Universal Dependencies and Morphology annotation scheme.
- XPOSTAG: The original morphological analysis.
- FEATS: List of morphological features from the Universal Dependencies and Morphology feature inventory.
- HEAD: Head of the current token following the Universal Dependencies and Morphology annotation scheme; currently empty.
- DEPREL: Dependency relation to the HEAD, following the Universal Dependencies and Morphology annotation scheme; currently empty.
- DEPS: List of secondary dependencies, following the Universal Dependencies and Morphology annotation scheme; currently empty.
- MISC: Any other annotation; currently empty.
The morphologically annotated version of the following texts are also available:
- Festetics Codex (also in CoNLL-U format)
- Guary Codex (also in CoNLL-U format)
- Jókai Codex (also in CoNLL-U format)
- Jordánszky Codex (only the New Testament)
- Booklet on the Dignity of the Apostles (also in CoNLL-U format)
- Munich Codex (also in CoNLL-U format)
The number of tokens of the original orthographic version with punctuation marks: 3,224,515, without punctuation marks: 2,751,869. The number of tokens of the normalized subcorpus with punctuation marks: 1,305,687, without punctuation marks: 1,049,019. The number of tokens of the morphologically annotated subcorpus with punctuation marks: 285,070, without punctuation marks: 228,851.