CzeSL – a Learner Corpus of Czech

Available versions

Thousands of tokens in Annotation Metadata Access Year
non-native ethnolect 𝚺 error linguistic
essays theses Tags TH T0 T1 T2
CzeSL-plain 1,315 732 428 2,475 SD 2012
CzeSL-SGT 1,147 1,147 F K M M yes SD 2014
CzeSL-man v0, a1 134 192 326 F+G 2T M M SD 2012
CzeSL-man v0, a2 59 149 208 F+G 2T M M S 2012
CzeSL-man v1 134 134 F+G T2 M M+S yes SD 2016
CzeSL-man v2 134 134 F+G 2T M M M yes SD 2020
CzeSL-TH 180 180 2T yes D 2018
CzeSL-MD 12 12 MD T2 D 2018
CzeSL-UD 10 10 M+S D 2018
CzeSL-GEC ? ? 20 2T D 2017
AKCES-GEC 336 168 504 G 2T D 2019
CzeSL in TEITOK 299 299 F+I 2T+ M M M+S yes S 2020
  • Tags: F – formal, G – grammar-based, MD – multi-dimensional, I – implicit
  • TH (target hypothesis): K – correction suggested by the proofing tool, 2T – successive corrections in the 2T scheme, T2 – correction at Tier 2, 2T+ – more than 2 successive corrections
  • Linguistic annotation: M – morphology (lemmas and morphosyntactic tags), S – syntax (structure and functions)
  • Access: S – searchable on-line, D – downloadable in full as a dataset
  • Year: year of the first release

CzeSL-plain

  • 12.4 thousand transcribed texts, 2 mil. words, 2.5 mil. tokens
  • Plain = without annotation, without metadata
  • Consists of 3 parts:
    • Texts written by foreign learners of Czech (ciz): 8,109 texts, 1,161 thousand tokens
    • Academic texts written by foreign students in Czech (kval): 174 texts, 732 thousand tokens
    • Texts written by Czech students with Romani background (rom), i.e. an ethnolect of Czech: 4,105 texts, 428 thousand tokens
  • Searchable from the Czech National Corpus site
  • Downloadable from the LINDAT-Clarin data repository as two subcorpora:
    • AKCES 3 – includes texts produced by non-native students of Czech
    • AKCES 4 – includes texts produced by students growing up in socially excluded communities
  • Description in English and Czech

CzeSL-SGT

  • Czech as a Second Language with Spelling, Grammar and Tags
  • 8,617 transcribed texts, 111 thousand sentences, 1 mil. words, 1.1 mil. tokens
  • Extends the “foreign” (ciz) part of CzeSL-plain by texts collected in 2013
  • Original forms and automatic corrections are tagged, lemmatized and assigned error labels
  • Most texts have metadata attributes (30 items) about the author and the text
  • Searchable from the Czech National Corpus site, metadata available from this site are in Czech
  • Dowloadable as AKCES 5 (CzeSL-SGT) Release 2 with the metadata in English. The original release is still available from AKCES 5 (CzeSL-SGT). We suggest the use of Release 2, where a number of bugs were fixed, including an issue in the metadata: native speakers of Ukrainian were labelled as speakers of “another Indo-European language” rather than as speakers of a Slavic language. Release 2 is now also a validated XML document with all annotation represented as XML attributes.
  • Description in English and Czech

CzeSL-man

  • Includes texts collected, transcribed and manually annotated within the ESF project, see a description in English.
  • Annotation manual in Czech
  • Transcription manual in Czech
  • Appendix to Transcription manual in Czech
  • Transcription manual – summary in English; please note that formatting specifics concerning the markup of manuscripts in the transcription manual, its appendix and the summary are outdated.
CzeSL-man v0
  • Includes subsets of the ciz and rom parts of CzeSL-plain, i.e. the manually annotated Czech texts written by non-native learners and by speakers of the Roma ethnolect of Czech, the total of about 330 thousand tokens.
  • Texts of about 208 thousand tokens are annotated independently by two annotators.
  • Texts are searchable on-line via the SeLaQ tool, using the CzeSL-native format:
    • SeLaQ is a purpose-built corpus manager. See its menu for instructions.
    • Note that metadata and graphical display of links between annotated word tokens are not available in SeLaQ.
  • Downloadable from here in the feat format, with metadata
CzeSL-man v1
  • CzeSL-man v1 contains 645 texts (128 thousand tokens) from CzeSL-SGT, including 298 doubly annotated texts (59 thousand doubly annotated tokens).
  • Most texts are equipped with metadata about the author, the text and the annotation process.
  • CzeSL-man can be searched or downloaded. Although the set of texts in the searchable and the downloadable versions are identical, they differ in how they represent the annotation.
  • CzeSL-man v1 downloadable:
    • This release is in the PML format, generated by the feat tool.
    • Each text with its annotation consists of several related files.
    • Some of the texts are independently annotated twice.
    • Includes also flat version (files named *.vert), see CzeSL-man v2 below.
  • CzeSL-man v1 searchable:
    • Differs from both CzeSL-man v0 and CzeSL-man v1 downloadable in two aspects:
      • There are no texts with alternative error annotation: each text is annotated by a single annotator
      • The two-tier annotation scheme is radically modified to fit the token-based setup of the search tool.
    • Apart from that, the content and metadata are identical to CzeSL-man v1 downloadable and the search options to those of CzeSL-SGT.
    • The main feature in the annotation is the reversal of the source text and its annotation. The correction at Tier 2 is assumed to be the basis for the annotation. The tokens of this corpus represent the words at Tier 2. The original text is added as annotation of the Tier 2 tokens. Each token of the corrected text receives its corresponding Tier 0 form and a Tier 2 error label as attributes.
    • This annotation discards any Tier 1 corrections and error tags, and simplifies other than 1:1 links between tokens at Tier 0 and Tier 2.
    • Tier 2 is parsed in a way similar to some other Czech corpora searchable in KonText, such as SYN2015.
CzeSL-man v2
  • In this release the two-tier error annotation is represented as pairs of XML elements err and corr. An ill-formed portion of the source text is enclosed within the err structure, immediately followed by its correction, enclosed within the corr structure.
  • Apart from the error annotation, the content and metadata are the same as in CzeSL-man v1.
  • Linguistic annotation (tags and lemmas) is provided for all tokens at Tier 0 and Tier 2.
  • Downloadable from https://bitbucket.org/czesl/czesl-man/ (files named *.vert).

CzeSL-TH

  • Includes a subset of CzeSL-SGT, hand-corrected, but not error-tagged, in 2017–2018, according to the 2T scheme.
  • The corpus includes about 1300 texts (180 thousand tokens), selected from those that had not been manually error-annotated before (are not part of CzeSL-man).
  • The selection was meant to make the manually annotated part of CzeSL more balanced in terms of L1 and CEFR level.
  • Downloadable from here in the feat format

CzeSL-MD

  • Includes a subset of texts from CzeSL-man, semi-automatically annotated by a multi-domain tagset with a focus on morphology, see the annotation manual (in Czech)
  • The dataset is available from https://bitbucket.org/czesl/czesl-md in the brat format
  • The texts can also be viewed and searched here using brat
  • To be extended to all CzeSL-man texts

CzeSL-UD

  • Texts from CzeSL-man with syntactic annotation according to the Universal Dependencies (UD) standard
  • Available from the LINDAT/CLARIN repository
  • CzeSL-man, CzeSL-MD and CzeSL-UD will eventually be merged into a single corpus with multiple types of annotation

CzeSL-GEC

  • CzeSL Grammatical Error Correction Dataset – manually annotated texts, converted to a format intended for NLP applications
  • Sentence pairs of original and corrected versions of Czech sentences collected from essays written by both non-native learners of Czech (CzeSL-man) and Czech pupils with Romani background
  • See also AKCES-GEC

AKCES-GEC

  • AKCES Grammatical Error Correction Dataset for Czech – a grammar error correction corpus for Czech generated from a subset of AKCES. It contains train, dev and test files annotated in M2 format. Note that in comparison to CZESL-GEC dataset, this dataset contains separated edits together with their type annotations in M2 format and also has twice as many sentences.
  • See https://arxiv.org/pdf/1910.00353.pdf for a detailed description

CzeSL in TEITOK

  • Work in progress: will eventually include all available Czech texts written by non-native learners
  • See CzeSL in TEITOK at the ICTL site

Tools

  • Annotation editor feat, used for multi-level manual error annotation of CzeSL-man
  • Annotation editor brat, used for multi-domain error annotation of CzeSL-MD
  • Tagger and lemmatizer of Czech Morphodita, used for morphological annotation of CzeSL-SGT
  • Spelling/grammar checker Korektor, used for automatic correction of CzeSL-SGT
  • Error identifier (see Jelínek, 2017), used for automatic identification of some types of errors in CzeSL-SGT
  • Multi-level concordancer SeLaQ, used for basic searching in CzeSL-man
  • Standard concordancer Manatee/KonText, used for searching in CzeSL-plain and CzeSL-SGT
  • General corpus tool TEITOK, currently used for building, editing and viewing learner corpora hosted by the Institute of Theoretical and Computational linguistics (see Learner corpora at ICTL)

Bibliography

Bibliography

NEW:

Rosen, A., Hana, J., Hladká, B., Jelínek, T., Škodová, S., and Štindlová, B. (2020). Compiling and annotating a learner corpus for a morphologically rich language – CzeSL, a corpus of non-native Czech. Karolinum, Charles University Press, Praha. Print copy, e-book CU Digital Repository

Acknowledgement

This work was supported by the European Regional Development Fund project “Creativity and Adaptability as Conditions of the Success of Europe in an Interrelated World” (reg. no.: CZ.02.1.01/0.0/0.0/16_019/0000734).


QR Code
QR Code czesl:czesl (generated for current page)