CzeSL – a Learner Corpus of Czech

Available versions

Thousands of tokens in annotation Metadata Access Year
non-native ethnolect 𝚺 Error Linguistic
essays theses Tags TH T0 T1 T2
CzeSL-plain 1,315 732 428 2,475 SD 2012
CzeSL-SGT 1,147 1,147 F K M M yes SD 2014
CzeSL-man v0, a1 134 192 326 F+G 2T M M SD 2012
CzeSL-man v0, a2 59 149 208 F+G 2T M M S 2012
CzeSL-man v1 134 134 F+G T2 M M+S yes SD 2016
CzeSL-man v2 134 134 F+G 2T M M M yes SD 2020
CzeSL-TH 180 180 2T yes D 2018
CzeSL-MD 12 12 MD T2 D 2018
CzeSL-UD 10 10 M+S D 2018
CzeSL-GEC ? ? 20 2T D 2017
AKCES-GEC 336 168 504 G 2T D 2019
CzeSL in TEITOK 299 299 F+I 2T+ M M M+S yes S 2020
  • Tags: F – formal, G – grammar-based, MD – multi-dimensional, I – implicit
  • TH (target hypothesis): K – correction suggested by the proofing tool, 2T – successive corrections in the 2T scheme, T2 – correction at Tier 2, 2T+ – more than 2 successive corrections
  • Linguistic annotation: M – morphology (lemmas and morphosyntactic tags), S – syntax (structure and functions)
  • Access: S – searchable on-line, D – downloadable in full as a dataset
  • Year: year of the first release

CzeSL-plain

  • 12.4 thousand transcribed texts, 2 mil. words, 2.5 mil. tokens
  • Plain = without annotation, without metadata
  • Consists of 3 parts:
    • Texts written by foreign learners of Czech (ciz): 8,109 texts, 1,161 thousand tokens
    • Academic texts written by foreign students in Czech (kval): 174 texts, 732 thousand tokens
    • Texts written by Czech students with Romani background (rom), i.e. an ethnolect of Czech: 4,105 texts, 428 thousand tokens
  • Searchable from the Czech National Corpus site
  • Downloadable from the LINDAT-Clarin data repository as two subcorpora:
    • AKCES 3 – includes texts produced by non-native students of Czech
    • AKCES 4 – includes texts produced by students growing up in socially excluded communities
  • Description in English and Czech

CzeSL-SGT

  • Czech as a Second Language with Spelling, Grammar and Tags
  • 8,617 transcribed texts, 111 thousand sentences, 1 mil. words, 1.1 mil. tokens
  • Extends the “foreign” (ciz) part of CzeSL-plain by texts collected in 2013
  • Original forms and automatic corrections are tagged, lemmatized and assigned error labels
  • Most texts have metadata attributes (30 items) about the author and the text
  • Searchable from the Czech National Corpus site, metadata available from this site are in Czech
  • Dowloadable as AKCES 5 (CzeSL-SGT) Release 2 with the metadata in English. The original release is still available from AKCES 5 (CzeSL-SGT). We suggest the use of Release 2, where a number of bugs were fixed, including an issue in the metadata: native speakers of Ukrainian were labelled as speakers of “another Indo-European language” rather than as speakers of a Slavic language. Release 2 is now also a validated XML document with all annotation represented as XML attributes.
  • Description in English and Czech

CzeSL-man

  • Includes texts collected, transcribed and manually annotated within the ESF project, see a description in English.
  • Annotation manual in Czech
  • Transcription manual in Czech
  • Appendix to Transcription manual in Czech
  • Transcription manual – summary in English; please note that formatting specifics concerning the markup of manuscripts in the transcription manual, its appendix and the summary are outdated.
CzeSL-man v0
  • Includes subsets of the ciz and rom parts of CzeSL-plain, i.e. the manually annotated Czech texts written by non-native learners and by speakers of the Roma ethnolect of Czech, the total of about 330 thousand tokens.
  • Texts of about 208 thousand tokens are annotated independently by two annotators.
  • Texts are searchable on-line via the SeLaQ tool, using the CzeSL-native format:
    • SeLaQ is a purpose-built corpus manager. See its menu for instructions.
    • Note that metadata and graphical display of links between annotated word tokens are not available in SeLaQ.
  • Downloadable from here in the feat format, with metadata
CzeSL-man v1
  • CzeSL-man v1 contains 645 texts (128 thousand tokens) from CzeSL-SGT, including 298 doubly annotated texts (59 thousand doubly annotated tokens).
  • Most texts are equipped with metadata about the author, the text and the annotation process.
  • CzeSL-man can be searched or downloaded. Although the set of texts in the searchable and the downloadable versions are identical, they differ in how they represent the annotation.
  • CzeSL-man v1 downloadable:
    • This release is in the PML format, generated by the feat tool.
    • Each text with its annotation consists of several related files.
    • Some of the texts are independently annotated twice.
  • CzeSL-man v1 searchable:
    • Differs from both CzeSL-man v0 and CzeSL-man v1 downloadable in two aspects:
      • There are no texts with alternative error annotation: each text is annotated by a single annotator
      • The two-tier annotation scheme is radically modified to fit the token-based setup of the search tool.
    • Apart from that, the content and metadata are identical to CzeSL-man v1 downloadable and the search options to those of CzeSL-SGT.
    • The main feature in the annotation is the reversal of the source text and its annotation. The correction at Tier 2 is assumed to be the basis for the annotation. The tokens of this corpus represent the words at Tier 2. The original text is added as annotation of the Tier 2 tokens. Each token of the corrected text receives its corresponding Tier 0 form and a Tier 2 error label as attributes.
    • This annotation discards any Tier 1 corrections and error tags, and simplifies other than 1:1 links between tokens at Tier 0 and Tier 2.
    • Tier 2 is parsed in a way similar to some other Czech corpora searchable in KonText, such as SYN2015.
CzeSL-man v2
  • In this release the two-tier error annotation is represented as pairs of XML elements err and corr. An ill-formed portion of the source text is enclosed within the err structure, immediately followed by its correction, enclosed within the corr structure.
  • Apart from the error annotation, the content and metadata are the same as in CzeSL-man v1.
  • Linguistic annotation (tags and lemmas) is provided for all tokens at Tier 0 and Tier 2.

CzeSL-TH

  • Includes a subset of CzeSL-SGT, hand-corrected, but not error-tagged, in 2017–2018, according to the 2T scheme.
  • The corpus includes about 1300 texts (180 thousand tokens), selected from those that had not been manually error-annotated before (are not part of CzeSL-man).
  • The selection was meant to make the manually annotated part of CzeSL more balanced in terms of L1 and CEFR level.
  • Downloadable from here in the feat format

CzeSL-MD

  • Includes a subset of texts from CzeSL-man, semi-automatically annotated by a multi-domain tagset with a focus on morphology, see the annotation manual (in Czech)
  • The dataset is available from https://bitbucket.org/czesl/czesl-md in the brat format
  • The texts can also be viewed and searched here using brat
  • To be extended to all CzeSL-man texts

CzeSL-UD

  • Texts from CzeSL-man with syntactic annotation according to the Universal Dependencies (UD) standard
  • Available from the LINDAT/CLARIN repository
  • CzeSL-man, CzeSL-MD and CzeSL-UD will eventually be merged into a single corpus with multiple types of annotation

CzeSL-GEC

  • CzeSL Grammatical Error Correction Dataset – manually annotated texts, converted to a format intended for NLP applications
  • Sentence pairs of original and corrected versions of Czech sentences collected from essays written by both non-native learners of Czech (CzeSL-man) and Czech pupils with Romani background
  • See also AKCES-GEC

AKCES-GEC

  • AKCES Grammatical Error Correction Dataset for Czech – a grammar error correction corpus for Czech generated from a subset of AKCES. It contains train, dev and test files annotated in M2 format. Note that in comparison to CZESL-GEC dataset, this dataset contains separated edits together with their type annotations in M2 format and also has twice as many sentences.
  • See https://arxiv.org/pdf/1910.00353.pdf for a detailed description

CzeSL in TEITOK

  • Work in progress: will eventually include all available Czech texts written by non-native learners
  • See CzeSL in TEITOK at the ICTL site

Tools

  • Annotation editor feat, used for multi-level manual error annotation of CzeSL-man
  • Annotation editor brat, used for multi-domain error annotation of CzeSL-MD
  • Tagger and lemmatizer of Czech Morphodita, used for morphological annotation of CzeSL-SGT
  • Spelling/grammar checker Korektor, used for automatic correction of CzeSL-SGT
  • Error identifier (see Jelínek, 2017), used for automatic identification of some types of errors in CzeSL-SGT
  • Multi-level concordancer SeLaQ, used for basic searching in CzeSL-man
  • Standard concordancer Manatee/KonText, used for searching in CzeSL-plain and CzeSL-SGT
  • General corpus tool TEITOK, currently used for building, editing and viewing learner corpora hosted by the Institute of Theoretical and Computational linguistics (see Learner corpora at ICTL)

Bibliography

Bibliography

NEW:

Rosen, A., Hana, J., Hladká, B., Jelínek, T., Škodová, S., and Štindlová, B. (2020). Compiling and annotating a learner corpus for a morphologically rich language – CzeSL, a corpus of non-native Czech. Karolinum, Charles University Press, Praha. Print copy, e-book CU Digital Repository


QR Code
QR Code czesl:czesl (generated for current page)