CzeSL – a Learner Corpus of Czech

Available versions

CzeSL-plain
  • Transcribed texts, 2 mil. words, without annotation, without metadata
  • Consists of 3 parts:
    • Texts written by foreign learners of Czech (ciz)
    • Academic texts written by foreign students in Czech (kval)
    • Texts written by Czech students with Romani background (rom)
  • Searchable from the Czech National Corpus site
  • Downloadable from the LINDAT-Clarin data repository as two subcorpora:
    • AKCES 3 – includes texts produced by non-native students of Czech
    • AKCES 4 – includes texts produced by students growing up in socially excluded communities
  • Description in English and Czech
CzeSL-SGT
  • Czech as a Second Language with Spelling, Grammar and Tags
  • Transcribed texts, 1 mil. words
  • Extends the “foreign” (ciz) part of CzeSL-plain by texts collected in 2013
  • Original forms and automatic corrections are tagged, lemmatized and assigned error labels
  • Most texts have metadata attributes (30 items) about the author and the text
  • Searchable from the Czech National Corpus site, metadata available from this site are in Czech
  • Dowloadable as AKCES 5 (CzeSL-SGT) Release 2 with the metadata in English. The original release is still available from AKCES 5 (CzeSL-SGT). We suggest the use of Release 2, where a number of bugs were fixed, including an issue in the metadata: native speakers of Ukrainian were labelled as speakers of “another Indo-European language” rather than as speakers of a Slavic language. Release 2 is now also a validated XML document with all annotation represented as XML attributes.
  • Description in English and Czech
CzeSL-man
  • Includes texts collected, transcribed and manually annotated within the ESF project, see a description in English.
  • Annotation manual in Czech
  • Transcription manual in Czech
  • Appendix to Transcription manual in Czech
  • Transcription manual – summary in English; please note that formatting specifics concerning the markup of manuscripts in the transcription manual, its appendix and the summary are outdated.
  • Texts searchable on-line via the SeLaQ tool, using the CzeSL-native format:
    • Includes all manually annotated texts, both the non-native Czech and the Romani ethnolect Czech parts.
    • SeLaQ is a purpose-built corpus manager. See its menu for instructions.
    • Note that metadata and graphical display of links between annotated word tokens are not available in SeLaQ.
  • Downloadable from here in the feat format, with metadata
  • Coming soon: CzeSL-man searchable from the Czech National Corpus site and downloadable from the LINDAT/CLARIN repository.
CzeSL-MD
  • Includes a subset of texts from CzeSL-man, semi-automatically annotated by a multi-domain tagset with a focus on morphology, see the annotation manual (in Czech)
  • The dataset is available from https://bitbucket.org/czesl/czesl-md in the brat format
  • The texts can also be viewed and searched here using brat
  • To be extended to all CzeSL-man texts
CzeSL in TEITOK
  • Work in progress: will eventually include all available Czech texts written by non-native learners
  • See CzeSL in TEITOK at the ICTL site
CzeSL-UD
  • Texts from CzeSL-man with syntactic annotation according to the Universal Dependencies (UD) standard
  • Available from the LINDAT/CLARIN repository
  • CzeSL-man, CzeSL-MD and CzeSL-UD will eventually be merged into a single corpus with multiple types of annotation
CzeSL-GEC
  • CzeSL Grammatical Error Correction Dataset – manually annotated texts, converted to a format intended for NLP applications
  • Sentence pairs of original and corrected versions of Czech sentences collected from essays written by both non-native learners of Czech (CzeSL-man) and Czech pupils with Romani background
  • See also AKCES-GEC
AKCES-GEC
  • AKCES Grammatical Error Correction Dataset for Czech – a grammar error correction corpus for Czech generated from a subset of AKCES. It contains train, dev and test files annotated in M2 format. Note that in comparison to CZESL-GEC dataset, this dataset contains separated edits together with their type annotations in M2 format and also has twice as many sentences.
  • See https://arxiv.org/pdf/1910.00353.pdf for a detailed description

Tools

  • Annotation editor feat, used for multi-level manual error annotation of CzeSL-man
  • Annotation editor brat, used for multi-domain error annotation of CzeSL-MD
  • Tagger and lemmatizer of Czech Morphodita, used for morphological annotation of CzeSL-SGT
  • Spelling/grammar checker Korektor, used for automatic correction of CzeSL-SGT
  • Error identifier (see Jelínek, 2017), used for automatic identification of some types of errors in CzeSL-SGT
  • Multi-level concordancer SeLaQ, used for basic searching in CzeSL-man
  • Standard concordancer Manatee/KonText, used for searching in CzeSL-plain and CzeSL-SGT
  • General corpus tool TEITOK, currently used for building, editing and viewing learner corpora hosted by the Institute of Theoretical and Computational linguistics (see Learner corpora at ICTL)

Bibliography


QR Code
QR Code czesl:czesl (generated for current page)