Toto je starší verze dokumentu!


CzeSL – a Learner Corpus of Czech

Available versions

CzeSL-plain
  • Transcribed texts, 2 mil. words, without annotation, without metadata
  • Consists of 3 parts:
    • Texts written by foreign learners of Czech (ciz)
    • Academic texts written by foreign students in Czech (kval)
    • Texts written by Czech students with Romani background (rom)
  • Searchable from the Czech National Corpus site
  • Downloadable from the LINDAT-Clarin data repository as two subcorpora:
    • AKCES 3 – includes texts produced by non-native students of Czech
    • AKCES 4 – includes texts produced by students growing up in socially excluded communities
  • Description in English and Czech
CzeSL-SGT
  • Czech as a Second Language with Spelling, Grammar and Tags
  • Transcribed texts, 1 mil. words
  • Extends the “foreign” (ciz) part of CzeSL-plain by texts collected in 2013
  • Original forms and automatic corrections are tagged, lemmatized and assigned error labels
  • Most texts have metadata attributes (30 items) about the author and the text
  • Searchable from the Czech National Corpus site, metadata available from this site are in Czech
  • Dowloadable as AKCES 5 (CzeSL-SGT) Release 2 with the metadata in English. The original release is still available from AKCES 5 (CzeSL-SGT). We suggest the use of Release 2, where a number of bugs were fixed, including an issue in the metadata: native speakers of Ukrainian were labelled as speakers of “another Indo-European language” rather than as speakers of a Slavic language. Release 2 is now also a validated XML document with all annotation represented as XML attributes.
  • Description in English and Czech
CzeSL-man
  • Includes texts collected, transcribed and manually annotated within the ESF project, see a description in English.
  • Annotation manual in Czech
  • Transcription manual in Czech
  • Appendix to Transcription manual in Czech
  • Transcription manual – summary in English; please note that formatting specifics concerning the markup of manuscripts in the transcription manual, its appendix and the summary are outdated.
  • Texts searchable on-line via the SeLaQ tool, using the CzeSL-native format:
    • Includes all manually annotated texts, both the non-native Czech and the Romani ethnolect Czech parts.
    • SeLaQ is a purpose-built corpus manager. See its menu for instructions.
    • Note that metadata and graphical display of links between annotated word tokens are not available in SeLaQ.
  • Downloadable from here in the feat format, with metadata
  • Coming soon: CzeSL-man searchable from the Czech National Corpus site and downloadable from the LINDAT/CLARIN repository.
CzeSL-MD
CzeSL-UD
  • Texts from CzeSL-man with syntactic annotation according to the Universal Dependencies (UD) standard
  • Available from the LINDAT/CLARIN repository
  • CzeSL-man, CzeSL-MD and CzeSL-UD will eventually be merged into a single corpus with multiple types of annotation
CzeSL-GEC
  • CzeSL Grammatical Error Correction Dataset – manually annotated texts, converted to a format intended for NLP applications
  • Sentence pairs of original and corrected versions of Czech sentences collected from essays written by both non-native learners of Czech (CzeSL-man) and Czech pupils with Romani background
CzeSL-GEC
  • CzeSL Grammatical Error Correction Dataset – manually annotated texts, converted to a format intended for NLP applications
  • Sentence pairs of original and corrected versions of Czech sentences collected from essays written by both non-native learners of Czech (CzeSL-man) and Czech pupils with Romani background
  • See also AKCES-GEC
AKCES-GEC
  • AKCES-GEC Grammatical Error Correction Dataset for Czech – a grammar error correction corpus for Czech generated from a subset of AKCES. It contains train, dev and test files annotated in M2 format. Note that in comparison to CZESL-GEC dataset, this dataset contains separated edits together with their type annotations in M2 format and also has twice as many sentences.
  • See https://arxiv.org/pdf/1910.00353.pdf for a detailed description

Tools

  • Annotation editor feat, used for multi-level manual error annotation of CzeSL-man
  • Annotation editor brat, used for multi-domain error annotation of CzeSL-MD
  • Tagger and lemmatizer of Czech Morphodita, used for morphological annotation of CzeSL-SGT
  • Spelling/grammar checker Korektor, used for automatic correction of CzeSL-SGT
  • Error identifier (see Jelínek, 2017), used for automatic identification of some types of errors in CzeSL-SGT
  • Multi-level concordancer SeLaQ, used for basic searching in CzeSL-man
  • Standard concordancer Manatee/KonText, used for searching in CzeSL-plain and CzeSL-SGT
  • General corpus tool TEITOK, currently used for manuscript transcription (http://utkl.ff.cuni.cz/teitok/czesl/), to be used for building, editing and viewing the CzeSL corpora

Bibliography


QR Code
QR Code czesl:czesl (generated for current page)