{{ :czesl:logolink_op_vvv_hor_barva_eng.jpg?600 |}}

======= CzeSL – a Learner Corpus of Czech =======

  * //The Corpus of **Cze**ch as a **S**econd **L**anguage//
  * A part of the [[http://utkl.ff.cuni.cz/dokuwiki/doku.php?id=akces:akces|AKCES/CLAC project]] (//the Czech Language Acquisition Corpora//)
  * For the official site see [[http://akces.ff.cuni.cz|AKCES – Akviziční korpusy českého jazyka]] (in Czech)
  * An outdated site: [[http://www.c2j.cz|Investice do rozvoje vzdělávání v oboru čeština jako cizí jazyk]] (//Investments into Teaching Czech as a Foreign Language//)
  * Funded since 2009 from several projects: 
    * 2009–2012: European Social Funds (ESF) – //Innovative approach to teaching Czech as a second language//, no. CZ.1.07/2.2.00/07.0119
    * 2012–2016: Ministry of Education, Youth and Sports – //Czech National Corpus//, no. LM2011023
    * 2016–2018 (extended to mid-2020): Grant Agency of the Czech Republic – [[https://ufal.mff.cuni.cz/czesl|Non-native Czech from the Theoretical and Computational Perspective]], no. 16-10185S
    * 2018–2022: [[https://kreas.ff.cuni.cz/en/|KREAS]], Faculty of Arts, Charles University; Structural and Investment Funds of the European Union
  * Alternative address of this site: [[http://utkl.ff.cuni.cz/learncorp/]]


===== Available versions =====

| ^  Thousands of tokens in  ^^^^  Annotation  ^^^^^ Metadata ^  Access  ^  Year  ^
| ::: ^  non-native  ^^  ethnolect  ^  𝚺  ^  error  ^^  linguistic  ^^^:::^:::^:::^
| ::: ^  essays  ^  theses  ^ ::: ^ ::: ^  Tags  ^  TH  ^  T0  ^  T1  ^  T2  ^ ::: ^ ::: ^ ::: ^
^ CzeSL-plain |  1,315 |  732 |  428 |  2,475 |  --  |  --  |  --  |  --  |  --  |  --  |  SD  |  2012  |
^ CzeSL-SGT   |  1,147 |   -- |   -- |  1,147 |  F  |  K  |  M  |  --  |  M  |  yes  |  SD  |  2014  |
^ CzeSL-man v0, a1  |  134 |   -- |  192 |  326 |  F+G  |  2T  |  --  |  M  |  M  |  --  |  SD  |  2012  |
^ CzeSL-man v0, a2 |  59 |   -- |  149 |  208 |  F+G  |  2T  |  --  |  M  |  M  |  --  |  S  |  2012  |
^ CzeSL-man v1 |  134 |   -- |   -- |  134 |  F+G  |  T2  |  M  |  --  |  M+S  |  yes  |  SD  |  2016  |
^ CzeSL-man v2 |  134 |   -- |   -- |  134 |  F+G  |  2T  |  M  |  M  |  M  |  yes  |  SD  |  2020  |
^ CzeSL-TH |  180 |   -- |   -- |  180 |  --  |  2T  |  --  |  --  |  --  |  yes  |  D  |  2018  |
^ CzeSL-MD |  12 |  -- |  -- |  12 |  MD  |  T2  |  --  |  --  |  --  |  --  |  D  |  2018  |
^ CzeSL-UD |  10 |  -- |  -- |  10 |  --  |  --  |  M+S  |  --  |  --  |  --  |  D  |  2018  |
^ CzeSL-GEC |  ? |  ? |  -- |  108 |  --  |  2T  |  --  |  --  |  --  |  --  |  D  |  2017  |
^ AKCES-GEC |  336 |  -- |  168 |  504 |  G  |  2T  |  --  |  --  |  --  |  --  |  D  |  2019  |
^ CzeSL in TEITOK |  299 |  -- |  -- |  299 |  F+I  |  2T+  |  M  |  M  |  M+S  |  yes  |  S  |  2020  |

  * **Tags**: F -- formal, G -- grammar-based, MD -- multi-dimensional, I -- implicit
  * **TH** (target hypothesis): K -- correction suggested by the proofing tool, 2T -- successive corrections in the 2T scheme, T2 -- correction at Tier 2, 2T+ -- more than 2 successive corrections
  * **Linguistic annotation**: M -- morphology (lemmas and morphosyntactic tags), S -- syntax (structure and functions)
  * **Access**: S -- searchable on-line, D -- downloadable in full as a dataset
  * **Year**: year of the first release


=== CzeSL-plain ===

  * 12.4 thousand transcribed texts, 2 mil. words, 2.5 mil. tokens
  * Plain = without annotation, without metadata
  * Consists of 3 parts:
    * Texts written by foreign learners of Czech (ciz): 8,109 texts, 1,161 thousand tokens
    * Academic texts written by foreign students in Czech (kval): 174 texts, 732 thousand tokens
    * Texts written by Czech students with Romani background (rom), i.e. an ethnolect of Czech: 4,105 texts, 428 thousand tokens
  * Searchable from the [[https://kontext.korpus.cz/first_form?corpname=czesl-plain|Czech National Corpus]] site
  * Downloadable from the [[http://ufal.mff.cuni.cz/lindat/|LINDAT-Clarin]] data repository as two subcorpora:
    * [[http://hdl.handle.net/11858/00-097C-0000-000C-2112-B|AKCES 3]] – includes texts produced by non-native students of Czech
    * [[http://hdl.handle.net/11858/00-097C-0000-000C-2293-0|AKCES 4]] – includes texts produced by students growing up in socially excluded communities
  * Description in [[https://wiki.korpus.cz/doku.php/en:cnk:czesl-plain|English]] and [[https://wiki.korpus.cz/doku.php/cnk:czesl-plain|Czech]]

=== CzeSL-SGT ===

  * **Cze**ch as a **S**econd **L**anguage with **S**pelling, **G**rammar and **T**ags
  * 8,617 transcribed texts, 111 thousand sentences, 1 mil. words, 1.1 mil. tokens
  * Extends the “foreign” (ciz) part of CzeSL-plain by texts collected in 2013
  * Original forms and automatic corrections are tagged, lemmatized and assigned error labels
  * Most texts have metadata attributes (30 items) about the author and the text
  * Searchable from the [[https://kontext.korpus.cz/first_form?corpname=czesl-sgt|Czech National Corpus]] site, metadata available from this site are in Czech
  * Dowloadable as [[http://hdl.handle.net/11234/1-162|AKCES 5 (CzeSL-SGT) Release 2]] with the metadata in English. The original release is still available from [[http://hdl.handle.net/11858/00-097C-0000-0023-95B1-E|AKCES 5 (CzeSL-SGT)]]. We suggest the use of Release 2, where a number of bugs were fixed, including an issue in the metadata: native speakers of Ukrainian were labelled as speakers of “another Indo-European language” rather than as speakers of a Slavic language. Release 2 is now also a validated XML document with all annotation represented as XML attributes.
  * Description in [[http://utkl.ff.cuni.cz/~rosen/public/2014-czesl-sgt-en.pdf|English]] and [[http://utkl.ff.cuni.cz/~rosen/public/2014-czesl-sgt-cs.pdf|Czech]]

=== CzeSL-man ===

  * Includes texts collected, transcribed and manually annotated within the ESF project, see a [[http://utkl.ff.cuni.cz/~rosen/public/2015-czesl-man-en.pdf|description in English]].
  * Annotation manual in [[http://utkl.ff.cuni.cz/~rosen/public/anotace.pdf|Czech]]
  * Transcription manual in [[http://utkl.ff.cuni.cz/~rosen/public/transkripce.pdf|Czech]]
  * Appendix to Transcription manual in [[http://utkl.ff.cuni.cz/~rosen/public/transkripce_doplnek.pdf|Czech]]
  * Transcription manual – summary in [[http://utkl.ff.cuni.cz/~rosen/public/transcription-reference.pdf|English]]; please note that formatting specifics concerning the markup of manuscripts in the transcription manual, its appendix and the summary are outdated.

== CzeSL-man v0 ==
  * Includes subsets of the ciz and rom parts of CzeSL-plain, i.e. the manually annotated Czech texts written by non-native learners and by speakers of the Roma ethnolect of Czech, the total of about 330 thousand tokens.
  * Texts of about 208 thousand tokens are annotated independently by two annotators. 
  * Texts are searchable on-line via the [[ http://utkl.ff.cuni.cz/czesl/selaq.html|SeLaQ]] tool, using the CzeSL-native format:
    * [[ http://utkl.ff.cuni.cz/czesl/selaq.html|SeLaQ]] is a purpose-built corpus manager. See its menu for instructions.
    * Note that metadata and graphical display of links between annotated word tokens are not available in SeLaQ.
  * Downloadable from [[https://bitbucket.org/czesl/czesl-man/|here]] in the [[https://bitbucket.org/czesl/feat/|feat]] format, with metadata

== CzeSL-man v1 ==

  * CzeSL-man v1 contains 645 texts (128 thousand tokens) from CzeSL-SGT, including 298 doubly annotated texts (59 thousand doubly annotated tokens).
  * Most texts are equipped with metadata about the author, the text and the annotation process.
  * CzeSL-man can be searched or downloaded. Although the set of texts in the searchable and the downloadable versions are identical, they differ in how they represent the annotation.
  * **CzeSL-man v1 downloadable**:
    * Downloadable from https://bitbucket.org/czesl/czesl-man/
    * This release is in the PML format, generated by the feat tool.
    * Each text with its annotation consists of several related files.
    * Some of the texts are independently annotated twice.
    * Includes also flat version (files named *.vert), see CzeSL-man v2 below.
  * **CzeSL-man v1 searchable**:
    * Searchable by KonText: https://kontext.korpus.cz/first_form?corpname=czesl-man
    * Differs from both CzeSL-man v0 and CzeSL-man v1 downloadable in two aspects: 
      * There are no texts with alternative error annotation: each text is annotated by a single annotator
      * The two-tier annotation scheme is radically modified to fit the token-based setup of the search tool. 
    * Apart from that, the content and metadata are identical to CzeSL-man v1 downloadable and the search options to those of CzeSL-SGT.
    * The main feature in the annotation is the reversal of the source text and its annotation. The correction at Tier 2 is assumed to be the basis for the annotation. The tokens of this corpus represent the words at Tier 2. The original text is added as annotation of the Tier 2 tokens. Each token of the corrected text receives its corresponding Tier 0 form and a Tier 2 error label as attributes.
    * This annotation discards any Tier 1 corrections and error tags, and simplifies other than 1:1 links between tokens at Tier 0 and Tier 2.
    * Tier 2 is parsed in a way similar to some other Czech corpora searchable in KonText, such as SYN2015. 

== CzeSL-man v2 ==

  * In this release the two-tier error annotation is represented as pairs of XML elements err and corr. An ill-formed portion of the source text is enclosed within the err structure, immediately followed by its correction, enclosed within the corr structure. 
  * Apart from the error annotation, the content and metadata are the same as in CzeSL-man v1.
  * Linguistic annotation (tags and lemmas) is provided for all tokens at Tier 0 and Tier 2.
  * Downloadable from https://bitbucket.org/czesl/czesl-man/ (files named *.vert).

=== CzeSL-TH ===

  * Includes a subset of CzeSL-SGT, hand-corrected, but not error-tagged, in 2017--2018, according to the 2T scheme. 
  * The corpus includes about 1300 texts (180 thousand tokens), selected from those that had not been manually error-annotated before (are not part of CzeSL-man). 
  * The selection was meant to make the manually annotated part of CzeSL more balanced in terms of L1 and CEFR level.
  * Downloadable from [[https://bitbucket.org/czesl/czesl-th/|here]] in the [[https://bitbucket.org/czesl/feat/|feat]] format

=== CzeSL-MD ===

  * Includes a subset of texts from CzeSL-man, semi-automatically annotated by a multi-domain tagset with a focus on morphology, see the [[http://utkl.ff.cuni.cz/~rosen/public/2018_prirucka_morfologicke_anotace.pdf|annotation manual]] (in Czech)
  * The dataset is available from [[https://bitbucket.org/czesl/czesl-md]] in the [[http://brat.nlplab.org|brat]] format
  * The texts can also be viewed and searched [[https://quest.ms.mff.cuni.cz/brat/czesl.err/index.xhtml#/anna_daniela/|here]] using [[http://brat.nlplab.org|brat]] 
  * To be extended to all CzeSL-man texts

=== CzeSL-UD ===

  * Texts from CzeSL-man with syntactic annotation according to the Universal Dependencies ([[https://universaldependencies.org|UD]]) standard 
  * Available from the [[http://hdl.handle.net/11234/1-2927|LINDAT/CLARIN]] repository
  * CzeSL-man, CzeSL-MD and CzeSL-UD will eventually be merged into a single corpus with multiple types of annotation


=== CzeSL-GEC ===

  * CzeSL Grammatical Error Correction Dataset – manually annotated texts, converted to a format intended for NLP applications
  * Sentence pairs of original and corrected versions of Czech sentences collected from essays written by both non-native learners of Czech (CzeSL-man) and Czech pupils with Romani background
  * Downloadable from [[http://hdl.handle.net/11234/1-2143]]
  * See also AKCES-GEC

=== AKCES-GEC ===

  * AKCES Grammatical Error Correction Dataset for Czech – a grammar error correction corpus for Czech generated from a subset of AKCES. It contains train, dev and test files annotated in M2 format. Note that in comparison to CZESL-GEC dataset, this dataset contains separated edits together with their type annotations in M2 format and also has twice as many sentences.
  * Downloadable from [[http://hdl.handle.net/11234/1-3057]]
  * See [[https://arxiv.org/pdf/1910.00353.pdf]] for a detailed description

=== CzeSL in TEITOK ===
  * Work in progress: will eventually include all available Czech texts written by non-native learners 
  * See [[http://utkl.ff.cuni.cz/teitok/czesl/|CzeSL in TEITOK]] at the ICTL site


===== Tools =====

  * Annotation editor [[https://bitbucket.org/czesl/feat/|feat]], used for multi-level manual error annotation of CzeSL-man
  * Annotation editor [[http://brat.nlplab.org|brat]], used for multi-domain error annotation of CzeSL-MD
  * Tagger and lemmatizer of Czech [[http://ufal.mff.cuni.cz/morphodita|Morphodita]], used for morphological annotation of CzeSL-SGT
  * Spelling/grammar checker [[http://ufal.mff.cuni.cz/korektor|Korektor]], used for automatic correction of CzeSL-SGT
  * Error identifier (see Jelínek, 2017), used for automatic identification of some types of errors in CzeSL-SGT
  * Multi-level concordancer [[http://utkl.ff.cuni.cz/czesl/selaq.html|SeLaQ]], used for basic searching in CzeSL-man
  * Standard concordancer [[http://wiki.korpus.cz/doku.php/en:manualy:kontext:index|Manatee/KonText]], used for searching in CzeSL-plain and CzeSL-SGT
  * General corpus tool [[http://www.teitok.org|TEITOK]], currently used for building, editing and viewing learner corpora hosted by the Institute of Theoretical and Computational linguistics (see [[http://utkl.ff.cuni.cz/teitok/|Learner corpora at ICTL]])

===== Bibliography =====

[[http://utkl.ff.cuni.cz/~rosen/public/czesl.html|Bibliography]]

**NEW:** 

Rosen, A., Hana, J., Hladká, B., Jelínek, T., Škodová, S., and Štindlová, B. (2020).
//Compiling and annotating a learner corpus for a morphologically rich language – CzeSL, a corpus of non-native Czech.// [[https://karolinum.cz|Karolinum, Charles University Press, Praha]]. [[https://karolinum.cz/knihy/rosen-compiling-and-annotating-a-learner-corpus-for-a-morphologically-rich-language-23802|Print copy, e-book]] [[https://dspace.cuni.cz/handle/20.500.11956/123103|CU Digital Repository]]

===== Acknowledgement =====

This work was supported by the European Regional Development Fund project “Creativity and Adaptability as Conditions of the Success of Europe in an Interrelated World” (reg. no.: CZ.02.1.01/0.0/0.0/16_019/0000734).