Rozdíly

Zde můžete vidět rozdíly mezi vybranou verzí a aktuální verzí dané stránky.

Odkaz na výstup diff

Both sides previous revision Předchozí verze
Následující verze
Předchozí verze
czesl:czesl [2019/02/14 12:56]
rosen
czesl:czesl [2020/11/11 22:03] (aktuální)
rosen [Bibliography]
Řádek 1: Řádek 1:
-===== CzeSL – a Learner Corpus of Czech ======+======= CzeSL – a Learner Corpus of Czech =======
  
   * //The Corpus of **Cze**ch as a **S**econd **L**anguage//​   * //The Corpus of **Cze**ch as a **S**econd **L**anguage//​
-  * A part of the AKCES/CLAC project (//the Czech Language Acquisition Corpora//)+  * A part of the [[http://​utkl.ff.cuni.cz/​dokuwiki/​doku.php?​id=akces:​akces|AKCES/CLAC project]] (//the Czech Language Acquisition Corpora//)
   * For the official site see [[http://​akces.ff.cuni.cz|AKCES – Akviziční korpusy českého jazyka]] (in Czech)   * For the official site see [[http://​akces.ff.cuni.cz|AKCES – Akviziční korpusy českého jazyka]] (in Czech)
   * An outdated site: [[http://​www.c2j.cz|Investice do rozvoje vzdělávání v oboru čeština jako cizí jazyk]] (//​Investments into Teaching Czech as a Foreign Language//)   * An outdated site: [[http://​www.c2j.cz|Investice do rozvoje vzdělávání v oboru čeština jako cizí jazyk]] (//​Investments into Teaching Czech as a Foreign Language//)
Řádek 8: Řádek 8:
     * 2009–2012:​ European Social Funds (ESF) – //​Innovative approach to teaching Czech as a second language//, no. CZ.1.07/​2.2.00/​07.0119     * 2009–2012:​ European Social Funds (ESF) – //​Innovative approach to teaching Czech as a second language//, no. CZ.1.07/​2.2.00/​07.0119
     * 2012–2016:​ Ministry of Education, Youth and Sports – //Czech National Corpus//, no. LM2011023     * 2012–2016:​ Ministry of Education, Youth and Sports – //Czech National Corpus//, no. LM2011023
-    * 2016–2018:​ Grant Agency of the Czech Republic – [[https://​ufal.mff.cuni.cz/​czesl|Non-native Czech from the Theoretical and Computational Perspective]],​ no. 16-10185S ​+    * 2016–2018 ​(extended to mid-2020): Grant Agency of the Czech Republic – [[https://​ufal.mff.cuni.cz/​czesl|Non-native Czech from the Theoretical and Computational Perspective]],​ no. 16-10185S 
 +  * Alternative address of this site: [[http://​utkl.ff.cuni.cz/​learncorp/​]]
  
  
-=== Available versions ===+===== Available versions ​=====
  
-== CzeSL-plain ​==+| ^  Thousands of tokens in  ^^^^  annotation ​ ^^^^^ Metadata ^  Access ​ ^  Year  ^ 
 +| ::: ^  non-native ​ ^^  ethnolect ​ ^  𝚺  ^  Error  ^^  Linguistic ​ ^^^:::​^:::​^:::​^ 
 +| ::: ^  essays ​ ^  theses ​ ^ ::: ^ ::: ^  Tags  ^  TH  ^  T0  ^  T1  ^  T2  ^ ::: ^ ::: ^ ::: ^ 
 +CzeSL-plain ​|  1,315 |  732 |  428 |  2,475 |  --  |  --  |  --  |  --  |  --  |  --  |  SD  |  2012  | 
 +^ CzeSL-SGT ​  ​| ​ 1,147 |   -- |   -- |  1,147 |  F  |  K  |  M  |  --  |  M  |  yes  |  SD  |  2014  | 
 +^ CzeSL-man v0, a1  |  134 |   -- |  192 |  326 |  F+G  |  2T  |  --  |  M  |  M  |  --  |  SD  |  2012  | 
 +^ CzeSL-man v0, a2 |  59 |   -- |  149 |  208 |  F+G  |  2T  |  --  |  M  |  M  |  --  |  S  |  2012  | 
 +^ CzeSL-man v1 |  134 |   -- |   -- |  134 |  F+G  |  T2  |  M  |  --  |  M+S  |  yes  |  SD  |  2016  | 
 +^ CzeSL-man v2 |  134 |   -- |   -- |  134 |  F+G  |  2T  |  M  |  M  |  M  |  yes  |  SD  |  2020  | 
 +^ CzeSL-TH |  180 |   -- |   -- |  180 |  --  |  2T  |  --  |  --  |  --  |  yes  |  D  |  2018  | 
 +^ CzeSL-MD |  12 |  -- |  -- |  12 |  MD  |  T2  |  --  |  --  |  --  |  --  |  D  |  2018  | 
 +^ CzeSL-UD |  10 |  -- |  -- |  10 |  --  |  --  |  M+S  |  --  |  --  |  --  |  D  |  2018  | 
 +^ CzeSL-GEC |  ? |  ? |  -- |  20 |  --  |  2T  |  --  |  --  |  --  |  --  |  D  |  2017  | 
 +^ AKCES-GEC |  336 |  -- |  168 |  504 |  G  |  2T  |  --  |  --  |  --  |  --  |  D  |  2019  | 
 +^ CzeSL in TEITOK |  299 |  -- |  -- |  299 |  F+I  |  2T+  |  M  |  M  |  M+S  |  yes  |  S  |  2020  |
  
-  * Transcribed ​texts, 2 mil. words, without annotation, without metadata+  * **Tags**: F -- formal, G -- grammar-based,​ MD -- multi-dimensional,​ I -- implicit 
 +  * **TH** (target hypothesis):​ K -- correction suggested by the proofing tool, 2T -- successive corrections in the 2T scheme, T2 -- correction at Tier 2, 2T+ -- more than 2 successive corrections 
 +  * **Linguistic annotation**:​ M -- morphology (lemmas and morphosyntactic tags), S -- syntax (structure and functions) 
 +  * **Access**: S -- searchable on-line, D -- downloadable in full as a dataset 
 +  * **Year**: year of the first release 
 + 
 + 
 +=== CzeSL-plain === 
 + 
 +  * 12.4 thousand transcribed ​texts, 2 mil. words, ​2.5 mil. tokens 
 +  * Plain = without annotation, without metadata
   * Consists of 3 parts:   * Consists of 3 parts:
-    * Texts written by foreign learners of Czech (ciz) +    * Texts written by foreign learners of Czech (ciz): 8,109 texts, 1,161 thousand tokens 
-    * Academic texts written by foreign students in Czech (kval) +    * Academic texts written by foreign students in Czech (kval): 174 texts, 732 thousand tokens 
-    * Texts written by Czech students with Romani background (rom) +    * Texts written by Czech students with Romani background (rom), i.e. an ethnolect of Czech: 4,105 texts, 428 thousand tokens 
-  * Searchable from the [[https://​kontext.korpus.cz/​run.cgi/​first?​corpname=czesl-plain|Czech National Corpus]] site+  * Searchable from the [[https://​kontext.korpus.cz/​first_form?​corpname=czesl-plain|Czech National Corpus]] site
   * Downloadable from the [[http://​ufal.mff.cuni.cz/​lindat/​|LINDAT-Clarin]] data repository as two subcorpora:   * Downloadable from the [[http://​ufal.mff.cuni.cz/​lindat/​|LINDAT-Clarin]] data repository as two subcorpora:
     * [[http://​hdl.handle.net/​11858/​00-097C-0000-000C-2112-B|AKCES 3]] – includes texts produced by non-native students of Czech     * [[http://​hdl.handle.net/​11858/​00-097C-0000-000C-2112-B|AKCES 3]] – includes texts produced by non-native students of Czech
     * [[http://​hdl.handle.net/​11858/​00-097C-0000-000C-2293-0|AKCES 4]] – includes texts produced by students growing up in socially excluded communities     * [[http://​hdl.handle.net/​11858/​00-097C-0000-000C-2293-0|AKCES 4]] – includes texts produced by students growing up in socially excluded communities
-  * Description in [[http://ucnk.ff.cuni.cz/english/​czesl-plain.php|English]] and [[http://ucnk.ff.cuni.cz/​czesl-plain.php|Czech]]+  * Description in [[https://wiki.korpus.cz/doku.php/en:cnk:czesl-plain|English]] and [[https://wiki.korpus.cz/doku.php/​cnk:​czesl-plain|Czech]]
  
-== CzeSL-SGT ==+=== CzeSL-SGT ​===
  
   * **Cze**ch as a **S**econd **L**anguage with **S**pelling,​ **G**rammar and **T**ags   * **Cze**ch as a **S**econd **L**anguage with **S**pelling,​ **G**rammar and **T**ags
-  * Transcribed ​texts, 1 mil. words+  * 8,617 transcribed ​texts, 111 thousand sentences, 1 mil. words, 1.1 mil. tokens
   * Extends the “foreign” (ciz) part of CzeSL-plain by texts collected in 2013   * Extends the “foreign” (ciz) part of CzeSL-plain by texts collected in 2013
   * Original forms and automatic corrections are tagged, lemmatized and assigned error labels   * Original forms and automatic corrections are tagged, lemmatized and assigned error labels
   * Most texts have metadata attributes (30 items) about the author and the text   * Most texts have metadata attributes (30 items) about the author and the text
-  * Searchable from the [[https://​kontext.korpus.cz/​run.cgi/​first?​corpname=czesl-sgt|Czech National Corpus]] site, metadata available from this site are in Czech+  * Searchable from the [[https://​kontext.korpus.cz/​first_form?​corpname=czesl-sgt|Czech National Corpus]] site, metadata available from this site are in Czech
   * Dowloadable as [[http://​hdl.handle.net/​11234/​1-162|AKCES 5 (CzeSL-SGT) Release 2]] with the metadata in English. The original release is still available from [[http://​hdl.handle.net/​11858/​00-097C-0000-0023-95B1-E|AKCES 5 (CzeSL-SGT)]]. We suggest the use of Release 2, where a number of bugs were fixed, including an issue in the metadata: native speakers of Ukrainian were labelled as speakers of “another Indo-European language” rather than as speakers of a Slavic language. Release 2 is now also a validated XML document with all annotation represented as XML attributes.   * Dowloadable as [[http://​hdl.handle.net/​11234/​1-162|AKCES 5 (CzeSL-SGT) Release 2]] with the metadata in English. The original release is still available from [[http://​hdl.handle.net/​11858/​00-097C-0000-0023-95B1-E|AKCES 5 (CzeSL-SGT)]]. We suggest the use of Release 2, where a number of bugs were fixed, including an issue in the metadata: native speakers of Ukrainian were labelled as speakers of “another Indo-European language” rather than as speakers of a Slavic language. Release 2 is now also a validated XML document with all annotation represented as XML attributes.
   * Description in [[http://​utkl.ff.cuni.cz/​~rosen/​public/​2014-czesl-sgt-en.pdf|English]] and [[http://​utkl.ff.cuni.cz/​~rosen/​public/​2014-czesl-sgt-cs.pdf|Czech]]   * Description in [[http://​utkl.ff.cuni.cz/​~rosen/​public/​2014-czesl-sgt-en.pdf|English]] and [[http://​utkl.ff.cuni.cz/​~rosen/​public/​2014-czesl-sgt-cs.pdf|Czech]]
  
-== CzeSL-man ==+=== CzeSL-man ​===
  
   * Includes texts collected, transcribed and manually annotated within the ESF project, see a [[http://​utkl.ff.cuni.cz/​~rosen/​public/​2015-czesl-man-en.pdf|description in English]].   * Includes texts collected, transcribed and manually annotated within the ESF project, see a [[http://​utkl.ff.cuni.cz/​~rosen/​public/​2015-czesl-man-en.pdf|description in English]].
Řádek 44: Řádek 69:
   * Appendix to Transcription manual in [[http://​utkl.ff.cuni.cz/​~rosen/​public/​transkripce_doplnek.pdf|Czech]]   * Appendix to Transcription manual in [[http://​utkl.ff.cuni.cz/​~rosen/​public/​transkripce_doplnek.pdf|Czech]]
   * Transcription manual – summary in [[http://​utkl.ff.cuni.cz/​~rosen/​public/​transcription-reference.pdf|English]];​ please note that formatting specifics concerning the markup of manuscripts in the transcription manual, its appendix and the summary are outdated.   * Transcription manual – summary in [[http://​utkl.ff.cuni.cz/​~rosen/​public/​transcription-reference.pdf|English]];​ please note that formatting specifics concerning the markup of manuscripts in the transcription manual, its appendix and the summary are outdated.
-  ​* Texts searchable on-line via the [[ http://​utkl.ff.cuni.cz/​czesl/​selaq.html|SeLaQ]] tool, using the CzeSL-native format: + 
-    * Includes all manually annotated texts, both the non-native Czech and the Romani ethnolect Czech parts.+== CzeSL-man v0 == 
 +  * Includes subsets of the ciz and rom parts of CzeSL-plain,​ i.e. the manually annotated Czech texts written by non-native learners and by speakers of the Roma ethnolect of Czech, the total of about 330 thousand tokens. 
 +  * Texts of about 208 thousand tokens are annotated independently by two annotators.  
 +  ​* Texts are searchable on-line via the [[ http://​utkl.ff.cuni.cz/​czesl/​selaq.html|SeLaQ]] tool, using the CzeSL-native format:
     * [[ http://​utkl.ff.cuni.cz/​czesl/​selaq.html|SeLaQ]] is a purpose-built corpus manager. See its menu for instructions.     * [[ http://​utkl.ff.cuni.cz/​czesl/​selaq.html|SeLaQ]] is a purpose-built corpus manager. See its menu for instructions.
     * Note that metadata and graphical display of links between annotated word tokens are not available in SeLaQ.     * Note that metadata and graphical display of links between annotated word tokens are not available in SeLaQ.
   * Downloadable from [[https://​bitbucket.org/​czesl/​czesl-man/​|here]] in the [[https://​bitbucket.org/​czesl/​feat/​|feat]] format, with metadata   * Downloadable from [[https://​bitbucket.org/​czesl/​czesl-man/​|here]] in the [[https://​bitbucket.org/​czesl/​feat/​|feat]] format, with metadata
-  * **Coming soon:** CzeSL-man searchable from the [[https://​kontext.korpus.cz/​run.cgi/​first?​corpname=czesl-sgt|Czech National Corpus]] site and downloadable from the [[https://​lindat.mff.cuni.cz/​repository|LINDAT/​CLARIN]] repository. 
  
-== CzeSL-MD ==+== CzeSL-man v1 == 
 + 
 +  * CzeSL-man v1 contains 645 texts (128 thousand tokens) from CzeSL-SGT, including 298 doubly annotated texts (59 thousand doubly annotated tokens). 
 +  * Most texts are equipped with metadata about the author, the text and the annotation process. 
 +  * CzeSL-man can be searched or downloaded. Although the set of texts in the searchable and the downloadable versions are identical, they differ in how they represent the annotation. 
 +  * **CzeSL-man v1 downloadable**:​ 
 +    * Downloadable from https://​bitbucket.org/​czesl/​czesl-man/​ 
 +    * This release is in the PML format, generated by the feat tool. 
 +    * Each text with its annotation consists of several related files. 
 +    * Some of the texts are independently annotated twice. 
 +  * **CzeSL-man v1 searchable**:​ 
 +    * Searchable by KonText: https://​kontext.korpus.cz/​first_form?​corpname=czesl-man 
 +    * Differs from both CzeSL-man v0 and CzeSL-man v1 downloadable in two aspects:  
 +      * There are no texts with alternative error annotation: each text is annotated by a single annotator 
 +      * The two-tier annotation scheme is radically modified to fit the token-based setup of the search tool.  
 +    * Apart from that, the content and metadata are identical to CzeSL-man v1 downloadable and the search options to those of CzeSL-SGT. 
 +    * The main feature in the annotation is the reversal of the source text and its annotation. The correction at Tier 2 is assumed to be the basis for the annotation. The tokens of this corpus represent the words at Tier 2. The original text is added as annotation of the Tier 2 tokens. Each token of the corrected text receives its corresponding Tier 0 form and a Tier 2 error label as attributes. 
 +    * This annotation discards any Tier 1 corrections and error tags, and simplifies other than 1:1 links between tokens at Tier 0 and Tier 2. 
 +    * Tier 2 is parsed in a way similar to some other Czech corpora searchable in KonText, such as SYN2015.  
 + 
 +== CzeSL-man v2 == 
 + 
 +  * In this release the two-tier error annotation is represented as pairs of XML elements err and corr. An ill-formed portion of the source text is enclosed within the err structure, immediately followed by its correction, enclosed within the corr structure.  
 +  * Apart from the error annotation, the content and metadata are the same as in CzeSL-man v1. 
 +  * Linguistic annotation (tags and lemmas) is provided for all tokens at Tier 0 and Tier 2. 
 + 
 +=== CzeSL-TH === 
 + 
 +  * Includes a subset of CzeSL-SGT, hand-corrected,​ but not error-tagged,​ in 2017--2018, according to the 2T scheme.  
 +  * The corpus includes about 1300 texts (180 thousand tokens), selected from those that had not been manually error-annotated before (are not part of CzeSL-man).  
 +  * The selection was meant to make the manually annotated part of CzeSL more balanced in terms of L1 and CEFR level. 
 +  * Downloadable from [[https://​bitbucket.org/​czesl/​czesl-th/​|here]] in the [[https://​bitbucket.org/​czesl/​feat/​|feat]] format 
 + 
 +=== CzeSL-MD ​===
  
   * Includes a subset of texts from CzeSL-man, semi-automatically annotated by a multi-domain tagset with a focus on morphology, see the [[http://​utkl.ff.cuni.cz/​~rosen/​public/​2018_prirucka_morfologicke_anotace.pdf|annotation manual]] (in Czech)   * Includes a subset of texts from CzeSL-man, semi-automatically annotated by a multi-domain tagset with a focus on morphology, see the [[http://​utkl.ff.cuni.cz/​~rosen/​public/​2018_prirucka_morfologicke_anotace.pdf|annotation manual]] (in Czech)
-  * Available ​from [[https://​bitbucket.org/​czesl/​czesl-md]] in the [[http://​brat.nlplab.org|brat]] format+  * The dataset is available ​from [[https://​bitbucket.org/​czesl/​czesl-md]] in the [[http://​brat.nlplab.org|brat]] format 
 +  * The texts can also be viewed and searched [[https://​quest.ms.mff.cuni.cz/​brat/​czesl.err/​index.xhtml#/​anna_daniela/​|here]] using [[http://​brat.nlplab.org|brat]] ​
   * To be extended to all CzeSL-man texts   * To be extended to all CzeSL-man texts
  
-== CzeSL-UD ==+=== CzeSL-UD ​===
  
   * Texts from CzeSL-man with syntactic annotation according to the Universal Dependencies ([[https://​universaldependencies.org|UD]]) standard ​   * Texts from CzeSL-man with syntactic annotation according to the Universal Dependencies ([[https://​universaldependencies.org|UD]]) standard ​
   * Available from the [[http://​hdl.handle.net/​11234/​1-2927|LINDAT/​CLARIN]] repository   * Available from the [[http://​hdl.handle.net/​11234/​1-2927|LINDAT/​CLARIN]] repository
   * CzeSL-man, CzeSL-MD and CzeSL-UD will eventually be merged into a single corpus with multiple types of annotation   * CzeSL-man, CzeSL-MD and CzeSL-UD will eventually be merged into a single corpus with multiple types of annotation
 +
 +
 +=== CzeSL-GEC ===
 +
 +  * CzeSL Grammatical Error Correction Dataset – manually annotated texts, converted to a format intended for NLP applications
 +  * Sentence pairs of original and corrected versions of Czech sentences collected from essays written by both non-native learners of Czech (CzeSL-man) and Czech pupils with Romani background
 +  * Downloadable from [[http://​hdl.handle.net/​11234/​1-2143]]
 +  * See also AKCES-GEC
 +
 +=== AKCES-GEC ===
 +
 +  * AKCES Grammatical Error Correction Dataset for Czech – a grammar error correction corpus for Czech generated from a subset of AKCES. It contains train, dev and test files annotated in M2 format. Note that in comparison to CZESL-GEC dataset, this dataset contains separated edits together with their type annotations in M2 format and also has twice as many sentences.
 +  * Downloadable from [[http://​hdl.handle.net/​11234/​1-3057]]
 +  * See [[https://​arxiv.org/​pdf/​1910.00353.pdf]] for a detailed description
 +
 +=== CzeSL in TEITOK ===
 +  * Work in progress: will eventually include all available Czech texts written by non-native learners ​
 +  * See [[http://​utkl.ff.cuni.cz/​teitok/​czesl/​|CzeSL in TEITOK]] at the ICTL site
 +
  
 ===== Tools ===== ===== Tools =====
Řádek 72: Řádek 152:
   * Multi-level concordancer [[http://​utkl.ff.cuni.cz/​czesl/​selaq.html|SeLaQ]],​ used for basic searching in CzeSL-man   * Multi-level concordancer [[http://​utkl.ff.cuni.cz/​czesl/​selaq.html|SeLaQ]],​ used for basic searching in CzeSL-man
   * Standard concordancer [[http://​wiki.korpus.cz/​doku.php/​en:​manualy:​kontext:​index|Manatee/​KonText]],​ used for searching in CzeSL-plain and CzeSL-SGT   * Standard concordancer [[http://​wiki.korpus.cz/​doku.php/​en:​manualy:​kontext:​index|Manatee/​KonText]],​ used for searching in CzeSL-plain and CzeSL-SGT
-  * General corpus tool [[http://​beta.clul.ul.pt/​teitok/​site/​|TEITOK]],​ currently used for manuscript transcription ​([[http://​utkl.ff.cuni.cz/​teitok/​czesl/]]), to be used for building, editing and viewing the CzeSL corpora+  * General corpus tool [[http://​beta.clul.ul.pt/​teitok/​site/​|TEITOK]],​ currently used for building, editing and viewing learner corpora hosted by the Institute of Theoretical and Computational linguistics ​(see [[http://​utkl.ff.cuni.cz/​teitok/​|Learner corpora at ICTL]])
  
 ===== Bibliography ===== ===== Bibliography =====
  
 [[http://​utkl.ff.cuni.cz/​~rosen/​public/​czesl.html|Bibliography]] [[http://​utkl.ff.cuni.cz/​~rosen/​public/​czesl.html|Bibliography]]
 +
 +**NEW:​** ​
 +
 +Rosen, A., Hana, J., Hladká, B., Jelínek, T., Škodová, S., and Štindlová,​ B. (2020).
 +//Compiling and annotating a learner corpus for a morphologically rich language – CzeSL, a corpus of non-native Czech.// [[https://​karolinum.cz|Karolinum,​ Charles University Press, Praha]]. [[https://​karolinum.cz/​knihy/​rosen-compiling-and-annotating-a-learner-corpus-for-a-morphologically-rich-language-23802|Print copy, e-book]] [[https://​dspace.cuni.cz/​handle/​20.500.11956/​123103|CU Digital Repository]]
 +
 +
 +

QR Code
QR Code czesl:czesl (generated for current page)