Rozdíly
Zde můžete vidět rozdíly mezi vybranou verzí a aktuální verzí dané stránky.
Both sides previous revision Předchozí verze Následující verze | Předchozí verze Následující verze Both sides next revision | ||
czesl:czesl [2019/02/11 15:34] rosen |
czesl:czesl [2020/11/11 21:49] rosen [Bibliography] |
||
---|---|---|---|
Řádek 1: | Řádek 1: | ||
- | ====== CzeSL – a Learner Corpus of Czech ====== | + | ======= CzeSL – a Learner Corpus of Czech ======= |
* //The Corpus of **Cze**ch as a **S**econd **L**anguage// | * //The Corpus of **Cze**ch as a **S**econd **L**anguage// | ||
- | * A part of the AKCES/CLAC project (//the Czech Language Acquisition Corpora//) | + | * A part of the [[http://utkl.ff.cuni.cz/dokuwiki/doku.php?id=akces:akces|AKCES/CLAC project]] (//the Czech Language Acquisition Corpora//) |
* For the official site see [[http://akces.ff.cuni.cz|AKCES – Akviziční korpusy českého jazyka]] (in Czech) | * For the official site see [[http://akces.ff.cuni.cz|AKCES – Akviziční korpusy českého jazyka]] (in Czech) | ||
* An outdated site: [[http://www.c2j.cz|Investice do rozvoje vzdělávání v oboru čeština jako cizí jazyk]] (//Investments into Teaching Czech as a Foreign Language//) | * An outdated site: [[http://www.c2j.cz|Investice do rozvoje vzdělávání v oboru čeština jako cizí jazyk]] (//Investments into Teaching Czech as a Foreign Language//) | ||
Řádek 8: | Řádek 8: | ||
* 2009–2012: European Social Funds (ESF) – //Innovative approach to teaching Czech as a second language//, no. CZ.1.07/2.2.00/07.0119 | * 2009–2012: European Social Funds (ESF) – //Innovative approach to teaching Czech as a second language//, no. CZ.1.07/2.2.00/07.0119 | ||
* 2012–2016: Ministry of Education, Youth and Sports – //Czech National Corpus//, no. LM2011023 | * 2012–2016: Ministry of Education, Youth and Sports – //Czech National Corpus//, no. LM2011023 | ||
- | * 2016–2018: Grant Agency of the Czech Republic – [[https://ufal.mff.cuni.cz/czesl|Non-native Czech from the Theoretical and Computational Perspective]], no. 16-10185S | + | * 2016–2018 (extended to mid-2020): Grant Agency of the Czech Republic – [[https://ufal.mff.cuni.cz/czesl|Non-native Czech from the Theoretical and Computational Perspective]], no. 16-10185S |
+ | * Alternative address of this site: [[http://utkl.ff.cuni.cz/learncorp/]] | ||
===== Available versions ===== | ===== Available versions ===== | ||
- | ==== CzeSL-plain ==== | + | | ^ Thousands of tokens in ^^^^ annotation ^^^^^ Metadata ^ Access ^ Year ^ |
+ | | ::: ^ non-native ^^ ethnolect ^ 𝚺 ^ Error ^^ Linguistic ^^^:::^:::^:::^ | ||
+ | | ::: ^ essays ^ theses ^ ::: ^ ::: ^ Tags ^ TH ^ T0 ^ T1 ^ T2 ^ ::: ^ ::: ^ ::: ^ | ||
+ | ^ CzeSL-plain | 1,315 | 732 | 428 | 2,475 | -- | -- | -- | -- | -- | -- | SD | 2012 | | ||
+ | ^ CzeSL-SGT | 1,147 | -- | -- | 1,147 | F | K | M | -- | M | yes | SD | 2014 | | ||
+ | ^ CzeSL-man v0, a1 | 134 | -- | 192 | 326 | F+G | 2T | -- | M | M | -- | SD | 2012 | | ||
+ | ^ CzeSL-man v0, a2 | 59 | -- | 149 | 208 | F+G | 2T | -- | M | M | -- | S | 2012 | | ||
+ | ^ CzeSL-man v1 | 134 | -- | -- | 134 | F+G | T2 | M | -- | M+S | yes | SD | 2016 | | ||
+ | ^ CzeSL-man v2 | 134 | -- | -- | 134 | F+G | 2T | M | M | M | yes | SD | 2020 | | ||
+ | ^ CzeSL-TH | 180 | -- | -- | 180 | -- | 2T | -- | -- | -- | yes | D | 2018 | | ||
+ | ^ CzeSL-MD | 12 | -- | -- | 12 | MD | T2 | -- | -- | -- | -- | D | 2018 | | ||
+ | ^ CzeSL-UD | 10 | -- | -- | 10 | -- | -- | M+S | -- | -- | -- | D | 2018 | | ||
+ | ^ CzeSL-GEC | ? | ? | -- | 20 | -- | 2T | -- | -- | -- | -- | D | 2017 | | ||
+ | ^ AKCES-GEC | 336 | -- | 168 | 504 | G | 2T | -- | -- | -- | -- | D | 2019 | | ||
+ | ^ CzeSL in TEITOK | 299 | -- | -- | 299 | F+I | 2T+ | M | M | M+S | yes | S | 2020 | | ||
- | * Transcribed texts, 2 mil. words, without annotation, without metadata | + | * **Tags**: F -- formal, G -- grammar-based, MD -- multi-dimensional, I -- implicit |
+ | * **TH** (target hypothesis): K -- correction suggested by the proofing tool, 2T -- successive corrections in the 2T scheme, T2 -- correction at Tier 2, 2T+ -- more than 2 successive corrections | ||
+ | * **Linguistic annotation**: M -- morphology (lemmas and morphosyntactic tags), S -- syntax (structure and functions) | ||
+ | * **Access**: S -- searchable on-line, D -- downloadable in full as a dataset | ||
+ | * **Year**: year of the first release | ||
+ | |||
+ | |||
+ | === CzeSL-plain === | ||
+ | |||
+ | * 12.4 thousand transcribed texts, 2 mil. words, 2.5 mil. tokens | ||
+ | * Plain = without annotation, without metadata | ||
* Consists of 3 parts: | * Consists of 3 parts: | ||
- | * Texts written by foreign learners of Czech (ciz) | + | * Texts written by foreign learners of Czech (ciz): 8,109 texts, 1,161 thousand tokens |
- | * Academic texts written by foreign students in Czech (kval) | + | * Academic texts written by foreign students in Czech (kval): 174 texts, 732 thousand tokens |
- | * Texts written by Czech students with Romani background (rom) | + | * Texts written by Czech students with Romani background (rom), i.e. an ethnolect of Czech: 4,105 texts, 428 thousand tokens |
- | * Searchable from the [[https://kontext.korpus.cz/run.cgi/first?corpname=czesl-plain|Czech National Corpus]] site | + | * Searchable from the [[https://kontext.korpus.cz/first_form?corpname=czesl-plain|Czech National Corpus]] site |
* Downloadable from the [[http://ufal.mff.cuni.cz/lindat/|LINDAT-Clarin]] data repository as two subcorpora: | * Downloadable from the [[http://ufal.mff.cuni.cz/lindat/|LINDAT-Clarin]] data repository as two subcorpora: | ||
* [[http://hdl.handle.net/11858/00-097C-0000-000C-2112-B|AKCES 3]] – includes texts produced by non-native students of Czech | * [[http://hdl.handle.net/11858/00-097C-0000-000C-2112-B|AKCES 3]] – includes texts produced by non-native students of Czech | ||
* [[http://hdl.handle.net/11858/00-097C-0000-000C-2293-0|AKCES 4]] – includes texts produced by students growing up in socially excluded communities | * [[http://hdl.handle.net/11858/00-097C-0000-000C-2293-0|AKCES 4]] – includes texts produced by students growing up in socially excluded communities | ||
- | * Description in [[http://ucnk.ff.cuni.cz/english/czesl-plain.php|English]] and [[http://ucnk.ff.cuni.cz/czesl-plain.php|Czech]] | + | * Description in [[https://wiki.korpus.cz/doku.php/en:cnk:czesl-plain|English]] and [[https://wiki.korpus.cz/doku.php/cnk:czesl-plain|Czech]] |
- | ==== CzeSL-SGT ==== | + | === CzeSL-SGT === |
* **Cze**ch as a **S**econd **L**anguage with **S**pelling, **G**rammar and **T**ags | * **Cze**ch as a **S**econd **L**anguage with **S**pelling, **G**rammar and **T**ags | ||
- | * Transcribed texts, 1 mil. words | + | * 8,617 transcribed texts, 111 thousand sentences, 1 mil. words, 1.1 mil. tokens |
* Extends the “foreign” (ciz) part of CzeSL-plain by texts collected in 2013 | * Extends the “foreign” (ciz) part of CzeSL-plain by texts collected in 2013 | ||
* Original forms and automatic corrections are tagged, lemmatized and assigned error labels | * Original forms and automatic corrections are tagged, lemmatized and assigned error labels | ||
* Most texts have metadata attributes (30 items) about the author and the text | * Most texts have metadata attributes (30 items) about the author and the text | ||
- | * Searchable from the [[https://kontext.korpus.cz/run.cgi/first?corpname=czesl-sgt|Czech National Corpus]] site, metadata available from this site are in Czech | + | * Searchable from the [[https://kontext.korpus.cz/first_form?corpname=czesl-sgt|Czech National Corpus]] site, metadata available from this site are in Czech |
* Dowloadable as [[http://hdl.handle.net/11234/1-162|AKCES 5 (CzeSL-SGT) Release 2]] with the metadata in English. The original release is still available from [[http://hdl.handle.net/11858/00-097C-0000-0023-95B1-E|AKCES 5 (CzeSL-SGT)]]. We suggest the use of Release 2, where a number of bugs were fixed, including an issue in the metadata: native speakers of Ukrainian were labelled as speakers of “another Indo-European language” rather than as speakers of a Slavic language. Release 2 is now also a validated XML document with all annotation represented as XML attributes. | * Dowloadable as [[http://hdl.handle.net/11234/1-162|AKCES 5 (CzeSL-SGT) Release 2]] with the metadata in English. The original release is still available from [[http://hdl.handle.net/11858/00-097C-0000-0023-95B1-E|AKCES 5 (CzeSL-SGT)]]. We suggest the use of Release 2, where a number of bugs were fixed, including an issue in the metadata: native speakers of Ukrainian were labelled as speakers of “another Indo-European language” rather than as speakers of a Slavic language. Release 2 is now also a validated XML document with all annotation represented as XML attributes. | ||
* Description in [[http://utkl.ff.cuni.cz/~rosen/public/2014-czesl-sgt-en.pdf|English]] and [[http://utkl.ff.cuni.cz/~rosen/public/2014-czesl-sgt-cs.pdf|Czech]] | * Description in [[http://utkl.ff.cuni.cz/~rosen/public/2014-czesl-sgt-en.pdf|English]] and [[http://utkl.ff.cuni.cz/~rosen/public/2014-czesl-sgt-cs.pdf|Czech]] | ||
- | ==== CzeSL-man ==== | + | === CzeSL-man === |
* Includes texts collected, transcribed and manually annotated within the ESF project, see a [[http://utkl.ff.cuni.cz/~rosen/public/2015-czesl-man-en.pdf|description in English]]. | * Includes texts collected, transcribed and manually annotated within the ESF project, see a [[http://utkl.ff.cuni.cz/~rosen/public/2015-czesl-man-en.pdf|description in English]]. | ||
Řádek 44: | Řádek 69: | ||
* Appendix to Transcription manual in [[http://utkl.ff.cuni.cz/~rosen/public/transkripce_doplnek.pdf|Czech]] | * Appendix to Transcription manual in [[http://utkl.ff.cuni.cz/~rosen/public/transkripce_doplnek.pdf|Czech]] | ||
* Transcription manual – summary in [[http://utkl.ff.cuni.cz/~rosen/public/transcription-reference.pdf|English]]; please note that formatting specifics concerning the markup of manuscripts in the transcription manual, its appendix and the summary are outdated. | * Transcription manual – summary in [[http://utkl.ff.cuni.cz/~rosen/public/transcription-reference.pdf|English]]; please note that formatting specifics concerning the markup of manuscripts in the transcription manual, its appendix and the summary are outdated. | ||
- | * Texts searchable on-line via the [[ http://utkl.ff.cuni.cz/czesl/selaq.html|SeLaQ]] tool, using the CzeSL-native format: | + | |
- | * Includes all manually annotated texts, both the non-native Czech and the Romani ethnolect Czech parts. | + | == CzeSL-man v0 == |
+ | * Includes subsets of the ciz and rom parts of CzeSL-plain, i.e. the manually annotated Czech texts written by non-native learners and by speakers of the Roma ethnolect of Czech, the total of about 330 thousand tokens. | ||
+ | * Texts of about 208 thousand tokens are annotated independently by two annotators. | ||
+ | * Texts are searchable on-line via the [[ http://utkl.ff.cuni.cz/czesl/selaq.html|SeLaQ]] tool, using the CzeSL-native format: | ||
* [[ http://utkl.ff.cuni.cz/czesl/selaq.html|SeLaQ]] is a purpose-built corpus manager. See its menu for instructions. | * [[ http://utkl.ff.cuni.cz/czesl/selaq.html|SeLaQ]] is a purpose-built corpus manager. See its menu for instructions. | ||
* Note that metadata and graphical display of links between annotated word tokens are not available in SeLaQ. | * Note that metadata and graphical display of links between annotated word tokens are not available in SeLaQ. | ||
* Downloadable from [[https://bitbucket.org/czesl/czesl-man/|here]] in the [[https://bitbucket.org/czesl/feat/|feat]] format, with metadata | * Downloadable from [[https://bitbucket.org/czesl/czesl-man/|here]] in the [[https://bitbucket.org/czesl/feat/|feat]] format, with metadata | ||
- | * **Coming soon:** CzeSL-man searchable from the [[https://kontext.korpus.cz/run.cgi/first?corpname=czesl-sgt|Czech National Corpus]] site and downloadable from the [[https://lindat.mff.cuni.cz/repository|LINDAT/CLARIN]] repository. | ||
- | ==== CzeSL-MD ==== | + | == CzeSL-man v1 == |
+ | |||
+ | * CzeSL-man v1 contains 645 texts (128 thousand tokens) from CzeSL-SGT, including 298 doubly annotated texts (59 thousand doubly annotated tokens). | ||
+ | * Most texts are equipped with metadata about the author, the text and the annotation process. | ||
+ | * CzeSL-man can be searched or downloaded. Although the set of texts in the searchable and the downloadable versions are identical, they differ in how they represent the annotation. | ||
+ | * **CzeSL-man v1 downloadable**: | ||
+ | * Downloadable from https://bitbucket.org/czesl/czesl-man/ | ||
+ | * This release is in the PML format, generated by the feat tool. | ||
+ | * Each text with its annotation consists of several related files. | ||
+ | * Some of the texts are independently annotated twice. | ||
+ | * **CzeSL-man v1 searchable**: | ||
+ | * Searchable by KonText: https://kontext.korpus.cz/first_form?corpname=czesl-man | ||
+ | * Differs from both CzeSL-man v0 and CzeSL-man v1 downloadable in two aspects: | ||
+ | * There are no texts with alternative error annotation: each text is annotated by a single annotator | ||
+ | * The two-tier annotation scheme is radically modified to fit the token-based setup of the search tool. | ||
+ | * Apart from that, the content and metadata are identical to CzeSL-man v1 downloadable and the search options to those of CzeSL-SGT. | ||
+ | * The main feature in the annotation is the reversal of the source text and its annotation. The correction at Tier 2 is assumed to be the basis for the annotation. The tokens of this corpus represent the words at Tier 2. The original text is added as annotation of the Tier 2 tokens. Each token of the corrected text receives its corresponding Tier 0 form and a Tier 2 error label as attributes. | ||
+ | * This annotation discards any Tier 1 corrections and error tags, and simplifies other than 1:1 links between tokens at Tier 0 and Tier 2. | ||
+ | * Tier 2 is parsed in a way similar to some other Czech corpora searchable in KonText, such as SYN2015. | ||
+ | |||
+ | == CzeSL-man v2 == | ||
+ | |||
+ | * In this release the two-tier error annotation is represented as pairs of XML elements err and corr. An ill-formed portion of the source text is enclosed within the err structure, immediately followed by its correction, enclosed within the corr structure. | ||
+ | * Apart from the error annotation, the content and metadata are the same as in CzeSL-man v1. | ||
+ | * Linguistic annotation (tags and lemmas) is provided for all tokens at Tier 0 and Tier 2. | ||
+ | |||
+ | === CzeSL-TH === | ||
+ | |||
+ | * Includes a subset of CzeSL-SGT, hand-corrected, but not error-tagged, in 2017--2018, according to the 2T scheme. | ||
+ | * The corpus includes about 1300 texts (180 thousand tokens), selected from those that had not been manually error-annotated before (are not part of CzeSL-man). | ||
+ | * The selection was meant to make the manually annotated part of CzeSL more balanced in terms of L1 and CEFR level. | ||
+ | * Downloadable from [[https://bitbucket.org/czesl/czesl-th/|here]] in the [[https://bitbucket.org/czesl/feat/|feat]] format | ||
+ | |||
+ | === CzeSL-MD === | ||
* Includes a subset of texts from CzeSL-man, semi-automatically annotated by a multi-domain tagset with a focus on morphology, see the [[http://utkl.ff.cuni.cz/~rosen/public/2018_prirucka_morfologicke_anotace.pdf|annotation manual]] (in Czech) | * Includes a subset of texts from CzeSL-man, semi-automatically annotated by a multi-domain tagset with a focus on morphology, see the [[http://utkl.ff.cuni.cz/~rosen/public/2018_prirucka_morfologicke_anotace.pdf|annotation manual]] (in Czech) | ||
- | * Available from [[https://bitbucket.org/czesl/czesl-md]] in the [[http://brat.nlplab.org|brat]] format | + | * The dataset is available from [[https://bitbucket.org/czesl/czesl-md]] in the [[http://brat.nlplab.org|brat]] format |
+ | * The texts can also be viewed and searched [[https://quest.ms.mff.cuni.cz/brat/czesl.err/index.xhtml#/anna_daniela/|here]] using [[http://brat.nlplab.org|brat]] | ||
* To be extended to all CzeSL-man texts | * To be extended to all CzeSL-man texts | ||
- | ==== CzeSL-UD ==== | + | === CzeSL-UD === |
* Texts from CzeSL-man with syntactic annotation according to the Universal Dependencies ([[https://universaldependencies.org|UD]]) standard | * Texts from CzeSL-man with syntactic annotation according to the Universal Dependencies ([[https://universaldependencies.org|UD]]) standard | ||
* Available from the [[http://hdl.handle.net/11234/1-2927|LINDAT/CLARIN]] repository | * Available from the [[http://hdl.handle.net/11234/1-2927|LINDAT/CLARIN]] repository | ||
* CzeSL-man, CzeSL-MD and CzeSL-UD will eventually be merged into a single corpus with multiple types of annotation | * CzeSL-man, CzeSL-MD and CzeSL-UD will eventually be merged into a single corpus with multiple types of annotation | ||
+ | |||
+ | |||
+ | === CzeSL-GEC === | ||
+ | |||
+ | * CzeSL Grammatical Error Correction Dataset – manually annotated texts, converted to a format intended for NLP applications | ||
+ | * Sentence pairs of original and corrected versions of Czech sentences collected from essays written by both non-native learners of Czech (CzeSL-man) and Czech pupils with Romani background | ||
+ | * Downloadable from [[http://hdl.handle.net/11234/1-2143]] | ||
+ | * See also AKCES-GEC | ||
+ | |||
+ | === AKCES-GEC === | ||
+ | |||
+ | * AKCES Grammatical Error Correction Dataset for Czech – a grammar error correction corpus for Czech generated from a subset of AKCES. It contains train, dev and test files annotated in M2 format. Note that in comparison to CZESL-GEC dataset, this dataset contains separated edits together with their type annotations in M2 format and also has twice as many sentences. | ||
+ | * Downloadable from [[http://hdl.handle.net/11234/1-3057]] | ||
+ | * See [[https://arxiv.org/pdf/1910.00353.pdf]] for a detailed description | ||
+ | |||
+ | === CzeSL in TEITOK === | ||
+ | * Work in progress: will eventually include all available Czech texts written by non-native learners | ||
+ | * See [[http://utkl.ff.cuni.cz/teitok/czesl/|CzeSL in TEITOK]] at the ICTL site | ||
+ | |||
===== Tools ===== | ===== Tools ===== | ||
Řádek 72: | Řádek 152: | ||
* Multi-level concordancer [[http://utkl.ff.cuni.cz/czesl/selaq.html|SeLaQ]], used for basic searching in CzeSL-man | * Multi-level concordancer [[http://utkl.ff.cuni.cz/czesl/selaq.html|SeLaQ]], used for basic searching in CzeSL-man | ||
* Standard concordancer [[http://wiki.korpus.cz/doku.php/en:manualy:kontext:index|Manatee/KonText]], used for searching in CzeSL-plain and CzeSL-SGT | * Standard concordancer [[http://wiki.korpus.cz/doku.php/en:manualy:kontext:index|Manatee/KonText]], used for searching in CzeSL-plain and CzeSL-SGT | ||
- | * General corpus tool [[http://beta.clul.ul.pt/teitok/site/|TEITOK]], currently used for manuscript transcription ([[http://utkl.ff.cuni.cz/teitok/czesl/]]), to be used for building, editing and viewing the CzeSL corpora | + | * General corpus tool [[http://beta.clul.ul.pt/teitok/site/|TEITOK]], currently used for building, editing and viewing learner corpora hosted by the Institute of Theoretical and Computational linguistics (see [[http://utkl.ff.cuni.cz/teitok/|Learner corpora at ICTL]]) |
===== Bibliography ===== | ===== Bibliography ===== | ||
[[http://utkl.ff.cuni.cz/~rosen/public/czesl.html|Bibliography]] | [[http://utkl.ff.cuni.cz/~rosen/public/czesl.html|Bibliography]] | ||
+ | |||
+ | **NEW:** | ||
+ | |||
+ | Rosen, A., Hana, J., Hladká, B., Jelínek, T., Škodová, S., and Štindlová, B. (2020). | ||
+ | //Compiling and annotating a learner corpus for a morphologically rich language – CzeSL, a corpus of non-native Czech.// Karolinum, Charles University Press, Praha. [[https://dspace.cuni.cz/handle/20.500.11956/123103|http]] | ||
+ | |||
+ | |||
+ |