Obsah

CzeSL – a Learner Corpus of Czech

CzeSL – a Learner Corpus of Czech

The Corpus of Czech as a Second Language
A part of the AKCES/CLAC project (the Czech Language Acquisition Corpora)
For the official site see AKCES – Akviziční korpusy českého jazyka (in Czech)
An outdated site: Investice do rozvoje vzdělávání v oboru čeština jako cizí jazyk (Investments into Teaching Czech as a Foreign Language)
Funded since 2009 from several projects:
- 2009–2012: European Social Funds (ESF) – Innovative approach to teaching Czech as a second language, no. CZ.1.07/2.2.00/07.0119
- 2012–2016: Ministry of Education, Youth and Sports – Czech National Corpus, no. LM2011023
- 2016–2018 (extended to mid-2020): Grant Agency of the Czech Republic – Non-native Czech from the Theoretical and Computational Perspective, no. 16-10185S
- 2018–2022: KREAS, Faculty of Arts, Charles University; Structural and Investment Funds of the European Union
Alternative address of this site: http://utkl.ff.cuni.cz/learncorp/

Available versions

	non-native		ethnolect	𝚺	error		linguistic
	Thousands of tokens in				Annotation					Metadata	Access	Year
	essays	theses	ethnolect	𝚺	Tags	TH	T0	T1	T2
CzeSL-plain	1,315	732	428	2,475	–	–	–	–	–	–	SD	2012
CzeSL-SGT	1,147	–	–	1,147	F	K	M	–	M	yes	SD	2014
CzeSL-man v0, a1	134	–	192	326	F+G	2T	–	M	M	–	SD	2012
CzeSL-man v0, a2	59	–	149	208	F+G	2T	–	M	M	–	S	2012
CzeSL-man v1	134	–	–	134	F+G	T2	M	–	M+S	yes	SD	2016
CzeSL-man v2	134	–	–	134	F+G	2T	M	M	M	yes	SD	2020
CzeSL-TH	180	–	–	180	–	2T	–	–	–	yes	D	2018
CzeSL-MD	12	–	–	12	MD	T2	–	–	–	–	D	2018
CzeSL-UD	10	–	–	10	–	–	M+S	–	–	–	D	2018
CzeSL-GEC	?	?	–	108	–	2T	–	–	–	–	D	2017
AKCES-GEC	336	–	168	504	G	2T	–	–	–	–	D	2019
CzeSL in TEITOK	299	–	–	299	F+I	2T+	M	M	M+S	yes	S	2020

Tags: F – formal, G – grammar-based, MD – multi-dimensional, I – implicit
TH (target hypothesis): K – correction suggested by the proofing tool, 2T – successive corrections in the 2T scheme, T2 – correction at Tier 2, 2T+ – more than 2 successive corrections
Linguistic annotation: M – morphology (lemmas and morphosyntactic tags), S – syntax (structure and functions)
Access: S – searchable on-line, D – downloadable in full as a dataset
Year: year of the first release

CzeSL-plain

12.4 thousand transcribed texts, 2 mil. words, 2.5 mil. tokens
Plain = without annotation, without metadata
Consists of 3 parts:
- Texts written by foreign learners of Czech (ciz): 8,109 texts, 1,161 thousand tokens
- Academic texts written by foreign students in Czech (kval): 174 texts, 732 thousand tokens
- Texts written by Czech students with Romani background (rom), i.e. an ethnolect of Czech: 4,105 texts, 428 thousand tokens
Searchable from the Czech National Corpus site
Downloadable from the LINDAT-Clarin data repository as two subcorpora:
- AKCES 3 – includes texts produced by non-native students of Czech
- AKCES 4 – includes texts produced by students growing up in socially excluded communities
Description in English and Czech

CzeSL-SGT

Czech as a Second Language with Spelling, Grammar and Tags
8,617 transcribed texts, 111 thousand sentences, 1 mil. words, 1.1 mil. tokens
Extends the “foreign” (ciz) part of CzeSL-plain by texts collected in 2013
Original forms and automatic corrections are tagged, lemmatized and assigned error labels
Most texts have metadata attributes (30 items) about the author and the text
Searchable from the Czech National Corpus site, metadata available from this site are in Czech
Dowloadable as AKCES 5 (CzeSL-SGT) Release 2 with the metadata in English. The original release is still available from AKCES 5 (CzeSL-SGT). We suggest the use of Release 2, where a number of bugs were fixed, including an issue in the metadata: native speakers of Ukrainian were labelled as speakers of “another Indo-European language” rather than as speakers of a Slavic language. Release 2 is now also a validated XML document with all annotation represented as XML attributes.
Description in English and Czech

CzeSL-man

Includes texts collected, transcribed and manually annotated within the ESF project, see a description in English.
Annotation manual in Czech
Transcription manual in Czech
Appendix to Transcription manual in Czech
Transcription manual – summary in English; please note that formatting specifics concerning the markup of manuscripts in the transcription manual, its appendix and the summary are outdated.

CzeSL-man v0

Includes subsets of the ciz and rom parts of CzeSL-plain, i.e. the manually annotated Czech texts written by non-native learners and by speakers of the Roma ethnolect of Czech, the total of about 330 thousand tokens.
Texts of about 208 thousand tokens are annotated independently by two annotators.
Texts are searchable on-line via the SeLaQ tool, using the CzeSL-native format:
- SeLaQ is a purpose-built corpus manager. See its menu for instructions.
- Note that metadata and graphical display of links between annotated word tokens are not available in SeLaQ.
Downloadable from here in the feat format, with metadata

CzeSL-man v1

CzeSL-man v1 contains 645 texts (128 thousand tokens) from CzeSL-SGT, including 298 doubly annotated texts (59 thousand doubly annotated tokens).
Most texts are equipped with metadata about the author, the text and the annotation process.
CzeSL-man can be searched or downloaded. Although the set of texts in the searchable and the downloadable versions are identical, they differ in how they represent the annotation.
CzeSL-man v1 downloadable:
- Downloadable from https://bitbucket.org/czesl/czesl-man/
- This release is in the PML format, generated by the feat tool.
- Each text with its annotation consists of several related files.
- Some of the texts are independently annotated twice.
- Includes also flat version (files named *.vert), see CzeSL-man v2 below.
CzeSL-man v1 searchable:
- Searchable by KonText: https://kontext.korpus.cz/first_form?corpname=czesl-man
- Differs from both CzeSL-man v0 and CzeSL-man v1 downloadable in two aspects:
  - There are no texts with alternative error annotation: each text is annotated by a single annotator
  - The two-tier annotation scheme is radically modified to fit the token-based setup of the search tool.
- Apart from that, the content and metadata are identical to CzeSL-man v1 downloadable and the search options to those of CzeSL-SGT.
- The main feature in the annotation is the reversal of the source text and its annotation. The correction at Tier 2 is assumed to be the basis for the annotation. The tokens of this corpus represent the words at Tier 2. The original text is added as annotation of the Tier 2 tokens. Each token of the corrected text receives its corresponding Tier 0 form and a Tier 2 error label as attributes.
- This annotation discards any Tier 1 corrections and error tags, and simplifies other than 1:1 links between tokens at Tier 0 and Tier 2.
- Tier 2 is parsed in a way similar to some other Czech corpora searchable in KonText, such as SYN2015.

CzeSL-man v2

In this release the two-tier error annotation is represented as pairs of XML elements err and corr. An ill-formed portion of the source text is enclosed within the err structure, immediately followed by its correction, enclosed within the corr structure.
Apart from the error annotation, the content and metadata are the same as in CzeSL-man v1.
Linguistic annotation (tags and lemmas) is provided for all tokens at Tier 0 and Tier 2.
Downloadable from https://bitbucket.org/czesl/czesl-man/ (files named *.vert).

CzeSL-TH

Includes a subset of CzeSL-SGT, hand-corrected, but not error-tagged, in 2017–2018, according to the 2T scheme.
The corpus includes about 1300 texts (180 thousand tokens), selected from those that had not been manually error-annotated before (are not part of CzeSL-man).
The selection was meant to make the manually annotated part of CzeSL more balanced in terms of L1 and CEFR level.
Downloadable from here in the feat format

CzeSL-MD

Includes a subset of texts from CzeSL-man, semi-automatically annotated by a multi-domain tagset with a focus on morphology, see the annotation manual (in Czech)
The dataset is available from https://bitbucket.org/czesl/czesl-md in the brat format
The texts can also be viewed and searched here using brat
To be extended to all CzeSL-man texts

CzeSL-UD

Texts from CzeSL-man with syntactic annotation according to the Universal Dependencies (UD) standard
Available from the LINDAT/CLARIN repository
CzeSL-man, CzeSL-MD and CzeSL-UD will eventually be merged into a single corpus with multiple types of annotation

CzeSL-GEC

CzeSL Grammatical Error Correction Dataset – manually annotated texts, converted to a format intended for NLP applications
Sentence pairs of original and corrected versions of Czech sentences collected from essays written by both non-native learners of Czech (CzeSL-man) and Czech pupils with Romani background
Downloadable from http://hdl.handle.net/11234/1-2143
See also AKCES-GEC

AKCES-GEC

AKCES Grammatical Error Correction Dataset for Czech – a grammar error correction corpus for Czech generated from a subset of AKCES. It contains train, dev and test files annotated in M2 format. Note that in comparison to CZESL-GEC dataset, this dataset contains separated edits together with their type annotations in M2 format and also has twice as many sentences.
Downloadable from http://hdl.handle.net/11234/1-3057
See https://arxiv.org/pdf/1910.00353.pdf for a detailed description

CzeSL in TEITOK

Work in progress: will eventually include all available Czech texts written by non-native learners
See CzeSL in TEITOK at the ICTL site

Tools

Annotation editor feat, used for multi-level manual error annotation of CzeSL-man
Annotation editor brat, used for multi-domain error annotation of CzeSL-MD
Tagger and lemmatizer of Czech Morphodita, used for morphological annotation of CzeSL-SGT
Spelling/grammar checker Korektor, used for automatic correction of CzeSL-SGT
Error identifier (see Jelínek, 2017), used for automatic identification of some types of errors in CzeSL-SGT
Multi-level concordancer SeLaQ, used for basic searching in CzeSL-man
Standard concordancer Manatee/KonText, used for searching in CzeSL-plain and CzeSL-SGT
General corpus tool TEITOK, currently used for building, editing and viewing learner corpora hosted by the Institute of Theoretical and Computational linguistics (see Learner corpora at ICTL)

Bibliography

NEW:

Rosen, A., Hana, J., Hladká, B., Jelínek, T., Škodová, S., and Štindlová, B. (2020). Compiling and annotating a learner corpus for a morphologically rich language – CzeSL, a corpus of non-native Czech. Karolinum, Charles University Press, Praha. Print copy, e-book CU Digital Repository

Acknowledgement

This work was supported by the European Regional Development Fund project “Creativity and Adaptability as Conditions of the Success of Europe in an Interrelated World” (reg. no.: CZ.02.1.01/0.0/0.0/16_019/0000734).

Historie: • czesl