Simplify Input for Parsers



A software tool for simplifying Czech linguistic data in order to facilitate parser's training and parsing of new data.


The tool improves the efficacy of automatic dependency parsing by simplifying Czech linguistic data: it removes a part of syntactically irrelevant linguistic variability. It is then easier for the parser to recognize syntactic structures, the parser has better results (accuracy) and is faster.


Download

The tool has three parts (in one tar gzip packet – Simplify_Input_For_Parsers.tgz):

SimplifyInputForParsers.pl

SimplifyInput_LingData.tsv

RecoverOriginalFormsAndLemmas.pl


Perl program SimplifyInputForParsers.pl treats linguistic data (texts), both training data (morphologically and syntactically annotated) and new/test data (annotated only morphologically).

It replaces word forms and lemmas of some words belonging to word classes with identical syntactic properties with one representative (e.g. all masculine given names are replaced by one proxy name). The program uses lists of words with their properties from the file SimplifyInput_LingData.tsv.

Simplified data can be used to train a parser (tested with MSTParser and MaltParser), which can then parse new data (simplified in the same way).

Discarded linguistic information, not used during parsing, can then be restored using the program RecoverOriginalFormsAndLemmas.pl.


Input/Output Formats


Usage


Author and License: Tomáš Jelínek, CC-by-sa 3.0


Publications:

Jelínek, Tomáš: Improving Dependency Parsing by Filtering "Linguistic Noise". In Text, Speech and Dialogue, Proceedings of the 16th International Conference TSD 2013, Lecture Notes in Computer Science, p. 288-294, Springer: Berlin-Heidelberg, Germany, 2013.

Jelínek,Tomáš: A System for Syntactic Annotation of Large Czech Corpora. In Trudy meždunarodnoj konferencii "Korpusnaja lingvistika - 2013" (Proceedings of the International Conference "Corpus Linguistics – 2013"), p. 44-51, St.-Petersburg University Press, St.-Petersburg, Russia, 2013.