Simplify Input for Parsers: Usage


SimplifyInputForParsers.pl requires Perl 5.0 (or higher).


You need to first untar the package in a chosen directory.

> tar -xzf Simplify_Input_For_Parsers.tgz



You need to have an input directory containing linguistic data (e.g. ~/pdt2.0/) in vertical format, files with .vert suffix (e.g. cmpr9406_001.vert ...).


CoNLL format

To simplify syntactically annotated data and obtain CoNLL format (MaltParser):

> ./SimplifyInputForParsers.pl -TM ~/pdt2.0/

New files with a .conll suffix will be created in the same directory.


If you prefer a single output for training, choose:

> ./SimplifyInputForParsers.pl -TM -o ./NameOfOutputFile.conll ~/pdt2.0/


To simplify new data and obtain both CoNLL format and a backup file (used to recover lost information):

> ./SimplifyInputForParsers.pl -NM ~/pdt2.0/

New files with a .conll and .backup suffix will be created in the same directory.


MCD format

To simplify syntactically annotated data and obtain MCD format (MSTParser):

> ./SimplifyInputForParsers.pl -TC3 ~/pdt2.0/

New files with a .mcd suffix will be created in the same directory.


If you prefer a single output for training, choose:

> ./SimplifyInputForParsers.pl -TC3 -o ./NameOfOutputFile.conll ~/pdt2.0/


To simplify new data and obtain both MCD format and a backup file (used to recover lost information):

> ./SimplifyInputForParsers.pl -NC3 ~/pdt2.0/

New files with a .mcd and .backup suffix will be created in the same directory.


There are more options available, but these are the most used and lead to the best parsers performance.