Simplify Input for Parsers: Input/Output Formats


Input format of the program is “vertical format”, in which each word is placed on a separate line with all its attributes separated by tabs: lemma, morphological tag, syntactic function (afun), ID (number defining order in the sentence starting with 1), dependency relation to the head (ID of the head or 0 for root):


Potřebujete

potřebovat

VB-P---2P-AA---

Pred

1

0

rychle

rychle

Dg-------1A----

Adv

2

3

poradit

poradit

Vf--------A----

Obj

3

1

?

?

Z:-------------

AuxK

4

0







Zvedněte

zvednout

Vi-P---2--A----

Pred_Co

1

3

telefon

telefon

NNIS4-----A----

Obj

2

1

a

a

J^-------------

Coord

3

0

zavolejte

zavolat

Vi-P---2--A----

Pred_Co

4

3

.

.

Z:-------------

AuxK

5

0


For new texts, only word forms, lemmas and morphological tags are used (a vertical format with three attributes).


Using SimplifyInputForParsers.pl, two output formats can be created, depending on the choice of parser. MaltParser requires CoNLL format, MSTParser uses MCD format. These formats are used as both input and output of the parsers (for training and for parsing of new texts).


CoNLL format (MaltParser)

1

Potřebujete

potřebovat

V

V

Synt=V|VForm=P|NumGen=-P|Pers=2

0

Pred

0

Pred

2

rychle

rychle

D

D

Synt=D|Gr=1

3

Adv

3

Adv

3

poradit

poradit

V

V

Synt=V|VForm=f

1

Obj

1

Obj

4

?

?

Z

Z

Synt=Z

0

AuxK

0

AuxK











1

Zvedněte

zvednout

V

V

Synt=V|VForm=i|NumGen=-P|Pers=2

3

Pred_Co

3

Pred_Co

2

telefon

telefon

N

N

Synt=N|NumGen=IS|Case=4

1

Obj

1

Obj

3

a

a

J

J

Synt=^

0

Coord

0

Coord

4

zavolejte

zavolat

V

V

Synt=V|VForm=i|NumGen=-P|Pers=2

3

Pred_Co

3

Pred_Co

5

.

.

Z

Z

Synt=Z

0

AuxK

0

AuxK



MCD format (MSTParser)

Potřebujete

rychle

poradit

?


VB

Dg

Vf

Z:


Pred

Adv

Obj

AuxK


0

3

1

0







Zvedněte

telefon

a

zavolejte

.

Vi

N4

J^

Vi

Z:

Pred_Co

Obj

Coord

Pred_Co

AuxK

3

1

0

3

0



The final output format (after the application of the program RecoverOriginalFormsAndLemmas.pl) is a vertical format again, with six attributes (with syntactic annotation).