Skip to content

clarinsi/reldi-tokeniser

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

51 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

reldi-tokeniser

A tokeniser developed inside the ReLDI project. Supports currently five languages -- Slovene, Croatian, Serbian, Macedonian and Bulgarian, and two modes -- standard and non-standard text.

Usage

Command line

$ echo 'kaj sad s tim.daj se nasmij ^_^.' | ./tokeniser.py hr -n 1.1.1.1-3	kaj 1.1.2.5-7	sad 1.1.3.9-9	s 1.1.4.11-13	tim 1.1.5.14-14	. 1.2.1.15-17	daj 1.2.2.19-20	se 1.2.3.22-27	nasmij 1.2.4.29-31	^_^ 1.2.5.32-32	. 

Language is a positional argument while tokenisation of non-standard text, tagging and lemmatization of symbols and punctuation, and diferent output formats are an optional one.

$ python tokeniser.py -h usage: tokeniser.py [-h] [-c] [-b] [-d] [-n] [-t] {sl,hr,sr,mk,bg} Tokeniser for (non-)standard Slovene, Croatian, Serbian, Macedonian and Bulgarian positional arguments: {sl,hr,sr,mk,bg} language of the text optional arguments: -h, --help show this help message and exit -c, --conllu generates CONLLU output -b, --bert generates BERT-compatible output -d, --document passes through ConLL-U-style document boundaries -n, --nonstandard invokes the non-standard mode -t, --tag adds tags and lemmas to punctuations and symbols 

Python module

# string mode import reldi_tokeniser text = 'kaj sad s tim.daj se nasmij ^_^.' output = reldi_tokeniser.run(text, 'hr', nonstandard=True, tag=True) # object mode from reldi_tokeniser.tokeniser import ReldiTokeniser reldi = ReldiTokeniser('hr', conllu=True, nonstandard=True, tag=True) list_of_lines = [el + '\n' for el in text.split('\n')] test = reldi.run(list_of_lines, mode='object')

Python module has two mandatory parameters - text and language. Other optional parameters are conllu, bert, document, nonstandard and tag.

CoNLL-U output

This tokeniser outputs also CoNLL-U format (flag -c/--conllu). If the additional -d/--document flag is given, the tokeniser passes through lines starting with # newdoc id = to preserve document structure.

$ echo '# newdoc id = prvi kaj sad s tim.daj se nasmij ^_^. haha # newdoc id = gidru štaš' | ./tokeniser.py hr -n -c -d # newdoc id = prvi # newpar id = 1 # sent_id = 1.1 # text = kaj sad s tim. 1	kaj	_	_	_	_	_	_	_	_ 2	sad	_	_	_	_	_	_	_	_ 3	s	_	_	_	_	_	_	_	_ 4	tim	_	_	_	_	_	_	_	SpaceAfter=No 5	.	_	_	_	_	_	_	_	SpaceAfter=No # sent_id = 1.2 # text = daj se nasmij ^_^. 1	daj	_	_	_	_	_	_	_	_ 2	se	_	_	_	_	_	_	_	_ 3	nasmij	_	_	_	_	_	_	_	_ 4	^_^	_	_	_	_	_	_	_	SpaceAfter=No 5	.	_	_	_	_	_	_	_	_ # newpar id = 2 # sent_id = 2.1 # text = haha 1	haha	_	_	_	_	_	_	_	_ # newdoc id = gidru # newpar id = 1 # sent_id = 1.1 # text = štaš 1	štaš	_	_	_	_	_	_	_	_ 

Pre-tagging

The tokeniser can also pre-annotate text on the part-of-speech (UPOS and XPOS) and lemma level (flag -t or --tag), if the available tokenisation regexes have sufficient evidence (punctuations, mentions, hashtags, URL-s, e-mails, emoticons, emojis). Default output format in case of pre-tagging is CoNLL-U.

$ echo -e "kaj sad s tim.daj se nasmij ^_^. haha" | python tokeniser.py hr -n -t # newpar id = 1 # sent_id = 1.1 # text = kaj sad s tim. 1	kaj	_	_	_	_	_	_	_	_ 2	sad	_	_	_	_	_	_	_	_ 3	s	_	_	_	_	_	_	_	_ 4	tim	_	_	_	_	_	_	_	SpaceAfter=No 5	.	.	PUNCT	Z	_	_	_	_	SpaceAfter=No # sent_id = 1.2 # text = daj se nasmij ^_^. 1	daj	_	_	_	_	_	_	_	_ 2	se	_	_	_	_	_	_	_	_ 3	nasmij	_	_	_	_	_	_	_	_ 4	^_^	^_^	SYM	Xe	_	_	_	_	SpaceAfter=No 5	.	.	PUNCT	Z	_	_	_	_	_ # sent_id = 1.3 # text = haha 1	haha	_	_	_	_	_	_	_	_ 

About

A two-mode (standard, nonstandard) tokeniser for South Slavic languages

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages