TUFS Asian Language Parallel Corpus (TALPCo)

Introduction

The TUFS Asian Language Parallel Corpus (TALPCo) is an open parallel corpus consisting of Japanese sentences and their translations into Korean, Burmese (Myanmar; the official language of the Republic of the Union of Myanmar), Malay (the national language of Malaysia, Singapore and Brunei), Indonesian, Thai, Vietnamese and English. TALPCo is licensed under a Creative Commons Attribution 4.0 International (CC BY 4.0) license. See the paper below for the details of TALPCo.

How to cite

(For Korean, Burmese, Malay, Indonesian and English translations)
Nomoto, Hiroki, Kenji Okano, David Moeljadi and Hideo Sawada. 2018. TUFS Asian Language Parallel Corpus (TALPCo). Proceedings of the Twenty-Fourth Annual Meeting of the Association for Natural Language Processing, 436-439.
(For Thai and Vietnamese translations, and interpersonal meaning annotations)
Nomoto, Hiroki, Kenji Okano, Sunisa Wittayapanyanon and Junta Nomura. 2019. Interpersonal meaning annotation for Asian language corpora: The case of TUFS Asian Language Parallel Corpus (TALPCo). Proceedings of the Twenty-Fifth Annual Meeting of the Association for Natural Language Processing, 846-849. Supplement
(For Javanese translations)
Lestari, Sri Buti and Yoshimi Miyake. 2022. Jawago ni mirareru taishouhyougen: Shootomuubii karano youreibuseki [Second person expressions in Javanese: An analysis of examples from short movies]. Indonesia: Gengo to Bunka 28: 65-84.
(For Malay and Indonesian constituency tree annotations)
Nomoto, Hiroki. 2022. Kyokushoushugi ni motoduku heiretsu tsuriibanku no kouchiku [Building a parallel treebank based on minimalism]. Proceedings of the Twenty-Eighth Annual Meeting of the Association for Natural Language Processing, 103-107.

data_jpn.txtJapanese (raw sentences)
data_jpn-sound.txtJapanese (links to sound files)
data_jpn-token.txtJapanese (tokenized sentences)
data_jpn-IPSpkr.csvJapanese (interpersonal meaning annotation, speaker)
data_jpn-IPAddr.csvJapanese (interpersonal meaning annotation, addressee)
data_jpn-IPLex.csvJapanese (interpersonal meaning annotation, lexical)
data_jpn-prosub.jsonlJapanese (pronoun substitute annotation)
data_jpn-prosub.txtJapanese (pronoun substitute annotation)

data_kor.txtKorean (raw sentences)
data_kor-sound.txtKorean (links to sound files)
data_kor-token.txtKorean (tokenized sentences)
data_kor-prosub.jsonlKorean (pronoun substitute annotation)
data_kor-prosub.txtKorean (pronoun substitute annotation)

data_zsm.txtMalay (raw sentences)
data_zsm-sound.txtMalay (links to sound files)
data_zsm-token.txtMalay (tokenized sentences)
data_zsm-MWE.txtMalay (multiword expression list)
data_zsm-tree.txtMalay (constituency tree annotation)
data_zsm.jpn-zsmMalay (partial Japanese-Malay alignment)
data_zsm-IPSpkr.csvMalay (interpersonal meaning annotation, speaker)
data_zsm-IPAddr.csvMalay (interpersonal meaning annotation, addressee)
data_zsm-IPLex.csvMalay (interpersonal meaning annotation, lexical)
data_zsm-prosub.jsonlMalay (pronoun substitute annotation)
data_zsm-prosub.txtMalay (pronoun substitute annotation)

data_ind.txtIndonesian (raw sentences)
data_ind-sound.txtIndonesian (links to sound files)
data_ind-token.txtIndonesian (tokenized sentences)
data_ind-MWE.txtIndonesian (multiword expression list)
data_ind-tree.txtIndonesian (constituency tree annotation)
data_ind.jpn-indIndonesian (partial Japanese-Indonesian alignment)
data_ind-IPSpkr.csvIndonesian (interpersonal meaning annotation, speaker)
data_ind-IPAddr.csvIndonesian (interpersonal meaning annotation, addressee)
data_ind-IPLex.csvIndonesian (interpersonal meaning annotation, lexical)
data_ind-prosub.jsonlIndonesian (pronoun substitute annotation)
data_ind-prosub.txtIndonesian (pronoun substitute annotation)

data_jav.txtJavanese (raw sentences)
data_jav-prosub.jsonlJavanese (pronoun substitute annotation)
data_jav-prosub.txtJavanese (pronoun substitute annotation)

data_tha.txtThai (raw sentences)
data_tha-sound.txtThai (links to sound files)
data_tha-token.txtThai (tokenized sentences)
data_tha.jpn-thaThai (partial Japanese-Thai alignment)
data_tha-IPSpkr.csvThai (interpersonal meaning annotation, speaker)
data_tha-IPAddr.csvThai (interpersonal meaning annotation, addressee)
data_tha-IPLex.csvThai (interpersonal meaning annotation, lexical)
data_tha-prosub.jsonlThai (pronoun substitute annotation)
data_tha-prosub.txtThai (pronoun substitute annotation)

data_vie.txtVietnamese (raw sentences)
data_vie-sound.txtVietnamese (links to sound files)
data_vie-token.txtVietnamese (tokenized sentences)
data_vie-MWE.txtVietnamese (multi-syllable expression list)
data_vie.jpn-vieVietnamese (partial Japanese-Vietnamese alignment)
data_vie-IPSpkr.csvVietnamese (interpersonal meaning annotation, speaker)
data_vie-IPAddr.csvVietnamese (interpersonal meaning annotation, addressee)
data_vie-IPLex.csvVietnamese (interpersonal meaning annotation, lexical)
data_vie-prosub.jsonlVietnamese (pronoun substitute annotation)
data_vie-prosub.txtVietnamese (pronoun substitute annotation)

data_myn.txtBurmese (raw sentences)
data_myn-sound.txtBurmese (links to sound files)
data_myn-token.txtBurmese (tokenized sentences)
data_myn-ps.txtBurmese (POS-tagged sentences)
data_myn-prosub.jsonlBurmese (pronoun substitute annotation)
data_myn-prosub.txtBurmese (pronoun substitute annotation)

data_eng.txtEnglish (raw sentences)
data_eng-US.txtEnglish (US) (raw sentences) [by courtesy of Charles Kelly]

readme.me(this document)

Format

All files are encoded in UTF-8 with DOS format.

Raw sentences

Sentence_ID [TAB] Sentence

1176	田中さんは 学生では ありません。 1176	Mr Tanaka is not a student.

Links to sound files

Sentence_ID [TAB] URL

Tokenized sentences

Sentence_ID [LINEBREAK] token [LINEBREAK] token [LINEBREAK] <EOS>

3627 Buku ini mempunyai se- ratus dua puluh muka surat . <EOS>

Burmese POS-tagged sentences

Sentence_ID [TAB] Sentence

White space: Phrasal boundary
Dash: Morpheme boundary

1176 n-pr-postp n pref-v-suf-suf

Constituency tree annotation

Sentence_ID.n [TAB] bracketing (for the n-th sentence for Sentence_ID)

3695.2	[S [CP [Conj Tapi] [CP [C *C_decl*] [TP [DP_a [D saya]] [T'[T *T*] [AP [AP [AdvP [Adv agak]] [AP [DP *t*<a>] [A letih]]] [AdvP [Adv sedikit]]]]]]] [PU .]]

Alignment

Sentence_ID [TAB] Japanese_token_index-target_language_token_index

1176	0-1 1-0 3-3 8-2 9-4

Interpersonal meaning annotation

See the second paper above and its supplement for the details of the interpersonal meaning feature system.

Speaker, Addressee

Sentence_ID, Gender, Marital status, Honour, Age, Social status, Role, Group, Formality, Number

3243,female,,,,neutral,,,,sg

Lexical

Token_index, token, Gender, Marital status, Honour, Age, Social status, Role, Group, Formality, Number

3845,,,,,,,,,, 0,Cô,female,,,elder.parents_younger_sibling,,parents_sibling.paternal,,, 1,tôi,,,,,neutral,,,,sg 2,làm việc,,,,,,,,, 3,ở,,,,,,,,, 4,cửa hàng,,,,,,,,, 5,hoa,,,,,,,,, 6,.,,,,,,,,, <EOS>,,,,,,,,,,

Notes on sound files

The sound files for the following sentences come from TUFS Open Language Resources.

Japanese: All sentences except sentences 1176, 1178, 1180, 1194, 1222, 1229, 1233, 1244, 1245, 1246, 1247, 1248, 1249, 1250, 1251, 1254, 1255, 1257, 1258, 1259, 1264, 1268, 1274, 1275, 1276, 1280, 1281, 1286, 1291, 1296, 1297, 1299, 1302, 1305, 1308, 1311, 1312, 1319, 1320, 1322, 1323, 1326, 1330, 1333, 1334, 1335, 1341, 1348, 1349, 1350, 1356, 1367, 1369, 1371, 1378, 1379, 1380, 1382, 1386, 1388, 1389, 1392, 1393, 1396, 1397, 1398, 1401, 1402, 1403, 1408, 1409, 1410, 1414, 1418, 1419, 1423, 1426, 1427, 1428, 1431, 1433, 1434, 1435, 1446, 1449, 1450, 1451, 1454, 1455, 1457, 1459, 1460, 1473, 1474, 1476, 1477, 1478, 1481, 1482, 1484, 1485, 1491, 1492, 1494, 1495, 1501, 1507, 1512, 1513, 1515, 1520, 1527, 1528, 1530, 1531, 1532, 1533, 1534, 1535, 1537, 1542, 1544, 1545, 1546, 1547, 1548, 1549, 1551, 1553, 1556, 1557, 1558, 1562, 1586, 1588, 1592, 1599, 1601, 1605, 1609, 1614, 1619, 1625, 1630, 1634, 1636, 1637, 1638, 1640, 1645, 1646, 1647, 1650, 1654, 1660, 1667, 1668, 1671, 1672, 1675, 1679, 1682, 1683, 1692, 1693, 1694, 1696, 1710, 1714, 1715, 1720, 1727, 1728, 1734, 1738, 1745, 1748, 1768, 1778, 1794, 1799, 1801, 1810, 1811, 1813, 1817, 1818, 1824, 1826, 1831, 1836, 1837, 1848, 1849, 1860, 1862, 1879, 1881, 1883, 1885, 1900, 1919, 1921, 1928, 1931, 1935, 1942, 1947, 1965, 1967, 1971, 1974, 1990, 1993, 2004, 2005, 2007, 2011, 2017, 2018, 2019, 2023, 2024, 2030, 2041, 2042, 2043, 2047, 2055, 2056, 2060, 2061, 2064, 2065, 2068, 2069, 2079, 2086, 2088, 2090, 2092, 2093, 2097, 2099, 2100, 2104, 2115, 2120, 2123, 2130, 2153, 2154, 2158, 2170, 2178, 2179, 2180, 2187, 2189, 2192, 2205, 2206, 2213, 2220, 2227, 2229, 2232, 2234, 2247, 2249, 2332, 2369, 2383, 2391, 2396, 2409, 2415, 2429, 2495, 2497, 2498, 2520, 2548, 2562, 2599, 2637, 2651, 2672, 2734, 2744, 2908, 3309, 3322 and 3751.
Malay: All sentences except sentences 1371, 1446, 1459, 2030, 2092, 2154, 2800, 2812, 3103, 3107, 3156, 3241, 3268, 3336, 3342, 3516, 3578, 3730, 3731, 3811, 3840, 3869 and 3888.
Burmese: All sentences.

Notes on tokenization

Malay/Indonesian

The Malay and Indonesian sentences were tokenized manually by Hiroki Nomoto and David Moeljadi, respectively. All clitics (i.e. -nya, -lah, -kah) were tokenized. In addition, the instances of the prefix se- were tokenized if they were cardinal numerals. Note that the suffix -nya and the non-numeral instances of se- were not tokenized. The following dictionaries were consulted when it was not immediately obvious whether a word sequence constituted a multiword expression.

KBBI5. 2016. Kamus Besar Bahasa Indonesia (edisi kelima). Jakarta: Badan Pengembangan dan Pembinaan Bahasa.
Nomoto, Hiroki. 2016. Pootaburu Nichi-Maree-Ei, Maree-Nichi-Ei Jiten [Japanese-Malay-English, Malay-Japanese-English Portable Dictionary]. Tokyo: Sanshusha.

Thai

The sentences were tokenized using the tokenize function of Deepcut and then checked by Sunisa Wittayapanyanon and Yuka Sato. The principle adopted for the manual correction is:

Tokenize a sequence consisting of two or more syllables if and only if all constituent syllables have a meaning that contributes to the meaning of the whole phrase/sentence.
- Do not tokenize a sequence if it contains a meaningless syllable.
- Do not tokenize a sequence if tokenizing it will change the meaning of the whole phrase/sentence.

Vietnamese

The sentences were tokenized using the word_tokenize function of the Undersea - Vietnamese NLP Project and then checked by Junta Nomura and Hiroki Nomoto. The following dictionary was consulted when it was not immediately obvious whether a syllable sequence constituted a multi-syllable expression.

Hoàng, Phê, ed. 2003. Từ Điển Tiếng Việt. Đà Nẵng: Nhà Xuất Bản Đà Nẵng.

Notes on pronoun substitute annotation

See the pronoun substitute project page.
One can modify the annotation, create a summary table and visualize the annotation by feeding the raw sentences and pronoun subtitute (prosub) annotation .txt files to ETA: Easy Text Annotator.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

TUFS Asian Language Parallel Corpus (TALPCo)

Introduction

How to cite

Contents

Format

Raw sentences

Links to sound files

Tokenized sentences

Burmese POS-tagged sentences

Constituency tree annotation

Alignment

Interpersonal meaning annotation

Speaker, Addressee

Lexical

Notes on sound files

Notes on tokenization

Malay/Indonesian

Thai

Vietnamese

Notes on pronoun substitute annotation

About

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 154 Commits
eng		eng
ind		ind
jav		jav
jpn		jpn
kor		kor
myn		myn
tha		tha
vie		vie
zsm		zsm
features.pdf		features.pdf
features.tex		features.tex
readme.md		readme.md

Folders and files

Latest commit

History

Repository files navigation

TUFS Asian Language Parallel Corpus (TALPCo)

Introduction

How to cite

Contents

Format

Raw sentences

Links to sound files

Tokenized sentences

Burmese POS-tagged sentences

Constituency tree annotation

Alignment

Interpersonal meaning annotation

Speaker, Addressee

Lexical

Notes on sound files

Notes on tokenization

Malay/Indonesian

Thai

Vietnamese

Notes on pronoun substitute annotation

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Contributors

Uh oh!

Languages