DaCiDian (大词典)

DaCiDian is an open-sourced lexicon for Chinese Automatic Speech Recognition(ASR)

Design

In mainstream ASR system, lexicon is a core component, that maps word into acoustic modeling units(such as phone). In DaCiDian, we break the mapping into 2 independent layers:

word --> PinYin syllable --> phoneme

The purpose of this design is as follows:

Anyone who is familiar with PinYin (basically every mandarin speaker), can enrich DaCiDian's vocabulary, by adding new entry(word) into the layer-1 mapping.
ASR system developers can easily adapt DaCiDian to their own phone set by defining their own layer-2 mapping.

Layer-1: Word -> Syllable Mapper (word_to_pinyin.txt)

word and its pronunciation(s) are seperated by tab
multiple pronunciations are seperated by ;
each pronunciation is a sequence of Chinese PinYin syllables, seperated by space
tone infomation are encoded into 0,1,2,3,4

examples:

... 裤子 KU_4 ZI_0 好事 HAO_4 SHI_4;HAO_3 SHI_4 教授 JIAO_1 SHOU_4;JIAO_4 SHOU_4 ... 语音识别 YU_3 YIN_1 SHI_2 BIE_2 傅里叶变换 FU_4 LI_3 YE_4 BIAN_4 HUAN_4

Layer-2: Syllabel->Phone Mapper (pinyin_to_phone.txt)

pinyin_to_phone is a user-defined mapping from PinYin syllables to target phone set

Take traditional PinYin's Initial-Final structure for example, a mapping should be defined as follows:

A	$0 a AI	$0 ai AN	$0 an ANG	$0 ang AO	$0 ao BA	b a BAI	b ai BAN	b an BANG	b ang BAO	b ao ... ... ... ZONG	z ong ZOU	z ou ZU	z u ZUAN	z uan ZUI	z ui ZUN	z un ZUO	z uo

syllable and its phoneme representation are seperated by tab
$0 is the NULL-Initial phone.

Notes

English word in Chinese context is a problem. An upgraded version covering both English and Chinese can be found here BigCiDian

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
DaCiDian.py		DaCiDian.py
README.md		README.md
pinyin_to_phone.txt		pinyin_to_phone.txt
word_to_pinyin.txt		word_to_pinyin.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

DaCiDian (大词典)

Design

Layer-1: Word -> Syllable Mapper (word_to_pinyin.txt)

Layer-2: Syllabel->Phone Mapper (pinyin_to_phone.txt)

Notes

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

DaCiDian (大词典)

Design

Layer-1: Word -> Syllable Mapper (word_to_pinyin.txt)

Layer-2: Syllabel->Phone Mapper (pinyin_to_phone.txt)

Notes

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages