andjc (Andj) · GitHub

Pinned Loading

enabling-languages/python-i18n enabling-languages/python-i18n Public

Random notes on Python internationalisation

Jupyter Notebook 19
enabling-languages/library-i18n enabling-languages/library-i18n Public

Exploration of internationalisation issues for libraries.

Jupyter Notebook 1 1

# Grapheme tokenisation in Python

When working with tokenisation and break iterators, it is sometimes necessary to work at the character, syllable, line, or sentence levels. Character level tokenisation is an interesting case. By character, I mean a user perceivable unit of text, which the Unicode standard would refer to as a grapheme. The usual way I see developers handling character level tokenisation of English is via list comprehension or typecasting a string to a list:

```py

enabling-languages/dinka enabling-languages/dinka Public

Dinka language resources

JavaScript 2
enabling-languages/nuer enabling-languages/nuer Public

Nuer language resources

Rich Text Format 1
enabling-languages/australian_indigenous enabling-languages/australian_indigenous Public

Keyboard layouts and web support for Aboriginal and Torres Straight Island languages

4

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Andj andjc

Achievements

Achievements

Block or report andjc

Pinned Loading

Uh oh!