Skip to content

OpenNMT/Tokenizer

Repository files navigation

CI PyPI version Forum

Tokenizer

Tokenizer is a fast, generic, and customizable text tokenization library for C++ and Python with minimal dependencies.

Overview

By default, the Tokenizer applies a simple tokenization based on Unicode types. It can be customized in several ways:

  • Reversible tokenization
    Marking joints or spaces by annotating tokens or injecting modifier characters.
  • Subword tokenization
    Support for training and using BPE and SentencePiece models.
  • Advanced text segmentation
    Split digits, segment on case or alphabet change, segment each character of selected alphabets, etc.
  • Case management
    Lowercase text and return case information as a separate feature or inject case modifier tokens.
  • Protected sequences
    Sequences can be protected against tokenization with the special characters ⦅ and ⦆.

See the available options for an overview of supported features.

Using

The Tokenizer can be used in Python, C++, or command line. Each mode exposes the same set of options.

Python API

pip install pyonmttok
>>> import pyonmttok >>> tokenizer = pyonmttok.Tokenizer("conservative", joiner_annotate=True) >>> tokens = tokenizer("Hello World!") >>> tokens ['Hello', 'World', '■!'] >>> tokenizer.detokenize(tokens) 'Hello World!'

See the Python API description for more details.

C++ API

#include <onmt/Tokenizer.h> using namespace onmt; int main() { Tokenizer tokenizer(Tokenizer::Mode::Conservative, Tokenizer::Flags::JoinerAnnotate); std::vector<std::string> tokens; tokenizer.tokenize("Hello World!", tokens); }

See the Tokenizer class for more details.

Command line clients

$ echo "Hello World!" | cli/tokenize --mode conservative --joiner_annotate Hello World ■! $ echo "Hello World!" | cli/tokenize --mode conservative --joiner_annotate | cli/detokenize Hello World!

See the -h flag to list the available options.

Development

Dependencies

Compiling

CMake and a compiler that supports the C++11 standard are required to compile the project.

git submodule update --init mkdir build cd build cmake .. -DICU_ROOT=<path to root of ICU dependencies> make 

It will produce the dynamic library libOpenNMTTokenizer and tokenization clients in cli/.

  • To compile only the library, use the -DLIB_ONLY=ON flag.

Testing

The tests are using Google Test which is included as a Git submodule. Run the tests with:

mkdir build cd build cmake -DBUILD_TESTS=ON .. make test/onmt_tokenizer_test ../test/data