tiktoken is a BPE tokeniser for use with OpenAI's models, forked from the original tiktoken library to provide JS/WASM bindings for NodeJS and other JS runtimes.
This repository contains the following packages:
tiktoken(formally hosted at@dqbd/tiktoken): WASM bindings for the original Python library, providing full 1-to-1 feature parity.js-tiktoken: Pure JavaScript port of the original library with the core functionality, suitable for environments where WASM is not well supported or not desired (such as edge runtimes).
Documentation for js-tiktoken can be found in here. Documentation for the tiktoken can be found here below.
The WASM version of tiktoken can be installed from NPM:
npm install tiktoken Basic usage follows, which includes all the OpenAI encoders and ranks:
import assert from "node:assert"; import { get_encoding, encoding_for_model } from "tiktoken"; const enc = get_encoding("gpt2"); assert( new TextDecoder().decode(enc.decode(enc.encode("hello world"))) === "hello world" ); // To get the tokeniser corresponding to a specific model in the OpenAI API: const enc = encoding_for_model("text-davinci-003"); // Extend existing encoding with custom special tokens const enc = encoding_for_model("gpt2", { "<|im_start|>": 100264, "<|im_end|>": 100265, }); // don't forget to free the encoder after it is not used enc.free();In constrained environments (eg. Edge Runtime, Cloudflare Workers), where you don't want to load all the encoders at once, you can use the lightweight WASM binary via tiktoken/lite.
const { Tiktoken } = require("tiktoken/lite"); const cl100k_base = require("tiktoken/encoders/cl100k_base.json"); const encoding = new Tiktoken( cl100k_base.bpe_ranks, cl100k_base.special_tokens, cl100k_base.pat_str ); const tokens = encoding.encode("hello world"); encoding.free();If you want to fetch the latest ranks, use the load function:
const { Tiktoken } = require("tiktoken/lite"); const { load } = require("tiktoken/load"); const registry = require("tiktoken/registry.json"); const models = require("tiktoken/model_to_encoding.json"); async function main() { const model = await load(registry[models["gpt-3.5-turbo"]]); const encoder = new Tiktoken( model.bpe_ranks, model.special_tokens, model.pat_str ); const tokens = encoder.encode("hello world"); encoder.free(); } main();If desired, you can create a Tiktoken instance directly with custom ranks, special tokens and regex pattern:
import { Tiktoken } from "../pkg"; import { readFileSync } from "fs"; const encoder = new Tiktoken( readFileSync("./ranks/gpt2.tiktoken").toString("utf-8"), { "<|endoftext|>": 50256, "<|im_start|>": 100264, "<|im_end|>": 100265 }, "'s|'t|'re|'ve|'m|'ll|'d| ?\\p{L}+| ?\\p{N}+| ?[^\\s\\p{L}\\p{N}]+|\\s+(?!\\S)|\\s+" );Finally, you can a custom init function to override the WASM initialization logic for non-Node environments. This is useful if you are using a bundler that does not support WASM ESM integration.
import { get_encoding, init } from "tiktoken/init"; async function main() { const wasm = "..."; // fetch the WASM binary somehow await init((imports) => WebAssembly.instantiate(wasm, imports)); const encoding = get_encoding("cl100k_base"); const tokens = encoding.encode("hello world"); encoding.free(); } main();As this is a WASM library, there might be some issues with specific runtimes. If you encounter any issues, please open an issue.
| Runtime | Status | Notes |
|---|---|---|
| Node.js | ✅ | |
| Bun | ✅ | |
| Vite | ✅ | See here for notes |
| Next.js | ✅ | See here for notes |
| Create React App (via Craco) | ✅ | See here for notes |
| Vercel Edge Runtime | ✅ | See here for notes |
| Cloudflare Workers | ✅ | See here for notes |
| Electron | ✅ | See here for notes |
| Deno | ❌ | Currently unsupported (see dqbd/tiktoken#22) |
| Svelte + Cloudflare Workers | ❌ | Currently unsupported (see dqbd/tiktoken#37) |
For unsupported runtimes, consider using js-tiktoken, which is a pure JS implementation of the tokeniser.
If you are using Vite, you will need to add both the vite-plugin-wasm and vite-plugin-top-level-await. Add the following to your vite.config.js:
import wasm from "vite-plugin-wasm"; import topLevelAwait from "vite-plugin-top-level-await"; import { defineConfig } from "vite"; export default defineConfig({ plugins: [wasm(), topLevelAwait()], });Both API routes and /pages are supported with the following next.config.js configuration.
// next.config.json const config = { webpack(config, { isServer, dev }) { config.experiments = { asyncWebAssembly: true, layers: true, }; return config; }, };Usage in pages:
import { get_encoding } from "tiktoken"; import { useState } from "react"; const encoding = get_encoding("cl100k_base"); export default function Home() { const [input, setInput] = useState("hello world"); const tokens = encoding.encode(input); return ( <div> <input type="text" value={input} onChange={(e) => setInput(e.target.value)} /> <div>{tokens.toString()}</div> </div> ); }Usage in API routes:
import { get_encoding } from "tiktoken"; import { NextApiRequest, NextApiResponse } from "next"; export default function handler(req: NextApiRequest, res: NextApiResponse) { const encoding = get_encoding("cl100k_base"); const tokens = encoding.encode("hello world"); encoding.free(); return res.status(200).json({ tokens }); }By default, the Webpack configugration found in Create React App does not support WASM ESM modules. To add support, please do the following:
- Swap
react-scriptswithcraco, using the guide found here: https://craco.js.org/docs/getting-started/. - Add the following to
craco.config.js:
module.exports = { webpack: { configure: (config) => { config.experiments = { asyncWebAssembly: true, layers: true, }; // turn off static file serving of WASM files // we need to let Webpack handle WASM import config.module.rules .find((i) => "oneOf" in i) .oneOf.find((i) => i.type === "asset/resource") .exclude.push(/\.wasm$/); return config; }, }, };Vercel Edge Runtime does support WASM modules by adding a ?module suffix. Initialize the encoder with the following snippet:
// @ts-expect-error import wasm from "tiktoken/lite/tiktoken_bg.wasm?module"; import model from "tiktoken/encoders/cl100k_base.json"; import { init, Tiktoken } from "tiktoken/lite/init"; export const config = { runtime: "edge" }; export default async function (req: Request) { await init((imports) => WebAssembly.instantiate(wasm, imports)); const encoding = new Tiktoken( model.bpe_ranks, model.special_tokens, model.pat_str ); const tokens = encoding.encode("hello world"); encoding.free(); return new Response(`${tokens}`); }Similar to Vercel Edge Runtime, Cloudflare Workers must import the WASM binary file manually and use the tiktoken/lite version to fit the 1 MB limit. However, users need to point directly at the WASM binary via a relative path (including ./node_modules/).
Add the following rule to the wrangler.toml to upload WASM during build:
[[rules]] globs = ["**/*.wasm"] type = "CompiledWasm"Initialize the encoder with the following snippet:
import { init, Tiktoken } from "tiktoken/lite/init"; import wasm from "./node_modules/tiktoken/lite/tiktoken_bg.wasm"; import model from "tiktoken/encoders/cl100k_base.json"; export default { async fetch() { await init((imports) => WebAssembly.instantiate(wasm, imports)); const encoder = new Tiktoken( model.bpe_ranks, model.special_tokens, model.pat_str ); const tokens = encoder.encode("test"); encoder.free(); return new Response(`${tokens}`); }, };To use tiktoken in your Electron main process, you need to make sure the WASM binary gets copied into your application package.
Assuming a setup with Electron Forge and @electron-forge/plugin-webpack, add the following to your webpack.main.config.js:
const CopyPlugin = require("copy-webpack-plugin"); module.exports = { // ... plugins: [ new CopyPlugin({ patterns: [ { from: "./node_modules/tiktoken/tiktoken_bg.wasm" }, ], }), ], };To build the tiktoken library, make sure to have:
- Rust and
wasm-packinstalled. - Node.js 18+ is required to build the JS bindings and fetch the latest encoder ranks via
fetch.
Install all the dev-dependencies with yarn install and build both WASM binary and JS bindings with yarn build.