unicode-data: Access Unicode Character Database (UCD)

[ apache, data, library, text, unicode ] [ Propose Tags ] [ Report a vulnerability ]

unicode-data provides Haskell APIs to efficiently access the Unicode character database (UCD). Performance is the primary goal in the design of this package.

The Haskell data structures are generated programmatically from the UCD files. The latest Unicode version supported by this library is 17.0.0.


[Skip to Readme]

Modules

[Index] [Quick Jump]

Flags

Manual Flags

NameDescriptionDefault
dev-has-icu

Use ICU for test and benchmark. Intended for development on the repository.

Disabled

Use -f <flag> to enable a flag, or -f -<flag> to disable that flag. More info

Downloads

Maintainer's Corner

Package maintainers

For package maintainers and hackage trustees

Candidates

  • No Candidates
Versions [RSS] 0.1.0, 0.1.0.1, 0.2.0, 0.3.0, 0.3.1, 0.4.0, 0.4.0.1, 0.5.0, 0.6.0, 0.7.0, 0.8.0 (info)
Change log Changelog.md
Dependencies base (>=4.7 && <4.23), ghc-prim [details]
Tested with ghc ==8.2.2, ghc ==8.4.4, ghc ==8.6.5, ghc ==8.8.4, ghc ==8.10.7, ghc ==9.0.2, ghc ==9.2.8, ghc ==9.4.8, ghc ==9.6.5, ghc ==9.8.4, ghc ==9.10.2, ghc ==9.12.2
License Apache-2.0
Copyright 2020 Composewell Technologies and Contributors
Author Composewell Technologies and Contributors
Maintainer streamly@composewell.com
Uploaded by wismill at 2025-09-16T04:55:33Z
Category Data, Text, Unicode
Home page http://github.com/composewell/unicode-data
Bug tracker https://github.com/composewell/unicode-data/issues
Source repo head: git clone https://github.com/composewell/unicode-data
Reverse Dependencies 7 direct, 241 indirect [details]
Downloads 14902 total (31 in the last 30 days)
Rating 2.25 (votes: 2) [estimated by Bayesian average]
Your Rating
  • λ
  • λ
  • λ
Status Docs available [build log]
Last success reported on 2025-09-16 [all 1 reports]

Readme for unicode-data-0.8.0

[back to package description]

README

unicode-data provides Haskell APIs to efficiently access the Unicode character database. Performance is the primary goal in the design of this package.

The Haskell data structures are generated programmatically from the Unicode character database (UCD) files. The latest Unicode version supported by this library is 17.0.0.

Please see the Haddock documentation for reference documentation.

Performance

unicode-data is up to 5 times faster than base ≤ 4.17 (see partial integration to base).

The following benchmark compares the time taken in milliseconds to process all the Unicode code points (except surrogates, private use areas and unassigned), for base-4.16 (GHC 9.2.6) and this package (v0.4). Machine: 8 × AMD Ryzen 5 2500U on Linux.

All Unicode.Char.Case.Compat isLower base: OK (1.19s) 17.1 ms ± 241 μs unicode-data: OK (0.52s) 3.58 ms ± 125 μs, 0.21x isUpper base: OK (0.63s) 17.5 ms ± 359 μs unicode-data: OK (1.02s) 3.58 ms ± 48 μs, 0.21x toLower base: OK (0.59s) 16.3 ms ± 524 μs unicode-data: OK (0.80s) 5.63 ms ± 129 μs, 0.35x toTitle base: OK (3.91s) 14.9 ms ± 427 μs unicode-data: OK (2.84s) 5.31 ms ± 37 μs, 0.36x toUpper base: OK (2.12s) 15.4 ms ± 234 μs unicode-data: OK (0.86s) 5.80 ms ± 159 μs, 0.38x Unicode.Char.General generalCategory base: OK (1.16s) 16.6 ms ± 534 μs unicode-data: OK (0.62s) 4.14 ms ± 103 μs, 0.25x isAlphaNum base: OK (0.62s) 17.1 ms ± 655 μs unicode-data: OK (0.97s) 3.59 ms ± 51 μs, 0.21x isControl base: OK (0.63s) 17.6 ms ± 494 μs unicode-data: OK (0.57s) 3.59 ms ± 90 μs, 0.20x isMark base: OK (0.34s) 17.6 ms ± 695 μs unicode-data: OK (1.00s) 3.59 ms ± 67 μs, 0.20x isPrint base: OK (1.22s) 17.7 ms ± 492 μs unicode-data: OK (1.92s) 3.56 ms ± 27 μs, 0.20x isPunctuation base: OK (2.23s) 16.6 ms ± 619 μs unicode-data: OK (1.05s) 3.60 ms ± 52 μs, 0.22x isSeparator base: OK (1.15s) 16.6 ms ± 439 μs unicode-data: OK (0.49s) 3.60 ms ± 85 μs, 0.22x isSymbol base: OK (2.11s) 16.1 ms ± 553 μs unicode-data: OK (1.05s) 3.58 ms ± 62 μs, 0.22x Unicode.Char.General.Compat isAlpha base: OK (0.58s) 17.2 ms ± 502 μs unicode-data: OK (1.02s) 3.58 ms ± 50 μs, 0.21x isLetter base: OK (8.57s) 16.4 ms ± 553 μs unicode-data: OK (1.05s) 3.58 ms ± 79 μs, 0.22x isSpace base: OK (1.09s) 7.56 ms ± 159 μs unicode-data: OK (0.97s) 3.58 ms ± 46 μs, 0.47x Unicode.Char.Numeric.Compat isNumber base: OK (0.58s) 15.7 ms ± 462 μs unicode-data: OK (0.58s) 3.58 ms ± 107 μs, 0.23x 

Partial integration of unicode-data into base

Since base 4.18, unicode-data has been partially integrated to GHC, so there should be no relevant difference. However, using unicode-data allows to select the exact version of Unicode to support, therefore not relying on the version supported by GHC.

Unicode database version update

To update the Unicode version please update the version number in ucd.sh.

To download the Unicode database, run ucd.sh download from the top level directory of the repo to fetch the database in ./ucd.

$ ./ucd.sh download 

To generate the Haskell data structure files from the downloaded database files, run ucd.sh generate from the top level directory of the repo.

$ ./ucd.sh generate 

Running property doctests

Temporarily add QuickCheck to build depends of library.

$ cabal build $ cabal-docspec --check-properties --property-variables c 

Licensing

unicode-data is an open source project available under a liberal Apache-2.0 license.

Contributing

As an open project we welcome contributions.