uunf 16.0.0 (latest) · OCaml Package

Unicode text normalization.

Uunf normalizes Unicode text. It supports all Unicode normalization forms. The module is independent from any IO mechanism or Unicode text data structure and it can process text without a complete in-memory representation of the data.

The supported Unicode version is determined by the unicode_version value.

Consult the basics, limitations and examples of use.

References

The Unicode Consortium. The Unicode Standard. (latest version)
Mark Davis. UAX #15 Unicode Normalization Forms. (latest version)
The Unicode Consortium. Normalization charts.

Normalize

type form = [

| `NFD
| `NFC
| `NFKD
| `NFKC

]

The type for normalization forms.

`NFD normalization form D, canonical decomposition.
`NFC normalization form C, canonical decomposition followed by canonical composition (recommended for the www).
`NFKD normalization form KD, compatibility decomposition.
`NFKC normalization form KC, compatibility decomposition, followed by canonical composition.

type t

The type for Unicode text normalizers.

type ret = [

| `Uchar of Stdlib.Uchar.t
| `End
| `Await

]

The type for normalizer results. See add.

val create : [< form ] -> t

create nf is an Unicode text normalizer for the normal form nf.

val form : t -> form

form n is the normalization form of n.

val add : t -> [ `Uchar of Stdlib.Uchar.t | `Await | `End ] -> ret

add n v is:

`Uchar u if u is the next character in the normalized sequence. The client must then call add with `Await until `Await is returned.
`Await when the normalizer is ready to add a new `Uchar or `End.

For v use `Uchar u to add a new character to the sequence to normalize and `End to signal the end of sequence. After adding one of these two values, always call add with `Await until `Await is returned.

Raises. Invalid_argument if `Uchar or `End is added directly after an `Uchar was returned by the normalizer or if an `Uchar is added after `End was added.

val reset : t -> unit

reset n resets the normalizer to a state equivalent to the state of Uunf.create (Uunf.form n).

val copy : t -> t

copy n is a copy of n in its current state. Subsequent adds on n do not affect the copy.

val pp_ret : Stdlib.Format.formatter -> ret -> unit

pp_ret ppf v prints an unspecified representation of v on ppf.

Normalization properties

These properties are used internally to implement the normalizers. They are not needed to use the module but are exposed as they may be useful to implement other algorithms.

val unicode_version : string

unicode_version is the Unicode version supported by the module.

val ccc : Stdlib.Uchar.t -> int

ccc u is u's canonical combining class value.

val decomp : Stdlib.Uchar.t -> int array

decomp u is u's decomposition mapping. If the empty array is returned, u decomposes to itself.

The first number in the array contains additional information, it cannot be used as an Uchar.t. Use d_uchar on the number to get the actual character and d_compatibility to find out if this is a compatibility decomposition. All other characters of the array are guaranteed to be convertible using Uchar.of_int.

Warning. Do not mutate the array.

val d_uchar : int -> Stdlib.Uchar.t

See decomp.

val d_compatibility : int -> bool

See decomp.

val composite : Stdlib.Uchar.t -> Stdlib.Uchar.t -> Stdlib.Uchar.t option

composite u1 u2 is the primary composite canonically equivalent to the sequence <u1,u2>, if any.

Limitations

An Uunf normalizer consumes only a small bounded amount of memory on ordinary, meaningful text. However on legal but degenerate text like a starter followed by 10'000 combining non-spacing marks it will have to bufferize all the marks (a workaround is to first convert your input to stream-safe text format).

Basics

A normalizer is a stateful filter that inputs a sequence of characters and outputs an equivalent sequence in the requested normal form.

The function create returns a new normalizer for a given normal form:

let nfd = Uunf.create `NFD

To add characters to the sequence to normalize, call add on nfd with `Uchar _. To end the sequence, call add on nfd with `End. The normalized sequence of characters is returned, character by character, by the successive calls to add.

The client and the normalizer must wait on each other to limit internal buffering: each time the client adds to the sequence by calling add with `Uchar or `End it must continue to call add with `Await until the normalizer returns `Await. In practice this leads to the following kind of control flow:

let rec add acc v = match Uunf.add nfd v with
| `Uchar u -> add (u :: acc) `Await
| `Await | `End -> acc

For example to normalize the character U+00E9 (é) with nfd to a list of characters we can write:

let e_acute = Uchar.of_int 0x00E9
let e_acute_nfd = List.rev (add (add [] (`Uchar e_acute)) `End)

The next section has more examples.

Examples

UTF-8 normalization

utf_8_normalize nf s is the UTF-8 encoded normal form nf of the UTF-8 encoded string s.

let utf_8_normalize nf s =
  let rec add buf normalizer v = match Uunf.add normalizer v with
  | `Uchar u -> Buffer.add_utf_8_uchar buf u; add buf normalizer `Await
  | `Await | `End -> ()
  in
  let rec loop buf s i max normalizer =
    if i > max then (add buf normalizer `End; Buffer.contents buf) else
    let dec = String.get_utf_8_uchar s i in
    add buf normalizer (`Uchar (Uchar.utf_decode_uchar dec));
    loop buf s (i + Uchar.utf_decode_length dec) max normalizer
  in
  let buf = Buffer.create (String.length s * 3) in
  let normalizer = Uunf.create nf in
  loop buf s 0 (String.length s - 1) normalizer

Note that this functionality is available directly through Uunf_string.normalize_utf_8

package uunf