package rosetta
Install
Dune Dependency
Authors
Maintainers
Sources
sha256=d8a2b6b235b7c15025d3d72a87d05bf691fcf7f3d90a892cce9c5529f760498f
sha512=9a323cd5b05e9ae7ba1f572936a42948fbc42090e1be6557840652d9deddee4cb979691047f1b6814afc07e81ec74eb9b1fcab098ba6d525ae88530c790b967a
README.md.html
Rosetta - universal decoder of an encoded flow to Unicode
Rosetta is a merge-point between uuuu, coin and yuscii. It able to decode UTF-7, ISO-8859 and KOI8 and return Unicode code-point - then, end-user can normalize it to UTF-8 with uutf for example.
The final goal is to provide an universal decoder of any encoding. This project is a part of mrmime, a parser of emails to be able to decode encoded-word (according rfc2047).
If you want to handle a new encoding (like, hmmhmm, APL-ISO-IR-68...), you can make a new issue - then, the process will be to make a new little library and integrate it to rosetta
.
How to use it?
rosetta
follows the same design as libraries used underlying. More precisely, it follows the same API as uutf about encoding. This is a little example to transform a latin1 flow to UTF-8:
let trans ic oc =
let decoder = Rosetta.decoder (Rosetta.encoding_of_string "latin1") (`Channel ic) in
let encoder = Uutf.encoder `UTF_8 (`Channel oc) in
let rec go () = match Rosetta.decode decoder with
| `Await -> assert false (* XXX(dinosaure): impossible when you use `String of `Channel as source. *)
| `Uchar _ as uchar -> ignore @@ Uutf.encode encoder uchar ; go ()
| `End -> ignore @@ Uutf.encoder `End
| `Malformed err -> failwith err in
go ()
let () = trans stdin stdout
About encoding_of_string
rosetta
follows aliases availables into IANA character sets database: https://www.iana.org/assignments/character-sets.xhtml
Others aliases will raise an exception. This function is case-insensitive.
About translation tables
rosetta
relies on underlying libraries such as uuuu
or coin
. They integrate translation tables provided by Unicode consortium. They should not be updated - so we statically save them into an int array
.
About encoding
rosetta
supports only decoding to Unicode code-point. A support of encoding is not on our plan where people should only use Unicode now. Deal with many encodings is a pain and we should only produce something according to Unicode than old encoding like latin1.