package markup
Install
Dune Dependency
Authors
Maintainers
Sources
sha256=9526fd06a0afc37d7ae6e2528787142d52b124238ffb0e7e8e83bdd383806eb5
md5=3609724f5408dff41b1cb43107bc24ef
Description
Markup.ml provides an HTML parser and an XML parser. The parsers are wrapped in a simple interface: they are functions that transform byte streams to parsing signal streams. Streams can be manipulated in various ways, such as processing by fold, filter, and map, assembly into DOM tree structures, or serialization back to HTML or XML.
Both parsers are based on their respective standards. The HTML parser, in particular, is based on the state machines defined in HTML5.
The parsers are error-recovering by default, and accept fragments. This makes it very easy to get a best-effort parse of some input. The parsers can, however, be easily configured to be strict, and to accept only full documents.
Apart from this, the parsers are streaming (do not build up a document in memory), non-blocking (can be used with threading libraries), lazy (do not consume input unless the signal stream is being read), and process the input in a single pass. They automatically detect the character encoding of the input stream, and convert everything to UTF-8.
Published: 14 Mar 2022
README
Markup.ml
Markup.ml is a pair of parsers implementing the HTML5 and XML specifications, including error recovery. Usage is simple, because each parser is a function from byte streams to parsing signal streams:
In addition to being error-correcting, the parsers are:
streaming: parsing partial input and emitting signals while more input is still being received;
lazy: not parsing input unless you have requested the next parsing signal, so you can easily stop parsing partway through a document;
non-blocking: they can be used with Lwt, but still provide a straightforward synchronous interface for simple usage; and
one-pass: memory consumption is limited since the parsers don't build up a document representation, nor buffer input beyond a small amount of lookahead.
The parsers detect character encodings automatically, and emit everything in UTF-8. The HTML parser understands SVG and MathML, in addition to HTML5.
Here is a breakdown showing the signal stream and errors emitted during the parsing and pretty-printing of bad_html
:
string bad_html "<body><p><em>Markup.ml<p>rocks!"
|> parse_html `Start_element "body"
|> signals `Start_element "p"
`Start_element "em"
`Text ["Markup.ml"]
~report (1, 10) (`Unmatched_start_tag "em")
`End_element (* </em>: recovery *)
`End_element (* </p>: not an error *)
`Start_element "p"
`Start_element "em" (* recovery *)
`Text ["rocks!"]
`End_element (* </em> *)
`End_element (* </p> *)
`End_element (* </body> *)
|> pretty_print (* adjusts the `Text signals *)
|> write_html
|> to_channel stdout;; "...shown above..." (* valid HTML *)
The parsers are tested thoroughly.
For a higher-level parser, see Lambda Soup, which is based on Markup.ml, but can search documents using CSS selectors, and perform various manipulations.
Overview and basic usage
The interface is centered around four functions between byte streams and signal streams: parse_html
, write_html
, parse_xml
, and write_xml
. These have several optional arguments for fine-tuning their behavior. The rest of the functions either input or output byte streams, or transform signal streams in some interesting way.
Here is an example with an optional argument:
(* Show up to 10 XML well-formedness errors to the user. Stop after
the 10th, without reading more input. *)
let report =
let count = ref 0 in
fun location error ->
error |> Error.to_string ~location |> prerr_endline;
count := !count + 1;
if !count >= 10 then raise_notrace Exit
file "some.xml" |> fst |> parse_xml ~report |> signals |> drain
Advanced: Cohttp + Markup.ml + Lambda Soup + Lwt
This program requests a Google search, then does a streaming scrape of result titles. It exits when it finds a GitHub link, without reading more input. Only one h3
element is converted into an in-memory tree at a time.
let () =
Lwt_main.run begin
(* Send request. Assume success. *)
let url = "https://www.google.com/search?q=markup.ml" in
let%lwt _, body = Cohttp_lwt_unix.Client.get (Uri.of_string url) in
(* Adapt response to a Markup.ml stream. *)
let body = body |> Cohttp_lwt.Body.to_stream |> Markup_lwt.lwt_stream in
(* Set up a lazy stream of h3 elements. *)
let h3s = Markup.(body
|> strings_to_bytes |> parse_html |> signals
|> elements (fun (_ns, name) _attrs -> name = "h3"))
in
(* Find the GitHub link. .iter and .load cause actual reading of data. *)
h3s |> Markup_lwt.iter (fun h3 ->
let%lwt h3 = Markup_lwt.load h3 in
match Soup.(from_signals h3 $? "a[href*=github]") with
| None -> Lwt.return_unit
| Some anchor ->
print_endline (String.concat "" (Soup.texts anchor));
exit 0)
end
This prints GitHub - aantron/markup.ml: Error-recovering streaming HTML5 and ...
. To run it, do:
ocamlfind opt -linkpkg -package lwt.ppx,cohttp.lwt,markup.lwt,lambdasoup \
scrape.ml && ./a.out
You can get all the necessary packages by
opam install lwt_ssl
opam install cohttp-lwt-unix lambdasoup markup
Installing
opam install markup
Documentation
The interface of Markup.ml is three modules: Markup
, Markup_lwt
, and Markup_lwt_unix
. The last two are available only if you have Lwt installed (OPAM package lwt
).
The documentation includes a summary of the conformance status of Markup.ml.
Depending
Markup.ml uses semantic versioning, but is currently in 0.x.x
. The minor version number will be incremented on breaking changes.
Contributing
Contributions are very much welcome. Please see CONTRIBUTING
for instructions, suggestions, and an overview of the code. There is also a list of easy issues.
License
Markup.ml is distributed under the MIT license. The Markup.ml source distribution includes a copy of the HTML5 entity list, which is distributed under the W3C document license.
Dev Dependencies (2)
-
ounit2
dev
-
bisect_ppx
dev & >= "2.5.0"
Used by (13)
- camyll
-
dream
>= "1.0.0~alpha6"
-
dream-livereload
>= "0.2.0"
-
lambdasoup
>= "0.6"
- learn-ocaml
- markup-lwt
-
odoc
>= "1.4.0" & < "2.1.0"
-
plist-xml
= "0.3.0"
- ppx_bsx
-
soupault
>= "1.7.0"
-
textmate-language
>= "0.3.0" & < "0.3.4"
- tyxml-ppx
- valentine
Conflicts
None