package bap-byteweight
Library
Module
Module type
Parameter
Class
Class type
Default implementation that uses memory chunk as the domain.
include V2.S
with type key = Bap.Std.mem
and type corpus = Bap.Std.mem
and type token := Bap.Std.word
include V1.S with type key = Bap.Std.mem with type corpus = Bap.Std.mem
include Bin_prot.Binable.S with type t := t
include Bin_prot.Binable.S_only_functions with type t := t
val bin_size_t : t Bin_prot.Size.sizer
val bin_write_t : t Bin_prot.Write.writer
val bin_read_t : t Bin_prot.Read.reader
val __bin_read_t__ : (int -> t) Bin_prot.Read.reader
This function only needs implementation if t
exposed to be a polymorphic variant. Despite what the type reads, this does *not* produce a function after reading; instead it takes the constructor tag (int) before reading and reads the rest of the variant t
afterwards.
val bin_shape_t : Bin_prot.Shape.t
val bin_writer_t : t Bin_prot.Type_class.writer
val bin_reader_t : t Bin_prot.Type_class.reader
val bin_t : t Bin_prot.Type_class.t
include Ppx_sexp_conv_lib.Sexpable.S with type t := t
val t_of_sexp : Sexplib0.Sexp.t -> t
val sexp_of_t : t -> Sexplib0.Sexp.t
type key = Bap.Std.mem
type corpus = Bap.Std.mem
val create : unit -> t
create ()
creates an empty instance of the byteweigth decider.
train decider ~max_length test corpus
train the decider
on the specified corpus
. The test
function classifies extracted substrings. The max_length
parameter binds the maximum length of substrings.
val length : t -> int
length decider
total amount of different substrings known to a decider.
next t ~length ~threshold data begin
the next positive chunk.
Returns an offset that is greater than begin
of the next longest substring up to the given length
, for which h1 / (h0 + h1) > threshold
.
This is a specialization of the next_if
function from the extended V1.V2.S
interface.
val pp : Format.formatter -> t -> unit
pp ppf decider
prints all known to decider chunks.
next_if t ~length ~f data begin
the next chunk that f
.
Finds the next offset greater than begin
of a string of the given length
for which there was an observing of a substring s
with length n
and statistics stats
, such that f s n stats
is true
.
val fold : t -> init:'b -> f:('b -> Bap.Std.word list -> stats -> 'b) -> 'b
fold t ~init ~f
applies f
to all chunks known to the decider.
val find : t -> length:int -> threshold:float -> corpus -> Bap.Std.addr list
find mem ~length ~threshold corpus
extract addresses of all memory chunks of the specified length
, that were classified positively under given threshold
.
val find_if :
t ->
length:int ->
f:(key -> int -> stats -> bool) ->
corpus ->
Bap.Std.addr list
find_if mem ~length ~f corpus
finds all positively classfied chunks.
This is a generalization of the find
function with an arbitrary thresholding function.
It scans the input corpus using the next_if
function and collects all positive results.
val find_using_bayes_factor :
t ->
min_length:int ->
max_length:int ->
float ->
corpus ->
Bap.Std.addr list
find_using_bayes_factor sigs mem
classify functions starts using the Bayes factor procedure.
Returns a list of addresses in mem
that have a signature in sigs
with length min_length <= n <= max_length
and the Bayes factor greater than threshold
.
The Bayes factor is the ratio between posterior probabilities of two hypothesis, the h1
hypothesis that the given sequence of bytes occurs at the function start, and the dual h0
hypothesis,
k = P(h1|s)/P(h0|s) = (P(s|h1)/P(s|h0)) * (P(h1)/P(h0))
,
where
P(hN|s)
is the probability of the hypothesisP(hN)
given the sequence of bytess
as the evidence,P(s|hN
is the probability of the sequence of bytess
, given the hypothesishN
,P(hN)
is the prior probability of the hypothesishN
.
Given that m
is the total number of occurences of a sequence of bytes s
at the beginning of a function, and n
is the total number of occurences of s
in a middle of a function, we compute P(s|h1)
and P(s|h0)
as
P(s|h1) = m / (m+n)
,P(s|h0) = 1 - P(s|h1) = n / (m+n)
.
Given that q
is the total number of substrings in sigs
of length min_length <= l <= max_length
and p
is the total number of substrings of the length l
that start functions, we compute prior probabilities as,
P(h1) = p / q
,P(h0) = 1 - P(h1)
.
The resulting factor is a value 0 < k < infinity
that quantify the strength of the evidence that a given substring gives in support of the hypothesis h1
. Levels below 1
support hypothesis h0
, levels above 1
give some support of h1
, with the following interpretations (Kass and Raftery (1995)),
Bayes Factor Strength 1 to 3.2 Weak 3.2 to 10 Substantial 10 to 100 Strong 100 and greater Decisive
val find_using_threshold :
t ->
min_length:int ->
max_length:int ->
float ->
corpus ->
Bap.Std.addr list
find_using_threshold sigs mem
classify function starts using a simple thresholding procedure.
Returns a list of addresses in mem
that have a signature s
in sigs
with length min_length <= n <= max_length
and the sample probability P1(s)
of starting a function greater than threshold
,
P1(s) = m / (m+n)
, where
- m - the total number of occurences of
s
at the begining of a function insigs
; - n - the total number of occurences of
s
not at the begining of a function insigs
.