mula

ML's radishal library for matching with Universal Levenshtein Automata.

Library mula

The entry point of this library is the module: Mula.

Basic Concepts

This library provides functions and functors to quickly compute Levenshtein edit distances of strings from a base string within a limit k. This can be used for fuzzy-string matching.

The Levenshtein distance from a string s1 to a string s2 is the minimum number of character edits (insert, delete, substitute) operations needed to change s1 into s2. We support both the standard Levenshtein distance as well as the (restricted) Demarau-Levenshtein distance, which includes transpositions of two adjacent characters as a edit operation.

Functionality

The Mula.Strings module provides functions for working with OCaml strings directly, and the Mula.Match module provides functors for working with your own representation of strings.

The libary offers two ways of working with strings. You can use the get_distance function to directly compute edit distances, or you can create a an automata using the start function to create an automata and feed characters and substrings into it lazily. The latter approach allows you to get the live minimum error counts.

The Mula.Strings module (and the functors created by Mula.Match.Make) contains submodules Mula.Strings.Lev for the standard Levenshtein distance and Mula.Strings.Dem for the (restricted) Demarau-Levenshtein distance. If you are unsure of which to use, use Mula.Strings.Dem.

Examples

Getting Edit Distances

# #require "mula";;
# Mula.Strings.Dem.get_distance ~k:2 "abcd" "abdc";;
- : int option = Some 1
# Mula.Strings.Lev.get_distance ~k:2 "abcd" "abdc";;
- : int option = Some 2
# Mula.Strings.Lev.get_distance ~k:2 "abcd" "efgh";;
- : int option = None

Live Minimal Error Counts

Examples of lazily feeding characters and into an automaton and getting live error counts:

# #require "mula";;
# (* Create an automaton for a limit and base string *);;
# module Lev = Mula.Strings.Lev;;
# let lev_nfa = Lev.start ~k:2 ~str:"abcd";;
val lev_nfa : Lev.nfa_state = <abstr>
# (* Get live error counts after feeding some characters into automaton *);;
# Lev.(feed_str lev_nfa ~str:"ab" |> current_error);;
- : int option = Some 0
# Lev.(feed lev_nfa ~ch:'a' |> feed ~ch:'b' |> feed ~ch:'c' |> current_error);;
- : int option = Some 0
# Lev.(feed_str lev_nfa ~str:"abd" |> current_error);;
- : int option = Some 1
# Lev.(feed_str lev_nfa ~str:"ab" |> feed_str ~str:"dc" |> current_error);;
- : int option = Some 1
# (* End input to get edit distance *);;
# Lev.(feed_str lev_nfa ~str:"ab" |> feed_str ~str:"dc" |> end_input);;
- : int option = Some 2

The last two examples show that the live error count can be lower than the edit distance. In the first of the two examples, 'd' is counted as a possible insert edit. In the second of the two examples, 'd' and 'c' are both counted as substitution edits.

Live Minimal Error Counts

Example of using the Mula.Match.Make functor:

# #require "mula";;
# module St = struct
  type ch = int
  type t = int array

  let length = Array.length
  let get = Array.get

  let equal = Int.equal
end;;
module St :
  sig
    ...
  end
# module M = Mula.Match.Make(St);;
module M :
  sig
    module Lev :
      sig
        type nfa_state = Mula.Match.Make(St).Lev.nfa_state
        val start : k:int -> str:St.t -> nfa_state
        val feed : nfa_state -> ch:int -> nfa_state
        val current_error : nfa_state -> int option
        val end_input : nfa_state -> int option
        val feed_str : nfa_state -> str:St.t -> nfa_state
        val get_distance : k:int -> St.t -> St.t -> int option
      end
    module Dem :
      sig
        ...
      end
  end