Universal Levenshtein Automata
The paper by Touzet details the Universal Levenshtein Automata.
Some nice computational properties of the (not-deterministic) automata include:
- There are no epsilon transitions.
- The automata are computable on demand, there is no need to store states and transitions in a data structure.
- The states of the automata carry error counts.
- There is a simple subsumption relation that prunes sets of states, so transitions produce small sets of states.
These allow for efficient implementation together with interesting uses.
- We not only know if two strings are within a certain edit distance, but we also know what the edit distance is if they are within the edit distance limit.
- If several strings are compared, we can rank them by their edit distance.
- We can lazily feed characters and string fragments to the automata and get the current error count.
This module offers a functor to build matchers for different representations of strings and characters. For a prebuilt matcher for OCaml strings and characters, see Strings
.
String Abstraction
We abstract over strings and characters so that we do not rely on any specific encoding. We only need the following:
- a function to compute length of strings,
- a function to fetch a character at an index, and
- a function to check if two characters are equal.
module type S = sig ... end
Levenshtein Automata
Given a string representation, we produce two Automata:
Make.Lev
, for the standard Levenshtein distance.Make.Dem
, for the Demarau-Levenshtein distance which includes transpositions as a primitve edit operation.