Index 1,600,000,000 Keys with Automata and Rust
Posted on by Andrew Gallant: BurntSushi
It turns out that finite state machines are useful for things other than expressing computation. Finite state machines can also be used to compactly represent ordered sets or maps of strings that can be searched very quickly.
In this article, I will teach you about finite state machines as a data
structure for representing ordered sets and maps. This includes introducing
an implementation written in Rust called the
fst
crate.
It comes with
complete API documentation.
I will also show you how to build them using a simple command line tool.
Finally, I will discuss a few experiments culminating in indexing over
1,600,000,000 URLs (134 GB) from the
July 2015 Common Crawl Archive.
The technique presented in this article is also how Lucene represents a part of its inverted index.
Along the way, we will talk about memory maps, automaton intersection with regular expressions, fuzzy searching with Levenshtein distance and streaming set operations.
Target audience: Some familiarity with programming and fundamental data structures. No experience with automata theory or Rust is required.