Tries and You: A Guide!
Data structures are a pretty fundamental topic in computer science, and most people are exposed to them in a whirlwind fashion in school or throughout their career. This exposure provides precious little experience with some of the less common or special purpose data structures.
The lack of understanding of a number of more exotic structures is not particularly surprising given that most of them only need to be used in some very specific cases. If you’re not working with large data sets that have to be searchable or advanced file system work, you probably won’t ever have a practical use for a B-tree outside of an interview. That said, it can also be good to take a step back and dive into some of these structures. To that end, I’m going to be briefly looking at one such structure today: The Trie!
What is a Trie?
Tries go by other, perhaps more recognizable, names, including prefix and radix trees. Tries are built around the concept of encoding the key for a piece of data in the actual structure of the tree itself.
In a normal search tree, each node has a key, and you have some rule for traversal of the key to get to the node you’re looking for. In the case of a binary search tree, you know the nodes in the left subtree of a given node are less than that node’s key, and nodes in the right subtree for a given node are greater than that node’s key. This lets you quickly traverse the tree to find the object you’re looking for. You can see an example of a binary search tree of this format below.
Tries are a bit different - instead of having each key on one node, you search for the key using each digit, letter, or element of the key - in other words, each logical sub-unit of the key - when you find one sub-unit, you proceed to looking for the next one. If you want to find the number “123” you would find a node with “1”, look at its children, and try to find the node that has a “2” and so on. This leads to some useful properties we can get from a trie that we can’t get from a normal search tree, which we’ll get into later. An example of a very basic trie is included below that exemplifies the structure I noted above.
Inserting data into a binary search tree is pretty simple, you just traverse the tree until you reach a dead end, then insert a node. Well, it’s usually that simple, unless you’re doing something fun like red-black or AVL trees - which you probably should be doing - then it gets more complicated.
For tries, we have a very similar approach, at least in spirit, even though it differs some in implementation. For our example, let’s take a string value, say, “program” and say we want to insert it into a trie that already has some data stored in it.
Before we get into exactly how to do an insertion, let’s have a look at a trie and see if we can understand more about the way it stores data.
Data Storage in Tries
I’ve highlighted nodes that denote actual stored key in red. In an actual implementation, this would generally be marked using a boolean flag on the node objects or an accessory table. Let’s look at this and see if we can determine a few properties.
- All the leaf nodes in this tree store keys - since the way we traverse these trees is to look for partial matches for a given input key, and the tree consists only of keys that have been inserted, it would not make sense for us to be able to traverse to a point where there are no other nodes to go to and we have not reached a node representing a stored key.
- Nodes that denote keys stored may occur on interior nodes - that is to stay, while all leaf nodes must denote a stored key, not all stored keys are present in leaf nodes - we can see several nodes here that store keys but are not leafs.
Let’s use the example above to try to find out how we would search for a key in a trie. Based on the structure we see above, the basic strategy we’ll want to employ is to recursively look at the children of a given node, and try to find ones that match the current prefix we’re on, if we find one, we keep going, if we don’t, we return not found.
Before we get to the actual algorithm, let’s just step through searching for “PEN”.
- To start, we look at the root node and check if it maxes our initial prefix, “P” - it does
- We search children of the root node to find our next prefix, “E”, we find a node that has this prefix
- We search children of the “E” node to find our next prefix, “N”, we find a node that has this prefix
- Since we have our entire prefix, we check if this node is marked as storing a key - it is, so we return the value of this node
I’ve colored the edges we used to reach “PEN” in the figure below to help demonstrate this.
This seems pretty straightforward, let’s see if we can write some pseudo-code for this algorithm
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29
Insertion into Tries
Now that we have built an algorithm for traversal, we are part of the way towards building an insertion algorithm since they are so similar for tries. Let’s see if we can figure out how to insert the word “PROMO” into the trie that we have here. We’ll start with something very similar to what we built for traversing tries, and then build on it for the case where we don’t find what we’re looking for.
- To start, we look at the root node, and we see we have a partial match on P, we then look at the children to see if we can find R
- We find R, we now look for children of that containing O
- We find O, we now look for children of that containing M
- We don’t find any children containing M - so we create a child of O that has a key of M - we do not mark it as storing a key
- As we have had to create a node, we now create another node, O, as a child of M and mark it as storing a key.
When all is said and done, we have the graph below:
Which is exactly what we were looking to get out of this algorithm. It’s fairly clear based on this work that the insertion algorithm for tries is closely related to traversal - it’s basically traversal until you either hit a dead end, in which case you keep adding nodes until you have your key stored, or until you hit the desired key, in which case you just mark it as storing a key and update its value, if any.
Using the approach we had above, we can state the general algorithm compactly as:
- Traverse as far as possible to the key you wish to store - if you find the key, you are done.
- If the key is not found, create a child node on the last node you reached containing the next element of the key
- Repeat (2) until you have created nodes for all the remaining segments in the key
- Mark the final node as storing a key, and add any value you require to it
Properties of Tries
Due to their structure and marked difference from binary search trees, tries have some interesting performance characteristics that are worth considering. We’ll get into some of the practical applications of this data in the next section when we start considering where tries are useful, but for now let’s focus purely on their advantages and disadvantages in terms of complexity and structure.
Since to traverse a trie you have to traverse a tree for the exact number of elements your key consists of (in the case of a string, letters), the worst case complexity for a given key is \(O(n)\) in the size of the key you wish to store or retrieve. One particular note - since we have two for loops in our traversal and insertion, many assume that this is quadratic - but you can bound the interior for loop by the size permitted for each portion of a key. In the case of a string that is case insensitive, we can bound this loop at 26 since we have that many letters - and you can then unroll the loop into 26 specific conditionals - so we end up only having one loop that is dependent on the input size.
Similarly, since storing a key amounts to traversing the tree, the complexity for key storage is also \(O(n)\).
We can contrast this with other structures used to store associative arrays. An unbalanced binary search tree shares the same worst case performance, while its average performance may be better. A balanced binary search tree has a worst case performance of \(O(log(n))\) with the tradeoff that you may have to rebalance the tree during insertion operations. Hash tables have a worst case performance matching that of the trie at \(O(n)\) but are generally considered to have an amortized runtime complexity of \(O(1)\) assuming you have a suitably sized hash table and good hash function to avoid collisions.
The structure of a trie allows us not only to find the key we’re looking for, but also to find other keys that are nearby the key we are searching for. Based on this, tries give us the ability to easily find near matches to whatever we’re looking for - which can come in handy for a number of cases.
In addition to near matches, we can also use a trie to get a sorting on the keys. If we understand our key space, in the case of strings, the alphabet, we can then structure the trie so that an in-order traversal of the tree, returning each key, gives us a lexographic ordering of the keys stored.
Uses for Tries
The fact that a trie is an associative array, just like hash tables and binary search trees, means we can use them in theory anywhere one of those other structures would be used. In practice, tries are most often used either as a replacement for a hash table when collisions become a problem or a hash function becomes complicated, or where we need to keep the data sorted as well as simply retrievable.
In addition to their general purpose use cases, tries also have a few specific areas they’re considered strong in. Predictive text completion, spell checking, and other related applications all generally use prefix trees. In the case of predictive text completion, you use the text that has been typed as the key to search for in the trie, and then provide potential complete strings based on the keys stored below the node you traversed to. For spell checking, you can traverse to a given key, and if it does not exist, you can back-track and find near matches to suggest potential corrections.
Optimizations and Storage Formats
In the general case, tries can end up taking a lot of space if they have a lot of interior nodes that are not stored keys. To fix this problem, there are versions of the trie data structure called compact prefix trees that address this problem by collapsing interior nodes. This adds some complexity both to the representation of the tree, as well as the algorithms used for traversal and insertion. I won’t go into a lot of detail on this, but suffice to say for trees with long keys, this can make a big difference in the space complexity of the tree. It’s most useful when a tree is used primarily for lookups instead of for insertions to avoid the complexity of re-inflating interior nodes that have been compacted. Applications for this include things like dictionaries or predictive text lookups that don’t change and may have quite a large set of keys.
Another optimization on tries can be to use a ternary search tree to store the trie - this turns out being a lot like a hybrid of a binary search tree and a trie. I won’t go into a lot of details on this subject at this time, but may follow up in a future article. These are used for similar reasons you’d want to use a compact trie for - to compress the space required by a trie to something more managable.