A Swift Solution to View Controller Dismissal

One of the most common complaints I see from people new to iOS is how view controller dismissal works. If you’re a view controller that is being presented, either on a controller stack, modally, or in some other way, there are a few ways you can clean up once you need to be dismissed. The most obvious solution is to directly dismiss yourself, using something like the following

1
2
3
4
@IBAction func finished(_: AnyObject?)
{
  self.dismissViewControllerAnimated(true, completion: nil)
}

The Drawbacks

Apple notes that this is not how you are supposed to dismiss a view controller, but that if you do this on a view controller that is being presented, it will automatically forward it to the presenting view controller, and then do the dismissal from there.

This approach is effective, but has a few things that really need to be considered that kind of make it a bit of a kludge.

  • This approach is not generic - it doesn’t handle cases where the view controller is not being presented, but is instead in a navigation stack. While in practice, view controllers are often very specifically tied to the flow of the UI, we should try to keep this general unless we absolutely must not, as it makes it easier to refactor or reuse in the future
  • It is hard to hook into this to make any other changes during dismissal - you’re effectively backing yourself into a corner and using a semi-automatic delegation pattern instead of setting one up.
  • This can complicate the flow of your program through an implicit delegation, and make it harder to figure out exactly what is going wrong if there’s a problem in your view controller logic. The bug may appear in another view controller you never call dismiss on directly.

The way that this is generally avoided is to create a delegate in the view controller being presented. Before presentation, the view controller doing the presentation sets itself as the delegate, then displays the other view controller modally. There’s a few drawbacks to this approach as well, notably that you end up with a slightly heavier-weight delegate protocol that is really only used for this dismissal. You also end up having to wire this up for every time you present the view controller as well as write the logic to then dismiss it. This can get a bit heavy.

Thanks to Swift category extensions, there’s a pretty easy way around this. It’s nothing particularly new, and I’m sure others use it all the time, but it’s a useful trick.

The Swift Solution

What we really want to accomplish here is to have a view controller that does some form of presentation or showing of another view be able to be notified on dismissal and take appropriate action. To do this, we can create a set of two protocols, one for the thing we will be presenting on, and one for the thing we’ll be presenting, and then use a protocol extension to give us a standard implementation that works for most view controllers. This gives us an almost drop-in solution with minimal fuss - our only real work is setting the delegate where we need to.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
protocol DismissalDelegate : class
{
    func finishedShowing(viewController: UIViewController);
}

protocol Dismissable : class
{
    weak var dismissalDelegate : DismissalDelegate? { get set }
}

extension DismissalDelegate where Self: UIViewController
{
    func finishedShowing(viewController: UIViewController) {
        if viewController.isBeingPresented() && viewController.presentingViewController == self
        {
            self.dismissViewControllerAnimated(true, completion: nil)
            return
        }

        self.navigationController?.popViewControllerAnimated(true)
    }
}

Using this is as simple as adding the category to the class definition for your presenting and presented view controllers - here’s an example of a set of view controller using this protocol.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
class PresentedController: UIViewController, Dismissable {

    weak var dismissalDelegate: DismissalDelegate?

    @IBAction func done(_: AnyObject?)
    {
        dismissalDelegate?.finishedShowing(self)
    }
}

class PresentingViewController: UIViewController, DismissalDelegate {
  override func prepareForSegue(segue: UIStoryboardSegue, sender _: AnyObject?) {
      if let vc = segue.destinationViewController as? Dismissable
      {
          vc.dismissalDelegate = self
      }
  }
}

Here we can see the only wiring required, as we noted, is to set the delegate, as well as to invoke the finishedShowing callback when we’re done in the presented view controller. In some cases, it can also makes sense to make a super class for the DismissalDelegate that implements prepareForSegue to do the work there automatically. Either way, you end up with a lot less glue code related to presentation and dismissal.

Another nice aspect of this, as we noted earlier, is that you no longer have a strong relationship in a view controller into how it is presented. This lets the view controller act more agnostically and make it easier to use throughout your application in diverse areas. We let whoever is doing the presentation worry about how to dismiss us, and we neither care who is presenting us, nor how they are doing so - we worry only on notifying our parent upon completion.

Conclusion

In this post, we’ve covered a simple way to wrap some common iOS glue code up using a Swift protocol extension to help us keep dismissal logic out of presented view controllers. Doing this helps us keep our view controllers agnostic to how they are being displayed and aids in reusability, helps save us unnecessary glue code in all our controllers, and saves us time and effort in implementing glue code in each view controller that needs to show another.

Oct 23rd, 2015

Tries and You: A Guide!

Data structures are a pretty fundamental topic in computer science, and most people are exposed to them in a whirlwind fashion in school or throughout their career. This exposure provides precious little experience with some of the less common or special purpose data structures.

The lack of understanding of a number of more exotic structures is not particularly surprising given that most of them only need to be used in some very specific cases. If you’re not working with large data sets that have to be searchable or advanced file system work, you probably won’t ever have a practical use for a B-tree outside of an interview. That said, it can also be good to take a step back and dive into some of these structures. To that end, I’m going to be briefly looking at one such structure today: The Trie!

What is a Trie?

Tries go by other, perhaps more recognizable, names, including prefix and radix trees. Tries are built around the concept of encoding the key for a piece of data in the actual structure of the tree itself.

Search Trees

In a normal search tree, each node has a key, and you have some rule for traversal of the key to get to the node you’re looking for. In the case of a binary search tree, you know the nodes in the left subtree of a given node are less than that node’s key, and nodes in the right subtree for a given node are greater than that node’s key. This lets you quickly traverse the tree to find the object you’re looking for. You can see an example of a binary search tree of this format below.

Trie Example
Trie Example

Tries

Tries are a bit different - instead of having each key on one node, you search for the key using each digit, letter, or element of the key - in other words, each logical sub-unit of the key - when you find one sub-unit, you proceed to looking for the next one. If you want to find the number “123” you would find a node with “1”, look at its children, and try to find the node that has a “2” and so on. This leads to some useful properties we can get from a trie that we can’t get from a normal search tree, which we’ll get into later. An example of a very basic trie is included below that exemplifies the structure I noted above.

Trie Example
Trie Example

Building Tries

Inserting data into a binary search tree is pretty simple, you just traverse the tree until you reach a dead end, then insert a node. Well, it’s usually that simple, unless you’re doing something fun like red-black or AVL trees - which you probably should be doing - then it gets more complicated.

For tries, we have a very similar approach, at least in spirit, even though it differs some in implementation. For our example, let’s take a string value, say, “program” and say we want to insert it into a trie that already has some data stored in it.

Before we get into exactly how to do an insertion, let’s have a look at a trie and see if we can understand more about the way it stores data.

Data Storage in Tries

Trie Example
Trie Example

I’ve highlighted nodes that denote actual stored key in red. In an actual implementation, this would generally be marked using a boolean flag on the node objects or an accessory table. Let’s look at this and see if we can determine a few properties.

  • All the leaf nodes in this tree store keys - since the way we traverse these trees is to look for partial matches for a given input key, and the tree consists only of keys that have been inserted, it would not make sense for us to be able to traverse to a point where there are no other nodes to go to and we have not reached a node representing a stored key.
  • Nodes that denote keys stored may occur on interior nodes - that is to stay, while all leaf nodes must denote a stored key, not all stored keys are present in leaf nodes - we can see several nodes here that store keys but are not leafs.

Traversing Tries

Let’s use the example above to try to find out how we would search for a key in a trie. Based on the structure we see above, the basic strategy we’ll want to employ is to recursively look at the children of a given node, and try to find ones that match the current prefix we’re on, if we find one, we keep going, if we don’t, we return not found.

Before we get to the actual algorithm, let’s just step through searching for “PEN”.

  1. To start, we look at the root node and check if it maxes our initial prefix, “P” - it does
  2. We search children of the root node to find our next prefix, “E”, we find a node that has this prefix
  3. We search children of the “E” node to find our next prefix, “N”, we find a node that has this prefix
  4. Since we have our entire prefix, we check if this node is marked as storing a key - it is, so we return the value of this node

I’ve colored the edges we used to reach “PEN” in the figure below to help demonstrate this.

Trie Example
Trie Example

This seems pretty straightforward, let’s see if we can write some pseudo-code for this algorithm

C Pseudo-Code for Trie Traversal
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
void* valueForKey(Node* node, char[] key)
{
  /* Handle base case */
  if (node->key != key[0])
  {
    return NULL;
  }

  for (int i = 1; i < strlen(key); ++i)
  {
    char match = 0;
    for (int j = 0; j < node->numChildren; ++j)
    {
      if (node->children[j]->key == key[i])
      {
        match = 1;
        node = node->children[j];
        break;
      }
    }

    if (!match)
    {
      return NULL;
    }
  }

  return node->value;
}

Insertion into Tries

Now that we have built an algorithm for traversal, we are part of the way towards building an insertion algorithm since they are so similar for tries. Let’s see if we can figure out how to insert the word “PROMO” into the trie that we have here. We’ll start with something very similar to what we built for traversing tries, and then build on it for the case where we don’t find what we’re looking for.

  1. To start, we look at the root node, and we see we have a partial match on P, we then look at the children to see if we can find R
  2. We find R, we now look for children of that containing O
  3. We find O, we now look for children of that containing M
  4. We don’t find any children containing M - so we create a child of O that has a key of M - we do not mark it as storing a key
  5. As we have had to create a node, we now create another node, O, as a child of M and mark it as storing a key.

When all is said and done, we have the graph below:

Trie Example
Trie Example

Which is exactly what we were looking to get out of this algorithm. It’s fairly clear based on this work that the insertion algorithm for tries is closely related to traversal - it’s basically traversal until you either hit a dead end, in which case you keep adding nodes until you have your key stored, or until you hit the desired key, in which case you just mark it as storing a key and update its value, if any.

Using the approach we had above, we can state the general algorithm compactly as:

  1. Traverse as far as possible to the key you wish to store - if you find the key, you are done.
  2. If the key is not found, create a child node on the last node you reached containing the next element of the key
  3. Repeat (2) until you have created nodes for all the remaining segments in the key
  4. Mark the final node as storing a key, and add any value you require to it

Properties of Tries

Due to their structure and marked difference from binary search trees, tries have some interesting performance characteristics that are worth considering. We’ll get into some of the practical applications of this data in the next section when we start considering where tries are useful, but for now let’s focus purely on their advantages and disadvantages in terms of complexity and structure.

Runtime Complexity

Since to traverse a trie you have to traverse a tree for the exact number of elements your key consists of (in the case of a string, letters), the worst case complexity for a given key is \(O(n)\) in the size of the key you wish to store or retrieve. One particular note - since we have two for loops in our traversal and insertion, many assume that this is quadratic - but you can bound the interior for loop by the size permitted for each portion of a key. In the case of a string that is case insensitive, we can bound this loop at 26 since we have that many letters - and you can then unroll the loop into 26 specific conditionals - so we end up only having one loop that is dependent on the input size.

Similarly, since storing a key amounts to traversing the tree, the complexity for key storage is also \(O(n)\).

We can contrast this with other structures used to store associative arrays. An unbalanced binary search tree shares the same worst case performance, while its average performance may be better. A balanced binary search tree has a worst case performance of \(O(log(n))\) with the tradeoff that you may have to rebalance the tree during insertion operations. Hash tables have a worst case performance matching that of the trie at \(O(n)\) but are generally considered to have an amortized runtime complexity of \(O(1)\) assuming you have a suitably sized hash table and good hash function to avoid collisions.

Structural Considerations

The structure of a trie allows us not only to find the key we’re looking for, but also to find other keys that are nearby the key we are searching for. Based on this, tries give us the ability to easily find near matches to whatever we’re looking for - which can come in handy for a number of cases.

In addition to near matches, we can also use a trie to get a sorting on the keys. If we understand our key space, in the case of strings, the alphabet, we can then structure the trie so that an in-order traversal of the tree, returning each key, gives us a lexographic ordering of the keys stored.

Uses for Tries

The fact that a trie is an associative array, just like hash tables and binary search trees, means we can use them in theory anywhere one of those other structures would be used. In practice, tries are most often used either as a replacement for a hash table when collisions become a problem or a hash function becomes complicated, or where we need to keep the data sorted as well as simply retrievable.

In addition to their general purpose use cases, tries also have a few specific areas they’re considered strong in. Predictive text completion, spell checking, and other related applications all generally use prefix trees. In the case of predictive text completion, you use the text that has been typed as the key to search for in the trie, and then provide potential complete strings based on the keys stored below the node you traversed to. For spell checking, you can traverse to a given key, and if it does not exist, you can back-track and find near matches to suggest potential corrections.

Optimizations and Storage Formats

In the general case, tries can end up taking a lot of space if they have a lot of interior nodes that are not stored keys. To fix this problem, there are versions of the trie data structure called compact prefix trees that address this problem by collapsing interior nodes. This adds some complexity both to the representation of the tree, as well as the algorithms used for traversal and insertion. I won’t go into a lot of detail on this, but suffice to say for trees with long keys, this can make a big difference in the space complexity of the tree. It’s most useful when a tree is used primarily for lookups instead of for insertions to avoid the complexity of re-inflating interior nodes that have been compacted. Applications for this include things like dictionaries or predictive text lookups that don’t change and may have quite a large set of keys.

Another optimization on tries can be to use a ternary search tree to store the trie - this turns out being a lot like a hybrid of a binary search tree and a trie. I won’t go into a lot of details on this subject at this time, but may follow up in a future article. These are used for similar reasons you’d want to use a compact trie for - to compress the space required by a trie to something more managable.

Sep 22nd, 2015

Building (Almost) Dependent Types in Swift

Dependent typing is something most of us use in a restricted form, but that most working programmers don’t really encounter in its “stronger” forms outside of school. Parametric polymorphism is the most common expression of type dependence, which lets us have functions or other types that are dependent on the types of either parameters or a type parameter to the class itself. This is mostly commonly called generics in most languages, and permit us to generalize the way we write code so that we can use the same algorithm even though the types may vary - the classic example being an array or vector.

There’s another side to dependent typing though that gives us even more freedom, and the best way to introduce it is to ask a simple question: what about tuples? While languages like Swift have built-in tuple types, what type is a tuple really? Is it a type that is parametrically polymorphic as we noted above? Sort of - we know that it should store any arbitrary data type, but the difference between a 3-tuple and a 4-tuple is not the same as the difference between a list containing three items and a list containing four items - indeed, a 3-tuple and a 4-tuple are separate types. The n-tuple is a dependent type based on a value - the number of elements it contains. This is where we really get the value out of dependent typing - we can define functions that only operate on 4-tuples, or functions that only operate on 8-tuples, and then enforce this at compile time using Swift’s static type checking. This helps us stop misuse of functions and methods before runtime, and helps us keep bugs out of our code.

So, let’s dig in, after all, why should Apple get to have all the fun with their tuples?

Tuple Typing

Let’s have a look at some code and how it evaluates in Swift to convince ourselves that tuples work this way. One thing to note, throughout this post I’m going to be using the reflect function in Swift, which lets us get a MirrorType for any object, letting us get easy access to its actual type and other information.

1
2
3
let x = reflect((1,2,3))
let y = reflect((1,2,3,4))
print(x.valueType == y.valueType)

If you try this for yourself, you can see that we get a statement of “false” printed. The types in question are shown as (Int,Int,Int).Type and (Int,Int,Int,Int).Type. We can see here that not only does the type of the values stored determine the type of the tuple, but as we said, so does the number of elements. This makes it clear that there is at least one very common case where allowing dependent types based on values not just types would be valuable. With this in mind, let’s see if we can implement “pseudo” dependent types in Swift with the end goal of implementing a type system rich enough for the tuple type ourselves.

Building (Almost) Dependent Types

Before we can tackle tuples, first we have to figure out how to represent a value in a type. For now, let’s talk about the type we’re going to generate for a particular value in terms of a type constructor, \(F(x)\) where \(x\) can vary over our input domain of valid values for our dependent type system. Given this type constructor, it must satisfy a few fairly common sense properties for us to be able to use it for dependent typing.

The first property is that the function must always map a given input value to a given output type - that is, the type generated for a given input must remain the same and compare as equal regardless of how many times we call the type constructor. Another way to state this is that we have to guarantee that \(F(x)\) is a pure function, that is, it causes no observable side effects, and relies on no external state.

The second property is that for two given inputs to the type constructor, the output must be unique. To state this mathematically the function must satisfy the following rule \(\forall x_{0},x_{1} \in X, x_{0} \neq x_{1} \implies F(x_{0}) \neq F(x_{1})\). This guarantees us that our function is a one to one map for values to types.

Recursive Type Representations in Swift

Swift is great because it offers very good static type checking - but this also present us with a bit of a problem as we try to generate dependent types. To build our types, we have to rely on values that are not known until runtime, but to build the types, we have to construct a type based on the arbitrary input. This puts us in hot water, as it means that the type inference system cannot figure out our types, and in fact, it’s not even entirely clear how we can do this in Swift’s syntax.

To solve this, let’s think about what the types we have will actually represent. For now, for the purposes of simplification, let’s limit the values we support to positive integers, that is we’ll limit the domain \(X\) of our function to be the set of positive integers supported in Swift. This makes sense since it’s all we need for tuples, but we can also generalize this approach as we’ll discuss later. Based on this, the type we generate with our function really just needs to be an encoding of the type itself. Swift offers us generics, so what if we used container types, and string them together in such a way that they represent the values. Let’s look at a few examples.

1
2
3
4
5
6
7
8
9
10
11
class ValueTypeTerm<T>
{
}

/* To represent 1 we will use the following type */
var valueType1 : ValueTypeTerm<int>

/* To represent 2 we will use the following type */
var valueType2 : ValueTypeTerm<ValueTypeTerm<int>>

/* And so on... */

Based on the above, it’s fairly clear that this also satisfies the properties we mentioned before - for any input, the type it generates will be unique, and for any input the type will always we the same. Now that we have a method of representation, we just have to write our actual type constructor.

Generating (Almost) Dependent Types

As it turns out, we can use a recursive function to actually generate the type we want. The only problem we will run into with this sort of setup is how we actually pass the type up the stack in such a way as to not confuse the type inference engine. We can do this using an accumulator, given that this is a recursive function, which also has the useful side effect of making this function tail recursive. We use MirrorType here at the top as a convenience to avoid some weird casting.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
class ValueTypeTerm<T>
{
}

func valueTypeForValueRecursive<T>(val: Int, blank: T) -> MirrorType
{
  if (val == 0)
  {
    return reflect(blank)
  }
  return valueTypeForValueRecursive(val-1, blank: ValueTypeTerm<T>())
}

func valueTypeForValue(val: Int) -> Any.Type
{
  return valueTypeForValueRecursive(val, blank: val).valueType
}

So we can see here that we have a wrapper function called valueTypeForValue that takes an integer and then returns a type by simply calling the recursive call and unwrapping the result from the MirrorType.

The recursive function itself takes what is in the accumulator, embeds it in a ValueTypeTerm as the parametric type, and then recursively calls itself until it hits zero. Once zero is reached, it wraps it in a MirrorType, and returns the result.

If we do some testing in a playground, we will quickly find that this does exactly what we want: it generates the types exactly as we specified in our previous representation. The returned types also meet the two requirements we stated originally.

There’s one problem we haven’t solved yet though: we can now represent types with numeric values, but another aspect of dependent typing is that the value type may actually change the semantics of the type itself. In other words, a tuple is dependent not just on the number of elements it has, but also the type of those elements. So it looks like we’ve moved the goal now that we’ve reached it, so let’s dig back in.

Back to the Drawing Board…

There’s a few directions we can head from here - but the most obvious is probably to extend our type constructor to not only be dependent on the integer, but also on a set of types themselves. In other words, instead of just passing in an integer, you would also pass in a list of types that dictate the order of the Tuple. This makes some intuitive sense, but first we have to think about how to tweak our model and see if it makes sense.

Adapting our model

Previously, we used a model where we nested types - we will still need to use this model to represent the integer portion of our type, but we also want to represent the type of a given tuple index. Why not just add another parametric type to our container, use one of them to handle the nesting, and use one of them to handle the type for the index of the current type. Let’s look at a few examples quickly.

1
2
3
4
5
6
7
8
9
class ValueTypeTerm<Q,T>
{
}

/* Value: 2, Types: Int, String */
var type1 : ValueTypeTerm<String,ValueTypeTerm<Int,Int>>

/* Value: 3, Types: Int, String, CustomType */
var type2 : ValueTypeTerm<CustomType,ValueTypeTerm<String,ValueTypeTerm<Int,Int>>>

This is pretty simple - we’re just adding a parametric type value alongside the “depth” to indicate what type is available at each depth. This sounds like it will work.

Modifying Our type constructor

Given our modified representation, now we just have to modify our type constructor to generate our types. The obvious way to do this would be to let someone pass in the types themselves - but doing so you end up running into issues with the type inference system. Our entire strategy around using the type system revolves around us being able to build a type up dynamically to represent our structure, and when we pass in the types we cannot create this structure dynamically since Swift is statically typed and we cannot reference types in this way at runtime.

Instead, we can pass in the actual values themselves, and construct our structure based on the objects, with their rich type information, but there’s still another problem here: if we accept [Any] the objects lose their type information as well. We also cannot dynamically cast based on an Any.Type or we could pass in MirrorType objects and cast their values to their valueType. Well, that puts us in a bad position, but there is one option: we can recursively, and in a quite ugly way, generate our tuple type based on the types we want to pass in.

Let’s have a look at the very simple code that accomplishes this, then we can talk more about it.

1
2
3
4
5
6
7
8
9
10
11
12
class ValueTypeContainer<Q,T>
{
}

func valueTypeForTypeRecursive<Q,T>(val: Q, cont : T) -> ValueTypeContainer<Q,T>
{
    return ValueTypeContainer<Q,T>()
}

let b = valueTypeForTypeRecursive(1,cont:valueTypeForTypeRecursive("A String",cont:valueTypeForTypeRecursive([1],cont: valueTypeForTypeRecursive(1,cont: ValueTypeContainer<Any,Any>()))))

let c = valueTypeForTypeRecursive(1,cont:valueTypeForTypeRecursive("A String",cont:valueTypeForTypeRecursive([1],cont: valueTypeForTypeRecursive([1],cont: ValueTypeContainer<Any,Any>()))))

We can see that this recursive building function basically has us pass in a value at each step, combined with a container structure. The function then aggregates the container structure with the value passed in. At the base case, we have a container that has any types in it to mark that we are done. Here you can see we create two tuples, one of what we would call type (Int, String, [Int], Int) (not literally this type in Swift, but what you can conceptually think of as representing this type) and one of what we would call type (Int, String, [Int], [Int]). If we use reflect and valueType to get their types, we can then actually compare the types and see that they are actually fundamentally different.

This is very ugly, and there are ways we could potentially clean it up, but it’s difficult as we run into a lot of problems where Swift won’t let us change things so it can preserve its static typing capability - which ironically is part of what really gives us value in these sort of approaches.

Generalizing This

With what we’ve done, we are capable of representing and even implementing a tuple using dependent typing. We can extend this to other arbitrary classes of values as well as types using multiple techniques, but they all come down to the primitive operations we displayed here: recursively building types and using polymorphic types to represent dependent types. Regardless, it’s important to note we still haven’t given the type system truly dependent types - just provided a workaround to let us pretend it does in some cases.

Conclusion

After discussing why dependent types are important, we have managed to build pseudo-dependent types based on values in in Swift. These types, while not truly giving us a dependent type system, prove that Swift is rich enough that it’s even possible to (almost) extend the type system itself if you’re willing to jump through enough hoops. I wouldn’t recommend using the final results here in most cases, but in some cases, using value driven dependent typing, even when the implementation itself is ugly, can lead to much cleaner and safer results.

I hope you’ve enjoyed the post - as always, please feel free to let me know of any errors I’ve made or any improvements you think I could make to the article. I’ve done my best to keep it accurate, but I’m quite fallible.

Oh - one last note - for those of you that are research computer scientists or theoretical purists, I’ve stepped on type theory quite a bit today without getting into any real depth in terms of explaining how type systems work, why dependent typing matters, the different ranks/types of dependent typing, nor a lot of other very important details. This was done intentionally to keep this article accessible. Rest assured, I’m preparing another article to dig into these topics a lot more though and you’re welcome to nitpick it as much as you’d like!

Thanks

Thanks to curryhoward from hacker news for pointing out some terminology errors I made and suggesting some other corrections for clarity - as always such corrections are appreciated!

Aug 11th, 2015

Let's Build Karatsuba Multiplication

Today I’m going to start a new series of posts that explores different algorithms, from the simple to serene, that we end up using more than we might realize every single day of our careers. Many of these algorithms are in the internals of various systems we use, from microprocessors to compilers to operating systems. Like with most other things I post here, these posts are going to focus on the why as well as the how and dig into why these things matter.

The first topic we’re going to cover is the Karatsuba multiplication algorithm and some of its more advanced friends.

Basic Multiplication

Most of us know how to multiply numbers using one of the grade school algorithms frequently taught. I’m going to be looking at the algorithm I was taught in elementary school, and that I understand is the quite common at least in the US - but you can extend the below arguments to most multiplication algorithms taught in schools. This particular algorithm is called long multiplication.

In the algorithm we’re going to look at, we multiply each digit in each number by each digit in the other number, adjusting for the place of the digits. In other words for an \(n\) digit number, we can say it’s made up of the digits \(a_{n-1}a_{n-2}a_{n-3}...a_{0}\) and that to multiply it by some number \(b\) of similar form with \(m\) digits, the grade-school multiplication algorithm gives us the following expression

\[ \sum\limits_{i=0}^{m-1}\sum\limits_{j=0}^{n-1}b_{i}\times a_{j}\times 10^{j+i} \]

Let’s break this down a little bit so we can see that this expression really represents how we do the old “grade school” method of multiplication. For each digit in the multiplier we multiply it by each digit of the multiplicand, scaling it by the position of the digit in the multiplier and the multiplicand. In other words, you take each digit in the “bottom number” and multiply it by the digits in the “top number” to get a result, save that, then multiply the next digit by the entire “top number.” When finished we add all these results together to get a final value.

Are we crazy?

Let’s look at an example of this on two numbers so we can convince ourselves intuitively that this expression is equivalent to our basic multiplication algorithm.

\[ \begin{array}{cccccccc} & & & & 2 & 8 & 7 & \\ \times & & & & 4 & 2 & 1 & \\ \hline & & & & 2 & 8 & 7 & (1\times 287\times 10^{0})\\ & & & 5 & 7 & 4 & 0 & (2\times 287\times 10^{1})\\ + & 1 & 1 & 4 & 8 & 0 & 0 & (4\times 287\times 10^{2})\\ \hline & 1 & 2 & 0 & 8 & 2 & 7\\ \end{array} \]

So you can see here that each intermediate result row is the result of multiplying a digit in the bottom number, in this case 421 with the entire top number - in this case 287. That means the expression for this multiplication is as follows:

\[ \sum\limits_{i=0}^{m-1} 10^{i}\times b_{i}\times a \]

We can of course change the representation of a as follows based on the fact that a number is the sum of its digits, scaled by their place - in this case, each place of a digit multiplies its value by 10 since we are in base–10.

\[ \sum\limits_{i=0}^{m-1} \left ( 10^{i}\times b_{i} \right ) \times \sum\limits_{j=0}^{n-1} \left ( a_{j}\times10^{j} \right ) \]

Which is very similar to the formula we came up with earlier - except the \(10^{i}\times b_{i}\) is outside of the sum. This is fine, but as we know, we can move it inside the sum by the distributive property. Why would we want to do this, considering the expressions are the same? Here’s why: by arranging it in this way, we make it clear that we are only ever multiplying single digit numbers together, with the exception of multiplying by powers of ten, which are basically just “shift” operations. This means that if we memorize or store the multiplication tables up to 9, we can compute these internal values simply by looking up the result.

Analyzing Long Multiplication

Now that we’re pretty sure that our expression previously is, in fact, an accurate representation of this multiplication algorithm, let’s write some pseudo-code that implements it and look at its runtime complexity.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
// We use the type digit here for an int that is always < 10
int multiply(digit[] a, int a_len, digit[] b, int b_len)
{
  vector<int> resultVec;
  int result = 0;

  for (int i = 0; i < b_len; ++i)
  {
    for (int j = 0; j < a_len ++j)
    {
      resultVec.push_back(a[j] * b[i] * exp(10,i+j));
    }
  }

  for (vector<int>::iterator i = result.begin(); i != result.end(); ++i)
  {
    result += *i;
  }

  return result;
}

Before we talk about the complexity of this algorithm, we need to consider what complexity means here. Since we’re talking about reducing this to a set of single digit multiplications what complexity really means here is the number of single digit multiplications we’re looking at. Looking at this pseudo-code, we can see we have a for loop within a for loop - with complexity \(O(n^{2})\) and another for loop with complexity \(O(n)\). Total complexity is \(O(n^{2} + n) = O(n^{2})\). Unsurprisingly, the runtime complexity of an algorithm that simply multiplies every digit against every other digit, and scales them appropriately, is \(O(n^{2})\). So the question now becomes is there a way we can do better than this, and somehow improve this result?

Improving on Naive Long Multiplication

Since you’re reading this article, you’ve probably already figured out we can improve on naive multiplication - so the question then becomes how are we going to accomplish this? A common approach we’ll see time and time again as we look at algorithms is the use of a divide and conquer approach to solving problems. Let’s see if we can leverage that here - we’ll start by trying to rewrite multiplication slightly differently and then figure out how we can work what we’ve done into an algorithm.

Diving Into Long Multiplication

If you’re just interested in understanding how the algorithm works and are not interested in the mathematical basis - feel free to skip this section - we go into some depth on this problem to justify the solution.

Let’s start by writing out the expression that symbolizes long multiplication, before we simplify it, using our previous examples:

\[ a\times b = \sum\limits_{i=0}^{n-1}\left (a_{i}\times10^{i} \right ) \times \sum\limits_{j=0}^{m-1}\left (b_{j}\times10^{j} \right ) \]

We can reformulate this to split \(a\) and \(b\) in half pretty easily as we see below.

\[ a = \left (\sum\limits_{i=0}^{\left(n\over 2\right)-1}\left (a_{i}\times10^{i}\right ) + \sum\limits_{i=\left( n\over 2\right)}^{n-1}\left (a_{i}\times10^{i} \right ) \right )\\ b = \left (\sum\limits_{j=0}^{\left(m\over 2\right)-1}\left (b_{j}\times10^{j}\right ) + \sum\limits_{j=\left( m\over 2\right)}^{m-1}\left (b_{j}\times10^{j} \right ) \right ) \]

Which is just a fancy way of saying we can split \(a\) and \(b\) into two numbers based on the first half of the digits and the second half of the digits. In other words, we can pick an arbitrary place in a number, split the number into two separate numbers (eg: for 1234, 1200 and 0034), and add the two halves back together and get the whole again. Let’s name the two halves as below - note that we can also extract a power term for the “top” half since the digits all start at a non-zero index - so we have done that as well. We have also extracted the expression representing the “midpoint” of the number to another variable.

\[ a = 10^{\left ( n_{center} + 1 \right )}A_{l} + A_{r}\\ n_{center} = \left ( n\over 2 \right )-1 \\ A_{l} = \sum\limits_{i=n_{center} + 1}^{n-1}\left (a_{i}\times 10^{i-\left (n_{center}+1 \right )} \right )\\ A_{r} = \sum\limits_{i=0}^{n_{center}}\left (a_{i}\times10^{i}\right )\\ \]

Let’s use an example so we can get a feel for that this means. To return to our previous problem - let’s use the solution from that problem, the number 120827 and split it up in this manner.

\[ a = 120827\\ A_{l} = 120\\ A_{r} = 827\\ a = \left(120 \times 10^{3}\right ) + 827 \]

We can see that this notation really just means we can split a number in half in such a way that adding the halves back up results in a “whole” so to speak. Using this new re-written formulation, we can now substitute into our original problem of multiplying the two numbers and we will find ourselves in a new and very interesting form. For simplicity here, we’re going to assume an equal number of digits for each number - if they aren’t equal, we can always pad it with zeroes to make them the same length.

\[ a \times b = \left (10^{n_{center}+1}A_{l} + A_{r} \right) \times \left (10^{n_{center}+1}B_{l} + B_{r} \right )\\ a \times b = 10^{2\left(n_{center}+1\right)}A_{l}B_{l} + 10^{n_{center}+1}A_{l}B_{r} + 10^{n_{center}+1}A_{r}B_{l} + A_{r}B_{r} \]

This is great - as it means we can divide the problem into four sub-pieces - and if we did this recursively, this would give us a new way to multiply two numbers. While this is interesting, we can actually show that this is not any more efficient than the long multiplication algorithm. The key thing that changes this around and turns it into what we call the Karatsuba algorithm is the recognition that we can simplify this to only requiring three multiplications instead of four. To do that, we’re going to look at manipulating the center two terms.

\[ A_{l}B_{r} + A_{r}B_{l} = \left(A_{l} + A_{r}\right )\left (B_{l} + B_{r} \right ) - A_{r}B_{r} - A_{l}B_{l} \]

What we’ve done here is factored the two terms shown above to require one multiplication and several additions and subtractions. The subtractions themselves require multiplications, but as you can see we already have to compute these terms for the rest of our algorithm. Putting this together, we have a grand total of three multiplication operations required. Let’s put it all together.

\[ a \times b = 10^{2\left(n_{center}+1\right)}t_{1} + 10^{n_{center}+1}t{2} + t{3}\\ \begin{array}{rl} t_{1} = & A_{l}B_{l}\\ t_{2} = & (A_{l} + A_{r})(B_{l} + B_{r}) - t_{1} - t{3}\\ t_{3} = & A_{r}B_{r}\\ \end{array} \]

Let’s try it out using our stand-by example of multiplying 287 by 421. To start out, let’s break it down and figure out what we need to calculate.

\[ n_{center} = 0\\ \begin{array}{c|c} a = 287 & b = 421\\ a_{l} = 28 & b_{l} = 42\\ a_{r}=7 & b_{r}=1\\ \end{array} \]

With these initial computations, we can compute the formulas we computed earlier, and get the solution we expected.

\[ \begin{array}{rl} t_{1} = & 28 \times 42 = 1176\\ t_{2} = & (28 + 7)(42 + 1) - 1176 - 7 = 322\\ t_{3} = & 1 \times 7 = 7\\ \hline a \times b = & 10^{2}(1176) + 10^{1}(322) + 7 = \textbf{120827}\\ \end{array} \]

Clearly, we can also iterate on this - so in this case we were left with another problem that has two digits - we could have split that up in another divide and conquer step and apply the algorithm to those parameters again - and so on - until we hit single digit multiplication as we did before. We’ll talk about the efficiency of this method in greater detail in the next section.

Details, Details, Details…

A few extra details for those curious and wanting to know more about the previous section. Feel free to skip this part if you’re not interested in even more details about the above computations and math.

Pivot Points and Arrangement

For our purposes here, it is easiest to think of the “pivot” point we use here as the “center” of a given number as this makes it a little more intuitive when we reach the part where we implement this algorithm in actual code. That being said - as you may have realized - you can set the “pivot” point for this algorithm to be anything - the two numbers do not have to be the same length, and the pivot does not have to be the midpoint necessarily. As we will see later on when we get to the performance analysis, these choices also make sense to optimize the performance of the algorithm.

In addition - there are a number of equivalent ways you can formulate the above equations - I chose one that places emphasis on individual digits to help make a smoother leap between long multiplication and Karatsuba, but rest assured all the formulations you may find are equivalent.

Multiplying by 10

A lot of Karatsuba involves trading multiplications for additions, subtractions, and multiplications by a factor of 10. These trade-offs can be made because addition and subtraction are usually much faster than multiplication, as they can generally be implemented as a series of shift operations with carry bits. In the case of multiplication by factors of 10, we have to remember we are operating in base–10 - so in this case, multiplying by a factor of 10 is actually just performing shift operations - which as we noted, are very efficient.

Bases Other than 10

This can all be generalized to any particular base simply by substituting the powers of 10 with powers of any given base - this allows you convert this algorithm to work on binary or any other number system. The proof for this is fairly straightforward and basically mimics the explanation given above.

Implementing Karatsuba Multiplication

Based on the mathematical work we’ve done in the previous section, we’ve finally distilled down the key formula for performing this ostensibly faster multiplication algorithm. I’ve written out a short copy of this formula below where center is the index of the digit you split on, that is, the last digit in the right hand side.

\[ a \times b = 10^{2\left(center + 1\right)}t_{1} + 10^{center + 1}t_{2} + t_{3}\\ \begin{array}{rl} t_{1} = & A_{l}B_{l}\\ t_{2} = & (A_{l} + A_{r})(B_{l} + B_{r}) - t_{1} - t{3}\\ t_{3} = & A_{r}B_{r}\\ \end{array} \]

As we noted before, we can apply this method recursively for large numbers, so that we break it down repeatedly until we end up with single digit multiplication operations which we can do very quickly - so we should consider that as we design our code to implement this formula. Let’s write out some pseudo code to represent how this should work.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
Number karatsubaMultiplication(Number op1, Number op2)
{
  int midPoint;
  Number t1,t2,t3,pow1,pow2,op1_l,op1_r,op2_l,op2_r;

  // Handle our base case of single digit multiplication
  if (op1.numDigits == 1 && op2.numDigits == 1)
    return op1 * op2;

  // Pad numbers to be of equal length using leading zeroes
  if (op1.numDigits < op2.numDigits)
  {
    op1.padWithZerosToSize(op2.numDigits);
  }

  if (op2.numDigits < op1.numDigits)
  {
    op2.padWithZerosToSize(op1.numDigits);
  }

  // Calculate midpoint and left/right sides
  midPoint = (op1.numDigits / 2)-1;

  op1_r = op1.digitsFrom(0,midPoint);
  op1_l = op1.digitsFrom(midPoint,op1.numDigits-1);

  op2_r = op2.digitsFrom(0,midPoint);
  op2_l = op2.digitsFrom(midPoint,op2.numDigits-1);

  // Recursive calls to generate sub-expressions
  t1 = karatsubaMultiplication(op1_l,op2_l);
  t3 = karatsubaMultiplication(op1_r,op2_r);
  t2 = karatsubaMultiplication((op1_l + op1_r),(op2_l + op2_r)) - t1 - t3;

  // Computer power statements - generally you don't compute these
  // since you would use a shift to accomplish this (eg: "move digits
  // over by so many places")
  pow1 = Number.powerOfTen(2*(midPoint+1));
  pow2 = Number.powerOfTen(midPoint+1);

  // In practice, the below multiplications are generally implemented as
  // shift operations - for binary representations, << and >> can be used
  return pow1*t_1 + pow2*t_2 + t_3
}

The above code will vary a bit depending on exactly how you’re implementing the algorithm and in what context. The above algorithm is intended to be used in a big integer implementation that uses base–10. In general big integer implementations will use base–2 and perform the multiplications by powers of two by using bit shifts which can be done very efficiently - in which case similar but slightly different code is used. Another thing to note is the use of zero padding to make the numbers be of equal length - this lets us treat them in an equal way that makes our lives a lot easier.

We can see from the algorithm that it isn’t actually all that complicated in terms of writing the code for it. How it works seems magical if we don’t understand the background, but it’s actually a very straightforward case of divide and conquer being applied to a problem.

Analyzing Karatsuba Multiplication

Getting into the analysis of this algorithm can at first seem a bit daunting, but we can relate it to other divide and conquer algorithms we have seen before such as merge sort. Let’s start by thinking about how the recurrence tree for this algorithm works.

For any given pass of the algorithm, we split the numbers themselves in half - a right half and a left half - that we then use in three subsequent recursive steps. To put it another way, we apply Karatsuba three times, and for each of these calls the problem size, that is, the number of digits in the operands, is halved. Let’s look at an example, and graph out how the problem is split up at each step.

Problem Division
Problem Division

Based on the above we can see, as we had previously noted, at each step we subdivide the problem in half, and then have three recursive steps. Let’s consider how we can understand this in terms of complexity.

The height of the graph above will be \(log_{2}n\) - since we half the problem space at each step. I should note that the height given does not include the leaf nodes - those nodes do not count as they are primitive operations, not part of the recursive steps. Also if \(n\) is not a factor of two then we should consider the ceiling of \(log_{2}n\) as the height.

Furthermore, we know that at each step, we subdivide the problem into three parts. The recurrence relation for the number of problems at any given level is \(op(l) = op(l-1)*3\) with \(op(1) = 1\). The closed form for this recurrence is \(op(l) = 3^{l}\) so in other words at any given step, this expression gives us the total number of operations being performed in the recursion step.

Lastly, note that at any given level, the work to “combine” the recursive operations of that level is going to be of complexity \(O\left(n\over 2^{l}\right)\). In other words, at any given level we have linear complexity to perform any operations required, over the size of the problem - which, as we noted, will be \(n \over 2^{l}\) for any given level \(l\)

Based on these propositions, we can start deriving the complexity of this algorithm. At any given level, the complexity of the algorithm will be the cost of combining the recursive solutions multiplied by the cost of the recursive operations. In other words, at \(l\) the complexity will be \(O\left( n \over 2^{l} \right) \times 3^{l}\). We can simplify this to be \(O(n) \times \left( 3 \over 2 \right)^{l}\). Based on this, we can write the cost of the entire series of recursive calls as follows.

\[ \sum\limits_{l=0}^{log_{2}n} O(n) \times \left( 3 \over 2 \right)^{l} \]

Given this expression, we can see that the terms will increase as \(l\) increases in value. Based on this, we can say that the last term, that is when \(l = log_{2}n\), is the largest term in this summation. Given we are looking for an asymtotic bound, we can ignore the lesser terms. The complexity for this algorithm can then be written as follows.

\[ O(n) \times \left( 3 \over 2 \right)^{log_{2}n} \]

From here we just have to simplify to give us a nicer form, which once completed leaves us with \(O(n^{log_{2}3})\) which is approximately equal to \(O(n^{1.59})\) - which we can see is an improvement over long multiplication.

It’s worth noting that for smaller input sizes, depending on hardware and other factors, long multiplication may be faster - but asymtotically Karatsuba will always win over long multiplication.

Other Algorithms

In addition to Karatsuba there are a number of other algorithms that are more efficient than Karatsuba, but have different tradeoffs in terms of complexity versus Karatsuba and long multiplication. I won’t go into as much detail for each of these, but I will talk a little about each just to provide some background. I may cover some of these in future posts as well, as each of them is interesting in their own way.

Toom-Cook Algorithm

We can generalize the Karatsuba algorithm by changing how many pieces we split the number at each step into and generalizing the formulas used to combine the recursive solutions. This gives us a generalized algorithm for multiplication that encompasses both the long multiplication method, as well as Karatsuba. The Toom-Cook algorithm is this generalization - and you can actually say long multiplication and Karatsuba are Toom-Cook algorithms - specifically Toom–1 and Toom–2. There is also a Toom–3 algorithm that is based on dividing the number into three smaller parts instead of two at each stage. This algorithm is interesting, as it is the first algorithm that really gets us into higher level strategies for multiplication instead of lower level “tactics”. I’ll probably cover this in some future post.

Schönhage–Strassen Algorithm

To take things in an entirely different direction, the Schönhage–Strassen algorithm uses fast Fourier transforms in special algebraic objects called rings, which we may cover at a later time, to speed up the multiplication process. As part of this, it also leverages a divide and conquer approach, though it is quite different than that we used in Karatsuba. At the end of the process, it applies an inverse transform, combined with a series of carry operations to “reduce” the solution into the desired form. This algorithm is very interesting as it shows how areas of math we might not suspect can be used to solve common problems.

Fürer’s algorithm

A more recent addition to the set of multiplication algorithms is Fürer. It is a complex algorithm that leverages similar techniques to those used in Schönhage–Strassen, but does so in a way as to further decrease the runtime complexity. I won’t get into any real details on this one as it is quite complicated, but might have a post about it at a future date. It is worth noting that the “cross over” point where this algorithm becomes more efficient than Schönhage–Strassen is not well understood as of this writing, and so this algorithm remains mostly of interest to researchers.

The Final Analysis

Based on all the work we’ve done above, we can see that Karatsuba multiplication is definitely faster than long multiplication for the general case, for numbers with a lot of digits. That being said, most real implementations rely on a combination of long multiplication and one or more fast multiplication algorithms, including Karatsuba, to help balance the poor performance of the fast multiplication algorithms for fewer digit numbers. This provides a tiered approach that helps target algorithms to the segment of the data that they are most useful for. As we look at more algorithms in future posts, we’ll find this is actually a very common technique to balance the drawbacks of a given algorithm against its strengths.

Thanks for reading. As always, I’ve done my best to make sure all this information is accurate and correct, but if you notice any typographical or other errors, feel free to contact me and let me know.

References

  • https://www.cs.berkeley.edu/~vazirani/algorithms/chap2.pdf
  • http://web.archive.org/web/20130425232048/http://www.cse.psu.edu/~furer/Papers/mult.pdf
  • The Art of Computer Programming, Volume 2
Jul 28th, 2015

ARC in Depth: Part II

In my last post, I had a look at how the reference counting implementation in the Objective-C runtime works and talked about some lessons we can learn from the implementation. Reference counting is part of ARC, but only half of it, the other half is the “automated” portion. Today, I’m going to cover how automated reference counting ends up automated from a variety of levels. We’ll start with a brief tour of Clang, focusing on AST generation and semantic analysis, and then step into code generation where all those lovely retain and release calls actually get generated.

It’s worth noting that, as in the prior post, all of this is based upon the version of Clang you use, and could be changed at any time. This also only applies to Objective-C code - once the Swift compiler is open sourced I plan on having a look at its semantic analyzer and AST generator to see how it handles ARC as well.

Lets get started

Clang: Modular Compilation

Before we can understand how automated reference counting instructions are generated, we first have to understand how the compiler that generates ARC works, and how its design fundamentally enables this behavior.

Compilation with Clang

Clang is a modular recursive descent parser that generates LLVM IR (called bitcode). Clang is modular in that each step in Clang is tied to, but relatively contained from, other steps. This design allows Clang to act in a highly modular fashion - it is relatively easy to include Clang in a variety of products including IDEs, debuggers, and other code generators.

We’ll use a slightly simplified view of Clang’s phases to help us understand ARC.

  1. Lexing - The Clang lexer handles tokenizing the input - it is important to note that unlike many compilers/parsers, the lexer here does not distinguish between user identifiers, types, etc - this is the job of the next step of the process
  2. Parsing - The parser acts on the token stream generated by the lexer, and begins building an AST from the input. Throughout this, it regularly calls on the semantic analyzer to validate the input and ensure validity of things like types. The AST generation here plays a big part in how ARC ownership and lifetime qualifiers are applied. Most of the qualifiers and ownership information is added at this step.
  3. Semantic Analysis - The semantic analyzer works with the parser to validate types and make transformations and additions to the AST as necessary. The semantic analyzer is another place where ARC ownership and lifetime qualifiers can be added or modified, and is also where a lot of ARC rules are enforced as we will see.
  4. Code Generation - The code generation library takes an AST and uses it to generate an LLVM IR representation of the AST. The LLVM IR representation can then be passed to LLVM to generate platform specific assembly code which can then be assembled into a final binary representation. This step is where Clang actually emits ARC instructions like retain and release based on the ownership qualifiers and lifetime data created during AST generation and semantic analysis.

Semantic Analyzer and AST Generation in Depth

As I mentioned earlier, the semantic analysis and AST generation phases are the components we really want to focus on when looking at ARC ownership and rule verification. We’re going to look at each, and a sample of how each does its part in making ARC work.

AST Generation

The AST, or abstract syntax tree, is the internal tree representation of the code you’ve written, and is generated by the parser and semantic analyzer as they step through the token stream the lexer has generated. The AST is important in that it contains all the information necessary to generate code for your program, and is a simplified representation of your program itself. The AST provides numerous services to Clang in terms of generating nodes and parsing the state of the token string, but we’re going to focus on one specific part: ownership and lifetime.

The AST itself can generally determine its context and add the necessary implicit ownership and lifetime qualifiers based on the ARC rule set. In this way, the AST and the parser that generates it plays the key role in terms of generating the annotations the code generator requires to make ARC work.

Let’s have a look at an example of the AST handling ownership.

Climbing the AST

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
void ObjCMethodDecl::createImplicitParams(ASTContext &Context,
                                          const ObjCInterfaceDecl *OID) {
  QualType selfTy;
  if (isInstanceMethod()) {
    // There may be no interface context due to error in declaration
    // of the interface (which has been reported). Recover gracefully.
    if (OID) {
      selfTy = Context.getObjCInterfaceType(OID);
      selfTy = Context.getObjCObjectPointerType(selfTy);
    } else {
      selfTy = Context.getObjCIdType();
    }
  } else // we have a factory method.
    selfTy = Context.getObjCClassType();

  bool selfIsPseudoStrong = false;
  bool selfIsConsumed = false;

  if (Context.getLangOpts().ObjCAutoRefCount) {
    if (isInstanceMethod()) {
      selfIsConsumed = hasAttr<NSConsumesSelfAttr>();

      // 'self' is always __strong.  It's actually pseudo-strong except
      // in init methods (or methods labeled ns_consumes_self), though.
      Qualifiers qs;
      qs.setObjCLifetime(Qualifiers::OCL_Strong);
      selfTy = Context.getQualifiedType(selfTy, qs);

      // In addition, 'self' is const unless this is an init method.
      if (getMethodFamily() != OMF_init && !selfIsConsumed) {
        selfTy = selfTy.withConst();
        selfIsPseudoStrong = true;
      }
    }
    else {
      assert(isClassMethod());
      // 'self' is always const in class methods.
      selfTy = selfTy.withConst();
      selfIsPseudoStrong = true;
    }
  }

This method[1] is pretty large, so let’s break it down a little bit. This method annotates a given Objective-C method declaration with ownership qualifiers. Before anything else, we resolve the type of self used in this method. This is important for later when we have to annotate self.

With this done, next we check if ARC is enabled. If so, we first check if the method is an instance method, if so, we see if it is attributed to consume self, and set a temporary if so. Based on the comments, we can get a pretty good understanding of what happens next. We get the reference to self that was resolved in the first block of code in the method and set it to strong since every instance method has a strong self. We also check if this is in the “init” method family (eg: new, init*, or attributed in such a way to be in this family) - if not, we add an implicit const to self to prevent modifying the value. We avoid this in the case of init to allow programmers to use the self = [super ...] idiom we are so familiar with.

If we are not an instance method, we must be a class method, so if we are not we assert. In class methods self is actually the Class instance, so it should always be const.

We can see here that what we’re really doing is not really adding much to the code itself, but rather annotating it in the way the Clang ARC specification specifies. The AST’s job is really to do exactly this: provide a code level representation of what the language standard specifies.

The Semantic Analyzer

The semantic analyzer performs several crucial operations for Clang. Most notably, it validates the types used, generates implicit casts, and applies a variety of annotations and modifications to the generated AST based on the context of the code. In other words, if parser helps determine if the given line of code make sense syntactically, then the semantic analyzer checks if they actually make sense in terms of meaning and makes various corrections and small changes. Compare them to a writer and an editor.

As far as our analysis of ARC is concerned, the semantic analyzers primary purpose is the enforcement of ARC rules on the code, checking for inconsistencies, and applying qualifiers and lifetime annotations in certain situations where the AST generation phase may not have done so. To this end, we’ll be looking at a few cases where the semantic analyzer provides enforcement for ARC rules.

Looking Inside Sema

All of the code for the semantic analyzer is in the Sema subdirectory in Clang’s source. For this example, let’s have a look at SemaExprObjC.cpp.

If we look at the start of the file, we can see it describes what it does[2]:

1
2
3
4
5
//===----------------------------------------------------------------------===//
//
//  This file implements semantic analysis for Objective-C expressions.
//
//===----------------------------------------------------------------------===//

Let’s look a little deeper and see if we can find some examples of some of the rules this semantic analysis module has.

1
2
3
4
5
6
7
8
9
10
ExprResult Sema::ParseObjCStringLiteral(SourceLocation *AtLocs,
                                        Expr **strings,
                                        unsigned NumStrings) {
  StringLiteral **Strings = reinterpret_cast<StringLiteral**>(strings);

  // Most ObjC strings are formed out of a single piece.  However, we *can*
  // have strings formed out of multiple @ strings with multiple pptokens in
  // each one, e.g. @"foo" "bar" @"baz" "qux"   which need to be turned into one
  // StringLiteral for ObjCStringLiteral to hold onto.
  StringLiteral *S = Strings[0];

So by reading this[3], we can see a pretty clear explanation of what this function does - it parses an Objective-C string literal and will combine multiple string literals into one if necessary. This is not ARC related, but will definitely help us understand how the semantic analyzer works. Let’s look a little lower and see how it performs this operation[4].

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
  // If we have a multi-part string, merge it all together.
  if (NumStrings != 1) {
    // Concatenate objc strings.
    SmallString<128> StrBuf;
    SmallVector<SourceLocation, 8> StrLocs;

    for (unsigned i = 0; i != NumStrings; ++i) {
      if (!S->isAscii()) {
        Diag(S->getLocStart(), diag::err_cfstring_literal_not_string_constant)
          << S->getSourceRange();
        return true;
      }

      // Append the string.
      StrBuf += S->getString();

      // Get the locations of the string tokens.
      StrLocs.append(S->tokloc_begin(), S->tokloc_end());
    }

So we can see the semantic analyzer does exactly what you’d think for this case - it loops through all the strings present and concatenates them into a buffer. After this step, as you can imagine, it combines them into a single Obj-C string literal and returns this object.

It’s worth noting how well documented this code was. Clang is generally very good about this, and if you decide to go code spelunking you’ll usually be helped by a multitude of well written and helpful comments.

Now that we have a feel for how Clang’s semantic analyzers work, let’s see if we can find something related to ARC. Let’s look for ‘ARC’ in this file and see what we get[5].

1
2
3
4
5
6
7
8
9
10
11
12
  // In ARC, forbid the user from using @selector for
  // retain/release/autorelease/dealloc/retainCount.
  if (getLangOpts().ObjCAutoRefCount) {
    switch (Sel.getMethodFamily()) {
    case OMF_retain:
    case OMF_release:
    case OMF_autorelease:
    case OMF_retainCount:
    case OMF_dealloc:
      Diag(AtLoc, diag::err_arc_illegal_selector) <<
        Sel << SourceRange(LParenLoc, RParenLoc);
      break;

This code snippet is from Sema::ParseObjCSelectorExpression. As you might guess, this snippet is responsible for the error you get in Xcode if you try to use @selector(retain) or associated calls. This enforces section 7.1.1 in the Clang ARC standard. As we noted earlier, as far as ARC is concerned most of the semantic analyzer’s work is of this type: making sure the source AST obeys ARC rules.

Code Generation

Once the AST has been created by the parser and has been updated, annotated, and verified by the semantic analyzer, we end up with what we want to send to the code generator. The source code for the code generator is, unsuprisingly, stored in the CodeGen subdirectory of the Clang source code. The code generator takes an AST generated by Clang, and evaluates it to generate LLVM IR.

As part of this evaluation process, the code generator takes the annotations, lifetime information, and ownership qualifiers previously inserted as part of the AST into account to generate our well known and loved retain and release calls.

Let’s have a look at an example of how the code generator performs this work. I’m going to snip out the “primitive” case of a non-ARC type to save us a little space here[6]

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
void CodeGenFunction::EmitStoreThroughLValue(RValue Src, LValue Dst,
                                             bool isInit) {
  if (!Dst.isSimple()) {
    ...
  }

  // There's special magic for assigning into an ARC-qualified l-value.
  if (Qualifiers::ObjCLifetime Lifetime = Dst.getQuals().getObjCLifetime()) {
    switch (Lifetime) {
    case Qualifiers::OCL_None:
      llvm_unreachable("present but none");

    case Qualifiers::OCL_ExplicitNone:
      // nothing special
      break;

    case Qualifiers::OCL_Strong:
      EmitARCStoreStrong(Dst, Src.getScalarVal(), /*ignore*/ true);
      return;

    case Qualifiers::OCL_Weak:
      EmitARCStoreWeak(Dst.getAddress(), Src.getScalarVal(), /*ignore*/ true);
      return;

    case Qualifiers::OCL_Autoreleasing:
      Src = RValue::get(EmitObjCExtendObjectLifetime(Dst.getType(),
                                                     Src.getScalarVal()));
      // fall into the normal path
      break;
    }
  }

So we can see what this piece of code does pretty clearly. If the destination object has an Objective-C lifetime attribute, we have to decide what to do with it. We look at the lifetime of the object, and since we’re doing an assignment, we emit the code necessary to handle each case. If our lifetime is strong, we call EmitARCStoreStrong which, as you might have guessed, emits the IR representation of objc_storeStrong. Similarly, for a weak pointer, we call EmitARCStoreWeak.

Using these simple rules, the code generator is able to understand what it needs to emit for each case of assignment. Similar, though sometimes a lot more complex, rules exist for a variety of cases throughout the code generator, always verifying if a given object has an Objective-C lifetime or ownership qualifier, and then acting on that.

The elimination of duplicate retain and release calls then is not really an ‘elimination’ per say in the way we would think about it if we were doing it manually. This elimination is actually just the compiler having absolute knowledge of its own rules, and being able to act on those rules in a minimal way to try to generate the minimal set of retains and releases required. In many cases, as we saw above, no real retain or release calls are generated at all - the compiler simply can emit code using objc_storeStrong and friends.

One other thing that is worth noting here is at no point do we care if this assignment is happening in C++ or Objective-C or anywhere else - we simply check if the AST has the necessary annotations. The code generator, at this step, doesn’t really care - it is acting solely on what is present in the AST, which includes no real notions of “source format” - only the tree that specifies the program’s state.

Conclusion

We have looked into the depths of Clang to understand how ARC is implemented at each level in the compiler stack: the parser, the semantic analyzer, and the code generator. We have looked at cases where ARC rules are enforced, where ARC annotations are generated, and where ARC calls are emitted into LLVM IR.

With this understanding it’s a lot easier to appreciate what ARC does for us when it manages our retain count, in terms of both optimizing calls to retain and release as well as helping us achieve the minimum set of retain and release calls required. In addition to this appreciation we can also learn some things from the design and implementation of ARC.

Lessons

  • A lot of what ARC does is done during AST generation and semantic analysis. If we aren’t careful in how we manage our objects, and aren’t aware of how implicit lifetimes and ownership qualifiers impact ARC’s behavior, we are always going to be at the mercy of rules we may not completely understand. It really is worth taking the time to understand the ARC specification and how it works.
  • The ARC system is pervasive throughout Clang’s architecture, and though Clang is modular, it’s difficult to really untangle it from the system it is in. Towards this end, it’s sometimes hard to determine the exact determinations ARC will make for complex code, and often, the only way to tell is either to look at the LLVM IR or look at the final assembly generated by LLVM. We should not be afraid to do this if it helps us understand our code.
  • The architecture for ARC, though spread throughout Clang, is actually based on very simple principles: lifetimes and ownership qualifiers. Using these primitives in a consistent manner has allowed Clang to have this almost magical ability that we all use every day, and we should take that to heart when we consider the designs of our own programs.

Thanks for Reading

Thanks for taking this journey with me - ARC is a fascinating look at compiler architecture and implementation, as well as an interesting solution to a complicated problem. As always, if you have any corrections please let me know - while I do my best to check the accuracy of my articles, there’s always opportunity for improvement, so feel free to let me know via email or Twitter.

References

  • http://clang.llvm.org/doxygen/SemaExprObjC_8cpp_source.html
  • https://github.com/llvm-mirror/clang/blob/45aa4ad6e537866b40fe9574fa992bf3d6593e39/lib/CodeGen/CGExpr.cpp
  • https://github.com/llvm-mirror/clang/blob/45aa4ad6e537866b40fe9574fa992bf3d6593e39/lib/CodeGen/CGDecl.cpp
  • http://clang.llvm.org/docs/AutomaticReferenceCounting.html
  • http://clang.llvm.org/doxygen/classclang_1_1Sema.html
  • http://clang-analyzer.llvm.org/checker_dev_manual.html
  • http://clang.llvm.org/docs/InternalsManual.html

  1. DeclObjC.cpp (Under University of Illinois Open Source License)  ↩

  2. SemaExprObjC.cpp (Under University of Illinois Open Source License)  ↩

  3. SemaExprObjC.cpp (Under University of Illinois Open Source License)  ↩

  4. SemaExprObjC.cpp (Under University of Illinois Open Source License)  ↩

  5. SemaExprObjC.cpp (Under University of Illinois Open Source License)  ↩

  6. CGDecl.cpp (Under University of Illinois Open Source License)  ↩

Jun 12th, 2015