Posts ARC In Depth: Part I
Post
Cancel

ARC In Depth: Part I

As I’ve mentioned before, when I use a system, I like to know why it works the way it does. About half the time, this leads me down the road of reading specifications and design documents, the other half it takes me into the depths of the code that makes everything I write run. Knowing how things work gives me a guide when things go wrong, as well as giving me context when I’m writing code. I try not to make assumptions in my code about how the runtime is implemented, but understanding how the runtime operates makes debugging and optimization a great deal easier. It’s also a lot of fun at very particular types of parties.

With all that in mind, today we’re going to take a look at ARC and figure out exactly how it works. We’re going to be diving into some of the runtime source code and looking at why certain operations do or don’t perform a certain way. We will also be making a quick detour through the runtime to help us understand how ARC uses it to accomplish its goals.

This post, due to length, will be split into two parts - I’ll be publishing the second part soon.

Let’s begin!

Why ARC Internals?

So maybe you buy into the idea that looking into ‘why’ things work the way they do is a good idea, but maybe you don’t quite buy why ARC is interesting. I can understand that point of view. ARC, at the base, is really just a reference counting system tied to a static analyzer - both of those components, while individually interesting, don’t seem all that complicated in terms of understanding how they work together.

I think it is glossing over the beauty of the Objective-C runtime not to understand these details, as well as missing out on a great opportunity to understand how to debug and optimize your own code. There are a number of surprising optimizations and implementation details you may have never even considered in the implementation of ARC and reference counting, and even lessons to be learned. Lastly, understanding the platform and runtime in which our code operates can help us to have more context as we approach software development as a whole, and leave us more enlightened when we run into impasses.

Reference Counts

An obvious place to start our exploration is to look at how reference counts themselves are implemented. To that end, let’s consider how some simple reference counting calls work. Objects have a retain and a release call available, and when called, they increment or decrement the retain count, which itself is accessible in a property called retainCount available on anything conforming to the NSObject protocol. Let’s start by focusing on where the retain count itself is stored.

The obvious place to store the retain count would be as a property of the object itself - perhaps a private member on NSObject and other root objects? That’s one way you could do it, but it isn’t exactly what Objective-C does. There are a few issues with using a property though, but the most notable one is we don’t want to have the overhead of invoking objc_msgSend - which is a very expensive operation compared to a simple C function call.

Instead, we want to be able to get performance close to the speed of the aforementioned function call and have it modify the retain count of the object in question as quickly as possible. To facilitate this, Objective-C stores the actual retain count of an object in one of two places. Where these work we’ll get it into in a minute as we unwind the code.

Let’s see if we can figure out how retain and release work. I would like to note before we really get into this that sometimes the files we’re talking about have several definitions of the same function or method I’ll be using the variant defined for tagged pointer support - since they handle the non-tagged case as well.

Understanding Retain and Release

When ARC generates a call to retain or release it’s actually invoking a call to the C functions objc_retain or objc_release - or a variant thereof. By doing this, we stay away from the time sink that is objc_msgSend for most cases, and we rely on good old fashioned function calls that in many cases can be inlined.

Let’s have a look at these two functions, and see if we can figure out how they work. Let’s look at how retain works1:

1
2
3
4
5
6
7
8
__attribute__((aligned(16)))
id
objc_retain(id obj)
{
    if (!obj) return obj;
    if (obj->isTaggedPointer()) return obj;
    return obj->retain();
}

So this is pretty simple, but let’s break it down a little.

  1. We verify that the object pointer is not nil
  2. We check if the pointer itself is a tagged pointer and if so we don’t actually retain it - we’ll get to why that is done in a minute
  3. Assuming our other checks passed, we go ahead and invoke the method retain on the Objective-C object (it turns out in the current runtime Objective-C objects are C++ objects - for those clever, this was already obvious in the C++ in the stack traces of many crash logs)

The release call looks just about the same, so I won’t include it here. So based on this, we know that we’re avoiding invoking retain or release on any tagged pointers, whatever those are, and that we invoke a C++ retain or release method. Let’s see where these two methods lead.

You Got Your C++ in My Objective-C!

1
2
3
4
5
6
7
8
9
10
11
12
13
14
// Equivalent to calling [this retain], with shortcuts if there is no override
inline id
objc_object::retain()
{
    // UseGC is allowed here, but requires hasCustomRR.
    assert(!UseGC  ||  ISA()->hasCustomRR());
    assert(!isTaggedPointer());

    if (! ISA()->hasCustomRR()) {
        return rootRetain();
    }

    return ((id(*)(objc_object *, SEL))objc_msgSend)(this, SEL_retain);
}

Here we have the definition of objc_object::retain2. Now we’re cooking - this looks promising. So what does this code do? Good questions - the comments will help us out a little as we try to figure it out, so let’s go through section by section like we did earlier.

  1. We assert that we’re not using the defunct garbage collector, or if we are that this class has a “custom RR” whatever that is
  2. We assert that this is not a tagged pointer.
  3. We check to be sure there isn’t a “custom RR”, if there isn’t, we call rootRetain, otherwise we invoke objc_msgSend and call retain on ourselves.

The obvious question is what is a ‘custom RR’? Custom RR denotes a custom retain-release implementation. If your class overrides retain or release, the runtime calls objc_class::setHasCustomRR which tags that Class and all child classes as having a custom implementation, and therefore opting out of fast retain/release behavior. This is one of several reasons why it’s a bad idea to override retain and release in just about every case, and why ARC persuades you not to through compiler messages.

Okay - so that makes sense - we’re using the actual retain and release message when we have them overridden. Why does the objc_msgSend look so weird though? If you’ve read the 64-bit migration guide, it turns out that objc_msgSend has to be cast in this way or ‘Bad Things’ happen because it doesn’t necessarily know the types of the parameters, you end up performing a narrowing implicit cast followed by a widening implicit case in some cases - see my references for more information on this.

The release call looks identical to the retain call except that it invokes rootRelease for the fast path and sends the SEL_release selector to the object for the custom retain/release path.

The Root of The Matter

Let’s have a look at what rootRetain does and try to understand how it works. This one is a bit of a doozy, so get ready…3

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
ALWAYS_INLINE id
objc_object::rootRetain(bool tryRetain, bool handleOverflow)
{
    assert(!UseGC);
    if (isTaggedPointer()) return (id)this;

    bool sideTableLocked = false;
    bool transcribeToSideTable = false;

    isa_t oldisa;
    isa_t newisa;

    do {
        transcribeToSideTable = false;
        oldisa = LoadExclusive(&isa.bits);
        newisa = oldisa;
        if (!newisa.indexed) goto unindexed;
        // don't check newisa.fast_rr; we already called any RR overrides
        if (tryRetain && newisa.deallocating) goto tryfail;
        uintptr_t carry;
        newisa.bits = addc(newisa.bits, RC_ONE, 0, &carry);  // extra_rc++

        if (carry) {
            // newisa.extra_rc++ overflowed
            if (!handleOverflow) return rootRetain_overflow(tryRetain);
            // Leave half of the retain counts inline and
            // prepare to copy the other half to the side table.
            if (!tryRetain && !sideTableLocked) sidetable_lock();
            sideTableLocked = true;
            transcribeToSideTable = true;
            newisa.extra_rc = RC_HALF;
            newisa.has_sidetable_rc = true;
        }
    } while (!StoreExclusive(&isa.bits, oldisa.bits, newisa.bits));

    if (transcribeToSideTable) {
        // Copy the other half of the retain counts to the side table.
        sidetable_addExtraRC_nolock(RC_HALF);
    }

    if (!tryRetain && sideTableLocked) sidetable_unlock();
    return (id)this;

 tryfail:
    if (!tryRetain && sideTableLocked) sidetable_unlock();
    return nil;

 unindexed:
    if (!tryRetain && sideTableLocked) sidetable_unlock();
    if (tryRetain) return sidetable_tryRetain() ? (id)this : nil;
    else return sidetable_retain();
}

Phew, that’s more code than I generally include in my blog posts, but it’s worth every line! Let’s dig in and start figuring out what this thing does.

Starting at the top, we can see that we do what has become our standard preamble - we assert no GC and return the object itself if we are a tagged pointer. The next thing we do is load some variables, and check if our newisa is indexed, and if not we perform the unindexed retain path by doing something called a sidetable_retain() unless we were asked to do a tryRetain in which case we try to use a sidetable retain, and if we fail, we return nil.

If we are indexed, we check if we were trying a retain, and if the object is already deallocating, we go to the try fail logic. If not, we continue and add RC_ONE to the newisa value, and story the carry. If we have a carry and we are handling carries, we execute an operation to move half of the extra_rc to the sidetable. Afterwards, regardless of what we did, we store the newisa value into our isa.

Phew, that’s pretty complicated, but we now know the answer to our quesiton, even if we don’t know what it means yet: retain counts are stored either in something called a sidetable if a pointer is unindexed, an isa if it is, or both if it’s indexed and overflows the newisa. Now we just need to know what all those words like sidetable and indexed mean.

What’s in An ISA (or how I learned to stop worrying and love bitfields)

Objects have to know what class they belong to for a number of reasons, but the most obvious is to invoke methods. In Objective-C method invocation means looking up IMP pointers and SEL pointers to invoke objc_msgSend which handles message passing - but where do you look these up from? ISA is the answer - it is the first pointer on every objc_object and it points to the Class object which that object is an instance of.

What does this have to do with retain counts? On 32-bit platforms, it doesn’t, it all comes down to those 32 extra bits you get on x86_64 or ARM64. 64 bits is more than we need to address all the memory most MMUs can support - in fact, on x86_64, it’s only 48 bits. That means that even though the pointer width of the architecture is 64 bits we only need 48 bits, which means we have 16 bits left over - that’s two bytes! And as it turns out, ISA is a pointer - which means ISA has two extra bytes.

Wonder what would be a good way to use those two bytes? Fill it with zeroes? Fill it with ones? Use it for storing values that are commonly accessed and need low-overhead? Ihe last one sounds pretty awesome.

A tagged (or indexed) pointer is a pointer with the pattern I just described - only some of the bits in it are actually used to store the memory location it points to, and the remaining bits are used to store metadata associated with the object or memory pointed to. In Objective-C, tagged pointers are currently supported on x86_64 and ARM64, and are created by the default alloc implementation. And herein lies another of the gotchas: if you override allocWithZone in your class, you will by default lose tagged pointers and all the wonderful optimizations they provide - we’ll talk about how to avoid this a bit later.

At this point I also should ention that the tagged pointers Objective-C uses should never have assumptions made about the values they store or their internal structure. The internal details of tagged pointers, and even the presence of them, can change between runtime releases in unpredictable ways. There are generally accessors to get everything you would want out of a tagged pointer, so there’s no real reason to access the data directly.

In the case of objc_object we don’t have the individual object pointers themselves act as tagged pointers - if we did, people writing code would have to use special accessors to retrieve the pointer to memory out of the tagged pointer. Instead, we use something that every objects has to have as we mentioned before: the ISA pointer.

One thing stored in this pointer, among other things, is the retain count of the object. Since it is stored here, we can access and modify it very quickly - its only manipulating a bitfield on an object from a C function. For further optimization, the C function can often be inlined. It turns out this is much faster than calling objc_msgSend, but what happens when the limited space in the tag is filled, where does it go?

A SideTable to Put Your RetainCount On

When space in a tagged pointer for retain counts run out, they must be stored somewhere else. If we look at the code, we can see it stores it in the place we talked about briefly before - the side table - but the code does not make it entirely clear what the side table is or how it works. To answer that question, we have to look a little further down, where we handle cases where pointers are not tagged and we call sidetable_retain. Let’s have a look at sidetable_retain and see if we can get more detail on how side tables work4.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
id
objc_object::sidetable_retain()
{
#if SUPPORT_NONPOINTER_ISA
    assert(!isa.indexed);
#endif
    SideTable *table = SideTable::tableForPointer(this);

    if (spinlock_trylock(&table->slock)) {
        size_t& refcntStorage = table->refcnts[this];
        if (! (refcntStorage & SIDE_TABLE_RC_PINNED)) {
            refcntStorage += SIDE_TABLE_RC_ONE;
        }
        spinlock_unlock(&table->slock);
        return (id)this;
    }
    return sidetable_retain_slow(table);
}

When we call sidetable_retain we can see that it asserts if tagged pointers are enabled and our isa is a tagged pointer. If not, we query to get the side table for this pointer, and then we try to lock the table. If we succeed, we get the reference count element for this particular pointer, and increase it. If we didn’t acquire the lock, we go to something called sidetable_retain_slow.

This makes sense. If we look into what a SideTable is defined as, we’ll find that it’s an optimized hash table, with a hash function that is, well, have a look for yourself5:

1
2
3
4
5
6
// Pointer hash function.
// This is not a terrific hash, but it is fast
// and not outrageously flawed for our purposes.

// Based on principles from http://locklessinc.com/articles/fast_hash/
// and evaluation ideas from http://floodyberry.com/noncryptohashzoo/

One more thing to note here is that side tables appear to be segmented and it appears there is more than one - and that’s true. To optimize performance, we first get a side table associated with a given range of addresses, and then we look up that particular address in what amounts to a fancy optimized hash table.

We don’t need to really get into the slow version of sidetable_retain as it is very similar to sidetable_retain, except that instead of trying to lock optimistically it spins on the spinlock until it becomes available. This is a performance optimization for the optimistic, and common, case where nothing is locking the retain count. As an aside - this also explains why you don’t have to synchronize retain and release - the runtime does this for you by either performing the operation on the tagged pointer or spin locking on the side table entry.

Putting it all together…

So it seems like we have an answer to our original question of where the retain count is stored, as well as how retain and release work. In short, the retain count is stored either in a sidetable - which is an optimized set of hash tables segmented by memory address, it is stored in a tagged pointer, or it is stored in both.

This only answers our first questions about how retain and release work though, and how they are optimized, but we still have a little more to look at before we fully understand ARC. This does give us a lot of information about why ARC encourages you to stay away from certain behaviors that break these optimizations though, and provides a good base for understanding additional internals in ARC. Before we close out this post, let’s review the high points of “things not to do” we learned from diving into the source code.

Custom Retain/Release Implementations

ARC pretty heavily discourages implementing retain and release calls on classes, but there are still ways you can get an implementation in. If you have any custom retain or release operations though your code is going to go through the slow path of using objc_msgSend which, depending on usage patterns, could result in a pretty big performance hit. Apple spends a lot of time optimizing memory accesses, and between runtimes there are often a lot of changes, so it doesn’t make a lot of sense not to take advantage of these optimizations.

The documentation explicitly says unless you are implementing your own memory management scheme separately from ARC you should not override these methods - and that’s exactly what you should abide by here. If you’re counting on using retain and release to signal or log something in your code, that’s a really bad practice. Even if you’re willing to accept the optimization penalty, counting on retain and release being called is still semantically incorrect. As mentioned earlier in this document, ARC makes all kind of optimizations to avoid calling retain and release, so relying on them being called in any way is just flat out incorrect.

The only case where it might be valid to override retain and release is the case the documentation mentions: if you actually using an alternative custom memory management system. Perhaps you have to explicitly cooperate with another reference counting system, and the retain and release calls don’t end up calling super but instead ask your implementation to decrement or increment the reference count. That might be fine, but just keep in mind this will make your code slower on these sorts of operations.

Custom Alloc Implementations

One of the things that is the responsibility of allocWithZone: is to set the isa pointer. As a result of this, if you overrice alloc or allocWithZone: you may end up not having a tagged pointer present for the isa field, and end up having to use the sidetable to store reference count information.

To avoid this, as my references note, you should explicitly be using the object_setClass function to set the isa pointer, not directly setting it. This is of particular concern for any codebase that is migrating to 64 bit support (as mandated by Apple for all iOS applications recently) as it is an easy change that will get you a decent boost in performance for most codebases. If you can avoid overriding alloc altogether, all the better.

Conclusion

In this post, we have looked at how retain and release worked in a great deal of depth, and now understand how choices we make in writing our own code can influence the behavior of ARC and its speed. We’ve also looked at some of the optimizations made in these parts of the reference counting implementation, and discussed how ARC takes advantage of them.

In the next post, part II of our in depth ARC investigation, we will start digging into even more ARC internals, focusing this time on weak pointers, how ARC gets away with optimizing and eliminating some calls to objc_retain and objc_release through use of objc_storeStrong and friends, and talk about the types of optimizations ARC is allowed to make at various points in the code.

References and Reading Material

Source Attributions