Posts Let's Build Karatsuba Multiplication
Post
Cancel

Let's Build Karatsuba Multiplication

Recently I was working on a toy big integer implementation that is incredibly inefficient but fun to play with, and I ended up implementing the Karatsuba algorithm. Today I’m going to share some details around how this algorithm works, what’s involved with it, and why it is more efficient.

Basic Multiplication

Most of us know how to multiply numbers using one of the grade school algorithms frequently taught. I’m going to be looking at the algorithm I was taught in elementary school, and that I understand is the quite common at least in the US - but you can extend the below arguments to most multiplication algorithms taught in schools. This particular algorithm is called long multiplication.

In the algorithm we’re going to look at, we multiply each digit in each number by each digit in the other number, adjusting for the place of the digits. In other words for an \(n\) digit number, we can say it’s made up of the digits \(a_{n-1}a_{n-2}a_{n-3}...a_{0}\) and that to multiply it by some number \(b\) of similar form with \(m\) digits, the grade-school multiplication algorithm gives us the following expression

\[\sum\limits_{i=0}^{m-1}\sum\limits_{j=0}^{n-1}b_{i}\times a_{j}\times 10^{j+i}\]

Let’s break this down a little bit so we can see that this expression really represents how we do the old “grade school” method of multiplication. For each digit in the multiplier we multiply it by each digit of the multiplicand, scaling it by the position of the digit in the multiplier and the multiplicand. In other words, you take each digit in the “bottom number” and multiply it by the digits in the “top number” to get a result, save that, then multiply the next digit by the entire “top number.” When finished we add all these results together to get a final value.

Are we crazy?

Let’s look at an example of this on two numbers so we can convince ourselves intuitively that this expression is equivalent to our basic multiplication algorithm.

\[\begin{array}{cccccccc} & & & & 2 & 8 & 7 & \\ \times & & & & 4 & 2 & 1 & \\ \hline & & & & 2 & 8 & 7 & (1\times 287\times 10^{0})\\ & & & 5 & 7 & 4 & 0 & (2\times 287\times 10^{1})\\ + & 1 & 1 & 4 & 8 & 0 & 0 & (4\times 287\times 10^{2})\\ \hline & 1 & 2 & 0 & 8 & 2 & 7\\ \end{array}\]

So you can see here that each intermediate result row is the result of multiplying a digit in the bottom number, in this case 421 with the entire top number - in this case 287. That means the expression for this multiplication is as follows:

\[\sum\limits_{i=0}^{m-1} 10^{i}\times b_{i}\times a\]

We can of course change the representation of a as follows based on the fact that a number is the sum of its digits, scaled by their place - in this case, each place of a digit multiplies its value by 10 since we are in base-10.

\[\sum\limits_{i=0}^{m-1} \left ( 10^{i}\times b_{i} \right ) \times \sum\limits_{j=0}^{n-1} \left ( a_{j}\times10^{j} \right )\]

Which is very similar to the formula we came up with earlier - except the \(10^{i}\times b_{i}\) is outside of the sum. This is fine, but as we know, we can move it inside the sum by the distributive property. Why would we want to do this, considering the expressions are the same? Here’s why: by arranging it in this way, we make it clear that we are only ever multiplying single digit numbers together, with the exception of multiplying by powers of ten, which are basically just “shift” operations. This means that if we memorize or store the multiplication tables up to 9, we can compute these internal values simply by looking up the result.

Analyzing Long Multiplication

Now that we’re pretty sure that our expression previously is, in fact, an accurate representation of this multiplication algorithm, let’s write some pseudo-code that implements it and look at its runtime complexity.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
// We use the type digit here for an int that is always < 10
int multiply(digit[] a, int a_len, digit[] b, int b_len)
{
  vector<int> resultVec;
  int result = 0;

  for (int i = 0; i < b_len; ++i)
  {
    for (int j = 0; j < a_len ++j)
    {
      resultVec.push_back(a[j] * b[i] * exp(10,i+j));
    }
  }

  for (vector<int>::iterator i = result.begin(); i != result.end(); ++i)
  {
    result += *i;
  }

  return result;
}

Before we talk about the complexity of this algorithm, we need to consider what complexity means here. Since we’re talking about reducing this to a set of single digit multiplications what complexity really means here is the number of single digit multiplications we’re looking at. Looking at this pseudo-code, we can see we have a for loop within a for loop - with complexity \(O(n^{2})\) and another for loop with complexity \(O(n)\). Total complexity is \(O(n^{2} + n) = O(n^{2})\). Unsurprisingly, the runtime complexity of an algorithm that simply multiplies every digit against every other digit, and scales them appropriately, is \(O(n^{2})\). So the question now becomes is there a way we can do better than this, and somehow improve this result?

Improving on Naive Long Multiplication

Since you’re reading this article, you’ve probably already figured out we can improve on naive multiplication - so the question then becomes how are we going to accomplish this? A common approach we’ll see time and time again as we look at algorithms is the use of a divide and conquer approach to solving problems. Let’s see if we can leverage that here - we’ll start by trying to rewrite multiplication slightly differently and then figure out how we can work what we’ve done into an algorithm.

Diving Into Long Multiplication

If you’re just interested in understanding how the algorithm works and are not interested in the mathematical basis - feel free to skip this section - we go into some depth on this problem to justify the solution.

Let’s start by writing out the expression that symbolizes long multiplication, before we simplify it, using our previous examples:

\[a\times b = \sum\limits_{i=0}^{n-1}\left (a_{i}\times10^{i} \right ) \times \sum\limits_{j=0}^{m-1}\left (b_{j}\times10^{j} \right )\]

We can reformulate this to split \(a\) and \(b\) in half pretty easily as we see below.

\[a = \left (\sum\limits_{i=0}^{\left(n\over 2\right)-1}\left (a_{i}\times10^{i}\right ) + \sum\limits_{i=\left( n\over 2\right)}^{n-1}\left (a_{i}\times10^{i} \right ) \right )\\ b = \left (\sum\limits_{j=0}^{\left(m\over 2\right)-1}\left (b_{j}\times10^{j}\right ) + \sum\limits_{j=\left( m\over 2\right)}^{m-1}\left (b_{j}\times10^{j} \right ) \right )\]

Which is just a fancy way of saying we can split \(a\) and \(b\) into two numbers based on the first half of the digits and the second half of the digits. In other words, we can pick an arbitrary place in a number, split the number into two separate numbers (eg: for 1234, 1200 and 0034), and add the two halves back together and get the whole again. Let’s name the two halves as below - note that we can also extract a power term for the “top” half since the digits all start at a non-zero index - so we have done that as well. We have also extracted the expression representing the “midpoint” of the number to another variable.

\[a = 10^{\left ( n_{center} + 1 \right )}A_{l} + A_{r}\\ n_{center} = \left ( n\over 2 \right )-1 \\ A_{l} = \sum\limits_{i=n_{center} + 1}^{n-1}\left (a_{i}\times 10^{i-\left (n_{center}+1 \right )} \right )\\ A_{r} = \sum\limits_{i=0}^{n_{center}}\left (a_{i}\times10^{i}\right )\\\]

Let’s use an example so we can get a feel for that this means. To return to our previous problem - let’s use the solution from that problem, the number 120827 and split it up in this manner.

\[a = 120827\\ A_{l} = 120\\ A_{r} = 827\\ a = \left(120 \times 10^{3}\right ) + 827\]

We can see that this notation really just means we can split a number in half in such a way that adding the halves back up results in a “whole” so to speak. Using this new re-written formulation, we can now substitute into our original problem of multiplying the two numbers and we will find ourselves in a new and very interesting form. For simplicity here, we’re going to assume an equal number of digits for each number - if they aren’t equal, we can always pad it with zeroes to make them the same length.

\[a \times b = \left (10^{n_{center}+1}A_{l} + A_{r} \right) \times \left (10^{n_{center}+1}B_{l} + B_{r} \right )\\ a \times b = 10^{2\left(n_{center}+1\right)}A_{l}B_{l} + 10^{n_{center}+1}A_{l}B_{r} + 10^{n_{center}+1}A_{r}B_{l} + A_{r}B_{r}\]

This is great - as it means we can divide the problem into four sub-pieces - and if we did this recursively, this would give us a new way to multiply two numbers. While this is interesting, we can actually show that this is not any more efficient than the long multiplication algorithm. The key thing that changes this around and turns it into what we call the Karatsuba algorithm is the recognition that we can simplify this to only requiring three multiplications instead of four. To do that, we’re going to look at manipulating the center two terms.

\[A_{l}B_{r} + A_{r}B_{l} = \left(A_{l} + A_{r}\right )\left (B_{l} + B_{r} \right ) - A_{r}B_{r} - A_{l}B_{l}\]

What we’ve done here is factored the two terms shown above to require one multiplication and several additions and subtractions. The subtractions themselves require multiplications, but as you can see we already have to compute these terms for the rest of our algorithm. Putting this together, we have a grand total of three multiplication operations required. Let’s put it all together.

\[a \times b = 10^{2\left(n_{center}+1\right)}t_{1} + 10^{n_{center}+1}t{2} + t{3}\\ \begin{array}{rl} t_{1} = & A_{l}B_{l}\\ t_{2} = & (A_{l} + A_{r})(B_{l} + B_{r}) - t_{1} - t{3}\\ t_{3} = & A_{r}B_{r}\\ \end{array}\]

Let’s try it out using our stand-by example of multiplying 287 by 421. To start out, let’s break it down and figure out what we need to calculate.

\[n_{center} = 0\\ \begin{array}{c|c} a = 287 & b = 421\\ a_{l} = 28 & b_{l} = 42\\ a_{r}=7 & b_{r}=1\\ \end{array}\]

With these initial computations, we can compute the formulas we computed earlier, and get the solution we expected.

\[\begin{array}{rl} t_{1} = & 28 \times 42 = 1176\\ t_{2} = & (28 + 7)(42 + 1) - 1176 - 7 = 322\\ t_{3} = & 1 \times 7 = 7\\ \hline a \times b = & 10^{2}(1176) + 10^{1}(322) + 7 = \textbf{120827}\\ \end{array}\]

Clearly, we can also iterate on this - so in this case we were left with another problem that has two digits - we could have split that up in another divide and conquer step and apply the algorithm to those parameters again - and so on - until we hit single digit multiplication as we did before. We’ll talk about the efficiency of this method in greater detail in the next section.

Details, Details, Details…

A few extra details for those curious and wanting to know more about the previous section. Feel free to skip this part if you’re not interested in even more details about the above computations and math.

Pivot Points and Arrangement

For our purposes here, it is easiest to think of the “pivot” point we use here as the “center” of a given number as this makes it a little more intuitive when we reach the part where we implement this algorithm in actual code. That being said - as you may have realized - you can set the “pivot” point for this algorithm to be anything - the two numbers do not have to be the same length, and the pivot does not have to be the midpoint necessarily. As we will see later on when we get to the performance analysis, these choices also make sense to optimize the performance of the algorithm.

In addition - there are a number of equivalent ways you can formulate the above equations - I chose one that places emphasis on individual digits to help make a smoother leap between long multiplication and Karatsuba, but rest assured all the formulations you may find are equivalent.

Multiplying by 10

A lot of Karatsuba involves trading multiplications for additions, subtractions, and multiplications by a factor of 10. These trade-offs can be made because addition and subtraction are usually much faster than multiplication, as they can generally be implemented as a series of shift operations with carry bits. In the case of multiplication by factors of 10, we have to remember we are operating in base-10 - so in this case, multiplying by a factor of 10 is actually just performing shift operations - which as we noted, are very efficient.

Bases Other than 10

This can all be generalized to any particular base simply by substituting the powers of 10 with powers of any given base - this allows you convert this algorithm to work on binary or any other number system. The proof for this is fairly straightforward and basically mimics the explanation given above.

Implementing Karatsuba Multiplication

Based on the mathematical work we’ve done in the previous section, we’ve finally distilled down the key formula for performing this ostensibly faster multiplication algorithm. I’ve written out a short copy of this formula below where center is the index of the digit you split on, that is, the last digit in the right hand side.

\[a \times b = 10^{2\left(center + 1\right)}t_{1} + 10^{center + 1}t_{2} + t_{3}\\ \begin{array}{rl} t_{1} = & A_{l}B_{l}\\ t_{2} = & (A_{l} + A_{r})(B_{l} + B_{r}) - t_{1} - t{3}\\ t_{3} = & A_{r}B_{r}\\ \end{array}\]

As we noted before, we can apply this method recursively for large numbers, so that we break it down repeatedly until we end up with single digit multiplication operations which we can do very quickly - so we should consider that as we design our code to implement this formula. Let’s write out some pseudo code to represent how this should work.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
Number karatsubaMultiplication(Number op1, Number op2)
{
  int midPoint;
  Number t1,t2,t3,pow1,pow2,op1_l,op1_r,op2_l,op2_r;

  // Handle our base case of single digit multiplication
  if (op1.numDigits == 1 && op2.numDigits == 1)
    return op1 * op2;

  // Pad numbers to be of equal length using leading zeroes
  if (op1.numDigits < op2.numDigits)
  {
    op1.padWithZerosToSize(op2.numDigits);
  }

  if (op2.numDigits < op1.numDigits)
  {
    op2.padWithZerosToSize(op1.numDigits);
  }

  // Calculate midpoint and left/right sides
  midPoint = (op1.numDigits / 2)-1;

  op1_r = op1.digitsFrom(0,midPoint);
  op1_l = op1.digitsFrom(midPoint,op1.numDigits-1);

  op2_r = op2.digitsFrom(0,midPoint);
  op2_l = op2.digitsFrom(midPoint,op2.numDigits-1);

  // Recursive calls to generate sub-expressions
  t1 = karatsubaMultiplication(op1_l,op2_l);
  t3 = karatsubaMultiplication(op1_r,op2_r);
  t2 = karatsubaMultiplication((op1_l + op1_r),(op2_l + op2_r)) - t1 - t3;

  // Computer power statements - generally you don't compute these
  // since you would use a shift to accomplish this (eg: "move digits
  // over by so many places")
  pow1 = Number.powerOfTen(2*(midPoint+1));
  pow2 = Number.powerOfTen(midPoint+1);

  // In practice, the below multiplications are generally implemented as
  // shift operations - for binary representations, << and >> can be used
  return pow1*t_1 + pow2*t_2 + t_3
}

The above code will vary a bit depending on exactly how you’re implementing the algorithm and in what context. The above algorithm is intended to be used in a big integer implementation that uses base-10. In general big integer implementations will use base-2 and perform the multiplications by powers of two by using bit shifts which can be done very efficiently - in which case similar but slightly different code is used. Another thing to note is the use of zero padding to make the numbers be of equal length - this lets us treat them in an equal way that makes our lives a lot easier.

We can see from the algorithm that it isn’t actually all that complicated in terms of writing the code for it. How it works seems magical if we don’t understand the background, but it’s actually a very straightforward case of divide and conquer being applied to a problem.

Analyzing Karatsuba Multiplication

Getting into the analysis of this algorithm can at first seem a bit daunting, but we can relate it to other divide and conquer algorithms we have seen before such as merge sort. Let’s start by thinking about how the recurrence tree for this algorithm works.

For any given pass of the algorithm, we split the numbers themselves in half - a right half and a left half - that we then use in three subsequent recursive steps. To put it another way, we apply Karatsuba three times, and for each of these calls the problem size, that is, the number of digits in the operands, is halved. Let’s look at an example, and graph out how the problem is split up at each step.

Problem Divisiondigraph "Problem Division" { "1231 x 4112" -> "43 x 53"; "1231 x 4112" -> "12 x 41"; "1231 x 4112" -> "31 x 12"; "43 x 53" -> "7 x 8"; "43 x 53" -> "4 x 5"; "43 x 53" -> "3 x 3"; "12 x 41" -> "3 x 5"; "12 x 41" -> "1 x 4"; "12 x 41" -> "2 x 1"; "31 x 12" -> "4 x 3"; "31 x 12" -> "3 x 1"; } Problem Division 1231 x 4112 1231 x 4112 43 x 53 43 x 53 1231 x 4112->43 x 53 12 x 41 12 x 41 1231 x 4112->12 x 41 31 x 12 31 x 12 1231 x 4112->31 x 12 7 x 8 7 x 8 43 x 53->7 x 8 4 x 5 4 x 5 43 x 53->4 x 5 3 x 3 3 x 3 43 x 53->3 x 3 3 x 5 3 x 5 12 x 41->3 x 5 1 x 4 1 x 4 12 x 41->1 x 4 2 x 1 2 x 1 12 x 41->2 x 1 4 x 3 4 x 3 31 x 12->4 x 3 3 x 1 3 x 1 31 x 12->3 x 1

Based on the above we can see, as we had previously noted, at each step we subdivide the problem in half, and then have three recursive steps. Let’s consider how we can understand this in terms of complexity.

The height of the graph above will be \(log_{2}n\) - since we half the problem space at each step. I should note that the height given does not include the leaf nodes - those nodes do not count as they are primitive operations, not part of the recursive steps. Also if \(n\) is not a factor of two then we should consider the ceiling of \(log_{2}n\) as the height.

Furthermore, we know that at each step, we subdivide the problem into three parts. The recurrence relation for the number of problems at any given level is \(op(l) = op(l-1)*3\) with \(op(1) = 1\). The closed form for this recurrence is \(op(l) = 3^{l}\) so in other words at any given step, this expression gives us the total number of operations being performed in the recursion step.

Lastly, note that at any given level, the work to “combine” the recursive operations of that level is going to be of complexity \(O\left(n\over 2^{l}\right)\). In other words, at any given level we have linear complexity to perform any operations required, over the size of the problem - which, as we noted, will be \(n \over 2^{l}\) for any given level \(l\)

Based on these propositions, we can start deriving the complexity of this algorithm. At any given level, the complexity of the algorithm will be the cost of combining the recursive solutions multiplied by the cost of the recursive operations. In other words, at \(l\) the complexity will be \(O\left( n \over 2^{l} \right) \times 3^{l}\). We can simplify this to be \(O(n) \times \left( 3 \over 2 \right)^{l}\). Based on this, we can write the cost of the entire series of recursive calls as follows.

\[\sum\limits_{l=0}^{log_{2}n} O(n) \times \left( 3 \over 2 \right)^{l}\]

Given this expression, we can see that the terms will increase as \(l\) increases in value. Based on this, we can say that the last term, that is when \(l = log_{2}n\), is the largest term in this summation. Given we are looking for an asymtotic bound, we can ignore the lesser terms. The complexity for this algorithm can then be written as follows.

\[O(n) \times \left( 3 \over 2 \right)^{log_{2}n}\]

From here we just have to simplify to give us a nicer form, which once completed leaves us with \(O(n^{log_{2}3})\) which is approximately equal to \(O(n^{1.59})\) - which we can see is an improvement over long multiplication.

It’s worth noting that for smaller input sizes, depending on hardware and other factors, long multiplication may be faster - but asymtotically Karatsuba will always win over long multiplication.

Other Algorithms

In addition to Karatsuba there are a number of other algorithms that are more efficient than Karatsuba, but have different tradeoffs in terms of complexity versus Karatsuba and long multiplication. I won’t go into as much detail for each of these, but I will talk a little about each just to provide some background. I may cover some of these in future posts as well, as each of them is interesting in their own way.

Toom-Cook Algorithm

We can generalize the Karatsuba algorithm by changing how many pieces we split the number at each step into and generalizing the formulas used to combine the recursive solutions. This gives us a generalized algorithm for multiplication that encompasses both the long multiplication method, as well as Karatsuba. The Toom-Cook algorithm is this generalization - and you can actually say long multiplication and Karatsuba are Toom-Cook algorithms - specifically Toom-1 and Toom-2. There is also a Toom-3 algorithm that is based on dividing the number into three smaller parts instead of two at each stage. This algorithm is interesting, as it is the first algorithm that really gets us into higher level strategies for multiplication instead of lower level “tactics”. I’ll probably cover this in some future post.

Schönhage–Strassen Algorithm

To take things in an entirely different direction, the Schönhage–Strassen algorithm uses fast Fourier transforms in special algebraic objects called rings, which we may cover at a later time, to speed up the multiplication process. As part of this, it also leverages a divide and conquer approach, though it is quite different than that we used in Karatsuba. At the end of the process, it applies an inverse transform, combined with a series of carry operations to “reduce” the solution into the desired form. This algorithm is very interesting as it shows how areas of math we might not suspect can be used to solve common problems.

Fürer’s algorithm

A more recent addition to the set of multiplication algorithms is Fürer. It is a complex algorithm that leverages similar techniques to those used in Schönhage–Strassen, but does so in a way as to further decrease the runtime complexity. I won’t get into any real details on this one as it is quite complicated, but might have a post about it at a future date. It is worth noting that the “cross over” point where this algorithm becomes more efficient than Schönhage–Strassen is not well understood as of this writing, and so this algorithm remains mostly of interest to researchers.

The Final Analysis

Based on all the work we’ve done above, we can see that Karatsuba multiplication is definitely faster than long multiplication for the general case, for numbers with a lot of digits. That being said, most real implementations rely on a combination of long multiplication and one or more fast multiplication algorithms, including Karatsuba, to help balance the poor performance of the fast multiplication algorithms for fewer digit numbers. This provides a tiered approach that helps target algorithms to the segment of the data that they are most useful for. As we look at more algorithms in future posts, we’ll find this is actually a very common technique to balance the drawbacks of a given algorithm against its strengths.

Thanks for reading. As always, I’ve done my best to make sure all this information is accurate and correct, but if you notice any typographical or other errors, feel free to contact me and let me know.

References and Reading Materials