Population Count in Java

How do you count the bits in a 32 bit integer? Since this is possible in a single instruction, popcntd, which is exposed by an intrinsic method in Java and several other languages, this is a completely academic question. Nevertheless, however futile, deriving an efficient expression is instructive.

A naive approach would be to check each of the 32 bits in sequence. This can be written in Java as follows:

  public static int populationCountCheckEachBit(int value) {
    int count = 0;
    for (int i = 0; i < Integer.SIZE; ++i) {
      if ((value & (1 << i)) != 0) {
        ++count;
      }
    }
    return count;
  }

This has constant and high execution time, even when most of the bits are unset: there will always be 32 left shifts and 32 intersections. There is no inherent data dependency in the loop above so it can probably be unrolled and pipelined, even so, it’s just too long to be practically useful. A less naive approach is to skip over the unset bits, which will actually be quite fast when the data is sparse.

  public static int populationCountSkipUnsetBits(int value) {
    int count = 0;
    while (value != 0) {
      value ^= value & -value;
      ++count;
    }
    return count;
  }

The code above calculates the lowest bit and unsets it until there are no bits left. In other languages, resetting the bit can use the blsr instruction, but C2 would emit code using blsi instruction and an xor here. This code will do well for sparse data, but has a data dependency and the performance will be absolutely terrible for dense data (such as small negative numbers).

Since an integer’s population count is the sum of the population counts of its constituent bytes, and the population count of a byte can only take 256 values, why not precompute a small lookup table containing the population counts for each possible byte? Then, with four masks, three right shifts, four moves and three additions, the population count can be calculated.

   private static int[] LOOKUP = {
           0, 1, 1, 2, 1, 2, 2, 3,
           1, 2, 2, 3, 2, 3, 3, 4,
           1, 2, 2, 3, 2, 3, 3, 4,
           2, 3, 3, 4, 3, 4, 4, 5,
           1, 2, 2, 3, 2, 3, 3, 4,
           2, 3, 3, 4, 3, 4, 4, 5,
           2, 3, 3, 4, 3, 4, 4, 5,
           3, 4, 4, 5, 4, 5, 5, 6,
           1, 2, 2, 3, 2, 3, 3, 4,
           2, 3, 3, 4, 3, 4, 4, 5,
           2, 3, 3, 4, 3, 4, 4, 5,
           3, 4, 4, 5, 4, 5, 5, 6,
           2, 3, 3, 4, 3, 4, 4, 5,
           3, 4, 4, 5, 4, 5, 5, 6,
           3, 4, 4, 5, 4, 5, 5, 6,
           4, 5, 5, 6, 5, 6, 6, 7,
           1, 2, 2, 3, 2, 3, 3, 4,
           2, 3, 3, 4, 3, 4, 4, 5,
           2, 3, 3, 4, 3, 4, 4, 5,
           3, 4, 4, 5, 4, 5, 5, 6,
           2, 3, 3, 4, 3, 4, 4, 5,
           3, 4, 4, 5, 4, 5, 5, 6,
           3, 4, 4, 5, 4, 5, 5, 6,
           4, 5, 5, 6, 5, 6, 6, 7,
           2, 3, 3, 4, 3, 4, 4, 5,
           3, 4, 4, 5, 4, 5, 5, 6,
           3, 4, 4, 5, 4, 5, 5, 6,
           4, 5, 5, 6, 5, 6, 6, 7,
           3, 4, 4, 5, 4, 5, 5, 6,
           4, 5, 5, 6, 5, 6, 6, 7,
           4, 5, 5, 6, 5, 6, 6, 7,
           5, 6, 6, 7, 6, 7, 7, 8
   };

  public static int populationCountWithLookupTable(int value) {
    return LOOKUP[value & 0xFF]
         + LOOKUP[(value & 0xFF00) >>> 8]
         + LOOKUP[(value & 0xFF0000) >>> 16]
         + LOOKUP[(value & 0xFF000000) >>> 24];
   }

This isn’t as stupid as it looks. The number of instructions is low and they can be pipelined easily. C2 obviously can’t autovectorise this, but I imagine this could possibly end up being quite fast (if used in a loop) once the Vector API becomes a reality. Lemire and Muła devised a fast vectorised population count algorithm based on a lookup table of precalculated population counts for each nibble. Their algorithm is used by clang to calculate the population count of an array, but is far beyond both the scope of this post and the capabilities of Java.

We can avoid storing the table while using very few instructions with a divide and conquer approach, writing the result in place. The first thing to notice is that the population count of N bits can be expressed in at most N bits. So, interpreting the integer as a 16 element string of 2-bit nibbles we can calculate each 2-bit population count and store it in the same 2 bit nibble.

The masks 0x55555555 and 0xAAAAAAAA each have alternating bits and are logical complements. Remember that the population count is the sum of the population counts of the even bits and the odd bits. The code below calculates the number of bits in each 2-bit nibble and stores the result into the same 2-bit nibble. It works because the addition can only carry left into a zero bit (the odd bits have all been shifted right).

     int output = (value & 0x55555555) // mask the even bits
                + ((value & 0xAAAAAAAA) >>> 1); // mask the odd bits and shift right so they line up with the even bits

By way of example, consider the input value 0b11001010101101010101010101010011. The population count is 17, and the output takes the value 0b10000101011001010101010101010010. Notice that no 2-bit nibble takes the value 0b11 - we have 16 values of either zero, one or two: 2 + 0 + 1 + 1 + 1 + 2 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 0 + 2 = 17. It’s not necessary to have two separate constants: (value & 0xAAAAAAAA) >>> 1 is equivalent to (value >>> 1) & 0x55555555. This saves a register.

We now have a smaller problem: we need to add up the 16 2-bit nibbles. The mask 0x33333333 covers all the even 2-bit nibbles, and the mask 0xCCCCCCCC covers all the odd 2-bit nibbles. Shifting the odd values right and adding them to the even ones gives eight nibbles consisting of the 4-bit population counts:

     value = (value & 0x55555555) + ((value >>> 1) & 0x55555555); 
     value = (value & 0x33333333) + ((value >>> 2) & 0x33333333); 

Like before, the expression (value & 0xCCCCCCCC) >>> 2 has been replaced by (value >>> 2) & 0x33333333 to save a constant. Now we have eight nibbles to add up into four bytes, after that we have two shorts, and finally a single integer. The complete method ends up as follows:

  public static int populationCountWithMasks(int value) {
    value = (value & 0x55555555) + ((value >>> 1) & 0x55555555);
    value = (value & 0x33333333) + ((value >>> 2) & 0x33333333);
    value = (value & 0x0F0F0F0F) + ((value >>> 4) & 0x0F0F0F0F);
    value = (value & 0x00FF00FF) + ((value >>> 8) & 0x00FF00FF);
    value = (value & 0x0000FFFF) + ((value >>> 16) & 0x0000FFFF);
    return value;
  }

You can almost see it already, but if you write the hexadecimal constants above in binary you will realise that this is quite an elegant solution: the masks look like a tree:

01010101010101010101010101010101
00110011001100110011001100110011
00001111000011110000111100001111
00000000111111110000000011111111
00000000000000001111111111111111

This elegance comes at a small cost. There are various profitable transformations, the simplest of which is the elision of the redundant final mask. The others are more involved and are covered in depth in chapter 5 of Hacker’s Delight. The end result can be seen in the Integer class.

    @HotSpotIntrinsicCandidate
    public static int bitCount(int i) {
        // HD, Figure 5-2
        i = i - ((i >>> 1) & 0x55555555);
        i = (i & 0x33333333) + ((i >>> 2) & 0x33333333);
        i = (i + (i >>> 4)) & 0x0f0f0f0f;
        i = i + (i >>> 8);
        i = i + (i >>> 16);
        return i & 0x3f;
    }

The method above is intrinsified by C2 to the instruction popcntd and this method is the only way to access the instruction from Java. If it’s not already obvious, the power of having this access can be shown with a comparative benchmark.

Benchmark	Mode	Threads	Samples	Score	Score Error (99.9%)	Unit
intrinsic	thrpt	1	10	341.572057	1.983535	ops/us
lookupTable	thrpt	1	10	205.373131	0.557472	ops/us
masks	thrpt	1	10	191.744272	1.942700	ops/us
naive	thrpt	1	10	26.651332	0.101285	ops/us
skipUnsetBits	thrpt	1	10	94.125249	0.559893	ops/us

Despite its power, since no vectorisation of this operation is possible prior to the AVX-512 VPOPCNTD/VPOPCNTQ extension (available virtually nowhere), loops containing popcnt can quickly become bottlenecks. Looking beneath the surface is intriguing. I’m sure with explicit vectorisation the lookup approach could be powerful.