September | 2014 | Programming with Jim

While working on a string processing algorithm in C++, I checked MSVC 2013’s assembly output to make sure it was inlining a subroutine inside a hot loop. The subroutine was:

char const *advancePastWhitespace(char const *s)
{
	while (*s == ' ' || *s == ',' || *s == '\r' || *s == '\n') {
		++s;
	}
	return s;
}

(Yes, for the purposes of this algorithm, the definition of whitespace is different.)

I saw that the subroutine was inlined. Easy enough. I became more interested when I looked at the generated 64-bit assembly:

        mov     rcx, 17596481020928v;           ; 0000100100002400H
        npad    5
$LL82@myFunc:
        movzx   eax, BYTE PTR [rsi]
        cmp     al, 44                          ; 0000002cH
        ja      SHORT $LN81@myFunc
        movsx   rax, al
        bt      rcx, rax
        jae     SHORT $LN81@myFunc
        inc     rsi
        jmp     SHORT $LL82@myFunc
$LN81@myFunc:
; rest of code...

Wow! I was expecting a sequence of CMP’s and JE’s, but instead I got something more cryptic, with a magic number and no JE’s to be found.

I’m still a beginner at assembly, so I had to look up some instructions and write on the whiteboard to figure out what’s going on here. The result made me happy: this is a clever optimization, and it’s only possible in 64-bit code!

Let’s step through.

        mov     rcx, 17596481020928v;           ; 0000100100002400H
        npad    5

First, we load a magic number into RCX. Notice that high 32 bits of this number are nonzero – it wouldn’t fit in a 32-bit register. A few bytes of padding align our loop.

$LL82@myFunc:
        movzx   eax, BYTE PTR [rsi]
        cmp     al, 44                          ; 0000002cH
        ja      SHORT $LN81@myFunc

We load the next byte of our input string. MOVZX zero-extends the byte into EAX, and the upper 32 bits of RAX are implicitly zeroed by the 32-bit move.

Our four ASCII characters of interest have the following values:

'\n'   : 10
'\r'   : 13
' '    : 32
','    : 44

The JA instruction means “Jump if the first operand is greater.” If the character is greater than 44, we know it couldn’t possibly be one of the whitespace characters, so we continue to the next loop iteration.

	movsx	rax, al

Then, we copy a sign-extended version of our character into RAX. This part is a bit confusing to me – it seems redundant with our “movzx eax, BYTE PTR [rsi]” from earlier. I figured it was related to signed vs. unsigned semantics, but if I change everything to use unsigned char, this instruction gets replaced with a redundant “movzx eax, al” instead! This is strange; I really don’t understand it. Anyway, moving on…

        bt      rcx, rax
        jae     SHORT $LN81@myFunc
        inc     rsi
        jmp     SHORT $LL82@myFunc
$LN81@myFunc:
; rest of code...

Now the magic happens! The BT instruction means Bit Test. It’s documented here (in its 32-bit x86 form): http://faydoc.tripod.com/cpu/bt.htm

BT copies the (operand 2)’th bit of (operand 1) into the carry flag.
In other words, if we treat op1 as an array of bits, it sets carry = op1[op2].

Our op1 is our magic number in rcx. If we take a closer look at its set bits, we find, sure enough: bits 10, 13, 32, and 44 are set. So after performing this BT instruction, the carry flag is set if and only if our character was one of the 4 whitespace characters.

JAE jumps when the carry flag is zero. If the carry flag is zero, we’ve hit a non-whitespace character. We exit the loop. Otherwise, we increment our character pointer and start again.

This optimization is really cool. It replaces 4 compare/conditional jump pairs with one compare, one Bit Test, and two conditional jumps. That should be a significant speedup. According to Agner Fog’s tables, the register/register version of BT has latency of only one clock. I’d expect an almost-2x speedup from this optimization.

In the 32-bit version, we can’t use this optimization because our magic number is too big to fit in a single register. We could perhaps use two separate tables and a branch on the character value < 32 or not, but then we'd get very close to the instruction count of the naive version. Plus there are fewer registers to work with in 32-bit code; we might not have enough to store those tables.

MSVC 2013 generates simple branches in 32-bit builds:

$LN59@myFunc:
        mov     cl, BYTE PTR [esi]
        cmp     cl, 32                  ; 00000020H
        je      SHORT $LN57@myFunc
        cmp     cl, 44                  ; 0000002cH
        je      SHORT $LN57@myFunc
        cmp     cl, 13                  ; 0000000dH
        je      SHORT $LN57@myFunc
        cmp     cl, 10                  ; 0000000aH
        jne     SHORT $LN58@myFunc
$LN57@myFunc:
        inc     esi
        jmp     SHORT $LL59@myFunc
$LN58@myFunc:
; rest of code...

That's what I expected. I did a performance comparison with 4 other ways to do it:

1) a branchless version, doing all 4 comparisons no matter what. This looks just like my original code except the short-circuiting || operators are replaced with |.
2) the C standard library function strspn
3) C++'s std::string::find_first_not_of
4) std::algorithm's find_if with a lambda function

Here are the results:

        original   branchless  strspn  std::string  std::algorithm
32-bit  9.149      12.631      24.584  48.592       9.637
64-bit  5.589      12.489      8.693   33.574       5.724

The branchless version is the most consistent between 32- and 64-bit builds, but it's slower than the short-circuiting version.

strspn performed decently on 64-bit builds but much worse on 32-bit. Unfortunately I'm testing on MS Visual Studio so I don't have access to the source code of strspn. The code is part of Microsoft's C Runtime DLL, so it must handle the most general case – i.e., load the search characters from the second argument rather than inlining immediate values into comparisons. It can't make any optimizations based on searching for a compile-time constant character set.

std::string was even worse, with dismal performance on both 32- and 64-bit.

On the bright side, I was quite happy to see std::algorithm effectively match the performance of my original version.

I was happy to see my compiler do an advanced optimization like this. It looks like something a clever assembly programmer would write. I would like to test on other compilers and operating systems to see if they do the same optimization, and if their standard libraries perform any better.

Programming with Jim

Monthly Archives: September 2014

Compiler optimizations: skipping whitespace