Skip to main content

Art of Performance

A Volatile Situation

Table of Contents

As a hardware engineer, most of the code that I write is either in assembly or low-level C (apart from the shell scripts to run simulation jobs or Perl/Python to parse log files). I am therefore very familiar with the volatile type qualifier in the C language. Referring to my well-worn copy of the Kernighan and Ritchie “The C Programming Language” book, the section on volatile states:

The purpose of volatile is to force an implementation to suppress optimization that could otherwise occur. For example, for a machine with memory-mapped input/output, a pointer to a device register might be declared as a pointer to volatile, in order to prevent the compiler from removing apparently redundant references through the pointer.

There is also one interesting tidbit about it in the previous paragraph:

There are no implementation independent semantics for volatile objects.

Why am I talking about volatile? I ran into a situation where not using volatile in a multi-threaded code exposed an interesting compiler behavior.

### The situation

Recently, in my spare time, I wrote some code to demonstrate memory consistency models (this will likely be a future blog post). The code has a producer thread and a consumer thread and the goal is to look at how ordering between different load and store operations work. I could have written this in assembly, but I wanted to try it across different architectures, so writing it in C made more sense. The entire code can be found here.

The table below shows the functions executed by the producer and consumer threads side by side:

Producer Thread Consumer Thread
void *write_data(void *args) {
     uint32_t i;

     for (i = 0; i < NUM_ITERS; i++) {    
       pthread_barrier_wait(&barrier);
       dat_buf[i] = 2*i;
       *flag = i;
     }

     return EXIT_SUCCESS;
}
void *read_data(void *args) {
     uint32_t i;
     int num_errors = 0;

     for (i = 0; i < NUM_ITERS; i++) {
       pthread_barrier_wait(&barrier);
       while (*flag != i);
       if (dat_buf[i] != 2*i) {
         printf("Error: Ordering mismatch at iteration %d\n", i);
         num_errors++;
       }
     }

     printf("Ordering errors: %d/%d\n", num_errors, NUM_ITERS);

     return EXIT_SUCCESS;
}

As you can see, in every iteration i, the producer thread writes a value into dat_buf[i] and then sets the flag value to i. Meanwhile, the consumer thread is polling on flag, waiting for it to equal i: i.e. waiting for producer to set the flag. It then checks whether the dat_buf[i] value matches the value we expect the producer to have written. If strict write ordering is guaranteed by the processor, then num_errors will always be 0. Note that the barrier at the start of each iteration ensures that the producer and consumer threads run each iteration in lockstep.

### It’s volatile!

When I first wrote this code, I did not declare the flag pointer as volatile. I compiled it without any compiler optimization flags on my laptop with an AMD Zen2 CPU:

gcc -Wall -march=native store_order.c -o store_order

I ran the executable and it worked without any issues. I then recompiled it with -O2 optimizations added, and running the new optimized binary would result in a hang within the first couple of iterations. I couldn’t immediately understand why optimizations caused this difference. Only when I compiled the code to assembly and diff’ed the read_data function between two did I realize the problem. The assembly1 snippet below corresponds to the C line:

while (*flag != i);
Without Optimization With -O2 Optimization

.L15:
	mov	rax, QWORD PTR flag[rip]
	mov	eax, DWORD PTR [rax]       ;; loads flag into eax
	cmp	DWORD PTR -8[rbp], eax     ;; -8[rbp] holds the value of 'i'
	                               ;; and is compared to flag
	jne	.L15

	mov	rax, QWORD PTR flag[rip]
	mov	eax, DWORD PTR [rax]       ;; loads flag into eax

.L7:
	cmp	eax, ebx                   ;; ebx holds value of 'i' and compared
	jne	.L7

You’ll notice that the loading of flag into eax happens within the loop in the unoptimized case, but that load is moved outside the loop. So, if the load of flag doesn’t get the expected value when it is read, then the .L7 compare loop on the right will never exit (infinite loop). This is when I realized that flag pointer also needs to be declared as volatile! The volatile qualifier applies when the pointer references anything that can be changed outside the current execution thread: that could be done by hardware (in case of hardware registers), but also by other threads in a shared memory system.

I then added the volatile keyword to the flag pointer declaration:

volatile uint32_t *flag; 

I recompiled the code with -O2 and then the program worked correctly – it did not hang. A quick look at the new assembly reveals what changed:

	mov	rdx, QWORD PTR flag[rip]

.L7:
	mov	eax, DWORD PTR [rdx]           ;; read of flag
	cmp	eax, ebx
	jne	.L7

The read of flag is now moved to be within the .L7 compare loop! This simple example shows what optimizations get suppressed with volatile that might otherwise occur without it.

So, if your multi-threaded program execution becomes “volatile” with optimizations, look to see if you need to add volatile qualifier to pointers in your code.

### Epilogue

I recently read a blog post titled “C and C++ Prioritize Performance over Correctness” – associated HN discussion here. This got me thinking of the code that gcc produced with -O2 but without the volatile qualifier:

	mov	eax, DWORD PTR [rax]       ;; loads flag into eax

.L7:
	cmp	eax, ebx                   ;; ebx holds value of 'i' and compared
	jne	.L7

What is the point of this code? Neither eax nor ebx can ever change inside this .L7 loop. Without volatile gcc is assuming that flag is not going to change, and therefore only reads it once outside the loop. Why didn’t gcc then change this while loop into an if statement and check flag once before moving on? Why insert the infinite loop?

Thinking this might be a problem with gcc, I compiled the code with LLVM/clang using the same commandline options. Without the volatile keyword, clang produced the following assembly:

	mov     rax, qword ptr [rip + flag]
	mov     eax, dword ptr [rax]         ;; loads flag into eax
	cmp     r12, rax                     ;; r12 holds value of 'i'
	jne     .LBB2_2

.LBB2_2:
	jmp     .LBB2_2

This is even more absurd: the cmp and the following jne looks like an if statement like I was suggesting earlier. But then the jmp at .LBB2_2 is an explicit infinite loop! Why, clang? Why?

This brings back memories from my childhood where I would write infinite loops in BASIC and leave it running to annoy others:

10   PRINT " "
20   GOTO 10

  1. Assembly in Intel Syntax ↩︎