Content
Spinlocks represent one of the simplest forms of mutual exclusion mechanisms used in multiprocessor operating systems. Their primary role is to prevent concurrent access to critical sections by multiple cores, ensuring data consistency and integrity. Generally, spinlocks safeguard against preemption on the current CPU—often by disabling interrupts—and simultaneously prevent other cores from accessing the protected code region by leveraging atomic memory operations. The defining characteristic of a spinlock is that when a thread attempts to acquire a lock already held by another, it busy-waits or "spins," continuously checking until the lock becomes available. This behavior leads to CPU time consumption during waiting periods, which necessitates that spinlocks be held for minimal durations and ideally not be subject to preemption while owned.
Historically, early spinlock implementations were based purely on standard memory operations such as loads and stores. Classic algorithms like those of Eisenberg & McGuire or the Lamport Bakery demonstrated this approach. However, with the advent of modern multicore processors, especially those with weak memory models like x86 and x86-64 architectures, these naive implementations often fail to guarantee correctness. Although memory fences can correct these issues, they tend to reduce efficiency. Consequently, modern spinlocks rely on more powerful atomic memory operations provided by hardware, such as test-and-set, compare-and-swap, and load-linked/store-conditional instructions. These operations ensure atomicity and proper synchronization, which are critical for the spinlock’s correct function.
A straightforward user-mode spinlock can be implemented in C using GCC’s atomic intrinsics. The lock state is represented by an integer, where zero indicates the lock is free, and one indicates it is held. Acquisition attempts involve an atomic compare-and-swap operation that transitions the lock from free to held atomically. This approach, marked by the use of the strongest memory ordering semantics (__ATOMIC_SEQ_CST), ensures sequential consistency across cores, albeit with potential inefficiencies. One drawback of this basic design is the possibility of unfairness under contention; the memory subsystem randomly determines which core acquires the lock next, sometimes resulting in disproportionate access frequencies across cores.
To address fairness concerns, spinlock designs sometimes employ a ticket mechanism that enforces a first-in-first-out (FIFO) order. Ticket spinlocks assign an incrementing ticket number to each requester and grant access in ascending ticket order. This approach guarantees fairness, as every thread waits its turn, analogous to the Lamport Bakery algorithm. Implementation-wise, the ticket lock maintains two counters—front and back—representing the current serving ticket and the next ticket to be assigned. When a thread’s ticket matches the front counter, it gains access to the critical section. This method has demonstrated highly fair behavior even on large multicore systems.
Linux kernels often use more sophisticated queued spinlocks, which are complex and distributed across multiple source files. However, focusing on 32-bit ARM platforms, the spinlock implementation takes advantage of the processor’s load-linked/store-conditional (ldrex/strex) instructions to perform atomic increments and ensure proper synchronization. The data structure representing the lock can be viewed either as a 32-bit integer or as two 16-bit fields, functioning like the front and back counters in the ticket lock. The use of prefetchw (which compiles to the pldw instruction) primes the cache line for writing, optimizing subsequent lock operations.
The ARM spinlock acquisition code uses an inline assembly block that closely mimics the ticket lock’s behavior. The ldrex instruction loads the lock’s value and tags the memory location for exclusive access. The code increments the high 16 bits of the lock value to generate a new ticket number. Subsequently, strex attempts to store this value conditionally, succeeding only if no other processor has modified the memory. If the store fails, the process repeats, ensuring atomicity through optimistic concurrency control. This mechanism enables efficient, low-level synchronization tailored to ARM’s architecture nuances.