SMP: Thundering Herd Problem – Part 6

My last five blogs posts have presented:

  1. The Symmetric Multi Processor (SMP) scaling problem.
  2. The solution as presented by Pyramid Technology.
  3. The user space SMP problem.
  4. The various lock primitives.
  5. Debugging locks.

Now, let’s explore something called the Thundering Herd Problem

If you recall, a previous blog presented the mutex algorithm as:

mutex (void *addr)
while (test_and_set(addr) == 1)

Now consider the scenerio where Process 1 (P1) is holding a lock and two other processes (P2 and P3) are trying to obtain the same lock.  If P2 has waiting for a long time and P3 has been waiting for a small amount of time, then there is nothing in the mutext algorithm that enforces fairness.  That is, once P1 releases the lock, P3 can immediately obtain it, even though P2 has been waiting for a much longer time.

Now extend this problem to a 12 CPU SMP system.  One cpu could hold a lock and 11 cpu’s could be waiting on the same lock.  And extend this problem to a 24 CPU SMP system.  One cpu could hold a lock and 23 cpu’s could be waiting on the same lock.

In this case, when a cpu releases a lock, possibly 23 other cpus could all try to get the same lock.  This mass attempt to obtain a single lock results in a non linear increase in bus and cache traffic.  The Thundering Herd is the 23 other cpus.  And the problem is, only one cpu is going to obtain the lock.  The other 23 cpus go back into a wait state and the process continues.  Since fairness is not built into this algorithm, there is no gaurantee that a particular cpu will obtain the lock in a reletively fair fashon.

Are there locks that suffer from this kind of contention?  Does the Thundering Herd problem occur in a real engineering envirnoment?  Or, is all of this just an interesting theory?

At Pyramid we actually ran into this problem.  In the Unix kernel, there a number of locks that experience this kind of contention.  And we experienced the Thundering Herd problem when we expanded our cachsis from 12 cpus to 24 cpus.  We actually ran into Lock Timeouts (30 seconds) where no single cpu was holding the lock for an excessive amount of time.  The problem was that the mutex algorithm was not fair, a waiting cpu was starved for attention.

In my next post I will present solutions to this problem.