SMP: Debugging the System Part 5

My last four blogs posts have presented:

  1. Symmetric Multi Processor (SMP) scaling problem.
  2. Solution as presented by Pyramid Technology.
  3. User space SMP problem.
  4. Various lock primitives.

It’s logical to now explore how to debug an SMP system. But first I need to present the classic locking problem as defined by Dijkstra.

Classic Locking Problem

Consider two processes: P1 and P2; and two locks: L1 and L2.  Now consider the following locking scenerio: P1 locks L1 (P1.L1) and then L2 (P2.L1); P2 locks L2 (P2.L2) and then L1 (P2.L1).

If the locks are aquired in the order: (1: P1.L1, 2: P2.L2, 3: P1.L2, 4: P2,L1), then steps 3 and 4 will never complete.  This scenerio is called a deadlock in that both processes are stuck waiting for locks that are held by each other.

So how do we deal with a SMP deadlock?  There are essentually two ways to deal with deadlocks.  Method 1: try to detect when a deadlock has occured.  Method 2: prevent deadlocks from occuring.

The first method is somewhat problomatic in that you have to keep track of all of the locks in the system and periodically monitor them for a deadlock.

The second method is much easier to implement.  Pyramid dealt with this problem by adding code to detect when a deadlock was immenant.  To wit, a positive, non-zero level was assigned to each lock and the lock protocol was modified to ensure that a lock could only be aquired if the lock level was increased.

Now let us apply lock levels to the Dijkstra lock problem.  L1 will now be defined as L1.5 (lock 1 with level 5) and L2 will now be defined as L2.10 (lock 1 with level 10).  If the locks are aquired in the order: (1: P1.L1, 2: P2.L2, 3: P1.L2, 4: P2,L1), then step 4 will generate a Lock Level Violation since P2 is holding L2 (which has a level of 10) and P2 is trying to lock L1 (which has a level of 5) – you can only increase the lock levels.

Handling SMP Interrupts

There is another problem that we have to deal with: interrupts.  If a process is holding a lock and interrupts occur, then it will look like the process is holding the lock for a long time.

This problem is dealt with by adding an interrupt level to the definition of a lock.  In addition, the lock protocol is expanded by requiring that the cpu interrupt level be set to the lock interrupt level.  Setting the cpu interrupt level blocks interrupts at a lower level.  System interrupts are defined by levels, for example: 0 = none, 1 = tty, 2 = network, 3 = disk, etc.  The higher the interrupt level, the higher the priority.  Higher level interrupts block out lower level interrupts.  Thus, setting the interrupt level of a cpu to priority X, blocks all interrupts lower than and including level X.

Defining the Locking Protocol

So a lock is now defined as: (Name, Level, Interrupt).  And the locking protocol is now defined as:

1)  If the new lock level is less than the current lock level, then generate a Lock Level Protocol Violation.

2)  If the new lock interrupt priority is less than the current cpu interrupt priority, then generate a Lock Interrupt Protocol Violation.

Next up? SMP’s Thundering Herd Problem.