Singleton Pattern -1. Thread Safety (original) (raw)

INTRODUCTION

Gang of four [1] describes Singleton pattern as: Ensure a class has one instance and provide a global point of access to it.

There are many situations in which the singleton pattern manifests itself. For instance, we may want to have only one instance of a particular logger, memory manager or a Window manager. One way to provide a single instance is to make the constructor of the class private and provide a public access function:

class WindowManager { private: WindowManager(); static WindowManager* pInstance_; public:

static WindowManager* getInstance()
{
    if(!pInstance_)
    {
        pInstance_ = new WindowManager();
    }
    return pInstance_;
}

};

WindowManager* WindowManager::pInstance_ = 0;

The getInstance() function ensures that only one instance of WindowManager is created and returns a pointer to it. The user can then access the instance as:

WindowManager* p = WindowManager::getInstance();

While this example will work for a single threaded program, it is unsafe for multi-threaded programs. We need to use a mutex for thread safety:

static WindowManager* getInstance() { lock_mutex(); if(!pInstance_) pInstance_ = new WindowManager(); unlock_mutex(); return pInstance_; }

Please note that here I have not given consideration to unlocking of the mutex in case of exceptions. I will not consider such a case here.

This solution is correct but has a limitation. The problem is that every time getInstance() is called, it locks and unlocks the mutex. Mutex lock and unlock has an overhead associated with it. Note that we do not need thread synchronization after the object has been constructed and pInstance_ has been initialized. One solution for optimization that had been widely used, and probably still is used, is to check pInstance_ before acquiring the lock. This is known as double-checked locking pattern (perhaps, a better name would be double-checked serialized initialiazation):

  1. static WindowManager* getInstance()
  2. {
  3.  if(!pInstance_)
  4.  {
  5.      lock_mutex();
  6.      if(!pInstance_)
  7.         pInstance_ = new WindowManager();
  8.      unlock_mutex();
  9.  }
  10. return pInstance_;
  11. }

Please note that after acquiring the lock, we again need to check if pInstance_ has been initialized, since some other thread may have acquired the lock first and constructed the object. So, as the rationale goes, once pInstance_ has been created and initialized, if other threads were to enter getInstance(), the if condition on line 3 above will evaluate to false. Thus these threads will completely avoid the mutex operations.

This logic, however, is incorrect and should not be used. The problem is that read to pInstance_ on line 3 and write to pInstance_ on line 7 is not synchronized. So it's quite possible for a reader to see a partially initialized pInstance_. Or a reader may get pInstance_ that points to uninitialized or partially initialzed object.

This may happen due to optimizations at two levels. One is by the compiler and the other is by the hardware. A compiler optimizes the generated code to improve the run time performance or space requirements of the program. One very important code optimization is code parallelism. Modern processors (like IA-64) have the ability to execute many statements in parallel. To use this potential of the processor, the compiler reorders the instructions so that it can group instructions that can be executed in parallel.

For the following case:

pInstance_ = new WindowManager();

the compiler allocates space for WindowManager object, calls the constructor and initializes pInstance_. It's quite possible that the optimizer moves around instructions so that pInstance_ is initialized before the object has been completely initialized. This may cause some other thread calling getInstance() to return partially initialzed object. An attempt to assign to a local variable first will not work:

WindowManager* p = new WindowManager(); pInstance_ = p;

The optimizing compiler will remove assignment to p completely. Even if we were somehow able to fix the order of assignment to pInstance_ in the generated code, it will still not work. This is due to the optimizations that the processor performs, as discussed below.

What about instruction reordering across the critical section or in the presence of mutexes? Any compiler that supports threads has to either unconditionally skip such reordering (acrosee the mutex region) or be able to reliably perform it only when it will not break thread synchronization guarntees. Also, mutex calls reside in external binary library, so reordering across such calls, and especially of global variables, is usually not possible.

Hardware Optimizations

Modern processors (like IA-64 or RISC processors like alpha, sparc etc) have the concept of unordered reads and/or writes. For instance, in the following fragment:

i = 10; //... j = 20;

it may happen that the the write to j is completed before the write to i. So, in our case, even if the compiler generated the assignment to pInstance_ after constructing the object, in practice, the changes to pInstance_ may be visible before the object is completely initialized.

These processors provide memory barrier instructions to provide ordering. A memory barrier instruction tells the processor that all read (or write) requests issued before the memory barrier must complete before it looks at the load/store requests issued after the memory barrier. Mutex locks and unlocks issue memory barrier instructions, so 'un-optimized' getInstance() function will always work:

static WindowManager* getInstance() { lock_mutex(); if(!pInstance_) pInstance_ = new WindowManager(); unlock_mutex(); return pInstance_; }

A thread locking or unlocking the mutex is guarnteed to see fully initialized pInstance_.

SOLUTION

So is there no way to optimize Singleton pattern? It turns out that there is. The problem with earlier approach is that it performs a check on pInstance_ in the if condition. However, if we change the object that is being compared in if condition, we can make it work.

The general idea is: Delegate the responsibility of first check in the if condition to some other variable and initialze this variable only after pInstance_ has been completely initialzed. We can serialize this initialization:

  1. class S
  2. {
  3. private:
  4.    static S*  pInstance_;
  5. public:
  6.    static int val_;
  7.    static S* instance()
  8.    {
  9.        if(val_ == 0)
  10.       {
  11.           lock_mutex();
  12.           if(pInstance_ == 0)
  13.              pInstance_ = new S();
  14.           unlock_mutex();
  15.           MEMORY_BARRIER();
  16.           val_ = 1;
  17.           return pInstance_;
  18.       }
  19.       else
  20.       {
  21.           return pInstance_;
  22.       }
  23.   }
  24. };
  25. //...

Here, mutex lock and unlock on line 11 and line 14 guarntee that pInstance_ has been fully initialzed before assignment to val_ is made. Threads may enter the first if part of the loop more than once, but once val_ is visible to them, they should not do so.

There is, however, still a problem with the above solution. The problem is that another thread may have an unordered read of val_ and pInstance_ in the else block, which means that it is possible for it to see non zero val_ and uninitialzied or partially initialized value for pInstance_. To fix this, we need to insert a memory barrier instruction before returning pInstance_. We can #define a MEMORY_BARRIER macro:

  1. class S
  2. {
  3. private:
  4.    static S*  pInstance_;
  5. public:
  6.    static int val_;
  7.    static S* instance()
  8.    {
  9.        if(val_ == 0)
  10.       {
  11.           lock_mutex();
  12.           if(pInstance_ == 0)
  13.              pInstance_ = new S();
  14.           unlock_mutex();
  15.           MEMORY_BARRIER();
  16.           val_ = 1;
  17.           return pInstance_;
  18.       }
  19.       else
  20.       {
  21.           MEMORY_BARRIER();
  22.           return pInstance_;
  23.       }
  24.   }
  25. };
  26. //...

For processors like x86 that do not need memory barriers, the macro will expand to nothing.

ANOTHER SOLUTION

One very good solution (suggested by?) is to use thread specific data (tsd) or thread local storage (tls) for accessing singletons. Here, each thread goes through mutex lock once, gets pInstance and sets the tls value to pInstance. Next time, the same thread requesting the singleton gets the non-null value for the tls key and returns that value instead. However, one needs to keep in mind that tsd mechanism on some platforms may be 'slow' and there is a limit on the number of available tsd keys.

CONCLUSION

It is possible to arrive at an 'optimized' solution for double-checked serialized initialization, however, we may need to use platform dependent memory barrier instruction at one point. Alternatively, tsd/tls mechanism can also be used for this purpose.

ACKNOWLEDGEMENTS

I would like to express my thanks to Dave Butenhof for his invaluable comments.

References

  1. Design Patterns - Elements of Reusable Object-Oriented Software. Addison-Wesley.

. 1