Friday, November 22, 2013

SyncLock Starvation

We had an issue in production that was particularly hard to track down. One of our applications appeared to be hanging while not doing a whole lot of work. We figured out that it was running a bunch of queries and hanging there. Looking at SQL Server, we saw that the connection was always waiting on async network IO, but the wait time was always pretty short (less than 2 seconds). We saw our application hang for hours at a time.

The funny thing was that this only happened occasionally, and without a good pattern. I had debugged the application several times, but it never had this issue while debugging. Finally, I was able to debug the application while it was hanging. Pausing all the threads, I noticed that they were all waiting for a single SyncLock.

A little background for this SyncLock: we have a bunch of databases that change frequently, so we store the database locations in one central database. To avoid hitting this database all the time, we cache the result in the application's memory. Whenever the application needs a connection string, it would first try the cache. If it's not there, then it goes back to the database to reload the cache. While the cache is being refreshed, it can't serve other threads looking for a connection string, so there's a SyncLock guarding this section.

This works great when we expect all our requests to exist. However, we recently added a new algorithm that checks to see if a connection string exists or not. Most of the time, it does not. This means that the application is constantly refreshing its cache. While it's refreshing, no other threads can access any databases, because the code that gets a connection string is guarded by that SyncLock. Thus, it looks like our application is hanging. The application would've eventually finished its job, but it would've taken a long long time.

This made me curious. Does SyncLock not serve requests in FIFO order? Can a thread be starved while waiting for a SyncLock? The answer appears to be yes. Here's the code to reproduce this.


Unfortunately, this isn't a super reliable way of producing the situation. We do see, however, that the maximum starvation time occasionally is much larger than it should be. Most of the time the threads are served in near FIFO order, so our application didn't always have an issue. However, sometimes, the threads are served in some other ordering, so our application did appear to be waiting forever for that lock.

No comments:

Post a Comment