The funny thing was that this only happened occasionally, and without a good pattern. I had debugged the application several times, but it never had this issue while debugging. Finally, I was able to debug the application while it was hanging. Pausing all the threads, I noticed that they were all waiting for a single SyncLock.
A little background for this SyncLock: we have a bunch of databases that change frequently, so we store the database locations in one central database. To avoid hitting this database all the time, we cache the result in the application's memory. Whenever the application needs a connection string, it would first try the cache. If it's not there, then it goes back to the database to reload the cache. While the cache is being refreshed, it can't serve other threads looking for a connection string, so there's a SyncLock guarding this section.
This works great when we expect all our requests to exist. However, we recently added a new algorithm that checks to see if a connection string exists or not. Most of the time, it does not. This means that the application is constantly refreshing its cache. While it's refreshing, no other threads can access any databases, because the code that gets a connection string is guarded by that SyncLock. Thus, it looks like our application is hanging. The application would've eventually finished its job, but it would've taken a long long time.
This made me curious. Does SyncLock not serve requests in FIFO order? Can a thread be starved while waiting for a SyncLock? The answer appears to be yes. Here's the code to reproduce this.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Module Module1 | |
Sub Main() | |
Dim lock As New Object | |
Dim threadCount = 30 | |
Dim lastAcquiredTime = Enumerable.Range(0, threadCount).Select(Function(__) DateTime.Now).ToArray | |
Enumerable.Range(0, threadCount) _ | |
.AsParallel.WithDegreeOfParallelism(threadCount) _ | |
.ForAll(Sub(i) | |
While True | |
SyncLock lock | |
Threading.Thread.Sleep(100) | |
Dim now = DateTime.Now | |
Dim longestStarvation = lastAcquiredTime.Select(Function(l) now - l).Max | |
Console.WriteLine(String.Format("thread {0,2} waited for {1,4}. maximum starvation time: {2,4}", | |
i, | |
(DateTime.Now - lastAcquiredTime(i)).TotalSeconds, | |
longestStarvation.TotalSeconds)) | |
lastAcquiredTime(i) = DateTime.Now | |
End SyncLock | |
End While | |
End Sub) | |
End Sub | |
End Module |
Unfortunately, this isn't a super reliable way of producing the situation. We do see, however, that the maximum starvation time occasionally is much larger than it should be. Most of the time the threads are served in near FIFO order, so our application didn't always have an issue. However, sometimes, the threads are served in some other ordering, so our application did appear to be waiting forever for that lock.