A while back, we had an issue with .NET's garbage collector. In particular, Gen 0, which should be super efficient and fast, was slowing down our multi-threaded application.
We had a 24-core machine, and an application that needs to do 5000 totally independent tasks. No matter who many threads we threw at it, task manager always showed about 200% CPU usage. Digging into the concurrent profiler, we saw that many threads were paused all over the place. The more threads we started, the more time they paused. Looking at what they were blocking on, they were all blocking on memory allocation. Every few milliseconds, one thread would enter Gen 0, and start garbage collecting. The other threads would quickly block when they tried to allocate some memory. Then Gen 0 would complete, and all other threads would continue again. The more threads we had, the more frequent these garbage collection cycles became, causing our application to never exceed 200% CPU on a 24-core system.
Let's try to reproduce it in a simpler program.
First, let's show that threading does work with a simple program:
And the output on my 4-core machine:
Looks like it's working. So let's add some memory allocations:
Here are the results:
|
threadCount |
arraySize |
1 |
2 |
3 |
4 |
5 |
6 |
1 |
658 |
376 |
278 |
223 |
208 |
211 |
10 |
677 |
392 |
282 |
241 |
227 |
237 |
100 |
1049 |
700 |
581 |
535 |
526 |
544 |
1000 |
4607 |
3826 |
3895 |
3918 |
4289 |
4399 |
10000 |
34185 |
35658 |
41861 |
46026 |
49758 |
52625 |
This time around, I am allocating some memory on the heap. There's probably some magic that .NET is doing for me to burn all those extra CPU cycles for larger sizes, but the trend I am looking for is clearly there. When the objects are large, scaling the number of threads does not give me as much of a performance gain as I had hoped.
Let's see if I can tune it for that CPU graph that I am looking for:
This is my CPU graph while scaling through the 6 threads. We see that the CPU usage goes up, but never hits 100% even when using 6 threads.
Now let's try some of the other garbage collector configurations. Let's enable gcServer:
|
threadCount |
arraySize |
1 |
2 |
3 |
4 |
5 |
6 |
1 |
677 |
383 |
282 |
189 |
202 |
255 |
10 |
724 |
373 |
279 |
509 |
309 |
273 |
100 |
1312 |
696 |
627 |
849 |
1565 |
2317 |
1000 |
6329 |
4596 |
4457 |
9292 |
13110 |
12540 |
10000 |
49819 |
48821 |
44955 |
50832 |
|
|
How about gcServer disabled, and gcConcurrent disabled?
|
threadCount |
arraySize |
1 |
2 |
3 |
4 |
5 |
6 |
1 |
644 |
400 |
277 |
221 |
213 |
207 |
10 |
678 |
380 |
295 |
243 |
223 |
224 |
100 |
1055 |
665 |
546 |
516 |
514 |
520 |
1000 |
4712 |
3736 |
3692 |
3984 |
3900 |
4052 |
10000 |
34827 |
35426 |
43847 |
|
|
|
How about gcServer enabled, and gcConcurrent disabled?
|
threadCount |
arraySize |
1 |
2 |
3 |
4 |
5 |
6 |
1 |
675 |
346 |
273 |
192 |
175 |
178 |
10 |
710 |
371 |
281 |
211 |
193 |
197 |
100 |
1309 |
753 |
588 |
529 |
500 |
562 |
1000 |
7537 |
4905 |
4591 |
5654 |
4509 |
4457 |
10000 |
49413 |
46509 |
46088 |
46285 |
46737 |
|
So it seems that the configurations make some impact, but they are not going to solve our problems completely. Let's see if we can do something better.
What if we try to manage our own memory by maintaining our own list of Stuff objects that can be reused?
The code has gotten more complicated, and the results show it as well:
|
threadCount |
|
|
|
|
|
arraySize |
1 |
2 |
3 |
4 |
5 |
6 |
1 |
3684 |
2124 |
1510 |
1186 |
1200 |
1196 |
10 |
3841 |
2178 |
1497 |
1273 |
1200 |
1162 |
100 |
4423 |
2737 |
1782 |
1344 |
1399 |
1256 |
1000 |
7941 |
4294 |
2955 |
2364 |
2480 |
2170 |
10000 |
60650 |
32238 |
22427 |
18197 |
17902 |
16450 |
But we do notice that this scales fairly well with the number of threads. In fact, 6 threads is almost always better than 4 threads. Very weird.
I am not going to delve into the weirdness today because I've come across a solution that scales well, but is much slower on small objects, and twice as slow on a single thread for large objects. And just for kicks, here are the times if we don't empty out the array when we reuse it:
|
threadCount |
|
|
|
|
|
arraySize |
1 |
2 |
3 |
4 |
5 |
6 |
1 |
2674 |
1414 |
1328 |
857 |
801 |
854 |
10 |
2678 |
1418 |
997 |
847 |
876 |
886 |
100 |
2699 |
1396 |
1105 |
791 |
801 |
764 |
1000 |
2714 |
1410 |
1087 |
784 |
853 |
811 |
10000 |
2727 |
1412 |
1005 |
870 |
955 |
961 |
From these experiments, we can see that the garbage collector works very well for small objects. However, it has trouble with large objects. For large objects, we would do well maintaining our own pool of resources. This method seems to be twice as slow, but it scales fairly well on my 4 core system.
I didn't take very good measurements, but the number of gen 0 collections went from 10's of thousands when .NET was managing the memory, to just hundreds when I was doing it myself. When large objects were an issue, the number of gen 1 and 2 collections went from the single digits to thousands.