Monday, October 14, 2013

An Emperical Look at .NET's Gen 0 Garbage Collection

A while back, we had an issue with .NET's garbage collector. In particular, Gen 0, which should be super efficient and fast, was slowing down our multi-threaded application.

We had a 24-core machine, and an application that needs to do 5000 totally independent tasks. No matter who many threads we threw at it, task manager always showed about 200% CPU usage. Digging into the concurrent profiler, we saw that many threads were paused all over the place. The more threads we started, the more time they paused. Looking at what they were blocking on, they were all blocking on memory allocation. Every few milliseconds, one thread would enter Gen 0, and start garbage collecting. The other threads would quickly block when they tried to allocate some memory. Then Gen 0 would complete, and all other threads would continue again. The more threads we had, the more frequent these garbage collection cycles became, causing our application to never exceed 200% CPU on a 24-core system.

Let's try to reproduce it in a simpler program.

First, let's show that threading does work with a simple program:
And the output on my 4-core machine:














Looks like it's working. So let's add some memory allocations:

Here are the results:
threadCount
arraySize 1 2 3 4 5 6
1 658 376 278 223 208 211
10 677 392 282 241 227 237
100 1049 700 581 535 526 544
1000 4607 3826 3895 3918 4289 4399
10000 34185 35658 41861 46026 49758 52625

This time around, I am allocating some memory on the heap. There's probably some magic that .NET is doing for me to burn all those extra CPU cycles for larger sizes, but the trend I am looking for is clearly there. When the objects are large, scaling the number of threads does not give me as much of a performance gain as I had hoped.

Let's see if I can tune it for that CPU graph that I am looking for:
This is my CPU graph while scaling through the 6 threads. We see that the CPU usage goes up, but never hits 100% even when using 6 threads.



Now let's try some of the other garbage collector configurations. Let's enable gcServer:
threadCount
arraySize 1 2 3 4 5 6
1 677 383 282 189 202 255
10 724 373 279 509 309 273
100 1312 696 627 849 1565 2317
1000 6329 4596 4457 9292 13110 12540
10000 49819 48821 44955 50832
How about gcServer disabled, and gcConcurrent disabled?
threadCount
arraySize 1 2 3 4 5 6
1 644 400 277 221 213 207
10 678 380 295 243 223 224
100 1055 665 546 516 514 520
1000 4712 3736 3692 3984 3900 4052
10000 34827 35426 43847
How about gcServer enabled, and gcConcurrent disabled?
threadCount
arraySize 1 2 3 4 5 6
1 675 346 273 192 175 178
10 710 371 281 211 193 197
100 1309 753 588 529 500 562
1000 7537 4905 4591 5654 4509 4457
10000 49413 46509 46088 46285 46737

So it seems that the configurations make some impact, but they are not going to solve our problems completely. Let's see if we can do something better.

What if we try to manage our own memory by maintaining our own list of Stuff objects that can be reused?

The code has gotten more complicated, and the results show it as well:
threadCount
arraySize 1 2 3 4 5 6
1 3684 2124 1510 1186 1200 1196
10 3841 2178 1497 1273 1200 1162
100 4423 2737 1782 1344 1399 1256
1000 7941 4294 2955 2364 2480 2170
10000 60650 32238 22427 18197 17902 16450

But we do notice that this scales fairly well with the number of threads. In fact, 6 threads is almost always better than 4 threads. Very weird.

I am not going to delve into the weirdness today because I've come across a solution that scales well, but is much slower on small objects, and twice as slow on a single thread for large objects. And just for kicks, here are the times if we don't empty out the array when we reuse it:
threadCount
arraySize 1 2 3 4 5 6
1 2674 1414 1328 857 801 854
10 2678 1418 997 847 876 886
100 2699 1396 1105 791 801 764
1000 2714 1410 1087 784 853 811
10000 2727 1412 1005 870 955 961
From these experiments, we can see that the garbage collector works very well for small objects. However, it has trouble with large objects. For large objects, we would do well maintaining our own pool of resources. This method seems to be twice as slow, but it scales fairly well on my 4 core system.

I didn't take very good measurements, but the number of gen 0 collections went from 10's of thousands when .NET was managing the memory, to just hundreds when I was doing it myself. When large objects were an issue, the number of gen 1 and 2 collections went from the single digits to thousands.

No comments:

Post a Comment