John Tseng: October 2013

A while back, we had an issue with .NET's garbage collector. In particular, Gen 0, which should be super efficient and fast, was slowing down our multi-threaded application.

We had a 24-core machine, and an application that needs to do 5000 totally independent tasks. No matter who many threads we threw at it, task manager always showed about 200% CPU usage. Digging into the concurrent profiler, we saw that many threads were paused all over the place. The more threads we started, the more time they paused. Looking at what they were blocking on, they were all blocking on memory allocation. Every few milliseconds, one thread would enter Gen 0, and start garbage collecting. The other threads would quickly block when they tried to allocate some memory. Then Gen 0 would complete, and all other threads would continue again. The more threads we had, the more frequent these garbage collection cycles became, causing our application to never exceed 200% CPU on a 24-core system.

Let's try to reproduce it in a simpler program.

First, let's show that threading does work with a simple program:
And the output on my 4-core machine:

Looks like it's working. So let's add some memory allocations:

Here are the results:


 
 

  
  threadCount
 

  arraySize
  1
  2
  3
  4
  5
  6
 

  1
  658
  376
  278
  223
  208
  211
 

  10
  677
  392
  282
  241
  227
  237
 

  100
  1049
  700
  581
  535
  526
  544
 

  1000
  4607
  3826
  3895
  3918
  4289
  4399
 

  10000
  34185
  35658
  41861
  46026
  49758
  52625

This time around, I am allocating some memory on the heap. There's probably some magic that .NET is doing for me to burn all those extra CPU cycles for larger sizes, but the trend I am looking for is clearly there. When the objects are large, scaling the number of threads does not give me as much of a performance gain as I had hoped.

Let's see if I can tune it for that CPU graph that I am looking for:

This is my CPU graph while scaling through the 6 threads. We see that the CPU usage goes up, but never hits 100% even when using 6 threads.

Now let's try some of the other garbage collector configurations. Let's enable gcServer:


 
 

  
  threadCount
 

  arraySize
  1
  2
  3
  4
  5
  6
 

  1
  677
  383
  282
  189
  202
  255
 

  10
  724
  373
  279
  509
  309
  273
 

  100
  1312
  696
  627
  849
  1565
  2317
 

  1000
  6329
  4596
  4457
  9292
  13110
  12540
 

  10000
  49819
  48821
  44955
  50832

How about gcServer disabled, and gcConcurrent disabled?


 
 

  
  threadCount
 

  arraySize
  1
  2
  3
  4
  5
  6
 

  1
  644
  400
  277
  221
  213
  207
 

  10
  678
  380
  295
  243
  223
  224
 

  100
  1055
  665
  546
  516
  514
  520
 

  1000
  4712
  3736
  3692
  3984
  3900
  4052
 

  10000
  34827
  35426
  43847

How about gcServer enabled, and gcConcurrent disabled?


 
 

  
  threadCount
 

  arraySize
  1
  2
  3
  4
  5
  6
 

  1
  675
  346
  273
  192
  175
  178
 

  10
  710
  371
  281
  211
  193
  197
 

  100
  1309
  753
  588
  529
  500
  562
 

  1000
  7537
  4905
  4591
  5654
  4509
  4457
 

  10000
  49413
  46509
  46088
  46285
  46737

So it seems that the configurations make some impact, but they are not going to solve our problems completely. Let's see if we can do something better.

What if we try to manage our own memory by maintaining our own list of Stuff objects that can be reused?

The code has gotten more complicated, and the results show it as well:


 
 

  
  threadCount
  
  
  
  
  
 

  arraySize
  1
  2
  3
  4
  5
  6
 

  1
  3684
  2124
  1510
  1186
  1200
  1196
 

  10
  3841
  2178
  1497
  1273
  1200
  1162
 

  100
  4423
  2737
  1782
  1344
  1399
  1256
 

  1000
  7941
  4294
  2955
  2364
  2480
  2170
 

  10000
  60650
  32238
  22427
  18197
  17902
  16450

But we do notice that this scales fairly well with the number of threads. In fact, 6 threads is almost always better than 4 threads. Very weird.

I am not going to delve into the weirdness today because I've come across a solution that scales well, but is much slower on small objects, and twice as slow on a single thread for large objects. And just for kicks, here are the times if we don't empty out the array when we reuse it:


 
 

  
  threadCount
  
  
  
  
  
 

  arraySize
  1
  2
  3
  4
  5
  6
 

  1
  2674
  1414
  1328
  857
  801
  854
 

  10
  2678
  1418
  997
  847
  876
  886
 

  100
  2699
  1396
  1105
  791
  801
  764
 

  1000
  2714
  1410
  1087
  784
  853
  811
 

  10000
  2727
  1412
  1005
  870
  955
  961

From these experiments, we can see that the garbage collector works very well for small objects. However, it has trouble with large objects. For large objects, we would do well maintaining our own pool of resources. This method seems to be twice as slow, but it scales fairly well on my 4 core system.


I didn't take very good measurements, but the number of gen 0 collections went from 10's of thousands when .NET was managing the memory, to just hundreds when I was doing it myself. When large objects were an issue, the number of gen 1 and 2 collections went from the single digits to thousands.

John Tseng

Tuesday, October 15, 2013

Hash Lookup vs Binary Search

Monday, October 14, 2013

An Emperical Look at .NET's Gen 0 Garbage Collection

	threadCount
arraySize	1	2	3	4	5	6
1	658	376	278	223	208	211
10	677	392	282	241	227	237
100	1049	700	581	535	526	544
1000	4607	3826	3895	3918	4289	4399
10000	34185	35658	41861	46026	49758	52625

	threadCount
arraySize	1	2	3	4	5	6
1	677	383	282	189	202	255
10	724	373	279	509	309	273
100	1312	696	627	849	1565	2317
1000	6329	4596	4457	9292	13110	12540
10000	49819	48821	44955	50832

	threadCount
arraySize	1	2	3	4	5	6
1	644	400	277	221	213	207
10	678	380	295	243	223	224
100	1055	665	546	516	514	520
1000	4712	3736	3692	3984	3900	4052
10000	34827	35426	43847

	threadCount
arraySize	1	2	3	4	5	6
1	675	346	273	192	175	178
10	710	371	281	211	193	197
100	1309	753	588	529	500	562
1000	7537	4905	4591	5654	4509	4457
10000	49413	46509	46088	46285	46737

	threadCount
arraySize	1	2	3	4	5	6
1	3684	2124	1510	1186	1200	1196
10	3841	2178	1497	1273	1200	1162
100	4423	2737	1782	1344	1399	1256
1000	7941	4294	2955	2364	2480	2170
10000	60650	32238	22427	18197	17902	16450

	threadCount
arraySize	1	2	3	4	5	6
1	2674	1414	1328	857	801	854
10	2678	1418	997	847	876	886
100	2699	1396	1105	791	801	764
1000	2714	1410	1087	784	853	811
10000	2727	1412	1005	870	955	961