So I was browsing through our code base, and saw that in several places, we have some binary searches. Binary search is known to be pretty effective when we are finding things in a sorted array. It has O(lg(n)) running time. Hashtables, on the other hand, have O(1) running time. For small tables, though, binary search can be faster than hashtables since a comparison is cheaper than computing a hash function. At some point, though, we can expect the hashtable to out perform binary search. Let's try and find out what the break-even point is:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Well, it doesn't look like binary search is ever better than the hashtable. Binary search is logarithmic at first, but then it starts growing faster. The hashtable performs pretty consistently at the beginning, and then slowly grows at the end.
The abnormalities are probably due to caching. There's two places where the slope changes. Those probably correspond to different layers of memory. Maybe first an L2 cache miss, and then an L3 cache miss.
I also implemented an iterative binary search:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
A while back, we had an issue with .NET's garbage collector. In particular, Gen 0, which should be super efficient and fast, was slowing down our multi-threaded application.
We had a 24-core machine, and an application that needs to do 5000 totally independent tasks. No matter who many threads we threw at it, task manager always showed about 200% CPU usage. Digging into the concurrent profiler, we saw that many threads were paused all over the place. The more threads we started, the more time they paused. Looking at what they were blocking on, they were all blocking on memory allocation. Every few milliseconds, one thread would enter Gen 0, and start garbage collecting. The other threads would quickly block when they tried to allocate some memory. Then Gen 0 would complete, and all other threads would continue again. The more threads we had, the more frequent these garbage collection cycles became, causing our application to never exceed 200% CPU on a 24-core system.
Let's try to reproduce it in a simpler program.
First, let's show that threading does work with a simple program:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Looks like it's working. So let's add some memory allocations:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This time around, I am allocating some memory on the heap. There's probably some magic that .NET is doing for me to burn all those extra CPU cycles for larger sizes, but the trend I am looking for is clearly there. When the objects are large, scaling the number of threads does not give me as much of a performance gain as I had hoped.
Let's see if I can tune it for that CPU graph that I am looking for:
This is my CPU graph while scaling through the 6 threads. We see that the CPU usage goes up, but never hits 100% even when using 6 threads.
Now let's try some of the other garbage collector configurations. Let's enable gcServer:
threadCount
arraySize
1
2
3
4
5
6
1
677
383
282
189
202
255
10
724
373
279
509
309
273
100
1312
696
627
849
1565
2317
1000
6329
4596
4457
9292
13110
12540
10000
49819
48821
44955
50832
How about gcServer disabled, and gcConcurrent disabled?
threadCount
arraySize
1
2
3
4
5
6
1
644
400
277
221
213
207
10
678
380
295
243
223
224
100
1055
665
546
516
514
520
1000
4712
3736
3692
3984
3900
4052
10000
34827
35426
43847
How about gcServer enabled, and gcConcurrent disabled?
threadCount
arraySize
1
2
3
4
5
6
1
675
346
273
192
175
178
10
710
371
281
211
193
197
100
1309
753
588
529
500
562
1000
7537
4905
4591
5654
4509
4457
10000
49413
46509
46088
46285
46737
So it seems that the configurations make some impact, but they are not going to solve our problems completely. Let's see if we can do something better.
What if we try to manage our own memory by maintaining our own list of Stuff objects that can be reused?
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
The code has gotten more complicated, and the results show it as well:
threadCount
arraySize
1
2
3
4
5
6
1
3684
2124
1510
1186
1200
1196
10
3841
2178
1497
1273
1200
1162
100
4423
2737
1782
1344
1399
1256
1000
7941
4294
2955
2364
2480
2170
10000
60650
32238
22427
18197
17902
16450
But we do notice that this scales fairly well with the number of threads. In fact, 6 threads is almost always better than 4 threads. Very weird.
I am not going to delve into the weirdness today because I've come across a solution that scales well, but is much slower on small objects, and twice as slow on a single thread for large objects. And just for kicks, here are the times if we don't empty out the array when we reuse it:
threadCount
arraySize
1
2
3
4
5
6
1
2674
1414
1328
857
801
854
10
2678
1418
997
847
876
886
100
2699
1396
1105
791
801
764
1000
2714
1410
1087
784
853
811
10000
2727
1412
1005
870
955
961
From these experiments, we can see that the garbage collector works very well for small objects. However, it has trouble with large objects. For large objects, we would do well maintaining our own pool of resources. This method seems to be twice as slow, but it scales fairly well on my 4 core system.
I didn't take very good measurements, but the number of gen 0 collections went from 10's of thousands when .NET was managing the memory, to just hundreds when I was doing it myself. When large objects were an issue, the number of gen 1 and 2 collections went from the single digits to thousands.