Tuesday, October 15, 2013

Hash Lookup vs Binary Search

So I was browsing through our code base, and saw that in several places, we have some binary searches. Binary search is known to be pretty effective when we are finding things in a sorted array. It has O(lg(n)) running time. Hashtables, on the other hand, have O(1) running time. For small tables, though, binary search can be faster than hashtables since a comparison is cheaper than computing a hash function. At some point, though, we can expect the hashtable to out perform binary search. Let's try and find out what the break-even point is:
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Threading.Tasks;
namespace HashVsBinary
{
class Program
{
static void Main(string[] args)
{
var r = new Random();
var targets = Enumerable.Range(0, 1000 * 1000).Select(_ => r.Next(int.MaxValue)).ToList();
for (int totalCount = 1; totalCount < 1000*1000*10; totalCount*=2)
{
var a = Enumerable.Range(0, totalCount).Select(_ => r.Next(int.MaxValue)).Distinct().Select(v => new thing(v)).OrderBy(t => t.value).ToArray();
var d = a.ToDictionary(t => t.value);
var watch = new System.Diagnostics.Stopwatch();
{
watch.Start();
var found = targets.Select(t => BinarySearch(t, a)).Where(t => t != null).Count();
watch.Stop();
Console.WriteLine(string.Format("found {0} things out of {2} in {1} ms with binary search", found, watch.ElapsedMilliseconds, a.Length));
}
{
watch.Restart();
var found = targets.Select(t => HashSearch(t, d)).Where(t => t != null).Count();
watch.Stop();
Console.WriteLine(string.Format("found {0} things out of {2} in {1} ms with hash search", found, watch.ElapsedMilliseconds, d.Keys.Count));
}
}
Console.ReadLine();
}
static thing HashSearch(int needle, Dictionary<int, thing> hash)
{
if (!hash.ContainsKey(needle))
return null;
return hash[needle];
}
static thing BinarySearch(int needle, thing[] sortedHaystack)
{
return BinarySearch(needle, sortedHaystack, 0, sortedHaystack.Length - 1);
}
static thing BinarySearch(int needle, thing[] sortedHaystack, int minimum, int maximum)
{
if (minimum > maximum)
return null;
var middle = (minimum + maximum) / 2;
if (needle == sortedHaystack[middle].value)
return sortedHaystack[middle];
if (needle < sortedHaystack[middle].value)
return BinarySearch(needle, sortedHaystack, minimum, middle - 1);
return BinarySearch(needle, sortedHaystack, middle + 1, maximum);
}
class thing
{
public int value;
public thing(int v)
{
value = v;
}
}
}
}
view raw gistfile1.cs hosted with ❤ by GitHub
And here's the results:
Well, it doesn't look like binary search is ever better than the hashtable. Binary search is logarithmic at first, but then it starts growing faster. The hashtable performs pretty consistently at the beginning, and then slowly grows at the end.

The abnormalities are probably due to caching. There's two places where the slope changes. Those probably correspond to different layers of memory. Maybe first an L2 cache miss, and then an L3 cache miss.

I also implemented an iterative binary search:
static thing BinarySearch(int needle, thing[] sortedHaystack)
{
var min = 0;
var max = sortedHaystack.Length - 1;
while (min <= max)
{
var middle = (min + max) / 2;
if (needle == sortedHaystack[middle].value)
return sortedHaystack[middle];
if (needle < sortedHaystack[middle].value)
{
max = middle - 1;
}
else
{
min = middle + 1;
}
}
return null;
}
view raw gistfile1.cs hosted with ❤ by GitHub

It did not perform significantly better. Hashtable was still always faster.

No comments:

Post a Comment