As I did my research on HyperLogLog, I used the built-in MD5 HashAlgorithm
.
As I was running my tests on larger and larger datasets, the running time grew too much to my liking.
A hash function is any function that can be used to map data of arbitrary size to fixed-size values. The values returned by a hash function are called hash values, hash codes, digests, or simply hashes. The values are used to index a fixed-size table called a hash table. Use of a hash function to index a hash table is called hashing or scatter storage addressing.
Since HyperLogLog doesn’t require a cryptographic hash function, MD5 is a bit of an overkill. I went hunting for a faster option. This is when I encountered MurmurHash.
MurmurHash is a non-cryptographic hash function suitable for general hash-based lookup. Luckily for me, there’s already a very comprehensive implementation of it in .Net, with the source code available in GitHub (https://github.com/darrenkopp/murmurhash-net) and as a NuGet package (https://www.nuget.org/packages/murmurhash/).
Murmur3 turned out to be much faster than MD5, allowing me to run my tests with much larger datasets, and faster running time.
While Murmur3 doesn’t have too many collisions, it is not designed to be difficult to reverse by an adversary, making it unsuitable for cryptographic purposes.