A .NET implementation of multiple index hashing by Norouzi et al, based on a description in the threatexchange repository
Multiple Index Hashing (MIH) is a relatively lightweight means for accelerating lookups for fuzzy hashes (e.g. PhotoDNA and PDQ) within a pre-defined hamming distance. Instead of a linear search through every record, multiple indices are made for separate windows/slots within each hash. The threatexchange document (linked above) provides a good, 'plain English' description for those who (like me) struggle with mathematical terminology and notation.
Available on NuGet
Install-Package netMIH
The constructor can be used for custom implementations of match thresholds, window sizes etc, but otherwise, for PDQ, Default values for MIHIndex constructor and train() method are set for PDQ hash, using 32 bit match threshold (30 being non divisible by 8).
using netMIH
// Uses pre-configured values for PDQ hash
var index = new Index(Index.Configuration.PDQ);
var hashes = new string[] {"358c86641a5269ab5b0db5f1b2315c1642cef9652c39b6ced9f646d91f071927"};
//add entries to the index
index.Update(hashes, "ignorable");
// TRAIN the index - YOU NEED TO DO THIS BEFORE QUERYING!
index.train()
//query index for identical hashes
var results = index.Query("fc4d8e2130177f8f6ce2a03bd27fa8e6b1067a1ac8f0068037215df6491eee1f", 0);
//query index for entries with hamming distance of 1 or less
var results = index.Query("fc4d8e2130177f8f6ce2a03bd27fa8e6b1067a1ac8f0068037215df6491eee1f", 1);
// query index for entries with hamming distance of 34
// this is bigger than default match threshold of 32 for PDQ, so will utilise linear lookup.
// This will be slower than MIH and performance will get worse the bigger the dataset gets
var results = index.Query("fc4d8e2130177f8f6ce2a03bd27fa8e6b1067a1ac8f0068037215df6491eee1f", 34);
This is released under an MIT licence.