MinHash is a simple but effective algorithm for estimating set similarity using the Jaccard index. Both the Wikipedia entry and this blog post are good explanations of how it works.
MinHash is attractive because it allows us to decide how similar two sets are without having to enumerate all of their elements. If we want to know how many users that performed action $A$ also performed action $B$, we can compare the MinHashes of the two sets instead of keeping track of multiple sets of millions of user ids. This is not only faster, but also has a fixed memory footprint.
MinHash is also extremely simple to implement: all we need is a set of $k$ hash functions, and a way of keeping track of the minimum value encountered for each hash function. The $k$ parameter gives us a way of trading off precision and efficiency: we get higher accuracy with higher $k$, but it takes longer to process new data points and the hashes themselves occupy more memory.
The following Python implementation uses the built-in hash function and $k$ bitwise XOR masks for hashing, and is sufficiently fast even for high $k$ (unless you really have a lot of data).
class MinHash(object):
def __init__(self, k, seed=10):
self._k = k
self._seed = seed
minint = np.iinfo(np.int64).min
maxint = np.iinfo(np.int64).max
self._masks = (np.random.RandomState(seed=self._seed)
.randint(minint, maxint, self._k))
self._hashes = np.empty(self._k, dtype=np.int64)
self._hashes.fill(maxint)
def add(self, v):
hashes = np.bitwise_xor(self._masks, hash(v))
self._hashes = np.minimum(self._hashes, hashes)
def jaccard(self, other):
if np.any(self.masks != other._masks):
raise Exception('Can only calculate similarity '
'between MinHashes with the same hash '
'functions.')
return (self._hashes == other._hashes).sum() / float(self._k)