MinHash is a simple but effective algorithm for estimating set similarity using the Jaccard index. Both the Wikipedia entry and this blog post are good explanations of how it works.

MinHash is attractive because it allows us to decide how similar two sets are without having to enumerate all of their elements. If we want to know how many users that performed action AA also performed action BB, we can compare the MinHashes of the two sets instead of keeping track of multiple sets of millions of user ids. This is not only faster, but also has a fixed memory footprint.

MinHash is also extremely simple to implement: all we need is a set of kk hash functions, and a way of keeping track of the minimum value encountered for each hash function. The kk parameter gives us a way of trading off precision and efficiency: we get higher accuracy with higher kk, but it takes longer to process new data points and the hashes themselves occupy more memory.

The following Python implementation uses the built-in hash function and kk bitwise XOR masks for hashing, and is sufficiently fast even for high kk (unless you really have a lot of data).

class MinHash(object):

    def __init__(self, k, seed=10):

        self._k = k
        self._seed = seed

        minint = np.iinfo(np.int64).min
        maxint = np.iinfo(np.int64).max

        self._masks = (np.random.RandomState(seed=self._seed)
                       .randint(minint, maxint, self._k))

        self._hashes = np.empty(self._k, dtype=np.int64)
        self._hashes.fill(maxint)

    def add(self, v):

        hashes = np.bitwise_xor(self._masks, hash(v))

        self._hashes = np.minimum(self._hashes, hashes)

    def jaccard(self, other):

        if np.any(self.masks != other._masks):
            raise Exception('Can only calculate similarity '
                            'between MinHashes with the same hash '
                            'functions.')

        return (self._hashes == other._hashes).sum() / float(self._k)