• MinHash & LSH

    In my previous post I explained how the MinHash algorithm is used to estimate the Jaccard similarity coefficient, because calculating it precisely each time can be expensive. However, we still need to compare every document to every other document in order to determine which documents are similar (or in more fancy words - to cluster the documents by similarity).
    Doing that has a time complexity of $O(n^2)$, where n is the number of documents we’re comparing.

    We can optimize this process by using the concepts of Locality-sensitive hashing (LSH).

    As always, in order to get something we need to give something, this time the trade-off is between accuracy and performance. Meaning that we’ll get a much better performance (time complexity), but we’ll pay with a bit less accuracy...

  • MinHash Similarity

    Before we can go into how MinHash works, we need to first talk about the Jaacard similarity coefficient.
    The Jaccard similarity coefficient on two sets A and B is defined to be:

    $J(A,B) = \dfrac{|A \cap B|}{|A \cup B|}$

    Important properties of the Jaccard similarity coefficient:

    • If A and B are identical sets then $J(A,B) = 1$
    • If A and B are disjoint sets (have no common element) then $J(A,B) = 0$
    • If A and B are neither equal nor disjoint then $0 \lt J(A,B) \lt 1$
    • The more common elements A and B have, the closer...