The Jaccard similarity has been widely used in search and machine learning. For binary (0/1) data, it is often called “resemblance” and the method of min-wise hashing has been the standard practice for computing resemblance in massive data. For Jaccard similarity in general weighted data, the commonly used sampling algorithm is the Consistent Weighted Sampling (CWS). A convenient (and perhaps mysterious) implementation of CWS is the so-called “0-bit CWS” which we refer to as the “relaxed CWS” and was purely an empirical observation without any theoretical justification. The difficulty is due to the highly complicated probability problem of “relaxed CWS”. In this paper, we propose using extremal process to generate samples for estimating Jaccard similarity. Surprisingly, this scheme makes possible to analyze the “relaxed ES” variant. Through novel probability endeavours, we can rigorously compute the bias of “relaxed ES” which explains why it works so well and when it does not in extreme corner cases. Interestingly, compared with CWS, the resultant algorithm only involves counting and does not need sophisticated mathematical operations (as required by CWS). It is not surprising that the proposed “Extremal Sampling” (ES) is actually noticeably faster than CWS. Although ES is different from CWS (and other previous algorithms for Jaccard similarity), in retrospect it is closely related to CWS. This paper provides the insight which connects CWS with extremal processes. This insight will help understand CWS (and variants) and help develop new algorithms for similarity estimation.

The Web Conference is announcing latest news and developments biweekly or on a monthly basis. We respect The General Data Protection Regulation 2016/679.