Automated analysis of privacy policies has proved a fruitful research direction, with developments such as automated policy summarization, question answering systems, and compliance detection. So far, prior research has been limited to analysis of privacy policies from a single point in time or from short spans of time, as researchers did not have access to a large-scale, longitudinal, curated dataset. To address this gap, we developed a crawler that discovers, downloads, and extracts archived privacy policies from the Internet Archive’s Wayback Machine. Using the crawler and following a series of validation and quality control steps, we curated a dataset of 1,071,488 English language privacy policies, spanning over two decades and over 130,000 distinct websites. Our analyses of the data paint a troubling picture of the transparency and accessibility of privacy policies. We find evidence that the self-regulation industry has grown, but has largely been driven by advertising trade groups rather than first-party sites. Our results contribute to the literature demonstrating the widespread impact of the GDPR, and show that GDPR stands out in its impact. By comparing the abundance of tracking-related terminology in our dataset against prior works’ measurements, we find that that privacy policies under-report the presence of many tracking technologies and all of the most common third parties. We also find that, while already shown to be inaccessible, over the last twenty years privacy policies have become even more difficult to read, doubling in length and increasing a full grade level in the median reading level.

The Web Conference is announcing latest news and developments biweekly or on a monthly basis. We respect The General Data Protection Regulation 2016/679.