The crawl archive for December 2024 is now available.
The data was crawled between December 1st and December 15th, and contains 2.64 billion web pages (or 394 TiB of uncompressed content). Page captures are from 47.5 million hosts or 38.3 million registered domains and include 1.05 billion new URLs, not visited in any of our prior crawls.
Archive Location & Download
The December 2024 crawl archive is located in the commoncrawl
bucket with the prefix: crawl-data/CC-MAIN-2024-51/
.
To assist with exploring and using the dataset, we provide gzip
-compressed files which list all segments, WARC
, WAT
and WET
files.
By simply adding either s3://commoncrawl/
or https://data.commoncrawl.org/
to each line, you end up with the S3
and HTTP
paths respectively. Please see Get Started for detailed instructions.
Changes to the WAT
Metadata Format
Multi-valued headers
Repeated HTTP
and WARC
headers were not represented in the JSON
data in WAT
files. When a header was repeated adding a further value of that header, only the last value was stored and other values were lost. This old issue (ia-web-commons#18) is now fixed:
- Single value headers are represented as before by a header name and a string value.
- Headers with multiple values are represented by a header name and an associated list of values.
Users are advised to update any code consuming WAT
files to this change. The examples in the projects cc-pyspark
and cc-warc-examples
were updated accordingly, see cc-pyspark#46 resp. cc-warc-examples#5.
Below are two JSON
snippets of multi-valued headers:
- The WARC-Protocol header field:
{
"Container": { "...": "..." },
"Envelope": {
"WARC-Header-Metadata": {
"...": "...",
"WARC-Target-URI": "https://en.wikipedia.org/wiki/Saturn",
"WARC-Protocol": [
"h2",
"tls/1.3"
],
- Many
HTTP
headers, most commonly the "Set-Cookie
" header:
{
"Container": { "...": "..." },
"Envelope": {
"Payload-Metadata": {
"Actual-Content-Type": "application/http; msgtype=response",
"HTTP-Response-Metadata": {
"...": "...",
"Headers": {
"date": "Sat, 30 Nov 2024 11:13:30 GMT",
"...": "...",
"set-cookie": [
"WMF-Last-Access=30-Nov-2024;Path=/;HttpOnly;secure;Expires=Wed, 01 Jan 2025 12:00:00 GMT",
"WMF-Last-Access-Global=30-Nov-2024;Path=/;Domain=.wikipedia.org;HttpOnly;secure;Expires=Wed, 01 Jan 2025 12:00:00 GMT",
"WMF-DP=5b0;Path=/;HttpOnly;secure;Expires=Sun, 01 Dec 2024 00:00:00 GMT",
"GeoIP=US:VA:Ashburn:39.05:-77.49:v4; Path=/; secure; Domain=.wikipedia.org",
"NetworkProbeLimit=0.001;Path=/;Secure;SameSite=Lax;Max-Age=3600"
],
Add language attributes of the <html>
root element as metadata
The WAT
metadata now includes the language attributes of the <html>
element. For example, the root element <html lang="es-MX">
is stored in the WAT
file as:
"HTML-Metadata": {
"Head": {
"Metas": [
{
"name": "HTML@/lang",
"content": "en"
},
Details on this change are tracked in ia-web-commons#35.
Do not include <meta itemprop="...">
as metadata
Schema.org annotations in <meta itemprop="...">
in the HTML
body are not put as metadata into the WAT
metadata, cf. ia-web-commons#40.
Crawling with IPv6
The crawler is now ready to crawl IPv6-only websites. While IPv4 is still preferred, sites which are only available by IPv6 are now visited by our crawler. As a consequence, IPv6 addresses now appear in the crawl data. For example, in the "WARC-IP-Address"
header or in URLs in the URL indexes.
Crawler Verification
Our crawler "CCBot"
is now run on dedicated IP address ranges with reverse DNS. This allows webmasters to verify whether a logged request stems from CCBot
. Please read our FAQ for more information.
Feedback Welcome
We look forward to hearing your thoughts and comments. As ever, please feel free to join the discussions in our Google Group or in our Discord server.