Common Crawl

Originally posted on LinkedIn by Julien Nioche on 26th March 2024.

Generated with AI by https://designer.microsoft.com/

It is pretty impossible to escape AI at the moment: every other social media post, news item, marketing blurb or job advert seems to be involving it one way or another. This comes at a time of growing awareness of the environmental impact of IT in general, which AI will only exacerbate: training and running AI models is computationally expensive and the resources that enable it require resources to be built, energy to run and water to cool them.

A recent study by the International Energy Agency forecasts a substantial growth in energy demand from data centres, fuelled by AI and cryptocurrencies.

“US data centre electricity consumption is expected to grow at a rapid pace in the coming years, increasing from around 200 TWh in 2022 (~4% of US electricity demand), to almost 260 TWh in 2026 to account for 6% of total electricity demand.“

Forecast of global electricity demand, IEA report

This article will look at tools (Green Software) and methodologies to evaluate the environmental impact of the cloud (a nascent activity coined GreenOps). We will examine the impact of an organisation I know quite well, the Common Crawl Foundation, who have kindly agreed to let me use them as a use case. This is a particularly relevant example in the context of AI. As we will see below, the Common Crawl dataset is one of the main sources of data for large language models, so by looking at the impact of generating and distributing their datasets, we not only get a good illustration of the environmental impact of cloud computing in general but also uncover one of the sources of upstream emissions of AI.

This is a complex subject and it is fast changing. As I am increasingly working in the field, this article gives me an opportunity to share some of my findings and introduce some key concepts, methodologies and tools, which hopefully might be useful to the reader. Please leave comments if something looks incorrect or unclear.

The Common Crawl Foundation is a nonprofit 501(c)(3) organisation founded by Gil Elbaz, which crawls the web and freely provides its archives and datasets to the public since 2011. Each release contains around 3 billion web pages stored in WARC format, which is the standard used by the web archiving community. The datasets are hosted and provided for free through Amazon’s Open Data Sponsorship Programme and reside on a AWS S3 bucket in the us-east-1 region. The total size of the bucket is 7.9 PB, most of it using intelligent tiering. As explained on the CC website, the data can be accessed either using the S3 clients or HTTP via CloudFront.

The Common Crawl dataset is extremely popular and has enabled a large amount of scientific research on a wide number of subjects (from computer science to sociology) as well as allowing startups to build innovative solutions and products. For instance, large research projects such as Open Web Search use Common Crawl data to bootstrap and test their crawl pipeline at scale.

Common Crawl was founded on a vision of providing data for the greater good but more recently, it has been better known for being one of the main sources of training data for the Large Language Models (LLMs) that power AI tools such as ChatGPT. According to a recent study by Mozilla:

"Common Crawl’s massive dataset [...] makes up a significant portion of the training data for many Large Language Models (LLMs) like GPT-3, which powers the free version of ChatGPT. Over 80% of GPT-3 tokens (a representation unit of text data) stemmed from Common Crawl. Many models published by other developers likewise rely heavily on it: the study analyzed 47 LLMs published between 2019 and October 2023 that power text generators and found at least 64% of them were trained on Common Crawl."

The main crawl is generated with a modified version of the venerable Apache Nutch™, whereas another dataset produced by Common Crawl, the NewsCrawl, is powered by our very own StormCrawler (in the process of being incubated at the Apache Software Foundation). The NewsCrawl dataset is in the WARC format too and stored in an AWS S3 bucket. It runs continuously on a single EC2 r5d.xlargeinstance, while the main crawl requires a Hadoop cluster for Nutch to run on. The size and nature of the Hadoop cluster varies depending on the stage of the process; the main fetch step for instance, where the web pages are downloaded and bundled into WARC files, takes about two weeks and uses 16x r7g.xlarge EC2 instances. Other parts of the process require a different cluster configuration.

AWS Carbon Footprint Tool

The most obvious place to look for insights into the environmental footprint of the operations of Common Crawl is the AWS Carbon Footprint Tool, which can be found in the Billing and Cost Management section of the AWS Console. The screenshot below is from the Common Crawl account used to run the crawls and other processes, such as the Web Graph generation.

The main information provided by the carbon footprint tool is of course the estimate of the carbon emissions generated. The figures are given as metric tons of carbon dioxide-equivalent (MTCO2e).

The screenshot above shows that 3.386 metric tons of Co2-equivalent were emitted, 2.942 of it saved (we’ll explain how later on), resulting in a net 0.444 metric tons, with no emissions since January 2022 (you can ignore the bit about 0.319 t having been saved, this is just an estimate of how much more you would have generated had the workload been on premise or in less efficient data centres).

The first thing to notice is that the data is not instantly available and there is a three-month delay. The figures are available per month, with no higher granularity e.g. per day or hour. There is an overall breakdown per service, here nearly half the estimated emissions are due to the use of EC2. The category ‘Other’ corresponds in part to emissions due to networking and transmission of data. There is a breakdown per geographical area but not at the region level (e.g us-east-1). Please note that these breakdowns are related to the unmatched emissions (i.e the 0.444 t emitted before 2022) and not the whole. If you were to select the last year only, these visualisations would be empty.

So it looks like the operations of Common Crawl have had little environmental impact, great! But let’s dig a bit deeper.

A Few Concepts

It is worth looking at the AWS documentation for what this covers and pause for a moment. First, a bit of terminology from the GHG Protocol, one of the main standards to measure and manage emissions. In the GHG protocol, emissions are accounted for in three scopes:

Scope 1: direct emissions (i.e. on-site generators, diesel machinery, etc…)
Scope 2: indirect emissions from purchased electricity production
Scope 3: all the other indirect emissions that come from a company's value or supply chain, including things like waste, shipping products, or product usage by customers

‍

The figures you get from the carbon footprint tool are for scope 1 and 2 only. While scope 1 is relatively straightforward to compute, scope 2 has a few more subtleties to it. There are two main approaches to estimating it.

First, the location-based approach which simply looks at the carbon intensity of the grid at the time and place where energy is consumed.

As you know, energy is a mix of various sources, divided into renewable (like wind farms, solar PV or hydroelectric), low-carbon (nuclear) and carbon-heavy (coal and gas). The latter produces energy in a way that releases greenhouse gases and is one of the main contributors to climate change. At a given location and time, the energy mix will be more or less carbon intensive. If it is a warm, sunny day with a bit of a breeze, the carbon intensity (expressed in grams of CO2-equivalent per kilowatt hour a.k.a CO₂e/kWh) of the grid is likely to be low as wind and solar farms will be hard at work. By contrast, on a cold and still winter night when people have the lights and heating on, the demand on the grid will be high, leading to gas or coal power stations being fired to guarantee the supply of energy, resulting in a higher carbon intensity.

Data about the carbon intensity of the energy grids is widely available, thanks for instance to resources such as ElectricityMaps, which provide data through their website and API.

Live view of the generation mix in Great Britain on 5th March 2024 at 17.30 GMT from ElectricityMaps

As you can see, not all countries are equal when it comes to carbon intensity, which is something we will get back to later.

By the way, this is something you can also use at home to work out when your use of electricity will have the lowest carbon impact. In the UK for instance, the National Grid provides a mobile app, WhenToPlugin, which tells you the current carbon intensity in your region. Ideal to work out the next best slot to charge your electric car or do a load of washing! Similar resources are available for other countries. The term used to describe taking into account carbon intensity, be it at work or at home, is called carbon awareness.

The other approach used to calculate the scope 2 emissions is the market-based approach, and it is what Amazon uses in their dashboard. Under the market-based approach, the calculation is based on power-purchase agreement (PPA) or instruments like Renewable Energy Certificates (RECs). Certificates are either bundled and sold with the corresponding energy, unbundled ones can be bought separately. Amazon would have a mix of power purchase agreement (i.e buying the entire production of a power source) and certificates. A different way of putting it could be that the emissions are not “saved” but “offset” with purchases: at the end of the day, there might still be carbon emitted as a result of your use of AWS.

Going back to life outside work, the market based approach is what domestic electricity providers rely on to provide green tariffs. The effects are the same as for data centres: because you are on a green tariff doesn’t mean that you are not generating carbon emissions from your energy use. This is another reason to take into account the carbon intensity as a way of minimising your impact.

Of the two approaches, the market-based one paints a more flattering picture than the location-based one. The latter tends to be used by the green ops community, with the GHG protocol recommending to provide both. The PPAs and RECs that the market based approach takes into account are not without virtues as they provide a financial incentive to develop renewable energy projects, which are essential to lowering the carbon intensity of the grid in general. Definitely a good thing! On this subject, the sustainability web page of AWS shows renewable energy projects they developed.

Looking at the dashboard above, we can consider that the emissions related to scope 1+2 for the last 3 years of operations at Common Crawl are 3.386 metric tons of CO2e.

One of the main limitations of the AWS Carbon Footprint Tool is the absence of data on scope 3. Data centres take a lot of resources to be built, so do the servers and all the equipment they require (cooling systems, network cables). They need to be installed and at the end of their working lives, these resources need to be disposed of. There are also other resources beyond energy such as minerals or water, which is required for cooling the data centres.

All these are unaccounted for in the report but could there be a way of estimating them?

Estimating (part of) the Scope 3 Emissions

The approach below was suggested by an AWS employee. The starting point is a study published in 2021, Digital Technologies in Europe: an Environmental Life Cycle Approach which looked at data centres in Europe.

Breakdown of the environmental impact of data centres

We will focus on the climate change impact (i.e. CO2eq emissions) but there are of course other impacts, as shown in the table above. It it worth noting that the study does not include all 15 scope 3 categories under the GHG protocol and for the data centres includes:

IT equipment (compute, storage, network)
Non-IT equipment involved in the infrastructure (cooling systems, generators, UPS, batteries, etc.)

‍

The list of inclusions and exclusions is on page 20 of the report.

The study gives an estimate of the relative size of the emissions for each scope. As explained earlier, scope 1 are the on-site emissions, often from diesel-powered backup generators.

A key concept for data centres is that of Power Usage Effectiveness (PUE). PUE is a measure of how efficient a data centre is. It is the amount of the electrical energy fed into the computer hardware over the total energy drawn from the grid. The difference is the amount of energy required for cooling, lighting etc… An ideal data centre would have a PUE of 1, i.e. all the energy it takes from the grid goes into the computer hardware.

The PUE in the study was 1.7 whereas the consensus is that data centres like AWS’ would nowadays have a PUE of 1.15. Part of the difference is that the data for the study was from 2019 but also that it covered a mix of data centres and not just hyperscalers, which are more efficient thanks in part to techniques like free cooling. After adjusting the figures based on a PUE of 1.15, we get a breakdown of 0.9% for scope 1, 63.3% for scope 2 and 35.9 for scope 3.

Rebalanced breakdown per scope based on PUE of 1.1.5

Location, Location!

The data above were based on average emissions of data centres in the EU but as we saw earlier when looking at carbon intensity, not all grids are the same when it comes to their energy mix.

Estimates scope breakdown based on regional carbon intensities

The table above takes the ratios we obtained earlier for EU-28 and applies them to different geographical zones, based on their average carbon intensity (provided by electricitymaps). France has a low carbon intensity, due to its energy coming mainly from nuclear power stations. This means that the scope 2 will be relatively low (23%), compared to 1 and 3. Singapore, on the other hand, gets its energy from natural gas and as a result, has relatively high emissions from scope 2.

This has interesting implications: different strategies are needed to optimise the overall environmental impact of data centres depending on where they are. If, for instance, a data centre is in France where scope 2 represents a relatively low proportion of emissions, you would not focus as much on the energy efficiency of the hardware but would perhaps try to give it a longer lifespan so as to reduce the emissions under scope 3. At the other end of the scale, in a country like Singapore, you would probably invest in newer and more efficient hardware to reduce the scope 2 emissions.

Virginia 12 months average carbon intensity, taken from ElectricityMaps on 22/03/24

Going back to estimating the scope 3 emissions of Common Crawl, we had established that the scope 1 and 2 emissions for the last 3 years were 3.386 metric tons of CO2eq. Given that the operations are based entirely in Virginia, we can estimate that the scope 3 emissions for the last 3 years are 1.28 metric tons of CO2eq, giving us a total of 4.666 metric tons.

What next?

Those 4.666 metric tons of CO2eq sounds a lot but to put in perspective, this is about the same as the equivalent of 1-year energy use for a home in the US. By crawling the web efficiently, computing useful data on top of it and providing it for free, Common Crawl is removing the need for numerous organisations to replicate the same processes. However, and to continue with a sense of perspective, the environmental impact we estimated here does not include things like travel. As a hypothetical example, two people on a return flight from the west coast of the US to Europe would put more CO2 in the atmosphere than 3 years of operations on AWS. By measuring things, you can see where your actions will have the most impact instead of just being based on assumptions.

We have covered loads of ground in this article and introduced key terms, concepts and tools. We have only looked at AWS’ carbon footprint dashboard but other cloud providers provide similar solutions. OVH (a French cloud provider) has published a comparison of the carbon calculators of the main providers.

OVH comparison of carbon calculators from the main cloud providers

AWS is clearly behind the competition, with Azure and GCP providing at least some scope 3 data.

As a user of AWS, it is fair to say that the information provided by the footprint tool is neither complete nor actionable. Worse, it could probably lure you into thinking that your operations on AWS have no environmental impact.

In the next article, we will look at a couple of open source tools you can use to get a better picture of the cloud footprint of your organisation and will apply these to the AWS use of Common Crawl. We will also turn to another aspect of Common Crawl’s footprint and look at the storage and distribution impact of its datasets, which will shed some light on the hidden cost of AI.

I hope that you find this article interesting. Please do get in touch if you have any questions or if you would like me to help your organisation with their GreenOps.

This release was authored by:

No items found.

Erratum:

Content is truncated

Originally reported by:

Permalink

Some archived content is truncated due to fetch size limits imposed during crawling. This is necessary to handle infinite or exceptionally large data streams (e.g., radio streams). Prior to March 2025 (CC-MAIN-2025-13), the truncation threshold was 1 MiB. From the March 2025 crawl onwards, this limit has been increased to 5 MiB.

For more details, see our truncation analysis notebook.

The Environmental Impact of the Cloud - the Common Crawl Case Study

Common Crawl

AWS Carbon Footprint Tool

A Few Concepts

Estimating (part of) the Scope 3 Emissions

Location, Location!

Erratum:

Content is truncated

The Data

Overview

Web Graphs

Latest Crawl

Crawl Stats

Graph Stats

Errata

Resources

Get Started

AI Agent

Blog

Examples

Use Cases

CCBot

Infra Status

FAQ

Community

Research Papers

Mailing List Archive

Hugging Face

Discord

Collaborators

About

Team

Jobs

Mission

Impact

Privacy Policy

Terms of Use