
We are pleased to introduce an experimental Common Crawl AI Agent, developed by our friends at ReadyAI. This AI Agent uses an LLM plus RAG (Retrieval-Augmented Generation) to be able to answer questions by searching content in our website, plus one hop away on the web, and from our public mailing list archive.
Try it out here: https://commoncrawl.org/ai-agent
Examples of things it’s pretty good at answering:
- Questions about Common Crawl’s data formats
- Questions about Common Crawl’s indexes, both cdx and columnar
- Questions about example uses of Common Crawl data
- Generic questions about web archiving
The end of most answers contains a link to a specific webpage with more information about the answer.
Like all LLM+RAG systems, it has a few limitations:
- One of the example queries is how many harvard.edu pages CC has crawled. The AI Agent gives an answer from a few months ago – but this is a number that changes every month. Why did the AI Agent say that? Well, that’s one number from our email list archive - the nuance of the number changing every month is difficult to teach to the AI Agent.
- If you ask a question that’s totally out of scope, like “What is the Frumious Bandersnatch”, the AI Agent will answer based on what the LLM knows, even though the RAG system (searching our website+1 and our mailing list) doesn’t know anything about Lewis Carroll’s poetry.
ReadyAI has been updating the RAG data in real-time, and we’re looking forward to future improvements.
We’ve had fun experimenting with this AI Agent, and we’d love to hear what you think about it.
Please feel free to join our Discord server or Google Group to let us know how you get on.
Erratum:
Content is truncated
Some archived content is truncated due to fetch size limits imposed during crawling. This is necessary to handle infinite or exceptionally large data streams (e.g., radio streams). Prior to March 2025 (CC-MAIN-2025-13), the truncation threshold was 1 MiB. From the March 2025 crawl onwards, this limit has been increased to 5 MiB.
For more details, see our truncation analysis notebook.