Erratum
Missing content_truncated flag in URL indexes
Originally reported by
.
The flag in our URL indexes (CDX and columnar) that indicates whether or not a WARC record payload was truncated was added in CC-MAIN-2019-47. This indicator is missing in our indexes for all previous crawl releases. In the CDX index this is referred to as "truncated", and the columnar index refers to this as "content_truncated".
For more information please refer to the blog post announcing the November 2019 crawl. The reason for the truncation is given only for truncated records following the WARC header field "WARC-Truncated".
Affected Crawls
Affected Web Graphs
No items found.