< Back to Errata

Erratum

No Truncation Indicator in WARC Records

Originally reported by 
Henry Thompson
.

Due to an issue with our crawler, not all truncations were indicated correctly. A workaround to detect length truncation is to be suspicious if the length of the content is exactly 1048576 bytes. Truncations for time or network do not have such a workaround. In the WARC files this indicator is called "WARC-Truncated".

The "length" in the CDX index is the length of the gzip-compressed WARC record. The name in the columnar index warc_record_length reflects this better.

It is also worth noting that PDFs end with %%EOF perhaps followed by a linefeed.

Affected Crawls
Affected Web Graphs
No items found.