Erratum
No Truncation Indicator in WARC Records
Originally reported by
Henry Thompson
.
Due to an issue with our crawler, not all truncations were indicated correctly. A workaround to detect length truncation is to be suspicious if the length of the content is exactly 1048576 bytes. Truncations for time or network do not have such a workaround. In the WARC files this indicator is called "WARC-Truncated".
The "length" in the CDX index is the length of the gzip-compressed WARC record. The name in the columnar index warc_record_length reflects this better.
It is also worth noting that PDFs end with %%EOF perhaps followed by a linefeed.
Affected Crawls
Affected Web Graphs
No items found.