Re: WARC file format now ISO standard

From: <steve_at_nyob>
Date: Tue, 2 Jun 2009 14:35:34 -0700
To: CODE4LIB_at_LISTSERV.ND.EDU
point well taken. :)

there were no significant changes to the WARC format
between the last draft and the published standard.

you can use Heritrix WARCReader, or WARC Tools warcvalidator
to verify that you have created a valid WARC in accordance
with the spec.


/steve_at_archive.org


On 6/2/09 2:27 PM, Ray Denenberg, Library of Congress wrote:
> But you have to pay $200 for the document that lists changes from last 
> draft to first official version.
> 
> (Ok, Ok, it was just a joke. But you do get the point.)
> 
> 
> ----- Original Message ----- From: "steve_at_archive.org" <steve_at_ARCHIVE.ORG>
> To: <CODE4LIB_at_LISTSERV.ND.EDU>
> Sent: Tuesday, June 02, 2009 5:18 PM
> Subject: Re: [CODE4LIB] WARC file format now ISO standard
> 
> 
>> hi Karen,
>>
>> understood.
>>
>> the final draft of the spec is available here:
>> http://www.scribd.com/doc/4303719/WARC-ISO-28500-final-draft-v018-Zentveld-080618 
>>
>>
>> and other (similar) versions here:
>> http://archive-access.sourceforge.net/warc/
>>
>>
>> /steve_at_archive.org
>>
>>
>>
>> On 6/2/09 2:15 PM, Karen Coyle wrote:
>>> Unfortunately, being an ISO standard, to obtain it costs 118 CHF 
>>> (about $110 USD). Hard to follow a standard you can't afford to read. 
>>> Is there an online version somewhere?
>>>
>>> kc
>>>
>>> steve_at_archive.org wrote:
>>>> hi code4lib,
>>>>
>>>> if you're archiving web content, please use the WARC format.
>>>>
>>>> thanks,
>>>> /steve_at_archive.org
>>>>
>>>>
>>>>
>>>> WARC File Format Published as an International Standard
>>>> http://netpreserve.org/press/pr20090601.php
>>>>
>>>> ISO 28500:2009 specifies the WARC file format:
>>>>
>>>> * to store both the payload content and control information from
>>>>   mainstream Internet application layer protocols, such as the
>>>>   Hypertext Transfer Protocol (HTTP), Domain Name System (DNS),
>>>>   and File Transfer Protocol (FTP);
>>>> * to store arbitrary metadata linked to other stored data
>>>>   (e.g. subject classifier, discovered language, encoding);
>>>> * to support data compression and maintain data record integrity;
>>>> * to store all control information from the harvesting protocol
>>>>   (e.g. request headers), not just response information;
>>>> * to store the results of data transformations linked to other
>>>>   stored data;
>>>> * to store a duplicate detection event linked to other stored
>>>>   data (to reduce storage in the presence of identical or
>>>>   substantially similar resources);
>>>> * to be extended without disruption to existing functionality;
>>>> * to support handling of overly long records by truncation or
>>>>   segmentation, where desired.
>>>>
>>>>
>>>> more info here:
>>>> http://www.digitalpreservation.gov/formats/fdd/fdd000236.shtml
>>>>
>>>>
>>>
Received on Tue Jun 02 2009 - 17:36:11 UTC