This page looks best with JavaScript enabled

Here There Be WARCs

   · ☕ 8 min read

The term WARC, an abbreviation of Web ARChive, always reminds me of things like hobbits, elves, dark lords, and orcs, of course. But this post has nothing to do with those things so I need to clear my head and press on.

A WARC is essentially a file format used to capture the content and organization of a web site. Recently, I was asked to add a pair of WARCs to Digital.Grinnell. Doing so proved to be quite an adventure, but I am pleased to report that we now have these two objects to show for it:

WARC Ingest - Failures and Success

What follows is a table of the steps, both failed and successful, taken to ingest the two new Digital.Grinnell objects, along with notes about each step in the process.

Ingest StepOutcomeNotes
1. .warc File CreationSuccessRebecca Ciota used a wget command to create a .warc archive file and a .cdx “index” of each site from existing, “live” web content. That command took this form: wget --warc-file=<FILENAME> --recursive --level=5 --warc-cdx --page-requisites --html-extension --convert-links --execute robots=off --directory-prefix=. -x /solr-search --wait=10 --random-wait <WEBSITE-URL>
2. .gz CompressionSuccessRebecca compressed each .warc and .cdx file pair into a compressed .gz archive to package the contents for subsequent processing.
3. MODS Metadata PrepSuccessRebecca added two rows of control data and MODS metadata to Google Sheet https://docs.google.com/spreadsheets/d/1X3rs7UhIdS6SumTwFUvRR0F6-OnGEIF5xnGzcLrFqNY to prep for IMI ingest.
4. .gz Files Added to //StorageSuccessThe two .gz files generated in a previous step were copied to //Storage for ingest.
5. Attempted IMI IngestFailedThe aforementioned IMI ingest process was invoked with Google Sheet https://docs.google.com/spreadsheets/d/1X3rs7UhIdS6SumTwFUvRR0F6-OnGEIF5xnGzcLrFqNY in Digital.Grinnell. The process ran for a very long time, in excess of 30 minutes, before it failed with no error messages or indication of root cause.
6. Attempted “Forms” Ingest of WMI ObjectFailedI navigated to the management page in our Pending Review collection and engaged the “Add an object to this Collection” link. I selected the “Islandora Web ARChive Content Model” and “Web ARChive MODS Form” for ingest. The form was not available so the ingest failed.
7. WARC Module UpdatedSuccessSteps were taken to update the key module, islandora_solution_pack_web_archive on Digital Grinnell’s node DGDocker1. The update ran without error.
8. Repeated Step 6FailedStill in the Pending Review Manage page I repeated the previous step (6) but even after the update no “Web ARChive MODS Form” was available.
9. Repeated Step 8 with the Correct FormFailedStill in the Pending Review Manage page I repeated the previous steps (6 and 8) choosing the “DG ONE Form to Rule them All”. The form opened and accepted input, but clicking “Next” presented the sub-form shown below, indicating that a .gz file could not be ingested.
10. Unzipped the .gz FilesSuccessThe WMI’s .gz file was nearly 2 GB in size, so I was unable to unzip it using the archive tool on my iMac. I was able to use gunzip to process the files on CentOS node DGDocker1. The result was a .warc file and .cdx file pair for each object (a total of 4 files).
11. Repeated Step 9FailedStill in the Pending Review Manage page I repeated the previous step (9) and the sub-form. I terminated the upload process after about 2 hours (included my lunch break). The process could not be resumed from that point.
12. Repeated Step 11 with the “Global Mongol Century” FileLimited SuccessStill in the Pending Review Manage page I repeated the previous step (11) but with the .warc file representing “The Global Mongol Century” archive. This .warc file was only 24.2 MB in size and the ingest worked (taking less than 5 minutes) producing grinnell:27856. No thumbnail image or other image derivatives were created, presumably because I did not choose the Upload a screenshot? option.
13. Investigated Using drushFailedThe updated islandora_solution_pack_web_archive module includes at least one drush command, and a non-web ingest of such large files would be preferred. However, investigation determined that the provided drush command does not provide an alternate means of ingest.
14. Repeat Step 12SuccessI repeated step 12 still using the relatively small “Global Mongol Century” .warc, but gave the object a title of “World Music Instruments”. grinnell:27858 was created, again with no thumbnail image or other image derivatives.
15. Attempted to Replace the OBJ DatastreamFailedNavigating to https://digital.grinnell.edu/islandora/object/grinnell%3A27858/manage/datastreams, I selected the replace link associated with the OBJ datastream in an attempt to replace the small .warc object with the proper WMI .warc. This process failed to finish after nearly an hour of processing, presumably because the WMI .warc is simply too large causing the web process to time-out before completion.
16. Replaced the OBJ Datastream Using FEDORASuccessI navigated to Digital.Grinnell‘s FEDORA admin page at http://dgdocker1.grinnell.edu:8081/fedora/admin/, logged in as an admin, opened the grinnell:27858 object, and used the FEDORA admin interface there to replace its OBJ datastream. I allowed the upload portion of the process to “spin” for more than an hour, but when I stopped it and saved changes, I found a new OBJ that is 1.97 GB in size. That new OBJ appears to be viable.
17. Generated SCREENSHOT ImagesSuccessRebecca used the “Snip Tool” in Windows to collect screenshot images of each “live” website home page. These were uploaded to //Storage for subsequent ingest.
18. Added Images via Manage/Datastreams MenuSuccessI navigated to each WARC object’s manage/datastreams page and used the Add a datastream links to create new SCREENSHOT datastreams using the home page .jpg images.
19. Added Empty MODS DatastreamsSuccessWhile still working in each object’s manage/datastreams page I used the Add a datastream links again to create new, empty MODS datastreams. This step was necessary because Web ARChive content models do not normally include any MODS record by default.
20. Exported Google Sheet to TSVSuccessTo prep for updating each object’s MODS record, I exported the “MASTER” tab of our Google Sheet to a .tsv, tab-seperated-values, file and saved the export in //Storage/Library/AllStaff/DG-Metadata-Review-2020-r1/WARC/mods-imvt.tsv.
21. Invoked Islandora-MODS-via-Twig WorkflowSuccessI engaged the drush imvt, islandora_mods_via_twig, command and subsequent workflow, including islandora_datastream_replace, to replace the empty MODS records (see step 19) with correct data. See Exporting, Editing, & Replacing MODS Datastreams: Technical Details for complete details.
22. Initiate Derivative RegenerationSuccessSince the two WARC objects were not ingested in a “traditional” manner, it was necessary to regenerate all derivative datastreams to complete the object. I did so, in the case of grinnell:27858, for example, by visiting the object’s manage/properties page and clicking Regenerate all derivatives.

WARC sub-form

Summary

Completing steps 1-4, 7, 10, 12, 14, and 16-22, resulted in the two “complete” WARC objects we now have in:

Move to Faculty Scholarship

All of the above steps were performed while the two objects were part of the Pending-Review collection in Digital.Grinnell. Once the objects were reviewed and determined to be correct, steps were taken using each object’s manage page to Migrate this Object to another collection, choosing to move them both to Faculty Scholarship.

Setting Object Permissions

Since both of the WARC objects are for archival only, it was determined that both objects should be accessible only to system administrators. To enforce that restriction I visited each object’s manage/xacml page to set appropriate restrictions on object management and object viewing.

Adding PDF Datastreams

The islandora_solution_pack_web_archive module and the Web ARChive content model provide an option to include a PDF in the ingest process. We did not initially generate any PDFs for the two WARCs that were ingested, but we have since taken steps to add PDF datastreams in order to experience what that option has to offer. The process we employed is briefly documented below.

PDF Creation

As mentioned earlier, a wget command was used to create the .warc files that we ingested, but wget does not appear to offer a viable option to create a PDF file. Fortunately, Rebecca’s research turned up this Adobe Acrobat Pro trick: https://lenashore.com/2019/06/how-to-make-a-pdf-of-an-entire-website/.

Rebecca reports that this process works, but can take a very long time. She apparently had to specify a limited number of “levels” to enable creation of a reasonable PDF in the case of the WMI web site.

PDF Datastream Addition

Each object’s PDF file was uploaded and ingested to join its corresponding object using steps similar to 18 and 22 above. Each object’s manage/datastreams page was engaged and the Add a datastream link used to create new PDF datastreams using the .pdf files created earlier.

Again, since the two WARC objects were not ingested in a “traditional” manner, I thought it necessary to regenerate all derivative datastreams to complete each object. I did so, in the case of grinnell:27858, for example, by visiting the object’s manage/properties page and clicking Regenerate all derivatives. The addition of the PDF datastream did not appear to create any additional derivatives, but each object came away with an empty WARC_CSV and WARC_FILTERED datastreams that I manually removed.

The addition of the PDF datastreams did produce new PDF-download links like the one shown here:

DOWNLOAD links

And that’s a wrap. Until next time…

Share on

Mark A. McFate
WRITTEN BY
Mark A. McFate
Digital Library Applications Developer