WARC, an abbreviation of Web ARChive, always reminds me of things like hobbits, elves, dark lords, and orcs, of course. But this post has nothing to do with those things so I need to clear my head and press on.
A WARC is essentially a file format used to capture the content and organization of a web site. Recently, I was asked to add a pair of WARCs to Digital.Grinnell. Doing so proved to be quite an adventure, but I am pleased to report that we now have these two objects to show for it:
WARC Ingest - Failures and Success
What follows is a table of the steps, both failed and successful, taken to ingest the two new Digital.Grinnell objects, along with notes about each step in the process.
|1. .warc File Creation||Success||Rebecca Ciota used a |
|2. .gz Compression||Success||Rebecca compressed each .warc and .cdx file pair into a compressed .gz archive to package the contents for subsequent processing.|
|3. MODS Metadata Prep||Success||Rebecca added two rows of control data and MODS metadata to Google Sheet https://docs.google.com/spreadsheets/d/1X3rs7UhIdS6SumTwFUvRR0F6-OnGEIF5xnGzcLrFqNY to prep for IMI ingest.|
|4. .gz Files Added to //Storage||Success||The two .gz files generated in a previous step were copied to //Storage for ingest.|
|5. Attempted IMI Ingest||Failed||The aforementioned IMI ingest process was invoked with Google Sheet https://docs.google.com/spreadsheets/d/1X3rs7UhIdS6SumTwFUvRR0F6-OnGEIF5xnGzcLrFqNY in Digital.Grinnell. The process ran for a very long time, in excess of 30 minutes, before it failed with no error messages or indication of root cause.|
|6. Attempted “Forms” Ingest of WMI Object||Failed||I navigated to the management page in our Pending Review collection and engaged the “Add an object to this Collection” link. I selected the “Islandora Web ARChive Content Model” and “Web ARChive MODS Form” for ingest. The form was not available so the ingest failed.|
|7. WARC Module Updated||Success||Steps were taken to update the key module, islandora_solution_pack_web_archive on Digital Grinnell’s node DGDocker1. The update ran without error.|
|8. Repeated Step 6||Failed||Still in the Pending Review Manage page I repeated the previous step (6) but even after the update no “Web ARChive MODS Form” was available.|
|9. Repeated Step 8 with the Correct Form||Failed||Still in the Pending Review Manage page I repeated the previous steps (6 and 8) choosing the “DG ONE Form to Rule them All”. The form opened and accepted input, but clicking “Next” presented the sub-form shown below, indicating that a .gz file could not be ingested.|
|10. Unzipped the .gz Files||Success||The WMI’s .gz file was nearly 2 GB in size, so I was unable to unzip it using the archive tool on my iMac. I was able to use |
|11. Repeated Step 9||Failed||Still in the Pending Review Manage page I repeated the previous step (9) and the sub-form. I terminated the upload process after about 2 hours (included my lunch break). The process could not be resumed from that point.|
|12. Repeated Step 11 with the “Global Mongol Century” File||Limited Success||Still in the Pending Review Manage page I repeated the previous step (11) but with the .warc file representing “The Global Mongol Century” archive. This .warc file was only 24.2 MB in size and the ingest worked (taking less than 5 minutes) producing grinnell:27856. No thumbnail image or other image derivatives were created, presumably because I did not choose the |
|13. Investigated Using ||Failed||The updated islandora_solution_pack_web_archive module includes at least one |
|14. Repeat Step 12||Success||I repeated step 12 still using the relatively small “Global Mongol Century” .warc, but gave the object a title of “World Music Instruments”. grinnell:27858 was created, again with no thumbnail image or other image derivatives.|
|15. Attempted to Replace the OBJ Datastream||Failed||Navigating to https://digital.grinnell.edu/islandora/object/grinnell%3A27858/manage/datastreams, I selected the |
|16. Replaced the OBJ Datastream Using FEDORA||Success||I navigated to Digital.Grinnell‘s FEDORA admin page at http://dgdocker1.grinnell.edu:8081/fedora/admin/, logged in as an admin, opened the grinnell:27858 object, and used the FEDORA admin interface there to replace its OBJ datastream. I allowed the upload portion of the process to “spin” for more than an hour, but when I stopped it and saved changes, I found a new OBJ that is 1.97 GB in size. That new OBJ appears to be viable.|
|17. Generated SCREENSHOT Images||Success||Rebecca used the “Snip Tool” in Windows to collect screenshot images of each “live” website home page. These were uploaded to //Storage for subsequent ingest.|
|18. Added Images via Manage/Datastreams Menu||Success||I navigated to each WARC object’s manage/datastreams page and used the |
|19. Added Empty MODS Datastreams||Success||While still working in each object’s manage/datastreams page I used the |
|20. Exported Google Sheet to TSV||Success||To prep for updating each object’s MODS record, I exported the “MASTER” tab of our Google Sheet to a .tsv, tab-seperated-values, file and saved the export in //Storage/Library/AllStaff/DG-Metadata-Review-2020-r1/WARC/mods-imvt.tsv.|
|21. Invoked Islandora-MODS-via-Twig Workflow||Success||I engaged the |
|22. Initiate Derivative Regeneration||Success||Since the two WARC objects were not ingested in a “traditional” manner, it was necessary to regenerate all derivative datastreams to complete the object. I did so, in the case of |
Completing steps 1-4, 7, 10, 12, 14, and 16-22, resulted in the two “complete” WARC objects we now have in:
Move to Faculty Scholarship
All of the above steps were performed while the two objects were part of the
Pending-Review collection in Digital.Grinnell. Once the objects were reviewed and determined to be correct, steps were taken using each object’s manage page to
Migrate this Object to another collection, choosing to move them both to Faculty Scholarship.
Setting Object Permissions
Since both of the WARC objects are for archival only, it was determined that both objects should be accessible only to system administrators. To enforce that restriction I visited each object’s manage/xacml page to set appropriate restrictions on object management and object viewing.
Adding PDF Datastreams
islandora_solution_pack_web_archive module and the Web ARChive content model provide an option to include a PDF in the ingest process. We did not initially generate any PDFs for the two WARCs that were ingested, but we have since taken steps to add PDF datastreams in order to experience what that option has to offer. The process we employed is briefly documented below.
As mentioned earlier, a
wget command was used to create the .warc files that we ingested, but
wget does not appear to offer a viable option to create a PDF file. Fortunately, Rebecca’s research turned up this Adobe Acrobat Pro trick: https://lenashore.com/2019/06/how-to-make-a-pdf-of-an-entire-website/.
Rebecca reports that this process works, but can take a very long time. She apparently had to specify a limited number of “levels” to enable creation of a reasonable PDF in the case of the WMI web site.
PDF Datastream Addition
Each object’s PDF file was uploaded and ingested to join its corresponding object using steps similar to 18 and 22 above. Each object’s manage/datastreams page was engaged and the
Add a datastream link used to create new
Again, since the two WARC objects were not ingested in a “traditional” manner, I thought it necessary to regenerate all derivative datastreams to complete each object. I did so, in the case of
grinnell:27858, for example, by visiting the object’s manage/properties page and clicking
Regenerate all derivatives. The addition of the
WARC_FILTERED datastreams that I manually removed.
The addition of the
And that’s a wrap. Until next time…