This page looks best with JavaScript enabled

Exporting, Editing, & Replacing MODS Datastreams

The transition to distance learning and social distancing that’s taken place at Grinnell College in the wake of the COVID-19 pandemic may afford GC Libraries an opportunity to do some overdue and necessary metadata cleaning in Digital.Grinnell. I believe that library staff who cannot take their usual work home will be asked to assist, and I am personally grateful that our leadership sees fit to do this, and am looking forward to supporting and working with my outstanding colleagues who will tackle this task.

To help implement this process efficiently and effectively I’m first turning to “Exporting, Editing, & Replacing MODS Datastreams", a workflow developed by the good folks at The California Historical Society. I’ll initiate the workflow with installation of two Drush tools on my local/development instance of ISLE on my Mac workstation.

Installing Necessary Modules

The command line process in my local host/workstation terminal looks like this:

docker exec -w /var/www/html/sites/all/modules/islandora/ ${Apache} git clone --recursive
docker exec -w /var/www/html/sites/all/modules/islandora/ ${Apache} git clone --recursive
docker exec -w /var/www/html/sites/all/modules/islandora/ ${Apache} chown -R islandora:www-data *
docker exec -w /var/www/html/sites/default ${Apache} drush en islandora_datastream_exporter islandora_datastream_replace -y
docker exec -w /var/www/html/sites/default ${Apache} drush cc drush -y

Exporting “grinnell:” Namespace Objects

I’ve elected to export all of the “grinnell:” namespace objects to my private: filesystem which resides on my host at ~/GitHub/dg-isle/private and maps into my Apache container as /var/www/private. The commands are:

docker exec -w /var/www/ ${Apache} chown -R islandora:www-data private
docker exec -w /var/www/ ${Apache} chmod 775 private
docker exec -w /var/www/html/sites/default/ ${Apache} drush -u 1 islandora_datastream_export --export_target=/var/www/private --query=PID:grinnell* --dsid=MODS

The last command in the above set populated the export_target directory mentioned above, and the result looks like this:

╭─mark@Marks-Mac-Mini ~/GitHub/dg-isle/private ‹master*›
╰─$ ls -alh
total 128
drwxrwxr-x@ 14 mark  staff   448B Mar 17 15:37 .
drwxr-xr-x  30 mark  staff   960B Mar 12 19:40 ..
-rw-r--r--@  1 mark  staff   675B Mar 10 14:01 .htaccess
-rw-r--r--   1 mark  staff   2.4K Mar 17 15:48 grinnell_11569_MODS.xml
-rw-r--r--   1 mark  staff   2.5K Mar 17 15:50 grinnell_20575_MODS.xml
-rw-r--r--   1 mark  staff   7.8K Mar 17 15:38 grinnell_25482_MODS.xml
-rw-r--r--   1 mark  staff   7.4K Mar 17 15:38 grinnell_25483_MODS.xml
-rw-r--r--   1 mark  staff   7.3K Mar 17 15:38 grinnell_25484_MODS.xml
-rw-r--r--   1 mark  staff   1.3K Mar 17 15:38 grinnell_25493_MODS.xml
-rw-r--r--   1 mark  staff   1.3K Mar 17 15:38 grinnell_25494_MODS.xml
-rw-r--r--   1 mark  staff   1.1K Mar 17 15:38 grinnell_25495_MODS.xml
-rw-r--r--   1 mark  staff   1.3K Mar 17 15:38 grinnell_25497_MODS.xml
-rw-r--r--   1 mark  staff   4.8K Mar 17 15:38 grinnell_25510_MODS.xml
-rw-r--r--   1 mark  staff   2.7K Mar 17 15:48 grinnell_3246_MODS.xml

Note that objects, like collections in the “grinnell:” namespace, which have no MODS datastream were reported as such, and were automatically excluded from populating the export_target directory.

Editing the MODS Metadata

I found a simple and effective process for editing the .xml files that were produced, using Atom. For testing purposes I made changes in a few of the objects listed above: 3246, 11569, 20575, and 25497. These changes included removing some empty XML tags, changing student/alumni graduation class years from '64 notation to , Class of 1964, as well as eliminiation of trailing whitespace and whitespace lines. Now I’ll push them back into the repository to confirm that the workflow does indeed “work”.

Importing the Changes

My command line “test” import of all the exported objects goes like this:

docker exec -w /var/www/html/sites/default/ ${Apache} drush -u 1 islandora_datastream_replace --dsid=MODS --source=/var/www/private/ --namespace=grinnell

And the output from that command, with whitelines removed, is:

╭─mark@Marks-Mac-Mini ~/GitHub/dg-isle ‹master*›
╰─$ docker exec -w /var/www/html/sites/default/ isle-apache-ld drush -u 1 islandora_datastream_replace --dsid=MODS --source=/var/www/private/ --namespace=grinnell
dsid=MODS  filename=
 dsid is not in filename!
dsid=MODS  filename=.
 dsid is not in filename!
dsid=MODS  filename=
 dsid is not in filename!
 file = grinnell_11569_MODS
Datastream replacement succeeded for grinnell:11569.                   [success]
 file = grinnell_20575_MODS
 file = grinnell_25482_MODS
Datastream replacement succeeded for grinnell:20575.                   [success]
 file = grinnell_25483_MODS
Datastream replacement succeeded for grinnell:25482.                   [success]
Datastream replacement succeeded for grinnell:25483.                   [success]
 file = grinnell_25484_MODS
Datastream replacement succeeded for grinnell:25484.                   [success]
 file = grinnell_25493_MODS
Datastream replacement succeeded for grinnell:25493.                   [success]
 file = grinnell_25494_MODS
Datastream replacement succeeded for grinnell:25494.                   [success]
 file = grinnell_25495_MODS
Datastream replacement succeeded for grinnell:25495.                   [success]
 file = grinnell_25497_MODS
Datastream replacement succeeded for grinnell:25497.                   [success]
 file = grinnell_25510_MODS
Datastream replacement succeeded for grinnell:25510.                   [success]
 file = grinnell_3246_MODS
Datastream replacement succeeded for grinnell:3246.                    [success]

Note that some replacment operations clearly took longer than others, and in some cases they report back in the “wrong order”, but it looks like all the legitimate object MODS records got updated. It’s also worth noting that after import each processed .xml file name is changed to .xml.used so that the file is not processed again unless its name is modified beforehand. Now to check up on one or two of them…

Test Results

Clearly, the changes I made to the four MODS .xml files are present in the repository. However, they are NOT reflected in the objects’ MODS display, presumably because Digital.Grinnell uses a Solr display of MODS data, and Solr has not been re-indexed.

So, I elected to validate and enable an old feature of my IDU - Islandora Drush Utilities module, the “DCTransform” command. Coupling “DCTransform” with the previously validated “SelfTransform” command and the “–reorder” option appears clean things up as expected.

docker exec -w /var/www/html/sites/default/ isle-apache-ld drush cc all
docker exec -w /var/www/html/sites/default/ ${Apache} drush -u 1 iduF grinnell:25497 DCTransform
docker exec -w /var/www/html/sites/default/ ${Apache} drush -u 1 iduF grinnell:25497 SelfTransform --reorder
docker exec -w /var/www/html/sites/default/ ${Apache} drush -u 1 iduF grinnell:3246 DCTransform
docker exec -w /var/www/html/sites/default/ ${Apache} drush -u 1 iduF grinnell:3246 SelfTransform --reorder
docker exec -w /var/www/html/sites/default/ ${Apache} drush -u 1 iduF grinnell:11569 DCTransform
docker exec -w /var/www/html/sites/default/ ${Apache} drush -u 1 iduF grinnell:11569 SelfTransform --reorder
docker exec -w /var/www/html/sites/default/ ${Apache} drush -u 1 iduF grinnell:20575 DCTransform
docker exec -w /var/www/html/sites/default/ ${Apache} drush -u 1 iduF grinnell:20575 SelfTransform --reorder
docker exec -w /var/www/html/sites/default/ isle-apache-ld drush cc all


It works. However, it’s worth noting that changes made to objects’ title field were NOT reflected in the object titles, presumably because the title becomes a “property” of the object itself, held apart from either the MODS or DC datastreams.

Next step, repeat the installation in production and export ALL of the objects in preparation for review and editing.

Exporting By Collection

After the successful tests docmented above, I repeated this process for Digital.Grinnell production on host DGDocker1. The export worked nicely; however, the process produces so many MODS .xml files (9,084 is the count) that I can’t easily work with them, there are too many to “glob” using a single wildcard spec. So, I’m going to try to formulate an export that isolates objects into their primary collections. A test of this process on my local/dev instance of ISLE looks likes this:

docker exec -w ${Target} ${Apache} mkdir -p grinnell/${Collection}
docker exec -w /var/www/html/sites/default/ ${Apache} drush -u 1 islandora_datastream_export --export_target=${Target}/grinnell/${Collection} --query=RELS_EXT_isMemberOfCollection_uri_mlt:\"info:fedora/grinnell:${Collection}\" --dsid=MODS

In Production

The same export in production on DGDocker1 looks like this:

docker exec -w ${Target} ${Apache} mkdir -p grinnell/${Collection}
docker exec -w /var/www/html/sites/default/ ${Apache} drush -u 1 islandora_datastream_export --export_target=${Target}/grinnell/${Collection} --query=RELS_EXT_isMemberOfCollection_uri_mlt:\"info:fedora/grinnell:${Collection}\" --dsid=MODS

This command set produced a total of 210 MODS .xml files in the Apache container’s new /utility-scripts/grinnell/jimmy-ley directory. All that’s required to repeat it for other collections in the grinnell: namespace is to repeat the command set changing the value of Collection= as needed.

There is a list of ALL Digital.Grinnell collections in this public gist, making it possible to loop the aforementioned command set like so:

while read collection
    echo Processing collection '${collection}'; Query is '${q}'...
    docker exec -w ${Target} ${Apache} mkdir -p exported-MODS/${collection}
    docker exec -w /var/www/html/sites/default/ ${Apache} drush -u 1 islandora_datastream_export --export_target=${Target}/exported-MODS/${Collection} --query=${q} --dsid=MODS
done < collections.list

Next Steps

I’ll open another post so that I can document the process of collecting and processing all the dumped MODS records presumably using OpenRefine.

And that’s a wrap. Until next time… 😄

Share on

Mark A. McFate
Mark A. McFate
Digital Library Applications Developer