This page looks best with JavaScript enabled

Exporting, Editing, & Replacing MODS Datastreams: Technical Details

Attention: On 21-May-2020 an optional, but recommended, sixth step was added to this workflow in the form of a new Drush command: islandora_mods_post_processing, an addition to my previous work in islandora_mods_via_twig. See my new post, Islandora MODS Post Processing for complete details.

A 5-Step Workflow

This document is follow-up, with technical details, to Exporting, Editing, & Replacing MODS Datastreams, post 069, in my blog. In case you missed it, the aforementioned post was written specifically for metadata editors working on the 2020 Grinnell College Libraries review of Digital Grinnell MODS metadata.

Attention: This document uses a shorthand ./ in place of the frequently referenced //STORAGE/LIBRARY/ALLSTAFF/DG-Metadata-Review-2020-r1/ directory. For example, ./social-justice is equivalent to the Social Justice collection sub-directory at //STORAGE/LIBRARY/ALLSTAFF/DG-Metadata-Review-2020-r1/social-justice.

Briefly, the five steps in this workflow are:

  1. Export of all grinnell:* MODS datastreams using drush islandora_datastream_export. This step, last performed on April 14, 2020, was responsible for creating all of the grinnell_<PID>_MODS.xml exports found in ./<collection-PID>.

  2. Execute my Map-MODS-to-MASTER Python 3 script on iMac MA8660 to create a mods.tsv file for each collection, along with associated grinnell_<PID>_MODS.log and grinnell_<PID>_MODS.remainder files for each object. The resultant ./<collection-PID>/mods.tsv files are tab-seperated-value (.tsv) files, and they are key to this process.

  3. Edit the MODS .tsv files. Refer Exporting, Editing, & Replacing MODS Datastreams for details and guidance.

  4. Use drush islandora_mods_via_twig in each ready-for-update collection to generate new .xml MODS datastream files. For a specified collection, this command will find and read the ./<collection-PID>/mods-imvt.tsv and create one ./<collection-PID>/ready-for-datastream-replace/grinnell_<PID>_MODS.xml file for each object.

  5. Execute the drush islandora_datastream_replace command once for each collection. This command will process each ./<collection-PID>/ready-for-datastream-replace/grinnell_<PID>_MODS.xml file and replace the corresponding object’s MODS datastream with the contents of the .xml file. The digital_grinnell branch version of the islandora_datastream_replace command also performs an implicit update of the object’s “Title”, a transform of the new MODS to DC (Dublin Core), and a re-indexing of the new metadata in Solr.

The remainder of this document provides technical details, frequently in the form of command lines used to build and use the aforementioned tools.

Step 1a - Installation of Drush islandora_datastream_export and islandora_datastream_replace Commands

To help implement this process efficiently and effectively I first turned to Exporting, Editing, & Replacing MODS Datastreams, a workflow developed by the good folks at The California Historical Society. I initiated the workflow by installing two Drush tools on my local/development instance of ISLE on my Mac workstation.

The command line process in my local host/workstation terminal looked like this:

1
2
3
4
5
6
Apache=isle-apache-ld
docker exec -w /var/www/html/sites/all/modules/islandora/ ${Apache} git clone https://github.com/Islandora-Labs/islandora_datastream_exporter.git --recursive
docker exec -w /var/www/html/sites/all/modules/islandora/ ${Apache} git clone https://github.com/pc37utn/islandora_datastream_replace.git --recursive
docker exec -w /var/www/html/sites/all/modules/islandora/ ${Apache} chown -R islandora:www-data *
docker exec -w /var/www/html/sites/default ${Apache} drush en islandora_datastream_exporter islandora_datastream_replace -y
docker exec -w /var/www/html/sites/default ${Apache} drush cc drush -y

Local tests of these commands were successful so I proceeded to install them in the production instance of Digital Grinnell at dgdocker1.grinnell.edu. Before doing that I needed to change the definition of Apache to reflect the production instance of our Apache container, like so Apache=isle-apache-dg.

Created a Fork of Islandora Datastream Replace

I also chose to “fork” the islandora_datastream_replace project so that I could do a little Digital.Grinnell customization of it. The fork I’m working with is here and my work is limited to the digital_grinnell branch of that fork.

In the digital_grinnell branch I modified the behavior of the islandora_datastream_replace command so that it implicitly performs an UpdateFromMODS operation that lives in our idu, or Islandora Drush Utilities module. The UpdateFromMODS, performed immediately after each datastream replace operation does the following:

  • Updates the object “Title”, one of its properties, to match the new value of /mods:mods/mods:titleInfo[not(@type)]/mods:title.
  • Invokes the iduF DCTransform operation which runs the default XSLT transform of the new MODS to DC (Dublin Core) and creates a new “DC” datastream for the object.
  • The iduF DCTransform operation also concludes with an implicit iduF IndexSolr operation to ensure that the new object metadata is properly indexed in Solr.

Step 1b - Installation of Drush islandora_datastream_export and islandora_datastream_replace Commands in Production

To install the commands in production I opened a terminal to dgdocker1.grinnell.edu as user islandora and executed the following commands there:

1
2
3
4
5
6
7
Apache=isle-apache-dg
docker exec -w /var/www/html/sites/all/modules/islandora/ ${Apache} git clone https://github.com/Islandora-Labs/islandora_datastream_exporter.git --recursive
docker exec -w /var/www/html/sites/all/modules/islandora/ ${Apache} git clone https://github.com/DigitalGrinnell/islandora_datastream_replace.git --recursive
docker exec -w /var/www/html/sites/all/modules/islandora/ ${Apache} chown -R islandora:www-data *
docker exec -w /var/www/html/sites/all/modules/islandora/islandora_datastream_replace ${Apache} git checkout -b digital_grinnell
docker exec -w /var/www/html/sites/default ${Apache} drush en islandora_datastream_exporter islandora_datastream_replace -y
docker exec -w /var/www/html/sites/default ${Apache} drush cc drush -y

Step 1c - Mounting //STORAGE to DGDocker1

Attention! This step, and some that come later, will require that the network storage path //STORAGE/LIBRARY/ALLSTAFF/DG-Metadata-Review-2020-r1 be accessible to our production instance of Digital.Grinnell. To make that possible I had to run this sequence on DGDocker1:

docker exec -it isle-apache-dg bash
mount -t cifs -o username=mcfatem /storage.grinnell.edu/LIBRARY/ALLSTAFF/DG-Metadata-Review-2020-r1 /mnt/metadata-review /mnt/metadata-review

Step 1d - Using Drush islandora_datastream_export

Unfortunately, the islandora_datastream_export results in my local test were woefully incomplete… NONE of the child objects with a compound parent were exported. I’m still not entirely sure why child obejcts were omitted since the query I used should have captured all objects. In testing I did find that this seems to be a flaw in the islandora_datastream_export command, and specifically in its implementation of any Solr query.

Fortunately, the aforementioned command also has a SPARQL query option, and after some trial-and-error I got it to work properly. To do so I created an export.sh bash script, shown below, and used it on dgdocker1.grinnell.edu like so:

1
2
docker exec -it isle-apache-dg bash
source export.sh

The export.sh script is:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
Apache=isle-apache-dg
Target=/utility-scripts
# wget https://gist.github.com/McFateM/5bd7e5b0fa5d2928b2799d039a4c0fab/raw/collections.list
while read collection
do
    cp -f ri-query.txt query.sparql
    sed -i 's|COLLECTION|'${collection}'|g' query.sparql
    docker cp query.sparql ${Apache}:${Target}/${collection}.sparql
    rm -f query.sparql
    q=${Target}/${collection}.sparql
    echo Processing collection '${collection}'; Query is '${q}'...
    docker exec -w ${Target} ${Apache} mkdir -p /mnt/metadata-review/${collection}
    docker exec -w /var/www/html/sites/default/ ${Apache} drush -u 1 islandora_datastream_export --export_target=/mnt/metadata-review/${collection} --query=${q} --query_type=islandora_datastream_exporter_ri_query  --dsid=MODS
done < collections.list

In the case of the Digital Grinnell social-justice collection, for example, this script produced 32 .xml files, the correct number. Each collection’s set of exported .xml files can be found in the collection-specific subdirectory of //STORAGE/LIBRARY/ALLSTAFF/DG-Metadata-Review-2020-r1/ and all have filenames of the form: grinnell_<PID>_MODS.xml. Note that objects which have no MODS datastream were not exported.

Step 2 - Map-MODS-to-MASTER Python 3 Script

The Map-MODS-to-MASTER script was developed, in Python 3, on iMac MA8660 at ~/GitHub/Map-MODS-to-MASTER to facilitate generation of mods.tsv and accompanying .log files for each Digital Grinnell collection from the .xml files found in subdirectories of //STORAGE/LIBRARY/ALLSTAFF/DG-Metadata-Review-2020-r1/.

The Map-MODS-to-MASTER project can be found in the master branch of https://github.com/DigitalGrinnell/Map-MODS-to-MASTER. I choose to execute it using PyCharm from iMac MA8660 since the directory holding all of the .xml files and folders is already mapped to /Volumes/LIBRARY/ALLSTAFF/DG-Metadata-Review-2020-r1 on that iMac. Note that this //STORAGE location was choosen because the ./ALLSTAFF directory, and its subordinates, are accessible to all staff in the Grinnell College Libraries.

It should not be necessary to run this script ever again…NEVER. However, if it becomes necessary to look back at this code and process, details can be found in Map-MODS-to-MASTER. Note: If it should ever become necessary to repeat the Map-MODS-to-MASTER process it might be wise to look at replacing the Python 3 script with a new Drush command, maybe islandora_map_mods_to_master, written in PHP and installed directly into the production instance of Digital.Grinnell.

Step 3 - Editing the MODS .tsv Files

Please refer to Refer to Exporting, Editing, & Replacing MODS Datastreams, post 069 in my blog, for details and guidance.

Step 4 - Run drush islandora_mods_via_twig

As each individual collection mods-imvt.tsv file is made ready-for-update, it will be necessary to run a drush islandora_mods_via_twig command to process the .tsv data. Running --help with that command produces:

[islandora@dgdocker1 ~]$ docker exec -it isle-apache-dg bash
root@122092fe8182:/# cd /var/www/html/sites/default/
root@122092fe8182:/var/www/html/sites/default# drush -u 1 islandora_mods_via_twig --help
Generate MODS .xml files from the mods-imvt.tsv file for a specified collection.

Examples:
 drush -u 1 islandora_mods_via_twig social-justice   Process ../social-justice/mods-imvt.tsv, for example.

Arguments:
 collection             The name of the collection to be processed.  Defaults to "social-justice".

Aliases: imvt

So, my command sequence to run islandora_mods_via_twig for the “Social Justice” collection, as an example, was:

[islandora@dgdocker1 ~]$ docker exec -it isle-apache-dg bash
root@122092fe8182:/# cd /var/www/html/sites/default/
root@122092fe8182:/var/www/html/sites/default# drush -u 1 islandora_mods_via_twig social-justice

When the islandora_mods_via_twig command is run, it processes the corresponding //STORAGE/LIBRARY/ALLSTAFF/DG-Metadata-Review-2020-r1/<collection-PID>/mods-imvt.tsv file and creates one //STORAGE/LIBRARY/ALLSTAFF/DG-Metadata-Review-2020-r1/<collection-PID>/ready-for-datastream-replace/grinnell_<PID>_MODS.xml file for each object.

Step 5 - Run drush islandora_datastream_replace

The whole point of this entire process is to get us back to this point with a set of reviewed and modified .xml files in a //STORAGE/LIBRARY/ALLSTAFF/DG-Metadata-Review-2020-r1/<collection-PID>/ready-for-datastream-replace/ collection-specific subdirectory so that we can replace existing object MODS datastreams with new data, and we use the drush islandora_datastream_replace command to do this.

Running --help for the aformentioned command produced this:

root@122092fe8182:/var/www/html/sites/default# drush -u 1 islandora_datastream_replace --help
Replaces a datastream in all objects given a file list in a directory.

Examples:
 drush -u 1 islandora_datastream_replace --source=/mnt/metadata-review/social-justice/ready-for-datastream-replace
 --dsid=MODS --namespace=grinnell

   Replacing MODS datastream for objects in --source using the digital_grinnell branch of code.

Options:
 --dsid                                    The datastream id of the datastream. Required.
 --namespace                               The namespace of the pids. Required.
 --source                                  The directory to get the datastreams and pid# from. Required.

Aliases: idre

It’s worth noting that this command looks for any files named MODS in whatever ABSOLUTE directory is named with the --source parameter. The command shown below was executed inside the Apache container, isle-apache-dg, on node DGDocker1, in order to process Digital Grinnell‘s social-justice collection.

root@122092fe8182:drush -u 1 islandora_datastream_replace --source=/mnt/metadata-review/social-justice/ready-for-datastream-replace --dsid=MODS --namespace=grinnell

The same command could have been executed directly from node DGDocker1 like so:

docker exec isle-apache-dg drush -u 1 -w /var/www/html/sites/default drush -u 1 islandora_datastream_replace --source=mnt/metadata-review/social-justice/ready-for-datastream-replace --dsid=MODS --namespace=grinnell

And that’s a wrap. Until next time, stay safe and wash your hands! 😄

Share on

Mark A. McFate
WRITTEN BY
Mark A. McFate
Digital Library Applications Developer