A set of 21 PDF objects were ingested into Digital.Grinnell’s Faculty Scholarship collection using IMI on 22-July-2019; unfortunately none of these PDFs contained OCR (optical character recognition) or “text recognition” data, so none of them generated a valid FULL_TEXT datastream. FULL_TEXT datastreams are required to make PDF, and similar text content, searchable and discoverable in Digital.Grinnell.
In order to confirm that the lack of OCR was in fact the problem, I ran a little test on https://digital.grinnell.edu/islandora/object/grinnell:26702, one of the 21 objects.
In my test I…
- signed in to Digital.Grinnell as an admin,
- opened the object (see address above) in my browser,
Manageto see all the object details,
Datastreamsto see the list of all the object’s datastreams,
- clicked the
downloadlink corresponding to the
OBJdatastream - this allowed me to download a copy of the PDF file to my workstation.
- Once the PDF was downloaded I opened it on my workstation in Adobe Acrobat Pro,
- then I chose*
In This File.
- After a few minutes I had a new PDF with OCR’d and searchable text.
- I saved that new PDF on my workstation,
- went back into the
Managetab in my browser,
- then uploaded the new PDF file to Digital.Grinnell.
Once the upload was complete the system automatically generated new derivatives for the object which now has a valid FULL_TEXT datastream, so this should make the content searchable and discoverable.
*Note that if I had multiple PDFs to process I believe I could have selected the
In Multiple Files option to save some time and OCR several PDFs in one operation.
The lesson-to-be-learned here is to…
always run "Text Recognition" on a PDF BEFORE it is ingested into Digital.Grinnell. But, if you forget, this procedure in the hands of any Digital.Grinnell admin, can save the day! 😄
And that’s a wrap. Until next time…