Skip to content

Migration Guide: CONTENTdm

Mark Jordan edited this page May 10, 2017 · 55 revisions

Overview

In the spring of 2016, Simon Fraser University Library migrated from CONTENTdm to Islandora. In total, we migrated approximately 110 collections containing 1.3 million objects (using CONTENTdm's count of objects). This migration guide describes the processes and tools we used during our migration, and provides enough background information to help you plan your migration.

The steps involved in migrating a single CONTENTdm collection using MIK are:

  1. Gather information about the CONTENTdm collection
  2. Configure MIK
  3. Test your configuration
  4. Reconfigure MIK if necessary
  5. Retest until happy
  6. Run MIK to perform the full migration
  7. Ingest your packages into Islandora

The "Migration workflow" section below documents each step in detail. The workflow needs to be applied separately to every collection in your CONTENTdm instance. If the expiry of your CONTENTdm license is a hard deadline, you should be planning and practising migrations well in advance of that date.

The key to a successful migration from CONTENTdm is to prepare thoroughly. The more prep work you can do prior to the migration, the faster and easier the migration will be. Some things you can do to prepare for the migration include:

  • Use the CONTENTdm Collection Inspector or other tools of you choice to get a sense for how consistent your metadata is in terms of date formats, controlled vocabularies, etc. You may want to consider using CONTENTdm's search and replace feature to clean up your metadata, or you can use MIK's metadata manipulators and mappings files to perform cleanup and normalization tasks during the migration. In either case, you will need to know how consistent (or inconsistent) the metadata in your CONTENTdm instance is.
  • Audit your preservation master files to make sure they are named consistently and are organized in ways that MIK can access them. MIK can combine the data extracted from CONTENTdm with your preservation masters, but only if the masters are organized in specific ways.
  • Migrating very large collections can take a long time. Both parts of the migration - MIK extracting objects from CONTENTdm, and ingesting those objects into Islandora - can be slow. For example, migrating our largest newspapers collection (300,000 pages) took over a week of execution time combined. One way of reducing the ingest time is to use the derivatives provided by CONTENTdm so you can avoid having Islandora regenerate them. See the "Reusing derivatives generated by CONTENTdm" section below for more information.

This guide assumes that you have access to your CONTENTdm administrative tools, and that you have some familiarity with MIK. In several places, this guide describes an important tool called the CONTENTdm Collection Inspector, which, like MIK itself, is a PHP command-line application. A third PHP command-line application, the Islandora Import Package QA Tool, is also mentioned in this guide. Both the Collection Inspector and the QA Tool are completely independent from MIK and can be used without it, but they are both also very helpful when used in conjuction with MIK.

Differences between CONTENTdm and Islandora

There are a number of differences between CONTENTdm and Islandora that you should take into account while planning your migration. Some impact the migration directly; others do not, but may inform decisions that you need to make in preparation for the migration:

CONTENTdm Islandora Impact for migration
Uses a flat metadata structure in which local sites can easily create their own fields. Stores metadata in MODS XML, although it can be configured to store discovery metadata in any datastream. Significant because you will need to determine how to handle fields from CONTENTdm that do not map cleanly to standard MODS elements.
Only stores files intended for consumption by end users. Can store all files associated with an object. Significant because you will need to decide if you are going to store your master files in Islandora or not.
Supports hierarchical objects, including books. Also supports hierarchical objects, but (currently) can't batch ingest them. Significant because all hierarchical objects will be flattened by MIK. Islandora currently lacks a standard tool for batch ingesting hierarchical compound objects. The Islandora Compound Batch module can only ingest flat compound objects.

The two platforms also differ in some ways that are not as significant during the migration itself, but that you should consider when planning the information architecture and user experience of your Islandora site:

CONTENTdm Islandora Impact for information architecture/UX
An object in CONTENTdm can only be in one collection. Objects in Islandora can be in many collections. When you ingest an object into Islandora, you ingest it into a single collection. After an object has been ingested, you can use Islandora's management tools to share objects across collections.
Provides a fixed set of content types. Provides a standard set of "content models", but it is possible to create new ones via custom Solution Packs. All of the most commonly used CONTENTdm content types have equivalent Islandora Solution Packs.
Provides advanced search forms at both the site and collection level. Provides a single site-wide advanced search form. Having only one advanced search form may influence your CONTENTdm-to-MODS field mappings. If your users have come to rely on collection-level advanced search forms that expose the collection's custom metadata fields, you will need to decide how many collection-specific fields to include in your single Islandora site-wide advanced search form.
Provides relatively simple permissions that determine the actions users can perform on objects in the collection. Provides granular but rather complex permissions on objects, and does not take a purely collection-oriented approach to access control. Permissions you may have granted users in CONTENTdm may not map cleanly or easily to equivalent permissions in Islandora.

CONTENTdm terminology used within MIK

CONTENTdm uses the terms "alias" and "pointer" to refer to a collection's unique identifier and an object's unique identifier, respectively. These two bits of data are visible in object-level URLs, e.g., in the URL

http://content.lib.sfu.ca/cdm/ref/collection/km/id/12895

km is the collection alias and 12895 is the object's pointer. Because an object in CONTENTdm can only be in one collection, the combination of an alias and a pointer uniquely identifies a CONTENTdm object.

"Alias" and to a lesser extent "pointer" are used within MIK configuration files and in some other places in the MIK documentation.

A third term that is used is "nickname". Nicknames are CONTENTdm's internal names for metadata fields. They usually take the form of abbreviated versions of the labels the CONTENTdm administrator assigns to fields. Examples of a collection's field nicknames are shown in the "Determining your CONTENTmd collection's field mappings" section below.

Reusing derivatives generated by CONTENTdm

For some Islandora content models, it is possible to reuse derivatives generated by CONTENTdm. For example, the Islandora Book and Newspaper Batch modules can load existing page-level derivatives, such as OCR, JP2, and TN. See the WRITER section of the CONTENTdm Book and CONTENTdm Newspapers toolchains for information on how to do this. Some points to consider:

  • You may also want to consider using MIK's generate_fits.php post-write hook script to generate FITS XML to be added to objects as the TECHMD datastream.
  • We found that some (but not all) JP2 files generated by CONTENTdm do not work in Islandora. They showed up as entirely black images. The CONTENTdm files are valid JP2s, but they just don't work when viewed in Islandora. If this happens to some of the JP2 files generated by CONTENTdm, you can use the fixjp2.php script provided as part of Islandora Datastream CRUD.

Migration workflow

MIK imposes two constraints that directly determine the number of jobs necessary to complete the migration from a CONTENTdm repository to Islandora. First, MIK migrates single collections from CONTENTdm. Second, MIK can also only migrate one content model at a time. Both of these constraints are consistent with Islandora, since Islandora ingests content of a single content model into a single collection.

The implication of these two workflow constraints is that for every source CONTENTdm collection, you will have at least on MIK job to run, and at least one Islandora batch ingest process. For CONTENTdm collections with multiple content models (some image objects, some video objects, and some book objects, for example), you will need to run MIK once for each type of content. That's three MIK jobs for CONTENTdm collection.

The steps in migrating a collection using MIK are:

  1. Gather information about the CONTENTdm collection
  2. Configure MIK
  3. Test your configuration
  4. Reconfigure MIK if necessary
  5. Retest until happy
  6. Run MIK to perform the full migration
  7. Ingest your packages into Islandora

Even though you will need to configure, test and run MIK for each content model in a CONTENTdm collection, you can reuse much of the configuration across content models.

Note that MIK only migrates objects contained within collections. It doesn't create collection objects in Islandora. To migrate objects into an Islandora collection, the target collection object must already exist. If you have a small number of collections to migrate, creating each one manually is not onerous, but if you have a very large number of CONTENTdm collections, you may want to check out the Islandora CONTENTdm Collection Migrator module.

Step 1: Gather information about the CONTENTdm collection

In preparation for migrating each CONTENTdm collection using MIK, you need to assemble the following information:

  1. how the metadata for the collection's objects should map to MODS, and
  2. how objects in the collection adhere into Islandora content models.

The CONTENTdm Collection Inspector is a tool that can help you with both of these questions. It lets you query a CONTENTdm collection to get information that will assist in configuring MIK. Examples of how to use this tool are provided below.

Determining your CONTENTmd collection's field mappings

The first step in migrating a collection is to decide how to map the collection's metadata to MODS. The CONTENTdm Collection Inspector can help you with this. MIK uses the human-readable labels in its mappings files, not the nicknames. However, some metadata manipulators, for example NormalizeDate and InsertXmlFromTemplate, use nicknames. Field nicknames are also used by the CONTENTdm Collection Inspector tool itself, for example when you ask it to provide a list of unique field values.

For example, running the following command will show you the field configuration for the collection with the alias "km" (note that this collection contained some fields whose labels are in Punjabi):

php cdminspect --inspect=nicknames --alias=km

Field nicknames for Komagata Maru - Continuing the Journey

Field label => field nickname
=============================
Title => title
ਸਿਰਲੇਖ => titlep
Subject => subjec
ਵਿਸ਼ਾ (ਸੰਕੇਤ ਸ਼ਬਦ) => subjea
Description => descri
ਵਰਣਨ => descra
Creator => creato
ਰਚਣਹਾਰ => creata
Publisher => publis
ਪ੍ਰਕਾਸ਼ਕ => publia
Contributors => contri
Date => dateso
Display date => date
ਜ਼ਾਹਰ ਤਾਰੀਖ਼ => displa
Type => type
Format => format
Identifier => identi
Repository => source
Language => langua
Duration => durati
Rights => rights
ਹੱਕ => righta
Level1 => level1
Level2 => level2
Level3 => level3
Full text => full
Archival file => fullrs
OCLC number => dmoclcno
Date created => dmcreated
Date modified => dmmodified
CONTENTdm number => dmrecord
CONTENTdm file name => find

Done.

The MIK Cookbook provides two entries that will get you going with the CONTENTdm Collection Inspector:

Determining what content types are in your CONTENTmd collection

As stated earlier in this guide, MIK can only migrate one content model at a time. For CONTENTdm collections which contain objects of the same content type (all PDFs, or all books, for example), you will only need to configure and run a single MIK job. For CONTENTdm collections that contain objects of multiple content types e.g., (some images, some videos, some audio), you will need to configure and run one MIK job per content type, you will need to run MIK once for each type of content.

In either case, you need to know which content types your source objects have so you can map them to the corresponding Islandora content models. The following table illustrates the correspondences between common CONTENTdm content types and Islandora content models:

CONTENTdm content type Islandora Solution Pack / content model
JP2000 Large Image Solution Pack
Other image formats Basic Image
Compound (Monograph) Book Solution Pack
Newspapers Newspaper Solution Pack
Compound (Document) Compound Solution Pack
Compound PDF PDF Solution Pack

CONTENTdm provides a summary of content types under each collection's Admin > Reports > Item types menu, but you can also use the CONTENTdm Collection Inspector to generate an object-level list of which content types are in use. From this list, you can then determine which Islandora content models you will need to enable. For example, to generate a list of content types used in a collection with the alias "km" and save the list to a file named "km_types.txt", you would run the following command:

php cdminspect --inspect=object_type --alias=km --output_file=km_types.txt

The output will look something like this:

# cdminspect output for the '/km' collection.
15693,compound,Document
15720,compound,Document
15740,compound,Document
15840,compound,Document
15843,compound,Document
15846,compound,Document
15847,simple,jp2
15848,simple,jp2
15849,simple,jp2
15850,simple,pdf
15851,simple,jp2
15852,simple,pdf
15856,compound,Document
15859,compound,Document
15864,compound,Document
15866,compound,Document
15875,compound,Document
15876,simple,mp4
16030,simple,mp4
16031,simple,pdf
9213,compound,Monograph
513,simple,jp2
514,simple,jp2
515,simple,jp2
516,simple,jp2
517,simple,jp2
518,simple,jp2
519,simple,jp2
520,simple,jp2

From this report we see that there are compound documents, JPEG2000s, MP4s, and monographs in this collection. The group of objects in the CONTENTdm source collection adhering to each of these content types will need to be migrated in a separate MIK job, configured according to the documentation for the CONTENTdm Generic Compound, CONTENTdm Single File (for the JP2000 and MP4 objects), and CONTENTdm Books (for the monographs).

Step 2: Configure MIK

Now that you know which fields your CONTENTdm source collection uses, and the content types of the objects in your source collection, you are ready to create an MIK configuration file. To do this, you will first need to create a mappings file.

Again, it is important to remember that you may need to create more than one MIK configuration file per source collection, specifically, you will need one configuration file per target Islandora content model. However, you can reuse the same mappings file across jobs required for specific content models.

Mappings files

Mappings file for CONTENTdm migrations contain the human-readable source field label in the first column, and the top-level MODS XML snippet in the second column:

Title,<titleInfo type='translated' lang='eng'><title>%value%</title></titleInfo>
ਸਿਰਲੇਖ,<titleInfo lang='pan' script='Guru'><title>%value%</title></titleInfo>
Subject,<subject><topic>%value%</topic></subject>
ਵਿਸ਼ਾ (ਸੰਕੇਤ ਸ਼ਬਦ),<subject lang='pan' script='Guru'><topic>%value%</topic></subject>
Description,<abstract>%value%</abstract>
ਵਰਣਨ,<abstract lang='pan'>%value%</abstract>
Creator,<name><namePart>%value%</namePart><role><roleTerm type='text' authority='marcrelator'>creator</roleTerm><roleTerm type='code' authority='marcrelator'>cre</roleTerm></role></name> 
ਰਚਣਹਾਰ,<name lang='pan' script='Guru'><namePart>%value%</namePart><role><roleTerm type='text' authority='marcrelator'>creator</roleTerm><roleTerm typ='code' authority='marcrelator'>cre</roleTerm></role></name>
Publisher,<originInfo><publisher>%value%</publisher></originInfo>
ਪ੍ਰਕਾਸ਼ਕ,<originInfo lang='pan' script='Guru'><publisher>%value%</publisher></originInfo>
Contributors,<name><namePart>%value%</namePart><role><roleTerm type='text' authority='marcrelator'>contributor</roleTerm><roleTerm type='code' authority='marcrelator'>ctb</roleTerm></role></name>
Date,<originInfo><dateIssued encoding='w3cdtf' keyDate='yes'>%value%</dateIssued></originInfo>
ਜ਼ਾਹਰ ਤਾਰੀਖ਼,,<originInfo lang='pan' script='Guru'><dateIssued>%value%</dateIssued></originInfo>
Type,<genre>%value%</genre>
Identifier,<identifier>%value%</identifier>
Repository,<location><physicalLocation>%value%</physicalLocation></location>
Language,<language><languageTerm type='text'>%value%</languageTerm></language>
Duration,<physicalDescription><extent>%value%</extent></physicalDescription>
Rights,<accessCondition type='use and reproduction'>%value%</accessCondition>
ਹੱਕ,<accessCondition lang='pan' script='Guru' type='use and reproduction'>%value%</accessCondition>
Level1,<extension><level_1 type='SFU custom metadata for the Komagata Maru Continuing the Journal Collection'>%value%</level_1></extension>
Level2,<extension><level_2 type='SFU custom metadata for the Komagata Maru Continuing the Journal Collection'>%value%</level_2></extension>
Level3,<extension><level_3 type='SFU custom metadata for the Komagata Maru Continuing the Journal Collection'>%value%</level_3></extension>
null1,<extension><CONTENTdmData></CONTENTdmData></extension>

While creating your mappings file, you will likely want to get a sense of how consistent your source metadata is, since in some cases you may want to take advantage of two MIK features to improve your metadata during the migration:

  1. The ability to replace inconsistent object-level values in your source metadata with a single, consistent target value. This process is described in the "null mappings" section of the mappings files documentation.
  2. The ability to apply one or more MIK metadata manipulators to your source data. Metadata manipulators you may find useful include:

The CONTENTdm Collection Inspector provides a way of getting a list of unique values in each field for a collection. For example, running

php cdminspect --inspect=field_values --nickname=source --alias=km --output_file=kmsources.txt

will result in the unique values for the "source" field being written to the kmsources.txt file:

# cdminspect output for the '/km' collection.
BC Archives
British Library
City of Vancouver Archives
Library and Archives Canada
National Archives and Records Administration
Nehru Memorial Museum and Library, New Delhi
SFU Library
SFU Library Special Collections and Rare Books
Simon Faser University Library
Simon Fraser University Library
The British Library
The National Archives
University of British Columbia
Unknown
Vancouver Public Library
[blank]

In addition to using MIK's metadata manipulators to improve the consistency of your CONTENTdm metadata, you may also want to consider adding additional metadata that may prove useful in your migrated repository. Two examples of this are uusing the AddContentdmMetadata metadata manipulator and configuring your metadata parser to use the include_migrated_from_uri = TRUE option.

Create your .ini file

You will always use the same fetcher and metadata parser, but the filegetter and writers will depend on the content type of the objects you are migrating:

CONTENTdm content type Islandora Solution Pack MIK filegetter MIK writer
JP2000 Large Image Solution Pack CdmSingleFile CdmSingleFile
Other image formats Basic Image CdmSingleFile CdmSingleFile
Compound (Monograph) Book Solution Pack CdmBooks CdmBooks
Newspapers Newspaper Solution Pack CdmNewspapers CdmNewspapers
Compound (Document) Compound Solution Pack CdmCompound CdmCompound
Compound PDF PDF Solution Pack CdmPhpDocuments CdmPhpDocuments

The combination of an MIK fetcher, filegetter, metadata parser, and writer is known as a "toolchain". Toolchains used in migrations from CONTENTdm will always use the CONTENTdm fetcher and the CdmToMods metadata parser. As the table above indicates, the filegetter and writer will depend on what type of objects you are migrating. Detailed documentation on configuring each toolchain is available on the MIK wiki.

Islandora's ability to store all the files associated with an object (masters and web-friendly derivatives) is a major advantage over CONTENTdm. Most of MIK's CONTENTdm toolchains allow you to combine the files retrieved from CONTENTdm with master files stored external to CONTENTdm (on a separate file server, for example) and add them to your Islandora objects as OBJ datastreams. Content-type-specific filegetters can be configured to search for master files on filesystem paths, as documented on the wiki. Also, in some cases you can generate OBJ datastreams for book and newspaper pages if you have no master files. If you plan to combine master files with files retrieved from CONTENTdm, it is very important to audit your master files to make sure they are named consistently and are organized in ways that MIK can access them before you start your migration. Filename and directory conventions are documented in the "Preparing the content files" section of each of the toolchains.

This sample .ini file was used to generate Islandora ingest packages from all JPEG2000 objects from the source CONTENTdm collection. Detailed information on the options is available in the MIK wiki's CONTENTdm Single File toolchain documenation.

[CONFIG]
config_id = km jp2
last_updated_on = "2016-04-16"
last_update_by = "mj"

[FETCHER]
class = Cdm
; The alias of the CONTENTdm collection.
alias = km
ws_url = "http://content.lib.sfu.ca:81/dmwebservices/index.php?q="
temp_directory = "m:\production_loads\km_2_jp2_mods\temp"
; 'record_key' should always be 'pointer' for CONTENTdm fetchers.
record_key = pointer

[METADATA_PARSER]
class = mods\CdmToMods
alias = km
ws_url = "http://content.lib.sfu.ca:81/dmwebservices/index.php?q="
mapping_csv_path = 'extras/sfu/mappings_files/km_2_mappings.csv'
include_migrated_from_uri = TRUE

[FILE_GETTER]
class = CdmSingleFile
alias = km
; "input_directories[]" is a list of file paths to the objects' master files.
input_directories[] = "t:\filestore\km\tiffs"
; input_directories[] = 
ws_url = "http://content.lib.sfu.ca:81/dmwebservices/index.php?q="
utils_url = "http://content.lib.sfu.ca/utils/"
temp_directory = "m:\production_loads\km_2_jp2_mods\temp"

[WRITER]
class = CdmSingleFile
alias = km
ws_url = "http://content.lib.sfu.ca:81/dmwebservices/index.php?q="
output_directory = "m:\production_loads\km_jp2_mods"
; Leave blank for Cdm single file objects (the MIK writer assigns the filename).
metadata_filename =
datastreams[] = MODS
datastreams[] = OBJ

[MANIPULATORS]
; fetchermanipulators[] = "RandomSet|10"
; fetchermanipulators[] = "SpecificSet|kn_jp2_test.pointers"
fetchermanipulators[] = "CdmSingleFileByExtension|jp2"
; We need to use this fetcher manipulator to select only objects that are not children of compound objects.
fetchermanipulators[] = "CdmNoParent"
metadatamanipulators[] = "SplitRepeatedValues|Subject|/subject/topic|;"
metadatamanipulators[] = "AddContentdmData"

[LOGGING]
; Full path to log file for general mik log file.
path_to_log = "m:\production_loads\km_jp2_mods\mik.log"
; Full path to log file for manipulators.
path_to_manipulator_log = "m:\production_loads\km_jp2_mods\manipulator.log"

Fetchers can be configured further by using one or more fetcher manipulators. Several useful fetcher manipulators relevant to CONTENTdm migrations include:

Step 3: Testing your configuration

Once you have your metadata mappings file and your .ini file, you are ready to run MIK. Mappings files in particular can be difficult to get right, but there are several techniques for testing your configuration before running MIK to generate your entire set of objects for loading into Islandora. These techniques include:

It is also important to validate the MODS that MIK generates. You can do this using a variety of tools, but MIK provides a post-write hook script. You can also use the Islandora Import Package QA Tool's -v option to validate your MODS.

Steps 4 and 5: Reconfigure and retest

As suggested in step 3, you will likely need to test several variations of your configuration file before getting everything right. Don't forget that you can temporarily comment out lines in your .ini file by using a semicolon (;) at the start of a line, as illustrated in the sample .ini file above.

Step 6: Performing the migration with MIK

MIK is fairly fault tolerant and can run for hours or even days without crashing. However, since it skips problematic objects and logs the problem, you can still end up with a small amount of output. You can use the SpecificSet fetcher manipulator to rerun failed objects, whose pointers you can get from an earlier job's problem_records.log using the included extras/scripts/specificsetfromproblemrecords.php script.

If running MIK over long periods of time is problematic, you can use MIK to generate smaller sets from large collections using the SpecificSet fetcher manipulator. To do this, get a list of all of the pointers from your source collection using the CONTENTdm Collection Inspector and break up the list into smaller subsets. Then use each of the subsets as input to the SpecificSet fetcher manipulator.

Step 7: Ingesting into Islandora

Once MIK has finished running, the resulting ingest packages are ready to load into Islandora, either via the web interface for small sets of data, or via Drush for larger sets.

SFU Library performed all of our batch ingests using the Drush interface to Islandora Batch, Book Batch, Newspaper Batch, and Compound Batch. MIK does not zip its output files, so if you want to ingest files using the web interface, you will need to do that step separately, although there is no reason someone couldn't write a shutdown script that zipped up MIK's output.

We also found that very large collections are best ingested in smaller batches. Ingesting content into Islandora can be slow, especially if it is generating OCR. Separating your large collections into smaller batches does reduce the impact of an interupted job, however.

Before using Islandora's batch modules to ingest your objects, you should check to make sure that the packages are complete. There are several ways to do this, but you may want to refer to this MIK Cookbook entry for some ideas.

Summary

This migration guide contains a lot of detail, but the most important things to remember about using MIK to generate Islandora ingest packages from CONTENTdm objects are:

  • Time spent analysing your source collections is well worth it.
    • First, analyse your source collections' metadata to make sure you are confident with mapping it to MODS.
    • Then, analyse your objects to make sure you understand which Islandora content models to migrate them to.
    • The CONTENTdm Collection Inspector is your friend bot both these tasks.
    • In general, the more time you spend understanding your source content, the smoother your migration will be.
  • Testing your MIK configuration thoroughly before running your production job on a collection is very important. Using the SpecificSet and RandomSet fetcher manipulators can help speed up testing, as can generating metadata only during testing.
  • Take the opportunity to use MIK's metadata manipulators to improve the quality and consistency of the metadata you will be importing into Islandora.
  • Use MIK's ability to combine content retrieved from CONTENTdm with master files stored outside of CONTENTdm.
  • Consider reusing the OCR, JPEG2000, and TN derivatives generated by CONTENTdm. This applies mainly to paged content (newspaper and book pages) but doing so can substantially decrease the amount of time it takes to ingest objects into Islandora.
Clone this wiki locally