Skip to content

Toolchain: CSV single file objects

Mark Jordan edited this page Jun 26, 2017 · 53 revisions

Overview

This toolchain allows the creation of Islandora import packages consisting of a single object file (image, PDF, video file, audio file, etc.), where the metadata describing a set of objects is in a Comma Separated Value (CSV) file. The resulting Islandora import packages can then be ingested into Islandora using the standard Islandora Batch module.

Requirements of the CSV file used as the input for this toolchain are that

  • the first row of the CSV file contains column labels/headings
    • all column headings must be unique, and the heading row cannot contain any empty headings
  • the records are separated by a single type of field delimiter (as defined in the [FETCHER] section's "field_delimiter" configuration setting, as described below)
  • each record in the CSV file corresponds to one Islandora object
  • one of the fields contains a unique identifier for each row in the file (in the [FETCHER] section's "record_key" configuration setting, as described below), and
  • one of the fields contains the name of the file that is to be used in each of the created objects (in the [FILE_GETTER] section's "file_name_field" configuration setting, as described below).

Records that contain line returns are allowed in the CSV file, as long as those fields are enclosed (wrapped) in double quotation marks or some other valid enclosure characters. Also note that you can comment out problematic records in CSV input files.

A sample CSV metadata file is available in the MIK Tutorial.

Preparing the content files

This toolchain requires that the TIFF, PDF, video, audio, etc. files that will be prepared for ingesting into Islandora are all in a single flat directory:

cartoons_tiffs
├── 2-1999-08-25.tif
├── 3-1988-01-13.tif
├── 3-2004-03-29.tif
├── 5-9115-00-00.tif
├── 5-9272-00-00.tif
└── 6-2000-01-31.tif

etc.

The full path to this directory must be specified in the [FILE_GETTER] section's input_directory value. The filenames within the directory don't matter, but each object's file must be identified by name, including the extension, in a field in the CSV file, in the [FILE_GETTER] section's "file_name_field", as described below.

Preparing the configuration file

All MIK configuration files are standard INI files which contain the following sections: [SYSTEM], [CONFIG], [FETCHER], [METADATA_PARSER], [FILE_GETTER], [WRITER], [MANIPULATORS], and [LOGGING]. Entries are required unless indicated otherwise below.

Commented lines begin with a semicolon. Values that contain whitespace or special characters (equals, semicolon, etc.) should be wrapped in double quotation marks. If in doubt, use the quotation marks. The order of the sections and the entries within each section do not matter.

The SYSTEM section

This section of the configuration file sets or overrides configuration settings for PHP and the various third-party PHP components used by MIK. It can contain the following entries:

  • date_default_timezone: Optional. Provide a default timezone if date.timezone is null in the the PHP INI. You will know if you need to use this setting because Monolog will throw MIK exceptions and halt MIK. Set to one of the valid PHP timezone values listed at http://php.net/manual/en/timezones.php.
  • verify_ca: Optional. OSX's default PHP configuration use Apple's Secure Transport rather than OpenSSL, causing issues with Certificate Authority verification in Guzzle requests against websites that use HTTPS. This setting allows Guzzle to override CA verification. You will know if you need to use this setting because Guzzle will write entries in your mik.log complaining about CA verification. Set to false to ignore CA verification.

Example

[SYSTEM]
date_default_timezone = 'America/Vancouver'

The CONFIG section

Key-value pairs of configuration entries in this section are simply written to the top of the log file specified in the [LOGGING] section's path_to_log setting. You can add whatever values you want, but they are static (that is, they can't be dynamically derived at runtime). Therefore, all entries in this section are optional.

Example

[CONFIG]
config_id = cartoons_job_2
last_updated_on = "2015-12-20"
last_update_by = "Mark Jordan"

The FETCHER section

This section of the configuration file contains the following entries:

  • class: Required. Must be 'Csv'.
  • input_file: Required. Full path to the CSV file that contains the data describing the objects you are ingesting into Islandora
  • temp_directory: Required. Full path to the directory where the fetchers write data for use later in the toolchain.
  • field_delimiter: Optional. Default is a comma (,). The string or character used in the CSV file to delimit fields. To read a tab-delimited file, use an actual tab character enclosed in quotation marks, not \t.
  • field_enclosure: Optional. Default is double quotation mark ("). The string or character used in the CSV file to wrap values of fields that contain spaces.
  • escape_character: Optional. Default is backslash (\). The string or character used in the CSV file to escape field delimiters or field enclosure characters within field values.
  • use_cache: Optional. Set to false in automated tests (in other words, you will not need to use this unless you are writing automated tests for this fetcher).
  • record_key: Required. The column label identifying the field that contains each record's unique identifier within the CSV file.

Example

[FETCHER]
class = Csv
input_file = "/home/mark/Downloads/cartoons.csv"
temp_directory = "/tmp/cartoons_temp"
field_delimiter = ","
record_key = "CartoonID"

The METADATA_PARSER section

This section of the CSV toolchain's configuration file contains the following entries:

  • class: Required. Must be 'mods\CsvToMods' or 'templated\Templated'. Use the former if simple source field-to-MODS-element mappings are sufficient for your needs, the latter if your source metadata requires complex logic to be converted to MODS.
  • mapping_csv_path: Required. The path, either full or relative to the mik script, where the metadata mapppings file is located.
  • repeatable_wrapper_elements: Optional. By default MIK reduces repeated top-level wrapper MODS elements (same element name with the same attributes) down to a single instance of the element. This setting lets you indicate which elements you want to be repeated (i.e, have multiple of) in your MODS. Common uses for this setting include allowing repeated <name>, <subject>, or <extension> elements.

Example

[METADATA_PARSER]
class = mods\CsvToMods
mapping_csv_path = "cartoons_mappings.csv"

The FILE_GETTER section

This section of the CSV toolchain's configuration file contains the following entries:

  • class: Required. Must be 'CsvSingleFile'.
  • input_directory: Required. The full path to the directory where the content files are located. The files should be named as described in the "Preparing the content files" section above. Giving this option an empty value (e.g., input_directory = ) and specifying MODS as the only datastream (e.g., [WRITER]datastreams[] = MODS) allows testing the generation of MODS without requiring access to the content files.
  • temp_directory: Required. Full path to the directory where the file getter will write data for use later in the toolchain. Can be the same as the temp_directory value used in the [FETCHER] section.
  • file_name_field: Required. The column label identifying the field that contains the name of the file that corresponds to each record.
  • validate_input: Optional. Set to false if you do not want MIK to validate the files and directories under input_directory. Defaults to true. See this Cookbook entry for more detail.
  • validate_input_type: Optional. Set to strict if you want MIK to validate the files and directories under input_directory before moving on to generate ingest packages. Defaults to realtime. See this Cookbook entry for more detail.

Example

[FILE_GETTER]
class = CsvSingleFile
input_directory = "/home/mark/Downloads/cartoons_tiffs"
temp_directory = "/tmp/cartoons_temp"
file_name_field = File

The CSV column identified in file_name_field can take two types of values:

  1. The full name, including extension, of the file corresponding to the item described in the CSV record. The filename must be spelled correctly, including case on case-sensitive operating systems such as Linux and OSX.
  • If MIK cannot find the file named in this field, it will skip the row in the CSV input file and log the identifier of the row in both mik.log and problem_records.log.
  1. It can be empty.
  • If it is empty and [WRITER]require_source_file is absent or if it is set to false, MIK will generate a MODS file from the CSV row and name the file using the row's identifier (configured in [FETCHER]record_key). Note that the value of [WRITER]preserve_content_filenames is ignored, since there is no content filename.
  • If it is empty and [WRITER]require_source_file is set to true, MIK will skip the row in the CSV input file and log the identifier of the row in both mik.log and problem_records.log.

The WRITER section

This section of the CSV toolchain's configuration file contains the following entries:

  • class: Required. Must be 'CsvSingleFile'.
  • output_directory: Required. The full path to the directory where output packages are written.
  • preserve_content_filenames: Optional.
    • If given a value of true, and if the [FILGETTER] input_directory value and the [WRITER] output_directory are the same, MIK will write the metadata file for each object to the specified directory with a filename corresponding to the matching content file and not copy any content files. In other words, MIK will not copy any content files and will generate metadata files only.
    • If omitted or given a value of false, MIK will name output files using the object's identifier.
  • postwritehooks: Optional. A multivalued list of post-write hook scripts. Values have two parts, the full path to the PHP, Python, or shell executable, and the full path to the script itself.
  • datastreams: Optional. A multivalued list of datastream files that you want MIK to create. If not included, MIK will create all the files that the various file getter, metadata parser, and writer classes used in the toolchain can create. If included, only the indicated datastream files will be generated. Most useful for testing metadata generation, for example datastreams[] = "MODS", which would tell MIK to generate only a MODS.xml file for each object.
  • require_source_file. Optional. Defaults to false. If set true, MIK will skip rows in the CSV input file that have empty values in the field specified in [FILE_GETTER]file_name_field and log the identifiers of the rows in both mik.log and problem_records.log.

Example

[WRITER]
class = CsvSingleFile
preserve_content_filenames = true
output_directory = "/tmp/cartoons_output"
postwritehooks[] = "/usr/bin/php extras/scripts/postwritehooks/validate_mods.php"
; During testing, we're just interested in MODS
; datastreams[] = "MODS"

The MANIPULATORS section

This section of the CSV toolchain's configuration file defines which manipulators should be used. Multiple manipulators can be defined for each type (fetchermanipulators, filegettermanipulators, metadatamanipulators) as illustrated below. The value of each entry is the manipulator class name plus any pip-separated parameters that the manipulator may require. Entries in this section are optional.

Example

[MANIPULATORS]
; fetchermanipulators[] = "RandomSet|50"
fetchermanipulators[] = "SpecificSet|cartoons_set.txt"
metadatamanipulators[] = "FilterModsTopic|subject"
metadatamanipulators[] = "AddUuidToMods"
metadatamanipulators[] = "AddCsvData"

The LOGGING section

This section of the CSV toolchain's configuration file contains the following entries:

  • path_to_log: Required. The full path to the standard log generated by MIK.
  • path_to_manipulator_log: Required. The full path to the log that the manipulators write status and error messages to.

Example

[LOGGING]
path_to_log = "/tmp/cartoons_output/mik.log"
path_to_manipulator_log = "/tmp/cartoons_output/manipulator.log"
Clone this wiki locally