Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MarcXML parser for Exlibris Alma holdings output #255

Draft
wants to merge 6 commits into
base: main
Choose a base branch
from
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions Gemfile
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,7 @@ gem "canister"
gem "dotenv"
gem "ettin"
gem "faraday"
gem "marc"
gem "push_metrics", git: "https://github.com/hathitrust/push_metrics.git", tag: "v0.9.1"
gem "mongo"
gem "mongoid", "~> 8.1"
Expand Down
182 changes: 182 additions & 0 deletions bin/ex_libris_holdings_xml_parser.rb
Original file line number Diff line number Diff line change
@@ -0,0 +1,182 @@
require "marc"
require "pry"
require "date"

# Takes marc-xml files from exlibris and parses them into .tsv files
# that can be loaded into the HathiTrust print holdings db.
# Example:
# $ bundle exec ruby bin/ex_libris_holdings_xml_parser.rb xlib1.xml xlib2.xml > holdings.tsv

# We are expecting each record to have the ControlFields 001 and 008
# and the DataFields 035 and ITM.

class ExLibrisHoldingsXmlParser
def initialize
@instid = ARGV.shift
@files = ARGV
@record_count = 0
@errors = {}
end

# Takes all files and prints all output records.
def run

date = Date.today.strftime("%Y%m%d")
mon = File.open("#{@instid}_mon_full_#{date}.tsv","w")
ser = File.open("#{@instid}_ser_full_#{date}.tsv","w")

mon.puts(%w[oclc local_id status condition enum_chron govdoc].join("\t"))
ser.puts(%w[oclc local_id issn govdoc].join("\t"))

main(@files) do |ht_record|
case ht_record.item_type
when "ser"
ser.puts ht_record.to_ser_tsv
else
# includes "mix"
mon.puts ht_record.to_mon_tsv
end
rescue ArgumentError => e
@errors[e.message.to_s] ||= 0
@errors[e.message.to_s] += 1
end
# Print any errors caught above
if @errors.any?
warn "Errors caught:"
@errors.each do |etype, count|
warn "#{etype}: #{count}"
end
end
end

# Open each file, read its xmlrecords and yield HTRecords
def main(files)
files.each do |file|
MARC::XMLReader.new(file).each do |marc_record|
@record_count += 1
ht_record = HTRecord.new(marc_record)
yield ht_record
end
end
end
end

# A HTRecord has a marc record and knows how to turn it into a tsv string.
class HTRecord
def initialize(marc_record)
@marc_record = marc_record # ... a Marc::Record!
end

# Should use the same order as HTRecord.header_tsv
def to_tsv
[item_type, oclc, local_id, status, condition, enum_chron, issn, govdoc]
.join("\t").delete("\n")
end

def to_mon_tsv
[oclc, local_id, status, condition, enum_chron, govdoc]
.join("\t").delete("\n")
end

def to_ser_tsv
[oclc, local_id, issn, govdoc]
.join("\t").delete("\n")
end

def itm(x)
@marc_record["ITM"][x]
end

def leader
@leader ||= @marc_record.leader
end

def oclc
if @marc_record["035"]
@oclc = @marc_record["035"]["a"].strip
else
raise ArgumentError, "Missing oclc"
end
end

def local_id
@local_id ||= (item_type == "ser" ? @marc_record["001"].value : itm("d")).strip
end

def condition
@condition ||= itm("c")
end

def item_type
@item_type ||= map_item_type(leader[7])
end

# ExLibris say: ITM|a: volume, ITM|b: issue, ITM|i: year, ITM|j: month
def enum_chron
@enum_chron ||= [
itm("a"), # volume
itm("b"), # issue
itm("i"), # year
itm("j") ## month
].reject{ |x| x.nil? || x.empty? }.map(&:strip).join(",")
end

def status
@status ||= map_status(itm("k"))
end

def issn
if item_type == "ser"
if @marc_record.fields("022").any?
@marc_record["022"]["a"]
else
[]
end
end
end

# TODO? double triple quadruple check that 17 and 28 are correct and using the right index (0/1)
def govdoc
@govdoc ||= is_us_govdoc? ? '1' : '0'
end

def is_us_govdoc?
str_val = @marc_record["008"].value.downcase

pubplace_008 = str_val[15,3]
govpub_008 = str_val[28]

is_us?(pubplace_008) && govpub_008 == 'f'
end

# via post-zephir processing "clean_pub_place"
def is_us?(pub_place)
return true if pub_place[2] == 'u'
return true if pub_place[0,2] == 'pr'
return true if pub_place[0,2] == 'us'
return false
end

private

def map_item_type(item_type)
{
"s" => "ser",
"m" => "mon",
}[item_type] || "mix"
end

def map_status(status)
unless item_type == "ser"
{
"MISSING" => "LM",
"LOST_LOAN" => "LM",
}[status] || "CH"
end
end
end

# Parse any incoming files, output to stdout.
if $0 == __FILE__
ExLibrisHoldingsXmlParser.new.run
end