-
Notifications
You must be signed in to change notification settings - Fork 34
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Re-implemented PicaDecoder based on a state machine.
The old PicaDecoder used regular expressions to parse PICA+ records. This let to two problems: * Errors in the data resulted in exceptions which did not refer to the portion of the data that caused the problem (e.g. a character index) * Due to the use of String.substring() for extracting data from the record the full record was kept in memory (see issue #51) The new PicaDecoder was written to solve these problems. The first one was addressed by constructing the parser so that it only fails in two clearly defined situations (missing id field and unexpected end of record). The second one was solved by copying the parsed data portions into new strings. In addition to the problems listed above, the following issues were addressed: * #109 -- removed support for static usages of the encoder * #112 -- removed support for appendControlSubField. If Metamorph is extended to pass data through (issue #107), this functionality can easily be implemented in a script. It is also not clear how widely it is used at all. While having removed support for control subfields the new decoder introduces a range of new options: * ignore missing id -- do not fail on missing ids but use an empty string as record id * skip empty fields -- do not output fields without subfields or empty subfields only (i.e. subfields without name and value) * fix unexpected end of record -- if a record does not end with a field delimiter one will be automatically added. * normalize UTF8 -- automatically performs UTF8 normalization of values The unit tests have been rewritten to match the new options and to be more useful for debugging.
- Loading branch information
Showing
8 changed files
with
701 additions
and
182 deletions.
There are no files selected for viewing
33 changes: 33 additions & 0 deletions
33
src/main/java/org/culturegraph/mf/stream/converter/bib/PicaConstants.java
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,33 @@ | ||
/* | ||
* Copyright 2013 Christoph Böhme | ||
* | ||
* Licensed under the Apache License, Version 2.0 the "License"; | ||
* you may not use this file except in compliance with the License. | ||
* You may obtain a copy of the License at | ||
* | ||
* http://www.apache.org/licenses/LICENSE-2.0 | ||
* | ||
* Unless required by applicable law or agreed to in writing, software | ||
* distributed under the License is distributed on an "AS IS" BASIS, | ||
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
* See the License for the specific language governing permissions and | ||
* limitations under the License. | ||
*/ | ||
package org.culturegraph.mf.stream.converter.bib; | ||
|
||
/** | ||
* Useful constants for PICA+ | ||
* | ||
* @author Christoph Böhme | ||
* | ||
*/ | ||
final class PicaConstants { | ||
|
||
public static final char FIELD_DELIMITER = '\u001e'; | ||
public static final char SUBFIELD_DELIMITER = '\u001f'; | ||
|
||
private PicaConstants() { | ||
// No instances allowed | ||
} | ||
|
||
} |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.