Skip to content

Latest commit

 

History

History
395 lines (257 loc) · 13.8 KB

io.i.md

File metadata and controls

395 lines (257 loc) · 13.8 KB
Input/Output

The CDK has functionality for extracting information from files in many different file formats. Unfortunately, hardly ever the full format specification is supported, but generally the chemical graph and 2D or 3D coordinates are extracted, not uncommonly complemented with formal or partial charge.

File Format Detection

Typically, a human is fairly aware about the format of a file, if he looks at a file. Very often, the file extension (which is hidden on many Microsoft Windows versions by default) gives a clear clue. Files with the .mol and .sdf extension are very likely to have one of the MDL formats. If the file extension is ambiguous, a trained cheminformatician can often help you out quickly, applying tacit knowledge about those formats.

Computer programs, however, cannot rely on file formats, and have to formalize rules for inspecting the file content to determine what file format it is. The CDK has such functionality available for recognizing chemical file formats. But, to ensure no false detections are made, it requires a fairly accurate method for detecting the chemical format of a file. Appendix fileformats provides a list of all chemical file formats the CDK knows about.

Programmatically, the format of a file can be detected using the FormatFactory:

GuessFormat

For example, this script recognizes that a file has the Chemical Markup Language [Q27783585,Q27211680] format:

GuessFormat

To learn if the CDK has a IChemObjectReader or IChemObjectWriter for a format, one can use the methods getReaderClassName() and getWriterClassName() respectively:

HasReaderOrWriter

It reports:

HasReaderOrWriter

Custom format matchers

The SMILES format is one of the few formats which does not have a matcher. This is because there is no generally accepted file format based on this line notation.

However, we can define a custom matcher ourselves and use that. First, the matcher will look something like:

SMILESFormatMatcher

If we then register this new matcher with the FormatFactory:

GuessSMILES

And with this, we can detect a file with SMILES strings and names:

GuessSMILES

Keep in mind that the more specific your custom matcher is, the lower the change of other formats accidentally recognized by your custom matcher.

REINSERT TABLE

Reading from Readers and InputStreams

Many input readers in the CDK allow reading from a Java Reader class, but all are required to also read from an InputStream. The difference between these two Java classes is that the Reader is based on a character stream, while an InputStream is based on an byte stream. For some readers this difference is crucial: processing an XML based format, such as CML and XML formats used by PubChem should be read from an InputStream, not a Reader.

For other formats, it does not matter. This allows, for example, to read a file easily from a string with a StringReader (mind the newlines indicated by \n):

InputFromStringReader

But besides reading XML files correctly, the support for InputStream also allows reading files directly from the internet and from gzipped files (see Section gzip).

Example: Downloading Domoic Acid from PubChem

As an example, below will follow a small script that takes a PubChem compound identifier (CID) and downloads the corresponding ASN.1 XML file, parses it and counts the number of atoms:

PubChemDownload

It reports:

PubChemDownload

PubChem ASN.1 files come with an extensive list of molecular properties. These are stored as properties on the molecule object and can be retrieved using the getProperties() method, or, using the Groovy bean formalism:

PubChemDownloadProperties

which lists the properties for the earlier downloaded domoic acid:

PubChemDownloadProperties

Input Validation

The history of the CDK project has seen many bug reports about problems which in fact turned out to be problems with in the input file. While the general perception seems to be that because files could be written, the content must be consistent.

However, this is a strong misconception. There are several problems found in chemical files in the wild. A first common problem is that the file is not conform the syntax of the specification. An example here can be that at places where a number is expected, something else is given; not uncommonly, this is caused by incorrect use of whitespace.

A second problem is that the file looks perfectly reasonable, but that the software that wrote the file used conventions and extensions that are not supported by the reading software. A common example is the use of the D and T symbols, for deuterium and tritium in MDL molfiles, where the specification does not allow that.

A third problem is that most chemical file formats do not disallow incorrect chemical graphs. For example, formats often allow to bind an atom to itself, which will cause problems when analyzing this graph. These problems are much more rare, though.

Reading modes

The IChemObjectReader has a feature that allows setting a validating mode, which has two values:

ReadingModes

returning:

ReadingModes

The STRICT mode follows the exact format specification. There RELAXED mode allows for a few common extensions, such as the support for the T and D element types. For example, let's consider this file:

code/data/t.mol

If we read this file with:

ReadStrict

we get this exception:

ReadStrict

However, if we read the file in RELAXED mode with this code:

ReadRelaxed

the files will be read as desired:

ReadRelaxed

Validation

When a file is being read in RELAXED mode, it is possible to get error messages. This functionality is provided by the IChemObjectReaderErrorHandler support in IChemObjectReader. For example, we can define this custom error handler:

CustomErrorHandler

and use that when reading a file:

ReadErrorHandler

we get these warnings via the handler interface:

ReadErrorHandler

Because of an issue in version of the CDK, the above does not show any warnings. This has been fixed in CDK 2.3, see commit 547b028e17656f54a080a885a166377320b3a8ad.

Gzipped files

Some remote databases gzip their data files to reduce download sized. The Protein Brookhaven Database (PDB) is such a database. Fortunately, Java has a simple API to work with gzipped files, using the GZIPInputStream:

PDBCoordinateExtraction

Iterating Readers

By default, the CDK readers read structures into memory. This is fine when it is a relatively small model. It no longer works for large files, such as 1GB MDL SD files [Q27783587]. To allow processing of such large files, the CDK can take advantage from the fact that these SD files are basically a concatenation of MDL molfiles. Therefore, one can use an iterating reader to process each individual molecule one by one.

MDL SD files

MDL SD files can be processed using the IteratingSDFReader, for example, to generate a SMILES for each structure:

IteratingSDFReaderDemo

Which outputs the molecular formula for the three entries in the file:

IteratingSDFReaderDemo

PubChem Compounds XML files

Similarly, PubChem Compounds XML files can be processed taking advantage of a XML pull library, which is nicely hidden behind the same iterator interface as used for parsing MDL SD files. Iterating over a set of compounds is fairly straightforward with the IteratingPCCompoundXMLReader class:

PubChemCompoundsXMLDemo

Which outputs the molecular formula for the three entries in the aceticAcids38.xml file:

PubChemCompoundsXMLDemo

Customizing the Output

An interesting feature of file IO in the CDK is that it is customizable. Before I will give all the details, let's start with a simple example: creating a Gaussian input file for optimizing the structure of methane, and let's start with an XYZ file, that is, with methane.xyz:

<in>code/data/methane.xyz</in>

The output will look something like:

<in>code/methane.gin</in>

The writer used the default IO options in the above example. So, the next step is to see which options the writer allows. To get a list of options for a certain IO class in one does something along the lines:

ListIOOptions

which results in the following output:

ListIOOptions

Setting Properties

The IO settings system allows interactive setting of these options, but a perfectly fine alternative is to use a Java Properties object.

Consider the following source code:

PropertiesSettings

The PropertiesListener takes a Properties class as parameter in its constructor. Therefore, the properties are defined by the customSettings variable in the first few lines. The PropertiesListener listener is the instantiated with the customizations as constructor parameter.

The output writer, specified to write to the methane.gin file, is created after which the ChemObjectIOListener is set. Only by setting this listener, the output will be customized with the earlier defined properties. The rest of the code reads a molecule from an XYZ file and writes the content to the created Gaussian Input file.

Example: creating unit tests for atom type perception

We saw earlier an example for reading files directly from PubChem (see Section domoicacid). This can be conveniently used to create CDK source code, for example, for use in unit tests for the atom type perception code (see Section atomtypePerception). But because we do not want 2D and 3D coordinates being set in the source code, we disable those options:

AtomTypeUnitTest

This results in this source code:

AtomTypeUnitTest

Line Notations

Another common input mechanism in cheminformatics is the line notation. Several line notations have been proposed, including the Wiswesser Line Notation (WLN) [Q29042322] and the Sybyl Line Notation (SLN) [Q30853915], but the most popular is SMILES [Q28090714]. There is a Open Standard around this format called OpenSMILES, available at http://www.opensmiles.org/.

SMILES

The CDK can both read and write SMILES, or at least a significant subset of the line notation. You can parse a SMILES into a IAtomContainer with the SmilesParser. The constructor of the parser takes an IChemObjectBuilder (see Section builders) because it needs to know what CDK interface implementation it must use to create classes. This example uses the DefaultChemObjectBuilder:

ReadSMILES

Telling us the number of (non-hydrogen) atoms in aspirin:

ReadSMILES

Writing of SMILES goes in a similar way. But I do like to point out that by default the SMILESGenerator does not use the convention to use lower case element symbols for aromatic atoms.

WriteSMILES

showing the different output without and with that option set:

WriteSMILES

The generic format does not output stereo information. For isomeric SMILES we need to use a different approach:

WriteIsomericSMILES

showing the difference in output between .generic() and .isomeric:

WriteIsomericSMILES

Of course, this does require that aromaticity has been perceived, as explained in Section aromaticity.

Recipes

This section will list for a few formats a recipe for how to read content from those formats, taking into account common issues with the input.

MDL molfile (V2000)

Like any file format, they support a limited number of features. For example, MDL files cannot represent a bond order 4, a quadruple bond. Other missing explicit details include hydrogens, and atom-based stereochemistry. Stereochemistry is wedge-bond-based, see Section ch:stereo:bond.

An example file which uses the bond order 4, is this file:

code/data/azulene4.mol

More recent MDL formats have become more powerful. The V3000 format can do much more than the V2000 format, or even the pre-V2000 format.

Here's a recipe with inline comments:

InputMDLMolfiles

This code will perceive CDK atom types. These types are needed to add the missing hydrogens, as well as to resolve the bond order information. The input has ten atoms and eleven bonds, all marked with bond order 4.

The result of the above post-processing is:

InputMDLMolfiles

SDF files with properties

SDF files are quite similar to MDL files but can have an arbitrary number of chemical structures and have properties (see also Section ctr4). An example file from ChEMBL [Q58608307]:

code/data/CHEMBL3183843.sdf

We can read this file and extract the property with the following approach:

SDFWithProperties

This extracts the chembl_id property:

SDFWithProperties

References