The CDK has functionality for extracting information from files in many different file formats. Unfortunately, hardly ever the full format specification is supported, but generally the chemical graph and 2D or 3D coordinates are extracted, not uncommonly complemented with formal or partial charge.
Typically, a human is fairly aware about the format of a file, if he looks at a file. Very often, the file extension (which is hidden on many Microsoft Windows versions by default) gives a clear clue. Files with the .mol and .sdf extension are very likely to have one of the MDL formats. If the file extension is ambiguous, a trained cheminformatician can often help you out quickly, applying tacit knowledge about those formats.
Computer programs, however, cannot rely on file formats, and have to formalize rules for inspecting the file content to determine what file format it is. The CDK has such functionality available for recognizing chemical file formats. But, to ensure no false detections are made, it requires a fairly accurate method for detecting the chemical format of a file. Appendix fileformats provides a list of all chemical file formats the CDK knows about.
Programmatically, the format of a file can be detected using the
FormatFactory
:
GuessFormat
For example, this script recognizes that a file has the Chemical Markup Language [Q27783585,Q27211680] format:
GuessFormat
To learn if the CDK has a IChemObjectReader
or
IChemObjectWriter
for a format, one can use the methods getReaderClassName()
and getWriterClassName()
respectively:
HasReaderOrWriter
It reports:
HasReaderOrWriter
The SMILES format is one of the few formats which does not have a matcher. This is because there is no generally accepted file format based on this line notation.
However, we can define a custom matcher ourselves and use that. First, the matcher will look something like:
SMILESFormatMatcher
If we then register this new matcher with the FormatFactory
:
GuessSMILES
And with this, we can detect a file with SMILES strings and names:
GuessSMILES
Keep in mind that the more specific your custom matcher is, the lower the change of other formats accidentally recognized by your custom matcher.
REINSERT TABLE
Many input readers in the CDK allow reading from a Java Reader
class,
but all are required to also read from an InputStream
. The difference
between these two Java classes is that the Reader is based on a character
stream, while an InputStream is based on an byte stream. For some readers this
difference is crucial: processing an XML based format, such as CML and XML
formats used by PubChem should be read from an InputStream, not a Reader.
For other formats, it does not matter. This allows, for example, to read
a file easily from a string with a StringReader
(mind the newlines indicated by \n
):
InputFromStringReader
But besides reading XML files correctly, the support for InputStream also allows reading files directly from the internet and from gzipped files (see Section gzip).
As an example, below will follow a small script that takes a PubChem compound identifier (CID) and downloads the corresponding ASN.1 XML file, parses it and counts the number of atoms:
PubChemDownload
It reports:
PubChemDownload
PubChem ASN.1 files come with an extensive list of molecular properties. These
are stored as properties on the molecule object and can be retrieved using the
getProperties()
method, or, using the Groovy bean formalism:
PubChemDownloadProperties
which lists the properties for the earlier downloaded domoic acid:
PubChemDownloadProperties
The history of the CDK project has seen many bug reports about problems which in fact turned out to be problems with in the input file. While the general perception seems to be that because files could be written, the content must be consistent.
However, this is a strong misconception. There are several problems found in chemical files in the wild. A first common problem is that the file is not conform the syntax of the specification. An example here can be that at places where a number is expected, something else is given; not uncommonly, this is caused by incorrect use of whitespace.
A second problem is that the file looks perfectly reasonable, but that the software that wrote the file used conventions and extensions that are not supported by the reading software. A common example is the use of the D and T symbols, for deuterium and tritium in MDL molfiles, where the specification does not allow that.
A third problem is that most chemical file formats do not disallow incorrect chemical graphs. For example, formats often allow to bind an atom to itself, which will cause problems when analyzing this graph. These problems are much more rare, though.
The IChemObjectReader has a feature that allows setting a validating mode, which has two values:
ReadingModes
returning:
ReadingModes
The STRICT
mode follows the exact format specification. There
RELAXED
mode allows for a few common extensions, such as
the support for the T and D element types. For example, let's consider
this file:
code/data/t.mol
If we read this file with:
ReadStrict
we get this exception:
ReadStrict
However, if we read the file in RELAXED
mode with this code:
ReadRelaxed
the files will be read as desired:
ReadRelaxed
When a file is being read in RELAXED
mode, it is possible to get
error messages. This functionality is provided by the
IChemObjectReaderErrorHandler support in
IChemObjectReader.
For example, we can define this custom error handler:
CustomErrorHandler
and use that when reading a file:
ReadErrorHandler
we get these warnings via the handler interface:
ReadErrorHandler
Because of an issue in version of the CDK, the above does not show any warnings. This has been fixed in CDK 2.3, see commit 547b028e17656f54a080a885a166377320b3a8ad.
Some remote databases gzip their data files to reduce download sized. The Protein Brookhaven Database (PDB) is such a database. Fortunately, Java has a simple API to work with gzipped files, using the GZIPInputStream:
PDBCoordinateExtraction
By default, the CDK readers read structures into memory. This is fine when it is a relatively small model. It no longer works for large files, such as 1GB MDL SD files [Q27783587]. To allow processing of such large files, the CDK can take advantage from the fact that these SD files are basically a concatenation of MDL molfiles. Therefore, one can use an iterating reader to process each individual molecule one by one.
MDL SD files can be processed using the IteratingSDFReader, for example, to generate a SMILES for each structure:
IteratingSDFReaderDemo
Which outputs the molecular formula for the three entries in the file:
IteratingSDFReaderDemo
Similarly, PubChem Compounds XML files can be processed taking advantage of a XML pull library, which is nicely hidden behind the same iterator interface as used for parsing MDL SD files. Iterating over a set of compounds is fairly straightforward with the IteratingPCCompoundXMLReader class:
PubChemCompoundsXMLDemo
Which outputs the molecular formula for the three entries in the
aceticAcids38.xml
file:
PubChemCompoundsXMLDemo
An interesting feature of file IO in the CDK is that it is customizable. Before
I will give all the details, let's start with a simple example: creating a
Gaussian input file for optimizing the structure of methane,
and let's start with an XYZ file, that is, with methane.xyz
:
<in>code/data/methane.xyz</in>
The output will look something like:
<in>code/methane.gin</in>
The writer used the default IO options in the above example. So, the next step is to see which options the writer allows. To get a list of options for a certain IO class in one does something along the lines:
ListIOOptions
which results in the following output:
ListIOOptions
The IO settings system allows interactive setting of these options, but a perfectly fine alternative is to use a Java Properties object.
Consider the following source code:
PropertiesSettings
The PropertiesListener
takes a Properties
class as parameter in
its constructor. Therefore, the properties are defined by the
customSettings
variable in the first few lines. The
PropertiesListener
listener
is the instantiated with the
customizations as constructor parameter.
The output writer, specified to write to the methane.gin
file, is
created after which the ChemObjectIOListener
is set. Only by setting
this listener, the output will be customized with the earlier defined
properties. The rest of the code reads a molecule from an XYZ file and writes
the content
to the created Gaussian Input file.
We saw earlier an example for reading files directly from PubChem
(see Section domoicacid).
This can be conveniently used to create CDK source code
, for example,
for use in unit tests for the atom type perception code (see
Section atomtypePerception). But because we do not want
2D and 3D coordinates being set in the source code, we disable those
options:
AtomTypeUnitTest
This results in this source code:
AtomTypeUnitTest
Another common input mechanism in cheminformatics is the line notation. Several line notations have been proposed, including the Wiswesser Line Notation (WLN) [Q29042322] and the Sybyl Line Notation (SLN) [Q30853915], but the most popular is SMILES [Q28090714]. There is a Open Standard around this format called OpenSMILES, available at http://www.opensmiles.org/.
The CDK can both read and write SMILES, or at least a significant subset of the line notation. You can parse a SMILES into a IAtomContainer with the SmilesParser. The constructor of the parser takes an IChemObjectBuilder (see Section builders) because it needs to know what CDK interface implementation it must use to create classes. This example uses the DefaultChemObjectBuilder:
ReadSMILES
Telling us the number of (non-hydrogen) atoms in aspirin:
ReadSMILES
Writing of SMILES goes in a similar way. But I do like to point out that by default the SMILESGenerator does not use the convention to use lower case element symbols for aromatic atoms.
WriteSMILES
showing the different output without and with that option set:
WriteSMILES
The generic
format does not output stereo information. For isomeric SMILES we need to use
a different approach:
WriteIsomericSMILES
showing the difference in output between .generic()
and .isomeric
:
WriteIsomericSMILES
Of course, this does require that aromaticity has been perceived, as explained in Section aromaticity.
This section will list for a few formats a recipe for how to read content from those formats, taking into account common issues with the input.
Like any file format, they support a limited number of features. For example, MDL files cannot represent a bond order 4, a quadruple bond. Other missing explicit details include hydrogens, and atom-based stereochemistry. Stereochemistry is wedge-bond-based, see Section ch:stereo:bond.
An example file which uses the bond order 4, is this file:
code/data/azulene4.mol
More recent MDL formats have become more powerful. The V3000 format can do much more than the V2000 format, or even the pre-V2000 format.
Here's a recipe with inline comments:
InputMDLMolfiles
This code will perceive CDK atom types. These types are needed to add the missing hydrogens, as well as to resolve the bond order information. The input has ten atoms and eleven bonds, all marked with bond order 4.
The result of the above post-processing is:
InputMDLMolfiles
SDF files are quite similar to MDL files but can have an arbitrary number of chemical structures and have properties (see also Section ctr4). An example file from ChEMBL [Q58608307]:
code/data/CHEMBL3183843.sdf
We can read this file and extract the property with the following approach:
SDFWithProperties
This extracts the chembl_id
property:
SDFWithProperties