The following compilation attempts to answer some of the more common questions asked about both XML in general and CML in particular, by both novices and experts. We do not assume it answers ALL questions (or even most), and very much depend on you in the community to both pose the questions YOU want answering; indeed to provide some of the answers! We also welcome contributions from others; for example the CML Reference tutorial at the excellent Zvon site.
To contact us, email Peter Murray-Rust or Henry Rzepa. Alternatively, post a question/comment here.
CML (Chemical Markup Language) is a new approach to managing molecular information using recently developed Internet tools such as SGML/XML and Java. It has a large scope as it covers disciplines from macromolecular sequences to inorganic molecules and quantum chemistry. There is also a lot of detail as molecular documents can contain many thousand discrete objects, all of which are manageable in CML. Because of this there is no single place to 'start' learning about CML and this FAQ is offered for those who like a general approach.
CML (Chemical Markup Language) is a new approach to managing molecular information using recently developed Internet tools such as XML and Java. It is based strictly on SGML, the most robust and widely used system for precise information management in many areas. It has been developed over several years and has been tested in many areas and on a variety of machines.
CML is not 'just another file format'; it is capable of holding extremely complex information structures and so acting as an interchange mechanism or for archival. It interfaces easily with modern database architectures such as relational databases or object-oriented databases. Most importantly, it a large amount of generic XML software to process and transform it is already available from the community.
CML has already been used to manage documents and information in:
and others.
CML provides lossless transmission of information: aspects of the encoded data can be selected and processed without loss of the remainder. It is machine-independent so guarantees portability of data between different operating systems and machine architectures. CML supports and interfaces with developments such as Java, C++ and Corba.
XML is based on this strategy. The markup (the elements and attributes) describe what the information is, not how it should appear. This preserves complete flexibility and allows the content to be presented in many ways using stylesheets.
A simple chemical example is whether an aromatic ring is presented with 3 double bonds or a circle, whether a charge is "++", "+2" or "2+", etc.
Many people see XML as the way forward for all information and CML provides support for chemistry. It's the only approach that is Open, has a wide range of tools and is designed to interoperate with many other XML-based systems.
There are many things that can only be done with XML/CML. XML files are extensible in that new concepts can be added - CML provides this through its unique dictionary-based system. Namespaces allow many different disciplines to be combined, such as graphics, maths, text, biology, etc.
XML is ideally suited for data capture, archiving and publishing. It is being heavily adopted by regulators, government agencies, and publishers.
The extensibility means that complex object structures are easy to manage. Thus CML can manage a hierarchy of molecules containing other molecules, and this can support macromolecules, salts, complexes, etc. Complete scientific articles can be written in CMl and searched and published using new XML tools.
Most current molecular formats are limited in the information they contain. Few can hold both 2- and 3-D coordinates simultaneously. CML represents a superset of current approaches and information can be transferred between applications without loss.
CML has support for atom- and bond-based stereochemistry that is not normally explicitly available.
CML is able to contain a complete audit of all operations on a molecule or chemical system. It is ideally suited to managing logfiles which can be manipulated and reused or reanalysed later.
There are huge amounts of valuable material. The places to go to include:
From these you will be able to find links to anything you want.
It's best to break this into:
A DTD (or DocumentTypeDefinition) is the formal specification of an XML document. It prescribes the syntax and vocabulary completely, e.g. <molecule> can contain <atom>s.
Most CML users will not need to read the CML DTD (http://www.xml-cml.org/dtd/cml1_0_1.dtd). For advanced programmers and others extending the features of CML it may help in the construction of software.
XML Schemas are an extension of DTD functionality, and support more dataTypes and structures. We have published the CML DTD in Schema format (http://www.xml-cml.org/dtd/cml1_0_1.xsd) and may use it for dataType validation but we shall use more powerful methods for full CML validation (e.g. XSLT).
You are not allowed to edit the CML DTD, Schema or the accompanying files.
Most HTML and XML documents are currently encoded in so-called 7-bit ASCII TEXT characters (as opposed to 8-bit binary encoding). A number of these ASCII characters are known as Whitespace (spaces, newlines, tabs) and are a valuable tool for human presentation but these are not usually considered as part of the fundamental content. They are used for delimiters (e.g. to separate words and numbers) and to provide formatting.
By default HTML normalizes all whitespace into single spaces and then makes its own decisions on line wrapping. This can be switched off with the <pre> </pre> tags which require all contained whitespace to be preserved and rendered. XML and XSLT allow several strategies for whitespace, including the deletion of non-significant whitespace (such as between tags), e.g:
<molecule>
<atomArray>
<atom>...</atom>
</atomArray>
</molecule>
This document contains much "ignorable whitespace" (8 spaces and 4
newlines) solely for human readability. CML regards leading and trailing whitepace within the CML ontology as optional, and all else as significant. Thus:
<string builtin="elementType"> O </string>is normalized to
<string builtin="elementType">O</string>while
<string" title="periodic table" xml:space="preserve" title=" H He Li Be B C N O F Ne </string>is preserved (although we would strongly suggest better markup!)
CML uses whitespace delimiters for arrays as in:
<floatArray" title="energy" title="
1.2 3.4 5.6
7.8 9.0
11.234
</floatArray>
which contains 6 floating numbers and no significant whitespace. This means
you can format large chunks of data so it can pass through mailers, etc. If
you have to include significant spaces (as in PDB atom identifiers) use
delimiters or entities:
<stringArray delimiter="/"> CA/CA</ARRAY> <stringArray delimiter="/">CA/CA</ARRAY>represents C-alpha and a Calcium.
Many names used in CML ("elementType", "list", "sequence") are certain to be used in other markup languages. Since there is (rightly) no central control over vocabulary it is important to avoid collisions. Each markup language should have a namespace, uniquely determined by a namespaceURI. For CML V1.0 this is http://www.xml-cml.org/dtd/cml1_0_1.dtd. A namespaceURI is NOT a location on the WWW, NOR do you have to be connected to the Internet to use namespaces. It is simply a uniquifying string, which normally maps onto a version of a DTD or Schema. Often the namespaceURI is an indicator of what software should be used with a document.
To avoid verbosity, the uniques is provided by namespacePrefixes which are prepended to element and attribute names. Example:
<cml:molecule xmlns:cml="http://www.xml-cml.org/dtd/cml1_0_1.dtd">
<cml:atomArray>
<cml:atom><cml:string builtin="elementType">O</cml:string></cml:atom>
<p>This might be some HTML</p>
<cml:atomArray>
</cml:molecule>
Here everything prefixed with cml: belongs to the CML namespace;
everything else does not. It is possible to have many namespaces in a
document, one of which can be a default namespace (in this case HTML).
Common namespaces which will be found with CML are SVG, RDF, XSL, XSD,
MathML, XHTML, and some of the biomolecular or materials sciences ones.
Every element in CML and XML in general can be addressed via the value of an associated attribute called an ID. XML specifies that the value of ID must start with a character, e.g. id="a1". In CML for example, it is essential that e.g. each atom within a molecule be given a unique ID if it will be addressed in any bonding specification. Molecules should similarly have unique IDs. These IDs should be unique within the document so that the addressing is not ambiguous. Part of the process of validating a document involves establishing that the IDs are indeed unique and do not conflict with each other. A much more difficult problem to solve is when e.g. CML components from several different sources are consolidated into a single document, since then the addressing may no longer be unique on a global scale. In the future, resources to assign e.g. globally unique IDs to molecules may become available to address issues such as this.
MIME is an IETF standard for labelling electronic documents (files) for transmission between machines by mail or WWW protocols. Every document can be stamped with a MIME content-type such as "text/sgml" or "image/gif". Thus all documents sent from a WWW server have a content-type provided by the server, which allows the client to decide how to treat it (e.g. what software to use for rendering. In 1994 Henry Rzepa put forward the idea that this could be extended to cover molecular science and he, Ben Whittaker and Peter Murray-Rust published a proposal which has been widely adopted. See the Chemical MIME home page (http://www.ch.ic.ac.uk/chemime/). It was originally envisaged that CML would have its own MIME type, and chemical/x-cml was proposed. However, MIME stamps are external to the document and in effect describe how the entire document must be handled. Because it is now recognised that CML is likely to be merely one component within an XML document (i.e. that document may cover more than one namespace, the only safe way of encoding a document type within itself is to use XML using the DOCTYPE statement. This declares what root DTD the document uses. For example, this document (which is of type text/html) starts with:
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML//EN">which declares it as using the HTML DTD of the W3C. All elements inherit this DTD, unless another namespace corresponding to another DTD is declared later on.
Because of the possibility that a document may contain multiple
namespaces, associated with multiple DTDs, it is now generally agreed that
the MIME type for any document that contains XML should be text/xml,
and that the use of chemical/x-cml is deprecated.
CML is an information specification and so independent of any operating system, programming language or hardware. Software can be written in most languages, though we don't recommend writing a reader in FORTRAN (better to convert to FORTRAN input using XSLT). We write in Java as it is especially suited to XML, but the CML-DOM and the OMG's IDL can be implemented in Java, Javascript, C, C++, Python, Smalltalk, Visual Basic, etc.
At present (2001) only MS IE5, 6 (http://www.microsoft.com/windows/ie/default.htm) and later support enough XML to make client-side implementation easy. However we believe that all browsers will eventually support XML and XSLT (and hopefully SVG (http://www.w3.org/svg/), or scalable vector graphics, itself an XML language); CML has been designed as platform-independent.
XML is the publishing language of choice , and ideally suited to multiple output formats. Ideally this is through XSL-FO, the W3C's approach to typography, which can be renderered in many formats (HTML, PDF, RTF, SVG) and for many media (browsers, WAP, PDA, etc.) We are developing a series of XSLT stylesheets for chemistry which will provide platform-independent chemical publishing
JUMBO can output any document in CML (and a variety of legacy formats). We shall use Java technology and/or SVG for capturing screen windows.
In the wider CML project we provide support beyond the concepts in the 1999 CML DTD. Examples of these include spectra and units. We support these through namespace extensions, at present:
Melting point is a scalar, numeric quantity with units and perhaps error estimates. There are hundreds of thousands of similar examples which are found in chemistry and CML cannot accommodate them directly. Therefore they are referenced through a dictionary as in:
<foo xmlns:fooDict="http://my.org/dict/fooDict.html"
xmlns:units="http://your.org/units/dict.xml">
<float title="My melting point" units="units:kelvin" dictRef="fooDict:mpt">298.15</float>
</foo>
The dictionary can be reached through the namespaceURI (which here
references a real location) and similarly for the units. The above example should make this clear. A dictionary is extensible and the entry can contain a large amount of valuable metadata including types, ranges and constraints, annotation by different curators, versions, links to other dictionaries, etc. JUMBo supports dictionaries in this way.
CML manages complex extensions through namespaces. We provide these for;
We expect there to be several initiatives in managing spectra and units in XML and the CML approach will interoperate with these.
XML (eXtensible Markup Language) has been developed by a large and dynamic group under the aegis of the W3 consortium. It's essentially a subset (or simplification) of SGML and is much easier to use. If you know how to write valid HTML it's a very small step to using XML. In fact, if you are already familiar with the latest specifications for HTML known as XHTML (http://www.w3.org/MarkUp/), you are already using XML!
The main features of XML are:
Well-formed is an important new concept. It means that a document is syntactically correct (e.g. the start and end tags balance, attribute values are quoted, etc.) even if it might not be valid (e.g. contain an unknown tag). XML is therefore very well suited to situations where the documents have already been validated (e.g. because the authoring software is authenticated, or because they have already passed through a validating parser). NOTE, however, that all CML documents must be validatable against the CML DTD, but it is possible to manipulate them without necessarily having to validate them .
If the file is one the chemical/* MIME list there is a good chance that a full or partial converter has been written. If not, one will need to be written (manual conversion is not recommended). We include many examples of this in JUMBO3-J, including some source code. The main task is to identify the information components within your chemical file, such as molecules, scalar data, text, citations, arrays, tables, graphs, dates, URLs, etc. You must then write a parser that reads one of your files and extracts this information. For each of these there is a simple routine which allows you to poke the object into the CML document.
At this stage you will need to think about the logical structure of your information. What data belongs to this molecule (e.g. a date)? Should all the annotations be in a separate section (e.g. a <list>)? You will also find that you start creating your own dictRef attributes for information and so you should draw up a glossary of all those terms used (if you have a user manual this should be in it anyway).
If you are in charge of the generation of chemical data files (e.g. they are output by your software or instrument) consider adding a "Save as CML" option to the system. This is much easier than writing a parser and is not a lengthy process.
Note also that you don't have to covert every piece of information initially. In some cases it can be held as text (<string> or XHTML) until you have decided what to do with it (an example is the REMARK cards in the CML version of PDB). But the more markup you add, the more valuable it will be to your readers.
CML also provides a mechanism for total encapsulation of foreign files. Do not use this as a lazy way out, but it's reasonable if you already use standard approaches from other disciplines. Thus a CML file might hold a CGM file (Computer Graphics Metafile) for its graphics - there is no real advantage in conversion.
CML can hold 2-D molecular information in a variety of ways:
If 2-D coordinates are present then JME-CML, JChemPaint and JUMBO3 can draw the structure. JUMBO3-JS uses Javascript and stylesheets to render this in a browser. If 2-Coordinates are missing JUMBO3-J has a layout routine which does a reasonable job on smallish molecules but needs more work for larger ones!
CML can hold 3-D molecular information in a variety of ways:
JUMBO3 can convert between all of these (specifically Z-Matrices and fractionals can be converted to Cartesians).
Yes. We have already done this. The main challenge comes from the need to have markup in a variety of disciplines:
An article consists of separate sections holding these data which can often be automatically converted from other formats.
If the author wishes to render the paper in particular ways (e.g. by interleaving molecules in the text, etc.) they will have to find a system for doing this. The advantage of SGML is that it is very widely used and understood in the publishing and printing communities, so if you take a document specification in SGML to a specialist they are likely to be able to help.
Yes! XML is an extremely powerful way of organising information that can be searched later. XPath (used in XSLT) is a language for addressing into documents; we show some examples below but most of you won't use this syntax as we shall customise searches in GUIs. It is possible to search on:
<author> <name>Pauling</name> </author>
<atom> <string builtin="elementType">O</string> </atom>We could use "atom[string[@builtin='elementType'][.='O]]" ("find all atom elements with a string child with content 'O' and value 'elementType' for attribute 'builtin'").
<atom id="A123"> <string builtin="elementType">O</string> </atom>could be retrieved with "id('A123')". Many XSLT tools will optimize this query, so it is an excellent idea to put unique IDs on all important elements.
<molecule id="A123"> <string title="name">paradimethylaminobenzaldehyde</string> </molecule>(chosen for its scansion!) could be searched by "molecule[string[@title='name' and contains(., 'dimethyl')]]"
Not surprisingly native XSLT does not support chemical substructure searching but it is possible to add homegrown functions and we shall develop these in stylesheets, using JUMBO functionality. JUMBO is able to search a set of CML files using traditional substructure-based methods.
possibilities include:
The following methods are available:
At present there are a reasonable selection of converters closely following the chemical/* MIME types (e.g. PDB, MOL, JCAMP). It's not too difficult to write other converters. The main difficulty is not normally the CML, but parsing the existing files.
With a graphical editor like CML-JME, JUMBO or JChemPaint it's safe and easy. Think twice before hand-editing - you must know what you are doing! The result must be well-formed, conform to the CML DTD and any non-CML tags must have non-CML namespaces.
This is typical of a wide range of questions about the level of detail and the algorithmic support that CML provides.
CML provides support for a very large number of chemical concepts but does not presume to supply the details. Thus a common mechanism for transmitting aromaticity is to draw a Kekulß structure with alternating double and single bonds. CML supports you in doing this (bondOrder can be "A", in addition to 1,2 or 3), but does not place its interpretation on it. For example, other systems use an 'aromatic bond' (-5 in CSD, ":" in SMILES) and CML will not necessarily recognise that these two correspond to the same concept (CML does not compare molecules - that is the role of the application program.)
CML has a small set of core conventions (e.g. all single bonds are mapped by default into CMLBond.SINGLE. Beyond this CML allows you to use whatever convention you like but you have to tell people what convention you are using. It's easy and more acceptable to do this if the convention is already widely used, and it's probable that CML-aware software will concentrate on the commoner systems. However, so long as you use the convention attribute to describe what you are doing, you are free to do what you want. It is likely, however, that the CML project in collaboration with IUPAC will recommend that certain common conventions (e.g. PDB) are reserved within CML.
Sometimes. The benefit of XML is that it provides explicit markup in ASCII so that encoding and syntactic errors and ambiguities are removed. The penalty is that files are sometimes larger, involving greater bandwidth and processing times. Reading a heavily tagged file into memory may also use a lot, and take time. Thus tagging every pixel in an image would be counter productive. It has been shown, however, that XML compresses very well and the markup overhead is often only a few percent over the raw ASCII - a very acceptable price.
Many XML languages use a compromise and use implicit semantics for repetitive markup. So CML uses <floatArray> for large arrays; in fact we usually find CML equivalents of PDB files are often smaller than the raw files, since formatting whitespace is removed. The tradeoff is that the CML implementer must support both formats - but we have already done that for you. Moreover XSLT is well suited to this sort of interconversion.
Highly tuned compression (e.g. in the various JCAMP methods) will usually outperform compressed XML, but suffers from lack of interoperability - it is not trivial to write the software.
CML is not a database management system, but both database schemas and data can be represented in CML and often this can provide new approaches. Thus a CML document can be regarded as the serialisation of an object - in other words an ASCII representation of a objects held in programs or databases. (There will be other tools for serialisation but CML can be made isomorphous with them). In this sense CML acts as an object schema, and is being used as the basis for developing IDLs (for use with CORBA) in molecular sciences.
There are many ways to manage objects, including distributed databases. Thus it could be reasonable to deliver a protein entry from several servers. One could hold its sequence, another its coordinates, a third the small molecules, etc. and these can all be transmitted as CML documents.
CML can also be used to represent data in relational databases and can provide a mechanism for input, output and archival.
CML also has a role to play in data entry. If entries are generated in CML (perhaps with an authoring tool) it becomes much easier to abstract the information from them when checking and validating. Moreover, since CML supports metadata, it's easy for authors to add this before submission.
For the vast majority of likely users of CML, NO. You will write and read CML with chemically-friendly software such as JUMBO, CML-JME and other "CML aware" editors. If you need to go deeper, the rules for constructing CML documents are simple:
If you are writing software to create CML files or read them you will find that we have done much of the work already.
We plan to start running real-life and online courses (http://www.cmlconsulting.com/) in CML shortly. Group bookings on-site will also be considered. Since a general knowledge of XML is likely to be important in almost all CML we shall offer introductory XML courses.
Yes. CML uses standards wherever possible. It's based on SGML/XML and MIME. Internally it uses ISO standards for dates and terminology. When standard entity sets (e.g. for Unicode characters) are used, CML can take advantage of this and will render them if the software has the appropriate glyphs. CML can also contain information in other standards or near-standards such as CGM, TeX, GIF, etc.
CML is the primary application of XML to molecular science, and we are in close collaboration with many bodies (see the position paper). Note that implementing a markup language for molecular science is a very considerable undertaking (some 500 Java classes may be required) so it is almost certainly worth basing your effort on CML.
We are aware of the importance of freezing versions as shown by our work with OMG. We also force ourselves to conform to our own specifications - CML is not being allowed to "creep" :-)
It's not software, so it won't actively 'do' anything! The relevant question is "what information can CML not carry?"
The following are still at "early adopter" stage:
In the last resort it may be possible to use parsable textual descriptions or contained files. But CML will not be widely useful if it is simply a container for a ragbag of legacy file types!
Strictly speaking not in CML CML is a DTD of XML and parses satisfactorily which is as bug-free as any software can be. So the real question is "what has CML got wrong?". We're answering some of this by our work with the OMG.
Do pigs fly? Are elephants pink? [See the JUMBO pages for bugs.]
See our position paper for the organisational links and our current thinking. Beyond that, it's up to you - the community. Remember that every new idea requires a huge amount of work - so the more you offer constructive ways forward, the more likely your ideas are to be incorporated. We welcome support in many ways:
There is no charge for using the CML DTD, but it is NOT in the public domain. Similar restrictions apply to the software - much of which is OpenSource - and other resources in this distribution or on the WebSite. CML is protected by copyright and trademarks, covering words and logos. Many formats have been trademarked to protect their integrity - an example is the IUCR's CIF format.
If you create "CML documents" they must conform to the specifications we have published. We provide tools for determining whether documents conform.
The JUMBO CML system consists of a set of Java classes and these may be freely used and distributed. These files may not be modified and the license explains the precise conditions of use.
The API for the classes is published. You may therefore extend the classes by standard mechanisms without needing to have source code. This is one of the great benefits of Java and means that the community can rely on a single, stable, core on which they can build. If the extensions are widely valuable it may be possible to incorporate them in future versions.
An XML document system normally requires:
There are extremely comprehensive resources at The W3C site (http://www.w3.org) . Each XML subdiscipline (e.g. DOM, XSL) has its own page, which have many links to tutorials, FAQs, software, etc. XSL has created its own community with excellent resources.
SAX was developed on the XML-DEV mailing list by David Megginson, PeterMR and many others. It is described and downloadable from The SAX home page (http://www.megginson.com/SAX) .
Note that the CML software on this site uses all 3 methods extensively.
Many languages have been used for generic XML software, including:
JUMBO-J is by far the largest of these and is based on Java (1.2) and Swing. It uses SAX for initial processing and builds a DOM by extending a generic W3C DOM (current implementation is through Apache/Xerces, but we have also used SUN's XML DOM). It makes increasing use of XSLT for accessing DOM nodeSets, laying out the screen, etc. It will interoperate with Apache/FOP for PDF and other output. JUMBO-J can form the basis of a full CML editing system and we are working very closely with the Object Management Group (OMG) in their implementation of CoreCML as the core of OMG/LifeSciencesResearch
JUMBO-JS is a lightweight Javascript implementation, based on MSXML DOM (but only because there are no others currently :-( ). It is designed for reading and displaying CML rather than editing it (Javascript has no write permission to local disk). It is likely to support a subset of CML such as CoreCML.
Chimeral . This is based largely on XSLT. Because it pioneered this at an early stage some of the approach is MS-specific. It also uses a variety of applets and plugins for displaying active chemistry. It is likely that Chimeral will evolve into a stylesheet library.
Obviously this depends on your program, and is simplest if the content can be separated into:
<atom id="a123">
<float title="B-factor">1.234</float>
</atom>
as a child of <atom>. if you are feeling very virtuous (please!)
you should also create a dictionary of the names you use (e.g.
"B-factor")- this will be of enormous value in the re-use of your
information.
Because CML documents can be so flexible there is no single answer. The questions you must ask are:
Therefore "reading" is often a set of "queries" for information in the input. There are three main strategies:
<atom><string builtin="elementType">Tl</string></atom>
we need only trap SAX events for elements with name = "float", builtin
attribute with value "elementType" and the next characters() equal to
"Tl".
A CML system may require many components, including:
This is completely dependent on what you want to do! Some of the processes could be:
The full CMLDOM implementation runs to many hundred classes and thousands of methods. (This isn't surprising; so does HTMLDOM, SVGDOM, WMLDOM, MathMLDOM, etc.) There are some applications with complex and flexible documents which require the full power of CMLDOM, but most don't. We have separated this into:
You must read this in conjunction with the JavaDoc for these packages. Firstly there are several support libraries, usually distributed in jars. Those labelled (OPEN) have OpenSource code; others are for early adopters but we expect them to evolve into OPEN.
OpenSource is a powerful philosophy for the rapid development of high-quality code. It is exemplified by the Linux operating system, the Apache server, and many XML tools. The basis of OpenSource, including philosophy and the business case are made in http://www.opensource.org and references from that. In particular Eric Raymond's papers on the ethics of "forking" code are useful for this section.
CML is intended to be Open and is offered in the spirit of OpenSource. However there is relatively little experience of OpenSource in the molecular community so some of our approach may be slightly different to that in more generic software projects. In particular we are concerned that our specifications and code, particularly those which we have created as reference tools, are not distributed in modified form.
We have had much experience of the OpenSource and OpenData approach both in molecular science and more generally. For over three year we have run the main development forum for XML (XML-DEV) which has resulted in a great deal of OpenSource code, most specifically SAX (see http://www.megginson.com/SAX). XML is a community which has benefitted enormously from OpenSource and many commercial companies have donated their code to OpenSource groups such as Apache.
This is not a common experience in molecular science, where it is more difficult to get critical mass. It happens in the biological and crystallographic domains with examples such as CCP4 (collaborative development of crystallographic programs), CIF (the IUCR's crystallographic information file), and a large amount of structure- and sequence-aware code for macromolecules.
I believe that OpenSource is a catalyst for the development of ideas in a discipline and the CML and JUMBO tools are offered in this spirit. There are some qualifications, however:
We intend to make robust, stable code available openly in an immutable form. Developers are free to build modules which interoperate with this code but not to change it. I believe that the current architecture allows this, through Java Patterns and similar approaches.
Code requires a gestation period before it can be reasonably offered as Open. In some projects (e.g. Apache) the code is openly mounted and collaborators accept that nightly builds, etc. may have bugs. The success depends on the discipline and critical mass of the collaborators.
Ideally this approach should be adopted for CML/JUMBO. However it proves much more difficult to get critical mass. Opening buggy code to the community can cause disillusionment and criticism, and the authors get flooded with calls for help.
We have therefore created an "early adopter" policy where organisations and individuals can have access to later versions with greater functionality and access to the source code. See the section of development of CML (#develop).
Currently we have released:
CML is copyright Peter Murray-Rust and Henry Rzepa
JUMBO is copyright Peter Murray-Rust
We do not believe in patenting software and do not intend to patent any of CML or JUMBO
CML and its logo are trademarked by Peter Murray-Rust and Henry Rzepa. Trademarking is common for public scientific information specifications, such as produced by the W3C or IUCr.
You may not modify the DTD or guidelines for its use. You may not explicitly or implicitly modify its functionality.
CML is a published specification; anyone wishing to use it should carry out their best endeavour to ensure that files or software adhere to the specification. The DTD and Schema mechanisms exist to validate XML files. We are continuing to increase the tools and guidelines for validation and publish these on the website. It is not acceptable to create and distribute XML files labelled "CML" which deviate from the published specifications and the guidelines on our website.
You may extend CML functionality by non-CML namespaces. These namespaces must be clearly labelled as non-CML (i.e. have distinct prefixes and namespaceURIs). They must not have side-effects on the meaning or behaviour of CML elements, attributes or content.
You may not modify JUMBO itself, either at source level or through reverse engineering. You may modify or extend JUMBO functionality as described below as long as the extensions and modifications are clearly labelled as such and do not claim to represent JUMBO functionality.
JUMBO can be modified by accepted Patterns and has been designed for this purpose. These include:
See for the types of help we value (#develop). If you offer help, please make sure that you are offering something tangible and have the commitment to deliver. Our experience is that a very high percentage of offers don't materialise :-(.
We are actively working to provide tools and examples which allow conformance to be assessed. At present it is not acceptable for third parties to offer "CML conformance".
NOTE: this will be an overview of the components of the DTD/Schema and will be automatically generated from the schema. At present most of these entries are placeholders but act as an enumeration of the components. There will be links to the annotated Schema - at present these are inactive.
CML has a number of elements and attributes specifically to support chemical concepts.
The angle element represents a valence angle and may be used to construct a molecule from internal coordinates or z-matrix
Further documentation (../cmldom/htmlDoc/org/xmlcml/cml/CMLAngle.html)
The atom element represents an atom in a molecule. It contains builtins including:
Further documentation (../cmldom/htmlDoc/org/xmlcml/cml/CMLAtom.html)
The bond element represents a bond in a molecule. It contains builtins including:
Further documentation (../cmldom/htmlDoc/org/xmlcml/cml/CMLBond.html)
The crystal element represents a unit cell. It contains builtins including:
Further documentation (../cmldom/htmlDoc/org/xmlcml/cml/CMLCrystal.html)
The electron element represents a electron. We are still working out the semantics of this.
The feature element represents a feature in a biomolecule. It is mainly to support sequence and structure files and will probably be mainly textual
The formula element represents the atom count in a molecule. It can be hierarchical (i.e. a formula can contain sub-formulas recursively).
The molecule element represents a "molecule" as a group of atoms and bonds. It can be hierarchical (i.e. a molecule can contain sub-molecules recursively).
The reaction element represents a reaction. The semantics are yet to be determined completely but early version use pointers to reactant and product molecules.
The sequence element represents a biomolecular sequence. The semantics are yet to be determined completely.
The angle element represents a torsion angle and may be used to construct a molecule from internal coordinates or z-matrix
Further documentation (../cmldom/htmlDoc/org/xmlcml/cml/CMLAngle.html)
Chemistry contains many concepts shared with other disciplines. In particular there are scalar and array, string and numeric datatypes. These elements and attributes support any named datatype with units and so are generally applicable to a very wide range of disciplines. In early versions of CML this isubset was called TecML and this may be resurrected shortly.
The float element holds a real number (usually of at least double precision) along with a title, dictRef and units.
Further documentation (../cmldom/htmlDoc/org/xmlcml/cml/CMLFloatVal.html)
The floatArray element holds an array of real numbers (usually of at least double precision) along with a title, dictRef and units.
Further documentation (../cmldom/htmlDoc/org/xmlcml/cml/CMLFloatArray.html)
The floatMatrix element holds a 2-D matrix of real numbers (usually of at least double precision) along with a title, dictRef and units. It may be rectangular, square, be symmetric, asymmetric or triangular
Further documentation (../cmldom/htmlDoc/org/xmlcml/cml/CMLFloatMatrix.html)
The integer element holds an integer (usually of at least 32 bits) along with a title, dictRef and units.
Further documentation (../cmldom/htmlDoc/org/xmlcml/cml/CMLIntegerVal.html)
The integerArray element holds an array of integers (usually of at least 32 bits) along with a title, dictRef and units.
Further documentation (../cmldom/htmlDoc/org/xmlcml/cml/CMLIntegerArray.html)
The string element holds a string along with a title, dictRef and units.
Further documentation (../cmldom/htmlDoc/org/xmlcml/cml/CMLStringVal.html)
The string element holds an array of strings along with a title, dictRef and units.
Further documentation (../cmldom/htmlDoc/org/xmlcml/cml/CMLStringArray.html)
CML documents contain generic structure and we have created the following to help support this
.The cml element is a general container, often at the root of a document but with no semantics other than to announce a chunk of CML
Further documentation (../cmldom/htmlDoc/org/xmlcml/cml/CML.html)
The link element is a hyperlink implementable in XLink. It can have strong typing, multiple components, etc.
Further documentation (../cmldom/htmlDoc/org/xmlcml/cml/CMLlink.html)
The list element is a general container which can represent complex structures in a document.
Further documentation (../cmldom/htmlDoc/org/xmlcml/cml/CMLlist.html)
A number of attributes are used on two or more elementTypes and so are explicitly collected and described here. They are also identified in the DTD and Schema.
The count attribute applies a multiplier to several elements such as molecules and atoms. In this way stoichiometry can economically built up without having to include every atom./molecule.
Further documentation (../cmldom/htmlDoc/org/xmlcml/cml/AttributeCount.html)
An attribute providing a reference to a (data) dictionary describing the datatype and usage of the element. The dictionaries are normally namespaced and so multiple ones can be used. This is one of the most powerful features of CML.
The size attribute gives the size of an array on fooArray elements. It is redundant but serves as a check for the parsing software.
Further documentation (../cmldom/htmlDoc/org/xmlcml/cml/AttributeSize.html)
The title attribute can occur on almost every element and is a human-readable string. It should not be used to describe the nature of the element - use dictRef instead.
Further documentation (../cmldom/htmlDoc/org/xmlcml/cml/AttributeTitle.html)
The units attribute should occur on all numeric elements and will normally point to a dictionary (not yet fully implemented in JUMBO).
Further documentation (../cmldom/htmlDoc/org/xmlcml/cml/AttributeUnits.html)
The id attribute should occur on almost all elements and provides a unique identifier within the document.
Further documentation (../cmldom/htmlDoc/org/xmlcml/cml/AttributeId.html)
A number of concepts in the CML DTD/Schema have been found to require non-primitive types and these have been modelled as such in CMLDOM.
atomParity is defined as a String in the DTD but is more complex and requires an atomRefs attribute.
Further documentation (../cmldom/htmlDoc/org/xmlcml/cml/CMLatomParity.html)
bondStereo is defined as a String in the DTD but is more complex and requires an atomRefs attribute.
Further documentation (../cmldom/htmlDoc/org/xmlcml/cml/CMLbondStereo.html)
isotope is defined as a String in the DTD but may require a more complex structure.
Further documentation (../cmldom/htmlDoc/org/xmlcml/cml/CMLisotope.html)
NOTE: CML was conceived in 1996 where SGML was the guiding principle. There were no namespaces, virtually no tools beyond sgmls and CoST and little experience in technical disciplines such as chemistry. Some of the design was constrained by this. In 1998, when CML was submitted for publication there was no stable XSLT and Schemas were at a very early stage. The distinction between elements and attributes was more important, and some features were guided by the availability of tools.
CML is devised to be extensible, one of the commonest designs being the named datatyped quantity as:
<float title="foo" dictRef="bar" units="plugh">23.2</float>
This design principle avoids the explosion of elementTypes (e.g.
<meltingPoint>) with the additional complexity of content models.
"float" is a very general mechanism which has proved to work. Since some
floats were central to the model they were defined in the DTD as attribute
values of "builtin". A redesign in 2001 would probably make more use of some fundamental elementTypes or hardcoded attributes such;
<atom elementType="O" hydrogenCount="2"/>
However we have constrained ourselves to use the 1999-DTD for the time
being.
Although XML preserves all order in an elementÕs children, CML
insists on this only where it makes chemical sense. For example:
<atom id="a1"> <float builtin="occupancy">0.5</float> <string builtin="elementType">O</string> </atom>and
<atom id="a1"> <string builtin="elementType">O</string> <float builtin="occupancy">0.5</float> </atom>
are semantically equivalent. We would discourage implicit semantics of order. Thus start and end of bonds, or product and reactant molecules, should be represented by markup and not by order in the document.
Atoms in particular should never be identified by their position, but always by an unchanging unique ID. CML does not prescribe the form of this ID or the method of uniquification. Some tools may use hashcodes based on time, userID, etc. while others may have implicit semantics (e.g. cl2, ribose-c3Õ, etc.). There is a difficulty when independent documents are merged and the IDs may collide. One approach would be to base IDs on URL-like syntax to ensure the planet-wide uniqueness of such IDs. Others may wish to convert IDs as part of the merging process. We have not prejudged the solutions that may emerge from the community and would only emphasise here that IDs must be unique within molecule context and may benefit from being unique within a wider one.
The following three ways of defining an oxygen atom are semantically the same (i.e. each defines an oxygen atom) but are clearly syntactically different. A frequent question is which is correct, and why, and does it matter?
<atom id="a1"><string builtin="elementType">O</builtin></atom> <atom id="a1"><elementType>O</elementType></atom> <atom id="a1" elementType="O" />In the first, which is the one CML employs, the element is defined using a string element, employing the builtin attribute of elementType. As noted above, elementType is a fundamental molecular property, and so is handled directly by software (i.e. JUMBO3) when this string is processed (i.e it is part of the CMLDOM which the software must build to represent the molecule in memory). The second example defines elementType to be an XML element in its own right (remember, XML elements are NOT the same as chemical elements), whilst the last defines elementType to be an attribute of an (XML) element. All three examples of course can be equally valid XML, but only the first is valid CM, so yes, it does matter which way this is expressed, since CML software conforming to the CML DTD will only be able to process the first example.
The issue of why the first method was used arises in part out of design and in part because CML itself grew up concurrently with the development of XML, and most importantly, the need to concurrently develop working software which implements the definitions. The design aspect hinges around the need to create a small and stable core for CML, which does not permit unrestricted proliferation of XML/CML elements and to a lesser extent their attributes. This is most easily achieved by limiting the elements to this defined core, and allowing future extensibility not by creating new core elements or their attributes, but by creating dictionaries to handle these extensions, and allowing sub-disciplines to create their own set of complementary (rather than competitive) elements using namespaces.