Chemical Markup Language. A Position Paper.

Peter Murray-Rust (Peter.Murray-rust@nottingham.ac.uk) and Henry S. Rzepa (rzepa@ic.ac.uk)

2001-04-10

Introduction

This paper describes Chemical Markup Language and its relationship to IUPAC and other organisations

1. Overview of XML and CML

XML eXtensible Markup Language (XML) was developed by the World Wide Web Consortium (W3C) from 1997 onwards as a means for describing and validating complex documents. It is a formal subset of SGML (ISO-8879:1986) and completely compliant with SGML; it can be regarded as "SGML-lite". It emphasizes, but is not limited to, the transmission and processing of documents over networks. XML became a Recommendation (the final product of the W3C process) on 1998-02-14 and is now ubiquitous in all areas of the computing environment.

XML was originally designed for complex documents for which HTML is too fragile, and especially to support e-commerce (B2C) and business-to-business (B2B) processes. However it has also proved to be very valuable for "non-document" content, such as data, business logic, message wrapping, and many aspects of middleware. It was designed to emphasise content over presentation, and the W3C anticipated that communities would create their own vocabularies as with SGML. These are defined in Document Type Definitions (DTDs) and often a DTD symbolises a markup language; thus the XHTML DTD represents the XML variant of HyperText Markup Language (XHTML).

The W3C has developed and continues to develop a wide range of generic protocols based on the XML syntax. We may loosely refer to these as the "XML family" or even "XML". They include a small number of content-based MLs including:

MathML for semantic and presentational mathematics SVG for Scalable Vector Graphics, SMIL for Synchronised (streamed) Multimedia. However, in general, the W3C does not create domain-based MLs, leaving that to appropriate individuals and authorities within domains.

The W3C has created a large and powerful set of protocols layered in XML. These provide generic functionality required for locating, processing, and interpreting XML documents. All these protcols are effectively "part of XML" and available to any author of an ML; we have taken great care that CML can re-use these protocols and tools. They include:

  1. RDF for metadata discovery
  2. XSLT for document transformation
  3. XSL-FO for high performance document formatting and printing XLink for hypermedia (links)
  4. XML Schemas for document and data validation (superset of DTDs) XML Query language specification Namespaces in XML
  5. Digital signatures and encryption
XML has now become "the official metalanguage" for many communities, including governmental and international organisations. Thus Drug Regulatory Authorities (DRAs) are now actively promoting XML for the support of New Drug Applications (NDAs). Where "chemistry" is required, CML is seen as the appropriate tool to use.

2. Chemical Markup Language (CML).

Historical Development

The origins of domain specific scientific (i.e. non-bibliographic) markup languages can traced at least as far as the first World-Wide Web conference (WWW1) held at CERN in May 1994, when a session on the future of HTML developed into a discussion of how Mathematics and Chemistry might be expressed. In late 1994 this took clearer form with the suggestion by HSR that the output of a data-rich modelling program such as MOPAC as relating to molecules, atoms, bonds and their computed properties should be marked up in SGML. A prototype CML browser written in Tcl/Wish was produced by PM-R and a poster describing CML, together with a working demonstration of a modified version of MOPAC capable of reading and writing CML, was presented at the 1995 ACS August meeting in Chicago. CML was further formalised when an SGML DTD was defined and subsequently published on the official SGML list comp.text.sgml by PM-R in 1996. It included another language ("TecML") for the representation of general scientific data. In mid 1996, the introduction of Java by Sun Microsystems meant a platform independent approach to implementation could be taken, and the JUMBO browser was written by PM-R and widely demonstrated. The XML project was started by W3C in 1996, and PMR was invited to be a member of the XML-WG. In January 1997 he and HSR set up the XML-DEV mailing list to support the development of XML and XML-based tools; a list which received some 22,000 postings in its first three years and has around 2000 subscribers. CML was recast and became the first XML DTD (in any domain); a working demo was presented to WWW6 (1997) by Jon Bosak ("father of XML"). CML was mentioned in the first "ChemWeb" virtual lecture by HSR and some 500 participants attended the "Launch of Chemical Markup Language", the title of the second such lecture given by PM-R on February 4, 1998. CML catalysed the development of "non-textual" DTDs and is frequently cited in the XML literature. Version 1.0 of the CML specification was formally published in 1999.

Relationship to the XML Community

The XML community has a very strong tradition of encouraging prototype implementations of protocols and CML evolved in this way. There has always been CML-aware software, frequently published as "JUMBO" and made available as open source.

CML deliberately does not cover all chemistry but concentrates on "molecules" (discrete entities representatable by a formula and usually a connection table). It supports a hierarchy for compound molecules (clathrates, macromolecules, etc.). It also supports reactions, and macromolcular strucures/sequences (though it can interoperate with other macromolecular XML languages as they are developed). It has no specific support for physicochemical concepts, but can support labelled numeric datatypes of several sorts which can cover a wide range of requirements. It allows quantities and properties to be specifically attached to molecules, atoms or bonds.

CML is designed to interoperate with several leading MLs and XML protocols and we have demonstrated the following

  1. XHTML for text and images
  2. SVG for line diagrams, graphs, reaction schemes, phase diagrams, etc.
  3. PlotML for graphs MathML for equations
  4. XLink for hypermedia (including atom-spectralPeak assignments, reaction mapping).
  5. RDF and Dublin Core for metadata
  6. XML Schemas for numeric and other data types
There are other generic tools required in physical science including units, multidimensional arrays with varied datatypes, terminology and bibliography. There are no widely accepted MLs for these at present; we shall continue to develop our own to be used with CML but will use others if they become widespread. An example is physiochemical data held as SELF (Prof. Henry Kehiaian, IUPAC+CODATA) and now converted to SELFML (PMR+HK) as a IUPAC/CODATA project.

3. Extensibility

XML languages are extensible in that they represent a set of documents as well as rigidly prescribed formats. Thus CML can represent any of the current "legacy" molecular formats, but can also describe a complete chemical publication (v.i.). We have designed CML to be flexible enough to support many uses (v.i.) There are mechanisms (DTD, Schema, XSLT stylesheets, DOM) to support rigidity or allow flexibilty as required. A key part of CML's design is the linking to XML-based terminologies and ontologies.

4. Uses of CML

Because XML now supports "documents" and "data" in a seamless spectrum, CML is applicable to all aspects of chemical informatics, data-handling and publication. Examples which we have prototyped include:
  1. datafiles: e.g. support for the IUCr's CIF and the Protein Data Bank formats reports
  2. publications: v.i.
  3. compound data cards: The SELFML project associates many molecules and mixtures with complex physicochemical properties Materials Safety Data Sheets (MSDS)
  4. molecules: we can convert losslessly from many legacy formats, including MDLMolfile, Sybil MOL2, JME, XYZ, SMILES, PDB and CIF. reactions
  5. logfiles from computational chemistry:e.g. postprocessors for Gaussian, MOPAC, VAMP
  6. instrumental output; conversion of JCAMP-DX to XML and many more.
Conversion of these applications to XML has a dramatic effect on the ease of processing, searching, maintenance, re-use and many other aspects.

5. CML and IUPAC

Background

We have been invited to present CML to IUPAC committees:
  1. CPEP (Oxford, 1998) An early version of CML was demonstrated and the potential value of converting the GoldBook to XML was shown.
  2. Nomenclature (Washington, 2000). As part of the IChI project PMR attended the IUPAC meeting in Washington.
In addition we are invited members of the IUPAC IChI group.

Relationship to IChI

All XML languages make fundamental use of the concept of identifiers (formally, the ID attribute) to address components. An essential component of the CML language which requires such identification is the <molecule> element. We strongly believe that both the CML and the IChI projects have much scope for mutual benefit.

6. CML and the publication process

SGML or XML are now the de facto languages for technical publishing and chemical publishers are experienced in them. The CML DTD specifically supports the publication of "chemical" information and we have pioneered (with the RSC and the ACS) the use of CML in the process. The RSC office was able to manage XML/CML submissions and we therefore note that there is an ongoing and evolving process for electronic and paper publication. XML supports high-quality (e-)print through its XSL-FO specification; generic XML content is transformed to an abstract print format ("Formatting Objects") which is then automatically converted to HTML, PDF, PostScript, TeX or any other format.

Chemical information is often rendered graphically and CML supports this through conversion to Scalable Vector Graphics (SVG). CML/SVG will support 2-D chemical presentation (molecules and reaction schemes) and can be integrated with XSL-FO.

7. Implementability and software

To ensure the specification can be implemented we have created test implementations, with OpenSource software. All implementations are modular and include:
  1. CMLDOM-JS: A Javascript implementation of the main components of CML JUMBO3-JS. A Javascript (in-browser) tool to retrieve and display documents containing CML elements.
  2. SELFML-JS browser. This (Javascript) tool reads one or more SELFML files and displays them, including the emedded CML describing the compounds
  3. CMLDOM-J. A complete Java implementation of the CML-DOM, extensible to further refinements of CML, developed in parallel with the OMG project.
  4. JUMBO3-J. A Java browser for any document containing CML elements including 2D and 3D displays.
  5. Chimeral. Working examples of large CML-based documents and scientific articles which use an XSLT stylesheet component library and applets for viewing.
  6. OpenScience Projects. The OpenScience project to communally develop chemical software tools includes two which have been early adopters of CML; JMol and JChemPaint.
  7. JME Editor. We are collaborating with the developer of JME (Java Molecular Editor) to create a CML-aware 2D chemical structure editor.
  8. JMVS. This is a Java3D-based CML-compliant molecular visualiser.
  9. JChemDig and JChemAgent. Web-based robots which can traverse a remote Site, identify chemical content based on chemical MIME types and create a CML-based database of these files, including derived metadata.
  10. JChemValidate. An online resource for converting to and digitally signing CML documents.

8. Publications relating to CML

Peer-Reviewed articles.

We have published the following articles describing CML:
  1. The Internet as a Chemical Information Tool", H. S. Rzepa, P. Murray-Rust and B. J. Whitaker, Chem. Soc. Revs, 1997, 1-10.
  2. P. Murray-Rust "Chemical Markup Language", World Wide Web Journal, 1997, pp 135-147 and "JUMBO, An Object-based XML Browser", ibid, 1997, pp 197-206. Published as chapters in "XML principles, Tools and Techniques", Ed. D. Connolly, 1997, O'Reilly.
  3. P. Murray-Rust, "Chemical Markup Language" in "Electronic communication technologies. techniques and technologies for the 21st century", 1998, Ed M. Mitchard, Interpharm Press, Buffalo Grove, Ill. ISBN: 1574910698.
  4. Chemical markup Language and XML Part I. Basic principles, P. Murray-Rust and H. S. Rzepa, J. Chem. Inf. Comp. Sci., 1999, 39, 928.
  5. A Universal approach to Web-based Chemistry using XML and CML, P. Murray-Rust, H. S. Rzepa, M. Wright and S. Zara, ChemComm, 2000, 1471-1472.
  6. Development of Chemical Markup Language (CML) as a System for Handling Complex Chemical Content, Peter Murray-Rust, Henry S. Rzepa and Michael Wright, New J. Chem., 2001, 618-634.
  7. Chemical Markup, XML and the World-Wide Web. Part II: Information Objects and the CMLDOM. P. Murray-Rust and H. S. Rzepa, J. Chem. Inf. Comp. Sci., in press.
  8. Chemical Markup, XML and the World-Wide Web. Part III: Towards a signed semantic Chemical Web of Trust, G. Gkoutos, P. Murray-Rust, H. S. Rzepa and M. Wright, J. Chem. Inf. Comp. Sci., submitted for publication.
  9. A Resource for Transforming HTML and Molfile Documents to XML Compliant Form, Georgios V. Gkoutos, Philip R. Kenway, Peter Murray-Rust, Henry S. Rzepa and Michael Wright, Internet J. Chem, 2001, article 5.
CML is also described in a number of other articles and reviews written by us and others (See supporting documents SS-bib.pdf and ISI-bib.pdf).

Press coverage

CML has been reviewed in Science, Nature, New Scientist and Scientific American.

XML monographs

Many monographs (probably over 30) about XML review CML, some devoting complete chapters to it. CML is reviewed on the definitive OASIS/Cover XML resource at http://www.oasis-open.org/cover/

9. Invited Presentations

Both PMR and HSR have been invited on many occasions to present talks describing CML and its applications at International conferences, including e.g. four at ACS national meetings and one at a CSA meeting.

10. Relation to other standards and organisations

When they exist CML re-uses other standards rather than re-invent approaches (v.s. for examples). CML is itself a fully conforming application of XML.
  1. CMLDOM/OMG: CMLDOM (v.s.) has been chosen for the core of the Life Sciences Research group of the Object Management Group (OMG). CMLDOM is therefore consistent with an object-based approach to chemistry. Care has been taken in the design to ensure that CMLDOM is extensible without breaking this core.
  2. MatML: (Material Markup Language). PMR is a virtual member of the working group to develop a materials markup language (at NIST) and CML will be used in this where chemistry is required.
  3. CODATA/SELF: PMR has been invited to present XML and CML at CODATA meeting(s). CML is an integral part of the SELFML IUPAC/CODATA project (v.s.)

11. Adoption of CML

Many different types of organisation have adopted, or are adopting CML. We list a few examples:
  1. Governmental and global agencies (e.g. drug regulatory agencies through the International Committee on Harmonisation (ICH/M2)). We have had additional meetings or discussions with several other agencies. Non-profit research (government). National Cancer Institute, Developmental Therapeutics program (NCI/DTP). ca. 500K compounds are being converted to CML Non-profit research (academia).
  2. The University of California at San Diego (UCSD) has adopted CML as the chemical technology for its new terascale information and computing grid portals. This will also by used by the Protein Data Bank (PDB) at the same site.

Software companies.

Four companies (names withheld because of confidentiallity) are collaborating with us in the introduction of CML for the next or near release of their products.

12. IPRights and Protection

CML and its logo is trademarked to protect its status as a name and to ensure integrity. This reflects similar practice for other languages (e.g. CIF). To ensure openness of use all CML specifications and software are published as OpenSource with appropriate licences to protect integrity but ensure wide distribution. There is no intention to patent the CML specification and software.