Chemical Markup Language. A Position Paper.
Peter Murray-Rust (Peter.Murray-rust@nottingham.ac.uk) and Henry S.
Rzepa (rzepa@ic.ac.uk)
2001-04-10
Introduction
This paper describes Chemical Markup Language and its relationship to IUPAC
and other organisations
1. Overview of XML and CML
XML eXtensible Markup Language (XML) was developed by the World Wide Web
Consortium (W3C) from 1997 onwards as a means for describing and validating
complex documents. It is a formal subset of SGML (ISO-8879:1986) and
completely compliant with SGML; it can be regarded as "SGML-lite". It
emphasizes, but is not limited to, the transmission and processing of
documents over networks. XML became a Recommendation (the final product of
the W3C process) on 1998-02-14 and is now ubiquitous in all areas of the
computing environment.
XML was originally designed for complex documents for which HTML is too
fragile, and especially to support e-commerce (B2C) and
business-to-business (B2B) processes. However it has also proved to be very
valuable for "non-document" content, such as data, business logic, message
wrapping, and many aspects of middleware. It was designed to emphasise
content over presentation, and the W3C anticipated that communities would
create their own vocabularies as with SGML. These are defined in Document
Type Definitions (DTDs) and often a DTD symbolises a markup language; thus
the XHTML DTD represents the XML variant of HyperText Markup Language
(XHTML).
The W3C has developed and continues to develop a wide range of generic
protocols based on the XML syntax. We may loosely refer to these as the
"XML family" or even "XML". They include a small number of content-based
MLs including:
MathML for semantic and presentational mathematics SVG for Scalable
Vector Graphics, SMIL for Synchronised (streamed) Multimedia. However, in
general, the W3C does not create domain-based MLs, leaving that to
appropriate individuals and authorities within domains.
The W3C has created a large and powerful set of protocols layered in
XML. These provide generic functionality required for locating, processing,
and interpreting XML documents. All these protcols are effectively "part of
XML" and available to any author of an ML; we have taken great care that
CML can re-use these protocols and tools. They include:
- RDF for metadata discovery
- XSLT for document transformation
- XSL-FO for high performance document formatting and printing XLink
for hypermedia (links)
- XML Schemas for document and data validation (superset of DTDs) XML
Query language specification Namespaces in XML
- Digital signatures and encryption
XML has now become "the official metalanguage" for many communities,
including governmental and international organisations. Thus Drug
Regulatory Authorities (DRAs) are now actively promoting XML for the
support of New Drug Applications (NDAs). Where "chemistry" is required, CML
is seen as the appropriate tool to use.
2. Chemical Markup Language (CML).
Historical Development
The origins of domain specific scientific (i.e. non-bibliographic) markup
languages can traced at least as far as the first World-Wide Web conference
(WWW1) held at CERN in May 1994, when a session on the future of HTML
developed into a discussion of how Mathematics and Chemistry might be
expressed. In late 1994 this took clearer form with the suggestion by HSR
that the output of a data-rich modelling program such as MOPAC as relating
to molecules, atoms, bonds and their computed properties should be marked
up in SGML. A prototype CML browser written in Tcl/Wish was produced by
PM-R and a poster describing CML,
together with a working demonstration of a modified version of MOPAC
capable of reading and writing CML, was presented at the 1995 ACS August
meeting in Chicago. CML was further formalised when an SGML DTD was defined
and subsequently published on the official SGML list comp.text.sgml by PM-R
in 1996. It included another language ("TecML") for the representation of
general scientific data. In mid 1996, the introduction of Java by Sun
Microsystems meant a platform independent approach to implementation could
be taken, and the JUMBO browser was written by PM-R and widely
demonstrated. The XML project was started by W3C in 1996, and PMR was
invited to be a member of the XML-WG. In January 1997 he and HSR set up the
XML-DEV mailing
list to support the development of XML and XML-based tools; a list
which received some 22,000 postings in its first three years and has around
2000 subscribers. CML was recast and became the first XML DTD (in any
domain); a working demo was presented to WWW6 (1997) by Jon Bosak ("father
of XML"). CML was mentioned in the first "ChemWeb" virtual lecture by HSR and
some 500 participants attended the "Launch of Chemical Markup
Language", the title of the second such lecture given by PM-R on February 4,
1998. CML catalysed the development of "non-textual" DTDs and is frequently
cited in the XML literature. Version 1.0 of the CML specification was
formally published in 1999.
Relationship to the XML Community
The XML community has a very strong tradition of encouraging prototype
implementations of protocols and CML evolved in this way. There has always
been CML-aware software, frequently published as "JUMBO" and made available
as open source.
CML deliberately does not cover all chemistry but concentrates on
"molecules" (discrete entities representatable by a formula and usually a
connection table). It supports a hierarchy for compound molecules
(clathrates, macromolecules, etc.). It also supports reactions, and
macromolcular strucures/sequences (though it can interoperate with other
macromolecular XML languages as they are developed). It has no specific
support for physicochemical concepts, but can support labelled numeric
datatypes of several sorts which can cover a wide range of requirements. It
allows quantities and properties to be specifically attached to molecules,
atoms or bonds.
CML is designed to interoperate with several leading MLs and XML
protocols and we have demonstrated the following
- XHTML for text and images
- SVG for line diagrams, graphs, reaction schemes, phase diagrams,
etc.
- PlotML for graphs MathML for equations
- XLink for hypermedia (including atom-spectralPeak assignments,
reaction mapping).
- RDF and Dublin Core for metadata
- XML Schemas for numeric and other data types
There are other generic tools required in physical science including units,
multidimensional arrays with varied datatypes, terminology and
bibliography. There are no widely accepted MLs for these at present; we
shall continue to develop our own to be used with CML but will use others
if they become widespread. An example is physiochemical data held as SELF
(Prof. Henry Kehiaian, IUPAC+CODATA) and now converted to SELFML (PMR+HK)
as a IUPAC/CODATA project.
3. Extensibility
XML languages are extensible in that they represent a set of documents as
well as rigidly prescribed formats. Thus CML can represent any of the
current "legacy" molecular formats, but can also describe a complete
chemical publication (v.i.). We have designed CML to be flexible enough to
support many uses (v.i.) There are mechanisms (DTD, Schema, XSLT
stylesheets, DOM) to support rigidity or allow flexibilty as required. A
key part of CML's design is the linking to XML-based terminologies and
ontologies.
4. Uses of CML
Because XML now supports "documents" and "data" in a seamless spectrum, CML
is applicable to all aspects of chemical informatics, data-handling and
publication. Examples which we have prototyped include:
- datafiles: e.g. support for the IUCr's CIF and the Protein
Data Bank formats reports
- publications: v.i.
- compound data cards: The SELFML project associates many
molecules and mixtures with complex physicochemical properties Materials
Safety Data Sheets (MSDS)
- molecules: we can convert losslessly from many legacy formats,
including MDLMolfile, Sybil MOL2, JME, XYZ, SMILES, PDB and CIF.
reactions
- logfiles from computational chemistry:e.g. postprocessors for
Gaussian, MOPAC, VAMP
- instrumental output; conversion of JCAMP-DX to XML and many
more.
Conversion of these applications to XML has a dramatic effect on the ease
of processing, searching, maintenance, re-use and many other aspects.
5. CML and IUPAC
Background
We have been invited to present CML to IUPAC committees:
- CPEP (Oxford, 1998) An early version of CML was demonstrated and the
potential value of converting the GoldBook to XML was shown.
- Nomenclature (Washington, 2000). As part of the IChI project PMR
attended the IUPAC meeting in Washington.
In addition we are invited members of the IUPAC IChI group.
Relationship to IChI
All XML languages make fundamental use of the concept of identifiers
(formally, the ID attribute) to address components. An essential component
of the CML language which requires such identification is the
<molecule> element. We strongly believe that both the CML and the
IChI projects have much scope for mutual benefit.
6. CML and the publication process
SGML or XML are now the de facto languages for technical publishing
and chemical publishers are experienced in them. The CML DTD specifically
supports the publication of "chemical" information and we have pioneered
(with the RSC and the ACS) the use of CML in the process. The RSC office
was able to manage XML/CML submissions and we therefore note that there is
an ongoing and evolving process for electronic and paper publication. XML
supports high-quality (e-)print through its XSL-FO specification; generic
XML content is transformed to an abstract print format ("Formatting
Objects") which is then automatically converted to HTML, PDF, PostScript,
TeX or any other format.
Chemical information is often rendered graphically and CML supports this
through conversion to Scalable Vector Graphics (SVG). CML/SVG will support
2-D chemical presentation (molecules and reaction schemes) and can be
integrated with XSL-FO.
7. Implementability and software
To ensure the specification can be implemented we have created test
implementations, with OpenSource software. All implementations are modular
and include:
- CMLDOM-JS: A Javascript implementation of the main components of CML
JUMBO3-JS. A Javascript (in-browser) tool to retrieve and display
documents containing CML elements.
- SELFML-JS browser. This (Javascript) tool reads one or more SELFML
files and displays them, including the emedded CML describing the
compounds
- CMLDOM-J. A complete Java implementation of the CML-DOM, extensible
to further refinements of CML, developed in parallel with the OMG
project.
- JUMBO3-J. A Java browser for any document containing CML elements
including 2D and 3D displays.
- Chimeral. Working examples of large CML-based
documents and scientific articles which use an XSLT stylesheet component
library and applets for viewing.
- OpenScience Projects. The OpenScience project to communally develop
chemical software tools includes two which have been early adopters of
CML; JMol and JChemPaint.
- JME Editor. We are collaborating with the developer of JME (Java
Molecular Editor) to create a CML-aware 2D chemical structure
editor.
- JMVS. This is a Java3D-based CML-compliant molecular visualiser.
- JChemDig and JChemAgent. Web-based robots which can traverse a remote
Site, identify chemical content based on chemical MIME types and create a
CML-based database of these files, including derived metadata.
- JChemValidate.
An online resource for converting to and digitally signing CML
documents.
8. Publications relating to CML
Peer-Reviewed articles.
We have published the following articles describing CML:
- The Internet as a Chemical Information Tool", H. S. Rzepa, P.
Murray-Rust and B. J. Whitaker, Chem. Soc. Revs, 1997, 1-10.
- P. Murray-Rust "Chemical Markup Language", World Wide Web
Journal, 1997, pp 135-147 and "JUMBO, An Object-based XML Browser",
ibid, 1997, pp 197-206. Published as chapters in "XML principles, Tools
and Techniques", Ed. D. Connolly, 1997, O'Reilly.
- P. Murray-Rust, "Chemical Markup Language" in "Electronic
communication technologies. techniques and technologies for the 21st
century", 1998, Ed M. Mitchard, Interpharm Press, Buffalo Grove, Ill.
ISBN: 1574910698.
- Chemical markup
Language and XML Part I. Basic principles, P. Murray-Rust and H. S.
Rzepa, J. Chem. Inf. Comp. Sci., 1999, 39, 928.
- A Universal
approach to Web-based Chemistry using XML and CML, P. Murray-Rust, H. S.
Rzepa, M. Wright and S. Zara, ChemComm, 2000, 1471-1472.
- Development of Chemical Markup Language
(CML) as a System for Handling Complex Chemical Content, Peter
Murray-Rust, Henry S. Rzepa and Michael Wright, New J. Chem.,
2001, 618-634.
- Chemical Markup, XML
and the World-Wide Web. Part II: Information Objects and the CMLDOM. P.
Murray-Rust and H. S. Rzepa, J. Chem. Inf. Comp. Sci., in
press.
- Chemical Markup,
XML and the World-Wide Web. Part III: Towards a signed semantic Chemical
Web of Trust, G. Gkoutos, P. Murray-Rust, H. S. Rzepa and M. Wright,
J. Chem. Inf. Comp. Sci., submitted for publication.
- A Resource for
Transforming HTML and Molfile Documents to XML Compliant Form,
Georgios V. Gkoutos, Philip R. Kenway, Peter Murray-Rust, Henry S. Rzepa
and Michael Wright, Internet J. Chem, 2001, article 5.
CML is also described in a number of other articles and reviews written by
us and others (See supporting documents SS-bib.pdf and ISI-bib.pdf).
Press coverage
CML has been reviewed in Science, Nature, New Scientist and Scientific
American.
XML monographs
Many monographs (probably over 30) about XML review CML, some devoting
complete chapters to it. CML is reviewed on the definitive OASIS/Cover XML
resource at http://www.oasis-open.org/cover/
9. Invited Presentations
Both PMR and HSR have been invited on many occasions to present talks
describing CML and its applications at International conferences, including
e.g. four at ACS national meetings and one at a CSA meeting.
10. Relation to other standards and organisations
When they exist CML re-uses other standards rather than re-invent
approaches (v.s. for examples). CML is itself a fully conforming
application of XML.
- CMLDOM/OMG: CMLDOM (v.s.) has been chosen for the core of the Life
Sciences Research group of the Object Management Group (OMG). CMLDOM is
therefore consistent with an object-based approach to chemistry. Care has
been taken in the design to ensure that CMLDOM is extensible without
breaking this core.
- MatML: (Material Markup Language). PMR is a virtual member of the
working group to develop a materials markup language (at NIST) and CML
will be used in this where chemistry is required.
- CODATA/SELF: PMR has been invited to present XML and CML at CODATA
meeting(s). CML is an integral part of the SELFML IUPAC/CODATA project
(v.s.)
11. Adoption of CML
Many different types of organisation have adopted, or are adopting CML. We
list a few examples:
- Governmental and global agencies (e.g. drug regulatory agencies
through the International Committee on Harmonisation (ICH/M2)). We have
had additional meetings or discussions with several other agencies.
Non-profit research (government). National Cancer Institute,
Developmental Therapeutics program (NCI/DTP). ca. 500K compounds are
being converted to CML Non-profit research (academia).
- The University of California at San Diego (UCSD) has adopted CML as
the chemical technology for its new terascale information and computing
grid portals. This will also by used by the Protein Data Bank (PDB) at
the same site.
Software companies.
Four companies (names withheld because of confidentiallity) are
collaborating with us in the introduction of CML for the next or near
release of their products.
12. IPRights and Protection
CML and its logo is trademarked to protect its status as a name and to
ensure integrity. This reflects similar practice for other languages (e.g.
CIF). To ensure openness of use all CML specifications and software are
published as OpenSource with appropriate licences to protect integrity but
ensure wide distribution. There is no intention to patent the CML
specification and software.