There is no controlled vocabulary for conventions, but the author must ensure that the semantics are openly available and that there are mechanisms for implementation. The convention is inherited by all the subelements, so that a convention for molecule would by default extend to its bond and atom children. This can be overwritten if necessary by an explicit convention.
It may be useful to create conventions with namespaces (e.g. iupac:name). Use of convention will normally require non-STMML semantics, and should be used with caution. We would expect that conventions prefixed with "ISO" would be useful, such as ISO8601 for dateTimes.
There is no default, but the conventions of STMML or the related language (e.g. CML) will be assumed.
Example:
In the protein database ' CA' and 'CA' are different atom types, and and array could be:
<array delimiter="/" dictRef="pdb:atomTypes">/ N/ CA/CA/ N/</array>
Note that the array starts and ends with the delimiter, which must be chosen to avoid accidental use. There is currently no syntax for escaping delimiters.
A reference to a dictionary entry.
Elements in data instances such as scalar may have a dictRef attribute to point to an entry in a dictionary. To avoid excessive use of (mutable) filenames and URIs we recommend a namespace prefix, mapped to a namespace URI in the normal manner. In this case, of course, the namespace URI must point to a real XML document containing entry elements and validated against STMML Schema.
Where there is concern about the dictionary becoming separated from the document the dictionary entries can be physically included as part of the data instance and the normal XPointer addressing mechanism can be used.
This attribute can also be used on dictionary elements to define the namespace prefix
ref modifies an element into a reference to an existing element of that type within the document. This is similar to a pointer and it can be thought of a strongly typed hyperlink. It may also be used for "subclassing" or "overriding" elements.
array manages a homogenous 1-dimensional array of similar objects. These can be encoded as strings (i.e. XSD-like datatypes) and are concatenated as string content. The size of the array should always be >= 1.
The default delimiter is whitespace. The normalize-space() function of XSLT could be used to normalize all whitespace to single spaces and this would not affect the value of the array elements. To extract the elements java.lang.StringTokenizer could be used. If the elements themselves contain whitespace then a different delimiter must be used and is identified through the delimiter attribute. This method is mandatory if it is required to represent empty strings. If a delimiter is used it MUST start and end the array - leading and trailing whitespace is ignored. Thus size+1 occurrences of the delimiter character are required. If non-normalized whitespace is to be encoded (e.g. newlines, tabs, etc) you are recommended to translate it character-wise to XML character entities.
Note that normal Schema validation tools cannot validate the elements of array (they are defined as string) However if the string is split, a temporary schema can be constructed from the type and used for validation. Also the type can be contained in a dictionary and software could decide to retrieve this and use it for validation.
When the elements of the array are not simple scalars (e.g. scalars with a value and an error, the scalars should be used as the elements. Although this is verbose, it is simple to understand. If there is a demand for more compact representations, it will be possible to define the syntax in a later version.
the size attribute is not mandatory but provides a useful validity check):
Note that the second array-element is the empty string ''.
A generic container with no implied semantics. It just contains things and can have attributes which bind conventions to it. It could often act as the root element in an STM document.
By default matrix represents a rectangular matrix of any quantities representable as XSD or STMML dataTypes. It consists of rows*columns elements, where columns is the fasting moving index. Assuming the elements are counted from 1 they are ordered V[1,1],V[1,2],...V[1,columns],V[2,1],V[2,2],...V[2,columns], ...V[rows,1],V[rows,2],...V[rows,columns]
By default whitespace is used to separate matrix elements; see array for details. There are NO characters or markup delimiting the end of rows; authors must be careful!. The columns and rows attributes have no default values; a row vector requires a rows attribute of 1.
matrix also supports many types of square matrix, but at present we require all elements to be given, even if the matrix is symmetric, antisymmetric or banded diagonal. The matrixType attribute allows software to validate and process the type of matrix.
Number of rows
Number of columns
units (recommended for numeric quantities!!)
A general container for metadata, including at least Dublin Core (DC) and CML-specific metadata
In its simple form each element provides a name and content in a similar fashion to the meta element in HTML. metadata may have simpleContent (i.e. a string for adding further information - this is not controlled).
A container for any events that need to be recorded, whether planned or not. They can include notes, measurements, conditions that may be referenced elsewhere, etc. There are no controlled semantics
scalar holds scalar data under a single generic container. The semantics are usually resolved by linking to a dictionary. scalar defaults to a scalar string but has attributes which affect the type.
scalar does not necessarily reflect a physical object (for which object should be used). It may reflect a property of an object such as temperature, size, etc.
Note that normal Schema validation tools cannot validate the data type of scalar (it is defined as string), but that a temporary schema can be constructed from the type and used for validation. Also the type can be contained in a dictionary and software could decide to retrieve this and use it for validation.
An array of coordinateComponents for a single coordinate where these all refer to an X-coordinate (NOT x,y,z) Instances of this type will be used in array-style representation of 2-D or 3-D coordinates.
Currently no machine validation
Currently not used in STMML, but re-used by CML (see example)
An x/y coordinate pair consisting of two real numbers, separated by whitespace or a comma. In arrays and matrices, it may be useful to set a separate delimiter
An x/y/z coordinate triple consisting of three real numbers, separated by whitespace or commas. In arrays and matrices, it may be useful to set a separate delimiter
A count multiplier for an element
Many elements represent objects which can occur an arbitrary number of times in a scientific context. Examples are action, object or molecules.
an enumerated type for all builtin allowed dataTypes in STM
dataTypeType represents an enumeration of allowed dataTypes (at present identical with those in XML-Schemas (Part2- datatypes). This means that implementers should be able to use standard XMLSchema-based tools for validation without major implementation problems.
It will often be used an an attribute on scalar, array or matrix elements.
Some STMML elements (such as array) have content representing concatenated values. The default separator is whitespace (which can be normalised) and this should be used whenever possible. However in some cases the values are empty, or contain whitespace or other problematic punctuation, and a delimiter is required.
Note that the content string MUST start and end with the delimiter so there is no ambiguity as to what the components are. Only printable characters from the ASCII character set should be used, and character entities should be avoided.
When delimiters are used to separate precise whitespace this should always consist of spaces and not the other allowed whitespace characters (newline, tabs, etc.). If the latter are important it is probably best to redesign the application.
Errors in values can be of several types and this simpleType provides a small controlled vocabulary
An observed or calculated estimate of the error in the value of a numeric quantity. . It should be ignored for dataTypes such as URL, date or string. The statistical basis of the errorValueType is not defined - it could be a range, an estimated standard deviation, an observed standard error, etc. This information can be added through errorBasisType.
This is not formally of type ID (an XML NAME which must start with a letter and contain only letters, digits and .-_:). It is recommended that IDs start with a letter, and contain no punctuation or whitespace. The function generate-id() in XSLT will generate semantically void unique IDs.
It is difficult to ensure uniqueness when documents are merged. We suggest namespacing IDs, perhaps using the containing elements as the base. Thus mol3:a1 could be a useful unique ID. However this is still experimental.
An array of floats or other real numbers. Not used in STM Schema, but re-used by CML and other languages.
An array of integers; for re-use by other schemas
Not machine-validatable
The maximum INCLUSIVE value of a sortable quantity such as numeric, date or string. It should be ignored for dataTypes such as URL. The use of min and max attributes can be used to give a range for the quantity. The statistical basis of this range is not defined. The value of max is usually an observed quantity (or calculated from observations). To restrict a value, the maxExclusive type in a dictionary should be used.
The type of the maximum is the same as the quantity to which it refers - numeric, date and string are currently allowed
Allowed matrix types. These are mainly square matrices
1 2 3 4
0 3 5 6
0 0 4 8
0 0 0 2
The minimum INCLUSIVE value of a sortable quantity such as numeric, date or string. It should be ignored for dataTypes such as URL. The use of min and min attributes can be used to give a range for the quantity. The statistical basis of this range is not defined. The value of min is usually an observed quantity (or calculated from observations). To restrict a value, the minExclusive type in a dictionary should be used.
The type of the minimum is the same as the quantity to which it refers - numeric, date and string are currently allowed
The namespace is optional but recommended where possible
Note: this convention is only used within STMML and related languages; it is NOT a generic URI.
The namespace prefix must start with an alpha character and can only contain alphanumeric and '_'. The suffix can have characters from the XML ID specification (alphanumeric, '_', '.' and '-'
A reference to an existing element in the document. The target of the ref attribute must exist. The test for validity will normally occur in the element's appinfo
Any DOM Node created from this element will normally be a reference to another Node, so that if the target node is modified a the dereferenced content is modified. At present there are no deep copy semantics hardcoded into the schema.
The size of an array. Redundant, but serves as a check for processing software (useful if delimiters are used)
These will be linked to dictionaries of units with conversion information, using namespaced references (e.g. si:m)
Distinguish carefully from unitType which is an element describing a type of a unit in a unitList
It can be used for:
Usually within a molecule. It is almost always contained within atomArray.
the electron children: One or more electrons associated with the atom. The atomRef on the electron should point to the id on the atom. We may relax this later and allow reference by context.
The elementType. Almost mandatory
The explicit hydrogen count
The non-hydrogen count (obsolete - moved to CML Query)
The isotopic mass. Default implies "natural abundance"
The occupancy (mainly from crystallography)
The x coordinate (arbitrary units) of a 2-D representation (unrelated to 3-D structure). Note that x- and y- 2D coordinates are required for graphical stereochemistry such as wedge/hatch. x- and y- coordinates must be both present or both absent.
The x coordinate (in Angstrom units) of a 3-D cartesian representation. x3 y3 and z3 coordinates must be both present or both absent.
The fractional x coordinate in a crystal structure. xFract, yFract and zFract coordinates must be all present or all absent. A crystal element is required
The combined x and y coordinates of a 2-D representation (unrelated to 3-D structure). Note that x- and y- 2D coordinates are required for graphical stereochemistry such as wedge/hatch.
The combined x, y, z coordinates (in Angstrom units) of a 3-D cartesian representation.
The combined x, y, z fractional coordinates in a crystal structure. A crystal element is required
The y coordinate (arbitrary units) of a 2-D representation (unrelated to 3-D structure). Note that x2 and y2 coordinates are required for graphical stereochemistry such as wedge/hatch. x2 and y2 coordinates must be both present or both absent.
The y coordinate (in Angstrom units) of a 3-D cartesian representation. x3 y3 and z3 coordinates must be both present or both absent.
The fractional x coordinate in a crystal structure. xFract, yFract and zFract coordinates must be all present or all absent. A crystal element is required
The z coordinate (in Angstrom units) of a 3-D cartesian representation. x3 y3 and z3 coordinates must be both present or both absent.
The fractional x coordinate in a crystal structure. xFract, yFract and zFract coordinates must be all present or all absent. A crystal element is required
This can be used to describe the purpose of atoms whose elementTypes are dummy or locant.
This is a CCML extension to core CML. atomTypes will normally be defined independently of a particular calculation and stored in a dictionary. The attribute may be included in a primary definition of an atom in a molecule or may be added later through the inherit mechanism.
This is a CCML extension to core CML. It represents a change in the Cartesian X coordinate (e.g. for vibrational modes, molecular dynamics, etc.). Whether dx3 can be added to an x3 value depends on the semantics of the application.
This is a CCML extension to core CML. It represents a change in the Cartesian Y coordinate (e.g. for vibrational modes, molecular dynamics, etc.). See dx3.
This is a CCML extension to core CML. It represents a change in the Cartesian Z coordinate (e.g. for vibrational modes, molecular dynamics, etc.). See dx3.
This is a CCML extension to core CML. It represents combined changes in the Cartesian X, Y, and Z coordinates and is an alternative to dx3, etc. See dx3.
This is a CCML extension to core CML.
Units MUST be given in velocityUnits in the grandparent molecule; there are NO defaults.
This is a CCML extension to core CML.
Units MUST be given in velocityUnits in the grandparent molecule; there are NO defaults.
This is a CCML extension to core CML.
Units MUST be given in velocityUnits in the grandparent molecule; there are NO defaults.
This is a CCML extension to core CML.
Units MUST be given in velocityUnits in the grandparent molecule; there are NO defaults.
The attributes are directly related to the scalar attributes under atom which should be consulted for more info.
NOTE: The CML-1 specifications are also supported but are deprecated
.Example - these are exactly equivalent representations
It follows the convention of the MIF format, and uses 4 distinct atoms to define the chirality. These can be any atoms (though they are normally bonded to the current atom). There is no default order and the order is defined by the atoms in the atomRefs4 attribute. If there are only 3 ligands, the current atom should be included in the 4 atomRefs.
The value of the parity is a signed number. (It can only be zero if two or more atoms are coincident or the configuration is planar). The sign is the sign of the chiral volume created by the four atoms (a1, a2, a3, a4):
| 1 1 1 1 |
| x1 x2 x3 x4 |
| y1 y2 y3 y4 |
| z1 z2 z3 z4 |
Note that atomParity cannot be used with the *Array syntax for atoms.
bond is a child of bondArray and contains bond information. Bond must refer to at least two atoms (using atomRefs2) but may also refer to more for multicentre bonds. Bond is often EMPTY but may contain electron, length or bondStereo elements.
. The bondRef on the electron should point to the id on the bond. We may relax this later and allow reference by context.(We
only one convention allowed
. This will be the normal reference attribute on the bond element. The order of atoms is preserved and may matter for some conventions (e.g. wedge/hatch or donor bonds)
. This is designed for multicentre bonds (as in delocalised systems or electron-deficient centres. The semantics are experimental at this stage. As an example, a B-H-B bond might be described as <bond atomRefs="b1 h2 b2"/>
. This is designed for pi-bonds and other systems where formal valence bonds are not drawn to atoms. The semantics are experimental at this stage. As an example, a Pt-|| bond (as the Pt-ethene bond in Zeise's salt) might be described as <bond atomRefs="pt1" bondRefs="b32"/>
There is NO default. This order is for bookkeeping only and is not related to length, QM calculations or other experimental or theoretical calculations. see orderType
bondArray is a child of molecule and contains bond information. There are two strategies:
The attributes are directly related to the scalar attributes under atom which should be consulted for more info.
Example - these are exactly equivalent representations
The IDs for the bonds. Required in array mode
The first atoms in each bond. Required in array mode
The second atoms in each bond. Required in array mode
The bond orders in each bond. Used in array mode
An explict list of atomRefs must be given, or it must be a child of bond. There are no implicit conventions such as E/Z. This will be extended to other types of stereochemistry.
At present the following are supported:
the atomRefs and atomRefs4 attributes cannot be used simultaneously.
Often the root of the CML (sub)document. Has no explicit function but serves to hold the dictionaries, namespace, and can alert CML processors and search/XMLQuery tools that there is chemistry in the document. Can contain any content, but usually a list of molecules and other CML components. Can be nested
. Required if fractional coordinates are provided for a molecule.
There are precisely SIX child scalars to represent the cell lengths and angles in that order. There are no default values;
The number of molecules per cell. Molecules are defined as the molecule which directly contains the crystal element.
Since there is very little use of electrons in current chemical information this is a fluid concept. I expect it to be used for electron counting, input and output of theochem operations, descriptions of orbitals, spin states, oxidation states, etc. Electrons can be associated with atoms, bonds and combinations of these. At present there is no hardcoded semantics. However, atomRef and similar attributes can be used to associate electrons with atoms or bonds
It is defined by atomArrays each with a list of elementTypes and their counts (or default=1). All other information in the atomArray is ignored. formula are nestable so that aggregates (e.g. hydrates, salts, etc.) can be described. CML does not require that formula information is consistent with (say) crystallographic information; this allows for experimental variance.
An alternative briefer representation is also available through the conciseForm. This must include whitespace round all elements and their counts, which must be explicit.
The formal charge is normally calculated from the formal charges of the atoms. If the formalCharge attribute is given it overrides this information completely. This allows (say) metal complexes to be represented when it is difficult to apportion the charges to atoms.
Supports compund identifiers such as IChI. At present uses the V0.9 IChI XML representation verbatim but will almost certainly change with future IChIs.
The inclusion of elements from other namespaces causes problems with validation. The content model is deliberately LAX but the actual elements in IChI will fail the validation as they are not declared in CML.
This is either an experimental measurement or used to build up internal coordinates (as in a z-matrix) (only one allowed)
We expect to move length as a child of molecule and remove it from here