Chemical Terms Language Reference

Version 5.10.1

Contents

Further reading

 

Introduction

This document describes ChemAxon's Chemical Terms Language. This language is used to formulate chemical expressions in general. Its current usage includes chemical rules for reaction processing, search filters or both as chemical calculations and chemical filtering in JChem Cartridge. The Evaluator command line tool and the Evaluator API are also available for general purpose expression evaluation.

The Chemical Terms Evaluator is designed to evaluate mathematical expressions on molecules using built-in chemical and general purpose functions. It is also possible to extend this built-in set of calculations by a user-defined configuration.

The heart of the evaluator mechanism is the JEP Java Expression Parser, equipped with chemical plugin calculations, chemical substructure search and some additional chemical and general purpose functions. User defined functions can also be added to this function set.

Here are some simple examples showing how some well-known chemical rules can be formulated for a given input molecule read from a molecule context:

The following filters are used in drug discovery and drug development to narrow down the scope of molecules. They provide estimation on solubility and permeability of orally active compounds considering their physical and chemical properties. The examined properties are given as chemical terms.
  1. Lipinski's rule of five states that the absorption or permeation of a molecule is more likely when the molecular weight is under 500 g/mol, the value of logP is lower than 5, and the molecule has utmost 5 H-donor and 10 H-acceptor atoms. The definition of the aforementioned rule by ChemicalTerms is:
    (mass() <= 500) && 
    (logP() <= 5) && 
    (donorCount() <= 5) && 
    (acceptorCount() <= 10)
    
  2. Lead-likeness:
    (mass() <= 450) &&
    (logD("7.4") >= -4) && (logD("7.4") <= 4) &&
    (ringCount() <= 4) &&
    (rotatableBondCount() <= 10) &&
    (donorCount() <= 5) &&
    (acceptorCount() <= 8)
    
  3. Bioavailability:
    (mass() <= 500) +
    (logP() <= 5) +
    (donorCount() <= 5) +
    (acceptorCount() <= 10) +
    (rotatableBondCount() <= 10) +
    (PSA() <= 200) +
    (fusedAromaticRingCount() <= 5) >= 6
    
    Note, that summing up the 7 subresults above means to count how many of them are satisfied. The requirement that this sum should be at least 6 means that we do not require all of the subconditions to be satisfied but instead we allow at most one of them to fail.

  4. Ghose filter:
    (mass() >= 160) && (mass() <= 480) &&
    (atomCount() >= 20) && (atomCount() <= 70) &&
    (logP() >= -0.4) && (logP() <= 5.6) &&
    (refractivity() >= 40) && (refractivity() <= 130)
    
  5. Scaffold hopping:
    refmol = "actives.sdf";
    dissimilarity("ChemicalFingerprint", refmol) - 
    dissimilarity("PharmacophoreFingerprint", refmol) > 0.6
    
    Note, that molecule constants can be defined by a molecule file path or a SMILES string. Multiple expressions are separated by ';' characters, whitespace characters can be added freely for readability, since they are not considered by the evaluation process.

A set of working examples is also available.

 

Language Elements

The Chemical Terms Evaluator parses and evaluates expressions that are built from the following language elements:

A set of short reference tables provides a summary of the available functions / calculations and the use of matching conditions.

 

Expression Syntax

Expression strings consist of an arbitrary number of initial assignments followed by a last subexpression that provides the evaluation result. An assignment sets a variable to the evaluation result of a subexpression. This variable can later be used to refererence this result. The assignment syntax is:

<identifier> = <subexpression>;

Note the ending ';' character. Examples for assignments:

x = 2;
y = x + 8;
z = f(x,y) + g(x,y);
where f and g are predefined functions.

An expression is an optional sequence of assignments followed by a subexpression providing the evaluation result:

<identifier1> = <subexpression1>;
<identifier2> = <subexpression2>;
...
<identifierN> = <subexpressionN>;
<result subexpression>
where N can also be zero in which case the expression coincides with the result subexpression.

Here is an example with assignments:

a = f(2,3);
b = g(4,5);
x = a + b;
x*x

Here is the same without assignments:

(f(2,3) + g(4,5))*(f(2,3) + g(4,5))

Assignments increase efficiency if the same evaluation result is used more than once since inline repetition of a subexpression results in multiple evaluation. Assignments can also be used to increase readability. However, in most cases, when the expression is simple, assignments are not needed. Note, that whitespace characters (new-line, tab, space) are skipped when parsing the expression string, so whitespace characters can be freely used for increasing readability.

The following examples demonstrate the expression syntax with very simple subexpressions. Examples with chemical meaning are shown later for matching conditions, chemical calculations and chemical and general purpose functions.

Examples:
  1. A simple expression:
    3+2
    
  2. Using assignments:
    x = 2;
    y = 3;
    x + y
    
  3. A more complicated one:
    x = 2;
    y = 3;
    z = 8*(x + y);
    t = 6*x*y;
    z + t
    
  4. When the same value is used more than once:
    x = (3 + 4)*8 + 16;
    y = 3*x;
    z = x + 20;
    5*(y + 8) + 4*z
    
 

Predefined Functional Groups and Named Molecule Groups

It is sometimes easier to refer molecules by names rather than explicit SMARTS strings or molecule file paths. For example, you may want to write nitro or carboxyl as query in a match function. Frequently used queries are pre-defined in the built-in functional groups file (chemaxon/marvin/templates/functionalgroups.cxsmi within MarvinBeans-templates.jar).

You can also define your favourite query SMARTS in marvin/config/marvin/templates/functionalgroups.cxsmi file and in $HOME\chemaxon\marvin\templates\functionalgroups.cxsmi (Windows) or $HOME/.chemaxon/marvin/templates/functionalgroups.cxsmi (UNIX / Linux) file where marvin is the Marvin istallation directory, $HOME is your user home directory.

However, there are some limitations when choosing the molecule names. Molecule names should be composed of letter, digit characters, and the '_' character. This means that molecule names cannot contain special characters, such as '=', '-', etc. with the exception of '_'. Molecule name definitions in functionalgroups.cxsmi file can contain whitespace characters (space, tab), but when names are referenced from a Chemical Terms expression the whitespace characters should be replaced with a single '_' character (e.g. secondary amine should be referred as secondary_amine in Chemical Terms expressions).

Note: from Marvin 5.4 mols.smarts configuration file is not used by Chemical Terms. It is replaced by functionalgroups.cxsmi file.

 

Initial Scripts

You can define molecule sets and other constants in the user-defined initial script $HOME/chemaxon/MARVIN_MAJOR_VERSION/jep.script (Windows) or $HOME/.chemaxon/MARVIN_MAJOR_VERSION/jep.script (UNIX / Linux), where $HOME is your user home directory, and MARVIN_MAJOR_VERSION is the major version of Marvin (e.g. "5.1"). This script is run right after the molecule sets are read and the constants defined here can be used later in your chemical expressions. Any valid chemical terms assignment is allowed here, and the terminating ';' characters may be omitted as long as you write each assignment in a separate line. Typically, you will define a molecule set by

  1. listing its members:
    x = {acid_halide, alcohol, "[#6]CC[#8]"}
    y = {alkene, amide, imide, imine} 
    z = {alkene, amide, amine, alcohol, isocyanate}
    
  2. or deriving it from other sets with the help of set operators:
    all = x + y + z     (union of x, y, z)
    join = y * z        (join of y and z)
    C = (x + y) * z     (join of the union of x and y with z)
    D = z - alcohol     (all elements of z except alcohol)
    E = (x + y) - z     (union of x and y without the elements of z)
    
    where + means set-union, * means set-join and - means exclusion.

Predefined molecules and molecule sets are most useful in query definitions of the match function:

 

Input Contexts

When evaluating an expression, the Evaluator substitutes data reference symbols by the corresponding data items. All data items belong to exactly one of the following data groups:

  1. constants: data having the same value at each evaluation

  2. inputs: data possibly changing for each evaluation, such as

The type of the input data depends on the expression evaluation environment, which currently is one of the following:

  1. an expression string evaluated by the command line version of Evaluator refers to the current input molecule read from the input file(s) or the standard input

  2. an inner atomic expression refers to both the input molecule and the current atom - it is used when a Chemical Terms expression is evaluated on some or all atoms of the input molecule (e.g. atom filtering conditions, atomic evaluators and min-max evaluators)

  3. a reaction condition can refer to a reactant and a product array as well as to their atoms mapped according to the reaction equation

The evaluation environment provides a specific input context for accessing its input data. The input context consists of a bunch of accessor functions that can be used in the expression strings to access the input data. The following input contexts correspond to the evaluation environments described above:

  1. molecule context, used for single molecule input (e.g. command line Evaluator, JChem Cartridge):

  2. atom context, used for single atom input (e.g. inner atomic expressions):

  3. search context, used for filtering search hits (e.g. jcsearch and search queries):

  4. reaction context, used for reaction input initiated by the Reactor: Note: In reaction context atoms also can be referred by atom index, but in this case the molecule (reactant / product) parameter always have to be specified in the parameter list of the function (see this example).

Note, that the default input molecule is the molecule returned by mol() in case when this function exists in the context.

 

Configuration

The built-in configuration XML can be extended by user-defined functions and plugin calculations. The configuration syntax is described in the Evaluator Manual.

 

Examples

The examples below are divided into sections according to the input context applied, which corresponds to the different applications that can make use of ChemAxon's chemical expressions. These examples use the built-in configuration XML, the referenced functions and plugin calculations are listed in the short reference tables.

 

Evaluator and JChem Cartridge examples (molecule context)

 

Reactor examples (reaction context)

Note: Reactor is part of JChem software package, it is not available in Marvin.

 

Search filter examples (search context)

 
Copyright © 1999-2012 ChemAxon Ltd.    All rights reserved.