Discussion:
[Rdkit-discuss] newbie help cleaning up sterochemistry in SMILES string
hari jayaram
2010-09-16 18:44:14 UTC
Permalink
I am working with several ligands from a database stored in a SMILES
format. I am using the SMILES string to get three dimensional
coordinates (pdb format file) using a third-party program called
libcheck.

For some of these molecules the SMILES string sterochemistry in the
database is entered in incorrectly such that the SMILES input to
libcheck returns a mangled coordinate file with rings clashing with
each other . Inputting SMILES string without the stereochemistry makes
libcheck behave correctly.

Is there a way to use rdkit to cleanup the stereochemistry in the SMILES string.

Thanks for your help in advance

Hari
Greg Landrum
2010-09-17 03:36:51 UTC
Permalink
Dear Hari,
Post by hari jayaram
I am working with several ligands from a database stored in a SMILES
format.  I am using the SMILES string to get three dimensional
coordinates (pdb format file)  using a third-party program called
libcheck.
For some of these molecules the SMILES string  sterochemistry in the
database is entered in incorrectly such that the SMILES input to
libcheck returns a mangled coordinate file with rings clashing  with
each other . Inputting SMILES string without the stereochemistry makes
libcheck behave correctly.
Is there a way to use rdkit to cleanup the stereochemistry in the SMILES string.
To be certain I understand: you would like to remove the
stereochemistry from the SMILES string?

One way to do this is to read in the SMILES then generate a new SMILES
without stereochemistry information:

[1]>>> from rdkit import Chem

[2]>>> m = Chem.MolFromSmiles('Cl[***@H](F)Br')

[3]>>> Chem.MolToSmiles(m)
Out[3] 'FC(Cl)Br'

A potential problem with this is that it changes the atom ordering.

However, the simplest way to remove stereochemistry information from
SMILES doesn't use the RDKit at all, you just remove "@" characters
from the string:

[4]>>> smi = 'Cl[***@H](F)Br'

[5]>>> smi.replace('@','')
Out[5] 'Cl[CH](F)Br'

Hope this helps,
-greg
hari jayaram
2010-09-17 19:49:44 UTC
Permalink
Thanks a tonne Greg and Paul,

I didnt realize that removing stereochemistry was as simple as
removing the "@" characters.

So now with the replace function in python I can easily remove
sterochem information from the molecule.

smiles_corrected = smiles_broken.replace("@","")

Once I remove the stereochemistry , libcheck does the right thing and
gives me the right 3D coordinates.

Thanks for your help

Hari
Post by Greg Landrum
Dear Hari,
Post by hari jayaram
I am working with several ligands from a database stored in a SMILES
format.  I am using the SMILES string to get three dimensional
coordinates (pdb format file)  using a third-party program called
libcheck.
For some of these molecules the SMILES string  sterochemistry in the
database is entered in incorrectly such that the SMILES input to
libcheck returns a mangled coordinate file with rings clashing  with
each other . Inputting SMILES string without the stereochemistry makes
libcheck behave correctly.
Is there a way to use rdkit to cleanup the stereochemistry in the SMILES string.
To be certain I understand: you would like to remove the
stereochemistry from the SMILES string?
One way to do this is to read in the SMILES then generate a new SMILES
[1]>>> from rdkit import Chem
[3]>>> Chem.MolToSmiles(m)
Out[3] 'FC(Cl)Br'
A potential problem with this is that it changes the atom ordering.
However, the simplest way to remove stereochemistry information from
Out[5] 'Cl[CH](F)Br'
Hope this helps,
-greg
Geoffrey Hutchison
2010-09-17 22:05:23 UTC
Permalink
Post by hari jayaram
So now with the replace function in python I can easily remove
sterochem information from the molecule.
Once I remove the stereochemistry , libcheck does the right thing and
gives me the right 3D coordinates.
This doesn't make chemical sense, though. If libcheck is operating on a SMILES without stereochemistry, there's no way it can always give "the right 3D" coordinates. If you have "N" stereo centers, the chance of a correct 3D structure will be (0.5)^N.

I'd suggest using a different tool. For example, the upcoming Open Babel 2.3 will handle 3D coordinate generation while ensuring stereochemistry.

But you don't have to use OB -- I'm just saying that your 3D coordinates won't respect stereo with your approach.

-Geoff
Greg Landrum
2010-09-18 04:19:04 UTC
Permalink
On Sat, Sep 18, 2010 at 12:05 AM, Geoffrey Hutchison
Post by Geoffrey Hutchison
Post by hari jayaram
So now with the replace function in python I can easily remove
sterochem information from the molecule.
Once I remove the stereochemistry , libcheck does the right thing and
gives me the right 3D coordinates.
This doesn't make chemical sense, though. If libcheck is operating on a SMILES without stereochemistry, there's no way it can always give "the right 3D" coordinates. If you have "N" stereo centers, the chance of a correct 3D structure will be (0.5)^N.
I'd suggest using a different tool. For example, the upcoming Open Babel 2.3 will handle 3D coordinate generation while ensuring stereochemistry.
But you don't have to use OB -- I'm just saying that your 3D coordinates won't respect stereo with your approach.
Geoff's point is a good one: if you remove the stereochemistry
information from the SMILES and then generate 3d coordinates, your
odds of getting a correct 3d structure are not good. I had assumed
that you had bad stereochemistry info in the SMILES that you wanted to
get rid of. If the stereochem is correct, then it might be a good idea
to try Geoff's idea and use OB 2.3 when it's released or to use the
RDKit's 3D coordinate generation (also respects stereochemistry),
write the files as SDF, and then use the current version of OB to
translate to a PDB if you need things in that format.

Best,
-greg
Paul Emsley
2010-09-18 14:03:38 UTC
Permalink
Post by Greg Landrum
On Sat, Sep 18, 2010 at 12:05 AM, Geoffrey Hutchison
Post by Geoffrey Hutchison
Post by hari jayaram
So now with the replace function in python I can easily remove
sterochem information from the molecule.
Once I remove the stereochemistry , libcheck does the right thing and
gives me the right 3D coordinates.
This doesn't make chemical sense, though. If libcheck is operating on a SMILES without stereochemistry, there's no way it can always give "the right 3D" coordinates. If you have "N" stereo centers, the chance of a correct 3D structure will be (0.5)^N.
I'd suggest using a different tool. For example, the upcoming Open Babel 2.3 will handle 3D coordinate generation while ensuring stereochemistry.
But you don't have to use OB -- I'm just saying that your 3D coordinates won't respect stereo with your approach.
Geoff's point is a good one: if you remove the stereochemistry
information from the SMILES and then generate 3d coordinates, your
odds of getting a correct 3d structure are not good. I had assumed
that you had bad stereochemistry info in the SMILES that you wanted to
get rid of. If the stereochem is correct, then it might be a good idea
to try Geoff's idea and use OB 2.3 when it's released or to use the
RDKit's 3D coordinate generation (also respects stereochemistry),
write the files as SDF, and then use the current version of OB to
translate to a PDB if you need things in that format.
The (additional) useful thing libcheck can do is generate esd geometry
restraints for crystallographic refinement (something like the "spring
constants", e.g. 0.02A for a C-C single bonds, 3 degrees for C-C-C
angles etc. (atom type dependent, of course) - also planes and
torsions). I wonder how hard that would be to get
similar/compatible/corresponding numbers by digging into RDKit's UFF
(presumably that would be the way to do it). Any thoughts/advice?

Thanks,

Paul.
Greg Landrum
2010-09-19 03:44:45 UTC
Permalink
Post by Paul Emsley
The (additional) useful thing libcheck can do is generate esd geometry
restraints for crystallographic refinement (something like the "spring
constants", e.g. 0.02A for a C-C single bonds, 3 degrees for C-C-C
angles etc. (atom type dependent, of course) - also planes and
torsions). I wonder how hard that would be to get
similar/compatible/corresponding numbers by digging into RDKit's UFF
(presumably that would be the way to do it).  Any thoughts/advice?
Apologies for my ignorance about things crystallographic, but I'm not
quite sure what you mean. Are you talking about accessing the
geometric parameters themselves or adding special terms to the
forcefield? Either is pretty straightforward.

-greg
Paul Emsley
2010-09-19 13:16:17 UTC
Permalink
Post by Greg Landrum
Post by Paul Emsley
The (additional) useful thing libcheck can do is generate esd geometry
restraints for crystallographic refinement (something like the "spring
constants", e.g. 0.02A for a C-C single bonds, 3 degrees for C-C-C
angles etc. (atom type dependent, of course) - also planes and
torsions). I wonder how hard that would be to get
similar/compatible/corresponding numbers by digging into RDKit's UFF
(presumably that would be the way to do it). Any thoughts/advice?
Apologies for my ignorance about things crystallographic, but I'm not
quite sure what you mean. Are you talking about accessing the
geometric parameters themselves or adding special terms to the
forcefield? Either is pretty straightforward.
Hi Greg,

Sorry for the hand-waving question. Thanks for your encouraging answer.

The interchange of information between the chemical model/restraints
generation programs and modern programs that do refinement using X-ray
data is in the form of macromolecular Crystallographic Information File,
mmCIF (STAR-formatted) files, e.g.

http://lmb.bioch.ox.ac.uk/emsley/ccp4/PHE.cif

I was hoping that the tools of RDKit can be used in the generation of
such files (starting from SMILES or a 2D mol2 description [1]). It
seems to me that many of the data items *can* be generated. The
question I had was (something like): how hard would it be to fill the
data for these columns:

_chem_comp_bond.value_dist_esd, _chem_comp_angle.value_angle_esd _chem_comp_tor.value_angle_esd and _chem_comp_plane_atom.dist_esd?


(I am not sure that this needs an answer any more than you have already
given - I'll start digging).

Thanks,

Paul.

[1] people in our community currently use libcheck, PRODRG or CORINA to
do this (those are non-Free programs).
Greg Landrum
2010-09-19 14:54:39 UTC
Permalink
Dear Paul,
I was hoping that the tools of RDKit can be used in the  generation of
such files (starting from SMILES or a 2D mol2 description [1]).  It
seems to me that many of the data items *can* be generated.  The
question I had was (something like): how hard would it be to fill the
_chem_comp_bond.value_dist_esd, _chem_comp_angle.value_angle_esd _chem_comp_tor.value_angle_esd and _chem_comp_plane_atom.dist_esd?
(I am not sure that this needs an answer any more than you have already
given - I'll start digging).
Feel free to ask in case you encounter anything missing/unexpected/confusing.
:-)

-greg

Loading...