Discussion:
[Rdkit-discuss] Isomeric smiles and explicit hydrogens
Noel O'Boyle
2008-04-14 10:50:17 UTC
Permalink
I've been trying to get my head around what's happening when I read
and write isomeric smiles. As a user, I hope that the same molecule
will also have the same isomeric SMILES. However, look at the
following examples using cinfony which read a SMILES string and write
an isomeric SMILES string...

I'm trying to specify the chirality of the carbon in
rdk.readstring("smi", "[C](Cl)Br").write("iso")
'ClCBr'
(No chirality, as expected)
'Cl[CH]Br'
'ClCBr'
'ClCBr'
'Cl[CH]Br'
(Expected chirality, but didn't get it)
'CC(Cl)Br'
(Expected chirality, but didn't get it)
'C[C@@H](Cl)Br'
(Expected chirality, and got it)

Is the problem with me or with RDKit?

On a related note, I have found that RDKit, when reading SDF files,
turns all of the hydrogens into implicit hydrogens. However, when
reading SMILES strings, it retains any explicit hydrogens specified in
C@@H expressions. This doesn't seem to be consistent and requires the
user to remove hydrogens if he/she wants to create a canonical smiles
string.

Apologies in advance if my understanding of SMILES is shaky.

Regards,
Noel
Noel O'Boyle
2008-04-14 14:47:49 UTC
Permalink
I think I've been misunderstanding the square brackets. I need to
RTFM, I think, after which I'll post here again if still confused.

Noel
Post by Noel O'Boyle
I've been trying to get my head around what's happening when I read
and write isomeric smiles. As a user, I hope that the same molecule
will also have the same isomeric SMILES. However, look at the
following examples using cinfony which read a SMILES string and write
an isomeric SMILES string...
I'm trying to specify the chirality of the carbon in
rdk.readstring("smi", "[C](Cl)Br").write("iso")
'ClCBr'
(No chirality, as expected)
'Cl[CH]Br'
'ClCBr'
'ClCBr'
'Cl[CH]Br'
(Expected chirality, but didn't get it)
'CC(Cl)Br'
(Expected chirality, but didn't get it)
(Expected chirality, and got it)
Is the problem with me or with RDKit?
On a related note, I have found that RDKit, when reading SDF files,
turns all of the hydrogens into implicit hydrogens. However, when
reading SMILES strings, it retains any explicit hydrogens specified in
user to remove hydrogens if he/she wants to create a canonical smiles
string.
Apologies in advance if my understanding of SMILES is shaky.
Regards,
Noel
Noel O'Boyle
2008-04-14 14:51:36 UTC
Permalink
And (egg on face) chlorobromomethane isn't chiral in the first
place...what was I thinking?
Post by Noel O'Boyle
I've been trying to get my head around what's happening when I read
and write isomeric smiles. As a user, I hope that the same molecule
will also have the same isomeric SMILES. However, look at the
following examples using cinfony which read a SMILES string and write
an isomeric SMILES string...
I'm trying to specify the chirality of the carbon in
rdk.readstring("smi", "[C](Cl)Br").write("iso")
'ClCBr'
(No chirality, as expected)
'Cl[CH]Br'
'ClCBr'
'ClCBr'
'Cl[CH]Br'
(Expected chirality, but didn't get it)
'CC(Cl)Br'
(Expected chirality, but didn't get it)
(Expected chirality, and got it)
Is the problem with me or with RDKit?
On a related note, I have found that RDKit, when reading SDF files,
turns all of the hydrogens into implicit hydrogens. However, when
reading SMILES strings, it retains any explicit hydrogens specified in
user to remove hydrogens if he/she wants to create a canonical smiles
string.
Apologies in advance if my understanding of SMILES is shaky.
Regards,
Noel
Greg Landrum
2008-04-14 16:25:37 UTC
Permalink
Hi Noel,

You already figured out the problem with the chirality of
chlorobromomethane, but I want to clarify a couple of things below.
Post by Noel O'Boyle
I'm trying to specify the chirality of the carbon in
rdk.readstring("smi", "[C](Cl)Br").write("iso")
'ClCBr'
(No chirality, as expected)
Just to be clear on this one, the output here is not technically
correct; you've input a molecule with the formula CClBr (you told the
software that the C has no implicit Hs by putting it in square
brackets), the output however is for something with the formula
CH2ClBr. This is actually a bug; thanks for finding it. :-)
https://sourceforge.net/tracker/index.php?func=detail&aid=1942220&group_id=160139&atid=814650
Post by Noel O'Boyle
'Cl[CH]Br'
'ClCBr'
'ClCBr'
'Cl[CH]Br'
(Expected chirality, but didn't get it)
As you've realized: this molecule isn't chiral, so the RDKit is doing
the right thing by not marking chirality. It's doing something
arguable with the canonical smiles though, because it's showing the
explicit H (inside the square brackets). If you input exactly the same
molecule as ClCBr, you'd get a different canonical smiles. This is a
known oddity of the way things are currently handled internally and I
haven't quite figured out a solution yet. Basically explicit Hs remain
always explicit, even if they don't need to be.
Post by Noel O'Boyle
'CC(Cl)Br'
(Expected chirality, but didn't get it)
Again, the molecule as provided isn't chiral because carbon 1 only has
three neighbors (you've told it that there are no implicit Hs).
Post by Noel O'Boyle
(Expected chirality, and got it)
It's even the right chirality, which is good to see. :-)
Post by Noel O'Boyle
Is the problem with me or with RDKit?
I'll answer that "or" question with a "yes", because it's a little of both. :-)
Post by Noel O'Boyle
On a related note, I have found that RDKit, when reading SDF files,
turns all of the hydrogens into implicit hydrogens.
correct.
Post by Noel O'Boyle
However, when
reading SMILES strings, it retains any explicit hydrogens specified in
user to remove hydrogens if he/she wants to create a canonical smiles
string.
I commented on this above. It's a known problem and I've been stewing
over how to solve it for a while. Now that someone other than me is
complaining I'll bump it up a bit in priority.

-greg
Noel O'Boyle
2008-04-14 19:12:49 UTC
Permalink
If I found a bug earlier, it was completely by accident. The following
though I think is also a bug. I find that I can invert the
stereocenter by adding and removing Hs.
mol.write("iso")
mol.addh()
mol.write("iso")
mol.removeh()
mol.write("iso")
'C[***@H](O)(Cl)c1ccccc1'

Can you tell whether the problem is when I add the Hs, or when I
remove them? I might be able to workaround if the adding is working
okay.

Noel
Noel O'Boyle
2008-04-14 19:33:23 UTC
Permalink
Wait a second, that molecule has five substituents on the isomeric C.
But I think we share the blame again this time, Greg, because I took
that structure from the RDKit Python tutorial Section 2.3. :-)

Noel
Post by Noel O'Boyle
If I found a bug earlier, it was completely by accident. The following
though I think is also a bug. I find that I can invert the
stereocenter by adding and removing Hs.
mol.write("iso")
mol.addh()
mol.write("iso")
mol.removeh()
mol.write("iso")
Can you tell whether the problem is when I add the Hs, or when I
remove them? I might be able to workaround if the adding is working
okay.
Noel
Greg Landrum
2008-04-14 19:54:20 UTC
Permalink
Post by Noel O'Boyle
Wait a second, that molecule has five substituents on the isomeric C.
But I think we share the blame again this time, Greg, because I took
that structure from the RDKit Python tutorial Section 2.3. :-)
Indeed. That's a documentation bug. I'll fix it.

There's also something bad in general going on with the handling of
organic-subset atoms in square brackets that I'm going to have to
track down (I think the five coordinate neutral C should have caused
an error). Thanks for reporting it.

-greg
Greg Landrum
2008-04-14 20:04:02 UTC
Permalink
Post by Noel O'Boyle
If I found a bug earlier, it was completely by accident. The following
though I think is also a bug. I find that I can invert the
stereocenter by adding and removing Hs.
mol.write("iso")
mol.addh()
mol.write("iso")
mol.removeh()
mol.write("iso")
Can you tell whether the problem is when I add the Hs, or when I
remove them? I might be able to workaround if the adding is working
okay.
As discussed in your later message, this molecule has a 5-coordinate
C, so it probably shouldn't have the @ in the output SMILES at all.
(Sarcasm doesn't work in email: that "probably" is a joke, it
definitely shouldn't be in there; that's another nice bug).

I'm prepared to believe that there could be a bug that causes
inversion of chirality when Hs are added and removed (I wouldn't be
overly surprised), but it definitely doesn't always happen, as this
case demonstrates:
[18]>>> m = Chem.MolFromSmiles('O[***@H](F)Cl')
[19]>>> Chem.MolToSmiles(m,1)
Out[19] 'O[***@H](F)Cl'
[20]>>> m2=Chem.AddHs(m)
[21]>>> Chem.MolToSmiles(m2)
Out[21] '[H]OC(F)(Cl)[H]'
[22]>>> Chem.MolToSmiles(m2,True)
Out[22] '[H]O[C@](F)(Cl)[H]'
[23]>>> m3 = Chem.RemoveHs(m2)
[24]>>> Chem.MolToSmiles(m3,True)
Out[24] 'O[***@H](F)Cl'

After playing around a bit with a model, I think this is also ok:
[25]>>> m = Chem.MolFromSmiles('C[C@@H](O)Cl')
[27]>>> Chem.MolToSmiles(m,True)
Out[27] 'C[C@@H](O)Cl'
[28]>>> m2 = Chem.AddHs(m)
[30]>>> Chem.MolToSmiles(m2,True)
Out[30] '[H]O[C@@](Cl)(C([H])([H])[H])[H]'
[31]>>> m3 = Chem.RemoveHs(m2)
[32]>>> Chem.MolToSmiles(m3,True)
Out[32] 'C[C@@H](O)Cl'


-greg

Loading...