[Rdkit-discuss] Detecting R groups using RDKit and the MCS code (R group decomposition).

Discussion:

2013-01-24 18:06:15 UTC

Hola RDkitters,

I have a number of analogue molecules (how lucky) - from which I can
extract a scaffold using Dalke's MCS code (great piece of work, btw).

I would like to identify each R group from each molecule. My current idea
which I wanted to bounce with you was, for every molecule that I have:

0. Substructure search for the MCS scaffold, this will give ma a set of
atom ids.
1. For every atom id above find if it is connected to something else (so
get neighbours and check for indices which are not in the MCS scaffold set)
2. If there is a connection to an R group break that bond
3. Somehow (how?) retrieve the fragment part and label it Rn (I need to
have distinct sets; R1 R2 R3 etc.)

Is there a better way to do this? Am I missing something?

Many Thanks,
-
Jean-Paul Ebejer
Early Stage Researcher

Greg Landrum

2013-01-25 04:36:06 UTC

Permalink

Hi JP,

Post by JP
Hola RDkitters,
I have a number of analogue molecules (how lucky) - from which I can extract
a scaffold using Dalke's MCS code (great piece of work, btw).
I would like to identify each R group from each molecule. My current idea
0. Substructure search for the MCS scaffold, this will give ma a set of atom
ids.
1. For every atom id above find if it is connected to something else (so get
neighbours and check for indices which are not in the MCS scaffold set)
2. If there is a connection to an R group break that bond
3. Somehow (how?) retrieve the fragment part and label it Rn (I need to have
distinct sets; R1 R2 R3 etc.)
Is there a better way to do this? Am I missing something?

That's pretty much what I would do. Fortunately, you don't have to
code it, because it's already there:[1]

In [2]: core = Chem.MolFromSmiles('c1cccc2c1[nH]cc2')

In [3]: mol = Chem.MolFromSmiles('c1c(O)c(C)cc2c1n(CC)cc2')

In [4]: chains = Chem.ReplaceCore(mol,core,labelByIndex=True)

In [5]: pieces = Chem.GetMolFrags(chains,asMols=True)

In [6]: [Chem.MolToSmiles(x,True) for x in pieces]
Out[6]: ['[1*]O', '[2*]C', '[6*]CC']

There's a bit more text about this in the GettingStarted document:
http://www.rdkit.org/docs/GettingStartedInPython.html#substructure-based-transformations

-greg
[1] I love being able to give that answer. Thanks! :-)

2013-01-25 11:47:53 UTC

Permalink

Hi Greg,

Post by Greg Landrum
That's pretty much what I would do. Fortunately, you don't have to
code it, because it's already there:[1]

[1] Incidentally, I love being given that answer :)

Post by Greg Landrum
In [6]: [Chem.MolToSmiles(x,True) for x in pieces]
Out[6]: ['[1*]O', '[2*]C', '[6*]CC']

Out of pedantry, why do some labels *not* have a numeric label (using
2012_12_1)? All atoms have a numeric id; so the label should all be
attached to a numeric label e.g.

mols = [ Chem.MolFromSmiles('CC(=O)CN(C)C'),
Chem.MolFromSmiles('c1ccccc1C(=O)CN(c1ccccc1)C'),
Chem.MolFromSmiles('COC(=O)CN')]
if MCS.FindMCS(mols).smarts:
core = Chem.MolFromSmarts(MCS.FindMCS(mols).smarts)
for m in mols:
chains = Chem.ReplaceCore(m,core,labelByIndex=True)
print "chains", Chem.MolToSmiles(chains, True)

Gives:

chains [*]C.[2*]C.[2*]C
chains [*]c1ccccc1.[2*]C.[2*]c1ccccc1
chains [*]OC

Now, where is the number label on each first entry? Not a big deal of
course, but wrecks havoc with my regex.

Also should these lists be uniquified or not? Take a look at the first
example (e.g. [2*]C.[2*]C)?

Thank-you,
JP

Greg Landrum

2013-01-25 13:46:55 UTC

Permalink

Post by JP
Out of pedantry, why do some labels *not* have a numeric label (using
2012_12_1)? All atoms have a numeric id; so the label should all be
attached to a numeric label e.g.
mols = [ Chem.MolFromSmiles('CC(=O)CN(C)C'),
Chem.MolFromSmiles('c1ccccc1C(=O)CN(c1ccccc1)C'),
Chem.MolFromSmiles('COC(=O)CN')]
core = Chem.MolFromSmarts(MCS.FindMCS(mols).smarts)
chains = Chem.ReplaceCore(m,core,labelByIndex=True)
print "chains", Chem.MolToSmiles(chains, True)
chains [*]C.[2*]C.[2*]C
chains [*]c1ccccc1.[2*]C.[2*]c1ccccc1
chains [*]OC
Now, where is the number label on each first entry? Not a big deal of
course, but wrecks havoc with my regex.

It's the usual numbering starting at zero thing.... I will try and
figure out if there's an easy workaround and get back to you later
(this weekend most likely)

Post by JP
Also should these lists be uniquified or not? Take a look at the first
example (e.g. [2*]C.[2*]C)?

I guess that atom has two substituents.

-greg