Parsing Molecular Identifiers From the Ideal Database, part 3

In our last episode, we were left with 25 cases where the inferred molecule did not agree with the reported formula or molar mass, in our quest to to turn the IDEaL database into a comprehensive f-element separation database. Here we fix them by hand and generate the final result…

Import the data

SetDirectory@NotebookDirectory[];
missing = Import["2023.10.16_missing_entries_needs_proofing.xlsx", {"Dataset", 1}, "HeaderLines" -> 1]
correct = Dataset[{}];
excluded = Dataset[{}]; 
  
 (*revised from last post*)
outputData[oldRecord_, mol_Molecule] := With[
   {reportedMass = Interpreter[Number]@ oldRecord["reported_mass"], 
    computedMass = QuantityMagnitude[mol["MolarMass"]]}, 
   Dataset@Association[
     "url" -> oldRecord["url"], 
     "abbreviation" -> oldRecord["abbreviation"], 
     "SMILES" -> mol["CanonicalSMILES"], 
     "InChI" -> mol["InChI"]["ExternalID"], 
     "InChIKey" -> mol["InChIKey"]["ExternalID"], 
     "computed_formula" -> mol["MolecularFormulaString"], 
     "reported_formula" -> oldRecord["reported_formula"], 
     "computed_mass" -> computedMass, 
     "reported_mass" -> reportedMass, 
     "formula_matchQ" -> StringMatchQ[mol["MolecularFormulaString"], oldRecord["reported_formula"]], 
     "mass_matchQ" -> (Round[reportedMass] == Round[computedMass]) 
    ]]

0cv2w8fok5rbk

1

missing[[1]]

0tgp12j6w63y1

WebImage@missing[[1, "url"]]

1ff7t9f98whks

MoleculePlot@Molecule@missing[[1, "InChI"]]

1sft34z9qhp6t

Comment: Imported structure is consistent with diagram and consistent with molecular formula. This implies an error of IDEaL mass.

AppendTo[correct, missing[[1]]]

1y019lhbzm4b4

2

missing[[2]]
WebImage@%["url"]
MoleculePlot@Molecule@%%["InChI"]

0lrkt37w0crtt

16qyt5pplps1r

0fpn3tdlxmp4a

Mass and structures match. Implies error on IDEaL chemical formula

AppendTo[correct, missing[[2]]]

0ga1dxvmv5kl5

3

missing[[3]]
WebImage@%["url"]
MoleculePlot@Molecule@%%["InChI"]

15eya67r8moi6

0ak19d6awtsux

0adbquoh86sjs

Comment: It seems strange (and unlikely) to me that one of the sidechains is different (missing a carbon compared to the others; all the others are 2-ethylhexl as implied by the name). I suspect that the formula is correct, but the structure drawn is incorrect.

CopyToClipboard@missing[[3]]["InChI"]

AppendTo[
  correct, 
  outputData[missing[[3]], Molecule["CCCCC(CN(CC(N(CC(CCCC)CC)CC(CCCC)CC)=O)CC(N(CC(CCCC)CC)CC(CCCC)CC)=O)CC"]]]

0m3a0egt2poo8

4

missing[[4]]
WebImage@%["url"]
MoleculePlot@Molecule@%%["InChI"]

17wlqncagc14l

1l0vc0atk6j83

1lagtq21bjx0i

04exfz5i02jjp

(*MoleculePlot[Molecule["Missing[\"NotAvailable\"][\"ExternalID\"]"]]*)

Comment: Looks like the structure is fine (matches molecular weight, etc.) and the only problem is that there are some extra asterisks. This is actually a six membered ring(!) We have to open this in ChemDraw and manually edit it (copying back the result)

CopyToClipboard@missing[[4]]["SMILES"]

Editing fixes the agreement…

outputData[missing[[4]], Molecule["COc1cc(Cc2c(OCC(N(CC)CC)=O)c(Cc3c(OCC(N(CC)CC)=O)c(Cc4c(OCC(N(CC)CC)=O)c(Cc5c(OCC(N(CC)CC)=O)c(C6)cc(OC)c5)cc(OC)c4)cc(OC)c3)cc(OC)c2)c(OCC(N(CC)CC)=O)c(Cc7c(OCC(N(CC)CC)=O)c6cc(OC)c7)c1"]]
AppendTo[correct, %]

1lp6xfa75vdtf

0dqrvz324ylz4

Comment: It appears we have fixed it!

5

missing[[5]]

10dsguynejj9j

Looks like another ring… copy to clipboard and repair

CopyToClipboard@missing[[5]]["SMILES"]

outputData[missing[[5]], Molecule["CCCCCCCCOc1cc(Cc2c(OCC(N(CC)CC)=O)c(Cc3c(OCC(N(CC)CC)=O)c(Cc4c(OCC(N(CC)CC)=O)c(Cc5c(OCC(N(CC)CC)=O)c(Cc6c(OCC(N(CC)CC)=O)c(Cc7c(OCC(N(CC)CC)=O)c(Cc8c(OCC(N(CC)CC)=O)c(C9)cc(OCCCCCCCC)c8)cc(OCCCCCCCC)c7)cc(OCCCCCCCC)c6)cc(OCCCCCCCC)c5)cc(OCCCCCCCC)c4)cc(OCCCCCCCC)c3)cc(OCCCCCCCC)c2)c(OCC(N(CC)CC)=O)c9c1"]]
AppendTo[correct, %]

0egtc05hj45p4

1pq0yxz6qge4b

6

missing[[6]]

0vsvrtopbzwze

Comment: This appears to be a batter with all of these CA* extractants… OK

CopyToClipboard@missing[[6, "SMILES"]]

outputData[missing[[6]], Molecule["CC(C)(C)c1cc(Cc2c(OCC(N(CC)CC)=O)c(Cc3c(OCC(N(CC)CC)=O)c(Cc4c(OCC(N(CC)CC)=O)c(Cc5c(OCC(N(CC)CC)=O)c(Cc6c(OCC(N(CC)CC)=O)c(C7)cc(C(C)(C)C)c6)cc(C(C)(C)C)c5)cc(C(C)(C)C)c4)cc(C(C)(C)C)c3)cc(C(C)(C)C)c2)c(OCC(N(CC)CC)=O)c7c1"]]
AppendTo[correct, %];

197mtc1xjow1v

7

missing[[7]]

0nchmzd27g2z8

WebImage[missing[[7, "url"]]]

10t4ouh56mxg7

Comment: This is not a well-defined compound (it appears to be some mix?) and there are no D-values, so we are going to leave this one out….

AppendTo[excluded, missing[[7]]]

05qol5h4c09ie

8

missing[[8]]

0zioincu1p8ki

WebImage@missing[[8, "url"]]

13iu6zipk8sjk

Comment: Looks like the issue here is that the position of the methyl group on the right is not explicit . The cited paper clarifies it to be the 4,4’(5’) compound (purchased commercially), so I will edit it appropriately, starting from downloading the CDX file

outputData[missing[[8]], Molecule["CC1CC2OCCOCCOC3C(CC(C)CC3)OCCOCCOC2CC1"]]
AppendTo[correct, %];

0s806tbnbur4r

9

missing[[9]]

0esbgocanfabg

Same thing as #8, but the isobutyl compound (same source paper)

outputData[missing[[9]], Molecule["CC(C)(C)C1CC2OCCOCCOC3C(CC(C(C)(C)C)CC3)OCCOCCOC2CC1"]]
AppendTo[correct, %];

17awluakk2vnf

10

missing[[10]]

0cu0jtkcot6vn

WebImage[missing[[10, "url"]]]

0o05pwwat0wya

This appears to be a clear case where the formula is wrong–there is not oxygen in this molecule (based on the name); mass is correct

AppendTo[correct, missing[[10]]];

11

missing[[11]]

1e6w8lk20tmjt

WebImage@missing[[11, "url"]]

14sm6q8p4iioi

Another ring…best fixed manually in ChemDraw

CopyToClipboard@missing[[11, "SMILES"]]

outputData[missing[[11]], Molecule["O=C(CP(c1ccccc1)(c2ccccc2)=O)Nc3cc(Cc4c(OCCCCCCCCCCCCCC)c(Cc5c(OCCCCCCCCCCCCCC)c(Cc6c(OCCCCCCCCCCCCCC)c(C7)cc(NC(C)=O)c6)cc(NC(CP(c8ccccc8)(c9ccccc9)=O)=O)c5)cc(NC(CP(c%10ccccc%10)(c%11ccccc%11)=O)=O)c4)c(OCCCCCCCCCCCCCC)c7c3"]]
AppendTo[correct, %];

13tgjv9ojss17

12

missing[[12]]

08nklkulam90y

Yet another ring…we know what to do…

CopyToClipboard@missing[[12, "SMILES"]]

outputData[missing[[12]], Molecule["COc1cc(Cc2c(OCC(N(CC)CC)=O)c(Cc3c(OCC(N(CC)CC)=O)c(Cc4c(OCC(N(CC)CC)=O)c(Cc5c(OCC(N(CC)CC)=O)c(Cc6c(OCC(N(CC)CC)=O)c(Cc7c(OCC(N(CC)CC)=O)c(C8)cc(OC)c7)cc(OC)c6)cc(OC)c5)cc(OC)c4)cc(OC)c3)cc(OC)c2)c(OCC(N(CC)CC)=O)c(Cc9c(OCC(N(CC)CC)=O)c8cc(OC)c9)c1"]]
AppendTo[correct, %];

0wowch6wqfkti

13

missing[[13]]

1gh6zt5b0ahpl

Same story here…

CopyToClipboard@missing[[13, "SMILES"]]

outputData[missing[[13]], Molecule["CCCCCOc1cc(Cc2c(OCC(N(CC)CC)=O)c(Cc3c(OCC(N(CC)CC)=O)c(Cc4c(OCC(N(CC)CC)=O)c(Cc5c(OCC(N(CC)CC)=O)c(Cc6c(OCC(N(CC)CC)=O)c(Cc7c(OCC(N(CC)CC)=O)c(C8)cc(OCCCCC)c7)cc(OCCCCC)c6)cc(OCCCCC)c5)cc(OCCCCC)c4)cc(OCCCCC)c3)cc(OCCCCC)c2)c(OCC(N(CC)CC)=O)c(Cc9c(OCC(N(CC)CC)=O)c8cc(OCCCCC)c9)c1"]]
AppendTo[correct, %];

1kzx2hpr9mogb

14

missing[[14]]

1fz1z6yzhhx7q

Seems to be a common story for all the NEA* ligands…

CopyToClipboard@missing[[14, "SMILES"]]

outputData[missing[[14]], Molecule["CC(C)(C)c1cc(Cc2c(OCC(N(CC)CC)=O)c(Cc3c(OCC(N(CC)CC)=O)c(Cc4c(OCC(N(CC)CC)=O)c(Cc5c(OCC(N(CC)CC)=O)c(Cc6c(OCC(N(CC)CC)=O)c(Cc7c(OCC(N(CC)CC)=O)c(C8)cc(C(C)(C)C)c7)cc(C(C)(C)C)c6)cc(C(C)(C)C)c5)cc(C(C)(C)C)c4)cc(C(C)(C)C)c3)cc(C(C)(C)C)c2)c(OCC(N(CC)CC)=O)c(Cc9c(OCC(N(CC)CC)=O)c8cc(C(C)(C)C)c9)c1"]]
AppendTo[correct, %];

0rb9xaslbcx2i

15

missing[[15]]

0qxmcbnn7u0jp

CopyToClipboard@missing[[15, "SMILES"]]

outputData[missing[[15]], Molecule["O=C(N(CC)CC)COc1c2cccc1Cc3c(OCC(N(CC)CC)=O)c(Cc4c(OCC(N(CC)CC)=O)c(Cc5c(OCC(N(CC)CC)=O)c(Cc6c(OCC(N(CC)CC)=O)c(Cc7c(OCC(N(CC)CC)=O)c(Cc8c(OCC(N(CC)CC)=O)c(Cc9c(OCC(N(CC)CC)=O)c(C2)ccc9)ccc8)ccc7)ccc6)ccc5)ccc4)ccc3"]]
AppendTo[correct, %];

0opt78ezuovqz

16

missing[[16]]

0549jnovkasy3

CopyToClipboard@missing[[16, "SMILES"]]

outputData[missing[[16]], Molecule["O=C(N(CC)CC)COc1c2cc(OCc3ccccc3)cc1Cc4c(OCC(N(CC)CC)=O)c(Cc5c(OCC(N(CC)CC)=O)c(Cc6c(OCC(N(CC)CC)=O)c(Cc7c(OCC(N(CC)CC)=O)c(Cc8c(OCC(N(CC)CC)=O)c(C2)cc(OCc9ccccc9)c8)cc(OCc%10ccccc%10)c7)cc(OCc%11ccccc%11)c6)cc(OCc%12ccccc%12)c5)cc(OCc%13ccccc%13)c4"]]
AppendTo[correct, %];

1ekp711cdqkfj

17

missing[[17]]

1ebs00qlomnse

CopyToClipboard@missing[[17, "SMILES"]]

outputData[missing[[17]], Molecule["CCCCCCCCOc1cc(Cc2c(OCC(N(CC)CC)=O)c(Cc3c(OCC(N(CC)CC)=O)c(Cc4c(OCC(N(CC)CC)=O)c(Cc5c(OCC(N(CC)CC)=O)c(C6)cc(OCCCCCCCC)c5)cc(OCCCCCCCC)c4)cc(OCCCCCCCC)c3)cc(OCCCCCCCC)c2)c(OCC(N(CC)CC)=O)c(Cc7c(OCC(N(CC)CC)=O)c6cc(OCCCCCCCC)c7)c1"]]
AppendTo[correct, %];

0fvhh7yrbbxzk

18

missing[[18]]

12q5ex39xye7z

CopyToClipboard@missing[[18, "SMILES"]]

outputData[missing[[18]], Molecule["O=C(N(CC)CC)COc1c2cccc1Cc3c(OCC(N(CC)CC)=O)c(Cc4c(OCC(N(CC)CC)=O)c(Cc5c(OCC(N(CC)CC)=O)c(Cc6c(OCC(N(CC)CC)=O)c(Cc7c(OCC(N(CC)CC)=O)c(C2)ccc7)ccc6)ccc5)ccc4)ccc3"]]
AppendTo[correct, %];

1n3lqom6zcwhu

19

missing[[19]]

0r84pjp85xwuf

CopyToClipboard@missing[[19, "SMILES"]]

outputData[missing[[19]], Molecule["O=C(N(CC)CC)COc1c2cc(OCc3ccccc3)cc1Cc4c(OCC(N(CC)CC)=O)c(Cc5c(OCC(N(CC)CC)=O)c(Cc6c(OCC(N(CC)CC)=O)c(Cc7c(OCC(N(CC)CC)=O)c(Cc8c(OCC(N(CC)CC)=O)c(Cc9c(OCC(N(CC)CC)=O)c(Cc%10c(OCC(N(CC)CC)=O)c(C2)cc(OCc%11ccccc%11)c%10)cc(OCc%12ccccc%12)c9)cc(OCc%13ccccc%13)c8)cc(OCc%14ccccc%14)c7)cc(OCc%15ccccc%15)c6)cc(OCc%16ccccc%16)c5)cc(OCc%17ccccc%17)c4"]]
AppendTo[correct, %];

0cdl4d7kbuah4

20

missing[[20]]

0120y491zzlu6

URL[missing[[20, "url"]]]

18038cos67115

Comment: Computed and reported formulas are the same and seem consistent with the name. The mass seems low by 3 AMU (?)–maybe just a typo. In any case, it does not matter much, as there are no distribution coefficients reported.

AppendTo[correct, missing[[20]]];

21

missing[[21]]

1fogj62juug95

WebImage[missing[[21, "url"]]]
URL@missing[[21, "url"]]

0o6zd5jy91ukp

11mzjoxptqljd

This is an interesting case as there are few problems: (i). The listed formula appears to be wrong (too many hydrogens); (ii) The LLM-scraped formula is wrong (apparently OpenAI did not like the dot, or at least determined that as an end…a reasonable mistake); (iii) The InChI representation correctly capture the idea that this is a salt, whereas the SMILES does not (?)

Map[MoleculePlot@Molecule[#] &]@{ missing[[21, "InChI"]], missing[[21, "SMILES"]]} // GraphicsRow

05undtzrakd0w

We can fix the SMILES by round tripping it from an initial InChI

MoleculePlot@Molecule@#["CanonicalSMILES"] &@Molecule@missing[[21, "InChI"]]

1kw8i71zjqbx9

outputData[missing[[21]], Molecule@missing[[21, "InChI"]]]
AppendTo[correct, %];

0xqxm4kgflmk8

22

missing[[22]]
URL[%["url"]]
WebImage[%]

0p7ddiibri4oo

0b4v4f2wbghtj

03nuprhdyzmj3

Comment: Hmmm….no tests performed. Mol weight seems fine, which suggests the formula is goofy. But in the end it will not matter much

AppendTo[correct, missing[[22]]];

23

missing[[23]]
URL[%["url"]]
WebImage[%]

1mld8ti6akju8

0vvi8a3yb5su2

0b0upqq3169ii

Not much to go on here. The structure backbone indeed looks like an ornithine (note that this is likely a racemate given the name) and the other parts seem to match. There should totally be 5 nitrogens in this (3 from the aza-phenanthroline plus two from the ornithine), so this looks like a bad formula and bad mass on IDEaL. Not that it matters much, because there is no data reported…

Molecule["ornithine"]
Molecule["3-aza-5,6-dihydro-1,10-phenanthroline"]

0bo19d0hsojkz

062nxfpmydgzf

AppendTo[correct, missing[[23]]];

24

missing[[24]]
URL[%["url"]]
WebImage[%]

00c4dbtlf99me

0pprsfrwml2ej

1q90fm39471py

There is not much to go on here; apparently we rejected it initially because the name is not unambiguous (placement of the sulfates on the phenyl):

mol = Molecule["6,6'-bis(5,6-di(sulfophenyl)-1,2,4-triazin-3-yl)-2,2'-bipyridine"]
MoleculePlot[%]
MoleculeValue[%%, {"MolecularFormulaString", "MolarMass"}]

06ntsvu3li71p

02w6mlig0war6

1vnboskzaahcu

0ul0hlfz3a7mo

This is totally not the same molecular formula and mass given on the IDEaL website. However, this structure looks consistent with the stated name–I do not see how you would get the bis di sulfulophenyl with fewer atoms. And there is no other data, so my decision is to run with this.

outputData[missing[[24]], Molecule[mol]]
AppendTo[correct, %];

0jk9amhkq90se

25

missing[[25]]
URL[%["url"]]
WebImage[%]

17mqyp3vjbrhe

018hnwiocqq5n

07oc2ucziq7mv

Once again, not much to go on here. I found a paper which uses this for a separation with ratios of HDBP:Zr of 9, but that makes the mass too high; the closest I can get to the stated mass is 7:1 and even then it is not a perfect match:

Molecule["dibutyl phosphoric acid"]
%["MolarMass"]*7 + ElementData["Zr", "MolarMass"]

1partbw9jxbdf

0by3fp0mqjdlr

Modest proposal: It does not matter much, because there are no references. So I will just put in a 1:1 stoichiometry and call it a day

mol = Molecule@StringJoin[
    "[Zr].", 
    Molecule["dibutyl phosphoric acid"]["CanonicalSMILES"]]
MoleculePlot[%]

0g9mh64bqc86e

1223wyrsd7e5y

outputData[missing[[25]], mol]
AppendTo[correct, %];

1sfwo94u3tr5i

Conclusion

Did we get them all?

(Length[excluded] + Length[correct] ) == Length[missing]

(*True*)

Remember, we excluded one case:

excluded[[1]]

1x9pyrcvcdqts

Merge these new results with the correct values found in the last episode; the resulting file can be downloaded here.

With[
   {previous =  Import["2023.10.16_all_correct_entries.xlsx", {"Dataset", 1}, "HeaderLines" -> 1]}, 
   Join[previous, correct]];
Export["2023.10.17_all_correct_entries.xlsx", %];
Length[%%] (*count how many we have*)


(*438*)

ToJekyll["Parsing Molecular Identifiers From the Ideal Database, part 3", "mathematica chemdraw science"]