Suppose you (not-so-hypothetically) want to screen-scrape the IDEaL database in order to build your own f-element separation database. While it is straightforward to task a student with scraping the data tables, extracting machine-readable chemical structures is more challenging, as they do not provide standard chemical identifiers such as SMILES, InChI, etc. However, they do provide Chemdraw files and IUPAC-ish names which we can try to parse. This also provides an opportunity to do some LLM-based screen scraping. (Spoiler alert: We uncover approximately 105/438 entries with unresolvable structures, including 6 cases where the stated molecular formula is inconsistent with the stated molecular mass.) Mathematica to the rescue…

Get a list of all the links to each compound

Start by getting a list of all the ligand URLs:

homepage = "https://www.oecd-nea.org/ideal/";
links = URLBuild /@ URLParse /@ Select[StringContainsQ["/extractants/"]]@Import[homepage, "Hyperlinks"];
Length[%]

(*439*)

Why round-trip the retrieved Hyperlinks through URLBuild and URLParse? Hyperlink importing returns an unescaped string (i.e., replaces %20 with a blankspace). But we need these to actually retrieve the files! If you do not do this, then the following hardExample will throw an error when we try to retrieve files:

hardExample = links[[50]]

(*"https://www.oecd-nea.org/ideal/extractants/AFR%20Nitro"*)

Comment: I did not know this was a problem when I started, and only appreciated it when I started applying the functions to the bulk job.

Retrieve data for each compound

In this section we will run through the various extraction functions. To test them we will try the following demonstration example:

example = "https://www.oecd-nea.org/ideal/extractants/CB4";

Abbreviation

The compound abbreviation names are the last part of the URL path. We use URLParse to remove the %-escaping characters (which we added in the round-tripping function above):

moleculeAbbreviation[url_String] := Last@Lookup["Path"]@URLParse[url] 
  
 (*demo*)
moleculeAbbreviation[example]
moleculeAbbreviation[hardExample]

(*"CB4"*)

(*"AFR Nitro"*)

Retrieve and parse the Chemdraw file

Yeah, Mathematica can Import Chemdraw .cdx files. Again, the extracted Hyperlinks get unescaped, so we roundtrip the retrieved URL through URLBuild/URLParse to make it readable:

(*given an extractant URL return the Molecule found in the linked Chemdraw file*)
  moleculeFromChemdraw[url_] := With[
     {cdxFileURL = SelectFirst[StringContainsQ[".cdx"]]@Import[url, "Hyperlinks"]},
     Import@URLBuild@URLParse[cdxFileURL] 
    ] 
   
  (*demo*) 
   exampleMol = moleculeFromChemdraw[example] 
   moleculeFromChemdraw[hardExample]

06alvp81k0gcz

0ngzbxwc5444y

Extract other information using GPT-3.5

Each HTML file contains some preliminary information (untagged) about the molecule name, molecular formula and molecular mass:

text = Import[example, "Plaintext"]
StringLength[text]

(*"IDEaL        Browse Search  Back-office        2,2'-thiobis(N,N-diisobutylacetamide)  SVG   PNG   ChemDraw      General infos   Tests   References      Abbreviation(s)  CB4    Chemical group  thiodiglycolamide   Formula  C 20 H 40 N 2 O 2 S  not CHON     Molecular mass  372.612 g/mol   Elements separated  Eu, Pu, Am, Cm     Tests  Test carried out Test type Conditions Ref.Extraction [1]      References  CEA"*)

(*460*)

We could try to write a parser for this, but it seems tedious, so instead I decided to use an LLMFunction to extract the information; we will trim the plaintext down to a few thousand characters to make sure it fits into the context window. In practice, the information we care about is near the top, in the first 500-1000 characters. (I had originally experimented with three separate functions, but ultimately combined them so that I can make a single API call and thus save a bit of time and a few pesos. However, as we shall see below, it is also useful to break these out to deal with certain debugging tasks.) As of this writing, LLMFunction defaults is to use OpenAI GPT-3.5 with temperature = 0. The progress reports are annoying when you apply this to a batch of results, so we turn it off with an undocumented ProgressReporting option:

llmExtractor = LLMFunction["What is the compound name, molecular formula, and molecular mass in the following text?  Return only the name, formula and number with no units or other information.  Return the results as a JSON dictionary with keys compound_name, molecular_formula, and molecular_mass.\n``", 
     ProgressReporting -> False 
    ]; 
 
compoundInformation[url_] := With[
    {text = StringTake[#, UpTo[1500]] &@Import[url, "Plaintext"]}, 
    ImportString[#, "RawJSON"] &@llmExtractor[text] 
   ] 
  
 (*demo*)
exampleResult = compoundInformation[example]

(*<|"compound_name" -> "2,2'-thiobis(N,N-diisobutylacetamide)", "molecular_formula" -> "C20H40N2O2S", "molecular_mass" -> "372.612"|>*)

Comment: I am actually pleasantly surprised at how well this worked. If you look at the string, it is quite ugly and I suspect there are many edge cases for a simple regexp. What I like about this use case is that it fails in a transparent way—if we get bad data in our consistency check then we will just have to manually correct the data.

Addendum (06 Nov 2023): The new function calling API options may allow you to simplify the prompt for returning JSON, in case you are trying this in the future.

We can check for consistency against the listed molecular formula and molecular mass (rounding the latter to the nearest integer as good-enough. In practice, getting the molecular formula correct should also get the mass correct, but I like a belt-and-suspenders approach.

moleculeConsistentQ[mol_Molecule, info_Association] := With[
    {formula = MoleculeProperty["MolecularFormulaString"]@mol, 
     mass = QuantityMagnitude@ MoleculeProperty["MolarMass"]@mol}, 
    (formula == info["molecular_formula"]) && (Round[mass] == Round[ Interpreter["Number"][info["molecular_mass"] ]]) 
   ] 
  
 (*demo*)
moleculeConsistentQ[exampleMol, exampleResult]

(*True*)

We also want to have backup ways of trying to assign the molecules, if some of these are messed up. We can try to parse the IUPAC name or kick it to PubChem. Both of these might fail, so we use the Enclose[Confirm[]] pattern, and just return empty molecules if it fails. I overload the function definitions to take both strings and the information dictionary returned from the LLM interpretation just for convenience:

interpretName[name_String] := Enclose[
    ConfirmQuiet[Molecule[name]], 
    Molecule[""] &] 
 
interpretName[info_Association] := interpretName@info["compound_name"] 
  
 (*demo*)
interpretName[exampleResult]
moleculeConsistentQ[%, exampleResult]

189v5a7i5nskr

(*True*)

Similarly use the built-in Pubchem service to do a synonym lookup; again, if we get a warning message about the compound not being present, then we just return an empty molecule. In this case, this is not actually a molecule in PubChem, so this shows the values of different methods:

interpretNameWithPubchem[name_String] := With[
    {pubchemResults = Enclose[
       ConfirmQuiet[
        ServiceExecute["PubChem", "CompoundProperties", {"Name" -> name, "Property" -> "InChI"}]], 
       {<|"InChI" -> ""|>} &]}, 
    Molecule@First@Lookup[#, "InChI", ""] &@Normal@pubchemResults] 
 
interpretNameWithPubchem[info_Association] := interpretNameWithPubchem@info["compound_name"] 
  
 (*demo*)
interpretNameWithPubchem[exampleResult]
moleculeConsistentQ[%, exampleResult]

05unae5m4zqm7

(*False*)

Note that IDEaL also has images of the molecules, and one could use MoleculeRecognize to parse them. However, in some isolated tests, I found that these were pretty bad (because of transparencies, abbreviations, etc. in the images), so it is not worth implementing.

Putting it all together:

Now that we have defined all the pieces, we can combine them to generate the output:

Define the function

We process each link one at a time, defining the output for each line as an Association. If the molecule is FUBAR, then we just report it as missing. Finally we try reading the molecule from the chemdraw file, the name, and pubchem (in order) until one of them passes the consistency test (w.r.t. molecular formula and molecular weight):

(*we have a valid molecule!  yay! *)
  outputData[url_String, mol_?MoleculeQ] := Association[
     "url" -> url, 
     "abbreviation" -> moleculeAbbreviation[url], 
     "SMILES" -> MoleculeValue[mol, "CanonicalSMILES"], 
     "InChI" -> MoleculeValue[mol, "InChI"]["ExternalID"], 
     "InChIKey" -> MoleculeValue[mol, "InChIKey"]["ExternalID"] 
    ] 
   
  (* something is fishy, so fail *) 
   outputData[url_String, _] := Association[
     "url" -> url, 
     "abbreviation" -> moleculeAbbreviation[url], 
     "SMILES" -> Missing[], 
     "InChI" -> Missing[], 
     "InChIKey" -> Missing[] 
    ] 
   
  (*I don't really like this nested if, but it does the job...*) 
   processURL[url_String] := Module[
     {mol = moleculeFromChemdraw[url], 
      info = compoundInformation[url]}, 
     If[ moleculeConsistentQ[mol, info], 
      outputData[url, mol], (*if true*)
      mol = interpretName[info]; (*else try the name...*)
      If[moleculeConsistentQ[mol, info], 
       outputData[url, mol], (*if true*)
       mol = interpretNameWithPubchem[info]; (*else try pubchem...*)
       If[moleculeConsistentQ[mol, info], 
        outputData[url, mol], (*if true*)
        outputData[url, Missing[]] (*it failed...*) 
       ]]]] 
   
  (*demo*) 
   processURL[example]

(*<|"url" -> "https://www.oecd-nea.org/ideal/extractants/CB4", "abbreviation" -> "CB4", "SMILES" -> "CC(C)CN(CC(C)C)C(=O)CSCC(=O)N(CC(C)C)CC(C)C", "InChI" -> "InChI=1S/C20H40N2O2S/c1-15(2)9-21(10-16(3)4)19(23)13-25-14-20(24)22(11-17(5)6)12-18(7)8/h15-18H,9-14H2,1-8H3", "InChIKey" -> "DNJRGLMKJFJUJD-UHFFFAOYSA-N"|>*)

Map it over the list of entries

Showtime! Map this over links and convert it to a Dataset. I use a ParallelMap so that I can run multiple LLM calls simultaneously (waiting for the response is the bottleneck). It took less than 5 minutes to run. However some errors came out. We will deal with these later (vide infra):

results = ParallelMap[processURL, links] // Dataset

0kfkfee3u6mgf

06mzpftp2h16o

1k8lnsfcm2zqn

0hlzusayn82b9

1snmscer9o1so

1wr69lzk57e3h

0bmlyasort54c

0vgsljqop6hu7

0mbcg5snqoymq

1jdbianvqjs6u

1k0mcb42x2bry

How many unresolved entries are there? Note that these are not computational errors but instead are cases where we were unable to resolve a structure that passed the consistency test:

Length@Select[MissingQ]@Lookup["SMILES"]@Normal@Select[AssociationQ]@results

(*103*)

Comment: Yikes! This seems bad. But it is probably not so bad in practice, because many of the problematic molecules have no distribution coefficient data? (Spoiler alert: We will care about 75 of them…) So I suggest that rather than try to fill all of these in, that one instead just see which entries are missing after doing the final merge.

Before dealing with the (actual) errors, I save the results obtained so far to avoid giving OpenAI another dime.¹ To export as a spreadsheet we must remove rows that are not valid Associations (or stated more positively, we Select only rows that are validly structured before exporting:

SetDirectory@NotebookDirectory[];
Export["2023.10.11_ideal_moleculelist.xlsx", Select[AssociationQ]@results];
DumpSave["2023.10.11_ideal_moleculelist.mx", results];

Investigating the Problem Children

There are a limited number of problem cases; it is possible that the relevant servers just timed out or there are strange unicode errors, or there might be other problems. Given that this is a one-time project, and the problems only affect about 1% of the data, I am not so inclined to solve it rigorously, but instead we can take a look:

problems = Cases[outputData[x_, _] :> x] /@ Normal@Select[(! AssociationQ[#]) &]@results // Flatten
Length[%]

(*{"https://www.oecd-nea.org/ideal/extractants/9H", "https://www.oecd-nea.org/ideal/extractants/HONTA", "https://www.oecd-nea.org/ideal/extractants/SO3-Ph-BTBP", "https://www.oecd-nea.org/ideal/extractants/T2EHDGA", "https://www.oecd-nea.org/ideal/extractants/ZS%20HDBP"}*)

(*5*)

We will address each of these individually, creating a place to store the results:

solutions = Table[Association[], {5}];

1) 9H

This looks like a problem with the ChemDraw file–it contains two disconnected fragments, but the first one seems correct (correct

processURL@problems[[1]]

051h75b58rnq2

url = problems[[1]];
info = compoundInformation[url]
cdxMol = First@ moleculeFromChemdraw[url]
iupacMol = interpretName[info] 
 
moleculeConsistentQ[iupacMol, info] 
 
solutions[[1]] = outputData[url, iupacMol]


(*<|"compound_name" -> "3,6-di-pyridin-4-yl-1,2-dihydro-[1,2,4,5]tetrazine", "molecular_formula" -> "C12H10N6", "molecular_mass" -> "238.254"|>*)

1ugt2niia2qyl

1jh7fvjz5b4qe

(*True*)

(*<|"url" -> "https://www.oecd-nea.org/ideal/extractants/9H", "abbreviation" -> "9H", "SMILES" -> "c1cc(C2=NN=C(c3ccncc3)NN2)ccn1","InChI" -> "InChI=1S/C12H10N6/c1-5-13-6-2-9(1)11-15-17-12(18-16-11)10-3-7-14-8-4-10/h1-8H,(H,15,16)(H,17,18)", "InChIKey" -> "QWYDFEKBBFVMQW-UHFFFAOYSA-N"|>*)

2) HONTA

It looks like the problem here is parsing the JSON:

url = problems[[2]];
cdxMol = moleculeFromChemdraw[url]
info = compoundInformation[url]

1uodk9ezm6kp2

12tp0xnrduqw6

1ozusa2v5id62

(*$Failed*)

WebImage[url]

07rs26phzo5ka

It looks like the problem is with the curly quotes that are being used for indicating primes and double primes (manually copy-pasted below)

iupacMol = Molecule["N,N,N\[CloseCurlyQuote],N\[CloseCurlyQuote],N\[CloseCurlyDoubleQuote],N\[CloseCurlyDoubleQuote]-hexa-n-octylnitrilotrisacetamide"]

07zcq0od1b2c2

formulaExtract[url_] := LLMFunction["What is the molecular formula in the following text? Return only the formula and no other information.\n ``"]@
      StringTake[#, UpTo[1500]] &@Import[url, "Plaintext"] 
 
formulaExtract[url]

(*"C54H108N4O3"*)

solutions[[2]] = outputData[url, iupacMol]

(*<|"url" -> "https://www.oecd-nea.org/ideal/extractants/HONTA", "abbreviation" -> "HONTA", "SMILES" -> "CCCCCCCCN(CCCCCCCC)C(=O)CN(CC(=O)N(CCCCCCCC)CCCCCCCC)CC(=O)N(CCCCCCCC)CCCCCCCC", "InChI" -> "InChI=1S/C54H108N4O3/c1-7-13-19-25-31-37-43-56(44-38-32-26-20-14-8-2)52(59)49-55(50-53(60)57(45-39-33-27-21-15-9-3)46-40-34-28-22-16-10-4)51-54(61)58(47-41-35-29-23-17-11-5)48-42-36-30-24-18-12-6/h7-51H2,1-6H3", "InChIKey" -> "YTTXOOIRNCYALH-UHFFFAOYSA-N"|>*)

3) SO3-Ph-BTBP

url = problems[[3]];
WebImage[url]

1vowhyafg4rek

Comment: Looks like the problem here is that there is no chemdraw file AND we have the curly-quote problem

info = compoundInformation[url]

1w4wnms8evkph

03f06l9rivlz9

(*$Failed*)

nameExtract[url_] := LLMFunction["What is the compound name in the following text? Return only the name and no other information.\n ``"]@
      StringTake[#, UpTo[1500]] &@Import[url, "Plaintext"] 
 
iupacMol = Molecule@nameExtract[url]
formula = formulaExtract[url]
MoleculeValue[iupacMol, "MolecularFormulaString"] == formula

0lhbmqwm7cfs4

0przfafiwhpic

(*"C11H7N7"*)

(*False*)

There appears to be a problem here, but I am not inclined to solve it, as there are no distribution coefficients reported.

solutions[[3]] = outputData[url, Missing[]]

(*<|"url" -> "https://www.oecd-nea.org/ideal/extractants/SO3-Ph-BTBP", "abbreviation" -> "SO3-Ph-BTBP", "SMILES" -> Missing[], "InChI" -> Missing[], "InChIKey" -> Missing[]|>*)

4. T2EHDGA

Another problem with non-unicode characters in the name

url = problems[[4]];
WebImage[url]

0n6dg2w141gku

cdxMol = moleculeFromChemdraw[url]
formula = formulaExtract[url]
MoleculeValue[cdxMol, "MolecularFormulaString"] == formula

18ti9ol4eavdf

(*"C36H72N2O3"*)

(*True*)

solutions[[4]] = outputData[url, cdxMol]

(*<|"url" -> "https://www.oecd-nea.org/ideal/extractants/T2EHDGA", "abbreviation" -> "T2EHDGA", "SMILES" -> "CCCCC(CC)CN(CC(CC)CCCC)C(=O)COCC(=O)N(CC(CC)CCCC)CC(CC)CCCC", "InChI" -> "InChI=1S/C36H72N2O3/c1-9-17-21-31(13-5)25-37(26-32(14-6)22-18-10-2)35(39)29-41-30-36(40)38(27-33(15-7)23-19-11-3)28-34(16-8)24-20-12-4/h31-34H,9-30H2,1-8H3", "InChIKey" -> "MJJHOBMBUANRML-UHFFFAOYSA-N"|>*)

#5

url = problems[[5]]
URL[url]
WebImage[url]

(*"https://www.oecd-nea.org/ideal/extractants/ZS%20HDBP"*)

0fdvd79b7f55l

1l3dqhgke95di

No D values and no chemdraw. So again, no reason to work too hard on this.

solutions[[5]] = outputData[url, Missing[]]

(*<|"url" -> "https://www.oecd-nea.org/ideal/extractants/ZS%20HDBP", "abbreviation" -> "ZS HDBP", "SMILES" -> Missing[], "InChI" -> Missing[], "InChIKey" -> Missing[]|>*)

Resolving the problem children

We have now resolved these error messages, obtaining the following resolutions:

solutions

(*{<|"url" -> "https://www.oecd-nea.org/ideal/extractants/9H", "abbreviation" -> "9H", "SMILES" -> "c1cc(C2=NN=C(c3ccncc3)NN2)ccn1", "InChI" -> "InChI=1S/C12H10N6/c1-5-13-6-2-9(1)11-15-17-12(18-16-11)10-3-7-14-8-4-10/h1-8H,(H,15,16)(H,17,18)", "InChIKey" -> "QWYDFEKBBFVMQW-UHFFFAOYSA-N"|>, <|"url" -> "https://www.oecd-nea.org/ideal/extractants/HONTA", "abbreviation" -> "HONTA", "SMILES" -> "CCCCCCCCN(CCCCCCCC)C(=O)CN(CC(=O)N(CCCCCCCC)CCCCCCCC)CC(=O)N(CCCCCCCC)CCCCCCCC", "InChI" -> "InChI=1S/C54H108N4O3/c1-7-13-19-25-31-37-43-56(44-38-32-26-20-14-8-2)52(59)49-55(50-53(60)57(45-39-33-27-21-15-9-3)46-40-34-28-22-16-10-4)51-54(61)58(47-41-35-29-23-17-11-5)48-42-36-30-24-18-12-6/h7-51H2,1-6H3", "InChIKey" -> "YTTXOOIRNCYALH-UHFFFAOYSA-N"|>, <|"url" -> "https://www.oecd-nea.org/ideal/extractants/SO3-Ph-BTBP", "abbreviation" -> "SO3-Ph-BTBP", "SMILES" -> Missing[], "InChI" -> Missing[], "InChIKey" -> Missing[]|>, <|"url" -> "https://www.oecd-nea.org/ideal/extractants/T2EHDGA", "abbreviation" -> "T2EHDGA", "SMILES" -> "CCCCC(CC)CN(CC(CC)CCCC)C(=O)COCC(=O)N(CC(CC)CCCC)CC(CC)CCCC", "InChI" -> "InChI=1S/C36H72N2O3/c1-9-17-21-31(13-5)25-37(26-32(14-6)22-18-10-2)35(39)29-41-30-36(40)38(27-33(15-7)23-19-11-3)28-34(16-8)24-20-12-4/h31-34H,9-30H2,1-8H3", "InChIKey" -> "MJJHOBMBUANRML-UHFFFAOYSA-N"|>, <|"url" -> "https://www.oecd-nea.org/ideal/extractants/ZS%20HDBP", "abbreviation" -> "ZS HDBP", "SMILES" -> Missing[], "InChI" -> Missing[], "InChIKey" -> Missing[]|>}*)

Keep the valid associations from the first case and append our resolved solutions`:

Dataset[
  results2 = Join[ Select[AssociationQ]@results, solutions]]

12wj2opaxqiwu

Investigating the Missing entries

We still have many entries which did not give an overt failure, and yet we did not resolve a name. We must try to figure out what is going on here:

missingInfo = Select[MissingQ[#["SMILES"]] &]@results2;
Length[missingInfo]

(*105*)

An example

Take a look at thefirst example:

url = missingInfo[[1, "url"]]
WebImage[%]
info = compoundInformation[%%]

(*"https://www.oecd-nea.org/ideal/extractants/%28Dodecylpyrazol%292Py"*)

0kg9g99ovo5jc

(*<|"compound_name" -> "2,6-bis(1-dodecyl-pyrazol-3-yl)-pyridine", "molecular_formula" -> "C35H57N5", "molecular_mass" -> "447.876"|>*)

This appears to be a case where the chemdraw is bad, but the name is fine; this tends to be a common error for many of these molecules which have large alkyl side chains (the ChemDraw structures are missing these entirely).

cdxMol = moleculeFromChemdraw[url]
iupacMol = interpretName[info]
moleculeConsistentQ[iupacMol, info]

19wt72lbzu93w

0ci5yas8c2qbq

(*False*)

MoleculePlot[iupacMol]

0xmn0qkl5vxoa

The structure appears to be correct and the chemical formula matches. But the posted molar mass on the website is wrong!

MoleculeValue[iupacMol, "MolarMass"]

18wujnq4qg5ll

This suggests a more general problem elsewhere in the dataset which we shall need to investigate …

Where are the molecular weights inconsistent with the formulas?

massFormulaOnly[url_] := ImportString[#, "RawJSON"] & @ LLMFunction["What is the molecular formula and the molecular mass in the following text? Return only the formula and number with no units or other information.  Return the results as a JSON dictionary with keys molecular_formula and molecular_mass.\n``", 
        ProgressReporting -> False]@
       StringTake[#, UpTo[1500]] &@Import[url, "Plaintext"] 
 
computeMassFromFormula[info_Association] := ChemicalFormula[info["molecular_formula"]]["MolecularMass"] // QuantityMagnitude 
 
consistentMassQ[info_Association] := With[
    {statedMass = Interpreter[ "Number"]@info["molecular_mass"], 
     computedMass = computeMassFromFormula[info] }, 
    Round[statedMass] == Round[computedMass]] 
 
inconsistentMass[url_] := With[
    {info = massFormulaOnly[url]}, 
    If[! consistentMassQ[info], 
     <|"url" -> url, 
      "formula" -> info["molecular_formula"], 
      "stated_mass" -> Interpreter["Number"][info["molecular_mass"]], 
      "compute_mass" -> computeMassFromFormula[info] 
     |>, 
     Nothing 
    ] 
   ] 
  
 (*demo*)
inconsistentMass[url]

(*<|"url" -> "https://www.oecd-nea.org/ideal/extractants/%28Dodecylpyrazol%292Py", "formula" -> "C35H57N5", "stated_mass" -> 447.876, "compute_mass" -> 547.9|>*)

As usual, we will parallelize on the HTTP requests / API calls to speed things up a bit (most of the time is spent having our computer wait for responses). Some errors arise for gateway failures or other problems:

inconsistencies = ParallelMap[inconsistentMass, Lookup[missingInfo, "url"]];
Dataset[%]

0a66box0rfyi5

0muyoujr2ugqw

We will just drop the gateway errors or cases of missing formulas. I have contacted a representative from OECD-NEA.org (12 Oct 2023) to make them aware of these discrepancies. Alas, it appears that this accounts for only six of our problems

Dataset@Select[AssociationQ]@inconsistencies
Export["2023.10.12_inconsistent_formula_masses_ideal.xlsx", %]

0zqirmtfaoli6

(*"2023.10.12_inconsistent_formula_masses_ideal.xlsx"*)

Retrieve the structures for these inconsistent results:

foundMols = moleculeFromChemdraw /@ Lookup[ Select[AssociationQ]@inconsistencies, "url"]

1wog0pzdizj3m

How does this compare to the listed formula?

Lookup["formula"]@Select[AssociationQ]@inconsistencies

(*{"C35H57N5", "C23H9N5", "C22H20N6O", "C15H19N7O2", "C35H23N7O12S4", "C24H18N7"}*)

Comment: Only #4 has a chemdraw file consistent with the molecular formula, but give all the other discrepancies I think this should be

Investigating other problems

Other cases appear to have totally inconsistent information. Here is one example:

WebImage@missingInfo[[3, "url"]]

1h6wkpaen9g9n

The CDX file is suspicious (missing atoms?), and the stated name is unresolvable:

cdxMol = moleculeFromChemdraw@missingInfo[[3, "url"]]
info = compoundInformation@missingInfo[[3, "url"]]
iupacMol = Molecule@info["compound_name"]

0g1bymefgf5zk

(*<|"compound_name" -> "2-ethylhexyl diamide amine", "molecular_formula" -> "C44H89N3O2", "molecular_mass" -> "692.215"|>*)

18vrp1qrwyd3b

(*Molecule["2-ethylhexyl diamide amine"]*)

You can also try to OCR the image, but this is also inconsistent with the stated molecular formula:

0misx2t3ffng6

149exuqk0maco

Conclusion: There is no simple solution here. Lots of inconsistent results. This is one where a human will have to go back to the original reference papers to sort out.

How many of these problem entries do we actually care about?

We do not care about missing entries unless they contain distribution coefficient data. How many of these have D values?

containsDValuesQ[url_] := 
   StringContainsQ["Distribution coefficients", IgnoreCase -> True]@Import[url, "Plaintext"] 
  
 (*demo*)
containsDValuesQ@missingInfo[[1, "url"]]

(*False*)

Now apply to the entire list:

missingInfo2 = Append[#, "containsDValuesQ" -> containsDValuesQ[#["url"]]] & /@ missingInfo;
Counts@Lookup[missingInfo2, "containsDValuesQ"]

(*<|False -> 30, True -> 75|>*)

Conclusion: Too bad. Looks like we have 75 cases that will need attention. Write them out to a spreadsheet and dispatch someone to get to work…

Export["2023.10.12_inconsistent_entries_with_D_values.xlsx", 
   Select[#["containsDValuesQ"] &]@Dataset@missingInfo2];

Conclusion

Even so, we still have a large number of the results that can be used:

Export["2023.10.12_completed_ideal_molecules.xlsx", 
   Select[! MissingQ[#["SMILES"]] &]@Dataset@results2];

Export["2023.10.12_missing_entries.xlsx", Dataset[missingInfo2]];

You can download the final spreadsheet file of correct values here. Merge this with your scraped data tables (indexing on the the URL or abbreviation fields) and you are off to the races.

You can also download the spreadsheet of unresolved entries with D values and the spreadsheet containing the subset of these with inconsistent molecular formulas and masses. There is also a complete list of all unresolved entries

The Sequel: The Revenge of ChemDraw

Upon further poking on this, it turns out that we can hack around this problem by AppleScripting Chemdraw to export SMILES strings. We treat this in depth in the next post…

ToJekyll["Parsing molecular identifiers from the IDEaL Database", "llm mathematica science"]

Looking back at my usage log, I spent about $0.40 USD in total for all of the calls made in writing, debugging, and using this blog post. ↩