Ligand preparation

“Preparation” in MolProp Toolkit is not a single toggle; it is a reproducible sequence that turns an input SMILES into a standardized parent structure, then optionally resolves tautomer/stereo ambiguity and optionally chooses a protomer at a defined pH before any conformer-based 3D work. The calculators write traceable SMILES columns at each stage so downstream tools can be explicit about which representation they are using.

calc-v4 calc-v5 prep analyze report sketch series picklists similarity featurize retro schema

What “fully prepared” means (for further calculation)

A “fully prepared ligand” means you have made and documented the structural decisions that materially change computed properties: parent fragment selection (salt stripping), charge normalization, tautomer choice, stereochemistry choice, and (if relevant) protonation state at a specified pH. For purely 2D triage descriptors, the default preparation chain is usually sufficient. For any workflow that relies on 3D geometry (3D descriptors, USR/USRCAT similarity, docking, shape screening), stereochemistry and protonation state should be treated as first-class choices rather than left as unspecified.

In the results CSV, the key traceability columns are Input_Canonical_SMILES, Canonical_SMILES, Calc_Base_SMILES, and Calc_Canonical_SMILES. If you are unsure which to use downstream, prefer Calc_Canonical_SMILES, because it reflects the exact structure used to compute descriptors.

Recommended preparation sequences

Pick one sequence and use it consistently across a project. Switching preparation policy midstream will change clusters, risk flags, and any model training data you derive from the table.

Standard (2D triage) Audit ambiguity 3D-ready Minimal / external prep

Use this for the majority of early triage. It produces a standardized parent structure and uses a single canonical tautomer while leaving stereochemistry as supplied. Ionization is represented as pH-aware features (Ion_* columns) without changing the structure used for 2D descriptors.

The internal chain is RDKit MolStandardize Cleanup → FragmentParent → Uncharger → Reionizer → Canonicalize tautomer, followed by stereochemistry assignment for auditing. The standardized parent is written to Canonical_SMILES.

molprop-calc-v5 input.smi -o results.csv \
  --tautomer-mode prep-canonical \
  --stereo-mode keep \
  --ph 7.4 --ionization heuristic

Use this when your inputs are messy or when you want to surface ambiguity explicitly. The calculator enumerates plausible tautomers and unresolved stereochemistry (bounded by max limits), selects a representative, and records the enumerated options in Tautomer_* and Stereo_* columns.

molprop-calc-v5 input.smi -o results.csv \
  --tautomer-mode enumerate --tautomer-max 64 --tautomer-topk 5 \
  --stereo-mode enumerate --stereo-max 32 --stereo-topk 5 --stereo-select canonical

This mode is about traceability, not about “correct chemistry.” RDKit enumeration is rule-based, and the selected representative is a reproducible choice, not a guarantee of biological relevance.

Use this when you will compute conformer-based 3D descriptors (--3d), or when you will cluster with 3D fingerprints (USR/USRCAT) in molprop-series, or when you plan docking. In these workflows, protonation state and stereochemistry materially change geometry, so decide them before generating conformers.

A practical reproducible pattern is to enumerate protomers in a defined pH window, select one protomer, compute descriptors on that protomer (--calc-on-protomer), and then generate 3D conformers with a recorded seed.

# optional dependency
pip install dimorphite_dl

molprop-calc-v5 input.smi -o results.csv \
  --ph 7.4 --ionization dimorphite --calc-on-protomer \
  --protomer-select closest-charge \
  --tautomer-mode prep-canonical \
  --stereo-mode keep \
  --3d --3d-num-confs 10 --3d-minimize mmff --3d-seed 0

Use this only if you already prepared structures externally and you want MolProp Toolkit to treat the input as authoritative. It disables the internal RDKit MolStandardize preparation chain. Salt forms, alternative tautomers, and charge variants will remain, which can be appropriate for auditing, but can also fragment your clustering and inflate “diversity” artificially.

molprop-calc-v5 input.smi -o results.csv \
  --no-prep \
  --tautomer-mode none \
  --ionization none

Which SMILES should downstream tools use?

For series clustering / scaffolds

If you generated the results CSV with MolProp Toolkit, run molprop-series on that CSV and let it use the preferred SMILES priority automatically. It will prefer Calc_Canonical_SMILES (calculation structure) and fall back to Canonical_SMILES or SMILES only when needed.

For modeling / 3D workflows

Use Calc_Canonical_SMILES as the “structure of record,” and retain the accompanying Protomer_*, Stereo_*, and Tautomer_* columns for auditability. If you need a different protomer distribution or a specific salt form, treat that as a separate dataset variant and document it explicitly.