`molprop-featurize`

Generates ML-ready feature packs from a MolProp results table (recommended) or from a SMILES file. Designed for tens-of-thousands of ligands: outputs a sparse hashed fingerprint matrix for scikit-learn / XGBoost / LightGBM, plus optional deep-learning exports (packed-bit fingerprints, aligned SMILES, and graph tensors) and an interpretable functional-group + fragment summary for SAR review.

Related: Ligand preparation guide

calc-v4 calc-v5 prep analyze report sketch series picklists similarity featurize retro schema

Typical usage (from a MolProp table)

Auto-selects the structure-of-record SMILES and prefers Calc_Canonical_SMILES when present.

# 1) make a MolProp table (CSV or Parquet)
molprop-calc-v5 input.smi -o results.parquet

# 2) build a feature pack
molprop-featurize results.parquet -o features/results_features

Parquet requires pyarrow. Install via pip install "molprop-toolkit[parquet]" (or pip install pyarrow).

SMILES input mode

Accepts a simple SMILES file (tab, comma, or whitespace separated).

molprop-featurize library.smi -o features/library_features

What it writes (feature pack layout)

Model-friendly (sparse) outputs

The main artifact is a CSR sparse matrix saved as NPZ. This is suitable for scikit-learn and gradient boosting. Numeric descriptor columns from the input table are exported separately so you can scale/impute them explicitly.

X_sparse_counts_csr.npz        # CSR: data/indices/indptr/shape
X_numeric.npy                 # optional dense numeric block
X_numeric_columns.json        # column names for X_numeric
ids.csv                       # row_index → Compound_ID + SMILES (always)
ids.parquet                   # optional (when --id-map-format parquet|both)
features_spec.json            # optional frozen vocab/block spec (for train/test stability)
features_metadata.json        # run metadata (includes selected SMILES column)

SAR / interpretability outputs

For quick SAR exploration and rule-based filtering, the tool can write a compact CSV of functional group counts/flags plus fragment summary columns.

features_interpretable.csv    # FG_* + Env_* + fragment summaries
features_interpretable.parquet# optional (when --interpretable-format parquet|both)
fragments_brics.json          # BRICS vocabulary + frequencies
fragments_recap.json          # RECAP vocabulary + frequencies

Deep learning exports

Exports are framework-agnostic and designed to avoid generating thousands of small files. Fingerprints are packed to bytes by default (2048 bits → 256 bytes per molecule).

dl_fp_packed_uint8.npy        # packed bits (default)
dl_smiles.csv                 # aligned SMILES (for token models)
dl_graphs.npz                 # concatenated graph tensors (for GNNs)

Feature blocks (how columns are organized)

X_sparse_counts_csr.npz is a concatenation of multiple blocks. The block offsets and sizes are recorded in features_metadata.json, so you can slice or ablate feature families reproducibly.

FP_Morgan   (hashed counts)
FP_AtomPair (hashed counts)
FP_Torsion  (hashed counts)
FG_*        (named functional-group/env features)
Frag_BRICS  (presence features)
Frag_RECAP  (presence features)

Common options

Run molprop-featurize --help for the full list. Common knobs include structure column choice, fragment method, fingerprint sizes, numeric column selection, and which deep-learning exports to write.

# choose which SMILES column from results.parquet to featurize
molprop-featurize results.parquet --smiles-col Canonical_SMILES -o features/out

# fragments (BRICS, RECAP, both, or none)
molprop-featurize results.parquet --fragments both --max-frags-per-mol 128 -o features/out

# fingerprint dimensions
molprop-featurize results.parquet --morgan-nbits 4096 --ap-nbits 2048 --torsion-nbits 2048 -o features/out

# deep learning exports
molprop-featurize results.parquet --dl all --dl-fp-nbits 2048 --dl-fp-format packed -o features/out

# numeric block selection
molprop-featurize results.parquet --numeric-cols MolWt,LogP,TPSA,CNS_MPO -o features/out

# disable the interpretable summary CSV
molprop-featurize results.parquet --no-interpretable -o features/out

# freeze vocab + blocks for train/test consistency
molprop-featurize train.parquet -o features/train_features
molprop-featurize test.parquet --spec features/train_features/features_spec.json -o features/test_features

How to load the sparse matrix (example)

The NPZ contains CSR arrays. If SciPy is available, construct a csr_matrix directly. This is compatible with scikit-learn estimators and can be passed into XGBoost/LightGBM.

import numpy as np
from scipy.sparse import csr_matrix

npz = np.load("features/out/X_sparse_counts_csr.npz")
X = csr_matrix((npz["data"], npz["indices"], npz["indptr"]), shape=tuple(npz["shape"]))

# Optional numeric block
X_num = np.load("features/out/X_numeric.npy")