molprop-featurize
Generates ML-ready feature packs from a MolProp results table (recommended) or from a SMILES file. Designed for tens-of-thousands of ligands: outputs a sparse hashed fingerprint matrix for scikit-learn / XGBoost / LightGBM, plus optional deep-learning exports (packed-bit fingerprints, aligned SMILES, and graph tensors) and an interpretable functional-group + fragment summary for SAR review.
Related: Ligand preparation guide
Typical usage (from a MolProp table)
Auto-selects the structure-of-record SMILES and prefers Calc_Canonical_SMILES when present.
# 1) make a MolProp table (CSV or Parquet)
molprop-calc-v5 input.smi -o results.parquet
# 2) build a feature pack
molprop-featurize results.parquet -o features/results_features
Parquet requires pyarrow. Install via pip install "molprop-toolkit[parquet]" (or pip install pyarrow).
SMILES input mode
Accepts a simple SMILES file (tab, comma, or whitespace separated).
molprop-featurize library.smi -o features/library_features
What it writes (feature pack layout)
Model-friendly (sparse) outputs
The main artifact is a CSR sparse matrix saved as NPZ. This is suitable for scikit-learn and gradient boosting. Numeric descriptor columns from the input table are exported separately so you can scale/impute them explicitly.
X_sparse_counts_csr.npz # CSR: data/indices/indptr/shape
X_numeric.npy # optional dense numeric block
X_numeric_columns.json # column names for X_numeric
ids.csv # row_index → Compound_ID + SMILES (always)
ids.parquet # optional (when --id-map-format parquet|both)
features_spec.json # optional frozen vocab/block spec (for train/test stability)
features_metadata.json # run metadata (includes selected SMILES column)
SAR / interpretability outputs
For quick SAR exploration and rule-based filtering, the tool can write a compact CSV of functional group counts/flags plus fragment summary columns.
features_interpretable.csv # FG_* + Env_* + fragment summaries
features_interpretable.parquet# optional (when --interpretable-format parquet|both)
fragments_brics.json # BRICS vocabulary + frequencies
fragments_recap.json # RECAP vocabulary + frequencies
Deep learning exports
Exports are framework-agnostic and designed to avoid generating thousands of small files. Fingerprints are packed to bytes by default (2048 bits → 256 bytes per molecule).
dl_fp_packed_uint8.npy # packed bits (default)
dl_smiles.csv # aligned SMILES (for token models)
dl_graphs.npz # concatenated graph tensors (for GNNs)
Feature blocks (how columns are organized)
X_sparse_counts_csr.npz is a concatenation of multiple blocks. The block offsets and sizes are recorded in
features_metadata.json, so you can slice or ablate feature families reproducibly.
FP_Morgan (hashed counts)
FP_AtomPair (hashed counts)
FP_Torsion (hashed counts)
FG_* (named functional-group/env features)
Frag_BRICS (presence features)
Frag_RECAP (presence features)
Common options
Run molprop-featurize --help for the full list. Common knobs include structure column choice, fragment method,
fingerprint sizes, numeric column selection, and which deep-learning exports to write.
# choose which SMILES column from results.parquet to featurize
molprop-featurize results.parquet --smiles-col Canonical_SMILES -o features/out
# fragments (BRICS, RECAP, both, or none)
molprop-featurize results.parquet --fragments both --max-frags-per-mol 128 -o features/out
# fingerprint dimensions
molprop-featurize results.parquet --morgan-nbits 4096 --ap-nbits 2048 --torsion-nbits 2048 -o features/out
# deep learning exports
molprop-featurize results.parquet --dl all --dl-fp-nbits 2048 --dl-fp-format packed -o features/out
# numeric block selection
molprop-featurize results.parquet --numeric-cols MolWt,LogP,TPSA,CNS_MPO -o features/out
# disable the interpretable summary CSV
molprop-featurize results.parquet --no-interpretable -o features/out
# freeze vocab + blocks for train/test consistency
molprop-featurize train.parquet -o features/train_features
molprop-featurize test.parquet --spec features/train_features/features_spec.json -o features/test_features
How to load the sparse matrix (example)
The NPZ contains CSR arrays. If SciPy is available, construct a csr_matrix directly. This is compatible with
scikit-learn estimators and can be passed into XGBoost/LightGBM.
import numpy as np
from scipy.sparse import csr_matrix
npz = np.load("features/out/X_sparse_counts_csr.npz")
X = csr_matrix((npz["data"], npz["indices"], npz["indptr"]), shape=tuple(npz["shape"]))
# Optional numeric block
X_num = np.load("features/out/X_numeric.npy")