转自 A walkthrough of TMAP - Xinhao Li
7 minute read
Published: May 03, 2020
The notebook version is here.
TMAP is a really cool interactive visualization for big data. Examples can be found here.
Tree MAP (TMAP) is an algorithm developed by Dr. Probst for visualization of very large high-dimensional data sets as minimum spanning trees.
Visualizations based on TMAP are better suited than t‑SNE or UMAP for the exploration and interpretation of large data sets due to their tree‑like nature, increased local and global neighborhood and structure preservation, and the transparency of the methods the algorithm is based on.
The four projects mentioned are:
TMAP consists of 4 phases:
The first two steps are designed for big data setting and can be replaced by other k-nearest neighbor algorithms. In the resulting graph, the nodes are molecules (or other entities) and the edges are weighted by the distances between nodes. In the 3rd step, a tree is basicly a graph wihch doest not contrain cycle. A MST is the tree has the minimum sum of edge weights. The MST is ready for visualization. In the last step, we choose a layout to visualize the MST.
Installation of (1) RDKit (2) TMAP (3) MHFP and (4) Faerun.
conda install -c rdkit rdkit
conda install -c tmap tmap
pip install mhfp
pip install faerun
import pandas as pd
import tmap
from faerun import Faerun
from mhfp.encoder import MHFPEncoder
from rdkit.Chem import AllChem
Here we use ESOL dataset includes 1117 molecules.
url = 'https://raw.githubusercontent.com/XinhaoLi74/molds/master/clean_data/ESOL.csv'
df = pd.read_csv(url)
df.shape
(1117, 2)
df.head(1)
smiles | logSolubility | |
---|---|---|
0 | N#CC(OC1OC(COC2OC(CO)C(O)C(O)C2O)C(O)C(O)C1O)c... | -0.77 |
MHFP6
MHFP6 (MinHash fingerprint, up to six bonds) is a molecular fingerprint which encodes detailed substructures using the extended connectivity principle of ECFP in a fundamentally different manner, increasing the performance of exact nearest neighbor searches in benchmarking studies and enabling the application of locality sensitive hashing (LSH) approximate nearest neighbor search algorithms.
# The number of permutations used by the MinHashing algorithm
perm = 512
# Initializing the MHFP encoder with 512 permutations
enc = MHFPEncoder(perm)
# Create MHFP fingerprints from SMILES
# The fingerprint vectors have to be of the tm.VectorUint data type
fingerprints = [tmap.VectorUint(enc.encode(s)) for s in df["smiles"]]
Data types for TMAP: VectorUnit
and VectorFloat
# Initialize the LSH Forest
lf = tmap.LSHForest(perm)
# Add the Fingerprints to the LSH Forest and index
lf.batch_add(fingerprints)
lf.index()
# Get the coordinates
x, y, s, t, _ = tmap.layout_from_lsh_forest(lf)
x
and y
are the coordinates of the nodes. s
and t
store the indexes of start nodes and to nodes in the MST, respectively.
# Now plot the data
faerun = Faerun(view="front", coords=False)
faerun.add_scatter(
"ESOL_Basic",
{ "x": x,
"y": y,
"c": list(df.logSolubility.values),
"labels": df["smiles"]},
point_scale=5,
colormap = ['rainbow'],
has_legend=True,
legend_title = ['ESOL (mol/L)'],
categorical=[False],
shader = 'smoothCircle'
)
faerun.add_tree("ESOL_Basic_tree", {"from": s, "to": t}, point_helper="ESOL_Basic")
# Choose the "smiles" template to display structure on hover
faerun.plot('ESOL_Basic', template="smiles", notebook_height=750)
/images/TMAP/ESOL_Basic.html
Sometime we want to plot multiple labels for a single dataset. There are two types of lables, continues and categorical. The type of labels need to explicitly assign to the categorical
in faerun.add_scatter
.
We compute two categorical labels: (1) the number of rings
(2) Linear molecules (is_linear
): 0
if numrings > 1 and 1
if numrings = 0.
from rdkit.Chem import rdMolDescriptors
from rdkit import Chem
numrings = [rdMolDescriptors.CalcNumRings(Chem.MolFromSmiles(s)) for s in df["smiles"]]
set(numrings)
{0, 1, 2, 3, 4, 5, 6, 7, 8}
is_linear = [1 if r == 0 else 0 for r in numrings]
The molecules in ESOL datast contains rings from 0 to 8. Now we are going to plot three labels. We need to change setting in faerun.add_scatter
, information of multiple labels are passed as lists.
# Now plot the data
faerun = Faerun(view="front", coords=False)
faerun.add_scatter(
"ESOL",
{ "x": x,
"y": y,
"c": [list(df.logSolubility.values), numrings, is_linear],
"labels": df["smiles"]},
point_scale=5,
colormap = ['rainbow', 'Set1'],
has_legend=True,
categorical=[False, True, True],
series_title = ['ESOL (mol/L)', 'Rings', 'is_linear'],
legend_labels = [None, None, [(0, "No"), (1, "Yes")]],
shader = 'smoothCircle'
)
faerun.add_tree("ESOL_tree", {"from": s, "to": t}, point_helper="ESOL")
# Choose the "smiles" template to display structure on hover
faerun.plot('ESOL', template="smiles", notebook_height=750)
/images/TMAP/ESOL.html
We can also use other descriptors/fingerprints. The descriptors/fingerprints need to be converted to Minhash vectors first. It supports binary, indexed, string and also int and float weighted vectors as input and returns a list of Minhash vectors (List of VectorUint
)
Methods for different types of input:
batch_from_binary_array
. MinHash vectors from binary arrays. The input vectors need to be a list of VectorUchar
.
batch_from_int_weight_array
. Create MinHash vectors from integer arrays (not only zeros and ones). The input vectors need to be a list of VectorUint
.
batch_from_sparse_binary_array
. Create MinHash vectors from sparse binary arrays. The input vectors need to be a list of VectorUint
– A list of vectors containing indices of ones in a binary array.
batch_from_string_array
. Create MinHash vectors from string arrays. The input vector is a list of list
or string
.
batch_from_weight_array
. Create MinHash vectors from float arrays. The input vectors need to be a list of VectorFloat
. – A list of vectors containing float values. Keyword Arguments: method (str) – The weighted hashing method to use (ICWS
(default) or I2CWS
).
Let’s try ECFP4 (binary).
bits = 1024
mols = [Chem.MolFromSmiles(s) for s in df['smiles']]
ECFP4_fps = [AllChem.GetMorganFingerprintAsBitVect(x,2,bits) for x in mols]
ecfp4_lists = [tmap.VectorUchar(list(fp)) for fp in ECFP4_fps]
# Initialize the Minhash
enc = tmap.Minhash(bits)
# Initialize the LSH Forest
lf_ecfp4 = tmap.LSHForest(bits)
# Add the Fingerprints to the LSH Forest and index
lf_ecfp4.batch_add(enc.batch_from_binary_array(ecfp4_lists))
lf_ecfp4.index()
x, y, s, t, _ = tmap.layout_from_lsh_forest(lf_ecfp4)
# Now plot the data
faerun = Faerun(view="front", coords=False)
faerun.add_scatter(
"ESOL_ECFP4",
{ "x": x,
"y": y,
"c": [list(df.logSolubility.values), numrings, is_linear],
"labels": df["smiles"]},
point_scale=5,
colormap = ['rainbow', 'Set1'],
has_legend=True,
categorical=[False, True, True],
series_title = ['ESOL (mol/L)', 'Rings', 'is_linear'],
legend_labels = [None, None, [(0, "No"), (1, "Yes")]],
shader = 'smoothCircle'
)
faerun.add_tree("ESOL_ECFP4_tree", {"from": s, "to": t}, point_helper="ESOL_ECFP4")
# Choose the "smiles" template to display structure on hover
faerun.plot("ESOL_ECFP4",template="smiles", notebook_height=750)