The space of synthesizable drug-like molecules is estimated to be somewhere between 1060 and 10100 distinct chemical species while current virtual screening algorithms such as docking can only address a small fraction ~107-109 of this space. Larger numbers quickly become computationally infeasible. For comparison, the total number of pages in the www indexed by search engines is known to be around 4.6*109, most of them are indexed by more than one search engine. The total number of distinct chemical entities ever synthesized by humanity (including all for experimental drug screening) is known to be close to 300 million, or 3*108. All of the drugs we currently use (or ever tested as a drug candidate) are a subset of this space. It is very likely that “black”, yet unexplored chemical space contains most of the drugs that we will discover in the future.

Crystallographic structure of the protein can be represented as a 3D image of atomic coordinates, or a lock, and structure of the drug can be represented as a key. Virtual screening can be seen as a search for a key to a specific lock when the relative position of the key and lock is not known at the time of a query. A very powerful algorithmic approach for this massive image search and retrieval task is needed.

To address the problem we developed Affinity, which is a high-level machine learning API (Application Programming Interface)  dedicated exclusively to molecular geometry. Affinity is written in TensorFlow, some small proportion of high-performance code is in low-level C++.  Depending on the application, it can be configured as multi-CPU, multi-CPU single GPU, or multi-GPU system.

Affinity can be imported into a python script with a single line: import affinity as af. The API contains:

  • highly specialized graph convolutions  af.nn.graph_conv3d()
  • high-level geometry subroutines af.geom.pointcloud_pairlist()
  • input preparation pipelines for some of the largest molecular datasets af.input.InputPipeQM9(), af.input.InputPipePDBBind()
  • and some of our best-performing networks for each of the datasets af.networks.spatial_transformer_net3D()

Here is a short list of important problems we are currently working on:

1) Atomic spaces are sparse (have very few atoms separated by large void spaces). At the same time drug has to fit a given protein with an extremely high precision, a fraction of atomic radii. Currently existing sparse (Graham's convolutions) are an algorithmic method of performing computation, and are otherwise algebraically similar to the regular dense convolutions. Google's molecular graph convolutions represent molecular images as undirected graphs and are much more specialized to the problem of predicting binding affinity. In current formulation convolutions on molecular graphs are extremely limited. For example, Google's molecular graphs have to learn that distances exist in 3D-symmetric spaces (not in spaces of any arbitrary dimensionality) at the time of training what prevents efficient parameter utilization. It is important to develop new convolutions that could efficiently summarize atomic interaction energies.

2) Both we and others had some initial success with applying metric learning techniques to drug-protein interactions. In this approach, a static representation of both drug and protein, typically a single vector of a length of a few thousand float point values, is learned in a manner similar to word2vec. In this approach similarity between two drugs, or two proteins can be represented as a cosine distance between their feature vectors. Static feature extraction approach has a limitation that feature vectors have to be very large at high precisions. In the future feature extraction should become a dynamic bi-directional approach in which different features from both images would repeatedly be extracted based on the context.

3) Every drug discovery algorithm existing up until now had to rely either on brute force, or in a better case biased (with evolutionary algorithm/Markov Chain) sampling of all available molecular structures. Only a tiny fractions of the space ~108 molecules has ever been searched due to computational expense. Representing all molecular structures in a continuous space would allow implementing more powerful search/optimization techniques such as gradient descent.