Content-Based Similarity Search in Large-Scale DNA Data Storage Systems

Authors

Callista Bee 1 *, Yuan-Jyue Chen 2 , David Ward 1 , Xiaomeng Liu 1 , Georg Seelig 1,3 , Karin
Strauss 2 *, Luis Ceze 1 *.
Affiliations:
1 Paul G. Allen School of Computer Science & Engineering, University of Washington.
2 Microsoft.
3 Department of Electrical & Computer Engineering, University of Washington.
*Correspondence to: [email protected], [email protected],
[email protected].

Abstract

  • Here, we demonstrate content-based retrieval from a DNA database by learning a mapping from images to DNA sequences such that an encoded query image will retrieve visually similar images from the database via DNA hybridization.
  • We encoded and synthesized a database of 1.6 million images and queried it with a variety of images, showing that each query
    retrieves a sample of the database containing visually similar images are retrieved at a rate much greater than chance.
  • We compare our results with several algorithms for similarity search in electronic systems, and demonstrate that our molecular approach is competitive with state-of-the-art electronics.

One Sentence Summary:

Learned encodings enable content-based image similarity search from a database of 1.6 million images encoded in synthetic DNA.

Main Text:

P1

  • Since the dawn of online social media, the amount of text, images, and videos that the world produces has been growing rapidly in an exponential fashion, outpacing the capacity growth of traditional storage media (solid state,magnetic, optical).
  • a high density and durable storage medium (1–7), with several orders of magnitude higher density and durability than current storage technologies.
  • An important feature of these systems is the ability to retrieve single documents without having to sequence a large, dense pool of data in molecular form.
  • To accomplish this, these systems leverage the specificity of DNA hybridization to implement key-based retrieval, where a short, predetermined sequence is used to retrieve the document associated with that sequence.

P2

  • Electronic database systems separate the concerns of search and storage by maintaining an index structure that is designed to facilitate content-based retrieval.
  • We follow this design principle by creating a DNA-based index that is optimized for similarity search, where an example document is used to retrieve similar documents from a database.
  • Augmenting the density and longevity of DNA data storage with the computational power of similarity search will enable applications that are infeasible for electronic systems.

P3

  • Document similarity is typically formulated as a geometric problem (13) where each document is converted to a vector in a high-dimensional feature space, with the property that neighboring feature vectors represent subjectively similar documents.

P4

  • Searching in a high-dimentional space is challenging for electronic systems.
  • Instead, efficient similarity search algorithms perform an approximate search that allows for errors in the results. We adopt that relaxation as well.
  • Rather than finding exact nearest neighbors, our goal is to maximize the number of near neighbors retrieved while minimizing the number of irrelevant results.

P5

  • They want to encode feature vectors as DNA sequences such that single stranded molecules created from an encoded target and the reverse-complement of an encoded query are likely to form stable hybridized structures when the query and target feature vectors are neighboring, but not when they are distant (Figure 1B).

-Given such an encoding, we can construct an index where each document is associated with a single strand of DNA that contains the document’s ID, alongside with its encoded feature vector (Figure 1C-D).

  • An encoded query (Figure 1E) can then be used as a hybridization probe to filter out similar documents from the index (Figure 1F-G).

-The filtered index can be sequenced and decoded to recover the IDs of documents that are similar to the query.

  • These documents can then be retrieved from a key-based database and displayed to the user (Figure 1H).

P6

  • Designing a universal encoding from feature vectors to DNA sequences is difficult and early researchers (9) used machine learning to optimize the encoding for a particular dataset by clustering the dataset using k-means, then mapping each cluster center to a known DNA sequence which is assigned to each document in that cluster.

  • By reducing content-based retrieval to exact key-based retrieval, their approach sidesteps any issues with unwanted DNA hybridization.

  • However, once the cluster labels are chosen, they are static; any additional data added to the database must fit into the existing clusters, even if it would cause the cluster centers to change.

P7

  • Rather than using a codebook of pre-determined sequences, our technique leverages neural networks to map feature vectors directly onto fixed-length DNA sequences.
  • A key component of our technique is a differentiable model of DNA hybridization that is trained alongside the encoder. This allows us to optimize the encoder to minimize unwanted hybridization while preserving the flexibility of inexact matching.
  • To train the predictor, the reactions between pairs of sequences output by the encoder are simulated via NUPACK (17). We consider this simulation to be the “gold standard” for computing the outcome of a reaction, but it cannot be used directly to train the encoder, because its computation is not differentiable.

P8

  • We chose to focus first on encoding feature vectors derived from images.
  • Figure 2A illustrates the neural network that maps images to DNA sequences. First, we use the FC2 intermediate layer of VGG16 (14) to extract a feature vector.
  • A fully-connected neural network with a single hidden layer converts these feature vectors to “one-hot” DNA sequences, where each position is represented numerically by a four-channel vector (one channel for each possible base) whose components sum to one.


    Figure 2

P9

  • The hybridization predictor (Fig. 2B) takes a pair of one-hot sequences and predicts the yield of the hybridization reaction between the top sequence and the reverse-complement of the bottom sequence.

  • It first pre-processes the sequence pair with a sliding window operation that extracts local matches (Fig. S3B). Following local average pooling, a trainable convolutional layer enables the predictor to weigh different kinds of local matches differently.


    image.png
  • Because we are not predicting where hybridization occurs, a global average pooling operation removes spatial relevance.

  • The prediction is then computed with a logistic regression (in our experience, this is more effective than a linear regression because most hybridization yields are either close to 0 or close to 1).

P10

Fig.4
  • Figure 2C outlines the training procedure, which alternates between encoder and predictor training phases.
  • In the encoder training phase, a pair of image feature vectors are compared to determine if they are close enough to be deemed “similar”. The threshold for similarity is determined based on subjective evaluation of the feature space (see Figure S1 in the supplementary material).


    image.png
  • The feature vectors are encoded independently to produce a pair of one-hot sequences, which are passed to the hybridization predictor.

  • If the predicted yield is low and the documents are similar, or the predicted yield is high and the documents are not similar, the encoder’s parameters are modified (via gradient descent) to correct the error.

P10

  • In the predictor training phase, the one-hot sequences output by the encoder are discretized and their reaction is simulated with NUPACK.
  • If the predicted yield differs from NUPACK’s simulation result, the predictor’s parameters are modified (via gradient descent) to correct the error.
  • Eventually, the predictor learns a model of hybridization that is good enough for the encoder to learn how to map feature vectors to DNA sequences, such that neighboring feature vectors are much more likely to result in pairs of sequences that hybridize (Figure 2D-E).

P11

  • After training our encoder, we transformed each of these images into a DNA sequence using the trained encoder.
  • In addition to the encoded features, each image’s sequence contains a unique, decodable barcode that can identify the sequence without using reference-based alignment, as well as conserved regions to facilitate amplification and processing via PCR (polymerase chain reaction).
  • Each image’s sequence is short enough to fit on a single synthesizable DNA oligomer (see Figure S4 in the supplementary material).


    image.png

P12

  • To conduct similarity search with an image, we ordered a biotinylated probe oligomer that contains the reverse complement of the query’s encoded feature sequence.
  • We annealed the probe with a sample of the database, and then separated the annealed target/query pairs from the database using streptavidin-conjugated magnetic beads.
  • We then use high-throughput sequencing to reveal which database sequences persist in the separated mixture, and measure how frequently each of them occur.

P13

image.png
  • Figure 3 shows the experimental results for three different query images.

  • If we consider images with sequencing read counts above a certain threshold to be “retrieved”, we can characterize the set of retrieved images for a variety of thresholds.

  • Figure 3A shows that higher read counts are associated with sets of images that are closer to the query in Euclidean distance.

  • We can quantitatively characterize the quality of a retrieved set by its recall of the 100 nearest neighbors; that is, the number of images in the set that are among the 100 most similar images to the query in the database.

  • Figure 3B shows that, as the read threshold increases, the number of total images in the retrieved set drops very low before you begin to sacrifice nearest neighbor recall.

  • We can also visually inspect the retrieved set by sorting its contents and displaying the most similar images.

  • Figure 3C shows that, even with very aggressive filtering, the retrieved set still contains images that are relevant to the query.

  • If the read counts for each image are proportional to their concentrations in the filtered mixture, this means that the filtered mixture could be diluted about 1000x, conserving sequencing resources while still retrieving relevant images.

P14

  • The performance of a similarity search algorithm can be summarized by the curve in Figure 3B, which measures the proportion of the database that must be retrieved and sorted to achieve a particular 100-nearest neighbor recall.
  • The dashed line above the curve illustrates a “naive” algorithm that randomly samples the database.
  • To retrieve half of the hundred nearest neighbors, it must retrieve half of the database. The dashed-and-dotted line below the curve illustrates a perfect “oracle” algorithm.
  • To retrieve half of the hundred nearest neighbors, it would retrieve exactly those 50 images from the 1.6 million in the database.

P15

image.png
  • Figure 4 places the curve from Figure 3B in context alongside several state-of-the-art in silico algorithms that were benchmarked using the same query and same database for each of the queries we evaluated experimentally.

  • While there is room for improvement, our experimental performance is comparable to the state-of-the-art, indicating that DNA-based similarity search is a viable technique for searching the databases of the future.

Summary

  • This paper detailed the first demonstration of similarity search of digital information in DNA and compared its potential efficiency with electronic systems.
  • The results suggest that, as DNA data storage becomes more practical and scales to larger data sets, similarity search in DNA form is an attractive possibility compared to electronic systems.
  • Combining DNA data storage with similarity search support may offer a path to viable hybrid molecular-electronic computer systems.

Materials and Methods

Datasets

Feature Extraction

  • To extract image features, we processed each image with VGG16, a convolutional neural network designed for image classification.
  • The weights were loaded from the publicly available trained model and left unchanged during our processing.
  • We used the activations of FC2 (the second fully-connected layer) as 4096-dimensional feature vectors.
  • Figure S1 illustrates the relationship between subjective image similarity and feature vector Euclidean distance.
  • Pairs of images with Euclidean distance of 75 or less tend to be consistently similar, so during training we label these pairs as “similar” and all other pairs as “not similar”.

Sequence Encoding

  • The sequence encoder is a fully-connected neural network. Its topology is depicted in Figure S2.


    Fig.S2
  • The 4096-dimensional FC2 vectors are fed into a 2048-dimensional hidden layer with a rectified linear activation, followed by an output layer with a “one-hot” sequence representation that is 80 nucleotides in length.

  • In this representation, each sequence position has four channels, one for each base. A softmax activation function is applied that forces each position’s channels to sum to 1.

  • A DNA sequence can be read off by picking the channel with
    the maximum activity at each position.

  • The one-hot representation can produce indeterminate bases (for example, if all four channels at a position have a value of 0.25).

  • Because of this, a regularization is applied during encoder training to minimize the entropy at each position. This encourages each position to have a well-defined maximum, which improves the accuracy of the yield predictor.

  • The yield predictor takes a pair of one-hot sequence representations and produces an estimate of the yield of the hybridization reaction between the first sequence and the reverse-
    complement of the second sequence.

  • It is structured as a convolutional neural network (Figure S3A). The network makes use of a novel local match layer (Figure S3B) that produces vectors of possible matches between each window of 3-mers. This encourages the predictor to make use of any unaligned matches between the two sequences.


    image.png

Training

你可能感兴趣的:(Content-Based Similarity Search in Large-Scale DNA Data Storage Systems)