Image - based approximate DNA storage system

Abstract

  • However, prior work simply investigated the feasibility of DNA storage storing different types of data and simply store images in DNA storage, which did not fully investigate the fault-tolerant potential of images in the DNA storage.

  • In this paper, we proposed a new image-based DNA system called IMG-DNA, which can efficiently store images in DNA storage with improved DNA storage robustness.

    • First, a new DNA architecture is proposed to fit JPEG-based images and improve the image’s robustness in DNA storage.
    • Moreover, barriers inserted in DNA sequences efficiently prevent error propagation in images of DNA storage.
    • The experimental results indicate that the proposed IMG-DNA achieves much higher fault-tolerant than prior work.

Introduce

The rapidly increased pace of digital volume motivates researchers to search for a storage medium with a real high density and long-term preservation.

Therefore, it is critical for emerging storage systems to
store such a massive amount of image data with a low cost.


Two challenges of many new emerging storage devices:

  • One is that the increased storage density cannot
    catch up with the rapid rate of increased digital volume.
  • The other is that the digital data can be reliably stored in these storage devices only for several years to at most tens of years.

Synthetic Deoxyribonucleic Acid (DNA) storage is
one promising storage medium for archiving due to its high storage density and long durability.


One issue of DNA storage is its high error rate. The write (synthesis) and read (sequencing) processes are error-prone.

For each base pair (one nucleotide), it may involve around 1% error rate [ 8 ]. To handle these errors, researchers use error-correction code (ECC) to recover errors resulting in a much high overhead. The space overhead may reach 15% or even higher.

Why choose image?

According to the high error rate in DNA storage, images might be well fitted to the DNA storage due to its properties of fault tolerance and large volume in social media.

Thus, DNA storage provides great potential to store a massive amount of image data to avoid high overhead induced by errors.


There are two types of prior work related to how to store image data in DNA storage.

  • One type of studies investigated the feasibility of DNA storage [9][8][10].

    • Images as one type of applications are stored in DNA.
    • They proposed different encoding schemes (i.e., converting digital data into DNA sequences), error-correction codes, and biochemical technologies for generally improving the efficiency of sequencing (read data out from DNA) or decreasing error rates.
    • However, they did not focus on the property of image data, which cannot achieve the best performance for image data in terms of storage density and error tolerance.
  • The other focused on approximately store images based on their property in different storage devices such as SSDs or Non-volatile memory [ 2 ][ 11 ][ 12 ].

    • For examples, Kuo et al. [ 11 ] proposed a
      new JPEG encoding scheme to improve the error tolerance of SSD based on the sensitivities of JPEG image data.
    • Fan et al. [ 12 ] proposed an adaptive-length image coding scheme based on the reliability of NAND flash memory to improve the density and reduce the total cost of JPEG-based storage.
    • However, due to the special properties of DNA storage (such as error propagation discussed in Section 3), those technologies cannot be efficiently applied in image-based DNA storage.

In this paper, we proposed a new scheme to efficiently store JPEG-based images in DNA storage, called IMG-DNA, which combining the special properties of DNA storage and images.

  • First, we propose a new DNA architecture by
    adding barriers to prevent error propagation in DNA storage. The barrier design is based on the encoding scheme and biochemical constraints in DNA storage, and thus it can be feasibly applied to DNA storage.

  • Then, according to the error sensitivities of coefficients in JPEG images, we separately store those coefficients in different DNA sequences, and proper internal index is designed to associate these coefficients with their corresponding images.

  • Moreover, to further improve the error tolerance of DNA storage, separate barrier schemes are applied to the different coefficients of JPEG-images according to their characteristics.

  • Finally, the IMG-DNA scheme improves the error tolerance capability compared to prior work.


Section 2: the background of DNA storage and JPEG- based images.
Section 3: demonstrates the motivation behind this work.

Section 4: introduces the implementation of
IMG-DNA scheme.

Section 5: shows the experimental results of IMG-DNA and compare them with some prior work.

Section 6: draw a conclusions

2 Background

  • 2.1 DNA Storage
    For DNA storage, as shown in Fig. 1, there are
    four major processes in the DNA storage system: encoding, synthesis, sequencing, and decoding.
  • The encoding and synthesis are the processes to write digital data into DNA storage.
  • The sequencing and decoding are the processes to read data out from DNA storage.
Figure 1:Basic steps of DNA storage.
Figure 2: DNA format within the same primer pair.
Figure 3: A rotating encoding to nucleotides avoids homopolymers (repetitions of the same nucleotide), which are error-prone.

  • 2.2 Fundamentals of JPEG-based Image
Figure 4: The encoding process of JPEG-based image.

3 Motivation

  • There are two properties of image-based data.

    • One is the large volume since image-based data type is explosively increased and generated from social media such as Facebook, Snapchat, and Instagram. The large volume increases the demand for high-capacity storage systems.
    • The other property of image-based data is fault tolerance. Many existing studies investigated images in approximate storage such as flash memory or non-volatile memory to reduce the storage overhead or improve the robustness of images [ 11 ][ 12 ][ 25 ].
  • Therefore, these two properties of images are well fitted to the DNA storage system, which is error-prone and has an extremely high density.

  • However, none of the previous studies focused on the robustness of images in DNA storage.
    Although some existing studies [ 8 ][ 9 ][ 14 ] implemented image-based binary data in DNA storage, all of them focused on the feasibility of DNA storage for different types of data including images.

Therefore, how to efficiently store images and improve the robustness of images in DNA storage is a critical and interesting research issue.


Moreover, due to the biochemical constraints as mentioned before, we should avoid long homopolymers in DNA sequences, which cause high error rates.

Therefore, the state-of-the-art encoding scheme [9][13][8][16] uses a rotating code manner as shown in Fig. 3 to avoid long homopolymers.


However, the rotating manner induces a phenomenon called error propagation (DNA-level).

In DNA storage, there are three types of errors (i.e., insertion, substitution, and deletion).

Any nucleotide changed in the middle of one DNA
strand has a significant influence on its following nucleotides.As a result, all subsequence data cannot be correctly read out due to one nucleotide error.

Figure 5: An example of deletion error for error propagation. The deletion error causes that one nucleotide ‘C’ is missed and all subsequence are decoded wrongly.

Additionally, although prior work [ 11 ][ 12 ] has mitigated the image-level error propagation for traditional storage devices, these techniques are not efficient for the DNA-level error propagation due to the special encoding manner in DNA storage.

Therefore, there is a need to find a proper architecture preventing such error propagation in DNA storage.

4 Aligorithm Design

In this section, we introduce the proposed IMG-DNA scheme to improve the robustness of image-based DNA storage.

The IMG-DNA scheme has three major optimizations compared to previous work.

  • The first one is that the separation of DC and AC coefficients at the DNA level.
  • The second one is ‘adding barrier’ to both AC and DC coefficients.
  • The third one is that the DC and AC coefficients are applied with different densities of barriers.

4.1 AC/DC Coefficient Separation at DNA Level

According to JPEG-based images, the DC and AC coefficients have different sensitivities to errors.

  • There are two reasons. One is that the DC coefficients have higher resolution than the AC coefficients during the quantization process since the human visual system is more sensitive to DC coefficients (low-frequency part).

  • The other reason is that the DC coefficients use differential pulse code modulation coding (DPCM) to compute its consecutive DC value for the next 8*8 block
    ( which is called error propagation phenomenon [ 11 ].)

4.2 Adding ‘Barriers’

We can use some ’barriers’ to prevent such error propagation.

However, the barriers are not easy to be added since the barriers should be distinguished from the payload.


The constraint of the number of consecutive identical nucleotides is less than four nucleotides [ 16 ].

The reason of using the barrier pattern based on ‘A’ is that the ‘AA’ and ‘AAA’ patterns have higher sequencing accuracies than other three types of patterns [ 26 ].

Since we predefine the PL, an error will be prevented
within its partition and will not affect other partitions.
So, one error causes the maximum errors within a partition size (i.e., PL).


Figure 6: Definition of Partition Length (PL) and Barrier Window (BW).

Moreover, substitution, deletion, and insertion errors may generate a pattern which is the same as the barrier pattern (i.e., AA). Misclassification of the barrier patterns may mitigate the effect of error propagation prevention.

To mitigate such a scenario, we propose another mechanism called Barrier Window (BW), which means that only a pattern ’AA’ locating in a barrier window will be regarded as a true barrier if detected. The pattern ’AA’ outside the barrier window will be treated as a normal payload. If there is no barrier pattern in the BW, it means that an error happens on the barrier pattern and changes the barrier to other patterns.

Thus, we increase the number of partitions by one and use the next barrier pattern to prevent the error propagation. In this case, the error will be propagated to the next partition but we can prevent it in two partitions.

We set BW to 12 bp, which means 5 bp on both left and right sides of the barrier pattern.

Since the probability of errors happening on the barrier pattern is much lower than that of other payloads (25 times less if the fixed-length takes 50 bp), it is highly possible that the error propagation can be prevented in one or two partitions.

Therefore, with adding barriers and Barrier Window, the IMG-DNA scheme can significantly prevent the error propagation in image-based DNA storage.

4.3 Asymmetric Barriers for AC/DC Coefficients

Therefore, there is a trade-off between the robustness and encoding density of DNA storage.


Since DC coefficients are more sensitive to errors, we assign a smaller PL value to DC coefficients to increase the robustness of DNA storage. Meanwhile, since AC coefficients have much larger size than that of DC, a larger PL value for AC coefficients can reduce the overhead of the scheme of adding barriers.

Moreover, compared with the data sizes of DC and AC coefficients in JPEG images, normally DC coefficients only occupy 4% - 8% of one image size [ 12 ].
Thus, the increased overhead of adding more barriers in DC coefficients is limited. According to our experiments, the capacity overheads of barriers in DC and AC coefficients are about 0.28% and 3% of image size with PL=20 bp for DC and PL=50 bp for AC, respectively.

4.4 Image-based DNA Storage Architecture

  • After JPEG-based encoding, we first separate DC and
    AC coefficients.

  • After that, the digital sequence will be encoded into DNA sequence.

  • Then, we insert the barrier pattern ’AA’ into DC and AC coefficients separately.

  • According to the different error sensitivities of DC and AC, we use different partition lengths for AC and DC DNA strands.

  • After inserting barriers, long DNA sequences will be chunked into small segments with a fixed length (e.g., 200 bp).

  • Then, each segment with a primer pair and its corresponding
    internal index are assembled. The internal index contains the type and address offset of payload.

image.png

4.5 Feasibility Discussion

In this study, the experiments are based on simulation.


A set of design rules for synthesis and sequencing efficiency are used based on commercial design rules [ 27 , 28 ] and previous studies [ 14 , 15 , 8 ]:

  • One is that the absence of long homopolymers (less than four nucleotides) [ 15 , 16 ].
  • The second one is that GC contents in a sequence should be around 40% - 60%.
  • The third one is DNA strand length smaller than 1000 bp.

5 Experimental Results

The default DNA strand length is 250 bp including a primer pair, payload, and index (the DNA strand lengths may be changed slightly due to the length of the index).

Two previous studies are used as baselines:

  • One is the straightforward implementation of DNA storage [ 8 ][ 14 ] (denoted by Raw-DNA ).
    (【8】Random access in large-scale dna data storage.
    【14】A dna-based archival storage system)

  • The other one is the approximate image schemes with the digital-level optimization [ 11 ] (denoted by Approx-IMG ).
    (【11】 Long-term jpeg data protection and recovery for nand flash-based solid-state storage.该文章用于磁盘存储中的jpeg文件优化))

  • We use 500 JPEG-based images from the ImageNet image dataset[29] for all experiments.

  • The error model is extracted from error distributions of the wet-lab DNA storage implementation [8].

5.1 Robustness of Image-based DNA System

We investigate the robustness of image-based DNA systems for the ImageNet data set by manually injecting errors from 0.1% to 2% following the distribution of the error model [ 8 ].

Figure 8

The AC and DC coefficients use the same DNA
strand length (250 bp) and different partition lengths (50 bp and 20 bp respectively).

  • Compared with these three schemes, the proposed IMG-DNA achieves a much higher fault tolerance ability than the other two with different error rates (2.6x -19.7x SSIM improvement).
    The reason is that the IMG-DNA considers both digit- and DNA- level error propagation prevention for images.
    Moreover, a graphic view of an image with different encoding schemes are used in Fig. 9.
    We can find that the proposed IMG-DNA achieves much close vision to the original picture and the other two baselines are vague on parts of the picture.
Figure 9

In summary, without any optimization, the propagation error in DNA storage can cause image corruption even with a small error in AC or DC data.
By using the proposed scheme, the error propagation is mitigated and finally results in little image quality degradation. The proposed scheme can assist those protection schemes and reduce the overhead of those ECCs (error-correction codes). For example, originally one error can propagate to ten errors.
Thus, to recover all those ten errors, a strong ECC should be used.
With the prevention of the error propagation, the ECC only needs to recover this one error.

5.2 Effect of Different Partition Lengths

We compare SSIM values by varying the PL values for AC and DC coefficients.

The IMG-DNA scheme with no barriers is normalized to 1 (denoted by ‘No Barrier’)


Figure 10

As shown in Fig. 10, as the PL values increase for AC and DC, the SSIM is decreased.
In other words, the robustness of image-based DNA storage becomes lower with larger PL values. Moreover, when injecting error rates increase, the SSIM difference between PL values become larger.

Therefore, for the cases with high error rates, it is always better to use a smaller PL value to prevent errors while may introduce a higher capacity overhead.

5.3 Errors on Different Coefficients

In this subsection, we investigate the effect of error injection on different coefficients.

As indicated in Figure 11, for all three schemes, as injecting different errors on AC and DC coefficients, injecting errors in DC coefficients has a much larger effect on the quality of images than that of AC coefficients. (模拟实验,直接往AC、DC部分引入错误)

Figure 11

If we compare different schemes, we can find that the proposed IMG-DNA can support much higher SSIM values for both AC and DC coefficients than the other two schemes. The reason is that the proposed scheme of adding barriers prevents error propagation in DNA storage and thus improves the overall robustness of image-based DNA storage systems.

5.4 Overhead Discussion

Overhead Discussion In this subsection, we discuss the overhead of the IMG-DNA scheme.the computation overhead of both schemes is much similar. IMG-DNA can introduce some capacity overhead
due to adding extra barriers.

5.4

In this subsection, we discuss the overhead of the IMG-DNA scheme.

As shown in Figure 12, we investigate the effect of adding barriers on the encoding density with different partition lengths.

Figure 12

6 Conclusion

In this paper, we proposed a new scheme to efficiently store JPEG-based images in DNA storage, called IMG-DNA with a new DNA architecture by adding barriers to prevent error propagation in DNA storage.

The barrier design is based on the encoding scheme and biochemical constraints in DNA storage.

Then, we separately store DC and AC coefficients in different DNA sequences and use separate barrier schemes to further improve the error tolerance of DNA

Finally, the IMG-DNA scheme improves SSIM value by 2.6x - 19.7x and thus enhance the robustness of DNA
storage compared to existing work.

你可能感兴趣的:(Image - based approximate DNA storage system)