Upgrade to Pro — share decks privately, control downloads, hide ads and more …

== OligoArchive: A DNA-based database archive for PostgreSQL ==

== OligoArchive: A DNA-based database archive for PostgreSQL ==

Today, most, if not all, enterprises in our society are driven by data. If data is the oil that fuels the metaphorical AI vehicle, storage technologies are the cog that keep the wheel spinning. For decades, we wanted fast storage devices that can quickly deliver data, and storage technologies evolved to meet this requirement. As data-driven decision making becomes an integral part of enterprises, we are increasingly faced with a new need-–one for cheap, long-term storage devices that can safely store the data we generate for tens or hundreds of years to meet legal and regulatory compliance requirements.

In this talk, we will first explore recent trends in the storage hardware landscape that clearly show that all current storage media face fundamental limitations that threaten our ability to store, much less process, the data we generate over long time frames. We will then focus on a radically new storage media that has received quite some attention recently -synthetic Deoxyribonucleic acid (DNA). After highlighting the pros and cons of using DNA--a biological building block--as a digital storage media, I will present our recent work in the EU-funded Future and Emerging Technologies (FET) project OligoArchive, that focuses on overcoming challenges in using DNA as a deep archival tier of PostgreSQL databases.

Speaker name:

Dr. Raja Appuswamy

Speaker bio:

Raja Appuswamy is an Assistant Professor in the Data Science department at EURECOM--a Grandes Écoles located in the sunny Sophia Antipolis tech-valley of southern France. Previously, he was as a Researcher and Visiting Professor at EPFL, Switzerland, a Visiting Researcher in the Systems and Networking group at Microsoft Research, Cambridge, and as a Software Development Engineer in the Windows 7 kernel team at Microsoft, Redmond.

He received his Ph.D in Computer Science from the Vrije Universiteit, Amsterdam, where he worked under the guidance of Prof. Andrew S. Tanenbaum on designing and implementing a new storage stack for the MINIX 3 microkernel operating system. He also holds dual Masters degrees in Computer Science and Agricultural Engineering from the University of Florida, Gainesville.

More Decks by Warsaw PostgreSQL Users Group

Other Decks in Technology

Transcript

  1. “50% of 175ZB global datasphere will be enterprise data in

    2025” [IDC] “80% data is cold, and increasing at 60% CAGR” [Horison] Growth of archival data DRAM SSD HDD Tape ns µs ms mins Performance Capacity Archival Data Access Latency Storage Cost Tape provides lowest cost/GB for data archives today
  2. “Kryder’s rate of tape: 31%/YR average” [Fontana] Problems with tape

    Tape has a lifetime of 10-20 years Continuous data migration “60% of archival data stored longer than 20 years” [SNIA]
  3. Net effect: Media Obsolescence “There’s going to be a large

    dead period,” he told me, “from the late ’90s through 2020, where most media will be lost.” Enterprise DBMS archives will soon face obsolescence
  4. Why DNA? How do we use DNA as an archival

    media for relational DBMS? Dense Durable Kryder’s rate: 0
  5. Project OligoArchive: EU FET Initiative OligoArchive (https://oligoarchive.eu) is a €3M

    European Commission funded research effort to deliver the building blocks need to make DNA storage a reality. It involves six partners across three countries. Eurecom Univ. Nice CNRS Imperial College Helixworks Interdisciplinary team & research agenda: Computer Science Efficient encoding of structured and unstructured data Molecular biology Near-molecule data analysis for content detection in DNA Accelerated sequencing and thus reading for DNA storage Biochemistry Novel, cost-effective synthesis technology for DNA storage Robotics/microfluidics Automation of manual library prep for writing and readout
  6. OligoArchive: DNA Storage Stack Goal: implement a custom storage stack

    for data archival on DNA Application Layer Encoding structured (database) and unstructured (imaging) data Controller Layer Data processing capabilities OS Layer File system abstraction Media Layer Synthesis and Sequencing Automation
  7. Database archival architecture DNA Synthesizer PCR thermocycler DNA Sequencer DNA

    library DBMS pg_oligo_archive pg_oligo_restore 1 CODD 0.1 2 GRAY 0.2 DNA storage system [ Get, [OID: GTTCAG] ] [ Put, [OID: GTTCAG], [value: ATATGTGAGT], [value: GATGGATCTA] ] [ [value: ATATGTGAGT], [value: ATGTGAGT…], [value: GATGTATCTA] [value: GATGGATCTATT] [value: GATGGATCTA] ]
  8. Writing data to DNA (1): The unstructured way • Issues

    using DNA as a storage media • Limited DNA(oligo) length, homopolymer/G-C constraint, indels/subst. errors • Approach-1: Dump database to a binary archive file and encode • Limitations • log4 (#segments) nucleotides reserved for offset (1TB => 17 nts in 150nt oligo) • Cannot perform near-molecule data processing Encoder AAAACTCA AAACTGCA AAAGCAGT 0111001010011 Synthesis Synthesis Synthesis AAAA CTCA (Unique offset) (Data) pg_archive 1 CODD 0.1 2 GRAY 0.2
  9. Writing data to DNA (2): Structured data layout • NSM

    on DNA: one row per oligo • Use unique primary key to avoid additional indexing • DSM on DNA: columnset partitioning for “large” rows • Reduces overhead from log4 (#segments) to log4 (cardinality) (#segs. >> Card.) CTCAGTAG TGCACGAT Synthesis Synthesis pg_oligo_archive 1 CODD 0.1 TCATGACT GCTGATA Synthesis Synthesis pg_oligo_archive 1 CODD 2 GRAY 0.2 1 0.1 Pg_oligo_archive picks right policy for each table automatically
  10. Reading data from DNA: Data cleaning • Read path for

    restoring unstructured data • Clustering and reassembly time-consuming, necessary step before decoding • But, pg_oligo_archive performs structure preserving encoding • Can map DNA read restoration to a data cleaning operation Pg_oligo_restore uses schema information to restore data from DNA Sequencing AAAACTCA AAACTGCA . . . AAAACTCA Clustering and Reassembly AAAACTCA AAACTGCA AAAGCTCA Decode 0111001010011 Sequencing AAAACTCA AAACTGCA . . . AAAACTCA Decode Data cleaning 1 CODD 0.1 1.0 1 1 10 1 1 1 CODD 0.1
  11. Evaluation (1): DNA archival and restoration • PostgreSQL TPC-H SF-10-5

    • 36 records across 8 tables, size 12KB • pg_oligo_archive to archive database to DNA • 404, 150nt oligos synthesized with Twist Bioscience • Sequencing with Illumina NextSeq 500 • Deep sequencing provided very high coverage • pg_oligo_restore performed automated restoration
  12. Near-molecule query processing: Selection • Mol. Biology 101: Polymerase Chain

    Reaction (PCR) • amplify, i.e., copy ‘matching’ oligo countless times • need to know start and end sequences of matching oligo • SQL “select * from table” with PCR • Could potentially be extended to project columns, or select tuples with specific value of an attribute AGGCTCAGATAGATCTAATT AGGCTCAGATAGATCTAATT AGGCTCAGATAGATCTAATT AGGCTCAGATAGATCTAATT AGGCTCAGATAGATCTAATT AGGCTCAGATAGATCTAATT AGGCTCAGATAGATCTAATT PCR AGG CTCAGAT AGATCTA ATT Table & Column ID Primary Key Data + Error Correction Codes Table & Column ID
  13. Near-molecule query processing: Join (1/2) • Molecular biology 101: Complementarity

    – matching base pairs: • Key technique: Annealing of complementary single stranded oligos
  14. Near-molecule query processing: Join (2/2) • Each attribute encoded as

    before: AGG CTCAGAT AGATCTA ATT but an additional reversed oligo with the value complemented: TAA ATCGAGGGATTACA TTT • Process: 1. Annealing binds together matching/equal attributes 2. PCR retrieves only annealed pairs AGGCTCAGATAGATCTAATT TAAATCGAGGGATTACATTT Table & Colum ID, e.g. part & partkey Table & Colum ID, e.g. partsupp & partkey Complementary value
  15. Evaluation (2): Near-molecule query processing (Join) 1 Kb plus 10-1

    10-2 10-3 10-4 10-5 10-6 10-7 10-8 10-9 10-10 10-11 nc • Encode matching records (only value attribute) from the TPC-H part and partsupp tables using two oligos • Perform join between the matching records in increasing background of random oligos • Gel electrophoresis after PCR confirms successful annealing
  16. Summary • All contemporary media types suffer from media obsolescence

    • Low durability and high kryder’s rate • DNA provides a biological alternative to the magnetic world • Dense & durable as a storage substrate • Scalable as a computational substrate • DNA and DBMS – a symbiotic relationship • DNA can act as a zetta-scale active archive with near-data processing • DBMS knowledge can be used to optimize read/write path