== OligoArchive: A DNA-based database archive for PostgreSQL ==

Raja Appuswamy Assistant Professor, Data Science Department, EURECOM Biot, France

“50% of 175ZB global datasphere will be enterprise data in
2025” [IDC] “80% data is cold, and increasing at 60% CAGR” [Horison] Growth of archival data DRAM SSD HDD Tape ns µs ms mins Performance Capacity Archival Data Access Latency Storage Cost Tape provides lowest cost/GB for data archives today

“Kryder’s rate of tape: 31%/YR average” [Fontana] Problems with tape
Tape has a lifetime of 10-20 years Continuous data migration “60% of archival data stored longer than 20 years” [SNIA]

Net effect: Media Obsolescence “There’s going to be a large
dead period,” he told me, “from the late ’90s through 2020, where most media will be lost.” Enterprise DBMS archives will soon face obsolescence

DNA as a digital storage media

Why DNA? How do we use DNA as an archival
media for relational DBMS? Dense Durable Kryder’s rate: 0

Project OligoArchive: EU FET Initiative OligoArchive (https://oligoarchive.eu) is a €3M
European Commission funded research effort to deliver the building blocks need to make DNA storage a reality. It involves six partners across three countries. Eurecom Univ. Nice CNRS Imperial College Helixworks Interdisciplinary team & research agenda: Computer Science Efficient encoding of structured and unstructured data Molecular biology Near-molecule data analysis for content detection in DNA Accelerated sequencing and thus reading for DNA storage Biochemistry Novel, cost-effective synthesis technology for DNA storage Robotics/microfluidics Automation of manual library prep for writing and readout

OligoArchive: DNA Storage Stack Goal: implement a custom storage stack
for data archival on DNA Application Layer Encoding structured (database) and unstructured (imaging) data Controller Layer Data processing capabilities OS Layer File system abstraction Media Layer Synthesis and Sequencing Automation

Database archival architecture DNA Synthesizer PCR thermocycler DNA Sequencer DNA
library DBMS pg_oligo_archive pg_oligo_restore 1 CODD 0.1 2 GRAY 0.2 DNA storage system [ Get, [OID: GTTCAG] ] [ Put, [OID: GTTCAG], [value: ATATGTGAGT], [value: GATGGATCTA] ] [ [value: ATATGTGAGT], [value: ATGTGAGT…], [value: GATGTATCTA] [value: GATGGATCTATT] [value: GATGGATCTA] ]

Writing data to DNA (1): The unstructured way • Issues
using DNA as a storage media • Limited DNA(oligo) length, homopolymer/G-C constraint, indels/subst. errors • Approach-1: Dump database to a binary archive file and encode • Limitations • log4 (#segments) nucleotides reserved for offset (1TB => 17 nts in 150nt oligo) • Cannot perform near-molecule data processing Encoder AAAACTCA AAACTGCA AAAGCAGT 0111001010011 Synthesis Synthesis Synthesis AAAA CTCA (Unique offset) (Data) pg_archive 1 CODD 0.1 2 GRAY 0.2

Writing data to DNA (2): Structured data layout • NSM
on DNA: one row per oligo • Use unique primary key to avoid additional indexing • DSM on DNA: columnset partitioning for “large” rows • Reduces overhead from log4 (#segments) to log4 (cardinality) (#segs. >> Card.) CTCAGTAG TGCACGAT Synthesis Synthesis pg_oligo_archive 1 CODD 0.1 TCATGACT GCTGATA Synthesis Synthesis pg_oligo_archive 1 CODD 2 GRAY 0.2 1 0.1 Pg_oligo_archive picks right policy for each table automatically

Reading data from DNA: Data cleaning • Read path for
restoring unstructured data • Clustering and reassembly time-consuming, necessary step before decoding • But, pg_oligo_archive performs structure preserving encoding • Can map DNA read restoration to a data cleaning operation Pg_oligo_restore uses schema information to restore data from DNA Sequencing AAAACTCA AAACTGCA . . . AAAACTCA Clustering and Reassembly AAAACTCA AAACTGCA AAAGCTCA Decode 0111001010011 Sequencing AAAACTCA AAACTGCA . . . AAAACTCA Decode Data cleaning 1 CODD 0.1 1.0 1 1 10 1 1 1 CODD 0.1

Evaluation (1): DNA archival and restoration • PostgreSQL TPC-H SF-10-5
• 36 records across 8 tables, size 12KB • pg_oligo_archive to archive database to DNA • 404, 150nt oligos synthesized with Twist Bioscience • Sequencing with Illumina NextSeq 500 • Deep sequencing provided very high coverage • pg_oligo_restore performed automated restoration

Near-molecule query processing: Selection • Mol. Biology 101: Polymerase Chain
Reaction (PCR) • amplify, i.e., copy ‘matching’ oligo countless times • need to know start and end sequences of matching oligo • SQL “select * from table” with PCR • Could potentially be extended to project columns, or select tuples with specific value of an attribute AGGCTCAGATAGATCTAATT AGGCTCAGATAGATCTAATT AGGCTCAGATAGATCTAATT AGGCTCAGATAGATCTAATT AGGCTCAGATAGATCTAATT AGGCTCAGATAGATCTAATT AGGCTCAGATAGATCTAATT PCR AGG CTCAGAT AGATCTA ATT Table & Column ID Primary Key Data + Error Correction Codes Table & Column ID

Near-molecule query processing: Join (1/2) • Molecular biology 101: Complementarity
– matching base pairs: • Key technique: Annealing of complementary single stranded oligos

Near-molecule query processing: Join (2/2) • Each attribute encoded as
before: AGG CTCAGAT AGATCTA ATT but an additional reversed oligo with the value complemented: TAA ATCGAGGGATTACA TTT • Process: 1. Annealing binds together matching/equal attributes 2. PCR retrieves only annealed pairs AGGCTCAGATAGATCTAATT TAAATCGAGGGATTACATTT Table & Colum ID, e.g. part & partkey Table & Colum ID, e.g. partsupp & partkey Complementary value

Evaluation (2): Near-molecule query processing (Join) 1 Kb plus 10-1
10-2 10-3 10-4 10-5 10-6 10-7 10-8 10-9 10-10 10-11 nc • Encode matching records (only value attribute) from the TPC-H part and partsupp tables using two oligos • Perform join between the matching records in increasing background of random oligos • Gel electrophoresis after PCR confirms successful annealing

Summary • All contemporary media types suffer from media obsolescence
• Low durability and high kryder’s rate • DNA provides a biological alternative to the magnetic world • Dense & durable as a storage substrate • Scalable as a computational substrate • DNA and DBMS – a symbiotic relationship • DNA can act as a zetta-scale active archive with near-data processing • DBMS knowledge can be used to optimize read/write path

UAG UGA UAA

== OligoArchive: A DNA-based database archive f...

== OligoArchive: A DNA-based database archive for PostgreSQL ==

Warsaw PostgreSQL Users Group

More Decks by Warsaw PostgreSQL Users Group

Other Decks in Technology

Featured

Transcript

Raja Appuswamy Assistant Professor, Data Science Department, EURECOM Biot, France

“50% of 175ZB global datasphere will be enterprise data in

“Kryder’s rate of tape: 31%/YR average” [Fontana] Problems with tape

Net effect: Media Obsolescence “There’s going to be a large

DNA as a digital storage media

Why DNA? How do we use DNA as an archival

Project OligoArchive: EU FET Initiative OligoArchive (https://oligoarchive.eu) is a €3M

OligoArchive: DNA Storage Stack Goal: implement a custom storage stack

Database archival architecture DNA Synthesizer PCR thermocycler DNA Sequencer DNA

Writing data to DNA (1): The unstructured way • Issues

Writing data to DNA (2): Structured data layout • NSM

Reading data from DNA: Data cleaning • Read path for

Evaluation (1): DNA archival and restoration • PostgreSQL TPC-H SF-10-5

Near-molecule query processing: Selection • Mol. Biology 101: Polymerase Chain

Near-molecule query processing: Join (1/2) • Molecular biology 101: Complementarity

Near-molecule query processing: Join (2/2) • Each attribute encoded as

Evaluation (2): Near-molecule query processing (Join) 1 Kb plus 10-1

Summary • All contemporary media types suffer from media obsolescence

UAG UGA UAA