A talk I gave at PyWaw Summit 2015.
Andrew Godwin@andrewgodwinSMALL DATASTORAGE FOR THE REST OF US
View Slide
Andrew GodwinHi, I'mDjango Core DeveloperSenior Engineer atFar too many hobbies
BIG DATAWhat does it mean?
BIG DATAWhat does it mean?What is 'big'?
1,000 rows?1,000,000 rows?1,000,000,000 rows?1,000,000,000,000 rows?
Scalable designs are a tradeoff:NOW LATERvs
Small company? Agency?Focus on ease of change, not scalability
You don't need to scalefrom day oneBut always leave yourself scaling points
Rapid developmentContinuous deploymentHardware choiceScaling 'breakpoints'
Rapid developmentIt's all about schema change overhead
Explicit SchemaID int Name text Weight uint123AliceBobCharles768465Implicit Schema{"id": 342,"name": "David","weight": 44,}
Silent Failure{"id": 342,"name": "David","weight": 74,}{"id": 342,"name": "Ellie","weight": "85kg",}{"id": 342,"nom": "Frankie","weight": 77,}{"id": 342,"name": "Frankie","weight": -67,}
Continuous deploymentIt's 11pm. Do you know where your locks are?
Add NULL and backfill1-to-1 relation and backfillDBMS-supported type changes
Hardware choiceZOMG RUN IT ON THE CLOUD
VMs are TERRIBLE at IOUp to 10x slowdown, even with VT-d.
Memory is kingYour database loves it. Don't let other apps steal it.
Adding more power goes farEspecially with PostgreSQL or read-only replicas
Scaling Breakpoints
Sharding pointDatasets paritioned by primary key
Vertical splitEntirely unrelated tables
DenormalisationIt's not free!
Consistency leewayCan you take inconsistent views?
Load Shapes
Read-heavyWrite-heavy Large size
Read-heavyWrite-heavy Large sizeWikipediaTV show websiteMinecraftForumsAmazon GlacierEventbriteLogging
Read-heavyWrite-heavy Large sizeOffline storageAppendformatsIn-memory cache / flat filesMany indexesFewer indexes
Extremes
Extreme ReadsHeavy ReplicationExtreme WritesSacrifice ordering or consistencyExtreme SizeSacrifice query time
Extreme LongevityFlash in cold storageExtreme SurvivabilityRad-hardened FlashExtreme AuditabilityTrue append only storage
SSDsMagnetic TapeHard DrivesConsumer FlashCDs/DVDsLong-life FlashMetal-Carbon DVDs3-6 months5-10 years3-5 years100+ yearsApproximate time to bit flip, unpowered at room temperature
Big Data isn't one thingIt depends on type, size, complexity,throughput, latency...
Focus on the current problemsFuture problems don't matter if you never get there
Efficiency and iterating fast mattersThe smaller you are, the more time is worth
Good architecture affects productYou're not writing a system in a vacuum
Thanks.Andrew Godwin@andrewgodwin