Big changes • 17 more systems in stage • Postgres - +10 systems to store all raw and processed JSON (5 pairs) (SSDs?) • Ceph (or something) instead of HBase • Likely get rid of the Netapp....
System type Who Hbase BIDW/Annie Non-hbase jakem, cshields Elastic Search adrian/solarce/bugzilla RabbitMQ solarce NFS (symbols) lerxst Zeus jakem Add’l systems lonnen, lars, me, adrian
Next steps • Test processing throughput (Lars) • Implement Ceph/S3 crashstorage class • Test Ceph (Inktank meeting Friday!) • Plan for symbols (See Ted later this morning)
Assumptions • Durability: No loss of user submitted data (crashes) • Size: Need a distributed storage mechanism for ~60TB of crash dumps (Current footprint 50 TB unreplicated, ~150TB replicated x3)
Do we need to store raw crashes/ processed json in hbase? If HBase is to continue as our primary crash storage, yes, we need all three of raw_crash, raw_dump and processed crash in there. It is required that we save raw_crash and processed_crash in there if we are to continue to support Map/Reduce jobs on our data.
Assumptions Performance: Need to store single crashes in a timely fashion for crashmovers. The only time requirement is that priority jobs must be saved, retrieved and processed within 60 seconds. Since any crash could potentially be a priority job, we must be able to store from the mover with seconds.
Assumptions HBase is a CP (consistent, partition tolerant) system. Wasn’t initially an explicit requirement, but now important architecturally for our processors and front- end which assume consistency.
Theory • To replace HDFS/HBase/Hadoop, we'll likely need a combination of a few new systems. • If we use an AP or AC system, we'll need another layer to ensure consistency.
GlusterFS • Supported by Redhat, lacks interface, just looks like a filesystem • Probably too bare-bones for our needs • We’ve already been down the NFS road...
Ceph Cons • no prior ops experience Moz Ops deployed a test cluster! • Need to test performance (but not likely to be a dealbreaker) • Need a map-reduce story (maybe)
Cassandra Cons • Potential loss of data on write due to network partition/node loss: http:// aphyr.com/posts/294-call-me-maybe- cassandra • Not designed for large object storage • Best for a starter streaming reporting system
Larger re-architecture • Pursue Kafka + a streaming system (like LinkedIn/Twitter) • Requires more research, more dev involvement • Peter prototyped a Kafka consumer • Point is faster mean-time-to-reports not immediate access to data
Next steps • Performance test Ceph • Performance test Cassandra, implement reports (TCBS?) • Report back, evaluate whether more research into streaming is warranted