building full-‐stack web soHware systems with a past focus on e-‐commerce and publishing. Currently responsible for building engineering capability to enable Traackr's growth goals.
about Big Data/NoSQL: what’s behind the buzz words? • Our reasons and method for picking a NoSQL database • Share the lessons we learned going through the process
sets too large or coming in at too high a velocity to process using tradi;onal databases or desktop tools. E.g. big science web logs rfid sensor networks social networks social data internet text and documents internet search indexing call detail records Astronomy atmospheric science genomics biogeochemical military surveillance medical records photography archives video archives large-‐scale e-‐commerce
• NoSQL ≈ Not Only SQL • NoSQL addresses RDBMS limita;ons, it’s not about the SQL language • RDBMS = sta;c schema • NoSQL = schema flexibility; don’t have to know exact structure before storing
a problem into many tasks, each of which can be solved by one or more computers • Allows computa;ons to be accomplished in acceptable ;meframes • Distributed computa;on approaches were developed to leverage mul;ple machines: MapReduce • With MapReduce, the program goes to the data since the data is too big to move
being able to process large amounts of data in real-‐;me can yield a compe;;ve advantage. E.g. – Online retailers leveraging buying history and click-‐ though data for real-‐;me recommenda;ons • No ;me to wait for MapReduce jobs to finish • Solu;ons: streaming processing (e.g. Twider Storm), pre-‐compu;ng (e.g. aggregate and count analy;cs as data arrives), quick to read key/value stores (e.g. distributed hashes)
Data Science • Data Scien;st ≈ Sta;s;cian • Possess scien;fic discipline & exper;se • Formulate and test hypotheses • Understand the math behind the algorithms so they can tweak when they don’t work • Can dis;ll the results into an easy to understand story • Help businesses gain ac;onable insights
to launch a new plakorm; how does it find the most influen;al IT bloggers on the web that can help bring visibility to the new product? How does it find the opinion leaders, the people that mader?
• Distributed hashtables • Designed for high load • In-‐memory or on-‐disk • Eventually consistent Column Databases • Spread sheet like • Key is a row id • Adributes are columns • Columns can be grouped into families Document Databases • Like Key/Value • Value = Document • Document = JSON/BSON • JSON = Flexible Schema Graph Databases • Graph Theory G=(E,V) • Great for modeling networks • Great for graph-‐based query algorithms
• Distributed hashtables • Designed for high load • In-‐memory or on-‐disk • Eventually consistent Column Databases • Spread sheet like • Key is a row id • Adributes are columns • Columns can be grouped into families Document Databases • Like Key/Value • Value = Document • Document = JSON/BSON • JSON = Flexible Schema Graph Databases • Graph Theory G=(E,V) • Great for modeling networks • Great for graph-‐based query algorithms
• Designed for high load • In-‐memory or on-‐disk • Eventually consistent Column Databases • Spread sheet like • Key is a row id • Adributes are columns • Columns can be grouped into families Document Databases • Like Key/Value • Value = Document • Document = JSON/BSON • JSON = Flexible Schema Graph Databases • Graph Theory G=(E,V) • Great for modeling networks • Great for graph-‐based query algorithms Graph Databases: while we can model our domain as a graph we don’t want to pigeonhole ourselves into this structure. We’d rather use these tools for specialized data analysis but not as the main data store.
• Designed for high load • In-‐memory or on-‐disk • Eventually consistent Column Databases • Spread sheet like • Key is a row id • Adributes are columns • Columns can be grouped into families Document Databases • Like Key/Value • Value = Document • Document = JSON/BSON • JSON = Flexible Schema Graph Databases • Graph Theory G=(E,V) • Great for modeling networks • Great for graph-‐based query algorithms Memcache: memory-‐based, we need true persistence
• Designed for high load • In-‐memory or on-‐disk • Eventually consistent Column Databases • Spread sheet like • Key is a row id • Adributes are columns • Columns can be grouped into families Document Databases • Like Key/Value • Value = Document • Document = JSON/BSON • JSON = Flexible Schema Graph Databases • Graph Theory G=(E,V) • Great for modeling networks • Great for graph-‐based query algorithms Amazon SimpleDB: not willing to store our data in a proprietary datastore.
• Designed for high load • In-‐memory or on-‐disk • Eventually consistent Column Databases • Spread sheet like • Key is a row id • Adributes are columns • Columns can be grouped into families Document Databases • Like Key/Value • Value = Document • Document = JSON/BSON • JSON = Flexible Schema Graph Databases • Graph Theory G=(E,V) • Great for modeling networks • Great for graph-‐based query algorithms Not willing to store our data in a proprietary datastore. Redis and LinkedIn’s Project Voldermort: no query filters, beder used as queues or distributed caches
• Designed for high load • In-‐memory or on-‐disk • Eventually consistent Column Databases • Spread sheet like • Key is a row id • Adributes are columns • Columns can be grouped into families Document Databases • Like Key/Value • Value = Document • Document = JSON/BSON • JSON = Flexible Schema Graph Databases • Graph Theory G=(E,V) • Great for modeling networks • Great for graph-‐based query algorithms CouchDB: no ad-‐hoc queries; maturity in early 2010 made us shy away although we did try early prototypes.
• Designed for high load • In-‐memory or on-‐disk • Eventually consistent Column Databases • Spread sheet like • Key is a row id • Adributes are columns • Columns can be grouped into families Document Databases • Like Key/Value • Value = Document • Document = JSON/BSON • JSON = Flexible Schema Graph Databases • Graph Theory G=(E,V) • Great for modeling networks • Great for graph-‐based query algorithms Cassandra: in early 2010, maturity ques;ons, no secondary indexes and no batch processing op;ons (came later on).
• Designed for high load • In-‐memory or on-‐disk • Eventually consistent Column Databases • Spread sheet like • Key is a row id • Adributes are columns • Columns can be grouped into families Document Databases • Like Key/Value • Value = Document • Document = JSON/BSON • JSON = Flexible Schema Graph Databases • Graph Theory G=(E,V) • Great for modeling networks • Great for graph-‐based query algorithms MongoDB: in early 2010, maturity ques;ons, adop;on ques;ons and no batch processing op;ons.
• Designed for high load • In-‐memory or on-‐disk • Eventually consistent Column Databases • Spread sheet like • Key is a row id • Adributes are columns • Columns can be grouped into families Document Databases • Like Key/Value • Value = Document • Document = JSON/BSON • JSON = Flexible Schema Graph Databases • Graph Theory G=(E,V) • Great for modeling networks • Great for graph-‐based query algorithms Riak: very close but in early 2010, we had adop;on ques;ons.
• Designed for high load • In-‐memory or on-‐disk • Eventually consistent Column Databases • Spread sheet like • Key is a row id • Adributes are columns • Columns can be grouped into families Document Databases • Like Key/Value • Value = Document • Document = JSON/BSON • JSON = Flexible Schema Graph Databases • Graph Theory G=(E,V) • Great for modeling networks • Great for graph-‐based query algorithms HBase: came across as the most mature at the ;me, with several deployments, a healthy community, "out-‐of-‐the box" secondary indexes through a contrib and support for batch processing using Hadoop/MR .
• Designed for high load • In-‐memory or on-‐disk • Eventually consistent Column Databases • Spread sheet like • Key is a row id • Adributes are columns • Columns can be grouped into families Document Databases • Like Key/Value • Value = Document • Document = JSON/BSON • JSON = Flexible Schema Graph Databases • Graph Theory G=(E,V) • Great for modeling networks • Great for graph-‐based query algorithms
have a Hadoop HDFS cluster of at least 2x replica;on factor nodes Must have an odd number of Zookeeper quorum nodes Then you can run your Hbase nodes but it’s recommended to co-‐locate regionservers with hadoop datanodes so you have to manage resources. Master/slave architecture means a single point of failure, so you need to protect your master. And then we also have to manage the MapReduce processes and resources in the Hadoop layer.
with secondary indexing out-‐of-‐the-‐box • It’s been moved out of the trunk to GitHub • Where only one other company besides us seems to care about it
hdp://www.huffingtonpost.com/arianna-‐huffington/post_1.html hdp://www.huffingtonpost.com/arianna-‐huffington/post_2.html hdp://www.huffingtonpost.com/arianna-‐huffington/post_3.html hdp://www.huffingtonpost.com/shaun-‐donovan/post1.html hdp://www.huffingtonpost.com/shaun-‐donovan/post2.html hdp://www.huffingtonpost.com/shaun-‐donovan/post3.html writes for authored by published under writes for authored by published under
hdp://www.huffingtonpost.com/arianna-‐huffington/post_1.html hdp://www.huffingtonpost.com/arianna-‐huffington/post_2.html hdp://www.huffingtonpost.com/arianna-‐huffington/post_3.html hdp://www.huffingtonpost.com/shaun-‐donovan/post1.html hdp://www.huffingtonpost.com/shaun-‐donovan/post2.html hdp://www.huffingtonpost.com/shaun-‐donovan/post3.html writes for authored by published under writes for authored by published under Denormalized/duplicated for fast run;me access and storage of influencer-‐ to-‐site rela;onship proper;es
hdp://www.huffingtonpost.com/arianna-‐huffington/post_1.html hdp://www.huffingtonpost.com/arianna-‐huffington/post_2.html hdp://www.huffingtonpost.com/arianna-‐huffington/post_3.html hdp://www.huffingtonpost.com/shaun-‐donovan/post1.html hdp://www.huffingtonpost.com/shaun-‐donovan/post2.html hdp://www.huffingtonpost.com/shaun-‐donovan/post3.html writes for authored by published under writes for authored by published under Content adribu;on logic could some;mes mis-‐adribute posts because of the duplicated data.
hdp://www.huffingtonpost.com/arianna-‐huffington/post_1.html hdp://www.huffingtonpost.com/arianna-‐huffington/post_2.html hdp://www.huffingtonpost.com/arianna-‐huffington/post_3.html hdp://www.huffingtonpost.com/shaun-‐donovan/post1.html hdp://www.huffingtonpost.com/shaun-‐donovan/post2.html hdp://www.huffingtonpost.com/shaun-‐donovan/post3.html writes for authored by published under writes for authored by published under Exacerbated when we started tracking people’s content on a daily basis in mid-‐2011
hdp://www.huffingtonpost.com/arianna-‐huffington/post_1.html hdp://www.huffingtonpost.com/arianna-‐huffington/post_2.html hdp://www.huffingtonpost.com/arianna-‐huffington/post_3.html hdp://www.huffingtonpost.com/shaun-‐donovan/post1.html hdp://www.huffingtonpost.com/shaun-‐donovan/post2.html hdp://www.huffingtonpost.com/shaun-‐donovan/post3.html writes for authored by published under writes for authored by published under Normalize the sites
• Distributed hashtables • Designed for high load • In-‐memory or on-‐disk • Eventually consistent Column Databases • Spread sheet like • Key is a row id • Adributes are columns • Columns can be grouped into families Document Databases • Like Key/Value • Value = Document • Document = JSON/BSON • JSON = Flexible Schema Graph Databases • Graph Theory G=(E,V) • Great for modeling networks • Great for graph-‐based query algorithms
• Distributed hashtables • Designed for high load • In-‐memory or on-‐disk • Eventually consistent Column Databases • Spread sheet like • Key is a row id • Adributes are columns • Columns can be grouped into families Document Databases • Like Key/Value • Value = Document • Document = JSON/BSON • JSON = Flexible Schema Graph Databases • Graph Theory G=(E,V) • Great for modeling networks • Great for graph-‐based query algorithms Nope!
• Distributed hashtables • Designed for high load • In-‐memory or on-‐disk • Eventually consistent Column Databases • Spread sheet like • Key is a row id • Adributes are columns • Columns can be grouped into families Document Databases • Like Key/Value • Value = Document • Document = JSON/BSON • JSON = Flexible Schema Graph Databases • Graph Theory G=(E,V) • Great for modeling networks • Great for graph-‐based query algorithms Graph Databases: we looked at Neo4J a bit closer but passed again for the same reasons as before.
• Distributed hashtables • Designed for high load • In-‐memory or on-‐disk • Eventually consistent Column Databases • Spread sheet like • Key is a row id • Adributes are columns • Columns can be grouped into families Document Databases • Like Key/Value • Value = Document • Document = JSON/BSON • JSON = Flexible Schema Graph Databases • Graph Theory G=(E,V) • Great for modeling networks • Great for graph-‐based query algorithms Memcache: s;ll no
• Distributed hashtables • Designed for high load • In-‐memory or on-‐disk • Eventually consistent Column Databases • Spread sheet like • Key is a row id • Adributes are columns • Columns can be grouped into families Document Databases • Like Key/Value • Value = Document • Document = JSON/BSON • JSON = Flexible Schema Graph Databases • Graph Theory G=(E,V) • Great for modeling networks • Great for graph-‐based query algorithms Amazon SimpleDB: s;ll no.
• Distributed hashtables • Designed for high load • In-‐memory or on-‐disk • Eventually consistent Column Databases • Spread sheet like • Key is a row id • Adributes are columns • Columns can be grouped into families Document Databases • Like Key/Value • Value = Document • Document = JSON/BSON • JSON = Flexible Schema Graph Databases • Graph Theory G=(E,V) • Great for modeling networks • Great for graph-‐based query algorithms Not willing to store our data in a proprietary datastore. Redis and LinkedIn’s Project Voldermort: s;ll no
• Distributed hashtables • Designed for high load • In-‐memory or on-‐disk • Eventually consistent Column Databases • Spread sheet like • Key is a row id • Adributes are columns • Columns can be grouped into families Document Databases • Like Key/Value • Value = Document • Document = JSON/BSON • JSON = Flexible Schema Graph Databases • Graph Theory G=(E,V) • Great for modeling networks • Great for graph-‐based query algorithms CouchDB: more mature but s;ll no ad-‐hoc queries.
• Distributed hashtables • Designed for high load • In-‐memory or on-‐disk • Eventually consistent Column Databases • Spread sheet like • Key is a row id • Adributes are columns • Columns can be grouped into families Document Databases • Like Key/Value • Value = Document • Document = JSON/BSON • JSON = Flexible Schema Graph Databases • Graph Theory G=(E,V) • Great for modeling networks • Great for graph-‐based query algorithms Cassandra: matured quite a bit, added secondary indexes and batch processing op;ons but more restric;ve in its’ use than other solu;ons. AHer the Hbase lesson, simplicity of use was now more important.
• Distributed hashtables • Designed for high load • In-‐memory or on-‐disk • Eventually consistent Column Databases • Spread sheet like • Key is a row id • Adributes are columns • Columns can be grouped into families Document Databases • Like Key/Value • Value = Document • Document = JSON/BSON • JSON = Flexible Schema Graph Databases • Graph Theory G=(E,V) • Great for modeling networks • Great for graph-‐based query algorithms Riak: strong contender s;ll but adop;on ques;ons remained.
• Distributed hashtables • Designed for high load • In-‐memory or on-‐disk • Eventually consistent Column Databases • Spread sheet like • Key is a row id • Adributes are columns • Columns can be grouped into families Document Databases • Like Key/Value • Value = Document • Document = JSON/BSON • JSON = Flexible Schema Graph Databases • Graph Theory G=(E,V) • Great for modeling networks • Great for graph-‐based query algorithms MongoDB: matured by leaps and bounds, increased adop;on, support from 10gen, advanced indexing out-‐of-‐the-‐box as well as some batch processing op;ons, breeze to use, well documented and fit into our exis;ng code base very nicely.
easier to write with JavaScript: no need for a Java developer to write map reduce code to extract the data in a usable form like it was needed with Hbase. • Simpler backups: Hbase mostly relied on HDFS redundancy; intra-‐ cluster replica;on is available but experimental and a lot more involved to setup. • Great documenta;on • Great adop;on and community
Data: – Volume – Velocity – Variety ß Traackr • Big Data technologies are complementary to SQL and RDBMS • Un;l machines can think for themselves Data Science will be increasingly important
with less mature tech • Be as flexible as the data => fearless refactoring • Importance of ease of use and administra;on cannot be overstated for a small startup