Sharing a Startup’s Big Data Lessons

Sharing a Startup’s Big Data Lessons Experiences with
non-‐RDBMS solu;ons at

Who we are •  A search engine
•  A people search engine •  An inﬂuencer search engine •  Subscrip;on-‐ based

George Stathis VP Engineering 14+ years of experience
building full-‐stack web soHware systems with a past focus on e-‐commerce and publishing. Currently responsible for building engineering capability to enable Traackr's growth goals.

What’s this talk about? •  Share what we know
about Big Data/NoSQL: what’s behind the buzz words? •  Our reasons and method for picking a NoSQL database •  Share the lessons we learned going through the process

Big Data/NoSQL: behind the buzz words

What is Big Data? •  3 Vs: – Volume
– Velocity – Variety

What is Big Data? Volume + Velocity •  Data
sets too large or coming in at too high a velocity to process using tradi;onal databases or desktop tools. E.g. big science web logs rﬁd sensor networks social networks social data internet text and documents internet search indexing call detail records Astronomy atmospheric science genomics biogeochemical military surveillance medical records photography archives video archives large-‐scale e-‐commerce

Tradi;onal sta;c reports What is Big Data? Variety
•  Big Data is varied and unstructured Analy;cs, explora;on & experimenta;on

$$$$$$$$ What is Big Data? •  Scaling data
processing cost eﬀec;vely $$$$$ $$$

What is NoSQL? •  NoSQL ≠ No SQL
•  NoSQL ≈ Not Only SQL •  NoSQL addresses RDBMS limita;ons, it’s not about the SQL language •  RDBMS = sta;c schema •  NoSQL = schema ﬂexibility; don’t have to know exact structure before storing

What is Distributed Compu;ng? •  Sharing the workload: divide
a problem into many tasks, each of which can be solved by one or more computers •  Allows computa;ons to be accomplished in acceptable ;meframes •  Distributed computa;on approaches were developed to leverage mul;ple machines: MapReduce •  With MapReduce, the program goes to the data since the data is too big to move

What is MapReduce? Source: developer.yahoo.com

What is MapReduce? •  MapReduce = batch processing =
analy;cal •  MapReduce ≠ interac;ve •  Therefore many NoSQL solu;ons don’t outright replace warehouse solu;ons, they complement them •  RDBMS is s;ll safe J

What is Big Data? Velocity •  In some instances,
being able to process large amounts of data in real-‐;me can yield a compe;;ve advantage. E.g. –  Online retailers leveraging buying history and click-‐ though data for real-‐;me recommenda;ons •  No ;me to wait for MapReduce jobs to ﬁnish •  Solu;ons: streaming processing (e.g. Twider Storm), pre-‐compu;ng (e.g. aggregate and count analy;cs as data arrives), quick to read key/value stores (e.g. distributed hashes)

What is Big Data? Data Science •  Emergence of
Data Science •  Data Scien;st ≈ Sta;s;cian •  Possess scien;ﬁc discipline & exper;se •  Formulate and test hypotheses •  Understand the math behind the algorithms so they can tweak when they don’t work •  Can dis;ll the results into an easy to understand story •  Help businesses gain ac;onable insights

Big Data Landscape Source: capgemini.com

So what’s Traackr and why did we need a
NoSQL DB?

Traackr: context •  A cloud compu;ng company as about
to launch a new plakorm; how does it find the most influen;al IT bloggers on the web that can help bring visibility to the new product? How does it find the opinion leaders, the people that mader?

Traackr: a people search engine Up to 50 keywords
per search!

Traackr: a people search engine People as
search results Content aggregated by author Proprietary 3-‐scale ranking

Traackr: 30,000 feet Acquisi<on Processing Storage &
Indexing Services Applica<ons

NoSQL is usually associated with “Web Scale” (Volume &
Velocity)

•  In terms of users/traffic? Do we fit the
“Web scale” profile?

Source: compete.com

•  In terms of users/traffic? •  In terms of
the amount of data? Do we fit the “Web scale” profile?

PRIMARY> use traackr switched to db traackr PRIMARY>
db.stats() { "db" : "traackr", "collec;ons" : 12, "objects" : 68226121, "avgObjSize" : 2972.0800625760330, "dataSize" : 202773493971, "storageSize" : 221491429671, "numExtents" : 199, "indexes" : 33, "indexSize" : 27472394891, "ﬁleSize" : 266623699968, "nsSizeMB" : 16, "ok" : 1 } That’s a quarter of a terabyte …

Wait! What? My Synology NAS at home can
hold 2TB!

No need for us to track the en;re web
Web Content Inﬂuencer Content Not at scale :-‐)

Variety view of “Web Scale” Web data is:
Heterogeneous Unstructured (text)

Source: hdp://www.opte.org/ Visualiza;on of the Internet, Nov. 23rd 2003

Data sources are isolated islands of rich data
with lose links to one another

How do we build a database that models all
possible en;;es found on the web?

Modeling the web: the RDBMS way

Source: socialbuderﬂyclt.com

{ "realName": "David Chancogne",
";tle": "CTO", "descrip;on": "Web. Geek.\r\nTraackr: hdp://traackr.com\r\nPropz: hdp://propz.me", "primaryAﬃlia;on": "Traackr", "email": "[email protected]", "loca;on": "Cambridge, MA, United States", "siteReferences": [ { "siteUrl": "hdp://twider.com/dchancogne", "metrics": [ { "value": 216, "name": "twider_followers_count" }, { "value": 2107, "name": "twider_statuses_count" } ] }, { "siteUrl": "hdp://traackr.com/blog/author/david", "metrics": [ { "value": 21, "name": "google_inbound_links" } ] } ] } Inﬂuencer data as JSON

NoSQL = schema ﬂexibility

the amount of data? •  In terms of the variety of the data Do we ﬁt the “Web scale” proﬁle? ✓

Traackr’s Datastore Requirements •  Schema ﬂexibility •  Good
at storing lots of variable length text •  Batch processing op;ons ✓

Requirement: text storage Variable text length: < big
variance < 140 character tweets mul;-‐page blog posts

Requirement: text storage RDBMS’ answer to variable text length:
Plan ahead for largest value CLOB/BLOB

Requirement: text storage Issues with CLOB/BLOG for us:
No clue what largest value is CLOB/BLOB for tweets = wasted space

Requirement: text storage NoSQL solu;ons are great for text:
No length requirements (automated chunking) Limited space overhead

at storing lots of variable length text •  Batch processing op;ons ✓ ✓

Requirement: batch processing Some NoSQL solu;ons come
with MapReduce Source: hdp://code.google.com/

Requirement: batch processing MapReduce + RDBMS: Possible but
proprietary solu;ons Usually involves expor;ng data from RDBMS into a NoSQL system anyway. Defeats data locality beneﬁt of MR

at storing lots of variable length text •  Batch processing op;ons ✓ ✓ A NoSQL op;on is the right ﬁt ✓

How did we pick a NoSQL DB?

Bewildering number of op;ons (early 2010) Key/Value Databases
•  Distributed hashtables •  Designed for high load •  In-‐memory or on-‐disk •  Eventually consistent Column Databases •  Spread sheet like •  Key is a row id •  Adributes are columns •  Columns can be grouped into families Document Databases •  Like Key/Value •  Value = Document •  Document = JSON/BSON •  JSON = Flexible Schema Graph Databases •  Graph Theory G=(E,V) •  Great for modeling networks •  Great for graph-‐based query algorithms

Trimming op;ons Key/Value Databases •  Distributed hashtables
•  Designed for high load •  In-‐memory or on-‐disk •  Eventually consistent Column Databases •  Spread sheet like •  Key is a row id •  Adributes are columns •  Columns can be grouped into families Document Databases •  Like Key/Value •  Value = Document •  Document = JSON/BSON •  JSON = Flexible Schema Graph Databases •  Graph Theory G=(E,V) •  Great for modeling networks •  Great for graph-‐based query algorithms Graph Databases: while we can model our domain as a graph we don’t want to pigeonhole ourselves into this structure. We’d rather use these tools for specialized data analysis but not as the main data store.

•  Designed for high load •  In-‐memory or on-‐disk •  Eventually consistent Column Databases •  Spread sheet like •  Key is a row id •  Adributes are columns •  Columns can be grouped into families Document Databases •  Like Key/Value •  Value = Document •  Document = JSON/BSON •  JSON = Flexible Schema Graph Databases •  Graph Theory G=(E,V) •  Great for modeling networks •  Great for graph-‐based query algorithms Memcache: memory-‐based, we need true persistence

•  Designed for high load •  In-‐memory or on-‐disk •  Eventually consistent Column Databases •  Spread sheet like •  Key is a row id •  Adributes are columns •  Columns can be grouped into families Document Databases •  Like Key/Value •  Value = Document •  Document = JSON/BSON •  JSON = Flexible Schema Graph Databases •  Graph Theory G=(E,V) •  Great for modeling networks •  Great for graph-‐based query algorithms Amazon SimpleDB: not willing to store our data in a proprietary datastore.

•  Designed for high load •  In-‐memory or on-‐disk •  Eventually consistent Column Databases •  Spread sheet like •  Key is a row id •  Adributes are columns •  Columns can be grouped into families Document Databases •  Like Key/Value •  Value = Document •  Document = JSON/BSON •  JSON = Flexible Schema Graph Databases •  Graph Theory G=(E,V) •  Great for modeling networks •  Great for graph-‐based query algorithms Not willing to store our data in a proprietary datastore. Redis and LinkedIn’s Project Voldermort: no query ﬁlters, beder used as queues or distributed caches

•  Designed for high load •  In-‐memory or on-‐disk •  Eventually consistent Column Databases •  Spread sheet like •  Key is a row id •  Adributes are columns •  Columns can be grouped into families Document Databases •  Like Key/Value •  Value = Document •  Document = JSON/BSON •  JSON = Flexible Schema Graph Databases •  Graph Theory G=(E,V) •  Great for modeling networks •  Great for graph-‐based query algorithms CouchDB: no ad-‐hoc queries; maturity in early 2010 made us shy away although we did try early prototypes.

•  Designed for high load •  In-‐memory or on-‐disk •  Eventually consistent Column Databases •  Spread sheet like •  Key is a row id •  Adributes are columns •  Columns can be grouped into families Document Databases •  Like Key/Value •  Value = Document •  Document = JSON/BSON •  JSON = Flexible Schema Graph Databases •  Graph Theory G=(E,V) •  Great for modeling networks •  Great for graph-‐based query algorithms Cassandra: in early 2010, maturity ques;ons, no secondary indexes and no batch processing op;ons (came later on).

•  Designed for high load •  In-‐memory or on-‐disk •  Eventually consistent Column Databases •  Spread sheet like •  Key is a row id •  Adributes are columns •  Columns can be grouped into families Document Databases •  Like Key/Value •  Value = Document •  Document = JSON/BSON •  JSON = Flexible Schema Graph Databases •  Graph Theory G=(E,V) •  Great for modeling networks •  Great for graph-‐based query algorithms MongoDB: in early 2010, maturity ques;ons, adop;on ques;ons and no batch processing op;ons.

•  Designed for high load •  In-‐memory or on-‐disk •  Eventually consistent Column Databases •  Spread sheet like •  Key is a row id •  Adributes are columns •  Columns can be grouped into families Document Databases •  Like Key/Value •  Value = Document •  Document = JSON/BSON •  JSON = Flexible Schema Graph Databases •  Graph Theory G=(E,V) •  Great for modeling networks •  Great for graph-‐based query algorithms Riak: very close but in early 2010, we had adop;on ques;ons.

•  Designed for high load •  In-‐memory or on-‐disk •  Eventually consistent Column Databases •  Spread sheet like •  Key is a row id •  Adributes are columns •  Columns can be grouped into families Document Databases •  Like Key/Value •  Value = Document •  Document = JSON/BSON •  JSON = Flexible Schema Graph Databases •  Graph Theory G=(E,V) •  Great for modeling networks •  Great for graph-‐based query algorithms HBase: came across as the most mature at the ;me, with several deployments, a healthy community, "out-‐of-‐the box" secondary indexes through a contrib and support for batch processing using Hadoop/MR .

Lessons Learned Challenges -‐  Complexity -‐  Missing
Features -‐  Problem solu;on ﬁt -‐  Resources Rewards -‐  Choices -‐  Empowering -‐  Community -‐  Cost

Rewards: Choices Key/Value Databases •  Distributed hashtables
•  Designed for high load •  In-‐memory or on-‐disk •  Eventually consistent Column Databases •  Spread sheet like •  Key is a row id •  Adributes are columns •  Columns can be grouped into families Document Databases •  Like Key/Value •  Value = Document •  Document = JSON/BSON •  JSON = Flexible Schema Graph Databases •  Graph Theory G=(E,V) •  Great for modeling networks •  Great for graph-‐based query algorithms

Rewards: Choices Source: capgemini.com

When Big-‐Data = Big Architectures Source: hdp://www.larsgeorge.com/2009/10/hbase-‐architecture-‐101-‐storage.html Must
have a Hadoop HDFS cluster of at least 2x replica;on factor nodes Must have an odd number of Zookeeper quorum nodes Then you can run your Hbase nodes but it’s recommended to co-‐locate regionservers with hadoop datanodes so you have to manage resources. Master/slave architecture means a single point of failure, so you need to protect your master. And then we also have to manage the MapReduce processes and resources in the Hadoop layer.

Jokes aside, no one said open source was easy
to use

To be expected •  Hadoop/Hbase are designed to
move mountains •  If you want to move big stuﬀ, be prepared to some;mes use big equipment

What it means to a startup Development capacity before
Development capacity aHer Congrats, you are now a sysadmin…

Mapping an saved search to a column store Name
Ranks References to inﬂuencer records

Unique key “adributes” column family
for general adributes “inﬂuencerId” column family for inﬂuencer ranks and foreign keys Mapping an saved search to a column store

Mapping an saved search to a column store “name”
adribute Inﬂuencer ranks can be adribute names as well

Mapping an saved search to a column store Can
get predy long so needs indexing and pagina;on

Problem: no out-‐of-‐the-‐box row-‐based indexing and pagina;on

Jumping right into the code

a few months later…

Need to upgrade to Hbase 0.90 •  Making sure
to remain on recent code base •  Performance improvements •  Mostly to get the latest bug ﬁxes No thanks!

Looks like something is missing

Our DB indexes depend on this!

Let’s get this straight •  Hbase no longer comes
with secondary indexing out-‐of-‐the-‐box •  It’s been moved out of the trunk to GitHub •  Where only one other company besides us seems to care about it

Only one other maintainer besides us

What it means to a startup Development capacity
Congrats, you are now an hbase contrib maintainer…

Homegrown Hbase Indexes Rows have id prefixes that can
be efficiently scanned using STARTROW and STOPROW filters Row ids for Posts

Homegrown Hbase Indexes Find posts for inﬂuencer_id_1234
Row ids for Posts

Homegrown Hbase Indexes Find posts for inﬂuencer_id_5678
Row ids for Posts

Homegrown Hbase Indexes •  No longer depending on
unmaintained code •  Work with out-‐of-‐the-‐box Hbase installa;on

You are back but you s;ll need to maintain indexing logic

a few months later…

Cracks in the data model huffingtonpost.com huffingtonpost.com
hdp://www.huffingtonpost.com/arianna-‐huffington/post_1.html hdp://www.huffingtonpost.com/arianna-‐huffington/post_2.html hdp://www.huffingtonpost.com/arianna-‐huffington/post_3.html hdp://www.huffingtonpost.com/shaun-‐donovan/post1.html hdp://www.huffingtonpost.com/shaun-‐donovan/post2.html hdp://www.huffingtonpost.com/shaun-‐donovan/post3.html writes for authored by published under writes for authored by published under

hdp://www.huffingtonpost.com/arianna-‐huffington/post_1.html hdp://www.huffingtonpost.com/arianna-‐huffington/post_2.html hdp://www.huffingtonpost.com/arianna-‐huffington/post_3.html hdp://www.huffingtonpost.com/shaun-‐donovan/post1.html hdp://www.huffingtonpost.com/shaun-‐donovan/post2.html hdp://www.huffingtonpost.com/shaun-‐donovan/post3.html writes for authored by published under writes for authored by published under Denormalized/duplicated for fast run;me access and storage of influencer-‐ to-‐site rela;onship proper;es

hdp://www.huffingtonpost.com/arianna-‐huffington/post_1.html hdp://www.huffingtonpost.com/arianna-‐huffington/post_2.html hdp://www.huffingtonpost.com/arianna-‐huffington/post_3.html hdp://www.huffingtonpost.com/shaun-‐donovan/post1.html hdp://www.huffingtonpost.com/shaun-‐donovan/post2.html hdp://www.huffingtonpost.com/shaun-‐donovan/post3.html writes for authored by published under writes for authored by published under Content adribu;on logic could some;mes mis-‐adribute posts because of the duplicated data.

hdp://www.huffingtonpost.com/arianna-‐huffington/post_1.html hdp://www.huffingtonpost.com/arianna-‐huffington/post_2.html hdp://www.huffingtonpost.com/arianna-‐huffington/post_3.html hdp://www.huffingtonpost.com/shaun-‐donovan/post1.html hdp://www.huffingtonpost.com/shaun-‐donovan/post2.html hdp://www.huffingtonpost.com/shaun-‐donovan/post3.html writes for authored by published under writes for authored by published under Exacerbated when we started tracking people’s content on a daily basis in mid-‐2011

Fixing the cracks in the data model huffingtonpost.com
hdp://www.huffingtonpost.com/arianna-‐huffington/post_1.html hdp://www.huffingtonpost.com/arianna-‐huffington/post_2.html hdp://www.huffingtonpost.com/arianna-‐huffington/post_3.html hdp://www.huffingtonpost.com/shaun-‐donovan/post1.html hdp://www.huffingtonpost.com/shaun-‐donovan/post2.html hdp://www.huffingtonpost.com/shaun-‐donovan/post3.html writes for authored by published under writes for authored by published under Normalize the sites

Fixing the cracks in the data model •  Normaliza;on
requires stronger secondary indexing •  Our applica;on layer indexing would need revisi;ng…again!

Psych! You are back to wri;ng indexing code.

Traackr’s Datastore Requirements (Revisited) •  Schema ﬂexibility
•  Good at storing lots of variable length text •  Out-‐of-‐the-‐box SECONDARY INDEX support! •  Simple to use and administer

NoSQL picking – Round 2 (mid 2011) Key/Value Databases
•  Distributed hashtables •  Designed for high load •  In-‐memory or on-‐disk •  Eventually consistent Column Databases •  Spread sheet like •  Key is a row id •  Adributes are columns •  Columns can be grouped into families Document Databases •  Like Key/Value •  Value = Document •  Document = JSON/BSON •  JSON = Flexible Schema Graph Databases •  Graph Theory G=(E,V) •  Great for modeling networks •  Great for graph-‐based query algorithms

•  Distributed hashtables •  Designed for high load •  In-‐memory or on-‐disk •  Eventually consistent Column Databases •  Spread sheet like •  Key is a row id •  Adributes are columns •  Columns can be grouped into families Document Databases •  Like Key/Value •  Value = Document •  Document = JSON/BSON •  JSON = Flexible Schema Graph Databases •  Graph Theory G=(E,V) •  Great for modeling networks •  Great for graph-‐based query algorithms Nope!

•  Distributed hashtables •  Designed for high load •  In-‐memory or on-‐disk •  Eventually consistent Column Databases •  Spread sheet like •  Key is a row id •  Adributes are columns •  Columns can be grouped into families Document Databases •  Like Key/Value •  Value = Document •  Document = JSON/BSON •  JSON = Flexible Schema Graph Databases •  Graph Theory G=(E,V) •  Great for modeling networks •  Great for graph-‐based query algorithms Graph Databases: we looked at Neo4J a bit closer but passed again for the same reasons as before.

•  Distributed hashtables •  Designed for high load •  In-‐memory or on-‐disk •  Eventually consistent Column Databases •  Spread sheet like •  Key is a row id •  Adributes are columns •  Columns can be grouped into families Document Databases •  Like Key/Value •  Value = Document •  Document = JSON/BSON •  JSON = Flexible Schema Graph Databases •  Graph Theory G=(E,V) •  Great for modeling networks •  Great for graph-‐based query algorithms Memcache: s;ll no

•  Distributed hashtables •  Designed for high load •  In-‐memory or on-‐disk •  Eventually consistent Column Databases •  Spread sheet like •  Key is a row id •  Adributes are columns •  Columns can be grouped into families Document Databases •  Like Key/Value •  Value = Document •  Document = JSON/BSON •  JSON = Flexible Schema Graph Databases •  Graph Theory G=(E,V) •  Great for modeling networks •  Great for graph-‐based query algorithms Amazon SimpleDB: s;ll no.

•  Distributed hashtables •  Designed for high load •  In-‐memory or on-‐disk •  Eventually consistent Column Databases •  Spread sheet like •  Key is a row id •  Adributes are columns •  Columns can be grouped into families Document Databases •  Like Key/Value •  Value = Document •  Document = JSON/BSON •  JSON = Flexible Schema Graph Databases •  Graph Theory G=(E,V) •  Great for modeling networks •  Great for graph-‐based query algorithms Not willing to store our data in a proprietary datastore. Redis and LinkedIn’s Project Voldermort: s;ll no

•  Distributed hashtables •  Designed for high load •  In-‐memory or on-‐disk •  Eventually consistent Column Databases •  Spread sheet like •  Key is a row id •  Adributes are columns •  Columns can be grouped into families Document Databases •  Like Key/Value •  Value = Document •  Document = JSON/BSON •  JSON = Flexible Schema Graph Databases •  Graph Theory G=(E,V) •  Great for modeling networks •  Great for graph-‐based query algorithms CouchDB: more mature but s;ll no ad-‐hoc queries.

•  Distributed hashtables •  Designed for high load •  In-‐memory or on-‐disk •  Eventually consistent Column Databases •  Spread sheet like •  Key is a row id •  Adributes are columns •  Columns can be grouped into families Document Databases •  Like Key/Value •  Value = Document •  Document = JSON/BSON •  JSON = Flexible Schema Graph Databases •  Graph Theory G=(E,V) •  Great for modeling networks •  Great for graph-‐based query algorithms Cassandra: matured quite a bit, added secondary indexes and batch processing op;ons but more restric;ve in its’ use than other solu;ons. AHer the Hbase lesson, simplicity of use was now more important.

•  Distributed hashtables •  Designed for high load •  In-‐memory or on-‐disk •  Eventually consistent Column Databases •  Spread sheet like •  Key is a row id •  Adributes are columns •  Columns can be grouped into families Document Databases •  Like Key/Value •  Value = Document •  Document = JSON/BSON •  JSON = Flexible Schema Graph Databases •  Graph Theory G=(E,V) •  Great for modeling networks •  Great for graph-‐based query algorithms Riak: strong contender s;ll but adop;on ques;ons remained.

•  Distributed hashtables •  Designed for high load •  In-‐memory or on-‐disk •  Eventually consistent Column Databases •  Spread sheet like •  Key is a row id •  Adributes are columns •  Columns can be grouped into families Document Databases •  Like Key/Value •  Value = Document •  Document = JSON/BSON •  JSON = Flexible Schema Graph Databases •  Graph Theory G=(E,V) •  Great for modeling networks •  Great for graph-‐based query algorithms MongoDB: matured by leaps and bounds, increased adop;on, support from 10gen, advanced indexing out-‐of-‐the-‐box as well as some batch processing op;ons, breeze to use, well documented and ﬁt into our exis;ng code base very nicely.

Immediate Beneﬁts •  No more maintaining custom applica;on-‐layer
secondary indexing code

Yay! I’m back!

secondary indexing code •  Single binary installa;on greatly simpliﬁes administra;on

Honestly, I thought I’d never see you guys again!

secondary indexing code •  Single binary installa;on greatly simpliﬁes administra;on •  Our NoSQL could now support our domain model

many-‐to-‐many rela;onship

Modeling an influencer Embedded list of references to
sites augmented with influencer-‐specific site adributes (e.g. percent contribu;on to content) { ”_id": "770cf5c54492344ad5e45ˆ791ae5d52”, "realName": "David Chancogne", ";tle": "CTO", "descrip;on": "Web. Geek.\r\nTraackr: hdp://traackr.com\r\nPropz: hdp://propz.me", "primaryAffilia;on": "Traackr", "email": "[email protected]", "loca;on": "Cambridge, MA, United States", "siteReferences": [ { "siteId": "b31236da306270dc2b5db34e943af88d", "contribu;on": 0.25 }, { "siteId": "602dc370945d3b3480fff4f2a541227c", "contribu;on": 1.0 } ] }

Modeling an influencer siteId indexed for “find influencers
connected to site X” > db.influencers.ensureIndex({siteReferences.siteId: 1});! > db.influencers.find({siteReferences.siteId: "602dc370945d3b3480fff4f2a541227c"});! { ”_id": "770cf5c54492344ad5e45ˆ791ae5d52”, "realName": "David Chancogne", ";tle": "CTO", "descrip;on": "Web. Geek.\r\nTraackr: hdp://traackr.com\r\nPropz: hdp://propz.me", "primaryAffilia;on": "Traackr", "email": "[email protected]", "loca;on": "Cambridge, MA, United States", "siteReferences": [ { "siteId": "b31236da306270dc2b5db34e943af88d", "contribu;on": 0.25 }, { "siteId": "602dc370945d3b3480fff4f2a541227c", "contribu;on": 1.0 } ] }

Other Beneﬁts •  Ad hoc queries and reports became
easier to write with JavaScript: no need for a Java developer to write map reduce code to extract the data in a usable form like it was needed with Hbase. •  Simpler backups: Hbase mostly relied on HDFS redundancy; intra-‐ cluster replica;on is available but experimental and a lot more involved to setup. •  Great documenta;on •  Great adop;on and community

looks like we found the right ﬁt!

We have more of this Development capacity

And less of this Source: socialbuderﬂyclt.com

Recap & Final Thoughts •  3 Vs of Big
Data: – Volume – Velocity – Variety ß Traackr •  Big Data technologies are complementary to SQL and RDBMS •  Un;l machines can think for themselves Data Science will be increasingly important

Recap & Final Thoughts •  Be prepared to deal
with less mature tech •  Be as ﬂexible as the data => fearless refactoring •  Importance of ease of use and administra;on cannot be overstated for a small startup

Sharing a Startup’s Big Data Lessons

Sharing a Startup’s Big Data Lessons

More Decks by George P. Stathis

Other Decks in Technology

Featured

Transcript