10TB of data per day? • Parallel is a must, but not enough • Usual approaches fail at this scale because of secondary effects • Operational costs • Faults 7 lunedì 18 gennaio 2010
‘80s Object-Oriented DBMS (Back to navigation) • ‘80s & ‘90s Parallel DBMS • Not much has happened since the ‘70s • The fundamental model and the code lines are still the same 11 lunedì 18 gennaio 2010
computing near data • Focus on scalability and fault tolerance • Simple! • Shared nothing architecture on commodity hardware • Data streaming 17 lunedì 18 gennaio 2010
• Computing on large data vs Handling large data • OLAP vs OLTP • User Defined Functions vs Select-Project-Join • Nested vs Flat data model 23 lunedì 18 gennaio 2010
agile) Grow big (optimize common patterns) • MapReduce, a major step backwards DeWitt, Stonebraker • "If the only tool you have is a hammer, you tend to see every problem as a nail" Abraham Maslow • SQL and Relational Model are not the answer 24 lunedì 18 gennaio 2010
• Industry is leading the trend, has cutting edge software • Different approaches • Most focus on MapReduce • Shift toward higher level abstractions 25 lunedì 18 gennaio 2010
H. Yang, A. Dasdan, R. Hsiao, and D. Parker. In SIGMOD 2007. • Goal: implement relational operators efficiently • How: new final phase that merges 2 key-value lists • Issues: very low level and hard to use needs integration into a high level language 29 lunedì 18 gennaio 2010
Afrati and J. Ullman. Unpublished. • Goal: I/O cost characterization • Issues: only theoretical analysis no existing reference system • Future: best algorithms for the model model adaptation to real systems 30 lunedì 18 gennaio 2010
S. Suri, and S. Vassilvitskii. In SODA, 2010. • Goal: theoretical computability characterization of MapReduce algorithms • Result: algorithmic design technique for MapReduce • Future: develop algorithms in this class find relationships with other classes 31 lunedì 18 gennaio 2010
Olston, E. Bortnikov, K. Elmeleegy, F. Junqueira, B. Reed. In CIDR, 2009. • Goal: speed up general queries for big data • How: pre-computed templates to fill at run-time • Future: which templates are useful for interactive? help the user to formulate templates (sampling?) 32 lunedì 18 gennaio 2010
P. Alvaro, J. Hellerstein, K. Elmeleegy, and R. Sears. Technical report, University of California, Berkeley, 2009. • Goal: speed up turnaround of MapReduce jobs • How: operator pipelining, online aggregation • Issues: limited inter-job pipelining (data only) inter-job aggregation problematic (scratch data) 33 lunedì 18 gennaio 2010
DBMS technologies for analytical workloads. A. Abouzeid, K. Bajda-Pawlikowski, D. Abadi, A. Silberschatz, A. Rasin. In VLDB, 2009. • Goal: advantages of both DB and MapReduce • How: integrate a DBMS (PostgreSQL) in Hadoop, Hive as interface • Issues: better reuse principles than technology 34 lunedì 18 gennaio 2010
A Look at the Stragglers Problem in MapReduce. J. Lin. In LSDS-IR, 2009. • Goal: data distribution effects on MapReduce parallel query/pairwise similarity as case study • How: balance input data (split long posting lists) • Issues: very specific for the problem/algorithm 35 lunedì 18 gennaio 2010
enough result is often acceptable • Semantic clues • Leverage properties of M/R functions (associativity, commutativity) • Properties of the input may speed up the computation 36 lunedì 18 gennaio 2010
• Integrate DB principles into Cloud systems • Enable interactive and approximate analytics • Evolve beyond the MapReduce paradigm 38 lunedì 18 gennaio 2010