look beyond the tried- and-true methods that are prevalent at that time” Jacobs, CACM 2009 Requirements for a large scale data analysis system Scalability (scale free) Cost effectiveness (autonomic) Fault tolerance (highly available) lunedì 8 marzo 2010
analysis ‘60s CODASYL ‘70s Relational DBMS ‘80s & ‘90s Parallel DBMS Not much has happened since the ‘70s The fundamental model and code are still the same lunedì 8 marzo 2010
nothing architectures using dataflow operators and horizontal partitioning Problems: Not enough flexibility and ease of use Limited fault-tolerance and scalability lunedì 8 marzo 2010
architectures Focus on being scale-free, fault tolerant, cost effective and easy to use Buzzword! Distributed system Scalability Location, replication and failure transparency lunedì 8 marzo 2010
"If the only tool you have is a hammer, you tend to see every problem as a nail" Abraham Maslow SQL and Relational Model are not the answer lunedì 8 marzo 2010
characterization Issues: only theoretical analysis no existing reference system Future: best algorithms for the model, model adaptation to real systems F. Afrati and J. Ullman. Unpublished lunedì 8 marzo 2010
of MapReduce algorithms Result: algorithmic design technique for MapReduce Future: develop algorithms in this class, find relationships with other classes H. Karloff, S. Suri, and S. Vassilvitskii. In SODA, 2010 lunedì 8 marzo 2010
engineering of MapReduce How: top down functional analysis Result: simplified and rationalized MapReduce model with runnable functional specification R. Lammel. In Science of Computer Programming, 2007 lunedì 8 marzo 2010
relational operators efficiently How: new final phase that merges 2 key-value lists Issues: very low level and hard to use, needs integration into a high level language H. Yang, A. Dasdan, R. Hsiao, and D. Parker. In SIGMOD 2007 lunedì 8 marzo 2010
analytical workloads. Goal: advantages of both DB and MapReduce How: integrate a DBMS (PostgreSQL) in Hadoop, Hive as interface Issues: better reuse principles than artifacts A. Abouzeid, K. Bajda-Pawlikowski, D. Abadi, A. Silberschatz, A. Rasin. In VLDB, 2009 lunedì 8 marzo 2010
for big data How: pre-computed templates to fill at run-time Future: which templates are useful for interactive? help the user to formulate templates (sampling?) C. Olston, E. Bortnikov, K. Elmeleegy, F. Junqueira, B. Reed. In CIDR, 2009 lunedì 8 marzo 2010
operator pipelining, online aggregation Issues: limited inter-job pipelining (data only) inter-job aggregation problematic (scratch data) T. Condie, N. Conway, P. Alvaro, J. Hellerstein, K. Elmeleegy, and R. Sears. Technical report, University of California, Berkeley, 2009 lunedì 8 marzo 2010
push for efficient relational operators Not clear which algorithms and problems are best for these systems Research efforts are “erratic”, no common research agenda yet Early stage of development, more effort needed lunedì 8 marzo 2010
data analysis? How to support these algorithms on cloud computing systems? Is it possible to carry out online data analysis on such systems? lunedì 8 marzo 2010
representative workload Identify weaknesses in existing systems Use principles of database research to fill the gaps Evaluate contributions from both theoretical and experimental point of view lunedì 8 marzo 2010
is often acceptable Semantic clues Leverage properties of M/R functions (distributivity, associativity, commutativity) Properties of the input may speed up the computation lunedì 8 marzo 2010
large scale data analysis on cloud computing systems Design extensions to existing programming paradigms in order to support these algorithms Develop methods to speed up these algorithms to support online processing lunedì 8 marzo 2010