R Stories from the Trenches - Budapest R Meetup - Aug 2015

R Stories from the Trenches Szilárd Pafka, PhD Chief Scientist,
Epoch Budapest R Meetup August 2015

T-25 ~ 1990

T-20 ~ 1996

T-15 ~ 2001

T-10 ~ 2006 - cost was not an issue! -
data.frame - 800 packages

~ 2009 aka data mining, aka (today) data science

T ~ 2014 Data Science

1999 CRISP Data Mining

(2009)

(382 RSVPs)

high-level API fast environment reproducibility

Data frames: “in-memory table” with (fast) bulk operations (“vectorized”) thousands
of packages (providing high-level API) R, Python (pandas), Spark best way to work with structured data

R data.table (on one server!) aggregation 100M rows 1M groups
1.3 sec join 100M rows x 1M rows 1.7sec

aggregation 100M rows 1M groups join 100M rows x 1M
rows

A colleague from work, asked me to investigate about Spark
and R. So the most obvious thing to was to investigate about SparkR I came across a piece of code that reads lines from a file and count how many lines have a "a" and how many lines have a "b". I prepared a file with 5 columns and 1 million records.

and R. So the most obvious thing to was to investigate about SparkR I came across a piece of code that reads lines from a file and count how many lines have a "a" and how many lines have a "b". I prepared a file with 5 columns and 1 million records. Spark: 26.45734 seconds for a million records? Nice job -:) R: 48.31641 seconds? Look like Spark was almost twice as fast this time...and this is a pretty simple example...I'm sure that when complexity arises...the gap is even bigger…

and R. So the most obvious thing to was to investigate about SparkR I came across a piece of code that reads lines from a file and count how many lines have a "a" and how many lines have a "b". I prepared a file with 5 columns and 1 million records. Spark: 26.45734 seconds for a million records? Nice job -:) R: 48.31641 seconds? Look like Spark was almost twice as fast this time...and this is a pretty simple example...I'm sure that when complexity arises...the gap is even bigger… HOLLY CRAP UPDATE! Markus gave me this code on the comments... [R: 0.1791632 seconds]. I just added a couple of things to make complaint...but...damn...I wish I could code like that in R

R Stories from the Trenches - Budapest R Meetup...

R Stories from the Trenches - Budapest R Meetup - Aug 2015

szilard

More Decks by szilard

Featured

Transcript

R Stories from the Trenches Szilárd Pafka, PhD Chief Scientist,

T-25 ~ 1990

T-20 ~ 1996

T-15 ~ 2001

T-10 ~ 2006 - cost was not an issue! -

~ 2009 aka data mining, aka (today) data science

T ~ 2014 Data Science

1999 CRISP Data Mining

2006

5 yrs

(2009)

(2009)

(382 RSVPs)

high-level API fast environment reproducibility

high-level API fast environment reproducibility

high-level API fast environment reproducibility

high-level API fast environment reproducibility

high-level API fast environment reproducibility

Data frames: “in-memory table” with (fast) bulk operations (“vectorized”) thousands

Data frames: “in-memory table” with (fast) bulk operations (“vectorized”) thousands

R data.table (on one server!) aggregation 100M rows 1M groups

aggregation 100M rows 1M groups join 100M rows x 1M

A colleague from work, asked me to investigate about Spark

A colleague from work, asked me to investigate about Spark

A colleague from work, asked me to investigate about Spark