PredictionIO –An Open Source Scalable Machine Learning Architecture

Simon Chan [email protected] Data Science London - April 24, 2013
Big Data Week

Machine Learning is.... computers learning to predict from data

putting Machine Learning into practice

challenge #1 Scalability

Big Data Bottlenecks Machine Learning Processing

PredictionIO has a horizontally scalable architecture

Async SDK Client client = new Client(appkey); // Adding user
behaviors req = client.getUserRateItemRequestBuilder(uid, iid, rate); client.userRateItemAsFuture(req);

Play Framework ‣ stateless - no server session ‣ non-blocking
web request

Play: A Non-blocking Example def index = Action { val
futureInt = scala.concurrent.Future { slowDataProcess() } Async { futureInt.map(i => Ok(views.html.result.render(i))) } }

MongoDB ‣ Read scaling: Replica Sets ‣ Write scaling: Sharding
‣ Indexes (e.g. geospatial) { geoSearch : "places", near : [33, 33], maxDistance : 6, search : { uid : "user1" } }

Hadoop Hadoop& Cascading&(Java)& Scalding&(Scala)&

MapReduce - Native Java public class WordCount { public static
class Map extends Mapper<LongWritable, Text, Text, IntWritable> { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(LongWritable key, Text value, Context context) throws .....{ String line = value.toString(); StringTokenizer tokenizer = new StringTokenizer(line); while (tokenizer.hasMoreTokens()) { word.set(tokenizer.nextToken()); context.write(word, one); } } } public static class Reduce extends Reducer<Text, IntWritable, Text, IntWritable> { public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException { int sum = 0; for (IntWritable val : values) { sum += val.get(); } context.write(key, new IntWritable(sum)); } } public static void main(String[] args) throws Exception { Configuration conf = new Configuration(); Job job = new Job(conf, "wordcount"); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); job.setMapperClass(Map.class); job.setReducerClass(Reduce.class); job.setInputFormatClass(TextInputFormat.class); job.setOutputFormatClass(TextOutputFormat.class); FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); job.waitForCompletion(true); } }

MapReduce - Scalding class ScaldingTestJob(args: Args) extends Job(args) { Tsv(args(0),
'text) .flatMap('text -> 'word) { text : String => text.split("\s+") } .groupBy('word) { _.size } .write(Tsv(args(1)) }

Sample Code

### Sample PredictionIO Python SDK Code client = predictionio.Client(appkey="<your app
key>") # Add Data client.create_user(uid=”user123”) client.create_item(iid=”itemXYZ”, itypes=(1,)) client.user_view_item(uid=”user123”, iid=”itemXYZ”) # Get Prediction rec = client.get_itemrec(engine="<engine name>", uid=”user123”, n=5)

Getting Involved! - @PredictionIO - prediction.io - Newsletter - github.com/predictionio

Q&A Q: Selecting the right features is a big problem.
Can PredictionIO solve this problem? A: Not at this moment. That’s why we focus on collaborative filtering algorithms right now which don’t require the use of features. And we believe that the involvement of data scientists is needed for many specific problems. PredictionIO is positioned as a tool to make their work easier, but not as a replacement. Q: How’s PredictionIO different from Weka? A: Weka, like Mahout, is a ML algorithm library. You can see PredictionIO as a layer on top of it, which helps you to implement algorithm into production environment by providing a complete infrastructure. Q: How do you compare PredictionIO with RapidMiner? A: RapidMiner is a great product to define data engineering workflow visually. PredictionIO focuses on a different problem -- i.e. deploying ML solution into production environment. Q: How does the algorithm evaluation metrics work in PredictionIO? A: At this moment, you can evaluate algorithms by some offline metrics, such as Mean Average Precision, based on your existing data. Q: What’s the business model? A: We focus on making PredictionIO a useful open source product at this moment.

PredictionIO –An Open Source Scalable Machine L...

PredictionIO –An Open Source Scalable Machine Learning Architecture

Data Science London

More Decks by Data Science London

Other Decks in Technology

Featured

Transcript

Simon Chan [email protected] Data Science London - April 24, 2013

Machine Learning is.... computers learning to predict from data

putting Machine Learning into practice

challenge #1 Scalability

Big Data Bottlenecks Machine Learning Processing

PredictionIO has a horizontally scalable architecture

Async SDK Client client = new Client(appkey); // Adding user

Play Framework ‣ stateless - no server session ‣ non-blocking

Play: A Non-blocking Example def index = Action { val

MongoDB ‣ Read scaling: Replica Sets ‣ Write scaling: Sharding

Hadoop Hadoop& Cascading&(Java)& Scalding&(Scala)&

MapReduce - Native Java public class WordCount { public static

MapReduce - Scalding class ScaldingTestJob(args: Args) extends Job(args) { Tsv(args(0),

Sample Code

### Sample PredictionIO Python SDK Code client = predictionio.Client(appkey="<your app

Getting Involved! - @PredictionIO - prediction.io - Newsletter - github.com/predictionio

Q&A Q: Selecting the right features is a big problem.