Slide 1

Slide 1 text

P RO P R I E TA RY • C O N F I D E N T I A L < > MEDIA intent Distributed Classification with ADMM Peter Lubell-Doughtie and Jon Sondag Intent Media

Slide 2

Slide 2 text

P RO P R I E TA RY • C O N F I D E N T I A L < > MEDIA intent About us Introduction • Intent Media, ad tech startup in New York City • Visibility into over 70% of online travel transactions in the US • Site optimization for e-commerce • 95%+ of site visitors do not buy • We arbitrage transaction and media revenue • Binary classification is a theme throughout our business • Modeling P(Ad Click) • Modeling P(Conversion) 2

Slide 3

Slide 3 text

P RO P R I E TA RY • C O N F I D E N T I A L < > MEDIA intent Background Intent Media Data Stack 3

Slide 4

Slide 4 text

P RO P R I E TA RY • C O N F I D E N T I A L < > MEDIA intent For large scale logistic regression Alternatives • Sample the data and run on a single machine • Apache Mahout • Stochastic Gradient Descent (SGD) • Adaptive parallel learning of parameters in SGD algorithm • Does not parallelize over machines • multithreaded for parameter selection on a single machine 4

Slide 5

Slide 5 text

P RO P R I E TA RY • C O N F I D E N T I A L < > MEDIA intent For large scale logistic regression Alternatives • Vowpal Wabbit • Parallel SGD and LBFGS • Runs on Hadoop using AllReduce • Handles tera-scale feature sets • Communication outside MapReduce • MLLib (Spark) • SGD • Runs on JVM • In-Memory, but caches large datasets • Not Hadoop 5

Slide 6

Slide 6 text

P RO P R I E TA RY • C O N F I D E N T I A L < > MEDIA intent Technology Perspective MapReduce is Good Enough • Make machine learning just another Hadoop job 6 “instead of trying to invent screwdrivers, we should simply get rid of everything that's not a nail.” Jimmy Lin

Slide 7

Slide 7 text

P RO P R I E TA RY • C O N F I D E N T I A L < > MEDIA intent Technology Perspective Motivation for ADMM on Hadoop • Years of company experience with Hadoop for data processing and aggregation • Aggregating user session statistics by site and product type • Aggregating unique visitors by site and time window • Many other jobs 7 Aggregation Ad Hoc Analysis Attribution Machine Learning

Slide 8

Slide 8 text

P RO P R I E TA RY • C O N F I D E N T I A L < > MEDIA intent Technology Perspective Motivation for ADMM on Hadoop • Much of IM codebase is Java, with Clojure for distributed data processing tasks 8 Log Data ETL Modeling JSON Cascalog Lin Reg/ADMM S3 EMR EMR

Slide 9

Slide 9 text

P RO P R I E TA RY • C O N F I D E N T I A L < > MEDIA intent Dataset Motivation for ADMM on Hadoop • Tens of millions of rows • Hundreds of features • Time since last conversion • Click count • Has visited checkout page • Percentage of recent searches with consistent product details • Rapidly expanding datasets and feature sets 9

Slide 10

Slide 10 text

P RO P R I E TA RY • C O N F I D E N T I A L < > MEDIA intent Machine Learning Perspective Motivation for ADMM on Hadoop • Scaling up the machine learning problem • Automatically tuned hyper-parameters • Easily transition from prototype to cloud 1 0

Slide 11

Slide 11 text

P RO P R I E TA RY • C O N F I D E N T I A L < > MEDIA intent Background ADMM • Presented in Boyd et al [2011], Distributed Optimization and Statistical Learning via the Alternating Direction of Multipliers Method • General framework for distributing convex optimization problems • We will focus on the logistic regression case 1 1

Slide 12

Slide 12 text

P RO P R I E TA RY • C O N F I D E N T I A L < > MEDIA intent Background ADMM • Linear machine learning problem: 1 2 minimize 1 m m X i=1 li(AT i x bi) + r ( x ) Ai bi r ( x ) is the ith row of the feature matrix A is the ith value of the target vector b is a regularization penalty on the model parameters x

Slide 13

Slide 13 text

P RO P R I E TA RY • C O N F I D E N T I A L < > MEDIA intent Parallelizing logistic regression ADMM • Logistic regression: 1 3 minimize 1 m m X i=1 log(1 + exp( biAT i x)) + k x k2 2

Slide 14

Slide 14 text

P RO P R I E TA RY • C O N F I D E N T I A L < > MEDIA intent Parallelizing logistic regression ADMM • Logistic regression: • Rewrite for parallelization: 1 4 minimize 1 m m X i=1 log(1 + exp( biAT i x)) + k x k2 2 minimize 1 m m X i=1 log(1 + exp( biAT i xj)) + k z k2 2 s.t. xj z = 0 , 8 j

Slide 15

Slide 15 text

P RO P R I E TA RY • C O N F I D E N T I A L < > MEDIA intent Parallelizing logistic regression ADMM • Logistic regression: • Rewrite for parallelization: • Method of Multipliers: 1 5 minimize 1 m m X i=1 log(1 + exp( biAT i x)) + k x k2 2 minimize 1 m m X i=1 log(1 + exp( biAT i xj)) + k z k2 2 minimize 1 m m X i=1 log(1 + exp( biAT i xj)) + k z k2 2 + ⇢ 2N N X j=1 k xj z k2 2 s.t. xj z = 0 , 8 j s.t. xj z = 0 , 8 j

Slide 16

Slide 16 text

P RO P R I E TA RY • C O N F I D E N T I A L < > MEDIA intent Algorithm for logistic regression ADMM • Separability: break down iterations into steps 1 6 xk +1 j := argmin xj 1 mj mj X ij =1 log(1 + exp( bij AT ij xj )) + ⇢k 2 k xj zk + uk j k2 2 z k+1 := 8 > > < > > : ¯ x k+1 q + ¯ u k q if q = 1 N⇢ k 2 + N⇢ k (¯ x k+1 q + ¯ u k q ) otherwise u k+1 j := u k j + x k+1 j z k+1

Slide 17

Slide 17 text

P RO P R I E TA RY • C O N F I D E N T I A L < > MEDIA intent Algorithm for logistic regression ADMM • Separability: break down iterations into steps 1 7 xk +1 j := argmin xj 1 mj mj X ij =1 log(1 + exp( bij AT ij xj )) + ⇢k 2 k xj zk + uk j k2 2 z k+1 := 8 > > < > > : ¯ x k+1 q + ¯ u k q if q = 1 N⇢ k 2 + N⇢ k (¯ x k+1 q + ¯ u k q ) otherwise u k+1 j := u k j + x k+1 j z k+1 Computationally Intensive Computationally Easy

Slide 18

Slide 18 text

P RO P R I E TA RY • C O N F I D E N T I A L < > MEDIA intent The Algorithm on MapReduce Translating ADMM to Hadoop 1 8 1: procedure ADMM(A, b , N , maxIterations ) 2: k = 0 3: while notConverged and k < maxIterations do 4: for j = 1 ! N do 5: update xk j using Eq. 1 6: end for 7: update zk using Eq. 2 8: for j = 1 ! N do 9: update uk j using Eq. 3 10: end for 11: update ⇢k using Eq. 4 12: k k + 1 13: end while 14: write zk to S3. 15: end procedure Figure 1. The ADMM procedure implemented for Hadoop MapReduce. notConverged is a helper function that evaluates the norms to check for convergence. node to use when performing computations on that data. To Figure 2. The Job Runner sets paths for th the Jar file with ADMM, and the value of a the number of iterations. ADMM computes a HDFS, then after the final iteration, the pro notation) is output to S3. not converged and that the number the maximum number of iterations helper function notConverged , wh k rkk2  ✏pri and k skk2  ✏dual . If the driver has not converged an of iterations have not passed, we di 1: procedure ADMM(A, b , N , maxIterations ) 2: k = 0 3: while notConverged and k < maxIterations do 4: for j = 1 ! N do 5: update xk j using Eq. 1 6: end for 7: update zk using Eq. 2 8: for j = 1 ! N do 9: update uk j using Eq. 3 10: end for 11: update ⇢k using Eq. 4 12: k k + 1 13: end while 14: write zk to S3. 15: end procedure Figure 1. The ADMM procedure implemented for Hadoop MapReduce. notConverged is a helper function that evaluates the norms to check for convergence. node to use when performing computations on that data. To Figure 2. The Job Runner sets paths for the Jar file with ADMM, and the value of the number of iterations. ADMM computes HDFS, then after the final iteration, the pr notation) is output to S3. not converged and that the number the maximum number of iteration helper function notConverged , w k rkk2  ✏pri and k skk2  ✏dua If the driver has not converged a of iterations have not passed, we d the mappers and begin the next it

Slide 19

Slide 19 text

P RO P R I E TA RY • C O N F I D E N T I A L < > MEDIA intent Optimization with MapReduce Translating ADMM to Hadoop In the mapper compute L-BFGS 1 9 Mapper x k+1 1 x k+1 2 ( x k 1, z k , u k 1 ) ( x k 2, z k , u k 2 )

Slide 20

Slide 20 text

P RO P R I E TA RY • C O N F I D E N T I A L < > MEDIA intent Optimization with MapReduce Translating ADMM to Hadoop In the reducer compute update z and u 2 0 Reducer ( x k+1 1 , u k 1 ) ( x k+1 2 , u k 2 ) zk+1 ( u k 1, x k+1 1 , z k+1) ( u k 2, x k+1 2 , z k+1) uk+1 1 uk+1 2

Slide 21

Slide 21 text

P RO P R I E TA RY • C O N F I D E N T I A L < > MEDIA intent Optimization with MapReduce Translating ADMM to Hadoop • Store state in HDFS without HBase or any external tools • Store a JSON hash-map from each split ID to its and • Distribute to all the mapper • on the next iteration each mapper knows which and to load 2 1 xj uj xj uj HDFS split ID 1 split ID 2 x1 x2 u2 u1

Slide 22

Slide 22 text

P RO P R I E TA RY • C O N F I D E N T I A L < > MEDIA intent How cluster time is spent Results • Warm start L-BFGS with average result from previous iteration • Time per iteration is roughly constant 2 2 15.2% 4.1% 24.0% 1.2% 96.0% Seconds(Per(Itera.on,(Average( Get%Input%Data% L4BFGS% Shuffle% Reduce% Job%Setup/Teardown%

Slide 23

Slide 23 text

P RO P R I E TA RY • C O N F I D E N T I A L < > MEDIA intent Automatic multiplier tuning Translating ADMM to Hadoop 2 3 0.01$ 0.1$ 1$ 10$ 0$ 5$ 10$ 15$ 20$ 25$ itera,on$ rNorm,&sNorm& rNorm[k]$ sNorm[k]$ 0" 0.2" 0.4" 0.6" 0.8" 1" 1.2" 0" 5" 10" 15" 20" 25" itera/on" Rho$ rho[k]" ⇢k+1 := 8 > < > : ⌧incr⇢k if krkk2 > µkskk2 ⇢k/⌧decr if kskk2 > µkrkk2 ⇢k otherwise

Slide 24

Slide 24 text

P RO P R I E TA RY • C O N F I D E N T I A L < > MEDIA intent Objective function & Fit vs Iteration Results 2 4 0.6$ 6$ 60$ 0$ 2$ 4$ 6$ 8$ 10$ 12$ 14$ f"#"f*" itera.on$ Primal'Objec-ve'Func-on'Loss' 77.21%& 77.53%& 0.5& 0.6& 0.7& 0.8& 0& 2& 4& 6& 8& 10& 12& 14& AUC$ itera2on& Test%Dataset%AUC%

Slide 25

Slide 25 text

P RO P R I E TA RY • C O N F I D E N T I A L < > MEDIA intent for ADMM on Hadoop Some Next Steps 1. Add standard errors for model parameters 2. Soft thresholding for L1 regularization 3. Hinge loss for SVM 2 5

Slide 26

Slide 26 text

P RO P R I E TA RY • C O N F I D E N T I A L < > MEDIA intent Thank you Questions? ADMM is open source https://github.com/intentmedia/admm 2 6