Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Distributed Classification with ADMM

Distributed Classification with ADMM

We describe a specific implementation of the Alternating Direction Method of Multipliers (ADMM) algorithm for distributed optimization. This implementation runs logistic regression with L2 regularization over large datasets and does not require a user-tuned learning rate metaparameter or any tools beyond MapReduce. Throughout we emphasize the practical lessons learned while implementing an iterative MapReduce algorithm and the advantages of remaining within the Hadoop ecosystem.

Peter Lubell-Doughtie

October 09, 2013
Tweet

More Decks by Peter Lubell-Doughtie

Other Decks in Research

Transcript

  1. P RO P R I E TA RY • C

    O N F I D E N T I A L < > MEDIA intent Distributed Classification with ADMM Peter Lubell-Doughtie and Jon Sondag Intent Media
  2. P RO P R I E TA RY • C

    O N F I D E N T I A L < > MEDIA intent About us Introduction • Intent Media, ad tech startup in New York City • Visibility into over 70% of online travel transactions in the US • Site optimization for e-commerce • 95%+ of site visitors do not buy • We arbitrage transaction and media revenue • Binary classification is a theme throughout our business • Modeling P(Ad Click) • Modeling P(Conversion) 2
  3. P RO P R I E TA RY • C

    O N F I D E N T I A L < > MEDIA intent Background Intent Media Data Stack 3
  4. P RO P R I E TA RY • C

    O N F I D E N T I A L < > MEDIA intent For large scale logistic regression Alternatives • Sample the data and run on a single machine • Apache Mahout • Stochastic Gradient Descent (SGD) • Adaptive parallel learning of parameters in SGD algorithm • Does not parallelize over machines • multithreaded for parameter selection on a single machine 4
  5. P RO P R I E TA RY • C

    O N F I D E N T I A L < > MEDIA intent For large scale logistic regression Alternatives • Vowpal Wabbit • Parallel SGD and LBFGS • Runs on Hadoop using AllReduce • Handles tera-scale feature sets • Communication outside MapReduce • MLLib (Spark) • SGD • Runs on JVM • In-Memory, but caches large datasets • Not Hadoop 5
  6. P RO P R I E TA RY • C

    O N F I D E N T I A L < > MEDIA intent Technology Perspective MapReduce is Good Enough • Make machine learning just another Hadoop job 6 “instead of trying to invent screwdrivers, we should simply get rid of everything that's not a nail.” Jimmy Lin
  7. P RO P R I E TA RY • C

    O N F I D E N T I A L < > MEDIA intent Technology Perspective Motivation for ADMM on Hadoop • Years of company experience with Hadoop for data processing and aggregation • Aggregating user session statistics by site and product type • Aggregating unique visitors by site and time window • Many other jobs 7 Aggregation Ad Hoc Analysis Attribution Machine Learning
  8. P RO P R I E TA RY • C

    O N F I D E N T I A L < > MEDIA intent Technology Perspective Motivation for ADMM on Hadoop • Much of IM codebase is Java, with Clojure for distributed data processing tasks 8 Log Data ETL Modeling JSON Cascalog Lin Reg/ADMM S3 EMR EMR
  9. P RO P R I E TA RY • C

    O N F I D E N T I A L < > MEDIA intent Dataset Motivation for ADMM on Hadoop • Tens of millions of rows • Hundreds of features • Time since last conversion • Click count • Has visited checkout page • Percentage of recent searches with consistent product details • Rapidly expanding datasets and feature sets 9
  10. P RO P R I E TA RY • C

    O N F I D E N T I A L < > MEDIA intent Machine Learning Perspective Motivation for ADMM on Hadoop • Scaling up the machine learning problem • Automatically tuned hyper-parameters • Easily transition from prototype to cloud 1 0
  11. P RO P R I E TA RY • C

    O N F I D E N T I A L < > MEDIA intent Background ADMM • Presented in Boyd et al [2011], Distributed Optimization and Statistical Learning via the Alternating Direction of Multipliers Method • General framework for distributing convex optimization problems • We will focus on the logistic regression case 1 1
  12. P RO P R I E TA RY • C

    O N F I D E N T I A L < > MEDIA intent Background ADMM • Linear machine learning problem: 1 2 minimize 1 m m X i=1 li(AT i x bi) + r ( x ) Ai bi r ( x ) is the ith row of the feature matrix A is the ith value of the target vector b is a regularization penalty on the model parameters x
  13. P RO P R I E TA RY • C

    O N F I D E N T I A L < > MEDIA intent Parallelizing logistic regression ADMM • Logistic regression: 1 3 minimize 1 m m X i=1 log(1 + exp( biAT i x)) + k x k2 2
  14. P RO P R I E TA RY • C

    O N F I D E N T I A L < > MEDIA intent Parallelizing logistic regression ADMM • Logistic regression: • Rewrite for parallelization: 1 4 minimize 1 m m X i=1 log(1 + exp( biAT i x)) + k x k2 2 minimize 1 m m X i=1 log(1 + exp( biAT i xj)) + k z k2 2 s.t. xj z = 0 , 8 j
  15. P RO P R I E TA RY • C

    O N F I D E N T I A L < > MEDIA intent Parallelizing logistic regression ADMM • Logistic regression: • Rewrite for parallelization: • Method of Multipliers: 1 5 minimize 1 m m X i=1 log(1 + exp( biAT i x)) + k x k2 2 minimize 1 m m X i=1 log(1 + exp( biAT i xj)) + k z k2 2 minimize 1 m m X i=1 log(1 + exp( biAT i xj)) + k z k2 2 + ⇢ 2N N X j=1 k xj z k2 2 s.t. xj z = 0 , 8 j s.t. xj z = 0 , 8 j
  16. P RO P R I E TA RY • C

    O N F I D E N T I A L < > MEDIA intent Algorithm for logistic regression ADMM • Separability: break down iterations into steps 1 6 xk +1 j := argmin xj 1 mj mj X ij =1 log(1 + exp( bij AT ij xj )) + ⇢k 2 k xj zk + uk j k2 2 z k+1 := 8 > > < > > : ¯ x k+1 q + ¯ u k q if q = 1 N⇢ k 2 + N⇢ k (¯ x k+1 q + ¯ u k q ) otherwise u k+1 j := u k j + x k+1 j z k+1
  17. P RO P R I E TA RY • C

    O N F I D E N T I A L < > MEDIA intent Algorithm for logistic regression ADMM • Separability: break down iterations into steps 1 7 xk +1 j := argmin xj 1 mj mj X ij =1 log(1 + exp( bij AT ij xj )) + ⇢k 2 k xj zk + uk j k2 2 z k+1 := 8 > > < > > : ¯ x k+1 q + ¯ u k q if q = 1 N⇢ k 2 + N⇢ k (¯ x k+1 q + ¯ u k q ) otherwise u k+1 j := u k j + x k+1 j z k+1 Computationally Intensive Computationally Easy
  18. P RO P R I E TA RY • C

    O N F I D E N T I A L < > MEDIA intent The Algorithm on MapReduce Translating ADMM to Hadoop 1 8 1: procedure ADMM(A, b , N , maxIterations ) 2: k = 0 3: while notConverged and k < maxIterations do 4: for j = 1 ! N do 5: update xk j using Eq. 1 6: end for 7: update zk using Eq. 2 8: for j = 1 ! N do 9: update uk j using Eq. 3 10: end for 11: update ⇢k using Eq. 4 12: k k + 1 13: end while 14: write zk to S3. 15: end procedure Figure 1. The ADMM procedure implemented for Hadoop MapReduce. notConverged is a helper function that evaluates the norms to check for convergence. node to use when performing computations on that data. To Figure 2. The Job Runner sets paths for th the Jar file with ADMM, and the value of a the number of iterations. ADMM computes a HDFS, then after the final iteration, the pro notation) is output to S3. not converged and that the number the maximum number of iterations helper function notConverged , wh k rkk2  ✏pri and k skk2  ✏dual . If the driver has not converged an of iterations have not passed, we di 1: procedure ADMM(A, b , N , maxIterations ) 2: k = 0 3: while notConverged and k < maxIterations do 4: for j = 1 ! N do 5: update xk j using Eq. 1 6: end for 7: update zk using Eq. 2 8: for j = 1 ! N do 9: update uk j using Eq. 3 10: end for 11: update ⇢k using Eq. 4 12: k k + 1 13: end while 14: write zk to S3. 15: end procedure Figure 1. The ADMM procedure implemented for Hadoop MapReduce. notConverged is a helper function that evaluates the norms to check for convergence. node to use when performing computations on that data. To Figure 2. The Job Runner sets paths for the Jar file with ADMM, and the value of the number of iterations. ADMM computes HDFS, then after the final iteration, the pr notation) is output to S3. not converged and that the number the maximum number of iteration helper function notConverged , w k rkk2  ✏pri and k skk2  ✏dua If the driver has not converged a of iterations have not passed, we d the mappers and begin the next it
  19. P RO P R I E TA RY • C

    O N F I D E N T I A L < > MEDIA intent Optimization with MapReduce Translating ADMM to Hadoop In the mapper compute L-BFGS 1 9 Mapper x k+1 1 x k+1 2 ( x k 1, z k , u k 1 ) ( x k 2, z k , u k 2 )
  20. P RO P R I E TA RY • C

    O N F I D E N T I A L < > MEDIA intent Optimization with MapReduce Translating ADMM to Hadoop In the reducer compute update z and u 2 0 Reducer ( x k+1 1 , u k 1 ) ( x k+1 2 , u k 2 ) zk+1 ( u k 1, x k+1 1 , z k+1) ( u k 2, x k+1 2 , z k+1) uk+1 1 uk+1 2
  21. P RO P R I E TA RY • C

    O N F I D E N T I A L < > MEDIA intent Optimization with MapReduce Translating ADMM to Hadoop • Store state in HDFS without HBase or any external tools • Store a JSON hash-map from each split ID to its and • Distribute to all the mapper • on the next iteration each mapper knows which and to load 2 1 xj uj xj uj HDFS split ID 1 split ID 2 x1 x2 u2 u1
  22. P RO P R I E TA RY • C

    O N F I D E N T I A L < > MEDIA intent How cluster time is spent Results • Warm start L-BFGS with average result from previous iteration • Time per iteration is roughly constant 2 2 15.2% 4.1% 24.0% 1.2% 96.0% Seconds(Per(Itera.on,(Average( Get%Input%Data% L4BFGS% Shuffle% Reduce% Job%Setup/Teardown%
  23. P RO P R I E TA RY • C

    O N F I D E N T I A L < > MEDIA intent Automatic multiplier tuning Translating ADMM to Hadoop 2 3 0.01$ 0.1$ 1$ 10$ 0$ 5$ 10$ 15$ 20$ 25$ itera,on$ rNorm,&sNorm& rNorm[k]$ sNorm[k]$ 0" 0.2" 0.4" 0.6" 0.8" 1" 1.2" 0" 5" 10" 15" 20" 25" itera/on" Rho$ rho[k]" ⇢k+1 := 8 > < > : ⌧incr⇢k if krkk2 > µkskk2 ⇢k/⌧decr if kskk2 > µkrkk2 ⇢k otherwise
  24. P RO P R I E TA RY • C

    O N F I D E N T I A L < > MEDIA intent Objective function & Fit vs Iteration Results 2 4 0.6$ 6$ 60$ 0$ 2$ 4$ 6$ 8$ 10$ 12$ 14$ f"#"f*" itera.on$ Primal'Objec-ve'Func-on'Loss' 77.21%& 77.53%& 0.5& 0.6& 0.7& 0.8& 0& 2& 4& 6& 8& 10& 12& 14& AUC$ itera2on& Test%Dataset%AUC%
  25. P RO P R I E TA RY • C

    O N F I D E N T I A L < > MEDIA intent for ADMM on Hadoop Some Next Steps 1. Add standard errors for model parameters 2. Soft thresholding for L1 regularization 3. Hinge loss for SVM 2 5
  26. P RO P R I E TA RY • C

    O N F I D E N T I A L < > MEDIA intent Thank you Questions? ADMM is open source https://github.com/intentmedia/admm 2 6