Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Lucene/Solr Revolution 2015: Where Search Meets Machine Learning

Diana Hu
October 15, 2015

Lucene/Solr Revolution 2015: Where Search Meets Machine Learning

Search engines have focused on solving the document retrieval problem, so their scoring functions do not handle naturally non-traditional IR data types, such as numerical or categorical. Therefore, on domains beyond traditional search, scores representing strengths of associations or matches may vary widely. As such, the original model doesn’t suffice, so relevance ranking is performed as a two-phase approach with 1) regular search 2) external model to re-rank the filtered items. Metrics such as click-through and conversion rates are associated with the users’ response to items served. The predicted selection rates that arise in real-time can be critical for optimal matching. For example, in recommender systems, predicted performance of a recommended item in a given context, also called response prediction, is often used in determining a set of recommendations to serve in relation to a given serving opportunity. Similar techniques are used in the advertising domain. To address this issue the authors have created ML-Scoring, an open source framework that tightly integrates machine learning models into a popular search engine (SOLR/Elasticsearch), replacing the default IR-based ranking function. A custom model is trained through either Weka or Spark and it is loaded as a plugin used at query time to compute custom scores.

Diana Hu

October 15, 2015
Tweet

More Decks by Diana Hu

Other Decks in Technology

Transcript

  1. Where Search Meets Machine Learning Diana Hu @sdianahu — Data

    Science Lead, Verizon Joaquin Delgado @joaquind — Director of Engineering, Verizon
  2. Disclaimer 2 The content of this presentation are of the

    authors’ personal statements and does not officially represent their employer’s view in anyway. Included content is especially not intended to convey the views of OnCue or Verizon 01
  3. Index 1.  Introduction 2.  Search and Information Retrieval 3.  ML

    problems as Search-based Systems 4.  ML Meets Search!
  4. Scaling learning systems is hard! •  Millions of users, items

    •  Billions of features •  Imbalanced Datasets •  Complex Distributed Systems •  Many algorithms have not been tested at “Internet Scale”
  5. Typical approaches •  Distributed systems – Fault tolerance, Throughput vs.

    latency •  Parallelization Strategies – Hashing, trees •  Processing – Map reduce variants, MPI, graph parallel •  Databases – Key/Value Stores, NoSQL Such a custom system requires TLC
  6. Search Search is about finding specific things that are either

    known or assumed to exist, Discovery is about is about helping the user encounter what he/she didn’t even know exists. •  Focused on Search: Search Engines, Database Systems •  Focused on Discovery: Recommender Systems, Advertising Predicate Logic and Declarative Languages Rock!
  7. Search stack Matched Hits Representation Function Similarity Calculation Matched Hits

    Documents Representation Function Input Query Matched Hits Matched Hits Retrieved Documents Online Processing Offline Processing (*)Relevance Feedback Query Representation Doc Representation Index *Metadata Engineering (*) Optional
  8. Search Engines: the big hammer •  Search engines are largely

    used to solve non-IR search problems, because: •  Widely Available •  Fast and Scalable •  Integrates well with existing data stores
  9. But… Are we using the right tool? •  Search Engines

    were originally designed for IR. •  Complex non-IR search tasks sometimes require a two phase approach Phase1) Filter Phase 2) Rank
  10. Machine Learning Machine Learning in particular supervised learning refer to

    techniques used to learn how to classify or score previously unseen objects based on a training dataset Inference and Generalization are the Key!
  11. Learning systems’ stack Visualization / UI Retrieval Ranking Query Generation

    and Contextual Pre-filtering Model Building Index Building Data/Events Collections Data Analytics Contextual Post Filtering Online Offline Experimentation
  12. Case study: Recommender Systems •  Reduce information load by estimating

    relevance •  Ranking (aka Relevance) Approaches: •  Collaborative filtering •  Content Based •  Knowledge Based •  Hybrid •  Beyond rating prediction and ranking •  Business filtering logic •  Low latency and Scale
  13. RecSys: Content based models •  Rec Task: Given a user

    profile find the best matching items by their attributes •  Similarity calculation: based on keyword overlap between user/items •  Neighborhood method (i.e. nearest neighbor) •  Query-based retrieval (i.e. Rocchio’s method) •  Probabilistic methods (classical text classification) •  Explicit decision models •  Feature representation: based on content analysis •  Vector space model •  TF-IDF •  Topic Modeling
  14. RecSys: Collaborative Filtering Matrix Factorization Rating Dataset User Factors Item

    Factors Re-Ranking Model Input Query Online Processing Offline Processing Recommendations
  15. RecSys: Collaborative Filtering Matrix Factorization Rating Dataset User Factors Item

    Factors Re-Ranking Model Input Query Online Processing Offline Processing Recommendations
  16. Remember the elephant? Visualization / UI Retrieval Ranking Query Generation

    and Contextual Pre-filtering Model Building Index Building Data/Events Collections Data Analytics Contextual Post Filtering Online Offline Experimentation
  17. Simplifying the stack! Visualization / UI Query Generation and Contextual

    Pre-filtering Model Building Index Building Data/Events Collections Data Analytics Online Offline Experimentation Retrieval Contextual Post Filtering Ranking
  18. Search stack Matched Hits Representation Function Similarity Calculation Matched Hits

    Documents Representation Function Input Query Matched Hits Matched Hits Retrieved Documents Online Processing Offline Processing (*)Relevance Feedback Query Representation Doc Representation Index *Metadata Engineering (*) Optional
  19. Simplifying the Search stack Matched Hits Representation Function Similarity Calculation

    Matched Hits Documents Representation Function Input Query Matched Hits Matched Hits Retrieved Documents Online Processing Offline Processing (*)Relevance Feedback Query Representation Doc Representation Index *Metadata Engineering (*) Optional Retrieval Contextual Post Filtering Ranking ML-Scoring Plugin Serialized ML Model
  20. ML-Scoring architecture Lucene/Solr Instances + Labels Instances Index ML Scoring

    Plugin Serialized ML Model Online Processing Offline Processing Trainer + Indexer
  21. ML-Scoring Options •  Option A: Solr FunctionQuery •  Pro: Model

    is just a query! •  Cons: Limits expressiveness of models •  Option B: Solr Custom Function Query •  Pro: Loading any type of model (also PMML) •  Cons: Memory limitations, also multiple model reloading •  Option C: Lucene CustomScoreQuery •  Pro: Can use PMML and tune how PMML gets loaded •  Cons: No control on matches •  Option D: Lucene Low level Custom Query •  *Mahout vectors from Lucene text (only trains, so not an option)
  22. Real-life Problem •  Census database that contains documents with the

    following fields: 1. Age: continuous; 2. Workclass: 8 values; 3. Fnlwgt: continuous.; 4. Education: 16 values; 5. Education-num: continuous.; 6. Marital-status: 7 values; 7. Occupation: 14 values; 8. Relationship: 6 values; 9. Race: 5 values; 10. Sex: Male, Female; 11. Capital-gain: continuous.;12. Capital- loss: continuous.; 13. Hours-per-week: continuous.; 14. Native-country: 41 values; 15. >50K Income: Yes, No. •  Task is to predict whether a person makes more than 50k a year based on their attributes
  23. 1) Learn from the (training) data Naïve Bayes SVM Logistic

    Regression Decision Trees Train with your favorite ML Framework
  24. Option A: Just a Solr Function Query q=“sum(C,    

             product(age,w1),              product(Workclass,w2),              product(Fnlwgt,  w3),              product(Education,  w4),              ….)”   Serialized ML Model as Query Trainer + Indexer Y_prediction = C + XB
  25. May result in a crazy Solr functionQuery See more at

    https://wiki.apache.org/solr/FunctionQuery q=dismax&bf="ord(educaton-num)^0.5 recip(rord(age),1,1000,1000)^0.3"
  26. Option B: Custom Solr FuntionQuery 1.  Subclass org.apache.solr.search.ValueSourceParser. public  class

     MyValueSourceParser  extends  ValueSourceParser  {    public  void  init  (NamedList  namedList)  {            …      }      public  ValueSource  parse(FunctionQParser  fqp)  throws  ParseException  {            return  new  MyValueSource();      }   } 2.  In solrconfig.xml, register your new ValueSourceParser directly under the <config> tag <valueSourceParser  name=“myfunc”  class=“com.custom.MyValueSourceParser”  />   3.  Subclass org.apache.solr.search.ValueSource and instantiate it in ValueSourceParser.parse()
  27. Option C: Lucene CustomScoreQuery 2C) Serialize model with PMML • 

    Can use JPMML library to read serialized model in Lucene •  On Lucene will need to implement an extension with JPMML-evaluator to take vectors as expected 3C) In Lucene: •  Override CustomScoreQuery: load PMML •  Create CustomScoreProvider: do model PMML data marshaling •  Rescoring: PMML evaluation
  28. Predictive Model Markup Language •  Why use PMML •  Allows

    users to build a model in one system •  Export model and deploy it in a different environment for prediction •  Fast iteration: from research to deployment to production •  Model is a XML document with: •  Header: description of model, and where it was generated •  DataDictionary: defines fields used by model •  Model: structure and parameters of model •  http://dmg.org/pmml/v4-2-1/GeneralStructure.html
  29. Example: Train in Spark to PMML import  org.apache.spark.mllib.clustering.KMeans    

    import  org.apache.spark.mllib.linalg.Vectors         //  Load  and  parse  the  data     val  data  =  sc.textFile("/path/to/file")            .map(s  =>  Vectors.dense(s.split(',').map(_.toDouble)))           //  Cluster  the  data  into  three  classes  using  KMeans     val  numIterations  =  20     val  numClusters  =  3     val  kmeansModel  =  KMeans.train(data,  numClusters,  numIterations)           //  Export  clustering  model  to  PMML     kmeansModel.toPMML("/path/to/kmeans.xml")  
  30. Overriding scores with CustomScoreQuery CustomScoreProvider CustomScoreQuery Lucene Query Find next

    Match Score Rescore Doc New Score *Credit to Doug Turnbull’s Hacking Lucene forCustom Search Results
  31. Overriding scores with CustomScoreQuery •  Matching remains •  Scoring overridden

    CustomScoreProvider CustomScoreQuery Lucene Query Find next Match Score Rescore Doc New Score *Credit to Doug Turnbull’s Hacking Lucene forCustom Search Results
  32. Implementing CustomScoreQuery 1.  Given normal Lucene Query, use a CustomScoreQuery

    to wrap it TermQuery  q  =  New  TermQuery(term)   MyCustomScoreQuery  mcsq  =  New  MyCustomScoreQuery(q)   //Make  sure  query  has  all  fields  needed  by  PMML!
  33. Implementing CustomScoreQuery 2.  Initialize PMML PMML  pmml  =  ...;  

    ModelEvaluatorFactory  modelEvaluatorFactory  =              ModelEvaluatorFactory.newInstance();   ModelEvaluator<?>  modelEvaluator  =              modelEvaluatorFactory.newModelManager(pmml);   Evaluator  evaluator  =  (Evaluator)modelEvaluator;        
  34. Implementing CustomScoreQuery 2.  Rescore each doc with IndexReader and docID

    public  float  customScore(int  doc,  float  subQueryScore,  float   valSrcScores[])  throws  IOException  {   //Lucene  reader   IndexReader  r  =  context.reader();   Terms  tv  =  r.getTermVector(doc,  _field);   TermsEnum  tenum  =  null;   tenum  =  tv.iterator(tenum);           //convert  the  iterator  order  to  fields  needed  by  model   TermsEnum  tenumPMML  =  tenum2PMML(tenum,                        evaluator.getActiveFields());        
  35. Implementing CustomScoreQuery 2.  Rescore each doc with IndexReader and docID

    //Marshall  Data  into  PMML   Map<FieldName,  FieldValue>  arguments  =                    new  LinkedHashMap<FieldName,  FieldValue>();   List<FieldName>  activeFields  =  evaluator.getActiveFields();   for(FieldName  activeField  :  activeFields){      //  The  raw  is  value  has  been  sorted  with  number  of  fields  needed      Object  rawValue  =  tenumPMML.next;      FieldValue  activeValue  =  evaluator.prepare(activeField,  rawValue);      arguments.put(activeField,  activeValue);   }        
  36. Implementing CustomScoreQuery 2.  Rescore each doc with IndexReader and docID

    //Rescore  and  evaluate  with  PMML   Map<FieldName,  ?>  results  =  evaluator.evaluate(arguments);   FieldName  targetName  =  evaluator.getTargetField();   Object  targetValue  =  results.get(targetName);   return  (float)  targetValue;      
  37. Potential issues •  Performance •  If search space is very

    large •  If model complexity explodes (i.e. kernel expansion) •  Operations •  Code is running on key infrastructure •  Versioning •  Binary Compatibility
  38. Option D: Low Level Lucene •  CustomScoreQuery or Custom FunctionScore

    can’t control matches •  If you want custom matches and scoring…. •  Implement: •  Custom Query Class •  Custom Weight Class •  Custom Scorer Class •  http://opensourceconnections.com/blog/2014/03/12/using- customscorequery-for-custom-solrlucene-scoring/
  39. Conclusion •  Importance of the full picture – Learning systems

    from the lenses of the whole elephant •  Reducing the time from science to production is complicated •  Scalability is hard! •  Why not have ML use Search in its core during online eval? •  Solr and Lucene are a start to customize your learning system
  40. O C T O B E R 1 3 -

    1 6 , 2 0 1 6 • A U S T I N , T X