Machine Learning Infrastructure at Stripe

Machine Learning Infrastructure at Stripe


Rob Story

July 24, 2017


  1. Machine Learning Infrastructure at Stripe Bridging from Python -> JVM

    (and some other stuff)
  2. Rob Story Software Engineer, Machine Learning Infra @oceankidbilly $ whoami

  3. None
  4. Why?

  5. Merchant Fraud

  6. None
  7. Transaction Fraud

  8. Support Tools

  9. Serialization

  10. None
  11. Y E P, R I G H T N O

    W Let’s ship a model to production
  12. make_serializable(pipeline, owner="pydata-2017") estimator.to_contentment_hash()

  13. return SerializableWrapper( registry.make_serializable(obj), owner=owner, tags=tags, metadata=metadata)

  14. @if_delegate_has_method(delegate='estimator') def fit_transform(self, X, y=None, **fit_params):, y, **fit_params) return

    self.transform(X) @if_delegate_has_method(delegate='estimator') def transform(self, X): return self.estimator.transform(X)
  15. def _fit_serializable(serializable, X, y=None, **fit_params): if not isinstance(X, pd.DataFrame): raise

    ValueError( 'serializable {} requires a pandas.DataFrame' .format(type(serializable.get_estimator()))) init_feature_names = list(X.columns.values) serializable.fit_with_feature_names( None, init_feature_names, X, y, **fit_params) # Allow feature selection to propagate backwards. serializable.set_output_features(None) return serializable
  16. def fit_with_feature_names(self, name, feature_names, X, y): self._feature_names = feature_names,

    y) return self
  17. Ok so what if I just want to ship a

    new model type?
  18. class FillMissing(SerializableEstimator, TransformerMixin): def __init__(self, columns='all', missing_value=-1): self.columns = columns

    self.missing_value = missing_value def serialize(self, name): bytes = json_to_bytes({ "features": list(self.columns_), "value": self.missing_value }) return ApplyFeatureEncoder('fill_missing', name, bytes, 'json')
  19. class RandomForestSerializer(ModelSerializer): """Serializer for RandomForest models.""" def is_serializer_for(self, obj): return

    isinstance(obj, RandomForestRegressor) def serialize_model(self, name, model, feature_names): decision_trees = [] for decision_tree in model.estimators_: decision_trees.append( _tree_to_dict(decision_tree, feature_names)) bonsai_bytes = get_bonsai_bytes(decision_trees) return Model("simple-bonsai-regression-forest", name, bonsai_bytes, "bonsai")
  20. Aside: Ok what is this bonsai thing?

  21. Scala library for transforming arbitrary tree structures into read- only

    versions that take up a fraction of the space Open Source!
  22. def _tree_to_dict(decision_tree, feature_names, fraudulent_class_idx=1): # This is where the internal

    tree structure lives in an sk DecisionTree tree = decision_tree.tree_ if isinstance(decision_tree, t.DecisionTreeClassifier): # NOTE: This ONLY WORKS with binary classification, where the # second class is the fraudulent class. probs = np.nan_to_num(tree.value[:, 0, fraudulent_class_idx] / (tree.value[:, 0, 0] + tree.value[:, 0, 1])) elif isinstance(decision_tree, t.DecisionTreeRegressor): probs = [v[0][0] for v in tree.value] else: raise ValueError("You can only serialize scikit decision trees!") return { "feature_names": feature_names, "features_used": _features_used(tree, feature_names), "node_features": map(int, tree.feature), "node_thresholds": map(float, tree.threshold), "left_children": map(int, tree.children_left), "right_children": map(int, tree.children_right), "probabilities": [float(p) for p in probs], # Deprecated, moving these to Pipeline "encodings": {} } Brittle to version changes!
  23. Now our models and encoders know how to serialize themselves.

    Let’s put it all together!
  24. estimator.to_contentment_hash()

  25. In [2]: model_package = estimator.model_package In [3]: model_package.encoder Out[3]: <…>

    In [4]: model_package.model Out[4]: <…> In [5]: model_package.encoder.encoder_type Out[5]: 'stripe-categorical-encoding' In [6]: model_package.model.model_type Out[6]: 'simple-bonsai-regression-forest'
  26. {'MODEL': {'encoders': {'apply': {'encoderType': 'stripe-categorical-encoding', 'path': 'label-encoder.json'}}, 'model': {'modelType': 'simple-bonsai-regression-forest',

    'path': 'random-forest-regressor.bonsai'}, 'owner': 'pydata-2017'},
  27. 'label-encoder.json': {'encodings': {'bird': {'chicken': 0, 'finch': 1, 'raven': 2}, 'food':

    {'cheese': 0, 'hamburger': 1, 'tomato': 2}, 'planet': {'earth': 0, 'mars': 1, 'pluto': 2}}, 'features': ['bird', 'food', 'planet']}
  28. x08\t\n\x0b\x0c\r\x0e\x0f\x10\x11\x12\x13\x14\x15\x16\x17\x18\x19\x1a\x01\x 00\x00\x00\x1b@\x7f2=p\xa3\xd7\n@\x84\x13>E0n\xb4@z\xfd\xb6\xdbm\xb6\xdb@| \xeepc\xe7\x06>@\x7f\x02\x16B\xc8Y\x0b@\x82\x1b\xd7\n=p\xa4@\x82.y\xe7\x9ey \xe8@~\xbbm\xb6\xdbm\xb7@\x7f\xf8\xaf\x8a\xf8\xaf\x8b@~8q\xc7\x1cq\xc7@~\xd 7q\x1d\xc4w\x12@|\xb2I$\x92I%@\x80\x1cI$ \x92I%@\x80J\xe3\x8e8\xe3\x8e@~;^P\xd7\x946@\x80\xb4\xb4\xb4\xb4\xb4\xb5@\x 81\xa7\x89\xd8\x9d\x89\xd9@\x81\x9a\xaa\xaa\xaa\xaa\xab@| \xc9UUUUU@~\xd7\x0f\x0f\x0f\x0f\x0f@\x80l\xa1\xaf(k\xca@| <\xcc\xcc\xcc\xcc\xcd@}\xa0c\xe7\x06>p@z(\x00\x00\x00\x00\x00@zGE\xd1t] \x17@y\xae\xb3\xe4S\x06\xeb@q\xf7\x945\xe5\ry\x00\x00\x00\x1b\x00\x01\x02\x

    03\x04\x05\x06\x07\x08\t\n\x0b\x0c\r\x0e\x0f\x10\x11\x12\x13\x14\x15\x16\x1 7\x18\x19\x1a\x00\x00\x00\x1b\x01\x00\x00\x005\x00\x00\x00\x01\x00\x00\x00\ x02\x00\x00\x00\x00\x00\x00$ \x00y\xc0\x88\x00\x00\x1f\xff\xe3\x00\x00\x00k\x01\x00\x00\x00k\x00\x00\x00 \x01\x00\x00\x00\x03\x00\x00\x00\x00\x02\xe0t\x00\x00\x00\x005~\x7f\xff\xff \x80x\x1f\xfe\x00\x00\x07\xe1\x00\x00\x00\x00
  29. Contentment: S3 as Content-Addressed Store In [1]: estimator.to_contentment_hash() Out[1]: ‘sha256.Q5NJ5DVQC…’

    We can load the model on the fly if we know its hash!
  30. Candidate Models & Promotion

  31. My model lives in S3. How do I actually promote

    it to production?
  32. Model Deployer! /model/$MODEL_ID/predict /tag/$TAG_ID/predict TAG_ID MODEL_ID

  33. ml-tool model-deploy \ -t txn_fraud.production \ -m sha256.Q5NJ5DVQC… ml-tool model-deploy

    \ -t txn_fraud.production -b
  34. Deployed At Model ID 2017-01-01 2017-02-01 2017-03-01 sha.12345… sha.67891… sha.abcde…

    Deploy History For Tag: txn_fraud.production
  35. Hot-swapping models on the fly sounds scary. Can I give

    it a trial run?
  36. Shadow Models

  37. Model Hierarchy: Fan Out M O D E L :

    S H A .1 2 3 4 5 … M O D E L : S H A . A B C … M O D E L : S H A . X Y Z … M O D E L : S H A .9 8 7…
  38. ml-tool model-deploy-add \ -t merchant_fraud.shadow \ -m $YOUR_MODEL_SHA ml-tool model-deploy-remove

    -t merchant_fraud.shadow -m $YOUR_MODEL_SHA
  39. You haven’t talked about the JVM bits yet!

  40. We implement everything on the Scala side, including the encoders.

    case class StandardCategoryEncoder( features: Set[String], encodings: Map[String, Map[String, Double]] ) extends FeatureEncoder { private[this] val (featureTypes, featureParsers) = StandardCategoryEncoder.makeParsers(features, encodings) def encode(features: Map[String, FeatureValue]): Try[Map[String, FeatureValue]] = Try { { case (key, value) => featureParsers.get(key) match { case Some(parse) => key -> parse(value).get case None => key -> value } } }
  41. Some components are already open source! M O S T

    LY T R E E T H I N G S
  42. Spark: PMML Apple CoreML: Protobuf Stripe: JSON (mostly) O P

    E N S O U R C E ?
  43. Q U E S T I O N S ?