Machine Learning Infrastructure at Stripe

Machine Learning Infrastructure at Stripe Bridging from Python -> JVM
(and some other stuﬀ)

Rob Story Software Engineer, Machine Learning Infra @oceankidbilly $ whoami

Merchant Fraud

Transaction Fraud

Support Tools

Serialization

Y E P, R I G H T N O
W Let’s ship a model to production

make_serializable(pipeline, owner="pydata-2017") estimator.to_contentment_hash()

return SerializableWrapper( registry.make_serializable(obj), owner=owner, tags=tags, metadata=metadata)

@if_delegate_has_method(delegate='estimator') def fit_transform(self, X, y=None, **fit_params): self.fit(X, y, **fit_params) return
self.transform(X) @if_delegate_has_method(delegate='estimator') def transform(self, X): return self.estimator.transform(X)

def _fit_serializable(serializable, X, y=None, **fit_params): if not isinstance(X, pd.DataFrame): raise
ValueError( 'serializable {} requires a pandas.DataFrame' .format(type(serializable.get_estimator()))) init_feature_names = list(X.columns.values) serializable.fit_with_feature_names( None, init_feature_names, X, y, **fit_params) # Allow feature selection to propagate backwards. serializable.set_output_features(None) return serializable

def fit_with_feature_names(self, name, feature_names, X, y): self._feature_names = feature_names self.estimator.fit(X,
y) return self

Ok so what if I just want to ship a
new model type?

class FillMissing(SerializableEstimator, TransformerMixin): def __init__(self, columns='all', missing_value=-1): self.columns = columns
self.missing_value = missing_value def serialize(self, name): bytes = json_to_bytes({ "features": list(self.columns_), "value": self.missing_value }) return ApplyFeatureEncoder('fill_missing', name, bytes, 'json')

class RandomForestSerializer(ModelSerializer): """Serializer for RandomForest models.""" def is_serializer_for(self, obj): return
isinstance(obj, RandomForestRegressor) def serialize_model(self, name, model, feature_names): decision_trees = [] for decision_tree in model.estimators_: decision_trees.append( _tree_to_dict(decision_tree, feature_names)) bonsai_bytes = get_bonsai_bytes(decision_trees) return Model("simple-bonsai-regression-forest", name, bonsai_bytes, "bonsai")

Aside: Ok what is this bonsai thing?

Scala library for transforming arbitrary tree structures into read- only
versions that take up a fraction of the space Open Source!

def _tree_to_dict(decision_tree, feature_names, fraudulent_class_idx=1): # This is where the internal
tree structure lives in an sk DecisionTree tree = decision_tree.tree_ if isinstance(decision_tree, t.DecisionTreeClassifier): # NOTE: This ONLY WORKS with binary classification, where the # second class is the fraudulent class. probs = np.nan_to_num(tree.value[:, 0, fraudulent_class_idx] / (tree.value[:, 0, 0] + tree.value[:, 0, 1])) elif isinstance(decision_tree, t.DecisionTreeRegressor): probs = [v[0][0] for v in tree.value] else: raise ValueError("You can only serialize scikit decision trees!") return { "feature_names": feature_names, "features_used": _features_used(tree, feature_names), "node_features": map(int, tree.feature), "node_thresholds": map(float, tree.threshold), "left_children": map(int, tree.children_left), "right_children": map(int, tree.children_right), "probabilities": [float(p) for p in probs], # Deprecated, moving these to Pipeline "encodings": {} } Brittle to version changes!

Now our models and encoders know how to serialize themselves.
Let’s put it all together!

estimator.to_contentment_hash()

In [2]: model_package = estimator.model_package In [3]: model_package.encoder Out[3]: <scripts.ml.lib.diorama.serialize.model_package.ApplyFeatureEncoder…>
In [4]: model_package.model Out[4]: <scripts.ml.lib.diorama.serialize.model_package.Model…> In [5]: model_package.encoder.encoder_type Out[5]: 'stripe-categorical-encoding' In [6]: model_package.model.model_type Out[6]: 'simple-bonsai-regression-forest'

{'MODEL': {'encoders': {'apply': {'encoderType': 'stripe-categorical-encoding', 'path': 'label-encoder.json'}}, 'model': {'modelType': 'simple-bonsai-regression-forest',
'path': 'random-forest-regressor.bonsai'}, 'owner': 'pydata-2017'},

'label-encoder.json': {'encodings': {'bird': {'chicken': 0, 'finch': 1, 'raven': 2}, 'food':
{'cheese': 0, 'hamburger': 1, 'tomato': 2}, 'planet': {'earth': 0, 'mars': 1, 'pluto': 2}}, 'features': ['bird', 'food', 'planet']}

x08\t\n\x0b\x0c\r\x0e\x0f\x10\x11\x12\x13\x14\x15\x16\x17\x18\x19\x1a\x01\x 00\x00\x00\x1b@\x7f2=p\xa3\xd7\n@\x84\x13>E0n\xb4@z\xfd\xb6\xdbm\xb6\xdb@| \xeepc\xe7\x06>@\x7f\x02\x16B\xc8Y\x0b@\x82\x1b\xd7\n=p\xa4@\x82.y\xe7\x9ey \xe8@~\xbbm\xb6\xdbm\xb7@\x7f\xf8\xaf\x8a\xf8\xaf\x8b@~8q\xc7\x1cq\xc7@~\xd 7q\x1d\xc4w\x12@|\xb2I$\x92I%@\x80\x1cI$ \x92I%@\x80J\xe3\x8e8\xe3\x8e@~;^P\xd7\x946@\x80\xb4\xb4\xb4\xb4\xb4\xb5@\x 81\xa7\x89\xd8\x9d\x89\xd9@\x81\x9a\xaa\xaa\xaa\xaa\xab@| \xc9UUUUU@~\xd7\x0f\x0f\x0f\x0f\x0f@\x80l\xa1\xaf(k\xca@| <\xcc\xcc\xcc\xcc\xcd@}\xa0c\xe7\x06>p@z(\x00\x00\x00\x00\x00@zGE\xd1t] \x17@y\xae\xb3\xe4S\x06\xeb@q\xf7\x945\xe5\ry\x00\x00\x00\x1b\x00\x01\x02\x
03\x04\x05\x06\x07\x08\t\n\x0b\x0c\r\x0e\x0f\x10\x11\x12\x13\x14\x15\x16\x1 7\x18\x19\x1a\x00\x00\x00\x1b\x01\x00\x00\x005\x00\x00\x00\x01\x00\x00\x00\ x02\x00\x00\x00\x00\x00\x00$ \x00y\xc0\x88\x00\x00\x1f\xff\xe3\x00\x00\x00k\x01\x00\x00\x00k\x00\x00\x00 \x01\x00\x00\x00\x03\x00\x00\x00\x00\x02\xe0t\x00\x00\x00\x005~\x7f\xff\xff \x80x\x1f\xfe\x00\x00\x07\xe1\x00\x00\x00\x00

Contentment: S3 as Content-Addressed Store In [1]: estimator.to_contentment_hash() Out[1]: ‘sha256.Q5NJ5DVQC…’
We can load the model on the ﬂy if we know its hash!

Candidate Models & Promotion

My model lives in S3. How do I actually promote
it to production?

Model Deployer! /model/$MODEL_ID/predict /tag/$TAG_ID/predict TAG_ID MODEL_ID

ml-tool model-deploy \ -t txn_fraud.production \ -m sha256.Q5NJ5DVQC… ml-tool model-deploy
\ -t txn_fraud.production -b

Deployed At Model ID 2017-01-01 2017-02-01 2017-03-01 sha.12345… sha.67891… sha.abcde…
Deploy History For Tag: txn_fraud.production

Hot-swapping models on the ﬂy sounds scary. Can I give
it a trial run?

Shadow Models

Model Hierarchy: Fan Out M O D E L :
S H A .1 2 3 4 5 … M O D E L : S H A . A B C … M O D E L : S H A . X Y Z … M O D E L : S H A .9 8 7…

ml-tool model-deploy-add \ -t merchant_fraud.shadow \ -m $YOUR_MODEL_SHA ml-tool model-deploy-remove
-t merchant_fraud.shadow -m $YOUR_MODEL_SHA

You haven’t talked about the JVM bits yet!

We implement everything on the Scala side, including the encoders.
case class StandardCategoryEncoder( features: Set[String], encodings: Map[String, Map[String, Double]] ) extends FeatureEncoder { private[this] val (featureTypes, featureParsers) = StandardCategoryEncoder.makeParsers(features, encodings) def encode(features: Map[String, FeatureValue]): Try[Map[String, FeatureValue]] = Try { features.map { case (key, value) => featureParsers.get(key) match { case Some(parse) => key -> parse(value).get case None => key -> value } } }

Some components are already open source! M O S T
LY T R E E T H I N G S

Spark: PMML Apple CoreML: Protobuf Stripe: JSON (mostly) O P
E N S O U R C E ?

Q U E S T I O N S ?
Thanks!

Machine Learning Infrastructure at Stripe

Machine Learning Infrastructure at Stripe

More Decks by Rob Story

Other Decks in Programming

Featured

Transcript