Slide 1

Slide 1 text

Machine Learning Infrastructure at Stripe Bridging from Python -> JVM (and some other stuff)

Slide 2

Slide 2 text

Rob Story Software Engineer, Machine Learning Infra @oceankidbilly $ whoami

Slide 3

Slide 3 text

No content

Slide 4

Slide 4 text

Why?

Slide 5

Slide 5 text

Merchant Fraud

Slide 6

Slide 6 text

No content

Slide 7

Slide 7 text

Transaction Fraud

Slide 8

Slide 8 text

Support Tools

Slide 9

Slide 9 text

Serialization

Slide 10

Slide 10 text

No content

Slide 11

Slide 11 text

Y E P, R I G H T N O W Let’s ship a model to production

Slide 12

Slide 12 text

make_serializable(pipeline, owner="pydata-2017") estimator.to_contentment_hash()

Slide 13

Slide 13 text

return SerializableWrapper( registry.make_serializable(obj), owner=owner, tags=tags, metadata=metadata)

Slide 14

Slide 14 text

@if_delegate_has_method(delegate='estimator') def fit_transform(self, X, y=None, **fit_params): self.fit(X, y, **fit_params) return self.transform(X) @if_delegate_has_method(delegate='estimator') def transform(self, X): return self.estimator.transform(X)

Slide 15

Slide 15 text

def _fit_serializable(serializable, X, y=None, **fit_params): if not isinstance(X, pd.DataFrame): raise ValueError( 'serializable {} requires a pandas.DataFrame' .format(type(serializable.get_estimator()))) init_feature_names = list(X.columns.values) serializable.fit_with_feature_names( None, init_feature_names, X, y, **fit_params) # Allow feature selection to propagate backwards. serializable.set_output_features(None) return serializable

Slide 16

Slide 16 text

def fit_with_feature_names(self, name, feature_names, X, y): self._feature_names = feature_names self.estimator.fit(X, y) return self

Slide 17

Slide 17 text

Ok so what if I just want to ship a new model type?

Slide 18

Slide 18 text

class FillMissing(SerializableEstimator, TransformerMixin): def __init__(self, columns='all', missing_value=-1): self.columns = columns self.missing_value = missing_value def serialize(self, name): bytes = json_to_bytes({ "features": list(self.columns_), "value": self.missing_value }) return ApplyFeatureEncoder('fill_missing', name, bytes, 'json')

Slide 19

Slide 19 text

class RandomForestSerializer(ModelSerializer): """Serializer for RandomForest models.""" def is_serializer_for(self, obj): return isinstance(obj, RandomForestRegressor) def serialize_model(self, name, model, feature_names): decision_trees = [] for decision_tree in model.estimators_: decision_trees.append( _tree_to_dict(decision_tree, feature_names)) bonsai_bytes = get_bonsai_bytes(decision_trees) return Model("simple-bonsai-regression-forest", name, bonsai_bytes, "bonsai")

Slide 20

Slide 20 text

Aside: Ok what is this bonsai thing?

Slide 21

Slide 21 text

Scala library for transforming arbitrary tree structures into read- only versions that take up a fraction of the space Open Source!

Slide 22

Slide 22 text

def _tree_to_dict(decision_tree, feature_names, fraudulent_class_idx=1): # This is where the internal tree structure lives in an sk DecisionTree tree = decision_tree.tree_ if isinstance(decision_tree, t.DecisionTreeClassifier): # NOTE: This ONLY WORKS with binary classification, where the # second class is the fraudulent class. probs = np.nan_to_num(tree.value[:, 0, fraudulent_class_idx] / (tree.value[:, 0, 0] + tree.value[:, 0, 1])) elif isinstance(decision_tree, t.DecisionTreeRegressor): probs = [v[0][0] for v in tree.value] else: raise ValueError("You can only serialize scikit decision trees!") return { "feature_names": feature_names, "features_used": _features_used(tree, feature_names), "node_features": map(int, tree.feature), "node_thresholds": map(float, tree.threshold), "left_children": map(int, tree.children_left), "right_children": map(int, tree.children_right), "probabilities": [float(p) for p in probs], # Deprecated, moving these to Pipeline "encodings": {} } Brittle to version changes!

Slide 23

Slide 23 text

Now our models and encoders know how to serialize themselves. Let’s put it all together!

Slide 24

Slide 24 text

estimator.to_contentment_hash()

Slide 25

Slide 25 text

In [2]: model_package = estimator.model_package In [3]: model_package.encoder Out[3]: In [4]: model_package.model Out[4]: In [5]: model_package.encoder.encoder_type Out[5]: 'stripe-categorical-encoding' In [6]: model_package.model.model_type Out[6]: 'simple-bonsai-regression-forest'

Slide 26

Slide 26 text

{'MODEL': {'encoders': {'apply': {'encoderType': 'stripe-categorical-encoding', 'path': 'label-encoder.json'}}, 'model': {'modelType': 'simple-bonsai-regression-forest', 'path': 'random-forest-regressor.bonsai'}, 'owner': 'pydata-2017'},

Slide 27

Slide 27 text

'label-encoder.json': {'encodings': {'bird': {'chicken': 0, 'finch': 1, 'raven': 2}, 'food': {'cheese': 0, 'hamburger': 1, 'tomato': 2}, 'planet': {'earth': 0, 'mars': 1, 'pluto': 2}}, 'features': ['bird', 'food', 'planet']}

Slide 28

Slide 28 text

x08\t\n\x0b\x0c\r\x0e\x0f\x10\x11\x12\x13\x14\x15\x16\x17\x18\x19\x1a\x01\x 00\x00\x00\x1b@\x7f2=p\xa3\xd7\n@\x84\x13>E0n\xb4@z\xfd\xb6\xdbm\xb6\xdb@| \xeepc\xe7\x06>@\x7f\x02\x16B\xc8Y\x0b@\x82\x1b\xd7\n=p\xa4@\x82.y\xe7\x9ey \xe8@~\xbbm\xb6\xdbm\xb7@\x7f\xf8\xaf\x8a\xf8\xaf\x8b@~8q\xc7\x1cq\xc7@~\xd 7q\x1d\xc4w\x12@|\xb2I$\x92I%@\x80\x1cI$ \x92I%@\x80J\xe3\x8e8\xe3\x8e@~;^P\xd7\x946@\x80\xb4\xb4\xb4\xb4\xb4\xb5@\x 81\xa7\x89\xd8\x9d\x89\xd9@\x81\x9a\xaa\xaa\xaa\xaa\xab@| \xc9UUUUU@~\xd7\x0f\x0f\x0f\x0f\x0f@\x80l\xa1\xaf(k\xca@| <\xcc\xcc\xcc\xcc\xcd@}\xa0c\xe7\x06>p@z(\x00\x00\x00\x00\x00@zGE\xd1t] \x17@y\xae\xb3\xe4S\x06\xeb@q\xf7\x945\xe5\ry\x00\x00\x00\x1b\x00\x01\x02\x 03\x04\x05\x06\x07\x08\t\n\x0b\x0c\r\x0e\x0f\x10\x11\x12\x13\x14\x15\x16\x1 7\x18\x19\x1a\x00\x00\x00\x1b\x01\x00\x00\x005\x00\x00\x00\x01\x00\x00\x00\ x02\x00\x00\x00\x00\x00\x00$ \x00y\xc0\x88\x00\x00\x1f\xff\xe3\x00\x00\x00k\x01\x00\x00\x00k\x00\x00\x00 \x01\x00\x00\x00\x03\x00\x00\x00\x00\x02\xe0t\x00\x00\x00\x005~\x7f\xff\xff \x80x\x1f\xfe\x00\x00\x07\xe1\x00\x00\x00\x00

Slide 29

Slide 29 text

Contentment: S3 as Content-Addressed Store In [1]: estimator.to_contentment_hash() Out[1]: ‘sha256.Q5NJ5DVQC…’ We can load the model on the fly if we know its hash!

Slide 30

Slide 30 text

Candidate Models & Promotion

Slide 31

Slide 31 text

My model lives in S3. How do I actually promote it to production?

Slide 32

Slide 32 text

Model Deployer! /model/$MODEL_ID/predict /tag/$TAG_ID/predict TAG_ID MODEL_ID

Slide 33

Slide 33 text

ml-tool model-deploy \ -t txn_fraud.production \ -m sha256.Q5NJ5DVQC… ml-tool model-deploy \ -t txn_fraud.production -b

Slide 34

Slide 34 text

Deployed At Model ID 2017-01-01 2017-02-01 2017-03-01 sha.12345… sha.67891… sha.abcde… Deploy History For Tag: txn_fraud.production

Slide 35

Slide 35 text

Hot-swapping models on the fly sounds scary. Can I give it a trial run?

Slide 36

Slide 36 text

Shadow Models

Slide 37

Slide 37 text

Model Hierarchy: Fan Out M O D E L : S H A .1 2 3 4 5 … M O D E L : S H A . A B C … M O D E L : S H A . X Y Z … M O D E L : S H A .9 8 7…

Slide 38

Slide 38 text

ml-tool model-deploy-add \ -t merchant_fraud.shadow \ -m $YOUR_MODEL_SHA ml-tool model-deploy-remove -t merchant_fraud.shadow -m $YOUR_MODEL_SHA

Slide 39

Slide 39 text

You haven’t talked about the JVM bits yet!

Slide 40

Slide 40 text

We implement everything on the Scala side, including the encoders. case class StandardCategoryEncoder( features: Set[String], encodings: Map[String, Map[String, Double]] ) extends FeatureEncoder { private[this] val (featureTypes, featureParsers) = StandardCategoryEncoder.makeParsers(features, encodings) def encode(features: Map[String, FeatureValue]): Try[Map[String, FeatureValue]] = Try { features.map { case (key, value) => featureParsers.get(key) match { case Some(parse) => key -> parse(value).get case None => key -> value } } }

Slide 41

Slide 41 text

Some components are already open source! M O S T LY T R E E T H I N G S

Slide 42

Slide 42 text

Spark: PMML Apple CoreML: Protobuf Stripe: JSON (mostly) O P E N S O U R C E ?

Slide 43

Slide 43 text

Q U E S T I O N S ? Thanks!