決定木を実装する

0 決定木を実装する 2023-09-01 第58回NearMe技術勉強会 Takuma Kakinoue

1 実装および決定木のイメージ実装：https://github.com/kakky-hacker/algorithm_sandbox/tree/main/decision-tree ※ 各ノードでは、情報ゲイン（ジニ不純度の減少度合い）が大きくなるような特徴量および分割閾値を計算する。 ※ 次ノードの展開は、「max_depthに達する」あるいは「ジニ不純度が 0になる」まで繰り返す。 ※ 図はscikit-learnのtree.plot_tree関数でscikit-learnのDecisionTreeClassifier（学習済み）を描画した。

2 ジニ不純度とは def calc_gini_score(y) -> float: if len(y) == 0:
return 0 y_unique = np.unique(y) res = 0 for value in y_unique: res += (np.count_nonzero(y == value) / len(y)) ** 2 return 1 - res 簡単に言うとデータの情報量のようなもの。ノードtにおけるジニ負純度IG(t)は右の式で表される。 cはクラス数、pはクラスiの発生確率である。データが全て同一のクラスの場合、ジニ不純度は0となる。

3 情報ゲインとは def calc_gain(input_y, output_y_left, output_y_right) -> float: assert len(input_y)
== (len(output_y_left) + len(output_y_right)) input_gini_impurity = calc_gini_score(input_y) output_gini_impurity = (calc_gini_score(output_y_left) * (len(output_y_left) / len(input_y)) + calc_gini_score(output_y_right) * (len(output_y_right) / len(input_y))) return input_gini_impurity - output_gini_impurity 情報ゲインとは、分割前後で減少するジニ不純度の量を指す。ジニ不純度が大きく減少する　＝　同じクラス同士でまとめられている　　＝　しっかり分類できている

4 情報ゲイン最大の分割を探す def calc_best_split_feature(x, y): num_of_features = x.shape[1] max_gain =
-1 max_gain_feature_index = -1 max_gain_threshold = -1 for feature_index in range(num_of_features): feature_values = x[:, feature_index] feature_values_unique = np.unique(feature_values) for feature_value_threshold in feature_values_unique: y_left = y[feature_values <= feature_value_threshold] y_right = y[feature_values > feature_value_threshold] gain = calc_gain(y, y_left, y_right) if max_gain < gain: max_gain = gain max_gain_feature_index = feature_index max_gain_threshold = feature_value_threshold return max_gain, max_gain_feature_index, max_gain_threshold 「特徴量」を順に調べる「分割閾値」を順に調べる情報ゲイン最大となる「特徴量」「分割閾値」を記録する情報ゲイン算出

5 ノードの定義 class Node: def __init__(self, x, y, num_of_class, max_depth,
current_depth): self.prob = [np.count_nonzero(y == i) / len(y) for i in range(num_of_class)] if current_depth <= max_depth: self.is_leaf = False self.gain, self.split_feature_index, self.split_threshold = calc_best_split_feature(x, y) feature_values = x[:, self.split_feature_index] x_left = x[feature_values <= self.split_threshold] x_right = x[feature_values > self.split_threshold] y_left = y[feature_values <= self.split_threshold] y_right = y[feature_values > self.split_threshold] if len(y_left) == 0 or len(y_right) == 0: self.is_leaf = True else: self.left_node = Node(x_left, y_left, num_of_class, max_depth, current_depth + 1) self.right_node = Node(x_right, y_right, num_of_class, max_depth, current_depth + 1) else: self.is_leaf = True max_depthに達するまで再帰を繰り返すゲイン最大の分割条件（特徴量および閾値）を求め、データ分割子ノードを作成

6 決定木の定義 class Tree: def __init__(self, num_of_class, max_depth=5): self.num_of_class =
num_of_class self.max_depth = max_depth self.root_node = None def fit(self, x, y): self.root_node = Node(x, y, self.num_of_class, self.max_depth, 1) def predict_proba(self, x): return [self.root_node.output(values) for values in x] def predict(self, x): return [np.argmax(self.root_node.output(values)) for values in x]

7 次回予告 • ランダムフォレストを実装する • 勾配ブースティングを実装する • KaggleのTitanicを学習する

8 Thank you

決定木を実装する

決定木を実装する

NearMeの技術発表資料です PRO

More Decks by NearMeの技術発表資料です

Featured

Transcript

0 決定木を実装する 2023-09-01 第58回NearMe技術勉強会 Takuma Kakinoue

2 ジニ不純度とは def calc_gini_score(y) -> float: if len(y) == 0:

3 情報ゲインとは def calc_gain(input_y, output_y_left, output_y_right) -> float: assert len(input_y)

4 情報ゲイン最大の分割を探す def calc_best_split_feature(x, y): num_of_features = x.shape[1] max_gain =

5 ノードの定義 class Node: def init(self, x, y, num_of_class, max_depth,

6 決定木の定義 class Tree: def init(self, num_of_class, max_depth=5): self.num_of_class =

7 次回予告 • ランダムフォレストを実装する • 勾配ブースティングを実装する • KaggleのTitanicを学習する

8 Thank you