Slide 29
Slide 29 text
検証③ テーブル結合(Dask)
joinよりもmergeの方が高速
Daskのset_indexでは、内部でsort等が実行されるためオーバーヘッド大
定義テーブルは単一パーティションにした方が高速
1 vendors = dd.from_pandas(common.vendors, npartitions=1)
2 ratecodes = dd.from_pandas(common.ratecodes, npartitions=1)
3 pulocations = dd.from_pandas(common.pulocations, npartitions=1)
4 dolocations = dd.from_pandas(common.dolocations, npartitions=1)
5 payment_types = dd.from_pandas(common.payment_types, npartitions=1)
6
7 taxi = (taxi.merge(vendors, left_on='VendorID', right_index=True, how='inner')
8 .merge(ratecodes, left_on='RatecodeID', right_index=True, how='inner')
9 .merge(pulocations, left_on='PULocationID', right_index=True, how='inner'
10 .merge(dolocations, left_on='DOLocationID', right_index=True, how='inner'
11 .merge(payment_types, left_on='payment_type', right_index=True, how='inne