Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Cloud Dataflow入門(Python SDK バッチ処理編)

Etsuji Nakai
November 11, 2017

Cloud Dataflow入門(Python SDK バッチ処理編)

ver1.1 2017/11/18

Etsuji Nakai

November 11, 2017
Tweet

More Decks by Etsuji Nakai

Other Decks in Technology

Transcript

  1. いつものワードカウント ['this is a pen', 'this is an apple', 'apple

    pen'] ↓ map (\x -> x.split(" ")) [['this', 'is', 'a', 'pen'], ['this', 'is', 'an', 'apple'], ['apple', 'pen']] ↓ flatten ['this', 'is', 'a', 'pen', 'this', 'is', 'an', 'apple', 'apple', 'pen'] ↓ map (\x -> (x, 1)) [('this', 1), ('is', 1), ('a', 1), ('pen', 1), ('this', 1), ('is', 1), ('an', 1), ('apple', 1), ('apple', 1), ('pen', 1)] ↓ groupByKey [('a', [1]), ('apple', [1, 1]), ('this', [1, 1]), ('is', [1, 1]), ('an', [1]), ('pen', [1, 1])] ↓ map (\(k, v) -> (k, sum(v)) [('a', 1), ('apple', 2), ('this', 2), ('is', 2), ('an', 1), ('pen', 2)]
  2. 普通に Python で実装すると・・・ def groupByKey(items): result = {} for (k,

    v) in items: if k not in result.keys(): result[k] = [] result[k].append(v) return [(k, v) for k, v in result.items()] items = ['this is a pen', 'this is an apple', 'apple pen'] items = map(lambda x: x.split(' '), items) items = sum(items, []) # Flatten items = map(lambda x: (x, 1), items) items = groupByKey(items) items = map(lambda (k, v): (k, sum(v)), items) print(items)
  3. Dataflow (Apache Beam) で実装すると import argparse import logging import apache_beam

    as beam logging.getLogger().setLevel(logging.INFO) parser = argparse.ArgumentParser() parser.add_argument('--input', dest='input', required=True, help='Input file to process.') parser.add_argument('--output', dest='output', required=True, help='Output file to write results to.') known_args, pipeline_args = parser.parse_known_args() p = beam.Pipeline(argv=pipeline_args) _ = (p | 'Step1. Read lines' >> beam.io.textio.ReadFromText(known_args.input) | 'Step2. Split into words' >> beam.FlatMap(lambda x: x.split(' ')) # Map + Flat | 'Step3. Assign one to each word' >> beam.Map(lambda x: (x, 1)) | 'Step4. Group by word' >> beam.GroupByKey() | 'Step5. Do count' >> beam.Map(lambda (k, v): (k, sum(v))) | 'Step6. Output results' >> beam.io.textio.WriteToText(known_args.output)) p.run()
  4. 実行例 sudo pip install six==1.10.0(*) python applepen.py \ --runner DataflowRunner

    \ --project=$PROJECT \ --staging_location gs://$BUCKET/staging \ --temp_location gs://$BUCKET/staging \ --input gs://$BUCKET/applepen.txt \ --output gs://$BUCKET/applepen_out.txt (*) https://stackoverflow.com/questions/46300173/import-apache-beam-metaclass-conflict
  5. 複数のパイプに分ける例 p = beam.Pipeline(argv=pipeline_args) tmp = (p | 'Step1. Read

    lines' >> beam.io.textio.ReadFromText(known_args.input) | 'Step2. Split into words' >> beam.FlatMap(lambda x: x.split(' ')) # Map + Flat | 'Step3. Assign one to each word' >> beam.Map(lambda x: (x, 1))) _ = (tmp | 'Step3.5. Temp output' >> beam.io.textio.WriteToText(known_args.tmp_output)) _ = (tmp | 'Step4. Group by word' >> beam.GroupByKey() | 'Step5. Do count' >> beam.Map(lambda (k, v): (k, sum(v))) | 'Step6. Output results' >> beam.io.textio.WriteToText(known_args.output)) p.run()
  6. アナグラムファインダー 英単語が1行に1つずつならんだ辞書ファイルから、すべてのアナ グラム(文字の順序を入れ替えると一致する単語)を発見してくださ い。 binder rined inbred loop parrot polo

    pool proart raport raptor rebind これをこうして… ( ^ω^) こうじゃ… ( ^ω^) parrot proart raport raptor binder brined inbred rebind loop polo pool https://raw.githubusercontent.com/dwyl/english-words/master/words_alpha.txt
  7. 答え binder rined inbred loop parrot polo pool proart raport

    raptor rebind (bdeinr, binder) (bdeinr, brined) (bdeinr, inbred) (loop, loop) (aoprrt, parrot) (loop, polo) (loop, pool) (aoprrt, proart) (aoprrt, raport) (aoprrt, raptor) (bdeinr, rebind) (aoprrt, [parrot, proart, raport, raptor]) (bdeinr, [binder, brined, inbred, rebind]) (loop, [loop, polo, pool]) Map GroupByKey
  8. 答え p = beam.Pipeline(argv=pipeline_args) _ = (p | 'Step1. Read

    lines' >> beam.io.textio.ReadFromText(known_args.input) | 'Step2. Add labels' >> beam.Map(lambda x: (sorted(x), x)) | 'Step4. Group by label' >> beam.GroupByKey() | 'Step5. Drop labels' >> beam.Map(lambda (k, v): ','.join(v)) | 'Step6. Output results' >> beam.io.textio.WriteToText(known_args.output)) p.run() python anagram.py \ --runner DataflowRunner \ --project=$PROJECT \ --staging_location gs://$BUCKET/staging \ --temp_location gs://$BUCKET/staging \ --input gs://$BUCKET/words_alpha.txt \ --output gs://$BUCKET/anagram_out.txt
  9. 応用例 アナグラムの単語が3つ以上の場合だけを出力する下記のコード には問題点があります。どこでしょうか? p = beam.Pipeline(argv=pipeline_args) _ = (p |

    'Step1. Read lines' >> beam.io.textio.ReadFromText(known_args.input) | 'Step2. Add labels' >> beam.Map(lambda x: (sorted(x), x)) | 'Step4. Group by label' >> beam.GroupByKey() | 'Step5. Filter out less than three words' >> beam.Filter(lambda (k, v): len(v)>2) | 'Step6. Drop labels' >> beam.Map(lambda (k, v): ','.join(v)) | 'Step7. Output results' >> beam.io.textio.WriteToText(known_args.output)) p.run()
  10. 答え Dataflow が扱う「リスト」は、実際には、 Iterative な PCollection 型のインスタンスなので、リスト演算 len() は使用できません。(「巨 大なリスト」が本当にリストだと、メモリーが不足する可能性があり

    ますよね。。。) p = beam.Pipeline(argv=pipeline_args) _ = (p | 'Step1. Read lines' >> beam.io.textio.ReadFromText(known_args.input) | 'Step2. Add labels' >> beam.Map(lambda x: (sorted(x), x)) | 'Step4. Group by label' >> beam.GroupByKey() | 'Step5. Filter out less than three words >> beam.Filter(lambda (k, v): sum(1 for _ in v)>2) | 'Step6. Drop labels' >> beam.Map(lambda (k, v): ','.join(v)) | 'Step7. Output results' >> beam.io.textio.WriteToText(known_args.output)) p.run()
  11. 参考資料 Apache Beam Programming Guide https://beam.apache.org/documentation/programming-guide/ Apache Beam Quick Start

    with Python http://shzhangji.com/blog/2017/09/12/apache-beam-quick-start-with-python/
  12. おまけ #!/usr/bin/env python3 import sys for line in iter(sys.stdin.readline, ''):

    word = line.rstrip() canonical = ''.join(sorted(word)) print (canonical + ' ' + word) #!/usr/bin/env python3 import sys pre_canonical = '' result = [] for line in iter(sys.stdin.readline, ''): canonical, word = line.rstrip().split(' ') if pre_canonical != canonical: if pre_canonical != '': print (' '.join(result)) pre_canonical = canonical result = [] result.append(word) map.py reduce.py $ time ./map.py < words_alpha.txt | sort | ./reduce.py > output.txt real 0m1.830s user 0m1.831s sys 0m0.061s