Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Elixir Flowで膨大なImageリストを捌く

enpedasi
August 24, 2018

Elixir Flowで膨大なImageリストを捌く

enpedasi

August 24, 2018
Tweet

More Decks by enpedasi

Other Decks in Programming

Transcript

  1. Flowの記述例 (CSVの集計) result = filename |> File.stream! # データクレンジング |>

    Flow.from_enumerable() |> Flow.map( &( String.replace( &1, ",", "\t" ) ) ) # ①CSV→ |> String.replace( "\r\n", "\n" ) # ②CRL |> String.replace( "\"", "" ) ) ) # ③ダブ # 集計 |> Flow.map( &( &1 |> String.split( "\t" ) ) ) # ④タブで分割 |> Flow.map( &Enum.at( &1, 2 - 1 ) ) # ⑤2番⽬の項⽬を抽出 |> Flow.partition |> Flow.reduce( fn -> %{} end, fn( name, acc ) # ⑥同値の出現数を集計 -> Map.update( acc, name, 1, &( &1 + 1 ) ) end ) |> Enum.sort( &( elem( &1, 1 ) > elem( &2, 1 ) ) ) # ⑦多い順で
  2. SELECT ーー 実在しない⽂法です word, count(*) cnt FROM (SELECT R.rec[2] as

    word FROM (SELECT split(R.rec, '\t') FROM (SELECT replace (R.rec, '\"', '') FROM (SELECT replace (R.rec, '\r\n', '\n') FROM (SELECT replace (R.rec, ',', '\t') FROM (SELECT rec FROM TABLE (stram_pkg.read ("text.csv")) ) R ) R ) R ) R ) R ) R GROUP BY word
  3. 1400万件のURL リストの中⾝ 分類ID + URL fall11_urls.txt /content/elixir/imagenet/urls> head -n 10

    fall11_urls.txt n00004475_6590 http://farm4.static.flickr.com/3175/2737866473_7 n00004475_15899 http://farm4.static.flickr.com/3276/2875184020_9 n00004475_32312 http://farm3.static.flickr.com/2531/4094333885_e n00004475_35466 http://farm4.static.flickr.com/3289/2809605169_8 n00004475_39382 http://2.bp.blogspot.com/_SrRTF97Kbfo/SUqT9y-qTV n00004475_41022 http://fortunaweb.com.ar/wp-content/uploads/2009 n00004475_42770 http://farm4.static.flickr.com/3488/4051378654_2 n00004475_54295 http://farm4.static.flickr.com/3368/3198142470_6 n00005787_13 http://www.powercai.net/Photo/UploadPhotos/20050 n00005787_32 http://www.web07.cn/uploads/Photo/c101122/12Z3Y5
  4. 分類IDと分類名 words_txt (8.2万件) n00001740 entity n00001930 physical entity n00002137 abstraction,

    abstract entity n00002452 thing n00002684 object, physical object n00003553 whole, unit n00003993 congener n00004258 living thing, animate thing n00004475 organism, being n00005787 benthos n00005930 dwarf n00006024 heterotroph n00006150 parent n00006269 life
  5. 並列ストリーミング検索(抜粋) 1400万件からの、データ検索部分にFlowを使⽤ @urls_filename "urls/fall11_urls.txt" defp url_list(word) do urls = File.stream!

    @urls_filename labels = labels_to_list(word) urls |> Flow.from_enumerable |> Flow.filter(&String.contains?(&1,labels)) |> Enum.to_list end
  6. 並列スクレイピング(コード抜粋) def scraping(urls) do urls #10並列で1個のサーバーに1タスクを割り振り |> Flow.from_enumerable(max_demand: 1,stages: 10)

    |> Flow.map( &get_image(&1) # HTTP request |> case do {:ok, image} -> path = URI.parse(&1).path |> Path.basename |> File.write! "images/#{path}", image {:error, status} -> IO.inspect "error=#{&1} #{status}" _ -> end) |> Enum.to_list end
  7. phoenixというワードでラベルリストを検索すると3 つのIDがヒット iex(1) > url_list = Imagenet.get_urls "phoenix" [labels: ["n09500936",

    "n12198286", "n12593826"]] [elapsed: 113.922] [count: 1146] 約2分かかって、1146件を抽出 ひとつのIdあたり、30秒で抽出完了︕ ※Kabylake Core-i5(2コア4スレッド) この程度の時間であれば、負担感は少ないのでは︖