Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Super fast CSV parser

Super fast CSV parser

Super fast CSV parser
RejectKaigi2018

7ef4ab70ee295c821f8b77fab3aa87cf?s=128

秒速284km

May 12, 2018
Tweet

More Decks by 秒速284km

Other Decks in Programming

Transcript

  1. Super fast CSV parser @284km

  2. benchmark/parse.rb # 284km/rcsv Calculating ------------------------------------- unquoted 166.007 (±12.7%) i/s -

    810.000 in 5.007349s quoted 146.088 (±24.6%) i/s - 656.000 in 5.009174s include col_sep 131.046 (±28.2%) i/s - 580.000 in 5.001424s include row_sep 138.830 (±18.7%) i/s - 666.000 in 5.054874s encode utf-8 100.167 (±26.0%) i/s - 448.000 in 5.576945s encode sjis 137.429 (±18.2%) i/s - 660.000 in 5.028713s ========================================================= # ruby/csv Calculating ------------------------------------- unquoted 37.546 (±21.3%) i/s - 177.000 in 5.066859s quoted 16.773 (±23.8%) i/s - 78.000 in 5.026788s include col_sep 8.316 (±24.0%) i/s - 39.000 in 5.113550s include row_sep 1.842 (±54.3%) i/s - 9.000 in 5.422059s encode utf-8 26.126 (±15.3%) i/s - 126.000 in 5.055306s encode sjis 29.573 (±16.9%) i/s - 142.000 in 5.028898s
  3. Yesterday's benchmark. still have various problem, continues development now.

  4. Yesterday's benchmark. still have various problem, continues development now. About

    3 times faster
  5. 284km/rcsv forked from arp/rcsv Using the Ruby binding of libcsv

    with FFI, I made the interface as ruby/csv as possible.
  6. # Motivation # Concern

  7. # Motivation - CSV is often used - Sometimes I

    use a large CSV
  8. # Concern - oj (A fast JSON parser and Object

    marshaller as a Ruby gem.) - Demand (few effective use cases?) - Don’t improve performance so much for cost? => Hmm, let's do it.
  9. # CSV RFC 4180

  10. # CSV 1. Each record is located on a separate

    line, delimited by a line break (CRLF). aaa,bbb,ccc CRLF zzz,yyy,xxx CRLF
  11. # CSV 2. The last record in the file may

    or may not have an ending line break. aaa,bbb,ccc CRLF zzz,yyy,xxx
  12. # CSV 3. There maybe an optional header line appearing

    as the first line of the file. This header should contain the same number of fields as the records. field_name,field_name,field_name CRLF aaa,bbb,ccc CRLF zzz,yyy,xxx CRLF
  13. # CSV 4. … Each line should contain the same

    number of fields. Spaces are considered part of a field and should not be ignored. The last field in the record must not be followed by a comma. aaa,bbb,ccc
  14. # CSV 5. Each field may or may not be

    enclosed in double quotes. If fields are not enclosed with double quotes, then double quotes may not appear inside the fields. "aaa","bbb","ccc" CRLF zzz,yyy,xxx
  15. # CSV 6. Fields containing line breaks (CRLF), double quotes,

    and commas should be enclosed in double-quotes. "aaa","b CRLF bb","ccc" CRLF zzz,yyy,xxx
  16. # CSV 7. If double-quotes are used to enclose fields,

    then a double-quote appearing inside a field must be escaped by preceding it with another double quote. “aaa","b""bb","ccc"
  17. # CSV ABNF grammar is written in RFC

  18. # CSV libraries - ruby/csv - FasterCSV - fastest-csv -

    rcsv
  19. # FasterCSV JEG2/faster_csv # ruby/csv https://docs.ruby-lang.org/ja/latest/library/csv.html ͜ͷϥΠϒϥϦ͸Ϣʔβͷؔ৺ࣄΛղܾ͢ΔͨΊʹσβΠϯ͞Ε͍ͯ· ͢ɻ ओͳΰʔϧ͕ࡾͭ͋Γ·͢ɻ -

    ϐϡΞ Ruby ͷ··Ͱݩͷ CSV ϥΠϒϥϦΑΓ΋͔ͳΓ଎͘͢Δ͜ͱ - খ͘͞ϝϯςφϯε͠΍͍͢ίʔυϕʔεͰ͋Δ͜ͱ (FasterCSV ͸͔ͳ Γେ͖͘ ػೳ๛͔ʹͳΓ·ͨ͠ɻߏจղੳ෦෼ͷίʔυ͸͔ͳΓখ͍͞ ··Ͱ͢) - CSV ͷΠϯλʔϑΣΠεΛվળ͢Δ͜ͱ
  20. # ruby/csv 2018/05/03 250 commits I read it all. Introduction

    of characteristic commitment
  21. # ruby/csv - 1998/1/16 ʹ࠷ॳͷίϛοτ͕ݟΒΕΔ(͕ɺCSV ͷίʔυࣗ ମ͸·ͩແ͍ʣ - 2003/6/19 ʹɺnahi

    ͞ΜʹΑΓ csv module ͕ Import ͞Εͨʁ - 2003/9/3 test directory ͕Ͱ͖Δɻ test/csv/bom.csv | 2 + test/csv/test_csv.rb | 1510 ++++++++++++++++++++++++++ +++++++++++++++++++++++++++++++++++++++++++++ + test/csv/tmp/.keep_me | 0
  22. # ruby/csv - 2007-12-24 23:41 jeg2 o * lib/csv.rb, test/csv/

    test_csv.rb: Removed in preparation for code ͕શ෦ফ͑ΔɻFasterCSV ΛೖΕΔ४උͷ໛༷ɻ - 2007-12-25 02:46 jeg2 o * lib/csv.rb: Import the FasterCSV source as the new CSV class. FasterCSV ͕ೖΔɻ
  23. # ruby/csv 2008-09-21 00:39 jeg2 o * lib/csv/csv.rb: Reworked CSV's

    parser and generator to be m17 େ͖͘৭ʑͱมΘ͍ͬͯͯɺm17n ͷରԠͳͲ΋͜ͷ࣌ظʹߦΘΕͯ ͍Δ໛༷ 2009, 2010 ͱ͍͏ͷ͸ɺencoding ʹؔ͢Δ commit ͕ଟ͍͜ͱ͕ݟ͑Δ 2012-11-14 02:53 zzak o * lib/csv.rb (init_comments): Document private method #init_comm 2012-09-19 22:07 zzak o * lib/csv.rb (Object#CSV, Array#to_csv, String#parse_csv): Exa 2012 ೥ͷಛ௃ͷͻͱͭʹɺzzak ͕ CSV ͷυΩϡϝϯτΛॻ͍ͯ͘Ε ͨ͜ͱ͕͋Δ
  24. # ruby/csv 2017-04-24 17:38 SHIBATA Hiroshi oᴷᵫᴷᵏ <v0.0.1> Enabled travis

    2017-04-24 17:37 SHIBATA Hiroshi o Enabled tests used by test suite of ruby core 2017-04-24 17:25 SHIBATA Hiroshi o Update basically configuration for gemspec 2017-04-24 17:16 SHIBATA Hiroshi o Update BSDL license. 2017-04-24 17:15 SHIBATA Hiroshi o Update repository name 2017-04-24 17:15 SHIBATA Hiroshi o Removed needless skelton files 2017-04-24 15:43 SHIBATA Hiroshi o overrided boilerplate by bundle init cmath ࣲా͞ΜʹΑͬͯɺruby/csv ͕஀ੜ
  25. # ruby/csv 2018 ࠷ۙ ਢ౻͞Μ͕ϝϯςφʹͳͬͨ͜ͱɻ ίʔυͷ੔ཧ͕ਐΜͰ͍ΔΑ͏ʹݟ͑Δ 2018-03-06 09:34 Kenta Murata

    oᴷᵏ Describe our attitude to RuboCop ### NOTE: About RuboCop We don't use RuboCop because we can manage our coding style by ourselves. We want to accept small fluc tuations in our coding style because we use Ruby. Please do not submit issues and PRs that aim to introduce RuboCop in this repository.
  26. # fastest_csv ໊લ͕ͱʹ͔͘଎ͦ͏ parser ͕ C Ͱ࣮૷͞Ε͍ͯΔ ࣮૷ࣗମ͸୯७ͳͷͰ͋Δ ࣮૷͞Ε͍ͯΔػೳ͕গͳ͍ ྫ͑͹

    header ͷ parse ʹର Ԡ͍ͯ͠ͳ͍ͳͲશવɻ
  27. # fastest_csv # benchmarkComparison: fastest_csv vs ruby/csv (fastest) unquoted: 210.3

    i/s (fastest) include col_sep: 182.2 i/s - same-ish: difference falls within error (fastest) quoted: 177.1 i/s - same-ish: difference falls within error (fastest) encode sjis: 160.5 i/s - 1.31x slower (fastest) encode utf-8: 156.0 i/s - 1.35x slower unquoted: 24.4 i/s - 8.62x slower encode sjis: 22.5 i/s - 9.34x slower encode utf-8: 19.8 i/s - 10.61x slower quoted: 15.7 i/s - 13.41x slower include col_sep: 8.1 i/s - 25.82x slower include row_sep: 4.6 i/s - 45.25x slower
  28. # rcsv arp/rcsv A fast libcsv-based CSV parser for Ruby

    ruby/csv ʹൺ΂Δͱຬ͍ͨͯ͠Δػೳతʹ͸গͳ͍͕ɺ fastest_csv ʹൺ΂ͨΒ͔ͳΓଟ͍ɻ
  29. ## How did I start - If I have a

    very fast parser, will I win? - Is there room for improvement anymore? - Will it become practical if fastest-csv is full of functions?
  30. ## How did I start - If I have a

    very fast parser, will I win? - Is there room for improvement anymore? - Will it become practical if fastest-csv is full of functions?
  31. ## How did I start - Indeed it is fast

    - It is difficult to have flexibility (such as adapting to an optional specification).
  32. ## How did I start What I thought next: -

    Write a part of ruby/csv with C - OreOre CSV Implementation - Can I use libcsv well ??
  33. ## How did I start - ruby/csv ͷΠϯλʔϑΣʔεͰɺlibcsv based ͳ࣮૷ʹ͢Δͷ͕ݱ࣮తʹ໨ࢦ͢Ձ஋͕͋Δͱ

    ൑அ͢Δ - rcsv (libcsv-based CSV parser) ͱ͍͏ͷ͕͋Δɻ - ͜ΕͷΠϯλʔϑΣʔεΛ ruby/csv ʹ߹Θ ͤΔɻͭ·Γ ruby/csv ͷ test ʹύε͢Δঢ়ଶΛ ໨ඪʹͨ͠
  34. ## ·ͱΊΔͱ - ଎͍ CSV parser && ϝϯςφϯεੑ - libcsv

    - ruby/csv ͷ࢖͍΍͢͞ɺϝϯςφϯε༰қ͞ - interface, test ४ڌ ࣮ࡍɺFFI Ͱ libcsv ͷϥούʔΛॻ͍ͯɺ ͦΕΛ࢖͏ଆͷΠϯλʔϑΣʔεΛ ruby/csv ʹ߹ΘͤΔΈ͍ͨͳײ͡ʹͨ͠ɻ
  35. # rcsv Λϕʔεʹ։ൃ͍ͯͯ͠ਏ͍෦෼ - interface Λ߹Θ͍ͤͨ ruby/csv ͱߏ଄͕େ͖͘ҟ ͳΔɻ -

    object ࢦ޲Ά͘ͳ͍ɻrow ΦϒδΣΫτͷ༗ແͳͲ - ಉ͜͡ͱΛ͢Δ option ໊͕ ruby/csv ͱҟͳΔɻ - libcsv Ͱ parse ͷ optional ͳ͜ͱΛ͠Α͏ͱ͢Δͱ ॻ͘ͷ͕ͦͦ͜͜େม “aaa”, “ foo “ ͱ͔ ޓ׵ੑΛ࣋ͨͤΔͨΊͷதؒ૚͕ඞཁͳέʔε͕ଟ ͍
  36. # ৺഑Ͱ͸͋Δ͜ͱ C ֦ு͸Ͳ͏ͳΜͩΖ͏͔ʁ ஔ͖׵͑Δ͜ͱʹΑͬͯɺ଎͘ͳΔͱͯ͠ɺ Մಡੑ͓ΑͼɺϝϯςφϯεͰ͖Δਓ͕গͳ ͘ͳΔͷͰ͸ͳ͍͔ʁ ߟ͑ͯ΋࢓ํ͕ແ͍ͷͰ࡞͍ͬͯΔ͕ɺ ஌ݟͷ͋Δํ͕͍ͨΒฉ͖͍ͨɻ

  37. ## test-unit test-unit ͷґଘ gem Ͱ ruby 2.5 Ͱ ಈ͔ͳ͔ͬͨΓɺςετΛྲྀ͢ͱ

    warning ͕ग़ͨΓ͢ΔͷΛͪ·ͪ ·ͱ௚ͨ͠Γ
  38. ## ग़དྷ͍ͯͳ͍͜ͱ ϝϞϦ࢖༻ྔ΋ൺֱ͢΂͖ ruby/csv Τϯίʔυॲཧ Τεέʔϓγʔέϯεͷॲཧ

  39. # Continue development - libcsv based CSV parser - ruby/csv

    compatible