Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Super fast CSV parser

Super fast CSV parser

Super fast CSV parser
RejectKaigi2018

秒速284km

May 12, 2018
Tweet

More Decks by 秒速284km

Other Decks in Programming

Transcript

  1. Super fast CSV
    parser
    @284km

    View Slide

  2. benchmark/parse.rb
    # 284km/rcsv
    Calculating -------------------------------------
    unquoted 166.007 (±12.7%) i/s - 810.000 in 5.007349s
    quoted 146.088 (±24.6%) i/s - 656.000 in 5.009174s
    include col_sep 131.046 (±28.2%) i/s - 580.000 in 5.001424s
    include row_sep 138.830 (±18.7%) i/s - 666.000 in 5.054874s
    encode utf-8 100.167 (±26.0%) i/s - 448.000 in 5.576945s
    encode sjis 137.429 (±18.2%) i/s - 660.000 in 5.028713s
    =========================================================
    # ruby/csv
    Calculating -------------------------------------
    unquoted 37.546 (±21.3%) i/s - 177.000 in 5.066859s
    quoted 16.773 (±23.8%) i/s - 78.000 in 5.026788s
    include col_sep 8.316 (±24.0%) i/s - 39.000 in 5.113550s
    include row_sep 1.842 (±54.3%) i/s - 9.000 in 5.422059s
    encode utf-8 26.126 (±15.3%) i/s - 126.000 in 5.055306s
    encode sjis 29.573 (±16.9%) i/s - 142.000 in 5.028898s

    View Slide

  3. Yesterday's benchmark.
    still have various problem,
    continues development now.

    View Slide

  4. Yesterday's benchmark.
    still have various problem,
    continues development now.
    About 3 times faster

    View Slide

  5. 284km/rcsv
    forked from arp/rcsv
    Using the Ruby binding of libcsv
    with FFI,
    I made the interface as ruby/csv
    as possible.

    View Slide

  6. # Motivation
    # Concern

    View Slide

  7. # Motivation
    - CSV is often used
    - Sometimes I use a large CSV

    View Slide

  8. # Concern
    - oj (A fast JSON parser and Object
    marshaller as a Ruby gem.)
    - Demand (few effective use cases?)
    - Don’t improve performance so much
    for cost?
    => Hmm, let's do it.

    View Slide

  9. # CSV
    RFC 4180

    View Slide

  10. # CSV
    1. Each record is located on a
    separate line, delimited by a line
    break (CRLF).
    aaa,bbb,ccc CRLF
    zzz,yyy,xxx CRLF

    View Slide

  11. # CSV
    2. The last record in the file may
    or may not have an ending line
    break.
    aaa,bbb,ccc CRLF
    zzz,yyy,xxx

    View Slide

  12. # CSV
    3. There maybe an optional header line
    appearing as the first line of the file.
    This header should contain the same
    number of fields as the records.
    field_name,field_name,field_name CRLF
    aaa,bbb,ccc CRLF
    zzz,yyy,xxx CRLF

    View Slide

  13. # CSV
    4. … Each line should contain the same
    number of fields. Spaces are
    considered part of a field and should not
    be ignored. The last field in the record
    must not be followed by a comma.
    aaa,bbb,ccc

    View Slide

  14. # CSV
    5. Each field may or may not be enclosed
    in double quotes. If fields are not
    enclosed with double quotes, then
    double quotes may not appear inside the
    fields.
    "aaa","bbb","ccc" CRLF
    zzz,yyy,xxx

    View Slide

  15. # CSV
    6. Fields containing line breaks
    (CRLF), double quotes, and commas
    should be enclosed in double-quotes.
    "aaa","b CRLF
    bb","ccc" CRLF
    zzz,yyy,xxx

    View Slide

  16. # CSV
    7. If double-quotes are used to
    enclose fields, then a double-quote
    appearing inside a field must be
    escaped by preceding it with
    another double quote.
    “aaa","b""bb","ccc"

    View Slide

  17. # CSV
    ABNF grammar
    is written in RFC

    View Slide

  18. # CSV libraries
    - ruby/csv
    - FasterCSV
    - fastest-csv
    - rcsv

    View Slide

  19. # FasterCSV
    JEG2/faster_csv
    # ruby/csv
    https://docs.ruby-lang.org/ja/latest/library/csv.html
    ͜ͷϥΠϒϥϦ͸Ϣʔβͷؔ৺ࣄΛղܾ͢ΔͨΊʹσβΠϯ͞Ε͍ͯ·
    ͢ɻ ओͳΰʔϧ͕ࡾͭ͋Γ·͢ɻ
    - ϐϡΞ Ruby ͷ··Ͱݩͷ CSV ϥΠϒϥϦΑΓ΋͔ͳΓ଎͘͢Δ͜ͱ
    - খ͘͞ϝϯςφϯε͠΍͍͢ίʔυϕʔεͰ͋Δ͜ͱ (FasterCSV ͸͔ͳ
    Γେ͖͘ ػೳ๛͔ʹͳΓ·ͨ͠ɻߏจղੳ෦෼ͷίʔυ͸͔ͳΓখ͍͞
    ··Ͱ͢)
    - CSV ͷΠϯλʔϑΣΠεΛվળ͢Δ͜ͱ

    View Slide

  20. # ruby/csv
    2018/05/03 250 commits
    I read it all.
    Introduction of characteristic
    commitment

    View Slide

  21. # ruby/csv
    - 1998/1/16 ʹ࠷ॳͷίϛοτ͕ݟΒΕΔ(͕ɺCSV ͷίʔυࣗ
    ମ͸·ͩແ͍ʣ
    - 2003/6/19 ʹɺnahi ͞ΜʹΑΓ csv module ͕ Import ͞Εͨʁ
    - 2003/9/3 test directory ͕Ͱ͖Δɻ
    test/csv/bom.csv | 2 +
    test/csv/test_csv.rb | 1510 ++++++++++++++++++++++++++
    +++++++++++++++++++++++++++++++++++++++++++++
    +
    test/csv/tmp/.keep_me | 0

    View Slide

  22. # ruby/csv
    - 2007-12-24 23:41 jeg2 o * lib/csv.rb, test/csv/
    test_csv.rb: Removed in preparation for
    code ͕શ෦ফ͑ΔɻFasterCSV ΛೖΕΔ४උͷ໛༷ɻ
    - 2007-12-25 02:46 jeg2 o * lib/csv.rb: Import the
    FasterCSV source as the new CSV class.
    FasterCSV ͕ೖΔɻ

    View Slide

  23. # ruby/csv
    2008-09-21 00:39 jeg2 o * lib/csv/csv.rb: Reworked CSV's parser
    and generator to be m17
    େ͖͘৭ʑͱมΘ͍ͬͯͯɺm17n ͷରԠͳͲ΋͜ͷ࣌ظʹߦΘΕͯ
    ͍Δ໛༷
    2009, 2010 ͱ͍͏ͷ͸ɺencoding ʹؔ͢Δ commit ͕ଟ͍͜ͱ͕ݟ͑Δ
    2012-11-14 02:53 zzak o * lib/csv.rb (init_comments): Document
    private method #init_comm
    2012-09-19 22:07 zzak o * lib/csv.rb (Object#CSV, Array#to_csv,
    String#parse_csv): Exa
    2012 ೥ͷಛ௃ͷͻͱͭʹɺzzak ͕ CSV ͷυΩϡϝϯτΛॻ͍ͯ͘Ε
    ͨ͜ͱ͕͋Δ

    View Slide

  24. # ruby/csv
    2017-04-24 17:38 SHIBATA Hiroshi oᴷᵫᴷᵏ Enabled travis
    2017-04-24 17:37 SHIBATA Hiroshi o Enabled tests used by test suite of
    ruby core
    2017-04-24 17:25 SHIBATA Hiroshi o Update basically configuration for
    gemspec
    2017-04-24 17:16 SHIBATA Hiroshi o Update BSDL license.
    2017-04-24 17:15 SHIBATA Hiroshi o Update repository name
    2017-04-24 17:15 SHIBATA Hiroshi o Removed needless skelton files
    2017-04-24 15:43 SHIBATA Hiroshi o overrided boilerplate by bundle
    init cmath
    ࣲా͞ΜʹΑͬͯɺruby/csv ͕஀ੜ

    View Slide

  25. # ruby/csv
    2018 ࠷ۙ
    ਢ౻͞Μ͕ϝϯςφʹͳͬͨ͜ͱɻ
    ίʔυͷ੔ཧ͕ਐΜͰ͍ΔΑ͏ʹݟ͑Δ
    2018-03-06 09:34 Kenta Murata oᴷᵏ Describe our attitude to
    RuboCop
    ### NOTE: About RuboCop
    We don't use RuboCop because we can manage our coding style by ourselves.
    We want to accept small fluc
    tuations in our coding style because we use Ruby.
    Please do not submit issues and PRs that aim to introduce RuboCop in this
    repository.

    View Slide

  26. # fastest_csv
    ໊લ͕ͱʹ͔͘଎ͦ͏
    parser ͕ C Ͱ࣮૷͞Ε͍ͯΔ
    ࣮૷ࣗମ͸୯७ͳͷͰ͋Δ
    ࣮૷͞Ε͍ͯΔػೳ͕গͳ͍ ྫ͑͹ header ͷ parse ʹର
    Ԡ͍ͯ͠ͳ͍ͳͲશવɻ

    View Slide

  27. # fastest_csv
    # benchmarkComparison:
    fastest_csv vs ruby/csv
    (fastest) unquoted: 210.3 i/s
    (fastest) include col_sep: 182.2 i/s - same-ish: difference falls within error
    (fastest) quoted: 177.1 i/s - same-ish: difference falls within error
    (fastest) encode sjis: 160.5 i/s - 1.31x slower
    (fastest) encode utf-8: 156.0 i/s - 1.35x slower
    unquoted: 24.4 i/s - 8.62x slower
    encode sjis: 22.5 i/s - 9.34x slower
    encode utf-8: 19.8 i/s - 10.61x slower
    quoted: 15.7 i/s - 13.41x slower
    include col_sep: 8.1 i/s - 25.82x slower
    include row_sep: 4.6 i/s - 45.25x slower

    View Slide

  28. # rcsv
    arp/rcsv
    A fast libcsv-based CSV parser for Ruby
    ruby/csv ʹൺ΂Δͱຬ͍ͨͯ͠Δػೳతʹ͸গͳ͍͕ɺ
    fastest_csv ʹൺ΂ͨΒ͔ͳΓଟ͍ɻ

    View Slide

  29. ## How did I start
    - If I have a very fast parser, will I win?
    - Is there room for improvement
    anymore?
    - Will it become practical if fastest-csv is
    full of functions?

    View Slide

  30. ## How did I start
    - If I have a very fast parser, will I win?
    - Is there room for improvement
    anymore?
    - Will it become practical if fastest-csv is
    full of functions?

    View Slide

  31. ## How did I start
    - Indeed it is fast
    - It is difficult to have flexibility
    (such as adapting to an optional
    specification).

    View Slide

  32. ## How did I start
    What I thought next:
    - Write a part of ruby/csv with C
    - OreOre CSV Implementation
    - Can I use libcsv well ??

    View Slide

  33. ## How did I start
    - ruby/csv ͷΠϯλʔϑΣʔεͰɺlibcsv based
    ͳ࣮૷ʹ͢Δͷ͕ݱ࣮తʹ໨ࢦ͢Ձ஋͕͋Δͱ
    ൑அ͢Δ
    - rcsv (libcsv-based CSV parser) ͱ͍͏ͷ͕͋Δɻ
    - ͜ΕͷΠϯλʔϑΣʔεΛ ruby/csv ʹ߹Θ
    ͤΔɻͭ·Γ ruby/csv ͷ test ʹύε͢Δঢ়ଶΛ
    ໨ඪʹͨ͠

    View Slide

  34. ## ·ͱΊΔͱ
    - ଎͍ CSV parser && ϝϯςφϯεੑ
    - libcsv
    - ruby/csv ͷ࢖͍΍͢͞ɺϝϯςφϯε༰қ͞
    - interface, test ४ڌ
    ࣮ࡍɺFFI Ͱ libcsv ͷϥούʔΛॻ͍ͯɺ
    ͦΕΛ࢖͏ଆͷΠϯλʔϑΣʔεΛ ruby/csv
    ʹ߹ΘͤΔΈ͍ͨͳײ͡ʹͨ͠ɻ

    View Slide

  35. # rcsv Λϕʔεʹ։ൃ͍ͯͯ͠ਏ͍෦෼
    - interface Λ߹Θ͍ͤͨ ruby/csv ͱߏ଄͕େ͖͘ҟ
    ͳΔɻ
    - object ࢦ޲Ά͘ͳ͍ɻrow ΦϒδΣΫτͷ༗ແͳͲ
    - ಉ͜͡ͱΛ͢Δ option ໊͕ ruby/csv ͱҟͳΔɻ
    - libcsv Ͱ parse ͷ optional ͳ͜ͱΛ͠Α͏ͱ͢Δͱ
    ॻ͘ͷ͕ͦͦ͜͜େม “aaa”, “ foo “ ͱ͔
    ޓ׵ੑΛ࣋ͨͤΔͨΊͷதؒ૚͕ඞཁͳέʔε͕ଟ
    ͍

    View Slide

  36. # ৺഑Ͱ͸͋Δ͜ͱ
    C ֦ு͸Ͳ͏ͳΜͩΖ͏͔ʁ
    ஔ͖׵͑Δ͜ͱʹΑͬͯɺ଎͘ͳΔͱͯ͠ɺ
    Մಡੑ͓ΑͼɺϝϯςφϯεͰ͖Δਓ͕গͳ
    ͘ͳΔͷͰ͸ͳ͍͔ʁ
    ߟ͑ͯ΋࢓ํ͕ແ͍ͷͰ࡞͍ͬͯΔ͕ɺ
    ஌ݟͷ͋Δํ͕͍ͨΒฉ͖͍ͨɻ

    View Slide

  37. ## test-unit
    test-unit ͷґଘ gem Ͱ ruby 2.5 Ͱ
    ಈ͔ͳ͔ͬͨΓɺςετΛྲྀ͢ͱ
    warning ͕ग़ͨΓ͢ΔͷΛͪ·ͪ
    ·ͱ௚ͨ͠Γ

    View Slide

  38. ## ग़དྷ͍ͯͳ͍͜ͱ
    ϝϞϦ࢖༻ྔ΋ൺֱ͢΂͖
    ruby/csv
    Τϯίʔυॲཧ
    Τεέʔϓγʔέϯεͷॲཧ

    View Slide

  39. # Continue development
    - libcsv based CSV parser
    - ruby/csv compatible

    View Slide