Slide 1

Slide 1 text

Super fast CSV parser @284km

Slide 2

Slide 2 text

benchmark/parse.rb # 284km/rcsv Calculating ------------------------------------- unquoted 166.007 (±12.7%) i/s - 810.000 in 5.007349s quoted 146.088 (±24.6%) i/s - 656.000 in 5.009174s include col_sep 131.046 (±28.2%) i/s - 580.000 in 5.001424s include row_sep 138.830 (±18.7%) i/s - 666.000 in 5.054874s encode utf-8 100.167 (±26.0%) i/s - 448.000 in 5.576945s encode sjis 137.429 (±18.2%) i/s - 660.000 in 5.028713s ========================================================= # ruby/csv Calculating ------------------------------------- unquoted 37.546 (±21.3%) i/s - 177.000 in 5.066859s quoted 16.773 (±23.8%) i/s - 78.000 in 5.026788s include col_sep 8.316 (±24.0%) i/s - 39.000 in 5.113550s include row_sep 1.842 (±54.3%) i/s - 9.000 in 5.422059s encode utf-8 26.126 (±15.3%) i/s - 126.000 in 5.055306s encode sjis 29.573 (±16.9%) i/s - 142.000 in 5.028898s

Slide 3

Slide 3 text

Yesterday's benchmark. still have various problem, continues development now.

Slide 4

Slide 4 text

Yesterday's benchmark. still have various problem, continues development now. About 3 times faster

Slide 5

Slide 5 text

284km/rcsv forked from arp/rcsv Using the Ruby binding of libcsv with FFI, I made the interface as ruby/csv as possible.

Slide 6

Slide 6 text

# Motivation # Concern

Slide 7

Slide 7 text

# Motivation - CSV is often used - Sometimes I use a large CSV

Slide 8

Slide 8 text

# Concern - oj (A fast JSON parser and Object marshaller as a Ruby gem.) - Demand (few effective use cases?) - Don’t improve performance so much for cost? => Hmm, let's do it.

Slide 9

Slide 9 text

# CSV RFC 4180

Slide 10

Slide 10 text

# CSV 1. Each record is located on a separate line, delimited by a line break (CRLF). aaa,bbb,ccc CRLF zzz,yyy,xxx CRLF

Slide 11

Slide 11 text

# CSV 2. The last record in the file may or may not have an ending line break. aaa,bbb,ccc CRLF zzz,yyy,xxx

Slide 12

Slide 12 text

# CSV 3. There maybe an optional header line appearing as the first line of the file. This header should contain the same number of fields as the records. field_name,field_name,field_name CRLF aaa,bbb,ccc CRLF zzz,yyy,xxx CRLF

Slide 13

Slide 13 text

# CSV 4. … Each line should contain the same number of fields. Spaces are considered part of a field and should not be ignored. The last field in the record must not be followed by a comma. aaa,bbb,ccc

Slide 14

Slide 14 text

# CSV 5. Each field may or may not be enclosed in double quotes. If fields are not enclosed with double quotes, then double quotes may not appear inside the fields. "aaa","bbb","ccc" CRLF zzz,yyy,xxx

Slide 15

Slide 15 text

# CSV 6. Fields containing line breaks (CRLF), double quotes, and commas should be enclosed in double-quotes. "aaa","b CRLF bb","ccc" CRLF zzz,yyy,xxx

Slide 16

Slide 16 text

# CSV 7. If double-quotes are used to enclose fields, then a double-quote appearing inside a field must be escaped by preceding it with another double quote. “aaa","b""bb","ccc"

Slide 17

Slide 17 text

# CSV ABNF grammar is written in RFC

Slide 18

Slide 18 text

# CSV libraries - ruby/csv - FasterCSV - fastest-csv - rcsv

Slide 19

Slide 19 text

# FasterCSV JEG2/faster_csv # ruby/csv https://docs.ruby-lang.org/ja/latest/library/csv.html ͜ͷϥΠϒϥϦ͸Ϣʔβͷؔ৺ࣄΛղܾ͢ΔͨΊʹσβΠϯ͞Ε͍ͯ· ͢ɻ ओͳΰʔϧ͕ࡾͭ͋Γ·͢ɻ - ϐϡΞ Ruby ͷ··Ͱݩͷ CSV ϥΠϒϥϦΑΓ΋͔ͳΓ଎͘͢Δ͜ͱ - খ͘͞ϝϯςφϯε͠΍͍͢ίʔυϕʔεͰ͋Δ͜ͱ (FasterCSV ͸͔ͳ Γେ͖͘ ػೳ๛͔ʹͳΓ·ͨ͠ɻߏจղੳ෦෼ͷίʔυ͸͔ͳΓখ͍͞ ··Ͱ͢) - CSV ͷΠϯλʔϑΣΠεΛվળ͢Δ͜ͱ

Slide 20

Slide 20 text

# ruby/csv 2018/05/03 250 commits I read it all. Introduction of characteristic commitment

Slide 21

Slide 21 text

# ruby/csv - 1998/1/16 ʹ࠷ॳͷίϛοτ͕ݟΒΕΔ(͕ɺCSV ͷίʔυࣗ ମ͸·ͩແ͍ʣ - 2003/6/19 ʹɺnahi ͞ΜʹΑΓ csv module ͕ Import ͞Εͨʁ - 2003/9/3 test directory ͕Ͱ͖Δɻ test/csv/bom.csv | 2 + test/csv/test_csv.rb | 1510 ++++++++++++++++++++++++++ +++++++++++++++++++++++++++++++++++++++++++++ + test/csv/tmp/.keep_me | 0

Slide 22

Slide 22 text

# ruby/csv - 2007-12-24 23:41 jeg2 o * lib/csv.rb, test/csv/ test_csv.rb: Removed in preparation for code ͕શ෦ফ͑ΔɻFasterCSV ΛೖΕΔ४උͷ໛༷ɻ - 2007-12-25 02:46 jeg2 o * lib/csv.rb: Import the FasterCSV source as the new CSV class. FasterCSV ͕ೖΔɻ

Slide 23

Slide 23 text

# ruby/csv 2008-09-21 00:39 jeg2 o * lib/csv/csv.rb: Reworked CSV's parser and generator to be m17 େ͖͘৭ʑͱมΘ͍ͬͯͯɺm17n ͷରԠͳͲ΋͜ͷ࣌ظʹߦΘΕͯ ͍Δ໛༷ 2009, 2010 ͱ͍͏ͷ͸ɺencoding ʹؔ͢Δ commit ͕ଟ͍͜ͱ͕ݟ͑Δ 2012-11-14 02:53 zzak o * lib/csv.rb (init_comments): Document private method #init_comm 2012-09-19 22:07 zzak o * lib/csv.rb (Object#CSV, Array#to_csv, String#parse_csv): Exa 2012 ೥ͷಛ௃ͷͻͱͭʹɺzzak ͕ CSV ͷυΩϡϝϯτΛॻ͍ͯ͘Ε ͨ͜ͱ͕͋Δ

Slide 24

Slide 24 text

# ruby/csv 2017-04-24 17:38 SHIBATA Hiroshi oᴷᵫᴷᵏ Enabled travis 2017-04-24 17:37 SHIBATA Hiroshi o Enabled tests used by test suite of ruby core 2017-04-24 17:25 SHIBATA Hiroshi o Update basically configuration for gemspec 2017-04-24 17:16 SHIBATA Hiroshi o Update BSDL license. 2017-04-24 17:15 SHIBATA Hiroshi o Update repository name 2017-04-24 17:15 SHIBATA Hiroshi o Removed needless skelton files 2017-04-24 15:43 SHIBATA Hiroshi o overrided boilerplate by bundle init cmath ࣲా͞ΜʹΑͬͯɺruby/csv ͕஀ੜ

Slide 25

Slide 25 text

# ruby/csv 2018 ࠷ۙ ਢ౻͞Μ͕ϝϯςφʹͳͬͨ͜ͱɻ ίʔυͷ੔ཧ͕ਐΜͰ͍ΔΑ͏ʹݟ͑Δ 2018-03-06 09:34 Kenta Murata oᴷᵏ Describe our attitude to RuboCop ### NOTE: About RuboCop We don't use RuboCop because we can manage our coding style by ourselves. We want to accept small fluc tuations in our coding style because we use Ruby. Please do not submit issues and PRs that aim to introduce RuboCop in this repository.

Slide 26

Slide 26 text

# fastest_csv ໊લ͕ͱʹ͔͘଎ͦ͏ parser ͕ C Ͱ࣮૷͞Ε͍ͯΔ ࣮૷ࣗମ͸୯७ͳͷͰ͋Δ ࣮૷͞Ε͍ͯΔػೳ͕গͳ͍ ྫ͑͹ header ͷ parse ʹର Ԡ͍ͯ͠ͳ͍ͳͲશવɻ

Slide 27

Slide 27 text

# fastest_csv # benchmarkComparison: fastest_csv vs ruby/csv (fastest) unquoted: 210.3 i/s (fastest) include col_sep: 182.2 i/s - same-ish: difference falls within error (fastest) quoted: 177.1 i/s - same-ish: difference falls within error (fastest) encode sjis: 160.5 i/s - 1.31x slower (fastest) encode utf-8: 156.0 i/s - 1.35x slower unquoted: 24.4 i/s - 8.62x slower encode sjis: 22.5 i/s - 9.34x slower encode utf-8: 19.8 i/s - 10.61x slower quoted: 15.7 i/s - 13.41x slower include col_sep: 8.1 i/s - 25.82x slower include row_sep: 4.6 i/s - 45.25x slower

Slide 28

Slide 28 text

# rcsv arp/rcsv A fast libcsv-based CSV parser for Ruby ruby/csv ʹൺ΂Δͱຬ͍ͨͯ͠Δػೳతʹ͸গͳ͍͕ɺ fastest_csv ʹൺ΂ͨΒ͔ͳΓଟ͍ɻ

Slide 29

Slide 29 text

## How did I start - If I have a very fast parser, will I win? - Is there room for improvement anymore? - Will it become practical if fastest-csv is full of functions?

Slide 30

Slide 30 text

## How did I start - If I have a very fast parser, will I win? - Is there room for improvement anymore? - Will it become practical if fastest-csv is full of functions?

Slide 31

Slide 31 text

## How did I start - Indeed it is fast - It is difficult to have flexibility (such as adapting to an optional specification).

Slide 32

Slide 32 text

## How did I start What I thought next: - Write a part of ruby/csv with C - OreOre CSV Implementation - Can I use libcsv well ??

Slide 33

Slide 33 text

## How did I start - ruby/csv ͷΠϯλʔϑΣʔεͰɺlibcsv based ͳ࣮૷ʹ͢Δͷ͕ݱ࣮తʹ໨ࢦ͢Ձ஋͕͋Δͱ ൑அ͢Δ - rcsv (libcsv-based CSV parser) ͱ͍͏ͷ͕͋Δɻ - ͜ΕͷΠϯλʔϑΣʔεΛ ruby/csv ʹ߹Θ ͤΔɻͭ·Γ ruby/csv ͷ test ʹύε͢Δঢ়ଶΛ ໨ඪʹͨ͠

Slide 34

Slide 34 text

## ·ͱΊΔͱ - ଎͍ CSV parser && ϝϯςφϯεੑ - libcsv - ruby/csv ͷ࢖͍΍͢͞ɺϝϯςφϯε༰қ͞ - interface, test ४ڌ ࣮ࡍɺFFI Ͱ libcsv ͷϥούʔΛॻ͍ͯɺ ͦΕΛ࢖͏ଆͷΠϯλʔϑΣʔεΛ ruby/csv ʹ߹ΘͤΔΈ͍ͨͳײ͡ʹͨ͠ɻ

Slide 35

Slide 35 text

# rcsv Λϕʔεʹ։ൃ͍ͯͯ͠ਏ͍෦෼ - interface Λ߹Θ͍ͤͨ ruby/csv ͱߏ଄͕େ͖͘ҟ ͳΔɻ - object ࢦ޲Ά͘ͳ͍ɻrow ΦϒδΣΫτͷ༗ແͳͲ - ಉ͜͡ͱΛ͢Δ option ໊͕ ruby/csv ͱҟͳΔɻ - libcsv Ͱ parse ͷ optional ͳ͜ͱΛ͠Α͏ͱ͢Δͱ ॻ͘ͷ͕ͦͦ͜͜େม “aaa”, “ foo “ ͱ͔ ޓ׵ੑΛ࣋ͨͤΔͨΊͷதؒ૚͕ඞཁͳέʔε͕ଟ ͍

Slide 36

Slide 36 text

# ৺഑Ͱ͸͋Δ͜ͱ C ֦ு͸Ͳ͏ͳΜͩΖ͏͔ʁ ஔ͖׵͑Δ͜ͱʹΑͬͯɺ଎͘ͳΔͱͯ͠ɺ Մಡੑ͓ΑͼɺϝϯςφϯεͰ͖Δਓ͕গͳ ͘ͳΔͷͰ͸ͳ͍͔ʁ ߟ͑ͯ΋࢓ํ͕ແ͍ͷͰ࡞͍ͬͯΔ͕ɺ ஌ݟͷ͋Δํ͕͍ͨΒฉ͖͍ͨɻ

Slide 37

Slide 37 text

## test-unit test-unit ͷґଘ gem Ͱ ruby 2.5 Ͱ ಈ͔ͳ͔ͬͨΓɺςετΛྲྀ͢ͱ warning ͕ग़ͨΓ͢ΔͷΛͪ·ͪ ·ͱ௚ͨ͠Γ

Slide 38

Slide 38 text

## ग़དྷ͍ͯͳ͍͜ͱ ϝϞϦ࢖༻ྔ΋ൺֱ͢΂͖ ruby/csv Τϯίʔυॲཧ Τεέʔϓγʔέϯεͷॲཧ

Slide 39

Slide 39 text

# Continue development - libcsv based CSV parser - ruby/csv compatible