Goで実装する軽量マークアップ言語パーサー / Gohn: parser written in Go

3f4be9784f765877f444bc839de29888?s=47 aereal
August 04, 2017
3.1k

Goで実装する軽量マークアップ言語パーサー / Gohn: parser written in Go

talked at builderscon tokyo 2017

3f4be9784f765877f444bc839de29888?s=128

aereal

August 04, 2017
Tweet

Transcript

  1. GoͰ࣮૷͢Δ
 ܰྔϚʔΫΞοϓݴޠ
 ύʔαʔ id:aereal @ builderscon tokyo 2017

  2. ࿩͢͜ͱ • ܰྔϚʔΫΞοϓݴޠͱ͸ͯͳه๏ʹ͍ͭͯ • ςΩετॲཧͱύʔαʔδΣωϨʔλʔͷඞཁੑ • Go/goyaccʹΑΔ͸ͯͳه๏ύʔαʔͷ঺հ • goyaccͷԠ༻஌ࣝ

  3. ࣗݾ঺հ • id:aereal • GitHub: aereal • גࣜձࣾ͸ͯͳ
 ΞϓϦέʔγϣϯΤϯδχΞ

  4. ⚠͓͜ͱΘΓ⚠ • αʔϏεΛ৮͍ͬͯͯײͨ͡
 ݸਓతͳ՝୊ҙࣝʹجͮ͘ϓϥΠϕʔτϫʔΫͰ͢ • αʔϏεʹ࠾༻͞ΕΔ͔͸ෆ໌

  5. ࢀߟ৘ใ • http://b.hatena.ne.jp/aereal/2017gokyoto/ • ͸ͯͳϒοΫϚʔΫͰλάΛ෇͚ͯϒΫϚ͍ͯ͠·͢

  6. ܰྔϚʔΫΞοϓݴޠͱ
 ͸ͯͳه๏

  7. ܰྔϚʔΫΞοϓݴޠͱ͸ • LML = Lightweight Markup Language • HTML΍XMLͱϓϨʔϯςΩετͷதؒʹ͋Δ •

    Markdown, Textile, ͸ͯͳه๏, etc.
  8. ܰྔϚʔΫΞοϓݴޠͱ͸ • LML = Lightweight Markup Language • HTML΍XMLͱϓϨʔϯςΩετͷதؒʹ͋Δ •

    Markdown, Textile, ͸ͯͳه๏, etc.
  9. ͸ͯͳه๏ͱ͸ • ͸ͯͳ͕ఏڙ͢Δ͍͔ͭ͘ͷαʔϏεͰ࢖͑ΔLML • ͸ͯͳϒϩάɺ͸ͯͳμΠΞϦʔɺetc. • HTMLʹม׵͞ΕΔศརͳه๏ • org-modeͱͪΐͬͱࣅ͍ͯΔจ๏

  10. * ݟग़͠1 ** ݟग़͠2 [http://127.0.0.1/:title=΅͘ͷIPͰ͢] - Ruby - Perl -

    Go + ى + ঝ + స + ݁
  11. <h1>ݟग़͠1</h1> <h2>ݟग़͠2</h2> <p> <a href="http://127.0.0.1/">΅͘ͷIPͰ͢</a> </p> <ul> <li>Ruby</li> <li>Perl</li> <li>Go</li>

    </ul> <ol> <li>ى</li> <li>ঝ</li> <li>స</li> <li>݁</li> </ol>
  12. ࣮૷͍Ζ͍Ζ • ͸ͯͳϒϩάɺ͸ͯͳμΠΞϦʔɺ͸ͯͳάϧʔϓ • Text-Hatena (CPAN) • Text-Xatena (CPAN) •

    chris4403/WikiTextConverter • motemen/pandoc
  13. ࣮૷͍Ζ͍Ζ • ࢓༷ ≈ ࣮૷ • ࣮૷͕͍Ζ͍Ζ͋Δ • ͭ·Γ •

    ࣮૷ͷ਺͚ͩ࢓༷͕ଘࡏ͢Δ • ࢓༷Λ஌Δʹ͸Perlͱਖ਼نදݱΛಡΈղ͘ඞཁ͕͋Δ
  14. खࠒͳ࣮૷͕ແͯ͘ࠔΔ • PerlҎ֎Ͱॻ͔ΕͨΞϓϦέʔγϣϯͰ
 ͸ͯͳه๏Λ࢖͑ΔΑ͏ʹ͍ͨ͠ɺ͚Ͳ…… • Perlͷ֦ுਖ਼نදݱΛۦ࢖͍ͯ͠ΔͷͰҠ২΋େม • HTMLม׵·Ͱ΍Δύʔαʔ͕ଟ͍

  15. ϙʔλϏϦςΟ • ϒϥ΢βͰϥΠϒϓϨϏϡʔͱ͔͍ͨ͠͡ΌΜ • ೖྗʹର͢Δग़ྗ (AST) ͚ͩΛܾΊ͍ͨ • Perl΍Go΍Scala, JavaScriptͦͷଞͰॻ͖͍ͨ

  16. HTMLม׵·Ͱ΍Γͨ͘ͳ͍ • ଟ͘ͷύʔαʔ࣮૷͕HTMLม׵·Ͱߦ͏ • ҰํɺೖྗʹͲΕ͘Β͍HTMLΛڐՄ͢Δ͔͸
 αʔϏεຖ (!= ύʔαʔ࣮૷ຖ) ʹҟͳΔ •

    → ύʔαʔͱHTMLม׵Λ෼཭͍ͨ͠
  17. ͜Μͳ͸ͯͳه๏ύʔαʔ͕ ΄͍͠ • ϦϑΝϨϯεͨΓ͏Δૉ๿ͳ࣮૷ • = ਖ਼نදݱͰͳΜͱ͔͠Α͏ͱ͍͗ͯ͢͠ͳ͍ • ύʔε݁Ռ͕HTMLͰ͸ͳ͘தؒදݱ͕ಘΒΕΔ

  18. ࡾߦͰ·ͱΊΔͱ • AST͘Ε!!!

  19. ࣍ճ༧ࠂ • ಛఆͷݴޠʹґଘ͠ͳ͍
 ྑ͍͔Μ͡ͷςΩετॲཧάοζ͸ͳ͍΋ͷ͔ • ͨͩ͠ (֦ு) ਖ਼نදݱҎ֎ • Αͦ͞͏ͳςΩετॲཧٕज़Λ୳͠ʹ͍͖·͢

  20. ςΩετॲཧͱ
 ύʔαʔδΣωϨʔλʔ

  21. ςΩετॲཧͻͱΊ͙Γ • ςΩετॲཧͷςΫχοΫΛ͍Ζ͍Ζ঺հ • έʔεʹΑͬͯ͸ύʔαʔΛॻ͘·Ͱ΋ͳ͔ͬͨΓ͢Δ

  22. τʔΫϯͷग़ݱҐஔ "id:aereal".substring(3) // => "aereal"

  23. τʔΫϯͷग़ݱҐஔ • τʔΫϯͷग़ݱҐஔ͕ݻఆ௕ͳΒ͜Ε͘Β͍Ͱ΋ • Մม௕ͩͱഁ୼͢Δ • ͓ͦͯ͠Αͦͷจ๏͸Մม௕ͷτʔΫϯ͹͔Γ

  24. ਖ਼نදݱ /id:(.+)/.match("id:aereal")[1] // => "aereal"

  25. ਖ਼نදݱ • ׅހͷඇରԠ͸ݕग़Ͱ͖ͳ͍ • (POSIXͷਖ਼نදݱͰ͸ෆՄɺ
 Perlͷ֦ுਖ਼نදݱͰ͸Ͱ͖ͨ͸ͣ) • Ұຊ௼Γ͕Ͱ͖ͳ͔ͬͨΒɺ
 ޙड़ͷঢ়ଶ؅ཧΛߦ͏ඞཁ͕͋Δ

  26. ঢ়ଶભҠΛ؅ཧ var isInIdNotation = false; while (1) { if (isInIdNotation)

    { var name = readText(); // => "aereal" } else { switch (readChar()) { case ':': isInIdNotation = true; default: // ... } } }
  27. ঢ়ଶભҠΛ؅ཧ var isInIdNotation = false; var isInHeading = false; var

    isInUnorderedList = false; var isInOrderedList = false; while (1) { if (isInIdNotation) if (isInHeading) if (isInUnorderedList) if (isInOrderedList) }
  28. None
  29. • ͲΕ΋จ๏Λ၆ᛌͮ͠Β͍ • ϞδϡʔϧԽ͕೉͍͠ • → খ͍͞෦඼ΛੵΈ্͍͛ͯ͘ελΠϧͰ࡞ΕͨΒ……

  30. ͦ͜Ͱyacc • ύʔαʔδΣωϨʔλʔͷ1ͭ • BNFʹࣅͨߏจنଇ͔ΒύʔαʔΛੜ੒͢Δ • ෳ਺ͷنଇΛ૊Έ߹Θͤͯ1ͭͷنଇΛ࡞Γ্͛Δ • ίʔϧόοΫελΠϧͰ
 نଇΛϓϩάϥϜʹม׵͢Δ

    (ؐݩɺreduce)
  31. https://tools.ietf.org/html/rfc7230 HTTP-Message = start-line *( header-field CRLF) CRLF [ message-body]

    start-line = request-line / status-line
  32. yacc • BNFͱ͍͏ந৅తͳํ๏Ͱطड़Ͱ͖Δͷ͕Α͍ • ݴޠ಺DSLʹରͯ͠ϙʔλϏϦςΟͰ༏Δ • ϨΩαʔ (ࣈ۟ղੳث) ͸ผ్࣮૷͢Δඞཁ͕͋Δ •

    ߏจنଇͷίʔϧόοΫ෦͕
 ΤσΟλͰϋΠϥΠτ͞Εͳ͍ (ͳʹ͔͍͍ํ๏͋Γͦ͏)
  33. ࣍ճ༧ࠂ • yaccΑͦ͞͏ͱ͍͏͜ͱ͕Θ͔ͬͨ • GoͱyaccΛ૊Έ߹ΘͤΒΕΔͷ͔ • ͸ͨͯ͠͸ͯͳه๏ύʔαʔΛ࡞Δ͜ͱ͕Ͱ͖Δͷ͔

  34. https://git.io/v7gcD github.com/aereal/gohn

  35. gohn • Written in Go w/goyacc • pronounce as `gone`

    • ओཁͳه๏͸࣮૷ࡁΈ
  36. gohnͷσβΠϯ • ඪ४ೖྗ͔Β͸ͯͳه๏Λड͚औΓɺ • ඪ४ग़ྗʹASTΛJSONʹγϦΞϥΠζͯ͠ग़ྗ͢Δ • → HTML΁ͷม׵͸ผ్࣮૷͢Δ • ͱͯ΋UNIXత

  37. AST • JSONʹγϦΞϥΠζ • JSON schemaΛެ։͍ͯ͠Δ • εΩʔϚ͔ΒHTMLม׵ثΛࣗಈੜ੒͢Δ͜ͱ΋Ͱ͖ͦ͏ • https://github.com/aereal/gohn/blob/master/schema.json

  38. Goͱyacc • goyaccͱ͍͏πʔϧ͕͋Δ • go get golang.org/x/tools/cmd/goyacc • ΞΫγϣϯΛGoͰॻ͚Δ

  39. Goͱࣈ۟ղੳ • ࣈ۟ղੳ = ಡΜͩจࣈ͕ͲΜͳҙຯΛ࣋ͭͷ͔ฦ͢ • text/scannerͱ͍͏ඪ४ύοέʔδ͕ศར • ڍಈΛΧελϚΠζͰ͖Δ •

    τʔΫϯΛফඅͨ͠࠷ޙͷҐஔΛه࿥ͯ͘͠ΕΔͷͰ
 Τϥʔϝοηʔδͷߏஙָ͕
  40. σϞ

  41. Ԡ༻ฤ

  42. HTTPه๏ [http://example.com/] # <a href="http://example.com/">
 # http://example.com/ # </a> [http://127.0.0.1/:title=΅͘ͷIP]

    # <a href="http://example.com/">
 # ΅͘ͷIP # </a>
  43. HTTPه๏ • ΞϯΧʔϦϯΫʹม׵͞ΕΔه๏ • ຤ඌʹల։࣌ͷΦϓγϣϯΛ `:` ʹଓ͚ͯطड़Ͱ͖Δ • `:` ͸URLͷҰ෦ʹݱΕΔ͜ͱ͕͋Δ

    • → ࣍ͷ1จࣈΛಡΉ͚ͩͰ͸:titleͷ։͔࢝൑அͰ͖ͳ͍
  44. ࠷ॳʹݱΕΔ `:` ͸εΩʔϜ෦ͱݟͳͯ͠ແࢹ͢Δ͜ͱʹ if !l.seenColon { l.seenColon = true return

    false // maybe part of URL } else { return true } https://github.com/aereal/gohn/blob/master/parser/ lex.go#L100
  45. ࠶ؼతͳϧʔϧ • N > 1ͷࢠنଇ͔ΒͳΔنଇͷॻ͖ํ • appendͷॱ൪͚ͩؒҧ͑ͳ͍Α͏ʹ

  46. http_options: http_option { $$ = []string{$1} } | http_option http_options

    { options := $2 $$ = append([]string{$1}, options...) }
  47. ςετ • Table-driven tests͕Φεεϝ • https://github.com/golang/go/wiki/TableDrivenTests • lexerΛؚΉparserͷػೳςετ͚ͩͰे෼ͩͱࢥ͏ • https://github.com/aereal/gohn/blob/master/parser/

    parser_test.go#L17
  48. σόοά • tokenͷࣝผࢠ (int) ͔Β໊લ (string) Λ
 ٯҾ͖͢ΔϝιουΛఆ͓ٛͯ͘͠ͱศར • print͢ΔʹͤΑσόοΨΛ࢖͏ʹͤΑ

    • https://github.com/aereal/gohn/blob/master/parser/ lex.go#L29
  49. ·ͱΊ

  50. Go/goyacc͸ศར • Go͸ෳࡶͳCLIΛϙʔλϒϧʹ࡞Δͷʹ޲͍͍ͯΔ • goyacc (yacc) ͸ෳࡶͳจ๏ͷύʔαʔʹ޲͍͍ͯΔ

  51. ܰྔϚʔΫΞοϓݴޠ͸ ೉͍͠ • ਓؒʹͱͬͯͷಡΈॻ͖͠΍͢͞ͱ
 ػցʹͱͬͯͷಡΈॻ͖͠΍͢͞͸ҟͳΔ • ݫ֨ͳจ๏نଇʹैΘͤΔύʔαʔΑΓ
 ޡΓగਖ਼ͯ͘͠ΕΔ΄͏͕࣮༻తͳͷͰ͸?

  52. ύʔαʔ࡞Γ͸ָ͍͠ • Ͱ͖Δ͜ͱɺ΍Γ͍ͨ͜ͱɺؔ৺ͷ͋Δ͜ͱ͕
 ͏·͘όϥϯε͞Εͨ໨ඪ • WebͱςΩετॲཧ • খ͞ͳ໨ඪΛগͣͭ͠ੵΈॏͶ͍͚ͯΔ • ʮࠓ೔͸Ϧετه๏ͷ࣮૷͕Ͱ͖ͨͧʯ

  53. ڵຯΛ࣋ͬͯ͘Εͨਓ΁ • ·ͣ͸JSONͷύʔαʔΛॻ͍ͯΈΔͱΑͦ͞͏ • RFC, relaxed JSON, etc. ʹൃలͤͯ͞ΈΔ •

    ࣍͸ࣈ۟ղੳثΛखॻ͖ͯ͠ΈΔ • ࣍͸ߏจղੳث΋खॻ͖ͯ͠ΈΔ
  54. ׬