Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Everybody Stand Back! I Know Regular Expressions

Everybody Stand Back! I Know Regular Expressions

Transcript

  1. None
  2. @listochkin CTO at Viravix

  3. None
  4. xkcd 2005

  5. None
  6. None
  7. None
  8. None
  9. None
  10. A language for text manipulations

  11. Regular Expressions

  12. None
  13. None
  14. Used them to describe math of events in neural networks

  15. SDS 940 1966 - QED 1968 - Mother of All

    Demos 1969 - 1st ARPANET IMP node
  16. 1966 - Ken Thompson QED 1986 - Henry Spencer regex

    1997 - Philip Hazel PCRE
  17. None
  18. /(.*)?\/\/.+[^z]{1,}\1(\b\w+)\s?/

  19. None
  20. None
  21. None
  22. None
  23. None
  24. None
  25. Quiz Time!

  26. s = s[4] === 's' ? s.substr(8) : s.substr(7); const

    iQ = s.indexOf('?'); const iH = s.indexOf('#'); const iS = s.indexOf('/'); if (Math.min(iQ, iH, iS) > 0) { return s.substr(0, Math.min(iQ, iH, iS)); } else { return s; }
  27. /http[s]?:\/\/(?<domain>[^\?#\/]+)/ .exec(s).groups.domain

  28. Math

  29. + - * /

  30. Trigonometry Complex numbers Integration Derivatives

  31. Text

  32. substr(3, 5) indexOf(‘a’)

  33. Regular Expressions is like a higher-level math for text

  34. How do regexes work?

  35. SQL CSS selectors File glob patterns

  36. DSL

  37. Special VM

  38. Statically / JIT compiled

  39. Often fast even for “slow” languages

  40. 2008 JavaScriptCore Regex JIT

  41. $("form.add-product button[type=submit]") .on("click", function (e) { ... })

  42. /abc/

  43. function* enumerate(iterable) { let i = 0; for (const x

    of iterable) { yield [i, x]; i++; } }
  44. function matchAbc(string) { for (const [startPos] of enumerate(string)) { let

    matchPos = startPos; if (str[matchPos] !== 'a') continue; matchPos += 1; if (str[matchPos] !== 'b') continue; matchPos += 1; if (str[matchPos] !== 'c') continue; return true; } return false; }
  45. None
  46. ababc

  47. ababc

  48. ababc

  49. ababc

  50. function matchAbc(string) { for (const [startPos] of enumerate(string)) { let

    matchPos = startPos; if (str[matchPos] !== 'a') continue; matchPos += 1; if (str[matchPos] !== 'b') continue; matchPos += 1; if (str[matchPos] !== 'c') continue; return true; } return false; }
  51. ababc

  52. ababc

  53. Not smart but predictable

  54. Very good for hardware

  55. Language

  56. Every character is an instruction

  57. Most instructions mean “Match this exact character, move forward by

    one character if matched”
  58. /abc|abx/

  59. None
  60. None
  61. Longer to compile Faster to execute

  62. /anteater|antelope|ant/ An ant encountered an anteater

  63. None
  64. /anteater|antelope|ant/ An ant encountered an anteater

  65. None
  66. Put most likely alternative to the left

  67. /x*z/

  68. None
  69. /”.*”/

  70. I watched “Home Alone” and “Back to the Future” last

    week
  71. I watched “Home Alone” and “Back to the Future” last

    week
  72. I watched “Home Alone” and “Back to the Future” last

    week
  73. I watched “Home Alone” and “Back to the Future” last

    week
  74. I watched “Home Alone” and “Back to the Future” last

    week
  75. I watched “Home Alone” and “Back to the Future” last

    week
  76. I watched “Home Alone” and “Back to the Future” last

    week
  77. I watched “Home Alone” and “Back to the Future” last

    week
  78. I watched “Home Alone” and “Back to the Future” last

    week
  79. /x*z/

  80. /x*?z/

  81. None
  82. None
  83. /”.*”/

  84. /”.*?”/

  85. I watched “Home Alone” and “Back to the Future” last

    week
  86. I watched “Home Alone” and “Back to the Future” last

    week
  87. I watched “Home Alone” and “Back to the Future” last

    week
  88. I watched “Home Alone” and “Back to the Future” last

    week
  89. I watched “Home Alone” and “Back to the Future” last

    week
  90. I watched “Home Alone” and “Back to the Future” last

    week
  91. I watched “Home Alone” and “Back to the Future” last

    week
  92. How to regex

  93. DSL

  94. Treat regexes as code

  95. Step 1. Formatting

  96. function (a){function*b(a){let b=0;for(const c of a)yield[b,c],b++}for(const[c] of b(a)) {let a=c;if("a"===str[a]&&(a+=1,"b"===str[a]

    &&(a+=1,"c"===str[a])))return!0}return!1}
  97. Use extended formatting!

  98. /(\d{4})-(\d{2})-(\d{2})/

  99. /(?x) # ISO date format \d {4} - \d {2}

    - \d {2} /
  100. None
  101. None
  102. function r(re, flags = 'u') { return new RegExp( re

    .replace(/^(?<regex>.*)(?<!\\)#.*$/gm, '$<regex>') .replace(/\\#/gm, '#') .replace(/\s/gm, ''), flags ); }
  103. r(String.raw` # ISO date format \d {4} - \d {2}

    - \d {2} `)
  104. Step 2. Naming

  105. /http[s]?:\/\/(?<domain>[^\?#\/]+)/ .exec(s).groups.domain

  106. /http[s]?:\/\/([^\?#\/]+)/ .exec(s)[1]

  107. function r(re, flags = 'u') { return new RegExp( re

    .replace(/^(?<regex>.*)(?<!\\)#.*$/gm, '$<regex>') .replace(/\\#/gm, '#') .replace(/\s/gm, ''), flags ); }
  108. function r(re, flags = 'u') { return new RegExp( re

    .replace(/^(.*)(?<!\\)#.*$/gm, '$1') .replace(/\\#/gm, '#') .replace(/\s/gm, ''), flags ); }
  109. r(String.raw` # ISO date format (?<year> \d {4}) - (?<month>

    \d {2}) - (?<day> \d {2}) `)
  110. /http[s]?:\/\/(?<domain>[^\?#\/]+)/ .exec(s).groups.domain

  111. r(String.raw` http[s]? :// (?<domain> [^?\#/]+ ) # rest of the

    url `).exec(s).groups.domain
  112. Step 3. Structure

  113. Perl & PCRE define

  114. /(?x) (?&sign)? (?&mantissa) (?&exponent)? (?(DEFINE) (?<sign> [+-] ) (?<mantissa> \d++\.?+\d*+

    | \. \d++ ) (?<exponent> [eE] (?&sign)?+ \d++ ) ) /
  115. Yes, even recursion

  116. /(?x) (?(DEFINE) (?<object> \{ (?&nvp) [, (?&nvp)]* \}) (?<nvp> (?&name)

    : (?&value) ) (?<value> (?&number) | (?&boolean) | (?&string) | (?&object) | (?&array) ) ... ) (?&value) /
  117. JSON parsing Regex

  118. use PPR; my $perl_block = qr{ (?&PerlBlock) $PPR::GRAMMAR }x;

  119. 1860 lines

  120. Отвал башки

  121. None
  122. None
  123. Raku grammars

  124. / :my regex sign { <[+-]> } :my regex mantissa

    { \d+: '.'?: \d*: | '.' \d+: } :my regex exponent { <[eE]> <sign>?: \d+: } <sign>? <mantissa> <exponent>? /
  125. function r(re, flags = 'u') { return new RegExp( re

    .replace(/^(?<regex>.*)(?<!\\)#.*$/gm, '$<regex>') .replace(/\\#/gm, '#') .replace(/\s/gm, ''), flags ); }
  126. [2020-09-11T15:57:20.848Z - 099fe80d-7e11-4c47-9bbf-2372bd6b2527: 108ms] 200 OK GET /configuration-tree/8 <- http://web.app.ui/

    178.133.180.67 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:82.0) Gecko/20100101 Firefox/82.0' {"id":1,"email":"andrei.listochkin@viravix.com",...}
  127. [Time in UTC - Request ID: Response Time] [2020-09-11T15:57:20.848Z -

    099fe80d-7e11-4c47-9bbf-2372bd6b2527: 108ms] Status Code and Message Request Method and Path <- Refer URL User IP 200 OK GET /configuration-tree/8 <- http://web.app.ui/ 178.133.180.67 User Agent 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:82.0) Gecko/20100101 Firefox/82.0' Session Token {"id":1,"email":"andrei.listochkin@viravix.com",...}
  128. ^ \[ (?<requestDate>2020-0[789]\S+) \s - \s (\S+) # request ID

    : \s (?<requestDuration>\d+)ms ] \s .+ \s (?<session>\S+) (?<!null) # non-anonymous sessions only $
  129. const match = lineRe.exec(logLine); if (match) { const { requestDate,

    requestDuration, session } = match; const requestData = { date: new Date(requestDate), duration: parseInt(requestDuration, 10), email: JSON.parse(session).email } }
  130. Regexes are awesome.

  131. Üñiçødé aware

  132. Advanced “math” on text

  133. Declarative

  134. Maintainable

  135. None
  136. Your best friend when it comes to text processing

  137. None
  138. None