Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Finding the Longest Common Substring & Gestalt Pattern Matching with SQL & PL/SQL

D7b6e701f0155fc189bbca6c89223b3c?s=47 Chris
October 20, 2020

Finding the Longest Common Substring & Gestalt Pattern Matching with SQL & PL/SQL

A deep-dive into solving the longest common substring problem and Ratcliff-Obershelp similarity algorithm using SQL & PL/SQL.

Discusses the algorithm to do this and the performance of the different solutions.

D7b6e701f0155fc189bbca6c89223b3c?s=128

Chris

October 20, 2020
Tweet

Transcript

  1. Your SQL Office Hours begins soon… Longest Common Substring Gestalt

    Pattern Matching Chris Saxon @ChrisRSaxon & @SQLDaily https://blogs.oracle.com/sql https://www.youtube.com/c/TheMagicofSQL Stew Ashton @StewAshton https://stewashton.wordpress.com/
  2. Copyright © 2020, Oracle and/or its affiliates. All rights reserved.

    | Welcome to Ask TOM Office Hours!
  3. Copyright © 2020, Oracle and/or its affiliates. All rights reserved.

    | Help me find the longest common substring* for two words *should be subsequence
  4. Copyright © 2020, Oracle and/or its affiliates. All rights reserved.

    | Find the Longest Common Substring WIKIMEDIA REMEDYA
  5. Copyright © 2020, Oracle and/or its affiliates. All rights reserved.

    | Find the Longest Common Substring WIKIMEDIA REMEDYA MED
  6. Copyright © 2020, Oracle and/or its affiliates. All rights reserved.

    | LCS solution : concept Given two strings, A and B : • LEN is the potential substring length, starting from the length of B and decrementing to 1 • For each value of LEN, try all possible starting points B_STRT • Test whether the substring of B starting at B_STRT for length LEN is within A • Stop at first substring found • Return A starting point, B starting point and substring length LEN B-STRT B = "SQL" 3 1 "SQL" 2 1 "SQ" 2 2 "QL" 1 1 "S" 1 2 "Q" 1 3 "L"
  7. Copyright © 2020, Oracle and/or its affiliates. All rights reserved.

    | a) lcs solution in SQL with input(a, b) as ( select 'antidisestablishmentarianism', 'disentanglement' from dual ) select instr(a, substr(b,b_strt,len)) a_strt, b_strt, len, substr(b, b_strt, len) lcs from input i, lateral( -- calculates LEN select length(i.b) + 1 - level len from dual connect by length(i.b) + 1 - level > 0 ) s, lateral( -- calculates B_STRT select level b_strt from dual connect by level + s.len - 1 <= length(i.b) ) where instr(a, substr(b,b_strt,len)) > 0 and rownum = 1; A_STRT B_STRT LEN LCS ------ ------ --- ---- 5 1 4 dise
  8. Copyright © 2020, Oracle and/or its affiliates. All rights reserved.

    | SQL implementation: imperfections • Assumes the lateral clauses generate rows in order of decreasing length • Compare to the reply "Another recursive variant", where Iudith Mentzel specifically orders the rows before returning the first one. • Starts with length of B, even when A is shorter.
  9. Copyright © 2020, Oracle and/or its affiliates. All rights reserved.

    | Was SQL the right choice? For • Can be done using hierarchical queries • Uses "fast dual" so no additional data access Against • Assumes "procedural" implementation: process in order, stop at first hit. • It's just a calculation with no data access other than the input strings • How will it work as part of the Pattern Recognition solution?
  10. Copyright © 2020, Oracle and/or its affiliates. All rights reserved.

    | Pattern Recognition: concept 1. Get lcs of A and B: starting points, length of lcs 2. Add lcs length of left sides & right sides; 3. Continue recursively until no match; 4. Multiply cumulative length by 2, divide by sum of lengths of A and B. A_STRT B_STRT LEN cumul level 1 level 2 level 3 4 4 2 2 OU aFbOUcRd iFjOUkRl A_STRT B_STRT LEN cumul level 1 level 2 level 3 4 4 2 2 OU aFbOUcRd iFjOUkRl 2 2 1 3 F aFb iFj A_STRT B_STRT LEN cumul level 1 level 2 level 3 4 4 2 2 OU aFbOUcRd iFjOUkRl 2 2 1 3 F aFb iFj 0 3 a i 0 3 b j A_STRT B_STRT LEN cumul level 1 level 2 level 3 4 4 2 2 OU aFbOUcRd iFjOUkRl 2 2 1 3 F aFb iFj 0 3 a i 0 3 b j 2 2 1 4 R cRd kRl A_STRT B_STRT LEN cumul level 1 level 2 level 3 4 4 2 2 OU aFbOUcRd iFjOUkRl 2 2 1 3 F aFb iFj 0 3 a i 0 3 b j 2 2 1 4 R cRd kRl 0 4 c k 0 4 d l
  11. Copyright © 2020, Oracle and/or its affiliates. All rights reserved.

    | b) Gestalt solution in PL/SQL, lcs uses SQL create or replace function gestalt( l in varchar2, r in varchar2 ) return number is function lcs ( l in varchar2, r in varchar2 ) return number is ll varchar2(4000); rl varchar2(4000); lr varchar2(4000); rr varchar2(4000); begin for rec in (<SELECT lcs data>) loop ll := substr(l, 1, rec.l_strt - 1); rl := substr(r, 1, rec.r_strt - 1); lr := substr(l, rec.l_strt + rec.len); rr := substr(r, rec.r_strt + rec.len); return rec.len + lcs(ll, rl) + lcs(lr, rr); end loop; return 0; end lcs; begin return 2 * lcs(l, r) / ( length(l) + length(r) ); end gestalt; FOR loop not correct here, use SELECT INTO as in Iudith Mentzel's reply
  12. Copyright © 2020, Oracle and/or its affiliates. All rights reserved.

    | Solution b) drawbacks PL/SQL Runtime Engine SQL Engine PL/SQL Statement Executor PL/SQL Block SQL Statement Executor 1 2 3 4
  13. Copyright © 2020, Oracle and/or its affiliates. All rights reserved.

    | Solution b) drawbacks PL/SQL Runtime Engine SQL Engine PL/SQL Statement Executor PL/SQL Block SQL Statement Executor 1 2 3 4 5 6 SCN 1234 SCN 4567
  14. Copyright © 2020, Oracle and/or its affiliates. All rights reserved.

    | c) Gestalt solution in PL/SQL, including for lcs function lcs ( l in varchar2, r in varchar2 ) return number is l_start pls_integer := 0; r_start pls_integer := 0; lr_length pls_integer := least(length(l),length(r)); ll varchar2(128); rl varchar2(128); lr varchar2(128); rr varchar2(128); begin if l is null or r is null then return 0; end if; <<len_loop>> loop l_start := 1; loop r_start:=instr(r,substr(l,l_start,lr_length)); exit len_loop when r_start > 0; exit when l_start = length(l)-lr_length+1; l_start := l_start + 1; end loop; exit when lr_length = 1; lr_length := lr_length - 1; end loop; if r_start = 0 then return 0; else <snipped> return lr_length + lcs(ll, rl) + lcs(lr, rr); end if; end lcs;
  15. Copyright © 2020, Oracle and/or its affiliates. All rights reserved.

    | Solution c) advantages • No row generation, no fake data access (DUAL) • Starting length is lesser of two string lengths • Procedural order: • Test lengths from longest to shortest • Test starting point from the left • Stop at first find • Minimal context switches • Result: performance
  16. Copyright © 2020, Oracle and/or its affiliates. All rights reserved.

    | 1 167 164 325 0 50 100 150 200 250 300 350 PL/SQL PL/SQL + SQL SQL with PL/SQL Recursive SQL SQL model Time in s Time to process ~57,000 pairs 1,980
  17. Copyright © 2020, Oracle and/or its affiliates. All rights reserved.

    | Why is SQL so Slooooooowwwwww? • Processing many rows – ( n2 + n ) / 2 • Set logic => no exit early guarantees • lateral join => nested loops
  18. Copyright © 2020, Oracle and/or its affiliates. All rights reserved.

    | Further Reading Ask TOM Question https://asktom.oracle.com/pls/apex/asktom.search? tag=longest-common-substring-ratcliff-obershelp- similarity-algorithm On Cursors, Context Switches, and Mistakes https://blogs.oracle.com/oraclemagazine/on- cursors-context-switches-and-mistakes
  19. Copyright © 2020, Oracle and/or its affiliates. All rights reserved.

    | Ryan McGuire / Gratisography See you soon! asktom.oracle.com #AskTOMOfficeHours