Slide 1

Slide 1 text

Your SQL Office Hours begins soon… Longest Common Substring Gestalt Pattern Matching Chris Saxon @ChrisRSaxon & @SQLDaily https://blogs.oracle.com/sql https://www.youtube.com/c/TheMagicofSQL Stew Ashton @StewAshton https://stewashton.wordpress.com/

Slide 2

Slide 2 text

Copyright © 2020, Oracle and/or its affiliates. All rights reserved. | Welcome to Ask TOM Office Hours!

Slide 3

Slide 3 text

Copyright © 2020, Oracle and/or its affiliates. All rights reserved. | Help me find the longest common substring* for two words *should be subsequence

Slide 4

Slide 4 text

Copyright © 2020, Oracle and/or its affiliates. All rights reserved. | Find the Longest Common Substring WIKIMEDIA REMEDYA

Slide 5

Slide 5 text

Copyright © 2020, Oracle and/or its affiliates. All rights reserved. | Find the Longest Common Substring WIKIMEDIA REMEDYA MED

Slide 6

Slide 6 text

Copyright © 2020, Oracle and/or its affiliates. All rights reserved. | LCS solution : concept Given two strings, A and B : • LEN is the potential substring length, starting from the length of B and decrementing to 1 • For each value of LEN, try all possible starting points B_STRT • Test whether the substring of B starting at B_STRT for length LEN is within A • Stop at first substring found • Return A starting point, B starting point and substring length LEN B-STRT B = "SQL" 3 1 "SQL" 2 1 "SQ" 2 2 "QL" 1 1 "S" 1 2 "Q" 1 3 "L"

Slide 7

Slide 7 text

Copyright © 2020, Oracle and/or its affiliates. All rights reserved. | a) lcs solution in SQL with input(a, b) as ( select 'antidisestablishmentarianism', 'disentanglement' from dual ) select instr(a, substr(b,b_strt,len)) a_strt, b_strt, len, substr(b, b_strt, len) lcs from input i, lateral( -- calculates LEN select length(i.b) + 1 - level len from dual connect by length(i.b) + 1 - level > 0 ) s, lateral( -- calculates B_STRT select level b_strt from dual connect by level + s.len - 1 <= length(i.b) ) where instr(a, substr(b,b_strt,len)) > 0 and rownum = 1; A_STRT B_STRT LEN LCS ------ ------ --- ---- 5 1 4 dise

Slide 8

Slide 8 text

Copyright © 2020, Oracle and/or its affiliates. All rights reserved. | SQL implementation: imperfections • Assumes the lateral clauses generate rows in order of decreasing length • Compare to the reply "Another recursive variant", where Iudith Mentzel specifically orders the rows before returning the first one. • Starts with length of B, even when A is shorter.

Slide 9

Slide 9 text

Copyright © 2020, Oracle and/or its affiliates. All rights reserved. | Was SQL the right choice? For • Can be done using hierarchical queries • Uses "fast dual" so no additional data access Against • Assumes "procedural" implementation: process in order, stop at first hit. • It's just a calculation with no data access other than the input strings • How will it work as part of the Pattern Recognition solution?

Slide 10

Slide 10 text

Copyright © 2020, Oracle and/or its affiliates. All rights reserved. | Pattern Recognition: concept 1. Get lcs of A and B: starting points, length of lcs 2. Add lcs length of left sides & right sides; 3. Continue recursively until no match; 4. Multiply cumulative length by 2, divide by sum of lengths of A and B. A_STRT B_STRT LEN cumul level 1 level 2 level 3 4 4 2 2 OU aFbOUcRd iFjOUkRl A_STRT B_STRT LEN cumul level 1 level 2 level 3 4 4 2 2 OU aFbOUcRd iFjOUkRl 2 2 1 3 F aFb iFj A_STRT B_STRT LEN cumul level 1 level 2 level 3 4 4 2 2 OU aFbOUcRd iFjOUkRl 2 2 1 3 F aFb iFj 0 3 a i 0 3 b j A_STRT B_STRT LEN cumul level 1 level 2 level 3 4 4 2 2 OU aFbOUcRd iFjOUkRl 2 2 1 3 F aFb iFj 0 3 a i 0 3 b j 2 2 1 4 R cRd kRl A_STRT B_STRT LEN cumul level 1 level 2 level 3 4 4 2 2 OU aFbOUcRd iFjOUkRl 2 2 1 3 F aFb iFj 0 3 a i 0 3 b j 2 2 1 4 R cRd kRl 0 4 c k 0 4 d l

Slide 11

Slide 11 text

Copyright © 2020, Oracle and/or its affiliates. All rights reserved. | b) Gestalt solution in PL/SQL, lcs uses SQL create or replace function gestalt( l in varchar2, r in varchar2 ) return number is function lcs ( l in varchar2, r in varchar2 ) return number is ll varchar2(4000); rl varchar2(4000); lr varchar2(4000); rr varchar2(4000); begin for rec in () loop ll := substr(l, 1, rec.l_strt - 1); rl := substr(r, 1, rec.r_strt - 1); lr := substr(l, rec.l_strt + rec.len); rr := substr(r, rec.r_strt + rec.len); return rec.len + lcs(ll, rl) + lcs(lr, rr); end loop; return 0; end lcs; begin return 2 * lcs(l, r) / ( length(l) + length(r) ); end gestalt; FOR loop not correct here, use SELECT INTO as in Iudith Mentzel's reply

Slide 12

Slide 12 text

Copyright © 2020, Oracle and/or its affiliates. All rights reserved. | Solution b) drawbacks PL/SQL Runtime Engine SQL Engine PL/SQL Statement Executor PL/SQL Block SQL Statement Executor 1 2 3 4

Slide 13

Slide 13 text

Copyright © 2020, Oracle and/or its affiliates. All rights reserved. | Solution b) drawbacks PL/SQL Runtime Engine SQL Engine PL/SQL Statement Executor PL/SQL Block SQL Statement Executor 1 2 3 4 5 6 SCN 1234 SCN 4567

Slide 14

Slide 14 text

Copyright © 2020, Oracle and/or its affiliates. All rights reserved. | c) Gestalt solution in PL/SQL, including for lcs function lcs ( l in varchar2, r in varchar2 ) return number is l_start pls_integer := 0; r_start pls_integer := 0; lr_length pls_integer := least(length(l),length(r)); ll varchar2(128); rl varchar2(128); lr varchar2(128); rr varchar2(128); begin if l is null or r is null then return 0; end if; <> loop l_start := 1; loop r_start:=instr(r,substr(l,l_start,lr_length)); exit len_loop when r_start > 0; exit when l_start = length(l)-lr_length+1; l_start := l_start + 1; end loop; exit when lr_length = 1; lr_length := lr_length - 1; end loop; if r_start = 0 then return 0; else return lr_length + lcs(ll, rl) + lcs(lr, rr); end if; end lcs;

Slide 15

Slide 15 text

Copyright © 2020, Oracle and/or its affiliates. All rights reserved. | Solution c) advantages • No row generation, no fake data access (DUAL) • Starting length is lesser of two string lengths • Procedural order: • Test lengths from longest to shortest • Test starting point from the left • Stop at first find • Minimal context switches • Result: performance

Slide 16

Slide 16 text

Copyright © 2020, Oracle and/or its affiliates. All rights reserved. | 1 167 164 325 0 50 100 150 200 250 300 350 PL/SQL PL/SQL + SQL SQL with PL/SQL Recursive SQL SQL model Time in s Time to process ~57,000 pairs 1,980

Slide 17

Slide 17 text

Copyright © 2020, Oracle and/or its affiliates. All rights reserved. | Why is SQL so Slooooooowwwwww? • Processing many rows – ( n2 + n ) / 2 • Set logic => no exit early guarantees • lateral join => nested loops

Slide 18

Slide 18 text

Copyright © 2020, Oracle and/or its affiliates. All rights reserved. | Further Reading Ask TOM Question https://asktom.oracle.com/pls/apex/asktom.search? tag=longest-common-substring-ratcliff-obershelp- similarity-algorithm On Cursors, Context Switches, and Mistakes https://blogs.oracle.com/oraclemagazine/on- cursors-context-switches-and-mistakes

Slide 19

Slide 19 text

Copyright © 2020, Oracle and/or its affiliates. All rights reserved. | Ryan McGuire / Gratisography See you soon! asktom.oracle.com #AskTOMOfficeHours