# Finding the Longest Common Substring & Gestalt Pattern Matching with SQL & PL/SQL

A deep-dive into solving the longest common substring problem and Ratcliff-Obershelp similarity algorithm using SQL & PL/SQL.

Discusses the algorithm to do this and the performance of the different solutions.

October 20, 2020

## Transcript

1. ### Your SQL Office Hours begins soon… Longest Common Substring Gestalt

Pattern Matching Chris Saxon @ChrisRSaxon & @SQLDaily https://blogs.oracle.com/sql https://www.youtube.com/c/TheMagicofSQL Stew Ashton @StewAshton https://stewashton.wordpress.com/

| Welcome to Ask TOM Office Hours!

| Help me find the longest common substring* for two words *should be subsequence

| Find the Longest Common Substring WIKIMEDIA REMEDYA

| Find the Longest Common Substring WIKIMEDIA REMEDYA MED

| LCS solution : concept Given two strings, A and B : • LEN is the potential substring length, starting from the length of B and decrementing to 1 • For each value of LEN, try all possible starting points B_STRT • Test whether the substring of B starting at B_STRT for length LEN is within A • Stop at first substring found • Return A starting point, B starting point and substring length LEN B-STRT B = "SQL" 3 1 "SQL" 2 1 "SQ" 2 2 "QL" 1 1 "S" 1 2 "Q" 1 3 "L"

| a) lcs solution in SQL with input(a, b) as ( select 'antidisestablishmentarianism', 'disentanglement' from dual ) select instr(a, substr(b,b_strt,len)) a_strt, b_strt, len, substr(b, b_strt, len) lcs from input i, lateral( -- calculates LEN select length(i.b) + 1 - level len from dual connect by length(i.b) + 1 - level > 0 ) s, lateral( -- calculates B_STRT select level b_strt from dual connect by level + s.len - 1 <= length(i.b) ) where instr(a, substr(b,b_strt,len)) > 0 and rownum = 1; A_STRT B_STRT LEN LCS ------ ------ --- ---- 5 1 4 dise

| SQL implementation: imperfections • Assumes the lateral clauses generate rows in order of decreasing length • Compare to the reply "Another recursive variant", where Iudith Mentzel specifically orders the rows before returning the first one. • Starts with length of B, even when A is shorter.

| Was SQL the right choice? For • Can be done using hierarchical queries • Uses "fast dual" so no additional data access Against • Assumes "procedural" implementation: process in order, stop at first hit. • It's just a calculation with no data access other than the input strings • How will it work as part of the Pattern Recognition solution?

| Pattern Recognition: concept 1. Get lcs of A and B: starting points, length of lcs 2. Add lcs length of left sides & right sides; 3. Continue recursively until no match; 4. Multiply cumulative length by 2, divide by sum of lengths of A and B. A_STRT B_STRT LEN cumul level 1 level 2 level 3 4 4 2 2 OU aFbOUcRd iFjOUkRl A_STRT B_STRT LEN cumul level 1 level 2 level 3 4 4 2 2 OU aFbOUcRd iFjOUkRl 2 2 1 3 F aFb iFj A_STRT B_STRT LEN cumul level 1 level 2 level 3 4 4 2 2 OU aFbOUcRd iFjOUkRl 2 2 1 3 F aFb iFj 0 3 a i 0 3 b j A_STRT B_STRT LEN cumul level 1 level 2 level 3 4 4 2 2 OU aFbOUcRd iFjOUkRl 2 2 1 3 F aFb iFj 0 3 a i 0 3 b j 2 2 1 4 R cRd kRl A_STRT B_STRT LEN cumul level 1 level 2 level 3 4 4 2 2 OU aFbOUcRd iFjOUkRl 2 2 1 3 F aFb iFj 0 3 a i 0 3 b j 2 2 1 4 R cRd kRl 0 4 c k 0 4 d l

| b) Gestalt solution in PL/SQL, lcs uses SQL create or replace function gestalt( l in varchar2, r in varchar2 ) return number is function lcs ( l in varchar2, r in varchar2 ) return number is ll varchar2(4000); rl varchar2(4000); lr varchar2(4000); rr varchar2(4000); begin for rec in (<SELECT lcs data>) loop ll := substr(l, 1, rec.l_strt - 1); rl := substr(r, 1, rec.r_strt - 1); lr := substr(l, rec.l_strt + rec.len); rr := substr(r, rec.r_strt + rec.len); return rec.len + lcs(ll, rl) + lcs(lr, rr); end loop; return 0; end lcs; begin return 2 * lcs(l, r) / ( length(l) + length(r) ); end gestalt; FOR loop not correct here, use SELECT INTO as in Iudith Mentzel's reply

| Solution b) drawbacks PL/SQL Runtime Engine SQL Engine PL/SQL Statement Executor PL/SQL Block SQL Statement Executor 1 2 3 4

| Solution b) drawbacks PL/SQL Runtime Engine SQL Engine PL/SQL Statement Executor PL/SQL Block SQL Statement Executor 1 2 3 4 5 6 SCN 1234 SCN 4567

| c) Gestalt solution in PL/SQL, including for lcs function lcs ( l in varchar2, r in varchar2 ) return number is l_start pls_integer := 0; r_start pls_integer := 0; lr_length pls_integer := least(length(l),length(r)); ll varchar2(128); rl varchar2(128); lr varchar2(128); rr varchar2(128); begin if l is null or r is null then return 0; end if; <<len_loop>> loop l_start := 1; loop r_start:=instr(r,substr(l,l_start,lr_length)); exit len_loop when r_start > 0; exit when l_start = length(l)-lr_length+1; l_start := l_start + 1; end loop; exit when lr_length = 1; lr_length := lr_length - 1; end loop; if r_start = 0 then return 0; else <snipped> return lr_length + lcs(ll, rl) + lcs(lr, rr); end if; end lcs;

| Solution c) advantages • No row generation, no fake data access (DUAL) • Starting length is lesser of two string lengths • Procedural order: • Test lengths from longest to shortest • Test starting point from the left • Stop at first find • Minimal context switches • Result: performance