Talking SQL to Strangers

Talking SQL to strangers Oleksii Kachaiev, @kachayev Principal Engineer @
Riot Games, League of Legends June, 2020

SQL as a Language • old invention, popularized by RDMS
in 70s • a few years of being "not cool" • collateral damage of NoSQL movement? • what do we have now? • databases that never talked SQL (Druid, Cassandra[1] etc) • not even databases (Spark, Flink, Kafka etc) [1] limitations applied

SQL as an API • GraphQL is a query language
made it to be an API • (probably for a wrong reason) • Strata talk comparing GraphQL and SQL • (mostly recognized as a smart trolling) • this talk: how to expose SQL API • (dealing with why as we go)

SQL as an API: Demo • camille-sql • running SQL
queries over local Maven artifacts

Maven Artifacts Tree % tree ~/.m2/repository/ | less /Users/okachaiev/.m2/repository ├──
aleph │ └── aleph │ └── 0.4.7-alpha5 │ ├── _remote.repositories │ ├── aleph-0.4.7-alpha5.jar │ ├── aleph-0.4.7-alpha5.jar.sha1 │ ├── aleph-0.4.7-alpha5.pom │ └── aleph-0.4.7-alpha5.pom.sha1 ├── ant │ └── ant │ └── 1.6.5 │ ├── _remote.repositories │ ├── ant-1.6.5.jar │ ├── ant-1.6.5.jar.sha1 │ ├── ant-1.6.5.pom │ └── ant-1.6.5.pom.sha1

camille-sql: Running Server % bin/camille-server [INFO] Scanning for projects... [INFO]
[INFO] -----------------< org.okachaiev.camille:camille-sql >------------------ [INFO] Building Camille SQL 1.0-SNAPSHOT [INFO] --------------------------------[ jar ]--------------------------------- [INFO] [INFO] >>> exec-maven-plugin:1.2.1:java (default-cli) > validate @ camille-sql >>> [INFO] [INFO] <<< exec-maven-plugin:1.2.1:java (default-cli) < validate @ camille-sql <<< [INFO] [INFO] [INFO] --- exec-maven-plugin:1.2.1:java (default-cli) @ camille-sql --- Artifacts repository path: /Users/okachaiev/.m2/repository/ Running server on localhost:26727

camille-sql: PSQL Client % PGPASSWORD=nopass psql "sslmode=disable host=localhost port=26727" psql
(12.2, server 9.5.0) Type "help" for help. okachaiev=>

Building Blocks • server/client protocol • SQL query lifecycle •
parser • planner • optimizer • execution • bridge between the two

Protocol • the goal: to choose wisely to have clients
• e.g. you have a good client for HTTP: browser • not the case for SQL, databases use different protocols • JDBC makes them look similar in Java • (but we're looking a layer deeper)

Protocol: PGWIRE • originally PostgreSQL • own low-level binary serialization
format • own somewhat documented control ﬂow • why? • clients: psql, SQL IDEs, BI • adoption outside PostgreSQL: Cockroach, Materialize • high performance streaming for large datasets

Sequence Diagram (simpliﬁed)

Protocol: Ignoring Today • SSL handshake (custom) • authentication handshake
• prepared statements & bindings • data copy • cursors, suspensions, cancellation • (and a lot more)

Protocol: Implementation • Netty to run TCP server • custom
Codec to deal with network serialization • sequence of Handlers • mapping between transport and application level logic • Handler to orchestrate main loop: get query, execute, send result set • async i/o is an absolute overkill in this case

Protocol: Example Serialization public ByteBuf toByteBuf(ByteBufAllocator allocator) { final ByteBuf
buf = allocator.buffer(12 + this.totalLength); buf.writeByte('E'); // not included in length buf.writeInt(11 + this.totalLength); // length, self-included buf.writeByte('S'); // magical constant buf.writeBytes(this.severity.getBytes()); buf.writeZero(1); // null-termination buf.writeByte('C'); // magical constant buf.writeBytes(this.code.getBytes()); buf.writeZero(1); // null-termination buf.writeByte('M'); // magical constant buf.writeBytes(this.message.getBytes()); buf.writeZero(2); // double null-termination return buf; }

Life of a Query: Theory • parse • plan (compile)
• optimize • execute

Life of a Query: Practice • leverage Apache Calcite •
not the easiest framework to work with • alternatives: • Catalyst (too much Spark) • ZetaSQL (too young)

Life of a Query: Calcite • parse ← lexer, dialect
• plan (compile) ← catalog (databases, "tables") • optimize ← statistics (CBO), rules folding • execution ← scan deﬁnitions

Life of a Query: Lexer • SQL is standard, right?
well... it depends • Lex is a set of rules that deﬁnes parsing • using MYSQL_ANSI: double quotes, case insensitive • still problematic to deal with • expressions like E"\n" • ; as a query termination (standard but optional)

Life of a Query: Parser • Babel-based parser • Default
Calcite dialect (uppercase, "`" for quoting) SELECT `group_id`, COUNT(*) AS `n_files` FROM `versions` GROUP BY `group_id` ORDER BY `n_files` DESC

Life of a Query: Catalog • schema: tables, functions, types
etc • table: tables, views, temp views etc • table is anything that implements Table interface • declare "statically": • 2 tables • each table deﬁnes own row type • enough information for compilation

Register Tables (Statically) this.resolver = new MavenArtifactsResolver(baseFolder); this.artifactsTable = new
MavenArtifactsTable(this.resolver); this.versionsTable = new MavenArtifactVersionsTable(this.resolver); ... SchemaPlus rootSchema = calciteConnection.getRootSchema(); rootSchema.add("artifacts", artifactsTable); rootSchema.add("versions", versionsTable);

Table Deﬁnes Row Type protected final RelProtoDataType protoRowType = new
RelProtoDataType() { public RelDataType apply(RelDataTypeFactory typeFactory) { return new RelDataTypeFactory.Builder(typeFactory) .add("uid", SqlTypeName.BIGINT) .add("group_id", SqlTypeName.VARCHAR, 1023) .add("artifact_id", SqlTypeName.VARCHAR, 255) .add("name", SqlTypeName.VARCHAR) .add("url", SqlTypeName.VARCHAR) .build(); } };

Life of a Query: Logical Plan select group_id, name from
artifacts ↓ Root { kind: SELECT, rel: rel#5:LogicalProject#5, rowType: RecordType(VARCHAR(1023) group_id, VARCHAR name), fields: [<0, group_id>, <1, name>], } ↓ LogicalProject(group_id=[$1], name=[$3]) LogicalTableScan(table=[[artifacts]])

Life of a Query: Logical Plan select group_id, name from
artifacts where group_id = 'nrepl' order by artifact_id limit 20 ↓ LogicalProject(group_id=[$0], name=[$1]) LogicalSort(sort0=[$2], dir0=[ASC], fetch=[20]) LogicalProject(group_id=[$1], name=[$3], artifact_id=[$2]) LogicalFilter(condition=[=($1, 'nrepl')]) LogicalTableScan(table=[[artifacts]])

Life of a Query: Logical Plan select group_id, count(*) as
n_files from artifacts left join versions on artifacts.uid=versions.uid group by group_id order by n_files desc ↓ LogicalSort(sort0=[$1], dir0=[DESC]) LogicalAggregate(group=[{0}], n_files=[COUNT()]) LogicalProject(group_id=[$1]) LogicalJoin(condition=[=($0, $5)], joinType=[left]) LogicalTableScan(table=[[artifacts]]) LogicalTableScan(table=[[versions]])

Life of a Query: Execution • minimal requirement: deﬁnition of
"full scan" @Override public Enumerable<Object[]> scan(DataContext root) { // better replace this with something interesting :) return Linq4j.emptyEnumerable(); }

Life of a Query: Execution • result is a Linq4j
enumerable (collection or dynamic) • analog of C# LINQ (Language Integrated Queries) • Items.Select(*).Where(Item.Version > 1).Take(5) • a few years later, Spark popularized the idea for Big Data • Linq4j does heavy-lifting (folding relational operators)

Life of a Query: Optimizations • generic optimizations provided by
the framework • cost-based optimizations • opt out by setting Statistics.UNKNOWN • speciﬁcs: • skip calculating ﬁle size (projection pruning) • skip reading checksum (projection pruning) • jump to subfolder (predicate push-down)

Projection Pruning • extremely common technique for big data analytics
• (parquet, hehe) @Override public Enumerable<Object[]> scan( DataContext root, int[] projects) { // now we know what columns are necessary return Linq4j.emptyEnumerable(); }

Predicate Push-Down • Calcite supports basic ﬁlters push-down • analysis
of predicate is... non trivial @Override public Enumerable<Object[]> scan(DataContext root, List<RexNode> filters, int[] projects) { // now we also know predicates return Linq4j.emptyEnumerable(); }

Predicate Push-Down Exact match: select * from artifacts where group_id
= 'nrepl'; Exact preﬁx: select * from artifacts where group_id LIKE 'com.apache.%';

Predicate Push-Down What if... select * from artifacts where group_id
= 'nrepl' OR group_id = 'maven'; select * from artifacts where group_id = 'nrepl' AND group_id = 'maven'; What if... select * from artifacts where LOWER(group_id) + CAST("spark" AS VARCHAR) = 'com.apache.spark';

Predicates • do not need to read the code •
analysis is too rigit and inﬂexible • might be easily "broken" by user or even compiler • imperative code quickly turns into a mess • visitor pattern helps, still :( • pattern matching, pls • logical PL, ideally

Optimization: toRel • table as a relational expression • opt-out
from being a TableScan • e.g. artifacts might be deﬁned as DISTINCT over versions • or, both artifacts and version tables are Project + Filter applied to virtual files table • in both cases, JOIN between them might be folded

Optimization: toRel • nice part of relational algebra: it's an
algebra • hardest part of relational algebra: it's an algebra

Stitch Them Together • leverage Avatica (part of Calcite project)
• standard JDBC interface • registers Calcite as a driver • local & remote connections

HAVING Q and A Oleksii Kachaiev camille-sql

Talking SQL to Strangers

Talking SQL to Strangers

More Decks by Oleksii Kachaiev

Other Decks in Programming

Featured

Transcript