Slide 1

Slide 1 text

© 2019 Frédéric G. MARAND - licensed under a Creative Commons Attribution 4.0 International License. Scaling up and accelerating Drupal 8 with NoSQL Frédéric G. MARAND drupal.org: fgm - irc/twitter: @osinet

Slide 2

Slide 2 text

No content

Slide 3

Slide 3 text

NoSQL

Slide 4

Slide 4 text

Topic ? Simple idea: “No SQL” ● Alternate storage engines: KV, Structures, Document, Graph, Columnar… ● No standard, often no fixed schema, no joins, no FKs ● → Engine-specific application design ● Drupal architecture ? Evolved idea: Not Only SQL ● For engines, add equivalent features to SQL ● For Drupal, combine SQL et NoSQL solutions ● Start from the default SQL-based architecture ● Offload services to non-SQL implementations ○ front-end caches, search engines, queue servers ○ specialized storage: cache, KV, lock, sessions… ● Often involves NoSQL as cache for SQL espace 1 espace 2

Slide 5

Slide 5 text

NOSQL: do you need it ? ● Start by observing the current state ○ Database queries → devel + webprofiler ○ Cache → heisencache (D7), webprofiler (D8) ○ Build cacheability → renderviz ● Observe behaviour ○ Core observability built-in: DBTNG logging, cache decorators, QueryInterface for KV, config, content… ○ Monitoring module (400 sites) by Karan Poddar (Google SoC) and MD Systems ○ Add your choice of time-series store (e.g. Prometheus, InfluxDB) and UI (e.g. Grafana) ○ ⇨ Use it ! ● You want to see this when it happens ⟶

Slide 6

Slide 6 text

“ “ Peter Drucker If you can’t measure it, you can’t improve it.

Slide 7

Slide 7 text

Fixing an identified problem is cheaper than “trying things” Fix from acquired information ● It /MAY/ involve taking queries off the main DB to a NoSQL solution ● But poorly configured NoSQL may make it worse.

Slide 8

Slide 8 text

“Just do it” ? ● Drupal is built on SQL: ○ Views depends on it by default ○ Most sites rely on Views data model awareness ○ → Contrib often assumes SQL, injects @database ○ NoSQL support doable, rarely done ● Contrib support level is limited ○ Most NoSQL contrib not ported from D7 to D8 ○ Drupalshop knowledge limited except biggest or specialized ○ Products may die… e.g. RethinkDB ● Pro support from publishers = costs. Availability. ● Extra support needed = costs NoSQL == added build costs → balance gains vs costs Example case: RethinkDB At DevDays Milan 2016, after lots of work, Gizra’s @RoySegall demoed a Drupal 8 ORM/ODM for RethinkDB. Then, this happened...

Slide 9

Slide 9 text

“ “ http://www.commitstrip.com/en/2012/04/10/what-do-you-mean-its-oversized Do you really need it ?

Slide 10

Slide 10 text

Front caching

Slide 11

Slide 11 text

Caching ahead of real work Default situation with SQL ● Browser caching, limited ● Internal / dynamic page cache in main SQL DB ● Need DB connection, a few SELECT queries ● Fetch cache from DB ● All data from main storage ● ⇨ Serve cached pages in about 20 msec All this work makes DoS-ing comparatively cheap. NoSQL improvements ● Add caching ahead of site itself ○ Browser ■ Optimized browser caching (Cache-Control) ■ PWA: use browser local storage ○ CDN ■ CDN module (2k sites) ■ Akamai module (600 sites) ■ ⇨ Serve cached pages in about 15 msec (TTFB) ■ Web-scale ○ Varnish and other reverse proxies ■ ⇨ Serve cached pages in about 10 msec (TTFB) ■ Core support ■ Varnish Purger (3k sites) ● ⇨ Most request will mean 0 SQL queries ○ DoS-ing more costly, especially with CDN ● Move page caches off main DB: next section

Slide 12

Slide 12 text

Choices

Slide 13

Slide 13 text

Storage

Slide 14

Slide 14 text

Storage: the “Big 3” The most active NoSQL suites for Drupal 8.x Redis ● Type: Key-value (structure server) ● Module ○ redis ● DB-Engines ranking: ○ #1 Key-value store ● Usage ○ Drupal 7: 10k sites ○ Drupal 8: 10k sites ● Supported by ○ Drupal 7: Makina Corpus ○ Drupal 8: MD Systems Memcached ● Type: Key-value ● Module ○ memcache ● DB-Engines ranking: ○ #3 Key-value store ○ #5 Key-value store (Hazelcast) ● Usage (memcache_storage) ○ Drupal 7: 32k (2k) sites ○ Drupal 8: 15k (800) sites ● Supported by: ○ Acquia ○ Tag1 Consulting MongoDB / CosmosDB ● Type: Document store ● Module ○ mongodb ● DB-Engines ranking: ○ #1 Document store (MongoDB) ○ #4 Document store (CosmosDB) ● Usage ○ Drupal 7: 300 sites ○ Drupal 8: 50 sites ● Supported by ○ OSInet

Slide 15

Slide 15 text

Redis https://www.drupal.org/project/redis ● Driver support ○ phpredis and predis both supported ● Supported Services ○ Driver adapter for custom code ○ Cache, including invalidations ○ Flood ○ Lock ○ Lock.Persistent ○ Queue ● CLI support ○ Not included ● Other modules ○ Redis Watchdog: logger + UI Recent events (from @Berdir) ● Deadlock/race condition on node_list invalidations (#2966607) finally fixed in core 8.8.x with latest release ● php-redis 5.0 broke module, fixed in latest 8.x and 7.x releases ● Module users: please test and report !

Slide 16

Slide 16 text

Performance / scalability Redis https://www.drupal.org/project/redis ● Performance, single-server ○ Memory-only implementation ■ Usually among the fastest ■ Often the fastest ■ Even with concurrent access ○ Persistent ■ A bit slower even with just RDB ■ Slower with AOF ● Persistence, single instance ○ RDB: ■ compact snapshots, shippable off-site ■ data loss: since latest snapshot ○ AOF ■ up to last-second fsync’ed journal ■ less compact ● Fault-tolerance: Sentinel 2 ○ master/slave supervision ○ automatic failover possible ○ observability support ● Scaling ○ Cluster-based sharding ○ Master → Slaves → Slaves ○ No strong consistency ○ Recommended config: 6 servers ● Cloud-native: ○ Redis Enteprise Cloud ○ AWS Elasticache, Azure, Google Memorystore ○ many others

Slide 17

Slide 17 text

Redis https://www.drupal.org/project/memcache ● Driver support ○ memcache extension (limited availability) ○ memcached extension ○ PHP ≥ 5.6 ● Supported Services ○ Driver adapter for custom code ○ Cache, including invalidations ○ Lock ○ Lock.Persistent removed in #2995907 ○ Sessions ported, then removed in 7.x ○ Monitoring UI ● CLI support ○ Not included: core commands ● Other module: memcache_storage ○ Cache with core SQL invalidations ○ No lock ○ Monitoring UI Recent events (from @Berdir) ● Deadlock/race condition on node_list invalidations (#2966607) finally fixed in core 8.8.x with latest release, based on Redis fix.

Slide 18

Slide 18 text

● Performance, single-server ○ Memory-only implementation ■ Usually among the fastest ■ Slower than in-memory Redis ■ A bit faster than to MySQL / MongoDB K/V ○ Persistence: extstore NVRAM support ■ No significant slowdown ■ Usually a bad idea (expectations) ■ https://memcached.org/blog/persistent-m emory/ ● Fault-tolerance ○ Module support for sharded clusters ○ Consistent hashing: avoid thundering herd prob. ○ Replication: with Hazelcache Performance / scalability Redis https://www.drupal.org/project/memcache ● Scaling ○ Cluster-based sharding ○ Consistent hashing allows elastic scaling ○ Recommended config: 2 instances per cluster, 1 cluster per bin, with some exceptions: usually 10-20 instances per D8 site ○ Some bins must stay in core (form, update) ● Monitoring ○ Instant: module-provided memcache_admin ○ Evolved: phpmemcacheadmin ● Cloud-native ○ AWS Elasticache ○ Azure Memcached Cloud ○ Google AppEngine Memcache

Slide 19

Slide 19 text

Mainstream packages MongoDB https://www.drupal.org/project/mongodb Drupal 7 features ● Driver support: ○ mongo extension for PHP 5.x ○ mongodb extension for PHP 7.x ○ MongoDB 2.x, 3.x ● Supported Services ○ Driver adapter for custom code ○ Block ○ Cache ○ Path ○ Queue ● Unsupported services ○ Field storage ○ Lock ○ (Session) ○ Watchdog = logger + UI ● Other modules ○ Views driver: EFQ Views Drupal 8.x-2.x features ● Driver support ○ mongodb extension for PHP ≥ 7.1 ○ mongodb/mongodb php driver ○ MongoDB 3.x, 4.x ● Supported Services ○ Driver adapter for custom code ○ Key-value (e.g. State) ○ Key-value expirable (e.g. *tempstore*, form_cache) ○ Watchdog = logger + UI ● CLI support ○ Drupal Console 1.9.x ○ Drush 9.x ● Other services ○ Entity/field storage ● Other modules ○ MongoDB Indexer

Slide 20

Slide 20 text

Exotic packages MongoDB https://www.drupal.org/project/mongodb Drupal 8.x-1.x ● Driver support: ○ mongo extension for PHP 5.x ○ MongoDB 3.x ● Supported services ○ Complete NoSQL distribution ○ @database implementation ○ No SQL DBMS needed ○ Unpatched Drupal core ● Status ○ Sponsored by MongoDB, led by chx ○ Development halted before Drupal 8.0.0 ● Performance: ○ About 4x faster than equivalent Drupal core Drumongous ● Driver support ○ mongo extension for PHP ≥ 5.6 ○ MongoDB ≥ 3.6 ● Supported Services ○ Complete NoSQL distribution ○ @database implementation ● Source: patched Drupal core + module ○ https://gitlab.com/daffie/drumongous/ ○ https://gitlab.com/daffie/mongodb ● CLI support ○ Drupal Console 1.x ○ Drush 9.x ● Status ○ Off-drupal.org ○ No issue queue ○ Active, led by daffie

Slide 21

Slide 21 text

espace réservé non accepté Performance / scalability Engine features ● Fault-tolerance ○ Built-in replication ○ Recommended config: 2+1 servers ● Scaling ○ Read-only replicas ○ Data-center awareness ○ Sharding ● Both supported by existing module Monitoring / Ops ● In-module: logs ● Cloud: MongoDB Atlas, free monitoring, OpsManager Cloud native ● Azure: CosmosDB ● MongoDB: Atlas ● Mlab (née Mongolab) MongoDB https://www.drupal.org/project/mongodb Production example Custom social network (2M users), migrated from MySQL: MySQL slow queries: -85%, uncached content build time: -98%

Slide 22

Slide 22 text

NoSQL storage features

Slide 23

Slide 23 text

Other NoSQL support modules NoSQL Product Module Wrapper Features 7.x 8.x Supported ? Neo4J neo4j Y - Y Y N RethinkDB renthinkdb Y ORM N Y ? CouchDB couchdb Y Node export Y N N Couchbase couchbase Y Logger + UI Y N ? ElasticSearch elasticsearch_connector Y Logger + improved UI, Statistics, Views Y N Y SearchAPI Y Y AWS DynamoDB dynamodb N Cache Y N ? AWS SimpleDB awssdk, creeper Y - Y N ? Riak riak_field_storage Y Field storage, map-reduce Y N unsupported Apache Cassandra cassandra Y Example app 6.x N unsupported Tokyo Tyrant node/844354 N Logger + UI 6.x N unapproved

Slide 24

Slide 24 text

Sessions

Slide 25

Slide 25 text

NoSQL Sessions ? ● Why the weak/removed session support, especially for memcache ? ○ Memcache session support is baked in PHP memcached extension ○ It was popular in Drupal 6.x time ○ It is popular in Symfony, even documented on symfony.com ○ So ? ● Experience ○ Session data ○ Instance restart → all sessions data on instance lost ○ Bigger session data saturating bin → evictions ○ LRU means vulnerability to DoS-ing and blocking admins via evictions ○ DB load is bigger in Drupal than most frameworks ■ Session DB load is a smaller part of load for us

Slide 26

Slide 26 text

Logs

Slide 27

Slide 27 text

Logs in core The “SQL” problem ● All sites really need some sort of logging feature ● Smaller sites only have a database ○ ⇨ Database Logging default-enabled ● Code is not perfect, throws notices, errors ● Modules are verbose, log debug info ● “Drupal is too slow, please help, agency is stuck” ○ ⇨ Audit : 1500 inserts/min in watchdog table ○ ⇨ Other audits: watchdog > 99% of site size ● DBlog inserts compete with content work ● Owner disables logging ○ ⇨ now misses essential info ● Does not disable logging ○ ⇨ now can’t find essential info buried in noise The core NoSQL module ● Core has been bundling a syslog client since 6.0 ● Decouple logs from DB load ○ ⇨ No more SQL logs workload ● But where do they go ? ○ ⇨ Needs OS-level configuration ● How are logs cleaned ? ○ ⇨ Needs OS-level configuration ● Where is the UI ? ○ ⇨ Needs extra tools ● Solutions ? ○ D7 has logging hook ○ D8 has PSR/3 standard logging ○ ⇨ Contributions

Slide 28

Slide 28 text

NoSQL on-site logs (mongodb|redis)_watchdog ● mongodb_watchdog ○ Logger service ■ Standard Drupal PSR/3 logs backend ■ Pre-storage filtering ■ Uses capped collections: auto-rotation, no ops ■ Dedicated database: zero contention ■ Per-request event tracing ○ Improved logs UI ■ Based on core UI ■ Groups recurring events on single line ■ Details page for occurrences ■ Per-HTTP-request log page ○ Most common reason to deploy MongoDB on D8 ● redis_watchdog ○ Logger service ○ Logs UI based on core UI ○ Usage: 1 site

Slide 29

Slide 29 text

Off-site logs: BELK stack BELK stack ● Beats (typically FileBeat) ● Elastic Search ● Logstash ● Kibana Operation ● Drupal syslog → local syslog server → local logs ● DON’T log straight from Drupal ● Filebeat pulls logs, sends to Logstash ● Logstash massages logs, sends to ES ● ES provides storage, indexing ● Kibana provides UI Deployment ● Hosted with site ● SaaS: Loggly, Logz.io, ...

Slide 30

Slide 30 text

Off-site logs: Graylog Graylog ● Dual server: ES (logs, search) + MongoDB (meta, conf) ● Includes GROK log handling ● Accept syslog or GELF input ● Designed from Splunk Operation ● Drupal syslog → local syslog server → local logs ● DON’T log straight from Drupal via monolog_gelf ● Local syslog forwards to Graylog2 ● Graylog2 massages logs, sends to ES ● ES provides storage, indexing ● Graylog2 provides UI Deployment ● Hosted with site ● SaaS: StackHero

Slide 31

Slide 31 text

(source: Graylog) Off-site logs: BELK vs Graylog design

Slide 32

Slide 32 text

Non-SQL Logs: do I need them ? ● Small site, little traffic, single webmaster: just use dblog ● Any other site: upgrade to something else ○ Hosting company provides a logs dashboard (e.g. Splunk): use it ■ syslog into their stack, via local syslog then pull ○ Have an internal ops team ? ■ syslog into internal BELK or Graylog ○ No ops expertise ? don’t have time to learn Kibana/Graylog ? hosting company doesn’t provide real time logs access ? ■ Want to minimize costs and/or have logs in-site ? ● use mongodb_watchdog ■ Otherwise, use SaaS logs vendor ● Datadog, Scalyr, Loggly or Papertrail (SolarWinds), Logz.io...

Slide 33

Slide 33 text

Queues

Slide 34

Slide 34 text

Queue API services ● Core: mostly for Batch API ● General D8 use: proxy invalidation ○ Invalidation queues ● Commerce sites ○ ERP links ○ Third-party catalog/inventory ● Media sites ○ Real time news feeds ingestion ○ Deferred derived media generation

Slide 35

Slide 35 text

Queue modules SQL and NoSQL SQL ● Core bundled: queue.database service ○ used by all Drupal sites ● advanced_queue project ○ created for Drupal Commerce projects ○ used by Commerce 2.x NoSQL: storage-based ● Core bundled: queue.memory service ● Redis: ○ 7.x: redis_queue project ○ 8.x: redis project ● MongoDB ○ 7.x: mongodb project NoSQL: message servers ● Beanstalkd ○ 6.x/7.x: popular, used by drupal.org itself ○ 8.x complete port, but no users (?) ● RabbitMQ ○ 7.x: little used, 8.x: most popular ○ Users include public TV, major french e-tailer ○ Hardened by production at these levels ● AWS SQS ○ 7.x: some use, but no 8.x port ● Apache Kafka ○ 8.x only ○ Created for largest french retail chain ● Other queue services ○ Less used: Gearman, IronMQ, 0MQ ○ No 8.x versions

Slide 36

Slide 36 text

Queue API modules by usage D7/D8

Slide 37

Slide 37 text

NoSQL Queue: do I need it ? ● Mainstream Drupal site without Varnish / CDN ○ probably not, advancedqueue is still a nice improvement though ● Content site with a lot of generated content, Varnish and/or CDN ○ consider using Redis (D8), MongoDB (D7), RabbitMQ (D8) ○ or use Kafka (D8) if you need to (e.g. corporate mandate) ● Drupal Commerce standalone ○ advancedqueue is normally enough ● Site generating lots of dynamic media (image, video, sound) ...or ingesting fast feeds (> 1 item/sec) ○ need a dedicated message server

Slide 38

Slide 38 text

NoSQL Queue: which should I use ? ● The one your ops team supports best ○ Content management has a low event rate (< 1 event/sec) ● Kafka-class is for high-throughput queues ○ Think LinkedIn, Twitter, Netflix, Spotify, Airbnb, Paypal… ● RabbitMQ is solid ○ usually well known and monitored ○ D8 driver used for years on Cyber Monday, Black Friday, Olympic games... ● Beanstalkd is simple ○ It “just works” ○ Good first queue upgrading from DB

Slide 39

Slide 39 text

Search

Slide 40

Slide 40 text

SQL-based search ● Search has long been the weakest core feature in Drupal ○ In spite of improvements with each version ● Relevant issues ○ Good recall, but bad precision ○ Multilingual support, but no language awareness ○ Low awareness of language inflections → preprocessing API ○ Limited ability to handle asian (CJK) languages ○ Slow updates, cron-based pull mode ○ Indexing costs impacting site users ○ Indexed search for content only → search plugins ○ Other entity types limited to unindexed search by default ○ No support for restricted content search ● Useful complements: porterstemmer, snowball_stemmer ● SQL Alternative: Search API database search. Similar.

Slide 41

Slide 41 text

NoSQL search solutions Cloud-based / SaaS ● SaaS offerings: ○ Algolia ○ Google CSE ● Drupal Hosting offerings (alphabetic order): ○ Acquia Search SOLR ○ Amazee.io SOLR ○ Pantheon SOLR ○ Platform.sh ElasticSearch / SOLR On-site / near-site ● Core support: Search API (14% of D7, 16% of D8 sites) ● Standard solution: ○ Local SOLR ○ Multilingual search supported ● Alternatives: ○ Elastic Search → heart of BELK suite ○ Xunsearch: Xapian for Chinese ○ Xapian (8.x dev) ● D7 backends not on D8: ○ Elastic Search via Elastica ○ Google Search Appliance: killed by Google ○ MongoDB via MongoDB module ○ Sphinx ● Proprietary search engine publishers have custom, unpublished, non-GPL (!) Drupal modules

Slide 42

Slide 42 text

SQL and NoSQL search solutions by usage in D8

Slide 43

Slide 43 text

Non-core search: which should I use ? ● Any content deserves search ● SQL ○ Core for small content quantities ○ Search API DB backend used by drupal.org ● SaaS ○ For entry level: Algolia/Google = 0 recurring cost, near 0 set-up cost ○ Both perform better than core, but non-free ● Drupal PaaS have managed ES/SOLR ● Others: cost equilibrium ○ ES/SOLR have setup and recurring costs of possession (server load) ○ SaaS has lower set-up costs, but recurring fees ○ Core search has the cost of lost opportunity

Slide 44

Slide 44 text

Best practices

Slide 45

Slide 45 text

Best current practice: NoSQL in general Drupal 8 core tries hard to be SQL-agnostic ● Every use of the DB goes through @database ○ So anything able to pass for a SQL engine may be used ○ The mongodb_dbtng, mongodb 8.x-1.x, and Drumongous projects do just that ● Even Views has a query plugin. Project efq_views (7.x, 8.x) supports NoSQL engines that way ● No service except “storage” services should receive databases ○ Write a storage service for your data, defining its interface ○ Write a SQL provider implementing it, receiving @database ○ Tag the service as “backend_overridable” ○ Core mostly does it, custom code should always do it. ● References: ○ https://www.drupal.org/project/drupal/issues/2302617 ○ https://www.drupal.org/node/2306083

Slide 46

Slide 46 text

Best current practice: MongoDB ● Connecting to MongoDB with 8.x-2.x ○ Using multiple databases ? Use @mongodb.client_factory ■ The client you get is a standard mongodb/mongodb Client instance ■ You have to handle topology ○ Using single database ? Use @mongodb.database_factory ■ The database you get is a standard mongodb/mongodb Database instance ■ Your DB topology is now configurable in settings ○ You probably don’t want to use Doctrine ODM, especially when interacting with Drupal data ● Designing a custom schema ○ Start from the queries, not from some canonicalization ○ For large scale data sets, consider: ■ Splitting live and archive data for sharding ■ Having a write DB and a read DB, and a CLI-based service between them - read about CQRS ○ Never use a monotonic increasing key for sharding ○ In most cases, joined data in lists don’t need to be as up-to-date as primary views ■ Embed “light” versions of dependent objects for lists, only use $lookup and DBRef joins on full datum view

Slide 47

Slide 47 text

“ “ There, I said it ! Contribution is its own reward

Slide 48

Slide 48 text

Join us for contribution opportunities Thursday, October 31, 2019 9:00-18:00 Room: Europe Foyer 2 Mentored Contribution First Time Contributor Workshop General Contribution #DrupalContributions 9:00-14:00 Room: Diamond Lounge 9:00-18:00 Room: Europe Foyer 2

Slide 49

Slide 49 text

What did you think? Locate this session at the DrupalCon Amsterdam website: https://drupal.kuoni-congress.info/2019/program/ Take the Survey! https://www.surveymonkey.com/r/DrupalConAmsterdam