Come integrare Elasticsearch e dormire sonni tranquilli

Come integrare Elasticsearch e dormire sonni tranquilli Martino Pizzol PYCON
SETTE

Martino Pizzol FullStack @ Spaziodati @mpiz1 github.com/martino

2012 2013 2015 pss: we’re hiring [email protected]

Atoka ❤ ES Siti web 150 GB Aziende 180 GB
News 40 GB 170 milioni di documenti 210 milioni di documenti 27 milioni di documenti

Search-engine basato su Lucene Document-oriented Distribuito e scalabile RESTful APIs

{ "name" : "Arides", "cluster_name" : "martino-es", "version" : {
"number" : "2.3.1", "build_hash" : "bd980929010aef404e7 "build_timestamp" : "2016-04-04T12: "build_snapshot" : false, "lucene_version" : "5.5.0" }, "tagline" : "You Know, for Search" } You Know, for Search Download it Extract it Run Elasticsearch Enjoy it

PUT /companies/company/1 { "nomeLegale": "ACME S.R.L.”, "partitaIVA": "02241890223", "ragioneSociale": "Società
A Responsabilità Limitata", "dataDiFondazione": “2012-02-13", "sedeLegale": { "coordinate": { "lon": 11.10781231, "lat": 46.06241231 }, "indirizzo": { "indirizzoCompleto": "Via dei serpenti, 13, 38122, Trento (TN)", "provincia": "Trento", "cap": "38122", "comune": "Trento", "stato": "Italia" } }, "keywords": [ "python", "django", "big data" ] } { "_index": "companies", "_type": "company", "_id": "1", "_version": 1, "_shards": { "total": 2, "successful": 1, "failed": 0 }, "created": true } Indexing

Full-text Multiﬁeld Geo Point Geo Hashes Proximity matching Partial matching
Query GET /companies/company/_search { "query": { "multi_match": { "query": "Super azienda", "fields": ["nomeLegale^3", "insegne"] } } } GET /companies/company/_search?q=descrizione:bar* bar | baretto | barbaforte GET /companies/company/_search?q=descrizione:bur~ bar | bur GET /companies/company/_search?q=descrizione:"bar rossi"~1 bar dei rossi | bar mario rossi | bar gino rossi

Aggregations GET companies/company/_search { "aggs": { "comuni": { "terms": {
"field": “sedi.comune", "size": 10 } } } } { "took": 180, "aggregations": { "comuni": { "buckets": [ { "key": "roma", "doc_count": 343969 }, { "key": "milano", "doc_count": 233817 }, { "key": "napoli", "doc_count": 107456 }, {

Aggregations / 2 GET companies/company/_search { "query": { "term": {
"flags": "startup" } }, "aggs": { "keywords": { "significant_terms": { "field": "websiteMetaKeywords" } } } } { "took": 27, "aggregations": { "keywords": { "doc_count": 5358, "buckets": [ { "key": "startup", "score": 46.3063793145 }, { "key": "wearable", "score": 11.8933947021 }, { "key": "spinoff", "score": 11.4179948602 },

Ci abbiamo provato… lessons learned

Bello ma… io uso Django?

haystack http://haystacksearch.org djangoes https://github.com/Exirel/djangoes requests http://docs.python-requests.org/en/master/ elasticsearch-py https://github.com/elastic/elasticsearch-py elasticsearch-dsl-py https://github.com/elastic/elasticsearch-dsl-py
Generazione dell’indice all’interno della mia app Django Utilizzo di un indice generato esternamente

Query driven document design { "dipendente" : [ { "nome"
: "Mario", "cognome" : "Rossi" }, { "nome" : "Roberto", "cognome" : "Fazzoletti" } ] } { "dipendente.nome" : [ "Mario", "Roberto" ], "dipendente.cognome" : [ "Rossi", "Fazzoletti" ] } Documento Originale Indice creato da ES warning

{ "azienda" : "Azienda A" "dipendente" : [ { "nome"
: "Mario", "cognome" : "Rossi" }, { "nome" : "Roberto", "cognome" : "Fazzoletti" } ] } { "azienda" : "Azienda B" "dipendente" : [ { "nome" : "Roberto", "cognome" : "Rossi" }, { "nome" : "Gino", "cognome" : "Parigino" } ] } GET /companies/company/_search? q=dipendente.nome:Roberto AND dipendente.cognome:Rossi Azienda A Azienda B

Mapping automatico? sempre deﬁnire un mapping! possibilità di deﬁnire dei
template per indici creati dinamicamente

Attenzione per cambiare mapping bisogna indicizzare tutto nuovamente!!!

Utilizzare sempre i DocType from elasticsearch_dsl import DocType, String, Date
class Company(DocType): nome_legale = String() data_di_fondazione = Date() partita_iva = String() class Meta: index = 'companies' https://elasticsearch-dsl.readthedocs.org/en/latest/persistence.html

Testing Django + Elasticsearch

1,2,3… Elasticsearch! Docker Docker Compose

django: build: . command: "echo ‘fake according to https://github.com/docker/compose/pull/ 1754#issuecomment-154218084'"
links: - elasticsearch:atoka_es - postgres:pg - redis:redis environment: - ES_INDEX_HOST=atoka_es - RDS_HOSTNAME=pg - REDIS_HOSTNAME=redis postgres: image: postgres:latest environment: POSTGRES_USER: user POSTGRES_PASSWORD: password elasticsearch: image: elasticsearch:1.7.5 command: elasticsearch --node.local=true --index.store.type=memory redis: image: docker.io/redis:2.8.21

Snapshot di dati per test prod tests

Generare lo snapshot generazione dinamica generazione statica

TestRunner che si occupa di preparare un Elasticsearch per il
testing class YourAppTestSuiteRunner(DiscoverRunner):   def setup_databases(self):  if os.getenv('SKIP_LOADING_ES_DATA') is None:  print('Loading es data...')  print('Set SKIP_LOADING_ES_DATA to avoid this step')  load_es_data()  time.sleep(5) # fuck eventual consistency  return super(YourAppTestSuiteRunner, self).setup_databases()

“ai dati piace cambiare!”

Elasticsearch Alias • Hot swap di indici • Gruppi di
indici   (utile per indici temporali) • Generare delle “view” POST /_aliases { "actions": [ { "remove": { "index": "companies-20160401", "alias": "companies" } }, { "add": { "index": "companies-20160410", "alias": "companies" } } ] }

TestCase + Alias class SpecificFieldFacetTestCase(MyFancyAppFacetTestCase): companies_id = ["123acs", "930d9f", "58588f",
"1ced29"] def test_field_without_data(self): self.assertTrue(…)

Let’s go to production

Enlarge your cluster Node1 Node1 Node2 Node3

Discovery Cloud Ready

Cluster size Node1 - Master Node2 Node3 1 2 3

Cluster size / 2 Node1 - Master Node2 Node3 1
2 3 1 2 3 Assicurarsi sempre di impostare la replica per gli shard

Che server uso? Dimensione disco Rapporto dimensione/quantità shard Tipologia di
disco SSD > Piatti RAID 0 > Piatti Quantità di memoria dipende dalle queries no silver bullet

Aggregations Sorting Scripts

Gestire in maniera corretta la memoria attenzione alla ﬁeld data,
sarà il vostro peggior incubo https://es:9200/_cat/fielddata?v --ES_HEAP_SIZE ~ RAM/2 (max 32GB) evitare un numero eccessivo di shard https://www.elastic.co/blog/a-heap-of-trouble

Monitor your cluster with Cluster API! Stato di salute del
cluster Stato di salute di ogni singolo nodo Marvel + sistema di alerting https://www.elastic.co/guide/en/elasticsearch/reference/2.3/cluster.html

Tuning insert performance usare la bulk insert utilizzando la dimensione
ottimale ottimizzare il throttling indexing ricordarsi di disabilitare le repliche https://www.elastic.co/guide/en/elasticsearch/guide/current/indexing- performance.html

Ottimizza i tuoi indici! ES < 2.1 Optimize API ES
>= 2.1 Force Merge

Queries essenziali teniamo sotto controllo la dimensione delle risposte (il
parametro _source ci aiuta) utilizzare la scroll

Rolling restart 1. Fermare se possibile l’indicizzazione 2. Disabilitare la
shard allocation 3. Spegnere un nodo 4. Fare le operazioni di maintenance etc etc 5. Riavviare il nodo e veriﬁcare che sia nel cluster 6. Abilitare la shard allocation 7. Ripetere per ogni nodo PUT /_cluster/settings { "transient" : { "cluster.routing.allocation.enable":"none" } } PUT /_cluster/settings { "transient" : { "cluster.routing.allocation.enable":"all" } }

Attenzione alla dimensione degli indici… …because size matters

Può sembrare una gran soluzione avere tutto in ES be
aware of your DB

Thank you @mpiz1

Come integrare Elasticsearch e dormire sonni tr...

Come integrare Elasticsearch e dormire sonni tranquilli

More Decks by martino

Other Decks in Programming

Featured

Transcript