Upgrade to Pro — share decks privately, control downloads, hide ads and more …

opening.io system architecture

opening.io system architecture

Insights into the technical stack

Adrian Mihai - opening.io

November 04, 2016
Tweet

More Decks by Adrian Mihai - opening.io

Other Decks in Technology

Transcript

  1. the technical stack overview, capabilities, architectural decisions, data flows, ai

    
 
 
 
 
 Adrian Mihai, co-founder/cto [email protected] / @dublin_io
 

  2. what do we do? 01 Resume parsing at scale Deploying

    specialized, location transparent units of work (akka-camel actors) which gradually build a complete candidate profile out of each file. We ingest files from web or in batch via dedicated import processes (mail, cloud). 02 03 Information retrieval Identifying patterns within structure and phrasing of resumes in order to match them to suitable job descriptions automatically. We employ linguistics to cluster pools of similar candidates and job posts and tap into external sources of signal: GitHub, LinkedIn. Real-time analytics Salary recommendations / forecasts, similarity between candidates and/or job descriptions, rankings, search & extra business logic.
  3. flows: anatomy of an upload S3 Internally, independent from the

    the web back-end we run a Scala/Java application: Connemara. web socket front-end back-end (node.js) kafka message, json topic: resume_upload_pipeline topic: results-resume_upload_pipeline Requests are getting fulfilled by pipelines, or chains of actions: s3_download, feature_extraction, salaries_for_skills, etc. Actions are JSON Kafka / HTTP requests sent to their respective topics while answers to these requests are decorating pipeline’s state tree in return. Once conditions met the pipeline returns an answer. Workers are actors within an Akka Cluster: Kafka consumers / HTTP Servers (receiving JSON queries) and producers (sending output). However, they are Camel context-aware (can switch transport from Kafka to http/websocket/gearman for instance via configuration) so that they could independently handle API requests for instance. For NLP tasks and document vectors we maintain Python workers too.
  4. 01 02 03 04 everything is a stream Example service

    call (ex: feature extraction)
 
 http post ( ***.***.***.***: 11115/feature_extraction ) -> HTTP.outgoingConnection
 (round robin flow to an available service - 
 ex: yyy.yyy.yyy.yyy:11178) -> Service actor: Future<Json> execute() { ... } return Future[HttpResponse] http .doc upload, web http response, fully classified json resume
 (with portfolio screenshots, Github code, etc) Distributing work worker
 instances worker
 instances worker
 instances Registry (cluster singleton actor)
 http://xxx.xxx.xxx.xxx:11115 web node router worker ws / xhr kafka / http / .. akka
  5. a simplified pipeline Services as pure actors Future[Json] parsedResume =

    Source(file_bucket_path) -> flatMap(downloadFileFromCloud) - (Future[ByteArrayOutputStream]) -> flatMap(convert) -> HTTP request to a conversion back-end (Docker conversion containers - http://engineering.opening.io/demo.html) return HTTPResponse (ByteArrayOutputStream) (Future[…] obviously, trimming down non-core info)
 -> flatMap(parsing) -> PDFBox (java) -> NER, tokenization, regex, vectors, etc return HTTPResponse(json) -> flatMap(extract_education) -> Percolation (Elasticsearch inverted search) return HTTPResponse(json - universities) -> fold(flatMap(portfolio links)) -> HTTP request (screenshotting service) -> Response (Array[ByteArrayOutputStream]) -> flow graph (async write screenshots to disk) -> ... return HTTPResponse(json - disk paths) -> flatMap(topics_classifier) -> return HTTPResponse(json) -> flatMap(salaries - regression) -> return HTTP Response (json) … -> flatMap(github_portfolios/dribbble/etc) -> clone non-forked repos (ramdisks) -> extract source code -> sample code per all identified languages return HTTPResponse(json) … .via(SharedKillSwitch.flow) .via(instrumentation - Prometheus) .to(Sink.foreach) parsedResume onComplete { case Success(json_resume) => Async store(Couchbase) HTTPResponse(json_resume) ..
  6. task pool traffic S3 download task orchestration (mapAsync / for

    comprehensions) convert parse nlp feature selection embedding regressions dropbox google drive batch github portfolio ai / analytics pipelines / business logic Akka streams Flow[ String, (String, ByteArrayOutputStream, Boolean) )]
 Flow[( (String,ByteArrayOutputStream,Boolean), (String,JSONObject,Boolean) )] … One single Graph describing how various Flows combine -> i/o, broadcasts, etc All I/O (conversions, screenshots, etc) + BI/AI = ~650ms (Grafana)
  7. back-end layout Akka cluster worker
 
 ip:port worker
 
 kafka

    topic pipeline pipeline (shared read) Data stores Pipelines Regular workers (Akka Streams) - flows defining behavior via series of mapAsync calls IN OUT IN OUT IN OUT IN OUT IN OUT IN OUT IN OUT HTTP HTTP HTTP worker
 
 ip:port worker
 
 ip:port HTTP HTTP worker
 
 kafka topic
  8. libreoffice ramdisk document conversion registrator consul check-in discovery fabio load

    balancer http :9999 access point for { req ← createRequest(target, testFile) response ← Http().singleRequest(req) responseBody ← Unmarshal(response).to[…] } yield responseBody
 golang web server (iris)
  9. why akka? http://www.lightbend.com/activator/template/akka-distributed-workers Akka Producer (1x) - Kafka consumer (topic

    request) - Kafka producer (results topic response) - Akka-Camel (abstracting transport) TASK Akka Broker Akka executor pool - Akka Routers per task
 
 - Workers as pure transforms - Isolated actor system - Read access to Hazelcast TASK TASK Vertical scale Horizontal scale (n) near cache near cache near cache
  10. why couchbase? Updates as versioned inserts. “free” analytics Memcache-compatible bucket

    Data rev. 1 Data rev. 2 Data rev. N … Cross systems shared session. (node.js) php? java? - View queries - Javascript indexes - Elasticsearch replication - Efficient KV lookup
 (in-mem, eventual persistence) Elasticsearch use cases Search query Percolation (inverted search) raw text university names (stored search) ~22k. detected
 education document keys
  11. why kafka? S3 takes the initial “heat” job scheduling web

    mail batch - Sticky to data. - Confidentiality: private clouds response request .. logging job recovery request response cloud
  12. ai Internally, we represent both candidates and job descriptions like

    this:
 an industry-specific topic vector, values are series of regressions And also like this:
 (a document vector pointing to a direction within a high-dimensional - in our case 300 - semantic space) Web design & E-Commerce: 0.95
 Software Development: 0.86
 …. 
 We looked at probability distributions too (LDA/HDP)
 but we actually like how well regression works here. tsne tsne kohonen
  13. Topic vectors Document vectors Mostly supervised: skills extraction, classification, regression.

    Mostly unsupervised: we employ a word2vec model to syntactically map the document. We then do SVD, trim, cluster information and filter out irrelevant clusters. Information in the clean clusters contributes to the final document vector. vectors and similarity Similarity All similarity (candidates/candidates, jobs/candidates, jobs/jobs) is cosine similarity / dot product
 (for speed, knn approximation - Annoy (https://github.com/spotify/annoy)
  14. 25 50 75 100 We extract salaries from job boards

    and use these to infer in average how valuable abilities are: 
 web developers, data scientists, etc - mostly regression and supervised (IT mapped as a tree-like structure).
 We then normalize this structure to resumes (lengthier text, etc) ending up forecasting salaries for candidates. salaries forecasts 15 job boards monitored
  15. elasticsearch automation: mining Web scrapers (Python, Scrapy) - running on

    Cron. replication (xdcr) plugin The scraping processes feed data to Couchbase. From there it’s being replicated to Elasticsearch where we have a small listener plugin which gets triggered by these new incoming job descriptions. The plugin relays each description towards Connemara for analysis (pay scales extraction, named entities, requirements) which in turn writes them back to the DB. This enables us to simply write a job description into the database and call it a day.
  16. information flow baobab.js state tree react.js view layer signals state

    store / view layers = pluggable back-ends
 (easy switch to immutable.js/angular/etc) altern. view layer altern. state tree
  17. batch & scale customer facing win/osx native app (java fx)

    batch upload S3 akka streams all i/o in ramdisks (containers over Mesos). Input from a variety of back-ends: mail, http, kafka…
  18. container i/o http fabio doc-to-pdf pdf-to-img web-to-img img-to-ocr doc-to-pdf pdf-to-img

    web-to-img doc-to-pdf pdf-to-img web-to-img img-to-ocr img-to-ocr active tasks event stream registrator
  19. monitoring & instrumentation Prometheus Same thing for us Grafana Instrumentation


    
 Java/Node clients
 
 Monitoring
 
 Systems - node exporter
 Couchbase - telegraf
 
 Alerts
  20. https://github.com/haifengl/smile https://radimrehurek.com/gensim/ https://github.com/cemoody/lda2vec http://scikit-learn.org/stable/ https://github.com/mit-nlp/MITIE
 https://www.tensorflow.org/
 http://kafka.apache.org/ http://akka.io/
 https://prometheus.io/
 http://grafana.org/

    http://camel.apache.org/ http://www.libreoffice.org/ http://mesos.apache.org/ https://www.gluster.org/ https://nodejs.org/en/ http://www.cerebraljs.com/ https://facebook.github.io/react/
 http://www.couchbase.com/ https://www.elastic.co/ https://www.nginx.com/ https://www.phusionpassenger.com/
 https://www.docker.com/ https://www.ansible.com/ https://www.docker.com/ references