Reactive Resumes

reactive resumes journeys from react: websocket -> kafka -> camel
-> akka -> ai -> couchbase         Scala by the Bay (scala.bythebay.io)  Nov 12, 2016 | 1355 Market Street, San Francisco, US (Twitter’s HQ)     Adrian Mihai, co-founder/cto @dublin_io / engineering.opening.io   

what do we do? 01 Resume parsing at scale Deploying
specialized, location transparent units of work (akka-camel actors) which gradually build a complete candidate profile out of each file. We ingest files from web or in batch via dedicated import processes (mail, cloud). 02 03 Information retrieval Identifying patterns within structure and phrasing of resumes in order to match them to suitable job descriptions automatically. We employ linguistics to cluster pools of similar candidates and job posts and tap into external sources of signal: GitHub, LinkedIn. Real-time analytics Salary recommendations / forecasts, similarity between candidates and/or job descriptions, rankings, search & extra business logic.

flows: anatomy of an upload S3 Internally, independent from the
the web back-end we run a Scala/Java application: Connemara. web socket front-end back-end (node.js) kafka message, json topic: resume_upload_pipeline topic: results-resume_upload_pipeline Requests are getting fulfilled by pipelines, or chains of actions: s3_download, feature_extraction, salaries_for_skills, etc. Actions are JSON Kafka / HTTP requests sent to their respective topics while answers to these requests are decorating pipeline’s state tree in return. Once conditions met the pipeline returns an answer. Workers are actors within an Akka Cluster: Kafka consumers / HTTP Servers (receiving JSON queries) and producers (sending output). However, they are Camel context-aware (can switch transport from Kafka to http/websocket/gearman for instance via configuration) so that they could independently handle API requests for instance. For NLP tasks and document vectors we maintain Python workers too.

01 02 03 04 everything is a stream Example service
call (ex: feature extraction)    http post ( ***.***.***.***: 11115/feature_extraction ) -> HTTP.outgoingConnection  (round robin ﬂow to an available service -   ex: yyy.yyy.yyy.yyy:11178) -> Service actor: Future<Json> execute() { ... } return Future[HttpResponse] http .doc upload, web http response, fully classiﬁed json resume  (with portfolio screenshots, Github code, etc) Distributing work worker  instances worker  instances worker  instances Registry (cluster singleton actor)  http://xxx.xxx.xxx.xxx:11115 web node router worker ws / xhr kafka / http / .. akka

a simplified pipeline Services as pure actors Future[Json] parsedResume =
Source(file_bucket_path) -> flatMap(downloadFileFromCloud) - (Future[ByteArrayOutputStream]) -> flatMap(convert) -> HTTP request to a conversion back-end (Docker conversion containers - http://engineering.opening.io/demo.html) return HTTPResponse (ByteArrayOutputStream) (Future[…] obviously, trimming down non-core info)  -> flatMap(parsing) -> PDFBox (java) -> NER, tokenization, regex, vectors, etc return HTTPResponse(json) -> flatMap(extract_education) -> Percolation (Elasticsearch inverted search) return HTTPResponse(json - universities) -> fold(flatMap(portfolio links)) -> HTTP request (screenshotting service) -> Response (Array[ByteArrayOutputStream]) -> flow graph (async write screenshots to disk) -> ... return HTTPResponse(json - disk paths) -> flatMap(topics_classifier) -> return HTTPResponse(json) -> flatMap(salaries - regression) -> return HTTP Response (json) … -> flatMap(github_portfolios/dribbble/etc) -> clone non-forked repos (ramdisks) -> extract source code -> sample code per all identified languages return HTTPResponse(json) … .via(SharedKillSwitch.flow) .via(instrumentation - Prometheus) .to(Sink.foreach) parsedResume onComplete { case Success(json_resume) => Async store(Couchbase) HTTPResponse(json_resume) ..

task pool traﬃc S3 download task orchestration (mapAsync / for
comprehensions) convert parse nlp feature selection embedding regressions dropbox google drive batch github portfolio ai / analytics pipelines / business logic Akka streams Flow[ String, (String, ByteArrayOutputStream, Boolean) )]  Flow[( (String,ByteArrayOutputStream,Boolean), (String,JSONObject,Boolean) )] … One single Graph describing how various Flows combine -> i/o, broadcasts, etc All I/O (conversions, screenshots, etc) + BI/AI = ~650ms (Grafana)

back-end layout Akka cluster worker    ip:port worker    kafka
topic pipeline pipeline (shared read) Data stores Pipelines Regular workers (Akka Streams) - ﬂows deﬁning behavior via series of mapAsync calls IN OUT IN OUT IN OUT IN OUT IN OUT IN OUT IN OUT HTTP HTTP HTTP worker    ip:port worker    ip:port HTTP HTTP worker    kafka topic

libreofﬁce ramdisk document conversion registrator consul check-in discovery fabio load
balancer http :9999 access point for { req ← createRequest(target, testFile) response ← Http().singleRequest(req) responseBody ← Unmarshal(response).to[…] } yield responseBody  golang web server (iris)

interactions React.JS (view) Cerebral.js websocket, xhr Back-end: Front-end: Baobab.js (state)

why akka? http://www.lightbend.com/activator/template/akka-distributed-workers Akka Producer (1x) - Kafka consumer (topic
request) - Kafka producer (results topic response) - Akka-Camel (abstracting transport) TASK Akka Broker Akka executor pool - Akka Routers per task    - Workers as pure transforms - Isolated actor system - Read access to Hazelcast TASK TASK Vertical scale Horizontal scale (n) near cache near cache near cache

why couchbase? Updates as versioned inserts. “free” analytics Memcache-compatible bucket
Data rev. 1 Data rev. 2 Data rev. N … Cross systems shared session. (node.js) php? java? - View queries - Javascript indexes - Elasticsearch replication - Eﬃcient KV lookup  (in-mem, eventual persistence) Elasticsearch use cases Search query Percolation (inverted search) raw text university names (stored search) ~22k. detected  education document keys

why kafka? S3 takes the initial “heat” job scheduling web
mail batch - Sticky to data. - Conﬁdentiality: private clouds response request .. logging job recovery request response cloud

ai Internally, we represent both candidates and job descriptions like
this:  an industry-speciﬁc topic vector, values are series of regressions And also like this:  (a document vector pointing to a direction within a high-dimensional - in our case 300 - semantic space) Web design & E-Commerce: 0.95  Software Development: 0.86  ….   We looked at probability distributions too (LDA/HDP)  but we actually like how well regression works here. tsne tsne kohonen

Topic vectors Document vectors Mostly supervised: skills extraction, classification, regression.
Mostly unsupervised: we employ a word2vec model to syntactically map the document. We then do SVD, trim, cluster information and filter out irrelevant clusters. Information in the clean clusters contributes to the final document vector. vectors and similarity Similarity All similarity (candidates/candidates, jobs/candidates, jobs/jobs) is cosine similarity / dot product  (for speed, knn approximation - Annoy (https://github.com/spotify/annoy)

25 50 75 100 We extract salaries from job boards
and use these to infer in average how valuable abilities are:   web developers, data scientists, etc - mostly regression and supervised (IT mapped as a tree-like structure).  We then normalize this structure to resumes (lengthier text, etc) ending up forecasting salaries for candidates. salaries forecasts 15 job boards monitored

elasticsearch automation: mining Web scrapers (Python, Scrapy) - running on
Cron. replication (xdcr) plugin The scraping processes feed data to Couchbase. From there it’s being replicated to Elasticsearch where we have a small listener plugin which gets triggered by these new incoming job descriptions. The plugin relays each description towards Connemara for analysis (pay scales extraction, named entities, requirements) which in turn writes them back to the DB. This enables us to simply write a job description into the database and call it a day.

front-end web bridge websocket access point Connemara

uniﬁed json state cerebral.js baobab.js state controller react.js view layer
state tree - cursors - lazy state computations

interactions as signals

information ﬂow baobab.js state tree react.js view layer signals state
store / view layers = pluggable back-ends  (easy switch to immutable.js/angular/etc) altern. view layer altern. state tree

batch & scale customer facing win/osx native app (java fx)
batch upload S3 akka streams all i/o in ramdisks (containers over Mesos). Input from a variety of back-ends: mail, http, kafka…

container i/o http fabio doc-to-pdf pdf-to-img web-to-img img-to-ocr doc-to-pdf pdf-to-img
web-to-img doc-to-pdf pdf-to-img web-to-img img-to-ocr img-to-ocr active tasks event stream registrator

monitoring & instrumentation Prometheus Same thing for us Grafana Instrumentation 
  Java/Node clients    Monitoring    Systems - node exporter  Couchbase - telegraf    Alerts

https://github.com/haifengl/smile https://radimrehurek.com/gensim/ https://github.com/cemoody/lda2vec http://scikit-learn.org/stable/ https://github.com/mit-nlp/MITIE  https://www.tensorﬂow.org/  http://kafka.apache.org/ http://akka.io/  https://prometheus.io/  http://grafana.org/
http://camel.apache.org/ http://www.libreofﬁce.org/ http://mesos.apache.org/ https://www.gluster.org/ https://nodejs.org/en/ http://www.cerebraljs.com/ https://facebook.github.io/react/  http://www.couchbase.com/ https://www.elastic.co/ https://www.nginx.com/ https://www.phusionpassenger.com/  https://www.docker.com/ https://www.ansible.com/ https://www.scala-lang.org/ references

            Adrian Mihai / @dublin_io
/ engineering.opening.io thank you and q and a

Reactive Resumes

Reactive Resumes

Adrian Mihai - opening.io

More Decks by Adrian Mihai - opening.io

Other Decks in Technology

Featured

Transcript