Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Harnessing Data for Active Business Insight

forLoop
August 22, 2016

Harnessing Data for Active Business Insight

Emeka Onu's presentation for forLoop Machine Learning

forLoop

August 22, 2016
Tweet

More Decks by forLoop

Other Decks in Programming

Transcript

  1. About Emeka Data Scientist / Machine Learning Enthusiast Previous: Support

    Software Engr @Afro Current: Data Scientist @Leadcops.com Visualizer, @viz.ng Community defender @datascience.ng
  2. A set of techniques, skills or tools used for the

    acquisition, transformation & interpretation of raw data into meaningful insights for profitable or competitive business purposes. BUSINESS INTELLIGENCE
  3. Business Intelligence: A Case Study of hotels.ng Hotel Customers complain

    about paying too much for ‘not so great’ services. As a data consultant, you need to deliver a performance measure to rank hotels and deliver recommendations to customers. Likes are great but a binary representation. Customer Ratings are great but an abstract representation on a scale of 1-10.
  4. Classifying & Ranking Hotels By Customer Reviews Using NLP &

    K-Means Clustering 1. Scrape Lagos hotel reviews from hotels.ng. 2. Use Natural Language Processing to extract review polarity & subjectivity. 3. Create a matrix of review words across all reviews. (Converting Text into a quantitative measurement) 4. Use K-Means to cluster reviews. 5. Aggregate Polarity, Subjectivity & Cluster scores for scoring hotel performance.
  5. Example: hotel_mapper.py A mapper prepares performs the initial preparation of

    data for analysis. E.g filtering, sorting, downloading & passes its output to a reducer. Here it grabs web content from each page of hotel listings.
  6. Example: hotel_reducer.py A reducer takes mapper outputs as inputs &

    performs actual analysis. Reducer takes page content and extracts the information needed.
  7. GET DATA Scrape data from hotels.ng and write out to

    a file polarity.csv ./hotels/mapper.py | ./hotels/reducer.py | ./reviews/mapper.py | ./reviews/reducer.py | ./reviews/analyse.py > polarity.csv 2998 reviews for 148 hotels
  8. CHOOSING K GOAL: Minimize the amount of difference within each

    cluster, but not so much that they become singular. K is the number of groups to classify hotels into. How do we know? Visualize within group differences for at least 30 - 50 clusters. Here, K is fine anywhere between 10 and 23 clusters. We can assign new hotels to a cluster by feeding their reviews into the this matrix. The matrix can learn new words from new reviews and re-classify hotels based on updates.
  9. Given a group of documents, create an N X M

    matrix with the number of times each word in all documents occur in each document. SO: N = Number of text files (in this case reviews) and M = Number of individual words across all documents. Normalize Matrix: [x-mean(x)] / Standard Deviation (x) Cluster document based on word frequency occurrence DOCUMENT-TERM MATRIX
  10. TOP 10 HOTELS BY HOTEL PERFORMANCE 1. Beni Gold Hotel

    & Apartments 2. Travel House Lekki 3. The Belaggio Corporate Suites 4. Piccadilly Suites 5. Ikoyi Fairview Apartments 6. Precinct Comfort 7. Victoria Continental Hotel 8. Lakeem Suites 9. Hotel Ibis Royale 10. Signature Suits TOP 10 HOTELS BY USER LIKES 1. Glonik Hotels 2. Beni Gold Hotel & Apartments 3. Unilag Guest House 4. Regent Luxury Suites 5. Wazobia Plaza 6. Intercontinental Lagos Hotel 7. Hotel Ibis Royale 8. Sheraton Hotel Lagos 9. Eko Hotels 10. Victoria Continental Hotel
  11. This method can help in service improvements, better professional recommendations.

    Deeper analysis e.g adding a time and location variable can identify what and when customers have good and bad reviews. This can be integrated with employee data to identify ‘bad luck’ employees and best performing employees.