Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Automating the Extraction of Static Content and...

JP
February 08, 2017

Automating the Extraction of Static Content and Dynamic Behaviour from e-Commerce Websites

E-commerce website owners rely heavily on analysing and summarising the behaviour of costumers, making efforts to influence user actions and optimize success metrics.
Machine learning and data mining techniques have been applied in this field, greatly influencing the Internet marketing activities.
When faced with a new e-commerce website, the data scientist starts a process of collecting real-time and historical data about it, analysing and transforming this data in order to get a grasp into the website and its users. Data scientists commonly resort to tracking domain-specific events, requiring code modification of the web pages. This paper proposes an alternative approach to retrieve information from a given e-commerce website, collecting data from the site’s structure, retrieving semantic information in predefined locations and analysing user’s access logs, thus enabling the development of accurate models for predicting users’ future behaviour. This is accomplished by the application of a web mining process, comprehending the site’s structure, content and usage in a pipeline, resulting in a web graph of the website, complemented with a categorization of each page and the website’s archetypical user profiles.

JP

February 08, 2017
Tweet

More Decks by JP

Other Decks in Research

Transcript

  1. Automating the Extraction of Static Content and Dynamic Behaviour from

    e-Commerce Websites João Pedro Matos Teixeira Dias Supervisor: Hugo Sereno Ferreira, PhD Faculdade de Engenharia da Universidade do Porto 08/02/2017 1
  2. Outline 1. Introduction 2. Literature Review 3. High-level Overview and

    Implementation Details 4. Evaluation 5. Conclusion 2
  3. Context • E-commerce is one of the most disruptive innovations

    in trading. • Marketing and advertising techniques are used to influence costumers’ behaviour, trying to increase sales and profits. • Recommendation systems are one of the used techniques. • Data mining and machine learning techniques had been applied to e-commerce as a way to improve e-metrics. • Customer retention and engagement, click-trough rate, conversion rate, shopping cart abandonment rate, customer lifetime value. 4
  4. Motivation • To improve business, e-commerce owners often resort to

    machine learning providers in order for them to develop new algorithms and models to run on their websites. • At an early stage, data scientists face some challenges: • Getting a grasp of the website's structure and content; • Understanding the users' behaviour (archetypical users); • Dealing with heterogeneous nature of the Web data; • Handling semi-structured data; • Finding a good process for extracting and representing the data collected from the websites. 5
  5. Goals 1. An all-in-one approach for collecting and processing information

    present on the scope of an e-commerce website; 2. A consistent and adaptable model that represents the website structure, content and users, establishing connections and relationships between the data; 3. A reduction of the need of developing and applying a different approach for each website, optimizing costs and resources. 6
  6. Web Mining Web mining process overview. Li Mei and Feng

    Cheng. Overview of Web mining technology and its application in e-commerce. 8
  7. Web Mining Web mining taxonomy and description. Ahmad Siddiqui and

    Sultan Aljahdali. Web Mining Techniques in E-Commerce Applications. 9
  8. User Profiling • Key Components: • User Background; • User

    Objectives; • User Interests. • Djallel Bouneouf. Towards user prole modelling in recommender system. • User Proling Approaches: • Behaviour-based; • Knowledge-based. • Siping He and Meiqi Fang. Ontological user proling on personalized recommendation in e- commerce. • User Prole Representation: • Keyword-based Profiles; • Ontologies Representation; • Semantic Network Profiles; • Concept-based Profiles. • Susan Gauch, Mirco Speretta, Aravind Chandramouli, and Alessandro Micarelli. User profiles for personalized information access. 10
  9. Desiderata 1. Collect the data present on the website and

    usage records; 2. Transform the collected data into structured data formats; 3. Categorise the website's pages by page type and category; 4. Identify unique users and sessions and categorise the sessions into pre-defined types; 5. Establish new relationships between the different analysed data sources: • Website category tree; • Keyword-based user profiles; 6. Identify archetypical website users through clustering; 7. Build a coherent representation of the website structure, content and users as an information model. 12
  10. Data Collection and Processing Web Structure Mining Main challenges: •

    Spider traps; • URL extraction and canonicalization. Approach used: • Web Crawler. Web Usage Mining Main challenges: • Complex log formats and unavailable information. Approach used: • Uniformization of the log data; • Unique user and session identication; • Session categorization (length, duration and mean time per page). Web Content Mining Main challenges: • Heterogeneity of the websites; • Semi-structured nature of the data. Approach used: • Scraper with manual approach; • Page categorization (Page Type and Page Category). 14
  11. Data Crossing Website's Category Tree Sources: • Web Graph; •

    Information extracted from pages (Categories). Output: • Tree structure with categories and sub- categories present in the website product catalogue. Keyword-based Profiles Sources: • User sessions; • Information extracted from pages (Categories and Page Types). Output: • Information about pages visited by category and by type in user profiles. 15
  12. Pattern Discovery and Analysis To find the archetypical users from

    our user prole database, we proceed to apply the k-means clustering algorithm. • Keyword-based profiles clustering; • Session-based profiles clustering. • The result of the application of this algorithm gives us a set of clusters that contains users with similarities between them. From this set of similar groups of users we can get a grasp of the archetypical website users. 16
  13. Data Sources and Experimental Parameters Niche dedicated e-commerce website •

    Sample usage data synthetically generated that mimics the server layer logs. General purpose e-commerce website • Sample usage data size: 1 000 000 events captured over aprox. 2 days and 17 hours. Captured at the application layer. 19
  14. Results Niche dedicated e-commerce website • 2687 crawled pages with

    361 344 links; • Category tree with 128 nodes; • Synthetic data used to make sanity checks on the proof-of-concept. General purpose e-commerce website • 621 303 crawled pages with 11 044 225 links; • Category tree with 1632 nodes; • 111 141 unique users with 135 056 sessions; • Average of 4.6 pages visited by user with an average session time of 125.07 seconds. • 5 user clusters resulted from clustering by preferences and 7 clusters resulted from clustering by session. 20
  15. Final Remarks and Main Contributions • The application of Web

    Mining techniques to the web and e-commerce is not new, with a lot of research being done in this field. • The main contributions of this work are: • An all-in-one process to collect and structure data from an e- commerce website's content, structure and users. • Crossing of the data collected from diverse sources in order to and non-trivial relationships, enriching the process output. • An information model of the e-commerce website, containing the collected and structured information, including data resulted from crossing different sources and pattern discovery tasks. 22
  16. Further Work • Improve the crawler implementing parallelism and/or prioritisation

    of the frontier; • Identify and differentiate static from dynamic hyperlinks; • Carry experiments with another kind of web scrapers; • Experiment new data crossings; • Apply different algorithms to finding and understand the archetypical website's users; • Analyse the possibility of expanding this methodology beyond e-commerce websites, finding other user cases. 23