Automating the Extraction of Static Content and Dynamic Behaviour from e-Commerce Websites

Automating the Extraction of Static Content and Dynamic Behaviour from
e-Commerce Websites João Pedro Matos Teixeira Dias Supervisor: Hugo Sereno Ferreira, PhD Faculdade de Engenharia da Universidade do Porto 08/02/2017 1

Outline 1. Introduction 2. Literature Review 3. High-level Overview and
Implementation Details 4. Evaluation 5. Conclusion 2

Introduction 3

Context • E-commerce is one of the most disruptive innovations
in trading. • Marketing and advertising techniques are used to influence costumers’ behaviour, trying to increase sales and profits. • Recommendation systems are one of the used techniques. • Data mining and machine learning techniques had been applied to e-commerce as a way to improve e-metrics. • Customer retention and engagement, click-trough rate, conversion rate, shopping cart abandonment rate, customer lifetime value. 4

Motivation • To improve business, e-commerce owners often resort to
machine learning providers in order for them to develop new algorithms and models to run on their websites. • At an early stage, data scientists face some challenges: • Getting a grasp of the website's structure and content; • Understanding the users' behaviour (archetypical users); • Dealing with heterogeneous nature of the Web data; • Handling semi-structured data; • Finding a good process for extracting and representing the data collected from the websites. 5

Goals 1. An all-in-one approach for collecting and processing information
present on the scope of an e-commerce website; 2. A consistent and adaptable model that represents the website structure, content and users, establishing connections and relationships between the data; 3. A reduction of the need of developing and applying a different approach for each website, optimizing costs and resources. 6

Literature Review 7

Web Mining Web mining process overview. Li Mei and Feng
Cheng. Overview of Web mining technology and its application in e-commerce. 8

Web Mining Web mining taxonomy and description. Ahmad Siddiqui and
Sultan Aljahdali. Web Mining Techniques in E-Commerce Applications. 9

User Profiling • Key Components: • User Background; • User
Objectives; • User Interests. • Djallel Bouneouf. Towards user prole modelling in recommender system. • User Proling Approaches: • Behaviour-based; • Knowledge-based. • Siping He and Meiqi Fang. Ontological user proling on personalized recommendation in e- commerce. • User Prole Representation: • Keyword-based Profiles; • Ontologies Representation; • Semantic Network Profiles; • Concept-based Profiles. • Susan Gauch, Mirco Speretta, Aravind Chandramouli, and Alessandro Micarelli. User profiles for personalized information access. 10

High-level Overview and Implementation Details 11

Desiderata 1. Collect the data present on the website and
usage records; 2. Transform the collected data into structured data formats; 3. Categorise the website's pages by page type and category; 4. Identify unique users and sessions and categorise the sessions into pre-defined types; 5. Establish new relationships between the different analysed data sources: • Website category tree; • Keyword-based user profiles; 6. Identify archetypical website users through clustering; 7. Build a coherent representation of the website structure, content and users as an information model. 12

Overview Representation of the data flow, operations and outputs. 13

Data Collection and Processing Web Structure Mining Main challenges: •
Spider traps; • URL extraction and canonicalization. Approach used: • Web Crawler. Web Usage Mining Main challenges: • Complex log formats and unavailable information. Approach used: • Uniformization of the log data; • Unique user and session identication; • Session categorization (length, duration and mean time per page). Web Content Mining Main challenges: • Heterogeneity of the websites; • Semi-structured nature of the data. Approach used: • Scraper with manual approach; • Page categorization (Page Type and Page Category). 14

Data Crossing Website's Category Tree Sources: • Web Graph; •
Information extracted from pages (Categories). Output: • Tree structure with categories and sub- categories present in the website product catalogue. Keyword-based Profiles Sources: • User sessions; • Information extracted from pages (Categories and Page Types). Output: • Information about pages visited by category and by type in user profiles. 15

Pattern Discovery and Analysis To find the archetypical users from
our user prole database, we proceed to apply the k-means clustering algorithm. • Keyword-based profiles clustering; • Session-based profiles clustering. • The result of the application of this algorithm gives us a set of clusters that contains users with similarities between them. From this set of similar groups of users we can get a grasp of the archetypical website users. 16

Website Information Model A information model for representing the website
data. 17

Evaluation 18

Data Sources and Experimental Parameters Niche dedicated e-commerce website •
Sample usage data synthetically generated that mimics the server layer logs. General purpose e-commerce website • Sample usage data size: 1 000 000 events captured over aprox. 2 days and 17 hours. Captured at the application layer. 19

Results Niche dedicated e-commerce website • 2687 crawled pages with
361 344 links; • Category tree with 128 nodes; • Synthetic data used to make sanity checks on the proof-of-concept. General purpose e-commerce website • 621 303 crawled pages with 11 044 225 links; • Category tree with 1632 nodes; • 111 141 unique users with 135 056 sessions; • Average of 4.6 pages visited by user with an average session time of 125.07 seconds. • 5 user clusters resulted from clustering by preferences and 7 clusters resulted from clustering by session. 20

Conclusions 21

Final Remarks and Main Contributions • The application of Web
Mining techniques to the web and e-commerce is not new, with a lot of research being done in this field. • The main contributions of this work are: • An all-in-one process to collect and structure data from an e- commerce website's content, structure and users. • Crossing of the data collected from diverse sources in order to and non-trivial relationships, enriching the process output. • An information model of the e-commerce website, containing the collected and structured information, including data resulted from crossing different sources and pattern discovery tasks. 22

Further Work • Improve the crawler implementing parallelism and/or prioritisation
of the frontier; • Identify and differentiate static from dynamic hyperlinks; • Carry experiments with another kind of web scrapers; • Experiment new data crossings; • Apply different algorithms to finding and understand the archetypical website's users; • Analyse the possibility of expanding this methodology beyond e-commerce websites, finding other user cases. 23

Thank you for your attention. 08/02/2017 24

Automating the Extraction of Static Content and...

Automating the Extraction of Static Content and Dynamic Behaviour from e-Commerce Websites

JP

More Decks by JP

Other Decks in Research

Featured

Transcript