Slide 1

Slide 1 text

Data to Discovery: Unveiling Clustering in BERTopic Topic Modeling Abhiram Ravikumar ML Engineer at Collinson Jaspal Singh Lead Data Scientist at Collinson Conf42 | 18th May 2023

Slide 2

Slide 2 text

Abhiram Ravikumar and Jaspal Singh | Conf42 | 18th May, 2023 Who are we? Abhiram Ravikumar ● Cloud Machine Learning Engineer, Collinson ● MSc in Data Science, King’s College London ● Data Science Research Fellow, SAP Labs (ex) ● LinkedIn Learning Instructor ● Volunteer at DataKind Bengaluru ● Loves to play badminton and listen to 80’s rock music ● Twitter: @abhi12ravi 2 Jaspal Singh ● Lead Data Scientist, Collinson ● Expertise: AI, Python, AWS and Data Products ● CBA, Advanced Business Analytics, Indian School of Business ● Loves to play football and favourite football club is Arsenal ● Twitter: @jaspalsingh26

Slide 3

Slide 3 text

Abhiram Ravikumar and Jaspal Singh | Conf42 | 18th May, 2023 Table of Contents ● Topic Modeling Use Case ● Why BERTopic? ● BERTopic end-to-end flow ● Clustering (HDB Scan) ● Dataset Description (Amazon Alexa reviews) ● Run through of topic modelling on Google Collab ● Conclusion and Future Scope 3 Agenda

Slide 4

Slide 4 text

Abhiram Ravikumar and Jaspal Singh | Conf42 | 18th May, 2023 Brief about me.. Topic Modeling Use Case 4 Alexa Echo Dot Topic Modeling Topic 1 Topic 2 Topic n . . .

Slide 5

Slide 5 text

Abhiram Ravikumar and Jaspal Singh | Conf42 | 18th May, 2023 Brief about me.. Why BERTopic? 5 ● Works on unstructured data ● Takes advantage of transformer models ● Offers modularity ● Contextual embeddings ● Flexible structure ● New advancements in clustering can be adapted easily ● c-TF-IDF extraction of topic representations Source: Clustering in BERTOPIC

Slide 6

Slide 6 text

Abhiram Ravikumar and Jaspal Singh | Conf42 | 18th May, 2023 Brief about me.. BERTopic End-to-End Flow 6 Untokenized Reviews Source: DataHour by Bharath Kumar Bolla

Slide 7

Slide 7 text

Abhiram Ravikumar and Jaspal Singh | Conf42 | 18th May, 2023 Brief about me.. Clustering 7 ● HDBSCAN ● K-Means ● cuML HDBSCAN ● sklearn algorithms ● Agglomerative clustering Credits: Clustering in BERTOPIC

Slide 8

Slide 8 text

Abhiram Ravikumar and Jaspal Singh | Conf42 | 18th May, 2023 Brief about me.. Dataset Description 8 rating date variation verified_reviews feedback 5 31-Jul-18 Charcoal Fabric Love my Echo! 1 5 31-Jul-18 Charcoal Fabric Loved it! 1 4 31-Jul-18 Walnut Finish Sometimes while playing a game, you can answer a question correctly but Alexa says you got it wrong and answers the same as you. I like being able to turn lights on and off while away from home. 1 5 31-Jul-18 Charcoal Fabric I have had a lot of fun with this thing. My 4 yr old learns about dinosaurs, i control the lights and play games like categories. Has nice sound when playing music as well. 1 5 31-Jul-18 Charcoal Fabric Music 1 5 31-Jul-18 Heather Gray Fabric I received the echo as a gift. I needed another Bluetooth or something to play music easily accessible, and found this smart speaker. Can’t wait to see what else it can do. 1 3 31-Jul-18 Sandstone Fabric Without having a cellphone, I cannot use many of her features. I have an iPad but do not see that of any use. It IS a great alarm. If u r almost deaf, you can hear her alarm in the bedroom from out in the living room, so that is reason enough to keep her.It is fun to ask random questions to hear her response. She does not seem to be very smartbon politics yet. 1 Amazon Alexa Reviews dataset - Kaggle

Slide 9

Slide 9 text

Abhiram Ravikumar and Jaspal Singh | Conf42 | 18th May, 2023 Hands-on: BERTopic

Slide 10

Slide 10 text

Abhiram Ravikumar and Jaspal Singh | Conf42 | 18th May, 2023 Diving Deep into HDBSCAN

Slide 11

Slide 11 text

Abhiram Ravikumar and Jaspal Singh | Conf42 | 18th May, 2023 What is HDBSCAN?

Slide 12

Slide 12 text

Abhiram Ravikumar and Jaspal Singh | Conf42 | 18th May, 2023 Brief about me.. HDBSCAN is a long acronym AND a clustering algorithm! 12 Hierarchical Density Based Spatial Clustering of Applications with Noise

Slide 13

Slide 13 text

Abhiram Ravikumar and Jaspal Singh | Conf42 | 18th May, 2023 Brief about me.. To understand HDBSCAN we need to know DBSCAN 13 • Clusters based on density. • Circles/hyperspheres around data points of fixed radius “epsilon” • Robust, flexible, and outlier-resistant. • No predefined number of clusters • Need to define radius of the circle • SLOW!

Slide 14

Slide 14 text

Abhiram Ravikumar and Jaspal Singh | Conf42 | 18th May, 2023 Brief about me.. What if there was no fixed radius? 14 • No fixed radius • Helps identify dense region • FAST!

Slide 15

Slide 15 text

Abhiram Ravikumar and Jaspal Singh | Conf42 | 18th May, 2023 Brief about me.. K-NN algorithm to define radius 15 Core Distance Mutual Reachability Distance K=5

Slide 16

Slide 16 text

Abhiram Ravikumar and Jaspal Singh | Conf42 | 18th May, 2023 Brief about me.. Minimum spanning tree finds density and hierarchy 16 • Identifying connected components • Hierarchical structure • Clustering interpretation • Efficient computation

Slide 17

Slide 17 text

Abhiram Ravikumar and Jaspal Singh | Conf42 | 18th May, 2023 Brief about me.. Density Based Spatial Clustering 17

Slide 18

Slide 18 text

Abhiram Ravikumar and Jaspal Singh | Conf42 | 18th May, 2023 Brief about me.. Stability score "λ" 18

Slide 19

Slide 19 text

Abhiram Ravikumar and Jaspal Singh | Conf42 | 18th May, 2023 Brief about me.. Stability score "λ" 19

Slide 20

Slide 20 text

Abhiram Ravikumar and Jaspal Singh | Conf42 | 18th May, 2023 Brief about me.. Final Clusters 20

Slide 21

Slide 21 text

Abhiram Ravikumar and Jaspal Singh | Conf42 | 18th May, 2023 Brief about me.. HDBSCAN steps 21 1. Transform the space as per density 2. Build minimum spanning tree 3. Construct cluster hierarchy 4. Condense the hierarchy based on min cluster size 5. Extract stable clusters from the condensed tree Source: How HDBSCAN works

Slide 22

Slide 22 text

Abhiram Ravikumar and Jaspal Singh | Conf42 | 18th May, 2023 Brief about me.. HDBSCAN – performance comparison 22 Source: http://hdbscan.readthedocs.io

Slide 23

Slide 23 text

Abhiram Ravikumar and Jaspal Singh | Conf42 | 18th May, 2023 Brief about me.. HDBSCAN – strengths and weaknesses 23 ● HDBSCAN focuses on high-density clustering -> reduces noise clustering problem ● Min-cluster-size parameter can be set, relatively fast ● Difficulty in handling large amounts of data ○ cuML HDBSCAN speeds up HDBSCAN using GPU acceleration ● Read: Comparing Python Clustering Algorithms https://hdbscan.readthedocs.io/en/latest/comparing_clustering_algorithms.html

Slide 24

Slide 24 text

Abhiram Ravikumar and Jaspal Singh | Conf42 | 18th May, 2023 Brief about me.. Conclusion and Future Scope 24 ● BERTopic – modular, scalable, flexible ● Clustering is modular ● BERTopic assumption: Every document contains only one topic ● Large Language Models (ChatGPT, etc.) could impact topic modeling ● Be cautious of inherent biases, and ethical issues with LLMs ● Online topic modeling ● Cloud vendors ● Operationalization

Slide 25

Slide 25 text

Abhiram Ravikumar and Jaspal Singh | Conf42 | 18th May, 2023 Brief about me.. References 25 1. The BERTopic Algorithm 2. Kaggle Dataset 3. BERTopic paper 4. HDBSCAN package 5. Pinecone HDBSCAN notebook

Slide 26

Slide 26 text

Abhiram Ravikumar and Jaspal Singh | Conf42 | 18th May, 2023 Brief about me.. Session Resources 26 1. Slides on Speaker Deck: https://speakerdeck.com/abhi12ravi/ 2. Notebooks on GitHub: https://github.com/abhi12ravi/BERTopic_Conf42/

Slide 27

Slide 27 text

Abhiram Ravikumar and Jaspal Singh | Conf42 | 18th May, 2023 Thank you!