PyData Berlin 2023 - Speaker Deck

Slide 1

Slide 1 text

Yo u a r e w h a t y o u r e a d Building a personal internet frontpage with spaCy and Prodigy Victoria Slocum, PyData Berlin 2023 01

Slide 2

Slide 2 text

Introductions 02 Victoria Slocum Developer Advocate Slide 11-14 Building an NLP pipeline with spaCy Slide 6-10 Customizing the annotation experience Slide 3-5 Intro to spaCy, Prodigy, and the project Slide 15-19 Setting up your frontpage and future plans (blog.) .com twitter.com/ linkedin.com/in/ github.com/ victoriaslocum victorialslocum victorialslocum victorialslocum

Slide 3

Slide 3 text

What’s spaCy? spaCy provides a modular architecture to construct NLP pipelines that can be tailored towards individual project needs. Text Doc tagger parser ... spaCy pipeline tokenizer 03 [nlp] lang = pipeline = [ , , ] batch_size = disabled = [] before_creation = null after_creation = null after_pipeline_creation = null tokenizer = { : } "en" "tok2vec" “tagger” “parser” "@tokenizers" "spacy.Tokenizer.v1" 1000 config.cfg

Slide 4

Slide 4 text

What’s Prodigy? 04 Prodigy is a scriptable annotation tool. We provide a lot of out of the box solutions but still allow for extensive customizability through custom recipe functions

Slide 5

Slide 5 text

The frontpage project 05 The project automatically updates the page and datasets every day New data, every day Annotate your own data on topics you’re interested in to get a personalized frontpage Do it your own way We’re using the arxiv PyPI library to get the papers using a search query arxiv API for Python

Slide 6

Slide 6 text

06 Data, data, and data 'ti:dataset OR ti:corpus OR ti:database OR abs:"a new dataset"' {"title":"RGB Arabic Alphabets Sign Language Dataset", "description":"This paper introduces the RGB Arabic Alphabet Sign Language (AASL) dataset. ...", "tags":["arxiv","dataset"], "meta":{"query":"ti:dataset OR ti:corpus OR ti:database OR abs:\"a new dataset\""}} {...} data.jsonl

Slide 7

Slide 7 text

Annotating the data 07 { "label": "dataset", "pattern": [ {"LEMMA": {"IN": ["present", "introduce", "propose", "publish", "provide", "derive", "construct", "create", "develop", "contribute", "release"]}, "POS": "VERB"}, {"OP": "{,6}"}, {"LEMMA": {"NOT_IN": ["performance", "result", "benchmark", "evaluate", "algorithm", "framework", "technique", "workflow"]}}, {"OP": "{,6}"}, {"LOWER": {"IN": ["database", "dataset", "corpus"]}} ] } pattern.jsonl This paper introduces the RGB Arabic Alphabet Sign Language (AASL) dataset. AASL comprises 7,856 raw and fully labelled RGB images of the Arabic sign language alphabets, which to our best knowledge is the first publicly available RGB dataset. The dataset is aimed to help those interested in developing real-life Arabic sign language classification models. AASL was collected from more than 200 participants and with different settings such as lighting, background, image orientation, image size, and image resolution. Experts in the field supervised, validated and filtered the collected images to ensure a high-quality dataset. AASL is made available to the public on Kaggle.

Slide 8

Slide 8 text

Combining rules & ML 08 Writing rules allows you to have better understanding and control over your pipeline and process, ensuring consistency and possibly creating a better output. Linear pipeline workflow Iterative pipeline workflow

Slide 9

Slide 9 text

Building a custom recipe 09 Custom HTML to display title, abstract with pattern-matched highlighting, and link for any further information Meta tag for the query provided to arxiv Prefer data entries where the abstract has a matched pattern

Slide 10

Slide 10 text

Customizing your solutions 10 SOU RCE CU E CONT ENT SOU RCE CU E CONT ENT SOU RCE CU E CONT ENT SOU RCE CU E CONT ENT https:/ /explosion.ai/blog/guardian

Slide 11

Slide 11 text

spaCy’s config system [nlp] [components]   [components.textcat_multilabel] lang = pipeline = [ ] batch_size = disabled = [] before_creation = null after_creation = null after_pipeline_creation = null tokenizer = { : } factory = scorer = { : } threshold = "en" "textcat_multilabel" "@tokenizers" "spacy.Tokenizer.v1" "textcat_multilabel" "@scorers" "spacy.textcat_multilabel_scorer.v1" 1000 0.5 config.cfg 11 d the , includes all settings and records all defaultj d by swapping out componentj d preset with to get you started single source of truth customize the architecture sensible defaults

Slide 12

Slide 12 text

spaCy textcat model 12 https:/ /spacy.io/usage/training#quickstart

Slide 13

Slide 13 text

Training $ python -m spacy_data/config.cfg --paths.train spacy_data/train.spacy --paths.dev spacy_data/dev.spacy --output training/ spacy train Train model nlp = spacy. ( ) load “training/model-best” Load model 13

Slide 14

Slide 14 text

Built-in solutions 14 More on this: https://explosion.ai/blog/spacy-design-concepts d Using preconfigured building blockV d Has a set of sensible defaultV d Easy to get started and iterate moving forward

Slide 15

Slide 15 text

spaCy project system https://spacy.io/usage/projects 15 Your frontpage $ spacy project run new-frontpage download preprocess spacy-train content build data-to-spacy - : : : - -> python scripts/download_arxiv.py --query --tag dataset name help script "download" "Download data from sources." 'ti:dataset OR ti:corpus OR ti:database OR abs:"a new dataset"' project.yml

Slide 16

Slide 16 text

Creating your frontpage 16 https://github.com/victorialslocum/frontpage description sections name tags n classes name threshold name tags n classes name threshold : | ... : - : : [ , ] : 12 : - : dataset : 0.5 - : : [ , ] : 12 : - : prompteng : 0.5 "Datasets on Arxiv" "arxiv" "dataset" "Prompt Engineering on Arxiv" "arxiv" "prompteng" frontpage.yaml Automatically runs spaCy project commands daily $ python -m spacy project run download $ python -m spacy project run new-frontpage-github

Slide 17

Slide 17 text

Slide 18

Slide 18 text

Future plans Better topic customizatio8 Upload your own data for model trainin Page customizatio8 Text classification trick 18 texcat score: .70 This paper introduces the RGB Arabic Alphabet Sign Language (AASL) dataset. AASL comprises 7,856 raw and fully labelled RGB images of the Arabic sign language alphabets, which to our best knowledge is the first publicly available RGB dataset. texcat score: .65 AASL comprises 7,856 raw and fully labelled RGB images of the Arabic sign language alphabets, which to our best knowledge is the first publicly available RGB dataset. + texcat score: .95 This paper introduces the RGB Arabic Alphabet Sign Language (AASL) dataset.

Slide 19

Slide 19 text

Future plans 19 support for Matcher instead of trained spaCy model0 A/B Prodigy classification to learn your interests

Slide 20

Slide 20 text

Conclusion 20 advanced workflows for modern NLP and machine learning ease-of-use with pre-configured building blocks and good defaults

Slide 21

Slide 21 text

21 Thank you for listening! @ (blog.) .coF @ twitter.com/ @ linkedin.com/in/ @ @explosion.ai victoriaslocum victorialslocuF victorialslocuF victoria Frontpage project Explosion blog More events Vincent on Twitter - github.com/victorialslocum/frontpage - explosion.ai/blog - explosion.ai/events - twitter.com/fishnets88