PyData Berlin 2023

Yo u a r e w h a t y
o u r e a d Building a personal internet frontpage with spaCy and Prodigy Victoria Slocum, PyData Berlin 2023 01

Introductions 02 Victoria Slocum Developer Advocate Slide 11-14 Building an
NLP pipeline with spaCy Slide 6-10 Customizing the annotation experience Slide 3-5 Intro to spaCy, Prodigy, and the project Slide 15-19 Setting up your frontpage and future plans (blog.) .com twitter.com/ linkedin.com/in/ github.com/ victoriaslocum victorialslocum victorialslocum victorialslocum

What’s spaCy? spaCy provides a modular architecture to construct NLP
pipelines that can be tailored towards individual project needs. Text Doc tagger parser ... spaCy pipeline tokenizer 03 [nlp] lang = pipeline = [ , , ] batch_size = disabled = [] before_creation = null after_creation = null after_pipeline_creation = null tokenizer = { : } "en" "tok2vec" “tagger” “parser” "@tokenizers" "spacy.Tokenizer.v1" 1000 config.cfg

What’s Prodigy? 04 Prodigy is a scriptable annotation tool. We
provide a lot of out of the box solutions but still allow for extensive customizability through custom recipe functions

The frontpage project 05 The project automatically updates the page
and datasets every day New data, every day Annotate your own data on topics you’re interested in to get a personalized frontpage Do it your own way We’re using the arxiv PyPI library to get the papers using a search query arxiv API for Python

06 Data, data, and data 'ti:dataset OR ti:corpus OR ti:database
OR abs:"a new dataset"' {"title":"RGB Arabic Alphabets Sign Language Dataset", "description":"This paper introduces the RGB Arabic Alphabet Sign Language (AASL) dataset. ...", "tags":["arxiv","dataset"], "meta":{"query":"ti:dataset OR ti:corpus OR ti:database OR abs:\"a new dataset\""}} {...} data.jsonl

Annotating the data 07 { "label": "dataset", "pattern": [ {"LEMMA":
{"IN": ["present", "introduce", "propose", "publish", "provide", "derive", "construct", "create", "develop", "contribute", "release"]}, "POS": "VERB"}, {"OP": "{,6}"}, {"LEMMA": {"NOT_IN": ["performance", "result", "benchmark", "evaluate", "algorithm", "framework", "technique", "workflow"]}}, {"OP": "{,6}"}, {"LOWER": {"IN": ["database", "dataset", "corpus"]}} ] } pattern.jsonl This paper introduces the RGB Arabic Alphabet Sign Language (AASL) dataset. AASL comprises 7,856 raw and fully labelled RGB images of the Arabic sign language alphabets, which to our best knowledge is the first publicly available RGB dataset. The dataset is aimed to help those interested in developing real-life Arabic sign language classification models. AASL was collected from more than 200 participants and with different settings such as lighting, background, image orientation, image size, and image resolution. Experts in the field supervised, validated and filtered the collected images to ensure a high-quality dataset. AASL is made available to the public on Kaggle.

Combining rules & ML 08 Writing rules allows you to
have better understanding and control over your pipeline and process, ensuring consistency and possibly creating a better output. Linear pipeline workflow Iterative pipeline workflow

Building a custom recipe 09 Custom HTML to display title,
abstract with pattern-matched highlighting, and link for any further information Meta tag for the query provided to arxiv Prefer data entries where the abstract has a matched pattern

Customizing your solutions 10 SOU RCE CU E CONT ENT
SOU RCE CU E CONT ENT SOU RCE CU E CONT ENT SOU RCE CU E CONT ENT https:/ /explosion.ai/blog/guardian

spaCy’s config system [nlp] [components]   [components.textcat_multilabel] lang = pipeline
= [ ] batch_size = disabled = [] before_creation = null after_creation = null after_pipeline_creation = null tokenizer = { : } factory = scorer = { : } threshold = "en" "textcat_multilabel" "@tokenizers" "spacy.Tokenizer.v1" "textcat_multilabel" "@scorers" "spacy.textcat_multilabel_scorer.v1" 1000 0.5 config.cfg 11 d the , includes all settings and records all defaultj d by swapping out componentj d preset with to get you started single source of truth customize the architecture sensible defaults

spaCy textcat model 12 https:/ /spacy.io/usage/training#quickstart

Training $ python -m spacy_data/config.cfg --paths.train spacy_data/train.spacy --paths.dev spacy_data/dev.spacy --output
training/ spacy train Train model nlp = spacy. ( ) load “training/model-best” Load model 13

Built-in solutions 14 More on this: https://explosion.ai/blog/spacy-design-concepts d Using preconfigured
building blockV d Has a set of sensible defaultV d Easy to get started and iterate moving forward

spaCy project system https://spacy.io/usage/projects 15 Your frontpage $ spacy project
run new-frontpage download preprocess spacy-train content build data-to-spacy - : : : - -> python scripts/download_arxiv.py --query --tag dataset name help script "download" "Download data from sources." 'ti:dataset OR ti:corpus OR ti:database OR abs:"a new dataset"' project.yml

Creating your frontpage 16 https://github.com/victorialslocum/frontpage description sections name tags n
classes name threshold name tags n classes name threshold : | ... : - : : [ , ] : 12 : - : dataset : 0.5 - : : [ , ] : 12 : - : prompteng : 0.5 "Datasets on Arxiv" "arxiv" "dataset" "Prompt Engineering on Arxiv" "arxiv" "prompteng" frontpage.yaml Automatically runs spaCy project commands daily $ python -m spacy project run download $ python -m spacy project run new-frontpage-github

Future plans Better topic customizatio8 Upload your own
data for model trainin Page customizatio8 Text classification trick 18 texcat score: .70 This paper introduces the RGB Arabic Alphabet Sign Language (AASL) dataset. AASL comprises 7,856 raw and fully labelled RGB images of the Arabic sign language alphabets, which to our best knowledge is the first publicly available RGB dataset. texcat score: .65 AASL comprises 7,856 raw and fully labelled RGB images of the Arabic sign language alphabets, which to our best knowledge is the first publicly available RGB dataset. + texcat score: .95 This paper introduces the RGB Arabic Alphabet Sign Language (AASL) dataset.

Future plans 19 support for Matcher instead of trained
spaCy model0 A/B Prodigy classification to learn your interests

Conclusion 20 advanced workflows for modern NLP and machine learning
ease-of-use with pre-configured building blocks and good defaults

21 Thank you for listening! @ (blog.) .coF @ twitter.com/
@ linkedin.com/in/ @ @explosion.ai victoriaslocum victorialslocuF victorialslocuF victoria Frontpage project Explosion blog More events Vincent on Twitter - github.com/victorialslocum/frontpage - explosion.ai/blog - explosion.ai/events - twitter.com/fishnets88

PyData Berlin 2023

PyData Berlin 2023

Victoria Slocum

More Decks by Victoria Slocum

Other Decks in Technology

Featured

Transcript

Yo u a r e w h a t y

Introductions 02 Victoria Slocum Developer Advocate Slide 11-14 Building an

What’s spaCy? spaCy provides a modular architecture to construct NLP

What’s Prodigy? 04 Prodigy is a scriptable annotation tool. We

The frontpage project 05 The project automatically updates the page

06 Data, data, and data 'ti:dataset OR ti:corpus OR ti:database

Annotating the data 07 { "label": "dataset", "pattern": [ {"LEMMA":

Combining rules & ML 08 Writing rules allows you to

Building a custom recipe 09 Custom HTML to display title,

Customizing your solutions 10 SOU RCE CU E CONT ENT

spaCy’s config system [nlp] [components]   [components.textcat_multilabel] lang = pipeline

spaCy textcat model 12 https:/ /spacy.io/usage/training#quickstart

Training $ python -m spacy_data/config.cfg --paths.train spacy_data/train.spacy --paths.dev spacy_data/dev.spacy --output

Built-in solutions 14 More on this: https://explosion.ai/blog/spacy-design-concepts d Using preconfigured

spaCy project system https://spacy.io/usage/projects 15 Your frontpage $ spacy project

Creating your frontpage 16 https://github.com/victorialslocum/frontpage description sections name tags n

17

Future plans Better topic customizatio8 Upload your own

Future plans 19 support for Matcher instead of trained

Conclusion 20 advanced workflows for modern NLP and machine learning

21 Thank you for listening! @ (blog.) .coF @ twitter.com/