Upgrade to Pro — share decks privately, control downloads, hide ads and more …

PyData Berlin 2023

PyData Berlin 2023

Victoria's talk at PyData Berlin 2023 on creating a personalized internet frontpage with spaCy and Prodigy.

Victoria Slocum

April 19, 2023
Tweet

More Decks by Victoria Slocum

Other Decks in Technology

Transcript

  1. Yo u a r e w h a t y

    o u r e a d Building a personal internet frontpage with spaCy and Prodigy Victoria Slocum, PyData Berlin 2023 01
  2. Introductions 02 Victoria Slocum Developer Advocate Slide 11-14 Building an

    NLP pipeline with spaCy Slide 6-10 Customizing the annotation experience Slide 3-5 Intro to spaCy, Prodigy, and the project Slide 15-19 Setting up your frontpage and future plans (blog.) .com twitter.com/ linkedin.com/in/ github.com/ victoriaslocum victorialslocum victorialslocum victorialslocum
  3. What’s spaCy? spaCy provides a modular architecture to construct NLP

    pipelines that can be tailored towards individual project needs. Text Doc tagger parser ... spaCy pipeline tokenizer 03 [nlp] lang = pipeline = [ , , ] batch_size = disabled = [] before_creation = null after_creation = null after_pipeline_creation = null tokenizer = { : } "en" "tok2vec" “tagger” “parser” "@tokenizers" "spacy.Tokenizer.v1" 1000 config.cfg
  4. What’s Prodigy? 04 Prodigy is a scriptable annotation tool. We

    provide a lot of out of the box solutions but still allow for extensive customizability through custom recipe functions
  5. The frontpage project 05 The project automatically updates the page

    and datasets every day New data, every day Annotate your own data on topics you’re interested in to get a personalized frontpage Do it your own way We’re using the arxiv PyPI library to get the papers using a search query arxiv API for Python
  6. 06 Data, data, and data 'ti:dataset OR ti:corpus OR ti:database

    OR abs:"a new dataset"' {"title":"RGB Arabic Alphabets Sign Language Dataset", "description":"This paper introduces the RGB Arabic Alphabet Sign Language (AASL) dataset. ...", "tags":["arxiv","dataset"], "meta":{"query":"ti:dataset OR ti:corpus OR ti:database OR abs:\"a new dataset\""}} {...} data.jsonl
  7. Annotating the data 07 { "label": "dataset", "pattern": [ {"LEMMA":

    {"IN": ["present", "introduce", "propose", "publish", "provide", "derive", "construct", "create", "develop", "contribute", "release"]}, "POS": "VERB"}, {"OP": "{,6}"}, {"LEMMA": {"NOT_IN": ["performance", "result", "benchmark", "evaluate", "algorithm", "framework", "technique", "workflow"]}}, {"OP": "{,6}"}, {"LOWER": {"IN": ["database", "dataset", "corpus"]}} ] } pattern.jsonl This paper introduces the RGB Arabic Alphabet Sign Language (AASL) dataset. AASL comprises 7,856 raw and fully labelled RGB images of the Arabic sign language alphabets, which to our best knowledge is the first publicly available RGB dataset. The dataset is aimed to help those interested in developing real-life Arabic sign language classification models. AASL was collected from more than 200 participants and with different settings such as lighting, background, image orientation, image size, and image resolution. Experts in the field supervised, validated and filtered the collected images to ensure a high-quality dataset. AASL is made available to the public on Kaggle.
  8. Combining rules & ML 08 Writing rules allows you to

    have better understanding and control over your pipeline and process, ensuring consistency and possibly creating a better output. Linear pipeline workflow Iterative pipeline workflow
  9. Building a custom recipe 09 Custom HTML to display title,

    abstract with pattern-matched highlighting, and link for any further information Meta tag for the query provided to arxiv Prefer data entries where the abstract has a matched pattern
  10. Customizing your solutions 10 SOU RCE CU E CONT ENT

    SOU RCE CU E CONT ENT SOU RCE CU E CONT ENT SOU RCE CU E CONT ENT https:/ /explosion.ai/blog/guardian
  11. spaCy’s config system [nlp] [components] 
 [components.textcat_multilabel] lang = pipeline

    = [ ] batch_size = disabled = [] before_creation = null after_creation = null after_pipeline_creation = null tokenizer = { : } factory = scorer = { : } threshold = "en" "textcat_multilabel" "@tokenizers" "spacy.Tokenizer.v1" "textcat_multilabel" "@scorers" "spacy.textcat_multilabel_scorer.v1" 1000 0.5 config.cfg 11 d the , includes all settings and records all defaultj d by swapping out componentj d preset with to get you started single source of truth customize the architecture sensible defaults
  12. Training $ python -m spacy_data/config.cfg --paths.train spacy_data/train.spacy --paths.dev spacy_data/dev.spacy --output

    training/ spacy train Train model nlp = spacy. ( ) load “training/model-best” Load model 13
  13. Built-in solutions 14 More on this: https://explosion.ai/blog/spacy-design-concepts d Using preconfigured

    building blockV d Has a set of sensible defaultV d Easy to get started and iterate moving forward
  14. spaCy project system https://spacy.io/usage/projects 15 Your frontpage $ spacy project

    run new-frontpage download preprocess spacy-train content build data-to-spacy - : : : - -> python scripts/download_arxiv.py --query --tag dataset name help script "download" "Download data from sources." 'ti:dataset OR ti:corpus OR ti:database OR abs:"a new dataset"' project.yml
  15. Creating your frontpage 16 https://github.com/victorialslocum/frontpage description sections name tags n

    classes name threshold name tags n classes name threshold : | ... : - : : [ , ] : 12 : - : dataset : 0.5 - : : [ , ] : 12 : - : prompteng : 0.5 "Datasets on Arxiv" "arxiv" "dataset" "Prompt Engineering on Arxiv" "arxiv" "prompteng" frontpage.yaml Automatically runs spaCy project commands daily $ python -m spacy project run download $ python -m spacy project run new-frontpage-github
  16. 17

  17. Future plans  Better topic customizatio8  Upload your own

    data for model trainin  Page customizatio8  Text classification trick 18 texcat score: .70 This paper introduces the RGB Arabic Alphabet Sign Language (AASL) dataset. AASL comprises 7,856 raw and fully labelled RGB images of the Arabic sign language alphabets, which to our best knowledge is the first publicly available RGB dataset. texcat score: .65 AASL comprises 7,856 raw and fully labelled RGB images of the Arabic sign language alphabets, which to our best knowledge is the first publicly available RGB dataset. + texcat score: .95 This paper introduces the RGB Arabic Alphabet Sign Language (AASL) dataset.
  18. Future plans 19  support for Matcher instead of trained

    spaCy model0  A/B Prodigy classification to learn your interests
  19. Conclusion 20 advanced workflows for modern NLP and machine learning

    ease-of-use with pre-configured building blocks and good defaults
  20. 21 Thank you for listening! @ (blog.) .coF @ twitter.com/

    @ linkedin.com/in/ @ @explosion.ai victoriaslocum victorialslocuF victorialslocuF victoria Frontpage project Explosion blog More events Vincent on Twitter - github.com/victorialslocum/frontpage - explosion.ai/blog - explosion.ai/events - twitter.com/fishnets88