Upgrade to Pro — share decks privately, control downloads, hide ads and more …

How I ended up maintaining a python package wit...

Avatar for Kurian Benoy Kurian Benoy
April 26, 2025
4

How I ended up maintaining a python package with 1M+ downloads so far?

Avatar for Kurian Benoy

Kurian Benoy

April 26, 2025
Tweet

Transcript

  1. How I ended up maintaining building a python package with

    1M+ downloads KochiFOSS Meetup April 26, 2025 by Kurian Benoy
  2. $whoami ML Engineer @ Sarvam Volunteer @ Swathanthra Malayalam Computing

    Loves Walking and like to participate in marathon Bird Watching is my hobby (PS: Sarvam Models are named with bird names because of me) I talked here itself in 2023 in KeyValue office
  3. Who I Am: A Snapshot FOSS & Python Active contributor

    with a passion for open source software. Machine Learning Engineer focusing on deep learning and fast.ai frameworks. Community & Travel Engaged in Malayalam computing and explored 10 Indian cities. Speaker & Volunteer Presented at PyCon India and contributes to AI4Bharat initiatives. (Prompt: create images to represent things like FOSS, ML, Malayalam computing, Walking, Ooty, Pune, 10cities travelling in India, fast.ai, Deep Learning, Sarvam, Startups, Python, Kaggle, bird watching, Language hero comes with pride)
  4. Why I created a python package? Identify a Problem In

    my previous company I was benchmarking various ASR providers. Malayalam ASR Benchmarking I knew how to build a python package Learned nbdev made by Jeremy Howard, Hamel Hussain, Wasim Lograt etc. during fast.ai course, 2022 Frustration lead me to publish as a python package while doing Malayalam ASR Benchmarking project and giving talks on - OpenAI Whisper and it's amazing power to do finetuning.
  5. What is nbdev? Create delightful software with Jupyter notebooks. Using

    Jupyter notebooks, build a python package with proper documentation. Easily publish it in pypi, github, anaconda ecosystem. Checkout nbdev tutorial: https://nbdev.fast.ai/tutorials/tutorial.html
  6. How to evaluate ASR providers? ASR is evaluated by comparing

    ground truth and ASR output. Two common metrics used are: Word Error Rate (WER) Character Error rate (CER)
  7. Example of ASR evaluation Ground Truth = I am at

    Kochi FOSS today. I am presenting a talk today, at Key Value Systems along with Andrew from Hoppscotch and Renjith from Wikidata. ASR output = I am at Kochi Foods today. I am presenting a talk today at Key Value Systems along with Andrew from Hope's Coach and Ranjit from Wikidata. WER without normalization = 0.2 CER without normalization = 0.08759 WER with normalization = 0.2 CER with normalization = 0.067164
  8. What is Whisper and it's normalizer? Whisper was open-sourced on

    September 21, 2022 by releasing the inference code and pre-trained model weights. The Whisper normalizer is a text normalization tool and algorithm used in OpenAI's Whisper automatic speech recognition (ASR) system. Its main purpose is to standardize transcribed text so that formatting differences-such as punctuation, capitalization, or whitespace-do not unfairly penalize evaluation metrics like Word Error Rate (WER) and Character Error Rate (CER) . The normalization process makes it easier to compare transcriptions by ensuring that only genuine transcription errors are counted, not superficial formatting differences. Explain EnglishNormalizer Explain BasicTextNormalizer
  9. Hello world to Whisper Normalizer 18:52 YouTube Hello World to

    whisper_normalizer package which has 1M+ download& Colab Notebook: https://colab.research.google.com/gist/kurianbenoy/7d27d9ec193a4a97ec78&
  10. Why the package got popular? The whole field of voice

    agents and Speech in general exploeded in 2023, 2024 onwards Seeing increasing better Speech to Text models, Text to Speech models and Speech to Speech models etc. SEO in google because of which my python package comes when googling whisper normalizer or using perplexity. Perplexity AI What is whisper normalizer and how to use it The Whisper normalizer is a text normalization tool and algorithm used in OpenAI's Whisper automatic speech recognition (ASR) system. Its main&
  11. Best Monthly Downloads Twitter Kurian Benoy on Twitter / X

    Sometimes stuff you build for fun can be humbling.I worked on Malayalam Speech to Text for last year and as a byproduct I got this unexpected& 00:14 YouTube whisper_normalizer hits 500K+ downloads
  12. Getting even more downloads than nbdev Twitter Kurian Benoy on

    Twitter / X Thank you @jeremyphoward, @HamelHusain and @wasimlorgat for creating nbdev. Many thanks to& Twitter Jeremy Howard on Twitte& Nice job!If this is a competition, then you're&
  13. Identifying a big issue with Malayalam ASR benchmarking 1 Kavya

    noticed a big bug whisper_normalizer is removing vowels as part of Basic Text Normalizer 2 Kavya and I tweet Inform the community via blogpost which Kavya wrote and tweets, that normalizer used in Meta's ASR paper, Assembly.ai, OpenAI etc are wrong 3 Kavya published a paper What is lost in Normalization? Exploring Pitfalls in Multilingual ASR Model Evaluations Published in EMNLP 4 Both of us are trying to fix the issues Fixed the problem with normalizers written by Anoop Kunchukuttan and AI4Bharat Now there are normalizers like MalayalamNormalizer, HindiNormalizer in 9 Indian languages
  14. Twitter Kavya Manohar (?= ) on Twitter / X For

    people who care deeply about NLP tech for under represented languages, here is a detailed story on how and why the normalization routine in Whisper&
  15. Twitter Kurian Benoy on Twitter / X Why aren't @huggingface

    @Meta @AssemblyAI not caring about this?Not going to tag famous audio folks there. It's good to atleast acknowledge these issues. https://t.co/CCckAAHuwG4 Kurian Benoy (@kurianbenoy2) May 8, 2024
  16. Twitter Kurian Benoy on Twitter / X Nice catch @kavya_manohar.

    This is a big bug.I was suspicious of this event and did benchmarking just to see if numbers hold up. Now my benchmarking was also wrong since it depended on this normalization https://t.co/ML4V1qUB154 Kurian Benoy & Twitter Kavya Manohar (?= ) on Twitter / X Loud Rant:I came to know that the surprisingly low WER in #whisper ASR for Malayalam reported in the @huggingface fine- tuning event last year was just because the evaluation script removed all the vowel signs before computing WER!!! ˋ And the&
  17. Twitter Rajiv Shah on Twitter / X Going to dig

    into this, but looks like another story of some initial great results falling victim to evaluation issues.Kavya writes:During the Whisper fine-tuning event hosted by Hugging Face in December 2022, researchers and practitioners worldwi&
  18. Crafting a User-Friendly Python package Simplify Usage Proper github README

    Simple name Comprehensive Documentation Properly document usage Accept contributions Examples & Tutorials Made youtube videos
  19. Maintenance required for python package 1 2 3 4 Fix

    Bugs Quickly Prioritize issues reported by users for reliability. Add Features Implement enhancements based on community feedback. Like MalayalamNormalizer, updates in English Normalizer Update Dependencies Keep libraries current to ensure compatibility and security. Monitor Usage Track downloads and feedback to inform future development.
  20. linkedin Last year March, I created a python package called

    whisper_normali& Last year March, I created a python package called whisper_normalizer package with nbdev. I realized it was not possible to use the normalization& linkedin Release v0.1.0 · kurianbenoy/whisper_normalizer | Kurian Benoy Weekend Release Alert! Just shipped whisper_normalizer v0.1.0 / What's new: 1. Support for converting arabic numbers to Indic script in&
  21. Thank you Make something you want Ensure Quality with good

    README, documentation Engage the Community, identify pain points Iterate Continuously Improve based on feedback and usage.