Making man-made medicine price data more useful by JD Bothma

Slide 1

Slide 1 text

Making man-made medicine price data more useful JD Bothma [email protected] @jdbothma Man made medicine price data More useful Used to be Code For South Africa People put too much emphasis on the word code. We do write code, but the objective is always to empower people in south africa. To make them informed and more empowered to take part in solving problems they face. Sometimes coding helps. So now we’re called OpenUp

Slide 2

Slide 2 text

● Data is fun ● Data is dirty ● You can do it! There are really three main points to this talk.

Slide 3

Slide 3 text

Data is fun Bear with me…..

Slide 4

Slide 4 text

South African Medicine Price Registry mpr.gov.za We became aware of something called the Medicine Price Registry The MPR is a registry of medicines and regulated prices There’s a base price, percentage margin for the pharmacy’s dispensing fee, Tax, and it comes to the SINGLE EXIT PRICE which is basically the max price that product can be sold for as packaged. Like most government websites, this one doesn’t exactly get you super excited. But on the right over there is a link to “Databases”

Slide 5

Slide 5 text

South African Medicine Price Registry mpr.gov.za And on the databases page there’s one link to an Excel Spreadsheet. Note that it’s 22MB I’m not sure about excel, but LibreOffice doesn’t exactly love a 20MB spreadsheet, even today.

Slide 6

Slide 6 text

South African Medicine Price Registry mpr.gov.za Here you have your typical horrendous spreadsheet A variety of formatting styles and locked columns and all sorts of things that make you want to play with it.

Slide 7

Slide 7 text

A Product mpr.gov.za You’ll have distinct products for a pack with 20 pills and a pack with 40 pills Prices vary dramatically. One might be R80, another might be R3000 Cancer drugs are expensive! It also indicates whether the drug is an originator produced under patent, or a generic

Slide 8

Slide 8 text

What should your medicines cost? So what should your drugs cost? So, my colleague found this dataset and had a few hours so he obviously made a site to help you find cheap drugs

Slide 9

Slide 9 text

Find Cheap Generics CENSORED Not this one

Slide 10

Slide 10 text

Find Cheap Generics mpr.code4sa.org This one Simple little django app. The database is SQLite file on disk. It works a charm. Thanks to bootstrap we don’t need to think much about making it look sharp enough Simple little json API With something like 10000 products, search works fine. People can find what’s written on their prescription, find the maximum any pharmacy is allowed to charge for it They can also find generic alternatives. It’s a bit more complicated than just the active ingredients. That’s why there’s a pharmacist dispensing these things and not just giving you the drug you ask for. We know people use it. One time the server went down, and a doctor got in touch and asked when it will be back up, because they use it daily to find cheaper alternatives for their low income patients.

Slide 11

Slide 11 text

Find Cheap Generics mpr.code4sa.org It’s not really just about saving a few rand though Did you even know our medicine prices were regulated? So you might get here because you want to find out more about one medicine, but then you can just learn a bit more about the systems that are in place And I think that’s important. Because if we don’t use the systems that are in place, and we don’t know about them, we won’t take part in the debate if people want to change them, or abuse them. So we need to be at least a bit aware of what’s out there.

Slide 12

Slide 12 text

One or more rows per record Now this makes me a bit sick but we can deal with it. I mean it worked for the users thus far, right? One product might have multiple ingredients. So in excel that means one product must have its ingredients on one or more rows.

Slide 13

Slide 13 text

Find price and “equivalent” medicines ● Load latest MPR xls ● Query by name ● Given Product, find Products with same Ingredients ○ (at same strength) So every few months when there’s a new dataset, we want to load the latest spreadsheet Users want to query for products by product name or ingredient name If you’re looking at a product, you want to find the products with equivalent ingredients

Slide 14

Slide 14 text

Parsing MPR def parse(self, filename): workbook = xlrd.open_workbook(filename) worksheet = workbook.sheet_by_index(0) product = None for idx in range(1, worksheet.nrows): ... We start of with a nice little loop over the spreadsheet rows

Slide 15

Slide 15 text

Multirow Records: Yield on new Product for idx in range(1, worksheet.nrows): regno = worksheet.cell_value(idx, 2).lower() pack_size = worksheet.cell_value(idx, 11) num_packs = worksheet.cell_value(idx, 12) sep = worksheet.cell_value(idx, 16) name = worksheet.cell_value(idx, 6).title() if regno.strip() != "": if product: yield product # ... CLEAN SOME STUFF ... Then we pull out the values we want from the row. If it’s a new product row, we yield the previous row. A row with a product code is a new product. A row without a product row is an ingredient from the previous product

Slide 16

Slide 16 text

# CLEAN SOME STUFF ● Fix case on Originator/Generic ● Skip if NAPPI code isn’t an integer ● Skip if Single Exit Price (SEP) is missing There’s some quick basic cleaning you can do inline, like making casing consistent, ensuring important values are present, and so on.

Slide 17

Slide 17 text

Product + Ingredients if regno.strip() != "": if product: yield product # ... CLEAN SOME STUFF ... product = { "nappi_code" : nappi_code, "regno" : regno, "applicant" : worksheet.cell_value(idx, 1).title(), "schedule" : worksheet.cell_value(idx, 5), "name" : name, "dosage_form" : worksheet.cell_value(idx, 10).title(), "pack_size" : pack_size, "num_packs" : num_packs, "sep" : sep, "is_generic" : is_generic, "ingredients" : [] } ingredient_name = worksheet.cell_value(idx, 7).title() product["ingredients"].append({ "name" : name_change.get(ingredient_name.lower(), ingredient_name), "strength" : worksheet.cell_value(idx, 8), "unit" : worksheet.cell_value(idx, 9).lower(), }) Once we’ve done basic cleaning, we put the data into a dictionary. You don’t need to be able to read this, I just think dictionary instantiation is pretty

Slide 18

Slide 18 text

Add them to the DB ● New dataset published: ○ Delete all Products ○ Load new products

Slide 19

Slide 19 text

So far so good sufficient Or...

Slide 20

Slide 20 text

What about medicine prices over time? Now this dataset is published periodically. Which means we have prices over time. That could be fun.

Slide 21

Slide 21 text

Medicine prices over time ● Adjust data model ○ split Price from Product ○ Price has Effective Date To load the prices over time, we need to store prices separately from the product. No big deal

Slide 22

Slide 22 text

No migrations? No Problem! ● (Small Problem) ● Too lazy to learn SQLite and upgrade old django ● SQLAlchemy + Alembic FTW!! Unfortunately my colleage is a hacker and just stuck stuff into SQLite willy nilly. I’m a clutz with no memory and want repeatability. It felt like too much of a pain to upgrade the old django to be able to add migrations easily, So I just copied everything into SQLAlchemy models

Slide 23

Slide 23 text

Fresh start: Add Constraints class Product(Base): __tablename__ = 'product' id = Column(Integer, primary_key=True) applicant_licence_no = Column(String, nullable=False) applicant_name = Column(String, nullable=False) nappi_code = Column(String, nullable=False) regno = Column(String, nullable=False) name = Column(String, nullable=False) schedule = Column(String) dosage_form = Column(String) pack_size = Column(Float, nullable=False) is_generic = Column(String) num_packs = Column(Integer) ingredients = relationship("ProductIngredient", back_populates="product") prices = relationship("ProductSEP", back_populates="product") equivalence_key = Column(String) __table_args__ = ( UniqueConstraint('regno', 'nappi_code', 'pack_size', 'num_packs', 'schedule', 'dosage_form', name='product_unique'), ) So I happily start building my models with SQLAlchemy’s Object Relational Models Like a good little engineer I don’t use a domain-specific primary key but just an integer. I add a unique constraint on what could be the primary key: the NAPPI code I import the data - ERROR! Duplicate key - two products have the same NAPPI code but different pack size Ok, the data isn’t perfect, let’s see where this goes. I add pack size to the unique constraint. ERROR! Duplicate key. And so on, until I had 6 columns in the unique constraint

Slide 24

Slide 24 text

Fresh start: Add Constraints UniqueConstraint('regno', 'nappi_code', 'pack_size', 'num_packs', 'schedule', 'dosage_form', name='product_unique') Theory: NAPPI code is primary key Practise: 5 fields violate this ● Which version is correct? ● Quorum might solve this ○ multiple differences ● Humans suck In theory, the NAPPI code is a unique identifier per product In practise, there are variations all over this dataset For two or more versions of a product, which one is correct? Perhaps we say the odd one out is the wrong one. Perhaps that one is a correction on the others.

Slide 25

Slide 25 text

What I should have done: ● For each violating column ○ how many rows violate the correct unique constraint? ● Maybe I can fix 10 rows by hand ● Worry about quorum and other clever things later Hindsight is 20:20 I should probably have gotten a quick idea here of how many thigns are wrong If it’s only a few, I can fix them by hand

Slide 26

Slide 26 text

OpenRefine: Normalise names openrefine.org Other fields are easier to clean There’s a GUI tool to cluster similar names and normalise them. This is inherently interactive so it’s probably wise to use a GUI tool. OpenRefine can do much more, but this was really what I used it for in this project. You select a column, tell it to cluster the names, you can select between algorithms and tweak parameters, and replace all the variants with your chosen one. You can also export this to JSON to apply programmatically or import back to OpenRefine for use in another session, e.g. when you get more data.

Slide 27

Slide 27 text

What about medicine prices over time? Can’t tell much automatically Can’t reliably track product over time Unit Price = max_price / (pack_size + num_packs) It would be nice to write queries that can tell us how the medicine market behaves over time What’s the average number of competitors? How much does competition affect the price We know there are errors - we’re not sure of the extent of the errors It’d be interesting to track the price of a product over time, but since the unique constraint had to be so broad, we know there are interruptions to the product series, and therefore the product price series.

Slide 28

Slide 28 text

Data is dirty

Slide 29

Slide 29 text

What can we tell?

Slide 30

Slide 30 text

What can we tell? All is not lost. Find interesting candidates So it’s impractical to answer the question you set out with Pivot Data isn’t perfect Go find candidates for interesting stories

Slide 31

Slide 31 text

Equivalent products Same set of ingredients in same quantity ● Name-amount-unit slug ● Sorted, -appended ● lowercase We want to easily look at equivalent products It’s easier

Slide 32

Slide 32 text

Equivalent products { ..., u'mefenamic-acid-500-mg': [, , , , ], ..., }

Slide 33

Slide 33 text

Equivalent products ● Product.equivalence_key ● Too slow as a property of Product ● Calculate and cache

Slide 34

Slide 34 text

Jupyter Notebook is your friend jupyter

Slide 35

Slide 35 text

Jupyter Notebook is your friend jupyter

Slide 36

Slide 36 text

Experiment: Generic After Originator jupyter

Slide 37

Slide 37 text

Still a lot of data (not Big Data) Equivalent product groups: 5323 ...Where Generic not present at the start of the window 253 jupyter Do little experiments to focus on smaller subsets of the data Focus on areas where you have a suspicion there’s something interesting It’s difficult to learn stuff about about unmanageable amounts of data. Your data isn’t going anywhere, but time is running away

Slide 38

Slide 38 text

Use your eyes matplotlib.pyplot

Slide 39

Slide 39 text

Use your eyes matplotlib.pyplot

Slide 40

Slide 40 text

Use your eyes matplotlib.pyplot

Slide 41

Slide 41 text

Use your eyes timolol-1-mg-g.png timolol-3-mg-ml.png timolol-5-mg-ml.png timolol-maleate-1-%-travoprost-0-%.png timolol-maleate-1-%-travoprost-40--g-ml.png tin-fluoride-0-mg-vial.png tinidazole-500-mg.png tioconazole-280-mg-g.png tioguanine-40-mg.png tiotropium-bromide-18--g-tiotropium-bromide-complete-18-mcg.png tiotropium-bromide-18--g-tiotropium-bromide-refill-18-mcg.png matplotlib.pyplot The same drug in different strengths isn’t equivalent, but it’s similar The charts get complex quickly We don’t put them on the same chart, but your favourite photo viewer probably lets you browse the filenames alphabetically

Slide 42

Slide 42 text

Put similar things close together tramadol-hydrochloride-100-mg-2-ml.png tramadol-hydrochloride-100-mg-2ml.png tramadol-hydrochloride-100-mg.png tramadol-hydrochloride-150-mg.png tramadol-hydrochloride-200-mg.png tramadol-hydrochloride-50-mg-1ml.png tramadol-hydrochloride-50-mg-ml.png tramadol-hydrochloride-50-mg.png matplotlib.pyplot There are inconsistencies in how strengths get recorded. Sometimes the strength is repeated in the ingredient name. So when something should be part of an equivalent group, it isn’t. That’s ok, the chart will be adjacent and your brains can put the charts together What’s the difference between 50mg/1ml, 50mg/ml and 50mg? It doesn’t matter at this stage. If it looks like there’s something interesting going on, we can figure it out. If we really want to do programmatic analysis, we can clean the relevant subset of data sufficiently. We will never be able to understand everything there is about an interesting dataset. We’ll never be able to clean it.

Slide 43

Slide 43 text

Manual verification is possible but expensive! opengazettes.org.za If you google for some of these products where the data looks wrong, you’ll find registration documents in government gazettes on our open searcheable gazette archive This isn’t just a cheeky little plug of one of our other products It might be possible to scrape structured data from the PDFs published by the government but that’s also a lot of work. Manual verification is possible but it’s expensive. It’s very important that you narrow your criteria and only do the cleaning and other work you have to do.

Slide 44

Slide 44 text

Put similar things close together ● haemagglutinin-1-ml.png ● haemagglutinin-strain----a-california-7-2009--h1n1--like-strain----variant-a-calif ornia-7-2009--nymc-x-181-----a-perth-16-2009--h3n2--like-strain----variant-a-vi ctoria-210-2009--nymc-x-187-----b-brisbane-60-2008--15--g-0-5ml.png ● halofantrine-100-mg-5ml.png ● halofantrine-250-mg.png ● haloperidol-1-mg.png ● haloperidol-20-mg-2ml.png So, excitedly, we send big folders full of these charts to people exploring this data with us. Ugh What is this?!?! It turns out, sometimes products rows don’t have product codes. So you add a whole list of ingredients to a product that they don’t belong to. Data is dirty. Accept it. Be subservient to the data. Respect it. But have fun

Slide 45

Slide 45 text

You can do it! Really. I just showed you. It’s messy. That’s ok. It’s not like the textbook makes it look. Just play with it. You’ll find interesting things to do with it in no time.

Slide 46

Slide 46 text

You can do it! And I don’t know what this is… ENHANCE

Slide 47

Slide 47 text

You can do it! I’m just showing there’s more data there… ENHANCE

Slide 48

Slide 48 text

You can do it! I wonder what this could be?

Slide 49

Slide 49 text

● Data is fun ● Data is dirty ● You can do it! Data is fun. ● I’ve never played with a dataset and not gotten excited about its potential. ● You’ll find stories in the data that you never expected. Just be open and dive in ● Be cynical. Question everything. Why is this so? There’s a story there. Data is dirty. ● It’s ok. ● You have to learn to compromise. ● Sorry about your beautiful data model. ● Can you answer your question or not? ● Iteratively develop your questions based on what’s really interesting, necessary and practical You can do it. ● There’s great tooling. ● Code only when you need to. ○ Get familiar with pivot tables in your favourite spreadsheet program and OpenRefine ○ CSVKit is such an amazing tool for getting to know and transforming big and small CSVs. pip install it in any project using csv. ● When you need to, code. ○ Try out Pandas. I find the syntax frustrating but when you get it right it’s so powerful. You don’t want to code the stuff it does for you. ○ Don’t be afraid to put stuff into the database. Migrations make it easy to develop your data model as you learn about your data and needs

Slide 50

Slide 50 text

○ Go to the shell programming talk tomorrow.

Slide 51

Slide 51 text

Join the community ● CodeBridge Community Evenings ○ TONIGHT 18:30 ○ Every Second Thursday ○ CodeBridge, Thicket Street, Cape Town ● CodeBridge DurbanLab ● Hacks/Hackers in Joburg ● Start your city’s community ● Find your tribe! Civic Tech Directory coming soon ○ ZATech Slack #civic-tech What you don’t see in this picture is the fat lounge music playing in the background. Until I finally convinced someone else to start DJing (instructions are on the wall)

Slide 52

Slide 52 text

We’re hiring :) https://openup.org.za/careers