Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Open Data APIs: Exploring OpenCorporates’ 80 million companies in 100+ jurisdictions

Open Data APIs: Exploring OpenCorporates’ 80 million companies in 100+ jurisdictions

More Decks by API Strategy & Practice Conference

Other Decks in Technology

Transcript

  1. Building the underlying dataset of all company data • A

    critical underpinning of understanding the corporate world • De-siloing data from official corporate registers, and other government data, especially regulatory • Linking critical – and previously obscure datasets
  2. One entry per legal entity • Assembled from company registers

    around the world • All automatically ingested – no manual imports • Over 100 jurisdictions already... more added every month • Disparate data normalised to key fields • Searchable across jurisdictions • Automatically matching foreign branches to home companies • Use company-register-based identifiers – non- proprietary, non-monopoly. Avoid lock-in
  3. Eventual target: Every bit of company-related public data in the

    world, matched to the relevant company • Already have millions of company data from disparate public sources: WIPO trademarks, UK government spending, corporate structures from SEC filings & Federal Reserve • Current focus: every bank licence in the world • Next targets: business licences, other financial licences, non-profit data, government gazettes • Driven by user-demand, or where there is structural benefit (e.g. corporate relationship data) • ‘Open’ critical to both mission and quality
  4. Problem Cause Data accuracy Data is re-keyed. Few eyeballs. Often

    little downside 
 to lying Gaps in data High (& often duplicated) cost of data entry. Limited to payers Lack of granularity Legacy systems/data models hard to re-engineer in closed world Errors go uncorrected Few feedback mechanisms Black box/No provenance Can’t reveal (sometimes dubious) sources. 
 Limits usefulness/trust Isolated Proprietary IDs are internal identifiers & are barriers to sharing & improved data quality Common proprietary data quality issues
  5. Just released: v0.4 • search by registered address, plus search

    for companies starting with given phrase (e.g. ‘Barclays Bank’) • filter by multiple jurisdictions (e.g. Ireland and UK) • filter by country (e.g. US) • richer filtering of inactive and branch companies • a new nonprofit filter, to restrict to/exclude companies with a nonprofit company type • users with API keys can now get addresses (and dates of birth) for directors/officers • search officers by address, date of birth, position or status • more powerful date searching • a completely new way of representing industry codes that is far more granular and allows more powerful search filtering https://www.flickr.com/photos/usairforce/6904504692
  6. What use is that? • Companies with registered address at

    the Empire State Building • Companies with ‘condominium’ in the name in the US and Canada • Officers who were born over 105 years ago, but are still active (requires API token) • Nonprofit companies in UK and US with ‘political’ in the name and incorporated in 2014 (requires API token) • Companies in the UK or Belgium with tax in the title and with the EU industry code for “Accounting, bookkeeping and auditing activities; tax consultancy”* • Companies based in Berlin with foreign branches
  7. Innovative business model: Share-Alike or paid for • Cross-subsidy model

    brings best of both worlds • Public benefit – free and open website • Plus free access to data under share-alike licence for open data projects • Many eyes – improves quality • Benefit from efficiencies of scale • No blackbox data – full provenance for all data (source + date retrieved) • Gives added context and confidence
  8. Who's using our data • World Bank • LinkedIn •

    Bureau van Dijk • Stripe • Avention (OneSource) • Creditsafe • Palantir • Funding Circle • etc
  9. helping the open data community work together • Missions: A

    platform for collaborating on data-sourcing, scraping and cleansing • Turbot: A docker-based framework for scrapers • #FlashHacks: Collaborative crowdscraping events for fun and the public good Next FlashHacks April 29, London & Berlin