$30 off During Our Annual Pro Sale. View Details »

Assessment and Visualization of Metadata Quality for Open Government Data

Konrad Reiche
January 21, 2014

Assessment and Visualization of Metadata Quality for Open Government Data

With the rise of the open data movement, government and public agencies start to open up their data for the public use. The technical tool for implementing this infrastructure are repositories. Repositories facilitate the collection, publishing and distribution of data in a centralized and possibly standardized way. Metadata is used to catalog and organize the provided data. The operationality and interoperability depends on the metadata quality.

Quantifying the metadata quality can help to measure the efficiency of a repository and discover low quality metadata records which prevent the user from finding what he/she is looking for. For this a range of metrics from the field of metadata quality assessment are researched and implemented. Current approaches should be adopted to the specifics of open government data repositories but also new approaches should be explored to fit the requirements.

In order to show the feasibility of these metrics a platform is implemented which demonstrates the automatic quality assessment of different repositories. A harvester component is used to gather metadata from different repositories. The metrics are discussed in detail, but also the platform's experimental results are analyzed for practical usage.

Konrad Reiche

January 21, 2014
Tweet

More Decks by Konrad Reiche

Other Decks in Science

Transcript

  1. Assessment and Visualization
    of Metadata Quality
    for Open Government Data
    Konrad Johannes Reiche

    View Slide

  2. View Slide

  3. “A piece of content or data is open if anyone is free to
    use, reuse, and redistribute it — subject only, at most,
    to the requirement to attribute and/or share-alike.”
    O·pen Da·ta /ˈəʊp(ə)n ˈdeɪtə/

    View Slide

  4. “A piece of content or data is open if anyone is free to
    use, reuse, and redistribute it — subject only, at most,
    to the requirement to attribute and/or share-like.”
    O·pen Da·ta /ˈəʊp(ə)n ˈdeɪtə/

    View Slide

  5. View Slide

  6. XML
    JSON
    RDF
    PDF XLS CSV
    DOC

    View Slide

  7. Quality.
    What could possibly go wrong?
    Metadata Record
    Name regional-household-income
    ID 98899446-0a1a-43bc-874c-2d54dc700670
    Maintainer Margaret Jarmon
    Maintainer Email [email protected]
    Author Office for National Statistics
    Author Email [email protected]
    License ID uk-ogl
    Resources
    URL http://www.ons.gov.uk/ons/rhi13
    Description Spring 2013
    Format CSV
    URL http://www.ons.gov.uk/ons/rhi14
    Description Spring 2014
    Format CSV

    View Slide

  8. Quality.
    What could possibly go wrong?
    Metadata Record
    Name regional-household-income
    ID 98899446-0a1a-43bc-874c-2d54dc700670
    Maintainer
    Maintainer Email
    Author Office for National Statistics
    Author Email
    License ID uk-ogl
    Resources
    URL http://www.ons.gov.uk/ons/rhi13
    Description Spring 2013
    Format CSV
    URL http://www.ons.gov.uk/ons/rhi14
    Description
    Format CSV

    View Slide

  9. Quality.
    What could possibly go wrong?
    Metadata Record
    Name regional-household-income
    ID 98899446-0a1a-43bc-874c-2d54dc700670
    Maintainer
    Maintainer Email
    Author Office for National Statistics
    Author Email
    License ID uk-ogl
    Resources
    URL http://www.ons.gov.uk/ons/rhi13
    Description Spring 2013
    Format CSV
    URL http://www.ons.gov.uk/ons/rhi14
    Description
    Format CSV
    CSV
    HTML

    View Slide

  10. Metadata Record
    Name regional-household-income
    ID 98899446-0a1a-43bc-874c-2d54dc700670
    Maintainer
    Maintainer Email
    Author Office for National Statistics
    Author Email
    License ID uk-ogl
    Resources
    URL http://www.ons.gov.uk/ons/rhi13
    Description Spring 2013
    Format CSV
    URL http://www.ons.gov.uk/ons/rhi14
    Description
    Format CSV Quality.
    What could possibly go wrong?
    CSV

    View Slide

  11. Metadata Record
    Name
    ID 98899446-0a1a-43bc-874c-2d54dc700670
    Maintainer
    Maintainer Email
    Author
    Author Email
    License ID uk-ogl
    Resources
    URL http://www.ons.gov.uk/ons/rhi13
    Description Spring 2013
    Format CSV
    URL http://www.ons.gov.uk/ons/rhi14
    Description
    Format CSV Quality.
    What could possibly go wrong?
    CSV

    View Slide

  12. Reputation Loss
    QUALITY LOSS
    Information Loss

    View Slide

  13. Meta·da·ta Qual·i·ty
    /ˈmɛtədeɪtə kwɒlɪti/
    The fitness to describe the data (resources), supporting
    the task dimensions of finding, identifying, selecting and
    eventually obtaining the resources. The quality is inversely
    proportional to the uncertainty of the user about the
    actual data.

    View Slide

  14. Assessing Metadata Quality is HARD

    View Slide

  15. Automated Quality Assessment

    View Slide

  16. QUALITY METRICS

    View Slide


  17. :
    ⟶ ∈ [0, 1]
    Measurement.
    Quality.

    View Slide

  18. Completeness. How many fields have been completed?
    Metadata Record
    Name uk-civil-service-high-earners
    ID 68addaac-59ae-4230-bb67-c5a8f6a76285
    Maintainer
    Maintainer Email
    Author Civil Service Capability Group
    Author Email [email protected]
    License ID uk-ogl
    Resources
    Size 40959
    Description Civil Servants Salaries 2010
    Format CSV
    Size
    Description Civil Servants Salaries 2011
    Format CSV

    View Slide

  19. Weighted Completeness. Not all fields are equally relevant.
    Metadata Record
    Name uk-civil-service-high-earners
    ID 68addaac-59ae-4230-bb67-c5a8f6a76285
    Maintainer
    Maintainer Email
    Author Civil Service Capability Group
    Author Email [email protected]
    License ID uk-ogl
    Resources
    Size 40959
    Description Civil Servants Salaries 2010
    Format CSV
    Size
    Description Civil Servants Salaries 2011
    Format CSV

    View Slide

  20. Accuracy. How accurate is the resource represented?
    Metadata Record
    Name regional-household-income
    ID 98899446-0a1a-43bc-874c-2d54dc700670
    Maintainer
    Maintainer Email
    Author Office for National Statistics
    Author Email
    License ID uk-ogl
    Resources
    URL http://www.ons.gov.uk/ons/rhi13
    Description Spring 2013
    Format CSV
    URL http://www.ons.gov.uk/ons/rhi14
    Description
    Format CSV
    CSV
    HTML

    View Slide

  21. Richness of Information. How much value is added?

    1

    View Slide

  22. Readability. How readable are the descriptions?

    View Slide

  23. Availability. Are the links working?
    1

    View Slide

  24. Implementation.

    View Slide

  25. REQUIREMENTS
    Non-functional
    Functional

    View Slide

  26. Repository
    + url : String
    + name : String
    + type : Symbol
    Snapshot
    + date : Date
    MetaMetadata
    + metadata_record : Hash
    + score : Float
    + statistics : Hash + completeness : Hash
    + weighted_completeness : Hash
    + richness_of_information: Hash
    ...
    + latitude : String
    + longitude : String + best_record() : MetaMetadata
    + worst_record() : MetaMetadata
    + score() : Float
    0..* 1..*
    DESIGN.

    View Slide

  27. CompletenessMetric
    WeightedCompleteness
    <>
    Metric
    + compute(record)
    MetricWorker
    + perform(snapshot, metric)
    GenericMetricWorker
    CompletenessMetricWorker
    OpennessMetric
    <>
    <>
    <>

    View Slide

  28. Metadata
    Harvester
    JSON JSON
    JSON
    Archives
    API Requests
    Records

    View Slide

  29. Imports
    Persist
    Metadata Census
    Metadata
    Harvester
    JSON JSON
    JSON
    Archives
    API Requests
    Records
    Preliminary
    Analyzer
    Dump
    Importer
    Database

    View Slide

  30. Imports
    Persist
    Metadata Census
    Metadata
    Harvester
    JSON JSON
    JSON
    Archives
    API Requests
    Records
    Metric Processor
    Query
    Records
    Scheduler
    Analyzer
    Preliminary
    Analyzer
    Dump
    Importer
    Database

    View Slide

  31. View
    User
    Generates
    Investigates
    Imports
    Persist
    Metadata Census
    Metadata
    Harvester
    JSON JSON
    JSON
    Archives
    API Requests
    Records
    Metric Processor
    Query
    Records
    Scheduler
    Analyzer
    Preliminary
    Analyzer
    Dump
    Importer
    Database

    View Slide

  32. View Slide

  33. View Slide

  34. Evaluation

    View Slide

  35. View Slide

  36. Rank Repository Score
    Misspelling
    Richness of
    Information
    Openness
    Completeness
    Availability
    Weighted
    Completeness
    Readability
    Accuracy
    1 data.gc.ca 74 97 86 80 79 79 81 71 20
    2 data.sa.gov.au 71 98 63 94 77 86 82 72 0
    3 GovData.de 67 99 4 38 55 81 87 79 56
    4 data.qld.gov.au 66 99 67 96 73 60 78 59 0
    4 PublicData.eu 66 98 84 69 64 70 67 42 32
    4 data.gov.uk 66 97 85 69 62 74 67 44 28
    4 africaopendata.org 66 100 20 78 70 87 68 55 53
    5 datos.codeandomexico.org 65 100 55 84 65 100 75 37 0
    6 catalogodatos.gub.uy 63 100 64 1 70 74 78 65 52
    6 data.openpolice.ru 63 100 0 0 58 100 81 100 64
    7 dados.gov.br 61 100 87 36 53 57 72 44 39
    8 opendata.admin.ch 59 100 12 0 58 100 68 35 100
    9 data.gv.at 57 100 21 99 51 68 65 59 0
    10 data.gov.sk 49 100 51 0 48 92 58 37 7

    View Slide

  37. Conclusion

    View Slide

  38. What is good about this approach?

    View Slide

  39. View Slide

  40. What is badnot so good about this approach?

    View Slide

  41. Final Thought. Do not aim for excellence,
    aim for low-quality metadata.

    View Slide

  42. View Slide

  43. View Slide

  44. View Slide

  45. DEMO

    View Slide