Upgrade to Pro — share decks privately, control downloads, hide ads and more …

"Data journalism, small newsroom."

"Data journalism, small newsroom."

A workshop presented at the Florida Times-Union (Jacksonville). Jul. 28, 2012.

Carl V. Lewis

July 28, 2012
Tweet

More Decks by Carl V. Lewis

Other Decks in Technology

Transcript

  1. Data visualization in the newsroom { “presented by”: “carl v.

    lewis”, “for”: “the florida times-union”, “slides”: “bit.ly/NIXkOD”, “email”:“[email protected]” }
  2. What is data visualization? •Data itself is the story; standalone

    narrative. •Interactive, communicative, visual. •Ranges from simple (charts) to complex (database-driven applications). •Both a technique and a format. •Both entertaining and factual. • See: “The Many Words for Visualization”
  3. The history of data journalism •Grew out of CAR (computer

    assisted-reporting) tradition •John Snow’s 1854 cholera map •Has coincided with the era of “Big Data”
  4. On the emergence of the field of data journalism: •"When

    information was scarce, most of our efforts were devoted to hunting and gathering. Now that information is abundant, processing is more important." –Phillip Meyer, UNC Chapel Hill
  5. On the growing importance of data-driven journalism: •“Journalists need to

    be data-savvy . . . Data-driven journalism is the future.” –Sir Tim Berners Lee. •“The explosion of Web-based tools and ways of sifting through and sharing data has created something approaching a revolution, and the potential benefits for journalism are only just beginning to reveal themselves.” –Matthew Ingram
  6. What data journalism is not: • Simply incorporating public data

    into your textual narrative • Infographics • Illustration • Resource-intensive • Just about numbers and programming • Just about making data flashy
  7. What data journalism is: • Visual • Often evergreen •

    Transparent – direct access to primary source • Credible • Engaging • A good business model
  8. Democratization of data journalism • Free and open-source tools (Google

    Drive, JavaScript libraries, etc.). • Open Data laws. • “Anyone can do it. Data journalism is the new punk.” -Simon Rogers, The Guardian
  9. The job of the data journalist • Part statistician, part

    journalist, part programmer. • “We're statisticians. We don't program.” • “We’re programmers. We don’t report.” • “We’re journalists. We don’t code.”
  10. Notable examples of data visualization • “Mapping America: Every City,

    Every Block,” NYTimes.com. • “Where Does My Money Go?”, Open Knowledge Foundation. • “Illinois school report cards,” Chicago Tribune • “We Feel Fine,” Jonathan Harris • “Top Secret America,” The Washington Post
  11. When to use data visualization: • Show change over time

    • Comparing discrete values • Showing connections and flows • Showing hierarchy • Browsing large databases
  12. When not to use data visualization: • When text or

    multimedia tells story better • When you have very few data pints • When there is no statistical significance • When a map is not a map • When a table would do
  13. Process of data journalism 1. Research – Think of topic

    and research factors. 2. Find the data – Locate and retrieve relevant public data 3. Analysis and evaluation – Crunch numbers, look for trends or inconsistencies 4. Visualize – Display the data in appropriate manner
  14. Research 1. Think of a topic – what factors influence

    it? 2. What public data might shed light on those factors? 3. Seek out the data
  15. Locating public data • Thousands of public “data dumps” by

    government bodies and nonprofits. • Most commonly in delimited spreadsheet format (look for .csv, .xls), sometimes in XML and JSON. • For geographic data, look for .kml or .shp • Can be found directly at source or by search engine keyword
  16. Search tips for data retrieval • If you don’t know

    which source to look to find your data, an initial Web search might help. • After your keywords, type “filetype:XLS”, “filetype:CSV”, or whatever the extension is of the data you’re seeking, and you’ll see only files of that type from across the Web. • If you get no results, try broadening your search term to locate sources that cover the general discipline (i.e. instead of “malaria deaths,” try “public health data”)
  17. Locating public data • Federal sources: Data.gov, Census.gov, OpenSecrets.org, FollowTheMoney.org,

    USA.gov, USGovXML.com (full federal list by topic/agency here). • Data catalogs such as thedatahub.org, datamarket.com, infochimps.org, datacatalogs.org are good places to find non-
  18. • Florida’s “Sunshine” law requires all state agencies to provide

    open access to public records, including data. • Chapter 119 of Florida State Statutes mandates that “any records made or received by any public agency in the course of its official business are available for inspection, unless specifically exempted by the Florida Legislature.” Florida public data sources
  19. • Dozens of useful open data sources maintained by Florida

    government agencies, including TransparencyFlorida.gov, FloridaHasARightToKnow.com and MyFlorida.gov • Full-list of state-maintained databases by topic here. • A few state-maintained databases worth mentioning: the Division of Elections’ campaign finance data, the DOE’s test score reports and the Department of Law Enforcement’s arrest and officer reports. Florida public data sources
  20. Florida public data sources • A number of advocacy groups

    also maintain useful, downloadable statewide databases: • FloridaOpenGov.org, which focuses on public employee payroll data. • FloridaRedistricting.org, which provides demographic data (.csv) and geographic polygons (.shp) for new district boundaries. • Florida Housing Data Clearinghouse, which provides regularly updated property values, housing data (.xls). (for even more, see my semi-exhaustive list with descriptions here). nt.aspx?id=235
  21. Georgia public data sources • Although Georgia has no law

    requiring all government agencies to make public data accessible online, many do anyway. • In 2008, the Transparency in Government Act expanded the public data site, Open.Georgia.gov, to include all three branches of government, regional education service agencies, local boards of education, and transactions made by the General Assembly.
  22. Georgia public data sources • A comprehensive list of downloadable

    databases from state agencies in Georgia can be found here. • The State Ethics Committee has made all campaign finance reports, lobbyist reports and campaign contributions available in downloadable spreadsheets. • OASIS provides a set of web-based tools to browse the Georgia Department of Public Health’s Data Warehouse, and download the data yourself if you wish.
  23. Locating geographic data • Most geographic data available as TIGER/Line

    Shapefile packages (archives containing .shp, .dbf, .prj, .xml, .shx) from U.S. Census Bureau. • Google also hosts a directory of .kml files for most geographic boundaries here. • Alternatively, Florida and Georgia GIS data can be found at FGDL.org, Geoplan and Data.GeorgiaSpatial.org.
  24. What to look for • Most numeric spreadsheet data comes

    either as a comma-separated value (.csv) or Microsoft Excel (.xls) file. Example of .csv structure: “Name”,“Date”,“Address”,”Zip”,”State”,”Country”, • XML (eXtensible Markup Language) stores data hierarchically for the Web, and is good for building news applications because of its broad interoperability. <menu id="file" value="File"> <popup> <menuitem value="New" onclick="CreateNewDoc()" /> <menuitem value="Open" onclick="OpenDoc()" /> <menuitem value="Close" onclick="CloseDoc()" /> </popup> </menu> • JSON (JavaScript Object Notation) – Similar to XML in structure, but has a “lighter” punctuation, based on JavaScript conventions. May eventually replace XML as standard. {"menu": { "id": "file", "value": "File", "popup": { "menuitem": [ {"value": "New", "onclick": "CreateNewDoc()"}, {"value": "Open", "onclick": "OpenDoc()"}, {"value": "Close", "onclick": "CloseDoc()"} ] } }}
  25. Scraping other sources • Scrape data from an HTML table

    with simple Google spreadsheet formula: =ImportHtml("http://the-url-goes-here", "table", 0) • For database of HTML tables, try Haystax. • For PDFs, try CometDocs. • Scrape webpages by running or creating Python script at ScraperWiki.
  26. APIs for data retrieval • APIs (application programming interfaces) are

    how many websites and services share content with one another. • Allows a computer system to fetch, interpret and use data created on another system, even if it used a different programming language or structure. • Examples: Twitter Search API, Google Maps API, NYTimes Campaign Finance API. • Usually returns data as XML, JSON or .txt • Often requires use of an API key.
  27. Manipulating datasets • Data rarely ready for analysis and visualization

    out-of-the- box (hence “raw data”). • Spreadsheet applications most common and easiest way to work with data (Excel, Google Spreadsheets). • Allow for complex calculations, formulas, sorting. • Compatible with a variety of file formats (.xls, .ods, .csv, .txt, .tsv). • Scripts may also be written to automate bulk manipulation (Python). • R Project (r-project.org)
  28. Data analysis • To figure out what your data says,

    you’ll need to crunch the numbers. • Statistical significance is litmus test. • Skewed or normal distribution? Why? • Outliers? If so, error or unexplained factor?
  29. Benchmarks for analysis • Mean (μ) simplest to calculate, but

    susceptible to errors caused by outliers. • Median usually a better metric in determining conclusion, especially with skewed distribution. • If mean=mode, no skewness. • Standard deviation (σ) measures reliability of data set. • Z-Score = how many standard deviations a value is away from the mean and, thus, its likelihood of being an outlier. standard deviation mean z-score
  30. Calculating values in Excel • Mean: =AVERAGE(A1-A27) • Median: MEDIAN(A1-A27)

    • Standard deviation: STDEV(A1-A27) • Z-score of a given value: Subtract mean of dataset from value. Divide result by the standard deviation
  31. Other commonly used Excel formulas • Concatenate to merge multiple

    columns. • MID to split columns. • Percent change to display relative change over time =(new_value-original_value)/ABS(original_value) • See this guide of helpful Excel tricks for data journalists, compiled by Mary-Jo Webster of St. Paul Pioneer Press: https://docs.google.com/file/d/ 0ByLyArAQRhaBNDc3NjJjYTUtY2U0Yi00NmIwLThk NTgtYzNlYThmNGE1ZTEz/edit
  32. Refining and cleaning data • Sometimes Excel and Google Spreadsheets

    aren’t enough, especially when working with large datasets. • Google Refine – free tool that lets you explore, power sort and process data. • Useful for finding and fixing errors and inconsistencies, “power tool for working with messy data.” • Facets to sort data • Cleaning with clusters • Shan Carter’s Mr. Data Converter to convert spreadsheets to more web- friendly format.
  33. Other data analysis tips and tricks • Put field names

    in first row. • Put geographic data in first columns • When you have two different datasets, a good tool to merge them is Google Fusion Tables (make sure they share a common attribute). • Never round until the end of calculations. Round to two decimal points for visualization purposes. • Cut and paste calculations into a new column as values only. • Know the principle data types (integer, real, string, boolean), and make sure numeric data is classified as either integer (whole numbers only) or real (any value).
  34. Planning your visualization • Identify your key message • Choose

    the best data series to illustrate your point • Consider the number of points in the data • Think about complementary/supporting datasets you can incorporate, e.g. sanitation with poverty. • Plan for user interaction, i.e. visual feedback. • Make numerical changes to raw data to enhance your point, e.g. absolute values vs. percent change • Brainstorm potential technologies • Consult experts on topic to back up your interpretation of data
  35. Choosing the right type of visualization • Change of single

    variable over time: line chart. • Comparison of single variable among multiple classes: bar chart. • Two variables: scatter plot, bubble chart. • Hierarchical data: treemap, bubbletree. • Area charts for area only • Makeup of whole: pie chart. • Distribution: histograms, box-and-whisker plots. • Geographic data (point, polygon, chloropleth and symbol maps). • Records: searchable database. • Chronological data: timeline, sparklines. • Other possibilities: matrices, heatmap, games, slopegraphs, stepper graphics,
  36. Visualization design principles • Typography: clear, consistent, not distracting. •

    Use bold, mix of serif/sans-serif to provide emphasis. • Don’t set type at an angle • Color: Let color correspond to variable, design for accessibility, choose from same side of color wheel, consider cultural associations but avoid thematic palletes. Use Adobe Kuler or 0to255.com • Visual overload, emotional design, skewmorphism. No white type on black background No angled type
  37. • Some guidelines for graphical integrity, according to Edward Tufte

    in The Visual Display of Quantitative Information: 1. Representation of numbers should be directly proportional to numerical qualities represented. 2. Clear, detailed labeling throughout. 3. Show data variation, not design variation. 4. Avoid excessive and unnecessary use of graphical effects What Edward Tufte calls “the worst visualization ever published.” Visualization design principles
  38. • Design for the eye • User should be able

    to discern key message visually. • Design for interaction • Highlighting and details on demand (example) • User-driven content selection (example) Visualization design principles
  39. “Four Ways to Slice Obama’s Budget Proposal” • From NYTimes.com:

    http:// www.nytimes.com/interactive/ 2012/02/13/us/politics/2013- budget-proposal-graphic.html • What makes this visualization effective? How does it approach color, complexity, interactivity and typography? How does it avoid visual overload?
  40. Wireframing/ prototyping • Follow a structured grid system (i.e., 12

    column, 960px grid – see 960.gs and Subtraction). • Very selectively, you can break the grid to emphasize a certain visual element. • Sketch out/prototype your wireframe on paper first (print templates such as this)
  41. Selecting tools/technologies • A wealth of free, open-source data visualization

    tools and libraries exist to shorten development times • Examples: Google Visualization API, Google Fusion Tables, Highcharts.js, CartoDB, d3.js, Tableau Public. • For everything else, HTML5 + CSS + JavaScript
  42. Web app anatomy Three components of a Web app: 1.

    HTML (structure) 2. CSS (styles) 3. JavaScript (interactivity)
  43. Parts of an HTML file An HTML file is made

    up of: 1. Doctype declaration 2. Head <head> 3. CSS/JavaScript references 4. Title <title> 5. Body <body> 6. A Div container 7. Divs (IDs and classes)
  44. Parts of a CSS file A CSS file is made

    up of: 1. Container ID 2. Default paragraph (p) style 3. Default H1,H2, etc. styles 4. Default .body style 5. Styles for all divs
  45. Maps 101 • Interactive maps combine geocoded data – points

    or polygons – along with metadata and/or numeric data. • KML (keyhole markup language) quickly becoming popular file format, but Shapefile (shp.zip) is still the most widely available • Geographic data can either be geocoded, downloaded from the Web, or custom-drawn. • Good puveyor of news maps: The Texas Tribune.
  46. Mapping services and libraries • Google Fusion Tables – Quick,

    versatile and classic maps that integrate seamlessly with the Google Maps JavaScript API. • CartoDB – A newer open-source tool much like Fusion Tables, but with a better looking out-of-the-box experience. • Leaflet – An open-source, client-side mapping library with an API that allows you to achieve a number of advanced features. Plays nicely with Fusion Tables and CartoDB-hosted maps. Part of CloudMade suite.
  47. Handy desktop mapping software • qGis – Free program that

    supports almost every conceivable map file type, and allows you to add or manipulate vector data, which can then be then exported as a KML or Shapefile package. • Tilemill – A map creation and styling software; ideal for those with little programming experience. UTF-grid enabled tilesets only.
  48. Primary map types • Chloropleth – Colors for each geometry

    correspond to numeric values of a given variable. • Point – Locations on a map displayed by geocoded markers. • Less frequently: proportional maps and geo maps. Chloropleth map of Georgia voter turnout Point map of Jacksonville polling locations
  49. Tips and tricks • If you have street address data,

    you can use BatchGeocode to convert them to lat-long coordinates. • For chloropleth maps, • Include no more than five fill colors or “buckets” • Don’t define an equidistant color ramp; use ColorBrewer instead. • Use MarkerClusterer when there are too many points for certain zoom levels. Using ColorBrewer to define an accurate, accessible color ramp. Using MarkerClusterer to cluster points at further zoom levels.
  50. Tips and tricks • To convert Shapefiles so they can

    be imported into Fusion Tables, either use Shape to Fusion, or export it as KML from CartoDB. • Before using the embed tool in Fusion Tables or CartoDB, make sure the map is centered where you want it. • Ensure your map is set to “Public.” Export a Shapefile as KML in CartoDB. Making your map public in Fusion Tables
  51. Charts • Basic building block of visualization • Simple, but

    also easy to mess up. • Should always be interactive. • Should always include data source. • Should always include a legend. • Unless necessary, only show labels on mouseover.
  52. Interactive charting tools • Out-of-the-box: Google Drive charts, infogr.am. •

    More advanced: Google Code Playground. • Most agile: Highcharts.js. • Most extendible: Tableau Public A combo chart made using Highcharts.js
  53. Charting best practices • Color: Pick palette of no more

    than 3-4 colors from same side of color wheel. • Increments: Use natural- increments like (0,2,4,6...) instead of, say, (0,3,6,9...) • Scale: Don’t plot two unrelated series with one scale on left and one on right. • Style: Flat and simple. No 3D effects, shadows, narrow bars or distracting shading. Don’t plot two different variables on same scale. Bars too narrow Distracting shading Misleading 3D effects Pointless shadows Source: The Wall Street Journal Guide to Information Graphics, Dona M. Wong.
  54. Charting best practices • Always set the baseline to zero.

    • Always order starting with greatest value • Use broken bars sparingly • No more than five slices on pie charts; no “donut” pie charts. • No more than 3-4 lines on line chart Wrong order Right order Wrong baseline Right baseline No donut-pies Source: The Wall Street Journal Guide to Information Graphics, Dona M. Wong.
  55. Utilizing JavaScript/HTML5 libraries • Together, JavaScript, HTML5 and jQuery have

    expanded boundaries of data visualization • Abundance of open-source libraries and packages mean less programming required to produce unique, interactive visualizations. • Examples: Timeline.js, Bubbletree.js, Raphael.js, ProPublica tools
  56. The HTML5 revolution • Adobe Edge for HTML5 development; end

    of Flash’s reign • Platform-agnostic, mobile- first movement • Forking resources and packages off GitHub
  57. Pushing the limits • RaphaelJS for easier manipulation of serialized

    vector graphics • Other boundary-pushing data visualization projects: Processing!, Gephi, d3.js, IBM’s Many Eyes. A network map produced using D3.js
  58. Helpful resources and communities • Blogs/Tutorials: FlowingData.com,Vis4.net,Driven- by-data.net, Chryswu.com, datavisualization.ch

    • Books: The Data Journalism Handbook, O’Reilly Media. Flowing Data Guide to Visualization, Chris Wyu. The Wall Street Journal Guide to Information Visualization, Dona M. Wong. • Communities: visual.ly, Hacks/ Hackers, NICAR. Free data journalism handbook from O’Reilly Media