of fields: • h: Bitly user hash identifier • g: Bitly global hash identifier • a: browser user agent • u: long URL • t: timestamp (UTC) • c: country (two-letter code) • nk: repeat client • kw: keyword alias for user hash • ckw: custom keyword • cy: city (optional) All about the data! Decode data 1usagov_data 1usagov_data_small 1usagov_data_tiny
• Domain Name • Ex. loc.gov • Domain Type • Ex. Federal Agency • Global Hash • Same as in decodes data • Hostname • Ex. www.loc.gov • State All about the data! Agency data agency_map
to link. •Output results as a three column TSV: type, value, count Pure Python Count Clicks & Countries EXERCISE 1.2 Try on your own for 5 minutes and then we’ll review a solution
the top 20 links and 20 countries in descending order. •Output results as a three column TSV: type, value, count Pure Python Top Links & Countries EXERCISE 1.3 Try on your own for 5 minutes and then we’ll review a solution
the top 20 agencies, in addition to the top countries and links. •Output results as a three column TSV: type, value, count Pure Python Join With Agencies EXERCISE 1.4 Try on your own for 5 minutes and then we’ll review a solution
a agency of “Department of State”. •Calculate the top 20 links and countries within the filter. •Output results as a three column TSV: type, value, count Pure Python Filter By Agency EXERCISE 1.5 Try on your own for 5 minutes and then we’ll review a solution
is to count clicks by hash. •Write a mapper, combiner, and reducer to accomplish this. •Run the job on your VM. Hadoop Count Clicks EXERCISE 2.1 Try on your own for 5 minutes and then we’ll review a solution
to link. •Update your mapper, combiner, and reducer to accomplish this. •Run the job on your VM. Hadoop EXERCISE 2.2 Try on your own for 5 minutes and then we’ll review a solution Count Clicks & Countries
the top 20 links and 20 countries in descending order. •Hint: you will want to update your reducer to accomplish this. •Run the job on your VM. Hadoop EXERCISE 2.3 Try on your own for 5 minutes and then we’ll review a solution Top Links & Countries
the top 20 agencies, in addition to the top countries and links. •Run the job on your VM. Hadoop EXERCISE 2.4 Try on your own for 5 minutes and then we’ll review a solution Join With Agencies
a agency of “Department of State”. •Calculate the top 20 links and countries within the filter. •Run the job on your VM. Hadoop EXERCISE 2.5 Try on your own for 5 minutes and then we’ll review a solution Join With Agencies
top links per country? •Looking at 20 is fine (show() will output 20) EXERCISE 3.1 Spark Top Links & Countries Try on your own for 5 minutes and then we’ll review a solution
as the agency? •What are the top countries with the Department of Education as the agency? EXERCISE 3.2 Spark Filter by Agency Try on your own for 5 minutes and then we’ll review a solution
and France (FR) •Find the most popular links for 3 agencies of your choice •How do they differ between countries? Are there any similarities? EXERCISE 4 More agencies, Try on your own for 15 minutes and then we’ll review a solution links, and countries