to SQL of WoS - Pick up sufficient data to analyze - Which tag is needed? - Title? - Names Count? (as known as scientific ordering) - Publisher? - Fund information? - Name? First name? Last Name? - ID?
*.zip to *.gz format • Take 3 mins for 8.16GB • Decrypt data from *.gz to *.zip format • Take 30 mins for 20GB • Decrypt data from *.zip to *.xml format • Take 1.2 hours for 10GB • Parsing xml data into (my)sql format • Take 5.5 hours for 40.5GB • Binding separated *.sql format data into one single file for each year • Take 35 mins • Importing *sql format data into MySQL Server • Take 120 hours for 49.8GB ZIP - > GZ GZ -> ZIP ZIP -> XML XML -> SQL SQL -> SQL Server Accessin g from SQL client Finally you could get the data! Stata?R? Analyze Submissi on Revise Publis h
XML data to SQL format. • Using generic_paser.py which distributed in Github. • https://github.com/titipata/wos_pars er • Takes 5.5 hours to parse XML data in 40.5GB into SQL format. • Then import to SQL server, it takes 1.5-3 days with 2 million entries per year.
of the listing • host_id; host ID • host_name; name of the host • neighbourhood_grouplocation • neighbourhoodarea • Latitude; latitude coordinates • Longitude; longitude coordinates • room_type; listing space type • Price; price in dollars • minimum_nights; amount of nights minimum • number_of_reviews; number of reviews • last_review; latest review • reviews_per_month; number of reviews per month • calculated_host_listings_count; amount of listing per host • availability_365; number of days when listing is available for booking