to SQL of WoS - Pick up sufficient data to analyze - Which tag is needed? - Title? - Names Count? (as known as scientific ordering) - Publisher? - Fund information? - Name? First name? Last Name? - ID?
*.zip to *.gz format • Take 3 mins for 8.16GB • Decrypt data from *.gz to *.zip format • Take 30 mins for 20GB • Decrypt data from *.zip to *.xml format • Take 1.2 hours for 10GB • Parsing xml data into (my)sql format • Take 5.5 hours for 40.5GB • Binding separated *.sql format data into one single file for each year • Take 35 mins • Importing *sql format data into MySQL Server • Take 120 hours for 49.8GB ZIP - > GZ GZ -> ZIP ZIP -> XML XML -> SQL SQL -> SQL Server Accessin g from SQL client Finally you could get the data! Stata?R? Analyze Submissi on Revise Publis h
XML data to SQL format. • Using generic_paser.py which distributed in Github. • https://github.com/titipata/wos_pars er • Takes 5.5 hours to parse XML data in 40.5GB into SQL format. • Then import to SQL server, it takes 1.5-3 days with 2 million entries per year.