Upgrade to Pro — share decks privately, control downloads, hide ads and more …

PyConZA 2012: "High-performance Computing with Python" by Kevin Colville and Andy Rabagliati (part 2)

Pycon ZA
October 04, 2012

PyConZA 2012: "High-performance Computing with Python" by Kevin Colville and Andy Rabagliati (part 2)

Part 1 of this talk will give an overview of how Python scales to supercomputer-sized programming and a brief introduction to using the message passing interface (MPI) library to scale a Python program to large distributed memory cluster systems.

Part 2 will cover Python libraries for access to 'big data' -- hdf5, netcdf, pytables -- mostly in reference to Earth observation data.

Python scales from smart phones to supercomputers. There are two pillars to huge computing problems: high performance computing and massive data.

The fastest and largest high-performance computing (HPC) systems are all distributed memory cluster systems. The message passing interface (MPI) was developed and designed to allow many programming threads to communicate efficiently across the high-speed network of a cluster supercomputer and effectively act as a single HPC program on thousands of processors. Python has access to the MPI library through the mpi4py module. An overview of HPC will be followed by an introduction to MPI with examples in Python using the mpi4py module.

In part 2, Andy will show example routines accessing NASA and ESA Earth observation data -- and routines for storing large files on CHPC DIRISA data store. This also requires a local database to store metadata -- in this case PostGIS.

Pycon ZA

October 04, 2012
Tweet

More Decks by Pycon ZA

Other Decks in Programming

Transcript

  1. Earth Observation Applications • Marine Remote Sensing Unit attached to

    Oceanography dept, University of Cape Town • Archive primary data downloaded from ESA and NASA, files between 100 to 700Gig • Processed by coastal region around Africa • Visualised, and raw data split by region as netcdf • Website at www.afro-sea.org.za • Partnering with CHPC for data processing and storage
  2. Data storage systems • CHPC DIRISA with 700TB data storage

    • Concurrent access through all 14 nodes • Storage policies – mostly to do with replication – Local replication, RAID, Offsite – Sister disk cluster at CSIR Pretoria – SANReN connectivity between the two • Our current usage at 100TB including replication • No CHPC compute node access at present
  3. PostGIS schema • Postgis database to store spatial, temporal data

    • Also indexed by filename and WOS Object ID • Allows geographical search for intersection, distance, area and other specialist geography functions • Satellite data stored in Object store • Use Django on website
  4. PostGIS schema Class Swaths(models.Model): : filename = models.CharField('base file name',

    max_length=100, db_index=True) filepath = models.CharField('file path', max_length=100) fileext = models.CharField('file extension', max_length=20) shape = models.PolygonField('Border', spatial_index=True, srid=4326, geography=True) objects = models.GeoManager() source = models.CharField(max_length=100) version = models.CharField(max_length=20) overpass = models.DateTimeField(db_index=True) oid = models.CharField('WOS ObjectId', max_length=40, db_index=True) orbit = models.IntegerField(db_index=True, null=True)
  5. WOS API • The WosCluster constructor is used to connect

    to the cluster. The returned handle is used in all future Put, Get, and similar commands. If all of the nodes in a cluster are configured under the same name in DNS, DNS will provide a round-robin response of different node IP addresses for successive connect statements, thereby achieving some measure of load-balancing. • wos = WosCluster("cluster-dnsname-or-ipaddr") • Only one connection to the cluster per process may be open at one time. Connect is a somewhat expensive operation. It is intended that applications do this once (or infrequently) and then keep their connection
  6. WOS API • Create a WosObj obj = WosObj() obj.data

    = ’lots of bytes of data...’ obj.meta[’a’] = ’b’ obj.meta[’date’] = time.time() • Put object, policy Store an object with replication directives specified by policy. The return value is an object-id (generally referred to as an OID). policy = ’replicate’ oid = wos.put(obj, policy)
  7. WOS API • Get oid - Retrieve an OID from

    the cluster obj = wos.get(oid) print obj.data print obj.meta[’date’] • Delete oid - Delete an object from the cluster. Note that although the space will be reclaimed for future use, the previously-used OID will never be returned again. wos.delete(oid)
  8. WOS API • It is possible to break the Put

    step into two pieces which can be executed separately: the first step is to Reserve which returns an OID, and the second step is to use PutOID to associate an object with that OID. Note that for a given OID, PutOID can only be called once. • Reserve policy - Reserve an OID policy = ’copy3’ oid = wos.reserve(policy) • PutOID object, oid - Associate an object with a previously- reserved OID wos.putoid(obj, oid)
  9. WOS Streaming API • The streaming API deals with large

    objects which would not fit in memory. • WosCluster.CreatePutStream policy - Create a WosPutStream class. • The WosPutStream and WosGetStream classes are intended to mimic a file stream in Python as closely as possible. • WosPutStream.write data - Data can be any Python binary-safe string - appended. • WosPutStream.meta - Meta is a dictionary of string-based key/value pairs. Saved with the object. • WosPutStream.close() • WosCluster.CreateGetStream oid, prefetch_meta - takes the OID of the object to be retrieved, and (optional) boolean for metadata.
  10. WOS Streaming API f = open(file, 'rb') puts = wos.CreatePutStream(policy)

    puts.meta['date'] = str(time.time()) while True: data = f.read(CHUNKSIZE) if len(data) == 0: break puts.write(data) f.close() oid = puts.close() return oid
  11. Zlib and WOS wosstream = woscluster.CreateGetStream(swath.oid) f = open('file.N1', 'wb')

    z = zlib.decompressobj(16+zlib.MAX_WBITS) while True: data = wosstream.read(CHUNKSIZE) if len(data) == 0: break f.write(z.decompress(data)) f.write(z.flush()) f.close()
  12. Zlib and WOS wosstream = woscluster.CreateGetStream(swath.oid) f = open('file.N1', 'wb')

    z = gzip.GzipFile('file.N1', 'r', 0, wosstream) while True: # fails with 'wosapi.WosGetStream' object has no attribute 'tell' data = z.read(CHUNKSIZE) if len(data) == 0 break f.write(data) f.write(z.flush()) f.close()