PyConZA 2012: "High-performance Computing with Python" by Kevin Colville and Andy Rabagliati (part 2)

Earth Observation Applications • Marine Remote Sensing Unit attached to
Oceanography dept, University of Cape Town • Archive primary data downloaded from ESA and NASA, files between 100 to 700Gig • Processed by coastal region around Africa • Visualised, and raw data split by region as netcdf • Website at www.afro-sea.org.za • Partnering with CHPC for data processing and storage

Data storage systems • CHPC DIRISA with 700TB data storage
• Concurrent access through all 14 nodes • Storage policies – mostly to do with replication – Local replication, RAID, Offsite – Sister disk cluster at CSIR Pretoria – SANReN connectivity between the two • Our current usage at 100TB including replication • No CHPC compute node access at present

PostGIS schema • Postgis database to store spatial, temporal data
• Also indexed by filename and WOS Object ID • Allows geographical search for intersection, distance, area and other specialist geography functions • Satellite data stored in Object store • Use Django on website

PostGIS schema Class Swaths(models.Model): : filename = models.CharField('base file name',
max_length=100, db_index=True) filepath = models.CharField('file path', max_length=100) fileext = models.CharField('file extension', max_length=20) shape = models.PolygonField('Border', spatial_index=True, srid=4326, geography=True) objects = models.GeoManager() source = models.CharField(max_length=100) version = models.CharField(max_length=20) overpass = models.DateTimeField(db_index=True) oid = models.CharField('WOS ObjectId', max_length=40, db_index=True) orbit = models.IntegerField(db_index=True, null=True)

WOS API • The WosCluster constructor is used to connect
to the cluster. The returned handle is used in all future Put, Get, and similar commands. If all of the nodes in a cluster are configured under the same name in DNS, DNS will provide a round-robin response of different node IP addresses for successive connect statements, thereby achieving some measure of load-balancing. • wos = WosCluster("cluster-dnsname-or-ipaddr") • Only one connection to the cluster per process may be open at one time. Connect is a somewhat expensive operation. It is intended that applications do this once (or infrequently) and then keep their connection

WOS API • Create a WosObj obj = WosObj() obj.data
= ’lots of bytes of data...’ obj.meta[’a’] = ’b’ obj.meta[’date’] = time.time() • Put object, policy Store an object with replication directives specified by policy. The return value is an object-id (generally referred to as an OID). policy = ’replicate’ oid = wos.put(obj, policy)

WOS API • Get oid - Retrieve an OID from
the cluster obj = wos.get(oid) print obj.data print obj.meta[’date’] • Delete oid - Delete an object from the cluster. Note that although the space will be reclaimed for future use, the previously-used OID will never be returned again. wos.delete(oid)

WOS API • It is possible to break the Put
step into two pieces which can be executed separately: the first step is to Reserve which returns an OID, and the second step is to use PutOID to associate an object with that OID. Note that for a given OID, PutOID can only be called once. • Reserve policy - Reserve an OID policy = ’copy3’ oid = wos.reserve(policy) • PutOID object, oid - Associate an object with a previously- reserved OID wos.putoid(obj, oid)

WOS Streaming API • The streaming API deals with large
objects which would not fit in memory. • WosCluster.CreatePutStream policy - Create a WosPutStream class. • The WosPutStream and WosGetStream classes are intended to mimic a file stream in Python as closely as possible. • WosPutStream.write data - Data can be any Python binary-safe string - appended. • WosPutStream.meta - Meta is a dictionary of string-based key/value pairs. Saved with the object. • WosPutStream.close() • WosCluster.CreateGetStream oid, prefetch_meta - takes the OID of the object to be retrieved, and (optional) boolean for metadata.

WOS Streaming API f = open(file, 'rb') puts = wos.CreatePutStream(policy)
puts.meta['date'] = str(time.time()) while True: data = f.read(CHUNKSIZE) if len(data) == 0: break puts.write(data) f.close() oid = puts.close() return oid

Zlib and WOS wosstream = woscluster.CreateGetStream(swath.oid) f = open('file.N1', 'wb')
z = zlib.decompressobj(16+zlib.MAX_WBITS) while True: data = wosstream.read(CHUNKSIZE) if len(data) == 0: break f.write(z.decompress(data)) f.write(z.flush()) f.close()

Zlib and WOS wosstream = woscluster.CreateGetStream(swath.oid) f = open('file.N1', 'wb')
z = gzip.GzipFile('file.N1', 'r', 0, wosstream) while True: # fails with 'wosapi.WosGetStream' object has no attribute 'tell' data = z.read(CHUNKSIZE) if len(data) == 0 break f.write(data) f.write(z.flush()) f.close()

PyConZA 2012: "High-performance Computing with ...

PyConZA 2012: "High-performance Computing with Python" by Kevin Colville and Andy Rabagliati (part 2)

Pycon ZA

More Decks by Pycon ZA

Other Decks in Programming

Featured

Transcript

Earth Observation Applications • Marine Remote Sensing Unit attached to

Data storage systems • CHPC DIRISA with 700TB data storage

PostGIS schema • Postgis database to store spatial, temporal data

PostGIS schema Class Swaths(models.Model): : filename = models.CharField('base file name',

WOS API • The WosCluster constructor is used to connect

WOS API • Create a WosObj obj = WosObj() obj.data

WOS API • Get oid - Retrieve an OID from

WOS API • It is possible to break the Put

WOS Streaming API • The streaming API deals with large

WOS Streaming API f = open(file, 'rb') puts = wos.CreatePutStream(policy)

Zlib and WOS wosstream = woscluster.CreateGetStream(swath.oid) f = open('file.N1', 'wb')

Zlib and WOS wosstream = woscluster.CreateGetStream(swath.oid) f = open('file.N1', 'wb')