In many fields of science standard data formats are used. Here we describe packages which implement a few of these standards, thus allowing Python users ready access to such data.
NetCDF is a format developed by the Unidata project of the University Corporation for Atmospheric Research. Maintenance and further development has been taken over by the National Center for Atmospheric Research (NCAR). NetCDF is used in a wide variety of fields outside of the atmospheric sciences due to its simplicity, robustness, and ability do provide information about the contents of a data file within the file itself, so called metadata. NetCDF3 maintains the general structure and philosophy of NetCDF present from its creation. NetCDF4 constitutes a major departure from the initial philosophy, and is described separately below.
NetCDF stores data in the form of multidimensional arrays. Each array is called a variable. Associated with each variable is information about its dimensionality, size along each axis, the data type of the array elements, and attributes of the variable, including such things as units and missing values which indicate when data do not exist for particular elements. Information is also stored separately about the dimensions of variables existing in the file. Not all variables have to have the same number of dimensions – variables can even be zero-dimensional, or scalars. Dimensions are fixed in length and cannot be changed in an existing NetCDF file with the exception of one, called the unlimited dimension. This is provided so that a file can be expanded along that dimension. Typically this dimension is used to provide, for example, records of variables at successive times.
In addition to the above, the NetCDF file may contain global attributes which provide overall information about the data in the file. A prime example is the history attribute, which contains a record of what has been done to the file. Attributes can take the form of numbers or strings, or tuples of these data types. NetCDF imposes no preconditions on the type or number of attributes, though useful conventions have developed in particular scientific fields.
As our first example, here is a Python script which creates an example of a two-dimensional array, which is plotted and then written to a NetCDF file:
#!/usr/bin/python
# Create a sample netcdf file.
from numpy import *
import matplotlib.pyplot as plt
# Use the netCDF3 package of Jeffrey Whitaker. This package is
# similar to the Scientific Python netcdf package, but works with
# NumPy instead of the older (and generally obsolete) Numeric package.
from netCDF3 import *
# Alternatively, use Roberto De Almeida's Pupynere NetCDF package,
# which is written totally in Python. This form of import changes
# the class name to be consistent with that of Whitaker's package.
#from pupynere import NetCDFFile as Dataset
# Create some interesting fields.
x = linspace(0,20,21)
y = linspace(0,16,17)
# A mis-feature of meshgrid is that it assumes that the most rapidly
# varying index in accessing sequential positions in memory is the
# first one. However, consistent with the C-array-type default of
# NumPy, the last index should vary most rapidly. The netCDF package
# follows the default NumPy convention, so we reverse x and y here to
# limit the spread of this contagion.
(Y,X) = meshgrid(y,x)
a = 3*cos(6.28318*X/20)*sin(3.14159*Y/16)
# Plot our new function to see what it looks like. Notice that we
# have to take the transpose of a to make it plot correctly, in
# agreement with the above comment. This wouldn't need to be done
# if we had not reversed the order of x and y in the call to meshgrid.
# However, it would have then caused problems later in the creation
# of the NetCDF file.
c = plt.contourf(x,y,a.transpose())
ax = plt.axis([0,20,0,16])
b = plt.colorbar(c,orientation='vertical')
xl = plt.xlabel("x")
yl = plt.ylabel("y")
plt.show()
# Write them to a netcdf file.
f = Dataset('create_example_nc.nc','w')
setattr(f,'history','sample nc file from Python')
dx = f.createDimension('x',21)
dy = f.createDimension('y',17)
xvar = f.createVariable('x','f4',('x',))
yvar = f.createVariable('y','f4',('y',))
avar = f.createVariable('a','f4',('x','y'))
svar = f.createVariable('s','i4',())
# The indexing on the left sides of the assignments below is needed to
# tell Python that the netcdf variables should be assigned the values
# of the corresponding arrays, not just redefined out of existence!
xvar[:] = x[:]
yvar[:] = y[:]
avar[:,:] = a[:]
# A different technique is used to assign a value to a scalar variable,
# since the array slicing technique doesn't work in that case.
svar.assignValue(42)
# A convention used in many applications of NetCDF is that a missing
# value be specified, indicating the existence of missing data in grid
# cells this occurs.
setattr(xvar,'missing_value',1.e30)
setattr(yvar,'missing_value',1.e30)
setattr(avar,'missing_value',1.e30)
setattr(svar,'missing_value',99999)
f.close()
Here is a dump of the header of the resulting NetCDF file using the utility ncdump, which is included with NCAR’s NetCDF package:
$ ncdump -h create_example_nc.nc
netcdf create_example_nc {
dimensions:
x = 21 ;
y = 17 ;
variables:
float x(x) ;
x:missing_value = 1.e+30 ;
float y(y) ;
y:missing_value = 1.e+30 ;
float a(x, y) ;
a:missing_value = 1.e+30 ;
int s ;
s:missing_value = 99999 ;
// global attributes:
:history = "sample nc file from Python" ;
}
The plot of the variable "a" is shown in figure 5.1.
The above Python script is largely self-explanatory and covers most of what one needs to do to write a new NetCDF file from scratch. If one desires to make one of the dimensions the unlimited dimension, assign it a length of 0 or None in the f.createDimension variable. If one of the variables is a scalar, the assignValue method must be used instead of the direct assignment.
Reading a NetCDF file takes some different techniques. After importing NumPy and netCDF3 in the usual way, we open an NetCDF file for reading:
>>> from numpy import *
>>> from netCDF3 import *
>>> f = Dataset('create_example_nc.nc')
>>>
The names of the global attributes are extracted using the ncattrs method applied to the file instance. These names are attributes of the instance and can be extracted using standard Python syntax:
>>> f.ncattrs() ['history'] >>> f.history 'sample nc file from Python' >>>
The dimensions of a NetCDF file are represented in a dictionary and the length of each dimension can be extracted with the len() function:
>>> f.dimensions.keys() ['y', 'x'] >>> len(f.dimensions['y']) 17 >>>
The variables of a NetCDF file are also represented in a dictionary:
>>> f.variables.keys() ['y', 'x', 's', 'a'] >>>
We can extract the variables using standard dictionary notation:
>>> xvar = f.variables['x'] >>> yvar = f.variables['y'] >>> avar = f.variables['a'] >>> svar = f.variables['s'] >>>
Various important aspects of the variables (including the attributes) can be extracted:
>>> avar.ncattrs()
['missing_value']
>>> avar.missing_value
1e+30
>>> avar.dimensions
('x', 'y')
>>> avar.size
357
>>> avar.shape
(21, 17)
>>>
The shape attribute can be used to define NumPy variables to receive the actual data:
>>> x = zeros(xvar.shape) >>> y = zeros(yvar.shape) >>> a = zeros(avar.shape) >>> a.shape (21, 17) >>>
The data for arrays can then be transferred by simple assignment with slicing. Scalar variable data can be transferred by the getValue() method:
>>> x[:] = xvar[:] >>> y[:] = yvar[:] >>> a[:] = avar[:] >>> svar = f.variables['s'] >>> s = svar.getValue() >>> s array([42]) >>>
Plotting shows that we end up with the same array a with which
we started.
The NetCDF variable instances such as xvar, yvar, and
avar are technically not NumPy arrays, but they have the
indexing properties of this data type. This means that transfer of
only part of the data from the variable instance to a smaller NumPy
array can be done using normal NumPy slicing:
>>> b = zeros(yvar.shape)
>>> b.shape
(17,)
>>> b[:] = avar[4,:]
>>> b
array([ 0.00000000e+00, 1.80859119e-01, 3.54767919e-01,
5.15043259e-01, 6.55525744e-01, 7.70816803e-01,
8.56485903e-01, 9.09240723e-01, 9.27053988e-01,
9.09241199e-01, 8.56486797e-01, 7.70818174e-01,
6.55527472e-01, 5.15045285e-01, 3.54770213e-01,
1.80861533e-01, 2.46002105e-06])
>>>
In the above example, the NumPy array b is created with the
shape of the y variable and a slice in the y direction is
transferred to b.
NetCDF version 4 is still considered to be somewhat experimental, and so we will delay investigating its capabilities.
Candis is a file format for gridded numerical data. It adopts the “filter” philosophy of Unix and Linux, in which the desired result is generated by the application of small, specialized applications linked together by the Unix pipe mechanism. Candis works on Unix, Linux, and Mac OS-X systems. Windows in any form is problematic.
The Pycandis package allows Candis files to be read and written from Python programs. In Pycandis, a Candis object defines the representation of a Candis file inside Python. Candis and ordinary Python methods are used to create and access Candis files.
Conceptually, a Candis file consists of 4 parts: (1) a "comment" section, which contains arbitrary text describing typically what has been done to the file; (2) a "parameters" section containing scalar variable-value pairs; (3) a "static fields" section containing data fields of which only a single instance is needed; and (4) a "variable fields" section containing data fields for which a sequence of one or more instances are defined in the Candis format. These instances represent, for instance, the values of the variable fields at successive times or possibly successive levels in numerical model output. The meaning of multiple instances is left up to the user. The Python representation of a Candis file contains a single instance of the variable fields at any given time, and methods (described below) are available to scan sequentially through successive instances. This division into successive instances allows very large Candis files to be defined without having to read the entire file into (virtual) memory at one time.
A "field" contains a zero to four dimensional array of 32 bit floating point numbers as well as information describing the size and shape of the array. Also included is an ASCII name for the field as well as the names of the dimensions associated with the array axes. Each of the dimensions is associated with a one-dimensional static field called an "index field" which gives the values of the variable associated with that axis. There can be no more than 4 dimensions in a Candis file.
Parameters are generally used to document values used in the creation or modification of the file. There are two additional uses: (1) The "bad" and "badlim" parameters are positive floating point numbers. "badlim" specifies the maximum absolute value a field element can take on and still be considered "good" data. "bad" is the preferred value used to indicate bad or missing data elements, though any value greater than "badlim" will do. If missing, default values are 1e30 for bad and 0.999e30 for badlim. The values of the bad and badlim parameters apply globally to all fields. (2) Redundant information about the starting and step values for index fields is present for historical reasons and is represented as parameters. The normal end user doesn’t have to worry about these parameters, as Pycandis takes care of them automatically.
Within Python, a Candis object represents the comments as a list of strings, the parameters as a dictionary of name-value pairs, and the static and variable fields as dictionaries of name-field pairs.
Each Field object contains field data in the form of a NumPy array in C format, i. e., the last dimension iterates most rapidly as one steps through memory. It also contains information about the field, most notably the size and shape of the array and the dimension names associated with each array axis. (There are a few other minor pieces of information present for historical reasons which are of no concern to the end user.)
The following Python program shows how to write Candis files using the Pycandis package:
#!/usr/bin/python
# Create a sample Candis file.
# Use the Pycandis package of Max Brister.
from numpy import *
from pycandis import *
# Create index field data.
nx = 21
ny = 17
x = linspace(0,20,nx)
y = linspace(0,16,ny)
# A mis-feature of meshgrid is that it assumes that the most rapidly
# varying index in accessing sequential positions in memory is the
# first one. However, consistent with the C-array-type default of
# NumPy, the last index should vary most rapidly. The Candis package
# follows the default NumPy convention, so we reverse x and y here to
# limit the spread of this contagion.
(Y,X) = meshgrid(y,x)
# Generate a Candis file with 2 variable slices.
c = Candis()
# Comments.
c.comments.append('sample Candis file from Python')
c.comments.append('this has 2 variable slices')
# A random parameter.
c.params['answer'] = 42
# Set up static index fields with contents added here.
c.sfields['x'] = Field(['x'], x)
c.sfields['y'] = Field(['y'], y)
# Set up zeroed variable fields with actual contents added later.
c.vfields['a'] = Field(['x', 'y'], zeros([nx,ny]))
# For zero dimensional field (a scalar), Pycandis knows that a
# scalar value rather than an array is the second argument.
c.vfields['slice'] = Field([], 0)
# Create a first variable slice after creating field a.
a = 3*cos(6.28318*X/20)*sin(3.14159*Y/16)
c.vfields['a'].data = a
# Note the special treatment of a scalar variable.
c.vfields['slice'].data.flat = 1
# Write header, static slice and first variable slice.
c.write('create_example_cdf.cdf')
# Create a second variable slice.
a = 3*sin(6.28318*X/20)*cos(3.14159*Y/16)
c.vfields['a'].data = a
c.vfields['slice'].data.flat = 2
# Write second variable slice. (Pycandis remembers that header
# and static slice have already been written.)
c.write('create_example_cdf.cdf')
# Finish
c.close()
To write to the standard output, omit the file name in the call to
c.write().
To read a Candis file, first import the needed packages and generate an empty Candis object:
>>> from numpy import * >>> from pycandis import * >>> c = Candis() >>>
We will look at the Candis file generated by the Python program illustrated in the previous section. Let’s read in the header, the static slice, and the first variable slice of this file:
>>> c.read('create_example_cdf.cdf')
True
>>>
The returned “True” indicates that the operation succeeded.
We can look at the comments,
>>> c.comments ['sample Candis file from Python', 'this has 2 variable slices'] >>>
the parameter names,
>>> c.params.keys() ['badlim', 'bad', 'dx', 'dy', 'answer', 'y0', 'x0'] >>>
and the static and variable file names:
>>> c.sfields.keys() ['y', 'x'] >>> c.vfields.keys() ['a', 'slice'] >>>
The values stored in parameters and fields may be examined:
>>> c.params['answer']
42.0
>>> c.sfields['y'].data
array([ 0., 1., 2., 3., 4., 5., 6., 7., 8., 9., 10.,
11., 12., 13., 14., 15., 16.], dtype=float32)
>>> c.vfields['a'].dims
['x', 'y']
>>> c.vfields['slice'].data
array(1.0, dtype=float32)
They may also be extracted for further use:
>>> a = c.params['answer'] >>> print a 42.0 >>> y = c.sfields['y'].data >>> print y [ 0. 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16.] >>> slice = c.vfields['slice'].data >>> print slice 1.0 >>>
Call the read command again to get the second variable slice:
>>> c.read('create_example_cdf.cdf')
True
>>> slice = c.vfields['slice'].data
>>> print slice
2.0
>>>
Notice what happens when we try to read the third (non-existent) variable slice:
>>> c.read('create_example_cdf.cdf')
False
>>>
If the Candis file of interest has only one variable slice, a single convenience command can be used to read the entire file in one statement:
>>> d = ReadCandis("single_slice.cdf")
>>>
Everything else works as above. (This also works for files with multiple variable slices, but additional “read” statements are required to access variable slices other than the first.)
If it is desired to read a Candis file from the standard input, omit the file name in the read and ReadCandis calls. For example:
>>> d = ReadCandis()
Information on NetCDF may be obtained from the National Center for Atmospheric Research at http://www.unidata.ucar.edu/software/netcdf/. NCAR itself has produced a module allowing Python to access NetCDF files, called PyNIO; http://www.pyngl.ucar.edu/Nio.shtml. This module is well documented and allows the user to access a number of other formats, including GRIB1, GRIB2, and HDF 4. However, this module is tightly connected to a wide array of other data analysis tools produced by NCAR, which makes it rather difficult to install from source code.
The first module to allow Python access to NetCDF files is a part of the Scientific Python package http://www.scipy.org/. The difficulty from our point of view is that this module works with the Python array package Numeric, a predecessor to NumPy. Unfortunately, there are enough incompatibilities between the two array packages to make use of this NetCDF module problematic.
Jeffrey Whitaker of the National Oceanic and Atmospheric Adminstration’s Physical Sciences Division of the Earth System Research Laboratory (http://www.cdc.noaa.gov/) has developed Python packages for both NetCDF3 and NetCDF4; http://code.google.com/p/netcdf4-python/. These packages build and install with little fuss and are at least minimally documented.
Roberto De Almeida (http://dealmeida.net/projects/, a Brazilian oceanographer, has created a package for NetCDF version 3 which is written totally in Python using the Python memory map module. This package also seems to work well. It requires the Python setuptools module.
David Raymond and collaborators in the Physics Department at New Mexico Tech (http://physics.nmt.edu/~raymond/tools.html) wrote the Candis system. Undergraduate student Max Brister wrote the Python package Pycandis.