As a developer, I need access to information about the 'chunks' in a HDF5 file so that I can read those from S3 using HTTP Range-GET.
This story is really about choosing which version of the HDF5 library to use as the basis for our work with A2. Should we use 1.10 or one of the 1.8 libraries?
The basis of the Architecture #2 software is that HDF5 files store data for a variable in one or more blocks of memory (i.e., chunks) and each file maintains tables of the locations of those chunks. The location is given using the byte offset from the start of the file and size of the chunk is given bytes. So, for a particular variable, there might be something like:
The 'trick' is that getting this information is hard. The HDF5 library versions we use don't normally make this available, and we don't know if newer versions have an explicit interface for it.
However, Kent Yang wrote code to do this and Elana P @ THG sent it to us (it was developed under task 28 - I think it was '28'). I have that code in a private repo name 'chunks' on the opendap site (https://github.com/OPENDAP/chunks; you need to sign in to see it). I'm treating this as somewhat sensitive information because I know THG views it that way (but at the same time, NASA paid for it...) so we can go ahead and use it, but work with THG to the extent we can. Hence the private repo.
In the version of hdf5 in this repo, there are new API calls that we can use to get this information. This task is to use that API to find the chunk sizes and offsets for each variable in an HDF5 file. There are some more tricks/caveats to this task: some variables are stored in a single 'contiguous block' and our software treats this as one chunk. Often arrays are broken up into many chunks and each is separately compressed. Thus, while each chunk may account for, e.g., 4096 bytes in the data file, they each take up less (but a different size) on disk. Thus, we need to know several different values for each chunk that makes up part of an array.
Note: In the following examples, there's some extraneous information (e.g., the UUIS and MD5 fields).
Here's an example of a chunked variable, that uses compression, but stores all its data in one chunk:
Here's an example of a variable a bunch of chunks. I cut out the middle part since there are many chunks:
Nathan is working on how we're going to change this XML.