More performant handling of contiguous data for the DMR++ handler
HDF5 files store the data for variables in two ways (slightly untrue - it uses more, but for our purposes in the DMR++ code, there are two forms variables can take). The first is as a series of chunks, where each chunk is individually compressed. Our DMR++ software processes these by determining which chunks should be read to get the data needed for a given request, pushing those on to a queue and then getting those chunks in parallel using pthreads. The data for each chunk is stored in an object of type Chunk that also contains various information about the chunk (like which algorithms to use to decompress it).
The other storage form is to put all of the data for the variable in a single block of memory. As far as our code is considered, this is a single chunk. We can probably improve the transfer rate for the data for these variables by breaking them into pieces and transferring those in parallel.
Look at the class Chunk in bes/modules/dmrpp_module. The code that reads Arrays is in DmrppArray. Here’s a description of the most important methods in DmrppArray:
The first method reads contiguous array data - that’s what needs to be modified.
The rest of the methods process variables that chunked - find_needed_chunks() looks at a request and determines which chunk needs to be read; read_chunks() reads the chunks in parallel; insert_chunk_unconstrained() forms the whole array by reading data from the Chunk objects and writing it in to the memory for the whole array.
What I think should be done is to divide the single ‘chunk’ of a contiguous array’s data into pieces and read them in parallel. The read_chunks() method does this (not that there’s a BES conf parameter that controls this, so that the BES can be configured to read the chunk serially when it starts up. To do this, we will need to determine the number and size of the memory blocks, rad them and then copy the data from the blocks to the memory for the array. Based on what amazon told us, 8-16 MB is the optimal transfer size, but smaller sizes should be OK. Between 4 and 8 parallel transfers in best for a single EC2 instance (which is where the server runs when it’s reading fro S3).
Given that we know the data are ‘contiguous’ some optimizations can be made. If the chunk size is N bytes, the code and allocate a single N-byte memory block. For the first version of this code, it can split the read up into 4 threads, each of which read N/4 bytes (later on we can get fancy and do things differently depending on the size of the data block and so on). We can use pointers into the single block so that the reads dump the data directly into the block. Contiguous blocks in HDF5 files are often not compressed so once the data are read, the function can return. Of course, if they are compressed, that has to be handled, but a first version can just throw an exception noting that it’s an unhandled case.
This limited first version will give us something to test, which is really what we need. If we can get this running early next week, then we can test it. Thus, it would be good if the implementation supported both the current serial chunk reading and a new parallel chunk reader.
The current code for reading a chunk is in chunk.read_chunk();
I would not try to hack the Chunk class and instead put the changes in DmrppArray. You can look at void DmrppArray::read_contiguous() and use DmrppRequestHandler::d_use_parallel_transfers to choose to either use your new code or do things the current way. This will enable us to test the affect of the new code on performance.
Here’s how the variables with multiple chunks are handled:
The ‘multiple chunks’ code can be adapted for use by the proposed ‘read contiguous data in parallel code.’
One approach is to take the single Chunk (HDF5 contiguous data are processed by our code using a single Chunk object), and use it to make a set of Chunk objects and then push those into a queue and use the above code. Once they complete the read operations, assemble the data. If you can, try to share the memory between the first Chunk that describes the whole data block and the four temporary Chunk objects used for reading.
NB: using 4 for the number of parallel threads is just a first version; we can get fancy later.