Now the code is set to use four threads - make that equal to the number of threads set in the configuration file. But, for 'small' blocks of memory, limit the max number of threads to a smaller number (like 4). Use a configuration parameter for the size we assume for 'small.' I think 'small' should be 1MB in this case.
Then, change the code so that for really small blocks of memory, it just reads them as a single block, even though we could read them in parallel. Set that threshold using a configuration parameter too. I think for anything under 100KB, we should just read it.
NB: most of our array are not byte arrays, so a 1024* 1024 of Float32 is 4MB.