During the base performance testing on the dmr+++ machinery in various parts of Hyrax we noticed that when we had an joinNew aggregation of dmr+++ files whose binary objects were held on a remote system (in this case S3) that servers response times indicated a problem. The response times would be consistent for a while (~23 requests) and then one response would take a very long time to produce. After the slow response, the next request will fail. (This is a typical pattern when the BES listener dies and the OLFS discovers the problem while servicing the next request.) After this the pattern begins again. Based on the testing I have already done (see work log below) it's pretty clear that there is a memory leak in the interaction between the ncml_handler and the dmrpp_module. The long response time preceding the failed request corresponds to kernel swap dominating the process stack, as observed with top.
The mission: Find and fix this memory leak.
If the problem lies D4Group:transform_to_dap2() once we fix it we should review the API and reducing the copying.
We need to look at the pattern of 'Get DMR, then get DDS, answer request' and see how that might be changed.
What is the cost of changing the NCML code to use DMR object in addition to DDSs?
How should we change the implementation of the transform_to_dap2() functionality so that it does not leak memory? Currently, the DMR and DDS object assume sole ownership of the objects they contain (even though they use pointers for those objects and can share them). That means that the add_var() --> add_var_nocopy() change will break once we fix a second issue which is that the new DDS is almost certainly not being deleted. Once we do delete it, the result will be that the contained variables will be deleted twice (the second time will be when the same variables are deleted when the DMR is destroyed).
Here are some ideas:
Modify the DMR so that it provides a 'DDS interface'.
Modify the DMR and DDS so that they use shared_ptr<>.
Go back to the copy semantics (but check the transform_to_dap2() code to make sure it only makes one copy...) and then arrange to correctly delete the DDS.
Fix the transform_to_dap2() code, make sure the DDS is deleted, and modify the NCML to use DMRs
Migrate to DAP4 inside the server and use a post-processing step to build DAP2 responses.
Detected memory leaks in DMR++ code have been removed. We used tests in libdap (running libdap4/unit-tests/DmrToDap2Test with valgrind on CentOS7 and running both NcML aggregations and DMR++ aggregations of AIRS data under valgrind (using besstandalone as the executable).
There are persistent leaks in the libxml2 code that we are working on. That code leaks 256 bytes every time the parser is called to parse a document. The leak is from xmlCreatePushParserCtxt.