h5tools Package¶
h5tools
Package¶
h52np
Module¶
Read efficiently h5 files
Includes functions useful for merging sorted datasets.
Some code is shared by H52NP and NP2H5: could have a superclass: optionally_h5_context_manager who would consist in implementing __init__, __enter__, __exit__ where a filename or a file handle can be passed and the file should be handled by the context manager only if a filename is passed.
Also the functionalities specific to sorted datasets could be put in a subclass.
h5_handler
Module¶
Sort rows of a several two dimenional numeric dataset (possibly with just one column) according to numeric key in two-dimensional key dataset with just one column (the first dimension of all datasets involved must match). The result replaces the original dataset buffer size is in kilobytes.
Two things need improvement:
backup solution is too slow for big files
the case of very small files should be handled nicely by using internal sort
To gain time: could try to parallelize the sort, however not easy how it would work for the merging part… Also cythonizing the ‘read chunk’ part might help getting more efficient when there are many chunks
Also should write a function to determine buffer_size based on amount of RAM and size of the file to be sorted: aim for 30 chunks or the least possible without exploding the RAM, except if file can be loaded in memory as a whole, then do internal sorting
h5io
Module¶
np2h5
Module¶
Class for efficiently writing to disk (in a dataset of a HDF5 file)
Simple two-dimensional numpy arrays that are incrementally generated along the first dimension. It uses buffers to avoid small I/O.
It needs to be used within a ‘with’ statement, so as to handle buffer flushing and opening and closing of the underlying HDF5 file smoothly.
Buffer size should be chosen according to speed/memory trade-off. Due to cache issues there is probably an optimal size.
The size of the dataset to be written must be known in advance, excepted when overwriting an existing dataset. Not writing exactly the expected amount of data causes an Exception to be thrown excepted is the fixed_size option was set to False when adding the dataset.