h5tools Package

h5tools Package

h52np Module

Read efficiently h5 files

Includes functions useful for merging sorted datasets.

Some code is shared by H52NP and NP2H5: could have a superclass: optionally_h5_context_manager who would consist in implementing __init__, __enter__, __exit__ where a filename or a file handle can be passed and the file should be handled by the context manager only if a filename is passed.

Also the functionalities specific to sorted datasets could be put in a subclass.

class ABXpy.h5tools.h52np.H52NP(h5file)[source]

Bases: object

add_dataset(group, dataset, buf_size=100, minimum_occupied_portion=0.25)[source]
add_subdataset(group, dataset, buf_size=100, minimum_occupied_portion=0.25, indexes=None)[source]
class ABXpy.h5tools.h52np.H52NPbuffer(parent, group, dataset, buf_size, minimum_occupied_portion)[source]

Bases: object

class ABXpy.h5tools.h52np.H5dataset2NPbuffer(parent, group, dataset, buf_size, minimum_occupied_portion, indexes=None)[source]

Bases: ABXpy.h5tools.h52np.H52NPbuffer

Augmentation of the H%2NPbuffer, proposing to use a subdataset selected by index


h5_handler Module

Sort rows of a several two dimenional numeric dataset (possibly with just one column) according to numeric key in two-dimensional key dataset with just one column (the first dimension of all datasets involved must match). The result replaces the original dataset buffer size is in kilobytes.

Two things need improvement:

  • backup solution is too slow for big files

  • the case of very small files should be handled nicely by using internal sort

To gain time: could try to parallelize the sort, however not easy how it would work for the merging part… Also cythonizing the ‘read chunk’ part might help getting more efficient when there are many chunks

Also should write a function to determine buffer_size based on amount of RAM and size of the file to be sorted: aim for 30 chunks or the least possible without exploding the RAM, except if file can be loaded in memory as a whole, then do internal sorting

class ABXpy.h5tools.h5_handler.H5Handler(h5file, keygroup, keyset, groups=None, datasets=None)[source]

Bases: object

extract_chunk(i_start, i_end, chunk_id)[source]
sort(buffer_size=1000, o_buffer_size=1000, tmpdir=None)[source]
class ABXpy.h5tools.h5_handler.H5TMP(tmpdir=None)[source]

Bases: object


h5io Module

class ABXpy.h5tools.h5io.H5IO(filename, datasets=None, indexes=None, fused=None, group='/')[source]

Bases: object

write(data, append=True, iterate=False, indexed=False)[source]

np2h5 Module

Class for efficiently writing to disk (in a dataset of a HDF5 file)

Simple two-dimensional numpy arrays that are incrementally generated along the first dimension. It uses buffers to avoid small I/O.

It needs to be used within a ‘with’ statement, so as to handle buffer flushing and opening and closing of the underlying HDF5 file smoothly.

Buffer size should be chosen according to speed/memory trade-off. Due to cache issues there is probably an optimal size.

The size of the dataset to be written must be known in advance, excepted when overwriting an existing dataset. Not writing exactly the expected amount of data causes an Exception to be thrown excepted is the fixed_size option was set to False when adding the dataset.

class ABXpy.h5tools.np2h5.NP2H5(h5file)[source]

Bases: object

add_dataset(group, dataset, n_rows=0, n_columns=None, chunk_size=10, buf_size=100, item_type=<Mock id='140122720682960'>, overwrite=False, fixed_size=True)[source]
class ABXpy.h5tools.np2h5.NP2H5buffer(parent, group, dataset, n_rows, n_columns, chunk_size, buf_size, item_type, overwrite, fixed_size)[source]

Bases: object

ABXpy.h5tools.np2h5.nb_lines(item_size, n_columns, size_in_mem)[source]