PyTables - Getting the most *out* of your data
What is PyTables?
PyTables is a package for managing hierarchical datasets and designed to efficiently and easily cope with extremely large amounts of data. You can download PyTables and use it for free. You can access documentation, some examples of use and presentations in the HowToUse section.
PyTables is built on top of the HDF5 library, using the Python language and the NumPy package. It features an object-oriented interface that, combined with C extensions for the performance-critical parts of the code (generated using Cython), makes it a fast, yet extremely easy to use tool for interactively browse, process and search very large amounts of data. One important feature of PyTables is that it optimizes memory and disk resources so that data takes much less space (specially if on-flight compression is used) than other solutions such as relational or object oriented databases.
You can have a look at the MainFeatures of PyTables. Also, find more info by reading the PyTables FAQ.
PyTables has been created, developed, maintained and supported for long time by Francesc Alted, with contributions from Ivan Vilata and the community. Nowadays, the maintenance is led and coordinated by a group of people. Feel free to join us, your contribution is welcome!
Strong foundations for solid performance
Besides making use of standard de-facto packages for handling large datasets (NumPy for in-memory and HDF5 for on-disk ones), PyTables leverages additional libraries for performing internal computations.
The first one is Numexpr a just-in-time compiler that is able to evaluate expressions in a way that both optimizes CPU usage and avoids in-memory temporaries. Also, Numexpr supports multiple threads for taking advantage of modern multicore computers.
The other pillar for improving performance in PyTables is Blosc, a compressor designed to transmit data from memory to cache (and back) at very high speeds. It does so by using the full capacities present in modern CPUs, including its SIMD set of instructions (SSE2 or higher) in any number of available cores.
So, PyTables takes every measure to reduce memory and disk usage during its operation. This allows not only to treat larger datasets by using the same hardware, but also to actually accelerate I/O operation, the most frequent source of bottlenecks in nowadays systems (see this article on why this is so).
Querying your data in many different ways, fast
One characteristic that sets PyTables apart from similar tools is its capability to perform extremely fast queries on your tables in order to facilitate as much as possible your main goal: get important information *out* of your datasets.
PyTables achieves so via a very flexible and efficient query iterator, named Table.where(). This, in combination with OPSI, the powerful indexing engine that comes with PyTables, and the efficiency of underlying tools like NumPy, HDF5, Numexpr and Blosc, makes of PyTables one of the fastest and more powerful query engines available.
Using PyTables as a Computing Kernel
After looking at all the weaponry implemented with the main goal of allowing very fast queries, PyTables developers suddenly realized that the same techniques could be used to accelerate algebraic operations with potentially large vectors and arrays. The tables.Expr class, integrated in PyTables, implements all these machinery in order to allow efficient vector/array operations, not only for disk-based operations, but also for memory-based ones too.
tables.Expr typically outperforms the memmap module available in NumPy, which is another solution for out-of-core computations. What's more, even when evaluating complex expressions for in-memory datasets, tables.Expr class can be faster than NumPy itself. This is a great achievement because, contrarily to tables.Expr, NumPy uses an in-core paradigm for performing computations.
For example, when it comes to evaluate polynomials, the plot shows how tables.Expr is beating both numpy.memmap as well as pure numpy libraries both in speed and disk/memory consumption, most specially if Blosc is used. Also, if you are going to use transcendental (trigonometrical, exponential, logarithmic...) functions in your expressions, you can optionally make use of Intel's Vector Mathematical Library so as to accelerate its evaluation.
Where can be PyTables used?
PyTables can be used on any scenario where you need to save and retrieve large amounts of data and provide metadata (that is, data about actual data) for it. Whether you want to work with large datasets of (potentially multidimensional) data, save and structure your NumPy datasets or just to provide a categorized structure for some portions of your cluttered RDBMS, then give PyTables a try. It works well for storing data from data acquisition systems, sensors in geosciences, simulation software, network data monitoring systems or as a centralized repository for system logs, to name only a few possible uses.
However, it's important to emphasize the fact that PyTables is not designed to work as a relational database competitor, but rather as a teammate. For example, if you have very large tables in your existing relational database, then you can move those tables to PyTables so as to reduce the burden of your existing database while efficiently keeping those huge tables on-disk.
Finally, remember that PyTables is Open Source software, so you are free to adapt it to your own needs, and due to its liberal BSD license, you can include it in any software you like (even if it is commercial).
Design goals
PyTables has been designed to fulfill the next requirements:
Allow to structure your data in a hierarchical way.
Easy to use. It implements the NaturalNaming scheme for allowing convenient access to the data.
All the cells in datasets can be multidimensional entities.
Most of the I/O operations speed should be only limited by the underlying I/O subsystem, be it disk or memory.
Enable the end user to save and deal with large datasets with minimum overhead, i.e. each single byte of data on disk has to be represented by one byte plus a small fraction when loaded into memory.