TL;DR: Preallocate, use h5py and think :)
Since I stepped into the world of crunching big amounts of data for analysis and machine learning with Python and Numpy I had to learn some tricks to get along. Here are some tipps I wished I had when I started.
Let's say you have a big number of different values you retrieve from the database. You want to somehow process these data and then save them to a numpy array.
Don't iterate over each entry and add it to the array one by one like this:
entries = range(1000000) # 1 billion entries
results = np.array((0,)) # empty array
for entry in entries:
processed_entry = entry + 5 # do something
np.append(results, [processed_entry])
This example takes roughly 11 seconds on my MacBook Air. I used such a loop once to fill a multi dimensional array which ended up having over 1 billion entries. Building that array this way took ~45 minutes. The problem here is, that python needs to make room in the memory again and again for each append
, and this is very time consuming.
Instead preallocate the array using np.zeros
:
entries = range(1000000) # 1 million entries
results = np.zeros((len(entries),)) # prefilled array
for idx, entry in enumerate(entries):
processed_entry = entry + 5 # do something
results[idx] = processed_entry
This finnishes in under 1 second, because the array is already sitting in the memory in its full size.
You can even do that if you don't know the final array size beforehand: you can resize the array in chunks with np.resize
, which will still be much faster than the other approach.
Sometimes your arrays get so big they wont fit into ram anymore.
Do not not use it ;)
results = np.ones((1000,1000,1000,5))
# do something...
results[100, 25, 1, 4] = 42
Execute that and you're ram is... just gone. Don't expect you're computer to do much else, while the script hogs nearly 40 GB of memory.
Obviously that's something to avoid. We need to somehow store these data on our disk, instead of the ram. So h5py
to the rescue:
import h5py
hdf5_store = h5py.File("./cache.hdf5", "a")
results = hdf5_store.create_dataset("results", (1000,1000,1000,5), compression="gzip")
# do something...
results[100, 25, 1, 4] = 42
This creates a file cache.hdf5
which will contain the data. create_dataset
gets us an object that we can treat just like a numpy array (at least most of the time). Additionally we get a file that contains this array and that we can access from other scripts:
hdf5_store = h5py.File("./cache.hdf5", "r")
print hdf5_store["results"][100, 25, 1, 4] # 42.0
This one should be obvious, but I still see it sometimes. You need the value from some entry of an array to loop over something else:
some_array = np.ones((100, 200, 300))
for _ in range(10000000):
some_array[50, 12, 199] # get some value some_array
Even though numpy is really fast in accessing even big arrays by index, it still needs some time for it, which gets quiet expensive in big loops.
By simply moving the array access outside the loop you can gain a significant improvement:
some_array = np.ones((100, 200, 300))
the_value_I_need = some_array[50, 12, 199] # access some_array
for _ in range(10000000):
the_value_I_need
This runs about twice as fast as the other version on my MacBook Air. Most of the times it's simple things like this that slow everything down!