TL;DR: Preallocate, use h5py and think :)

Since I stepped into the world of crunching big amounts of data for analysis and machine learning with Python and Numpy I had to learn some tricks to get along. Here are some tipps I wished I had when I started.

1. faster: use pre-allocated arrays

Let's say you have a big number of different values you retrieve from the database. You want to somehow process these data and then save them to a numpy array.

Don't

Don't iterate over each entry and add it to the array one by one like this:

entries = range(1000000) # 1 billion entries
results = np.array((0,)) # empty array

for entry in entries:
  processed_entry = entry + 5 # do something
  np.append(results, [processed_entry])

This example takes roughly 11 seconds on my MacBook Air. I used such a loop once to fill a multi dimensional array which ended up having over 1 billion entries. Building that array this way took ~45 minutes. The problem here is, that python needs to make room in the memory again and again for each append, and this is very time consuming.

Do

Instead preallocate the array using np.zeros:

entries = range(1000000) # 1 million entries
results = np.zeros((len(entries),)) # prefilled array

for idx, entry in enumerate(entries):
  processed_entry = entry + 5 # do something
  results[idx] = processed_entry

This finnishes in under 1 second, because the array is already sitting in the memory in its full size.

You can even do that if you don't know the final array size beforehand: you can resize the array in chunks with np.resize, which will still be much faster than the other approach.

2. larger: use h5py

Sometimes your arrays get so big they wont fit into ram anymore.

Don't

Do not not use it ;)

results = np.ones((1000,1000,1000,5))

# do something...
results[100, 25, 1, 4] = 42

Execute that and you're ram is... just gone. Don't expect you're computer to do much else, while the script hogs nearly 40 GB of memory.

Do

Obviously that's something to avoid. We need to somehow store these data on our disk, instead of the ram. So h5py to the rescue:

import h5py

hdf5_store = h5py.File("./cache.hdf5", "a")
results = hdf5_store.create_dataset("results", (1000,1000,1000,5), compression="gzip")

# do something...
results[100, 25, 1, 4] = 42

This creates a file cache.hdf5 which will contain the data. create_dataset gets us an object that we can treat just like a numpy array (at least most of the time). Additionally we get a file that contains this array and that we can access from other scripts:

hdf5_store = h5py.File("./cache.hdf5", "r")

print hdf5_store["results"][100, 25, 1, 4] # 42.0

3. smarter: don't access arrays more often than necessary

Don't

This one should be obvious, but I still see it sometimes. You need the value from some entry of an array to loop over something else:

some_array = np.ones((100, 200, 300))

for _ in range(10000000):
  some_array[50, 12, 199] # get some value some_array

Even though numpy is really fast in accessing even big arrays by index, it still needs some time for it, which gets quiet expensive in big loops.

Do

By simply moving the array access outside the loop you can gain a significant improvement:

some_array = np.ones((100, 200, 300))

the_value_I_need = some_array[50, 12, 199] # access some_array

for _ in range(10000000):
  the_value_I_need

This runs about twice as fast as the other version on my MacBook Air. Most of the times it's simple things like this that slow everything down!