Iterating through pandas objects is generally slow. In many cases, iterating manually over the rows is not needed [...].
Answer: DON'T!
Iteration in pandas is an anti-pattern, and is something you should only do when you have exhausted every other option. You should not use any function with "iter
" in its name for more than a few thousand rows or you will have to get used to a lot of waiting.Do you want to print a DataFrame? Use
DataFrame.to_string()
.Do you want to compute something? In that case, search for methods in this order (list modified from here):
- Vectorization
- Cython routines
- List Comprehensions (vanilla
for
loop) DataFrame.apply()
: i) Reductions that can be performed in cython, ii) Iteration in python spaceDataFrame.itertuples()
anditeritems()
DataFrame.iterrows()
iterrows
and itertuples
(both receiving
many votes in answers to this question) should be used in very rare
circumstances, such as generating row objects/nametuples for sequential
processing, which is really the only thing these functions are useful
for.Faster than Looping: Vectorization, Cython
A good number of basic operations and computations are "vectorised" by pandas (either through NumPy, or through Cythonized functions). This includes arithmetic, comparisons, (most) reductions, reshaping (such as pivoting), joins, and groupby operations. Look through the documentation on Essential Basic Functionality to find a suitable vectorised method for your problem.If none exists, feel free to write your own using custom cython extensions.
Next Best Thing: List Comprehensions
List comprehensions should be your next port of call if 1) there is no vectorized solution available, 2) performance is important, but not important enough to go through the hassle of cythonizing your code, and 3) you're trying to perform elementwise transformation on your code. There is a good amount of evidence to suggest that list comprehensions are sufficiently fast (and even sometimes faster) for many common pandas tasks.The formula is simple,
# iterating over one column - `f` is some function that processes your data
result = [f(x) for x in df['col']]
# iterating over two columns, use `zip`
result = [f(x, y) for x, y in zip(df['col1'], df['col2'])]
# iterating over multiple columns
result = [f(row[0], ..., row[n]) for row in df[['col1', ...,'coln']].values]
If you can encapsulate your business logic into a function, you can
use a list comprehension that calls it. You can make arbitrarily complex
things work through the simplicity and speed of raw python.