How to iterate over rows in a DataFrame in Pandas?

Iterating through pandas objects is generally slow. In many cases, iterating manually over the rows is not needed [...].

Answer: DON'T!

Iteration in pandas is an anti-pattern, and is something you should only do when you have exhausted every other option. You should not use any function with "iter" in its name for more than a few thousand rows or you will have to get used to a lot of waiting.
Do you want to print a DataFrame? Use DataFrame.to_string().
Do you want to compute something? In that case, search for methods in this order (list modified from here):

Vectorization
Cython routines
List Comprehensions (vanilla for loop)
DataFrame.apply(): i) Reductions that can be performed in cython, ii) Iteration in python space
DataFrame.itertuples() and iteritems()
DataFrame.iterrows()

iterrows and itertuples (both receiving many votes in answers to this question) should be used in very rare circumstances, such as generating row objects/nametuples for sequential processing, which is really the only thing these functions are useful for.

Faster than Looping: Vectorization, Cython

A good number of basic operations and computations are "vectorised" by pandas (either through NumPy, or through Cythonized functions). This includes arithmetic, comparisons, (most) reductions, reshaping (such as pivoting), joins, and groupby operations. Look through the documentation on Essential Basic Functionality to find a suitable vectorised method for your problem.
If none exists, feel free to write your own using custom cython extensions.

Next Best Thing: List Comprehensions

List comprehensions should be your next port of call if 1) there is no vectorized solution available, 2) performance is important, but not important enough to go through the hassle of cythonizing your code, and 3) you're trying to perform elementwise transformation on your code. There is a good amount of evidence to suggest that list comprehensions are sufficiently fast (and even sometimes faster) for many common pandas tasks.
The formula is simple,

# iterating over one column - `f` is some function that processes your data
result = [f(x) for x in df['col']]
# iterating over two columns, use `zip`
result = [f(x, y) for x, y in zip(df['col1'], df['col2'])]
# iterating over multiple columns
result = [f(row[0], ..., row[n]) for row in df[['col1', ...,'coln']].values]

If you can encapsulate your business logic into a function, you can use a list comprehension that calls it. You can make arbitrarily complex things work through the simplicity and speed of raw python.

Python Knowledge Center

vrijdag 28 februari 2020

How to iterate over rows in a DataFrame in Pandas?

Answer: DON'T!

Faster than Looping: Vectorization, Cython

Next Best Thing: List Comprehensions

Geen opmerkingen:

Een reactie posten

Datums bepalen adhv begin en einddatum in Dataframe

Zoeken in deze blog