dinsdag 12 februari 2019

select pandas dataframe rows and columns using LOC

The Pandas loc indexer can be used with DataFrames for two different use cases:

    a.) Selecting rows by label/index

    b.) Selecting rows with a boolean / conditional lookup



a.) Selecting rows by label/index


The loc indexer is used with the same syntax as iloc: data.loc[<row selection>, <column selection>] .

Selections met LOC zijn gebaseerd op de index van het dataframe (als die er is)

met set_index kan je index op dataframe zetten

data.set_index("last_name", inplace=True)

als de index gezet is kan je direct rijen selecteren via de last_name

1 rij select:
- data.loc['Andrade']    >>>>> series

2 rijen select
- data.loc[['Andrade','Veness']]  >>>>>>>   dataframe


Select columns with .loc using the names of the columns. In most of my data work, typically I have named columns, and use these named selections.

When using the .loc indexer, columns are referred to by names using lists of strings, or “:” slices.

# Select rows with index values 'Andrade' and 'Veness', with all columns between 'city' and 'email'
data.loc[['Andrade', 'Veness'], 'city':'email']
# Select same rows, with just 'first_name', 'address' and 'city' columns
data.loc['Andrade':'Veness', ['first_name', 'address', 'city']]

# Change the index to be based on the 'id' column
data.set_index('id', inplace=True)
# select the row with 'id' = 487
data.loc[487]

Note that in the last example, data.loc[487] (the row with index value 487) is not equal to data.iloc[487] (the 487th row in the data). The index of the DataFrame can be out of numeric order, and/or a string or multi-value.



  b.) Selecting rows with a boolean / conditional lookup

 Conditional selections with boolean arrays using data.loc[<selection>] is the most common method that people use with Pandas DataFrames. With boolean indexing or logical selection, you pass an array or Series of True/False values to the .loc indexer to select the rows where your Series has True values.

In most use cases, you will make selections based on the values of different columns in your data set.

For example, the statement data[‘first_name’] == ‘Antonio’] produces a Pandas Series with a True/False value for every row in the ‘data’ DataFrame, where there are “True” values for the rows where the first_name is “Antonio”. These type of boolean arrays can be passed directly to the .loc indexer as so:

data.loc[data['first_name] == 'Antonio'] 

a second argument can be passed to .loc to select particular columns out of the data frame. Again, columns are referred to by name for the loc indexer and can be a single string, a list of columns, or a slice “:” operation.

data.loc[data['first_name] == 'Erasmo', ['column1', 'column2', 'column3']


Selecting multiple columns with loc can be achieved by passing column names to the second argument of .loc[]

Let op welke datatype gereturned wordt

data.loc[data['first_name] == 'Antonio']   ======> SERIES

data.loc[data['first_name] == 'Antonio'] data.loc[data['first_name] == 'Erasmo', ['column1', 'column2', 'column3'] =======> DATAFRAME

Voorbeelden
# Select rows with first name Antonio, # and all columns between 'city' and 'email'
data.loc[data['first_name'] == 'Antonio', 'city':'email']

# Select rows where the email column ends with 'hotmail.com', include all columns
data.loc[data['email'].str.endswith("hotmail.com")]

# Select rows with last_name equal to some values, all columns
data.loc[data['first_name'].isin(['France', 'Tyisha', 'Eric'])]

# Select rows with first name Antonio AND hotmail email addresses
data.loc[data['email'].str.endswith("gmail.com") & (data['first_name'] == 'Antonio')]

# select rows with id column between 100 and 200, and just return 'postal' and 'web' columns
data.loc[(data['id'] > 100) & (data['id'] <= 200), ['postal', 'web']]

# A lambda function that yields True/False values can also be used.
# Select rows where the company name has 4 words in it.
data.loc[data['company_name'].apply(lambda x: len(x.split(' ')) == 4)]

# Selections can be achieved outside of the main .loc for clarity:
# Form a separate variable with your selections:
idx = data['company_name'].apply(lambda x: len(x.split(' ')) == 4)
# Select only the True values in 'idx' and only the 3 columns specified:
data.loc[idx, ['email', 'first_name', 'company']]

'

Geen opmerkingen:

Een reactie posten

Datums bepalen adhv begin en einddatum in Dataframe

Voorbeeld op losse velden  ####################################################################### # import necessary packages from datetime...