maandag 25 februari 2019
donderdag 14 februari 2019
Difference between map, applymap and apply methods in Pandas
Summing up,
- apply works on a row / column basis of a DataFrame
- applymap works element-wise on a DataFrame
- map works element-wise on a Series
APPLY
Another frequent operation is applying a function on 1D arrays to each column or row. DataFrame’s apply method does exactly this:
In [116]: frame = DataFrame(np.random.randn(4, 3), columns=list('bde'), index=['Utah', 'Ohio', 'Texas', 'Oregon'])
In [117]: frame
Out[117]:
b d e
Utah -0.029638 1.081563 1.280300
Ohio 0.647747 0.831136 -1.549481
Texas 0.513416 -0.884417 0.195343
Oregon -0.485454 -0.477388 -0.309548
In [118]: f = lambda x: x.max() - x.min()
In [119]: frame.apply(f)
Out[119]:
b 1.133201
d 1.965980
e 2.829781
dtype: float64
APPLYMAP
Many of the most common array statistics (like sum and mean) are DataFrame methods, so using apply is not necessary.
Element-wise Python functions can be used, too. Suppose you wanted to compute a formatted string from each floating point value in frame. You can do this with applymap:
In [120]: format = lambda x: '%.2f' % x
In [121]: frame.applymap(format)
Out[121]:
b d e
Utah -0.03 1.08 1.28
Ohio 0.65 0.83 -1.55
Texas 0.51 -0.88 0.20
Oregon -0.49 -0.48 -0.31
MAP
The reason for the name applymap is that Series has a map method for applying an element-wise function:
In [122]: frame['e'].map(format)
Out[122]:
Utah 1.28
Ohio -1.55
Texas 0.20
Oregon -0.31
Name: e, dtype: object
Summing up, apply works on a row / column basis of a DataFrame, applymap works element-wise on a DataFrame, and map works element-wise on a Series.
dinsdag 12 februari 2019
select pandas dataframe rows and columns using LOC
The Pandas loc indexer can be used with DataFrames for two different use cases:
a.) Selecting rows by label/index
b.) Selecting rows with a boolean / conditional lookup
The loc indexer is used with the same syntax as iloc: data.loc[<row selection>, <column selection>] .
Selections met LOC zijn gebaseerd op de index van het dataframe (als die er is)
met set_index kan je index op dataframe zetten
data.set_index("last_name", inplace=True)
als de index gezet is kan je direct rijen selecteren via de last_name
1 rij select:
- data.loc['Andrade'] >>>>> series
2 rijen select
- data.loc[['Andrade','Veness']] >>>>>>> dataframe
Select columns with .loc using the names of the columns. In most of my data work, typically I have named columns, and use these named selections.
When using the .loc indexer, columns are referred to by names using lists of strings, or “:” slices.
# Select rows with index values 'Andrade' and 'Veness', with all columns between 'city' and 'email'
data.loc[['Andrade', 'Veness'], 'city':'email']
# Select same rows, with just 'first_name', 'address' and 'city' columns
data.loc['Andrade':'Veness', ['first_name', 'address', 'city']]
# Change the index to be based on the 'id' column
data.set_index('id', inplace=True)
# select the row with 'id' = 487
data.loc[487]
Note that in the last example, data.loc[487] (the row with index value 487) is not equal to data.iloc[487] (the 487th row in the data). The index of the DataFrame can be out of numeric order, and/or a string or multi-value.
In most use cases, you will make selections based on the values of different columns in your data set.
For example, the statement data[‘first_name’] == ‘Antonio’] produces a Pandas Series with a True/False value for every row in the ‘data’ DataFrame, where there are “True” values for the rows where the first_name is “Antonio”. These type of boolean arrays can be passed directly to the .loc indexer as so:
data.loc[data['first_name] == 'Antonio']
a second argument can be passed to .loc to select particular columns out of the data frame. Again, columns are referred to by name for the loc indexer and can be a single string, a list of columns, or a slice “:” operation.
data.loc[data['first_name] == 'Erasmo', ['column1', 'column2', 'column3']
Selecting multiple columns with loc can be achieved by passing column names to the second argument of .loc[]
Let op welke datatype gereturned wordt
data.loc[data['first_name] == 'Antonio'] ======> SERIES
data.loc[data['first_name] == 'Antonio'] data.loc[data['first_name] == 'Erasmo', ['column1', 'column2', 'column3'] =======> DATAFRAME
Voorbeelden
# Select rows with first name Antonio, # and all columns between 'city' and 'email'
data.loc[idx, ['email', 'first_name', 'company']]
'
a.) Selecting rows by label/index
b.) Selecting rows with a boolean / conditional lookup
a.) Selecting rows by label/index
The loc indexer is used with the same syntax as iloc: data.loc[<row selection>, <column selection>] .
Selections met LOC zijn gebaseerd op de index van het dataframe (als die er is)
met set_index kan je index op dataframe zetten
data.set_index("last_name", inplace=True)
als de index gezet is kan je direct rijen selecteren via de last_name
1 rij select:
- data.loc['Andrade'] >>>>> series
2 rijen select
- data.loc[['Andrade','Veness']] >>>>>>> dataframe
Select columns with .loc using the names of the columns. In most of my data work, typically I have named columns, and use these named selections.
When using the .loc indexer, columns are referred to by names using lists of strings, or “:” slices.
# Select rows with index values 'Andrade' and 'Veness', with all columns between 'city' and 'email'
data.loc[['Andrade', 'Veness'], 'city':'email']
# Select same rows, with just 'first_name', 'address' and 'city' columns
data.loc['Andrade':'Veness', ['first_name', 'address', 'city']]
# Change the index to be based on the 'id' column
data.set_index('id', inplace=True)
# select the row with 'id' = 487
data.loc[487]
Note that in the last example, data.loc[487] (the row with index value 487) is not equal to data.iloc[487] (the 487th row in the data). The index of the DataFrame can be out of numeric order, and/or a string or multi-value.
b.) Selecting rows with a boolean / conditional lookup
Conditional selections with boolean arrays using data.loc[<selection>] is the most common method that people use with Pandas DataFrames. With boolean indexing or logical selection, you pass an array or Series of True/False values to the .loc indexer to select the rows where your Series has True values.In most use cases, you will make selections based on the values of different columns in your data set.
For example, the statement data[‘first_name’] == ‘Antonio’] produces a Pandas Series with a True/False value for every row in the ‘data’ DataFrame, where there are “True” values for the rows where the first_name is “Antonio”. These type of boolean arrays can be passed directly to the .loc indexer as so:
data.loc[data['first_name] == 'Antonio']
a second argument can be passed to .loc to select particular columns out of the data frame. Again, columns are referred to by name for the loc indexer and can be a single string, a list of columns, or a slice “:” operation.
data.loc[data['first_name] == 'Erasmo', ['column1', 'column2', 'column3']
Selecting multiple columns with loc can be achieved by passing column names to the second argument of .loc[]
Let op welke datatype gereturned wordt
data.loc[data['first_name] == 'Antonio'] ======> SERIES
data.loc[data['first_name] == 'Antonio'] data.loc[data['first_name] == 'Erasmo', ['column1', 'column2', 'column3'] =======> DATAFRAME
Voorbeelden
# Select rows with first name Antonio, # and all columns between 'city' and 'email'
data.loc[data['first_name'] == 'Antonio', 'city':'email'] |
# Select rows where the email column ends with 'hotmail.com', include all columns |
data.loc[data['email'].str.endswith("hotmail.com")] |
# Select rows with last_name equal to some values, all columns |
data.loc[data['first_name'].isin(['France', 'Tyisha', 'Eric'])] |
# Select rows with first name Antonio AND hotmail email addresses |
data.loc[data['email'].str.endswith("gmail.com") & (data['first_name'] == 'Antonio')] |
# select rows with id column between 100 and 200, and just return 'postal' and 'web' columns |
data.loc[(data['id'] > 100) & (data['id'] <= 200), ['postal', 'web']] |
# A lambda function that yields True/False values can also be used. |
# Select rows where the company name has 4 words in it. |
data.loc[data['company_name'].apply(lambda x: len(x.split(' ')) == 4)] |
# Selections can be achieved outside of the main .loc for clarity: |
# Form a separate variable with your selections: |
idx = data['company_name'].apply(lambda x: len(x.split(' ')) == 4) |
# Select only the True values in 'idx' and only the 3 columns specified: |
select pandas dataframe rows and columns using iloc
zie https://www.shanelynn.ie/select-pandas-dataframe-rows-and-columns-using-iloc-loc-and-ix/
data.iloc[<row selection>, <column selection>]
Each row has a row number from 0 to the total rows (data.shape[0]) and iloc[] allows selections based on these numbers. The same applies for columns (ranging from 0 to data.shape[1] )
# Single selections using iloc and DataFrame
# Rows:
data.iloc[0] # first row of data frame (Aleshia Tomkiewicz) - Note a Series data type output.
data.iloc[1] # second row of data frame (Evan Zigomalas)
data.iloc[-1] # last row of data frame (Mi Richan)
# Columns:
data.iloc[:,0] # first column of data frame (first_name)
data.iloc[:,1] # second column of data frame (last_name)
data.iloc[:,-1] # last column of data frame (id)
# Multiple row and column selections using iloc and DataFrame
data.iloc[0:5] # first five rows of dataframe
data.iloc[:, 0:2] # first two columns of data frame with all rows
data.iloc[[0,3,6,24], [0,5,6]] # 1st, 4th, 7th, 25th row + 1st 6th 7th columns.
data.iloc[0:5, 5:8] # first 5 rows and 5th, 6th, 7th columns of data frame (county -> phone1).
When selecting multiple columns or multiple rows in this manner, remember that in your selection e.g.[1:5], the rows/columns selected will run from the first number to one minus the second number. e.g. [1:5] will go 1,2,3,4., [x,y] goes from x to y-1.
Note that .iloc returns a Pandas Series when one row is selected, and a Pandas DataFrame when multiple rows are selected, or if any column in full is selected. To counter this, pass a single-valued list if you require DataFrame output.
Selection Options
There’s three main options to achieve the selection and indexing activities in Pandas,- Selecting data by row numbers (.iloc)
- Selecting data by label or by a conditional statment (.loc)
- Selecting in a hybrid approach (.ix) (now Deprecated in Pandas 0.20.1)
ILOC
integer-location based indexing/selectiondata.iloc[<row selection>, <column selection>]
Each row has a row number from 0 to the total rows (data.shape[0]) and iloc[] allows selections based on these numbers. The same applies for columns (ranging from 0 to data.shape[1] )
# Single selections using iloc and DataFrame
# Rows:
data.iloc[0] # first row of data frame (Aleshia Tomkiewicz) - Note a Series data type output.
data.iloc[1] # second row of data frame (Evan Zigomalas)
data.iloc[-1] # last row of data frame (Mi Richan)
# Columns:
data.iloc[:,0] # first column of data frame (first_name)
data.iloc[:,1] # second column of data frame (last_name)
data.iloc[:,-1] # last column of data frame (id)
# Multiple row and column selections using iloc and DataFrame
data.iloc[0:5] # first five rows of dataframe
data.iloc[:, 0:2] # first two columns of data frame with all rows
data.iloc[[0,3,6,24], [0,5,6]] # 1st, 4th, 7th, 25th row + 1st 6th 7th columns.
data.iloc[0:5, 5:8] # first 5 rows and 5th, 6th, 7th columns of data frame (county -> phone1).
When selecting multiple columns or multiple rows in this manner, remember that in your selection e.g.[1:5], the rows/columns selected will run from the first number to one minus the second number. e.g. [1:5] will go 1,2,3,4., [x,y] goes from x to y-1.
Note that .iloc returns a Pandas Series when one row is selected, and a Pandas DataFrame when multiple rows are selected, or if any column in full is selected. To counter this, pass a single-valued list if you require DataFrame output.
vrijdag 8 februari 2019
statistiek 2
Centrummaten
Er zijn drie maten om het centrum van een verdeling te beschrijven: De modus geeft de klasse aan met de meeste waarnemingen, de mediaan geeft de klasse aan die de onderste 50% van de bovenste 50% scheidt en het gemiddelde houdt niet alleen rekening met de aantallen, maar ook met de hoogte van elke score…
Extremen uit een distributie kan je weghalen door de gebruik te maken van de interkwartiel range: alleen 2e en 3e kwartiel rondom mediaan
een frequency distribution kan je gebruiken om aan te geven hoe vaak een score voorkomt. Je kan het ook gebruiken om de waarschijnlijkheid van een score te bepalen
Bij een normaal verdeling heb je tabellen waarmee je de waarschijnlijkheid kan opzoeken dat iets voorkomt. Je moet de normaal verdeling wel omzetten naar een z-score
belangrijke waardes
z ligt tussen -1.96 en 1.96 (95% van de scores ) 2.5% aan beide kanten wordt er afgehakt
We've talked a little about the difference between working with a full population of data and working with a sample. In most real-world scenarios, you won't have access to the full population. For example, you're unlikely to have rainfall measures for everyday ever; and even if you did, that's a lot of data to try and manage.
Generally you work with samples of data that are representative of the data, and you use sample statistics such as the mean and standard deviation to approximate the parameters of the full data population. In practice it's best to get as large a sample as you can. The larger the sample, the better it will approximate the distribution and parameters of the full population.
Another thing you can do is to take multiple random samples. Each sample has a sample mean, and you can record these to form what's called a sampling distribution. With enough samples, two things happen.
One is that, thanks to something called the central limit theorem, the sampling distribution takes on a normal shape regardless of the shape of the population distribution; and the second thing is that the mean of the sampling distribution, in other words the mean of all the sample means, will be the same as the population mean.
Er zijn drie maten om het centrum van een verdeling te beschrijven: De modus geeft de klasse aan met de meeste waarnemingen, de mediaan geeft de klasse aan die de onderste 50% van de bovenste 50% scheidt en het gemiddelde houdt niet alleen rekening met de aantallen, maar ook met de hoogte van elke score…
Extremen uit een distributie kan je weghalen door de gebruik te maken van de interkwartiel range: alleen 2e en 3e kwartiel rondom mediaan
een frequency distribution kan je gebruiken om aan te geven hoe vaak een score voorkomt. Je kan het ook gebruiken om de waarschijnlijkheid van een score te bepalen
Bij een normaal verdeling heb je tabellen waarmee je de waarschijnlijkheid kan opzoeken dat iets voorkomt. Je moet de normaal verdeling wel omzetten naar een z-score
belangrijke waardes
z ligt tussen -1.96 en 1.96 (95% van de scores ) 2.5% aan beide kanten wordt er afgehakt
We've talked a little about the difference between working with a full population of data and working with a sample. In most real-world scenarios, you won't have access to the full population. For example, you're unlikely to have rainfall measures for everyday ever; and even if you did, that's a lot of data to try and manage.
Generally you work with samples of data that are representative of the data, and you use sample statistics such as the mean and standard deviation to approximate the parameters of the full data population. In practice it's best to get as large a sample as you can. The larger the sample, the better it will approximate the distribution and parameters of the full population.
Another thing you can do is to take multiple random samples. Each sample has a sample mean, and you can record these to form what's called a sampling distribution. With enough samples, two things happen.
One is that, thanks to something called the central limit theorem, the sampling distribution takes on a normal shape regardless of the shape of the population distribution; and the second thing is that the mean of the sampling distribution, in other words the mean of all the sample means, will be the same as the population mean.
Abonneren op:
Posts (Atom)
Datums bepalen adhv begin en einddatum in Dataframe
Voorbeeld op losse velden ####################################################################### # import necessary packages from datetime...
-
value_counts geef per waarde het aantal voorkomens in een bepaalde df_iris.species.value_counts() versicolor 50 setosa 50 v...
-
import textfiles # Open a file: file file = open('opa.txt','r') # Print it print(file.read()) # Check whether file ...
-
scikit-learn, a standard library for machine learning in Python. It describes itself like this: Machine Learning in Python •Simple and...