Python Knowledge Center: februari 2019

maandag 25 februari 2019

Handige pythonlinks

hoe om te gaan met configuraties in Python

https://hackernoon.com/4-ways-to-manage-the-configuration-in-python-4623049e841b

superhandige videos bijv over local and global

http://www.pythonbytesize.com/user-defined-functions.html

donderdag 14 februari 2019

Difference between map, applymap and apply methods in Pandas

Summing up,

apply works on a row / column basis of a DataFrame
applymap works element-wise on a DataFrame
map works element-wise on a Series

APPLY

Another frequent operation is applying a function on 1D arrays to each column or row. DataFrame’s apply method does exactly this:

In [116]: frame = DataFrame(np.random.randn(4, 3), columns=list('bde'), index=['Utah', 'Ohio', 'Texas', 'Oregon'])

In [117]: frame
Out[117]:
               b         d         e
Utah   -0.029638 1.081563 1.280300
Ohio    0.647747 0.831136 -1.549481
Texas   0.513416 -0.884417 0.195343
Oregon -0.485454 -0.477388 -0.309548

In [118]: f = lambda x: x.max() - x.min()

In [119]: frame.apply(f)
Out[119]:
b    1.133201
d    1.965980
e    2.829781
dtype: float64

APPLYMAP

Many of the most common array statistics (like sum and mean) are DataFrame methods, so using apply is not necessary.

Element-wise Python functions can be used, too. Suppose you wanted to compute a formatted string from each floating point value in frame. You can do this with applymap:

In [120]: format = lambda x: '%.2f' % x

In [121]: frame.applymap(format)
Out[121]:
            b      d      e
Utah    -0.03   1.08   1.28
Ohio     0.65   0.83 -1.55
Texas    0.51 -0.88   0.20
Oregon -0.49 -0.48 -0.31

MAP

The reason for the name applymap is that Series has a map method for applying an element-wise function:

In [122]: frame['e'].map(format)
Out[122]:
Utah       1.28
Ohio      -1.55
Texas      0.20
Oregon    -0.31
Name: e, dtype: object

Summing up, apply works on a row / column basis of a DataFrame, applymap works element-wise on a DataFrame, and map works element-wise on a Series.

dinsdag 12 februari 2019

select pandas dataframe rows and columns using LOC

The Pandas loc indexer can be used with DataFrames for two different use cases:

a.) Selecting rows by label/index
b.) Selecting rows with a boolean / conditional lookup

a.) Selecting rows by label/index

The loc indexer is used with the same syntax as iloc: data.loc[<row selection>, <column selection>] .

Selections met LOC zijn gebaseerd op de index van het dataframe (als die er is)

met set_index kan je index op dataframe zetten

data.set_index("last_name", inplace=True)

als de index gezet is kan je direct rijen selecteren via de last_name

1 rij select:
- data.loc['Andrade'] >>>>> series

2 rijen select
- data.loc[['Andrade','Veness']] >>>>>>> dataframe

Select columns with .loc using the names of the columns. In most of my data work, typically I have named columns, and use these named selections.

When using the .loc indexer, columns are referred to by names using lists of strings, or “:” slices.

# Select rows with index values 'Andrade' and 'Veness', with all columns between 'city' and 'email'
data.loc[['Andrade', 'Veness'], 'city':'email']
# Select same rows, with just 'first_name', 'address' and 'city' columns
data.loc['Andrade':'Veness', ['first_name', 'address', 'city']]

# Change the index to be based on the 'id' column
data.set_index('id', inplace=True)
# select the row with 'id' = 487
data.loc[487]

Note that in the last example, data.loc[487] (the row with index value 487) is not equal to data.iloc[487] (the 487th row in the data). The index of the DataFrame can be out of numeric order, and/or a string or multi-value.

b.) Selecting rows with a boolean / conditional lookup

Conditional selections with boolean arrays using data.loc[<selection>] is the most common method that people use with Pandas DataFrames. With boolean indexing or logical selection, you pass an array or Series of True/False values to the .loc indexer to select the rows where your Series has True values.

In most use cases, you will make selections based on the values of different columns in your data set.

For example, the statement data[‘first_name’] == ‘Antonio’] produces a Pandas Series with a True/False value for every row in the ‘data’ DataFrame, where there are “True” values for the rows where the first_name is “Antonio”. These type of boolean arrays can be passed directly to the .loc indexer as so:

data.loc[data['first_name] == 'Antonio']

a second argument can be passed to .loc to select particular columns out of the data frame. Again, columns are referred to by name for the loc indexer and can be a single string, a list of columns, or a slice “:” operation.

data.loc[data['first_name] == 'Erasmo', ['column1', 'column2', 'column3']

Selecting multiple columns with loc can be achieved by passing column names to the second argument of .loc[]

Let op welke datatype gereturned wordt

data.loc[data['first_name] == 'Antonio'] ======> SERIES

data.loc[data['first_name] == 'Antonio'] data.loc[data['first_name] == 'Erasmo', ['column1', 'column2', 'column3'] =======> DATAFRAME

Voorbeelden
# Select rows with first name Antonio, # and all columns between 'city' and 'email'

data.loc[data['first_name'] == 'Antonio', 'city':'email']

# Select rows where the email column ends with 'hotmail.com', include all columns

data.loc[data['email'].str.endswith("hotmail.com")]

# Select rows with last_name equal to some values, all columns

data.loc[data['first_name'].isin(['France', 'Tyisha', 'Eric'])]

# Select rows with first name Antonio AND hotmail email addresses

data.loc[data['email'].str.endswith("gmail.com") & (data['first_name'] == 'Antonio')] 

# select rows with id column between 100 and 200, and just return 'postal' and 'web' columns

data.loc[(data['id'] > 100) & (data['id'] <= 200), ['postal', 'web']]

# A lambda function that yields True/False values can also be used.

# Select rows where the company name has 4 words in it.

data.loc[data['company_name'].apply(lambda x: len(x.split(' ')) == 4)]

# Selections can be achieved outside of the main .loc for clarity:

# Form a separate variable with your selections:

idx = data['company_name'].apply(lambda x: len(x.split(' ')) == 4)

# Select only the True values in 'idx' and only the 3 columns specified:

data.loc[idx, ['email', 'first_name', 'company']]

select pandas dataframe rows and columns using iloc

zie https://www.shanelynn.ie/select-pandas-dataframe-rows-and-columns-using-iloc-loc-and-ix/

Selection Options

There’s three main options to achieve the selection and indexing activities in Pandas,

Selecting data by row numbers (.iloc)
Selecting data by label or by a conditional statment (.loc)
Selecting in a hybrid approach (.ix) (now Deprecated in Pandas 0.20.1)

ILOC

integer-location based indexing/selection

data.iloc[<row selection>, <column selection>]

Each row has a row number from 0 to the total rows (data.shape[0]) and iloc[] allows selections based on these numbers. The same applies for columns (ranging from 0 to data.shape[1] )

# Single selections using iloc and DataFrame

# Rows:
data.iloc[0] # first row of data frame (Aleshia Tomkiewicz) - Note a Series data type output.
data.iloc[1] # second row of data frame (Evan Zigomalas)
data.iloc[-1] # last row of data frame (Mi Richan)

# Columns:
data.iloc[:,0] # first column of data frame (first_name)
data.iloc[:,1] # second column of data frame (last_name)
data.iloc[:,-1] # last column of data frame (id)

# Multiple row and column selections using iloc and DataFrame
data.iloc[0:5] # first five rows of dataframe
data.iloc[:, 0:2] # first two columns of data frame with all rows
data.iloc[[0,3,6,24], [0,5,6]] # 1st, 4th, 7th, 25th row + 1st 6th 7th columns.
data.iloc[0:5, 5:8] # first 5 rows and 5th, 6th, 7th columns of data frame (county -> phone1).

When selecting multiple columns or multiple rows in this manner, remember that in your selection e.g.[1:5], the rows/columns selected will run from the first number to one minus the second number. e.g. [1:5] will go 1,2,3,4., [x,y] goes from x to y-1.

Note that .iloc returns a Pandas Series when one row is selected, and a Pandas DataFrame when multiple rows are selected, or if any column in full is selected. To counter this, pass a single-valued list if you require DataFrame output.

vrijdag 8 februari 2019

statistiek 2

Centrummaten
Er zijn drie maten om het centrum van een verdeling te beschrijven: De modus geeft de klasse aan met de meeste waarnemingen, de mediaan geeft de klasse aan die de onderste 50% van de bovenste 50% scheidt en het gemiddelde houdt niet alleen rekening met de aantallen, maar ook met de hoogte van elke score…

Extremen uit een distributie kan je weghalen door de gebruik te maken van de interkwartiel range: alleen 2e en 3e kwartiel rondom mediaan

een frequency distribution kan je gebruiken om aan te geven hoe vaak een score voorkomt. Je kan het ook gebruiken om de waarschijnlijkheid van een score te bepalen

Bij een normaal verdeling heb je tabellen waarmee je de waarschijnlijkheid kan opzoeken dat iets voorkomt. Je moet de normaal verdeling wel omzetten naar een z-score

belangrijke waardes
z ligt tussen -1.96 en 1.96 (95% van de scores ) 2.5% aan beide kanten wordt er afgehakt

We've talked a little about the difference between working with a full population of data and working with a sample. In most real-world scenarios, you won't have access to the full population. For example, you're unlikely to have rainfall measures for everyday ever; and even if you did, that's a lot of data to try and manage.
Generally you work with samples of data that are representative of the data, and you use sample statistics such as the mean and standard deviation to approximate the parameters of the full data population. In practice it's best to get as large a sample as you can. The larger the sample, the better it will approximate the distribution and parameters of the full population.

Another thing you can do is to take multiple random samples. Each sample has a sample mean, and you can record these to form what's called a sampling distribution. With enough samples, two things happen.
One is that, thanks to something called the central limit theorem, the sampling distribution takes on a normal shape regardless of the shape of the population distribution; and the second thing is that the mean of the sampling distribution, in other words the mean of all the sample means, will be the same as the population mean.

Python Knowledge Center