Python Knowledge Center: september 2018

dinsdag 11 september 2018

git

git

Van scratch af aan. Maak in windows een directory.
ga in directory staan en open command box

- git clone https://git.datapunt.amsterdam.nl/jwagener/mypython.git

er wordt een directory aangemaakt

- cd mypython
- git status

kopieer files naar windows directory
- git add *
- git commit -m "Dit is een test"
- git push origin master

elke ochtend kan je uit master je eigen werkdirecotry verversen met
- git pull

credentials weer op orde brengen
git config credential.helper store

maandag 3 september 2018

excel inlezen en data selecties maken

opmerkingen

let op dat bij het inlezen van Excel pandas waarschijnlijk automatisch allerlei conversies gaat uitvoeren. dit kan je voorkomen door bijv alles als string in te lezen

dflookup1=xls_file.parse(sheet,skiprows=0,dtype=str)

je kan daarna altijd nog de datatypes aanpassen

per veld conversie

>>> string conversie naar datetime

dfin=dfin.astype({'VD_Ingangsdatum': 'datetime64', 'VD_Einddatum': 'datetime64'})

>>> string conversie naar getal

dfjgd2['BEDRAG'] = dfjgd2['BEDRAG'].str.replace(',', '').astype(float)

bij inlexen conversie

Als je enkel bepaalde velden wil interpreteren dan doe je het als volgt

dflookup1=xls_file.parse(sheet,skiprows=0,dtype={'BSN': str,'GEB_PC_HNR':str,'JAAR':str})

Voorbeeldcode

import pandas as pd
import numpy as np

xls_file=pd.ExcelFile('..\\Factuur regels 303-Jgd 01-01-2016 tm 03-09-2018 dd 03-09-2018.xlsx')
dfnew=xls_file.parse('ZorgNed_FactRegl303')

print(list(dfnew.columns))
lscolsnew = dfnew.columns.values

bjaar2017=dfnew['Factuurregel betrekking op jaar']==2017
bGefactureerd=df2017['Goedkeuren']=='Gefactureerd'
dfSelect=dfnew[bjaar2017 & bGefactureerd]

# Tel rijen: manier 1
rows,columns = dfSelect.shape
print('aantal rijen: ',rows )
# Tel rijen: manier 2
dfSelect['Factuurregel betrekking op jaar'].count()

dfSelect['Bedrag goedgekeurd'].sum()

#bij 1 rechthoekig haakje returned een Series object
dfSelect.groupby("Voorzieningsoort")['Bedrag goedgekeurd'].sum().sort_values()

#bij 2 rechthoekige haakjes wordt een dataframe gereturned
dfSelect.groupby("Voorzieningsoort")[['Bedrag goedgekeurd']].sum().sort_values(by='Bedrag goedgekeurd')

summarize aggregate and grouping

basis functies op dataset

# How many rows the dataset
data['item'].count()

# What was the longest phone call / data entry?
data['duration'].max()

# How many seconds of phone calls are recorded in total?
data['duration'][data['item'] == 'call'].sum()

# How many entries are there for each month?
data['month'].value_counts()
Out[41]: 
2014-11    230
2015-01    205
2014-12    157
2015-02    137
2015-03    101
dtype: int64

# Number of non-null unique network entries
data['network'].nunique()

group by

op 1 variabele:

data.groupby('month')['duration'].sum()

op meerdere variabelen

 data.groupby(['month', 'item'])['date'].count()

output format group by

As a rule of thumb, if you calculate more than one column of results, your result will be a Dataframe. For a single column of results, the agg function, by default, will produce a Series.

You can change this by selecting your operation column differently:

# produces Pandas Series

data.groupby('month')['duration'].sum()

of

data.groupby('JAAR')['BEDRAG'].agg('sum'))

 # Produces Pandas DataFrame

 data.groupby('month')[['duration']].sum() # Produces Pandas DataFrame

of

dfjgd2.groupby('JAAR')['BEDRAG'].agg(['sum'])

Opmaak aanpassen van grouping output

dfjgd2.groupby('JAAR')['BEDRAG'].sum().apply(lambda x: '{:.2f}'.format(x))

of

maak een dataframe

df1=dfjgd2.groupby('JAAR')['BEDRAG'].agg(['sum'])

pas toe op de kolom (series)

df1['sum'].apply(lambda x: '{:.2f}'.format(x))

Multi-Index 1

The groupby output will have an index or multi-index on rows corresponding to your chosen grouping variables. To avoid setting this index, pass “as_index=False” to the groupby operation.

data.groupby('month', as_index=False).agg({"duration": "sum"})

https://www.shanelynn.ie/summarising-aggregation-and-grouping-data-in-python-pandas/

Multi-index manier 2

Een andere manier is via reset_index

data.groupby('month', as_index=False).agg({"duration": "sum"})

data.reset_index(drop=False)

Meerdere groepering op verschillende velden in 1x

aggregations = { 'bsn':'count','bedrag': 'sum'}
aggregations = { 'bsn':'nunique','bedrag': 'sum'}
df_zeng =dflever.groupby('code_voorziening').agg(aggregations)

df_zeng.rename(columns={"bsn": "aantal_klant_facturatie", "bedrag": "bedrag_gefactureerd"})

Unieke aantal tellen

gebruik nunique
aggregations = { 'bsn':'nunique','bedrag': 'sum'}

Python Knowledge Center