maandag 3 september 2018

summarize aggregate and grouping

basis functies op dataset

# How many rows the dataset
data['item'].count()

# What was the longest phone call / data entry?
data['duration'].max()

# How many seconds of phone calls are recorded in total?
data['duration'][data['item'] == 'call'].sum()

# How many entries are there for each month?
data['month'].value_counts()
Out[41]: 
2014-11    230
2015-01    205
2014-12    157
2015-02    137
2015-03    101
dtype: int64

# Number of non-null unique network entries
data['network'].nunique()
 

group by 

op 1 variabele:

data.groupby('month')['duration'].sum()
 
 

op meerdere variabelen

 data.groupby(['month', 'item'])['date'].count()
 

output format group by

As a rule of thumb, if you calculate more than one column of results, your result will be a Dataframe. For a single column of results, the agg function, by default, will produce a Series.

You can change this by selecting your operation column differently:

# produces Pandas Series
data.groupby('month')['duration'].sum() 
of
data.groupby('JAAR')['BEDRAG'].agg('sum')) 
 
 # Produces Pandas DataFrame
 data.groupby('month')[['duration']].sum() # Produces Pandas DataFrame
of 
dfjgd2.groupby('JAAR')['BEDRAG'].agg(['sum'])
 
 

Opmaak aanpassen van grouping output

 
dfjgd2.groupby('JAAR')['BEDRAG'].sum().apply(lambda x: '{:.2f}'.format(x)) 
 
of

maak een dataframe
df1=dfjgd2.groupby('JAAR')['BEDRAG'].agg(['sum'])
 
pas toe op de kolom (series) 
df1['sum'].apply(lambda x: '{:.2f}'.format(x))

 Multi-Index 1

The groupby output will have an index or multi-index on rows corresponding to your chosen grouping variables. To avoid setting this index, pass “as_index=False” to the groupby operation.

data.groupby('month', as_index=False).agg({"duration": "sum"})

https://www.shanelynn.ie/summarising-aggregation-and-grouping-data-in-python-pandas/

Multi-index manier 2

Een andere manier is via reset_index

data.groupby('month', as_index=False).agg({"duration": "sum"})
data.reset_index(drop=False) 

Meerdere groepering op verschillende velden in 1x

aggregations = {   'bsn':'count','bedrag': 'sum'}
aggregations = {   'bsn':'nunique','bedrag': 'sum'}
df_zeng =dflever.groupby('code_voorziening').agg(aggregations)

df_zeng.rename(columns={"bsn": "aantal_klant_facturatie", "bedrag": "bedrag_gefactureerd"})



Unieke aantal tellen


gebruik nunique
aggregations = {   'bsn':'nunique','bedrag': 'sum'}

Geen opmerkingen:

Een reactie posten

Datums bepalen adhv begin en einddatum in Dataframe

Voorbeeld op losse velden  ####################################################################### # import necessary packages from datetime...