woensdag 27 maart 2019

python handigheidjes 1

value_counts

geef per waarde het aantal voorkomens in een bepaalde 

df_iris.species.value_counts()
versicolor    50
setosa        50
virginica     50
Name: species, dtype: int64


get_dummies

To get the genres as seperate columns we can use the get_dummies method in the str namespace. This splits strings by a seperator and creates dummy variables for all the possible values:
 

sort_values


You can sort dataframes (or Series) by values using the sort_values method:

 


String namespace

You can apply many of the string methods that are available in core Python. You can access these methods through the "string namespace" by calling .str on a Series object containing strings

  • movies.title.str.upper().head(3)

  • crime_movies = movies[movies.genres.str.contains("crime", case=False)] # the input is actually a regular expression

  • years = movies.title.str.extract(r".+ \(([0-9]{4})\)", expand=False)

group_by


By group by we are referring to a process involving one or more of the following steps:

  •     Splitting the data into groups based on some criteria
  •     Applying a function to each group independently
  •     Combining the results into a data structure

In the apply step, we might wish to one of the following:

  •     Aggregation: computing a summary statistic (or statistics) about each group. Some examples:
        Compute group sums or means
        Compute group sizes / counts
  •     Transformation: perform some group-specific computations and return a like-indexed. Some examples:
        Standardizing data (zscore) within group
        Filling NAs within groups with a value derived from each group
  •     Filtration: discard some groups, according to a group-wise computation that evaluates True or False. Some examples:
        Discarding data that belongs to groups with only a few members
        Filtering out data based on the group sum or mean


vb
gender children height weight
0 man yes 173.964090 68.624437
1 woman yes 171.565582 77.870054
2 man no 174.348325 72.875702
3 woman yes 178.476953 86.612571
4 man no 177.031283 74.815527
5 woman no 167.551663 74.784314
6 woman no 165.706805 84.409427
7 woman yes 183.063584 70.793055

grouped = df.groupby('gender')

grouped.groups

grouped.get_group('woman')

grouped = df.groupby(['gender', 'children'])
grouped.groups

grouped.height.mean()

grouped.aggregate(np.mean)
# grouped.aggregate(lambda x: x.iloc[0])

woensdag 6 maart 2019

datatypes en pandas

https://pbpython.com/pandas_dtypes.html

Pandas Data Types

A data type is essentially an internal construct that a programming language uses to understand how to store and manipulate data. For instance, a program needs to understand that you can add two numbers together like 5 + 10 to get 15. Or, if you have two strings such as “cat” and “hat” you could concatenate (add) them together to get “cathat.”
A possible confusing point about pandas data types is that there is some overlap between pandas, python and numpy. This table summarizes the key points:

Pandas dtype mapping
Pandas dtype Python type NumPy type Usage
object str string_, unicode_ Text
int64 int int_, int8, int16, int32, int64, uint8, uint16, uint32, uint64 Integer numbers
float64 float float_, float16, float32, float64 Floating point numbers
bool bool bool_ True/False values
datetime64 NA datetime64[ns] Date and time values
timedelta[ns] NA NA Differences between two datetimes
category NA NA Finite list of text values

For the most part, there is no need to worry about determining if you should try to explicitly force the pandas type to a corresponding to NumPy type. Most of the time, using pandas default int64 and float64 types will work. The only reason I included in this table is that sometimes you may see the numpy types pop up on-line or in your own analysis.

GEWOONTE: Check datatypes bij inlezen DATAFRAME

Check bij het inlezen van data in een dataframe altijd eerst de datatypes

df.dtypes
 
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 10 columns):
Customer Number    5 non-null float64
Customer Name      5 non-null object
2016               5 non-null object
2017               5 non-null object
Percent Growth     5 non-null object
Jan Units          5 non-null object
Month              5 non-null int64
Day                5 non-null int64
Year               5 non-null int64
Active             5 non-null object
dtypes: float64(1), int64(3), object(6)
memory usage: 480.0+ bytes

After looking at the automatically assigned data types, there are several concerns:
  • The Customer Number is a float64 but it should be an int64
  • The 2016 and 2017 columns are stored as objects, not numerical values such as a float64 or int64
  • Percent Growth and Jan Units are also stored as objects not numerical values
  • We have Month , Day and Year columns that should be converted to datetime64
  • The Active column should be a boolean
Until we clean up these data types, it is going to be very difficult to do much additional analysis on this data.

Schoon de datatypes op

In order to convert data types in pandas, there are three basic options:
  • Use astype() to force an appropriate dtype
  • Create a custom function to convert the data
  • Use pandas functions such as to_numeric() or to_datetime()

 ASTYPE() functie

The simplest way to convert a pandas column of data to a different type is to use astype() .

van float64 naar int64


The simplest way to convert a pandas column of data to a different type is to
use 
astype()
. For instance, to convert the 
Customer Number
 to an integer
we can call it like this:
df['Customer Number'].astype('int')
 

van object naar float64

Lijkt makkelijk maar er zit hier een addertje onder het gras

 Let’s try to do the same thing to our 2016 column and convert it to a floating point number:
df['2016'].astype('float') 
ValueError       Traceback (most recent call last)
<ipython-input-45-999869d577b0> in <module>()
----> 1 df['2016'].astype('float')

[lots more code here]

ValueError: could not convert string to float: '$15,000.00'

In a similar manner, we can try to conver the Jan Units column to an integer:
 
df['Jan Units'].astype('int')
ValueError         Traceback (most recent call last)

<ipython-input-44-31333711e4a4> in <module>()
----> 1 df['Jan Units'].astype('int')

[lots more code here]


ValueError: invalid literal for int() with base 10: 'Closed'

===>
Both of these return ValueError exceptions which mean that the conversions did not work.
In each of the cases, the data included values that could not be interpreted as numbers. In the sales columns, the data includes a currency symbol as well as a comma in each value. In the Jan Units columnm the last value is “Closed” which is not a number; so we get the exception.
So far it’s not looking so good for astype() as a tool. We should give it one more try on the Active column.
df['Active'].astype('bool')
0    True
1    True
2    True
3    True
4    True
Name: Active, dtype: bool
At first glance, this looks ok but upon closer inspection, there is a big problem. All values were interpreted as True but the last customer has an Active flag of N so this does not seem right.

Beperking ASTYPE

The takeaway from this section is that astype() will only work if:
  • the data is clean and can be simply interpreted as a number
  • you want to convert a numeric value to a string object
If the data has non-numeric characters or is not homogeneous, then astype() will not be a good choice for type conversion. You will need to do additional transforms for the type change to work correctly.
 

 Custom  Function

Since this data is a little more complex to convert, we can build a custom function that we apply to each value and convert to the appropriate data type.

vb: currency conversion

For currency conversion (of this specific data set), here is a simple function we can use:

def convert_currency(val):
    """
    Convert the string number value to a float
     - Remove $
     - Remove commas
     - Convert to float type
    """
    new_val = val.replace(',','').replace('$', '')
    return float(new_val) 
 
The code uses python’s string functions to strip out the ‘$” and ‘,’ and then convert the value to a floating point number. In this specific case, we could convert the values to integers as well but I’m choosing to use floating point in this case.

note1

I also suspect that someone will recommend that we use a Decimal type for currency. This is not a native data type in pandas so I am purposely sticking with the float approach.

note 2

Also of note, is that the function converts the number to a python float but pandas internally converts it to a float64. As mentioned earlier, I recommend that you allow pandas to convert to specific size float or int as it determines appropriate. There is no need for you to try to downcast to a smaller or upcast to a larger byte size unless you really know why you need to do it.


Now, we can use the pandas apply function to apply this to all the values in the 2016 column.
df['2016'].apply(convert_currency)
0    125000.0
1    920000.0
2     50000.0
3    350000.0
4     15000.0
Name: 2016, dtype: float64
ll the values are showing as float64 so we can do all the math functions we need to.

Zou ook met een lambda functie kunnen
df['2016'].apply(lambda x: x.replace('$', '').replace(',', '')).astype('float')
 
 

vb2 : fixing the Percent Growth column.

Using the lambda :
df['Percent Growth'].apply(lambda x: x.replace('%', '')).astype('float') / 100 
 
Doing the same thing with a custom function:
def convert_percent(val):
    """
    Convert the percentage string to an actual floating point percent
    - Remove %
    - Divide by 100 to make decimal
    """
    new_val = val.replace('%', '')
    return float(new_val) / 100

df['Percent Growth'].apply(convert_percent) 
 
Both produce the same value:
0 0.30 1 0.10 2 0.25 3 0.04 4 -0.15 Name: Percent Growth, dtype: float64

 

vb3: to convert the active column to a boolean

using np.where() to convert the active column to a boolean. There are several possible ways to solve this specific problem.
The np.where()  approach is useful for many types of problems so I’m choosing to include
it here.
 
The basic idea is to use the np.where() function to convert all “Y” values to True and everything else assigned 
False 
 
df["Active"] = np.where(df["Active"] == "Y", True, False) 
 
Whether you choose to use a lambda function, create a more standard python function or use another approach like np.where() , these approaches are very flexible and can be customized for your own unique data needs.


Pandas helper functions

Pandas has a middle ground between the blunt astype() function and the more complex custom functions. These helper functions can be very useful for certain data type conversions.

If you have been following along, you’ll notice that I have not done anything with the date columns or the Jan Units column. Both of these can be converted simply using built in pandas functions such as pd.to_numeric() and pd.to_datetime() .

The reason the Jan Units conversion is problematic is the inclusion of a non-numeric value in the column. If we tried to use astype() we would get an error (as described earlier). The pd.to_numeric() function can handle these values more gracefully:
 
pd.to_numeric(df['Jan Units'], errors='coerce')
0    500.0
1    700.0
2    125.0
3     75.0
4      NaN
Name: Jan Units, dtype: float64

There are a couple of items of note. First, the function easily processes the data and creates a float64 column. Additionally, it replaces the invalid “Closed” value with a NaN value because we passed errors=coerce . We can leave that value there or fill it in with a 0 using fillna(0) :
pd.to_numeric(df['Jan Units'], errors='coerce').fillna(0)
0    500.0
1    700.0
2    125.0
3     75.0
4      0.0
Name: Jan Units, dtype: float64
 
The final conversion I will cover is converting the separate month, day and year columns into a datetime . The pandas pd.to_datetime() function is quite configurable but also pretty smart by default.
pd.to_datetime(df[['Month', 'Day', 'Year']])
0   2015-01-10
1   2014-06-15
2   2016-03-29
3   2015-10-27
4   2014-02-02
dtype: datetime64[ns]
 
In this case, the function combines the columns into a new series of the appropriate datateime64 dtype.

We need to make sure to assign these values back to the dataframe:
df["Start_Date"] = pd.to_datetime(df[['Month', 'Day', 'Year']])
df["Jan Units"] = pd.to_numeric(df['Jan Units'], errors='coerce').fillna(0)

Customer Number Customer Name 2016 2017 Percent Growth Jan Units Month Day Year Active Start_Date
0 10002 Quest Industries 125000.0 162500.0 0.30 500.0 1 10 2015 True 2015-01-10
1 552278 Smith Plumbing 920000.0 1012000.0 0.10 700.0 6 15 2014 True 2014-06-15
2 23477 ACME Industrial 50000.0 62500.0 0.25 125.0 3 29 2016 True 2016-03-29
3 24900 Brekke LTD 350000.0 490000.0 0.04 75.0 10 27 2015 True 2015-10-27
4 651029 Harbor Co 15000.0 12750.0 -0.15 NaN 2 2 2014 False 2014-02-02
Now the data is properly converted to all the types we need:
df.dtypes
Customer Number int64 Customer Name object 2016 float64 2017 float64 Percent Growth float64 Jan Units float64 Month int64 Day int64 Year int64 Active bool Start_Date datetime64[ns] The dataframe is ready for analysis!

Bringing it all together

The basic concepts of using astype() and custom functions can be included very early in the data intake process. If you have a data file that you intend to process repeatedly and it always comes in the same format, you can define the dtype and converters to be applied when reading the data.

dtype

 It is helpful to think of dtype as performing astype() on the data.

converters

 The converters arguments allow you to apply functions to the various input columns similar to the approaches outlined above.

It is important to note that you can only apply a dtype or a converter function to a specified column once using this approach. If you try to apply both to the same column, then the dtype will be skipped.


Here is a streamlined example that does almost all of the conversion at the time the data is read into the dataframe:
df_2 = pd.read_csv("sales_data_types.csv",
                   dtype={'Customer Number': 'int'},
                   converters={'2016': convert_currency,
                               '2017': convert_currency,
                               'Percent Growth': convert_percent,
                               'Jan Units': lambda x: pd.to_numeric(x, errors='coerce'),
                               'Active': lambda x: np.where(x == "Y", True, False)
                              })

df_2.dtypes
Customer Number      int64
Customer Name       object
2016               float64
2017               float64
Percent Growth     float64
Jan Units          float64
Month                int64
Day                  int64
Year                 int64
Active              object
dtype: object

As mentioned earlier, I chose to include a lambda example as well as the function example for converting data. The only function that can not be applied here is the conversion of the Month , Day and Year columns to the corresponding datetime column. Still, this is a powerful convention that can help improve your data processing pipeline.

Datums bepalen adhv begin en einddatum in Dataframe

Voorbeeld op losse velden  ####################################################################### # import necessary packages from datetime...