https://pbpython.com/pandas_dtypes.html
Pandas Data Types
A data type is essentially an internal construct that a programming language
uses to understand how to store and manipulate data. For instance, a program
needs to understand that you can add two numbers together like 5 + 10 to get 15.
Or, if you have two strings such as “cat” and “hat” you could concatenate (add) them
together to get “cathat.”
A possible confusing point about pandas data types is that there is some overlap
between pandas, python and numpy. This table summarizes the key points:
Pandas
dtype
mapping
Pandas dtype |
Python type |
NumPy type |
Usage |
object |
str |
string_, unicode_ |
Text |
int64 |
int |
int_, int8, int16, int32, int64, uint8, uint16, uint32, uint64 |
Integer numbers |
float64 |
float |
float_, float16, float32, float64 |
Floating point numbers |
bool |
bool |
bool_ |
True/False values |
datetime64 |
NA |
datetime64[ns] |
Date and time values |
timedelta[ns] |
NA |
NA |
Differences between two datetimes |
category |
NA |
NA |
Finite list of text values |
For the most part, there is no need to worry about determining if you should try
to explicitly force the pandas type to a corresponding to NumPy type.
Most of the time, using pandas default
int64
and
float64
types will work. The only reason
I included in this table is that sometimes you may see the numpy types pop up on-line
or in your own analysis.
GEWOONTE: Check datatypes bij inlezen DATAFRAME
Check bij het inlezen van data in een dataframe altijd eerst de datatypes
df.dtypes
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 10 columns):
Customer Number 5 non-null float64
Customer Name 5 non-null object
2016 5 non-null object
2017 5 non-null object
Percent Growth 5 non-null object
Jan Units 5 non-null object
Month 5 non-null int64
Day 5 non-null int64
Year 5 non-null int64
Active 5 non-null object
dtypes: float64(1), int64(3), object(6)
memory usage: 480.0+ bytes
After looking at the automatically assigned data types, there are several concerns:
- The
Customer Number
is a
float64
but it should be an
int64
- The
2016
and
2017
columns are stored as objects, not numerical values such as a
float64
or
int64
Percent Growth
and
Jan Units
are also stored as objects not numerical values
- We have
Month
,
Day
and
Year
columns that should be converted to
datetime64
- The
Active
column should be a boolean
Until we clean up these data types, it is going to be very difficult to do much
additional analysis on this data.
Schoon de datatypes op
In order to convert data types in pandas, there are three basic options:
- Use
astype()
to force an appropriate
dtype
- Create a custom function to convert the data
- Use pandas functions such as
to_numeric()
or
to_datetime()
ASTYPE() functie
The simplest way to convert a pandas column of data to a different type is to
use
astype()
.
van float64 naar int64
The simplest way to convert a pandas column of data to a different type is to
use
astype()
. For instance, to convert the
Customer Number
to an integer
we can call it like this:
df['Customer Number'].astype('int')
van object naar float64
Lijkt makkelijk maar er zit hier een addertje onder het gras
Let’s try to do the same thing to
our
2016
column and convert it to a floating point number:
df['2016'].astype('float')
ValueError Traceback (most recent call last)
<ipython-input-45-999869d577b0> in <module>()
----> 1 df['2016'].astype('float')
[lots more code here]
ValueError: could not convert string to float: '$15,000.00'
In a similar manner, we can try to conver the
Jan Units
column to an integer:
df['Jan Units'].astype('int')
ValueError Traceback (most recent call last)
<ipython-input-44-31333711e4a4> in <module>()
----> 1 df['Jan Units'].astype('int')
[lots more code here]
ValueError: invalid literal for int() with base 10: 'Closed'
===>
Both of these return
ValueError
exceptions which mean that the conversions
did not work.
In each of the cases, the data included values that could not be interpreted as
numbers. In the sales columns, the data includes a currency symbol as well as a comma in each value.
In the
Jan Units
columnm the last value is “Closed” which is not a number; so we get the exception.
So far it’s not looking so good for
astype()
as a tool. We should give it
one more try on the
Active
column.
df['Active'].astype('bool')
0 True
1 True
2 True
3 True
4 True
Name: Active, dtype: bool
At first glance, this looks ok but upon closer inspection, there is a big problem.
All values were interpreted as
True
but the last customer has an Active flag
of
N
so this does not seem right.
Beperking ASTYPE
The takeaway from this section is that
astype()
will only work if:
- the data is clean and can be simply interpreted as a number
- you want to convert a numeric value to a string object
If the data has non-numeric characters or is not homogeneous, then
astype()
will not be a good choice for type conversion. You will need to do additional transforms
for the type change to work correctly.
Custom Function
Since this data is a little more complex to convert, we can build a custom
function that we apply to each value and convert to the appropriate data type.
vb: currency conversion
For currency conversion (of this specific data set), here is a simple function we can use:
def convert_currency(val):
"""
Convert the string number value to a float
- Remove $
- Remove commas
- Convert to float type
"""
new_val = val.replace(',','').replace('$', '')
return float(new_val)
The code uses python’s string functions to strip out the ‘$” and ‘,’ and then
convert the value to a floating point number. In this specific case, we could convert
the values to integers as well but I’m choosing to use floating point in this case.
note1
I also suspect that someone will recommend that we use a
Decimal
type for currency.
This is not a native data type in pandas so I am purposely sticking with the float approach.
note 2
Also of note, is that the function converts the number to a python
float
but pandas internally converts it to a
float64.
As mentioned earlier,
I recommend that you allow pandas to convert to specific size
float
or
int
as
it determines appropriate. There is no need for you to try to downcast to a smaller
or upcast to a larger byte size unless you really know why you need to do it.
Now, we can use the pandas
apply
function to apply this to all the values
in the 2016 column.
df['2016'].apply(convert_currency)
0 125000.0
1 920000.0
2 50000.0
3 350000.0
4 15000.0
Name: 2016, dtype: float64
ll the values are showing as
float64
so we can do all the math
functions we need to.
Zou ook met een lambda functie kunnen
df['2016'].apply(lambda x: x.replace('$', '').replace(',', '')).astype('float')
vb2 : fixing the Percent Growth
column.
Using the
lambda
:
df['Percent Growth'].apply(lambda x: x.replace('%', '')).astype('float') / 100
Doing the same thing with a custom function:
def convert_percent(val):
"""
Convert the percentage string to an actual floating point percent
- Remove %
- Divide by 100 to make decimal
"""
new_val = val.replace('%', '')
return float(new_val) / 100
df['Percent Growth'].apply(convert_percent)
Both produce the same value:
0 0.30
1 0.10
2 0.25
3 0.04
4 -0.15
Name: Percent Growth, dtype: float64
vb3: to convert the active column to a boolean
using np.where()
to convert the active column to a boolean. There are several possible ways to solve this specific problem.
The np.where()
approach is useful for many types of problems so I’m choosing to include
it here.
The basic idea is to use the np.where()
function to convert all “Y” values to True
and everything else assigned
False
df["Active"] = np.where(df["Active"] == "Y", True, False)
Whether you choose to use a
lambda
function, create a more standard python
function or use another approach like
np.where()
, these approaches
are very flexible and can be customized for your own unique data needs.
Pandas helper functions
Pandas has a middle ground between the blunt
astype()
function and the
more complex custom functions. These helper functions can be very useful for
certain data type conversions.
If you have been following along, you’ll notice that I have not done anything with
the date columns or the
Jan Units
column. Both of these can be converted
simply using built in pandas functions such as
pd.to_numeric()
and
pd.to_datetime()
.
The reason the
Jan Units
conversion is problematic is the inclusion of
a non-numeric value in the column. If we tried to use
astype()
we would
get an error (as described earlier). The
pd.to_numeric()
function can
handle these values more gracefully:
pd.to_numeric(df['Jan Units'], errors='coerce')
0 500.0
1 700.0
2 125.0
3 75.0
4 NaN
Name: Jan Units, dtype: float64
There are a couple of items of note. First, the function easily processes the data
and creates a
float64
column. Additionally, it replaces the invalid “Closed”
value with a
NaN
value because we passed
errors=coerce
. We can
leave that value there or fill it in with a 0 using
fillna(0)
:
pd.to_numeric(df['Jan Units'], errors='coerce').fillna(0)
0 500.0
1 700.0
2 125.0
3 75.0
4 0.0
Name: Jan Units, dtype: float64
The final conversion I will cover is converting the separate month, day and year columns
into a
datetime
. The pandas
pd.to_datetime()
function is quite
configurable but also pretty smart by default.
pd.to_datetime(df[['Month', 'Day', 'Year']])
0 2015-01-10
1 2014-06-15
2 2016-03-29
3 2015-10-27
4 2014-02-02
dtype: datetime64[ns]
In this case, the function combines the columns into a new series of the appropriate
datateime64
dtype.
We need to make sure to assign these values back to the dataframe:
df["Start_Date"] = pd.to_datetime(df[['Month', 'Day', 'Year']])
df["Jan Units"] = pd.to_numeric(df['Jan Units'], errors='coerce').fillna(0)
|
Customer Number |
Customer Name |
2016 |
2017 |
Percent Growth |
Jan Units |
Month |
Day |
Year |
Active |
Start_Date |
0 |
10002 |
Quest Industries |
125000.0 |
162500.0 |
0.30 |
500.0 |
1 |
10 |
2015 |
True |
2015-01-10 |
1 |
552278 |
Smith Plumbing |
920000.0 |
1012000.0 |
0.10 |
700.0 |
6 |
15 |
2014 |
True |
2014-06-15 |
2 |
23477 |
ACME Industrial |
50000.0 |
62500.0 |
0.25 |
125.0 |
3 |
29 |
2016 |
True |
2016-03-29 |
3 |
24900 |
Brekke LTD |
350000.0 |
490000.0 |
0.04 |
75.0 |
10 |
27 |
2015 |
True |
2015-10-27 |
4 |
651029 |
Harbor Co |
15000.0 |
12750.0 |
-0.15 |
NaN |
2 |
2 |
2014 |
False |
2014-02-02 |
Now the data is properly converted to all the types we need:
Customer Number int64
Customer Name object
2016 float64
2017 float64
Percent Growth float64
Jan Units float64
Month int64
Day int64
Year int64
Active bool
Start_Date datetime64[ns]
The dataframe is ready for analysis!
Bringing it all together
The basic concepts of using
astype()
and custom functions can be included
very early in the data intake process. If you have a data file that you intend
to process repeatedly and it always comes in the same format, you can define the
dtype
and
converters
to be applied when reading the data.
dtype
It is helpful to
think of
dtype
as performing
astype()
on the data.
converters
The
converters
arguments allow you to apply functions to the various input columns similar to the approaches
outlined above.
It is important to note that you can only apply a
dtype
or a
converter
function to a specified column once using this approach. If you try to apply both
to the same column, then the dtype will be skipped.
Here is a streamlined example that does almost all of the conversion at the time
the data is read into the dataframe:
df_2 = pd.read_csv("sales_data_types.csv",
dtype={'Customer Number': 'int'},
converters={'2016': convert_currency,
'2017': convert_currency,
'Percent Growth': convert_percent,
'Jan Units': lambda x: pd.to_numeric(x, errors='coerce'),
'Active': lambda x: np.where(x == "Y", True, False)
})
df_2.dtypes
Customer Number int64
Customer Name object
2016 float64
2017 float64
Percent Growth float64
Jan Units float64
Month int64
Day int64
Year int64
Active object
dtype: object
As mentioned earlier, I chose to include a
lambda
example as well as the function
example for converting data. The only function that can not be applied here is
the conversion of the
Month
,
Day
and
Year
columns to the
corresponding
datetime
column. Still, this is a powerful convention that
can help improve your data processing pipeline.