The fundamental Pandas data structures for working with time series data:
- For time stamps, Pandas provides the
Timestamp
type. As mentioned before, it is essentially a replacement for Python's nativedatetime
, but is based on the more efficientnumpy.datetime64
data type. The associated Index structure isDatetimeIndex
. - For time Periods, Pandas provides the
Period
type. This encodes a fixed-frequency interval based onnumpy.datetime64
. The associated index structure isPeriodIndex
. - For time deltas or durations, Pandas provides the
Timedelta
type.Timedelta
is a more efficient replacement for Python's nativedatetime.timedelta
type, and is based onnumpy.timedelta64
. The associated index structure isTimedeltaIndex
.
The most fundamental of these date/time objects are the
Timestamp
and DatetimeIndex
objects.
While these class objects can be invoked directly, it is more common to use the pd.to_datetime()
function, which can parse a wide variety of formats.
Passing a single date to pd.to_datetime()
yields a Timestamp
; passing a series of dates by default yields a DatetimeIndex
:
dates = pd.to_datetime([datetime(2015, 7, 3), '4th of July, 2015',
'2015-Jul-6', '07-07-2015', '20150708'])
dates
Any
DatetimeIndex
can be converted to a PeriodIndex
with the to_period()
function with the addition of a frequency code; here we'll use 'D'
to indicate daily frequency:dates.to_period('D')
A
TimedeltaIndex
is created, for example, when a date is subtracted from another:dates - dates[0]
Regular sequences: pd.date_range()
To make the creation of regular date sequences more convenient, Pandas offers a few functions for this purpose:pd.date_range()
for timestamps,pd.period_range()
for periods, andpd.timedelta_range()
for time deltas.We've seen that Python's
range()
and NumPy's np.arange()
turn a startpoint, endpoint, and optional stepsize into a sequence.
Similarly, pd.date_range()
accepts a start date, an end date, and an optional frequency code to create a regular sequence of dates.
By default, the frequency is one day:pd.date_range('2015-07-03', '2015-07-10')
Alternatively, the date range can be specified not with a start and endpoint, but with a startpoint and a number of periods:
pd.date_range('2015-07-03', periods=8)
The spacing can be modified by altering the
freq
argument, which defaults to D
.
For example, here we will construct a range of hourly timestamps:pd.date_range('2015-07-03', periods=8, freq='H')
To create regular sequences of
Period
or Timedelta
values, the very similar pd.period_range()
and pd.timedelta_range()
functions are useful.
Here are some monthly periods:pd.period_range('2015-07', periods=8, freq='M')
And a sequence of durations increasing by an hour:
pd.timedelta_range(0, periods=10, freq='H')
All of these require an understanding of Pandas frequency codes,
Frequencies and Offsets
Fundamental to these Pandas time series tools is the concept of a frequency or date offset. Just as we saw theD
(day) and H
(hour) codes above, we can use such codes to specify any desired frequency spacing.
The following table summarizes the main codes available:Code | Description | Code | Description |
---|---|---|---|
D |
Calendar day | B |
Business day |
W |
Weekly | ||
M |
Month end | BM |
Business month end |
Q |
Quarter end | BQ |
Business quarter end |
A |
Year end | BA |
Business year end |
H |
Hours | BH |
Business hours |
T |
Minutes | ||
S |
Seconds | ||
L |
Milliseonds | ||
U |
Microseconds | ||
N |
nanoseconds |
The monthly, quarterly, and annual frequencies are all marked at the end of the specified period.
By adding an
S
suffix to any of these, they instead will be marked at the beginning:Code | Description | Code | Description | |
---|---|---|---|---|
MS |
Month start | BMS |
Business month start | |
QS |
Quarter start | BQS |
Business quarter start | |
AS |
Year start | BAS |
Business year start |
Additionally, you can change the month used to mark any quarterly or
annual code by adding a three-letter month code as a suffix:
Q-JAN
,BQ-FEB
,QS-MAR
,BQS-APR
, etc.A-JAN
,BA-FEB
,AS-MAR
,BAS-APR
, etc.
W-SUN
,W-MON
,W-TUE
,W-WED
, etc.
H
) and minute (T
) codes as follows:pd.timedelta_range(0, periods=9, freq="2H30T")
All of these short codes refer to specific instances of Pandas time series offsets, which can be found in the
pd.tseries.offsets
module.
For example, we can create a business day offset directly as follows:from pandas.tseries.offsets import BDay
pd.date_range('2015-07-01', periods=5, freq=BDay())
For more discussion of the use of frequencies and offsets, see the "DateOffset" section of the Pandas documentation.
Resampling, Shifting, and Windowing
The ability to use dates and times as indices to intuitively organize and access data is an important piece of the Pandas time series tools. The benefits of indexed data in general (automatic alignment during operations, intuitive data slicing and access, etc.) still apply, and Pandas provides several additional time series-specific operations.resample and asfreq
One common need for time series data is resampling at a higher or lower frequency. This can be done using the
resample()
method, or the much simpler asfreq()
method.
The primary difference between the two is that resample()
is fundamentally a data aggregation, while asfreq()
is fundamentally a data selection.For up-sampling,
resample()
and asfreq()
are largely equivalent, though resample has many more options available.
In this case, the default for both methods is to leave the up-sampled points empty, that is, filled with NA values.
Just as with the pd.fillna()
function discussed previously, asfreq()
accepts a method
argument to specify how values are imputed.
Here, we will resample the business day data at a daily frequency (i.e., including weekends):
Geen opmerkingen:
Een reactie posten