Time Series Stationarity
Testing for Stationarity in Time Series Data
Components of Time Series data
- Trend
- Seasonality
- Irregularity
- Cyclicality
When not to use Time Series Analyis
- Values are constant - it's pointless
- Values are in the form of functions - just use the function
Stationarity
- Constant mean
- Constant variance
- Autovariance that does not depend on time
A stationary series has a high probability to follow the same pattern in future
Stationarity Tests
- Rolling Statistics - moving average, moving variance, visualization
- ADCF Test
ARIMA
ARIMA is a common model for analysis
The ARIMA model has the following parameters::
- P - Auto Regressive (AR)
- d - Integration (I)
- Q - Moving Average (MA)
Applying the Above
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
for filename in filenames:
print(os.path.join(dirname, filename))
# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All"
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session
import seaborn as sns
df = pd.read_csv('/kaggle/input/air-passengers/AirPassengers.csv')
df.head()
Month | #Passengers | |
---|---|---|
0 | 1949-01 | 112 |
1 | 1949-02 | 118 |
2 | 1949-03 | 132 |
3 | 1949-04 | 129 |
4 | 1949-05 | 121 |
df['Month'] = pd.to_datetime(df['Month'], infer_datetime_format=True)
df = df.set_index(['Month'])
df.head()
#Passengers | |
---|---|
Month | |
1949-01-01 | 112 |
1949-02-01 | 118 |
1949-03-01 | 132 |
1949-04-01 | 129 |
1949-05-01 | 121 |
sns.lineplot(data=df)
<AxesSubplot:xlabel='Month'>
<Figure size 432x288 with 1 Axes>
In the above we can see that there is an upward trend as well as some seasonality
Next, we can check some summary statistics using a rolling mean approach
Rolling Averages
Note that for the rolling functions we use a window of 12, this is because the data has a seasonality of 12 months
rolling_mean = df.rolling(window=12).mean()
rolling_std = df.rolling(window=12).std()
df_summary = df.assign(Mean=rolling_mean)
df_summary = df_summary.assign(Std=rolling_std)
sns.lineplot(data=df_summary)
<AxesSubplot:xlabel='Month'>
<Figure size 432x288 with 1 Axes>
Since the mean and standard deviation are not constant we can conclude that the data is not stationary
ADF Test
The null hypothesis for the test is that the series is non-stationary, we reject it if the resulting probability > 0.05 (or some other threshold)
from statsmodels.tsa.stattools import adfuller
def print_adf(adf):
print('ADF test statistic', adf[0])
print('p-value', adf[1])
print('Lags used', adf[2])
print('Observations used', adf[3])
print('Critical values', adf[4])
adf = adfuller(df['#Passengers'])
print_adf(adf)
In the result of the ADF test we can see that the p-value is much higher than 0.05 which means that the data is not stationary
Because the data is non-stationary the next think we need to do is estimate the trend
df_log = np.log(df)
sns.lineplot(data=df_log)
<AxesSubplot:xlabel='Month'>
<Figure size 432x288 with 1 Axes>
rolling_mean_log = df_log.rolling(window=12).mean()
df_summary = df_log.assign(Mean=rolling_mean_log)
sns.lineplot(data=df_summary)
<AxesSubplot:xlabel='Month'>
<Figure size 432x288 with 1 Axes>
Using the log there is still some residual effect visible, we can try taking a diff:
df_diff = df - rolling_mean
sns.lineplot(data=df_diff)
<AxesSubplot:xlabel='Month'>
<Figure size 432x288 with 1 Axes>
rolling_mean_diff = df_diff.rolling(window=12).mean()
rolling_std_diff = df_diff.rolling(window=12).std()
df_summary = df_diff.assign(Mean=rolling_mean_diff)
df_summary = df_summary.assign(Std=rolling_std_diff)
sns.lineplot(data=df_summary)
<AxesSubplot:xlabel='Month'>
<Figure size 432x288 with 1 Axes>
adf_diff = adfuller(df_diff.dropna())
print_adf(adf_diff)
We can do the same with the log:
df_diff_log = df_log - rolling_mean_log
sns.lineplot(data=df_diff_log)
<AxesSubplot:xlabel='Month'>
<Figure size 432x288 with 1 Axes>
rolling_mean_diff_log = df_diff_log.rolling(window=12).mean()
rolling_std_diff_log = df_diff_log.rolling(window=12).std()
df_summary = df_diff_log.assign(Mean=rolling_mean_diff_log)
df_summary = df_summary.assign(Std=rolling_std_diff_log)
sns.lineplot(data=df_summary)
<AxesSubplot:xlabel='Month'>
<Figure size 432x288 with 1 Axes>
adf_diff_log = adfuller(df_diff_log.dropna())
print_adf(adf_diff_log)
The ADF for the log diff is less than 0.05 so the result is stationary
We can also try a divide using the the original data and the rolling mean:
df_div = df / rolling_mean
sns.lineplot(data=df_div)
<AxesSubplot:xlabel='Month'>
<Figure size 432x288 with 1 Axes>
rolling_mean_div = df_div.rolling(window=12).mean()
rolling_std_div = df_div.rolling(window=12).std()
df_summary = df_div.assign(Mean=rolling_mean_div)
df_summary = df_summary.assign(Std=rolling_std_div)
sns.lineplot(data=df_summary)
<AxesSubplot:xlabel='Month'>
<Figure size 432x288 with 1 Axes>
adf_div = adfuller(df_div.dropna())
print_adf(adf_div)
The ADF for the division is less than 0.05 so the result is stationary
Next we can try to do a decomposition on the above series since it is stationary:
from statsmodels.tsa.seasonal import seasonal_decompose
decomposition = seasonal_decompose(df_div.dropna())
trend = decomposition.trend
sns.lineplot(data=trend.dropna())
<AxesSubplot:xlabel='Month', ylabel='trend'>
<Figure size 432x288 with 1 Axes>
seasonal = decomposition.seasonal
sns.lineplot(data=seasonal.dropna())
<AxesSubplot:xlabel='Month', ylabel='seasonal'>
<Figure size 432x288 with 1 Axes>
resid = decomposition.resid
sns.lineplot(data=resid.dropna())
<AxesSubplot:xlabel='Month', ylabel='resid'>
<Figure size 432x288 with 1 Axes>