Time Series Analysis & Visualization in Python (original) (raw)

Every dataset has distinct qualities that function as essential aspects in the field of data analytics, providing insightful information about the underlying data. Time series data is one kind of dataset that is especially important. This article delves into the complexities of time series datasets, examining their unique features and how they may be utilized to gain significant insights.

**What are time series visualization and analytics?

Time series visualization and analytics empower users to graphically represent time-based data, enabling the identification of trends and the tracking of changes over different periods. This data can be presented through various formats, such as line graphs, gauges, tables, and more.

The utilization of time series visualization and analytics facilitates the extraction of insights from data, enabling the generation of forecasts and a comprehensive understanding of the information at hand. Organizations find substantial value in time series data as it allows them to analyze both real-time and historical metrics.

What is Time Series Data?

Time series data is a sequential arrangement of data points organized in consecutive time order. Time-series analysis consists of methods for analyzing time-series data to extract meaningful insights and other valuable characteristics of the data.

Importance of time series analysis

Time-series data analysis is becoming very important in so many industries, like financial industries, pharmaceuticals, social media companies, web service providers, research, and many more. To understand the time-series data, visualization of the data is essential. In fact, any type of data analysis is not complete without visualizations, because one good visualization can provide meaningful and interesting insights into the data.

Basic Time Series Concepts

Types of Time Series Data

Time series data can be broadly classified into two sections:

****1. Continuous Time Series Data:**Continuous time series data involves measurements or observations that are recorded at regular intervals, forming a seamless and uninterrupted sequence. This type of data is characterized by a continuous range of possible values and is commonly encountered in various domains, including:

2. **Discrete Time Series Data: Discrete time series data, on the other hand, consists of measurements or observations that are limited to specific values or categories. Unlike continuous data, discrete data does not have a continuous range of possible values but instead comprises distinct and separate data points. Common examples include:

**Visualization Approach for Different Data Types:

Time Series Data Visualization using Python

We will use Python libraries for visualizing the data. The link for the dataset can be found here. We will perform the visualization step by step, as we do in any time-series data project.

Importing the Libraries

We will import all the libraries that we will be using throughout this article in one place so that do not have to import every time we use it this will save both our time and effort.

import pandas as pd import numpy as np import seaborn as sns import matplotlib.pyplot as plt from statsmodels.graphics.tsaplots import plot_acf from statsmodels.tsa.stattools import adfuller

`

Loading The Dataset

To load the dataset into a dataframe we will use the pandas read_csv() function. We will use head() function to print the first five rows of the dataset. Here we will use the ‘**parse_dates’ parameter in the read_csv function to convert the ‘Date’ column to the DatetimeIndex format. By default, Dates are stored in string format which is not the right format for time series data analysis.

Python `

reading the dataset using read_csv

df = pd.read_csv("stock_data.csv", parse_dates=True, index_col="Date")

displaying the first five rows of dataset

df.head()

`

**Output:

Unnamed: 0	Open	High	Low	Close	Volume	Name  

Date
2013-02-08 NaN 15.07 15.12 14.63 14.75 8407500 AAL
2013-02-11 NaN 14.89 15.01 14.26 14.46 8882000 AAL
2013-02-12 NaN 14.45 14.51 14.10 14.27 8126000 AAL
2013-02-13 NaN 14.30 14.94 14.25 14.66 10259500 AAL
2013-02-14 NaN 14.94 14.96 13.16 13.99 31879900 AAL

Dropping Unwanted Columns

We will drop columns from the dataset that are not important for our visualization.

Python `

deleting column

df.drop(columns='Unnamed: 0', inplace =True) df.head()

`

**Output:

Date Open High Low Close Volume Name
2013-02-08 15.07 15.12 14.63 14.75 8407500 AAL
2013-02-11 14.89 15.01 14.26 14.46 8882000 AAL
2013-02-12 14.45 14.51 14.10 14.27 8126000 AAL
2013-02-13 14.30 14.94 14.25 14.66 10259500 AAL
2013-02-14 14.94 14.96 13.16 13.99 31879900 AAL

**Plotting Line plot for Time Series data:

Since, the volume column is of continuous data type, we will use line graph to visualize it.

Python `

Assuming df is your DataFrame

sns.set(style="whitegrid") # Setting the style to whitegrid for a clean background

plt.figure(figsize=(12, 6)) # Setting the figure size sns.lineplot(data=df, x='Date', y='High', label='High Price', color='blue')

Adding labels and title

plt.xlabel('Date') plt.ylabel('High') plt.title('Share Highest Price Over Time')

plt.show()

`

**Output:

download

Line plot

Resampling

To better understand the trend of the data we will use the resampling method, resampling the data on a monthly basis can provide a clearer view of trends and patterns, especially when we are dealing with daily data.

Python `

Assuming df is your DataFrame with a datetime index

df_resampled = df.resample('M').mean(numeric_only=True) # Resampling to monthly frequency, using mean as an aggregation function

sns.set(style="whitegrid") # Setting the style to whitegrid for a clean background

Plotting the 'high' column with seaborn, setting x as the resampled 'Date'

plt.figure(figsize=(12, 6)) # Setting the figure size sns.lineplot(data=df_resampled, x=df_resampled.index, y='High', label='Month Wise Average High Price', color='blue')

Adding labels and title

plt.xlabel('Date (Monthly)') plt.ylabel('High') plt.title('Monthly Resampling Highest Price Over Time')

plt.show()

This code is modified by Susobhan Akhuli

`

**Output:

download1

line plot

We have observed an upward **trend in the resampled monthly volume data. An upward trend indicates that, over the monthly intervals, the “high” column tends to increase over time.

Detecting Seasonality Using Auto Correlation

We will detect Seasonality using the autocorrelation function (ACF) plot. Peaks at regular intervals in the ACF plot suggest the presence of seasonality.

Python `

Check if 'Date' is already the index

if 'Date' not in df.columns: print("'Date' is already the index or not present in the DataFrame.") else: df.set_index('Date', inplace=True)

Plot the ACF

plt.figure(figsize=(12, 6)) plot_acf(df['Volume'], lags=40) # You can adjust the number of lags as needed plt.xlabel('Lag') plt.ylabel('Autocorrelation') plt.title('Autocorrelation Function (ACF) Plot') plt.show()

This code is modified by Susobhan Akhuli

`

**Output:

download

Autocorrelation Function

The presence of seasonality is typically indicated by peaks or spikes at regular intervals, as there are none there is no seasonality in our data.

Detecting Stationarity

We will perform the ADF test to formally test for stationarity.

The test is based on the;

The ADF test employs an augmented regression model that includes lagged differences of the series to determine the presence of a unit root.

Python `

from statsmodels.tsa.stattools import adfuller

Assuming df is your DataFrame

result = adfuller(df['High']) print('ADF Statistic:', result[0]) print('p-value:', result[1]) print('Critical Values:', result[4])

`

**Output:

ADF Statistic: -2.0394210870439844
p-value: 0.2695601609296777
Critical Values: {'1%': -3.4355629707955395, '5%': -2.863842063387667, '10%': -2.567995644141416}

Smoothening the data using Differencing and Moving Average

Differencing involves subtracting the previous observation from the current observation to remove trends or seasonality.

Python `

Differencing

df['high_diff'] = df['High'].diff()

Plotting

plt.figure(figsize=(12, 6)) plt.plot(df['High'], label='Original High', color='blue') plt.plot(df['high_diff'], label='Differenced High', linestyle='--', color='green') plt.legend() plt.title('Original vs Differenced High') plt.show()

`

**Output:

download

Original vs Differenced High

The df['High'].diff() part calculates the difference between consecutive values in the ‘High’ column. This differencing operation is commonly used to transform a time series into a new series that represents the changes between consecutive observations.

Python `

Moving Average

window_size = 120 df['high_smoothed'] = df['High'].rolling(window=window_size).mean()

Plotting

plt.figure(figsize=(12, 6))

plt.plot(df['High'], label='Original High', color='blue') plt.plot(df['high_smoothed'], label=f'Moving Average (Window={window_size})', linestyle='--', color='orange')

plt.xlabel('Date') plt.ylabel('High') plt.title('Original vs Moving Average') plt.legend() plt.show()

`

**Output:

download

Original vs Moving Average

This calculates the moving average of the ‘High’ column with a window size of 120(A quarter) , creating a smoother curve in the ‘high_smoothed’ series. The plot compares the original ‘High’ values with the smoothed version.Now let’s plot all other columns using a subplot.

Original Data Vs Differenced Data

Printing the original and differenced data side by side we get;

Python `

Create a DataFrame with 'high' and 'high_diff' columns side by side

df_combined = pd.concat([df['High'], df['high_diff']], axis=1)

Display the combined DataFrame

print(df_combined.head())

`

**Output:

         High  high_diff  

Date
2013-02-08 15.12 NaN
2013-02-11 15.01 -0.11
2013-02-12 14.51 -0.50
2013-02-13 14.94 0.43
2013-02-14 14.96 0.02

Hence, the ‘high_diff’ column represents the differences between consecutive high values .The first value of ‘high_diff’ is NaN because there is no previous value to calculate the difference.

As, there is a NaN value we will drop that proceed with our test,

Python `

Remove rows with missing values

df.dropna(subset=['high_diff'], inplace=True) df['high_diff'].head()

`

**Output:

high_diff  

Date
2013-02-11 -0.11
2013-02-12 -0.50
2013-02-13 0.43
2013-02-14 0.02
2013-02-15 -0.35
dtype: float64

After that if we conduct the ADF test;

Python `

from statsmodels.tsa.stattools import adfuller

Assuming df is your DataFrame

result = adfuller(df['high_diff']) print('ADF Statistic:', result[0]) print('p-value:', result[1]) print('Critical Values:', result[4])

`

**Output:

ADF Statistic: -30.782419342418
p-value: 0.0
Critical Values: {'1%': -3.4355629707955395, '5%': -2.863842063387667, '10%': -2.567995644141416}

You can download the dataset and source code from here: