Time-series - YData Profiling (original) (raw)

Time-Series data

ydata-profiling can be used for a quick Exploratory Data Analysis on time-series data. This is useful for a quick understanding on the behaviour of time dependent variables regarding behaviours such as time plots, seasonality, trends, stationary and data gaps.

Combined with the profiling reports compare, you're able to compare the evolution and data behaviour through time, in terms of time-series specific statistics such as PACF and ACF plots. It also provides the identification of gaps in the time series, caused either by missing values or by entries missing in the time index.

Time-series profiling

Time-series profiling report

The following syntax can be used to generate a profile under the assumption that the dataset includes time dependent features:

Setting the configurations for time-series profiling
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20	import pandas as pd from ydata_profiling.utils.cache import cache_file from ydata_profiling import ProfileReport file_name = cache_file( "pollution_us_2000_2016.csv", "https://query.data.world/s/mz5ot3l4zrgvldncfgxu34nda45kvb", ) df = pd.read_csv(file_name, index_col=[0]) # Filtering time-series to profile a single site site = df[df["Site Num"] == 3003] #Enable tsmode to True to automatically identify time-series variables #Provide the column name that provides the chronological order of your time-series profile = ProfileReport(df, tsmode=True, sortby="Date Local", title="Time-Series EDA") profile.to_file("report_timeseries.html")

Setting the configurations for time-series profiling

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

import pandas as pd from ydata_profiling.utils.cache import cache_file from ydata_profiling import ProfileReport file_name = cache_file( "pollution_us_2000_2016.csv", "https://query.data.world/s/mz5ot3l4zrgvldncfgxu34nda45kvb", ) df = pd.read_csv(file_name, index_col=[0]) # Filtering time-series to profile a single site site = df[df["Site Num"] == 3003] #Enable tsmode to True to automatically identify time-series variables #Provide the column name that provides the chronological order of your time-series profile = ProfileReport(df, tsmode=True, sortby="Date Local", title="Time-Series EDA") profile.to_file("report_timeseries.html")

To enable a time-series report to be generated ts_mode needs to be set to True. If True the variables that have temporal dependence will be automatically identified based on the presence of autocorrection. The time-series report uses the sortby attribute to order the dataset. If not provided it is assumed that the dataset is already ordered. You can set up the correlation level to detect to apply the time-series validation by setting the x configuration.

Warnings and validations

Specific to time-series analysis, 2 new warnings were added to the ydata-profilingwarnings family: NON_STATIONARY and SEASONAL.

Stationarity

In the realm of time-series analysis, a stationary time-series is a dataset where statistical properties, such as mean, variance, and autocorrelation, remain constant over time. This property is essential for many time-series forecasting and modeling techniques because they often assume that the underlying data is stationary. Stationarity simplifies the modeling process by making it easier to detect patterns and trends.

ydata-profiling stationary warning is based on an Augmented Dickey-Fuller(ADF) test. Nevertheless, you should always combine the output of this warning with a visual inspection to your time-series behaviour and search for variance of the rolling statistics analysis.

Seasonality

A seasonal time-series is a specific type of time-series data that exhibits recurring patterns or fluctuations at regular intervals. These patterns are known as seasonality and are often observed in data associated with yearly, monthly, weekly, or daily cycles. Seasonal time-series data can be challenging to model accurately without addressing the underlying seasonality.

ydata-profiling seasonality warning is based on an Augmented Dickey-Fuller(ADF) test. Nevertheless, you should always combine the output of this warning with a seasonal decomposition PACF and ACF plots (also computed in your time-series profiling).

Time-series missing gaps

Time-series gap analysis

Time-series missing data visualization

As a data scientist, one of the critical aspects of working with time-series data is understanding and analyzing time-series gaps. Time-series gaps refer to the intervals within your time-series data where observations are missing or incomplete. While these gaps might seem like inconveniences, they hold valuable information and can significantly impact the quality and reliability of your analyses and predictions.

ydata-profiling automated identification of potential time-series gaps is based on time intervals analysis. By analyzing the time intervals between data points, the gaps are expected to be reflected as larger intervals in the distribution.

You can set up the configuration and intervals for the missing gaps identification here.

Customizing time-series variables

In some cases you might be already aware of what variables are expected to be time-series or, perhaps, you just want to ensure that the variables that you want to analyze as time-series are profiled as such:

Setting what variables are time-series
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32	import pandas as pd from ydata_profiling.utils.cache import cache_file from ydata_profiling import ProfileReport file_name = cache_file( "pollution_us_2000_2016.csv", "https://query.data.world/s/mz5ot3l4zrgvldncfgxu34nda45kvb", ) df = pd.read_csv(file_name, index_col=[0]) # Filtering time-series to profile a single site site = df[df["Site Num"] == 3003] # Setting what variables are time series type_schema = { "NO2 Mean": "timeseries", "NO2 1st Max Value": "timeseries", "NO2 1st Max Hour": "timeseries", "NO2 AQI": "timeseries", } profile = ProfileReport( df, tsmode=True, type_schema=type_schema, sortby="Date Local", title="Time-Series EDA for site 3003", ) profile.to_file("report_timeseries.html")

Setting what variables are time-series

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32

import pandas as pd from ydata_profiling.utils.cache import cache_file from ydata_profiling import ProfileReport file_name = cache_file( "pollution_us_2000_2016.csv", "https://query.data.world/s/mz5ot3l4zrgvldncfgxu34nda45kvb", ) df = pd.read_csv(file_name, index_col=[0]) # Filtering time-series to profile a single site site = df[df["Site Num"] == 3003] # Setting what variables are time series type_schema = { "NO2 Mean": "timeseries", "NO2 1st Max Value": "timeseries", "NO2 1st Max Hour": "timeseries", "NO2 AQI": "timeseries", } profile = ProfileReport( df, tsmode=True, type_schema=type_schema, sortby="Date Local", title="Time-Series EDA for site 3003", ) profile.to_file("report_timeseries.html")

For more questions and suggestions around time-series analysis reach us out at the Data-Centric AI community.