Extracting Features from Time Series Data Using tsfresh - Coding

Time series data is prevalent in various fields such as finance, healthcare, and engineering. Extracting meaningful features from this data is crucial for building predictive models. The tsfresh Python package simplifies this process by automatically calculating a wide range of features. This article provides a comprehensive guide on how to use tsfresh to extract features from time series data.

Table of Content

Introduction to tsfresh
How to Use tsfresh for Feature Extraction

Installation
Basic Usage : Step-by-Step Procedure
Step 1: Preparing the Data
Step 2: Extracting Features
Step 3: Filtering Relevant Features
Step 4: Visualizing Results

tsfresh to Extract Features from Time Series Data : Advanced Usage

Introduction to tsfresh

tsfresh (Time Series Feature extraction based on scalable hypothesis tests) is a Python package designed to automate the extraction of a large number of features from time series data. It is particularly useful for tasks such as classification, regression, and clustering of time series data. The package integrates seamlessly with pandas and scikit-learn, making it easy to incorporate into existing workflows.

Key Features of tsfresh:

Automated Feature Extraction: Extracts hundreds of features from time series data automatically.
Feature Selection: Identifies relevant features using statistical tests.
Scalability: Supports parallel processing and integration with dask for handling large datasets.
Compatibility: Works well with pandas DataFrames and scikit-learn pipelines.

How to Use tsfresh for Feature Extraction

Installation

To install tsfresh, you can use pip:

pip install tsfresh

Ensure you have the necessary dependencies, such as pandas and numpy, which are commonly used in conjunction with tsfresh.

Basic Usage : Step-by-Step Procedure

The basic usage of tsfresh involves three main steps:

Preparing your data
Extracting features
Selecting relevant features

Step 1: Preparing the Data

Your data should be in a long-format pandas DataFrame where each row corresponds to a single observation at a specific time point. The DataFrame should have at least the following columns:

id: Identifier for each time series (e.g., sensor ID)
time: Time of the observation
value: Observed value at that time point

Example:

Python

import pandas as pd

data = {
    'id': [1, 1, 1, 2, 2, 2],
    'time': [1, 2, 3, 1, 2, 3],
    'value': [10, 15, 14, 7, 8, 9]
}
df = pd.DataFrame(data)

Step 2: Extracting Features

To extract features, use the extract_features function to compute features from your time series data:

Python

from tsfresh import extract_features

extracted_features = extract_features(df, column_id='id', column_sort='time')
print(extracted_features.head())

Output:

Feature Extraction: 100%|██████████| 2/2 [00:00<00:00,  6.65it/s]
   value__variance_larger_than_standard_deviation  value__has_duplicate_max  \
1                                             1.0                       0.0   
2                                             0.0                       0.0   

   value__has_duplicate_min  value__has_duplicate  value__sum_values  \
1                       0.0                   0.0               39.0   
2                       0.0                   0.0               24.0   

   value__abs_energy  value__mean_abs_change  value__mean_change  \
1              521.0                     3.0                 2.0   
2              194.0                     1.0                 1.0   

   value__mean_second_derivative_central  value__median  ...  \
1                                   -3.0           14.0  ...   
2                                    0.0            8.0  ...   

   value__fourier_entropy__bins_5  value__fourier_entropy__bins_10  \
1                        0.693147                         0.693147   
2                        0.693147                         0.693147   

   value__fourier_entropy__bins_100  \
1                          0.693147   
2                          0.693147   

   value__permutation_entropy__dimension_3__tau_1  \
1                                            -0.0   
2                                            -0.0   

   value__permutation_entropy__dimension_4__tau_1  \
1                                             NaN   
2                                             NaN   

   value__permutation_entropy__dimension_5__tau_1  \
1                                             NaN   
2                                             NaN   

   value__permutation_entropy__dimension_6__tau_1  \
1                                             NaN   
2                                             NaN   

   value__permutation_entropy__dimension_7__tau_1  \
1                                             NaN   
2                                             NaN   

   value__query_similarity_count__query_None__threshold_0.0  \
1                                                NaN          
2                                                NaN          

   value__mean_n_absolute_max__number_of_maxima_7  
1                                             NaN  
2                                             NaN

This function will compute a wide range of features for each time series identified by id.

Step 3: Filtering Relevant Features

After feature extraction, you may want to filter out irrelevant or redundant features. Use the select_features function to keep only the significant ones:

Python

from tsfresh import select_features
from tsfresh.utilities.dataframe_functions import impute

# First, impute any missing values
impute(extracted_features)

# Assume we have target labels for a classification task
y = pd.Series([0, 1], index=[1, 2])

selected_features = select_features(extracted_features, y)
print(selected_features.head())

Output:

value__autocorrelation__lag_4' 'value__autocorrelation__lag_5'
 'value__autocorrelation__lag_6' 'value__autocorrelation__lag_7'
 'value__autocorrelation__lag_8' 'value__autocorrelation__lag_9'
 'value__partial_autocorrelation__lag_0'
 'value__partial_autocorrelation__lag_1'
 'value__partial_autocorrelation__lag_2'
 'value__partial_autocorrelation__lag_3'
 'value__partial_autocorrelation__lag_4'
 'value__partial_autocorrelation__lag_5'
 'value__partial_autocorrelation__lag_6'
 'value__partial_autocorrelation__lag_7'
 'value__partial_autocorrelation__lag_8'
 'value__partial_autocorrelation__lag_9'

Note: This is just the glimpse of the output

This step ensures that only features relevant to your prediction task are retained.

Step 4: Visualizing Results

Visualizing the selected features can help understand their importance and distribution. Use visualization libraries such as matplotlib or seaborn to create plots:

Python

# Assume we have target labels for a classification task
y = pd.Series([0, 1], index=[1, 2], name='target') # Give the series a name

selected_features = select_features(extracted_features, y)
print(selected_features.head())

Output:

Empty DataFrame
Columns: []
Index: [1, 2]

tsfresh to Extract Features from Time Series Data : Advanced Usage

tsfresh offers several advanced features:

Custom Feature Extraction: You can define custom feature extraction functions to tailor the feature set to your specific needs.
Parallel Processing: Utilize parallel processing to speed up feature extraction on large datasets.
Feature Extraction Settings: Customize the feature extraction process by modifying settings such as which features to calculate and how to handle missing values.

Example of custom settings:

from tsfresh.feature_extraction import ComprehensiveFCParameters

settings = ComprehensiveFCParameters()
settings['maximum'] = None  # Disable maximum feature extraction

extracted_features = extract_features(df, column_id='id', column_sort='time', default_fc_parameters=settings)

Conclusion

The tsfresh package is a robust tool for extracting and selecting features from time series data. By automating the feature extraction process, it allows you to focus on building and optimizing your machine learning models. With its flexibility and ease of use, tsfresh is an essential package for anyone working with time series data.

Reffered: https://www.geeksforgeeks.org

AI ML DS

Related
What is a Data Scientist?
Architecture and Working of Transformers in Deep Learning
How to use CoreNLPParser in NLTK in Python
How to Make a Mosaic Plot in Matplotlib
Calculating Precision and Recall for Multiclass Classification Using Confusion Matrix

Type:	Geek
Category:	Coding
Sub Category:	Tutorial
Uploaded by:	Admin
Views:	19