Streamlining Data Cleaning with PyJanitor: A Comprehensive Guide - Coding

Data cleaning is a crucial step in the data analysis pipeline. It involves transforming raw data into a clean dataset that can be used for analysis. This process can be time-consuming and error-prone, especially when dealing with large datasets. PyJanitor is a Python library that aims to simplify data cleaning by providing a set of convenient functions for common data cleaning tasks. In this article, we will explore PyJanitor, its features, and how it can be used to streamline the data cleaning process.

Table of Content

What is PyJanitor?
Key Features of PyJanitor
Installing PyJanitor
Using PyJanitor for Data Cleaning in Python

1. Cleaning Column Names with PyJanitor
2. Removing Empty Rows and Columns
3. Identifying Duplicate Data Points
4. Encoding Object Data Type to Categorical Data Type
5. Renaming Columns
6. Filtering Data

Pipe() Method in PyJanitor : Custom Functions
Exploring Different PyJanitor Functions

1. fill_empty(data, column_names, value)
2. filter_on(data, criteria, complement=False)
3. rename_column(data, old_column_name, new_column_name)
4. add_column(df, column_name, value, fill_remaining=False)

What is PyJanitor?

PyJanitor is an open-source Python library built on top of Pandas, designed to extend its functionality with additional data cleaning features. It provides a set of functions that make it easier to perform common data cleaning tasks, such as removing missing values, renaming columns, and filtering data. PyJanitor aims to make data cleaning more efficient and less error-prone by providing a consistent and intuitive API.

Key Features of PyJanitor

PyJanitor offers a variety of features that simplify data cleaning:

Chaining Methods: PyJanitor allows for method chaining, making it easy to apply multiple data cleaning operations in a single line of code.
Convenient Functions: It provides a set of functions for common data cleaning tasks, such as removing missing values, renaming columns, and filtering data.
Integration with Pandas: PyJanitor is built on top of Pandas, so it integrates seamlessly with existing Pandas workflows.
Custom Functions: Users can define their own custom cleaning functions and integrate them into the PyJanitor workflow.

Installing PyJanitor

To get started with PyJanitor, you need to install it. You can install PyJanitor using pip:

pip install pyjanitor

Using PyJanitor for Data Cleaning in Python

1. Cleaning Column Names with PyJanitor

We can clean multiple column names at once using the clean_names() function of PyJanitor. This function converts the names of the columns to lowercase, replaces spaces with underscores, and removes any special characters. Here’s an example of how to use this function. Let’s explore some common data cleaning tasks and how PyJanitor can simplify them.

Python

import pandas as pd
import janitor
data = {'Column @1': [1, 2], 'Column @2': [3, 4]}
data = pd.DataFrame(data)

print(data)

Output:

   Column @1  Column @2
0          1          3
1          2          4

Python

data = data.clean_names(remove_special=True)

print(data)

Output:

   column_1  column_2
0         1         3
1         2         4

2. Removing Empty Rows and Columns

We can remove empty rows and empty columns using the remove_empty() function.

Python

import pandas as pd
import janitor

data = {'A': [1, None, 3], 'B': [4, None, 6]}
data = pd.DataFrame(data)  

data = data.remove_empty()   

print(data)

Output:

     A    B
0  1.0  4.0
1  3.0  6.0

3. Identifying Duplicate Data Points

We can identify the data points that are repeated using the duplicated() function, which returns True if all the columns of a data point are repeated, and False if any one is not repeated.

Python

import pandas as pd
import janitor

data = {
    'A': [1, 2, 2, 4],
    'B': [5, 6, 6, 8]
}
data = pd.DataFrame(data)

duplicates = data.duplicated()
duplicates

Output:

0    False
1    False
2     True
3    False
dtype: bool

4. Encoding Object Data Type to Categorical Data Type

We can encode an object data type to a categorical data type using the encode_categorical() function, in which we need to pass the column names for which we want to encode.

Python

import pandas as pd
import janitor

data = {
    'A': ['low', 'medium', 'high', 'medium', 'low'],
    'B': ['type1', 'type2', 'type1', 'type3', 'type2']
}
data = pd.DataFrame(data)
print(data.dtypes)

# Encoding columns 'A' and 'B' as categorical
data = data.encode_categorical(columns=['A', 'B'])

print(data)
print(data.dtypes)

Output:


A    object
B    object
dtype: object
        A      B
0     low  type1
1  medium  type2
2    high  type1
3  medium  type3
4     low  type2
A    category
B    category
dtype: object

5. Renaming Columns

Renaming columns is a common task when cleaning data. PyJanitor provides the clean_names function to standardize column names by converting them to lowercase and replacing spaces with underscores.

Python

# Sample DataFrame with messy column names
data = {
    'First Name': [1, 2, 3, 4],
    'Last Name': [5, 6, 7, 8],
    'Age (Years)': [9, 10, 11, 12]
}
df = pd.DataFrame(data)

# Clean column names
cleaned_df = df.clean_names()
print(cleaned_df)

Output:

6. Filtering Data

Filtering data based on certain conditions is a common data cleaning task. PyJanitor provides the filter_string function to filter rows based on string conditions.

Python

# Sample DataFrame
data = {
    'Name': ['Alice', 'Bob', 'Charlie', 'David'],
    'Age': [25, 30, 35, 40]
}
df = pd.DataFrame(data)

# Filter rows where Name contains 'a'
filtered_df = df.filter_string(column_name='Name', search_string='a')
print(filtered_df)

Output:

Pipe() Method in PyJanitor : Custom Functions

The pipe() method of PyJanitor is used to chain multiple data-cleaning operations.
This method helps us to write more readable code.
We can do a series of operations in a clear manner, making it easier to understand. Here’s an example of how to use this function.

First, we will import the necessary libraries and the functions we will use from the janitor library. Then we will create a dummy company sales data in a dictionary. Finally, we will create a data frame from this dictionary and use the pipe() function in the same line of code and call the clean_names function and then remove_empty function. This code will directly create a dataframe clean the columns and remove any None values.

Python

import pandas as pd
import numpy as np
from janitor import clean_names, remove_empty

company_sales = {
    'SalesMonth': ['Jan', 'Feb', None, 'April'],
    'Company1': [150.0, 200.0, None, 400.0],
    'Company2': [180.0, 250.0, None, 500.0],
    'Company3': [400.0, 500.0, None, 675.0]
}

data = (
    pd.DataFrame.from_dict(company_sales)
    .pipe(clean_names)
    .pipe(remove_empty)
)
print(data)

Output:

  salesmonth  company1  company2  company3
0        Jan     150.0     180.0     400.0
1        Feb     200.0     250.0     500.0
2      April     400.0     500.0     675.0

Exploring Different PyJanitor Functions

Now that we have understood the main features of PyJanitor, let’s dive deep into some other main functions.

1. fill_empty(data, column_names, value)

We can use the fill_empty function to fill the empty data points of a specific column with a value.

The first parameter which we need to give is column_names. If we want to fill only one column’s empty value then we can give the name of the column in a string.
But if we want to fill multiple columns, then we need to pass the names in a list to column_names. The second parameter is ‘value’, where we need to provide the integer or object that should replace the None value.

Now let’s see how to use this function with an example.

Python

import pandas as pd
import janitor

data = pd.DataFrame(
    {
        'col1': [1, 2, 3],
        'col2': [None, 4, None],
        'col3': [None, 5, 6]
    }
)

data.fill_empty(column_names=['col2', 'col3'], value=0)

Output:

col1    col2    col3
0    1    0.0    0.0
1    2    4.0    5.0
2    3    0.0    6.0

2. filter_on(data, criteria, complement=False)

We can use the filter_on function to filter data based on criteria. This function doesn’t mutate the dataset but only shows a dataframe based on the criteria.

The first parameter we will pass is the ‘criteria’ in a string form.
The second is the ‘complement’, in which the default value is ‘False’. If we pass the complement as True, we will get rows for which the criteria is False.

Now let’s look at an example of how to use this function.

Python

import pandas as pd
import janitor

data = pd.DataFrame({
    "student_id": ["S1", "S2", "S3", "S4", "S5"],
    "score": [45, 75, 50, 90, 30],
})

data.filter_on("score >= 50", complement=False)

Output:


    student_id    score
1    S2    75
2    S3    50
3    S4    90

3. rename_column(data, old_column_name, new_column_name)

As the name of the function suggests, the rename_column function is used to change the column’s names. The first parameter is the old_column_name, in which we need to pass the name of the existing column that we want to change, the second one is the new_column_name, in which we need to pass the new name.

Now let’s see how to use this with an example.

Python

import pandas as pd
import janitor

data = pd.DataFrame({"x": [10, 20, 30], "y":[40, 50, 60]})
data.rename_column(old_column_name='x', new_column_name='x_new')

Output:


x_new    y
0    10    40
1    20    50
2    30    60

4. add_column(df, column_name, value, fill_remaining=False)

We can use the add_column function to add a new column to the dataset.

The first parameter is column_name, in which we need to pass the name of the new column.
The second parameter is value, in which either we can use one value like value=3, which will store 3 in all the rows, or we can use functions like list(), and range() to add different values in different rows.

Now, let’s see how to use this function with an example.

Python

import pandas as pd
import janitor

data = pd.DataFrame({"a": list(range(3)), "b": list("abc")})
data = data.add_column(column_name="c", value=1)
data = data.add_column(column_name="d", value=list("efg"))
data = data.add_column(column_name="e", value=range(4, 7))
data

Output:


a    b    c    d    e
0    a    1    e    4
1    b    1    f    5
2    c    1    g    6

Conclusion

In conclusion, PyJanitor is a useful library for data cleaning in Python. It has many functions that can make the data-cleaning process simple and fast. One of the main features of PyJanitor is that we can chain multiple data-cleaning operations into one step, improving the readability of the code. PyJanitor doesn’t just provide basic data cleaning operations, but it also provides functions that can be used for complex operations. Hence, the next time you need to do data cleaning in your project give PyJanitor a try.

Reffered: https://www.geeksforgeeks.org

AI ML DS

Related
The Role of Feature Extraction in Machine Learning
Techniques for Data Visualization and Reporting
Numpy: Index 3D array with index of last axis stored in 2D array
Robot Operating System (ROS): The Future of Automation
Mastering Generalized Cross-Validation (GCV): Theory, Applications, and Best Practices

Type:	Geek
Category:	Coding
Sub Category:	Tutorial
Uploaded by:	Admin
Views:	18