Skip to Content

BOOK STORE SALES DATA CLEANING PROJECT - PYTHON (Click to view full python script here)

Introduction

Data cleaning is a fundamental step in any data analysis project. It involves preparing raw data for analysis by correcting errors, handling missing values, and ensuring consistency. This project is a step-by-step walkthrough of the data cleaning process for a book store sales dataset. Using the pandas library, the dataset was imported to Python (in Visual Studio Code) and found to contain 2000 rows of data and 8 field columns. 

Importance of Data Cleaning

Data cleaning is crucial because:

  • Accuracy: Ensures the data is correct and reliable.
  • Consistency: Standardizes data formats and values.
  • Completeness: Fills in missing values or removes incomplete records.
  • Efficiency: Reduces the time and effort required for analysis by eliminating errors and inconsistencies.

Without proper data cleaning, any analysis performed on the dataset could lead to incorrect conclusions and decisions.

Libraries Used

For this project, I used the following Python libraries:

  • pandas: For data manipulation and cleaning.

                            

                                                                          A LOOK AT DATASET

DATA CLEANING:

Here are the key data cleaning steps I performed:

1. Checking Duplicates: Duplication of data was checked using pandas for cleaning but the data did not have any duplicate records.

2. Handling Missing Values: Filled or removed missing values depending on the context.  

3. Correcting Data Types: Ensured all columns had the correct data types. For example, converting date columns to datetime objects.

4. Standardizing Text Data: Converted text data to a consistent format (e.g., all uppercase).

5. Splitting Address into 3 separate columns: The address column was split into 3 columns of- street address, city, state for better analysis.


                                                                CODE FOR SPLITTING ADDRESS

6. Standardizing Phone number: The customer phone number had lots of inconsistencies such as nulls, irregular formats, some extensions etc. Various methods such as regular expressions were employed to standardize phone number.

CONCLUSION:

Data cleaning is not just a preliminary step but a crucial process that significantly impacts the quality and reliability of the analysis. By investing time and effort in cleaning the data, we ensure that the subsequent analysis is robust, accurate, and valuable. This project highlights the transformative power of data cleaning in turning raw data into meaningful insights that drive better business outcomes. The dataset is now ready to perform various analysis and data visualization.

Click to view full python script here