Skip to Content

CUSTOMER LIST DATA CLEANING PROJECT-PYTHON (Click to view full python script here)

Introduction:

Data cleaning is a fundamental step in any data analysis project. It involves preparing raw data for analysis by correcting errors, handling missing values, and ensuring consistency. This project is a step-by-step walkthrough of the data cleaning process for a book store sales dataset. Using the pandas library, the dataset was imported to Python (in Visual Studio Code) and found to contain 2000 rows of data and 8 field columns. 

Importance of Data Cleaning:

Data cleaning is crucial because:

  • Accuracy: Ensures the data is correct and reliable.
  • Consistency: Standardizes data formats and values.
  • Completeness: Fills in missing values or removes incomplete records.
  • Efficiency: Reduces the time and effort required for analysis by eliminating errors and inconsistencies.

Without proper data cleaning, any analysis performed on the dataset could lead to incorrect conclusions and decisions.

Libraries Used:

For this project, I used the following Python libraries:

  • pandas: For data manipulation and cleaning.

                            

                                                                  A LOOK AT DATASET

DATA CLEANING:

Here are the key data cleaning steps I performed:

1. Checking Duplicates: Duplication of data was checked using pandas for cleaning but the data did not have any duplicate records.

2. Handling Missing Values: Filled or removed missing values depending on the context.  

3. Correcting Data Types: Ensured all columns had the correct data types. For example, converting date columns to datetime objects.

4. Standardizing Text Data: Converted text data to a consistent format (e.g., all uppercase) and removing all the extra inconsistent elements.

5. Splitting Address into 3 separate columns: The address column was split into 3 columns of- street address, state, zip code for better analysis.


                                         REMOVING INCONSISTENCIES FROM LAST NAME COLUMN

6. Standardizing Phone number: The customer phone number had lots of inconsistencies such as nulls, irregular formats, etc. Various methods such as regular expressions and lambda functions were employed to standardize phone number.

7. Removing unwanted columns: The columns which were not suitable for contact were removed from the dataset using a for loop.

CONCLUSION:

Data cleaning is not just a preliminary step but a crucial process that significantly impacts the quality and reliability of the analysis. By investing time and effort in cleaning the data, we ensure that the subsequent analysis is robust, accurate, and valuable. This project highlights the transformative power of data cleaning in turning raw data into meaningful insights that drive better business outcomes. The dataset is now ready to perform various analysis and data visualization.

Click to view full python script here