chapter 2 : Data cleaning

Duplicate Data

Duplicate data refers to records that are repeated in a dataset — meaning the same information appears more than once when it should only exist once. These repeated entries can bias analysis and distort outcomes if not removed.

Example:

ID

Name

Score

101

Ravi

80

102

Mira

92

101

Ravi

80

Here, the first and last rows are exact duplicates and should be cleaned (removed).


Why Duplicate Data Matters

Duplicate records inflate dataset size, waste storage, skew statistics, and can lead to inaccurate insights — for example, making some patterns seem stronger than they are or giving a false impression of volume in reporting.


How to Handle Duplicate Data (Best Practices)

Identify Duplicates Based on Key Fields
Detect duplicates by comparing unique identifiers such as ID, name + email, or other key combinations to see if records repeat.

Remove Exact Duplicate Rows
Keep only one instance of each repeated record — the first or most relevant one — and remove the extra copies.

Standardize Formats Before Removing
Ensure text matches exactly (e.g., identical capitalization or formatting) so that near-duplicates can be detected.

Use Automated Tools for Large Datasets
When working with large datasets, automated deduplication tools or built-in platform features make the process faster and more accurate.