chapter 2 : Data cleaning

Data Types

Every column in a dataset has a data type — a category that tells a system how to interpret and use the values in that column. Common data types include text/string, numeric (numbers), date/time, and Boolean (true/false). The correct data type ensures that the data behaves as expected during calculations, sorting, filtering, and visualization. When values don’t match the expected type (for example, text in a numeric field), it can cause errors, miscalculations, and misinterpretation of results.

Example:

Product

Price

Book

250

Pen

“thirty”

In this example, the value “thirty” is text, but Price should be a numeric type to perform calculations (like total sales or average price).


Why Data Types Matter

Accurate Calculations: Numeric fields must be numbers for formulas (like sum or average) to work correctly.
Correct Sorting & Filtering: Dates must be in date format to sort chronologically.
Valid Visualizations: Tools need correct types to decide which charts are appropriate (e.g., line charts for numeric time series).
Using wrong types can lead to errors, wasted time, or misleading insights.


How Data Type Issues Are Usually Fixed

Here are common ways to handle data type problems during cleaning:
Convert Text to Numbers: If a numeric field has text (e.g., “thirty”), convert it to a numeric value or fix the format.
Convert Text to Date Formats: Ensure dates are stored as Date/Time types, not text, so tools can sort and analyze them correctly.
Check for Mixed Types: In a single column, move errors or mismatches to the correct format before analysis.