Lesson video

In progress...

Loading...

Hello, my name is Mrs. Jones, and I'm really pleased that you've decided to join this lesson today.

In this lesson, we will look at data cleansing and the impact on visualisations if the data is not accurate.

We will then look at how to cleanse data and a selection of examples before you visualise your own data set.

So let's get started.

Welcome to today's lesson.

Today's lesson is called "Data cleansing" from the unit "Data science." And by the end of the lesson, you'll be able to describe the need for data cleansing and apply data cleansing techniques to a data set.

There is one keyword to today's lesson: data cleansing.

Data cleansing is the process of detecting and correcting, or removing, corrupt or inaccurate data.

There are three sections to today's lesson.

The first is describe the need for data cleansing.

The second is apply data cleansing techniques to a data set.

And the third, visualise a data set.

So let's start with describe the need for data cleansing.

Once you have collected the data needed to help you answer a question, you should look through it to see if it needs cleansing.

And you see this is in the data section of the PPDAC framework.

Izzy says, "Well, how can you clean data?" Data cleansing involves detecting and correcting, or removing, corrupt or inaccurate data.

Here is a data set that contains data about zoo animals.

You can see we have name, animal type, weight, height, age, and gender.

Spend some time looking at the data and see whether it is ready for analysis.

Pause the video to have a quick look at that data before we move on.

Can you spot any errors in the data? So look at the data.

Did you find any errors that you consider are errors that we need to rectify, we need to cleanse, before we move on? Pause the video, look at the data, and consider what we need to sort.

Let's check your answer.

First, we have missing or incomplete data, and that is where the cells are empty.

We have duplicate data.

We have two entries here for Alexander Meerkat.

We also have invalid data, data that's outside the normal range.

Well done if you got those correct.

How could these errors have happened? So we have here on the left, we have the original data collection form and we have the entry that was within the table.

And Sam says, "It looks like the original form was completed correctly.

But the error was made when entering the data electronically.

Height has been entered as 3 instead of 0.

3." Cleansing data is important because incorrect data can significantly impact findings and visualisations.

So here before cleansing, what it's showing us is the height of a meerkat, on average, is larger than a lion.

After cleansing, this now looks more in line with what we would expect, that a meerkat is not as high as a lion.

Decisions based on "garbage in" lead to "garbage out." This means without data cleansing, your data isn't reliable.

Clean data ensures that decisions are based on accurate and reliable information.

By eliminating duplicates and correcting errors, you can trust your data.

Let's have a quick check.

What is the primary goal of data cleansing? Is it A, to increase the size of the data set, B, anonymize personal data, C, remove inaccurate or incomplete records.

Pause the video to consider your answer, and then we'll check it.

Let's check your answer.

The answer was C, remove inaccurate or incomplete records.

Well done for getting that correct.

Let's look at an activity.

Describe why data cleansing is important and needed when collecting data.

Pause the video, go back through the slides, use your worksheet, and then we'll check your answer.

Let's check your answer.

Raw data often contains typos, missing values, duplicates, and errors.

Cleansing identifies and corrects issues, ensuring your data set is accurate and trustworthy for decision-making.

Well done if you got that correct.

Let's look at the second part of today's lesson, apply data cleansing techniques to a data set.

There are a number of different techniques which can be used for data cleansing.

These include examining the data set for errors, removing duplicates, entering missing data, and validation.

Let's have a quick check.

Which of the following is an example of data cleansing? A, printing out the data, B, removing duplicate entries from a data set, C, sorting data in alphabetical order.

Pause the video to consider your answer, and then we'll check it.

Let's check your answer.

The answer was B, removing duplicate entries from a data set.

Well done if you got that correct.

Examining the data set for errors.

This involves looking at a data set to try to spot anomalies, missing data, or inconsistent patterns.

It is unlikely that a meerkat is a similar height to an elephant, so this flags to us looking at it that this data might be incorrect, that there is an error.

Removing duplicates, this involves looking at a data set to see if there are any duplicates, the same data entered more than once, and removing it from the data set.

It's unlikely that there are two meerkats called Alexander in the same zoo.

Both those entries need to be reduced to one entry because it has been added twice.

Entering missing data.

This involves looking at a data set to find missing values and then trying to find out what these values should be.

Sometimes you may not have the original data, so may have to add an average value or remove the record.

And here, we're missing Stretch's age.

Let's have a quick check.

True or false? If you have missing data and cannot find the original source, there is nothing you can do.

Pause the video to consider your answer, and then we'll check it.

Let's check your answer.

The answer is false.

Pause the video to consider why is it false, and then we'll check your answer.

Let's check your answer.

You could look at similar data and generate an average value to insert.

For example, if the zoo is missing a weight for a lion, they could calculate the average weight of the other lions and enter this.

Well done if you got those correct.

Let's do an activity.

Open the spreadsheet file zoo-data.

xls, available as a downloadable file for this lesson.

Open the original data file collection-forms.pdf as well as a downloadable file for this lesson.

Use the original data collection forms to clean the data set.

Check that the visualisations have updated with the corrected information.

Pause the video, use those sources, complete the activity, and then we'll check it.

Let's check your answer.

There was some missing data that needed to be added, there was inaccurate data that needed to be corrected, and there was a duplicate data that needed to be removed.

Well done if you got those correct.

Let's move to the last section of today's lesson, visualise a data set.

What is the main purpose of a data visualisation? Is it A, to make numbers look more fun, B, to hide confusing information, or C, to help people understand data easily.

Pause the video to consider your answer, and then we'll check it.

Let's check your answer.

The answer was C, to help people understand data easily.

Well done if you got that correct.

You may have collected some data on reducing litter and recycling in your school.

If you haven't, then some sample data is provided as a downloadable file for this lesson.

Online forms often have the ability to visualise responses.

Some of them may be helpful to us, but none of them will show the potential relationship or correlations between the variables.

Before we use a tool to help us create visualisations, check the data for any errors and download the data to a spreadsheet.

So you can see on here, we have responses and the visualisations that are created for each individual question.

In Google Forms, we need to click on Responses to see those visualisations, but we also have an option at the top there to click on Link to Sheets to open the data in a spreadsheet.

That will open the spreadsheet in Google Sheets for you to be able to use.

You are now going to upload your data into CODAP to help you analyse it.

CODAP accepts data as plain text, not a spreadsheet file type.

You need to save your data from the spreadsheet as a CSV file.

To do that, you need to download the spreadsheet as a CSV file, and you may have to go to Download or Save as and select Comma-separated values, which is.

csv.

This illustration right here we have is Google Sheets, and you click File, Download, and then select Comma-separated values.

csv.

Let's have a quick check.

True or false? CODAP accepts data as XLS, which is a spreadsheet file type.

Pause the video to consider if that is true or false, and then we'll check your answer.

Let's check your answer.

The answer is false.

Pause the video to consider why is it false, and then we'll check your answer.

Let's check your answer.

CODAP only accepts data as plain text, which is not a spreadsheet file type, so you have to download your spreadsheet file as a CSV file before uploading it.

Well done if you got those correct.

Go to oak.

link/codap-new.

Select Launch CODAP.

And then select Open document.

And you can see the screenshot here which would show you where it says Open document or Browse examples.

And we want to open a document, so we need to click that button.

The data should then show in CODAP once you've clicked Upload so you can start creating your visualisation.

Click on Graph to generate a visualisation.

To change the data you are looking at in the visualisation, you can click the column heading and drag it across to either the vertical axis, the y-axis, or the horizontal axis, the x-axis.

This visualisation here is showing the type of litter by the location.

So you can see on the left, it says the type of litter, drinks bottle, drinks carton, food packaging, food waste, paper, and stationary, and then along the x-axis, the horizontal one, English corridor, IT corridor, PE corridor, art corridor, dining hall, languages corridor, maths corridor, technologies corridor.

Let's do an activity.

Write a question that you want to investigate using the data.

Note, you may have created these questions in a previous lesson.

Two, create a visualisation that helps to answer the question.

Three, write a few sentences to describe what you have learnt from the visualisation and whether it helps you to answer the question.

Pause the video to complete this activity, and then we'll go through an answer.

Let's check your answer.

The first part of this activity was write a question that you want to investigate using the data.

In this example, we have: are there enough bins? The second part was create a visualisation that helps to answer the question.

And here we've got the location on the left, the y-axis, the vertical axis, and then on the bottom, on the x-axis, the horizontal axis, we have the distance from the nearest bin.

You can see all the dots there representing where the data has been plotted in that visualisation.

And the third part was write a few sentences to describe what you have learnt from the visualisation and whether it helps you to answer the question.

The visualisation shows that most of the litter is in the dining hall where there are bins close by, so there is no link between littering and the distance to bins here.

However, there were a few occurrences of litter in the languages corridor where the distance from the bins was greater.

This suggests that more bins are needed in the languages corridor.

Well done for completing that activity.

In summary, data cleansing involves detecting and correcting, or removing, corrupt or inaccurate data.

Data cleansing is important because real-world data is often messy, with errors or missing information.

Once the data is clean, charts or graphs can be created to help understand patterns and trends.

Well done for completing this lesson, "Data cleansing.".

Files you will need for this lesson

Download these files to use in the lesson.
  • collection-forms14.88 MB (PDF)
  • litter-sample-data20.7 KB (XLSX)
  • zoo-data29.08 KB (XLSX)