video

Lesson video

In progress...

Loading...

Hello and welcome to lesson five of our Data Science unit.

I'm Ben and this lesson's called Clean It Up.

Now hopefully you're coming to this lesson all equipped with the data that you've collected since last lesson, but don't worry if you didn't do it, we've got a set of data that you can use.

So, we're going to take some data, we're going to look at it and hopefully do some data cleansing and once we've done all of that, we'll be in a position where we can start analysing the data and visualising it as well.

Okay, so all you'll need for this lesson is a computer and a web browser, and other than that, if you can clear away any distractions that you might have, perhaps turn off your mobile phone and if you've got a nice quiet place to work, that'd be great too and when you're ready, we'll get started.

Okay, so in this lesson, you're going to describe the need for data cleansing, and then you're going to apply some of those data cleansing techniques to a data set.

Before then, we're going to start working on your data set that you collected about the litter in your local area.

So, we will take that data, we'll download it, maybe do some data cleansing if it's needed and then we're going to upload it into that CODAP platform that we looked at in lesson three, where we looked at the rollercoaster examples and we're going to use that same platform to analyse and visualise your data, okay? So let's get started with an exercise.

Now this exercise is called Clean it up.

So, if we look at the PP that cycle, we're now going to look at the data step.

Now the data step's where we collected the data, but once we have the data we need to help us answer the question, we need to look through it to see if it needs cleansing.

Now cleansing involved detecting and correcting or removing corrupt or inaccurate data, okay? Now it goes back to that phrase that you might have heard before which is garbage in, garbage out.

If your data is bad quality, you can never guarantee that the analysis that you're going to do, or the conclusions that you're going to make are also going to be good quality, but if we can try and make the data as best it can be, as clean as it can be, then we got more chance of being able to trust the data and get some good results or trust the analysis or conclusion that we're going to make, okay? Our exercise involves zoo animals.

So you're going to be given a data set about some zoo animals.

So what I'd like you to do is spend some time looking at the data to see whether or not it's ready for analysis.

So essentially we're asking you, you need to look at it to see whether or not it needs cleansing, okay? So I'd like you to check if there's any anomalies with the data, anything you think doesn't look quite right and then to see if there's something you can do about that, if there's anything you can do to correct this data.

So let me just head over to this spreadsheet now that you're going to have access to.

Now a link to this spreadsheet is on your worksheet as well.

So I'll show you this.

So this is the spreadsheet you'll be given and you can see this is the data on the left hand side, okay, we've also got some data on the right hand side as well.

We've also got some gross, which can visualise the data to a certain extent.

Now, I'd recommend you spend a little bit of time just looking through the spreadsheet and then you might want to look at the graphs 'cause the graphs might also help you spot maybe things that don't quite look right, and then also there's an interesting hyperlink which you might want to look at as well, okay? So, I'm just going to head back over to the slides.

Okay, so I'd like you to now pause the video, go to task one on your worksheet where you'll find a link to that spreadsheet.

Now, the spreadsheet is in Google Forms, so if you have got your own Google account, you can go to file and make a copy if you wanted to change some of the data around, but if you haven't got Google, then you might want to download it as maybe an Excel Spreadsheet.

In which case, you can go file, download and then select Microsoft Excel, okay? So pause the video, have a go at task one, see if you can cleanse the data, see if you can solve some of the problems in there and then unpause when you're ready.

Okay, so how did you get on with that? Do you think you were able to spot all the areas that were there? I mean, there was certainly some data that did need cleansing.

And also, hopefully you also had an opportunity to cleanse the data as well, did you find out how to correct the data? Well, let's go through some of the answers.

So first of all, there were some missing values in our data.

So Stretch, our giraffe, didn't have an age recorded in our data and also Sher Lion didn't have a weight.

And of course, that can cause all sorts of problems, particularly if we're tryna find our averages then that's going to skew the data somewhat.

Okay, so let's have a look at another mistake that was in our data.

Okay, so there was also a duplicate entry, I'm not sure whether you spotted that, but Alexander Meerkat appeared twice in our data, okay? So that was obviously unnecessary repetition when actually it's the same data, so we don't want to record that twice.

Another mistake that was made in this was there was some invalid data that maybe fell outside of the normal range.

Now if you looked at this or studied this in detail, I mean, sometimes it's not clear whether or not something falls outside of a normal range, but certainly, in some of these instances, they definitely did.

I'm not sure whether you're aware of meerkats, but they tend to be about this big, not very big or in fact, 0.

32 metres, okay? So that's not very big at all.

However, Bogdan and Alexander were three metres tall.

I mean, that's a huge terrifying meerkat, the size of an elephant in fact.

So that's clearly wrong, we have no meerkats that are three metres tall.

So maybe it was missing a decimal place there, okay? And also, our elephant, Drona, had a weight of zero and clearly, no elephant's got a weight of zero.

So that's definitely outside the normal range, okay? Now those were the mistakes that were made and maybe even, I asked you to, or hinted, that you should look at the graphs as well, because the graph may have indicated some problems. For example, if we look at the before graph at the top here, it says that, you can see the elephants and giraffes, but you can see the meerkats are taller than lions because we had those three meerkats that were three metres in height and therefore, that skewed the averages so that meerkats looked like they're taller than the lions, and clearly meerkats aren't taller than lions, so that gave us some indication there.

But once we corrected that data, you can look at the graph below, then actually you can see now that graph starting to look about right in terms of orders of height.

So how do we correct the data? Well often sometimes we work out that there are problems with data.

I mean, you could just put in the average data in there, so you could work out the average height of all the meerkats and just put in the average height for the ones that you knew are wrong, but that doesn't necessarily mean you're going to get accurate results.

So, if possible you should always go back to the original data source.

So there's a hyperlink on that spreadsheet that took you to the original data capture forms that we're collected by the zoo keepers.

So we could have scrolled through those to find the original data and then populated the spreadsheet with the ones where we noticed there was inaccuracies or data missing, for example.

Now computers can really help us out with this as well.

So it's not just a human process.

I think particularly when we're looking at really large data sets, I think it's important for the human to, first of all, look at the data and get a sense of what the normal value should be or get a sense of the problems that might be occurring, but with large data sets, it's really difficult for a human to spot all the possible mistakes.

So there are certainly coding techniques that can be put in place where you can programme something to look through and teach it what the normal value should be and whether a missing value is to maybe remove it or put in the average, like we talked about before, okay? So let's actually move onto the next stage of our process which is looking at our data and downloading it.

Okay, so this next section is split into two parts.

So it's the downloading of the data and then uploading it into the CODAP platform, okay? So I'm going to show you if you've put your data into a Google Form or a Microsoft Form, or any kind of online form platform, I'm going to show you how you can download that as a spreadsheet.

Now, if you entered the data straight away manually into a spreadsheet, then you don't need to do the downloading part, you can skip right ahead into the uploading part to this lesson, okay? So, let's do the downloading part.

Okay, so we'll start off by looking at these online forms, okay? Now online forms are really great 'cause often they have the ability to visualise data for us, but we need to be a little bit wary about that because although it's really good to have the visualisation, it might not be helpful to us to help us answer the questions to the problems we set out right at the beginning of the project.

Because to really have a good question, or focus question, we needed to compare two variables at least.

Now, these visualisations are great, but they're not comparing variables.

They're showing an answer to one part of the question.

So they might help us, certainly, help us form our opinion or conclusion at the end of this, but we really need to be doing a little bit more than that with our analysis and visualisation by using the CODAP platform to compare two or more variables, okay? Now, I'm going to show you how to download it using a Google Form, but the process I'm about to show you now is very, very similar if you were using Microsoft Forms, okay? So I'm going to head over to my Google Form now, okay, so, we can see from the Google Form we have our questions, but when people start responding to it or when you've entered things in, you'll notice we have the thing called responses and it shows you how many responses they've been, okay? Now don't worry, I asked you to collect a minimum of 20, so if you need the data set that I've already created, then it has got more than that so don't worry, but this is just for demonstration purposes.

So we click on responses, then you'll be able to see the graph, as they appear in a second.

There you go, so you can see the different types of visualisations that have been made.

So, once we've had a look at that, you'll notice that in a slightly discreet way, we're just going on to the top, we've got this little green spreadsheet looking icon at the top there, okay? So if we click our mouse on that icon, then what it's going to do is take these responses and put them into a spreadsheet.

Now you can either put them into an existing spreadsheet or you can create a new spreadsheet, okay, I'm just going to call mine demonstration for now.

But you obviously give it a sensible name.

You give it a sensible name relating to the data it is, okay? I've not spelt demonstration correctly, there we go.

So, you give it a name that's going to relate to this, so it might be litter data in my local area spreadsheet or something like that, okay? But something that's going to help you remember when you next look at that spreadsheet.

So I'm going to click create, okay? And hopefully it's going to create a new spreadsheet for me, as you can see here, I'm just opening it up and it's going to take its time to load up and do that for me, and there you go.

So all the results have now been put in a spreadsheet.

So this is now going to be something that I can now use to upload into CODAP platform, okay? So I'd like you to, and it's as simple as that and like I say, if you're using Microsoft Forms, it's a very, very similar process, just look for the option to download it as a spreadsheet, okay? So all I'd like you to do now is I'd like you to pause the video now and like I say, only do this if you used one of the online forms. If it's already in a spreadsheet, you don't need to do this, but if you have to, just pause the video, it'll take you about two, three minutes to do that and when you've done that then unpause the video and we'll look at what we need to do to upload it into the CODAP platform, okay? Okay so now you've downloaded the data as a spreadsheet, we're now going to upload our data into the CODAP platform that you used in lesson three, if you remember, to look at the rollercoaster data so that we can visualise the data and start to analyse it.

Now, it's worth noting that COPAD only accepts data as plain text, as in, not in a spreadsheet file type.

So what we need to do is save our data from the spreadsheet into a csv file.

Now trust me, that sounds way more complicated than it actually is, okay? Now csv file, which stands for comma-separated values, is a plain text file format.

So what it does, it converts it from a spreadsheet into just normal text, but then puts commas between each values, okay? So I'm going to demonstrate how to do that, but also on your worksheet, I have a list of steps for you to follow.

So I'm going to go through this process which is to save my spreadsheet as a csv file and then go to open the CODAP website, which is the web address is there, then we're going to create a new document by selecting Try CODAP and then upload it as a csv file, okay? So let's go through those steps now.

So I'm going to start off with my spreadsheet, okay? Now you can see my spreadsheet on the screen there, so what we need to do is we click on file, if you're using Google Sheets you go file and then you select download and then you can select comma-separated values there and if I click on that, it'll start downloading it locally to my device, okay? If you're using Microsoft Excel, it's as simple as going to file, save as and then where it says save as type, there's a drop down list and again, from that drop down list, just select a csv file, okay? And then that'll save it onto your computer or your device and then you're ready to use the CODAP website.

Now, like I said, on the worksheet, there's a link to the CODAP website, but I'm going to show you this now.

So I'm going to select the CODAP website there.

Now I said what you need to do is select Try CODAP.

So at the top right hand side you can select Try CODAP and that will then open up a new window and it'll ask you, do you want to open a document or browser examples, or do you want to create a new document? Well, what I need to do is create a new document 'cause we haven't got anything already preloaded up.

So once you've selected new document, we then, on the top left hand side, you can see there's the three lines in a little drop down list, so see if you can spot that, it's a small icon at the top left hand side.

So once you've selected the drop down list, you then select import and then that'll bring up an option for a local file.

So you need to either drag the file into that window or just click on that window there and select that file that you've just downloaded, okay? And once you've done that, it'll then upload it into the CODAP platform and you'll have something that looks not too dissimilar to this, okay? So when you can see the data uploaded in table format, that's as far as I want you to get before continuing with the rest of the lesson, okay? So I would like you to, if possible please, I'd like you to pause the video now and, oh I've gone to far, I'd like you to pause the video now and I would like you to follow those steps that I just showed you, but remember, they're also on your worksheet as well, okay? So you're going to download as a csv file, upload it into CODAP and when you've done all of that, I'll be here when you get back.

Okay, so the next part of our lesson, now that we've downloaded the data and we've uploaded it into our CODAP platform, you're really ready to start the analysis section of this lesson, okay? So, what I'd like you to do is go over to your task three on your worksheet where there's a template there for you to record the findings that you need to make, okay? Now, to do this, we need to use CODAP to make the visualisation to help answer your questions, okay? So, the important thing is, right back in lesson four, you posed two questions.

So in this lesson, we're just going to look at one of those questions, not both of them, okay? And then next lesson, we'll look at the second one.

So the task is just to look at one of the questions.

So you pick which question you'd like to analyse first of all, okay? So pick that question and then let me head over to the worksheet so I can show you what I'm talking about, so.

Okay, so you can see that we've got space for your question to be written in there, so you can either copy and paste it or type it out again, okay? I've given space for you to take a screenshot of your visualisation and then as you've done in previous lessons, all you need to do here is write a few sentences about what your findings were, okay? Remember you should be thinking about, does this help answer the question? Does it give you a positive answer or a negative answer? Does it prove your theory or not? Okay? So, let's just head over to the CODAP website so I can give you a quick reminder about how we did this.

So hopefully you're at a point now where you've got your data loaded up.

So we can get rid of some of these windows, so this information about the data source, I don't really need that, so I can get rid of that and if you want to make the window a bit bigger so you can see the data, you can use the dragging tool, so to make that window, as I'm doing here, a little bit bigger, okay? Now we want to visualise the data now, so we need to select the graph tool, okay? So in this menu on the top left hand side, just select graph, okay, and what it will do is it will create a mini graph for you already and you can move that around and resize it as well, okay? Now the graph is just plotted at the dots of the different data that is on there, okay? Now in itself, this isn't particularly helpful in the format it is in, okay? So you've got to use the variables to help you answer your question, okay? So it may well be that you wanted to find out something about the compare the type of litter to the distance to the nearest bin, for example, okay? So to do that, I'm going to get the title of type of litter, I'm going to click my mouse plotting down and drag it down to whichever axis you want it on, okay? So let's say I was on the X axis, I'm just going to drop it there and you can see that it's going to group the data together.

So we can see a visualisation starting to happen.

But we're comparing two variables, so the distance to the nearest bin was the other thing that I wanted to compare, okay? So I'm going to grab that title, and put it onto the Y axis, okay and there we go, we're starting to see maybe a more useful visualisation that we can use.

And just to, I'm not convinced this will be particularly useful or this will answer many of your questions, but let's say for example you also wanted to know whether or not it was recyclable or not.

So I'm going to select that one.

Now we haven't got a space on the X and the Y axis, so I'm just going to put that into the middle of the graph and then let go and then you remember, it changes the colours of the dots to reflect the key that we got in place there, so it's a yes or no and now we can see, not only can we see the type of litter, we can see distance to nearest bin and whether or not it was recyclable, okay? It may well be that you don't think that's a particularly great way to compare them, so let me just close that off and I'm going to create a separate graph.

What might be more useful is I'm going to put instead, I'm going to put in distance nearest bin on the X axis, I'm going to put whether or not it was recyclable, I'm going to put that on the Y axis, okay? And maybe I'm going to put the type of litter in the middle of the graph.

Okay, and maybe, although we're using the same variables here, we've presented it in a different way and maybe this visualisation is more useful than the previous one.

So we're only answering one of your questions in this lesson, so don't be afraid to just play around with this.

Have a few attempts and find the visualisation that you think helps us perform that analysis a little bit easier than other ways, okay? So do play around with it and get that visualisation feeling about right before you take a screenshot of it and put it into your worksheet to discuss your findings, okay? So that's your task for the remainder of this lesson, so it's only one question remember.

So if you can pause the video, head over to task three on your worksheet, also be using the CODAP platform, visualise your data, take the screenshot and then do some kind of analysis to the findings that the visualisation helped you find, okay? So pause the video now and I'll be here when you get back.

Okay, so that's all for lesson five and we're well on our way to hopefully solving our problems of the litter in our local area.

So you've taken the data, you've downloaded it, you've uploaded it and you've started doing the analysis process.

So in the next lesson, we're going to continue with that analysis process so we'll analyse the second problem or second question that you posed, then we'll go through the process of concluding and maybe wrapping up the findings and hopefully making some sort of recommendation that we can make a change to you and your local area, okay? So I'm looking forward to that already, I hope you're looking forward to it and I'd really like to see some of the visualisations that you've made from this lesson.

So if you're willing to share that with us, please ask your parent or carer to share your work on Instagram, Facebook or Twitter tagging @OakNational and using the hashtag LearnwithOak, okay? So looking forward to seeing you next lesson and I'll see you then.