Lesson video

In progress...


Hello, and welcome to "Lesson Three" of our data science unit.

I'm Ben and in this lesson, you're going to solve a real world problem, using a real dataset and follow the steps that are of an investigative cycle to help us do that.

So in this lesson, all you'll need is your computer, and a web browser.

And if you can clear away any distractions that you might have and find a really nice quiet place to work, and when you've done all that, let's get started.

Okay, so in this lesson, we're going to look at the terms, correlation and outliers in relation to data trends.

We're also going to identify those steps of the investigative cycle that we just mentioned.

Then we're going to solve a real problem using those steps.

And then once we've done all of that, hopefully we'll be able to make a recommendation or find some kind of data that'll support some kind of recommendation we can make, okay.

So the first task for us to do, is to go over to a website that we used in lesson two, which is the Gapminder website.

Now in this one, I've given you a particular data set to look at.

Now, what I'd like you to do is identify what that data is.

So what data are you looking at on the graph? I'd like to see whether or not you think the graph shows a trend? Where do you think the anomalies are in the data? 'Cause there are some anomalies.

And why do you think those anomalies have occurred? Okay, so I'd like to pause the video now and head over to your worksheets.

Follow the link on your worksheet, to the graph, and then answer the questions.

And once you've done that, you can unpause the video and we'll go through the answers together.

Okay, so how did you get on with that? So that's now go through some of the answers for the questions that you were asked.

So what data is being shown on the graph? Well the graph compares two variables and they were life expectancy and income.

And it compares the life expectancy of the people living in France between 1800 and 2019.

So hopefully you found that out.

But one of the key words I wanted to highlight here is this key word, which is variables.

So variables in this instance are categories of data.

So two categories of variables are life expectancy and income.

And they're variable because you can see that they're changing because there's a new value for each one of the years as we go through.

So let's move on to the next question.

So what did you think, did the graph show a trend? What I've done is I've highlighted here that there is a trend, but the trends may be a little bit more obvious from 1949 onwards.

So I've answered here yes.

So from 1949 onwards, it shows a clear upwards trend showing that there's a relationship between two variables.

And we call that a correlation.

So when there is a relationship between two variables, like I said, that's called a correlation.

Now this is an example of a positive correlation.

Meaning that as one of these variables increases, the other one increases too.

Now, so if that's the case, what do you think is meant by a negative correlation? So I'll give you just a moment to think about that.

Okay so, the way our word is here, perhaps is a little bit misleading? So I've said that this is a positive correlation, meaning there's one variable increases, the other one increases too.

So you might have thought there's a negative correlation is when one decreases the other one decreases too.

Now that's actually not the case, that would also be a positive correlation.

It just means that as one is moving in a certain direction, the other one is moving in the same direction.

So a negative correlation, hopefully now you can start thinking about the fact that it's the opposite of that.

So as one is moving in one direction, the other one's moving in a different direction.

So for example, you might see a negative correlation when the price of a product increases and therefore the sales decrease that will be a negative correlation, or maybe if a train was travelling from A to B, it may well be that the train as the train speeds up, then the time it takes to get from A to B decreases.

So we're seeing like a negative correlation because they're moving in opposite directions.

So let's, one thing I wanted to point out.

In fact, if we just go back to the graph there, we looked at income and life expectancy here.

So do you think that means that if income increases life expectancy also increases? Well, there may well be a direct relationship and it may well be that as income increases that causes life expectancy to increase.

We have, perhaps we need to investigate that further and see if that's the case with other countries as well.

But, one thing that's really worth pointing out is that correlation doesn't always mean causation.

So correlation shows that there is a relationship between two or more variables, but that doesn't guarantee that one causes the other.

So for example, there is likely to be a correlation between ice cream sales and the weather.

So for example, as the weather gets hotter, ice cream sales also increased.

So we'll see a positive correlation there, but that doesn't mean that ice cream sales cause hot weather.

So, another really great example of this, and there's a lovely website if you ever want to go and visit it.

It's called the Spurious Correlations by Tyler Vigen.

And there's a really nice one that I picked out here, which is the raw divorce rate in Maine, which is a US state and how there's a positive correlation between the divorce rate in Maine and the per capita consumption of margarine.

I really love that because there is, as you can see from the graph a correlation, but clearly there's no causation there.

Well, I hope there's not anyway.

Okay, so I asked you a question which is to look at where are the anomalies in the data.

Now I've pointed out three key anomalies or the worst, some of the ones as well.

But if we look at a particular, 1871, for example, well, let's go through this, it says until 1949, most of the data follows a slow upward trend, but there were a few odd blips.

Okay now, any data that's sits outside of a trend is known as an outlier.

So I can say that so 1871, 1918 and 1940, they sit slightly outside of that trend.

There are some other ones as well, but we would call them outliers.

Now outliers can cause problems when working with statistics, such as the mean, the average.

So, but we shouldn't remove them from the data without investigating the reasons for them.

And maybe to put that a different context.

Imagine if the teacher was looking at the marks that they got from their students got for a test.

Now it may well be that everyone did really well on the test, but there was one student who turned it really late.

Maybe the only about five minutes left of the test to complete sat down and start the test, but clearly got a really low mark.

Now that one student's mark would bring down the average for the rest of the class.

So the average might not be necessarily a fair reflection, of the majority of students in the class, however, we don't want to remove that without knowing the reason behind it so we can remove it.

But before we remove it, knowing being able to tell the story about that test student, the fact that he turned up late and only sat it for a few minutes and didn't do particularly very well.

Then that helps us tell a story for the rest of the data as well.

So, before we just removed the outliers, we should investigate them first and work out why there was that outlier data.

So let's actually do that to here, so 1871.

If we look at the life expectancy between 1869 and 1872, you'll see that the, it clearly is this outlined data because at 1869, the life expectancy was 41.


But in 1870, it went down significantly to 36.

4 and even further in 1871 to 29.


But then in 1972, just one year later, it shoots right back up again to 42.


Now we shouldn't just rule out that date without trying to investigate why.

Now hopefully fingers crossed you're thinking back to lesson one now, and that graph that told the visualisation that Joseph Menard did.

I mean if you remember what the topic of that was, well it's actually the point where Napoleon was making a march on Russia, so actually the likely reason for that life expectancy going down was that so many soldiers are dying probably quite a young age and therefore bringing the life expectancy down.

So that war was between 1870 and 1871.

So you can see in those years, that was actually probably a very significant reason why the life expectancy decreased.

Now I've also highlighted 1918 and 1940.

And if you're a history student and you know your stuff, can you think of the reason, potential reason behind those two outliers? Well, hopefully you were thinking that 1918, the first world war was taking place.

And what do you think would taken place in 1940? That was the second world war.

So actually in these three particular instances of the outlier data, there were significant Wars going on, that of course, you know, lots of people were dying and therefore that's going to bring down the life expectancy.

So you might also look at the graph and I think there were three others there in between 1980 and 1940.

So I wonder if there's reasons behind that, perhaps that's something we could as well.

Okay so now we're going to look at the investigative cycle.

Now, as you can probably tell, this is not a word that I find easy to say.

So I wonder, can you say it any better than I can just want to give that a go? All right, you can do it, Beth and I can so I'm apologising to avoid saying it because I trip over my tongue trips over itself a little bit when I try and say it.

So let's go through what the cycle is.

Now, so far, we spent time investigating datasets to see patterns or extract meaning.

But the PPDAC cycle is a framework for us to follow and asking and answering real world problems. So this is a, like I say, a framework, and we'll follow the steps of this.

Every time we want to solve a problem or answer a question using data.

So let's go through each one of these steps of this cycle.

So we get a better understanding of what each step involves.

So starting off with the problem.

Now, the problem is that point where we would pose a question that you think data will help you to answer.

So what's really important here is context is when we're framing the question.

Now, for example, if we said, what is the average number of goals scored in the first half for teams in the premier league? And that's a really example of a, quite a good question, because what we're doing there is we're including some variables that can be compared against each other.

So the whole point of the problem stage is that, we work on the problem that we're trying to solve, but we do that, not by thinking there's a big problem, but narrowing down into forming a question that we think that data can help us answer.

And that question should comprise of maybe some variables that we think can be collected so that data sets or categories that be can be collected and therefore compared against each other.

Okay, so then once we've defined the problem, we'd move on to the planning stage, now the plan that involves working out, where we going to get the data from maybe how it will be collected if we're collecting ourselves.

So it may well be that there's a data set already out there.

For example, if we were looking at first half goals by a team in the premier league, the chances are that data already exists.

Somebody is already collecting it, so we can ask or pay somebody, or it might be open source and free data to go and get that data.

And we'll just use that to answer our problem, our problem, sorry, but it may well be that we need to go and collect this data our selves.

So what we need to do at this stage as well, we also need to predict what we think the answer would be.

So very much like a science experiment.

We, once we've got our problem, we think about what we think the answer is going to be before we actually then investigate and find out what the answer is.

So we then find our data set plan, how we're going to collect the data if we're going to do that and then evaluate the quality of the data as well, okay.

So I'm going to move on to the data step.

So in this step we gathered the data and that's of course, if you need to collect it yourself, otherwise it will be going and getting it from the source that you need to get it from.

But once we have the data to help us answer the question, we should look through the data to see if the data needs cleansing.

Now cleansing mean detecting or correcting or removing any corrupt or inaccurate data.

'Cause it may well be there's some kind of data entry era, maybe where we're collecting ourselves, somebody might have put something illegible on the piece of paper, and we're trying to type that into the computer, or maybe there's a typo mistake that puts an extra zero on to a number, which wouldn't be very helpful to us.

So if you'd look through the data to make sure it's as clean as we possibly can make before we then move on to the next step.

So the next step is the analysis stage.

So now this is getting to the stage where parts were more familiar with based on the learning that we've done so far in the lessons.

So this is about making sense of the data.

So, far we've worked with data sets that are being collected for us and probably cleanse them and then we've then analysed it.

So this is a step that we're discussing here.

So the data needs to be visualised so that we can therefore spot any patterns, trends, correlations, maybe I'll identify those outliers that we spoke about earlier on in this lesson.

And then you should write down any observations you make about the data, okay? So that's like I say, that's a step that we're very much comfortable with that we practised in lesson two if this unit, okay.

So then finally, and perhaps the most crucial parts of this is the conclusion and recommendations.

So this is the point that once we've unlined it, then what's the answer to our question.

Do we have an answer to our question? Does the data help prove the answer to the question? Is the answer reliable? Is there anything that makes this data unreliable? And then finally, what can we do with these results? For example, can we use this data to make a case for action or importantly, has it led to further questions that need to be answered? For example, it may well be that we collect a lot of data and we analyse it and think, actually, this doesn't quite help us answer this question or it maybe does answer the question, but it's led me to think of more questions that I think must be answered before can make any kind of decisions or make any recommendations.

And that's where the concept of the cycle comes into play because we can see here, this is a investigative cycle, so it's problem plan, data conclusions.

And then if you're not satisfied at this point or you feel there's more questions to be answered, we would then move through the cycle again and then move to the problem stage again.

Re identify a problem, ask a new question, and then work through that until you get to the conclusion stage again.

And you keep running through that cycle until you're satisfied that you were able to make a recommendation or make a decision based on your findings.

Okay, so let's now find out about River Kingdom roller coasters, because this is an opportunity now for us to apply the steps of the PPDAC cycle using real data in the context of River Kingdom roller coasters.

So, like I say, let's find out a little bit more about them.

Okay, so River Kingdom is a new theme park that is opening in the UK and they want you to recommend design considerations that would help them make a great experience for their visitors.

Now, one, essentially what they're asking is they would like us to build a roller coaster.

But one of the main restrictions that they know of is that they can't build a rollercoaster over the height, 350 feet due to limitations of the site where they're going to build a roller coaster.

So let's think about this problem a little bit more, essentially what they're asking is what makes a really cool roller coaster.

That's, essentially the broader problem that we're trying to solve.

But as we know, by looking at the steps, join the PPDAC cycle, is this will be considered a poorly defined problem.

It doesn't help us understand what we're measuring.

And we had to make sure that a question that we're defining must at least compare variables.

So what variables do you think about roller coasters that we could measure in order to help us answer that question? Okay, so I'll give you a clue.

One of them might be the length of the roller coaster.

So how many, how far is it in distance, okay metres maybe.

Okay so, that would be one variable that I would pick.

So I like to pause the video for a moment and see if you can come up with some other variables that you think would help us measure, what makes a good roller coaster? Okay, so let's go through some of the answers that I thought you might have come up with some different ones, but this is what I've got.

Okay so, I thought that speed will be a key consideration, the height clearly, because we've got a restriction of 350 feet.

So that's definitely something that we know is going to be available, that we can use.

The drop maybe, what's the angle of drop that it goes down, maybe the number of twists or loops.

So what we call inversions in a roller coaster world, the length of it and distance, maybe the duration, how long does it take for them once they sit down to, from start to finish and then maybe the position as well.

'Cause there are different types of roller coasters out there were some of them who sitting in a seat something you're suspended where your Lexa kind of dangling down.

So those are the variables that I think that we could use to measure, okay? So we can get data on existing rollercoaster to find out what is possible, so that's what we're going to do.

Okay now, but first we need to pose questions that help us turn our vague game of what makes a cool rollercoaster into precise goals.

So remember that well-defined question, which was something along the lines of, if I remember correctly, they what's, the average number of goals scored in the first half in premier league teams, okay.

So we're looking at a question that's using some variables there.

So all you need to do now is I would like you to head over to your task on your worksheet and see if you can think of two questions.

So we can ask that, use the variables that you can see on the screen now.

So there are plenty of variables for you to use.

So you might NFE question, try and use two of them that you could maybe compare against each other, okay? So if you can pause the video now, go over to task to see if you can use the variables that we talked about to define two good questions.

And once you've done that unpause the video and I'll be here when you get back.

Okay so hopefully.

Now you've got two questions that we can now use some data to investigate the particular problems or questions that you've posed, okay.

So to do that, we're going to use a website called CODAP.

Now on this CODAP website, there is a dataset of historical rollercoaster.

So we'll look at that now.

You don't have to worry about the link to this because it's on your worksheet.

So when we come to the part where you open your worksheet, then you'll see there's a link to this that you can navigate straight to this website that I'm going to show you now, okay? So I'm moving to that now, so that once you follow the link? It'll take you to this CODAP platform.

That's going to allow you to analyse this data.

And preload up is a set of historical data about US roller coasters.

So that's something that we can use as a starting point to our investigation, because it's data that's already been collected for us.

We might find that it doesn't solve all of our problems or answer our questions, but we might do as well.

So it's certainly worth trying this dataset before we go down the route of trying to collect our own.

So when we look at this window here, you might want to move things around a little bit, this window on the right hand side you can see you can't get rid of it if you want to, but I recommend leaving up because it's, gives you a little bit of a mini tutorial, if you feel that you need it after I finished showing you this, you might want to read through this and it'll maybe help you with some of the work that I'm going to ask you to do.

So I'm just going to move this to the right hand side.

And that's going to allow me to make this window a little bit bigger, 'cause this is obviously the data, the visualisation of the data sorry okay.

So I'm going to move this over here and that's going to allow me a little bit more room to actually look at the raw data itself.

Okay, so this is a very interactive platform because I can move my mouse over each one.

And it'll tell me what the roller coaster is, okay.

But let's start using those variables to form the graph that I want to, or the visualisation that I want to create okay? So, but obviously I'm doing for visualisation I'm sorry for demonstration purposes, sorry, but I'm not doing this as essay answer the question that you've got to answer.

So I'd like you to put in the variables that you think you need to help you answer that.

Okay, so what I'm going to do, what I'm kind of interested in is I've got many theory in my head that maybe the younger or the most recent roller coasters are probably the fastest ones, okay.

So my question might be, does the age of the roller coaster determine the speed of the roller coaster? Okay, or what is the speed of roller coasters based on the year? Okay, so one of my variables that got hit here opened, so to put this into the graph or to put these into one of my axis or X or the Y axes, all I need to do is I need to hold down my mouse button and drag it down, can you see the way it's moving? Okay I'm going to drag it down into the bottom and then let go of my mouse and it should populate them in a way, okay.

So now we've got yet opened, okay? And it's changed the it's changed the graph.

So we can see the year that each one these is opened.

Now that in itself tells me a bit of a story.

It tells me that, there's odd little roller coaster opened up, but really around this kind of via here, there's lots of roller coasters, so that's interesting.

I wonder what happened in 1999 that made all these roller coasters want to open.

I wonder if it was particular part of that open with lots of roller coasters, I don't know, but either way, that's an interesting thing to explore, but I'm trying to find out, compare the year to the speed of the roller coaster.

So now what I'm going to do is get ro do the same thing again, but I'm gonnaclick on top speed.

I'm going to drag that into the other axes, the Y axis.

So I'm going to click that and drop it in.

And once I've dropped it in, it should then do some work and show me the results, okay.

There you go.

So now we're showing the year across the X axis and the top speed on the Y axis.

Okay, so we can see that maybe it's not quite as well.

There's not the correlation that I thought that would be.

I thought the year increased, so would the speed, but this data is not showing me that it's perhaps not the case.

We see a slight one that, the fastest ones here, I mean, that's certainly more recent than older ones, but it's not quite the smoothest correlation that I thought that that was going to be.

So just to show you something else you can do, which might be interesting is I might say, well, I wonder if there's a difference between also the type of roller coaster.

So we've got type here which shows a wooden and steel.

So some of them are wooden and some of them to steel.

Okay so, I'm going to click this one.

Now haven't got room to put this on my X or my Y axis.

So I'm just going to put this into the data itself and see what happens.

So I'm going to click on the type, I'm going to drag it, I'm going to dump it in the middle of the graph, okay.

So you can see the way it's highlighted yellow.

I'm just going to let go off my mouse.

Let's see what that does.

Now what this does it, colour codes, that data.

So now we can see that the red ones is in use the key at the bottom there, the red ones are the wooden roller coasters.

And yes the green ones are the steel roller coasters.

So what this graph is now showing me, although there's not necessarily that correlation that I thought there might be between the year opened and the top speed, we're actually shown the fastest roller coasters are the steel ones.

And maybe also, it's also showing me that it tends to be the more modern ones tend to be steel, but not solely because of this one here, that's a wooden one that was opened in the year 2016.

So it's not exclusively wooden but it tends to be or seem to be a transition in the more modern building of roller coasters that still tend to be the preferred choice.

So although this is not necessarily helping me prove my theory, I am actually also able to extract some interesting from it, okay.

So, what your task is, is I would like you to go ahead and do that, and you're going answer your two questions.

So you've defined two questions.

I'd like you to create the graphs for those two questions, and also would like you to document your findings in the worksheet.

So if I just head over to your worksheet for a second, so you can see for task three, you've got a link to the, to the worksheet there.

And then you've got two slides here where you've got an opportunity for you to write down what your question is.

Maybe take a screenshot, if it's possible of the visualisation that you made, and then you can write down a few sentences about your findings, okay? And then do that again for the second question okay.

And then what I'd like to do is write a conclusion as well.

So let me just have back to my slides.

So you would analyse this data, okay.

And then see if you can write a conclusion about it.

So it's got to be based on your findings.

Would you make, do you feel comfortable in making a recommendation to River Kingdom based on the findings that you did, and if that's the case, what is your recommendation? How does the data help support that recommendation? And is there enough data to support your recommendation or importantly, don't be afraid to say if there's some further action that you need or any research that you think you need to do.

So in your research might have helped you or your analysis, sorry, might have helped you answer part of the question, but you might not feel comfortable enough that you know enough to be able to give a full recommendation.

So if that's the case, don't be afraid to say that in your conclusion, but also say, what other information or data would you like to collect to help you answer the question maybe a bit more fully okay.

So if you can pause the video now, go to test in your worksheets and use the CODAP or link to the CODAP website, create your two visualisations based on your questions, comment on your findings, and then make sure you write your conclusion, okay.

So you can pause the video now and then unpause when you've done all of that.

Okay, so if you've done all of that well done, because what you've done there, if you'd done taking your first steps and going through the full PPDAC cycle, by getting a data set and then visualising it to help you answer the question that you posed, okay.

Now I wonder, did you find out what you think makes the coolest roller coaster? I mean, do you feel confident that this data's provided you with all the information you need to make that recommendation? Well, what I wanted to highlight here is the fact that it is a cycle.

So don't worry if you felt that there wasn't enough data there and you didn't feel confident in making that recommendation because it may well be, you just need to pose more questions or different questions.

It may well be that you need to collect more data, but either way, it's a cycle element of this because you went, you created the problem.

So you, you got your questions, you created the plan, you got the data, you analyse that data, and then you made a conclusion.

So because it's a cycle at the end of that conclusion stage, it may well be that you need more data or you need to pose slightly different questions.

So you would go through the cycle again until you felt confident to make that recommendation to river kingdom.

Okay, so that's all for this lesson and you really should feel quite proud of what you've done in this lesson, because you've been through all the steps of the PPDAC cycle and that's fantastic.

So the next thing for us to do is define our own problem.

This, lesson we gave you the scenario of river kingdom, but next lesson, we're going to look at creating a problem that you can solve by looking or collecting your own data and solve your problem that maybe you can take real action on.

So hope you're, I hope you're excited about that.

'Cause I'm looking forward to that already, okay.

Now I would really love to see the work that you've done.

I'd love to see the visualisations that you've made in this lesson.

So if you feel willing to share that with us and please ask your parents or care to share your work on Instagram, Facebook, or Twitter, tagging @OakNational or using the #LearnwithOak.

Okay so I'm looking forward to seeing you next lesson.