Lesson video

In progress...

Hello, and thank you for choosing this lesson.

My name is Dr.

Rowlandson, and I'm excited to be helping you with your learning today.

Let's get started.

Welcome to today's lesson from the unit of "graphical representations of data with scatter graphs and time series." This lesson is called "checking understanding of correlation", and by the end of today's lesson, we'll be able to interpret scatter graphs.

Here are some previous keywords that we're going to use again during today's lesson.

So you may want to pause the video if you want to remind yourselves what these words mean before pressing play to continue.

This lesson contains two learn cycles.

In the first learn cycle, we're going to focus on identifying correlation and types of correlation.

And then in the second learn cycle, we're going to focus on interpreting what correlation tells us about the data in different contexts.

Let's start off with identifying correlation.

Scatter graphs can be used to investigate the nature of a relationship between two variables.

Here we have four scatter graphs, labelled A, B, C, and D, and the points appear quite different in these four scatter graphs.

Which scatter graphs would suggest there is an association between the two variables? Pause the video while you think about this, and press play when you're ready to continue.

Scatter graphs B, C and D, would suggest there is an association between its two variables, because the points seem to form a discernible shape or pattern.

With scatter graph A, the points are all over the place.

They seem to sort of be randomly placed around the graph, so that would suggest there's not so much of an association between those two variables.

But with scatter graphs B, C, and D, those points seem to form some kind of pathway through the graph, and leave certain parts of the graph without any points plotted whatsoever.

That would suggest that the variables in these scatter graphs are interacting with each other in some kind of way, and that they are associated with one another.

And an association between variables may be a linear association, or a non-linear association.

With scatter graphs B and C, we have a linear association, because the points appear to be forming a straight line.

With scatter graph D, we have a non-linear association between its variables, because the points seem to be forming a curve.

When the points appear to form a straight line, it suggests that the rate of change is constant throughout the data, and in these cases, there is correlation between the two variables.

So which scatter graphs here show correlation? Pause the video while you think about it, and press play when you're ready to continue.

Scatter graphs B and C show correlation.

And that's because the points appear to follow the path of a straight line, which means that the rate of change is pretty much constant throughout the data.

Every time we move to the right along the horizontal axis, we move up to get to the data again, a similar amount each time.

Whereas with scatter graph D, yes, those points do show an association, but it's non-linear.

The points seem to follow the path of a curve, which means as we go from left to right on that graph, the amount that we go up each time to meet the data differs from the start of the axes to the end of the axes.

When the points form an upward sloping line, a straight line, we can describe the variables as having positive correlation.

So, out of these four graphs, which one shows a positive correlation? Pause the video while you think about it, and press play when you're ready to continue.

Scatter graph B shows a positive correlation, because each time the independent variable increases, the dependent variable also increases at a constant rate.

When the points form a downward sloping line, we can describe the variables as having a negative correlation.

So with that in mind, which of these scatter graphs shows a negative correlation? Pause the video while you think about that, and then press play when you're ready to continue.

Scatter graph C shows a negative correlation, and that's because each time the independent variable increases, the dependent variable decreases, and at a constant rate.

As one increases, the other decreases, so that means they have a negative correlation.

Now, correlation can be clearer to see in some scatter graphs than in others.

For example, here we have two scatter graphs, A and B.

They both show correlation between its two variables, but in which scatter graph is that correlation clearer to see? Pause the video while you think about this, but also think about, why do you think that particular graph shows clearer correlation than the other? What is it about the points that make it clearer? And then press play when you're ready to continue.

So both of these scatter graphs do show a positive correlation, because the points appear to be sloping upwards in a straight line, but the relationship between the variables is clearer to see in scatter graph A than it is in B.

And that's because those points seem to be more tightly packed together in that pathway of a straight line than with B.

And we can visualise that with these line segments that have been drawn around the data.

These line segments try to capture most of the data, and follow the pathway that the data has taken.

It doesn't capture all the data points.

You can see there are some points on the outside of the lines, but it does capture most of them, and generally follows the path that the data has taken.

And you can see that those line segments are closer together in scatter graph A, than they are in scatter graph B, because the points seem to be following that pathway more closely in scatter graph A.

Now, a small amount of data may not be enough to make a reliable assessment of the correlation between two variables.

Let's take a look at an example of that now.

Here we have a scatter graph that shows data taken from the Met office about the weather in Bradford for five months.

Each of those points represents a month.

The horizontal axis shows us the amount of sunshine that was recorded that month, that's in hours.

And the vertical axis shows us the mean daily maximum temperature in that month as well.

And what could we observe about the correlation with these five points? Well, Andeep says, "It looks like there is a negative correlation between the amount of sunshine and the temperature." And Andeep would be right to say that, because the points seem to be following a downward slope and straight line, as we go from left to right.

But Andeep says, "This suggests that months with more sunshine tend to be colder," but that doesn't really seem right in the context.

Maybe five months of data is not enough, and we might be able to see things more clearly if we had some more data.

A more accurate assessment of correlation can be obtained by plotting a larger amount of data points.

For example, here we now have a scatter graph that shows data from the Met office about the weather in Bradford, again, but this time it's for every month from 1941 up to 2022.

We have a lot more data points here now.

We can still see the original five points we had from the previous graph, and these five points are still following that same downward slope and straight line, but these five points may have just been a chance finding.

It's really hard to tell when you've only got five data points, which is why it's important to have a large amount of data wherever possible.

And what we can see here is that the inclusion of additional data points shows a different correlation.

Andeep says, "It now looks like there is a positive correlation.

This suggests that months with more sunshine tend to be warmer, and in the context of the data, that makes more sense." One other thing you might notice about this scatter graph is in the bottom right hand corner, there's an icon with three chain links in.

Now wherever you see that icon next to a scatter graph, it means in the slide deck, you can click on the image and it'll take you to a Desmos version of that scatter graph, which is interactive, so you can zoom in, you can zoom out, and explore the data in more depth if you'd like to do so.

So let's check what we've learned.

Which scatter graph here shows a positive correlation? Is it A, B, or C? Pause the video while you write down an answer, and press play when you're ready to continue.

The answer is A, and that's because the points appear to be forming an upward slope in a straight line.

The points don't follow that straight line perfectly, but the majority of the points seem to follow that pathway.

Which scatter graph shows a negative correlation? Is it A, B, or C? Pause the video while you make a choice, and press play when you're ready for an answer.

The answer is C.

The points in that scatter graph appear to be following a downward sloping straight line.

Once again, not all the points, but the majority of points, seem to be going in that direction.

Here we have four scatter graphs.

Which scatter graphs show negative correlation here? And it may be more than one.

Pause the video while you make your choices, and press play when you're ready for answers.

The answers are C and D.

In both cases, the points seem to be forming a downward sloping straight line.

It's more clear with C than it is with D, but they both seem to be following that similar kind of pathway, whereas with A and B, yes, those points are going downwards, but they are going more in a curved pathway.

Okay, it's over to you now for task A.

This task contains two questions, and here is question one.

You need to match each scatter graph with its description.

Pause the video while you do that, and press play when you're ready for question two.

And here is question two.

On each scatter graph, plot at least 20 data points so that its correlation matches its description.

For example, with part A, plot 20 or more data points on a pair of axes to show positive correlation, and then follow those instructions with B, C, and D, in a similar way.

Pause the video while you do this, and press play when you're ready for some answers.

Let's now look at some answers.

In question one, we need to match each scatter graph to its description.

With scatter graph one, as one variable increases, the other one also increases, but it's not really at a constant rate.

It seems to be more of a curve than a straight line.

So that would be an association which is not linear.

With graph two, as one variable increases, the other one also increases, and it appears to be at approximately constant rate.

Those points seem to be following the pathway of a straight line, not perfectly, but mostly, that would be a positive correlation.

And then with graph three, as one variable increases, the other variable could either be high or it could be low.

Doesn't really seem to be much of a pattern forming with the data points in that scatter graph.

So there is no association between those variables.

And then with graph four, as one variable increases, the other variable tends to decrease, and that appears to be in a straight line, so that would be a negative correlation.

And then question two, you have to plot at least 20 points on each scatter graph so that its correlation matches the description.

In part A, you want to plot your points to show a positive correlation.

It could look something a bit like this.

It doesn't have to look exactly like this.

It could be steeper, it could be less steep, but generally, as one variable increases, the other one increases as well.

And the points seem to follow the pathway of a straight line.

And then with B, it might look something a bit like this.

Very similar, but going downwards.

And then with part C, you had to show positive correlation that was not as clear as your graph from part A, so you still want an upward slope in a straight line, but maybe have those points being further apart from each other, so something maybe a bit like this.

And then part D, you want to show no correlation, so it might look something a little bit like this.

Doesn't really matter how far across you are on that horizontal axis, the points could be of any height.

Great job so far.

Now let's move on to the second learn cycle, which is interpreting correlation in context.

Here we have a scatter graph that shows data taken from the Met office about the weather in Stornoway, where each point represents a month from 1941 to 2022.

On the horizontal axis, we have the total rainfall for each month, which is measured in millimetres.

And the vertical axis, we have the total sunshine duration for each month, which is measured in hours.

Now if we look at the data points in this scatter graph, generally they seem to be falling in a downward sloping straight line, as we move from the left side to the right side of the scatter graph.

Not all the points follow that pathway quite so clearly as others, but generally, overall, it seems to be following that pattern.

Therefore, it is reasonable to model this relationship as a negative correlation.

So let's now think about what that means in the context of this data.

This data is about weather.

So what does the correlation suggest about these two variables in the context of this data? Pause the video while you think about it, and press play when you're ready to continue.

The negative correlation would suggest that the more rainfall there is in a month, the less visible sunshine there is likely to be.

Here we have a different scatter graph.

This one shows data taken from the ONS, the Office of National Statistics, where each data point represents a region in England or Wales.

On the horizontal axis, it shows the population of each region, and that's in the thousands.

And on the vertical axis, it shows the number of cell phone towers that can be found in that region.

And if you want to, you can click on the image in the slide deck to access a Desmos version of this scatter graph.

So let's think about the correlation here.

What we can see is that, as we move from left to right on that horizontal axis, the points generally seem to be getting higher and higher.

And overall, that seems to be happening at a pretty constant rate.

So that would suggest that it shows a positive correlation.

But what does that correlation mean about these two variables in the context of this data? Pause the video while you think about this, and press play when you're ready to continue.

A positive correlation here would suggest that the greater the population of a region, the more cell towers it is likely to contain.

And here we have another scatter graph, with once again data taken from the ONS, where each point represents a region in England or Wales.

Once again, we have the population of each region on the horizontal axis, but this time, on the vertical axis, we have transport emissions, with the units that it's measured in.

And we can see a positive correlation again.

But what does positive correlation mean in this context? Pause the video while you think about it, and press play when you are ready to continue.

In this context, it suggests that regions with greater populations tend to have more transport emissions.

And here is one more scatter graph.

This one once again shows data from the ONS, where each point represents a region in England or Wales.

But this time, on the horizontal axis, we have the number of cell phone towers, and on the vertical axis, we have the amount of transport emissions.

Once again, it seems to show a positive correlation, but what does it suggest in the context of the data? Pause the video while you think about it, and press play when you're ready to continue.

A positive correlation here would suggest that the more cell phone towers there are in a region, the more transport emissions there are too.

Let's explore this one a little bit more.

Lucas says, "Does this mean that cell phone towers are causing transport emissions?" What do you think about that? Pause the video while you think about what Lucas has said.

Do you agree with him, or do you think there might be another explanation? Laura says, "That doesn't really seem likely, Lucas.

For example, I doubt that removing cell phone towers from a region would directly reduce the amount of transport emissions." Let's see what's going on here a bit more.

So as we've seen with positive correlation, as one variable increases, the other one increases as well.

But a positive correlation does not always mean that an increase in one variable is causing an increase in the other variable.

Sometimes this is expressed with the phrase "correlation does not mean causation." In other words, just because two things correlate, does not necessarily mean that one is causing the other to happen.

Sometimes it does, but not always.

Let's take a look at these three graphs here.

So with the graph on the left, we have population and transport emissions in each region of the UK.

Increasing the population does seem likely to cause more transport emissions.

The more people there are in a town or city, the more cars and buses there are likely to be in that town and city, and therefore there are more transport emissions.

So increase of one is likely to cause an increase in the other.

And then with the middle scatter graph, we have population and cell towers.

Well, increase in the population is likely to cause more cell towers to be built, as more people live in a town or region, the more cell towers are going to get built to meet the needs of the population.

So in these two cases, it does seem that increase of one is causing an increase in the other.

But what about that third scatter graph? Cell towers and transport emissions? Is it likely that increasing the number of cell towers is what is causing the increase in transport emissions? Probably not.

Building more cell towers is unlikely to be directly causing more transport emissions.

But what the positive correlation does show is that there is some kind of relationship between these two variables, even if one of them is not directly affecting the other.

The thing that might be connecting these two variables could be the population.

The positive correlation between the cell towers and transport emissions could be due to both variables also having a positive correlation with the population.

As the population of a region increases, the number of cell towers and the amount of transport emissions both tend to increase, and therefore, those two variables would have a positive correlation, even though one of them is not directly affecting the other.

So here we have Aisha, Jacob, and Sam, who are doing a data investigation about careers.

They collect data from the ONS about different occupations in 2023.

They prepare to plot a scatter graph for the mean hours worked per week, and the mean weekly pay.

And what they are thinking about is what correlation they expect to see, if any.

Aisha says, "Working more hours per week means you earn more money.

So I think there'll be a positive correlation." Jacob says, "But some jobs pay more than others, so I think there'll be no correlation." And Sam says, "If your job pays more, then you can afford to work fewer hours.

So I think there'll be a negative correlation." Who do you agree with out of these three ideas? Pause the video while you think about it, and then press play when you're ready to look at a scatter graph together.

Here's a scatter graph based on the data they collected.

It shows data from the ONS about different occupations in 2023, where each point represents a different occupation.

Along the horizontal axis, we have the mean number of hours worked per week, and the vertical axis shows how much money they earn weekly, on average.

And if you want to, you can click on the scatter graph in the slide deck to access a Desmos version, where you can explore it in a bit more depth.

Now we can see the scatter graph, let's see what we think.

Jacob says, "Most of the points seem to be forming an upward slope and straight line, but there are some points that don't quite fit this." Aisha says, "Overall though, there seemed to be a positive correlation." So let's check what we've learned.

Here we have a scatter graph with data from the ONS, where each point represents a region in England or Wales.

The horizontal axis shows the population of each region, and the vertical axis shows how many properties were sold in that region.

What correlation does it show? Is it A, negative correlation? B, no correlation? Or C, a positive correlation? Pause the video while you make a choice, and press play when you're ready for an answer.

It shows a positive correlation.

So, explain what a positive correlation suggests about these two variables in the context of this data.

Write a sentence to describe what this data means about the context.

Pause the video while you do that, and press play when you're ready for an answer.

Let's take a look at some example answers.

We could say that regions with greater populations tend to have more property sales, or we could phrase it as, the more people that live in a region, the more property sales there are likely to be.

However you phrased it, if you acknowledge that, as the population increases, the number of property sales also increase, you've done a great job.

Okay, it's over to you now for task B.

This task contains one question, and here it is.

Here we have three scatter graphs that each show data taken from the ONS about regions in the UK.

And for each one of them, you can click on the image in the slide deck to open up a Desmos version of that scatter graph if you want to explore it in more detail.

You don't need to, but it's there for you, if you'd like to.

For question A, I'd like you to assume that all of these graphs show positive correlation.

You might not necessarily think they do, but we'll get back to that later.

I'd like you to describe what a positive correlation would suggest about the context of the data in each graph.

In question B, I'd like you to think about, how could the correlations for the graphs in parts one and two explain the correlation for the graph in part three? And then for question C, for the graphs in parts two and three, explain why a non-linear association may be a better model for that data than correlation.

Pause the video where you answer these questions, and then press play when you're ready to go through them together.

So let's see how we got on with that then.

In question A, we need to describe what the positive correlation would suggest about the context of each graph.

So in part one, we have the average disposable income and the median house price.

We could say that houses tend to cost more in regions where people have more disposable income, or something along those lines.

In part two, we have average disposable income and mean life expectancy.

We could say life expectancy tends to be longer in regions where people have more disposable income, or something along those lines.

And in part three, we have median house price, and mean life expectancy.

We could say life expectancy tends to be longer in regions where houses cost more.

So in question B, we need to describe how the correlations in parts one and two could explain the correlation part three.

Well, we could say that as the average disposable income in a region increases, the median house price and the mean life expectancy both increase together.

And then question C, for the graphs in parts two and three, explain why a non-linear association may be a better model for the data than correlation.

We can describe that in a couple different ways.

One way we could explain it is by describing the shape that the points take.

The data points form a slight curve, rather than a straight line, particularly on the right hand side of the graph.

Or we could explain it using the context.

Life expectancy is unlikely to just keep increasing at a constant rate as the average disposable income increases.

In other words, having more and more money isn't necessarily going to keep increasing the life expectancy.

Otherwise, some of our billionaires will be living to the age of about a thousand.

Fantastic work today.

Now let's summarise what we've learned.

The distribution of points on a scatter graph may suggest a correlation, which can be either positive or negative.

However, the distribution of points on a scatter graph may suggest there is no correlation or no association between the variables.

And correlation can be assessed more reliably when there is a large amount of data points.

Just a few data points probably won't be enough to give you an accurate picture of whether or not data has correlation, or what that correlation is.

More data points will be more helpful.

And correlation does not necessarily imply that one variable is affecting the other.

The two might be related, but without one directly affecting the other.

Great job today.

Thank you very much.

I've finished the video