Lesson video

In progress...

Hello, my name is Mrs. Holborow, and welcome to computing.

I'm so pleased you've decided to join me for the lesson today.

In today's lesson, we're going to be exploring the impact of data on machine learning models.

What happens when training data is not high-quality or contains bias? Welcome to today's lesson from the unit "Using Data Science and AI Tools Effectively and Safely." This lesson is called "Bias and Accuracy in Machine Learning," and by the end of today's lesson, you'll be able to describe the impact of data on machine learning models and explain bias in machine learning model predictions.

Shall we make a start? We will be using these keywords throughout today's lesson.

Let's take a look at them now.

Fair.

Fair: When something is free from bias and gives equal consideration and treatment to all parts of a group or situation.

Unrepresentative.

Unrepresentative: When the data does not properly reflect the whole group or situation it is meant to describe.

Diverse.

Diverse: When data includes a wide range of examples that fairly represent different parts of the whole group or situation.

Look out for these keywords throughout today's lesson.

Today's lesson is split into two parts.

We'll start by describing the impact of data on machine learning models, and then we'll explain bias in machine learning model predictions.

Let's make a start by describing the impact of data on machine learning models.

Machine learning is a branch of AI that focuses on improving the computer's ability to perform a specific task over time.

In machine learning, the computer changes its outputs by analysing data instead of following specific rules written for each task.

Machine learning is often shortened to "ML." Alex says, "Jacob, I don't really understand how data makes an impact on ML models." Jacob says, "All ML models are trained on and use data, Alex, so the data can really affect the quality of the ML model output." Machine learning models are built using patterns found in large amounts of training data.

The quality and balance of training data affect how ML models work and whether the outputs are useful and fair.

Sofia says, "Fair means when something is free from bias and gives equal consideration and treatment to all parts of a group or situation." Balanced data helps make sure a machine learning model gives fair results.

If the ML training data is not suitable or contains errors, it may produce inaccurate outputs.

A weather ML model may give incorrect predictions if it is trained on data from a different location because the patterns in that data may not match the local weather conditions.

High-quality data should be accurate, complete, and relevant to the task.

Alex says, "So the quality of data used in ML models really does matter, Jacob." Jacob says, "That's right, Alex.

Errors, missing examples, and unbalanced data lead to poor and unfair outputs from ML models." Sofia says, "The size of the set of data used to train ML models also matters!" The size of data sets affects how well a machine learning model performs and how accurately it can handle new situations and input data.

It is, however, important not to think that just more data will improve a machine learning model.

It needs to be more data that is accurate and suitable for purpose, so the quality does still matter.

Small data sets may not contain enough patterns to create reliable machine learning models.

Larger data sets are more useful, but only when the data is accurate and relevant to the model's intended purpose." Sofia says, "If data overrepresents one group, then the ML model may output unrepresentative or unfair information.

Data should be as relevant, diverse, balanced, and accurate as possible." Alex says, "But new real-world data is being created all the time.

How does that impact ML models?" Maybe pause the video whilst you think about Alex's question.

Data can become outdated when the real world changes, and this can happen really quickly.

This can make ML models less accurate.

For example, a shopping recommendation model built with data from five years ago might not reflect current trends or available products.

Updating data keeps models accurate and relevant.

New data helps ML models reflect changes in the world, recent patterns, and situations.

Alex says, "A weather prediction model needs recent climate data to give reliable forecasts about future weather events." That's a great example from Alex.

Jacob says, "The type of data used is also important and can affect ML models." The type of data should match the task the machine learning model is designed for.

Using the wrong type of data can lead to inaccurate or irrelevant outputs.

For example, if a model is meant to recognise animals in photos but is trained on images of fruits, it will not work for its intended purpose.

So we've got the example here of new input data, the machine learning classification model, and then we've got a prediction of a banana, which obviously is not accurate in this case.

Alex says, "How do ML models determine whether data is correct or incorrect?" Again, maybe pause your video here and have a think about Alex's question.

Jacob says, "They don't! ML models will process all data provided, even if it contains mistakes." Data scientists and engineers use data cleaning methods to find and correct problems in data so it is more accurate, complete, and ready to use.

Data cleaning is about finding and fixing problems in data so it is accurate, complete, and ready to use.

For example, data cleaning may include: removing errors or incorrect information, filling in missing information, removing duplicate data records.

Note that the developers also set aside some data to test that the ML model works as intended after training.

So we've got data which is allocated as training data, and we've got data which is allocated as testing data.

Testing the model in this way can help identify bias, inaccuracies, or unrepresentative information in the output.

Okay, time to check your understanding.

I have a question for you.

A machine learning model built using incorrect or incomplete data is likely to: A, output incorrect information? B, not use the incorrect data? Or C, correct the mistake automatically? Pause the video whilst you have a think.

Did you select A? Well done.

Remember, a machine learning model built using incorrect or incomplete data is likely to output incorrect information.

Another question for you.

Why is balanced data important? Is it: A, because it makes the model process data faster? B, it helps the model produce fairer outputs? Or C, it reduces the need for updates? Pause the video whilst you think about your answer.

Did you select B? Well done.

Balanced data is important because it helps the model produce fairer outputs.

A true or false statement for you: An ML model may become less accurate if it uses or is trained on outdated data.

Is this true or false? Pause the video whilst you have a think.

Did you select true? Well done.

Okay, we are moving on to the first task of today's lesson, task A, where you are going to describe the impact of data on ML models.

For part one, in your own words, describe how data can impact: A, the accuracy of information an ML model outputs, and B, the fairness of the information an ML model outputs.

Pause the video whilst you have a go at the task.

How did you get on? Did you manage to describe the impact of data on ML models? Well done.

Let's have a look at a sample answer together.

So for part one, A, you were asked in your own words to describe how data can impact the accuracy of information an ML model outputs.

If the data is correct, complete, and relevant, the model's outputs are more likely to be accurate.

If the data has errors, is outdated, or doesn't match the task, the outputs are much more likely to be wrong.

For part B, you were asked to describe how data can impact the fairness of the information an ML model outputs.

If the data fairly represents all groups, the outputs are more likely to treat these groups equally.

If some groups are overrepresented or underrepresented, the outputs can be biassed and less fair.

Did you have some similar ideas in your answers? Remember, you can always pause the video and revisit your answer if you want to.

Okay, we're moving on to the second part of today's lesson, where we're going to explain bias in ML model predictions.

Alex says, "What does 'bias' mean, Jacob?" Jacob says, "I think it's about when something isn't fair." Do you agree with Jacob? When information is unrepresentative or favours one group over another, it is said to be biassed.

ML models can be biassed.

An ML model is biassed when it outputs unfair or unrepresentative information.

Bias happens when data or the way data is used leads to outputs that are not accurate or fair for all groups.

Alex has a question: "How do ML models become biassed in the first place?" Maybe pause the video whilst you think about Alex's question.

ML models are trained on large amounts of data.

If data contains stereotypes, historical unfairness, or incorrect labels, an ML model may generate biassed outputs based on the patterns it identifies in that data.

Sofia says, "Biassed ML models can really affect people in real life." For example, if a face recognition system's ML model works well for some skin tones but poorly for others, this is an example of bias.

This could have a number of negative impacts, such as: incorrect identification or recognition, unequal or unfair access to systems, reinforcing historical inequality.

Bias matters because it can affect people in real-life outcomes, like jobs, loans, or policing.

Biassed ML models can directly impact the life chances, well-being, and opportunities of certain groups.

Alex says, "That's not fair.

Is there a way to reduce bias in ML models?" What do you think? Do you think there's a way to reduce bias in ML models?" Bias can never be completely removed from ML models, but it can be reduced in a number of ways.

It can be reduced by improving the quality and diversity of data, by testing the outputs for fairness, by adjusting the model design, and by updating data regularly.

To reduce bias, data should include examples from all relevant groups in fair proportions.

Where possible, incorrect labels and stereotypes should be removed from the dataset.

Testing the outputs of machine learning models is an important step in identifying and reducing bias.

For example, output can be analysed to see that it is balanced and fair, or whether some groups of people are unrepresented or subject to stereotypes.

Developers can apply techniques to adjust machine learning models in ways that help reduce bias.

Data should be updated regularly so models remain accurate and reflect current realities and context in the real world.

Okay, time to check your understanding.

I have a question for you.

Which of these is an example of bias in an ML model? Is it: A, a weather prediction model gives slightly wrong forecasts due to a faulty sensor? B, a model takes longer to process large amounts of data? Or C, a face recognition system works well for some skin tones but poorly for others? Pause the video whilst you think about your answer.

Did you select C? Well done.

That's an example of bias in a machine learning model.

Which of the following can cause bias in an ML model? A, data that is unrepresentative for some groups of people? B, using larger amounts of data to train the model? Or C, using the model on a faster computer? Pause the video whilst you think about your answer.

Well done.

That's right.

Bias can be caused in ML models by using data that is unrepresentative of some groups of people.

Okay, we're moving on to our second task of today's lesson, task B, and you're doing a fantastic job, so well done.

In your own words, explain what bias in a machine learning model means.

As a tip in your answer, you might want to give a short definition of bias, a cause of bias, and an example of how bias can affect people in real life.

Pause the video here whilst you have a go at the task.

How did you get on? Did you manage to explain what bias in a machine learning model means? Well done.

Let's have a look at a sample answer together.

Bias in a machine learning model occurs when the model generates less accurate or unfair outputs for certain groups.

It can be caused by training data that is unrepresentative of some groups or contains stereotypes.

An example of bias is a face recognition model that identifies some skin tones accurately but performs poorly on others, resulting in unequal outcomes.

Remember, if you want to pause the video whilst you add any detail to your answer, you can do that now.

Okay, we've come to the end of today's lesson, "Bias and Accuracy in Machine Learning," and you've done a fantastic job, so well done.

Let's summarise what we've learned together in this lesson.

The type, quality, and amount of data used affects how accurate a machine learning model is.

ML models require both training data and separate test data to ensure reliability.

Bias is introduced into ML models when the data is unrepresentative or contains stereotypes.

Use large, representative data sets and diverse perspectives to reduce bias and improve ML model fairness and accuracy.

I hope you've enjoyed today's lesson, and I hope you'll join me again soon.

Bye.

I've finished the video