Lesson video

In progress...

Hello, my name is Mrs. Holbro and welcome to Computing.

I'm so pleased you've decided to join me for the lesson today.

In today's lesson, you'll be exploring data-driven models in AI.

How is data used in AI systems? And what happens if the data used is poor quality? Welcome to today's lesson from the unit, "Data science: AI and machine learning." This lesson is called data-driven models, and by the end of today's lesson, you'll be able to recognise that AI systems rely on data-driven models and the importance of data quality.

Shall we make a start? We will be exploring these key words throughout today's lesson.

Let's take a look at them now.

Bias.

When something is unfair towards or against something or someone.

Cleaning.

Dealing with various issues that are commonly found in raw datasets such as missing data, duplicated records, and outliers.

Look out for these key words throughout today's lesson.

Today's lesson is broken down into two parts.

We'll start by describing how data is used in AI systems, and then we'll move on to explain the impact of poor quality data.

Let's start by describing how data is used in AI systems. The traditional programming approach of specifying an algorithm to solve a problem is useful when we know and can define the exact steps needed to solve a problem.

When we don't know all the rules or there are far too many rules to realistically write them ourselves, we can instead use data-driven models.

A data-driven approach is a way of designing systems using data instead of step-by-step instructions.

Data-driven systems are suitable for solving problems where rules that cover every situation are difficult to produce.

Current AI systems are mainly data driven.

These systems are provided with lots of training data.

An algorithm identifies patterns in the data to make a machine learning model.

This model can then make predictions on new unseen data.

Training data is the set of data and examples processed by AI systems so that the patterns and relationships can be identified.

All AI systems rely on training data to perform tasks by using patterns identified in the data.

When using the data-driven approach, solving problems can be broken down into the following steps.

One, collect data.

Two, prepare the data for the model.

Three, train the model.

And four, evaluate and use the model.

Let's take a look at each of these steps in a bit more detail.

Stage one, collect the data.

What data should be used to train the model? Maybe pause the video whilst you have a think.

The aim is to find data that is representative of the context you are modelling, so suitable for the purpose, good enough quality, and ethically sourced.

That means, we've gathered people's permission to use that data.

The data you collect should fully represent the context you are modelling.

For example, facial recognition systems may produce biassed predictions if there's not enough data from people with a range of different skin tones included in the training data.

The data you collect should be good quality.

If poor quality data is used to train the model, then the model will undoubtedly produce inaccurate predictions.

The data you collect should be ethically sourced.

Data must be obtained lawfully and ethically.

This is particularly important in situations when personal data is collected and processed.

Step two, prepare the data for the model.

The process of preparing data involves exploring the data and understanding its strengths and weaknesses.

You must ensure that the data is consistent, accurate, and in the right format.

Data cleaning is a term that describes dealing with various issues that are commonly found in raw datasets.

These issues can include missing data, duplicated records, and outliers.

Missing data may be treated as a null value by a machine learning algorithm.

You need to decide whether or not missing data is acceptable, should be removed, or should be generated or replaced with something else.

Duplicated records are a common occurrence in very large datasets.

This may happen due to a data entry error or when datasets from multiple sources are combined.

Removing duplicates can reduce processing time.

Sometimes duplicate data is appropriate.

For example, if an event has occurred multiple times, then we would want to make it appear multiple times in the dataset.

Outliers are data points that are significantly different from other data points in the dataset.

Outliers can significantly affect the accuracy of a model.

Okay, time to check your understanding.

I have a question for you.

Why is data cleaning an important step before using data in AI systems? Is it A, it introduces bias into a model which can lead to more human based decisions; B, it resolves issues such as missing data, duplicates and outliers leading to more accurate predictions; Or C, it replaces missing data with random guesses to save time? Pause the video here whilst you have a think.

Did you select B? Well done.

Cleaning is the process of resolving issues such as missing data, duplicates and outliers, which leads to more accurate predictions from a model.

Step three, train the model.

The aim of training is to adjust the parameters of a model so that it can be used to make accurate predictions on new, unseen data.

The set of clean historical data that has been carefully prepared for this purpose is used to train the model.

Step four, evaluate and use the model.

You should use a fresh set of data, a set that the model has not processed before to evaluate the model and check for accuracy, fairness, and reliability.

The performance of the model should also be continually monitored and updated with new training data as required because a model can become out-of-date very quickly.

When evaluating a machine learning model, two important considerations are bias and variance.

Bias, the model makes unfair decisions or predictions towards or against something or someone.

Variance, the model performs well on the training data set, but poorly on new, unprocessed data.

Okay, time to check your understanding.

I have a question for you.

Which of the following are important considerations when evaluating a machine learning model? Is it A, programming language and training speed; B, number of developers and size of programme; or C, bias and variance? Pause the video whilst you think about your answer.

Did you select C? Well done.

Bias and variance are important considerations when evaluating a machine learning model.

Okay, we're moving on to our first task of today's lesson, task A.

Describe each step of how data is used in data-driven systems. If you need to, you can pause the video and go back and have a look through previous slides to help you with your answer.

How did you get on? Did you manage to describe the steps of how data is used in data-driven systems? Well done.

Let's have a look at a sample answer together.

So step one, collect data.

Gather relevant data from a range of different sources.

Data should be representative of the problem being solved, be high quality and ethically sourced.

Step two, prepare the data.

Ensure the data is accurate, consistent, and in the right format.

Clean the data to resolve any issues.

Step three, train the model.

Expose the model to a set of historical data that has been carefully prepared for training.

Step four, evaluate and use the model.

Test the model to check for accuracy, fairness and reliability.

Two important considerations are bias and variance.

Did you have the steps in the right order? Well done.

Remember, if you need to pause the video and go back and make any corrections or add any detail to your answer, you can do that now.

Okay, so we've described how data is used in AI systems. Let's now move on to explain the impact of poor quality data.

Jun says machine learning is having a huge impact on society, but how do we know that these applications will treat everyone fairly? Bias is when the output of a machine learning model favours some things and deprioritizes or excludes others.

Some examples of machine learning bias include computer programmer jobs only being shown to users that a social media platform has identified as male; a facial recognition system that is less accurate in recognising people with certain skin tones.

Data bias occurs due to data being used that does not accurately represent what is being modelled.

For example, if we train a machine learning model to identify apples and we only use training images of green apples, this is data bias.

The resulting model will be less likely to identify red apples correctly.

What questions should be considered about the data used to create the model? Maybe pause the video whilst you have a think.

Jun says, was enough data used? Izzy says, was the right data used? I think these are two important questions.

Was enough data used? So in our example of the apples, were enough images of apples used? Was the right data used only images of green apples were used to create the model, so that's probably not correct.

Were the images representative of what the fruit would look like in a supermarket or different situations? Societal bias is when data being used to train a model reflects bias that exists in society.

Imagine there is an AI application that predicts the job a person might do.

The model has been trained using employment data from the 1960s.

What societal bias do you think could appear in the data? Pause the video whilst you have a think.

Time to check your understanding.

I've got a true or false statement for you here.

Bias is when the output of a machine learning model favours all things equally.

Is this true or false? Pause the video whilst you have a think.

That's right.

It's false.

Bias is when the output of a machine learning model favours some things and deprioritizes or excludes others.

So, not all things are treated equally.

In the data from the 1960s, which gender might have been associated with the following jobs? A professional football player, a teacher, a nurse, and an astronaut.

Maybe pause the video whilst you have a think.

What decisions did you make about the gender associated with the jobs? I wonder if they match these images.

So a professional footballer may have been male, a teacher and a nurse may have been female and an astronaut may have been male.

Funnily enough, I generated these images using an AI image generator, and I didn't specify the gender when I asked the image generator to create these images.

So, is that model biassed in its outputs? It's impossible to avoid there being any bias in a data set.

However, there are steps you must take to reduce bias.

Use large and representative sets of data to train your model.

Before training the model, separate a proportion of the data that will be used to create the model for testing.

During the process of designing the machine learning application, consider people with different backgrounds, experiences and opinions.

Machine learning models trained with poor quality data that contains errors, missing data or duplicate data may produce inaccurate results.

So there's some phrases here.

Rubbish in, rubbish out.

And also bias in, bias out.

If machine learning models are trained on poor quality data, they may produce inaccurate or unpredictable results.

People using the model may lose trust and feel the model is producing unfair results.

This ultimately will result in them not using the model.

Okay, time to check your understanding.

I have a question for you.

What is the most likely impact of training a machine learning model on poor quality data? Is it A, the model will still achieve high accuracy as long as the data set is large; B, the model might take longer to train, but will eventually overcome the data issues; or C, the model may produce inaccurate, biassed, or unreliable predictions? Pause the video whilst you have a think about your answer.

Did you select C? Well done.

The most likely impact of training a machine learning model on poor quality data is that the model may produce inaccurate, biassed, or unreliable predictions.

Okay, we're moving on to our second task of today's lesson, and you've done a great job to get this far, so well done.

For part one, explain the impact of poor quality data on a machine learning model.

For part two, explain the term bias and the steps that can be taken to reduce bias in a machine learning model.

Pause the video whilst you have a go at the task.

How did you get on? Did you manage to explain the impact of poor quality data? Well done.

Let's have a look at a sample answer together.

So for part one, you were explaining the impact of poor quality data on a machine learning model.

Poor quality data negatively affects the performance of a machine learning model in several ways.

If the data contains errors or missing data, the model may produce inaccurate results.

If the data is biassed or reflects existing stereotypes, the models predictions may be biassed.

Models trained on poor quality data may appear unpredictable or unfair, making them unsuitable for real world use.

For part two, you are asked to explain the term bias and the steps that can be taken to reduce bias in a machine learning model.

Bias is when the output of a machine learning model favours some things and deprioritizes or excludes others.

Bias occurs due to data being used that does not accurately represent what is being modelled.

Bias can be reduced by using large, representative data sets to train the model and by considering people with different backgrounds, experiences, and opinions when designing the application.

Remember, if you'd like to, you can pause the video here to add any extra detail to your answers.

Okay, we've come to the end of today's lesson, data-driven models, and you've done a great job, so well done.

Let's summarise what we've learned together in this lesson.

Data-driven models are designed to find patterns in data, in order to make decisions or predictions.

Accurate, complete and unbiased data is essential for effective AI applications.

Poor data leads to wrong predictions and unreliable AI results.

I hope you've enjoyed this lesson and I hope you'll join me again soon.

Bye.

I've finished the video