Lesson video

In progress...

Hello, my name's Mr. Davidson and I'm going to be guiding you through your learning today.

Today's lesson is called, Representing text and using ASCII and Unicode from the unit representation of text, images and sound.

By the end of the lesson today you'll be able to explain how computers represent text.

We'll also be learning about these key words, state, which is the value of data at a specific point in time.

Character set, a system that matches characters to a unique binary sequence.

ASCII, which is a method of character representation that uses seven bits per character.

And Unicode a method of character representation that uses up to 32 bits per character.

The lessons going to involve three learning cycles today.

Let's start with the first one.

Calculate the number of states of a sequence.

Binary sequences are made up of binary digits or what we refer to often as bits.

They can vary a lot in length, but the state of an individual bit is its value at a certain point in time.

So if we consider an individual binary digit that bit, it can only ever have the values of zero or one that may change over time, but it'll only ever be those two values.

Now let's just check that you've remembered that.

What is the total number of states that a single - Binary digit can have? Good, you've remembered it's two.

And remember that's zero or one as a value.

If we extend that to a binary sequence where we mean multiple bits collected together, then we know that that binary sequence is going to have a number of different states.

If a single bit has two states, then a binary sequence which is made up of multiple bits is going to have a number of different states.

Now Sam's asking what are all the states a two bit sequence can be? So we mean a sequence that has two bits collected in it, and the sequences considered those two bits together.

Lucas finds that easy, and he's responding with the different four sequences that it can be.

It can have 00.

So the first bit and the second bit of both zero, 01, whether last bit has changed through to 10 and 11, it's quite easy to figure out all the different states that we can have.

So we would say a two bit sequence is going to have four different states.

So let's extend that.

And Sam's asking now, what are all the states a three bit sequence can be.

Again, Lucas is pretty confident he can list them all out, change one bit at a time, and come up with all the states that that three bit sequence can be.

So we would confirm that and say, a three bit sequence will have eight different states.

However we have a problem.

The longer the binary sequence becomes, the longer the sequence is, it follows that the more states it can have.

So Sam extends this again and he moves it up to five bits.

So what are all the states a five bit sequence can be? It's at this point Lucas realizes that that's gonna take him a long time to work out.

It was easy where there was two or three bits, but because we've got those extra bits in the sequence, listing those combinations becomes difficult.

Now actually, if you worked it out, a five bit sequence can have 32 different states.

The method Lucas uses for determining the number of states a binary sequence has is by listing them and it helps to do it in counting order where you vary the bits in order, starting with the right most bit changing to a one, and then as we progress the next column along changes, and the first column resets itself back to zero.

That process carries on until you've listed all of the possible states of that binary sequence.

As Lucas found out though, if even if we have one extra bit put on to the left hand side of this sequence, it's actually gonna take a lot more work to list all of those different possible sequence states.

The good news is that we can use maths to determine the number of possible states of a binary sequence.

So if we think we've got N number of bits, we can say that the number of states is two to the power of N.

And in our example, if we have a four bit binary sequence, the number of states would be two raised to the power of four.

Because remember we said that the N is the number of bits we've said we are using four bits.

So we do two to the power of four.

And that gives us our number of states as 16.

Now if you're not sure how that's worked out, we can do it on a calculator, or we can consider that two to the power of four is two times two times two times two.

So multiplied by two four times.

Let's check that you can work that out for yourself.

How many states would a binary sequence of three bits have? Well done.

You worked that out perfectly.

It's eight two to the power of three is eight.

I want you to try and practice some of that now.

First task I'm gonna get you to do is to complete a table listing all the possible different states for different amounts of bits.

You can confirm you've got the right number of states by calculating the possible number of sequences mathematically as well.

So in my first example I've realized that with one bit I've got the states zero and one, I can check that by using my formula of two to the power of N where N is the number of bits.

So two to the power of one gives me two, I want you to do the rest yourself.

Once you've done that, there's a second part.

So remember the change in the total number of states follows a pattern as the number of bits increases.

Have a look back through your answers in the first part and then describe how the total changes and explain why this happens.

Pause the video and have a go now.

Well done.

There was a lot of working out there to do.

Let's check the answers that you've got.

So you can see I've listed the four different possible states for two bits and check mathematically two to the power of two is four.

I've then gone on with three bits, and written out the eight possible states.

Again, checking mathematically I've got the correct number of states.

And then finally the difficult one, where we've got four bits and we've got a lot of ones and zeros there to write.

But again, I've done the right amount of states and checked it because two to the power of four gives me 16 and I've written the 16 different sequences there correctly.

And for part two of the task, did you spot a pattern? When the number of bits used in the sequence increases by one, the total amount of possible states doubles from the previous total.

And if you look back and check that two bits have four states, three bits have eight states and so on.

Each time the number of states are doubling.

So we could say as another bit is added to the sequence, the existing states are joined with this new bit, which itself has two states.

This has the effect of doubling the previous total, therefore every bit added doubles the amount of states.

Let's move on to the second part of the lesson now, where we're going to describe how ASCII is used to represent characters.

Computers have to represent lots of different things and certainly one of those things is textual data, such as words.

Words are formed from an alphabet.

So we can see here the English alphabet starting from A all the way through to Z.

Now if a computer is going to represent those characters, each of the characters is going to need its own unique binary sequence.

That means we've got to represent all 26 letters and if we look through the number of sequences that we had, it's likely that five bits are needed, because two to the power of five provides 32 different states.

A little bit more but if we used only four bits, we wouldn't have enough for 26 letters.

So just see if you can remember what we've just discussed.

Let's check how you got on.

So all 26 letters of the alphabet can be represented by a binary sequence of five bits.

Two to the power of five provides 32 different states.

Four bits would not be enough as two to the power of four would only provide 16 different states.

Computers though need to represent more than just the 26 letters of the alphabet.

For example, we've got both upper and lower case letters that need to be represented, and they need to be treated differently.

And we're not just limited to letters.

Other symbols also need to be represented such as punctuation, numbers and spaces between words.

And this example sentence that I've got here, this sentence uses 17 different characters, actually includes 17 different characters if we list them, some of them are repeated, but we need 17 separate sequences to represent those 17 different characters including the space and the exclamation mark.

But if you think about it a bit more character representation includes more than just the letters, numbers, punctuation and spaces used in typical sentences.

We might have to represent the enter character, to move the cursor down to the next line.

So that's like when you press enter in a word processing document.

We need to represent tab spaces, and the tab key when we press it creates a larger space to indent text like we would have in paragraphs, or perhaps you've done it in Python where you need to indent a block of code.

We sometimes have to represent special characters like mathematical symbols, mathematical symbols such as Pi, Epsilon, Delta are all used in formulas.

So we need to be able to represent them in a computer.

A standard English keyboard typically has 104 keys on it, give or take a few.

So to be able to give 104 different binary sequences, we are going to need to use a seven bit sequence as two to the power of seven provides 128 different binary sequences.

We've also got to consider that of those 104 keys, some of them have double symbols, meaning we can change what symbol we produce with the same key.

So 128 is probably just about right.

Now we need to be able to remember which symbol is matched to which binary sequence, and that's where something called a character set comes in.

Now a character set is a system that matches binary sequences to characters.

And as Alex knows, a computer can represent characters by picking long enough binary sequences and assigning the character it wants to each sequence.

But Izzy has thought about this, she's asking what happens if another computer chooses a different order for its character set.

It's all gonna get pretty confusing if everyone has a different character set for the different characters and the binary sequences that represent them.

So it would help if we could find a way of standardizing these character sets.

Computers all need to use an identical character set to know with which sequence goes with which letter.

That means that using the same character set make sure that the binary sequence matches the same character on all devices.

If we were sending an email for example, we'd then know that the characters are transferred between computers as binary sequences and we'd be confident that using the same character set ensures that the sequences are matched, to the same characters that you sent.

So let's just check that you remember what that key term is.

What is a record of characters matched to a unique binary sequence known as? Well done.

It's a character set.

In order to provide consistency for character representation.

In 1963, the first edition of the ASCII standard, which is short for American standard code for information interchange was created and used.

It listed all the possible characters that needed to be represented in its character set and chose for each character a seven bit binary sequence that would be used to represent each letter.

I'm gonna want you to put some of that into practice now yourself.

Firstly, I'm gonna want you to describe how the ASCII standard is used to represent characters stored on a computer, and it's important to be able to articulate how that standard works and how it affects the characters that can be represented.

Once you've done that, I want you to try and complete the table.

The table is part of the ASCII character set.

Now you'll notice that the letters go sequentially, they go in order, but also so do the binary sequences, those seven bit binary sequences, I want you to have a go and see if you can predict what the next three binary sequences are, if they go up in counting order.

And for the last part of the text, we're going to explore the difference between upper and lowercase letters in the ASCII character set.

Now if you look at the difference between the different cases of characters, the sequences are almost the same, but the sixth bit from the right is the one that changes.

Now in Python, a string can be converted to uppercase by using the dot upper method.

I'm gonna want you to explain how this feature of the ASCII character set makes this method easier than if it had to change the ASCII value to a totally different sequence.

Pause the video at this point and have a go at these tasks and then I'll go through the answers once you're done.

Well done.

Let's check the answers.

ASCII uses a seven bit binary sequence to represent characters in a computer.

Each character has a standardized binary sequence, so all devices using ASCII know which character matches each sequence.

Now we saw before that letters and the binary sequences in a character set go up sequentially.

So the next three sequences are listed on the table there.

And for the last part, we see that the dot up a method only has to change one bit in our seven bit sequence, not the whole sequence.

And this makes the methods algorithm simpler and easier to use.

If we think about how we design our data, we can actually make our algorithms that process that data easier to operate and easier to program.

Okay, let's get onto the last learning cycle, which is explained why Unicode representation is needed.

ASCII provides a character set of 128 possible characters.

However, computers need to be able to represent more than 128 characters.

ASCII is only specified to work with Western alphabets, not other languages from around the world.

And if you think about it a little bit more, character representation also needs to include other symbols that aren't parts of speech.

For example, a maths textbook might need to include mathematical equations with different symbols and different styles of symbols as well.

And we know if our ASCII character set is limited, it wouldn't be possible to represent this fully using ASCII.

We may find over time as well that new characters need to be created and represented as part of a character set.

Emojis, for example, each emoji is represented by a binary sequence in a character set.

They are characters after all, just like letters and numbers.

So what can you remember? Which of these three options cannot be represented in ASCII? Well done.

It's the Urdu characters we've got there.

The exclamation and the uppercase T, both can be represented in ASCII, but we need something else to be able to represent other languages that aren't based on the same alphabet.

To get around this, Unicode was created and first used in 1991, and that allowed more characters to be represented than is possible with ASCII.

Unicode uses up to a 32 bit binary sequence, which actually four bytes to represent characters.

Again, if we work out how many sequences that is, this gives two to the power of 32 different sequences, which can be used to represent over 4 billion different characters.

Unicode was set up to be compatible with ASCII.

The first 128 characters in Unicode are the same as the first 128 in ASCII.

That means the first seven bits in Unicode are identical to ASCII and with Unicode, that means the other 25 bits, that are possible are used for other characters.

Let's check if you remember what we've just seen.

How many characters in Unicode are the same as in ASCII? Well done.

It's 128.

So the first 128 characters in Unicode are the same as in ASCII.

Okay, it's time for our last practice tasks of the day.

First, I've given you a scenario and I want you to have a think and give an explanation as to why this scenario has happened.

A shipping company in Beijing, China, has sent a parcel to Hull, UK.

When the parcel leaves China, an automated confirmation message in Mandarin Chinese is sent via the internet to the destination.

When it arrives at the destination the tracking system in Hull, is an old system and still uses ASCII encoding.

When the message from China is received, it's full of missing characters represented by a rectangle.

Can you explain why the message received has these missing characters, even though it was sent correctly? Next, I want you to give Andeep some help.

He's saying my computer uses Unicode, so I won't be able to read documents in code in ASCII.

Explain why Andeep is incorrect and why he will be able to open any ASCII documents even though he uses Unicode.

Pause the video and have a go at those tasks.

How did you get on? Let's check your answers against mine.

So in our first part we've got to understand that the message from China, was probably encoded in Unicode to be able to represent Mandarin characters.

When it's received by the older system that still uses ASCII the system can't recognize these characters and any unrecognized character appears as a rectangle.

It shows that the systems aren't compatible because ASCII can't show characters it doesn't have a record for in its character set.

For the next part, we helped Andeep by telling him the first 128 Unicode characters match the ASCII character set.

So a system using Unicode can still display ASCII characters correctly.

Unicode is designed to work with ASCII ensuring the same characters are represented.

Well done.

You worked really well today.

Let's just check what we learned.

Binary sequences have two to the power of N different states where N is the number of bits in the sequence.

Character sets keep a record of characters against their unique binary sequence.

And we saw that ASCII uses seven bits to represent characters, whereas Unicode uses up to 32 bits.

Remember as well, Unicode can represent a wider range of languages and symbols, but also still is compatible with ASCII.

I've finished the video