Lesson video

In progress...

Hello, I'm Dr.

Gus, welcome to computing.

I'm so pleased that you have decided to join me for this lesson today.

In today's lesson, we will be looking at Huffman coding.

Welcome to today's lesson from the unit, data compression.

This lesson is called Huffman coding, and by the end of this lesson you should be able to describe and use Huffman coding to create a compressed representation of data.

Some of the keywords that we will be using in today's lesson include Huffman tree, Huffman tree, a special kind of binary tree used to create short binary codes for data based on how often each piece of data appears.

Frequency, frequency, how often a piece of data appears in a file.

Encoding, encoding, to change data into a different format so a computer can store or send it.

This lesson is split into two sections.

The first section looks at describing Huffman coding.

In the second section, we will look at how you can compress data using Huffman coding.

Let's start with the first section where we will look at what Huffman coding is.

Data compression is the process of reducing the size of a file.

To compress a file, compression algorithms are used.

Data compression algorithms work by removing unnecessary data or rearranging information more efficiently.

There are a number of advantages to using data compression, such as a reduction in the space needed to store files locally and in the cloud, less bandwidth usage when transferring data on a network, faster uploading and downloading times to reduce latency.

Algorithms, which are just sequences of steps, are used to reduce the number of bits required for each file.

So if we start off with this original file, after having used a compression algorithm, the compressed file is of a much reduced size.

These algorithms are called compression algorithms. There are two categories of compression algorithms, lossy compression and lossless compression.

The choice of compression method will depend on factors like the need for data accuracy, the acceptable level of quality loss and the importance of reducing file size.

Huffman coding is a method of lossless data compression.

It reduces file size by assigning shorter binary codes to more frequent data patterns and longer codes to less frequent data patterns.

To do this, it builds a special binary tree called the Huffman tree, based on the frequencies of the data patterns.

The result is a more efficient binary representation that saves space without losing any information.

Huffman coding can be used to reduce the file size of lots of different types of files, such as images, audio, video, and text.

Aisha says, "I'd like to compress a text file, but I've heard that some compression algorithms don't work very well with text, what should I do?" Lucas says, "Huffman coding can be used to compress text files.

It can reduce the size of a file to between 30% and 70% of the original." In a text file, each letter, number, space, or symbol is stored as a binary code.

Each of the characters are typically stored using a fixed number of bits.

If you were using 8-bit ASCII, each character would take up 1 byte or 8 bits of space.

This would be higher for 16 or 32-bit Unicode.

Uncompressed text may be acceptable for a small text file, but what if you were writing a book with 100,000 words in it? An uncompressed book would end up having a file size of around 500,000 bytes.

To ensure that hundreds or even thousands of books can be saved on e-readers.

Some compression needs to occur.

A single word can be used as an example.

So if you choose the word Mississippi, there are 11 characters in this word.

The uncompressed text file would have a file size of 88 bits or 11 bytes if using 8-bit ASCII, each character is represented by 8 bits, there are 11 characters, that gives you 8 times 11, so 88 bits, or in other words, 11 bytes, 'cause 1 byte has 8 bits in it.

With Huffman coding, it is possible to reduce the raw file size from 88 bits to 21 bits.

You can create a frequency table to count how many times each letter appears in the text file.

So for Mississippi, M appears once, I appears four times, S appears four times and P appears twice.

The letters can then be sorted into order to show which letters occur most frequently and which occur less frequently.

So in Mississippi, if we rearrange the letters in the order of frequency, S comes first with 4, I, 4, P 2 and M, 1.

A Huffman tree is then created based on these frequencies.

A Huffman tree uses a binary tree structure and is used to visually represent the encoding scheme of the compressed data.

A binary tree is made up of nodes and branches as you can see in the image here, each node can have a maximum of two child nodes.

This is where the term binary tree comes from.

The node at the top is referred to as the root node and is also a parent node.

A parent node is any node that has links to other nodes beneath it.

The nodes connected directly below a parent node are known as child nodes.

Right, before we move on, let's do a quick check.

A Huffman tree is made up of nodes and branches.

What do you think? Is this statement true or is it false? Why don't you pause the video here and have a quick think.

The correct answer is true.

A Huffman tree is made up of nodes and branches.

Well done.

A frequency table can be used to create a Huffman tree for the word Mississippi.

First place the two least common characters along with their frequencies in the binary tree.

So as you can see here, the two least common characters are P and M.

So you create a parent node and that branches out into two child nodes, one child node is M with a frequency one, and the other child node is P with a frequency two.

Next add a new parent and child node and place the next least common character along with its frequency in the new child node.

So we have I with a frequency of four under a parent node, and then of course the old parent node becomes the child node for the new parent node.

So the old parent node is represented with the black circle and the new parent node is represented with the clear purple circle.

Finally, add a new parent and child node and place the last, which is the most frequent character, along with its frequency in the new child node.

So we have a new parent node, so clear purple circle with the child node, S, along with its frequency, which is 4, and the old parent node becomes the new parent node's child node.

A Huffman tree for the word Mississippi then looks like this.

You can see that in the image shown here.

You can see the four characters in the word listed here on the left hand side, so your S, I, M, and P, you can see the frequency of each character listed on the right hand side, 4, 4, 1, and 2.

Notice how the less frequently occurring characters are furthest away from the root node.

Characters that occur more frequently will be assigned shorter bit patterns.

So S gets the bit pattern 0.

While M, which is a character that occurs less frequently, is assigned a longer bit pattern of 110.

Huffman coding reduces the size of a text file by assigning shorter binary codes to frequent characters and longer binary codes to less frequent ones.

This reduces the total number of bits needed to represent the data without losing any information.

The encoding data contained in a Huffman tree is needed to decompress the data back to its original form.

Without access to the Huffman tree or the encoding scheme, decompression would not be possible because there would be no way to reverse the encoding process.

Note that if the encoding data needs to be transmitted alongside the compressed data, this can reduce the compression efficiency, especially for small files.

Right, another quick check.

In a Huffman tree, characters that occur less frequently are placed, you have been given three options.

Why don't you post the video here and work this one out? The correct answer is option B.

In a Huffman tree, characters that occur less frequently are placed further away from the root node and have a larger bit pattern.

Well done on working that one out! Let's move on.

In this check, you are being asked whether the statement is true or false.

Huffman coding is only used to compress the size of text files.

What do you think? Why don't you pause the video here and try and work this one out? Correct answer is false.

Why? Because Huffman coding can be used to reduce the file size of lots of different types of files, such as images, audio, video, and text.

Well done on working that one out.

This brings us to the end of this section.

Before we move on, why don't you try this task just to make sure that you have understood what we've covered in this section.

You're being asked to write two or three sentences in your own words to describe Huffman coding.

Why don't you pause the video here, look through what we have covered in this section and try and work this out.

I'm sure you'll be able to.

Right, a possible answer.

Huffman coding is a lossless data compression algorithm that can be used to reduce the overall size of different file types such as image, audio, video, and text.

The algorithm will count how many times each letter appears in the text file and then a Huffman tree is created based on the frequencies.

More frequent characters are placed closer to the root node with a smaller bit pattern and less frequent characters are placed further down the Huffman tree with larger bit patterns.

Huffman coding reduces the total number of bits needed to represent data without losing any information.

Well done on attempting this task and working that part out, let's move on.

Right, this next section looks at how you can compress data using Huffman coding.

Lucas says, "I understand that the less frequent characters are furthest away from the root node, but I'm not sure how the Huffman tree is used to work out the shorter binary code." The cumulative frequency of each node is placed in the parent node.

So for the bottom two child nodes, the cumulative frequency is 1 plus 2 which gives you 3 so 3 is placed in the parent node.

For the next one up, 4, which is the frequency of the letter I added to the cumulative frequency of the other child node, which is 3 gives you 7.

So that's placed in the parent node.

Moving further up, 4 plus 7 gives you 11.

So the final parent node, the root node, gets the number 11.

Once the Huffman tree is created, a binary number is assigned to each character.

Moving left uses 0 and moving right uses a 1.

For the character S, you just move left once.

So the bit pattern for S is 0.

For the character I you move right once and left once, so the bit pattern is 10 for the character M, you move right twice as you can see in the diagram and then move left once, so the bit pattern for M is 110.

Finally, for the character P, you move right three times, so the pattern, the bit pattern for character P is 111.

Right! Let's do a quick check.

In a Huffman tree, moving left from a node assigns a zero to that branch.

What do you think? Is that statement true or is it false? Why don't you pause the video here and try and work this out.

The answer is true, In a Huffman tree moving left from a mode assigns a zero to that branch.

Well done.

Which of the following is used to create a Huffman tree? You've been given three options.

Why don't you pause the video again and try and work this one out.

I'm sure you'll be able to.

The correct answer is character frequency.

So character frequency is used to create a Huffman tree.

Well done on working both these out.

Let's move on.

These bit patterns can now be used to create the compressed file.

So the bit pattern for S was 0, bit pattern for I was 10, bit pattern for M was 110 and bit pattern for P was 111.

So if you put all of this together we get the following bit pattern, M is 110, I is 10, S is 0, S is 0, I is 10, S is 0, S is 0, I is 10, P is 111, the next P is also 111 followed by I which is 10.

The compressed file now has a file size of 21 bits compared to the original file size of 88 bits.

So that's 21 bits altogether.

Notice that there has been no loss of data during this compression, this means that Huffman coding is a form of lossless compression algorithm.

Note, 21 bits refers to the raw file size and doesn't take into account the amount of storage required to store the tree that is needed for decompression.

Right, this brings us to the end of this section.

Why don't you attempt this task just to make sure that you have understood what covered in this section.

So as you can see, a Huffman tree has been created for the word hello.

This task is made up of four parts.

The first part asks you to complete the table to state the bit pattern that will be used for each character.

Why don't you pause the video here and try and work this one out.

Let's look at the answer.

So if we work, try and work this out.

So L is your most frequently occurring character with a frequency of 2, to go from the first parent node we move once left to get to L, so the bit pattern for L is 0, the next character is O, to get to O we move right once and then left once, so the bit pattern for O is 10.

To get to H we move right twice and left once, so the bit pattern for H is 110.

And finally for the letter E we move right three times, so the bit pattern becomes 111.

Well done on working this one out.

Let's look at the next part.

Part two asks you to write down the full bit pattern for the compressed word, hello.

A helpful tip, use the bit pattern table that we used in the previous question, question one for reference.

Pause the video here and work this one out.

I'm sure you'll be able to.

Right.

So for the word hello, the bit pattern is, for H, the bit pattern was 110, so 110 followed by E, which was 111, then two Ls, each L is has a bit pattern 0, so followed by two 0s, and then ending with the O, which has the bit pattern 10.

Really well done on working this part out, let's look at the final two parts.

Part three asks you what is the file size of the original file containing the word hello in bits if 8-bit ASCII is used.

Part four asks you what is the file size of the compressed file containing the word, hello in bits? Why don't you pause the video here and try and work these two parts out.

You've been doing so well.

I'm sure you'll be able to come up with the solutions.

For the question where it asks you the file size of the original file, there are five characters in the word, hello.

Each character takes up a space of 8 bits, 5 x 8 gives you 40 bits.

So the original file, the file size of the original file containing the word, hello, in bits, if 8-bit ASCII is used is 40 bits.

Now when we compress the file using Huffman coding, the file size becomes 10 bits.

So from 40 we have reduced the file size to 10 bits.

Well done on working both these parts out.

You have done brilliantly.

Let's move on to the summary.

To summarise, Huffman coating is a lossless compression technique used to reduce the size of many types of files.

Huffman coding assigns shorter bit patterns to more frequent characters and longer bit patterns to less frequent ones.

Huffman trees are used to visually represent the encoding scheme that is used to compress data.

Huffman trees sent with compressed data do add overhead and can reduce overall compression efficiency.

I've finished the video