Loading...
Hello, I'm Dr.
Das.
Welcome to Computing.
I am so pleased that you have decided to join me for this lesson today.
In today's lesson, we will be looking at comparing compression algorithms. Welcome to today's lesson from the unit "Data compression." This lesson is called "Comparing compression algorithms." And by the end of this lesson, you should be able to compare the suitability of different compression algorithms. Some of the keywords that we will be using in this lesson today are algorithm, algorithm, a set of step-by-step instructions that are followed to perform a task.
Efficiency, efficiency, how effectively compression reduces the size of data while maintaining its usability.
Trade-offs, trade-offs, the limitations and choices faced when trying to optimise multiple competing factors.
This lesson is split into two sections.
The first section describes the different types of compression.
The second section compares the suitability of compression algorithms. Let's start with the first section, where we will see how we can describe types of compression.
Data compression is the process of reducing the size of a file.
To compress a file, compression algorithms are used.
Data compression algorithms work by removing unnecessary data or rearranging information more efficiently.
There are a number of advantages to using data compression, such as a reduction in the space needed to store files locally and in the cloud, less bandwidth usage when transferring data on a network, and faster uploading and downloading times to reduce latency.
Algorithms, which are just sequences of steps, are used to reduce the number of bits required for each file.
So if the original file looks like this, we are looking at getting a compressed file that is much smaller in size.
These algorithms are called compression algorithms. So the original file, when acted on by a compression algorithm, gives you a compressed file, which is smaller in size.
There are two types of compression algorithms, lossy compression and lossless compression.
The choice of compression method will depend on factors like the need for data accuracy, the acceptable level of quality loss, and the importance of reducing file size.
Lossy compression always results in a loss of data.
The decompressed file that uses lossy compression will never be an exact copy of the original, only an approximation.
This is because lossy compression algorithms permanently remove data from the file to aid with the compression.
There are many types of lossy compression.
Some common ones are outlined in the table below.
So JPEG compression is generally used on image files, and the key features of a JPEG compression technique is that it removes fine detail and colour precision.
MP3 compression is used on audio files, and this discards inaudible sounds and compresses waveform data.
Finally, MP4 technique is used on video files, and this can hold video, audio, subtitles, and images.
Lossy compression algorithms are generally more efficient at reducing file size because they can achieve higher compression ratios than lossless compression algorithms. A trade-off of using higher compression ratios is that more of the original data will be lost and the quality of the information contained in the file could be reduced.
Lossy compression may be acceptable for media files such as images, audio, and video as it can be hard for a human to notice the difference between the original and compressed files.
Lossy compression is not suitable for text and executable files as all the original information needs to be retained.
Lossless compression results in no loss of data.
This is particularly important for files where the information contained in a compressed file must be exactly the same, such as text and executable files.
Text files need to use lossless compression because it is important that no letters or words are lost during the compression process.
There are many types of lossless compression.
Some are outlined in the table below.
The PNG compression technique generally works on image files, and it supports transparent pixels and maintains original quality.
The FLAC technique is used on audio files.
It's open source and has a high-resolution audio.
The ZIP technique can work on mixed files.
It compresses multiple file types into a single folder.
Run length encoding works on text and images.
It has a high compression ratio for data with many repeated identical values.
Finally, Huffman coding is mainly used for compressing text files.
And Huffman coding makes common letters shorter to save space.
Generally, lossless compression algorithms are less efficient at reducing file size and achieve smaller compression ratios than lossy.
The main reason for choosing a lossless compression method would be to preserve all the information contained in the original file.
The trade-off of achieving no loss of original information is that the compressed file size will be larger than if a lossy method had been used.
Note that care should be taken when choosing the type of compression method and ratio.
It can be tempting to think that the best compression method will be the one with the highest compression ratio, the one that would make the smallest file.
Sometimes, important details are lost using high-compression lossy methods, so the best method always depends on the file type, quality needs, and purpose.
Let's do a quick check.
Which of the following is a lossy file format? You've been given three options.
Why don't you pause the video here and have a quick think.
The correct answer is JPEG.
Well done.
Right, another check.
The main reason for choosing lossless compression is to preserve all the information contained in the original file.
Is this statement true, or is it false? Why don't you pause the video here and try and work this out.
The correct answer is true.
The main reason for choosing lossless compression is to preserve all the information contained in the original file.
Well done.
Another true or false? Lossless compression can achieve higher compression ratios than lossy methods.
What do you think? Is the statement true, or is it false? Pause the video again and try and work this out.
I'm sure you know the answer to this one.
The correct answer is false.
Why? Because lossy compression can achieve much higher compression ratios than lossless methods.
Well done.
Right, this brings us to the end of this section.
Before we move on, why don't you do this task just to make sure that you have understood what we covered in this section.
Task A is made up of two parts.
The first part asks you to describe what a compression algorithm is and why they are used.
Pause the video here and try and work this one out.
Compression algorithms are used to reduce the size of a file.
They reduce the number of binary digits needed to store the file and therefore reduce the space needed to store the file.
A compressed file also needs less bandwidth when transferring on a network, which can reduce upload and download types.
Well done.
Let's look at the next part.
The second part asks you to describe two main types of compression algorithms and give an example for each.
Pause the video again and try and work this one out.
I'm sure that you'll be able to.
A possible answer looks like this: There are two main types of compression, lossy and lossless.
In lossy compression, file size is reduced by permanently removing some data that may be less noticeable, especially in media files like images and audio.
Examples of lossy compression could be JPEG and MP3.
In lossless compression, file size is reduced by encoding the information in a more efficient way without losing any of the original data.
Lossless compression is a good choice whenever file size needs to be reduced without losing quality.
PNG and ZIP are types of lossless compression.
Well done on completing this task.
You're doing brilliantly.
Let's move on.
This section talks about comparing the suitability of compression algorithms. Sofia says, "There are so many different types of compression to choose from.
How do I know which one is the most suitable for certain situations?" It can be hard to choose a suitable compression algorithm, but by thinking about a number of factors, it can be made easier.
A simple algorithm can help to simplify the process.
First, identify the type of file.
Are you trying to compress an image file, a video file, audio, text, et cetera? Then decide if file size reduction, lossy, or maintaining the exact original data, lossless, is the priority.
Then choose the compression method more suited to your needs from the lossy or lossless categories.
Sofia says, "I need to compress an essay I have written for my English homework.
It's a text file, so I need to use a lossless method as I don't want to lose any words or meaning, but I'm still not sure which method to use." Compression algorithms can be compared to help choose the most suitable method for text.
Let's have a look how.
Many types of compression algorithms can be used to compress text.
Two are run length encoding and Huffman coding.
Both run length encoding and Huffman coding can be used to compress the word ASSESS.
To use run length encoding to compress the word ASSESS, you need to follow the run of characters, count the number of same occurrences of the character, and log this information in the encoded table.
Let's see how we can do this.
So in the word ASSESS, there is only one A character in the first run.
So this is logged in the encoding as 1A.
So if you look at the image below, in the run, the first character is A, and there is only one instance of it.
So in the encoding, you write it as 1A.
The next character is S, and there is a run of two characters, so this is logged as 2S.
So S appears twice, consecutively, so we record that as 2S.
The next character is E.
There is only one in the run, so this is logged as 1E in the encoding.
Finally, the character S has a run of two, and this is logged in the encoding as 2S.
The original word ASSESS had a total of six characters.
If each character were 8 bits, the total number of bits needed to store the word would be 8 times 6, which is 48 bits.
When run length encoding is used to compress the word ASSESS, a total of eight characters are needed.
If each character were 8 bits, the total number of bits needed to store the encoded word would be 8 times 8, which is 64 bits.
When run length encoding is used to compress the word ASSESS, a total of eight characters are needed.
If you look in the table below, 1A, 2S, 1E, and 2S makes up eight characters.
So we will need eight characters to store the encoded word.
If each character were 8 bits, the total number of bits needed to store the encoded word would be 8 times 8, which is 64 bits.
Now, clearly in this case, using run length encoding to compress the word ASSESS has actually increased the size of the original file.
So if you look at the tables below, the original word ASSESS had six characters.
If each character has 8 bits, there is 48 bits of storage needed.
The encoded word has eight characters.
Again, each character needs 8 bits, so there is 64 bits of storage that is needed.
Run length encoding can be a useful compression technique.
However, it does come with some advantages and disadvantages, as outlined in the table below.
So the advantages include the fact that they are simple to implement and they are good for certain types of data.
The disadvantages are that run length encoding is inefficient if data has few repeats.
And also, it can increase the file size.
Before we move on, let's do a quick check.
Run length encoding can achieve high compression ratios with data that has few repeats.
What do you think? Is the statement true, or is it false? Pause the video here and try and work this one out.
The answer is false.
Why? Run length encoding is inefficient if data has few repeats, and it can even increase the file size.
Well done.
Huffman coding is a method of lossless data compression.
It reduces file size by assigning shorter binary codes to more frequent data patterns and longer codes to less frequent patterns.
To do this, it builds a special binary tree called the Huffman tree based on the frequencies of the data patterns.
To use Huffman coding to compress the word ASSESS, you need to create a frequency table and then create a Huffman tree.
A frequency table shows the number of times each character appears in the word.
So for the original word ASSESS, the frequency table would look like this.
A and E both occur once, and S occurs four times.
The letters can then be sorted into order to see which letters occur most frequently and which occur least frequently.
So for the word ASSESS, if we rearrange the frequency table, it will look like this.
So S with its most occurrences comes first, then comes A, and then comes E.
A frequency table can be used to create a Huffman tree for the word ASSESS.
We place, we start by placing the two least common characters along with their frequency into a binary tree, as you can see in the image here.
So A and E are the two least common characters.
So you have a parent node with the child nodes being A and E, and each has its frequency written next to it.
Next, add a new parent and child node and place the last character along with its frequency in the new child node.
So we have added a new parent and child node.
One of the children, one of the child nodes is S with its frequency 4.
The new parent node then connects to the old parent node, and that becomes its child node as well.
The cumulative frequency of each node is then placed in the parent node.
So, for the bottom two, the least frequently occurring characters, A occurs once, E occurs once.
You add the two together, you put 2 in the parent node.
For the next up, you add 4, which is the frequency of S, with 2 to get a 6.
A binary number is then assigned to each character, starting from the root node.
Moving left uses a 0, and moving right uses a 1.
For the character S, you just move left once, so the bit pattern for S is 0.
For the character A, you move right once and left once, so the bit pattern for character A is 10.
For the character E, you move right twice, so the bit pattern for E is 11.
The bit patterns from the Huffman tree can now be used to create the compressed file.
So as you can see here, it has been written out, so ASSESS.
A, the pattern is 10.
S, the pattern is 0.
S again, the pattern is 0.
E is 11.
S is 0.
And S is 0 again.
In this case, using Huffman coding to compress the word ASSESS has reduced the size of the original file.
The original size of the word, the original storage capacity needed to store the word was 48 bits because it had six characters.
In the compressed size, we need, if you count up the number of bits at the bottom, you can see that we need 8 bits of storage.
So this has compressed the file size.
Note that the actual amount of storage needed would be higher than 8 bits.
This is because the encoding data contained in the Huffman tree must also be stored to enable the data to be decoded.
So for the word ASSESS, the total storage that is needed is 8 bits of storage and then the overhead of the Huffman tree.
Right, let's do a quick check.
What is needed to decode Huffman encoded data? You've been given three options.
Pause the video here and try and work this out.
You're doing so well.
I'm sure you'll be able to.
The correct answer is Huffman tree.
A Huffman tree is needed to decode the Huffman encoded data.
Well done.
Huffman coding can be a useful compression technique.
However, it does come with some advantages and disadvantages, as outlined in the table below.
The advantages include that Huffman coding is a lossless type of compression and is good for certain types of data such as text.
The disadvantages are that it adds extra overhead for the tree and is less effective for smaller files.
A more detailed table can help compare the suitability of Huffman and run length encoding compression algorithms. So if you look at this table here, in terms of compression efficiency, Huffman coding, the compression efficiency is high, especially with varied character frequency.
For run length encoding, it's low to high.
It's only efficient with long repeated runs.
Huffman coding is best for text, general files with uneven character frequency, while run length encoding is best for images or data with long runs of repeated values.
In terms of complexity, Huffman encoding is more complex.
There is tree building, there are variable length codes, et cetera.
Run length encoding, on the other hand, is very simple.
Just count the repeated values.
In terms of speed, Huffman coding is slower to encode and decode, while run length encoding is very fast.
Sofia says, "The table really helps me to compare the suitability of compression algorithms and choose the best method." Right, this brings us to the end of this section.
Just to make sure that you have understood what we've covered in this section, why don't you attempt this task? Task B is made up of two parts.
Jun says, "I've scanned a written letter, and I need to email it to someone.
It's now a simple bitmap image with only black and white colours.
I wonder whether run length encoding or Huffman coding would be most suitable for me to use to compress this file." So, the question is, can you compare the suitability of run length encoding and Huffman coding compression algorithms for Jun's scenario and suggest the most suitable method? Why don't you pause the video here and try and work on this.
Right, now Jun's file is a simple black-and-white scanned image and not actual text.
The document will have large blocks of the same colour with lots of repeated patterns.
Both Huffman and run length encoding could be used to compress the file, but run length encoding would achieve a higher compression as it is well suited to large repeated runs of data.
Also, Huffman is a more complex method than run length encoding and takes more time to encode and decode, whereas run length encoding is faster to encode and decode.
So, I would suggest that run length encoding is more suitable for Jun to compress, to use to compress his scanned image in this scenario.
Aisha now says, "I've just completed a large written assignment for my history homework.
It's a large text document, and I need to compress it so I can submit it online.
I'm not sure whether run length encoding or Huffman coding would be most suitable to compress the file." So your second question in this task says, "Compare the suitability of run length encoding and Huffman compression algorithms for Aisha's scenario and suggest the best method." Pause the video again and try and work this out.
You're doing brilliantly, so I'm sure you'll be able to work the solution out.
Right, Aisha's text file will contain lots of characters that will occur frequently in her data but are unlikely to have long runs of the same character.
Huffman coding can achieve high compression with this kind of data, whereas run length encoding is only efficient when there are long repeated runs of data.
Run length encoding might even make the compressed size bigger than the original.
Although Huffman coding is slower to encode and decode compared to run length encoding, the trade-off is that Huffman will be able to make the compressed file much smaller.
So you could suggest that Huffman coding is more suitable for Aisha to use to compress her text document in this scenario.
Well done on attempting these two questions that were part of task B.
This brings us to the end of this lesson.
Let's do a quick summary.
Compression algorithms achieve the same goal of compressing data in different ways.
Compression methods can be compared based on file size reduction, compression and decompression speed, and compressed data quality.
The best compression method depends on the intended application, acceptable processing time, and desired quality.