I would say the correct answer is Wrong.
Two points to consider:
- Probability is not certain, and
- Lossless compression is a process closely linked to two factors: Noise and the ability of the compression algorithm to encode a dictionary of patterns.
Let’s assume that the 4-character alphabet is composed of the following symbols:
- A (binary representation 00)
- B (binary representation 01)
- C (binary representation 10)
- D (binary representation 11)
Simulation 1: A file content 1024 characters [A]
The probability of occurrence of a file containing only a distinct character repeated N times is low (1 / 2 1024 in this simulation), but not impossible.
Its total size is 2048 bits (2Kb).
Assume a compression algorithm that uses the following rule as an escape code:
If the AD pair is found, check the next 2 characters.
- If the content is AD: The final content is just an AD pair.
- If the content is different from AD: Repeat the first character after the exhaust code by the decimal value expressed in the following 12 bits.
The compress representation of simulation file 1 would be as follows:
A D A B A A A A A
00 11 00 01 00 00 00 00 00
| | | |_______________|_ Valor hexa &H400, ou 1024 decimal
| | |____________________ Caracter a ser repetido (**A**)
|___|______________________ Código de escape
Result of Parsing of this instruction: character A generated 1024 times, a representation Lossless of the original file.
That is, it is possible to represent the file with only 18 bits, or 0.8% of the original size.
But let’s assume that the statement, instead of:
[...]Also consider that the probability of occurrence of each
element is equal to 1/2, 1/4, 1/8 and 1/8[...]
formulate this passage as follows:
[...]Also consider that the occurrence of each
element is equal to 1/2, 1/4, 1/8 and 1/8[...]
That is no longer probability, it is a fact that all alphabet characters appear in the file. Let’s go to the next simulation:
Simulation 2: A contented file, in order: 512[A] 256[B] 128[C] 128[D]
Let’s use the same algorithm as the previous example:
A D A A B A A A A
00 11 00 00 01 00 00 00 00 = 512 caracteres A
A D B A A C A A A
00 11 01 00 00 10 00 00 00 = 256 caracteres B
A D C A A B A A A
00 11 10 00 00 01 00 00 00 = 128 caracteres C
A D D A A B A A A
00 11 01 00 00 01 00 00 00 = 128 caracteres D
Final representation:
ADAABAAAAADBAACAAAADCAABAAAADDAABAAA
Total: 36 bits. Compression ratio: 1.75%.
Considerations
In both cases I used an algorithm that uses a standard dictionary of length N=1 (repetition of a character), and files with as little noise as possible (pattern variations). For example, a file containing the ABCD sequence 256 times would generate an output orders of times larger than the original file.
It is a terribly bad algorithm for any real application, but it serves as proof of concept that yes, given the characteristics of the problem and in ideal situations it is possible to compress the file at rates below 80%.
-x-
Attention at the end of the statement:
[...]it will not be possible to compress this file lossless with a rate
80%.[...]
I am assuming that the statement actually means 'with a compression rate below 80%'. If the rate needs to be precisely 80% the algorithm can be modified to generate garbage (padding) at the end of the content, for example.
Thanks for the answer, it’s a little complex for me, but I’ll read carefully and stay tuned for details.
– Metalus
Feel free to answer any questions, @Metalus!
– OnoSendai
The question seems to imply the Huffmann compression algorithm.
– epx
I tend to agree, @Exp, and I believe this was the original goal. However, the old axiom that says 'the statement is law' forced me to abandon assumptions; if it is important to the question, it must be mentioned in the statement.
– OnoSendai