Data compression. Issue for Criminal Period 2012, CESPE/Unb

Asked

Viewed 228 times

8

Hello, I have searched in several places answer to this question, but to date I did not find. The question is to be judged on Right or Wrong.

  • Question

Consider a file composed of a large number of independent characters and belonging to an alphabet with four distinct elements. Also, consider that the probability of occurrence of each element is equal to 1/2, 1/4, 1/8 and 1/8, where each character is mapped into 2 bits. In this case, the compression rate being equal to the ratio between the size of the compressed file and the original file, it will not be possible to compress this file lossless with a compression rate of 80%.

Feedback: All right

2 answers

7


I would say the correct answer is Wrong.

Two points to consider:

  • Probability is not certain, and
  • Lossless compression is a process closely linked to two factors: Noise and the ability of the compression algorithm to encode a dictionary of patterns.

Let’s assume that the 4-character alphabet is composed of the following symbols:

  1. A (binary representation 00)
  2. B (binary representation 01)
  3. C (binary representation 10)
  4. D (binary representation 11)

Simulation 1: A file content 1024 characters [A]

The probability of occurrence of a file containing only a distinct character repeated N times is low (1 / 2 1024 in this simulation), but not impossible. Its total size is 2048 bits (2Kb).

Assume a compression algorithm that uses the following rule as an escape code:

If the AD pair is found, check the next 2 characters.

  • If the content is AD: The final content is just an AD pair.
  • If the content is different from AD: Repeat the first character after the exhaust code by the decimal value expressed in the following 12 bits.

The compress representation of simulation file 1 would be as follows:

A  D  A  B  A  A  A  A  A
00 11 00 01 00 00 00 00 00
|   | |  |_______________|_ Valor hexa &H400, ou 1024 decimal
|   | |____________________ Caracter a ser repetido (**A**)
|___|______________________ Código de escape

Result of Parsing of this instruction: character A generated 1024 times, a representation Lossless of the original file.

That is, it is possible to represent the file with only 18 bits, or 0.8% of the original size.

But let’s assume that the statement, instead of:

[...]Also consider that the probability of occurrence of each element is equal to 1/2, 1/4, 1/8 and 1/8[...]

formulate this passage as follows:

[...]Also consider that the occurrence of each element is equal to 1/2, 1/4, 1/8 and 1/8[...]

That is no longer probability, it is a fact that all alphabet characters appear in the file. Let’s go to the next simulation:

Simulation 2: A contented file, in order: 512[A] 256[B] 128[C] 128[D]

Let’s use the same algorithm as the previous example:

A  D  A  A  B  A  A  A  A
00 11 00 00 01 00 00 00 00   = 512 caracteres A
A  D  B  A  A  C  A  A  A
00 11 01 00 00 10 00 00 00   = 256 caracteres B
A  D  C  A  A  B  A  A  A
00 11 10 00 00 01 00 00 00   = 128 caracteres C
A  D  D  A  A  B  A  A  A
00 11 01 00 00 01 00 00 00   = 128 caracteres D

Final representation:

ADAABAAAAADBAACAAAADCAABAAAADDAABAAA

Total: 36 bits. Compression ratio: 1.75%.

Considerations

In both cases I used an algorithm that uses a standard dictionary of length N=1 (repetition of a character), and files with as little noise as possible (pattern variations). For example, a file containing the ABCD sequence 256 times would generate an output orders of times larger than the original file.

It is a terribly bad algorithm for any real application, but it serves as proof of concept that yes, given the characteristics of the problem and in ideal situations it is possible to compress the file at rates below 80%.

-x-

Attention at the end of the statement:

[...]it will not be possible to compress this file lossless with a rate 80%.[...]

I am assuming that the statement actually means 'with a compression rate below 80%'. If the rate needs to be precisely 80% the algorithm can be modified to generate garbage (padding) at the end of the content, for example.

  • Thanks for the answer, it’s a little complex for me, but I’ll read carefully and stay tuned for details.

  • 1

    Feel free to answer any questions, @Metalus!

  • The question seems to imply the Huffmann compression algorithm.

  • I tend to agree, @Exp, and I believe this was the original goal. However, the old axiom that says 'the statement is law' forced me to abandon assumptions; if it is important to the question, it must be mentioned in the statement.

2

The problem of the statement was not specifying whether it dealt with the best case, medium case or worst case. If we use common sense, however, we can see that this issue only makes sense when we consider the average case:

Worst case

No lossless compression algorithm performs well in the worst case. The best it can achieve is zero compression (i.e. each file remains the same size). By beginning of pigeon house, if all 4n size files n are possible, and there is no "waste" any in the original format (4 symbols, 2 bits per symbol, minimum representation), so for each reduced file a file will necessarily be increased. Only if the algorithm does not reduce nothingness (keep everything as it is, or just swap) is that the performance in the worst case will not be negative - will be zero.

Best case

Consider the following "encoding":

  • Take all possible files and sort them according to some criteria (for example, first sort them lexicographically then apply the n-th permutation).
  • Encode each file using the natural number that represents its index in the generated list. Discard zeros on the left.

In this encoding, a file will be represented only by 0, another for 10, another for 11, another for 100 and so on. If we choose n (the permutation number) so that your particular file is the first on the list, the compression rate will be almost 100%! (but it can be shown that on average this coding is terrible)

Not only is this an absurd premise (a gambiarra, so to speak), but it is also useless in determining the efficiency of the compaction process.

Average case

I don’t have a reference to confirm this, but if I remember well the faculty in cases where the only information present is the relative frequency of each symbol, the huffman encoding brings great results (i.e. no alternative encoding will get better results in the average case). We will then apply it, representing each character by a sequence of inverse size to its frequency in the set to compress:

Caractere   Frequência   Repr. Binária
---------------------------------------
A           0.5          0
B           0.25         10
C           0.125        110
D           0.125        111

The same file - which previously contained 2 bytes per character - represented in this way will then have:

0.5 * 1 + 0.25 * 2 + 0.125 * 3 + 0.125 * 3 = 1.75

Like 1.75 is 87.5% of 2, then this will be the average case if the set of files to be compressed obey this distribution. Note that not every encoded file in this way will get smaller - those whose frequency of the last symbols is greater than their probability of occurrence will grow in size. However, these should occur in lower frequency in the set than the others (by the statement)*.

* Without these probabilities, the average case performance also becomes zero, as in a homogeneous distribution the number of documents that grow times their additional size is equal to the number of documents that shrink times their reduced size (also by the beginning of the pigeon house).

Browser other questions tagged

You are not signed in. Login or sign up in order to post.