Best Text/number compression method?

mp3guy · 09-10-2004 9:14pm #1

What's the best compression algorithm for numbers or hex numbers? I've tried rar, zip, gzip and all those. Is there a special algorithm for numbers, or even a more efficient way of storing numeric values (upto 16777216)

Capt'n Midnight · 09-10-2004 11:25pm

I assume that by better you mean smaller rather than faster etc...

www.7-zip.org - better than ZIP for some formats and open source so you can see how they do it.

It depends on the numbers/raw data. eg: if you try to compress the digits of PI (in Hex) you won't ever get better than 50% because the data is totally random.

So the compression depends on the lack of randomness in the original data.
Is there any chance of a pattern in the numbers ?
eg: you could represent the numbers by +/- differences between the previous one if it the numbers changed slowly

If the numbers represent real world data you could look at lossy compression a lá mp3 / jpeg etc. They reprogrammed voyager so that it was able to store more data from the outer planets than when it was launched.

I also assume that you are referring only to postive integers so the raw data would be three bytes per number.

you could do a statistical analysis of the numbers to see if some are more common, eg: the 65,000 most common numbers could be stored in two bytes (as long as you realise that less common numbers could take more than three bytes)

without knowing the data set its hard to suggest more.

Kai · 10-10-2004 1:54pm

very insightful.

mp3guy · 20-10-2004 10:57pm

would compression work better if the numbers were in sequence or random? e.g:
would 1,6,3,5,2,4 store differently if it was changed to 1,2,3,4,5,6?

zt · 20-10-2004 11:40pm

Are you planning on writing the compression algorithm yourself?

If so traditional zip algorithms are probably not the best approach because they assume text and common grouped characters.

Here is an option if you are coding:

Assume that you have 500,000 entries; this would normally require 1.5 million bytes (3 bytes each).

Split the values into 256 ranges depending on the first byte. Store the remaining two bytes for each item. You might find that some ranges have no values.

Size is reduced. You will need some delimiters between the arrays but this is a single byte. All other values are reduced to type bytes.

I'm assuming since you suggested a sort that the order is not critical?

I would agree with Capt'n Midnight that knowing the data is key to the solution. Is speed or compression the key requirement? What is the level of compression that you are getting from zip?

zt · 20-10-2004 11:48pm

This is interesting. If you look at the solution proposed by CM above and have a sorted list and some type of relationship exists in the data then you could potentially double compress.

First get the offsets for the numbers (I'll use 10, 22, 33, 44, 55, 80, 90, 95, 100)

Offsets are 10, 12, 11, 11, 11, 35, 10, 5

You could then double encode these using say 4 bit operations (using the remainder of the two bytes required for the offset).

say
bit (00) means add
bit (01) means twice
bit (02) means three times etc

You could then encode the sequence of 11, 11, 11 as operation 02-11. Two bytes gets you what previously took 9 bytes.

Show us the data and lets see if we can compress it into a single byte

Nothing on TV tonight ....

mp3guy wrote:

would compression work better if the numbers were in sequence or random? e.g:
would 1,6,3,5,2,4 store differently if it was changed to 1,2,3,4,5,6?

Capt'n Midnight · 21-10-2004 9:31am

www.7-zip.org - download / use / look up it's help esp. the settings about dictionery size up to 192MB (RAM needs to be ~ 10 times dictionery size), if you have enough RAM then being able to do the whole data set in one go should give you the best compression. Depending on how long you expect repeat sequences to be you can also alter the word size, in this case this would be where you expect repeated patterns of numbers.

Numbers in sequence compress better because patterns repeat in the end digits and of course the first few digits will be the same. eg: 10001, 10002 , 10007 ... would be better than 10001, 33, 478 etc.

Also depends on how you store the numbers (but not if the compression program is REALLY good) eg: HEX or Binary vs. Decimal, also depends on whether then numbers are integers or real. If real numbers you could reduce precision so less unique numbers or look at lossy compression that might fuzz the last digit (no I don't know any off hand) to get a better compression with affecting the data a lot, note any lossy compression will affect the data.

If it was a single data set compressed once for reference then you could try several lossy compression techniques to see which one distorted the data least and then do up a correction table for it - I have no idea if this would work - it depends on the data. But it would take a lot of time since you are looking for the best match and not something that could be used on an on-going basis. More for a once off reference data set, where you can reject the tables that give too many errors.
eg: if the numbers were for a milling machine then any errors in the data that caused the machine to drill into thin air might be acceptible, but any that caused it to drill in the the metal in the wrong place might not be.

Again without knowing the data set / how it was generated / what it is needed for it's hard to anything other than vague.

Try 7-zip first. Then resequence the data / change nuber format and try again / then try offset data. Also LOOK at the data, if possible graph it, on average Humans are better at overall pattern recognition than machines, if you find a pattern then you model it and then see if storing the delta's from it would be better.

Final note: Before you encrypt data, compress it. This should hide it a little better from statistical decryption techniques . Also you can't compress something after it has been encrypted - if you can then the encryption method is not randomising the data enough.

Capt'n Midnight · 22-10-2004 11:27pm

http://en.wikipedia.org/wiki/Data_compression

http://en.wikipedia.org/wiki/Burrows-Wheeler_transform
I like this - its a way of sorting data such that it likely to be more compressible.

http://en.wikipedia.org/wiki/Four-stage_model_of_data_compression

http://www.vectorsite.net/ttdcmp1.html#m6
[1.1] INTRODUCTION TO DATA COMPRESSION
[1.2] RUN LENGTH ENCODING
[1.3] HUFFMAN CODING
[1.4] ARITHMETIC CODING
[1.5] LZ-77 ENCODING
[1.6] LZW CODING

Note: Arithmetic Coding is Patented so you can't use if for others or commerically

Best Text/number compression method?

Comments