3694813 is referenced by 52 patents and cites 6 patents.

The present invention relates to a method practiceable on a general purpose electronic computer for statistically analyzing a data set and for producing a set of encoding and decoding (E/D) tables for achieving compaction of the original data set utilizing a variable length code. The method disclosed may operate under constraints of available core, desired compaction rate and speed of compaction/decompaction to produce differing sets of encoding/decoding tables depending upon the constraints imposed. The method would most normally be provided and utilized as a software package wherein the primary inputs are the data set itself and the above enumerated constraints. By utilizing a variable-length code wherein the code assignment is dependent upon the characteristic of preceding data good compaction rates may be achieved utilizing reasonable amounts of memory for the E/D tables.The method comprises three principle steps. The first is the construction of a matrix showing the probability of occurrence of every member of the data set with respect to the immediately preceding member. The second step comprises grouping various rows or columns of this matrix having similar probabilities of occurrence, the third step comprises a reordering of all of the previously grouped rows or columns and finally a second clustering into coding sets may be performed.CROSS-REFERENCE TO RELATED APPLICATIONSThis invention is related to an application entitled CODE PROCESSOR FOR VARIABLE-LENGTH DEPENDENT CODE having the same inventors as the present application and filed concurrently herewith which discloses a hardware embodiment utilizing the assignment and mapping tables of the present invention to produce Encoding/Decoding tables for effecting data compaction.Application Ser. No. 119,275 entitled METHOD OF DECODING A VARIABLE-LENGTH PREFIX-FREE COMPACTION CODE, filed Feb. 26, 1971 of L.S. Loh, J.H. Mommens and J. Raviv discloses a method for decoding compacted data wherein the code assignments may be provided by the present invention.BACKGROUND OF THE INVENTIONIt is characteristic of information handling systems that the cost of the storage devices used to hold the files strains the user' s budget. As the files grow--and they always do--more physical storage devices are needed until, eventually, the limit is reached. Regardless of whether the limit is set by hardware constraints, budget, floor space, or customer attitude, some alternative method of coping with the storage problem is required.There are known procedures for reducing the size of files. In general, they sacrifice time to save space. The simplest of these procedures is to eliminate unnecessary records. This is an extreme case of file migration.A second class of procedures involves blocking records within a file to minimize unused storage space.A third method of reducing file size is data compaction. Two levels of compaction are most significant. The first is character and symbol suppression and the second is character and symbol encoding.Character suppression is a form of run-length encoding in which a string of identical characters (or multi-character symbols and words) is replaced by an identifier and a count.After migration and blocking have been applied to a file, it is possible to achieve additional compaction, in some cases quite a lot, by substituting more efficient codes for those commonly used. In the S/360 which has eight-bit bytes, it is possible to use 256 different characters. Most applications use fewer characters in their alphabet for the simple reason that the sources of input and the devices for output only handle 64 or fewer characters. Similarly, programming languages have limited character sets (COBOL: FORTRAN and PL/1:60, being examples).An alphanumeric file may contain only 64 different character codes out of the 256 available. Also, when a file contains all the 256 possible characters in the eight-bit byte, they are not all used equally often, i.e., some are very frequent and others are very rare, (as mentioned before, some may not ever be used). Therefore, an efficient coding scheme can achieve data compaction. This would be accomplished by encoding the common symbols with short codes and the rare symbols with longer codes such that the average code length for the file is reduced. Table 1 shows such a coding scheme for an oversimplified alphabet of only four symbols (A, B, C, D).TABLE 1if A is known to occur twice as often as B and B occurs twice as often as C and D, a new code can take this into account.Expected Length = (1/2 .times. 1) + (1/4 .times. 2) + (1/8 .times. 3) +(1/8 .times. 3) = 1.75 bits/character.The code used in the above Table is a simple one known as the Huffman code and is only exemplary of such compaction codes. It has many desirable characteristics. The Huffman code has the minimum expected length (i.e., it is very efficient) and is constructed in a straightforward way. It is prefix-free; that is, the code for one character cannot be confused with the beginning of the code for another character. Decoding can be done by a single table look-up. However, storage requirements are very severe if the length of the longest code word is large. Every character in the original message can be reconstructed from the coded message. The code is content-independent in that it ignores what the files are about; it only depends on the frequency of occurrence of characters in the alphabet.The size of the alphabet or character set is arbitrary in such a system. The method of deriving the Huffman code words for any list of symbols is based on the probability of their occurrence. The alphabet selected for an information storage and retrieval application might contain all 256 possible byte configurations plus common multi-character symbols such as 'and,' 'the,' 'Jan-Dec,' etc. The user has flexibility in establishing the list the symbols to be encoded. The Huffman code is not the only one possible. There are other efficient prefix-free codes.In compaction codes such as the Huffman code, the coding of a particular character is based solely on the identity of the character.SUMMARY & OBJECTSIt has been found that an improvement is achievable in data compaction methods by coding characters utilizing variable-length codes based not only on the frequency of occurrence of the particular character but also based upon the character which immediately precedes the character being coded. If this notion is applied straight forwardly, it would require a substantial amount of storage. Savings of storage space is achieved by grouping together various sets of characters having similar occurrence properties.Accordingly, it is a primary object of the present invention to provide an improved method for achieving data compaction.It is a further object of the invention to provide such a method utilizing variable-length compaction codes.It is another object of the invention to provide such a data compaction method wherein the variable-length codes are prefix-free.It is yet another object of the invention to provide such a data compaction method wherein the coding is done on a preceding character dependent basis.It is still a further object of the invention to provide such a data compaction method wherein a character co-occurrence matrix is developed for a particular data base.It is another object to provide such a method wherein dependence groups having similar statistical characteristics are joined together.It is yet another object to provide such a method wherein further joining may be performed after reordering of the members of the groups. Then, further clustering is done into coding sets.Other features, objects and advantages of the invention will be apparent from the following more particular description of the preferred embodiment of the invention as illustrated in the accompanying drawings.

Method of achieving data compaction utilizing variable-length dependent coding techniques
Application Number
Publication Number
Application Date
October 30, 1970
Publication Date
September 26, 1972
Raviv Josef
Mommens Jacques H
Loh Louis S
International Business Machines Corporation
G06f 07/00
G11b 13/00
H03M 07/42
G06F 17/18
View Original Source