DATA COMPRESSION FOR VERY SPARSE INTEGER ARRAYS
The method of data compression described in this article was developed for archiving and processing EXOSAT images, but could be adapted to other kinds of very sparse data also. It is perhaps not the optimum method in terms of reducing data volume, but it is relatively simple to implement and does not lead to any loss of information from the original uncompressed data.
All EXOSAT images exist already on Final Observation Tapes in the form of a list of X-ray events. Usually one event is described by 8 bytes of data (position, quality, sum signal, and time tag). For applications which need only the positions of events, only 2 of the above 8 bytes per event need be retained, however this more compact form for storing positional data has never been implemented in the EXOSAT observatory software; up to now the accumulation of pixels has been performed by starting from the 8-byte events on an FOT. Recently the following even more compact way of storing positional data has instead been implemented; it requires only about 4 bits per event, and so is about 4 times more compact than simply listing the positions of individual photons; for very long exposures it can be up to 8 times more compact.
2. Compression Algorithm.
In a typical 2048 x 2048-pixel EXOSAT image the average density of photons is as little as I per 5-10 pixels; pixels in background areas of the image will therefore nearly all have values of 0 or 1. This property is shared by printed text (the corresponding values being white and black) and suggests that one could copy a data compression method used for facsimile transmission (see e.g. ref 1). However, in parts of the image there axe, of course, pixels with values exceeding 1, and allowance for any 16-bit integer value must be made. A system of encoding GROUPS of consecutive pixels has been adopted, using four Huffman codes: 1, 01, 001, and 000. These are bit patterns which indicate how to decode the accompanying group of pixels. The four corresponding encoding methods are defined by the following bit strings:
(i) I represents 4 consecutive zero-valued pixels; the zeroes are not actually present in the file but are implied by this single bit set to 1,
(ii) 01nn represents 4 pixels containing just one photon in position nn within the group, the other 3 pixels being implicitly zero,
(iii) 001nnmm represents 4 pixels containing two photons in positions nn and mm within the group,
(iv) 000aabbccdd represents 4 pixels containing at least three photons altogether.
The numbers nn, aa etc are two-bit binary integers. In methods (ii) and (iii) it is more economical to encode the position of a photon as an offset within the group of four pixels; method (iii) includes the case of two photons in the same pixel (nn=mm). For method (iv), however, the pixels are directly encoded as two-bit integers if they have values 0, 1, or 2. Individual pixels in the range 3 to 17 are encoded as (binary) 11 followed by a four-bit string in the range 0 to Pixels greater than 17 are encoded as (binary) 111111 followed by the full 16-bit pixel value. Thus the total length of the bit string for method (iv) is variable, although most commonly 11 bits including the Huffman code.
The full field-of-view is subdivided into 16 x 16 "small maps" each of 128 x 128 pixels, as on the FOT itself; a table near the beginning of a compressed image gives the offset of each small map and the total number of photons it contains. The decompression of the data therefore has to be performed in units of a complete small map, and unwanted areas of the image can be disregarded, in order to reduce the CPU time needed.
The size of a Pixel group was chosen above to be four, because this leads to a minimum size for a well-populated small map. Any other small power of two could instead have been chosen, with a similar encoding algorithm, and indeed for a relatively empty small map (between 1 and 1024 photons) a group size of 16 was adopted; it is advantageous because the minimum- overhead of 512 bytes (4096 groups each encoded according to method (i) above) is reduced to only 128 bytes, at the expense of increasing the number of additional bits per photon from 3 to 5. A small map containing no photons at all is entirely omitted from the compressed image; this happens quite frequently, especially when the on-board software uses a spatial filter to select events. Clearly there exist possibilities for varying the algorithm, and some tests were made with other group sizes and additional Huffman codes, but none resulted in any worthwhile reduction to the total sizes of images formed from actual EXOSAT data, and so for the sake of simplicity the algorithm was restricted to the cases already described.
The decoding of compressed images relies on the efficient manipulation of bit strings; care must be taken not to introduce unnecessary inefficiencies because of the possible impact on CPU time. A precise specification of the image layout and a FORTRAN listing of the decompression routines currently in use on the HP-1000 computers at ESTEC can be provided on request. A program wil be provided to convert these compressed images to FITs format.
l. Huang, T.S. and Hussian, A.B.S., Facsimile Coding by Skipping White, IEEE Trans Commun COM 23, 12 1975 pp,1452-1466.