13.2 Optimization Strategies

1. Because the data in FITS files is always stored in "big-endian" byte order, where the first byte of numeric values contains the most significant bits and the last byte contains the least significant bits, CFITSIO must swap the order of the bytes when reading or writing FITS files when running on little-endian machines (e.g., Linux and Microsoft Windows operating systems running on PCs with x86 CPUs).

On relatively new CPUs that support "SSSE3" machine instructions (e.g., starting with Intel Core 2 CPUs in 2007, and in AMD CPUs beginning in 2011) significantly faster 4-byte and 8-byte swapping algorithms are available. These faster byte swapping functions are not used by default in CFITSIO (because of potential code portablility issues), but users can enable them on supported platforms by adding the appropriate compiler flags (-mssse3 with gcc or icc on linux) when compiling the swapproc.c source file, which will allow the compiler to generate code using the SSSE3 instruction set. A convenient way to do this is to configure the CFITSIO library with the following command:

  >  ./configure --enable-ssse3
Note, however, that a binary executable file that is created using these faster functions will only run on machines that support the SSSE3 machine instructions.

For faster 2-byte swaps on virtually all x86-64 CPUs (even those that do not support SSSE3), a variant using only SSE2 instructions exists. SSE2 is enabled by default on x86_64 CPUs with 64-bit operating systems (and is also automatically enabled by the -enable-ssse3 flag). When running on x86_64 CPUs with 32-bit operating systems, these faster 2-byte swapping algorithms are not used by default in CFITSIO, but can be enabled explicitly with:

./configure --enable-sse2
Preliminary testing indicates that these SSSE3 and SSE2 based byte-swapping algorithms can boost the CFITSIO performance when reading or writing FITS images by 20% - 30% or more. It is important to note, however, that compiler optimization must be turned on (e.g., by using the -O1 or -O2 flags in gcc) when building programs that use these fast byte-swapping algorithms in order to reap the full benefit of the SSSE3 and SSE2 instructions; without optimization, the code may actually run slower than when using more traditional byte-swapping techniques.

2. When dealing with a FITS primary array or IMAGE extension, it is more efficient to read or write large chunks of the image at a time (at least 3 FITS blocks = 8640 bytes) so that the direct IO mechanism will be used as described in the previous section. Smaller chunks of data are read or written via the IO buffers, which is somewhat less efficient because of the extra copy operation and additional bookkeeping steps that are required. In principle it is more efficient to read or write as big an array of image pixels at one time as possible, however, if the array becomes so large that the operating system cannot store it all in RAM, then the performance may be degraded because of the increased swapping of virtual memory to disk.

3. When dealing with FITS tables, the most important efficiency factor in the software design is to read or write the data in the FITS file in a single pass through the file. An example of poor program design would be to read a large, 3-column table by sequentially reading the entire first column, then going back to read the 2nd column, and finally the 3rd column; this obviously requires 3 passes through the file which could triple the execution time of an IO limited program. For small tables this is not important, but when reading multi-megabyte sized tables these inefficiencies can become significant. The more efficient procedure in this case is to read or write only as many rows of the table as will fit into the available internal IO buffers, then access all the necessary columns of data within that range of rows. Then after the program is completely finished with the data in those rows it can move on to the next range of rows that will fit in the buffers, continuing in this way until the entire file has been processed. By using this procedure of accessing all the columns of a table in parallel rather than sequentially, each block of the FITS file will only be read or written once.

The optimal number of rows to read or write at one time in a given table depends on the width of the table row and on the number of IO buffers that have been allocated in CFITSIO. The CFITSIO Iterator routine will automatically use the optimal-sized buffer, but there is also a CFITSIO routine that will return the optimal number of rows for a given table: fits_get_rowsize. It is not critical to use exactly the value of nrows returned by this routine, as long as one does not exceed it. Using a very small value however can also lead to poor performance because of the overhead from the larger number of subroutine calls.

The optimal number of rows returned by fits_get_rowsize is valid only as long as the application program is only reading or writing data in the specified table. Any other calls to access data in the table header would cause additional blocks of data to be loaded into the IO buffers displacing data from the original table, and should be avoided during the critical period while the table is being read or written.

4. Use the CFITSIO Iterator routine. This routine provides a more `object oriented' way of reading and writing FITS files which automatically uses the most appropriate data buffer size to achieve the maximum I/O throughput.

5. Use binary table extensions rather than ASCII table extensions for better efficiency when dealing with tabular data. The I/O to ASCII tables is slower because of the overhead in formatting or parsing the ASCII data fields and because ASCII tables are about twice as large as binary tables that have the same information content.

6. Design software so that it reads the FITS header keywords in the same order in which they occur in the file. When reading keywords, CFITSIO searches forward starting from the position of the last keyword that was read. If it reaches the end of the header without finding the keyword, it then goes back to the start of the header and continues the search down to the position where it started. In practice, as long as the entire FITS header can fit at one time in the available internal IO buffers, then the header keyword access will be relatively fast and it makes little difference which order they are accessed.

7. Avoid the use of scaling (by using the BSCALE and BZERO or TSCAL and TZERO keywords) in FITS files since the scaling operations add to the processing time needed to read or write the data. In some cases it may be more efficient to temporarily turn off the scaling (using fits_set_bscale or fits_set_tscale) and then read or write the raw unscaled values in the FITS file.

8. Avoid using the `implicit data type conversion' capability in CFITSIO. For instance, when reading a FITS image with BITPIX = -32 (32-bit floating point pixels), read the data into a single precision floating point data array in the program. Forcing CFITSIO to convert the data to a different data type can slow the program.

9. Where feasible, design FITS binary tables using vector column elements so that the data are written as a contiguous set of bytes, rather than as single elements in multiple rows. For example, it is faster to access the data in a table that contains a single row and 2 columns with TFORM keywords equal to '10000E' and '10000J', than it is to access the same amount of data in a table with 10000 rows which has columns with the TFORM keywords equal to '1E' and '1J'. In the former case the 10000 floating point values in the first column are all written in a contiguous block of the file which can be read or written quickly, whereas in the second case each floating point value in the first column is interleaved with the integer value in the second column of the same row so CFITSIO has to explicitly move to the position of each element to be read or written.

10. Avoid the use of variable length vector columns in binary tables, since any reading or writing of these data requires that CFITSIO first look up or compute the starting address of each row of data in the heap. In practice, this is probably not a significant efficiency issue.

11. When copying data from one FITS table to another, it is faster to transfer the raw bytes instead of reading then writing each column of the table. The CFITSIO routines fits_read_tblbytes and fits_write_tblbytes will perform low-level reads or writes of any contiguous range of bytes in a table extension. These routines can be used to read or write a whole row (or multiple rows for even greater efficiency) of a table with a single function call. These routines are fast because they bypass all the usual data scaling, error checking and machine dependent data conversion that is normally done by CFITSIO, and they allow the program to write the data to the output file in exactly the same byte order. For these same reasons, these routines can corrupt the FITS data file if used incorrectly because no validation or machine dependent conversion is performed by these routines. These routines are only recommended for optimizing critical pieces of code and should only be used by programmers who thoroughly understand the internal format of the FITS tables they are reading or writing.

12. Another strategy for improving the speed of writing a FITS table, similar to the previous one, is to directly construct the entire byte stream for a whole table row (or multiple rows) within the application program and then write it to the FITS file with fits_write_tblbytes. This avoids all the overhead normally present in the column-oriented CFITSIO write routines. This technique should only be used for critical applications because it makes the code more difficult to understand and maintain, and it makes the code more system dependent (e.g., do the bytes need to be swapped before writing to the FITS file?).

13. Finally, external factors such as the speed of the data storage device, the size of the data cache, the amount of disk fragmentation, and the amount of RAM available on the system can all have a significant impact on overall I/O efficiency. For critical applications, the entire hardware and software system should be reviewed to identify any potential I/O bottlenecks.