NAME

ftjoin - Join the columns of two tables according to a matching criterium

USAGE

ftjoin leftfile rightfile outfile expr

DESCRIPTION

The ftjoin task joins the columns of two tables into one table, using an arbitrary matching expression. In the languages of databases, this is known as a "database join." The two tables to be joined are designated the "left" and "right" tables.

ftpaste and ftjoin can both be used to join the columns of two tables. When using ftpaste, the rows must match one-to-one. When using ftjoin, the rows are matched according to any desired criteria.

Each row in the left table is matched with each of the rows in the right table, using a user-supplied matching expression. For example, if table "L" is a list of detected stars, and table "R" is a catalog of known stars, then one could ask for the stars in table "L" that are within a predefined search radius of the stars in table "R". Any CFITSIO calculator expression which evaluates to true or false is valid (see help for 'colfilter').

The rows that match are saved to the output file. Multiple matches are allowed. For example, two stars of a binary in table "L" may match a single catalog entry in table "R", as long as they both satisfy the matching criterium. These matches would appear as two separate output rows. The output will contain a single table with all the columns of both input tables.

ftjoin has a direct analogy to classical database join operations. For example, in the SQL database language, one might issue a star catalog matching operation with the following query:

  SELECT * FROM L,R WHERE ANGSEP(L.RA,L.DEC,R.RA,R.DEC) < 10.0/3600;

which selects those stars in "L" that are within 10 arcseconds of the catalog positions in "R". This is equivalent to the ftjoin operation,

  ftjoin l.fits r.fits out.fits "ANGSEP(L_RA,L_DEC,R_RA,R_DEC) < 10.0/3600"  \
    leftnameprefix="L_" rightnameprefix="R_"

Please see below for the way that columns are renamed.

It is also possible to save non-matching rows as well. This depends on the type of join (specified by the jointype parameter). The default type of join, "INNER", saves only rows that match both tables. As shown in the table below, a row can appear in the output depending on whether the row appears in the left table, right table or both.

 Row appears in output if it is in...
jointypeLeft onlyLeft and RightRight only
INNER / INTERSECTIONNoYesNo
LEFT OUTERYesYesNo
RIGHT OUTERNoYesYes
FULL OUTER / UNIONYesYesYes

An "INNER" join is equivalent to a standard set-wise intersection. A "FULL OUTER" join is equivalent to a standard set-wise union.

Continuing with the star catalog analogy, a "LEFT OUTER" join between tables "L" and "R" would produce all matches between the the detection table "L" and the catalog table "R", plus any remaining non-matching stars in table "L".

When using an "OUTER" join type, missing values are filled with null values. Note that the input tables must have null values defined via TNULLn keywords in order for this to work properly.

Avoiding Column Name Clashes

In order to avoid column name clashes, ftjoin can rename columns automatically. ftjoin can append a prefix and/or suffix to each column name. The pre/suffix should be different for the left and right tables. By default, the column names of the left table have the prefix "L_" added, and the prefix "R_" is applied to the right table. However, it is worth noting that simply adding a prefix does not always avoid column name clashes. The output table and the matching expression use the renamed column names, not the original names.

ftjoin can issue a warning or terminate if there are duplicate column names, depending on the dupcolname parameter.

Special Variables

The standard CFITSIO calculator expressions are available to the user, but the #ROW expression is treated specially. Because two different tables are being joined, the two new different variables are available: #L_ROW and #R_ROW. The variable #L_ROW is the row number in the left table, and #R_ROW is the row number in the right table. Note that the names #L_ROW and #R_ROW are hard-coded; they are independent of any leftnameprefix or rightnameprefix settings that you choose.

Other Information

Upon finishing successfully, ftjoin updates the task parameters nbothmatch, nleftmatch and nrightmatch with the number of joint matches, left-only matches, and right-only matches, respectively. These can be used to determine the success of the operation.

Generally speaking ftjoin is only efficient for modest size tables. The full tables should fit into memory at once. No optimization or special indexing is done. For two tables with M and N rows, the number of cross-comparisons is M * N.

PARAMETERS

leftfile [filename]
Name of "left" input table. Non-structural keywords are copied from this table to the output file.
rightfile [filename]
Name of "right" input table.
outfile [filename]
Name of output file. If no matching rows are found, it is possible for this table to be empty.
expr [string]
Matching expression. Any expression that results in a boolean can be used to compare the columns in the two input tables. Note that if column renaming is done, then the renamed column names should be used.
(jointype = "INNER") [string]
The type of join to be done. One of "INNER", "LEFT OUTER", "RIGHT OUTER", or "FULL OUTER". Users can also use "INTERSECTION" (="INNER") or "UNION" (="FULL OUTER"). The spaces are optional, i.e. you can use "LEFTOUTER" instead of "LEFT OUTER". See above for the descriptions of these operations.
(leftnameprefix = "L_") [string]
String to be automatically prepended to each left table column name, or NONE to indicate no prefix.
(leftnamesuffix = "NONE") [string]
String to be automatically appended to each left table column name, or NONE to indicate no suffix.
(rightnameprefix = "R_") [string]
String to be automatically prepended to each right table column name, or NONE to indicate no prefix.
(rightnamesuffix = "NONE") [string]
String to be automatically appended to each right table column name, or NONE to indicate no suffix.
nbothmatch = 0 [integer - OUTPUT]
Upon successful task completion, this parameter is updated with the number of joint row matches.
nleftmatch = 0 [integer - OUTPUT]
Upon successful task completion, this parameter is updated with the number of joint left-only matches (in the cases of LEFTOUTER or FULLOUTER joins.
nrightmatch = 0 [integer - OUTPUT]
Upon successful task completion, this parameter is updated with the number of joint right-only matches (in the cases of RIGHTOUTER or FULLOUTER joins.
(dupcolnames = NO) [boolean]
If YES, then duplicate column names are allowed (although a warning is issued when duplicates are found). If NO, then duplicate column names are not allowed, and the task terminates.
(outcolumns = "*") [string]
Currently not used.
(clobber = NO) [boolean]
If the output file already exists, then setting "clobber = yes" will cause it to be overwritten.

(chatter = 2) [integer, 0 - 5]
Amount of verbosity of the task. For chatter=0, no output is printed. For chatter=5, debugging output is printed.

(history = YES) [boolean]
If history = YES, then a set of HISTORY keywords will be written to the header of the specified HDU in the output file to record the value of all the task parameters that were used to produce the output file.

EXAMPLES

1. Match two tables by name. Assume two tables have a NAME column which refers to the names of various stars. This example forms a combined table with the columns of both tables where the NAMEs match.

     ftjoin left.fits right.fits out.fits "L_NAME == R_NAME" \
       leftnameprefix="L_" rightnameprefix="R_"

2. Match two catalogs by position. Consider an example where we are looking for pulsars inside of supernova remnants. We can download the Green supernova catalog as a FITS file (see HEASARC snrgreen catalog); and the ATNF pulsar catalog as a FITS file (see atnfpulsar catalog). Then, we can match the tables by asking the question, which pulsars are inside of the catalogged supernova remnants. In this case, the maximum of the MINOR_DIAMETER and MAJOR_DIAMETER columns is used to estimate the radius of the remnant.

     ftjoin snrgreen.fits atnfpulsar.fits snr_psr.cat \
       'ANGSEP(L_RA,L_DEC,R_RA,R_DEC) < MAX(L_MINOR_DIAMETER,L_MAJOR_DIAMETER)/60.0/2.0' \
       leftnameprefix="L_" rightnameprefix="R_"

The factor of 2.0 converts a diameter into a radius, and 60.0 converts arcminutes into degrees. Of course, this table will require follow-up screening to eliminate chance coincidences and double-matches (of which there are some).

3. Add new data to an existing base catalog. Consider a case where we have a base catalog of stars, and a catalog of new data from a second observatory for a few stars which we would like to join to this table. We would like new data to be associated with existing data, when the star already exists in the base catalog. This is a case where an "outer" join is appropriate: we want to keep all the main catalog entries regardless of whether they match the ancillary data. Here we will match by NAME columns.

     ftjoin main.cat newobs.cat newmain.cat "NAME == NEWOBS_NAME" \
       leftnameprefix=NONE rightnameprefix="NEWOBS_" jointype="LEFTOUTER"

Since we would like to keep the original table unperturbed, the left prefix is set to NONE so that the column names are not renamed. The new observations are prefixed with "NEWOBS_". Entries in the main catalog which don't have any corresponding ancillary data will have NULL ancillary values.

4. Selecting output columns. Currently ftjoin does not have an explicit mechanism to select output columns. However, this can be done with CFITSIO pipes. For example, if one wanted to keep only the columns NAME,FLUX,PERIOD, one could use the following:

     ftjoin left.fits right.fits - "MATCHING EXPRESSION" | \
       ftcopy '-[col L_NAME;R_FLUX;R_PERIOD]' out.fits

The "-" hyphen is used to pipe the output of ftjoin, and ftcopy is then used to select the requested columns.

SEE ALSO

ftpaste, ftselfjoin, colfilter, calc_express