NAME

ftselfjoin - Search for matching entries within a single table

USAGE

ftselfjoin infile outfile expr

DESCRIPTION

The ftselfjoin task searches for matching entries within a single table, usually duplicates. ftselfjoin is a "wrapper" around the ftjoin task, but specialized for the operation of duplicate-searching within a single table. In a sense, this is a database "join" of a table with itself (in database terminology, a "join" is a matching operation).

Users may specify any matching criteria. For example, one could search a table of sources for all sources which have the same name. However, the power of ftselfjoin is in specifying more complicated matching criteria. A more complicated example might be to search a table of sources for all sources that are within 1 degree of any other source in the table. (see example below)

Matches occur in pairs, known as the "left" and "right" matches. The rows that match are saved to the output file. Multiple matches are allowed. These matches would appear as two separate output rows. The output will contain a single table with all the columns of both left and right matches.

The matching expression is commutative if it doesn't care about whether left is being compared to right, or right is compared to left. If the expression is commutative, then the default processing will produce two output entries for every match: one when left matches right, and another when right matches left. If this is undesireable, the duplicates can be avoided by setting commutative=YES.

ftselfjoin never reports a match between a table row and itself, regardless of the 'commutative' setting.

Avoiding Column Name Clashes

In order to avoid column name clashes, ftselfjoin should rename columns automatically. ftjoin can append a prefix and/or suffix to each column name. The output table and the matching expression use the renamed column names, not the original names.

At least one of the prefixes or suffixes must be used to remove the ambiguity between the column names of the left and right matches. By default, the column names of the left table have the prefix "L_" added, and the prefix "R_" is applied to the right table. ftjoin will issue a warning or terminate if there are duplicate column names.

Generally speaking ftjoin is only efficient for modest size tables. The full tables should fit into memory at once. No optimization or special indexing is done. For two tables with M and N rows, the number of cross-comparisons is M * N.

PARAMETERS

infile [filename]
Name of input table. Non-structural keywords are copied from this table to the output file.
outfile [filename]
Name of output file. If no matching rows are found, it is possible for this table to be empty.
expr [string]
Matching expression. Any expression that results in a boolean can be used to compare the columns in the two input tables. Note that if column renaming is done, then the renamed column names should be used.
(jointype = "INNER") [string]
The type of join to be done. One of "INNER", "LEFT OUTER", "RIGHT OUTER", or "FULL OUTER". Users can also use "INTERSECTION" (="INNER") or "UNION" (="FULL OUTER"). The spaces are optional, i.e. you can use "LEFTOUTER" instead of "LEFT OUTER". See above for the descriptions of these operations.
(leftnameprefix = "L_") [string]
String to be automatically prepended to each left table column name. This prefix is ignored if the leftnamemap is used.
(leftnamesuffix = "NONE") [string]
String to be automatically appended to each left table column name. This suffix is ignored if the leftnamemap is used.
(rightnameprefix = "R_") [string]
String to be automatically prepended to each right table column name. This prefix is ignored if the rightnamemap is used.
(rightnamesuffix = "NONE") [string]
String to be automatically appended to each right table column name. This suffix is ignored if the rightnamemap is used.
nbothmatch = 0 [integer - OUTPUT]
Upon successful task completion, this parameter is updated with the number of joint row matches.
nleftmatch = 0 [integer - OUTPUT]
Upon successful task completion, this parameter is updated with the number of joint left-only matches (in the cases of LEFTOUTER or FULLOUTER joins.
nrightmatch = 0 [integer - OUTPUT]
Upon successful task completion, this parameter is updated with the number of joint right-only matches (in the cases of RIGHTOUTER or FULLOUTER joins.
(dupcolnames = NO) [boolean]
If YES, then duplicate column names are allowed (although a warning is issued when duplicates are found). If NO, then duplicate column names are not allowed, and the task terminates.
(commutative = NO) [boolean]
Set to YES if the matching operation is commutative.
(outcolumns = "*") [string]
Currently not used.
(clobber = NO) [boolean]
If the output file already exists, then setting "clobber = yes" will cause it to be overwritten.

(chatter = 2) [integer, 0 - 5]
Amount of verbosity of the task. For chatter=0, no output is printed. For chatter=5, debugging output is printed.

(history = YES) [boolean]
If history = YES, then a set of HISTORY keywords will be written to the header of the specified HDU in the output file to record the value of all the task parameters that were used to produce the output file.

EXAMPLES

1. Search a table for entries with the same name. Assume the table has a NAME column which refers to the names of various stars. This example forms a table which shows where the NAMEs match.

     ftselfjoin table.fits out.fits "L_NAME == R_NAME" \
       leftnameprefix="L_" rightnameprefix="R_" commutative=YES

Note that the setting "commutative=YES" will avoid duplicate entries.

2. Match two catalogs by position. Consider an example of a table containing a star catalog, and it is desired to find all stars within a given radius of any other star, say 0.5 degrees. We use the ANGSEP() function to measure the distance between stars:

     ftselfjoin table.fits out.fits \
       'ANGSEP(L_RA,L_DEC,R_RA,R_DEC) < 0.5' \
       leftnameprefix="L_" rightnameprefix="R_" commutative=YES

Note that the setting "commutative=YES" will avoid duplicate entries.

SEE ALSO

ftpaste, ftjoin, colfilter, calc_express