THE ANNOTATED SOURCE DETECTION

2^nd X-Ray Astronomy School, Berkeley Springs, WV
2002 August 21
Vinay Kashyap & Peter Freeman

VK Viewgraphs | PF Viewgraphs | Annotated viewgraphs

Overview

Why source detection?
* why is it so difficult in X-rays to know that there is a source? consider that for the same flux as at, say, 5000 Angstrom, there are a thousand fewer photons at 5 Angstrom; not to mention that for most objects -- e.g., stars -- the X-ray flux is lower than the optical by many orders of magnitude, leaving us scrounging for photons in the Poisson regime
[HST image of 47 Tuc]
[Chandra ACIS image of central portion of 47 Tuc]
* dealing only with source detection, not image reconstruction, not image deconvolution, nor modeling
How to detect sources
- In Theory
  * Peter Freeman speaks on the theory of source detection (viewgraphs in postscript and PDF)
  - [Detection Theory: the Short Form]
  - [The Five-Fold Path]
  - [Potholes on the Five-Fold Path]
  - [Classic Detection: CELLDETECT]
  - [Figure of source and background cells]
  - [New Detection: WAVDETECT]
  - [Figure of MexHat wavelet and negative annulus]
- In Practice
  [SOURCE DETECTION IN THE REAL WORLD: SNARES AND TRAPS]
Dealing with detected sources
- Hardness Ratios
  [HARDNESS RATIOS]
- logN-logS
  [BIASES IN THE CONSTRUCTION OF logN-logS CURVES]
Upper limits
* there is a difference between the confidence range limits derived on the source strength parameter and asking the question ``at what count rate would the source be detected?'' (i.e., a difference between parameter bounds and upper limits that is not always recognized)
* For example, consider a source cell with 5 counts and a background cell of 100 times larger area with 200 counts. Then the background estimate in the source cell is B=2+-0.141, and the estimated source strength is S=3+-2.24 (assuming simple root-N errors). A ``3-sigma'' upper limit could be set based on the measured error, i.e., S<9.7, i.e., 12 counts in the sou rce cell are required for a 3-sigma detection.
On the other hand, if a hypothesis test is performed by computing the probability that a background of 2 counts can generate >N then the putative 3-sigma limit is reached at 6.5, for a nominal source strength of <5 counts
* Of course, note that the former method is not internally self-consistent, i.e., if 12 counts are indeed observed in the source cell, that would be still only be classified as a 2.9-sigma source
* The difference is due to the fact that it is not necessary to measure the flux of a source to determine its existence or lack thereof
* A Bayesian approach would consider the relative probabilities of two models, one with only the background and one with a source and a background

SOURCE DETECTION IN THE REAL WORLD: SNARES AND TRAPS

The Starter Kit
* these are basic issues without which source detection is impossible, and ones that the analysis software will work out for you, if not in canned routines, then at least in threads
- Aspect dither
  [Capella HRC-S/LETG 0th-order image, pre-aspect correction v/s post-aspect correction]
- Bad pixels
  [Capella ACIS-S/HETG dispersed region, showing the hot pixels as points in detector coordinates, and dithered squares in sky coordinates]
- Exposure maps
  * imperative that instrument effects are correctly accounted for, otherwise the detection algorithm will find these structures in the data and call them sources
  * this occurs because of incorrect background determination at the edges and at support structure shadows, etc. -- an exposure map tells the algorithm to {\sl expect} a differect background
  [ROSAT/PSPC exposure map]
  [ACIS-I/ACIS-S data showing the effects of including the exposure map -- fewer sources at the edge -- v/s not including one -- lots of sources at the edges]
  * the changing energy sensitivity of the exposure maps results in the remaining few false sources along the boundaries of the BI chips
Here be Hippogrif
* these are problems that are by and large taken care of by the algorithms, but they are also traditionally where source detection algorithms tend to break down, and results must be double checked
- Point Spread Functions
  [HZ 43 at 10 arcmin off-axis]
- Varying background
  [diffuse background in the Pleiades, with shadow due to high-density molecular cloud]
  * this is a wavdetect product, notice that the image has been flat-fielded, i.e., exposure map effects have been taken into account
- Extended Sources
  [Galaxy cluster MS1137 showing the effects of an extended source on background determination]
  * not as bad as it looks, the dynamic range in the source image is in the 1000's, while in the background image it is less than 2x
- Overlapping Sources
  [Sirius A and B]
What, me worry?
* these are problems that the source detection algorithms generally do not consider in any detail, and also do not cause much of a problem in the general run of things, but have the potential to have a large impact in certain cases
- Source Position and plate scale
  * at large off-axis locations and for weak sources, source position determination may be contaminated by background photons
  * when large numbers of sources have been detected in the field and identified with counterparts, it is always a good idea to double check the plate scale
- Source spectrum and ExpMaps and PSFs
  [variation of PSF size with energy and off-axis]
  [exposure map showing energy dependence, courtesy Jonathan McDowell]
  * an exposure map that has the correct energy dependence will take into account the differences between the ACIS-S FI and BI chip responses, so that the algorithms will know that the change in background level across the chip boundaries is not real
- Pileup
  [ACIS-S/LETG 0th order image of XTE J1118+48 showing the effects of pileup on the on-axis PSF]
- Detection Sensitivity
  * Type I Errors, aka false positives, or sources that are detected even if there aren't real
- Detection Probability
  * Type II Errors, aka false negatives, or real sources that are not detected because of count fluctuations
  * Type I and Type II errors must be properly accounted for when computing logN-logS curves

HARDNESS RATIOS
(Not as simple as they seem)

Useful with large samples and when spectral fits are unfeasible
Types of hardness ratios:
- The simple ratio, R = S/H,
  sig_R = R sqrt{ (sig_S/S)² + (sig_H/H)² }
  * A simple ratio of counts in two passbands, Soft and Hard; range is 0<R<infinity
- The color, C = log_a(S) - log_a(H),
  sig_C = ln(a) sqrt{ (sig_S/S)² + (sig_H/H)² }
  * the equivalent of color, such as B-V; range is -infinity<C<+infinity
- The fractional difference, HR = (H-S)/(H+S),
  sig_HR = (2/(H+S)²) sqrt{ H²sig_S² + S²sig_H²}
  * the fractional difference, often used in extragalactic astronomy because of the better behavior of the denominator; range is -1<HR<+1
Mathematical obstinacy
* formal error propagation is mathematically invalid for both R and HR, which result in Cauchy distributions for which a mean cannot be defined and the variance is infinite; this is not a problem that goes away with ``better statistics'', in fact it gets worse
* use the log form, C, which is well behaved, as much as possible
Poisson statistics
* Better to use the full Poisson likelihoods to determine the errors, especially since the ratios behave very unintuitively in the Poisson regime -- for instance, for small values of $H$, the mean value of $R$ increases, but the mode decreases
Background
* Poisson formulations can correctly take background into account, but for regular error propagation just square-add the errors
Upper limits
* again, Poisson formulations can deal naturally with the cases where one or both of S and H are consistent with 0. DO NOT use simplistic constructs like using the 1-sigma limit to compute the ratios

log(N>S)-log(S)

Say n(r)=n₀, f(L_x)=delta(L_x-L_x0), S=L_x0/(4 pi r²). Number of sources within r, N(<r)=(4 pi/3) r³ n₀, or N(>S)=n₀ (4 pi/3) ( L_x0/(4 pi S)^3/2, i.e., N(>S) ~ S^-3/2. In general,
N(>S,l,b) = dO int₀^infinitydL_x int₀^infinitydL'_x f(L'_x) SIGMA(L_x,L'_x) int₀^r' dr n(r,l,b) r²
where r' is implicitly defined by S = L'_x/(4 pi r'²) e^-tau(r'), and SIGMA(L_x,L'_x) is a function that takes into account statistical fluctuation in the intrinsic luminosity.

Biases
- Source confusion
  * in crowded fields, detection algorithms tend to miss weaker sources near strong sources, and sometimes they merge multiple sources into a single one
- False sources (false positives)
  * false sources that arise due to fluctuations in the background will contaminate the numbers at the low flux end
- Lost sources (false negatives)
  * because of statistical fluctuations, sources of a given intrinsic strength will produce different numbers of counts at any given observation, and as the source strength becomes smaller, the chances of it not being detected increase even if nominally it is above the detection threshold
- Malmquist Bias
  * the volume in which high-luminosity sources can be detected is larger than the volume in which low-luminosity sources are detected; thus luminous objects will be overrepresented in flux-limited samples
- Faint source fluctuations
  * when there are larger numbers of low-flux sources than high-flux sources, then statistical fluctuations result in a larger number of the weaker sources deflected into higher flux regimes than stronger sources that are deflected into lower flux regimes
- Eddington Bias
  * when fluxes of sources with intensities near the detection threshold are measured, there is a tendency for the average measured flux to be higher than the true fluxes, because of the fact that fluctuations towards smaller counts will be censored out because of the detection threshold
Construction
- Sky coverage
  * construct the area of the field corresponding to a given sensitivity and correct the measured N(>S) v/s S
- Modeling
  * recent efforts have concentrated on developing a method for modeling f(L_x to match the observed log(N)-log(S)

vkashyap@cfa.harvard.edu
pfreeman@cfa.harvard.edu