A trend filtering algorithm for wide-field variability surveys (original) (raw)
Abstract
We show that various systematics related to certain instrumental effects and data reduction anomalies in wide-field variability surveys can be efficiently corrected by a trend filtering algorithm (TFA) applied to the photometric time-series produced by standard data pipelines. Statistical tests, performed on the data base of the HAT Network project, show that by the application of this filtering method the cumulative detection probability of periodic transits increases by up to 0.4 for variables brighter than 11 mag, with a trend of increasing efficiency toward brighter magnitudes. We also show that the TFA can be used for the reconstruction of periodic signals by iteratively filtering out systematic distortions.
1 Introduction
In the resolute hunt for transiting hot Jupiters, much effort is focused on very small instruments that are capable of observing a large area of the sky and gathering data with a photometric precision better than 0.01 mag. There are many projects targeting this goal (see Horne 2003), using instruments with typical apertures of the order of 10 cm; it has even been argued that small telescope projects with ∼2-inch diameter optics are the best for this purpose (Pepper, Gould & Depoy 2003).
The attained photometric precision plays a central role in transit signal detection. However, the large field of view and the semiprofessional CCD detectors often employed by these low-budget projects drastically amplify some of the standard problems of CCD photometry. These include narrow point spread function (PSF), crowded field, differential extinction and refraction (see Bakos et al. 2004, hereafter B04, for a more comprehensive summary). Most of the problems become more severe toward fainter magnitudes, where the objects are more affected by time-dependent merging with brighter stars. Most of these, and other hidden effects, not only increase the noise of the light curves, but leave their fingerprint in the data as systematic variations, because they vary in a non-smooth fashion over the field, and standard position-dependent smooth transformations applied at various stages of the data reduction do not eliminate them perfectly.
The efficiency of the detection of periodic signals drops dramatically when the light curves become dominated by noise and systematic variations. For example, the box-fitting least-squares (BLS) method (Kovács et al. 2002, hereafter K02), an application used to search for transits, is efficient on light curves exhibiting a box-shaped dip, but fails to recover transits superposed on strongly variable light curves. These systematics are also harmful in discovering interesting low-amplitude phenomena, such as δ Scuti or γ Dor stars.
Many of the systematic variations in a given light curve are shared by light curves of other stars in the same data set, due to common effects such as colour-dependent extinction, or blending of two or more star images (which could produce similar light-curve variations in all of them), etc. A solution to remove or reduce these common systematic variations can therefore be devised using the already reduced data. For each target star, one must identify objects in the field that suffer from the same type of systematics as the target, and apply some kind of optimum filtering of the target light curve based on the light curves of a set of comparison/template stars.
The algorithm to be presented in this paper can be applied to two types of problems. The first application is to remove trends from trend- and noise-dominated time-series, thereby increasing the probability of detecting a weak signal (periodic or non-periodic). Secondly, if there is a periodic signal present, the trend filtering algorithm (TFA) can be used in a slightly different way to reconstruct also the shape of that signal.
The structure of the paper is the following. In Section 2 we give a more detailed description of the systematics and their possible causes. In Section 3 we describe the mathematical formulation of the TFA, and in Section 4 we present a number of tests to show its ability to remove spurious trends from the data. In Section 5 we explore the increase in detectability of periodic transit signals (using the BLS algorithm) from data after processing with the TFA. In Section 6 we then investigate the capability of reconstructing the detailed shape and amplitude of such periodic signals, using a variant of the basic algorithm. The paper closes with a brief highlighting of the most important results. Throughout this paper we rely on the data base of the HAT Network (HATNet) project (see B04).
2 Light-Curve Systematics and Their Possible Sources
The systematic variations might be intrinsic to the data, or they could be due to uncorrected instrumental effects or changing observing conditions, or they could also originate from imperfect data reduction. First, we outline the procedure of common photometry reduction methods along with their smooth transformations that correct for trivial systematic variations. Then we proceed to the characterization of the remaining systematics, and discuss these and their possible causes through the examples of the HATNet light curves.
The typical procedure of wide-field photometry either employs flux extraction with aperture-photometry or PSF fitting (e.g. daophot; Stetson 1987), or uses image subtraction (isis; Alard & Lupton 1998; Alard 2000).
In the first type of approach, the instrumental magnitudes of individual frames are transformed to a common magnitude reference system, in order to eliminate the instrumental or observational changes between the epochs. For narrow-field surveys, all stars experience the same fractional flux change, and thus the transformation can be treated as a constant magnitude shift with good approximation.
For wide-field surveys, this transformation takes a more complex form, and for convenience it is assumed to be a smooth function of the fitting parameters, e.g. of the coordinates. The transformation is established by iterative fitting over a substantial sample of stars and subsequent rejection of outlier values. The underlying assumption in this procedure is that CCD images of typical surveys have a large number of stars, and the overwhelming majority of them are non-variable above the noise level (see Pojmański 2003). We note in passing that other wide-field surveys use similar methods in their basic reduction pipelines.
Ideally, after applying the above-determined transformation, the resulting light curves should be free of systematic variations, and constant stars should be dominated only by noise. However, omission of some ‘hidden’ parameters, characteristic of the transformation, might lead to systematic spurious variations in the transformed magnitudes; sources with close values of these parameters will exhibit similar variations. An improper functional form of the transformation, or inclusion of variable stars in the fit might lead to similar systematics. Furthermore, because local effects might dominate the light curves (local in position or other parameters), variations remain even after application of the best large-scale (smooth) transformations. Such systematics are known to exist in various massive photometric surveys – see, for example, Drake & Cook (2004) for the Massive Compact Halo Objects (MACHO) and Kruszewski & Semeniuk (2003) for the Optical Gravitational Lensing Experiment (OGLE) projects – and even in classical CCD observations (Makidon et al. 2004), but in wide-field observations they could be especially severe (Alonso et al. 2003).
To give examples, a time-dependent PSF causes variable merging, where the flux contribution of a star to the flux of another star and its background annulus varies (assuming aperture photometry). The extent of systematics induced will depend on the exact configuration of the neighbourhood of the source. Although image subtraction convolves the profiles of the reference frame before the subtraction so that they have (in principle) the same width as on the frame that is analysed, we have found that there remain small amplitude correlations with the periodic variation of original PSF widths. Periodic motion of stars across the imperfectly corrected pixels, hot-pixels or bad columns, or simply the gate structure of front-illuminated pixels can be another cause for systematic variations. Faint stars that fall in the vertical vicinity of stars that are on the saturation limit might be periodically affected by the saturation of the bright source, e.g. with the 29-d period of the moon cycle through the increased background, or by the nightly variation of extinction, as the object rises and sets.
Here and throughout the paper, we employ the data base on the moderately crowded (b = 20°) G175 field observed by HATNet. Similar results were obtained with other fields. Observations for G175 were made during the fall and winter of 2003, spanning a compact interval of 202 d with ∼3000 data points for each variable in the I band. Nearly 10000 stars down to I = 12 were analysed with aperture photometry from the field. We employed a fourth-order polynomial function of the x and y coordinates to fit each frame's magnitude system to that of the reference frame. _V_-band data were also taken on selected nights to have complementary colour information on the sources, but this was not incorporated in the above fits. Field G175 observations started or ended every night whenever the field crossed the 35° horizon, and exposures were taken with 5-min resolution with no interruption by observations of another field. There is obviously a (close to sidereal) daily periodicity in the time-base, which has 4–8 h long sections, and 16-h or longer gaps. Although the field contains nearly 10000 photometered stars, we present tests on only the 4293 stars, among the 5000 brightest, whose light curve contains at least 2800 points. These bright, intensively observed and accurately photometered stars are the most interesting ones for shallow transit searches. The faintest stars in this sample have _I_∼ 11.0 mag.
After discrete Fourier transformation (DFT) of the light curves in the [0, 6] d−1 frequency range with 20000 frequency steps, the derived histogram of primary frequencies shows a distribution with high peaks around n d−1 frequencies, where n is a small integer. The histogram also shows that the 1- and 2-d−1 peaks have double structures, with slight offsets from the exact integer values (see Fig. 1). The double peak structure is attributed to the cadence of 1 sidereal day of the observations and its alias. Other, higher-frequency peaks do not show double structure, only an offset, because they are further off from their alias counterparts leaking from the negative frequency regime. A very large fraction (83 per cent) of the peak frequencies fall in the _n_± 0.02 d−1 interval, with n = 0, 1, 2, 3, 4 and 5. Even excluding the n = 0 peak to avoid counting long-periodic variables as the stars showing real systematics, the occurrence rate of integer peak frequencies is still 69 per cent. Weighting the histogram according to the signal-to-noise (S/N) ratio of the frequency peaks does not significantly alter its appearance.
Figure 1
Distribution of the peak frequencies obtained from the DFT of a sample of stars of field G175 (see text for details on the data selection). Largest panel: distribution function in the full bandwidth of the analysis. The sum of the bin occupation number N per cent over the full frequency range is equal to 100. Small panels: blow-ups of the neighbourhood of the peaks. Each bin has a width of 0.002 d−1 in all panels.
The systematic variations are of small amplitude, typically ∼0.01 mag. To have a better measure on this quantity, we derived folded and phase-binned light curves (with 100 bins), rejected the lowest and highest five points of the phased curve, and determined the amplitude. For instance, the median value of the amplitudes of the 1-d−1 stars is 0.015 mag, with 85 per cent of them falling below 0.03 mag. For illustration, in Fig. 2 we exhibit examples of daily trends for a few selected stars in our field.
Figure 2
Examples of daily trends in field G175. The numbers on the top of each panel show the star's identification (GSC number) and the mean value of the instrumental magnitude. The deviations from this mean value are plotted. Phases are computed with 1-d period and arbitrary epochs.
We searched the observational parameters of HAT for commensurate periodicities, the effect of which might become superposed on the data and cause the above-mentioned systematics. The trivial daily hour-angle or zenith-distance variation, and the fact that we employ a simple fourth-order polynomial transformation that neglects colour dependence suggested that colour-dependent differential extinction should also cause a daily periodicity: redder or bluer than average stars should deviate from others as the field rises or sets. Probably because our data are dominated by other systematics, we found no sign of this when plotting the distribution of higher-amplitude _n_-d−1 stars on a V_–_I versus V colour–magnitude diagram (see Fig. 3). However, we noticed that the PSF also varies with close to 1 d−1, which should result in variable merging. Indeed, as shown in Fig. 4, the most prominent systematic variations are exhibited by stars that have close-by merging or almost saturated companions. On the other hand, there remain a few perplexing stars that seem equally merging, but show no strong _n_-d−1 variations. Early image subtraction results on another field show slightly smaller occurrence of integer cycle-per-day systematics as compared to aperture photometry, which suggest that merging indeed plays a central role. Using more accurate crowded-field photometry, which takes into account the variable stellar profiles, the extent of systematics is somewhat suppressed. There is also the sign of a slight increase of the occurrence of systematics in the proximity of bad columns. In conclusion, the degree and the appearance of the systematics depend on many factors (position, colour, brightness, merging), among which merging is probably the most important in our case, but not the only one.
Figure 3
Colour–magnitude diagram of field G175 in instrumental V_–_I versus V system (small dots) with stars exhibiting strong 1-d−1 (filled boxes) or 2-d−1 (open circles) systematics overplotted.
Figure 4
Image section from field G175 with dominant systematic stars marked by circles (1 d−1) and boxes (2 d−1). North is up, east is left, and image size is ≈ 68 × 49 arcmin2. White vertical lines are due to bad CCD columns. Most of the 1-d−1 stars are merging with neighbours. The bright star marked as ‘A’ is on the saturation limit, and might induce the systematic variations seen in the stars south from it. The open cluster NGC 2281 (marked as ‘B’) is moderately crowded with our 14-arcsec pixel scale, and the merging contributes to the 1- or 2-d−1 systematics for some of its stars. Nevertheless, there are unexplained cases where isolated stars also show systematic variations, such as ‘C’ or ‘D’.
3 Trend Filtering Algorithm
In this section we give a description of the algorithm in two steps. In Section 3.1 we outline the method and highlights basic problems to be dealt with, whereas in Section 3.2 we reveal more technical details.
3.1 Preliminaries
First of all, it is important to mention that the simplest and seemingly the most general approach to trend filtering would be the application of high-pass filtering on each daily track of observation. In this case, with a low-order polynomial or spline fitting, we would filter out any variation occurring on a daily time-base, presuming that the order of the polynomial (or that of the spline) is properly set. Although we experimented with this approach at the beginning of our tests, we did not find it satisfactory for several reasons: (i) although the daily trends may be the strongest ones, other time-scales play a role too; (ii) it is unclear what parts of the variation come from the trend and from the true change in the object; (iii) short tracks of observations may not be treated well in this way at all, because of the increasing chance of mixing intrinsic and systematic variations.
The idea of the TFA is based on the observation that in a large photometric data base there are many stars with similar systematic effects (this is why they are called ‘systematics’). As we have discussed in the previous section, various systematics may be present in the individual objects. We assume that the sample size is large enough to allow us to select a representative set for all the possible systematics. Once this subsample (‘template set’) is selected, one may try to build up a filter function from these light curves and subtract systematics from the other, non-template light curves.
The first question is how to choose the template set. Because of the lack of a priori information on the type of systematics influencing our target, selection of the template set follows only some very broad guidelines and constitutes basically a random set, drawn from the available data base. More details on the template selection are given in Section 3.2.
The second question that must be addressed is how to choose the filter function. Because we have no a priori knowledge on the functional form of the systematics, we take the simplest form of it, i.e. the ‘linear combination’ of the template light curves.
The third question is how to choose the weights of the template light curves in the filter function. If we assume that the variation in all light curves is caused only by systematics, a natural choice would be a simple least-squares criterion for the residuals (observed minus filter-predicted). Because – as we mentioned in Section 2– most of the stars are non-variable, the criterion of minimum variance seems to be a good one for most of the observed stars.
The fourth question is what distortion the TFA process causes in the signal being sought. It is clear that the signal will suffer from some distortion that can be very severe if the time-scale and phase of its variation are close enough to those of some of the template time-series.
In view of the above considerations, TFA is designed to tackle two types of problems:
- (i)
increase the detection probability by filtering out trends from the trend- and noise-dominated signals; and - (ii)
restore the signal form by filtering out trends iteratively from periodic signals, assuming that the period is known.
If the trends are sufficiently small (as compared to the signal amplitude), TFA processing is not necessary for the signal search (i.e. period analysis, application i), but (ii) can still be a useful utility if the signal turns out to be periodic.
The fifth question is whether the method generates unwanted signal components by chance inclusion of variables in the template set. In classical variable star observations, special care was taken to avoid variables among the comparison stars. It was necessary to do so, because variable star magnitudes were obtained by simply subtracting the directly observed magnitudes of the variable star from those of the comparison stars. In the case of the TFA, the situation is very different. Here we create a linear combination of all template light curves, which fits to the target light curve in the best way. This approach tends to fit features that appear in similar forms and in proper timings, both in the target and in one or more of the template light curves. Because of the low likelihood of this to happen for non-systematic variations, variables in the template set are not expected to influence the TFA filtering. More specifically, templates containing pure sinusoidal signals will not generate detectable artificial signals in a target of pure white noise (see Appendix A). Of course, this result, referring to the statistical average of the signal detection parameter, does not exclude rare false signal detections due to statistical fluctuations.
3.2 Mathematical formulation
Let us assume that all time-series are sampled in the same moments and contain the same number of data points N. Let our filter be assembled from a subset of M time-series (this is the template set, the method of selection will be discussed later). The filter {F(i); i = 1, 2, …, N} is built up from the following linear combination of the template time-series {Xj(i); i = 1, 2, …, N; j = 1, 2, …, M}
1
It is assumed that the template time-series are zero-averaged (i.e. for all j). The coefficients {cj} are determined through the minimization of the following expression:
2
Here, {Y(i); i = 1, 2, … , N} denotes the target time-series being filtered, and {A(i); i = 1, 2, … , N} denotes the current best estimate of the detrended light curve, defined in the following way.
In application (i) (‘frequency analysis’), we start from the assumption that the time-series is dominated by systematics and noise, and we have no a priori knowledge of any real (periodic or aperiodic) signal in the light curve. Consequently, in this case we set {A(i)} equal to the average of the target time-series, i.e. . As we will see in the following sections, this simple choice of {A(i)} works very well, except for the rare case when the signal is similar to some of the templates (this usually occurs for signals with long periods – comparable to the length of the total time-span).
In application (ii) (‘signal reconstruction’), we have already established (from previous analysis of the data) that the time-series contains a periodic signal. The phase-folded time-series can thus be used iteratively to estimate {A(i)}. First, an initial filter is constructed using A(i) =〈_Y_〉 as above. The filtered time-series is then phase-folded and binned, then remapped to the original time-base to give a new estimate of {A(i)}, which is in turn fed into equation (2) to compute a new set of filter coefficients {cj}. The new filter leads to a better determination of {A(i)}, and the iteration continues until some convergence criterion is satisfied. Further details and examples of signal reconstruction are given in Section 6.
Note that the unbiased estimate of the variance of noise of the filtered data can be obtained from the following equation:
3
At a more technical level, the main steps of the TFA are the following.
- 1
Select M template time-series in the full field, distributed nearly uniformly in the field (presumably ensuring uniform sampling also in other parameters, e.g. in colour). Because we have no a priori knowledge on which stars are bona fide variables, the above selection is almost random, except that stars with a low number of data points, low brightness and high standard deviation are not selected. Although the result is not sensitive to the actual values of the above limits, it is still better to employ them, in order to avoid any (however small) chance of biasing the target light curve. Nevertheless, for long-periodic target variables there is a higher chance of finding a template member varying on a similar time-scale with similar phase. Once the template set is selected, it is fixed throughout the analysis. - 2
Define the time-base to be used by the filter and target time-series. Because in modern automated surveys nearly all photometered stars have the same number of data points distributed in the same moments of time, selection of the uniform time-base is made on a subsample of the template light curves (containing ∼50 stars in our case). We select the time-base from the template light curve that contains the largest number of data points. Occasionally, data points, at some moments defined by the template time-base, might be missing in the observed light curve. In these cases they are filled in with the average value of that light curve. - 3
Compute zero-average template time-series by using some criterion for outlier selection (in our case a 5σ clipping). - 4
Compute the normal matrix from the above template time-series
4
and compute the inverse of it: {G j,k}. - 5
For each light curve, compute scalar products of the target and template time-series:
5
Here, the modified target time-series is also assumed to be free of outliers. - 6
Compute the solution for {cj} and apply the correction accordingly:
6
The corrected time-series is computed from
7
It is important to make the following comments concerning additional details of the computational implementation.
A significant advantage of the above algorithm is that there is no need to compute the normal matrix for each target separately. This means only the computation of simple array–array and matrix–array products (steps 5 and 6 above). For period search this is done only once for each star, but for signal reconstruction this is repeated several times (which is still not too time-consuming).
Because the template set is fixed, in principle the normal matrix need only be computed once. In practice, if the target accidentally coincides with one of the template components, the latter must be eliminated from the normal matrix before computing the filter. This requires a reshuffling and inverting of the normal matrix; however, it should happen fairly rarely, because of the relative low number of template time-series (∼102–103) compared to the size of data base (∼104 stars). The CPU request for the initial set-up of a filter with a few hundred templates is only a few times slower than a single run of the BLS algorithm for transit search. Nevertheless, for large template numbers the extra matrix manipulations increase the execution time substantially. For example, on a 3-GHz commercial PC a full BLS analysis of a set of 13000 time-series with 2300 data points per time-series, 30000 frequency steps per time-series and 920 TFA templates required 100 CPU hours. The cost of the non-TFA analysis of the same data set was only one-fifth of it.
The computer memory required by TFA can be quite large. This is mostly because of the template set, because it requires the usage of N_×_M floating point array elements. Inversion of the normal matrix may also pose some numerical problems if M is too large, although we have not encountered such problems even for very large M, such as 800. This stability is probably due to the basically uncorrelated noise of the light curves.
We note that a possible way to decrease the number of templates is to perform a singular value decomposition (SVD; see Press et al. 1992) and employ only the ‘significant’ eigenvectors derived from this decomposition. Although we experimented with SVD at some point, we decided that a clear-cut division of the eigenvectors into ‘significant’ and ‘non-significant’ eigenvectors was not possible. Therefore, the original ‘brute force’, full template set approach was followed.
There is also the question whether magnitude or intensity (flux) values should be used in the above procedure. We experimented with both quantities and finally settled at the direct magnitude values, because no definite advantage was observed to result from a conversion to fluxes.
4 Trend Suppression Tests
The simplest way to check the most straightforward effect of the TFA is to compare the standard deviations of the original and TFA-processed time-series. This comparison, of course, will not tell us if all temporal distortions due to trends have been successfully filtered out or not. Nevertheless, it will at least give us some indication of the efficiency of the method.
The effect of decreasing the standard deviation is displayed in Fig. 5. Unbiased estimates of the standard deviations σ for the TFA runs were computed with the aid of equation (3) by using a template number of M = 361. (Originally we set M = 400, but limitations on the template time-series – see Section 3.2– lowered this value. A similar statement is applied to other template numbers used in this paper.) We see that with the increase of the standard deviation σ(non-TFA) of the original time-series, TFA is likely to introduce significant corrections. As expected, this means that toward larger σ(non-TFA) (i.e. at fainter magnitudes) there are time-series that are more affected by systematic errors than most of the brighter stars. There are relatively sharp upper and lower boundaries in the decrease of the standard deviation. The existence of the upper boundary at small corrections suggests that all stars are affected by some (however small) systematic errors and the TFA is capable of filtering out a substantial part of them. Although the TFA also has some side effects on any real signal in the time-series, it suppresses the systematics much more effectively than the signal, as shown by the fact that it leads to a significant increase in detection probability (see Section 5).
Figure 5
Decrease of the unbiased estimate of the standard deviation of the time-series due to the application of the TFA. The brightest 4293 stars with greater than 2800 data points per star of field G175 are used. The TFA result was obtained with 361 templates. For better visibility of the minimum decrease in σ, we plotted a horizontal line to indicate the zero correction level.
In order to examine the question whether the decrease of the standard deviation has also led to the diminishing of the temporal signatures of the systematics, we frequency analysed the original and TFA-processed light curves. The BLS routine was run on both data sets. The analysis was performed in the [0.01, 0.99] d−1 band that covers the period of interest in transit search, but excludes very long periods close to the total time-span. We recall that the BLS routine generates several aliases (subharmonics) from the daily trends at frequencies of small integer ratios (see K02). This effect results in the appearance of several aliases in the frequency band chosen for this test. Therefore, the success of the TFA in filtering out systematic periodicities can be judged from the above analysis.
The distribution function of the frequencies of the highest peaks obtained on the original, unfiltered data set is shown in the upper panel of Fig. 6. The large number of stars affected by the daily trends (that appears here mostly at ∼0.5 d−1) is surprising. Actually, it turns out that 24 per cent of the highest peaks fall in the [0.49, 0.51] d−1 frequency range and about 22 per cent of them appear in the ±0.01 d−1 neighbourhood of 0.02, 0.33 and 0.66 d−1. It is clear that a simple direct use of the data for transit search would leave a quite substantial part of the sample not utilized because of the domination of trends.
Figure 6
Comparison of the distribution functions of the peak frequencies obtained in the BLS analysis with and without TFA processing (lower and upper panels, respectively). Each bin has a width of 0.002 d−1. The sum of the bin occupation number N per cent over the full frequency range is equal to 100. The same number of data points and templates are used as in Fig. 5.
Next, the above analysis is repeated by applying the TFA. The result is displayed in the lower panel of Fig. 6. We see that severe aliases due to the daily trends disappeared. Now, as can be expected for a sample with time-series of mostly pure noise, all frequency bins are nearly uniformly occupied. The remaining fluctuations are all of statistical origin, due to the low number of events falling in the narrow bins.
We note in passing that we performed the above test also on the peak frequencies based on DFT spectra. Although at low or moderate template numbers we still obtained some remnants under the 1 per cent level in the frequency distribution around integer frequencies, at a high number of templates we basically obtained a flat distribution at the 0.3 per cent level, very similar to the one we obtained for the BLS spectra. We recall that 69 per cent of the DFT peak frequencies obtained by the non-TFA DFT analysis fell in the ±0.02 d−1 neighbourhood of positive integer frequencies.
5 Transit Detection Tests
We examine the periodic transit detection capabilities of the BLS method on the TFA-processed light curves. This is done with the aid of various test signals generated from the observed time-series and synthetic signals. To each of the 4293 observed, zero-averaged light curves we add the transit signals given in Table 1. Although most of the tests presented in this paper refer to signal 1, the other test signals were also utilized in order to check the efficiency of the method for various signal parameters. Most of these parameters correspond to realistic transit configurations, except perhaps for signal 5, where the relative duration of the transit is slightly too long.
Table 1
Parameters of the synthetic signals. P is the period; δ is the depth of the transit; ϕ is the phase of the transit, defined as [t_in−_t(1)]/P, where _t_in is the moment of the first ingress after t(1), the time of the first item of the time-series; and q is the relative length of the transit, defined as the ratio of the actual transit length and the period.
In order to illustrate the signal detection capability after applying the TFA, in Fig. 7 we exhibit two specifically chosen examples for the extreme improvement one can obtain for trend-dominated signals. It is obvious that the TFA is capable of suppressing a considerable part of the trend without a simultaneous suppression of the periodic signal component. Although the signal also suffers from some amount of suppression, in a very large number of cases this is smaller than the one exerted on the trends. As we will see below, this ultimately leads to significantly higher detection rates for TFA-processed signals.
Figure 7
Examples of the signal detection capability of the TFA. In both cases, test signal 1 was added to the observed light curves of the stars given by the identification numbers in the headers. Left column: BLS spectra of the non-TFA (original) test light curves. Right column: BLS spectra of the TFA test light curves. Template numbers are shown in the headers. The spectra are normalized to unity in each panel separately.
In order to compare the detection rates obtained from the original and TFA-processed light curves, we need to define the meaning of the word ‘detection’. Based on the frequency spectrum, we employ the following two criteria to regard a peak frequency as being identical with the injected test frequency.
- (i)
The highest peak in the BLS spectrum should have a frequency in the [_f_test− 0.001, _f_test+ 0.001] d−1 interval, where _f_test is the frequency of the test signal; in the present case of signal 1, it is equal to 0.1952 d−1. - (ii)
The S/N ratio of the highest peak in the frequency spectrum must be greater than 6. The definition of S/N ratio is the following: S/N =[p(highestpeak) −〈_p_〉]/σ(p). Here, p denotes the BLS statistics (denoted by SR in K02), 〈_p_〉 is the average power in the frequency band analysed, and σ(p) is the standard deviation of the spectrum (we omit large peaks in the computation of 〈_p_〉 and σ(p)).
The above lower limit of the S/N ratio is somewhat arbitrary and it is a compromise between avoiding too high false alarm rates but not losing too many positive peak identifications. At the end of this section, we discuss the effect of posing a more stringent condition on the S/N ratio.
For the characterization of the detection probability, we introduce the cumulative detection probability (CDP) that is defined in the following way:
8
Here, _N_d(m < I) denotes the number of detections for test signals brighter than I magnitudes and _N_nv(m < I) denotes the number of non-variable stars in this magnitude range. Because some 5–10 per cent of the stars in the sample are true variables and they will either dominate the signal or decrease the S/N ratio of the test signal below the detection limit, conservatively we take _N_nv(m < I) = 0.95 _N_t(m < I), where _N_t(m < I) is the total number of stars brighter than I magnitude. Because in this test all stars have injected transit signals, the detection rates to be derived have nothing to do with the true incidence rates of transits in our sample or in stars in general. The sole purpose of this test is to find out by how much we increase the chance of detection by applying the TFA.
The derived CDPs on the sample of 4293 stars are shown in Fig. 8. The jagged variation of the curves for bright stars is attributed to the relatively small sample size for these stars.
Figure 8
CDP for test signal 1 in the presence of observational noise. All cases considered have S/N > 6. The TFA result was obtained with 595 templates.
It is clear from this figure that the TFA introduces a substantial improvement in our detection capability. The high and almost constant detection rate down to ∼10.2 mag indicates the steady performance of the TFA. The situation changes at fainter magnitudes, when the random photometric noise becomes higher. Coupled with the not too favourable combination of period, phase and data distribution for signal 1, this results in a definite decrease in CDP at fainter magnitudes. The decrease of the detection probability is more visible if we check narrow magnitude intervals. For example, for test signal 1, the detection probability decreases from 0.45 to 0.08 for the raw data when moving from the ±0.1 range of 10.0 mag to the same range of 11.0 mag. The same probabilities for the TFA-processed light curves are 0.87 and 0.27, respectively. For short-period signals, the detection capability (for both the original and TFA-processed time-series) is greater than for long-period signals, as shown at the end of this section.
It is also important to characterize the change in the S/N ratio when we lift the S/N > 6 constraint. Then, by keeping only the frequency constraint given above, and plotting the S/N ratio values for the variables satisfying this single constraint (in both the TFA-processed and original data sets), we obtain the plot displayed in Fig. 9. Again, the improvement introduced by the TFA is obvious. Many transits, with S/N < 6 in the original time-series, become confidently ‘discovered’ after applying the TFA. However, it is also seen that there are cases when the original time-series have higher S/N ratio. Although this feature is expected to be rare, it is not too surprising. While the TFA suppresses mostly those features in the target light curve that can also be found in the template set, it may happen that, during this procedure, features (i.e. intrinsic variability) not representative of the trends also become suppressed. This effect of the TFA is not too harmful at higher S/N ratio values (e.g. for S/N > 8), because the signal will be securely detected in both data sets, albeit with a somewhat smaller significance for the TFA-processed time-series. The situation becomes worse at lower S/N ratio, when the signal may become suppressed under the detection limit.
Figure 9
Comparison of the S/N ratios obtained for test signal 1 with BLS spectra satisfying only the frequency constraint for detectability. Shading is proportional to the number of cases (darker regions contain a larger number of cases). The TFA result was obtained with 595 templates.
In order to obtain a more quantitative insight into this phenomenon, we checked the number of cases when the signal was detected in the original time-series, but escaped detection in the TFA-processed time-series. By performing this and the opposite check with two different template numbers and detection limits for various test signals, we obtained the results shown in Table 2. It is seen that the number of such exclusive detections is always much higher among the TFA-processed light curves. In addition, the application of higher template numbers increases the number of cases when only the TFA is capable of detecting the signal. Nevertheless, it seems that there is almost always a relatively small fraction of stars that escape detection in the TFA, because of the unwanted side effect of trend suppression.
Table 2
Mutually exclusive detections for various test signals. Nji is the number of exclusive detections obtained with S/N > i and template number j. Upper rows: not detected using TFA-processed data, but detected using raw data. Lower rows: detected using TFA-processed data, but not detected using raw data.
It is interesting to compare the effect of various signal parameters on the detection probability. As before, here we limit ourselves to the test signals given in Table 1. For each test signal we check the CDP obtained at two magnitudes and at two S/N ratio levels entering in the detection condition discussed above. Changing the S/N ratio level is illustrative for the expected decrease of the CDP when one is aimed at detections with high confidence and with practically zero false alarm rate. The derived CDP values are given in Table 3. As expected, the TFA always yields higher CDP, independent of the signal. The gain is of course smaller for cases of high S/N ratio, such as for signal 2. It is noticeable that single site observations inevitably lead to bias toward discovering short periodic transits (see also Brown 2003). For example, test signals 2 and 5 have the same parameters except for their periods. This results in a dramatic difference in the CDP. In terms of the S/N ratio of the frequency spectra, the S/N ratio rarely hits 8 even when the TFA is employed for signal 5, whereas an overwhelming majority of the spectra of the original (non-TFA) data for signal 2 have S/N > 8 and quite often above 10 (reaching a maximum value of 25).
Table 3
CDP for various test signals. CDP_j_ i is CDP at magnitude j for stars with S/N > i. Upper rows: non-TFA results. Lower rows: TFA results with 820 templates.
6 Signal Reconstruction
In Section 3 we emphasized that when using the TFA for signal detection, we assume that the signal is trend- and noise-dominated; therefore, the use of constant detrended signal {A(i) = const} seems to be justified. Nevertheless, the original signal will suffer from some level of distortion, because of the requirement of minimum variance for the signal that is assumed to be constant. It is clear that for non-periodic signals of arbitrary shape, sampled unevenly and gapped in time, it is very difficult to make any reasonable initial guess on the original signal form. Therefore, the reconstruction method developed in this section can be applied only to periodic time-series.
The main step of the algorithm is the iterative approximation of {A(i)}. This has already been described in Section 3. It is important to find a good method to estimate {A(i)} at each step of iteration. Here we employ the simple method of bin averaging to derive the updated set of {A(i)} at each step of iteration. Of course, the reliable estimate of {A(i)} requires the use of a proper number of bins in order both to allow an ample sampling of the light curve and to ensure statistical stability due to observational noise. Although in our case of ∼3000 data points per light curve even 100 bins give a reasonable noise averaging, usually a lower number of bins (e.g. 50) is also acceptable, because of the smooth shapes of most of the light curves.
Once the signal shape is more accurately determined by the above general method, one may proceed further and utilize this information to derive a more specific model for the signal shape. For example, if the signal turns out to be box-shaped, one can use the BLS routine to obtain a more accurate approximation for {A(i)}. However, it is emphasized that this second level of filtering can be used only if the first, more general method firmly supports our assumption on the signal form. Otherwise, the output will be biased through our incorrect assumption on the signal shape.
Iteration on the folded light curve continues until the relative difference between the standard deviations of the residuals (see equations 2 and 3) in the successive iterations become under a certain limit. We set this limit at 10−3.
For the illustration of the efficiency of the method, first we generate a sinusoidal test signal in the same way as we did in the case of the transit signals in Section 5 (i.e. by injecting the synthetic signal in the real light curves). The period of the injected signal is the same as that of test signal 1 (see Table 1), whereas its amplitude is 0.015 mag. We use 100 bins for the estimation of the detrended light curve {A(i)} during each step of iteration. In Fig. 10 we show two examples of the substantial improvement obtained after TFA filtering. As we see, the TFA filtering returned the original signal form without suppressing the amplitude or modifying the phase.
Figure 10
Reconstruction of a sinusoidal test signal with the aid of iterative TFA filtering. Left column: original folded signal. Right column: TFA-filtered signal with the synthetic signal shown by a solid line. Identifiers (GSC numbers) of the stars employed in the generation of the test signals and the corresponding periods are given in the headers of the panels on the left. We used 595 template light curves and bin averaging in the TFA filtering.
The second example in Fig. 11 shows the result for our standard box-shaped transit signal (signal 1; see Table 1). Again, it is seen that the signal can be completely scrambled by systematics, yet the TFA is able to reconstruct the original signal shape (and also to find the correct period).
Figure 11
Reconstruction of our standard box-shaped test signal 1 (see Table 1) with the aid of iterative TFA filtering. The notation, the method of reconstruction and the TFA parameters used are the same as in Fig. 10.
In order to quantify the accuracy of the estimated parameters when TFA signal reconstruction is used, we compare the derived transit parameters with those obtained in the course of the BLS period search (i.e. without iterative TFA correction of the transit shape). In Table 4 we show the averages and standard deviations of the three parameters (computed only from the significant detections, as defined in Section 5). We see that both the phase and the width are estimated with the same accuracy in both cases, with no apparent systematic errors. However, the depth is substantially underestimated if the TFA is applied without iterative signal reconstruction.
Table 4
Estimated parameters of test signal 1. Transit parameters are described in Table 1. Only significant detections (as defined in Section 5) have been considered. At each parameter the upper and lower rows refer, respectively, to the values obtained without and with iterative TFA filtering. σ denotes the standard deviations of the detected signal parameters.
Finally, in Fig. 12 we show two real examples for signal reconstruction from the G175 data base. As in all of our examples presented in this section, we used bin averaging applicable for arbitrary-shaped signals. Although these examples are not representative of the overall effect of the TFA on the other variables in the data base, they clearly show the size of improvement one may obtain in cases heavily corrupted by systematics.
Figure 12
Examples of the reconstruction of the light curves of observed intrinsic variables with the aid of iterative TFA filtering. The notation, the method of reconstruction and the TFA parameters used are the same as in Fig. 10.
7 Conclusions
Current studies indicate that the incidence rate of hot Jupiters is much lower than it was believed a few years ago (Brown 2003). Although we do not have good observational constraint yet for bright Galactic field stars, based on the OGLE survey of faint Galactic bulge stars and the very low number of positive cases obtained by spectroscopic follow-ups (Alonso et al. 2004; Pont et al. 2004, and references therein), the predicted low incidence rate could be close to reality. Therefore, it is of prime importance to develop effective methods both for data reduction and for subsequent data analysis. The goal to be reached by photometric surveys targeting extrasolar planets is twofold: (i) to reach photometric accuracy close to the photon noise; (ii) to minimize systematic effects leading to coloured noise and therefore seriously jeopardizing the discovery rate. Although current sophisticated data reduction techniques (e.g. differential image analysis by Alard & Lupton 1998 and Alard 2000) may help to reach the near photon noise limit and minimize systematic effects, the latter seems to be never eliminated completely. Systematic effects, usually appearing on a daily time-base, are present even in observations that cover a small area of the sky per frame (MACHO, OGLE). Wide-field surveys have severe additional disadvantages (e.g. narrow PSF, strong non-linearities in the magnitude/position transformations, etc.) that amplify further the systematic effects and thereby make subsequent data analysis more difficult.
Because the appearance of systematics seems to be generic, it is important to devise a method that is capable of filtering out systematics from the time-series photometry in a post-reduction phase. The reason why such an approach may work is that in standard data reduction pipelines the photometry is performed through image processing of snapshots. Therefore, these types of data reduction methods are unable to consider the time-series properties of the data base. The TFA described in this paper takes into consideration the temporal behaviour of the data by constructing a linear filter from a template set of time-series chosen from the data base to be analysed. Our TFA is capable of handling two problems:
- (i)
create optimally filtered time-series for frequency analysis; and - (ii)
reconstruct periodic signals scrambled by systematics.
In both cases the optimization of the filter is made in the standard least-squares sense. In application (i) the signal is assumed to be trend- and noise-dominated, whereas in (ii) this assumption is not made and the filtered signal is iteratively built up by applying case (i) assumption on the residual (observed minus cycle-averaged) time-series.
Tests made on the data base of the HATNet project have shown that by the application of the TFA the signal detection rate (the CDP) increases by up to 0.4 for stars brighter than 11 mag. The improvement in the S/N ratio is often spectacular, leading to secure discoveries from signals originally completely hidden in the systematics. Once the period is found, the signal can be reconstructed by applying the TFA iteratively, as mentioned above.
The results presented in this paper indicate that there is a steady increase in CDP and S/N ratio even for template numbers above 500. This suggests that systematics show up in many different forms and one needs to consider as many of them as possible when targeting the detection of weak signal components. Limitations to very high template numbers are set only by the size of the sample, the number of the data points, statistical and numerical stability and computational power.
Acknowledgments
A part of this work was carried out during GK's stay at CfA. He is grateful for the support and hospitality provided by the staff of the Institute. The thorough review of the paper by the referee is greatly appreciated. We thank Kris Stanek for the useful discussions. The operation of HATNet at the Fred Lawrence Whipple Observatory, Arizona, and the Submillimeter Array, Hawaii has been supported by the CfA. We are grateful to Carl Akerlof and the Robotic Optical Transient Search Experiment (ROTSE) project (University of Michigan) for the long-term loan of the lenses and CCDs used in HATNet. We acknowledge support from National Aeronautics and Space Administration (NASA) grant NAG5–10854 and Hungarian Scientific Research Fund (OTKA) grant T–038437.
References
,
2003
, in , eds, ASP Conf. Ser. Vol. 294,
Scientific Frontiers in Research on Extrasolar Planets
. Astron. Soc. Pac., San Francisco, p.
371
et al. ,
2004
,
ApJ
,
613
,
L153
,
2004
,
PASP
,
116
,
266
(B04)
,
2003
, in , eds, ASP Conf. Ser. Vol. 294,
Scientific Frontiers in Research on Extrasolar Planets
. Astron. Soc. Pac., San Francisco, p.
36
,
2002
,
A&A
,
391
,
369
(K02)
,
2003
,
Acta Astron.
,
53
,
241
,
2003
,
Acta Astron.
,
53
,
213
,
2003
,
Acta Astron.
,
53
,
341
,
1992
,
Numerical Recepies in Fortran
. The Art of Scientific Computing.
Cambridge Univ. Press
, Cambridge
Appendix
Appendix A
The purpose of this appendix is to show that templates containing periodic signals do not induce similar components at the level of detection in a pure noise target time-series. For simplicity, we discuss only the case for a single template. The result holds also for an arbitrary number of orthogonal templates.
Let the target {yi; i = 1, 2, … , n} be equal to a pure Gaussian white-noise series {η_i_} with the following properties
(A1)
where E(.) denotes the expectation value, δ_ij_ is the Kronecker delta function, and σ is the standard deviation. The template is a simple harmonic function in the form of
(A2)
where Φ_i_ =ω0_ti_+ϕ0. The TFA minimization criterion leads to the following expression for the estimate of the filtered target :
(A3)
It is seen that the amplitude of the induced periodic signal is expected to be very small, because it is computed through the Fourier transformation of the noise.
In order to compute the power spectrum of at an arbitrary test frequency ω, we take the Fourier transform of . For the sine and cosine transforms we obtain
(A4)
where
(A5)
By using the properties of equation (A1), for the expectation value of the power P(ω) =_S_2+_C_2 we obtain
(A6)
By omitting sums of oscillating terms, far from the template frequency ω0 we obtain the well-known result for the average noise power (see, for example, Kovács 1980):
(A7)
At the template frequency, with the above assumption on the oscillating terms, we have α2+β2≈ 1 and . Therefore, we obtain
(A8)
Thus, _E_[P(ω0)]/_E_[P(ω)]= 1/2, which means that the TFA (on average) does not induce an extra signal component in the filtered time-series; on the contrary, it ‘whitens out’ (in the statistical sense) even the pure noise target, resulting in a decrease of a factor of 2 in the average power at the template frequency.
Author notes
†
Predoctoral Fellow.
© 2004 RAS