by Joseph Long

In this post, we will reconstruct a time-varying signal that we can't measure at consistent intervals. A lot of the tools in time-domain signal analysis are derived for continuous signals, with their discrete (digital) forms dependent on consistent sampling. Engineers talk about signaling in terms of frequencies and sample rates, assuming that in the ideal case that you'll have one measurement every \(\Delta t\) seconds. (If a measurement is missing, that means your equipment isn't working right.) The well known discrete Fourier transform (and the FFT algorithm) are based on the idea that you're sampling your signal at regular intervals, and can give you unexpected results when you violate that assumption.

On the other hand, when you're looking for signals from outer space, you might set out to measure once a night. But then you have a cloudy night, and miss a data point. And then your allotted telescope time runs out, so you have a gap before you can get more data.

Or a space telescope has to fire its thrusters mid-observation, throwing off a few measurements in your time series. (Which then might inspire you to write a blog post.)

The Lomb-Scargle periodogram

The discrete Fourier transform has the undesirable characteristic of introducing noise at longer periods when the data contains gaps. This can be mitigated slightly by manipulating the data set to "fill in the gaps", but ultimately the Fourier transform is the wrong tool for the job. The technique I'm going to use here comes under the broader category of "least-squares spectral analysis", and is called the Lomb-Scargle periodogram.

As anyone who has fit functions to data knows, the least-squares regression technique is a powerful way to determine the appropriate values for your free parameters. If you had a single sine wave signal, measured at a bunch of different times, you could use a least-squares fit to recover its amplitude and frequency. The Lomb-Scargle technique effectively does this for a spectrum of frequencies of interest, telling you how much power in the signal exists at each frequency. Fortunately, we don't have to implement the specifics ourselves: an optimized implementation exists in SciPy.

Let's set up our notebook.

%pylab inline --no-import-all
import scipy.signal

The implementation we'll be using lives in scipy.signal.lombscargle. The documentation provides references to the literature, if you want to read up. The implementation itself is only a few lines of Cython.

Our input signal

At least to start with, we should use a signal we synthesize ourselves. Then we can be sure we're getting the same thing out as we're putting in, giving us an important sanity check on our code.

Our input signal will have a period of three days, and a phase offset \(\phi\) of \(\pi/2\). Below, I define these parameters for our input signal.

# Conversion factors
days_to_seconds = 60 * 60 * 24
seconds_to_days = 1.0 / days_to_seconds

# Signal parameters
A = 2. # amplitude
period = 3 * days_to_seconds # seconds
frequency = 1.0 / period # Hertz
omega = 2. * np.pi * frequency # radians per second
phi = 0.5 * np.pi # radians

Since this is a discretely sampled signal, we also have to define the length of our signal in terms of a number of data points and the spacing between them. (This may seem at odds with the idea that we have gaps in our data. You're right! The gaps are introduced a little later on.)

We make an array of timesteps and a corresponding array signal defined as \(\text{signal}_i = A * \sin ( \omega t_i + \phi )\) for each element i in timesteps and signal.

N = 1000 # number of samples we're dealing with
dt = 30 * 60 / 1.0 # 30 minutes each sample, ~5e-4 samples / sec

timesteps = np.linspace(0.0, N*dt, N)
signal = A * np.sin(omega * timesteps + phi)
# make some prettier time-axis labels, since timesteps is in seconds
timestep_day_ticks = days_to_seconds * np.arange(22)
timestep_day_labels = ['{}'.format(i) for i in np.arange(22)]

# plot the signal
plt.plot(timesteps, signal)
plt.xticks(timestep_day_ticks, timestep_day_labels)
plt.xlabel('Time (days)')
plt.ylabel('Signal Value')


A moth-eaten sine wave

To get data that demonstrates the strengths of the Lomb-Scargle periodogram, we have to take our regularly-sampled signal array and throw out points from it. To do so, we use numpy.random.rand to generate an array the same length as our input with uniform random values, and throw out a set fraction of the input based on the random array element and a threshold.

Here we set frac_points to 0.05, meaning we want to retain 5% of the input samples for our next step.

frac_points = 0.05 # fraction of points to select

r = np.random.rand(signal.shape[0])
timesteps_subset = timesteps[r >= (1.0 - frac_points)]
signal_subset = signal[r >= (1.0 - frac_points)]
plt.plot(timesteps, signal, 'k', alpha=0.3, label="Original signal")
plt.plot(timesteps_subset, signal_subset, 'b+', label="Points kept in new signal")
plt.xticks(timestep_day_ticks, timestep_day_labels)
plt.xlabel('Time (days)')
plt.ylabel('Signal Value')


This perhaps makes it a bit too easy to see the underlying sine wave. If you were taking these data, and didn't know a priori what you were looking for, you'd probably have a plot that looked like this instead.

plt.plot(timesteps_subset, signal_subset, 'b+')
plt.xticks(timestep_day_ticks, timestep_day_labels)
plt.xlabel('Time (days)')
plt.ylabel('Signal Value')
plt.ylim(-4, 4)


Perhaps you can guess that there's some periodicity there by looking, but it's much less obvious. This data set is also unnaturally clean by construction. The only signal present above is a sinusoid; real data have systematic errors that will further confound your analysis.

Computing the periodogram

We call scipy.signal.lombscargle with three arguments: x, y, and freqs. In this case, x and y are given by the timestamps and measurements from our gap-filled signal array. We generate freqs, the array of angular frequencies over which the periodogram should be evaluated, by picking 10000 periods in the range from 0.1 to 10 days and using the relation \(\text{Angular Frequency} = \frac{2 \pi}{\text{Period}}\).

# Choosing points in frequency space depends on what frequencies we think we'll detect.
# There's no point looking for 100 Hz signals when we're looking at T = 3 days periods.

nout = 10000 # number of frequency-space points at which to calculate the signal strength (output)
periods = np.linspace(0.1 * days_to_seconds, 10 * days_to_seconds, nout)
freqs = 1.0 / periods
angular_freqs = 2 * np.pi * freqs
pgram = scipy.signal.lombscargle(timesteps_subset, signal_subset, angular_freqs)

From the documentation, we have one caveat about the output given by scipy.signal.lombscargle:

The computed periodogram is unnormalized; it takes the value (A**2) * N/4 for a harmonic signal with amplitude A for sufficiently large N.

Therefore, if we want to get out the same amplitude \(A\) that we put in, we will have to divide by the number of samples, multiply by four, and take the square- root. In practice, one generally cares more about the frequency than the detected amplitude, so this is not always strictly necessary.

normalized_pgram = np.sqrt(4 * (pgram / signal_subset.shape[0]))

Let's take a look at the normalized periodogram. Below, I'm plotting the normalized periodogram output against the periods it was evaluated at, instead of the angular frequencies that were passed to the lombscargle function.

For reference, I've also added a vertical line marking the 3 day period of the input signal.

plt.plot(periods, normalized_pgram)
plt.xlabel('Period $T$ (days)')
plt.axvline(3 * days_to_seconds, lw=2, color='red', alpha=0.4)
day_ticks, day_labels = np.arange(10) * days_to_seconds, ['{:2.1f}'.format(t) for t in np.arange(10)]
plt.xticks(day_ticks, day_labels)


Visualizing the detected period

Once you have a periodogram that hints at a periodicity, you will most likely want to "fold" or "phase" the curve (pick your favorite term) to get a better idea of how each observation fits into your attempted fit.

We take a copy of the timesteps_subset array and use the modulus operator to limit the values to the range [0, 3] days.

timesteps_phased = timesteps_subset.copy()
timesteps_phased = timesteps_phased % (3 * days_to_seconds)
phase_angles = 2 * np.pi * timesteps_phased / (3 * days_to_seconds)

Now we plot the original and phased signals. The plot is color coded so that you can see how the points are rearranged on the (former) time axis to assume the appropriate phase position.

fig = plt.figure(figsize=(14,4))
# Gap-filled data
ax1 = plt.subplot2grid((1,3), (0,0), colspan=2)
ax1.scatter(timesteps_subset, signal_subset, c=timesteps_subset, cmap='rainbow')
ax1.set_xlabel('Time (days)')
ax1.set_ylabel('Signal Value')
ax1.set_xlim(0, 21 * days_to_seconds)

# Phased data
ax2 = plt.subplot2grid((1,3), (0,2), colspan=1)
ax2.scatter(phase_angles, signal_subset, c=timesteps_subset, cmap='rainbow')
ax2.set_xticks([0, 0.5 * np.pi, np.pi, 1.5 * np.pi, 2 * np.pi], ['0', '$\pi/2$', '$\pi$', '$3\pi/2$', '$2\pi$'])
ax2.set_xlim(0, 2*np.pi)
ax2.set_xlabel('Phase $\phi$ (radians)')
ax2.set_ylabel('Signal Value')


How pretty! Much easier to interpret than that sparsely populated graph of a moth-eaten sine wave.

Hopefully this helped explain how to accomplish something useful to your research. Some more high level descriptions of Lomb-Scargle spectral analysis, as well as other periodograms and when to use them, is online as part of the NASA Exoplanet Archive Periodogram Service documentation. (Bear in mind that they have their own implementation of Lomb-Scargle which may follow different conventions than the SciPy one.)

If you desire more mathematical rigor, in addition to the original paper by Scargle (and the older one by Lomb), there's a treatment (with some useful generalizations) by G. Larry Bretthorst linked below. Thanks to Nick Earl from Space Telescope for sharing it with me.

Further Reading and References


← return home
Background image of the Carina nebula by NASA, ESA, N. Smith (University of California, Berkeley), and The Hubble Heritage Team (STScI/AURA)