Exponential spectro-temporal modulation generation

Traditionally, real-time generation of spectro-temporally modulated noise has been performed on a linear amplitude scale, partially due to computational constraints. Experiments often require modulation that is sinusoidal on a logarithmic amplitude scale as a result of the many perceptual and physiological measures which scale linearly with exponential changes in the signal magnitude. A method is presented for computing exponential spectro-temporal modulation, showing that it can be expressed analytically as a sum over linearly offset sidebands with component amplitudes equal to the values of the modified Bessel function of the first kind. This approach greatly improves the efficiency and precision of stimulus generation over current methods, facilitating real-time generation for a broad range of carrier and envelope signals.


I. INTRODUCTION
Spectro-temporal modulation (STM) is of great interest in psychoacoustics and auditory physiology because of its relevance to speech decoding 1-3 as well as the broad applicability of the modulation-based linear-systems approach to parametric investigations. Investigations involving STM have used a variety of modulator shapes and carrier types. The two most common carriers are broadband noise and tonal complexes, although the details of each vary from study to study.
Creating STM with modulation that is sinusoidal on a logarithmic amplitude scale through explicit evaluation of the time-domain representation is computationally costlyfar too costly to be practical for generating stimuli for presentation during an experiment. To make use of such exponential STM, stimuli frequently must be calculated in advance using fixed parameters, which limits the psychophysical methods that can be used. Additionally, an insufficiently robust set of pre-generated stimuli requires limiting adaptive procedures or risks design flaws due to the recognizability of frozen noise. 4 As a result of these computational constraints and the relative simplicity of its form, many studies use STM with a modulation envelope that is sinusoidal on a linear amplitude scale. A stimulus comprised of the sum of N simple carrier tones decomposes into 3N different tones under linear modulation. However, one of the oldest, best-established psychoacoustics principles is that sensitivity to sound intensity is logarithmic not linear. 5 This is a critical assumption of most estimates of the internal excitation in response to an acoustic stimulus. [6][7][8][9] Thus, implementing the desired modulation pattern on a logarithmic amplitude scale ensures a similar pattern to a first approximation at the level of excitation, which translates to applying exponential rather than linear modulation.
Existing methods for achieving exponential STM tend to be either unsuitable for real-time generation because they are too computationally intensive or poorly suited for research due to constraints or inherent imprecision. Further, some attempts to limit the cost of generating stimuli, such as making use of low carrier-tone densities, have led to unwanted stimulus artifacts. 10 A computationally efficient method for generating exponential STM from independent carrier tones in the frequency domain is presented here along with metrics comparing the resultant stimulus with the explicit form and an existing alternative.

II. STM
When generating signals that vary in intensity as a function of time and frequency, it is necessary to consider how a) Electronic mail: tstavrop@ucr.edu, ORCID: 0000-0001-6183-3012. variation is represented by the amplitude and phase as a function of frequency. An inherent feature of any finite, practical auditory filter (like that of the human auditory system) is some degree of frequency selectivity. Given any such auditory filter and using noise composed of stationary, uncorrelated tones, a tone-density sensitivity threshold exists above which the system is insensitive to further increases in tone density. 11 This is part of the basis for the common practice of generating noise in the frequency domain as a sum of a limited number of stationary, uncorrelated tones distributed through a frequency band.
The fact that complex stimuli can be represented as a linear sum of simple carrier tones has long been appreciated (and exploited). It allows for the leveraging of powerful mathematical technology, like the discrete Fourier transform (DFT), to generate stimuli much more rapidly than could otherwise be done. Improvements to the computational efficiency of generating stimuli allows for the flexible generation of custom stimuli in real-time, circumventing the aforementioned issues that arise when relying on pregenerated stimuli. Computational complexity and, thus, generation time can be limiting factors for real-time generation of robust, broad-spectrum noise.
A generalized form that can be used to represent the time waveform, S(t), of STM generated from noise that is composed as of a sum of N pure sine wave carrier tones is given by A n sin ð2 pf n t þ / n Þ zfflfflfflfflfflfflfflfflfflfflfflfflfflffl ffl}|fflfflfflfflfflfflfflfflfflfflfflfflfflffl ffl{ where each carrier tone is described by an amplitude A n , a frequency f n , and a phase / n , and Mðf n ; tÞ is the modulatora function of both time and carrier frequency. This representation cleanly separates the properties of the underlying noise, such as the different statistical distributions and spectral shaping, which would be reflected in the carrier tone amplitudes, A n , from the modulator function, Mðf n ; tÞ. With different choices for the modulator, this form can represent both exponential and linear STM, as well as their respective spectral modulation (SM) and temporal modulation (TM) counterparts. Starting with this general, conceptual model helps establish a common framework in which linear and exponential modulations can be developed and compared without unnecessarily constraining how they are applied.

A. Linear modulation
Linear modulation is often used either because of the simplicity of its form or the efficiency of generating such a stimulus in the frequency domain. Using the above generalized form, simple linear modulation could be expressed as where m is the linear modulation depth and a value in the range [0,1), 12 x is the TM rate in Hz, and Uðf Þ is a function that determines the modulation phase. Application of this modulator to the carrier-tone sum results in each tone receiving a sinusoidal temporal envelope with an envelope phase shift determined by its frequency, which is responsible for creating the progressive spectral offset that is characteristic of STM. A scaling prefactor for normalization has been dropped for convenience and clarity. Direct use of this form of the modulator would change the root-mean-square (RMS) level of the carrier, requiring either the level of the output stimulus to be rescaled or the normalization constant to be calculated beforehand. In the literature, the linear modulation depth is frequently reported as 20 log 10 m.
Typically, SM that is periodic on a logarithmic frequency (octave) scale is desired, in which case, the envelope phase would be where X is the spectral density of the modulation in terms of cycles per octave, U 0 is the phase shift of the entire modulation envelope, and f 0 is any reference frequency (typically the lower-bound of the spectral domain of the noise, although in principle any value can be used as it only determines which frequency receives the modulator phase shift of U 0 ). While Eq. (3) was included for clarity and completeness, it is not necessary to specify the spectral properties of the modulation envelope to derive a result for either linear or exponential STM and as such, the results apply broadly to any spectral relationship whether logarithmic, linear, or constant as is the case with TM. For notational convenience, the envelope phase of the nth carrier tone will be defined as U n Uðf n Þ, although the results can be reinterpreted for continuous frequency distributions simply by reversing this substitution. When M Lin [Eq. (2)] is substituted into the general form [Eq. (1)] as the modulator function M and the trigonometric product rules are applied, the product of this modulator and each carrier tone simplifies to the sum of three simple sine waves, representing a base carrier tone of frequency f and two sidebands of frequencies f þ x and f À x [see Eq. (18) and Table I].

B. Exponential modulation
Exponential modulation requires a more complex representation, where m is the modulation depth in decibels, a positive value representing the level difference between a peak or a valley and the midpoint, x is the TM rate in Hz, and Uðf Þ is a function that determines the modulation phase. Note that when m ¼ 0 dB, the modulator is strictly equal to one, leaving the carrier tone unmodified. A significant downside to this form, and likely the reason it has been neglected in favor of the simpler linear modulator, is that the modulator can no longer be represented by sidebands determined by a simple trigonometric relationship as in the linear case. Much like linear STM, exponential STM can be accurately represented as the sum over a limited number of sidebands [see Eq. (17) and Table II]. The full derivation of the sideband relationship can be found in Sec. III. Whereas the analytic solution contains a sum over an infinite number of sidebands, Sec. IV A discusses how the sum converges quickly enough that a very limited number of sidebands is sufficient to produce an accurate stimulus across the range of modulation depths required in practice.

III. DERIVATION
The goal is to manipulate the expression for the exponential modulator [Eq. (4)] so that it is expressed solely as a sum of pure tones with well-defined frequencies, amplitudes, and phases. Such a representation can then be exploited to a proiri calculate a STM stimulus in the frequency domain, reducing the entire computational cost of generating exponential STM to little more than that of a single DFT and potentially reducing the computational burden by several orders of magnitude.

A. Overview
The first step is to find the Taylor expansion of the exponential to pull the sine wave in the exponent into a marginally more cooperative form, an infinite sum over powers of sine [Eq. (6)]. Next, the sine power reduction formula is used to rewrite each power of sine into a sum of sine and cosine terms [Eq. (10). The formula being invoked here is the result of the recursive application of the trigonometric product rules [such as 2 sin h sin / ¼ cos ðh À /Þ À cos ðh þ /Þ]. However, because this formula has a different form for even and odd exponents of sine (as sine and cosine end up transforming back and forth with each additional power), these terms are temporarily split into a sum over the even terms, a sum over the odd terms, and the 0th term.
The result of this manipulation is a finite sum inside an infinite sum. The next step is to collect every trigonometric function with the same argument together and sum the amplitudes. To take the even sum, as an example, the first term of the outer sum (m ¼ 2) is a scalar value multiplied by cos 2h, and the second term of the outer sum (m ¼ 4) is a cos 2h term plus a cos 4h term. Similarly, the third term of the outer sum (m ¼ 6) is a sum of a cos 2h term, a cos 4h term, and a cos 6h term. This expression becomes significantly simpler if the sums are rearranged and all of the cos 2h terms are collected together, all of the cos 4h terms are collected together, etc., effectively turning the nested sum inside out. The simplified representation that this yields is a sum over constant values [Eq. (11a)], cosine terms [Eq. This new representation is in the form of a particularly well-studied function, the modified Bessel function of the first kind, I ðzÞ [Eq. (12)]. Bessel functions, of which I ðzÞ is included, appear throughout physics and engineering due to involvement in the representations of spherical harmonics. It is pertinent here to note that many mathematical packages include this function (besseli in MATLAB, for instance), and highly efficient numerical recipes exist for it as well (see Ref. 13). The expression collapses into a sum over sine and cosine terms with amplitudes equal to values of the Bessel functions [Eq. (15)].
Finally, the modulator expression is multiplied by its corresponding carrier tone [Eq. (1)], and the trigonometric product rules are applied to generate the final representation: a base tone plus sidebands that are linearly offset by integer multiples of the TM rate with amplitudes equal to values of the modified Bessel function of the first kind [Eq. (17)]. This form is closely parallel to the linear modulation form [Eq. (18)] albeit with a little more complexity. Nonetheless, the values of I k ðzÞ drop off rapidly as k increases, especially in the range of physiologically relevant modulation (see the discussion on error in Sec. IV A for more details), allowing for high accuracy with the use of as few as ten sidebands.

B. Complete derivation
The first step to finding a cleaner representation of the exponential modulation is to use the Taylor expansion of the TABLE I. The transformation to apply to each carrier tone in order to apply linear modulation. Each carrier tone is replaced with a base tone and two sidebands of linearly offset frequency.

Linear modulation sidebands Frequency
Amplitude a Phase Whereas it is improper to express amplitude as a negative value, it does greatly simplify the comparison with the exponential sidebands in this format. To recover the proper amplitudes and phase shifts, remove the factor of À1 appearing in an amplitude term and add a phase shift of p.
TABLE II. The transformation to apply to each carrier tone in order to apply exponential modulation. Each carrier tone is replaced with a base tone and its associated sidebands. The first sidebands correspond to k ¼ 1.
The even-numbered sidebands use the appropriate Even k row, the oddnumbered sidebands use the respective Odd k row. The amplitude terms fall off exponentially for successive sidebands. For 20-dB peak-to-valley modulation (m ¼ 10), by the fifth set of sidebands (k ¼ 5), each subsequent amplitude term is more than 1 order of magnitude smaller than the prior.

Exponential modulation sidebands
Frequency Band Amplitude a , b , c Phase Base f n -I 0 ðM 0 ÞA n / n Upper bands f n þ kx (Even k) ðÀ1Þ k=2 I k ðM 0 ÞA n / n þ kU n (Odd k) ðÀ1Þ ðkþ1Þ=2 I k ðM 0 ÞA n / n À p 2 þ kU n Lower bands f n À kx (Even k) ðÀ1Þ k=2 I k ðM 0 ÞA n / n À kU n (Odd k) ðÀ1Þ ðkÀ1Þ=2 I k ðM 0 ÞA n / n À p 2 À kU n a The substitution M 0 ðm=20Þ ln ð10Þ is made for readability. b I ðzÞ is the modified Bessel function of the first kind of order and argument z. 3 See the footnote regarding negative amplitudes in Table I. exponential. This will pull the modulating sine wave out of the exponent, allowing for easier manipulation.
x n ln ðnÞ ð10Þ n! ; where ln ðxÞ is the natural logarithm or log e ðxÞ. By substituting this into the equation for the modulator, Eq. (4) becomes Several expressions make repeated reappearances, therefore, it is prudent to make substitutions for clarity and brevity. The argument to the sine function in the modulator is substituted with H n ðtÞ 2 p x t þ U n , and a constant factor appearing as the amplitude is represented with C m ln ð10Þ=20. Using these simplifying expressions, the modulator becomes M Exp ðf n ; tÞ ¼ X 1 a¼0 C a a! sin a ðH n ðtÞÞ: To simply this, an explicit, analytic substitution for sin n ðhÞ, called the sine power-reduction formula 14 is used ðÀ1Þ n=2þk n k ! cos ðn À 2kÞ h ð Þ ; n is even; where the binomial coefficient is defined as Substituting the sine power-reduction formula into the expression for the modulator M Exp ðf n ; tÞ [Eq. (6) where even and odd refer to only summing over the even and odd terms, respectively. Separating the unitary, cosine, and sine terms into separate expressions for convenience and collecting the similar sine and cosine terms, the modulation terms become Next, the modified Bessel function of the first kind is required, where the Gamma function, when the argument is restricted to natural numbers, is equal to a factorial CðnÞ ¼ ðn À 1Þ! 8 n 2 N: When values of are constrained to the set of natural numbers (as will be true in this case), Eq. (12) simplifies to Simplifying Eq. (16) with the trigonometric product rules yields the following expression: ð Þcos 2pðf n À k xÞ t þ / n À k U n ð Þ

> > > > > > > > > > > > > > > > <
> > > > > > > > > > > > > > > > : As a useful comparison, the linear modulation of Eq. (2) can be expanded to take the form A common interpretation of the linear modulation result above is that each carrier tone becomes the sum of three waves. The base wave is of amplitude A n , frequency f n , and phase / n , and the remaining two are sidebands with frequencies of f n þ x and f n À x, phases of / n þ U n þ p and / n À U n , and both with amplitudes of 1 2 A n m (see Table I). In this context, Eq. (17) can be interpreted as a base tone plus an infinite number of diminishing pure-tone sidebands (see Table II). This representation is convenient because the sidebands converge to zero fairly quickly, and a number of efficient numerical recipes for evaluating the modified Bessel function exist, 13 and it is included in standard MATLAB functions like besseli.
Representing STM in this way enables quickly and efficiently composing a stimulus in the frequency domain from many thousands of carrier tones and generating the timedomain representation with a single inverse discrete Fourier transform (IDFT). This method is easily fast enough to be used for real-time stimulus generation-decreasing the computation time for a 1-s stimulus by over 3 orders of magnitude to less than 40 ms (see Sec. IV B).

IV. STRENGTHS AND LIMITATIONS
Whereas the solution derived for representing exponential modulation is explicit and analytic, making use of it is not without necessarily invoking some simplifying assumptions.

A. Error
Although the analytic representation of exponential modulation is expressed as a sum over an infinite number of sidebands, in practice, this expression does not require the inclusion of many terms to be highly accurate. Figure 1 demonstrates a direct comparison between the proposed sideband approach [ Fig. 1(a)] and the explicit evaluation of the modulation [ Fig. 1(b)]. One stimulus exemplar was generated with each method, using the same carrier component frequencies, amplitudes, phases phases, and 20-dB midpoint-to-peak modulation. The differences between the respective spectrograms, which are equivalent to the ratios in the spectral power density, visualzied in Fig. 1(c) show the signals to be nearly identical. Figure 2 visualizes how quickly the sum over the sidebands converges to the proper infinite sum. It is worth noting that the sideband convergence ordinate is in units of decibels plotted on a logarithmic scale, and the linear features shown in this scale represent hyper-exponential convergence. Thus, the midpoint ordinate value of À10 À5 represents a -0.00001-dB difference between the partial sum and the complete sum, or effectively 0.00001 dB of "missing" energy. The trend is that the greater the depth of modulation, the more sidebands are required to capture the dynamics. Even an envelope with a 40-dB midpoint-to-peak modulation depth (80-dB peak-to-valley) requires only 21 terms (20 sidebands between À10x and þ10x plus the fundamental frequency) to have a power spectrum accurate to one part in 10 9 in decibels. The perceptual relevance of these small level differences is difficult to evaluate because of the many different stimulus types that can be created and the variety of perceptual tasks that one might use with such stimuli. Using the extremely conservative estimate that a total energy difference of 0.01 dB due to omitted sidebands should be physiologically undetectable, a sideband extent of four would be sufficient for modulation depths up to 40 dB. In comparison to perceptual data, the best thresholds obtained by incrementing a single component in a tonal complex correspond to about a À20-dB signal-to-standard ratio, 15 which corresponds to a level difference of 0.828 dB. 16

B. Computational complexity
As previously alluded to, the DFT is a powerful tool. When feasible, generating stimuli in the frequency domain can be several orders of magnitude faster than comparable time-domain methods. For example, generating a 1-s sample of STM with 10 000 carrier tones in the frequency domain with the proposed algorithm is consistently over 1200 times faster than exhaustively evaluating the explicit form of the stimulus in the time domain. 17 In fact, the difference in computational complexity between the linear and exponential modulation cases in the frequency domain is relatively small as the cost for stimuli of duration greater than even 100 ms is dominated by the cost of the DFT [a fast Fourier transform (FFT) in this case], which is represented equally in the complexity of both algorithms. In the aforementioned test   2. (Color online) The ratio between the energy of each term in the partial sum over a limited number of sidebands and the energy of the complete infinite sum, expressed in decibels and shown for several midpoint-to-peak modulation depths. Because sidebands are distributed symmetrically about the carrier tone, they are counted in terms of "sideband extent," which is half of the total number of sidebands. The visualized "power ratio" value can be interpreted as the energetic contribution of the omitted terms relative to the entire sum. The values converge to zero quickly enough such that few total datapoints are visible when visualized with a linear ordinate axis. case, it took on average 38 ms to generate 1 s of exponential STM composed of 10 000 carrier tones in the frequency domain, whereas 48.7 s were required to generate the same stimulus in the time domain.
This comparison can also be made in the language of algorithmic complexity. In terms of big O complexity, using C for the carrier tone count, N for the number of samples, and B for the number of sidebands used, the complexity of the explicit calculation is OðCNÞ, whereas the sidebandbased approach is OðN ln NÞ when N is a power of two and, thus, the FFT can be used. The latter would contain the addition of a term of BC, but it falls away due to insignificance. However, the algorithmic complexity only captures part of the computational savings of the frequency domain approach as a result of the unequal cost of evaluating different arithmetic operations. In particular, the exponentiation and trigonometric function evaluations, which are evaluated B times for every sample in the explicit form, are notably slower than the addition and multiplication operations that form the backbone of the FFT.
Consideration of computational efficiency is particularly pertinent in light of Resnick et al., 10 demonstrating the effects of the spectral aliasing that occurs when generating STM with an insufficient carrier density. Low carrier densities have been used in the past to offset computational constraints, but the proposed sideband approach renders such carrier degradation optional.

V. VALIDATION
Validation of the proposed sideband exponential stimulus generation technique can be further explored using metrics of the spectro-temporal envelope fluctuations as the basis for comparison between the explicit calculations and a third existing exponential technique as described by Chi et al. 1,18 Direct, meaningful evaluation of the envelope of STM with its characteristic spectral and temporal variation is challenging. However, both pure SM and pure TM allow for much simpler approaches in accessing and analyzing the envelope. The modulation envelope in SM is directly inscribed into the spectrum just as the modulation envelope in TM appears in the time-domain envelope. Because all of the approaches analyzed here are composed of a sum over modulated carrier tones, the difference between SM and STM is the omission of the TM term in Eq. (4), x, which advances the envelope phase with time, whereas the Uðf Þ term that advances the envelope phase with frequency in the same equation is omitted in the case of TM. The envelope of the spectrum of pure SM is the focus of this analysis to best resolve any frequency-sensitive artifacts that might be present in the tested generation methods.
To evaluate the spectral envelope, 100 stimulus exemplars were computed with each method at each of 20 selected modulation depths, using maximum-density carriers with amplitudes sampled from a Rayleigh distribution and scaled to match a bandpass filter (À32 dB/octave) from 400 to 3200 Hz. The SM frequency was 2 cycles/octave, and the starting phase of the modulator was randomized. For each exemplar, two metrics of the spectral-envelope fluctuation were computed.
The first metric used to analyze the signals was the normalized fourth moment (M 4 ) of the spectrum, 19,20 where E is the stimulus spectrum, obtained from a DFT. The fourth moment characterizes the degree of fluctuations of a signal and could potentially capture issues like inadequate curvature or frequency-dependent modulation artifacts. Figure 3(a) shows the normalized fourth moment of the spectrum as a function of modulation depth (peak-to-valley difference in dB) from 0 to 50 dB. The three functions represent the explicit evaluation (black, open circles), the existing method (red, open squares), 1 and the proposed sideband method (green, filled triangles). The most important observation in this context is that the proposed sideband method (green triangles) maps directly onto the explicit method (black circles), indicating that the minimal computational error discussed above has minimal effect on modulation envelope curvature and depth. Also of potential interest is the fact that the normalized fourth moment values for the existing method (red squares) are quite different from the other two methods. For modulation depths below about 20 dB, the normalized fourth moment is greater than the other two methods, and for greater modulation depths, the normalized fourth moment is less. The second metric was the crest factor, CF, of the spectrum, where E peak is the peak in the spectral envelope, and E rms is the RMS magnitude of the spectrum. The crest factor characterizes the extrema of a signal and would reveal if a method suppressed or exaggerated transients and peaks. This metric, shown in Fig. 3(b), also confirms that the explicit method and proposed sideband method are nearly identical in terms of envelope peaks, whereas the existing method had an elevated crest factor (by about 36% on average across modulation depths).
Overall, these methods of quantifying spectral envelope fluctuations show that the explicit and proposed sideband methods yield roughly the same spectral envelope depth and both differ from the envelope of the existing method even when the same spectral envelope parameters are specified during stimulus generation. Although the differences between theses methods shown in Fig. 3 do point to potential envelope cues introduced by using the existing method, it is currently unknown what impacts, if any, the differences in envelope actually have on perception. With the proposed, more efficient method in hand, it will now be much easier to systematically examine exactly which spectral and temporal cues are responsible for sensitivity to STM and how this interacts with the stimuli that have been used to measure sensitivity in the past.

VI. SUMMARY AND CONCLUSIONS
The derivation of an explicit, analytic expression for the sidebands necessary to capture exponential modulation has been presented. Where existing solutions capable of real-time execution either estimate this effect through coarser Fourier transforms or approximate filtering, the proposed solution solves for the precise sideband values necessary to capture the behavior. A set of metrics evaluating the spectral envelope demonstrate that the modulation generated by the proposed method more closely matches the expected form than does an accepted alternative. The degree to which the modulation envelope shape interacts with the auditory system and impacts STM detection is an area of study that is still evolving, and empirical evidence on the impact of differences in shape is limited. For example, Shamma and Versnel 21 reported that modulation shape did not lead to a difference in ferret cortical single-unit responses. Isarangura et al., 22 however, reported that SM detection thresholds differed significantly based on the spectral envelope shape. Consequently, efficient methods of generation like the proposed sideband approach will help keep differences and potential errors in generation from slowing progress in our further understanding of the perceptual and physiological bases and implications of modulation sensitivity.