Acoustic detection of unmanned aerial vehicles using biologically inspired vision processing a)

: Robust detection of acoustically quiet, slow-moving, small unmanned aerial vehicles is challenging. A biologically inspired vision approach applied to the acoustic detection of unmanned aerial vehicles is proposed and demonstrated. The early vision system of insects signiﬁcantly enhances signal-to-noise ratios in complex, cluttered, and low-light (noisy) scenes. Traditional time-frequency analysis allows acoustic signals to be visualized as images using spectrograms and correlograms. The signals of interest in these representations of acoustic signals, such as linearly related harmonics or broadband correlation peaks, essentially offer equivalence to meaningful image patterns immersed in noise. By applying a model of the photoreceptor stage of the hoverﬂy vision system, it is shown that the acoustic patterns can be enhanced and noise greatly suppressed. Compared with traditional narrowband and broadband techniques, the bio-inspired processing can extend the maximum detectable distance of the small and medium-sized unmanned aerial vehicles by between 30% and 50%, while simultaneously increasing the accuracy of ﬂight parameter and trajectory estimations. V C 2022 Author(s). All article content, except where otherwise noted, is licensed under a Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).


I. INTRODUCTION
Passive distributed acoustic sensor arrays have been used for detecting and tracking moving aircraft, 1-8 ground vehicles, [9][10][11][12][13][14][15] and unmanned aerial vehicles (UAV). [16][17][18][19][20][21][22][23][24][25] Several array configurations have been explored, including the small aperture circular array, 4 the L-shaped planar array, 7 a tetrahedral array, 6 and the widely distributed small arrays. [10][11][12] By analysing the acoustic signals received by the individual microphones, a variety of applications can be realized, including object classification, target tracking, and simultaneous localization and mapping (SLAM). 26 With the fast development of UAV platforms, these systems have become a very useful tool for a wide variety of applications, 27 including structure inspection, 28 surveillance, 29 3D mapping, 30 and acoustic tomography. 31 Nevertheless, an unauthorized UAV may pose a threat to an airport, individuals, or military bases. Therefore, long range detection and precise location of the UAV becomes important for safety and security purposes.
The acoustic signature emitted by UAVs makes their passive detection and tracking possible. Depending on the spectral components of the acoustic signal, two traditional processing techniques for flight parameter estimation are readily available. For propeller-driven aircraft and helicopters that emit strong harmonic tones a narrowband processing technique 20,32 may be used to estimate the flight parameters. The approach is based on identification of the instantaneous frequency of the motion-induced, Doppler shifted acoustic signature of the aircraft. A broadband processing technique, 20 which measures the temporal variation of the time delays between multiple microphone pairs, may also be used. Compared with the narrowband technique, the broadband approach is more flexible because it can handle UAVs that do not emit strong narrowband tones or fixed harmonic frequencies. Accurate estimation of flight parameters is best achieved through application of both methods when a UAV overflies the array. Indeed, using such approaches several researchers have reported detection ranges for aircraft and UAVs in excess of 2 km. 2,17,21,33 When a UAV is far away from the microphones, however, the signal is weak compared to noise and both broad and narrowband approaches struggle to achieve reliable results. This raises a challenge for UAV detection, localization, and tracking, as observation of the acoustic signal at long range is usually highly desirable.
Similar signal conditions exist in the natural world. For instance, the spread of luminance in naturally lit scenes typically covers a very large dynamic range, and the details in dark regions are immersed in noise. 34 In this regard, insect visual systems, such as that of the hoverfly, have proven to be a powerful information capture system. 35 Similar to the visual scenes, acoustic signals can be converted into "images" using spectrograms and correlograms. These traditional techniques thus permit us to visualize the (one-dimensional) acoustic signal as (two-dimensional) images based on a corresponding time-frequency analysis. With the narrowband method, detected acoustic signals are usually visualized as spectrograms and the frequency of harmonics extracted. For the broadband technique correlograms are used, from which the time delays can be obtained. In this sense, the acoustic signal of interest, in the form of harmonics or correlation peaks, can be presented as patterns in twodimensional arrays or matrices and analysed using vision processing techniques. The particular model used in this study is described in more detail below in Sec. III and the references therein. It derives from a fully elaborated biologically inspired vision (BIV) model that, in addition to enabling scene agnostic, sub-pixel motion detection via electro-optic and infrared sensor modalities, can apply signal conditioning to sensor output prior to downstream processing. The model is based on multiple layers of non-linear dynamic adaptive components measured or suspected from responses of neurons in pathways of the hoverfly brain. Uniquely, however, we have transitioned the techniques from biological finding and theoretical simulation to embedded hardware implementation running in real time. 36 The approach enhances and suppresses elements of related and unrelated signal and noise, providing crisp sub-pixel/low-amplitude signal detection and classification for these difficult target sets. 37 This vision-inspired method differs substantially from techniques that draw upon the biology of auditory systems. 38,39 The technique also differs from the well-known biologically inspired convolution neural network (CNN) method. 40,41 A CNN is an adaptive algorithm that is based on interconnected nodes (neurons) arranged in a layered structure to resemble the human brain. The CNN learns from pre-supplied data and can thus be trained to classify the data by breaking down the input layers into layers of abstraction. It is trained using many examples, by the way in which the individual neurons are connected and by the strength (weight) of those connections. The weights are automatically adjusted during training according to specified learning rules until the algorithm performs appropriately. The last layer of the CNN indicates whether the target class (in our case a UAV signature) was present within the data.
This means that while the hidden layers of a CNN may extract and enhance features the core task is typically classification, and any decision making is based on patterns in the training data. This contrasts with the proposed algorithm in this paper, which is focused on enhancement only: classification of the data is still required post facto. This means the two approaches are not mutually exclusive. In fact, there is reason to believe a CNN trained on outputs from the BIV model could be smaller and more accurate than one trained on raw data (due to enhancement of the signals relative to the noise). This paper is organized into six sections including the present one. Section II briefly describes the traditional methods of UAV acoustic signal visualization, including the narrowband and broadband processing. Section III describes the principles of the bio-vision technique. In Sec. IV field trial used to gather the experimental data is explained. The results of the comparisons between the traditional and bioinspired methods are given in Sec. V. Finally, in Sec. VI this work is summarized and possible future investigations suggested.

A. Notations and assumptions
Passive acoustic localization 9,33 is based on the following assumptions. First, the atmosphere is assumed to be an isospeed sound propagation medium, with the speed of sound denoted as c. Second, the UAV is assumed to travel in a straight line at constant speed V and altitude h for the duration of the inter-observation period, as shown in Fig. 1. The position of the UAV at time s is uðsÞ ¼ ½xðsÞ; yðsÞ; zðsÞ T , which can be expressed in Cartesian coordinates as 9 xðsÞ ¼ d c cos a c þ ðs À s c ÞV sin a c ; (1) where s c , d c , and a c are the time, the ground range and the bearing angle at the closest point of approach (CPA) to the origin ½0; 0; 0 T , respectively, h is the UAV altitude, and is the slant range of CPA. In this case, the UAV trajectory can be explicitly described by five flight parameters fV; s c ; h; d c ; a c g. Therefore, the UAV position can also be expressed as uðs; kÞ, where k ¼ ½V; s c ; h; d c ; a c T is the flight parameter vector. From Eq. (1), the UAV velocity is _ u ¼ ½V sin a c ; ÀV cos a c ; 0 T . Note that these assumptions are only required to be satisfied during the interobservation period, which is typically short compared to flight dynamics.
The UAV position uðsÞ can be described in spherical coordinates. Equivalently, uðsÞ ¼ rðsÞ Á qðsÞ, where rðsÞ is the slant range at time s, and qðsÞ is the direction-of-arrival (DOA) unit vector defined as qðsÞ ¼ cos hðsÞ cos /ðsÞ; cos hðsÞ sin /ðsÞ; sin hðsÞ ½ T ; (2) in which hðsÞ and /ðsÞ are the bearing and elevation angles, respectively. The non-linear relationship between the bearing and elevation angles and the source location is given by hðsÞ ¼ tan À1 yðsÞ xðsÞ ; /ðsÞ ¼ tan À1 zðsÞ ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi x 2 ðsÞ þ y 2 ðsÞ p ! : ( The acoustic array here comprises a general form, that has a wide aperture with N RS small aperture arrays, each with N CH microphones (each microphone representing a different channel). The j-th channel of the i-th small acoustic array is denoted as S i;j ; 1 i N RS ; 1 j N CH located at position s i;j ¼ ½x i;j ; y i;j ; z i;j T . The central position of the i-th small microphone array is denoted as s i ¼ ½x i ; y i ; z i T , which can be obtained through s i ¼ ð1=N CH Þ P N CH j¼1 s i;j . In this paper, one small microphone array is located at the origin point of the coordinate system (see Fig. 6 later).

B. Narrowband technique
The narrowband technique suits acoustic signals that contain strong narrowband tones. The acoustic signals are visualized as spectrograms using time-frequency analysis. Figure 2 shows the power spectral density (in dB) of a UAV observed by a single ground microphone. As well as the harmonic tones emitted by the UAV, the wind noise and acoustical Lloyd's mirror effects are also visible. The wind noise exists throughout the observation period, its spectral width varying in accordance with different gust intensities. Although the maximum frequency of wind noise contamination can reach 500 Hz and simply highpass filtering the signal could eliminate wind noise, considerable harmonic information could also be discarded, which is not desirable.
The Lloyd's mirror effect, generated by reflection of acoustic waves by the ground, is clearly visible when the UAV is near the array. This introduces a slow-change variation in the spectrogram, making some harmonic components hard to detect. For instance, in Fig. 2 the amplitude of the harmonic signal around 600 Hz at time 10 s is almost as low as the noise floor, which results in a breakpoint in the harmonics of this order. It is noted that the Lloyd's mirror effect can be utilized in estimating some flight parameters. 42,43 However, the Lloyd's mirror effect is distinct within only a short period when the airborne source flies over the acoustic sensor and thus not suitable for longdistance detection.
To mitigate the wind noise and Lloyd's mirror effect, the cepstrum filtering technique is used. 44 The (power) cepstrum of a signal x(t) is defined as where F ðÁÞ is the Fourier transform and Xðf Þ ¼ F xðtÞ È É is the spectrum of signal x(t). q is the quefrency variable of the cepstrum. After performing the cepstrum transform on the spectrogram X(f, t), the outputs form a new image called the cepstrogramxðq; tÞ. Figure 3 shows the cepstrogram for the same UAV considered in Fig. 2. Unlike the spectrogram, in the cepstrogram the power of wind noise is concentrated within a fixed quefrency range. The Lloyd's mirror is distributed within 0 < q < 0:01 s and 0 t < 20 s, corresponding to the UAV being within a range of about 300 m from the array. The harmonic components are distributed near q % 0:02 s, which is clearly distant from both the Lloyd mirror and the wind noise. Therefore, by applying a bandpass filter that retains the harmonic components on the cepstrogram we can effectively distill the essential harmonics. Figure 4(a), which contains only the harmonics, shows the result of cepstrum filtering. Figure 4(b) shows the background spectrogram and contains only the Lloyd mirror and wind noise. Using this approach, we can eliminate the influence of the unwanted signals.  After cepstrum filtering, the harmonic spectrogram is then processed for pitch detection. A number of traditional methods are available for high-accuracy pitch detection. For example, in time domain the zero-crossing rate (ZCR), 45 robust algorithm for pitch tracking (RAPT) 46 and YIN estimators 47 may be used, while in the frequency domain there are component frequency ratios, 48 cepstrum analysis, 44 optimum comb filters, 49 and harmonic product spectrum (HPS). 50 In this paper, since the acoustic signal has been visualized into spectrogram, the HPS method for pitch estimation is adopted.
The tonal frequency emitted by the UAV at time s is received by the i, j-th acoustic sensor at time t. This can be expressed as 23 f i;j ðt; kÞ ¼ f i;j ðs þ Ds i;j ðs; kÞ; kÞ where Ds i;j ðs; kÞ is the travel time from the UAV to the i; j-th microphone, i.e., Ds i;j ðs; kÞ ¼ jjuðs; kÞ À s i;j jj=c: Using the relation t ¼ s þ Ds i;j ðs; kÞ and substituting it into Eq. (6), we have where s 0 RðaÞ is the rotation matrix written as For the acoustic sensor located at the origin, Eq. (7) reduces to the same form as in Ref. 32.
It is worth noting that not all UAVs exhibit strong acoustic harmonics. Consequently, the narrowband processing is not suitable for all UAV signatures.

C. Broadband technique
The other traditional method for detecting and locating acoustic signatures is based on broadband processing. Unlike the spectrograms in the narrowband technique, the broadband processing uses correlograms to visualise the acoustic signals. Suppose the acoustic signal emitted by the UAV at time s arrives at two microphones (the m-th and n-th channels) of the i-th array at time t, the time delay between these channels can be estimated from the correlogram C i;mn ðb; tÞ, which is obtained through the generalized cross correlation and phase transform (GCC-PHAT) method 51,52 where b is the lag. X i;m ðf ; tÞ and X i;n ðf ; tÞ are the complexvalued spectrograms of the m-th and n-th channels of the i-th array, respectively. W(f) is the spectral windowing function. The time delay between these two channels can be obtained by searching the peak in the correlogram, i.e., When the flight trajectory is fully described by Eq. (1), the time delay d i;mn ðtÞ is also referred as d i;mn ðt; kÞ. According to Eq. (6), it can be rewritten as  jjuðs; kÞ À s i;m jj À jjuðs; kÞ À s i;n jj c ; (11) where Ds i;m ðs; kÞ and Ds i;n ðs; kÞ are the travel time from the UAV to the microphone channels S i;m and S i;n , respectively. Since the channels were placed near to each other within an array, the relationship between the sound emitted at time s and received at time t can be treated as a plane wave across the small array. Thus, considering t ¼ s þ Ds i ðs; kÞ and solving Eq. (6) along with Eq. (1), the relationship between s and t for the i-th acoustic array centered at s i can be expressed as When the centre of the small microphone array is located at the origin point, i.e., s i ¼ s 0 i ¼ ½0; 0; 0 T , Eq. (12) reduces to the same form as Ref. 9.
Besides the correlogram, the acoustic signal in broadband processing can also be visualized into the global coherence field (GCF). 53 A GCF shows the plausibility for an acoustic source. For the i-th acoustic array, it can be calculated as where M ¼ ð 2 N CH Þ is the number of microphone pairs. b i ðuÞ is the expected at time delay of the i-th small acoustic array when the acoustic source is at position u. The GCF can also be presented as a 2-dimensional image in terms of the bearing and elevation angles, based on the DOA unit vector q defined in Eq. (2).

III. BIO-VISION PROCESSING
A complete biologically inspired vision (BIV) model is a multi-stage non-linear system with adaptive feedback both within and between stages. This model has previously been used only on electromagnetic data but has shown great promise in estimating optic flow 54 and detecting targets in clutter 34 in both visual 36 and infrared 37 portions of the spectrum. Even using the first stage of the model in isolation has yielded improved clarity in poor lighting conditions 55,56 and better target detection. 57,58 The photoreceptor cell (PRC), which is responsible for dynamic range reduction of the input signal, provides the first stage of the biological visual system. Photoreceptors dynamically adjust the dark and bright regions of input images through temporal pixel-wise operations. 59 Figure 5 shows the elaborated mathematical model of the bio-vision photoreceptor. 34 It includes four stages: (1) the adaptive filtering; (2) the low-pass divisive feedback, which is also called as the DeVries-Rose stage; (3) the exponential low-pass divisive feedback, which is known as the Weber stage; and (4) the non-linear Naka-Rushton transform. This four-stage model is functionally equivalent to the processing conducted in a primate cone. 60 The detailed implementation is described as follows.

A. Adaptive filter
This stage includes low-pass filtering with a dynamic cut-off frequency, the value of which depends on the adaptation state (long-term average value of the element). This is followed by a variable gain control that acts to reduce the differences over a wide range of acoustic intensities by using a larger gain when the signal is low compared to when it is high. In visual images the power of the signal component is approximately inversely proportional to its spatial frequency, whereas the noise component is essentially white, i.e., constant over all frequencies. There is therefore a frequency above which the signal-to-noise ratio (SNR) falls below an acceptable level and a low-pass filter should be employed. The threshold at which the SNR drops varies in accordance with the intensity (brightness) of the input signal because the majority of the noise is in the sensor itself and therefore independent of the scene being observed. Similarly, in acoustic applications the elements that relate to low amplitude acoustic signals must be more heavily filtered than those that receive high amplitude inputs since their SNR will be low and can be increased by the reduction of high frequency signal components.
To begin, the adaptation state is calculated as where f c1 is the corner frequency of this stage and f r is the frame rate, which is the reciprocal of the time step interval. Then the low-passed filtered result ' 1;t k passes through a tone-mapping (bit depth normalisation) process, which is realized via a Naka-Rushton transform 61 in order to estimate the adaptation state, I nr;t k , expressed as where I mid1 is the mid-point value chosen from the empirical data set. By using the adaptation state it is possible to independently classify each element of the incoming data by its average intensity, and hence estimate the filtering required to improve the SNR and the gain required to amplify low intensity sections if the input signal. The adaptive LPF is realized through where f fm;t k is the adaptive corner frequency calculated as In Eq. (17), f max and f min are the bio-vision parameters corresponding to the maximal and minimal adaptation rates of the temporal LPF, respectively, and are set depending on the level of noise in the sensor relative to the intensity of the recorded signal. Last, an operation to compress the dynamic range is applied, using a non-linear adaptive gain which is given by The gain factor g af;t k compresses the dynamic range by amplifying the lower values more than the higher ones. The output signal of this stage is expressed as The initialization of this stage is derived from the steady-state response, i.e., ' 1;t 0 ¼ ' 2;t 0 ¼ I in;t 0 ; I nr;t 0 ¼ I in;t 0 = ðI in;t 0 þ I mid1 ; I af;t 0 ¼ I af;t 0 ½ðg max À 1ÞI mid1 =ðI in;t 0 þ I mid1 Þ þ 1, where I in;t 0 is the initial input to the bio-vision model.

B. Low-pass divisive feedback (DeVries-Rose)
The second stage incorporates rapid, short-term adaption of the input intensities, allowing the element to respond, but quickly adapt, to adjust for any changes. As shown in Fig. 5, it has a feedback loop with a LPF such that where I dvr;t kÀ1 is the output of the previous time step, and f c2 is the corner frequency of this stage. Function f LPF is defined in Eq. (14). The filtered signal from the previous time step serves as the denominator in the divisive feedback loop (i.e., divided by the input of this stage), which can be written as The steady state-behaviour follows a square root, thus the initialization of this stage at time step t 0 is I dvr;t 0 ¼ ffiffiffiffiffiffiffiffi I af;t 0 p , which is also called the DeVries-Rose law. Due to the lowpass filter, this stage produces overshoots and undershoots at incremental and decremental steps, respectively. The result of this stage is an output that responds to a change in the input with no time delay but will decay over time, asymptotically approaching the square-root of the input value, if the input remains unchanged. This preserves the presence and temporal coherence of changes while simultaneously reducing the required bandwidth of the signal. The operation is similar to a leaky high-pass filter with the addition of compressive non-linearity. Weighting with current and previous iterations, this process supports temporal coherency when processing input sequences.

C. Exponential low-pass divisive feedback (Weber)
The third stage is the Weber model, which contains an exponential operation in the feedback loop. It is parametrically similar to the previous DeVries-Rose stage, but provides long-term, slow adaptation due to the lower corner frequency in the filter, but includes the exponential to alter the rate of change of the system to a disturbance. As with the previous stage the temporal filtering is not on the main signal path, only in the feedback. This allows the model to adapt to slow changes in intensity while maintaining temporal coherency and resistance to high-frequency changes in the overall scene. The low-pass feedback loop in this stage is where I weber;t kÀ1 is the output at previous time step and f c3 is the corner frequency of this stage. The filtered signal is first manipulated by an exponential operation and then divided by the input signal of the current time step. It is written as where a is the exponential sensitivity of the system. The initialization of this stage comes from the steady state function that I weber;t 0 ¼ I dvr;t 0 = exp ðI weber;t 0 Þ with the solution I weber;t 0 ¼ WðI dvr;t 0 Þ, where WðÁÞ is the Lambert W function. When ln ðxÞ ( x, the solution can be approximated as I weber;t 0 ¼ ln ðI dvr;t 0 Þ. This logarithmic relationship is referred to as the Weber law, which is the name of this stage. The exponential scaling enables this stage to perform significant non-linear rescaling of the signal, which can drastically reduce the intensity of the largest elements.

D. Non-linearity (Naka-Rushton)
The final stage of the photoreceptor is the static nonlinearity Naka-Rushton transform process expressed as where I mid2 is an empirically selected positive offset. As a result, the response becomes increasingly non-linear as the intensity values rise.

IV. FIELD TRIAL EXPERIMENT
A. Field trial equipment Field trials were conducted at a site known as Evett's Field at the Woomera Test Range, South Australia. The terrain is flat and open, has sandy/rocky ground, no grass and sparse vegetation. Aside from an equipment hut 300 m west of the microphone array there are no substantial scattering objects obstructing line of sight propagation. As shown in Fig. 6(a), Evett's Field has two runways (Runway #1 and Runway #2), each with a length of around 2 km. The acoustic array was located to the south eastern end of the two runways. An array of 49 microphones was deployed in a fractal pattern of 7 groups of 7 smaller arrays (Fig. 6). Each small array comprised a microphone at its centre (height 2 m) and two sets of three microphones (height 0.15 m) at radii of 1 m and 5 m, each separated in angle by 120 . The 7 smaller arrays were themselves arranged in a similar pattern of equilateral triangles, the inner triangle having 50 m sides, the outer 100 m sides. The position of each microphone was located using real time kinematic carrier phase differential global positioning system, which has a 1r accuracy of 60.03 m.
The sound fields at each array were measured using ECM800 10 mV/Pa condenser microphones sampled at 44.1 kHz using an 8-channel, 24-bit Data Acquisition (DAQ) recorder with 107 dB spurious free dynamic range. Accurate time stamping of the data were obtained from a GPS-derived one pulse per second (1PPS), sampled using channel one of the DAQ.  Table I were mounted with an iMet XQ UAV sensor, which recorded the UAV's GPS location and local meteorological factors (temperature, pressure, relative humidity) at a rate of 1 Hz. The GPS data from the iMet sensor were used as ground truth for the UAV flight trajectories.

V. RESULTS AND DISCUSSION
Once the data acquisition was completed, the signals were processed using both the narrowband and broadband techniques. Note that as the narrowband processing is only suitable for UAVs with strong harmonic signals, the technique was only applied to FS1. For the other two flight scenarios, there are not obvious fixed harmonic tones. This was because, despite FS2 having a petrol engine, the changing demands on the engine during flight resulted in an acoustic signature with high variability in the dominant frequencies.
Consequently, FS2 and FS3 were not suitable for narrowband processing. The broadband technique, however, was applied to all three flight scenarios. The processing details and results, including the improvements obtained by applying the proposed bio-vision technique, are described below.

A. Narrowband technique
The data from each microphone was pre-filtered through a low-pass finite impulse response (FIR) anti-aliasing filter (AAF) with a passband cut-off of 5 kHz, passband ripple of 0.1 dB and stop band attenuation of 100 dB, prior to downsampling by a factor of 5. The spectrograms were obtained through time-frequency analysis, with a digital Fourier transform block size of 4096 samples, 75% overlap between two consecutive data blocks, Hann windowing, and 2 times zeropadding. When combined with the bio-processing model the BIV parameters were set as f c1 ¼ 1 Hz, I mid1 ¼ 0:02; f max ¼ 2 Hz, f min ¼ 0:5 Hz, g max ¼ 40; f c2 ¼ 1 Hz, f c3 ¼ 1 Hz, I mid2 ¼ 0:02. These parameters were selected using an empirical process and were not optimised against any quantifiable criteria. Figure 7(a) shows the normalized (with respect to maximum power spectral density) spectrogram of FS1 (Matrice 600 with acoustic payload) obtained from the first channel of the microphone array RS1, while Fig. 7(b) shows the same spectrogram after bio-processing. With the help of adaptive filtering and nonlinear transforms, the bioprocessing has enhanced the related acoustic harmonics and suppressed the unrelated noise. Two particular regions, which correspond to ranges of <300 m and around 1000 m from the array and are marked as Z1 and Z2 in Fig. 7, are expanded to more clearly show the improvement of the bio-processing approach. Z1 represents the high-frequency region when the UAV was near the acoustic array. In this case, the narrowband harmonics were low in power, but quickly varied due to the high harmonic order. Z2 denotes the low-frequency region when the UAV is far away from the array.  the normalized spectrogram, respectively. Some harmonics are barely visible as their amplitudes are insufficient to be clearly identified from the background. Figures 7(d) and 7(f) are the same two regions after BIV processing. The former illustrates a more distinct, clearer set of harmonics up to t ¼ 20 s and the latter demonstrates an obvious harmonic signal around t ¼ 90 s. Both the original spectrogram and the one processed using BIV were then passing through the same cepstrum filter, which removed the spectral signal with quefrency range jqj < 0:01 s to eliminate the influence of wind noise and Lloyd's mirror. As described in Sec. II, the Doppler frequency (pitch) received by each microphone was obtained by searching the peaks in the HPS algorithm. For each small microphone array, the pitch data were complemented by the first four channels (Ch1-Ch4), since they were located very close to each other. Figures 8(a) and 8(b) depict the estimated pitch frequency for the acoustic signal of FS1 obtained from RS1 without and with bio-vision processing, respectively. To quantitatively evaluate the improvement of our proposed method, we adopted the peak signal-to-noise ratio (PSNR) n, which is defined as n ¼ 20 log 10 maxðIÞ where maxðIÞ is the maximum of the intensity and RMSE ðn I Þ is the root mean square error of the intensity noise. The data points in Fig. 8 were used for flight parameter estimation by the non-linear least squares (NLS) regression given byk wherek ¼ ½V ;ŝ c ;ĥ;d c ;â c T are the estimates of k; w m ðt k Þ is the weighting function related to the PSNR n of the m-th microphone of time step t k . The NLS fitting results are marked as the red solid lines, which stop at the low tracking confidence regions with severe data deviations and PSNRs lower than n ¼ 6 dB. The pitch estimated by traditional methods has the highest PSNR of 20.6 dB and the maximum visible period of 70.9 s, as shown in Fig. 8. As a contrast, the PSNR with BIV processing can be as high as 30.1 dB with a maximum visible period of 96.2 s, resulting in a 10 dB improvement on PSNR, and a 35.7% improvement on the maximum visible period. The flight parameters estimated from the NLS regression are depicted in Table II. The error values are 1r for the traditional and bio-vision methods and derived from the iMet GPS sensor performance envelope for iMet (ground truth) data. The estimates of flight parameters appear slightly biased, mainly because of the wind noise. Compared with the iMet sensor data, however, the estimation of both traditional and bio-vision processing are acceptable.
The tracks of the UAV trajectory estimated by the acoustic array are superimposed onto the satellite photograph (Fig. 9). The red triangles are the locations of the small microphone arrays (RS1-RS7). In Fig. 9(a), the gray circles are the trajectory measured by the iMet XQ sensor GPS records, while the blue circles represent the UAV measured acoustically as it travels about 1134.5 m along the Runway #2, corresponding to a maximum slant range of 1147.5 m. As a comparison, the green circles in Fig. 9(b) show the measured UAV trajectory up to 1509.1 m, corresponding to a slant range of 1519.2 m. This indicates that for flight scenario 1, the maximum detection range was improved by approximately 33%.

B. Broadband technique
For the broadband technique, the acoustic data were visualized as correlograms. For each small microphone array, 7 channels (Ch1-Ch7) formed N p ¼ 21 sensor pairs. The correlograms were implemented through the GCC-PHAT algorithm in the frequency domain, with a FFT window size of 8192 points, 2 times zero-padding, 50% overlapping, and Kaiser windowing. Note that there was no downsampling in this stage. The time step Dt of the correlograms was 0.093 s, resulting in a frame rate f r of 10.77 Hz. When  processing with the bio-vision, the input images were the correlograms, and the BIV parameters were set as f c1 ¼ 1:5 Hz, I mid1 ¼ 0:7; f max ¼ 2:5 Hz, f min ¼ 0:1 Hz, g max ¼ 10; f c2 ¼ 1:5 Hz, f c3 ¼ 1:5 Hz, I mid2 ¼ 0:7. Once again, these parameters were not optimised against any objective criteria. Figures 10(a) and 10(c) show the correlograms of the first microphone pair (Ch1-Ch2) and the last microphone pair (Ch6-Ch7), respectively. The former represents the minimum distance between two microphones, the latter the maximum. The two corresponding correlograms processed using bio-vision are shown in Fig. 10(b) and 10(d). It is worth noting that the correlation peak has a higher amplitude and is temporally prolonged due to the amplification and filtering present in the first stage of the bio-inspired processing chain. There is also a "shadow" after the fast varying correlation peaks [as in Fig. 10(d)]. This is mainly due to the low-pass filter in the divisive feedbacks stages of the bio-vision model. However, the influence of this effect is negligible, since the delays were estimated through the peak searching of the correlograms and if anything increased the local contrast of said peaks.   with BIV processing colour-coded according to peak signal-to-noise ratio (PSNR). Not only did the bio-vision processing improve the PSNR of the correlograms, it also made the tracking of the peak more coherent, even at low PSNRs. calculated via Eq. (25). With the traditional broadband method, the maximal PNSR is about 30 dB, while the bio-vision achieved a PSNR improvement of around 10 dB. Figure 11(a) illustrates that without bio-processing, the correlogram peaks oscillate violently after 54.7 s, while with bio-processing, they remain stable up to 85.2 s. This result indicates an improvement in the tracking of 56%, which is even higher than the improvement for the narrowband processing.
The GCF from the correlograms were also computed according to Eq. (13). Figure 12 shows the GCF of FS1 with and without bio-vision at time 6 s and 80 s, respectively. The red crosses show the estimated bearing and elevation angles, and the white lines the trajectory traces obtained from the iMet data. This demonstrates that the BIV processing resulted in higher contrast and more accurate peaks in the GCF making the estimation of bearing and elevation more accurate at longer ranges. When the UAV was near the acoustic array, i.e., T ¼ 6 s, the crosses for the traditional (a) and bio-processed (b) lie on the iMet trace, indicating both provide an accurate estimate of h and /. Note that the highlighted area of Fig. 12(b) is larger than that in (a), which was mainly due to the prolonged low-pass filtering effect in the BIV (as in Fig. 10). When the UAV was far away from the acoustic array, i.e., T ¼ 80 s, traditional GCF lost track and provided an incorrect estimate, as in Fig. 12(c). However, with bio-vision processing the UAV was still visible in Fig. 12(c), and the estimate matches well with the iMet trace.
Similar broadband processing was conducted for all three flight scenarios (FS1-FS3). The flight parameters were estimated through the NLS regression given bŷ The flight parameters obtained from the NLS regression are shown in Table III. For all flight scenarios, an accurate estimate of flight parameters could be obtained using both traditional and bio-vision methods, although in a few cases the accuracy of the bio-vision method is worse than that of the traditional approach. However, these differences in accuracy were minor and are traded against large detection range extensions. The reason for the discrepancies between the BIV and traditional results is likely due to use of lowpass filters by the former, which induces a small phase delay within the data.
As in the narrowband technique, we plot the track of UAV trajectory by broadband technique with the satellite photograph as in Fig. 13. With the traditional method, the maximum detectable slant range R max of the Matrice 600 (FS1), Skywalker X-8 (FS2), and Mavic Air (FS3) were 915.7 m, 901.1 m and 258.7 m, respectively. After bioprocessing, the maximum detectable range of these FSs were 1360.5 m, 1204.7 m, and 336.7 m, indicating distance improvements of 48.6%, 33.7%, and 30.2%, respectively (see Table IV). It is worth noting that the Mavic Air has a much shorter detection range because of its smaller size, lower signal intensity, and higher spectral signature compared with the other medium size UAVs.
Although all figures shown in this paper relate to flights beginning with the UAV starting close to the microphones (i.e., the SNR declines from that point on, and at towards the end of the run any angular change is small), overall the analysis is drawn from both the outward and the return legs of the flights, i.e., it includes trajectories where the SNR starts low and increases as the target approaches the observer. This was to eliminate any potential influence of "track extrapolation." In contrast to many traditional imaging systems that operate with a single global or regional gain and attempt to capture the world as faithfully as possible, the BIV operates at multiple local time scales, uses pixel-wise integration and manipulation, and employs self-adapting non-linear feedback between its stages. This enables it to process all parts of the data in parallel, whilst simultaneously allowing scene-independent adaptations between its components as there is no concept of spatial (and thus spectral) structure in the initial processing stages. Data elements considered dynamic are accentuated, whilst static ones are condensed. This allows the huge dynamic range of the real world to be compressed into manageable bandwidth for optimal information transmission and downstream processing across a diverse range of environments. Consequently, although extraneous coherent noise sources (e.g., interferers such as petrol generators) will be accentuated relative to any background noise, consistent noise sources (such as a constant generator) will be suppressed relative to variable sources (such as a moving UAV), in the same way moving objects are enhanced relative to stationary ones within the visual system. Consequently, unless the temporal-spectral properties of extraneous signals completely overlay those of the UAV targets, in which case they are indistinguishable, the BIV processing will make it easier to discriminate interferers from the signals of interest. A detailed examination of  the topics of the influence of interference and track extrapolation is beyond the scope of this paper and will be published elsewhere.

VI. CONCLUSION
This paper presents the use of a bio-inspired signal processing technique for detecting the acoustic signature of UAVs. Two standard time-frequency processing methods, based on narrowband and broadband techniques, were considered. Such approaches are commonly used by other researchers in this field, and the ranges reported (prior to the addition of any BIV signal conditioning) are similar to other publicly reported findings for such experiments. [16][17][18][23][24][25] The photoreceptor model of the insect vision system was applied in conjunction with both these traditional methods. Field trials using three different types of UAV (fixed and rotary wing), and various flight scenarios show that for narrowband processing the bio-vision technique improved the maximum detection range by a factor of 33%, while for broadband processing the bio-inspired method achieved range extension of between 30% and 49%, depending on the UAV model/type and flight scenario.
Recently BIV processing has been shown to greatly increase the detection range of UAVs in both visual 62 and infrared 37 data. However, this is the first time such a finding has been translated to acoustic detection.
Compared with the traditional methods, the bio-vision method also achieves comparable accuracy in flight parameter estimation, indicating that the proposed method is accurate and reliable. Since BIV is a pre-processing (signal conditioning) technique it augments, not replaces, existing detection and tracking methods. This means that BIV can be integrated with other more complex UAV detection algorithms. Furthermore, since it is causal and made up only of relatively simple mathematical operations, BIV is also suitable for real-time applications. Optimisation of BIV parameters against a defined goal would likely lead to a further increase in performance, as has been observed in a different context. 63 However, such improvements are beyind the scope of this paper. Future work includes verification using more UAVs and flight scenarios, fusion of the narrowband and broadband techniques, including further components in the BIV processing pathway, 64 and application of the BIV to the real and imaginary components of the analytic signals which would allow more accurate determination of the mainlobe.