Model-based Bayesian analysis in acoustics – A tutorial

Bayesian analysis has been increasingly applied in many acoustical applications. In these applications, prediction models are often involved in better understanding the process under investigation for purposely learning from the experimental observations. When involving the model-based data analysis within Bayesian framework, issues related to incorporating the experimental data, and assigning probabilities into the inferential learning procedure need fundamental consideration. This paper introduces Bayesian probability theory on a tutorial level, including fundamental rules for manipulating the probabilities, and the principle of maximum entropy for assignment of necessary probabilities prior to the data analysis. This paper also employs a number of examples recently published in this journal to explain detailed steps as how to apply the model-based Bayesian inference to solving acoustical problems


I. INTRODUCTION
Many recent Bayesian applications in acoustics hint at 'Bayesian revolution' in Science and Engineering (Ballard et al., 2020;Landschoot and Xiang, 2019;Lee et al., 2020;Martiartu et al., 2019;Nannuru et al., 2018).However, logical foundations of the Bayesian methods cannot be put in fully satisfactory form until the classical problem of arbitrariness (sometimes called 'subjectivity') in assigning prior probability is resolved (Jaynes, 2003).These recent Bayesian applications have already reached a level where the problem of prior probabilities can no longer be ignored.
This paper describes model-based Bayesian analysis at an introductory level to the readers of the journal, leaning to sacrifice some rigorous complexity for increasing clarity in a way to keep mathematical handling of Bayesian probability methods as minimum as feasible.Model-based approaches rely crucially on parametric models which can be derived from physical/acoustical principles, can also be phenomenological, and numerical.Another class of Bayesian data analysis which does not necessarily rely on parametric models, so-called nonparametric Bayesian analysis (Müller et al., 2015) is outside the scope of this paper.
This tutorial paper is organized as follows: Sec.II introduces logical concept of probability in Bayesian view.Sec.III discusses two simple examples within Bayesian framework.Sec.IV applies the principle of maximum entropy to assign prior probabilities.Sec.V explains two a) xiangn@rpi.edulevels of Bayesian inference.Sec.VI presents a number of recent applications of model-based Bayesian analysis, and Sec.VII summarizes the paper.

II. BAYESIAN PROBABILITY
When considering the interpretation of probability, debate is now in its third century between the two main schools of statistical inference, the Bayesian school and the frequentist school.Frequentists consider probability to be a proportion in a large ensemble of repeatable observations (Cox, 2006).This interpretation has been dominant in statistics until recent decades (McGrayne, 2011).But in considering, for instance, the probability of the mass of the universe falling between certain bounds, the probability cannot be interpreted in terms of frequencies of repeatable observations, because there is only one universe.The Bayesian school [henceforth just 'Bayesian' (Fienberg, 2006)] views probability instead as a degree or strength of implication: how strongly one thing implies another (Garrett, 1998;Keynes, 1921).Carnap (1950) taking a similar view, considered it as the degree of confirmation (or conclusion) of a hypothesis on basis of some given evidence (or promises).In detail, p(B|A) represents how strongly the assumed truth of one binary proposition A implies the truth of another, B, according to the relations known between the things that the propositions refer to; in this expression the proposition in question (B) appears first, to the left of the vertical line or 'conditioning solidus', and the proposition held to be true (A) appears to its right.Use of implication or confirmation, rather than belief, does away with common criticisms of Bayesianism relating to psychology, and also demonstrates clearly that all probabilities are conditional: they depend on at least two propositions, one of which is the conditioning information, so there is no such thing as an unconditional probability (Carnap, 1950;de Finetti, 2017).Furthermore, since propositions obey an algebra -Boolean algebra -and since propositions are the arguments of probabilities, an algebra for the probabilities follows from the algebra of propositions.This algebra turns out to be the sum and product rules (Cox, 1946), and this is their deepest rationale.Herein is the justification for calling degree of implication 'probability' -it obeys the two 'laws of probability', and it is what is actually needed in all problems involving uncertainty.
If there were objections to this meaning of 'probability', Bayesians do not waste time on semantics but get on with calculating the degree of implication, because it is what is needed in solving any real problem, regardless of the name attached to it.Below, 'probability' should be understood as a shorthand for 'degree of implication'.Note that state of knowledge or state of information can be quantitatively encoded in probabilities, namely in 'degree of implication'.
Probability is a real-valued quantity ranging between 0 and 1, with p(B|A) = 0 meaning that truth of A implies falsehood of B with certainty, and p(B|A) = 1 meaning that truth of A implies truth of B with certainty.Bayesian probability theory is not restricted to applications involving a large number of repeatable events or so-called random variables.Using Bayesian probability it is possible to reason in a consistent and rational manner about single events (such as the mass of the universe) when the information needed for certainty is lacking, socalled inductive reasoning (Jaynes, 2003;Keynes, 1921).At the same time, Bayesian probability can also be applied in the 'frequentist' case of repeated trials with an uncontrolled variable.The extent of control is included in the conditioning proposition.In the case of repeated trials, the value of the probability of an outcome is often equal numerically to the relative frequency (proportion) of that outcome, but the concepts are distinct.
The frequentist view is associated with 'randomness' about what member of an ensemble is chosen by nature, in a 'random process'.But when people speak of a random process (perhaps yielding a 'random number') they really mean a process which they believe nobody can work out how to predict.That statement is as much about human ingenuity as about the actual process, for a smarter person might work out how to analyze the process better or gather further information about it.Randomness is not intrinsic to a system, which is why mathematicians have not been able to come up with any agreed definition of it.Accordingly, the Bayesian view downplays the notion (Jaynes, 2003).

A. Product and Sum Rules
Given the probabilities of various propositions, how can the probability of any compound proposition assem-bled from them, using the Boolean operations of logical product, logical sum and negation, be calculated?Two relations turn out to be enough to decompose the probability of any compound proposition.These are known as the product rule and the sum rule.They follow principally from the associativity property of the Boolean logical product, by decomposing the probability of the logical product of three propositions in differing ways and equating the results (Cox, 1946 , 1961).For their interesting history and a full tutorial derivation, see Jaynes (2003).
Product rule: given two propositions A and B, the probability of the logical product AB (i.e., the probability of them both being true) is given by where Z is a proposition specifying the 'background information'.This rule is named after the product of probabilities on its right-hand side, although it is best understood as specifying how to move a proposition (here, A) across the conditioning solidus.An insertion of a comma between the propositions in a logical product, for instance, p(A, B|Z), often makes the probability expression clearer, where they appear as an argument of a probability, and in particular where continuous variables are involved (as discussed later in Sec.II B), so as to maintain consistency with conventional mathematical notation for functions of multiple arguments.If knowledge of whether or not A is true has no bearing on one's knowledge of whether B is true, so that then the product rule reduces to p(A, B|Z) = p(A|Z) p(B|Z). (3) In this case, A and B are said to be logically independent of each other, given Z.
The product rule of two propositions can be generalized to more propositions.In particular, if multiple propositions A, B, C . . .are logically independent of each other, given Z, then Sum rule: if the probability p(A|Z) that a proposition A is true is known, given Z, then the probability that it is false, p( Ā|Z) must be calculable from it: p( Ā|Z) is a unique function of p(A|Z), where Ā represents the logical negation of A. The relation between these two probabilities is and is known, from its form, as the sum rule.For any two propositions A, B, it can be shown using the sum and product rules, together with de Morgan's laws (Patrick, 2015) relating the Boolean operations of logical sum, logical product and negation, that Suppose now that A and B are exclusive, given Z, so that at most one of them is true.Then p(A|B, Z) and p(B|A, Z) are zero, from which it follows that p(A, B|Z) is zero upon decomposing it using the product rule.Hence, in that case, When a discrete variable is considered, this relation normalizes the probabilities for its values.For instance, if D k is the proposition that 'the kth face of a dice shows', and Z specifies that 'the dice has K faces', then

B. Marginalization and Probability Density
Marginalization is a means to reduce dimensions of a multivariate distribution/density, in general.In Bayesian data analysis particularly, it enables the removal of nuisance parameters.These are variables, along with the variable of interest, which contain relevant information, but which are not themselves of interest.The rule for marginalizing follows from the sum and product rules.

Consider the expression
Using the product rule to move A across the conditioning solidus in these two joint probabilities, this expression becomes Since the term in square brackets takes value one by the sum rule, it follows that Generalization is routine to the case of a discrete variable θ taking one of the set of K possible values θ k , k = 1, 2 . . .K, when the value of θ has a bearing on a proposition of interest, A. Denote by Θ k the proposition that θ takes the value θ k , so that the set of propositions Θ k , k = 1, 2 . . .K, is exclusive and exhaustive.Then Now suppose that the value of θ may have a bearing on another variable, x, taking one of the set of J possible values x j , j = 1, 2 . . .J. Denote by X j the proposition that x takes the value x j , so that the set of propositions X j , j = 1, 2 . . .J, is exclusive and exhaustive.The preceding relation is true when A is replaced by any X j : This is a relation between probabilities of propositions.
One can now replace the probabilities by functional forms, and think of p(X j |Z) as a function of the discrete variable j, denoted p(x j |Z), and think of p(Θ k |X j , Z) as a function of j and k, denoted p(θ k |x j Z).
In the case of continuous variables it is necessary to introduce the idea of probability densities, by applying the sum and product rules to propositions of type "the continuous variable takes a value between x and x + dx, and defining the probability of this proposition to be a probability density multiplied by dx" which is specified as given background information, z.It follows now that p(x, θ|z) + p(x, θ|z) = p(x|z). ( This can be derived by applying the product rule Upon adding the product rule for p(x, θ|z) to both sides of Eq. ( 17), This result for marginalization can be generalized to multiple alternative propositions as a consequence of both the product and the sum rule; given a joint probability p(x j , θ k ) of K alternative propositions: where y specifies that "x j , θ k are discrete variables, and Bayes' theorem can be straightforwardly derived from the product rule.Consider again the joint probability p(A, B|Z) in Eq.( 1); since the logical product of two propositions is commutative, it is equal to p(B, A|Z), and so by the product rule A special case of Bayes' theorem was published posthumously in 1763 in Philosophical Transactions of the Royal Society (Bayes, 1763) through the effort of Richard Price (Hooper, 2013), an amateur mathematician and a close friend of Reverend Thomas Bayes (1702Bayes ( -1761)), two years after Bayes had died.While sorting through Bayes' unpublished mathematical papers, Price recognized the importance of an essay by Bayes giving a solution to an inverse probability problem in moving mathematically from observations of the natural world inversely back to its ultimate cause (McGrayne, 2011).The general mathematical form of the theorem is attributable to Laplace (1812), who was also the first to apply Bayes' theorem to astronomy, earth science and social sciences.

III. BAYESIAN INVERSION EXAMPLES
An example in seismic/acoustics research is the investigation of earthquakes in California, based on a limited number of globally deployed seismic sensors.Must an ensemble of repeatable, independent devastating earthquakes at the same location occur in order to infer the location of their epicenter?
To introduce model-based Bayesian analysis in acoustic studies, consider a data analysis task common not only in acoustic investigations but in many scientific and engineering fields.This example begins with an acoustic measurement in a room, which records a discrete dataset expressed as the sequence D = [d 1 , d 2 , . . ., d K ]. Figure 1 illustrates the data, which consist of a finite number (K) of observation points.These data represent a sound energy decay process in an enclosed space.Based on visual inspection of Fig. 1, an experienced architectural acoustician will formulate a hypothesis that this set of data points probably represents an exponential decay.This hypothesis can be formulated into an analytical function of time (t), specifically a parametric model, This model contains a set of parameters θ = [θ 0 , θ 1 , θ 2 ], in which θ 0 represents background noise, θ 1 represents the initial amplitude, and θ 2 is the decay constant.The aim is to estimate this set of parameters θ so that the modeled curve (solid-line) is consistent with the data points (black dots).The data analysis task is to estimate the relevant parameters θ contained in the model H(θ), particularly the decay coefficient θ 2 , from the experimental observations D. This is known as an inverse problem.To highlight the practical importance of Bayes' theorem in the context of data analysis and model-based inference, the propositions in Eq.( 21) can relate to concrete experimental data and the parameters of interest.Again, the sound energy decay example above is helpful in relating more general problems to concrete data analysis tasks often encountered in acoustics.Proposition A can be stated as 'the experimental data D took the values stated', namely A = D, and Proposition B represents 'the set of parameters takes certain values θ', so that B = θ; the experimenter wishes to estimate the actual values of these parameters θ from the experimental data D. In addition, there is an important proposition, reflecting the relevant background information I known to the experimenter, of the form 'an appropriate set of parameters θ will generate a hypothesized dataset via a well-established prediction model H (hypothesis) which approximates the experimental data well.' Given this background information I, substitution of A = D, B = θ, and Z = I into Eq (21) yields Bayes' theorem for this data analysis task as p(θ|D, I) = p(D|θ, I) p(θ|I) p(D|I) (23) Model-based parameter estimation scheme as an inverse problem.The inversion to be performed is to estimate a set of parameters encapsulated in a model, based on experimental data and the model itself (the 'hypothesis').
supposing that p(D|I) = 0. Bayes' theorem represents the principle of inverse probability (Jeffreys 1965).The significance of Bayes' theorem, as recognized by R. Price, is that an initial implication of θ, based on the background information I expressed by p(θ|I), will be updated by new information, p(D|θ, I), about the probable cause of those items of data.The new information comprising the data D comes from experiments.
In the room-acoustic example mentioned above, the quantity p(θ|I) is a probability representing the initial implication of the parameter values from the background information before taking into account the experimental data D. It is consequently referred to as the prior probability for θ, or (prior for short).
The term p(D|θ, I) reads "the probability of the data D given the parameter θ and the background information I, and represents the strength of implication that the measured data D would have been generated for a given value of θ.It represents the probability of observing the dataset D if the parameters take any particular set of values θ.It is often called the likelihood function ('likelihood' for short).This likelihood represents the probability of getting the measured data D supposing that the model H(θ) holds with values of its defining parameters θ.
The term p(θ|D, I) represents the probability of the parameter values θ after taking the data values D into account.It is therefore referred to as the posterior probability, also known as the posterior predictive distribution.The quantity p(D|I) in the denominator on the righthand side of Eq. ( 23) represents the probability that the observed data occur no matter what the values of the parameters.It can be interpreted as a normalization factor ensuring that the posterior probability for the parameters integrates to unity.
Acousticians who face data analysis tasks from experimental observations are typically challenged to estimate a set of parameters encapsulated in the model H(θ), also called the hypothesis.23).The posterior probability, up to a normalization constant p(D|I), arises by updating the prior probability p(θ|I).Once the experimental data D become available, the likelihood function p(D|θ, I) represents a multiplicative factor that updates the prior probability so as to transform it into the posterior probability (up to a normalization constant).Bayes' theorem represents how one's initial assignment is updated in the light of the data.This corresponds exactly to the process of scientific exploration: acoustical scientists seek to gain new knowledge from acoustic experiments.The prior knowledge in many acoustics fields is the fruit of long development, leading to well-understood hypotheses -models.These models, as part of the prior knowledge (Candy, 2016) are typically based on generations of learning/education.
In Eq. ( 23), Bayes' theorem requires two probabilities in the calculation of the posterior probability up to a normalization constant.These two are the prior probability, p(θ|I), and the likelihood function, p(D|θ, I).Use of the prior probability in the data analysis was an element of controversy between the Bayesian and frequentist schools.Frequentists have criticized the 'subjective' aspect of the prior probability involved in the Bayesian methodology Berger (2006), since different prior assignments lead to different posteriors according to Bayes' theorem [see Fig. 3 and Cowan (2007)].Figure 3 (a) shows that, if the prior probability is sharply peaked in parameter space, the likelihood function multiplied by the prior probability may give rise to a posterior probability peaked at a significantly different position than either the prior or the likelihood.Sharply peaked probability density functions encode a strong implication that the parameter values fall in certain (narrow) ranges in parameter space.Assignment of the prior probability in this way implies injection of information that may differ from person to person into the data analysis.In the case of Fig. 3 (a), the parameters are already known accurately, in which case there is almost no need for experiment, unless the data can sharpen the peak significantly.
The use of prior probability is actually a strength of Bayesian analysis, not a weakness, for in the extreme case that the experimenter knows the parameters exactly, the prior probability is all heaped at a single value and the prior is zero elsewhere as shown in an example in Fig. 4. Bayes' theorem shows immediately that this feature carries through to the posterior, in accord with intuition but not with frequentist methods of data analysis.
The assignment of prior probability should proceed from the prior information.If it is not known how to assign a prior distribution from the prior information, as is often the case when limited prior knowledge about the parameter values is available, or no preference is intended to be incorporated into the data analysis, then a broad prior probability density can safely be assigned, as in Fig. 3(b).Bayesian analysis often involves a broad or flat prior, representing maximal non-commitment to any particular value.Below, the model-based Bayesian anal-ysis proceeds by assigning such a noninformative prior probability.

IV. MAXIMUM ENTROPY PRIOR PROBABILITY
Bayes' theorem takes account of both the prior information about the process (parameters) under investigation, p(θ|I), and the experimental data observed in the experiment, through the likelihood function, p(D|θ, I).The prior probability encodes the experimenter's initial implication of the possible values of parameters, and the likelihood function encodes how likely are the data given particular values of the parameters.The parameters are coefficients in a functional form which specifies the model.The prior probability p(θ|I) must be assigned based on available information prior to the data analysis in order to apply Bayes' theorem. Berger (2006) rebutted a common criticism of the Bayesian school arising from the supposedly subjective use of prior probabilities.The first Bayesians, including Bayes (1763) and Laplace (1812), conducted probabilistic analysis using constant prior probability for unknown parameters.A fundamental technique that is often used in Bayesian analysis relies on whatever is already known (prior knowledge) about a probability distribution in order to assign it.This technique developed in recent decades (Jaynes, 1968) encodes whatever is already known about the probability distribution in mathematical form.This is the maximum entropy method, and it generates a so-called maximum entropy prior probability.Jaynes (1968) applied a continuum version of the Shannon (1948) information-theoretic entropy, which is a measure of uncertainty, to encode the available information into a probability assignment.To assign a probability distribution p(x), the information entropy of this probability S[p(x)], given by: needs to be examined, with where m(x) is a Lebesgue measure which ensures that the entropy remains invariant under change of variables (Gregory, 2005).The probability p(x) is assigned by maximization of the entropy in Eq. ( 24), subject to whatever is known about it as constraints on the maximization process.Knowledge directly about a probability distribution is in a different category from knowledge about samples drawn from it; moments of the distribution provide a good example.In the Bayesian literature (Gregory, 2005;Jaynes, 1968) this technique is termed the principle of maximum entropy, and it provides a consistent and rigorous way to encode such 'testable' information into a unique probability distribution.The principle of maximum entropy assigns the probability distribution as non-committally as possible, while satisfying all known constraints on the distribution.The resulting distribution is also guaranteed to be non-negative.The following sections demonstrate how to arrive at two common probability assignments using the principle of maximum entropy.The method of Lagrange multipliers is well suited to solve such a constrained maximization problem.Detailed derivations are given elsewhere (Gregory, 2005;Jaynes, 1968;Sivia and Skilling, 2006;Woodward, 1953).

A. Prior probability assignment
Assignment of the prior probability [in Eq. ( 23)] requires that no possible value of a parameter should be preferred to any other, except to the extent necessary to conform to any known constraints on the probability distribution.The following illustration involves a onedimensional distribution, for simplicity.Normalization is a universal constraint such that the prior probability density p(x) integrates to unity: In the absence of further constraints, incorporating this constraint in Eq. ( 26) into the maximization of the entropy in Eq. ( 24) with respect to p(x) yields where λ is the undetermined Lagrange multiplier.Solution of Eq. ( 27) leads to 1+λ) .(28) Upon using the only constraint in Eq.( 26), and, since the measure is normalized as in Eq. ( 25), the (undetermined) Lagrange multiplier becomes λ = −1, and In other words, subject only to the constraint of normalization, the probability assignment is equal to the measure.Jaynes (1968) used group theory to show how the measure may be assigned in the case that nothing known to the experimenter distinguishes one value of x from another; for example, if x describes a location, and nothing is known about where.In that case, if the coordinate system describing x were to be shifted by an amount x 0 , the state of knowledge about the location should not change, so that p(x) dx ≈ p(x For this to be true, the measure m(x) must be constant (since m(x) = m(x + x 0 ) for all x 0 ), and for a quantity located on an interval (a, b), the probability assignment encoding this state of knowledge is then the (bounded) uniform distribution so as to fulfill the normalization constraint in Eq. ( 26) within the finite-valued range (a, b) If the parameter in question represents not a location but a magnitude or size, then relative changes in the scale of the quantity, rather than in its position, should be invariant; it is a so-called scale parameter (Jeffreys, 1946).Mathematically, scaling the quantity by an arbitrary factor c should not change the state of knowledge on the parameter, leading to (Sivia and Skilling, 2006) This leads to a functional equation having solution also known as the Jeffreys' prior, and equivalent to a uniform distribution on a logarithmic domain.In this form the Jeffreys' prior represents an 'improper' probability, since it is not normalizable over the entire positive domain.Since the posterior probability is given, via Bayes' theorem, by multiplying the prior probability by the likelihood and then normalizing the result, any factor by which the prior is multiplied cancels out.As a result, the use of an improper prior is harmless provided that the resulting posterior is normalizable.(If it is not, the experiment was not very well designed!)In summary, encoding the state of 'no knowledge' about the value of a parameter, but with use of its physical meaning, gives the measure for it, via invariance arguments; then the prior probability for it, in the absence of testable information, is equal to the (normalized) measure according to the principle of maximum entropy.The idea of maximum entropy is to distribute the prior probability as non-committally as possible (Gregory, 2005;Jaynes, 1968).The result for a location parameter is a uniform prior, and for a scale parameter it is the logarithmic prior.
This maximum entropy assignment does not involve any consideration that the prior probability distribution shall be assigned according to the results of any prior 'random' experiment.Nevertheless, if testable information about the distribution is declared to be known based on such results, it can be incorporated (Jaynes, 1968).As mentioned, this enables wider application of Bayesian analysis to problems where the prior probabilities have no reasonable frequency interpretation.Bayesian probabilistic analysis, equipped with the principle of the maximum entropy, represents a quantitative tool capable of dealing with ignorance in the sense of complete lack of knowledge.The least informative prior probability is consistent with the principle of indifference (Keynes, 1921), or the principle of insufficient reasoning, as historically applied to Bayesian inference by its pioneers Bayes (1763), Richard Price (Hooper, 2013), and Laplace (1812), among others.The principle of indifference was eventually given a deeper logical justification by Jaynes (1968), based on the principle of transformation groups followed by the principle of maximum entropy.The methods of probability theory constitute consistent reasoning in scenarios where insufficient information for certainty is available, namely, inductive reasoning.Thus, probability is always the appropriate tool for dealing with uncertainty, lack of scientific data, and ignorance.

B. Likelihood function assignment
The likelihood function represents the probability of getting the measured data assuming the values of the parameters.Often the mean and standard deviation of the data measurements are of physical relevance to the inference problem for the parameters.For instance, the mean of the distribution might actually be the value of a parameter which is measured directly, and the standard deviation might relate to a physically significant noise process in the measuring apparatus.In that case the likelihood for a single datum is given by maximizing the entropy subject to (unknown) values of the mean and standard deviation.The experimenters know in advance that the model [such as Eq. ( 22) in the example mentioned above] is capable of representing the data well, so that the variance of the residual errors should feature finite values, expressed mathematically as where µ is the mean and σ 2 is the variance.
To assign the likelihood function, the data are supposed to be defined on a space having uniform measure, for simplicity.(They might be a location measurement, for instance.)The likelihood function p(x) is given by maximizing the entropy (Gregory, 2005;Woodward, 1953) − p(x) ln p(x) dx. (37) subject to the values of the mean and standard deviation (and normalization).With appropriate Lagrange multi-pliers, the maximization condition is which yields with λ 0 = 1 + λ.The Lagrange multipliers λ 0 and λ 1 can be determined and eliminated in favor of µ and σ by substituting this expression into the constraints, Eq. ( 26) and Eq. ( 36), giving (see Appendix A) It is straightforward to generalize this for discrete variables: where x k can take only one of a number (1 ≤ k ≤ K) of specific discrete values.Upon taking into account the finite variance of the errors about the mean in the maximum entropy procedure on a continuous space of uniform measure, the result is the Gaussian or normal distribution (Gregory, 2005;Jaynes, 1968).
The following discussion specifies this assignment in terms of the data and the model parameters.The residual errors are taken to have a zero mean by a transformation of variables provided that the mean is tied to a model parameter of physical relevance.Then, for residual errors the principle of the maximum-entropy assigns, as in Eq. ( 42), the probability of individual residual error k given the model H, the model parameters θ, an undetermined error standard deviation σ k and the background information I, as In formulating the likelihood function, which is the probability that a particular dataset would be observed given the model parameters, the probability of each datum is equal to the probability of the residual error at each datum p(d k |θ, H, σ k ) = p( k |θ, H, σ k ).The overall likelihood function is then equal to the joint probability of all data points, The individual measurements d k are logically independent of each other: given σ, θ, H, I, the value of one observation indicates nothing about any other.This independence is justified by the principle of maximum entropy as well, since any dependence of one value from any other would lower the entropy (Jaynes, 1968).Consequently, substituting Eq. ( 43) into Eq.( 44) and 'chaining' via repeated application of the product rule (see Eq.( 4) in Sec.II A), lead to where different values σ k of the standard deviation for the kth measurement are allowed for.On the other hand, in most experiments the noise is the same for each measurement, and where In summary, this assignment of the likelihood function incorporates no more than the available information; the experimenter knows a priori that the data model H(θ) describes the experimental data D well such that the residual errors between the model and the data are expressed in a finite error variance σ 2 .The resulting Gaussian likelihood distribution in Eq. ( 42) is then the consequence of the principle of the maximum entropy which is fundamentally different from assuming probability of the residual errors being Gaussian.The end result in Eq. ( 47) is due also to the residual errors being logically independent of each other, given the conditioning information which is also according to the principle of the maximum entropy.The assigned distribution ensures no bias for which there is no prior evidence.

C. Student's t-distribution
In the above formulation, the likelihood function encodes the extent of implication of the experimental data D, by the values θ of the parameters in the model H(θ).However, the likelihood also contains a hyperparameter, the error standard deviation σ, stemming from the maximum entropy assignment discussed in the previous Section.For some applications, this is often a nuisance parameter of no interest to the analysis.Marginalization (as introduced in Sec.II B) may be applied to the joint probability of the D and the standard deviation σ (Bretthorst, 1988 , 1990).Recent acoustic applications of Bayesian analysis also utilized a similar marginalization process to remove nuisance parameters (Jasa and Xiang, 2009;Xiang et al., 2011).The following derivation relies heavily on these references, particularly (Bretthorst, 1988 , 1990).
To perform the marginalization [as in Eq. ( 20)], the joint probability p(D, σ|θ, H, I) of data D and the standard deviation σ, namely the product of the conditional likelihood in Eq. ( 47) and the marginal distribution p(σ|θ, H, I) of σ, is integrated over all possible values of σ, resulting in a likelihood free from the error standard deviation: The marginal distribution p(σ|θ, H, I) is regarded as a prior probability for the standard deviation σ, which is a scale parameter discussed in Section IV A above, so that the principle of maximum entropy assigns the Jeffreys prior: Substituting Eq. ( 47) and Eq. ( 50) in Eq. ( 49) and integrating over all possible values for σ, result This integral can be performed (see Appendix B) to give a marginalized likelihood function taking the form of Student's t-distribution: where Γ(•) is the standard Gamma function (Abramowitz and Stegun, 1964), and the residuals An extended Bayesian parameter estimation scheme is a sequential Bayesian process (Candy, 2015;Carriere and Hermand, 2012;Özdamar et al., 1990), such as the work recently reported on underwater applications (Yardim et al., 2011), in which the parameters encapsulated in the model evolve in time and space, with the data arriving consecutively.

V. TWO LEVELS OF BAYESIAN INFERENCE
In many acoustic experiments there are a finite number of models (hypotheses) H 1 , H 2 , . .., H M that compete against one other to explain the data.In the roomacoustic example above, H 1 is specified in Eq. ( 22) as containing one exponential decay term and one noise term, but the same data in Fig. 1 may alternatively be described by H 2 , containing a sum of two exponential decay terms (double-rate decay) with differing time constants and amplitude coefficients.Which model explains the data better?Figure 5 illustrates this scenario.Each model H S is governed by a set of parameters θ S .Model selection is an inverse problem to infer which of a competing finite set of models is preferred by the data.
In practice, architectural acousticians often expect single-, double-, or triple-rate energy decays (Jasa and Xiang, 2012;Xiang et al., 2011).In the case of such competing models, it would be unhelpful to apply an inappropriate model to the parameter estimation problem (Xiang et al., 2011 , 2010).Before undertaking parameter estimation, one should ask, given the experimental data and alternative models, which model is preferred by the data?
Bayesian data analysis applied to solving parameter estimation problems, as in the example above, is referred to as the first level of inference, while solving model selection problems is known as the second level of inference.Bayesian data analysis is capable of performing both the parameter estimation and the model selection, using Bayes' theorem.The following discussion begins with the second level of inference, namely model selection.This top-down approach is logical, as one should determine which of the competing models is appropriate before the parameters governing the model are estimated (Xiang, 2015).

A. Model selection: The second level of inference
Given a set of competing models, the model that best fits the data is not necessarily the best choice for inference.More complex models are capable of fitting the data better than simpler models, but tend to fit everything including noise and predict nothing accurately, a phenomenon known as overfit (Jefferys and Berger, 1992;Knuth et al., 2015;MacKay, 2003).To penalize over-parameterization, Bayes' theorem is applied to an arbitrary member H S of the finite set of models H 1 , H 2 , . . ., H M , given the data D; this procedure defers any interest in values of the model parameters.In this context, the background information I specifies that 'each model of this finite model set H M describes the data D well.' Bayes' theorem applied to each model H S in this set of M competing models, given the data D and the background information I, can be written by replacing θ in Eq. ( 23) by H S as follows: Using Eq. ( 53), model comparison between two different models H i and H j evaluates the so-called Bayes' factor, K i,j (Kass and Raftery, 1995), where 1 ≤ i, j ≤ M ; i = j.In the right-hand side of the Bayes' factor, the second fraction, termed the prior ratio, represents one's prior knowledge as to how strongly model H i is preferred over H j before considering the data D. Often one is unable to incorporate any prior preference to either of the models, and the principle of maximum entropy as discussed previously assigns equal prior probability to each of M models.In this case, the Bayes' factor for model comparison between two different models H i and H j relies solely on the posterior ratio between models, which is consequently equal to the marginal likelihood ratio when the model prior probabilities are uniform.The marginal likelihood p(D | H i , I) therefore plays a crucial role in Bayesian model selection.For computational convenience, the Bayes factor is determined using a logarithmic scale with the unit 'decibans' (Jeffreys, 1961), with simplified notations for the Bayesian evidence, Z i = p(D|H i , I), and Z j = p(D|H j , I).This enables the evidence values (marginal likelihoods) for two models to be compared quantitatively against one another.Among a finite set of competing models, the highest positive Bayes factor, L ij , implies that the data prefer model H i over H j the most.Therefore, the Bayes factor is also applied to select a finite number of models under consideration.
Although more complex models may fit the data better, they pay a penalty by strewing some of the prior probability for their parameters where the data subsequently indicate that those parameters are extremely unlikely to be.There is therefore a trade-off between goodness of fit and simplicity of model.This can be seen as a quantitative generalization of the qualitative principle known as Occam's Razor, which is to prefer the simpler theory that fits the facts well; there is a trade-off between simplicity of model and closeness of fit, and Occam's Razor makes this trade-off quantitative.(Garrett, 1991;Jefferys and Berger, 1992;MacKay, 2003).

B. Parameter estimation: the first level of inference
Once a model has been chosen, based on the experimental data, denote it as model H S ; then the Bayesian framework is used with this selected model to estimate its parameters θ S .For this purpose the model is specified with its parameters on view as H S (θ S ); for instance, model H 1 (θ 1 ) in Eq. ( 22) contains one exponential decay term with its parameters collectively denoted as θ 1 = [θ 0 , θ 1 , θ 2 ].Bayes' theorem is now applied as before in order to estimate the parameters θ S , given the data D and the model H S .The quantity p(D)|I in Eq. ( 23), now becomes p(D|H S ), the probability of the data D given the model H S .In this context, the background information I now includes the fact that a specific model H S is selected or given.Bayes' theorem for the parameter estimation problem is written as where the subscript S and background information I have been dropped for simplicity.Bayes' theorem represents here how one's prior knowledge about parameters θ, given the specific model H encoded in p(θ | H), is updated by incorporating the data D through the likelihood, p(D | θ, H).
The prior p(θ | H) encodes all of the knowledge about the parameters before the data are incorporated, and it is denoted by Π(θ) ≡ p(θ | H) in the following discussion for simplicity.Once the data have been observed or measured, the likelihood p(D | θ, H) incorporates the data for updating the prior probability for the parameters.To emphasize that the data are settled, once observed, and that the likelihood is a function of the parameter values, it is denoted by L(θ) ≡ p(D | θ, H).The posterior probability for the parameters, p(θ | D, H) encodes the updated knowledge of the parameters in the light of the data.

C. Two levels of inference in one unified framework
The posterior p(θ | D, H) must be normalized and integrate to unity over the entire parameter space [see Eq. ( 26)].With the notational changes of the previous paragraph, this normalization constraint involves integrating both sides of Eq. ( 58) over the entire parameter space: where the integral marginalizes out θ which appear in the likelihood p(D | θ, H), by assigning the prior p(θ | H) for the parameters θ [compare Eq. ( 20)].
At the first level of inference, namely parameter estimation, the quantity p(D|H) plays the role of a normalization constant in Eq. ( 58).Furthermore, it is identical to the Bayesian evidence in Eq. ( 53), where the simplified notation Z ≡ p(D|H), is used.This quantity is central to model selection -the second level of inference.Rearrangement of the terms of Bayes' theorem in Eq. ( 58), with the notational changes, gives which shows the logical relationship among the quantities of Bayesian inference (Skilling, 2006): the likelihood function L(θ) and the prior probability Π(θ) are the inputs, while the posterior probability p(θ | D, H) and the evidence Z are the outputs of Bayesian inference.The evidence Z is then needed in the second level of inference, model selection, while the posterior probability is the output for the first level of inference, parameter estimation.Bayesian evidence automatically encapsulates the principle of parsimony, and quantitatively embodies Occam's razor (Garrett, 1991;Jefferys and Berger, 1992;Knuth et al., 2015;MacKay, 2003); when two competing theories explain the data equally accurately, the simpler one is preferred.Figure 6 displays the quantitative embodiment of Occam's Razor, in which a sharply peaked posterior distribution at position θ MAP in parameter space is assumed relative to a broad prior distribution; θ MAP stands for the parameter set which maximizes the posterior probability, so-called maximum a posterior (MAP).Equation (59) can be simplified (Bishop, 2006;Knuth et al., 2015) as where the prior distribution is taken as flat with width ∆w prior , so that Π(θ) = 1/∆w prior .The integral, which is essentially the area under the likelihood curve, is approximated by the peak-value L(θ MAP ) multiplied by the width ∆w post of the posterior, since the prior is flat so that the posterior and the likelihood peak at the same location in parameter space; the conditioning 'given the model H' is dropped for notational simplicity.The logarithm of Eq. ( 61) yields The first term on the right-hand side, the logarithm of the peak-value of the likelihood, represents a goodness of model fit to the data.The second term represents the penalty on over-parameterized models, since overparameterized models give a larger value of the first term, but more parameters cause the probability to be distributed over a larger space.Equation ( 62) represents a quantitative trade-off between the goodness of model fit and simplicity of the model.A similar approximation used for model selection that has been used recently in some acoustical applications (Dettmer et al., 2011;Steininger et al., 2014;Xiang et al., 2011) is the Bayesian information criteria (BIC) (Schwarz, 1978) and it's variation, the deviance information criteria (DIC) (Spiegelhalter et al., 2002).The BIC asymptotically approximates the Bayesian evidence (Schwarz, 1978) under the assumption that a (multi-dimensional) Gaussian distribution can approximate the posterior probability distribution around the global extremum of the likelihood (Stoica and Selen, 2004).Supposing that the dataset involved in the analysis is large (large K), the (base ten logarithmic) inverse BIC for ranking a set of decay models H 1 , H 2 , H 3 , . . . is given (Xiang et al., 2011) by where N θ is the number of parameters involved in model H S , and K is total number of data points involved.The quantity L(θ MAP ) is the peak value of the likelihood, at a location in the parameter space specified as θ MAP .
As in Eq. ( 62), the first term in Eq. ( 63) represents the goodness of the model fit to the data, and the second term represents a penalty on over-parameterized models, the DIC (Spiegelhalter et al., 2002;Steininger et al., 2014) ranks the competing models similarly, yet with a slightly different penalty term.In accordance with the (natural logarithmic) simplified evidence in Eq. ( 62), the present paper prefers the inverse BIC definition [Eq.( 63)] which differs from that in Schwarz (1978) in sign and in the base of logarithms.The IBIC as defined in Eq. ( 63) is then in units of decibans, as advocated by Jeffreys (1961).
To calculate the evidence in Eq. ( 59), the product of the likelihood and the prior probability must be integrated over the entire parameter space.Estimation of the evidence therefore requires substantial computational effort.Often it is necessary to use numerical sampling methods based on Markov chain Monte Carlo (MCMC) approaches such as a trans-dimensional approach using reversible jump MCMC (Dettmer et al., 2010;Green, 1995), and nested sampling (Jasa and Xiang, 2005 , 2012;  Skilling, 2004).In fact, Eq. ( 60) indicates implicitly that model selection and parameter estimation can both be accomplished within a unified Bayesian framework.Once sufficient exploration has been performed to estimate the evidence, the explored likelihood function multiplied by the prior probability can contribute to the normalized posterior probability, since the evidence is also estimated.From estimates of the posterior distributions it is straightforward to estimate the mean values of the relevant parameters, and also their uncertainties in terms of associated individual variances, and inter-relationships between the parameters of interest (Xiang et al., 2011).

A. Modal, Decay, and Direction of Arrival Analysis
This section discusses room-acoustic modal and decay analysis, and the direction of arrival estimation using a microphone array.They are in common that the prediction models used are in form of generalized linear models (Bretthorst, 1988;Ò Ruanaidh and Fitzgerald, 1996), consisting in essence of a sum of simple functions.

Room-Acoustic Modal Analysis
This application explores a method to identify multiple decaying modes in measured room impulse responses from existing spaces.Beaton and Xiang (2017) reported a method employing Bayesian framework working in the time domain to identify numerous decaying modes in a room impulse response.Experimental measurement is carried out at one strategic location in the room under investigation using one monophonic microphone.A model describing the room impulse response h(t k ), employed in this application is so-called Prony model (d. Prony, 1795) where A s , T s , f s and φ s are modal parameters, amplitude, decay time, modal frequency, and phase, respectively.The model contains altogether S room modes, namely S sets of parameters {A s , T s , f s , φ s } with s = 1, . . ., S. Figure 7 compares an experimentally measured room-impulse response with predicted data based on the model in eq. ( 64) when the number of modes, S, and modal parameters were known or well estimated.Two levels of Bayesian inference are suitable to this application, the model selection estimates the number of modes, S, being present in the room impulse response, while the parameter estimation determines the relevant parameters upon the selected model (e.g.decay times, modal frequency and amplitude) of each mode.

Sound Energy Decay Analysis
Due to recent developments in concert hall design (Jaffe, 2005) and the high variations in chamberbased sound absorption measurement (Balint et al., 2019), there is an increasing interest in the analysis of sound energy decays consisting of multiple exponential decay rates.To meet the need of characterizing energy decays of potentially multiple decay processes, Balint et al.
has been found to be capable of characterization of multiple-slope decays beyond the single-slope and double-slope energy decays, with T s = θ 2s being the sth decay time (parameter), Θ S = {θ 0 , θ 1 , . . ., θ 2 S+1 } then includes all 2 S + 1 parameters with S being the number of exponential decaying terms (slopes).Variable t k represents a discrete time variable, value t K is the upper limit of Schroeder's integration, and term θ 0 (t k − t K ) is associated with background noise in the experimentally measured room impulse responses (Xiang, 2017a).Within the Bayesian framework, sound energy decays more complicated than single-slope and double-slope nature, such as triple-slope decays have been identified and characterized (Sü Gül et al., 2016;Xiang et al., 2011).Figure 8 illustrates the experimentally measured, sound energy decay functions, so-called Schroeder decay curve in a monumental mosque in comparison with the predicted curve based on the model in eq. ( 65) where S = 3 decay slopes are identified and the decay parameters are well estimated.(Sü Gül et al., 2016).Using the estimated decay parameters three exponentially decaying and the noise terms can be decomposed as depicted in Fig. 8.

Estimation of Multiple Direction of Arrivals
Estimating direction of arrival (DoA) of sound sources is an important acoustical problem often tackled using microphone arrays.Coprime linear microphone arrays represent an innovative sparse sensing technique which extends the frequency range of a given number of array elements by exceeding the spatial Nyquist limit.Whereas initial coprime array theory was derived based on an operating frequency dictated by the specific coprime spacing (Vaidyanathan and Pal, 2011;Xiang et al., 2015), recent investigations demonstrate advantages of broadband beamforming (Bush and Xiang, 2015).Parametric models describing this broadband behavior enable the use of model-based Bayesian inference for estimating not only source directions, but also number of sources present in the sound field, which is often unknown prior to estimation.The following section discusses the DoA of recently published by Bush and Xiang (2018) to demonstrate this line of Bayesian applications.
a. Broadband coprime sensing model.Bush and Xiang (2018) demonstrated that for sufficient broadband beamforming of coprime microphone array data, a generalized Laplace distribution function A s e −|φs−θ|/δs (66) sufficiently predicts the beamforming data with θ being azimuth angular variable.A 0 is a constant parameter for accounting noise floor and the three parameters per sound source include amplitude, A s , angle of arrival, φ s and beam width, δ s of each sound sources.
The model parameter for S number of sound sources, Θ S = {A 0 , A 1 , . . ., A S , φ 1 , . . .φ S , δ 1 , . . ., δ S }, includes all the amplitude and angular parameters.Figure 9 illustrates directional response experimentally measured using a comprime linear microphone array of 16 elements, compared with the predicted one based on the model in eq. ( 66), where two simultaneous sound sources are in the data.Figure 9 (b) also illustrates the residual error function when the predictive model in eq. ( 66) fits the experimental data well.The residual errors fulfill the condition as expressed in eq. ( 36) when assigning the likelihood function in Sec.IV B. The DoA model using a coprime microphone array expressed in Eq.( 66), including those in Eqs.(64,65), represents a class of models; generalized linear models, consisting of a linear superposition of S number of nonlinear functions.Escolano et al. (2014)'s DoA estimation using two microphones and the room-acoustic modal analysis (Beaton and Xiang, 2017) and decay analysis (Xiang et al., 2011) all employ the generalized linear models, where the Bayesian model selection is applied to estimate the number S of the nonlinear functions in their models.
b. Spherical harmonic sensing model.Xiang and Landschoot (2019) recognize similar problems using a spherical microphone array in the DoA analysis of sound events.They apply spherical harmonics beamforming to formulating a parametric model to predict multiple sound sources as with A s representing strength associated with sth sound source.Φ S = {θ 1 , . . ., θ S ; φ 1 , . . ., φ S } are S number of sound source directions.Instead of basic nonlinear functions, a normalized sound energy function, g s (Φ s , Φ), FIG. 10.Logarithm of the Bayesian evidence among competing models for three simultaneous sound sources.Twenty trials of evidence estimations are run per model.In this case, the three-source model correctly shows significant increase over the one-and two-source models.The four-and five-source models show slight increases in evidence, though not enough to justify their higher complexity (Bush and Xiang, 2018).
exploits the completeness property of the spherical harmonics with a finite spherical harmonic order is expressed by the truncated completenenss (Williams, 1999) as where Φ s = {θ s , φ s } denotes the specific filtering direction, and g(Φ s , Φ) represents specific beamforming function oriented towards direction, Φ s , over angular range specified by Φ.The maximum order, N , dicated by the number of microphone channels, determines the sharpness of the beam patterns (Landschoot and Xiang, 2019).This section, Sec.VI A has dealt with three different applications, yet they are in common that the prediction models are nested analytical expressions in form of generalized linear models (Bretthorst, 1988;Ò Ruanaidh and Fitzgerald, 1996), consisting in essence of a sum of simple nonlinear functions.The sum of total number, S leads to different competing models for s = 1, . . ., S for these problems, respectively.The Bayesian model selection, the higher level of inference is applied to selection of the suitable model given the experimental data, before the parameters encapsulated in the selected model are inferred.Three above discussed applications employed the nested sampling (Skilling, 2004); (Jasa and Xiang, 2012) to estimate the evidence in eqs.(59-60).

B. Multi-Layered Porous Media
In physical acoustics and many noise control applications, porous materials are of practical interest.Recently, Chazot et al. (2012); Roncen et al. (2018) apply Bayesian parameter estimation (the first level of inference) in rigid frame porous media analysis.When depthdependent anisotropy of the porous media occurs, multilayered porous absorbers of finite-thickness layers approximate the depth anisotropy with each layer being considered as isotropic.Fackler et al. (2018) reported an application employing the two levels of inference within Bayesian framework to analyze multilayer porous materials, developing a method to determine simultaneously the number of constituent layers as well as the macroscopic physical properties of each layer.

Multi-Layered Model
In the work by Fackler et al. (2018), the rigid-frame porous materials are modeled based on a Miki-model (Miki, 1990), this model contains three physical parameters; the flow resistivity σ f , porosity φ, and tortuosity α ∞ of a porous material.The Miki-model predicts the propagation coefficient γ(ω) and the characteristic impedance Z c (ω) of such a material as (Miki, 1990) where with σ e being the effective flow resistivity of the porous material, j = √ −1, and ω is the angular frequency.When combining multiple distinct layers into multilayered media, the transfer matrix method is used to model the overall response of the media (Allard and Atalla, 2009;Blauert and Xiang, 2009).A two-by-two transfer matrix relates the acoustic pressure and normal component of particle velocity between the two sides of the m th layer with a thickness d m and where γ m and Z (m) c of the m th layer are given in Eq. ( 70) and Eq. ( 71).T rigid is a rigid termination matrix appended to the end of the transfer matrix chain of the composite material (Allard and Atalla, 2009) to model the mounting and termination of the material against a rigid backing.Matrix multiplication of this chain of distinct, multiple T (m) eq in Eq. ( 73) along with the rigid termination results in the complex-valued surface impedance with T 11 , T 21 associated with the acoustic pressure and normal component of the particle velocity at the front surface of potentially (M ) multilayer media with a rigid backing (Fackler et al., 2018).

Likelihood function
The squared error between measured (Z s,meas ) and modeled (Z s,mod ) complex-valued surface impedance data as function of frequency is given as at each measured frequency point f k , the real and imaginary parts of the complex surface impedance contribute separately to the error function.The likelihood function is assigned as a Student's t-distribution as in Eq. ( 52), where the squared errors 2 k have been summed across all K measured data points over the frequency range of interest.Note that the model predicted surface impedance Z s,mod (f k ) is dictated by parameter vector θ containing 3-4 parameters for each layer with potentially M layers, as discussed below.

Parameter priors
Before involving any data, limited knowledge is available about the parameters under study.the prior probability distributions for each parameter encode this state of knowledge into a Bayesian analysis.For realistic porous materials, the physical parameters describing the pore structure fall into broad ranges of physically realistic values.Following the principle of maximum entropy introduced in Sec.IV A, a uniform prior distribution in form of Eq. ( 32) is assigned to each of the physical porous material parameters, encoding a lack of specific prior knowledge: Pr(flow resist.σf ) = Uniform(0.1,1000 kNs/m 4 ), ( 78) Pr(porosity φ) = Uniform(0.1,1), (79) Pr(tortuosity α ∞ ) = Uniform(1, 7). (80) The material layers experimentally tested are on the order of a few centimeters thick.To remain impartial when considering the layer thickness as an unknown parameter to be estimated, it is assigned the following prior, Pr(layer thickness d) = Uniform(0.1mm, 10 cm); (81) otherwise, it is fixed at the physically-measured value.
C. Concise Representation of Head-Related Transfer Functions Head-related transfer functions (HRTFs) represent directionally-dependent responses between an external source and the eardrums of a binaural listener.They are crucial part of databank for auditory virtual reality (Blauert, 2001;Xie, 2013) in the communicationacoustics applications.Experimental measurement results of the HRTFs are often represented on several hundred data points in frequency domain, approximation by a recursive filter can result in substantially less computation and data storage.Botts et al. (2013) represent them by a pole-zero model of digital filters with infiniteimpulse responses (IIR filters).In order to represent measured HRTFs in a parsimonious form, Botts et al. (2013) applied Bayesian model-based analysis to filter design in form of infinite impulse response (IIR) filter.

Parametric representation
The head-related transfer functions, H(z), can be presented in frequency domain (Botts et al., 2013) through a zeros-poles model where z = e −jω , p l , q l ∈ R are real-valued zeros and poles, q m = r q,m e jφq,m , p m = r p,m e jφp,m , and G is a real-valued gain factor.In the second line of above equation, the complex-valued zeros, (z − q m )(z − q * m ), and poles, (z − p m )(z − p * m ), are presented as magnitudes, r p,m , r q,m and phases φ p,m , φ q,m with −1 ≤ cos φ p,m , cos φ q,m ≤ 1. N = L + M is the filter order, collectively.et al. (2013) apply this model to derive filter coefficients of concise IIR filters for the HRTFs.The squared error, k , between complex-valued HRTFs, and the model
(87) Figure 11 exemplifies one HRTF data set from an experimentally measured data bank (Gardner and Martin, 1995).The two levels of Bayesian inference is formulated so as to estimate concise infinite impulse response filters with the parsimonious filter order N = L + M , also using the nested sampling (Jasa and Xiang, 2012;Skilling, 2004) for the filter order selection followed by filter coefficient estimation as detailed in Sec.V.The effort benefits virtual auditory reality due to the computational demand in dynamic changes of auditory scenes in real-time (Xiang et al., 2019;Xie, 2013).The magnitude spectra as shown in Figure 11(b) indicates that an IIR filter of order 7 approximates the experimentally measured data concisely, yet reasonably well, while order 9 slightly improves the filter response.When considering a trade-off between simplicity of model and closeness of fit, the model selection narrows a number of alternatives down to filter order 7 and 9.As in Sec.V A among the two models, the closeness of fit favors filter order 9 (Botts et al., 2013), while the simplicity of the model favors filter order 7.

VII. SUMMARY
Experimental investigations are vital for scientists and engineers to better understand underlying theories in acoustics.Scientific experiments are often subject to uncertainties and randomness.Probability theory is the appropriate mathematical language for quantification of uncertainties and randomness.This tutorial paper emphasizes logical interpretation of probability, that the probability itself quantitatively represents strength of implication, or equivalently, state of knowledge.The arguments of the probabilities within Bayesian school represent propositions expressed by statements, not necessarily limited to large ensembles of repeatable events or random variables.Bayesian probability theory, centered around the Bayes' theorem includes all the necessary calculus to manipulate probabilities which are essentially relying on product and sum rules of the probability which are the same rules within the propositional calculus, as detailed in Sec.II C.
Bayes' theorem includes the prior probability and the likelihood function as its input quantities.Bayes' theorem represents how one's prior knowledge, encoded in the initial assignment, is updated once experimental data are incorporated into the analysis through the likelihood function.The model-based Bayesian inference critically relies on well understood parametric models which can be physical, phenomenological, or numerical.The parametric model is considered as a part of prior information when incorporating the observation data into the likelihood function, it ensures that the output of Bayes theorem is a posterior probability distribution based precisely on the available information put into the prior probability assignment, the information encoded in the prior probabilities is utilized rigorously based on principle of the maximum entropy.Sec.IV introduces, in details, how the information-theoretic entropy is maximized to assign the prior probability and the likelihood.The application of maximum entropy principle also constitutes consistent reasoning in scenarios where insufficient information for certainty is available.Therefore, Bayesian probability has been considered as the logic tool of inductive reasoning, for dealing with uncertainty, lack of scientific data, and ignorance.
Bayesian analysis has recently been applied to an increasing extent in acoustic science and engineering.Many data analysis tasks in acoustics and other fields require two levels of inference, namely, model selection, and within the chosen model, parameter estimation.The model selection quantitatively embodies Occam's razor in favor of the concise and simpler model, Sec.V C has given this point of view a detailed explanation.Bayesian probabilistic methods provide solutions of both levels using Bayes theorem within a unified framework.One of two Bayes' theorem outputs, Bayesian evidence, represents the key quantity for the higher level of inference, the model selection.The posterior probability as another output is responsible for the parameter estimation.In demonstrating the model-based data analysis within the unified Bayesian framework, the last section (Sec.VI) briefly discusses a number of recent applications in room-acoustics, noise control, and communication acoustics.These examples all have in common the existence of well established models for predicting the processes under consideration and the two levels of inference are required for solving these acoustic problems.
Implementation of the model-based Bayesian analysis for both levels of inference often requires numerical exploration technique, so-called random walk or Markov chain Monte Carlo sampling methods, they are beyond the scope of the current tutorial paper.

VIII. APPENDICES
For completeness, this Chapter derives in certain details of maximum entropy assignment of the likelihood function (Appendix A) and marginalization of the hyperparameter (Appendix B).The derivations heavily rely on prior work (Bretthorst, 1988; Gregory, 2005; Jaynes,  1968; Xiang and Goggans, 2001 , 2003).

IX. ACKNOWLEDGMENT
The author is grateful to Drs.John Skilling, Anthony Garrett, Paul Goggans for their insightful discussion, to Drs.Jose Escolano, Tomislav Jasa, Zuhre Sü-Gül, Jonathan Botts, Cameron Fackler, Dane Bush, Mr. Douglas Beaton, and Mr. Christopher Landschoot for their collaborative research applications in Bayesian analysis.

FIG. 1
FIG. 1. (Color online) Comparison between experimental data for a sound energy decay process and data predicted by a parametric model.The sequence D = [d1, d2, . . ., dK ] expresses experimental data values collected in an enclosure, plotted on a logarithmic scale.A parametric model function can be specified by a single exponential decay function H(θ) having three parameters θ = [θ0, θ1, θ2].When the three parameters θ are estimated well, the model prediction yields the model curve shown as a solid line, whereas poorly estimated parameters yield the mismatched model curves shown by dotted lines.
Figure 2 illustrates this task schematically.A model H(θ) is specified by a functional form containing a set of parameters θ whose values are not known in advance.The task is to go backwards: given the experimental data D and the well-established model form containing the parameters θ, estimate the unknown (hopefully optimal) set of parameters.
FIG. 3. (Color online) Bayes' theorem [in Eq. (23)] used to estimate a parameter with the same data set as incorporated in the likelihood function p(D|θ, I), in which different prior probabilities p(θ|I) lead to different posterior probabilities p(θ|D, I) (Cowan, 2007).(a) A more sharply peaked prior p(θ|I) has a considerable influence on the posterior p(θ|D, I).(b) A flat prior p(θ|I) has almost no influence on the posterior p(θ|D, I).

Figure 3
Figure3illustrates Bayes' theorem as shown in Eq. (23).The posterior probability, up to a normalization constant p(D|I), arises by updating the prior probability p(θ|I).Once the experimental data D become available, the likelihood function p(D|θ, I) represents a multiplicative factor that updates the prior probability so as to transform it into the posterior probability (up to a normalization constant).Bayes' theorem represents how one's initial assignment is updated in the light of the data.This corresponds exactly to the process of scientific exploration: acoustical scientists seek to gain new knowledge from acoustic experiments.The prior knowledge in many acoustics fields is the fruit of long development, leading to well-understood hypotheses -models.These models, as part of the prior knowledge(Candy, 2016) are typically based on generations of learning/education.In Eq. (23), Bayes' theorem requires two probabilities in the calculation of the posterior probability up to a normalization constant.These two are the prior probability, p(θ|I), and the likelihood function, p(D|θ, I).Use of the prior probability in the data analysis was an element of controversy between the Bayesian and frequentist schools.Frequentists have criticized the 'subjective'

FIG. 4 .
FIG.4.Sharply peaked probability density function over the range of frequency.The probability indicates that the parameter is strongly implied to be in an extremely narrow range of frequencies.

FIG. 5 .
FIG. 5. Model selection, which comprises a second level of inference, represents an inverse problem of choosing one of a finite set of models.The model selection task is to infer which model is preferred of a finite set of (S) models/hypotheses according to the experimental data.
p(H S |D, I) = p(D|H S , I) p(H S |I) of Eq. (53), Bayes' theorem represents how one's prior knowledge about model H S , encoded in the prior probability p(H S | I), is updated in the presence of data D, given the background information I.The probability p(D|H S , I) in the context of the model selection represents the likelihood of models, is referred to as the marginal likelihood of the data or the Bayesian evidence [as further explained by Eq. (59) in Sec.V C], while p(H S | D, I) is the posterior probability of the model H S , given the data.

FIG. 6
FIG. 6. (Color online) Illustration of model selection, quantitatively implementing Occam's razor so as to penalize overparameterized models.The posterior distribution is approximated as the peak-value of the likelihood L(θ MAP ) and the distribution width ∆w post .

FIG. 7
FIG. 7. (Color online) Comparison between experimentally measured and model predicted room impulse responses.Segments of first 200 ms are compared (Beaton and Xiang, 2017).

FIG. 8
FIG. 8. (Color online) Schroeder curve and the model curve derived from impulse responses collected in a monumental mosque filtered for 250 Hz in field tests.Three decomposed decay slope lines and two turning points are shown (Sü Gül et al., 2016).

FIG. 9
FIG. 9. (Color online) Broadband beamforming data and the Laplace model for two simultaneous sound sources (Bush and Xiang, 2018).(a) Broadband directional pattern in response to two noise sources (solid line).The two-sound source Laplace distribution function model with reasonable values of the model parameters is superimposed onto the experimental data (dashed line).(b) Errors between model and data are finite with mean zero.

FIG. 11 .
FIG. 11.Bayes factors and magnitude spectra of different filter orders (Botts et al., 2013).(a) Bayes factors for filter order 5 -10 both in linear and logarithmic form.(b) Measured and estimated head-related transfer functions of filter order 5, 7, and 9.