Signal Estimation and Detection note

Posted Oct 10, 2023 Updated Nov 17, 2023

By lng205 14 min read

0. Introduction to Estimate

An estimator is a random variable. As such, its performance can only be completely described statistically or by its PDF.

In the basic case of estimating a single parameter from a single sample, with a two-variable probability density function p(x,θ):

Fix x, we can estimate θ using maximun likelihood estimation.
Fix θ, we can evaluate the estimation by calculating the expectation and variance.

An estimation is considered unbiased when its expectation equal the true of the parameter for every value of the parameter.

0.1 Unbiased Estimator

An estimator is unbiased when the expectation is the same as it’s true value.

The three key criteria for evaluating an estimator are bias, consistency, and efficiency.

0.2 Sample variance

Suppose x[n] (n=1, 2, … N) are i.i.d. random variables with expectation μ and variance $\sigma^2$, the unbiased estimator of variance is

\[S = \frac1{N-1}\sum_1^N(x[n]- \bar x)^2, \quad\bar x = \frac1N \sum_1^N x[m]\]

Proof:

\[\begin{align*} E[\sum_1^N(x[n]- \bar x)^2] &= E[\sum((x[n]-\mu)-(\bar x-\mu))^2]\\ &= \sum E[(x[n]-\mu)^2-2(x[n]-\mu)(\bar x-\mu)+(\bar x-\mu)^2]\\ &= \sum E[(x[n]-\mu)^2]-2E[(\bar x-\mu)\sum(x[n]-\mu)]+E[N(\bar x-\mu)^2]\\ &= N\sigma^2-NE[(\bar x-\mu)^2]\\ &= (N-1)\sigma^2 \end{align*}\]

while

\[\begin{align*} & i.i.d.\Rightarrow Cov_{i,j} = E[(x[i]-\mu)(x[j]-\mu)]=0, \quad i\neq j\\ E[(\bar x-\mu)^2] &= \frac{1}{N^2}E[\sum(x[i]-\mu)(x[j]-\mu)]\\ &= \frac{1}{N^2}E[\sum(x[n]-\mu)^2] = \sigma^2/N \end{align*}\]

ref: Bias of an estimator

For white noise, we have a flat spectrum, which means a delta function in autocorrelation. As WGN is a stationary process, the autocorrelation is the same as expectation. Thus we can say WGN has i.i.d. samples.

0.3 MVUE

Minimum variance unbiased estimators do not, in general, exist. When they do, several methods can be used to find them.

MSE of the estimator satisfy:

\[\begin{align*} E[(\hat\theta-\theta)^2] &= E[(\hat\theta-E[\hat\theta]+E[\hat\theta]-\theta)^2]\\ &= var(\hat\theta)+E[2(\hat\theta-E[\hat\theta])(E[\hat\theta]-\theta)]+b^2(\hat\theta)\\ &= var + b^2 \end{align*}\]

So a unbiased estimator with small variance would have small MSE.

MVUE requires uniform minimum variance for all parameters. Estimators with minimum variance specific to certain parameters do not meet MVUE criteria, as the true parameter values are unknown and cannot dictate estimator selection.
A vector estimator is a MVUE if all its parameters are MVUE

1. Cramer-Rao Lower Bound(CRLB)

The Cramér-Rao Lower Bound (CRLB) sets the minimum variance threshold for unbiased estimators, but it may not always be exact. An estimator that reach CRLB is called the efficient estimator. Estimators in the exponential family, including the common normal distribution, reliably meet this bound, ensuring dependable parameter estimation.

1.1 Regular Condition

The PDF must satisfy the regular condition to have the CRLB.

\[E[\frac{\partial \ln p(x;\theta)}{\partial \theta}]=0\]

1.2 Lower Bound

Fisher Information define the minimum variance for any unbiased estimator

\[I(\theta)=E[(\frac{\partial \ln p}{\partial \theta})^2]=-E[\frac{\partial^2 \ln p}{\partial^2 \theta}], \quad var \geq 1/I(\theta)\]

If θ is a vector, we have the covariance matrix minus Fisher Information positive semi-definite
\[I_{i,j}=-E[\frac{\partial^2\ln p}{\partial\theta_i\partial\theta_j}], \quad C-I^{-1}\geq \mathbf{0}\]
g(x) is the MVUE with var=1/I(θ) if and only if we can express
\[\frac{\partial \ln p(x;\theta)}{\partial \theta} = I(\theta)(g(x)-\theta)\]

The Fisher Information take partial differential on the parameter’s log-likelihood function, which can be regarded as a way to measure the curvature of the log-likelihood function, which is related to the variance of the distribution and the difficulty to estimate the parameter.

The expectation is with respect to p(x;θ)

\[E[f(x)]=\int f(x)p(x;\theta)dx\]

We can deduce from the analysis that as the number of parameters increases, the CRLB also rises, indicating a more challenging estimation process. Conversely, the greater the influence a parameter has on ‘x’, the easier it becomes to estimate.

A function of the estimator satisfy

\[var(f(\hat\theta))- \frac{\partial g} {\partial \theta}I^{-1}(\theta) (\frac{\partial g}{\partial \theta})^T \geq 0,\quad[\frac{\partial g}{\partial \theta}]_{ij} = \frac{\partial g_i}{\partial \theta_j}\]

which is at least asymptotic to the CRLB of f(θ)

1.3 Linear Case

For the common case in signal processing, the samples are linear combinations of parameter vectors with AWGN:

The number of parameters should be less than the number of samples, meaning H should have more rows than columns.

\[\mathbf x = H \mathbf\theta + \mathbf w, \quad\mathbf w\sim \mathcal N(0,\sigma^2I)\]

we have

\[\begin{align*} \hat {\mathbf \theta} &= (H^TH)^{-1}H^T\mathbf x\\ C &= \sigma^2(H^TH)^{-1}\\ \hat{\mathbf \theta} &\sim \mathcal N(\theta,C) \end{align*}\]

For AGN, which means the covariance matrix of noise is not an identity matrix, we can whiten the matrix by multiplying D to the left of the model equation:

\[\begin{align*} DD^T &= C^{-1}, \quad E[(D\mathbf{w})^TD\mathbf{w}] = D^TE[\mathbf{w}^T\mathbf{w}]D = I\\ \mathbf{\hat\theta} &= (H^TC^{-1}H)^{-1}H^TC^{-1}\mathbf x,\quad C = (H^TC^{-1}H)^{-1} \end{align*}\]

2. Sufficient Statics

2.1 definition

A static is a function to a set of data. The static that contains the least number of elements is called the minimal static. It is sufficient when $p(x\vert T(x)=T_0,\theta)$ is independent with $\theta$.

2.2 Neyman-Fisher Factorization Theorem

T is sufficient statics if we can factor pdf as:

\[p(x,\theta)=g(T(x),\theta)h(x)\]

Here we are factoring the pdf itself, not the derivative of its logarithm.

2.3 Complete static

A static is complete if it yields an unique unbiased estimator. Equivalently, if d represents the difference between two unbiased estimators, then the condition for completeness is $\int d(T)p(T)dT=0$ for all θ holds true if and only if d=0.

2.3.1 Example

Estimate μ while $\mathbf{x}\sim\mathcal{N}(\mu,\sigma^2)$

\[\begin{align*}p&=\frac{1}{(\sqrt{2\pi}\sigma)^N} e^{-\frac{1}{2\sigma^2}\sum(x[n]-\mu)^2}\\ &=\frac{1}{(\sqrt{2\pi}\sigma)^N}e^{\frac{1}{2\sigma^2}(2\mu\sum x[n]-N\mu^2)}e^{-\frac{1}{2\sigma^2}\sum x^2[n]} \end{align*}\]

We have a sufficient estimator

\[T(x)=\sum x[n]\sim \mathcal N(N\mu,N\sigma^2)\]

Then for all μ and any d

\[\begin{align*} \int d(T)p(T)dT &= \int d(T)e^{-\frac{1}{2N\sigma^2}(T-N\mu)^2}dT=0\\ \Rightarrow f(N\mu) &= d(T)*\exp(-T/(2N\sigma^2))=0\\ D(f)E(f) &= 0\\ E > 0 &\Rightarrow d = 0 \end{align*}\]

2.4 Rao-Blackwell-Lehmann-Scheffé theorem(RBLS)

When T is determined to be both sufficient and complete, the RBLS asserts that the unbiased estimator derived from T is the MVUE. In scenarios where such an estimator is not readily identifiable, the MVUE is determined by the conditional expectation of any unbiased estimator given T.

2.4.1 Example

Find MVUE of β from x~U(0,β)(skip the proof of completeness)

\[\begin{align*} p(\mathbf{x};\beta) &= \prod \frac{1}{\beta}u(x[n])u(\beta-x[n])\\ &= \frac{1}{\beta^N}u(\min x[n])u(\beta-\max x[n])\\ T(x) &= \max x[n]\\ p(T) &= \frac{d}{dT}(\int \frac{1}{\beta}dT)^N = \frac{N}{\beta^N}T^{N-1}\\ E[T] &= \int_0^\beta Tp(T)dT = \frac{N}{N+1}\beta\\ \hat\beta &= \frac{N+1}{N}\max x \end{align*}\]

In this special case, T and β has a linear relationship, making it easy to find the estimator.

The observed variance is, in fact, less than the typical estimator, which is twice the mean of the data.

\[E[\hat\beta] = (\frac{N+1}{N})^2(E[T^2]-E^2[T])=\frac{\beta^2}{(N+2)N}<\frac{\beta^2}{3N}\]

Summary of MVUE

To locate the MVUE, it’s customary to determine the CRLB initially. An estimator that meets the CRLB is considered the MVUE. In scenarios where the data is a linear combination of the parameters with AWGN, we can apply a straightforward formula to compute the estimator.

Should the CRLB be unattainable or if the MVUE exceeds the CRLB, we then resort to the RBLS theorem. This approach necessitates a sufficient and complete statistical summary of the data, after which the unbiased estimator derived is deemed the MVUE.

3. Best Linear Unbiased Estimator

Consider a linear model where the noise is characterized by a given covariance structure:

\[\mathbf{x}=H\mathbf{\theta}+\mathbf{w}\]

By employing the Lagrange multiplier technique, with the unbiasedness condition as a constraint and the objective of minimizing variance, the BLUE can be derived:

\[\mathbf{\hat\theta} = (H^TC^{-1}H)^{-1}H^TC^{-1}\mathbf{x},\quad var = (H^TC^{-1}H)^{-1}\]

Notably, the derivation of BLUE does not necessitate knowledge of the PDF of the noise.

3.1 Example

Determine the coefficients of a Finite Impulse Response (FIR) filter that extracts the DC level from a signal embedded in Wide Sense Stationary (WSS) noise with zero mean. The noise is characterized by its AutoCorrelation Function (ACF), r(k). The solution should aim to minimize the power of the noise in the sampled output at the end of the sequence. This should be achieved under the constraint that H(exp(j0)) = 1, where H represents the Z-transform of the FIR filter.

The questions is equivalent to finding unbiased linear estimator with minimum variance of parameter in noise given covariance:

\[\begin{align*} \mathbf{x} &= \mathbf{1}A + \mathbf{w}\\ C_{i,j} &= r(\vert i - j\vert) \end{align*}\]

The coefficients of FIR are arranged in reverse order relative to the coefficients derived for the coefficients in BLUE:
\[A = \sum h(k)x(N-1-x)\]
The condition of H is requiring unbiasness:
\[H(\exp(j0)) = \sum h(k)(\exp(j0))^{-k}=\mathbf{h}^T\mathbf{1} = 1\]
The power of noise with zero mean is the variance of it.

4. Maximun Likelihood Estimator(MLE)

When the PDF is viewed as a function of the unknown parameter(with x fixed), it is termed the likelihood function

4.1 Definition

The MLE is the maximum of the likelihood function

\[\hat\theta = \arg\max_\theta{p(x;\theta)}\]

As PDF is positive, we can find stationary point thorough the derivative of the log-likelihood function

\[\frac{\partial\ln p}{\partial\theta}=0\]

The likelihood function is the probability of x taking that value with different θ.

4.2 Properties

When the first and second order derivative of the log-likelihood function exist, and satisfy the regular condition, the MLE has following properties:

asymptotically unbiased(i.e. consistent: converges in probability to the true value as N -> ∞)
asymptotically distributed as $N(θ, I^{-1})$

When an efficient estimator is available, meaning we can factor the derivative of the log-likelihood function and set it to zero, this ensures that the MLE coincides with the efficient estimator.

In situations involving multiple unknown parameters, we often encounter a set of equations. In such cases, an estimator can be used to approximate the true values of these parameters, aiding in the estimation of one parameter using another.

MLE possesses an invariance property. This means that the MLE of a one-to-one function of a parameter is obtained by applying the function to the MLE of that parameter.
\[\frac{\partial\ln p}{\partial f}=\frac{\partial\ln p}{\partial \theta}\frac{\partial \theta}{\partial f}=0\]
- In case when f is not 1:1 function, we would ge the MLE of the modified likelihood function
  \[f(\hat\theta) = \arg \max{p(x;f^{-1}(\alpha))}, \quad \alpha = f(\theta)\]

The PDF of a function of random variable?

We can also use neumerical methods to find the zero point in the derivative of log-likelihood function.

5. Least Squares

5.1 Definition

MVUE is trying to minimize the MSE between the estimator and the parameter(s). Under a model with additive noise, the LS estimator minimizing square error between data and signal:

\[\begin{align*} \mathbf{x} &= \mathbf{s}(\mathbf{\theta}) + \mathbf{w} \\ J &= (\mathbf{x}-\mathbf{s})^T(\mathbf{x}-\mathbf{s})\\ \mathbf{\hat\theta} &= \arg\min_\mathbf{\theta} J \end{align*}\]

LS estimator does not require PDF or statistic information of noise.

5.2 Linear Case

When parameter(s) are in linear model with signal, we can derive a formula for LSE:

\[\begin{align*} \mathbf{s} &= H\mathbf{\theta}\\ \frac{\partial J}{\partial\mathbf{\theta}} = 0 \Rightarrow \mathbf{\hat\theta} &= (H^TH)^{-1}H^T\mathbf{x} \end{align*}\]

The formula is the same as the formula for linear model with white noise in CRLB and MVUE

We can understand this concept through a geometric perspective. It involves projecting an N-dimensional vector (representing data) onto a P-dimensional subspace (representing parameters). Essentially, the difference between the original vector and its projection should be perpendicular to the subspace’s base vectors:

\[\begin{align*} (\mathbf{x}-H\mathbf{\theta})^TH &= \mathbf{0}\\ \mathbf{x}^TH-\mathbf{\theta}^TH^TH &= \mathbf{0}\\ H^TH\mathbf{\theta} &= H^T\mathbf{x}\\ \mathbf{\theta} &= (H^TH)^{-1}H^T\mathbf{x}\\ \end{align*}\]

The orthogonal projection matrix $P = H(H^TH)^{-1}H^T$ is symmetric, idempotent and singular.

Least Squares are:

\[\min J = \mathbf{x}^T(I-H(H^TH)^{-1}H^T)\mathbf{x}\]

5.3 Example

Estimate amplitute and phase in sine wave:

\[\begin{align*} s[n] &= A\cos(2\pi f_0n+\phi)\\ &= A\cos(2\pi f_0n)\cos(\phi) - A\sin(2\pi f_0n)\sin(\phi)\\ \alpha &= A\cos(\phi), \quad \beta = A\sin(\phi)\\ \mathbf{s} &= H[\alpha\ \beta]^T \end{align*}\]

6. Bayesian Estimator

In classic theories, we only require the data distribution for a fixed paramter $p(x;\theta)$. Bayesion theory treat the parameter(s) as random variable(s) with a prior distribution, so we have $p(\theta)$ and $p(x\vert\theta)$ now, which can give us the posterior distribution $p(\theta\vert x)$.

The likelihood function does not treat parameters as random variables and is not necessarily integrate to 1.

6.1 Definition

Different methods, like mean, median, and mode, estimate a variable based on its distribution. These methods minimize square error, absolute error, and hit-or-miss cost, respectively.

\[\begin{align*} C_{MMSE} &= (\theta-\hat\theta)^2\\ \min E[C_{MMSE}] &= \min\int (\theta-\hat\theta)^2p(\theta\vert x)d\theta\\ \frac{\partial}{\partial\theta}\int (\theta-\hat\theta)^2p(\theta\vert x)d\theta &= 0\\ \hat\theta &= E[\theta\vert x] \end{align*}\]

and

\[\begin{align*} C_{abs} &= \vert \theta - \hat\theta\vert,\quad \hat\theta : \int_{-\infty}^{\hat\theta}p(\theta\vert x)d\theta=\frac{1}{2}\\ C_{MAP} &= \begin{cases} 1,\quad \vert \theta - \hat\theta\vert > \delta\\ 0,\quad \vert \theta - \hat\theta\vert < \delta \end{cases},\quad \hat\theta = \arg\max p(\theta\vert x) \end{align*}\]

The risk in Bayesian estimation is the average value of the cost function, calculated across all possible values of x:
\[\begin{align*} R = E[C] &= \int Cp(x)dx\\ R_{MMSE} &= \int(\int (\theta-E[\theta\vert x])^2p(\theta\vert x)d\theta)p(x)dx = E[var(\theta\vert x)] \end{align*}\]

6.2 Properties

Data and parameter are jointly Gaussian in linear model where parameter has Gaussian prior distribution. The posterior pdf is also Gaussian and its mean and covariance can be expressed with a formula:
\[\begin{align*} E[\mathbf \theta|\mathbf x] &= \mathbf \mu_\mathbf \theta + (C_\mathbf \theta^{-1} + H^T C_\mathbf w^{-1} H)^{-1} H^T C_\mathbf w^{-1} (\mathbf x - H\mathbf \mu_\mathbf \theta)\\ C_{\mathbf \theta|\mathbf x} &= (C_\mathbf \theta^{-1} + H^T C_\mathbf w^{-1} H)^{-1} \end{align*}\]
The expectation which is also the MMSE is actually assigning different weights to the prior mean and the statistic of data.
When the variance of prior distribution is approaching infinite, MMSE in linear model equals MVUE.
MAP has properties:
\[\arg\max_\theta p(\theta\vert x) = \arg\max_\theta p(x\vert \theta)\frac{p(\theta)}{p(x)} = \arg\max_\theta p(x\vert \theta)p(\theta)\]
If the prior distribution is flat, then the MAP is also the maximum point of the conditional distribution, resembling MLE. This also hold ture as N -> ∞.
- If x and θ are jointy Gaussian then MAP = MMSE
- MAP only commutes over invertable linear tranformations

7. Detection

Detection theory, often called signal detection theory, is a framework for measuring the ability to differentiate between informative signals and noise in a system.

7.1 Hypothesis detection

Hypothesis detection in the context of signal detection theory involves testing two competing hypotheses: the null hypothesis that only noise is present, and the alternative hypothesis that a signal plus noise is present.

The commonly used theory in signal detection is Binary Hypothesis Testing, where we have 2 hypothesis, $\mathcal H_0$ and $\mathcal H_1$, and we need to determine which one to choose based on observed data.

We use the format of $P(\mathcal H_0;\mathcal H_1)$ to express the possibility of deciding $\mathcal H_0$ when $\mathcal H_1$ is true.

Typically, we want to find a range($R_1$) of signal value to decide $\mathcal H_1$ is true.

7.2 Neyman-Pearson Theory

We maximize $P(\mathcal H_1;\mathcal H_1)$(Detected) subject to a higher bound of $P(\mathcal H_1;\mathcal H_0)$(False Alert)

\[\begin{align*} P(\mathcal{H_1};\mathcal{H_1})=\int_{R_1}p(x;\mathcal{H_1})dx &= P_D\\ P(\mathcal{H_1};\mathcal{H_0})=\int_{R_1}p(x;\mathcal{H_0})dx &= P_{FA} \end{align*}\]

NP Theory states that, given $P_{FA}\leq\alpha$, the range of the data($x$) when we decide $\mathcal{H_1}$ is true should satisfy the condition

\[R_1(x):\frac{p(x;\mathcal{H_1})}{p(x;\mathcal{H_0})}>\gamma\]

while $\gamma$ is given by

\[P_{FA}=\alpha\]

7.2.1 Example

Given N observations $x[n],n=0,1,…N-1$ and hypothesis

\[\begin{align*} \mathcal H_0: x[n] &= w[n]\\ \mathcal H_1: x[n] &= A+w[n], \quad w[n]\sim\mathcal{N}(0,\sigma^2) i.i.d \end{align*}\]

We now determine the NP hypothesis test

\[\begin{align*} \frac{p(x;\mathcal{H_1})}{p(x;\mathcal{H_0})}&=\frac{\frac{1}{\sqrt{2\pi}\sigma}^N \exp(\frac{1}{2\sigma^2}\sum (x-A)^2)}{\frac{1}{\sqrt{2\pi}\sigma}^N \exp(\frac{1}{2\sigma^2}\sum x^2)}>\gamma\\ \ln(\frac{p(x;\mathcal{H_1})}{p(x;\mathcal{H_0})})&=\frac{1}{2\sigma^2}(-2A\sum x+NA^2)>\ln \gamma\\ T(X)&=\frac{\sum x}{N}>(2\sigma^2\ln\gamma-NA^2)/(-2AN)=\gamma' \end{align*}\]

We have

\[T(x)\sim \left\{\begin{array}{l} \mathcal{N}(0,\sigma^2/N),\ \quad\mathcal H_0\\ \mathcal{N}(A,\sigma^2/N),\quad \mathcal H_1 \end{array}\right.\]

We can calculate $P_D$ with α

\[\begin{align*} P_{FA}&=\Pr\lbrace T(x)>\gamma';\mathcal{H_0}\rbrace =Q(\frac{\gamma'}{\sqrt{\sigma^2/N}})=\alpha\\ P_D&=Q(\frac{\gamma'-a}{\sqrt{\sigma^2/N}})=Q(Q^{-1}(\alpha)-\sqrt{(NA^2)/\sigma^2}) \end{align*}\]

7.3 Bayesian Detection

Assume hypothesis are random variable, we can derive the Bayesian detection rule

\[\begin{align*} &\text{min} P_E:\frac{p(x|\mathcal{H_1})}{p(x|\mathcal{H_0})} > \frac{P(\mathcal{H_0})}{P(\mathcal{H_1})}\\ &\text{MAP}:p(\mathcal{H_1}|x) > p(\mathcal{H_0}|x)\\ &\text{ML}:p(x|\mathcal{H_1}) > p(x|\mathcal{H_0}) \end{align*}\]

8. Detection with Deterministic Signals

If we have a prior probability of hypothesis, like BPSK, then we may tend to use Bayesian rule. Else, we tend to use Neyman-Pearson rule.
Detect signal from noise using NP method
\[\begin{align*} \mathcal{H_0}:x[n] &= w[n]\\ \mathcal{H_1}:x[n] &= s[n] + w[n] \end{align*}\] \[\Rightarrow T(x) = \mathbf{x}^T\mathbf{s}\sim \left\{\begin{array}{l} \mathcal{N}(0, \sigma^2\mathcal{E})\\ \mathcal{N}(\mathcal{E}, \sigma^2\mathcal{E}) \end{array}\right.\]

T is the same as the matched filter in this situation.

\[P_{FA} = Q(\frac{\gamma'}{\sqrt{\sigma^2\mathcal{E}}}),\quad P_D = Q(Q^{-1}(P_{FA}-\sqrt{\frac{\mathcal{E}}{\sigma^2}}))\]

For colored noise which has covariance matrix not equal identity matrix, we can whiten it and we have $T(x) = \mathbf{x}^TC^{-1}\mathbf{s}$.
Detect signal from signal using ML method

\[\begin{align*} \mathcal{H_0}:x[n] &= s_0[n] + w[n]\\ \mathcal{H_1}:x[n] &= s_1[n] + w[n] \end{align*}\] \[T_i(x) = \sum x[n]s_i[n] - \mathcal{E}_i/2 \Leftrightarrow \|\mathbf{x}-\mathbf{s}_i\|_2\]

More generally, we can detect random signal. If $s\sim \mathcal{N}(\mathbf{\mu}_s,\mathbf{C}_s)$, we can derive

\[T(x) = \frac{1}{2}\mathbf{x^T[C_w^{-1}C_s(C_s + C_w)^{-1}]x + x^T(C_s + C_w)^{-1}\mu_s}\]

note

math communication

This post is licensed under CC BY 4.0 by the author.