机器学习笔记(4)贝耶斯分类

Introduction

Here is a not so short note (Rolling Update) for machine learning and artificial intelligience. Main concepts and some references are presented. This note is based on Wikipedia, Andrew Ng’s Lecture Notes CS229 of Standford, Machine Learning course of SJTU IEEE-yangyang, and Machine Learning Crash Course of Google.

Bayesian estimation

If you flip a coin, what’s the probability it will fall with the head up? To solve this problem, you may flip the coin many times. You will get the data:

X={Xi}i=1n,XiH,TX=\{X_i\}^n_{i=1}, X_i \in {H,T}

where HH means Head and TT means Tail. The probabilityshould be:

P(Head)=θ,P(Tail)=1θP(Head) = \theta, P(Tail) = 1 - \theta

If you flip the coin 100 times, with 57 heads and 43 tails, the probability of P(Head)P (Head) should be 57/100. Because the flips are i.i.d., which means:

  • Independent events
  • Identically distributed according to Bernoulli distribution (n=1n=1 means BErnoulli distribution)

Maximum likelihood estimation

The aim of the maximum likelihood estimation is to choose θ\theta that maximizes the probability of observed data. MLE of probability of head can be calculated based on i.i.d. as below:

θ^MLE=argmaxθP(Xθ)\hat{\theta}_{MLE} = \arg \max \limits_{\theta} P(X \mid \theta)

=argmaxθi=1nP(Xiθ)= \arg \max \limits_{\theta} \prod_{i=1}^n P(X_i \mid \theta)

=argmaxθi:Xi=HαHθi:Xi=TαT(1θ)= \arg \max \limits_{\theta} \prod_{i:X_i=H}^{\alpha_H} \theta \prod_{i:X_i=T}^{\alpha_T} (1-\theta)

=argmaxθθαH(1θ)αT= \arg \max \limits_{\theta} \theta^{\alpha_H} (1-\theta)^{\alpha_T}

Let J(θ)=θαH(1θ)αTJ(\theta)=\theta^{\alpha_H}(1-\theta)^{\alpha_T}. It is easy to find the θ\theta that maximizes the probability of observed data:

J(θ)θθ=θ^MLE=0\frac{\partial J(\theta)}{\partial \theta} \Big\vert_{\theta=\hat{\theta}_{MLE}} = 0

Thus the MLE of probability of head is:

θ^MLE=αHαH+αT\hat{\theta}_{MLE} = \frac{\alpha_H}{\alpha_H+\alpha_T}

Prior and posterior

If you flip a coin 5 times, it is possible that 5 heads are obtained. In this case, the probability of head will be 1.0 if you use the maximum likelihood estimation. However, it is crazy to consider that the result of fliping a coin is always head. Thus, we need to use prior probability to adjust the probability calculation.

A prior probability distribution of an uncertain quantity is the probability distribution that would express one’s beliefs about this quantity before some evidence is taken into account. A prior can be determined from past information such as previous experiments, or according to some principle such as Jeffreys prior.

A posterior probability distribution of an unknow quantity is the probability distribution treated as the conditional probability that is assigned after the relevant evidence or background is taken into account. In contrasts with the likelihood function P(Xθ)P(X \mid \theta), the posterior probability is given as:

P(θX)=P(Xθ)P(θ)P(X)P(\theta \mid X) = \frac{P(X \mid \theta)P(\theta)}{P(X)}

where P(θ)P(\theta) is probability density function (pdf) of the prior probability.

Conjugate prior

In Bayesian probability theory, if the posterior distributions P(θX)P(\theta \mid X) are in the same probability distribution family as the prior probability distribution P(θ)P(\theta), the prior and posterior are then called conjugate distributions, and the prior is called a conjugate prior for the likelihood function. When a family of conjugate priors exists, choosing a prior from that family simplifies posterior form.

An example of the conjugate prior: In the problem of coin flipping,

  • Likelihood is ~ binomial distribution:

P(Xθ)=(nαH)θαH(1θ)αTP(X \mid \theta) = \left( \begin{matrix} n \\ \alpha_H \end{matrix} \right) \theta^{\alpha_H} (1-\theta)^{\alpha_T}

=n!αH!(nαH)!θαH(1θ)αT= \frac{n!}{\alpha_H! (n-\alpha_H)!} \theta^{\alpha_H} (1-\theta)^{\alpha_T}

It means in nn independent flipping trial, the probability of occurring αH\alpha_H heads belongs to binomial distribution, with the probability of head is θ\theta for every flipping.

  • Prior is assumed as Beta distribution:

P(θ)=θβH1(1θ)βT1B(βH,βT)Beta(βH,βT)P(\theta) = \frac{\theta^{\beta_H-1} (1-\theta)^{\beta_T-1}}{B(\beta_H,\beta_T)} \sim Beta(\beta_H,\beta_T)

where B(βH,βT)=Γ(βH)Γ(βT)Γ(βH+βT)B(\beta_H,\beta_T)=\frac{\Gamma(\beta_H) \Gamma(\beta_T)}{\Gamma(\beta_H+\beta_T)}, Γ(z)\Gamma(z) is the gamma function as an extension of the factorial function, Γ(z)=0xz1exdx\Gamma(z) = \int_0^{\infty} x^{z-1}e^{-x} dx, and Γ(n)=(n1)!\Gamma(n) = (n-1)!, if nn is a positive integer.

  • Posterior is also Beta distribution:

P(θX)Beta(βH+αH,βT+αT)P(\theta \mid X) \sim Beta(\beta_H + \alpha_H,\beta_T + \alpha_T)

For binomial distribution, the conjugate prior is Beta distribution. The pdf of Beta distribution got more concentrated as values of βH\beta_H and βT\beta_T increased. As more samples are obtained, βH+αH>>βH\beta_H + \alpha_H >> \beta_H, the effect of prior will be “washed out”.

Another example of the conjugate prior: In the problem of rolling a dice, there are k=6k=6 outcomes.

  • Likelihood is ~ multinomial (θ=θ1,θ2,...,θk\theta={\theta_1,\theta_2,...,\theta_k})

P(Xθ)=n!α1!α2!...αk!i=1kθiαiP(X \mid \theta) = \frac{n!}{\alpha_1!\alpha_2!...\alpha_k!} \prod_{i=1}^k \theta_i^{\alpha_i}

The multinomial distribution is a generalization of the binomial distribution. In nn independent rolling test, the probability of occurring α=α1,α2,...,αk\alpha={\alpha_1,\alpha_2,...,\alpha_k} times of the dice number α=X1,X2,...,Xk\alpha={X_1,X_2,...,X_k} belongs to multinomial distribution, with the probability of XkX_k is θk\theta_k for every rolling. The sum of θk\theta_k is 1 and the sum of αk\alpha_k is nn.

  • Prior is assumed as Dirichlet distribution:

P(θ)=i=1kθiβi1B(β1,β2,...,βk)Dir(β1,β2,...,βk)P(\theta) = \frac{\prod_{i=1}^k \theta_i^{\beta_i -1}}{B(\beta_1,\beta_2,...,\beta_k)} \sim Dir(\beta_1,\beta_2,...,\beta_k)

  • Posterior is also Dirichlet distribution:

P(θX)Dir(β1+α1,β2+α2,...,βk+αk)P(\theta \mid X) \sim Dir(\beta_1+\alpha_1,\beta_2+\alpha_2,...,\beta_k+\alpha_k)

For multinomial, the conjugate prior is Dirichlet distribution.

MLE compared with MAP: maximum a posterior estimation

  • MLE: to choose θ\theta that maximizes the likelihood probability (the probability of observed data) P(X \mid \theta).
  • MAP: to choose θ\theta that maximizes the posterior probability P(\theta \mid X).

θ^MAP=argmaxθP(θX)\hat{\theta}_{MAP} = \arg \max \limits_{\theta} P(\theta \mid X)

=argmaxθP(Xθ)P(θ)= \arg \max \limits_{\theta} P(X \mid \theta)P(\theta)

In the problem of flipping trial, the maximum a posterior estimation of probability of head is Beta distribution:

P(θX)Beta(βH+αH,βT+αT)P(\theta \mid X) \sim Beta(\beta_H+\alpha_H,\beta_T+\alpha_T)

MLE vs MAP:

θ^MLE=αHαH+αT\hat{\theta}_{MLE} = \frac{\alpha_H}{\alpha_H+\alpha_T}

θ^MAP=αH+βH1βH+αH+βT+αT2\hat{\theta}_{MAP} = \frac{\alpha_H + \beta_H - 1}{\beta_H+\alpha_H+\beta_T+\alpha_T-2}

For small sample size, the prior is important. For infinite nn, prior will be useless. Therefore, when βH=βT=1\beta_H=\beta_T=1, MAP will be same as MLE.

We use Gaussian distribution for a continuous variable xx which is i.i.d. The central limit theorem (CLT) establishes that, in most situations, when independent random variables are added, their properly normalized sum tends toward a normal distribution

The pdf is N(μ,σ2)N(\mu,\sigma^2):

P(xμ,σ)=1σ2πe(xμ)22σ2P(x \mid \mu,\sigma) = \frac{1}{\sigma \sqrt{2\pi}} e^{\frac{-(x-\mu)^2}{2\sigma^2}}

where μ\mu is the mean and σ2\sigma^2 is the variance.

MLE probability for Gaussian distribution:

θ^MLE=argmaxθP(Xθ)\hat{\theta}_{MLE} = \arg \max \limits_{\theta} P(X \mid \theta)

=argmaxθi=1nP(Xiθ)= \arg \max \limits_{\theta} \prod_{i=1}^n P(X_i \mid \theta)

MLE for Gaussian mean and variance:

μMLE=1ni=1nxi\mu_{MLE} = \frac{1}{n} \sum_{i=1}^n x_i

σMLE2=1ni=1n(xiμ^)2\sigma_{MLE}^2 = \frac{1}{n} \sum_{i=1}^n (x_i - \hat{\mu})^2

Here we use average value to estimate μ\mu, which is not a true parameter. The estimation of σ\sigma is under-estimated thus MLE for the variance of a Gaussian is biased. Because the average of x1,x2,...,xn{x_1,x_2,...,x_n} is used as μ^\hat{\mu}, the true mean is not equal to the average of samples. If we use the average of samples to estimate the average of total, there will be a bias. For a unbiased variance estimator:

σunbiased2=1n1i=1n(xiμ^)2\sigma_{unbiased}^2 = \frac{1}{n-1} \sum_{i=1}^n (x_i - \hat{\mu})^2

Because only n1n-1 xix_i is independent with the average of samples. The nthn^{th} xix_i can be obtained by the average and other n1n-1 values, which includes no additional information.

If we use different samples, the mean is also different. The prior for mean is also Gaussian distribution:

P(μη,λ)=1λ2πe(μη)22λ2P(\mu \mid \eta,\lambda) = \frac{1}{\lambda\sqrt{2\pi}}e^{\frac{-(\mu-\eta)^2}{2\lambda^2}}

The conjugate prior for mean is Gaussian distribution and for variance is Wishart distribution.

Bayes classifier

Bayes Optimal Classifier

Bayes optimal classifier is a kind of ensemble learning. In Bayesian learning, the primary question is: What is the most probable hypothesis give data? If we have:

  • X: Features
  • Y: Target classes
    How to decide which class yy is for a new sample with feature xx? (xXx\in X and yYy\in Y)

The Bayes rule is:

P(Y=yX=x)=P(X=xY=y)P(Y=y)P(X=x)P(Y=y \mid X=x) = \frac{P(X=x \mid Y=y)P(Y=y)}{P(X=x)}

According to the Bayes decision rule, we need to minimize the conditional risk of making a wrong classification of a sample. Becasue R(yx)=1P(yx)R(y \mid x)=1-P(y \mid x), the Bayes optimal classifier can be written as:

h(x)=argmaxY=yP(Y=yX=x)h^*(x) = \arg \max \limits_{Y=y} P(Y=y \mid X=x)

As the denominator P(X=x)P(X=x) is fixed for call cases, the maximum can be derived as:

h(x)=argmaxY=yP(X=xY=y)P(Y=y)h^*(x) = \arg \max \limits_{Y=y} P(X=x \mid Y=y)P(Y=y)

  • P(X=xY=y)P(X=x \mid Y=y) is the class conditional density (likelihood)
  • P(Y=y)P(Y=y) is the class prior

A binary classification and the optimal Bayes decision boundary is shown as follows.

Conditional independence

Given Z, X is conditionally independent of Y means:

P(X,YZ)=P(XZ)P(YZ)P(X,Y \mid Z) = P(X \mid Z)P(Y \mid Z)

e.g. If the alarm in your house is active (Z), your neighbor Mary will call the police (X) or another neighbor John will call the police (Y). It is assumed that they do not communicate for each other. In this case, Mary’s calling and John’s calling are conditionally independent. It also means whether Mary’s calling the police is irrelated to John’s calling:

P(XY,Z)=P(XZ)P(X \mid Y,Z) = P(X \mid Z)

Note: Independence \nLeftrightarrow Conditionally Independence
X is conditionally independent of Y with a given Z, does not mean that, X is independent of Y. e.g. It is unknown whether Mary’s calling is independent of John’s calling. They can call the police for other cases other than the alarm in your house. On the contrary, X is independent of Y, does not result to, X is conditionally independent of Y with a given Z. e.g.

  • X: rolling a dice, the number is 3.
  • Y: rolling a dice, the number is 2.
  • Z: rolling a dice two times, the sum of the two numbers is 5.
    In this case, X is not conditionally independent of Y with the given Z.

Naive Bayes (NB)

Naive Bayes Optimal Classifier assumes that the data is conditionally independent on the given class, which makes the computation more feasible.

P(X1,...,XdY)=i=1dP(XiY)P(X_1,...,X_d \mid Y) = \prod_{i=1}^d P(X_i \mid Y)

where dd is the number of features. For each discrete feature, the likelihood is P(XiY)P(X_i \mid Y). If the class prior is $P(Y), the decision rule of Naive Bayes can be:

hNB(x)=argmaxyP(x1,x2,...,xdy)P(y)h^*_{NB}(x) = \arg \max \limits_{y} P(x_1,x_2,...,x_d \mid y)P(y)

=argmaxyi=1dP(xiy)P(y)= \arg \max \limits_{y} \prod_{i=1}^d P(x_i \mid y)P(y)

Actually, the features are not conditionally independent, then

P(X1,...,XdY)i=1dP(XiY)P(X_1,...,X_d \mid Y) \neq \prod_{i=1}^d P(X_i \mid Y)

Nonetheless, Naive Bayes sometimes still performs well even when the assumption of conditionally independence is violated. If the training data is insufficient, e.g., there is no sample for a feature X1X_1 in the class YY, the likelihood will be zero. No matter what the values X2,...,XdX_2,...,X_d take, the probability keeps zero as:

P(X1,...,XdY)=0P(X_1,...,X_d \mid Y) = 0

If the features are continuous, we use Gaussian Naive Bayes (GNB) instead. Let Xi=xX_i = x and Y=ykY=y_k. The probability is:

P(xyk)=1σik2πe(xμik)22σik2P(x \mid y_k) = \frac{1}{\sigma_{ik}\sqrt{2\pi}} e^{\frac{-(x-\mu_{ik})^2}{2\sigma_{ik}^2}}

The subscript indicates the mean and variance for each class kk and each feature ii are different.

Reference

[1] Some common probability distributions
[2] Why use n1n-1 for unbiased variance estimation
[3] Example of NB: Sex classification based on features including height, weight, and foot size

Reward