# About

## Introduction

Bayes’ rule is a rigorous method for interpreting evidence in the context of previous experience or knowledge. It was discovered by Thomas Bayes (c. 1701-1761), and independently discovered by Pierre-Simon Laplace (1749-1827).

The Naive Bayes algorithm is called “naive” because it makes the assumption that the occurrence of a certain feature is independent of the occurrence of other features.

In naive bayesian, it`s diffcultly to predict posterior probability, and we can use proior probability p(x|y) to consider posterior probability:

After this section, we begin talking about how this bayesian work.

## Bayesian decision theory

Suppose n labels defined as `Y = {y1, y2, y3, y4...yn}`

, `λij`

defined as loss probability for wrongly classing yj as yi. After this we can get a conditional risk `R(yi | x)`

: on condition x, probability for wrongly class yj as yi:

Make a mathematical expectation:

To make E(h(x)) more smaller, we need to find a group of `{X -> Y}`

which make `R(h(x)|x)`

mathematical expectation smallest. The smallest h(x) is called Bayes optiomal classifier. `λij`

can be defined as:

Error risk probability:

Where `P(yi|x)`

is accurate judgment probability, and Bayes optiomal classifier can be defined as:

From here, we can make a conclusion: in order to make classfy risk more smaller, we need to find max mathematical expectation for P(y|x), but mostly, find bigest `P(y | x)`

is not easy work, in bayesian theory:

Where `P(y)`

is prior probability, `P(x|y)`

is on condition y probability(likehood).

## Maximum Likelihood Estimator (MLE)

In statistics, maximum likelihood estimation (MLE) is a method of estimating the parameters of a statistical model given observations, by finding the parameter values that maximize the likelihood of making the observations given the parameters.(Wiki: https://en.wikipedia.org/wiki/Maximum_likelihood_estimation)

Without considering any prior opinion, a typical approach for estimating p is the method of maximum likelihood. For `xi ∈ X (i = 1,2,3,4...)`

, all xi independent and with same distribution, then `P(Xi=xi|θ)`

should be:

Now we look at this function from a different perspective by considering the observed values `x1, x2, …, xn`

to be fixed “parameters” of this function, whereas θ will be the function’s variable and allowed to vary freely.In practice the algebra is often more convenient when working with the natural logarithm of the likelihood function, called the log-likelihood:

Because log(x) and multiplication function has same incremental decrement attribute, we choose log-likelihood for better caculating.

The method of maximum likelihood estimates θ by finding a value of θ that maximizes ln(L(θ)). This method of estimation defines a maximum likelihood estimator (MLE) of θ:

In order to find the maximum likelihood function, we need to derive the formula 9, find the derivative is 0 when the θ is the maximum likelihood function solution.

## Naive Bayesian theory

Ok, let’s go back to naive bayesian.

We may find posterior probability `P(y|x)`

not easy to calculate, but if the conditions are interrelated, we can not from the prior probability (`P(x|y)`

) get the results. Bayesian uses the attribute conditional independence hypothesis, which supposing all conditions x do not affect each other. Based on attribute conditional independence hypothesis, formula 6 can be like:

Where X has m attributes, all working independently. Formula 5 can be rewrited as:

Here we need to choose a distribution to describe p(x|y) called prior distribution.

## Bayesian prior

There are many different bayesian prior: Gaussian or normal distribution, bernoulli distribution, conjugate distributions. Once we have chosen a prior distribution Pr(p)Pr(p), we then observe the process, and use Bayes’ theorem to update our probability distribution appropriately.

For normal distribution:

From attributions of normal distribution, P(x|y) ~ N(u, σ^2), we can now get the estimating mathematical expectation and variance from maximum likelihood:

Next post I whil write some code to express how naive bayesian works.

# Reference:

- 《机器学习-周志华》
- Bayes’ Rule With Python - A Tutorial Introduction to Bayesian Analysis, by James V Stone
- Kevin P. Murphy: Conjugate Bayesian analysis of the Gaussian distribution
- Bayesian Inference for Bernoulli processes: Is that coin fair?
- Wiki Bernoulli: https://en.wikipedia.org/wiki/Bernoulli_distribution
- Wiki Maximum_likelihood: https://en.wikipedia.org/wiki/Maximum_likelihood_estimation

#### 本文出自 夏日小草,转载请注明出处:http://homeway.me/2017/05/22/machine-learning-naive-bayesian/

-by grasses

2017-05-22 23:52:34