Machine Learning - Naive Bayesian theory


Machine Learning

About

Introduction

Bayes’ rule is a rigorous method for interpreting evidence in the context of previous experience or knowledge. It was discovered by Thomas Bayes (c. 1701-1761), and independently discovered by Pierre-Simon Laplace (1749-1827).

The Naive Bayes algorithm is called “naive” because it makes the assumption that the occurrence of a certain feature is independent of the occurrence of other features.

In naive bayesian, it`s diffcultly to predict posterior probability, and we can use proior probability p(x|y) to consider posterior probability:

Figure-1: Naive Bayesian model

After this section, we begin talking about how this bayesian work.

Bayesian decision theory

Suppose n labels defined as Y = {y1, y2, y3, y4...yn}, λij defined as loss probability for wrongly classing yj as yi. After this we can get a conditional risk R(yi | x): on condition x, probability for wrongly class yj as yi:

Formula1: Error risk on condition x

Make a mathematical expectation:

Formula2: Error risk expectation

To make E(h(x)) more smaller, we need to find a group of {X -> Y} which make R(h(x)|x) mathematical expectation smallest. The smallest h(x) is called Bayes optiomal classifier. λij can be defined as:

Formula-3: λij

Error risk probability:

Formula-4: risk probability

Where P(yi|x) is accurate judgment probability, and Bayes optiomal classifier can be defined as:

Formula-5: h(x)

From here, we can make a conclusion: in order to make classfy risk more smaller, we need to find max mathematical expectation for P(y|x), but mostly, find bigest P(y | x) is not easy work, in bayesian theory:

Formula-6: bayesian formula

Where P(y) is prior probability, P(x|y) is on condition y probability(likehood).

Maximum Likelihood Estimator (MLE)

In statistics, maximum likelihood estimation (MLE) is a method of estimating the parameters of a statistical model given observations, by finding the parameter values that maximize the likelihood of making the observations given the parameters.(Wiki: https://en.wikipedia.org/wiki/Maximum_likelihood_estimation)

Without considering any prior opinion, a typical approach for estimating p is the method of maximum likelihood. For xi ∈ X (i = 1,2,3,4...), all xi independent and with same distribution, then P(Xi=xi|θ) should be:

Formula-7: Likelihood probability

Now we look at this function from a different perspective by considering the observed values x1, x2, …, xn to be fixed “parameters” of this function, whereas θ will be the function’s variable and allowed to vary freely.In practice the algebra is often more convenient when working with the natural logarithm of the likelihood function, called the log-likelihood:

Formula-8:Log-likelihood

Because log(x) and multiplication function has same incremental decrement attribute, we choose log-likelihood for better caculating.
The method of maximum likelihood estimates θ by finding a value of θ that maximizes ln(L(θ)). This method of estimation defines a maximum likelihood estimator (MLE) of θ:

Formula-9: Maximum Likelihood Estimator

In order to find the maximum likelihood function, we need to derive the formula 9, find the derivative is 0 when the θ is the maximum likelihood function solution.

Naive Bayesian theory

Ok, let’s go back to naive bayesian.

We may find posterior probability P(y|x) not easy to calculate, but if the conditions are interrelated, we can not from the prior probability (P(x|y)) get the results. Bayesian uses the attribute conditional independence hypothesis, which supposing all conditions x do not affect each other. Based on attribute conditional independence hypothesis, formula 6 can be like:

Formula-10: attribute conditional independence hypothesis

Where X has m attributes, all working independently. Formula 5 can be rewrited as:

Formula-11: attribute conditional independence hypothesis

Here we need to choose a distribution to describe p(x|y) called prior distribution.

Bayesian prior

There are many different bayesian prior: Gaussian or normal distribution, bernoulli distribution, conjugate distributions. Once we have chosen a prior distribution Pr(p)Pr(p), we then observe the process, and use Bayes’ theorem to update our probability distribution appropriately.

For normal distribution:

Formula-12: normal distribution bayesian prior

From attributions of normal distribution, P(x|y) ~ N(u, σ^2), we can now get the estimating mathematical expectation and variance from maximum likelihood:

Formula-14: normal distribution mathematical expectation

Formula-15: normal distribution variance

Next post I whil write some code to express how naive bayesian works.



Reference:



本文出自 夏日小草,转载请注明出处:http://homeway.me/2017/05/22/machine-learning-naive-bayesian/

-by grasses

2017-05-22 23:52:34

Fork me on GitHub