Machine Learning - K-Nearest Neighbor

K-Nearest Neighbor is a type of instance-based learning, or lazy learning, where the function is only approximated locally and all computation is deferred until classification.
The KNN algorithm is among the simplest of all machine learning algorithms.

The KNN classifier offers an alternative approach to classification using lazy learning that allows us to make predictions without any model training but at the cost of expensive prediction step.

As is show in figue 1, the KNN algorithm offers an alternative approach to classification using lazy learning.

2.How to work

2.1 Simple Step

We think two point have the closest distance, they may have the similar attribute, and then we classify them as a class. First, we should change thoses elements into vector and then count their distance, finally we get their neighbors, classify them.

Suppose we have pairs samples <X1, Y1>, <X2, Y2> ... <Xn, Yn>, and we want to predict the sample <X, Y> belong to which class, we should caculate the distance bettween <X, Y> and <Xm, Ym> (m ∈ (1, n)), and find type of <X, Y> by some math approach.

As is described about, the KNN algorithm logic following steps:

1.Build training model, change training samples into matrix matrix vector
2.Choose the number of k and a distance metric.
3.Caculate distances bettween <X, Y> and <Xm, Ym>, and save k nearest neighbors
4.Decided <X, Y> by weight vote.


2.2 Distance Algorithm

The Euclidean distance between points p and q is the length of the line segment connecting them.

In Cartesian coordinates(笛卡尔坐标), if p = (p1, p2,…, pn) and q = (q1, q2,…, qn) are two points in Euclidean n-space, then the distance (d) from p to q, or from q to p is given by the Pythagorean formula:

Cosine similarity is the cosine of the angle between two n-dimensional vectors in an n-dimensional space. It is the dot product of the two vectors divided by the product of the two vectors’ lengths (or magnitudes). For two vectors A and B in an n-dimensional space:

Cosine similarity ranges between -1 and 1, where -1 is perfectly dissimilar and 1 is perfectly similar.

The another algorithm is Manhattan distance.The distance between two points in a grid based on a strictly horizontal and/or vertical path (that is, along the grid lines), as opposed to the diagonal or “as the crow flies” distance. The Manhattan distance is the simple sum of the horizontal and vertical components, whereas the diagonal distance might be computed by applying the Pythagorean theorem.

3.Coding

python sklearn documents: http://scikit-learn.org/stable/modules/neighbors.html

My github example project: https://github.com/grasses/Machine-Learning/tree/master/ml/knn

That’s start with an example, we have a table with row described as <x1, x2, x3, x4>, now we want to predict <5.1, 3.5, 1.4, 0.2>

step 1 building training matrix

    def train_predict(data_list = [], datalist = None, split = 0.5):
train_list = []
test_list = []
# build training list && testing list
for x in range(len(data_list)):
if random.random() < split:
test_list.append(data_list[x])
else:
train_list.append(data_list[x])
for x in range(len(test_list)):
y = test_list[x][len(test_list[x]) - 1]
# training one by one
test(test_list[x], y)


step 2 Choose the number of k and a distance metric

    '''
Test() is the entrance of knn
@input  test_X        list        [1,2,3,4]format
@return predict     string      predict result
'''
def test(test_X = [], predict = ''):
neighbors = get_neighbors(test_X)
result = get_min_dest(neighbors)

total_count = 0
right_count = 0

if str(result) == str(predict):
right_count += 1
total_count += 1

# append test test_X to self.tets_list by default
test_X.append(predict)
test_list.append(test_X)


step 3 Caculate distances

    '''
Count <X1, Y2> -> <X2, Y2> with euclidean distance algorithm
'''
def get_dest(source, destination):
distance = 0
for i in range(len(source) - 1):
distance += pow((float(source[i]) - float(destination[i])), 2)
return math.sqrt(distance)

'''
Get train list neighbors sorted by distance.
'''
def get_neighbors(train_list = [], test_X = [], k = 5):
distances = []
neighbors = []
for x in range(len(train_list)):
distances.append((train_list[x], get_dest(test_X, train_list[x])))
distances.sort(key = operator.itemgetter(1))

for i in range(k):
neighbors.append(distances[i][0])
return neighbors


step 4 Decided by weight vote

'''
get_min_dest() return closest neighbor by vote
@input  neighbors        list    [[], []]format
@return neighbor        []        closest neighbor
'''
def get_min_dest(neighbors = []):
for x in range(len(neighbors)):
response = neighbors[x][-1]
else:
sorted_vote = sorted(class_votes.iteritems(), key = operator.itemgetter(1), reverse=True)
return sorted_vote[0][0]


Example Output

===============test instance()=================
test()-> knn predicted = 'Iris-versicolor', your given actual = 'Iris-versicolor'
test()-> knn predicted = 'Iris-virginica', your given actual = 'Iris-versicolor'

===============accuracy()=================
training size = 150, testing size = 2
accuracy()-> total instance = 2, right instance = 1, error instance = 1, right rate = 50.0%

===============predict()=================
predict()-> instance = [7.1, 3.0, 5.9, 2.1], knn predict = Iris-virginica

===============mult_predict()=================
predict()-> instance = [4.6, 3.4, 1.4, 1.3], knn predict = Iris-setosa
predict()-> instance = [7.1, 3.0, 5.9, 1.1], knn predict = Iris-virginica
predict()-> instance = [2.3, 3.3, 4.5, 6.7], knn predict = Iris-versicolor


Reference:

本文出自 夏日小草,转载请注明出处:http://homeway.me/2017/04/21/machine-learning-knn/

-by grasses

2017-04-21 23:52:34