Machine Learning - K-Nearest Neighbor


Machine Learning

1.About

K-Nearest Neighbor is a type of instance-based learning, or lazy learning, where the function is only approximated locally and all computation is deferred until classification.
The KNN algorithm is among the simplest of all machine learning algorithms.

The KNN classifier offers an alternative approach to classification using lazy learning that allows us to make predictions without any model training but at the cost of expensive prediction step.

Figue 1:  Example of k-NN classification. The test sample (green circle) should be classified either to the first class of blue squares or to the second class of red triangles. If k = 3 (solid line circle) it is assigned to the second class because there are 2 triangles and only 1 square inside the inner circle. If k = 5 (dashed line circle) it is assigned to the first class (3 squares vs. 2 triangles inside the outer circle).

As is show in figue 1, the KNN algorithm offers an alternative approach to classification using lazy learning.



2.How to work

2.1 Simple Step

We think two point have the closest distance, they may have the similar attribute, and then we classify them as a class. First, we should change thoses elements into vector and then count their distance, finally we get their neighbors, classify them.

Suppose we have pairs samples <X1, Y1>, <X2, Y2> ... <Xn, Yn>, and we want to predict the sample <X, Y> belong to which class, we should caculate the distance bettween <X, Y> and <Xm, Ym> (m ∈ (1, n)), and find type of <X, Y> by some math approach.

As is described about, the KNN algorithm logic following steps:

1.Build training model, change training samples into matrix matrix vector
2.Choose the number of k and a distance metric.
3.Caculate distances bettween <X, Y> and <Xm, Ym>, and save k nearest neighbors
4.Decided <X, Y> by weight vote.

2.2 Distance Algorithm

The Euclidean distance between points p and q is the length of the line segment connecting them.

In Cartesian coordinates(笛卡尔坐标), if p = (p1, p2,…, pn) and q = (q1, q2,…, qn) are two points in Euclidean n-space, then the distance (d) from p to q, or from q to p is given by the Pythagorean formula:

euclidean distance

Cosine similarity is the cosine of the angle between two n-dimensional vectors in an n-dimensional space. It is the dot product of the two vectors divided by the product of the two vectors’ lengths (or magnitudes). For two vectors A and B in an n-dimensional space:

cosine similarity distance

Cosine similarity ranges between -1 and 1, where -1 is perfectly dissimilar and 1 is perfectly similar.

The another algorithm is Manhattan distance.The distance between two points in a grid based on a strictly horizontal and/or vertical path (that is, along the grid lines), as opposed to the diagonal or “as the crow flies” distance. The Manhattan distance is the simple sum of the horizontal and vertical components, whereas the diagonal distance might be computed by applying the Pythagorean theorem.

Manhattan distance



3.Coding

python sklearn documents: http://scikit-learn.org/stable/modules/neighbors.html

My github example project: https://github.com/grasses/Machine-Learning/tree/master/ml/knn

Testing csv data: https://github.com/grasses/Machine-Learning/blob/master/ml/knn/knn.csv

That’s start with an example, we have a table with row described as <x1, x2, x3, x4>, now we want to predict <5.1, 3.5, 1.4, 0.2>

step 1 building training matrix

    def train_predict(data_list = [], datalist = None, split = 0.5):
        train_list = []
        test_list = []
        # build training list && testing list
        for x in range(len(data_list)):
            if random.random() < split:
                test_list.append(data_list[x])
            else:
                train_list.append(data_list[x])
        for x in range(len(test_list)):
            y = test_list[x][len(test_list[x]) - 1]
            # training one by one
            test(test_list[x], y)

step 2 Choose the number of k and a distance metric

    '''
    Test() is the entrance of knn
    @input  test_X        list        [1,2,3,4]format
    @return predict     string      predict result
    '''
    def test(test_X = [], predict = ''):
        neighbors = get_neighbors(test_X)
        result = get_min_dest(neighbors)

        total_count = 0
        right_count = 0

        if str(result) == str(predict):
            right_count += 1
        total_count += 1

        # append test test_X to self.tets_list by default
        test_X.append(predict)
        test_list.append(test_X)

step 3 Caculate distances

    '''
    Count <X1, Y2> -> <X2, Y2> with euclidean distance algorithm
    '''
    def get_dest(source, destination):
        distance = 0
        for i in range(len(source) - 1):
            distance += pow((float(source[i]) - float(destination[i])), 2)
        return math.sqrt(distance)

   '''
   Get train list neighbors sorted by distance.
   '''
    def get_neighbors(train_list = [], test_X = [], k = 5):
        distances = []
        neighbors = []
        for x in range(len(train_list)):
            distances.append((train_list[x], get_dest(test_X, train_list[x])))
        distances.sort(key = operator.itemgetter(1))

        for i in range(k):
            neighbors.append(distances[i][0])
        return neighbors

step 4 Decided by weight vote

'''
    get_min_dest() return closest neighbor by vote
    @input  neighbors        list    [[], []]format
    @return neighbor        []        closest neighbor
    '''
    def get_min_dest(neighbors = []):
        class_votes = {}
        for x in range(len(neighbors)):
            response = neighbors[x][-1]
            if response in class_votes:
                class_votes[response] += 1
            else:
                class_votes[response] = 1
        sorted_vote = sorted(class_votes.iteritems(), key = operator.itemgetter(1), reverse=True)
        return sorted_vote[0][0]

Example Output

===============test instance()=================
test()-> knn predicted = 'Iris-versicolor', your given actual = 'Iris-versicolor'
test()-> knn predicted = 'Iris-virginica', your given actual = 'Iris-versicolor'

===============accuracy()=================
training size = 150, testing size = 2
accuracy()-> total instance = 2, right instance = 1, error instance = 1, right rate = 50.0%

===============predict()=================
predict()-> instance = [7.1, 3.0, 5.9, 2.1], knn predict = Iris-virginica

===============mult_predict()=================
predict()-> instance = [4.6, 3.4, 1.4, 1.3], knn predict = Iris-setosa
predict()-> instance = [7.1, 3.0, 5.9, 1.1], knn predict = Iris-virginica
predict()-> instance = [2.3, 3.3, 4.5, 6.7], knn predict = Iris-versicolor



Reference:



本文出自 夏日小草,转载请注明出处:http://homeway.me/2017/04/21/machine-learning-knn/

-by grasses

2017-04-21 23:52:34

Fork me on GitHub