Skip to content
KNN Classification using Scikit-learn
  • AI Chat
  • Code
  • Report
  • KNN Classification using Scikit-learn

    Learn K-Nearest Neighbor(KNN) Classification and build KNN classifier using Python Scikit-learn package.

    K Nearest Neighbor(KNN) is a very simple, easy to understand, versatile and one of the topmost machine learning algorithms. KNN is used in a variety of applications such as finance, healthcare, political science, handwriting detection, image recognition and video recognition. In Credit ratings, financial institutes will predict the credit rating of customers. In loan disbursement, banking institutes will predict whether the loan is safe or risky. In political science, classifying potential voters in two classes will vote or won’t vote. The KNN algorithm is used for both classification and regression problems. The algorithm based on a feature similarity approach.

    In this tutorial, you are going to cover the following topics:

    • K-Nearest Neighbor Algorithm
    • How does the KNN algorithm work?
    • Eager Vs Lazy learners
    • How do you decide the number of neighbors in KNN?
    • Curse of Dimensionality
    • Classifier Building in Scikit-learn
    • Pros and Cons
    • How to improve KNN performance?
    • Conclusion

    K-Nearest Neighbors

    KNN is a non-parametric and lazy learning algorithm. Non-parametric means there is no assumption for underlying data distribution. In other words, the model structure determined from the dataset. This will be very helpful in practice where most of the real world datasets do not follow mathematical theoretical assumptions. Lazy algorithm means it does not need any training data points for model generation. All training data is used in the testing phase. This makes training faster and the testing phase slower and costlier. Costly testing phase means time and memory. In the worst case, KNN needs more time to scan all data points and scanning all data points will require more memory for storing training data.

    How does the KNN algorithm work?

    In KNN, K is the number of nearest neighbors. The number of neighbors is the core deciding factor. K is generally an odd number if the number of classes is 2. When K=1, then the algorithm is known as the nearest neighbor algorithm. This is the simplest case. Suppose P1 is the point, for which label needs to predict. First, you find the one closest point to P1 and then the label of the nearest point assigned to P1.

    Suppose P1 is the point, for which a label needs to be predicted. First, you find the k closest points to P1 and then classify points by majority vote of its k neighbors. Each object votes for their class and the class with the most votes is taken as the prediction. For finding closest similar points, you find the distance between points using distance measures such as Euclidean distance, Hamming distance, Manhattan distance and Minkowski distance. KNN has the following basic steps:

    1. Calculate distance
    2. Find closest neighbors
    3. Vote for labels

    Eager Vs. Lazy Learners

    Eager learners mean when given training points will construct a generalized model before performing prediction on given new points to classify. You can think of such learners as being ready, active and eager to classify unobserved data points.

    Lazy Learning means there is no need for learning or training of the model and all of the data points used at the time of prediction. Lazy learners wait until the last minute before classifying any data point. A lazy learner merely stores the training dataset and waits until classification needs to perform. Only when it sees the test tuple does it perform generalization to classify the tuple based on its similarity to the stored training tuples. Unlike eager learning methods, lazy learners do less work in the training phase and more work in the testing phase to make a classification. Lazy learners are also known as instance-based learners because lazy learners store the training points or instances, and all learning is based on instances.

    Curse of Dimensionality

    KNN performs better with a lower number of features than a large number of features. You can say that when the number of features increases than it requires more data. Increase in dimension also leads to the problem of overfitting. To avoid overfitting, the needed data will need to grow exponentially as you increase the number of dimensions. This problem of higher dimension is known as the Curse of Dimensionality.

    To deal with the problem of the curse of dimensionality, you need to perform principal component analysis before applying any machine learning algorithm, or you can also use feature selection approach. Research has shown that in large dimension Euclidean distance is not useful anymore. Therefore, you can prefer other measures such as cosine similarity, which get decidedly less affected by high dimension.

    How do you decide the number of neighbors in KNN?

    Now, you understand the KNN algorithm working mechanism. At this point, the question arises that How to choose the optimal number of neighbors? And what are its effects on the classifier? The number of neighbors(K) in KNN is a hyperparameter that you need choose at the time of model building. You can think of K as a controlling variable for the prediction model.

    Research has shown that no optimal number of neighbors suits all kind of data sets. Each dataset has it's own requirements. In the case of a small number of neighbors, the noise will have a higher influence on the result, and a large number of neighbors make it computationally expensive. Research has also shown that a small amount of neighbors are most flexible fit which will have low bias but high variance and a large number of neighbors will have a smoother decision boundary which means lower variance but higher bias.

    Generally, Data scientists choose as an odd number if the number of classes is even. You can also check by generating the model on different values of k and check their performance. You can also try Elbow method here.

    Classifier Building in Scikit-learn

    KNN Classifier

    Defining dataset

    Let's first create your own dataset. Here you need two kinds of attributes or columns in your data: Feature and label. The reason for two type of column is "supervised nature of KNN algorithm".

    # Assigning features and label variables
    # First Feature
    weather=['Sunny','Sunny','Overcast','Rainy','Rainy','Rainy','Overcast','Sunny','Sunny',
    'Rainy','Sunny','Overcast','Overcast','Rainy']
    # Second Feature
    temp=['Hot','Hot','Hot','Mild','Cool','Cool','Cool','Mild','Cool','Mild','Mild','Mild','Hot','Mild']
    
    # Label or target varible
    play=['No','No','Yes','Yes','Yes','No','Yes','No','Yes','Yes','Yes','Yes','Yes','No']

    In this dataset, you have two features (weather and temperature) and one label(play).

    Encoding data columns

    Various machine learning algorithms require numerical input data, so you need to represent categorical columns in a numerical column.

    In order to encode this data, you could map each value to a number. e.g. Overcast:0, Rainy:1, and Sunny:2.

    This process is known as label encoding, and sklearn conveniently will do this for you using Label Encoder.

    # Import LabelEncoder
    from sklearn import preprocessing
    
    #creating labelEncoder
    le = preprocessing.LabelEncoder()
    
    # Converting string labels into numbers.
    weather_encoded=le.fit_transform(weather)
    print(weather_encoded)

    Here, you imported preprocessing module and created Label Encoder object. Using this LabelEncoder object, you can fit and transform "weather" column into the numeric column.

    Similarly, you can encode temperature and label into numeric columns.

    # converting string labels into numbers
    temp_encoded=le.fit_transform(temp)
    label=le.fit_transform(play)
    Combining Features

    Here, you will combine multiple columns or features into a single set of data using "zip" function

    #combinig weather and temp into single listof tuples
    features=list(zip(weather_encoded,temp_encoded))
    Generating Model

    Let's build KNN classifier model.

    First, import the KNeighborsClassifier module and create KNN classifier object by passing argument number of neighbors in KNeighborsClassifier() function.

    Then, fit your model on the train set using fit() and perform prediction on the test set using predict().

    from sklearn.neighbors import KNeighborsClassifier
    
    model = KNeighborsClassifier(n_neighbors=3)
    
    # Train the model using the training sets
    model.fit(features,label)
    
    #Predict Output
    predicted= model.predict([[0,2]]) # 0:Overcast, 2:Mild
    print(predicted)

    In the above example, you have given input [0,2], where 0 means Overcast weather and 2 means Mild temperature. Model predicts [1], which means play.

    KNN with Multiple Labels

    Till now, you have learned How to create KNN classifier for two in python using scikit-learn. Now you will learn about KNN with multiple classes.

    In the model the building part, you can use the wine dataset, which is a very famous multi-class classification problem. This data is the result of a chemical analysis of wines grown in the same region in Italy using three different cultivars. The analysis determined the quantities of 13 constituents found in each of the three types of wines.

    The dataset comprises 13 features ('alcohol', 'malic_acid', 'ash', 'alcalinity_of_ash', 'magnesium', 'total_phenols', 'flavanoids', 'nonflavanoid_phenols', 'proanthocyanins', 'color_intensity', 'hue', 'od280/od315_of_diluted_wines', 'proline') and a target (type of cultivars).

    This data has three types of cultivar classes: 'class_0', 'class_1', and 'class_2'. Here, you can build a model to classify the type of cultivar. The dataset is available in the scikit-learn library, or you can also download it from the UCI Machine Learning Library.

    Loading Data

    Let's first load the required wine dataset from scikit-learn datasets.

    #Import scikit-learn dataset library
    from sklearn import datasets
    
    #Load dataset
    wine = datasets.load_wine()
    Exploring Data

    After you have loaded the dataset, you might want to know a little bit more about it. You can check feature and target names.

    # print the names of the features
    print(wine.feature_names)