Table of contents
No headings in the article.
In this blog, I will write about a very famous supervised learning algorithm, k-nearest neighbors or in short KNN.
This algorithm is interesting because it differs from the other classification algorithms. This algorithm is also called a lazy learner algorithm. KNN comes under instance-based learning. Models based on instance-based learning are characterized by memorizing the training dataset, and lazy learning is a special case of instance-based learning that is associated with zero cost during the learning process.
The algorithm can be summarized in three steps :
Calculate the number of k clusters and compute the distance between the test object and every object in the dataset
Find the k-nearest neighbors of the test object we want to classify
Assign it to the maximum frequency class.
Exception: If the neighbors have similar distances, the algorithm will choose the class label that comes first in the training dataset.
Let’s implement the KNN classifier in the breast cancer dataset
#Import the package
import pandas as pd
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings('ignore')
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split,GridSearchCV
from sklearn import metrics
#Load the dataset
data = pd.read_csv('data.csv')
data.head()
data.drop(['id'],axis=1,inplace=True)
data.columns
data.info()
#Dimension
print("Dimension : " , data.shape)
data['diagnosis'].value_counts()
Out of 569, 212 are categorized as malignant and others are categorized as benign
X = data.drop(['diagnosis','Unnamed: 32'],axis=1)
y = data['diagnosis']
x_train,x_test,y_train,y_test = train_test_split(X,y,test_size=0.33,random_state=42)
#Hyperparameter Tuning - GridSearchCV
parameters = {'metric' : ('manhattan','euclidean','minkowski'),
'n_neighbors':range(1,21)}
model = KNeighborsClassifier()
tuning = GridSearchCV(model,param_grid=parameters,scoring='accuracy')
tuning.fit(x_train,y_train)
print("Best parameter - " , tuning.best_params_)
print("Best Accuracy Score - ", tuning.best_score_)
The best accuracy score happens with the k = 5
knn = KNeighborsClassifier(n_neighbors=5,metric='manhattan',p=2)
knn.fit(x_train,y_train)
y_pred = knn.predict(x_test)
print("The accuracy score of training dataset : ", metrics.accuracy_score(y_train,knn.predict(x_train)))
print("The accuracy score of testing dataset : " , metrics.accuracy_score(y_test,y_pred))