### Effects of Outliers Towards Data Science

As we begin working with data, we observe that there are few errors in the data, like missing values, outliers, no proper formatting, etc. In nutshell, we call them inconsistency. This consistency, more or less, skews the data and hamper the Machine learning algorithms to predict correctly. In this article, we will try to see how outliers affect the accuracy of machine learning algorithm and how scaling would have helped or affected our learning. We have used 2 non parametric algorithms, k NN and Decision Trees for the simplicity of the objective.

See full list on towardsdatascience

We will be using Hepatitis C Virus for Egyptian patients Data Set obtained from UCI Machine Learning Repository. Which can be obtained from : This data consists of Egyptian patients who underwent treatment dosages for HCV about 18 months. There are total of 1385 patients with 29 attributes. These attributes ranges from their age, counts of WBC, RBC, plat etc.

See full list on towardsdatascience

First and foremost thing is to load the data and required libraries in python. Once we are through the data set it almost advised to check if there are any inconsistencies which we mentioned earlier. To do so we will use pythons info function. Here we observe that we do not have any missing values, and since the data is a numeric data, we can be certain that all the attribute values are numeric i.e, either int64 or float64 type. Additionally, there are no null values, thus we can use our data to model. We also want to see if there are any outliers, one quick check in pandas library is using describe function. It provides us with the desired statistics like minimum maximum values, quantiles, Standard deviation, etc. Here, we observe that the RNA EOT has minimum value of 5 and which is very far from our mean. Now we are certain that there is atleast one outlier. So, we will see how treating outlier effect affect our models.

See full list on towardsdatascience

Since our data is on different scales, we have performed scaling to bring every attribute on a common scale. which will further reduce the effects of outliers. we will utilize the scikit learn libraries to split and model our data.

See full list on towardsdatascience

Once we have split our data into train and test data set, we can run k NN algorithm with k = 1 or and check our accuracy of the data. The accuracy of the model is 24% which is bad altogether but it is not our objective here. we can run elbow method to select the optimal k value. where we see that the minimum error rate is at k = 35 and accuracy as 31 %.

See full list on towardsdatascience

We will remove outliers which lies beyond 2 % and 98 % percentile of the data. we observed the k NNs accuracy at k =1 is 28% which is 16% increase. Similarly, at k = 21 gives 33% which is 6% increase however, for decision trees we got almost 4% decrease in accuracy.

See full list on towardsdatascience

We can conclude that removing outliers increases the accuracy of the model. Even though it increased the accuracy with a significant amount in k NN, it decreased in decision trees. This leads us to our next step in our analysis is parameter tuning. We will dig further in parameter tuning to target better accuracy.

See full list on towardsdatascience

https://aasldpubs.onlinelibrary.wiley /doi/pdf/10.1002/hep.1840360720See full list on towardsdatascience