The traditional k-NN classification algorithm finds the k-nearest neighbour(s) and classifies numerical data records by calculating the distance between the test sample and all training samples using the Euclidian distance. One common classification technique based on the use of distance measures is k-nearest neighbours (k-NN). This can be done using a distance measures that can handle heterogeneous data. The second category extends distance-based algorithms to match the heterogeneous data. Moreover, the data conversion could also fundamentally alter values to make them more equidistant, meaning there are no guarantees that data will be interpreted correctly, which introduces the risk of losing or altering vital information in the process of decision the classification task is designed to support. However, this method is not effective as the similarity measure of the transformed data does not necessarily represent consistently the similarity of the original heterogeneous data, especially when the transformation is not fully reversible. binning data, interpolating or projecting data) and then, distance-based algorithms can be used with an appropriate measurement to classify the data. The first category converts values from one data type to another (e.g. In general, when classifying heterogeneous data using distance-based algorithms, there are two categories of methods. These algorithms were subsequently developed to enable handling of heterogeneous data as real-world data sets are often diverse in types, format, content and quality, particularly when they are gathered from different sources. Distance-based algorithms though were originally proposed to deal with one type of data using distance-based measurements to determine the similarity between data objects. ĭistance-based classification algorithms are techniques used for classifying data objects by computing the distance between the test sample and all training samples using a distance function. The main condition for applying a classification technique is that all data objects should be assigned to classes, and that each of the data objects should be assigned to only one class. Experimental results showed that the proposed measures performed better for heterogeneous data than Euclidean distance, and that the challenges raised by the nature of heterogeneous data need personalised similarity measures adapted to the data characteristics.Ĭlassification is a supervised machine learning process that maps input data into predefined groups or classes. The experiments used six heterogeneous datasets from different domains and two categories of measures. In this paper, several similarity measures have been defined based on a combination between well-known distances for both numerical and binary data, and to investigate k-NN performances for classifying such heterogeneous data sets. For the sake of simplicity, this work considers only one type of categorical data, which is binary data. The main objective of this paper is to investigate the performance of k-NN on heterogeneous datasets, where data can be described as a mixture of numerical and categorical features. The traditional k-NN classifier works naturally with numerical data. This classification is based on measuring the distances between the test sample and the training samples to determine the final classification output. The k-nearest neighbour classification (k-NN) is one of the most popular distance-based algorithms. Distance-based algorithms are widely used for data classification problems.
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |