In the figure, the red, blue and green dots are sample data, which belong to three categories:,,. For the point to be classified, calculate the five points closest to it (that is, k is 5), and the categories to which these five points belong at most are (4 points belong and 1 point belongs), so the categories are classified as.
the algorithm flow of KNN is also very simple, please see the flow chart below.
KNN algorithm is a very simple and practical classification algorithm, which can be used in various classification scenarios, such as news classification, commodity classification, and even simple text recognition. For news classification, some news can be manually marked in advance, the news category can be marked well, and the feature vector can be calculated well. For an unclassified news, after calculating its feature vector, the distance between it and all the marked news is calculated, and then KNN algorithm is further used for automatic classification.
after reading this, you will definitely ask, how to calculate the distance of data? How to get the feature vector of news?
the key of KNN algorithm is to compare the distance between the data to be classified and the sample data, which is usually done in machine learning by extracting the eigenvalues of the data, forming an N-dimensional real number vector space (this space is also called feature space) according to the eigenvalues, and then calculating the spatial distance between the vectors. There are many methods to calculate the distance between spaces, such as Euclidean distance and cosine distance.
for data sum, if its characteristic space is n-dimensional real number vector space, that is, its Euclidean distance calculation formula is
In fact, we learned this Euclidean distance formula in junior high school, and the distance between two points in plane geometry and solid geometry is also calculated by this formula, but n=2 in plane geometry (two-dimensional geometry) and n=3 in solid geometry (three-dimensional geometry). But no matter what the eigenvalue n is, the formula for calculating the spatial distance between two data is still this Euclidean formula. Most machine learning algorithms need to calculate the distance between data, so mastering the distance calculation formula of data is the basis of mastering machine learning algorithms.
Euclidean distance is the most commonly used data calculation formula, but in the machine learning of text data and user evaluation data, the more commonly used distance calculation method is cosine similarity.
The closer the cosine similarity is to 1, the more similar it is, and the closer it is to , the greater the difference. Using cosine similarity can eliminate some redundant information of data, and in some cases it is closer to the essence of data. Let me give you a simple example. For example, the eigenvalues of two articles are "big data", "machine learning" and "geek time", and the eigenvector of article A is (3, 3, 3), that is, the occurrence times of these three words are all 3; The feature vector of article B is (6, 6, 6), that is, the occurrence times of these three words are all 6. If we just look at the feature vectors, these two vectors are very different, and if we calculate them by Euclidean distance, they are really very similar, but their cosine similarity is 1, which means they are very similar.
Cosine similarity is actually calculating the included angle of vectors, while Euclidean distance formula is calculating the spatial distance. Cosine similarity pays more attention to the similarity of data. For example, if two users rate two products as (3, 3) and (4, 4) respectively, then the preferences of two users are similar. In this case, cosine similarity is more reasonable than Euclidean distance.
We know that the algorithm of machine learning needs to calculate the distance, and the calculation of the distance also needs to know the feature vector of the data, so extracting the feature vector of the data is an important work for machine learning engineers, sometimes even the most important work. Different data and different application scenarios need to extract different feature values. Let's take common text data as an example to see how to extract text feature vectors.
the characteristic value of text data is to extract text keywords, and TF-IDF algorithm is a common and intuitive text keyword extraction algorithm. This algorithm is composed of TF and IDF.
TF is $ Term Frequency, which indicates the frequency of a word appearing in a document. The more frequently a word appears in a document, the higher the TF value.
word frequency:
IDF is the Inverse Document Frequency, which indicates the scarcity of this word in all documents. The less the word appears in documents, the higher the IDF value.
reverse document frequency:
the product of TF and IDF is TF-IDF.
So if a word appears frequently in one document, but rarely in all documents, then this word is probably the key word of this document. For example, in a technical article about atomic energy, words such as "nuclear fission", "radioactivity" and "half-life" will appear frequently in this document, that is, TF is very high; However, the frequency of occurrence in all documents is relatively low, that is, IDF is relatively high. Therefore, the TF-IDF value of these words will be very high, which may be the key words of this document. If this is an article about China's atomic energy, perhaps the word "China" will appear frequently, that is, the TF is also high, but "China" also appears in many documents, so the IDF will be low. Finally, the TF-IDF of the word "China" will be very low and will not become the key word of this document.
after the keywords are extracted, the feature vectors can be constructed by using the word frequencies of the keywords. For example, in the article on atomic energy in the above example, the three words "nuclear fission", "radioactivity" and "half-life" are the feature values, and the frequency of occurrence is 12, 9 and 4 respectively. Then the feature vector of this article is (12, 9, 4), and then the distance from other documents can be calculated by using the spatial distance calculation formula mentioned above, and the automatic classification of documents can be realized by combining KNN algorithm.
Bayesian formula is a classification algorithm based on conditional probability. If we already know the probability of occurrence of A and B, and know the probability of occurrence of A when B occurs, we can use Bayesian formula to calculate the probability of occurrence of B when A occurs. In fact, we can judge the probability of B, that is, the possibility of B, according to the situation of A, that is, the input data, and then classify it.
For example, suppose that there are 6% boys and 4% girls in a school. Boys always wear trousers, while girls wear half trousers and half skirts. Suppose you are walking on campus and a student in trousers walks in front of you. Can you infer the probability that the student in trousers is a boy?
The answer is 75%, and the specific algorithm is:
This algorithm uses Bayesian formula, which is written as follows:
It means that the probability of occurrence of B under the condition of occurrence of A is equal to the probability of occurrence of A under the condition of occurrence of B, multiplied by the probability of occurrence of B and divided by the probability of occurrence of A.. Again, in the above example, if I ask you what the probability is that the student wearing a skirt is a girl. By introducing Bayesian formula, we can calculate that the probability of being a girl is 1%. In fact, this result can also be inferred from common sense, but many times, common sense is disturbed by various factors and will be biased. For example, when someone saw a news that a doctoral student was working for a boss with junior high school education, he lamented that reading was useless. In fact, it's just rare and strange, and the sample size is too small. The statistical law of a large number of data can accurately reflect the classification probability of things.
A typical application of Bayesian classification is spam classification. Through the statistics of sample emails, we know the probability of each word appearing in the email, we also know the probability of normal email and spam, and we can also count the probability of each word appearing in the spam. Now that a new email arrives, we can calculate it according to the words appearing in the email, that is, get the probability that the email is spam when these words appear, and then judge whether the email is spam or not.
In reality, the probability on the right side of Bayesian formula equals to the right side can be obtained through statistics of big data. When new data comes, we can bring it into the Bayesian formula above to calculate its probability. And if we assume that the probability exceeds a certain value, then we classify and predict this data, and the specific process is shown in the following figure.
The training sample is our original data. Sometimes the original data does not contain the dimension data we want to calculate. For example, if we want to automatically classify spam with Bayesian formula, we should first mark the original mail, and we need to mark which emails are normal and which ones are spam. This kind of machine learning training that needs to label data is also called supervised machine learning.