Word Segmentation and Relation Diagram in Text Analysis

In text analysis, we need to divide the text into words and make a statistical analysis of these words. Jieba based on python is a very popular thesaurus, and then Python Matplotlib may also draw a relational network diagram based on networkx, but here we will make it with the help of Gephi. This software is very easy to use, here we will explain some methods.

Jieba Library is an important third-party Chinese word segmentation function library in Python, which can split a Chinese text into Chinese word sequences.

The principle of word segmentation in Jieba Library is to compare the content of word segmentation with the Chinese word library of word segmentation, and find the phrase with the highest probability through the method of graph structure and dynamic programming.

Support four word segmentation modes:

Examples of four separation modes:

Results:

As can be seen from the above, we should separate the ecological environment, sewage treatment and limited company. The exact model and the paddle model are inseparable. Although the whole model and the search engine model are separated, they also contain undivided phrases. So here we can use a custom dictionary and use load_userdict (). However, we should pay attention to the word frequency of custom words, otherwise the custom dictionary will not work, because when the word frequency of custom words is lower than that of the default dictionary, it still uses the default word segmentation, so when the word frequency is higher than the default word frequency, we will use the word segmentation of custom dictionary.

There is no specific formula for how to set the word frequency of a custom dictionary. The greater the probability, the greater the probability, as long as it exceeds the default dictionary, but not too big. Default dictionary

Custom dictionary

Where user_dict is defined as follows:

The introduction and use of jieba are all here, and the deeper theory and use can be understood at this address: jieba-github reference.

In graph theory, clustering coefficient (also called clustering coefficient, clustering coefficient) is a coefficient used to describe the clustering degree between vertices in a graph. Specifically, it is the degree of interconnection between adjacent points of a point. For example, in the social network of life, the degree of mutual understanding between your friends is the research progress of metabolic network structure based on complex network theory. There is evidence that in all kinds of network structures reflecting the real world, especially in social network structures, nodes tend to form relatively high-density network groups, such as transitivity in small group structure model and collective dynamics of' small world' networks. In other words, the aggregation coefficient of the real-world network is higher than that of the network obtained by randomly connecting two nodes.

Assuming that some points in the graph are connected in pairs, many "triangles" can be found, and their corresponding three points are connected in pairs, which is called closed three-point group. There is also a three-point group, that is, there are two sides between three points (a triangle with one side missing).

There are two definitions of clustering coefficient; Global and local.

Global algorithm:

Local algorithm:

Average coefficient:

The following is the analysis of its coefficient solution:

Next, we use an example to analyze the application of clustering coefficient. The tool we use here is Gephi, and the data is also built-in data.

In the above analysis, we mentioned that the node size represents its own weight, but sometimes some nodes that need to be identified are difficult to analyze because of our node range. At this time, we can consider starting from the color, that is, judging the weight by the change of color from small to large, and of course, we can also judge the same color by gradient. Here I use three discoloration ranges to analyze. Select and display as follows:

From the above picture, we chose the sequence change of red, yellow and blue. On the right, it is more convenient for us to judge the weight of a node from its size and color, that is, the more times we appear, the closer the color is to blue, and vice versa.

It can be seen from the changes of the last two pictures that their layout and distribution are the same, so what is the reason?

As shown in the figure, the structure can be analyzed to form aggregated clusters, which are strongly attracted to each other by springs, that is to say, the relationship is relatively close.

In the data, our graph is composed of nodes and edges. The processing of nodes is simply analyzed above, so how to analyze edges? In fact, the relationship between two words can be judged by the thickness of the lines in the edge graph, that is, the number of occurrences. As shown in the figure below:

Because the frequency range is too wide, we convert them into the range of 0- 1, and the highest weight is 1, and other data are converted on this basis.

That is, the converted proportion, each weight value and the maximum weight value.

Jieba-github reference

Clustering coefficient

ForceAtlas2, a continuous graphic layout algorithm convenient for network visualization.

English spelling in department stores

What about Dalian Love Cube Media Co.

I want to open a small restaurant, which is the most popular one. Do I need any conditions?

What are the key points to be concerned about the charcoal-based smokeless grill truck?

What does a hard-seat car mean? Pictures.

The night to eat noodles suitable for sending the circle of friends sentences

How about Wenhui Square in Suzhou Dushu Lake Higher Education Zone? There is a shop there, but I don't know what to open ~ Please give me a suggestion ~ ~

With a throbbing mood, the owner of Yue Bin test-drives the manual Yue Bin, feeling very energetic.

How about Guangzhou Chazhizhi Catering Management Co., Ltd.?