Current location - Recipe Complete Network - Complete cookbook - Microbial environmental factors
Microbial environmental factors
There are many environmental/clinical factors that affect the composition of the sample flora, but many of them have strong multicollinearity (correlation) relations, which will affect the subsequent correlation analysis. Therefore, before the correlation analysis of environmental/clinical factors, we can screen the environmental/clinical factors and keep the environmental/clinical factors with less multicollinearity for subsequent research. VIF (variance expansion factor) is a commonly used screening method for analyzing environmental/clinical factors. The expression of VIF is: VIFi= 1/( 1-Ri2). Ri2 represents the variance ratio of the ith independent variable relative to other independent variables in the model, which is used to measure the collinearity relationship between the ith independent variable and other independent variables. The greater the VIF value, the more serious the multicollinearity relationship between independent variables. Generally speaking, environmental factors with VIF value greater than 10 are considered as useless environmental factors. Screen out environmental factors with VIF greater than 10, and screen for many times until the VIF values corresponding to the selected environmental factors are all less than 10.

In the process of VIF analysis, it is necessary to carry out correlation analysis based on RDA/CCA, and the selection principle of RDA/CCA model is the same as that of RDA/CCA analysis.

RDA analysis, namely redundancy analysis, is a PCA analysis constrained by environmental factors, which can reflect samples and environmental factors on the same two-dimensional ranking diagram, from which the relationship between sample distribution and environmental factors can be intuitively seen. CCA analysis is a ranking method based on correspondence analysis, which combines correspondence analysis with multiple regression analysis, and each step is regressed by environmental factors, also known as multiple direct gradient analysis. This analysis is mainly used to reflect the relationship between flora and environmental factors. RDA is based on linear model and CCA is based on unimodal model. Analysis can detect the relationship between environmental factors, samples and flora or between them.

RDA analysis is a binding correspondence analysis method, which is often analyzed by Euclidean distance. However, Euclidean distance does not apply to some data types. Db-RDA analysis can solve the limitation of data types and be used to analyze the relationship between species and environmental factors.

Db-RDA (Distance-Based Redundancy Analysis) is a five-step analysis process:

Mantel test is a nonparametric statistical method to test the correlation between two matrices. Mantel test is mainly used to check the correlation (Spearman rank correlation coefficient, etc. ) between the community distance matrix (such as UniFrac distance matrix) and the environmental variable distance matrix (such as pH, temperature or geographical location difference matrix) in ecology. Under the influence of control matrix C, partial Mantel test is used to test whether the residual variation of matrix A is related to matrix B. Two numerical matrices are analyzed and input, and the third control matrix can be determined by selecting factors.

Software: Qiime

Correlation heat map analysis by calculating correlation coefficient (Spearman rank correlation coefficient, Pearson correlation coefficient, etc. ) Between the environmental factors and the selected species, the numerical matrix obtained is directly displayed by thermal map. The change of color reflects the data information in a two-dimensional matrix or table, and the color depth indicates the size of the data value, which can be intuitively expressed by the defined color depth.

Software: R(pheatmap package).

Linear regression is a statistical analysis method that uses regression analysis in mathematical statistics to determine the relationship between one or more independent variables and dependent variables. Ordered regression analysis of environmental factors is usually based on the results of α diversity or β diversity analysis, with the score of α diversity index or β diversity analysis results corresponding to each sample on PC 1 axis as the y axis, and environmental factors (such as pH, temperature, etc.) as the basis. ) Make a scatter plot for the corresponding sample on the X axis, and make a linear regression and mark R2, which can be used to evaluate the relationship between them. Where R2 is the determining coefficient, representing the variation ratio explained by the tropic of cancer.

Variance Partitioning Analysis of VPA can be used to quantitatively evaluate the degree of individual interpretation and common interpretation of two or more groups (2~4 groups) of environmental factors variables to response variables (such as microbial community differences), and it is often used in combination with RDA/CCA.

Analysis software: vpa analysis in vegan package of R language.

Maaslin (multiple linear model correlation) analysis is an analytical method to explore the correlation between environmental factors (such as clinical data markers) and relative abundance (data) of microbial communities through linear models. The results show that an environmental factor corresponds to the relative abundance of a species or function, and has nothing to do with other environmental factors. Environmental factors can be continuous data (such as age and weight), Boolean data (gender), or discrete/factor data (cohort grouping and phenotype). The percentage of species relative abundance or functional relative abundance represented by the data generally does not conform to the normal distribution, so in the analysis process, the data should be standardized by the square root of the arcsine, and the potential environmental factors related to the data can be obtained by boosting algorithm. Before constructing multivariate linear model, it is necessary to check the quality of environmental factors and data, and eliminate some abnormal values and some values with low abundance or no difference. Finally, taking environmental factors as the predicted value and data as the response quantity, a multivariate linear model is constructed, and the corresponding correlation coefficient is calculated, and the significance of correlation is tested. When the correlation coefficient is greater than 0, it means positive correlation; If it is less than 0, it means negative correlation. When the corresponding significance test values p and q reach the threshold, discontinuous data will draw a box chart, and continuous data will draw a scatter chart with the highest linear fitting.

Procrustes analysis is a method for analyzing shape distribution. Mathematically speaking, it is to find the regular shape by iteration, and to find the affine change pattern of each sample shape to this regular shape by least square method. Platts analysis can be based on the sorting configuration of different multivariate data sets (≥2 groups), and the maximum superposition can be achieved by translation, rotation, scaling and other conversion methods, which can be used for comparative analysis of different data sets. The sorting method can be PCA, PCoA, etc.