Complex Analysis of Fluorescence Intensity Fluctuations of Molecular Compounds

A method is proposed for the complex analysis of fluctuations in the fluorescence intensity of molecular compounds, which allows determining the structural composition of protein oligomers. The idea of the method is to analyze the photon counting histograms of experimental measurements using principal component analysis to assess the presence of oligomeric compounds, and to perform hierarchical cluster analysis, to determine the data classes corresponding to various molecular compounds, followed by selecting cluster medoids to determine the oligomeric composition of protein complexes. The efficiency of the analysis algorithms developed within the framework of the proposed method was confirmed on simulated and experimental photon counting histograms of the measured fluorescence intensity fluctuations of monomeric and dimeric forms of green-fluorescent protein (GFP).


Methodology.
The developed method is based on the hypothesis of the separability of a set of multidimensional experimental data in a certain information space into several populations representing various molecular oligomeric compounds [10]. A small measurement volume is considered, in which molecular compounds of the same type prevail in a series of short time intervals. A normal distribution of the measured attributes is assumed for molecular compounds of the same type in the allocated space. For example, protein monomers can form a cloud or spherical Gaussian cluster of data in a multidimensional space based on measurable attributes. If, however, protein oligomers are added to the monomeric forms of the protein, then the cloud is extended or divided into two parts along a certain line connecting the centers of the two populations. In the extreme case, two clouds or clusters of these monomers and oligomers are expected. Thus, if groups of data are divided into clusters in a multidimensional space of attributes, this confi rms the presence of several forms of protein compounds. Tasks of this kind are solved using data mining algorithms such as data dimensionality reduction and cluster analysis [10,13,14]. Dimensionality reduction algorithms allow switching to a low-dimensional space without losing the essence of information [15,16]. Cluster analysis algorithms make it possible to determine clusters of data specifi ed in varying degrees of similarity, the number of which may be associated with aggregates of molecular compounds. Thus, applying principal components analysis (PCA) will make it possible to carry out such a rotation, as a result of which the axis of the fi rst principal component coincides with the diagonal of the data cloud in multidimensional space [17]. Therefore, the relative fraction of the scatter attributable to the fi rst principal component for two types of molecular compounds (an elongated ellipsoid or two spherical data clouds in a multidimensional space of attributes is expected) should differ signifi cantly from that for a monomer solution (one spherical cloud). It should be noted that the scatter diagram of the fi rst two principal components is informative in the sense of defi ning the data structure in two-dimensional space.
The idea behind the method of complex analysis is to calculate the PCH based on the recorded fl uorescence intensities (it is possible to use other attributes, for example, the autocorrelation function or factorial cumulants of the distribution of the number of photocounts [18]), the use of the PCA to assess the presence of oligomeric compounds and hierarchical cluster analysis to determine groups of data, corresponding to various molecular compounds, followed by the isolation of cluster medoids, PCHs having the smallest average distances to the remaining objects of the corresponding clusters, to assess the parameters of the oligomeric composition of protein complexes. Comprehensive analysis requires the availability of experimental data for the reference (monomers) and tested (oligomeric forms) samples. The block diagram of the developed method is shown in Fig. 1. Consider the main stages of the method.
Calculation of the PCH. We calculate N of the PCH based on the registered sets of fl uorescence intensities S i , i = 1, 2, ..., N, and form objects n 1 , n 2 , ..., n N , characterized by attributes X 1 , X 2 , ..., X K , -histogram channels representing the frequencies of occurrence f j of the number of photons l = ( j -1), j = 1, 2, ..., K, during a certain (short) time interval Δt. As a standard or reference sample, we use the experimental data of the monomer solution, and as a test sample -data for the oligomeric forms of the protein.
Data Dimensionality Reduction. The PCA method is applied to datasets of reference and test samples. In the PCA, such a linear transformation is defi ned, as a result of which the initial data X 1 , X 2 , ..., X K are expressed by a set of principal components Z 1 , Z 2 , ..., Z K , where the fi rst M principal components (M << K) provide the required fraction γ of the variance of groups of attributes. In expanded form, the principal component Z j is expressed through the attribute vectors X 1 , X 2 , ..., X K : where а ij are the loading parameters of the principal components. The relative proportion of the scatter (%) attributable to the principal component Z j is: where D(Z j ) is the variance of the component Z j . If the relative proportions of the scatter in the reference and the tested samples, which fall on the fi rst principal component Z 1 , are the same, then to assume that there are no oligomers means to stop the algorithm. Otherwise, permit the presence of oligomers and continue the algorithm.
Hierarchical Cluster Analysis of the Reference Sample (HCARS). A hierarchical cluster analysis of the histograms of the reference sample n 1 R , n 2 R , ..., n N R is performed in the space of initial attributes. In this case, it is necessary to specify a method for comparing objects to each other (or a measure of similarity, for example, Euclidean, Minkowski, correlation distance). In the developed method to eliminate inter-experimental inhomogeneities associated with separate measurements of the reference and test samples, we propose to use the standardized Euclidean distance (invariant to inhomogeneity in the data) [10]: where x il and x jl are coordinates of objects n i and n j ; 2 l σ is the variance of the attribute X l . We determine the maximum connection distance (or threshold) d 1 on the dendrogram, at which the data are combined into one cluster. The maximum connection distance d 1 is used as a threshold for fi nding the number of oligomer clusters on the dendrogram for the test data.

Hierarchical Cluster Analysis of the Test Sample (HCATS).
A hierarchical cluster analysis of the histograms of the tested sample T 1 , n T 2 , n …, T N n is performed in the space of initial attributes. Using the threshold d 1 found in the previous step of the algorithm, we select data clusters on the dendrogram. Assume that one cluster belongs to monomers, and the other(s) -to oligomeric forms.
Determination of Cluster Medoids. Clusters of monomers and oligomers are displayed on the scatter diagram of the fi rst two principal components. Datasets are formed by calculating medoids in each cluster to accurately determine the parameters of molecular compounds using PCH and FIDA methods.
Materials and Methods. Consider simulated and experimental data. The simulated data make it possible to qualitatively assess the performance of the method and explore the limits of application. The experimental data are used to confi rm the fundamental possibility of applying the developed approach to solving real problems of experimental research. A simulation model of the photocount fl ow with a given distribution of the number of photocounts is presented in [19]. The number of photons emitted by the molecule during the observation time T is approximated by the Poisson distribution with the intensity where q 〈 〉 is the brightness, or the average number of photons emitted by one molecule per unit of time; B(r) is the exposure profi le function; r(x, y, z) is the radius vector of the molecule. A three-dimensional Gaussian distribution is used as a function of the exposure profi le B(r). The number of molecules in solution in a certain volume obeys the Poisson distribution with the parameter where m N 〈 〉 is the average number of molecules of the test sample per unit volume; V 0 is the exposure volume. For each molecule, the coordinates of the location in the volume V 0 (according to the uniform distribution law) and the number of emitted photons (according to the Poisson distribution with the intensity λ f ) are generated. If a mixture of molecules of different types is simulated, then it is necessary to perform photon generation cycles for each type of molecule. The generation cycle is repeated iteratively until the accumulation of the number of photons, at which a PCH with a given signal-to-noise ratio is formed. To take into account the effect of scattering of data or "blurring" of PCH clusters caused by the infl uence of various distortions, such as the presence of unremovable impurities that quench or stimulate fl uorescence of molecules, high background noise, fl are and degradation of dyes, we use modeling of model parameters that have a normal distribution with a given mathematical expectation and standard deviation σ. Variation of σ makes it possible to control the scatter of data or the blur of clusters of PCH curves in a multidimensional space of time samples.
The simulated data is an example of The simulated data make it possible to investigate the applicability of the developed method in the case of different separability of data clusters (varied by the parameter σ) corresponding to protein compounds. The data representing the GFP protein in the buffer solution and the cell lysate are experimentally confi rmed and make it possible to check the effi ciency of the method using examples of real model data. A mixture of monomeric and dimeric forms of the GFP protein is an example of a dataset specifi cally containing various forms of protein aggregation. Assuming that molecules of the same type were predominantly found in the observation volume, the PCHs of the experimental samples were constructed over a time interval of 5⋅10 -2 s or less in one measurement of fl uorescence intensity fl uctuations with a duration of 120 s.
The algorithms were implemented in the Matlab mathematical programming environment using the pdist, linkage, cluster, and eig functions, which integrate algorithms for hierarchical cluster analysis and PCA [21]. The hierarchical method of cluster analysis was used, and the most common method for calculating the distance (standardized Euclidean) and the measure of cluster similarity (Ward) were investigated [13]. The data centering procedure is applied in the PCA. To assess the error ε of restoring the PCHs of various types of molecules, the ratio of incorrectly determined PCHs to the total number of PCHs (in %) was considered.
Results and Discussion. The results of the analysis of the simulated datasets using the algorithms of the integrated approach are shown in Fig. 2 and in Table 1. The analysis of the simulated data was carried out separately for monomers and dimers ( Fig. 2a and 2b). The relative proportion of the scatter α 1 for the fi rst principal component is 54.6 and 58.8% for monomers and dimers, and the data clouds in the space of the principal components have a spherical Gaussian shape. The threshold value of the similarity measure, at which molecules form a single cluster d 1 = 15, is a criterion for determining clusters of different molecular shapes. The connection distance of the resulting clusters into one is <2, which indicates a signifi cant similarity of the combined clusters. The application of the algorithms of the developed method to the analysis of the combined set of simulated data makes it possible to accurately determine the samples of monomeric and dimeric forms of proteins (error ε = 0), which is confi rmed by the high relative fraction of the scatter falling on the fi rst principal component, α 1 >98% (for monomers 54.6%), clear separability of data into two clusters in the space of the principal components Z 1 and Z 2 (Fig. 2c), long connection distances of the resulting clusters into one (>50), which confi rms the importance of the difference between clusters. It should be noted that the method successfully works under the conditions of the considered example of blurring and partial overlapping of data clusters (σ = 0.2, ε = 1.5%; Fig. 2d), which is typical for molecular systems such as a mixture of GFP monomers and dimers in a cell lysate. Samples of monomeric and dimeric forms of proteins were determined: the relative proportion of scatter α 1 = 99%, the data form two clusters in the space of the principal components Z 1 and Z 2 (Fig. 2d), the line length of the unifi cation of the resulting clusters into one is >30.
In the course of the study, together with the standardized Euclidean distance, three additional measures for calculating the similarity between objects, invariant to data heterogeneity, such as Mahalanobis, correlation and Spearman were considered [9,13,14]. The best results were obtained for the distances of the standardized Euclidean distance and Mahalanobis. However, the Mahalanobis measure requires the computation of the covariance matrix of the input data, which can be costly in the case of analyzing large datasets (N → ∞, K → ∞).
The results of the analysis of experimental datasets using the algorithms of the integrated approach are shown in Fig. 3 and in Table 1. Study of the data for the GFP protein in a buffer solution allows one to determine the threshold value of the similarity measure (d 1 = 23), at which the monomers form a single cluster, for use in the subsequent analysis of protein compounds (Fig. 3a). The connection distance of the resulting clusters into one (<5), the spherical shape of the data cloud in the space of the fi rst two principal components (Fig. 3a) and a low relative proportion of the scatter α 1 = 50.5% (Table 1), which falls on the fi rst principal component, qualitatively confi rm the fundamental principle of the working hypothesis proposed in the implemented method. As a result of the analysis of the combined experimental data of mGFP and diGFP proteins in cell lysates, the presence of two forms of proteins corresponding to monomeric and dim eric forms (Fig. 3b) was confi rmed: α 1 = 99.9%, the data form two clusters in the space of the principal components, the connection distance of the resulting clusters into one >40. Analysis of the experimental data of a mixture of mGFP and diGFP proteins in the cell lysate revealed the presence of two forms of protein oligomers. The relative proportion of the scatter α 1 , which falls on the fi rst principal component of the tested data, at 93.6% signifi cantly exceeds the value of 50.5% obtained for monomeric forms of the GFP protein in a buffer solution. The connection distance at which the fi nal cluster is formed is 40  ( Fig. 3c), the data form two clusters in the space of the principal components, at 18 the connection distance of the resulting clusters into one signifi cantly exceeds the value of 5 for GFP monomers. The value ≥23 should be taken as the threshold value for determining the number of nonmonomeric form clusters. At a connection distance of 23, two clusters formed by the majority of mGFP or diGFP molecules can be distinguished on the dendrogram of the tested data (Fig. 3c). Further evaluation of the parameters of protein complexes can be carried out in the course of analysis of medoids of the obtained PCH clusters using classical algorithms for analyzing fl uorescence spectroscopy data [5,6]. Note that the monomers of the GFP protein form a spherical cluster of data in the space of the fi rst two principal components (Fig. 3a), while an elongated ellipsoidal cloud is observed for a mixture of mGFP or diGFP, formed by clusters of monomers and dimers of compounds (Fig. 3c).

Conclusions.
A method for the complex analysis of fl uctuations of the fl uorescence intensity of molecular compounds is proposed, which makes it possible to determine the structural composition of protein oligomers and complements the classical methods of PCH and FIDA analysis. The effi ciency of the algorithms developed within the framework of the proposed method was confi rmed during the analysis of simulated and experimental data representing the fl uorescence of monomeric and dimeric forms of the GFP protein. The developed method has the following advantages over the classical method for analyzing data from fl uorescence fl uctuation spectroscopy: it improves the accuracy of data analysis, since it uses the entire data set, rather than individual histograms; provides computational performance due to the high speed of execution of procedures of the method of principal components and cluster analysis in comparison with a separate analysis of the full set of histograms; provides the ability to visualize data in the space of the fi rst two principal components, which is much more informative than a diagram of a complete set of initial histograms.