ORIGINAL RESEARCH

Determining optimal ambient ionization mass spectrometry data pre-processing parameters in neurosurgery

About authors

1 Moscow Institute of Physics and Technology, Moscow, Russia

2 Semenov Federal Research Center for Chemical Physics of the Russian Academy of Sciences, Moscow, Russia

3 Skolkovo Institute of Science and Technology, Moscow, Russia

4 Siberian State Medical University, Tomsk, Russia

Correspondence should be addressed: Denis S. Zavorotnyuk
Institutskiy per., 9, str. 7, Dolgoprudny, Moscow Region, 141701; moc.liamg@kuyntorovaz.sined

About paper

Funding: the study was performed within the framework of the state assignment of the Ministry of Science and Higher Education of the Russian Federation (agreement № 075-03-2022-107, project № 0714-2020-0006). The study involved the use of equipment of the Semenov Federal Research Center for Chemical Physics RAS.

Author contribution: Zavorotnyuk DS — data acquisition and interpretation, software development, manuscript writing and editing; Sorokin AA — study planning, data analysis and interpretation, manuscript editing; Bormotov DS — data acquisition and interpretation, manuscript writing; Eliferov VA — financial support of the experiment; Bocharov KV — data acquisition; Pekov SI — study planning, data analysis and interpretation, manuscript draft writing and manuscript text finalization; Popov IA — project management, financial support.

Compliance with ethical standards: the study was approved by the Ethics Committee of the Burdenko Research Institute of Neurosurgery (protocols № 40 dated 12 April 2016 and № 131 dated 17 July 2018) and conducted in accordance with the principles of the Declaration of Helsinki (2000) and its subsequent revisions. All patients submitted the informed consent to study participation and the use of biomaterial for scientific purposes.

Received: 2024-12-19 Accepted: 2024-03-03 Published online: 2024-04-27
|

Ambient ionization mass spectrometry represents one of the promising methods to improve accuracy and completeness of the glial tumor resection, since radical tumor removal is currently the most effective treatment method for brain tumors [1]. However, there is a problem of identifying the tumor margins in order to ensure resection completeness for relapse prevention on the one hand and prevention of excessive resection and development of neuropathological sequelae on the other hand [2]. The main universal methods to ensure intraoperative control of the resected tumor margins still include positron emission tomography–computed tomography (PET-CT), magnetic resonance imaging (MRI), and histochemical analysis, since other methods, such as fluorescence staining, can turn out to be non-specific for certain diagnoses. However, these methods are time-consuming, and tomography is also expensive due to the need to equip the specialized surgical units [3].

Ambient ionization mass spectrometry (MS) makes it possible to quickly acquire the data on the molecular structure of the sample [46]. However, today, the vast majority of computational tools to deal with mass spectrometry data involve working with the spectra acquired by tandem MS coupled with gas/liquid chromatography. These data are distinguished by the fact that the number of peaks per scan of such a spectrum is much less than the number per scan obtained by ambient ionization MS [7, 8]. When using ambient ionization MS, the sample preparation simplicity and analysis speed make it possible to acquire far more complex mass spectra, i.e., large amounts of data within minutes. At the same time, the analysis of such data requires the use of automated processing methods and complex analysis algorithms [911], therefore, great attention should be paid to the data quality control and pre-processing [12].

Mass spectrometry data are the time-ordered sets of scans. Each scan represents the profile of the ion current intensities accumulated by the instrument over a certain time that is ordered on the mass-to-charge ratio (m/z) scale. In the preprocessing phase, it is necessary to transform this scan into the set consisting of intensities and m/z values of the detected peaks. Usually, this is achieved through implementation of such steps as normalization of intensity values, noise determination and elimination, peak position determination and alignment [1315]. The great diversity of approaches to MS data processing suggests that the above steps should be implemented with various parameters depending on the nature of samples used in the study, mass spectrometer construction, ion acquisition mode, and the type of further analysis.

The paper describes the method to determine the mass spectra pre-processing parameters in order to ensure unification of mass spectrometry data for further automated analysis on the example of the experimental data obtained by mass spectrometry without sample preparation when assessing human brain tumor tissue samples.

METHODS

The study involved mass spectrometry data acquired when processing brain tissue samples of the individual diagnosed with glioblastoma and grade IV astrocytoma (according to the 2021 WHO classification [16]) and non-neoplastic samples obtained during surgical treatment of drug-resistant epilepsy. A total of 307 tissue samples obtained from 74 patients were assessed. The data were acquired using the Thermo LTQ XL Orbitrap ETD mass spectrometer (Thermo Fisher Scientific; USA) with an inline cartridge extraction [3, 17]. Each sample was separated into two parts. The first part was sent for standard histochemical analysis to obtain a medical record on the sample, while the remaining part was used to extract three fragments, about 1 mm3 each, to be subjected to mass spectrometry analysis. The mass spectrometry protocol involved the analysis and detection of ions in eight different modes, each of which was characterized by the ions’ polarity, detector resolution and bandwidth of the registered ions’ m/z. Ion acquisition was performed twice in each mode.

The experimental data acquired were pre-processed using different values of the parameters described in the Results section. The pre-processing procedure involved peak intensity calibration, peak alignment relative to the scan showing maximum total ion current (TIC), reciprocal alignment of peaks among scans performed in the same mode of ion detection and filtration of rare and low-intensity peaks. Distinct scan sets were obtained for each ion detection mode. Each set of scans was transformed into the matrix of peak intensities used to train a classification model. When training the models, the matrix columns containing distributions of peak intensities across all scans of the appropriate mode were used as predictors, while the patients’ histological diagnoses were used as response. The mass spectrometry data acquired for brain tissue samples of 33 patients diagnosed with glioblastoma and seven patients diagnosed with non-neoplastic disorders were used to train and validate the models. The dataset available for each mode was divided into the training and validating groups in a ratio of 3 : 1, respectively; division was implemented in such a way that different scans of the same sample were present in both groups, to reduce model overfitting.

The data were analyzed using the computer running Ubuntu 16.04 with the installed R package v. 3.4.4 and R packages MALDIquant [18], caret [19], glmnet [20], ggplot2 [21]. For that the data received from the mass spectrometer were converted from the source Thermo Finnigan format to the open NetCDF format [22] using the in-lab developed software tool [23].

RESULTS

In 2012, it was shown that the differences between mass spectra of tumors and non-neoplastic brain tissues could be used for construction of the classifiers for automatic recognition of cancerous tissues in biopsy samples [24]. Fig. fig. 1 demonstrates peaks of two mass scans of the tissue samples obtained from the patients diagnosed with glioblastoma and non-neoplastic disorders.

The mass spectrometry data pre-processing procedure consists of several phases. In the first phase, noise is assessed and the signal-to-noise ratios are determined for all scans:

formula

where Is is signal intensity, In is noise intensity. There are several methods to determine the digital data noise intensity, for example, using mean absolute deviation (MAD) or regression with adaptive bandwidths (Super Smoother) [25]. In the subsequent phases, the low-intensity peaks with the signal-tonoise ratios lower than the specified SNR value are excluded from the spectrum. Positions of maxima within the scan may vary slightly under exposure to variable environmental factors and occasional fluctuation. In the next phase, alignment of profiles in different scans is performed to compensate for such changes. The scan showing maximum TIC is used as a reference one, since it is assumed that this scan has the largest number of reported ions, and its profile comprises the largest number of various ion peaks. Here every profile is subjected to alignment along the m/z axis to become as similar to the reference profile as possible. The maximum permissible value of such alignment is specified using the alignment tolerance (TA). Then peaks are detected: the scan profile is converted into the set of individual peaks. For that the entire profile is divided into several parts. The size of each part is determined by the half window size (HWS) representing the range of m/z points, within which the search for a point with the maximum intensity value is carried out. This point is designated as a peak in this part of the profile. Then positions of identical peaks are aligned across the entire set of scans. Here, peaks, the differences in m/z between which do not exceed the tolerance specified when detecting peaks (TBP), are considered to be identical. In the final phase, rare peaks are removed, and peaks of all scans are combined into the common matrix of intensities.

Thus, as a result of mass spectrometry data pre-processing, the matrix is produced [26], the number of rows in which is determined by the number of scans obtained during the experiment, while the number of rows represents the combined number of peaks from all scans. It is clear that the above parameters (SNR, TA, HWS, and TBP) have a significant impact on the number of peaks in the matrix of intensities and the question, which values these parameters should take in each particular ion acquisition mode, is not trivial.

In the classic tasks to determine the model that best describes experimental data [27, 28], the information criteria [29] are used and the extreme values of this criteria correspond to optimal values of the set of model construction criteria obtained with the regularization method. In our study, the minimum value of the classic Akaike information criterion (AIC) [30] was used to determine the optimal SNR value. Optimality of other parameters (HWS, TA and TBP) was determined based on the manual evaluation of spectra processing quality.

SNR parameter

The optimal SNR value was determined using the Akaike criterion of the LASSO classification models. For that we made a combination of SNR, TA and TBP values, pre-processed the mass spectra, constructed the matrix of intensities, and then trained the LASSO model using the matrix and the patient’s diagnosis as the training data. Training of models involved 5/10-fold cross-validation, and the best model was selected based on the Accuracy metric. The parameter combinations were made of value sets:

SNR:={1.5, 2}

TA = TBP:= {20, 200, 2000}

The combination of parameters, with which the resulting model had the lowest AIC value, was named optimal. The optimal parameter values are provided in tab. 1.

To prevent the emergence of negative noise intensities in the scan, 100 nulls were added to the set of points (M/Z, Intensity) on the left and on the right. As a result, the noise signal was evaluated in the broader range of M/Z values with a constant number of significant peaks in the spectrum.

HWS, TA, TBP parameters

Optimality of the HWS, TA and TBP parameters was determined by manual evaluation of spectra processing quality. The interactive Shiny application Mass-Spectrum Observer allowing one to explore, how the spectrum shape, peak positions and characteristics of the intensity matrix of certain mass scan change with changing values of these parameters, was developed for this purpose. The application source code is available from GitHub repository [31], and the application demo version is available from the open access library of Shiny applications [32]. The screen-captured images of the application are provided in fig. 2 and fig. 3.

The lists of possible HWS, TA and TBP values were determined, and the mass spectrometry data pre-processing procedures were applied to each combination of these values in order to obtain separate matrices of intensities for each ion acquisition mode. The TBP parameter was proportional to the TA parameter with three possible proportionality coefficient values. The lists of parameter values are provided in tab. 2.

The number of columns corresponding to the total number of peaks obtained from the mass scan profiles was determined for each matrix of intensities. Furthermore, when constructing the intensity matrix, we determined the number of peaks located close to each other in the resulting spectra. When the distance between peaks was smaller than two instrument resolutions during detection of ions in this mode, the peaks were considered as probably duplicate. Such peaks can emerge during conversion of the scan profiles into the sets of individual peaks, for example, within the same scan at too low HWS values, with the result that the intensity spike that is relatively broad on the m/z scale is represented by several spectral peaks, or in the scans of the same file at low TBP values, due to which the algorithm cannot compile the list of identical peaks from different scans. The duplicate peaks were determined within the same scan, in all scans of the same tissue specimen sample used for mass spectrometry analysis, and among all peaks of the intensity matrix. Peak duplication was defined based on the mass spectrometer resolution in this ion acquisition mode; the value of 800 at m/z = 400 was selected for the low-resolution mode, the value of 30,000 at m/z = 400 was selected for the highresolution mode.

The reference HWS, TA and TBP values that were later subjected to manual evaluation performed using MassSpectrum Observer were determined based on the changes in these four indicators in accordance with the processing parameters. The manual evaluation results are provided in tab. 3.

DISCUSSION

The findings show a close relationship between the ambient ionization mass spectrometry data pre-processing parameters and the quality of acquired spectra. The SNR parameter makes it possible to reduce the number of peaks in the resulting spectrum. However, attention should be paid to the presence of the negative estimate of noise signal values that may occur in the border spectral regions as an artifact. When detecting peaks in the profile, the noise estimate is used to determine peak intensity in this region of the profile, so negative noise can result in the emergence of the excessive number of peaks in the spectrum. This may not matter much in case of ion detection in the broad M/Z range (for example, 120–2000), but may be significant for the narrow range of 500–1000. In some cases, it is possible to eliminate such artifacts by finetuning the Super Smoother method (for example, by changing the smoothness degree during approximation or by narrowing the profile region, for which noise estimation is performed). However, these methods can yield different results for each particular mass scan, therefore the method of false dataset expansion was selected as a more sustainable method to eliminate negative values.

The HWS, TA and TBP values should be selected based primarily on the instrument resolution. The increase in half window size during the profile conversion into the intensity matrix enables elimination of artifact and duplicate peaks on the one hand (fig. 4), but on the other hand the too high values of this parameter lead to exclusion of significant peaks from the subsequent analysis (fig. 5). The values of peak position tolerance at alignment and detection are also closely related to the half window size and, therefore, to resolution, as well as to other mass spectrometer features resulting from the mass drift and the signal digitization methods. Furthermore, the TBP value should not be less than the TA value, since such configuration of values always results in the increase in the average number of possible duplicate peaks. This is due to the fact that the algorithm does not have enough tolerance for shift of identical peaks in different scans to eliminate duplicate peaks even after alignment of all scans relative to the scan with the highest ion current. It should be also noted that changing the width of the range without changing resolution and polarity of the detected ions has no significant effect on the parameter values, which is considered the expected result.

CONCLUSIONS

We developed a universal approach to determining the optimal parameter values for pre-processing of the data acquired by ambient ionization MS. The use of this approach was demonstrated on the data acquired by assessing human brain tissue samples using the Thermo LTQ XL Orbitrap ETD mass spectrometer. The approach developed can be used to determine the optimal parameter values for pre-processing of the data acquired when assessing samples of other types using other mass spectrometry equipment. The findings show that it is necessary to thoroughly adjust the mass spectrometry data processing parameters when using ambient ionization MS in the clinics as the faster and more affordable alternative to conventional intraoperative monitoring methods. Parameters have to be determined considering the mass spectrometer and research conditions. In particular, the SNR parameter determining the number of peaks in the resulting spectra should be selected based on the assessed tissue type and ionization method, while the value of 1.5–2 can be considered the lower limit. When performing scan profile alignment and peak detection, the half window size (HWS) and scan modification tolerance (TA) should be selected in accordance with the resolution of the mass spectrometer used, and the tolerance for spectra peak alignment (TBP) should not be lower than the TA value. Both machine learning methods and manual evaluation of the quality of acquired spectra can be used to choose optimal values of these parameters from several options.

КОММЕНТАРИИ (0)