Chemometrics in analytical chemistry – an overview of applications from 2014 to 2018

A compilation of papers published between 2014 and 2018 was evaluated. Many papers related to multivariate calibration and classification have been reported, as well as, design of experiments applications and artificial intelligence methods. Some applications were highlighted, as medical and pharmaceutical, food analysis, fuels, biological and forensic for the chemometric techniques on this review. Most studies are related to developing methods for practical solutions in industry or routine analysis. A promising scenario is shown considering the number of published papers: a total of 832 for this period using the keywords, multivariate classification, multivariate calibration, analysis, chemometrics, prediction, analytical chemistry, artificial neural networks (ANN), design of experiments (DoE) and factorial design. An useful overview for Analytical Chemistry researchers´ combined with Chemometrics is presented in this review.


Introduction
With the continuous technological progress of instrumental techniques for analytical purposes, multivariate methods applied to chemical data are mandatory in several applications. Chemometry is a prominent area dedicated to developing multivariate strategies for chemical data evaluation and interpretation 1,2 . Figure 1 shows the main subjects related to Chemometric techniques according to the most repeated words in a bibliographic searching performed in Science Citation Index Expanded (SCI-E) in Clarivate Analytics' ISI -Web of Science © , on June 21, 2018.
ABSTRACT: A compilation of papers published between 2014 and 2018 was evaluated. Many papers related to multivariate calibration and classification have been reported, as well as, design of experiments applications and artificial intelligence methods. Some applications were highlighted, as medical and pharmaceutical, food analysis, fuels, biological and forensic for the chemometric techniques on this review. Most studies are related to developing methods for practical solutions in industry or routine analysis. A promising scenario is shown considering the number of published papers: a total of 832 for this period using the keywords, multivariate classification, multivariate calibration, analysis, chemometrics, prediction, analytical chemistry, artificial neural networks (ANN), design of experiments (DoE) and factorial design. An useful overview for Analytical Chemistry researchers´ combined with Chemometrics is presented in this review.  The main topics revealed by the search can be verified in the color clusters of Figure 1, as follow in green symbols classification and educational purposes ("hands-on-learning", "laboratory instruction" as keywords), light blue circles denote chemistry and separation techniques (chromatography), purple circles represent sophisticated instrumentation as massspectrometry, metabolomics, and principal component analysis (PCA), dark blue symbols are related to regression and quality control, as well the red ones are linked to calibration models and spectral techniques, the process analytical technology (PAT) and identification/ differentiation issues appear in the searching as orange and yellow circles, respectively. In the center of the clustering, chemometrics is surrounded by all those words and connected to analytical chemistry.
A total of 832 occurrences were reported by the authors' during the search. Most studies are related to published papers, followed by reviews and proceedings in minor scale. From this amount, studies involving multivariate classification as topic has 390 papers, multivariate calibration 209, design of experiments (DoE) 136 and artificial neural networks (ANN) 97. It is clear that chemometrics and analytical chemistry are growing together, and the goal of this review was to make a concise compilation of representative studies for the area between 2014 and 2018. Several topics were discussed by using key papers mentioned along the text.

Medical and pharmaceutical applications
The use of chemometrics to study multifunctional indole alkaloids from Psychotria nemorosa (Palicourea comb. nov.) was developed by Júnior et al. 3 . The techniques PCA, partial least squares (PLS) and orthogonal PLS1 (O-PLS1) were helpful for modelling the activities using ultra-high-performance liquid chromatographydiode array detector (UPLC-DAD) data as fingerprint 3 . Ultraviolet spectral data from monoclonal antibodies compounded into hospital pharmacy were evaluated to build models using PLS-discriminant analysis (PLS-DA). The challenge was to identify the monoclonal antibodies after compounding and just before administration to the patient for quality control at the hospital 4 . The linear sweep voltammograms generated from integration of three commercial screen-printed electrodes into a voltammetry electronic tongue was combined with PLS to predict concentrations of cysteine, glutathione and homocysteine. The authors applied genetic algorithm and interval PLS (i-PLS) for selection of variables 5 . It was possible to monitor the extraction of indole alkaloids from the toad skin 6 and the quantification of geniposide in Gardenia jasminoides fruit (Chinese medicinal herb) 7 was performed using near infrared (NIR) spectroscopy and PLS regression (PLSR). The extraction of indole alkaloids from the toad skin showed a coefficient of determination (R 2 ) value of 0.99 and root mean square error of cross validation (RMSECV) of 8.26 mg mL -1 6 and the correlation values to quantify the concentration of geniposide in Gardenia were between 0.92 -0.99 and 0.32 -1.66 mg mL -1 for RMSECV 7 . In order to evaluate the anti-inflammatory properties of extracts from Honeysuckle, Fourier Transform Infrared (FTIR) spectroscopy in combination with PLSR 8 were used.
Junior et al. 9 used NIR and performed a comparison between two multivariate calibration techniques, PLSR and multiple linear regression (MLR) to monitor the quality control of tablets, and the PLSR model present better results than MLR. Raman spectral imaging technique was also used to predict active ingredient pharmaceutical concentration in microtablets 10 , in oral suspensions 11 , intravaginal rings 12 , and in quantitative assessment of pharmaceutical tablets 13 . The PLSR and least squares-support vector machine (LS-SVM) models exhibited excellent prediction abilities for active pharmaceutical ingredient in microtablets with a correlation higher than 0. 98 16 , besides to quantify the presence of melamine at levels as low as 1.0% (w/w).
The NIR spectroscopy and PLSR were used to determine protein nitrogen in yellow fever vaccine 17 . Haghighi, Hemmateenejad and Shamsipur 18 using fluorescence spectroscopy and multivariate curve resolution alternative least square (MCR-ALS) were able to predict enantiomeric excess of some amino acids in frozen plasma which RMSEP values lower than 0.07% and R higher than 0.97. PLS was applied in NIR spectra to analyze the indicator of β-thalassemia in human hemolysate samples 19 . FTIR and multivariate calibration method were used to identify three group of clinical bacterial pathogens in human serum 20 . Both models, PCA and PLS-DA, revealing a feasible tool for discrimination of the samples with percent accuracy of 100% 20 .

Food analysis
NIR and PLSR were used to measure both physicochemical 21 and the antioxidant properties of honey 22 , as well to detect and to quantify adulterant in honey samples of stingless bees 23  with relative errors up 0.58 to 1.49%. The threedimensional fluorescence spectroscopy technique combined with multivariate calibration was also used for detecting the presence of adulterants in honey 24 .
A near infrared hyperspectral imaging system was used for predicting viability and vigor in muskmelon seeds using PLS-DA. Variable selection methods, such as variable important projection (VIP), selectivity ratio (SR) and significance multivariate correlation (sMC) were also applied to select the most effective wavelengths 25 .
In order to determine oxidation parameters in frying canola oil, Talpur et al. 26  The applicability of FTIR was used as tool to determine of total sugar content in soy-based drinks 33 and for measuring the concentration of nutrients in grapevine petioles 34 . For the soy-based drinks, two models were applied for different spectral band: (1) iPLS and (2) siPLS, and both presented high coefficient of correlation (R) (0.98).
Milk samples were evaluated using NIR and PLS-DA to detect contamination by Salmonella 35 . Mid-infrared (MIR) spectroscopy and PLSR were applied for determining the residues of tetracycline in cow's milk 36 and for detecting and quantifying the adulteration in milk powder 37 . Augusto et al. used laser-induced breakdown spectroscopy (LIBS) data and PLSR model for the direct determination of mineral elements, such as Ca, K and Mg in powdered milk and dietary supplements 38 .

Fuel
Data acquired using C-13 nuclear magnetic resonance (NMR) spectroscopy were evaluated using SVR with variable selection by genetic algorithm (GA) to determine the contents of saturated, aromatic, and polar compounds in crude oil. The use of small amounts of samples was highlighted by the authors 39 . Sulfur content in petroleum derivatives was determined using MIR spectroscopy and PLS in association with variable selection methods, among them iPLS, siPLS, uninformative variable elimination (UVE), and GA 40 . The UV spectroscopy integrated different chemometric strategies for quantifying the hydrocarbon contents in fuel oil samples 41 .
In order to quantify the presence of adulterants in gasoline, Raman spectroscopy was used, and calibration models were constructed using PLSR, iPLS and PLS-GA which correlations were between 0.80 and 0.99 42 . Gas chromatography data were used to build a PLSR model for quality assessment of gasoline, and low errors from 0.005 to 0.010% were obtained 43 . Dadson, Pandam and Asiedu evaluated adulterants in gasoline, such as kerosene, diesel, naphta using a multivariate model with FTIR data and PLSR 44 . Mabood et al. 45 also applied PLSR model to detect and estimate gasoline adulteration by NIR spectroscopy.

Biological samples
Neves et al. 46 used mass spectrometry coupled with PCA, GA with support vector machine (SVM), linear discriminant analysis (LDA) and quadratic discriminant analysis (QDA) as an untargeted lipidomic approach to classify 76 blood plasma samples into negative for intraepithelial lesion or malignancy and squamous intraepithelial lesion. The PCA-SVM models provided the best classification results, achieving values of 80% and 83% for the sensitivity and specificity, respectively. Santos et al. 47 proposed a method to determine 11 polycyclic aromatic hydrocarbons (PAHs) in urine samples based on the coupling of a programmed temperature vaporizer (PTV) with a quadrupole mass spectrometer (qMS) instrument, via a deactivated fused silica tubing. The authors used PLS-DA, LDA, soft independent modeling of class analogy (SIMCA) and SVM to classify the samples according to the presence or absence of the PAHs. Gilany et al. 48 investigated untargeted metabolomic profiling of the seminal plasma in non-obstructive azoospermia men using gas chromatography-mass spectrometry (GC-MS) and QDA chemometric technique to implement on total ion chromatograms for identification of discriminatory retention times. The receiver operating characteristic (ROC) curves for these classification models presented 88% of accuracy for the discrimination of 36 metabolites that may be considered discriminatory biomarkers for different groups in non-obstructive azoospermia. Boll et al. 49 developed a method using attenuated total reflection (ATR) FTIR spectroscopy combined with PLS-DA model that allowed the discrimination between dyed and non-dyed hair of 380 hair samples. Monakhova, Diehl and Fareed 50 used PCA, factor discriminant analysis (FDA), PLS-DA and LDA combined with high resolution (600 MHz) NMR spectroscopy data to distinguish 102 authentic samples of heparin and lowmolecular weight heparins produced from porcine, bovine and ovine mucosal tissues as well as their blends. proposed a method to discriminate 38 extra-virgin olive oils from seven regions along the Italian coasts from determination of the isotopic composition and the carotenoid contents. The authors used isotope ratio mass spectrometry (IRMS) and resonant Raman spectroscopy (RRS) to determine isotopic composition and the carotenoid contents, respectively. The LDA model presented 50 -100% of correct classification for the seven regions, achieving 82% of accuracy for validation set.

Food analysis
Márquez et al. 56 used NIR and FT-Raman data to perform the data fusion in mid and high-level using SIMCA model for classifying the unadulterated and adulterated hazelnut samples. Rodrigues Júnior et al. 57 applied PLS-DA for discrimination of lactose-free samples and classification of adulterated and unadulterated milk powder samples by FT-Raman spectroscopy. Wakholi et al. 58 used short-wave IR (SWIR) hyperspectral imaging coupled with LDA, PLS-DA and SVM models for classification of corn seeds viable (control group) and corn seeds nonviable (microwave treated). The authors used a total of 200 samples of each type of corn: white, yellow and purple. Sandasi et al. 59 applied a fast and non-destructive method for the quality control of herbal tea blends. The authors used hyperspectral images obtained from SWIR combined with PLS-DA to classify herbal tea blends of the species Sceletium tortuosum and Cyclopia genistoides.
Kimuli et al. 60 also used SWIR hyperspectral imaging system (1000 -2500 nm) to detect aflatoxin B1 combined with PCA, PLS-DA and FDA techniques to explore and classify maize kernels of four varieties from different States of the USA. Cortés et al. 61 discriminated two nectarine varieties in different spectral ranges of NIR and Vis-NIR using PLS-DA and LDA models. Shrestha et al. 62 used NIR hyperspectral imaging data from 975 to 2500 nm and PLS-DA to investigate seed quality parameters such as year of production and variety in tomato seed lots. Santos, Pereira-Filho and Colnago 63 applied 1 H time domain nuclear magnetic resonance (TD-NMR) combined with multivariate analysis for identifying and quantifying the adulteration in 78 samples of milk using whey, urea, hydrogen peroxide and synthetic material (urine or milk).

Forensic purposes
Chen et al. 64 used NIR combined with hierarchical cluster analysis (HCA), PLS-DA and SVM to authenticate stamps of 12 seals on a Chinese traditional painting. The results obtained were 93% and 100% of accuracy for the PLS-DA and SVM model in validation set, respectively. Oliveira et al. 65

Fuels
Silva et al. 68 developed a method to classify gasoline according to its origin (Brazil, Venezuela and Peru), using IR spectroscopy and multivariate classification. A set of 126 gasoline samples: 56 Brazilian, 66 Venezuelan, and 4 Peruvian, was analyzed. The spectra were standardized using the direct standardization method achieving 100% of correct predictions. In another study, Silva et al. 69 proposed the use of IR spectroscopy and LDA and PLS-DA for classification of gasoline with or without dispersant and detergent additives in 125 samples. The authors selected the variables for the multivariate models using stepwise (SW), successive projections algorithm (SPA) and GA algorithms. Sinkov, Sandercock and Harynuk 70 developed PLS-DA and SIMCA models for classifying the levels of gasoline in casework arson samples based on GC-MS data.

Artificial intelligence methods for multiple questions
An interesting approach for ANN is to handle the samples more efficiently to achieve streamlined processes. Kovalishyn and Poda 71 created a new variable selection method for ANN known as batch pruning algorithm (BPA), being faster and more efficient than traditional methods. ANN was used for Sobol sensitivity estimation 72 evidencing as advantage, the reduction of computational costs. Still, Bian et al. 73 allied the advantages of linear and non-linear methods in a novel algorithm called the extreme learning machine (ELM) and compared with PCR, PLS, SVR and back propagation algorithm-ANN (BP-ANN) by three NIR spectral datasets: diesel fuel, ternary mixture and blood, showing that ELM presented the best performance in the spectral quantitative analysis of complex samples.
NIR spectra data acquired for beers from three types of fermentation was used to evaluate foam and color-related parameters 74 , using a robotic pourer and chemical fingerprinting. Results from NIR were used to create PLS and ANN models to predict four parameters such as pH, alcohol, Brix and maximum volume of foam. The ANN was implemented using the Levenberg-Marquardt training algorithm, being able to create more accurate models than PLS 74 . Oliveira et al. 75 determined protein concentration, foam stability, haze, color, total acidity (TA), alcohol content, and bitterness in Ale beers properties using UV-Vis spectra in combination with PCR and also with ANN models.
Many authors also compared ANN and other chemometric methods, such as studies of Das et al. 76 that modeled variations in sucrose, reducing and total sugar content due to water-deficit stress in rice leaves using Vis, NIR and SWIR spectroscopies. They tested the following multivariate techniques, ANN feed-forward model, multivariate adaptive regression splines (MARS), MLR, PLSR, random forest regression and SVM. The variables that affect the performance of ANNs and PLS for spectral interference correction are random noise level, intensity ratio, peak separation and wavelength shift. The results showed that ANNs and PLS are about equally as effective for spectral interference correction 77,78 . A portable artificial olfaction system for real-time monitoring of black tea fermentation was developed, based on the combination of the kNN and adaptive boosting, namely kNN-AdaBoost with discrimination rate of 100% of correct predictions 79  for to classification of orange growing locations based on the NIR using data mining was evaluated 80 . The experimental results showed that the juice NIR spectra is the most suitable dataset for identifying the orange growing locations, and the decision tree is the best and most stable classifier 80 , with the average prediction rate of 97%.
In relation to herbal products, Wang et al. 81 used LIBS combined with PCA and BP-ANN to classify Chinese herbal medicine with 99.9% classification accuracy of three types of herbal products, roots of Angelica pubescens, Codonopsis pilosula, and Ligusticum wallichii. The authors stablished 82 quality control markers for Chinese herbal medicines using BP-ANN; and the study of Ding et al. 83 developed a method to improve the markers to Q-markers in Chinese herbal medicines quality management, using PLS-DA for screening analysis of the chemical markers and identification of herbal origin. The BP-ANN algorithm was used to clarify the non-linear relationship between the Qmarkers and their integral anti-inflammation effect. Still in relation to herbal medicines, Ito et al. 84 developed a model based on an ANN to quantify proteolytic and amylolytic enzymes using UV-Vis spectra of diluted samples from a particular solidstate fermentation in wheat bran, soybean meal, type II wheat flour and sugarcane bagasse.
Li et al. 85 combined GA with ANN to determine the elements copper and vanadium in steel samples with satisfactory quantitative results. Zhang et al. 86 developed a method combining GA, PCA and ANN to select spectral segments from the original spectra to improve the LIBS performance and proved that use only a fixed-length segment appropriate provides better results than selecting the entire spectral range.
Gurbanov, Bilgin and Severcan 87 used ANN to evaluate the secondary structural variations in the diabetic kidney cell membrane proteins based in ATR-FTIR spectroscopy analyzing the effects of selenium on diabetes. Hasanjani and Sohrabi 88 used UV-Vis spectroscopic and back-propagation algorithm ANN to predict fluoxetine and sertraline in tablets. Guo et al. 89 developed a kinetic spectrofluorometric method for the analysis of sibutramine, indapamide, and hydrochlorothiazide, very common in weight-reducing health foods, and the data of the mixtures were processed by parallel factor analysis (PARAFAC), PLS, PCR and radial basis function-artificial neural network (RBF-ANN).

Design of Experiments (DoE) and response surface methodology (RSM)
Before the application of any analytical method, it is necessary to optimize some instrumental conditions, named variables. These variables can be related to sample preparation procedure, as volumes of reagents, pressure and temperature or instrumental parameters, as wavelength, power, among others. The goal in several cases is always related to obtain a condition with high analytical signal, signal-to-noise ratio (SNR), and signal-tobackground (SBR). In addition, analysts can be also interested in analytical methods with low relative standard deviation, limits of detection, and quantification, and reduced cost per determination. In order to achieve these goals DoE can be applied in almost any type of problem [90][91][92] .
DoE can be defined as a group of tools that applies statistical and mathematical knowledge to optimize a system. Several strategies can be used as central composite design (CCD) 93 , Doehlert (DD) 94 and, Box-Behnken designs (BBD) 95 . If the goal is only to identify the most important variables, full factorial 96 or fractional factorial 97 or Placket-Burman 98 designs can be also applied. Figure 2 shows the several possibilities related to the application of DoE in any type of problem. The applicability and advantages of DoE in analytical chemistry are out of context and hundreds of studies can be easily searched in the scientific literature. In the time spam selected for this review we found 125 papers using only the combination of the words "DoE", "Factorial Design", "Chemistry" and "Analytical chemistry". The scientific literature presents several reviews and tutorial reviews about the use of DoE 90,92 . In this case, the fundaments about the use of DoE will be not discussed in this review, and only applications will be described and discussed. Table 1 shows the main applications observed and several remarks.

Conclusions
According to authors' search of this review, the most developed methods combined with chemometrics is for food analysis, undoubtedly. In general, the goal of the studies was to improve speediness in the analysis and, to reduce the number of steps in the analytical method. Fuel samples were also a highlighted topic in analytical chemistry. Sophisticated techniques considering mass spectrometry and chromatography allied to chemometrics have been increasing their applications in the area. Classification techniques have the majority of the papers, following by multivariate calibration, DoE and ANN. Chemometrics and analytical chemistry is a powerful combination to improve robustness, analytical frequency and practicality for methods.