5
$\begingroup$

I have the following dataset: dataset

I am trying to fit this with a series of gaussians, without overfitting. I have tried the following,

g[x_, xo_, \[Sigma]_, a_] := a Exp[-((x - xo)^2/(2 \[Sigma]^2))]/(\[Sigma] Sqrt[2 \[Pi]]); kvar[k_Integer] := Through[{xo, \[Sigma], a}[k]]; gmodel[n_Integer] := Sum[g[x, Sequence @@ kvar[i]], {i, 1, n}]; gpars[n_Integer] := Flatten@Array[kvar, n]; fitg[data_, maxn_Integer] := MinimalBy[ Table[{#, #["AIC"]} &@ NonlinearModelFit[data, gmodel[n], gpars[n], x], {n, maxn}], Last][[1, 1]]; 

This does not work, as the peaks I am trying to fit are positioned nowhere close to the default starting point of 1. How do I implement a couple of "guesses" as to the positions of the peaks (i.e. a starting point for xo in the gaussian fitting).

Or is this not a good approach? Some background: this is an infrared spectrum, and this is only a part of the whole dataset.

$\endgroup$
3
  • 1
    $\begingroup$ related question here $\endgroup$ Commented Jul 8 at 18:53
  • 2
    $\begingroup$ Please give the results (estimates and a plot showing the fit) so that we know what you mean by "This does not work". Certainly plotting the data and choosing the locations of what appears to be 6 peaks and their heights as starting values will almost certainly produce a fit that matches the observed peaks. AIC is great for selecting models that will be used for future predictions. But if the object is just to identify peaks just for this dataset, you should start with 6 gaussians. $\endgroup$ Commented Jul 8 at 21:09
  • 1
    $\begingroup$ For the part of your dataset that is shown this doesn't matter too much but in general the parameters labeled as xo are not the peaks. The peaks and troughs are where the first derivatives are zero. $\endgroup$ Commented Jul 9 at 5:46

3 Answers 3

11
$\begingroup$

You can use PeakDetect to get an estimate of the peak positions. Using some of your code (I took the liberty to rename xo to x0):

g[x_, x0_, \[Sigma]_, a_] := a Exp[-((x - x0)^2/(2 \[Sigma]^2))]/(\[Sigma] Sqrt[2 \[Pi]]); kvar[k_Integer] := Through[{x0, \[Sigma], a}[k]]; gmodel[n_Integer] := Sum[g[x, Sequence @@ kvar[i]], {i, 1, n}]; {expectedPeakPositions, expectedPeakAmplitudes} = Transpose[data[[#]]] &@Flatten@Position[PeakDetect[data[[All, 2]]], 1] nPeaks = Length@expectedPeakPositions; (* Set the expected positions and amplitudes as best guesses *) params = Catenate[ MapThread[ Function[{X0, A, K}, {{x0[K], X0}, \[Sigma][K], {a[K], A}}], {expectedPeakPositions, expectedPeakAmplitudes, Range[nPeaks]}]] (* Fit *) fit = FindFit[data, gmodel[nPeaks], params, x] 

I don't think it is necessary to use NonlinearModelFit here, the plain FindFit should be sufficient.

Fit of data

$\endgroup$
1
  • $\begingroup$ Please add in your answer the code which produced the above plot. $\endgroup$ Commented Jul 17 at 10:58
3
$\begingroup$

It's good to use AIC (or AICc) to help protect against overfitting as you state in your question and Mathematica code.

The example data you've given at first glance appears to have 6 well defined peaks but a fit (good or bad) with 6 gaussians can have 6 or fewer peaks with a peak defined where a predicted value has a slope of zero and positive slope just to the left of the associated predictor and a negative slope just to the right of the associated predictor.

Because of the iterative nature of the fitting, good starting values are essential. The other answers using FindPeaks for starting values assume that the ideal number of peaks matches the number gaussians. That's not the case with your dataset. Using AICc, the best fit (using multiple gaussians) has 7 gaussians and 6 resulting peaks. So while FindPeaks is extremely helpful when the "best" model using AICc has the same number of gaussians as peaks, that isn't the case in general.

So one might consider setting the number of peaks (which I don't believe one can set directly in FindPeaks - other than to set the threshold parameter that matches what one sees by eye) and then looking at the data to set initial values.

The fit with 6 gaussians has an AICc value of -320.859 (and you'll need to use NonlinearModelFit to get that summary statistic). The need for one more than 6 gaussians is that one sees a "shoulder" (between x=160 and x=166) possibly requiring another gaussian. Here is that fit:

(* Initial guesses *) nGaussians = 7; c0 = {165, 168, 172, 177, 185, 192, 197}; (* Middle of each gaussian kernel *) b0 = {1, 1, 1, 1, 1, 1, 1}; (* Spread *) a0 = {0.11, 0.27, 0.83, 0.06, 0.19, 0.12, 0.17}; (* Height *) inits = Join[Table[{a[i], a0[[i]]}, {i, 1, nGaussians}], Table[{b[i], b0[[i]]}, {i, 1, nPeaks}], Table[{c[i], c0[[i]]}, {i, 1, nPeaks}]] nlm7 = NonlinearModelFit[data, Sum[a[i] Exp[-(x - c[i])^2/b[i]^2], {i, 1, nGaussians}], inits, x] nlm7["AICc"] (* -429.1 *) Show[ListPlot[data], Plot[nlm7[x], {x, Min[data[[All, 1]]], Max[data[[All, 1]]]}]] 

Data and fit with 7 gaussians

The AICc statistic is far lower for the 7-gaussian fit: a difference of 108.241. And values larger than 4 are considered important enough to take notice.

This fit has just 6 peaks and I assume you are interested in identifying the locations of the peaks. Here's code to do so:

(* Slope: we see 6 crossings at zero which identifies the locations of the peaks *) slope = D[nlm7[x], x]; Plot[slope, {x, 160, 200}] 

Slopes at each predictor value

(* Use above figure to obtain starting values *) peaks = x /. FindRoot[slope == 0, {{x, #}}] & /@ {167.9, 171.8, 177.1, 185, 192.2, 196.7} (* {167.899, 171.744, 177.142, 184.96, 192.146, 196.676} *) Show[ListPlot[data], Plot[nlm7[x], {x, 160, 200}], ListPlot[Table[{peaks[[i]], Sum[a[j] Exp[-(peaks[[i]] - c[j])^2/b[j]^2] /. nlm7["BestFitParameters"], {j, 1, 7}]}, {i, 1, 6}], PlotStyle -> Red]] 

Data and estimates of peak locations

The peaks of the fitted curve do not necessarily occur in general at the estimated values of the $c$ parameters (although with your data that is the case - except for the left-most gaussian where there is no peak directly associated with that gaussian).

The next step (which I don't have time right now to do) is to obtain confidence intervals for both the peak locations and the associated height: any estimate should always have a measure of precision.

$\endgroup$
3
  • 1
    $\begingroup$ Actually, 8 gaussians give a much better fit with an AICc value of -465.561 but with the same number of peaks (6) in only slightly different estimated positions. $\endgroup$ Commented Jul 9 at 23:54
  • 1
    $\begingroup$ Assuming that data contains information about spectral lines, measured with normally distributed errors, I would think that 8 gaussians would indicate 8 spectral lines, some of these are just very weak and masked by stronger ones. (I can be mistaken of course - it was very long time ago when I was involved in physics and also I have no idea about OP data source or experiment purposes.) $\endgroup$ Commented Jul 10 at 11:35
  • $\begingroup$ This is exactly what I was looking for! I have many datasets similar to this one that contain lots of shoulders and gaussians not associated with peaks! Thank you! $\endgroup$ Commented Jul 11 at 21:19
2
$\begingroup$

As a supplement to to @user136848's nice answer , we can look for more general peaks nearby the estimated positions.

par = Table[{a[i], b[i], c[i]}, {i, Length[expectedPeakPositions ]}] // Flatten; inipar = Flatten[ Transpose[{expectedPeakAmplitudes, ConstantArray[1, Length[expectedPeakAmplitudes]], expectedPeakPositions}]]; fit = NonlinearModelFit[data, Sum[ a[i] Exp[-(x - c[i] )^2/b[i ]] , {i, 1, Length@ expectedPeakPositions} ] , Transpose[{par, inipar}], x ] Show[Plot[fit[x], {x, Min[data[[All, 1]]], Max[data[[All, 1]]]}], ListPlot[data]] 

enter image description here

$\endgroup$

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.