1
$\begingroup$

I have several ranking distributions and would, for each one, like to fit a [Zipf distribution][1], and estimate the goodness of fit relative to some standard benchmark.

With the Matlab code below, I tried to do a sanity check and see if a "textbook" Zipf rank distribution passes the statistical test. Clearly something is wrong, as it does not. If that doesn't, nothing will!

Using the Kolmogorov-Smirnoff test, or the Anderson-Darling test with a custom-built (non-normal) distribution in place of the chi-squared test does not change this.

% Define some empirical frequency distribution x = 1:10; freq = randn(1,10); % textbook zipf! % Define the Zipf distribution alpha = 1.5; % Shape parameter, 1.5 is apparently a good all-round value to start with N = sum(freq); % Total number of observations k = 1:length(x); % Rank of each observation zipf_dist = N ./ (k.^alpha); % Compute the Zipf distribution % Plot our empirical frequency distribution alongside the Zipf distribution figure; bar(x, freq); % or freq\N hold on; plot(x, zipf_dist, 'r--'); xlabel('Rank'); ylabel('Frequency'); legend('Observed', 'Zipf'); % Compute the goodness of fit using the chi-squared test expected_freq = zipf_dist .* N; chi_squared = sum((freq - expected_freq).^2 ./ expected_freq); dof = length(freq) - 1; p_value = 1 - chi2cdf(chi_squared, dof); % Display the results fprintf('Chi-squared statistic = %.4f\n', chi_squared); fprintf('p-value = %.4f\n', p_value); if p_value < 0.05 fprintf('Conclusion: The data is not from a Zipf distribution.\n'); else fprintf('Conclusion: The data is from a Zipf distribution.\n'); end 
$\endgroup$
5
  • $\begingroup$ This is no way to go about fitting, because the variability in y (presumably counts?) depends strongly on y. Please visit our posts about fitting Zipf distributions. $\endgroup$ Commented Mar 30, 2023 at 18:41
  • $\begingroup$ Thanks. I did check the older posts before asking, but I am still unclear. Also, y does indeed refer to counts, but I don't get what you mean when you say that the variability in y depends strongly on y. Aside from that, what exactly in my fitting approach is incorrect? $\endgroup$ Commented Mar 30, 2023 at 20:12
  • 1
    $\begingroup$ When the probability of something is $p$ and you take a sample of size $n,$ the count of that will be somewhere around $y = np$ and its standard deviation will be $\sqrt{np(1-p)},$ which is approximately proportional to $\sqrt y.$ Most curve-fitting methods (and I presume yours is one of them) assume the standard deviations of all the responses are the same. This makes a big difference! See my post at stats.stackexchange.com/a/493749/919 for an extended explanation. $\endgroup$ Commented Mar 30, 2023 at 20:25
  • $\begingroup$ Thanks! Although I think I understand the concept you are nicely explaining in that answer, I still don't understand why it's a conceptual no-go to try and see how well a Zipf distribution fits a vector of numbers. I also (still) don't understand how this can be done in Matlab (fitdist doesn't seem to have Zipf among its preset distributions) $\endgroup$ Commented Mar 31, 2023 at 14:56
  • $\begingroup$ Disappointingly, the bounty I have set for this question produced no further answers (not even comments), and so I am still unclear as to my question! :-( $\endgroup$ Commented Apr 18, 2023 at 10:17

1 Answer 1

2
$\begingroup$

This is a way of doing it with the Kolmogorov-Smirnoff test. To plug in your actual date you'd have to change y_obs. I think this is what you wanted, but please let me know if you have any questions.

% Define the Zipf distribution N = 100; alpha = 1; % Shape parameter, 1.5 is apparently a good all-round value to start with x = 1:N-1; y_exp = 1./(x.^alpha); % A correct Zipf distribution % Create a noisy signal rng(42) noise = normrnd(0, 0.01, [1, N-1]); y_obs = y_exp + noise; % Plot our empirical frequency distribution alongside the Zipf distribution plot(x, y_obs, 'r--') hold on plot(x, y_exp, 'b--') xlabel('Rank') ylabel('Frequency') legend('Observed', 'Zipf') hold off % Compute the goodness of fit using the Kolmogorov-Smirnov test [H, p_value, D] = kstest2(y_obs, y_exp); % Display the results fprintf('Kolmogorov-Smirnov statistic = %.4f\n', D) fprintf('p-value = %.4f\n', p_value) if p_value < 0.05 fprintf('Conclusion: The data is not from a Zipf distribution.\n') else fprintf('Conclusion: The data is from a Zipf distribution.\n') end 

[EDIT]

You could also try to fit an exponential function onto your data. In MATLAB, you can use the anova1 function to perform an analysis of variance (ANOVA) test on the residuals of the exponential regression, which can give you a p-value that indicates whether the model is a good fit for the data.

% Define the exponential function to fit function y = func(x, p) y = p(1) .* x .^ p(2); end % Define the dataset xdata = [1, 2, 3, 4, 5]; ydata = [2.7, 7.5, 16.6, 30.2, 48.6]; % Fit the exponential function to the dataset using lsqcurvefit p0 = [1, 1]; % Initial parameter values popt = lsqcurvefit(@(p, xdata) func(xdata, p), p0, xdata, ydata); % Extract the fitting parameters a = popt(1); b = popt(2); % Calculate the residuals and perform an ANOVA test yfit = func(xdata, popt); resid = ydata - yfit; [pval, tbl, stats] = anova1(resid); % Print the fitting parameters and p-value fprintf('a = %f\n', a); fprintf('b = %f\n', b); fprintf('p-value = %f\n', pval); % Check significance against alpha = 0.05 if pval < 0.05 disp('The exponential regression model is not a good fit for the data.'); else disp('The exponential regression model is a good fit for the data.'); end % Plot the original data and the fitted curve plot(xdata, ydata, 'ko', 'DisplayName', 'Original data') hold on plot(xdata, func(xdata, popt), 'r-', 'DisplayName', 'Fitted curve') xlabel('x') ylabel('y') legend('Location', 'best') hold off 
$\endgroup$
5
  • $\begingroup$ Thanks a lot! The plot makes sense, and suggests the Zipf distribution fits very well to the example data set. However, the Kolmogorov-Smirnov statistic is significant at p = 0.02, with the unexpected conclusion that therefore "The data is not from a Zipf distribution.". I thought maybe it's the other way around, i.e. maybe significance in this test actually indicates a good rather than bad fit; but it's quite clear that the null hypothesis (which here `re rejecting) is that the sample is drawn from the reference distribution. $\endgroup$ Commented Apr 28, 2023 at 8:36
  • $\begingroup$ This is from the 'scipy' documentation for a two-sided test: "The null hypothesis is that the two distributions are identical, F(x)=G(x) for all x; the alternative is that they are not identical. The statistic is the maximum absolute difference between the empirical distribution functions of the samples. [...] Suppose we wish to test the null hypothesis that two samples were drawn from the same distribution. We choose a confidence level of 95%; that is, we will reject the null hypothesis in favor of the alternative if the p-value is less than 0.05." I.e. the test is the right way around. $\endgroup$ Commented Apr 28, 2023 at 10:36
  • $\begingroup$ This is also how MATLAB does it, and the first output variable H is actually the 0 for accepting the null hypothesis and 1 for rejecting. But running this code, it accepted the null hypothesis, as my p is 0.052 > a = 0.05 $\endgroup$ Commented Apr 28, 2023 at 10:38
  • $\begingroup$ Sorry but I'm still not getting it. Running your first code, I get p-value = 0.0195 every time. This enables me to reject the null hypothesis, meaning that my test distribution - constructed to be as zipfian as possible except for the added noise - is qualitatively not the same as the reference (Zipf) distribution. That is, Zipf isn't a good fit. Am I missing something? $\endgroup$ Commented Apr 30, 2023 at 13:43
  • $\begingroup$ One more sanity check: If I create a distribution of random numbers (y_obs_null = randn(1,length(x)) in the code above), the significance of the K-S test is the same as for the actual Zipf distribution (p-value very close to 0), so they are both rejected as Zipfian for the same reason so to speak, even though one is clearly an excellent fit and the other one not. This tells me either K-S is not an appropriate goodness of fit test for this distribution, or that higher p values do indicate a better fit - but that a statistical threshold (alpha) much higher than .05 should be used for them. $\endgroup$ Commented May 2, 2023 at 19:07

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.