I have several ranking distributions and would, for each one, like to fit a [Zipf distribution][1], and estimate the goodness of fit relative to some standard benchmark.
With the Matlab code below, I tried to do a sanity check and see if a "textbook" Zipf rank distribution passes the statistical test. Clearly something is wrong, as it does not. If that doesn't, nothing will!
Using the Kolmogorov-Smirnoff test, or the Anderson-Darling test with a custom-built (non-normal) distribution in place of the chi-squared test does not change this.
% Define some empirical frequency distribution x = 1:10; freq = randn(1,10); % textbook zipf! % Define the Zipf distribution alpha = 1.5; % Shape parameter, 1.5 is apparently a good all-round value to start with N = sum(freq); % Total number of observations k = 1:length(x); % Rank of each observation zipf_dist = N ./ (k.^alpha); % Compute the Zipf distribution % Plot our empirical frequency distribution alongside the Zipf distribution figure; bar(x, freq); % or freq\N hold on; plot(x, zipf_dist, 'r--'); xlabel('Rank'); ylabel('Frequency'); legend('Observed', 'Zipf'); % Compute the goodness of fit using the chi-squared test expected_freq = zipf_dist .* N; chi_squared = sum((freq - expected_freq).^2 ./ expected_freq); dof = length(freq) - 1; p_value = 1 - chi2cdf(chi_squared, dof); % Display the results fprintf('Chi-squared statistic = %.4f\n', chi_squared); fprintf('p-value = %.4f\n', p_value); if p_value < 0.05 fprintf('Conclusion: The data is not from a Zipf distribution.\n'); else fprintf('Conclusion: The data is from a Zipf distribution.\n'); end
y(presumably counts?) depends strongly ony. Please visit our posts about fitting Zipf distributions. $\endgroup$ydoes indeed refer to counts, but I don't get what you mean when you say that the variability inydepends strongly ony. Aside from that, what exactly in my fitting approach is incorrect? $\endgroup$fitdistdoesn't seem to have Zipf among its preset distributions) $\endgroup$