3
$\begingroup$

I have a dataset with hashtags and their frequencies (~370k frequencies), as for example (after a sort):

373827 hashtag_1 373826 hashtag_2 373826 hashtag_3 373826 hashtag_4 373825 hashtag_5 373823 hashtag_6 373823 hashtag_7 373822 hashtag_8 

and I want to check if this frequencies follows Zipf-Mandelbrot's law (link) through fitting in R. So I must estimate the two parameters of the law (the exponent s and q following Wikipedia page). For this purpose I have decided to use function zm.ll of tolerance package that uses ML estimation, obtaining s~1,65 and q~0.085.

Now I'd understand if this estimation is correct through Chi-square test but I didn't understand how to do this test in R and the interpretation of the p-value returned. I've tried with this that return p.val=0:

# ZipfMan law where y is the frequencies x <- (((1:length(y))+0.085)^1.65) p <- x/sum(x) tab <- tabulate(y, length(y)) discrepancy <- (tab-length(y)*p)^2/(length(y)*p) chi.stat <- sum(discrepancy) p.val <- pchisq(chi.stat, df= length(y)-2-1, lower.tail = FALSE) c(p.val, chi.stat) 

I have some doubts about this piece of code (also about the choice of degree of freedom) and the result returned by it and I have difficulties to interpret p-value.

$\endgroup$

1 Answer 1

1
$\begingroup$

I've resolved using max likelihood estimation for fitting parameters and the probability mass function of Zipf-Mandelbrot to compute probabilities and expected frequencies. Recall the problem: i'd check if the occurrences of hashtags (and not the frequencies of the occurrences) follow a Zipfian distribution (Zipf or ZM).

# use lzipfman for plotting lzipfman <- function(alpha, beta) -s*log((1:length(data$V2)) + beta)-log(sum(1/((1:length(data$V2))+beta)^s)) ##### MLE estimation ##### zm.ll <- function(alpha, beta) { sum(freq*(alpha*log(1:length(data$V2)+beta)+log(sum(1/(1:length(data$V2)+beta)^alpha)))) } zm.fit <- mle(minuslogl = zm.ll, start = list(alpha=1, beta=0.1)) summary(zm.fit) ##### Probabilities and test ##### probs =(1/(1:length(data$V2)+beta)^alpha)/sum(1/(1:length(data$V2)+beta)^alpha) chisq.test(data$V2, p = probs) 
$\endgroup$

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.