Skip to main content

You are not logged in. Your edit will be placed in a queue until it is peer reviewed.

We welcome edits that make the post easier to understand and more valuable for readers. Because community members review edits, please try to make the post substantially better than how you found it, for example, by fixing grammar or adding additional resources and hyperlinks.

Required fields*

7
  • $\begingroup$ Hi! 1) Your code is wrong. A single 'off' letter should not cause $\chi^2 = \infty $. Remember that it's a stochastic measure. 2) Greek is not English. 3) Non printable characters are not English. $\endgroup$ Commented Mar 3, 2019 at 13:25
  • 1
    $\begingroup$ im aware of these issues. but since i don't have frequency data for the remaining bytes i cant think of a better reasonable solution than assuming their frequency is 0 resulting in the fraction in chi square becoming infinite (or undefined with another interpretation of positive value/0) english text may very well contain nonenglish characters. your comment would be an example for that. $\endgroup$ Commented Mar 3, 2019 at 13:51
  • $\begingroup$ In entropy calculations for files like Shakespeare, we simply drop any characters we don't like. That eliminates f(char) = 0. We also convert all printable text to upper case. And $\chi$ isn't English, it's maths drawn in ink. As is $\int_0^\infty \mathrm{e}^{-x}\,\mathrm{d}x$. Remember that a single metric like chi is a very broad brush approach and therefore you have to compromise (a lot). Just some observations... $\endgroup$ Commented Mar 3, 2019 at 14:09
  • $\begingroup$ docs.scipy.org/doc/scipy/reference/generated/… $\endgroup$ Commented Mar 3, 2019 at 14:11
  • 1
    $\begingroup$ @PaulUszak $\chi$ is a greek letter (and therefore an example of a non-english character appearing in an english sentence). It is not an $x$. $\endgroup$ Commented Mar 3, 2019 at 14:26