Let's assume for the moment that the sort keys are all distinct, and think about the information theory of sorting.
An unsorted array is a permutation of the sorted array. There are $n!$ possible permutations. The job of sorting, in an information theoretic sense, is to discover precisely which permutation it is.
To transmit a number between $1$ and $m$ requires transmitting $\log_2 m$ bits of information. To transmit a permutation of $n$ elements, therefore, requires $\log_2 n!$ bits of information. By Stirling's approximation, this turns out to be $n \log_2 n + O(\hbox{low order})$ bits.
A binary comparison operation discovers one bit of information. It follows that any sorting algorithm which only uses a binary comparison operation must perform $O(n \log n)$ comparisons.
A radix sort beats this by discovering more than one bit per query. If you can discover $\Theta(\log n)$ bits per query, for example, then you can sort using $O(n)$ queries.
EDIT The key word in that last sentence is "if".
Suppose that we have $n$ distinct integers, and we wish to radix sort them using $k$ buckets.
Clearly, each query will discover $\log_2 k$ bits of information. It follows that any such sort must perform $\frac{n \log_2 n}{\log_2 k}$ queries (ignoring the low-order term).
If $k = 2$, then this has the same complexity as a comparison-based sort. However, if $k = \Theta(\log n)$, then sorting will be linear. In most cases, $k$ will be somewhere in between these two values.