In Mathematical Snapshots (second edition, 1950) pp. 38-39, Hugo Steinhaus discusses the problem of ordering $n$ objects with comparisons with respect to a transitive relation. The book, addressed to a general audience, presented the problem of ordering weights by having a scale, and then that of rank $n$ tennis players, thus generalising a problem proposed by Steinhaus himself in a mathematical seminar during the 1929-1930 academic year, in which he asked to find the minimum number of matches needed to determine the first and second best player in a tennis tournament (the problem was solved in 1932 by Schreier, but his demonstration was wrong, the first correct demonstration was only published in 1964 by Kislitsym). De facto, he describes binary insertion and shows that the problem can be solved with $M(n)=1+nS(n)-2^{S(n)}$ comparisons, where $S(n)=1+[\log_2(n)]$. In the volume, Steinhaus states that "has not been proved that there is no shorter proceeding possible, but we rather think it to be true".
Several years later, in Some remarks about tournaments. Calcutta Math. Soc. Golden Jubilee Commemoration Vol. (1958/59), Part II, pp. 323-327, Steinhaus reports that two of his colleagues, Stanisław Trybuła and Czen Ping, disproved his own conjecture by calculating the exact values of $S(n)$ for $n\leq 11$ (but, apparently, without publishing their results). From this fact, Knuth, in TAOCP, second edition, volume 3, p. 186, argues that Trybuła and Ping may have independently discovered the method known as merge-insertion, proposed by Lestor R. Ford, Jr. and Selmer M. Johnson in A tournament problem. The American Mathematical Monthly, 66(5):387-389, 1959. In this paper, the authors introduce the so-called decision tree model and prove the lower bound $1+[\log_2(n!)]$, which is in $\Theta(n\log(n))$.
The merge-insertion sort or Ford-Johnson algorithm uses fewer comparisons in the worst case than the best previously known algorithms, i.e., binary insertion sort and merge sort. Merge-insertion sort is the sorting algorithm with the minimum possible comparisons for $n\leq 22$ items and it has the fewest comparisons known for $n \leq 46$. Only in 1979, Glenn Manacher published another sorting algorithm that used even fewer comparisons, for large enough inputs.
Unfortunately, at this point the story gets a little complicated. As correctly pointed out by Georg Essl, a permutation counting arguments already appeared in Dumuth's thesis Electronic Data Sorting (1956). This was "perhaps the first publication to deal in any detail with questions of computability complexity. Demuth considered several abstract models for sorting devices, and established lower and upper bounds on the mean and maximum, execution times achievable with each model." [Knuth, cit., p. 246] "It considered three abstract models of the sorting problem, using cyclic, linear, and random-access memories [... A]lthough no practical consequences flowed immediately from Demuth's thesis, it establish important ideas about hot to link theory with practice" [Knuth, cit., p. 388]. "[... ] one of Demuth's main results was that in a certain sense bubble sorting is the optimum way to sort. [...] It turns out that [all other sorting methods studied] are always better in practice, in spite of the fact that Demuth has proved the optimality of bubble sorting on a certain peculiar type of machine. [Knuth, The Dangers of Computer Science Theory, pp. 190-191,emphasis mine]
Knuth thus suggests that Demuth's work not only fails to take into account certain peculiarities of the implementation of the algorithm (but the same holds also for Ford-Johnson algorithm, as "their approach has fortunately never been used by programmers, because the complex method used to decide what comparison to make costs much more time than the comparisons themselves", Knuth, The Dangers of Computer Science Theory, p. 191), but that it is also too constrained by particular calculation models. Knuth, again, describes an optimal way to order five elements and states that this method "was first found by H. B. Demuth [in his] thesis, [pp.] 41-43" and that "a pleasant generalization [...] has been discovered by Lester Ford, Jr. and Selmer Johnson". [TAOCP, volume 3, p. 184]
Since I cannot (at least for now) access Demuth's thesis in its entirety, I will allow myself to formulate my own hypothesis (based on section 5.5 of TAOCP, Summary, History, and Bibliography, pp. 380-391, whose reading, if possible, I recommend). In the second half of the 1940s, the problem of sorting large amounts of data began to be tackled extensively, and many internal sorting techniques were already more or less known, but without any real theory having been developed. In 1952 Daniel Goldenberg published Time analyses of various methods of sorting data. Digital Computer Laboratory memo M-1680, where he codes and makes best-case and worst-case analysis for five different sorting methods. Then in 1956 Demuth's thesis lays the foundations of computational complexity theory and arguably states the permutation counting argument. His thesis probably remained inaccessible and circulated only in relatively restricted circles, but thanks to a number of surveys published on the subject in the years 1956-1959, his ideas were disseminated and Ford and Johnson finally made them known to a wider audience.