24
\$\begingroup\$

The purpose of this task is to write a program or function to find the most common substring within a given string, of the specified length.

Inputs

  1. A string, s, of any length and characters that are valid as inputs for your language.
  2. A number, n, indicating the length of substring to find. Guaranteed to be equal to or less than the length of the string.

Output

The most common (case-sensitive) substring of s of length n. In the case of a tie, any of the options may be output.

Win Criteria

This is , so lowest bytes wins; usual exceptions etc. apply.

Examples

Hello, World!, 1 > l (appears 3 times)
Hello! Cheerio!, 2 > o! (appears twice)
The Cat sat on the mat, 2 > at (appears three times)
The Cat sat on the mat, 3 > at or he (both with trailing spaces. Both appear twice - note case-sensitivity so The and the are different substrings)
Mississippi, 4 > issi (appears twice, note that the two occurrences overlap each other)

Some more examples, all using the same input string*

Bb:maj/2 F:maj/5 Bb:maj/2 G:9/3 Bb:maj7 F:maj/3 G:min7 C:sus4(b7,9) C:sus4(b7,9) C:sus4(b7,9) F:maj F:maj 

(note the trailing newline)

1 > \r\n(appears 12 times, counting \r\n as a single character - my choice, your code may differ on this - otherwise either \r or \n appear the same number of times). : also appears 12 times
2 > :m (appears 8 times)
3 > :ma or maj (appears 7 times)
4 > :maj (appears 7 times)
5 > F:maj or \r\nF:ma or :maj/ (appears 4 times)
14 > \r\nC:sus4(b7,9)\r\n (appears 3 times)

(* input is taken from How predictable is popular music?. The result of this task might be used, for example, to find the most efficient way to apply a substitution in order to compress the string)

\$\endgroup\$
7
  • \$\begingroup\$ @KevinCruijssen thanks for the n=5 spot. regarding your second point, each line in the test case ends with a newline - there are 12 lines in the test case, so 11 newlines. Or am I missing something? Or maybe the formatting is messing things up? \$\endgroup\$ Commented Feb 18, 2020 at 11:24
  • \$\begingroup\$ ...but, : does appear 12 times (not sure how you got 8?) - so that's actually the largest. I've added a trailing newline to the test case, so now we have \r\n and : tying. \$\endgroup\$ Commented Feb 18, 2020 at 11:29
  • \$\begingroup\$ Ah oops, not sure how I got 8 instead of 12 indeed.. Mainly wanted to say : occurs one more than \n, but with a trailing newline they indeed both occur an equal amount of times. \$\endgroup\$ Commented Feb 18, 2020 at 11:38
  • 1
    \$\begingroup\$ Is it acceptable to return the highest occurring substring with its count? (C.f. this answer) \$\endgroup\$ Commented Feb 18, 2020 at 14:40
  • 2
    \$\begingroup\$ @RGS sure, why not? \$\endgroup\$ Commented Feb 18, 2020 at 14:42

21 Answers 21

11
\$\begingroup\$

Python 3.8, 75 68 97 63 bytes

lambda s,n:max(l:=[s[j:j+n]for j in range(len(s))],key=l.count) 

You can try it online! Increased byte count because, as @xnor kindly pointed out, Python's str.count doesn't count overlapping substrings... But then @xnor's reformulation of what I was doing allowed to slice a third of the bytes!

How:

l:=[s[j:j+n]for j in range(len(s))] 

This creates a list of all the substrings in s of size n, plus the suffixes of size n-1, n-2, ..., 1. Then we find the max on that list, with the numerical value being used to sort given by

l.count 

That means that if a and b are from l, a > b if a shows up more times in l than b. This means the shorter suffixes are never the result of the max because they only show up once, and even if the correct-sized substrings only show up once, they are first in l so they come out instead of the short suffixes. This allows me to save some bytes in the range used, given that I don't have to prevent i+n from being larger than len(s).

# Python, 83 64 bytes

lambda s,n:max((s[i:i+n]for i in range(len(s)-n+1)),key=s.count) 

Thanks to @mypetlion I saved a LOT of bytes :D

You can try it online with my very own incredible test case!

\$\endgroup\$
11
  • \$\begingroup\$ Replace sorted(...)[-1] with max(...) to save 7 bytes. \$\endgroup\$ Commented Feb 18, 2020 at 18:17
  • \$\begingroup\$ Replace lambda t:s.count(t) with s.count to save 12 bytes. \$\endgroup\$ Commented Feb 18, 2020 at 18:17
  • 1
    \$\begingroup\$ fixes typos \$\endgroup\$ Commented Feb 18, 2020 at 23:25
  • 1
    \$\begingroup\$ @S.S.Anne gives credit, too \$\endgroup\$ Commented Feb 18, 2020 at 23:31
  • 1
    \$\begingroup\$ It looks like this unfortunately doesn't get the "Mississippi", 4 test case right, due to how Python counts substrings. \$\endgroup\$ Commented Feb 19, 2020 at 8:04
7
\$\begingroup\$

05AB1E, 5 bytes

ŒIù.M 

Try it online or verify the smaller test cases or verify the larger test case (which is a bit slow).

Explanation:

Œ # Push all substrings of the (implicit) input-string Iù # Only keep substrings of a length equal to the second input-integer .M # Only keep the most frequent item of the remaining substrings # (after which it is output implicitly as result) 
\$\endgroup\$
5
\$\begingroup\$

Brachylog, 8 bytes

s₎ᶠọtᵒth 

Try it online!

Explanation

 ᶠ Find all… s …substrings of <1st element of input>… ₎ …of length <2nd element of input> ọ Get the list of each substring with its number of occurrence tᵒ Order by the number of occurrence t Take the last one h Output the substring itself 
\$\endgroup\$
5
\$\begingroup\$

JavaScript (ES6),  85 83  80 bytes

Takes input as (string)(length).

s=>n=>[...s].map(o=m=(x,i)=>(o[k=s.substr(i,n)]=-~o[k])<m|!k[n-1]?0:m=o[O=k])&&O 

Try it online!

\$\endgroup\$
5
\$\begingroup\$

Java 10, 212 211 196 195 146 143 bytes

s->n->{String r="",t;for(int m=0,c,j,i=s.length();i-->n;)for(c=0,j=-1;(j=s.indexOf(t=s.substring(i-n,i),j+1))>=0;)if(++c>m){m=c;r=t;}return r;} 

-15 bytes by outputting the substrings with their count, which is allowed.
-1 byte thanks to @ceilingcat.
-52 bytes thanks to @OlivierGrégoire.

Try it online.

Explanation:

s->n->{ // Method with String and Integer parameters and String return-type String r="", // Result-String, starting empty t; // Temp-String for(int m=0, // Largest count, starting at 0 c, // Temp count integer j, // Temp index integer i // Index-integer `i` =s.length();i-->n;) // Loop `i` in the range (input-length, n]: for(c=0, // Reset the temp-count to 0 j=-1; // And the temp-index integer to -1 (j=s.indexOf(t=s.substring(i-n,i) // Set String `t` to a substring of the input-String in the range [i-n,i) j+1))// Set the temp-index `j` to the index of String `t` in the input-String, // only looking at the trailing portion after index `j+1` >=0;) // And continue looping as long as long as that substring is found if(++c>m){// If the temp-count + 1 is larger than the current largest count: m=c; // Set this current largest count to this temp-count + 1 r=t;} // And the result-String to this temp-String return r;} // After the nested loops, return the result-String 
\$\endgroup\$
4
  • 1
    \$\begingroup\$ 146 bytes: S->n->{String R="",x;for(int M=0,I=0,m,i;I<S.length()-n;I++)for(m=0,i=-1;(i=S.indexOf(x=S.substring(I,I+n),i+1))>=0;)if(++m>M){M=m;R=x;}return R;} (no imports, and Java 8, not 10) \$\endgroup\$ Commented Feb 20, 2020 at 9:04
  • \$\begingroup\$ @OlivierGrégoire Thanks! That's a much better approach. I had the feeling it was possible without the Map and/or stream, but couldn't figure it out. \$\endgroup\$ Commented Feb 20, 2020 at 9:53
  • 1
    \$\begingroup\$ I also tried with .split("\\Q"+substring+"\\E"), without success because of the overlapping test case in Mississippi \$\endgroup\$ Commented Feb 20, 2020 at 9:55
  • \$\begingroup\$ 143 bytes: s->n->{String r="",t;for(int m=0,c,j,i=s.length();i-->n;)for(c=0,j=-1;(j=s.indexOf(t=s.substring(i-n,i),j+1))>=0;)if(++c>m){m=c;r=t;}return r;} (just changing the iteration manner) \$\endgroup\$ Commented Feb 20, 2020 at 10:04
5
\$\begingroup\$

Perl 6, 49 46 bytes

-3 bytes thanks to SirBogman

{$^n;bag($^a~~m:ov/.**{$n}/X~$).max(*{*}).key} 

Try it online!

Explanation

{ } # Anonymous block $^n; # Declare argument $^a~~m / / # Regex match input string :ov # Also return overlapping matches .**{$n} # Match n-character strings X~$ # Stringify matches bag( ) # Convert to Bag (multiset) .max(*{*}) # Find Pair string=>count with max value .key # Get key (string) 
\$\endgroup\$
5
  • \$\begingroup\$ 46 bytes /.**{$n}/ works. Not sure why /.**$n/ doesn't. \$\endgroup\$ Commented Feb 20, 2020 at 15:50
  • \$\begingroup\$ @SirBogman Cool, this sure would have helped me a few times. /$n/ is supposed to interpolate literal text (like \Q...\E in Perl 5), so I can see why .**$n doesn't work. \$\endgroup\$ Commented Feb 20, 2020 at 19:25
  • \$\begingroup\$ I don't fully understand what .max(*{*}) is doing. It seems like both of those whatever parameters are Pairs? \$\endgroup\$ Commented Feb 20, 2020 at 19:49
  • 1
    \$\begingroup\$ @SirBogman Only the first * is a whatever parameter. The second * is simply to get a full slice of the Pair, returning a list containing its only value. So *{*} is effectively the same as *.values. \$\endgroup\$ Commented Feb 20, 2020 at 22:20
  • \$\begingroup\$ Cool. Those whatevers are very flexible. \$\endgroup\$ Commented Feb 21, 2020 at 0:05
4
\$\begingroup\$

PHP, 133 126 bytes

for(;$k<=strlen($i=$argv[1])-$j=$argv[2];)$s[]=substr($i,$k++,$j);$c=array_count_values($s);asort($c);echo array_key_last($c); 

Try it online!

Annoyingly asort() and end() can't be chained into one expression.

-7 bytes thanks to @manatwork

\$\endgroup\$
2
  • 1
    \$\begingroup\$ PHP 7.3 has array_key_last() to reduce it to 126 characters. \$\endgroup\$ Commented Feb 18, 2020 at 11:32
  • \$\begingroup\$ I really should learn all my array functions. \$\endgroup\$ Commented Feb 18, 2020 at 11:43
3
\$\begingroup\$

JavaScript (Node.js), 92 bytes

s=>n=>s.map(F=t=>([F[w=(w+t).slice(-n)]=-~F[w],w]),w="").sort(([a],[b])=>a-b)[s.length-1][1] 

Try it online!

Takes s as a list of characters in (s)(n) syntax.

\$\endgroup\$
3
  • \$\begingroup\$ =>a-b)[s.length-1] to =>b-a)[0] to sort in reverse order and save some bytes. \$\endgroup\$ Commented Feb 18, 2020 at 13:11
  • 1
    \$\begingroup\$ @KevinCruijssen That could fail if there are no repeating substrings. Try f([..."123456"])(2). \$\endgroup\$ Commented Feb 18, 2020 at 13:36
  • \$\begingroup\$ Ah, you're indeed right. My bad. \$\endgroup\$ Commented Feb 18, 2020 at 13:42
3
\$\begingroup\$

C# (Visual C# Interactive Compiler), 97, 96, 92 bytes

s=>n=>s.Skip(n-1).Select((_,i)=>s.Substring(i,n)).GroupBy(x=>x).OrderBy(g=>g.Count()).Last() 

Try it online!

Explanation:

s.Skip(n-1).Select((_,i)=>s.Substring(i,n)) //select every substring of size n .GroupBy(x=>x) //group by value .OrderBy(g=>g.Count()) //order by increasing recurrence .Last() //return the last one 
\$\endgroup\$
3
  • 1
    \$\begingroup\$ -1 byte by using a currying lambda (s,n)=> to s=>n=>: Try it online. \$\endgroup\$ Commented Feb 20, 2020 at 10:08
  • \$\begingroup\$ I'm pretty sure returning a Grouping is not acceptable. \$\endgroup\$ Commented Feb 21, 2020 at 1:10
  • \$\begingroup\$ @mypronounismonicareinstate check the comments section \$\endgroup\$ Commented Feb 21, 2020 at 10:07
2
\$\begingroup\$

Charcoal, 28 bytes

NθF⊕⁻Lηθ⊞υ✂ηι⁺θιUMυ⟦№υιι⟧⊟⌈υ 

Try it online! Link is to verbose version of code. Outputs the lexicographically largest substring with the highest count. Explanation:

Nθ 

Input the desired substring length.

F⊕⁻Lηθ⊞υ✂ηι⁺θι 

Extract all substrings of that length and push them to a list.

UMυ⟦№υιι⟧ 

Replace each substring with a tuple of its count and the substring..

⊟⌈υ 

Print the substring with the highest count.

\$\endgroup\$
2
\$\begingroup\$

Japt -g, 8 bytes

ãV ü ñÊÌ 

Try it

\$\endgroup\$
2
  • \$\begingroup\$ Is there any specific reason you use your own interpreter instead of the ethproductions one or the TIO one? \$\endgroup\$ Commented Feb 18, 2020 at 23:23
  • 1
    \$\begingroup\$ @S.S.Anne I don't know about Shaggy, but I use it because it has a lot of features that make golfing in Japt a lot easier, like the autogolf button \$\endgroup\$ Commented Feb 19, 2020 at 1:04
2
\$\begingroup\$

Wolfram Language (Mathematica), 30 bytes

#&@@Commonest[##~Partition~1]& 

Takes an array of characters, returns an array of characters.

Try it online!

Wolfram Language (Mathematica), 36 bytes

#&@@Commonest[##~StringPartition~1]& 

Takes a string, returns a string.

Try it online!

\$\endgroup\$
2
\$\begingroup\$

Zsh, 89 bytes

local -A m repeat $#1 ((++m[${1:$[i++]:$2}])) for k v (${(kv)m})((v>M))&&r=$k&&M=$v <<<$r 

Try it online!

local -A m # associative array, "local" is shorter than "typeset" repeat $#1 { # looping for the length of the string will cause the # program to use shorter, invalid strings as # the bash-style slice gets cut short ((++m[${1:$[i++]:$2}])) # increment the map at the key with substring } # starting at $[i++] with length $2 for k v (${(kv)m}) { # loop over the associative array's keys/values ((v>M)) && r=$k && M=$v # find the max value M, save the key as r. } # looping forward ensures we avoid our invalid subtrings <<<$r # print $r 
\$\endgroup\$
2
\$\begingroup\$

C++ (clang), 201 \$\cdots\$ 151 150 bytes

#import<map> using S=std::string;S f(S s,int n){std::map<S,int>m;S t,r;for(int i=0,x=0;i<=s.size()-n;r=x<++m[t=s.substr(i++,n)]?x=m[t],t:r);return r;} 

Try it online!

Ungolfed

#include<map> #include<string> std::string f(const std::string& s,int n) { std::map<std::string,int> m; for(int i=0;i<=s.size()-n;++i) { m[s.substr(i,n)]++; } int x=0; std::string r; for(auto a:m) { if(x<a.second) { x=a.second; r=a.first; } } return r; } 
\$\endgroup\$
0
2
\$\begingroup\$

Excel (Ver. 1911), 60 Bytes

B2 'Input: String B3 'Input: Sub-string Length C2 =-COUNTIF(D2#,D2#) D2 =MID(B2,SEQUENCE(LEN(B2)),B3) E2 =SORT(C2#:D2) F2 'Output 

Test Sample

Note: Newlines do exist in cells, but default formatting doesn't show them. Also only 10 of the 105 rows are shown to keep the image a decent size.

Test sample

\$\endgroup\$
1
\$\begingroup\$

Jelly, 6 bytes

I have a feeling there may be an elusive 5-byter out there.

ṡċ@Þ`Ṫ 

Try it online!

\$\endgroup\$
1
\$\begingroup\$

PowerShell, 72 bytes

param($s,$n)(0..($s.Length-$n)|%{$s|% s*g $_ $n}|group|sort C*|% N*)[-1] 

Try it online!

Unrolled:

param($s,$n) ( 0..($s.Length-$n)|%{ $s|% substring $_ $n }|group|sort Count|% Name )[-1] 
\$\endgroup\$
1
\$\begingroup\$

Ruby, 58 bytes

->s,n{a=(0..s.size).map{|i|s[i,n]};a.max_by{|e|a.count e}} 

Try it online!

\$\endgroup\$
1
\$\begingroup\$

Perl 5 -lp, 47 bytes

++$k{$_}>$k{$\}&&($\=$_)for<>=~/(?=(.{$_}))/g}{ 

Try it online!

\$\endgroup\$
1
\$\begingroup\$

Japt -h, 8 bytes

ãV ñ@è¶X 

Try it

\$\endgroup\$
1
\$\begingroup\$

Go, 131 bytes

func f(s string,l int)(p string){x,i:=0,0 m:=map[string]int{} for;i<=len(s)-l;i++{t:=s[i:i+l] m[t]++ if m[t]>x{x=m[t] p=t}} return} 

Takes as input a string and an int representing the string and the length of the substring respectively.

Some newlines can be replaced by semicolons if people prefer one-liners.

Try it online!

\$\endgroup\$

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.