The most common substring

Question

The purpose of this task is to write a program or function to find the most common substring within a given string, of the specified length.

Inputs

A string, s, of any length and characters that are valid as inputs for your language.
A number, n, indicating the length of substring to find. Guaranteed to be equal to or less than the length of the string.

Output

The most common (case-sensitive) substring of s of length n. In the case of a tie, any of the options may be output.

Win Criteria

This is code-golf, so lowest bytes wins; usual exceptions etc. apply.

Examples

Hello, World!, 1 > l (appears 3 times)
Hello! Cheerio!, 2 > o! (appears twice)
The Cat sat on the mat, 2 > at (appears three times)
The Cat sat on the mat, 3 > at or he (both with trailing spaces. Both appear twice - note case-sensitivity so The and the are different substrings)
Mississippi, 4 > issi (appears twice, note that the two occurrences overlap each other)

Some more examples, all using the same input string^*

Bb:maj/2 F:maj/5 Bb:maj/2 G:9/3 Bb:maj7 F:maj/3 G:min7 C:sus4(b7,9) C:sus4(b7,9) C:sus4(b7,9) F:maj F:maj

(note the trailing newline)

1 > \r\n(appears 12 times, counting \r\n as a single character - my choice, your code may differ on this - otherwise either \r or \n appear the same number of times). : also appears 12 times
2 > :m (appears 8 times)
3 > :ma or maj (appears 7 times)
4 > :maj (appears 7 times)
5 > F:maj or \r\nF:ma or :maj/ (appears 4 times)
14 > \r\nC:sus4(b7,9)\r\n (appears 3 times)

_{(* input is taken from How predictable is popular music?. The result of this task might be used, for example, to find the most efficient way to apply a substitution in order to compress the string)}

@KevinCruijssen thanks for the n=5 spot. regarding your second point, each line in the test case ends with a newline - there are 12 lines in the test case, so 11 newlines. Or am I missing something? Or maybe the formatting is messing things up? — simonalexander2005
– simonalexander2005, Commented Feb 18, 2020 at 11:24
...but, : does appear 12 times (not sure how you got 8?) - so that's actually the largest. I've added a trailing newline to the test case, so now we have \r\n and : tying. — simonalexander2005
– simonalexander2005, Commented Feb 18, 2020 at 11:29
Ah oops, not sure how I got 8 instead of 12 indeed.. Mainly wanted to say : occurs one more than \n, but with a trailing newline they indeed both occur an equal amount of times. — Kevin Cruijssen
– Kevin Cruijssen, Commented Feb 18, 2020 at 11:38
Is it acceptable to return the highest occurring substring with its count? (C.f. this answer) — RGS
– RGS, Commented Feb 18, 2020 at 14:40

RGS · Accepted Answer · 2020-02-19 09:23:32Z

Python 3.8, 75 68 97 63 bytes

lambda s,n:max(l:=[s[j:j+n]for j in range(len(s))],key=l.count)

You can try it online! Increased byte count because, as @xnor kindly pointed out, Python's str.count doesn't count overlapping substrings... But then @xnor's reformulation of what I was doing allowed to slice a third of the bytes!

How:

l:=[s[j:j+n]for j in range(len(s))]

This creates a list of all the substrings in s of size n, plus the suffixes of size n-1, n-2, ..., 1. Then we find the max on that list, with the numerical value being used to sort given by

l.count

That means that if a and b are from l, a > b if a shows up more times in l than b. This means the shorter suffixes are never the result of the max because they only show up once, and even if the correct-sized substrings only show up once, they are first in l so they come out instead of the short suffixes. This allows me to save some bytes in the range used, given that I don't have to prevent i+n from being larger than len(s).

~~# Python, 83 64 bytes~~

lambda s,n:max((s[i:i+n]for i in range(len(s)-n+1)),key=s.count)

Thanks to @mypetlion I saved a LOT of bytes :D

~~You can try it online with my very own incredible test case!~~

Replace lambda t:s.count(t) with s.count to save 12 bytes. — mypetlion
– mypetlion, Commented Feb 18, 2020 at 18:17
It looks like this unfortunately doesn't get the "Mississippi", 4 test case right, due to how Python counts substrings. — xnor
– xnor, Commented Feb 19, 2020 at 8:04

Community · Accepted Answer · 2020-06-17 09:04:33Z

05AB1E, 5 bytes

ŒIù.M

Try it online or verify the smaller test cases or verify the larger test case (which is a bit slow).

Explanation:

Œ # Push all substrings of the (implicit) input-string Iù # Only keep substrings of a length equal to the second input-integer .M # Only keep the most frequent item of the remaining substrings # (after which it is output implicitly as result)

Fatalize · Accepted Answer · 2020-02-18 11:04:44Z

Brachylog, 8 bytes

s₎ᶠọtᵒth

Try it online!

Explanation

 ᶠ Find all… s …substrings of <1st element of input>… ₎ …of length <2nd element of input> ọ Get the list of each substring with its number of occurrence tᵒ Order by the number of occurrence t Take the last one h Output the substring itself

Community · Accepted Answer · 2020-06-17 09:04:33Z

JavaScript (ES6), 85 83 80 bytes

Takes input as (string)(length).

s=>n=>[...s].map(o=m=(x,i)=>(o[k=s.substr(i,n)]=-~o[k])<m|!k[n-1]?0:m=o[O=k])&&O

Try it online!

Community · Accepted Answer · 2020-06-17 09:04:33Z

Java 10, 212 211 196 195 146 143 bytes

s->n->{String r="",t;for(int m=0,c,j,i=s.length();i-->n;)for(c=0,j=-1;(j=s.indexOf(t=s.substring(i-n,i),j+1))>=0;)if(++c>m){m=c;r=t;}return r;}

-15 bytes by outputting the substrings with their count, which is allowed.
-1 byte thanks to @ceilingcat.
-52 bytes thanks to @OlivierGrégoire.

Try it online.

Explanation:

s->n->{ // Method with String and Integer parameters and String return-type String r="", // Result-String, starting empty t; // Temp-String for(int m=0, // Largest count, starting at 0 c, // Temp count integer j, // Temp index integer i // Index-integer `i` =s.length();i-->n;) // Loop `i` in the range (input-length, n]: for(c=0, // Reset the temp-count to 0 j=-1; // And the temp-index integer to -1 (j=s.indexOf(t=s.substring(i-n,i) // Set String `t` to a substring of the input-String in the range [i-n,i) j+1))// Set the temp-index `j` to the index of String `t` in the input-String, // only looking at the trailing portion after index `j+1` >=0;) // And continue looping as long as long as that substring is found if(++c>m){// If the temp-count + 1 is larger than the current largest count: m=c; // Set this current largest count to this temp-count + 1 r=t;} // And the result-String to this temp-String return r;} // After the nested loops, return the result-String

146 bytes: S->n->{String R="",x;for(int M=0,I=0,m,i;I<S.length()-n;I++)for(m=0,i=-1;(i=S.indexOf(x=S.substring(I,I+n),i+1))>=0;)if(++m>M){M=m;R=x;}return R;} (no imports, and Java 8, not 10) — Olivier Grégoire
– Olivier Grégoire, Commented Feb 20, 2020 at 9:04
@OlivierGrégoire Thanks! That's a much better approach. I had the feeling it was possible without the Map and/or stream, but couldn't figure it out. — Kevin Cruijssen
– Kevin Cruijssen, Commented Feb 20, 2020 at 9:53
I also tried with .split("\\Q"+substring+"\\E"), without success because of the overlapping test case in Mississippi — Olivier Grégoire
– Olivier Grégoire, Commented Feb 20, 2020 at 9:55
143 bytes: s->n->{String r="",t;for(int m=0,c,j,i=s.length();i-->n;)for(c=0,j=-1;(j=s.indexOf(t=s.substring(i-n,i),j+1))>=0;)if(++c>m){m=c;r=t;}return r;} (just changing the iteration manner) — Olivier Grégoire
– Olivier Grégoire, Commented Feb 20, 2020 at 10:04

nwellnhof · Accepted Answer · 2020-02-20 19:24:26Z

5

Perl 6, 49 46 bytes

-3 bytes thanks to SirBogman

{$^n;bag($^a~~m:ov/.**{$n}/X~$).max(*{*}).key}

Try it online!

Explanation

{ } # Anonymous block $^n; # Declare argument $^a~~m / / # Regex match input string :ov # Also return overlapping matches .**{$n} # Match n-character strings X~$ # Stringify matches bag( ) # Convert to Bag (multiset) .max(*{*}) # Find Pair string=>count with max value .key # Get key (string)

edited Feb 20, 2020 at 19:24

answered Feb 18, 2020 at 12:28

nwellnhof

10.6k1 gold badge20 silver badges38 bronze badges

\$\begingroup\$ 46 bytes /.**{$n}/ works. Not sure why /.**$n/ doesn't. \$\endgroup\$

SirBogman
– SirBogman

2020-02-20 15:50:57 +00:00
Commented Feb 20, 2020 at 15:50
\$\begingroup\$ @SirBogman Cool, this sure would have helped me a few times. /$n/ is supposed to interpolate literal text (like \Q...\E in Perl 5), so I can see why .**$n doesn't work. \$\endgroup\$

nwellnhof
– nwellnhof

2020-02-20 19:25:29 +00:00
Commented Feb 20, 2020 at 19:25
\$\begingroup\$ I don't fully understand what .max(*{*}) is doing. It seems like both of those whatever parameters are Pairs? \$\endgroup\$

SirBogman
– SirBogman

2020-02-20 19:49:48 +00:00
Commented Feb 20, 2020 at 19:49
1

\$\begingroup\$ @SirBogman Only the first * is a whatever parameter. The second * is simply to get a full slice of the Pair, returning a list containing its only value. So *{*} is effectively the same as *.values. \$\endgroup\$

nwellnhof
– nwellnhof

2020-02-20 22:20:29 +00:00
Commented Feb 20, 2020 at 22:20
\$\begingroup\$ Cool. Those whatevers are very flexible. \$\endgroup\$

SirBogman
– SirBogman

2020-02-21 00:05:18 +00:00
Commented Feb 21, 2020 at 0:05

Add a comment |

Guillermo Phillips · Accepted Answer · 2020-02-18 11:42:27Z

4

PHP, 133 126 bytes

for(;$k<=strlen($i=$argv[1])-$j=$argv[2];)$s[]=substr($i,$k++,$j);$c=array_count_values($s);asort($c);echo array_key_last($c);

Try it online!

Annoyingly asort() ~~and end()~~ can't be chained into one expression.

-7 bytes thanks to @manatwork

edited Feb 18, 2020 at 11:42

answered Feb 18, 2020 at 11:23

Guillermo Phillips

8295 silver badges9 bronze badges

1

\$\begingroup\$ PHP 7.3 has array_key_last() to reduce it to 126 characters. \$\endgroup\$

manatwork
– manatwork

2020-02-18 11:32:00 +00:00
Commented Feb 18, 2020 at 11:32
\$\begingroup\$ I really should learn all my array functions. \$\endgroup\$

Guillermo Phillips
– Guillermo Phillips

2020-02-18 11:43:11 +00:00
Commented Feb 18, 2020 at 11:43

Add a comment |

Shieru Asakoto · Accepted Answer · 2020-02-18 12:55:21Z

3

JavaScript (Node.js), 92 bytes

s=>n=>s.map(F=t=>([F[w=(w+t).slice(-n)]=-~F[w],w]),w="").sort(([a],[b])=>a-b)[s.length-1][1]

Try it online!

Takes s as a list of characters in (s)(n) syntax.

edited Feb 18, 2020 at 12:55

answered Feb 18, 2020 at 12:48

Shieru Asakoto

6,47819 silver badges44 bronze badges

\$\begingroup\$ =>a-b)[s.length-1] to =>b-a)[0] to sort in reverse order and save some bytes. \$\endgroup\$

Kevin Cruijssen
– Kevin Cruijssen

2020-02-18 13:11:52 +00:00
Commented Feb 18, 2020 at 13:11
1

\$\begingroup\$ @KevinCruijssen That could fail if there are no repeating substrings. Try f([..."123456"])(2). \$\endgroup\$

Shieru Asakoto
– Shieru Asakoto

2020-02-18 13:36:38 +00:00
Commented Feb 18, 2020 at 13:36
\$\begingroup\$ Ah, you're indeed right. My bad. \$\endgroup\$

Kevin Cruijssen
– Kevin Cruijssen

2020-02-18 13:42:51 +00:00
Commented Feb 18, 2020 at 13:42

Add a comment |

Innat3 · Accepted Answer · 2020-02-20 14:12:12Z

C# (Visual C# Interactive Compiler), 97, 96, 92 bytes

s=>n=>s.Skip(n-1).Select((_,i)=>s.Substring(i,n)).GroupBy(x=>x).OrderBy(g=>g.Count()).Last()

Try it online!

Explanation:

s.Skip(n-1).Select((_,i)=>s.Substring(i,n)) //select every substring of size n .GroupBy(x=>x) //group by value .OrderBy(g=>g.Count()) //order by increasing recurrence .Last() //return the last one

-1 byte by using a currying lambda (s,n)=> to s=>n=>: Try it online. — Kevin Cruijssen
– Kevin Cruijssen, Commented Feb 20, 2020 at 10:08

Neil · Accepted Answer · 2020-02-18 12:18:06Z

Charcoal, 28 bytes

ＮθＦ⊕⁻Ｌηθ⊞υ✂ηι⁺θιＵＭυ⟦№υιι⟧⊟⌈υ

Try it online! Link is to verbose version of code. Outputs the lexicographically largest substring with the highest count. Explanation:

Ｎθ

Input the desired substring length.

Ｆ⊕⁻Ｌηθ⊞υ✂ηι⁺θι

Extract all substrings of that length and push them to a list.

ＵＭυ⟦№υιι⟧

Replace each substring with a tuple of its count and the substring..

⊟⌈υ

Print the substring with the highest count.

Community · Accepted Answer · 2020-06-17 09:04:33Z

2

Japt `-g`, 8 bytes

ãV ü ñÊÌ

Try it

edited Jun 17, 2020 at 9:04

CommunityBot

1

answered Feb 18, 2020 at 14:23

Shaggy

45k4 gold badges39 silver badges95 bronze badges

\$\begingroup\$ Is there any specific reason you use your own interpreter instead of the ethproductions one or the TIO one? \$\endgroup\$

S.S. Anne
– S.S. Anne

2020-02-18 23:23:01 +00:00
Commented Feb 18, 2020 at 23:23
1

\$\begingroup\$ @S.S.Anne I don't know about Shaggy, but I use it because it has a lot of features that make golfing in Japt a lot easier, like the autogolf button \$\endgroup\$

Gymhgy
– Gymhgy

2020-02-19 01:04:30 +00:00
Commented Feb 19, 2020 at 1:04

Add a comment |

JungHwan Min · Accepted Answer · 2020-02-18 17:38:22Z

Wolfram Language (Mathematica), 30 bytes

#&@@Commonest[##~Partition~1]&

Takes an array of characters, returns an array of characters.

Try it online!

Wolfram Language (Mathematica), 36 bytes

#&@@Commonest[##~StringPartition~1]&

Takes a string, returns a string.

Try it online!

GammaFunction · Accepted Answer · 2020-02-18 21:12:48Z

Zsh, 89 bytes

local -A m repeat $#1 ((++m[${1:$[i++]:$2}])) for k v (${(kv)m})((v>M))&&r=$k&&M=$v <<<$r

Try it online!

local -A m # associative array, "local" is shorter than "typeset" repeat $#1 { # looping for the length of the string will cause the # program to use shorter, invalid strings as # the bash-style slice gets cut short ((++m[${1:$[i++]:$2}])) # increment the map at the key with substring } # starting at $[i++] with length $2 for k v (${(kv)m}) { # loop over the associative array's keys/values ((v>M)) && r=$k && M=$v # find the max value M, save the key as r. } # looping forward ensures we avoid our invalid subtrings <<<$r # print $r

Noodle9 · Accepted Answer · 2020-02-19 16:23:48Z

C++ (clang), ^{201 \$\cdots\$ 151} 150 bytes

#import<map> using S=std::string;S f(S s,int n){std::map<S,int>m;S t,r;for(int i=0,x=0;i<=s.size()-n;r=x<++m[t=s.substr(i++,n)]?x=m[t],t:r);return r;}

Try it online!

Ungolfed

#include<map> #include<string> std::string f(const std::string& s,int n) { std::map<std::string,int> m; for(int i=0;i<=s.size()-n;++i) { m[s.substr(i,n)]++; } int x=0; std::string r; for(auto a:m) { if(x<a.second) { x=a.second; r=a.first; } } return r; }

emegolf123 · Accepted Answer · 2020-02-19 21:05:47Z

Excel (Ver. 1911), 60 Bytes

B2 'Input: String B3 'Input: Sub-string Length C2 =-COUNTIF(D2#,D2#) D2 =MID(B2,SEQUENCE(LEN(B2)),B3) E2 =SORT(C2#:D2) F2 'Output

Test Sample

Note: Newlines do exist in cells, but default formatting doesn't show them. Also only 10 of the 105 rows are shown to keep the image a decent size.

Jonathan Allan · Accepted Answer · 2020-02-18 13:29:34Z

1

Jelly, 6 bytes

I have a feeling there may be an elusive 5-byter out there.

ṡċ@Þ`Ṫ

Try it online!

answered Feb 18, 2020 at 13:29

Jonathan Allan

115k8 gold badges68 silver badges293 bronze badges

Add a comment |

mazzy · Accepted Answer · 2020-02-18 17:03:07Z

PowerShell, 72 bytes

param($s,$n)(0..($s.Length-$n)|%{$s|% s*g $_ $n}|group|sort C*|% N*)[-1]

Try it online!

Unrolled:

param($s,$n) ( 0..($s.Length-$n)|%{ $s|% substring $_ $n }|group|sort Count|% Name )[-1]

Value Ink · Accepted Answer · 2020-02-18 21:34:25Z

1

Ruby, 58 bytes

->s,n{a=(0..s.size).map{|i|s[i,n]};a.max_by{|e|a.count e}}

Try it online!

answered Feb 18, 2020 at 21:34

Value Ink

13.5k1 gold badge18 silver badges46 bronze badges

Add a comment |

Xcali · Accepted Answer · 2020-02-18 23:00:40Z

1

Perl 5 `-lp`, 47 bytes

++$k{$_}>$k{$\}&&($\=$_)for<>=~/(?=(.{$_}))/g}{

Try it online!

answered Feb 18, 2020 at 23:00

Xcali

17k2 gold badges17 silver badges42 bronze badges

Add a comment |

Community · Accepted Answer · 2020-06-17 09:04:33Z

1

Japt `-h`, 8 bytes

ãV ñ@è¶X

Try it

edited Jun 17, 2020 at 9:04

CommunityBot

1

answered Feb 19, 2020 at 1:15

Gymhgy

8,03813 silver badges35 bronze badges

Add a comment |

S.S. Anne · Accepted Answer · 2020-02-19 21:43:47Z

Go, 131 bytes

func f(s string,l int)(p string){x,i:=0,0 m:=map[string]int{} for;i<=len(s)-l;i++{t:=s[i:i+l] m[t]++ if m[t]>x{x=m[t] p=t}} return}

Takes as input a string and an int representing the string and the length of the substring respectively.

Some newlines can be replaced by semicolons if people prefer one-liners.

Try it online!

Stack Exchange Network

The most common substring

Inputs

Output

Win Criteria

Examples

Some more examples, all using the same input string^*

21 Answers 21

Python 3.8, 75 68 97 63 bytes

How:

05AB1E, 5 bytes

Brachylog, 8 bytes

Explanation

JavaScript (ES6), 85 83 80 bytes

Java 10, 212 211 196 195 146 143 bytes

Perl 6, 49 46 bytes

Explanation

PHP, 133 126 bytes

JavaScript (Node.js), 92 bytes

C# (Visual C# Interactive Compiler), 97, 96, 92 bytes

Charcoal, 28 bytes

Japt `-g`, 8 bytes

Wolfram Language (Mathematica), 30 bytes

Wolfram Language (Mathematica), 36 bytes

Zsh, 89 bytes

C++ (clang), ^{201 \$\cdots\$ 151} 150 bytes

Ungolfed

Excel (Ver. 1911), 60 Bytes

Test Sample

Jelly, 6 bytes

PowerShell, 72 bytes

Ruby, 58 bytes

Perl 5 `-lp`, 47 bytes

Japt `-h`, 8 bytes

Go, 131 bytes

Linked

Hot Network Questions

The most common substring

Inputs

Output

Win Criteria

Examples

Some more examples, all using the same input string*

21 Answers 21

Python 3.8, 75 68 97 63 bytes

How:

05AB1E, 5 bytes

Brachylog, 8 bytes

Explanation

JavaScript (ES6), 85 83 80 bytes

Java 10, 212 211 196 195 146 143 bytes

Perl 6, 49 46 bytes

Explanation

PHP, 133 126 bytes

JavaScript (Node.js), 92 bytes

C# (Visual C# Interactive Compiler), 97, 96, 92 bytes

Charcoal, 28 bytes

Japt -g, 8 bytes

Wolfram Language (Mathematica), 30 bytes

Wolfram Language (Mathematica), 36 bytes

Zsh, 89 bytes

C++ (clang), 201 \$\cdots\$ 151 150 bytes

Ungolfed

Excel (Ver. 1911), 60 Bytes

Test Sample

Jelly, 6 bytes

PowerShell, 72 bytes

Ruby, 58 bytes

Perl 5 -lp, 47 bytes

Japt -h, 8 bytes

Go, 131 bytes

Linked

Related

Hot Network Questions

Some more examples, all using the same input string^*

Japt `-g`, 8 bytes

C++ (clang), ^{201 \$\cdots\$ 151} 150 bytes

Perl 5 `-lp`, 47 bytes

Japt `-h`, 8 bytes