1

Given a set of arbitrary files, what's the best way to identify the text strings shared between them (either in all files or a subset of them) from the Linux command line?

This would be useful for quickly identifying ways to write Yara rules for clusters of similar malicious files (for instance, malicious executables).

1 Answer 1

4

Here's one approach, for malicious files in a directory named malware:

find malware/ -type f | xargs -n1 -P1 -I{} sh -c 'strings {} | sort | uniq' | sort | uniq -c | sort -n 

The output will look something like the following, where the first number on each line is the number of files containing the string:

 ... 1 Sleep ... 2 JFIF 2 SetBkColor ... 5 !This program cannot be run in DOS mode. 5 t@PW 5 @tVH ... 

One useful variation of this when the input files are Windows executables is using strings -el instead of strings, which will cause UTF-16 little-endian strings (also known as wide character strings) to be shown.

To tie string sequences back to the corresponding files use strings -f malware/* | grep <string>.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.