1

Under the folder

/grid/sdh/hadoop/yarn/local/usercache/hdfs/appcache 

we have more than 100 recursive folders.

One of the folders contain thousands of files. Is it possible to identify this folder?

I am asking because this one folder that contains the thousands of files and we may have a problem that we can't remove the files there because of the thousands of files.

5
  • could be thousand or more , in this folder for example if you type ls , then it not return the output because the huge files Commented Mar 26, 2019 at 6:27
  • Have you run fsck just to make sure that your filesystem is not corrupted? Commented Mar 26, 2019 at 6:44
  • no we not run , but why you think about this direction ? Commented Mar 26, 2019 at 6:45
  • Because if ls fails in the directory, one possible cause is a corrupted filesystem. fsck checks for filesystem corruption. Commented Mar 26, 2019 at 6:50
  • With some file systems, the size of the directory files can give an indication of how many entries they have without having to list their content. Commented Mar 26, 2019 at 7:29

2 Answers 2

2

The number of items in a directory may be counted using

set -- * 

This sets the positional parameters ($1, $2, etc.) to the names in the current directory. The number of names that * expands to is found in $#. If you use the bash shell and set the dotglob shell option, this will additionally count hidden names.

Using this to find directories under /grid/sdh/hadoop/yarn/local/usercache/hdfs/appcache that contain more than 1000 names:

find /grid/sdh/hadoop/yarn/local/usercache/hdfs/appcache \ -type d -exec bash -O dotglob -c ' for pathname do set -- "$pathname"/* if [ "$#" -gt 1000 ]; then printf "%d\t%s\n" "$#" "$pathname" fi done' bash {} + 

This expands the * shell glob in each found directory and outputs the pathname of the directory if there are more than 1000 names in it, along with the number of names. It does this by executing a short bash script for batches of directories. The script will loop over each batch of directories and for each, it will expand the * glob inside it to count the number of entries. An if statement then triggers printf if appropriate.

Note that if a directory contains millions of names, then it may take a bit of time to actually expand the * glob in that directory.

4
  • do we need to run "set -- *" before we run your find code? Commented Mar 26, 2019 at 7:06
  • @yael No, that is not necessary. That was just a bit of explaining the way that the embedded bash script works that find calls. Commented Mar 26, 2019 at 7:12
  • so when you say "names" it include folders and files - am I correct ? ( I mean also if we have more the 1000 folders under folder , it will print it Commented Mar 26, 2019 at 7:17
  • @yael Yes, it would include the names of any entry in the directory. Commented Mar 26, 2019 at 7:41
1

On a GNU system

(export LC_ALL=C find /grid/sdh/hadoop/yarn/local/usercache/hdfs/appcache -print0 | tr '\n\0' '\0\n' | sed 's|/[^/]*$||' | sort | uniq -c | sort -rn | head | tr '\z' '\n') 

Would list the 10 directories with the most entries.

If the directories have so many files than even listing them would be too expensive, you can try and guess which they are without entering them by looking at their size.

 find /grid/sdh/hadoop/yarn/local/usercache/hdfs/appcache -type d \ -size +10000000c -print -prune 

Would list the directories that are more than 10MB large and not enter them.

You must log in to answer this question.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.