3

I have a large number of compressed tar files, where each tar itself contains several files. I want to extract these files and I want to use hadoop or a similar technique to speedup the the processing. Are there any tools for this kind of problem? As far as I know hadoop and similar frameworks like spark or flink do not use files directly and don't give you access to the filesystem directly. I also want to do some basic renaming of the extracted files and move them into appropriate directories.

I can image a solution where one creates a list of all tar files. This list is then passed to the mappers and a single mapper extracts one file from the list. Is this a reasonable approach?

1
  • 1
    You probably have to write a custom input format for Hadoop, Flink or Spark to implement this. In the InputFormat code, you can treat the files in every way you want. Commented Aug 5, 2015 at 8:48

2 Answers 2

5

It is possible to instruct MapReduce to use an input format where the input to each Mapper is a single file. (from https://code.google.com/p/hadoop-course/source/browse/HadoopSamples/src/main/java/mr/wholeFile/WholeFileInputFormat.java?r=3)

public class WholeFileInputFormat extends FileInputFormat<NullWritable, BytesWritable> { @Override protected boolean isSplitable(JobContext context, Path filename) { return false; } @Override public RecordReader<NullWritable, BytesWritable> createRecordReader( InputSplit inputSplit, TaskAttemptContext context) throws IOException, InterruptedException { WholeFileRecordReader reader = new WholeFileRecordReader(); reader.initialize(inputSplit, context); return reader; } } 

Then, in your mapper, you can use the Apache commons compress library to unpack the tar file https://commons.apache.org/proper/commons-compress/examples.html

you don't need to pass a list of files to Hadoop, just put all the files in a single HDFS directory, and use that directory as your input path.

Sign up to request clarification or add additional context in comments.

9 Comments

This sounds like a good solution. I will try to implement it.
if tars are few and very large, the small number of mappers will be a performance bottleneck
The question states "I have a large number of compressed tar files"
that's what i wonder about, what is "large" in OPs understanding
probably the better idea would be using one job to decompress jars and second job to process unpacked files
|
0

Distcp moves files from one place to another, you can take a look at its docs but I don't think it offers any decompress or unpack capability? If a file is bigger than main memory, you probably will get some out of memory errors. 8gb is not very big for a Hadoop cluster, how many machines do you have?

2 Comments

I don't think that I can use distcp, but its functioning should be similar to my needs. I am studying its source code (I don't fully understand it yet) and it seems that the mapper receives the input file name and the status of the input files as key/values and performs the copying. But maybe you are right and I should just process the whole open file.
Ok, I think distcp creates a list of files and directories to be copied.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.