Distributed file processing in Hadoop?

Question

I have a large number of compressed tar files, where each tar itself contains several files. I want to extract these files and I want to use hadoop or a similar technique to speedup the the processing. Are there any tools for this kind of problem? As far as I know hadoop and similar frameworks like spark or flink do not use files directly and don't give you access to the filesystem directly. I also want to do some basic renaming of the extracted files and move them into appropriate directories.

I can image a solution where one creates a list of all tar files. This list is then passed to the mappers and a single mapper extracts one file from the list. Is this a reasonable approach?

You probably have to write a custom input format for Hadoop, Flink or Spark to implement this. In the InputFormat code, you can treat the files in every way you want. — Robert Metzger
– Robert Metzger, Commented Aug 5, 2015 at 8:48

Matthias J. Sax · Accepted Answer · 2015-11-24 14:07:21Z

It is possible to instruct MapReduce to use an input format where the input to each Mapper is a single file. (from https://code.google.com/p/hadoop-course/source/browse/HadoopSamples/src/main/java/mr/wholeFile/WholeFileInputFormat.java?r=3)

public class WholeFileInputFormat extends FileInputFormat<NullWritable, BytesWritable> { @Override protected boolean isSplitable(JobContext context, Path filename) { return false; } @Override public RecordReader<NullWritable, BytesWritable> createRecordReader( InputSplit inputSplit, TaskAttemptContext context) throws IOException, InterruptedException { WholeFileRecordReader reader = new WholeFileRecordReader(); reader.initialize(inputSplit, context); return reader; } }

Then, in your mapper, you can use the Apache commons compress library to unpack the tar file https://commons.apache.org/proper/commons-compress/examples.html

you don't need to pass a list of files to Hadoop, just put all the files in a single HDFS directory, and use that directory as your input path.

This sounds like a good solution. I will try to implement it.
if tars are few and very large, the small number of mappers will be a performance bottleneck
The question states "I have a large number of compressed tar files"
that's what i wonder about, what is "large" in OPs understanding
probably the better idea would be using one job to decompress jars and second job to process unpacked files

mattinbits · Accepted Answer · 2015-08-10 06:50:56Z

0

Distcp moves files from one place to another, you can take a look at its docs but I don't think it offers any decompress or unpack capability? If a file is bigger than main memory, you probably will get some out of memory errors. 8gb is not very big for a Hadoop cluster, how many machines do you have?

answered Aug 10, 2015 at 6:50

mattinbits

10.4k1 gold badge28 silver badges35 bronze badges

2 Comments

headmyshoulder Over a year ago

I don't think that I can use distcp, but its functioning should be similar to my needs. I am studying its source code (I don't fully understand it yet) and it seems that the mapper receives the input file name and the status of the input files as key/values and performs the copying. But maybe you are right and I should just process the whole open file.

headmyshoulder Over a year ago

Ok, I think distcp creates a list of files and directories to be copied.

Collectives™ on Stack Overflow

Distributed file processing in Hadoop?

2 Answers 2

9 Comments

2 Comments

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

9 Comments

2 Comments

Related