I have a large number of compressed tar files, where each tar itself contains several files. I want to extract these files and I want to use hadoop or a similar technique to speedup the the processing. Are there any tools for this kind of problem? As far as I know hadoop and similar frameworks like spark or flink do not use files directly and don't give you access to the filesystem directly. I also want to do some basic renaming of the extracted files and move them into appropriate directories.
I can image a solution where one creates a list of all tar files. This list is then passed to the mappers and a single mapper extracts one file from the list. Is this a reasonable approach?