1

I've got an xml-file containing the directory structure for files I want to put into a tar.gz file (flattened).

How should I parse the xml to extract the path for each file?

Right now I'm using lxml and finding the paths like this:

paths = [] for case in root.iter('case'): for language in case.iter('language'): for result in language.iter('result'): for file in result.iter('file'): paths.append('/'.join([node.get('id') for node in [case, language, result, file]])) 

But this feels a bit too hardcoded and it does not work well if the structure change.

I can find each file-node with root.iter('file'), but how can I get all parents/directories for each node/file? Or should I do this a (completely?) different way?

The xml looks like this:

<?xml version="1.0" encoding="UTF-8"?> <files batch="regular"> <case id="case_10_some_description"> <language id="english"> <result id="images"> <file id="screenshot_1.png"/> <file id="screenshot_2.png"/> <file id="screenshot_3.png"/> <file id="screenshot_4.png"/> <file id="screenshot_5.png"/> <file id="screenshot_6.png"/> </result> </language> </case> <case id="case_12_some_description"> <language id="english"> <result id="images"> <file id="screenshot_1.png"/> <file id="screenshot_2.png"/> <file id="screenshot_3.png"/> </result> </language> </case> </files> 

And this is the files:

regular/case_10_some_description/english/images/screenshot_1.png regular/case_10_some_description/english/images/screenshot_2.png regular/case_10_some_description/english/images/screenshot_3.png regular/case_10_some_description/english/images/screenshot_4.png regular/case_10_some_description/english/images/screenshot_5.png regular/case_10_some_description/english/images/screenshot_6.png regular/case_12_some_description/english/images/screenshot_1.png regular/case_12_some_description/english/images/screenshot_2.png regular/case_12_some_description/english/images/screenshot_3.png 
1

2 Answers 2

1

Do you create this file-schema on your own? If you can change it, i would definitly. Try to make something like this:

<?xml version="1.0" encoding="UTF-8"?> <Directory id="regular"> <Directory id="case_10_some_description"> <Directory id="english"> <Directory id="images"> <file id="screenshot_1.png"/> <file id="screenshot_2.png"/> <file id="screenshot_3.png"/> <file id="screenshot_4.png"/> <file id="screenshot_5.png"/> <file id="screenshot_6.png"/> </Directory> </Directory> </Directory> <Directory id="case_12_some_description"> <Directory id="english"> <Directory id="images"> <file id="screenshot_1.png"/> <file id="screenshot_2.png"/> <file id="screenshot_3.png"/> </Directory> </Directory> </Directory> </Directory> 

Always give tag the same name if they have the same meaning. Maybe use more different attributes than tag, is would make your parsing easier

Sign up to request clarification or add additional context in comments.

Comments

0
import xml.etree.ElementTree as ET tree = ET.parse('sample.xml') root = tree.getroot() for file in root.iter('file'): print 'regular/case_10_some_description/english/images/'+file.attrib['id'] 

1 Comment

Thanks for the answer, but this is more hardcoded than the solution I want to get rid of. This only works for the first case also.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.