0

I'm working on a directory with 27 folders and various XML files inside each folder.

was able to:

  • parse one XML file and write to a CSV file
  • traverse one folder and read and parse all XML files in it

challenges:

  • having a problem trying to go through and parse all the XML files from the root folder to its subfolders

Send help and thank you. Code snippets below

# working in one folder only import csv import xml.etree.ElementTree as ET import os ## directory path = "/Users.../y" filenames = [] ## Count the number of xml files of each folder files = os.listdir(path) print("\n") xml_data_to_csv = open('/Users.../xml_extract.csv', 'w') list_head = [] csvwriter = csv.writer(xml_data_to_csv) # Read XML files in a folder for filename in os.listdir(path): if not filename.endswith('.xml'): continue fullname = os.path.join(path,filename) print("\n", fullname) filenames.append(fullname) # parse elements in each XML file for filename in filenames: tree = ET.parse(filename) root = tree.getroot() extract_xml=[] ## extract child elements per xml file print("\n") for x in root.iter('Info'): for element in x: print(element.tag,element.text) extract_xml.append(element.text) ## Write list nodes to csv csvwriter.writerow(extract_xml) ## Close CSV file xml_data_to_csv.close() 

enter image description here

3 Answers 3

1

You can get the list of all the XML files in a given path with

import os path = "main/root" filelist = [] for root, dirs, files in os.walk(path): for file in files: if not file.endswith('.xml'): continue filelist.append(os.path.join(root, file)) for file in filelist: print(file) # or in your case parse the XML 'file' 

If for instance:

$ tree /main/root /main/root ├── a │   ├── a.xml │   ├── b.xml │   └── c.xml ├── b │   ├── d.xml │   ├── e.xml │   └── x.txt └── c   ├── f.xml    └── g.xml 

we get:

/main/root/c/g.xml /main/root/c/f.xml /main/root/b/e.xml /main/root/b/d.xml /main/root/a/c.xml /main/root/a/b.xml /main/root/a/a.xml 

If you want to sort directories and files:

for root, dirs, files in os.walk(path): dirs.sort() for file in sorted(files): if not file.endswith('.xml'): continue filelist.append(os.path.join(root, file)) 
Sign up to request clarification or add additional context in comments.

3 Comments

thanks for clearing this out. thank you all so much. in love with you guys already.
Also, is there a workflow or method to include a header or column name during parsing when saving it to csv?
You should create a new post for this.
1

You can use os.walk:

import os for dir_name, dirs, files in os.walk('<root_dir>'): # parse files 

1 Comment

thank you all so much. in love with you guys already
1

You can use the pathlib module to "glob" the XML files. It will search all subdirectories for the pattern you supply and return Path objects that already include the path to the file. Cleaning up your script a bit, you would have

import csv import xml.etree.ElementTree as ET from pathlib import Path ## directory path = Path("/Users.../y") with open('/Users.../xml_extract.csv', 'w') as xml_data_to_csv: csvwriter = csv.writer(xml_data_to_csv) # Read XML files in a folder for filepath in path.glob("**/*.xml"): tree = ET.parse(filename) root = tree.getroot() extract_xml=[] ## extract child elements per xml file print("\n") for x in root.iter('Info'): for element in x: print(element.tag,element.text) extract_xml.append(element.text) ## Write list nodes to csv csvwriter.writerow(extract_xml) 

6 Comments

thank you all so much. in love with you guys already
suggestions or workaround as to how we can include header or column name to csv?
@Panda, right after creating csvwriter, you can write the header with csvwriter.writerow(["column 1", "column 2", "column 3"]) ... replacing with your names of course.
tried this but it's giving me the column names on each row (like a loop)
@Panda What column names do you want? I can add it to the example.
|

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.