0

I am trying to read a file in Linux and as soon as a "&" character is encountered, I am writing the output to another file, sending that file to another folder and then continuing to read the original file till next "&" and so on

Input xml file-

<Document> <tag1> <tag2> </Document> & <Document> <tag3> <tag4> </Document> & <Document> <tag5> <tag6> </Document> 

My code snippet -

while IFS= read -r line;do if [["$line" =="$delimeter"]];then echo "$line" | sed "s/delimeter.*//">> "$output_file" cp "$output_file" "$TARGET_FOLDER" break else echo "$line" >> "$output_file" fi done < "$input_file" 

However, the code is producing the entire file as the output instead of splitting by occurrence of delimeter, can I please be directed towards where I'm going wrong?

Expected Output - The first <Document> to </Document> (till &) section is put in output file, which is copied to TARGET_FOLDER. Then the next <Document> to </Document> section is copied and so on.

Thankyou for your help!

7
  • Is it XML or HTML? Commented Oct 11, 2023 at 4:36
  • Hi, it is XML.. Commented Oct 11, 2023 at 4:37
  • 1
    Can you share a valid XML sample with closing nodes? I will share how to parse XML properly. And add your expected output. Please, edit your post accordingly. Tools to parse XML in a shell: xidel, xmlstarlet, xmllint Commented Oct 11, 2023 at 5:04
  • 1
    Hi Gilles, correct me if I'm wrong, but do we need xmlparser here? Is there no way to just read till the character "&", and copy that text to another file? Commented Oct 11, 2023 at 5:11
  • We don't have sufficient information, I made recommendations Commented Oct 11, 2023 at 5:13

3 Answers 3

1

Sounds like a job for csplit:

mkdir -p target && csplit -f target/output. your-file '/^&$/' '{*}' 

Would create target/output.00, target/output.01... files, splitting on lines that contain &.

If you just want one target/output file with the & lines removed, then that's just:

grep -vx '&' < your-file > target/output 

Or if it's to send to an output file in target.xx directories:

csplit -f '' -b target.%02d/output your-file '/^&$/' '{*}' 

Though note that the target.00..target.n directories must exist beforehand.

In any case, you don't want to use a shell loop to process text.

3
  • Hi, thanks for your help, the first suggestion from you is only generating one output file - output.00 Commented Oct 11, 2023 at 5:51
  • 1
    @python6 then the delimiter line likely contains not only &. Maybe it has whitespace or invisible characters around it such as a CR character if it comes from the Microsoft world, and you'd need to adapt the regexp (/^&$/) accordingly (like /^[[:space:]]*&[[:space:]]*$/ to allow any amount of whitespace including CR characters on either side of the &). Commented Oct 11, 2023 at 5:54
  • great, works now Commented Oct 11, 2023 at 6:04
0

With awk:

awk 'BEGIN{RS="&"}{print $0 > ++c".xml"}' file.xml ls -ltr 
3
  • This may break on things like &amp; occuring elsewhere in the input. Commented Oct 11, 2023 at 5:39
  • Regarding print $0 > ++c".xml" - 1) you don't need the $0 as that's what awk prints by default, 2) an unparenthesized expression on the right side of input or output redirection is undefined behavior and will give you a syntax error in some awks, 3) not closing the output files as you go will lead to a "too many open files" error from most awks when you pass the system limit. It should be print > (++c".xml"); close(c".xml") or better out=++c".xml"; print > out; close(out) to be portable and robust. Commented May 20, 2024 at 10:50
  • The output files will all start with a blank line, though, (the one after the & in the input) so you should probably add a sub(/^\n/,"") before the print, and they will end with a blank line (there's one before the & and at the end of the file in the input and ORS will add one by default) so you should set ORS to null and just use the one from the input as the terminating newline for each file, i.e. awk 'BEGIN{RS="&"; ORS=""}{out=++c".xml"; sub(/^\n/,""); print > out; close(out)}' file.xml. Obviously in GNU awk you could manipulate RS and RT to handle that instead. Commented May 20, 2024 at 10:55
0

Using any POSIX awk and accounting for the possibility of CRs or other white space around the &s and assuming you really only want to split the file where there are &s on their own lines, not if/when they appear mid-line as part of some other string:

mkdir -p "$TARGET_FOLDER" && awk ' /^[[:space:]]*&[[:space:]]*$/ { close(out); out=""; next } !out { out=ENVIRON["TARGET_FOLDER"] "/" FILENAME "_out" (++cnt) } { print > out } ' file 

$ cd "$TARGET_FOLDER" $ head file_out* ==> file_out1 <== <Document> <tag1> <tag2> </Document> ==> file_out2 <== <Document> <tag3> <tag4> </Document> ==> file_out3 <== <Document> <tag5> <tag6> </Document> 

The above assumes your TARGET_FOLDER variable is exported since it's all-upper-case (if it's not exported, don't make it all upper case) If that's not true then just change awk to TARGET_FOLDER="$TARGET_FOLDER" awk, all on 1 line, to set it in awks environment anyway. By the way they're called directories in Unix, not folders.

You must log in to answer this question.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.