Read file till special char, copy that section into another file, and continue till eof

Question

I am trying to read a file in Linux and as soon as a "&" character is encountered, I am writing the output to another file, sending that file to another folder and then continuing to read the original file till next "&" and so on

Input xml file-

<Document> <tag1> <tag2> </Document> & <Document> <tag3> <tag4> </Document> & <Document> <tag5> <tag6> </Document>

My code snippet -

while IFS= read -r line;do if [["$line" =="$delimeter"]];then echo "$line" | sed "s/delimeter.*//">> "$output_file" cp "$output_file" "$TARGET_FOLDER" break else echo "$line" >> "$output_file" fi done < "$input_file"

However, the code is producing the entire file as the output instead of splitting by occurrence of delimeter, can I please be directed towards where I'm going wrong?

Expected Output - The first <Document> to </Document> (till &) section is put in output file, which is copied to TARGET_FOLDER. Then the next <Document> to </Document> section is copied and so on.

Thankyou for your help!

Can you share a valid XML sample with closing nodes? I will share how to parse XML properly. And add your expected output. Please, edit your post accordingly. Tools to parse XML in a shell: xidel, xmlstarlet, xmllint — Gilles Quénot
– Gilles Quénot, Commented Oct 11, 2023 at 5:04
Hi Gilles, correct me if I'm wrong, but do we need xmlparser here? Is there no way to just read till the character "&", and copy that text to another file? — python6
– python6, Commented Oct 11, 2023 at 5:11
We don't have sufficient information, I made recommendations — Gilles Quénot
– Gilles Quénot, Commented Oct 11, 2023 at 5:13

Stéphane Chazelas · Accepted Answer · 2023-10-11 05:30:21Z

Sounds like a job for csplit:

mkdir -p target && csplit -f target/output. your-file '/^&$/' '{*}'

Would create target/output.00, target/output.01... files, splitting on lines that contain &.

If you just want one target/output file with the & lines removed, then that's just:

grep -vx '&' < your-file > target/output

Or if it's to send to an output file in target.xx directories:

csplit -f '' -b target.%02d/output your-file '/^&$/' '{*}'

Though note that the target.00..target.n directories must exist beforehand.

In any case, you don't want to use a shell loop to process text.

Hi, thanks for your help, the first suggestion from you is only generating one output file - output.00 — python6
– python6, Commented Oct 11, 2023 at 5:51
@python6 then the delimiter line likely contains not only &. Maybe it has whitespace or invisible characters around it such as a CR character if it comes from the Microsoft world, and you'd need to adapt the regexp (/^&$/) accordingly (like /^[[:space:]]*&[[:space:]]*$/ to allow any amount of whitespace including CR characters on either side of the &). — Stéphane Chazelas
– Stéphane Chazelas, Commented Oct 11, 2023 at 5:54

Gilles Quénot · Accepted Answer · 2023-10-11 05:26:34Z

0

With awk:

awk 'BEGIN{RS="&"}{print $0 > ++c".xml"}' file.xml ls -ltr

answered Oct 11, 2023 at 5:26

Gilles Quénot

36.8k7 gold badges76 silver badges97 bronze badges

This may break on things like & occuring elsewhere in the input.

Kusalananda
– Kusalananda ♦

2023-10-11 05:39:51 +00:00
Commented Oct 11, 2023 at 5:39
Regarding print $0 > ++c".xml" - 1) you don't need the $0 as that's what awk prints by default, 2) an unparenthesized expression on the right side of input or output redirection is undefined behavior and will give you a syntax error in some awks, 3) not closing the output files as you go will lead to a "too many open files" error from most awks when you pass the system limit. It should be print > (++c".xml"); close(c".xml") or better out=++c".xml"; print > out; close(out) to be portable and robust.

Ed Morton
– Ed Morton

2024-05-20 10:50:29 +00:00
Commented May 20, 2024 at 10:50
The output files will all start with a blank line, though, (the one after the & in the input) so you should probably add a sub(/^\n/,"") before the print, and they will end with a blank line (there's one before the & and at the end of the file in the input and ORS will add one by default) so you should set ORS to null and just use the one from the input as the terminating newline for each file, i.e. awk 'BEGIN{RS="&"; ORS=""}{out=++c".xml"; sub(/^\n/,""); print > out; close(out)}' file.xml. Obviously in GNU awk you could manipulate RS and RT to handle that instead.

Ed Morton
– Ed Morton

2024-05-20 10:55:16 +00:00
Commented May 20, 2024 at 10:55

Add a comment |

Ed Morton · Accepted Answer · 2024-05-20 11:16:16Z

Using any POSIX awk and accounting for the possibility of CRs or other white space around the &s and assuming you really only want to split the file where there are &s on their own lines, not if/when they appear mid-line as part of some other string:

mkdir -p "$TARGET_FOLDER" && awk ' /^[[:space:]]*&[[:space:]]*$/ { close(out); out=""; next } !out { out=ENVIRON["TARGET_FOLDER"] "/" FILENAME "_out" (++cnt) } { print > out } ' file

$ cd "$TARGET_FOLDER" $ head file_out* ==> file_out1 <== <Document> <tag1> <tag2> </Document> ==> file_out2 <== <Document> <tag3> <tag4> </Document> ==> file_out3 <== <Document> <tag5> <tag6> </Document>

The above assumes your TARGET_FOLDER variable is exported since it's all-upper-case (if it's not exported, don't make it all upper case) If that's not true then just change awk to TARGET_FOLDER="$TARGET_FOLDER" awk, all on 1 line, to set it in awks environment anyway. By the way they're called directories in Unix, not folders.

Stack Exchange Network

Read file till special char, copy that section into another file, and continue till eof

3 Answers 3

You must log in to answer this question.

Linked

Hot Network Questions

Read file till special char, copy that section into another file, and continue till eof

3 Answers 3

You must log in to answer this question.

Linked

Related

Hot Network Questions