How can i delete a line if it is longer than e.g.: 2048 chars?
- Do you insist on using sed? This is easy, for example in python. And no doubt even easier in perl. Though the question is not terribly well defined. Copy a file, removing all lines longer than 2048, or something else?Faheem Mitha– Faheem Mitha2011-03-23 18:21:50 +00:00Commented Mar 23, 2011 at 18:21
9 Answers
sed '/^.\{2048\}./d' input.txt > output.txt - 5I get the error message
sed: 1: "/^.\{2048\}..*/d": RE error: invalid repetition count(s)(Mac OS X)wedi– wedi2014-10-13 15:47:21 +00:00Commented Oct 13, 2014 at 15:47 - 1@wedi you probably want to install the GNU version instead of the BSD version that ships with Mac. This is easy with brewFreedom_Ben– Freedom_Ben2016-07-06 00:00:52 +00:00Commented Jul 6, 2016 at 0:00
- The question says "if longer than XY (e.g., 2048 chars)". Then it must be > 2048 and not => 2048acgbox– acgbox2019-08-28 13:53:04 +00:00Commented Aug 28, 2019 at 13:53
- 1@ajcg, It is > 2048. Notice that there's an extra period in the end of the regex to match the 2049th character.forcefsck– forcefsck2019-08-30 14:02:57 +00:00Commented Aug 30, 2019 at 14:02
- @forcefsck and it wouldn't be better if you take it away "^" ? (with your command you are only removing lines that "start with XYZ", but if XYZ is in another part of the line then it does not delete it)acgbox– acgbox2019-08-30 16:13:31 +00:00Commented Aug 30, 2019 at 16:13
Here's a solution which deletes lines that has 2049 or more characters:
sed '/.\{2049\}/d' <file.in >file.out The regular expression .\{2049\} would match any line that contains a substring of 2049 characters (another way of saying "at least 2049 characters"). The d command deletes them from the input, producing only shorter line on the output.
BSD sed (on e.g. macOS) can only handle repetition counts of up to 256 in the \{...\} operator (the value of RE_DUP_MAX; see getconf RE_DUP_MAX in the shell). On these systems, you may instead use awk:
awk 'length <= 2048' <file.in >file.out Mimicking the sed solution literally with awk:
awk 'length >= 2049 { next } { print }' <file.in >file.out Note that any awk implementation is only guaranteed to be able to handle records of lengths up to LINE_MAX bytes (see getconf LINE_MAX in the shell), but may support longer ones. On macOS, LINE_MAX is 2048.
perl -lne "length < 2048 && print" infile > outfile - 1Does not work for me. Perl v5.16.2.
Warning: Use of "length" without parentheses is ambiguous at -e line 1. Unterminated <> operator at -e line 1.wedi– wedi2014-10-13 15:51:37 +00:00Commented Oct 13, 2014 at 15:51 - You may try
length($_) > 2048 && print.lengthis a shortcut forlength($_)anyway.MaratC– MaratC2014-11-17 12:10:35 +00:00Commented Nov 17, 2014 at 12:10 - Had to use ' instead of "Larsen– Larsen2021-09-30 13:31:40 +00:00Commented Sep 30, 2021 at 13:31
Something like this should work in Python.
of = open("orig") nf = open("new",'w') for line in of: if len(line) < 2048: nf.write(line) of.close() nf.close() - 1Personally, @Faheem, I prefer your answer. The reason why is that it was very easy for me to turn it around into 'delete all lines smaller than x'. I don't use Python all the time, but when I do I always feel I should learn it well.ixtmixilix– ixtmixilix2011-05-22 18:18:19 +00:00Commented May 22, 2011 at 18:18
- 1@ixtmixilix: Yes, using a full featured language like Python is pretty flexible. Thanks for the comment.Faheem Mitha– Faheem Mitha2011-05-24 16:46:05 +00:00Commented May 24, 2011 at 16:46
- If you love Python but also like using the CLI and not having to write and run a seperate script for this task, check out
pz! : github.com/CZ-NIC/pz It brings Python to shell pipes. For this question the solution would becat input | pz 's if len(s) < 2048 else ""' > outputChris– Chris2022-02-17 21:40:54 +00:00Commented Feb 17, 2022 at 21:40
The above answers do not work for me on Mac OS X 10.9.5.
The following code does work:
sed '/.\{2048\}/d'.
Although not asked, but provided for reference, the reverse can be achieved the following code:
sed '/.\{2048\}/!d'.
- lol, but
sed: 1: "/.\{2048\}/d": RE error: invalid repetition count(s)(Mac OS X, 10.10.4)alex gray– alex gray2015-07-24 13:29:02 +00:00Commented Jul 24, 2015 at 13:29 - Ah. I installed the GNU version instead of the BSD version that ships with Mac as @Freedom_Ben suggested above. But Kusalananda found the switch to enable extended regex. So you should go with his solution if you still have that problem. ;)wedi– wedi2018-11-30 19:40:18 +00:00Commented Nov 30, 2018 at 19:40
With gnu-sed, you may use the -r flag, to avoid typing the backslashes, and a comma, to define an open interval:
sed -r "/.{2049,}/d" input.txt > output.txt with:
- x{2049} meaning exactly 2049 xs
- x{2049,3072} meaning from 2049 to 3072 xs
- x{2049,} meaning at least 2049 xs
- x{,2049} meaning at most 2049 xs
For the intervals, to not match bigger patterns, you would need line anchors like
sed -r "/^.{32,64}$/d" input.txt > output.txt The sed solutions are all very slow when the line lengths become very long. This is the disadvantage of matching line length with regexes. (But of course the advantage is that sed is everywhere)
If you like the speed of the Perl solution, but prefer using Python, the pz CLI tool makes this really easy. It brings Python to shell pipes.
With pz the solution would be:
cat input | pz 's if len(s) < 2048 else ""' > output Split the row at each char by setting FS to nothing :
awk 'BEGIN{FS=""} NF <= 2048' file test with :
perl -e 'print "z"x2048' | awk 'BEGIN{FS=""} NF <= 2048' # This print perl -e 'print "z"x2049' | awk 'BEGIN{FS=""} NF <= 2048' # This not With Ruby:
ruby -ne 'print if $_.size <= 2048' input.txt > output.txt Or to edit in place and create a backup:
ruby -i.bak -ne 'print if $_.size <= 2048' file.txt Without a backup:
ruby -i -ne 'print if $_.size <= 2048' file.txt Note: $_.size includes the trailing newline, if any. You can use $_.chomp.size to ignore trailing newlines.
You could also check line size via a regex, like some of the other examples, but it will be slower:
# slow ruby -ne 'print if /.{2048}./' input.txt > output.txt