Simple Tag Parser

Question

This is a model of a forgiving HTML parser. Instead of parsing HTML and extracting attributes, in this code golf, the tag parser will be simple.

Write a function that parses a tag structure and returns its parenthized form. An opening tag consists of one lowercase letter, and a closing tag consists of one uppercase letter. For example, aAbaAB parses into (a)(b(a)), or in HTML, <a></a><b><a></a></b>. Of course, tags can be in juxtaposition and nest.

"Prematurely" closed tags must be handled. For example, in abcA, the A closes the outermost a, so it parses into (a(b(c))).

Extra closing tags are simply ignored: aAB parses into (a).

Overlapping tags are NOT handled. For example, abAB parses into (a(b)), not (a(b))(b), by the previous rule of extra closing tags (abAB -> abA ((a(b))) + B (extra)).

Assuming no whitespaces and other illegal characters in the input.

You are not allowed to use any library.

Here is a reference implementation and a list of test cases:

#!/usr/bin/python def pars(inpu): outp = "" stac = [] i = 0 for x in inpu: lowr = x.lower() if x == lowr: stac.append(x) outp += "(" + x i = i + 1 else: while len(stac) > 1 and stac[len(stac) - 1] != lowr: outp += ")" stac.pop() i = i - 1 if len(stac) > 0: outp += ")" stac.pop() i = i - 1 outp += ")" * i return outp tests = [ ("aAaAbB", "(a)(a)(b)"), ("abBcdDCA", "(a(b)(c(d)))"), ("bisSsIB", "(b(i(s)(s)))"), ("aAabc", "(a)(a(b(c)))"), ("abcdDA", "(a(b(c(d))))"), ("abcAaA", "(a(b(c)))(a)"), ("acAC", "(a(c))"), ("ABCDEFG", ""), ("AbcBCabA", "(b(c))(a(b))") ] for case, expe in tests: actu = pars(case) print "%s: C: [%s] E: [%s] A: [%s]" % (["FAIL", "PASS"][expe == actu], case, expe, actu)

Shortest code wins.

You should add a test case for input that leads with a closing tag, such as AbcBCabA (should parse as (b(c))(a(b)). My code could have been shorter except for this case. — MtnViewMark
– MtnViewMark, Commented May 9, 2011 at 7:12

MtnViewMark · Accepted Answer · 2011-05-10 02:14:23Z

Haskell, 111 characters

s@(d:z)§c|c>'^'=toEnum(fromEnum c-32):s++'(':[c]|d<'='=s|d==c=z++")"|1<3=(z++")")§c p=tail.foldl(§)"$".(++"$")

This one's pretty golf'd for Haskell. Fun feature: The stack and the accumulating output are kept in the same string!

Test cases:

> runTests Pass: aAbaAB parsed correctly as (a)(b(a)) Pass: abcA parsed correctly as (a(b(c))) Pass: aAB parsed correctly as (a) Pass: abAB parsed correctly as (a(b)) Pass: aAaAbB parsed correctly as (a)(a)(b) Pass: abBcdDCA parsed correctly as (a(b)(c(d))) Pass: bisSsIB parsed correctly as (b(i(s)(s))) Pass: aAabc parsed correctly as (a)(a(b(c))) Pass: abcdDA parsed correctly as (a(b(c(d)))) Pass: abcAaA parsed correctly as (a(b(c)))(a) Pass: acAC parsed correctly as (a(c)) Pass: AbcBCabA parsed correctly as (b(c))(a(b))

Edit: (113 → 111) used an @ pattern as suggested by FUZxxl

\$\begingroup\$ Using an @-pattern for d:z might save two chars. \$\endgroup\$

FUZxxl
– FUZxxl

2011-05-09 12:07:11 +00:00
Commented May 9, 2011 at 12:07 — FUZxxl
– FUZxxl, Commented May 9, 2011 at 12:07

Lux · Accepted Answer · 2017-10-01 22:32:26Z

Z80 Machine Code for TI-83+, 41 bytes

This is an implementation in hexadecimal machine code for a z80 cpu running on a TI-83+.

11XXXX131AFE61380F6FE53E28CD9DB47DCD9DB4188EE1BDC03E29CD9DB4189BEF4504E5214CE1C9

The XXXX (3 - 6 inclusive) is the 16-bit address of the string you are parsing, minus 1 byte.

Encoded in Z80-ASCII:

¹XX≤¯•⟙8𝑭o↥>(ˣïÑ}ˣïÑ≠á↑γ∊>)ˣïÑ≠Ì⬆︎E𝑤↥!₄L↑Φ

(Approximate, because TI calculators have their own character set.)

NOTE THAT THE AsmPrgm IS NOT INCLUDED IN THE ABOVE

Community · Accepted Answer · 2020-06-17 09:04:33Z

Windows PowerShell, 142 146 147 152 156 169

{$s='' -join([char[]]"$args "|%{if(90-ge$_){')'*(($x=$s.indexOf("$_".ToLower())+1)+$s.Length*!$x) $s=$s.substring($x)}else{"($_" $s="$_$s"}})}

Some things to note: This is just a script block. It can be assigned to a variable or given a function name, if necessary. You can also run it by putting . or & in front of it and the arguments at the end. Uses a final space to terminate unclosed tags.

Passes all tests. Test script:

$tests = ("aAaAbB","(a)(a)(b)"),("abBcdDCA","(a(b)(c(d)))"),("bisSsIB","(b(i(s)(s)))"),("aAabc","(a)(a(b(c)))"),("abcdDA","(a(b(c(d))))"),("abcAaA", "(a(b(c)))(a)"),("acAC","(a(c))") "function f " + ((gc ./tags.ps1)-join"`n") | iex $tests | %{ $result = f $_[0] ("FAIL: $($_[0]):$($_[1]) - $result", 'PASS')[$result -ceq $_[1]] }

Community · Accepted Answer · 2020-06-17 09:04:33Z

Python - 114 113 153 192 174 159 characters

from sys import * s="";c=a=argv[1] for f in a: o=c.find;p=f.lower if '@'<f<'\\': \td=o(f)-o(p()) \ts+=")"*d \tc=(c[:o(p())]+c[o(f)+1:]) else:s+=("("+f) print s

Abuses python's indentation parser to use one space for a full tab, five for two tabs.

Edit 1 - saved an unneeded space in the range() function

Edit 2 - fixed to deal with improper parse grammars, unterminated tags.

Edit 3 - fixed a bug whereby "incorrect" parses could be generated by ambiguity in the tag tree. Implemented a stack-based strategy, rather than a counter.

Edit 4 - renamed s.find to o to prevent save the chars used to repeatedly call it. did the same for f.lower.

Edit 5 - added the space/tab hack, saving three chars.

Edit 6 - ditched the loop in favor of ")"*d.

instead of ord(f)... you can use '@'<f<'\\' If you don't need to check for '\\' you can use ']' instead — gnibbler
– gnibbler, Commented May 9, 2011 at 6:15
you can use a single tab instead of 5 spaces. SO code markup can't handle it though :(. In your case just put leave out the newline and the spaces altogether. eg if ...:s+=")";c-=1 and else:s+="("+f;c+=1 — gnibbler
– gnibbler, Commented May 9, 2011 at 6:18
for i in range(d):s+=")" can be rewritten as s+=")"*d. And you have 174 chars. — cemper93
– cemper93, Commented May 9, 2011 at 18:06
@cemper - good point that. I do "_"*80 all day long and forget about it when golfing.... Also, thanks to @gnibbler for the suggestions! — arrdem
– arrdem, Commented May 9, 2011 at 20:10
Actually, I meant that you had had 174 chars before. So you are at 159 now. — cemper93
– cemper93, Commented May 9, 2011 at 21:01

Community · Accepted Answer · 2020-06-17 09:04:33Z

Golfscript, 54 chars

{[]:|\{.96>{.|+:|;40\}{32+|?).')'*\|>:|;}if}%|,')'*}:$

Tests

;["aAaAbB" "abBcdDCA" "bisSsIB" "aAabc" "abcdDA" "abcAaA" "acAC" "aAB" "abAB" "AbcBCabA"]{.' '\$n}% aAaAbBaAaAbB (a)(a)(b) abBcdDCA (a(b)(c(d))) bisSsIB (b(i(s)(s))) aAabc (a)(a(b(c))) abcdDA (a(b(c(d)))) abcAaA (a(b(c)))(a) acAC (a(c)) aAB (a) abAB (a(b)) AbcBCabA (b(c))(a(b))

Stack Exchange Network

Simple Tag Parser

5 Answers 5

Haskell, 111 characters

Z80 Machine Code for TI-83+, 41 bytes

Windows PowerShell, 142 146 147 152 156 169

Python - 114 113 153 192 174 159 characters

Golfscript, 54 chars

Hot Network Questions

Simple Tag Parser

5 Answers 5

Haskell, 111 characters

Z80 Machine Code for TI-83+, 41 bytes

Windows PowerShell, 142 146 147 152 156 169

Python - 114 113 153 192 174 159 characters

Golfscript, 54 chars

Related

Hot Network Questions