Splitting siblings by different @class attributes into columns using IMPORTXML formula

Question

I am trying to scrape sample sentences from an online Japanese dictionary using IMPORTXML in a spreadsheet.

As shown on the image above, some ul descendants are categorized by 2 different span @classes, or has sibling, from some li parents.

[for those who don't know] furigana, in Japanese, is basically their assigned phonetic character as reading aid for their Kanji letters (characters borrowed or adapted from Chinese writing)*

And on the other hand, other li parents has 1 child only, and that is 

My goal is to split @class='furigana' and @class='unlinked' into 2 separate columns and those 'unlinked' characters that doesn't have any 'furigana' counterpart will be replaced with — symbol instead of a blank cell.

I haven't done any filtering on my formula yet, but here it is:

= IMPORTXML( "https://jisho.org/word/%E6%A4%85%E5%AD%90", "//span[@class='furigana']/ancestor::ul[@class='japanese japanese_gothic clearfix']/li[@class='clearfix']" )

What happened was my formula miraculously gave me 2 columns for some reason that I don't understand, which is somewhat beneficial on my part, and seemingly already separated some 'unlinked' characters from the 'furigana' column. But I think it only segregated the characters in zigzag order, that's why some 'unlinked' characters are on the 'furigana' side.

I hope someone could help me and provide some simple formula that I could easily comprehend.

Blindspots · Accepted Answer · 2023-03-21 22:25:56Z

What you are seeing is the array returned by your search.

The <li> items are being returned as rows, and each  within the <li> represents a column in said row. Which value is returned in column 1 vs column 2 is dictated by the order of the spans in the HTML code, not the class name. If there is only one  in an <li> it will be in column 1 regardless of the class name.

If you want to return only the  your formula would be

=IMPORTXML( "https://jisho.org/word/%E6%A4%85%E5%AD%90", "//ul[@class='japanese japanese_gothic clearfix']/ li[@class='clearfix']/span[@class='furigana']" )

<ul class="japanese japanese_gothic clearfix" lang="ja"> <li class="clearfix"> <span class="furigana">かれ</span> <span class="unlinked">彼</span> </li> <li class="clearfix"> <span class="unlinked">は</span> </li> <li class="clearfix"> <span class="furigana">かなら</span> <span class="unlinked">必ず</span> </li> <li class="clearfix"> <span class="furigana">だいとうりょう</span> <span class="unlinked">大統領</span> </li> <li class="clearfix"> <span class="unlinked">の</span> </li> <li class="clearfix"> <span class="furigana">いす</span> <span class="unlinked"><span class="hit">椅子</span></span> </li> <li class="clearfix"> <span class="unlinked">に</span> </li> <li class="clearfix"> <span class="unlinked">つく</span> </li> <li class="clearfix"> <span class="furigana">じんぶつ</span> <span class="unlinked">人物</span> </li> <li class="clearfix"> <span class="unlinked">だ</span> </li> </ul>

Stack Exchange Network

Splitting siblings by different @class attributes into columns using IMPORTXML formula

1 Answer 1

Hot Network Questions

Splitting siblings by different @class attributes into columns using IMPORTXML formula

1 Answer 1

Related

Hot Network Questions