2

After scraping a website, I ended up with a list which looks like this:

data = ['\xa0header1', 'element1', 'element2', 'element3', '\xa0header2', 'element4', 'element5'] 

and so on.

I want to create a panda dataframe with the data I scraped that looks like this:

 A B 0 element1 header1 1 element2 header1 2 element3 header1 3 element4 header2 4 element5 header2 

So, basically, I want to show in the next column the header which is above a group of elements of the initial list.

How can it be done, considering the special character in front of the headers makes it easy to look them up in the list?

2 Answers 2

2

itertools groupby + repeat + chain

This is one solution using the itertools module. In essence these are the only operations we need to undertake:

  1. Group items according to whether they start with \xa0.
  2. Repeat headers for each list within your list of lists after grouping.
  3. Chain results for series A and B to remove nested lists.

Crucially, these operations are already implemented lazily and efficiently in the standard library, so there's no need to reproduce in pure Python (although this, in itself, is a good learning exercise).

Complete solution:

from itertools import chain, groupby, repeat chainer = chain.from_iterable data = ['\xa0header1', 'element1', 'element2', 'element3', '\xa0header2', 'element4', 'element5'] def condition(x): return x.startswith('\xa0') # create list of lists for elements elements = [list(j) for i, j in groupby(data, key=condition) if not i] # create list of headers headers = [next(j) for i, j in groupby(data, key=condition) if i] # chain list of lists, and use repeat for headers df = pd.DataFrame({'A': list(chainer(LoL)), 'B': list(chainer(repeat(i, j) for i, j in \ zip(headers, map(len, elements))))}) print(df) A B 0 element1 header1 1 element2 header1 2 element3 header1 3 element4 header2 4 element5 header2 
Sign up to request clarification or add additional context in comments.

1 Comment

Thank you for providing a solution. However, maybe I was not clear enough but the headers actually include words like \xa0Eggs or \xa0Plates for example that I want to extract and put in the next colmun instead of enumerating.
0

An alternative solution is to use collections.defaultdict to create a dictionary mapping headers to elements. Potentially more intuitive than itertools.groupby and requires only one pass.

from collections import defaultdict from itertools import chain, repeat chainer = chain.from_iterable data = ['\xa0header1', 'element1', 'element2', 'element3', '\xa0header2', 'element4', 'element5'] # create dictionary of lists # each key a separate header; values are list of elements d = defaultdict(list) for item in data: if item.startswith('\xa0'): key = item.strip() else: d[key].append(item) # chain list of lists, and use repeat for headers df = pd.DataFrame({'A': list(chainer(d.values())), 'B': list(chainer(repeat(i, j) for i, j in \ zip(d.keys(), map(len, d.values()))))}) print(df) A B 0 element1 header1 1 element2 header1 2 element3 header1 3 element4 header2 4 element5 header2 

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.