Pull headers from a list and create a DataFrame with headers side-by-side to list elements

Question

After scraping a website, I ended up with a list which looks like this:

data = ['\xa0header1', 'element1', 'element2', 'element3', '\xa0header2', 'element4', 'element5']

and so on.

I want to create a panda dataframe with the data I scraped that looks like this:

 A B 0 element1 header1 1 element2 header1 2 element3 header1 3 element4 header2 4 element5 header2

So, basically, I want to show in the next column the header which is above a group of elements of the initial list.

How can it be done, considering the special character in front of the headers makes it easy to look them up in the list?

jpp · Accepted Answer · 2018-06-09 21:29:09Z

itertools groupby + repeat + chain

This is one solution using the itertools module. In essence these are the only operations we need to undertake:

Group items according to whether they start with \xa0.
Repeat headers for each list within your list of lists after grouping.
Chain results for series A and B to remove nested lists.

Crucially, these operations are already implemented lazily and efficiently in the standard library, so there's no need to reproduce in pure Python (although this, in itself, is a good learning exercise).

Complete solution:

from itertools import chain, groupby, repeat chainer = chain.from_iterable data = ['\xa0header1', 'element1', 'element2', 'element3', '\xa0header2', 'element4', 'element5'] def condition(x): return x.startswith('\xa0') # create list of lists for elements elements = [list(j) for i, j in groupby(data, key=condition) if not i] # create list of headers headers = [next(j) for i, j in groupby(data, key=condition) if i] # chain list of lists, and use repeat for headers df = pd.DataFrame({'A': list(chainer(LoL)), 'B': list(chainer(repeat(i, j) for i, j in \ zip(headers, map(len, elements))))}) print(df) A B 0 element1 header1 1 element2 header1 2 element3 header1 3 element4 header2 4 element5 header2

Thank you for providing a solution. However, maybe I was not clear enough but the headers actually include words like \xa0Eggs or \xa0Plates for example that I want to extract and put in the next colmun instead of enumerating.

jpp · Accepted Answer · 2018-06-09 21:59:26Z

An alternative solution is to use collections.defaultdict to create a dictionary mapping headers to elements. Potentially more intuitive than itertools.groupby and requires only one pass.

from collections import defaultdict from itertools import chain, repeat chainer = chain.from_iterable data = ['\xa0header1', 'element1', 'element2', 'element3', '\xa0header2', 'element4', 'element5'] # create dictionary of lists # each key a separate header; values are list of elements d = defaultdict(list) for item in data: if item.startswith('\xa0'): key = item.strip() else: d[key].append(item) # chain list of lists, and use repeat for headers df = pd.DataFrame({'A': list(chainer(d.values())), 'B': list(chainer(repeat(i, j) for i, j in \ zip(d.keys(), map(len, d.values()))))}) print(df) A B 0 element1 header1 1 element2 header1 2 element3 header1 3 element4 header2 4 element5 header2

Collectives™ on Stack Overflow

Pull headers from a list and create a DataFrame with headers side-by-side to list elements

2 Answers 2

itertools groupby + repeat + chain

1 Comment

Comments

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

itertools groupby + repeat + chain

1 Comment

Comments

Related