python from List[Tuple[str]] to List[str]

Question

I have something like this:

[('a',), ('b',), ('c',),('d',)]

and I want it to be like this:

['a', 'b', 'c', 'd']

This is what I tried:

p = [('a',), ('b',), ('c',),('d',)] [el[0] for el in p]

is there a more "pythonic way " to do that? I would like to avoid using a for loop as my list consist of 165400 tuples

I don't think you can do it any faster than a for loop or list comprehension, this is fundamentally O(n) — mousetail
– mousetail, Commented Oct 6, 2023 at 10:33
In what sense do you think the list comprehension may not be "Pythonic"? — jackal
– jackal, Commented Oct 6, 2023 at 10:38
Just curious: Can you change the data source so that it outputs a list of strings instead of single-element tuples? — InSync
– InSync, Commented Oct 6, 2023 at 10:44
@DerSchinken: That's going to be substantially slower for large inputs, even if you fix it to use a tuple as the base; it's repeated concatenation, throwing away the old tuple each time to build new tuples at every step. It's O(n²) where it could be O(n). — ShadowRanger
– ShadowRanger, Commented Oct 6, 2023 at 10:49

ShadowRanger · Accepted Answer · 2023-10-06 20:30:50Z

To assert that the tuples are in fact one element each, so you get an exception if the assumption is violated, rather than silently discarding data, you can use iterable unpacking in any of the following three forms to extract the element from a known single-element iterable:

[el for el, in p] [el for (el,) in p] [el for [el] in p]

All three are 100% equivalent in behavior and performance, it's mostly about which form you feel is most readable.

As a bonus, unpacking is typically a little faster than indexing (only about 5-10% on Python 3.11, not enough to worry about if you wanted to ignore extra elements, but if your code's correctness relies on the assumption that there is exactly one element per tuple, this gets you that check automatically, with negative cost).

If the tuples might be variable length and you want to keep the contents, itertools.chain.from_iterable is the way to go:

from itertools import chain list(chain.from_iterable(p)) # Or [*chain.from_iterable(p)] if you prefer

which will avoid discarding any data.

Lastly, if you're just looking to optimize taking the first element and you're really okay with your assumptions being violated, you could micro-optimize this a tiny bit for large inputs:

from operator import itemgetter list(map(itemgetter(0), p)) # Or [*map(itemgetter(0), p)] if you prefer

but that's a small, and shrinking (as they make the interpreter faster) benefit over the listcomp, and probably not worth the trouble unless you're sure this is the hot loop and there is no way to change what you're doing to improve it algorithmically.

On Performance

Some performance notes to address your concern about "for loops" due to "my Teacher in university suggested me to avoid using for loop when it is possible to use "built in " function cause they are built in C so they are really really faster".

Plain for loops repeatedly appending to a list will be slower.
```
lst = [] for x, in p: lst.append(x) 
```
will lose out on performance due to the cost of repeatedly loading lst from the stack, looking up its append method, and calling it through generalized code paths. That said, as of 3.11, they put a bunch of cached and self-modifying/specializing bytecode in the interpreter, so it doesn't actually cost that much more anymore.
map with a built-in function implemented in C can win. But even before 3.11, the gains were small (if unpacked to a list, typically no more than 10% faster than an equivalent listcomp, and only slightly faster than that against the higher CPU overhead generator expression equivalent). As of 3.11, with the massive interpreter speed improvements, the differences have shrunk even more.

For a case example, here's the timings I get running IPython microbenchmarks on Python 3.11.6 on my local machine, making a 10K element list of one-tuples, then converting it to a list of their contents with one of the three approaches given in the main body of my answer, or the explicit for loop I mentioned as being slower:

>>> import random >>> from itertools import chain; from operator import itemgetter >>> %%timeit p = [(random.randrange(1000),) for _ in range(10000)] ... [x for x, in p] ... 346 µs ± 2.07 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each) >>> %%timeit p = [(random.randrange(1000),) for _ in range(10000)] ... [*map(itemgetter(0), p)] ... 344 µs ± 2.29 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each) >>> %%timeit p = [(random.randrange(1000),) for _ in range(10000)] ... [*chain.from_iterable(p)] ... 491 µs ± 1.8 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each) >>> %%timeit p = [(random.randrange(1000),) for _ in range(10000)] ... lst = [] ... for x, in p: ... lst.append(x) ... 388 µs ± 12.8 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

As you can see:

map with itemgetter(0) (both of which are C-level built-ins, though itemgetter could stand to be optimized a titch better for special cases like being given a single integer index) does win, but only barely; a mere 2 µs difference per loop (and the standard deviation is higher than the difference; in further runs, map did tend to win more often than not, but it wasn't a sure thing), less than 1% of the total time.
Using chain.from_iterable lost badly, despite being a built-in implemented entirely in C, presumably because chain assumes the sub-iterables can be anything and has no optimizations for tuples, let alone one-tuples (it constructs an iterator for each of them, pulls from it twice, with the second pull failing each time, and moves on to the next). Being a C built-in is no guarantee of speed if it can't specialize to the task at hand.
While the plain for loop, without a listcomp involved, did lose, it only took about 12% longer than the listcomp, and 13% longer than map. I'd favor the listcomp over the plain for loop for brevity and clarity, reserving the for loop for more complicated work, but speed-wise? Not going to be your problem. Actually processing the data will almost certainly involve an order of magnitude or so more effort (you're getting this from a database, which, unless it's an SQLite DB in memory, involves disk or network access, either of which will be much slower than Python); unpacking it is going to be a pretty irrelevant to the overall performance of your code.

[*chain(*p)] if you are after brevity =) the from_iterable part is quite unnecessary if you consume all of them all the way and turn it into a list anyway.
@ShadowRanger it is the result of an Sqlalchemy query where I select only one column . So yeah I am sure of tuple's length
@NicWorkAccount: Yeah, then definitely go with one of the unpacking solutions. The unpacking validates your assumption for free (if you change the query such that it now returns two columns, and forget to change the listcomp, it will raise a ValueError, rather than silently discarding the second column and leaving you wondering why it's missing later on like indexing would), and does it slightly more efficiently than indexing (the difference is pretty small by 3.11 which optimized indexing a bit, but even so, for large inputs the unpacking takes about 8% less time on my machine).
@user2390182: The from_iterable does save a shallow copy of the list (star unpacking to a function shallow copies it to a tuple, which chain then gets an iterator from, exactly the same way from_iterable does directly to the list without a shallow copy). Admittedly not likely to be a big performance hit, relative to whatever you do with the resulting chain object, but it's O(n) work you didn't have to do (for a 10K element list, chain(*lst) takes 44.2 µs, chain.from_iterable(lst) takes 96.2 ns; unpacking increased setup time by over 450x).
Just to be clear, running out the chain is going to be an order of magnitude more expensive than creating it even in the slow way, so it's not the end of the world to use star unpacking, but just for comparison, running it out to make a new list still leaves a noticeable difference for the 10K list of one-tuples: [*chain(*lst)] took 524 µs, while [*chain.from_iterable(lst)] took 481 µs (roughly as expected given the difference in setup costs; running them out should take the same time, since it's just a matter of whether it's using a tuple or list iterator, both are fast).

Collectives™ on Stack Overflow

python from List[Tuple[str]] to List[str]

1 Answer 1

On Performance

8 Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

On Performance

8 Comments

Linked

Related