I have a Polars dataframe in which the cells contain a sequence of single digits as a string of characters, and I want to find the number of differences between the elements of the string. For example:
import polars as pl df = pl.DataFrame({"pop_1": ["100","0021"],"pop_2":["11002","0000",]}) shape: (2, 2) ┌───────┬───────┐ │ pop_1 ┆ pop_2 │ │ --- ┆ --- │ │ str ┆ str │ ╞═══════╪═══════╡ │ 100 ┆ 11002 │ │ 0021 ┆ 0000 │ └───────┴───────┘ col_1 row1 has 2 differences with itself (1 different than 0 two times); col_2 row1 has 8 differences with itself; col_1 and col_2 have 9 differences between them at row1.
The naive implementation of this would be:
def get_dxy(str1,str2): diffs = 0 for x in str1: for y in str2: if x!=y: diffs+=1 return diffs def get_pi(str1): diffs = 0 for i in range(len(str1)-1): for j in range(i+1,len(str1)): if str1[i]!=str1[j]: diffs+=1 return diffs I need to report these differences in separate columns. I am able to do this by using map_elements at each row:
df = df.with_columns( pl.col('pop_1') .map_elements(get_pi, return_dtype=pl.Int64) .alias('pi_1') ) df = df.with_columns( pl.col('pop_2') .map_elements(get_pi, return_dtype=pl.Int64) .alias('pi_2') ) df = df.with_columns( pl.struct("pop_1", "pop_2") .map_elements(lambda cols: get_dxy(cols["pop_1"], cols["pop_2"]), return_dtype=pl.Int64) .alias("dxy") ) df shape: (2, 5) ┌───────┬───────┬──────┬──────┬─────┐ │ pop_1 ┆ pop_2 ┆ pi_1 ┆ pi_2 ┆ dxy │ │ --- ┆ --- ┆ --- ┆ --- ┆ --- │ │ str ┆ str ┆ i64 ┆ i64 ┆ i64 │ ╞═══════╪═══════╪══════╪══════╪═════╡ │ 100 ┆ 11002 ┆ 2 ┆ 8 ┆ 9 │ │ 0021 ┆ 0000 ┆ 5 ┆ 0 ┆ 8 │ └───────┴───────┴──────┴──────┴─────┘ However, my data is too large, and this method is not very fast. I was wondering what the fastest way to accomplish this using Polars built-in tools would be (perhaps without using map_elements?)
Could I get some hints in how to do this?