Replace personal names and addresses with company ones

Question

The problem:

I am given a data frame. Somewhere in that dataframe there is 3*N number of columns that I need to modify based on a condition. The columns of interest look like this:

names_1	address_1	description_1	names_2	address_2	...
Joe	joe_address	...	George	...	...
Kate	kate_address	...	Daphne	...	...
Bob	bob_address	...	Jake	...	...

I can generate this with the following code:

import pandas as pd names_dict = {'names_1':['Joe', 'Kate', 'Bob'], 'address_1':['a1', 'a2', 'a3'], 'description_1':['d1', 'd2', 'd3'], 'names_2':['George', 'Daphne', 'Jake'], 'address_2':['a4', 'a5', 'a6'], 'description_2':['d4', 'd5', 'd6']} df = pd.DataFrame(data=names_dict)

There is also a dictionary that I need to use. The keys to that dictionary are names of some companies. Each key has a list of names attached. It looks like this:

companies_dict = {'company1': ['Kate', 'Mark', 'Ben'], 'company2':['Jacob', 'Michael', 'Ken'], 'company3':['Jake', 'Don', 'Joe']}

I need to go over all names_k columns. If I encounter a name that is in one of the companies lists, I swap the name of that person with the name of that company. Moreover, I swap the address and description of that person with the address and the description of that company.

Here are dictionaries to use for this purpose:

companies_descriptions = {'company1': 'company1_desc', 'company2': 'company2_desc', 'company3': 'company3_desc'} companies_addresses = {'company1': 'company1_address', 'company2': 'company2_address', 'company3': 'company3_address'}

Note: The columns are somewhere in the dataframe, but they are next to each other. That is, the names_1 all the way to description_N are next to each other.

My solution:

I wrote the following Python code.

N = 2 number_of_columns = N for k in range(1, number_of_columns+1): for index, name in enumerate(df[f'names_{k}']): for company, name_list in companies_dict.items(): if name in name_list: df.loc[index, f'names_{k}'] = company df.loc[index, f'address_{k}'] = companies_descriptions.get(company) df.loc[index, f'description_{k}'] = companies_addresses.get(company)

Note:

We can safely assume that each person's name is unique. So no two companies have the same employee.
N = 2 is an arbitrary value. Should work for any int>=1. It dictates how many columns (named names_k) there are and is defined by a separate process. N = 2 is given here as an example.

My solution is ugly, but it solves the problem. How to write it better?

Here is the whole code to copy:

import pandas as pd names_dict = {'names_1':['Joe', 'Kate', 'Bob'], 'address_1':['a1', 'a2', 'a3'], 'description_1':['d1', 'd2', 'd3'], 'names_2':['George', 'Daphne', 'Jake'], 'address_2':['a4', 'a5', 'a6'], 'description_2':['d4', 'd5', 'd6']} df = pd.DataFrame(data=names_dict) companies_dict = {'company1': ['Kate', 'Mark', 'Ben'], 'company2':['Jacob', 'Michael', 'Ken'], 'company3':['Jake', 'Don', 'Joe']} companies_descriptions = {'company1': 'company1_desc', 'company2': 'company2_desc', 'company3': 'company3_desc'} companies_addresses = {'company1': 'company1_address', 'company2': 'company2_address', 'company3': 'company3_address'} N = 2 number_of_columns = N for k in range(1, number_of_columns+1): for index, name in enumerate(df[f'names_{k}']): for company, name_list in companies_dict.items(): if name in name_list: df.loc[index, f'names_{k}'] = company df.loc[index, f'address_{k}'] = companies_descriptions.get(company) df.loc[index, f'description_{k}'] = companies_addresses.get(company)

And what should happen when two different companies' employees have the same name? That does happen, frequently. — Toby Speight
– Toby Speight, Commented Feb 17, 2023 at 11:04
Yes, that latest edit has improved the question. It looks complete now. Thanks for responding positively. — Toby Speight
– Toby Speight, Commented Feb 18, 2023 at 9:40
Is this homework or an interview question? Either way, please tag it as such — Reinderien
– Reinderien, Commented Feb 18, 2023 at 19:09
The person ("you") in the problem statement was confusing and made it seem like a homework problem, so I have changed it to the first person. — Reinderien
– Reinderien, Commented Feb 20, 2023 at 15:26

Reinderien · Accepted Answer · 2023-02-20 15:33:00Z

The question

It's unhelpful, in a handful of ways:

It presents you data that are shaped in an unhelpful way
It implies the use of non-vectorised dictionary lookups
It implies the use of non-vectorised iteration
What you're really doing is building up a dataframe of merged contacts, but the question does not describe this

Since you claim this is not from a course, and from your own scenario, this is somewhat an x/y problem: you asked how to do x when you shouldn't do x at all, and should do y instead.

The existing code

Get rid of all of your for-loops. Get rid of all of your dictionary lookups. Get rid of element-wise reassignment.

Vectorised approach

You've tagged your question vectorization, but the language of the problem statement is not guiding you toward this, and your own solution (perhaps unsurprisingly) is not vectorised - but it should be.

This will be done in, roughly, the following steps:

Fix the broken column representation in the first dataframe
Load the other dictionaries into dataframes with sensible indices and columns
Left-merge to get a porous merged-dataframe
fillna to substite where possible

Suggested

import pandas as pd # You are given a data frame. Somewhere in that dataframe there is 3*N number of columns contacts: pd.DataFrame = pd.DataFrame({ 'names_1': ('Joe', 'Kate', 'Bob'), 'address_1': ('a1', 'a2', 'a3'), 'description_1': ('d1', 'd2', 'd3'), 'names_2': ('George', 'Daphne', 'Jake'), 'address_2': ('a4', 'a5', 'a6'), 'description_2': ('d4', 'd5', 'd6'), }) contacts.columns = pd.MultiIndex.from_frame( contacts.columns.str.extract(r'(.+)_(\d+)$'), names=('property', 'contact_group'), ) contacts.index.name = 'contact' # There is also a dictionary that you need to use. The keys to that dictionary # are names of some companies. Each key has a list of names attached. companies_dict = { 'company1': ('Kate', 'Mark', 'Ben'), 'company2': ('Jacob', 'Michael', 'Ken'), 'company3': ('Jake', 'Don', 'Joe'), } company_employees: pd.Series = pd.DataFrame(companies_dict).stack() company_employees.index.names = 'employee', 'company_name' company_employees.name = 'employee_name' # you swap the address and description of that person with the address and the # description of that company. Here are dictionaries to use for this purpose: companies_descriptions = { 'company1': 'company1_desc', 'company2': 'company2_desc', 'company3': 'company3_desc', } companies_addresses = { 'company1': 'company1_address', 'company2': 'company2_address', 'company3': 'company3_address', } companies = pd.DataFrame.from_dict({ 'description': companies_descriptions, 'address': companies_addresses, }) companies.index.name = 'company_name' # You need to iterate over all [employee] names_k columns. If you encounter an [employee] name that # is in one of the companies lists, you swap the name of that person with the name of that company. # Moreover, you swap the address and description of that person with the address and the description # of that company. employees_with_company_properties = pd.merge( left=company_employees, right=companies, left_on='company_name', right_on='company_name' ) contacts_long = contacts.stack(level='contact_group') contacts_merged = pd.merge( left=contacts_long, right=employees_with_company_properties.reset_index(), left_on='names', right_on='employee_name', suffixes=('_employee', '_company'), how='left', ).set_index(contacts_long.index) contacts_replaced = contacts_merged[[ 'company_name', 'address_company', 'description_company' ]].rename(columns={ 'company_name': 'names', 'address_company': 'address', 'description_company': 'description', }).fillna( contacts_merged[['names', 'address_employee', 'description_employee']] .rename(columns={ 'address_employee': 'address', 'description_employee': 'description', }) ).unstack( level='contact_group' ).sort_values('contact_group', axis=1) pd.set_option('display.max_columns', 10) pd.set_option('display.width', 1000) print(contacts_replaced)

Output

 names address description names address description contact_group 1 1 1 2 2 2 contact 0 company3 company3_address company3_desc George a4 d4 1 company1 company1_address company1_desc Daphne a5 d5 2 Bob a3 d3 company3 company3_address company3_desc

Thanks, nice data manipulation :D. "It implies the use of non-vectorised iteration" - yes, that was unfortunate, I'll change that now. The only issue I have with this solution is that the idea of what we are trying to do is not quite clear as we read the code. But I suppose that's due to the amount of changes you had to make to the structure of the data. Do you agree? — Glue
– Glue, Commented Feb 20, 2023 at 11:33
Re. I'll change that now - either it is a quote, in which case you shouldn't change it, or it's not actually a quote. — Reinderien
– Reinderien, Commented Feb 20, 2023 at 13:13
Re. clarity - I mostly disagree. Whereas vectorisation can make the code longer and more difficult to understand for people unfamiliar with Pandas, once you understand the individual calls, the whole is fairly straightforward. — Reinderien
– Reinderien, Commented Feb 20, 2023 at 13:31

Toby Speight · Accepted Answer · 2023-02-18 11:07:05Z

Before starting the loop, we should probably invert the employees list, so that we have a dict that maps each employee to their company:

employers = { name: co for co,names in companies_dict.items() for name in names }

That eliminates the inner loop, and replaces with a simple dict.get(), so improving the code's scalability.

It's not clear where N = 2 comes from, as that's not mentioned in the extract of the problem statement. Could be worth a comment. Or perhaps we should change the code to work with any number of replacements - surely we just need to go up to ⅓ the number of columns?

max_k = len(df.columns) // 3

Or forget about N and after name_1 just keep incrementing k until there's no name_k column.

Indexing using .iloc is much more efficient than enumerate(df[f'names_{k}']), and we're told about the columns' order, so we can use that information:

col = df.columns.get_loc('names_1') colgroup = 1 while col < len(df.columns) and df.columns[col] == f'names_{colgroup}': for row_index in range(len(df)): company = employers.get(df.iloc[row_index, col]) if company: df.iloc[row_index, col] = company df.iloc[row_index, col+1] = companies_descriptions.get(company) df.iloc[row_index, col+2] = companies_addresses.get(company) colgroup += 1 col += 3

I'm not sure what do you mean by the last statement. There are 3*N columns. That's because there are N columns such as names_k. Each names_k column has corresponding address and description column. — Glue
– Glue, Commented Feb 17, 2023 at 14:00
If there are 3*N columns or more but we are not given N, then we can deduce an upper bound for N: it's not more than (columns/3). That's all. — Toby Speight
– Toby Speight, Commented Feb 17, 2023 at 14:08
Yes, we don't use the guarantee that the columns are consecutive. However, we can use this guarantee, but I don't see how to do that exactly. I also don't see how inverting the companies_dict helps. — Glue
– Glue, Commented Feb 17, 2023 at 21:09
Inverting the dict turns the lookup from a linear search (for company, name_list in companies_dict.items(): if name in name_list:) into a simple dict access (company = employer[name]). — Toby Speight
– Toby Speight, Commented Feb 18, 2023 at 9:37
Thanks for providing the code, it is clearer for me now. What do you think, which solution is more readable? Mine or yours? That's what I value the most here. — Glue
– Glue, Commented Feb 18, 2023 at 12:06

Stack Exchange Network

Replace personal names and addresses with company ones

The problem:

My solution:

2 Answers 2

The question

The existing code

Vectorised approach

Suggested

Output

You must log in to answer this question.

Hot Network Questions

Replace personal names and addresses with company ones

The problem:

My solution:

2 Answers 2

The question

The existing code

Vectorised approach

Suggested

Output

You must log in to answer this question.

Related

Hot Network Questions