Skip to content

Conversation

@sinhrks
Copy link
Member

@sinhrks sinhrks commented May 14, 2015

Related to #10081. Make a short path using numpy's string funcs when all the target values are strings. Otherwise, use current path.

Followings are current comparison results:

import pandas as pd import numpy as np import string import random np.random.seed(1) s = [''.join([random.choice(string.ascii_letters + string.digits) for i in range(3)]) for i in range(1000000)] # s_str uses short path s_str = pd.Series(s) # set object s[-1] = 1 # s_obj uses current path s_obj = pd.Series(s) 
%timeit s_str.str.lower() #1 loops, best of 3: 696 ms per loop %timeit s_obj.str.lower() #1 loops, best of 3: 1.46 s per loop %timeit s_str.str.split('a') #1 loops, best of 3: 1.55 s per loop %timeit s_obj.str.split('a') #1 loops, best of 3: 3.52 s per loop 

The logic has an overhead to check whether target values are all-string using lib.is_string_array. But this should be speed-up in most cases because it takes relatively shorter time than string ops, and (I believe) values should be all-string in most cases.

%timeit pd.lib.is_string_array(s_str.values) #10 loops, best of 3: 21.9 ms per loop 

If it looks OK, I'll work on all the funcs which is supported by numpy.

@sinhrks sinhrks added Performance Memory or execution speed performance Strings String extension data type and string data labels May 14, 2015
@sinhrks sinhrks added this to the 0.17.0 milestone May 14, 2015
@sinhrks sinhrks force-pushed the str_perf branch 2 times, most recently from 8b39ce5 to 80f436e Compare May 14, 2015 15:14
@sinhrks
Copy link
Member Author

sinhrks commented May 14, 2015

Ah noticed above comparison is not fair, preparing valid ones...

@sinhrks
Copy link
Member Author

sinhrks commented May 14, 2015

I misunderstood the differnce of str and object is caused by numpy logic. As numpy looks to use similar logic as pandas thus cannot expect such a performance gain... Allow me to close this.

@sinhrks sinhrks closed this May 14, 2015
@jorisvandenbossche
Copy link
Member

Where did the initial speed-up come from then?

@sinhrks
Copy link
Member Author

sinhrks commented May 17, 2015

The above difference exists on current master depending on dtypes handled. I've misunderstand it is caused be the logic i've changed, but not.

@jorisvandenbossche jorisvandenbossche modified the milestones: 0.17.0, 0.16.2, No action Jun 2, 2015
@sinhrks sinhrks deleted the str_perf branch November 13, 2015 05:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Performance Memory or execution speed performance Strings String extension data type and string data

2 participants