0

Let's say I have the following extremely large string using Python3.x, several GB in size and +10 billion characters in length:

string1 = "XYZYXZZXYZZXYZYXYXZYXZYXZYZYZXY.....YY" 

Given its length, this already takes +GB to load into RAM.

I would like to write a function that will replace every X with A, Y with B, and Z with C. My goal is to make this as quick as possible. Naturally, this should be efficient as well (e.g. there may be some RAM trade-offs I'm not sure about).

The most obvious solution for me is to use the string module and string.replace():

import string def replace_characters(input_string): new_string = input_string.replace("X", "A").replace("Y", "B").replace("Z", "C") return new_string foo = replace_characters(string1) print(foo) 

which outputs

'ABCBACCABCCABCBABACBACBACBCBCAB...BB' 

I worry this is not the most efficient approach, as I'm simultaneously calling three functions at once on such a large data structure.

What is the most efficient solution for a string this large?

2
  • What is the performance the way you do it now? Do you have reason to believe that it is unsatisfactory in some way? Commented Jun 25, 2017 at 3:11
  • @wallyk It's clunky. I think .replace() is first passing through the entire string. So, this function is actually three function calls with at least three temporary strings held in memory. It's not terribly efficient. Commented Jun 25, 2017 at 3:31

1 Answer 1

7

A more memory efficient method, that will not generate so many temporary strings along the way, would be to use str.translate.

>>> string1 = "XYZYXZZXYZZXYZYXYXZYXZYXZYZYZXY" >>> string1.translate({ord("X"): "A", ord("Y"): "B", ord("Z"): "C"}) 'ABCBACCABCCABCBABACBACBACBCBCAB' 

This will allocate just one (extra large in your case) string.

Sign up to request clarification or add additional context in comments.

4 Comments

Oh, didn't know about this one.
@Coldspeed Should be a lot faster than a regex I'd expect!
Whoah! Excellent solution
or, you could just read the entire data into an array (docs.python.org/3/library/array.html) and then convert the specific bytes in the array. No need to allocate another huge buffer.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.