1

I have a large set of strings, and am looking to extract a certain part of each of the strings. Each string contains a sub string like this:

my_token:[ "key_of_interest" ], 

This is the only part in each string it says my_token. I was thinking about getting the end index position of ' my_token:[" ' and after that getting the beginning index position of ' "], ' and getting all the text between those two index positions.

Is there a better or more efficient way of doing this? I'll be doing this for string of length ~10,000 and sets of size 100,000.

Edit: The file is a .ion file. From my understanding it can be treated as a flat file - as it is text based and used for describing metadata.

6
  • If the string is JSON, use json.loads and access it on the parsed object, don't try to slice it as a string. Commented Jan 6, 2016 at 2:22
  • 1
    Assuming that is the only time my_token appears in each string, without an additional constraint (such as 'it is likely to be in the last half of the string'), what sort of efficiency improvement are you looking for? I think you could get a marginal increase in efficiency using regex to simply capture key_of_interest by making a regex for the surrounding characters, but not to an algorithmically significant degree. Commented Jan 6, 2016 at 2:24
  • 1
    @Amadan Why would that be more efficient? If the string is already in-memory and he has no need for any other part of the string, isn't that just adding overhead of loading the string into a separate object? Commented Jan 6, 2016 at 2:25
  • @NathanielFord: You are right, I was thinking of safety and managed to skip the bit where it is "the only part in each string it says my_token". Feel free to disregard. Commented Jan 6, 2016 at 2:28
  • You should clarify your question. Speed-wise, your solution is the fastest possible, but it might be reasonable to actually sacrifice a bit of processing speed in favor of other factors, such as abstraction, expandability etc. Commented Jan 6, 2016 at 9:56

3 Answers 3

1

How can this can possibly be done the "dumbest and simplest way"?

  • find the starting position
  • look on for the ending position
  • grab everything indiscriminately between the two

This is indeed what you're doing. Thus any further inprovement can only come from the optimization of each step. Possible ways include:

  • narrow down the search region (requires additional constraints/assumptions as per comment56995056)
  • speed up the search operation bits, which include:
    • extracting raw data from the format
      • you already did this by disregarding the format altogether - so you have to make sure there'll never be any incorrect parsing (e.g. your search terms embedded in strings elsewhere or matching a part of a token) as per comment56995034
    • elementary pattern comparison operation
      • unlikely to attain in pure Python since str.index is implemented in C already and the implementation is probably already as simple as can possibly be
Sign up to request clarification or add additional context in comments.

Comments

1

The underlying requirement shows through when you clarify:

I was thinking about getting the end index position of ' my_token:[" ' and after that getting the beginning index position of ' "], ' and getting all the text between those two index positions.

That sounds like you're trying to avoid the correct approach: use a parser for whatever language is in the string.

There is no good reason to build directly on top of string primitives for parsing, unless you are interested in writing yet another parsing framework.

So, use libraries written by people who have dealt with the issues before you.

  • If it's JSON, use the standard library json module; ditto if it's some other language with a parser already in the Python standard library.
  • If it's some other widely-implemented standard: get whichever already-existing third-party Python library knows how to parse that properly.
  • If it's not already implemented: write a custom parser using pyparsing or some other well-known solid library.

So to make a good choice you need to know what is the data format (this is not answered by “what are the file names”; rather, you need to know what is the data format of the content of those files). Then you'll be able to search for a parser library that knows about that data format.

1 Comment

The file is a .ion file. Would you suggest using pyparsing?
0

Well, as already mentioned - a parser seems the best option.

But to answer your question without all this extra advice ... if you're just looking at speed, a parser isn't really the best method of doing this. The faster method is you already have a string like this would be to use regex.

matches = re.match(r"my_token:\[\s*"(.*)"\s*\]\.",str) key_of_interest = matches.groups()[0] 

There are other issues that come up. For example what if your key has a " inside it ? strinified JSON will automatically use an escape character there and that will be captures by the regex too. And therefore this gets a bit too complicated.

And JSON is not regex parsable in itself (is-json-a-regular-language). So, use at your own risk. But with the appropriate restrictions and assumptions regex would be faster than a json parser.

2 Comments

The file is not a JSON file. It is a .ion file.
That's interesting. I've never heard of a .ion file. Could you give the full form or what it is being used for ?

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.