6

This might be a very broad question. I wanted to create a way to represent strings that would support O(1) append, O(1) append to the left, and O(1) comparison while maintaining O(N) slicing and O(1) indexing. My idea is that I would store unicode characters as their unicode number, and use mathematical operations to add and delete characters from either end. I called it NumString:

class Numstring: def __init__(self, init_str=""): self.num = 0 self.length = 0 for char in init_str: self.append(char) def toString(self, curr=None): if curr is None: curr = self.num retlst = [] while curr: char = chr(curr % 10000) retlst.append(char) curr = curr // 10000 return "".join(retlst) def appendleft(self, char): assert len(char) == 1 self.num *= 10000 self.num += ord(char) self.length += 1 def append(self, char): self.num += ord(char)*(10000**self.length) self.length += 1 def pop(self): # self.num = self.num % (10000**self.length-1) self.num = self.num % 10000**(self.length-1) self.length -= 1 def popleft(self): self.num = self.num // 10000 self.length -= 1 def compare(self, numstring2): return self.num == numstring2.num def size(self): return self.length def get(self, start, end=None): if start >= self.length: raise IndexError("index out of bounds") if end and end > self.length + 1: raise IndexError("index out of bounds") if end is not None: if start == end: return "" curr = self.num curr = curr // (10000 ** start) curr = curr % 10000**(end) return self.toString(curr) else: curr = self.num curr = curr // (10000 ** start) char = chr(curr % 10000) return char 

Here is the profiling code:

import string import time import random from NumString import Numstring import numpy as np import matplotlib.pyplot as plt if __name__ == '__main__': numstring_times = [] regstring_times = [] numstring = Numstring("abc") regstring = "abc" for i in range(0, 10000): char_to_add = random.choice(string.ascii_letters) start_time = time.time() numstring.append(char_to_add) end_time = time.time() numstring_times.append(end_time-start_time) start_time = time.time() regstring += char_to_add end_time = time.time() regstring_times.append(end_time-start_time) plt.plot(numstring_times, label="Numstring") plt.plot(regstring_times, label="Regular strings") plt.legend() plt.show() 

It works the way I want it to, but when I tested the time it takes to append to the string and my Numstring class performed way worse. I understood that there would be overhead, but I didn't expect the overall trend to be way worse than concatenating strings in python.

Time it takes to append a character to

Curiously, comparison times are actually better for Numstrings than regular strings:

Comparison times for Numstrings vs Regular strings

I've realized I don't really understand how integer operations in Python are implemented and what the limitations of infinite integer precision are. Could somebody help me understand what I'm missing?

2
  • 1
    provide the profiling code... Commented Sep 24, 2020 at 23:59
  • @juanpa.arrivillaga just did Commented Sep 25, 2020 at 0:02

1 Answer 1

5

ints are essentially represented as strings of digits (in a higher base than 10), and most operations on them, including addition and multiplication, take time proportional to the number of digits they contain or worse. Because the base you’re using (10000) doesn’t match the base ints use, operations like multiplying or dividing by the base become a complex operation instead of a simple copying of bytes of memory. So it’s pretty much a less efficient reimplementation of the operations strings already do (which is what you found by benchmarking) and it doesn’t support all of Unicode (which goes up to 0x10FFFF, not 10000).

Hint for a data structure that actually has the properties you’re looking for, though: a deque based on a circular buffer.

Sign up to request clarification or add additional context in comments.

10 Comments

I thought of using a deque, but I also wanted O(1) comparison. In my understanding, to compare two strings you have to go through all the characters and see if they're equal. I wanted to use numbers because I thought it was an O(1) operation to compare. I guess to a computer, there's no way to actually get O(1) comparison, since you'll always need to check the bytes
@AlexPeniz: It isn’t an O(1) operation to compare arbitrary-size ints. Like the rest of the operations, it’s O(digits). Algorithms often work with integers of bounded size (and integers in other languages are often bounded, even), whose operations can be considered O(1) because of that.
got it, that makes a lot of sense. Sad, I spent a fair bit of time on this haha
@AlexPeniz: You can get probabilistic O(1) comparison, though, by storing a hash alongside the deque and updating them in sync. The probability of failure can even be negligible, through use of a cryptographic hash (warning: asymptotically fast, but with a big constant factor). It kind of depends on what you’re using it for.
@AlexPeniz that's why the first rule of optimization is to profile. Your intuition will often lead you astray.
|

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.