A smart bubble/insertion sort is faster than quick sort when the array is already mostly sorted (reused on the next frame).
To speed up the copy of data rather than copying entire vertex values use an index buffer to sort the vertices.
When you remove quads you can quickly degenerate them (set all vertices to the same value) instead of removing them and copying over the following quads.
You can then gradually compact them over multiple frames or compact them only as needed. Inserting into a "swiss-cheese" vertex array with such holes lets you move data only up to the first hole and stop there rather than move the whole array.
Krom Stern's answer of using linked vectors (a list of shorter arrays) is excellent as you can fine tune the technique to a sweet spot between [number of draw calls] and [sort time of individual buffers], always put an extra bit of free space in each buffers.
And for the more adventurous:
If insertion speed is critical you can use the top-1 N bits of a positive float as an integer, skipping the sign bit, into an array of linked lists.
IEEE floating points of the same sign can be compared and sorted as integers, its part of the standard's design. They are one's-complements rather than two's-complements but this doesn't matter if we're only using positive z. If we need both positive and negative z then the negative z needs a set of reverse-sort functions due to being one's-complements.
#define NBITS 10 // 1024 entries SortList *lists[1 << NBITS] = {0}; ... float z = object->z; if(z <= 0){ // IMPORTANT this includes negative zero which is distinct from positive zero z = 0; // negative float number are in reverse when interpreted as integers, this would need a set of reverse functions } uint32_t raw_z_bits = ((uint32_t &)z); uint32_t list_number = (raw_z_bits >> (31 - NBITS)); // 31 because we skip the sign bit uint32_t index_within_list = raw_z_bits & ((1 << (31 - NBITS))-1); AddToList(lists[list_number], index_within_list, object );
This is useful when you don't know the range you'll need to sort and/or are already using floating point depth values.
If you're using integer depth values using this float hack will limit you to 23 bits of depth values (0 to +8M) as 32bit floats only have 23 bits (+ 1bit sign) of precision (8bits used for the exponent).
All this gets complicated quickly so a good idea for debugging is to have an independant sanity-check using a per-frame sort-all-active-objects-and-verify function that will confirm everything is sorted properly and there are no bugs in the faster, gradual sorting system.
std::sort(), but I'm thinking to switch over to a quad-tree. \$\endgroup\$