12

I've been testing the time complexity of different sorting algorithms for different number sequences and it was all going well until i got quicksort's (with pivot in the middle) results for sequences that are one half ascending and the other descending. The graph:

enter image description here

(By "V" I mean a sequence in which the first half is descending, and the other ascending, and by "A" I mean a sequence where the first half is ascending, and the other half is descending.)

Results for other kinds of sequences look as I would expect, but maybe there is something wrong with my algorithm?

void quicksort(int l,int p,int *tab) { int i=l,j=p,x=tab[(l+p)/2],w; //x - pivot do { while (tab[i]<x) { i++; } while (x<tab[j]) { j--; } if (i<=j) { w=tab[i]; tab[i]=tab[j]; tab[j]=w; i++; j--; } } while (i<=j); if (l<j) { quicksort(l,j,tab); } if (i<p) { quicksort(i,p,tab); } } 

Does anybody have an idea what caused such weird results?

5
  • 3
    Known flaw of a quicksort with middle pivot. Classical. Commented Apr 5, 2016 at 20:36
  • @SergeyA I've been searching the internet for similar graphs and found nothing. Is there any place I could read up on why it jumps between complexities? Commented Apr 5, 2016 at 21:18
  • Well, what is your metric for creating such epic graph? You are measuring time? How? Commented Apr 5, 2016 at 21:18
  • @RainbowTom clock_t start = clock(); quicksort(0,n-1,tab); clock_t finish = clock(); sfile<<double(finish-start)/CLOCKS_PER_SEC<<" "; Commented Apr 5, 2016 at 21:22
  • @vois what's wrong with my answer? :) Commented May 14, 2016 at 11:18

4 Answers 4

10

TL;DR: The problem is the pivot-choosing strategy, which makes repeatedly poor choices on these types of inputs (A- and V-shaped sequences). These result in quicksort making highly "imbalanced" recursive calls, which in turn result in the algorithm performing very poorly (quadratic time for A-shaped sequences).

Congratulations, you've (re)discovered an adversarial input (or rather a family of inputs) for the version of quicksort that chooses the middle element as the pivot.

For the reference, an example of an A-shaped sequence is 1 2 3 4 3 2 1, i.e., a sequence that increases, reaches the pick at the middle, and then decreases; an example of a V-shaped sequence is 4 3 2 1 2 3 4, i.e., a sequence that decreases, reaches the minimum at the middle, and then increases.

Think about what happens when you pick the middle element as the pivot of an A- or a V-shaped sequence. In the first case, when you pass the algorithm the A-shaped sequence 1 2 ... n-1 n n-1 ... 2 1, the pivot is the largest element of the array---this is because the largest element of an A-shaped sequence is the middle one, and you choose the middle element as the pivot---and you will make recursive calls on subarrays of sizes 0 (your code doesn't actually make the call on 0 elements) and n-1. In the next call on the subarray of size n-1 you will pick as the pivot the largest element of the subarray (which is the second-largest element of the original array); and so on. This results in poor performance because the running time is O(n)+O(n-1)+...+O(1) = O(n^2) because in each step you essentially pass on almost the whole array (all elements except the pivot), in other words, the sizes of the arrays in the recursive calls are highly imbalanced.

Here's the trace for the A-shaped sequence 1 2 3 4 5 4 3 2 1:

blazs@blazs:/tmp$ ./test pivot=5 1 2 3 4 1 4 3 2 5 pivot=4 1 2 3 2 1 3 4 4 pivot=3 1 2 3 2 1 3 pivot=3 1 2 1 2 3 pivot=2 1 2 1 2 pivot=2 1 1 2 pivot=1 1 1 pivot=4 4 4 1 1 2 2 3 3 4 4 5 

You can see from the trace that at recursive call the algorithm chooses a largest element (there can be up to two largest elements, hence the article a, not the) as the pivot. This means that the running time for the A-shaped sequence really is O(n)+O(n-1)+...+O(1) = O(n^2). (In the technical jargon, the A-shaped sequence is an example of an adversarial input that forces the algorithm to perform poorly.)

This means that if you plot running times for "perfectly" A-shaped sequences of the form

1 2 3 ... n-1 n n-1 ... 3 2 1 

for increasing n, you will see a nice quadratic function. Here's a graph I just computed for n=5,105, 205, 305,...,9905 for A-shaped sequences 1 2 ... n-1 n n-1 ... 2 1:

Running times for A-shaped sequences

In the second case, when you pass the algorithm a V-shaped sequence, you choose the smallest element of the array as the pivot, and will thus make recursive calls on subarrays of sizes n-1 and 0 (your code doesn't actually make the call on 0 elements). In the next call on the subarray of size n-1 you will pick as the pivot the largest element; and so on. (But you won't always make such terrible choices; it's hard to say anything more about this case.) This results in poor performance for similar reasons. This case is slightly more complicated (it depends on how you do the "moving" step).

Here's a graph of running times for V-shaped sequences n n-1 ... 2 1 2 ... n-1 n for n=5,105,205,...,49905. The running times are somewhat less regular---as I said it is more complicated because you don't always pick the smallest element as the pivot. The graph:

Running times for V-shaped sequences for increasing sizes.

Code that I used to measure time:

double seconds(size_t n) { int *tab = (int *)malloc(sizeof(int) * (2*n - 1)); size_t i; // construct A-shaped sequence 1 2 3 ... n-1 n n-1 ... 3 2 1 for (i = 0; i < n-1; i++) { tab[i] = tab[2*n-i-2] = i+1; // To generate V-shaped sequence, use tab[i]=tab[2*n-i-2]=n-i+1; } tab[n-1] = n; // For V-shaped sequence use tab[n-1] = 1; clock_t start = clock(); quicksort(0, 2*n-2, tab); clock_t finish = clock(); free(tab); return (double) (finish - start) / CLOCKS_PER_SEC; } 

I adapted your code to print the "trace" of the algorithm so that you can play with it yourself and gain insight into what's going on:

#include <stdio.h> void print(int *a, size_t l, size_t r); void quicksort(int l,int p,int *tab); int main() { int tab[] = {1,2,3,4,5,4,3,2,1}; size_t sz = sizeof(tab) / sizeof(int); quicksort(0, sz-1, tab); print(tab, 0, sz-1); return 0; } void print(int *a, size_t l, size_t r) { size_t i; for (i = l; i <= r; ++i) { printf("%4d", a[i]); } printf("\n"); } void quicksort(int l,int p,int *tab) { int i=l,j=p,x=tab[(l+p)/2],w; //x - pivot printf("pivot=%d\n", x); do { while (tab[i]<x) { i++; } while (x<tab[j]) { j--; } if (i<=j) { w=tab[i]; tab[i]=tab[j]; tab[j]=w; i++; j--; } } while (i<=j); print(tab, l, p); if (l<j) { quicksort(l,j,tab); } if (i<p) { quicksort(i,p,tab); } } 

By the way, I think the graph showing the running times would be smoother if you took the average of, say, 100 running times for each input sequence.

We see that the problem here is the pivot-choosing strategy. Let me note that you can alleviate the problems with adversarial inputs by randomizing the pivot-choosing step. The simplest approach is to pick the pivot uniformly at random (each element is equally likely to be chosen as the pivot); you can then show that the algorithm runs in O(n log n) time with high probability. (Note, however, that to show this sharp tail bound you need some assumptions on the input; the result certainly holds if the numbers are all distinct; see, for example, Motwani and Raghavan's Randomized Algorithms book.)

To corroborate my claims, here's the graph of running times for the same sequences if you choose the pivot uniformly at random, with x = tab[l + (rand() % (p-l))]; (make sure you call srand(time(NULL)) in the main). For A-shaped sequences: enter image description here

For V-shaped sequences:

enter image description here

Sign up to request clarification or add additional context in comments.

Comments

1

in QuickSort the one of the major things that affect running time is making the input ramdom.

generally choosing a pivot at a particular position may not really be the best except its certain that the input is randomly shuffled. Using the median of three partition is one of the widely used means just to make sure that the pivot is a random number. From your code you didn't implement it.

Also, when recursive quicksort will experience some overhead since its used internal stack( will have to generate several function and assign parameters) , so its advisable that when the size of the data left is around 10 - 20 you can use other sort algorithm like InsertionSort as this will make it about 20% faster.

void quicksort(int l,int p,int *tab){ if ( tab.size <= 10 ){ IntersionSort(tab); } .. ..} 

Something of this nature.

In General best running time for quicksort is nlogn worse case running time is n^2 often caused by non-random inputs or duplicates inputs

2 Comments

Could a non-random input cause it to jump between nlogn and n^2 compexity? If so, why?
yes, a non-random input can cause it to jump from nlogn to n^2, it depends on the implementation though. but to be on a safe side that's why some people shuffle the input first. if the input is not random picking the middle number makes no much difference, because it result sequence will be tilted to one side. won't be able to give much explanation here but you can check out the choosing a pivot part of en.wikipedia.org/wiki/Quicksort
1

All answers here have very good points. My idea is, that there is nothing wrong with the algorithm (since the pivot problem is well known and it is a reason for O(n^2), but there is something wrong with the way you measure it.

clock() - returns number of processor ticks elapsed from some point (probably program launch? Not important).

Your way of measuring time relly on constant length of tick, which I think isn't guaranteed.

Point is, that many (all?) of todays modern processors dynamically change their frequency to save energy. I think, it is very non-determinstic, so every time you launch your program - CPU frequency will depend not only on size of your input, but also on what is happening in you system right now. The way I understand it is, that the length of one tick can be very different during program execution.

I tried to lookup, what macro CLOCKS_PER_SEC actually does. Is it current clocks per sec? Does it do some averages during some mysterious time period? I sadlly wasn't able to find out. Therefore, I think that your way of measuring time can be totaly wrong.

Since my argument stands on something, I don't know for sure, I might be totaly wrong.

One way to find out is run multiple tests with same data multiple times with diferrent overall system usage and see, if it behaves significally different each time. Another way is, to set your computer's CPU frequency to some static value and test it similar way.

IDEA Wouldn't be better to measure "time" in ticks?

EDIT 1 Thanks to @BeyelerStudios, now we for certain know, that you shouldn't relly on clock() on Windows machines, because it doesn't follow C98 standard. Source

I hope I helped, if I am wrong, please correct me - I am a student and not a HW specialist.

12 Comments

in Microsoft's libraries CLOCKS_PER_SEC is a constant value (at least up to VS2013, don't know about MSVC 14.0)
@BeyelerStudios That can mean either that program is able to force its CPU frequency or that this method of time calculation just doesn't work.
Normalizing this would be difficult - how to normalize it well? There is no (visible) bond between clock() and CLOCK_PER_SEC, which could mean, that there is some magic happing somewhere inside the code and that's not very good desing, thus I don't think it works this way ... Imagine it this way: to measure time, you has call clock() twice. How could background know, from which point should the normalization begin? Every even call? That woudl be extremelly bad desing ...
Well, it is possible to do that. The magic would be, to keep track of last two clock() calls and to run two normalizations simultaneously. Then macro would display CLOCK_PER_SEC for last two calls correctly. But is it this way?
if we read the msdn article then two things are notable: clock by MS is not standard (doesn't follow C99) and it's specifically mentioned you shouldn't rely on it for time computations[1]... so there, don't use clock for anything... [1] keywords: approximately 1/CLOCKS_PER_SEC and breaks after about 59 hours
|
0

Quicksort has a worst-case time complexity of O(n^2) and an average of O(n log n) for n entries in a data set. More details on an analysis of the time complexity can be found here:

https://www.khanacademy.org/computing/computer-science/algorithms/quick-sort/a/analysis-of-quicksort

and here:

http://www.cise.ufl.edu/class/cot3100fa07/quicksort_analysis.pdf

2 Comments

Yes, but should an algorithm jump between complexities for the same type of sequence?
@vois I think, it shouldn't. See my answer.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.