Problem
Input: array , .
Output: th smallest number of .
Consider , which is just finding the minimum. This can obviously be done in , or if we try to sort first. Finding is simple too, we first find the smallest element, remove it from , and then rerun to find the newest smallest.
This doesn’t scale though, especially for larger values of . Since , we have to iterate through times, so runtime is . Imagine we want to find the median, . Then, this algorithm runs in , and we could have instead just sorted the array to find the th element.
That would take , but we can actually find the solution in linear time.
Procedure
The algorithm works as follows: at each step, we (somehow) pick an element as the pivot, which we use to partition with, just like quicksort. Then we check which position the pivot ended up in.
- If is less than the position, we know it has to be in the array to the left of the pivot’s position.
- If is greater than the position, then the element is in the array on the right.
- If equals the pivot position, then we’ve found the element. The pivot is the answer.
Naturally, this leads to a divide and conquer algorithm.
Picking the best pivot to partition around
Pick an element as pivot, scan through array and partition such that everything to the left of pivot is smaller than element, and to the right everything is greater than pivot.
This determines how many elements we recurse into, and therefore the overall runtime of the algorithm. The less elements we recurse into at each subproblem, the faster we get.
To divide and partition the array, we scan across the whole array, so that takes . Then, the three cases determine the recursive runtime:
- We’ve found the th element, no other steps are needed. This is .
- Recursing on means next step takes .
- Recursing on means next step takes .
How big and are depend on what we choose as the pivot.
Worst case pivot
If the pivot was the min or max element in , we’re essentially performing work for a single unit of progress (one element has been “removed” from ). So avoid that.
This is , which is the same as implementing insertion sort using D&C. Recurrence simplifies to . Here, we are summing consecutive numbers, and gives us an overall runtime of .
Picking the median as the pivot
Consider if the pivot was the th element of , aka the median. Then the recurrence relation is given by . Expanding gives us
Solving the geometric series tells us that the value is less than . Therefore , and therefore . We just made the selection algorithm linear!
Issue is, how do we find the median of the array? Isn’t that what we were trying to find in the first place, using this algorithm?
It turns out that finding the median will take way too long. We can, however, find a good enough pivot such that the time it takes to find it still fits in our budget, and also allows us to still eliminate enough elements during each recursion for the overall algorithm to remain linear.
Picking a “good enough” pivot
Consider the pivot that splits the array such that (both arrays contain at least elements). This means every recursion step, we remove at least elements.
Solving the recurrence for this gives us a geometric series that still evaluates to a constant less than infinity, and so runtime is still .
How do we find such a good pivot during each divide step in such a small runtime, such that the term is still the upper limit? The answer is by using the median of medians as the pivot.
Algorithm
- Group the array into collections of 5. There may be a remaining group with less than 5 elements.
- Find the median within each collection, including that last group.
- Recursively call the algorithm on the array of medians, where . We want to find the median of these medians.
- Partition using the median of the medians. Keep track of the final position of our pivot.
- Depending on the three cases, we either return as our answer, recurse into , or recurse into (maybe updating along the way, because we’re now considering a smaller array).
Runtime
Grouping the array takes time, and finding the median of each collection takes time (because it’s at most 5 elements, we can find it in constant time).
We recursively call the algorithm on this list of elements. There are of such collections.
Partitioning takes time (if we just swap elements in place). Since we remove at least of the elements when we recurse, the subproblem has at most of the original array. The recurrence relation is given by
There is a really long, formal proof that shows this reduces to a term that is , but intuitively, when we consider the two recursive steps, we have
Recalling the intuition behind the master theorem, when the recursive step takes less time than the merging step, then the latter will dominate. So the final runtime is .
Using selection to find the median efficiently
Selection is commonly used within other algorithms in order to find the median of a list. Its efficiency means that it rarely dominates the overall runtime and basically can be used “for free.”
One example of this is in deterministic quicksort, where we consistently find the median of the array to use as the pivot, which dramatically brings down its overall runtime.