Order statistics

Next: Sorting algorithms I, Previous: Algorithmic problem examples, Up: Top

2 Order statistics

The most often used statistic is the average of a sequence of data points. This is the sum of the values divided by the number of values. Sometimes, however, the values are skewed and the median value is preferred. The median is in the middle: half are larger, half smaller. Note that median makes sense on some data where average does not. For instance, the median of strings: what is the middle word in the dictionary? More generally one may want the k-th smallest. For example the first decile is the n/10-th element, the max is the n-th element (if they were sorted).

It is useful to use the notion of rank in discussing order statistics. Let A be a sequence of distinct values, For a number x in A, let rank(x) denote one more than the number of items in A which are less than x (i.e. rank(x) = 1+k, where A has k entries less than x). In other words, rank(A[i]) is the location where A[i] would end up if A were sorted. For example, rank(3) = 2 in {3, 5, 1}.

Problem Select(A, k)
Input: sequence A of size n, and index k such that 1 ≤ k ≤ n.
Output: A is permuted so that A[i] ≤ A[k], for 1 ≤ i < k and A[k] ≤ A[i], for k < i ≤ n.

We specify permuting the array so that the k-th element is where it would be if the array were sorted. Thus any sorting algorithm solves the selection problem, but we want faster solutions. Variants of the problem are to just return the value or index of the k-th element without permuting the data.

Algorithms for Select solve these other order statistics problems (though for some there are faster methods).
Problem Minimum(A) = Select(A, 1).
Problem Maximum(A) = Select(A, size(A)).
Problem Median(A) = Select(A, floor(size(A)/2)).
Problem Max2(A) = Select(A, size(A)-1)
Problem Maximin(A) = position both first and last correctly.