52.219 Sorting
What do we mean by "sorting" and why bother?
The sorting problem is to arrange a sequence of objects so that the
values of their "key" fields are in "non-decreasing" sequence.
That is, given objects O1, O2, ..., On with key values K1, K2, ..., Kn
respectively, arrange the objects in an order Oi1, Oi2, ..., Oin
such that Ki1 <= Ki2 <= ... <= Kin.
NOTES: we assume that the objects to be sorted have more than one
attribute, and one of those attributes is taken to be the "key" (for
example we might sort employee records using employee number as the
key, or we could sort the same set of records by name of employee, or weight).
Also, we generally assume we are sorting into "non-decreasing" order.
This is not essentially true, as we could equally well sort the
objects in decreasing order (largest first), and it does not change
the nature of the problem.
Why do we want to sort objects?
Because it results in some kind of
order, and we can exploit this later on when accessing those objects
(and we will discuss that later on).
Comparing Sorting Algorithms
The criteria to evaluate the running time of sorting algorithms is
generally the number of comparisons between pairs of keys, or the
number of times objects are swapped from one position to another.
The cost of a comparison can be significant. For example, consider
comparing the names of two people Johanson and Johansin. Also,
swapping can be expensive if the objects are large.
Polymorphism
In our examples we will assume that the key and the object are the
same thing, but the scheme functions that are presented are
polymorphic, and by defining the predicate p we can sort any class of
object.
The Source Code
The lectures and source code only examine a subset of the sorting
algorithms. These are some of the algorithms that exist: bubble sort,
insertion sort, selection sort, shellsort, quicksort, merge sort,
heap sort, bin sort, radix sort. These algorithms fall into 3 classes
O(n*n), O(n log n), and O(n). We will examine only 4 algorithms:
1. Simple sorting schemes, such as
(a) bubble sort
(b) insertion sort
(c) selection sort
These are O(n*n) algorithms
2. Mergesort ... O(n log n)
See source /home/s7/pat/scheme/pub/sort.scm
Complexity of mergesort
The merge sort function/algorithm (mergesort l) takes a list l of length n,
and does a merge on the mergesort of the first half of l, and the
mergesort of the second half of l. The stopping condition is when the
list l is of size 0 or 1.
The merge function takes two sorted lists l1 and l2. At each step
merge takes the smaller of the head of l1 and the head of l2, and
appends it to a growing list, and removes that element from the list
(either l1 or l2).
A merge on lists of length n/2 is O(n).
The running time of mergesort on a list of n elements is then
t(0) = 0
t(1) = 1
t(n) = 2.t(n/2) + c.n
where c.n is the cost of merging two lists of length n/2, and the term
2t(n/2) is the two recursive calls to mergesort with lists l1 and l2
of length n/2. Consequently,
t(n) = 2.t(n/2) + c.n
= 2.(2.t(n/4) + c.n/2) + c.n
= 2.(2.(2.t(n/8) + c.n/4) + c.n/2) + c.n
= 8.t(n/8) + 3.c.n
A pattern emerges, and by induction on i we obtain
t(n) = 2^i.t(n/2^i) + i.c.n
where the operator ^ is "raised to the power". If we assume that n is
a power of 2 (ie 2, 4, 8, 16, 32, ... generally 2^k) the expansion
process comes to an end when we get t(1) on the right, and that occurs
when i=k, whereupon
t(n) = 2^k.t(1) + k.c.n
We have just stated that the process comes to an end when i=k, where
n = 2^k. Put another way, k = log n (to the base 2 of course), therefore
t(n) = n + c.n.log n
= O(n log n)
Choosing an Algorithm
Again, choice of algorithm is not a straight forward matter, as a
number of issues may be relevant. It may be the case that an O(n*n)
algorithm is more suitable than an O(n log n) algorithm. Some factors
may be the quality of object code, computing platforms available,
size of objects to be swapped, number of times algorithm is to be
used versus time to develop (if not already in place), criticality of
run time (maybe we don't care), size of input.
A lower bound for sorting
There is a "folk theorem" that says that sorting n elements requires
nlogn time. This aint true. It might be the case that the keys of the
objects fall within a known range. We might have this in a controlled
environment, where we produce the keys. For example, we might issue
employee numbers in a given order and in a given sequence, or we
might issue part numbers to items that we manufacture. Therefore, if
the keys are unique and known to be in a given range, we might allow
the key to be the location of the object (imagine placing the objects
in a vector such that object with key K is in (vector-ref v k)).
However the algorithms presented only rely on the fact that we can
compare keys, to determine if one is less than another.
Demo Software
Software is available to demo the sorting algorithms
(see file /home/s7/pat/scheme/pub/sort-demo.scm).
There are 3 demo's
NOTE: load and run those demo's. They are part of the
course material and I will assume you are familiar with them. Run them for small
values of n (about 20) and then increase n and see what happens.