52.219 Sorting

What do we mean by "sorting" and why bother?

The sorting problem is to arrange a sequence of objects so that the values of their "key" fields are in "non-decreasing" sequence. That is, given objects O1, O2, ..., On with key values K1, K2, ..., Kn respectively, arrange the objects in an order Oi1, Oi2, ..., Oin such that Ki1 <= Ki2 <= ... <= Kin.

NOTES: we assume that the objects to be sorted have more than one attribute, and one of those attributes is taken to be the "key" (for example we might sort employee records using employee number as the key, or we could sort the same set of records by name of employee, or weight). Also, we generally assume we are sorting into "non-decreasing" order. This is not essentially true, as we could equally well sort the objects in decreasing order (largest first), and it does not change the nature of the problem.

Why do we want to sort objects?

Because it results in some kind of order, and we can exploit this later on when accessing those objects (and we will discuss that later on).

Comparing Sorting Algorithms

The criteria to evaluate the running time of sorting algorithms is generally the number of comparisons between pairs of keys, or the number of times objects are swapped from one position to another. The cost of a comparison can be significant. For example, consider comparing the names of two people Johanson and Johansin. Also, swapping can be expensive if the objects are large.

Polymorphism

In our examples we will assume that the key and the object are the same thing, but the scheme functions that are presented are polymorphic, and by defining the predicate p we can sort any class of object.

The Source Code

The lectures and source code only examine a subset of the sorting algorithms. These are some of the algorithms that exist: bubble sort, insertion sort, selection sort, shellsort, quicksort, merge sort, heap sort, bin sort, radix sort. These algorithms fall into 3 classes O(n*n), O(n log n), and O(n). We will examine only 4 algorithms:
    1. Simple sorting schemes, such as 
       (a) bubble sort
       (b) insertion sort
       (c) selection sort
      These are O(n*n) algorithms
  
    2. Mergesort ... O(n log n)
            
     See source /home/s7/pat/scheme/pub/sort.scm

Complexity of mergesort

The merge sort function/algorithm (mergesort l) takes a list l of length n, and does a merge on the mergesort of the first half of l, and the mergesort of the second half of l. The stopping condition is when the list l is of size 0 or 1.

The merge function takes two sorted lists l1 and l2. At each step merge takes the smaller of the head of l1 and the head of l2, and appends it to a growing list, and removes that element from the list (either l1 or l2).

             A merge on lists of length n/2 is O(n).

The running time of mergesort on a list of n elements is then

                        t(0) = 0
                        t(1) = 1
                        t(n) = 2.t(n/2) + c.n
where c.n is the cost of merging two lists of length n/2, and the term 2t(n/2) is the two recursive calls to mergesort with lists l1 and l2 of length n/2. Consequently,

                        t(n) = 2.t(n/2) + c.n
                             = 2.(2.t(n/4) + c.n/2) + c.n
                             = 2.(2.(2.t(n/8) + c.n/4) + c.n/2) + c.n
                             = 8.t(n/8) + 3.c.n

A pattern emerges, and by induction on i we obtain

                        t(n) = 2^i.t(n/2^i) + i.c.n
where the operator ^ is "raised to the power". If we assume that n is a power of 2 (ie 2, 4, 8, 16, 32, ... generally 2^k) the expansion process comes to an end when we get t(1) on the right, and that occurs when i=k, whereupon
                        t(n) = 2^k.t(1) + k.c.n
We have just stated that the process comes to an end when i=k, where n = 2^k. Put another way, k = log n (to the base 2 of course), therefore
                        t(n) = n + c.n.log n 
                             = O(n log n)
            

Choosing an Algorithm

Again, choice of algorithm is not a straight forward matter, as a number of issues may be relevant. It may be the case that an O(n*n) algorithm is more suitable than an O(n log n) algorithm. Some factors may be the quality of object code, computing platforms available, size of objects to be swapped, number of times algorithm is to be used versus time to develop (if not already in place), criticality of run time (maybe we don't care), size of input.

A lower bound for sorting

There is a "folk theorem" that says that sorting n elements requires nlogn time. This aint true. It might be the case that the keys of the objects fall within a known range. We might have this in a controlled environment, where we produce the keys. For example, we might issue employee numbers in a given order and in a given sequence, or we might issue part numbers to items that we manufacture. Therefore, if the keys are unique and known to be in a given range, we might allow the key to be the location of the object (imagine placing the objects in a vector such that object with key K is in (vector-ref v k)). However the algorithms presented only rely on the fact that we can compare keys, to determine if one is less than another.

Demo Software

Software is available to demo the sorting algorithms (see file /home/s7/pat/scheme/pub/sort-demo.scm). There are 3 demo's NOTE: load and run those demo's. They are part of the course material and I will assume you are familiar with them. Run them for small values of n (about 20) and then increase n and see what happens.