Do data structures matter? Who cares?
          -------------------------------------
Here is a small example of the effect a data structure can have.

The Problem
-----------
The problem is as follows. We have a flat file representing the Internet Movie Database
(imdb.com) from Robert Sedgewick's site. The file has one line for a movie, giving 

 - the title of the movie
 - the (year) as a string, possibly (????)
 - the actors in the film 

Produce a sorted list of actor names, removing duplicates

First Stab
----------
This is a "no frills" solution and took less than an hour to get going. An ArrayList
was used because I had no idea just how many actors would be in the input, so I 
could not use static space such as an array. I also wanted to get something up and
running with little effort. The algorithm is below

 1     Use existing code to get the MovieDB 
 2     Create an empty ArrayList, actors
 3     for each movie
 3.1     for each actor in the movie
 3.1.1     if the actor is not in the ArrayList actor
 3.1.2     add the actor to the ArrayList actors
 4     convert the ArrayList of actors to an array
 5     sort the array
 6     output the array

Second Go
---------
Rather than use an ArrayList we now use a HashSet. Essentially a HashSet
implements a Set, and as yet I do not know how this has been done. Why use
a set? By definition a set disallows duplicates. Therefore I can add an actor
to the set without coding up a test to determine if the actor is already
in our set. The algorithms is below

 1     Use existing code to get the MovieDB 
 2     Create an empty HashSet, actors
 3     for each movie
 3.1     for each actor in the movie
 3.1.1     add the actor to the HashSet actors
 4     convert the HashSet of actors to an array
 5     sort the array
 6     output the array

There is only a minor difference.

What's the effect? 
------------------
How long does it take to get sorted actors using the two
programs TestArrayList and TestHashSet? We need only measure the
time spent in the steps 2 to 4 inclusive, and use the same machine for
our experiment. Below are experiments run over various imdb data sets 
from Robert Sedgewick's website. Run times are given in milliseconds (ms).

      input file    movies   actors   ArrayList(ms)    HashSet(ms)
    cast.G.txt        1288    21177      9956             89
    cast.PG13.txt     2538    70325    123325            258
    cast.PG.txt       3687    74724    147732            265
    cast.rated.txt    4527   122406    440563            384
    cast.06.txt       8780    84236    157924            262
    cast.00-06.txt   52195   606531   3563394           1073
    cast.all.txt    285460   933802   ???????           5659

For the data set cast.00-06.txt runtime was almost exactly one hour
using the ArrayList whereas the MashSet took just over 1 second!
For the data set cast.all.txt the ArrayList implementation was aborted after
one hour.

We can see that to all intents and purposes, in this 
example if we make the wrong engineering decision we get a
program that does not really work

Patrick Prosser
25th June 2010