Do data structures matter? Who cares? ------------------------------------- Here is a small example of the effect a data structure can have. The Problem ----------- The problem is as follows. We have a flat file representing the Internet Movie Database (imdb.com) from Robert Sedgewick's site. The file has one line for a movie, giving - the title of the movie - the (year) as a string, possibly (????) - the actors in the film Produce a sorted list of actor names, removing duplicates First Stab ---------- This is a "no frills" solution and took less than an hour to get going. An ArrayList was used because I had no idea just how many actors would be in the input, so I could not use static space such as an array. I also wanted to get something up and running with little effort. The algorithm is below 1 Use existing code to get the MovieDB 2 Create an empty ArrayList, actors 3 for each movie 3.1 for each actor in the movie 3.1.1 if the actor is not in the ArrayList actor 3.1.2 add the actor to the ArrayList actors 4 convert the ArrayList of actors to an array 5 sort the array 6 output the array Second Go --------- Rather than use an ArrayList we now use a HashSet. Essentially a HashSet implements a Set, and as yet I do not know how this has been done. Why use a set? By definition a set disallows duplicates. Therefore I can add an actor to the set without coding up a test to determine if the actor is already in our set. The algorithms is below 1 Use existing code to get the MovieDB 2 Create an empty HashSet, actors 3 for each movie 3.1 for each actor in the movie 3.1.1 add the actor to the HashSet actors 4 convert the HashSet of actors to an array 5 sort the array 6 output the array There is only a minor difference. What's the effect? ------------------ How long does it take to get sorted actors using the two programs TestArrayList and TestHashSet? We need only measure the time spent in the steps 2 to 4 inclusive, and use the same machine for our experiment. Below are experiments run over various imdb data sets from Robert Sedgewick's website. Run times are given in milliseconds (ms). input file movies actors ArrayList(ms) HashSet(ms) cast.G.txt 1288 21177 9956 89 cast.PG13.txt 2538 70325 123325 258 cast.PG.txt 3687 74724 147732 265 cast.rated.txt 4527 122406 440563 384 cast.06.txt 8780 84236 157924 262 cast.00-06.txt 52195 606531 3563394 1073 cast.all.txt 285460 933802 ??????? 5659 For the data set cast.00-06.txt runtime was almost exactly one hour using the ArrayList whereas the MashSet took just over 1 second! For the data set cast.all.txt the ArrayList implementation was aborted after one hour. We can see that to all intents and purposes, in this example if we make the wrong engineering decision we get a program that does not really work Patrick Prosser 25th June 2010