- 18.01 (released) ...
Hash tables are a convenient mechanism tostore and retrieve objects identified by unique keys.
Why use hash tables?
The main advantage of hash tables is the efficiency of the basic operations: store ( put ) and retrieve ( item , remove ).
The idea behind hash tables is to try to emulate the data structure that provides the ultimate in efficiency: the array. On an array
a.put (x, i)
x := a.item (i)
x := a @ i
The first causes the value of a at index
Not only are the operation times small; they are constant (or more precisely bounded by a constant). This is a great advantage over structures such as lists or trees which you must traverse at least in part to retrieve an item, so that access and modification times grow with the number of items. With an array, disregarding the influence of other factors such as memory paging, the time for a put or item is for all practical purposes the same whether the array has five items or five hundred thousand. These properties make arrays excellent data structures for keeping objects. Unfortunately, they are only applicable if the objects satisfy three requirements:
- A1. For each object there must be an associated integer, which for the purpose of this discussion we may call the object's index (since it will serve as index for the object in the array.)
- A2. No two objects may have the same index.
- A3. If we want to avoid wasting huge amount of storage, all the indices must lie in a contiguous or almost contiguous range.
Hash tables may be viewed as a rehabilitation mechanism for objects that do not naturally possess these three properties. If we are unable to find a natural index, we can sometimes devise an artificial one. To do so we must be able to find a key. Each key must uniquely identify the corresponding object; this is the same as property A2, making keys similar to indices. But keys are not necessarily integers (violating property A1), although it must be possible to associate an integer with each key. The mechanism that maps keys to integers is called the hashing function.
Thanks to the hashing mechanism we will indeed be able to store suitable objects into arrays, approaching the optimal efficiency of this data structure. The efficiency will not be quite as good, however, for two reasons:
- We must pay the price of computing the hash function whenever we store or retrieve an object.
- Different keys may hash into the same integer value, requiring extra processing to find an acceptable index.
With good implementations, however, it is possible to use hash tables with a performance that is not much worse than that of arrays and, most importantly, may be treated as if the time for a put, an item or a remove were constant. This will mean that you can consider operations such as
h.put (x, k)
x := h.item (k)
The quality of a hashed implementation will depend both on the data structure that will store the objects, and on the choice of hashing function. Class HASH_TABLE attempts to address the first concern; for the second, client developers will be responsible for choosing the proper hashing function, although Base provides a few predefined functions, in particular for class STRING .
When hash tables are appropriate
You may keep objects in a hash table if for each one of these objects you can find a key that uniquely identifies it. The objects and their keys may be of many possible kinds:
- H1. In a simple example, the objects are integers; each integer serves as its own key. (More precisely we will use its absolute value, since it is convenient to have non-negative keys only.) This case is of more than theoretical interest, since it makes hash tables appropriate for storing a set of integers with widely scattered values, for which simple array storage would be a waste of space (see requirement A3 above).
- H2. Frequently, the objects will be composite, that is to say, instances of a developer-defined class, and one of the attributes of that class, of type STRING , can serve as the key. For example if you were writing an Eiffel compiler you would probably need to keep a data structure that includes information about classes of the system. Each class is represented by an object with several fields describing the properties of the class; one of these fields, the class name, corresponding to an attribute of type STRING , will serve as key.
- H3. Instead of being the full object (as in case H1) or one of the object's fields (as in case H2), the key may have to be computed through a function of the generating class, which will take into account several attributes of the class (that is to say, for each object, several fields).
What this practically means is that in all cases you will need, in the generating class of the objects to be stored, a query (attribute or function) that gives the key. The type of the key is highly variable but must in all cases be a descendant of HASHABLE . This is true of both INTEGER (case H1) and STRING (case H2). The requirements for being a HASHABLE are not harsh: all you need is a function hash_code that returns a non-negative integer.>
Using hash tables
Class HASH_TABLE takes two generic parameters:
class HASH_TABLE [G, H -> HASHABLE] ...
G represents the type of the objects to be stored in the hash table, H the type of their keys.
When viewed as an implementation of containers, HASH_TABLE , in a strict sense, represents bags rather than sets: unlike the other classes in this chapter, it allows an object to have two or more distinct occurrences in a single container. But this is only true if we consider a hash table as a repository of objects of type G. In reality each item of the table is identified by a pair of values, one from G and one from H. Because the keys must uniquely identify objects, the hash table viewed as a container of such pairs is indeed a set. The creation procedure make takes an integer argument, as in
create my_table.make (n)
The value of
- The actual size of the underlying array representation will be higher than n since efficient operation of hash table algorithms require the presence of enough breathing space - unoccupied positions.
- If the number of items in the table grows beyond the initial allocation, the table will automatically be resized.
It is useful, however, to use a reasonable upon creation: not too large to avoid wasting space, but not too small to avoid frequent applications of resizing, an expensive operation.