Homework 5: Concordance Construction

Due Thursday, May 20.

A concordance is a listing which shows, for every word that occurs in a given text, a list of the line numbers of lines on which that word occurs. It can be a very useful tool for studying concepts being discussed in the text, for finding the defining instance of a term, etc. Prior to computer text processing, concordance construction was a laborious and error prone task. It was undertaken only for the most heavily studied texts such as the Bible. Even when concordance construction could be programmed, the programming has been complex and error prone. However we see from this project that high level data structures (maps and sets) can make the job relatively easy.

As an example, if the first two lines of this paragraph
were the entire text, then the first few lines of the concordance
would look like this:

In this final project we will construct an extension and improvement of the concordance example in section 16.2.3. This will be not require extensive code, as maps and sets will make our work easy. The code from the book (not quite identical) is online in the 220/concordance directory.

There are two main goals of the project and a third optional one.

  1. Instead of using multimaps we'll use maps of sets. Specifically, instead of declaring the concordance data structure as:
    typedef multimap < string, int > wordDictTyp;
    
    We will use:
    typedef map < string, set < int > > wordDictTyp;
    

    This will cause changes in how the addWord and printConcordance functions are written. The resulting code will be both simpler and more efficient.

  2. The value of a concordance of a text usually lies in allowing those studying the text to find the uses of the most important words in the text. Thus entries for common words such as "the" are just an annoyance making the concordance unnecessarily long. We will add to the concordance class a set of words to be skipped, say set skipWords. Then we'll
    1. add a member function readSkips(string skipfile), to setup the skipWords
    2. modify addWord to not make entries in the concordance (in wordMap) for those words.

  3. [Extra credit] Another problem for concordances can be multiple entries for essentially the same word. For instance, do you may or may not want separate entries for "hide", "hides", and "hiding" or for "colour"
    (break in text here for sake of later example)
    and "color"? For flexibility on this issue, we won't try to systematically merge all singular and plural forms. Instead we process a collection of pairs of words that are to be treated as the same. Suppose the user wants words A, B, and C to be treated as the same and furthermore wants the entry to be under A. That user would put the pairs "A B" and "A C" in a file. The resulting concordance should output all the lines on which any of A, B, and C occur under the listing for A. Under the listings for B and C it should simply print "see A". This extra credit part of the project is to create an additional function readPairings and use it to read in the information about related versions of words and then use this information correctly in addWord and printConcordance to get the desired result. You may assume the pairings file never is inconsistent; for instance, doesn't have both "A B" and "B C" pairings. For example, if the pairings file contained
    color colour
    color colors
    color colours
    
    then the concordance for this writeup would contain:
    ...
    color: 74 76 95 96 97 102 104 107 
    ...
    colors: see color
    colot: 105 111
    ...
    colour: see color 
    ...
    

    footnote 1: colot is an Indonesian word meaning jump or leap.
    footnote 2: the line numbers in the last example are as in the html source and will not be the same as displayed by a browser in a particular sized window. But, by explicit linebreaks, the first "color" occurs two lines after the first "colour" in most formattings as well as in the html source file.