Homework 5: Concordance Construction
Due Thursday, May 20.
A concordance is a listing which shows, for every word
that occurs in a given text, a list of the line numbers of lines on which
that word occurs. It can be a very useful tool for studying concepts
being discussed in the text, for finding the defining instance of a term,
etc. Prior to computer text processing, concordance construction was
a laborious and error prone task. It was undertaken only for the most
heavily studied texts such as the Bible. Even when concordance construction
could be programmed, the programming has been complex and error prone.
However we see from this project that high level data structures
(maps and sets) can make the job relatively easy.
As an example, if the first two lines of this paragraph
were the entire text, then the first few lines of the concordance
would look like this:
- an: 1
- as: 1
- concordance: 2
- example: 1
- few: 2
- first: 1 2
- ...
In this final project we will construct an extension and improvement
of the concordance example in section 16.2.3. This will be not require
extensive code, as maps and sets will make our work easy.
The code from the book (not quite identical) is online in the 220/concordance directory.
There are two main goals of the project and a third optional one.
-
Instead of using multimaps we'll use maps of sets. Specifically,
instead of declaring the concordance data structure as:
typedef multimap < string, int > wordDictTyp;
We will use:
typedef map < string, set < int > > wordDictTyp;
This will cause changes in how the addWord and printConcordance
functions are written. The resulting code will be both simpler and more
efficient.
-
The value of a concordance of a text usually lies in allowing those
studying the text to find the uses of the most important words in the
text. Thus entries for common words such as "the" are just an annoyance
making the concordance unnecessarily long. We will add to the concordance
class a set of words to be skipped, say set skipWords.
Then we'll
- add a member function readSkips(string skipfile), to setup
the skipWords
-
modify addWord to not make entries in the concordance (in wordMap) for
those words.
- [Extra credit]
Another problem for concordances can be multiple entries for essentially
the same word. For instance, do you may or may not want separate entries
for "hide", "hides", and "hiding" or for "colour"
(break in text here for sake of later example)
and "color"?
For flexibility on this issue,
we won't try to systematically merge all singular and plural forms.
Instead we process a collection of pairs of words that are to be treated
as the same. Suppose the user wants words A, B, and C to be treated
as the same and furthermore wants the entry to be under A. That user
would put the pairs "A B" and "A C" in a file. The resulting concordance
should output all the lines on which any of A, B, and C occur under the
listing for A. Under the listings for B and C it should simply print
"see A". This extra credit part of the project is to create an additional
function readPairings and use it to read
in the information about related versions of words and then use this information
correctly in addWord and printConcordance to get the
desired result. You may assume the pairings file never is inconsistent;
for instance, doesn't have both "A B" and "B C" pairings.
For example, if the pairings file contained
color colour
color colors
color colours
then the concordance for this writeup would contain:
...
color: 74 76 95 96 97 102 104 107
...
colors: see color
colot: 105 111
...
colour: see color
...
footnote 1: colot is an Indonesian word meaning jump or leap.
footnote 2: the line numbers in the last example are as in the html source
and will not be the same as displayed by a browser in a particular
sized window. But, by explicit linebreaks, the first "color" occurs two lines
after the first "colour" in most formattings as well as in the html source file.