CISC 320 Algorithms and Advanced Programming, Spring 2001

Homework set #2 - revised - Algorithm analysis in the wild

Handed out: April 10, 2001
Due date: April 17, 2001

Note: This assignment is worth 9% of the course grade. It is due in class on April 12, since sketches of the solutions will be handed out on that day. Algorithm analysis and development in the wild (outside of textbooks).

(3%) In Dr. Dobbs Journal, #323, April, 2001 (distibuted in class), on page 145 Jon Bentley says this about the suffix array algorithm he is discussing: "On typical text files of n characters, the algorithm runs in O(n lg(n)) time."
1. Do you think he is speaking about the worst case time, the average case time, or other? Explain why you think so.
2. Bentley uses quick sort, but for this part assume that introspective sort is used. Let Ws(n) be the worst case number of string comparisons made by the algorithm on a text of n characters. Show that Ws(n) is Theta(n lg(n)). Remark: It is relatively clear that Ws(n) is O(n lg(n)), the upper bound. But here the strings in the array being sorted are not independent. Indeed for every two suffixes, one is a substring of the other. Could it be that introspective sort always runs in Ws(n) = Theta(n) time for this special case? Show that this is not so by showing that for a certain text of length n the n lg(n) number of comparisons will occur or by showing that the sorted order of the suffixes may be (virtually) any permutation of the initial order of the suffixes in the array. Count each call of of strcmp and each call of comlen as one string comparison.
3. Let Wc(n) be the worst case number of character comparisons made by the algorithm on a text of n characters. Show that Wc(n) is O(n² lg(n)) -- upper bound -- and is Omega(n^2) -- lower bound. Again assume introspective sort is used. You must show that Wc(n) is no more than a constant multiple of n²lg(n), and by example show that it can be at least a positive multiple of n², say n²/10. Character comparisons occur inside comlen -- *p++ == *q++ -- and inside strcmp. You may assume strcmp is written as follows.
```
int strcmp(char* p, char* q){
  while(*p != 0 && *q != 0 && *p == *q) ++p, ++q;
  if (*p < *q) return -1;
  else if (*p == *q) return 0; // they are both 0 - eos.
  else return 1;
}
```
  Simply put, the number of character comparisons used by by strcmp is about a 3 times (constant multiple of) the number of initial matching characters in the two strings. The number of character comparisons used by comlen is about 2 times the number of initial matching characters in the two strings (which is also the value returned).
  Comment (not meant to affect your homework): I think the while loop in comlen would be faster and still correct if written as
```
while (*p == *q && *p) p++, q++, i++;
```
(6%) Implement an algorithm to find the longest duplicated substring in a text. Your algorithm must be
- Correct on all valid inputs (all possible texts: strings of ascii characters).
- As fast as you can make it. For partial credit submit Bentley's code and a report on times you got on some test cases. For full credit, make your own modifications, explain what and why you modified, report on times you got on some test cases.
On the due date, implementations must be submitted by email. They will be run on some test cases. If they pass an initial correctness screening they will be candidates for timed runs in class. This way we will see whose modifications have had the most success in speedup.