Hall of Fame

A few of the assignments/projects will be competitive. The best solution will get the Trnka award and will retain it until the next competition, like the Stanley Cup.

Awardees

Dongqing Zhu just edged out Team Java in the user evaluation of the term cloud project. See here for the evaluation and final reports. His solution primarily uses NP chunking using regular expressions of part of speech tag patterns in NLTK.
Nicole Sparks, Dongqing Zhu, and Tim Walsh (Team A) won the evaluation for the Microsoft Research speller challenge. In contrast to other systems, they handled situations with multiple errors in a query and had good scoring. We'll have a picture if Praveen can ever get it off of his phone.
Chris Boston won the sentence segmentation assignment by adding some processing to the ML-based Punkt segmenter in NLTK. His solution was robust in re-attaching certain characters after a period/etc such as quotes and parentheses. Unlike other solutions, he simply merged back sentences of 4 characters or less rather than using a fixed list, which covered many of the unusual characters (such as the four possible ways of ending a double quote in the data).

Calendar/Announcements

Date Topics Assignments
February 8/10
February 15/17
  • bring top 3 picks for corpus linguistics on Monday
February 22/24
March 1/3
March 8/10
  • Testing examples posted for sentence segmentation. Write your segmenter then test on the four files in there later. If there are any errors in the gold standard let me know. The chapter.detex file won't be a major factor in the evaluation, more like a curiosity and discussion subject.
March 15/17
  • Assignment 2 due Monday, March 14
  • Assignment 3 out, due Apr 4
March 22/24
  • everyone read the tutorial and paper, Dongqing and Dan will lead discussion on them
March 28-April 1 Spring Break!
April 5/7
  • spell checking assignment due on Monday by noon. Be sure to bring your examples of good/bad corrections to class.
  • project 1 out
April 12/14
April 19/21
  • phase 2 of product due Monday/noon
April 26/28
  • phase 3 (final report) of product due Wednesday/noon
  • submit final evaluation (on my special set - listed on Sakai) to me by Friday evening
  • project 2 released
May 3/5
  • Wednesday - phase 1 of project 2 due (short description of your group, the plan, the leader, etc)
May 10/12
  • Thursday - turn in your report for phase 2 (implementation) of project 2
  • Saturday - reviews of the report are due
May 17/19
  • Tuesday - revised reports are due
  • Tuesday - user evaluation will begin

Course Overview

Instructor: Overseer: Kathleen McCoy
Office: 100 Elkton Rd. Office: 100 Elkton Rd.
Office hours: Monday 10:30-11:30
Tuesday 3-4
Office hours: TBA
Lecture: TR 11:00-12:15
Smith 102a

Course Description

Content
This course focuses on the practical aspects of natural language processing (NLP). We will study common tasks/problems in the field and apply existing techniques and tools to quickly develop accurate solutions. The course takes a project-based, hands-on approach to solving NLP problems and focuses on the wealth of available tools in human language technologies, machine learning, and statistics. In addition to the focus on existing methods, we will highlight the differences between semi-artificial tasks and domain-specific, practical tasks. We will use a collaborative learning approach to not only understand NLP practices but to teach students to adapt existing methods to new tasks and design novel methods based on previous research. It is helpful to have some background in natural language processing such as offered by CISC882.

Structure
The course content is divided into several different modules. Each module will have an introductory lecture or two then we will discuss tools and papers for the module. When discussing papers, the whole class will read the paper.

Everyone is required to give two module unit presentations over the course of the semester. Take a look at the list of modules and pick two units from different modules you find interesting (first-come first-serve). More info in the grading/rules section.

Themes
Because I love bulleted lists:

Instructor Information

Biography
I recently received my Ph.D. from UD in Computer and Information Sciences and now I'm teaching a couple courses while applying for jobs. My dissertation focuses on language modeling for word prediction in devices for people with speech and motor impairments. The methods are similar to applications of language modeling for text entry on mobile devices.

In relation to the course, I've spent a substantial amount of time on practical issues in NLP. Here are some of the practical things I found I needed (or wanted) in my thesis:

Modules

Each module should have a list of tools, papers, or sub-topics which students can pick for their presentations. If there isn't a volunteer it won't be covered unless there's a star to indicate that it's very important. In this case, I'll probably pick someone.

Linguistic Background (~2 classes)

I'll cover this. The purpose is to bring everyone to the same page, but we'll only focus on the aspects of linguistics that are useful for our programming. Many high-level concepts and issues that don't affect text will be omitted.

Basic NLP Programming (1 class I hope)

I'll cover this. In general, I suggest people use Perl/Python. Java's regular expressions are incredibly annoying. On the other hand, Java seems to handle Unicode better than Perl.

Corpus Linguistics (~3-4 classes)

I'll give 1-2 lectures/discussions on corpus linguistics and then we'll spend a few classes for student presentations of their first assignment, in which you'll analyse a corpus. Here's a list of places to find freely available corpora:

General toolkits

We'll be using modules from various toolkits throughout the class, so general-purpose introductions to the toolkits may be helpful.

Lexical resources

Language modeling

Spell checking

We'll look at some tools and papers for spell/grammar checking.

Textual similarity

I have to fill this in a little more.

Colocations

I have to fill this in more.

Text formats

Present a tool to process a specific file format in any common language:

Evaluation metrics and issues

Information Extraction

I'd like to fill this out some more. Looking for a general-purpose resource or survey paper.

Morphology

Although morphology can't be used for a lot of things by itself, it tends to be used to improve other methods.

Part of speech tagging

Chunking

Mostly we're interested in base noun phrase chunking. I need to fill this in a little more

Parsing

Machine learning toolkits

Summarization

I need to touch this section up a bit, add link to Staples, and more papers.

Optional topics

We'll cover these topics if there's interest (i.e., if someone volunteers to cover it)

Additional Resources

Grading & Policies

The course grades will roughly follow this breakdown:
Participation 15%
Presentations (2+) 30%
Individual assignments 25%
Group projects (2) 30%

Participation

This isn't like an undergrad participation grade. It's important to have discussions in class. If you don't understand something, ask up. Or if you want more information on something, speak up. I expect to have lively discussions over the course of the semester.

Examples of good participation:

Examples of bad participation:

Presentations

You'll each give at least two 15-30 minute presentations. The length of the talk should depend on the material. For example, the corpus linguistics talks may vary based on the data format, the annotation manual(s), the data sampling method(s), and the level of linguistic annotation.

The majority of the topics throughout the semester will be on this syllabus/webpage from the beginning. Talk to me if you're interested in a topic (even a little) and I'll write you down for that. About a week ahead of the presentation time, I'll confirm that we've got our class periods booked. If not, I'll pressure people into giving talks or assign people quasi-randomly.

Here are some bulleted guidelines (but they aren't an exhaustive checklist). You should periodically talk to me about your planned presentations, even informally.

Individual assignments

Sometimes they'll be written homework assignments like "Give me three examples of X" or "Read paper X and come up with three questions you had". Other times it'll be programming assignments, like "Design and evaluate a sentence segmenter or apply and evaluate an existing one." Some of the programming assignments will qualify for HALL OF FAME.

Questions/examples will be on paper, but I had better be able to read it and don't even think about folding over the corner of some ragged, lined paper cause you don't want to go over to the CIS office next door and use the stapler. The programming assignments might involve writing a short report. For most programming assignments I'll ask for the report via email as pdf.

Group projects

There will be two group projects for teams of about 3 people. The first will involve spelling correction, probably with Microsoft's spelling challenge. The second will probably involve summarization of product reviews. Of course this qualifies for HALL OF FAME.

Projects will generally have associated timelines. You'll go about the task much like a normal research problem, so there will be a stage where you analyse data, there will be a stage where you describe the subproblems you're focusing on and propose solutions, and a final stage where it's all implemented. Then at the end your group will give a presentation.