Hall of Fame
![]() |
A few of the assignments/projects will be competitive. The best solution will get the Trnka award and will retain it until the next competition, like the Stanley Cup. Awardees
|
Calendar/Announcements
|
|
|
Course Overview
| Instructor: | Overseer: | Kathleen McCoy | |
|---|---|---|---|
| Office: | 100 Elkton Rd. | Office: | 100 Elkton Rd. |
| Office hours: | Monday 10:30-11:30 Tuesday 3-4 |
Office hours: | TBA |
| Lecture: | TR 11:00-12:15 | ||
| Smith 102a |
Course Description
Content
This course focuses on the practical aspects of natural language processing (NLP). We will study common tasks/problems in the field and apply existing techniques and tools to quickly develop accurate solutions. The course takes a project-based, hands-on approach to solving NLP problems and focuses on the wealth of available tools in human language technologies, machine learning, and statistics. In addition to the focus on existing methods, we will highlight the differences between semi-artificial tasks and domain-specific, practical tasks. We will use a collaborative learning approach to not only understand NLP practices but to teach students to adapt existing methods to new tasks and design novel methods based on previous research. It is helpful to have some background in natural language processing such as offered by CISC882.
Structure
The course content is divided into several different modules. Each module will have an introductory lecture or two then we will discuss tools and papers for the module. When discussing papers, the whole class will read the paper.
Everyone is required to give two module unit presentations over the course of the semester. Take a look at the list of modules and pick two units from different modules you find interesting (first-come first-serve). More info in the grading/rules section.
Themes
Because I love bulleted lists:
- easy-to-understand explanations - It's better to have most of the class learn a reasonable amount than have all of the class understand nothing. Some tips for reports and presentations: give visuals, make examples, ask for feedback in talks, make questions to stir discussion.
- developing tutorial resources
- tools vs published research - Both are important to get things done.
- practical information - I think this is the origin of the phrase "free as in beer".
- discussion
- competitions
- presentation/writing quality
- the importance of the task/problem
Instructor Information
Biography
I recently received my Ph.D. from UD in Computer and Information Sciences and now I'm teaching a couple courses while applying for jobs. My dissertation focuses on language modeling for word prediction in devices for people with speech and motor impairments. The methods are similar to applications of language modeling for text entry on mobile devices.
In relation to the course, I've spent a substantial amount of time on practical issues in NLP. Here are some of the practical things I found I needed (or wanted) in my thesis:
- text normalization - when training and testing language models on different corpora, you need to make sure that they "speak the same language", so you need to preprocess them to a standard form. Here are some example issues:
- specialized formats - every corpus has it's own format. How does it label speakers? Separate files or line headers? Does it annotate non-word sounds (e.g., laughter)? Does it contain titles? Do emails contain quoted text or signatures?
- character encoding - some corpora use plain 7-bit ASCII. Some use UTF-8. Some use UTF-16. Some use ISO 8859-1. Some use Windows-1252.
- capitalization of the first word in a sentence - What's going to happen to my language model if I train on a corpus that doesn't capitalize the first word, but test on a corpus that does capitalize the first word? Keystroke savings would be awful. So I had to come up with simple corpus-based methods to decide whether the first word in a sentence was a proper noun or not.
- speech repair removal - written corpora don't have speech repairs or backchannels (uh/um), so I had to remove those when I could. Some corpora signal abandoned words or pauses but some don't.
- parenthetical removal - spoken corpora don't really have parenthetical expressions or frequent quotations (in contrast to news text). So I extracted parentheticals/quotations from written/formal texts and treated them as a separate sentence.
- sentence splitting/tokenization - different corpora have different information. Spoken corpora split up the sentences based on pauses or speaker-switches. You need different methods for written data.
- practice vs theory of language models - In theory, it seems like you're ranking the whole vocabulary. If you actually code that, it's too slow to get anything done.
- balancing cross-validation sets for topic adaptation
If you randomly split up documents into cross-validation sets, you can potentially pick a worst-case scenario (i.e., sets are abnormally different). So you want to balance them to ensure that you're fairly evaluating your methods. - part of speech tagging
- taggers tokenize in Penn Treebank format, but we predict words in a more standard format (split on spaces only)
- taggers are trained on news data (Treebank), but I'm tagging spoken data. So we need to add extra processing to get Treebank-based taggers to work.
Modules
Each module should have a list of tools, papers, or sub-topics which students can pick for their presentations. If there isn't a volunteer it won't be covered unless there's a star to indicate that it's very important. In this case, I'll probably pick someone.
Linguistic Background (~2 classes)
I'll cover this. The purpose is to bring everyone to the same page, but we'll only focus on the aspects of linguistics that are useful for our programming. Many high-level concepts and issues that don't affect text will be omitted.Basic NLP Programming (1 class I hope)
I'll cover this. In general, I suggest people use Perl/Python. Java's regular expressions are incredibly annoying. On the other hand, Java seems to handle Unicode better than Perl.- character sets and Unicode
- Joel Spoelsky on Unicode (2003) - if you read one thing, read this
- file - quick test for some encodings. Doesn't work well if there are headers/annotations. It can identify pure ASCII, ISO-8859-x, UTF-8, UTF-16, and EBCDIC. Anything else is reported as non-ISO extended ASCII.
- iconv - converts between various character sets. Run with -h or something to get a list of supported encodings. Doesn't work well with Windows-1252 in my experience. It's handy if some language supports UTF-8 but not UTF-16.
- ASCII - scroll to the bottom of any Wikipedia charset page to witness the horror of pre-Unicode life.
- ISO 8859-1 - if there's a "standard" extended ASCII, this is it
- Windows-1252 - differs from ISO 8859-1 in the 0x80-0x9F range, adding special printable characters.
- Table of the Windows-1252 differences - the table maps Windows-1252 to Unicode code points and UTF-8 bytes, provides escape codes, and a short description.
- Perl charset tutorial - It says Unicode tutorial, but it's just kidding. It gives info about the Encode module. If we're just talking about normal-ish files, you can avoid it by specifying in the
opencommand, as mentioned in the FAQ. - IBM on UTF-8 vs UTF-16 - interesting read
- Unicode tools/libraries - some of these are really neat
- regular expressions
- Perl's regular expression page - The reference. Also printed in Perl in a Nutshell.
- txt2re - type in some text, then click the link for the piece you want to recognize. But it only somewhat works. On the plus side, it generates output in multiple languages. You could probably become famous by writing a real version of that (especially if it's pure Javascript).
- J&M Chapter 2 is good; can skip discussion of FSAs, regular languages
- unification as a framework or design pattern
- training/testing/development, cross-validation, etc
- word tokenization, sentence tokenization/splitting/segmentation
- J&M Section 3.9
- Mikheev (2000). Document centered approach to text normalization
Corpus Linguistics (~3-4 classes)
I'll give 1-2 lectures/discussions on corpus linguistics and then we'll spend a few classes for student presentations of their first assignment, in which you'll analyse a corpus. Here's a list of places to find freely available corpora:- NLTK corpora (also available for download) - good first place to look for something interesting. It has plain text corpora as well as treebanks.
- LDC freebies - Not all of these are English, but there are some neat ones like Timebank (news tagged with time info)
- /m/blizzard/corpora - we've collected several corpora over the years, available on blizzard. Also, if you're interested in parts of ANC not yet on blizzard, we have that too.
- Wikipedia - you can download a dump of Wikipedia, though you need to deal with whatever format they use. Excellent resource for many tasks.
- Reuters - large collection of news stories from TREC. We'd have to request it, but it's an option.
- Oxford Text Archive - nice listing of corpora and whether they're free or not.
- Dialogue Diversity corpus
- CSTBank - small corpus annotated for cross-document discourse structure.
- ClairLib corpora - The 20newsgroups corpus might be neat, there are others listed though.
- Project Gutenberg - huge collection of traditional literature (free because copyrights expire)
- Hansard corpus - official records (Hansards) of 36th Canadian Parliament. Parallel corpus in English and French.
- Copenhagen Dependency Treebank - relatively small amount of English, but annotated for part of speech and syntax.
- Corpus of Contemporary American English (COCA) and Corpus of Historical American English (COHA) - These are corpora designed for corpus linguistics rather than NLP/computational linguistics (they're web-access only). The Google Books Ngram Viewer is similar to COHA (though prettier and with less annotation).
- If you want to work with the xkcd color experiment data on blizzard, one of the things you should do is correlation with Crayola colors and/or standard CSS/HTML named-colors.
- feel free to pick something I haven't listed or linked to, so long as it's freely available to all EECIS students. For example, someone interested in BioNLP might consider MEDLINE.
General toolkits
We'll be using modules from various toolkits throughout the class, so general-purpose introductions to the toolkits may be helpful.- [Nicole] NLTK (Python) - This is the most popular toolkit and Python is a nice language for NLP. Contains dozens of NLP modules with various implementations. The NLTK book is available for free online or for pay in stores.
- LingPipe (Java) - Unlike other toolkits, LingPipe is more commercially-oriented (free for research), so they've put time into performance/memory optimization and such. The API has many tools.
- ClairLib (Perl) - ClairLib has several NLP tools for Perl, including summarization.
- FreeLing (C++, Python, PHP) - seems like it was originally made for Spanish and related languages, but supports English. You can play with some of it in an online demo.
- GATE (Java) - GATE has been around for ages and continued to grow. It's a little unclear exactly what it can do, but it's been very successful in information extraction.
- OpenNLP (Java) - has a handful of tools. It was (relatively recently) picked up by Apache Incubator, so this may develop into something bigger in the future.
- LT TTT2 - unlike others, this is more like a group of tools rather than an API (as far as I know).
- Mallet (Java) - package for lots of NLP tasks. I don't know much about it.
- MorphAdorner (Java/CLI) - designed for morphological analysis, but also has tools for sentence segmentation and named entity recognition/classification. Has many other tools, including TextTiling and C99 for text segmentation.
Lexical resources
- *WordNet - has APIs in many, many languages. Toolkits like NLTK and others have their own interfaces as well.
- [Keith] general intro
- [Charlie] *similarity metrics
This can be two talks: theoretical similarity metrics and practical metrics implemented in NLTK or other tools.
- *FrameNet - more about verbs/frames than WordNet.
- VerbNet - large online verb lexicon. Addresses the weaknesses of WordNet for verbs.
- PropBank - text annotated with basic semantic propositions
- *Unified Verb Index - merges and links many verb ontologies (VerbNet, PropBank, FrameNet, OntoNotes)
- OpenCyc - a large, general-purpose knowledge base and reasoning engine. Unlike the previous Cyc project, this is free and open.
- Open Mind Common Sense - a large knowledge base of "common sense" knowledge. Addresses some of the problems in knowledge-based approaches to AI/NLP.
- SIGLEX - ACL's special interest group on lexical issues, including lexical semantics. Their online resources section is very useful.
- COMLEX - oldschool lexicon with part of speech and other info. See also the LDC page. Version 3.0 (1997) and 2.2 (1994) are on blizzard.
- person names - If you want to give a talk on this, you should collect some agglomerate statistics of interest to NLP, such as percentage of first names that are also last names (and vice versa), ambiguity between first names and regular words when we don't have capitalization, male/female ambiguity, etc.
- US Census publishes some statistics about names
- There are countless baby name databases out there for many countries.
- word lists - Yet Another Word List (YAWL), aspell, etc
There are many word lists or dictionaries available online. If you'd like, you can do a presentation on them, but because the processing is easy, it'll be more like a corpus linguistics presentation (number of words, word length distributions, does it have proper nouns, are there any strange sorts of words, etc). You should also estimate the coverage of the word lists by comparing overlap with other resources (like Google unigram model). - Roget's Thesaurus - The 1911 version is available from Project Gutenberg and an API to access it here. They plan to publish a version of their tool with modern thesaurus information. They show comparisons to the 1987 edition on their webpage (surprisingly close for some tasks).
- Lingua::EN::Phoneme (Perl) - interface to the popular CMU phonetic dictionary
- Visual Thesaurus - not really worthy of a full talk, but a really good example of NLP combind with a nice user interface
- a talk on stopwords may also be interesting, especially if you take a look at borderline cases and how stopwords correlate (or don't correlate) with function words. May also consider the idea of core/nuclear vocabulary.
Language modeling
- [Keith] Intro: I'll adapt J&M to my experiences.
- toolkits/etc
- *ARPA format - it seems that all the language modeling toolkits support this format. Details on the web are a little sparse, but J&M has a section describing the format.
- [Praveen] *CMU language modeling toolkit - very popular toolkit
- *SRI language modeling toolkit - seems more up-to-date and maintained than the CMU one, has many options
- MIT language modeling toolkit - new contender
- Microsoft Research language modeling toolkit - designed for scalability, but with small set of features. Available from here.
- random forest language modeling toolkit (JHU) - I haven't seen it used, but it seems like it should be good.
- precomputed language models - There are some popular language models.
- Google ngram model - The real name is "Web 1T 5-gram Version 1" but that isn't as recognizable. It has frequencies for 5grams, 4grams, etc. (This means you need to do your own smoothing) Available from research machines in /m/blizzard/corpora/google
- Google books ngram data - Similar style to the Google web ngrams, except with additional segmentation by year. Used to build the Google Books Ngram Viewer.
- [Praveen] Microsoft Web N-gram Services - cloud-based language model. Instead of trying to load it locally, you query their servers using SOAP/REST or a Python API. It's a smoothed backoff 5gram model.
- noisy channel models - bonus points if you fix the Wikipedia article on noisy channel models, which is tagged with requests for improvement.
- papers
- Jelinek (2009). The Dawn of Statistical ASR and MT.
- [Tim A] Jelinek (1991). Up from trigrams! - the struggle for improved language models.
- Brill, Florian, Henderson, and Mangu (1998). Beyond N-Grams: Can Linguistic Sophistication Improve Language Modeling?
- Rosenfeld (2000). Two decades of statistical language modeling: Where do we go from here?
- *Goodman (2001). A Bit of Progress in Language Modeling: Extended Version.
- Bellegarda (2004). Statistical language model adaptation: review and perspectives.
- Chen and Goodman (1996). An Empirical Study of Smoothing Techniques for Language Modeling
There are a couple updated versions of this paper to look at for this talk.
- smoothing and discounting methods
It might be a little detailed for this class, but someone could give a talk on smoothing methods that goes beyond the introductory material. If you're considering this, you might take a look at papers for Witten-Bell, Simple Good-Turing, Chen and Goodman (above), Katz backoff, and Kneser-Ney smoothing. - hidden parameter estimation va expectation maximization (EM)
Bonus points if you can give a simple/clear tutorial on how to write code to finds weights in a language model using the EM family of algorithms. The primary application should be linear interpolations of LMs. - class-based language models, like Brown et al. (1992) "Class-based n-gram models of natural language"
- the entropy of English
There are two papers you'd want to look at: 1) Shannon's 1951 paper "Prediction and entropy of printed English" and Brown et al. (1992) "An estimate of an upper bound for the entropy of English". - other potential topics: trigger pairs, cache-based models, topic adaptation (LSA, similarity-based, EM-based), web-based adaptation for topic modeling, decision-tree models, pruning
Spell checking
We'll look at some tools and papers for spell/grammar checking.- [Keith] Intro: J&M 3.10 and 5.9
- papers
- *Kukich (1992). Techniques for automatically correcting words in text.
- [Dan] *Mitton (1996). Spellchecking by computer.
Excellent article, but keep the year in mind. Real-word spelling correction is very common in research, but still very scarce in practice. - *Brill and Moore (2000). An improved error model for noisy channel spelling correction
- Toutanova and Moore (2002). Pronunciation modeling for improved spelling correction.
- *Cucerzan and Brill (2004). Spelling correction as an iterative process that exploits the collective knowledge of web users
- *minimum distance edit (and weighted minimum distance edit) - note conversion issues to noisy channel model. Also note search space issues. Also note issues in balancing accuracy vs dictionary coverage. Note that I'll cover basic minimum distance edit in the intro lecture, so you'll have to cover more of the issues and adaptations of it. Should also cover the simple version of Damerau–Levenshtein distance.
- helpful links on Levenshtein distance (aka minimum distance edit aka string alignment)
- Levenshtein distance (wikipedia) - page include a couple tables and proof of correctness
- Visualization of computing Levenshtein distance (type in the box, interact in popup window)
- Another interactive min edit distance - this one works in the same window, but doesn't show you the trace.
- Another example, and C++/Java code
- Damerau–Levenshtein distance - this is the modification of Levenshtein distance to include transposition errors. Although in general it's not trivial, it's trivial with some assumptions about the transpositions
- *After the Deadline - very nice spelling/grammar/style checker, but it's mostly used as a web service. Firefox/Chrome plugins, WordPress plugin, OpenOffice plugin, etc. There's a lot of information in the paper below and the blog.
- Raphael Mudge (2010). The Design of a Proofreading Software Service.
There's more info (slides, etc) on the After the Deadline blog. It looks like they even make their ngram models available.
- Raphael Mudge (2010). The Design of a Proofreading Software Service.
- [Dongqing] Peter Norvig's How to Write a Spelling Corrector - great guide and tutorial. Covers both the theoretical and practical information. Provides a 21 line Python corrector right at the top. Links to a spelling error corpus in the body of the text. Also includes extensive future work.
- LingPipe's spelling correction tutorial - nice tutorial though maybe not introductory level. Excellent implementation and coverage of context-sensitive spelling correction. I'm guessing this is basically an explanation of LingPipe's spelling correction module.
- Language tool - multi-language spelling, grammar, and style checking. You can browse the English rules online.
- Manning, Raghavan, and Schütze (2008). Introduction to Information Retrieval.
Has a nice couple of sections on spelling correction for information retrieval (Chapter 3). - SOUNDEX - an interesting old algorithm that can help (included by default in Perl's Text::Soundex)
Textual similarity
I have to fill this in a little more.- vector space model (aka unigrams)
- metrics - TF/IDF, cosine, Jaccard, LSA, LDA, generative probability, naive bayes, etc
- clustering (WEKA? ClairLib?)
- text classification
- Author identification/attribution
- Language identification
- (Any kind of classification really)
- text segmentation
- TextTiling - implemented in Lingua::EN::Segmenter::TextTiling (Perl). Also in NLTK and MorphAdorner. Joel Tetreault has several links to tools (at the bottom).
Colocations
I have to fill this in more.- [Keith] *pointwise mutual information - this an essential technique in NLP. I'll start the ball rolling with PMI, but someone can do a followup lecture.
- Tools/methods:
- WordNet compounds
- Wikipedia titles
- Text::NSP (Perl)
Text formats
Present a tool to process a specific file format in any common language:- Formats of interest:
- HTML (see SIGWAC - web as corpus)
- *crawling to build a corpus (e.g., Amazon product reviews, Google search, Google specialty searches). A talk should cover web crawling for a specific language.
- *processing the HTML
CLEANEVAL is an important resource in this area - a competition to best extract text from HTML.
- *LaTeX and BibTeX - There are some Perl modules to process LaTeX/BibTeX files (probably other languages too). Or you can go the detex route. If you want good examples of LaTeX processing, take a look at various helper tools, like style-check.rb (Ruby).
- *Wikipedia (and other forum/wiki/blog formats) - There are some tools listed on Wikipedia's database download page. Many other HTML-like formats exist, such as BBCode for forums.
- Twitter - not a text format per se, but there are APIs to follow the live stream of millions of tweets per second in addition to searching them. It's been used for certain kinds of NLP, such as novelty detection, or tracking the spread of seasonal flu.
- HTML (see SIGWAC - web as corpus)
- Tools of interest:
- *Apache Tika - Java, but it might be run as a standalone tool
- If you come across a processing module for some language or a general-purpose tool, suggest it to me as a unit.
Evaluation metrics and issues
- [Keith] *Keystroke savings
- precision, recall, F-measure
- ROC curves and AUC
- evaluation for IR: MAP, MRR, etc
- accuracy
- BLEU
- user studies
- annotator agreement and the kappa statistic
Information Extraction
I'd like to fill this out some more. Looking for a general-purpose resource or survey paper.- Intro: R&N has a nice intro. J&M second edition also has a decent chapter.
- *Time expression tagging
- named entity recognition/classification
- Lingua::EN::NamedEntity (Perl)
- Links from MIT - I'll try and read this later when I have time, looks nice.
Morphology
Although morphology can't be used for a lot of things by itself, it tends to be used to improve other methods.- [Keith] Intro: J&M Chapter 3
- Stemming/Lemmatization
- *Porter's stemmer - Natively available in almost every language out there.
Someone could also do a talk on Porter's original publications, which have been reprinted more recently. - *KIMMO/PC-KIMMO or Pykimmo (some MIT students adapted KIMMO to Python, but I can't find it anymore. It might be in an NLTK alpha or on the wayback machine.) Side note: If you want to make a good version of KIMMO for Perl and put it on CPAN, you'll get a major bonus and probably some fame.
- Snowball (Perl)
- *NLTK's WordNet lemmatizer
- NLTK's Lancaster stemmer
- Manabu's stemmer (on /m/blizzard)
- (demo) Xerox XRCE language tools: English demo
- *Porter's stemmer - Natively available in almost every language out there.
- Orthography and part of speech
- Krovetz (1993). Viewing morphology as an inference process
Part of speech tagging
- [Keith] Intro: J&M Chapter 5
- Penn Treebank tagset - The annotation manual is an instructive read. Here are some cheat sheets:
- Inside cover of J&M has the tags and examples!
- http://faculty.washington.edu/dillon/GramResources/penntable.html (has some examples)
- http://www.computing.dcu.ie/~acahill/tagset.html
- http://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html
- http://bulba.sdsu.edu/jeanette/thesis/PennTags.html
- *AMALGAM - automatic mapping between multiple formalisms. They have a excellent set of pages on the various tag sets. J&M also has some comparison.
- Taggers:
- *Stanford maxent tagger
- [Nicole] *TreeTagger - decision-tree based tagger
- TnT - HMM-based tagger, free for non-commercial research
- *hunpos - a re-implementation of TnT (but completely free)
- GENIA - tagger made for BioNLP, but also works for regular NLP
- Brill's TBL tagger - I have a copy of Brill's tagger, and you might be able to find it from his old Microsoft page using the Internet way-back machine. There's also an implementation in NLTK. See also Lingua::BrillTagger (Perl)
- *fnTBL - a transformation-based learning (TBL) toolkit with a tagger like Brill's and also several other things (like chunking, etc)
- CLAWS tagger - CLAWS can only be used for free via a web interface, but it's a good example of another popular tag set
- MXPOST - used to be a popular maximum entropy tagger from Penn
- LT TTT2 - I believe this contains a descendent of LT POS, which used to be a popular tagger.
- *QTag - rule-based tagger. Supposedly handles misspelled words and junk text well.
- NLTK bigram/trigram/ngram taggers - if giving a talk on these, compare unigram tagging to bigram tagging to trigram tagging, etc to show the benefit of context and the marginal benefit of added context.
- LingPipe has a tagger
- FreeLing has two taggers
- Supertagging - some software from here. The interesting thing is that instead of tagging with plain part of speech tags, you're tagging with simple trees from tree-adjoining grammar (TAG).
- Anything on ACL's state of the art page
Chunking
Mostly we're interested in base noun phrase chunking. I need to fill this in a little more- [Keith] Intro: J&M??? Also covering the formulation of chunking as a tagging problem.
- *YamCha - SVM-based model for NP chunking and a variety of other labeling tasks, such as POS tagging, and NER. If you scroll down a bit, there's a nice example showing how chunking is represented as a tagging problem.
- *fnTBL - TBL-based model for NP chunking and other labeling tasks - POS tagging, more general chunking, sentence segmentation, and WSD.
- NLTK - has a chunker module
- FreeLing - has a chunker
- LT TTT2 - has a chunker
- Fill in section with either papers, shared tasks, etc. Check textbooks!
Parsing
- Summary of different kinds of parsing: grammatical parsing, dependency parsing, semantic parsing. Probabilistic vs not.
- Probabilistic parsers:
- Stanford parser - Java, open source. Has an online demo. Also provides dependencies. See Stanford dependencies
- Dan Bikel's parser - Java. Contains an implementation of Collins' parser. The description sounds like this is suited for easy use.
- Collins' parser - C source and executables. It looks like there's a Perl frontend in Lingua::CollinsParser
- Charniak's parser - C++ source distribution
- Link Grammar parser - link grammar is a bit different than a traditional HPSG parse.
- MSTParser - Java dependency parser
- XTAG - there's a C version of a TAG parser on there
- ASSERT - semantic role labeling/parsing based on PropBank.
- Anything else on Stanford's site, especially the semantic parsers.
- If you want a paper/theoretical talk, you could talk about the kinds of information (besides the grammar) used in probabilistic parsers.
Machine learning toolkits
- [Oana] *SVMLight
- [Charlie] *C4.5/C5.0
- [Rich] Bayesian networks (Rich)
- Weka - you can cover a particular ML implementation
- Infomap - latent semantic analysis toolkit
- R - R is an excellent statistics environment (and is a free alternative to SAS). In addition to normal statistics (and some nice usage of gnuplot), it can do things like linear regression and logistic regression.
- It isn't machine learning per se, but linear regression is an important idea that isn't often taught. It's also pretty easy too.
Summarization
I need to touch this section up a bit, add link to Staples, and more papers.- *MEAD (ClairLib/Perl) - multi-document summarization. ClairLib also has an interface for MEAD.
- NewsInEssence - an application of MEAD.
- Columbia's NewsBlaster - It's amazing that NewsBlaster still works; it's been around for ages. It isn't a toolkit that you can use or anything though.
- HTML::Summary (Perl)
- Lingua::EN::Summarize (Perl)
- Lingua::EN::Squeeze (Perl) - the goal is to shorten text for SMS
- Pick a commercial summarizer and evaluate it
- *KPC
- *MMR
Optional topics
We'll cover these topics if there's interest (i.e., if someone volunteers to cover it)- Surface realization
- FUF/SURGE
- Nitrogen/Halogen
- ACL's list of systems
- Chat bots, namely ALICE
- Word sense disambiguation - Senseval, SenseClusters
- Coreference resolution
- Machine translation
- Rhetorical structure theory (RST)
- Lexical chains
Rich got Greg Silber's lexical chainer working again, but it had some other problems. - Latent Dirichlet Allocation - If you can actually make me understand this, I'll seriously bake you something. I understand the high-level, but not how it works. Fair warning: noone has succeeded so far.
- Speech recognition as input - Sphinx, Windows, etc
- Speech synthesis tools
- Text simplification
- The only available tool I know of was written by Oana and Peter (It's a web-based system and I'll find the link if there's interest. Super bonus points if you want to improve on their work and release a Perl/Python module
- text-to-text generation
- Kryptos - a sculpture of encrypted text at the CIA since 1990. The first public solution to the first three parts was found in 1999. The fouth part is still unsolved. Cracking encryption is (in part) a language modeling problem, but an abnormally hard one because the source text contains dropped punctuation, no capitalization information, usage of 'X' for declarative sentence breaks, uncommon words, and some deliberate misspellings.
- Keith's experiences in getting things done
- reading and writing CSV files
- writing gnuplot files
- want a balance of human-readable and machine-readable/storable? gzipped xml
- anything in my list in the biography (or stuff I've published)
Additional Resources
- Software and tools
- Stanford NLP resources - best list of tools out there by far
- Natural Language Software Repository - a searchable system of software. Has some unique information, but lots of commercial tools and the categorization is so-so.
- LDC's Linguistic Annotation Page - just "found" this recently; nice list of annotation formats and tools.
- Morphix-NLP Live CD Manual - I'd never heard of this Linux Live CD distribution, but the manual actually has some nice documentation on some tools. For example, I didn't realize that Adam Berger has some tools for trigger pairs. The nice thing is that older software is usually hard to find, but all of this stuff is built into their distribution with some wrapper scripts.
- Hal Daumé's software - a handful of tools, some in (arguably) more rare languages like Haskell or OCaml.
- CPAN (Perl) - central place for almost every Perl module. You can search by keyword. There are many NLP modules under the prefix "Lingua". For installation, just use the CPAN module like so:
perl -MCPAN -e"install X", where X is the module you want. If you're installing into a custom location (if you don't have root access), then doperl -MCPAN -e"shell"and navigate through the help files to add the parameterPREFIX=/path/to/folderas a make_install_arg or maybe as mbuildpl_arg (I can't remember which at the moment) - Python Package Index (PyPI) - CPAN for Python
- If you come across a nice list of tools or Perl scripts/snippets on someone's webpage, let me know.
- Books
I'll try to provide links to useful books when we cover related material, but here's a rough list.- Speech and Language Processing by Jurafsky and Martin (2009)
Great overview of the field. I'll use material from the second edition. This is abbreviated J&M all over the syllabus. - Text processing in Python by Mertz (2003)
Free web version of the book, looks like it has some good programming knowledge - Natural Language Processing with Python by Bird, Klien, and Loper (2009)
It's a book about coding NLP, but uses NLTK extensively. Free online. - Foundations of Statistical Natural Language Processing by Manning and Schütze (1999)
I have trouble believing that this book was from 1999. It's an excellent intro to NLP with a focus on statistical methods. - Natural Language Understanding by James Allen (1994)
This is one of the classics, but hasn't been updated in years. - Register, Genre, and Style by Biber and Conrad (2009)
This is the modernization of Biber's famous corpus linguistics work. It's a nice reference/example for that aspect of the class. - Anaphora Resolution by Ruslan Mitkov (2002)
I remember that I enjoyed reading this; I'll try to skim it again to give a better overview. - Cohesion in English by Hallidan and Hasan (1976)
Gives a good background in seeing the origins of the modern idea of cohesion in NLP. - machine learning book(s)?
- Information Retrieval by C.J. "Keith" van Rijsbergen (1979)
Important oldschool text on IR, but still has a lot of good information on text similarity, classification, and search. I remember that it's easy to read. It's also free online. - Text Analysis with LingPipe 4 by Bob Carpenter (2011). It's currently a freely available draft, but I bet it's really useful for those looking to use LingPipe.
- Speech and Language Processing by Jurafsky and Martin (2009)
Grading & Policies
The course grades will roughly follow this breakdown:| Participation | 15% |
|---|---|
| Presentations (2+) | 30% |
| Individual assignments | 25% |
| Group projects (2) | 30% |
Participation
This isn't like an undergrad participation grade. It's important to have discussions in class. If you don't understand something, ask up. Or if you want more information on something, speak up. I expect to have lively discussions over the course of the semester.
Examples of good participation:
- asking about the differences between similar tags
- discussing pros and cons of some tool/approach
- application questions (i.e., How did you just do that?)
- there are so many more
Examples of bad participation:
- not showing up to class
- texting/whatever while someone's presenting
- asking a question you know the answer to, but you want to show off (asking these kinds of questions won't help your participation grade, but will hurt it if it gets distracting)
Presentations
You'll each give at least two 15-30 minute presentations. The length of the talk should depend on the material. For example, the corpus linguistics talks may vary based on the data format, the annotation manual(s), the data sampling method(s), and the level of linguistic annotation.
The majority of the topics throughout the semester will be on this syllabus/webpage from the beginning. Talk to me if you're interested in a topic (even a little) and I'll write you down for that. About a week ahead of the presentation time, I'll confirm that we've got our class periods booked. If not, I'll pressure people into giving talks or assign people quasi-randomly.
Here are some bulleted guidelines (but they aren't an exhaustive checklist). You should periodically talk to me about your planned presentations, even informally.
- General guidelines:
- encourage discussion!
Mentally prepare a few challenging questions to stir up discussion. For example, in the beginning of the semester, I'll ask you "What's a word?" It can seem uncontroversial at first, but with some examples it becomes very controversial. - communicate clearly!
If it's a hard concept, try using visuals and examples. It's better to cover a little less material but make sure everyone understands it. - Google Scholar is your friend. But also Wikipedia and research webpages/blogs can have an incredible wealth of information in plain English. Textbooks will sometimes describe tools or research in simple English too.
- encourage discussion!
- If you're presenting a tool:
- it's a tutorial!
At the end of your talk, the class should know how to install/setup/compile/cajole the tool, the input and output formats for any data, any prerequisites (some tools require POS tagging), strengths/weaknesses of the tool, etc. Basically they should know what they need to know in order to walk out of the class and start using it. - consider watching some cooking/baking shows
You should be showing people how to get it done. - slides should convey any steps/etc
Many/most of the presentations will be posted on this webpage as a resource. A good presentation will save some poor grad student weeks of effort in trying to figure things out. - discuss unintended uses
If the tool was trained on Treebank/WSJ, consider the quality on different genres like spoken language, twitter, SMS, email, blogging, etc. (Don't pick them all, but pick something and give good/bad/ugly examples)
- it's a tutorial!
- If you're presenting a paper:
- the whole class will read the paper as preparation (and they'll submit some questions as homework)
Because the whole class is reading the paper, you need to be more of an expert to lead discussion. That means reading some of the citations and/or followup work. You can find followup work based on who cites the paper and/or the author's webpage. - lead discussion
Papers can be controversial. They express moments of genius and moments of desperation. Did some part not make sense? It could've been an implementation decision rather than a research decision. Or try asking the class - someone will have an opinion. But be sure to keep the class on track. - what is the linguistic or HCI intuition?
The goal of computational linguistics is not just to solve meaningful language problems with software, but also to learn something about linguistics based on what's effective.
- the whole class will read the paper as preparation (and they'll submit some questions as homework)
- If you're presenting a corpus or language model:
- how do we process it?
What kind of crazy format is it in? How should we deal with tokenization? - collect statistics
For a corpus, collect some interesting statistics (like Biber) and compare it to some baseline. Does it have a lot more proper nouns than usual? More symbols and crazy stuff? More pronouns? More interjections? - read annotation manuals and resources
My favorite example is that Santa Barbara transcribes "woulda" as "would've", but Switchboard uses "woulda". That makes a world of difference to algorithms.
- how do we process it?
Individual assignments
Sometimes they'll be written homework assignments like "Give me three examples of X" or "Read paper X and come up with three questions you had". Other times it'll be programming assignments, like "Design and evaluate a sentence segmenter or apply and evaluate an existing one." Some of the programming assignments will qualify for HALL OF FAME.
Questions/examples will be on paper, but I had better be able to read it and don't even think about folding over the corner of some ragged, lined paper cause you don't want to go over to the CIS office next door and use the stapler. The programming assignments might involve writing a short report. For most programming assignments I'll ask for the report via email as pdf.
Group projects
There will be two group projects for teams of about 3 people. The first will involve spelling correction, probably with Microsoft's spelling challenge. The second will probably involve summarization of product reviews. Of course this qualifies for HALL OF FAME.
Projects will generally have associated timelines. You'll go about the task much like a normal research problem, so there will be a stage where you analyse data, there will be a stage where you describe the subproblems you're focusing on and propose solutions, and a final stage where it's all implemented. Then at the end your group will give a presentation.
