CISC 889: Information Extraction Projects
Many of the items below represent a topic rather than a specific
project. Thus, many projects or variations of a single project are possible
for each topic.
The general idea is that the project will allow you to get a
hands-on experience and allow you to learn/investigate a topic in
far more detail thand what would be possible from class
discussions alone. There are various possible ways to do so.
For example, you may choose to explore a method (discussed or not
discussed in class) and implement and investigate extensions. You could also
try to explore its application to another domain. Such a project can
bring up issues to do with annotations and creation of data.
Other possible projects could explore different methods, investigating
their relative advantages and disadvantages and/or propose a method
that combine the strengths of different existing methods.
You are encouraged to come up with projects that explore new ideas
whether they be new methods, new applications, new domains etc.
Please note that the criteria for success in the project is not
limited to an effective system (accurate, efficient etc.). For the
purposes of this course, I care more
about how you went about the design, how you addressed the main issues,
what you learnt etc.
- Building a named entity recognizer.
Several possibilities exist. You can try to build your own named entity
recognizer using some of the interesting ideas from the papers
we have discussed in class, or other papers you have read/can read and/or your
own additional ideas.
An alternate project within this topic is to explore building a NER for
a new domain such as in the biology domain. There is a named entity
annotated corpus available for this domain.
-
Text simplification and application to learning IE patterns.
The motivation here is that many IE pattern learning methods don't work well
because they don't capture the syntax and structure of language but rather
look to make generalization at the suface string level. So you could
investigate whether applying some simple "text simplification" methods
to simplify the sentences can make standard pattern learning methods
work better.
- Social Networks (idea due to Keith Trnka). You could investigate
techniques to discover a "social network" from (multiple text documents).
One can formulate a general idea
of a social network as a graph where the nodes represent entities of interest
(whether they are people or genes or organizations ...) and the edges
represent relationships between the entities that they connect.
These relations may be a set of predefined relationships that can be
extracted or could be some vague notion of relationship that may be
inferred from co-occurrence of the entities in text. From such a graph,
different kinds of graph-theoretic notions can be explored to see what
types of groups are formed. Alternatively, they can be used for
knowledge-discovery or mining.
- Topic Categorization: Classifying the topic of a document
(or clustering similar documents based on their topic) has many applications
including in information retrieval. Consider for example, the ability
of a search engine presenting the search results based on topics and
user preferences. Here various techniques can be explored to see how
to determine the topic (from some pre-defined set of topics) or
cluster documents (when no predefined set of topics is assumed).
- Query Explansion: The quality of retrieved documents can often be
improved with inclusion of few related terms in the query. Many techniques have
been suggested for query expansion and you can explore the different
techniques or explore them in context of different domains/genre of text.
- Terminology and ontology building: One of the main hurdles in
applying IE techniques or building question-answering systems in specialized
domains is the unique terminology. Simple methods can be used to
extract considerable amount of information about the
terminology or the ontological relationships between the terms.
- IE from non-textual domains: While the course focus is on IE from
text documents, you can explore extraction of specific kinds of
information from images, simple graphs/charts etc.
- Question answering systems: This topic will explore towards the
end of the semester. You can get a head-start by looking at
question-answering systems and exploring issues in question-answering.
- Extracting important concepts from a document. You can explore methods
to figure out which words or concepts are important and/or tell us what
the document is about. An application would be to see if that presence of
such a word in the query makes this document more relevant to the query and
hence should be assigned higher relevance score for the search.