CISC 889: Information Extraction

CISC 889: Information Extraction Projects

Many of the items below represent a topic rather than a specific project. Thus, many projects or variations of a single project are possible for each topic. The general idea is that the project will allow you to get a hands-on experience and allow you to learn/investigate a topic in far more detail thand what would be possible from class discussions alone. There are various possible ways to do so. For example, you may choose to explore a method (discussed or not discussed in class) and implement and investigate extensions. You could also try to explore its application to another domain. Such a project can bring up issues to do with annotations and creation of data. Other possible projects could explore different methods, investigating their relative advantages and disadvantages and/or propose a method that combine the strengths of different existing methods. You are encouraged to come up with projects that explore new ideas whether they be new methods, new applications, new domains etc. Please note that the criteria for success in the project is not limited to an effective system (accurate, efficient etc.). For the purposes of this course, I care more about how you went about the design, how you addressed the main issues, what you learnt etc.

Building a named entity recognizer. Several possibilities exist. You can try to build your own named entity recognizer using some of the interesting ideas from the papers we have discussed in class, or other papers you have read/can read and/or your own additional ideas. An alternate project within this topic is to explore building a NER for a new domain such as in the biology domain. There is a named entity annotated corpus available for this domain.
Text simplification and application to learning IE patterns. The motivation here is that many IE pattern learning methods don't work well because they don't capture the syntax and structure of language but rather look to make generalization at the suface string level. So you could investigate whether applying some simple "text simplification" methods to simplify the sentences can make standard pattern learning methods work better.
Social Networks (idea due to Keith Trnka). You could investigate techniques to discover a "social network" from (multiple text documents). One can formulate a general idea of a social network as a graph where the nodes represent entities of interest (whether they are people or genes or organizations ...) and the edges represent relationships between the entities that they connect. These relations may be a set of predefined relationships that can be extracted or could be some vague notion of relationship that may be inferred from co-occurrence of the entities in text. From such a graph, different kinds of graph-theoretic notions can be explored to see what types of groups are formed. Alternatively, they can be used for knowledge-discovery or mining.
Topic Categorization: Classifying the topic of a document (or clustering similar documents based on their topic) has many applications including in information retrieval. Consider for example, the ability of a search engine presenting the search results based on topics and user preferences. Here various techniques can be explored to see how to determine the topic (from some pre-defined set of topics) or cluster documents (when no predefined set of topics is assumed).
Query Explansion: The quality of retrieved documents can often be improved with inclusion of few related terms in the query. Many techniques have been suggested for query expansion and you can explore the different techniques or explore them in context of different domains/genre of text.
Terminology and ontology building: One of the main hurdles in applying IE techniques or building question-answering systems in specialized domains is the unique terminology. Simple methods can be used to extract considerable amount of information about the terminology or the ontological relationships between the terms.
IE from non-textual domains: While the course focus is on IE from text documents, you can explore extraction of specific kinds of information from images, simple graphs/charts etc.
Question answering systems: This topic will explore towards the end of the semester. You can get a head-start by looking at question-answering systems and exploring issues in question-answering.
Extracting important concepts from a document. You can explore methods to figure out which words or concepts are important and/or tell us what the document is about. An application would be to see if that presence of such a word in the query makes this document more relevant to the query and hence should be assigned higher relevance score for the search.