How to read Penn Treebank notation using JLex and Java Cup

Introduction

Penn Treebank notation is represented by simple labelled brackets in a text file. This type of representation is popular because it is 'light' on resources, and the tree structure is relatively easy to 'read' without software tools. Many treebanks adopt Penn Treebank Bracketing standard, but they are not always consitent with it. For example, a phrase might be in the leaf node instead of a single word, special characters might represent specific domain meanings, and the brackets might not be matched each other. Therefore, how to construct a robust method to read the Penn Treebank anotation need to be considered.

The most convienient way to construct the method is to use JLex and Java Cup. Here I demenstrate a piece of codes, which takes the treebank file as input and return list of trees. Based on techiques in Compiler, it can detect and handle errors in the input. The source package below contains both the lex and cup files to generate the lexer and parser. It also provide classes to construct tree structure.

The readme file and copyright are still under constructing, but the makefile is easy to understand. Error dections will soon be added. Please feel free to make any modification and report any problems encountered.

Grammar

tree-list ::= tree | tree-list tree

tree ::= ( node )

node ::= leaf | token node-list

leaf ::= token token | leaf token

node-list ::= ( node ) | node-list node

Download

Source package

updated 10/16/2010

Copyright (c) 2010 Yifan Peng