Abstract

We present an automatic approach to tree annotation in which basic nonterminal symbols are alternately split and merged to maximize the likelihood of a training treebank. Starting with a simple X-bar grammar, we learn a new grammar whose nonterminals are subsymbols of the original nonterminals. In contrast with previous work, we are able to split various terminals to different degrees, as appropriate to the actual complexity in the data. Our grammars automatically learn the kinds of linguistic distinctions exhibited in previous work on manual tree annotation. On the other hand, our grammars are much more compact and substantially more accurate than previous work on automatic annotation. Despite its simplicity, our best grammar achieves an F1 of 90.2% on the Penn Treebank, higher than fully lexicalized systems.

Keywords

TreebankTerminal and nonterminal symbolsComputer scienceAnnotationNatural language processingArtificial intelligenceTree (set theory)Rule-based machine translationGrammarGrammar inductionMathematicsLinguistics

Affiliated Institutions

Related Publications

Publication Info

Year
2006
Type
article
Pages
433-440
Citations
808
Access
Closed

External Links

Social Impact

Social media, news, blog, policy document mentions

Citation Metrics

808
OpenAlex

Cite This

Slav Petrov, Leon Barrett, Romain Thibaux et al. (2006). Learning accurate, compact, and interpretable tree annotation. , 433-440. https://doi.org/10.3115/1220175.1220230

Identifiers

DOI
10.3115/1220175.1220230