public class PennTreebankReader extends org.apache.uima.fit.component.JCasCollectionReader_ImplBase
PennTreebankReader reads in the PennTreebank (PTB) data distributed by the LDC. It simply reads the raw treebank data into a view called "TreebankView". To actually parse the treebank data and post it to the CAS, you will need to use the TreebankGoldAnnotator which does the real work of parsing the treebank format. In general, treebank data can be read in by a PlainTextCollectionReader or some other simple collection reader. This class exists because the PennTreebank has a specific directory structure that corresponds to sections which are often used in specific ways to conduct experiments - e.g. section 02-20 for training and sections 21-24 for testing. This collection reader makes it easy to read in specific sections for later processing. Only files ending with ".mrg" will be read in.
The acronym WSJ stands for Wall Street Journal which is the source of the articles treebanked by PTB.
Modifier and Type | Field and Description |
---|---|
protected File |
directory |
protected LinkedList<File> |
files |
protected int |
numberOfFiles |
static String |
PARAM_CORPUS_DIRECTORY_NAME |
static String |
PARAM_SECTIONS_SPECIFIER |
protected ListSpecification |
sections |
static String |
TREEBANK_VIEW
The view containing the parenthesized text of a TreeBank .mrg file.
|
PARAM_AGGREGATE_SOFA_MAPPINGS, PARAM_CONFIG_MANAGER, PARAM_CONFIG_PARAM_SETTINGS, PARAM_EXTERNAL_OVERRIDE_SETTINGS, PARAM_PERFORMANCE_TUNING_SETTINGS, PARAM_RESOURCE_MANAGER, PARAM_UIMA_CONTEXT
Constructor and Description |
---|
PennTreebankReader() |
Modifier and Type | Method and Description |
---|---|
void |
close() |
static void |
collectSections(File wsjDirectory,
List<File> treebankFiles,
ListSpecification wsjSections)
This will add all the .mrg files in the given WSJ sections to treebankFiles.
|
void |
getNext(JCas jCas)
Reads the next file and stores its text in cas as the "TreebankView" SOFA.
|
Progress[] |
getProgress() |
boolean |
hasNext() |
void |
initialize(UimaContext context) |
void |
setCorpusDirectoryName(String corpusDirectoryName) |
void |
setSectionsSpecifier(String sectionsString) |
getLogger, getNext, initialize
destroy, getCasInitializer, getProcessingResourceMetaData, initialize, isConsuming, reconfigure, setCasInitializer, typeSystemInit
getConfigParameterValue, getConfigParameterValue, setConfigParameterValue, setConfigParameterValue
getCasManager, getMetaData, getResourceManager, getUimaContext, getUimaContextAdmin, setLogger, setMetaData
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
getConfigParameterValue, getConfigParameterValue, setConfigParameterValue, setConfigParameterValue
getMetaData, getResourceManager, getUimaContext, getUimaContextAdmin, setLogger
protected LinkedList<File> files
protected int numberOfFiles
public static final String PARAM_CORPUS_DIRECTORY_NAME
public static final String PARAM_SECTIONS_SPECIFIER
protected ListSpecification sections
public static final String TREEBANK_VIEW
public PennTreebankReader()
public void close() throws IOException
close
in interface BaseCollectionReader
close
in class org.apache.uima.fit.component.JCasCollectionReader_ImplBase
IOException
@Beta public static void collectSections(File wsjDirectory, List<File> treebankFiles, ListSpecification wsjSections)
wsjDirectory
- The top level of the WSJ part of Treebank. Underneath here are the section
subdirectories.treebankFiles
- The List
to which the treebank files should be added.wsjSections
- The set of sections to include.public void getNext(JCas jCas) throws IOException, CollectionException
getNext
in class org.apache.uima.fit.component.JCasCollectionReader_ImplBase
IOException
CollectionException
public Progress[] getProgress()
public boolean hasNext() throws IOException, CollectionException
IOException
CollectionException
public void initialize(UimaContext context) throws ResourceInitializationException
initialize
in class org.apache.uima.fit.component.JCasCollectionReader_ImplBase
ResourceInitializationException
@Beta public void setCorpusDirectoryName(String corpusDirectoryName)
@Beta public void setSectionsSpecifier(String sectionsString)
Copyright © 2014. All rights reserved.