public class GeniaPosParser extends Object implements Iterator<GeniaParse>
This class parses the file GENIAcorpus3.02.pos.xml which provides sentence, word, and part-of-speech data. This parser maintains the whitespace found in the xml file so that the text added to the CAS does not come out as:
"... of anti- Ro(SSA) antibodies . A pair of restriction "
but instead comes out as:
"... of anti-Ro(SSA) antibodies. A pair of restriction "
There is no white space provided between sentences provided by the genia corpus. So, this parser simply adds in two spaces between each sentence. It also adds two newlines between the title and the body of the abstract.
The parses returned by this parser will not have any named entities - i.e. there will be now values returned from GeniaParse.getSemTags().
About 4000 word (w) tags have a part-of-speech assignment "*" which I refer to as the wildcard part-of-speech tag. An example is:
<w c="*">Ras</w><w c="NN">/protein</w>The above tags are parsed as a single token Ras/protein with the tag "NN".
Constructor and Description |
---|
GeniaPosParser() |
GeniaPosParser(File xmlFile) |
Modifier and Type | Method and Description |
---|---|
boolean |
hasNext() |
static void |
main(String[] args) |
GeniaParse |
next() |
GeniaParse |
parse(Element articleElement) |
void |
remove() |
public GeniaPosParser()
public GeniaPosParser(File xmlFile) throws IOException, JDOMException
IOException
JDOMException
public boolean hasNext()
hasNext
in interface Iterator<GeniaParse>
public GeniaParse next()
next
in interface Iterator<GeniaParse>
public GeniaParse parse(Element articleElement)
public void remove()
remove
in interface Iterator<GeniaParse>
Copyright © 2014. All rights reserved.