Using the StanfordCoreNLP class

The StanfordCoreNLP class supports sentence-detection using the ssplit annotator. In the following example, the tokenize and ssplit annotators are used. A pipeline object is created and the annotate method is applied against the pipeline, using the paragraph as its argument:

Properties properties = new Properties(); 
properties.put("annotators", "tokenize, ssplit"); 
StanfordCoreNLP pipeline = new StanfordCoreNLP(properties); 
Annotation annotation = new Annotation(paragraph); 
pipeline.annotate(annotation); 

The output contains a lot of information. Only the output for the first line is shown here:

    Sentence #1 (13 tokens):
    When determining the end of sentences we need to consider several factors.
    [Text=When CharacterOffsetBegin=0 CharacterOffsetEnd=4] [Text=determining CharacterOffsetBegin=5 CharacterOffsetEnd=16] [Text=the CharacterOffsetBegin=17 CharacterOffsetEnd=20] [Text=end CharacterOffsetBegin=21 CharacterOffsetEnd=24] [Text=of CharacterOffsetBegin=25 CharacterOffsetEnd=27] [Text=sentences CharacterOffsetBegin=28 CharacterOffsetEnd=37] [Text=we CharacterOffsetBegin=38 CharacterOffsetEnd=40] [Text=need CharacterOffsetBegin=41 CharacterOffsetEnd=45] [Text=to CharacterOffsetBegin=46 CharacterOffsetEnd=48] [Text=consider CharacterOffsetBegin=49 CharacterOffsetEnd=57] [Text=several CharacterOffsetBegin=58 CharacterOffsetEnd=65] [Text=factors CharacterOffsetBegin=66 CharacterOffsetEnd=73] [Text=. CharacterOffsetBegin=73 CharacterOffsetEnd=74] 
  

Alternatively, we can use the xmlPrint method. This will produce the output in XML format, which can often be easier for extracting the information of interest.
This method is shown here, and it requires that the IOException be handled:

try { 
    pipeline.xmlPrint(annotation, System.out); 
} catch (IOException ex) { 
    // Handle exception 
}

A partial listing of the output is as follows:

<?xml version="1.0" encoding="UTF-8"?> 
<?xml-stylesheet href="CoreNLP-to-HTML.xsl" type="text/xsl"?> 
<root> 
  <document> 
    <sentences> 
      <sentence id="1"> 
        <tokens> 
          <token id="1"> 
            <word>When</word> 
            <CharacterOffsetBegin>0</CharacterOffsetBegin> 
            <CharacterOffsetEnd>4</CharacterOffsetEnd> 
          </token> 
... 
         <token id="34"> 
            <word>...</word> 
            <CharacterOffsetBegin>316</CharacterOffsetBegin> 
            <CharacterOffsetEnd>317</CharacterOffsetEnd> 
          </token> 
        </tokens> 
      </sentence> 
    </sentences> 
  </document> 
</root> 
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset