Using the MedlineSentenceModel class

The LingPipe sentence model uses MEDLINE, which is a large collection of biomedical literature. This collection is stored in XML format and is maintained by the United States National Library of Medicine (http://www.nlm.nih.gov/).

LingPipe uses its MedlineSentenceModel class to perform SBD. This model has been trained against the MEDLINE data. It uses simple text and tokenizes it into tokens and whitespace. The MEDLINE model is then used to find the text's sentences.

In the following example, we will use a paragraph from http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3139422/ to demonstrate the use of the model, as declared here:

paragraph = "HepG2 cells were obtained from the American Type 
 Culture "  
    + "Collection (Rockville, MD, USA) and were used only until "  
    + "passage 30. They were routinely grown at 37°C in Dulbecco's " 
    + "modified Eagle's medium (DMEM) containing 10 % fetal bovine " 
    + "serum (FBS), 2 mM glutamine, 1 mM sodium pyruvate, and 25 " 
    + "mM glucose (Invitrogen, Carlsbad, CA, USA) in a humidified " 
    + "atmosphere containing 5% CO2. For precursor and 13C-sugar "  
    + "experiments, tissue culture treated polystyrene 35 mm " 
    + "dishes (Corning Inc, Lowell, MA, USA) were seeded with 2 " 
    + "× 106 cells and grown to confluency in DMEM.";

The code that follows is based on the SentenceChunker class, as demonstrated in the previous section. The difference is in the use of the MedlineSentenceModel class:

TokenizerFactory tokenizerfactory = 
     IndoEuropeanTokenizerFactory.INSTANCE; 
MedlineSentenceModel sentenceModel = new 
     MedlineSentenceModel(); 
SentenceChunker sentenceChunker =  
    new SentenceChunker(tokenizerfactory, 
 sentenceModel); 
     = sentenceChunker.chunk( 
    paragraph.toCharArray(), 0, paragraph.length()); 
Set<Chunk> sentences = chunking.chunkSet(); 
String slice = chunking.charSequence().toString(); 
for (Chunk sentence : sentences) { 
    System.out.println("[" 
        + slice.substring(sentence.start(), 
 sentence.end())  
        + "]"); 
}

The output is as follows:

    [HepG2 cells were obtained from the American Type Culture Collection (Rockville, MD, USA) and were used only until passage 30.]
    [They were routinely grown at 37°C in Dulbecco's modified Eagle's medium (DMEM) containing 10 % fetal bovine serum (FBS), 2 mM glutamine, 1 mM sodium pyruvate, and 25 mM glucose (Invitrogen, Carlsbad, CA, USA) in a humidified atmosphere containing 5% CO2.]
    [For precursor and 13C-sugar experiments, tissue culture treated polystyrene 35 mm dishes (Corning Inc, Lowell, MA, USA) were seeded with 2 × 106 cells and grown to confluency in DMEM.]

When executed against medical text, this model will perform better than other models.

Table of Contents for Using the MedlineSentenceModel class

Create new playlist

Sign In

Sign Up

Table of Contents for
Using the MedlineSentenceModel class