Evaluation

A tutorial_key.txt is a simple text file, and the content will look similar to the following screenshot:

It contains all the topics, as we have asked for 20 topics. The lines in the file can be seen in three ways. The first is by using the number starting from 0 onward, denoting the topic number. The second number is the Dirichlet parameter, with a default of 2.5, and the third way is by looking at the paragraph showing possible topics. The tutorial_compostion.txt file contains a percentage breakdown of each topic with each original text file. The tutorial_compostion.txt file can be opened in Excel or LibreOffice so that you can understand it more easily. It shows the filename followed by the topic and proportion for all words in the topic:

The first file is hawes.txt and topic 19 has a proportion of 0.438 %.

Let's try this using custom data. Create a mydata folder in the mallet directory with four text files with the names 1.txt, 2.txt, 3.txt, and 4.txt. The following is the content of the file:

Filename Content
1.txt I love eating bananas.
2.txt I have a dog. He also loves to eat bananas.
3.txt Banana is a fruit, rich in nutrients.
4.txt Eating bananas in the morning is a healthy habit.

Let's train and evaluate the model. Execute the following two commands:

mallet-2.0.6$ bin/mallet import-dir --input mydata/ --output mytutorial.mallet --keep-sequence --remove-stopwords

mallet-2.0.6$ bin/mallet train-topics --input mytutorial.mallet --num-topics 2 --output-state mytopic-state.gz --output-topic-keys mytutorial_keys.txt --output-doc-topics mytutorial_compostion.txt

As mentioned previously, it will create three files, which we will now look at in detail.

The first file is mytopic-state.gz. Extract and open the file. This will display all the words that are used, and in which topic they are set:

The next file is mytutorial_key.txt, which, when opened, will display the topic terms. As we have asked for two topics, it will have two lines:

The last file is mytutorial_composition.txt, which we will open in Excel or LibreOffice. It will display doc, topic, and proportion:

It can be seen that for the 3.txt file, which contains "Banana is a fruit, rich in nutrients.", topic 0 is more in proportion to topic 1. From the first file, we can see that topic 0 contains the topics banana, nutrients, love, and healthy.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset