Difference: ReutersSubset (1 vs. 4)

Revision 42011-11-01 - CyrilAllauzen

Line: 1 to 1
 
META TOPICPARENT name="KernelQuickTour"

Reuters-21578 subset: a dataset example

Changed:
<
<
The reuters.subset directory contains a subset of the [[http://www.daviddlewis.c
>
>
The reuters.subset.tgzDownload archive contains a subset of the [[http://www.daviddlewis.c
 om/resources/testcollections/reuters21578/][Reuters-21578External site]] often used for text categorization experiments. This subset contains 466 documents over 4 categories.
Line: 79 to 79
 

-- CyrilAllauzen - 30 Oct 2007 \ No newline at end of file

Added:
>
>
META FILEATTACHMENT attachment="reuters.subset.tgz" attr="" comment="" date="1320181715" name="reuters.subset.tgz" path="reuters.subset.tgz" size="3306370" stream="reuters.subset.tgz" tmpFilename="/var/tmp/CGItemp25041" user="CyrilAllauzen" version="1"

Revision 32010-08-03 - CyrilAllauzen

Line: 1 to 1
 
META TOPICPARENT name="KernelQuickTour"

Reuters-21578 subset: a dataset example

Line: 21 to 21
  In the following, we assume that your PATH and LD_LIBRARY_PATH environment variables are set as suggested in the quick tour (PATH should contain far, kernel/bin and
Changed:
<
<
libsvm-2.82, LD_LIBRARY_PATH should contain kernel/lib and kernel/plugin).
>
>
libsvm-2.82, LD_LIBRARY_PATH should contain far, kernel/lib and kernel/plugin).
  The documents are in the data subdirectory. Each document is present as a text file. The data.list file contained the list of text files that defines our dataset.

Revision 22010-04-09 - CyrilAllauzen

Line: 1 to 1
 
META TOPICPARENT name="KernelQuickTour"

Reuters-21578 subset: a dataset example

Line: 16 to 16
 
  1. The two datasets do not contains the same documents.
  2. In Lodhi et al., the authors also performed some text normalization (removing stop words and punctuations, ...) on the documents.
Changed:
<
<
Hence, we anticipate the performance of a given kernel on our subset to be worse than what is reported
>
>
Hence, due to the lack of text normalization, we anticipate the performance of a given kernel on this dataset to be slightly worse than what is reported
 in Lodhi et al..
Deleted:
<
<
The documents are in the data subdirectory. Each document is present as a text file and a corresponding fst file where each (ascii) character is represented by a transition. The fst files were generated by the ascii2fst command in the utils subdirectory.
 In the following, we assume that your PATH and LD_LIBRARY_PATH environment variables are
Changed:
<
<
set as suggested in the quick tour (PATH should contain kernel/bin and
>
>
set as suggested in the quick tour (PATH should contain far, kernel/bin and
 libsvm-2.82, LD_LIBRARY_PATH should contain kernel/lib and kernel/plugin).
Added:
>
>
The documents are in the data subdirectory. Each document is present as a text file. The data.list file contained the list of text files that defines our dataset.

Following Lodhi et al., we are going to evaluate n -gram kernels at the character level. The first step is to convert each text file into a linear automaton where each transition represents an (ascii) character. This was done using the farcompilestrings utility, as shown below, and the result is a far file containing a collection of Fsts (in the OpenFst library binary format) appearing in the same order as in the data.list file.

$ farcompilestrings --arc_type=log --entry_type=file --token_type=byte --generate_keys=3 --file_list_input data.list data.far
 A normalized 4-gram kernel for this dataset can be generated using the command:
Changed:
<
<
$ klngram -order=4 -sigma=255 fst.list > 4-gram.kar
>
>
$ klngram -order=4 -sigma=256 data.far > 4-gram.kar
 

To evaluate the performance of this kernel for classifying the acq category (one vs. others).

Line: 43 to 48
 nSV = 217, nBSV = 60 Total nSV = 217 openkernel: 82563 kernel computations
Added:
>
>

The generated can then be used for prediction:

 $ svm-predict acq.test acq.train.4-gram.model acq.test.4-gram.pred Loading open kernel open kernel: 4-gram.kar
Line: 52 to 61
 Squared correlation coefficient = 0.566988 (regression)
Changed:
<
<
Finally, this prediction can be scored using the utils/score.sh utility:
>
>
Finally, this prediction can be scored using the score.sh utility:
 
Changed:
<
<
$ ./utils/score.sh acq.test.4-gram.pred acq.test
>
>
$ ./score.sh acq.test.4-gram.pred acq.test
 true positive = 21 true negative = 59 false positive = 4

Revision 12007-10-30 - CyrilAllauzen

Line: 1 to 1
Added:
>
>
META TOPICPARENT name="KernelQuickTour"

Reuters-21578 subset: a dataset example

The reuters.subset directory contains a subset of the Reuters-21578External site often used for text categorization experiments. This subset contains 466 documents over 4 categories.

  all categories earn acq crude corn
train 377 154 114 76 38
test 89 42 26 15 10

The idea of this example is to have a dataset similar to one used for the experiments in the first part of: Huma Lodhi, Craig Saunders, John Shawe-Taylor, Nello Cristianini and Chris Watskins. Text Classification using String KernelsDownload, Journal of Machine Learning Research 2:419-444, 2002. The size of dataset and splits are similar, however:

  1. The two datasets do not contains the same documents.
  2. In Lodhi et al., the authors also performed some text normalization (removing stop words and punctuations, ...) on the documents.

Hence, we anticipate the performance of a given kernel on our subset to be worse than what is reported in Lodhi et al..

The documents are in the data subdirectory. Each document is present as a text file and a corresponding fst file where each (ascii) character is represented by a transition. The fst files were generated by the ascii2fst command in the utils subdirectory.

In the following, we assume that your PATH and LD_LIBRARY_PATH environment variables are set as suggested in the quick tour (PATH should contain kernel/bin and libsvm-2.82, LD_LIBRARY_PATH should contain kernel/lib and kernel/plugin).

A normalized 4-gram kernel for this dataset can be generated using the command:

$ klngram -order=4 -sigma=255 fst.list > 4-gram.kar

To evaluate the performance of this kernel for classifying the acq category (one vs. others).

$ svm-train -k openkernel -K 4-gram.kar acq.train acq.train.4-gram.model
open kernel successfully loaded
*
optimization finished, #iter = 362
nu = 0.339642
obj = -74.288867, rho = -0.368477
nSV = 217, nBSV = 60
Total nSV = 217
openkernel: 82563 kernel computations
$ svm-predict acq.test acq.train.4-gram.model acq.test.4-gram.pred
Loading open kernel
open kernel: 4-gram.kar
open kernel successfully loaded
Accuracy = 89.8876% (80/89) (classification)
Mean squared error = 0.404494 (regression)
Squared correlation coefficient = 0.566988 (regression)

Finally, this prediction can be scored using the utils/score.sh utility:

$ ./utils/score.sh acq.test.4-gram.pred acq.test
true positive =  21
true negative =  59
false positive =  4
false negative =  5
---
accuracy =  0.898876
precision =  0.84
recall =  0.807692
F1 =  0.823529

This is comparable to the F1 of 0.873 reported by Lodhi et al.

-- CyrilAllauzen - 30 Oct 2007

 
This site is powered by the TWiki collaboration platform Powered by PerlCopyright © 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback