TWiki
>
Kernel Web
>
KernelQuickTour
>
ReutersSubset
(2011-11-01,
CyrilAllauzen
)
(raw view)
E
dit
A
ttach
---+ Reuters-21578 subset: a dataset example The [[%ATTACHURL%/reuters.subset.tgz][reuters.subset.tgz%ICON{download}%]] archive contains a subset of the [[http://www.daviddlewis.c\ om/resources/testcollections/reuters21578/][Reuters-21578%ICON{external}%]] often used for text categorization experiments. This subset contains 466 documents over 4 categories. | | all categories | =earn= | =acq= | =crude= | =corn= | | train | 377 | 154 | 114 | 76 | 38 | | test | 89 | 42 | 26 | 15 | 10 | The idea of this example is to have a dataset similar to one used for the experiments in the first part of: Huma Lodhi, Craig Saunders, John Shawe-Taylor, Nello Cristianini and Chris Watskins. [[http://www.jmlr.org/papers/volume2/lodhi02a/lodhi02a.pdf][Text Classification using String Kernels%ICON{download}%]], _Journal of Machine Learning Research_ 2:419-444, 2002. The size of dataset and splits are similar, however: 1. The two datasets do not contains the same documents. 1. In Lodhi _et al._, the authors also performed some text normalization (removing stop words and punctuations, ...) on the documents. Hence, due to the lack of text normalization, we anticipate the performance of a given kernel on this dataset to be slightly worse than what is reported in Lodhi _et al._. In the following, we assume that your =PATH= and =LD_LIBRARY_PATH= environment variables are set as suggested in the [[KernelQuickTour][quick tour]] (=PATH= should contain =far=, =kernel/bin= and =libsvm-2.82=, =LD_LIBRARY_PATH= should contain =far=, =kernel/lib= and =kernel/plugin=). The documents are in the =data= subdirectory. Each document is present as a text file. The =data.list= file contained the list of text files that defines our dataset. Following Lodhi _et al._, we are going to evaluate _n_ -gram kernels at the character level. The first step is to convert each text file into a linear automaton where each transition represents an (ascii) character. This was done using the =farcompilestrings= utility, as shown below, and the result is a =far= file containing a collection of Fsts (in the [[FST.WebHome][OpenFst library]] binary format) appearing in the same order as in the =data.list= file. <verbatim> $ farcompilestrings --arc_type=log --entry_type=file --token_type=byte --generate_keys=3 --file_list_input data.list data.far </verbatim> A normalized 4-gram kernel for this dataset can be generated using the command: <verbatim> $ klngram -order=4 -sigma=256 data.far > 4-gram.kar </verbatim> To evaluate the performance of this kernel for classifying the =acq= category (one vs. others). <verbatim> $ svm-train -k openkernel -K 4-gram.kar acq.train acq.train.4-gram.model open kernel successfully loaded * optimization finished, #iter = 362 nu = 0.339642 obj = -74.288867, rho = -0.368477 nSV = 217, nBSV = 60 Total nSV = 217 openkernel: 82563 kernel computations </verbatim> The generated can then be used for prediction: <verbatim> $ svm-predict acq.test acq.train.4-gram.model acq.test.4-gram.pred Loading open kernel open kernel: 4-gram.kar open kernel successfully loaded Accuracy = 89.8876% (80/89) (classification) Mean squared error = 0.404494 (regression) Squared correlation coefficient = 0.566988 (regression) </verbatim> Finally, this prediction can be scored using the =score.sh= utility: <verbatim> $ ./score.sh acq.test.4-gram.pred acq.test true positive = 21 true negative = 59 false positive = 4 false negative = 5 --- accuracy = 0.898876 precision = 0.84 recall = 0.807692 F1 = 0.823529 </verbatim> This is comparable to the F1 of 0.873 reported by Lodhi _et al._ -- Main.CyrilAllauzen - 30 Oct 2007
Attachments
Attachments
Topic attachments
I
Attachment
History
Action
Size
Date
Who
Comment
tgz
reuters.subset.tgz
r1
manage
3228.9 K
2011-11-01 - 21:08
CyrilAllauzen
E
dit
|
A
ttach
|
Watch
|
P
rint version
|
H
istory
: r4
<
r3
<
r2
<
r1
|
B
acklinks
|
V
iew topic
|
WYSIWYG
|
M
ore topic actions
Topic revision: r4 - 2011-11-01
-
CyrilAllauzen
Kernel
Log In
or
Register
Kernel Web
Create New Topic
Index
Search
Changes
Notifications
Statistics
Preferences
Webs
Contrib
FST
Forum
GRM
Kernel
Main
Sandbox
TWiki
Main
Copyright © 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki?
Send feedback