Line: 1 to 1  

Reuters21578 subset: a dataset example  
Changed:  
< <  The reuters.subset directory contains a subset of the [[http://www.daviddlewis.c  
> >  The reuters.subset.tgz archive contains a subset of the [[http://www.daviddlewis.c  
om/resources/testcollections/reuters21578/][Reuters21578]] often used for
text categorization experiments. This subset contains 466 documents over 4 categories.
The idea of this example is to have a dataset similar to one used for the experiments in the first part of: Huma Lodhi, Craig Saunders, John ShaweTaylor, Nello Cristianini and Chris Watskins. Text Classification using String Kernels, Journal of Machine Learning Research 2:419444, 2002. The size of dataset and splits are similar, however:
Hence, due to the lack of text normalization, we anticipate the performance of a given kernel on this dataset to be slightly worse than what is reported in Lodhi et al..
In the following, we assume that your
The documents are in the
Following Lodhi et al., we are going to evaluate n gram kernels at the character level. The first step is to convert each text file into a linear automaton where each
transition represents an (ascii) character. This was done using the $ farcompilestrings arc_type=log entry_type=file token_type=byte generate_keys=3 file_list_input data.list data.far A normalized 4gram kernel for this dataset can be generated using the command: $ klngram order=4 sigma=256 data.far > 4gram.kar
To evaluate the performance of this kernel for classifying the $ svmtrain k openkernel K 4gram.kar acq.train acq.train.4gram.model open kernel successfully loaded * optimization finished, #iter = 362 nu = 0.339642 obj = 74.288867, rho = 0.368477 nSV = 217, nBSV = 60 Total nSV = 217 openkernel: 82563 kernel computations The generated can then be used for prediction: $ svmpredict acq.test acq.train.4gram.model acq.test.4gram.pred Loading open kernel open kernel: 4gram.kar open kernel successfully loaded Accuracy = 89.8876% (80/89) (classification) Mean squared error = 0.404494 (regression) Squared correlation coefficient = 0.566988 (regression)
Finally, this prediction can be scored using the $ ./score.sh acq.test.4gram.pred acq.test true positive = 21 true negative = 59 false positive = 4 false negative = 5  accuracy = 0.898876 precision = 0.84 recall = 0.807692 F1 = 0.823529 This is comparable to the F1 of 0.873 reported by Lodhi et al.
 CyrilAllauzen  30 Oct 2007 \ No newline at end of file  
Added:  
> > 
