我正在尝试将更多实例添加到我的训练集中,并执行10倍交叉验证。
我的实例为String格式,因此我使用StringToWordVector过滤器将其转换为数字。如果我不添加我想要的多余页面,事情将会很好。但是当我添加命令trainSet.addAll(data2);并传递trainSet到过滤器时,我IndexOutOfBoundsException在第一次迭代中遇到了一个奇怪的问题InstancesfTrainSet = Filter.useFilter(trainSet, filter);
trainSet.addAll(data2);
trainSet
IndexOutOfBoundsException
InstancesfTrainSet = Filter.useFilter(trainSet, filter);
Instances data = getDataFromFile("pathtofile.arff");//main dataset 1821 instances Instances data2 = getDataFromFile("anotherpath.arff");//709 instances i want to add int folds = 10; for(int i=0;i<folds;i++){ Instances trainSet = data.trainCV(folds, i);//training set System.out.println(trainSet.numInstances());//Prints 1638 Instances testSet = data.testCV(folds, i);//testing set //add more instances trainSet.addAll(data2); System.out.println(trainSet.numInstances());//Prints 2347 //filter StringToWordVector filter = new StringToWordVector(); filter.setInputFormat(trainSet); filter.setWordsToKeep(10000); filter.setTFTransform(true); filter.setLowerCaseTokens(true); filter.setOutputWordCounts(true); Stemmer stemmer = new IteratedLovinsStemmer(); filter.setStemmer(stemmer); WordsFromFile stopwords = new WordsFromFile(); stopwords.setStopwords(new File(".data/stopwords2.txt")); filter.setStopwordsHandler(stopwords); Instances fTrainSet = Filter.useFilter(trainSet, filter);//error!!! Instances fTestSet = Filter.useFilter(testSet, filter); .... //classification and evaluation....
尝试使用过滤器时出现以下错误:
Exception in thread "main" java.lang.IndexOutOfBoundsException: Index: 2161, Size: 1749 at java.util.ArrayList.rangeCheck(Unknown Source) at java.util.ArrayList.get(Unknown Source) at weka.core.Attribute.addStringValue(Attribute.java:924) at weka.core.StringLocator.copyStringValues(StringLocator.java:150) at weka.core.StringLocator.copyStringValues(StringLocator.java:91) at weka.filters.Filter.copyValues(Filter.java:399) at weka.filters.Filter.bufferInput(Filter.java:342) at weka.filters.unsupervised.attribute.StringToWordVector.input(StringToWordVector.java:655) at weka.filters.Filter.useFilter(Filter.java:692) at CrossValidationExample.main(CrossValidationExample.java:108)
有什么事吗
经过一番搜索,我意识到该addAll功能有问题。我能想到的一个原因是,addAll仅添加实例的引用,而当我尝试将其与一起使用时,这是一个问题filter。相反,我使用了此处建议的合并功能,因此我替换trainSet.addAll(data2); 为 Instances newTrainSettrainSet = merge(trainSet,data2);,一切正常。
addAll
filter
Instances newTrainSettrainSet = merge(trainSet,data2);