1
Vote

Maybe a bug with the isDuplicated method.

description

I'm trying to understand the use of this method, it seems like the return values are changed. The way it is now causes the DF variable to have only values for terms in duplicated "tweets" .

Original :
public boolean isDuplicate(String tweet) throws IOException {
    TokenizerFactory tokFactory = new NormalizedTokenizerFactory();
    int size = set.size();
    set = DedupeJaccard.filterTweetsJaccard(set, tokFactory, tweet, .80);
    if (set.size() > size) {
        return true;
    }
    return false;
}
Modified:
Original :
public boolean isDuplicate(String tweet) throws IOException {
    TokenizerFactory tokFactory = new NormalizedTokenizerFactory();
    int size = set.size();
    set = DedupeJaccard.filterTweetsJaccard(set, tokFactory, tweet, .80);
    if (set.size() > size) {
        return false;
    }
    return true;
}

comments