for one last thinking cap. So here goes...
1. We talked about classification learning in the class last couple of days. One important issue in classification learning is access to training data that is "labeled" --i.e., training examples that are pre-classified. Often, we have a lot of training data, but only part of it is pre-classified.
Consider for example, spam mails. It is easy to get access to a lot of mails, but only some of them may be known for sure to be spam vs. non-spam. It would be great if learning algorithms can use not just pre-labeled data, but also unlabeled one. Is there a technique that you can think of that can do this? (Hint; Think a bit back beyond decision trees..)
(Learning scenarios where we get by with some labeled and some unlabeled data are are "sem-supervised learning tasks").
Okay. One is enough for now, I think..
Rao
We can use Naive bayes classifier which can help increase the accuracy using unlabeled data. First we can build a classifier using labeled data. Use this classifier to classify unlabeled data. Now merge both data set to build a final classifier.
ReplyDelete-Dinu
I am looking for an idea that we discussed in the class that is directly applicable here..
DeleteRao
This comment has been removed by the author.
ReplyDeleteLooking over the statistical learning slides from class, I think I have an idea. I'm sure you'll correct me if I'm wrong but this unlabeled data scenario you've given seems similar to the missing data example. Since we have some examples that are classified, can't we just treat the unlabeled examples as missing data?
ReplyDeleteIf we did this, we could use maximum likelihood to learn from the pre-classified data and generate the missing information.
So I think we could use the EM algorithm to classify the rest of our data.