cse471-s12: Thinking Cap: A Post-Easter Resurrection..

Saturday, April 14, 2012

Thinking Cap: A Post-Easter Resurrection..

Considering that this is the last quiet weekend before the beginning of the end of the semester, I could sense a collective yearning

for one last thinking cap. So here goes...

1. We talked about classification learning in the class last couple of days. One important issue in classification learning is access to training data that is "labeled" --i.e., training examples that are pre-classified. Often, we have a lot of training data, but only part of it is pre-classified.

Consider for example, spam mails. It is easy to get access to a lot of mails, but only some of them may be known for sure to be spam vs. non-spam. It would be great if learning algorithms can use not just pre-labeled data, but also unlabeled one. Is there a technique that you can think of that can do this? (Hint; Think a bit back beyond decision trees..)

(Learning scenarios where we get by with some labeled and some unlabeled data are are "sem-supervised learning tasks").

Okay. One is enough for now, I think..

Rao

4 comments:

Dinu JohnApril 15, 2012 at 2:22 AM
We can use Naive bayes classifier which can help increase the accuracy using unlabeled data. First we can build a classifier using labeled data. Use this classifier to classify unlabeled data. Now merge both data set to build a final classifier.

-Dinu
ReplyDelete
Replies
Subbarao KambhampatiApril 15, 2012 at 6:51 AM
This comment has been removed by the author.
ReplyDelete
Replies
John BrannanApril 16, 2012 at 3:00 PM
Looking over the statistical learning slides from class, I think I have an idea. I'm sure you'll correct me if I'm wrong but this unlabeled data scenario you've given seems similar to the missing data example. Since we have some examples that are classified, can't we just treat the unlabeled examples as missing data?
If we did this, we could use maximum likelihood to learn from the pre-classified data and generate the missing information.
So I think we could use the EM algorithm to classify the rest of our data.
ReplyDelete
Replies

Add comment