Saturday, April 14, 2012

Thinking Cap: A Post-Easter Resurrection..

Considering that this is the last quiet weekend before the beginning of the end of the semester, I could sense a collective yearning
for one last thinking cap. So here goes...

1. We talked about classification learning in the class last couple of days. One important issue in classification learning is access to training data that is "labeled" --i.e., training examples that are pre-classified.   Often, we have a lot of training data, but only part of it is pre-classified.
Consider for example, spam mails. It is easy to get access to a lot of mails, but only some of them may be known for sure to be spam vs. non-spam.   It would be great if learning algorithms can use not just pre-labeled data, but also unlabeled one. Is there a  technique that you can think of that can do this?  (Hint; Think a bit back beyond decision trees..)  

(Learning scenarios where we get by with some labeled and some unlabeled data are are "sem-supervised learning tasks").

Okay. One is enough for now, I think..



  1. We can use Naive bayes classifier which can help increase the accuracy using unlabeled data. First we can build a classifier using labeled data. Use this classifier to classify unlabeled data. Now merge both data set to build a final classifier.


    1. I am looking for an idea that we discussed in the class that is directly applicable here..


  2. This comment has been removed by the author.

  3. Looking over the statistical learning slides from class, I think I have an idea. I'm sure you'll correct me if I'm wrong but this unlabeled data scenario you've given seems similar to the missing data example. Since we have some examples that are classified, can't we just treat the unlabeled examples as missing data?
    If we did this, we could use maximum likelihood to learn from the pre-classified data and generate the missing information.
    So I think we could use the EM algorithm to classify the rest of our data.