Ph.D. (2003) Cornell University
Abstract: Many endeavors require the integration of data from multiple data sources. One major obstacle to such undertakings is the fact that different sources may vary considerably in the way they choose to represent their data, even if their data collections are otherwise perfectly compatible. In practice, this problem is usually solved
by a manual construction of translations between these data representations, although there have been some recent attempts at supplementing this with automated algorithms based on machine learning methods.
This work addresses the problem of making classification predictions based on data from multiple sources, without constructing explicit translations between them. We view this problem as a special case of the problem of multitask learning: Both intuition and much empirical work indicate that learning can be improved by attacking multiple related tasks simultaneously; however, thus far, no theoretical work has been able to support this claim, and no concrete definition has been proposed for what it means for two learning tasks to be "related."
In this work, we introduce a general notion of relatedness between tasks, provide the standard sort of information complexity bound for learning such tasks, and give general conditions under which this bound is an improvement over the standard single task learning result.
Finally, we apply these results to the problem of learning from disparate data sources. We give a decision tree learning algorithm for this problem for a particular type of data source disparity and demonstrate its empirical success on real data sets.