Developing Representationally Robust Learning Algorithms Over Structured Data
Learning novel concepts and relations from relational databases is an important problem with many applications in data management and machine learning. Given a relational database and training examples for a target relation, relational learning algorithms learn the definition of a target relation in terms of the existing relations in the database. For relational learning algorithms to be effective, the input data should be clean and in some specific format. However, this is not usually the case for real-world data. In this work, we explore the challenges faced when learning in the presence of representational variations in data, i.e., structure and content variations.
It is well established that the same relational database may be represented under different schemas for various reasons, such as efficiency, data quality, and usability. Unfortunately, the output of relational learning algorithms tends to vary quite substantially over the choice of schema. This is because these algorithms must employ a language bias and heuristics in order to learn efficiently. We introduce the property of schema independence of relational learning algorithms, and study both the theoretical and empirical dependence of existing algorithms on schema transformations. We propose Castor, a relational learning algorithm that achieves schema independence by leveraging data constraints. We also propose AutoMode, a system that leverages information in the schema and content of the database to automatically induce the language bias used by relational learning systems.
Another form of representational variation common in data is the variation of names used to refer to the same entity. This is particularly common when the information about a domain is spread across several databases. Learning algorithms treat entities with different names as different entities. This fact may significantly impact the accuracy of the learned models. We propose CastorX, a relational learning system that performs relational learning over heterogeneous databases. The user specifies matching attributes between (heterogeneous) databases through matching dependencies. As the learning process may become expensive, CastorX implements sampling techniques that allow it to learn efficiently and output accurate definitions.
Major Advisor: Arash Termehchy
Committee: Alan Fern
Committee: Prasad Tadepalli
Committee: David Maier
GCR: Hector Vergara
Tuesday, November 6 at 3:00pm to 5:00pm
Kelley Engineering Center, 1126
110 SW Park Terrace, Corvallis, OR 97331