fb pixel

Thesis Examination: MSc Student Hoora Rezaei Moghaddam

Thu. Nov. 14 09:00 AM - Thu. Nov. 14 11:00 AM
Contact: Dylan Armitage
Location: 1RC028


Hoora Rezaei Moghaddam - MSc Student in Applied Computer Science and Society

EXPLORING SCALABILITY AND CONCEPT DRIFT ISSUES IN LEARNING CATEGORICAL FACTS WITH TOLERANCE ROUGH SETS

The growth of the web has spawned knowledge bases from web corpora where the construction of these bases is performed using semi-supervised or unsupervised methods. These methods require minimal or no human intervention and can recursively learn new nouns, relations, and instances in a fully automated, scalable manner. Three typical issues that arise from this form of construction or learning facts: i) the number of training examples (i.e., nouns and their known categories also known as categorical facts) are few, ii) a noun may belong to more than one category depending on its contextual patterns, and iii) new nouns end up being miscategorized also known as concept drift. Recent efforts at improving the accuracy of learning categories of nouns and relations using a tolerance form of rough sets and fuzzy rough sets was successfully demonstrated with two semi-supervised learning algorithms: Tolerant Pattern Learning (TPL 1.0) and Fuzzy Rough Pattern Learner (FRL) respectively. This thesis revisits two issues identified during the development of TPL 1.0 and FRL algorithms, namely,: i) the issue of scalability and, ii) the handling of concept drift in a larger data set with more iterations during the semi-supervised learning process. The contribution of this thesis includes: i) extracting categorical information from a large noisy data set of crawled web pages, ii) preparing contextual co-occurrence matrix for experimentation and, iii) redesigning the Tolerant Pattern Learner (TPL 2.0) algorithm to learn categorical facts. This thesis demonstrates that the TPL 2.0 algorithm produces promising results in terms of precision and handles concept drift for twice the number of iterations compared to those used in all previous experiments with TPL 1.0, FRL and Coupled Bayesian Sets (CBS).