Towards web-scale structured web data extraction
Abstract
In this paper we present an ongoing PhD research on unsupervised and domain-independent structured data extraction from the Web. We propose a novel method to extract structured data records from template-generated Web pages. The method is based on clustering visually similar Web page elements by exploiting their visual formatting and HTML structural features. Tag paths of clustered Web page elements are then employed to derive extraction rules. These rules, called wrappers, can be later reused on thousands of same template-generated Web pages. This opens the possibility for the proposed method to be deployed in Web-Scale structured data extraction systems.