dc.contributor.author | Grigalis, Tomas | |
dc.date.accessioned | 2023-09-18T19:47:35Z | |
dc.date.available | 2023-09-18T19:47:35Z | |
dc.date.issued | 2013 | |
dc.identifier.other | (BIS)VGT02-000026698 | |
dc.identifier.uri | https://etalpykla.vilniustech.lt/handle/123456789/143224 | |
dc.description.abstract | In this paper we present an ongoing PhD research on unsupervised and domain-independent structured data extraction from the Web. We propose a novel method to extract structured data records from template-generated Web pages. The method is based on clustering visually similar Web page elements by exploiting their visual formatting and HTML structural features. Tag paths of clustered Web page elements are then employed to derive extraction rules. These rules, called wrappers, can be later reused on thousands of same template-generated Web pages. This opens the possibility for the proposed method to be deployed in Web-Scale structured data extraction systems. | eng |
dc.format | PDF | |
dc.format.extent | p. 753-758 | |
dc.format.medium | tekstas / txt | |
dc.language.iso | eng | |
dc.title | Towards web-scale structured web data extraction | |
dc.type | Straipsnis recenzuotame konferencijos darbų leidinyje / Paper published in peer-reviewed conference publication | |
dcterms.references | 34 | |
dc.type.pubtype | P1d - Straipsnis recenzuotame konferencijos darbų leidinyje / Article published in peer-reviewed conference proceedings | |
dc.contributor.institution | Vilniaus Gedimino technikos universitetas | |
dc.contributor.faculty | Fundamentinių mokslų fakultetas / Faculty of Fundamental Sciences | |
dc.contributor.department | Informacinių sistemų katedra / Department of Information Systems | |
dc.subject.researchfield | T 007 - Informatikos inžinerija / Informatics engineering | |
dcterms.sourcetitle | Web Search and Data Mining (WSDM'13) : proceedings of the sixth ACM international conference | |
dc.publisher.name | ACM | |
dc.publisher.city | New York | |
dc.identifier.doi | 10.1145/2433396.2433491 | |
dc.identifier.elaba | 4030629 | |