dc.contributor.author | Grigalis, Tomas | |
dc.contributor.author | Čenys, Antanas | |
dc.date.accessioned | 2023-09-18T20:02:14Z | |
dc.date.available | 2023-09-18T20:02:14Z | |
dc.date.issued | 2014 | |
dc.identifier.issn | 1820-0214 | |
dc.identifier.other | (BIS)VGT02-000027998 | |
dc.identifier.uri | https://etalpykla.vilniustech.lt/handle/123456789/145938 | |
dc.description.abstract | Template-generated Web pages contain most of structured data on the Web. Clustering these pages according to their template structure is an important problem in wrapper-based structured data extraction systems. These systems extract structured data using wrappers that must be matched to only particular template pages. Selecting single type of template from all crawled Web pages is a time consuming task. Although there are methods to cluster Web pages according to their structural similarity, however, in most cases they are too computationally expensive to be applicable at Web-Scale. We propose a novel highly scalable approach to structurally cluster Web pages by employing XPath addresses of inbound inner-site links. We demonstrate the effectiveness of our method by clustering more than one million Web pages from many real world Websites in a few minutes and achieving>90% accuracy. | eng |
dc.format | PDF | |
dc.format.extent | p. 111-131 | |
dc.format.medium | tekstas / txt | |
dc.language.iso | eng | |
dc.relation.isreferencedby | Scopus | |
dc.relation.isreferencedby | Science Citation Index Expanded (Web of Science) | |
dc.source.uri | http://www.comsis.org/archive.php?show=ppr464-1304 | |
dc.subject | IK01 - Informacinės technologijos, ontologinės ir telematikos sistemos / Information technologies, ontological and telematic systems | |
dc.title | Using XPaths of inbound links to cluster template-generated web pages | |
dc.type | Straipsnis Web of Science DB / Article in Web of Science DB | |
dcterms.references | 35 | |
dc.type.pubtype | S1 - Straipsnis Web of Science DB / Web of Science DB article | |
dc.contributor.institution | Vilniaus Gedimino technikos universitetas | |
dc.contributor.faculty | Fundamentinių mokslų fakultetas / Faculty of Fundamental Sciences | |
dc.contributor.department | Informacinių sistemų katedra / Department of Information Systems | |
dc.contributor.department | Taikomosios informatikos institutas / Institute of Applied Computer Science | |
dc.subject.researchfield | T 007 - Informatikos inžinerija / Informatics engineering | |
dc.subject.ltspecializations | L106 - Transportas, logistika ir informacinės ir ryšių technologijos (IRT) / Transport, logistic and information and communication technologies | |
dc.subject.en | Web data extraction | |
dc.subject.en | Structural clustering | |
dc.subject.en | Template-generated pages | |
dc.subject.en | Wrapper induction | |
dcterms.sourcetitle | Computer science and information systems (ComSIS) | |
dc.description.issue | iss. 1 | |
dc.description.volume | Vol. 11 | |
dc.publisher.name | ComSIS Consortium | |
dc.publisher.city | Novi Sad (Serbia) | |
dc.identifier.doi | 10.2298/CSIS130416020G | |
dc.identifier.elaba | 4062966 | |