Rodyti trumpą aprašą

dc.contributor.authorGrigalis, Tomas
dc.contributor.authorČenys, Antanas
dc.date.accessioned2023-09-18T20:02:14Z
dc.date.available2023-09-18T20:02:14Z
dc.date.issued2014
dc.identifier.issn1820-0214
dc.identifier.other(BIS)VGT02-000027998
dc.identifier.urihttps://etalpykla.vilniustech.lt/handle/123456789/145938
dc.description.abstractTemplate-generated Web pages contain most of structured data on the Web. Clustering these pages according to their template structure is an important problem in wrapper-based structured data extraction systems. These systems extract structured data using wrappers that must be matched to only particular template pages. Selecting single type of template from all crawled Web pages is a time consuming task. Although there are methods to cluster Web pages according to their structural similarity, however, in most cases they are too computationally expensive to be applicable at Web-Scale. We propose a novel highly scalable approach to structurally cluster Web pages by employing XPath addresses of inbound inner-site links. We demonstrate the effectiveness of our method by clustering more than one million Web pages from many real world Websites in a few minutes and achieving>90% accuracy.eng
dc.formatPDF
dc.format.extentp. 111-131
dc.format.mediumtekstas / txt
dc.language.isoeng
dc.relation.isreferencedbyScopus
dc.relation.isreferencedbyScience Citation Index Expanded (Web of Science)
dc.source.urihttp://www.comsis.org/archive.php?show=ppr464-1304
dc.subjectIK01 - Informacinės technologijos, ontologinės ir telematikos sistemos / Information technologies, ontological and telematic systems
dc.titleUsing XPaths of inbound links to cluster template-generated web pages
dc.typeStraipsnis Web of Science DB / Article in Web of Science DB
dcterms.references35
dc.type.pubtypeS1 - Straipsnis Web of Science DB / Web of Science DB article
dc.contributor.institutionVilniaus Gedimino technikos universitetas
dc.contributor.facultyFundamentinių mokslų fakultetas / Faculty of Fundamental Sciences
dc.contributor.departmentInformacinių sistemų katedra / Department of Information Systems
dc.contributor.departmentTaikomosios informatikos institutas / Institute of Applied Computer Science
dc.subject.researchfieldT 007 - Informatikos inžinerija / Informatics engineering
dc.subject.ltspecializationsL106 - Transportas, logistika ir informacinės ir ryšių technologijos (IRT) / Transport, logistic and information and communication technologies
dc.subject.enWeb data extraction
dc.subject.enStructural clustering
dc.subject.enTemplate-generated pages
dc.subject.enWrapper induction
dcterms.sourcetitleComputer science and information systems (ComSIS)
dc.description.issueiss. 1
dc.description.volumeVol. 11
dc.publisher.nameComSIS Consortium
dc.publisher.cityNovi Sad (Serbia)
dc.identifier.doi10.2298/CSIS130416020G
dc.identifier.elaba4062966


Šio įrašo failai

Thumbnail

Šis įrašas yra šioje (-se) kolekcijoje (-ose)

Rodyti trumpą aprašą