• Lietuvių
    • English
  • English 
    • Lietuvių
    • English
  • Login
View Item 
  •   DSpace Home
  • Mokslinės publikacijos (PDB) / Scientific publications (PDB)
  • Moksliniai ir apžvalginiai straipsniai / Research and Review Articles
  • Straipsniai Web of Science ir/ar Scopus referuojamuose leidiniuose / Articles in Web of Science and/or Scopus indexed sources
  • View Item
  •   DSpace Home
  • Mokslinės publikacijos (PDB) / Scientific publications (PDB)
  • Moksliniai ir apžvalginiai straipsniai / Research and Review Articles
  • Straipsniai Web of Science ir/ar Scopus referuojamuose leidiniuose / Articles in Web of Science and/or Scopus indexed sources
  • View Item
JavaScript is disabled for your browser. Some features of this site may not work without it.

Using XPaths of inbound links to cluster template-generated web pages

Thumbnail
View/Open
comsis_Vol11_No1_111?131_cenys.pdf (625.5Kb)
Date
2014
Author
Grigalis, Tomas
Čenys, Antanas
Metadata
Show full item record
Abstract
Template-generated Web pages contain most of structured data on the Web. Clustering these pages according to their template structure is an important problem in wrapper-based structured data extraction systems. These systems extract structured data using wrappers that must be matched to only particular template pages. Selecting single type of template from all crawled Web pages is a time consuming task. Although there are methods to cluster Web pages according to their structural similarity, however, in most cases they are too computationally expensive to be applicable at Web-Scale. We propose a novel highly scalable approach to structurally cluster Web pages by employing XPath addresses of inbound inner-site links. We demonstrate the effectiveness of our method by clustering more than one million Web pages from many real world Websites in a few minutes and achieving>90% accuracy.
Issue date (year)
2014
URI
https://etalpykla.vilniustech.lt/handle/123456789/145938
Collections
  • Straipsniai Web of Science ir/ar Scopus referuojamuose leidiniuose / Articles in Web of Science and/or Scopus indexed sources [7946]

 

 

Browse

All of DSpaceCommunities & CollectionsBy Issue DateAuthorsTitlesSubjects / KeywordsInstitutionFacultyDepartment / InstituteTypeSourcePublisherType (PDB/ETD)Research fieldStudy directionVILNIUS TECH research priorities and topicsLithuanian intelligent specializationThis CollectionBy Issue DateAuthorsTitlesSubjects / KeywordsInstitutionFacultyDepartment / InstituteTypeSourcePublisherType (PDB/ETD)Research fieldStudy directionVILNIUS TECH research priorities and topicsLithuanian intelligent specialization

My Account

LoginRegister