dc.contributor.author | Grigalis, Tomas | |
dc.contributor.author | Čenys, Antanas | |
dc.date.accessioned | 2023-09-18T19:15:38Z | |
dc.date.available | 2023-09-18T19:15:38Z | |
dc.date.issued | 2012 | |
dc.identifier.other | (BIS)VGT02-000025148 | |
dc.identifier.uri | https://etalpykla.vilniustech.lt/handle/123456789/137400 | |
dc.description.abstract | Record segmentation is a core problem in structured web data extraction. In this paper we present a novel technique that segments structured web data into individual data records that come from underlying database. Proposed technique exploits visual as well as structural features of web page elements to group them into semantically similar clusters. Resulting clusters reflect the page structure and are used to segment data records. During the segmentation process the technique also generates Xpath expressions. These expressions can be later used to directly extract data records from same template generated web pages without need to redo all the clustering and segmentation processes. Extracted structured data can be reused in wide range of applications, such as price comparison portals, meta-searching, knowledge bases and etc. The experimental evaluation results of proposed technique system on three publicly available benchmark data sets demonstrate nearly perfect results in terms of precision and recall. | eng |
dc.format | PDF | |
dc.format.extent | p. 38-47 | |
dc.format.medium | tekstas / txt | |
dc.language.iso | eng | |
dc.relation.ispartofseries | Communications in Computer and Information Science vol. 319 1865-0929 1865-0937 | |
dc.relation.isreferencedby | Conference Proceedings Citation Index - Science (Web of Science) | |
dc.relation.isreferencedby | SpringerLink | |
dc.relation.isreferencedby | Scopus | |
dc.relation.isreferencedby | MathSciNet | |
dc.source.uri | https://doi.org/10.1007/978-3-642-33308-8_4 | |
dc.title | Generating Xpath expressions for structured web data record segmentation | |
dc.type | Straipsnis konferencijos darbų leidinyje Web of Science DB / Paper in conference publication in Web of Science DB | |
dcterms.references | 18 | |
dc.type.pubtype | P1a - Straipsnis konferencijos darbų leidinyje Web of Science DB / Article in conference proceedings Web of Science DB | |
dc.contributor.institution | Vilniaus Gedimino technikos universitetas | |
dc.contributor.faculty | Fundamentinių mokslų fakultetas / Faculty of Fundamental Sciences | |
dc.contributor.department | Informacinių sistemų katedra / Department of Information Systems | |
dc.subject.researchfield | T 007 - Informatikos inžinerija / Informatics engineering | |
dc.subject.en | Web data segmentation | |
dc.subject.en | Structured web data | |
dc.subject.en | Web data extraction | |
dc.subject.en | Wrapper induction | |
dcterms.sourcetitle | Information and software technologies : 18th International Conference, ICIST 2012, Kaunas, Lithuania, September 13-14, 2012 : proceedings | |
dc.publisher.name | Springer | |
dc.publisher.city | New York | |
dc.identifier.doi | 000312463800004 | |
dc.identifier.doi | 10.1007/978-3-642-33308-8_4 | |
dc.identifier.elaba | 3994295 | |