dc.contributor.author | Grigalis, Tomas | |
dc.contributor.author | Čenys, Antanas | |
dc.date.accessioned | 2023-09-18T20:04:02Z | |
dc.date.available | 2023-09-18T20:04:02Z | |
dc.date.issued | 2014 | |
dc.identifier.issn | 0948-695X | |
dc.identifier.other | (BIS)VGT02-000028249 | |
dc.identifier.uri | https://etalpykla.vilniustech.lt/handle/123456789/146446 | |
dc.description.abstract | This paper studies structured data extraction from template-generated Web pages. Such pages contain most of structured data on the Web. Extracted structured data can be later integrated and reused in very big range of applications, such as price comparison portals, business intelligence tools, various mashups and etc. It encourages industry and academics to seek automatic solutions. To tackle the problem of automatic structured Web data extraction we present a new approach - structured data extraction based on clustering visually similar Web page elements. Our method called ClustVX combines visual and pure HTML features of Web page to cluster visually similar Web page elements and then extract structured Web data. ClustVX can extract structured data from Web pages where more than one data record is present. With extensive experimental evaluation on three benchmark datasets we demonstrate that ClustVX achieves better results than other state-of-the-art automatic structured Web data extraction methods. | eng |
dc.format | PDF | |
dc.format.extent | p. 169-192 | |
dc.format.medium | tekstas / txt | |
dc.language.iso | eng | |
dc.relation.isreferencedby | Scopus | |
dc.relation.isreferencedby | Science Citation Index Expanded (Web of Science) | |
dc.source.uri | http://www.jucs.org/jucs_20_2/unsupervised_structured_data_extraction | |
dc.subject | IK01 - Informacinės technologijos, ontologinės ir telematikos sistemos / Information technologies, ontological and telematic systems | |
dc.title | Unsupervised structured data extraction from template-generated web pages | |
dc.type | Straipsnis Web of Science DB / Article in Web of Science DB | |
dcterms.references | 46 | |
dc.type.pubtype | S1 - Straipsnis Web of Science DB / Web of Science DB article | |
dc.contributor.institution | Vilniaus Gedimino technikos universitetas | |
dc.contributor.faculty | Fundamentinių mokslų fakultetas / Faculty of Fundamental Sciences | |
dc.contributor.department | Informacinių sistemų katedra / Department of Information Systems | |
dc.contributor.department | Taikomosios informatikos institutas / Institute of Applied Computer Science | |
dc.subject.researchfield | T 007 - Informatikos inžinerija / Informatics engineering | |
dc.subject.ltspecializations | L106 - Transportas, logistika ir informacinės ir ryšių technologijos (IRT) / Transport, logistic and information and communication technologies | |
dc.subject.en | Deep Web | |
dc.subject.en | Data extraction | |
dc.subject.en | Structured web data | |
dc.subject.en | Wrapper induction | |
dcterms.sourcetitle | Journal of Universal Computer Science (J.UCS) | |
dc.description.issue | iss.2 | |
dc.description.volume | Vol. 20 | |
dc.publisher.name | Graz University of Technology | |
dc.publisher.city | Graz | |
dc.identifier.doi | 10.2298/CSIS130416020G | |
dc.identifier.elaba | 4070163 | |