Show simple item record

dc.contributor.authorGriazev, Kiril
dc.contributor.authorRamanauskaitė, Simona
dc.date.accessioned2023-09-18T16:39:59Z
dc.date.available2023-09-18T16:39:59Z
dc.date.issued2023
dc.identifier.urihttps://etalpykla.vilniustech.lt/handle/123456789/115700
dc.description.abstractWeb page segmentation is one of the most influential factors for the automated integration of web page content with other systems. Existing solutions are focused on segmentation but do not provide a more detailed description of the segment including its range (minimum and maximum HTML code bounds, covering the segment content) and variants (the same segments with different content). Therefore the paper proposes a novel solution designed to find all web page content blocks and detail them for further usage. It applies text similarity and document object model (DOM) tree analysis methods to indicate the maximum and minimum ranges of each identified HTML block. In addition, it indicates its relation to other blocks, including hierarchical as well as sibling blocks. The evaluation of the method reveals its ability to identify more content blocks in comparison to human labeling (in manual labeling only 24% of blocks were labeled). By using the proposed method, manual labeling effort could be reduced by at least 70%. Better performance was observed in comparison to other analyzed web page segmentation methods, and better recall was achieved due to focus on processing every block present on a page, and providing a more detailed web page division into content block data by presenting block boundary range and block variation data.eng
dc.formatPDF
dc.format.extentp. 1-16
dc.format.mediumtekstas / txt
dc.language.isoeng
dc.relation.isreferencedbyScience Citation Index Expanded (Web of Science)
dc.relation.isreferencedbyScopus
dc.relation.isreferencedbyDOAJ
dc.relation.isreferencedbyINSPEC
dc.relation.isreferencedbyAgris
dc.rightsLaisvai prieinamas internete
dc.source.urihttps://www.mdpi.com/2076-3417/13/9/5680
dc.source.urihttps://talpykla.elaba.lt/elaba-fedora/objects/elaba:165049601/datastreams/MAIN/content
dc.titleWeb page content block identification with extended block properties
dc.typeStraipsnis Web of Science DB / Article in Web of Science DB
dcterms.accessRightsThis article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/)
dcterms.licenseCreative Commons – Attribution – 4.0 International
dcterms.references41
dc.type.pubtypeS1 - Straipsnis Web of Science DB / Web of Science DB article
dc.contributor.institutionVilniaus Gedimino technikos universitetas
dc.contributor.facultyFundamentinių mokslų fakultetas / Faculty of Fundamental Sciences
dc.subject.researchfieldT 007 - Informatikos inžinerija / Informatics engineering
dc.subject.researchfieldN 009 - Informatika / Computer science
dc.subject.vgtuprioritizedfieldsIK0303 - Dirbtinio intelekto ir sprendimų priėmimo sistemos / Artificial intelligence and decision support systems
dc.subject.ltspecializationsL106 - Transportas, logistika ir informacinės ir ryšių technologijos (IRT) / Transport, logistic and information and communication technologies
dc.subject.enweb segmentation
dc.subject.enhierarchical segments
dc.subject.enweb page labeling
dcterms.sourcetitleApplied sciences: Special issue: New horizons in web search, web data mining, and web-based applications
dc.description.issueiss. 9
dc.description.volumevol. 13
dc.publisher.nameMDPI
dc.publisher.cityBasel
dc.identifier.doi000987192400001
dc.identifier.doi10.3390/app13095680
dc.identifier.elaba165049601


Files in this item

Thumbnail

This item appears in the following Collection(s)

Show simple item record