Web page content block identification with extended block properties

Griazev, Kiril; Ramanauskaitė, Simona

dc.contributor.author	Griazev, Kiril
dc.contributor.author	Ramanauskaitė, Simona
dc.date.accessioned	2023-09-18T16:39:59Z
dc.date.available	2023-09-18T16:39:59Z
dc.date.issued	2023
dc.identifier.uri	https://etalpykla.vilniustech.lt/handle/123456789/115700
dc.description.abstract	Web page segmentation is one of the most influential factors for the automated integration of web page content with other systems. Existing solutions are focused on segmentation but do not provide a more detailed description of the segment including its range (minimum and maximum HTML code bounds, covering the segment content) and variants (the same segments with different content). Therefore the paper proposes a novel solution designed to find all web page content blocks and detail them for further usage. It applies text similarity and document object model (DOM) tree analysis methods to indicate the maximum and minimum ranges of each identified HTML block. In addition, it indicates its relation to other blocks, including hierarchical as well as sibling blocks. The evaluation of the method reveals its ability to identify more content blocks in comparison to human labeling (in manual labeling only 24% of blocks were labeled). By using the proposed method, manual labeling effort could be reduced by at least 70%. Better performance was observed in comparison to other analyzed web page segmentation methods, and better recall was achieved due to focus on processing every block present on a page, and providing a more detailed web page division into content block data by presenting block boundary range and block variation data.	eng
dc.format	PDF
dc.format.extent	p. 1-16
dc.format.medium	tekstas / txt
dc.language.iso	eng
dc.relation.isreferencedby	Science Citation Index Expanded (Web of Science)
dc.relation.isreferencedby	Scopus
dc.relation.isreferencedby	DOAJ
dc.relation.isreferencedby	INSPEC
dc.relation.isreferencedby	Agris
dc.rights	Laisvai prieinamas internete
dc.source.uri	https://www.mdpi.com/2076-3417/13/9/5680
dc.source.uri	https://talpykla.elaba.lt/elaba-fedora/objects/elaba:165049601/datastreams/MAIN/content
dc.title	Web page content block identification with extended block properties
dc.type	Straipsnis Web of Science DB / Article in Web of Science DB
dcterms.accessRights	This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/)
dcterms.license	Creative Commons – Attribution – 4.0 International
dcterms.references	41
dc.type.pubtype	S1 - Straipsnis Web of Science DB / Web of Science DB article
dc.contributor.institution	Vilniaus Gedimino technikos universitetas
dc.contributor.faculty	Fundamentinių mokslų fakultetas / Faculty of Fundamental Sciences
dc.subject.researchfield	T 007 - Informatikos inžinerija / Informatics engineering
dc.subject.researchfield	N 009 - Informatika / Computer science
dc.subject.vgtuprioritizedfields	IK0303 - Dirbtinio intelekto ir sprendimų priėmimo sistemos / Artificial intelligence and decision support systems
dc.subject.ltspecializations	L106 - Transportas, logistika ir informacinės ir ryšių technologijos (IRT) / Transport, logistic and information and communication technologies
dc.subject.en	web segmentation
dc.subject.en	hierarchical segments
dc.subject.en	web page labeling
dcterms.sourcetitle	Applied sciences: Special issue: New horizons in web search, web data mining, and web-based applications
dc.description.issue	iss. 9
dc.description.volume	vol. 13
dc.publisher.name	MDPI
dc.publisher.city	Basel
dc.identifier.doi	000987192400001
dc.identifier.doi	10.3390/app13095680
dc.identifier.elaba	165049601

Files in this item

Name:: applsci-13-05680.pdf
Size:: 1.603Mb
Format:: PDF

View/Open

This item appears in the following Collection(s)

Straipsniai Web of Science ir/ar Scopus referuojamuose leidiniuose / Articles in Web of Science and/or Scopus indexed sources [7946]

Show simple item record