| dc.contributor.author | Griazev, Kiril | |
| dc.contributor.author | Ramanauskaitė, Simona | |
| dc.date.accessioned | 2023-09-18T16:39:59Z | |
| dc.date.available | 2023-09-18T16:39:59Z | |
| dc.date.issued | 2023 | |
| dc.identifier.uri | https://etalpykla.vilniustech.lt/handle/123456789/115700 | |
| dc.description.abstract | Web page segmentation is one of the most influential factors for the automated integration of web page content with other systems. Existing solutions are focused on segmentation but do not provide a more detailed description of the segment including its range (minimum and maximum HTML code bounds, covering the segment content) and variants (the same segments with different content). Therefore the paper proposes a novel solution designed to find all web page content blocks and detail them for further usage. It applies text similarity and document object model (DOM) tree analysis methods to indicate the maximum and minimum ranges of each identified HTML block. In addition, it indicates its relation to other blocks, including hierarchical as well as sibling blocks. The evaluation of the method reveals its ability to identify more content blocks in comparison to human labeling (in manual labeling only 24% of blocks were labeled). By using the proposed method, manual labeling effort could be reduced by at least 70%. Better performance was observed in comparison to other analyzed web page segmentation methods, and better recall was achieved due to focus on processing every block present on a page, and providing a more detailed web page division into content block data by presenting block boundary range and block variation data. | eng |
| dc.format | PDF | |
| dc.format.extent | p. 1-16 | |
| dc.format.medium | tekstas / txt | |
| dc.language.iso | eng | |
| dc.relation.isreferencedby | Science Citation Index Expanded (Web of Science) | |
| dc.relation.isreferencedby | Scopus | |
| dc.relation.isreferencedby | DOAJ | |
| dc.relation.isreferencedby | INSPEC | |
| dc.relation.isreferencedby | Agris | |
| dc.rights | Laisvai prieinamas internete | |
| dc.source.uri | https://www.mdpi.com/2076-3417/13/9/5680 | |
| dc.source.uri | https://talpykla.elaba.lt/elaba-fedora/objects/elaba:165049601/datastreams/MAIN/content | |
| dc.title | Web page content block identification with extended block properties | |
| dc.type | Straipsnis Web of Science DB / Article in Web of Science DB | |
| dcterms.accessRights | This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/) | |
| dcterms.license | Creative Commons – Attribution – 4.0 International | |
| dcterms.references | 41 | |
| dc.type.pubtype | S1 - Straipsnis Web of Science DB / Web of Science DB article | |
| dc.contributor.institution | Vilniaus Gedimino technikos universitetas | |
| dc.contributor.faculty | Fundamentinių mokslų fakultetas / Faculty of Fundamental Sciences | |
| dc.subject.researchfield | T 007 - Informatikos inžinerija / Informatics engineering | |
| dc.subject.researchfield | N 009 - Informatika / Computer science | |
| dc.subject.vgtuprioritizedfields | IK0303 - Dirbtinio intelekto ir sprendimų priėmimo sistemos / Artificial intelligence and decision support systems | |
| dc.subject.ltspecializations | L106 - Transportas, logistika ir informacinės ir ryšių technologijos (IRT) / Transport, logistic and information and communication technologies | |
| dc.subject.en | web segmentation | |
| dc.subject.en | hierarchical segments | |
| dc.subject.en | web page labeling | |
| dcterms.sourcetitle | Applied sciences: Special issue: New horizons in web search, web data mining, and web-based applications | |
| dc.description.issue | iss. 9 | |
| dc.description.volume | vol. 13 | |
| dc.publisher.name | MDPI | |
| dc.publisher.city | Basel | |
| dc.identifier.doi | 000987192400001 | |
| dc.identifier.doi | 10.3390/app13095680 | |
| dc.identifier.elaba | 165049601 | |