| dc.contributor.author | Griazev, Kiril | |
| dc.contributor.author | Ramanauskaitė, Simona | |
| dc.date.accessioned | 2023-09-18T17:06:39Z | |
| dc.date.available | 2023-09-18T17:06:39Z | |
| dc.date.issued | 2018 | |
| dc.identifier.uri | https://etalpykla.vilniustech.lt/handle/123456789/119734 | |
| dc.description.abstract | Automatic data extraction is an important task but websites contain a lot of secondary information that has little value, because of this it is important to correctly identify information blocks. This can be done using various techniques one of which is HTML block comparison. It can be used to identify blocks by estimating their similarity score. This paper proposes an algorithm for HTML block similarity estimation using multiple methods: structure, structure and tag similarity, structure, tag and content similarity. Additionally, proposed algorithm is tested against other open source algorithms by analyzing the same data. | eng |
| dc.format | PDF | |
| dc.format.extent | p. 1-4 | |
| dc.format.medium | tekstas / txt | |
| dc.language.iso | eng | |
| dc.relation.isreferencedby | Conference Proceedings Citation Index - Science (Web of Science) | |
| dc.relation.isreferencedby | IEEE Xplore | |
| dc.relation.isreferencedby | Scopus | |
| dc.source.uri | https://ieeexplore.ieee.org/document/8592241 | |
| dc.title | HTML block similarity estimation | |
| dc.type | Straipsnis konferencijos darbų leidinyje Web of Science DB / Paper in conference publication in Web of Science DB | |
| dcterms.references | 11 | |
| dc.type.pubtype | P1a - Straipsnis konferencijos darbų leidinyje Web of Science DB / Article in conference proceedings Web of Science DB | |
| dc.contributor.institution | Vilniaus Gedimino technikos universitetas | |
| dc.contributor.faculty | Fundamentinių mokslų fakultetas / Faculty of Fundamental Sciences | |
| dc.subject.researchfield | T 007 - Informatikos inžinerija / Informatics engineering | |
| dc.subject.researchfield | N 009 - Informatika / Computer science | |
| dc.subject.vgtuprioritizedfields | IK0101 - Informacijos ir informacinių technologijų sauga / Information and Information Technologies Security | |
| dc.subject.ltspecializations | L106 - Transportas, logistika ir informacinės ir ryšių technologijos (IRT) / Transport, logistic and information and communication technologies | |
| dc.subject.en | estimation | |
| dc.subject.en | testing | |
| dc.subject.en | navigation | |
| dc.subject.en | task analysis | |
| dc.subject.en | web pages | |
| dc.subject.en | noise measurement | |
| dc.subject.en | content similarity | |
| dc.subject.en | DOM | |
| dc.subject.en | html block similarity | |
| dc.subject.en | TED | |
| dc.subject.en | tree edit distance | |
| dcterms.sourcetitle | 2018 IEEE. 6th workshop on Advances in Information, Electronic and Electrical Engineering (AIEEE), November 8-10, 2018 Vilnius, Lithuania : proceedings / edited by: Dalius Navakauskas, Andrejs Romanovs, Darius Plonis | |
| dc.publisher.name | IEEE | |
| dc.publisher.city | New York | |
| dc.identifier.doi | 2-s2.0-85061479936 | |
| dc.identifier.doi | 000458738600016 | |
| dc.identifier.doi | 10.1109/AIEEE.2018.8592241 | |
| dc.identifier.elaba | 33410035 | |