Rodyti trumpą aprašą

dc.contributor.authorGriazev, Kiril
dc.contributor.authorRamanauskaitė, Simona
dc.date.accessioned2023-09-18T17:06:39Z
dc.date.available2023-09-18T17:06:39Z
dc.date.issued2018
dc.identifier.urihttps://etalpykla.vilniustech.lt/handle/123456789/119734
dc.description.abstractAutomatic data extraction is an important task but websites contain a lot of secondary information that has little value, because of this it is important to correctly identify information blocks. This can be done using various techniques one of which is HTML block comparison. It can be used to identify blocks by estimating their similarity score. This paper proposes an algorithm for HTML block similarity estimation using multiple methods: structure, structure and tag similarity, structure, tag and content similarity. Additionally, proposed algorithm is tested against other open source algorithms by analyzing the same data.eng
dc.formatPDF
dc.format.extentp. 1-4
dc.format.mediumtekstas / txt
dc.language.isoeng
dc.relation.isreferencedbyConference Proceedings Citation Index - Science (Web of Science)
dc.relation.isreferencedbyIEEE Xplore
dc.relation.isreferencedbyScopus
dc.source.urihttps://ieeexplore.ieee.org/document/8592241
dc.titleHTML block similarity estimation
dc.typeStraipsnis konferencijos darbų leidinyje Web of Science DB / Paper in conference publication in Web of Science DB
dcterms.references11
dc.type.pubtypeP1a - Straipsnis konferencijos darbų leidinyje Web of Science DB / Article in conference proceedings Web of Science DB
dc.contributor.institutionVilniaus Gedimino technikos universitetas
dc.contributor.facultyFundamentinių mokslų fakultetas / Faculty of Fundamental Sciences
dc.subject.researchfieldT 007 - Informatikos inžinerija / Informatics engineering
dc.subject.researchfieldN 009 - Informatika / Computer science
dc.subject.vgtuprioritizedfieldsIK0101 - Informacijos ir informacinių technologijų sauga / Information and Information Technologies Security
dc.subject.ltspecializationsL106 - Transportas, logistika ir informacinės ir ryšių technologijos (IRT) / Transport, logistic and information and communication technologies
dc.subject.enestimation
dc.subject.entesting
dc.subject.ennavigation
dc.subject.entask analysis
dc.subject.enweb pages
dc.subject.ennoise measurement
dc.subject.encontent similarity
dc.subject.enDOM
dc.subject.enhtml block similarity
dc.subject.enTED
dc.subject.entree edit distance
dcterms.sourcetitle2018 IEEE. 6th workshop on Advances in Information, Electronic and Electrical Engineering (AIEEE), November 8-10, 2018 Vilnius, Lithuania : proceedings / edited by: Dalius Navakauskas, Andrejs Romanovs, Darius Plonis
dc.publisher.nameIEEE
dc.publisher.cityNew York
dc.identifier.doi2-s2.0-85061479936
dc.identifier.doi000458738600016
dc.identifier.doi10.1109/AIEEE.2018.8592241
dc.identifier.elaba33410035


Šio įrašo failai

FailaiDydisFormatasPeržiūra

Su šiuo įrašu susijusių failų nėra.

Šis įrašas yra šioje (-se) kolekcijoje (-ose)

Rodyti trumpą aprašą