Towards automatic structured web data extraction system

Grigalis, Tomas

Data

2012

Autorius

Grigalis, Tomas

Metaduomenys

Rodyti detalų aprašą

Santrauka

Automatic extraction of structured data from web pages is one of the key challenges for theWeb search engines to advance into the more expressive semantic level. Here we propose a novel data extraction method, called ClustVX. It exploits visual as well as structural features of web page elements to group them into semantically similar clusters. Resulting clusters reflect the page structure and are used to derive data extraction rules. The preliminary evaluation results of ClustVX system on three public benchmark datasets demonstrate a high efficiency and indicate a need for a much bigger up-to-date benchmark data set that reflects contemporary WEB 2.0 web pages.

Paskelbimo data (metai)

2012

URI

https://etalpykla.vilniustech.lt/handle/123456789/143223

Kolekcijos

Konferencijų straipsniai / Conference Articles [15192]