Usage of non-probability sample and scraped data to estimate proportions

Nekrašaitė-Liegė, Vilma; Čiginas, Andrius; Krapavickaitė, Danutė

View/Open

BNU 2022 Proceedings.pdf (3.060Mb)

Date

2022

Author

Nekrašaitė-Liegė, Vilma

Čiginas, Andrius

Krapavickaitė, Danutė

Metadata

Show full item record

Abstract

An increasing amount of data sources suggests a task to integrate them with the ordinary data sources used in official statistics. One of the problems under the study at Statistics Lithuania is to revise some indicators and to find out if there is room for their accuracy improvement using data from additional sources. The proportion of companies possessing the websites is one such indicator. Traditionally it is estimated using the data of the Information and Communication Technology sample survey. Information about enterprise website possession is provided also by a private company. However, this data source is updated on a voluntary basis and has some drawbacks: it does not cover all the population, thus the estimator based on this data source should be biased (Tam and Kim, 2018). Another way to create a list of enterprises owing the websites is to do it by web scrapping (ESSnet Big Data I, ESSnet Big Data II). Following a common methodology, ten potential URLs are found for each enterprise applying a search engine to the population. A logistic regression model is used to estimate the probability, that the selected URL is a website of the particular enterprise. If this probability reaches the fixed threshold, then a conclusion, that the enterprise owns the website, is made. Otherwise, the conclusion is opposite. However, it is known from other research sources, that the accuracy of such an enterprise classification is around 59-89 percent truthful and depends on a search engine, training sample, etc. Therefore, it may seem that there is no possibility of renouncing the collection of the data on websites through the ICT survey, however, the combination of different sources may lead to more efficient estimators. See Beaumont (2020), Kim and Tam (2021) and Rao (2021) among others. In this research, the number of methods to integrate auxiliary data obtained from alternative sources with the survey data for bias adjustment is examined. The integration leads to more efficient estimators in comparison with the estimators based only on the survey data. The accuracy measures of the estimators considered are evaluated.

Issue date (year)

2022

URI

https://etalpykla.vilniustech.lt/handle/123456789/113485

Collections

Konferencijų pranešimų santraukos / Conference and Meeting Abstracts [3431]