• Lietuvių
    • English
  • English 
    • Lietuvių
    • English
  • Login
View Item 
  •   DSpace Home
  • Mokslinės publikacijos (PDB) / Scientific publications (PDB)
  • Konferencijų publikacijos / Conference Publications
  • Konferencijų pranešimų santraukos / Conference and Meeting Abstracts
  • View Item
  •   DSpace Home
  • Mokslinės publikacijos (PDB) / Scientific publications (PDB)
  • Konferencijų publikacijos / Conference Publications
  • Konferencijų pranešimų santraukos / Conference and Meeting Abstracts
  • View Item
JavaScript is disabled for your browser. Some features of this site may not work without it.

Usage of non-probability sample and scraped data to estimate proportions

Thumbnail
View/Open
BNU 2022 Proceedings.pdf (3.060Mb)
Date
2022
Author
Nekrašaitė-Liegė, Vilma
Čiginas, Andrius
Krapavickaitė, Danutė
Metadata
Show full item record
Abstract
An increasing amount of data sources suggests a task to integrate them with the ordinary data sources used in official statistics. One of the problems under the study at Statistics Lithuania is to revise some indicators and to find out if there is room for their accuracy improvement using data from additional sources. The proportion of companies possessing the websites is one such indicator. Traditionally it is estimated using the data of the Information and Communication Technology sample survey. Information about enterprise website possession is provided also by a private company. However, this data source is updated on a voluntary basis and has some drawbacks: it does not cover all the population, thus the estimator based on this data source should be biased (Tam and Kim, 2018). Another way to create a list of enterprises owing the websites is to do it by web scrapping (ESSnet Big Data I, ESSnet Big Data II). Following a common methodology, ten potential URLs are found for each enterprise applying a search engine to the population. A logistic regression model is used to estimate the probability, that the selected URL is a website of the particular enterprise. If this probability reaches the fixed threshold, then a conclusion, that the enterprise owns the website, is made. Otherwise, the conclusion is opposite. However, it is known from other research sources, that the accuracy of such an enterprise classification is around 59-89 percent truthful and depends on a search engine, training sample, etc. Therefore, it may seem that there is no possibility of renouncing the collection of the data on websites through the ICT survey, however, the combination of different sources may lead to more efficient estimators. See Beaumont (2020), Kim and Tam (2021) and Rao (2021) among others. In this research, the number of methods to integrate auxiliary data obtained from alternative sources with the survey data for bias adjustment is examined. The integration leads to more efficient estimators in comparison with the estimators based only on the survey data. The accuracy measures of the estimators considered are evaluated.
Issue date (year)
2022
URI
https://etalpykla.vilniustech.lt/handle/123456789/113485
Collections
  • Konferencijų pranešimų santraukos / Conference and Meeting Abstracts [3431]

 

 

Browse

All of DSpaceCommunities & CollectionsBy Issue DateAuthorsTitlesSubjects / KeywordsInstitutionFacultyDepartment / InstituteTypeSourcePublisherType (PDB/ETD)Research fieldStudy directionVILNIUS TECH research priorities and topicsLithuanian intelligent specializationThis CollectionBy Issue DateAuthorsTitlesSubjects / KeywordsInstitutionFacultyDepartment / InstituteTypeSourcePublisherType (PDB/ETD)Research fieldStudy directionVILNIUS TECH research priorities and topicsLithuanian intelligent specialization

My Account

LoginRegister