ISSN 0021-3454 (print version)
ISSN 2500-0381 (online version)
Menu

10
Issue
vol 67 / October, 2024
Article

DOI 10.17586/0021-3454-2023-66-12-1002-1010

UDC 004.912: 004.822

PHENOMENOLOGICAL DESCRIPTION OF INTERNET DOCUMENTS COLLECTING AND PROCESSING

S. V. Kuleshov
St. Petersburg Institute for Informatics and Automation of Russian Academy of Sciences (SPIIRAS), Laboratory of Research Activities Automation;


A. A. Zaytseva
St. Petersburg Institute for Informatics and Automation of the Russian Academy of Sciences, Laboratory of Research Automation ; Senior Scientist


Read the full article 
Reference for citation: Kuleshov S. V., Zaytseva A. A. Phenomenological description of Internet documents collecting and processing. Journal of Instrument Engineering. 2023. Vol. 66, N 12. P. 1002—1010 (in Russian). DOI: 10.17586/0021-3454-2023-66-12-1002-1010.

Abstract. The state of the Internet as a repository of information resources is analyzed from the point of view of a bot - a program that collects data for the purpose of monitoring resources, filling a search engine, or other commercial or research purposes. An approach is proposed to describe the problem under study through a set of phenomena that arise when collecting documents on the Internet. The described phenomena must be taken into account when developing monitoring systems or search engines. A number of features that arise during web scraping, harvesting and other cases of using bots to collect data on the Internet are given. The problems of using subdomains, recursive subdomains, dynamically loaded content technologies, search engine optimization of text content and others are described. It is shown that the task of collecting data from Internet resources is not only technological, but also to a greater extent knowledge-intensive, and since research is in an active phase, there is no “out-of-the-box” solution for it. The article will be useful to researchers in the field of Internet development, search engine developers, specialists in data retrieval and Internet technologies, as well as specialists in the field of creation and support of Internet resources and in the field of Internet marketing.
Keywords: Internet documents, data collection technologies, data retrieval, search engines, Internet resources

Acknowledgement: This work was supported by State Assignment for 2023 No. FFZF-2022-0005.

References:
  1. Berners-Lee T. Information Management: A Proposal, CERN, March 1989, May 1990.
  2. RFC 1945, https://datatracker.ietf.org/doc/html/rfc1945.
  3. Barnet B. Memory Machines: The Evolution of Hypertext, Anthem Press, 2013.
  4. Olston C. and Najork M. Information Retrieval, 2010, no. 3(4), pp. 175–246.
  5. Najork M., Heydon A. High-Performance Web Crawling in Handbook of Massive Data Sets. Massive Computing, Springer, 2002, vol. 4, https://doi.org/10.1007/978-1-4615-0005-6_2.
  6. Laliwala Z., Shaikh A. Web Crawling and Data Mining with Apache Nutch, Packt Publishing, 2013.
  7. Nasraoui O. ACM SIGKDD Explorations Newsletter, 2008, DOI: https://doi.org/10.1145/1540276.1540281.
  8. Chakrabarti S. Mining the Web: Discovering knowledge from hypertext data, Elsevier, 2003.
  9. Castillo C. ACM SIGIR Forum, 2005, DOI: https://doi.org/10.1145/1067268.1067287.
  10. Boeing G., Waddell P. Journal of Planning Education and Research, 2017, no. 4(37), DOI:10.2139/ssrn.2781297.
  11. Practical Web Scraping for Data Science, Apress, Berkeley, CA, https://doi.org/10.1007/978-1-4842-3582-9_6.
  12. Bloch J. Companion to the 21st ACM SIGPLAN symposium on Object-oriented programming systems, languages, and applications, 2006, рр. 506–507.
  13. Robillard M.P. et al. IEEE Transactions on Software Engineering, 2012, no. 5(39), pp. 613–637.
  14. Ofoeda J., Boateng R., Effah J. International Journal of Enterprise Information Systems (IJEIS), 2019, no. 3(15), pp. 76–95.
  15. Qi L. et al. IEEE transactions on big data, 2020, no. 3(8), pp. 685–698.
  16.  https://eais.rkn.gov.ru/. (in Russ.)
  17. HTML::LinkExtor - Extract links from an HTML document, http://search.cpan.org/dist/HTML-Parser/lib/HTML/LinkExtor.pm.
  18. http://habrahabr.ru/post/185816/. (in Russ.)
  19. http://seopult.ru/subscribe.html?id=76. (in Russ.)
  20. http://habrahabr.ru/post/23456/. (in Russ.)
  21. http://habrahabr.ru/post/130258/. (in Russ.)
  22. http://socio.escience.ifmo.ru/content/files/file/network+centered.pdf. (in Russ.)
  23. http://download.yandex.ru/company/techno/YandexTech_1.pdf. (in Russ.)
  24. http://habrahabr.ru/post/123671/. (in Russ.)
  25. HtmlUnit – JavaScript Tutorial, https://htmlunit.sourceforge.io/javascript-howto.html.
  26. https://timeweb.com/ru/community/articles/poddomeny-chto-eto-takoe-i-zachem-oni-nuzhny. (in Russ.)
  27. RFC1035: Domain Names – Implementation and Specification. Network Working Group, November 1987, http://www.faqs.org/rfcs/rfc1035.htm>.
  28. https://habr.com/ru/company/click/blog/478758/. (in Russ.)
  29. A Standard for Robot Exclusion, http://www.robotstxt.org/orig.html.
  30. Kuleshov S., Zaytseva A., Aksenov A. Natural Language Search and Associative-Ontology Matching Algorithms Based on Graph Representation of Texts in Intelligent Systems Applications in Software Engineering. Advances in Intelligent Systems and Computing, Springer, Cham, 2019, vol. 1046, DOI 10.1007/978-3-030-30329-7_26.
  31. Mikhailov S.N., Kuleshov S.V. Izvestiya Yugo-Zapadnogo gosudarstvennogo universiteta (Proceedings of the Southwest State University), 2013, no. 6-2(51), pp. 40–43. (in Russ.)
  32. Zaytseva A.А., Kuleshov S.V., Mikhailov S.N. SPIIRAS Proceedings, 2014, no. 37, pp. 144–155. (in Russ.)
  33. Moskalenko A.A., Laponina O.R., Sukhomlin V.A. Modern Information Technology and IT-education, 2019, no. 2(15), pp. 413–420. (in Russ.)
  34. Ignatiev A.G., Lindre Yu.A. Aktual'nyye trendy regulirovaniya Interneta: ot otkrytogo prostranstva bezgranichnoy svobody k regional'noy i stranovoy fragmentatsii (Current Trends in Internet Regulation: from an open Space of Unlimited Freedom to Regional and Country Fragmentation), Moscow, 2023, 30 р., EDN EHZLLW. (in Russ.)
  35. Kulikova A.V. Indeks bezopasnosti, 2015, no. 1(21), pp. 115–120, EDN XBFPKZ. (in Russ.)