ISSN 0021-3454 (print version)
ISSN 2500-0381 (online version)
Menu

2
Issue
vol 67 / February, 2024
Article

DOI 10.17586/0021-3454-2022-65-11-826-832

UDC 004.912: 004.822

FORMATION OF THE CORE OF DOCUMENTS IN INTERNET MONITORING SYSTEMS UNDER RESOURCE CONSTRAINTS

S. V. Kuleshov
St. Petersburg Institute for Informatics and Automation of Russian Academy of Sciences (SPIIRAS), Laboratory of Research Activities Automation;


A. A. Zaytseva
St. Petersburg Institute for Informatics and Automation of the Russian Academy of Sciences, Laboratory of Research Automation ; Senior Scientist


A. Y. Aksenov
St. Petersburg Federal Research Center of the RAS, St. Petersburg Institute for Informatics and Automation of the RAS, Research Automation Laboratory ; Senior Researcher


Read the full article 

Abstract. The features of development of open-type Internet monitoring systems with an unlimited number of sources in conditions of a limited amount of data storage systems are considered. The purpose of the work is to solve the problem of forming a set of documents of the minimum required size (the core of documents) that meets the requirements of representativeness and variability of topics when monitoring the Internet. To formalize and solve the problem, a set-theoretic model of the document core is developed. The proposed approach is distinguished by the use of a preemptive algorithm that supports the availability of only relevant documents in the database within the available volume of the data storage system. The results of an experiment using real data confirming the applicability of the developed model are presented. The proposed approach can be used in a number of practical tasks, in particular for searching the Internet for information (documents, pages) for which there is no a priori information needed for keyword search.
Keywords: core of documents, monitoring, crawler, document search, Internet resources

References:
  1. Zachlod C., Samuel O., Ochsner A., & Werthmüller S. Journal of Business Research, 2022, vol. 144, рр. 1064–1076, DOI: 10.1016/j.jbusres.2022.02.016.
  2. Fink C., Toivonen T., Correia R. A., & Di Minin E. Applied Geography, 2021, рр. 134, DOI: 10.1016/j.apgeog.2021.102505.
  3. Han H., Wang C., Zhao Y., Shu M., Wang W., & Min Y. World Wide Web, 2022, no. 3(25), pp. 1169–1195, DOI: 10.1007/s11280-022-01031-4.
  4. Krewinkel A., Sünkler S., Lewandowski D. et al. Food Control, 2016, vol. 61, рр. 204–212, DOI: 10.1016/j.foodcont.2015.09.039.
  5. Beliaevskii K.O. Peter the Great St. Petersburg Polytechnic University. Computing, Telecommunications and Control, 2019, no. 4(12), pp. 97–110. (in Russ.)
  6. Puzak T.R. Analysis of Cache Replacement-Algorithms, Doctor’s thesis, 1985.
  7. Wilson P.R. et al. Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), 1995, vol. 986, рр. 1–116.
  8. Laliwala Z., Shaikh A. Web Crawling and Data Mining with Apache Nutch., Packt Publishing, 2013.
  9. Nasraoui O. Computer Science, 2008, DOI:10.1145/1540276.1540281.
  10. Van den Broucke S., Baesens B. From Web Scraping to Web Crawling. Practical Web Scraping for Data Science, Apress – Berkeley, CA, 2018, рр. 155–172.
  11. Alkalbani A.M., Hussain W. & Kim J.Y. IEEE Access, 2019, vol. 7, рр. 128213–128223, DOI: 10.1109/ACCESS.2019.2939543.
  12. Wu Z., Cai Z., Tang, X., Xu Y., & Deng T. Journal of Parallel and Distributed Computing, 2022, vol. 166, рр. 1–14, DOI:10.1016/j.jpdc.2022.04.008.
  13. Zaitseva A.A., Kuleshov S.V., Mikhailov S.N. Trudy SPIIRAN (SPIIRAS Proceedings), 2014, no. 37, pp. 144—155. (in Russ.)
  14. Kuleshov S.V., Zaytseva A.A., Levashkin S.P. Informatization and communication, 2020, no. 5, pp. 22–28. (in Russ.)
  15. Kuleshov S., Zaytseva A., Aksenov A. Systems Applications in Software Engineering. CoMeSySo 2019. Advances in Intelligent Systems and Computing, 2019, vol. 1046, рр. 7–26, DOI 10.1007/978-3-030-30329-7_26.