<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Publishing DTD v1.3 20210610//EN" "JATS-journalpublishing1-3.dtd">
<article article-type="research-article" dtd-version="1.3" xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xml:lang="ru"><front><journal-meta><journal-id journal-id-type="publisher-id">pribor</journal-id><journal-title-group><journal-title xml:lang="ru">Известия высших учебных заведений. Приборостроение</journal-title><trans-title-group xml:lang="en"><trans-title>Journal of Instrument Engineering</trans-title></trans-title-group></journal-title-group><issn pub-type="ppub">0021-3454</issn><issn pub-type="epub">2500-0381</issn><publisher><publisher-name>Национальный исследовательский университет ИТМО</publisher-name></publisher></journal-meta><article-meta><article-id pub-id-type="doi">10.17586/0021-3454-2022-65-11-826-832</article-id><article-id custom-type="elpub" pub-id-type="custom">pribor-305</article-id><article-categories><subj-group subj-group-type="heading"><subject>Research Article</subject></subj-group><subj-group subj-group-type="section-heading" xml:lang="ru"><subject>МАТЕМАТИЧЕСКОЕ И ПРОГРАММНОЕ ОБЕСПЕЧЕНИЕ  ИНФОРМАЦИОННЫХ СИСТЕМ</subject></subj-group><subj-group subj-group-type="section-heading" xml:lang="en"><subject>MATHEMATICAL AND SOFTWARE SUPPORT  OF INFORMATION SYSTEMS</subject></subj-group></article-categories><title-group><article-title>Формирование ядра документов в системах интернет-мониторинга в условиях ресурсных ограничений</article-title><trans-title-group xml:lang="en"><trans-title>Formation of the core of documents in Internet monitoring  systems under resource constraints</trans-title></trans-title-group></title-group><contrib-group><contrib contrib-type="author" corresp="yes"><name-alternatives><name name-style="eastern" xml:lang="ru"><surname>Кулешов</surname><given-names>С. В.</given-names></name><name name-style="western" xml:lang="en"><surname>Kuleshov</surname><given-names>S. V.</given-names></name></name-alternatives><bio xml:lang="ru"><p>Сергей Викторович Кулешов — д-р техн. наук, профессор; лаборатория автоматизации научных исследований; гл. научный сотрудник</p><p>Санкт-Петербург</p></bio><bio xml:lang="en"><p>Sergey V. Kuleshov — Dr. Sci., Professor; St. Petersburg Institute for Informatics and Automation of the RAS, Research Automation Laboratory; Chief Researcher</p><p>St. Petersburg</p></bio><email xlink:type="simple">kuleshov@iias.spb.su</email><xref ref-type="aff" rid="aff-1"/></contrib><contrib contrib-type="author" corresp="yes"><name-alternatives><name name-style="eastern" xml:lang="ru"><surname>Зайцева</surname><given-names>А. А.</given-names></name><name name-style="western" xml:lang="en"><surname>Zaytseva</surname><given-names>A. A.</given-names></name></name-alternatives><bio xml:lang="ru"><p>Александра Алексеевна Зайцева — канд. техн. наук; лаборатория автоматизации научных исследований; ст. научный сотрудник</p><p>Санкт-Петербург</p></bio><bio xml:lang="en"><p>Alexandra A. Zaytseva — PhD; St. Petersburg Institute for Informatics and Automation of the RAS, Research Automation Laboratory; Senior Researcher</p><p>St. Petersburg</p></bio><email xlink:type="simple">cher@iias.spb.su</email><xref ref-type="aff" rid="aff-1"/></contrib><contrib contrib-type="author" corresp="yes"><name-alternatives><name name-style="eastern" xml:lang="ru"><surname>Аксенов</surname><given-names>А. Ю.</given-names></name><name name-style="western" xml:lang="en"><surname>Aksenov</surname><given-names>A. Yu.</given-names></name></name-alternatives><bio xml:lang="ru"><p>Алексей Юрьевич Аксенов — канд. техн. наук; лаборатория автоматизации научных исследований; ст. научный сотрудник</p><p>Санкт-Петербург</p></bio><bio xml:lang="en"><p>Alexey Yu. Aksenov — PhD; St. Petersburg Institute for Informatics and Automation of the RAS, Research Automation Laboratory; Senior Researcher</p><p>St. Petersburg</p></bio><email xlink:type="simple">a_aksenov@iias.spb.su</email><xref ref-type="aff" rid="aff-1"/></contrib></contrib-group><aff-alternatives id="aff-1"><aff xml:lang="ru"><institution>Санкт-Петербургский федеральный исследовательский центр Российской академии наук</institution></aff><aff xml:lang="en"><institution>St. Petersburg Federal Research Center of the RAS</institution></aff></aff-alternatives><pub-date pub-type="collection"><year>2022</year></pub-date><pub-date pub-type="epub"><day>03</day><month>12</month><year>2024</year></pub-date><volume>65</volume><issue>11</issue><fpage>826</fpage><lpage>832</lpage><permissions><copyright-statement>Copyright &amp;#x00A9; Национальный исследовательский университет ИТМО, 2024</copyright-statement><copyright-year>2024</copyright-year><copyright-holder xml:lang="ru">Национальный исследовательский университет ИТМО</copyright-holder><copyright-holder xml:lang="en">Национальный исследовательский университет ИТМО</copyright-holder><license xlink:href="https://pribor.ifmo.ru/jour/about/submissions#copyrightNotice" xlink:type="simple"><license-p>https://pribor.ifmo.ru/jour/about/submissions#copyrightNotice</license-p></license></permissions><self-uri xlink:href="https://pribor.ifmo.ru/jour/article/view/305">https://pribor.ifmo.ru/jour/article/view/305</self-uri><abstract><p>Рассматриваются особенности разработки систем интернет-мониторинга открытого типа с неограниченным количеством источников в условиях ограниченного объема систем хранения собранных данных. Цель работы — решение задачи формирования множества документов минимально необходимого размера (ядра документов), отвечающего требованиям репрезентативности и вариативности тем при мониторниге сети Интернет. Для формализации и решения поставленной задачи разработана теоретико-множественная модель ядра документов. Предложенный подход отличается использованием вытесняющего алгоритма, поддерживающего в базе данных наличие только актуальных документов в пределах доступного объема системы хранения данных. Приведены результаты эксперимента с использованием реальных данных, подтверждающие применимость разработанной модели. Предложенный подход может быть использован в ряде практических задач, в частности для поиска в сети Интернет сведений (документов, страниц), по которым отсутствует априорная информация, необходимая для поиска по ключевым словам.</p></abstract><trans-abstract xml:lang="en"><p>The features of development of open-type Internet monitoring systems with an unlimited number of sources in conditions of a limited amount of data storage systems are considered. The purpose of the work is to solve the problem of forming a set of documents of the minimum required size (the core of documents) that meets the requirements of representativeness and variability of topics when monitoring the Internet. To formalize and solve the problem, a set-theoretic model of the document core is developed. The proposed approach is distinguished by the use of a preemptive algorithm that supports the availability of only relevant documents in the database within the available volume of the data storage system. The results of an experiment using real data confirming the applicability of the developed model are presented. The proposed approach can be used in a number of practical tasks, in particular for searching the Internet for information (documents, pages) for which there is no a priori information needed for keyword search.</p></trans-abstract><kwd-group xml:lang="ru"><kwd>ядро документов</kwd><kwd>мониторинг</kwd><kwd>краулер</kwd><kwd>поиск документов</kwd><kwd>интернет-ресурсы</kwd></kwd-group><kwd-group xml:lang="en"><kwd>core of documents</kwd><kwd>monitoring</kwd><kwd>crawler</kwd><kwd>document search</kwd><kwd>Internet resources</kwd></kwd-group><funding-group><funding-statement xml:lang="ru">работа выполнена в рамках реализации Государственного задания на 2022 г., № FFZF-2022-0005.</funding-statement><funding-statement xml:lang="en">the work was carried out as part of the implementation of the State Task for 2022, N FFZF-2022-0005.</funding-statement></funding-group></article-meta></front><back><ref-list><title>References</title><ref id="cit1"><label>1</label><citation-alternatives><mixed-citation xml:lang="ru">Zachlod C., Samuel O., Ochsner A., Werthmüller S. Analytics of social media data – state of characteristics and application // Journal of Business Research. 2022. Vol. 144, P. 1064—1076. DOI: 10.1016/j.jbusres.2022.02.016.</mixed-citation><mixed-citation xml:lang="en">Zachlod C., Samuel O., Ochsner A., &amp; Werthmüller S. Journal of Business Research, 2022, vol. 144, рр. 1064–1076, DOI: 10.1016/j.jbusres.2022.02.016.</mixed-citation></citation-alternatives></ref><ref id="cit2"><label>2</label><citation-alternatives><mixed-citation xml:lang="ru">Fink C., Toivonen T., Correia R. A., Di Minin E. Mapping the online songbird trade in Indonesia // Applied Geography. 2021. P. 134. DOI:10.1016/j.apgeog.2021.102505.</mixed-citation><mixed-citation xml:lang="en">Fink C., Toivonen T., Correia R. A., &amp; Di Minin E. Applied Geography, 2021, рр. 134, DOI: 10.1016/j.apgeog.2021.102505.</mixed-citation></citation-alternatives></ref><ref id="cit3"><label>3</label><citation-alternatives><mixed-citation xml:lang="ru">Han H., Wang C., Zhao Y., Shu M., Wang W., Min Y. SSLE: A framework for evaluating the “Filter bubble” effect on the news aggregator and recommenders // World Wide Web. 2022. N 25(3). P. 1169—1195. DOI: 10.1007/s11280-022-01031-4.</mixed-citation><mixed-citation xml:lang="en">Han H., Wang C., Zhao Y., Shu M., Wang W., &amp; Min Y. World Wide Web, 2022, no. 3(25), pp. 1169–1195, DOI: 10.1007/s11280-022-01031-4.</mixed-citation></citation-alternatives></ref><ref id="cit4"><label>4</label><citation-alternatives><mixed-citation xml:lang="ru">Krewinkel A., Sünkler S., Lewandowski D. et al. Concept for automated computer-aided identification and evaluation of potentially non-compliant food products traded via electronic commerce // Food Control. 2016. N 61, P. 204—212. DOI:10.1016/j.foodcont.2015.09.039.</mixed-citation><mixed-citation xml:lang="en">Krewinkel A., Sünkler S., Lewandowski D. et al. Food Control, 2016, vol. 61, рр. 204–212, DOI: 10.1016/j.foodcont.2015.09.039.</mixed-citation></citation-alternatives></ref><ref id="cit5"><label>5</label><citation-alternatives><mixed-citation xml:lang="ru">Беляевский К. О. Формирование октодерева по облаку точек при ограничении объема оперативной памяти // Научно-технический вестник СПбПУ. Информатика. Телекоммуникации. Управление. 2019. Т. 12, № 4. С. 97—110.</mixed-citation><mixed-citation xml:lang="en">Beliaevskii K.O. Peter the Great St. Petersburg Polytechnic University. Computing, Telecommunications and Control, 2019, no. 4(12), pp. 97–110. (in Russ.)</mixed-citation></citation-alternatives></ref><ref id="cit6"><label>6</label><citation-alternatives><mixed-citation xml:lang="ru">Puzak T.R. Analysis of Cache Replacement-Algorithms: Doctor’s Thesis. 1985.</mixed-citation><mixed-citation xml:lang="en">Puzak T.R. Analysis of Cache Replacement-Algorithms, Doctor’s thesis, 1985.</mixed-citation></citation-alternatives></ref><ref id="cit7"><label>7</label><citation-alternatives><mixed-citation xml:lang="ru">Wilson P. R. et al. Dynamic storage allocation: A survey and critical review // Lecture Notes in Computer Science (Including Subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics). 1995. Vol. 986. P. 1—116.</mixed-citation><mixed-citation xml:lang="en">Wilson P.R. et al. Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), 1995, vol. 986, рр. 1–116.</mixed-citation></citation-alternatives></ref><ref id="cit8"><label>8</label><citation-alternatives><mixed-citation xml:lang="ru">Laliwala Z., Shaikh A. Web Crawling and Data Mining with Apache Nutch. Packt Publ., 2013.</mixed-citation><mixed-citation xml:lang="en">Laliwala Z., Shaikh A. Web Crawling and Data Mining with Apache Nutch., Packt Publishing, 2013.</mixed-citation></citation-alternatives></ref><ref id="cit9"><label>9</label><citation-alternatives><mixed-citation xml:lang="ru">Nasraoui O. Web data mining: exploring hyperlinks, contents, and usage data // ACM SIGKDD Explorations Newsletter. 2008.</mixed-citation><mixed-citation xml:lang="en">Nasraoui O. Computer Science, 2008, DOI:10.1145/1540276.1540281.</mixed-citation></citation-alternatives></ref><ref id="cit10"><label>10</label><citation-alternatives><mixed-citation xml:lang="ru">Van den Broucke S., Baesens B. From Web Scraping to Web Crawling. Practical Web Scraping for Data Science. Berkeley, CA: Apress, 2018. P. 155—172.</mixed-citation><mixed-citation xml:lang="en">Van den Broucke S., Baesens B. From Web Scraping to Web Crawling. Practical Web Scraping for Data Science, Apress – Berkeley, CA, 2018, рр. 155–172.</mixed-citation></citation-alternatives></ref><ref id="cit11"><label>11</label><citation-alternatives><mixed-citation xml:lang="ru">Alkalbani A. M., Hussain W., Kim J. Y. A Centralised Cloud Services Repository (CCSR) Framework for Optimal Cloud Service Advertisement Discovery from Heterogenous Web Portals // IEEE Access. 2019. Vol. 7. P. 128213—128223. DOI: 10.1109/ACCESS.2019.2939543.</mixed-citation><mixed-citation xml:lang="en">Alkalbani A.M., Hussain W. &amp; Kim J.Y. IEEE Access, 2019, vol. 7, рр. 128213–128223, DOI: 10.1109/ACCESS.2019.2939543.</mixed-citation></citation-alternatives></ref><ref id="cit12"><label>12</label><citation-alternatives><mixed-citation xml:lang="ru">Wu Z., Cai Z., Tang, X., Xu Y., Deng T. A forward and backward private oblivious RAM for storage outsourcing on edge-cloud computing // Journal of Parallel and Distributed Computing. 2022. Vol. 166. P. 1—14. DOI: 10.1016/j.jpdc.2022.04.008.</mixed-citation><mixed-citation xml:lang="en">Wu Z., Cai Z., Tang, X., Xu Y., &amp; Deng T. Journal of Parallel and Distributed Computing, 2022, vol. 166, рр. 1–14, DOI:10.1016/j.jpdc.2022.04.008.</mixed-citation></citation-alternatives></ref><ref id="cit13"><label>13</label><citation-alternatives><mixed-citation xml:lang="ru">Зайцева А. А., Кулешов С. В., Михайлов С. Н. Метод оценки качества текстов в задачах аналитического мониторинга информационных ресурсов // Тр. СПИИРАН. 2014. Вып. 37. C. 144—155.</mixed-citation><mixed-citation xml:lang="en">Zaitseva A.A., Kuleshov S.V., Mikhailov S.N. Trudy SPIIRAN (SPIIRAS Proceedings), 2014, no. 37, pp. 144—155. (in Russ.)</mixed-citation></citation-alternatives></ref><ref id="cit14"><label>14</label><citation-alternatives><mixed-citation xml:lang="ru">Кулешов С. В., Зайцева А. А., Левашкин С. П. Технологии и принципы сбора и обработки неструктурированных распределенных данных с учетом современных особенностей предоставления медиа-контента // Информатизация и связь. 2020. № 4. С. 62—66.</mixed-citation><mixed-citation xml:lang="en">Kuleshov S.V., Zaytseva A.A., Levashkin S.P. Informatization and communication, 2020, no. 5, pp. 22–28. (in Russ.)</mixed-citation></citation-alternatives></ref><ref id="cit15"><label>15</label><citation-alternatives><mixed-citation xml:lang="ru">Kuleshov S., Zaytseva A., Aksenov A. Natural Language Search and Associative-Ontology Matching Algorithms Based on Graph Representation of Texts // Intelligent Systems Applications in Software Engineering, CoMeSySo 2019; Advances in Intelligent Systems and Computing. 2019. Vol. 1046. P. 7—26. DOI 10.1007/978-3-030-30329-7_26.</mixed-citation><mixed-citation xml:lang="en">Kuleshov S., Zaytseva A., Aksenov A. Systems Applications in Software Engineering. CoMeSySo 2019. Advances in Intelligent Systems and Computing, 2019, vol. 1046, рр. 7–26, DOI 10.1007/978-3-030-30329-7_26.</mixed-citation></citation-alternatives></ref></ref-list><fn-group><fn fn-type="conflict"><p>The authors declare that there are no conflicts of interest present.</p></fn></fn-group></back></article>
