References

pribor

Известия высших учебных заведений. Приборостроение

Journal of Instrument Engineering

0021-34542500-0381

Национальный исследовательский университет ИТМО

10.17586/0021-3454-2024-67-11-958-968

pribor-314

Research Article

МЕТОДИЧЕСКОЕ И ПРОГРАММНО-ИНФОРМАЦИОННОЕ ОБЕСПЕЧЕНИЕ ФУНКЦИОНИРОВАНИЯ АВТОМАТИЗИРОВАННЫХ СИСТЕМ

METHODOLOGICAL AND SOFTWARE-INFORMATION SUPPORT FOR THE FUNCTIONING OF AUTOMATED SYSTEMS

Анализ статистических характеристик искусственно сгенерированных текстов

Analysis of Statistical Characteristics of Artificially Generated Texts

Кулешов

С. В.

Kuleshov

S. V.

Сергей Викторович Кулешов — д-р техн. наук, профессор РАН; СПИИРАН, лаборатория автоматизации научных исследований; гл. научный сотрудник

Sergey V. Kuleshov — Dr. Sci., Professor; St. Petersburg Institute for Informatics and Automation of the RAS, Laboratory of Automation of Scientific Research, Chief Researcher

kuleshov@iias.spb.su

Зайцева

А. А.

Zaytseva

A. A.

Александра Алексеевна Зайцева — канд. техн. наук; СПИИРАН, лаборатория автоматизации научных исследований; ст. научный сотрудник

Alexandra A. Zaytseva — PhD; St. Petersburg Institute for Informatics and Automation of the RAS, Laboratory of Automation of Scientific Research, Senior Researcher

cher@iias.spb.su

Аксенов

А. Ю.

Aksenov

A. Yu.

Алексей Юрьевич Аксенов — канд. техн. наук; СПИИРАН, лаборатория автоматизации научных исследований; ст. научный сотрудник

Alexey Yu. Aksenov — PhD; St. Petersburg Institute for Informatics and Automation of the RAS, Laboratory ofAutomation of Scientific Research, Senior Researcher

a_aksenov@iias.spb.su

Санкт-Петербургский федеральный исследовательский центр Российской академии наукSt. Petersburg Federal Research Center of the RAS

2024

07122024

6711958968

2024

Национальный исследовательский университет ИТМО

https://pribor.ifmo.ru/jour/about/submissions#copyrightNotice

https://pribor.ifmo.ru/jour/article/view/314

Рассматривается новый тренд — формирование контента с применением инструментов и технологий искусственного интеллекта. Активное внедрение технологий искусственного интеллекта для генерации данных приводит к увеличению доли искусственно сгенерированных данных, которые необходимо выявлять в автоматическом режиме для предотвращения ошибок (недостоверности, введения в заблуждение). Предложены подходы к идентификации текстовых данных, созданных при помощи нейросетевых технологий, включающие эвристические правила, основанные на критерии зависимости объема реферата от порога реферирования, что позволяет проводить автоматическую оценку текстовых документов в мониторинговых и поисковых системах при обработке больших объемов неструктурированных данных. Полученные результаты закладывают технологическую базу для реализации широкого спектра практических решений по обеспечению интеллектуальной поддержки коллективного поведения участников в человекомашинных сообществах за счет разработки теоретических и технологических основ обработки неструктурированных данных.

A new trend is considered, namely, the formation of content using artificial intelligence tools and technologies. Active implementation of artificial intelligence technologies for data generation leads to an increase in the share of artificially generated data that must be identified automatically to prevent errors (unreliability, misleading). Approaches to identifying text data created using neural network technologies are proposed, including heuristic rules based on the criterion of dependence of the abstract volume on the abstracting threshold, which allows for automatic evaluation of text documents in monitoring and search systems when processing large volumes of unstructured data. The obtained results lay the technological basis for the implementation of a wide range of practical solutions to ensure intellectual support for the collective behavior of participants in human-machine communities through the development of theoretical and technological foundations for processing unstructured data.

интернет-документыискусственные нейронные сетибольшая языковая модельинтернет-ресурсыметоды искусственного интеллектагенерация данных

internet documentsartificial neural networkslarge language modelInternet resourcesartificial intelligence methodsdata generation

работа выполнена при поддержке гос. заданием на 2024 г. № FFZF-2022-0005.

References1

YouTube обяжет маркировать контент, созданный нейросетями [Электронный ресурс]: https://www.fontanka.ru/2023/11/14/72913286/,27.06.2024.

https://www.fontanka.ru/2023/11/14/72913286/. (in Russ.)

Fang X., Che Sh., Mao M., Zhang H., Zhao M., Zhao X. Bias of AI-Generated Content: An Examination of News Produced by Large Language Models [Электронный ресурс]: https://papers.ssrn.com/sol3/papers.cfm?abstract_id=4574226,27.06.2024.

Fang X., Che Sh., Mao M., Zhang H., Zhao M., Zhao X. Sci. Rep., 2024, no. 1(14), pp. 5224, doi: 10.1038/s41598-024-55686-2.

Chen Ch., Fu J., Lyu L. A Pathway Towards Responsible AI Generated Content. 2023. DOI: 10.48550/arXiv.2303.01325.

Chen Ch., Fu J., Lyu L. arXiv:2303.01325v3, 27 Dec. 2023, https://doi.org/10.48550/arXiv.2303.01325.

Wahle J.Ph., Ruas T., Mohammad S.M., Meuschke N., Gipp B. AI Usage Cards: Responsibly Reporting AI-Generated Content // Proc. of ACM/IEEE Joint Conf. on Digital Libraries (JCDL 2023), June 2023, Mexico, Santa Fe. 2023. P. 282–284.

Wahle J.Ph., Ruas T., Mohammad S.M., Meuschke N., Gipp B. Proc. of 2023 ACM/IEEE Joint Conf. on Digital Libraries (JCDL 2023), Mexico, Santa Fe, June 2023, рр. 282–284.

Huang X., Li P., Du H., Kang J., Niyato D., Kim D.I., Wu Y. Federated Learning-Empowered AI-Generated Content in Wireless Networks. 2023. DOI: 10.48550/arXiv.2307.07146.

https://doi.org/10.48550/arXiv.2307.07146.

Gragnaniello D., Marra F., Verdoliva L. Detection of AI-Generated Synthetic Faces. Handbook of Digital Face Manipulation and Detection // Advances in Computer Vision and Pattern Recognition. 2022. P. 191–212.

Gragnaniello D., Marra F., Verdoliva L. Advances in Computer Vision and Pattern Recognition, 2022, рр. 191–212.

Xi Z., Wenmin H., Kangkang W., Weiqi L., Peijia Zh. AI-Generated Image Detection using a Cross-Attention Enhanced Dual-Stream Network // Proc. of Asia Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), Nov. 2023, Taiwan, Taipei. P. 1463–1470.

Xi Z., Wenmin H., Kangkang W., Weiqi L., Peijia Zh. Proc. of 2023 Asia Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), Taiwan, Taipei, November 2023, рр. 1463–1470.

Weber-Wulff D., Anohina-Naumeca A., Bjelobaba S., Foltýnek T., Guerrero-Dib J., Popoola O., Šigut P., Waddington L. Testing of Detection Tools for AI-Generated Text. 2023. DOI: 10.48550/arXiv.2306.15666.

https://doi.org/10.48550/arXiv.2306.15666.

Joo-Wha H., Fischer K., Ha Y., Zeng Y. Human, I wrote a song for you: An experiment testing the influence of machines’ attributes on the AI-composed music evaluation//Computers in Human Behavior. 2022. Vol. 131. 107239.

Joo-Wha H., Fischer K., Ha Y., Zeng Y. Computers in Human Behavior, 2022, vol. 131, art. no. 107239.

Cao Y. Li S., Liu Y., Yan Zh., Dai Y., Yu Ph., Sun L. A Comprehensive Survey of AI-Generated Content (AIGC): A History of Generative AI from GAN to ChatGPT. 2023. DOI: 10.48550/arXiv.2303.04226.

https://doi.org/10.48550/arXiv.2303.04226.

Wu J., Wensheng G., Zefeng Ch., Shicheng W., Hong L. AI-Generated Content (AIGC): A Survey. 2023. DOI: 10.48550/arXiv.2304.06632.

https://doi.org/10.48550/arXiv.2304.06632.

Ruchika L., Priyanka Bh., Neha V., Anshika J. AI-Generated Text Detection: A Review // Intern. Journal of Creative Research Thoughts (IJCRT). 2023. Vol. 11(10). P. d784–d789.

Ruchika L., Priyanka Bh., Neha V., Anshika J. Intern. J. of Creative Research Thoughts (IJCRT), 2023, no. 10(11), pp. d784–d789.

Zhengyuan J., Jinghuai Zh., Neil Zh.G. Evading Watermark based Detection of AI-Generated Content // Proc. of the ACM SIGSAC Conf. on Computer and Communications Security (CCS ‘23), Nov. 2023, Copenhagen. 2023. P.1168–1181.

Zhengyuan J., Jinghuai Zh., Neil Zh.G. Proc. of the 2023 ACM SIGSAC Conf. on Computer and Communications Security (CCS '23), Denmark, Copenhagen, November 2023, рр. 1168–1181.

Elkhatat A., Elsaid Kh., Almeer S. Evaluating the efficacy of AI content detection tools in differentiating between human and AI-generated text // Intern. Journal for Educational Integrity. 2023. Vol. 19. P. 17.

Elkhatat A., Elsaid Kh., Almeer S. Intern. J. for Educational Integrity, 2023, vol. 19, рр. 17.

Elkhatat A. M. Evaluating the authenticity of ChatGPT responses: a study on text-matching capabilities // Intern. Journal for Educational Integrity. 2023. Vol. 19. P. 15. DOI: 10.1007/s40979-023-00137-0.

Elkhatat A.M. Intern. J. for Educational Integrity, 2023, vol. 19, рр. 15, https://doi.org/10.1007/s40979-023-00137-0.

Otterbacher J. Why technical solutions for detecting AI-generated content in research and education are insufficient// Patterns. 2023. Vol. 4(7). P. 100796.

Otterbacher J. Patterns, 2023, no. 7(4), pp. 100796.

Pengyu W., Linyang K. R., Botian J., Dong Zh., Xipeng Q. SeqXGPT: Sentence-Level AI-Generated Text Detection // Proc. of the Conf. on Empirical Methods in Natural Language Processin, Dec. 2023. Singapore. 2023. P. 1144–1156.

Pengyu W., Linyang K.R., Botian J., Dong Zh., Xipeng Q. Proc. of the 2023 Conf. on Empirical Methods in Natural Language Processing 2023, Singapore, December 2023, рр. 1144–1156.

Price G. Sakellarios M. The Effectiveness of Free Software for Detecting AI-Generated Writing // Intern. Journal of Teaching, Learning and Education. 2023. Vol. 2. P. 31–38.

Price G. Sakellarios M. Intern. J. of Teaching, Learning and Education, 2023, vol. 2, рр. 31–38.

Qu Y., Liu P., Song W., Liu L., Cheng M. A Text Generation and Prediction System: Pre-training on New Corpora Using BERT and GPT-2 // IEEE 10th Int. Conf. on Electronics Information and Emergency Communication (ICEIEC), July 2020, China, Beijing. 2020. P. 323–326.

Qu Y., Liu P., Song W., Liu L., Cheng M. IEEE 10th Intern. Conf. on Electronics Information and Emergency Communication (ICEIEC), China, Beijing, July 2020, рр. 323–326.

Chen W., Su Y., Yan X., Wang W. Y. KGPT: Knowledge-Grounded Pre-Training for Data-to-Text Generation. [Электронный ресурс]: https://arxiv.org/abs/2010.02307,27.06.2024.

https://arxiv.org/abs/2010.02307.

GPT для чайников: от токенизации до файнтюнинга [Электронный ресурс]: https://habr.com/ru/articles/599673/27.06.2024.

https://habr.com/ru/articles/599673/. (in Russ.)

Ackley D., Hinton G., Sejnowski T. A learning algorithm for Boltzman nmachines//Cognitive Science. 1985. Vol. 9. N 1. P. 147–169.

Ackley D., Hinton G., Sejnowski T. Cognitive Science, 1985, no. 1(9), pp. 147–169.

OpenAI Codex [Электронный ресурс]: https://openai.com/blog/openai-codex,27.06.2024.

OpenAI Codex, https://openai.com/blog/openai-codex.

GPT-4 Technical Report. OpenAI [Электронный ресурс]: https://cdn.openai.com/papers/gpt-4.pdf,27.06.2024.

GPT-4 Technical Report. OpenAI, https://cdn.openai.com/papers/gpt-4.pdf.

GPTZero [Электронный ресурс]: https://gptzero.me/technology,27.06.2024.

GPTZero, https://gptzero.me/technology.

Chaka C. Detecting AI content in responses generated by ChatGPT, YouChat, and Chatsonic: The case of five AI content detection tools//Journal of Applied Learning and Teaching. 2023. Vol. 6(2). DOI: 10.37074/jalt.2023.6.2.12.

Chaka C. Journal of Applied Learning and Teaching, 2023, no. 2(6), https://doi.org/10.37074/jalt.2023.6.2.12.

Yang X., Cheng W., Petzold L., Wang W.Y., Chen H. DNA-GPT: Divergent N-Gram Analysis for Training-Free Detection of GPT-Generated Text//ArXiv, abs/2305.17359. 2024.

Yang X., Cheng W., Petzold L., Wang W.Y., Chen H. ArXiv, abs/2305.17359, https://www.semanticscholar.org/paper/DNA-GPT%3A-Divergent-N-Gram-Analysis-for-Detection-of-Yang-Cheng/08145978da4c8912f4a05444a6bbf048778dc4af.

Кулешов С. В., Зайцева А. А., Марков С. В. Ассоциативно-онтологический подход к обработке текстов на естественном языке // Интеллектуальные технологии на транспорте. 2015. № 4. С. 40–45.

Kuleshov S.V., Zaytseva A.A., Markov S.V. Intellectual Technologies on Transport, 2015, no. 4, pp. 40–45. (in Russ.)

Jiang A. Q. et al. Mistral 7B [Электронный ресурс]: https://arxiv.org/abs/2310.06825,27.06.2020.

https://arxiv.org/abs/2310.06825

The authors declare that there are no conflicts of interest present.