ANALYSIS OF STATISTICAL CHARACTERISTICS OF ARTIFICIALLY GENERATED TEXTS Journal of instrument engineering

DOI 10.17586/0021-3454-2024-67-11-958-968
UDC 004.912: 004.822

ANALYSIS OF STATISTICAL CHARACTERISTICS OF ARTIFICIALLY GENERATED TEXTS

S. V. Kuleshov
St. Petersburg Institute for Informatics and Automation of Russian Academy of Sciences (SPIIRAS), Laboratory of Research Activities Automation;

A. A. Zaytseva
St. Petersburg Institute for Informatics and Automation of the Russian Academy of Sciences, Laboratory of Research Automation ; Senior Scientist

A. Y. Aksenov
St. Petersburg Federal Research Center of the RAS, St. Petersburg Institute for Informatics and Automation of the RAS, Research Automation Laboratory ; Senior Researcher

Read the full article

Reference for citation: Kuleshov S. V., Zaytseva A. A., Aksenov A. Yu. Analysis of statistical characteristics of artificially generated texts . Journal of Instrument Engineering. 2024. Vol. 67, N 11. P. 958–968 (in Russian). DOI: 10.17586/0021-3454-2024- 67-11-958-968.

Abstract. A new trend is considered, namely, the formation of content using artificial intelligence tools and technologies. Active implementation of artificial intelligence technologies for data generation leads to an increase in the share of artificially generated data that must be identified automatically to prevent errors (unreliability, misleading). Approaches to identifying text data created using neural network technologies are proposed, including heuristic rules based on the criterion of dependence of the abstract volume on the abstracting threshold, which allows for automatic evaluation of text documents in monitoring and search systems when processing large volumes of unstructured data. The obtained results lay the technological basis for the implementation of a wide range of practical solutions to ensure intellectual support for the collective behavior of participants in human-machine communities through the development of theoretical and technological foundations for processing unstructured data.

Keywords: internet documents, artificial neural networks, large language model, Internet resources, artificial intelligence methods, data generation

Acknowledgement: the work was carried out with the support of the State assignment for 2024 No. FFZF-2022-0005.

References:

https://www.fontanka.ru/2023/11/14/72913286/. (in Russ.)
Fang X., Che Sh., Mao M., Zhang H., Zhao M., Zhao X. Sci. Rep., 2024, no. 1(14), pp. 5224, doi: 10.1038/s41598- 024-55686-2.
Chen Ch., Fu J., Lyu L. arXiv:2303.01325v3, 27 Dec. 2023, https://doi.org/10.48550/arXiv.2303.01325.
4Wahle J.Ph., Ruas T., Mohammad S.M., Meuschke N., Gipp B. Proc. of 2023 ACM/IEEE Joint Conf. on Digital Libraries (JCDL 2023), Mexico, Santa Fe, June 2023, рр. 282–284.
https://doi.org/10.48550/arXiv.2307.07146.
Gragnaniello D., Marra F., Verdoliva L. Advances in Computer Vision and Pattern Recognition, 2022, рр. 191–212.
Xi Z., Wenmin H., Kangkang W., Weiqi L., Peijia Zh. Proc. of 2023 Asia Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), Taiwan, Taipei, November 2023, рр. 1463–1470.
https://doi.org/10.48550/arXiv.2306.15666.
Joo-Wha H., Fischer K., Ha Y., Zeng Y. Computers in Human Behavior, 2022, vol. 131, art. no. 107239. https://doi.org/10.48550/arXiv.2303.04226.
https://doi.org/10.48550/arXiv.2304.06632.
Ruchika L., Priyanka Bh., Neha V., Anshika J. Intern. J. of Creative Research Thoughts (IJCRT), 2023, no. 10(11), pp. d784–d789.
Zhengyuan J., Jinghuai Zh., Neil Zh.G. Proc. of the 2023 ACM SIGSAC Conf. on Computer and Communications Security (CCS '23), Denmark, Copenhagen, November 2023, рр. 1168–1181.
Elkhatat A., Elsaid Kh., Almeer S. Intern. J. for Educational Integrity, 2023, vol. 19, рр. 17.
Elkhatat A.M. Intern. J. for Educational Integrity, 2023, vol. 19, рр. 15, https://doi.org/10.1007/s40979-023-00137-0.
Otterbacher J. Patterns, 2023, no. 7(4), pp. 100796.
Pengyu W., Linyang K.R., Botian J., Dong Zh., Xipeng Q. Proc. of the 2023 Conf. on Empirical Methods in Natural Language Processing 2023, Singapore, December 2023, рр. 1144–1156.
Price G. Sakellarios M. Intern. J. of Teaching, Learning and Education, 2023, vol. 2, рр. 31–38.
Qu Y., Liu P., Song W., Liu L., Cheng M. IEEE 10th Intern. Conf. on Electronics Information and Emergency Communication (ICEIEC), China, Beijing, July 2020, рр. 323–326.
https://arxiv.org/abs/2010.02307.
https://habr.com/ru/articles/599673/. (in Russ.)
Ackley D., Hinton G., Sejnowski T. Cognitive Science, 1985, no. 1(9), pp. 147–169.
OpenAI Codex, https://openai.com/blog/openai-codex.
GPT-4 Technical Report. OpenAI, https://cdn.openai.com/papers/gpt-4.pdf.
GPTZero, https://gptzero.me/technology.
Chaka C. Journal of Applied Learning and Teaching, 2023, no. 2(6), https://doi.org/10.37074/jalt.2023.6.2.12.
Yang X., Cheng W., Petzold L., Wang W.Y., Chen H. ArXiv, abs/2305.17359, https://www.semanticscholar.org/paper/ DNA-GPT%3A-Divergent-N-Gram-Analysis-for-Detection-of-Yang-Cheng/08145978da4c8912f4a05444a6bbf04877 8dc4af.
Kuleshov S.V., Zaytseva A.A., Markov S.V. Intellectual Technologies on Transport, 2015, no. 4, pp. 40–45. (in Russ.)
https://arxiv.org/abs/2310.06825.

Partners

ANALYSIS OF STATISTICAL CHARACTERISTICS OF ARTIFICIALLY GENERATED TEXTS