ISSN 0021-3454 (print version)
ISSN 2500-0381 (online version)

vol 63 / July, 2020

DOI 10.17586/0021-3454-2019-62-11-976-981

UDC 004.89


K. V. Nenausnikov
St. Petersburg Institute for Informatics and Automation of the RAS, Laboratory of Automation of Scientific Research; Junior Researcher;

S. V. Kuleshov
St. Petersburg Institute for Informatics and Automation of Russian Academy of Sciences (SPIIRAS), Laboratory of Research Activities Automation;

Abstract. To improve the accuracy of the associative search system, an algorithm for automatic selection of collocations from the corpus of natural language texts is proposed. The developed algorithm is intended for additive estimation of bigrams (pairs of elements) of the text on the basis of statistical approach and selec-tion of the most relevant bigrams with the use of Zipf distribution. Methods of extracting collocations are analyzed on the example of a random corpus of texts obtained from the Internet on the base of such asso-ciative measures as the frequency of occurrence of bigrams in the text - t-test, MI and χ2, using a gram-matical filter, with removal of stop words and subsequent evaluation of these measures. The application of the additive estimation method in the construction of Zipf distribution makes it possible to determine the ar-ea of correct collocations, which leads to a decrease in the number of errors in the obtained collocation lists.
Keywords: semantic analysis, entity, collocation, dictionary, associative measure, linguistic pattern, MI, t-test, c2, associative search, Zipf distribution

  1. Chen W.T., Bonial C., Palmer M. Proc. of the 29th AAAI Conf. on Artificial Intelligence, Austin, TX, USA, 2015, рр. 2368–2374.
  2. Kolesnikova O., Gelbukh A. Lecture Notes in Computer Science, Gonzalez-Mendoza M., Castro F., Miranda-Jimenez S., ed., Mexico, 2018, рр. 3–14. DOI:10.13140/RG.2.1.2610.0242.
  3. Bobkova A. Thought Elaboration: Linguistics, Literature, Media Expression, Satkauskaite D., ed., Vilnius, Vilnius Univ., 2017, рр. 64–78.
  4. Granger S. Understanding Formulaic Language: A Second Language Acquisition Perspective, Siyanova-Chanturia А., Pellicer-Sanchez A., ed., NY, Routledge, 2018, рр. 228–247. DOI:10.4324/9781315206615.
  5. Gyllstad H., Wolter B. Language Learning, 2017, no. 3(45), pp. 296–323. DOI:10.1111/lang.12143.
  6. Leskina S.V., Sharanova V.B. South Ural State University Bulletin. Linguistics, 2014, no. 1, pp. 22–28. (in Russ.)
  7. Verma R., Vuppuluri V., Nguyen A., Mukherjee A., Mammar G., Baki S., Armstrong R. Lecture Notes in Computer Science, Springer Verlag, 2018, рр. 177–194. DOI:10.1007/978-3-319-75477-2_11.
  8. Vlavatskaya M.V. Philological Sciences. Issues of Theory and Practice, 2015, no. 11, pt. 1, pp. 56–60. (in Russ.)
  9. Yagunova E.V., Pivovarova L.M. Automatic Documentation and Mathematical Linguistics, 2010, no. 6, pp. 30–40. (in Russ.)
  10. Zakharov V.P., Khokhlova M.V. Computational Linguistics and Intelligent Technologies, 2010, no. 9(16), pp. 137–143. (in Russ.)
  11. Liu X., Huang D., Yin Z., Ren F. IEICE Transact. on Information and Systems, 2019, рр. 620–627. DOI: 10.1587/transinf.2018EDP7255.
  12. Petrov A.S., Shul'ga T.E. Proc. of Voronezh State University. Series: Systems analysis and information technologies, 2017, no. 3, pp. 195–203. (in Russ.)
  13. Kuleshov S.V., Zaytseva A.A., Markov V.S. Intellectual Technologies on Transport, 2015, no. 4, pp. 40–45. (in Russ.)
  14. Naykhanova L.V. Tekhnologiya sozdaniya metodov avtomaticheskogo postroyeniya ontologii s primeneniyem geneticheskogo i avtomatnogo programmirovaniya (The Technology of Creating Methods for Automatically Constructing Ontologies Using Genetic and Automatic Programming), Ulan-Ude, 2008, 244 р. (in Russ.)