DOI 10.17586/0021-3454- 2017-60-10-932-939
UDC 004.822, 004.021
QUICK SEARCH METHOD FOR NODES OF A SEMANTIC NETWORK BY EXACT WORD FORMS MATCHING
ITMO University, Saint Petersburg, 197101, Russian Federation; Assistant
E. A. Tsopa
ITMO University, Saint Petersburg, 197101, Russian Federation; Assistant
S. A. Zhmylev
GK YADRO LLC, Moscow, 125373, Russian Federation; Engineering consultant
A. V. Pokid
SVD Embedded Systems Ltd.; Engineer-Programmer
N. M. Tkeshelashvili
Elcom Ltd., Software Development Department; Programmer
Read the full article
Abstract. Development and usage of ontologies is an important part of modern text analysis application. When huge amount of text is analyzed, the lookup time in ontology becomes critical bottleneck. An approach to creation of search prefix tree of wordforms’ parts called x-gram is proposed. Division of the wordform into 3- and 4-gramms as well as phonetic and morphological syllable for Russian language is used. Every x-gram is represented by numerical index that allows its storage in the plain array. Resulting arrays are very sparse, so approach uses compactification to “insert” one array into another. When looking for a word, it is split into x-grams, the index for every x-gram is computed, consequent lookup is performed in constructed arrays, where each array corresponds to a single level of prefix tree. The developed program demonstrates the advantage of 36—50 % over Google dense hash-map in seek time and 12 % over Google sparse hash-map in memory consumption for set of Russian wordforms extracted from wellknown grammatical dictionary and Russian National Corpus. This approach is well suited for dictionary search in rarely changing wordform sets, such as ontology based on Russian Wiktionary.
Keywords: dictionary lookup, ontology based on Wiktionary, prefix tree, syllable, n-gramm, hashtables, exact match
References:
References:
- Pis'mak A.E., Kharitonova A.E., Tsopa E.A., Klimenkov S.V. Programmnye produkty i sistemy, 2016, no. 3, рр. 74–78. (in Russ.)
- Klimenkov S., Tsopa E., Pismak A., Yarkeev A. Proceedings of the 8th International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management (KDIR), 2016, no. 2, рр.74–80.
- Bocharov V.V., Granovskiy D.V., Surikov A.V. Novye informatsionnye tekhnologii v avtomatizirovannykh sistemakh (New Information Technologies in Automated Systems), Proceedings of the 15th Scientific-Practical Seminar, Moscow, 2012. (in Russ.)
- Neyer M.P. A Comparison of Dictionary Implementations, 2009, https://www.cs.unc.edu/~plaisted/comp550/Neyer%20paper.pdf.
- Mehlhorn K. and Sanders P.Algorithms and Data Structures. The Basic Toolbox, Springer, 2007, 285р.
- Peterson W.W. IBM J. of Research and Development, 1957, no. 1, рр. 130–146.
- Dumey A.I. Computers and Automation, 1956, no. 5(12), рр. 6–9.
- Comer D. and Shen V.Y. Practice and Experience, 1982, no. 7(12), рр. 669–682.
- Silverstein C. Google sparsehash package, 2010.
- Silverstein C. Google sparsehash package performance, http://goog-sparsehash.sourceforge.net/doc/performance.html 2010.
- Acharya Anurag, Huican Zhu, and Kai Shen. Workshop on Algorithm Engineering and Experimentation. Lecture Notes in Computer Science, Berlin–Heidelberg, Springer, 1999, no. 1619, рр. 300–315.
- Thenmozhi M., Srimathi H. Indian Journal of Science and Technology, 2015.
- Pryor D.V., Thistle M.R., and Shirazi N. Proceedings of the Workshop on FPGAs for Custom Computing Machines, 1993.
- Aho A.V., Hill M., Hopcroft J., Ulman J.D. Data Structures and Algorithms, 1983.
- Andreychenko L.N. Russkiy yazyk. Fonetika i fonologiya. Orfoepiya. Grafika i orfografiya (Russian. Phonetics and Phonology. Orthoepy. Graphics and Spelling), Moscow, 2003. (in Russ.)
- Malyshev B., Samarin A., and Vulis D. TUGboat, 1991, no. 2(12), рр.212–214.
- Zaliznyak A.A. Grammaticheskiy slovar' russkogo yazyka (Grammatical Dictionary of Russian), Moscow, 2008. (in Russ.)
- Granovskiy D.V., Bocharov V.V., Bichineva S.V. Komp'yuternaya lingvistika i razvitie semanticheskogo poiska v Internete (Computational Linguistics and Development of Semantic Search on the Internet), Proceedings of the Scientific Workshop of the XIII All-Russian Joint Conference "Internet and modern society", St. Petersburg, 2010, 94 р. (in Russ.)
- Zhukov M.A., Afanas'ev D.B. Scientific and Technical Journal of Information Technologies, Mechanics and Optics, 2015, no. 3(15), рр. 470–475 (in Russ.)
- Cisłak A. and Grabowski S. A practical index for approximate dictionary matching with few mismatches, arXiv:1501.04948, 2015.