مقایسه روش‌های کلاسیک و روش‌های مبتنی بر اندازه‌های آماری پیکره‏بنیاد در استخراج خودکار واژه‌های پایه علوم پزشکی به روش بسامدی

نوع مقاله: مقاله پژوهشی

نویسندگان

1 دانشجوی زبان‏شناسی همگانی، دانشگاه پیام نور، ایران.

2 دانشیارگروه زبان‏شناسی همگانی، دانشگاه پیام نور، ایران.

3 دانشیارگروه زبان‏شناسی همگانی، دانشگاه پیام نور ، ایران.

چکیده

طی دو دهه‌ی اخیر با پیشرفت علم و فناوری، استفاده از‌‌‌ روش‌های پیکره­بنیاد در آموزش زبان و تدوین منابع درسی گسترش چشمگیری داشته است. پژوهش حاضر با هدف دستیابی به روشی خودکار در استخراج واژه از پیکره‌ها در زبان فارسی صورت گرفته است. برای دستیابی به هدف پژوهش روش‌های بسامد‌‌شماری در دو گروه کلاسیک و روش‌های مبتنی بر اندازه‌های آماری موردبررسی قرارگرفته و توانمندی هریک از که عبارتند از بسامدشماری پیکره‌‌ی عمومی، بسامدشماری پیکره‌ی تخصصی و روش‌های بهبودیافته‌ی آن‌ها موردمقایسه قرارمی گیرند. نتایج نشان می‌دهد که در روش‌های کلاسیک با اعمال تکنیک‌هایی می‌توان فرایند انتخاب واژه‌های تخصصی را بهبود بخشید و در این میان بهترین عملکرد مربوط به روش بسامدشماری بهبود‌یافته در پیکره­ی تخصصی بوده است. روش‌های به­کاررفته در پژوهش عبارتند از اطلاعات متقابل نقطه­ای و مجذور کا[1]. نتایج به­دست آمده برای این دو روش نیز قابلیت استفاده از روش‌های بسامدشماری پیکره­بنیاد در زبان فارسی را مورد تأیید قرار می‌دهد. روش مجذور کا با استخراج %۳۲ واژه­ی تخصصی و روش اطلاعات متقابل نقطه­ای با استخراج %52 واژه­ی تخصصی، عملکرد مناسبی در تشخیص خودکار واژه‌های تخصصی از خود نشان می‌دهند. نتایج حاصل از اعمال این روش‌ها روی پیکره‌ها و مقایسه آن­ها نشان می­دهند که  می‌توان از روش‌های مبتنی بر اندازه­گیری‌های آماری برای استخراج خودکار واژه در زبان بهره جست و به این ترتیب تحولی نوین در تهیه و تدوین متون آموزشی حاصل خواهد شد و آموزش‌دهندگان می‌توانند به فهرست واژگانی دسترسی داشته باشند که دانستن آن برای زبان آموزانشان مفید و گاه ضروری است.
 

کلیدواژه‌ها


عنوان مقاله [English]

Extraction of Core Medical Terms Using Frequency Approach

نویسندگان [English]

  • Zohreh Zolfaghar Kondori 1
  • Tayebeh Mosavi Miangah 2
  • Belgheis Rowshan 3
1 PhD Candidate in General Linguistics, Department of Payam-e-Noor, Iran.
2 Corresponding author, Associate Professor, Department of Payam-e-Noor, Iran.
3 Associate Professor, Department of General Linguistics, Department of Payam-e-Noor, Iran.
چکیده [English]

During the past two decades, use of corpus-based approaches in language teaching and design of teaching materials has increased remarkably. The goal of the present study was to achieve an automatic approach in extracting medical terms from corpora in Farsi. To achieve the purpose, classic and statistical measurement-based methods of frequency counting were used and the capability of each single approach was compared with the other one. Classic frequency approaches include: General corpus frequency, Special corpus frequency and their enhanced techniques. Results showed that in classic approaches, special term extraction process can be improved by utilizing some techniques and among them, the best performance related to the improved frequency approach in special corpus which covered 60% of the special terms by 50 terms. Chi-square and PMI verified the possibility of using corpus-based frequency approaches in Farsi. Chi-square with extraction of %32 and PMI with extraction of 52% of special terms performed appropriately in automatic special term extraction. Overall, the results of applying these approaches on corpora and their comparison showed that statistical measurement approaches are appropriate for automatic term extraction and hence we will face a modern change in preparing teaching materials and teachers could access lists of words which are useful and occasionally essential for language learners.
 
Extended Abstract
Over the past few decades, with the advancement of technology, the use of corpora in linguistic studies has dramatically increased. Linguistic corpuses provide linguistic experts with the possibility to apply different methods for linguistic analysis by providing large collections. Most of the studies that have been done so far have been in English, French, and Japanese, and limited research has been conducted in Farsi language, and this lack, especially in specialized fields such as medical sciences, mathematics, science, tourism and so on is so tangible. So far most of the term or vocabulary extractions in Farsi have been done by using non-automatic methods and through reading and collecting data by the researchers; however, due to the technical properties of Farsi language, using non-Farsi term extractors which have been quite successful in other languages such as English, French and Japanese, have been impossible to use in Farsi so far. This is because of the particularities and specific features of languages. Each of these extractors is defined based on the features and properties of language they have been used for. In order to improve teaching materials in Farsi, paying attention to this problem was of paramount importance and we decided to apply some of these extraction methods and devise an extraction method for Farsi language which works properly. Since Iran’s universities admit a lot of non-native Farsi international students annually whose goal is to study at fields such as medicine, engineering and humanities, preparing standard modern teaching materials in Farsi, which are based on the most modern technologies, is significantly important .The purpose of this study was to improve the resources used in teaching Farsi language at university levels, especially for non -native Farsi speakers and to explore the feasibility of using frequency-based methods in the automatic extraction of core medical terms and comparing the capabilities of each method. Findings of the research reveal the strengths and weaknesses of these methods in Farsi language and explore the possibility of using each of these methods in Farsi and provide technical solutions for the improvement of the results.
Research Methodology:
The frequency counting approaches utilized in this study included the general and a specialized corpus which was created by the researcher. The general corpus used in this study was the Hamshahri Corpus and the specialized researcher made corpus included: texts from the science books of grades 1-4 of senior high schools and grades 1-3 of junior high schools in Iran, science courses in Imam Khomeini Farsi language center, general medicine texts from journals and internet. After the formation of the corpus, preparation and tokenization, the research introduced two methods of frequency i.e. classical and modern categories. Then, in the next step, the capabilities of each method were compared. The methods used in the classical frequency approach were the frequency of the main general corpus, the frequency of the specialized corpus and their improved approaches. Also, modern methods used in the research were: PMI and Chi-square. Pearson correlation analysis and trend analysis were also used to compare the methods used in the research.
Research findings
The results showed that classical methods in their general form, have little accuracy in identifying specialized vocabulary, however, by applying some techniques, it was possible to improve the process of selecting specialized vocabulary, among which the best performance related to the improved numerical method in the specialized corpus which resulted in extracting 60% of the specialized vocabulary in the first 50 high-frequency words. This result improved by increasing the scope of the study to 100, 150 and 200 first extracted words and it was observed that the percentage of specialized vocabulary identified increased by about 75%. Moreover, the results obtained for modern methods indicated that these methods can be used in Farsi. It can be seen that chi-square method with 32% and PMI method with 52% extraction of specialized vocabulary in the first 50 high frequency words showed a good function in automatic term extraction in Farsi. They automatically detected specialized vocabulary and by increasing the scope of the study to 200 first words, these percentages improved.
Conclusion:
The results of the research showed that frequency-based methods are applicable in Farsi. If we use classic frequency methods, we will need to utilize improved classic frequency methods in order to increase the accuracy of extracted words. Also, in order to achieve reliable results in modern frequency approaches, it is necessary to choose large enough vocabulary scope for the extracted vocabulary

کلیدواژه‌ها [English]

  • Automatic extraction of medical terms
  • corpus
  • Mixed extraction approaches
  • teaching Persian language
جهانگردی، کیومرث، عاصی، مصطفی، افراشی، آزیتا و وکیلی فرد، امیررضا. (۱۳۹۵). واژه در کتاب آموزش زبان فارسی به غیرفارسیزبانان: پژوهشی پیکره بنیاد. پژوهش­نامه­یآموزش زبان فارسی به غیرفارسی­زبانان. سال پنجم، شماره ۲، صص : ۲۶-۳.

رسولی، محمدصادق و مینایی بیدگلی، بهروز. (۱۳۸۷). روشی جدید در خطایابی املایی در زبان فارسی. دومین کنفرانس داده­یابی ایران، (صص:۴-۲)، دانشگاه امیر کبیر.

سپهری، مهرداد. (۱۳۹5). بسامدنگاری و دستاورد‌های آن در آموزش. مجله زبان و زبانشناسی. سال دوم، شماره ۳، صص: ۶۰-۴۷.

نعمت زاده، شهین. (۱۳۹5). واژگان پایه در خدمت تألیف کتاب ریاضی پایه­ی اول.‎ فصلنامه مطالعات برنامه­ی درسی ایران. سال هفتم، شماره ۲۷، صص: ۸۴-۶۷.

 Refrences:

Biber, D. (1993) .Representativeness in corpus design. Literary and Linguistic Computing. 8(4), 243-57.

Biber,D. (1990). Some methodological Issues in Corpus Based Analyses of Linguistic  Variation, ms. University of Southern California , Los Angeles.

Bin, H & Zhang, Y. (2013). Automatic Term Extraction in Large Text Corpora, retrieved from:https://www.cs.dal.ca/~yongzhen/course/6509/report.pdf

Chujo, K., Utiyama, M., & Oghigian, K. ( 2006). Selecting level-specific Kyoto tourism vocabulary using statistical measures. New aspects of English language teaching and learning ,Taipei: Crane Publishing Company Ltd., pp.126-138.

Chujo, K., Oghigian, K., Nishigaki, C., Utiyama, M., & Nakamura, T. ( 2007). Creating                e-learning   material with statistically-extracted spoken and written business vocabulary from the British National Corpus. Journal of the College of Industrial Technology Nihon University, 40, 1-12.

Chujo, K., Nishigaki, C., & Utiyama, M. (2005). Selecting 500 essential daily-life words for Japanese EFL elementary students from English picture dictionaries and a children's spoken corpus. In Proceedings of inaugural international conference on the teaching and learning of English in Asia, Penang, Malaysia Vol. 11, No. 15,1-12.

Chujo, K., Utiyama, M., & Nakamura, T. (2007). Extracting Level-Specific Science and Technology Vocabulary From, the Corpus of Professional English CPE, Retrieved from: http://www.birmingham.ac.uk/documents/college-artslaw/corpus/conference-archives/2007/47Paper.pdf

Chujo, K., Utiyama, M., & Nakamuru, T. O. (2010). Evaluating Statistically Extracted Domain –Specific Word Lists. Corpus, ICT, and Language Education, (eds) G. Weir and S. Ishikawa.University of Strathclyde Publishing, Glasgow, UK.

Chujo, K. (2004). Measuring Vocabulary Levels of English Textbooks and Tests Using a BNC Lemmatised High Frequency Word List. English Corpora under Japanese Eyes, Rodopi, pp. 231-249

Chujo, K., & Nishigaki, Ch. (2006). Creating Spoken Academic Vocabulary Lists from the British National Corpus. Practical English Studies, Vol.12, 19-34.

Coxhead, A., & Nation, P. (2001). The specialised vocabulary of English for academic purposes. The Specialized Vocabulary of English for Academic Purposes. In J. Flowerdew, & M. Peacock (Eds.), Research Perspectives on English for Academic Purposes, Chapter: The Specialised Vocabulary of English for Academic Purposes). Cambridge: Cambridge University Press. , 252-267.

Daille, B. (1994). Study and implementation of combined techniques for automatic extraction of terminology in The balancing act: Combining symbolic and statistical approaches to language, 29-36.

Enguehard, C. & Pantera, L. (1995). Automatic natural acquisition of a terminology. Journal of quantitative linguistics, 2(1), 27-32.

Foo, J. (2012).Computational Terminology: Exploring Bilingual and Monolingual Term Extraction, Department of Computer and Information Science Linköping University, Linköping University Electronic Press.

Francis, W. N. (1992). Language corpora BC. In Directions in Linguistics: Proceedings of Nobel Symposium ,Vol. 82, 17-32.

Frantzi, K., Ananiadou, S., & Mima, H. (2000). Automatic recognition of multi-word terms: the c-value/nc-value method. International Journal on Digital Libraries, 3(2), 115-130.

Gries, S. T. (2010). Useful statistics for corpus linguistics. A mosaic of corpus linguistics: Selected approaches, 66, 269-291.

Granger, S. (2015). The contribution of learner corpora to reference and instructional materials design.UK:Cambridge Uiversity Press.

Hulth, A. ( 2004). Combining Machine Learning and Natural Language Processing for Automatic Keyword Extraction. Doctoral Dissertation, Stockholm University

Jahangardi ,K., & Asi ,M., & Afrashi, A., & Vakilifard ,A. R. (2016).Vocabulary in the Textbooks of Teaching Persian to Non-Persian speakers: A Corpus-Based Study. Journal of Teaching Persian to speakers of other Languages.5(12),3-26 [In Persian]

Leech, G. (1991). The state of the art in corpus linguistics. In Aijmer, K. and Altenberg, B. (Eds.) English corpus linguistics: Studies in Honour of Jan Svartvik. London: Longman

Leech, G. (1992). Corpora and theories of linguistic performance. Directions in corpus linguistics, 1992: 105-122.

McEnery, T. (2001). Corpus linguistics: An introduction. Edinburgh University Press.

Nematzadeh, Sh. (2013). Core Vocabulary Serving the First Grade Primary School Mathematics Textbook. Journal of Curriculum Studies (J.C.S.) Vol.7 (27), 84-67 [In Persian]

Nakagawa, H. & Mori, T. (2002). A simple but powerful automatic term extraction method. In COLING-02 on COMPUTERM 2002: second international workshop on computational terminology-Volume 14 , 1-7.

Pantel, Pand. & Lin, D.(2001). A Statistical Corpus-Based Term Extractor. Advances in Artificial Language, Volume 2056, 36-46.

Patry, A & .Langlais, P. (2005).Corpus-based terminology extraction. In Terminology and Content Development–Proceedings of 7th International Conference on Terminology and Knowledge Engineering, Litera, Copenhagen.

Rasooli ,M. S. & Minaei-Bidgoli, B. (2008). A new approach for Persian spellchecking. In IDMC2008 11-12 Nov, Amir Kabir University,Tehran,Iran [In Persian]

Reppen, R. (2010). Using corpora in the language classroom. Cambridge University Press.

Rezai, M. J. & Mosavi Miangah, T. (2017). FarsiTag: a part-of-speech tagging system for Persian. Digital Scholarship in the Humanities, 32(3), 632–642.

Sepehri ,M. (2006). Concordancing and its Pedagogical Implications. Language & Linguistics, Journal of the Linguistic Society of Iran,Vol 2, 47-60 [In Persian]

Sinclair, J. (1991). Corpus, concordance, collocation (Vol. 1). Oxford: Oxford University Press.

Sinclair, J. (1998). Corpus evidence in language description, In Gerry Knowles, Tony Mcenery, Stephen Fligelstone, Anne Wichman, (Eds.) Teaching and language corpora. Longman pp. 27-39

Swales, J. M. (2002). Integrated and fragmented worlds: EAP materials and corpus linguistics. Academic discourse, 150-164.

Vulanovic, R. & Mosavi Miangah, T. (2018). A Comparison of the Accuracy of Parts-of-Speech Tagging Systems Based on a Mathematical Model, Journal of Quantitative Linguistics, DOI: 10.1080/09296174.2018.1474517. 

Wermter, V. J. ( 2008). Collocation and Term Extraction Using Linguistically Enhanced Statistical Methods.https://db_thueringen.de/servlets/MCRfileNodeServlet/dbt_derivate_00017176/warmter/dissertation.pdf

Zhang, C. & Wu, D. ( 2012).Bilingual terminology extraction using multi-level termhood. The Electronic Library, 30(2), 295-309.