Ramkhamhaeng Research Journal

Abstract

Symbolic sequence classification can be used in a variety of applications such as DNA sequences analysis, intrusion detection, electrocardiography (ECG) analysis. The convention methods which can be applied to this field are for example, probabilistic language model, support vector machine, and artificial neural network. However, sometime the list of words in very long sequences such as DNA sequences are unknown. The fix size of imitation words are used for this case. Consequently, the probability and position of original unknown words are distorted which can lead to incorrect results in long term. Moreover, the update learning may cause difficulty. This research proposes a novel probabilistic language model that can learn infinitely increments and do not have to fix the size of imitation words, but still retain the statistical information of substring accurately as possible while the size of model is tractable. The experiment of classification is applied to DNA sequences of promoter and non-promoter bacteria E. Coli. The accuracy of classification is 96.23 % which is highly accurate compared to the other standard methods.

Keyword

symbolic sequence classification, probabilistic language, unsegmented string, probabilistic finite automata, n-gram

บทคัดย่อ

การจำแนกลำดับสัญลักษณ์สามารถนำไปใช้ประโยชน์ได้หลากหลาย เช่น การวิเคราะห์ดีเอ็นเอ การตรวจจับ การบุกรุก การวิเคราะห์คลื่นไฟฟ้าหัวใจปัจจุบันมีวิธีการมาตรฐานที่ประยุกต์ใช้ได้กับเรื่องนี้ ได้แก่ แบบจำลอง ภาษาเชิงน่าจะเป็น แบบจำลองโครงข่ายประสาทเทียม เครื่องจักรเวกเตอร์สนับสนุน เป็นต้น อย่างไรก็ตาม ในกรณีที่ ข้อมูลอินพุตสำหรับเรียนรู้เป็นสายอักขระที่มีความยาวมากและไม่มีการแบ่งคำ เช่น สายอักขระดีเอ็นเอ เป็นต้น การนำข้อมูลเหล่านี้ไปเรียนรู้ จะต้องแบ่งคำสมมติที่มีความยาวจำกัด ซึ่งทำให้ความน่าจะเป็นและตำแหน่งของคำที่ถูกต้องบิดเบือนไปจากข้อมูลต้นฉบับ ผลที่ตามมาคือ การนำไปประยุกต์ใช้จะได้ผลลัพธ์ไม่ตรงเท่าที่ควร อีกทั้ง จะทำให้การเรียนรู้ส่วนเพิ่มด้วยข้อมูลใหม่เพิ่มเติมจะได้ผลลัพธ์ที่คลาดเคลื่อนเช่นกัน งานวิจัยนี้จึงเสนอแบบจำลอง การเรียนรู้สายอักขระแบบใหม่ ที่สามารถเรียนรู้ส่วนเพิ่มได้ไม่จำกัด และไม่ต้องแบ่งคำสมมติ แต่ยังคงข้อมูลสถิติของ สายอักขระย่อยที่ถูกต้องให้ได้มากที่สุด ในขณะที่มิติข้อมูลอยู่ในขอบเขตที่สามารถจัดการได้ โดยจะใช้วิธีการแบ่งคำ อัตโนมัติด้วยสายอักขระเอกลักษณ์และสายอักขระเกิดซ้ำ เพื่อทดสอบประสิทธิภาพของแบบจำลองในงานวิจัย เราได้ทดลองจำแนกสายอักขระดีเอ็นเอของแบคทีเรีย อี.โคไล (E. Coli) 2 กลุ่ม คือ กลุ่มที่เป็นตัวสนับสนุน และกลุ่มที่ไม่เป็น ผลการทดลองพบว่ามีความแม่นยำในการจำแนกกลุ่มถูกต้องร้อยละ 96.23 ซึ่งถือว่ามีความแม่นยำสูง เมื่อเทียบกับวิธีการมาตรฐานอื่น

คำสำคัญ

การจำแนกลำดับสัญลักษณ์ ภาษาเชิงน่าจะเป็น สายอักขระไม่แบ่งส่วน ออโตมาตาเชิงน่าจะเป็น เอ็นแกรม