研究生: |
鍾允昇 Yun-Sheng Chung |
---|---|
論文名稱: |
具限制條件的序列比對:考慮範圍與順序資訊 Constrained Alignment with Range and Ordering Semantic Information |
指導教授: |
唐傳義
Chuan Yi Tang |
口試委員: | |
學位類別: |
碩士 Master |
系所名稱: |
電機資訊學院 - 資訊工程學系 Computer Science |
論文出版年: | 2004 |
畢業學年度: | 92 |
語文別: | 英文 |
論文頁數: | 40 |
中文關鍵詞: | 序列比對 、計算生物學 、演算法 、計算複雜度 |
外文關鍵詞: | sequence alignment with constraints, computational biology, algorithms, computational complexity |
相關次數: | 點閱:110 下載:0 |
分享至: |
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
在序列排比問題(Sequence Alignment Problem)之中,加入有意義的限制條件(Constraints),是一種整合使用者對於所處理的資料所具備的額外知識的有效方法。例如,欲比對的序列如果屬於同一類蛋白質,且已知該類蛋白質必定擁有一些序列上或結構上的特徵,則在序列比對的結果中,這類特徵應該被保持,並且應被排比在一起;否則可視為一種語意上的違背。這是在序列排比中,加入限制條件的主要精神。
在本論文中,我們引入兩項新的元素於限制條件中:pattern之間的距離資訊,以及pattern出現的順序是否要求必須固定。這兩類資訊並沒有被考慮在先前關於具限制條件的序列比對的文獻中,但在生物文獻中,曾指出binding site之間的距離和順序,有時會影響基因的調控。引入這兩項新元素,顯然是十分重要的。此外,我們也探討當兩條序列在比對時,其中一條如果已經被標定各個pattern出現的確定位置,對於演算法效率的影響。這種情形在實際應用上有其需要,尤其是在作蛋白質功能分類上,因此作這樣的區分,並提出更有效率的演算法,在理論和應用上,確有其必要。我們也推廣了pattern的定義,使得應用上得以更加自然而有彈性。
以結果而言,我們首先提出為推廣後的pattern定義所設計的演算法。其次,針對一條序列已經被標定的情形,提出有效率的演算法,並以之為核心,大幅改善了Chin等人的兩倍近似演算法的時間與空間複雜度,尤其可達到最適空間複雜度。對於加入距離資訊的限制條件,我們引入Schmidt所提出十分精緻的資料結構,結合Divide-and-Conquer策略,發展了一個有效率的演算法。至於如果pattern之間的順序不需要維持一定,我們證明了這個問題無法在任何函數之內被近似。如果把問題定義進一步規範成為NPO的成員,我們也提出了NPO-complete的結果。
In this thesis, we study the constrained sequence alignment problem. We introduce two new elements into the problem: ranges between patterns, and unorderedness of the patterns.
In addition to the various biological applications of such variants, we investigate the impacts of the new elements
to the design of efficient algorithms for the problems. Also, we introduce a new dimension to the problem:
one-annotated or not. The one-annotated version, as a special case of the original problem, has its own biological applications, and often admits more efficient algorithms. Hence to clarify the difference is meaningful.
The goal of the constrained alignment problem
is to align a set of sequences such that specified patterns must be aligned together. This is desirable since one often have the knowledge about the patterns that are necessary for some function to work. If one is aligning sequences under such knowledge, or want to determine if a query sequence have the function of some protein family,
constrained alignment turns out to be useful. For the original problem, we proposed a 2-approximation algorithm
which significantly improves the efficiency over previous results.
Ranges and order of binding elements are sometimes important for determining gene expression. The introduction of ranges and order information turns out to be meaningful both theoretically and biologically. As to the range information, we require that ranges between any two adjacent patterns satisfy user's specification. We refer to this problem as SARC (sequence alignment with ranged constraint). For the one-annotated case, our algorithm solves the problem in $O(n^{2} \log n)$ time and $O(n^{2})$ space. As to the order information, we prove that if the patterns are allowed to appear in any order in the output alignment, then the problem becomes not approximable within any function computable in polynomial time. We also show that the a relative to this problem that is a member of NPO turns out to be NPO-complete.
Another direction of extension in this thesis is to generalize the definition of patterns. We introduce a generalized framework to admit higher flexibility without loss of efficiency.
Abendroth, J., Niefind, K., Schomburg, D.:
X-ray Structure of a Dihydropyrimidinase from Thermus sp. at 1.3A Resolution.
J. Mol. Biol. 320 (2002) 143--156
Ausiello, G., Crescenzi, P., Protasi, M.:
Approximate Solution of NP Optimization Problems.
Theoret. Comput. Sci. 150 (1995) 1--55
Bafna, V., Muthukrishnan, S., Ravi, R.:
Computing Similarity between RNA Strings.
DIMACS Tech. Rep. (1996)
Bonizzoni, P., Vedova, G. D.:
The Complexity of Multiple Sequence Alignment with SP-score that is a metric.
Theoret. Comput. Sci. 259 (2001) 63--79
Chin, F. Y. L., Ho, N. L., Lam, T. W., Wong, P. W. H., Chan, M. Y.:
Efficient Constrained Sequence Alignment with Performance Guarantee.
In: Proc. Comput. Syst. Bioinfo. (CSB'03). IEEE (2003) 337--346
Cormen, T.H., Leiserson, C.E., Rivest, R.L., Stein, C.:
Introduction to Algorithms (2nd ed.)
The MIT Press (2001)
Crescenzi, P., Kann, V. (eds.):
A compendium of NP optimization problems.
http://www.nada.kth.se/~viggo/wwwcompendium/wwwcompendium.html
Evans, P. A.:
Algorithms and Complexity for Annotated Sequence Analysis.
Ph. D Thesis, Department of Computer Science, University of Victoria, Canada (1999)
Faisst, S., Meyer, S.:
Compilation of Vertebrate-encoded Transcription Factors.
Nucleic Acids Research 20 (1992) 3--26
Fessele, S., Maier, H., Zischek, C., Nelson, P.J., Werner,T.:
Regulatory Context is a Crucial Part of Gene Function.
Trends Genet., 18 (2002) 60-63
Garey, M.R., Johnson, D.S.:
Computers and Intractability: A Guide to the Theory of NP-Completeness.
W.H. Freeman & Company (1979)
Gusfield, D.:
Efficient Methods for Multiple Sequence Alignment
with Guaranteed Error Bounds.
Bul. Math. Biol. 30 (1993) 141--154
Gusfield, D.:
Algorithms on Strings, Trees, and Sequences. Cambridge University Press (1997)
Hirschberg, D. S.:
A Linear Space Algorithm for Computing Maximal Common Subsequences.
Comm. ACM. 18 (1975) 341--343
Huang, X.:
A Lower Bound for the Edit-Distance Problem under Arbitrary Cost Function.
Inform. Process. Lett. 27 (1988) 319--321
Jiang, T., Lin, G., Ma, B., Zhang, K.:
A General Edit Distance between RNA Structures.
J. Comput. Biol. 9 (2002) 371--388
Jiang, T., Xu, Y., Zhang, M. Q. (ed.):
Current Topics in Computational Molecular Biology. The MIT Press (2002)
Karp, R. M.:
Reducibility among Combinatorial Problems.
In R. E. Miller and J. W. Thatcher (eds.),
Complexity of Computer Computations.
Plenum Press, New York (1972) 85--103
Katoh, K., Misawa, K., Kuma, K.-I., Miyata, T.:
MAFFT: A Novel Method for Rapid Multiple Sequence Alignment
Based on Fast Fourier Transform.
Nucleic Acids Res. 30 (2002) 3059--3066
Kel, A., Kel-Margoulis, O., Babenko, V., Wingender, E.:
Recognition of NFATp/AP-1 Composite Elements within Genes
Induced upon the Activation of Immune Cells.
J. Mol. Biol. 288 (1999) 353-376
Kel-Margoulis, O., Kel, A.E., Reuter, I., Deineko, I.V.,
Wingender, E.:
TRANSCompel: a Database on Composite
Regulatory Elements in Eukaryotic Genes.
Nucleic Acids Res. 30 (2002) 332-334
Lin, G.-H., Chen, Z.-Z., Jiang, T., Wen, J.:
The Longest Common Subsequence Problem for Sequences with Nested Arc Annotations.
J. Comput. Syst. Sci. 65 (2002) 465--480
Ma, B., Wang, L., Zhang, K.:
Computing Similarity between RNA Structures.
Theoret. Comput. Sci. 276 (2002) 111--132
Myers, G., Selznick, S. Zhang, Z., Miller, W.:
Progressive Multiple Alignment with Constraints.
J. Comput. Biol. 3 (1996) 563-572
Schmidt, J. P.:
All Highest Scoring Paths in Weighted Grid Graphs
and Their Application to Finding All Approximate Repeats in Strings.
SIAM J. Comput. 27 (1998) 972--992
Tang, C. Y., Lu, C. L., Chang, M. D. T., Sun, Y. J., Tsai, Y. T., Chang, J. M., Chiou, Y. H., Wu, C. M., Chang, H. T., Chou, W. I., Chiang, S. C.:
Constrained Sequence Alignment Tool Development
and its Application to RNase Family Alignment.
J. Bioinfo. Comput. Biol. 1 (2003) 267--287
Taylor, W. R.:
Motif-biased Protein Sequence Alignment.
J. Comput. Biol. 1 (1994) 297--310
Tsai, Y. T., Lu, C. L., Yu, C. T., Huang, Y. P.:
MuSiC: A Tool for Multiple Sequence Alignment with Constraints.
Bioinformatics (to appear)
Wang, L., Jiang, T.:
On the Complexity of Multiple Sequence Alignment.
J. Comput. Biol. 1 (1994) 337--348
Wishart, D.S., Boyko, R.F., Sykes, B.D.:
Constrained multiple sequence alignment using XALIGN.
Comput. Appl. Biosci. 10 (1994) 687--688
Wu, Q.S., Chao, K.M., Lee, R.C.T.:
The NPO-completeness of the longest Hamiltonian cycle problem.
Infor. Proc. Lett. 65 (1998) 119--123
Zhang, K.:
Computing Similarity between RNA Secondary Structures.
Proc. IEEE Internat. Joint Symp. on Intelligence and Systems (1998) 126--132