具限制條件的序列比對：考慮範圍與順序資訊

簡易檢索 / 詳目顯示

回結果列表

研究生：	鍾允昇 Yun-Sheng Chung
論文名稱：	具限制條件的序列比對：考慮範圍與順序資訊 Constrained Alignment with Range and Ordering Semantic Information
指導教授：	唐傳義 Chuan Yi Tang
口試委員:
學位類別：	碩士 Master
系所名稱：	電機資訊學院 - 資訊工程學系 Computer Science
論文出版年：	2004
畢業學年度：	92
語文別：	英文
論文頁數：	40
中文關鍵詞：	序列比對、計算生物學、演算法、計算複雜度
外文關鍵詞：	sequence alignment with constraints, computational biology, algorithms, computational complexity
相關次數：	點閱：110 下載：0
分享至:	分享至facebook 分享至twitter

查詢本校圖書館目錄查詢臺灣博碩士論文知識加值系統勘誤回報

在序列排比問題（Sequence Alignment Problem）之中，加入有意義的限制條件（Constraints），是一種整合使用者對於所處理的資料所具備的額外知識的有效方法。例如，欲比對的序列如果屬於同一類蛋白質，且已知該類蛋白質必定擁有一些序列上或結構上的特徵，則在序列比對的結果中，這類特徵應該被保持，並且應被排比在一起；否則可視為一種語意上的違背。這是在序列排比中，加入限制條件的主要精神。
在本論文中，我們引入兩項新的元素於限制條件中：pattern之間的距離資訊，以及pattern出現的順序是否要求必須固定。這兩類資訊並沒有被考慮在先前關於具限制條件的序列比對的文獻中，但在生物文獻中，曾指出binding site之間的距離和順序，有時會影響基因的調控。引入這兩項新元素，顯然是十分重要的。此外，我們也探討當兩條序列在比對時，其中一條如果已經被標定各個pattern出現的確定位置，對於演算法效率的影響。這種情形在實際應用上有其需要，尤其是在作蛋白質功能分類上，因此作這樣的區分，並提出更有效率的演算法，在理論和應用上，確有其必要。我們也推廣了pattern的定義，使得應用上得以更加自然而有彈性。
以結果而言，我們首先提出為推廣後的pattern定義所設計的演算法。其次，針對一條序列已經被標定的情形，提出有效率的演算法，並以之為核心，大幅改善了Chin等人的兩倍近似演算法的時間與空間複雜度，尤其可達到最適空間複雜度。對於加入距離資訊的限制條件，我們引入Schmidt所提出十分精緻的資料結構，結合Divide-and-Conquer策略，發展了一個有效率的演算法。至於如果pattern之間的順序不需要維持一定，我們證明了這個問題無法在任何函數之內被近似。如果把問題定義進一步規範成為NPO的成員，我們也提出了NPO-complete的結果。

In this thesis, we study the constrained sequence alignment problem. We introduce two new elements into the problem: ranges between patterns, and unorderedness of the patterns.
In addition to the various biological applications of such variants, we investigate the impacts of the new elements
to the design of efficient algorithms for the problems. Also, we introduce a new dimension to the problem:
one-annotated or not. The one-annotated version, as a special case of the original problem, has its own biological applications, and often admits more efficient algorithms. Hence to clarify the difference is meaningful.

The goal of the constrained alignment problem
is to align a set of sequences such that specified patterns must be aligned together. This is desirable since one often have the knowledge about the patterns that are necessary for some function to work. If one is aligning sequences under such knowledge, or want to determine if a query sequence have the function of some protein family,
constrained alignment turns out to be useful. For the original problem, we proposed a 2-approximation algorithm
which significantly improves the efficiency over previous results.

Ranges and order of binding elements are sometimes important for determining gene expression. The introduction of ranges and order information turns out to be meaningful both theoretically and biologically. As to the range information, we require that ranges between any two adjacent patterns satisfy user's specification. We refer to this problem as SARC (sequence alignment with ranged constraint). For the one-annotated case, our algorithm solves the problem in $O(n^{2} \log n)$ time and $O(n^{2})$ space. As to the order information, we prove that if the patterns are allowed to appear in any order in the output alignment, then the problem becomes not approximable within any function computable in polynomial time. We also show that the a relative to this problem that is a member of NPO turns out to be NPO-complete.

Another direction of extension in this thesis is to generalize the definition of patterns. We introduce a generalized framework to admit higher flexibility without loss of efficiency.

Abstract{i}
{1}Introduction{1}
{2}Preliminaries{6}
{2.1}Definitions and Problem Formulation{6}
{2.2}Annotation of the Patterns{10}
{2.2.1}Annotation of Type-1 Patterns{10}
{2.2.2}Annotation of Type-2 Patterns{11}
{2.2.3}A Note on the Efficiency of the Annotation Process{12}
{3}Constrained Sequence Alignment{13}
{3.1}Constrained Pairwise Sequence Alignment{13}
{3.2}One-Annotated Constrained Sequence Alignment{15}
{3.3}A More Efficient Approximation Algorithm for CMSA{17}
{4}Sequence Alignment with Ranged Constraint{19}
{4.1}The Data Structure{19}
{4.2}The Algorithm for One-Annotated SARC{20}
{4.3}Multiple Sequence Alignment with Ranged-Motif onstraint{24}
{5}Sequence Alignment with Unordered Constraint{26}
{5.1}Terminology{27}
{5.2}Hardness of SAUC{29}
{6}Conclusions and Future Works{34}
{6.1}Conclusions{34}
{6.2}Future Works{34}
{6.2.1}Providing Services on the Web{34}
{6.2.2}A More Efficient Algorithm for the Unannotated SARC{35}
{6.2.3}Lower Bound for the Pairwise Version without Range Constraints{36}
{6.2.4}More Hybrid Annotated Sequence Comparison{36}
Bibliography{38}

                                

Abendroth, J., Niefind, K., Schomburg, D.:
X-ray Structure of a Dihydropyrimidinase from Thermus sp. at 1.3A Resolution.
J. Mol. Biol. 320 (2002) 143--156

Ausiello, G., Crescenzi, P., Protasi, M.:
Approximate Solution of NP Optimization Problems.
Theoret. Comput. Sci. 150 (1995) 1--55

Bafna, V., Muthukrishnan, S., Ravi, R.:
Computing Similarity between RNA Strings.
DIMACS Tech. Rep. (1996)

Bonizzoni, P., Vedova, G. D.:
The Complexity of Multiple Sequence Alignment with SP-score that is a metric.
Theoret. Comput. Sci. 259 (2001) 63--79

Chin, F. Y. L., Ho, N. L., Lam, T. W., Wong, P. W. H., Chan, M. Y.:
Efficient Constrained Sequence Alignment with Performance Guarantee.
In: Proc. Comput. Syst. Bioinfo. (CSB'03). IEEE (2003) 337--346

Cormen, T.H., Leiserson, C.E., Rivest, R.L., Stein, C.:
Introduction to Algorithms (2nd ed.)
The MIT Press (2001)

Crescenzi, P., Kann, V. (eds.):
A compendium of NP optimization problems.
http://www.nada.kth.se/~viggo/wwwcompendium/wwwcompendium.html

Evans, P. A.:
Algorithms and Complexity for Annotated Sequence Analysis.
Ph. D Thesis, Department of Computer Science, University of Victoria, Canada (1999)

Faisst, S., Meyer, S.:
Compilation of Vertebrate-encoded Transcription Factors.
Nucleic Acids Research 20 (1992) 3--26

Fessele, S., Maier, H., Zischek, C., Nelson, P.J., Werner,T.:
Regulatory Context is a Crucial Part of Gene Function.
Trends Genet., 18 (2002) 60-63

Garey, M.R., Johnson, D.S.:
Computers and Intractability: A Guide to the Theory of NP-Completeness.
W.H. Freeman & Company (1979)

Gusfield, D.:
Efficient Methods for Multiple Sequence Alignment
with Guaranteed Error Bounds.
Bul. Math. Biol. 30 (1993) 141--154

Gusfield, D.:
Algorithms on Strings, Trees, and Sequences. Cambridge University Press (1997)

Hirschberg, D. S.:
A Linear Space Algorithm for Computing Maximal Common Subsequences.
Comm. ACM. 18 (1975) 341--343

Huang, X.:
A Lower Bound for the Edit-Distance Problem under Arbitrary Cost Function.
Inform. Process. Lett. 27 (1988) 319--321

Jiang, T., Lin, G., Ma, B., Zhang, K.:
A General Edit Distance between RNA Structures.
J. Comput. Biol. 9 (2002) 371--388

Jiang, T., Xu, Y., Zhang, M. Q. (ed.):
Current Topics in Computational Molecular Biology. The MIT Press (2002)

Karp, R. M.:
Reducibility among Combinatorial Problems.
In R. E. Miller and J. W. Thatcher (eds.),
Complexity of Computer Computations.
Plenum Press, New York (1972) 85--103

Katoh, K., Misawa, K., Kuma, K.-I., Miyata, T.:
MAFFT: A Novel Method for Rapid Multiple Sequence Alignment
Based on Fast Fourier Transform.
Nucleic Acids Res. 30 (2002) 3059--3066

Kel, A., Kel-Margoulis, O., Babenko, V., Wingender, E.:
Recognition of NFATp/AP-1 Composite Elements within Genes
Induced upon the Activation of Immune Cells.
J. Mol. Biol. 288 (1999) 353-376

Kel-Margoulis, O., Kel, A.E., Reuter, I., Deineko, I.V.,
Wingender, E.:
TRANSCompel: a Database on Composite
Regulatory Elements in Eukaryotic Genes.
Nucleic Acids Res. 30 (2002) 332-334

Lin, G.-H., Chen, Z.-Z., Jiang, T., Wen, J.:
The Longest Common Subsequence Problem for Sequences with Nested Arc Annotations.
J. Comput. Syst. Sci. 65 (2002) 465--480

Ma, B., Wang, L., Zhang, K.:
Computing Similarity between RNA Structures.
Theoret. Comput. Sci. 276 (2002) 111--132

Myers, G., Selznick, S. Zhang, Z., Miller, W.:
Progressive Multiple Alignment with Constraints.
J. Comput. Biol. 3 (1996) 563-572

Schmidt, J. P.:
All Highest Scoring Paths in Weighted Grid Graphs
and Their Application to Finding All Approximate Repeats in Strings.
SIAM J. Comput. 27 (1998) 972--992

Tang, C. Y., Lu, C. L., Chang, M. D. T., Sun, Y. J., Tsai, Y. T., Chang, J. M., Chiou, Y. H., Wu, C. M., Chang, H. T., Chou, W. I., Chiang, S. C.:
Constrained Sequence Alignment Tool Development
and its Application to RNase Family Alignment.
J. Bioinfo. Comput. Biol. 1 (2003) 267--287

Taylor, W. R.:
Motif-biased Protein Sequence Alignment.
J. Comput. Biol. 1 (1994) 297--310

Tsai, Y. T., Lu, C. L., Yu, C. T., Huang, Y. P.:
MuSiC: A Tool for Multiple Sequence Alignment with Constraints.
Bioinformatics (to appear)

Wang, L., Jiang, T.:
On the Complexity of Multiple Sequence Alignment.
J. Comput. Biol. 1 (1994) 337--348

Wishart, D.S., Boyko, R.F., Sykes, B.D.:
Constrained multiple sequence alignment using XALIGN.
Comput. Appl. Biosci. 10 (1994) 687--688

Wu, Q.S., Chao, K.M., Lee, R.C.T.:
The NPO-completeness of the longest Hamiltonian cycle problem.
Infor. Proc. Lett. 65 (1998) 119--123

Zhang, K.:
Computing Similarity between RNA Secondary Structures.
Proc. IEEE Internat. Joint Symp. on Intelligence and Systems (1998) 126--132

全文公開日期本全文未授權公開 (校內網路)
全文公開日期本全文未授權公開 (校外網路)

簡易檢索 / 詳目顯示

相關論文