基因資料庫中相似與不相似型樣搜尋技術之研究

簡易檢索 / 詳目顯示

回結果列表

研究生：	李孝屏 Lee, Hsiao Ping
論文名稱：	基因資料庫中相似與不相似型樣搜尋技術之研究 A Study on Efficient Discoveries of Similar and Dissimilar Patterns in Sequence Databases
指導教授：	唐傳義 Tang, Chuan-Yi
口試委員:
學位類別：	博士 Doctor
系所名稱：	電機資訊學院 - 資訊工程學系 Computer Science
論文出版年：	2010
畢業學年度：	98
語文別：	英文
論文頁數：	113
中文關鍵詞：	相似型樣搜尋、不相似型樣搜尋、隱含特徵型樣搜尋
外文關鍵詞：	similar pattern discovery, dissimilar pattern discovery, implicit signature discovery
相關次數：	點閱：74 下載：0
分享至:	分享至facebook 分享至twitter

查詢本校圖書館目錄查詢臺灣博碩士論文知識加值系統勘誤回報

在本研究中，我們將討論從生物 DNA 序列資料庫中搜尋相似型樣(similar pattern)與不相似型樣(dissimilar pattern)的問題，並為這些問題設計快速、有效的搜尋演算法。
隨著儲存在資料庫中的資料量以及使用者查詢次數的快速增加，可將不可能是答案之型樣快速、及早刪除的過濾演算法在解決相似型樣搜尋的問題上日益受到重視且益發重要。然而，現有以 gram 為基礎的過濾演算法皆沒有考慮 gram 在型樣上排列的順序，因而可能造成演算法過高的誤判率，導致過濾效果不彰的結果。在本研究中，我們將搜尋滿足指定誤差量(mismatch)與涵蓋率(coverage)條件之相似型樣的問題，轉換為有範圍限制之最長遞增子序列(longest increasing subsequence)搜尋問題，並提出稱之為 IDCF 的快速過濾演算法。由我們所得的實驗數據可看出， IDCF 演算法因將 gram 在型樣上出現的順序列入考慮，明顯的降低了相似型樣之候選資料數量，有效提升了搜尋效能。
不相似型樣可用做分辨資料庫中不同序列的特徵(signature)。為了達到快速搜尋不相似型樣的目的，我們設計了稱之為 IMUS 與 USD 的兩個特徵型樣搜尋演算法。其中， IMUS 演算法可用以處理可全數被載入內部記憶體進行處理的小型生物序列資料，而 USD 演算法則使用 IMUS 演算法為處理核心，是一個專為處理資料量過大、無法全數被載入內部記憶體的大型資料庫而設計的演算方法。根據我們的實驗結果可發現， IMUS 與 USD 演算法在處理過程中所需的字元比對次數很明顯的少於現有之特徵型樣搜尋演算法。在一台普通的個人電腦上， IMUS 與 USD 演算法可在一天內完成從資料量為156MB的人類11號染色體序列中搜尋出不相似特徵型樣的工作，而在人類Y染色體序列上搜尋特徵型樣的工作則僅需35秒即可完成。
特徵型樣搜尋演算法通常需要設定包括型樣長度 l 與誤差容忍度 d 在內的參數，這些參數值的設定將影響搜尋所得之結果。然而，如何為搜尋設定適當的參數值之建議與準則卻很少被提及，尤其是在處理不熟悉的資料庫時，此一問題更加嚴重。在大多數的情況下，生物學家只能先依據過去的經驗甚至是猜測的方式來設定型樣搜尋演算法之參數值，如果因此而得到的結果無法令人滿意，就嘗試其他的參數組合，並重新進行搜尋，上述嘗試錯誤的過程一再重複，直到找出令人滿意的結果為止。對於指定的搜尋條件(l, d)，我們將所有長度小於等於 l 且誤差容忍度大於等於 d 的特徵型樣稱之為該搜尋條件下的隱含特徵型樣(implicit signature)。如果搜尋演算法可以快速、有效的將所有滿足使用者需求的隱含特徵型樣全數找出，當能改善重複地嘗試錯誤之狀況，對生物學家亦將有所幫助，但現有的特徵型樣搜尋演算法在設計時卻未將搜尋隱含特徵型樣的需求考慮在內，因而搜尋效果並不理想。在本研究中，我們提出兩個特徵型樣搜尋演算法： Consecutive Multiple Discovery (CMD) 以及 Parallel and Incremental Signature Discovery (PISD)演算法。其中， PISD 演算法是專為在指定的搜尋條件下找出特徵型樣而設計的快速演算法它採用了漸進式搜尋(incremental discovery)的概念，以既有、已知的結果做為候選資料，並由候選資料中尋找新的結果，而不需針對資料庫的全部內容進行搜尋，另外，此演算法引入了平行運算的技術，以加快特徵型樣之搜尋；而 CMD 則是專為搜尋隱含特徵型樣而設計的演算法，它採用 PISD 演算法做為處理核心，以便在各個特定的搜尋條件下，以漸進的搜尋方式，快速找出所有的隱含特徵型樣。我們所提出的漸進式演算法確實可快速、有效地從資料庫中搜尋出隱含特徵型樣，由實驗數據可知，當使用8個處理器時， CMD 演算法可較傳統循序搜尋演算法節省超過97%的執行時間。

In this study, the problems of discovering similar and dissimilar patterns in sequence databases are discussed, and the efficient algorithms for the discovery are designed.
With exponentially increasing database size and number of queries, the filtration approach, which filters out impossible patterns to accelerate similar pattern discovery, becomes more and more important in bioinformatics. However, the order of the gramsin sequences does not be considered in most of the known gram-based filtration approaches in literature so that higher false-positives would be conducted. In this study, the task of extracting similar patterns under a certain coverage level and error tolerance is transformed to a longest increasing subsequence problem with range constraints, and an efficient algorithm, Incremental Decreasing Cover Filtering (IDCF) algorithm, is designed for the filtration. Experimental results show that the IDCF algorithm significantly reduces the number of candidates for similar pattern discovery.
Dissimilar patterns can be used as unique signatures to distinguish a sequence from the other sequences in a database. To achieve efficient unique signatures discovery in genomic databases, two efficient algorithms, IMUS and USD, are designed in this study. The IMUS algorithm is designed for handling a sequence database which can be loaded into internal memory, and the USD algorithm uses the IMUS algorithm as a kernel routine to handle large-scale databases. The results of our experiments present that the amount of character comparisons used in the IMUS and USD algorithms are significantly less than that of the existing discovery algorithms. On a regular PC platform, the IMUS and USD algorithms discover unique signatures from a human chromosome 11 EST database, 156M bases, within one day, and takes 35 seconds to discover signatures from a human chromosome Y EST database.
The signature discovery algorithms require to set some input factors, such as signature length and mismatch tolerance , which affect the discovery results. However, suggestions about how to select proper factor values are rare, especially when an unfamiliar DNA database is used. In most cases, biologists typically select factor values based on experience, or even by guessing. If the discovered result is unsatisfactory, biologists change the input factors of the algorithm to obtain a new result. This process is repeated until a proper result is obtained. Implicit signatures under the discovery condition ( ) are defined as the signatures of length with mismatch tolerance . A discovery algorithm that could discover all implicit signatures, such that those that meet the requirements concerning the results, would be more helpful than one that depends on trial and error. However, existing discovery algorithms do not address the need to discover all implicit signatures. Two discovery algorithms, consecutive multiple discovery (CMD) algorithm and parallel and incremental signature discovery (PISD) algorithm, are proposed in this study. The PISD algorithm is designed for efficiently discovering signatures under a certain discovery condition. The algorithm finds new results by using previously discovered results as candidates, rather than by using the whole database. The PISD algorithm further increases discovery efficiency by applying parallel computing. The CMD algorithm is designed to discover implicit signatures efficiently. It uses the PISD algorithm as a kernel routine to discover implicit signatures efficiently under every feasible discovery condition. The presented CMD algorithm has up to 97% less execution time than typical sequential discovery algorithms in the discovery of implicit signatures in experiments, when eight processing cores are used.

Introduction 1
1 Finding Similar Patterns . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Homology Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2 Finding Dissimilar Patterns . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.1 Expressed Sequence Tags (ESTs) . . . . . . . . . . . . . . . . . . . . . . 5
2.2 Unique Signatures on ESTs . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.3 Discovery Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
3 Finding Implicit Patterns . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
3.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
3.2 Parallel and Incremental Discoveries . . . . . . . . . . . . . . . . . . . . . 10
4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
Related Works 15
1 Homology Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
1.1 Substitution Matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
1.2 Gap Penalty . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
1.3 Homology Search Algorithms . . . . . . . . . . . . . . . . . . . . . . . . 20
2 Unique Signature Discoveries . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
Methods 33
1 The IDCF Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
1.1 Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
1.2 Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
2 The IMUS and USD Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
2.1 Problem Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
2.2 The IMUS Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
2.3 The USD Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
3 The CMD and PISD Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
3.1 The PISD Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
3.2 The scheduling heuristic for parallelism . . . . . . . . . . . . . . . . . . . 57
3.3 The CMD Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
Results 70
1 Materials Used in the Experiments for Testing the IDCF, IMUS and USD Algorithms 70
2 Experimental Results of the IDCF Algorithm . . . . . . . . . . . . . . . . . . . . 71
3 The Results of the Experiments for the IMUS and USD Algorithms . . . . . . . . 73
3.1 Performance Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
3.2 Performance Evaluation over Real EST databases . . . . . . . . . . . . . . 77
3.3 Evaluations on the Frequency-based Filter . . . . . . . . . . . . . . . . . . 80
4 Experiments for the CMD and PISD algorithms . . . . . . . . . . . . . . . . . . . 80
4.1 Mathematical Analyses . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
4.2 Performance Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
Conclusions 93
Future Works 95

                                

[1] M. D. Adams, J. M. Kelley, J. D. Gocayne, M. Dubnick, M. H. Polymeropoulos, H. Xiao,
C. R. Merril, A. Wu, B. Olde, and R. F. Moreno, et al. Complementary DNA sequencing:
expressed sequence tags and human genome project. Science, 252:1651–1656, 1991.
[2] M. D. Adams, M. B. Soares, A. R. Kerlavage, C. Fields, and J. C. Venter. Rapid cDNA
sequencing (expressed sequence tags) from a directionally cloned human infant brain cDNA
library. Nat. Genet., 4:373–380, 1993.
[3] S. A. Aghili, D. Agrawal, and A. E. Abbadi. Using Transformation Techniques Towards
Efficient Filtration of String Proximity Search of Biological Sequences. Technical Report
2003-19, Department of Computer Science, University of California, Santa Barbara, 2003.
[4] S. A. Aghili and O. D. Sah. Efficient filtration of sequence homology search through singular
value decomposition. Technical Report 2003-19, Department of Computer Science,
University of California, Santa Barbara, 2003.
[5] S. Altschul, W. Gish, W. Miller, E. Myers, and D. Lipman. Basic Local Alignment Search
Tool. J. Molecular Biology, 215:403–410, 1990.
[6] S. F. Altschul. Amino acid substitution matrices from an information theoretic perspective.
Journal of Molecular Biology, 219:555–565, 1991.
[7] S. F. Altschul, T. L. Madden, A. A. Schaffer, J. Zhang, Z. Zhang, W. Miller, and D. J.
Lipman. Gapped blast and psi-blast: a new generation of protein database search programs.
Nucleic Acids Res., 25(17):3389–3402, 1997.
[8] A. D. Baxevanis and B. F. Francis Ouellette. Bioinformatics: a Practical Guide to the
Analysis of Genes and Proteins. Wiley Interscience, New York, USA, second edition, Apr.
2001.
[9] D. A. Benson, I. Karsch-Mizrachi, D. J. Lipman, J. Ostell, B. A. Rapp, and D. L. Wheeler.
GenBank. Nucleic Acids Res., 30:17–20, 2002.
[10] M. S. Boguski, T.M. Lowe, and C.M. Tolstoshev. dbEST: database for ”expressed sequence
tags”. Nat. Genet., 4:332–333, 1993.
[11] M. S. Boguski, C. M. Tolstoshev, and DE. JR. Bassett. Gene discovery in dbEST. Science,
265:1993–1994, 1994.
[12] N. Bray, I. Dubchak, and L. Pachter. Avid: a global alignment program. Genorne Res.,
13:97–102, 2003.
[13] B. Brejova, D. G. Brown, and T. Vinar. Optimal spaced seeds for hidden markov models,
with application to homologous coding regions. Proceedings of the 14th Annual Symposium
on Combinatorial Pattern Matching (CPM), Lecture Notes in Computer Science, 2676:42–
54, 2003.
[14] B. Brejova, D. G. Brown, and T. Vinar. Vector seeds: an extension to spaced seeds allows
substantial improvements in sensitivity and specificity. Proceedings of WABI 2003, pages
39–54, 2003.
[15] B. Brejova, D.G. Brown, and T. Vinar. Vector seeds: An extension to spaced seeds. Journal
of Computer and System Sciences, 70:364–380, 2005.
[16] S. Brin, R. Motwani, J. D. Ullman, and S. Tsur. Dynamic itemset counting and implication
rules for market basket data. In Proc. of the ACM SIGMOD Conference, pages 255–264,
1997.
[17] M. Brudno, C. B. Do, G. M. Cooper, M. F. Kim, E. Davydov, E. D. Green, A. Sidow, and
S. Batzoglou. Lagan and multi-lagan: efficient tools for large-scale multiple alignment of
genomic dna. Genome Research, 13(4):721–731, 2003.
[18] M. Brudno, S.Malde, A. Poliakov, C. Do, O. Couronne, L. Dubchak, and S. Batzoglou. Glocal
alignment: finding rearrangements during alignment. Bioinformatics, 19(Suppl. 1):154–
162, 2003.
[19] M. Brudno and B. Morgenstern. Fast and sensitive alignment of large genomic sequences.
BMC Bioinformatics, 4(66), 2002.
[20] J. Buhler, U. Keich, and Y. Sun. Designing seeds for similarity search in genomic dna.
Proceedings of the Seventh Annual International Conference on Computational Molecular
Biology (RECOMB03), pages 67–75, 2003.
[21] J. Buhler and M. Tompa. Finding motifs using random projections. J. Comput. Bio.,, 9:225–
242, 2002.
[22] C. J. Bult, O. White, and et al. G. J. Olsen. Complete Genome Sequence of the
Methanogenic Archaeon, Methanococcus Jannaschii. Science, 273:1058–1078, 1996.
[23] J. Burke, H. Wang, W. Hide, and D. B. Davison. Alternative gene form discovery and
candidate gene selection from gene indexing projects. Genome Res., 8:276–290, 1998.
[24] S. Burkhardt, A. Crauser, P. Ferragina, H. P. Lenhof, E. Rivals, and M. Vingron. q-Gram
Based Database Searching Using a Suffix Array (QUASAR). Proceedings of the third Annual
International Conference on Computational Molecular Biology (RECOMB99), pages
77–83, 1999.
[25] S. Burkhardt and J. Karkkainen. Better Filtering with Gapped q-Grams. to appear in Fundamenta
Informaticae, special issue on Computing Patterns in Strings, 2003.
[26] C. Caskey, R. Eisenberg, E. Lander, and J. Straus. Hugo Statement on Patenting of DNA.
Genome Digest, 2:6–9, 1995.
[27] K.-M. Chao, J. Zhang, J. Ostell, and W. Miller. A tool for aligning very similar dna sequences.
Comput. Appl. Biosci., 13:75–80, 1997.
[28] E. Chavez and G. Navarro. AMetric Index for Approximate StringMatching. Lecture Notes
in Computer Science 2286 (LATIN 2002), pages 181–195, 2002.
[29] K. Choi and L. Zhang. Sensitivity analysis and efficient method for identifying optimal
spaced seeds. Journal of Computer and System Sciences, 68:22–40, 2004.
[30] H. H. Chou, A. P.Hsia, D. L. Mooney, and P. S. Schnable. Picky: oligo microarray design
for large genomes. Bioinformatics, 20:2893–2902, 2004.
[31] J.-M. Claverie. Detecting frame shifts by amino acid sequence comparison. J.Mol. Biol,
234:1140–1157, 1993.
[32] The C. Elegans Sequencing Consortium. Genome Sequence of the Nematode C. Elegans: a
Platform for Investigating Biology. Science, 282:2012–2018, 1998.
[33] T. Munzner D. Archambault and D. Auber. Smashing peacocks further: Drawing quasitrees
from biconnected components. IEEE Transactions on Visualization and Computer
Graphics, 12:813–820, 2006.
[34] E. H. Davidson. Genomic Regulatory Systems. Academic Press, San Diego, 2001.
[35] M. O. Dayhoff, R. M. Schwartz, and B. C. Orcutt. A model ofevolutionary change in proteins.
matrices for detecting distant relationships. Atlas of protein sequence and structure,
5:345–358, 1978.
[36] A. L. Delcher, S. Kasif, R. D. Fleischmann, J. Peterson, O. White, and S. L. Salzberg.
Alignment of whole genomes. Nucleic Acids Res., 27:2369–2376, 1999.
[37] P. J. Deschavanne, A. Giron, J. Vilain, G. Vilain, and B. Fertil. Genomic signature: characterization
and classification of species assessed by chaos game representation of sequences.
Mol. Biol. Evol., 16:1391–1399, 1999.
[38] R. F. Doolittle. What We Have Learned and Will Learn from Sequence Databases. In
G. Bell and T. Marr, editors, Computers and DNA, pages 21–31. Addison-Wesley, Reading,
MA, USA, 1990.
[39] E. Eskin and P. A. Pevzner. Finding composite regulatory patterns in dna sequences. Bioinformatics,
18(Suppl 1):354–363, 2002.
[40] G. M. Rubin et al. Comparative Genomics of the Eukaryotes. Science, 287:2204–2215,
2000.
[41] J. C. Venter et al. The Sequence of the Human Genome. Science, 291:1304–1351, 2001.
[42] O. Ermolaeva et al. Data Management and Analysis for Gene Expression Arrays. Nature
genetics, 20:19–23, 1998.
[43] R. D. Fleischmann,M. D. Adams, and et al. O.White. Whole-Genome Random Sequencing
and Assembly of Haemophilus Influenzae Rd. Science, 269:496–512, 1995.
[44] L. Florea, G. Hartzell, Z. Zhang, G. M. Rubin, and W. Miller. A computer program for
aligning a cdna sequence with a genomic dna sequence. Genome Res., 8:967–974, 1998.
[45] C. M. Fraser, J. D. Gocayne, and et al. O. White. The Minimal Gene Complement of
Mycoplasma Genitalium. Science, 270:397–403, 1995.
[46] D. Gautheret, O. Poirot, F. Lopez, S. Audic, and J. M. Claverie. Alternate polyadenylation
in human mRNAs: A large-scale analysis by EST clustering. Genome Res., 8:524–530,
1998.
[47] A. S. Gentles and S. Karlin. Genome-Scale Compositional Comparisons in Eukaryotes.
Genome Res., 11:540–546, 2001.
[48] E. Giladi, M. G. Walker, J. Z. Wang, and W. Volkmuth. SST: an Algorithm for Finding
Near-Exact Sequence Matches in Time Proportional to the Logarithm of the Database Size.
Bioinformatics, 18(6):873–877, 2002.
[49] A. Goffeau, B. G. Barrell, and et al. H. Bussey. Life with 6000 Genes. Science, 274:546–
567, 1996.
[50] N. Goldman. Nucleotide, dinucleotide and trinucleotide frequencies explain patterns observed
in chaos game representations of DNA sequences. Nucleic Acids Res., 21:2487–
2491, 1993.
[51] N. C. W. Goonesekere and B. Lee. Frequency of gaps observed in a structurally aligned
protein pair database suggests a simple gap penalty function. Nucleic Acids Research,
32(9):2838–2843, 2004.
[52] D. Gusfield. Algorithms on strings, trees, and sequences: computer science and computational
biology. Cambridge University Press New York, NY, USA, 1997.
[53] C. H. and E. H. Davidson. Modular Cis-Regulatory Organization of Endo16, a Gut-Specific
Gene of the Sea Urchin Embryo. Mech. Dev., 122:1069–1082, 1996.
[54] C. h. Yuh, A. Ransick, P. Martinez, R. J. Britten, and E. H. Davidson. Complexity and
Organization of DNA-Protein Interactions in the 5’ Regulatory Region of an Endoderm-
Specific Marker Gene in the Sea Urchin Embryo. Mech. Dev., 47:165–186, 1994.
[55] C. Han, B. Sutherland, P. Jewett, M. Campbell, L. Meincke, J. Tesmer, M. Iundt, J. Fawcett,
U. Kim, L. Deaven, and N. Doggett. Construction of a BAC contig map of chromosome
16q by two-dimensional overgo hybridization. Genome Res., 10(5):714–721, 2000.
[56] S. Henikoff. Performance evaluation of amino acid substitution matrices. Proteins,
17(1):49–61, 1993.
[57] S. Henikoff and J. Henikoff. Amino acid substitution matrices from protein blocks. In Proc.
National Academy of Sciences USA, volume 89, pages 10915–10919, 1992.
[58] C. Hertz and G. Stormo. Identifying dna and protein patterns with statistically significant
alignments of multiple sequences. Bioinformatics, 15:563–577, 1999.
[59] P. Hieter and M. Boguski. Functional Genomics: It’s All How You Read It. Science,
278:601–602, 1997.
[60] S. A. F. T. Van Hijum, A. D. Jong, G. Buist, J. Kok, and O. P. Kuipers. Unifrag and
genomeprimer: selection of primers for genome-wide production of unique amplicons.
Bioinformatics, 19:1580–1582, 2003.
[61] L. D. Hillier, G. Lennon, M. Becker, M. F. Bonaldo, B. Chiapelli, S. Chissoe, M. Dietrich,
T. DuBuque, A. Favello, andW. Gish. Generation and analysis of 280,000 human expressed
sequence tags. Genome Res., 6:807–828, 1996.
[62] R. Houlgatte, R. Mariage-Samson, S. Duprat, A. Tessier, S. Bentolia, B. Larry, and C. Auffray.
The Genexpress Index: a resource for gene discovery and the genic map of the human
genome. Genome Res., 5:272–304, 1995.
[63] P. Jokinen and E. Ukkonen. Two Algorithms for Approximate String Matching in Static
Texts. Proc. of the 16th Symposium on Mathematical Foundations of Computer Science
(Lecture Notes in Computer Science 520), pages 240–248, 1991.
[64] D. T. Jones, W. R. Taylor, and J. M. Thornton. Pet91- an updated dayhoff matrix. Comp.
App. Biosci., 8:275–282, 1992.
[65] T. Kahveci, V. Ljosa, and A. K. Singh. Speeding up whole-genome alignment by indexing
frequency vectors. Bioinformatics, 20(13):2122–2134, 2004.
[66] T. Kahveci and A. K. Singh. An efficient index structure for string databases. In Proc. of
the 27th VLDB conference, pages 351–360, 2001.
[67] S. Karlin, L. Brocchieri, J. Mrazek, A. M. Campbell, and A. M. Spormann. A chimeric
prokaryotic ancestry of mitochondria and primitive eukaryotes. Proc. Natl. Acad. Sci.,
96:9190–9195, 1999.
[68] S. Karlin and C. Burge. Dinucleotide relative abundance extremes: a genomic signature.
Trends Genet., 11:283–290, 1995.
[69] S. Karlin, C. Burge, and A. M. Campbell. Statistical analyses of counts and distributions of
restriction sites in DNA sequences. Nucleic Acids Res., 20:1363–1370, 1992.
[70] S. Karlin and I. Ladunga. Comparisons of eukaryotic genomic sequences. Proc. Natl. Acad.
Sci., 91:12832–12836, 1994.
[71] S. Karlin, J. Mrazek, and A. M. Campbell. Compositional biases of bacterial genomes and
evolutionary implications. J. Bacteriol., 179:3899–3913, 1997.
[72] U. Keich, M. Li, B. Ma, and J. Tromp. On spaced seeds for similarity search. Discrete
Applied Mathematics, 138:253–263, 2004.
[73] W. J. Kent. BLAT: The BLAST-Like Alignment Tool. Genome Research, 12(4):656–664,
2002.
[74] A. Krause and M. Vingron. A Set-Theoretic Approach to Database Searching and Clustering.
Bioinformatics, 14:430–438, 1998.
[75] D. B. Krizman, L.Wagner, A. Lash, R. L. Strausberg, and M. R. Emmert-Buck. The Cancer
Genome Anatomy Project: EST sequencing and the genetics of cancer progression. Neoplasia,
1:101–106, 1999.
[76] R. C. T. Lee, C. R. Chang, S. S. Tseng, and Y. T. Tsai. Introduction to Design and Analysis
of Algorithms. Flag Publishing Co., Taipei, Taiwan, second edition, 2001.
[77] A. M. Lesk. Computational Molecular Biology. In A. Kent and J. G. Williams, editors,
Encyclopedia of Computer Science and Technology, volume 31, pages 101–165. Marcel
Dekker, New York, 1994.
[78] F. Li and G. D. Stormo. Selection of optimal DNA oligos for gene expression arrays. Bioinformatics,
17:1067–1076, 2001.
[79] K. B. Li. Clustalw-mpi: Clustalw analysis using distributed and parallel computing. Bioinformatics,
19:1585–1586, 2003.
[80] M. Li, B. Ma, D. Kisman, and J. Tromp. Patternhunter ii: Highly sensitive and fast homology
search. Journal of Bioinformatics and Computational Biology, 2004.
[81] D. J. Lipman and W. R. Pearson. Rapid and sensitive protein similarity searches. Science,
227:1435–1441, 1985.
[82] S. Liu, N. A. Tinker, S. J. Molnar, and D. E. Mather. Ec oligos: automated and wholegenome
primer design for exons within one or between two genomes. Bioinformatics,
20:3668–3669, 2004.
[83] D. Manocha M. K. Ponamgi and M. C. Lin. Incremental algorithms for collision detection
between polygonal models. IEEE Transactions on Visualization and Computer Graphics,
3:51–64, 1997.
[84] B. Kao M. Zhang and C. L. Yip. A comparison study on algorithms for incremental update
of frequent sequences. In Proceeding of the Second IEEE Conference on Data Mining
(ICDM2002), page 554, 2002.
[85] B. Ma, J. Tromp, and M. Li. PatternHunter: Faster and More Sensitive Homology Search.
Bioinformatics, 18:440–445, 2002.
[86] T. Madden. Personal Communication. National Center for Biotechnology Information
(NCBI), 2003.
[87] U. Manber and E. W. Myers. Suffix Arrays: a New Method for On-Line String Searches.
SIAM Journal on Computing, 22:935–948, 1993.
[88] B. M. Mannarelli and C. P. Kurtzman. Rapid identification of Candida albicans and other
human pathogenic yeasts by using oligonucleotides in a PCR. J. Clin. Microbiol., 36:1634–
1641, 1998.
[89] D. R. Mathog. Parallel blast on split databases. Bioinformatics, 19:1865–1866, 2003.
[90] R. Mrowka, J. Schuchhardt, and C. Gille. Oligodb- interactive design of oligo dna for
transcription profiling of human genes. Bioinformatics, 18:1686–1687, 2003.
[91] T. Muller, S. Rahmann, and M. Rehmsmeier. Nonsymmetric score matrices and the detection
of homologous transmembrance proteins. Bioinformatics, 17(Supplement 1):760–766,
2001.
[92] S. B. Needleman and C. D.Wunsch. A general method applicable to the search for similarities
in the amino acid sequence of two proteins. Journal of Molecular Biology, 48:443–453,
1970.
[93] A. Neuwald, J. Liu, and C. Lawrence. Gibbs motif sampling: Detecting bacterial outer
membrane protein repeats. Protein Science, 4:1618–1632, 1995.
[94] P. C. Ng, J. G. Henikoff, and S. Henikoff. Phat: a transmembrane-specific substitution
matrix. Bioinformatics, 16(9):760–766, 2000.
[95] Z. Ning and J. C. Mullikin. SSAHA: A Fast Search Method for Large DNA Databases.
Genome Research, 11(10):1725–1729, 2001.
[96] E. K. Nordberg. Yoda: selecting signature oligonucleotides. Bioinformatics, 21:1365–1370,
2005.
[97] K. Okubo, N. Hori, R. Matoba, T. Niiyama, A. Fukushima, Y. Kojima, and K. Matsuba.
Large scale cDNA sequencing for analysis of quantitative and qualitative aspects of gene
expression. Nat. Genet., 2:173–179, 1992.
[98] W. R. Pearson. Protein Sequence Comparison and Protein Evolution. Tutorial of Intelligent
Systems in Molecular Biology 2000, 2000.
[99] W. R. Pearson and D. J. Lipman. Improved tools for biological sequence comparison. Proceedings
of the National Academy of Sciences, 85(8):2444–2448, 1988.
[100] L. Picoult-Newberg, T. E. Ideker, M. G. Pohl, S. L. Taylor, M. A. Donaldson, D. A. Nickerson,
and M. Boyce-Jacino. Mining SNPs from EST databases. Genome Res., 9:167–174,
1999.
[101] M. G. Rabbat and R. D. Nowak. Quantized incremental algorithms for distributed optimization.
IEEE Journal on Selected Areas in Communications (JSAC), 23:798–808, 2005.
[102] S. Rahmann. Rapid large-scale oligonucleotide selection for microarrays. In Proc. of the
First IEEE Computer Society Bioinformatics Conference (CSB’02), pages 54–63, 2002.
[103] A. Ransick, S. Ernst, R. J. Britten, and E. H. Davidson. Whole Mount in Situ Hybridization
Shows Endo16 to Be aMarker for the Vegetal Plate Territory in Sea Urchin Embryos. Mech.
Dev., 42:117–124, 1993.
[104] J. T. Reese and W. R. Pearson. Empirical determination of effective gap penalties for sequence
comparison. Bioinformatics, 18(11):1500–1507, 2002.
[105] I. Rigoutsos and A. Floratos. Combinatorial pattern discovery in biological sequences: The
teiresias algorithm. Bioinformatics, 14:55–67, 1998.
[106] T. Rognes. Paralign: a parallel sequence alignment algorithm for rapid and sensitive
database searches. Nucleic Acids Research, 29:1647–1652, 2001.
[107] J. M. Rouillard, C. J. Herbert, and M. Zuker. Oligoarray: Genome-scale oligonucleotide
design for microarrays. Bioinformatics, 18(3):486–487, 2002.
[108] R. Sandberg, G. Winberg, C. I. Branden, A. Kaske, I. Ernberg, and J. Coster. Capturing
whole-genome characteristics in short sequences using a naive Bayesian classifier. Genome
Res., 11:1404–1409, 2001.
[109] A. O. Schmitt, T. Specht, G. Beckmann, E. Dahl, C. P. Pilarsky, B. Hinzmann, and A. Rosenthal.
Exhaustive mining of EST libraries for genes differentially expressed in normal and
tumour tissues. Nucleic Acids Res., 27:4251–4260, 1999.
[110] S. Schwartz, W. J. Kent, A. Smit, and W. Miller. Human-mouse alignments with blastz.
Genome Research, 13:103–107, 2003.
[111] S. Schwartz, W. J. Kent, A. Smit, Z. Zhang, R. Baertsch, R. C. Hardison, D. Haussler, and
W. Miller. A greedy algorithm for aligning dna sequences. J. Comput. Biol., 7:203–214,
2000.
[112] J. M. Sikela and C. Auffray. Finding new genes faster than ever. Nat. Genet., 3:189–191,
1993.
[113] T. F. Smith and M. S. Waterman. Identification of Common Molecular Subsequences. J.
Molecular Biology, 147:195–197, 1981.
[114] R. D. Stevens, A. J. Robinson, and C. A. Goble. mygrid: personalized bioinformatics on
the information grid. Bioinformatics, 19(Suppl. 1):i302–i304, 2003.
[115] F. X. Sun. Errors estimating of incompletion and updating strategy in ids. In Proceeding of
2006 International Conference on Machine Learning and Cybernetics, pages 2948–2953,
2006.
[116] Y. Xu T. Li, N. Yang and J. Ma. An incremental algorithm for mining classification rules in
incomplete information systems. In Proceeding of the International Conference of the North
American Fuzzy Information Processing Society (NAFIPS 2004), pages 446–449, 2004.
[117] K. Tanabe, S. Nakagomi, S. Kiryu-Seo, K. Kiryu, Y. Kiryu, T. Kiryu, M. Kiryu, and
H. Kiyama. Expressed-sequence-tag approach to identify differentially expressed genes
following peripheral nerve axotomy. Brain Res. Mol. Brain Res., 64:34–40, 1999.
[118] T. A. Tatusova and T. L. Madden. Blast 2 sequences: a new tool for comparing protein and
nucleotide sequences. FEMS Microbiol. Lett., 174:247–250, 1999.
[119] A. Varma and S. Chalasani. An incremental algorithm for tdm switching assignments
in satellite and terrestrial networks. IEEE Journal on Selected Areas in Communications
(JSAC), 10:364–377, 1992.
[120] J. C. Venter,M. D. Adams, G. G. Sutton, A. R. Kerlavage, H. O. Smith, and M. Hunkapillar.
Shotgun Sequencing of the Human Genome. Science, 280:1540–1542, 1998.
[121] H. E. Williams. Genomic Information Retrieval. Proc. Australasian Database Conference,
pages 27–35, 2003. Adelaide, Australia, 2003.
[122] H. E. Williams and J. Zobel. Indexing and Retrieval for Genomic Databases. IEEE Transactions
on Knowledge and Data Engineering, 14(1):63–78, 2002.
[123] I. H. Witten, A. Moffat, and T. C. Bell. Managing Gigabytes: Compressing and Indexing
Documents and Images. Morgan Kaufmann Publishing, San Francisco, USA, 1999.
[124] J. Xu, D.G. Brown, M. Li, and B. Ma. Optimizing multiple spaced seeds for homology
search. To appear in Journal of Computational Biology, 2005.
[125] B. Yang and D. Y. Liu. Incremental algorithm for detecting community structure in dynamic
networks. In Proceedings of 2005 International Conference on Machine Learning
and Cybernetics, pages 2284–2290, 2005.
[126] I.-H. Yang, S.-H. Wang, Y.-H. Chen, P.-H. Huang, L. Ye, X. Huang, and K.-M. Chao. Efficient
methods for generating optimal single and multiple spaced seeds. Proceedings of
IEEE Fourth Symposium on Bioinformatics and Bioengineering (IEEE BIBE 2004), pages
411–418, 2004.
[127] J. Zheng, T. J. Close, T. Jiang, and S. Lonardi. Efficient Selection of Unique and Popular
Oligos for Large EST Databases. Bioinformatics, 20:2101–2112, 2004.

全文公開日期本全文未授權公開 (校內網路)
全文公開日期本全文未授權公開 (校外網路)

簡易檢索 / 詳目顯示

相關論文