研究生: |
劉立恆 Liou, Li-Heng |
---|---|
論文名稱: |
近乎線性時間之社群偵測與分群演算法 Nearly-Linear time Algorithms for Community Detection and Clustering |
指導教授: |
張正尚
Chang, Cheng-Shang |
口試委員: |
李端興
Lee, Duan-Shin 林華君 Lin, Hwa-Chun 連卿閔 Lien, Ching-Min 陳震宇 Chen, Cheng-Yu |
學位類別: |
博士 Doctor |
系所名稱: |
電機資訊學院 - 通訊工程研究所 Communications Engineering |
論文出版年: | 2018 |
畢業學年度: | 106 |
語文別: | 英文 |
論文頁數: | 116 |
中文關鍵詞: | 社群偵測 、分群 、網路科學 、線性時間演算法 |
外文關鍵詞: | community detection, clustering, network science, linear time algorithm |
相關次數: | 點閱:3 下載:0 |
分享至: |
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
在網路分析領域中,社群偵測與分群這兩個相近的議題十分被重視。由於網路資料大小快速增長,社群偵測與分群演算法的效率及擴展性變得日漸重要。在本篇論文中,我們提供許多方法來在多樣且大型的網路資料上進行社群偵測與分群。
本篇論文可以分成三個部分。於第一部分中,我們提出了兩項目標:一、在有向(directed)網路中,正式且精準的定義圖的分群與社群偵測。二、在有向網路中,演算法設計以及分析。為此,我們開發了用於有向網路的機率框架,此框架是奠基於我們之前用於無向網路的版本。縱使將對聯合分佈的要求從「對稱」放寬至只需擁有「相同的邊緣分佈」,我們仍然可以正式的在有向網路中定義何謂「重要性」(centrality)、「相對重要性」(relative centrality)、「社群」(community)、「模組度」(modularity)。透過模組度與稀疏度守恆轉換,我們也將許多常見於無向網路的社群偵測演算法拓展至有向網路,例如:hierarchical agglomerative演算法、partitional演算法、fast unfolding演算法。透過機率框架,我們可以得知這些三種演算法會在有限步內收斂。其中,partitional演算法更被證明是近乎線性時間的演算法,而hierarchical agglomerative演算與fast unfolding演算法輸出結果必定符合社群的數學定義。這些演算法在經過少許修改之後,都可以被拓展至更普遍的聯合分佈。我們實驗了使用「PageRank」和「隨機漫步與後跳」兩種方法得到的聯合分佈。
在本論文的第二部份中,我們提出了一個新的迭代演算法「K-sets+」。該演算法可用於對半度量空間(metric space)中的資料分群。而在半度量空間中,三角不等式未必會成立。我們證明了K-sets+會在有限步內收斂,並擁有與K-sets相同的效能。此外我們還將K-sets+拓展至僅有對稱性的相似性度量(similarity measure)。這樣的拓展大幅降低了計算複雜度,使得當相似性矩陣是疏鬆時,演算法會具有線型的時間複雜度與空間複雜度。我們也進行了許多實驗驗證K-sets+演算法的效率。實驗資料分別來自於隨機塊狀模型(stochastic block model)和WonderNetwork的網頁。
在本論文的第三部份中,我們詳細描述了如何建構一個線性時間的fast unfolding演算法。由於時間複雜度與空間複雜度受資料結構影響甚大,我們介紹了三個在實現線性時間的fast unfolding演算法時必要的資料結構。分別為adjacency list、disjoint sets和array set。其中adjacency lsit是廣泛用於儲存疏鬆網路拓樸的資料結構,而disjoint sets和array set為我們所開發的資料結構,可用於避免超線性時間的運算,例如排序或在雜湊樹(或二元樹)中插入元素。我們也用實驗去驗證我們的實踐方法的效率與擴展性。在該實驗中,我們的方法速度為比較對象的3.6倍,而處理的網路連結數也可以上達十億個。
Community detection and clustering are two closely related issues that have drawn much of the attention in network analysis. Due to the rapid growth of the scale of networked data, the efficiency and the scalability of community detection algorithms and clustering algorithms are taken more seriously. In this thesis, we provide several efficient methods to perform community detection and clustering that can deal with diverse and large-scale data.
The thesis is organized into three parts. In the first part of this thesis, we address two major points: (i) a formal and precise definition of the graph clustering and community detection problem in directed networks, and (ii) algorithm design and evaluation of community detection algorithms in directed networks.
Motivated by these, we develop a probabilistic framework for structural analysis and community detection in directed networks based on our previous work in undirected networks.
By relaxing the assumption from symmetric bivariate distributions in our previous work to bivariate distributions that have the same marginal distributions in this thesis,
we can still formally define various notions for structural analysis in directed networks, including centrality, relative centrality, community, and modularity. We also extend three commonly used community detection algorithms in undirected networks to directed networks:
the hierarchical agglomerative algorithm, the partitional algorithm, and the fast unfolding algorithm.
These are made possible by two modularity preserving and sparsity preserving transformations. In conjunction with the probabilistic framework, we show these three algorithms converge in a finite number of steps. In particular, we show that the partitional algorithm is a nearly-linear time algorithm for large sparse graphs. Moreover, the outputs of the hierarchical agglomerative algorithm and the fast unfolding algorithm are guaranteed to be communities.
These three algorithms can also be extended to general bivariate distributions with some minor modifications. We also conduct various experiments by using two sampling methods in directed networks: (i) PageRank and (ii) random walks with self-loops and backward jumps.
In the second part of this thesis, we first propose a new iterative algorithm, called the K-sets+ algorithm for clustering data points in a semi-metric space, where the distance measure does not necessarily satisfy the triangular inequality. We show that the K-sets+ algorithm converges in a finite number of iterations and it retains the same performance guarantee as the K-sets algorithm for clustering data points in a metric space. We then extend the applicability of the K-sets+ algorithm from data points in a semi-metric space to data points that only have a symmetric similarity measure. Such an extension leads to great reduction of computational complexity. In particular, for an n × n similarity matrix with m nonzero elements in the matrix, the computational complexity of the K-sets+ algorithm is O((Kn+m)I), where I is the number of iterations. The memory complexity to achieve that computational complexity is O(Kn+m).
As such, both the computational complexity and the memory complexity are linear in n when the n × n similarity matrix is sparse, i.e., m=O(n). We also conduct various experiments to show the effectiveness of the K-sets+ algorithm by using a synthetic dataset from the stochastic block model and a real network from the WonderNetwork website.
In the third part of this thesis, we detail the implementation of the fast unfolding algorithm that has a nearly-linear time complexity and a linear memory complexity. Since the time and the memory complexity depend heavily on the data structures, we introduce three essential data structures for the implementation of the nearly-linear time fast unfolding algorithm: (i) adjacency list, (ii) disjoin sets, and (iii) array set. The adjacency list is a commonly used memory-efficient data structure for storing sparse networks. The disjoint sets and array set are our newly invented data structure that can allow us to avoid using superlinear operations such as sorting and insertings in a hash (or binary) tree. We also do an experiment to test the efficiency and scalability of our implementation of the fast unfolding algorithm. With the data structures and techniques designed by us, our implementation is 3.6 times faster than the competitor, and can cope with networks with one billion edges.
``Number of social media users worldwide from 2010 to 2021 (in billions),''
2018.
R.~Courtland, ``Gordon moore: The man whose name means progress,'' IEEE
Spectrum, vol.~30, 2015.
M.~Plantié and M.~Crampes, ``Survey on social community detection,'' in
Social media retrieval, pp.~65--85, Springer, 2013.
S.~Fortunato, ``Community detection in graphs,'' Physics reports,
vol.~486, no.~3, pp.~75--174, 2010.
M.~A. Porter, J.-P. Onnela, and P.~J. Mucha, ``Communities in networks,'' Notices of the AMS, vol.~56, no.~9, pp.~1082--1097, 2009.
B.~Yang, D.~Liu, and J.~Liu, ``Discovering communities from social networks:
Methodologies and applications,'' in Handbook of social network
technologies and applications, pp.~331--346, Springer, 2010.
S.~Papadopoulos, Y.~Kompatsiaris, A.~Vakali, and P.~Spyridonos, ``Community
detection in social media,'' Data Mining and Knowledge Discovery,
vol.~24, no.~3, pp.~515--554, 2012.
M.~E. Newman and M.~Girvan, ``Finding and evaluating community structure in
networks,'' Physical review E, vol.~69, no.~2, p.~026113, 2004.
B.~W. Kernighan and S.~Lin, ``An efficient heuristic procedure for partitioning
graphs,'' The Bell system technical journal, vol.~49, no.~2,
pp.~291--307, 1970.
J.~Friedman, T.~Hastie, and R.~Tibshirani, The elements of statistical
learning, vol.~1.
\newblock Springer series in statistics New York, 2001.
U.~Von~Luxburg, ``A tutorial on spectral clustering,'' Statistics and
computing, vol.~17, no.~4, pp.~395--416, 2007.
C.~F. Burk and F.~W. Horton, ``Infomap: a complete guide to discovering
corporate information resources,'' in Infomap: a complete guide to
discovering corporate information resources, Prentice Hall, 1988.
F.-Y. Wu, ``The potts model,'' Reviews of modern physics, vol.~54, no.~1,
p.~235, 1982.
P.~Ronhovde and Z.~Nussinov, ``Local resolution-limit-free potts model for
community detection,'' Physical Review E, vol.~81, no.~4, p.~046114,
2010.
M.~E. Newman, ``Modularity and community structure in networks,'' {\em
Proceedings of the national academy of sciences}, vol.~103, no.~23,
pp.~8577--8582, 2006.
V.~D. Blondel, J.-L. Guillaume, R.~Lambiotte, and E.~Lefebvre, ``Fast unfolding
of communities in large networks,'' Journal of Statistical Mechanics:
Theory and Experiment, vol.~2008, no.~10, p.~P10008, 2008.
U.~N. Raghavan, R.~Albert, and S.~Kumara, ``Near linear time algorithm to
detect community structures in large-scale networks,'' Physical review
E, vol.~76, no.~3, p.~036106, 2007.
F.~Wu and B.~A. Huberman, ``Finding communities in linear time: a physics
approach,'' The European Physical Journal B-Condensed Matter and Complex
Systems, vol.~38, no.~2, pp.~331--338, 2004.
S.~Fortunato and M.~Barthelemy, ``Resolution limit in community detection,''
Proceedings of the National Academy of Sciences, vol.~104, no.~1,
pp.~36--41, 2007.
A.~K. Jain, ``Data clustering: 50 years beyond k-means,'' Pattern
Recognition Letters, vol.~31, no.~8, pp.~651--666, 2010.
M.~Newman, Networks: an introduction.
\newblock OUP Oxford, 2009.
C.-S. Chang, C.-Y. Hsu, J.~Cheng, and D.-S. Lee, ``A general probabilistic
framework for detecting community structure in networks,'' in INFOCOM,
2011 Proceedings IEEE, pp.~730--738, IEEE, 2011.
C.-S. Chang, C.-J. Chang, W.-T. Hsieh, D.-S. Lee, L.-H. Liou, and W.~Liao,
``Relative centrality and local community detection,'' Network Science,
vol.~3, no.~4, pp.~445--479, 2015.
D.~Liben-Nowell and J.~Kleinberg, ``The link prediction problem for social
networks,'' in Proceedings of the twelfth international conference on
Information and knowledge management, pp.~556--559, ACM, 2003.
R.~Lambiotte, ``Multi-scale modularity in complex networks,'' in Modeling
and Optimization in Mobile, Ad Hoc and Wireless Networks (WiOpt), 2010
Proceedings of the 8th International Symposium on, pp.~546--553, IEEE, 2010.
J.-C. Delvenne, S.~N. Yaliraki, and M.~Barahona, ``Stability of graph
communities across time scales,'' Proceedings of the National Academy of
Sciences, vol.~107, no.~29, pp.~12755--12760, 2010.
F.~D. Malliaros and M.~Vazirgiannis, ``Clustering and community detection in
directed networks: A survey,'' Physics Reports, vol.~533, no.~4,
pp.~95--142, 2013.
M.~E. Newman, ``Fast algorithm for detecting community structure in networks,''
Physical review E, vol.~69, no.~6, p.~066133, 2004.
C.-S. Chang, W.~Liao, Y.-S. Chen, and L.-H. Liou, ``A mathematical theory for
clustering in metric spaces,'' to appear in IEEE Transactions on Network
Science and Engineering, 2016.
S.~Brin and L.~Page, ``The anatomy of a large-scale hypertextual web search
engine,'' Computer networks and ISDN systems, vol.~30, no.~1,
pp.~107--117, 1998.
R.~Lambiotte and M.~Rosvall, ``Ranking and clustering of nodes in networks with
smart teleportation,'' Physical Review E, vol.~85, no.~5, p.~056107,
2012.
F.~Chung, ``Laplacians and the cheeger inequality for directed graphs,'' {\em
Annals of Combinatorics}, vol.~9, no.~1, pp.~1--19, 2005.
E.~A. Leicht and M.~E. Newman, ``Community structure in directed networks,''
Physical review letters, vol.~100, no.~11, p.~118703, 2008.
R.~Nelson, Probability, stochastic processes, and queueing theory: the
mathematics of computer performance modeling.
\newblock Springer Verlag, 1995.
Y.~Kim, S.-W. Son, and H.~Jeong, ``Finding communities in directed networks,''
Physical Review E, vol.~81, no.~1, p.~016103, 2010.
L.~C. Freeman, ``A set of measures of centrality based on betweenness,'' {\em
Sociometry}, pp.~35--41, 1977.
L.~C. Freeman, ``Centrality in social networks conceptual clarification,'' {\em
Social networks}, vol.~1, no.~3, pp.~215--239, 1979.
M.~Newman, Networks: an introduction.
\newblock Oxford university press, 2010.
R.~Andersen, F.~Chung, and K.~Lang, ``Local graph partitioning using pagerank
vectors,'' in Foundations of Computer Science, 2006. FOCS'06. 47th
Annual IEEE Symposium on, pp.~475--486, IEEE, 2006.
J.~Leskovec, K.~J. Lang, and M.~Mahoney, ``Empirical comparison of algorithms
for network community detection,'' in Proceedings of the 19th
international conference on World wide web, pp.~631--640, ACM, 2010.
A.~Clauset, M.~E. Newman, and C.~Moore, ``Finding community structure in very
large networks,'' Physical review E, vol.~70, no.~6, p.~066111, 2004.
M.~Rosvall and C.~T. Bergstrom, ``Maps of information flow reveal community
structure in complex networks,'' Proceedings of the National Academy of
Sciences USA, vol.~105, pp.~1118--1123, 2008.
R.~Lambiotte, J.-C. Delvenne, and M.~Barahona, ``Random walks, markov processes
and the multiscale modular organization of complex networks,'' IEEE
Transactions on Network Science and Engineering, vol.~1, no.~2, pp.~76--90,
2014.
J.~White, E.~Southgate, J.~Thomson, and S.~Brenner, ``The structure of the
nervous system of the nematode caenorhabditis elegans: the mind of a worm,''
Phil. Trans. R. Soc. Lond, vol.~314, pp.~1--340, 1986.
D.~J. Watts and S.~H. Strogatz, ``Collective dynamics of
‘small-world’networks,'' nature, vol.~393, no.~6684, p.~440, 1998.
A.~Lancichinetti, S.~Fortunato, and F.~Radicchi, ``Benchmark graphs for testing
community detection algorithms,'' Physical review E, vol.~78, no.~4,
p.~046110, 2008.
J.~Yang and J.~Leskovec, ``Defining and evaluating network communities based on
ground-truth,'' Knowledge and Information Systems, vol.~42, no.~1,
pp.~181--213, 2015.
L.~A. Adamic and N.~Glance, ``The political blogosphere and the 2004 us
election: divided they blog,'' in Proceedings of the 3rd international
workshop on Link discovery, pp.~36--43, ACM, 2005.
G.~Csardi and T.~Nepusz, ``The igraph software package for complex network
research,'' InterJournal, vol.~Complex Systems, p.~1695, 2006.
S.~Harenberg, G.~Bello, L.~Gjeltema, S.~Ranshous, J.~Harlalka, R.~Seay,
K.~Padmanabhan, and N.~Samatova, ``Community detection in large-scale
networks: a survey and empirical evaluation,'' Wiley Interdisciplinary
Reviews: Computational Statistics, vol.~6, no.~6, pp.~426--439, 2014.
S.~Isaacman, R.~Becker, R.~C{\'a}ceres, S.~Kobourov, M.~Martonosi, J.~Rowland,
and A.~Varshavsky, ``Identifying important places in people’s lives from
cellular network data,'' in International Conference on Pervasive
Computing, pp.~133--151, Springer, 2011.
W.~Gao, Q.~Li, B.~Zhao, and G.~Cao, ``Multicasting in delay tolerant networks:
a social network perspective,'' in Proceedings of the tenth ACM
international symposium on Mobile ad hoc networking and computing,
pp.~299--308, ACM, 2009.
N.~P. Nguyen, T.~N. Dinh, Y.~Xuan, and M.~T. Thai, ``Adaptive algorithms for
detecting community structure in dynamic social networks,'' in INFOCOM,
2011 Proceedings IEEE, pp.~2282--2290, IEEE, 2011.
N.~P. Nguyen, G.~Yan, M.~T. Thai, and S.~Eidenbenz, ``Containment of
misinformation spread in online social networks,'' in Proceedings of the
4th Annual ACM Web Science Conference, pp.~213--222, ACM, 2012.
S.~Theodoridis, K.~Koutroumbas, et~al., ``Pattern recognition.,'' 1999.
J.~Leskovec, A.~Rajaraman, and J.~D. Ullman, Mining of massive datasets.
\newblock Cambridge university press, 2014.
A.~K. Jain, M.~N. Murty, and P.~J. Flynn, ``Data clustering: a review,'' {\em
ACM computing surveys (CSUR)}, vol.~31, no.~3, pp.~264--323, 1999.
S.~Theodoridis and K.~Koutroumbas, Pattern Recognition.
\newblock Elsevier Academic press, USA, 2006.
A.~Rajaraman, J.~Leskovec, and J.~D. Ullman, Mining of massive datasets.
\newblock Cambridge University Press, 2012.
S.~Lloyd, ``Least squares quantization in pcm,'' Information Theory, IEEE
Transactions on, vol.~28, no.~2, pp.~129--137, 1982.
M.~Agarwal, R.~Jaiswal, and A.~Pal, ``k-means++ under approximation
stability,'' in Theory and Applications of Models of Computation,
pp.~84--95, Springer, 2013.
L.~Kaufman and P.~J. Rousseeuw, Finding groups in data: an introduction to
cluster analysis, vol.~344.
\newblock John Wiley \& Sons, 2009.
M.~Van~der Laan, K.~Pollard, and J.~Bryan, ``A new partitioning around medoids
algorithm,'' Journal of Statistical Computation and Simulation,
vol.~73, no.~8, pp.~575--584, 2003.
H.-S. Park and C.-H. Jun, ``A simple and fast algorithm for k-medoids
clustering,'' Expert Systems with Applications, vol.~36, no.~2,
pp.~3336--3341, 2009.
J.~Shi and J.~Malik, ``Normalized cuts and image segmentation,'' Pattern
Analysis and Machine Intelligence, IEEE Transactions on, vol.~22, no.~8,
pp.~888--905, 2000.
M.~Filippone, F.~Camastra, F.~Masulli, and S.~Rovetta, ``A survey of kernel and
spectral methods for clustering,'' Pattern recognition, vol.~41, no.~1,
pp.~176--190, 2008.
F.~Krzakala, C.~Moore, E.~Mossel, J.~Neeman, A.~Sly, L.~Zdeborov{\'a}, and
P.~Zhang, ``Spectral redemption in clustering sparse networks,'' {\em
Proceedings of the National Academy of Sciences}, vol.~110, no.~52,
pp.~20935--20940, 2013.
M.~Ester, H.-P. Kriegel, J.~Sander, and X.~Xu, ``A density-based algorithm for
discovering clusters in large spatial databases with noise.,'' in Kdd,
vol.~96, pp.~226--231, 1996.
A.~Cuevas, M.~Febrero, and R.~Fraiman, ``Cluster analysis: a further approach
based on density estimation,'' Computational Statistics \& Data
Analysis, vol.~36, no.~4, pp.~441--459, 2001.
M.~Halkidi and M.~Vazirgiannis, ``A density-based cluster validity approach
using multi-representatives,'' Pattern Recognition Letters, vol.~29,
no.~6, pp.~773--786, 2008.
M.-F. Balcan, A.~Blum, and A.~Gupta, ``Clustering under approximation
stability,'' Journal of the ACM (JACM), vol.~60, no.~2, p.~8, 2013.
``The wondernetwork dataset.''
L.-H.~L. Cheng-Shang~Chang, Duan-Shin~Lee and S.-M. Lu, ``Community detection
in signed networks: an error-correcting code approach,'' International
Conference on Internet of People, 2017.
``The submarine cable map.''
C.-S. Chang, D.-S. Lee, L.-H. Liou, S.-M. Lu, and M.-H. Wu, ``A probabilistic
framework for structural analysis in directed networks,'' in {\em
Communications (ICC), 2016 IEEE International Conference on}, pp.~1--6, IEEE,
2016.
C.-S. Chang, D.-S. Lee, L.-H. Liou, S.-M. Lu, and M.-H. Wu, ``A probabilistic
framework for structural analysis and community detection in directed
networks,'' IEEE/ACM Transactions on Networking, 2017.
Blondel, Jlguillaume, and Taynaud, ``Louvain method for community detection in
large graphs,'' 2011.
J.~Dean and S.~Ghemawat, ``Mapreduce: simplified data processing on large
clusters,'' Communications of the ACM, vol.~51, no.~1, pp.~107--113,
2008.
R.~C. Martin, More C++ gems, vol.~17.
\newblock Cambridge University Press, 2000.
D.~E. Knuth, The art of computer programming: sorting and searching,
vol.~3.
\newblock Pearson Education, 1998.
R.~E. Tarjan, ``Efficiency of a good but not linear set union algorithm,'' {\em
Journal of the ACM (JACM)}, vol.~22, no.~2, pp.~215--225, 1975.
T.~White, Hadoop: The definitive guide.
\newblock " O'Reilly Media, Inc.", 2012.
J.~Liu, ``Fuzzy modularity and fuzzy community structure in networks,'' {\em
The European Physical Journal B-Condensed Matter and Complex Systems},
vol.~77, no.~4, pp.~547--557, 2010.
T.~C. Havens, J.~C. Bezdek, C.~Leckie, K.~Ramamohanarao, and M.~Palaniswami,
``A soft modularity function for detecting fuzzy communities in social
networks,'' IEEE Transactions on Fuzzy Systems, vol.~21, no.~6,
pp.~1170--1175, 2013.