I Introduction
Community detection concerns identifying clusters (communities) that are more densely connected in graphstructured data, a central task of many important problems in data mining and modern network science (see [1, 2, 3] and references therein). Realworld applications of the community detection problem abound, including ones in social networking [4, 5], biology (e.g., to study brain connectivity [6]
), and machine learning (e.g., recommendation systems
[7]).Various techniques have been proposed to this end including, but not to limited to, approaches leveraging random walks [8]
[9], semidefinite programming [10], matrix decomposition [11, 12, 13], and word embedding [14, 15].As in much of the prior work, we consider graphstructured data randomly generated from the canonical Stochastic Block Model (SBM), a popular probabilistic generative model for random graphs with latent structure (planted clusters) [16, 17]
. The nodes are connected with intracluster and intercluster probabilities
and , respectively. Due to partial observability, some of the existing edges are unobserved. The fullyobserved case, known as correlation clustering, was introduced in [18].To detect densely connected clusters, the authors in [11] devised a decompositionbased approach [19, 20, 21, 22] to correlation clustering, in which the adjacency matrix of the graph is decomposed as a sum of a low rank and a sparse component. The low rank component captures the nonoverlapping full cliques, and the sparse component captures missing edges within clusters and extra edges across clusters. The validity of the low rank plus sparse structure emerges from the fact that , i.e., the higher density of intracluster over intercluster connections. The use of matrix decomposition was later extended to the partially observed case [12, 13].
Our work is motivated by two main limitations of decompositionbased approaches (and most other community detection algorithms). First, existing algorithms work with the adjacency matrix of the full graph, which limits their scalability to large graph sizes. Second, they often fall short of discovering small clusters in unbalanced graphs (i.e., graphs in which the largest and smallest clusters differ greatly in size).
We build on the previous works of [12, 13]
by presenting and analyzing a randomized scalable framework, which significantly reduces the complexity of the clustering algorithm. Randomized sketching techniques have been instrumental in devising scalable solutions to many highdimensional unsupervised learning problems, such as Principal Component Analysis (PCA)
[20], matrix decomposition [23, 24, 25][26, 27], low rank matrix approximation [28, 29, 30], data summarization [31, 28, 32], and clustering [33, 34, 35, 36].There are major distinctions between the approach proposed herein and existing randomized data clustering algorithms. For example, the approach in [37] uses random node sampling, but considers altogether different scenarios in which the data feature matrices are known. In sharp contrast, our setup only assumes availability of similarity and dissimilarity information given the partially observed topology, otherwise no data features are available. The work in [35] targets the graph clustering problem, but uses a very different edge sampling mechanism as opposed to node sampling. We focus on node sampling to obtain balanced subgraphs. As pointed out earlier, the primary focus of this paper is to devise a scalable solution to the graph clustering problem leveraging the decompositionbased approach presented in [12] along with graph sketching, and to provide robustness to data unbalancedness.
The adjacency matrix is assumed to follow the Planted Partition/Stochastic Block Model [16, 17], with a modification to account for partial observations. This data model, originally found in [12], is defined as follows.
Data Model 1.
The graph consists of nodes partitioned into clusters. Any two nodes within a cluster are connected with probability , and two nodes belonging to different clusters are connected with probability . Any given edge is observed with probability .
Given an adjacency matrix , the decomposition into the low rank and sparse matrices takes the form . The matrix captures the connectivity of the nodes within a cluster and has no intercluster edges, while indicates the missing intracluster and extra intercluster edges. The diagonal elements of are assumed to be all ones. We note that for an ideal adjacency matrix with , and we have and .
Ia Summary of contributions
We develop and analyze a scalable graph clustering algorithm in which matrix recovery and completion are applied to the adjacency matrix of a small random subgraph obtained from the original graph through random sampling. We establish performance guarantees for the presented randomized approach to yield exact clustering. Moreover, we present a new node sampling method, dubbed Sparsitybased Sampling, which significantly enhances the performance of the clustering algorithm with unbalanced graphs. Setting the sampling probabilities inversely proportional to the node degrees (equivalently, proportional to the sparsity levels of the columns of the adjacency matrix), Sparsitybased Sampling achieves the twin objective of capturing smaller clusters and yielding balanced sketches which enhance the success of the subsequent completion and clustering.
The algorithms in [12, 13] require clusters of minimum size to guarantee a high probability of success, while at the same time requiring the high computational cost of working with the full matrix. In contrast, our proposed random sampling method maintains similar performance guarantees while only requiring samples to be made, where is the size of the smallest cluster (assuming the number of clusters , and the SBM parameters are fixed). Furthermore, the use of Sparsitybased Sampling improves the probability of sampling from the smallest clusters, thus allowing the restriction on minimum cluster size to be relaxed.
Ii Proposed Approach
In this section, we propose a randomized framework which provides a scalable solution for the community detection problem with big data. We focus on the convexoptimizationbased method analyzed in [12, 13] which translates the problem of clustering a partially observed graph to one of robust low rank matrix completion. As discussed in Section I
, the two main shortcomings of existing methods we seek to address are the computational complexity and the inability of these algorithms to detect the small clusters in unbalanced graphs. First, the randomized approach proposed is presented and analyzed with uniform random sampling where it is shown to significantly reduce the computational complexity. Subsequently, Sparsitybased Sampling is presented to handle unbalanced highdimensional data. The proofs of all stated results are deferred to an extended version of this work.
Iia Randomized graph clustering using random node sampling
The proposed method is detailed in Algorithm 1. It consists of three main steps: node sampling, subgraph clustering, and full data clustering.
IiA1 Uniform random node sampling
The basic idea underlying the proposed randomized approach is to apply the clustering algorithm to a random sketch rather than the full data. The clustering method is applied to a random subgraph obtained by sampling nodes. The subgraph is equivalently represented by a submatrix (referred to as the sketch matrix) of the full adjacency matrix . As with the full adjacency matrix, the sketch can be decomposed into low rank and sparse matrices. The random selection is performed using uniform random sampling. It is important to ensure that at least one node is sampled from each cluster, i.e., the columns of span the column space of . The following lemma shows that the sufficient number of random samples is linear with , where is the number of nodes in the smallest cluster.
Lemma 1.
Suppose the adjacency matrix follows Data Model 1. Define as the low rank component of the sketched adjacency matrix, constructed using randomly sampled (uniformly without replacement) nodes. If then the rank of is equal to the rank of with probability at least .
IiA2 Robust subgraph clustering
In the second step, we apply the robust matrix completion algorithm (3) to cluster the sampled nodes by finding the low rank component of . The following lemma, which is based in part on [12, Theorem 4], provides sufficient conditions for the convex matrix completion algorithm to yield the correct .
Lemma 2.
Let , , and , where is a constant real number. If
(1)  
(2) 
then the optimal point of (3) yields the exact low rank component of with probability at least , where is a constant real number.
The term can be thought as an indicator of the ‘balancedness’ of the clusters, where a larger value indicates less balance. In fact, for uniform random sampling is the inverse of the probability of sampling from the smallest cluster.
IiA3 Full data clustering
In the last step, we compare the connectivity of each node of the full graph to the clusters obtained in the second step from the sketch. For an exact description of the last step, we refer the reader to the third step of Algorithm 1. The operator
returns the observed values of its vector or matrix argument.
Input: Given adjacency matrix
1. Random Node Sampling:
1.1 Form the set consisting of indices of randomly sampled nodes without replacement. Sampling is accomplished using either Random Sampling, or Sparsitybased Sampling.
1.2 Construct as the submatrix of corresponding to the sampled nodes.
2. Subgraph Clustering:
2.1 Define and as the optimal point of
(3) 
as described in [12]. In the experiments, the problem is solved using .
2.2 Cluster the subgraph corresponding to using (we use Spectral Clustering in our experiments).
2.3 If is the number of detected clusters, define as the set of columns of , which span the column space of (each vector essentially represents the connectivities within a certain cluster).
3. Full Data Clustering:
Define as the vector of elements of ( column of ) indexed by set .
Let be the number of elements in the cluster of the sketch (as identified in Step 2 of the algorithm).
For from 1 to
Assign the node to the cluster.
End For
We can readily state the following theorem which establishes a sufficient condition for Algorithm 1 to yield exact clustering with high probability.
Theorem 3.
One can observe that the sufficient number of sampled nodes depends on the size of the graph through the factor . However, if the graph is balanced, then . Thus, for balanced graphs, the sufficient number of randomly sampled nodes is almost independent of the size of the graph.
IiB Sparsitybased Sampling
While the clustering algorithm that uses the fullscale data has quadratic computational complexity in the graph size, the complexity of Algorithm 1 is only linear in when the data is balanced. This affords substantial speedup for highdimensional graphs. Nevertheless, according to Lemma 1, if the data is unbalanced and uniform random sampling is utilized, the sketch will be similarly unbalanced and a large number of samples will be needed to ensure the small clusters are sufficiently represented in the sketch. At the same time, the existing clustering algorithms can hardly capture small clusters when the data is unbalanced, a fact that is also confirmed by results from the theory developed in [12].
In this section, a new sampling method which can yield a more balanced sketch of unbalanced data is presented. The basic idea is to sample the sparser columns of the adjacency matrix, which represent nodes with fewer connections, with a higher probability, hence the appellation ‘Sparsitybased Sampling’. Specifically, the probability of the node being selected is set proportional to , a factor which is used as a measure of sparsity, where denotes the norm. Here, the entries corresponding to unobserved edges are set to zero. Note that since the diagonal entries of the adjacency matrix are set to one. The sampling probabilities are properly normalized so that they sum up to one.
For ease of exposition, suppose the graph is composed of two clusters with population sizes and (). In the case where the adjacency matrix is clean and complete, i.e., , the following lemma shows that the probabilities of sampling from the different clusters are equal if we use the Sparsitybased Sampling scheme, independent of their relative population sizes.
Lemma 4.
Suppose , , and all edges are observed. Then, using Sparsitybased Sampling, the sampling probabilities from both clusters are equal.
According to Lemma 4, the new sampling algorithm can efficiently yield perfectly balanced sketches. Now, we will show that the Sparsitybased Sampling scheme can improve the balancedness of the sketched data even if the graphstructured data is corrupted.
Recall that we are sampling without replacement. To accomplish this, suppose that the probability of sampling each column is fixed throughout the sampling process. Columns may be sampled more than once, but duplicates will be discarded and not counted towards the budget of samples. Now, let be the probability of sampling from the smallest cluster. For a randomly generated graph, the following lemma places a lower bound on which holds with high probability.
Lemma 5.
If and are small, then will be close to , indicating that both clusters have a roughly equal chance of being sampled. Contrast this with uniform random sampling, where the probability of sampling from the smallest cluster is exactly . This means that under proper conditions, Sparsitybased Sampling will tend to produce more balanced sketches than uniform random sampling. In Section III, we show through numerical experiments that Sparsitybased Sampling indeed significantly enhances the performance of the randomized approach, and can even outperform the fullscale decomposition of the full adjacency matrix .
Iii Numerical Experiments
A set of numerical simulations are presented to study the performance of the proposed randomized framework. First, it is shown that the proposed approach brings about substantial speedups. Then, an experiment with unbalanced data is presented to showcase the effectiveness of the Sparsitybased Sampling scheme. We show that in some cases the proposed method with Sparsitybased Sampling can even outperform fullscale decomposition. An experiment is considered successful if the algorithm reconstructs the exact low rank matrix
. We estimate the number of clusters using Spectral Clustering
[38]applied to the obtained low rank component. Each point in the presented phase transition results is obtained by averaging over 20 independent runs.
Iiia Running time
We compare the running time of the proposed randomized method with the fullscale decomposition algorithm. The adjacency matrix follows Data Model 1 with , , , , and . In this experiment, . Table I compares the running time in seconds for different values of . In all cases, both the fullscale and sketchingbased decompositions cluster the data accurately, but the randomized method is substantially faster. The main reason is that the matrix completion algorithm is applied to the much smaller sketch matrix. Since the complexity of the last step of Algorithm 1 is linear with , it does not have a significant impact on the run time.
Randomized approach  Fullscale decomposition  

500  1.2 s  3.3 s 
1000  1.2 s  15.0 s 
5000  1.2 s  174.0 s 
10000  1.3 s  647.0 s 
IiiB Clustering unbalanced graphs
In this experiment, we study the performance of the proposed randomized approach with Sparsitybased Sampling. The adjacency matrix adheres to Data Model 1 with , , , and . The graph consists of two small clusters and one dominant cluster, with population sizes and . Fig. 1 compares the phase transition plots of the proposed randomized methods in the sample complexity and minimum cluster size plane. Results with uniform random sampling are shown on the left and with Sparsitybased Sampling on the right. The phase transitions are shown for and . One can observe that the algorithm which uses Sparsitybased Sampling can yield exact clustering even when the data is highly unbalanced in which case the uniform random sampling based algorithm fails. For instance, when , the Sparsitybased Sampling algorithm can yield accurate clustering with only randomly sampled nodes (only 4% of the total number of nodes). The main reason is that the probability of sampling from the small clusters using Sparsitybased Sampling is larger than the corresponding probability with uniform random sampling, which results in a more balanced sketch.
Interestingly, in addition to being significantly faster than the fullscale decomposition algorithm, the proposed randomized approach (with Sparsitybased Sampling) can even outperform fullscale decomposition in terms of success rate when the intercluster probability is sufficiently small. We remark that this does not violate the data processing inequality which indicates that postprocessing cannot increase information [39]. Rather, the fullscale decomposition algorithm is not robust to data unbalancedness in the sense that it often fails to yield accurate clustering with unbalanced data. For example, consider , , , , with a graph composed of three clusters with population sizes and . Fig. 2a shows the probability of success of the proposed randomized approach and the fullscale decomposition algorithm versus . The proposed method uses 500 sampled nodes (only 10% of the total number nodes). As shown, the fullscale algorithm fails to yield accurate clustering for . Meanwhile, the randomized approach with Sparsitybased Sampling yields exact clustering even when . Fig. 2b shows the phase transition of the randomized approach with Sparsitybased Sampling under the same setup.
Iv Conclusion
We presented and analyzed a randomized framework for the community detection problem with partially observed graphs. It was shown that the proposed method can substantially reduce the computational complexity from to with balanced data. We also developed a new node sampling scheme, Sparsitybased Sampling, which efficiently obtains a descriptive sketch of unbalanced graphs. It was shown that the new sampling method can improve the performance of the clustering algorithm by enhancing the balancedness of the data.
Acknowledgment This work was supported in part by NSF CAREER Award CCF1552497.
References
 [1] S. Fortunato and D. Hric, “Community detection in networks: A user guide,” Phys. Rep., vol. 659, pp. 1 – 44, 2016.
 [2] A. K. Jain, M. N. Murty, and P. J. Flynn, “Data clustering: A review,” ACM Comput. Surv., vol. 31, no. 3, pp. 264–323, Sep. 1999.
 [3] M. A. Porter, J.P. Onnela, and P. J. Mucha, “Communities in networks,” Not. AMS, vol. 56, no. 9, Oct. 2009.
 [4] P. Bedi and C. Sharma, “Community detection in social networks,” Data Mining Knowl. Discov., vol. 6, no. 3, pp. 115–135.
 [5] N. Mishra, R. Schreiber, I. Stanton, and R. E. Tarjan, “Clustering social networks,” in Proc. 5th Int’l Conf. Algor. Models Webgraph. Berlin, Heidelberg: SpringerVerlag, 2007, pp. 56–67.
 [6] C. Nicolini, C. Bordier, and A. Bifone, “Community detection in weighted brain connectivity networks beyond the resolution limit,” ArXiv eprints, Sep. 2016.
 [7] J. Stan, F. Muhlenbach, and C. Largeron, “Recommender systems using social network analysis: Challenges and future trends,” in Encyclopedia of Social Network Analysis and Mining. Springer, 2014, pp. 1–22.
 [8] P. Pons and M. Latapy, “Computing communities in large networks using random walks,” in Proc. 20th Int. Symp. Comput. Inform. Sci. Springer, 2005, pp. 284–293.
 [9] U. Von Luxburg, “A tutorial on spectral clustering,” Stat. Comput., vol. 17, no. 4, pp. 395–416, 2007.
 [10] B. Hajek, Y. Wu, and J. Xu, “Semidefinite programs for exact recovery of a hidden community,” in J. Mach. Learn. Res., vol. 49, 23–26 Jun 2016, pp. 1051–1095.
 [11] S. Oymak and B. Hassibi, “Finding dense clusters via “low rank + sparse” decomposition,” ArXiv eprints, Apr. 2011.
 [12] Y. Chen, A. Jalali, S. Sanghavi, and H. Xu, “Clustering partially observed graphs via convex optimization,” J. Mach. Learn. Res., vol. 15, no. 1, pp. 2213–2238, Jan. 2014.
 [13] R. Korlakai Vinayak, S. Oymak, and B. Hassibi, “Graph clustering with missing data: Convex algorithms and analysis,” in Proc. Adv. Neural Inf. Process. Syst., 2014, pp. 2996–3004.
 [14] V. W. Zheng, S. Cavallari, H. Cai, K. ChenChuan Chang, and E. Cambria, “From node embedding to community embedding,” ArXiv eprints, Oct. 2016.
 [15] W. Ding, C. Lin, and P. Ishwar, “Node embedding via word embedding for network community discovery,” IEEE Trans. Signal Inf. Process. Netw., vol. 3, no. 3, pp. 539–552, Sept 2017.
 [16] P. W. Holland, K. B. Laskey, and S. Leinhardt, “Stochastic blockmodels: First steps,” Social Networks, vol. 5, no. 2, pp. 109 – 137, 1983.
 [17] A. Condon and R. M. Karp, “Algorithms for graph partitioning on the planted partition model,” Random Struct. Algor., vol. 18, no. 2, pp. 116–140, 2001.
 [18] N. Bansal, A. Blum, and S. Chawla, “Correlation clustering,” Mach. Learn., vol. 56, no. 1, pp. 89–113, Jul 2004.
 [19] V. Chandrasekaran, S. Sanghavi, P. A. Parrilo, and A. S. Willsky, “Ranksparsity incoherence for matrix decomposition,” SIAM J. Optim., vol. 21, no. 2, pp. 572–596, 2011.
 [20] E. J. Candès, X. Li, Y. Ma, and J. Wright, “Robust principal component analysis?” J. ACM, vol. 58, no. 3, pp. 11:1–11:37, Jun. 2011.
 [21] D. Hsu, S. M. Kakade, and T. Zhang, “Robust matrix decomposition with sparse corruptions,” IEEE Trans. Inf. Theory, vol. 57, no. 11, pp. 7221–7234, Nov 2011.
 [22] Y. Chen, A. Jalali, S. Sanghavi, and C. Caramanis, “Lowrank matrix recovery from errors and erasures,” IEEE Trans. Inf. Theory, vol. 59, no. 7, pp. 4324–4337, July 2013.
 [23] M. Rahmani and G. K. Atia, “High dimensional low rank plus sparse matrix decomposition,” IEEE Trans. Signal Process., vol. 65, no. 8, pp. 2004–2019, April 2017.
 [24] M. Rahmani and G. Atia, “A subspace learning approach for high dimensional matrix decomposition with efficient column/row sampling,” in Proc. 33rd Int. Conf. Mach. Learn., 2016, pp. 1206–1214.
 [25] L. W. Mackey, M. I. Jordan, and A. Talwalkar, “Divideandconquer matrix factorization,” in Proc. Adv. Neural Inf. Process. Syst, 2011, pp. 1134–1142.
 [26] M. Rahmani and G. Atia, “Randomized robust subspace recovery and outlier detection for high dimensional data matrices,” IEEE Trans. Signal Process., vol. 65, no. 6, March 2017.
 [27] X. Li and J. Haupt, “Identifying outliers in large matrices via randomized adaptive compressive sampling,” IEEE Trans. Signal Process., vol. 63, no. 7, pp. 1792–1807, 2015.
 [28] N. Halko, P.G. Martinsson, and J. A. Tropp, “Finding structure with randomness: Probabilistic algorithms for constructing approximate matrix decompositions,” SIAM review, vol. 53, no. 2, pp. 217–288, 2011.
 [29] M. Gu and S. C. Eisenstat, “Efficient algorithms for computing a strong rankrevealing QR factorization,” SIAM J. Sci. Comput., vol. 17, no. 4, pp. 848–869, 1996.
 [30] N. H. Nguyen, T. T. Do, and T. D. Tran, “A fast and efficient algorithm for lowrank approximation of a matrix,” in Proc. 41st ACM Symp. Theory Comput., 2009, pp. 215–224.
 [31] M. Rahmani and G. K. Atia, “Spatial random sampling: A structurepreserving data sketching tool,” IEEE Signal Process. Lett., vol. 24, no. 9, pp. 1398–1402, 2017.
 [32] N. Ailon and B. Chazelle, “Approximate nearest neighbors and the fast JohnsonLindenstrauss transform,” in Proc. 38th ACM Symp. Theory Comput., 2006, pp. 557–563.

[33]
S. BenDavid, “A framework for statistical clustering with constant time approximation algorithms for kmedian and kmeans clustering,”
Mach. Learn., vol. 66, no. 23, pp. 243–257, 2007.  [34] A. Czumaj and C. Sohler, “Sublineartime approximation for clustering via random sampling,” in Proc. 31st ICALP. Springer, 2004, pp. 396–407.
 [35] S.Y. Yun and A. Proutiere, “Community detection via random and adaptive sampling,” in Proc. Conf. Learn. Theory, 2014, pp. 138–175.
 [36] I. Kirianovskii, O. Granichin, and A. Proskurnikov, “A new randomized algorithm for community detection in large networks,” IFACPapersOnLine, vol. 49, no. 13, pp. 31–35, 2016.
 [37] K. Voevodski, M.F. Balcan, H. Roglin, S.H. Teng, and Y. Xia, “Efficient clustering with limited distance information,” arXiv preprint arXiv:1009.5168, 2010.
 [38] E. Elhamifar and R. Vidal, “Sparse subspace clustering: Algorithm, theory, and applications,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 35, no. 11, pp. 2765–2781, Nov 2013.
 [39] T. M. Cover and J. A. Thomas, Elements of Information Theory. WileyInterscience, 2006.
Comments
There are no comments yet.