Fast and Highly Scalable Multiresolution Linear Word based Clustering in Multidimensional data

P.Rubi; M.Govindaraj

Abstract

Clustering problems are well known in database literature for their use in numerous applications. Multidimensional data always is a challenge for clustering algorithms. The Halite, fast and scalable clustering method that looks for clusters in subspaces of multidimensional data. The tree root corresponds to a hypercube embodying the full data set. The next level divides the space in a set of 2D hypercube. The resulting hypercube are divided again, generating the tree structure. Bump Hunting task refers to apply for each level of the Counting-tree one d-dimensional Laplacian mask over the respective grid to spot bumps in the respective resolution. Specifically the main contributions of Halite are: Scalability: it is linear in time and space regarding the data size and dimensionality of the clusters’ subspaces. Usability: it is deterministic, robust to noise, doesn’t take the number of clusters as an input parameter, and detects clusters in subspaces generated by original axes or by their linear combinations, including space rotation. Effectiveness: it is accurate, providing results with equal or better quality. It is achieved through word based approach Generality: it includes a soft clustering approach.

Keywords

Bump Hunting, Correlation Connected Objects, Harp , Spotting clusters .

References

[1] R.L.F.Cordeiro,A.J.M. Traina,C.Faloutsos and C. Traina Jr., ., “Finding Clusters in Subspaces of Very Large, Multi-Dimensional Data Sets,” Proc. IEEE 26th Int’1 Conf.Data Eng.(ICDE),pp.625-636,2010.

[2] R.C. Gonzalez and R.E. Woods, Digital Image Processing, third ed. Prentice-Hall, Inc., 2006.

[3] P.D. Grunwald, I.J. Myung, and M.A. Pitt, Advances in Minimum Description Length: Theory and Applications (Neural Information Processing). The MIT Press, 2005.

[4] C. Traina Jr., A.J.M. Traina, C. Faloutsos, and B. Seeger,“Fast Indexing and Visualization of Metric Data Sets Using Slim-Trees,” IEEE Trans. Knowledge Data Eng., vol. 14, no. 2, pp. 244-260, Mar./ Apr. 2002.

[5] C. Traina Jr., A.J.M. Traina, L. Wu, and C. Faloutsos, “Fast Feature Selection Using Fractal Dimension,” Proc. 15th Brazilian Symp. Databases (SBBD), pp. 158-171, 2000.

[6] H.-P. Kriegel, P. Kro¨ger, and A. Zimek, “Clustering High- Dimensional Data: A Survey on Subspace Clustering, PatternBased Clustering, and Correlation Clustering,” ACM Trans. Knowledge Discovery from Data, vol. 3, no. 1, pp. 1-58, 2009.

[7] C. Domeniconi, D. Gunopulos, S. Ma, B. Yan, M. Al-Razgan, and D. Papadopoulos, “Locally Adaptive Metrics for Clustering High Dimensional Data,” Data Mining and Knowledge Discovery, vol. 14, no. 1, pp. 63-97, 2007.

[8] A.K.H. Tung, X. Xu, and B.C. Ooi, “Curler: Finding and Visualizing Nonlinear Correlation Clusters,” Proc. ACM SIGMOD Int’l Conf. Management of Data, pp. 467-478, 2005.

[9] C. Aggarwal and P. Yu, “Redefining Clustering for HighDimensional Applications,” IEEE Trans. Knowledge and Data Eng., vol. 14, no. 2, pp. 210-225, Mar./Apr. 2002 .

[10] E.K.K. Ng, A.W. chee Fu, and R.C.-W. Wong, “Projective Clustering by Histograms,” IEEE Trans. Knowledge and Data Eng., vol. 17, no. 3, pp. 369-383, Mar. 2005.

[11] G. Moise, J. Sander, and M. Ester, “Robust Projected Clustering,” Knowledge Information Systems, vol. 14, no. 3, pp. 273-298, 2008.

[12] R. Agrawal, J. Gehrke, D. Gunopulos, and P. Raghavan, “Automatic Subspace Clustering of High Dimensional Data for Data Mining Applications,” SIGMOD Record, vol. 27, no. 2, pp. 94- 105, 1998.

[13] C.C. Aggarwal, J.L. Wolf, P.S. Yu, C. Procopiuc, and J.S. Park, “Fast Algorithms for Projected Clustering,” SIGMOD Record, vol. 28, no. 2, pp. 61-72, 1999.

[14] M.L. Yiu and N. Mamoulis, “Iterative Projected Clustering by Subspace Mining,” IEEE Trans. Knowledge and Data Eng., vol. 17, no. 2, pp. 176-189, Feb. 2005.

[15] K. Yip, D. Cheung, and M. Ng, “Harp: A Practical Projected Clustering Algorithm,” IEEE Trans. Knowledge and Data Eng., vol.16, no. 11, pp. 1387-1397, Nov. 2004.

[16] G. Moise and J. Sander, “Finding Non-Redundant, Statistically Significant Regions in High Dimensional Data: A Novel Approach to Projected and Subspace Clustering,” Proc. 14th ACM SIGKDD Int’l Conf. Knowledge Discovery Data Mining (KDD), pp. 533-541, 2008

[17] Douglass R. Cutting, David R. Karger, Jan O. Pedersen, John W. Tukey, Scatter/Gather: A Cluster-based Approach to Browsing Large Document Collections, In Proceedings of the Fifteenth Annual International ACM SIGIR Conference, pp 318-329, June 1992.

[18] Dean, P. M. Ed., Molecular Similarity in Drug Design, Blackie Academic & Professional, 1995, pp 111 –137.

[19] D. R. Hill, A vector clustering technique, in: Samuelson (Ed.), Mechanized Information Storage, Retrieval and Dissemination, North-Holland, Amsterdam, 1968.

Cites this article as

P.Rubi, M.Govindaraj, "Fast and Highly Scalable Multiresolution Linear Word based Clustering in Multidimensional data", International Journal of Innovative Research in Engineering & Management (ijircst), Vol-2, Issue-3, Page No-85-92, 2014. Available from:

Corresponding Author

P.Rubi

Computer Science and Engineering Department, Bharathidasan University, Tiruchirappalli/Tamilnadu, India, 9790534573., (e-mail: rubi.joyce04@gmail.com)

Download Full Paper

Download PDF

No. of Downloads: 5 | No. of Views: 1230

A Comparative Study of ChatGPT, Gemini, and Perplexity

Manali Shukla, Ishika Goyal, Bhavya Gupta, Jhanvi Sharma.

July 2024 - Vol 12, Issue 4
Helmet Detection and Number Plate Recognition Using YOLOv8 and Tensorflow Algorithm in Machine Learning

Dipti Prajapati, Samishtarani Sabat, Sanika Bhilare, Rashmi Vishe, Prof. Suman Bhujbal.

March 2024 - Vol 12, Issue 2
Machine Learning Prospects: Insights for Social Media Data Mining and Analytics

Anu Sharma, Vivek Kumar.

May 2023 - Vol 11, Issue 3

Fast and Highly Scalable Multiresolution Linear Word based Clustering in Multidimensional data

Citations

Download Full Paper PDF

Total View 1230

Total Download 5