Subject Code: MA3022 – MA4022 – MA7022
MA3022 / MA4022 / MA7022 Data Mining and Neural Networks
Due till 03.02.2025
100 marks available
Theoretical Background and Two Mini-Research Projects
Theoretical Background (20 marks)
Give a description of classification and clustering (5 marks).
What is the difference between them? (5 marks)
Describe KNN approach and Hart’s algorithm for data (5 marks).
Describe the K-means (5 marks).
Project 1: Condensed Nearest Neighbour for Data Reduction in Nearest Neighbour Classifier (40 marks)
Go to web page:
https://github.com/Mirkes/Data_Mining_Softbook/wiki/KNN-and-potential-ene
Read text. Download application:
https://github.com/Mirkes/Data_Mining_Softbook/blob/master/knn/knn.jar
Task 1 (10 marks)
Study how the number of prototypes depends on the number of points for two convex well-separated classes.
Task 2 (10 marks)
Prepare a series of examples with more sophisticated non-convex shapes of well-separated classes. Study how the number of prototypes depends on the number of points in these classes.
Task 3 (10 marks)
Study how the number of prototypes and outliers depends on the number of points for two well-separated classes with added background uniformly distributed noise (option: random).
Task 4 (10 marks)
In conclusion, discuss the results and propose a hypothesis for further study.
Do not forget to save and submit the configurations of the classes and prototypes as figures!
Project 2: Dynamics of K-means Clustering (40 marks)
Go to web page:
https://github.com/Mirkes/Data_Mining_Softbook/wiki/k-means-and-k-medoids
Read text. Download application:
https://github.com/Mirkes/Data_Mining_Softbook/blob/master/kmeans/KMeansKMedoids.jar
Task 1 (10 marks) – Exploration
Find the final K-means configurations for a series of datasets and various initial generations of centroids. How many different configurations did you observe? How frequently did they appear? How many iterations were required?
Task 2 (10 marks)
Formulate a hypothesis about the number of different final K-means configurations and their frequencies. Analyse how they depend on the number of data points. Check this hypothesis on random sets of equidistributed points.
Task 3 (10 marks)
Formulate a hypothesis about the convergence rate of K-means and its dependence on the number of data points. Check this hypothesis on the random sets of equidistributed points (use the same series of experiments as in question 2).
Task 4 (10 marks)
In conclusion, discuss the results and propose a hypothesis for further study.
Do not forget to save and submit the configurations of the classes and prototypes as figures!
