STAT3888 Statistical Machine Learning
Task
Content:
This assessment item evaluates your learning towards the data collection, feature engineering, design, evaluate and execution of data analytical model, providing appropriate context-based recommendations. In this assessment, we evaluate the listed skills in three main themes of machine learning techniques, namely, unsupervised learning, supervised learning, and learning by experience. You are provided with the real-world dataset for the assessment.
By using Google Colab, you should analyse the data through your own analysis plan and propose context-based recommendations. Also, you should respond to some questions using the output of the designed and executed analytical models. It requires you to understand and apply the techniques and skills covered in the subject.
Objective:
Apply K-means clustering technique to an enterprise data set Following on from the initial exploration of use cases, technical management is now advising you to develop a proof-of-concept for the application of machine learning models.
For this, you are provided with a real-world dataset for the assessment. By using the course analytical software, you should analyse the data through your own analysis plan, compare the different ML techniques, and propose context-based recommendations.
Also, you should respond to some questions using the output of the designed and executed analytical models. It requires you to understand and apply the techniques and skills covered in the subject. As a deliverable technical management is expecting a handbook outlining your model results and a report to describe and justify your results. "IBM employs a network of expert consultants for various projects. To help determine how to distribute its bonuses, IBM wants to cluster groups of employees with similar performance based on key performance metrics.
Each observation corresponding to an employee in the file ‘Big Blue’ and consists of values for:
(1) Usage Rate - which corresponds to the period proportion of time that the employee has been actively working on high priority projects,
(2) Recognition - which is the number of projects for which the employee involvement was specifically requested, and
(3) Leader - which is the number of projects on which the employee has served as a project leader."
You are provided with a real-world dataset, IBM, to be analysed using unsupervised learning techniques/clustering. Click the link below to download the dataset.
Once you have downloaded the dataset, you are expected to analyse the data by carrying out the following tasks:
Feature engineering task Download Feature engineering task: You should find variables that should not be considered for the 'clustering'. In the report, you should justify your decision. By applying clustering algorithms, you should identify the optimum number of clusters? Please justify your finding. In this part, you should explain the ‘clusters’ linguistically.
You can consider various stakeholders as your audience and pitch your explanation to them. The provided data is about the performance of a given enterprise. Now you should identify which cluster refers to a group of individuals with the best performance? Can you justify your finding by using the Centroid Table values?