辅导案例-F19 17601-Assignment 1
F19 17601 Data-driven Software Engineering Assignment 1 Software Engineering Activity Analysis Using ML Techniques 10/31/19 © 2019 Vijay Sai Vadlamudi 1 | P a g e IMPORTANT: Please READ the entire assignment carefully; ask if you have questions. Plan your time well. The assignment is due by 11:59 P.M. on Sunday, 11/17/2019. (Total points: 400) Grading Question Points 1 50 2 100 3 50 4 100 5 10 6 10 7 10 8 10 9 15 10 10 11 15 12 20 Total 400 The usability requirements and the “modernized” UI should enable features represented in UC01 to UC10 with the exception of UC09. The key properties of the system will be to: 1. Design activity a. Extract and segregate the client responsibilities of the system b. Present it via a new modernized” GUI c. Enable UC01 to UC10 except UC09 d. Design such that the client “process’ is testable in a measurable way e. Implement quality attributes usability and performance as measured by: 2. Navigation ability measured by the time taken to a. launch the game b. setup and start the game c. win the game at the first level d. Time taken at the various paths in the game to make it a measurably smooth game play experience e. Number of errors as the game is tested (consider reaching out to students to test) 3. Develop a CI/CD pipeline using tools such as Jenkins or similar tool 4. Use the tool add-ins (e.g., Hudson global-build-stats plugin) to capture the metrics, statuses, and any other meaningful characteristics of the builds and deployments F19 17601 Data-driven Software Engineering Assignment 1 Software Engineering Activity Analysis Using ML Techniques 10/31/19 © 2019 Vijay Sai Vadlamudi 2 | P a g e Assignment Questions: ML: Clustering Analysis Cluster analysis is a multivariate method, which aims to classify a set of objects based on a set of measured variables into a number of different groups such that similar objects are placed into the same group. An example where this might be useful in the context of software engineering, is to identify groups of software components, tasks and builds, etc. which demonstrate similarity based on various parameters such as social factors (people demographics) as well as technical factors (e.g., Commits, Issues, defects, source of defects, story points, etc.). In this assignment, we are going to perform the task of grouping observations based on their similarity. You will apply clustering analysis techniques. You should select features and apply techniques based on 1. Planning activity 2. Requirements and Design activity 3. Development activity 4. Build and deployment activity 5. Progressive activities i.e., 1, 1+2, 1+2+3 6. Overall data set that includes all the above 1+2+3+4 The goal of this assignment is for you to demonstrate what task features explain the structure or patterns inherent in the data. For example, let us assume that your planning data as captured in JIRA presents information such as Epic, issue, issue size (S, M, L, XL) or (by story points), estimated time, actual time, dependency (Y, N), # of dependencies dependency type (Blocked by, etc.), issue start, issue end, component type (client side, server side, DB object, etc.) and so on. Similarly, let us assume you have a systematic way of capturing requirements (i.e., use cases, number of steps in use case, number of interactions in use case, number of issues per user story, etc.) and design data (quality attributes, supporting scenario, environment, system elements affected, response metric, etc.). Extending this line of thinking, you should be able to obtain data about the development activity from your testing tools and IDE, and finally build and deploy data from the CI/CD tool. Section A: Project Activities, Environment and Data Readiness 1. Prepare your environment for a. Software development (10 points) b. Testing – install tools (10 points) c. Commission a cloud instance on AWS (10 points) d. Create and implement the database and table design for capturing the data (10 points) e. Develop a high-level project plan with milestones (simple plan in Excel is adequate) (10 points) 2. Develop GUI for the project as stated in Iteration 1 of 2: Project requirements at the top of this page (100 points) F19 17601 Data-driven Software Engineering Assignment 1 Software Engineering Activity Analysis Using ML Techniques 10/31/19 © 2019 Vijay Sai Vadlamudi 3 | P a g e 3. Develop and implement the CI/CD pipeline for deploying the application to AWS (50 points) 4. Collect and preprocess the data from each of the engineering activities listed above and populate the database from 2 and 3 above (100 points) Section B: Hierarchical Clustering 5. Briefly, describe which of the variables from your data are relevant for clustering analysis? (10 points) 6. Before performing the HCA, do you think that the data needs to be normalized or standardized? If so, what method you would employ and why? Show boxplots of the different variables in the dataset before and after data normalization. (10 points) 7. Using the data selected, after data-normalization and selection of variables, perform hierarchical clustering analysis. Be sure to save and submit the cluster dendrogram with your submission (10 points) 8. A clustering/grouping of the data objects, can be obtained by cutting the dendrogram at the desired level, then each connected element forms a cluster or group. What level do you see it meaningful to cut the tree and understand your data? Discuss similarities that discovered and dissimilarities in the variables that separated this cluster from others? (10 points) Section C: K-means Clustering and Principal Component Analysis Non-hierarchical cluster analysis is generally adopted when large data sets are involved, as it becomes difficult to visualize/interpret the dendrograms resulting from Hierarchical methods. Another advantage K-means is preferred over Hierarchical methods is because it allows the instance(s) to move from one cluster to another as the (this is not possible in hierarchical cluster analysis where an instance, once assigned, cannot move to a different cluster). There are certain disadvantages associated with K-means clustering method. It is difficult to know how many clusters you are likely to have and therefore the analysis may have to be repeated several times. One possible strategy is to use a hierarchical approach initially to determine how many clusters you would like to use after visualizing the dendrogram. Then use the cluster centers obtained from this as initial cluster centers in the non-hierarchical method (this is something you already performed in the sections above) Let us recall two major properties of K-means clustering: (1) it fits exactly k clusters and (2) clustering assignments depend on initial k value and the K points. F19 17601 Data-driven Software Engineering Assignment 1 Software Engineering Activity Analysis Using ML Techniques 10/31/19 © 2019 Vijay Sai Vadlamudi 4 | P a g e 9. Perform k-means clustering on relevant variables identified earlier using a k of size identical to the number of clusters in HCA from question 8. Plot the data to visualize each cluster in a different color along with their cluster centers. Discuss the clusters and the basis of their similarities. Display the model performance. (15 points) 10. Is it possible to estimate “correct” or in other words “optimal number of clusters? If so, how many clusters is optimal for the data chosen? Use the knee plot to determine the size of k. Re-run the k-means clustering analysis with the identified optimal k value, if different from the value adopted in question 9 (10 points) 11. Use Principle Component Analysis (PCA) and use relevant PCs to perform K-means clustering. Provide your analysis. (15 points) Section D: Reflection 12. Do you observe a change in the structures and patterns as you progressively combine the data for different activities “progressively” as suggested on page 1. Explain your observation (20 points) Important Notes: (a) Clustering has no mechanism for differentiating between relevant and irrelevant variables. Therefore, the choice of variables included in the cluster analysis must be carefully chosen. It is critical to note that the formation of clusters or groups could be very much dependent on the variables included. (b) As part of your solution, please provide Jupyter notebooks (.IPYNB) files, data sets with your code and answers captured within the file. (c) Note that you will require to package your application with databases, code so that we can deploy in a different environment at the end of the course.