辅导案例-F19 17601-Assignment 1

欢迎使用51辅导，51作业君孵化低价透明的学长辅导平台，服务保持优质，平均费用压低50%以上！ 51fudao.top

F19 17601 Data-driven Software Engineering
Assignment 1
Software Engineering Activity Analysis Using ML Techniques
10/31/19 © 2019 Vijay Sai Vadlamudi 1 | P a g e

IMPORTANT: Please READ the entire assignment carefully; ask if you have questions. Plan
your time well. The assignment is due by 11:59 P.M. on Sunday, 11/17/2019.
(Total points: 400)

Grading
Question Points
1 50
2 100
3 50
4 100
5 10
6 10
7 10
8 10
9 15
10 10
11 15
12 20
Total 400

The usability requirements and the “modernized” UI should enable features represented in UC01
to UC10 with the exception of UC09. The key properties of the system will be to:
1. Design activity
a. Extract and segregate the client responsibilities of the system
b. Present it via a new modernized” GUI
c. Enable UC01 to UC10 except UC09
d. Design such that the client “process’ is testable in a measurable way
e. Implement quality attributes usability and performance as measured by:
2. Navigation ability measured by the time taken to
a. launch the game
b. setup and start the game
c. win the game at the first level
d. Time taken at the various paths in the game to make it a measurably smooth game
play experience
e. Number of errors as the game is tested (consider reaching out to students to test)
3. Develop a CI/CD pipeline using tools such as Jenkins or similar tool
4. Use the tool add-ins (e.g., Hudson global-build-stats plugin) to capture the metrics,
statuses, and any other meaningful characteristics of the builds and deployments
F19 17601 Data-driven Software Engineering
Assignment 1
Software Engineering Activity Analysis Using ML Techniques
10/31/19 © 2019 Vijay Sai Vadlamudi 2 | P a g e
Assignment Questions: ML: Clustering Analysis
Cluster analysis is a multivariate method, which aims to classify a set of objects based on a set
of measured variables into a number of different groups such that similar objects are placed into
the same group. An example where this might be useful in the context of software engineering,
is to identify groups of software components, tasks and builds, etc. which demonstrate similarity
based on various parameters such as social factors (people demographics) as well as technical
factors (e.g., Commits, Issues, defects, source of defects, story points, etc.). In this assignment,
we are going to perform the task of grouping observations based on their similarity. You will
apply clustering analysis techniques. You should select features and apply techniques based on
1. Planning activity
2. Requirements and Design activity
3. Development activity
4. Build and deployment activity
5. Progressive activities i.e., 1, 1+2, 1+2+3
6. Overall data set that includes all the above 1+2+3+4
The goal of this assignment is for you to demonstrate what task features explain the structure
or patterns inherent in the data. For example, let us assume that your planning data as captured
in JIRA presents information such as Epic, issue, issue size (S, M, L, XL) or (by story points),
estimated time, actual time, dependency (Y, N), # of dependencies dependency type (Blocked
by, etc.), issue start, issue end, component type (client side, server side, DB object, etc.) and
so on. Similarly, let us assume you have a systematic way of capturing requirements (i.e., use
cases, number of steps in use case, number of interactions in use case, number of issues per
user story, etc.) and design data (quality attributes, supporting scenario, environment, system
elements affected, response metric, etc.). Extending this line of thinking, you should be able to
obtain data about the development activity from your testing tools and IDE, and finally build and
deploy data from the CI/CD tool.
Section A: Project Activities, Environment and Data Readiness
1. Prepare your environment for
a. Software development (10 points)
b. Testing – install tools (10 points)
c. Commission a cloud instance on AWS (10 points)
d. Create and implement the database and table design for capturing the data (10
points)
e. Develop a high-level project plan with milestones (simple plan in Excel is
adequate) (10 points)

2. Develop GUI for the project as stated in Iteration 1 of 2: Project requirements at the top
of this page (100 points)

F19 17601 Data-driven Software Engineering
Assignment 1
Software Engineering Activity Analysis Using ML Techniques
10/31/19 © 2019 Vijay Sai Vadlamudi 3 | P a g e
3. Develop and implement the CI/CD pipeline for deploying the application to AWS
(50 points)

4. Collect and preprocess the data from each of the engineering activities listed above and
populate the database from 2 and 3 above (100 points)

Section B: Hierarchical Clustering

5. Briefly, describe which of the variables from your data are relevant for clustering analysis?
(10 points)

6. Before performing the HCA, do you think that the data needs to be normalized or
standardized? If so, what method you would employ and why? Show boxplots of the
different variables in the dataset before and after data normalization. (10 points)

7. Using the data selected, after data-normalization and selection of variables, perform
hierarchical clustering analysis. Be sure to save and submit the cluster dendrogram with
your submission (10 points)

8. A clustering/grouping of the data objects, can be obtained by cutting the dendrogram at
the desired level, then each connected element forms a cluster or group. What level do
you see it meaningful to cut the tree and understand your data? Discuss similarities that
discovered and dissimilarities in the variables that separated this cluster from others? (10
points)

Section C: K-means Clustering and Principal Component Analysis
Non-hierarchical cluster analysis is generally adopted when large data sets are involved, as it
becomes difficult to visualize/interpret the dendrograms resulting from Hierarchical methods.
Another advantage K-means is preferred over Hierarchical methods is because it allows the
instance(s) to move from one cluster to another as the (this is not possible in hierarchical cluster
analysis where an instance, once assigned, cannot move to a different cluster).
There are certain disadvantages associated with K-means clustering method. It is difficult to
know how many clusters you are likely to have and therefore the analysis may have to be
repeated several times. One possible strategy is to use a hierarchical approach initially to
determine how many clusters you would like to use after visualizing the dendrogram. Then use
the cluster centers obtained from this as initial cluster centers in the non-hierarchical method
(this is something you already performed in the sections above)
Let us recall two major properties of K-means clustering: (1) it fits exactly k clusters and (2)
clustering assignments depend on initial k value and the K points.
F19 17601 Data-driven Software Engineering
Assignment 1
Software Engineering Activity Analysis Using ML Techniques
10/31/19 © 2019 Vijay Sai Vadlamudi 4 | P a g e
9. Perform k-means clustering on relevant variables identified earlier using a k of size
identical to the number of clusters in HCA from question 8. Plot the data to visualize each
cluster in a different color along with their cluster centers. Discuss the clusters and the
basis of their similarities. Display the model performance. (15 points)

10. Is it possible to estimate “correct” or in other words “optimal number of clusters? If so,
how many clusters is optimal for the data chosen? Use the knee plot to determine the
size of k. Re-run the k-means clustering analysis with the identified optimal k value, if
different from the value adopted in question 9 (10 points)

11. Use Principle Component Analysis (PCA) and use relevant PCs to perform K-means
clustering. Provide your analysis. (15 points)

Section D: Reflection
12. Do you observe a change in the structures and patterns as you progressively combine
the data for different activities “progressively” as suggested on page 1. Explain your
observation (20 points)
Important Notes:

(a) Clustering has no mechanism for differentiating between relevant and irrelevant variables.
Therefore, the choice of variables included in the cluster analysis must be carefully
chosen. It is critical to note that the formation of clusters or groups could be very much
dependent on the variables included.

(b) As part of your solution, please provide Jupyter notebooks (.IPYNB) files, data sets with
your code and answers captured within the file.

(c) Note that you will require to package your application with databases, code so that we
can deploy in a different environment at the end of the course.