辅导案例-CS544

欢迎使用51辅导,51作业君孵化低价透明的学长辅导平台,服务保持优质,平均费用压低50%以上! 51fudao.top
Analysis of Kaggle Survey
Jennifer La - CS544 Term Project
December 4, 2017
Dataset Details
The data set was created from information collected from a Kaggle survey to examine the state of data science and machine learning from the
views of more than 16000 individuals from over 171 different countries. The question bank consisted of approximately 200 questions, some
questions were asked to all individuals while others were only asked to particular groups of people. Individuals were grouped into ‘learners’, ‘non-
switcher’, ‘non-worker’, ‘worker’, and ‘coding worker’ based on their answers to current employment state, if they code for their job, if they are
learning to code, and if they are looking to switch careers.
Objective
The objective of this project is to gain further knowledge about the data science environment in the United States.
Separate Dataset Into Their Respective Groups
The data set includes surveys from individuals across 171 countries. This project will focus on data on individuals in the United States. As
mentioned previously, the questions asked to each individual was determined by how he/she answered questions about their employment, if their
job requires coding, if they are thinking of switching careers, and if they are currently learning how to code. The schema file is a csv file containing
columns labeled ‘Column’, ‘Question’, and ‘Asked’ which corresponded to the column label in the whole data set, the question asked to the
individual, and who was asked respectively. The data set was broken into these groups (learners’, ‘non-switcher’, ‘non-worker’, ‘worker’, and
‘coding worker’) and only included the questions (columns) the groups were asked.
Examine the Age Distribution in Each Group
A question that was asked to every responded was “What is your age”. Below is a box plot and histogram of the age distribution for each group
of people.
Age Distribution of the Groups
coding_worker
non_switcher
non_worker
worker
0 20 40 60 80 100
learner
coding_worker
non_switcher
non_worker
worker
learner
50
100
150
coding_worker
non_switcher
Central Limit Theorem
The central limit theorem states that the distribution of sample means, taken from independent random sample sizes, follows a normal
distribution even if the original population is not normally distributed. This is important because there are a lot of statistical procedures that
require normality in the data set. As a result we can apply statistical techniques that assume normality even when the population is non normal.
Using the age attribute in this data set the applicability of the central limit theorem can be shown. As displayed in the box plot and histogram
above, the age distribution of all groups have a positive skew. Since all these distributions follow a right skew, the coding workers will be used as
an example to show the application of the central limit theorem. Below is are histograms showing the sample means of 1000 random samples of
sample size 10, 20, 30, and 40 follow a normal distribution.
## population mean: 35.68992
## sample size: 10 mean: 35.5954 sd: 3.552371
## sample size: 20 mean: 35.7667 sd: 2.554598
## sample size: 30 mean: 35.6469 sd: 2.034189
## sample size: 40 mean: 35.72738 sd: 1.776585
0
50
0
10
20
30
0
20
40
60
0
50
100
150
0 20 40 60 80 100
0
20
40
60
non_switcher
non_worker
worker
learner
0
20
40
60
30 35 40 45 50
0
50
100
0
20
40
60
80
100
30 35 40 45
0
20
40
60
80
100
10
20
30
40
Sampling of Coding Worker via Simple Random Sample Without Replacement,
Systematic Sampling, and Stratified Sampling
Sampling is a technique to select a representative portion of the population to perform a study on. There are many different sampling techniques
including simple random sampling, systematic sampling, and stratified sampling. Simple random sampling is a basic sampling technique where
individual subjects are selected from a larger group. In this case, every sample has the same chance of getting picked. Systematic sampling is a
method where samples are selected via a fixed periodic interval. The interval is calculated by dividing the whole population sample by the desired
sample size. The first sample is decided randomly within the first interval. Lastly, stratified sampling takes into the account that there is
heterogeneity in a population. The population is subdivided into sub populations and the same percentage of individuals is selected from each
sub population to make up the sample set. When looking at a normal distribution, the sample mean can be used as an estimate for the
population mean. Given a certain confidence level, a confidence interval is defined. The confidence interval is range of values which contains the
population mean with the given confidence level.
For this project the coding worker population with be analyzed. Simple random sampling without replacement, systematic sampling, and
stratified sampling will be utilized as sampling methods.
General Information of Coding Worker
Sampling is a great way to analyze a representative portion of the population without needing to evaluate the whole population. In many
circumstances it is impossible to obtain data on the whole population and as a result sampling comes in very useful. Further, decreasing the
number of subjects for analysis is also beneficial in that it will require less computational power. The data set used for this project can be seen as
a sample of the whole coding worker population. However this sample may be skewed towards certain coding workers since the data was
obtained only through the Kaggle website. Given that the data set has a manageable number of tuples the whole coding worker data set rather
than a smaller sample size of the data set will be used to analyze coding workers.
The background information of the coding worker will be depicted to give some general idea of what kinds of people took this survey. Of
particular interest are the following questions:
1. Select the option that’s most similar to your current job/professional title.
## All Coding Worker: mean = 35.7 and sd = 0.21
## 80% Conf Level (alpha = 0.20), CI = 35.43 - 35.97
## 90% Conf Level (alpha = 0.10), CI = 35.35 - 36.05
## SRSWOR: mean = 35.38 and sd = 0.36
## 80% Conf Level (alpha = 0.20), CI = 34.92 - 35.84
## 90% Conf Level (alpha = 0.10), CI = 34.79 - 35.97
## systematic sampling: mean = 35.5 and sd = 0.37
## 80% Conf Level (alpha = 0.20), CI = 35.03 - 35.97
## 90% Conf Level (alpha = 0.10), CI = 34.89 - 36.11
## stratified sampling: mean = 35.37 and sd = 0.34
## 80% Conf Level (alpha = 0.20), CI = 34.93 - 35.81
## 90% Conf Level (alpha = 0.10), CI = 34.81 - 35.93
0
0.02
0.04
0
0.05
0.1
0
0.05
0.1
0 20 40 60 80 100
0
0.05
0.1
All Coding Worker
SRSWOR
Systematic Sampling
Stratified Sampling
by Job Title and Gender
2. Which level of formal education have you attained?
3. What programming language would you recommend a new data scientist learn first?
Select the option that’s most similar to your current
job/professional title.
There were many job titles under the umbrella of ‘coding worker’. It is wise to know the distribution of the different types of professions to
examine if there is a group that dominates the survey. Specifically, if there is a profession that dominates, then the answers for all the successive
questions would be biased towards the mentality of that group.
Level of Formal Education of All Coding Worker?
Many people are applying for jobs everyday. It would be ideal to interview every applicant to increase your chances of getting the best applicant,
however this approach is not feasible.Scanning resumes for keywords that pertains to the job at hand is a good way of shifting out those who
may not have the technical expertise for the position. After passing the first step of screening, what distinguishes one applicant from another?
Does having more degrees further boost your chances of getting a job? Although examining the educational background of these professionals
may only show a correlative relationship between degrees and certain jobs, it is still interesting to see what the educational background of these
professionals are.
Data Scientist
27.6%
Softw
are D
eveloper/Softw
are Engineer
11.5%
Data Analyst
10.7%
Other
9.97%
Sc
ien
tis
t/R
es
ea
rch
er
9.
5%
Re
se
ar
ch
er
5.
98
%
B
us
in
es
s
A
na
ly
st
4.
66
%
Engineer
4.51%
M
achine Learning Engineer
4.44%
Statistician
2.86%
Computer Scientist
2.31%
Predictive Modeler
1.91%
DBA/Database Engineer
1.43%
Programmer1.28%
Operations Research Practitioner
0.88%
Data Miner
0.44%
CurrentJobTitleSelect
Master's degree
44.9%
Bachelor's degree
24.9%
Doctoral degree
24.3%
S
om
e college/university study w
ithout earning a bachelor's degree
3.83%
Professional degree
1.29%
I did not com
plete any form
al education past high school
0.479%
FormalEducation
Formal Education and Specified Profession
The above pie chart shows the formal education of all the coding workers. However, it is more informative to show the formal education
distribution of each profession. Alternatively, it is also interesting to examine the profession of individuals with a certain formal education.
A Little Peek at Data Scientist’s Job Description
Data Science was deemed the ‘sexist job of the 21 century’ in the Harvard Business Review in 2012. It is a fairly new and evolving field and this
survey provides some information about the new trends and ‘need to knows’ in the data science field. Some questions of interest include :
1. What language do they recommend new data scientists to learn?
S
om
e college/university study w
ithout earning a bachelor's degree
3.83%
Professional degree
1.29%
I did not com
plete any form
al education past high school
0.479%
I prefer not to answ
er
0.295%
Business Analyst
Computer Scientist
Data Analyst
Data Miner
Data Scientist
DBA/Database Engineer
Engineer
Machine Learning Engineer
Operations Research Practitioner
Other
Predictive Modeler
Programmer
Researcher
Scientist/Researcher
Software Developer/Software Engineer
Statistician
0
0.5
1
1.5
CurrentJobTitleSelect
Professional degree trace 6
I prefer not to answer I did not complete any formal education past high school
Some college/university study without earning a bachelor's degree Doctoral degree
Bachelor's degree Master's degree
Bachelor's degree
Doctoral degree
I did not complete any formal education past high school
I prefer not to answer
Master's degree
Professional degree
Some college/university study without earning a bachelor's degree
0
2
4
6
8
FormalEducation
Machine Learning Engineer Statistician trace 14 Software Developer/Software Engineer
Data Miner Programmer Data Analyst Business Analyst
Predictive Modeler Other Data Scientist Engineer
Researcher Scientist/Researcher DBA/Database Engineer Computer Scientist
Operations Research Practitioner
2. Are there any learning platforms they found useful for their career?
3. What sort of tools and algorithms do they use in their job?
4. What are some important skills that a Data Scientist should have?
What Language do Current Data Scientist Recommend New Data
Scientist to Learn?
Results: Python > R
Interestingly the majority of data scientists suggest learning Python over R. There can be many reasons that could explain this large gap between
Python and R. First, maybe the python language is truly the predominate language to know in the data science field. Alternatively, there can be a
sample size bias. This survey is taken by Kaggle which MAY be largely visited by data scientist who use python as their main or only language. As
a result the results are skewed towards individuals who use python.
Whats the Best Way of Obtaining “Data Science” Skills?
Python
68.4%
R
23.6%
SQL
5%
SAS
1.03%
C/C+
+/C#
0.69
%
Othe
r
0.34
5%
Scal
a
0.34
5%
Java
0.17
2%
Julia
0.17
2%
Mat
lab
0.17
2%
LanguageRecommendationSelect
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8 Arxiv
Blogs
College
Communities
Company
Conferences
Courses
Documentation
Friends
Kaggle
Newsletters
Podcasts
Projects
SO
Textbook
TradeBook
Tutoring
YouTube
Is doing Projects the way to go?
What Sort of Tools and Algorithms do Data Scientists Use in their
Job?
A very simplified description of a data science role includes the collection, analysis, and interpretation of copious amounts of data. Many data
scientist utilize different tools and algorithms to analyze and interpret their gathered data. Below shows the likelihood at which some common
algorithms and tools are used by data scientist.
Algorithms
Not Useful Somewhat useful Very useful
0
Very useful Somewhat useful Not Useful
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
coding_worker
non_worker
worker
learner
Most of the time Often Rarely Sometimes
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
AssociationRules
Bayesian
CNNs
CollaborativeFiltering
DataVisualization
DecisionTrees
EnsembleMethods
EvolutionaryApproaches
GANs
GBM
HMMs
KNN
LiftAnalysis
LogisticRegression
MLN
NaiveBayes
NeuralNetworks
NLP
PCA
PrescriptiveModeling
RandomForests
RecommenderSystems
RNNs
Segmentation
Select1
Tools
What are Some Important Skills that a Data Scientist Should
Have?
Observations
100 percent of data scientist who answered these questions said ‘BigData’, SQL, and visualization is something a data scientist must know.
However, this data seems to be contradictory of what we have seen so far. Showing percentage data can be very deceiving. The question is how
many people are answering these questions? Below is a histogram showing the number of people who actually answered the job skill questions.
Most of the time Often Rarely Sometimes
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
AmazonML
Angoss
AWS
Azure
C
Cloudera
DataRobot
Excel
Flume
GCP
Hadoop
IBMCognos
IBMSPSSModeler
IBMSPSSStatistics
IBMWatson
Impala
Java
Julia
Jupyter
KNIMECommercial
KNIMEFree
Mathematica
MATLAB
MicrosoftRServer
MicrosoftSQL
Minitab
NoSQL
Oracle
Orange
Perl
Python
Qlik
R
RapidMinerCommercial
RapidMinerFree
Salfrod
SAPBusinessObjects
SASBase
SASEnterprise
SASJMP
Select1
Select2
Spark
SQL
Stan
Statistica
Tableau
TensorFlow
TIBCO
Unix
Necessary Nice to have Unnecessary
0
0.2
0.4
0.6
0.8
1 BigData
Degree
EnterpriseTools
KaggleRanking
MOOC
Python
R
SQL
Stats
Visualizations
3 BigData
As shown from the histogram above less than 5 people out of over 700 data scientist answered these questions. It is not surprising that many
people did not answer all the questions, especially given that there were over 200 questions in the question bank. The n for this data set is too
low to make any real conclusions.
Conclusion
Below is a word cloud depicting the importance of a word based on how frequently it was selected as used “most of the time” or is “very useful”.
As you can see words like python, data visualization, R, SQL, and projects are emphasized in this word cloud.
Necessary Nice to have Unnecessary
0
0.5
1
1.5
2
2.5
3
Degree
EnterpriseTools
KaggleRanking
MOOC
Python
R
SQL
Stats
Visualizations

欢迎咨询51作业君
51作业君

Email:51zuoyejun

@gmail.com

添加客服微信: abby12468