Analysis of Kaggle Survey Jennifer La - CS544 Term Project December 4, 2017 Dataset Details The data set was created from information collected from a Kaggle survey to examine the state of data science and machine learning from the views of more than 16000 individuals from over 171 different countries. The question bank consisted of approximately 200 questions, some questions were asked to all individuals while others were only asked to particular groups of people. Individuals were grouped into ‘learners’, ‘non- switcher’, ‘non-worker’, ‘worker’, and ‘coding worker’ based on their answers to current employment state, if they code for their job, if they are learning to code, and if they are looking to switch careers. Objective The objective of this project is to gain further knowledge about the data science environment in the United States. Separate Dataset Into Their Respective Groups The data set includes surveys from individuals across 171 countries. This project will focus on data on individuals in the United States. As mentioned previously, the questions asked to each individual was determined by how he/she answered questions about their employment, if their job requires coding, if they are thinking of switching careers, and if they are currently learning how to code. The schema file is a csv file containing columns labeled ‘Column’, ‘Question’, and ‘Asked’ which corresponded to the column label in the whole data set, the question asked to the individual, and who was asked respectively. The data set was broken into these groups (learners’, ‘non-switcher’, ‘non-worker’, ‘worker’, and ‘coding worker’) and only included the questions (columns) the groups were asked. Examine the Age Distribution in Each Group A question that was asked to every responded was “What is your age”. Below is a box plot and histogram of the age distribution for each group of people. Age Distribution of the Groups coding_worker non_switcher non_worker worker 0 20 40 60 80 100 learner coding_worker non_switcher non_worker worker learner 50 100 150 coding_worker non_switcher Central Limit Theorem The central limit theorem states that the distribution of sample means, taken from independent random sample sizes, follows a normal distribution even if the original population is not normally distributed. This is important because there are a lot of statistical procedures that require normality in the data set. As a result we can apply statistical techniques that assume normality even when the population is non normal. Using the age attribute in this data set the applicability of the central limit theorem can be shown. As displayed in the box plot and histogram above, the age distribution of all groups have a positive skew. Since all these distributions follow a right skew, the coding workers will be used as an example to show the application of the central limit theorem. Below is are histograms showing the sample means of 1000 random samples of sample size 10, 20, 30, and 40 follow a normal distribution. ## population mean: 35.68992 ## sample size: 10 mean: 35.5954 sd: 3.552371 ## sample size: 20 mean: 35.7667 sd: 2.554598 ## sample size: 30 mean: 35.6469 sd: 2.034189 ## sample size: 40 mean: 35.72738 sd: 1.776585 0 50 0 10 20 30 0 20 40 60 0 50 100 150 0 20 40 60 80 100 0 20 40 60 non_switcher non_worker worker learner 0 20 40 60 30 35 40 45 50 0 50 100 0 20 40 60 80 100 30 35 40 45 0 20 40 60 80 100 10 20 30 40 Sampling of Coding Worker via Simple Random Sample Without Replacement, Systematic Sampling, and Stratified Sampling Sampling is a technique to select a representative portion of the population to perform a study on. There are many different sampling techniques including simple random sampling, systematic sampling, and stratified sampling. Simple random sampling is a basic sampling technique where individual subjects are selected from a larger group. In this case, every sample has the same chance of getting picked. Systematic sampling is a method where samples are selected via a fixed periodic interval. The interval is calculated by dividing the whole population sample by the desired sample size. The first sample is decided randomly within the first interval. Lastly, stratified sampling takes into the account that there is heterogeneity in a population. The population is subdivided into sub populations and the same percentage of individuals is selected from each sub population to make up the sample set. When looking at a normal distribution, the sample mean can be used as an estimate for the population mean. Given a certain confidence level, a confidence interval is defined. The confidence interval is range of values which contains the population mean with the given confidence level. For this project the coding worker population with be analyzed. Simple random sampling without replacement, systematic sampling, and stratified sampling will be utilized as sampling methods. General Information of Coding Worker Sampling is a great way to analyze a representative portion of the population without needing to evaluate the whole population. In many circumstances it is impossible to obtain data on the whole population and as a result sampling comes in very useful. Further, decreasing the number of subjects for analysis is also beneficial in that it will require less computational power. The data set used for this project can be seen as a sample of the whole coding worker population. However this sample may be skewed towards certain coding workers since the data was obtained only through the Kaggle website. Given that the data set has a manageable number of tuples the whole coding worker data set rather than a smaller sample size of the data set will be used to analyze coding workers. The background information of the coding worker will be depicted to give some general idea of what kinds of people took this survey. Of particular interest are the following questions: 1. Select the option that’s most similar to your current job/professional title. ## All Coding Worker: mean = 35.7 and sd = 0.21 ## 80% Conf Level (alpha = 0.20), CI = 35.43 - 35.97 ## 90% Conf Level (alpha = 0.10), CI = 35.35 - 36.05 ## SRSWOR: mean = 35.38 and sd = 0.36 ## 80% Conf Level (alpha = 0.20), CI = 34.92 - 35.84 ## 90% Conf Level (alpha = 0.10), CI = 34.79 - 35.97 ## systematic sampling: mean = 35.5 and sd = 0.37 ## 80% Conf Level (alpha = 0.20), CI = 35.03 - 35.97 ## 90% Conf Level (alpha = 0.10), CI = 34.89 - 36.11 ## stratified sampling: mean = 35.37 and sd = 0.34 ## 80% Conf Level (alpha = 0.20), CI = 34.93 - 35.81 ## 90% Conf Level (alpha = 0.10), CI = 34.81 - 35.93 0 0.02 0.04 0 0.05 0.1 0 0.05 0.1 0 20 40 60 80 100 0 0.05 0.1 All Coding Worker SRSWOR Systematic Sampling Stratified Sampling by Job Title and Gender 2. Which level of formal education have you attained? 3. What programming language would you recommend a new data scientist learn first? Select the option that’s most similar to your current job/professional title. There were many job titles under the umbrella of ‘coding worker’. It is wise to know the distribution of the different types of professions to examine if there is a group that dominates the survey. Specifically, if there is a profession that dominates, then the answers for all the successive questions would be biased towards the mentality of that group. Level of Formal Education of All Coding Worker? Many people are applying for jobs everyday. It would be ideal to interview every applicant to increase your chances of getting the best applicant, however this approach is not feasible.Scanning resumes for keywords that pertains to the job at hand is a good way of shifting out those who may not have the technical expertise for the position. After passing the first step of screening, what distinguishes one applicant from another? Does having more degrees further boost your chances of getting a job? Although examining the educational background of these professionals may only show a correlative relationship between degrees and certain jobs, it is still interesting to see what the educational background of these professionals are. Data Scientist 27.6% Softw are D eveloper/Softw are Engineer 11.5% Data Analyst 10.7% Other 9.97% Sc ien tis t/R es ea rch er 9. 5% Re se ar ch er 5. 98 % B us in es s A na ly st 4. 66 % Engineer 4.51% M achine Learning Engineer 4.44% Statistician 2.86% Computer Scientist 2.31% Predictive Modeler 1.91% DBA/Database Engineer 1.43% Programmer1.28% Operations Research Practitioner 0.88% Data Miner 0.44% CurrentJobTitleSelect Master's degree 44.9% Bachelor's degree 24.9% Doctoral degree 24.3% S om e college/university study w ithout earning a bachelor's degree 3.83% Professional degree 1.29% I did not com plete any form al education past high school 0.479% FormalEducation Formal Education and Specified Profession The above pie chart shows the formal education of all the coding workers. However, it is more informative to show the formal education distribution of each profession. Alternatively, it is also interesting to examine the profession of individuals with a certain formal education. A Little Peek at Data Scientist’s Job Description Data Science was deemed the ‘sexist job of the 21 century’ in the Harvard Business Review in 2012. It is a fairly new and evolving field and this survey provides some information about the new trends and ‘need to knows’ in the data science field. Some questions of interest include : 1. What language do they recommend new data scientists to learn? S om e college/university study w ithout earning a bachelor's degree 3.83% Professional degree 1.29% I did not com plete any form al education past high school 0.479% I prefer not to answ er 0.295% Business Analyst Computer Scientist Data Analyst Data Miner Data Scientist DBA/Database Engineer Engineer Machine Learning Engineer Operations Research Practitioner Other Predictive Modeler Programmer Researcher Scientist/Researcher Software Developer/Software Engineer Statistician 0 0.5 1 1.5 CurrentJobTitleSelect Professional degree trace 6 I prefer not to answer I did not complete any formal education past high school Some college/university study without earning a bachelor's degree Doctoral degree Bachelor's degree Master's degree Bachelor's degree Doctoral degree I did not complete any formal education past high school I prefer not to answer Master's degree Professional degree Some college/university study without earning a bachelor's degree 0 2 4 6 8 FormalEducation Machine Learning Engineer Statistician trace 14 Software Developer/Software Engineer Data Miner Programmer Data Analyst Business Analyst Predictive Modeler Other Data Scientist Engineer Researcher Scientist/Researcher DBA/Database Engineer Computer Scientist Operations Research Practitioner 2. Are there any learning platforms they found useful for their career? 3. What sort of tools and algorithms do they use in their job? 4. What are some important skills that a Data Scientist should have? What Language do Current Data Scientist Recommend New Data Scientist to Learn? Results: Python > R Interestingly the majority of data scientists suggest learning Python over R. There can be many reasons that could explain this large gap between Python and R. First, maybe the python language is truly the predominate language to know in the data science field. Alternatively, there can be a sample size bias. This survey is taken by Kaggle which MAY be largely visited by data scientist who use python as their main or only language. As a result the results are skewed towards individuals who use python. Whats the Best Way of Obtaining “Data Science” Skills? Python 68.4% R 23.6% SQL 5% SAS 1.03% C/C+ +/C# 0.69 % Othe r 0.34 5% Scal a 0.34 5% Java 0.17 2% Julia 0.17 2% Mat lab 0.17 2% LanguageRecommendationSelect 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 Arxiv Blogs College Communities Company Conferences Courses Documentation Friends Kaggle Newsletters Podcasts Projects SO Textbook TradeBook Tutoring YouTube Is doing Projects the way to go? What Sort of Tools and Algorithms do Data Scientists Use in their Job? A very simplified description of a data science role includes the collection, analysis, and interpretation of copious amounts of data. Many data scientist utilize different tools and algorithms to analyze and interpret their gathered data. Below shows the likelihood at which some common algorithms and tools are used by data scientist. Algorithms Not Useful Somewhat useful Very useful 0 Very useful Somewhat useful Not Useful 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 coding_worker non_worker worker learner Most of the time Often Rarely Sometimes 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 AssociationRules Bayesian CNNs CollaborativeFiltering DataVisualization DecisionTrees EnsembleMethods EvolutionaryApproaches GANs GBM HMMs KNN LiftAnalysis LogisticRegression MLN NaiveBayes NeuralNetworks NLP PCA PrescriptiveModeling RandomForests RecommenderSystems RNNs Segmentation Select1 Tools What are Some Important Skills that a Data Scientist Should Have? Observations 100 percent of data scientist who answered these questions said ‘BigData’, SQL, and visualization is something a data scientist must know. However, this data seems to be contradictory of what we have seen so far. Showing percentage data can be very deceiving. The question is how many people are answering these questions? Below is a histogram showing the number of people who actually answered the job skill questions. Most of the time Often Rarely Sometimes 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 AmazonML Angoss AWS Azure C Cloudera DataRobot Excel Flume GCP Hadoop IBMCognos IBMSPSSModeler IBMSPSSStatistics IBMWatson Impala Java Julia Jupyter KNIMECommercial KNIMEFree Mathematica MATLAB MicrosoftRServer MicrosoftSQL Minitab NoSQL Oracle Orange Perl Python Qlik R RapidMinerCommercial RapidMinerFree Salfrod SAPBusinessObjects SASBase SASEnterprise SASJMP Select1 Select2 Spark SQL Stan Statistica Tableau TensorFlow TIBCO Unix Necessary Nice to have Unnecessary 0 0.2 0.4 0.6 0.8 1 BigData Degree EnterpriseTools KaggleRanking MOOC Python R SQL Stats Visualizations 3 BigData As shown from the histogram above less than 5 people out of over 700 data scientist answered these questions. It is not surprising that many people did not answer all the questions, especially given that there were over 200 questions in the question bank. The n for this data set is too low to make any real conclusions. Conclusion Below is a word cloud depicting the importance of a word based on how frequently it was selected as used “most of the time” or is “very useful”. As you can see words like python, data visualization, R, SQL, and projects are emphasized in this word cloud. Necessary Nice to have Unnecessary 0 0.5 1 1.5 2 2.5 3 Degree EnterpriseTools KaggleRanking MOOC Python R SQL Stats Visualizations
欢迎咨询51作业君