Data science is a multidisciplinary field that requires knowledge of math, technology, and domain.
According to Wikipedia, Data science is an interdisciplinary field of scientific methods, processes, algorithms and systems to extract knowledge or insights from data in various forms.
In simple terms, it is the area of study which involves extracting knowledge from all the data you can gather. And people practicing in this field are called Data Scientists.
It involves playing with data in many steps like acquiring the data, cleaning, analyzing, drawing insights to the data analyzed, building model, validating the model, selecting the best model and then finally deploying that model.
Although the work doesn't end here.
You don't just need to have technical skills but a lot of business sense also. A good scientist is full of questions, has the right approach to understand the data and has the required business acumen.
However, it is like looking for a unicorn if you are trying to find a data scientist who can do all the things and complete your data science project on his own. Everybody wants a unicorn, no doubt but can they have it?
Learn what is Data Science in the following videos:
Based on the business requirements, the analysis needed are:
- Exploratory analysis is the process of analyzing the dataset to summarize or get an overview of it. It is often done with visual methods using libraries like matplotlib, d3.js and applications like tableau.
- Predictive analysis is the major branch of data science where models are created using existing data to make predictions on future or unknown data.
- Prescriptive analysis is like an extension of predictive analysis in the sense that it not only predicts what will happen, it also suggests decision options to change the outcome.
- IPA analysis - Interpretative phenomenological analysis (IPA) is an approach which deals with psychological research.
- Visualization - For exploratory analysis, tableau is a popular tool to create interactive data visualizations. D3.js is an open-source library that is used to create visualizations inside web pages.
- Programming Languages - Python, R are the most used languages by data scientists. Python is useful to create end-to-end product as it can be used to create websites. R is preferred for research purposes.
- For dealing with large amounts of data, open-source big data tools like spark, hive, hadoop are useful.
Data Science Lifecycle
- Business Requirement - The first step is to define the objective by discussing with customers or stakeholders to identify the business problems and define the target metric for the project.
- Collecting the data - The next step is to acquire the relevant data by direct sources like analytics or from third party sources if necessary. High quality data is an important requirement of a data science project.
- Understanding the data - Before training a model, it is important to explore the data first. Most of the data in production has missing values and errors, they should be dealt with domain knowledge and available algorithms. The data may also be normalized and transformed for better model training.
- Creating a model - Out of all the columns available in the dataset, choosing the relevant columns is an important task, this is called feature engineering. It needs exploration of data and domain expertise to decide on the features to use for training the model. Based on the problem statement of the project, there are different types of models available to choose. The models can be compared with each other by metrics like accuracy.
- Deploying the model - Decide whether the accuracy of the model is sufficient to use in production. If not, try training the data on different models and collect more data if necessary. Once the model is finalized, deploy the model to web to facilitate users to get predictions using their data. APIs can be used to get predictions from other applications as well.
It takes a vast range of roles to make a data science project happen.
- Data Scientist: This role involves a class of mathematicians/statisticians who do predictive modelling, story-telling, visualizing. They also clean and organize big data.
- Data Analyst: These people are data-junkie. They understand data like no one else. They perform statistical data analysis and are database experts.
- Data Architect: They are the contemporary 'data modeller'. They provide data warehousing solutions and possess in-depth knowledge of database architecture.
- Data Engineer: Or what the world sees them as 'Software Engineer'. They are 'jack of all trades'. They develop, construct, test and maintain architecture.
It's so much also about scientific foundations than just about the practical methods. Data science, data analytics, big data(however you called it) can be applied to almost every field out there, be it agriculture, sports, economy, finance, medicine to name a few.
Statisticians/Mathematicians are the people behind writing and formulating the models.
Interest in Data Science has been increasing over a period of time, also seen in google trends:
Responsibilities of a Data Scientist:
- Domain understanding
- Data collection from multilple sources
- Data cleansing, preparation & processing
- Predictive modeling, machine learning
- Asking the right questions questions, running queries
- Applying mathematical & statistical analysis
- Visualization & Communicating the business results
Though it sounds simple, data science is quite a challenging area due to the complexities involved in combining and applying complex programming techniques to perform.
I hope my answer helps. You should start learning Data Science if you haven’t already. Go through the below video to know how to become a Data Scientist: