Data Science: A High-Level Overview

Data Science: A High-Level Overview

In this era, where a huge amount of information from different fields is gathered and stored, its analysis and the extraction of value have become one of the most attractive tasks for companies and society in general.

  • What is Data Science in simple words?
  • A brief history of Data Science.
  • Challenges to practising Data Science.
  • Who is a Data Scientist?
  • Data Science tools for Data Scientists.
  • Where are we going? Perspectives.

What is Data Science in simple words?

The term “Data Science” has emerged only recently to specifically designate a new profession that is expected to make sense of the vast stores of big data. But making sense of data has a long history and has been discussed by scientists, statisticians, librarians, computer scientists and others for years.

Nowadays Data Science as a business field is really complicated, due to its remarkable popularity, there are numerous descriptions of data science, for example:

Data Science is concerned with analyzing data and extracting useful knowledge from it. Building predictive models are usually the most important activity for a Data Scientist (Gregory Piatetsky, KDnuggets, https://www.kdnuggets.com/tag/data-science)

Data Science is concerned with analyzing Big Data to extract correlations with estimates of likelihood and error. (Brodie, 2015a)

Data science is an emerging discipline that draws upon knowledge in statistical methodology and computer science to create impactful predictions and insights for a wide range of traditional scholarly fields (Harvard Data Science Initiative https://datascience.harvard.edu)

However, in simple words, data scientists just try to get insights from massive amounts of data that can help companies to make smarter business decisions. We also define Data Science as a methodology by which actionable insights can be inferred from data.

Data science uses a wide array of data-oriented technologies including SQL, Python, R, and Hadoop, etc. However, it also makes extensive use of statistical analysis, data visualization, distributed architecture, and more to extract meaning out of sets of data. The information extracted through data science applications is used to guide business processes and reach organizational goals.

To complete this section, we will also provide a simple definition of the concepts of data mining, artificial intelligence, machine learning and deep learning, as these are related to data science and to each other.

  • Data mining aims to understand and discover new, previously unseen knowledge in the data.
  • Artificial intelligence (AI) is concerned with making machines smart aiming to create a system which behaves like a human.
  • Machine learning is a subset of Artificial Intelligence. Machine learning aims to develop algorithms that can learn from historical data and improve the system with experience.
  • Deep learning is a subset of ML, in which data is passed via multiple numbers of non-linear transformations to calculate an output.

 

Fig. 1 Relationship between Artificial Intelligence, Machine Learning, Deep Learning and Data Science

 Data science makes use of data mining, machine learning, Artificial Intelligence techniques.

For example, deep learning requires running Jupyter in more powerful environments. Fortunately, platforms like Saturn Cloud let users facilitate the management of the Jupyter development environment. In fact, by managing the resources of the environment, the user can enable more power in terms of CPU, GPU and memory, just when it is necessary. A platform designed for cloud computing, therefore, allows keeping the environmental costs low, allowing the Data Scientist to pay only for the resources he uses.

A brief history of Data Science

Data Science has revolutionized several different aspects of our world. Let's take a look then to when and where data science comes from.

  • In 1962, John W. Tukey writes in “The Future of Data Analysis” - The first milestone in the history of data science is globally recognized for the bright American mathematician John Tukey. The influence of John Tukey in statistical terms is enormous, but the most famous coinage attributed to him is related to computer science. In fact, it should be mentioned that he was the first to introduce the term "bit" as a contraction of "binary digit".
  • In 1974, Peter Naur published the Concise Survey of Computer Methods, which surveyed data processing methods across a wide variety of applications. The term “data science” is become clearer, he put his own definition on it: “The science of dealing with data, once they have been established, while the relation of the data to what they represent is delegated to other fields and sciences.”
  • In 1977, the International Association for Statistical Computing (IASC) was founded.
  • In 1989, Gregory Piatetsky-Shapiro organizes and chair the first Knowledge Discovery in Databases (KDD) workshop.
  • In 1994, BusinessWeek published a cover story on “Database Marketing”
  • In 1996, in the occasion of the conference of International Federation of Classification Societies (IFCS), for the first time, the term “data science” is included in the title of the conference (“Data science, classification, and related methods”). In the same year, Usama Fayyad, Gregory Piatetsky-Shapiro and Padhraic Smyth, publish “From Data Mining to Knowledge Discovery in Databases”.
  • In 1997, during his inaugural lecture as the H. C. Carver Chair in Statistics at the University of Michigan, Jeff Wu called for statistics to be renamed “data science” and statisticians to be renamed “data scientists”.

Fig. 2 History of Data Science

Since the beginning of the 21st century, data stockpiles have expanded exponentially, largely thanks to advents in processing and storage that is both efficient and cost-effective at scale. The capability to collect, process, analyze and display data and information in “real-time”, give us an unprecedented opportunity to conduct a new form of knowledge discovery. To process this huge amount of data, Data Scientists needs high performance also of a large portfolio of technologies to speed up tasks and data processing in a matter of seconds.

Disruptive technologies like artificial intelligence, machine learning and deep learning are nowadays available for Data Scientists thanks to powerful platforms like Saturn Cloud.

Challenges to practising Data Science

While the adoption of analytics has increased, it comes with its own set of challenges. A study conducted in 2017 by Kaggle on a sample of 16000 data professional, showed us the most 10 difficult challenges faced by them in their profession:

  1. Dirty data (36% reported)
  2. Lack of data science talent (30%)
  3. Company politics (27%)
  4. Lack of clear question (22%)
  5. Data inaccessible (22%)
  6. Results not used by decision-makers (18%)
  7. Explaining data science to others (16%)
  8. Privacy issues (14%)
  9. Lack of domain expertise (14%)
  10. Organization small and cannot afford the data science team (13%)

These appear as strong challenges to address. However, we need to realize that for every step forward in a new discipline, new challenging challenges need to be addressed. We must embrace transformative changes and we must be assured that changes help us ensure continuous improvement, acquiring new skills, expanding our knowledge, and exploring new approaches.

Who is a Data Scientist?

As pointed above, with constantly growing operating data and emerging new technologies we increasingly need of professionals’ whit analytical acumen, who can extract valuable information and insights from the massive amount of data and make a precise decision. We call this type of experts "Data Science teams” o simply “Data Scientists".

Data Scientist is an analytical data expert who should masterfully possess the necessary technical skills to solve complex problems in the modern world. Today's emerging technologies such as AI, IoT, 5G, robotics, blockchain and so on rely heavily on data and only those who will be able to operate with data and translate them into profitable products will guide the digital business of next future.

Therefore, Data Scientists are playing an essential role in the business development strategy of every company and organization. As said by Thomas H. Devenport and D.J. Patil, Data Scientist is the sexiest job of the 21st Century. 

Data Science tools for Data Scientists

An extensive collection of software tools is available to support the Data Scientist to dive into the world of Data Science. Saturn Cloud enables data scientists to work at scale using the tools they know best: Python, Jupyter, and Dask, it provides a secure and scalable infrastructure for running data science and machine learning workloads within AWS environment. Data teams can develop and deploy data science models in Python at scale with automated DevOps and ML infrastructure engineering.

Saturn Cloud supports a lot of useful Python libraries.

“Python Libraries are collections of functions and methods that allows Data Scientist to perform many actions without writing code.”

  • NumPy s a library for the Python programming language, adding support for large, multi-dimensional arrays and matrices.
  • Seaborn is a Python data visualization library based on matplotlib.
  • TensorFlow is a free and open-source software library for dataflow and differentiable programming across a range of tasks.
  • PyTorch is an open-source machine learning library based on the Torch library.
  • Numba is an open-source JIT compiler that translates a subset of Python and NumPy into fast machine code using LLVM
  • SciPy is a free and open-source Python library used for scientific computing and technical computing.
  • Pandas is a software library written for the Python programming language for data manipulation and analysis.
  • Scikit-learn is a free software machine learning library for the Python programming language.
  • Matplotlib is a plotting library for the Python programming language and its numerical mathematics extension NumPy
  • Bokeh is a data visualization library in Python that provides high-performance interactive charts and plots.

Fig. 3 Python Libraries

Saturn Cloud, therefore, offers an end-to-end analytics platform, all in Python on AWS. This includes: 

  • Dask which allows organizations to scale out Python and dramatically reduce runtime.
  • Suite of collaboration tools, model deployment capabilities, and tools for the machine learning lifecycle.
  • Prefect that provides a workflow orchestration framework that eliminates manual effort on the part of developers and data scientists.
  • Integration with services like Docker and Kubernetes so that data scientists can build a custom image to meet their best development expectations.
  • Jupyter Notebooks to deploy, manage, and scale the PyData stack.

Where are we going? Perspectives.

As John Tukey predicted: “the future of data analysis can involve great progress, the overcoming of real difficulties and the provision of great service to all fields of science and technology”. During the last years, we become witnesses of many of data-driven technological innovations, 5G lightning-fast Internet speed, machine learning, cloud computing, blockchain concept, the noteworthy list is far from being exhaustive. The explosion of data along with growing technological abilities is just the beginning, our life is becoming “smarter” with technology innovations and they might be integrated into all aspects of human life.

Related Articles