What is data science and how does it work? Data science, big data , machine learning – you probably heard these big words, but how understandable was their meaning for you? For some, they are beautiful marketing lures. Someone thinks that data science is the magic that will force the machine to do whatever it orders for free. Others even believe that this is an easy way to make a lot of money. let’s try to explain what it means in a simple and understandable language.
I work in the field of automatic processing of a natural language, one of the data science applications, and often observe how people use these terms incorrectly, so I wanted to make a little clarity. This article is for those who have a poor idea of what data science is and wants to understand concepts.
Let’s define the terminology
To begin with, in fact, no one really knows exactly what is data science , and there is no strict definition – this is a very broad and interdisciplinary concept. Therefore, here I will share my vision, which does not necessarily coincide with the opinions of others.
The first part: data
The first component of data science, without which the whole further process is impossible, is, in fact, the data itself: how to collect, store and process it, as well as how to extract useful information from a common data array. It is up to 80% of their working time to clear the data and bring it to the right kind.
An important part of this is how to handle data for which standard methods of storage and processing are not suitable because of their huge volume and / or variety – the so-called big data, big data. By the way, do not let yourself be confused: big data and data science are not synonyms: rather, the first subsection of the second. At the same time, data analysis specialists do not always have to work with big data in practice – small ones can be useful.
In order not to be unfounded, I will give a simple example. Gather some data.
Imagine that we are interested in whether there is any relationship between how much your work colleagues drink coffee per day and how much they slept the day before. We will write down the information available to us: let’s say your colleague Gregory slept for 4 hours today, so he had to drink 3 cups of coffee; Ellina slept 9 hours and did not drink coffee at all; and Polina slept all 10 hours, but drank 2.5 cups of coffee – and so on.
Let us depict the data obtained on the graph (visualization is also an important element of any data science project). Set aside the time in hours on the X axis, and coffee in milliliters on the Y axis. We get something like this:
Second Part: science
We have data, what can we do with them now? Correctly, analyze, extract useful patterns and somehow use them. Here such disciplines as statistics, machine learning, optimization will help us.
They form the next and possibly the most important component of data science – data analysis. Machine learning allows you to find patterns in existing data, then to predict the necessary information for new objects.
Make sure to check our list of Best Data Science Books
Analyze the data
Let’s get back to our example. It seems to the eye that the two parameters are somehow interconnected: the less the person slept, the more he will drink coffee the next day. At the same time, we also have an example that stands out from this trend – Polina, a lover of sleeping and drinking coffee. Nevertheless, you can try to approximate the resulting pattern with a certain common straight line so that it fits as close as possible to all points:
The green line is our model of machine learning, it summarizes the data and can be described mathematically. Now, with the help of it, we can determine the values for new objects: when we want to predict how much coffee Nikita, who entered the office, will drink today, we will take an interest in how much he slept. Having received the value of 7.5 hours as an answer, we substitute it into the model – it corresponds to the amount of coffee consumed in a volume of slightly less than 300 ml. A red dot indicates our prediction.
Similarly, machine learning works, the idea of which is very simple: to find a pattern and extend it to new data. In fact, in machine learning, another class of problems is singled out when it is necessary not to predict some values, as in our example, but to divide the data into some groups. But we’ll talk more about this another time.
Apply the result
However, in my opinion, data science does not end with identifying patterns in data. Any data science project is an applied research, where it is important not to forget about such things as setting a hypothesis, planning an experiment and, of course, evaluating the result and its suitability for solving a particular case.The latter is very important in real business problems when you need to understand whether the data science solution found will bring benefits to your project or not. What would be the usefulness of the constructed model in our example? Perhaps, with its help, we could optimize the delivery of coffee to the office. In this case, we need to assess the risks and determine whether our model would be better able to cope with this than the existing solution – the office manager Mikhail, responsible for purchasing the product.
Of course, our example is as simple as possible. In reality, it would be possible to build a more complex model that takes into account some other factors, for example, does a person like coffee in principle. Or the model could find more complex than represented by a straight line, the relationship.
We could first find outliers in our data – objects that, like Polina, are very different from most others. The fact is that in real-life work, such examples can adversely affect the process of building the model and its quality, and it makes sense to process them somehow differently. And sometimes such objects are of primary interest, for example, in the task of detecting abnormal banking transactions in order to prevent fraud.
In addition, Polina shows us another important idea – the imperfection of machine learning algorithms. Our model predicts only 100 ml of coffee for a person who slept for 10 hours, while in fact Polina drank as many as 500. Customers of data science solutions will never believe this, but it’s still impossible to teach the machine to perfectly predict everything: no matter how well we highlight patterns in the data, there will always be unpredictable elements.
Continue the story So, data science is a set of methods for processing and analyzing data and applying them to practical tasks. It should be understood that each specialist has his own view on this area and opinions may differ.
Data science is based on fairly simple ideas, but in practice, many subtle subtleties are often found. How data science surrounds us in everyday life, what methods of data analysis exist, who the data science team consists of, and what difficulties may arise during the research process, we will discuss this in the following articles.