Why Being a Data Scientist is More Painful than Stepping on a Lego?
Every time you step on a lego piece when you're barefoot, it's a pain in the neck, and you should've wished to have at least a pair of sandals.
What if I told you that being barefoot when stepping on a Lego is the same as being a Data Scientist nowadays?
Let me start with this quote:
Big data is like teenage sex: everyone talks about it, nobody really knows how to do it, everyone thinks everyone else is doing it, so everyone claims they are doing it — Dan Ariely
This person doesn't explicitly refer to Data Science, but this seems to fit with it!
Everyone is talking about Data Science. Everyone is doing their best by giving some tutorials. Everyone is adding more tips to become faster a Data Scientist. Everyone is becoming some kind of guru when talking about this.
(In fact, I wrote a personal roadmap to becoming a data scientist, hahaha!)
What exactly is it happening? What is really going on? and why painful?
After doing a lot of research on Data Science and working in the betting market sector a few years ago, I believe Data Science is in some way in a 'bubble.'
Explain the 'Bubble' please
What I do know, sadly, is that most of the tutorials, articles, podcasts and everything seems to focus A LOT on Machine Learning.
In short, to create fancy deep learning solutions, create extravagant algorithms applied in a small dataset where the data is cleaned.
Focus on the last thing that I said:
Small dataset where data is already cleaned
This is why I think the bubble is present nowadays.
Isn't it true that Data Scientists spend the majority of their time cleansing data?
Well...If a Data Scientist spends more than 70% of their effort on data cleansing, why does everyone focus on models?
Because it's not as appealing as creating models.
Because it's painful.
What people are really attracted to is into making algorithms that solve complex problems. But, unfortunately, data preprocessing is "boring" and is meticulous. Indeed, there's a lot of work behind a well-formatted dataset.
Data preprocessing is variable; you always have different problems to solve because the data you have to work with is unique and dependent.
As a result, prospective data scientists appear to be exclusively concerned with modeling, but this should not be the case because:
- SQL is an underrated skill, and it's the number one job requirement.
- They don't waste time studying how to manipulate data.
- No understanding of data sourcing.
Suppose you want to become a Data Scientist. In that case, you might know that you need to have knowledge about programming, math, visualization, machine learning, etc. Nevertheless, for some reason, spending time in data wrangling is not as important as modeling.
Data is the foundation of data science. So the data you feed into your Machine Learning model is just as necessary as the model itself, right?
A data scientist will be unable to create a helpful product without sufficient data.
Ok, now that I know that Data Wrangling is essential, what's next?
We should focus on the future. Data Scientists are somewhat expensive, and big companies are trying their best to minimize as much as possible these costs.
Are you aware that Automatic Machine Learning has arrived?
Data Science Workflow is becoming increasingly automated. It's strange, yet one automation technique is being supplanted by another.
Many cloud providers and tools can actually do what data scientists do. That is a model selection of more than 20 machine learning libraries.
I'm trying to say that companies are automating the workflow, and most aspiring data scientists are putting a lot of emphasis on understanding ML and Deep Learning algorithms.
Tools will become more efficient in the future, providing more precision in a shorter length of time.
Also, in the future, there will be a better approach to Data Science professions.
What I mean is, Data Scientists cover a lot of aspects, but maybe the roles will be defined better.
Some people could do Natural Language Processing, others could do the deployment, others statistics stuff, etc.
Have you ever found it challenging to keep up with all of the latest Data Science trends and developments?
If you ask me, it feels like a never-ending battle.
On the other hand, focusing on anything should always win in the long run, not just in data science, but in all parts of your life!
Just let me ask you something...If you had a disease, let's say COVID-19, would you like to be attended by a doctor who deals with many problems, or would you like a specialized COVID-19 doctor?
Should you become an expert in one skill or become a jack of all trades?
In my opinion, the knowledge economy rewards skill disproportionately more than ever before.
If I want to be an expert at something related to Data Science, what should I do?
Of course, you can achieve the famous "Unicorn Data Scientist"; that particular person whose abilities in machine learning, statistics, and analytics are superb.
Also, you could be that guy who builds a machine learning model, and deploys it correctly. Building the model must not be the last step of a data science workflow, right?
Alternatively, you might be the man who bridges the gap between technology and business.
I'm referring to someone who has a firm grip on a specific subject.
But I will talk further about the Hero without Cape in Data Science: The Data Engineer.
This is the job that most aspiring data scientists reject because it's the boring part. Yet, I know that most companies are asking for a data scientist, data analyst, or machine learning engineer even without a data pipeline or an ETL process.
Data is the foundation of everything in Data Science. In reality, the data supplied into your machine learning model is just as important as the model itself.
As a result, a data engineer should perform data engineer things. On the other hand, a data scientist should be able to do both data scientist and data engineer tasks.
At the end of the day, if you are an aspiring data scientist, you shouldn't be shocked if you keep doing data engineer work like ETL operations. This is because companies are still figuring out what to do and what to expect from data scientists.
To put it another way, if you expect to just run machine learning algorithms with ready-to-use data, you will be disappointed.
As I mentioned at the beginning of this article, more than 70% of your time, you will be doing data cleaning/processing/wrangling.
The foundation of excellent data science is data quality. Data scientists must first verify that the data is clean, relevant, and complete before creating models.
Because of data engineers and other technologies/trends, data scientists' responsibilities are evolving.
The data engineering profession has evolved beyond loading and storing data to calculating and extracting data using the right tools and technologies.
In the long run, you will be doing Data Engineer work. That's the most painful part of being a data scientist, and I hope you wouldn't prefer to step into a lego.