
AI , machine learning , data science , deep learning and all of those juicy words you keep hearing around makes you want to drop everything you do and just get your hands on some of those, right… wrong!
According to a recent Gartner research, only 15% -20% of data science projects get completed.
Out of those projects that did complete, according to their CEOs , about 8% of them generate value.
By those numbers we can tell that data science projects yield astonishing numbers of 2% success rate.
Having said that - your first project can be include in those 2% and kicked-off with the right reasons , support , resources and knowledge - that I’ll cover later on.
Data science project should solve a business problem or answer a difficult question
As a data scientist / analyst , your tasks or projects always starts with an assumption , a problem that the business has or an unknown behavior that is waiting ( product , marketing etc) to be solved , and data science projects shouldn't be different here.
How do we start?
Define the business use-case
This is the point that you were part of a meeting or got a request to solve a problem using an analytical approach , and it might have been pre-decided for you about how to solve it.
What do you do?
You start asking questions , running meetings with the stakeholders , creating a document for the entire project that starts with the reasons and filling it out using those questions:
How will this project help the business?
Is it the first time the business is planning a project like that? If not , what happened ?
What are we trying to solve?
Is it the main problem or extension of it ( you will need to answer that - root cause analysis) ?
Is there existing data I can get for a sanity check ? (without it we cannot say if the project will be possible or not)
How long has the business been collecting the data? ( let’s say you want to predict black Friday purchase behavior - so you will need at least 3 years of data point to align the previous and following black Fridays)
What kind of baseline comparison do we have today?
What is the main hypothesis and how you plan to get there?
Now , that you have enough information to get your brain working , you need to organize all that information and prepare yourself for a pre kick-off meeting with the stakeholders.
Pre kick-off meeting
Now that you have done some field work gathering information that will serve you in the near future , you need to have partnerships that will assist you to take this project to the finish line.
The stakeholders meeting should be treated as an elevator pitch / fundraising pitch or any other form of selling your idea ( i know it sounds bad but it is true) as the stakeholders need to believe in what you are offering and then they need to decide whether to fund your project.
Here are few things you should do at this point:
Understand the company vision and see if that project align with it , it is important to emphasize those touch points
Don’t start with the solution - spend some time thinking about the problem you are aim to solve and look for the impact of the problem
Spend some time with the product/marketing to get another perspective on the problem and see if , in case of problem solved , you’ll be able to productize the solution
Start working with the sample data and create an MVP that you’ll be able to showcase the stakeholders in the kickoff meeting ( just label small data and create something that can express a potential solution , executive need to understand values - not graphs)
Quantify the potential solution - in order to get the right assessments whether to invest funds/resources in your project or not , executive need to get a sense of value
Pitching your project - Kickoff meeting
This is your moment to shine - the moment that all your efforts and preparations are facing their purpose.
That meeting should be handled in a constructed way of telling a story , your project story , and it should be engaging , appealing and trust worthy.
Your main goal is to make clear from the very beginning what is your project objective and your aim to solve , the potential value ,the resources/funds you will need to complete that and time frame.
Once you covered that - make sure you are going over those topics:
Detailed business problem - that should be elaborate in details ( after root cause analysis ) and quantified in revenue/lost potential revenue
Main drivers ( assumption backed with data) for the business problem
Main hypothesis your project going to tackle
How you plan to solve it - the solution ( don’t get into details as it can change due to data issues/structure etc…)
Projected outcome and how to productize your solution ( maintenance , costs , resources , projected ROI etc…)
Why are those stages crucial?
As i wrote in the beginning of this paper - the majority of data science projects are failing getting to the finish line , and not for many reasons!
The main reason deals with executive lack of support/ownership/partnership and the stages we went through are critical in order to have that support.
The second one is working without a proper definition and preliminary research in order to set the project scope , causing lots of projects to expend more than they needed or planned.
Congratulations - you’re good to go , what's now ?
Till now - you did the business preparations that led you to this point and the data itself wasn’t the main point , but now it’s time to get deep inside the company data…
Try to have several meetings with the subject matter experts in the company to drive some insights about the behavior , trends and reasons to the business results - it will help you out in the feature engineering part and also in the data cleansing part.
Define the problem ( yes, again )
At this point you should have your root cause analysis ready in order to break the business problem to several little ones ( or stages ) , then follow those steps :
Set the hypothesis and make it clear
Define what kind of problem you are about to solve - is it prediction of a behavior , do you need to identify a pattern or a group/subset of users etc…
Measure the rate of phenomenon in the population - make sure you have enough data points to work on and set the baseline
Getting deeper - data!
Now that you know what you are after and you have defined it well , it’s time to build the dataset that you will use to develop your solution.
I’m going to overview the steps in high level :
Data gathering and collecting
Get data that is operational
Start with the mandatory data and the transactional data
Add info layers about the users
Get as much behavioral data as you can
EDA ( Exploratory Data Analysis)
Start exploring your soon to be features - get to know your data
Deal with missing values
Deal with outliers
Do a univariate analysis to describe your potential features
Do a bivariate analysis to explore more than 1 feature relationships
Transform your features
Numeric - change the scale of your values and/or to adjust the skewed data distribution to Gaussian-like distribution.
Categorical - turning your categorical variable to a numeric variable *mandatory for most of the machine learning models because they can handle only numeric values
Create your final features
Solution - here we go
You have reached a great milestone in your project - at this stage you have the baseline to start developing your solution!
Before you move on - you need to decide , based on your problem , if you have labeled data ( and if it’s enough - sample size wise ) or you’re trying to narrow it down using your features ( like topic discovery , segmentation etc... ).
You are probably going to start one of the next problems, that I’ll explain deeper in my next posts :
Regression problem - if you are trying to predict a numerical value ( like LTV , sale price etc…)
Classification problem - if you are trying to label your data based on your features ( speech tagging , music identification , topic discovery in text etc…)
Make sure that:
You understand what are you trying to solve
You understand what are the correlation between your features ( if any ) - assuming dependency / in-dependency in features can impact the algorithm ( classifier ) you’ll choose
Avoid data leakage (features that will not be able on production as they are calculated post-predicted action)
That your dataset is balanced and noise-free ( normalized and clean )
Presenting your solution - at last !
If you reach that point and haven't shared your progress within the company , show a mock-up of your solution and communicate the milestones - that’s bad!
Make sure you do that along your project - if you’re expecting to take it to production and productize your solution.
Having said that - this is your moment to shine!
Months of efforts centered around that specific meeting ( or meetings) - and here , your story matters.
This is how you are going to overview the company problem , the assumption and how you tackled it , the solution and how the company is going to benefit out of that.
I’ll explain it in detail about Data Storytelling in other posts - so stay tuned.
Summary
You need to keep in mind that DS project is not technological project (at least not entirely) - it is pure business initiative ( the solution is semi-technological ) and it means that you need the business with you along the way , make sure to create partnerships , message carriers within the company , future consumers ( product / marketing ) and vision alignment.
“If you don't know how to say farewell, put on a Kermit and let him say it”
Thanks for the interesting post. I read it all and learned a lot as a Junior DS in a startup company.
However, I didn't get that citation in the end, what does it mean?
Aviv Gelfand