top of page

How To Nail Your First Data Science Project !



AI , machine learning , data science , deep learning and all of those juicy words you keep hearing around makes you want to drop everything you do and just get your hands on some of those, right… wrong!

According to a recent Gartner research, only 15% -20% of data science projects get completed.

Out of those projects that did complete, according to their CEOs , about 8% of them generate value.

By those numbers we can tell that data science projects yield astonishing numbers of 2% success rate.

Having said that - your first project can be include in those 2% and kicked-off with the right reasons , support , resources and knowledge - that I’ll cover later on.


Data science project should solve a business problem or answer a difficult question


As a data scientist / analyst , your tasks or projects always starts with an assumption , a problem that the business has or an unknown behavior that is waiting ( product , marketing etc) to be solved , and data science projects shouldn't be different here.

How do we start?


Define the business use-case


This is the point that you were part of a meeting or got a request to solve a problem using an analytical approach , and it might have been pre-decided for you about how to solve it.

What do you do?

You start asking questions , running meetings with the stakeholders , creating a document for the entire project that starts with the reasons and filling it out using those questions:

  • How will this project help the business?

  • Is it the first time the business is planning a project like that? If not , what happened ?

  • What are we trying to solve?

  • Is it the main problem or extension of it ( you will need to answer that - root cause analysis) ?

  • Is there existing data I can get for a sanity check ? (without it we cannot say if the project will be possible or not)

  • How long has the business been collecting the data? ( let’s say you want to predict black Friday purchase behavior - so you will need at least 3 years of data point to align the previous and following black Fridays)

  • What kind of baseline comparison do we have today?

  • What is the main hypothesis and how you plan to get there?

Now , that you have enough information to get your brain working , you need to organize all that information and prepare yourself for a pre kick-off meeting with the stakeholders.


Pre kick-off meeting


Now that you have done some field work gathering information that will serve you in the near future , you need to have partnerships that will assist you to take this project to the finish line.

The stakeholders meeting should be treated as an elevator pitch / fundraising pitch or any other form of selling your idea ( i know it sounds bad but it is true) as the stakeholders need to believe in what you are offering and then they need to decide whether to fund your project.

Here are few things you should do at this point:

  • Understand the company vision and see if that project align with it , it is important to emphasize those touch points

  • Don’t start with the solution - spend some time thinking about the problem you are aim to solve and look for the impact of the problem

  • Spend some time with the product/marketing to get another perspective on the problem and see if , in case of problem solved , you’ll be able to productize the solution

  • Start working with the sample data and create an MVP that you’ll be able to showcase the stakeholders in the kickoff meeting ( just label small data and create something that can express a potential solution , executive need to understand values - not graphs)

  • Quantify the potential solution - in order to get the right assessments whether to invest funds/resources in your project or not , executive need to get a sense of value


Pitching your project - Kickoff meeting


This is your moment to shine - the moment that all your efforts and preparations are facing their purpose.

That meeting should be handled in a constructed way of telling a story , your project story , and it should be engaging , appealing and trust worthy.

Your main goal is to make clear from the very beginning what is your project objective and your aim to solve , the potential value ,the resources/funds you will need to complete that and time frame.

Once you covered that - make sure you are going over those topics:

  • Detailed business problem - that should be elaborate in details ( after root cause analysis ) and quantified in revenue/lost potential revenue

  • Main drivers ( assumption backed with data) for the business problem

  • Main hypothesis your project going to tackle

  • How you plan to solve it - the solution ( don’t get into details as it can change due to data issues/structure etc…)

  • Projected outcome and how to productize your solution ( maintenance , costs , resources , projected ROI etc…)


Why are those stages crucial?


As i wrote in the beginning of this paper - the majority of data science projects are failing getting to the finish line , and not for many reasons!

The main reason deals with executive lack of support/ownership/partnership and the stages we went through are critical in order to have that support.

The second one is working without a proper definition and preliminary research in order to set the project scope , causing lots of projects to expend more than they needed or planned.


Congratulations - you’re good to go , what's now ?


Till now - you did the business preparations that led you to this point and the data itself wasn’t the main point , but now it’s time to get deep inside the company data…

Try to have several meetings with the subject matter experts in the company to drive some insights about the behavior , trends and reasons to the business results - it will help you out in the feature engineering part and also in the data cleansing part.


Define the problem ( yes, again )


At this point you should have your root cause analysis ready in order to break the business problem to several little ones ( or stages ) , then follow those steps :

  • Set the hypothesis and make it clear

  • Define what kind of problem you are about to solve - is it prediction of a behavior , do you need to identify a pattern or a group/subset of users etc…

  • Measure the rate of phenomenon in the population - make sure you have enough data points to work on and set the baseline

Getting deeper - data!


Now that you know what you are after and you have defined it well , it’s time to build the dataset that you will use to develop your solution.

I’m going to overview the steps in high level :

  • Data gathering and collecting

    • Get data that is operational

    • Start with the mandatory data and the transactional data

    • Add info layers about the users

    • Get as much behavioral data as you can

  • EDA ( Exploratory Data Analysis)

    • Start exploring your soon to be features - get to know your data

    • Deal with missing values

    • Deal with outliers

    • Do a univariate analysis to describe your potential features

    • Do a bivariate analysis to explore more than 1 feature relationships

    • Transform your features

      • Numeric - change the scale of your values and/or to adjust the skewed data distribution to Gaussian-like distribution.

      • Categorical - turning your categorical variable to a numeric variable *mandatory for most of the machine learning models because they can handle only numeric values

    • Create your final features


Solution - here we go


You have reached a great milestone in your project - at this stage you have the baseline to start developing your solution!

Before you move on - you need to decide , based on your problem , if you have labeled data ( and if it’s enough - sample size wise ) or you’re trying to narrow it down using your features ( like topic discovery , segmentation etc... ).

You are probably going to start one of the next problems, that I’ll explain deeper in my next posts :

  • Regression problem - if you are trying to predict a numerical value ( like LTV , sale price etc…)

  • Classification problem - if you are trying to label your data based on your features ( speech tagging , music identification , topic discovery in text etc…)


Make sure that:

  • You understand what are you trying to solve

  • You understand what are the correlation between your features ( if any ) - assuming dependency / in-dependency in features can impact the algorithm ( classifier ) you’ll choose

  • Avoid data leakage (features that will not be able on production as they are calculated post-predicted action)

  • That your dataset is balanced and noise-free ( normalized and clean )


Presenting your solution - at last !


If you reach that point and haven't shared your progress within the company , show a mock-up of your solution and communicate the milestones - that’s bad!

Make sure you do that along your project - if you’re expecting to take it to production and productize your solution.

Having said that - this is your moment to shine!

Months of efforts centered around that specific meeting ( or meetings) - and here , your story matters.

This is how you are going to overview the company problem , the assumption and how you tackled it , the solution and how the company is going to benefit out of that.

I’ll explain it in detail about Data Storytelling in other posts - so stay tuned.


Summary


You need to keep in mind that DS project is not technological project (at least not entirely) - it is pure business initiative ( the solution is semi-technological ) and it means that you need the business with you along the way , make sure to create partnerships , message carriers within the company , future consumers ( product / marketing ) and vision alignment.


“If you don't know how to say farewell, put on a Kermit and let him say it”


1 Comment


Aviv Gelfand
Aviv Gelfand
Jan 11

Thanks for the interesting post. I read it all and learned a lot as a Junior DS in a startup company.


However, I didn't get that citation in the end, what does it mean?

Aviv Gelfand

Like

Unlock your data potential with PickmydatabraiN. Contact us now.

Thanks for submitting!

Data beats emotions , any day

bottom of page