Data Science Methodology
Updated: 03 September 2023
Based on this Cognitive Class Course
The Methodology
The data science methodology described here is as outlined by John Rollins of IBM
The Methodology can be seen in the following steps
The Questions
The Data Science methodology aims to answer ten main questions
From Problem to Approach
- What is the problem you are trying to solve?
- How can you use data to answer the question?
Working with the Data
- What data do you need to answer the question?>
- Where is the data coming from and how will you get it?
- Is the data that you collected representative of the problem to be solved?
- What additional work is required to manipulate and work with the data
Deriving the Answer
- In what way can the data be visualized to get to the answer that is required?
- Does the model used really answer the initial question or does it need to be adjusted?
- Can you put the model into practice?
- Can you get constructive feedback into answering the question?
Problem and Approach
Business Understanding
What is the problem you are trying to solve?
The firs step in the methodology involves seeking any needed clarification in order to identify what the problem we are trying to solve is as this drives the data we use and the analytical approach that we will go about applying
It is important to seek clarification early on otherwise we can waste time and resources moving in the wrong direction
In order to understand a question, it is important to understand the goal of the person asking the question
Based on this we will break down objectives and prioritize them
Analytic Approach
How can you use data to answer the question?
The second step in the methodology is electing the correct approach involves the specific problem being addressed, this points to the purpose of business understanding and helps us to identify what methods we should use in order to address the problem
Approach to be Used
When we have a strong understanding of the problem, wwe can pick an analytical approach to be used
- Descriptive
- Current Status
- Diagnostic (Statistical Analysis)
- What happened?
- Why is this happening?
- Predictive (Forecasting)
- What if these trends continue?
- What will happen next?
- Prescriptive
- How do we solve it?
Question Types
We have a few different types of questions that can direct our modelling
- Question is to determine probabilities of an action
- Predictive Model
- Question is to show relationships
- Descriptive Model
- Question requires a binary answer
- Classification Model
Machine Learning
Machine learning allows us to identify relationships and trends that cannot otherwise be established
Decision Trees
Decision trees are a machine learning algorithm that allow us to classify nodes while also giving us some information as to how the information is classified
It makes use of a tree structure with recursive partitioning to classify data, predictiveness is based on decrease in entropy - gain in information or impurity
A decision tree for classifying data can result in leaf nodes of varying purity, as seen below which will provide us with different ammounts of information
Some of the characteristics of decision trees are summarized below
Pros | Cons |
---|---|
Easy to interpret | Easy to over or underfit the model |
Can handle numeric or categorical features | Cannot model feature interaction |
Can handle missing data | Large trees can be difficult to interpret |
Uses only the most important features | |
Can be used on very large or small datasets |
Labs
The Lab notebooks have been added in the labs
folder, and are released under the MIT License
The Lab for this section is 1-From-Problem-to-Approach.ipynb
Requirements and Collection
Data Requirements
What data do you need to answer the question?
We need to understand what data is required, how to collect it, and how to transform it to address the problem at hand
It is necessary to identify the data requirements for the initial data collection
We typically make use of the following steps
- Define and select the set of data needed
- Content format and representation of data was defined
- It is important to look ahead when transforming our data to a form that would be most suitable for us
Data Collection
Where is the data coming from and how will you get it?
After the initial data collection has been performed, we look at the data and verify that we have all the data that we need, and the data requirements are revisited in order to define what has not been met or needs to be changed
We then make use of descriptive statistics and visuals in order to define the quality and other aspects of the data and then identify how we can fill in these gaps
Collecting data requires that we know the data source and where to find the required data
Labs
The lab documents for this section are in both Python and R, and can be found in the labs
folder as 2-Requirements-to-Collection-py.ipynb
and 2-Requirements-to-Collection-R.ipynb
These labs will simply read in a dataset from a remote source as a CSV and display it
Python
In Python we will use Pandas to read data as DataFrames
We can use Pandas to read data into the data frame
Thereafter we can view the dataframe by looking at the first few rows, as well as the dimensions with
R
We do the same aas the above in R as follows
First we download the file from the remote resource
Thereafter we can read this into a variable with
We can then see the first few rows of data as well as the dimensions with
Understanding and Preparation
Data Understanding
Is the data that you collected representative of the problem to be solved?
We make use of descriptive statistics to understand the data
We run statistical analyses to learn about the data with means such as such as
- Univariate
- Pairwise
- Histogram
- Mean
- Medium
- Min
- Max
- etc.
We also make use of these to understand data quality and values such as Missing values and Invalid or Misleading values
Data Preparation
What additional work is required to manipulate and work with the data
Data preparation is similar to cleansing data by removing unwanted elements and imperfections, this can take between 70% and 90% of the project time
Transforming data in this phase is the process of turining data into something that would be easier to work with
Some examples of what we need to look out for are
- Invalid values
- Missing data
- Duplicates
- Formatting
Another part of data preparation is feature engineering which is when we use domain knowledge to create features for our predictive models
The data preparation will support the remainder of the project
Labs
The lab documents for this section are in both Python and R, and can be found in the labs
folder as 3-Understanding-to-Preparation-py.ipynb
and 3-Understanding-to-Preparation-R.ipynb
These labs will continue to analyze the data that was imported from the previous lab
Python
First, we check if the ingredients exist in our dataframe
Thereafter we can look at our data in order to see if there are any changes that need to be made
From here the following can be seen
- Cuisine is labelled as country
- Cuisine names are not consistent, uppercase, lowercase, etc.
- Some cuisines are duplicates of the country name
- Some cuisines have very few recipes
We can take a few steps to solve these problems
First we fix the Country title to be Cuisine
Then we can make all the names lowercase
Next we correct the mislablled cuisine names
After that we can remove the cuisines with less than 50 recipes
And then view the number of rows we kepy/removed
Next we can convert the yes/no fields to be binary
And lastly view our data with
Next we can Look for recipes that contain rice and soy and wasabi and seaweed
Based on this we can see that not all recipes with those ingredients are Japanese
Now we can look at the frequency of different ingredients in these recipes
We can then sort the dataframe of ingredients in descending order
From this we can see that the most common ingredients are Egg, Wheat, and Butter. However we have a lot more American recipes than the others, indicating that our data is skewed towards American ingredients
We can now create a profile for each cuisine in order to see a more representative recipe distribution with
We can then print out the top 4 ingredients of every couisine with the following
R
First, we check if the ingredients exist in our dataframe
Thereafter we can look at our data in order to see if there are any changes that need to be made
From here the following can be seen
- Cuisine is labelled as country
- Cuisine names are not consistent, uppercase, lowercase, etc.
- Some cuisines are duplicates of the country name
- Some cuisines have very few recipes
We can take a few steps to solve these problems
First we fix the Country title to be Cuisine
Then we can make all the names lowercase
Next we correct the mislablled cuisine names
After that we can remove the cuisines with less than 50 recipes
And then view the number of rows we kept/removed
Next we convert all the columns into factors for classification later
We can look at the structure of our dataframe as
Now we can look at which recipes contain rice and soy_sauce and wasabi and seaweed
We can count the ingredients across all recipes with
We can next count the total ingredients and sort that in descending order
We can then create a profile for each cuisine as we did previously
We can then print out the top 4 ingredients for each recipe with
Modeling and Evaluation
Modeling
In what way can the data be visualized to get to the answer that is required?
Modeling is the stage in which the Data Scientist
Data modeling either tries to get to a predictive or descriptive model
Data scientists use a training set for predictive modeling, this is historical data that acts as a way to test that the data we are using is suitable for the problem we are tryig to solve
Evaluation
Does the model used really answer the initial question or does it need to be adjusted?
A model evaluation goes hand in hand with model building, model building and evaluation are done iteratively
This is done before the model is deployed in order to verify that the model answers our questions and the quality meets our standard
Two phases are considered when evaluating a model
- Diagnostic Measures
- Predictive
- Descriptive
- Statistical Significance
We can make use of the ROC curve to evaluate models and determine the optimal model for a binary classification model by plotting the True-Positive vs False-Positive rate for the model
Labs
The lab documents for this section are in both Python and R, and can be found in the labs
folder as 4-Modeling-to-Evaluation-py.ipynb
and 4-Modeling-to-Evaluation-R.ipynb
These labs will continue from where the last lab left off and build a decision Tree Model for the recipe data
Python
First we will need to import some libraries for modelling
We will make use of a decision tree called bamboo_tree
which will be used to classify between Korean, Japanese, Chinese, Thai, and Indian Food
The following code will create our decision tree
Thereafter we can plot the decision tree with
Now we can go back and rebuild our model, however this time retaining some data so we can evaluate the model
We can use 30 values as our sample size
We can verify that we have 30 recipes from each cuisine
We can now separate our data in to a test and training set
And then train our model again
We can then view our tree as before
If you run this you will see that the new tree is more complex than the last one due to it having fewer data points to work with (I did not put it here because it renders very big in the plot)
Next we can test our model based on the Test Data
We can then create a confusion matrix to see how well the tree does
The rows on a confusion matrix epresent the actual values, and the rows are the predicted values
The resulting confusion matrix can be seen below
The squares along the top-left to bottom-right diagonal are those that the model correctly classified
R
We can follow a similar process as above using R
First we import the libraries we will need to build our decision trees as follows
Thereafter we can train our model using our data with
And view it with the following
Now we can redefine our dataframe to only include the Asian and Indian cuisine
And take a sample of 30 for our test set from each cuisine
Thereafter we can create our training set with
And verify that we have correctly removed the 30 elements from each revipe
Next we can train our tree and plot it
It can be seen that by removing elements we get a more complex decision tree, this is the same as in the Python case
We can then view the confusion matrix as follows
Which will result in
Deployment and Feedback
Deployment
Can you put the model into practice?
The key to making your model relevant is making the stakeholders familiar with the solution developed
When the model is evaluated and we are confident in the model we deploy it, typically first to a small set of users to put it through practical tests
Deployment also consists of developing a suitable method to enable our users to interact with and use the model as well as looking to ways to improve the model with a feedback system
Feedback
Can you get constructive feedback into answering the question?
User feedback helps us to refine and assess the model’s performance and impact, and based on this feedback making changes to make the model
Once the model is deployed we can make use of feedback and experience with the model to refine the model or incorporate different data into it that we had not initally considered