# Data Science Methodology

Updated: 03 September 2023

Based on this Cognitive Class Course

# The Methodology

The data science methodology described here is as outlined by John Rollins of IBM

The Methodology can be seen in the following steps

# The Questions

The Data Science methodology aims to answer ten main questions

## From Problem to Approach

- What is the problem you are trying to solve?
- How can you use data to answer the question?

## Working with the Data

- What data do you need to answer the question?>
- Where is the data coming from and how will you get it?
- Is the data that you collected representative of the problem to be solved?
- What additional work is required to manipulate and work with the data

## Deriving the Answer

- In what way can the data be visualized to get to the answer that is required?
- Does the model used really answer the initial question or does it need to be adjusted?
- Can you put the model into practice?
- Can you get constructive feedback into answering the question?

# Problem and Approach

## Business Understanding

What is the problem you are trying to solve?

The firs step in the methodology involves seeking any needed clarification in order to identify what the problem we are trying to solve is as this drives the data we use and the analytical approach that we will go about applying

It is important to seek clarification early on otherwise we can waste time and resources moving in the wrong direction

In order to understand a question, it is important to understand the goal of the person asking the question

Based on this we will break down objectives and prioritize them

## Analytic Approach

How can you use data to answer the question?

The second step in the methodology is electing the correct approach involves the specific problem being addressed, this points to the purpose of business understanding and helps us to identify what methods we should use in order to address the problem

### Approach to be Used

When we have a strong understanding of the problem, wwe can pick an analytical approach to be used

- Descriptive
- Current Status

- Diagnostic (Statistical Analysis)
- What happened?
- Why is this happening?

- Predictive (Forecasting)
- What if these trends continue?
- What will happen next?

- Prescriptive
- How do we solve it?

### Question Types

We have a few different types of questions that can direct our modelling

- Question is to determine probabilities of an action
- Predictive Model

- Question is to show relationships
- Descriptive Model

- Question requires a binary answer
- Classification Model

### Machine Learning

Machine learning allows us to identify relationships and trends that cannot otherwise be established

### Decision Trees

Decision trees are a machine learning algorithm that allow us to classify nodes while also giving us some information as to how the information is classified

It makes use of a tree structure with *recursive partitioning* to classify data, *predictiveness* is based on decrease in entropy - gain in information or impurity

A decision tree for classifying data can result in leaf nodes of varying purity, as seen below which will provide us with different ammounts of information

Some of the characteristics of decision trees are summarized below

Pros | Cons |
---|---|

Easy to interpret | Easy to over or underfit the model |

Can handle numeric or categorical features | Cannot model feature interaction |

Can handle missing data | Large trees can be difficult to interpret |

Uses only the most important features | |

Can be used on very large or small datasets |

## Labs

The Lab notebooks have been added in the `labs`

folder, and are released under the MIT License

The Lab for this section is `1-From-Problem-to-Approach.ipynb`

# Requirements and Collection

## Data Requirements

What data do you need to answer the question?

We need to understand what data is required, how to collect it, and how to transform it to address the problem at hand

It is necessary to identify the data requirements for the initial data collection

We typically make use of the following steps

- Define and select the set of data needed
- Content format and representation of data was defined
- It is important to look ahead when transforming our data to a form that would be most suitable for us

## Data Collection

Where is the data coming from and how will you get it?

After the initial data collection has been performed, we look at the data and verify that we have all the data that we need, and the data requirements are revisited in order to define what has not been met or needs to be changed

We then make use of descriptive statistics and visuals in order to define the quality and other aspects of the data and then identify how we can fill in these gaps

Collecting data requires that we know the data source and where to find the required data

## Labs

The lab documents for this section are in both Python and R, and can be found in the `labs`

folder as `2-Requirements-to-Collection-py.ipynb`

and `2-Requirements-to-Collection-R.ipynb`

These labs will simply read in a dataset from a remote source as a CSV and display it

### Python

In Python we will use Pandas to read data as DataFrames

We can use *Pandas* to read data into the data frame

Thereafter we can view the dataframe by looking at the first few rows, as well as the dimensions with

### R

We do the same aas the above in R as follows

First we download the file from the remote resource

Thereafter we can read this into a variable with

We can then see the first few rows of data as well as the dimensions with

# Understanding and Preparation

## Data Understanding

Is the data that you collected representative of the problem to be solved?

We make use of descriptive statistics to understand the data

We run statistical analyses to learn about the data with means such as such as

- Univariate
- Pairwise
- Histogram
- Mean
- Medium
- Min
- Max
- etc.

We also make use of these to understand data quality and values such as Missing values and Invalid or Misleading values

## Data Preparation

What additional work is required to manipulate and work with the data

Data preparation is similar to cleansing data by removing unwanted elements and imperfections, this can take between 70% and 90% of the project time

Transforming data in this phase is the process of turining data into something that would be easier to work with

Some examples of what we need to look out for are

- Invalid values
- Missing data
- Duplicates
- Formatting

Another part of data preparation is feature engineering which is when we use domain knowledge to create features for our predictive models

The data preparation will support the remainder of the project

## Labs

The lab documents for this section are in both Python and R, and can be found in the `labs`

folder as `3-Understanding-to-Preparation-py.ipynb`

and `3-Understanding-to-Preparation-R.ipynb`

These labs will continue to analyze the data that was imported from the previous lab

### Python

First, we check if the ingredients exist in our dataframe

Thereafter we can look at our data in order to see if there are any changes that need to be made

From here the following can be seen

- Cuisine is labelled as country
- Cuisine names are not consistent, uppercase, lowercase, etc.
- Some cuisines are duplicates of the country name
- Some cuisines have very few recipes

We can take a few steps to solve these problems

First we fix the *Country* title to be *Cuisine*

Then we can make all the names lowercase

Next we correct the mislablled cuisine names

After that we can remove the cuisines with less than 50 recipes

And then view the number of rows we kepy/removed

Next we can convert the yes/no fields to be binary

And lastly view our data with

Next we can Look for recipes that contain rice and soy and wasabi and seaweed

Based on this we can see that not all recipes with those ingredients are Japanese

Now we can look at the frequency of different ingredients in these recipes

We can then sort the dataframe of ingredients in descending order

From this we can see that the most common ingredients are Egg, Wheat, and Butter. However we have a lot more American recipes than the others, indicating that our data is skewed towards American ingredients

We can now create a profile for each cuisine in order to see a more representative recipe distribution with

We can then print out the top 4 ingredients of every couisine with the following

### R

First, we check if the ingredients exist in our dataframe

Thereafter we can look at our data in order to see if there are any changes that need to be made

From here the following can be seen

- Cuisine is labelled as country
- Cuisine names are not consistent, uppercase, lowercase, etc.
- Some cuisines are duplicates of the country name
- Some cuisines have very few recipes

We can take a few steps to solve these problems

First we fix the *Country* title to be *Cuisine*

Then we can make all the names lowercase

Next we correct the mislablled cuisine names

After that we can remove the cuisines with less than 50 recipes

And then view the number of rows we kept/removed

Next we convert all the columns into factors for classification later

We can look at the structure of our dataframe as

Now we can look at which recipes contain rice and soy_sauce and wasabi and seaweed

We can count the ingredients across all recipes with

We can next count the total ingredients and sort that in descending order

We can then create a profile for each cuisine as we did previously

We can then print out the top 4 ingredients for each recipe with

# Modeling and Evaluation

## Modeling

In what way can the data be visualized to get to the answer that is required?

Modeling is the stage in which the Data Scientist

Data modeling either tries to get to a predictive or descriptive model

Data scientists use a training set for predictive modeling, this is historical data that acts as a way to test that the data we are using is suitable for the problem we are tryig to solve

## Evaluation

Does the model used really answer the initial question or does it need to be adjusted?

A model evaluation goes hand in hand with model building, model building and evaluation are done iteratively

This is done before the model is deployed in order to verify that the model answers our questions and the quality meets our standard

Two phases are considered when evaluating a model

- Diagnostic Measures
- Predictive
- Descriptive

- Statistical Significance

We can make use of the ROC curve to evaluate models and determine the optimal model for a binary classification model by plotting the True-Positive vs False-Positive rate for the model

## Labs

The lab documents for this section are in both Python and R, and can be found in the `labs`

folder as `4-Modeling-to-Evaluation-py.ipynb`

and `4-Modeling-to-Evaluation-R.ipynb`

These labs will continue from where the last lab left off and build a decision Tree Model for the recipe data

### Python

First we will need to import some libraries for modelling

We will make use of a decision tree called `bamboo_tree`

which will be used to classify between Korean, Japanese, Chinese, Thai, and Indian Food

The following code will create our decision tree

Thereafter we can plot the decision tree with

Now we can go back and rebuild our model, however this time retaining some data so we can evaluate the model

We can use 30 values as our sample size

We can verify that we have 30 recipes from each cuisine

We can now separate our data in to a test and training set

And then train our model again

We can then view our tree as before

If you run this you will see that the new tree is more complex than the last one due to it having fewer data points to work with (I did not put it here because it renders very big in the plot)

Next we can test our model based on the Test Data

We can then create a confusion matrix to see how well the tree does

The rows on a confusion matrix epresent the actual values, and the rows are the predicted values

The resulting confusion matrix can be seen below

The squares along the top-left to bottom-right diagonal are those that the model correctly classified

### R

We can follow a similar process as above using R

First we import the libraries we will need to build our decision trees as follows

Thereafter we can train our model using our data with

And view it with the following

Now we can redefine our dataframe to only include the Asian and Indian cuisine

And take a sample of 30 for our test set from each cuisine

Thereafter we can create our training set with

And verify that we have correctly removed the 30 elements from each revipe

Next we can train our tree and plot it

It can be seen that by removing elements we get a more complex decision tree, this is the same as in the Python case

We can then view the confusion matrix as follows

Which will result in

# Deployment and Feedback

## Deployment

Can you put the model into practice?

The key to making your model relevant is making the stakeholders familiar with the solution developed

When the model is evaluated and we are confident in the model we deploy it, typically first to a small set of users to put it through practical tests

Deployment also consists of developing a suitable method to enable our users to interact with and use the model as well as looking to ways to improve the model with a feedback system

## Feedback

Can you get constructive feedback into answering the question?

User feedback helps us to refine and assess the model’s performance and impact, and based on this feedback making changes to make the model

Once the model is deployed we can make use of feedback and experience with the model to refine the model or incorporate different data into it that we had not initally considered