# Data Analysis

Updated: 03 September 2023

# Data Analysis With Python

Based on this Cognitive Class Course

## Labs

The Labs for the course are located in the `Labs`

folder are from CognitiveClass and are licensed under MIT

## Introduction

The data being analysed will be based on a used-car analysis and how to estimate the price of a used car based on its characteristics

- Dataset to be analyzed in Python
- Overview of Python Packages
- Importing and Exporting Data
- Basic Insights from the Data

### Understanding the Data

Thhe Data being used is the Autos dataset from the Machine Learning Database at archive.ics.uci

The first few lines of the file are as follows

And the description of the data can be found here, the attributes are as follows

Column | Attribute | Attribute Range |
---|---|---|

1. | symboling: | -3, -2, -1, 0, 1, 2, 3. |

2. | normalized-losses: | continuous from 65 to 256. |

3. | make: | alfa-romero, audi, bmw, chevrolet, dodge, honda, isuzu, jaguar, mazda, mercedes-benz, mercury, mitsubishi, nissan, peugot, plymouth, porsche, renault, saab, subaru, toyota, volkswagen, volvo |

4. | fuel-type: | diesel, gas. |

5. | aspiration: | std, turbo. |

6. | num-of-doors: | four, two. |

7. | body-style: | hardtop, wagon, sedan, hatchback, convertible. |

8. | drive-wheels: | 4wd, fwd, rwd. |

9. | engine-location: | front, rear. |

10. | wheel-base: | continuous from 86.6 120.9. |

11. | length: | continuous from 141.1 to 208.1. |

12. | width: | continuous from 60.3 to 72.3. |

13. | height: | continuous from 47.8 to 59.8. |

14. | curb-weight: | continuous from 1488 to 4066. |

15. | engine-type: | dohc, dohcv, l, ohc, ohcf, ohcv, rotor. |

16. | num-of-cylinders: | eight, five, four, six, three, twelve, two. |

17. | engine-size: | continuous from 61 to 326. |

18. | fuel-system: | 1bbl, 2bbl, 4bbl, idi, mfi, mpfi, spdi, spfi. |

19. | bore: | continuous from 2.54 to 3.94. |

20. | stroke: | continuous from 2.07 to 4.17. |

21. | compression-ratio: | continuous from 7 to 23. |

22. | horsepower: | continuous from 48 to 288. |

23. | peak-rpm: | continuous from 4150 to 6600. |

24. | city-mpg: | continuous from 13 to 49. |

25. | highway-mpg: | continuous from 16 to 54. |

26. | price: | continuous from 5118 to 45400. |

Missing Attribute Values: (denoted by ”?“)

The *Symboling* is an indicator of the vehicle risk level, the lower the level the lower the risk (from an insurance level)

The *normalized-losses* is an indicator of the rate at which the vehicle loses value over time

The *price* is the value that we would like to predict given the other features

Note that this dataset is from 1995 and therefore the prices may seem a little low

### Python Libraries

#### Scientific Computing Libraries

##### Pandas

Pandas is a library for working with Data Structures, primarily DataFrames

##### NumPy

NumPy is a library for working with Arrays and Matrices

##### SciPy

SciPy includes functions for assisting in mathematical analysis as well as some basic visualizations

#### Visualization

##### Matplotlib

Matplotlib is the most commonly used Python library for data visualization

##### Seaborn

Seaborn is based on Matplotlib and provides functionality such as heat maps, time series, and viollin plots

#### Algorithmic Libraries

##### Scikit-learn

Machine learning, regression, classification,etc.

##### Statsmodels

Explore data, estimate statistical models, and perform statistical tests

### Import and Export Data

#### Importing

When importing data we need ot take a few things into consideration such as

- Format
- File source/path

In our case the data is CSV (.data), and is located as a remote source

We can import our data as follows

0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | ... | 16 | 17 | 18 | 19 | 20 | 21 | 22 | 23 | 24 | 25 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|

0 | 3 | ? | alfa-romero | gas | std | two | convertible | rwd | front | 88.6 | ... | 130 | mpfi | 3.47 | 2.68 | 9.0 | 111 | 5000 | 21 | 27 | 13495 |

1 | 3 | ? | alfa-romero | gas | std | two | convertible | rwd | front | 88.6 | ... | 130 | mpfi | 3.47 | 2.68 | 9.0 | 111 | 5000 | 21 | 27 | 16500 |

2 rows × 26 columns

We can see that our data comes in without headers, we can assign the headers to our data based on the information in the `imports-85.names`

file

symboling | normalized-losses | make | fuel-type | aspiration | num-of-doors | body-style | drive-wheels | engine-location | wheel-base | ... | engine-size | fuel-system | bore | stroke | compression-ratio | horsepower | peak-rpm | city-mpg | highway-mpg | price | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|

0 | 3 | ? | alfa-romero | gas | std | two | convertible | rwd | front | 88.6 | ... | 130 | mpfi | 3.47 | 2.68 | 9.0 | 111 | 5000 | 21 | 27 | 13495 |

1 | 3 | ? | alfa-romero | gas | std | two | convertible | rwd | front | 88.6 | ... | 130 | mpfi | 3.47 | 2.68 | 9.0 | 111 | 5000 | 21 | 27 | 16500 |

2 | 1 | ? | alfa-romero | gas | std | two | hatchback | rwd | front | 94.5 | ... | 152 | mpfi | 2.68 | 3.47 | 9.0 | 154 | 5000 | 19 | 26 | 16500 |

3 | 2 | 164 | audi | gas | std | four | sedan | fwd | front | 99.8 | ... | 109 | mpfi | 3.19 | 3.40 | 10.0 | 102 | 5500 | 24 | 30 | 13950 |

4 | 2 | 164 | audi | gas | std | four | sedan | 4wd | front | 99.4 | ... | 136 | mpfi | 3.19 | 3.40 | 8.0 | 115 | 5500 | 18 | 22 | 17450 |

5 | 2 | ? | audi | gas | std | two | sedan | fwd | front | 99.8 | ... | 136 | mpfi | 3.19 | 3.40 | 8.5 | 110 | 5500 | 19 | 25 | 15250 |

6 | 1 | 158 | audi | gas | std | four | sedan | fwd | front | 105.8 | ... | 136 | mpfi | 3.19 | 3.40 | 8.5 | 110 | 5500 | 19 | 25 | 17710 |

7 | 1 | ? | audi | gas | std | four | wagon | fwd | front | 105.8 | ... | 136 | mpfi | 3.19 | 3.40 | 8.5 | 110 | 5500 | 19 | 25 | 18920 |

8 | 1 | 158 | audi | gas | turbo | four | sedan | fwd | front | 105.8 | ... | 131 | mpfi | 3.13 | 3.40 | 8.3 | 140 | 5500 | 17 | 20 | 23875 |

9 | 0 | ? | audi | gas | turbo | two | hatchback | 4wd | front | 99.5 | ... | 131 | mpfi | 3.13 | 3.40 | 7.0 | 160 | 5500 | 16 | 22 | ? |

10 rows × 26 columns

Next we can remove the missing values from the price column as follows

#### Analyzing Data

Pandas has a few different methods to undertand the data

We need to do some basic checks such as

- Data Types
- Pandas automatically assigns data types which may not necessarily be correct
- This will further allow us to understand what functions we can apply to what columns

We can also view the statisitcal summary as follows

symboling | wheel-base | length | width | height | curb-weight | engine-size | compression-ratio | city-mpg | highway-mpg | |
---|---|---|---|---|---|---|---|---|---|---|

count | 205.000000 | 205.000000 | 205.000000 | 205.000000 | 205.000000 | 205.000000 | 205.000000 | 205.000000 | 205.000000 | 205.000000 |

mean | 0.834146 | 98.756585 | 174.049268 | 65.907805 | 53.724878 | 2555.565854 | 126.907317 | 10.142537 | 25.219512 | 30.751220 |

std | 1.245307 | 6.021776 | 12.337289 | 2.145204 | 2.443522 | 520.680204 | 41.642693 | 3.972040 | 6.542142 | 6.886443 |

min | -2.000000 | 86.600000 | 141.100000 | 60.300000 | 47.800000 | 1488.000000 | 61.000000 | 7.000000 | 13.000000 | 16.000000 |

25% | 0.000000 | 94.500000 | 166.300000 | 64.100000 | 52.000000 | 2145.000000 | 97.000000 | 8.600000 | 19.000000 | 25.000000 |

50% | 1.000000 | 97.000000 | 173.200000 | 65.500000 | 54.100000 | 2414.000000 | 120.000000 | 9.000000 | 24.000000 | 30.000000 |

75% | 2.000000 | 102.400000 | 183.100000 | 66.900000 | 55.500000 | 2935.000000 | 141.000000 | 9.400000 | 30.000000 | 34.000000 |

max | 3.000000 | 120.900000 | 208.100000 | 72.300000 | 59.800000 | 4066.000000 | 326.000000 | 23.000000 | 49.000000 | 54.000000 |

However, if we want to view a summary including fields that are of type objectm we can do the following

symboling | normalized-losses | make | fuel-type | aspiration | num-of-doors | body-style | drive-wheels | engine-location | wheel-base | ... | engine-size | fuel-system | bore | stroke | compression-ratio | horsepower | peak-rpm | city-mpg | highway-mpg | price | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|

count | 205.000000 | 205 | 205 | 205 | 205 | 205 | 205 | 205 | 205 | 205.000000 | ... | 205.000000 | 205 | 205 | 205 | 205.000000 | 205 | 205 | 205.000000 | 205.000000 | 205 |

unique | NaN | 52 | 22 | 2 | 2 | 3 | 5 | 3 | 2 | NaN | ... | NaN | 8 | 39 | 37 | NaN | 60 | 24 | NaN | NaN | 187 |

top | NaN | ? | toyota | gas | std | four | sedan | fwd | front | NaN | ... | NaN | mpfi | 3.62 | 3.40 | NaN | 68 | 5500 | NaN | NaN | ? |

freq | NaN | 41 | 32 | 185 | 168 | 114 | 96 | 120 | 202 | NaN | ... | NaN | 94 | 23 | 20 | NaN | 19 | 37 | NaN | NaN | 4 |

mean | 0.834146 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 98.756585 | ... | 126.907317 | NaN | NaN | NaN | 10.142537 | NaN | NaN | 25.219512 | 30.751220 | NaN |

std | 1.245307 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 6.021776 | ... | 41.642693 | NaN | NaN | NaN | 3.972040 | NaN | NaN | 6.542142 | 6.886443 | NaN |

min | -2.000000 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 86.600000 | ... | 61.000000 | NaN | NaN | NaN | 7.000000 | NaN | NaN | 13.000000 | 16.000000 | NaN |

25% | 0.000000 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 94.500000 | ... | 97.000000 | NaN | NaN | NaN | 8.600000 | NaN | NaN | 19.000000 | 25.000000 | NaN |

50% | 1.000000 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 97.000000 | ... | 120.000000 | NaN | NaN | NaN | 9.000000 | NaN | NaN | 24.000000 | 30.000000 | NaN |

75% | 2.000000 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 102.400000 | ... | 141.000000 | NaN | NaN | NaN | 9.400000 | NaN | NaN | 30.000000 | 34.000000 | NaN |

max | 3.000000 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 120.900000 | ... | 326.000000 | NaN | NaN | NaN | 23.000000 | NaN | NaN | 49.000000 | 54.000000 | NaN |

11 rows × 26 columns

We can also use `df.info`

to see the top and bottom thirty rows of the dataframe

If we would like to get a more complete report of the dataset however, we can make use of the `pandas_profiling`

library which will output a full overall data visualisation

## Preprocessing Data

Data preprocessing is the process of cleaning the data in order to get data from its raw form to a more usable format

This can include different stages such as

- Missing value handling
- Data formatting
- Data normalization
- Data binning
- Turning categorical values into numerical values

### Missing Values

When no data value is stored for a specific feature in an obsevation

Missing data can be represented in many different ways such as ?, N/A, 0 or blank among other ways

In our dataset normalized losses are represented as NaA

We can deal with missing values in a variety of ways

- Check if the correct data can be found
- Drop the feature
- Drop the data entry
- Replace with average or similar datapoints
- Replace with most common value
- Replace based on knowledge about data
- Leave the data as missing data

Dropping missing data can be done with the `df.dropna()`

function, since we have missing values in the price column, which means those rows will need to be dropped

We also need to make use of the `inplace=True`

parameter to modify the actual dataframe

The `df.replace()`

function allows us to replace missing values in the data

symboling | normalized-losses | make | fuel-type | aspiration | num-of-doors | body-style | drive-wheels | engine-location | wheel-base | ... | engine-size | fuel-system | bore | stroke | compression-ratio | horsepower | peak-rpm | city-mpg | highway-mpg | price | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|

0 | 3 | NaN | alfa-romero | gas | std | two | convertible | rwd | front | 88.6 | ... | 130 | mpfi | 3.47 | 2.68 | 9.0 | 111 | 5000 | 21 | 27 | 13495 |

1 | 3 | NaN | alfa-romero | gas | std | two | convertible | rwd | front | 88.6 | ... | 130 | mpfi | 3.47 | 2.68 | 9.0 | 111 | 5000 | 21 | 27 | 16500 |

2 | 1 | NaN | alfa-romero | gas | std | two | hatchback | rwd | front | 94.5 | ... | 152 | mpfi | 2.68 | 3.47 | 9.0 | 154 | 5000 | 19 | 26 | 16500 |

3 | 2 | 164 | audi | gas | std | four | sedan | fwd | front | 99.8 | ... | 109 | mpfi | 3.19 | 3.40 | 10.0 | 102 | 5500 | 24 | 30 | 13950 |

4 | 2 | 164 | audi | gas | std | four | sedan | 4wd | front | 99.4 | ... | 136 | mpfi | 3.19 | 3.40 | 8.0 | 115 | 5500 | 18 | 22 | 17450 |

5 | 2 | NaN | audi | gas | std | two | sedan | fwd | front | 99.8 | ... | 136 | mpfi | 3.19 | 3.40 | 8.5 | 110 | 5500 | 19 | 25 | 15250 |

6 | 1 | 158 | audi | gas | std | four | sedan | fwd | front | 105.8 | ... | 136 | mpfi | 3.19 | 3.40 | 8.5 | 110 | 5500 | 19 | 25 | 17710 |

7 | 1 | NaN | audi | gas | std | four | wagon | fwd | front | 105.8 | ... | 136 | mpfi | 3.19 | 3.40 | 8.5 | 110 | 5500 | 19 | 25 | 18920 |

8 | 1 | 158 | audi | gas | turbo | four | sedan | fwd | front | 105.8 | ... | 131 | mpfi | 3.13 | 3.40 | 8.3 | 140 | 5500 | 17 | 20 | 23875 |

9 | 0 | NaN | audi | gas | turbo | two | hatchback | 4wd | front | 99.5 | ... | 131 | mpfi | 3.13 | 3.40 | 7.0 | 160 | 5500 | 16 | 22 | NaN |

10 rows × 26 columns

Next, we can count the missing value in each column

symboling | normalized-losses | make | fuel-type | aspiration | num-of-doors | body-style | drive-wheels | engine-location | wheel-base | ... | engine-size | fuel-system | bore | stroke | compression-ratio | horsepower | peak-rpm | city-mpg | highway-mpg | price | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|

0 | False | True | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | False | False |

1 | False | True | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | False | False |

2 | False | True | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | False | False |

3 | False | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | False | False |

4 | False | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | False | False |

5 rows × 26 columns

We’ll replace the missing data as follows

- Replace by Mean
- “normalized-losses”: 41 missing data
- “stroke”: 4 missing data
- “bore”: 4 missing data
- “horsepower”: 2 missing data
- “peak-rpm”: 2 missing data

- Replace by Frequency
- “num-of-doors”: 2 missing data
- Becase most cars have 4 doors

- “num-of-doors”: 2 missing data
- Drop the Roe
- “price”: 4 missing data
- Because price is what we want to predict, it does not help if it is not there

- “price”: 4 missing data

#### Normalized Losses

#### Bore

#### Stroke

#### Horsepower

#### Peak RPM

#### Number of Doors

First we need to check which door count is the most common

And then replace invalid values with that

#### Price

And then lastly drop the columns with a missing price

#### Clean Data

The we can view our dataset which has no missing values

symboling | normalized-losses | make | fuel-type | aspiration | num-of-doors | body-style | drive-wheels | engine-location | wheel-base | ... | engine-size | fuel-system | bore | stroke | compression-ratio | horsepower | peak-rpm | city-mpg | highway-mpg | price | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|

0 | 3 | 122 | alfa-romero | gas | std | two | convertible | rwd | front | 88.6 | ... | 130 | mpfi | 3.47 | 2.68 | 9.0 | 111 | 5000 | 21 | 27 | 13495 |

1 | 3 | 122 | alfa-romero | gas | std | two | convertible | rwd | front | 88.6 | ... | 130 | mpfi | 3.47 | 2.68 | 9.0 | 111 | 5000 | 21 | 27 | 16500 |

2 | 1 | 122 | alfa-romero | gas | std | two | hatchback | rwd | front | 94.5 | ... | 152 | mpfi | 2.68 | 3.47 | 9.0 | 154 | 5000 | 19 | 26 | 16500 |

3 | 2 | 164 | audi | gas | std | four | sedan | fwd | front | 99.8 | ... | 109 | mpfi | 3.19 | 3.40 | 10.0 | 102 | 5500 | 24 | 30 | 13950 |

4 | 2 | 164 | audi | gas | std | four | sedan | 4wd | front | 99.4 | ... | 136 | mpfi | 3.19 | 3.40 | 8.0 | 115 | 5500 | 18 | 22 | 17450 |

5 rows × 26 columns

### Formatting Data

Now we nee to ensure that our data is correctly formatted

- Data is usually collected in different places and stored in different formats
- Bringing data into a common standard allows for more meaningful comparison

In order to format data, Pandas comes with some tool for us to use

We can also apply mathematical to our data as well as type conversions and column renames

Sometimes the incorrect data type may be set by default to the correct type, it may be necessary for us to convert our data to the correct type for analysis. We can convert datatypes using the `df.astype()`

function, and check types with `df.dtypes`

#### Data Types

#### Convert types

We can easily convert our data to the correct types with

And view the corrected types

#### Standardization

Standardization is the process of taking data from one format to another that may be more meaningful for us to use, such as converting our fuel consumption

city-mpg | city-L/100km | |
---|---|---|

0 | 21 | 11.190476 |

1 | 21 | 11.190476 |

2 | 19 | 12.368421 |

3 | 24 | 9.791667 |

4 | 18 | 13.055556 |

symboling | normalized-losses | make | fuel-type | aspiration | num-of-doors | body-style | drive-wheels | engine-location | wheel-base | ... | fuel-system | bore | stroke | compression-ratio | horsepower | peak-rpm | city-mpg | highway-mpg | price | city-L/100km | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|

0 | 3 | 122 | alfa-romero | gas | std | two | convertible | rwd | front | 88.6 | ... | mpfi | 3.47 | 2.68 | 9.0 | 111 | 5000.0 | 21 | 27 | 13495.0 | 11.190476 |

1 | 3 | 122 | alfa-romero | gas | std | two | convertible | rwd | front | 88.6 | ... | mpfi | 3.47 | 2.68 | 9.0 | 111 | 5000.0 | 21 | 27 | 16500.0 | 11.190476 |

2 | 1 | 122 | alfa-romero | gas | std | two | hatchback | rwd | front | 94.5 | ... | mpfi | 2.68 | 3.47 | 9.0 | 154 | 5000.0 | 19 | 26 | 16500.0 | 12.368421 |

3 | 2 | 164 | audi | gas | std | four | sedan | fwd | front | 99.8 | ... | mpfi | 3.19 | 3.40 | 10.0 | 102 | 5500.0 | 24 | 30 | 13950.0 | 9.791667 |

4 | 2 | 164 | audi | gas | std | four | sedan | 4wd | front | 99.4 | ... | mpfi | 3.19 | 3.40 | 8.0 | 115 | 5500.0 | 18 | 22 | 17450.0 | 13.055556 |

5 rows × 27 columns

highway-mpg | highway-L/100km | |
---|---|---|

0 | 27 | 8.703704 |

1 | 27 | 8.703704 |

2 | 26 | 9.038462 |

3 | 30 | 7.833333 |

4 | 22 | 10.681818 |

symboling | normalized-losses | make | fuel-type | aspiration | num-of-doors | body-style | drive-wheels | engine-location | wheel-base | ... | bore | stroke | compression-ratio | horsepower | peak-rpm | city-mpg | highway-mpg | price | city-L/100km | highway-L/100km | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|

0 | 3 | 122 | alfa-romero | gas | std | two | convertible | rwd | front | 88.6 | ... | 3.47 | 2.68 | 9.0 | 111 | 5000.0 | 21 | 27 | 13495.0 | 11.190476 | 8.703704 |

1 | 3 | 122 | alfa-romero | gas | std | two | convertible | rwd | front | 88.6 | ... | 3.47 | 2.68 | 9.0 | 111 | 5000.0 | 21 | 27 | 16500.0 | 11.190476 | 8.703704 |

2 | 1 | 122 | alfa-romero | gas | std | two | hatchback | rwd | front | 94.5 | ... | 2.68 | 3.47 | 9.0 | 154 | 5000.0 | 19 | 26 | 16500.0 | 12.368421 | 9.038462 |

3 | 2 | 164 | audi | gas | std | four | sedan | fwd | front | 99.8 | ... | 3.19 | 3.40 | 10.0 | 102 | 5500.0 | 24 | 30 | 13950.0 | 9.791667 | 7.833333 |

4 | 2 | 164 | audi | gas | std | four | sedan | 4wd | front | 99.4 | ... | 3.19 | 3.40 | 8.0 | 115 | 5500.0 | 18 | 22 | 17450.0 | 13.055556 | 10.681818 |

5 rows × 28 columns

#### Renaming Columns

If there is a need for us to rename columns we can do so with `df.rename()`

, for example

### Data Normalization

This is an important part of data preprocessing

We may need to normalize our variables such that the range of our data is more consistent, allowing us to manage the way that different values will impact our analysis

There are several differnet ways to normalize data

#### Simple Feature Scaling

$x_{new}=\frac{x_{old}}{x_{max}}$#### Min-Max

$x_{new}=\frac{x_{old}-x_{min}}{x_{max}-x_{min}}$#### Z-Score

The resulting values hover around zero, and are in terms of their standard deviation from the mean

$x_{new}=\frac{x_{old}-\mu}{\sigma}$#### Normalization

Next we will normalize our values using the *Simple Feature Scaling Method*

We will apply this to the following features

- ‘length’
- ‘width’
- ‘height’

length | width | height | |
---|---|---|---|

0 | 0.811148 | 0.890278 | 0.816054 |

1 | 0.811148 | 0.890278 | 0.816054 |

2 | 0.822681 | 0.909722 | 0.876254 |

3 | 0.848630 | 0.919444 | 0.908027 |

4 | 0.848630 | 0.922222 | 0.908027 |

### Binning

Binning is grouping values together into bins, this can sometimes improve accuracy of predictive models and help us to better understand the data distribution

We’ll arrange our ‘horsepower’ column into bins such that we can label cars as having a ‘Low’, ‘Medium’, or ‘High’ horsepower as follows

We can then use `pd.cut()`

to determine what bin each value belongs in

horsepower | horsepower-binned | |
---|---|---|

0 | 111.0 | Medium |

1 | 111.0 | Medium |

2 | 154.0 | Medium |

3 | 102.0 | Medium |

4 | 115.0 | Medium |

5 | 110.0 | Medium |

6 | 110.0 | Medium |

7 | 110.0 | Medium |

8 | 140.0 | Medium |

9 | 101.0 | Low |

10 | 101.0 | Low |

11 | 121.0 | Medium |

12 | 121.0 | Medium |

13 | 121.0 | Medium |

14 | 182.0 | High |

15 | 182.0 | High |

16 | 182.0 | High |

17 | 48.0 | Low |

18 | 70.0 | Low |

19 | 70.0 | Low |

#### Visualization

We can use `matplotlib`

to visualize the number of vehicles in each of our bins with the following

### Categorical to Quantatative

Most statistical models cannot take in objects as strings, in order to do this we provide a dummy variable that contains whether or not an entry is a certain string variable, we can do this in pandas with `pd.get_dummies()`

This method will automatically generate a dummy vector for the provided dataframe, and is used as follows

#### Fuel

We will do this for the ‘fuel-type’ column in our data

diesel | gas | |
---|---|---|

0 | 0 | 1 |

1 | 0 | 1 |

2 | 0 | 1 |

3 | 0 | 1 |

4 | 0 | 1 |

Then we will rename our columns for clarity

fuel-type-diesel | fuel-type-gas | |
---|---|---|

0 | 0 | 1 |

1 | 0 | 1 |

2 | 0 | 1 |

3 | 0 | 1 |

4 | 0 | 1 |

Then we will add this column to our dataset

symboling | normalized-losses | make | aspiration | num-of-doors | body-style | drive-wheels | engine-location | wheel-base | length | ... | horsepower | peak-rpm | city-mpg | highway-mpg | price | city-L/100km | highway-L/100km | horsepower-binned | fuel-type-diesel | fuel-type-gas | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|

0 | 3 | 122 | alfa-romero | std | two | convertible | rwd | front | 88.6 | 0.811148 | ... | 111.0 | 5000.0 | 21 | 27 | 13495.0 | 11.190476 | 8.703704 | Medium | 0 | 1 |

1 | 3 | 122 | alfa-romero | std | two | convertible | rwd | front | 88.6 | 0.811148 | ... | 111.0 | 5000.0 | 21 | 27 | 16500.0 | 11.190476 | 8.703704 | Medium | 0 | 1 |

2 | 1 | 122 | alfa-romero | std | two | hatchback | rwd | front | 94.5 | 0.822681 | ... | 154.0 | 5000.0 | 19 | 26 | 16500.0 | 12.368421 | 9.038462 | Medium | 0 | 1 |

3 | 2 | 164 | audi | std | four | sedan | fwd | front | 99.8 | 0.848630 | ... | 102.0 | 5500.0 | 24 | 30 | 13950.0 | 9.791667 | 7.833333 | Medium | 0 | 1 |

4 | 2 | 164 | audi | std | four | sedan | 4wd | front | 99.4 | 0.848630 | ... | 115.0 | 5500.0 | 18 | 22 | 17450.0 | 13.055556 | 10.681818 | Medium | 0 | 1 |

5 rows × 30 columns

#### Aspiration

We will do the same as above with our aspiration variable

aspiration-std | aspiration-turbo | |
---|---|---|

0 | 1 | 0 |

1 | 1 | 0 |

2 | 1 | 0 |

3 | 1 | 0 |

4 | 1 | 0 |

symboling | normalized-losses | make | num-of-doors | body-style | drive-wheels | engine-location | wheel-base | length | width | ... | city-mpg | highway-mpg | price | city-L/100km | highway-L/100km | horsepower-binned | fuel-type-diesel | fuel-type-gas | aspiration-std | aspiration-turbo | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|

0 | 3 | 122 | alfa-romero | two | convertible | rwd | front | 88.6 | 0.811148 | 0.890278 | ... | 21 | 27 | 13495.0 | 11.190476 | 8.703704 | Medium | 0 | 1 | 1 | 0 |

1 | 3 | 122 | alfa-romero | two | convertible | rwd | front | 88.6 | 0.811148 | 0.890278 | ... | 21 | 27 | 16500.0 | 11.190476 | 8.703704 | Medium | 0 | 1 | 1 | 0 |

2 | 1 | 122 | alfa-romero | two | hatchback | rwd | front | 94.5 | 0.822681 | 0.909722 | ... | 19 | 26 | 16500.0 | 12.368421 | 9.038462 | Medium | 0 | 1 | 1 | 0 |

3 | 2 | 164 | audi | four | sedan | fwd | front | 99.8 | 0.848630 | 0.919444 | ... | 24 | 30 | 13950.0 | 9.791667 | 7.833333 | Medium | 0 | 1 | 1 | 0 |

4 | 2 | 164 | audi | four | sedan | 4wd | front | 99.4 | 0.848630 | 0.922222 | ... | 18 | 22 | 17450.0 | 13.055556 | 10.681818 | Medium | 0 | 1 | 1 | 0 |

5 rows × 31 columns

### Exporting

Then we can export our clean data to a CSV as follows

## Exploratory Data Analysis

Exploratory Data Analysis (EDA) is an approach to analyze data in order to

- Summarize main characteristics of data
- Gain better understanding of a dataset
- Uncover relationships between varables
- Identify important variables that impact our problem

### Descriptive Statistics

When analyzing data it is important to describe the data giving a few short summaries. We have a few different ways to do this, such as

`df.describe()`

to get general statistical information about numeric data`df.values_counts`

to get a count of the different values of categorical data`sns.boxplot()`

to generate box plots`plt.scatter()`

Scatter plots show the relationship between the predictor and target variables

### Correlation

Correlation is a measure to what exctent different vriables are interdependent

Correlation does not imply causation

#### Pearson Correlation

We can make use of Pearson correlation to measure the strength of correlation between two features, this has two values

- Correlation Coeffient
- Close to 1: Positive relationship
- Close to -1: Negative relationship
- Close to 0: No relationship

- P-Value
- P < 0.001: Strong certainty in the result
- P < 0.05 : Moderate certainty in result
- P < 0.1 : Weak certainty in result
- P > 0.1 : No certainty in result

We would define a strong correlation when the Correlation Coefficient is around 1 or -1, and the P-Value is less than 0.001

We can find our correlation values for two variables with

##### Wheel Base and Price

##### Horsepower and Price

#### Positive Correlation

We can visualize our data using some of the visualizations we described above such as by comparing engine size to price

We can also view the correlation matrix for these two variables as follows

engine-size | price | |
---|---|---|

engine-size | 1.000000 | 0.872335 |

price | 0.872335 | 1.000000 |

From this we can see that engine size is a fairly good predictor of price

#### Negative Correlation

We can also look at the correation between highway mileage and price

highway-mpg | price | |
---|---|---|

highway-mpg | 1.000000 | -0.704692 |

price | -0.704692 | 1.000000 |

Where we can note a Negative Linear Relationship

#### Weak Correlation

We can the compare Peak RPM and price

peak-rpm | price | |
---|---|---|

peak-rpm | 1.000000 | -0.101616 |

price | -0.101616 | 1.000000 |

Where we can see a weak linear relationship

We can also observe the relationships between stroke and price

stroke | price | |
---|---|---|

stroke | 1.000000 | 0.082269 |

price | 0.082269 | 1.000000 |

Next we can look at boxplots of our different categorical values in order to look at their distribution

#### Descriptive Statistics

Next we can find some descriptive statistics about our data by the following

symboling | normalized-losses | wheel-base | length | width | height | curb-weight | engine-size | bore | stroke | ... | peak-rpm | city-mpg | highway-mpg | price | city-L/100km | highway-L/100km | fuel-type-diesel | fuel-type-gas | aspiration-std | aspiration-turbo | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|

count | 201.000000 | 201.00000 | 201.000000 | 201.000000 | 201.000000 | 201.000000 | 201.000000 | 201.000000 | 201.000000 | 201.000000 | ... | 201.000000 | 201.000000 | 201.000000 | 201.000000 | 201.000000 | 201.000000 | 201.000000 | 201.000000 | 201.000000 | 201.000000 |

mean | 0.840796 | 122.00000 | 98.797015 | 0.837102 | 0.915126 | 0.899108 | 2555.666667 | 126.875622 | 3.330692 | 3.256874 | ... | 5117.665368 | 25.179104 | 30.686567 | 13207.129353 | 9.944145 | 8.044957 | 0.099502 | 0.900498 | 0.820896 | 0.179104 |

std | 1.254802 | 31.99625 | 6.066366 | 0.059213 | 0.029187 | 0.040933 | 517.296727 | 41.546834 | 0.268072 | 0.316048 | ... | 478.113805 | 6.423220 | 6.815150 | 7947.066342 | 2.534599 | 1.840739 | 0.300083 | 0.300083 | 0.384397 | 0.384397 |

min | -2.000000 | 65.00000 | 86.600000 | 0.678039 | 0.837500 | 0.799331 | 1488.000000 | 61.000000 | 2.540000 | 2.070000 | ... | 4150.000000 | 13.000000 | 16.000000 | 5118.000000 | 4.795918 | 4.351852 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |

25% | 0.000000 | 101.00000 | 94.500000 | 0.801538 | 0.890278 | 0.869565 | 2169.000000 | 98.000000 | 3.150000 | 3.110000 | ... | 4800.000000 | 19.000000 | 25.000000 | 7775.000000 | 7.833333 | 6.911765 | 0.000000 | 1.000000 | 1.000000 | 0.000000 |

50% | 1.000000 | 122.00000 | 97.000000 | 0.832292 | 0.909722 | 0.904682 | 2414.000000 | 120.000000 | 3.310000 | 3.290000 | ... | 5125.369458 | 24.000000 | 30.000000 | 10295.000000 | 9.791667 | 7.833333 | 0.000000 | 1.000000 | 1.000000 | 0.000000 |

75% | 2.000000 | 137.00000 | 102.400000 | 0.881788 | 0.925000 | 0.928094 | 2926.000000 | 141.000000 | 3.580000 | 3.410000 | ... | 5500.000000 | 30.000000 | 34.000000 | 16500.000000 | 12.368421 | 9.400000 | 0.000000 | 1.000000 | 1.000000 | 0.000000 |

max | 3.000000 | 256.00000 | 120.900000 | 1.000000 | 1.000000 | 1.000000 | 4066.000000 | 326.000000 | 3.940000 | 4.170000 | ... | 6600.000000 | 49.000000 | 54.000000 | 45400.000000 | 18.076923 | 14.687500 | 1.000000 | 1.000000 | 1.000000 | 1.000000 |

8 rows × 22 columns

make | num-of-doors | body-style | drive-wheels | engine-location | engine-type | num-of-cylinders | fuel-system | |
---|---|---|---|---|---|---|---|---|

count | 201 | 201 | 201 | 201 | 201 | 201 | 201 | 201 |

unique | 22 | 2 | 5 | 3 | 2 | 6 | 7 | 8 |

top | toyota | four | sedan | fwd | front | ohc | four | mpfi |

freq | 32 | 115 | 94 | 118 | 198 | 145 | 157 | 92 |

##### Value Counts

We can get an indication as to the frequencies of our categorical values with the following

value_counts | |
---|---|

drive-wheels | |

fwd | 118 |

rwd | 75 |

4wd | 8 |

value_counts | |
---|---|

engine-location | |

front | 198 |

rear | 3 |

#### Grouping Data

It can be helpful to group data in order to see if there is a correlation between a certain categorical data and some other variable

We can get an array of all uniques ‘drive-wheel’ categories with

In Pandas we can use `df.groupby()`

to do this as such

drive-wheels | price | |
---|---|---|

0 | 4wd | 10241.000000 |

1 | fwd | 9244.779661 |

2 | rwd | 19757.613333 |

Next we can look at the wheels compared to the body style

drive-wheels | body-style | price | |
---|---|---|---|

0 | 4wd | hatchback | 7603.0 |

1 | 4wd | sedan | 12647.0 |

2 | 4wd | wagon | 9096.0 |

3 | fwd | convertible | 11595.0 |

4 | fwd | hardtop | 8249.0 |

5 | fwd | hatchback | 8396.0 |

6 | fwd | sedan | 9812.0 |

7 | fwd | wagon | 9997.0 |

8 | rwd | convertible | 23950.0 |

9 | rwd | hardtop | 24203.0 |

10 | rwd | hatchback | 14338.0 |

11 | rwd | sedan | 21712.0 |

12 | rwd | wagon | 16994.0 |

We can view this data a bit more conviniently as a pivot table, we can do this with `df.pivot()`

so as to display the data as below

price | |||||
---|---|---|---|---|---|

body-style | convertible | hardtop | hatchback | sedan | wagon |

drive-wheels | |||||

4wd | NaN | NaN | 7603.0 | 12647.0 | 9096.0 |

fwd | 11595.0 | 8249.0 | 8396.0 | 9812.0 | 9997.0 |

rwd | 23950.0 | 24203.0 | 14338.0 | 21712.0 | 16994.0 |

Next we can use a heatmap to visualize the above relationship, the heatmap plots the target variable as the colour with respect to the drive-wheels and body-style variables

### Analysis of Variance (ANOVA)

If we want to analyze a categorical variable and see the correlation between different categories

- Statistical comparison of groups

ANOVA is a statistical test that hels find the correlation between different groups of a categorical variable, the ANOVA returns two values

- F-Test score is the variation between sample group means divided by variation within the sample group
- P-Value is a measure of confidence and show if the obtained result is statistically significant

##### Drive Wheels

drive-wheels | price | |
---|---|---|

0 | rwd | 13495.0 |

1 | rwd | 16500.0 |

3 | fwd | 13950.0 |

4 | 4wd | 17450.0 |

5 | fwd | 15250.0 |

136 | 4wd | 7603.0 |

To obtain the ANOVA values, we use the `stats.f_oneway`

function from scipy for the correlation between drive wheels and price

##### Drive Wheels and Price

This shows a large F and small P, meaning that there is a strong corelation between the three drive types and the price, however does this apply for each individual comparison of drive type?

##### FWD and RWD

##### 4WD and RWD

4WD and FWD

## Model Development

A model can be thought of as a mathematical equation used to predict a value given a specfic input

More relevant data will allow us to more accurately predict an outcome

### Linear Regression

#### Simple Linear Regression

SLR helps us to identify a relationship between two independant variables in the form of

$y=b_0 + b_1x$We make use of training datapoints to help us find the $b_0$ and $b_1$ values

Our measured datapoints will have some noise, causing our data to be differerntiated from the model

We make use of our training data to fit a model to our data, we then use the model to make predictions. We denote our predicted values as an estimate with

$\hat y=b_0+b_1x$If the linear model is correct, we can assume that the offsets are noise, otherwise it may be other factors that we have not taken into consideration

We can create a linear regression model by importing it from `sklearn`

and creating a new `LinearRegression`

object as follows

Then we define the predictor and target variables and use `lm.fit()`

to find the parameters $b_0$ and $b_1$

We can then obtain a prediction using `lm.predict()’

The intercept and slope are attributes of the `LinearRegression`

object and can be found with `lm.intercept_`

and `lm.coef_`

respectively

We can train an SLR model for our data as follows

symboling | normalized-losses | make | num-of-doors | body-style | drive-wheels | engine-location | wheel-base | length | width | ... | city-mpg | highway-mpg | price | city-L/100km | highway-L/100km | horsepower-binned | fuel-type-diesel | fuel-type-gas | aspiration-std | aspiration-turbo | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|

0 | 3 | 122 | alfa-romero | two | convertible | rwd | front | 88.6 | 0.811148 | 0.890278 | ... | 21 | 27 | 13495.0 | 11.190476 | 8.703704 | Medium | 0 | 1 | 1 | 0 |

1 | 3 | 122 | alfa-romero | two | convertible | rwd | front | 88.6 | 0.811148 | 0.890278 | ... | 21 | 27 | 16500.0 | 11.190476 | 8.703704 | Medium | 0 | 1 | 1 | 0 |

2 | 1 | 122 | alfa-romero | two | hatchback | rwd | front | 94.5 | 0.822681 | 0.909722 | ... | 19 | 26 | 16500.0 | 12.368421 | 9.038462 | Medium | 0 | 1 | 1 | 0 |

3 | 2 | 164 | audi | four | sedan | fwd | front | 99.8 | 0.848630 | 0.919444 | ... | 24 | 30 | 13950.0 | 9.791667 | 7.833333 | Medium | 0 | 1 | 1 | 0 |

4 | 2 | 164 | audi | four | sedan | 4wd | front | 99.4 | 0.848630 | 0.922222 | ... | 18 | 22 | 17450.0 | 13.055556 | 10.681818 | Medium | 0 | 1 | 1 | 0 |

5 rows × 31 columns

#### Multiple Linear Regression

MLR is used to explain the relationship between 1 continuous target and 2 or more predictors

The model will then be defined by the function

$\hat y=b_0 + b_1x_1 + b_2x_2 + b_3x_3 + ... + b_nx_n$The resulting equation represents a surface in $n$ dimensional space that will define $y$ given each component of $x$

We can train this model just as before as follows

And then predict as before with

Where `X`

is in the form of the training data

The intercept and coefficiencts of the model can once again be found with `lm.intercept_`

and `lm.coef_`

respectively

We can develop a MLR model as follows

### Model Evaluation using Visualization

#### Regression Plot

Regression plots gives us a good estimate of:

- The relationship between two variables
- The strength of correlation
- The direction of the relationship

A regression plot is a combination of a scatter plot, and a linear regression line

We create a regression plot using the `sns.regplot()`

function in `seaborn`

##### Single Linear Regression

We can visualize SLR as follows

#### Residual Plot

A Residual plot represents the error between the actual and predicted value, the results should have 0 mean if the linear assumption is applicable

If the residual plot is not equally distributed around the mean of 0 throughout, we know that the linear model is not correct for the data we have

We can create a residual plot as with `sns.residplot()`

#### Distribution Plot

We use this to look at the distribution of the actual vs predicted values for our model

A distribution plot can be made with `sns.distplot()`

##### Multiple Linear Regression

Using A Distribution Plot is more valuable when looking at MLR model performance

### Polynomial Regression and Pipelines

Polynomial Regression is a form of regression used to describe curvilinear relationships in which our predictor variable is not linear

- Quadratic $\hat y=b_0 + b_1x_1 + b_2x_1^2$
- Cubic $\hat y=b_0 + b_1x_1 + b_2x_1^2 + b_2x_1^3$
- Higher Order $\hat y=b_0 + b_1x_1 + b_2x_1^2 + ... + b_2x_1^n$

Picking the correct order can make our model more accurate

We can fit a polynomial to our data using `np.polyfit()`

, and can print out the resulting model with `np.polydl()`

For an $n^{th}$ order polynomial we can create and view a model as follows

#### Visualization

Visualizing Polynomial regression plots is a bit more work but can be done with the function below

Next we can do the polynomial fit for our data

### Multi Dimension Polynomial Regression

Polynomial regression can also be done in multiple dimensions, for example a second order approximation would look like the following

$\hat y=b_0 + b_1x_1 + b_2x_2 + b_3x_1x_2 + b_4x_1^2 + b_5x_2^2$If we want to do multi dimentional polynomial fits, we will need to use `PolynomialFeatures`

from `sklearn.preprocessing`

to preprocess our features as follows

Before the transformation we have 4 features

After the transformation we have 15 features

### Pre-Processing

`sklearn`

has some preprocessing functionality such as

#### Normalization

We can train a scaler and normalize our data based on that as follows

There are other normalization functions available with which we can preprocess our data

#### Pipelines

There are many steps to getting a prediction, such as Normalization, Polynomial Transformation, and training a Linear Regression Model

We can use a Pipeline library to help simplify the process

First we import all the libraries we will need as well as the pipeline library

Then we construct our pipeline using a list of tuples defined as `('name of estimator model', ModelConstructor())`

and create our pipeline object

We can then create our pipeline object on the data by using the `pipe.train`

function

The above method will normalize the data, then perform a polynomial transform and output a prediction

### In-Sample Evaluation

Two important measures that determine how well a model fits a speficic dataset are

- Mean Squared Error (MSE)
- R-Squared ($R^2$)

#### Mean Squared Error

We simply find the difference between the average square error of our prediction when compared to our data

To get the MSE in Python we can do the following

#### R-Squared

$R^2$ is the coefficient of determination

- Measure of how close the data is to the fitted regression line
- The percentage of variation in Y that is explained by the model
- Like comparing our model to the mean of the data in approximating the data

$R^2$ is usually between 0 and 1

$R^2=1-\frac{MSE of Regression Line}{MSE of \bar y}$We can get the $R^24$ value with `lm.score()`

A negative $R^2$ can be a sign of overfitting

#### In-Sample Evaluation of Models

We can evaluate our models with the following

Now we can check what the $R^2$ value for our model is

### Prediction and Decision Making

We should use visualization, numerical evaluation, and model comparison in order to see if the model values makes sense

To compare our model to the data we can simply plot the output of our model over the range of our data

#### Conclusion

From the results above (yes, they’re a mess but it’s all pretty much just from the CC Lab file) we can note the $R^2$ and MSE values are as follows

Simple Linear Regression: Using Highway-mpg as a Predictor Variable of Price.

- R-squared: 0.49659118843391759
- MSE: 3.16 x10^7

Multiple Linear Regression: Using Horsepower, Curb-weight, Engine-size, and Highway-mpg as Predictor Variables of Price.

- R-squared: 0.80896354913783497
- MSE: 1.2 x10^7

Polynomial Fit: Using Highway-mpg as a Predictor Variable of Price.

- R-squared: 0.6741946663906514
- MSE: 2.05 x 10^7

#### SLR vs MLR

Usually having more variables helps the prediction, however if you do not have enough data you can run into trouble or many of the variables may just be noise and not be very useful

The MSE for MLR is Smaller than for SLR, and the $R^2$ for MLR is higher as well, MSE and $R^2$ both seem to indicate that MLR is a better fit than SLR

#### SLR vs Polynomial

The MSE for the Polynomial Fit is less than for the SLR and the $R^2$ is higher meaning that the Polynomial Fit is a better predictor based on ‘highway-mpg’ than the SLR

#### MLR vs Polynomial

The MSE for the MLR is smaller than for the Polynomial Fit, and the MLR also has a higher $R^2$ therefore the MLR is as better fit than the Polynomial in this case

#### Overall

The MLR has the lowest $R^2$ and the highest MSE, meaning that it is the best fit of the three models that have been evaluate

## Model Evaluation

Model Evaluation tells us how our model works in the real wold. In-Sample evaluation does not give us an indication as to how our model performs under real lifr circumstances

### Training and Test Sets

We typically split our data into a training and testing set and use to build and evaluate our model respectively

- Split into
- 70% Training
- 30% Testing

- Build and Train with Training Set
- Use Testing Set to evaluate model performance

`sklearn`

gives us a function to split out data into a train and test set

### Generalization Performance

Generalization Error is a measure of how well our data does at predicting previously unseen data

The error we obtain using our testing data is an approximation of this error

### Cross Validation

Cross validation involves us splitting the data into folds and using a fold for testing and the remainder for training, and we make use of each combination of training, and evalution

We can use `cross_val_score()`

to evaluate our out-of-sample evaluation

Where `X`

is our predictor matrix, `Y`

is our target, and `n`

is our number of folds

If we want to use get the actual predicted values from our model we can do the following

### Overfitting, Underfitting, and Model Selection

The goal of model selection is to try to select the best function to fit our model, if our model is too simple we will have model that does not appropriately fit our data, whereas if we have a model that is too complex, it wil perfectly fit our testing data but will not be good at approximating new data

We need to be sure to fit the data, not the noise

It is also possible that the data we are trying to approximate cannot be fitted by polynomial at all, for example in the case of cyclic data

We can make use of the following code to look at the effect of order on our $R^2$ error for a model

### Ridge Regression

Ridge Regression prevents overfitting

Ridge regression controls the higher order parameters in our model by using a factor $\alpha$, increasing our $\alpha$ value will help us avoid overfitting until, however increasing it too far can lead to us underfitting the data

To use ridge regression we can do the following

We will usually start off with a small value of $\alpha$ such as 0.0001 and increase our value in orders of magnitude

Furthermore we can use Cross Validation to identify an optimal $\alpha$

### Grid Search

Grid Search allows us to scan through multiple free parameters

Parameters such as $\alpha$ are not part of the training or fitting process. These parameters are called Hyperparameters

`sklearn`

has a means of automatically iterating over these parameters called Grid Search

This allows us to use different hyperparameters to train our model, and select the model that provides the lowest MSE

The values of a grid search are simply a list whcih contains a dictionary for the parameters we want to modify

We can use Grid Search as follows

The `scores`

dictionary will store the results for each hyperparameter combination given our inputs

## Conclusion

For more specific examples of the different setions look at the appropriate Lab