Siddhant Shah

A Basic AI Model Implementation


Published on Tuesday, 25 June 2024


This is a demo for AI model development.


Table of Contents


The example used here analyses Customer Churn Data using a Random Forrest Classifier. This is used for demonstrative purposes only and the steps, libraries and data used by you may vary.

The Implementation Process

Data Collection & Processing

Example Code

import pandas as pd
from sklearn.preprocessing import OrdinalEncoder 
oe = OrdinalEncoder()

data = pd.read_csv('Data/Customer Churn.csv')

data['Gender'] = oe.fit_transform(data[['Gender']])
data['EmailOptIn'] = oe.fit_transform(data[['EmailOptIn']])
data['PromotionResponse'] = oe.fit_transform(data[['PromotionResponse']])

X = data.iloc[:,:-1]
y = data.iloc[:,-1]
  • Here, we first read the data that is stored in a CSV (Comma-separated Values) file.
  • The data is then pre-processed into a format that can be understood and used by the model. The processing may vary depending on the model you use.
  • Finally, we separate the data into the measured values and the goal variable.

Example Code

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
                                        X,
                                        y,
                                        test_size=0.2,
                                        stratify=X['Gender'],
                                        random_state=1
                                        )
  • test_size=0.2 denotes that we are using 20%20\% of the data for the testing set.

⁠Implement Models and Compare Performance

Example Code

  • Here we choose to implement a Random Forrest Classifier.
from sklearn.ensemble import RandomForestClassifier
model = RandomForestClassifier()

grid = { 
    'n_estimators': [25 * i for i in range(1, 10 + 1)], 
    'max_features': ['sqrt', 'log2'], 
    'max_depth': [3 * i for i in range(1, 5 + 1)], 
    'max_leaf_nodes': [2, 5, 10, 20, 50], 
}
  • The dictionary grid contains the names of the hyper-parameters that can be tuned for this particular model.

Example Code

from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import RepeatedStratifiedKFold

cv = RepeatedStratifiedKFold(
    n_splits=10,
    n_repeats=3,
    random_state=1
    )
grid_search = GridSearchCV(
    estimator=model,
    param_grid=grid,
    n_jobs=-1,
    cv=cv,
    scoring='accuracy'
    )

grid_result = grid_search.fit(X_train, y_train)
print(f"Highest Accuracy: {grid_result.best_score_:.4f}")
print(f"Hyper Parameter Values: {grid_result.best_params_}")
  • Here we use functions called RepeatedStratifiedKFold and GridSearchCV as RepeatedStratifiedKFold creates a validation set from the training data on its own, without any data leakage and GridSearchCV automatically varies across all the hyper-parameter values, trains the given model and evaluates the given score over the validation dataset.
  • At the end, we use the last two print statements to get the highest accuracy achieved across all the hyper-parameter values and to get the respective values as well.

Insights & Analysis

Example Code

rf = RandomForestClassifier(
    n_estimators=200,
    max_features='log2',
    max_leaf_nodes=2,
    max_depth=12,
    random_state=1
    )

rf_model = rf.fit(X_train, y_train)

Possible topics

  1. ⁠Pneumonia or Diabetes Prediction:
    Here, measured values will be vitals such as heart rate, body temperature, previous medical history, etc. The goal variable will be a binary value (0 or 1) denoting the absence or presence of the disease.
  2. ⁠Wildfire Threat:
    Here the measured values will include distance from nearby forrest fire, settlements, water bodies, etc. The goal variable will be a score denoting how likely a wildfire is.

The values mentioned for each topic are not exhaustive. Other values can and should be used. The student should be able to justify the use of every value through some direct or indirect logic.

Resources

  1. ⁠Intro to Python (Official Python Documentation): https://www.python.org
  2. ⁠Scikit-Learn Documentation: https://scikit-learn.org/stable/