Our ML Project: Titanic + CPT

In this blog, we will show demonstrations of our code for the titanic machine learning model, as well as our own personalized CPT machine learning project.

Titanic ML:

For the titanic project, we worked on trainnig the model with the titanic dataset. Using an API, we recieved data from the frontend, made the prediction, and sent the prediction back to the frontend to display.

import pandas as pd
import seaborn as sns
from sklearn.preprocessing import OneHotEncoder
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
import numpy as np


class TitanicPredictor:
   def __init__(self):
       self.data = None
       self.encoder = None
       self.model_dt = None
       self.model_logreg = None
       self.X_test = None
       self.y_test = None
      
   def load_data(self):
       self.data = sns.load_dataset('titanic')
      
   def preprocess_data(self):
       if self.data is None:
           raise ValueError("Data not loaded. Call load_data() first.")
      
       self.data.drop(['alive', 'who', 'adult_male', 'class', 'embark_town', 'deck'], axis=1, inplace=True)
       self.data.dropna(inplace=True)
       self.data['sex'] = self.data['sex'].apply(lambda x: 1 if x == 'male' else 0)
       self.data['alone'] = self.data['alone'].apply(lambda x: 1 if x == True else 0)
      
       self.encoder = OneHotEncoder(handle_unknown='ignore')
       self.encoder.fit(self.data[['embarked']])
       onehot = self.encoder.transform(self.data[['embarked']]).toarray()
       cols = ['embarked_' + val for val in self.encoder.categories_[0]]
       self.data[cols] = pd.DataFrame(onehot)
       self.data.drop(['embarked'], axis=1, inplace=True)
       self.data.dropna(inplace=True)
      
   def train_models(self):
       X = self.data.drop('survived', axis=1)
       y = self.data['survived']
       X_train, self.X_test, y_train, self.y_test = train_test_split(X, y, test_size=0.3, random_state=42)
      
       self.model_dt = DecisionTreeClassifier()
       self.model_dt.fit(X_train, y_train)
      
       self.model_logreg = LogisticRegression()
       self.model_logreg.fit(X_train, y_train)
      
   def evaluate_models(self):
       if self.model_dt is None or self.model_logreg is None:
           raise ValueError("Models not trained. Call train_models() first.")
      
       y_pred_dt = self.model_dt.predict(self.X_test)
       accuracy_dt = accuracy_score(self.y_test, y_pred_dt)
       print('DecisionTreeClassifier Accuracy: {:.2%}'.format(accuracy_dt)) 
      
       y_pred_logreg = self.model_logreg.predict(self.X_test)
       accuracy_logreg = accuracy_score(self.y_test, y_pred_logreg)
       print('LogisticRegression Accuracy: {:.2%}'.format(accuracy_logreg)) 
  
   def predict_survival_probability(self, new_passenger):
       if self.model_logreg is None:
           raise ValueError("Models not trained. Call train_models() first.")
      
       new_passenger['sex'] = new_passenger['sex'].apply(lambda x: 1 if x == 'male' else 0)
       new_passenger['alone'] = new_passenger['alone'].apply(lambda x: 1 if x == True else 0)
      
       onehot = self.encoder.transform(new_passenger[['embarked']]).toarray()
       cols = ['embarked_' + val for val in self.encoder.categories_[0]]
       new_passenger[cols] = pd.DataFrame(onehot, index=new_passenger.index)
       new_passenger.drop(['embarked'], axis=1, inplace=True)
       new_passenger.drop(['name'], axis=1, inplace=True)
      
       dead_proba, alive_proba = np.squeeze(self.model_logreg.predict_proba(new_passenger))
       print('Death probability: {:.2%}'.format(dead_proba)) 
       print('Survival probability: {:.2%}'.format(alive_proba)) 
       return dead_proba, alive_proba


# Usage
titanic_predictor = TitanicPredictor()
titanic_predictor.load_data()
titanic_predictor.preprocess_data()
titanic_predictor.train_models()
titanic_predictor.evaluate_models()


# Define a new passenger
passenger = pd.DataFrame({
   'name': ['John Mortensen'],
   'pclass': [2],
   'sex': ['male'],
   'age': [64],
   'sibsp': [1],
   'parch': [1],
   'fare': [16.00],
   'embarked': ['S'],
   'alone': [False]
})


titanic_predictor.predict_survival_probability(passenger)

Titanic Model:

Data Loading and Preprocessing: The model begins by loading the Titanic dataset using Seaborn’s load_dataset() function. It then preprocesses the data by dropping irrelevant columns (‘alive’, ‘who’, ‘adult_male’, ‘class’, ‘embark_town’, ‘deck’) and handling missing values. Categorical variables like ‘sex’ and ‘alone’ are converted into numerical format.

One-Hot Encoding: The model uses one-hot encoding to convert the categorical variable ‘embarked’ into binary vectors.

Model Training: After preprocessing, the data is split into features (X) and the target variable (y), followed by splitting into training and testing sets. Two models are trained: a Decision Tree Classifier (model_dt) and a Logistic Regression model (model_logreg).

Model Evaluation: The trained models are evaluated using accuracy scores on the test data.

Prediction: The model provides a method predict_survival_probability() to predict the survival probability of a new passenger. The new passenger’s data is preprocessed similarly to the training data. The survival probability is predicted using the trained Logistic Regression model.

Usage Example: An instance of the TitanicPredictor class is created. Data is loaded, preprocessed, models are trained, and then evaluated. A new passenger’s data is defined, and the predict_survival_probability() method is called to estimate their survival probability.

from flask import Blueprint, jsonify, request  # jsonify creates an endpoint response object
from flask_restful import Api, Resource # used for REST API building

from model.jokes import *

joke_api = Blueprint('joke_api', __name__,
                   url_prefix='/api/jokes')

# API generator https://flask-restful.readthedocs.io/en/latest/api.html#id1
api = Api(joke_api)

class TitanicAPI(Resource):
    def post(self):
            # Get passenger data from the API request
            data = request.get_json()  # get the data as JSON
            data['alone'] = str(data['alone']).lower()
            converted_dict = {key: [value] for key, value in data.items()}
            pass_in = pd.DataFrame(converted_dict)  # create DataFrame from JSON
            titanic_predictor = TitanicPredictor()
            titanic_predictor.load_data()
            titanic_predictor.preprocess_data()
            titanic_predictor.train_models()
            titanic_predictor.evaluate_models()
            dead_proba, alive_proba = titanic_predictor.predict_survival_probability(pass_in)
            response = {
                'dead_proba': dead_proba,  # Example probabilities, replace with actual values
                'alive_proba': alive_proba
            }
            return jsonify(response)


# Add resource to the API
api.add_resource(TitanicAPI, '/create')

Titanic API:

The TitanicAPI class is a Flask-Restful Resource representing an endpoint of the API. It handles POST requests to the /api/jokes/create endpoint. In the post method, it extracts passenger data from the JSON request using request.get_json(). The passenger data is processed and formatted, and then passed to the TitanicPredictor class (assumed to be defined elsewhere) for prediction. After predicting survival probabilities, a JSON response containing the probabilities is created.

CPT ML (Depression):

For our CPT project, we have decided to create a project centering around mental health and self care. We decided that we would try to find A dataset dealing with depression rates. We found a few we liked, and even turned created a dataset. We created a model to train data form the dataset to predict how likley a person would develop depression due to these factors: age, stress level, exercise, and sleep.

import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import StandardScaler
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import StandardScaler

data = pd.read_csv('depression_dataset.csv')
# Split the data into features and labels
X = data.drop('Probability of Developing Depression', axis=1)
y = data['Probability of Developing Depression']
# Standardize the features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, random_state=42)
# Train a linear regression model
model = LinearRegression()
model.fit(X_train, y_train)
# Function to predict the chance of being depressed
def predict_depression(age, stress_level, exercise_hours, sleep_hours):
    input_data = scaler.transform([[age, stress_level, exercise_hours, sleep_hours]])
    chance_of_depression = model.predict(input_data)[0]
    return chance_of_depression

CPT Model:

Data Loading and Preparation: The model begins by loading a dataset containing pertinent information for predicting depression. This dataset typically comprises features such as age, stress level, exercise hours, and sleep hours, alongside a target variable indicating the likelihood of developing depression. Subsequently, the model divides the data into features (X) and the target variable (y). Features represent the input variables utilized for predictions, while the target variable signifies what we aim to predict—in this instance, the probability of experiencing depression.

Data Preprocessing: Before proceeding with model training, it’s imperative to preprocess the data. Within this model, data preprocessing entails standardization of features using a method known as StandardScaler. Standardization ensures that all features exhibit a mean of 0 and a standard deviation of 1, thereby enhancing the efficacy of certain machine learning algorithms.

Model Training: The model undergoes training employing a linear regression algorithm for prediction tasks. Linear regression, although straightforward, is a robust algorithm employed for establishing relationships between a dependent variable (in this scenario, the likelihood of depression) and one or more independent variables (the features). Training involves utilizing the preprocessed training data (X_train, y_train). Throughout this process, the model learns to discern the relationship between the input features and the target variable by minimizing the disparity between predicted values and actual observations, a process known as minimizing the loss function.

from flask import Blueprint, request, jsonify, make_response
from flask_restful import Api, Resource
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import StandardScaler
from model.depression import *
predict_api = Blueprint("predict_api", __name__,
                        url_prefix='/api/predict')
api = Api(predict_api)
# Load the dataset
class Predict(Resource):
    def post(self):
        body = request.get_json()
        age = float(body.get("age"))
        stress_level = float(body.get("stress"))
        daily_exercise_hours = float(body.get("exercise"))
        daily_sleep_hours = float(body.get("sleep"))
        chance_of_depression = predict_depression(age, stress_level, daily_exercise_hours, daily_sleep_hours)
        chance_of_depression = max(0, min(chance_of_depression, 1))  # Ensure chance_of_depression is between 0 and 1
        return (jsonify(f"Based on the provided data, the chance of developing depression is: {chance_of_depression * 100:.2f}%"))
api.add_resource(Predict, '/')

CPT API:

Predict Resource Class (Predict): This class defines the behavior for handling POST requests to the /api/predict endpoint. Upon receiving a POST request, it retrieves JSON data from the request body containing information such as age, stress level, daily exercise hours, and daily sleep hours. It then uses the predict_depression function (assumed to be defined elsewhere) to predict the chance of developing depression based on the provided input parameters. The predicted chance of depression is bounded between 0 and 1 using max and min functions. Finally, it returns a JSON response containing the predicted chance of developing depression as a percentage.

JSON Response: The response returned by the API is a JSON object containing a string message with the predicted chance of developing depression formatted to display two decimal places.