1. Project Overview

Goal: Predict diabetes risk (0-100% probability) using CDC dataset features.
Key Components:

  • Machine Learning: GradientBoostingClassifier with probability calibration.
  • API: Flask RESTful endpoints for single/bulk predictions.
  • Documentation: Jupyter Notebook + GitHub Pages.

2. Model Design

Architecture

Singleton Pattern: Ensures one model instance system-wide.

Pipeline:

Pipeline([
    ('scaler', StandardScaler()),
    ('classifier', GradientBoostingClassifier())
])

Calibration: CalibratedClassifierCV for Accurate Probabilities

Key Methods

  • predict(): Returns probability (0–1) for a patient.
  • save_model() / load_model(): Model persistence via joblib.

3. Data Choices

Dataset

Source: UCI ML Repository (ID: 891)

Features:

['HighBP', 'HighChol', 'BMI', 'Age', 'GenHlth', ...]  # 15 total

Preprocessing:

  • Dropped missing values
  • Engineered BMI_Category (binned categories)
import pandas as pd
df = fetch_ucirepo(id=891).data
df.features.sample(5)

4. API Documentation

Endpoints

Endpoint Method Description
/api/diabetes/predict POST Predict risk for a single patient
/api/diabetes/bulk POST Predict for multiple patients

Example Request: ```python try { const formData = getFormData(); const response = await fetch(${pythonURI}/api/diabetes/predict, { …fetchOptions, method: ‘POST’, headers: { ‘Content-Type’: ‘application/json’ }, body: JSON.stringify(formData) });

        if (!response.ok) {
            const error = await response.json();
            throw new Error(error.message || 'Prediction failed');
        }

        const result = await response.json();

```

Sqlite Database

UI Screenshot

SQL vs. Pandas

Category SQL Pandas
Performance Optimized for large datasets; scalable with indexing and parallelism Fast for small to medium datasets; limited by memory
Readability Clear for joins and filtering; declarative syntax Flexible but can get complex; imperative and method chaining syntax
Ease of Use Requires database setup; used in production environments Easy to set up; works directly in Python with CSVs or APIs
Flexibility Great for structured queries, limited for custom logic Excellent for data wrangling, transformations, and ML integration
Best Use Case Large-scale structured data in databases Exploratory analysis, data cleaning, and ML in Python

UI

UI Screenshot

User Stories

1. Patient Perspective

Title: Check My Diabetes Risk
As a health-conscious individual,
I want to input my health metrics
So that I can understand my diabetes risk level

Acceptance Criteria:

  • Can submit my health data via simple form
  • Receive clear probability percentage (0-100%)
  • Get categorized risk level (Low/Medium/High)
  • See explanation of factors contributing to risk

2. Doctor Perspective

Title: Rapid Patient Screening
As a primary care physician,
I want to quickly assess patients’ diabetes risk during checkups
So that I can prioritize high-risk cases

Acceptance Criteria:

  • Bulk prediction for multiple patients
  • Results integrate with electronic health records
  • Flag patients with >70% probability
  • Show trend comparisons with previous visits

3. Clinic Administrator Perspective

Title: Population Health Dashboard
As a clinic manager,
I want to analyze aggregate prediction data
So that I can allocate resources effectively

Acceptance Criteria:

  • API returns anonymized aggregate statistics
  • Visualize risk distribution across patient demographics
  • Export reports for public health reporting
  • Track changes in population risk over time

4. Public Health Researcher Perspective

Title: Model Transparency
As a epidemiology researcher,
I want to understand feature importance
So that I can validate the model’s clinical relevance

Acceptance Criteria:

  • API endpoint for feature weights
  • Documentation of training methodology
  • Access to model performance metrics
  • Comparison with established clinical guidelines

5. Insurance Provider Perspective

Title: Risk Assessment Integration
As a health insurance underwriter,
I want to incorporate diabetes risk scores
So that I can adjust preventive care offerings

Acceptance Criteria:

  • Audit trail for all predictions
  • Explanation of risk factors for each case
  • Compliance with healthcare data regulations
  • API authentication for authorized use only