Diabetes Prediction Using ML
Diabetes Prediction
- 1. Project Overview
- 2. Model Design
- 3. Data Choices
- 4. API Documentation
- Sqlite Database
- SQL vs. Pandas
- UI
- User Stories
1. Project Overview
Goal: Predict diabetes risk (0-100% probability) using CDC dataset features.
Key Components:
- Machine Learning:
GradientBoostingClassifier
with probability calibration. - API: Flask RESTful endpoints for single/bulk predictions.
- Documentation: Jupyter Notebook + GitHub Pages.
2. Model Design
Architecture
Singleton Pattern: Ensures one model instance system-wide.
Pipeline:
Pipeline([
('scaler', StandardScaler()),
('classifier', GradientBoostingClassifier())
])
Calibration: CalibratedClassifierCV
for Accurate Probabilities
Key Methods
predict()
: Returns probability (0–1) for a patient.save_model()
/load_model()
: Model persistence viajoblib
.
3. Data Choices
Dataset
Source: UCI ML Repository (ID: 891)
Features:
['HighBP', 'HighChol', 'BMI', 'Age', 'GenHlth', ...] # 15 total
Preprocessing:
- Dropped missing values
- Engineered BMI_Category (binned categories)
import pandas as pd
df = fetch_ucirepo(id=891).data
df.features.sample(5)
4. API Documentation
Endpoints
Endpoint | Method | Description |
---|---|---|
/api/diabetes/predict |
POST | Predict risk for a single patient |
/api/diabetes/bulk |
POST | Predict for multiple patients |
Example Request:
```python
try {
const formData = getFormData();
const response = await fetch(${pythonURI}/api/diabetes/predict
, {
…fetchOptions,
method: ‘POST’,
headers: { ‘Content-Type’: ‘application/json’ },
body: JSON.stringify(formData)
});
if (!response.ok) {
const error = await response.json();
throw new Error(error.message || 'Prediction failed');
}
const result = await response.json();
```
Sqlite Database
SQL vs. Pandas
Category | SQL | Pandas |
---|---|---|
Performance | Optimized for large datasets; scalable with indexing and parallelism | Fast for small to medium datasets; limited by memory |
Readability | Clear for joins and filtering; declarative syntax | Flexible but can get complex; imperative and method chaining syntax |
Ease of Use | Requires database setup; used in production environments | Easy to set up; works directly in Python with CSVs or APIs |
Flexibility | Great for structured queries, limited for custom logic | Excellent for data wrangling, transformations, and ML integration |
Best Use Case | Large-scale structured data in databases | Exploratory analysis, data cleaning, and ML in Python |
UI
User Stories
1. Patient Perspective
Title: Check My Diabetes Risk
As a health-conscious individual,
I want to input my health metrics
So that I can understand my diabetes risk level
Acceptance Criteria:
- Can submit my health data via simple form
- Receive clear probability percentage (0-100%)
- Get categorized risk level (Low/Medium/High)
- See explanation of factors contributing to risk
2. Doctor Perspective
Title: Rapid Patient Screening
As a primary care physician,
I want to quickly assess patients’ diabetes risk during checkups
So that I can prioritize high-risk cases
Acceptance Criteria:
- Bulk prediction for multiple patients
- Results integrate with electronic health records
- Flag patients with >70% probability
- Show trend comparisons with previous visits
3. Clinic Administrator Perspective
Title: Population Health Dashboard
As a clinic manager,
I want to analyze aggregate prediction data
So that I can allocate resources effectively
Acceptance Criteria:
- API returns anonymized aggregate statistics
- Visualize risk distribution across patient demographics
- Export reports for public health reporting
- Track changes in population risk over time
4. Public Health Researcher Perspective
Title: Model Transparency
As a epidemiology researcher,
I want to understand feature importance
So that I can validate the model’s clinical relevance
Acceptance Criteria:
- API endpoint for feature weights
- Documentation of training methodology
- Access to model performance metrics
- Comparison with established clinical guidelines
5. Insurance Provider Perspective
Title: Risk Assessment Integration
As a health insurance underwriter,
I want to incorporate diabetes risk scores
So that I can adjust preventive care offerings
Acceptance Criteria:
- Audit trail for all predictions
- Explanation of risk factors for each case
- Compliance with healthcare data regulations
- API authentication for authorized use only