Data Science Project Using Real Dataset

Most beginners learn data science by watching tutorials, but they never build something complete. That becomes a problem during interviews because you are expected to show real work, not just concepts.
This guide is designed so you can go from zero to a fully working project with deployment. Every step is written clearly so that if you follow along, you will end up with a project that runs, looks professional, and is ready for GitHub.

Table of Contents

Project Overview

Language and Tools

Step 1: Setup Project Structure

Step 2: Download Dataset

Step 3: Install Required Libraries (CMD)

Step 4: Complete Model Code (project.py)

Step 5: Run the Model (CMD)

Step 6: Create Web App (app.py)

Step 7: Run App Locally (CMD)

Step 8: Create requirements.txt

Step 9: Upload to GitHub (CMD)

Step 10: Deploy on Streamlit Cloud

Common Mistakes

Conclusion

Frequently Asked Questions

Project Overview

You will build a Student Score Predictor.
The goal:

Take student data (reading and writing scores).
Predict the math score using a machine learning model.

This project is beginner-friendly because:

The data is simple.
The workflow is realistic.
The output is easy to understand.

Language and Tools

1. Language: Python
2. Libraries (with purpose):

Pandas: This handles and processes data.
NumPy: It is used for numerical operations.
Matplotlib & Seaborn: These are used for visualization.
Scikit-learn: This is a machine learning model.
Joblib: This is used to save model.
Streamlit: This is used to build web app.

3. Environment:

Jupyter Notebook (optional)
VS Code / Command Prompt (CMD)

Step 1: Setup Project Structure

Create this structure:
student-score-predictor/
│
├── data/
├── model/
├── project.py
├── app.py
├── requirements.txt
└── README.md
This structure keeps your files organized and avoids path errors. It also makes your project look professional on GitHub.

Step 2: Download Dataset

Download dataset:
1. Search on Kaggle "Student Performance in Exams".
2. Save it as:

data/students.csv

You are working with real-world data. Do not change column names, or your code will break.

Step 3: Install Required Libraries (CMD)

Open Command Prompt (CMD) and run:

pip install pandas numpy matplotlib seaborn scikit-learn joblib streamlit

These libraries provide all required functionality. If even one is missing, your code will not run.

Step 4: Complete Model Code (project.py)

Create project.py and copy paste this code:

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import joblib

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score

# Load dataset
df = pd.read_csv("data/students.csv")

# Select required columns
df = df[['math score', 'reading score', 'writing score']]

print(df.head())

# Visualization
sns.histplot(df['math score'], kde=True)
plt.show()

# Features and target
X = df[['reading score', 'writing score']]
y = df['math score']

# Split data
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Train model
model = LinearRegression()
model.fit(X_train, y_train)

# Predict
y_pred = model.predict(X_test)

# Evaluate
print("MSE:", mean_squared_error(y_test, y_pred))
print("R2 Score:", r2_score(y_test, y_pred))

# Save model
joblib.dump(model, "model/student_model.pkl")

# Plot
plt.scatter(y_test, y_pred)
plt.xlabel("Actual")
plt.ylabel("Predicted")
plt.show()

Explanation:
This file performs the full pipeline:

Loads data
Visualise it
Trains a model
Evaluates performance
Saves the model

Step 5: Run the Model (CMD)

python project.py

Explanation:
Running this file will:

Train the model
Show graphs
Save model file (Student_model.pk1)

Step 6: Create Web App (app.py)

Create app.py:

import streamlit as st
import pandas as pd
import joblib

model = joblib.load("model/student_model.pkl")

st.title("Student Score Predictor")

reading = st.slider("Reading Score", 0, 100, 50)
writing = st.slider("Writing Score", 0, 100, 50)

if st.button("Predict"):
    input_data = pd.DataFrame({
        'reading score': [reading],
        'writing score': [writing]
    })

    prediction = model.predict(input_data)
    st.success(f"Predicted Math Score: {prediction[0]:.2f}")

Explanation:
This creates a simple interface where users input values and get predictions.

Step 7: Run App Locally (CMD)

Open command prompt and type:

streamlit run app.py

Explanation:
Your browser will open automatically showing your app.

Step 8: Create requirements.txt

pandas
numpy
matplotlib
seaborn
scikit-learn
joblib
streamlit

Explanation:
Deployment platforms use this file to install dependencies.

Step 9: Upload to GitHub (CMD)

Open command prompt and type:

git init
git add .
git commit -m "Data Science Project"
git branch -M main
git remote add origin YOUR_LINK
git push -u origin main

Explanation:
Your project must be on GitHub before deployment.

Step 10: Deploy on Streamlit Cloud

Follow the steps below to deploy on streamlit cloud:

Go to Streamlit Cloud.
Login with GitHub.
Click New App.
Select your repository.
Choose app.py.
Click Deploy.

Explanation:
This will generate a public link where anyone cna use your app.

Common Mistakes

Wrong File Path: If students.csv is not found, your folder structure is incorrect or file is in the wrong location.
Not Running Model Before App: If model file is missing, run project.py first to generate it.
Missing Libraries: If you see “Module not found”, install dependencies using pip.
Column Name Error: If you get a KeyError, check if dataset column names were modified.
Feature Mismatch: If app crashes, ensure model and input features match exactly.

Conclusion

This project takes you through the complete journey of a data science workflow. You started with raw data, built a machine learning model, and then deployed it as a working application.
What makes this valuable is not just the model, but the fact that you turned it into something usable. That is what recruiters look for. If you can explain each step clearly, this project alone is enough to demonstrate your fundamentals.
Build one or two more projects like this, and you will have a strong portfolio ready for internships and placements.

Frequently Asked Questions

1. Can I run everything using CMD?

Yes, all commands in this guide are designed for Command Prompt.

2. Do I need Jupyter Notebook?

No, it is optional. You can complete everything using Python files.

3. Can I use a different dataset?

Yes, the same process applies to any dataset.

4. Why is my app not opening?

Make sure Streamlit is installed and command is correct.

5. Is this enough for a resume project?

Yes, for beginners this is a strong and complete project.

April 16, 2026 12:00 AM

Write A Comment