SMS Spam Detection System with CNN and LSTM Architectures

Project Overview

Spam messages pose significant security risks and annoyances in mobile communication. This project addresses the challenge of accurately detecting spam by implementing advanced Natural Language Processing (NLP) techniques. By utilizing the SMS Spam Collection dataset, the project develops hybrid deep learning models that combine Convolutional Neural Networks (CNN) for local feature extraction and Long Short-Term Memory (LSTM) for capturing sequential dependencies in text. The primary goal is to provide a robust classification system that maintains accuracy despite the imbalanced nature of spam data through techniques like class weighting and text augmentation.

System Architecture

Text Tokenizer

Converts raw SMS text into numerical sequences with a 10,000-word vocabulary and OOV handling.

Embedding Layer

Transforms high-dimensional word indices into dense 64-dimensional vectors for semantic representation.

CNN Layer

Uses 1D convolution with 64 filters to extract local spatial patterns and features from text sequences.

LSTM/BiLSTM Layer

Processes sequential data to capture long-term dependencies, with Bidirectional support for enhanced context understanding.

Dense Output Layer

Uses a Sigmoid activation function to provide a final probability score for binary spam classification.

Key Features

Hybrid Model Design

Combines CNN and LSTM layers to leverage both spatial and temporal text features.

Class Weighting

Implements balanced class weights to mitigate the impact of imbalanced "ham" versus "spam" datasets.

Text Augmentation

Uses NLTK WordNet for synonym replacement, increasing the diversity of the training data.

Automated Tuning

Integrates Keras Tuner with Grid Search to systematically find the optimal model hyperparameters.

Early Stopping

Employs monitoring to prevent overfitting by halting training when validation loss stops improving.

System Flow

Data Preparation

SMS data is downloaded, labeled, and converted into padded numerical sequences of 120 tokens.

Model Construction

Hybrid architectures (CNN+LSTM or BiLSTM+CNN) are built with dropout layers for regularization.

Training Process

Models are trained using binary cross-entropy loss and Adam optimizer with class weights enabled.

Evaluation & Tuning

Performance is assessed via confusion matrices and tuned through automated Grid Search or manual iterations.

Inference Phase

New messages are tokenized and processed by the trained model to predict spam probability.

Project Outcome

The project successfully developed a good accuracy SMS classifier, with both CNN+LSTM and BiLSTM+CNN architectures achieving strong performance on test data. Through automated hyperparameter tuning and text augmentation, the system demonstrated good precision in detecting spam while maintaining low false-positive rates for legitimate messages. The final model is capable of real-time spam detection for unseen text inputs.

Screenshots