Disaster Response Message Classification Pipelines (NLTK & Flask)

Typhoon victim Marimar Bacolod, smiles after receiving bags of relief goods. (Photo: ROMEO GACAD/ Getty Images)

Project Description

Figure Eight Data Set: Disaster Response Messages provides thousands of messages that have been sorted into 36 categories. These messages are sorted into specific categories such as Water, Hospitals, Aid-Related, that are specifically aimed at helping emergency personnel in their aid efforts.

The main goal of this project is to build an app (Screenshots below) that can help emergency workers analyze incoming messages and sort them into specific categories to speed up aid and contribute to more efficient distribution of people and other resources.

You can find all the code available on my GitHub page here.


Here’s the disaster response app that I built!


Table of Contents

  1. Libraries
  2. File Description
  3. Analysis
  4. Results
  5. Future Improvements
  6. Licensing, Authors, and Acknowledgements

Libraries

  • pandas
  • numpy
  • sqlalchemy
  • matplotlib
  • plotly
  • NLTK
  • NLTK [punkt, wordnet, stopwords]
  • sklearn
  • joblib/pickle
  • flask

File Description

There are three main folders:

  1. Data
    • disaster_categories.csv: dataset including all the categories
    • disaster_messages.csv: dataset including all the messages
    • process_data.py: ETL pipeline scripts to read, clean, and save data into a database
    • DisasterResponse.db: output of the ETL pipeline, i.e. SQLite database containing messages and categories data
  2. Models
    • train_classifier.py: machine learning pipeline scripts to train and export a classifier
    • classifier.pkl: output of the machine learning pipeline, i.e. a trained classifier
  3. App
    • run.py: Flask file to run the web application
    • templates contains html file for the web application

Analysis

Data Preparation

  • Modify the Category csv; split each category into a separate column
  • Merge Data from the two csv files (messages.csv & categories.csv)
  • remove duplicates and any non-categorized valued
  • create SQL database DisasterResponse.db for the merged data sets

Text Preprocessing

  • Tokenize text
  • remove special characters
  • lemmatize text
  • remove stop words

Build Machine Learning Pipeline

  • Build Pipeline with countevectorizer and tfidtransformer
  • Seal pipeline with multioutput classifier with random forest
  • Train Pipeline (with Train/Test Split)
  • Print classification reports and accuracy scores

Improve Model

  • Preform GirdSearchCV
  • Find best parameters

Export Model as .pkl File

  • You could also use Joblib as it can be faster. read more here

Results

  1. Created an ETL pipeline to read data from two csv files, clean data, and save data into a SQLite database.
  2. Created a machine learning pipeline to train a multi-output classifier on the various categories in the dataset.
  3. Created a Flask app to show data visualization and classify any message that users would enter on the web page.

Future Improvements

Much of this data was imbalanced, some labels had only a handful inputs. Here are some approaches to improve our model in the future. read more here

  • Change the performance metrics (focus more on Recall and F1-score: weighted average of precision and recall)
  • Generate Synthetic Data
  • Use Different Algorithms such as multilabel algorithms that take into account that labels may be connected and may not be mutually exclusive.
  • Use Penalized Classification Algorithms

Licensing, Authors, and Acknowledgements

Thanks to Udacity for the starter code and FigureEight for providing the data set to be used by this project.

You can find all the code available on my GitHub page here.