Updated: Feb 6, 2021
Twitter Classification Cumulative Project Part-2
My Codecademy Challenging Part-2 Project from the Data Scientist Path foundations of Machine Learning: Supervised Learning Course, Advance Classification Models Section.
In this project, Twitter Classification Cumulative Project, I use real tweets to find patterns in the way people use social media. There are two parts to this project:
Part-1: Viral Tweets, Predict Viral Tweets, using a K-Nearest Neighbors classifier model.
Part-2: Classifying Tweets or Tweets Location, (This Section).
+ Tweets Location Project Goal
Using Naïve Bayes Classifier Models, classify any tweet (or sentence) and predict whether that sentence came from New York, London, or Paris.
+ Project Requirements
Be familiar with:
Machine Learning: Supervised Learning
The Python Libraries:
Investigate The Data
The provided data:
▪ Exploring the provided data
The columns, or features, of a tweet.
The text of the 12th tweet in the New York dataset.
The number of tweets.
The "text" features has useful data to predict a tweet location.
Text of 12th tweet: Be best #ThursdayThoughts
+ new_york_tweets, london_tweets and paris_tweets number of tweets:
Number of tweets from New York: 4723 Number of tweets from London: 5341 Number of tweets from Paris: 2510
The paris_tweets DataFrame has roughly half the amount of tweets than the new_york_tweets and london_tweets DataFrames.
Naïve Bayes Classifier
▪ Defining data and labels
To classify any tweet (or sentence) and predict whether that sentence came from New York, London, or Paris, using Naïve Bayes Classifier Models, I isolated the text features data from each tweets by location DataFrame and combined them into to one data variable named all_tweets_text.
I defined the labels associated with paris_tweets, new_york_tweets and london_tweets locations as follow:
0 represents a New York tweet
1 represents a London tweet
2 represents a Paris tweet
▪ Creating training and test set
To split the data into training and test sets, I used the "train_test_split" function with the argument "random_state = 1", which sets the random seed to 1, to ensure that results are reproducible.
Labels Test Sample:
Data Test Sample:
▪ Making the Count Vectors
To use a Naïve Bayes Classifier, the data lists of words needs to be transformed into count vectors.
For example, the sentence "I love New York, New York" will transform into a list that contains:
Two 1s because the words "I" and "love" each appear once.
Two 2s because the words "New" and "York" each appear twice.
Many 0s because every other word in the training set didn't appear at all.
▪ Train and Test the Naïve Bayes Classifier and Predictions
I used the MultinomialNB class from sklearn python library to create my Naïve Bayes Classifier model. Then I trained the model and predicted the tweets test data locations.
+ Tweets test data locations predictions:
Evaluating The Model
To evaluate the models, I used the scaled test data sets to predict whether or not a tweet is a viral tweet, and compared the predicted results against the test labels sets by using the following evaluation metrics:
More info: 5 Classification Evaluation metrics
Accuracy is the percentage of classifications that the algorithm got correct out of every classification it made.
The model accuracy score is acceptable.
Precision measures the percentage of items the classifier found that were actually relevant.
The model precision scores are acceptable.
Recall measures the percentage of the relevant items the classifier was able to successfully find.
The New York recall score is a little low.
▪ Confusion Matrix
The other way to evaluate a model is by looking at the confusion matrix. A confusion matrix is a table that describes how a classifier made its predictions.
For example, if there were two labels, A and B, a confusion matrix might look like this:
9 1 3 5
In this example, the first row shows how the classifier classified the true A's. It guessed that 9 of them were A's and 1 of them was a B. The second row shows how the classifier did on the true B's. It guessed that 3 of them were A's and 5 of them were B's.
This project utilizes three classes — 0 for New York, 1 for London, and 2 for Paris.
The classifier predicts tweets that were actually from New York as either New York tweets or London tweets, but almost never Paris tweets. Similarly, the classifier rarely misclassifies the tweets that were actually from Paris. Tweets coming from two English speaking countries are harder to distinguish than tweets in different languages.