WWCode Talks Tech #6: Emoji Predictor with Machine Learning

WWCode Talks Tech #6: Emoji Predictor with Machine Learning

Written by Alex Gamez

WWCode Talks Tech

Women Who Code Talks Tech 6     |     SpotifyiTunesGoogleYouTubeText
Alex Gamez, Jr Software Engineer, mPulse Mobile gives a talk entitled “Emoji Predictor with Machine Learning” in which she discusses using data science and data mining to build a program that can predict which emojis will be used in context. 

Alex Gamez, Jr Software Engineer, mPulse Mobile gives a talk entitled “Emoji Predictor with Machine Learning” in which she discusses using data science and data mining to build a program that can predict which emojis will be used in context. 

At the end of my senior year of school, I joined a natural language processing class. It was super interesting, but one thing that I did have trouble with, was figuring out how to start. This is a machine learning problem. It's hard to get started because you don't really know what's going on. One of my friends, who had done a similar project, walked me through it. It's actually pretty simple. The project itself is about 10 lines long, which is pretty cool, but you still need to know a couple of concepts.

I chose emoji prediction because I think it's cool, the things you can do with Twitter, data mining, and data science. Let's start with natural language processing. It's basically how computers interact with human language. One of the problems of machine learning is text classification, language identification, emoji prediction, and language modeling. If you're texting and you see that your phone suggests a word or a spelling mistake that you've done, that's language modeling, or language prediction. Speech recognition is audio to language or text to language. Caption generation also captures an image and then from there, describes the contents of that scene. With machine translation, there are a couple of things you can do, document memorization or question answering.

We see text, we see tweets and we recognize exactly what's happening. It's a little bit harder for computers because all they see are ones and zeros. Emoji prediction is text classification. If you look at a picture, give it an article and it can determine which genre. If you switch that to emojis, you give it a tweet and it'll say which emoji it might suggest for you. You can also do sentiment analysis. Sentiment analysis is also kind of like text classification. This is machine learning, natural language processing, and language all interacting as one. 

Let's move on to emoji prediction. We're going to create a model that, when given a tweet, will be able to predict which emoji should be used. There are stages to machine learning in general, not just natural language processing. We will be focusing on pre-processing, which includes feature extraction. This is a pretty basic application, it's not gonna go super into depth. We have pre-processing and feature extraction. We'll cover what features are and why they're important to machine learning and training.

The type of training we're gonna be doing is supervised learning. There are a couple of different types of learning when it comes to machine learning. There's unsupervised learning, supervised learning, and other types. In unsupervised you give your model your training algorithm and a data set. If  the data set is pictures of cats and dogs, unsupervised would just cluster them. It won't know what it is, you're just giving it the pictures. With supervised learning, you give this model, you give it your data, but you also give it your label. You say, this is a picture, this is a picture of a cat. The algorithm will recognize that and the next time you give it a picture of a cat, it'll be able to recognize it. 

This is where we actually see some of the code. The library that my friend used and told me to use was SK learn the psychic library. It was pretty easy and straightforward. These are going to be used as the modules for feature extraction and pre-processing. These are the classifiers, the one we're going to be using is the neural network. These are different algorithms. Some are better for text classification, different types of class, and different types of machine learning applications. I use neural networks because they're good. 

We're going to use a thousand K data set for this example. The pre-processing step is to organize our data. We will read the document, the 1000 tweets, and the 1000 labels. We will put the emojis in a list. It reads line by line and each index, each entry in the list is a tweet. Same with the emojis. It's going to read this document line by line and separate it. Index one of one list, corresponds to index one of the other list, index two corresponds to index two, until we get to and so on until we get to index 999. There are four lines of code where the heavy lifting is happening. We have something called a count Vectorizer, where we're learning the vocabulary dictionary. We return a term document matrix and the count Vectorizer converts collection of text documents to a matrix of token counts.

We're going to use tweets and use this model called The Bag of Words Model. The Bag of Word Model in terms of text classification is going to use different categories of words. You have your words and the number of times they appear. The words will be weighted differently based on originality. The way we're going to represent this to our algorithm is to use a term document matrix. We just rely on term frequency as a feature to say, yes, this word appears a million times, so therefore it appears a million times for this type of document. So whenever we see that we really don't wanna take that into account. A way around this, is to normalize data. The word 'the' gets a weight of zero. Why? Because it appears everywhere. It chooses which features probably have more weight.

When it comes to text classification and figuring out what should go where, we really want only to take into consideration words that would make sense, original words. The normalized term document matrix takes the tweets and it assigns, or weights, the words, depending on how many times they appeared. The words that appear less are probably going to get a higher weight because they're more original. This is where we make our machine learn. This is where the supervised part of the learning comes in. You're not only giving it the normalized matrix term  document matrix, but you're also giving it its corresponding labels. Import the MLP classifier, which is just a neural network type of algorithm. If you think of it, in terms of the function, you're using an algorithm, an MLP classifier to create a function that, given a tweet, will return an emoji.