S

Sentify

A model that lets you analyze user sentiment from tweets. Especially useful for individuals or organizations trying to judge public sentiment regarding their products or services.

The problem Sentify solves

Sentiment analysis from text feedback is used by an ever-increasing number of people, and organizations to automatically classify feedback in the form of comments or chats. Many companies will want to judge public sentiment regarding their products or services. Using this feedback they can better their products and services. Since it is impossible for a single person to do this, and impractical to use human resources for this task, a machine learning solution would be very useful.
Automatically knowing if a review/comment is positive or negative goes a long way when looking for feedback to improve a service/application. Our model is currently capable of taking a sentence and classifying its intent as either positive or negative.

Explanation of Steps:

  • Pre-Processing - Preprocessing is a common stage in any task involving Twitter data because of the language irregularities that are present in tweets.
  • Pre-trained word vectors - Learning word representations from massive unannotated text corpora have recently been used in many NLP tasks. Leveraging large corpora for unsupervised learning of word representations enables capturing of syntactic and semantic characteristics of words.
  • DCNN model - CNNs with pooling operation deal naturally with variable length sentences and they also take into account the ordering of the words and the context each word appears in.
  • Tokenization - Tokenization describes the general process of splitting the text of a document into a series of tokens in order to identify all words in a given document for further processing, especially to create a term-document matrix.
  • Train Embedding Layer - A word embedding is a way of representing text where each word in the vocabulary is represented by a real-valued vector in a high-dimensional space. The vectors are learned in such a way that words that have similar meanings will have similar representation in the vector space (close in the vector space).

Challenges we ran into

  • We faced issues with choosing a model for training our data. We finally chose a DCNN since we were familiar with its working.
  • We also faced issues with trying to choose the hyperparameters as a brute force search took too long. So we used some heuristics to choose the most optimal ones.
  • We faced issues with using the Twitter API, as it requires a verified Twitter developer account. So to overcome this we chose tweets from our large own dataset the matched a given search word.

Discussion