There are 500 million tweets per day and 800 million monthly active users on Instagram while 90 percent of whom are younger than 35. Users make 2.8 million Reddit comments per day and 68% of Americans use Facebook. There is staggering amount of data generated at every single day and it is getting extremely difficult to get the relevant insights out of all that clutter. Is there a way to get a grasp of that for your niche in real time? I will show you one way if you read the rest of this article 🙂 I also deployed a simple real life example at my social listening website for you to try out…
What is NLP and why is it important?
Natural Language Processing (NLP) is a field at the intersection of computer science, artificial intelligence, and linguistics. The goal is for computers to process or “understand” natural language in order to perform various human like tasks like language translation or answering questions.
With the rise of voice interfaces and chatbots, NLP is one of the most important technologies of the information age and become a popular area of AI. There’s a fast-growing collection of useful applications derived from the NLP field. They range from simple to complex. Below are a few of them:
- Search, spell checking, keyword search, finding synonyms, complex questions answering
- Extracting information from websites such as: products, price, dates, locations, people or names
- Machine translation (i.e. Google translate), speech recognition, personal assistants (think about Amazon Alexa, Apple Siri, Facebook M, Google Assistant or Microsoft Cortana)
- Chat bots/dialog agents for customer support, controlling devices, ordering goods
- Matching online advertisements, sentiment analysis for marketing or finance/trading
- Identifying financials risks or fraud
How are words/sentences represented by NLP?
The genius behind NLP is a concept called word embedding. Word embeddings are representations of words as vectors, learned by exploiting vast amounts of text. Each word is mapped to one vector and the vector values are learned in a way that resembles a neural network.
Each word is represented by a real-valued vector, often tens or hundreds of dimensions. Here a word vector is a row of real valued numbers where each number is a dimension of the word’s meaning and where semantically similar words have similar vectors. i.e. Queen and Princess would be closer vectors.
If we label 4 words (King, Queen, Woman, Princess) with some made up dimensions in a hypothetical word vector, it might look a bit like below:
The numbers in the word vector represent the word’s distributed weight across dimensions. The semantics of the word are embedded across these dimensions of the vector. Another simplified example across 4 dimensions is as below:
These hypothetical vector values represent the abstract ‘meaning’ of a word. The beauty of representing words as vectors is that they lend themselves to mathematical operators thus we can program them! They are then can be used as inputs to an artificial neural network!
We can visualize the learned vectors by projecting them down to 2 dimensions as below and it becomes apparent that the vectors capture useful semantic information about words and their relationships to one another.
These are distributional vectors based on the assumption that words appearing within similar context possess similar meaning. For example, in the figure below, all the big cats(i.e. cheetah, jaguar, panther, tiger and leopard) are really close in the vector space.
The word embedding algorithm takes as its input from a large corpus of text and produces these vector spaces, typically of several hundred dimensions. A neural language model is trained on a large corpus (body of text) and the output of the network is used to each unique word to be assigned to a corresponding vector. The most popular word embedding algorithms are Google ‘s Word2Vec, Stanford ‘s GloVe or Facebook ‘s FastText.
Word embeddings represent one of the most successful AI applications of unsupervised learning.
Potential short comings
There are short comings as well like conflation deficiency that is the inability to discriminate among different meanings of a word. For example, the word “bat” has at least two distinct meanings: a flying animal, and a piece of sporting equipment. Another challenge is a text may contain multiple sentiments all at once. For instance (source)
“The intent behind the movie was great, but it could have been better”.
The above sentence consists of two polarities of Positive and Negative. So how do we conclude whether the review was Positive or Negative?
The good news is Artificial Intelligence (AI) now delivers the most accurate understanding of complex human language and its nuances at scale and (almost) real time. Thanks to pre-trained and deep learning powered algorithms, we started seeing NLP cases as part of our daily lives.
Latest and greatest popular news on NLP
Pre-trained NLP models could act like humans and can be deployed much faster using reasonable computing resources. And the race is on!
A recent popular news on NLP is the recent controversy that OpenAI has published a new GPT-2 language model but they refused to open source the full model due to its potential dark uses! It was trained via 8 million web pages and GPT-2 can generate long paragraphs of human-like coherent text and has potential to create fake news or spoof online identities. It was basically found too dangerous to make public. This is just the beginning. We will see a lot more discussion about the dangers of unregulated AI approaches in NLP field.
Recently there was also news that Google has open sourced its natural language processing (NLP) pre-training model called bidirectional encoder representations from transformers (BERT). Then Baidu (kind of “Google of China”) announced its own pretrained NLP model called “ERNIE”.
Lastly the large tech companies and publishers including Facebook or Google Jigsaw are trying to find ways to detoxify the abundant abuse and harassment on the Internet. Though thousands of human moderators are still needed to avoid scandals until AI and NLP catch up. Stay tuned!
Social media sentiment analysis
How much one can read or how many people one can follow? Maybe you are watching the Super Bowl or the Oscars and curious about what all the other people thing about the latest ad during the breaks. This could help spot a possible social media crisis, reach out to unhappy customers or help run a marketing/political campaign. You can avoid crises and identify the influencers…
Sentiment Analysis (or opinion mining) is a sub-field of NLP that tries to identify and extract opinions within a given text across blogs, reviews, social media, forums, news etc. Sentiment Analysis can help craft all this exponentially growing unstructured text into structured data using NLP and open source tools. For example Twitter is a treasure trove of sentiment and users are making their reactions and opinions for every topic under the sun.
The good news is in the new world of ML driven AI, it is possible and getting better everyday to analyze these text snippets in seconds. Actually there are a lot of similar commercial tools available though you can build your own DIY app just for fun!
Streaming tweets is a fun exercise in data mining. I used a powerful Python library called tweepy to real time access the (public) tweets from the web. The simplified idea is that we first (1) generate Twitter credentials online to be able to use the Twitter API and then (2) use tweepy together with our Twitter credentials to stream tweets based on our filter settings. We can then (3) save these streaming tweets in a database so that we can perform our own search queries, NLP operations and online analytics. That is about it!
What is VADER?
The other good news is you do not need to be an deep learning or NLP expert to start coding for your ideas. One of the readily available pre-trained algorithms is called VADER ( Valence Aware Dictionary and sEntiment Reasoner) that is a lexicon (dictionary of sentiments) and a simple rule-based model for general sentiment analysis. Its algorithms are optimized to sentiments expressed in social media like Twitter, online news, movie/product reviews etc. VADER can give us a Positivity and Negativity score that can be standardized in a range of -1 to 1. VADER is able to include sentiment from emoticons (e.g, :-)), sentiment-related acronyms (e.g, LOL) and slang (e.g, meh) where computers typically struggle. Thus Vader is an awesome tool for fresh online text.
VADER has advantages including working well on social media type text, doesn’t require any training data as it is based on valence-based, human-curated gold standard sentiment lexicon and important for me is it is fast enough to be used online with streaming data. The developers of VADER have used Amazon’s Mechanical Turk to get most of their ratings and is described fully in an academic paper entitled “VADER: A Parsimonious Rule-based Model for Sentiment Analysis of Social Media Text.” .
The incoming sentences are first split up into several words via a process called “Tokenization”. Then it is much easier to look at the sentiment value of each word sentence via comparing with the sentiment lexicon of the selected model in any language. Actually there is no ML going on there but this library parses for every tokenized word, compares with its lexicon and returns polarity scores. This brings up an overall score for the tweet using the Python code. VADER also has an open sourced python library and can be installed using regular pip install. It does not require any training data and can work fast enough to be used with almost REAL TIME streaming data thus it was an easy choice for my hands on example.
Basic Data Clean up
Any DIY code would need to do some real time clean up to remove the stop words & punctuation marks, lower the capital cases and filter tweets based on a language of interest. Twitter API (tweepy) has an auto-detect feature for the common languages. There are also some other popular NLP techniques you can further apply including Lemmatisation (converting words to dictionary form) or Stemming (reducing words to their root form) to further improve the results.
Hands on MVP example using live Twitter data:
Finally I deployed an example model at my blog to demonstrate the power of pre-trained NLP models using real time twitter data with English tweets only. This minimum viable product is done with only free open source tools. The inspiration and the original code is from python programming You tuber Sentdex at this link. I added extra functionalities like Google-like search experience, US States sentiment map to capture tweets with location meta-data, word cloud for the searched terms, and error handling to avoid break downs. You can download the modified code from my GitHub repository and follow these instructions for deployment. The code is messy as I wrote it mostly at a limited time and open to any help to make it look better.
Dependencies: Open source tech & cloud
Majority of the work is get all these components installed and work together, data clean up and analytics while the Vader model itself is only few lines of basic code.
Open Source tech: I used Python 3.7 together with various open source libraries. Main ones are (1) Tweepy: Twitter API library to stream public tweets in JSON format (2) SQlite3: widely used light-weight relational database (3) Panda: great for manipulating numerical tables and time series.(4) NLTK: Natural Language Toolkit (5) wordcloud (6) Flask: micro web framework for web deployment (7) Dash: enables you to build awesome dashboards using pure Python (8) Plotly: most popular python graphing library for interactive and online graphs for line plots, scatter plots, area charts, bar charts … First you need to register for the Twitter API, install the dependencies, write your code and deploy it to your laptop or on a cloud.
Cloud: I used Digital Ocean using a low tier frugal server thus the speed performance is not super fast but it works and is relatively secure with SSL. Initially I deployed with AWS Beantalks though I had to pivot as the available Amazon Linux distribution was not compatible with the latest SQlite3 database version I wanted to deploy. Actually these are typical challenges when you deal with open source tools and requires the data scientists to be flexible with a LOT of patience. Interestingly Digital Ocean gave me more control in this case. Another good option is Heroku. If you use AWS, I recommend to leverage one of mySQL, MongoDB etc like database services (vs. SQLite3). Let me know if questions.
The Vader model demonstrated here is not perfect but quiet indicative. There are some false negatives or positives as with any algorithm though more advanced and accurate ML algorithms are coming are way.
These capabilities could be easily reapplied with emails, Facebook, Twitter, Instagram, YouTube, Reddit, IMDB, eRetailer reviews, news blogs and the public web. The insight could be parsed by location, demographics, popularity, impact… It has never been easier to measure the pulse of the net.
AI/Machine Learning democratizes and enables real time access to critical insights for your niche. Tracking itself isn’t worth it if you’re not going to act on the insights. Future survivors will need to transform their processes & resources to adopt and adapt to the new age of abundant data and algorithms.
Happy learning in 2019! And…
Do not forget to stay in the moment!