Introduction to Flair for NLP: A Simple yet Powerful State-of-the-Art NLP Library (original) (raw)

Introduction

Last couple of years have been incredible for Natural Language Processing (NLP) as a domain! We have seen multiple breakthroughs – ULMFiT, ELMo, Facebook’s PyText, Google’s BERT, among many others. These have rapidly accelerated the state-of-the-art research in NLP (and language modeling, in particular).

We can now predict the next sentence, given a sequence of preceding words.

What’s even more important is that machines are now beginning to understand the key element that had eluded them for long.

Context! Understanding context has broken down barriers that had prevented NLP techniques making headway before. And today, we are going to talk about one such library – Flair.

Until now, the words were either represented as a sparse matrix or as word embeddings such as GLoVe, Bert and ELMo, and the results have been pretty impressive. But, there’s always room for improvement and Flair is willing to stand up to it.

In this article, we will first understand what Flair is and the concept behind it. Then we’ll dive into implementing NLP tasks using Flair. Get ready to be impressed by its accuracy!

Please note that this article assumes familiarity with NLP concepts. You can go through the below articles if you need a quick refresher:

Table of contents

  1. What is ‘Flair’ Library?
  2. What gives Flair the Edge
  3. Introduction to Contextual String Embeddings for Sequence Labeling
  4. Performing NLP Tasks in Python using Flair
  5. What’s Next for Flair?

What is ‘Flair’ Library?

Flair is a simple natural language processing (NLP) library developed and open-sourced by Zalando Research. Flair’s framework builds directly on PyTorch, one of the best deep learning frameworks out there. The Zalando Research team has also released several pre-trained models for the following NLP tasks:

  1. Name-Entity Recognition (NER): It can recognise whether a word represents a person, location or names in the text.
  2. Parts-of-Speech Tagging (PoS): Tags all the words in the given text as to which “part of speech” they belong to.
  3. Text Classification: Classifying text based on the criteria (labels)
  4. Training Custom Models: Making our own custom models.

All of this looks promising. But what truly caught my attention was when I saw Flair outperforming several state-of-the-art results in NLP. Check out this table:

Note: F1 score is an evaluation metric primarily used for classification tasks. It’s often used in machine learning projects over the accuracy metric when evaluating models. The F1 score takes into consideration the distribution of the classes present.

What Gives Flair the Edge?

There are plenty of awesome features packaged into the Flair library. Here’s my pick of the most prominent ones:

  1. It comprises of popular and state-of-the-art word embeddings, such as GloVe, BERT, ELMo, Character Embeddings, etc. There are very easy to use thanks to the Flair API
  2. Flair’s interface allows us to combine different word embeddings and use them to embed documents. This in turn leads to a significant uptick in results
  3. ‘Flair Embedding’ is the signature embedding provided within the Flair library. It is powered by contextual string embeddings. We’ll understand this concept in detail in the next section
  4. Flair supports a number of languages – and is always looking to add new ones

Introduction to Contextual String Embeddings for Sequence Labeling

Context is so vital when working on NLP tasks. Learning to predict the next character based on previous characters forms the basis of sequence modeling.

Contextual String Embeddings leverage the internal states of a trained character language model to produce a novel type of word embedding. In simple terms, it uses certain internal principles of a trained character model, such that words can have different meaning in different sentences.

Note: A language and character model is a probability distribution of Words / Characters such that every new word or character depends on the words or characters that came before it. Have a look here to know more about it.

There are two primary factors powering contextual string embeddings:

  1. The words are trained as characters (without any notion of words). Aka, it works similar to character embeddings
  2. The embeddings are contextualised by their surrounding text. This implies that the same word can have different embeddings depending on the context. Quite similar to natural human language, isn’t it? The same word may have different meanings in different situations

Let’s look at an example to understand this:

Explanation:

Language is such a wonderful yet complex thing. You can read more about Contextual String Embeddings in this Research Paper.

Performing NLP Tasks in Python using Flair

It’s time to put Flair to the test! We’ve seen what this awesome library is all about. Now let’s see firsthand how it works on our machines.

We’ll use Flair to perform all the below NLP tasks in Python:

  1. Text Classification using the Flair embeddings
  2. Part of Speech Tagging (PoS) and comparison with the NLTK library

Setting up the Environment

We will be using Google Colaboratory for running our code. One of the best things about Colab is that it provides GPU support for free! It is pretty handy for training deep learning models.

Why use Colab?

All you need is a stable internet connection.

About the Dataset

We’ll be working on the Twitter Sentiment Analysis practice problem. Go ahead and download the dataset from there (you’ll need to register/log in first).

The problem statement posed by this challenge is:

The objective of this task is to detect hate speech in tweets. For the sake of simplicity, we say a tweet contains hate speech if it has a racist or sexist sentiment associated with it. So, the task is to classify racist or sexist tweets from other tweets.

1. Text Classification Using Flair Embeddings

Overview of steps:

Step 1: Import the data into the local Environment of Colab:

Step 2: Installing Flair

Step 3: Preparing text to work with Flair

Step 4: Word Embeddings with Flair

Step 5: Vectorizing the text

Step 6: Partitioning the data for Train and Test Sets

Step 7: Time for predictions!

Step 1: Import the data into the local Environment of Colab:

Install the PyDrive wrapper & import libraries.

This only needs to be done once per notebook.

!pip install -U -q PyDrive from pydrive.auth import GoogleAuth from pydrive.drive import GoogleDrive from google.colab import auth from oauth2client.client import GoogleCredentials

Authenticate and create the PyDrive client.

This only needs to be done once per notebook.

auth.authenticate_user() gauth = GoogleAuth() gauth.credentials = GoogleCredentials.get_application_default() drive = GoogleDrive(gauth)

Download a file based on its file ID.

A file ID looks like: laggVyWshwcyP6kEI-y_W3P8D26sz

file_id = '1GhyH4k9C4uPRnMAMKhJYOqa-V9Tqt4q8' ### File ID ### data = drive.CreateFile({'id': file_id}) #print('Downloaded content "{}"'.format(downloaded.GetContentString()))

You can find the file ID in the shareable link of the dataset file in the drive.

Importing the dataset into the Colab notebook:

import io Import pandas as pd data = pd.read_csv(io.StringIO(data.GetContentString())) data.head()

All the emoticons and symbols have been removed from the data and the characters have been converted to lowercase. Additionally, our dataset has already been divided into train and test sets. You can download this clean dataset from here.

Step 2: Installing Flair

download flair library

import torch !pip install flair import flair

A Brief look at Flair Data Types

There are two types of objects central to this library – Sentence and Token objects. A Sentence holds a textual sentence and is essentially a list of Tokens:

from flair.data import Sentence

create a sentence

sentence = Sentence('Blogs of Analytics Vidhya are Awesome.')

print the sentence to see what’s in it.

print(Sentence)

Step 3: Preparing text to work with Flair

#extracting the tweet part# text = data['tweet']

txt is a list of tweets

txt = text.tolist() print(txt[:10])

Step 4: Word Embeddings with Flair

Feel free to first go through this article if you’re new to word embeddings: An Intuitive Understanding of Word Embeddings.

Importing the Embeddings

from flair.embeddings import WordEmbeddings from flair.embeddings import CharacterEmbeddings from flair.embeddings import StackedEmbeddings from flair.embeddings import FlairEmbeddings from flair.embeddings import BertEmbeddings from flair.embeddings import ELMoEmbeddings from flair.embeddings import FlairEmbeddings

Initialising embeddings (un-comment to use others)

#glove_embedding = WordEmbeddings('glove') #character_embeddings = CharacterEmbeddings() flair_forward  = FlairEmbeddings('news-forward-fast') flair_backward = FlairEmbeddings('news-backward-fast') #bert_embedding = BertEmbedding() #elmo_embedding = ElmoEmbedding()

stacked_embeddings = StackedEmbeddings( embeddings = [ flair_forward-fast, flair_backward-fast ])

You would have noticed we just used some of the most popular word embeddings above. Awesome! You can remove the comments ‘#’ to use all the embeddings.

Now you might be asking – What in the world are “Stacked Embeddings”? Here, we can combine multiple embeddings to build a powerful word representation model without much complexity. Quite like ensembling, isn’t it?

We are using the stacked embedding of Flair only for reducing the computational time in this article**.** Feel free to play around with this and other embeddings by using any combination you like.

Testing the stacked embeddings:

create a sentence

sentence = Sentence(‘ Analytics Vidhya blogs are Awesome .')

embed words in sentence

stacked.embeddings(sentence) for token in sentence:  print(token.embedding)

data type and size of embedding

print(type(token.embedding))

storing size (length)

z = token.embedding.size()[0]

Step 5: Vectorizing the text

We’ll be showcasing this using two approaches.

Mean of Word Embeddings within a Tweet

We will be calculating the following in this approach:

For each sentence:

  1. Generate word embedding for each word
  2. Calculate the mean of the embeddings of each word to obtain the embedding of the sentence

from tqdm import tqdm ## tracks progress of loop ##

creating a tensor for storing sentence embeddings

s = torch.zeros(0,z)

iterating Sentence (tqdm tracks progress)

for tweet in tqdm(txt):     # empty tensor for words #  w = torch.zeros(0,z)     sentence = Sentence(tweet)  stacked_embeddings.embed(sentence)  # for every word #  for token in sentence:    # storing word Embeddings of each word in a sentence #    w = torch.cat((w,token.embedding.view(-1,z)),0)  # storing sentence Embeddings (mean of embeddings of all words)   #  s = torch.cat((s, w.mean(dim = 0).view(-1, z)),0)

Document Embedding: Vectorizing the entire Tweet

from flair.embeddings import DocumentPoolEmbeddings

initialize the document embeddings, mode = mean

document_embeddings = DocumentPoolEmbeddings([                                              flair_embedding_backward,                                              flair_embedding_forward ])

Storing Size of embedding

z = sentence.embedding.size()[1]

Vectorising text

creating a tensor for storing sentence embeddings

s = torch.zeros(0,z)

iterating Sentences

for tweet in tqdm(txt):     sentence = Sentence(tweet)  document_embeddings.embed(sentence)  # Adding Document embeddings to list #  s = torch.cat((s, sentence.embedding.view(-1,z)),0)

You can choose either approach for your model. Now that our text is vectorised, we can feed it to our machine learning model!

Step 6: Partitioning the data for Train and Test Sets

tensor to numpy array

X = s.numpy()   

Test set

test = X[31962:,:] train = X[:31962,:]

extracting labels of the training set

target = data['label'][data['label'].isnull()==False].values

Step 6: Building the Model and Defining Custom Evaluator (for F1 Score)

Defining custom F1 evaluator for XGBoost

def custom_eval(preds, dtrain):    labels = dtrain.get_label().astype(np.int)    preds = (preds >= 0.3).astype(np.int)    return [('f1_score', f1_score(labels, preds))]

Building the XGBoost model

import xgboost as xgb from sklearn.model_selection import train_test_split from sklearn.metrics import f1_score

Splitting training set

x_train, x_valid, y_train, y_valid = train_test_split(train, target,                                                        random_state=42,                                                          test_size=0.3)

XGBoost compatible data

dtrain = xgb.DMatrix(x_train,y_train)          dvalid = xgb.DMatrix(x_valid, label = y_valid)

defining parameters

params = {     'colsample': 0.9, 'colsample_bytree': 0.5, 'eta': 0.1, 'max_depth': 8, 'min_child_weight': 6, 'objective': 'binary:logistic', 'subsample': 0.9 }

Training the model

xgb_model = xgb.train(     params,     dtrain,     feval= custom_eval,     num_boost_round= 1000,     maximize=True,     evals=[(dvalid, "Validation")],     early_stopping_rounds=30 )

Our model has been trained and is ready for evaluation! Note: The parameters were taken from this Notebook.

Step 7: Time for predictions!

Reformatting test set for XGB

dtest = xgb.DMatrix(test)

Predicting

predict = xgb_model.predict(dtest) # predicting

I uploaded the predictions to the practice problem page with 0.2 as probability threshold:

Word Embedding F1- Score
Glove 0.53
flair-forward -fast 0.45
flair-backward-fast 0.48
Stacked (flair-forward-fast + flair-backward-fast) 0.54

Note: According to Flair’s official documentation, stacking of the flair embedding with other embeddings often yields even better results, But, there is a catch..

It might take a VERY LONG time to compute on a CPU. I highly recommend leveraging a GPU for faster results. You can use the free one within Colab!

2. Part of Speech (POS) Tagging with Flair

We will be using a subset of the Conll-2003 dataset, is a pre-tagged dataset in English. Download the dataset from here.

Overview of steps:

Step 1: Importing the dataset

Step 2 : Extracting Sentences and PoS Tags from the dataset

Step 3: Tagging the text using NLTK and Flair

Step 4: Evaluating the PoS tags from NLTK and Flair against the tagged dataset

Step 1: Importing the dataset

file was uploaded manually to local environment of Colab

data = open('pos-tagged_corpus.txt','r') txt = data.read() #print(txt)

The data file contains one word per line, with empty lines representing sentence boundaries.

Step 2 : Extracting Sentences and PoS Tags from the dataset

converting text in form of list of (words with their tags)

txt = txt.split('\n')

removing DOCSTART (document header)

txt = [x for x in txt if x != '-DOCSTART- -X- -X- O']

check

for i in range(10):  print(txt[i])  print(‘-’*10)

Extracting Sentences

Initialize empty list for storing words

words = []

initialize empty list for storing sentences

corpus = []

for i in tqdm(txt):  ## if blank sentence encountered ##  if i =='':    ## previous words form a sentence ##    corpus.append(' '.join(words))    ## Refresh Word list ##    words = []  else:

word at index 0

   words.append(i.split()[0])   

did it work?

for i in range(10):  print(corpus[i])  print(‘-’*10)

Extracting POS

Initialize empty list for storing word pos

w_pos = [] #initialize empty list for storing sentence pos # POS = [] for i in tqdm(txt):  ## blank sentence = new line ##  if i =='':    ## previous words form a sentence POS ##    POS.append(' '.join(w_pos))    ## Refresh words list ##    w_pos = []  else: ## pos tag from index 1 ##    w_pos.append(i.split()[1])   

did it work?

for i in range(10):  print(corpus[i])  print(POS[i])

Removing blanks form sentence and pos

corpus = [x for x in corpus if x!= ''] POS = [x for x in POS if x!= '']

Check

For i in range(10):  print(corpus[i])  print(POS[i])

We have extracted the essentials aspects we require from the dataset. Let’s move on to step 3.

Step 3: Tagging the text using NLTK and Flair

First, import the required libraries:

import nltk nltk.download('tagsets') nltk.download('punkt') nltk.download('averaged_perceptron_tagger') from nltk import word_tokenize

This will download all the necessary files to tag the text using NLTK.

Tagging the corpus with NLTK

#for storing results# nltk_pos = [] ##for every sentence ## for i in tqdm(corpus):  # Tokenize sentence #  text = word_tokenize(i)  #tag Words#  z = nltk.pos_tag(text)  # store #  nltk_pos.append(z)

The PoS tags are in this format:

[(‘token_1’, ‘tag_1’), ………….. , (‘token_n’, ‘tag_n’)]

Lets extract PoS from this:

Extracting final pos by nltk in a list

tmp = [] nltk_result = []

every tagged sentence

for i in tqdm(nltk_pos):  tmp = []

every word

 for j in i:    ## append tag (from index 1) ##    tmp.append(j[1])  # join the tags of every sentence #  nltk_result.append(' '.join(tmp))

check

for i in range(10):  print(nltk_result[i])  print(corpus[i])

The NLTK tags are ready for business.

Importing the libraries first:

!pip install flair from flair.data import Sentence from flair.models import SequenceTagger

Tagging using Flair

initiating object

pos = SequenceTagger.load('pos-fast')

#for storing pos tagged string# f_pos = []

for every sentence

for i in tqdm(corpus):  sentence = Sentence(i)  pos.predict(sentence)

append tagged sentence

 f_pos.append(sentence.to_tagged_string())

###check ### for i in range(10):  print(f_pos[i])  print(corpus[i])

The result is in the below format:

token_1 <tag_1> token_2 <tag_2> ………………….. token_n <tag_n>

Note: We can use different taggers available within the Flair library. Feel free to tinker around and experiment. You can find the list here .

Extract the sentence-wise tags as we did in NLTK

Import re

Extracting POS tags

in every sentence by index

for i in tqdm(range(len(f_pos))):  ## for every words ith sentence ##  for j in corpus[i].split():    ## replace that word from ith sentence in f_pos ##    f_pos[i] = str(f_pos[i]).replace(j,"",1)

 ## Removing < > symbols ##  for j in  ['<','>']:    f_pos[i] = str(f_pos[i]).replace(j,"")

   ## removing redundant spaces ##    f_pos[i] = re.sub(' +', ' ', str(f_pos[i]))    f_pos[i] = str(f_pos[i]).lstrip()

check

for i in range(10):  print(f_pos[i])  print(corpus[i])

Aha! We have finally tagged the corpus and extracted them sentence-wise. We are free to remove all the punctuation and special symbols.

Removing Symbols and redundant space

in every sentence by index

for i in tqdm(range(len(corpus))):  # Removing Symbols #  corpus[i] = re.sub('[^a-zA-Z]', ' ', str(corpus[i]))  POS[i] = re.sub('[^a-zA-Z]', ' ', str(POS[i]))  f_pos[i] = re.sub('[^a-zA-Z]', ' ', str(f_pos[i]))  nltk_result[i] = re.sub('[^a-zA-Z]', ' ', str(nltk_result[i]))

 ## Removing HYPH SYM (they are for symbols) ##  f_pos[i] = str(f_pos[i]).replace('HYPH',"")  f_pos[i] = str(f_pos[i]).replace('SYM',"")  POS[i] = str(POS[i]).replace('SYM',"")  POS[i] = str(POS[i]).replace('HYPH',"")  nltk_result[i] = str(nltk_result[i].replace('HYPH',''))  nltk_result[i] = str(nltk_result[i].replace('SYM',''))                     

 ## Removing redundant space ##  POS[i] = re.sub(' +', ' ', str(POS[i]))  f_pos[i] = re.sub(' +', ' ', str(f_pos[i]))  corpus[i] = re.sub(' +', ' ', str(corpus[i]))  nltk_result[i] = re.sub(' +', ' ', str(nltk_result[i]))  

We have tagged the corpus using NLTK and Flair, extracted and removed all the unnecessary elements. Let’s see it for ourselves:

for i in range(1000):  print('corpus   '+corpus[i])  print('actual   '+POS[i])  print('nltk     '+nltk_result[i])  print('flair    '+f_pos[i])  print('-'*50)

OUTPUT:

corpus SOCCER JAPAN GET LUCKY WIN CHINA IN SURPRISE DEFEAT
actual NN NNP VB NNP NNP NNP IN DT NN
nltk NNP NNP NNP NNP NNP NNP NNP NNP NNP
flair NNP NNP VBP JJ NN NNP IN NNP NNP
————————————————–
corpus Nadim Ladki
actual NNP NNP
nltk NNP NNP
flair NNP NNP
————————————————–
corpus AL AIN United Arab Emirates
actual NNP NNP NNP NNPS CD
nltk NNP NNP NNP VBZ JJ
flair NNP NNP NNP NNP CD

That looks convincing!

Step 4: Evaluating the PoS tags from NLTK and Flair against the tagged dataset

Here, we are doing word-wise evaluation of the tags with the help of a custom-made evaluator.

corpus Japan coach Shu Kamo said The Syrian own goal proved lucky for us
actual NNP NN NNP NNP VBD POS DT JJ JJ NN VBD JJ IN PRP
nltk NNP VBP NNP NNP VBD DT JJ JJ NN VBD JJ IN PRP
flair NNP NN NNP NNP VBD DT JJ JJ NN VBD JJ IN PRP

Note that in the example above, the actual POS tags contain redundancy compared to NLTK and flair tags as shown (in bold). Therefore we will not be considering the POS tagged sentences where the sentences are of unequal length.

EVALUATION FUNCTION

def eval(x,y):  # correct match #  count = 0  #Total comparisons made#  comp = 0  ## for every sentence index in dataset ##  for i in range(len(x)):    ## if the sentence length match ##    if len(x[i].split()) == len(y[i].split()):      ## compare each word ##      for j in range(len(x[i].split())):        if x[i][j] == y[i][j] :          ## Match! ## count = count+1          comp = comp + 1        else:          comp = comp + 1  return (count/comp)*100

Finally we evaluate the POS tags of NLTK and Flair against the POS tags provided by the dataset.

print("nltk Score ", eval2(POS,nltk_result)) print("Flair Score ", eval2(POS,f_pos))

Our Result:

NLTK Score: 85.38654023442645

Flair Score: 90.96172124773179

Well, well, well. I can see why Flair has been getting so much attention in the NLP community.

End Notes

Flair clearly provides an edge in word embeddings and stacked word embeddings. These can be implemented without much hassle due to its high level API. The Flair embedding is something to keep an eye on in the near future.

I love that the Flair library supports multiple languages. The developers are additionally currently working on “Frame Detection” using flair. The future looks really bright for this library.

I personally enjoyed working and learning the in’s and out’s of this library. I hope you found the tutorial useful and will be using Flair to your advantage next time you take up an NLP challenge.