Sentence Correction using Recurrent Neural Networks

Manoj Kumar
6 min readAug 1, 2021

Contents:

  1. Introduction
  2. Data Overview
  3. EDA
  4. Preprocessing
  5. Modeling
  6. Streamlit Demo Video
  7. Conclusion

Introduction:

In this Modern World everything has become Digitalized, everyone in the World has atleast one Mobile phones. Due to evolution Message to be delivered to people are done via Mobile phones or computers. It is fast and easy way for communication. Moreover most of the people in world now started to use short forms for conveying messages. So while performing various NLP based text Problems, the words need better text preprocessing to get better performance.In this CaseStudy by using various RNN techniques we can correct the incorrect text which contains errors to a standard English words so that it can help ML or DL to get better Performance on NLP based Models.

Data Overview:

The dataset is taken from https://www.comp.nus.edu.sg/~nlp/corpora.html. The dataset is around 2000 text messages which contains both incorrect and correct sentence. The incorrect sentences are social media text messages. Now we can see how our input’s are and what the model should return as output.

Example1:

Input:’U wan me to “chop” seat 4 u nt?’

Output:’Do you want me to reserve seat for you or not?’

Example2:

Input:’Yup. U reaching. We order some durian pastry already. U come quick.’

Output:’Yea. You reaching? We ordered some Durian pastry already. You come quick.’

EDA:

EDA is important Stage in Data Analysis process, that it refers to performing initial investigation on the data.

>>Importing necessary Libraries.

>>Seperating the source and target points from the text files.

>>Creating DataFrame using the source and target datapoints retrieved from the text files.

>>Checking the shape, info about the DataFrame and checking for null values.

>>Words count for each sentence in Source.

>>Words Count for each sentence in target.

→From the above plots we can see the Count of no of number of words in source varies between 0 to 50 and in target the Count of number of words varies between 0 to 60 and their distribution are mostly similar since the target is just corrected sentence of the source.

>>Character count for each sentence in source.

>>Character count for each sentence in target.

→From the above plots we can see that the Count of number of characters in each source sentence varies between 0 to 240 and similar way target sentence varies between 0 to 270.

From these we could that, since in RNN based methods sentences with many characters would be quiet hard to train and since the data points are quiet small we should not remove most of the points so in source sentence of length less than or equal to would be better same as in target sentence of length less than or equal would be better.

>>Top 40 Frequently occuring words in source.

>>Top 40 frequently occuring words in target.

>>Top 10 Rare words in source.

>>Top 10 Rare words in target.

Preprocessing:

This refers to removing unwanted data that is data cleaning, which too contribute in improving the Performance of the Model.

Since the data points are quiet small we need to preserve most of the points to get better performance. So we can remove the datapoints with maximum length.Using Exploratory Data Analysis we could find that the source datapoints with length more than 170 can be removed and target datapoints with length more than 200 can be removed.

Modelling:

Coming to modelling part, the Model is approached in Character Level and other is Word level. In Character level the maxlength for source sentence is 170 and for target sentence 200. In Word Level the maxlength for source sentence is 39 and for target sentence 43.

Simple Character-level and Word-level RNN Model:

https://tommytracey.github.io

Given an input with all the preprocessing done like tokenization followed by padding, it is given to the model, the model outputs probabilites of length of target vocab size. Based on both greedy search and beam search we predict each word and combine all the words to get a final sentence. Here Simple RNN and Bi-directional RNN are used.In Character level model each letter is given as input whereas in Word level each word is given as input.

>>Below Model is for Character-level.

Single Directional and Bidirectional RNN Models for Character-level Embedding

>>Now we can see for Word-level.

Single Directional and Bidirectional RNN Models for Word-level Embedding

Similarly we can use pretrained fasttext embeddings.

Fasttext is another word embedding method that is an extension of the word2vec model. Instead of learning vectors for words directly, fasttext represents each word as an n-gram of characters. This helps capture the meaning of shorter words and allows the embeddings to understand suffixes and prefixes.

Encoder-Decoder Model:

Taken from “Sequence to Sequence Learning with Neural Networks,” 2014.

Here the first part encoder loops through input steps and gives a vector which is then passed through the Decoder where it loops through output time steps and produces the required output.

Similar to the Basic RNN Models here also both character level and word level embeddings are used and the output is predicted and also fasttext embeddings.

Encoder-Decoder with Attention Model:

Taken from “Effective Approaches to Attention-based Neural Machine Translation”, 2015.

In encoder_decoder model, in encoder after encoding single vector is obtained which is passed to the decoder,where as attention models develops vector that is filtered specially for each time output step.

Both Character level and word level embedding using fasttext and without fasttext embeddings is done and their performance is calculated.

The Results of Performance of various models used are concluded below.

Streamlit Demo Video:

Conclusion:

Hence the task of correcting the sentence is done, but the scores are quiet low.It is mostly because the data set is so small for the Models to learn.Moreover the Deep Learning models quiet often needs the data to be large.

>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>

For full code visit my Github Repository-click here.

For any queries you can contact me via linked in,My linked profile-click here.

>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>

Further Improvement:

Here I have tried from basic RNN models to SeqtoSeq to models. So as a next step I would try Transformers which i popular for many NLP based tasks.

Reference:

www.appliedaicourse.com

https://cs224d.stanfod.edu/reports/Lewis.pdf

https://machinelearningmastery.com/beam-search-decoder-natural-language-processing/

--

--

Manoj Kumar

UnderGraduate Engineering Student Interested in DataScience.