What is Natural Language Processing?
NLP is a method or a way in which computer interprets the Human language are perform the task. Alexa, Siri, etc. are some of its example.
Let’s start with the Spam Classifier:
The spam classifier predicts whether received message is a ham or a spam.
Let’s start with the dataset: The dataset consists of 5572 messages and their labels which is either “ham” or “spam”.
import pandas as pdmessages = pd.read_csv(“SMSSpamClassifier”,sep=”t”,names=[‘label’,’message’])
Now the labels needs to be converted in 0 and 1 labels which can be done using get_dummies() method of pandas library.
y = pd.getdummies(messages[‘labels’])y = y.iloc[:1].values
Here, y wil contain 0 for “ham” labels and 1 for “spam” labels.
Now let’s look at independent data i.e. for x. For that 1st we have to clean the message data i.e. remove stopwords, lower string, group the same type words, etc. For all these we will use WordNetLemmatizer, the main reason of using the lemmatizer instead of stemming, it will provide meaning full words.
Now the code for it is:
import reimport nltkimport nltk.corpus import stopwordsfrom nltk.stem import WordNetLemmatizerlemmatizer = WordNetLemmatizer()corpus = for i in range(len(messages)):review = re.sub(‘[^a-zA-Z]’,’ ‘,messages[‘message’][i])review = review.lower()review = review.split()review = [lemmatizer.lemmaatizer(word) for word in review if not word in stopwords.words(‘english’)]review = ‘ ‘.join(review)corpus.append(review)
Here, corpus have all the sentences with clear data. The code above removes the stopwords, lowercase them and get all the important words that are required for prediction. Now we use Term Frequency and Inverse Term Frequency i.e. TfidfVectorizer to for the vector of words. The Tf-idf vector provide us with a vector of words and their importance.
1. How Conversational AI can Automate Customer Service
2. Automated vs Live Chats: What will the Future of Customer Service Look Like?
3. Chatbots As Medical Assistants In COVID-19 Pandemic
4. Chatbot Vs. Intelligent Virtual Assistant — What’s the difference & Why Care?
from sklearn.feature_extraction.text import TfidfVectorizercv = TfidfVectorizer(max_features=5000)x = cv.fit_transform(corpus).toarray()
The data is prepared in ‘x’ and now we can use it for training our model. Since Naïve Bayes algorithm works better for NLP we will use it for training our model.
from sklearn.model_selection import train_test_splitfrom sklearn.naive_bayes import MultinomialNBfrom sklearn.metrics import accuracy_scoreX_train, X_test, y_train, y_test = train_test_split(x, y, test_size = 0.20, random_state=0)spam_detect_model = MultinomialNB().fit(X_train, y_train)y_pred = spam_detect_model.predict(X_test)print(accuracy_score(y_test,y_pred))
The model will give of accuracy of around 98%. To predict the new input we can use model.predict(cv.tranform(user_input).toarray()) and get the output for it.
All resources and code is present at:
To have a look for NLP description click here.