NATURAL LANGUAGE PROCESSING ... :)
We are going to learn what is Natural Language Processing or NLP for short ..
We will start with the basics like 'sentence tokenization' and 'word tokenization' and work our way to build a crappy model that can predict whether you are a boy or a girl by analysing your name ...
Installing package :
pip install nltk
Then we have to import nltk and download some stuff ...
The code for this is ... :
import nltk
# we need this for PunktSentenceTokenizer
nltk.download('punkt')
# we need this for removing stop words
nltk.download('stopwords')
# interface for tagging each token in a sentence with supplementary information
nltk.download('averaged_perceptron_tagger')
# for classifying names
nltk.download('names')
But let's first understand what is nltk .. ?
The Natural Language Toolkit (NLTK) is a platform used for building Python programs that work with human language data for applying in statistical natural language processing (NLP).
It is simply a python package to decipher human language data to predict something ..
WORD TOKENIZER :
# importing word_tokenize from nltk.tokenize
from nltk.tokenize import word_tokenize
# the data
data='I am inevitable . I am Ironman . I am Groot . I am Batman . I wanna be your endgame .I am steve rogers '
# the tokenizer
word=word_tokenize(data)
# printing
print(word)
And the output looks like this ..:
['I', 'am', 'inevitable', '.', 'I', 'am', 'Ironman', '.', 'I', 'am', 'Groot', '.', 'I', 'am', 'Batman', '.', 'I', 'wan', 'na', 'be', 'your', 'endgame', '.', 'I', 'am', 'Steve', 'Rogers']
SENTENCE TOKENIZER :
# importing sent_tokenize from nltk.tokenizefrom nltk.tokenize import sent_tokenize
# the data
data='I am inevitable . I am Ironman . I am Groot . I am Batman . I wanna be your endgame . I am Steve Rogers'
# sentence tokenizer
sentence=sent_tokenize(data)
# printing
print(sentence)
And the output looks like this :
['I am inevitable .', 'I am Ironman .', 'I am Groot .', 'I am Batman .', 'I wanna be your endgame .', 'I am Steve Rogers']REMOVING STOP WORDS :
What is stop words ?In computing, stop words are words which are filtered out before or after processing of natural language data (text).
# importing stopwords from nltk.corpusfrom nltk.corpus import stopwords
# the data
data = 'I am inevitable . I am Ironman . I am Groot . I am Batman . I wanna be your endgame . I am Steve Rogers'
# setting the stop words in English
Stop_Words = set(stopwords.words('english'))
# word tokenizer
word = word_tokenize(data)
# a list of filtered words
words_filtered = []
# for loop to segregate stop words and filtered words
for word in word:
if word not in Stop_Words:
words_filtered.append(word)
# printing
print(words_filtered)And the output .. :['I', 'inevitable', '.', 'I', 'Ironman', '.', 'I', 'Groot', '.', 'I', 'Batman', '.', 'I', 'wan', 'na', 'endgame', '.', 'I', 'Steve', 'Rogers']You could see that 'am' , which is a stop word gets filtered out and only other words are printed out as output ..PORTER STEMMER :
What is porter stemmer ?
The Porter stemming algorithm (or ‘Porter stemmer’) is a process for removing the commoner morphological and endings from words in English. Its main use is as part of a term normalization process that is usually done when setting up Information Retrieval systems
# importing PorterStemmer from nltk.stemfrom nltk.stem import PorterStemmer
# the words
words=['amazing','amazed','amazer','amazes']
# initializing the PorterStemmer model
ps=PorterStemmer()
# a for loop
for word in words:
# printing the original word and stemmed word
print(word + ':' + ps.stem(word))And the output ..:
amazing:amaz amazed:amaz amazer:amaz amazes:amazYou could see that the word that it spewed as output is not a meaningful one . It is stemming ..What is stemming .. ?Stemming is a technique used to extract the base form of the words by removing affixes from them.It is not always correct , but it is faster than the alternative ..
What is the alternative .. ?
It is lemmatization ..
Now what is lemmatization ?
It is same as that of stemming , but the output words are meaningful at the expense of being slow ..
PunktSentenceTokenizer :
# importing PunktSentenceTokenizer from nltk.tokenize
from nltk.tokenize import PunktSentenceTokenizer
# the data
document = 'Whether you\'re new to programming or an experienced developer, it\'s easy to learn and use Python.'
# sentence tokenizer
sentences=sent_tokenize(document)
# for loop
for sent in sentences:
# printing after pos_tagging the tokenized words ..
print(nltk.pos_tag(nltk.word_tokenize(sent)))
And the output ..[('Whether', 'IN'), ('you', 'PRP'), ("'re", 'VBP'), ('new', 'JJ'), ('to', 'TO'), ('programming', 'VBG'), ('or', 'CC'), ('an', 'DT'), ('experienced', 'JJ'), ('developer', 'NN'), (',', ','), ('it', 'PRP'), ("'s", 'VBZ'), ('easy', 'JJ'), ('to', 'TO'), ('learn', 'VB'), ('and', 'CC'), ('use', 'VB'), ('Python', 'NNP'), ('.', '.')]The PunktSentenceTokenizer algorithm gives whether the word is a noun , adjective , conjuction , etc ..AND NOW WE ARE GOING TO BUILD A MODEL THAT PREDICTS WHETHER THE GIVEN NAME IS MALE OR FEMALE ...
The data :
# importing names from nltk.corpus
from nltk.corpus import names
# the data
names = ([(name, 'male') for name in names.words('male.txt')] +
[(name, 'female') for name in names.words('female.txt')])
# printing ..
print(names)It spews a large number of names and their gender ..The model :
import nltk.classify.util
from nltk.classify import NaiveBayesClassifier
from nltk.corpus import names
def gender_features(word):
return {'last_letter': word[-1]}
# Load data and training
names = ([(name, 'male') for name in names.words('male.txt')] +
[(name, 'female') for name in names.words('female.txt')])
featuresets = [(gender_features(n), g) for (n, g) in names]
train_set = featuresets
classifier = nltk.NaiveBayesClassifier.train(train_set)
# prediction
name = input('Enter a name : ')
print(classifier.classify(gender_features(name)))And the output ..If you input Robert Downey Junior , it will predict that the name is male ...Which is correct ..But if you input Taylor Swift , it will predict that the name is male ...Which is absolutely wrong ..Because the data that we provided is not complete , it is way too small ..But as I said before it is a crappy gender predictor ..I hope you learnt something in this blog ..If you want to learn Machine learning to predict human population :https://codeddevil-01blogs.blogspot.com/2021/06/machine-learning-in-python.html
Comments
Post a Comment