## 1. Tools for text processing¶

What are the most frequent words in Herman Melville's novel Moby Dick and how often do they occur?

In this notebook, we'll scrape the novel Moby Dick from the website Project Gutenberg (which contains a large corpus of books) using the Python package requests. Then we'll extract words from this web data using BeautifulSoup. Finally, we'll dive into analyzing the distribution of words using the Natural Language ToolKit (nltk).

The Data Science pipeline we'll build in this notebook can be used to visualize the word frequency distributions of any novel that you can find on Project Gutenberg. The natural language processing tools used here apply to much of the data that data scientists encounter as a vast proportion of the world's data is unstructured data and includes a great deal of text.

Let's start by loading in the three main python packages we are going to use.

In [110]:
# Importing requests, BeautifulSoup and nltk
import requests
import nltk
from bs4 import BeautifulSoup

In [111]:
%%nose

import sys

def test_example():
assert ('requests' in sys.modules and
'bs4' in sys.modules and
'nltk' in sys.modules ), \
'The modules requests, BeautifulSoup, and nltk should be imported.'

Out[111]:
1/1 tests passed


## 2. Request Moby Dick¶

To analyze Moby Dick, we need to get the contents of Moby Dick from somewhere. Luckily, the text is freely available online at Project Gutenberg as an HTML file: https://www.gutenberg.org/files/2701/2701-h/2701-h.htm .

Note that HTML stands for Hypertext Markup Language and is the standard markup language for the web.

To fetch the HTML file with Moby Dick we're going to use the request package to make a GET request for the website, which means we're getting data from it. This is what you're doing through a browser when visiting a webpage, but now we're getting the requested page directly into python instead.

In [112]:
# Getting the Moby Dick HTML
r = requests.get('https://s3.amazonaws.com/assets.datacamp.com/production/project_147/datasets/2701-h.htm')

# Setting the correct text encoding of the HTML page
r.encoding = 'utf-8'

# Extracting the HTML from the request object
html = r.text

# Printing the first 2000 characters in html
print(html[:2000])

<?xml version="1.0" encoding="utf-8"?>

<!DOCTYPE html
PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd" >

<html xmlns="http://www.w3.org/1999/xhtml" lang="en">
<title>
Moby Dick; Or the Whale, by Herman Melville
</title>
<style type="text/css" xml:space="preserve">

body { background:#faebd0; color:black; margin-left:15%; margin-right:15%; text-align:justify }
P { text-indent: 1em; margin-top: .25em; margin-bottom: .25em; }
H1,H2,H3,H4,H5,H6 { text-align: center; margin-left: 15%; margin-right: 15%; }
hr  { width: 50%; text-align: center;}
.foot { margin-left: 20%; margin-right: 20%; text-align: justify; text-indent: -3em; font-size: 90%; }
blockquote {font-size: 100%; margin-left: 0%; margin-right: 0%;}
.mynote    {background-color: #DDE; color: #000; padding: .5em; margin-left: 10%; margin-right: 10%; font-family: sans-serif; font-size: 95%;}
.toc       { margin-left: 10%; margin-bottom: .75em;}
.toc2      { margin-left: 20%;}
div.fig    { display:block; margin:0 auto; text-align:center; }
div.middle { margin-left: 20%; margin-right: 20%; text-align: justify; }
.figleft   {float: left; margin-left: 0%; margin-right: 1%;}
.figright  {float: right; margin-right: 0%; margin-left: 1%;}
.pagenum   {display:inline; font-size: 70%; font-style:normal;
margin: 0; padding: 0; position: absolute; right: 1%;
text-align: right;}
pre        { font-family: times new roman; font-size: 100%; margin-left: 10%;}

table      {margin-left: 10%;}

text-decoration:none}
text-decoration:none}
a:visited {color:blue;
text-decoration:none}
a:hover {color:red}

</style>
<body>
<pre xml:space="preserve">

The Project Gutenberg EBook of Moby Dick; or The Whale, by Herman Melville

This eBook is for the use of anyone anywh

In [113]:
%%nose

def test_r_correct():
assert r.request.path_url == '/assets.datacamp.com/production/project_147/datasets/2701-h.htm', \
"r should be a get request for 'https://s3.amazonaws.com/assets.datacamp.com/production/project_147/datasets/2701-h.htm'"

assert len(html) == 1500996, \
'html should contain the text of the request r.'

Out[113]:
2/2 tests passed


## 3. Get the text from the HTML¶

This HTML is not quite what we want. However, it does contain what we want: the text of Moby Dick. What we need to do now is wrangle this HTML to extract the text of the novel. For this we'll use the package BeautifulSoup.

Firstly, a word on the name of the package: Beautiful Soup? In web development, the term "tag soup" refers to structurally or syntactically incorrect HTML code written for a web page. What Beautiful Soup does best is to make tag soup beautiful again and to extract information from it with ease! In fact, the main object created and queried when using this package is called BeautifulSoup. After creating the soup, we can use its .get_text() method to extract the text.

In [114]:
# Creating a BeautifulSoup object from the HTML
soup = BeautifulSoup(html, 'html.parser')

# Getting the text out of the soup
text = soup.get_text()

# Printing out text between characters 32000 and 34000
print(text[32000:34000])

r which the beech tree
extended its branches.” —Darwin’s Voyage of a Naturalist.

“‘Stern all!’ exclaimed the mate, as upon turning his head, he saw the
distended jaws of a large Sperm Whale close to the head of the boat,
threatening it with instant destruction;—‘Stern all, for your
lives!’” —Wharton the Whale Killer.

“So be cheery, my lads, let your hearts never fail, While the bold
harpooneer is striking the whale!” —Nantucket Song.

“Oh, the rare old Whale, mid storm and gale
In his ocean home will be
A giant in might, where might is right,
And King of the boundless sea.”
—Whale Song.

CHAPTER 1. Loomings.

Call me Ishmael. Some years ago—never mind how long precisely—having
little or no money in my purse, and nothing particular to interest me on
shore, I thought I would sail about a little and see the watery part of
the world. It is a way I have of driving off the spleen and regulating the
circulation. Whenever I find myself growing grim about the mouth; whenever
it is a damp, drizzly November in my soul; whenever I find myself
involuntarily pausing before coffin warehouses, and bringing up the rear
of every funeral I meet; and especially whenever my hypos get such an
upper hand of me, that it requires a strong moral principle to prevent me
from deliberately stepping into the street, and methodically knocking
people’s hats off—then, I account it high time to get to sea as soon
as I can. This is my substitute for pistol and ball. With a philosophical
flourish Cato throws himself upon his sword; I quietly take to the ship.
There is nothing surprising in this. If they but knew it, almost all men
in their degree, some time or other, cherish very nearly the same feelings
towards the ocean with me.

Ther

In [115]:
%%nose

import bs4

def test_text_correct_type():
assert isinstance(text, str), \
'text should be a string'

def test_soup_correct_type():
assert isinstance(soup, bs4.BeautifulSoup), \
'soup should be a BeautifulSoup object'


Out[115]:
2/2 tests passed


## 4. Extract the words¶

We now have the text of the novel! There is some unwanted stuff at the start and some unwanted stuff at the end. We could remove it, but this content is so much smaller in amount than the text of Moby Dick that, to a first approximation, it is okay to leave it in.

Now that we have the text of interest, it's time to count how many times each word appears, and for this we'll use nltk – the Natural Language Toolkit. We'll start by tokenizing the text, that is, remove everything that isn't a word (whitespace, punctuation, etc.) and then split the text into a list of words.

In [116]:
# Creating a tokenizer
tokenizer = nltk.tokenize.RegexpTokenizer('\w+')

# Tokenizing the text
tokens = tokenizer.tokenize(text)

# Printing out the first 8 words / tokens
print(tokens[0:8])

['Moby', 'Dick', 'Or', 'the', 'Whale', 'by', 'Herman', 'Melville']

In [117]:
%%nose

import nltk

def test_correct_tokenizer():
correct_tokenizer = nltk.tokenize.RegexpTokenizer('\w+')
assert isinstance(tokenizer, nltk.tokenize.regexp.RegexpTokenizer), \
'tokenizer should be created using the function nltk.tokenize.RegexpTokenizer .'

def test_correct_tokens():
correct_tokenizer = nltk.tokenize.RegexpTokenizer('\w+')
correct_tokens = correct_tokenizer.tokenize(text)
assert isinstance(tokens, list) and len(tokens) > 150000 , \
'tokens should be a list with the words in text'

Out[117]:
2/2 tests passed


## 5. Make the words lowercase¶

OK! We're nearly there. Note that in the above 'Or' has a capital 'O' and that in other places it may not, but both 'Or' and 'or' should be counted as the same word. For this reason, we should build a list of all words in Moby Dick in which all capital letters have been made lower case.

In [118]:
# A new list to hold the lowercased words
words = []

# Looping through the tokens and make them lower case
for word in tokens:

# Printing out the first 8 words / tokens
print(words[0:8])

['moby', 'dick', 'or', 'the', 'whale', 'by', 'herman', 'melville']

In [119]:
%%nose

correct_words = []
for word in tokens:
correct_words.append(word.lower())

def test_correct_words():
assert correct_words == words, \
'words should contain every element in tokens, but lowercased.'

Out[119]:
1/1 tests passed


## 6. Load in stop words¶

It is common practice to remove words that appear a lot in the English language such as 'the', 'of' and 'a' because they're not so interesting. Such words are known as stop words. The package nltk includes a good list of stop words in English that we can use.

In [120]:
from nltk.corpus import stopwords

# Getting the English stop words from nltk
sw = stopwords.words('english')

# Printing out the first eight stop words
print(sw[0:8])

[nltk_data] Downloading package stopwords to /home/repl/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves']

In [121]:
%%nose

def test_correct_sw():
correct_sw = nltk.corpus.stopwords.words('english')
assert correct_sw == sw, \
'sw should contain the stop words from nltk.'

Out[121]:
1/1 tests passed


## 7. Remove stop words in Moby Dick¶

We now want to create a new list with all words in Moby Dick, except those that are stop words (that is, those words listed in sw). One way to get this list is to loop over all elements of words and add each word to a new list if they are not in sw.

In [122]:
# A new list to hold Moby Dick with No Stop words
words_ns = []

# Appending to words_ns all words that are in words but not in sw
for word in words:
if word not in sw:

# Printing the first 5 words_ns to check that stop words are gone
print(words_ns[0:5])

['moby', 'dick', 'whale', 'herman', 'melville']

In [123]:
%%nose

def test_correct_words_ns():
correct_words_ns = []
for word in words:
if word not in sw:
correct_words_ns.append(word)
assert correct_words_ns == words_ns, \
'words_ns should contain all words of Moby Dick but with the stop words removed.'

Out[123]:
1/1 tests passed


## 8. We have the answer¶

Our original question was:

What are the most frequent words in Herman Melville's novel Moby Dick and how often do they occur?

We are now ready to answer that! Let's create a word frequency distribution plot using nltk.

In [124]:
# This command display figures inline
%matplotlib inline

# Creating the word frequency distribution
freqdist = nltk.FreqDist(words_ns)

# Plotting the word frequency distribution
freqdist.plot(25)

In [125]:
%%nose

def test_correct_freqdist():
correct_freqdist = nltk.FreqDist(words_ns)
assert correct_freqdist == freqdist, \
'freqdist should contain the word frequencies of words_ns.'

Out[125]:
1/1 tests passed


## 9. The most common word¶

Nice! The frequency distribution plot above is the answer to our question.

The natural language processing skills we used in this notebook are also applicable to much of the data that Data Scientists encounter as the vast proportion of the world's data is unstructured data and includes a great deal of text.

So, what word turned out to (not surprisingly) be the most common word in Moby Dick?

In [126]:
# What's the most common word in Moby Dick?
most_common_word = 'whale'

In [127]:
%%nose

def test_most_common_word():
assert most_common_word.lower() == 'whale', \
"That's not the most common word in moby dick."

Out[127]:
1/1 tests passed