Web Crawling#

Section Title: Web Crawling

Web crawling involves browsing and extracting data from websites. The basic steps of web crawling are:

  • Sending a request for information to a website

  • Retrieving the content on the website

  • Parsing the retrieved data to extract useful information

In this chapter, we will use Python’s BeautifulSoup library to extract data from HTML and XML files.

  • Extensible Markup Language (XML) is a markup language designed to store, transmit, and reconstruct data.

We will work with examples using the website http://quotes.toscrape.com/, which offers a collection of quotes, authors, and tags for practicing web scraping and crawling.

  • Through web crawling, we will extract the quotes, authors, and tags.

BeautifulSoup#

The BeautifulSoup function processes the raw HTML content retrieved from the website.

  • It converts the raw HTML data (in bytes) into a structured object for easy navigation and search.

  • It organizes the content into a tree-like structure for efficient parsing and manipulation.

urllib Library#

If you use the following code for the URL http://quotes.toscrape.com/ mentioned above, you will encounter an error because:

  • The response from the URL is in HTML format, not JSON.

  • The json.loads() function is designed to parse JSON data, not HTML.

import urllib.request
url = f'http://quotes.toscrape.com/'

response = urllib.request.urlopen(url)
data = json.loads(response.read().decode())

Instead of attempting to load the HTML content as JSON, we can use the BeautifulSoup library to parse and process it.

  • This will return a BeautifulSoup object which is a data structure representing a parsed HTML or XML document.

In the follwoing code:

  • response.read().decode() reads the raw HTML content from the website and decodes it into a string.

  • This string contains the HTML content of the webpage.

import urllib.request
from bs4 import BeautifulSoup

url = 'http://quotes.toscrape.com/'
response = urllib.request.urlopen(url)
soup = BeautifulSoup(response.read().decode(), 'html.parser')

The second argument html.parser specifies the parser to process the HTML.

  • It is Python’s built-in HTML parser.

  • It reads and understands the HTML content.

type(soup)
bs4.BeautifulSoup
soup
<!DOCTYPE html>

<html lang="en">
<head>
<meta charset="utf-8"/>
<title>Quotes to Scrape</title>
<link href="/static/bootstrap.min.css" rel="stylesheet"/>
<link href="/static/main.css" rel="stylesheet"/>
</head>
<body>
<div class="container">
<div class="row header-box">
<div class="col-md-8">
<h1>
<a href="/" style="text-decoration: none">Quotes to Scrape</a>
</h1>
</div>
<div class="col-md-4">
<p>
<a href="/login">Login</a>
</p>
</div>
</div>
<div class="row">
<div class="col-md-8">
<div class="quote" itemscope="" itemtype="http://schema.org/CreativeWork">
<span class="text" itemprop="text">“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”</span>
<span>by <small class="author" itemprop="author">Albert Einstein</small>
<a href="/author/Albert-Einstein">(about)</a>
</span>
<div class="tags">
            Tags:
            <meta class="keywords" content="change,deep-thoughts,thinking,world" itemprop="keywords"/>
<a class="tag" href="/tag/change/page/1/">change</a>
<a class="tag" href="/tag/deep-thoughts/page/1/">deep-thoughts</a>
<a class="tag" href="/tag/thinking/page/1/">thinking</a>
<a class="tag" href="/tag/world/page/1/">world</a>
</div>
</div>
<div class="quote" itemscope="" itemtype="http://schema.org/CreativeWork">
<span class="text" itemprop="text">“It is our choices, Harry, that show what we truly are, far more than our abilities.”</span>
<span>by <small class="author" itemprop="author">J.K. Rowling</small>
<a href="/author/J-K-Rowling">(about)</a>
</span>
<div class="tags">
            Tags:
            <meta class="keywords" content="abilities,choices" itemprop="keywords"/>
<a class="tag" href="/tag/abilities/page/1/">abilities</a>
<a class="tag" href="/tag/choices/page/1/">choices</a>
</div>
</div>
<div class="quote" itemscope="" itemtype="http://schema.org/CreativeWork">
<span class="text" itemprop="text">“There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.”</span>
<span>by <small class="author" itemprop="author">Albert Einstein</small>
<a href="/author/Albert-Einstein">(about)</a>
</span>
<div class="tags">
            Tags:
            <meta class="keywords" content="inspirational,life,live,miracle,miracles" itemprop="keywords"/>
<a class="tag" href="/tag/inspirational/page/1/">inspirational</a>
<a class="tag" href="/tag/life/page/1/">life</a>
<a class="tag" href="/tag/live/page/1/">live</a>
<a class="tag" href="/tag/miracle/page/1/">miracle</a>
<a class="tag" href="/tag/miracles/page/1/">miracles</a>
</div>
</div>
<div class="quote" itemscope="" itemtype="http://schema.org/CreativeWork">
<span class="text" itemprop="text">“The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.”</span>
<span>by <small class="author" itemprop="author">Jane Austen</small>
<a href="/author/Jane-Austen">(about)</a>
</span>
<div class="tags">
            Tags:
            <meta class="keywords" content="aliteracy,books,classic,humor" itemprop="keywords"/>
<a class="tag" href="/tag/aliteracy/page/1/">aliteracy</a>
<a class="tag" href="/tag/books/page/1/">books</a>
<a class="tag" href="/tag/classic/page/1/">classic</a>
<a class="tag" href="/tag/humor/page/1/">humor</a>
</div>
</div>
<div class="quote" itemscope="" itemtype="http://schema.org/CreativeWork">
<span class="text" itemprop="text">“Imperfection is beauty, madness is genius and it's better to be absolutely ridiculous than absolutely boring.”</span>
<span>by <small class="author" itemprop="author">Marilyn Monroe</small>
<a href="/author/Marilyn-Monroe">(about)</a>
</span>
<div class="tags">
            Tags:
            <meta class="keywords" content="be-yourself,inspirational" itemprop="keywords"/>
<a class="tag" href="/tag/be-yourself/page/1/">be-yourself</a>
<a class="tag" href="/tag/inspirational/page/1/">inspirational</a>
</div>
</div>
<div class="quote" itemscope="" itemtype="http://schema.org/CreativeWork">
<span class="text" itemprop="text">“Try not to become a man of success. Rather become a man of value.”</span>
<span>by <small class="author" itemprop="author">Albert Einstein</small>
<a href="/author/Albert-Einstein">(about)</a>
</span>
<div class="tags">
            Tags:
            <meta class="keywords" content="adulthood,success,value" itemprop="keywords"/>
<a class="tag" href="/tag/adulthood/page/1/">adulthood</a>
<a class="tag" href="/tag/success/page/1/">success</a>
<a class="tag" href="/tag/value/page/1/">value</a>
</div>
</div>
<div class="quote" itemscope="" itemtype="http://schema.org/CreativeWork">
<span class="text" itemprop="text">“It is better to be hated for what you are than to be loved for what you are not.”</span>
<span>by <small class="author" itemprop="author">André Gide</small>
<a href="/author/Andre-Gide">(about)</a>
</span>
<div class="tags">
            Tags:
            <meta class="keywords" content="life,love" itemprop="keywords"/>
<a class="tag" href="/tag/life/page/1/">life</a>
<a class="tag" href="/tag/love/page/1/">love</a>
</div>
</div>
<div class="quote" itemscope="" itemtype="http://schema.org/CreativeWork">
<span class="text" itemprop="text">“I have not failed. I've just found 10,000 ways that won't work.”</span>
<span>by <small class="author" itemprop="author">Thomas A. Edison</small>
<a href="/author/Thomas-A-Edison">(about)</a>
</span>
<div class="tags">
            Tags:
            <meta class="keywords" content="edison,failure,inspirational,paraphrased" itemprop="keywords"/>
<a class="tag" href="/tag/edison/page/1/">edison</a>
<a class="tag" href="/tag/failure/page/1/">failure</a>
<a class="tag" href="/tag/inspirational/page/1/">inspirational</a>
<a class="tag" href="/tag/paraphrased/page/1/">paraphrased</a>
</div>
</div>
<div class="quote" itemscope="" itemtype="http://schema.org/CreativeWork">
<span class="text" itemprop="text">“A woman is like a tea bag; you never know how strong it is until it's in hot water.”</span>
<span>by <small class="author" itemprop="author">Eleanor Roosevelt</small>
<a href="/author/Eleanor-Roosevelt">(about)</a>
</span>
<div class="tags">
            Tags:
            <meta class="keywords" content="misattributed-eleanor-roosevelt" itemprop="keywords"/>
<a class="tag" href="/tag/misattributed-eleanor-roosevelt/page/1/">misattributed-eleanor-roosevelt</a>
</div>
</div>
<div class="quote" itemscope="" itemtype="http://schema.org/CreativeWork">
<span class="text" itemprop="text">“A day without sunshine is like, you know, night.”</span>
<span>by <small class="author" itemprop="author">Steve Martin</small>
<a href="/author/Steve-Martin">(about)</a>
</span>
<div class="tags">
            Tags:
            <meta class="keywords" content="humor,obvious,simile" itemprop="keywords"/>
<a class="tag" href="/tag/humor/page/1/">humor</a>
<a class="tag" href="/tag/obvious/page/1/">obvious</a>
<a class="tag" href="/tag/simile/page/1/">simile</a>
</div>
</div>
<nav>
<ul class="pager">
<li class="next">
<a href="/page/2/">Next <span aria-hidden="true">→</span></a>
</li>
</ul>
</nav>
</div>
<div class="col-md-4 tags-box">
<h2>Top Ten tags</h2>
<span class="tag-item">
<a class="tag" href="/tag/love/" style="font-size: 28px">love</a>
</span>
<span class="tag-item">
<a class="tag" href="/tag/inspirational/" style="font-size: 26px">inspirational</a>
</span>
<span class="tag-item">
<a class="tag" href="/tag/life/" style="font-size: 26px">life</a>
</span>
<span class="tag-item">
<a class="tag" href="/tag/humor/" style="font-size: 24px">humor</a>
</span>
<span class="tag-item">
<a class="tag" href="/tag/books/" style="font-size: 22px">books</a>
</span>
<span class="tag-item">
<a class="tag" href="/tag/reading/" style="font-size: 14px">reading</a>
</span>
<span class="tag-item">
<a class="tag" href="/tag/friendship/" style="font-size: 10px">friendship</a>
</span>
<span class="tag-item">
<a class="tag" href="/tag/friends/" style="font-size: 8px">friends</a>
</span>
<span class="tag-item">
<a class="tag" href="/tag/truth/" style="font-size: 8px">truth</a>
</span>
<span class="tag-item">
<a class="tag" href="/tag/simile/" style="font-size: 6px">simile</a>
</span>
</div>
</div>
</div>
<footer class="footer">
<div class="container">
<p class="text-muted">
                Quotes by: <a href="https://www.goodreads.com/quotes">GoodReads.com</a>
</p>
<p class="copyright">
                Made with <span class="zyte">❤</span> by <a class="zyte" href="https://www.zyte.com">Zyte</a>
</p>
</div>
</footer>
</body>
</html>

Quote#

The find_all() method is used to retrieve specific information from the HTML content. In the following code:

  • span: Refers to the HTML element, commonly used to group and style content within a webpage.

    • The code specifically searches for all elements present in the HTML content.

  • text: Represents the class attribute of the targeted elements.

    • The argument class_=’text’ ensures that only elements with the class name ‘text’ are included in the search.

    • The ‘text’ class likely represents quotes within the webpage.

quotes = soup.find_all('span', class_='text')
quotes
[<span class="text" itemprop="text">“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”</span>,
 <span class="text" itemprop="text">“It is our choices, Harry, that show what we truly are, far more than our abilities.”</span>,
 <span class="text" itemprop="text">“There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.”</span>,
 <span class="text" itemprop="text">“The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.”</span>,
 <span class="text" itemprop="text">“Imperfection is beauty, madness is genius and it's better to be absolutely ridiculous than absolutely boring.”</span>,
 <span class="text" itemprop="text">“Try not to become a man of success. Rather become a man of value.”</span>,
 <span class="text" itemprop="text">“It is better to be hated for what you are than to be loved for what you are not.”</span>,
 <span class="text" itemprop="text">“I have not failed. I've just found 10,000 ways that won't work.”</span>,
 <span class="text" itemprop="text">“A woman is like a tea bag; you never know how strong it is until it's in hot water.”</span>,
 <span class="text" itemprop="text">“A day without sunshine is like, you know, night.”</span>]

Each element in the quotes list is a Tag object, which represents an HTML element.

  • This Tag object has a text attribute that can be used to access the textual content contained within the tag.

quotes[0]
<span class="text" itemprop="text">“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”</span>
type(quotes[0])
bs4.element.Tag
quotes[0].text
'“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”'

Author#

  • small: Refers to the HTML element, commonly used to style or format text as smaller or less prominent within a webpage.

    • The code specifically searches for all elements present in the HTML content.

  • author: Represents the class attribute of the targeted elements.

    • The argument class_=’text’ ensures that only elements with the class name ‘author’ are included in the search.

    • The ‘author’ class represents the authors of quotes.

authors = soup.find_all('small', class_='author')
authors
[<small class="author" itemprop="author">Albert Einstein</small>,
 <small class="author" itemprop="author">J.K. Rowling</small>,
 <small class="author" itemprop="author">Albert Einstein</small>,
 <small class="author" itemprop="author">Jane Austen</small>,
 <small class="author" itemprop="author">Marilyn Monroe</small>,
 <small class="author" itemprop="author">Albert Einstein</small>,
 <small class="author" itemprop="author">André Gide</small>,
 <small class="author" itemprop="author">Thomas A. Edison</small>,
 <small class="author" itemprop="author">Eleanor Roosevelt</small>,
 <small class="author" itemprop="author">Steve Martin</small>]

Each element in the authors list is a Tag object, which represents an HTML element.

  • This Tag object has a text attribute that can be used to access the textual content contained within the tag.

authors[0]
<small class="author" itemprop="author">Albert Einstein</small>
type(authors[0])
bs4.element.Tag
authors[0].text
'Albert Einstein'

Tag#

tags = soup.find_all('a', class_='tag')
tags
[<a class="tag" href="/tag/change/page/1/">change</a>,
 <a class="tag" href="/tag/deep-thoughts/page/1/">deep-thoughts</a>,
 <a class="tag" href="/tag/thinking/page/1/">thinking</a>,
 <a class="tag" href="/tag/world/page/1/">world</a>,
 <a class="tag" href="/tag/abilities/page/1/">abilities</a>,
 <a class="tag" href="/tag/choices/page/1/">choices</a>,
 <a class="tag" href="/tag/inspirational/page/1/">inspirational</a>,
 <a class="tag" href="/tag/life/page/1/">life</a>,
 <a class="tag" href="/tag/live/page/1/">live</a>,
 <a class="tag" href="/tag/miracle/page/1/">miracle</a>,
 <a class="tag" href="/tag/miracles/page/1/">miracles</a>,
 <a class="tag" href="/tag/aliteracy/page/1/">aliteracy</a>,
 <a class="tag" href="/tag/books/page/1/">books</a>,
 <a class="tag" href="/tag/classic/page/1/">classic</a>,
 <a class="tag" href="/tag/humor/page/1/">humor</a>,
 <a class="tag" href="/tag/be-yourself/page/1/">be-yourself</a>,
 <a class="tag" href="/tag/inspirational/page/1/">inspirational</a>,
 <a class="tag" href="/tag/adulthood/page/1/">adulthood</a>,
 <a class="tag" href="/tag/success/page/1/">success</a>,
 <a class="tag" href="/tag/value/page/1/">value</a>,
 <a class="tag" href="/tag/life/page/1/">life</a>,
 <a class="tag" href="/tag/love/page/1/">love</a>,
 <a class="tag" href="/tag/edison/page/1/">edison</a>,
 <a class="tag" href="/tag/failure/page/1/">failure</a>,
 <a class="tag" href="/tag/inspirational/page/1/">inspirational</a>,
 <a class="tag" href="/tag/paraphrased/page/1/">paraphrased</a>,
 <a class="tag" href="/tag/misattributed-eleanor-roosevelt/page/1/">misattributed-eleanor-roosevelt</a>,
 <a class="tag" href="/tag/humor/page/1/">humor</a>,
 <a class="tag" href="/tag/obvious/page/1/">obvious</a>,
 <a class="tag" href="/tag/simile/page/1/">simile</a>,
 <a class="tag" href="/tag/love/" style="font-size: 28px">love</a>,
 <a class="tag" href="/tag/inspirational/" style="font-size: 26px">inspirational</a>,
 <a class="tag" href="/tag/life/" style="font-size: 26px">life</a>,
 <a class="tag" href="/tag/humor/" style="font-size: 24px">humor</a>,
 <a class="tag" href="/tag/books/" style="font-size: 22px">books</a>,
 <a class="tag" href="/tag/reading/" style="font-size: 14px">reading</a>,
 <a class="tag" href="/tag/friendship/" style="font-size: 10px">friendship</a>,
 <a class="tag" href="/tag/friends/" style="font-size: 8px">friends</a>,
 <a class="tag" href="/tag/truth/" style="font-size: 8px">truth</a>,
 <a class="tag" href="/tag/simile/" style="font-size: 6px">simile</a>]
tags[0]
<a class="tag" href="/tag/change/page/1/">change</a>
type(tags[0])
bs4.element.Tag
tags[0].text
'change'

requests Library#

An alternative approach we can use the requests library instead of urllib.request.

import requests
from bs4 import BeautifulSoup

url = 'http://quotes.toscrape.com/'
response = requests.get(url)

The response status can be checked using the status_code attribute. Here are the possible returns:

  • 200: Success – No issue reaching the website.

  • 404: Not Found – Resource not found.

  • 500: Internal Server Error – Server issue.

  • 403: Forbidden – Access denied.

  • 400: Bad Request – Invalid request.

response.status_code
200

response.content is a bytes object that contains the HTML content of the website.

response.content
b'<!DOCTYPE html>\n<html lang="en">\n<head>\n\t<meta charset="UTF-8">\n\t<title>Quotes to Scrape</title>\n    <link rel="stylesheet" href="/static/bootstrap.min.css">\n    <link rel="stylesheet" href="/static/main.css">\n    \n    \n</head>\n<body>\n    <div class="container">\n        <div class="row header-box">\n            <div class="col-md-8">\n                <h1>\n                    <a href="/" style="text-decoration: none">Quotes to Scrape</a>\n                </h1>\n            </div>\n            <div class="col-md-4">\n                <p>\n                \n                    <a href="/login">Login</a>\n                \n                </p>\n            </div>\n        </div>\n    \n\n<div class="row">\n    <div class="col-md-8">\n\n    <div class="quote" itemscope itemtype="http://schema.org/CreativeWork">\n        <span class="text" itemprop="text">\xe2\x80\x9cThe world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.\xe2\x80\x9d</span>\n        <span>by <small class="author" itemprop="author">Albert Einstein</small>\n        <a href="/author/Albert-Einstein">(about)</a>\n        </span>\n        <div class="tags">\n            Tags:\n            <meta class="keywords" itemprop="keywords" content="change,deep-thoughts,thinking,world" /    > \n            \n            <a class="tag" href="/tag/change/page/1/">change</a>\n            \n            <a class="tag" href="/tag/deep-thoughts/page/1/">deep-thoughts</a>\n            \n            <a class="tag" href="/tag/thinking/page/1/">thinking</a>\n            \n            <a class="tag" href="/tag/world/page/1/">world</a>\n            \n        </div>\n    </div>\n\n    <div class="quote" itemscope itemtype="http://schema.org/CreativeWork">\n        <span class="text" itemprop="text">\xe2\x80\x9cIt is our choices, Harry, that show what we truly are, far more than our abilities.\xe2\x80\x9d</span>\n        <span>by <small class="author" itemprop="author">J.K. Rowling</small>\n        <a href="/author/J-K-Rowling">(about)</a>\n        </span>\n        <div class="tags">\n            Tags:\n            <meta class="keywords" itemprop="keywords" content="abilities,choices" /    > \n            \n            <a class="tag" href="/tag/abilities/page/1/">abilities</a>\n            \n            <a class="tag" href="/tag/choices/page/1/">choices</a>\n            \n        </div>\n    </div>\n\n    <div class="quote" itemscope itemtype="http://schema.org/CreativeWork">\n        <span class="text" itemprop="text">\xe2\x80\x9cThere are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.\xe2\x80\x9d</span>\n        <span>by <small class="author" itemprop="author">Albert Einstein</small>\n        <a href="/author/Albert-Einstein">(about)</a>\n        </span>\n        <div class="tags">\n            Tags:\n            <meta class="keywords" itemprop="keywords" content="inspirational,life,live,miracle,miracles" /    > \n            \n            <a class="tag" href="/tag/inspirational/page/1/">inspirational</a>\n            \n            <a class="tag" href="/tag/life/page/1/">life</a>\n            \n            <a class="tag" href="/tag/live/page/1/">live</a>\n            \n            <a class="tag" href="/tag/miracle/page/1/">miracle</a>\n            \n            <a class="tag" href="/tag/miracles/page/1/">miracles</a>\n            \n        </div>\n    </div>\n\n    <div class="quote" itemscope itemtype="http://schema.org/CreativeWork">\n        <span class="text" itemprop="text">\xe2\x80\x9cThe person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.\xe2\x80\x9d</span>\n        <span>by <small class="author" itemprop="author">Jane Austen</small>\n        <a href="/author/Jane-Austen">(about)</a>\n        </span>\n        <div class="tags">\n            Tags:\n            <meta class="keywords" itemprop="keywords" content="aliteracy,books,classic,humor" /    > \n            \n            <a class="tag" href="/tag/aliteracy/page/1/">aliteracy</a>\n            \n            <a class="tag" href="/tag/books/page/1/">books</a>\n            \n            <a class="tag" href="/tag/classic/page/1/">classic</a>\n            \n            <a class="tag" href="/tag/humor/page/1/">humor</a>\n            \n        </div>\n    </div>\n\n    <div class="quote" itemscope itemtype="http://schema.org/CreativeWork">\n        <span class="text" itemprop="text">\xe2\x80\x9cImperfection is beauty, madness is genius and it&#39;s better to be absolutely ridiculous than absolutely boring.\xe2\x80\x9d</span>\n        <span>by <small class="author" itemprop="author">Marilyn Monroe</small>\n        <a href="/author/Marilyn-Monroe">(about)</a>\n        </span>\n        <div class="tags">\n            Tags:\n            <meta class="keywords" itemprop="keywords" content="be-yourself,inspirational" /    > \n            \n            <a class="tag" href="/tag/be-yourself/page/1/">be-yourself</a>\n            \n            <a class="tag" href="/tag/inspirational/page/1/">inspirational</a>\n            \n        </div>\n    </div>\n\n    <div class="quote" itemscope itemtype="http://schema.org/CreativeWork">\n        <span class="text" itemprop="text">\xe2\x80\x9cTry not to become a man of success. Rather become a man of value.\xe2\x80\x9d</span>\n        <span>by <small class="author" itemprop="author">Albert Einstein</small>\n        <a href="/author/Albert-Einstein">(about)</a>\n        </span>\n        <div class="tags">\n            Tags:\n            <meta class="keywords" itemprop="keywords" content="adulthood,success,value" /    > \n            \n            <a class="tag" href="/tag/adulthood/page/1/">adulthood</a>\n            \n            <a class="tag" href="/tag/success/page/1/">success</a>\n            \n            <a class="tag" href="/tag/value/page/1/">value</a>\n            \n        </div>\n    </div>\n\n    <div class="quote" itemscope itemtype="http://schema.org/CreativeWork">\n        <span class="text" itemprop="text">\xe2\x80\x9cIt is better to be hated for what you are than to be loved for what you are not.\xe2\x80\x9d</span>\n        <span>by <small class="author" itemprop="author">Andr\xc3\xa9 Gide</small>\n        <a href="/author/Andre-Gide">(about)</a>\n        </span>\n        <div class="tags">\n            Tags:\n            <meta class="keywords" itemprop="keywords" content="life,love" /    > \n            \n            <a class="tag" href="/tag/life/page/1/">life</a>\n            \n            <a class="tag" href="/tag/love/page/1/">love</a>\n            \n        </div>\n    </div>\n\n    <div class="quote" itemscope itemtype="http://schema.org/CreativeWork">\n        <span class="text" itemprop="text">\xe2\x80\x9cI have not failed. I&#39;ve just found 10,000 ways that won&#39;t work.\xe2\x80\x9d</span>\n        <span>by <small class="author" itemprop="author">Thomas A. Edison</small>\n        <a href="/author/Thomas-A-Edison">(about)</a>\n        </span>\n        <div class="tags">\n            Tags:\n            <meta class="keywords" itemprop="keywords" content="edison,failure,inspirational,paraphrased" /    > \n            \n            <a class="tag" href="/tag/edison/page/1/">edison</a>\n            \n            <a class="tag" href="/tag/failure/page/1/">failure</a>\n            \n            <a class="tag" href="/tag/inspirational/page/1/">inspirational</a>\n            \n            <a class="tag" href="/tag/paraphrased/page/1/">paraphrased</a>\n            \n        </div>\n    </div>\n\n    <div class="quote" itemscope itemtype="http://schema.org/CreativeWork">\n        <span class="text" itemprop="text">\xe2\x80\x9cA woman is like a tea bag; you never know how strong it is until it&#39;s in hot water.\xe2\x80\x9d</span>\n        <span>by <small class="author" itemprop="author">Eleanor Roosevelt</small>\n        <a href="/author/Eleanor-Roosevelt">(about)</a>\n        </span>\n        <div class="tags">\n            Tags:\n            <meta class="keywords" itemprop="keywords" content="misattributed-eleanor-roosevelt" /    > \n            \n            <a class="tag" href="/tag/misattributed-eleanor-roosevelt/page/1/">misattributed-eleanor-roosevelt</a>\n            \n        </div>\n    </div>\n\n    <div class="quote" itemscope itemtype="http://schema.org/CreativeWork">\n        <span class="text" itemprop="text">\xe2\x80\x9cA day without sunshine is like, you know, night.\xe2\x80\x9d</span>\n        <span>by <small class="author" itemprop="author">Steve Martin</small>\n        <a href="/author/Steve-Martin">(about)</a>\n        </span>\n        <div class="tags">\n            Tags:\n            <meta class="keywords" itemprop="keywords" content="humor,obvious,simile" /    > \n            \n            <a class="tag" href="/tag/humor/page/1/">humor</a>\n            \n            <a class="tag" href="/tag/obvious/page/1/">obvious</a>\n            \n            <a class="tag" href="/tag/simile/page/1/">simile</a>\n            \n        </div>\n    </div>\n\n    <nav>\n        <ul class="pager">\n            \n            \n            <li class="next">\n                <a href="/page/2/">Next <span aria-hidden="true">&rarr;</span></a>\n            </li>\n            \n        </ul>\n    </nav>\n    </div>\n    <div class="col-md-4 tags-box">\n        \n            <h2>Top Ten tags</h2>\n            \n            <span class="tag-item">\n            <a class="tag" style="font-size: 28px" href="/tag/love/">love</a>\n            </span>\n            \n            <span class="tag-item">\n            <a class="tag" style="font-size: 26px" href="/tag/inspirational/">inspirational</a>\n            </span>\n            \n            <span class="tag-item">\n            <a class="tag" style="font-size: 26px" href="/tag/life/">life</a>\n            </span>\n            \n            <span class="tag-item">\n            <a class="tag" style="font-size: 24px" href="/tag/humor/">humor</a>\n            </span>\n            \n            <span class="tag-item">\n            <a class="tag" style="font-size: 22px" href="/tag/books/">books</a>\n            </span>\n            \n            <span class="tag-item">\n            <a class="tag" style="font-size: 14px" href="/tag/reading/">reading</a>\n            </span>\n            \n            <span class="tag-item">\n            <a class="tag" style="font-size: 10px" href="/tag/friendship/">friendship</a>\n            </span>\n            \n            <span class="tag-item">\n            <a class="tag" style="font-size: 8px" href="/tag/friends/">friends</a>\n            </span>\n            \n            <span class="tag-item">\n            <a class="tag" style="font-size: 8px" href="/tag/truth/">truth</a>\n            </span>\n            \n            <span class="tag-item">\n            <a class="tag" style="font-size: 6px" href="/tag/simile/">simile</a>\n            </span>\n            \n        \n    </div>\n</div>\n\n    </div>\n    <footer class="footer">\n        <div class="container">\n            <p class="text-muted">\n                Quotes by: <a href="https://www.goodreads.com/quotes">GoodReads.com</a>\n            </p>\n            <p class="copyright">\n                Made with <span class=\'zyte\'>\xe2\x9d\xa4</span> by <a class=\'zyte\' href="https://www.zyte.com">Zyte</a>\n            </p>\n        </div>\n    </footer>\n</body>\n</html>'

The BeautifulSoup function processes the raw HTML content in *response.content into a structured object.

soup = BeautifulSoup(response.content, 'html.parser')
soup
<!DOCTYPE html>

<html lang="en">
<head>
<meta charset="utf-8"/>
<title>Quotes to Scrape</title>
<link href="/static/bootstrap.min.css" rel="stylesheet"/>
<link href="/static/main.css" rel="stylesheet"/>
</head>
<body>
<div class="container">
<div class="row header-box">
<div class="col-md-8">
<h1>
<a href="/" style="text-decoration: none">Quotes to Scrape</a>
</h1>
</div>
<div class="col-md-4">
<p>
<a href="/login">Login</a>
</p>
</div>
</div>
<div class="row">
<div class="col-md-8">
<div class="quote" itemscope="" itemtype="http://schema.org/CreativeWork">
<span class="text" itemprop="text">“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”</span>
<span>by <small class="author" itemprop="author">Albert Einstein</small>
<a href="/author/Albert-Einstein">(about)</a>
</span>
<div class="tags">
            Tags:
            <meta class="keywords" content="change,deep-thoughts,thinking,world" itemprop="keywords"/>
<a class="tag" href="/tag/change/page/1/">change</a>
<a class="tag" href="/tag/deep-thoughts/page/1/">deep-thoughts</a>
<a class="tag" href="/tag/thinking/page/1/">thinking</a>
<a class="tag" href="/tag/world/page/1/">world</a>
</div>
</div>
<div class="quote" itemscope="" itemtype="http://schema.org/CreativeWork">
<span class="text" itemprop="text">“It is our choices, Harry, that show what we truly are, far more than our abilities.”</span>
<span>by <small class="author" itemprop="author">J.K. Rowling</small>
<a href="/author/J-K-Rowling">(about)</a>
</span>
<div class="tags">
            Tags:
            <meta class="keywords" content="abilities,choices" itemprop="keywords"/>
<a class="tag" href="/tag/abilities/page/1/">abilities</a>
<a class="tag" href="/tag/choices/page/1/">choices</a>
</div>
</div>
<div class="quote" itemscope="" itemtype="http://schema.org/CreativeWork">
<span class="text" itemprop="text">“There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.”</span>
<span>by <small class="author" itemprop="author">Albert Einstein</small>
<a href="/author/Albert-Einstein">(about)</a>
</span>
<div class="tags">
            Tags:
            <meta class="keywords" content="inspirational,life,live,miracle,miracles" itemprop="keywords"/>
<a class="tag" href="/tag/inspirational/page/1/">inspirational</a>
<a class="tag" href="/tag/life/page/1/">life</a>
<a class="tag" href="/tag/live/page/1/">live</a>
<a class="tag" href="/tag/miracle/page/1/">miracle</a>
<a class="tag" href="/tag/miracles/page/1/">miracles</a>
</div>
</div>
<div class="quote" itemscope="" itemtype="http://schema.org/CreativeWork">
<span class="text" itemprop="text">“The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.”</span>
<span>by <small class="author" itemprop="author">Jane Austen</small>
<a href="/author/Jane-Austen">(about)</a>
</span>
<div class="tags">
            Tags:
            <meta class="keywords" content="aliteracy,books,classic,humor" itemprop="keywords"/>
<a class="tag" href="/tag/aliteracy/page/1/">aliteracy</a>
<a class="tag" href="/tag/books/page/1/">books</a>
<a class="tag" href="/tag/classic/page/1/">classic</a>
<a class="tag" href="/tag/humor/page/1/">humor</a>
</div>
</div>
<div class="quote" itemscope="" itemtype="http://schema.org/CreativeWork">
<span class="text" itemprop="text">“Imperfection is beauty, madness is genius and it's better to be absolutely ridiculous than absolutely boring.”</span>
<span>by <small class="author" itemprop="author">Marilyn Monroe</small>
<a href="/author/Marilyn-Monroe">(about)</a>
</span>
<div class="tags">
            Tags:
            <meta class="keywords" content="be-yourself,inspirational" itemprop="keywords"/>
<a class="tag" href="/tag/be-yourself/page/1/">be-yourself</a>
<a class="tag" href="/tag/inspirational/page/1/">inspirational</a>
</div>
</div>
<div class="quote" itemscope="" itemtype="http://schema.org/CreativeWork">
<span class="text" itemprop="text">“Try not to become a man of success. Rather become a man of value.”</span>
<span>by <small class="author" itemprop="author">Albert Einstein</small>
<a href="/author/Albert-Einstein">(about)</a>
</span>
<div class="tags">
            Tags:
            <meta class="keywords" content="adulthood,success,value" itemprop="keywords"/>
<a class="tag" href="/tag/adulthood/page/1/">adulthood</a>
<a class="tag" href="/tag/success/page/1/">success</a>
<a class="tag" href="/tag/value/page/1/">value</a>
</div>
</div>
<div class="quote" itemscope="" itemtype="http://schema.org/CreativeWork">
<span class="text" itemprop="text">“It is better to be hated for what you are than to be loved for what you are not.”</span>
<span>by <small class="author" itemprop="author">André Gide</small>
<a href="/author/Andre-Gide">(about)</a>
</span>
<div class="tags">
            Tags:
            <meta class="keywords" content="life,love" itemprop="keywords"/>
<a class="tag" href="/tag/life/page/1/">life</a>
<a class="tag" href="/tag/love/page/1/">love</a>
</div>
</div>
<div class="quote" itemscope="" itemtype="http://schema.org/CreativeWork">
<span class="text" itemprop="text">“I have not failed. I've just found 10,000 ways that won't work.”</span>
<span>by <small class="author" itemprop="author">Thomas A. Edison</small>
<a href="/author/Thomas-A-Edison">(about)</a>
</span>
<div class="tags">
            Tags:
            <meta class="keywords" content="edison,failure,inspirational,paraphrased" itemprop="keywords"/>
<a class="tag" href="/tag/edison/page/1/">edison</a>
<a class="tag" href="/tag/failure/page/1/">failure</a>
<a class="tag" href="/tag/inspirational/page/1/">inspirational</a>
<a class="tag" href="/tag/paraphrased/page/1/">paraphrased</a>
</div>
</div>
<div class="quote" itemscope="" itemtype="http://schema.org/CreativeWork">
<span class="text" itemprop="text">“A woman is like a tea bag; you never know how strong it is until it's in hot water.”</span>
<span>by <small class="author" itemprop="author">Eleanor Roosevelt</small>
<a href="/author/Eleanor-Roosevelt">(about)</a>
</span>
<div class="tags">
            Tags:
            <meta class="keywords" content="misattributed-eleanor-roosevelt" itemprop="keywords"/>
<a class="tag" href="/tag/misattributed-eleanor-roosevelt/page/1/">misattributed-eleanor-roosevelt</a>
</div>
</div>
<div class="quote" itemscope="" itemtype="http://schema.org/CreativeWork">
<span class="text" itemprop="text">“A day without sunshine is like, you know, night.”</span>
<span>by <small class="author" itemprop="author">Steve Martin</small>
<a href="/author/Steve-Martin">(about)</a>
</span>
<div class="tags">
            Tags:
            <meta class="keywords" content="humor,obvious,simile" itemprop="keywords"/>
<a class="tag" href="/tag/humor/page/1/">humor</a>
<a class="tag" href="/tag/obvious/page/1/">obvious</a>
<a class="tag" href="/tag/simile/page/1/">simile</a>
</div>
</div>
<nav>
<ul class="pager">
<li class="next">
<a href="/page/2/">Next <span aria-hidden="true">→</span></a>
</li>
</ul>
</nav>
</div>
<div class="col-md-4 tags-box">
<h2>Top Ten tags</h2>
<span class="tag-item">
<a class="tag" href="/tag/love/" style="font-size: 28px">love</a>
</span>
<span class="tag-item">
<a class="tag" href="/tag/inspirational/" style="font-size: 26px">inspirational</a>
</span>
<span class="tag-item">
<a class="tag" href="/tag/life/" style="font-size: 26px">life</a>
</span>
<span class="tag-item">
<a class="tag" href="/tag/humor/" style="font-size: 24px">humor</a>
</span>
<span class="tag-item">
<a class="tag" href="/tag/books/" style="font-size: 22px">books</a>
</span>
<span class="tag-item">
<a class="tag" href="/tag/reading/" style="font-size: 14px">reading</a>
</span>
<span class="tag-item">
<a class="tag" href="/tag/friendship/" style="font-size: 10px">friendship</a>
</span>
<span class="tag-item">
<a class="tag" href="/tag/friends/" style="font-size: 8px">friends</a>
</span>
<span class="tag-item">
<a class="tag" href="/tag/truth/" style="font-size: 8px">truth</a>
</span>
<span class="tag-item">
<a class="tag" href="/tag/simile/" style="font-size: 6px">simile</a>
</span>
</div>
</div>
</div>
<footer class="footer">
<div class="container">
<p class="text-muted">
                Quotes by: <a href="https://www.goodreads.com/quotes">GoodReads.com</a>
</p>
<p class="copyright">
                Made with <span class="zyte">❤</span> by <a class="zyte" href="https://www.zyte.com">Zyte</a>
</p>
</div>
</footer>
</body>
</html>

Now, we will parse the quotes, authors, and tags using da different approach.

  • div: Refers to the \(<div>\) HTML element, commonly used to group and organize content within a webpage.

    • The code specifically searches for all \(<div>\) elements present in the HTML content.

  • quote: Represents the class attribute of the targeted \(<div>\) elements.

    • The argument class_=’quote’ ensures that only \(<div>\) elements with the class name ‘quote’ are included in the search.

quotes = soup.find_all('div', class_='quote')
quotes
[<div class="quote" itemscope="" itemtype="http://schema.org/CreativeWork">
 <span class="text" itemprop="text">“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”</span>
 <span>by <small class="author" itemprop="author">Albert Einstein</small>
 <a href="/author/Albert-Einstein">(about)</a>
 </span>
 <div class="tags">
             Tags:
             <meta class="keywords" content="change,deep-thoughts,thinking,world" itemprop="keywords"/>
 <a class="tag" href="/tag/change/page/1/">change</a>
 <a class="tag" href="/tag/deep-thoughts/page/1/">deep-thoughts</a>
 <a class="tag" href="/tag/thinking/page/1/">thinking</a>
 <a class="tag" href="/tag/world/page/1/">world</a>
 </div>
 </div>,
 <div class="quote" itemscope="" itemtype="http://schema.org/CreativeWork">
 <span class="text" itemprop="text">“It is our choices, Harry, that show what we truly are, far more than our abilities.”</span>
 <span>by <small class="author" itemprop="author">J.K. Rowling</small>
 <a href="/author/J-K-Rowling">(about)</a>
 </span>
 <div class="tags">
             Tags:
             <meta class="keywords" content="abilities,choices" itemprop="keywords"/>
 <a class="tag" href="/tag/abilities/page/1/">abilities</a>
 <a class="tag" href="/tag/choices/page/1/">choices</a>
 </div>
 </div>,
 <div class="quote" itemscope="" itemtype="http://schema.org/CreativeWork">
 <span class="text" itemprop="text">“There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.”</span>
 <span>by <small class="author" itemprop="author">Albert Einstein</small>
 <a href="/author/Albert-Einstein">(about)</a>
 </span>
 <div class="tags">
             Tags:
             <meta class="keywords" content="inspirational,life,live,miracle,miracles" itemprop="keywords"/>
 <a class="tag" href="/tag/inspirational/page/1/">inspirational</a>
 <a class="tag" href="/tag/life/page/1/">life</a>
 <a class="tag" href="/tag/live/page/1/">live</a>
 <a class="tag" href="/tag/miracle/page/1/">miracle</a>
 <a class="tag" href="/tag/miracles/page/1/">miracles</a>
 </div>
 </div>,
 <div class="quote" itemscope="" itemtype="http://schema.org/CreativeWork">
 <span class="text" itemprop="text">“The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.”</span>
 <span>by <small class="author" itemprop="author">Jane Austen</small>
 <a href="/author/Jane-Austen">(about)</a>
 </span>
 <div class="tags">
             Tags:
             <meta class="keywords" content="aliteracy,books,classic,humor" itemprop="keywords"/>
 <a class="tag" href="/tag/aliteracy/page/1/">aliteracy</a>
 <a class="tag" href="/tag/books/page/1/">books</a>
 <a class="tag" href="/tag/classic/page/1/">classic</a>
 <a class="tag" href="/tag/humor/page/1/">humor</a>
 </div>
 </div>,
 <div class="quote" itemscope="" itemtype="http://schema.org/CreativeWork">
 <span class="text" itemprop="text">“Imperfection is beauty, madness is genius and it's better to be absolutely ridiculous than absolutely boring.”</span>
 <span>by <small class="author" itemprop="author">Marilyn Monroe</small>
 <a href="/author/Marilyn-Monroe">(about)</a>
 </span>
 <div class="tags">
             Tags:
             <meta class="keywords" content="be-yourself,inspirational" itemprop="keywords"/>
 <a class="tag" href="/tag/be-yourself/page/1/">be-yourself</a>
 <a class="tag" href="/tag/inspirational/page/1/">inspirational</a>
 </div>
 </div>,
 <div class="quote" itemscope="" itemtype="http://schema.org/CreativeWork">
 <span class="text" itemprop="text">“Try not to become a man of success. Rather become a man of value.”</span>
 <span>by <small class="author" itemprop="author">Albert Einstein</small>
 <a href="/author/Albert-Einstein">(about)</a>
 </span>
 <div class="tags">
             Tags:
             <meta class="keywords" content="adulthood,success,value" itemprop="keywords"/>
 <a class="tag" href="/tag/adulthood/page/1/">adulthood</a>
 <a class="tag" href="/tag/success/page/1/">success</a>
 <a class="tag" href="/tag/value/page/1/">value</a>
 </div>
 </div>,
 <div class="quote" itemscope="" itemtype="http://schema.org/CreativeWork">
 <span class="text" itemprop="text">“It is better to be hated for what you are than to be loved for what you are not.”</span>
 <span>by <small class="author" itemprop="author">André Gide</small>
 <a href="/author/Andre-Gide">(about)</a>
 </span>
 <div class="tags">
             Tags:
             <meta class="keywords" content="life,love" itemprop="keywords"/>
 <a class="tag" href="/tag/life/page/1/">life</a>
 <a class="tag" href="/tag/love/page/1/">love</a>
 </div>
 </div>,
 <div class="quote" itemscope="" itemtype="http://schema.org/CreativeWork">
 <span class="text" itemprop="text">“I have not failed. I've just found 10,000 ways that won't work.”</span>
 <span>by <small class="author" itemprop="author">Thomas A. Edison</small>
 <a href="/author/Thomas-A-Edison">(about)</a>
 </span>
 <div class="tags">
             Tags:
             <meta class="keywords" content="edison,failure,inspirational,paraphrased" itemprop="keywords"/>
 <a class="tag" href="/tag/edison/page/1/">edison</a>
 <a class="tag" href="/tag/failure/page/1/">failure</a>
 <a class="tag" href="/tag/inspirational/page/1/">inspirational</a>
 <a class="tag" href="/tag/paraphrased/page/1/">paraphrased</a>
 </div>
 </div>,
 <div class="quote" itemscope="" itemtype="http://schema.org/CreativeWork">
 <span class="text" itemprop="text">“A woman is like a tea bag; you never know how strong it is until it's in hot water.”</span>
 <span>by <small class="author" itemprop="author">Eleanor Roosevelt</small>
 <a href="/author/Eleanor-Roosevelt">(about)</a>
 </span>
 <div class="tags">
             Tags:
             <meta class="keywords" content="misattributed-eleanor-roosevelt" itemprop="keywords"/>
 <a class="tag" href="/tag/misattributed-eleanor-roosevelt/page/1/">misattributed-eleanor-roosevelt</a>
 </div>
 </div>,
 <div class="quote" itemscope="" itemtype="http://schema.org/CreativeWork">
 <span class="text" itemprop="text">“A day without sunshine is like, you know, night.”</span>
 <span>by <small class="author" itemprop="author">Steve Martin</small>
 <a href="/author/Steve-Martin">(about)</a>
 </span>
 <div class="tags">
             Tags:
             <meta class="keywords" content="humor,obvious,simile" itemprop="keywords"/>
 <a class="tag" href="/tag/humor/page/1/">humor</a>
 <a class="tag" href="/tag/obvious/page/1/">obvious</a>
 <a class="tag" href="/tag/simile/page/1/">simile</a>
 </div>
 </div>]
quotes[0]
<div class="quote" itemscope="" itemtype="http://schema.org/CreativeWork">
<span class="text" itemprop="text">“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”</span>
<span>by <small class="author" itemprop="author">Albert Einstein</small>
<a href="/author/Albert-Einstein">(about)</a>
</span>
<div class="tags">
            Tags:
            <meta class="keywords" content="change,deep-thoughts,thinking,world" itemprop="keywords"/>
<a class="tag" href="/tag/change/page/1/">change</a>
<a class="tag" href="/tag/deep-thoughts/page/1/">deep-thoughts</a>
<a class="tag" href="/tag/thinking/page/1/">thinking</a>
<a class="tag" href="/tag/world/page/1/">world</a>
</div>
</div>

Quote#

quotes[0].find('span')
<span class="text" itemprop="text">“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”</span>
quotes[0].find('span').text
'“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”'

Author#

quotes[0].find('small')
<small class="author" itemprop="author">Albert Einstein</small>
quotes[0].find('small').text
'Albert Einstein'

Tag#

quotes[0].find_all('a')
[<a href="/author/Albert-Einstein">(about)</a>,
 <a class="tag" href="/tag/change/page/1/">change</a>,
 <a class="tag" href="/tag/deep-thoughts/page/1/">deep-thoughts</a>,
 <a class="tag" href="/tag/thinking/page/1/">thinking</a>,
 <a class="tag" href="/tag/world/page/1/">world</a>]
quotes[0].find_all('a')[0].text
'(about)'

We can extract all tags using a list comprehension:

tags = [tag.text for tag in quotes[0].find_all('a', class_='tag')]
tags
['change', 'deep-thoughts', 'thinking', 'world']

Regular Expressions#

In this section, we will extract all links in the href attributes that start with /tag.

  • The href attribute contains links to other websites or specific sections within a website.

  • We will use regular expressions to accomplish this task.

First, we create a BeautifulSoup object using the HTML content.

import requests
from bs4 import BeautifulSoup

url = 'http://quotes.toscrape.com/'
response = requests.get(url)

soup = BeautifulSoup(response.content, 'html.parser')

All <a> HTML element:

a_list = soup.find_all('a')
a_list
[<a href="/" style="text-decoration: none">Quotes to Scrape</a>,
 <a href="/login">Login</a>,
 <a href="/author/Albert-Einstein">(about)</a>,
 <a class="tag" href="/tag/change/page/1/">change</a>,
 <a class="tag" href="/tag/deep-thoughts/page/1/">deep-thoughts</a>,
 <a class="tag" href="/tag/thinking/page/1/">thinking</a>,
 <a class="tag" href="/tag/world/page/1/">world</a>,
 <a href="/author/J-K-Rowling">(about)</a>,
 <a class="tag" href="/tag/abilities/page/1/">abilities</a>,
 <a class="tag" href="/tag/choices/page/1/">choices</a>,
 <a href="/author/Albert-Einstein">(about)</a>,
 <a class="tag" href="/tag/inspirational/page/1/">inspirational</a>,
 <a class="tag" href="/tag/life/page/1/">life</a>,
 <a class="tag" href="/tag/live/page/1/">live</a>,
 <a class="tag" href="/tag/miracle/page/1/">miracle</a>,
 <a class="tag" href="/tag/miracles/page/1/">miracles</a>,
 <a href="/author/Jane-Austen">(about)</a>,
 <a class="tag" href="/tag/aliteracy/page/1/">aliteracy</a>,
 <a class="tag" href="/tag/books/page/1/">books</a>,
 <a class="tag" href="/tag/classic/page/1/">classic</a>,
 <a class="tag" href="/tag/humor/page/1/">humor</a>,
 <a href="/author/Marilyn-Monroe">(about)</a>,
 <a class="tag" href="/tag/be-yourself/page/1/">be-yourself</a>,
 <a class="tag" href="/tag/inspirational/page/1/">inspirational</a>,
 <a href="/author/Albert-Einstein">(about)</a>,
 <a class="tag" href="/tag/adulthood/page/1/">adulthood</a>,
 <a class="tag" href="/tag/success/page/1/">success</a>,
 <a class="tag" href="/tag/value/page/1/">value</a>,
 <a href="/author/Andre-Gide">(about)</a>,
 <a class="tag" href="/tag/life/page/1/">life</a>,
 <a class="tag" href="/tag/love/page/1/">love</a>,
 <a href="/author/Thomas-A-Edison">(about)</a>,
 <a class="tag" href="/tag/edison/page/1/">edison</a>,
 <a class="tag" href="/tag/failure/page/1/">failure</a>,
 <a class="tag" href="/tag/inspirational/page/1/">inspirational</a>,
 <a class="tag" href="/tag/paraphrased/page/1/">paraphrased</a>,
 <a href="/author/Eleanor-Roosevelt">(about)</a>,
 <a class="tag" href="/tag/misattributed-eleanor-roosevelt/page/1/">misattributed-eleanor-roosevelt</a>,
 <a href="/author/Steve-Martin">(about)</a>,
 <a class="tag" href="/tag/humor/page/1/">humor</a>,
 <a class="tag" href="/tag/obvious/page/1/">obvious</a>,
 <a class="tag" href="/tag/simile/page/1/">simile</a>,
 <a href="/page/2/">Next <span aria-hidden="true">→</span></a>,
 <a class="tag" href="/tag/love/" style="font-size: 28px">love</a>,
 <a class="tag" href="/tag/inspirational/" style="font-size: 26px">inspirational</a>,
 <a class="tag" href="/tag/life/" style="font-size: 26px">life</a>,
 <a class="tag" href="/tag/humor/" style="font-size: 24px">humor</a>,
 <a class="tag" href="/tag/books/" style="font-size: 22px">books</a>,
 <a class="tag" href="/tag/reading/" style="font-size: 14px">reading</a>,
 <a class="tag" href="/tag/friendship/" style="font-size: 10px">friendship</a>,
 <a class="tag" href="/tag/friends/" style="font-size: 8px">friends</a>,
 <a class="tag" href="/tag/truth/" style="font-size: 8px">truth</a>,
 <a class="tag" href="/tag/simile/" style="font-size: 6px">simile</a>,
 <a href="https://www.goodreads.com/quotes">GoodReads.com</a>,
 <a class="zyte" href="https://www.zyte.com">Zyte</a>]

The element at index 10 of the list above is:

a_list[10]
<a href="/author/Albert-Einstein">(about)</a>

It can be converted into a string using the decode() method.

a_list[10].decode()
'<a href="/author/Albert-Einstein">(about)</a>'

In the following code, each element of the a_list is iterated over in a loop.

  • The regular expression ‘href=”(/tag[^”]+)”’ is applied. This expression works as follows:

    • It looks for href=” in the HTML content.

    • After locating href=”, it captures the part that starts with /tag.

    • It continues capturing characters until it encounters the closing quotation mark after /tag.

import re
href_list = []
for a_item in a_list:
    href_list += re.findall('href="(/tag[^"]+)"' , a_item.decode())
href_list
['/tag/change/page/1/',
 '/tag/deep-thoughts/page/1/',
 '/tag/thinking/page/1/',
 '/tag/world/page/1/',
 '/tag/abilities/page/1/',
 '/tag/choices/page/1/',
 '/tag/inspirational/page/1/',
 '/tag/life/page/1/',
 '/tag/live/page/1/',
 '/tag/miracle/page/1/',
 '/tag/miracles/page/1/',
 '/tag/aliteracy/page/1/',
 '/tag/books/page/1/',
 '/tag/classic/page/1/',
 '/tag/humor/page/1/',
 '/tag/be-yourself/page/1/',
 '/tag/inspirational/page/1/',
 '/tag/adulthood/page/1/',
 '/tag/success/page/1/',
 '/tag/value/page/1/',
 '/tag/life/page/1/',
 '/tag/love/page/1/',
 '/tag/edison/page/1/',
 '/tag/failure/page/1/',
 '/tag/inspirational/page/1/',
 '/tag/paraphrased/page/1/',
 '/tag/misattributed-eleanor-roosevelt/page/1/',
 '/tag/humor/page/1/',
 '/tag/obvious/page/1/',
 '/tag/simile/page/1/',
 '/tag/love/',
 '/tag/inspirational/',
 '/tag/life/',
 '/tag/humor/',
 '/tag/books/',
 '/tag/reading/',
 '/tag/friendship/',
 '/tag/friends/',
 '/tag/truth/',
 '/tag/simile/']