Web Crawling#

Web crawling involves browsing and extracting data from websites. The basic steps of web crawling are:
Sending a request for information to a website
Retrieving the content on the website
Parsing the retrieved data to extract useful information
In this chapter, we will use Python’s BeautifulSoup library to extract data from HTML and XML files.
Extensible Markup Language (XML) is a markup language designed to store, transmit, and reconstruct data.
We will work with examples using the website http://quotes.toscrape.com/, which offers a collection of quotes, authors, and tags for practicing web scraping and crawling.
Through web crawling, we will extract the quotes, authors, and tags.
BeautifulSoup#
The BeautifulSoup function processes the raw HTML content retrieved from the website.
It converts the raw HTML data (in bytes) into a structured object for easy navigation and search.
It organizes the content into a tree-like structure for efficient parsing and manipulation.
urllib Library#
If you use the following code for the URL http://quotes.toscrape.com/ mentioned above, you will encounter an error because:
The response from the URL is in HTML format, not JSON.
The json.loads() function is designed to parse JSON data, not HTML.
import urllib.request
url = f'http://quotes.toscrape.com/'
response = urllib.request.urlopen(url)
data = json.loads(response.read().decode())
Instead of attempting to load the HTML content as JSON, we can use the BeautifulSoup library to parse and process it.
This will return a BeautifulSoup object which is a data structure representing a parsed HTML or XML document.
In the follwoing code:
response.read().decode() reads the raw HTML content from the website and decodes it into a string.
This string contains the HTML content of the webpage.
import urllib.request
from bs4 import BeautifulSoup
url = 'http://quotes.toscrape.com/'
response = urllib.request.urlopen(url)
soup = BeautifulSoup(response.read().decode(), 'html.parser')
The second argument html.parser specifies the parser to process the HTML.
It is Python’s built-in HTML parser.
It reads and understands the HTML content.
type(soup)
bs4.BeautifulSoup
soup
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="utf-8"/>
<title>Quotes to Scrape</title>
<link href="/static/bootstrap.min.css" rel="stylesheet"/>
<link href="/static/main.css" rel="stylesheet"/>
</head>
<body>
<div class="container">
<div class="row header-box">
<div class="col-md-8">
<h1>
<a href="/" style="text-decoration: none">Quotes to Scrape</a>
</h1>
</div>
<div class="col-md-4">
<p>
<a href="/login">Login</a>
</p>
</div>
</div>
<div class="row">
<div class="col-md-8">
<div class="quote" itemscope="" itemtype="http://schema.org/CreativeWork">
<span class="text" itemprop="text">“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”</span>
<span>by <small class="author" itemprop="author">Albert Einstein</small>
<a href="/author/Albert-Einstein">(about)</a>
</span>
<div class="tags">
Tags:
<meta class="keywords" content="change,deep-thoughts,thinking,world" itemprop="keywords"/>
<a class="tag" href="/tag/change/page/1/">change</a>
<a class="tag" href="/tag/deep-thoughts/page/1/">deep-thoughts</a>
<a class="tag" href="/tag/thinking/page/1/">thinking</a>
<a class="tag" href="/tag/world/page/1/">world</a>
</div>
</div>
<div class="quote" itemscope="" itemtype="http://schema.org/CreativeWork">
<span class="text" itemprop="text">“It is our choices, Harry, that show what we truly are, far more than our abilities.”</span>
<span>by <small class="author" itemprop="author">J.K. Rowling</small>
<a href="/author/J-K-Rowling">(about)</a>
</span>
<div class="tags">
Tags:
<meta class="keywords" content="abilities,choices" itemprop="keywords"/>
<a class="tag" href="/tag/abilities/page/1/">abilities</a>
<a class="tag" href="/tag/choices/page/1/">choices</a>
</div>
</div>
<div class="quote" itemscope="" itemtype="http://schema.org/CreativeWork">
<span class="text" itemprop="text">“There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.”</span>
<span>by <small class="author" itemprop="author">Albert Einstein</small>
<a href="/author/Albert-Einstein">(about)</a>
</span>
<div class="tags">
Tags:
<meta class="keywords" content="inspirational,life,live,miracle,miracles" itemprop="keywords"/>
<a class="tag" href="/tag/inspirational/page/1/">inspirational</a>
<a class="tag" href="/tag/life/page/1/">life</a>
<a class="tag" href="/tag/live/page/1/">live</a>
<a class="tag" href="/tag/miracle/page/1/">miracle</a>
<a class="tag" href="/tag/miracles/page/1/">miracles</a>
</div>
</div>
<div class="quote" itemscope="" itemtype="http://schema.org/CreativeWork">
<span class="text" itemprop="text">“The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.”</span>
<span>by <small class="author" itemprop="author">Jane Austen</small>
<a href="/author/Jane-Austen">(about)</a>
</span>
<div class="tags">
Tags:
<meta class="keywords" content="aliteracy,books,classic,humor" itemprop="keywords"/>
<a class="tag" href="/tag/aliteracy/page/1/">aliteracy</a>
<a class="tag" href="/tag/books/page/1/">books</a>
<a class="tag" href="/tag/classic/page/1/">classic</a>
<a class="tag" href="/tag/humor/page/1/">humor</a>
</div>
</div>
<div class="quote" itemscope="" itemtype="http://schema.org/CreativeWork">
<span class="text" itemprop="text">“Imperfection is beauty, madness is genius and it's better to be absolutely ridiculous than absolutely boring.”</span>
<span>by <small class="author" itemprop="author">Marilyn Monroe</small>
<a href="/author/Marilyn-Monroe">(about)</a>
</span>
<div class="tags">
Tags:
<meta class="keywords" content="be-yourself,inspirational" itemprop="keywords"/>
<a class="tag" href="/tag/be-yourself/page/1/">be-yourself</a>
<a class="tag" href="/tag/inspirational/page/1/">inspirational</a>
</div>
</div>
<div class="quote" itemscope="" itemtype="http://schema.org/CreativeWork">
<span class="text" itemprop="text">“Try not to become a man of success. Rather become a man of value.”</span>
<span>by <small class="author" itemprop="author">Albert Einstein</small>
<a href="/author/Albert-Einstein">(about)</a>
</span>
<div class="tags">
Tags:
<meta class="keywords" content="adulthood,success,value" itemprop="keywords"/>
<a class="tag" href="/tag/adulthood/page/1/">adulthood</a>
<a class="tag" href="/tag/success/page/1/">success</a>
<a class="tag" href="/tag/value/page/1/">value</a>
</div>
</div>
<div class="quote" itemscope="" itemtype="http://schema.org/CreativeWork">
<span class="text" itemprop="text">“It is better to be hated for what you are than to be loved for what you are not.”</span>
<span>by <small class="author" itemprop="author">André Gide</small>
<a href="/author/Andre-Gide">(about)</a>
</span>
<div class="tags">
Tags:
<meta class="keywords" content="life,love" itemprop="keywords"/>
<a class="tag" href="/tag/life/page/1/">life</a>
<a class="tag" href="/tag/love/page/1/">love</a>
</div>
</div>
<div class="quote" itemscope="" itemtype="http://schema.org/CreativeWork">
<span class="text" itemprop="text">“I have not failed. I've just found 10,000 ways that won't work.”</span>
<span>by <small class="author" itemprop="author">Thomas A. Edison</small>
<a href="/author/Thomas-A-Edison">(about)</a>
</span>
<div class="tags">
Tags:
<meta class="keywords" content="edison,failure,inspirational,paraphrased" itemprop="keywords"/>
<a class="tag" href="/tag/edison/page/1/">edison</a>
<a class="tag" href="/tag/failure/page/1/">failure</a>
<a class="tag" href="/tag/inspirational/page/1/">inspirational</a>
<a class="tag" href="/tag/paraphrased/page/1/">paraphrased</a>
</div>
</div>
<div class="quote" itemscope="" itemtype="http://schema.org/CreativeWork">
<span class="text" itemprop="text">“A woman is like a tea bag; you never know how strong it is until it's in hot water.”</span>
<span>by <small class="author" itemprop="author">Eleanor Roosevelt</small>
<a href="/author/Eleanor-Roosevelt">(about)</a>
</span>
<div class="tags">
Tags:
<meta class="keywords" content="misattributed-eleanor-roosevelt" itemprop="keywords"/>
<a class="tag" href="/tag/misattributed-eleanor-roosevelt/page/1/">misattributed-eleanor-roosevelt</a>
</div>
</div>
<div class="quote" itemscope="" itemtype="http://schema.org/CreativeWork">
<span class="text" itemprop="text">“A day without sunshine is like, you know, night.”</span>
<span>by <small class="author" itemprop="author">Steve Martin</small>
<a href="/author/Steve-Martin">(about)</a>
</span>
<div class="tags">
Tags:
<meta class="keywords" content="humor,obvious,simile" itemprop="keywords"/>
<a class="tag" href="/tag/humor/page/1/">humor</a>
<a class="tag" href="/tag/obvious/page/1/">obvious</a>
<a class="tag" href="/tag/simile/page/1/">simile</a>
</div>
</div>
<nav>
<ul class="pager">
<li class="next">
<a href="/page/2/">Next <span aria-hidden="true">→</span></a>
</li>
</ul>
</nav>
</div>
<div class="col-md-4 tags-box">
<h2>Top Ten tags</h2>
<span class="tag-item">
<a class="tag" href="/tag/love/" style="font-size: 28px">love</a>
</span>
<span class="tag-item">
<a class="tag" href="/tag/inspirational/" style="font-size: 26px">inspirational</a>
</span>
<span class="tag-item">
<a class="tag" href="/tag/life/" style="font-size: 26px">life</a>
</span>
<span class="tag-item">
<a class="tag" href="/tag/humor/" style="font-size: 24px">humor</a>
</span>
<span class="tag-item">
<a class="tag" href="/tag/books/" style="font-size: 22px">books</a>
</span>
<span class="tag-item">
<a class="tag" href="/tag/reading/" style="font-size: 14px">reading</a>
</span>
<span class="tag-item">
<a class="tag" href="/tag/friendship/" style="font-size: 10px">friendship</a>
</span>
<span class="tag-item">
<a class="tag" href="/tag/friends/" style="font-size: 8px">friends</a>
</span>
<span class="tag-item">
<a class="tag" href="/tag/truth/" style="font-size: 8px">truth</a>
</span>
<span class="tag-item">
<a class="tag" href="/tag/simile/" style="font-size: 6px">simile</a>
</span>
</div>
</div>
</div>
<footer class="footer">
<div class="container">
<p class="text-muted">
Quotes by: <a href="https://www.goodreads.com/quotes">GoodReads.com</a>
</p>
<p class="copyright">
Made with <span class="zyte">❤</span> by <a class="zyte" href="https://www.zyte.com">Zyte</a>
</p>
</div>
</footer>
</body>
</html>
Quote#
The find_all() method is used to retrieve specific information from the HTML content. In the following code:
span: Refers to the HTML element, commonly used to group and style content within a webpage.
The code specifically searches for all elements present in the HTML content.
text: Represents the class attribute of the targeted elements.
The argument class_=’text’ ensures that only elements with the class name ‘text’ are included in the search.
The ‘text’ class likely represents quotes within the webpage.
quotes = soup.find_all('span', class_='text')
quotes
[<span class="text" itemprop="text">“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”</span>,
<span class="text" itemprop="text">“It is our choices, Harry, that show what we truly are, far more than our abilities.”</span>,
<span class="text" itemprop="text">“There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.”</span>,
<span class="text" itemprop="text">“The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.”</span>,
<span class="text" itemprop="text">“Imperfection is beauty, madness is genius and it's better to be absolutely ridiculous than absolutely boring.”</span>,
<span class="text" itemprop="text">“Try not to become a man of success. Rather become a man of value.”</span>,
<span class="text" itemprop="text">“It is better to be hated for what you are than to be loved for what you are not.”</span>,
<span class="text" itemprop="text">“I have not failed. I've just found 10,000 ways that won't work.”</span>,
<span class="text" itemprop="text">“A woman is like a tea bag; you never know how strong it is until it's in hot water.”</span>,
<span class="text" itemprop="text">“A day without sunshine is like, you know, night.”</span>]
Each element in the quotes list is a Tag object, which represents an HTML element.
This Tag object has a text attribute that can be used to access the textual content contained within the tag.
quotes[0]
<span class="text" itemprop="text">“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”</span>
type(quotes[0])
bs4.element.Tag
quotes[0].text
'“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”'
Tag#
a: Refers to the HTML element, commonly used to create hyperlinks that navigate to other web pages or sections within the same page.
tag: Represents the class attribute of the targeted elements.
tags = soup.find_all('a', class_='tag')
tags
[<a class="tag" href="/tag/change/page/1/">change</a>,
<a class="tag" href="/tag/deep-thoughts/page/1/">deep-thoughts</a>,
<a class="tag" href="/tag/thinking/page/1/">thinking</a>,
<a class="tag" href="/tag/world/page/1/">world</a>,
<a class="tag" href="/tag/abilities/page/1/">abilities</a>,
<a class="tag" href="/tag/choices/page/1/">choices</a>,
<a class="tag" href="/tag/inspirational/page/1/">inspirational</a>,
<a class="tag" href="/tag/life/page/1/">life</a>,
<a class="tag" href="/tag/live/page/1/">live</a>,
<a class="tag" href="/tag/miracle/page/1/">miracle</a>,
<a class="tag" href="/tag/miracles/page/1/">miracles</a>,
<a class="tag" href="/tag/aliteracy/page/1/">aliteracy</a>,
<a class="tag" href="/tag/books/page/1/">books</a>,
<a class="tag" href="/tag/classic/page/1/">classic</a>,
<a class="tag" href="/tag/humor/page/1/">humor</a>,
<a class="tag" href="/tag/be-yourself/page/1/">be-yourself</a>,
<a class="tag" href="/tag/inspirational/page/1/">inspirational</a>,
<a class="tag" href="/tag/adulthood/page/1/">adulthood</a>,
<a class="tag" href="/tag/success/page/1/">success</a>,
<a class="tag" href="/tag/value/page/1/">value</a>,
<a class="tag" href="/tag/life/page/1/">life</a>,
<a class="tag" href="/tag/love/page/1/">love</a>,
<a class="tag" href="/tag/edison/page/1/">edison</a>,
<a class="tag" href="/tag/failure/page/1/">failure</a>,
<a class="tag" href="/tag/inspirational/page/1/">inspirational</a>,
<a class="tag" href="/tag/paraphrased/page/1/">paraphrased</a>,
<a class="tag" href="/tag/misattributed-eleanor-roosevelt/page/1/">misattributed-eleanor-roosevelt</a>,
<a class="tag" href="/tag/humor/page/1/">humor</a>,
<a class="tag" href="/tag/obvious/page/1/">obvious</a>,
<a class="tag" href="/tag/simile/page/1/">simile</a>,
<a class="tag" href="/tag/love/" style="font-size: 28px">love</a>,
<a class="tag" href="/tag/inspirational/" style="font-size: 26px">inspirational</a>,
<a class="tag" href="/tag/life/" style="font-size: 26px">life</a>,
<a class="tag" href="/tag/humor/" style="font-size: 24px">humor</a>,
<a class="tag" href="/tag/books/" style="font-size: 22px">books</a>,
<a class="tag" href="/tag/reading/" style="font-size: 14px">reading</a>,
<a class="tag" href="/tag/friendship/" style="font-size: 10px">friendship</a>,
<a class="tag" href="/tag/friends/" style="font-size: 8px">friends</a>,
<a class="tag" href="/tag/truth/" style="font-size: 8px">truth</a>,
<a class="tag" href="/tag/simile/" style="font-size: 6px">simile</a>]
tags[0]
<a class="tag" href="/tag/change/page/1/">change</a>
type(tags[0])
bs4.element.Tag
tags[0].text
'change'
requests Library#
An alternative approach we can use the requests library instead of urllib.request.
import requests
from bs4 import BeautifulSoup
url = 'http://quotes.toscrape.com/'
response = requests.get(url)
The response status can be checked using the status_code attribute. Here are the possible returns:
200: Success – No issue reaching the website.
404: Not Found – Resource not found.
500: Internal Server Error – Server issue.
403: Forbidden – Access denied.
400: Bad Request – Invalid request.
response.status_code
200
response.content is a bytes object that contains the HTML content of the website.
response.content
b'<!DOCTYPE html>\n<html lang="en">\n<head>\n\t<meta charset="UTF-8">\n\t<title>Quotes to Scrape</title>\n <link rel="stylesheet" href="/static/bootstrap.min.css">\n <link rel="stylesheet" href="/static/main.css">\n \n \n</head>\n<body>\n <div class="container">\n <div class="row header-box">\n <div class="col-md-8">\n <h1>\n <a href="/" style="text-decoration: none">Quotes to Scrape</a>\n </h1>\n </div>\n <div class="col-md-4">\n <p>\n \n <a href="/login">Login</a>\n \n </p>\n </div>\n </div>\n \n\n<div class="row">\n <div class="col-md-8">\n\n <div class="quote" itemscope itemtype="http://schema.org/CreativeWork">\n <span class="text" itemprop="text">\xe2\x80\x9cThe world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.\xe2\x80\x9d</span>\n <span>by <small class="author" itemprop="author">Albert Einstein</small>\n <a href="/author/Albert-Einstein">(about)</a>\n </span>\n <div class="tags">\n Tags:\n <meta class="keywords" itemprop="keywords" content="change,deep-thoughts,thinking,world" / > \n \n <a class="tag" href="/tag/change/page/1/">change</a>\n \n <a class="tag" href="/tag/deep-thoughts/page/1/">deep-thoughts</a>\n \n <a class="tag" href="/tag/thinking/page/1/">thinking</a>\n \n <a class="tag" href="/tag/world/page/1/">world</a>\n \n </div>\n </div>\n\n <div class="quote" itemscope itemtype="http://schema.org/CreativeWork">\n <span class="text" itemprop="text">\xe2\x80\x9cIt is our choices, Harry, that show what we truly are, far more than our abilities.\xe2\x80\x9d</span>\n <span>by <small class="author" itemprop="author">J.K. Rowling</small>\n <a href="/author/J-K-Rowling">(about)</a>\n </span>\n <div class="tags">\n Tags:\n <meta class="keywords" itemprop="keywords" content="abilities,choices" / > \n \n <a class="tag" href="/tag/abilities/page/1/">abilities</a>\n \n <a class="tag" href="/tag/choices/page/1/">choices</a>\n \n </div>\n </div>\n\n <div class="quote" itemscope itemtype="http://schema.org/CreativeWork">\n <span class="text" itemprop="text">\xe2\x80\x9cThere are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.\xe2\x80\x9d</span>\n <span>by <small class="author" itemprop="author">Albert Einstein</small>\n <a href="/author/Albert-Einstein">(about)</a>\n </span>\n <div class="tags">\n Tags:\n <meta class="keywords" itemprop="keywords" content="inspirational,life,live,miracle,miracles" / > \n \n <a class="tag" href="/tag/inspirational/page/1/">inspirational</a>\n \n <a class="tag" href="/tag/life/page/1/">life</a>\n \n <a class="tag" href="/tag/live/page/1/">live</a>\n \n <a class="tag" href="/tag/miracle/page/1/">miracle</a>\n \n <a class="tag" href="/tag/miracles/page/1/">miracles</a>\n \n </div>\n </div>\n\n <div class="quote" itemscope itemtype="http://schema.org/CreativeWork">\n <span class="text" itemprop="text">\xe2\x80\x9cThe person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.\xe2\x80\x9d</span>\n <span>by <small class="author" itemprop="author">Jane Austen</small>\n <a href="/author/Jane-Austen">(about)</a>\n </span>\n <div class="tags">\n Tags:\n <meta class="keywords" itemprop="keywords" content="aliteracy,books,classic,humor" / > \n \n <a class="tag" href="/tag/aliteracy/page/1/">aliteracy</a>\n \n <a class="tag" href="/tag/books/page/1/">books</a>\n \n <a class="tag" href="/tag/classic/page/1/">classic</a>\n \n <a class="tag" href="/tag/humor/page/1/">humor</a>\n \n </div>\n </div>\n\n <div class="quote" itemscope itemtype="http://schema.org/CreativeWork">\n <span class="text" itemprop="text">\xe2\x80\x9cImperfection is beauty, madness is genius and it's better to be absolutely ridiculous than absolutely boring.\xe2\x80\x9d</span>\n <span>by <small class="author" itemprop="author">Marilyn Monroe</small>\n <a href="/author/Marilyn-Monroe">(about)</a>\n </span>\n <div class="tags">\n Tags:\n <meta class="keywords" itemprop="keywords" content="be-yourself,inspirational" / > \n \n <a class="tag" href="/tag/be-yourself/page/1/">be-yourself</a>\n \n <a class="tag" href="/tag/inspirational/page/1/">inspirational</a>\n \n </div>\n </div>\n\n <div class="quote" itemscope itemtype="http://schema.org/CreativeWork">\n <span class="text" itemprop="text">\xe2\x80\x9cTry not to become a man of success. Rather become a man of value.\xe2\x80\x9d</span>\n <span>by <small class="author" itemprop="author">Albert Einstein</small>\n <a href="/author/Albert-Einstein">(about)</a>\n </span>\n <div class="tags">\n Tags:\n <meta class="keywords" itemprop="keywords" content="adulthood,success,value" / > \n \n <a class="tag" href="/tag/adulthood/page/1/">adulthood</a>\n \n <a class="tag" href="/tag/success/page/1/">success</a>\n \n <a class="tag" href="/tag/value/page/1/">value</a>\n \n </div>\n </div>\n\n <div class="quote" itemscope itemtype="http://schema.org/CreativeWork">\n <span class="text" itemprop="text">\xe2\x80\x9cIt is better to be hated for what you are than to be loved for what you are not.\xe2\x80\x9d</span>\n <span>by <small class="author" itemprop="author">Andr\xc3\xa9 Gide</small>\n <a href="/author/Andre-Gide">(about)</a>\n </span>\n <div class="tags">\n Tags:\n <meta class="keywords" itemprop="keywords" content="life,love" / > \n \n <a class="tag" href="/tag/life/page/1/">life</a>\n \n <a class="tag" href="/tag/love/page/1/">love</a>\n \n </div>\n </div>\n\n <div class="quote" itemscope itemtype="http://schema.org/CreativeWork">\n <span class="text" itemprop="text">\xe2\x80\x9cI have not failed. I've just found 10,000 ways that won't work.\xe2\x80\x9d</span>\n <span>by <small class="author" itemprop="author">Thomas A. Edison</small>\n <a href="/author/Thomas-A-Edison">(about)</a>\n </span>\n <div class="tags">\n Tags:\n <meta class="keywords" itemprop="keywords" content="edison,failure,inspirational,paraphrased" / > \n \n <a class="tag" href="/tag/edison/page/1/">edison</a>\n \n <a class="tag" href="/tag/failure/page/1/">failure</a>\n \n <a class="tag" href="/tag/inspirational/page/1/">inspirational</a>\n \n <a class="tag" href="/tag/paraphrased/page/1/">paraphrased</a>\n \n </div>\n </div>\n\n <div class="quote" itemscope itemtype="http://schema.org/CreativeWork">\n <span class="text" itemprop="text">\xe2\x80\x9cA woman is like a tea bag; you never know how strong it is until it's in hot water.\xe2\x80\x9d</span>\n <span>by <small class="author" itemprop="author">Eleanor Roosevelt</small>\n <a href="/author/Eleanor-Roosevelt">(about)</a>\n </span>\n <div class="tags">\n Tags:\n <meta class="keywords" itemprop="keywords" content="misattributed-eleanor-roosevelt" / > \n \n <a class="tag" href="/tag/misattributed-eleanor-roosevelt/page/1/">misattributed-eleanor-roosevelt</a>\n \n </div>\n </div>\n\n <div class="quote" itemscope itemtype="http://schema.org/CreativeWork">\n <span class="text" itemprop="text">\xe2\x80\x9cA day without sunshine is like, you know, night.\xe2\x80\x9d</span>\n <span>by <small class="author" itemprop="author">Steve Martin</small>\n <a href="/author/Steve-Martin">(about)</a>\n </span>\n <div class="tags">\n Tags:\n <meta class="keywords" itemprop="keywords" content="humor,obvious,simile" / > \n \n <a class="tag" href="/tag/humor/page/1/">humor</a>\n \n <a class="tag" href="/tag/obvious/page/1/">obvious</a>\n \n <a class="tag" href="/tag/simile/page/1/">simile</a>\n \n </div>\n </div>\n\n <nav>\n <ul class="pager">\n \n \n <li class="next">\n <a href="/page/2/">Next <span aria-hidden="true">→</span></a>\n </li>\n \n </ul>\n </nav>\n </div>\n <div class="col-md-4 tags-box">\n \n <h2>Top Ten tags</h2>\n \n <span class="tag-item">\n <a class="tag" style="font-size: 28px" href="/tag/love/">love</a>\n </span>\n \n <span class="tag-item">\n <a class="tag" style="font-size: 26px" href="/tag/inspirational/">inspirational</a>\n </span>\n \n <span class="tag-item">\n <a class="tag" style="font-size: 26px" href="/tag/life/">life</a>\n </span>\n \n <span class="tag-item">\n <a class="tag" style="font-size: 24px" href="/tag/humor/">humor</a>\n </span>\n \n <span class="tag-item">\n <a class="tag" style="font-size: 22px" href="/tag/books/">books</a>\n </span>\n \n <span class="tag-item">\n <a class="tag" style="font-size: 14px" href="/tag/reading/">reading</a>\n </span>\n \n <span class="tag-item">\n <a class="tag" style="font-size: 10px" href="/tag/friendship/">friendship</a>\n </span>\n \n <span class="tag-item">\n <a class="tag" style="font-size: 8px" href="/tag/friends/">friends</a>\n </span>\n \n <span class="tag-item">\n <a class="tag" style="font-size: 8px" href="/tag/truth/">truth</a>\n </span>\n \n <span class="tag-item">\n <a class="tag" style="font-size: 6px" href="/tag/simile/">simile</a>\n </span>\n \n \n </div>\n</div>\n\n </div>\n <footer class="footer">\n <div class="container">\n <p class="text-muted">\n Quotes by: <a href="https://www.goodreads.com/quotes">GoodReads.com</a>\n </p>\n <p class="copyright">\n Made with <span class=\'zyte\'>\xe2\x9d\xa4</span> by <a class=\'zyte\' href="https://www.zyte.com">Zyte</a>\n </p>\n </div>\n </footer>\n</body>\n</html>'
The BeautifulSoup function processes the raw HTML content in *response.content into a structured object.
soup = BeautifulSoup(response.content, 'html.parser')
soup
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="utf-8"/>
<title>Quotes to Scrape</title>
<link href="/static/bootstrap.min.css" rel="stylesheet"/>
<link href="/static/main.css" rel="stylesheet"/>
</head>
<body>
<div class="container">
<div class="row header-box">
<div class="col-md-8">
<h1>
<a href="/" style="text-decoration: none">Quotes to Scrape</a>
</h1>
</div>
<div class="col-md-4">
<p>
<a href="/login">Login</a>
</p>
</div>
</div>
<div class="row">
<div class="col-md-8">
<div class="quote" itemscope="" itemtype="http://schema.org/CreativeWork">
<span class="text" itemprop="text">“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”</span>
<span>by <small class="author" itemprop="author">Albert Einstein</small>
<a href="/author/Albert-Einstein">(about)</a>
</span>
<div class="tags">
Tags:
<meta class="keywords" content="change,deep-thoughts,thinking,world" itemprop="keywords"/>
<a class="tag" href="/tag/change/page/1/">change</a>
<a class="tag" href="/tag/deep-thoughts/page/1/">deep-thoughts</a>
<a class="tag" href="/tag/thinking/page/1/">thinking</a>
<a class="tag" href="/tag/world/page/1/">world</a>
</div>
</div>
<div class="quote" itemscope="" itemtype="http://schema.org/CreativeWork">
<span class="text" itemprop="text">“It is our choices, Harry, that show what we truly are, far more than our abilities.”</span>
<span>by <small class="author" itemprop="author">J.K. Rowling</small>
<a href="/author/J-K-Rowling">(about)</a>
</span>
<div class="tags">
Tags:
<meta class="keywords" content="abilities,choices" itemprop="keywords"/>
<a class="tag" href="/tag/abilities/page/1/">abilities</a>
<a class="tag" href="/tag/choices/page/1/">choices</a>
</div>
</div>
<div class="quote" itemscope="" itemtype="http://schema.org/CreativeWork">
<span class="text" itemprop="text">“There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.”</span>
<span>by <small class="author" itemprop="author">Albert Einstein</small>
<a href="/author/Albert-Einstein">(about)</a>
</span>
<div class="tags">
Tags:
<meta class="keywords" content="inspirational,life,live,miracle,miracles" itemprop="keywords"/>
<a class="tag" href="/tag/inspirational/page/1/">inspirational</a>
<a class="tag" href="/tag/life/page/1/">life</a>
<a class="tag" href="/tag/live/page/1/">live</a>
<a class="tag" href="/tag/miracle/page/1/">miracle</a>
<a class="tag" href="/tag/miracles/page/1/">miracles</a>
</div>
</div>
<div class="quote" itemscope="" itemtype="http://schema.org/CreativeWork">
<span class="text" itemprop="text">“The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.”</span>
<span>by <small class="author" itemprop="author">Jane Austen</small>
<a href="/author/Jane-Austen">(about)</a>
</span>
<div class="tags">
Tags:
<meta class="keywords" content="aliteracy,books,classic,humor" itemprop="keywords"/>
<a class="tag" href="/tag/aliteracy/page/1/">aliteracy</a>
<a class="tag" href="/tag/books/page/1/">books</a>
<a class="tag" href="/tag/classic/page/1/">classic</a>
<a class="tag" href="/tag/humor/page/1/">humor</a>
</div>
</div>
<div class="quote" itemscope="" itemtype="http://schema.org/CreativeWork">
<span class="text" itemprop="text">“Imperfection is beauty, madness is genius and it's better to be absolutely ridiculous than absolutely boring.”</span>
<span>by <small class="author" itemprop="author">Marilyn Monroe</small>
<a href="/author/Marilyn-Monroe">(about)</a>
</span>
<div class="tags">
Tags:
<meta class="keywords" content="be-yourself,inspirational" itemprop="keywords"/>
<a class="tag" href="/tag/be-yourself/page/1/">be-yourself</a>
<a class="tag" href="/tag/inspirational/page/1/">inspirational</a>
</div>
</div>
<div class="quote" itemscope="" itemtype="http://schema.org/CreativeWork">
<span class="text" itemprop="text">“Try not to become a man of success. Rather become a man of value.”</span>
<span>by <small class="author" itemprop="author">Albert Einstein</small>
<a href="/author/Albert-Einstein">(about)</a>
</span>
<div class="tags">
Tags:
<meta class="keywords" content="adulthood,success,value" itemprop="keywords"/>
<a class="tag" href="/tag/adulthood/page/1/">adulthood</a>
<a class="tag" href="/tag/success/page/1/">success</a>
<a class="tag" href="/tag/value/page/1/">value</a>
</div>
</div>
<div class="quote" itemscope="" itemtype="http://schema.org/CreativeWork">
<span class="text" itemprop="text">“It is better to be hated for what you are than to be loved for what you are not.”</span>
<span>by <small class="author" itemprop="author">André Gide</small>
<a href="/author/Andre-Gide">(about)</a>
</span>
<div class="tags">
Tags:
<meta class="keywords" content="life,love" itemprop="keywords"/>
<a class="tag" href="/tag/life/page/1/">life</a>
<a class="tag" href="/tag/love/page/1/">love</a>
</div>
</div>
<div class="quote" itemscope="" itemtype="http://schema.org/CreativeWork">
<span class="text" itemprop="text">“I have not failed. I've just found 10,000 ways that won't work.”</span>
<span>by <small class="author" itemprop="author">Thomas A. Edison</small>
<a href="/author/Thomas-A-Edison">(about)</a>
</span>
<div class="tags">
Tags:
<meta class="keywords" content="edison,failure,inspirational,paraphrased" itemprop="keywords"/>
<a class="tag" href="/tag/edison/page/1/">edison</a>
<a class="tag" href="/tag/failure/page/1/">failure</a>
<a class="tag" href="/tag/inspirational/page/1/">inspirational</a>
<a class="tag" href="/tag/paraphrased/page/1/">paraphrased</a>
</div>
</div>
<div class="quote" itemscope="" itemtype="http://schema.org/CreativeWork">
<span class="text" itemprop="text">“A woman is like a tea bag; you never know how strong it is until it's in hot water.”</span>
<span>by <small class="author" itemprop="author">Eleanor Roosevelt</small>
<a href="/author/Eleanor-Roosevelt">(about)</a>
</span>
<div class="tags">
Tags:
<meta class="keywords" content="misattributed-eleanor-roosevelt" itemprop="keywords"/>
<a class="tag" href="/tag/misattributed-eleanor-roosevelt/page/1/">misattributed-eleanor-roosevelt</a>
</div>
</div>
<div class="quote" itemscope="" itemtype="http://schema.org/CreativeWork">
<span class="text" itemprop="text">“A day without sunshine is like, you know, night.”</span>
<span>by <small class="author" itemprop="author">Steve Martin</small>
<a href="/author/Steve-Martin">(about)</a>
</span>
<div class="tags">
Tags:
<meta class="keywords" content="humor,obvious,simile" itemprop="keywords"/>
<a class="tag" href="/tag/humor/page/1/">humor</a>
<a class="tag" href="/tag/obvious/page/1/">obvious</a>
<a class="tag" href="/tag/simile/page/1/">simile</a>
</div>
</div>
<nav>
<ul class="pager">
<li class="next">
<a href="/page/2/">Next <span aria-hidden="true">→</span></a>
</li>
</ul>
</nav>
</div>
<div class="col-md-4 tags-box">
<h2>Top Ten tags</h2>
<span class="tag-item">
<a class="tag" href="/tag/love/" style="font-size: 28px">love</a>
</span>
<span class="tag-item">
<a class="tag" href="/tag/inspirational/" style="font-size: 26px">inspirational</a>
</span>
<span class="tag-item">
<a class="tag" href="/tag/life/" style="font-size: 26px">life</a>
</span>
<span class="tag-item">
<a class="tag" href="/tag/humor/" style="font-size: 24px">humor</a>
</span>
<span class="tag-item">
<a class="tag" href="/tag/books/" style="font-size: 22px">books</a>
</span>
<span class="tag-item">
<a class="tag" href="/tag/reading/" style="font-size: 14px">reading</a>
</span>
<span class="tag-item">
<a class="tag" href="/tag/friendship/" style="font-size: 10px">friendship</a>
</span>
<span class="tag-item">
<a class="tag" href="/tag/friends/" style="font-size: 8px">friends</a>
</span>
<span class="tag-item">
<a class="tag" href="/tag/truth/" style="font-size: 8px">truth</a>
</span>
<span class="tag-item">
<a class="tag" href="/tag/simile/" style="font-size: 6px">simile</a>
</span>
</div>
</div>
</div>
<footer class="footer">
<div class="container">
<p class="text-muted">
Quotes by: <a href="https://www.goodreads.com/quotes">GoodReads.com</a>
</p>
<p class="copyright">
Made with <span class="zyte">❤</span> by <a class="zyte" href="https://www.zyte.com">Zyte</a>
</p>
</div>
</footer>
</body>
</html>
Now, we will parse the quotes, authors, and tags using da different approach.
div: Refers to the \(<div>\) HTML element, commonly used to group and organize content within a webpage.
The code specifically searches for all \(<div>\) elements present in the HTML content.
quote: Represents the class attribute of the targeted \(<div>\) elements.
The argument class_=’quote’ ensures that only \(<div>\) elements with the class name ‘quote’ are included in the search.
quotes = soup.find_all('div', class_='quote')
quotes
[<div class="quote" itemscope="" itemtype="http://schema.org/CreativeWork">
<span class="text" itemprop="text">“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”</span>
<span>by <small class="author" itemprop="author">Albert Einstein</small>
<a href="/author/Albert-Einstein">(about)</a>
</span>
<div class="tags">
Tags:
<meta class="keywords" content="change,deep-thoughts,thinking,world" itemprop="keywords"/>
<a class="tag" href="/tag/change/page/1/">change</a>
<a class="tag" href="/tag/deep-thoughts/page/1/">deep-thoughts</a>
<a class="tag" href="/tag/thinking/page/1/">thinking</a>
<a class="tag" href="/tag/world/page/1/">world</a>
</div>
</div>,
<div class="quote" itemscope="" itemtype="http://schema.org/CreativeWork">
<span class="text" itemprop="text">“It is our choices, Harry, that show what we truly are, far more than our abilities.”</span>
<span>by <small class="author" itemprop="author">J.K. Rowling</small>
<a href="/author/J-K-Rowling">(about)</a>
</span>
<div class="tags">
Tags:
<meta class="keywords" content="abilities,choices" itemprop="keywords"/>
<a class="tag" href="/tag/abilities/page/1/">abilities</a>
<a class="tag" href="/tag/choices/page/1/">choices</a>
</div>
</div>,
<div class="quote" itemscope="" itemtype="http://schema.org/CreativeWork">
<span class="text" itemprop="text">“There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.”</span>
<span>by <small class="author" itemprop="author">Albert Einstein</small>
<a href="/author/Albert-Einstein">(about)</a>
</span>
<div class="tags">
Tags:
<meta class="keywords" content="inspirational,life,live,miracle,miracles" itemprop="keywords"/>
<a class="tag" href="/tag/inspirational/page/1/">inspirational</a>
<a class="tag" href="/tag/life/page/1/">life</a>
<a class="tag" href="/tag/live/page/1/">live</a>
<a class="tag" href="/tag/miracle/page/1/">miracle</a>
<a class="tag" href="/tag/miracles/page/1/">miracles</a>
</div>
</div>,
<div class="quote" itemscope="" itemtype="http://schema.org/CreativeWork">
<span class="text" itemprop="text">“The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.”</span>
<span>by <small class="author" itemprop="author">Jane Austen</small>
<a href="/author/Jane-Austen">(about)</a>
</span>
<div class="tags">
Tags:
<meta class="keywords" content="aliteracy,books,classic,humor" itemprop="keywords"/>
<a class="tag" href="/tag/aliteracy/page/1/">aliteracy</a>
<a class="tag" href="/tag/books/page/1/">books</a>
<a class="tag" href="/tag/classic/page/1/">classic</a>
<a class="tag" href="/tag/humor/page/1/">humor</a>
</div>
</div>,
<div class="quote" itemscope="" itemtype="http://schema.org/CreativeWork">
<span class="text" itemprop="text">“Imperfection is beauty, madness is genius and it's better to be absolutely ridiculous than absolutely boring.”</span>
<span>by <small class="author" itemprop="author">Marilyn Monroe</small>
<a href="/author/Marilyn-Monroe">(about)</a>
</span>
<div class="tags">
Tags:
<meta class="keywords" content="be-yourself,inspirational" itemprop="keywords"/>
<a class="tag" href="/tag/be-yourself/page/1/">be-yourself</a>
<a class="tag" href="/tag/inspirational/page/1/">inspirational</a>
</div>
</div>,
<div class="quote" itemscope="" itemtype="http://schema.org/CreativeWork">
<span class="text" itemprop="text">“Try not to become a man of success. Rather become a man of value.”</span>
<span>by <small class="author" itemprop="author">Albert Einstein</small>
<a href="/author/Albert-Einstein">(about)</a>
</span>
<div class="tags">
Tags:
<meta class="keywords" content="adulthood,success,value" itemprop="keywords"/>
<a class="tag" href="/tag/adulthood/page/1/">adulthood</a>
<a class="tag" href="/tag/success/page/1/">success</a>
<a class="tag" href="/tag/value/page/1/">value</a>
</div>
</div>,
<div class="quote" itemscope="" itemtype="http://schema.org/CreativeWork">
<span class="text" itemprop="text">“It is better to be hated for what you are than to be loved for what you are not.”</span>
<span>by <small class="author" itemprop="author">André Gide</small>
<a href="/author/Andre-Gide">(about)</a>
</span>
<div class="tags">
Tags:
<meta class="keywords" content="life,love" itemprop="keywords"/>
<a class="tag" href="/tag/life/page/1/">life</a>
<a class="tag" href="/tag/love/page/1/">love</a>
</div>
</div>,
<div class="quote" itemscope="" itemtype="http://schema.org/CreativeWork">
<span class="text" itemprop="text">“I have not failed. I've just found 10,000 ways that won't work.”</span>
<span>by <small class="author" itemprop="author">Thomas A. Edison</small>
<a href="/author/Thomas-A-Edison">(about)</a>
</span>
<div class="tags">
Tags:
<meta class="keywords" content="edison,failure,inspirational,paraphrased" itemprop="keywords"/>
<a class="tag" href="/tag/edison/page/1/">edison</a>
<a class="tag" href="/tag/failure/page/1/">failure</a>
<a class="tag" href="/tag/inspirational/page/1/">inspirational</a>
<a class="tag" href="/tag/paraphrased/page/1/">paraphrased</a>
</div>
</div>,
<div class="quote" itemscope="" itemtype="http://schema.org/CreativeWork">
<span class="text" itemprop="text">“A woman is like a tea bag; you never know how strong it is until it's in hot water.”</span>
<span>by <small class="author" itemprop="author">Eleanor Roosevelt</small>
<a href="/author/Eleanor-Roosevelt">(about)</a>
</span>
<div class="tags">
Tags:
<meta class="keywords" content="misattributed-eleanor-roosevelt" itemprop="keywords"/>
<a class="tag" href="/tag/misattributed-eleanor-roosevelt/page/1/">misattributed-eleanor-roosevelt</a>
</div>
</div>,
<div class="quote" itemscope="" itemtype="http://schema.org/CreativeWork">
<span class="text" itemprop="text">“A day without sunshine is like, you know, night.”</span>
<span>by <small class="author" itemprop="author">Steve Martin</small>
<a href="/author/Steve-Martin">(about)</a>
</span>
<div class="tags">
Tags:
<meta class="keywords" content="humor,obvious,simile" itemprop="keywords"/>
<a class="tag" href="/tag/humor/page/1/">humor</a>
<a class="tag" href="/tag/obvious/page/1/">obvious</a>
<a class="tag" href="/tag/simile/page/1/">simile</a>
</div>
</div>]
quotes[0]
<div class="quote" itemscope="" itemtype="http://schema.org/CreativeWork">
<span class="text" itemprop="text">“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”</span>
<span>by <small class="author" itemprop="author">Albert Einstein</small>
<a href="/author/Albert-Einstein">(about)</a>
</span>
<div class="tags">
Tags:
<meta class="keywords" content="change,deep-thoughts,thinking,world" itemprop="keywords"/>
<a class="tag" href="/tag/change/page/1/">change</a>
<a class="tag" href="/tag/deep-thoughts/page/1/">deep-thoughts</a>
<a class="tag" href="/tag/thinking/page/1/">thinking</a>
<a class="tag" href="/tag/world/page/1/">world</a>
</div>
</div>
Quote#
quotes[0].find('span')
<span class="text" itemprop="text">“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”</span>
quotes[0].find('span').text
'“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”'
Author#
quotes[0].find('small')
<small class="author" itemprop="author">Albert Einstein</small>
quotes[0].find('small').text
'Albert Einstein'
Tag#
quotes[0].find_all('a')
[<a href="/author/Albert-Einstein">(about)</a>,
<a class="tag" href="/tag/change/page/1/">change</a>,
<a class="tag" href="/tag/deep-thoughts/page/1/">deep-thoughts</a>,
<a class="tag" href="/tag/thinking/page/1/">thinking</a>,
<a class="tag" href="/tag/world/page/1/">world</a>]
quotes[0].find_all('a')[0].text
'(about)'
We can extract all tags using a list comprehension:
tags = [tag.text for tag in quotes[0].find_all('a', class_='tag')]
tags
['change', 'deep-thoughts', 'thinking', 'world']
Regular Expressions#
In this section, we will extract all links in the href attributes that start with /tag.
The href attribute contains links to other websites or specific sections within a website.
We will use regular expressions to accomplish this task.
First, we create a BeautifulSoup object using the HTML content.
import requests
from bs4 import BeautifulSoup
url = 'http://quotes.toscrape.com/'
response = requests.get(url)
soup = BeautifulSoup(response.content, 'html.parser')
All <a> HTML element:
a_list = soup.find_all('a')
a_list
[<a href="/" style="text-decoration: none">Quotes to Scrape</a>,
<a href="/login">Login</a>,
<a href="/author/Albert-Einstein">(about)</a>,
<a class="tag" href="/tag/change/page/1/">change</a>,
<a class="tag" href="/tag/deep-thoughts/page/1/">deep-thoughts</a>,
<a class="tag" href="/tag/thinking/page/1/">thinking</a>,
<a class="tag" href="/tag/world/page/1/">world</a>,
<a href="/author/J-K-Rowling">(about)</a>,
<a class="tag" href="/tag/abilities/page/1/">abilities</a>,
<a class="tag" href="/tag/choices/page/1/">choices</a>,
<a href="/author/Albert-Einstein">(about)</a>,
<a class="tag" href="/tag/inspirational/page/1/">inspirational</a>,
<a class="tag" href="/tag/life/page/1/">life</a>,
<a class="tag" href="/tag/live/page/1/">live</a>,
<a class="tag" href="/tag/miracle/page/1/">miracle</a>,
<a class="tag" href="/tag/miracles/page/1/">miracles</a>,
<a href="/author/Jane-Austen">(about)</a>,
<a class="tag" href="/tag/aliteracy/page/1/">aliteracy</a>,
<a class="tag" href="/tag/books/page/1/">books</a>,
<a class="tag" href="/tag/classic/page/1/">classic</a>,
<a class="tag" href="/tag/humor/page/1/">humor</a>,
<a href="/author/Marilyn-Monroe">(about)</a>,
<a class="tag" href="/tag/be-yourself/page/1/">be-yourself</a>,
<a class="tag" href="/tag/inspirational/page/1/">inspirational</a>,
<a href="/author/Albert-Einstein">(about)</a>,
<a class="tag" href="/tag/adulthood/page/1/">adulthood</a>,
<a class="tag" href="/tag/success/page/1/">success</a>,
<a class="tag" href="/tag/value/page/1/">value</a>,
<a href="/author/Andre-Gide">(about)</a>,
<a class="tag" href="/tag/life/page/1/">life</a>,
<a class="tag" href="/tag/love/page/1/">love</a>,
<a href="/author/Thomas-A-Edison">(about)</a>,
<a class="tag" href="/tag/edison/page/1/">edison</a>,
<a class="tag" href="/tag/failure/page/1/">failure</a>,
<a class="tag" href="/tag/inspirational/page/1/">inspirational</a>,
<a class="tag" href="/tag/paraphrased/page/1/">paraphrased</a>,
<a href="/author/Eleanor-Roosevelt">(about)</a>,
<a class="tag" href="/tag/misattributed-eleanor-roosevelt/page/1/">misattributed-eleanor-roosevelt</a>,
<a href="/author/Steve-Martin">(about)</a>,
<a class="tag" href="/tag/humor/page/1/">humor</a>,
<a class="tag" href="/tag/obvious/page/1/">obvious</a>,
<a class="tag" href="/tag/simile/page/1/">simile</a>,
<a href="/page/2/">Next <span aria-hidden="true">→</span></a>,
<a class="tag" href="/tag/love/" style="font-size: 28px">love</a>,
<a class="tag" href="/tag/inspirational/" style="font-size: 26px">inspirational</a>,
<a class="tag" href="/tag/life/" style="font-size: 26px">life</a>,
<a class="tag" href="/tag/humor/" style="font-size: 24px">humor</a>,
<a class="tag" href="/tag/books/" style="font-size: 22px">books</a>,
<a class="tag" href="/tag/reading/" style="font-size: 14px">reading</a>,
<a class="tag" href="/tag/friendship/" style="font-size: 10px">friendship</a>,
<a class="tag" href="/tag/friends/" style="font-size: 8px">friends</a>,
<a class="tag" href="/tag/truth/" style="font-size: 8px">truth</a>,
<a class="tag" href="/tag/simile/" style="font-size: 6px">simile</a>,
<a href="https://www.goodreads.com/quotes">GoodReads.com</a>,
<a class="zyte" href="https://www.zyte.com">Zyte</a>]
The element at index 10 of the list above is:
a_list[10]
<a href="/author/Albert-Einstein">(about)</a>
It can be converted into a string using the decode() method.
a_list[10].decode()
'<a href="/author/Albert-Einstein">(about)</a>'
In the following code, each element of the a_list is iterated over in a loop.
The regular expression ‘href=”(/tag[^”]+)”’ is applied. This expression works as follows:
It looks for href=” in the HTML content.
After locating href=”, it captures the part that starts with /tag.
It continues capturing characters until it encounters the closing quotation mark “ after /tag.
import re
href_list = []
for a_item in a_list:
href_list += re.findall('href="(/tag[^"]+)"' , a_item.decode())
href_list
['/tag/change/page/1/',
'/tag/deep-thoughts/page/1/',
'/tag/thinking/page/1/',
'/tag/world/page/1/',
'/tag/abilities/page/1/',
'/tag/choices/page/1/',
'/tag/inspirational/page/1/',
'/tag/life/page/1/',
'/tag/live/page/1/',
'/tag/miracle/page/1/',
'/tag/miracles/page/1/',
'/tag/aliteracy/page/1/',
'/tag/books/page/1/',
'/tag/classic/page/1/',
'/tag/humor/page/1/',
'/tag/be-yourself/page/1/',
'/tag/inspirational/page/1/',
'/tag/adulthood/page/1/',
'/tag/success/page/1/',
'/tag/value/page/1/',
'/tag/life/page/1/',
'/tag/love/page/1/',
'/tag/edison/page/1/',
'/tag/failure/page/1/',
'/tag/inspirational/page/1/',
'/tag/paraphrased/page/1/',
'/tag/misattributed-eleanor-roosevelt/page/1/',
'/tag/humor/page/1/',
'/tag/obvious/page/1/',
'/tag/simile/page/1/',
'/tag/love/',
'/tag/inspirational/',
'/tag/life/',
'/tag/humor/',
'/tag/books/',
'/tag/reading/',
'/tag/friendship/',
'/tag/friends/',
'/tag/truth/',
'/tag/simile/']