Extracting text from HTML file using Python - Coding

Extracting text from an HTML file is a common task in web scraping and data extraction. Python provides powerful libraries such as BeautifulSoup that make this task straightforward. In this article we will explore the process of extracting text from an HTML file using Python.

Use the below command to install the BeautifulSoup library:

pip install beautifulsoup4 requests

Using BeautifulSoup for Text Extraction

BeautifulSoup helps us to parse HTML documents and extract data from them.

Example: In the below example we are extracting text from html file.

Python

from bs4 import BeautifulSoup

# Read HTML content from a file
with open('example.html', 'r', encoding='utf-8') as file:
    html_cont = file.read()

# Parse the HTML content
soup = BeautifulSoup(html_cont, 'html.parser')

# Extract text from all tags
all_tags = soup.find_all()
for tag in all_tags:
    print(tag.get_text())

HTML

<!DOCTYPE html>
<html lang="en">
<head>
    <meta charset="UTF-8">
    <title>Extracting HTML text</title>
</head>
<body>
    <h1>Welcome to GeeksForGeeks</h1>
    <div class="content">
        <p>This is a paragraph.</p>
        <p>This is another paragraph.</p>
    </div>
</body>
</html>

Output:

Extracting HTML text


Welcome to GeeksForGeeks

This is a paragraph.
This is another paragraph.

Handling Complex HTML Structures

Sometimes HTML files can have complex structures and we might need to extract text from nested tags or specific parts of the HTML. For such scenarios BeautifulSoup provides various methods to handle it.

Extracting Text from Nested Tags

To extract text from nested tags we will navigate through the parse tree using tag names, attributes, or CSS selectors.

Example: In below example we will extract text from nested tags.

Python

from bs4 import BeautifulSoup

# Read HTML content from a file
with open('example.html', 'r', encoding='utf-8') as file:
    html_cont = file.read()

# Parse the HTML content
soup = BeautifulSoup(html_cont, 'html.parser')

# Extract text from nested tags
divs = soup.find_all('div', class_='content')  # Find all <div> tags with class 'content'
for div in divs:
    paragraphs = div.find_all('p')  # Find all <p> tags within each <div class='content'>
    for p in paragraphs:
        print(p.get_text())

HTML

<!DOCTYPE html>
<html lang="en">
<head>
    <meta charset="UTF-8">
</head>
<body>
    <h1>Welcome to the Example Page</h1>
    <div class="content">
        <p>This is a paragraph inside content div.</p>
        <p>This is another paragraph inside content div.</p>
    </div>
</body>
</html>

Output:

This is a paragraph inside content div.
This is another paragraph inside content div.

Extracting Text Using CSS Selectors

We can also use CSS selectors to target specific elements.

Python

from bs4 import BeautifulSoup

# Read HTML content from a file
with open('ex.html', 'r', encoding='utf-8') as file:
    html_cont = file.read()

# Parse the HTML content
soup = BeautifulSoup(html_cont, 'html.parser')

# Extract text using CSS selectors
texts = soup.select('div.content > p')  # Select all <p> tags inside <div class='content'>
for text in texts:
    print(text.get_text())

HTML

<!DOCTYPE html>
<html lang="en">
<head>
    <meta charset="UTF-8">
</head>
<body>
    <h1>Welcome to the Example Page</h1>
    <div class="content">
        <p>This is a paragraph inside content div.</p>
        <p>This is another paragraph inside content div.</p>
    </div>
</body>
</html>

Output:

This is a paragraph inside content div.
This is another paragraph inside content div.

Conclusion:

By following the steps explained in this article, we can efficiently parse and extract text from HTML documents. For complex HTML structures BeautifulSoup’s has powerful methods and CSS selectors. With these techniques, we can perform web scraping and data extraction tasks effectively.

Reffered: https://www.geeksforgeeks.org

Python

Related
How to suppress scientific notation when printing float values?
How can we display an image in a child process in Python
How to Add Same Key Value in Dictionary Python
How to Install Yotta through PIP
Releasing Memory in Python

Type:	Geek
Category:	Coding
Sub Category:	Tutorial
Uploaded by:	Admin
Views:	19