Extracting text from an HTML file is a common task in web scraping and data extraction. Python provides powerful libraries such as BeautifulSoup that make this task straightforward. In this article we will explore the process of extracting text from an HTML file using Python.
Use the below command to install the BeautifulSoup library:
pip install beautifulsoup4 requests Using BeautifulSoup for Text ExtractionBeautifulSoup helps us to parse HTML documents and extract data from them.
Example: In the below example we are extracting text from html file.
Python
from bs4 import BeautifulSoup
# Read HTML content from a file
with open('example.html', 'r', encoding='utf-8') as file:
html_cont = file.read()
# Parse the HTML content
soup = BeautifulSoup(html_cont, 'html.parser')
# Extract text from all tags
all_tags = soup.find_all()
for tag in all_tags:
print(tag.get_text())
HTML
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<title>Extracting HTML text</title>
</head>
<body>
<h1>Welcome to GeeksForGeeks</h1>
<div class="content">
<p>This is a paragraph.</p>
<p>This is another paragraph.</p>
</div>
</body>
</html>
Output:
Extracting HTML text
Welcome to GeeksForGeeks
This is a paragraph. This is another paragraph. Handling Complex HTML StructuresSometimes HTML files can have complex structures and we might need to extract text from nested tags or specific parts of the HTML. For such scenarios BeautifulSoup provides various methods to handle it.
Extracting Text from Nested TagsTo extract text from nested tags we will navigate through the parse tree using tag names, attributes, or CSS selectors.
Example: In below example we will extract text from nested tags.
Python
from bs4 import BeautifulSoup
# Read HTML content from a file
with open('example.html', 'r', encoding='utf-8') as file:
html_cont = file.read()
# Parse the HTML content
soup = BeautifulSoup(html_cont, 'html.parser')
# Extract text from nested tags
divs = soup.find_all('div', class_='content') # Find all <div> tags with class 'content'
for div in divs:
paragraphs = div.find_all('p') # Find all <p> tags within each <div class='content'>
for p in paragraphs:
print(p.get_text())
HTML
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
</head>
<body>
<h1>Welcome to the Example Page</h1>
<div class="content">
<p>This is a paragraph inside content div.</p>
<p>This is another paragraph inside content div.</p>
</div>
</body>
</html>
Output:
This is a paragraph inside content div. This is another paragraph inside content div. Extracting Text Using CSS SelectorsWe can also use CSS selectors to target specific elements.
Python
from bs4 import BeautifulSoup
# Read HTML content from a file
with open('ex.html', 'r', encoding='utf-8') as file:
html_cont = file.read()
# Parse the HTML content
soup = BeautifulSoup(html_cont, 'html.parser')
# Extract text using CSS selectors
texts = soup.select('div.content > p') # Select all <p> tags inside <div class='content'>
for text in texts:
print(text.get_text())
HTML
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
</head>
<body>
<h1>Welcome to the Example Page</h1>
<div class="content">
<p>This is a paragraph inside content div.</p>
<p>This is another paragraph inside content div.</p>
</div>
</body>
</html>
Output:
This is a paragraph inside content div. This is another paragraph inside content div. Conclusion:By following the steps explained in this article, we can efficiently parse and extract text from HTML documents. For complex HTML structures BeautifulSoup’s has powerful methods and CSS selectors. With these techniques, we can perform web scraping and data extraction tasks effectively.
|