Python: Extracting HTML Data with Python

Overview

This post demonstrates how to use Python, along with libraries like requests and BeautifulSoup, to scrape web data. We’ll extract the Kubernetes release calendar from the Amazon EKS official documentation. This method can be adapted to various data extraction tasks, making it a valuable tool for developers needing up-to-date information from websites.

Explanation with Inline Comments

import requests
from bs4 import BeautifulSoup

# Target URL from which to scrape the data
url = 'https://docs.aws.amazon.com/eks/latest/userguide/kubernetes-versions.html'
string = 'Amazon EKS Kubernetes release calendar'

# Sending a GET request to the URL
response = requests.get(url)
html_content = response.text  # Getting the text content of the response

# Parsing the HTML content
soup = BeautifulSoup(html_content, 'html.parser')

# Finding the specific section of the page by text string
heading = soup.find(string=string)

# Navigating to the next table after the heading
table = heading.find_next('table')

# Building an HTML table with borders
html_table = "<table border='1'>"
header_row = table.find('tr')
html_header_row = "<tr>"

# Loop through each header in the table and clean up the text
for header_cell in header_row.find_all('th'):
    html_header_row += f"<th>{header_cell.text.strip()}</th>"
html_header_row += "</tr>"
html_table += html_header_row

# Loop through each row in the table and append it to our HTML table string
for row in table.find_all('tr')[1:]:
    html_table += str(row)
html_table += "</table>"

# Output the HTML table to the console
print(html_table)

Scraping the Data

We start by importing the necessary libraries: requests for making HTTP requests and BeautifulSoup for parsing HTML content.

import requests
from bs4 import BeautifulSoup

Next, we specify the target URL from which we’ll scrape the data and define the string that indicates the section containing the release calendar.

url = 'https://docs.aws.amazon.com/eks/latest/userguide/kubernetes-versions.html'
string = 'Amazon EKS Kubernetes release calendar'

We send a GET request to the URL and parse the HTML content using BeautifulSoup.

response = requests.get(url)
html_content = response.text
soup = BeautifulSoup(html_content, 'html.parser')

After parsing the HTML content, we locate the specific section of the page containing the release calendar by searching for the specified string.

heading = soup.find(string=string)

Once we’ve located the section, we navigate to the next table after the heading.

table = heading.find_next('table')

Building the HTML Table

To present the scraped data in a structured format, we construct an HTML table with borders.

html_table = "<table border='1'>"
header_row = table.find('tr')
html_header_row = "<tr>"

# Extracting and cleaning up table headers
for header_cell in header_row.find_all('th'):
    html_header_row += f"<th>{header_cell.text.strip()}</th>"
html_header_row += "</tr>"
html_table += html_header_row

# Extracting table rows and appending them to the HTML table string
for row in table.find_all('tr')[1:]:
    html_table += str(row)
html_table += "</table>"