Working with External Data Sources: APIs and Web Scraping

1 Year ago | 91 views

**Course Title:** Modern Python Programming: Best Practices and Trends **Section Title:** File Handling and Working with External Data **Topic:** Working with external data sources: APIs, web scraping (using `requests` and `BeautifulSoup`) In this topic, we'll explore how to work with external data sources using APIs and web scraping. You'll learn how to fetch data from APIs, parse HTML and XML using BeautifulSoup, and leverage the power of the `requests` library. ### APIs and API Requests Before we dive into the world of APIs, let's first define what an API is: * An **API** (Application Programming Interface) is a set of rules, protocols, and tools for building software applications. It's a way for different systems to communicate with each other. To interact with an API, you'll need to send an **HTTP request** (GET, POST, PUT, DELETE, etc.) and specify the URL of the API endpoint. In Python, you can use the `requests` library to send HTTP requests. Here's a simple example using the GitHub API: ```python import requests def get_github_user(username): url = f"https://api.github.com/users/{username}" response = requests.get(url) if response.status_code == 200: return response.json() return None # Example usage: username = "your-username" result = get_github_user(username) if result: print(result["name"]) print(result["location"]) else: print("Failed to fetch user data.") ``` This example sends a GET request to the GitHub API to fetch the user data associated with the specified username. ### Web Scraping with BeautifulSoup Web scraping is the process of automatically extracting data from websites, usually using HTML and CSS selectors. BeautifulSoup is a powerful Python library for parsing HTML and XML documents. Here's a step-by-step guide to using BeautifulSoup for web scraping: 1. **Send an HTTP request** using the `requests` library to fetch the HTML content of the webpage. 2. **Parse the HTML content** using BeautifulSoup to create a parse tree. 3. **Use CSS or HTML selectors** to select and extract specific data from the parse tree. Here's an example that uses BeautifulSoup to extract the titles of the top 10 movies on IMDB: ```python import requests from bs4 import BeautifulSoup def get_imdb_top_movies(): url = "https://www.imdb.com/chart/top/?ref_=nv_mv_250" response = requests.get(url) if response.status_code == 200: soup = BeautifulSoup(response.content, 'html.parser') titles = [] for movie in soup.find_all('td', class_='titleColumn'): title = movie.a.text titles.append(title) return titles[:10] return None # Example usage: movies = get_imdb_top_movies() if movies: for i, movie in enumerate(movies, start=1): print(f"{i}. {movie}") else: print("Failed to fetch movie titles.") ``` This example fetches the HTML content of the IMDB Top 250 page, uses BeautifulSoup to parse the HTML, and extracts the titles of the top 10 movies. ### Putting It All Together Now, let's say we want to fetch data from the GitHub API and the IMDB website, then store the results in a JSON file. Here's an example code snippet that uses both APIs and web scraping: ```python import requests from bs4 import BeautifulSoup import json def fetch_and_save_data(): github_url = "https://api.github.com/users/octocat" imdb_url = "https://www.imdb.com/chart/top/?ref_=nv_mv_250" github_response = requests.get(github_url) if github_response.status_code == 200: github_data = github_response.json() else: print("Failed to fetch GitHub data.") imdb_response = requests.get(imdb_url) if imdb_response.status_code == 200: imdb_soup = BeautifulSoup(imdb_response.content, 'html.parser') imdb_titles = [] for movie in imdb_soup.find_all('td', class_='titleColumn'): title = movie.a.text imdb_titles.append(title) imdb_data = {"titles": imdb_titles} else: print("Failed to fetch IMDB data.") combined_data = {"github": github_data, "imdb": imdb_data} with open("combined_data.json", "w") as f: json.dump(combined_data, f, indent=4) # Call the function fetch_and_save_data() ``` This example fetches data from the GitHub API, IMDB website, combines the results, and stores it in a JSON file. ### Conclusion and Practical Takeaways Working with external data sources using APIs and web scraping opens up endless possibilities for your Python applications. By using the `requests` library and BeautifulSoup, you can fetch and extract data from various sources. **External Resources** * [requests library documentation](https://docs.python-requests.org/en/latest/) * [BeautifulSoup documentation](https://beautiful-soup-4.readthedocs.io/en/latest/) * [IMDB website terms of use](https://help.imdb.com/article/imdb/account-information/copyright-notice-G5MS4U6B7SRGCR2Z?pf_rd_m=A2FGELT16U6RP4&pf_rd_p=4d162aeb-3aae-4cdf-88f2-0c8a23c89c5a&pf_rd_r=81R62VFSNW16FG65AW03&pf_rd_s=center-2&pf_rd_t=15051&pf_rd_i=movies&ref_=atv_lang_dt) **Do you have any questions or feedback about working with external data sources? Leave a comment below!** Next topic: [Error Handling and Exception Management in File Operations](insert_link)

Course

Python

Best Practices

Data Science

Web Development

Automation

Working with External Data Sources: APIs and Web Scraping

Images

Modern Python Programming: Best Practices and Trends

Objectives

Introduction to Python and Environment Setup

Data Structures and Basic Algorithms

Functions, Modules, and Best Practices

Object-Oriented Programming (OOP) in Python

File Handling and Working with External Data

Testing and Debugging Python Code

Functional Programming in Python

Concurrency and Parallelism

Data Science and Visualization with Python

Web Development with Python

Automation and Scripting

Packaging, Version Control, and Deployment