・In Japanese
<premise knowledge>
・Scraping
・python
Explains how to batch download images on the web with python using scraping technology.
This is an effective method when obtaining a large number of images for machine learning.
■How image files are defined in html?
The image file is defined as follows in the html source code.
Identify the image information from the img tag and get the image address from the src tag.
Depending on the website, there may be cases where the domain is not specified as shown below, but the program explained this time does not support this case.
(However, if you acquire the domain name and combine it with the image file name, you can save it)
■Example of image download using python
<When downloading one image by address specification>
First, an explanation of an example of saving one image with the simplest program.
Install a library called "requests". The following url exists as a test page, so the program below can be executed as is.
import requests
img_url = "https://taketake2.com/PID77.jpg" # File addressing
with open("PID77.jpg", 'wb') as f:
f.write(requests.get(img_url).content) # Save file
<When downloading multiple images automatically>
The BeautifulSoup library obtains the source code of the specified url, and finds the address of the image data from there.
Once you know the address of the image, save the image as described above.
import requests
from bs4 import BeautifulSoup
img_list = []
url = 'https://taketake2.com/test.html' # specify any url
url_cont = BeautifulSoup(requests.get(url).content,'lxml') # url parsing
img_all = url_cont.find_all("img") # Get img tag information
for d in img_all: # Extract img tag information one by one
d = d.get("src") # get src info
if d.startswith("http") and (d.endswith("jpg") or d.endswith("png")):
img_list.append(d) # Add to list if src ends with .jpg or .png
for img_data in img_list: # Save image data to file
with open(img_data.split('/')[-1], 'wb') as f:
f.write(requests.get(img_data).content) # save to file
print(img_data.split('/')[-1]) # Save file name output
■Notes, Cases where the program does not work well
Some websites do not allow image downloads. It's for copyright protection and it may overload the web server.
Check the site's terms of use carefully.
In addition, the source of the web page obtained by the above program may not match the source of the page you want to obtain the image, and the image may not be obtained.
If the program does not go well, check the program execution results from that perspective.
|