How to batch download images on the web (python)



Software

Release date:2023/1/15         

In Japanese
<premise knowledge>
Scraping
python


Explains how to batch download images on the web with python using scraping technology. This is an effective method when obtaining a large number of images for machine learning.

■How image files are defined in html?

The image file is defined as follows in the html source code. Identify the image information from the img tag and get the image address from the src tag.



Depending on the website, there may be cases where the domain is not specified as shown below, but the program explained this time does not support this case. (However, if you acquire the domain name and combine it with the image file name, you can save it)



■Example of image download using python

<When downloading one image by address specification>
First, an explanation of an example of saving one image with the simplest program. Install a library called "requests". The following url exists as a test page, so the program below can be executed as is.

import requests

img_url = "https://taketake2.com/PID77.jpg"  # File addressing

with open("PID77.jpg", 'wb') as f:
    f.write(requests.get(img_url).content)    # Save file


<When downloading multiple images automatically>
The BeautifulSoup library obtains the source code of the specified url, and finds the address of the image data from there. Once you know the address of the image, save the image as described above.

import requests
from bs4 import BeautifulSoup

img_list = []
url = 'https://taketake2.com/test.html'        # specify any url
url_cont = BeautifulSoup(requests.get(url).content,'lxml')        # url parsing
img_all = url_cont.find_all("img")        # Get img tag information

for d in img_all:            # Extract img tag information one by one
    d = d.get("src")        # get src info
    if d.startswith("http") and (d.endswith("jpg") or d.endswith("png")):
        img_list.append(d)        # Add to list if src ends with .jpg or .png

for img_data in img_list:        # Save image data to file
    with open(img_data.split('/')[-1], 'wb') as f:
        f.write(requests.get(img_data).content)        # save to file
    print(img_data.split('/')[-1])        # Save file name output


■Notes, Cases where the program does not work well

Some websites do not allow image downloads. It's for copyright protection and it may overload the web server. Check the site's terms of use carefully. In addition, the source of the web page obtained by the above program may not match the source of the page you want to obtain the image, and the image may not be obtained. If the program does not go well, check the program execution results from that perspective.









List of related articles



Software