Geology and Python

A blog stuffed with easy-to-follow Python recipes for geosciences !

Integrating & Exploring 4: Automated Bulk Downloads of Landsat-8 Data Products in Python



Multispectral and hyperspectral satellites are amazing (I'm a huge fan). I tend to think them as a super human vision. Soviet and American space stations started to be launched with multispectral devices in it's equipments in yearly 70's. From them, space scientists have improved it to a solid earth observation system for all kinds of purposes. Today, with these instruments, scientists can accurately categorize trillion of surface pixels, find not only visible, but invisible information, control illegal actions and plan our future as a civilization.

The Landsat Mission is a joint initiative between the U.S. Geological Survey (USGS) and NASA, officially described as:

The world's longest continuously acquired collection of space-based moderate-resolution land remote sensing data.

The Landsat is a multispectral satellite currently on the eighth of it's series. Landsat-8 data is freely available on the USGS's Earth Explorer website. All we need to do is sign up and find a scene that match our area of study. However, in this tutorial, I will show you how to automate the bulk download of low Cloud Covered Landsat-8 images from the comfort your Python!

This is the PART 4 of a series of posts called Integrating & Exploring. This post is also a consequence of the tweet below. Thanks for votes, retweetes and likes!

Area of Study

Iron River in Michigan, USA


Import pandas and geopandas for easy manipulation and filtering of tables and vector files.folium for interactive map views. os and shutil for file/folder manipulation.

import pandas as pd
import geopandas as gpd
import folium
import os, shutil
from glob import glob

Let's read the vector containing the bounds of the Area of Study. This was processed on previous posts.

bounds = gpd.read_file('./data/processed/area_of_study_bounds.gpkg')

Picking the Best Scenes

The notation used to catalog Landsat-8 images is called Worldwide Reference System 2 (WRS-2). The Landsat follows the same paths imaging the earth every 16 days. Each path is split into multiple rows. So, each scene have a path and a row. 16 days later, another scene will have the same path and row than the previous scene. This is the essence of the WRS-2 system.

USGS provides Shape files of these paths and rows that let us quickly visualize, interact and select the important images.

Let us get the shape files and unpack it. I will first download the file to WRS_PATH and extract it to LANDSAT_PATH.

WRS_PATH = './data/external/Landsat8/'
LANDSAT_PATH = os.path.dirname(WRS_PATH)


shutil.unpack_archive(WRS_PATH, os.path.join(LANDSAT_PATH, 'wrs2'))

Import it to a GeoDataFrame using geopandas. Check the first 5 entries.

wrs = gpd.GeoDataFrame.from_file('./data/external/Landsat8/wrs2/wrs2_descending.shp')

0 15.74326 1 D 13 26.98611 13001 1 1 0 1 1 2233 13001 POLYGON ((-10.80341356392465 80.9888, -8.97406...
1 14.55366 1 D 13 25.84254 13002 2 2 0 1 2 2234 13002 POLYGON ((-29.24250366707619 80.18681161921363...
2 13.37247 1 D 13 24.20303 13003 3 3 0 1 3 2235 13003 POLYGON ((-24.04205646041896 79.12261247629547...
3 12.26691 1 D 13 22.40265 13004 4 4 0 1 4 2236 13004 POLYGON ((-36.66813132081753 77.46094098591608...
4 11.26511 1 D 13 20.64284 13005 5 5 0 1 5 2237 13005 POLYGON ((-44.11209517917457 76.93655561966702...

Now, we can check which Polygons of the WRS-2 system intersects with the bounds vector.

wrs_intersection = wrs[wrs.intersects(bounds.geometry[0])]

Also, let's put the calculated paths and rows in a variable.

paths, rows = wrs_intersection['PATH'].values, wrs_intersection['ROW'].values

Create a map to visualize the paths and rows that intersects.

Click on the Polygon to get the path and row value.
# Get the center of the map
xy = np.asarray(bounds.centroid[0].xy).squeeze()
center = list(xy[::-1])

# Select a zoom
zoom = 6

# Create the most basic OSM folium map
m = folium.Map(location=center, zoom_start=zoom, control_scale=True)

# Add the bounds GeoDataFrame in red
m.add_child(folium.GeoJson(bounds.__geo_interface__, name='Area of Study', 
                           style_function=lambda x: {'color': 'red', 'alpha': 0}))

# Iterate through each Polygon of paths and rows intersecting the area
for i, row in wrs_intersection.iterrows():
    # Create a string for the name containing the path and row of this Polygon
    name = 'path: %03d, row: %03d' % (row.PATH, row.ROW)
    # Create the folium geometry of this Polygon 
    g = folium.GeoJson(row.geometry.__geo_interface__, name=name)
    # Add a folium Popup object with the name string
    # Add the object to the map


Let's use boolean to remove the paths and rows that intersects just a tiny bit with the area. (paths higher than 26 and lower than 23). This could be done by a threshold of intersection area.

b = (paths > 23) & (paths < 26)
paths = paths[b]
rows = rows[b]

We end up with four pairs of path and row values that would give images covering the area of study.

for i, (path, row) in enumerate(zip(paths, rows)):
    print('Image', i+1, ' - path:', path, 'row:', row)
Image 1  - path: 25 row: 27
Image 2  - path: 25 row: 28
Image 3  - path: 24 row: 28
Image 4  - path: 24 row: 27

Checking Available Images on Amazon S3 & Google Storage

Google and Amazon provides public access to Landsat images.

We can get a DataFrame of available scenes to download in each server using the urls below. The Amazon S3 table has ~20 MB of rows describing the existing data while the Google Storage table has ~500 MB.

s3_scenes = pd.read_csv('', compression='gzip')
# google_scenes = pd.read_csv('', compression='gzip')

First 3 entries:

productId entityId acquisitionDate cloudCover processingLevel path row min_lat min_lon max_lat max_lon download_url
0 LC08_L1TP_149039_20170411_20170415_01_T1 LC81490392017101LGN00 2017-04-11 05:36:29.349932 0.00 L1TP 149 39 29.22165 72.41205 31.34742 74.84666
1 LC08_L1TP_012001_20170411_20170415_01_T1 LC80120012017101LGN00 2017-04-11 15:14:40.001201 0.15 L1TP 12 1 79.51504 -22.06995 81.90314 -7.44339
2 LC08_L1TP_012002_20170411_20170415_01_T1 LC80120022017101LGN00 2017-04-11 15:15:03.871058 0.38 L1TP 12 2 78.74882 -29.24387 81.14549 -15.04330

Now, let's use the previous defined paths and rows intersecting the area and see if there are any available Landsat-8 images in the Amazon S3 storage.

Also, we can select only the images that have a percentage of cloud covering the image less than 5%.

We also want to exclude the product that end with _T2 and _RT, since these are files that must go through calibration and pre-processing. More info!

# Empty list to add the images
bulk_list = []

# Iterate through paths and rows
for path, row in zip(paths, rows):

    print('Path:',path, 'Row:', row)

    # Filter the Landsat Amazon S3 table for images matching path, row, cloudcover and processing state.
    scenes = s3_scenes[(s3_scenes.path == path) & (s3_scenes.row == row) & 
                       (s3_scenes.cloudCover <= 5) & 
                       (~s3_scenes.productId.str.contains('_T2')) &
    print(' Found {} images\n'.format(len(scenes)))

    # If any scenes exists, select the one that have the minimum cloudCover.
    if len(scenes):
        scene = scenes.sort_values('cloudCover').iloc[0]

    # Add the selected scene to the bulk download list.
Path: 25 Row: 27
 Found 2 images

Path: 25 Row: 28
 Found 6 images

Path: 24 Row: 28
 Found 3 images

Path: 24 Row: 27
 Found 6 images

Check the four images that were selected.

bulk_frame = pd.concat(bulk_list, 1).T
productId entityId acquisitionDate cloudCover processingLevel path row min_lat min_lon max_lat max_lon download_url
679 LC08_L1TP_025027_20170406_20170414_01_T1 LC80250272017096LGN00 2017-04-06 16:45:24.051124 1.51 L1TP 25 27 46.3107 -91.0755 48.4892 -87.8023
226329 LC08_L1TP_025028_20170913_20170928_01_T1 LC80250282017256LGN00 2017-09-13 16:46:22.160327 0.06 L1TP 25 28 44.8846 -91.5385 47.0651 -88.3262
255166 LC08_L1TP_024028_20171008_20171023_01_T1 LC80240282017281LGN00 2017-10-08 16:40:21.299100 0.23 L1TP 24 28 44.9159 -89.9783 47.0759 -86.858
127668 LC08_L1TP_024027_20170720_20170728_01_T1 LC80240272017201LGN00 2017-07-20 16:39:33.524611 3.11 L1TP 24 27 46.3429 -89.4354 48.5005 -86.2616

Last step is to download all files in the server for the 4 images, including metadata and QA. Each product for it's own folder.

# Import requests and beautiful soup
import requests
from bs4 import BeautifulSoup

# For each row
for i, row in bulk_frame.iterrows():

    # Print some the product ID
    print('\n', 'EntityId:', row.productId, '\n')
    print(' Checking content: ', '\n')

    # Request the html text of the download_url from the amazon server. 
    # download_url example:
    response = requests.get(row.download_url)

    # If the response status code is fine (200)
    if response.status_code == 200:

        # Import the html to beautiful soup
        html = BeautifulSoup(response.content, 'html.parser')

        # Create the dir where we will put this image files.
        entity_dir = os.path.join(LANDSAT_PATH, row.productId)
        os.makedirs(entity_dir, exist_ok=True)

        # Second loop: for each band of this image that we find using the html <li> tag
        for li in html.find_all('li'):

            # Get the href tag
            file = li.find_next('a').get('href')

            print('  Downloading: {}'.format(file))

            # Download the files
            # code from:

            response = requests.get(row.download_url.replace('index.html', file), stream=True)

            with open(os.path.join(entity_dir, file), 'wb') as output:
                shutil.copyfileobj(response.raw, output)
            del response
EntityId: LC08_L1TP_025027_20170406_20170414_01_T1 

 Checking content:  

  Downloading: LC08_L1TP_025027_20170406_20170414_01_T1_B9.TIF.ovr
  Downloading: LC08_L1TP_025027_20170406_20170414_01_T1_B11.TIF
  Downloading: LC08_L1TP_025027_20170406_20170414_01_T1_B11.TIF.ovr

  Downloading: LC08_L1TP_025027_20170406_20170414_01_T1_B1.TIF.ovr

EntityId: LC08_L1TP_025028_20170913_20170928_01_T1 

  Checking content:  

  Downloading: LC08_L1TP_025028_20170913_20170928_01_T1_B4_wrk.IMD

Let's plot the raw images for the thermal band 10.

import rasterio
from rasterio.transform import from_origin
from rasterio.warp import reproject, Resampling
import matplotlib.pyplot as plt
import as ccrs
import numpy as np
%matplotlib inline

Read the rasters using rasterio and extract the bounds.

xmin, xmax, ymin, ymax = [], [], [], []

for image_path in glob(os.path.join(LANDSAT_PATH, '*/*B10.TIF')):
    with as src_raster:

Reproject and plot it.

fig, ax = plt.subplots(1, 1, figsize=(20, 15), subplot_kw={'projection': ccrs.UTM(16)})

ax.set_extent([min(xmin), max(xmax), min(ymin), max(ymax)], ccrs.UTM(16))

bounds.plot(ax=ax, transform=ccrs.PlateCarree())

for image_path in glob(os.path.join(LANDSAT_PATH, '*/*B10.TIF')):

    with as src_raster:

        extent = [src_raster.bounds[i] for i in [0, 2, 1, 3]]  

        dst_transform = from_origin(src_raster.bounds.left,, 250, 250)

        width = np.ceil((src_raster.bounds.right - src_raster.bounds.left) / 250.).astype('uint')
        height = np.ceil(( - src_raster.bounds.bottom) / 250.).astype('uint')

        dst = np.zeros((height, width))

        reproject(, dst, 

        ax.matshow(, 0), extent=extent, transform=ccrs.UTM(16))



That's it, I download all the images that cover the area of study. Next post, I will go step-by-step on how to derivate the Top of Atmosphere (TOA) radiance, reflectance and brightness temperature.

Share on: