Map Sentinel-2 with Soil Properties: Step by Step Guide

7 min readMar 10, 2024

All codes and files are available in https://github.com/arf-themascoteers/satsoc. Throughout the article, this link will be referred to as “my GitHub link”.

Steps:

Download a freely available georeferenced soil dataset.
Prepare a bounding box of the GPS Coordinates.
Download Sentinel-2 data using the bounding box.
Integrate Sentinel-2 band data and soil information through GPS coordinates mapping.

Quick Note:

Although Sentinel-2 Multispectral data is freely available and there is a very nice logical sequence of mapping any georeferenced data with Sentinel-2 images, as of today, there is no commonly available straightforward way to automate all the steps. Most of the researchers follow a mixture of automated and manual process. Although some of the manual works shown in this article could be accomplished with API to some extent, that is not a very accessible way for newcomers. Therefore, I have outlined here the steps I follow most of the time.

Step-1: Download a freely available georeferenced soil dataset

First, we need a georeferenced dataset, e.g., a CSV file with 3 columns: a soil property and the corresponding latitude and longitude.

For example:

Sample Georeferenced Soil Property Dataset

Here, X and Y represent the longitude and latitude of a particular position, respectively, and SOCc represents a particular soil property, the Soil Organic Carbon content, at that position.

We will download a freely available dataset and extract there such columns. I have used here the dataset used in the paper “Multi-site evaluation of stratified and balanced sampling of soil organic carbon stocks in agricultural fields”, which is available in: https://figshare.com/articles/dataset/Data_for_Multi-site_evaluation_of_stratified_and_balanced_sampling_of_soil_organic_carbon_stocks_in_agricultural_fields_/23669304. We need locations.csv and measurements.csv, which are available on this link (or find them on my GitHub link).

So, we need at least SOCc, X, and Y columns in a single CSV file. Therefore, we need to join/merge measurements.csv and locations.csv based on the location_id column. 1_merge_csvs.py on my GitHub link does this work:

import pandas as pd

locations_df = pd.read_csv("locations.csv")
measurements_df = pd.read_csv("measurements.csv")

locations_df_columns = ["location_id","X","Y"]
measurements_df_columns = ["site", "location_id","sample_depth_min", "SOCc"]

locations_df = locations_df[locations_df_columns]
measurements_df = measurements_df[measurements_df_columns]

complete_df = pd.merge(locations_df, measurements_df, on='location_id', how='inner')
complete_df.to_csv("complete.csv",index=False)

Here, first we filter only the columns we need from each files and then on the second last line, we merge them based on the “location_id” column. Finally, we save them in “complete.csv”, which looks like this:

We retained “site” and “sample_depth_min” columns to further shorten the area. This particular dataset includes measurements from different areas at six different soil depths. In this exercise, we will only consider the topsoil (meaning, where sample_depth_min = 0). For Sentinel-2 (or any satellite data in general), you cannot download data/image for a very large area at once (which is the main barrier in automating the end to end mapping processes). Therefore, we need to reduce the area/scope of the samples. For this purpose, we will choose only one site, say “IL-BR”. The filtering process is implemented in 2_filter_region_depth.py on my GitHub link.

import pandas as pd

complete_df = pd.read_csv("complete.csv")
il_br_df = complete_df[(complete_df["site"] == "IL-BR") & (complete_df["sample_depth_min"] == 0)]
il_br_df.to_csv("IL-BR.csv", index=False)

The filtered file “IL-BR.csv”:

The CSV after reducing the area and choosing only topsoil data

Now, let us remove the unnecessary columns, which is done in 3_cleanup.py on my GitHub link:

import pandas as pd

complete_df = pd.read_csv("IL-BR.csv")
filtered_columns = ["location_id", "X", "Y", "SOCc"]
complete_df = complete_df[filtered_columns]
complete_df.to_csv("cleaned.csv", index=False)

The clean up file cleaned.csv:

Step 2: Prepare a bounding box of the GPS Coordinates.

From the paper related to the dataset, I found that they measured the SOCc data in June 2019. Since SOCc usually does not change overnight, we may use satellite data from any cloud-free day in July 2019.

We will download Sentinel-2 multispectral data from Copernicus Browser (https://browser.dataspace.copernicus.eu/). After you register and then login there, you will find this interface:

Choose a date on July 2019 (say, 10th July 2019) in the yellow-marked box. Set cloud-percentage 0 in the blue-marked box. Then choose an available date when zero-cloud image is available (I chose 10th July 2019). At the bottom-left in the green-marked box, choose Sentinel-2 L2A.

ask Google or ChatGPT if you are curious about the difference between L1C and L2A.

Now, you could manually draw a rectangle to target an area to download image for, or at the upper-right corner in the red-marked box, click the button that looks like upload to upload a geospatial file (for example, kml file) to select the area of interest. We will follow the second approach.

You could manually build a KML file creating the bounding box on a GIS software by identifying the boundary points (top-most, bottom-most, left-most, right-most). Or, we can do that with Python, which is done in 4_make_kml.py on my GitHub link:

import pandas as pd
import simplekml

df = pd.read_csv("cleaned.csv")
lats = list(df["Y"])
longs = list(df["X"])
coordinates = []

for i in range(len(lats)):
    coordinates.append((lats[i], longs[i]))

min_lat, min_lon = min(coordinates, key=lambda x: x[0])[0], min(coordinates, key=lambda x: x[1])[1]
max_lat, max_lon = max(coordinates, key=lambda x: x[0])[0], max(coordinates, key=lambda x: x[1])[1]

kml = simplekml.Kml()
polygon = kml.newpolygon(name="Bounding Rectangle",
                         outerboundaryis=[(min_lon, min_lat), (max_lon, min_lat), (max_lon, max_lat),
                                          (min_lon, max_lat), (min_lon, min_lat)])

kml.save("kml.kml")

It will create a file named kml.kml with the bounding box. If you open the file on a text editor (you don not have to for this exercise), you will see:

<?xml version="1.0" encoding="UTF-8"?>
<kml xmlns="http://www.opengis.net/kml/2.2" xmlns:gx="http://www.google.com/kml/ext/2.2">
    <Document id="1">
        <Placemark id="3">
            <name>Bounding Rectangle</name>
            <Polygon id="2">
                <outerBoundaryIs>
                    <LinearRing id="4">
                        <coordinates>-90.19070581290642,39.05418703227648,0.0 -90.18350060015084,39.05418703227648,0.0 -90.18350060015084,39.057445340783104,0.0 -90.19070581290642,39.057445340783104,0.0 -90.19070581290642,39.05418703227648,0.0</coordinates>
                    </LinearRing>
                </outerBoundaryIs>
            </Polygon>
        </Placemark>
    </Document>
</kml>

Step 3: Download Sentinel-2 data using the bounding box

Let us go back to Copernicus Browser, click on the upload button, and upload kml.kml. You will see:

Copernicus Browser: Region of Interest Selection with KML File

As shown in the yellow box, it selected a small area of only 0.23 square kilometers, which is okay for downloading. Click the download image button in the red-marked box. You will find this dialog box:

Go to “Analytical” tab.

From Image format drop-down, choose “Tiff (32-bit float)”.

From Image resolution drop-down, choose “Custom” (most of the time, selecting “High” is fine provided that the size of the selected area is reasonable).

Leave “10” for Resolution X (m/px) and Resolution X (m/px).

Leave “WGS 84 (EPSG:4326)” for Coordinate systems.

Check all the Raw bands at the right side.

Optionally, check a few “Visualized” items at the left side only for visualization purposes.

Click the download button at the bottom.

Copernicus Browser: Analytical Data Selection

A zip file named Browser_image.zip swill be downloaded. Extract the zip.

I had trouble extracting with default windows software. With 7-zip, it worked fine.

You should see these files in extracted Browser_images folder.

The 12 grayscale images are the RAW band data (that you selected at the right side) and the colour images represent different visualizations.

Step 4: Integration of Sentinel-2 band data and soil information through GPS coordinates mapping

Now, for each sample in cleaned.csv, we have to:

Take the latitude and longitude
For a particular band, say B01, locate the corresponding file, which is 2019–06–10–00_00_2019–06–10–23_59_Sentinel-2_L2A_B01_(Raw).tiff.
Calculate the x (row) and y (col) coordinate of the pixel in that tiff file for that latitude and longitude.
Save the value of that pixel for that particular sample as “B01” band value.
Repeat steps 2–4 for 11 other raw bands.
Save everything on a csv file.

It has been done in 5_map_soc_bands.py on my GitHub link. Before running the it, I renamed Browser_images to sat, and placed in my project home folder. 5_map_soc_bands.py:

import pandas as pd
import rasterio as rio
import os

SCENE_PATH = "sat"
SOURCE_PATH = "cleaned.csv"
FINAL_CSV = "final.csv"
BANDS = ["B01",
         "B02",
         "B03",
         "B04",
         "B05",
         "B06",
         "B07",
         "B8A",
         "B08",
         "B09",
         "B11",
         "B12"
         ]


def find_tiff(band):
    for tiff in os.listdir(SCENE_PATH):
        if band in tiff:
            return tiff
    return None


def get_soc_rows():
    df = pd.read_csv(SOURCE_PATH)
    coordinates = []
    for index, row in df.iterrows():
        point_id = row["location_id"]
        oc = row["SOCc"]
        lon = row["X"]
        lat = row["Y"]
        coordinates.append((point_id, lon, lat, oc))
    return coordinates


def get_band_values(band):
    print(f"Processing {band}")
    tiff = find_tiff(band)
    tiff_path = os.path.join(SCENE_PATH, tiff)
    soc_rows = get_soc_rows()
    pixels = []
    with rio.open(tiff_path) as src:
        for point_id, lon, lat, oc in soc_rows:
            row, col = src.index(lon, lat)
            pixel_value = src.read(1, window=((row, row + 1), (col, col + 1)))
            if pixel_value.shape[0] != 0 and pixel_value.shape[1] != 0:
                pixels.append([point_id, oc, pixel_value[0][0]])
    return pixels


def merge_bands(all_bands_pixels_df, band, band_pixels):
    band_df = pd.DataFrame(band_pixels, columns=["location_id","SOCc",band])
    if all_bands_pixels_df is None:
        return band_df
    band_df = band_df[["location_id",band]]
    all_bands_pixels_df = pd.merge(all_bands_pixels_df, band_df, on="location_id")
    return all_bands_pixels_df


def process_scene():
    all_bands_pixels_df = None
    for band in BANDS:
        band_pixels = get_band_values(band)
        all_bands_pixels_df = merge_bands(all_bands_pixels_df, band, band_pixels)

    all_bands_pixels_df.to_csv(FINAL_CSV, index=False)


if __name__ == "__main__":
    process_scene()

It will produce the ultimate mapped file final.csv, where for each sample, you have SOCc ground truth and 12 raw band values:

final.csv: Mapped Sentinel-2 Band and SOCc data

Now, you can use this file for statistical analyses or process it with machine learning algorithms.

Map Sentinel-2 with Soil Properties: Step by Step Guide

Written by Arif