Using podaacpy to subset Level-2P satellite swath data
This post will quickly demonstrate how to use the awesome podaacpy
python package to access Level-2 satellite data on PO.DAAC. Only a few datasets have been updated to include the information necessary to subset. There is a list somewhere but I cannot find it anymore.
The approach here is:
- Query the server for all files in a dataset within a given bounding box;
- Parse the server response for URLs;
- Save those URLs to a file.
- Use
xargs
andwget
to download the files in parallel
1. Preliminaries
First the required imports and initialize a Drive
instance. You will need PO.DAAC credentials for this. These credentials can be specified either in podaac.ini
or passed explicitly as done here.
Docs here: https://podaacpy.readthedocs.io/en/latest/drive.html#drive.Drive
import pdb import podaac as po import podaac.podaac as podaac import podaac.podaac_utils as utils from podaac import drive as drive import tqdm import warnings from xml.dom import minidom d = drive.Drive(None, username=YOUR_USERNAME, password=YOUR_PASSWORD) p = podaac.Podaac() u = utils.PodaacUtils()
2. Asking the server for results
Now we tell podaacpy
the dataset and bounding box for the subset we want. The dataset_id
can be found by looking for “Persistent ID” at the bottom of the PO.DAAC description page for the dataset. For example, see https://podaac.jpl.nasa.gov/dataset/MODIS_A-JPL-L2P-v2019.0
This dictionary specifies the dataset and the space/time bounds for the subset we want.
# dataset_id is "Peristent ID" at bottom of PODAAC page # if bbox is specified, need the T...Z in start_time, end_time kwargs = dict( dataset_id="PODAAC-GHMDA-2PJ19", # Aqua v2019 start_time="2018-09-20T00:00:00Z", end_time="2018-10-01T00:00:00Z", bbox="-125.5,40,-123.5,50", items_per_page="400", # 400 seems to be max )
p.granule_search
asks PO.DAAC to search for all files matching the request.
result = p.granule_search(**kwargs, start_index="1") try: doc = minidom.parseString(result) num_granules = int( doc.getElementsByTagName("opensearch:totalResults")[0].firstChild.nodeValue ) except: raise ValueError("Decoding response failed. Printing response below...") print(result) print(f"Found {num_granules} granules.")
If all goes well this should print the total number of granules (or files) the request found.
The response from the server is in XML which can be painful to deal with (if you don’t know what you’re doing, like me) and looks like this: https://podaac.jpl.nasa.gov/ws/search/granule/?datasetId=PODAAC-GHMDA-2PJ19. For each file, the response contains an OPeNDAP URL which is what we want.
3. Parse the xml response to find OPeNDAP URLs
Luckily podaacpy
has two very useful functions to deal with the xml response
Now we loop through, and parse the response page by page. This is because the server will only respond with one page of results at a time. So we call granule_search
in a loop and pass it a different start_index
each time. We then generate file names and get OPeNDAP URLs.
I am only interested in nighttime passes so I added a simple function filter_night
to pull these out based on the file name.
def filter_night(string): return "MODIS_A-N-" in string nitems = int(kwargs["items_per_page"]) name_list = [] url_list = [] for start in range(1, num_granules + 1, nitems): for attempt in range(1, 11): print(f"{start}, {attempt}") result = p.granule_search(**kwargs, start_index=str(start)) names = u.mine_granules_from_granule_search(result) urls = u.mine_opendap_urls_from_granule_search(result) if start + len(names) != num_granules: try: assert len(names) == nitems assert len(urls) == nitems except AssertionError: print( f"\n{len(names)} < {nitems} items returned. retrying attempt {attempt}..." ) else: break else: break else: warnings.warn("Invalid data returned. even after 10 attempts.", UserWarning) pdb.set_trace() names = list(filter(filter_night, names)) urls = list(filter(filter_night, urls)) name_list += names url_list += urls
Now we have a list of file names name_list
and a list of URLs url_list
. We build a download command from these two and write that to a text file. Note that we can subset the file to keep the variables we want at this point. I found it useful to actually open one of the OPeNDAP URLs in a browser and click around to figure out what variables are available.
def make_wget_str(url, name): """ chooses netCDF4 and subsets to needed variables.""" return ( f"{url[:-5]}.nc4?lat,lon,time,sea_surface_temperature,quality_level,l2p_flags" f" -O 'modis/{name}'" ) with open("url-list.txt", "w") as f: f.write("\n".join(map(make_wget_str, sorted(url_list), sorted(name_list))))
4. Download!
Finally we can download the URLs in this file in parallel using this xargs
command. -P 4
specified 4 parallel processes.
xargs -P 4 --replace --verbose --arg-file=url-list.txt /bin/sh -c "wget {}"
Each line is read and inserted at the location of {}
so there is one wget
command per line in url-list.txt
.