If you would like to work with AIS Data on NetApp Ocean for Apache Spark, please request your credentials as soon as possible using this form.

You will need to request and receive credentials via the form before you can access the resources.

Please check back soon as additional documentation will be added!

You can use the following entry point for the AIS Data on NetApp Ocean for Apache Spark platform during the Datathon



You should have filled in the form above. Once you have, drop a message in our discord tech support channel to have someone create the account in Spot. 

Access overview

The AIS data is accessible within a Jupyterhub environment running in a Kubernetes cluster. When you log into the JupyterHub platform and initiate a Jupyter Notebook session, you're allocated an individual workspace called a "pod." This pod operates within a larger system, known as a Kubernetes cluster.

Every pod is provisioned with its own set of resources, which includes CPU and memory. This means that each user's pod has its own dedicated resources that aren't shared with other users. This architecture ensures that one user's activities don't impact another user's performance, even if both are operating simultaneously.

To access the environment, navigate to Spot | Console (spotinst.com) and attempt to reset your password. You should receive login credentials via email. 

Launching a notebook

To launch a notebook, navigate to the Spot | Console (spotinst.com). Once you're there, you should click "Create workspace". This will launch a workspace. After a bit of time, you should see the ability to connect to your workspace in Jupyterhub: 

This will launch your Jupyter notebook. Select the kernel "extra". This kernel will create a Spark driver pod that will power your notebook environment. Each notebook you create will create a new Spark driver pod. Please be judicious in the number of Notebooks you open, as each Spark driver incurs a usage cost for compute resources allocated. 

Occasionally, the initial Notebook launch fails to create a Kernel. If this happens, it will raise a 504 error, and you should re-start the kernel and select "extra" again. This is a known issue with the Jupyter environment and typically resolves itself on the second or third try. 

Saving data

You have access to a temp bucket to save data. The bucket follows a write-once-read-many model, so all data in the bucket can be read by anyone. Here's an example of how to do this. 

Save Data to S3 Bucket
# be sure to run !pip install boto3 in your notebook

import os
import tempfile
import boto3

BUCKET_NAME = 'datathon-user-bucket-147546773234'

def create_fake_file():
    with tempfile.NamedTemporaryFile(delete=False) as temp_file:
        temp_file.write(b'This is some fake data for testing.')
        return temp_file.name

def upload_file_to_s3(file_path, bucket_name):
    s3 = boto3.client('s3')
    file_name = os.path.basename(file_path)
        s3.upload_file(file_path, bucket_name, file_name)
        print(f"File {file_name} uploaded to {bucket_name}.")
    except Exception as e:
        print(f"Error uploading file {file_name} to {bucket_name}: {e}")

def main():
    fake_file_path = create_fake_file()
    upload_file_to_s3(fake_file_path, BUCKET_NAME)


Clean Up Unused Kernels after Failed Notebook Launches

Typically, there are failed notebook launches when you create your notebook. This is not a problem; however, each notebook launch attempt creates a Spark driver node that can get "orphaned". If this is the case, you should click navigate to the applications dashboard and look for the jobs with your name. Delete the older applications created under your name. The number of spark applications running should equal the number of Jupyter notebooks you've created.

Accessing the AIS data

Below you will find various example notebooks on how to access the AIS data. Things to note overall: 

  1. The data are available in an s3 bucket that is only accessible from within the UNGP network. Using the notebook environment, you will be able to read data from this bucket without using AWS credentials. 
  2. In the example notebooks, you will find Python code snippets to access and read the data directly from our s3 buckets.
  3. Simply run the cells with these code snippets to load the data into your Jupyter Notebook.
  4. The total volume of s3 data exceeds 15 TB. To analyze this properly, you will need to use Spark. 
  5. Attempting to load all this data into your notebook environment, either naively or inadvertently by insufficiently accumulating results in the executor notes, will lead your notebook to fail and you will need to restart it. This is not a bug; it is a fundamental feature of Spark and it is part of the constraints that are typically faced when using big data in a Spark environment 

Stopping your Spark driver to reduce overall resource usage

When you are done with your analysis, please run spark.stop() to terminate this Spark driver and reduce your resource usage. Efficient use of resources will be a consideration in overall judging criteria. 

Frequent Issues

Limitations on data uploads in Spot Workspaces

There appears to be a limitation in the upload size in the Spot Workspace, meaning it will be hard to upload data beyond trivially small sizes. If you want to upload data, you can upload it to our legacy Jupyterhub environment and then use the methods above to push it to s3, and once it's in s3, you should be able to pull it back into the Spot Workspace environment. 

Do not use the hackathon-notebooks.officialstatistics.org endpoint

Due to technical issues, we are encouraging all users to use the Spot Workspaces and not the legacy Jupyter notebook environment maintained for the hackathon.