Click > to expand page tree items
If you would like to work with AIS Data on NetApp Ocean for Apache Spark, please request your credentials as soon as possible using this form.
You will need to request and receive credentials via the form before you can access the resources.
Please check back soon as additional documentation will be added!
You can use the following entry point for the AIS Data on NetApp Ocean for Apache Spark platform during the Datathon
You should have filled in the form above. Once you have, drop a message in our discord tech support channel to have someone create the account in Spot.
The AIS data is accessible within a Jupyterhub environment running in a Kubernetes cluster. When you log into the JupyterHub platform and initiate a Jupyter Notebook session, you're allocated an individual workspace called a "pod." This pod operates within a larger system, known as a Kubernetes cluster.
Every pod is provisioned with its own set of resources, which includes CPU and memory. This means that each user's pod has its own dedicated resources that aren't shared with other users. This architecture ensures that one user's activities don't impact another user's performance, even if both are operating simultaneously.
To access the environment, navigate to Spot | Console (spotinst.com) and attempt to reset your password. You should receive login credentials via email.
Launching a notebook
To launch a notebook, navigate to the Spot | Console (spotinst.com). Once you're there, you should click "Create workspace". This will launch a workspace. After a bit of time, you should see the ability to connect to your workspace in Jupyterhub:
This will launch your Jupyter notebook. Select the kernel "extra". This kernel will create a Spark driver pod that will power your notebook environment. Each notebook you create will create a new Spark driver pod. Please be judicious in the number of Notebooks you open, as each Spark driver incurs a usage cost for compute resources allocated.
Occasionally, the initial Notebook launch fails to create a Kernel. If this happens, it will raise a 504 error, and you should re-start the kernel and select "extra" again. This is a known issue with the Jupyter environment and typically resolves itself on the second or third try.
You have access to a temp bucket to save data. The bucket follows a write-once-read-many model, so all data in the bucket can be read by anyone. Here's an example of how to do this.
Clean Up Unused Kernels after Failed Notebook Launches
Typically, there are failed notebook launches when you create your notebook. This is not a problem; however, each notebook launch attempt creates a Spark driver node that can get "orphaned". If this is the case, you should click navigate to the applications dashboard and look for the jobs with your name. Delete the older applications created under your name. The number of spark applications running should equal the number of Jupyter notebooks you've created.
Accessing the AIS data
Below you will find various example notebooks on how to access the AIS data. Things to note overall:
- The data are available in an s3 bucket that is only accessible from within the UNGP network. Using the notebook environment, you will be able to read data from this bucket without using AWS credentials.
- In the example notebooks, you will find Python code snippets to access and read the data directly from our s3 buckets.
- Simply run the cells with these code snippets to load the data into your Jupyter Notebook.
- The total volume of s3 data exceeds 15 TB. To analyze this properly, you will need to use Spark.
- Attempting to load all this data into your notebook environment, either naively or inadvertently by insufficiently accumulating results in the executor notes, will lead your notebook to fail and you will need to restart it. This is not a bug; it is a fundamental feature of Spark and it is part of the constraints that are typically faced when using big data in a Spark environment
Stopping your Spark driver to reduce overall resource usage
When you are done with your analysis, please run spark.stop() to terminate this Spark driver and reduce your resource usage. Efficient use of resources will be a consideration in overall judging criteria.
Limitations on data uploads in Spot Workspaces
There appears to be a limitation in the upload size in the Spot Workspace, meaning it will be hard to upload data beyond trivially small sizes. If you want to upload data, you can upload it to our legacy Jupyterhub environment and then use the methods above to push it to s3, and once it's in s3, you should be able to pull it back into the Spot Workspace environment.
Do not use the hackathon-notebooks.officialstatistics.org endpoint
Due to technical issues, we are encouraging all users to use the Spot Workspaces and not the legacy Jupyter notebook environment maintained for the hackathon.
On this page: