The following code is to be used in a Jupyter notebook

Port Polygons are important for generating statistics based on geospatial filtering e.g. port visits, anchoring patterns, etc. There exists no standard for generating polygons at the microdetail needed by most applications. This project aims at building a standard algorithm/method for self generating berth (an area where the cargo is loaded or discharged on and off the ships) polygons by using AIS positions and port location data.

Disclaimer: the designations used in the geographical data in this codebook do not imply official endorsement or acceptance by the United Nations.

Method develop by the ML Group.

Correspondence: -Gabriel Fuentes. Norwegian School of Economics

1) Setup

1.1) Set up the kernel

The kernel is the part of the backend responsible for executing code written by the user in the web application. It is set to communicate with the cloud machinery and hold the settings for the Python interpreter and the virtual environment to be used.

For this cookbook, we choose the kernel "ais-tt" which contains necessary packages with pre-defined settings and has additional Spark configuration, including Amazon Web Services (AWS) credentials.

1.2) Connect to GitLab and install pre-defined modules for AIS handling

Handling of AIS data involves some repetitive tasks, such as reading AIS based on filtered selection and performing spatial joins. We can use modules created by the UN Committee of Experts on Big Data and Data Science for Official Statistics (CEBD) AIS task team (developed by Cherryl Chico and the AIS Task Team) to simplify this process.

To connect to their GitLab repository, you'll need both a unique User and Token. In the example below, we connect to the AIS Task Team via this method and install the "AIS" module stored in their repository.

In [1]:

import sys
import subprocess

In [2]:
# Set the GitLab username; this is a read-only user
GITLAB_USER = "read_aistt"

# Set the GitLab token for authentication

# Construct the Git package URL using the GitLab user and token
git_package = f"git+https://{GITLAB_USER}:{GITLAB_TOKEN}"

# Use subprocess to run the pip install command for the Git package; this simulates running a pip install in the terminal
std_out =[sys.executable, "-m", "pip", "install", git_package], capture_output=True, text=True).stdout

# Print the output of the pip install command
Collecting git+https://read_aistt:****
  Cloning https://read_aistt:**** to /tmp/pip-req-build-s06m2kcn
  Resolved https://read_aistt:**** to commit b326480f684c057cf11f95b9a1ff6c17cbca495e
  Preparing metadata ( started
  Preparing metadata ( finished with status 'done'
Building wheels for collected packages: ais
  Building wheel for ais ( started
  Building wheel for ais ( finished with status 'done'
  Created wheel for ais: filename=ais-2.8.1-py3-none-any.whl size=11240 sha256=c3e9909e68ed3ad9156b897cfdce8e1d0ff2017e5ca66b43c88cb4382335e2e9
  Stored in directory: /tmp/pip-ephem-wheel-cache-hcjb1o78/wheels/6d/8c/5e/19898a2b930f8efa2ef2e6ecc8ef48797422e3ec7e0114b312
Successfully built ais
Installing collected packages: ais
Successfully installed ais-2.8.1

In [3]:

# Set the GitLab username; this is a read-only user
GITLAB_USER = "read_aistt"

# Set the GitLab token for authentication

# Construct the Git package URL using the GitLab user and token
git_package = f"git+https://{GITLAB_USER}:{GITLAB_TOKEN}"

# Use subprocess to run the pip install command for the Git package; this simulates running a pip install in the terminal
std_out =[sys.executable, "-m", "pip", "install", git_package], capture_output=True, text=True).stdout

# Print the output of the pip install command
Collecting git+https://read_aistt:****
  Cloning https://read_aistt:**** to /tmp/pip-req-build-8leg80yo
  Resolved https://read_aistt:**** to commit b326480f684c057cf11f95b9a1ff6c17cbca495e
  Preparing metadata ( started
  Preparing metadata ( finished with status 'done'

In [4]:

# Define the GitLab username associated with a read-only group for machine learning projects
GITLAB_USER = 'ml_group_read_only' 

# Define the GitLab authentication token; this token grants access rights (according to its permissions) to the GitLab repository
GITLAB_TOKEN = 'eac7ZwiseRdeLwmBsrsm'

# Construct the full URL to access the Git repository on GitLab.
# The format uses the defined user and token for authentication, ensuring secure access to the repository.
# This URL points to the 'ml-group-polygons' repository under the 'mlpolygonsalgorithm' project or group.
git_package = f"git+https://{GITLAB_USER}:{GITLAB_TOKEN}"

# Execute a subprocess command to install the package from the constructed Git URL.
# sys.executable ensures we use the same Python interpreter that's currently running this script.
# "-m pip install" is equivalent to running "pip install" from the command line.
# capture_output=True captures the standard output and errors from the command.
# text=True ensures that the output is interpreted as a string.
std_out =[sys.executable, "-m", "pip", "install", git_package], capture_output=True, text=True).stdout

# Display the result (either success or error messages) of the installation process.
Collecting git+https://ml_group_read_only:****
  Cloning https://ml_group_read_only:**** to /tmp/pip-req-build-jnxq10yc
  Resolved https://ml_group_read_only:**** to commit 89f1aab64fee28c2f86e86d6fa7b55118882b1e8
  Preparing metadata ( started
  Preparing metadata ( finished with status 'done'
Building wheels for collected packages: unece-ais
  Building wheel for unece-ais ( started
  Building wheel for unece-ais ( finished with status 'done'
  Created wheel for unece-ais: filename=unece_ais-0.0.4-py3-none-any.whl size=12529 sha256=4d92f3dc43bfe1d9b7a2d784fe184b6c2688e59ce4d2878f10baf28f3388ce76
  Stored in directory: /tmp/pip-ephem-wheel-cache-h8qg8n9f/wheels/61/b5/f9/bcf024b104169c32950c03a4605d2d07ea9da07cae7bed5e3e
Successfully built unece-ais
Installing collected packages: unece-ais
Successfully installed unece-ais-0.0.4

1.3) Install packages

Now, we'll install the necessary packages.

In [5]:

# Importing essential libraries for data manipulation and computation
import numpy as np  # For numerical operations
import pandas as pd  # For data manipulation and analysis

# Libraries for geospatial data handling and visualization
import geopandas as gpd  # Extends the datatypes used by pandas to allow spatial operations on geometric types
import matplotlib.pyplot as plt  # For plotting and visualization

# For hexagonal grid handling
import h3  # Library for working with hexagonal hierarchical spatial index

# Shapely is used for manipulation and analysis of planar geometric objects
from shapely.geometry import Polygon  # For creating and working with polygons
from shapely.ops import transform  # For geometric object transformations
from shapely.geometry import MultiPoint  # For creating and working with multiple point geometries

# Libraries for working with dates and collections
from datetime import datetime  # For handling date and time
from collections import defaultdict  # For creating dictionaries with default values

# For map visualization
import folium  # For creating interactive maps

# Import specific functions from custom AIS modules
from ais import functions as af  # Custom functions related to AIS
from unece_ais import unece_ais as un  # Functions from the UNECE AIS module

# For clustering geospatial data
from sklearn.cluster import DBSCAN  # Density-Based Spatial Clustering of Applications with Noise

Packages below are needed for importing AIS Data (see Section 4. AIS Data)

In [6]:

#Sedona Imports
import sedona.sql
from sedona.register import SedonaRegistrator
from sedona.utils import SedonaKryoRegistrator, KryoSerializer
from sedona.core.SpatialRDD import PolygonRDD, PointRDD
from sedona.core.enums import FileDataSplitter

# Pyspark Imports
import pyspark.sql.functions as F
import pyspark.sql.types as pst
from pyspark import StorageLevel
from pyspark.sql import SparkSession 
from pyspark.sql.functions import pandas_udf, PandasUDFType

2) Port data

2.1) World Port Index general information

The World Port Index (WPI) provides information on the location (longitude and latitude) as well as the physical characteristics, facilities, and services offered by major ports and terminals worldwide. It covers approximately 3,700 entries. For the purposes of this cookbook, we utilized the 2019 Version.

2.2) Read port data

Using the WPI data, bounding boxes were designed by Justin McGurk to delineate geographic buffers around each port location. The process involves reprojecting each feature to a Dynamic Equal Distance projection. Buffers are then created, which are subsequently reprojected to align with the original coordinate system of the input features.

These bounding boxes are available in three formats:

  • csv_WKT: A CSV file containing the Well Known Text (WKT) representation of geometries as a field.
  • geo-json: A format for encoding geographic data structures.
  • shp: A zipped version of the shapefile.

For this exercise, we will retrieve data from CSV files for both ports and countries from the following GitHub repository: UNECE/AIS on GitHub. Please note that the Geojson and shp formats are exclusively accessible via the UN Global Platform GitLab, which necessitates a login.

In [7]:

# get the port polygons
pd_ports = pd.read_csv('')

pd_ports = pd_ports.assign(geometry=gpd.GeoSeries.from_wkt(pd_ports['geom_WKT']))
pd_ports = gpd.GeoDataFrame(pd_ports, geometry="geometry")

countries = gpd.read_file(gpd.datasets.get_path("naturalearth_lowres"))

The dataset includes the name of country, ISO country code and bounding boxes (geometry).

In [8]:



0920938OceaniaFijiFJI8374.0MULTIPOLYGON (((180.00000 -16.06713, 180.00000...
153950935AfricaTanzaniaTZA150600.0POLYGON ((33.90371 -0.95000, 34.07262 -1.05982...
2603253AfricaW. SaharaESH906.5POLYGON ((-8.66559 27.65643, -8.66512 27.58948...
335623680North AmericaCanadaCAN1674000.0MULTIPOLYGON (((-122.84000 49.00000, -122.9742...
4326625791North AmericaUnited States of AmericaUSA18560000.0MULTIPOLYGON (((-122.84000 49.00000, -120.0000...

In this cookbook, we will focus on the area around Port Richards (South Africa) as an example.

In [9]:

pd_richards_bay = pd_ports.loc[pd_ports.PORT_NAME=="RICHARDS BAY"]


196146855RICHARDS BAYZA-28.81666732.083333POLYGON ((31.85565339417423 -29.01718802110112...POLYGON ((31.85565 -29.01719, 31.85565 -28.616...

2.3) Visualize port data

To understand the spatial distribution of ports, visualizing the data is essential. Here, we will:

  1. Plot all ports around the world.
  2. Highlight ports specifically in Africa.
  3. Zoom in on the ports in the vicinity of Richards Bay.

We'll utilize the plot function from geopandas to achieve this.

In [10]:

fig, ax = plt.subplots(figsize=(8,6))
countries.to_crs(epsg=4326).plot(ax=ax, color='lightgrey')
pd_ports.plot(ax=ax, color='red')
ax.set_xlim(-180, 180)
ax.set_ylim(-90, 90)

(-90.0, 90.0)

We can also have a closer look by adjusting plot setting.

In [11]:

fig, (ax2, ax3)= plt.subplots(1,2,figsize=(8,6))

countries.to_crs(epsg=4326).plot(ax=ax2, color='lightgrey')
pd_ports.plot(ax=ax2, color='red')
ax2.set_xlim(-20, 55)
ax2.set_ylim(-40, 40)

countries.to_crs(epsg=4326).plot(ax=ax3, color='lightgrey')
pd_ports.plot(ax=ax3, color='black')
pd_richards_bay.plot(ax=ax3, color='red')  # Location of Haldia
ax3.set_xlim(85, 90)
ax3.set_ylim(20, 23)

(20.0, 23.0)

3) H3 data

3.1) H3 general information

H3 is a module developed by Uber that discretizes the world in hexagons. The indexing standardization makes it very efficient to index or map a position anywhere. Also, for our purposes, it helps in reducing the search area for potential polygons.

The module is quite robust and is made for spatial developments. But be aware that we are in the mid of a transition among versions (version 3 and version 4) with important changes, mostly on the syntax. The example below is a repsentation of H3 hexagons with different resolutions. The resolutions are standardized as presented here. For instance the highest resolution (15) has an area of 0.9m2 and the lowest resolution (0) has an area of 4 million km2, naturally there are more resolution 15 hexagons than resolution 0 hexagons.


3.2) Use of H3 to reduce the search space for Richards Bay Port

As AIS is a big data, it is highly recommended to first filter out the AIS data for the area of interest, and H3 is a useful to reduce the search space.

To get a list of hexagons of the area surrounding Richards Bay Port:

  • Get the hexagon index for the area around the point (red square on the image above in 2.3)
  • Get the neighboring hexagon indeces. An easy method to do such is by using k_ring method (grid_disk in h3 version 4) which retrieves all h3 hexagons around a given point
h3.k_ring(origin_index, k)

The parameter "k" indicates how many distances away, from your origin hexagon index, would you like to retrieve. Figure below illustrates a k_ring with k=1:

Pic k_ring

For our case, we can use a shortcut with un.k_ring(H3_INDEX, k=K)

In [12]:

set_of_indices = set()   # Sets inherently removes duplicates, the h3 indices for the position and k niehgbors at level 2 are stored in a set. 

## we are going to iterate over our pandas data frame as this will work if you want to work with with several ports at one
for index, row in pd_richards_bay.iterrows():
    lat = row['LATITUDE']
    lng = row['LONGITUDE']
    k_ring = un.k_ring(un.get_h3_index(lat, lng, 5), k=2) # get a k ring of set of h3 indices for each record and append set_of_indices

3.3) Visualize H3 hexagons around Richards Bay Port

To visualise the filter area around Richards Bay Port created by H3 hexagons, we first need to create a geometrics from H3 index.

Note that, for h3 to geo boundary function, the standard polygon shape is in the form of ((lat,lon),(lat,lon)) while visualisation packages (e.g., Geopandas, Shapely) works with ((lon,lat),(lon,lat)) structure. Therefore, in the next example, we flip the coordinates with "flip" function.

In [13]:

## Flips the x and y coordinate values. h3_to_geo_boundary returns ((lat,lon),(lat,lon)) Polygons while Geopandas works with ((lon,lat),(lon,lat)) 

def flip(x, y):   
    return y, x

## Create Polygon geometries from the h3 boundary. Then pass to a GeoPandas DataFrame and plot.

geom_h3 = []
for element in set_of_indices:
    geom_h3.append(transform(flip,Polygon(h3.h3_to_geo_boundary(element, geo_json=False))))
geom_h3 = gpd.GeoDataFrame(geometry=geom_h3)

fig, ax4 = plt.subplots(figsize=(8,6))

countries.to_crs(epsg=4326).plot(ax=ax4, color='lightgrey')
pd_richards_bay.plot(ax=ax4, color='mistyrose')
ax4.scatter(pd_richards_bay.LONGITUDE, pd_richards_bay.LATITUDE, zorder=1, c='red', s=10)

geom_h3.boundary.plot(ax=ax4, color='black')
ax4.set_xlim(85, 90)
ax4.set_ylim(20, 23)

(20.0, 23.0)

4) AIS data

4.1) AIS general information

AIS represents "automatic identification system" which is a tracking system for ships, originally developed for collision avoidance. In the recent years, it is also used for analyses from various fields. The data is automatically transmitted every few seconds over very high frequency (VHF) radios from approximately 100,000 vessels worldwide. With its high temporal frequency and spatial coverage, AIS data has a great potential to be used to create official statistics and/or develop experimental data fit-for-purpose in various areas (e.g., traffic within harbours, economic trade indicators, CO2 emission, fishery).

For full introduction to AIS data, see AIS Handbook from the UN CEBD AIS Task Team.

4.2) Read AIS data

AIS data is made available by several private providers. However, for the purpose of official statistics a standardized data base is crucial. Thus, the AIS data provided by the UN Global Platform.

We start a Spark Session before running anything with the AIS data.

In [14]:

spark = SparkSession. \
    builder. \
    appName('MLGroup_Demonstration'). \
    config("spark.serializer", KryoSerializer.getName). \
    config("spark.kryo.registrator", SedonaKryoRegistrator.getName). \
    config('spark.jars.packages'). \
    config("spark.sql.parquet.enableVectorizedReader", "false").\



4.3) Reduce AIS data

The following cell uses the ais.get_ais function to filter our desired area within a time period. Note that

  • Start_date and End_Date need to be declared as timestamp by datetime.fromisoformat("YYYY-MM-DD")
  • Columns list must match the ais-dataframe column names.

In below, we use time stamp from 1 to 7 of March 2022 and get a sample with the hex list plotted earlier.

In [15]:

start_date = datetime.fromisoformat("2022-03-01")
end_date = datetime.fromisoformat("2022-03-07")

the_h3_list = [h3.string_to_h3(element) for element in set_of_indices] # get_ais function requires numeric h3 indexes 

ais_sample = af.get_ais(spark,start_date, end_date = end_date, h3_list = the_h3_list)

Handling big data is based on a very simple concept!

  • Perform simple operations on the large sample and reduce as much as possible.
  • If in need of writing your results, aim to reduce as much as possible, then export your results. Plus, as per the agrrement with the AIS supplier, raw data cannot be exported. It is not efficient to do so, the big data cluster is in this environment.

Note that AIS signal contains information on vessel type and navigation status with fixed codelist."vessel_type").distinct().show()"nav_status").distinct().show()
TankerRestricted Manoeu...
MilitaryEngaged In Fishing
TowingUnderway Sailing
OtherAt Anchor
SARUnder Way Using E...
Law Enforcement
Pleasure Craft
Port Tender
Not Available

Our assumption for detecting a berthed vessel (engaged in commercial activities) is that it must be classified as a Cargo vessel and declared as a moored vessel. Therefore, we filter out all positions reported from "Cargo" with navigation stats as "Moored".

In [16]:

ais_sample = ais_sample.filter(((F.col("nav_status")=="Moored")&(F.col("vessel_type").isin(["Cargo","Tanker","Passenger"]))))

To add additional vessels which might also signal activities in important ports (i.e., Tanker, Passenger), we could use filter:

ais_sample = ais_sample.filter(((F.col("nav_status")=="Moored")&(F.col("vessel_type").isin(["Cargo","Tanker","Passenger"])))

But note that some of vessel types may introduce erratic behavior in the data set (e.g., vessels engaging law enforcemenet activity might move differently from ordinary cargo vessels).

5) Berth polygons

5.1) Create candidate polygons

The example below makes a count per H3 index on resolution 8 with approximately 0.73km20.73��2 and edges of 370m370�.

Ideally, the hexagon with higher density (i.e. large number of AIS signals within the hexagon) is a berth polygon.

In [17]:

moored_areas = ais_sample.groupBy([size_h3]).count()

With a quick count we can verify how big is the sample.

  • Note that any transformation to Pandas is very expensive. So, if in need of exporting yur data be aware that the process is limited to 2gb which is the PySpark Driver (controller of executors) limit.
  • Of course, the smaller the better

by running


We can verify that sample moored_areas is reduced from our big data sample to 14 rows.

In [18]:

moored_areas = moored_areas.toPandas()

<bound method NDFrame._add_numeric_operations.<locals>.sum of H3_int_index_11    90
count              90
dtype: int64>

5.2) Visualise candidate polygons

We can check if there are hexagons with high density through bar chart:

In [19]:

moored_areas.sort_values(by=["count"], ascending = False), y='count')


Similar to before, we can transform our hex indexes to polygons

In [20]:

geom_h3_moored =[]
for ind,row in moored_areas.iterrows():
    geom = transform(flip,Polygon(h3.h3_to_geo_boundary(h3.h3_to_string(row[size_h3]), geo_json=False)))
moored_areas = moored_areas.assign(geometry=geom_h3_moored)
moored_areas = gpd.GeoDataFrame(moored_areas,geometry="geometry")

For recognizing cluster areas we can create a heatmap with matplotlib

In [21]:

fig, ax5 = plt.subplots(figsize=(8,6))

ax5.scatter(pd_richards_bay.LONGITUDE, pd_richards_bay.LATITUDE, zorder=1, c='red', s=10)
geom_h3.boundary.plot(ax=ax5, color='black')

## Heat map plot from geopandas. Pass the count column.
moored_areas.plot(ax=ax5,column="count",legend = True,legend_kwds = {'label': "Moored vessels density map"})

ax5.set_xlim(88.0, 88.3)
ax5.set_ylim(22.0, 22.075)

(22.0, 22.075)

In [22]:

moored_areas = moored_areas.assign(H3_str = moored_areas[size_h3].apply(lambda x: h3.h3_to_string(x)) )

We can also overlay the Port Polygons with OpenStreetMap via folium

In [23]:

## Create GeoJson for folium Choropleth
feature_list = [] 
for i in range(moored_areas.shape[0]):
    coordinates = list( h3.h3_to_geo_boundary(moored_areas["H3_str"][i], geo_json = True) )
    feature = {'geometry': {'coordinates': [[list(x) for x in coordinates]], 'type': 'Polygon' }, 
             'ID_H3_str': moored_areas["H3_str"][i] 
mygeojson = {"type":"FeatureCollection", "features": feature_list}

In [24]:
map_richards_bay = folium.Map( location= [pd_richards_bay.LATITUDE, pd_richards_bay.LONGITUDE], zoom_start=13, width=1000, height=1000)

    columns=["H3_str", "count"],



Make this Notebook Trusted to load map: File -> Trust Notebook

5.3) Create berth polygons

Based on our previous example, we refine our filtering by using the Speed Over Ground (SOG) column.

In [25]:

moored_areas_sog = ais_sample.groupBy([size_h3]).count()

We select hexagons with counts more than 5 quantile.

In [26]:

moored_areas_sog = moored_areas_sog.toPandas()

In [27]:
moored_areas_sog = moored_areas_sog[moored_areas_sog["count"]>moored_areas_sog["count"].quantile(0.01)]

5.4) Visualise berth polygons

In [28]:

geom_h3_moored = []
for ind,row in moored_areas_sog.iterrows():
    geom = transform(flip,Polygon(h3.h3_to_geo_boundary(h3.h3_to_string(row[size_h3]), geo_json=False)))
moored_areas_sog = moored_areas_sog.assign(geometry=geom_h3_moored)
moored_areas_sog = gpd.GeoDataFrame(moored_areas_sog,geometry="geometry")

Recognizing clusters of hexagons,

  • First we assign a node number (index) to every hexagon id (unique)
  • In an iteration for every hexagon we get the k_ring k=1 and check if they are within the remaining hexagons at the dataframe
  • Store the pair of indeces (nodes) as if they were declared as a valid connection

In [30]:

moored_areas_sog = moored_areas_sog.assign(alpha_num=moored_areas_sog[size_h3].apply(lambda x : h3.h3.h3_to_string(x)))
dic_hex={y: x for x, y in dic_hex.items()}

for ind,row in moored_areas_sog.iterrows():
    ls_neighbors= un.k_ring(row[size_h3],k=1)
    valid_nb=[h3.string_to_h3(x) for x in ls_neighbors if h3.string_to_h3(x)\
              in moored_areas_sog[size_h3].tolist()]

    for x in valid_nb:
        if x != row[size_h3]:

The next is an extract of networks for recognizing the separated clusters.

With the "nodes" and "edges" defined in the previous cell we can create a graph and test which nodes belong together.

In [31]:

class Graph:
    # init function to declare class variables
    def __init__(self, V):
        self.V = V
        self.adj = [[] for i in range(V)]
    def DFSUtil(self, temp, v, visited):
        # Mark the current vertex as visited
        visited[v] = True
        # Store the vertex to list
        # Repeat for all vertices adjacent
        # to this vertex v
        for i in self.adj[v]:
            if visited[i] == False:
                # Update the list
                temp = self.DFSUtil(temp, i, visited)
        return temp
    # method to add an undirected edge
    def addEdge(self, v, w):
    # Method to retrieve connected components
    # in an undirected graph
    def connectedComponents(self):
        visited = []
        cc = []
        for i in range(self.V):
        for v in range(self.V):
            if visited[v] == False:
                temp = []
                cc.append(self.DFSUtil(temp, v, visited))
        return cc
g = Graph(moored_areas_sog.shape[0])

for x,y in ls_hex:
    g.addEdge(x, y)

cc = g.connectedComponents()

for num in range(len(cc)):
    for x in cc[num]:

In [32]:
moored_areas_sog = pd.merge(moored_areas_sog,pair_num,on=[size_h3])

Given that the clusters are groups of neighboring hexagons it makes and ideal setting where a polygon can be generated from the hexagons connected by the external borders. In Shapely this is called unary_union.

In [33]:

new_gdf = []
for ind,group in moored_areas_sog.groupby("label"):
    geo = group.unary_union
new_gdf = gpd.GeoDataFrame(new_gdf,columns=["label","geom"],geometry="geom")

In [35]:
from shapely.geometry import mapping
from shapely.ops import unary_union,transform
from functools import partial

def flip(x, y):
    """Flips the x and y coordinate values"""
    return y, x

# Assuming you have a GeoDataFrame 'new_gdf' with polygons

# Create a Folium map
map_richards_bay = folium.Map(location=[pd_richards_bay.LATITUDE, pd_richards_bay.LONGITUDE], zoom_start=13, width=1000, height=1000)

# Convert the geometry to GeoJSON
geojson_vals = new_gdf['geom'].apply(mapping)

# Iterate over the GeoJSON geometries and add them to the map
for geojson in geojson_vals:

# Display the map

Make this Notebook Trusted to load map: File -> Trust Notebook

5.6) Spatial Join

With our defined polygons, we are set to filter all AIS records that are within the polygons represented by new_gdf_sais_sample_s.

A few things to note:

  • We need to convert our geoPandas DataFrame into a Spark DataFrame.
  • The AIS sample data's longitude and latitude will be transformed into Point Geometry using the function:
    un.latlong2geom(spark, ais_sample)
    un.ais2areas(spark, ais_sample_s, new_gdf_s)

In [36]:

new_gdf_s = spark.createDataFrame(new_gdf)
ais_sample_s = un.latlong2geom(spark,ais_sample)
ais_sample_s = un.ais2areas(spark,ais_sample_s,new_gdf_s)

In [37]:
export_data = ais_sample_s.groupBy(["label"]).count()
export_data = export_data.toPandas()

In [38]:



5.7) Export values

Feasible with small dataframes.

  • From Pandas to Html to CSV method
  • We can also save to AIS Task Team AWS S3 bucket in parquet format. But, that will remain for a future tutorial

In [35]:

from IPython.display import HTML
import base64  

def create_download_link(df, title = "Download CSV file", filename = "data.csv"):  
    csv = df.to_csv(index=False)
    b64 = base64.b64encode(csv.encode())
    payload = b64.decode()
    html = '<a download="{filename}" href="data:text/csv;base64,{payload}" target="_blank">{title}</a>'
    html = html.format(payload=payload,title=title,filename=filename)
    return HTML(html)

In [36]:

Download CSV file

6) Berth polygons with DBSCAN

6.1) DBSCAN general information

Density-Based Spatial Clustering of Applications with Noise (DBSCAN) is an unsupervised machine learning algorithm which finds core samples of high density and expands clusters from them. Unlike other popular clustering algorithms such as K-means, DBSCAN does not require to predefine the number of clusters and works well with arbitrary cluster shapes which is beneficial for our case as the shape of birth areas could be diverse and unpredictable. Below example of DBSCAN shows two clusters (red and blue) which might have been difficult to detect with k-means (source: Mathworks)

This section is presented to build more refined polygons based on DBSCAN which follows 3 steps as below (more details can be found in corresponding sub-section):

Please note that for the purpose of demonstration, hyperparameters are set as 7e-3 for Epsilon (approx. 700 meters) and 3 for minPts. However, hyper-parameter search could be refined per berth group by, for instance, Silhoutte coefficient.

6.2) Apply DBSCAN on hexagons (Step 1)

Note that direct application of DBSCAN on the raw AIS signals is costly as AIS is big data.

To limit the computation, we again use H3 hexagons as unit of analysis (rather than AIS signal), but at a much higher level, apply DBSCAN on the center points of these hexagons.

In [39]:

h3_resol_DBSCAN = 13   # h3 resolution to be used for DBSCAN
h3_resol_berth = 10   # h3 resolution to be used for construction of final berth polygons

In [40]:
moored_areas_raw = ais_sample.groupBy(["H3_int_index_" + str(h3_resol_DBSCAN)]).count()
moored_areas_raw = moored_areas_raw.toPandas()
# moored_areas.sort_values(by=["count"], ascending = True).head(5)

<bound method NDFrame._add_numeric_operations.<locals>.sum of H3_int_index_13    386
count              386
dtype: int64>

We remove hexagons that have signal counts less than than 0.01 quantile.

In [41]:

moored_areas = moored_areas_raw[moored_areas_raw["count"]>moored_areas_raw["count"].quantile(0.01)]
moored_areas.reset_index(drop = True, inplace=True)

In [42]:
moored_areas.columns = [ 'H3_int_index', 'count']

<bound method NDFrame._add_numeric_operations.<locals>.sum of H3_int_index    277
count           277
dtype: int64>

In [43]:
geom_h3_moored =[]
for ind,row in moored_areas.iterrows():
    geom = transform(flip,Polygon(h3.h3_to_geo_boundary(h3.h3_to_string(row["H3_int_index"]), geo_json=False)))
moored_areas = moored_areas.assign(geometry=geom_h3_moored)
moored_areas = gpd.GeoDataFrame(moored_areas,geometry="geometry")

In [44]:
moored_areas = moored_areas.assign(H3_str = moored_areas["H3_int_index"].apply(lambda x: h3.h3_to_string(x)) )
moored_areas = moored_areas.assign(H3_center = moored_areas["H3_str"].apply(lambda x: h3.h3_to_geo(x)) )
moored_areas = moored_areas.assign(H3_center_lat = moored_areas["H3_center"].apply(lambda x: x[0]))
moored_areas = moored_areas.assign(H3_center_long = moored_areas["H3_center"].apply(lambda x: x[1]))

In [45]:
Cluster = DBSCAN(eps=4.5e-3, min_samples=2).fit_predict(moored_areas[["H3_center_lat", "H3_center_long"]])
n_clusters = len(set(Cluster)) - (1 if -1 in set(Cluster) else 0)
moored_areas["cluster"] = Cluster


063833167447844684732POLYGON ((32.02994 -28.79167, 32.02994 -28.791...8dbcf46a22b08ff(-28.791650876159604, 32.02990010997048)-28.79165132.0299000
163833167447662195142POLYGON ((32.03515 -28.79404, 32.03515 -28.794...8dbcf46a20f307f(-28.7940226813856, 32.03510925601086)-28.79402332.0351090
2638331674542945471212POLYGON ((32.04949 -28.81632, 32.04949 -28.816...8dbcf46a60334bf(-28.816301596870474, 32.04944778529296)-28.81630232.0494481
363833167448027283195POLYGON ((32.03960 -28.79205, 32.03960 -28.792...8dbcf46a246e5bf(-28.792028388585145, 32.03955947811133)-28.79202832.0395590
4638331674805937407417POLYGON ((32.05669 -28.80501, 32.05669 -28.804...8dbcf46b5b024ff(-28.80499064497774, 32.056648159807)-28.80499132.0566482

In [46]:
# check number of h3 polygons for each cluster

0    180
1     85
2      7
3      5
Name: cluster, dtype: int64

Visualise the polygons:

In [47]:

# Initialize a folium map centered on Richards Bay using its latitude and longitude
# Set the zoom level and dimensions of the map
map_richards_bay = folium.Map(location=[pd_richards_bay.LATITUDE, pd_richards_bay.LONGITUDE], 

# Create a list to hold the GeoJSON features for high-resolution polygons
feature_list = []

# Loop through the moored_areas DataFrame to convert H3 hexagons to GeoJSON polygons
for i in range(moored_areas.shape[0]):
    # Convert H3 hexagon string to its boundary coordinates
    coordinates = list(h3.h3_to_geo_boundary(moored_areas["H3_str"][i], geo_json=True))
    # Define a GeoJSON feature for the current H3 hexagon
    feature = {
        'geometry': {
            'coordinates': [[list(x) for x in coordinates]],
            'type': 'Polygon'
        'type': 'Feature',
        'ID_H3_str': moored_areas["H3_str"][i],
        "properties": {
            "cluster": int(moored_areas["cluster"][i])
    # Append the current GeoJSON feature to the feature list

# Construct a GeoJSON object using the list of features
mygeojson = {
    "type": "FeatureCollection",
    "features": feature_list

# Define a dictionary for cluster colors
cluster_colors = {
    0: '#F00',
    1: '#0F0',
    2: '#00F'

# Define a styling function to assign colors to different clusters
def style_function(feature):
    cluster = feature['properties']['cluster']
    fillColor = cluster_colors.get(cluster, '#808080')
    return {
        'fillColor': fillColor,
        'color': '#000',
        'weight': 1,
        'fillOpacity': 0.3,

# Add the styled GeoJSON polygons to the folium map

# Display the map

Make this Notebook Trusted to load map: File -> Trust Notebook

6.3) Obtain a set of lower resolution H3 hexagons (Step 2)

For each hexagon, we find the paraent hexagon to construct the berth polygons (the high-resolution hexagons are too detailed).

In [48]:

moored_areas = moored_areas.assign(H3_parent = moored_areas["H3_str"].apply(lambda x: h3.h3_to_parent(x, h3_resol_berth)) )

In [49]:
moored_areas = moored_areas.assign(H3_parent_center = moored_areas["H3_parent"].apply(lambda x: h3.h3_to_geo(x)) )
moored_areas = moored_areas.assign(H3_parent_center_lat = moored_areas["H3_parent_center"].apply(lambda x: x[0]))
moored_areas = moored_areas.assign(H3_parent_center_long = moored_areas["H3_parent_center"].apply(lambda x: x[1]))


063833167447844684732POLYGON ((32.02994 -28.79167, 32.02994 -28.791...8dbcf46a22b08ff(-28.791650876159604, 32.02990010997048)-28.79165132.02990008abcf46a22b7fff(-28.79153801241071, 32.029905411108736)-28.79153832.029905
163833167447662195142POLYGON ((32.03515 -28.79404, 32.03515 -28.794...8dbcf46a20f307f(-28.7940226813856, 32.03510925601086)-28.79402332.03510908abcf46a20f7fff(-28.79442515720381, 32.035437352464946)-28.79442532.035437
2638331674542945471212POLYGON ((32.04949 -28.81632, 32.04949 -28.816...8dbcf46a60334bf(-28.816301596870474, 32.04944778529296)-28.81630232.04944818abcf46a6037fff(-28.81686742274209, 32.04949836734247)-28.81686732.049498
363833167448027283195POLYGON ((32.03960 -28.79205, 32.03960 -28.792...8dbcf46a246e5bf(-28.792028388585145, 32.03955947811133)-28.79202832.03955908abcf46a246ffff(-28.79212468572867, 32.03878388229827)-28.79212532.038784
4638331674805937407417POLYGON ((32.05669 -28.80501, 32.05669 -28.804...8dbcf46b5b024ff(-28.80499064497774, 32.056648159807)-28.80499132.05664828abcf46b5b07fff(-28.805546653579768, 32.05623656078254)-28.80554732.056237

In [50]:
moored_areas.drop_duplicates(subset=["H3_parent"], inplace = True, ignore_index = True)

Visualise the polygons:

In [51]:

map_richards_bay = folium.Map( location= [pd_richards_bay.LATITUDE, pd_richards_bay.LONGITUDE], zoom_start=13, width=1000, height=1000)

## Create GeoJson to plot high-resolution polygons for folium Choropleth
feature_list = [] 
for i in range(moored_areas.shape[0]):
    coordinates = list( h3.h3_to_geo_boundary(moored_areas["H3_parent"][i], geo_json = True) )
    feature = {'geometry': {'coordinates': [[list(x) for x in coordinates]], 'type': 'Polygon' }, 
             'ID_H3_str': moored_areas["H3_str"][i],
            "properties": {"cluster": int(moored_areas["cluster"][i])}
mygeojson = {"type":"FeatureCollection", "features": feature_list}

cluster_colors = {
    0: '#F00',
    1: '#0F0',
    2: '#00F'

def style_function(feature):
    cluster = feature['properties']['cluster']
    fillColor = cluster_colors.get(cluster, '#808080') 
    return {
        'fillColor': fillColor,
        'color': '#000',
        'weight': 1,
        'fillOpacity': 0.3,



Make this Notebook Trusted to load map: File -> Trust Notebook

6.4) Connect polygons (Step 3)

In [52]:

# Initialize an empty dictionary to store hexagons that are connected within a cluster
h3_connected = {}

# Loop through each cluster defined by 'n_clusters'
for i_cluster in range(n_clusters):

    # Filter the 'moored_areas' DataFrame for the current cluster
    h3_set = moored_areas[moored_areas.cluster == i_cluster]
    # Find the reference hexagon index by selecting the one with the minimum latitude
    reference_index = h3_set[h3_set.H3_center_lat == min(h3_set.H3_center_lat)].iloc[0]['H3_parent']
    # Get a list of unique parent hexagons for the current cluster
    h3_set = list(h3_set['H3_parent'].unique())
    # Initialize a list to store hexagons that are connected, starting with the reference index
    h3_set_connected = [reference_index]

    # Iterate through the hexagons in the cluster
    for i in range(len(h3_set)):
        # Calculate the distance from the reference hexagon to all other hexagons in the cluster
        distances = {h3_index: h3.h3_distance(reference_index, h3_index) for h3_index in h3_set}
        # Identify the closest hexagon to the reference hexagon
        closest_index = min(distances, key=distances.get)
        min_val = h3.h3_distance(reference_index, closest_index)

        # Loop through the already connected hexagons to ensure the shortest path
        for j in range(len(h3_set_connected)):
            # Calculate distances from the current connected hexagon to all other hexagons
            distances_2 = {h3_index: h3.h3_distance(h3_set_connected[j], h3_index) for h3_index in h3_set}
            # Update the minimum distance and index if a closer hexagon is found
            if min(distances_2.values()) < min_val:
                min_val = min(distances_2.values())
                min_j = j
        # If the minimum distance is not from the reference index, update the path
        if min_val < h3.h3_distance(reference_index, closest_index):
            distances_2 = {h3_index: h3.h3_distance(h3_set_connected[min_j], h3_index) for h3_index in h3_set}
            closest_index = min(distances_2, key=distances_2.get)
            line_hexagons = h3.h3_line(h3_set_connected[min_j], closest_index)
            line_hexagons = h3.h3_line(reference_index, closest_index)
        # Add the identified path to the list of connected hexagons
        h3_set_connected = list(set(h3_set_connected).union(set(line_hexagons)))
        # Update the reference hexagon for the next iteration
        reference_index = closest_index
        # Remove the hexagon that was just processed to avoid processing it again
        # Optional: Verification code to visualize connected hexagons using folium (currently commented out)
        # from folium import DivIcon
        # temp = moored_areas[moored_areas.H3_parent == closest_index]
        # folium.Marker(
        #     location=[temp.iloc[0]["H3_parent_center_lat"], temp.iloc[0]["H3_parent_center_long"]],
        #     icon=folium.DivIcon(html=f"""<div style="font-size: 24pt">{i}</div>""")
        # ).add_to(map_richards_bay)
    # Store the list of connected hexagons for the current cluster in the 'h3_connected' dictionary
    h3_connected[str(i_cluster)] = h3_set_connected

Visualise the polygons:

In [53]:

map_richards_bay = folium.Map( location= [pd_richards_bay.LATITUDE, pd_richards_bay.LONGITUDE], zoom_start=13, width=1000, height=1000)

for i_cluster in range(n_clusters): 
    h3_set_connected = h3_connected[str(i_cluster)]
    feature_list = [] 
    for i in range( len( h3_set_connected) ):
        coordinates = list( h3.h3_to_geo_boundary(h3_set_connected[i], geo_json = True) )
        feature = {'geometry': {'coordinates': [[list(x) for x in coordinates]], 'type': 'Polygon' }, 
                 'ID_H3_str': h3_set_connected[i],
                "properties": {"cluster": i_cluster}
    mygeojson = {"type":"FeatureCollection", "features": feature_list}

    for feature in mygeojson['features']:
        feature['properties']['cluster'] = int(feature['properties']['cluster'])


# Display the map

Make this Notebook Trusted to load map: File -> Trust Notebook
In [54]:
from shapely.geometry import mapping
from shapely.ops import unary_union

map_richards_bay = folium.Map(location=[pd_richards_bay.LATITUDE, pd_richards_bay.LONGITUDE], zoom_start=13, width=1000, height=1000)

pols_hulls = []
for i_cluster in range(n_clusters): 
    h3_set_connected = h3_connected[str(i_cluster)]
    pol_in = []
    for hex_i in h3_set_connected:
        geo_hex = Polygon(h3.h3_to_geo_boundary(hex_i, geo_json=True))
    multipol = unary_union(pol_in)
    # Convert the Shapely geometry of the convex hull back into a GeoJSON-like dictionary
    geojson_geom = mapping(multipol.convex_hull)
    # Create the GeoJSON Feature for this convex hull
    feature = {
        'type': 'Feature',
        'geometry': geojson_geom,
        'properties': {
            'ID_H3_str': hex_i,  # Updated line
            'cluster': i_cluster,
    # Add this feature to the map

Make this Notebook Trusted to load map: File -> Trust Notebook