This chapter describes the methodology of the proposed shipping activity indicators as well as the necessary data cleaning, preparation and configuration steps. The indicators are produced by the provided sample scripts using the AIS data available on the UN Global Platform. In addition, it shows guidelines on how to use Location Services in UN Global Platform
Overview of Location Services in UN Global Platform
Location Services UN Global Platform offers two kind of data services: 1) AIS Data and 2) ADS-B (flight) Data both for real-time and historical data. It also features interactive visualization called Optix.Viz which is accessible at UNGP Optix.Viz. It is a browser-based dynamic geospatial visualization client that can represent many layers of geospatial data along with their relationships, often with animation of up to millions of entities in your browser. Core applications include tracking, social media analysis, situational awareness, forensic track analysis, and spatial data visualization. The pluggable architecture makes it simple to add new analytical workflows and utilities. In addition to data visualization platform, users can run the codes directly on UNGP Optix.Analyze which is powered by Spark and JupyterHub. See Optix.Analyze and Optix.Viz guidelines below:
Computing the shipping indicators on the UNGP platform
The proposed shipping indicators are based on aggregations of historical ship positions as reported in AIS data. Currently, the UNGP platform contains tens of terabytes of global AIS data for the period from October 2018 till now. Due to the large size of the dataset, the computation is done through the location service of the platform, which is based on the Apache Spark/ Hadoop distributed environment. This enables processing the data in reasonable timeframe. Additionally, the location serves provides the GeoMesa open source library to facilitate large-scale geospatial querying and analytics of AIS data. This simplifies the code development process by providing the user with convenient high level functions for the most common geospatial tasks. Two programming language options are available, Python and Scala. Although the sample scripts are provided only in Python, due to its higher popularity, converting them to Scala language is considered to be relatively straightforward for an experienced developer.
Initially, it is necessary for each of the ports of interest, a geo-bounding box for the particular port to be defined by the user. These bounding boxes are used by the algorithm to determine if the reported ship's position is in any of the ports of interest. Indeed, a ship is considered to be in port for the period until the next AIS message arrives if its last reported position is inside the port bounding box. However, due to anomalies in the AIS data, this definition introduces shortcomings that require certain special measures to taken in the algorithm to limit the effect of irregular AIS messages. For example, the practice of switching off the AIS equipment after a ship docks could result in an undesired bias in the indicators when the equipment is not turned back on before the ship leaves the observed area. In fact the ship will be counted in port until is the next AIS message from it is seen.
In certain cases, when a port is located in close proximity to a busy shipping lane, it is possible that some of the passing by ships are captured within the bounding box. As this can introduce significant bias in the generated port based indicators, every effort should be made by the user to minimise this problem by careful selecting the port bounding box parameters. Although it is possible to eliminate the bias caused by passing-by traffic by defining the bounding box parameters very carefully, in some cases it might be particularly challenging to include all the port's berths due to their geographical distribution. In such cases, using a union of multiple bounding boxes, instead of a single large bounding box, might be a better solution that provides sufficient accuracy and convenience. Alternatively, using a more complex shape instead of a bounding box, like a polygon, could help improving further the separation of the port traffic from the passing by traffic. However, it should be kept in mind that the computation of presence of a ship position inside a polygon is much more computationally challenging. Especially in the context of large scale datasets like AIS, this may result in a significant slow down of the computation as every single ship position is evaluated against every single port .
Additionally, the presence of significant amount of noise present in the AIS data should be taken into consideration. This noise origin can be associated with inaccurate coordinates reported by ships, corrupted AIS messages or even spurious messages originating from non-ship sources. Different measures can be deployed to reduce the undesirable effect of noise on the computed indicators based on the noise characteristics in the specific location. These can include filtering out the messages with MMSI that don't conform to the AIS standard, using probabilistic filtering methods to smooth the reported ship positions, like the Kalman filter, or checking the ship tracks with complex algorithms to verify the validity of the source. The noise filtering in the provided sample scripts are based on excluding MMSIs that don't conform to the standard and excluding messages from ships that don't travel more than a predefined distance threshold over a certain period of time. In addition to suppressing the noise present in AIS data, this filtering method also reduces the indicator bias that the presence of specific port-based type vessels, like tug boats, dredges and pilot crew boats, introduce. In fact, these port based vessels are not considered to be directly related to economic trade but spent considerable proportion of time inside the port resulting in considerable unwanted bias to the output indicators. Therefore, in the developed "moving ships" filter it is possible to set the minimal travel distance threshold distance higher than the port size and eliminate the effect both of one off spurious messages and the port-based vessels. It should be noted that , on the opposite, setting the minimal travel distance too high will start excluding certain valid ships that don't travel very far.
Filtering of the data could also be be achieved by linking to up-to-date shipping register that acts as a form of validation data of the messages in the AIS stream. However, as currently such a global shipping register is not currently available on UNGP, this method has not been explored.
Conceptual overview of the methods
A high level conceptional overview of the methods, aimed at providing general description of the computation of the indicators is given below to introduce the key stages of the indicator computation process.
The "Moving Ships" filter algorithm
The first step before the actual computation is done is to filter noise and the port based ships as both can introduce significant bias in the output indicators. This is achieved by the "Moving Ships" filter algorithm, described below.
The validation process is based on evaluation of the distance travelled over a certain predefined period of time, normally 6 months. If the travelled distance is bigger than the preset threshold the ship is considered to be a valid ship and included in the list of ships used in all subsequent computations. The key stages of the method are:
The distance travelled is calculated in the "Calculate the mount of motion" block by computing the minimum and maximum of latitude and longitudes for all ship positions over the selected period. The differences in latitude and longitude, the deltas, are compared against a predefined threshold values. Finally the ships that comply with the minimal motion requirement are saved in the Moving Ships list to be used in the computations of the indicators.
The "Time In Port" indicator
The "Time in Port" indicator measures the total time spent by all ships within the boundaries of the port monthly over the defined period. It is based on summation of all time differences between eligible messages, i.e. messages that originate from inside the port bounding box. The indicator is not considered very sensitive to random noise but can be affected severely by ships spending long periods within the port, e.g. pilot boats. As these ships are not considered to be directly involved in trade it is desirable to exclude them from the indicator
After the start and the end dates of the period for which the Time in Port indicator is calculated are entered, they are used in a filter to reduce the amount of AIS data that is read into the memory of the cluster. By executing inner join operation between the incoming AIS data and the "moving ships" list, the data is further reduced to only valid "moving" ships operating within the area of interest. Next, the time differences between of the sorted by time messages for each MMSI are calculated as saved as time deltas. Subsequently, the messages are labelled as either originating from one of the predefined port areas or originating from outside a port area based on the reported ship coordinates. Each port has an unique index number that is used which is saved for each message. This forms basis the port index. Also, time period index is created using the reported time stamp and appropriate time function that generates that generates the time period index. Both indices are then used to group the messages and sum the time differences to produce the required accumulative time spent by ships in the specific port, i.e. "Time in Port" indicator.
The result is exported to a file in a csv format which makes it more accessible by external tools outside the UNGP.
The "Port Traffic" indicator
The "Port Traffic" indicator captures how many unique ships have been observed in port based on their reported MMSI. The indicator by design is not sensitive to high frequency visits. For example, if a ferry have multiple sailing per the day and the indicator is calculated on a monthly basis, the particular ferry will contribute towards the total monthly port traffic indicator with only one count. However, this indicator due to its nature is very sensitive to one-off spurious messages as well as to misplaced coordinates as a single wrong message can result bias for the indicator. Therefore careful cleaning and filtering of the data is required to produce accurate outputs.
Computation of the "Port Traffic" indicator bears many similarities with the "Time in Port" indicator. The main difference is that the aggregation is not done on the time differences between the individual messages but it is based on counting of unique MMSI in each group made of messages originating from the same port and in the same time period index. It should be noted that due to the type of aggregation a single message of the MMSI within a port carries the same weight as multiple reports which prevents the indicator from capturing the effect of the high frequency visits like ferries with multiple daily sailing.
Data cleaning and preparation
The data that is available at the UN Global Platform is pre-cleaned by the service provider for advanced analytical techniques. However, gaps in the messages can still occur in the data. Thus, the information that is used from the AIS data should be checked with regards to irregular patterns. For example, the number of messages or MMSI should not drop significantly from one hour to another.
Figure 1: Number of MMSI (left) and messages (right) by hour from day 112 until day 217 in 2019 of the two available AIS data sets at the UN Global Platform.
Figure 1 highlights that the Orbcomm data has large drops in the last days shown. The data looks better in the beginning of the time series because the values were already back-filled by the data provider. Thus, it is reasonable to check drops in the numbers and to ask the data provider to back-fill the data if possible.
Due to inability to fit the full global dataset into the memory of the cluster it is necessary that an area of interest and the period of interest are selected to extract the relevant AIS data. Then using the extracted data a bounding box is build around the reported coordinates for each ship and used to evaluate its amount of motion for this period. If the bounding box size is bigger than the predefined minimum size then the ship is included in further computation of the indicators.
Step-by-step guidelines for executing the sample scripts on UNGP
For all applications with AIS data provided by the UN Global Platform, the first steps are the same. In begging of the Jupyter notebook, the user needs to establish the connection to the distributed data. For this, Spark/pyspark environment parameters need to configured appropriately. The first part of the scripts imports the used libraries, among others from the pyspark.sql_module:
pyspark.sql.SparkSession: Main entry point for DataFrame and SQL functionality
pyspark.sql.types: List of data types available
pyspark.sql.functions: List of built-in functions available for DataFrame
pyspark.sql.window: For working with window functions
geomesa_pyspark: Provides integration with the Spark Python API for accessing data in GeoMesa data stores
Because of the way the geomesa_pyspark library interacts with the underlying Java libraries, a GeoMesa configuration must be set up before referencing the pyspark library. Spark is accessed using a Yarn master by default. Using the geomesa_pyspark configurations, a Spark session is created. Then, the parameters for reading in the data need to be specified. These are used to read the data by SparkSession.read.
Steps for reading in AIS data in the location service of UNGP.
- Import necessary libraries
- Set configuration by geomesa_pyspark.configure
- Create SparkSession
- Set parameters for reading the data
- Read the data
On the right, the code to read in the AIS data from the UN Global Platform can be downloaded as a notebook (readingAISData.ipynb).
A more complete example of code, computing the above indicators from OrbComm dataset, is also provided as a notebook (documented_orbcomm_HB.ipynb). The comments in the notebook provide additional guidance and explanation of how the code works. The best way of opening and testing the notebook is by loading it in UNGP location service environment.
One of the visualization tools available on the UN Global Platform is Stealth developed by CCRi. It is browser-based dynamic spatial visualization tool that can represent many layers of geospatial data along with their relationships, often with animation of up to millions of entities in your browser.
Stealth on UNGP, How does it work?
- Choose data source It is possible to choose satellite AIS data from the provider exactEarth or Orbcomm. For all data sources, historical and live data is available
- Select area (e.g. rectangle, polygon, circle)
- See and set options (available attributes from dynamic and static data and the ability to apply a filter).
- Result: We get an overview of all available vessels in the selected area.
By selecting a ship you can get detailed information.
There are functions that allow us to in-depth analysis, e.g. distance measure.
|File readingAISData.ipynb||Oct 31, 2019 by Ann-Kristin Kreutzmann|
|File documented_orbcomm_HB.ipynb||Feb 10, 2020 by Alexandre Noyvirt|
|File tradeIndicatorsClass_Indonesia.ipynb||Feb 11, 2020 by Ann-Kristin Kreutzmann|
- No labels