In Chapter 1, High-Performance Computing Fundamentals, we introduced the concepts of HPC applications, why we need HPC, and its use cases across different industries. Before we begin developing HPC applications, we need to migrate the required data into the cloud. In this chapter, we will uncover some of the challenges in managing and transferring data to the cloud and the ways to mitigate them. We will dive deeper into the services in AWS online and offline data transfer services using which you can securely transfer data to the AWS cloud, while maintaining data integrity and consistency. We will cover different data transfer scenarios and provide guidance on how to select the right service for each one.
We will cover the following topics in this chapter:
These topics will help you understand how you can transfer Gigabytes (GB), Terabytes (TB), or Petabytes (PB) of data onto the cloud with minimal disruption, cost, and time involved.
Let’s get started with data management and its role in HPC applications.
Data management is the process of effectively capturing, storing, and collating data created by different applications in your company to make sure it’s accurate, consistent, and available when needed. It includes developing policies and procedures for managing your end-to-end data life cycle. The following are some of the elements of the data life cycle specific to HPC applications, due to which it’s important to have data management policies in place:
Now that we have understood the significance of data management in HPC applications, let’s see some of the challenges of transferring large amounts of data into the cloud.
In order to start building HPC applications on the cloud, you need to have data on the cloud, and also think about the various elements of your data life cycle in order to be able to manage data effectively. One way is to write custom code for transferring data, which will be time-consuming and might involve the following challenges:
The data transfer and migration services offered by AWS can assist you in securely transferring data to the cloud without you having to write and manage code, helping you overcome these aforementioned challenges. In order to select the right service based on your business requirement, you first need to build a data transfer strategy. We will discuss the AWS data transfer services in a subsequent section of this chapter. Let’s first understand the items that you need to consider while building your strategy.
In a nutshell, your data transfer strategy needs to take the following into account in order to move data with minimal disruption, time, and cost:
Based on the answers to these questions, you can create your data strategy and select appropriate AWS services that will help you to transfer data with ease and mitigate the challenges mentioned in the preceding list. To understand it better, let’s move to the next topic and see how to securely transfer large amounts of data into the cloud with a simple example.
To understand this topic, let’s start with a simple example where you want to build and train a computer vision deep learning model to detect product defects in your manufacturing production line. You have cameras installed on each production line, which capture hundreds of images each day. Each image can be up to 5 MB in size, and you have about 1 TB of data, which is currently stored on-premises in a NAS filesystem that you want to use to train your machine learning model. You have about 1 Gbps of network bandwidth and need to start training your model in 2-4 weeks. There is no impact on other applications if the data is moved to the cloud and no structured data is needed for building the computer vision model. Let’s rearrange this information into the following structure, which will become part of your data strategy document:
Since you have 5 TB of data with a maximum file size of 5 MB to transfer securely to Amazon S3, you can use the AWS DataSync service. It is an AWS online data transfer service to migrate data securely using a Virtual Private Cloud (VPC) endpoint to avoid your data going through the open internet. We will discuss all the AWS data transfer services in detail in the later sections of this chapter.
The following architecture visually depicts how the transfer will take place:
Figure 2.1 – Data transfer using AWS DataSync with a VPC endpoint
The AWS DataSync agent transfers the data between your local storage, NAS in this case, and AWS. You deploy the agent in a Virtual Machine (VM) in your on-premises network, where your data source resides. With this approach, you can minimize the network overhead while transferring data using the Network File System (NFS) and Server Message Block (SMB) protocols.
Let’s take a deeper look into AWS DataSync in the next section.
Online data transfer services are out-of-the-box solutions built by AWS for transferring data between on-premises systems and the AWS cloud via the internet. They include the following services:
Let’s look at each of these services in detail to understand the scenarios in which we can use the relevant services.
AWS DataSync helps you overcome the challenges of transferring data from on-premises to AWS storage services and between AWS storage services in a fast and secure fashion. It also enables you to automate or schedule the data transfer to optimize your use of network bandwidth, which might be shared with other applications. You can monitor the data transfer task, add data integrity checks to make sure that the data transfer was successful, and validate that data was not corrupted during the transfer, while preserving the file permissions and associated metadata. DataSync offers integration with multiple filesystems and enables you to transfer data between the following resources:
Important note
We will discuss AWS storage services in detail in Chapter 4, Data Storage.
As discussed, AWS DataSync is used for transferring data to the cloud over the network. Let’s now see some of the specific use cases for which you can use DataSync:
We will use an architecture diagram to show how DataSync can transfer data between on-premises self-managed storage systems to AWS storage services and between AWS storage resources.
We will start with on-premises storage to AWS storage services.
The architecture in Figure 2.2 depicts the data transfer from on-premises to AWS storage resources:
Figure 2.2 – Data transfer from on-premises to AWS storage services using AWS DataSync
The DataSync agent is a VM that reads from and writes the data to the on-premises storage. You can configure and activate your agent using the DataSync console or API. This process associates your agent with your AWS account. Once the agent is activated, you can create the data transfer task from the console or API to kick start the data transfer. DataSync encrypts and performs a data integrity check during transfer to make sure that data is transferred securely. You can enable additional checks as well to verify the data copied to the destination is the same as that read at the source. Additionally, you can also monitor your data transfer task. The time that DataSync takes to transfer depends on your network bandwidth, the amount of data, and the network traffic. However, a single data transfer task is capable of utilizing a 10-Gbps network link.
Let’s take a deeper look to understand the data transfer between AWS storage resources using AWS DataSync.
The architecture in Figure 2.3 depicts the data transfer between AWS storage resources using DataSync in the same AWS account. The same architecture applies for data transfers within the same region as well as cross-region:
Figure 2.3 – Data transfer between AWS storage resources using AWS DataSync
As shown in the architecture, DataSync does not use the agent for transferring data between AWS resources in the same account. However, if you want to transfer data between different AWS accounts, then you need to set up and activate the DataSync Amazon EC2 agent in an AWS Region.
In summary, you can use AWS DataSync for online data transfer from on-premises to AWS storage services, and between AWS storage resources. AWS DataSync transfers data quickly, safely, and in a cost-effective manner while ensuring data integrity and consistency, without the need to write and manage data-copy scripts.
Now, let’s move on to another AWS data transfer service, AWS Transfer Family, which is used for scaling your recurring business-to-business file transfers to Amazon S3 and Amazon EFS.
File transfer protocols such as File Transfer Protocol (FTP), Secure File Transfer Protocol (SFTP), and File Transfer Protocol Secure (FTPS) are commonly used in business-to-business data exchange workflows across different industries including financial services, healthcare, manufacturing, and retail. AWS Transfer Family helps in scaling and migrating these file workflows to the AWS cloud. It uses the FTP, SFTP, and FTPS protocols for data transfer. It enables you to transfer files to and from Amazon EFS and Amazon S3.
As discussed, AWS Transfer Family uses protocols such as FTP, SFTP, and FTPS for data exchange workflows in business-to-business contexts. So, let’s understand some of the common use cases for transferring data to and from Amazon S3 and Amazon EFS using AWS Transfer Family:
Now that we have gone over some of the use cases for AWS Transfer Family, let’s see how it works.
The architecture in Figure 2.4 shows how files are transferred using AWS Transfer Family from on-premises file servers to Amazon S3 or Amazon EFS, which can then be used for downstream file processing workflows such as content distribution, machine learning, and data analysis:
Figure 2.4 – File transfer workflow using AWS Transfer Family
You can configure any standard file transfer protocol client such as WinSCP, FileZilla, or OpenSSH to initially transfer to Amazon S3 or EFS using AWS Transfer Family. It will first authenticate the user based on the identity provider type that you have configured, and once the user is authenticated, it will initiate the file transfer.
So far, we have seen how we can transfer data using AWS DataSync and AWS Transfer Family over the network and understood their use cases and how these services work to transfer data securely while reducing the operational burden in a cost-effective manner. Let’s now see how we can accelerate the data transfer to S3 using Amazon S3 Transfer Acceleration.
Amazon S3 Transfer Acceleration (S3TA) is a feature in Amazon S3 buckets that lets you speed up your data transfer to a S3 bucket over long distances, regardless of internet traffic and without the need for any special clients or proprietary network protocols. You can speed up transfers to and from Amazon S3 by 50-500% using the transfer acceleration feature.
Some of the use cases include the following:
It uses Amazon CloudFront’s globally distributed edge locations, AWS backbone networks, and network protocol optimizations to route traffic, which speeds up the transfer, reduces the internet traffic variability, and helps in logically shortening the distance to S3 for remote applications.
Important note
There is an additional charge to use Amazon S3 Transfer Acceleration.
We have discussed using online data transfer services such as AWS DataSync, AWS Transfer Family, and S3TA for moving data from on-premises storage to AWS storage resources over the network. There might be scenarios where you want to transfer streaming data in real time to the AWS cloud, for example, telemetry data from IoT sensors, video for online streaming applications, and so on. For this, we will go deeper into Amazon Kinesis, which is a fully managed streaming service built by AWS.
Amazon Kinesis is a fully managed service used to collect, process, and analyze streaming data in real time at any scale. Streaming data can include ingesting application logs, audio, video, website clickstreams, or IoT sensor data for deep learning, machine learning, analytics, and other applications. It allows you to perform data analysis as the data arrives in real time, instead of waiting for all the data to be transferred before processing.
Amazon Kinesis includes the following services:
Figure 2.5 – Capture, process, and store video streams for machine learning, analytics, and playback
Figure 2.6 – Capture data from different sources into Amazon Kinesis Data Streams
Figure 2.7 shows how Amazon Kinesis Data Firehose can be used for ETL use cases without having to write long lines of code or managing your own infrastructure at scale:
Figure 2.7 – ETL using Amazon Kinesis Data Firehose
Figure 2.8 shows how a typical Kinesis Data Analytics application works. It has three main components:
Figure 2.8 – Real-time processing of streaming data using Amazon Kinesis Data Analytics
In this section, we discussed how to transfer and process streaming data to AWS Storage, using Amazon Kinesis. There are some use cases, such as edge computing and edge storage, for which you can use AWS Snowcone, a portable, rugged, and secure device for edge computing, storage, and data transfer. Next, let’s see how we can transfer data online from AWS Snowcone to AWS.
AWS Snowcone is a small, rugged, and portable device, used for running edge computing workloads, edge storage, and data transfer. The device weighs about 4.5 lbs (2.1 kg) and has multiple layers of security and encryption. It has 8 TB of storage, while the AWS Snowcone Solid State Drive (SSD) version provides 14 TB. Some of the common use cases for Snowcone are as follows:
You can transfer terabytes of data from various AWS Snowcone devices over the network using AWS DataSync, as discussed in the AWS DataSync section.
AWS online data transfer services are helpful when you have to transfer up to terabytes of data over the network to AWS. The time taken to transfer data is dependent on your available network bandwidth and internet traffic. When you have to transfer data from remote locations, or when your network bandwidth is heavily used by existing applications, you would need an alternative mechanism to transfer data offline. Let’s discuss the AWS offline data transfer services in the next section.
For transferring up to petabytes of data via offline methods, in a secure and cost-effective fashion, you can use AWS Snow Family devices. Sometimes, your applications may require enhanced performance at the edge, where you want to process and analyze your data close to the source in order to deliver real-time meaningful insights. This would mean having AWS-managed hardware and software services beyond the AWS cloud. AWS Snow Family can help you to run operations outside of your data center, as well as in remote locations with limited network connectivity.
It consists of the following devices:
Now that we understand the various offline data transfer options offered by AWS, let’s understand the process for ordering the device.
To order AWS Snowmobile, you need to contact AWS sales support. For Snowcone or Snowball devices, you can follow these steps:
Figure 2.9 – AWS Snow Family console
Figure 2.10 – AWS Snow Family – Create new job
Figure 2.11 – Getting started with import to S3 job
Figure 2.12 – Choose your shipping preferences
Figure 2.13 – Choose your job details
On this screen, you can select your Snow device, power supply, wireless options for Snowcone, S3 bucket, compute using EC2 instances, and the option to install the AWS IoT Greengrass validated AMI.
Please note that the S3 bucket will appear as directories on your device, and the data in those directories will be transferred back to S3. If you have selected the AWS IoT Greengrass AMI to run IoT workloads on the device, you also need to select the Remote device management option to open and manage the device remotely with OpsHub or Snowball Client.
Important note
All the options as mentioned in step 4 are not shown in Figure 2.13 – Choose your job details, but will be present on your console screen.
Figure 2.14 – Choose your security preferences
Figure 2.15 – Choose your notification preferences
Figure 2.16 – Review and create your job
Figure 2.17 – Snow Family jobs
Important note
On the Actions drop-down menu, you also have options to cancel a job, edit a job name, and clone a job.
In this section, we learned about AWS Snow Family devices to transfer data offline based on our application requirements, network connectivity, available bandwidth, and the location of our data sources. We also discussed how we can use these devices not only for transferring data but also for edge computing.
One of the most frequently asked questions on this topic is, how do we calculate the time taken to move data to the cloud based on the network speed and available bandwidth? For this, AWS provides a simple formula based on the best-case scenario, which is given as follows:
For example, if we have a network connection of 1.544 Mbps, and we want to move 1 TB of data into and out of the AWS cloud, then theoretically the minimum time that it would take to transfer over your network connection at 80% network utilization is 82 days.
Important note
Please note that this formula only gives a high-level estimate; the actual time taken might differ based on the variability of network traffic and available bandwidth.
Let’s now take a brief look at all the topics that we have covered in this chapter.
In this chapter, we talked about various aspects of data management, including data governance and compliance with the legal requirements of federal and regional authorities of the country where the data resides. We also discussed that in order to build HPC applications on the cloud, we need to have data on the cloud, and looked at the challenges of transferring this data to the cloud. In order to mitigate these challenges, we can use the managed AWS data transfer services, and in order to select which service to use for your application, we then discussed the elements of building a data strategy.
We then took an example of how we can transfer petabyte-scale data to the cloud in order to understand the concepts involved in a data transfer strategy. Finally, we did a deep dive on various AWS data transfer services for both online and offline data transfer based on your network bandwidth, connectivity, type of application, speed of data transfer, and location of your data source.Now that we understand the mechanisms for transferring data to the cloud, the challenges involved, and how to mitigate them, in the next chapter, we will focus on understanding the various compute options provided by AWS for running HPC applications, and how to optimize these based on the application requirements.
The following are some additional resources for this chapter: