Skip to main content
HomeTutorialsData Science

AWS Storage Tutorial: A Hands-on Introduction to S3 and EFS

The complete guide to file storage on AWS with S3 & EFS.
May 2024  · 16 min read

Effective data storage is a critical component to the operational success of any business across all sectors. The more data is collected, the more crucial the need for scalable, reliable, and cost-effective storage solutions.

Choosing the right data storage can be challenging due to the vast number of options, whether in the cloud or on-premises, each with its advantages and limitations.

This article doesn't aim to cover all the possible options but focuses instead on scalable, reliable, and cost-effective services: Amazon Simple Storage Service (Amazon S3) and Amazon Elastic File System (EFS).

We’ll start by explaining why businesses should adopt storage services and continue by covering the fundamentals of storage services. We’ll then offer guidance on integrating storage services into an operational process. The last section is a deep dive into the advanced concepts, specifically focusing on performance and cost optimizations.

If you’re looking for a comprehensive introductory resource on Amazon Web Services (AWS), check out this Introduction to AWS course.

Why AWS for File Storage?

With all the storage options out there, it's important to provide the reasons why a business should use AWS for file storage, and that's the purpose of this section.

It starts by conducting a comparative analysis between AWS storage services and traditional ones, along with brief examples of real-world applications. This will hopefully reinforce the relevance of AWS S3 and EFS in various scenarios.

Comparing storage options

Traditional storage services and AWS storage services provide different sets of features and benefits. We do a comparative analysis in the table below:

Features

Traditional storage

AWS S3 & EFS

Cost

Capital expenditure on hardware and ongoing costs for maintenance—can be unpredictable over time.

The pay-as-you-go pricing model reduces upfront costs and helps with better cost management based on usage.

Compliance

Achieved through internal controls and audits and may require additional resources to maintain standards.

AWS complies with a vast range of industry standards such as HIPAA and GDPR, simplifying user compliance.

Data Accessibility

Mainly accessible from specific locations or networks. Accessing remotely can be less secure and complex.

Globally accessible over the internet, with high availability and data redundancy.

Data durability

Prompt to physical damage, theft, or natural disasters.

Offer 99.99999999999% (11 nines) durability by automatically replicating data across multiple availability zones.

Management & Maintenance

Requires a dedicated team for hardware maintenance, software updates, and troubleshooting.

Handled by AWS. This reduces the administration burden on users.

Performance

Depends on the specific hardware and network infrastructure. Also, updates can be costly.

Provide customizable performance options. S3 provides storage classes and acceleration features, while EFS supports bursting workloads and throughput models.

Scalability

Limited by physical infrastructure and requires manual interventions to scale.

Highly scalable, allowing for easy adjustment to storage without upfront provisioning.

Security

Depends on local security measures. This requires a significant effort to ensure data safety and compliance.

Offer comprehensive security controls, including data encryption (at rest and in transit) access management, and compliance certifications.

Both S3 and EFS have a broad spectrum of applications across industries. Below are some examples of real-world use cases for both services, highlighting their relevance and utility in different scenarios.

Amazon S3 use cases

Let’s first consider the Amazon Simple Storage Service (S3) use cases:

  • Website hosting: S3 hosts static websites such as blogs, marketing websites, and portfolios. Its global reach and high durability ensure that website content is quickly and reliably delivered worldwide.
  • Backup and disaster recovery: S3 can be used for backup, data archiving and disaster recovery due to its high durability and availability.
  • Data lakes for analytics: Vast amounts of structured and unstructured data can be aggregated by companies using S3 for big data analytics.
  • Content distribution and delivery: Media companies use S3 to store and distribute large video, audio, and image files for further analytics purposes. When used with CloudFront and Content Delivery Network services, S3 can efficiently deliver content to users with low latency.

EFS use cases

Let’s now consider the use cases of Amazon Elastic File System (EFS):

  • Shared file storage for EC2 instances: EFS serves as a highly available and scalable file storage for EC2 instances. Applications running on multiple instances can share data through EFS, making it ideal for workloads that require file system features like hierarchical storage, directory navigation, and file locking.
  • Container storage: EFS can be used with containerized applications, including those orchestrated by Amazon ECS and Kubernetes on AWS. It offers a persistent storage layer for containers, allowing data to persist beyond the life of a single container instance.
  • Big data and analytics: EFS is adapted for Big Data Analytics applications requiring shared file storage. It can support complex analytic workloads by providing the necessary throughput, IOPS, and low latency for processing large datasets.

AWS Storage Fundamentals

With all these insights about the S3 and EFS storage services, it's essential to know when to use which service, and that's the purpose of this section. We’ll first introduce block, file, and object storage and then dive into the comparative aspect of S3 and EFS.

Introduction to block, file, and object storage

Block storage, file storage, and object storage are three fundamental types of data storage systems. Each serves different purposes in cloud computing and data management.

Block storage

Block storage segments data into blocks, each with a unique identifier, allowing for flexible and efficient data management. Amazon Elastic Block Store (EBS) and Instance Store Volumes are examples of block storage in the AWS ecosystem.

Instance Store provides ephemeral storage directly attached to a host computer, ideal for temporary data. In contrast, EBS offers persistent block-level storage, making it suitable for databases, applications, and filesystems that require durability and incremental backups through snapshots.

File storage

File storage organizes data into a hierarchical structure of files and folders, similar to a traditional file system on a computer. This intuitive method simplifies data access and management across various applications and services.

Amazon Elastic File System (EFS) is an example of file storage in the cloud. It provides a shared file system that automatically scales and supports simultaneous access from multiple instances.

EFS is particularly beneficial for applications that require shared access to files or for serving content that several users or systems need to access and edit.

Object storage

Object storage, like Amazon S3, is designed to store vast amounts of unstructured data. Unlike block and file storage, object storage manages data as objects within buckets.

Each object includes the data, metadata, and a unique identifier, allowing for efficient data retrieval and management at any scale. S3 is versatile, supporting a wide range of use cases from static website hosting to data archiving with its various storage classes, including Standard, Standard-Infrequent Access (IA), Glacier Flexible Retrieval, and Glacier Deep Archive.

S3 vs. EFS

Choosing between Amazon S3 and Amazon Elastic File System (EFS) depends on the specific requirements of your application or workload. Let’s look at the differences between each:

Feature

Amazon S3

Amazon EFS

Data type

Unstructured data like images, videos, logs, and backups.

Shared files that need to be accessed by multiple EC2 instances simultaneously.

Storage scale

Designed for vast amounts of data, scalable to exabytes.

Automatically scales with your storage needs without manual intervention.

Access pattern

Ideal for data with varying access patterns, from frequently accessed to archival.

Suited for data that requires consistent and simultaneous access by multiple applications.

Integration

Integrates with AWS services for data processing, analytics, and content delivery.

Supports applications requiring a traditional file system interface.

Storage classes

Multiple storage classes.

Not applicable.

Cost considerations

Storage costs vary by class, with options for infrequent access to reduce costs.

Cost based on storage capacity used with no upfront cost and automatic scaling.

Scalability and flexibility

Highly scalable for both storage capacity and performance.

Flexibility in scaling without needing to provision storage, handling peak loads easily.

Getting Started With Amazon S3

Now that we have a better understanding of the file storage concept, it's time to get our hands dirty with some technical implementation.

This section covers the process of managing data with Amazon S3. Specifically, we’ll learn how to create a bucket for storage, organize data into folders, upload files, adjust permissions for public access when needed, apply metadata for management, and finally, delete the bucket and its contents to clear resources and avoid extra costs.

The main prerequisite to successfully completing this section is having an AWS account, which can be created from the AWS website.

Create a bucket

The starting point of this section is the Amazon Web Services Management Console, as shown below:

AWS Management Console

This console helps manage all AWS resources, from compute instances like EC2 to message queuing services like SQS, including security management.

Once on the console, the first step is to create an S3 instance. This can be done using the following steps (you can follow the numbers in the image below):

Creating an S3 instance

We now continue with a few more steps:

Creating an S3 instance

Let’s add a few notes to some of the steps we’ve covered:

  • Step 4: We created the bucket in the region US East (Ohio) us-east-2.
  • Step 5: Bucket names are globally unique, so it's important to choose one that isn't already in use. This way, no one else, regardless of location, can create a bucket with the same name.
  • Step 6: The ACLs enabled option allows us to set detailed permissions on who can access the bucket’s files and what they can do with it. This is beneficial for managing access at the individual file level. The ACLs disabled option doesn’t give us the possibility of setting permissions on individual files.
  • Step 8: The Bucket owner preferred option ensures we own all the files uploaded to the bucket, no matter who uploaded them. This makes it easier to manage permissions and access.

Let’s create the bucket by taking three more steps:

Creating an S3 instance

Creating an S3 instance

Upload data into the bucket

After creating the bucket, we can see the details below. A bucket is a container that can hold resources. Uploading resources starts with creating a folder. Let’s start by taking these two steps:

Upload data into the bucket - steps 1 to 2

Let’s now create the folder:

Upload data into the bucket - steps 3 to 4

Notice that the folder has been created!

Upload data into the bucket - steps 5

Now it's time to upload data in the folder. Let’s try uploading an image:

Upload data into the bucket - steps 6 to 7

We uploaded the image successfully!

Upload data into the bucket - final result

Configure access control

By default, all the files uploaded to the bucket are private and owned by the creator, and only the owner can view, edit, or download them. However, we may need to configure the bucket so anyone can access the file—this can be useful when everyone needs to view the photos on a website.

We can do this configuration by allowing all public access to the image. This should enable us to access the Object URL with public access, but unfortunately, we get an Access Denied message.

Access control configuration

To fix this, we need to navigate to the Permissions tab at the buckets level and edit the Bucket Policy:Access control configuration - steps 1 to 3

Now let’s edit the statement:

Access control configuration - steps 4 to 6Now that we’ve made these changes, the image is finally displayed!

AWS logo after configuration of the access control

AWS logo after configuration of the access control

Introducing Amazon EFS

This section discusses creating an EFS file system, emphasizing performance and security settings.

Create a security group

We first create a security group by choosing the “Security Groups” option under the Security tab on the left banner.

Security groups tab

Then, we:

  • Select Create security group.
  • Use a meaningful name—MyDataCampEFSGroup in our case.
  • Write a brief description of the security group.
  • Leave VPC with the default value.

Creation of the security group - steps 1 to 3

We don’t add any inbound rules:

Inbound rules left by default.

The type of Outbound rule 1 should be All traffic:

Creation of the security group - steps 4 to 5 showing Outbound rules with “All traffic”

Now that we’re done with our configuration, we can click the Create security group button.

Creation of the security group - step 5: Security group creation after the configuration

The newly created security group is displayed in the Security Groups section.

Details of the “MyDataCampEFSGroup”

Set up your first EFS file system

Next, we create an EFS file system. We start by searching and selecting EFS:

Find EFS from the search bar

We now click on Create file system.

Button to start creating EFS file system.

We use a meaningful name, leave the VPC section as it is, and then click on Customize.

Setting up an EFS system

To access the network configuration page, we click Next.

Step to access the Network configuration page.

We now fill out the security groups sections of each availability zone with MyDataCampEFSGroup.

Setting up an EFS instance

Once we’re done, we click on Next again.

Setting up an EFS instance

The final step enables us to review the entire configuration. Once we’re satisfied with everything, we click on Create.

Final step to create the EFS file system after review.

Once everything is successfully executed, we should see the newly created EFS file system in the File systems tab.

Setting up an EFS system

Create an EC2 instance

The next step is to create an EC2 instance, which provides EFS with the necessary computing environment and simultaneously accesses the same data.

Let’s take everything step-by-step! We start by selecting an EC2 instance for the console. Then, we click on Launch instance. We’ll choose Amazon Linux 2, which comes eligible for a free tier.

Amazon Linux 2 AMI

We now create a key pair to securely connect to the instance. The one we make in this section has a .pem extension with the name efs_ec2_key_pair. This file is then downloaded and saved locally for further use.

Keypair configuration

The default storage capacity is 8GiB, which is enough for our scenario, and the number of instances is increased to 3. The final step is to launch the instance by clicking the Launch instance button, and the final result shows the newly created instances.

Newly created instances

Before moving forward, let’s make sure connecting to the instances is possible. We’ll use the following SSH (Secure Shell) command:

ssh -i [path_to_the_key] ec2-user@[Public IPv4 address]

Let’s break down this command step by step:

  • ssh: This is the command that initiates the SSH connection.
  • -i [path_to_the_key]: This option specifies the path to our private SSH key file, which is required for authentication on the remote server.
  • ec2-user: This is the default username for EC2 Linux instances. We can use a different username if configured.
  • @: This symbol separates the username from the server's address.
  • [Public IPv4 address]: This is the public IP address of our EC2 instance. We can find this address in the AWS Management Console.

Let’s see the results of the connection to each instance through SSH.

SSH connection result for the first instance.

SSH connection result for the second instance.

SSH connection result for the third instance.

Configure EFS

This final step allows the mounting of the EC2 instances created above. We can do this from the EFS general interface.

For simplicity’s sake, the mounting process is shown for only one instance, but the same approach applies to the remaining instances.

We start by installing the EFS mount helper for each instance by using the following command:

sudo yum install -y amazon-efs-utils
 

Installation of the amazon efs utils trace log

Installation of the Amazon EFS utilities—trace log.

We then create an efs folder in our instances at the root using sudo, making sure to use the same folder across all the instances.

sudo mkdir /efs
 

Creation of the “efs” folder at the root of the system

After successfully creating the folder, the next step is to mount it. The file system ID is the ID of the EFS we created in the EFS section, and we use it inside the sudo mount -t efs [file system ID]:/ /efs command:

sudo mount -t efs fs-088faeea6372efa83:/ /efs
 

Failure to mount message

This timeout happens because the security group attached to the EFS system we created doesn’t allow any inbound traffic from our EC2 instances. We can fix this by enabling the security group to accept these inbound connections:

Inbound rule creation - step 1 to 4

We now create a new inbound rule:

A more secure option is to create a separate security group for the EC2 instances, and only allow connection from that security group. That configuration is outside the scope of this article.

After successfully completing this process on each EC2 instance, the EFS system can accept connections from anywhere (including our EC2 instances). We don’t get any timeouts this time:

Successful mounting

To make sure the mounting process was successful, we should be able to:

  • Create a folder within the shared efs folder from one EC2 instance
  • Access that folder from another EC2 instance.

The illustration below shows how the my_DataCamp_data folder is created from the first EC2 instance, along with a README.txt file. The same README file can be seen across the remaining EC2 instances.

EFS file sharing illustration

EFS file-sharing.

Thanks to the power of the EFS system configuration, a file created by one instance is accessible by the remaining instances.

To get a broader overview of the AWS ecosystem, check out this course on AWS Cloud Technology and Services.

Advanced Storage Strategies

Effective storage strategies are essential for managing data efficiently. Two key aspects of these strategies are performance optimization and cost optimization.

These strategies are particularly relevant to Amazon’s Simple Storage Service (S3) and Elastic File System (EFS).

Performance optimization in S3 and EFS involves selecting appropriate storage classes and throughput modes to enhance speed and efficiency.

Cost optimization involves using different storage tiers based on data access frequency and utilizing AWS monitoring tools for cost management. These strategies help in efficient data management and cost reduction in cloud storage.

Conclusion

Congratulations on getting this far! We hope this deep dive equips you to start using AWS storage.

One efficient way to master AWS services is to apply them to real-world scenarios—not only standalone but also by combining multiple services. If that sort of exercise sounds interesting to you, check out this course on Streaming Data with AWS Kinesis and Lambda.

If you’re looking to get a job in the AWS space, make sure to check out the relevant AWS certifications in 2024. Also, it could be very helpful to go through the top AWS interview questions and answers.


Photo of Zoumana Keita
Author
Zoumana Keita

Zoumana develops LLM AI tools to help companies conduct sustainability due diligence and risk assessments. He previously worked as a data scientist and machine learning engineer at Axionable and IBM. Zoumana is the founder of the peer learning education technology platform ETP4Africa. He has written over 20 tutorials for DataCamp.

Topics

Learn more about AWS with these courses!

Course

Introduction to AWS

2 hr
6.1K
Discover the world of Amazon Web Services (AWS) and understand why it's at the forefront of cloud computing.
See DetailsRight Arrow
Start Course
See MoreRight Arrow
Related

blog

Fundamentals of Container Orchestration With AWS Elastic Kubernetes Service (EKS)

Unlock the full potential of container orchestration with AWS Elastic Kubernetes Service (EKS). Learn the fundamentals, explore real-world applications in data science, and discover how to optimize costs and scalability.
Gary Alway's photo

Gary Alway

13 min

blog

The Path to Becoming a Data Engineer

The definitive guide to help you become a data engineer.
Vincent Vankrunkelsven's photo

Vincent Vankrunkelsven

17 min

tutorial

AWS EC2 Tutorial For Beginners

Discover why you should use Amazon Web Services Elastic Compute Cloud (EC2) and how you can set up a basic data science environment on a Windows instance.
DataCamp Team's photo

DataCamp Team

7 min

tutorial

Mastering AWS Step Functions: A Comprehensive Guide for Beginners

This article serves as an in-depth guide that introduces AWS Step Functions, their key features, and how to use them effectively.
Zoumana Keita 's photo

Zoumana Keita

tutorial

Cloudera Hadoop Tutorial

Learn about Hadoop ecosystem, the architectures and how to start with Cloudera.
DataCamp Team's photo

DataCamp Team

27 min

tutorial

Deep Learning with Jupyter Notebooks in the Cloud

This step-by-step tutorial will show you how to set up and use Jupyter Notebook on Amazon Web Services (AWS) EC2 GPU for deep learning.
Dan Becker's photo

Dan Becker

10 min

See MoreSee More