Buckets
Naming
There are two valid URL formats that can be used:
-
Path-style URL:
https://s3.${aws_region}.amazonaws.com/${bucket-name}/${object_key} -
Virtual-hosted-style URLs:
https://${bucket_name}.s3.${aws_region}.amazonaws.com/${object_key}
bucket_name = "my-test-bucket"
aws_region = "eu-north-1"
object_prefix = "images/"
object_key = "images/home.jpg"
In terms of implementation, buckets and objects are AWS resources, and Amazon S3 provides APIs for you to manage them.
Amazon S3 supports global buckets, which means that each bucket name must be unique across all AWS accounts in all the AWS Regions within a partition. You should not depend on specific bucket naming conventions for availability or security verification purposes.
There is a soft limit of 100 buckets per account and a hard limit of 1000 buckets per account.
Naming:
-
3-63 chars
-
lowercase letters, numbers, dots (discouraged), hyphens
-
must begin or end with a letter or number
-
No ..
-
No IP address format
-
Forbidden prefixes:
-
xn-- -
sthree- -
sthree-configurator
-
-
Forbidden suffixes:
-
s3alias -
--ol-s3
-
-
Unique within partition (Partitions: aws, aws-cn, aws-us-gov)
-
No dots if using S3 Transfer Acceleration
Configuration:
-
website
-
versioning
-
transfer acceleration
-
tagging
-
requestPayment (Requester Pays)
-
replication
-
policy and ACL (access control list)
-
object locking
-
logging
-
location
-
lifecycle
-
event notification
-
CORS (cross-origin resource sharing)
Access options:
-
Console, AWS CLI, AWS SDKs or S3 API (directly)
-
Static websites
-
S3 Access Points / Multi-region Access Points (for shared datasets)
-
Mountpoint for Amazon S3 (mount locally, high throughput)
-
SFTP with AWS SFTP
Security
The resource owner refers to the AWS account that creates the resource. The default owner is the AWS account for objects created by root accounts, IAM Roles and IAM Users.
A bucket owner can grant cross-account permissions to another AWS account (or users in another account) to upload objects. In this case, the AWS account that uploads objects owns those objects. The bucket owner does not have permissions on the objects that other accounts own. With some exceptions:
-
The bucket owner pays the bills. The bucket owner can deny access to any objects, or delete any objects in the bucket, regardless of who owns them.
-
The bucket owner can archive any objects or restore archived objects regardless of who owns them.
Amazon S3 considers a bucket or object ACL public if it grants any permissions to members of the predefined AllUsers or AuthenticatedUsers groups.
The anonymous user is identified by a specific canonical user ID: 65a011a29cdf8ec533ec3d1ccaae921c.
All Users group: predefined group.
If an object is uploaded to a bucket through an unauthenticated request, the anonymous user owns the object. The default object ACL grants FULL_CONTROL to the anonymous user as the object’s owner. Therefore, Amazon S3 allows unauthenticated requests to retrieve the object or modify its ACL.
S3 Public Access
A helper setting to managing public access.Default is deny.
It overrides policies and permissions, it must be disablet for other mechanisms to work, it’s evaluated before policies.
Settings:
-
BlockPublicAcls: blocks PUT calls on ACL APIs for the bucket and all objects.ACLs cannot be changed.
-
IgnorePublicAcls: public ACLs.
-
BlockPublicPolicy: blocks PUL calls to ALC APIs for buckets and access points.
-
RestrictPublicBuckets: only allows access to entities within the AWS account that have access to the bucket.
S3 Bucket Ownership
S3 Object Ownership is an Amazon S3 bucket-level setting that you can use to control ownership of objects uploaded to your bucket and to disable or enable access control lists (ACLs).
Owner: the root account in which the bucket was created.
Encryption
S3 uses AWS-256 for encrption.
Available options:
-
Server Side Encryption (SSE)
-
SSE-S3: uses S3 managed keys, no calls to KMS. Keys are from S3.
For API calls set the "x-amz-server-side-encryption" header to "AES256" -
(D)SSE-KMS: uses KMS keys to encrypt/decrypt. KMS APIs are called for each encryption/descryption operation (limits and costs will impact).
With DSSE you get dual-layer SSE.
For API calls set the "x-amz-server-side-encryption" header to "aws:kms" (for SSE-KMS, not DSSE-KMS).-
Bucket keys: KMS generates a key that’s stored in the bucket so no external KMS calls are made every time. Works with replication even though KMS is regional but the ETAG of the destination object will not be the same as the source.
-
-
SSE-C: Server side encryption with customer provided keys.
The key is provided in an header of the request and is not stored by AWS after use.
HTTPS is mandatory in this case.
-
-
Client Side Encryption: AWS has no role in encrypting. The client encrypts the payload before storing it o decrypts it upon retrieval.
|
With bucket keys enabled only the bucket will show up in CloudTrail KMS events, not the objects nymore. |
To encrypt existing object S3 Batch Operations or the copy-object CLI command can be used..
Encryption in transit:
-
S3 exposes both HTTP and HTTPS endpoints. The latter is recommended but it’s mandatory when using CSE.
-
To force encryption in transit a bucket policy of
denywith a condition of"aws:SecureTransport": "false"can be used. No plain HTTP requests will be accepted. -
Bucket policies are evaluated before default encryption.
Storage Management
Object Versioning
States of a bucket:
-
Unversioned: versioning has never been active.
-
versioning-enabled: versioning is enabled
-
versioning-suspended: versioning has been suspended. Once versioning has been activated it cannot be deactivated, only suspended.
Enabling versioning will not give any version ID to existing objects, new versions of that objects will get one. On versioning enabling the version ID for existing objects will be null.
If you delete an object, instead of removing the object permanently, Amazon S3 inserts a delete marker, which becomes the current object version (Performing a GET Object request when the current version is a delete marker returns a 404 Not Found error). If your DELETE operation specifies the versionId, that object version is permanently deleted, and Amazon S3 doesn’t insert a delete marker.
You can permanently delete an object by specifying the version that you want to delete. Only the owner of an Amazon S3 bucket or an authorized IAM user can permanently delete a version.
Only PUT calls create a new version, Some actions that modify the current object don’t create a new version because they don’t PUT a new object. This includes actions such as changing the tags on an object.
Replication
Replication enables automatic, asynchronous copying of objects across Amazon S3 buckets within the same account or across accounts. Replication also works with multiple destinations and different regions (Cross-Region Replication, CRR).
You can use a prefix or tags to limit its scope.
Replication can be:
-
Live replication: replicates objects as soon as they arrive to the bucket.
-
Batch replication: on-demand.
On enabling, existing objects are not replicated, you must use batch replication to replicate them.
Replication options:
-
Which objects to replicate (All, prefix, tags)
-
Destination storage class (default = same)
-
Destination objects ownership: by default in case of cross-account replication the source account is the owner of the objects
Requirements:
-
Destination bucket(s)
-
An IAM role that can read from the source and write to the destination: in case of cross-accounts replication the destination bucket must allow writes by the source account role with a bucket policy, the role will be out of scope of the destination IAM.
-
Both source and destination buckets must have versioning enabled.
-
If the owner of the source bucket doesn’t own the object in the bucket, the object owner must grant the bucket owner READ and READ_ACP permissions with the object access control list (ACL).
-
If the source bucket has S3 Object Lock enabled, the destination buckets must also have S3 Object Lock enabled.
What is NOT replicated
-
Replicas. They need a batch replication to be themselves replicated. There’s no replication chaining A ⇒ B ⇒ C.
-
Objects in the source bucket that have already been replicated to a different destination. For example, if you change the destination bucket in an existing replication configuration, Amazon S3 won’t replicate the objects again. To replicate previously replicated objects, use Batch Replication.
-
Delete markers are not replicated by default but you can opt-in.
-
Objects that were deleted with the version ID of the object from the destination bucket.
-
By default, when replicating from a different AWS account, delete markers added to the source bucket are not replicated.
-
Objects that are stored in the S3 Glacier Flexible Retrieval, S3 Glacier Deep Archive, S3 Intelligent-Tiering Archive Access, or S3 Intelligent-Tiering Deep Archive Access storage classes or tiers.
-
Objects in the source bucket that the bucket owner doesn’t have sufficient permissions to replicate.
-
Updates to bucket-level subresources. For example, if you change the lifecycle configuration or add a notification configuration to your source bucket, these changes are not applied to the destination bucket. This feature makes it possible to have different configurations on source and destination buckets.
-
System actions performed by lifecycle configuration.
Default bucket encryption and replication
-
If objects in the source bucket are not encrypted, the replica objects in the destination bucket are encrypted by using the default encryption settings of the destination bucket (As a result, the entity tags (ETags) of the source objects differ from the ETags of the replica objects).
-
If objects in the source bucket are encrypted by using SSE-S3, SSE-KMS or DSSE-KMS, the replica objects in the destination bucket use the same type of encryption as the source objects. The default encryption settings of the destination bucket are not used.
Replication use cases
-
Replicate objects while retaining metadata – important if you must ensure that your replica is identical to the source object.
-
Replicate objects into different storage classes
-
Maintain object copies under different ownership – Regardless of who owns the source object, you can tell Amazon S3 to change replica ownership to the AWS account that owns the destination bucket. This is referred to as the owner override option.
-
Keep objects stored over multiple AWS Regions
-
Replicate objects within 15 minutes – To replicate your data in the same AWS Region or across different Regions within a predictable time frame, you can use S3 Replication Time Control (S3 RTC). S3 RTC replicates 99.99 percent of new objects stored in Amazon S3 within 15 minutes (backed by a service-level agreement).
-
Replicate previously failed or replicated objects
-
Replicate objects and fail over to a bucket in another AWS Region – To keep all metadata and objects in sync across buckets during data replication, use two-way replication (also known as bi-directional replication) rules before configuring Amazon S3 Multi-Region Access Point failover controls. Two-way replication rules help ensure that when data is written to the S3 bucket that traffic fails over to, that data is then replicated back to the source bucket.
CRR (Cross-Region Replication):
-
Compliance
-
Lower latency access
-
Replication across accounts
SRR (Same-Region Replication):
-
Log aggregation
-
Live replication between production and test accounts/environments
When to use Cross-Region Replication (CRR)
-
Meet compliance requirements
-
Minimize latency
-
Increase operational efficiency
When to use Same-Region Replication (SRR)
-
Aggregate logs into a single bucket
-
Configure live replication between production and test accounts
-
Abide by data sovereignty laws
When to use two-way replication (bi-directional replication)
-
Build shared datasets across multiple AWS Regions
-
Keep data synchronized across Regions during failover
-
Make your application highly available
When to use S3 Batch Replication
-
Replicate existing objects: replication is not retroactive
-
Replicate objects that previously failed to replicate
-
Replicate objects that were already replicated – You might be required to store multiple copies of your data in separate AWS accounts or AWS Regions. Batch Replication can replicate existing objects to newly added destinations.
-
Replicate replicas of objects that were created from a replication rule. Replicas of objects can be replicated only with Batch Replication.
Transfer Acceleration
Amazon S3 Transfer Acceleration is a bucket-level feature that enables fast, easy, and secure transfers of files over long distances between your client and an S3 bucket. It can speed up content transfers to and from Amazon S3 by as much as 50–500 percent for long-distance transfer of larger objects.
Transfer Acceleration takes advantage of the globally distributed edge locations in Amazon CloudFront. As the data arrives at an edge location, the data is routed to Amazon S3 over an optimized network path.
It supports multi-part uploads.
The bucket name MUST be DNS-compatible and it MUST NOT contain dots.
Requester Pays
With Requester Pays buckets, the requester instead of the bucket owner pays the cost of the request and the data download from the bucket. The bucket owner always pays the cost of storing data.
The requester must be authenticated to AWS.
Useful when sharing data among accounts.
S3 Inventory
You can use it to audit and report on the replication and encryption status of your objects for business, compliance, and regulatory needs.
Amazon S3 Inventory provides comma-separated values (CSV), Apache optimized row columnar (ORC) or Apache Parquet output files that list your objects and their corresponding metadata on a daily or weekly basis for an S3 bucket or objects with a shared prefix.
Configuration:
-
Which metadata to include per object
-
Which versions (all, current)
-
Destination
-
Schedule: daily or weekly
-
Inventory list file encryption
|
All of your objects might not appear in each inventory list. The inventory list provides eventual consistency for PUT requests (of both new objects and overwrites) and for DELETE requests. Each inventory list for a bucket is a snapshot of bucket items. These lists are eventually consistent (that is, a list might not include recently added or deleted objects). |
Amazon S3 Event Notifications can be used to notify.
The Inventory can be queried through Athena.
The source bucket:
-
contains the objects
-
has the inventory configuration
The destination bucket:
-
Contains the inventory file lists
-
Contains the inventory manifest files that list all the inventory list files that are stored in the destination bucket
-
Must have a bucket policy to give Amazon S3 permission to verify ownership of the bucket and permission to write files to the bucket.
-
Must be in the same AWS Region as the source bucket.
-
Can be the same as the source bucket.
-
Can be owned by a different AWS account than the account that owns the source bucket.
Inventory formats:
-
Gzip compressed CSV
-
Zlib compressed ORC (Apache Optimized Row Columnar)
-
Snappy compressed Apache Parquet
Columns:
-
Bucket name
-
Object owner
-
Key name – When you’re using the CSV file format, the key name is URL-encoded and must be decoded before you can use it.
-
Version ID – This field is not included if the list is configured only for the current version of the objects.)
-
IsLatest – Set to True if the object is the current version of the object. (This field is not included if the list is configured only for the current version of the objects.
-
Delete marker – Set to True if the object is a delete marker.
-
Size – not including the size of incomplete multipart uploads, object metadata, and delete markers.
-
Last modified date
-
ETag – The entity tag (ETag) is a hash of the object. The ETag reflects changes only to the contents of an object, not to its metadata. The ETag can be an MD5 digest of the object data. Whether it is depends on how the object was created and how it is encrypted.
-
Storage class – Set to STANDARD, REDUCED_REDUNDANCY, STANDARD_IA, ONEZONE_IA, INTELLIGENT_TIERING, GLACIER, DEEP_ARCHIVE, OUTPOSTS, GLACIER_IR, or SNOW
-
Multipart upload flag
-
Replication status – Set to PENDING, COMPLETED, FAILED, or REPLICA.
-
Encryption status – Set to SSE-S3, SSE-C, SSE-KMS, or NOT-SSE.
-
S3 Object Lock retain until date
-
S3 Object Lock retention mode - Governance or Compliance.
-
S3 Object Lock legal hold status – On or Off
-
S3 Intelligent-Tiering access tier – Set to FREQUENT, INFREQUENT, ARCHIVE_INSTANT_ACCESS, ARCHIVE, or DEEP_ARCHIVE.
-
S3 Bucket Key status – Set to ENABLED or DISABLED. Indicates whether the object uses an S3 Bucket Key for SSE-KMS.
-
Checksum algorithm
-
Object access control list – An access control list (ACL) for each object that defines which AWS accounts or groups are granted access to this object and the type of access that is granted. The Object ACL field is defined in JSON format.
Website hosting
Website endpoints:
-
s3-website dash (-) Region:
http://${bucket_name}.s3-website-${aws_region}.amazonaws.com -
s3-website dot (.) Region:
http://${bucket_name}.s3-website.${aws_region}.amazonaws.com
If you want to configure an existing bucket as a static website that has public access, you must edit Block Public Access settings for that bucket. You might also have to edit your account-level Block Public Access settings. Additionally you must add a bucket policy!
In order to use a custom hostname for a Route53 Alias record the bucket must be named the same: for example, to access the static site at http://mysite.com the bucket name must be mysite.com. HTTPS is not supported anyway.
Use cases
-
Static web content serving.
-
Offloading big files from, for example, EC2 instances (see CORS).
-
Out-of-band pages: storing "under maintenance" or "out of service" pages so that the service hosted somewhere else can undergo maintenance and the page still shows, because it’s on a different place (an S3 buccket).
Public website policy
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "PublicReadGetObject",
"Effect": "Allow",
"Principal": "*",
"Action": [
"s3:GetObject"
],
"Resource": [
"arn:aws:s3:::Bucket-Name/*"
]
}
]
}
The bucket policy applies only to objects that are owned by the bucket owner. If your bucket contains objects that aren’t owned by the bucket owner, the bucket owner should use the object access control list (ACL) to grant public READ permission on those objects:
Public website ACL
<Grant>
<Grantee xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:type="Group">
<URI>http://acs.amazonaws.com/groups/global/AllUsers</URI>
</Grantee>
<Permission>READ</Permission>
</Grant>
You can optionally enable Amazon S3 server access logging for a bucket that is configured as a static website.
CORS
Least configuration
[
{
"AllowedMethods": [
"METHOD1"
],
"AllowedOrigins": [
"http://www.example1.com"
]
}
]
Complete configuration
[
{
"AllowedHeaders": [
"*"
],
"AllowedMethods": [
"METHOD1",
"METHOD2"
],
"AllowedOrigins": [
"http://www.example1.com"
],
"ExposeHeaders": []
},
{
"AllowedHeaders": [
"*"
],
"AllowedMethods": [
"METHOD1",
"METHOD2",
"METHOD3"
],
"AllowedOrigins": [
"http://www.example2.com"
],
"ExposeHeaders": []
},
{
"AllowedHeaders": [],
"AllowedMethods": [
"GET"
],
"AllowedOrigins": [
"*"
],
"ExposeHeaders": []
}
]
Data Access
S3 Access Points
Access points are named network endpoints that are attached to buckets that you can use to perform S3 object operations.
Each access point has distinct permissions and network controls that S3 applies for any request that is made through that access point. Each access point enforces a customized access point policy that works in conjunction with the bucket policy. Each allow permission on an access point must have a counterpart on the bucket policy. However you can delegate: you grant unlimited access via the access point and move the granular permissions definition on the access point.
Access point bucket-style alias: When you create an access point, Amazon S3 automatically generates an alias that you can use instead of an Amazon S3 bucket name for data access. You can use this access point alias instead of an Amazon Resource Name (ARN) for access point data plane operations.
You can configure any access point to accept requests only from a virtual private cloud (VPC) to restrict Amazon S3 data access to a private network. You can also configure custom block public access settings for each access point.
All block public access settings are enabled by default for access points. You must explicitly disable any settings that you don’t want to apply to an access point.
Creation command:
$ aws s3control create-access-point \
--name partialbucket \
--bucket mybucket230ifde \
--account-id 123456789012
Access point URL format: s3-accesspoint.Region.amazonaws.com
Access point policy:
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Principal": {
"AWS": "arn:aws:iam::123456789012:user/Jane"
},
"Action": [
"s3:GetObject",
"s3:PutObject"
],
"Resource": "arn:aws:s3:us-west-2:123456789012:accesspoint/my-access-point/object/Jane/*"
}
]
}
In conjunction with a bucket policy:
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Principal": {
"AWS": "arn:aws:iam::123456789012:user/Jane"
},
"Action": [
"s3:GetObject",
"s3:PutObject"
],
"Resource": "arn:aws:s3:::DOC-EXAMPLE-BUCKET1/Jane/*"
}
]
}
Or the bucket can delegate (to any access point owned by the user in this case):
{
"Version": "2012-10-17",
"Statement" : [
{
"Effect": "Allow",
"Principal" : { "AWS": "*" },
"Action" : "*",
"Resource" : [ "Bucket ARN", "Bucket ARN/*"],
"Condition": {
"StringEquals" : { "s3:DataAccessPointAccount" : "Bucket owner's account ID" }
}
}]
}
Multi-region Access Points (MRAP)
A MRAP is a single access point that is backed by multiple S3 buckets. You can enable two-ways replication.
Creation take a while, it’s not immediate.
An interesting use case: you have a bucket replicated in n regions and allow people to use a single endpoint to have their traffic go to the lowest-latency bucket. It’s an active-active failover configuration. If you try to get, from the MRAP, a file that was uploaded in a far region and that hasn’t yet been replicated in a closer region youl’ll get a 404. S3 won’t serve you the file from the origin bucket :(
S3 Select and Glacier Select
You can use structured query language (SQL) statements to filter the contents of an Amazon S3 object and retrieve only the subset of data that you need. You can reduce the amount of data that Amazon S3 transfers, which reduces the cost and latency to retrieve this data.
File formats:
-
CSV
-
JSON
-
Apache Parquet
-
Gzip, Bzip2 ⇒ CSV and JSON only + Server Side Encryption
Limitations
-
One object at a time
-
The maximum length of a SQL expression is 256 KB.
-
The maximum length of a record in the input or result is 1 MB.
-
Can only emit nested data by using the JSON output format.
-
You cannot query an object stored in the S3 Glacier Flexible Retrieval, S3 Glacier Deep Archive, Reduced Redundancy Storage (RRS) storage classes, S3 Intelligent-Tiering Archive Access tier or the S3 Intelligent-Tiering Deep Archive Access tiers.
-
Apache Parquet objects:
-
Amazon S3 Select supports only columnar compression using GZIP or Snappy
-
Amazon S3 Select doesn’t support Parquet output
-
The maximum uncompressed row group size is 512 MB.
-
You must use the data types that are specified in the object’s schema.
-
Selecting on a repeated field returns only the last value.
-