Skip to main content

Command Palette

Search for a command to run...

AWS S3 Multi-Part Uploads

Updated
4 min read
AWS S3 Multi-Part Uploads
P

My name is Pulkit, and I am seasoned Data Engineer. Along with my expertise in Spark / Hadoop applications, I am deeply fond of AWS Cloud. I love to learn new tech and broaden my horizons with every single day.

Once you’ve created an S3 bucket, you’ll likely need to transfer large files — sometimes in the order of gigabytes or even terabytes.

AWS S3 allows object sizes of up to 5 TB, but a single PUT upload is limited to 5 GB. To handle larger uploads, S3 offers a Multi-Part Upload mechanism.

The idea is simple:

  • Split the large file into smaller parts
  • Upload each part individually (possibly in parallel)
  • S3 reassembles these parts into the final object

This method provides two key advantages:

  1. Fault tolerance: If an upload fails midway, you can resume from the failed part.
  2. Parallelism: Parts can be uploaded concurrently for faster throughput.

Rules to Remember

Before jumping into implementation, note the following S3 Multi-Part Upload constraints:

  • A file can be split into a maximum of 10,000 parts.
  • Each part must be between 5 MB and 100 MB in size.
  • You can use S3 lifecycle policies to automatically abort unfinished multi-part uploads that exceed a time limit.

Steps for a Multi-Part Upload

The process involves three main stages:

  1. Initiate the upload Use CreateMultipartUpload API — this returns an UploadId used to reference the ongoing upload.

  2. Upload each part Use the UploadPart API — each call returns an ETag (a checksum for that part).

  3. Complete the upload Call CompleteMultipartUpload with the UploadId, part numbers, and their ETags. S3 then assembles the file from these parts.

You can even overwrite parts while the upload is in progress — enabling in-flight file modifications.


Hands-On Implementation

We’ll walk through creating a Python script using boto3 to perform a multi-part upload.

Step 1 — Get Upload ID

import boto3

def start_upload(bucket, key):
    """Returns the UploadId for multi-part upload"""
    client = boto3.client("s3")
    response = client.create_multipart_upload(Bucket=bucket, Key=key)
    return response["UploadId"]

This initializes the upload and returns the UploadId.


Step 2 — Upload One Part

def upload_part(bucket, key, part_num, upload_id, data):
    """Upload a part to S3"""
    client = boto3.client("s3")
    response = client.upload_part(
        Bucket=bucket,
        Key=key,
        PartNumber=part_num,
        UploadId=upload_id,
        Body=data
    )
    print(f"Uploaded part {part_num} with ETag {response['ETag']}")
    return {'PartNumber': part_num, 'ETag': response['ETag']}

Each part requires the same UploadId and its sequence number. The response contains an ETag that must be passed during final assembly.


Step 3 — Putting It Together

You can parallelize uploads using the concurrent.futures module:

from concurrent.futures import ProcessPoolExecutor, as_completed
import boto3

# Start upload
upload_id = start_upload(bucket, key)

# Upload parts in parallel
futures = []
with ProcessPoolExecutor(max_workers=10) as executor:
    with open(file_name, "rb") as f:
        i = 1
        chunk = f.read(chunk_size_bytes)
        while len(chunk) > 0:
            future = executor.submit(
                upload_part,
                bucket=bucket,
                key=key,
                part_num=i,
                upload_id=upload_id,
                data=chunk
            )
            futures.append(future)
            i += 1
            chunk = f.read(chunk_size_bytes)

# Collect results
results = [f.result() for f in as_completed(futures)]

# Complete upload
boto3.client("s3").complete_multipart_upload(
    Bucket=bucket,
    Key=key,
    UploadId=upload_id,
    MultipartUpload={'Parts': sorted(results, key=lambda e: e["PartNumber"])}
)

Step 4 — Testing the Program

# Create a test bucket
aws s3 mb s3://test-3224-random --region us-east-1

# Run the program
python3 upload.py --file app.msi --bucket test-3224-random --key app.msi --chunk_size 6

That’s it — the script uploads your file in multiple parallel parts, then assembles them in S3.


Full Solution Code

Here’s the complete working script:

import boto3
import argparse
import json
from concurrent.futures import ProcessPoolExecutor, as_completed

def start_upload(bucket, key):
    """Returns the UploadId for multi-part upload"""
    client = boto3.client("s3")
    response = client.create_multipart_upload(Bucket=bucket, Key=key)
    return response["UploadId"]

def upload_part(bucket, key, part_num, upload_id, data):
    """Upload a part to S3"""
    client = boto3.client("s3")
    response = client.upload_part(
        Bucket=bucket,
        Key=key,
        PartNumber=part_num,
        UploadId=upload_id,
        Body=data
    )
    print(f"Uploaded part {part_num} and received ETag {response['ETag']}")
    return {'PartNumber': part_num, 'ETag': response['ETag']}

if __name__ == '__main__':
    MB = 1024 * 1024

    parser = argparse.ArgumentParser()
    parser.add_argument("--file", required=True)
    parser.add_argument("--bucket", required=True)
    parser.add_argument("--key", required=True)
    parser.add_argument("--chunk_size", required=True, help="Size of each part in MB.")
    args = parser.parse_args()

    file_name = args.file
    bucket = args.bucket
    key = args.key
    chunk_size_bytes = int(args.chunk_size) * MB

    upload_id = start_upload(bucket, key)

    futures = []
    with ProcessPoolExecutor(max_workers=10) as executor:
        with open(file_name, "rb") as f:
            i = 1
            chunk = f.read(chunk_size_bytes)
            while len(chunk) > 0:
                future = executor.submit(upload_part, bucket, key, i, upload_id, chunk)
                futures.append(future)
                i += 1
                chunk = f.read(chunk_size_bytes)

    results = [f.result() for f in as_completed(futures)]

    response = boto3.client("s3").complete_multipart_upload(
        Bucket=bucket,
        Key=key,
        UploadId=upload_id,
        MultipartUpload={'Parts': sorted(results, key=lambda e: e["PartNumber"])}
    )

    print(json.dumps(response))

Summary

  • Multi-part upload splits large files for parallel and resumable uploads.
  • Each part is individually uploaded and later combined by S3.
  • This method enhances reliability and performance for massive datasets.