# AWS S3 Multi-Part Uploads

Once you’ve created an S3 bucket, you’ll likely need to transfer large files — sometimes in the order of gigabytes or even terabytes.

AWS S3 allows object sizes of up to **5 TB**, but a single `PUT` upload is limited to **5 GB**.
To handle larger uploads, S3 offers a **Multi-Part Upload** mechanism.

The idea is simple:

* Split the large file into smaller **parts**
* Upload each part individually (possibly in parallel)
* S3 reassembles these parts into the final object

This method provides two key advantages:

1. **Fault tolerance:** If an upload fails midway, you can resume from the failed part.
2. **Parallelism:** Parts can be uploaded concurrently for faster throughput.

---

## Rules to Remember

Before jumping into implementation, note the following S3 Multi-Part Upload constraints:

* A file can be split into a **maximum of 10,000 parts**.
* Each part must be between **5 MB and 100 MB** in size.
* You can use **S3 lifecycle policies** to automatically abort unfinished multi-part uploads that exceed a time limit.

---

## Steps for a Multi-Part Upload

The process involves three main stages:

1. **Initiate the upload**
   Use `CreateMultipartUpload` API — this returns an `UploadId` used to reference the ongoing upload.

2. **Upload each part**
   Use the `UploadPart` API — each call returns an `ETag` (a checksum for that part).

3. **Complete the upload**
   Call `CompleteMultipartUpload` with the `UploadId`, part numbers, and their ETags.
   S3 then assembles the file from these parts.

> You can even overwrite parts while the upload is in progress — enabling in-flight file modifications.

---

## Hands-On Implementation

We’ll walk through creating a Python script using `boto3` to perform a multi-part upload.

### Step 1 — Get Upload ID

```python
import boto3

def start_upload(bucket, key):
    """Returns the UploadId for multi-part upload"""
    client = boto3.client("s3")
    response = client.create_multipart_upload(Bucket=bucket, Key=key)
    return response["UploadId"]
```

This initializes the upload and returns the `UploadId`.

---

### Step 2 — Upload One Part

```python
def upload_part(bucket, key, part_num, upload_id, data):
    """Upload a part to S3"""
    client = boto3.client("s3")
    response = client.upload_part(
        Bucket=bucket,
        Key=key,
        PartNumber=part_num,
        UploadId=upload_id,
        Body=data
    )
    print(f"Uploaded part {part_num} with ETag {response['ETag']}")
    return {'PartNumber': part_num, 'ETag': response['ETag']}
```

Each part requires the same `UploadId` and its sequence number.
The response contains an `ETag` that must be passed during final assembly.

---

### Step 3 — Putting It Together

You can parallelize uploads using the `concurrent.futures` module:

```python
from concurrent.futures import ProcessPoolExecutor, as_completed
import boto3

# Start upload
upload_id = start_upload(bucket, key)

# Upload parts in parallel
futures = []
with ProcessPoolExecutor(max_workers=10) as executor:
    with open(file_name, "rb") as f:
        i = 1
        chunk = f.read(chunk_size_bytes)
        while len(chunk) > 0:
            future = executor.submit(
                upload_part,
                bucket=bucket,
                key=key,
                part_num=i,
                upload_id=upload_id,
                data=chunk
            )
            futures.append(future)
            i += 1
            chunk = f.read(chunk_size_bytes)

# Collect results
results = [f.result() for f in as_completed(futures)]

# Complete upload
boto3.client("s3").complete_multipart_upload(
    Bucket=bucket,
    Key=key,
    UploadId=upload_id,
    MultipartUpload={'Parts': sorted(results, key=lambda e: e["PartNumber"])}
)
```

---

### Step 4 — Testing the Program

```bash
# Create a test bucket
aws s3 mb s3://test-3224-random --region us-east-1

# Run the program
python3 upload.py --file app.msi --bucket test-3224-random --key app.msi --chunk_size 6
```

That’s it — the script uploads your file in multiple parallel parts, then assembles them in S3.

---

## Full Solution Code

Here’s the complete working script:

```python
import boto3
import argparse
import json
from concurrent.futures import ProcessPoolExecutor, as_completed

def start_upload(bucket, key):
    """Returns the UploadId for multi-part upload"""
    client = boto3.client("s3")
    response = client.create_multipart_upload(Bucket=bucket, Key=key)
    return response["UploadId"]

def upload_part(bucket, key, part_num, upload_id, data):
    """Upload a part to S3"""
    client = boto3.client("s3")
    response = client.upload_part(
        Bucket=bucket,
        Key=key,
        PartNumber=part_num,
        UploadId=upload_id,
        Body=data
    )
    print(f"Uploaded part {part_num} and received ETag {response['ETag']}")
    return {'PartNumber': part_num, 'ETag': response['ETag']}

if __name__ == '__main__':
    MB = 1024 * 1024

    parser = argparse.ArgumentParser()
    parser.add_argument("--file", required=True)
    parser.add_argument("--bucket", required=True)
    parser.add_argument("--key", required=True)
    parser.add_argument("--chunk_size", required=True, help="Size of each part in MB.")
    args = parser.parse_args()

    file_name = args.file
    bucket = args.bucket
    key = args.key
    chunk_size_bytes = int(args.chunk_size) * MB

    upload_id = start_upload(bucket, key)

    futures = []
    with ProcessPoolExecutor(max_workers=10) as executor:
        with open(file_name, "rb") as f:
            i = 1
            chunk = f.read(chunk_size_bytes)
            while len(chunk) > 0:
                future = executor.submit(upload_part, bucket, key, i, upload_id, chunk)
                futures.append(future)
                i += 1
                chunk = f.read(chunk_size_bytes)

    results = [f.result() for f in as_completed(futures)]

    response = boto3.client("s3").complete_multipart_upload(
        Bucket=bucket,
        Key=key,
        UploadId=upload_id,
        MultipartUpload={'Parts': sorted(results, key=lambda e: e["PartNumber"])}
    )

    print(json.dumps(response))
```

---

## Summary

* Multi-part upload splits large files for parallel and resumable uploads.
* Each part is individually uploaded and later combined by S3.
* This method enhances reliability and performance for massive datasets.


