Contact S3 reps about the access times and their stddev and find out if these can be lowered.

Description

Here’s an email chain with lots of information about the high standard deviations we see when accessing S3. Unlike a chain in a mail reader, this starts with the oldest message and newer messages follow.


Joe,

It was good talking with you at ESIP last week.

We have been working with NASA/ESDIS on cloud-optimized data access and direct access and subsetting of data stored on S3 (using HTTP Range-GET). In the course of our work we have noticed that S3 has unpredictable response latency with a standard deviation of 3 or more. Is there a way to minimize this?

Thanks,
James


Joe sends us onward to Kyle…

Hi James

Can you help with the below and I can escalate this?

  1. Bucket name/region?

  2. Latency from where you are testing from to the S3 endpoint for that region?

  3. Is testing directly on the internet or through a government TIC (or from an EC2 instance in the same region as the bucket?)

  4. Do these results appear to be limited to the ranged gets or repeatable on S3 generally?

  5. Can you share the specific data you have captured on this?

  6. What are your timeout and retry values set to in your code, if any?

Thanks

Kyle Hart

Sr. Solutions Architect

m: 571-278-4195 e: awskyle@amazon.com

a: 13200 Woodland Park Rd, Herndon, VA 20171


I reply…

On Jul 24, 2019, at 18:49, Hart, Kyle <awskyle@amazon.com> wrote:


Hi James Can you help with the below and I can escalate this? Bucket name/region?

US East (N Virginia). Bucket name ‘cloudydap’

Latency from where you are testing from to the S3 endpoint for that region?

I’m not sure about this.

Is testing directly on the internet or through a government TIC (or from an EC2 instance in the same region as the bucket?)

We are running tests using EC2 instances also in the same region as the bucket.

Do these results appear to be limited to the ranged gets or repeatable on S3 generally?

We are testing only range gets.

Can you share the specific data you have captured on this?

Yes, although it will take some time, but we have the data on the accesses in a Google sheet.

What are your timeout and retry values set to in your code, if any?

We are using libcurl for the accesses. I’m not sure about the timeout values (there may be none) but the retry limit is 10 (which we never hot).

Thanks,

James


Then…


On Jul 28, 2019, at 10:46, Hart, Kyle <awskyle@amazon.com> wrote:


Thanks James If you can share the performance data on the various access times (that you have in the google spreadsheet) I can pass that on to the service team to see if those values are expected.

My apology for the delay - From the end of July to the end of August, but summer vacations and the need to collect clean data took their toll ;-)

Here’s one sheet that shows times based on using curl to access data using range gets. In each case, the same kind of object are used (364 HDF5 files, not that it matter much since we’re just reading byes from them). Each of the 1k, .. ~300M tabs show the results for 100 accessed each over 364 files.

We have lost of other data, but these isolate the accesses from other things our data server does with the data (like decompress the data, copy it, etc.) and or other tests use just one chunk/block size for the transfers. Here we have four sizes. What we see is a transfer rate of about 600 Mb/s. The access all use HTTPS and all are done in this sheet using curl 7.27, and it does not default to HTTP/2 nor does it employ keep-alive – we understand the benefits of these – given that, the 1k chunks don’t have transfer times of anywhere near 600Mb/s.

NB: The plots are log/linear.

Questions: How can we drop the RSD? If we measure the average response time and then kill/restart any request that’s twice that value (e.g.), will there be a benefit or will the resulted restart take as long as the original request?

If we do decide to kill requests after some time, is there something we can set in/with/for the S3 bucket or should we do that using TCP/IP (or HTTP/S) timeouts?

I may have some questions about parallel accesses to the same object (but using range gets to extract information from different parts), but those will require that we use our code and will require more experiments.

Thanks,

James


And…


On Sep 3, 2019, at 16:49, Hart, Kyle <awskyle@amazon.com> wrote:


Hi James I have submitted a request for specialist review. In the meantime, can you tell me what instance sizes you were testing from, and if you were using an Internet Gateway or S3 VPC Endpoint?

m4.xlarge for the VM running curl. We are not using VPC and I don’t think an Internet Gateway was involved either. The timing data - collected using curl to eliminate data processing time and isolate just the access behavior - was from a script that called curl using an https URL. The timing data collected this way mirrors the behavior of our code, so using curl seems reasonable. And, our code is C++ and uses libcurl (although we set Keep-Alive and so on to boost performance and the data you’re looking use the curl 7.2x.x executable that does not use Keep-Alive by default.

Thanks,

James

PS. We have data for our own code, but because it spend time doing things like decompressing data blocks and copying them into an ‘assembled’ response, the wall time is longer and Relative Standard Deviation is less (because those operations are effectively constant-time and, for decompression, scales linearly with the number of threads-processors, so the code’s performance favors more CPUs when that might not actually move data faster or more uniformly out of S3).


Useful info:

I suspect you were just using an EC2 instance with a Public IP Address and an internet gateway (or VPN to on-premises and to the internet from there). That or a VPC Endpoint (which is not there by default). One of those options is how to get to S3. In theory a VPC Gateway endpoint could be a bit faster. But it would not account significantly for some of the outliers you are seeing.

I’m still waiting to hear back from an S3 SME on if the data you shared is in range for what would be expected.

S3 as an object store of course has many advantages – in-region durability, low-cost, etc. But it is not the same as a high-performance file system. And there are connection overhead issues to account for that traditional file systems would not encounter.

We used to offer guidance to customers to randomize the prefix of objects so they’d land in different partitions but that guidance was superseded a year ago with an RPS increase announcement of 5500 RPS. https://aws.amazon.com/about-aws/whats-new/2018/07/amazon-s3-announces-increased-request-rate-performance/. It does not appear you are exceeding this – though if you are, we should dig deeper on it. Beyond that, what you may be seeing is just inherent variability with S3. It is highly distributed within region so each request for s3.us-east-1.amazonaws.com is typically fielded by a different server each time (you cans see this with repeated nslookup requests for that host).

Here's some good guidance you may be aware of: https://docs.aws.amazon.com/AmazonS3/latest/dev/optimizing-performance.html

And this page: https://docs.aws.amazon.com/AmazonS3/latest/dev/optimizing-performance-design-patterns.html

Note this section in particular:

Timeouts and Retries for Latency-Sensitive Applications

Amazon S3 automatically scales in response to sustained new request rates, dynamically optimizing performance. While Amazon S3 is internally optimizing for a new request rate, you will receive HTTP 503 request responses temporarily until the optimization completes. After Amazon S3 internally optimizes performance for the new request rate, all requests are generally served without retries.

For latency-sensitive applications, Amazon S3 advises tracking and aggressively retrying slower operations. When you retry a request, we recommend using a new connection to Amazon S3 and performing a fresh DNS lookup.

And separately, depending on the nature of your application and what you are trying to do, we can look at hydrating a file system with data from S3 and processing against the file system (like our fSX for Lustre or EFS offerings). These would increase costs, but are designed to be stood up and torn down when done with persistent data in S3. And these systems are designed to serve access to a fleet of parallel hosts. But it may be preferable just to pull the data directly from S3 with your application and handle any potential latency variances.

I’ll let you know when I hear further.

Thanks

Kyle Hart

Sr. Solutions Architect

m: 571-278-4195 e: awskyle@amazon.com

a: 13200 Woodland Park Rd, Herndon, VA 20171


And more useful info:

James,

There are a couple of things you can do to help reduce the RSD for range GETs:

1. Aggressive timeouts and retries (exactly what you and Kyle discussed below). The general thought is that the retry is likely to take a different path than the initial, latent request. This will help remove the large outliers where you’re (likely) waiting for the default timeout to abort the connection before retrying. As for how to kill and retry, it would be best to kill the TCP connection, do a fresh DNS lookup and reconnect. S3 has single record response with a 5s TTL so you are more likely to get a different path on the next attempt. AWS SDKs provide configurable timeouts but I’m sure the same could be accomplished with libcurl and curl. You can set per-request and per-connection timeouts for a more holistic approach.

2. Align byte-range requests to part boundaries. Depending on the size of the parts that were uploaded, crossing part boundaries can increase latency. So, for example, if you have parts of 8MB on upload and you do a range GET on 16MB, then you could be requesting 2 or 3 parts depending on the boundary. This may be a non-issue if the objects weren’t uploaded using multipart, or if you’re using the first N bytes of an object and the part size is >300MB. You can figure out the part size by specifying a part number on the GET request as a query string and checking the content-length on the response.

You mentioned you haven’t hit the 10 retry max, are you seeing any retries at all?

Out of curiosity, have you considered using the SDK to help with any of this logic?

Luke Wells

Solutions Architect

Amazon Web Services

cell: +1-540-272-2751 | desk: +1-703-326-3346


My reply to Luke’s questions:


On Sep 6, 2019, at 01:10, Wells, Luke <lawells@amazon.com> wrote:


James, There are a couple of things you can do to help reduce the RSD for range GETs:1. Aggressive timeouts and retries (exactly what you and Kyle discussed below). The general thought is that the retry is likely to take a different path than the initial, latent request. This will help remove the large outliers where you’re (likely) waiting for the default timeout to abort the connection before retrying. As for how to kill and retry, it would be best to kill the TCP connection, do a fresh DNS lookup and reconnect. S3 has single record response with a 5s TTL so you are more likely to get a different path on the next attempt. AWS SDKs provide configurable timeouts but I’m sure the same could be accomplished with libcurl and curl. You can set per-request and per-connection timeouts for a more holistic approach.

Right, we can set timeout with libcurl.

2. Align byte-range requests to part boundaries. Depending on the size of the parts that were uploaded, crossing part boundaries can increase latency.

Our objects are all single part objects (HDF5 files, originally written for spinning disk).

So, for example, if you have parts of 8MB on upload and you do a range GET on 16MB, then you could be requesting 2 or 3 parts depending on the boundary. This may be a non-issue if the objects weren’t uploaded using multipart, or if you’re using the first N bytes of an object and the part size is >300MB. You can figure out the part size by specifying a part number on the GET request as a query string and checking the content-length on the response.

You mentioned you haven’t hit the 10 retry max, are you seeing any retries at all?

Yes, we see about a 0.1% error rate (1 per 1,000 requests).

Out of curiosity, have you considered using the SDK to help with any of this logic?

Is there a C++ SDK?

Thanks,

James

Status

Assignee

James Gallagher

Reporter

James Gallagher

Priority

Medium

Labels

None

Story Points

None

Fix versions

Configure