AWS Consistent S3 & Apache Hadoop's S3A connector

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

AWS Consistent S3 & Apache Hadoop's S3A connector

Steve Loughran-2
as sent to hadoop-general.

TL;DR. S3 is consistent; S3A now works perfectly with S3Guard turned off, if not, file a JIRA.  rename still isn't real, so don't rely on that or create(path, overwrite=false) for atomic operations

-------

If you've missed the announcement, AWS S3 storage is now strongly consistent: https://aws.amazon.com/s3/consistency/

That's full CRUD consistency, consistent listing, and no 404 caching.

You don't get: rename, or an atomic create-no-overwrite. Applications need to know that and code for it.

This is enabled for all S3 buckets; no need to change endpoints or any other settings. No extra cost, no performance impact. This is the biggest change in S3 semantics since it launched.

What does this mean for the Hadoop S3A connector?

  1. We've been testing it for a while, no problems have surfaced.
  2. There's no need for S3Guard; leave the default settings alone. If you were using it, turn it off, restart *everything* and then you can delete the DDB table.
  3. Without S3 listings may get a bit slower.
  4. There's been a lot of work in branch-3.3 on speeding up listings against raw S3, especially for code which uses listStatusIterator() and listFiles (HADOOP-17400).

It'll be time to get Hadoop 3.3.1 out the door for people to play with; it's got a fair few other s3a-side enhancements.

People are still using S3Guard and it needs to be maintained for now, but we'll have to be fairly ruthless about what isn't going to get closed as WONTFIX. I'm worried here about anyone using S3Guard against non-AWS consistent stores. If you are, send me an email.

And so for releases/PRs, tdoing est runs with and without S3Guard is important. I've added an optional backwards-incompatible change recently for better scalability: HADOOP-13230. S3A to optionally retain directory markers. which adds markers=keep/delete to the test matrix. This is a pain, though as you can choose two options at a time it's manageable.

Apache HBase
============

You still need the HBoss extension in front of the S3A connector to use Zookeeper to lock files during compaction.


Apache Spark
============

Any workflows which chained together reads directly after writes/overwrites of files should now work reliably with raw S3.

  • The classic FileOutputCommitter commit-by-rename algorithms aren't going to fail with FileNotFoundException during task commit.
  • They will still use copy to rename work, so take O(data) time to commit filesWithout atomic dir rename, v1 commit algorithm can't isolate the commit operations of two task attempts. So it's unsafe and very slow.
  • The v2 commit is slow, doesn't have isolation between task attempt commits against any filesystem.If different task attempts are generating unique filenames (possibly to work around s3 update inconsistencies), it's not safe. Turn that option off.
  • The S3A committers' algorithms are happy talking directly to S3. But: SPARK-33402 is needed to fix a race condition in the staging committer. 
  • The "Magic" committer, which has relied on a consistent store, is safe. There's a fix in HADOOP-17318 for the staging committer; hadoop-aws builds with that in will work safely with older spark versions.

Any formats which commit work by writing a file with a unique name & updating a reference to it in a consistent store (iceberg &c) are still going to work great. Naming is irrelevant and commit-by-writing-a-file is S3's best story.

(+ SPARK-33135 and other uses of incremental listing will get the benefits of async prefetching of the next page of list results)

Disctp
======

There'll be no cached 404s to break uploads, even if you don't have the relevant fixes to stop HEAD requests before creating files (HADOOP-16932 and revert of HADOOP-8143)or update inconsistency (HADOOP-16775)
  • If your distcp version supports -direct, use it to avoid rename performance penaltiesIf your distcp version doesn't have HADOOP-15209 it can issue needless DELETE calls to S3 after a big update, and end up being throttled badly. Upgrade if you can.
  • If people are seeing problems: issues.apache.org + component HADOOP is where to file JIRAs; please tag the version of hadoop libraries you've been running with.

thanks,

-Steve
Reply | Threaded
Open this post in threaded view
|

Re: AWS Consistent S3 & Apache Hadoop's S3A connector

Chang Chen
Since S3A now works perfectly with S3Guard turned off, Could Magic Committor work with S3Guard is off? If Yes, will performance degenerate? Or if HADOOP-17400 is fixed, then it will have comparable performance?

Steve Loughran <[hidden email]> 于2020年12月4日周五 下午10:00写道:
as sent to hadoop-general.

TL;DR. S3 is consistent; S3A now works perfectly with S3Guard turned off, if not, file a JIRA.  rename still isn't real, so don't rely on that or create(path, overwrite=false) for atomic operations

-------

If you've missed the announcement, AWS S3 storage is now strongly consistent: https://aws.amazon.com/s3/consistency/

That's full CRUD consistency, consistent listing, and no 404 caching.

You don't get: rename, or an atomic create-no-overwrite. Applications need to know that and code for it.

This is enabled for all S3 buckets; no need to change endpoints or any other settings. No extra cost, no performance impact. This is the biggest change in S3 semantics since it launched.

What does this mean for the Hadoop S3A connector?

  1. We've been testing it for a while, no problems have surfaced.
  2. There's no need for S3Guard; leave the default settings alone. If you were using it, turn it off, restart *everything* and then you can delete the DDB table.
  3. Without S3 listings may get a bit slower.
  4. There's been a lot of work in branch-3.3 on speeding up listings against raw S3, especially for code which uses listStatusIterator() and listFiles (HADOOP-17400).

It'll be time to get Hadoop 3.3.1 out the door for people to play with; it's got a fair few other s3a-side enhancements.

People are still using S3Guard and it needs to be maintained for now, but we'll have to be fairly ruthless about what isn't going to get closed as WONTFIX. I'm worried here about anyone using S3Guard against non-AWS consistent stores. If you are, send me an email.

And so for releases/PRs, tdoing est runs with and without S3Guard is important. I've added an optional backwards-incompatible change recently for better scalability: HADOOP-13230. S3A to optionally retain directory markers. which adds markers=keep/delete to the test matrix. This is a pain, though as you can choose two options at a time it's manageable.

Apache HBase
============

You still need the HBoss extension in front of the S3A connector to use Zookeeper to lock files during compaction.


Apache Spark
============

Any workflows which chained together reads directly after writes/overwrites of files should now work reliably with raw S3.

  • The classic FileOutputCommitter commit-by-rename algorithms aren't going to fail with FileNotFoundException during task commit.
  • They will still use copy to rename work, so take O(data) time to commit filesWithout atomic dir rename, v1 commit algorithm can't isolate the commit operations of two task attempts. So it's unsafe and very slow.
  • The v2 commit is slow, doesn't have isolation between task attempt commits against any filesystem.If different task attempts are generating unique filenames (possibly to work around s3 update inconsistencies), it's not safe. Turn that option off.
  • The S3A committers' algorithms are happy talking directly to S3. But: SPARK-33402 is needed to fix a race condition in the staging committer. 
  • The "Magic" committer, which has relied on a consistent store, is safe. There's a fix in HADOOP-17318 for the staging committer; hadoop-aws builds with that in will work safely with older spark versions.

Any formats which commit work by writing a file with a unique name & updating a reference to it in a consistent store (iceberg &c) are still going to work great. Naming is irrelevant and commit-by-writing-a-file is S3's best story.

(+ SPARK-33135 and other uses of incremental listing will get the benefits of async prefetching of the next page of list results)

Disctp
======

There'll be no cached 404s to break uploads, even if you don't have the relevant fixes to stop HEAD requests before creating files (HADOOP-16932 and revert of HADOOP-8143)or update inconsistency (HADOOP-16775)
  • If your distcp version supports -direct, use it to avoid rename performance penaltiesIf your distcp version doesn't have HADOOP-15209 it can issue needless DELETE calls to S3 after a big update, and end up being throttled badly. Upgrade if you can.
  • If people are seeing problems: issues.apache.org + component HADOOP is where to file JIRAs; please tag the version of hadoop libraries you've been running with.

thanks,

-Steve
Reply | Threaded
Open this post in threaded view
|

Re: AWS Consistent S3 & Apache Hadoop's S3A connector

Steve Loughran-2


On Mon, 7 Dec 2020 at 07:36, Chang Chen <[hidden email]> wrote:
Since S3A now works perfectly with S3Guard turned off, Could Magic Committor work with S3Guard is off? If Yes, will performance degenerate? Or if HADOOP-17400 is fixed, then it will have comparable performance?

Yes, works really well.

* It doesn't have problems with race conditions in job IDs (SPARK-3320) because it does all its work under the dest dir and only supports one job @ a time there.


Performance wise:

* Expect no degradation if you are not working with directories marked as authoritative (hive does that for managed tables). Indeed, you will save on DDB writes.
* HADOOP-17400 speeds up all listing code, but for maximum directory listing performance you need to use the (existing) incremental listing APIS. See SPARK-33135 for some work there which matches this.

The list performance enhancements will only ship in hadoop-3.3.1. If you use the incremental list APIs today (listStatusIncremental, listFiles) everything is lined up, hdfs scales better and it helps motivate the abfs dev team to do the same.

There's some extra fixes coming in related to this -credit to Donjoon for contributing and/or reviewing this work.

HADOOP-17258. Magic S3Guard Committer to overwrite existing pendingSet file on task commit
HADOOP-17318. Support concurrent S3A commit jobs with same app attempt ID  (for staging; for magic you can disable aborting all upload under the dest dir & so have >1 job use the same dest dir)
HADOOP-16798. S3A Committer thread pool shutdown problems.

I'm also actively working on HADOOP-17414, Magic committer files don't have the count of bytes written collected by spark: 
https://github.com/apache/hadoop/pull/2530

Spark doesn't track bytes written as it is only measuring the 0-byte marker file.

The Hadoop-side patch

* Returns all S3 object headers as XAttr attributes prefixed "header."
* Sets the custom header x-hadoop-s3a-magic-data-length to the length of the data in the marker file.

There's a matching spark change which looks for the header in the getXAttr API if the length of the output file is 0 bytes long. If present and parses to a positive long, it's used as the declaration of output size.

Hadoop Branch-3.3 also has a very leading edge patch to stop deleting superfluous directory markers when files are created. See https://github.com/apache/hadoop/blob/trunk/hadoop-tools/hadoop-aws/src/site/markdown/tools/hadoop-aws/directory_markers.md for details
This will avoid throttling when many files are being written to the same bit of an S3 bucket, and stop creating tombstone markers in versioned S3 buckets. These tombstones were slowing down subsequent calls to LIST; over time list calls will slow. This is new, needs a patch on older clients to stop mistaking a marker for an empty dir and needs broader testing. This is in all maintained hadoop 3.x branches, but not yet shipped other than in hadoop-3.3.2

If you do want leading edge performance, yes, grab those latest patches in your own build. I plan to cut a new 3.3.x release soon to get it into peoples' hands. It will be the one with Arm-M1 binary support in the libs and codecs.  Building and testing now means that you get to have problems you find now fixed before that release. Hey, you even have an excuse for the new macbooks "I wanted to test spark on it"

-Steve