Skip to content

Fix worker report all block locations to master for async restore#16897

Open
fffanyang wants to merge 7 commits intoAlluxio:master-2.xfrom
fffanyang:fix-pagestore-asyncrestore
Open

Fix worker report all block locations to master for async restore#16897
fffanyang wants to merge 7 commits intoAlluxio:master-2.xfrom
fffanyang:fix-pagestore-asyncrestore

Conversation

@fffanyang
Copy link
Copy Markdown
Contributor

What changes are proposed in this pull request?

Using BlockHeartbeat to report all block locations to master when worker scans pages asynchronously.

Why are the changes needed?

If not,worker may have some blocks unrecognized.

Does this PR introduce any user facing changes?

N/A

@alluxio-bot
Copy link
Copy Markdown
Contributor

Thank you for your pull request.
In order for us to evaluate and accept your PR, we ask that you sign a contribution license agreement (CLA).
It's all electronic and will take just a few minutes. Please download CLA form here, sign, and e-mail back to cla@alluxio.org

@fffanyang
Copy link
Copy Markdown
Contributor Author

@dbw9580 @maobaolong PTAL thanks!

Copy link
Copy Markdown
Contributor

@maobaolong maobaolong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@fffanyang Thanks for your contribution. Left a couple of comments. @dbw9580 Would you like to take a look at this PR?

BTW, please fix the style issue in the next commit.

2023-02-17T01:08:29.7892989Z [ERROR] /usr/src/alluxio/core/client/fs/src/main/java/alluxio/client/file/cache/PageMetaStore.java:58:3: Missing a Javadoc comment. [MissingJavadocMethod] 
return false;
}
if (mInitService.isPresent()) {
mPageMetaStore.reportBlocks(pageStoreDir);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reference to this BlockMasterSync, The original behavior is that worker register to master with its all blocks as a full block report during the register stage.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The changes are used to fix blocks report partly during worker register stage when we set alluxio.worker.page.store.async.restore.enabled=true (default, to accelerate worker restarts)

for (PagedBlockMeta blockMeta : blockMetas) {
for (BlockStoreEventListener listener : mBlockStoreEventListeners) {
synchronized (listener) {
listener.onMoveBlockByWorker(blockMeta.getBlockId(),
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a tricky way to report the blocks generated by pageStore to master by Incrementally, but it need a full block report during register stage.

*/
void addPage(PageId pageId, PageInfo pageInfo);

default void reportBlocks(PageStoreDir pageStoreDir) {}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing a Javadoc comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The page cache implementation is shared by the client side local cache, as well as the worker side paged block store. So PageMetaStore is also used by the client side local cache, adding a method specifically for doing worker-master sync is not appropriate here.

@alluxio-bot
Copy link
Copy Markdown
Contributor

Automated checks report:

  • Commits associated with Github account: FAIL
    • It looks like your commits can't be linked to a valid Github account.
      Your commits are made with the email daisyfyang@tencent.com, which does not allow your contribution to be tracked by Github.
      See this link for possible reasons this might be happening.
      To change the author email address that your most recent commit was made under, you can run:
      git -c user.name="Name" -c user.email="Email" commit --amend --reset-author
      See this answer for more details about how to change commit email addresses.
      Once the author email address has been updated, update the pull request by running:
      git push --force https://github.com/fffanyang/alluxio.git fix-pagestore-asyncrestore
  • PR title follows the conventions: PASS

Some checks failed. Please fix the reported issues and reply 'alluxio-bot, check this please' to re-run checks.

@fffanyang fffanyang force-pushed the fix-pagestore-asyncrestore branch from 19df170 to d3218e4 Compare February 17, 2023 07:10
@alluxio-bot
Copy link
Copy Markdown
Contributor

Automated checks report:

  • Commits associated with Github account: PASS
  • PR title follows the conventions: PASS

All checks passed!

@fffanyang
Copy link
Copy Markdown
Contributor Author

@maobaolong Already fix the comments, please take a look again, thanks a lot!

@fffanyang fffanyang requested review from maobaolong and removed request for dbw9580 February 20, 2023 02:43
Copy link
Copy Markdown
Contributor

@maobaolong maobaolong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM @fffanyang Thanks for your contribution!

@maobaolong
Copy link
Copy Markdown
Contributor

@dbw9580 Would you like to take a double look?

@elega elega requested a review from dbw9580 February 21, 2023 03:12
Copy link
Copy Markdown
Contributor

@dbw9580 dbw9580 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the worker registers itself with the master before the cache store is restored, will the async restoration have a race with the actual read requests from clients?

I wonder what's pros and cons of turning off async restoration by setting alluxio.worker.page.store.async.restore.enabled to false?

*/
void addPage(PageId pageId, PageInfo pageInfo);

default void reportBlocks(PageStoreDir pageStoreDir) {}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The page cache implementation is shared by the client side local cache, as well as the worker side paged block store. So PageMetaStore is also used by the client side local cache, adding a method specifically for doing worker-master sync is not appropriate here.

@fffanyang
Copy link
Copy Markdown
Contributor Author

If the worker registers itself with the master before the cache store is restored, will the async restoration have a race with the actual read requests from clients?

I wonder what's pros and cons of turning off async restoration by setting alluxio.worker.page.store.async.restore.enabled to false?

@dbw9580
The actual read requests from clients may have a race with the async restoration. The update requests such as put, append and delete, will wait until mState in LocalCacheManager become READ_WRITE, which is set only after restoration. While read requests only need the state be not NOT_IN_USE, so clients may acquire blocks without location before async restoration finished.

I already tested that if there are 100w pages, the time spend on scaning pages will be 16s on NVME. Setting alluxio.worker.page.store.async.restore.enabled to true defaultly may wants to accelerate workers restart I think.

@fffanyang fffanyang requested a review from dbw9580 February 22, 2023 08:19
@jiacheliu3 jiacheliu3 added the area-worker Alluxio worker label Mar 3, 2023
@dbw9580
Copy link
Copy Markdown
Contributor

dbw9580 commented Mar 3, 2023

I'm thinking of adding the block store event here:

public void addBlock(PagedBlockMeta blockMeta) {
mBlocks.add(blockMeta);
}

this should be the sole place that gets called whenever a block is added to the paged block store, whether it's added during restoration, or added by client activity.

There are a few places where mBlocks is manipulated directly instead of calling this function, e.g.

mBlocks.add(new PagedBlockMeta(blockId, blockSize, dir));

they also need to be changed to call the addBlock method.

@fffanyang fffanyang force-pushed the fix-pagestore-asyncrestore branch from 12c9f0a to 16bb004 Compare March 21, 2023 02:31
@fffanyang
Copy link
Copy Markdown
Contributor Author

@dbw9580 I have added the callback function as you suggested, please take a look again, thanks a lot!

Copy link
Copy Markdown
Contributor

@dbw9580 dbw9580 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the update! I like the idea of a callback.

* @param conf the Alluxio configuration
* @param options the options for local cache manager
* @param pageMetaStore meta store for pages
* @param report the callback to report blocks for async pagedBlock restore
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We'd like to keep the callback abstract, so that they may be used for purposes other than block report.

Suggested change
* @param report the callback to report blocks for async pagedBlock restore
* @param callback the callback that will be invoked when the page store finishes restoring pages after start up.
Comment on lines 117 to 122
try (LockResource r = new LockResource(pageMetaStore.getLock().readLock())) {
report.reportBlocks();
}
} catch (IOException e) {
LOG.error("Failed to restore LocalCacheManager", e);
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the call to reportBlocks should be outside of this try catch block. Failing to report blocks should not share a common catch block with the case where restoreOrInit fails.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think another callback would be helpful when restoreOrInit fails? The page store would be read only and cannot handle cache requests. In this case, the upper level things like PagedBlockStore might need to know that.
Something like this:

try { manager.restoreOrInit(pageStoreDirs); try (LockResource r = new LockResource(pageMetaStore.getLock().readLock())) { callback.onSuccess(); } } catch (IOException e) { LOG.error("Failed to restore LocalCacheManager", e); callback.onFailure(); }

Just a thought, we don't need to address this in this PR.

Comment on lines +14 to +19
public interface PagedBlockReport {
/**
* Report all block locations in a pageStoreDirs to master using BlockHeartbeat.
*/
void reportBlocks();
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think a more general name like CacheRestorationCallback would be good. And the method name can be onCacheRestorationSuccess.

public static CacheManager create(AlluxioConfiguration conf) throws IOException {
CacheManagerOptions options = CacheManagerOptions.create(conf);
return create(conf, options, PageMetaStore.create(options));
return create(conf, options, PageMetaStore.create(options), () -> { });
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

() -> { } looks funny. Can we provide a no-op implementation for PagedBlockReport interface, and make it a named singleton e.g. PagedBlockReport.NO_OP?


private void testNewCache() throws Exception {
mCacheManager = LocalCacheManager.create(mCacheManagerOptions, mPageMetaStore);
mCacheManager = LocalCacheManager.create(mCacheManagerOptions, mPageMetaStore, () -> { });
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you added a overloaded version of create that provides the default argument for the callback, so this change here is not necessary.

@fffanyang fffanyang requested a review from dbw9580 March 27, 2023 08:45
@fffanyang
Copy link
Copy Markdown
Contributor Author

@dbw9580 I have fixed according to your comments. Please take a look, thanks a lot!

@fffanyang fffanyang force-pushed the fix-pagestore-asyncrestore branch from e274e00 to 73038a1 Compare March 30, 2023 02:49
Copy link
Copy Markdown
Contributor

@dbw9580 dbw9580 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the improvements! Implementation-wise these are the last batch from me. Can you please add some tests to ensure the feature works as expected?

Comment on lines +20 to +22
static void NO_OP() {
//NOOP
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can add a static field on the interface:

Suggested change
static void NO_OP() {
//NOOP
}
CacheRestorationCallback NOOP = () -> { };
Comment on lines +407 to +411
/**
* Report all block locations in a pageStoreDirs to master using BlockHeartbeat.
*/
@GuardedBy("getLock().readLock()")
public void reportBlocks() {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The actual reporting does not happen inside this method, so may be rename this to reflect what it actually does?

Suggested change
/**
* Report all block locations in a pageStoreDirs to master using BlockHeartbeat.
*/
@GuardedBy("getLock().readLock()")
public void reportBlocks() {
/**
* Callback that gets called when the underlying page store finishes async
* restoration.
* This method notifies all registered block store event listeners
* about the newly added blocks. In particular, these blocks will be reported
* as added blocks to master via heartbeat.
*/
@GuardedBy("getLock().readLock()")
public void onCacheRestorationSuccess() {
Comment on lines +416 to +420
for (PagedBlockMeta blockMeta : blockMetas) {
for (BlockStoreEventListener listener : mBlockStoreEventListeners) {
synchronized (listener) {
listener.onMoveBlockByWorker(blockMeta.getBlockId(),
blockMeta.getBlockLocation(), blockMeta.getBlockLocation());
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Presumably there are far more blocks than event listeners. It's more efficient to only acquire the lock once per listener.

Suggested change
for (PagedBlockMeta blockMeta : blockMetas) {
for (BlockStoreEventListener listener : mBlockStoreEventListeners) {
synchronized (listener) {
listener.onMoveBlockByWorker(blockMeta.getBlockId(),
blockMeta.getBlockLocation(), blockMeta.getBlockLocation());
for (BlockStoreEventListener listener : mBlockStoreEventListeners) {
synchronized (listener) {
for (PagedBlockMeta blockMeta : blockMetas) {
listener.onMoveBlockByWorker(blockMeta.getBlockId(),
blockMeta.getBlockLocation(), blockMeta.getBlockLocation());

package alluxio.client.file.cache;

public interface CacheRestorationCallback {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2023-04-10T07:51:57.2198086Z [ERROR] /usr/src/alluxio/core/client/fs/src/main/java/alluxio/client/file/cache/CacheRestorationCallback.java:14:1: Missing a Javadoc comment. [MissingJavadocType] 
@fffanyang fffanyang requested review from dbw9580 and maobaolong April 18, 2023 11:44
@maobaolong
Copy link
Copy Markdown
Contributor

@fffanyang It seems you use a new email ?
image

@github-actions
Copy link
Copy Markdown

github-actions bot commented Jun 6, 2023

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed in two weeks if no further activity occurs. Thank you for your contributions.

@github-actions github-actions bot added the stale The PR/Issue does not have recent activities and will be closed automatically label Jun 6, 2023
@github-actions github-actions bot removed the stale The PR/Issue does not have recent activities and will be closed automatically label Jun 26, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area-worker Alluxio worker

7 participants