Fix worker report all block locations to master for async restore#16897
Fix worker report all block locations to master for async restore#16897fffanyang wants to merge 7 commits intoAlluxio:master-2.xfrom
Conversation
| Thank you for your pull request. |
| @dbw9580 @maobaolong PTAL thanks! |
There was a problem hiding this comment.
@fffanyang Thanks for your contribution. Left a couple of comments. @dbw9580 Would you like to take a look at this PR?
BTW, please fix the style issue in the next commit.
2023-02-17T01:08:29.7892989Z [ERROR] /usr/src/alluxio/core/client/fs/src/main/java/alluxio/client/file/cache/PageMetaStore.java:58:3: Missing a Javadoc comment. [MissingJavadocMethod] | return false; | ||
| } | ||
| if (mInitService.isPresent()) { | ||
| mPageMetaStore.reportBlocks(pageStoreDir); |
There was a problem hiding this comment.
Reference to this BlockMasterSync, The original behavior is that worker register to master with its all blocks as a full block report during the register stage.
There was a problem hiding this comment.
The changes are used to fix blocks report partly during worker register stage when we set alluxio.worker.page.store.async.restore.enabled=true (default, to accelerate worker restarts)
| for (PagedBlockMeta blockMeta : blockMetas) { | ||
| for (BlockStoreEventListener listener : mBlockStoreEventListeners) { | ||
| synchronized (listener) { | ||
| listener.onMoveBlockByWorker(blockMeta.getBlockId(), |
There was a problem hiding this comment.
This is a tricky way to report the blocks generated by pageStore to master by Incrementally, but it need a full block report during register stage.
| */ | ||
| void addPage(PageId pageId, PageInfo pageInfo); | ||
| | ||
| default void reportBlocks(PageStoreDir pageStoreDir) {} |
There was a problem hiding this comment.
Missing a Javadoc comment
There was a problem hiding this comment.
The page cache implementation is shared by the client side local cache, as well as the worker side paged block store. So PageMetaStore is also used by the client side local cache, adding a method specifically for doing worker-master sync is not appropriate here.
| Automated checks report:
Some checks failed. Please fix the reported issues and reply 'alluxio-bot, check this please' to re-run checks. |
19df170 to d3218e4 Compare | Automated checks report:
All checks passed! |
| @maobaolong Already fix the comments, please take a look again, thanks a lot! |
maobaolong left a comment
There was a problem hiding this comment.
LGTM @fffanyang Thanks for your contribution!
| @dbw9580 Would you like to take a double look? |
dbw9580 left a comment
There was a problem hiding this comment.
If the worker registers itself with the master before the cache store is restored, will the async restoration have a race with the actual read requests from clients?
I wonder what's pros and cons of turning off async restoration by setting alluxio.worker.page.store.async.restore.enabled to false?
| */ | ||
| void addPage(PageId pageId, PageInfo pageInfo); | ||
| | ||
| default void reportBlocks(PageStoreDir pageStoreDir) {} |
There was a problem hiding this comment.
The page cache implementation is shared by the client side local cache, as well as the worker side paged block store. So PageMetaStore is also used by the client side local cache, adding a method specifically for doing worker-master sync is not appropriate here.
@dbw9580 I already tested that if there are 100w pages, the time spend on scaning pages will be 16s on NVME. Setting alluxio.worker.page.store.async.restore.enabled to true defaultly may wants to accelerate workers restart I think. |
| I'm thinking of adding the block store event here: this should be the sole place that gets called whenever a block is added to the paged block store, whether it's added during restoration, or added by client activity. There are a few places where they also need to be changed to call the |
12c9f0a to 16bb004 Compare | @dbw9580 I have added the callback function as you suggested, please take a look again, thanks a lot! |
dbw9580 left a comment
There was a problem hiding this comment.
Thanks for the update! I like the idea of a callback.
| * @param conf the Alluxio configuration | ||
| * @param options the options for local cache manager | ||
| * @param pageMetaStore meta store for pages | ||
| * @param report the callback to report blocks for async pagedBlock restore |
There was a problem hiding this comment.
We'd like to keep the callback abstract, so that they may be used for purposes other than block report.
| * @param report the callback to report blocks for async pagedBlock restore | |
| * @param callback the callback that will be invoked when the page store finishes restoring pages after start up. |
| try (LockResource r = new LockResource(pageMetaStore.getLock().readLock())) { | ||
| report.reportBlocks(); | ||
| } | ||
| } catch (IOException e) { | ||
| LOG.error("Failed to restore LocalCacheManager", e); | ||
| } |
There was a problem hiding this comment.
the call to reportBlocks should be outside of this try catch block. Failing to report blocks should not share a common catch block with the case where restoreOrInit fails.
There was a problem hiding this comment.
Do you think another callback would be helpful when restoreOrInit fails? The page store would be read only and cannot handle cache requests. In this case, the upper level things like PagedBlockStore might need to know that.
Something like this:
try { manager.restoreOrInit(pageStoreDirs); try (LockResource r = new LockResource(pageMetaStore.getLock().readLock())) { callback.onSuccess(); } } catch (IOException e) { LOG.error("Failed to restore LocalCacheManager", e); callback.onFailure(); }Just a thought, we don't need to address this in this PR.
| public interface PagedBlockReport { | ||
| /** | ||
| * Report all block locations in a pageStoreDirs to master using BlockHeartbeat. | ||
| */ | ||
| void reportBlocks(); | ||
| } |
There was a problem hiding this comment.
I think a more general name like CacheRestorationCallback would be good. And the method name can be onCacheRestorationSuccess.
| public static CacheManager create(AlluxioConfiguration conf) throws IOException { | ||
| CacheManagerOptions options = CacheManagerOptions.create(conf); | ||
| return create(conf, options, PageMetaStore.create(options)); | ||
| return create(conf, options, PageMetaStore.create(options), () -> { }); |
There was a problem hiding this comment.
() -> { } looks funny. Can we provide a no-op implementation for PagedBlockReport interface, and make it a named singleton e.g. PagedBlockReport.NO_OP?
| | ||
| private void testNewCache() throws Exception { | ||
| mCacheManager = LocalCacheManager.create(mCacheManagerOptions, mPageMetaStore); | ||
| mCacheManager = LocalCacheManager.create(mCacheManagerOptions, mPageMetaStore, () -> { }); |
There was a problem hiding this comment.
I think you added a overloaded version of create that provides the default argument for the callback, so this change here is not necessary.
| @dbw9580 I have fixed according to your comments. Please take a look, thanks a lot! |
e274e00 to 73038a1 Compare
dbw9580 left a comment
There was a problem hiding this comment.
Thanks for the improvements! Implementation-wise these are the last batch from me. Can you please add some tests to ensure the feature works as expected?
| static void NO_OP() { | ||
| //NOOP | ||
| } |
There was a problem hiding this comment.
You can add a static field on the interface:
| static void NO_OP() { | |
| //NOOP | |
| } | |
| CacheRestorationCallback NOOP = () -> { }; |
| /** | ||
| * Report all block locations in a pageStoreDirs to master using BlockHeartbeat. | ||
| */ | ||
| @GuardedBy("getLock().readLock()") | ||
| public void reportBlocks() { |
There was a problem hiding this comment.
The actual reporting does not happen inside this method, so may be rename this to reflect what it actually does?
| /** | |
| * Report all block locations in a pageStoreDirs to master using BlockHeartbeat. | |
| */ | |
| @GuardedBy("getLock().readLock()") | |
| public void reportBlocks() { | |
| /** | |
| * Callback that gets called when the underlying page store finishes async | |
| * restoration. | |
| * This method notifies all registered block store event listeners | |
| * about the newly added blocks. In particular, these blocks will be reported | |
| * as added blocks to master via heartbeat. | |
| */ | |
| @GuardedBy("getLock().readLock()") | |
| public void onCacheRestorationSuccess() { |
| for (PagedBlockMeta blockMeta : blockMetas) { | ||
| for (BlockStoreEventListener listener : mBlockStoreEventListeners) { | ||
| synchronized (listener) { | ||
| listener.onMoveBlockByWorker(blockMeta.getBlockId(), | ||
| blockMeta.getBlockLocation(), blockMeta.getBlockLocation()); |
There was a problem hiding this comment.
Presumably there are far more blocks than event listeners. It's more efficient to only acquire the lock once per listener.
| for (PagedBlockMeta blockMeta : blockMetas) { | |
| for (BlockStoreEventListener listener : mBlockStoreEventListeners) { | |
| synchronized (listener) { | |
| listener.onMoveBlockByWorker(blockMeta.getBlockId(), | |
| blockMeta.getBlockLocation(), blockMeta.getBlockLocation()); | |
| for (BlockStoreEventListener listener : mBlockStoreEventListeners) { | |
| synchronized (listener) { | |
| for (PagedBlockMeta blockMeta : blockMetas) { | |
| listener.onMoveBlockByWorker(blockMeta.getBlockId(), | |
| blockMeta.getBlockLocation(), blockMeta.getBlockLocation()); |
| | ||
| package alluxio.client.file.cache; | ||
| | ||
| public interface CacheRestorationCallback { |
There was a problem hiding this comment.
2023-04-10T07:51:57.2198086Z [ERROR] /usr/src/alluxio/core/client/fs/src/main/java/alluxio/client/file/cache/CacheRestorationCallback.java:14:1: Missing a Javadoc comment. [MissingJavadocType] | @fffanyang It seems you use a new email ? |
| This pull request has been automatically marked as stale because it has not had recent activity. It will be closed in two weeks if no further activity occurs. Thank you for your contributions. |

What changes are proposed in this pull request?
Using BlockHeartbeat to report all block locations to master when worker scans pages asynchronously.
Why are the changes needed?
If not,worker may have some blocks unrecognized.
Does this PR introduce any user facing changes?
N/A