- Notifications
You must be signed in to change notification settings - Fork 15.3k
[libc][libm][GPU] Added missing vendor entry points to libm for GPUs #66031
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Closed
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters
There are a number of mathematical functions where no target agnostic implementations exist and the compiler builtins are not correctly lowered. This patch adds inlined wrappers for those functions to the GPU version of `libm` for AMDGPU and NVPTX targets.
This doesn't change the current behavior of the function, but the explicit declaration looks cleaner.
…m#65674) Add a DAG combine to form a masked.load from a masked_strided_load intrinsic with stride equal to element size. This covers a couple of extra test cases, and allows us to simplify and common some existing code on the concat_vector(load, ...) to strided load transform. This is the first in a mini-patch series to try and generalize our strided load and gather matching to handle more cases, and common up different approaches to the same problems in different places.
COMPILER_RT_DEBUG was just added to sanitizer-ppc64le-linux, and this test is already broken there.
Instead of eagerly creating a diagnostic that will be discarded in the normal case, switch to lazy initialization on error.
…ool (llvm#65991) To match other internal symbolizer functions. This makes harder to distighush small buffer from a different failure, but we has the same problem for the rest of the lib. Still we use 16k buffer so it should be enough most of the time. We can fix all function togerher if future, if needed.
Deprecate the `gpu-to-cubin` & `gpu-to-hsaco` passes in favor of the `TargetAttr` workflow. This patch removes remaining upstream uses of the aforementioned passes, including the option to use them in `mlir-opt`. A future patch will remove these passes entirely. The passes can be re-enabled in `mlir-opt` by adding the CMake flag: `-DMLIR_ENABLE_DEPRECATED_GPU_SERIALIZATION=1`.
…cc.loop (llvm#65521) The `cache` directive may appear at the top of (inside of) a loop. It specifies array elements or subarrays that should be fetched into the highest level of the cache for the body of the loop. The `cache` directive is modeled as a data entry operands attached to the acc.loop operation.
The cache directive is attached directly to the acc.loop operation when the directive appears in the loop. When it appears before a loop, the OpenACCCacheConstruct is saved and attached when the acc.loop is created. Directive that cannot be attached to a loop are silently discarded. Depends on llvm#65521
…` canonicalization patterns. (llvm#66002) This pattern fits better with the other canonicalization patterns that exist for `linalg.fill`.
Otherwise they are dangling if lldMain is called more than once.
llvm#65776) Since the OpenACC atomics specification is a subset of OpenMP atomics, the same lowering implementation can be used. This change extracts out the necessary pieces from the OpenMP lowering and puts them in a shared spot. The shared spot is a header file so that each implementation can template specialize directly. After putting the OpenMP implementation in a common spot, the following changes were needed to make it work for OpenACC: * Ensure parsing works correctly by avoiding hardcoded offsets. * Templatize based on atomic type. * The checking whether it is OpenMP or OpenACC is done by checking for OmpAtomicClauseList (OpenACC does not implement this so we just templatize with void). It was preferable to check this instead of atomic type because in some cases, like atomic capture, the read/write/update implementations are called - and we want compile time evaluation of these conditional parts. * The memory order and hint are used only for OpenMP. * Generate acc dialect operations instead of omp dialect operations.
… configurable (llvm#65687) "descriptive summaries" should only be used for small to medium binaries because of the performance penalty the cause when completing types. I'm defaulting it to false. Besides that, the "raw child" for synthetics should be optional as well. I'm defaulting it to false. Both options can be set via a launch or attach config, following the pattern of most settings. javascript extension wrappers can set these settings on their own as well.
Fixes: llvm#65806 Currently clang put extern shared var ODR-used by host device functions in global var __clang_gpu_used_external. This behavior was due to https://reviews.llvm.org/D123441. However, clang should not do that for extern shared vars since their addresses are per warp, therefore cannot be accessed by host code.
…/macos triples" This reverts commit 9f77fac. The change unintentionally changed lots of codegen, see llvm#47698 (comment) Also revert a follow-up: This reverts commit b40a5be.
…yped memory. (llvm#66009) Exposes the existing `get(ShapedType, StringRef, AsmResourceBlob)` builder publicly (was protected) and adds a CAPI `mlirUnmanagedDenseBlobResourceElementsAttrGet`. While such a generic construction interface is a big help when it comes to interop, it is also necessary for creating resources that don't have a standard C type (i.e. f16, the f8s, etc). Previously reviewed/approved as part of https://reviews.llvm.org/D157064
This should fix the warning seen in https://lab.llvm.org/buildbot/#/builders/13/builds/39980/steps/6/logs/stdio
Make sure every conditional branch constructed by `LoopUnrollRuntime` code sets branch weights. - Add new 1:127 weights for the conditional jumps checking whether the whole (unrolled) loop should be skipped in the generated prolog or epilog code. - Remove `updateLatchBranchWeightsForRemainderLoop` function and just add weights immediately when constructing the relevant branches. This leads to simpler code and makes the code more obvious as every call to `CreateCondBr` now has a `BranchWeights` parameter. - Rework formula for epilogue latch weights, to assume equal distribution of remainders and remove `assert` (as I was able to reach this code when forcing small unroll factors on the commandline). Differential Revision: https://reviews.llvm.org/D158642
With this, check-llvm passes on an arm mac if x86 isn't in LLVM_TARGETS_TO_BUILD. This pattern to skip the tests if x86 isn't enabled is used in every other test in this file.
According to 7.5.6.3 point 3, finalization occurs when > A nonpointer, nonallocatable object that is not a dummy argument or function result is finalized immediately before it would become undefined due to execution of a RETURN or END statement (19.6.6, item (3)). We were not calling the finalization on empty derived-type. There is no such restriction so this patch updates the code so the finalization is called for empty type as well.
The test is flaky after Kernel upgrade from 6.0 to 6.5.
The issue is uncovered by llvm#47698: for IR files without a target triple, -mtriple= specifies the full target triple while -march= merely sets the architecture part of the default target triple, leaving a target triple which may not make sense, e.g. riscv64-apple-darwin. Therefore, -march= is error-prone and not recommended for tests without a target triple. The issue has been benign as we recognize $unknown-apple-darwin as ELF instead of rejecting it outrightly.
The issue is uncovered by llvm#47698: for assembly files, -triple= specifies the full target triple while -arch= merely sets the architecture part of the default target triple, leaving a target triple which may not make sense, e.g. riscv64-apple-darwin. Therefore, -arch= is error-prone and not recommended for tests. The issue has been benign as we recognize $unknown-apple-darwin as ELF instead of rejecting it outrightly. Due to the nature of the issue, we don't see the issue in tests using architectures that any of Mach-O/COFF/XCOFF supports.
The test only applies to ELF. On Linux, when a default target triple is, say, Mach-O, the test should be excluded as well.
Reviewed By: #bolt, maksfb Differential Revision: https://reviews.llvm.org/D154120
Reduce YAML profile processing times: - preprocessProfile: speed up buildNameMaps by replacing ProfileNameToProfile mapping with ProfileFunctionNames set and ProfileBFs vector. Pre-look up YamlBF->BF correspondence, memoize in ProfileBFs. - readProfile: replace iteration over all functions in the binary by iteration over profile functions (strict match and LTO name match). On a large binary (1.9M functions) and large YAML profile (121MB, 30k functions) reduces profile steps runtime: pre-process profile data: 12.4953s -> 10.7123s process profile data: 9.8195s -> 5.6639s Compared to fdata profile reading: pre-process profile data: 8.0268s process profile data: 1.0265s process profile data pre-CFG: 0.1644s Reviewed By: #bolt, maksfb Differential Revision: https://reviews.llvm.org/D159460
Currently, dimlvlmap with identity affine map will be treated as empty affine map. But the new syntax would treat it as an actual identity affine map such as {d0} -> {d0}. This mismatch could raise an error when we are comparing sparse encodings. This is information that the compiler already has, and should be exposed so that the library doesn't need to reimplement the exact same functionality. Differential Revision: https://reviews.llvm.org/D135341
Fixed formatting of the section violating 80-char line limit.
For some reasons enable_aliases is not set when we LLVM_ENABLE_RUNTIMES=compiler-rt instead of LLVM_ENABLE_PROJECTS.
Member
| @llvm/pr-subscribers-libc ChangesThere are a number of mathematical functions where no target-agnostic implementations exist, and the compiler built-ins are not correctly lowered on all GPU targets. This patch adds inlined wrappers for those functions to the GPU version of |
Member
| @llvm/pr-subscribers-backend-amdgpu ChangesThere are a number of mathematical functions where no target-agnostic implementations exist, and the compiler built-ins are not correctly lowered on all GPU targets. This patch adds inlined wrappers for those functions to the GPU version of |
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Add this suggestion to a batch that can be applied as a single commit. This suggestion is invalid because no changes were made to the code. Suggestions cannot be applied while the pull request is closed. Suggestions cannot be applied while viewing a subset of changes. Only one suggestion per line can be applied in a batch. Add this suggestion to a batch that can be applied as a single commit. Applying suggestions on deleted lines is not supported. You must change the existing code in this line in order to create a valid suggestion. Outdated suggestions cannot be applied. This suggestion has been applied or marked resolved. Suggestions cannot be applied from pending reviews. Suggestions cannot be applied on multi-line comments. Suggestions cannot be applied while the pull request is queued to merge. Suggestion cannot be applied right now. Please check back later.
There are a number of mathematical functions where no target-agnostic implementations exist, and the compiler built-ins are not correctly lowered on all GPU targets. This patch adds inlined wrappers for those functions to the GPU version of
libmfor AMDGPU and NVPTX targets.