- Notifications
You must be signed in to change notification settings - Fork 66
Add bounds_t with pre-aligned mins and maxs #1482
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
| For now there is a bug somewhere I have not found yet. The bug can be seen by loading the In my first iteration of the branch I made a stupid mistakes by passing the |
| Note: the percentage of time spent is not reliable in those screenshots because I use a Before (C), a lot of time is spent in After (C), the first loads are just completely skipped: Before (Asm), we see three After (Asm), se see only two |
| So, as I said, this is not urgent. But I would appreciate people re-reading what I did in hope someone finds the stupid mistake I may have done. The code change should be straightforward because it's basically rewriting with a different data layout, there is no algorithm change to be expected. Unfortunately this data struct is used in many places so the patch is massive. There are many functions in |
| I also added some constness to some function input, which may help the compiler to optimize a bit more some other places of the code. I doubt this extra-constness is the cause of the bug I'm facing because the compiler doesn't complain I'm writing to some |
da96dc5 to b9fbbab Compare | out->bounds.mins[ 1 ] = LittleLong( in->mins[ 1 ] ); | ||
| out->bounds.mins[ 2 ] = LittleLong( in->mins[ 2 ] ); | ||
| out->bounds.maxs[ 1 ] = LittleLong( in->maxs[ 1 ] ); | ||
| out->bounds.maxs[ 2 ] = LittleLong( in->maxs[ 2 ] ); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here I forgot:
out->bounds.maxs[ 0 ] = LittleLong( in->maxs[ 0 ] );There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That was the bug I was looking for.
b9fbbab to 3176224 Compare | So I reread everything and found the bug I was looking for. I missed the conversion of one line (I just did not re-added the new one, that's why there was no compile type error). It works. |
4bea319 to e99d1aa Compare | I found another bug affecting BSP movers like doors. One can test it in the Vega map. The doors disappear according to the viewing angle / point of view. |
92be21c to d7d9d1e Compare | Any fps difference? |
d7d9d1e to 3239c6e Compare | I think you can make a drop-in aligned replacement for vec3_t like this: This can be used in arrays so then you don't need |
src/engine/qcommon/q_shared.h Outdated
| byte type; // for fast side tests: 0,1,2 = axial, 3 = nonaxial | ||
| byte signbits; // signx + (signy<<1) + (signz<<2), used as lookup during collision | ||
| byte pad[ 2 ]; | ||
| byte pad[ 12 ]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks wrong. The previous variables in the struct add up to 6 bytes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That was because of:
But maybe I'm wrong with the values.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, but it should be byte pad[ 14 ], rather than byte pad[ 12 ].
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(small correction to my previous comment: they add up to 18 bytes, not 6)
3239c6e to 881fb23 Compare
Interesting, how such type can be used and where? |
| As a direct replacement for vec3_t. |
| I like the idea of having have an I just rewrote |
ce2df22 to b30fef6 Compare
I'm suggesting that no |
| But I like having a explicit |
| set_cxx_flag(${FLAG} ${ARGN}) | ||
| endmacro() | ||
| | ||
| macro(set_kind_linker_flag KIND FLAG) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are random cmake changes in here by mistake
src/engine/qcommon/q_shared.h Outdated
| | ||
| vec3_t& at( bool index ) | ||
| { | ||
| return ( &mins + ( index * 16 ) )[ 0 ]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's undefined behavior, like the (&v.x)[i] thing with GLM.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But how can this be undefined if the struct layout is guaranteed? And if it isn't guaranteed, how can we use structs for defining file storage format or network packet formats if the layout of the data isn't guaranteed?
Is there a type in C/C++ that guarantees that alignas(16) vec4_t, alignas(16) vec4_t has an offset of sizeof(vec4_t) or 16?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can this be solved with static assertion (if the compiler doesn't do as expected, don't let the code be compiled?).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It doesn't matter that the layout is guaranteed. It's illegal to use an index that goes outside the bounds of an array. (A non-array object is treated like an array of length 1.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But this is not an index, this is a pointer. Pointer arithmetic isn't undefined.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK I didn't explain it precisely. Let's cite people who can. https://en.cppreference.com/w/cpp/language/operator_arithmetic#Pointer_arithmetic
Or one would say that if you have float array[3], doing &array[0] + 2 isn't defined because that goes outside of the object (array[0]).
This is the case Otherwise, if P points to the ith element of an array object x with n elements, given the value of J as j, P is added or subtracted as follows: . J can be from 0 to 3.
If you do &array instead of &a[0], it's the case Otherwise, if P points to a complete object, a base class subobject or a member subobject y, given the value of J as j, P is added or subtracted as follows: . J can be 0 or 1. Note that if it is 1, the resulting pointer is a "pointer past the end" and is illegal to dereference, even if it would happen to have the same numerical value as a valid object.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, is this defined?
vec_t* at( bool index ) { return (vec_t*) this + ( index * 16 ); }The pointer doesn't go past the end of the object this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, that doesn't work, actually only the if (index) return maxs; return mins work:
https://godbolt.org/z/ETEqnG9Gv
What's curious is that the compilers (GCC, Clang) seem to produce the same code for all three functions, but produce garbage on the functions doing pointer arithmetic. Edit: In fact the computed offsets are wrong.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Edit: In fact the computed offsets are wrong.
The wrong offsets are the ones in your at() functions, i. e. it should be index * 4, rather than index * 16.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've seen claims that, essentially, doing this would not be UB:
return (vec_t*) ( (char*) &mins[0] + ( index ? offsetof( boundsA_t, maxs ) : offsetof( boundsA_t, mins ) ) );
However, don't quote me on that, & it produces the same assembly anyway.
73a5d78 to bf2896f Compare 



This is not urgent at all. This is complementary to:
BoxOnPlaneSidefunction #1142It also improves over:
plane_tstruct withnormalanddistmembers #1043The
BoxOnPlaneSide()function is known to be a hot spot, being called by multiple recursive functions. Right now we spend a lot of time in_mm_loadu_ps, we have to callsseLoadVec3Unsafe()explicitly because we can't guess if the data is aligned or not. This comes with multiple downside:_mm_loadu_psis said to be slower than_mm_load_ps, and that fits my experience¹._mm_loadu_psand will always call it if we ask for it explicitely, even if the data is already aligned, and by experience, even if no copy is needed.So the idea is to introduce some nice
bounds_tstruct that uses alignedminsandmaxs, same forcplane_twith an alignednormal. When doing that, we can write an explicit_mm_load_psthat is faster than_mm_loadu_ps, and most of the cases, because the compiler knows the data isconstand already aligned, the compiler just removes the_mm_load_psand process the data without any copy.See also:
bounds_twith pre-alignedminsandmaxsUnvanquished/Unvanquished#3265¹ Some times ago I tried to write optimized SSE code for some other functions, but the code was slower because of the explicit
_mm_loadu_pscall. I even noticed copying an unaligned vec3 to an aligned vec3 before calling_mm_load_pscould make the code faster when the compiler notices the data is already aligned and skips the copy and calls_mm_load_ps(or even doesn't need to call it at all).