Ticket #39565

Incorrect alignment of SIMD vectors returned by value from functions

Date d'ouverture: 2019-09-12 15:47 Dernière mise à jour: 2019-09-17 19:00

Rapporteur:
Propriétaire:
(Aucun)
Type:
État:
Ouvert
Composant:
Jalon:
(Aucun)
Priorité:
5 - moyen
Sévérité:
5 - moyen
Résolution:
Aucun
Fichier:
Aucun
Vote
Score: 0
No votes
0.0% (0/0)
0.0% (0/0)

Détails

Summary of the Issue

Working with AVX2 SIMD vectors. When a __m256i value is returned from a function, the SIMD register ymm0 is copied to memory using the aligned move instruction vmovdqa, so the destination memory address must be aligned to 32 bytes. This requirement is not met in my test case and the program crashes with SIGSEGV. More info follows in the "Minimal Test Case" section below.

I tested it also on Ubuntu's gcc, but it didn't crash there, neither did with clang++ on Windows and Ubuntu, so therefore I am submitting this issue to the MinGW tracker. But I might be wrong and the issue may exist in gcc as well.

Host Operating System Information and Version

Windows 10 Home, Version 10.0.17134 Build 17134

with latest updated MSYS2.

GCC Version

$ gcc -v
Using built-in specs.
COLLECT_GCC=D:\msys64_latest\mingw64\bin\gcc.exe
COLLECT_LTO_WRAPPER=D:/msys64_latest/mingw64/bin/../lib/gcc/x86_64-w64-mingw32/9.2.0/lto-wrapper.exe
Target: x86_64-w64-mingw32
Configured with: ../gcc-9.2.0/configure --prefix=/mingw64 --with-local-prefix=/mingw64/local --build=x86_64-w64-mingw32 --host=x86_64-w64-mingw32 --target=x86_64-w64-mingw32 --with-native-system-header-dir=/mingw64/x86_64-w64-mingw32/include --libexecdir=/mingw64/lib --enable-bootstrap --with-arch=x86-64 --with-tune=generic --enable-languages=c,lto,c++,fortran,ada,objc,obj-c++ --enable-shared --enable-static --enable-libatomic --enable-threads=posix --enable-graphite --enable-fully-dynamic-string --enable-libstdcxx-filesystem-ts=yes --enable-libstdcxx-time=yes --disable-libstdcxx-pch --disable-libstdcxx-debug --disable-isl-version-check --enable-lto --enable-libgomp --disable-multilib --enable-checking=release --disable-rpath --disable-win32-registry --disable-nls --disable-werror --disable-symvers --enable-plugin --with-libiconv --with-system-zlib --with-gmp=/mingw64 --with-mpfr=/mingw64 --with-mpc=/mingw64 --with-isl=/mingw64 --with-pkgversion='Rev2, Built by MSYS2 project' --with-bugurl=https://sourceforge.net/projects/msys2 --with-gnu-as --with-gnu-ld
Thread model: posix
gcc version 9.2.0 (Rev2, Built by MSYS2 project)

Binutils Version

$ ld -v
GNU ld (GNU Binutils) 2.32

MinGW Version

I don't know what should I look for in the mingw/include/_mingw.h file.

Build Environment

$ uname -a
MINGW64_NT-10.0-17134 MISO-PC 3.0.7-338.x86_64 2019-07-11 10:58 UTC x86_64 Msys

Minimal Self-Contained Test Case

main.cpp:

  1. #include <iostream>
  2. #include <immintrin.h>
  3. #define DBG(var_name) std::cout<<#var_name": "<<(var_name)<<std::endl
  4. // Output operator for vector
  5. std::ostream& operator<<(std::ostream& oss, const __m256i& v)
  6. {
  7. constexpr size_t length_bytes = 32;
  8. unsigned char a[length_bytes];
  9. _mm256_storeu_si256(reinterpret_cast<__m256i*>(a), v);
  10. oss << "[";
  11. std::string sep = "";
  12. for (size_t i=0; i<length_bytes; i++) {
  13. oss << sep << int(a[i]);
  14. sep = " ";
  15. }
  16. return oss << "]";
  17. }
  18. __m256i __attribute__ ((noinline)) crash_vector_ret(uint8_t* a)
  19. {
  20. __m256i v;
  21. v = _mm256_loadu_si256(reinterpret_cast<const __m256i*>(a));
  22. // Crash
  23. return v;
  24. }
  25. int main(int argc, char** argv)
  26. {
  27. // Setup memory from which the vector will be loaded
  28. const int a_size = ARRAY_SIZE;
  29. uint8_t a[a_size];
  30. for (volatile int i=0; i<a_size; i++) {
  31. a[i] = i;
  32. }
  33. DBG(alignof(__m256i));
  34. __m256i vr;
  35. vr = crash_vector_ret(a);
  36. DBG(vr);
  37. return 0;
  38. }

Makefile:

  1. CXXFLAGS += -mavx2
  2. CXXFLAGS += -std=c++17
  3. CXXFLAGS += -g
  4. CXX = g++
  5. main.exe: main.cpp
  6. $(CXX) $(CXXFLAGS) -O1 -DARRAY_SIZE=32 -o $@ $<
  7. test: main.cpp
  8. for COMPILER in g++; do \
  9. ASM_FLAGS="-S -fverbose-asm -masm=intel"; \
  10. for ARRAY_SIZE in 32 36; do \
  11. for OPT in 0 1 2 3; do \
  12. NAME="main_O$${OPT}_$${COMPILER}_s$${ARRAY_SIZE}"; \
  13. echo "--------------------------------------------------"; \
  14. echo "NAME:$${NAME}"; \
  15. rm -f $${NAME}.exe; \
  16. $${COMPILER} $(CXXFLAGS) -DARRAY_SIZE=$${ARRAY_SIZE} -O$${OPT} -o $${NAME}.exe main.cpp; \
  17. $${COMPILER} $(CXXFLAGS) -DARRAY_SIZE=$${ARRAY_SIZE} -O$${OPT} $${ASM_FLAGS} -o $${NAME}.s main.cpp; \
  18. ./$${NAME}.exe; \
  19. done; \
  20. done; \
  21. done

Trying various compilation options:

$ make test
for COMPILER in g++; do \
        ASM_FLAGS="-S -fverbose-asm -masm=intel"; \
        for ARRAY_SIZE in 32 36; do \
                for OPT in 0 1 2 3; do \
                        NAME="main_O${OPT}_${COMPILER}_s${ARRAY_SIZE}"; \
                        echo "--------------------------------------------------"; \
                        echo "NAME:${NAME}"; \
                        rm -f ${NAME}.exe; \
                        ${COMPILER} -mavx2 -std=c++17 -g -DARRAY_SIZE=${ARRAY_SIZE} -O${OPT}               -o ${NAME}.exe main.cpp; \
                        ${COMPILER} -mavx2 -std=c++17 -g -DARRAY_SIZE=${ARRAY_SIZE} -O${OPT} ${ASM_FLAGS} -o ${NAME}.s   main.cpp; \
                        ./${NAME}.exe; \
                done; \
        done; \
done
--------------------------------------------------
NAME:main_O0_g++_s32
alignof(__m256i): 32
/bin/sh: line 3: 15627 Segmentation fault      ./${NAME}.exe
--------------------------------------------------
NAME:main_O1_g++_s32
alignof(__m256i): 32
/bin/sh: line 3: 15631 Segmentation fault      ./${NAME}.exe
--------------------------------------------------
NAME:main_O2_g++_s32
alignof(__m256i): 32
/bin/sh: line 3: 15635 Segmentation fault      ./${NAME}.exe
--------------------------------------------------
NAME:main_O3_g++_s32
alignof(__m256i): 32
/bin/sh: line 3: 15639 Segmentation fault      ./${NAME}.exe
--------------------------------------------------
NAME:main_O0_g++_s36
alignof(__m256i): 32
/bin/sh: line 3: 15643 Segmentation fault      ./${NAME}.exe
--------------------------------------------------
NAME:main_O1_g++_s36
alignof(__m256i): 32
vr: [0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31]
--------------------------------------------------
NAME:main_O2_g++_s36
alignof(__m256i): 32
vr: [0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31]
--------------------------------------------------
NAME:main_O3_g++_s36
alignof(__m256i): 32
vr: [0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31]

GDB session

Program compiled by

g++ -mavx2 -std=c++17 -g -DARRAY_SIZE=32 -O0 -o main_O0_g++_s32.exe main.cpp

$ gdb main_O0_g++_s32
GNU gdb (GDB) 8.3
Copyright (C) 2019 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Type "show copying" and "show warranty" for details.
This GDB was configured as "x86_64-w64-mingw32".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
    <http://www.gnu.org/software/gdb/documentation/>.

For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from main_O0_g++_s32...
(gdb) r
Starting program: D:\SourceCode\cpp_simdpp_crash_loadu\main_O0_g++_s32.exe
[New Thread 10112.0x4c8]
[New Thread 10112.0x409c]
[New Thread 10112.0x6bf0]
alignof(__m256i): 32

Thread 1 received signal SIGSEGV, Segmentation fault.
0x00000000004016ea in crash_vector_ret (a=0x66fe10 "") at main.cpp:24
24              v = _mm256_loadu_si256(reinterpret_cast<const __m256i*>(a));
(gdb) bt
#0  0x00000000004016ea in crash_vector_ret (a=0x66fe10 "") at main.cpp:24
#1  0x000000000040179f in main (argc=1, argv=0xe64a60) at main.cpp:40
(gdb) set disassembly-flavor intel
(gdb) disas
Dump of assembler code for function crash_vector_ret(unsigned char*):
   0x00000000004016b7 <+0>:     push   rbp
   0x00000000004016b8 <+1>:     mov    rbp,rsp
   0x00000000004016bb <+4>:     sub    rsp,0x10
   0x00000000004016bf <+8>:     mov    QWORD PTR [rbp+0x10],rcx
   0x00000000004016c3 <+12>:    mov    QWORD PTR [rbp+0x18],rdx
   0x00000000004016c7 <+16>:    mov    rax,QWORD PTR [rbp+0x18]
   0x00000000004016cb <+20>:    mov    QWORD PTR [rbp-0x8],rax
   0x00000000004016cf <+24>:    mov    rax,QWORD PTR [rbp-0x8]
   0x00000000004016d3 <+28>:    vmovdqu xmm0,XMMWORD PTR [rax]
   0x00000000004016d7 <+32>:    vinserti128 ymm0,ymm0,XMMWORD PTR [rax+0x10],0x1
   0x00000000004016de <+39>:    vmovdqa ymm1,ymm0
   0x00000000004016e2 <+43>:    vmovdqa ymm0,ymm1
   0x00000000004016e6 <+47>:    mov    rax,QWORD PTR [rbp+0x10]
=> 0x00000000004016ea <+51>:    vmovdqa YMMWORD PTR [rax],ymm0
   0x00000000004016ee <+55>:    nop
   0x00000000004016ef <+56>:    mov    rax,QWORD PTR [rbp+0x10]
   0x00000000004016f3 <+60>:    add    rsp,0x10
   0x00000000004016f7 <+64>:    pop    rbp
   0x00000000004016f8 <+65>:    ret
End of assembler dump.
(gdb) p $rax
$1 = 6749616
(gdb) p $rax/32*32
$2 = 6749600

As you can see, the $rax address is not aligned to 32 bytes and therefore the vmovdqa crashes.

I thought that because of the alignof(__m256i) == 32, the compiler should guarantee that also the return value is aligned to 32 bytes.

I might be wrong. Could you please shed more light on this issue?

Thanks for any hints.

Ticket History (2/2 Histories)

2019-09-12 15:47 Updated by: michal_fapso
  • New Ticket "Incorrect alignment of SIMD vectors returned by value from functions" created
2019-09-17 19:00 Updated by: michal_fapso
Commentaire

I thought, that force-inlining all functions which return a SIMD vector would be the solution, but I was wrong. The same issue reveals when storing a SIMD vector in a local variable. Here is the disassembly of such SIGSEGV crash:

   0x00000000005520db <+101>:   vmovdqa ymm0,ymm1
=> 0x00000000005520df <+105>:   vmovdqa YMMWORD PTR [rbp-0x40],ymm0
   ...
(gdb) p $rbp
$1 = (void *) 0xf7a750
(gdb) p $rbp-0x40
$2 = (void *) 0xf7a710

The local variable's address 0xf7a710 is not properly aligned to 32 bytes, so vmovdqa crashes.

This was compiled with -O0. When compiled with -O3, it doesn't crash, probably because it keeps all computations in registers and doesn't store intermediate results in local variables at all. However, there are cases when storing vector registers to local variables is inevitable.

Attachment File List

No attachments

Modifier

Please login to add comment to this ticket » Connexion