Tuesday, November 25, 2014

Portability woes : Endianess and Alignment (Part 2)


In the previous part, endianess and 32/64 bits detection were detailed. Now we'll have a look at more complex memory alignment troubles.

Memory Access Alignment is a lesser known issue, but its impact can be huge : it will crash your program or result in a disproportionate slow down.

Let's first summarize the problem. When accessing a single byte into memory, there is no restriction. But when trying to access 2-bytes at a time (short), or 4-bytes at a time (int), alignment get into the way.
Aligned property means a 2-bytes field must always be accessed on an even address (multiple of 2), or accessing a 4-bytes field is always done on an address multiple of 4, and so on.
From a performance perspective, accessing many bytes at a time is a win, as it makes better use of the memory bus. But when a multi-bytes field is accessed on a non-aligned memory address, all sort of problems can happen : bus width or addressing limitation, cache line overlap, memory segment border overlap, and so on. 

All these issues can be solved, and indeed, on the most widely known programming environment, x86/x64, these problems are solved since a long long time.
But it has a cost, it makes the CPU more complex, and consume some precious transistor space. As a consequence, several CPU vendors selected to be a bit lazy on these issues, deciding to not address them, leaving the problem into the hands of software developers. In such case, if an unaligned memory access is nonetheless performed, the CPU sends an exception, typically resulting in a program crash.
To make the matter more complex, some CPU addressed alignment issues, but in an inefficient manner, resulting in undesirable slow performance. Other ones address it correctly for short (2-bytes) or int (4-bytes) but not long long (8-bytes).

Data alignment issue is well described, with many sources throughout Internet. Unfortunately, finding a proper portable solution for it is not, many "advisers" simply telling to avoid unaligned access altogether. Thanks, really.
But the above condition cannot be met in every circumstances. Consider how the compression algorithm works : it looks for similar pattern into past data. Such pattern can appear anywhere, not just on "aligned" addresses.

For a portable code, this situation is a nightmare. The "safe" approach would be to always access data byte-by-byte, but then, the impact on performance is huge, and for speed-oriented application such as LZ4, this trade-off is unacceptable.

The way it was handled by LZ4 up to now relied on the compiler. The basic idea is : the compiler should be fully aware if its target CPU can, or cannot, handle unaligned memory access.
This is achieved through the "pack" instruction, which, in a nutshell, tell the compiler to "not align these data", and therefore generate proper cautious assembler code when accessing them.

It gives the following result :

#if defined(__GNUC__)  && !defined(LZ4_FORCE_UNALIGNED_ACCESS)
#  define _PACKED __attribute__ ((packed))
#else
#  define _PACKED
#endif

#if !defined(LZ4_FORCE_UNALIGNED_ACCESS) && !defined(__GNUC__)
#  if defined(__IBMC__) || defined(__SUNPRO_C) || defined(__SUNPRO_CC)
#    pragma pack(1)
#  else
#    pragma pack(push, 1)
#  endif
#endif

typedef struct { U16 v; }  _PACKED U16_S;
typedef struct { U32 v; }  _PACKED U32_S;
typedef struct { U64 v; }  _PACKED U64_S;
typedef struct {size_t v;} _PACKED size_t_S;

#if !defined(LZ4_FORCE_UNALIGNED_ACCESS) && !defined(__GNUC__)
#  if defined(__SUNPRO_C) || defined(__SUNPRO_CC)
#    pragma pack(0)
#  else
#    pragma pack(pop)
#  endif
#endif

#define A16(x)   (((U16_S *)(x))->v)
#define A32(x)   (((U32_S *)(x))->v)
#define A64(x)   (((U64_S *)(x))->v)
#define AARCH(x) (((size_t_S *)(x))->v)

If it looks a bit complex, that's because it is.
The first issue we have is that issuing the "pack" instruction must be done in a variety of ways, depending on the compiler and its version. It translates into this monstrous macro, trying to figure out all possible situations reported up to now. As you can guess, I regularly receive notice of new situations the macro cannot cope with.
This is about as bad as the previous stories regarding 32/64 bits and endianess.

But there is more.
Compilers are not that clever.
In many circumstances, the compiler will issue a "safe" slow code to access unaligned data, even though the target CPU is able to efficiently handle this situation, resulting in a large speed drop. This is especially true for late ARM CPU.
To counter-balance this effect, there is a need to manually "turn off" the "pack" instruction, using in the above example the #define LZ4_FORCE_UNALIGNED_ACCESS.
Unfortunately, the manual switch is almost never used. Source code will most of the time be compiled "as is", which is no surprise.

So we have 2 issues : issuing the "pack" instruction is complex, and not future-proof, and compilers don't automatically make the best choice.

To save the day, maybe a new runtime check will help, like for previous issues ?
Alas, there is no portable runtime test available to check for aligned properties.
(Note : of course, one could imagine running an external program/process just for this purpose, but it's outside of the scope of a little single-file library).

So we are stuck, aren't we ?

Well, that's the difficult part. To make some progresses on current situation, I'm going to change the logic : instead of relying on the compiler, take explicit charge to handle unaligned accesses.

The core foundation of the new solution is based on below function, already used within lz4frame :

static U32 LZ4_readLE32 (const BYTE* srcPtr)
{
    U32 value32 = srcPtr[0];
    value32 += (srcPtr[1]<<8);
    value32 += (srcPtr[2]<<16);
    value32 += (srcPtr[3]<<24);
    return value32;
}
What's good with this function ?
It handles both endianess and alignment in a safe way. The code is portable.

What's wrong with it ?
It's the safe approach, and therefore is slower than necessary when CPU can correctly handle unaligned memory accesses.

So, we will now special-cases CPU which do support efficient unaligned access.

Some of them are easily detectable, thanks to  widely supported standard macro : __i386____x86_64____ARM_ARCH_7__ are known architectures with good support for unaligned memory accesses. __ARM_ARCH_6__ is also able to handle it, but in a less efficient manner, so it's unclear if it's really faster than the portable version.

Finding a list of CPU with efficient support of unaligned memory accesses (and their related detection macro) has proven an impossible task so far. One may have in mind that Linux faces a similar challenge, which is supposed to be solved thanks to the macro CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS. Unfortunately, I couldn't find a place where this macro is defined. It seems to be a decentralized methodology, where each architecture tells independently if it's compatible or not. For the Linux kernel, it's likely the correct method. But that also means there is no central repository where this property is listed.

So I'm a bit stuck right now.
My expectation is that external contributors interested in LZ4 performance may benchmark the speed of the new version, tentatively enabling/disabling the prominent switch at the top of lz4.c when they see fit :

/*
 * CPU_HAS_EFFICIENT_UNALIGNED_MEMORY_ACCESS :
 * You can force the code to use unaligned memory access, should you know your CPU can handle it efficiently.
 * If it effectively results in better speed (up to 50% improvement can be expected)
 * please report your configuration to upstream (https://groups.google.com/forum/#!forum/lz4c)
 * so that an automatic detection macro can be added to mainline.
 */
/* #define CPU_HAS_EFFICIENT_UNALIGNED_MEMORY_ACCESS 1 */

Each time a CPU is known to efficiently handle unaligned memory access, its standard detection macro can be added to the list, in order to automatically use the faster code path.

Is it any better than previous method ?
Well, sort of.

To begin with, there is no longer a need to fiddle with the different "pack" instructions depending on compilers and versions. This is one less complexity to manage, and dependency to worry.
Getting formally in charge of unaligned access allows introduction of dedicated functions. For example, a more adapted "string comparison" function could be introduced.

More importantly, instead of crashing when the detection fail, the library will now run slower, but still run correctly. But it introduces some new risk : many users may simply not notice the slow down, and just use the library "as is", unaware of the latent performance improvement which could be unleashed. The hope is, as long as a few contributors can detect and report the performance issue, the situation can be solved for everybody with similar setup.


Latest version of LZ4 using these new detection routines is accessible in the feature branch "AlignEndian" : https://github.com/Cyan4973/lz4/tree/AlignEndian

It's possible to compare it with latest "official" release r124. On x86/x64 CPU, it was benchmarked and proved to provide similar performance. On other CPU though, it's still worthwhile to check.

Monday, November 24, 2014

Portability woes : Endianess and Alignment (Part 1)

 In creating an ultra-portable code, able to be compiled on (almost) every platform, there are some unusual problems to take into consideration. Listing them by order of increasing complexity, this post will review in detail address space, endianess and alignment restrictions (part 2).

Detecting 32 vs 64 bits
This is the easiest part, now practiced by many programmers. It's very common for a program to be designed with a single target environment. But since PC programmers had to deal with the 32->64 bits transitions, there are many solutions available around, just looking throughout Internet. However, they are not all equivalent.
Consider the initial solution adopted by LZ4 :

/* 32 or 64 bits ? */
#if (defined(__x86_64__) || defined(_M_X64) || defined(_WIN64) \
  || defined(__64BIT__)  || defined(__mips64) \
  || defined(__powerpc64__) || defined(__powerpc64le__) \
  || defined(__ppc64__) || defined(__ppc64le__) \
  || defined(__PPC64__) || defined(__PPC64LE__) \
  || defined(__ia64) || defined(__itanium__) || defined(_M_IA64) \
  || defined(__s390x__) )   /* Detects 64 bits mode */
#  define LZ4_ARCH64 1
#else
#  define LZ4_ARCH64 0
#endif

It works. But you can easily spot the weakness : this is a long macro, with many special cases, and there is no guarantee there will be no more additional cases in the future. And by the way, this is what happened : the list was completed thanks to contributors adding one by one each target platform as they were discovered.
So it's complex, and not completely future proof.
Let's compare to the new method :

static unsigned LZ4_64bits(void) { return sizeof(void*)==8; }

Yes, a single trivial line, and it is future proof. It could be done as a macro too, but I prefer a static function, since it gives the compiler a chance to do something clever about it.

The initial feeling is that the macro is "runtime free" while the function will cost a small comparison test every time it's called, thus be slower.
But eventually, that's not what a modern compiler is expected to do. Clever compilers will realize this function always return the same value, and therefore replace it by its result. When there are branches depending on the result, the compiler is also expected to automatically solve the test, and remove the useless branch through dead code optimization.
And in practice, it works well.

It doesn't solve everything though.
For example, using Visual, the intrinsic function _BitScanForward64() is only accessible during x64 compilation. Compiling a source code mentioning this function in 32-bits mode will fail the link stage, even if the program will never call the function. That's a situation a runtime test cannot solve.
For this special case, it's still necessary to restrict the compiled source code through macro selection.

Detecting Endianess
While 32/64 bits is (mostly) a question of performance, endianess will impact result correctness. So it's damn important to correctly detect it.
A code able to handle different endianess is less common, but it's still relatively easy to find several solutions over Internet. And once again, they are not all equivalent.

Initially, LZ4 adopted the macro detection approach :

/*
 * Little Endian or Big Endian ?
 * Overwrite the #define below if you know your architecture endianess
 */
#include <stdlib.h>   /* Apparently required to detect endianess */
#if defined (__GLIBC__)
#  include <endian.h>
#  if (__BYTE_ORDER == __BIG_ENDIAN)
#     define LZ4_BIG_ENDIAN 1
#  endif
#elif (defined(__BIG_ENDIAN__) || defined(__BIG_ENDIAN) || defined(_BIG_ENDIAN)) && !(defined(__LITTLE_ENDIAN__) || defined(__LITTLE_ENDIAN) || defined(_LITTLE_ENDIAN))
#  define LZ4_BIG_ENDIAN 1
#elif defined(__sparc) || defined(__sparc__) \
   || defined(__powerpc__) || defined(__ppc__) || defined(__PPC__) \
   || defined(__hpux)  || defined(__hppa) \
   || defined(_MIPSEB) || defined(__s390__)
#  define LZ4_BIG_ENDIAN 1
#else
/* Little Endian assumed. PDP Endian and other very rare endian format are unsupported. */
#endif
As you can see, it is a big mess. It depends on so many different parameters, it's hard to maintain, and it's difficult to guarantee it will always work. Indeed, it does not. Quite regularly, I received external contributions regarding specific platforms which would fail the test, in both directions (some little endian declared as big endian, and the other way round).
Add to this already complex situation the case of bi-endian CPU, a growing list of hardware which can select to be little-endian or big-endian, at will. That makes using architecture as an endian hint an unsustainable proposition.

As previously, the intention is to replace this list of macros by a guaranteed runtime test. Here is the new method within LZ4 :

static unsigned LZ4_isLittleEndian(void)
{

const union { U32 i; BYTE c[4]; } one = { 1 }; /* don't use static : performance detrimental */
return one.c[0]; }

Once again, the runtime method is not just much shorter and readable, it's also guaranteed to always produce the right result, whatever the CPU and its local mode.

However, this time, it's a little bit more difficult for the compiler to make the test "runtime free".

Here, unfortunately, your mileage may vary. For example, in the above example, I only succeeded in making the compiler realize the function always produce the same result after removing the "static" property from variable "one". Other possibilities exist, such as filling a 32-bits value and then accessing it with a char* pointer. They all work. At the end of the day, the question is : which version is most likely to be reduced into a constant value by as many compilers as possible ?

The above version proved successful so far. I can only wish it will remain as successful for other compiler/target combinations. At least, it's no longer the correctness of the test which is at stake, only its performance.


In the next article, we'll review memory access alignment restriction, which is, by far, the most complex issue.
In the meantime, should you wish to review and test the new detection methods, you can grab them in the latest LZ4 feature branch named "AlignEndian" : https://github.com/Cyan4973/lz4/tree/AlignEndian

It's possible to compare it with latest "official" release r124. On x86/x64 CPU, it was benchmarked and proved to provide similar performance. On other CPU though, especially big-endian ones, it would deserve to be checked.