John the Ripper 1.7.2 (a "development" version) adds bitslice DES assembly code for x86-64 making use of the 64-bit mode extended SSE2 with 16 XMM registers. You can download it at the usual location: John the Ripper password cracker.

The SSE2 code introduced in 1.7.1 was only available with 32-bit x86 builds of John. Now John 1.7.2 is able to use SSE2 with 64-bit builds as well, which in practice brings their performance on current 64-bit capable AMD processors to the same level that 1.7.1 had achieved with 32-bit builds running on those same processors. So if your operating system is 64-bit only (no 32-bit compatibility) and thus you were unable to make use of the SSE2 acceleration in 1.7.1, you should be able to do so now. :-)

There's no performance improvement for current 64-bit capable Intel processors with this version. Those were happy running native x86-64 code, achieving the same level of performance that they do with SSE2 code. There's, however, one advantage to using this version on Intel processors as well: the performance should be more stable, regardless of C compiler version and options.

Some technical detail for the curious:

Yes, the new 64-bit mode SSE2 code is also different. The DES S-box implementations are not derived from the 32-bit mode MMX/SSE2 ones, but rather they have been generated from Matthew Kwan's optimized S-box expressions anew, making use of the 16 XMM registers. A Perl script was written and used to compile the expressions into a virtual 3-operand architecture, convert to 2-operand and allocate virtual registers, allocate real registers and spill whatever didn't fit into registers to memory locations, and do some instruction scheduling based on published SSE2 instruction latencies for the Pentium 4 (which are larger than those for current AMD64 processors, so should be good for both camps). Then the script's output required some manual work to resolve the few occurrences of unsupported operand combinations. Finally, the individual S-boxes were subjected to "brute-force instruction scheduling", which tested thousands of different ways in which instructions could be re-ordered, measuring the actual execution times in clock cycles (yes, code was actually being run on the CPU multiple times for each of the thousands of code versions being tested).

Now, you would expect excellent performance from the above, wouldn't you? Well, it's not so great in practice. The 8-registers MMX code from which the non-64-bit-mode SSE2 code was derived is excellent, making it very hard to do things better - even with more registers. Yes, the availability of 16 registers helps save some moves and reduces the instruction count in S-boxes by 10%, but it turns out that accessing the added 8 XMM registers may be slower than accessing memory (actually, L1 cache) in some cases. I think that this has to do with limitations of the instruction decoder. For example, blindly converting the "old" 8-registers SSE2 code to use the extra 8 registers instead of temporaries in memory results in an 8% slowdown on an Athlon 64, but makes no difference on a Xeon - which is consistent with the "decoder theory". Brute-re-scheduling that code does get back some of the lost 8%, but it does not restore the original 8-registers code performance.

So for the time being, I am satisfied with the brand new 16-registers code being about as fast as the old 8-registers one is. In future versions, I might further improve the Perl script to do smarter register allocation, taking into account the fact that consecutive instructions which access registers XMM8-XMM15 have a higher cost on AMD64 processors. (In my testing, the slowdown is seen with 4+ consecutive instructions like that, but in real-world code there are more constraints.) I might also generate some hybrid native x86-64 + SSE2 code, which might make this issue irrelevant... or it might not. :-)