RHEL(CentOS) 기준 6.6 (2.6.32-504.16.2.el6 커널 이하 버전 ) 에서는 하스웰 CPU를 사용하는 시스템에서 Futex Bug에 의해 hang 현상이 발생 할 수 있다.

커널 2.6.32-504.16.2.el6 버전 부터 Bug Fix 됨.

Futex 란?

 

Serious Red Hat Linux Bug Affects Haswell-based Servers

http://www.infoq.com/news/2015/05/redhat-futex

A recent post by Gil Tene raises the importance of an important, little known patch to Linux kernels that should be reviewed by all users and administrators of Linux systems, especially those who utilize Haswell processors.  Tene reports that in particular users of Red Hat-based distributions (including CentOS 6.6 and Scientific Linux 6.6) should apply the patch as soon as possible.   Even if your instance of Linux is running in a VM, that VM is most likely hosted on a Haswell machine if is on the popular cloud providers (Azure / Amazon /etc) and would benefit from the patch.

Tene describes the flaw as follows:

“The impact of this kernel bug is very simple: user processes can deadlock and hang in seemingly impossible situations. A futex wait call (and anything using a futex wait) can stay blocked forever, even though it had been properly woken up by someone. Thread.park() in Java may stay parked. Etc. If you are lucky you may also find soft lockup messages in your dmesg logs. If you are not that lucky (like us, for example), you'll spend a couple of months of someone's time trying to find the fault in your code, when there is nothing there to find.”

Tene goes on to explain how the flawed code performed (boils down to a switch block missing a default case).  The big reason for the problem today is that while the code in question was fixed in January 2014, the flaw was backported into the Red Hat 6.6 family around October 2014.  Other systems including (SLES, Ubuntu, Debian, etc) are also probably affected.

The fix for those systems is only now being distributed and it could be overlooked.  Red Hat users should look for RHEL 6.6.z or newer.  A key point made by Tene is that the fix has been unevenly distributed as different distributions make specific choices on what goes into their kernel. 

For example, RHEL 7.1 “The upstream 3.10 didn't have the bug. But RHEL 7's version is different from the pure upstream version.  Unfortunately, RHEL 7.1 (much like RHEL 6.6) backported the change that included the bug…  I expect that some other distros may have also done the same.”

For RHEL based distributions, Tene produced a quick table for reference (emphasis in the original):

RHEL 6 (and CentOS 6, and SL 6): 6.0-6.5 are good. 6.6 is BAD. 6.6.z is good.

RHEL 7 (and CentOS 7, and SL 7): 7.1 is BADAs of yesterday. there does not yet appear to be a 7.x fix.  [May 13, 2015]

RHEL 5 (and CentOS 5, and SL 5): All versions are good (including 5.11).

A conversation about this discovery at Hacker News saw some disputing the amount of affected systems, but it provides some context for checking whether or not your system may need a patch.

 

관련된 레드햇 문서는 아래를 참조.

https://access.redhat.com/solutions/1386323

futex_wait()

futex_wait_queue

Linux Futex Bug

답글 남기기

이메일 주소는 공개되지 않습니다. 필수 필드는 *로 표시됩니다