泰晓科技 -- 聚焦 Linux - 追本溯源,见微知著!

Linux 技术报告:从 3.10 到 4.0

Wu Zhangjin 创作于 2015/05/25

By Falcon of TinyLab.org 2015/05/24


Android 5.x 至今都还在用 Linux 3.10,未来是否会迁移到更新的内核呢?我们来看一下,从 3.11 开始,Linux 内核引入了哪些可能影响用户体验的变更。下面从电源管理、性能优化、可靠性和安全四个方面展开汇总。

Power Management

3.17: Improved power management features enabled for more Radeon GPUs

Dynamic power management (dpm) has been re-enabled by default on Cayman and BTC devices.

Also, a new module parameter (radeon.bapm=1) has been added to enable bidirectional application power management (bapm) on APUs where it’s disabled by default due to stability issues.

3.17: scripts/analyze_suspend.py: update to v3.0

which includes back-2-back suspend testing, device filters to reduce the html size, the inclusion of device_prepare and device_complete callbacks, a USB topography list, and the ability to control USB device autosuspend

3.13: Power capping framework

This release includes a framework that allow to set power consumption limits to devices that support it. It has been designed around the Intel RAPL (Running Average Power Limit) mechanism available in the latest Intel processors (Sandy Bridge and later, many devices will also be added RAPL support in the future). This framework provides a consistent interface between the kernel and user space that allows power capping drivers to expose their settings to user space in a uniform way.

3.12: Improved timerless multitasking: allow timekeeping CPU go idle

Linux 3.10 added support for timerless multitasking, that is, the ability to run processes without needing to fire up the timer interrupt that is traditionally used to implement multitasking. This support, however, had a caveat: it could turn off interrupts in all CPUs, except one that is used to track timer information for the other CPUs. But that CPU keeps the timer turned on even if all the CPUs are idle, which was useless. This release allows to disable the timer for the timekeeping CPU when all CPUs are idle.


4.0: DAX – Direct Access, for persistent memory storage

DAX removes the extra copy incurred by the buffer by performing reads and writes directly to the persistent-memory storage device.

4.0: “lazytime” option for better update of file timestamps

Lazytime causes access, modified and changed time updates to only be made in the cache. The times will only be written to the disk if the inode needs to be updated anyway for some non-time related change, if fsync(), syncfs() or sync() are called, or just before an undeleted inode is evicted from memory. This is POSIX compliant, while at the same time improving the performance.

4.0: rcu: Optionally run grace-period kthreads at real-time priority

Recent testing has shown that under heavy load, running RCU’s grace-period kthreads at real-time priority can improve performance and reduce the incidence of RCU CPU stall warnings

4.0: slub: optimize memory alloc/free fastpath by removing preemption on/off

4.0: memcontrol cgroup: a clearer model and improved workload performance

Introduce the basic control files to account, partition, and limit memory using cgroups in default hierarchy mode. The old interface will be maintained, but a clearer model and improved workload performance should encourage existing users to switch over to the new one eventually

4.0: F2FS: Introduce a batched trim

3.17: perf timechart adds I/O mode

Currently, perf timechart records only scheduler and CPU events (task switches, running times, CPU power states, etc); this release adds I/O mode which makes it possible to record IO (disk, network) activity. In this mode perf timechart will generate SVG with I/O charts (writes, reads, tx, rx, polls).

3.16: cpufreq: stable frequency and cpuidle issue

Add support for intermediate (stable) frequencies for platforms that may temporarily switch to a stable frequency while transitioning between frequencies commit

governor: Improve performance of latency-sensitive bursty workloads commit

3.15: Faster erasing and zeroing of parts of a file

This release adds two new fallocate(2) mode flags:

  • FALLOC_FL_COLLAPSE_RANGE: Allows to remove a range of a file without leaving holes, improving the performance of these operations that previously needed to be done with workarounds.

  • FALLOC_FL_ZERO_RANGE: Allows to set a range of a file to zero, much faster than it would take to do it manually (this functionality was previously available in XFS through the XFS_IOC_ZERO_RANGE ioctl)

3.15: zram: LZ4 compression support, improved performance

Zram is a memory compression mechanism added in Linux 3.14 that is used in Android, Cyanogenmod, Chrome OS, Lubuntu and other projects. In this release zram brings support for the LZ4 compression algorithm, which is better than the current available LZO in some cases.

3.15: FUSE: improved write performance

FUSE can now use cached writeback support to fuse, which improves write throughput.

3.15: Introduce cancelable MCS lock

it is a simple spinlock with the desirable properties of being fair, and with each CPU trying to acquire the lock spinning on a local variable. It avoids expensive cache bouncings that common test-and-set spinlock implementations incur

3.15: Per-thread VMA caching

cache last recently used VMA to improve VMA cache hit rate, for more details see the recommended LWN article

3.15: Speed up resume

  • As mentioned in the “prominent features” section, faster resume from power suspend in systems with hard disk drives

  • Speed up resume by resuming runtime-suspended devices later during system suspend

  • Speed up resume by using asynchronous threads for resume_early commit, resume_noirq commit, suspend_late commit, suspend_noirq commit, acpi_thermal_check

  • tools/power turbostat: Run on Intel Broadwell

3.15: ext4/ext3: Speedup sync

In the following test script sync(1) takes around 6 minutes when there are two ext4 filesystems mounted on a standard SATA drive. After this patch sync takes a couple of seconds so we have about two orders of magnitude improvement.

3.14: Deadline scheduling class for better real-time scheduling

Deadline scheduling gets away with the notion of process priorities. Instead, processes provide three parameters: runtime, period, and deadline. A SCHED_DEADLINE task is guaranteed to receive “runtime” microseconds of execution time every “period” microseconds, and these “runtime” microseconds are available within “deadline” microseconds from the beginning of the period. The task scheduler uses that information to run the process with the earliest deadline, a behavior closer to the requirements needed by real-time systems.

3.14: scripts/analyze_suspend.py


Tool for suspend/resume performance analysis and optimization

3.14: futexes: Increase hash table size for better performance

3.13: fuse: Implement writepages callback, improving mmaped writeout

3.13: slab improvement

Changes in the slab have been done to improve the slab memory usage and performance. kmem_caches consisting of objects less than or equal to 128 byte have now one more objects in a slab, and a change to the management of free objects improves the locality of the accesses, which improve performance in some microbenchmarks

3.12: Improved tty layer locking

The tty layer locking got cleaned up and in the process a lot of locking became per-tty, which actually shows up on some odd loads.

3.12: New lockref locking scheme, VFS locking improvements

This release adds a new locking scheme, called “lockref”. The “lockref” structure is a combination “spinlock and reference count” that allows optimized reference count accesses. In particular, it guarantees that the reference count will be updated as if the spinlock was held, but using atomic accesses that cover both the reference count and the spinlock words, it can often do the update without actually having to take the lock. This allows to avoid the nastiest cases of spinlock contention on large machines. When updating the reference counts on a large system, it will still end up with the cache line bouncing around, but that’s much less noticeable than actually having to spin waiting for the lock. This release already uses lockref to improve the scalability of heavy pathname lookup in large systems.

3.12: IPC locking improvements

This release includes improvements on the amount of contention we impose on the ipc lock (kern_ipc_perm.lock). These changes mostly deal with shared memory, previous work has already been done for semaphores in 3.10 and message queues in 3.11. With these chanves, a custom shm microbenchmark stressing shmctl doing IPC_STAT with 4 threads a million times, reduces the execution time by 50%. A similar run, this time with IPC_SET, reduces the execution time from 3 mins and 35 secs to 27 seconds.

3.11: Zswap: A compressed swap cache

Zswap is a lightweight, write-behind compressed cache for swap pages. It takes pages that are in the process of being swapped out and attempts to compress them into a dynamically allocated RAM-based memory pool. If this process is successful, the writeback to the swap device is deferred and, in many cases, avoided completely. This results in a significant I/O reduction and performance gains for systems that are swapping

3.11: Add support for LZ4 compressed kernels

Add support for LZ4 decompression in the Linux Kernel. LZ4 Decompression APIs for kernel are based on LZ4 implementation by Yann Collet.

3.11: Kswapd and page reclaim behaviour

Kswapd and page reclaim behaviour has been screwy in one way or the other for a long time. One example is reports of a large copy operations or backup causing the machine to grind to a halt or applications pushed to swap. Sometimes in low memory situations a large percentage of memory suddenly gets reclaimed. In other cases an application starts and kswapd hits 100% CPU usage for prolonged periods of time and so on. This patch series aims at addressing some of the worst of these problems.


4.0: kasan, kernel address sanitizer

Kernel Address sanitizer (KASan) is a dynamic memory error detector. It provides fast and comprehensive solution for finding use-after-free and out-of-bounds bugs. Linux already has the kmemcheck feature, but unlike kmemcheck, KASan uses compile-time instrumentation, which makes it significantly faster than kmemcheck.

4.0: GDB scripts for debugging the kernel.

If you load vmlinux into gdb with the option enabled, the helper scripts will be automatically imported by gdb as well, and additional functions are available to analyze a Linux kernel instance.

3.14: stackprotector: Introduce CONFIG_CC_STACKPROTECTOR_STRONG

“Strong” is a new mode introduced by this patch. With “Strong” the kernel is built with -fstack-protector-strong (available in gcc 4.9 and later). This option increases the coverage of the stack protector without the heavy performance hit of -fstack-protector-all.

3.12: Better Out-Of-Memory handling

The Out-Of-Memory state happens when the computer runs out of RAM and swap memory. When Linux gets into this state, it kills a process in order to free memory. This release includes important changes to how the Out-Of-Memory states are handled, the number of out of memory errors sent to userspace and reliability. For more details see the below link.


4.0: Live patching: a feature for live patching the kernel code

This release introduces “livepatch”, a feature for live patching the kernel code, aimed primarily at systems who want to get security updates without needing to reboot. This feature has been born as result of merging kgraft and kpatch, two attempts by SuSE and Red Hat that where started to replace the now propietary ksplice. It’s relatively simple and minimalistic, as it’s making use of existing kernel infrastructure (namely ftrace) as much as possible. It’s also self-contained and it doesn’t hook itself in any other kernel subsystems.

4.0: Add security hooks to the Android Binder

Add security hooks to the Android Binder that enable security modules such as SELinux to implement controls over Binder IPC. The security hooks include support for controlling what process can become the Binder context manager, invoke a binder transaction/IPC to another process, transfer a binder reference to another process , transfer an open file to another process. These hooks have been included in the Android kernel trees since Android 4.3


Read Related:

Read Latest: