Sunday, February 24, 2013

LINUX Performance Tuning

Presentor's BLOG
dtrace.org/blog/brandon

JOYENT
@brendangregg

System metrix
-- iostat

UPtime -- Load avg. 1/5 / 15mins


Only a clue, don't stress about load svgs, use other tools


TOP

1. Top can consume cpu
-- Does top show all cpu consumers?
-- Can miss short lived processes may not list kernel threads unless included

-- A process has high cpu - what are next steps
-- identify why -- profile code path
-- identify what -- execution or stall cycles


Htop - SUper top

MPSTAT --- Check for hot threads, on multi core/processors

iostat -- disk/io


vmstat - virtual memory statistics (started from bad days)
-- vmstat 1
summary since boot values - first line

free
-buffer cache for filesystem

ping
-- network might behave differently for ICMP packets

hping - try using TCP

TCP

nicstat - network statistics tool
-- nicstat -z 1

The presenter created this tool - Tim Cook ported and enhanced it on LINUX

dstat - better vmstat tool (Does coloring) - FWIW
-- dstat 1



-== INTERMEDIATE TOOLS =---

-- SAR
- PRETTY GOOD ON LINUX BUT HAS impression of being buggy.
- configure to archive stats from cron
- -d block device stats
- -q run queue statistics

-- NETSTAT
-- netstat -s
Active connection openings
passive connection openings
failed connection attempts
- Calculate the retransmission rates
- high retransmission rates for mobile phones

-- PIDSTAT
- Very useful process breakdowns:
-- pidstat 1

-- pidstat -d 1
- shows disk i/o

-- STRACE
- system call tracer (attach to process or a command)
-- strace -tttT -p 12670
- resource i/o
- memory mapped i/o can't be done through strace as there is no sys call

-- strace -c dd if=/dev/zero of=/dev/null bs=512 count =1024k
- worst case

-- time strace -c dd if=/dev/zero of=/dev/null bs=512 count =1024k


-- TCPDUMP
-- tcpdump -i eth4 -w /tmp/out.tcpdump
does have overhead in terms of cpu and storage
should use socket ring buffers
use filter expressions to reduce overhead
still be problematic for busy interfaces
tricky for 10gb ethernet

BLKTRACE - block device i/o event tracing.
-- btrace /dev/sdb


IOTOP - for disk i/o *** authors preferred**
-- iotop -bod5
kbytes is not universal currenty, time is.


SLABTOP -
-- kernel slap allocator usage
-- slabtop -sc
- shows where kernel memory is consumed (tool used by kernel developers)
- gives good reason why the memory is used,

SYSCTL - static performance tuning: Check the config of the system (even without workload)
-- sysctl -a
- check this even if no load, cause there might be something you would overlook.


-=== TOOLS ADVANCED

* PERF
- performance counters for linux (PCL) focusing on CPU performance counters (Programable registers)
- now collection of profiling and tracing tools, which numerous subcommand, including:

kmem - reace meause kernel mem
kvm trace kvm gues os
list lis available events (targets of instrumentation)

- key performance counter summary
#perf stat gzip file1
- low IPC (<0.2) means stall cycles (likely memory); look for ways to reduce memory i/o and improve locality (NUMA)

#perf lis |grep Hardware

# perf stat -e instructions, cycles, L1-dcache

(BIOS developer /kernel developer user guide)

( PROFILES CPU ACTIVITIES ALSO)
- profileing (sampling)
# perf record -a -g -F 997 sleep 10

- focusing on non-interactive mode
# perf report --stdio
- profilling/graphs (FLAME GRAPHS)
- generates .svg file interactive in browser
- (as performance engineer its very important - exonerate the areas that are culprit)


- Static Tracing
# perf list | grep block:
- shows block i/o trace points
- where the trace points instruments (pic in slide)

- Dynamic Tracing
# perf probe --add='tcp_sendmsg'

# perf record -e probe:tcp_sendmsg -aR -g sleep 5

# perf report --stdio

active traced call stacks from arbitrary kernel locations

= fills in kernel observability gaps
- awesome capability
- takes some effort to use

** DTRACE **
- programmable real-time - dynamic and static tracing
- pert analysis and troubleshooting without restarting anything
- used on solaris, illumos/SmartOS, mac OS X, freeBSD
- two ports in development for LINUX
1. dtrace4linux - paul fox (ubuntu, fedora, centos)
2. Oracle Enterprise Linux DTRACE
- steady progress

- dtrace 4 linux version
- github.com/dtrace4linux/dtrace
-- make load

- warning: still a prototype, can panic/

- programming cabapilities
# dtrace -n 'fbt::tcp_sendmsg:entry /execname =="sshd"/ { @["bytes"] = quantize(arg3); }'

- Multiple GUIs use Dtrace for real-time statistics.
Eg, JOYENET CLOUD ANALYTICS, showing real-time cloud-wide sys call latency:

- advanced capabilities, not difficult
- use one-liners (google "DTrace one-liners")
- use scripts (DTRace Toolkit; Dtrace book; google)
- tweak one-liners or scripts a little
- ask someone else to write

EXAMPLES
-fbt::vfs_read:entry
{ self->start=tmestamp; }

CLI Syntax
dtrace.org/guide

-Providers IO,CPC,VMINFO,IP,FBT-function boundary tracing(kernel dynamic tracing)

the presenter has a script for dtrace in github for this
dtrace:ext4slower.d 10 (list process thawith i/o latency more than given parameter ), clearly exonerates or blames the filesystem (ext4)

Show me
- TCP retransmits (only trace retransmits and not full dump)
- dtrace:tcpretransmit.d (in github soon)


illumos/SmartOS (is the on-going fork of opensolaris kernel)

IOSNOOP - shows disk/ i/o (mac ships it with osx)


SYSTEM TAP
- created when there wasn't Dtrace for linux ports
- can't use in LInux prod systems because of un-reliability



METHODOLOGIES
- SELECTED FOUR
- PICK OBSERVABILITY TOOLS
- ITS INEFFICIENT

- WORKLOAD CHARACTERIZATION METHOD
- WHO WHAT WHY AND WHEN - IDENTIFIES ISSUES OF LOAD
- best performance wins are from eliminating

- DRILL-Down Analysis
- start from top and get down and find root cause

- USE Method
- for every roursore find
- utilization
- saturation
- Errors

- mutex locks
- threads pools

No comments: