Tuning Linux for your database Colin Charles,Team MariaDB, MariaDB Corporation colin@mariadb.com / byte@bytebot.net http://bytebot.net/blog/ | @bytebot on Twitter OSDC, Berlin, Germany 28 April 2016 1
whoami • Work on MariaDB at MariaDB Corporation (SkySQL Ab) • Merged with Monty Program Ab, makers of MariaDB • Formerly MySQL AB (exit: Sun Microsystems) • Past lives include Fedora Project (FESCO), OpenOffice.org • MySQL Community Contributor of theYear Award winner 2014 2
Agenda • Hardware or cloud • I/O • Filesystems • Memory • CPU • Network • Resources 3
Operating System • Focus of this talk & tools: Linux • But distributions make a difference too • like RHEL6 has tuned but Ubuntu doesn’t • Distribution versions can make a difference too • RHEL5: ktune; RHEL6: tuned; RHEL7: tuned throughput-performance 4
Servers • When you buy servers... do you think in terms of CPU, then memory, then I/O? • Database servers clearly love I/O more than memory more than CPU • queries! scans of large tables? index scans with random I/O? BI? • Durability requires each transaction commit to be flushed to disk - i.e. call fsync() 5
I/O • What kind of storage do you use? • SAS or SATA? • Ethernet (NAS, DRBD) • SSD • Fusion IO (NVMFS) 6
Schedulers (I/O Elevators) • cfq: default, great for slower storage (SATA) • noop: low latency storage (SSD) • deadline: multi-process apps • Most database workloads benefit from the deadline scheduler (goal is after all to minimise seeks, prioritise process I/O) 7
Elevators continued • Boot time: elevator=deadline •echo “deadline” > /sys/class/ block/sdaN/queue/scheduler • http://dimitrik.free.fr/blog/archives/2012/01/ mysql-performance-linux-io.html • http://www.mysqlperformanceblog.com/ 2009/01/30/linux-schedulers-in-tpcc-like- benchmark/ 8
Sample elevator results 9 1 thread tps 1 thread ioutil% 8 threads tps 8 threads ioutil% deadline* 15000 <1% 19100 <1% deadline 6850 32% 15250 45% cfq 5880 45% 1240 94%
File systems • Choices: ext4, XFS • mount options: • ext4: rw, nosuid, noatime, data=ordered, nobarrier • xfs: nobarrier • Turn off I/O barriers • http://lwn.net/Articles/ 283161/ • Separate files (data/logs/ undo/etc.) or not? • Don’t forget flash backed write caches or battery backed write cache on RAID cards (way more fsync/s) • iostat -dmx <interval> is your friend 10
SSD & Flash • SSD • much more IOPS than traditional disks • lower latency • make sure you have a new RAID controller • Flash • more IOPS than SSD • very low latency • InnoDB doublewrite buffer -- MariaDB 10.0, Percona Server 5.6 (Fusion IO), MySQL 5.7 11
RAID • RAID0 - fast writes, reads, no redundancy • RAID1 - slower writes, fast reads • RAID5 - slow for random writes, fast for sequential writes, fast reads too but slow recovery • RAID10 - fast reads & writes 12
LVM • Why? • easily expand disk • snapshot backups • Why not? • 2-3% performance penalty • Snapshot penalties 13
Memory • The more RAM the better -> reduces I/O requirements • Use ECC RAM • NUMA - “swap insanity” • http://blog.jcole.us/2010/09/28/mysql-swap-insanity-and-the- numa-architecture/ •numactl --interleave=all mysqld & • Twitter patch/Percona Server 5.6 • numa_interleave option • innodb_buffer_pool_populate - NUMA decisions to be made when buffer cache is clean 14
Why does MySQL need memory? • Per-session buffers • sort buffer, temp tables • Metadata/locking • index statistics, table definitions • Caching data • decrease reads: faster response time • decrease random writes: queues write in cache; perform more sequential writes 15
Where is memory used? • filesystem cache • binary / relay logs • MyISAM data relies on filesystem cache • MySQL cache: • innodb_buffer_pool_size (pages), MyISAM’s key_buffer_size (indexes) • Per session buffers •sort_buffer_size, join_buffer_size, read_buffer_size, tmp_table_size •free -m / SHOW ENGINE INNODB STATUS 16
Transparent Huge pages • 2M (4M even) pages vs 4K standard Linux page •vm.nr_hugepages=8192 • MySQL can benefit from transparent huge pages, MongoDB doesn’t (neither does Redis) 17
Swappiness • Controls how aggressively the system reclaims “mapped” memory • default: 60, vm.swappiness=0 is ok (but not for newer kernels!) • decrease: aggressive reclaiming of unmapped pagecache memory • increase: aggressive swapping of mapped memory • 3.5 kernels change behaviour (and this was backported to RHEL 2.6.32-303) 18
CPU • default is ondemand, you should turn it to performance generally •echo "performance" > /sys/devices/ system/cpu/cpu0/cpufreq/ scaling_governor • disable powersave mode • cpufrequtils is where it’s at • look at BIOS: turbo mode/C-state/OS power control 19
Network • Tools: tc, dropwatch • Use gigabit Ethernet (more bandwidth/ throughput, less latency) • Can also trunk/bond for HA • net.ipv4.tcp_no_metrics_save=1 20
Network II • ARP filter - prevents ARP flux •echo 1 > /proc/sys/net/ipv4/ conf/all/arp_filter • MTU size at 9,000 - “jumbo frames” • sar -n DEV 21
Network III • allow more connections to queue up: echo 4096 > / proc/sys/net/ipv4/tcp_max_syn_backlog • /etc/sysctl.conf - net.ipv4.tcp_max_syn_backlog = 4096 • increase kernel packet buffer: net.core.netdev_max_backlog = 4096 • increase socket connection wait queue: net.core.somaxconn = 4096 • Reduce TCP timeout that comes after closing socket (default 60s): net.ipv4.tcp_fin_timeout = 30 22
Network IV • Fully synchronous replication, ala, Galera Cluster? (NDBCluster too) • Fastest as your slowest node • evs.send_window + evs.user_send_window (something large e.g. 512) 23
Resource Limits • ulimits / /etc/security/limits.conf - they need to be high/sensible over the defaults • -f (file size): unlimited • -t (cpu time): unlimited • -v (virtmem): unlimited • -n (open files): 64000 • -m (memsize): unlimited • -u (processes/threads): 32000 24
Tune that database • Focus on database design - find slow, crappy queries and fix them (eg. mysql: pt- query-digest) • Reduce locking or time waiting - always benchmark your application continually • Don’t forget /etc/my.cnf (and appropriate friends) 25
KVM/Xen/Virtualization • Most of your raw database performance comes from being hardware-centric • KVM at device level you can turn off caching • Native AIO used over threaded nowadays • VMWare: http://kb.vmware.com/selfservice/ microsites/search.do? language=en_US&cmd=displayKC&externalI d=1008542 26
Containers • Kubernetes - vitess.io rolls out w/o issue, CoreOS, Docker • Remember when benchmarking, you have to look at local vs remote benchmarking (networking can add overhead) • 1 thread (up to 25% difference), but average seems to be 15% loss for r/w workloads; remotely? 32% is possible 27
EC2 • EC2 - instance limitations • EBS - unpredictable I/O performance, use RAID10 or RAID0 • RDS - similar performance between EBS RAID10 MySQL & RDS • choice of MySQL 5.7 (and older) or MariaDB Server 10.0 or PostgreSQL 28
Why MariaDB/Percona Server • Threadpool • NUMA interleaving • Numerous performance fixes in XtraDB like fast InnoDB restarts, parallel doublewrite buffer, etc. • Can run (supported) MariaDB Galera Cluster/Percona XtraDB Cluster 29
Hardware overall • Test everything • Sometimes NIC’s are a bad batch or the driver is buggy (RHEL5 + some Broadcom NICs drop packets under high load - bnx2 before 2.6.33) • Sometimes you get a bad batch of disk 30
Hadoop/Hbase • running this in EC2 is problematic • make sure name resolution works • enable bonding • networking: hadoop uses big buffers (64MB blocks by default) • net.ipv4.tcp_rmem = 32768 436600 4194304 • net.ipv4.tcp_wmem = 32768 436600 4194304 • http://docs.hortonworks.com/HDPDocuments/HDP2/ HDP-2.1.3/bk_cluster-planning-guide/content/ch_hardware- recommendations.html 31
Other databases • MongoDB: http://docs.mongodb.org/manual/ administration/production-notes/ • PostgreSQL: http://wiki.postgresql.org/wiki/ Tuning_Your_PostgreSQL_Server • Neo4j: http://docs.neo4j.org/chunked/stable/ configuration-linux-notes.html • Riak: http://docs.basho.com/riak/1.4.0/ cookbooks/Linux-Performance-Tuning/ 32
SELinux, firewalls, security • The easy way is to “disable SELinux” • The reality is you should learn it • Its just like a firewall that you really should configure 33
Tools • ps • vmstat 1 • iostat • top • free • sar (sar -n DEV 1) • gdb • tcpdump • strace • oprofile • htop • MySQL only: Percona Toolkit (pt-diskstats, pt- summary, pt-query-digest, etc.) 34
Benchmarking • sysbench • OLTP test, use tables with 20M rows and 20M transactions, check 1-128 threads/run • LinkBench • Yahoo! Cloud Serving Benchmark • https://github.com/ brianfrankcooper/ YCSB • Google’s PerfKit Benchmarker • https://github.com/ GoogleCloudPlatfor m/ PerfKitBenchmarke r 35
Books 36
Resources • http://doc.opensuse.org/products/draft/SLES/ SLES-tuning_sd_draft/book.sle.tuning.html • https://access.redhat.com/site/documentation/ en-US/Red_Hat_Enterprise_Linux/6/html-single/ Performance_Tuning_Guide/index.html • https://access.redhat.com/site/documentation/ en-US/Red_Hat_Enterprise_Linux/6/html-single/ Virtualization_Tuning_and_Optimization_Guide /index.html 37
A word from your sponsors • MariaDB Corporation sponsored my T&E to come to OSDC • Remember you can buy a MariaDB Enterprise subscription to help you with your cloud deployments, MariaDB Server usage, etc. • Services include: 24x7 support, consulting, training, remote DBA services, non-recurring engineering and more for all variants of MySQL
Thanks/Q&A Colin Charles, colin@mariadb.com | byte@bytebot.net http://bytebot.net/blog/ | @bytebot slides: slideshare.net/bytebot/ Download MariaDB and give it a try: http://mariadb.org/ 39

OSDC 2016 - Tuning Linux for your Database by Colin Charles

  • 1.
    Tuning Linux foryour database Colin Charles,Team MariaDB, MariaDB Corporation colin@mariadb.com / byte@bytebot.net http://bytebot.net/blog/ | @bytebot on Twitter OSDC, Berlin, Germany 28 April 2016 1
  • 2.
    whoami • Work onMariaDB at MariaDB Corporation (SkySQL Ab) • Merged with Monty Program Ab, makers of MariaDB • Formerly MySQL AB (exit: Sun Microsystems) • Past lives include Fedora Project (FESCO), OpenOffice.org • MySQL Community Contributor of theYear Award winner 2014 2
  • 3.
    Agenda • Hardware orcloud • I/O • Filesystems • Memory • CPU • Network • Resources 3
  • 4.
    Operating System • Focusof this talk & tools: Linux • But distributions make a difference too • like RHEL6 has tuned but Ubuntu doesn’t • Distribution versions can make a difference too • RHEL5: ktune; RHEL6: tuned; RHEL7: tuned throughput-performance 4
  • 5.
    Servers • When youbuy servers... do you think in terms of CPU, then memory, then I/O? • Database servers clearly love I/O more than memory more than CPU • queries! scans of large tables? index scans with random I/O? BI? • Durability requires each transaction commit to be flushed to disk - i.e. call fsync() 5
  • 6.
    I/O • What kindof storage do you use? • SAS or SATA? • Ethernet (NAS, DRBD) • SSD • Fusion IO (NVMFS) 6
  • 7.
    Schedulers (I/O Elevators) • cfq:default, great for slower storage (SATA) • noop: low latency storage (SSD) • deadline: multi-process apps • Most database workloads benefit from the deadline scheduler (goal is after all to minimise seeks, prioritise process I/O) 7
  • 8.
    Elevators continued • Boottime: elevator=deadline •echo “deadline” > /sys/class/ block/sdaN/queue/scheduler • http://dimitrik.free.fr/blog/archives/2012/01/ mysql-performance-linux-io.html • http://www.mysqlperformanceblog.com/ 2009/01/30/linux-schedulers-in-tpcc-like- benchmark/ 8
  • 9.
    Sample elevator results 9 1thread tps 1 thread ioutil% 8 threads tps 8 threads ioutil% deadline* 15000 <1% 19100 <1% deadline 6850 32% 15250 45% cfq 5880 45% 1240 94%
  • 10.
    File systems • Choices:ext4, XFS • mount options: • ext4: rw, nosuid, noatime, data=ordered, nobarrier • xfs: nobarrier • Turn off I/O barriers • http://lwn.net/Articles/ 283161/ • Separate files (data/logs/ undo/etc.) or not? • Don’t forget flash backed write caches or battery backed write cache on RAID cards (way more fsync/s) • iostat -dmx <interval> is your friend 10
  • 11.
    SSD & Flash •SSD • much more IOPS than traditional disks • lower latency • make sure you have a new RAID controller • Flash • more IOPS than SSD • very low latency • InnoDB doublewrite buffer -- MariaDB 10.0, Percona Server 5.6 (Fusion IO), MySQL 5.7 11
  • 12.
    RAID • RAID0 -fast writes, reads, no redundancy • RAID1 - slower writes, fast reads • RAID5 - slow for random writes, fast for sequential writes, fast reads too but slow recovery • RAID10 - fast reads & writes 12
  • 13.
    LVM • Why? • easilyexpand disk • snapshot backups • Why not? • 2-3% performance penalty • Snapshot penalties 13
  • 14.
    Memory • The moreRAM the better -> reduces I/O requirements • Use ECC RAM • NUMA - “swap insanity” • http://blog.jcole.us/2010/09/28/mysql-swap-insanity-and-the- numa-architecture/ •numactl --interleave=all mysqld & • Twitter patch/Percona Server 5.6 • numa_interleave option • innodb_buffer_pool_populate - NUMA decisions to be made when buffer cache is clean 14
  • 15.
    Why does MySQLneed memory? • Per-session buffers • sort buffer, temp tables • Metadata/locking • index statistics, table definitions • Caching data • decrease reads: faster response time • decrease random writes: queues write in cache; perform more sequential writes 15
  • 16.
    Where is memory used? •filesystem cache • binary / relay logs • MyISAM data relies on filesystem cache • MySQL cache: • innodb_buffer_pool_size (pages), MyISAM’s key_buffer_size (indexes) • Per session buffers •sort_buffer_size, join_buffer_size, read_buffer_size, tmp_table_size •free -m / SHOW ENGINE INNODB STATUS 16
  • 17.
    Transparent Huge pages •2M (4M even) pages vs 4K standard Linux page •vm.nr_hugepages=8192 • MySQL can benefit from transparent huge pages, MongoDB doesn’t (neither does Redis) 17
  • 18.
    Swappiness • Controls howaggressively the system reclaims “mapped” memory • default: 60, vm.swappiness=0 is ok (but not for newer kernels!) • decrease: aggressive reclaiming of unmapped pagecache memory • increase: aggressive swapping of mapped memory • 3.5 kernels change behaviour (and this was backported to RHEL 2.6.32-303) 18
  • 19.
    CPU • default isondemand, you should turn it to performance generally •echo "performance" > /sys/devices/ system/cpu/cpu0/cpufreq/ scaling_governor • disable powersave mode • cpufrequtils is where it’s at • look at BIOS: turbo mode/C-state/OS power control 19
  • 20.
    Network • Tools: tc,dropwatch • Use gigabit Ethernet (more bandwidth/ throughput, less latency) • Can also trunk/bond for HA • net.ipv4.tcp_no_metrics_save=1 20
  • 21.
    Network II • ARPfilter - prevents ARP flux •echo 1 > /proc/sys/net/ipv4/ conf/all/arp_filter • MTU size at 9,000 - “jumbo frames” • sar -n DEV 21
  • 22.
    Network III • allowmore connections to queue up: echo 4096 > / proc/sys/net/ipv4/tcp_max_syn_backlog • /etc/sysctl.conf - net.ipv4.tcp_max_syn_backlog = 4096 • increase kernel packet buffer: net.core.netdev_max_backlog = 4096 • increase socket connection wait queue: net.core.somaxconn = 4096 • Reduce TCP timeout that comes after closing socket (default 60s): net.ipv4.tcp_fin_timeout = 30 22
  • 23.
    Network IV • Fullysynchronous replication, ala, Galera Cluster? (NDBCluster too) • Fastest as your slowest node • evs.send_window + evs.user_send_window (something large e.g. 512) 23
  • 24.
    Resource Limits • ulimits/ /etc/security/limits.conf - they need to be high/sensible over the defaults • -f (file size): unlimited • -t (cpu time): unlimited • -v (virtmem): unlimited • -n (open files): 64000 • -m (memsize): unlimited • -u (processes/threads): 32000 24
  • 25.
    Tune that database •Focus on database design - find slow, crappy queries and fix them (eg. mysql: pt- query-digest) • Reduce locking or time waiting - always benchmark your application continually • Don’t forget /etc/my.cnf (and appropriate friends) 25
  • 26.
    KVM/Xen/Virtualization • Most ofyour raw database performance comes from being hardware-centric • KVM at device level you can turn off caching • Native AIO used over threaded nowadays • VMWare: http://kb.vmware.com/selfservice/ microsites/search.do? language=en_US&cmd=displayKC&externalI d=1008542 26
  • 27.
    Containers • Kubernetes -vitess.io rolls out w/o issue, CoreOS, Docker • Remember when benchmarking, you have to look at local vs remote benchmarking (networking can add overhead) • 1 thread (up to 25% difference), but average seems to be 15% loss for r/w workloads; remotely? 32% is possible 27
  • 28.
    EC2 • EC2 -instance limitations • EBS - unpredictable I/O performance, use RAID10 or RAID0 • RDS - similar performance between EBS RAID10 MySQL & RDS • choice of MySQL 5.7 (and older) or MariaDB Server 10.0 or PostgreSQL 28
  • 29.
    Why MariaDB/Percona Server • Threadpool •NUMA interleaving • Numerous performance fixes in XtraDB like fast InnoDB restarts, parallel doublewrite buffer, etc. • Can run (supported) MariaDB Galera Cluster/Percona XtraDB Cluster 29
  • 30.
    Hardware overall • Testeverything • Sometimes NIC’s are a bad batch or the driver is buggy (RHEL5 + some Broadcom NICs drop packets under high load - bnx2 before 2.6.33) • Sometimes you get a bad batch of disk 30
  • 31.
    Hadoop/Hbase • running thisin EC2 is problematic • make sure name resolution works • enable bonding • networking: hadoop uses big buffers (64MB blocks by default) • net.ipv4.tcp_rmem = 32768 436600 4194304 • net.ipv4.tcp_wmem = 32768 436600 4194304 • http://docs.hortonworks.com/HDPDocuments/HDP2/ HDP-2.1.3/bk_cluster-planning-guide/content/ch_hardware- recommendations.html 31
  • 32.
    Other databases • MongoDB:http://docs.mongodb.org/manual/ administration/production-notes/ • PostgreSQL: http://wiki.postgresql.org/wiki/ Tuning_Your_PostgreSQL_Server • Neo4j: http://docs.neo4j.org/chunked/stable/ configuration-linux-notes.html • Riak: http://docs.basho.com/riak/1.4.0/ cookbooks/Linux-Performance-Tuning/ 32
  • 33.
    SELinux, firewalls, security • Theeasy way is to “disable SELinux” • The reality is you should learn it • Its just like a firewall that you really should configure 33
  • 34.
    Tools • ps • vmstat1 • iostat • top • free • sar (sar -n DEV 1) • gdb • tcpdump • strace • oprofile • htop • MySQL only: Percona Toolkit (pt-diskstats, pt- summary, pt-query-digest, etc.) 34
  • 35.
    Benchmarking • sysbench • OLTPtest, use tables with 20M rows and 20M transactions, check 1-128 threads/run • LinkBench • Yahoo! Cloud Serving Benchmark • https://github.com/ brianfrankcooper/ YCSB • Google’s PerfKit Benchmarker • https://github.com/ GoogleCloudPlatfor m/ PerfKitBenchmarke r 35
  • 36.
  • 37.
  • 38.
    A word fromyour sponsors • MariaDB Corporation sponsored my T&E to come to OSDC • Remember you can buy a MariaDB Enterprise subscription to help you with your cloud deployments, MariaDB Server usage, etc. • Services include: 24x7 support, consulting, training, remote DBA services, non-recurring engineering and more for all variants of MySQL
  • 39.
    Thanks/Q&A Colin Charles, colin@mariadb.com| byte@bytebot.net http://bytebot.net/blog/ | @bytebot slides: slideshare.net/bytebot/ Download MariaDB and give it a try: http://mariadb.org/ 39