[00:00:55] <wikibugs>	 (03CR) 10Bstorm: ">" [puppet] - 10https://gerrit.wikimedia.org/r/415384 (https://phabricator.wikimedia.org/T181650) (owner: 10Bstorm)
[00:36:21] <icinga-wm>	 PROBLEM - HHVM rendering on mw2113 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[00:37:20] <icinga-wm>	 RECOVERY - HHVM rendering on mw2113 is OK: HTTP OK: HTTP/1.1 200 OK - 74555 bytes in 0.468 second response time
[00:41:35] <wikibugs>	 10Operations, 10DNS, 10Release-Engineering-Team, 10Traffic, 10WMF-Communications: Move Foundation Wiki to new URL when new Wikimedia Foundation website launches - https://phabricator.wikimedia.org/T188776#4020038 (10greg)
[00:59:48] <wikibugs>	 10Operations, 10DBA, 10MediaWiki-General-or-Unknown, 10MW-1.31-release-notes (WMF-deploy-2018-02-20 (1.31.0-wmf.22)), and 2 others: Regularly purge expired temporary userrights from DB tables - https://phabricator.wikimedia.org/T176754#4020080 (10greg) Sorry, I still haven't found someone to do this deploy...
[01:04:41] <wikibugs>	 (03PS1) 10Chad: WIP: Add git::config{} for calling `git config` on repositories. [puppet] - 10https://gerrit.wikimedia.org/r/416200
[01:05:11] <wikibugs>	 (03PS1) 10Chad: Abstract Gerrit's public key out of gerrit::jetty [puppet] - 10https://gerrit.wikimedia.org/r/416201
[01:05:17] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] WIP: Add git::config{} for calling `git config` on repositories. [puppet] - 10https://gerrit.wikimedia.org/r/416200 (owner: 10Chad)
[01:09:21] <wikibugs>	 10Operations, 10DBA, 10MediaWiki-General-or-Unknown, 10MW-1.31-release-notes (WMF-deploy-2018-02-20 (1.31.0-wmf.22)), and 2 others: Regularly purge expired temporary userrights from DB tables - https://phabricator.wikimedia.org/T176754#3635603 (10Reedy) I'm guessing this is going to be mostly a noop...  en...
[01:13:40] <wikibugs>	 (03PS2) 10Chad: WIP: Add git::config{} for calling `git config` on repositories. [puppet] - 10https://gerrit.wikimedia.org/r/416200
[01:16:28] <wikibugs>	 10Operations, 10DBA, 10MediaWiki-General-or-Unknown, 10MW-1.31-release-notes (WMF-deploy-2018-02-20 (1.31.0-wmf.22)), and 2 others: Regularly purge expired temporary userrights from DB tables - https://phabricator.wikimedia.org/T176754#4020127 (10EddieGP) >>! In T176754#4020097, @Reedy wrote: > I'm guessin...
[01:21:33] <chasemp>	 !log labnodepool1001:~# service nodepool stop
[01:22:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[01:26:30] <icinga-wm>	 PROBLEM - Check systemd state on labnodepool1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[01:27:11] <icinga-wm>	 PROBLEM - nodepoold running on labnodepool1001 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 113 (nodepool), regex args ^/usr/bin/python /usr/bin/nodepoold -d
[01:30:05] <chasemp>	 !log root@labnet1001:~# service nova-fullstack restart
[01:31:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[01:36:22] <icinga-wm>	 PROBLEM - nova instance creation test on labnet1001 is CRITICAL: PROCS CRITICAL: 0 processes with command name python, args nova-fullstack
[01:48:30] <icinga-wm>	 PROBLEM - HHVM rendering on mw2223 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[01:49:20] <icinga-wm>	 RECOVERY - HHVM rendering on mw2223 is OK: HTTP OK: HTTP/1.1 200 OK - 74660 bytes in 0.304 second response time
[02:06:20] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1003 is CRITICAL: PYBAL CRITICAL - CRITICAL - labweb_80: Servers labweb1001.wikimedia.org are marked down but pooled
[02:07:00] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1010 is CRITICAL: PYBAL CRITICAL - CRITICAL - labweb_80: Servers labweb1001.wikimedia.org are marked down but pooled
[02:07:01] <icinga-wm>	 PROBLEM - LVS HTTP IPv4 on labweb.svc.eqiad.wmnet is CRITICAL: connect to address 10.2.2.40 and port 80: Connection refused
[02:07:10] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1006 is CRITICAL: PYBAL CRITICAL - CRITICAL - labweb_80: Servers labweb1001.wikimedia.org are marked down but pooled
[02:07:40] <icinga-wm>	 PROBLEM - PyBal IPVS diff check on lvs1006 is CRITICAL: CRITICAL: Hosts in IPVS but unknown to PyBal: set([labweb1001.wikimedia.org])
[02:08:32] <icinga-wm>	 RECOVERY - nova instance creation test on labnet1001 is OK: PROCS OK: 1 process with command name python, args nova-fullstack
[02:08:50] <icinga-wm>	 PROBLEM - PyBal IPVS diff check on lvs1003 is CRITICAL: CRITICAL: Hosts in IPVS but unknown to PyBal: set([labweb1001.wikimedia.org])
[02:09:30] <icinga-wm>	 PROBLEM - PyBal IPVS diff check on lvs1010 is CRITICAL: CRITICAL: Hosts in IPVS but unknown to PyBal: set([labweb1001.wikimedia.org])
[02:12:19] <wikibugs>	 10Operations, 10Cloud-Services, 10DBA, 10Patch-For-Review: db1009 overloaded by idle connections to the nova database - https://phabricator.wikimedia.org/T188589#4020208 (10bd808) Traffic to db1009 spiked like crazy again around 2018-03-03T00:30Z. This time things backed up so far that OpenStack started fa...
[02:14:30] <icinga-wm>	 RECOVERY - nodepoold running on labnodepool1001 is OK: PROCS OK: 1 process with UID = 113 (nodepool), regex args ^/usr/bin/python /usr/bin/nodepoold -d
[02:14:31] <chasemp>	 !log labnodepool1001:~# service nodepool start
[02:14:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[02:14:50] <icinga-wm>	 RECOVERY - Check systemd state on labnodepool1001 is OK: OK - running: The system is fully operational
[02:28:12] <wikibugs>	 10Operations, 10Cloud-Services, 10DBA, 10Patch-For-Review: db1009 overloaded by idle connections to the nova database - https://phabricator.wikimedia.org/T188589#4020221 (10chasemp) https://phabricator.wikimedia.org/P6785
[02:28:30] <wikibugs>	 10Operations, 10Cloud-Services, 10DBA, 10Patch-For-Review: db1009 overloaded by idle connections to the nova database - https://phabricator.wikimedia.org/T188589#4020222 (10chasemp) >>! In T188589#4020208, @bd808 wrote: > Traffic to db1009 spiked like crazy again around 2018-03-03T00:30Z. This time things...
[03:27:50] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 771.30 seconds
[04:14:10] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 164.86 seconds
[04:19:04] <wikibugs>	 10Operations, 10Ops-Access-Requests, 10Analytics-Kanban, 10Patch-For-Review: Add Tilman to analytics-admins - https://phabricator.wikimedia.org/T178802#4020311 (10Tbayer) Yes, still need them (at least through this month, probably a bit longer).
[04:27:43] <wikibugs>	 (03PS1) 10Andrew Bogott: wikitech: remove smw update cron [puppet] - 10https://gerrit.wikimedia.org/r/416209
[04:27:45] <wikibugs>	 (03PS1) 10Andrew Bogott: wikitech: remove crons to backup the mediawiki database [puppet] - 10https://gerrit.wikimedia.org/r/416210 (https://phabricator.wikimedia.org/T188029)
[04:27:47] <wikibugs>	 (03PS1) 10Andrew Bogott: labweb: disable wikitech-static sync crons [puppet] - 10https://gerrit.wikimedia.org/r/416211
[04:28:36] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] labweb: disable wikitech-static sync crons [puppet] - 10https://gerrit.wikimedia.org/r/416211 (owner: 10Andrew Bogott)
[04:28:38] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 032] wikitech: remove smw update cron [puppet] - 10https://gerrit.wikimedia.org/r/416209 (owner: 10Andrew Bogott)
[04:30:58] <wikibugs>	 (03PS2) 10Andrew Bogott: labweb: disable wikitech-static sync crons [puppet] - 10https://gerrit.wikimedia.org/r/416211
[04:31:22] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 032] wikitech: remove crons to backup the mediawiki database [puppet] - 10https://gerrit.wikimedia.org/r/416210 (https://phabricator.wikimedia.org/T188029) (owner: 10Andrew Bogott)
[04:33:26] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 032] labweb: disable wikitech-static sync crons [puppet] - 10https://gerrit.wikimedia.org/r/416211 (owner: 10Andrew Bogott)
[04:59:00] <icinga-wm>	 PROBLEM - toolschecker: Make sure enwiki dumps are not empty on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string OK not found on http://checker.tools.wmflabs.org:80/dumps - 288 bytes in 0.015 second response time
[05:27:34] <wikibugs>	 10Operations, 10Analytics, 10Traffic: Update documentation for "https" field in X-Analytics - https://phabricator.wikimedia.org/T188807#4020348 (10Tbayer)
[05:28:31] <icinga-wm>	 PROBLEM - Disk space on labtestnet2001 is CRITICAL: DISK CRITICAL - free space: / 350 MB (3% inode=60%)
[05:44:30] <icinga-wm>	 PROBLEM - HHVM rendering on mw2212 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[05:45:20] <icinga-wm>	 RECOVERY - HHVM rendering on mw2212 is OK: HTTP OK: HTTP/1.1 200 OK - 74567 bytes in 0.292 second response time
[06:06:20] <icinga-wm>	 RECOVERY - toolschecker: Make sure enwiki dumps are not empty on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 0.014 second response time
[06:43:10] <icinga-wm>	 RECOVERY - Disk space on labtestnet2001 is OK: DISK OK
[07:42:02] <wikibugs>	 (03PS1) 10Nemo bis: Switch public wikis to explicit Flow usage definition [mediawiki-config] - 10https://gerrit.wikimedia.org/r/416217 (https://phabricator.wikimedia.org/T188812)
[07:43:18] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Switch public wikis to explicit Flow usage definition [mediawiki-config] - 10https://gerrit.wikimedia.org/r/416217 (https://phabricator.wikimedia.org/T188812) (owner: 10Nemo bis)
[07:49:14] <wikibugs>	 (03PS2) 10Nemo bis: Switch public wikis to explicit Flow usage definition [mediawiki-config] - 10https://gerrit.wikimedia.org/r/416217 (https://phabricator.wikimedia.org/T188812)
[07:50:41] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Switch public wikis to explicit Flow usage definition [mediawiki-config] - 10https://gerrit.wikimedia.org/r/416217 (https://phabricator.wikimedia.org/T188812) (owner: 10Nemo bis)
[07:52:43] <wikibugs>	 (03PS3) 10Nemo bis: Switch public wikis to explicit Flow usage definition [mediawiki-config] - 10https://gerrit.wikimedia.org/r/416217 (https://phabricator.wikimedia.org/T188812)
[08:04:30] <icinga-wm>	 PROBLEM - Host bohrium is DOWN: PING CRITICAL - Packet loss = 100%
[08:04:50] <icinga-wm>	 PROBLEM - Host kafkamon1001 is DOWN: PING CRITICAL - Packet loss = 100%
[08:04:51] <icinga-wm>	 PROBLEM - Host logstash1008 is DOWN: PING CRITICAL - Packet loss = 100%
[08:05:00] <icinga-wm>	 PROBLEM - Host sca1004 is DOWN: PING CRITICAL - Packet loss = 100%
[08:05:31] <icinga-wm>	 PROBLEM - SSH on ganeti1005 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[08:06:20] <icinga-wm>	 RECOVERY - Host kafkamon1001 is UP: PING OK - Packet loss = 0%, RTA = 3.60 ms
[08:06:30] <icinga-wm>	 RECOVERY - SSH on ganeti1005 is OK: SSH OK - OpenSSH_6.7p1 Debian-5+deb8u4 (protocol 2.0)
[08:06:30] <icinga-wm>	 RECOVERY - Host sca1004 is UP: PING OK - Packet loss = 0%, RTA = 4.01 ms
[08:08:00] <icinga-wm>	 PROBLEM - Misc HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=misc&var-status_type=5
[08:10:40] <icinga-wm>	 RECOVERY - Host logstash1008 is UP: PING OK - Packet loss = 0%, RTA = 2.63 ms
[08:10:50] <icinga-wm>	 RECOVERY - Host bohrium is UP: PING OK - Packet loss = 0%, RTA = 2.54 ms
[08:17:04] <wikibugs>	 (03PS1) 10MaxSem: Remove $wgBrowserBlacklist, does nothing [mediawiki-config] - 10https://gerrit.wikimedia.org/r/416219
[08:23:10] <icinga-wm>	 RECOVERY - Misc HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=misc&var-status_type=5
[10:06:09] <wikibugs>	 (03PS1) 10Jayprakash12345: Enable Rollbacker User right at arwikiversity [mediawiki-config] - 10https://gerrit.wikimedia.org/r/416224 (https://phabricator.wikimedia.org/T188633)
[11:18:58] <wikibugs>	 (03CR) 10MarcoAurelio: Enable Rollbacker User right at arwikiversity (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/416224 (https://phabricator.wikimedia.org/T188633) (owner: 10Jayprakash12345)
[11:51:43] <wikibugs>	 10Operations, 10DBA, 10MediaWiki-General-or-Unknown, 10MW-1.31-release-notes (WMF-deploy-2018-02-20 (1.31.0-wmf.22)), and 2 others: Regularly purge expired temporary userrights from DB tables - https://phabricator.wikimedia.org/T176754#4020618 (10MarcoAurelio) I tested this script on deploymentwiki. It wor...
[12:12:40] <wikibugs>	 10Operations, 10DBA, 10MediaWiki-General-or-Unknown, 10MW-1.31-release-notes (WMF-deploy-2018-02-20 (1.31.0-wmf.22)), and 2 others: Regularly purge expired temporary userrights from DB tables - https://phabricator.wikimedia.org/T176754#4020626 (10Marostegui) >>! In T176754#4020080, @greg wrote: > nor do I...
[12:51:10] <Hauskatze>	 Reedy: you around? I need help to unlock a rename.
[13:02:30] <andrewbogott>	 !log stopping nodepool for a bit while investigating openstack issues
[13:02:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:03:41] <Hauskatze>	 andrewbogott: you looking into T188820 ? :)
[13:03:41] <stashbot>	 T188820: CI is broken - https://phabricator.wikimedia.org/T188820
[13:03:52] <andrewbogott>	 yes
[13:04:00] <Hauskatze>	 thanks for that
[13:05:22] <wikibugs>	 (03PS30) 10ArielGlenn: [WIP] php7 manifests for mediawiki on stretch [puppet] - 10https://gerrit.wikimedia.org/r/394977
[13:05:57] <andrewbogott>	 !log restarting nova-conductor
[13:06:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:07:30] <icinga-wm>	 PROBLEM - nodepoold running on labnodepool1001 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 113 (nodepool), regex args ^/usr/bin/python /usr/bin/nodepoold -d
[13:08:00] <icinga-wm>	 PROBLEM - Check systemd state on labnodepool1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[13:08:57] <andrewbogott>	 !log retarting nodepool
[13:09:00] <icinga-wm>	 RECOVERY - Check systemd state on labnodepool1001 is OK: OK - running: The system is fully operational
[13:09:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:09:30] <icinga-wm>	 RECOVERY - nodepoold running on labnodepool1001 is OK: PROCS OK: 1 process with UID = 113 (nodepool), regex args ^/usr/bin/python /usr/bin/nodepoold -d
[13:10:05] <andrewbogott>	 !log restarting  rabbitmq-server on labcontrol1001
[13:10:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:12:45] <wikibugs>	 (03PS2) 10Jayprakash12345: Enable rollbacker user right at arwikiversity [mediawiki-config] - 10https://gerrit.wikimedia.org/r/416224 (https://phabricator.wikimedia.org/T188633)
[13:22:02] <Hauskatze>	 there are three gate-and-submit jobs running right now with -jessie test on them that are passing :)
[13:23:31] <andrewbogott>	 !log forcing quota update in nova with update quota_usages set reserved='-1' where project_id='contintcloud';
[13:23:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:25:08] <andrewbogott>	 !log forced quota update in admin-monitoring as well; the reserved fixed_ip value was incorrect
[13:25:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:35:41] <icinga-wm>	 PROBLEM - Host kafkamon1001 is DOWN: PING CRITICAL - Packet loss = 100%
[13:35:50] <icinga-wm>	 PROBLEM - Host bohrium is DOWN: PING CRITICAL - Packet loss = 100%
[13:35:51] <icinga-wm>	 PROBLEM - Host sca1004 is DOWN: PING CRITICAL - Packet loss = 100%
[13:35:51] <icinga-wm>	 PROBLEM - Host logstash1008 is DOWN: PING CRITICAL - Packet loss = 100%
[13:37:01] <icinga-wm>	 PROBLEM - SSH on ganeti1005 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[13:40:41] <icinga-wm>	 PROBLEM - Misc HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=misc&var-status_type=5
[13:41:31] <icinga-wm>	 PROBLEM - etcd request latencies on chlorine is CRITICAL: CRITICAL - etcd_request_latencies is 89603 https://grafana.wikimedia.org/dashboard/db/kubernetes-api
[13:43:31] <icinga-wm>	 RECOVERY - etcd request latencies on chlorine is OK: OK - etcd_request_latencies is 3631 https://grafana.wikimedia.org/dashboard/db/kubernetes-api
[13:44:21] <ema>	 uh, bohrium down
[13:44:38] <ema>	 and others, ganeti issues apparently
[13:45:11] <icinga-wm>	 PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=esams&var-cache_type=All&var-status_type=5
[13:46:00] <ema>	 calling akosiaris 
[13:48:56] <ema>	 ok he said that only bohrium should be affected really, taking a look 
[13:49:59] <ema>	 akosiaris is gonna be online in ~15 minutes FTR
[13:51:23] <ema>	 so ganeti1005 is down and the master is ganeti1004
[13:56:20] <ema>	 !log powercycle ganeti1005
[13:56:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:57:40] <icinga-wm>	 PROBLEM - etcd request latencies on chlorine is CRITICAL: CRITICAL - etcd_request_latencies is 84639 https://grafana.wikimedia.org/dashboard/db/kubernetes-api
[13:58:40] <icinga-wm>	 RECOVERY - etcd request latencies on chlorine is OK: OK - etcd_request_latencies is 3041 https://grafana.wikimedia.org/dashboard/db/kubernetes-api
[13:59:11] <ema>	 ganeti1005 booting up
[13:59:20] <icinga-wm>	 PROBLEM - Host ganeti1005 is DOWN: PING CRITICAL - Packet loss = 100%
[13:59:50] <icinga-wm>	 RECOVERY - Host ganeti1005 is UP: PING OK - Packet loss = 0%, RTA = 0.33 ms
[14:00:20] <icinga-wm>	 RECOVERY - SSH on ganeti1005 is OK: SSH OK - OpenSSH_6.7p1 Debian-5+deb8u4 (protocol 2.0)
[14:00:40] <icinga-wm>	 RECOVERY - Host logstash1008 is UP: PING OK - Packet loss = 0%, RTA = 4.98 ms
[14:00:41] <icinga-wm>	 RECOVERY - Host bohrium is UP: PING OK - Packet loss = 0%, RTA = 4.27 ms
[14:01:10] <icinga-wm>	 RECOVERY - Host kafkamon1001 is UP: PING OK - Packet loss = 0%, RTA = 6.15 ms
[14:02:20] <icinga-wm>	 RECOVERY - Host sca1004 is UP: PING OK - Packet loss = 0%, RTA = 6.84 ms
[14:03:10] <ema>	 I guess an alternative to rebooting 1005 would have been to run `gnt-node migrate -f ganeti1005.eqiad.wmnet`, but I'll wait for more expert eyes to see this first :)
[14:03:31] <icinga-wm>	 PROBLEM - Check systemd state on kafkamon1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[14:04:15] <ema>	 cache_misc recovering now that bohrium is back: https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&from=1520083164099&to=1520085834590&var-site=All&var-cache_type=misc&var-status_type=5
[14:09:49] <akosiaris>	 I am around
[14:11:24] <akosiaris>	 an InnoDB on bohrium recovered fine (there was a time it did not have)
[14:11:28] <akosiaris>	 and*
[14:11:31] <akosiaris>	 nice
[14:12:21] <icinga-wm>	 RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=esams&var-cache_type=All&var-status_type=5
[14:13:28] <ema>	 oh, while I'm here
[14:13:40] <ema>	 andrewbogott: labweb100[12] are both down
[14:13:51] <icinga-wm>	 RECOVERY - Misc HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=misc&var-status_type=5
[14:14:02] <akosiaris>	 and lvs is complaining I saw that too
[14:14:40] <wikibugs>	 10Operations, 10Analytics, 10ContentTranslation, 10ContentTranslation-Analytics, and 3 others: schedule a daily run of ContentTranslation analytics scripts - https://phabricator.wikimedia.org/T122479#4020776 (10Petar.petkovic)
[14:14:50] <andrewbogott>	 That's fallout from an outage last night... I'll clean them up shortly
[14:15:16] <akosiaris>	 so ema: so overall all you had to do was powercycle the box, right ?
[14:15:38] <ema>	 akosiaris: correct. Nothing meaningful in console, I've just powercycled it.
[14:16:05] <akosiaris>	 yeah good call. from the logs I see it was the usual crappy issue about page allocation stalls T181121
[14:16:05] <stashbot>	 T181121: Kernels errors on ganeti1005- ganeti1008 under high I/O - https://phabricator.wikimedia.org/T181121
[14:16:34] <ema>	 akosiaris: if the state of ganeti1005 is unstable these days we might perhaps want to move bohrium to another ganeti host?
[14:16:47] <akosiaris>	 !log 13:56:20 ema: powercycle ganeti1005 T181121
[14:17:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:17:11] <icinga-wm>	 RECOVERY - LVS HTTP IPv4 on labweb.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 453 bytes in 0.001 second response time
[14:17:22] <akosiaris>	 ema: it's not just ganeti1005. It's all the 4 hosts in that nodegroup (ganeti1005-ganeti1008) that have the problem
[14:17:31] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1003 is OK: PYBAL OK - All pools are healthy
[14:17:31] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1010 is OK: PYBAL OK - All pools are healthy
[14:17:47] <akosiaris>	 and up to now (that's anecdotal) it's almost always the host that runs bohrium that exhibits the problem
[14:17:50] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1006 is OK: PYBAL OK - All pools are healthy
[14:17:59] <ema>	 oh I see
[14:18:00] <akosiaris>	 due to borhium being THE IO heavy VM 
[14:18:00] <icinga-wm>	 RECOVERY - PyBal IPVS diff check on lvs1006 is OK: OK: no difference between hosts in IPVS/PyBal
[14:18:19] <ema>	 fair enough, so moving the vm to another host might just move the problem there instead
[14:18:31] <akosiaris>	 in fact the issues started on 08:03 today per the logs, which is just 3 minutes after the piwik archiver starts
[14:19:00] <icinga-wm>	 RECOVERY - PyBal IPVS diff check on lvs1003 is OK: OK: no difference between hosts in IPVS/PyBal
[14:19:12] <akosiaris>	 it's very inconsistent to reproduce. I 've been running IO heavy stuff on various boxes for days and I do not yet have a clear reproduction case
[14:19:41] <icinga-wm>	 RECOVERY - PyBal IPVS diff check on lvs1010 is OK: OK: no difference between hosts in IPVS/PyBal
[14:21:43] <ema>	 elukey: we've rebooted ganeti1005; kafkamon1001, which runs there, now has 3 failed units after booting up (burrow-analytics.service,  burrow-jumbo-eqiad.service, burrow-main-eqiad.service)
[14:24:50] <icinga-wm>	 RECOVERY - Work requests waiting in Zuul Gearman server on contint1001 is OK: OK: Less than 30.00% above the threshold [90.0] https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1
[14:25:01] <ema>	 akosiaris: ok, nothing more to do now on the ganeti side I guess?
[14:25:06] <akosiaris>	 nope
[14:25:16] <akosiaris>	 unfortunately... I would love to pinpoint the problem
[14:25:25] <akosiaris>	 it's been messing with us > 3 months now
[14:25:46] <akosiaris>	 it's only present in that nodegroup btw (we got 4)
[14:26:08] <akosiaris>	 but anyway I digress, nothing more to do for now
[14:28:42] <ema>	 alright! o/
[14:28:47] <akosiaris>	 o/
[14:47:06] <wikibugs>	 10Operations, 10DBA, 10MediaWiki-General-or-Unknown, 10MW-1.31-release-notes (WMF-deploy-2018-02-20 (1.31.0-wmf.22)), and 2 others: Regularly purge expired temporary userrights from DB tables - https://phabricator.wikimedia.org/T176754#4020798 (10Reedy) I don't think there's even any point doing that. Very...
[15:18:20] <elukey>	 ema thanks! Just fixed them, there was an issue with stale pid files
[15:18:49] <elukey>	 will need to add process alerts for those
[15:18:51] <icinga-wm>	 RECOVERY - Check systemd state on kafkamon1001 is OK: OK - running: The system is fully operational
[15:21:55] <elukey>	 akosiaris: this time for bohrium it might have been a special case since my team forced the re-process of the past week of data for all the smallish websites (the archiver by default does not retry on data older than 24h unless explicitly told so)
[15:23:02] <elukey>	 iirc the archiver was manually run after this, but it might have not done all the work in one go and started this morning
[15:39:57] <wikibugs>	 (03PS1) 10Elukey: burrow: ensure pid file is created under /var/run/burrow [puppet] - 10https://gerrit.wikimedia.org/r/416232 (https://phabricator.wikimedia.org/T180442)
[15:42:12] <wikibugs>	 (03PS2) 10Elukey: burrow: ensure pid file is created under /var/run/burrow [puppet] - 10https://gerrit.wikimedia.org/r/416232 (https://phabricator.wikimedia.org/T180442)
[15:45:54] <wikibugs>	 (03PS3) 10Elukey: burrow: ensure pid file is created under /var/run/burrow [puppet] - 10https://gerrit.wikimedia.org/r/416232 (https://phabricator.wikimedia.org/T180442)
[15:49:34] <wikibugs>	 (03CR) 10Elukey: [C: 032] "https://puppet-compiler.wmflabs.org/compiler02/10244/kafkamon2001.codfw.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/416232 (https://phabricator.wikimedia.org/T180442) (owner: 10Elukey)
[15:49:40] <wikibugs>	 (03PS4) 10Elukey: burrow: ensure pid file is created under /var/run/burrow [puppet] - 10https://gerrit.wikimedia.org/r/416232 (https://phabricator.wikimedia.org/T180442)
[15:54:10] <icinga-wm>	 PROBLEM - Check systemd state on kafkamon1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[15:54:17] <elukey>	 lovely
[15:56:05] <elukey>	 createPidFile(appContext.Config.General.LogDir + "/" + appContext.Config.General.PIDFile)
[15:56:18] <elukey>	 ah so apparently it was meant to go in var/log
[15:56:22] <elukey>	 mmmm
[15:56:35] <wikibugs>	 (03PS1) 10Elukey: Revert "burrow: ensure pid file is created under /var/run/burrow" [puppet] - 10https://gerrit.wikimedia.org/r/416233
[15:57:05] <wikibugs>	 (03CR) 10Elukey: [C: 032] Revert "burrow: ensure pid file is created under /var/run/burrow" [puppet] - 10https://gerrit.wikimedia.org/r/416233 (owner: 10Elukey)
[15:57:46] <elukey>	 will add monitoring on Monday, it is clearly not the right day for me :D
[15:59:10] <icinga-wm>	 RECOVERY - Check systemd state on kafkamon1001 is OK: OK - running: The system is fully operational
[15:59:18] * elukey is looking forward to upgrade Burrow to 1.0
[16:26:25] <wikibugs>	 (03CR) 10Reedy: [C: 031] "[DNM] can be removed now" [puppet] - 10https://gerrit.wikimedia.org/r/382631 (https://phabricator.wikimedia.org/T176754) (owner: 10EddieGP)
[16:46:51] <wikibugs>	 (03PS8) 10EddieGP: Add cron job for expired userrights maintenance script [puppet] - 10https://gerrit.wikimedia.org/r/382631 (https://phabricator.wikimedia.org/T176754)
[16:49:02] <wikibugs>	 (03CR) 10EddieGP: "Sorry, wanted to re-add as reviewer, not set as assignee. But yes, it seems like this'd be ready now." [puppet] - 10https://gerrit.wikimedia.org/r/382631 (https://phabricator.wikimedia.org/T176754) (owner: 10EddieGP)
[17:20:36] <wikibugs>	 (03PS1) 10Andrew Bogott: labweb wikitech: add a few more apache confs [puppet] - 10https://gerrit.wikimedia.org/r/416235 (https://phabricator.wikimedia.org/T168470)
[17:21:42] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 032] labweb wikitech: add a few more apache confs [puppet] - 10https://gerrit.wikimedia.org/r/416235 (https://phabricator.wikimedia.org/T168470) (owner: 10Andrew Bogott)
[17:29:40] <wikibugs>	 10Operations, 10ops-codfw: attach furud's new arrays (furud-array[3-7]) - https://phabricator.wikimedia.org/T185153#4020984 (10Papaul) Connection complete
[18:33:51] <wikibugs>	 10Operations, 10Analytics-Kanban, 10Patch-For-Review, 10Performance-Team (Radar), and 2 others: Deprecation of mw.errors.* metrics - https://phabricator.wikimedia.org/T188749#4021055 (10elukey) Thanks @Krinkle! @fgiunchedi I think we are ready to go, what do you think?
[19:42:10] <wikibugs>	 (03PS6) 10Sau226: Disable main page deletion on enwiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/414509 (https://phabricator.wikimedia.org/T184959)
[20:12:32] <wikibugs>	 (03PS1) 10Gilles: Point private Thumbor Swift user to existing user for now [puppet] - 10https://gerrit.wikimedia.org/r/416240 (https://phabricator.wikimedia.org/T188834)
[20:29:02] <wikibugs>	 10Operations, 10ops-codfw: attach furud's new arrays (furud-array[3-7]) - https://phabricator.wikimedia.org/T185153#4021169 (10Papaul)
[20:29:17] <wikibugs>	 10Operations, 10ops-codfw: attach furud's new arrays (furud-array[3-7]) - https://phabricator.wikimedia.org/T185153#3907805 (10Papaul) a:05Papaul>03faidon
[20:31:59] <wikibugs>	 10Operations, 10ops-codfw: rack/setup/install mw2259-mw2290 - https://phabricator.wikimedia.org/T188301#4021171 (10Papaul)
[20:47:36] <wikibugs>	 10Operations, 10ops-codfw: rack/setup/install mw2259-mw2290 - https://phabricator.wikimedia.org/T188301#4021180 (10Papaul)
[21:13:23] <wikibugs>	 10Operations, 10ops-codfw: rack/setup/install mw2259-mw2290 - https://phabricator.wikimedia.org/T188301#4021198 (10Papaul)
[23:55:07] <wikibugs>	 (03PS1) 10EddieGP: Article counts: Change 'comma' method to 'any' [mediawiki-config] - 10https://gerrit.wikimedia.org/r/416330 (https://phabricator.wikimedia.org/T188472)
[23:57:07] <wikibugs>	 (03CR) 10EddieGP: "To be deployed after 1.31wmf23 (326d655 specifically)." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/416330 (https://phabricator.wikimedia.org/T188472) (owner: 10EddieGP)