[00:00:55] (03CR) 10Bstorm: ">" [puppet] - 10https://gerrit.wikimedia.org/r/415384 (https://phabricator.wikimedia.org/T181650) (owner: 10Bstorm) [00:36:21] PROBLEM - HHVM rendering on mw2113 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:37:20] RECOVERY - HHVM rendering on mw2113 is OK: HTTP OK: HTTP/1.1 200 OK - 74555 bytes in 0.468 second response time [00:41:35] 10Operations, 10DNS, 10Release-Engineering-Team, 10Traffic, 10WMF-Communications: Move Foundation Wiki to new URL when new Wikimedia Foundation website launches - https://phabricator.wikimedia.org/T188776#4020038 (10greg) [00:59:48] 10Operations, 10DBA, 10MediaWiki-General-or-Unknown, 10MW-1.31-release-notes (WMF-deploy-2018-02-20 (1.31.0-wmf.22)), and 2 others: Regularly purge expired temporary userrights from DB tables - https://phabricator.wikimedia.org/T176754#4020080 (10greg) Sorry, I still haven't found someone to do this deploy... [01:04:41] (03PS1) 10Chad: WIP: Add git::config{} for calling `git config` on repositories. [puppet] - 10https://gerrit.wikimedia.org/r/416200 [01:05:11] (03PS1) 10Chad: Abstract Gerrit's public key out of gerrit::jetty [puppet] - 10https://gerrit.wikimedia.org/r/416201 [01:05:17] (03CR) 10jerkins-bot: [V: 04-1] WIP: Add git::config{} for calling `git config` on repositories. [puppet] - 10https://gerrit.wikimedia.org/r/416200 (owner: 10Chad) [01:09:21] 10Operations, 10DBA, 10MediaWiki-General-or-Unknown, 10MW-1.31-release-notes (WMF-deploy-2018-02-20 (1.31.0-wmf.22)), and 2 others: Regularly purge expired temporary userrights from DB tables - https://phabricator.wikimedia.org/T176754#3635603 (10Reedy) I'm guessing this is going to be mostly a noop... en... [01:13:40] (03PS2) 10Chad: WIP: Add git::config{} for calling `git config` on repositories. [puppet] - 10https://gerrit.wikimedia.org/r/416200 [01:16:28] 10Operations, 10DBA, 10MediaWiki-General-or-Unknown, 10MW-1.31-release-notes (WMF-deploy-2018-02-20 (1.31.0-wmf.22)), and 2 others: Regularly purge expired temporary userrights from DB tables - https://phabricator.wikimedia.org/T176754#4020127 (10EddieGP) >>! In T176754#4020097, @Reedy wrote: > I'm guessin... [01:21:33] !log labnodepool1001:~# service nodepool stop [01:22:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:26:30] PROBLEM - Check systemd state on labnodepool1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [01:27:11] PROBLEM - nodepoold running on labnodepool1001 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 113 (nodepool), regex args ^/usr/bin/python /usr/bin/nodepoold -d [01:30:05] !log root@labnet1001:~# service nova-fullstack restart [01:31:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:36:22] PROBLEM - nova instance creation test on labnet1001 is CRITICAL: PROCS CRITICAL: 0 processes with command name python, args nova-fullstack [01:48:30] PROBLEM - HHVM rendering on mw2223 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:49:20] RECOVERY - HHVM rendering on mw2223 is OK: HTTP OK: HTTP/1.1 200 OK - 74660 bytes in 0.304 second response time [02:06:20] PROBLEM - PyBal backends health check on lvs1003 is CRITICAL: PYBAL CRITICAL - CRITICAL - labweb_80: Servers labweb1001.wikimedia.org are marked down but pooled [02:07:00] PROBLEM - PyBal backends health check on lvs1010 is CRITICAL: PYBAL CRITICAL - CRITICAL - labweb_80: Servers labweb1001.wikimedia.org are marked down but pooled [02:07:01] PROBLEM - LVS HTTP IPv4 on labweb.svc.eqiad.wmnet is CRITICAL: connect to address 10.2.2.40 and port 80: Connection refused [02:07:10] PROBLEM - PyBal backends health check on lvs1006 is CRITICAL: PYBAL CRITICAL - CRITICAL - labweb_80: Servers labweb1001.wikimedia.org are marked down but pooled [02:07:40] PROBLEM - PyBal IPVS diff check on lvs1006 is CRITICAL: CRITICAL: Hosts in IPVS but unknown to PyBal: set([labweb1001.wikimedia.org]) [02:08:32] RECOVERY - nova instance creation test on labnet1001 is OK: PROCS OK: 1 process with command name python, args nova-fullstack [02:08:50] PROBLEM - PyBal IPVS diff check on lvs1003 is CRITICAL: CRITICAL: Hosts in IPVS but unknown to PyBal: set([labweb1001.wikimedia.org]) [02:09:30] PROBLEM - PyBal IPVS diff check on lvs1010 is CRITICAL: CRITICAL: Hosts in IPVS but unknown to PyBal: set([labweb1001.wikimedia.org]) [02:12:19] 10Operations, 10Cloud-Services, 10DBA, 10Patch-For-Review: db1009 overloaded by idle connections to the nova database - https://phabricator.wikimedia.org/T188589#4020208 (10bd808) Traffic to db1009 spiked like crazy again around 2018-03-03T00:30Z. This time things backed up so far that OpenStack started fa... [02:14:30] RECOVERY - nodepoold running on labnodepool1001 is OK: PROCS OK: 1 process with UID = 113 (nodepool), regex args ^/usr/bin/python /usr/bin/nodepoold -d [02:14:31] !log labnodepool1001:~# service nodepool start [02:14:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:14:50] RECOVERY - Check systemd state on labnodepool1001 is OK: OK - running: The system is fully operational [02:28:12] 10Operations, 10Cloud-Services, 10DBA, 10Patch-For-Review: db1009 overloaded by idle connections to the nova database - https://phabricator.wikimedia.org/T188589#4020221 (10chasemp) https://phabricator.wikimedia.org/P6785 [02:28:30] 10Operations, 10Cloud-Services, 10DBA, 10Patch-For-Review: db1009 overloaded by idle connections to the nova database - https://phabricator.wikimedia.org/T188589#4020222 (10chasemp) >>! In T188589#4020208, @bd808 wrote: > Traffic to db1009 spiked like crazy again around 2018-03-03T00:30Z. This time things... [03:27:50] PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 771.30 seconds [04:14:10] RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 164.86 seconds [04:19:04] 10Operations, 10Ops-Access-Requests, 10Analytics-Kanban, 10Patch-For-Review: Add Tilman to analytics-admins - https://phabricator.wikimedia.org/T178802#4020311 (10Tbayer) Yes, still need them (at least through this month, probably a bit longer). [04:27:43] (03PS1) 10Andrew Bogott: wikitech: remove smw update cron [puppet] - 10https://gerrit.wikimedia.org/r/416209 [04:27:45] (03PS1) 10Andrew Bogott: wikitech: remove crons to backup the mediawiki database [puppet] - 10https://gerrit.wikimedia.org/r/416210 (https://phabricator.wikimedia.org/T188029) [04:27:47] (03PS1) 10Andrew Bogott: labweb: disable wikitech-static sync crons [puppet] - 10https://gerrit.wikimedia.org/r/416211 [04:28:36] (03CR) 10jerkins-bot: [V: 04-1] labweb: disable wikitech-static sync crons [puppet] - 10https://gerrit.wikimedia.org/r/416211 (owner: 10Andrew Bogott) [04:28:38] (03CR) 10Andrew Bogott: [C: 032] wikitech: remove smw update cron [puppet] - 10https://gerrit.wikimedia.org/r/416209 (owner: 10Andrew Bogott) [04:30:58] (03PS2) 10Andrew Bogott: labweb: disable wikitech-static sync crons [puppet] - 10https://gerrit.wikimedia.org/r/416211 [04:31:22] (03CR) 10Andrew Bogott: [C: 032] wikitech: remove crons to backup the mediawiki database [puppet] - 10https://gerrit.wikimedia.org/r/416210 (https://phabricator.wikimedia.org/T188029) (owner: 10Andrew Bogott) [04:33:26] (03CR) 10Andrew Bogott: [C: 032] labweb: disable wikitech-static sync crons [puppet] - 10https://gerrit.wikimedia.org/r/416211 (owner: 10Andrew Bogott) [04:59:00] PROBLEM - toolschecker: Make sure enwiki dumps are not empty on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string OK not found on http://checker.tools.wmflabs.org:80/dumps - 288 bytes in 0.015 second response time [05:27:34] 10Operations, 10Analytics, 10Traffic: Update documentation for "https" field in X-Analytics - https://phabricator.wikimedia.org/T188807#4020348 (10Tbayer) [05:28:31] PROBLEM - Disk space on labtestnet2001 is CRITICAL: DISK CRITICAL - free space: / 350 MB (3% inode=60%) [05:44:30] PROBLEM - HHVM rendering on mw2212 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:45:20] RECOVERY - HHVM rendering on mw2212 is OK: HTTP OK: HTTP/1.1 200 OK - 74567 bytes in 0.292 second response time [06:06:20] RECOVERY - toolschecker: Make sure enwiki dumps are not empty on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 0.014 second response time [06:43:10] RECOVERY - Disk space on labtestnet2001 is OK: DISK OK [07:42:02] (03PS1) 10Nemo bis: Switch public wikis to explicit Flow usage definition [mediawiki-config] - 10https://gerrit.wikimedia.org/r/416217 (https://phabricator.wikimedia.org/T188812) [07:43:18] (03CR) 10jerkins-bot: [V: 04-1] Switch public wikis to explicit Flow usage definition [mediawiki-config] - 10https://gerrit.wikimedia.org/r/416217 (https://phabricator.wikimedia.org/T188812) (owner: 10Nemo bis) [07:49:14] (03PS2) 10Nemo bis: Switch public wikis to explicit Flow usage definition [mediawiki-config] - 10https://gerrit.wikimedia.org/r/416217 (https://phabricator.wikimedia.org/T188812) [07:50:41] (03CR) 10jerkins-bot: [V: 04-1] Switch public wikis to explicit Flow usage definition [mediawiki-config] - 10https://gerrit.wikimedia.org/r/416217 (https://phabricator.wikimedia.org/T188812) (owner: 10Nemo bis) [07:52:43] (03PS3) 10Nemo bis: Switch public wikis to explicit Flow usage definition [mediawiki-config] - 10https://gerrit.wikimedia.org/r/416217 (https://phabricator.wikimedia.org/T188812) [08:04:30] PROBLEM - Host bohrium is DOWN: PING CRITICAL - Packet loss = 100% [08:04:50] PROBLEM - Host kafkamon1001 is DOWN: PING CRITICAL - Packet loss = 100% [08:04:51] PROBLEM - Host logstash1008 is DOWN: PING CRITICAL - Packet loss = 100% [08:05:00] PROBLEM - Host sca1004 is DOWN: PING CRITICAL - Packet loss = 100% [08:05:31] PROBLEM - SSH on ganeti1005 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:06:20] RECOVERY - Host kafkamon1001 is UP: PING OK - Packet loss = 0%, RTA = 3.60 ms [08:06:30] RECOVERY - SSH on ganeti1005 is OK: SSH OK - OpenSSH_6.7p1 Debian-5+deb8u4 (protocol 2.0) [08:06:30] RECOVERY - Host sca1004 is UP: PING OK - Packet loss = 0%, RTA = 4.01 ms [08:08:00] PROBLEM - Misc HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=misc&var-status_type=5 [08:10:40] RECOVERY - Host logstash1008 is UP: PING OK - Packet loss = 0%, RTA = 2.63 ms [08:10:50] RECOVERY - Host bohrium is UP: PING OK - Packet loss = 0%, RTA = 2.54 ms [08:17:04] (03PS1) 10MaxSem: Remove $wgBrowserBlacklist, does nothing [mediawiki-config] - 10https://gerrit.wikimedia.org/r/416219 [08:23:10] RECOVERY - Misc HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=misc&var-status_type=5 [10:06:09] (03PS1) 10Jayprakash12345: Enable Rollbacker User right at arwikiversity [mediawiki-config] - 10https://gerrit.wikimedia.org/r/416224 (https://phabricator.wikimedia.org/T188633) [11:18:58] (03CR) 10MarcoAurelio: Enable Rollbacker User right at arwikiversity (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/416224 (https://phabricator.wikimedia.org/T188633) (owner: 10Jayprakash12345) [11:51:43] 10Operations, 10DBA, 10MediaWiki-General-or-Unknown, 10MW-1.31-release-notes (WMF-deploy-2018-02-20 (1.31.0-wmf.22)), and 2 others: Regularly purge expired temporary userrights from DB tables - https://phabricator.wikimedia.org/T176754#4020618 (10MarcoAurelio) I tested this script on deploymentwiki. It wor... [12:12:40] 10Operations, 10DBA, 10MediaWiki-General-or-Unknown, 10MW-1.31-release-notes (WMF-deploy-2018-02-20 (1.31.0-wmf.22)), and 2 others: Regularly purge expired temporary userrights from DB tables - https://phabricator.wikimedia.org/T176754#4020626 (10Marostegui) >>! In T176754#4020080, @greg wrote: > nor do I... [12:51:10] Reedy: you around? I need help to unlock a rename. [13:02:30] !log stopping nodepool for a bit while investigating openstack issues [13:02:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:03:41] andrewbogott: you looking into T188820 ? :) [13:03:41] T188820: CI is broken - https://phabricator.wikimedia.org/T188820 [13:03:52] yes [13:04:00] thanks for that [13:05:22] (03PS30) 10ArielGlenn: [WIP] php7 manifests for mediawiki on stretch [puppet] - 10https://gerrit.wikimedia.org/r/394977 [13:05:57] !log restarting nova-conductor [13:06:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:07:30] PROBLEM - nodepoold running on labnodepool1001 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 113 (nodepool), regex args ^/usr/bin/python /usr/bin/nodepoold -d [13:08:00] PROBLEM - Check systemd state on labnodepool1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [13:08:57] !log retarting nodepool [13:09:00] RECOVERY - Check systemd state on labnodepool1001 is OK: OK - running: The system is fully operational [13:09:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:09:30] RECOVERY - nodepoold running on labnodepool1001 is OK: PROCS OK: 1 process with UID = 113 (nodepool), regex args ^/usr/bin/python /usr/bin/nodepoold -d [13:10:05] !log restarting rabbitmq-server on labcontrol1001 [13:10:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:12:45] (03PS2) 10Jayprakash12345: Enable rollbacker user right at arwikiversity [mediawiki-config] - 10https://gerrit.wikimedia.org/r/416224 (https://phabricator.wikimedia.org/T188633) [13:22:02] there are three gate-and-submit jobs running right now with -jessie test on them that are passing :) [13:23:31] !log forcing quota update in nova with update quota_usages set reserved='-1' where project_id='contintcloud'; [13:23:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:25:08] !log forced quota update in admin-monitoring as well; the reserved fixed_ip value was incorrect [13:25:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:35:41] PROBLEM - Host kafkamon1001 is DOWN: PING CRITICAL - Packet loss = 100% [13:35:50] PROBLEM - Host bohrium is DOWN: PING CRITICAL - Packet loss = 100% [13:35:51] PROBLEM - Host sca1004 is DOWN: PING CRITICAL - Packet loss = 100% [13:35:51] PROBLEM - Host logstash1008 is DOWN: PING CRITICAL - Packet loss = 100% [13:37:01] PROBLEM - SSH on ganeti1005 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:40:41] PROBLEM - Misc HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=misc&var-status_type=5 [13:41:31] PROBLEM - etcd request latencies on chlorine is CRITICAL: CRITICAL - etcd_request_latencies is 89603 https://grafana.wikimedia.org/dashboard/db/kubernetes-api [13:43:31] RECOVERY - etcd request latencies on chlorine is OK: OK - etcd_request_latencies is 3631 https://grafana.wikimedia.org/dashboard/db/kubernetes-api [13:44:21] uh, bohrium down [13:44:38] and others, ganeti issues apparently [13:45:11] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=esams&var-cache_type=All&var-status_type=5 [13:46:00] calling akosiaris [13:48:56] ok he said that only bohrium should be affected really, taking a look [13:49:59] akosiaris is gonna be online in ~15 minutes FTR [13:51:23] so ganeti1005 is down and the master is ganeti1004 [13:56:20] !log powercycle ganeti1005 [13:56:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:57:40] PROBLEM - etcd request latencies on chlorine is CRITICAL: CRITICAL - etcd_request_latencies is 84639 https://grafana.wikimedia.org/dashboard/db/kubernetes-api [13:58:40] RECOVERY - etcd request latencies on chlorine is OK: OK - etcd_request_latencies is 3041 https://grafana.wikimedia.org/dashboard/db/kubernetes-api [13:59:11] ganeti1005 booting up [13:59:20] PROBLEM - Host ganeti1005 is DOWN: PING CRITICAL - Packet loss = 100% [13:59:50] RECOVERY - Host ganeti1005 is UP: PING OK - Packet loss = 0%, RTA = 0.33 ms [14:00:20] RECOVERY - SSH on ganeti1005 is OK: SSH OK - OpenSSH_6.7p1 Debian-5+deb8u4 (protocol 2.0) [14:00:40] RECOVERY - Host logstash1008 is UP: PING OK - Packet loss = 0%, RTA = 4.98 ms [14:00:41] RECOVERY - Host bohrium is UP: PING OK - Packet loss = 0%, RTA = 4.27 ms [14:01:10] RECOVERY - Host kafkamon1001 is UP: PING OK - Packet loss = 0%, RTA = 6.15 ms [14:02:20] RECOVERY - Host sca1004 is UP: PING OK - Packet loss = 0%, RTA = 6.84 ms [14:03:10] I guess an alternative to rebooting 1005 would have been to run `gnt-node migrate -f ganeti1005.eqiad.wmnet`, but I'll wait for more expert eyes to see this first :) [14:03:31] PROBLEM - Check systemd state on kafkamon1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [14:04:15] cache_misc recovering now that bohrium is back: https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&from=1520083164099&to=1520085834590&var-site=All&var-cache_type=misc&var-status_type=5 [14:09:49] I am around [14:11:24] an InnoDB on bohrium recovered fine (there was a time it did not have) [14:11:28] and* [14:11:31] nice [14:12:21] RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=esams&var-cache_type=All&var-status_type=5 [14:13:28] oh, while I'm here [14:13:40] andrewbogott: labweb100[12] are both down [14:13:51] RECOVERY - Misc HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=misc&var-status_type=5 [14:14:02] and lvs is complaining I saw that too [14:14:40] 10Operations, 10Analytics, 10ContentTranslation, 10ContentTranslation-Analytics, and 3 others: schedule a daily run of ContentTranslation analytics scripts - https://phabricator.wikimedia.org/T122479#4020776 (10Petar.petkovic) [14:14:50] That's fallout from an outage last night... I'll clean them up shortly [14:15:16] so ema: so overall all you had to do was powercycle the box, right ? [14:15:38] akosiaris: correct. Nothing meaningful in console, I've just powercycled it. [14:16:05] yeah good call. from the logs I see it was the usual crappy issue about page allocation stalls T181121 [14:16:05] T181121: Kernels errors on ganeti1005- ganeti1008 under high I/O - https://phabricator.wikimedia.org/T181121 [14:16:34] akosiaris: if the state of ganeti1005 is unstable these days we might perhaps want to move bohrium to another ganeti host? [14:16:47] !log 13:56:20 ema: powercycle ganeti1005 T181121 [14:17:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:17:11] RECOVERY - LVS HTTP IPv4 on labweb.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 453 bytes in 0.001 second response time [14:17:22] ema: it's not just ganeti1005. It's all the 4 hosts in that nodegroup (ganeti1005-ganeti1008) that have the problem [14:17:31] RECOVERY - PyBal backends health check on lvs1003 is OK: PYBAL OK - All pools are healthy [14:17:31] RECOVERY - PyBal backends health check on lvs1010 is OK: PYBAL OK - All pools are healthy [14:17:47] and up to now (that's anecdotal) it's almost always the host that runs bohrium that exhibits the problem [14:17:50] RECOVERY - PyBal backends health check on lvs1006 is OK: PYBAL OK - All pools are healthy [14:17:59] oh I see [14:18:00] due to borhium being THE IO heavy VM [14:18:00] RECOVERY - PyBal IPVS diff check on lvs1006 is OK: OK: no difference between hosts in IPVS/PyBal [14:18:19] fair enough, so moving the vm to another host might just move the problem there instead [14:18:31] in fact the issues started on 08:03 today per the logs, which is just 3 minutes after the piwik archiver starts [14:19:00] RECOVERY - PyBal IPVS diff check on lvs1003 is OK: OK: no difference between hosts in IPVS/PyBal [14:19:12] it's very inconsistent to reproduce. I 've been running IO heavy stuff on various boxes for days and I do not yet have a clear reproduction case [14:19:41] RECOVERY - PyBal IPVS diff check on lvs1010 is OK: OK: no difference between hosts in IPVS/PyBal [14:21:43] elukey: we've rebooted ganeti1005; kafkamon1001, which runs there, now has 3 failed units after booting up (burrow-analytics.service, burrow-jumbo-eqiad.service, burrow-main-eqiad.service) [14:24:50] RECOVERY - Work requests waiting in Zuul Gearman server on contint1001 is OK: OK: Less than 30.00% above the threshold [90.0] https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [14:25:01] akosiaris: ok, nothing more to do now on the ganeti side I guess? [14:25:06] nope [14:25:16] unfortunately... I would love to pinpoint the problem [14:25:25] it's been messing with us > 3 months now [14:25:46] it's only present in that nodegroup btw (we got 4) [14:26:08] but anyway I digress, nothing more to do for now [14:28:42] alright! o/ [14:28:47] o/ [14:47:06] 10Operations, 10DBA, 10MediaWiki-General-or-Unknown, 10MW-1.31-release-notes (WMF-deploy-2018-02-20 (1.31.0-wmf.22)), and 2 others: Regularly purge expired temporary userrights from DB tables - https://phabricator.wikimedia.org/T176754#4020798 (10Reedy) I don't think there's even any point doing that. Very... [15:18:20] ema thanks! Just fixed them, there was an issue with stale pid files [15:18:49] will need to add process alerts for those [15:18:51] RECOVERY - Check systemd state on kafkamon1001 is OK: OK - running: The system is fully operational [15:21:55] akosiaris: this time for bohrium it might have been a special case since my team forced the re-process of the past week of data for all the smallish websites (the archiver by default does not retry on data older than 24h unless explicitly told so) [15:23:02] iirc the archiver was manually run after this, but it might have not done all the work in one go and started this morning [15:39:57] (03PS1) 10Elukey: burrow: ensure pid file is created under /var/run/burrow [puppet] - 10https://gerrit.wikimedia.org/r/416232 (https://phabricator.wikimedia.org/T180442) [15:42:12] (03PS2) 10Elukey: burrow: ensure pid file is created under /var/run/burrow [puppet] - 10https://gerrit.wikimedia.org/r/416232 (https://phabricator.wikimedia.org/T180442) [15:45:54] (03PS3) 10Elukey: burrow: ensure pid file is created under /var/run/burrow [puppet] - 10https://gerrit.wikimedia.org/r/416232 (https://phabricator.wikimedia.org/T180442) [15:49:34] (03CR) 10Elukey: [C: 032] "https://puppet-compiler.wmflabs.org/compiler02/10244/kafkamon2001.codfw.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/416232 (https://phabricator.wikimedia.org/T180442) (owner: 10Elukey) [15:49:40] (03PS4) 10Elukey: burrow: ensure pid file is created under /var/run/burrow [puppet] - 10https://gerrit.wikimedia.org/r/416232 (https://phabricator.wikimedia.org/T180442) [15:54:10] PROBLEM - Check systemd state on kafkamon1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [15:54:17] lovely [15:56:05] createPidFile(appContext.Config.General.LogDir + "/" + appContext.Config.General.PIDFile) [15:56:18] ah so apparently it was meant to go in var/log [15:56:22] mmmm [15:56:35] (03PS1) 10Elukey: Revert "burrow: ensure pid file is created under /var/run/burrow" [puppet] - 10https://gerrit.wikimedia.org/r/416233 [15:57:05] (03CR) 10Elukey: [C: 032] Revert "burrow: ensure pid file is created under /var/run/burrow" [puppet] - 10https://gerrit.wikimedia.org/r/416233 (owner: 10Elukey) [15:57:46] will add monitoring on Monday, it is clearly not the right day for me :D [15:59:10] RECOVERY - Check systemd state on kafkamon1001 is OK: OK - running: The system is fully operational [15:59:18] * elukey is looking forward to upgrade Burrow to 1.0 [16:26:25] (03CR) 10Reedy: [C: 031] "[DNM] can be removed now" [puppet] - 10https://gerrit.wikimedia.org/r/382631 (https://phabricator.wikimedia.org/T176754) (owner: 10EddieGP) [16:46:51] (03PS8) 10EddieGP: Add cron job for expired userrights maintenance script [puppet] - 10https://gerrit.wikimedia.org/r/382631 (https://phabricator.wikimedia.org/T176754) [16:49:02] (03CR) 10EddieGP: "Sorry, wanted to re-add as reviewer, not set as assignee. But yes, it seems like this'd be ready now." [puppet] - 10https://gerrit.wikimedia.org/r/382631 (https://phabricator.wikimedia.org/T176754) (owner: 10EddieGP) [17:20:36] (03PS1) 10Andrew Bogott: labweb wikitech: add a few more apache confs [puppet] - 10https://gerrit.wikimedia.org/r/416235 (https://phabricator.wikimedia.org/T168470) [17:21:42] (03CR) 10Andrew Bogott: [C: 032] labweb wikitech: add a few more apache confs [puppet] - 10https://gerrit.wikimedia.org/r/416235 (https://phabricator.wikimedia.org/T168470) (owner: 10Andrew Bogott) [17:29:40] 10Operations, 10ops-codfw: attach furud's new arrays (furud-array[3-7]) - https://phabricator.wikimedia.org/T185153#4020984 (10Papaul) Connection complete [18:33:51] 10Operations, 10Analytics-Kanban, 10Patch-For-Review, 10Performance-Team (Radar), and 2 others: Deprecation of mw.errors.* metrics - https://phabricator.wikimedia.org/T188749#4021055 (10elukey) Thanks @Krinkle! @fgiunchedi I think we are ready to go, what do you think? [19:42:10] (03PS6) 10Sau226: Disable main page deletion on enwiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/414509 (https://phabricator.wikimedia.org/T184959) [20:12:32] (03PS1) 10Gilles: Point private Thumbor Swift user to existing user for now [puppet] - 10https://gerrit.wikimedia.org/r/416240 (https://phabricator.wikimedia.org/T188834) [20:29:02] 10Operations, 10ops-codfw: attach furud's new arrays (furud-array[3-7]) - https://phabricator.wikimedia.org/T185153#4021169 (10Papaul) [20:29:17] 10Operations, 10ops-codfw: attach furud's new arrays (furud-array[3-7]) - https://phabricator.wikimedia.org/T185153#3907805 (10Papaul) a:05Papaul>03faidon [20:31:59] 10Operations, 10ops-codfw: rack/setup/install mw2259-mw2290 - https://phabricator.wikimedia.org/T188301#4021171 (10Papaul) [20:47:36] 10Operations, 10ops-codfw: rack/setup/install mw2259-mw2290 - https://phabricator.wikimedia.org/T188301#4021180 (10Papaul) [21:13:23] 10Operations, 10ops-codfw: rack/setup/install mw2259-mw2290 - https://phabricator.wikimedia.org/T188301#4021198 (10Papaul) [23:55:07] (03PS1) 10EddieGP: Article counts: Change 'comma' method to 'any' [mediawiki-config] - 10https://gerrit.wikimedia.org/r/416330 (https://phabricator.wikimedia.org/T188472) [23:57:07] (03CR) 10EddieGP: "To be deployed after 1.31wmf23 (326d655 specifically)." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/416330 (https://phabricator.wikimedia.org/T188472) (owner: 10EddieGP)