[00:07:25] <icinga-wm>	 PROBLEM - puppet last run on elastic1017 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 7 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/local/bin/apt2xml] https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[00:17:22] <wikibugs>	 (03PS1) 10Dzahn: mediawiki::php::restarts: try to avoid including LVS but still get pools [puppet] - 10https://gerrit.wikimedia.org/r/527285
[00:29:00] <wikibugs>	 (03CR) 10Dzahn: [C: 04-1] "WIP" [puppet] - 10https://gerrit.wikimedia.org/r/527285 (owner: 10Dzahn)
[00:34:09] <wikibugs>	 (03PS1) 10Dzahn: add scandium as an app test server to conftool data [puppet] - 10https://gerrit.wikimedia.org/r/527291 (https://phabricator.wikimedia.org/T228069)
[00:35:19] <icinga-wm>	 RECOVERY - puppet last run on elastic1017 is OK: OK: Puppet is currently enabled, last run 5 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[00:36:17] <wikibugs>	 (03CR) 10Dzahn: "Does it make sense to add it like a test server?" [puppet] - 10https://gerrit.wikimedia.org/r/527291 (https://phabricator.wikimedia.org/T228069) (owner: 10Dzahn)
[00:45:02] <wikibugs>	 (03CR) 10Ayounsi: "> Patch Set 2:" [puppet] - 10https://gerrit.wikimedia.org/r/526819 (https://phabricator.wikimedia.org/T226331) (owner: 10Ayounsi)
[01:18:15] <icinga-wm>	 PROBLEM - puppet last run on mx2001 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[01:22:35] <icinga-wm>	 PROBLEM - puppet last run on alsafi is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[01:40:35] <icinga-wm>	 RECOVERY - puppet last run on mx2001 is OK: OK: Puppet is currently enabled, last run 23 seconds ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[01:44:57] <icinga-wm>	 RECOVERY - puppet last run on alsafi is OK: OK: Puppet is currently enabled, last run 23 seconds ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[03:46:33] <icinga-wm>	 PROBLEM - High CPU load on API appserver on mw1225 is CRITICAL: CRITICAL - load average: 56.93, 29.25, 16.57 https://wikitech.wikimedia.org/wiki/Application_servers
[03:47:09] <icinga-wm>	 PROBLEM - High CPU load on API appserver on mw1222 is CRITICAL: CRITICAL - load average: 59.33, 35.98, 20.09 https://wikitech.wikimedia.org/wiki/Application_servers
[03:56:15] <icinga-wm>	 PROBLEM - High CPU load on API appserver on mw1225 is CRITICAL: CRITICAL - load average: 52.38, 37.43, 27.03 https://wikitech.wikimedia.org/wiki/Application_servers
[03:57:09] <icinga-wm>	 PROBLEM - High CPU load on API appserver on mw1226 is CRITICAL: CRITICAL - load average: 49.54, 34.70, 23.28 https://wikitech.wikimedia.org/wiki/Application_servers
[03:58:49] <icinga-wm>	 PROBLEM - High CPU load on API appserver on mw1233 is CRITICAL: CRITICAL - load average: 56.88, 36.93, 25.98 https://wikitech.wikimedia.org/wiki/Application_servers
[04:01:57] <icinga-wm>	 RECOVERY - High CPU load on API appserver on mw1226 is OK: OK - load average: 15.00, 23.64, 21.77 https://wikitech.wikimedia.org/wiki/Application_servers
[04:05:51] <icinga-wm>	 RECOVERY - High CPU load on API appserver on mw1225 is OK: OK - load average: 9.42, 18.47, 23.87 https://wikitech.wikimedia.org/wiki/Application_servers
[04:06:49] <icinga-wm>	 RECOVERY - High CPU load on API appserver on mw1233 is OK: OK - load average: 9.01, 19.98, 23.39 https://wikitech.wikimedia.org/wiki/Application_servers
[04:11:37] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+2] acme_chief: Save the certificate using the 3 save modes on issuance time [software/acme-chief] - 10https://gerrit.wikimedia.org/r/526840 (https://phabricator.wikimedia.org/T229096) (owner: 10Vgutierrez)
[04:12:53] <icinga-wm>	 RECOVERY - High CPU load on API appserver on mw1222 is OK: OK - load average: 10.21, 12.15, 22.78 https://wikitech.wikimedia.org/wiki/Application_servers
[04:14:34] <wikibugs>	 (03CR) 10jenkins-bot: acme_chief: Save the certificate using the 3 save modes on issuance time [software/acme-chief] - 10https://gerrit.wikimedia.org/r/526840 (https://phabricator.wikimedia.org/T229096) (owner: 10Vgutierrez)
[04:15:47] <icinga-wm>	 PROBLEM - Disk space on elastic1017 is CRITICAL: DISK CRITICAL - free space: /srv 28693 MB (5% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=elastic1017&var-datasource=eqiad+prometheus/ops
[04:19:09] <icinga-wm>	 PROBLEM - PHP opcache health on mwdebug1001 is CRITICAL: CRITICAL: opcache free space is below 50 MB https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[04:19:36] <wikibugs>	 (03PS1) 10Vgutierrez: Release 0.20 [software/acme-chief] - 10https://gerrit.wikimedia.org/r/527364 (https://phabricator.wikimedia.org/T229096)
[04:22:27] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute on graphite1004 is CRITICAL: CRITICAL: 70.00% of data above the critical threshold [50.0] https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=2&fullscreen
[04:24:03] <icinga-wm>	 RECOVERY - MediaWiki exceptions and fatals per minute on graphite1004 is OK: OK: Less than 70.00% above the threshold [25.0] https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=2&fullscreen
[04:25:27] <icinga-wm>	 RECOVERY - Disk space on elastic1017 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=elastic1017&var-datasource=eqiad+prometheus/ops
[04:29:57] <wikibugs>	 (03PS4) 10Vgutierrez: fifo-log-demux: Keep attempting to read the FIFO after EOF [software/fifo-log-demux] - 10https://gerrit.wikimedia.org/r/527013
[04:30:12] <wikibugs>	 (03CR) 10Vgutierrez: fifo-log-demux: Keep attempting to read the FIFO after EOF (031 comment) [software/fifo-log-demux] - 10https://gerrit.wikimedia.org/r/527013 (owner: 10Vgutierrez)
[04:30:49] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+2] Release 0.20 [software/acme-chief] - 10https://gerrit.wikimedia.org/r/527364 (https://phabricator.wikimedia.org/T229096) (owner: 10Vgutierrez)
[04:33:45] <wikibugs>	 (03CR) 10jenkins-bot: Release 0.20 [software/acme-chief] - 10https://gerrit.wikimedia.org/r/527364 (https://phabricator.wikimedia.org/T229096) (owner: 10Vgutierrez)
[04:35:17] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute on graphite1004 is CRITICAL: CRITICAL: 70.00% of data above the critical threshold [50.0] https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=2&fullscreen
[04:38:03] <wikibugs>	 (03PS1) 10Vgutierrez: acme_chief: Save the certificate using the 3 save modes on issuance time [software/acme-chief] (debian) - 10https://gerrit.wikimedia.org/r/527373 (https://phabricator.wikimedia.org/T229096)
[04:38:05] <wikibugs>	 (03PS1) 10Vgutierrez: Release 0.20 [software/acme-chief] (debian) - 10https://gerrit.wikimedia.org/r/527374 (https://phabricator.wikimedia.org/T229096)
[04:38:07] <wikibugs>	 (03PS1) 10Vgutierrez: debian: Add release 0.20 to changelog [software/acme-chief] (debian) - 10https://gerrit.wikimedia.org/r/527375 (https://phabricator.wikimedia.org/T229096)
[04:38:31] <icinga-wm>	 RECOVERY - MediaWiki exceptions and fatals per minute on graphite1004 is OK: OK: Less than 70.00% above the threshold [25.0] https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=2&fullscreen
[04:44:33] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+2] acme_chief: Save the certificate using the 3 save modes on issuance time [software/acme-chief] (debian) - 10https://gerrit.wikimedia.org/r/527373 (https://phabricator.wikimedia.org/T229096) (owner: 10Vgutierrez)
[04:44:37] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+2] Release 0.20 [software/acme-chief] (debian) - 10https://gerrit.wikimedia.org/r/527374 (https://phabricator.wikimedia.org/T229096) (owner: 10Vgutierrez)
[04:47:16] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+2] debian: Add release 0.20 to changelog [software/acme-chief] (debian) - 10https://gerrit.wikimedia.org/r/527375 (https://phabricator.wikimedia.org/T229096) (owner: 10Vgutierrez)
[04:47:19] <wikibugs>	 (03Merged) 10jenkins-bot: acme_chief: Save the certificate using the 3 save modes on issuance time [software/acme-chief] (debian) - 10https://gerrit.wikimedia.org/r/527373 (https://phabricator.wikimedia.org/T229096) (owner: 10Vgutierrez)
[04:47:22] <wikibugs>	 (03Merged) 10jenkins-bot: Release 0.20 [software/acme-chief] (debian) - 10https://gerrit.wikimedia.org/r/527374 (https://phabricator.wikimedia.org/T229096) (owner: 10Vgutierrez)
[04:51:05] <wikibugs>	 (03CR) 10jenkins-bot: acme_chief: Save the certificate using the 3 save modes on issuance time [software/acme-chief] (debian) - 10https://gerrit.wikimedia.org/r/527373 (https://phabricator.wikimedia.org/T229096) (owner: 10Vgutierrez)
[04:51:09] <wikibugs>	 (03Merged) 10jenkins-bot: debian: Add release 0.20 to changelog [software/acme-chief] (debian) - 10https://gerrit.wikimedia.org/r/527375 (https://phabricator.wikimedia.org/T229096) (owner: 10Vgutierrez)
[04:51:30] <wikibugs>	 (03CR) 10jenkins-bot: Release 0.20 [software/acme-chief] (debian) - 10https://gerrit.wikimedia.org/r/527374 (https://phabricator.wikimedia.org/T229096) (owner: 10Vgutierrez)
[04:54:54] <wikibugs>	 (03CR) 10jenkins-bot: debian: Add release 0.20 to changelog [software/acme-chief] (debian) - 10https://gerrit.wikimedia.org/r/527375 (https://phabricator.wikimedia.org/T229096) (owner: 10Vgutierrez)
[05:01:24] <wikibugs>	 (03PS3) 10Marostegui: wmnet: Point m2-master.codfw to dbproxy2002 [dns] - 10https://gerrit.wikimedia.org/r/527114 (https://phabricator.wikimedia.org/T176532)
[05:02:28] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] wmnet: Point m2-master.codfw to dbproxy2002 [dns] - 10https://gerrit.wikimedia.org/r/527114 (https://phabricator.wikimedia.org/T176532) (owner: 10Marostegui)
[05:06:55] <marostegui>	 !log Remove db2058 from tendril and zarcillo T229543
[05:07:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:07:05] <stashbot>	 T229543: Decommission db2058.codfw.wmnet - https://phabricator.wikimedia.org/T229543
[05:07:31] <wikibugs>	 (03PS1) 10Marostegui: mariadb: Decommission db2058 [puppet] - 10https://gerrit.wikimedia.org/r/527382 (https://phabricator.wikimedia.org/T229543)
[05:10:07] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] mariadb: Decommission db2058 [puppet] - 10https://gerrit.wikimedia.org/r/527382 (https://phabricator.wikimedia.org/T229543) (owner: 10Marostegui)
[05:10:42] <marostegui>	 !log Stop MySQL on db2058 for decommissioning T229543
[05:10:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:21:06] <vgutierrez>	 !log uploaded acme-chief 0.20 to apt.wikimedia.org (buster) - T229096
[05:21:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:21:16] <stashbot>	 T229096: Provide the three cert types (chain-only, cert only and chained) as soon as we get the certificate issued - https://phabricator.wikimedia.org/T229096
[05:23:16] <wikibugs>	 (03PS1) 10Marostegui: mariadb: Specify candidate masters [puppet] - 10https://gerrit.wikimedia.org/r/527383
[05:24:32] <wikibugs>	 (03PS2) 10Marostegui: mariadb: Specify candidate masters (codfw) [puppet] - 10https://gerrit.wikimedia.org/r/527383
[05:25:42] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] mariadb: Specify candidate masters (codfw) [puppet] - 10https://gerrit.wikimedia.org/r/527383 (owner: 10Marostegui)
[05:26:03] <icinga-wm>	 PROBLEM - Nginx local proxy to apache on mw1285 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1308 bytes in 1.914 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[05:27:15] <icinga-wm>	 RECOVERY - Nginx local proxy to apache on mw1285 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 618 bytes in 0.117 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[05:37:16] <wikibugs>	 (03PS1) 10Marostegui: mariadb: Provision db2124 into s6 [puppet] - 10https://gerrit.wikimedia.org/r/527386 (https://phabricator.wikimedia.org/T228969)
[05:39:10] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] mariadb: Provision db2124 into s6 [puppet] - 10https://gerrit.wikimedia.org/r/527386 (https://phabricator.wikimedia.org/T228969) (owner: 10Marostegui)
[06:11:12] <wikibugs>	 (03PS3) 10Ema: misc-common: piwik cookies should not block caching either [puppet] - 10https://gerrit.wikimedia.org/r/473299 (owner: 10BBlack)
[06:21:42] <wikibugs>	 (03CR) 10Ema: [C: 03+1] "LGTM but elukey should confirm!" [puppet] - 10https://gerrit.wikimedia.org/r/473299 (owner: 10BBlack)
[06:23:10] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 04-2] "scandium has no lvs, and thus should not be in conftool-data as it won't serve live traffic, we need to add it to the dsh list manually in" [puppet] - 10https://gerrit.wikimedia.org/r/527291 (https://phabricator.wikimedia.org/T228069) (owner: 10Dzahn)
[06:28:57] <icinga-wm>	 PROBLEM - puppet last run on acmechief1001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/local/bin/prometheus-puppet-agent-stats] https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[06:30:24] <vgutierrez>	 uh?
[06:32:39] <icinga-wm>	 PROBLEM - puppet last run on elastic2054 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[06:38:41] <wikibugs>	 (03CR) 10Ema: [C: 03+2] misc-common: piwik cookies should not block caching either [puppet] - 10https://gerrit.wikimedia.org/r/473299 (owner: 10BBlack)
[06:40:03] <icinga-wm>	 RECOVERY - puppet last run on acmechief1001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[06:46:52] <vgutierrez>	 !log upgrading acme-chief to version 0.20 in acme-chief test instances - T229096
[06:47:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:47:02] <stashbot>	 T229096: Provide the three cert types (chain-only, cert only and chained) as soon as we get the certificate issued - https://phabricator.wikimedia.org/T229096
[06:48:43] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: role::parsoid::testing: remove unnecessary php additions [puppet] - 10https://gerrit.wikimedia.org/r/527414
[06:50:23] <wikibugs>	 (03PS2) 10Giuseppe Lavagetto: role::parsoid::testing: remove unnecessary php additions [puppet] - 10https://gerrit.wikimedia.org/r/527414 (https://phabricator.wikimedia.org/T228069)
[06:51:19] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+2] role::parsoid::testing: remove unnecessary php additions [puppet] - 10https://gerrit.wikimedia.org/r/527414 (https://phabricator.wikimedia.org/T228069) (owner: 10Giuseppe Lavagetto)
[06:51:28] <_joe_>	 is it me or ci is horribly slow?
[06:54:41] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 04-2] "see  I61cf3f for the correct way to include scandium in dsh." [puppet] - 10https://gerrit.wikimedia.org/r/527291 (https://phabricator.wikimedia.org/T228069) (owner: 10Dzahn)
[07:00:09] <wikibugs>	 (03PS1) 10Marostegui: db-eqiad,db-codfw.php: Pool db2124 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/527422 (https://phabricator.wikimedia.org/T228969)
[07:00:13] <_joe_>	 !log running systemd-tmpfiles --create nutcracker.conf on scandium
[07:00:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:00:35] <icinga-wm>	 RECOVERY - puppet last run on elastic2054 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[07:03:23] <wikibugs>	 (03CR) 10Ema: [C: 03+1] "Looks good!" [software/fifo-log-demux] - 10https://gerrit.wikimedia.org/r/527013 (owner: 10Vgutierrez)
[07:05:57] <icinga-wm>	 PROBLEM - High CPU load on API appserver on mw1222 is CRITICAL: CRITICAL - load average: 49.31, 23.83, 14.54 https://wikitech.wikimedia.org/wiki/Application_servers
[07:10:21] <icinga-wm>	 PROBLEM - High CPU load on API appserver on mw1225 is CRITICAL: CRITICAL - load average: 51.13, 31.67, 19.18 https://wikitech.wikimedia.org/wiki/Application_servers
[07:11:09] <icinga-wm>	 PROBLEM - High CPU load on API appserver on mw1231 is CRITICAL: CRITICAL - load average: 53.36, 36.73, 22.70 https://wikitech.wikimedia.org/wiki/Application_servers
[07:11:20] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: systemd::tmpfile: apply changes when we change the files. [puppet] - 10https://gerrit.wikimedia.org/r/527430 (https://phabricator.wikimedia.org/T204450)
[07:11:21] <_joe_>	 sigh
[07:12:53] <wikibugs>	 (03PS1) 10Elukey: cdh::hive: add hive.server2.logging.operation.enabled [puppet] - 10https://gerrit.wikimedia.org/r/527433 (https://phabricator.wikimedia.org/T227257)
[07:13:18] <wikibugs>	 (03PS2) 10Elukey: cdh::hive: add hive.server2.logging.operation.enabled [puppet] - 10https://gerrit.wikimedia.org/r/527433 (https://phabricator.wikimedia.org/T227257)
[07:14:27] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] cdh::hive: add hive.server2.logging.operation.enabled [puppet] - 10https://gerrit.wikimedia.org/r/527433 (https://phabricator.wikimedia.org/T227257) (owner: 10Elukey)
[07:17:53] <icinga-wm>	 PROBLEM - High CPU load on API appserver on mw1234 is CRITICAL: CRITICAL - load average: 54.86, 37.48, 26.53 https://wikitech.wikimedia.org/wiki/Application_servers
[07:17:55] <icinga-wm>	 PROBLEM - High CPU load on API appserver on mw1226 is CRITICAL: CRITICAL - load average: 64.57, 38.85, 24.49 https://wikitech.wikimedia.org/wiki/Application_servers
[07:19:41] <wikibugs>	 (03PS1) 10Vgutierrez: fifo-log-demux: Fix EPIPE check [software/fifo-log-demux] - 10https://gerrit.wikimedia.org/r/527435
[07:21:27] <marostegui>	 !log Add db2124 to tendril and zarcillo T228969
[07:21:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:21:36] <stashbot>	 T228969: Productionize db21[21-30} - https://phabricator.wikimedia.org/T228969
[07:21:48] <_joe_>	 ok, those appservers
[07:22:02] <wikibugs>	 (03PS2) 10Marostegui: db-eqiad,db-codfw.php: Pool db2129 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/527422 (https://phabricator.wikimedia.org/T228969)
[07:22:07] <_joe_>	 can someone look? I'm looking at another problem rn
[07:22:20] <marostegui>	 _joe_: I can take a look
[07:24:50] <elukey>	 marostegui: very interesting https://grafana.wikimedia.org/d/000000002/api-backend-summary?refresh=5m&orgId=1
[07:25:14] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+1] db-eqiad,db-codfw.php: Pool db2129 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/527422 (https://phabricator.wikimedia.org/T228969) (owner: 10Marostegui)
[07:25:41] <_joe_>	 it's parsoid-batch again
[07:25:59] <elukey>	 now I am wondering - is it something that hits hard hhvm for some reason, but not php-fpm?
[07:26:01] <marostegui>	 elukey: yeah I was checking that https://grafana.wikimedia.org/d/000000377/host-overview?refresh=5m&orgId=1&var-server=mw1226&var-datasource=eqiad%20prometheus%2Fops&var-cluster=api_appserver kinda matches the SAL entry from _joe_ but it is probably a coincidence
[07:26:02] <_joe_>	 we need to open a task and involve core platform in the investigation
[07:26:09] <_joe_>	 elukey: based on what?
[07:26:10] <marostegui>	 vgutierrez: thanks
[07:26:22] <vgutierrez>	 np <3
[07:26:41] <_joe_>	 yeah marostegui my log entry is for something not in production
[07:26:51] <marostegui>	 yep
[07:26:54] <elukey>	 _joe_ I am wondering out loud, not based on anything. It could be good to check. If so, we are migrating slowly to php7 only..
[07:27:05] <_joe_>	 elukey: mw1347 is php only
[07:27:09] <_joe_>	 so is mw1348
[07:27:15] <_joe_>	 you can check if they're affected
[07:27:50] <elukey>	 not this time afaics from icinga, and not the last one too
[07:28:03] <marostegui>	 mw1226 has hhvm
[07:28:19] <icinga-wm>	 PROBLEM - High CPU load on API appserver on mw1232 is CRITICAL: CRITICAL - load average: 50.90, 33.95, 25.52 https://wikitech.wikimedia.org/wiki/Application_servers
[07:32:13] <icinga-wm>	 PROBLEM - High CPU load on API appserver on mw1234 is CRITICAL: CRITICAL - load average: 43.70, 36.93, 30.41 https://wikitech.wikimedia.org/wiki/Application_servers
[07:36:22] <_joe_>	 https://grafana.wikimedia.org/d/RIA1lzDZk/xxx-joe-appserver?orgId=1&from=1564728500616&to=1564731349240&var-datasource=eqiad%20prometheus%2Fops&var-cluster=api_appserver&var-instance=mw1347:3903&var-method=GET&var-code=200 response times are degrading
[07:36:29] <_joe_>	 but nothing too horrible
[07:38:39] <_joe_>	 !log disabling puppet on mw1270 for testing of different php settings
[07:38:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:40:12] <_joe_>	 !log restarting php-fpm on mw1270, with 80 pms - static, apc 6 GB no ttl
[07:40:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:41:05] <icinga-wm>	 PROBLEM - HHVM rendering on mw1227 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[07:41:20] <wikibugs>	 (03CR) 10ArielGlenn: [C: 03+1] db-eqiad,db-codfw.php: Pool db2129 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/527422 (https://phabricator.wikimedia.org/T228969) (owner: 10Marostegui)
[07:42:07] <marostegui>	 _joe_: what can we do to help?
[07:42:16] <marostegui>	 thanks apergos 
[07:42:25] <_joe_>	 marostegui: help with what?
[07:42:32] <marostegui>	 _joe_: with the app servers
[07:42:33] <icinga-wm>	 RECOVERY - HHVM rendering on mw1227 is OK: HTTP OK: HTTP/1.1 200 OK - 81035 bytes in 0.134 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[07:42:37] <_joe_>	 the api issue? restarting hhvm 
[07:42:41] <_joe_>	 on those servers
[07:42:45] <marostegui>	 ok!
[07:43:07] <_joe_>	 sorry I'm trying to understand what went wrong with mw1270 last night
[07:43:12] <marostegui>	 !log Restart hhvm on mw1226
[07:43:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:43:19] <marostegui>	 _joe_: No worries, I just wanted to help :)
[07:43:29] <_joe_>	 being the only appserver (non-api) fully on php7, it's worrisome
[07:43:42] <_joe_>	 marostegui: you have too much free time now!
[07:43:48] <apergos>	 hahahahaha
[07:43:55] <marostegui>	 I actually have to push a change to mwconfig! :)
[07:45:44] <marostegui>	 mw1226 looking good now
[07:46:41] <wikibugs>	 (03PS1) 10Marostegui: db2129: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/527441 (https://phabricator.wikimedia.org/T228969)
[07:48:57] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] db2129: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/527441 (https://phabricator.wikimedia.org/T228969) (owner: 10Marostegui)
[07:48:59] <wikibugs>	 (03CR) 10Ema: [C: 04-1] "Tested on traffic-upload-stretch.traffic.eqiad.wmflabs, error is still there." [software/fifo-log-demux] - 10https://gerrit.wikimedia.org/r/527435 (owner: 10Vgutierrez)
[07:49:32] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] db-eqiad,db-codfw.php: Pool db2129 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/527422 (https://phabricator.wikimedia.org/T228969) (owner: 10Marostegui)
[07:51:00] <wikibugs>	 (03Merged) 10jenkins-bot: db-eqiad,db-codfw.php: Pool db2129 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/527422 (https://phabricator.wikimedia.org/T228969) (owner: 10Marostegui)
[07:51:17] <wikibugs>	 (03CR) 10jenkins-bot: db-eqiad,db-codfw.php: Pool db2129 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/527422 (https://phabricator.wikimedia.org/T228969) (owner: 10Marostegui)
[07:52:09] <icinga-wm>	 RECOVERY - High CPU load on API appserver on mw1232 is OK: OK - load average: 8.14, 14.67, 23.11 https://wikitech.wikimedia.org/wiki/Application_servers
[07:52:10] <logmsgbot>	 !log marostegui@deploy1001 Synchronized wmf-config/db-codfw.php: Add db2129 to the config T228969 (duration: 00m 47s)
[07:52:15] <icinga-wm>	 PROBLEM - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is CRITICAL: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most read articles for January 1, 2016) is CRITICAL: Test retrieve the most read articles for January 1, 2016 returned the unexpected status 504 (expecting: 200) https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29
[07:52:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:52:26] <stashbot>	 T228969: Productionize db21[21-30} - https://phabricator.wikimedia.org/T228969
[07:53:02] <logmsgbot>	 !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Add db2129 to the config T228969 (duration: 00m 47s)
[07:53:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:53:49] <icinga-wm>	 PROBLEM - mobileapps endpoints health on scb2003 is CRITICAL: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most read articles for January 1, 2016) is CRITICAL: Test retrieve the most read articles for January 1, 2016 returned the unexpected status 504 (expecting: 200): /{domain}/v1/page/summary/{title} (Get summary for test page) is CRITICAL: Test Get summary for test page returned the unexpected status 504 (exp
[07:53:49] <icinga-wm>	 s://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps
[07:53:51] <icinga-wm>	 RECOVERY - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29
[07:55:25] <icinga-wm>	 RECOVERY - mobileapps endpoints health on scb2003 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps
[07:55:49] <logmsgbot>	 !log marostegui@cumin2001 dbctl commit of MediaWiki config (dc=all), diff saved to 'https://phabricator.wikimedia.org/P8852', previous config saved to /var/cache/conftool/dbconfig/20190802-075548-marostegui.json
[07:56:03] <icinga-wm>	 RECOVERY - High CPU load on API appserver on mw1234 is OK: OK - load average: 8.22, 12.18, 22.84 https://wikitech.wikimedia.org/wiki/Application_servers
[07:56:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:04:02] <wikibugs>	 (03PS2) 10Vgutierrez: fifo-log-demux: Fix EPIPE check [software/fifo-log-demux] - 10https://gerrit.wikimedia.org/r/527435
[08:04:07] <icinga-wm>	 RECOVERY - High CPU load on API appserver on mw1226 is OK: OK - load average: 15.48, 15.83, 23.71 https://wikitech.wikimedia.org/wiki/Application_servers
[08:07:07] <icinga-wm>	 PROBLEM - Disk space on elastic1017 is CRITICAL: DISK CRITICAL - free space: /srv 27833 MB (5% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=elastic1017&var-datasource=eqiad+prometheus/ops
[08:08:00] <wikibugs>	 (03PS1) 10Marostegui: dbctl_client.pp: Remove dbctl diff alerts [puppet] - 10https://gerrit.wikimedia.org/r/527451 (https://phabricator.wikimedia.org/T197126)
[08:08:31] <icinga-wm>	 RECOVERY - High CPU load on API appserver on mw1231 is OK: OK - load average: 9.26, 13.00, 22.72 https://wikitech.wikimedia.org/wiki/Application_servers
[08:09:17] <icinga-wm>	 RECOVERY - High CPU load on API appserver on mw1225 is OK: OK - load average: 8.81, 11.28, 23.34 https://wikitech.wikimedia.org/wiki/Application_servers
[08:09:43] <icinga-wm>	 PROBLEM - dbctl differs from mediawiki-config in codfw- did you forget to update both- on cumin2001 is CRITICAL: Mismatched loads for section s6: diff {(db2129, 400)} -- PHP {db2053: 50, db2060: 100, db2046: 0, db2076: 400, db2089:3316: 100, db2087:3316: 100, db2067: 100, db2114: 400, db2117: 400} vs dbctl {db2129: 400, db2053: 50, db2060: 100, db2046: 0, db2076: 400, db2089:3316: 100, db2087:3316: 100, db2067: 100, db2114: 400, 
[08:09:43] <icinga-wm>	 s://wikitech.wikimedia.org/wiki/Dbctl%23Configuration_deltas_vs_PHP
[08:09:52] <marostegui>	 I will ack that alert for now
[08:10:07] <marostegui>	 Or actually I can just commit the change to clear it, it is just one line
[08:11:17] <icinga-wm>	 RECOVERY - High CPU load on API appserver on mw1222 is OK: OK - load average: 7.63, 10.13, 23.07 https://wikitech.wikimedia.org/wiki/Application_servers
[08:12:51] <wikibugs>	 (03PS1) 10Marostegui: db-codfw.php: Add db2129 to s6 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/527453
[08:14:34] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] db-codfw.php: Add db2129 to s6 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/527453 (owner: 10Marostegui)
[08:15:29] <wikibugs>	 (03Merged) 10jenkins-bot: db-codfw.php: Add db2129 to s6 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/527453 (owner: 10Marostegui)
[08:15:35] <wikibugs>	 (03PS1) 10Marostegui: db2129: Clarify it will be the candidate master for s6 [puppet] - 10https://gerrit.wikimedia.org/r/527454
[08:16:28] <wikibugs>	 (03CR) 10jenkins-bot: db-codfw.php: Add db2129 to s6 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/527453 (owner: 10Marostegui)
[08:16:33] <logmsgbot>	 !log marostegui@deploy1001 Synchronized wmf-config/db-codfw.php: Add db2129 to s6 (duration: 00m 46s)
[08:16:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:19:51] <icinga-wm>	 RECOVERY - Disk space on elastic1017 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=elastic1017&var-datasource=eqiad+prometheus/ops
[08:22:19] <wikibugs>	 (03CR) 10Volans: [C: 04-1] "This is already taken care in Ie407f61f6cb09deb9311d0d5cb4b18e0aca5eacf ;)" [puppet] - 10https://gerrit.wikimedia.org/r/527451 (https://phabricator.wikimedia.org/T197126) (owner: 10Marostegui)
[08:22:53] <wikibugs>	 (03CR) 10Marostegui: "<3" [puppet] - 10https://gerrit.wikimedia.org/r/527451 (https://phabricator.wikimedia.org/T197126) (owner: 10Marostegui)
[08:22:59] <icinga-wm>	 RECOVERY - dbctl differs from mediawiki-config in codfw- did you forget to update both- on cumin2001 is OK: OK: configurations match https://wikitech.wikimedia.org/wiki/Dbctl%23Configuration_deltas_vs_PHP
[08:23:02] <wikibugs>	 (03Abandoned) 10Marostegui: dbctl_client.pp: Remove dbctl diff alerts [puppet] - 10https://gerrit.wikimedia.org/r/527451 (https://phabricator.wikimedia.org/T197126) (owner: 10Marostegui)
[08:26:40] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] db2129: Clarify it will be the candidate master for s6 [puppet] - 10https://gerrit.wikimedia.org/r/527454 (owner: 10Marostegui)
[08:31:23] <wikibugs>	 (03PS1) 10Marostegui: wmnet: Add m1-master for codfw [dns] - 10https://gerrit.wikimedia.org/r/527462 (https://phabricator.wikimedia.org/T202367)
[08:36:19] <wikibugs>	 (03PS1) 10Filippo Giunchedi: monitoring: fix HTTP availability dashboard links [puppet] - 10https://gerrit.wikimedia.org/r/527465 (https://phabricator.wikimedia.org/T228878)
[08:40:23] <icinga-wm>	 PROBLEM - Disk space on analytics1043 is CRITICAL: DISK CRITICAL - /sys/kernel/debug/tracing is not accessible: Permission denied https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=analytics1043&var-datasource=eqiad+prometheus/ops
[08:41:03] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute on graphite1004 is CRITICAL: CRITICAL: 80.00% of data above the critical threshold [50.0] https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=2&fullscreen
[08:41:42] <wikibugs>	 (03PS3) 10Vgutierrez: fifo-log-demux: Fix EPIPE check [software/fifo-log-demux] - 10https://gerrit.wikimedia.org/r/527435
[08:42:37] <icinga-wm>	 RECOVERY - MediaWiki exceptions and fatals per minute on graphite1004 is OK: OK: Less than 70.00% above the threshold [25.0] https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=2&fullscreen
[08:45:39] <elukey>	 elukey@analytics1043:~$ sudo ls -ld /sys/kernel/debug/tracing
[08:45:40] <elukey>	 drwx------ 6 root root 0 Jun  7 09:13 /sys/kernel/debug/tracing
[08:45:41] <elukey>	 mmmmm
[08:46:40] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [V: 03+2 C: 03+2] restrouter: Add helmfile stanzas [deployment-charts] - 10https://gerrit.wikimedia.org/r/526719 (https://phabricator.wikimedia.org/T223953) (owner: 10Alexandros Kosiaris)
[08:47:02] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] "PCC https://puppet-compiler.wmflabs.org/compiler1001/17718/" [puppet] - 10https://gerrit.wikimedia.org/r/527465 (https://phabricator.wikimedia.org/T228878) (owner: 10Filippo Giunchedi)
[08:47:12] <wikibugs>	 (03PS5) 10Alexandros Kosiaris: restrouter: Add kubernetes stanzas [puppet] - 10https://gerrit.wikimedia.org/r/526632 (https://phabricator.wikimedia.org/T223953)
[08:56:58] <logmsgbot>	 !log @ helmfile [STAGING] Ran 'apply' command on namespace 'kube-system' for release 'calico-policy-controller' .
[08:57:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:57:11] <logmsgbot>	 !log akosiaris@ helmfile [STAGING] Ran 'apply' command on namespace 'kube-system' for release 'calico-policy-controller' .
[08:57:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:05:32] <wikibugs>	 (03PS1) 10Ladsgroup: Revert "Revert "Revert "Switch property terms migration to WRITE_NEW on production wikidata""" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/527477
[09:05:37] <wikibugs>	 (03CR) 10Ladsgroup: [C: 03+2] Revert "Revert "Revert "Switch property terms migration to WRITE_NEW on production wikidata""" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/527477 (owner: 10Ladsgroup)
[09:12:44] <icinga-wm>	 RECOVERY - Disk space on analytics1043 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=analytics1043&var-datasource=eqiad+prometheus/ops
[09:12:48] <elukey>	 !log umount /sys/kernel/debug/tracing on analytics1043
[09:12:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:13:07] <wikibugs>	 (03PS1) 10Alexandros Kosiaris: restrouter: Fix typo in suffixes in admin [deployment-charts] - 10https://gerrit.wikimedia.org/r/527480
[09:14:40] <wikibugs>	 (03PS2) 10Ladsgroup: Revert "Revert "Revert "Switch property terms migration to WRITE_NEW on production wikidata""" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/527477
[09:14:42] <wikibugs>	 (03CR) 10Ladsgroup: [C: 03+2] Revert "Revert "Revert "Switch property terms migration to WRITE_NEW on production wikidata""" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/527477 (owner: 10Ladsgroup)
[09:15:02] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "Revert "Revert "Switch property terms migration to WRITE_NEW on production wikidata""" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/527477 (owner: 10Ladsgroup)
[09:16:40] <wikibugs>	 (03CR) 10jenkins-bot: Revert "Revert "Revert "Switch property terms migration to WRITE_NEW on production wikidata""" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/527477 (owner: 10Ladsgroup)
[09:17:05] <logmsgbot>	 !log ladsgroup@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:526657|Revert: Switch property terms migration to WRITE_NEW on production wikidata (T225053)]] (duration: 00m 48s)
[09:17:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:17:13] <stashbot>	 T225053: Switch `tmpPropertyTermsMigrationStage` to MIGRATION_WRITE_NEW - https://phabricator.wikimedia.org/T225053
[09:22:05] <marostegui>	 !log Compress s7 on labsdb1010 - T222978
[09:22:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:22:14] <stashbot>	 T222978: Compress and defragment tables on labsdb hosts - https://phabricator.wikimedia.org/T222978
[09:23:21] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+2] restrouter: Add kubernetes stanzas [puppet] - 10https://gerrit.wikimedia.org/r/526632 (https://phabricator.wikimedia.org/T223953) (owner: 10Alexandros Kosiaris)
[09:25:49] <wikibugs>	 (03CR) 10Filippo Giunchedi: "> Patch Set 2:" [puppet] - 10https://gerrit.wikimedia.org/r/526819 (https://phabricator.wikimedia.org/T226331) (owner: 10Ayounsi)
[09:34:16] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: Add the mediawiki.restart_appservers cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/527487
[09:35:55] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Add the mediawiki.restart_appservers cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/527487 (owner: 10Giuseppe Lavagetto)
[09:37:17] <wikibugs>	 (03PS4) 10Filippo Giunchedi: prometheus: query pdu resources based on model [puppet] - 10https://gerrit.wikimedia.org/r/526634 (https://phabricator.wikimedia.org/T148541)
[09:38:11] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] prometheus: query pdu resources based on model [puppet] - 10https://gerrit.wikimedia.org/r/526634 (https://phabricator.wikimedia.org/T148541) (owner: 10Filippo Giunchedi)
[09:41:34] <wikibugs>	 (03PS3) 10Filippo Giunchedi: prometheus: generate targets for sentry4 PDUs too [puppet] - 10https://gerrit.wikimedia.org/r/526640 (https://phabricator.wikimedia.org/T148541)
[09:41:45] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] prometheus: generate targets for sentry4 PDUs too [puppet] - 10https://gerrit.wikimedia.org/r/526640 (https://phabricator.wikimedia.org/T148541) (owner: 10Filippo Giunchedi)
[09:56:01] <icinga-wm>	 PROBLEM - High CPU load on API appserver on mw1225 is CRITICAL: CRITICAL - load average: 53.14, 25.33, 17.41 https://wikitech.wikimedia.org/wiki/Application_servers
[09:56:23] <wikibugs>	 (03PS1) 10Filippo Giunchedi: prometheus: skip duplicates when generating pdu configuration [puppet] - 10https://gerrit.wikimedia.org/r/527498 (https://phabricator.wikimedia.org/T148541)
[09:56:53] <icinga-wm>	 PROBLEM - High CPU load on API appserver on mw1233 is CRITICAL: CRITICAL - load average: 67.69, 39.26, 24.95 https://wikitech.wikimedia.org/wiki/Application_servers
[09:57:19] <icinga-wm>	 PROBLEM - High CPU load on API appserver on mw1222 is CRITICAL: CRITICAL - load average: 74.15, 38.58, 23.35 https://wikitech.wikimedia.org/wiki/Application_servers
[10:01:37] <icinga-wm>	 RECOVERY - High CPU load on API appserver on mw1233 is OK: OK - load average: 11.81, 23.16, 21.92 https://wikitech.wikimedia.org/wiki/Application_servers
[10:01:51] <logmsgbot>	 !log @ helmfile [STAGING] Ran 'apply' command on namespace 'restrouter' for release 'staging' .
[10:01:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:02:03] <icinga-wm>	 RECOVERY - High CPU load on API appserver on mw1222 is OK: OK - load average: 13.19, 23.84, 21.24 https://wikitech.wikimedia.org/wiki/Application_servers
[10:02:21] <icinga-wm>	 RECOVERY - High CPU load on API appserver on mw1225 is OK: OK - load average: 13.78, 22.59, 20.14 https://wikitech.wikimedia.org/wiki/Application_servers
[10:02:38] <wikibugs>	 (03PS2) 10Giuseppe Lavagetto: Add the mediawiki.restart_appservers cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/527487
[10:03:17] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] prometheus: skip duplicates when generating pdu configuration [puppet] - 10https://gerrit.wikimedia.org/r/527498 (https://phabricator.wikimedia.org/T148541) (owner: 10Filippo Giunchedi)
[10:03:25] <wikibugs>	 (03PS2) 10Filippo Giunchedi: prometheus: skip duplicates when generating pdu configuration [puppet] - 10https://gerrit.wikimedia.org/r/527498 (https://phabricator.wikimedia.org/T148541)
[10:07:17] <wikibugs>	 (03CR) 10Ema: [C: 03+1] fifo-log-demux: Fix EPIPE check [software/fifo-log-demux] - 10https://gerrit.wikimedia.org/r/527435 (owner: 10Vgutierrez)
[10:12:45] <wikibugs>	 (03CR) 10Vgutierrez: [V: 03+2 C: 03+2] fifo-log-demux: Remove socket activation [software/fifo-log-demux] - 10https://gerrit.wikimedia.org/r/527039 (owner: 10Vgutierrez)
[10:13:13] <wikibugs>	 (03CR) 10Vgutierrez: [V: 03+2 C: 03+2] fifo-log-demux: Keep attempting to read the FIFO after EOF [software/fifo-log-demux] - 10https://gerrit.wikimedia.org/r/527013 (owner: 10Vgutierrez)
[10:14:01] <wikibugs>	 (03CR) 10Vgutierrez: [V: 03+2 C: 03+2] fifo-log-demux: Fix EPIPE check [software/fifo-log-demux] - 10https://gerrit.wikimedia.org/r/527435 (owner: 10Vgutierrez)
[10:41:00] <Amir1>	 Jenkins it slow
[10:41:09] <Amir1>	 I need to deploy for a UBN right now :/
[10:42:36] <marostegui>	 Amir1: this doesn't look too bad from a first sight https://integration.wikimedia.org/zuul/
[10:42:46] <wikibugs>	 (03PS1) 10ArielGlenn: add more public tables for xml/sql dumps [puppet] - 10https://gerrit.wikimedia.org/r/527505 (https://phabricator.wikimedia.org/T226167)
[10:43:19] <Amir1>	 marostegui: gate-and-submit-swat: 32 minutes and due to a flaky browser test I need to redo it 
[10:43:21] <_joe_>	 Amir1: if that's a rollback of a patch, you can quick-fix it on deploy1001 in the meantime
[10:43:30] <_joe_>	 wow
[10:43:43] <marostegui>	 32 minutes woot
[10:43:46] <_joe_>	 what's the commit?
[10:43:50] <Amir1>	 _joe_: yeah
[10:43:56] <Amir1>	 https://gerrit.wikimedia.org/r/c/mediawiki/extensions/Wikibase/+/527501
[10:44:04] <Amir1>	 branch backports are always slow
[10:44:42] <_joe_>	 Amir1: if you feel confident it works, just remove the -1 from jenkins bot, add v+2 and hit submit
[10:44:44] <_joe_>	 :P
[10:45:24] <Amir1>	 _joe_: yeah, the error is flaky + the not-submit tests all passed (look at jenkins)
[10:45:46] <Amir1>	 Thanks
[10:45:47] <_joe_>	 Amir1: go for it then
[10:47:37] <Amir1>	 now on wmf.16
[10:51:25] <logmsgbot>	 !log ladsgroup@deploy1001 Synchronized php-1.34.0-wmf.16/extensions/Wikibase: [[gerrit:527501|Revert "fix eslint errors in lib after moving submodule files into lib"]] (duration: 01m 08s)
[10:51:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:34:29] <logmsgbot>	 !log ladsgroup@deploy1001 Started scap: [[phab:T229604|Rebuilding l10n cache]]
[11:34:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:34:38] <stashbot>	 T229604: Several selectors/experts are broken - https://phabricator.wikimedia.org/T229604
[11:39:35] <logmsgbot>	 !log ladsgroup@deploy1001 Finished scap: [[phab:T229604|Rebuilding l10n cache]] (duration: 05m 06s)
[11:39:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:39:47] <stashbot>	 T229604: Several selectors/experts are broken - https://phabricator.wikimedia.org/T229604
[11:40:58] <Amir1>	 It's done in five minutes and it didn't rebuild l10n cache :/
[11:41:16] <Lucas_WMDE>	 weird
[11:44:11] <Amir1>	 How can I rebuild the l10n cache?
[11:44:18] <Amir1>	 extensions/LocalisationUpdate/update.php ?
[11:45:17] <Amir1>	 nope that's something else
[11:47:48] <Amir1>	 scap sync-l10n 1.34.0-wmf.16 '[[phab:T229604|Rebuilding l10n cache]]' would be it
[11:47:49] <stashbot>	 T229604: Several selectors/experts are broken - https://phabricator.wikimedia.org/T229604
[11:48:07] <logmsgbot>	 !log ladsgroup@deploy1001 scap sync-l10n completed (1.34.0-wmf.16) (duration: 00m 44s)
[11:48:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:52:13] <Amir1>	 Still not working. zeljkof hey, do you know why l10n cache is not getting rebuilt?
[11:52:30] <Amir1>	 e.g. This https://www.wikidata.org/wiki/MediaWiki:Valueview-expertextender-languageselector-label
[11:54:30] <Amir1>	 !log start of l10nupdate
[11:54:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:56:08] <Amir1>	 I'm not sure if I'm doing it right. The changed logs are too big
[11:56:25] <Amir1>	 !log aborted l10nupdate
[11:56:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:10:06] <wikibugs>	 (03CR) 10Lucas Werkmeister (WMDE): "I also checked the other redirects generated by Varnish to see if it makes sense to add the header to them as well (`git grep -F 'synth(3'" [puppet] - 10https://gerrit.wikimedia.org/r/526627 (https://phabricator.wikimedia.org/T229385) (owner: 10Lucas Werkmeister (WMDE))
[12:10:39] <Lucas_WMDE>	 brennen: you’re train conductor this week, do you know why Amir1’s l10n rebuild might not be working as expected?
[12:11:22] <Amir1>	 I think he's asleep now 
[12:14:48] <liw>	 I'd expect brennen to not wake up until at least two hours from now
[12:15:37] <Amir1>	 If anyone knows better how to rebuild l10n cache for wmf.16, please do
[12:15:55] <Amir1>	 This should show up https://www.wikidata.org/wiki/MediaWiki:Valueview-expertextender-languageselector-label
[12:18:12] <wikibugs>	 (03PS2) 10CDanis: Revert "dbctl: diff PHP vs dbctl configs" [puppet] - 10https://gerrit.wikimedia.org/r/527245 (https://phabricator.wikimedia.org/T229070)
[12:19:37] <wikibugs>	 (03CR) 10CDanis: [C: 03+2] Revert "dbctl: diff PHP vs dbctl configs" [puppet] - 10https://gerrit.wikimedia.org/r/527245 (https://phabricator.wikimedia.org/T229070) (owner: 10CDanis)
[12:31:28] <wikibugs>	 10Operations, 10DBA, 10cloud-services-team, 10wikitech.wikimedia.org: Switchover m5 primary master: db1073 to db1133 - https://phabricator.wikimedia.org/T229657 (10Marostegui)
[12:33:01] <marostegui>	 !log Restarted wikibugs a few minutes ago as it was not sending anything on IRC
[12:33:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:33:22] <wikibugs>	 10Operations, 10DBA, 10wikitech.wikimedia.org, 10cloud-services-team (Kanban): Switchover m5 primary master: db1073 to db1133 - https://phabricator.wikimedia.org/T229657 (10aborrero) I think we could either do this next week or wait until september because the WMCS team we will be traveling for Wikimania +...
[12:35:01] <wikibugs>	 10Operations, 10DBA, 10wikitech.wikimedia.org, 10cloud-services-team (Kanban): Switchover m5 primary master: db1073 to db1133 - https://phabricator.wikimedia.org/T229657 (10Marostegui) >>! In T229657#5387587, @aborrero wrote: > I think we could either do this next week or wait until september because the W...
[12:37:36] <wikibugs>	 10Operations, 10DBA, 10wikitech.wikimedia.org, 10cloud-services-team (Kanban): Switchover m5 primary master: db1073 to db1133 - https://phabricator.wikimedia.org/T229657 (10CDanis) FYI I'll be on vacation and without a work laptop approx Sept 10th - Sept 20th, and possibly Sept 9th as well.  Outside of tha...
[12:42:08] <wikibugs>	 10Operations, 10DBA, 10wikitech.wikimedia.org, 10cloud-services-team (Kanban): Switchover m5 primary master: db1073 to db1133 - https://phabricator.wikimedia.org/T229657 (10Marostegui) Let me pick a tentative.....Tuesday 3rd Sept at 13:00 UTC? @aborrero @CDanis ?
[12:42:27] <wikibugs>	 10Operations, 10DBA, 10wikitech.wikimedia.org, 10cloud-services-team (Kanban): Switchover m5 primary master: db1073 to db1133 - https://phabricator.wikimedia.org/T229657 (10aborrero) Ok, so I'm proposing two dates:  * 2019-10-03 -- I'm unavailable, but I think both @JHedden and @Andrew will be around. Also...
[12:42:45] <icinga-wm>	 PROBLEM - High CPU load on API appserver on mw1233 is CRITICAL: CRITICAL - load average: 75.67, 40.14, 23.56 https://wikitech.wikimedia.org/wiki/Application_servers
[12:43:03] <icinga-wm>	 PROBLEM - High CPU load on API appserver on mw1231 is CRITICAL: CRITICAL - load average: 71.10, 35.33, 21.14 https://wikitech.wikimedia.org/wiki/Application_servers
[12:43:25] <icinga-wm>	 PROBLEM - High CPU load on API appserver on mw1223 is CRITICAL: CRITICAL - load average: 66.86, 32.58, 18.70 https://wikitech.wikimedia.org/wiki/Application_servers
[12:43:25] <icinga-wm>	 PROBLEM - High CPU load on API appserver on mw1222 is CRITICAL: CRITICAL - load average: 76.09, 38.17, 22.10 https://wikitech.wikimedia.org/wiki/Application_servers
[12:43:46] <wikibugs>	 10Operations, 10DBA, 10wikitech.wikimedia.org, 10cloud-services-team (Kanban): Switchover m5 primary master: db1073 to db1133 - https://phabricator.wikimedia.org/T229657 (10CDanis) >>! In T229657#5387609, @Marostegui wrote: > Let me pick a tentative.....Tuesday 3rd Sept at 13:00 UTC? @aborrero @CDanis ?  L...
[12:44:32] <wikibugs>	 10Operations, 10DBA, 10wikitech.wikimedia.org, 10cloud-services-team (Kanban): Switchover m5 primary master: db1073 to db1133 - https://phabricator.wikimedia.org/T229657 (10Marostegui) @aborrero are you proposing October?
[12:45:54] <wikibugs>	 10Operations, 10DBA, 10wikitech.wikimedia.org, 10cloud-services-team (Kanban): Switchover m5 primary master: db1073 to db1133 - https://phabricator.wikimedia.org/T229657 (10aborrero) Ok, **2019-10-03**, work for us. Will let my team know, since I won't be around.  >>! In T229657#5387615, @Marostegui wrote:...
[12:46:21] <icinga-wm>	 PROBLEM - High CPU load on API appserver on mw1225 is CRITICAL: CRITICAL - load average: 37.84, 36.45, 23.15 https://wikitech.wikimedia.org/wiki/Application_servers
[12:47:03] <wikibugs>	 10Operations, 10DBA, 10wikitech.wikimedia.org, 10cloud-services-team (Kanban): Switchover m5 primary master: db1073 to db1133 - https://phabricator.wikimedia.org/T229657 (10Marostegui) Let's try to go for the 3rd of September at 13:00 UTC if @Andrew and/or @JHedden can confirm they'll be available to suppo...
[12:50:59] <icinga-wm>	 RECOVERY - High CPU load on API appserver on mw1225 is OK: OK - load average: 18.94, 25.51, 21.89 https://wikitech.wikimedia.org/wiki/Application_servers
[12:52:56] <Amir1>	 I have a feeling someone is hitting Wikidata's API too hard every hour or every two hour. The load errors ^ and this: https://grafana.wikimedia.org/d/000000548/wikibase-wb_terms?refresh=30s&orgId=1&from=now-24h&to=now
[12:53:22] <marostegui>	 Amir1: could that be also related to what we saw on the DBs?
[12:53:53] <Amir1>	 yeah, it's the same minute but exactly one hour off
[12:54:10] <Amir1>	 (I've got a headache so I'm not sure my measurements are correct)
[12:59:11] <icinga-wm>	 RECOVERY - toolschecker: showmount succeeds on a labs instance on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 158 bytes in 0.064 second response time https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Toolschecker
[13:04:59] <icinga-wm>	 PROBLEM - High CPU load on API appserver on mw1234 is CRITICAL: CRITICAL - load average: 62.91, 35.35, 26.42 https://wikitech.wikimedia.org/wiki/Application_servers
[13:05:15] <icinga-wm>	 PROBLEM - High CPU load on API appserver on mw1223 is CRITICAL: CRITICAL - load average: 56.00, 34.54, 28.58 https://wikitech.wikimedia.org/wiki/Application_servers
[13:06:43] <icinga-wm>	 PROBLEM - High CPU load on API appserver on mw1226 is CRITICAL: CRITICAL - load average: 79.30, 44.48, 29.63 https://wikitech.wikimedia.org/wiki/Application_servers
[13:06:51] <icinga-wm>	 PROBLEM - High CPU load on API appserver on mw1232 is CRITICAL: CRITICAL - load average: 66.11, 38.01, 26.16 https://wikitech.wikimedia.org/wiki/Application_servers
[13:06:51] <icinga-wm>	 PROBLEM - High CPU load on API appserver on mw1225 is CRITICAL: CRITICAL - load average: 71.35, 43.09, 30.54 https://wikitech.wikimedia.org/wiki/Application_servers
[13:08:01] <icinga-wm>	 PROBLEM - recommendation_api endpoints health on scb2002 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[13:09:31] <icinga-wm>	 RECOVERY - recommendation_api endpoints health on scb2002 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[13:18:47] <wikibugs>	 (03PS1) 10Arturo Borrero Gonzalez: toolforge: k8s: ingress: nginx-ingress listen on 8082/tcp [puppet] - 10https://gerrit.wikimedia.org/r/527541 (https://phabricator.wikimedia.org/T228500)
[13:18:49] <wikibugs>	 (03PS1) 10Arturo Borrero Gonzalez: toolforge: k8s: ingress: add frontend service [puppet] - 10https://gerrit.wikimedia.org/r/527542 (https://phabricator.wikimedia.org/T228500)
[13:19:25] <icinga-wm>	 RECOVERY - High CPU load on API appserver on mw1226 is OK: OK - load average: 9.80, 14.75, 22.98 https://wikitech.wikimedia.org/wiki/Application_servers
[13:19:33] <icinga-wm>	 RECOVERY - High CPU load on API appserver on mw1223 is OK: OK - load average: 11.54, 15.25, 22.86 https://wikitech.wikimedia.org/wiki/Application_servers
[13:19:33] <icinga-wm>	 RECOVERY - High CPU load on API appserver on mw1232 is OK: OK - load average: 9.90, 16.04, 22.81 https://wikitech.wikimedia.org/wiki/Application_servers
[13:22:43] <icinga-wm>	 RECOVERY - High CPU load on API appserver on mw1225 is OK: OK - load average: 11.54, 14.53, 22.86 https://wikitech.wikimedia.org/wiki/Application_servers
[13:24:03] <icinga-wm>	 RECOVERY - High CPU load on API appserver on mw1234 is OK: OK - load average: 10.39, 13.77, 22.95 https://wikitech.wikimedia.org/wiki/Application_servers
[13:25:32] <paravoid>	 what were these alerts about? :)
[13:27:09] <marostegui>	 paravoid: we saw them in the morning too, they recovered after a while, we restarted some hhvm but I am not sure if the root cause was found
[13:27:28] <paravoid>	 _joe_, jijiki ^
[13:27:56] <_joe_>	 this morning it was a flood of parsoid_batch requests
[13:28:19] <_joe_>	 but given it's always the same servers, lemme take a deeper look at what's going on there
[13:28:56] <wikibugs>	 (03PS1) 10Arturo Borrero Gonzalez: toolforge: k8s: haproxy: add proxy redirection for nginx-ingress [puppet] - 10https://gerrit.wikimedia.org/r/527544 (https://phabricator.wikimedia.org/T228500)
[13:29:18] <cdanis>	 the hosts in question have basically no thermal headroom
[13:29:28] <cdanis>	 ugh, the graph of thermal throttling events on grafana has broken
[13:29:33] <cdanis>	 but it's very much happening on these hosts, and often
[13:29:38] <cdanis>	 Aug  2 13:24:01 mw1223 kernel: [5086950.120019] CPU29: Package temperature above threshold, cpu clock throttled (total events = 176490976)
[13:30:18] <marostegui>	 cdanis: yeah, I think it has been happening for a while already
[13:30:33] <_joe_>	 those servers will be replaced *this quarter* btw
[13:30:58] <cdanis>	 maybe we want to de-weight them some?
[13:31:18] <_joe_>	 lemme first take a better look at what's going on
[13:31:50] <_joe_>	 so the root issue is a flow of parsoid-batch requests (again)
[13:32:00] <_joe_>	 that raised the cpu usage on all api servers
[13:32:03] <cdanis>	 this looks to me like a spike of traffic that gets spread across the apiservers but these 5 servers in question are much less able to handle it
[13:32:48] <_joe_>	 but yeah we could tweak the weights a bit I concur
[13:34:32] <cdanis>	 would it be possible to make parsoid_batch traffic less bursty?
[13:34:43] <icinga-wm>	 RECOVERY - High CPU load on API appserver on mw1233 is OK: OK - load average: 10.21, 12.85, 23.95 https://wikitech.wikimedia.org/wiki/Application_servers
[13:34:56] <_joe_>	 I didn't have time to look more in depth to what is causing that
[13:35:03] <icinga-wm>	 RECOVERY - High CPU load on API appserver on mw1231 is OK: OK - load average: 9.60, 10.80, 22.54 https://wikitech.wikimedia.org/wiki/Application_servers
[13:35:19] <_joe_>	 but yes, parsoid-batch should go via changepropagation
[13:35:36] <_joe_>	 so it's mediated by kafka and an http pusher that has concurrency limits
[13:35:38] <_joe_>	 but
[13:35:55] <_joe_>	 I don't see a spike in the number of requests 
[13:35:57] <_joe_>	 to the api
[13:36:16] <cdanis>	 there is a slight increase in requests that correlates with a large increase in median execution time https://grafana.wikimedia.org/d/000000559/api-requests-breakdown?refresh=5m&orgId=1&var-metric=p50&var-module=parsoid_batch&panelId=19&fullscreen
[13:36:34] <cdanis>	 so I think we are occasionally getting a group of very expensive queries?
[13:37:16] <wikibugs>	 (03PS1) 10Filippo Giunchedi: prometheus: don't snmp-poll st4InputCordNotifications [puppet] - 10https://gerrit.wikimedia.org/r/527548 (https://phabricator.wikimedia.org/T148541)
[13:37:24] <_joe_>	 cdanis: I trust these other data more: https://grafana.wikimedia.org/d/RIA1lzDZk/xxx-joe-appserver?orgId=1&from=now-1h&to=now&var-datasource=eqiad%20prometheus%2Fops&var-cluster=api_appserver&var-instance=mw1276:3903&var-method=GET&var-code=200&panelId=17&fullscreen
[13:37:34] <_joe_>	 the increase is small though
[13:37:40] <_joe_>	 and it seems to be persisting
[13:38:01] <_joe_>	 I think it's the kind of requests that are somehow very cpu-expensive
[13:38:10] <cdanis>	 but that's not specific to parsoid-batch nor is it aggregated across apiservers
[13:38:35] <cdanis>	 or am I misunderstanding the header row?
[13:38:52] <cdanis>	 oh, I see, these are global graphs, the instance chooser up top only affects the bottom row
[13:39:17] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] prometheus: don't snmp-poll st4InputCordNotifications [puppet] - 10https://gerrit.wikimedia.org/r/527548 (https://phabricator.wikimedia.org/T148541) (owner: 10Filippo Giunchedi)
[13:39:35] <_joe_>	 it is aggregated across the api servers
[13:39:50] <_joe_>	 but yes, it's not specific, it's the overall rate
[13:40:13] <icinga-wm>	 RECOVERY - High CPU load on API appserver on mw1222 is OK: OK - load average: 12.40, 13.01, 23.16 https://wikitech.wikimedia.org/wiki/Application_servers
[13:40:58] <_joe_>	 now to know what happened here, welcome to api.log :P
[13:41:54] <cdanis>	 heh, some of these servers already have lower weights
[13:42:24] <_joe_>	 they do yes
[13:42:58] <_joe_>	 oh I found one problem though
[13:43:00] <_joe_>	 fixing it
[13:44:17] <cdanis>	 there is something that doesn't make sense to me about the apache2 weights vs the nginx weights
[13:45:28] <logmsgbot>	 !log oblivian@puppetmaster1001 conftool action : set/weight=10; selector: cluster=api_appserver,dc=eqiad,service=nginx,name=mw12[23].*
[13:45:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:45:42] <_joe_>	 cdanis: ^^
[13:45:57] <cdanis>	 _joe_: LGTM
[13:45:59] <icinga-wm>	 PROBLEM - toolschecker: All k8s worker nodes are healthy on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string OK not found on http://checker.tools.wmflabs.org:80/k8s/nodes/ready - 177 bytes in 0.154 second response time https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Toolschecker
[13:46:05] <_joe_>	 the nginx weights were modified at some point during some emergency, and no one brought them back to normalcy
[13:46:27] <cdanis>	 _joe_: I still don't understand why mw1276-mw1297 have weight:10 for nginx, but have weight:25 for apache
[13:47:33] <_joe_>	 the reason for this was exactly preserving some api servers for whenever therre was a parsoid-batch storm 
[13:47:42] <_joe_>	 we did that at the time as a protective measure
[13:47:48] <_joe_>	 as parsoid uses TLS
[13:48:04] <_joe_>	 while most things are still reaching mediawiki unencrypted
[13:48:46] <_joe_>	 those storms were gone for... years now?
[13:48:50] <_joe_>	 but they seem to be back
[13:48:52] <wikibugs>	 10Operations, 10Traffic, 10netops, 10IPv6: Fix IPv6 autoconf issues once and for all, across the fleet. - https://phabricator.wikimedia.org/T102099 (10akosiaris) > This still leaves all the servers currently installed which have a MAC based SLAAC address i.e. they do not have interface::add_ip6_mapped. It...
[13:49:45] <wikibugs>	 (03PS2) 10Filippo Giunchedi: prometheus: don't snmp-poll st4InputCordNotifications [puppet] - 10https://gerrit.wikimedia.org/r/527548 (https://phabricator.wikimedia.org/T148541)
[13:50:42] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] prometheus: don't snmp-poll st4InputCordNotifications [puppet] - 10https://gerrit.wikimedia.org/r/527548 (https://phabricator.wikimedia.org/T148541) (owner: 10Filippo Giunchedi)
[13:51:11] <wikibugs>	 (03CR) 10Filippo Giunchedi: [V: 03+2 C: 03+2] "CI is choking on the commit message too long, which is a url and I'm going to override" [puppet] - 10https://gerrit.wikimedia.org/r/527548 (https://phabricator.wikimedia.org/T148541) (owner: 10Filippo Giunchedi)
[13:52:39] <icinga-wm>	 ACKNOWLEDGEMENT - toolschecker: All k8s worker nodes are healthy on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string OK not found on http://checker.tools.wmflabs.org:80/k8s/nodes/ready - 177 bytes in 0.184 second response time andrew bogott no idea what this is yet https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Toolschecker
[13:54:22] <logmsgbot>	 !log @ helmfile [STAGING] Ran 'sync' command on namespace 'restrouter' for release 'staging' .
[13:54:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:57:47] <logmsgbot>	 !log mforns@deploy1001 Started deploy [analytics/refinery@b50a939]: deploying refinery up to b50a93955952ed863d5ef7703a91ab59f5d979cf (rollback of cassandra and edit_hourly hive2 actions to unbreak production)
[13:57:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:02:31] <icinga-wm>	 PROBLEM - mobileapps endpoints health on scb2001 is CRITICAL: /{domain}/v1/page/metadata/{title} (retrieve extended metadata for Video article on English Wikipedia) is CRITICAL: Test retrieve extended metadata for Video article on English Wikipedia returned the unexpected status 504 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps
[14:04:09] <icinga-wm>	 RECOVERY - mobileapps endpoints health on scb2001 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps
[14:08:27] <icinga-wm>	 RECOVERY - toolschecker: All k8s worker nodes are healthy on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 158 bytes in 0.204 second response time https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Toolschecker
[14:11:58] <wikibugs>	 10Operations, 10Traffic, 10netops, 10IPv6: Fix IPv6 autoconf issues once and for all, across the fleet. - https://phabricator.wikimedia.org/T102099 (10BBlack) Re: transitioning away from SLAAC for the current fleet/setup (which I think is probably a good incremental idea, and could happen ahead of the futu...
[14:14:34] <logmsgbot>	 !log mforns@deploy1001 Finished deploy [analytics/refinery@b50a939]: deploying refinery up to b50a93955952ed863d5ef7703a91ab59f5d979cf (rollback of cassandra and edit_hourly hive2 actions to unbreak production) (duration: 16m 47s)
[14:14:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:22:03] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: Remove the service object for the default schema [software/conftool] - 10https://gerrit.wikimedia.org/r/527564
[14:22:05] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: kvobject: fix some class property ordering [software/conftool] - 10https://gerrit.wikimedia.org/r/527565
[14:27:10] <wikibugs>	 (03CR) 1020after4: [C: 03+1] "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/520433 (owner: 10Aklapper)
[14:28:45] <wikibugs>	 (03CR) 1020after4: [C: 03+1] "D1145 is visible to WMD-NDA and Security" [puppet] - 10https://gerrit.wikimedia.org/r/520433 (owner: 10Aklapper)
[14:29:30] <wikibugs>	 (03CR) 10Filippo Giunchedi: "See also upstream issue https://github.com/prometheus/snmp_exporter/issues/443" [puppet] - 10https://gerrit.wikimedia.org/r/527548 (https://phabricator.wikimedia.org/T148541) (owner: 10Filippo Giunchedi)
[14:29:32] <wikibugs>	 (03PS2) 1020after4: Phab: Allow Greg and Andre to roll back specific user actions [puppet] - 10https://gerrit.wikimedia.org/r/520433 (owner: 10Aklapper)
[14:29:50] <wikibugs>	 (03CR) 1020after4: [C: 03+1] "It's been merged now, no need for security protection" [puppet] - 10https://gerrit.wikimedia.org/r/520433 (owner: 10Aklapper)
[14:35:10] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Phab: Allow Greg and Andre to roll back specific user actions [puppet] - 10https://gerrit.wikimedia.org/r/520433 (owner: 10Aklapper)
[14:40:25] <wikibugs>	 (03Abandoned) 10Tarrow: Assign termbox-test.svc.{eqiad,codfw}.wmnet LVS IPs [dns] - 10https://gerrit.wikimedia.org/r/521456 (https://phabricator.wikimedia.org/T226814) (owner: 10Tarrow)
[14:43:28] <wikibugs>	 (03PS3) 1020after4: Phab: Allow Greg and Andre to roll back specific user actions [puppet] - 10https://gerrit.wikimedia.org/r/520433 (owner: 10Aklapper)
[14:43:47] <wikibugs>	 (03PS4) 1020after4: Phab: Allow Greg and Andre to roll back specific user actions [puppet] - 10https://gerrit.wikimedia.org/r/520433 (owner: 10Aklapper)
[14:44:40] <wikibugs>	 (03CR) 10CDanis: [C: 03+1] Remove the service object for the default schema [software/conftool] - 10https://gerrit.wikimedia.org/r/527564 (owner: 10Giuseppe Lavagetto)
[14:48:00] <wikibugs>	 (03CR) 10Alexandros Kosiaris: Add the mediawiki.restart_appservers cookbook (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/527487 (owner: 10Giuseppe Lavagetto)
[14:51:05] <icinga-wm>	 PROBLEM - High CPU load on API appserver on mw1227 is CRITICAL: CRITICAL - load average: 52.73, 23.32, 14.72 https://wikitech.wikimedia.org/wiki/Application_servers
[14:51:35] <icinga-wm>	 PROBLEM - High CPU load on API appserver on mw1317 is CRITICAL: CRITICAL - load average: 74.39, 35.84, 23.37 https://wikitech.wikimedia.org/wiki/Application_servers
[14:51:49] <icinga-wm>	 PROBLEM - High CPU load on API appserver on mw1231 is CRITICAL: CRITICAL - load average: 53.88, 24.11, 14.85 https://wikitech.wikimedia.org/wiki/Application_servers
[14:51:59] <icinga-wm>	 PROBLEM - High CPU load on API appserver on mw1234 is CRITICAL: CRITICAL - load average: 58.53, 25.45, 14.65 https://wikitech.wikimedia.org/wiki/Application_servers
[14:52:07] <icinga-wm>	 PROBLEM - High CPU load on API appserver on mw1226 is CRITICAL: CRITICAL - load average: 50.71, 27.27, 15.43 https://wikitech.wikimedia.org/wiki/Application_servers
[14:52:13] <icinga-wm>	 PROBLEM - High CPU load on API appserver on mw1223 is CRITICAL: CRITICAL - load average: 49.43, 24.31, 14.47 https://wikitech.wikimedia.org/wiki/Application_servers
[14:52:15] <icinga-wm>	 PROBLEM - High CPU load on API appserver on mw1232 is CRITICAL: CRITICAL - load average: 56.41, 24.70, 14.37 https://wikitech.wikimedia.org/wiki/Application_servers
[14:52:19] <icinga-wm>	 PROBLEM - High CPU load on API appserver on mw1284 is CRITICAL: CRITICAL - load average: 63.90, 30.16, 19.51 https://wikitech.wikimedia.org/wiki/Application_servers
[14:52:23] <icinga-wm>	 PROBLEM - High CPU load on API appserver on mw1281 is CRITICAL: CRITICAL - load average: 61.84, 27.78, 18.51 https://wikitech.wikimedia.org/wiki/Application_servers
[14:52:39] <icinga-wm>	 PROBLEM - High CPU load on API appserver on mw1283 is CRITICAL: CRITICAL - load average: 62.79, 28.31, 18.62 https://wikitech.wikimedia.org/wiki/Application_servers
[14:52:57] <icinga-wm>	 PROBLEM - High CPU load on API appserver on mw1279 is CRITICAL: CRITICAL - load average: 76.58, 35.20, 21.47 https://wikitech.wikimedia.org/wiki/Application_servers
[14:52:57] <icinga-wm>	 PROBLEM - High CPU load on API appserver on mw1280 is CRITICAL: CRITICAL - load average: 64.05, 31.33, 20.19 https://wikitech.wikimedia.org/wiki/Application_servers
[14:53:01] <icinga-wm>	 PROBLEM - High CPU load on API appserver on mw1230 is CRITICAL: CRITICAL - load average: 57.37, 26.79, 15.08 https://wikitech.wikimedia.org/wiki/Application_servers
[14:53:01] <icinga-wm>	 PROBLEM - High CPU load on API appserver on mw1314 is CRITICAL: CRITICAL - load average: 75.15, 45.95, 27.76 https://wikitech.wikimedia.org/wiki/Application_servers
[14:53:01] <icinga-wm>	 PROBLEM - High CPU load on API appserver on mw1286 is CRITICAL: CRITICAL - load average: 68.23, 37.28, 24.08 https://wikitech.wikimedia.org/wiki/Application_servers
[14:53:05] <icinga-wm>	 PROBLEM - High CPU load on API appserver on mw1278 is CRITICAL: CRITICAL - load average: 65.01, 32.33, 20.93 https://wikitech.wikimedia.org/wiki/Application_servers
[14:53:05] <icinga-wm>	 PROBLEM - High CPU load on API appserver on mw1290 is CRITICAL: CRITICAL - load average: 61.60, 35.71, 22.41 https://wikitech.wikimedia.org/wiki/Application_servers
[14:53:07] <icinga-wm>	 PROBLEM - High CPU load on API appserver on mw1233 is CRITICAL: CRITICAL - load average: 51.01, 30.89, 17.82 https://wikitech.wikimedia.org/wiki/Application_servers
[14:53:13] <icinga-wm>	 PROBLEM - PHP7 rendering on mw1347 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[14:53:13] <icinga-wm>	 PROBLEM - PHP7 rendering on mw1348 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[14:53:13] <icinga-wm>	 RECOVERY - High CPU load on API appserver on mw1317 is OK: OK - load average: 40.12, 37.01, 25.22 https://wikitech.wikimedia.org/wiki/Application_servers
[14:53:13] <icinga-wm>	 PROBLEM - High CPU load on API appserver on mw1276 is CRITICAL: CRITICAL - load average: 82.48, 38.90, 22.80 https://wikitech.wikimedia.org/wiki/Application_servers
[14:53:13] <icinga-wm>	 PROBLEM - High CPU load on API appserver on mw1288 is CRITICAL: CRITICAL - load average: 66.16, 37.14, 23.06 https://wikitech.wikimedia.org/wiki/Application_servers
[14:53:21] <icinga-wm>	 PROBLEM - Apache HTTP on mw1348 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[14:53:37] <icinga-wm>	 PROBLEM - High CPU load on API appserver on mw1289 is CRITICAL: CRITICAL - load average: 71.53, 38.16, 22.80 https://wikitech.wikimedia.org/wiki/Application_servers
[14:53:39] <icinga-wm>	 PROBLEM - HTTP availability for Varnish at eqiad on icinga1001 is CRITICAL: job=varnish-text site=eqiad https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d
[14:53:45] <icinga-wm>	 RECOVERY - High CPU load on API appserver on mw1226 is OK: OK - load average: 18.93, 23.17, 15.19 https://wikitech.wikimedia.org/wiki/Application_servers
[14:53:51] <icinga-wm>	 RECOVERY - High CPU load on API appserver on mw1223 is OK: OK - load average: 25.08, 23.15, 15.09 https://wikitech.wikimedia.org/wiki/Application_servers
[14:54:01] <icinga-wm>	 PROBLEM - High CPU load on API appserver on mw1282 is CRITICAL: CRITICAL - load average: 78.60, 41.54, 24.42 https://wikitech.wikimedia.org/wiki/Application_servers
[14:54:01] <icinga-wm>	 PROBLEM - High CPU load on API appserver on mw1277 is CRITICAL: CRITICAL - load average: 84.55, 45.03, 25.83 https://wikitech.wikimedia.org/wiki/Application_servers
[14:54:07] <icinga-wm>	 PROBLEM - recommendation_api endpoints health on scb1003 is CRITICAL: /{domain}/v1/caption/addition/{target} (Caption addition suggestions) timed out before a response was received: /{domain}/v1/article/creation/translation/{source}{/seed} (article.creation.translation - bad seed) timed out before a response was received: /{domain}/v1/caption/translation/from/{source}/to/{target} (Caption translation suggestions) timed out before
[14:54:07] <icinga-wm>	 eceived: /{domain}/v1/description/translation/from/{source}/to/{target} (Description translation suggestions) timed out before a response was received: /{domain}/v1/description/addition/{target} (Description addition suggestions) timed out before a response was received: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received: /{domain}/v1/article/cre
[14:54:07] <icinga-wm>	 eed} (article.creation.morelike - bad article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[14:54:09] <icinga-wm>	 PROBLEM - High CPU load on API appserver on mw1287 is CRITICAL: CRITICAL - load average: 71.35, 41.42, 24.58 https://wikitech.wikimedia.org/wiki/Application_servers
[14:54:11] <icinga-wm>	 PROBLEM - recommendation_api endpoints health on scb2003 is CRITICAL: /{domain}/v1/caption/addition/{target} (Caption addition suggestions) timed out before a response was received: /{domain}/v1/article/creation/translation/{source}{/seed} (article.creation.translation - bad seed) timed out before a response was received: /{domain}/v1/caption/translation/from/{source}/to/{target} (Caption translation suggestions) timed out before
[14:54:11] <icinga-wm>	 eceived: /{domain}/v1/description/translation/from/{source}/to/{target} (Description translation suggestions) timed out before a response was received: /{domain}/v1/description/addition/{target} (Description addition suggestions) timed out before a response was received: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received: /{domain}/v1/article/cre
[14:54:11] <icinga-wm>	 eed} (article.creation.morelike - bad article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[14:54:11] <icinga-wm>	 PROBLEM - recommendation_api endpoints health on scb2004 is CRITICAL: /{domain}/v1/caption/addition/{target} (Caption addition suggestions) timed out before a response was received: /{domain}/v1/caption/translation/from/{source}/to/{target} (Caption translation suggestions) timed out before a response was received: /{domain}/v1/description/translation/from/{source}/to/{target} (Description translation suggestions) timed out befor
[14:54:11] <icinga-wm>	 received: /{domain}/v1/description/addition/{target} (Description addition suggestions) timed out before a response was received: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - bad article title) timed out before a response was received https://wikitech.wikimedia.org/
[14:54:11] <icinga-wm>	 itoring/recommendation_api
[14:54:11] <icinga-wm>	 PROBLEM - recommendation_api endpoints health on scb2006 is CRITICAL: /{domain}/v1/caption/translation/from/{source}/to/{target} (Caption translation suggestions) timed out before a response was received: /{domain}/v1/description/translation/from/{source}/to/{target} (Description translation suggestions) timed out before a response was received: /{domain}/v1/description/addition/{target} (Description addition suggestions) timed o
[14:54:12] <icinga-wm>	 nse was received: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - bad article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[14:54:12] <icinga-wm>	 PROBLEM - High CPU load on API appserver on mw1285 is CRITICAL: CRITICAL - load average: 70.11, 42.90, 25.15 https://wikitech.wikimedia.org/wiki/Application_servers
[14:54:20] <wikibugs>	 (03CR) 10CDanis: [C: 03+1] kvobject: fix some class property ordering [software/conftool] - 10https://gerrit.wikimedia.org/r/527565 (owner: 10Giuseppe Lavagetto)
[14:54:25] <icinga-wm>	 PROBLEM - Apache HTTP on mw1346 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[14:54:31] <icinga-wm>	 PROBLEM - recommendation_api endpoints health on scb2005 is CRITICAL: /{domain}/v1/caption/addition/{target} (Caption addition suggestions) timed out before a response was received: /{domain}/v1/article/creation/translation/{source}{/seed} (article.creation.translation - bad seed) timed out before a response was received: /{domain}/v1/caption/translation/from/{source}/to/{target} (Caption translation suggestions) timed out before
[14:54:31] <icinga-wm>	 eceived: /{domain}/v1/description/translation/from/{source}/to/{target} (Description translation suggestions) timed out before a response was received: /{domain}/v1/description/addition/{target} (Description addition suggestions) timed out before a response was received: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedi
[14:54:31] <icinga-wm>	 es/Monitoring/recommendation_api
[14:54:31] <icinga-wm>	 PROBLEM - HTTP availability for Nginx -SSL terminators- at esams on icinga1001 is CRITICAL: cluster=cache_text site=esams https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[14:54:31] <icinga-wm>	 PROBLEM - HTTP availability for Varnish at esams on icinga1001 is CRITICAL: job=varnish-text site=esams https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d
[14:54:33] <icinga-wm>	 PROBLEM - termbox eqiad on termbox.svc.eqiad.wmnet is CRITICAL: /termbox (get rendered termbox) is CRITICAL: Test get rendered termbox returned the unexpected status 500 (expecting: 200) https://wikitech.wikimedia.org/wiki/WMDE/Wikidata/SSR_Service
[14:54:33] <icinga-wm>	 PROBLEM - HHVM rendering on mw1339 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[14:54:35] <icinga-wm>	 PROBLEM - Apache HTTP on mw1339 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[14:54:35] <icinga-wm>	 PROBLEM - Apache HTTP on mw1340 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[14:54:43] <icinga-wm>	 RECOVERY - PHP7 rendering on mw1347 is OK: HTTP OK: HTTP/1.1 200 OK - 81418 bytes in 2.133 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[14:54:43] <icinga-wm>	 PROBLEM - HTTP availability for Nginx -SSL terminators- at codfw on icinga1001 is CRITICAL: cluster=cache_text site=codfw https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[14:54:44] <cdanis>	 uhm
[14:54:49] <icinga-wm>	 PROBLEM - recommendation_api endpoints health on scb2001 is CRITICAL: /{domain}/v1/caption/addition/{target} (Caption addition suggestions) timed out before a response was received: /{domain}/v1/caption/translation/from/{source}/to/{target} (Caption translation suggestions) timed out before a response was received: /{domain}/v1/description/translation/from/{source}/to/{target} (Description translation suggestions) timed out befor
[14:54:49] <icinga-wm>	 received: /{domain}/v1/description/addition/{target} (Description addition suggestions) timed out before a response was received: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[14:54:52] <elukey>	 wow
[14:54:55] <icinga-wm>	 PROBLEM - recommendation_api endpoints health on scb1002 is CRITICAL: /{domain}/v1/caption/addition/{target} (Caption addition suggestions) timed out before a response was received: /{domain}/v1/caption/translation/from/{source}/to/{target} (Caption translation suggestions) timed out before a response was received: /{domain}/v1/description/translation/from/{source}/to/{target} (Description translation suggestions) timed out befor
[14:54:55] <icinga-wm>	 received: /{domain}/v1/description/addition/{target} (Description addition suggestions) timed out before a response was received: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - bad article title) timed out before a response was received https://wikitech.wikimedia.org/
[14:54:55] <icinga-wm>	 itoring/recommendation_api
[14:55:01] <icinga-wm>	 PROBLEM - recommendation_api endpoints health on scb1001 is CRITICAL: /{domain}/v1/caption/addition/{target} (Caption addition suggestions) timed out before a response was received: /{domain}/v1/description/translation/from/{source}/to/{target} (Description translation suggestions) timed out before a response was received: /{domain}/v1/description/addition/{target} (Description addition suggestions) timed out before a response wa
[14:55:01] <icinga-wm>	 ain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - bad article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[14:55:01] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1022 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[14:55:02] <icinga-wm>	 PROBLEM - HTTP availability for Nginx -SSL terminators- at eqiad on icinga1001 is CRITICAL: cluster=cache_text site=eqiad https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[14:55:02] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2018 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received: /en.wikipedia.org/v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[14:55:03] <icinga-wm>	 PROBLEM - Nginx local proxy to apache on mw1344 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[14:55:05] <icinga-wm>	 PROBLEM - HTTP availability for Nginx -SSL terminators- at eqsin on icinga1001 is CRITICAL: cluster=cache_text site=eqsin https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[14:55:07] <icinga-wm>	 RECOVERY - High CPU load on API appserver on mw1231 is OK: OK - load average: 18.98, 24.14, 17.01 https://wikitech.wikimedia.org/wiki/Application_servers
[14:55:09] <icinga-wm>	 PROBLEM - recommendation_api endpoints health on scb1004 is CRITICAL: /{domain}/v1/caption/addition/{target} (Caption addition suggestions) timed out before a response was received: /{domain}/v1/article/creation/translation/{source}{/seed} (article.creation.translation - bad seed) timed out before a response was received: /{domain}/v1/caption/translation/from/{source}/to/{target} (Caption translation suggestions) timed out before
[14:55:09] <icinga-wm>	 eceived: /{domain}/v1/description/addition/{target} (Description addition suggestions) timed out before a response was received: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - bad article title) timed out before a response was received https://wikitech.wikimedia.org/w
[14:55:09] <icinga-wm>	 toring/recommendation_api
[14:55:19] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase-dev1004 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received: /en.wikipedia.org/v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) is CRITICAL: Test Get a graph from Graphoid returned the unexpected status 400 (expecting: 200) https://wikitech.wikimedia
[14:55:19] <icinga-wm>	 s/Monitoring/restbase
[14:55:19] <icinga-wm>	 PROBLEM - mobileapps endpoints health on scb2005 is CRITICAL: /{domain}/v1/page/metadata/{title} (retrieve extended metadata for Video article on English Wikipedia) timed out before a response was received: /{domain}/v1/media/image/featured/{year}/{month}/{day} (retrieve featured image data for April 29, 2016) timed out before a response was received: /{domain}/v1/page/mobile-sections/{title} (retrieve test page via mobile-sectio
[14:55:19] <icinga-wm>	 ore a response was received: /{domain}/v1/page/mobile-html/{title} (Get page content HTML for test page) timed out before a response was received: /{domain}/v1/page/random/title (retrieve a random article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps
[14:55:21] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1017 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received: /en.wikipedia.org/v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) is CRITICAL: Test Get a graph from Graphoid returned the unexpected status 400 (expecting: 200) https://wikitech.wikimedia.org
[14:55:21] <icinga-wm>	 nitoring/restbase
[14:55:21] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2013 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[14:55:22] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2020 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[14:55:22] <icinga-wm>	 PROBLEM - Nginx local proxy to apache on mw1339 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[14:55:22] <icinga-wm>	 PROBLEM - mobileapps endpoints health on scb2006 is CRITICAL: /{domain}/v1/page/media/{title} (Get media in test page) timed out before a response was received: /{domain}/v1/media/image/featured/{year}/{month}/{day} (retrieve featured image data for April 29, 2016) timed out before a response was received: /{domain}/v1/page/mobile-html/{title} (Get page content HTML for test page) timed out before a response was received https://
[14:55:22] <icinga-wm>	 a.org/wiki/Services/Monitoring/mobileapps
[14:55:22] <icinga-wm>	 PROBLEM - Restbase LVS codfw on restbase.svc.codfw.wmnet is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) timed out before a response was received https://wikitech.wikimedia.org/wiki/RESTBase
[14:55:27] <icinga-wm>	 PROBLEM - mobileapps endpoints health on scb1003 is CRITICAL: /{domain}/v1/page/media/{title} (Get media in test page) timed out before a response was received: /{domain}/v1/page/metadata/{title} (retrieve extended metadata for Video article on English Wikipedia) timed out before a response was received: /{domain}/v1/media/image/featured/{year}/{month}/{day} (retrieve featured image data for April 29, 2016) timed out before a res
[14:55:27] <icinga-wm>	 d: /{domain}/v1/page/mobile-html/{title} (Get page content HTML for test page) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps
[14:55:29] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1025 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received: /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[14:55:29] <icinga-wm>	 PROBLEM - mobileapps endpoints health on scb2004 is CRITICAL: /{domain}/v1/page/media/{title} (Get media in test page) timed out before a response was received: /{domain}/v1/page/summary/{title} (Get summary for test page) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps
[14:55:29] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1023 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received: /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) timed out before a response was received: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) timed out before a re
[14:55:30] <icinga-wm>	 ed https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[14:55:30] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1021 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[14:55:32] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1019 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[14:55:33] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2012 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[14:55:33] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2009 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[14:55:33] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2017 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[14:55:33] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2015 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[14:55:43] <icinga-wm>	 PROBLEM - mobileapps endpoints health on scb1001 is CRITICAL: /{domain}/v1/page/media/{title} (Get media in test page) timed out before a response was received: /{domain}/v1/media/image/featured/{year}/{month}/{day} (retrieve featured image data for April 29, 2016) timed out before a response was received: /{domain}/v1/page/mobile-html/{title} (Get page content HTML for test page) timed out before a response was received: /{domai
[14:55:43] <icinga-wm>	 y/{title} (Get summary for test page) timed out before a response was received: /{domain}/v1/transform/html/to/mobile-html/{title} (Get preview mobile HTML for test page) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps
[14:55:43] <icinga-wm>	 PROBLEM - Cxserver LVS eqiad on cxserver.svc.eqiad.wmnet is CRITICAL: /v2/translate/{from}/{to}{/provider} (Machine translate an HTML fragment using TestClient, adapt the links to target language wiki.) timed out before a response was received https://wikitech.wikimedia.org/wiki/CX
[14:55:45] <icinga-wm>	 PROBLEM - HHVM rendering on mw1226 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[14:55:45] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1015 is CRITICAL: PYBAL CRITICAL - CRITICAL - api_80: Servers mw1346.eqiad.wmnet, mw1344.eqiad.wmnet, mw1226.eqiad.wmnet, mw1233.eqiad.wmnet, mw1340.eqiad.wmnet, mw1343.eqiad.wmnet, mw1225.eqiad.wmnet, mw1345.eqiad.wmnet, mw1221.eqiad.wmnet, mw1317.eqiad.wmnet, mw1235.eqiad.wmnet, mw1342.eqiad.wmnet, mw1315.eqiad.wmnet, mw1339.eqiad.wmnet are marked down but pooled: api-https_443: Serve
[14:55:45] <icinga-wm>	 mnet, mw1346.eqiad.wmnet, mw1344.eqiad.wmnet, mw1227.eqiad.wmnet, mw1226.eqiad.wmnet, mw1233.eqiad.wmnet, mw1222.eqiad.wmnet, mw1340.eqiad.wmnet, mw1343.eqiad.wmnet, mw1225.eqiad.wmnet, mw1347.eqiad.wmnet, mw1345.eqiad.wmnet, mw1221.eqiad.wmnet, mw1235.eqiad.wmnet, mw1342.eqiad.wmnet, mw1315.eqiad.wmnet, mw1341.eqiad.wmnet, mw1339.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[14:55:45] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2010 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[14:55:47] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase-dev1005 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[14:55:49] <icinga-wm>	 PROBLEM - Apache HTTP on mw1315 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[14:55:51] <icinga-wm>	 PROBLEM - Apache HTTP on mw1342 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[14:55:51] <icinga-wm>	 PROBLEM - HTTP availability for Nginx -SSL terminators- at ulsfo on icinga1001 is CRITICAL: cluster=cache_text site=ulsfo https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[14:55:55] <icinga-wm>	 PROBLEM - HHVM rendering on mw1343 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[14:55:57] <icinga-wm>	 PROBLEM - mobileapps endpoints health on scb2001 is CRITICAL: /{domain}/v1/page/media/{title} (Get media in test page) timed out before a response was received: /{domain}/v1/page/metadata/{title} (retrieve extended metadata for Video article on English Wikipedia) timed out before a response was received: /{domain}/v1/media/image/featured/{year}/{month}/{day} (retrieve featured image data for April 29, 2016) timed out before a res
[14:55:57] <icinga-wm>	 d: /{domain}/v1/transform/html/to/mobile-html/{title} (Get preview mobile HTML for test page) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps
[14:55:58] <icinga-wm>	 PROBLEM - HHVM rendering on mw1340 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[14:55:59] <icinga-wm>	 PROBLEM - recommendation_api endpoints health on scb2002 is CRITICAL: /{domain}/v1/caption/translation/from/{source}/to/{target} (Caption translation suggestions) timed out before a response was received: /{domain}/v1/description/translation/from/{source}/to/{target} (Description translation suggestions) timed out before a response was received: /{domain}/v1/description/addition/{target} (Description addition suggestions) timed o
[14:55:59] <icinga-wm>	 nse was received: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[14:55:59] <icinga-wm>	 PROBLEM - Nginx local proxy to apache on mw1345 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[14:55:59] <icinga-wm>	 PROBLEM - Nginx local proxy to apache on mw1342 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[14:56:01] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1026 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[14:56:01] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase-dev1006 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[14:56:02] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2016 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[14:56:02] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2019 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[14:56:03] <icinga-wm>	 PROBLEM - Apache HTTP on mw1343 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[14:56:03] <icinga-wm>	 PROBLEM - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is CRITICAL: /{domain}/v1/page/media/{title} (Get media in test page) timed out before a response was received: /{domain}/v1/page/metadata/{title} (retrieve extended metadata for Video article on English Wikipedia) timed out before a response was received: /{domain}/v1/media/image/featured/{year}/{month}/{day} (retrieve featured image data for April 29, 2016) timed out 
[14:56:03] <icinga-wm>	  was received https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29
[14:56:07] <icinga-wm>	 PROBLEM - PyBal IPVS diff check on lvs1016 is CRITICAL: CRITICAL: Hosts in IPVS but unknown to PyBal: set([mw1226.eqiad.wmnet, mw1348.eqiad.wmnet, mw1233.eqiad.wmnet, mw1346.eqiad.wmnet, mw1315.eqiad.wmnet, mw1221.eqiad.wmnet, mw1342.eqiad.wmnet, mw1340.eqiad.wmnet, mw1345.eqiad.wmnet, mw1344.eqiad.wmnet, mw1343.eqiad.wmnet, mw1225.eqiad.wmnet, mw1341.eqiad.wmnet, mw1317.eqiad.wmnet, mw1347.eqiad.wmnet, mw1339.eqiad.wmnet, mw1313
[14:56:07] <icinga-wm>	 tps://wikitech.wikimedia.org/wiki/PyBal
[14:56:11] <icinga-wm>	 PROBLEM - Nginx local proxy to apache on mw1340 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[14:56:11] <icinga-wm>	 PROBLEM - HTTP availability for Varnish at ulsfo on icinga1001 is CRITICAL: job=varnish-text site=ulsfo https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d
[14:56:11] <icinga-wm>	 PROBLEM - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is CRITICAL: /{domain}/v1/page/media/{title} (Get media in test page) timed out before a response was received: /{domain}/v1/media/image/featured/{year}/{month}/{day} (retrieve featured image data for April 29, 2016) timed out before a response was received: /{domain}/v1/page/mobile-sections/{title} (retrieve test page via mobile-sections) timed out before a response wa
[14:56:11] <icinga-wm>	 ain}/v1/page/summary/{title} (Get summary for test page) timed out before a response was received: /{domain}/v1/transform/html/to/mobile-html/{title} (Get preview mobile HTML for test page) timed out before a response was received https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29
[14:56:11] <icinga-wm>	 PROBLEM - Nginx local proxy to apache on mw1343 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[14:56:13] <icinga-wm>	 RECOVERY - High CPU load on API appserver on mw1227 is OK: OK - load average: 18.04, 23.25, 17.85 https://wikitech.wikimedia.org/wiki/Application_servers
[14:56:19] <icinga-wm>	 PROBLEM - HHVM rendering on mw1315 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[14:56:19] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1020 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[14:56:20] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1027 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received: /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) timed out before a response was received: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) timed out before a re
[14:56:20] <icinga-wm>	 ed https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[14:56:25] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1024 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[14:56:25] <icinga-wm>	 PROBLEM - Nginx local proxy to apache on mw1346 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[14:56:25] <icinga-wm>	 PROBLEM - mobileapps endpoints health on scb2002 is CRITICAL: /{domain}/v1/page/media/{title} (Get media in test page) timed out before a response was received: /{domain}/v1/media/image/featured/{year}/{month}/{day} (retrieve featured image data for April 29, 2016) timed out before a response was received: /{domain}/v1/page/summary/{title} (Get summary for test page) timed out before a response was received https://wikitech.wikim
[14:56:25] <icinga-wm>	 vices/Monitoring/mobileapps
[14:56:25] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1018 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[14:56:27] <icinga-wm>	 PROBLEM - Restbase edge esams on text-lb.esams.wikimedia.org is CRITICAL: /api/rest_v1/transform/wikitext/to/html/{title} (Transform wikitext to html) timed out before a response was received https://wikitech.wikimedia.org/wiki/RESTBase
[14:56:27] <icinga-wm>	 PROBLEM - proton endpoints health on proton1002 is CRITICAL: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Foo page from en.wp.org in letter format) timed out before a response was received: /{domain}/v1/pdf/{title}/{format}/{type} (Respond file not found for a nonexistent title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/proton
[14:56:29] <_joe_>	 and indeed
[14:56:37] <icinga-wm>	 PROBLEM - PyBal IPVS diff check on lvs1015 is CRITICAL: CRITICAL: Hosts in IPVS but unknown to PyBal: set([mw1224.eqiad.wmnet, mw1226.eqiad.wmnet, mw1348.eqiad.wmnet, mw1233.eqiad.wmnet, mw1346.eqiad.wmnet, mw1221.eqiad.wmnet, mw1342.eqiad.wmnet, mw1340.eqiad.wmnet, mw1344.eqiad.wmnet, mw1343.eqiad.wmnet, mw1225.eqiad.wmnet, mw1341.eqiad.wmnet, mw1315.eqiad.wmnet, mw1347.eqiad.wmnet, mw1345.eqiad.wmnet, mw1339.eqiad.wmnet]) https
[14:56:37] <icinga-wm>	 edia.org/wiki/PyBal
[14:56:37] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1016 is CRITICAL: PYBAL CRITICAL - CRITICAL - api-https_443: Servers mw1346.eqiad.wmnet, mw1344.eqiad.wmnet, mw1226.eqiad.wmnet, mw1348.eqiad.wmnet, mw1233.eqiad.wmnet, mw1340.eqiad.wmnet, mw1343.eqiad.wmnet, mw1225.eqiad.wmnet, mw1347.eqiad.wmnet, mw1345.eqiad.wmnet, mw1223.eqiad.wmnet, mw1221.eqiad.wmnet, mw1317.eqiad.wmnet, mw1224.eqiad.wmnet, mw1342.eqiad.wmnet, mw1341.eqiad.wmnet, 
[14:56:37] <icinga-wm>	 t are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[14:56:37] <icinga-wm>	 PROBLEM - HHVM rendering on mw1344 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[14:56:37] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2011 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received: /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[14:56:37] <icinga-wm>	 PROBLEM - HHVM rendering on mw1346 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[14:56:39] <icinga-wm>	 RECOVERY - High CPU load on API appserver on mw1233 is OK: OK - load average: 19.59, 23.26, 17.38 https://wikitech.wikimedia.org/wiki/Application_servers
[14:56:41] <icinga-wm>	 PROBLEM - HTTP availability for Varnish at eqsin on icinga1001 is CRITICAL: job=varnish-text site=eqsin https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d
[14:56:45] <icinga-wm>	 PROBLEM - proton endpoints health on proton1001 is CRITICAL: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Foo page from en.wp.org in letter format) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/proton
[14:56:49] <icinga-wm>	 PROBLEM - mobileapps endpoints health on scb1002 is CRITICAL: /{domain}/v1/media/image/featured/{year}/{month}/{day} (retrieve featured image data for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps
[14:56:51] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1022 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[14:56:51] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2018 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[14:57:05] <icinga-wm>	 RECOVERY - Nginx local proxy to apache on mw1339 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 618 bytes in 0.666 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[14:57:09] <icinga-wm>	 PROBLEM - High CPU load on API appserver on mw1234 is CRITICAL: CRITICAL - load average: 55.00, 37.27, 22.62 https://wikitech.wikimedia.org/wiki/Application_servers
[14:57:11] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase-dev1004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[14:57:13] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2013 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[14:57:13] <icinga-wm>	 RECOVERY - mobileapps endpoints health on scb2006 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps
[14:57:15] <icinga-wm>	 PROBLEM - mobileapps endpoints health on scb2003 is CRITICAL: /{domain}/v1/page/media/{title} (Get media in test page) timed out before a response was received: /{domain}/v1/page/summary/{title} (Get summary for test page) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps
[14:57:15] <icinga-wm>	 RECOVERY - mobileapps endpoints health on scb1003 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps
[14:57:17] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2009 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[14:57:17] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1021 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[14:57:17] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2015 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[14:57:19] <icinga-wm>	 RECOVERY - mobileapps endpoints health on scb2004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps
[14:57:20] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1025 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[14:57:20] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2017 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[14:57:20] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1023 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[14:57:21] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1019 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[14:57:23] <icinga-wm>	 RECOVERY - HHVM rendering on mw1226 is OK: HTTP OK: HTTP/1.1 200 OK - 81370 bytes in 0.810 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[14:57:25] <icinga-wm>	 RECOVERY - Cxserver LVS eqiad on cxserver.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/CX
[14:57:25] <icinga-wm>	 PROBLEM - puppet last run on db1131 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[14:57:25] <icinga-wm>	 RECOVERY - mobileapps endpoints health on scb1001 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps
[14:57:27] <icinga-wm>	 RECOVERY - Apache HTTP on mw1315 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 617 bytes in 0.393 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[14:57:29] <icinga-wm>	 PROBLEM - High CPU load on API appserver on mw1232 is CRITICAL: CRITICAL - load average: 60.02, 39.12, 23.38 https://wikitech.wikimedia.org/wiki/Application_servers
[14:57:29] <icinga-wm>	 PROBLEM - HTTP availability for Varnish at codfw on icinga1001 is CRITICAL: job=varnish-text site=codfw https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d
[14:57:29] <icinga-wm>	 RECOVERY - Apache HTTP on mw1342 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 617 bytes in 0.558 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[14:57:33] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase-dev1005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[14:57:33] <icinga-wm>	 RECOVERY - HHVM rendering on mw1343 is OK: HTTP OK: HTTP/1.1 200 OK - 81370 bytes in 0.758 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[14:57:33] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1015 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[14:57:33] <icinga-wm>	 RECOVERY - HHVM rendering on mw1340 is OK: HTTP OK: HTTP/1.1 200 OK - 81370 bytes in 0.149 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[14:57:37] <icinga-wm>	 RECOVERY - Nginx local proxy to apache on mw1345 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 618 bytes in 0.502 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[14:57:38] <paravoid>	 3~/win 30
[14:57:39] <icinga-wm>	 RECOVERY - Nginx local proxy to apache on mw1342 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 618 bytes in 0.872 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[14:57:39] <icinga-wm>	 RECOVERY - mobileapps endpoints health on scb2001 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps
[14:57:41] <icinga-wm>	 RECOVERY - Apache HTTP on mw1343 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 617 bytes in 0.344 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[14:57:45] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1026 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[14:57:45] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2016 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[14:57:47] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase-dev1006 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[14:57:47] <icinga-wm>	 RECOVERY - Nginx local proxy to apache on mw1340 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 617 bytes in 0.076 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[14:57:47] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2019 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[14:57:47] <icinga-wm>	 RECOVERY - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29
[14:57:49] <icinga-wm>	 RECOVERY - Nginx local proxy to apache on mw1343 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 618 bytes in 0.408 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[14:57:53] <icinga-wm>	 RECOVERY - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29
[14:57:53] <icinga-wm>	 RECOVERY - Apache HTTP on mw1346 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 616 bytes in 0.092 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[14:57:55] <icinga-wm>	 RECOVERY - High CPU load on API appserver on mw1283 is OK: OK - load average: 17.51, 30.81, 23.89 https://wikitech.wikimedia.org/wiki/Application_servers
[14:57:57] <icinga-wm>	 RECOVERY - HHVM rendering on mw1315 is OK: HTTP OK: HTTP/1.1 200 OK - 81370 bytes in 0.701 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[14:58:01] <icinga-wm>	 RECOVERY - Nginx local proxy to apache on mw1346 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 618 bytes in 0.225 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[14:58:03] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1027 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[14:58:03] <icinga-wm>	 RECOVERY - Apache HTTP on mw1339 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 616 bytes in 0.049 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[14:58:03] <icinga-wm>	 RECOVERY - Apache HTTP on mw1340 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 616 bytes in 0.047 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[14:58:03] <icinga-wm>	 RECOVERY - HHVM rendering on mw1339 is OK: HTTP OK: HTTP/1.1 200 OK - 81370 bytes in 0.348 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[14:58:05] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1024 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[14:58:05] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1018 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[14:58:05] <icinga-wm>	 RECOVERY - mobileapps endpoints health on scb2002 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps
[14:58:07] <icinga-wm>	 RECOVERY - proton endpoints health on proton1002 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/proton
[14:58:07] <icinga-wm>	 RECOVERY - recommendation_api endpoints health on scb2005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[14:58:07] <icinga-wm>	 RECOVERY - Restbase edge esams on text-lb.esams.wikimedia.org is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase
[14:58:11] <icinga-wm>	 RECOVERY - termbox eqiad on termbox.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/WMDE/Wikidata/SSR_Service
[14:58:11] <icinga-wm>	 RECOVERY - HHVM rendering on mw1344 is OK: HTTP OK: HTTP/1.1 200 OK - 81370 bytes in 0.213 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[14:58:11] <icinga-wm>	 RECOVERY - HHVM rendering on mw1346 is OK: HTTP OK: HTTP/1.1 200 OK - 81370 bytes in 0.304 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[14:58:17] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2011 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[14:58:17] <icinga-wm>	 PROBLEM - High CPU load on API appserver on mw1230 is CRITICAL: CRITICAL - load average: 71.13, 43.74, 24.78 https://wikitech.wikimedia.org/wiki/Application_servers
[14:58:17] <icinga-wm>	 PROBLEM - High CPU load on API appserver on mw1229 is CRITICAL: CRITICAL - load average: 53.84, 32.40, 19.93 https://wikitech.wikimedia.org/wiki/Application_servers
[14:58:17] <icinga-wm>	 PROBLEM - High CPU load on API appserver on mw1286 is CRITICAL: CRITICAL - load average: 81.75, 51.28, 32.74 https://wikitech.wikimedia.org/wiki/Application_servers
[14:58:19] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1016 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[14:58:21] <icinga-wm>	 PROBLEM - High CPU load on API appserver on mw1290 is CRITICAL: CRITICAL - load average: 80.01, 52.55, 32.40 https://wikitech.wikimedia.org/wiki/Application_servers
[14:58:21] <icinga-wm>	 RECOVERY - PHP7 rendering on mw1348 is OK: HTTP OK: HTTP/1.1 200 OK - 81411 bytes in 0.395 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[14:58:23] <icinga-wm>	 PROBLEM - High CPU load on API appserver on mw1235 is CRITICAL: CRITICAL - load average: 53.41, 31.35, 19.29 https://wikitech.wikimedia.org/wiki/Application_servers
[14:58:25] <icinga-wm>	 RECOVERY - proton endpoints health on proton1001 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/proton
[14:58:25] <icinga-wm>	 RECOVERY - mobileapps endpoints health on scb1002 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps
[14:58:27] <icinga-wm>	 RECOVERY - recommendation_api endpoints health on scb1002 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[14:58:29] <icinga-wm>	 PROBLEM - puppet last run on cp3030 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[14:58:29] <icinga-wm>	 RECOVERY - Apache HTTP on mw1348 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 658 bytes in 0.041 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[14:58:31] <icinga-wm>	 PROBLEM - High CPU load on API appserver on mw1288 is CRITICAL: CRITICAL - load average: 66.89, 49.08, 31.52 https://wikitech.wikimedia.org/wiki/Application_servers
[14:58:33] <icinga-wm>	 RECOVERY - Nginx local proxy to apache on mw1344 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 617 bytes in 0.055 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[14:58:33] <icinga-wm>	 RECOVERY - recommendation_api endpoints health on scb1001 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[14:58:43] <icinga-wm>	 RECOVERY - recommendation_api endpoints health on scb1004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[14:58:43] <icinga-wm>	 PROBLEM - High CPU load on API appserver on mw1231 is CRITICAL: CRITICAL - load average: 55.51, 34.38, 22.10 https://wikitech.wikimedia.org/wiki/Application_servers
[14:58:53] <icinga-wm>	 RECOVERY - mobileapps endpoints health on scb2005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps
[14:58:53] <icinga-wm>	 PROBLEM - High CPU load on API appserver on mw1313 is CRITICAL: CRITICAL - load average: 80.33, 49.34, 32.47 https://wikitech.wikimedia.org/wiki/Application_servers
[14:58:53] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1017 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[14:58:53] <icinga-wm>	 RECOVERY - Restbase LVS codfw on restbase.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase
[14:58:55] <icinga-wm>	 RECOVERY - mobileapps endpoints health on scb2003 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps
[14:59:01] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2012 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[14:59:07] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2010 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[14:59:09] <icinga-wm>	 PROBLEM - High CPU load on API appserver on mw1223 is CRITICAL: CRITICAL - load average: 55.92, 34.70, 21.54 https://wikitech.wikimedia.org/wiki/Application_servers
[14:59:13] <icinga-wm>	 PROBLEM - High CPU load on API appserver on mw1284 is CRITICAL: CRITICAL - load average: 45.09, 44.60, 30.00 https://wikitech.wikimedia.org/wiki/Application_servers
[14:59:17] <icinga-wm>	 PROBLEM - puppet last run on dbproxy1003 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[14:59:19] <icinga-wm>	 RECOVERY - recommendation_api endpoints health on scb1003 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[14:59:21] <icinga-wm>	 RECOVERY - recommendation_api endpoints health on scb2002 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[14:59:23] <icinga-wm>	 RECOVERY - recommendation_api endpoints health on scb2003 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[14:59:23] <icinga-wm>	 RECOVERY - recommendation_api endpoints health on scb2004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[14:59:25] <icinga-wm>	 RECOVERY - recommendation_api endpoints health on scb2006 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[14:59:39] <icinga-wm>	 PROBLEM - High CPU load on API appserver on mw1227 is CRITICAL: CRITICAL - load average: 56.93, 35.67, 23.47 https://wikitech.wikimedia.org/wiki/Application_servers
[14:59:41] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1020 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[14:59:51] <icinga-wm>	 PROBLEM - puppet last run on ms-be1033 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[15:00:01] <icinga-wm>	 RECOVERY - High CPU load on API appserver on mw1278 is OK: OK - load average: 14.01, 31.85, 27.47 https://wikitech.wikimedia.org/wiki/Application_servers
[15:00:03] <icinga-wm>	 RECOVERY - recommendation_api endpoints health on scb2001 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[15:00:09] <icinga-wm>	 RECOVERY - High CPU load on API appserver on mw1276 is OK: OK - load average: 12.87, 27.41, 25.05 https://wikitech.wikimedia.org/wiki/Application_servers
[15:00:33] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2020 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[15:00:47] <icinga-wm>	 RECOVERY - HTTP availability for Varnish at codfw on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d
[15:00:53] <icinga-wm>	 RECOVERY - High CPU load on API appserver on mw1281 is OK: OK - load average: 14.40, 27.61, 26.00 https://wikitech.wikimedia.org/wiki/Application_servers
[15:00:55] <icinga-wm>	 RECOVERY - High CPU load on API appserver on mw1282 is OK: OK - load average: 14.49, 29.14, 26.60 https://wikitech.wikimedia.org/wiki/Application_servers
[15:00:55] <icinga-wm>	 RECOVERY - High CPU load on API appserver on mw1277 is OK: OK - load average: 13.83, 28.51, 26.75 https://wikitech.wikimedia.org/wiki/Application_servers
[15:01:03] <icinga-wm>	 RECOVERY - High CPU load on API appserver on mw1287 is OK: OK - load average: 13.54, 28.14, 25.99 https://wikitech.wikimedia.org/wiki/Application_servers
[15:01:27] <icinga-wm>	 RECOVERY - High CPU load on API appserver on mw1279 is OK: OK - load average: 13.75, 28.14, 27.18 https://wikitech.wikimedia.org/wiki/Application_servers
[15:01:27] <icinga-wm>	 RECOVERY - High CPU load on API appserver on mw1280 is OK: OK - load average: 15.60, 30.46, 28.08 https://wikitech.wikimedia.org/wiki/Application_servers
[15:01:31] <icinga-wm>	 RECOVERY - High CPU load on API appserver on mw1314 is OK: OK - load average: 19.28, 35.43, 31.35 https://wikitech.wikimedia.org/wiki/Application_servers
[15:01:39] <icinga-wm>	 RECOVERY - HTTP availability for Varnish at eqsin on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d
[15:01:45] <icinga-wm>	 RECOVERY - PyBal IPVS diff check on lvs1016 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal
[15:01:55] <icinga-wm>	 RECOVERY - HTTP availability for Nginx -SSL terminators- at eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[15:01:57] <icinga-wm>	 RECOVERY - HTTP availability for Nginx -SSL terminators- at eqsin on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[15:02:07] <icinga-wm>	 RECOVERY - High CPU load on API appserver on mw1313 is OK: OK - load average: 20.57, 34.31, 29.72 https://wikitech.wikimedia.org/wiki/Application_servers
[15:02:07] <icinga-wm>	 RECOVERY - High CPU load on API appserver on mw1289 is OK: OK - load average: 14.89, 30.39, 28.46 https://wikitech.wikimedia.org/wiki/Application_servers
[15:02:07] <icinga-wm>	 RECOVERY - High CPU load on API appserver on mw1234 is OK: OK - load average: 10.37, 23.24, 21.12 https://wikitech.wikimedia.org/wiki/Application_servers
[15:02:09] <icinga-wm>	 RECOVERY - HTTP availability for Varnish at eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d
[15:02:13] <icinga-wm>	 RECOVERY - PyBal IPVS diff check on lvs1015 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal
[15:02:25] <icinga-wm>	 RECOVERY - High CPU load on API appserver on mw1284 is OK: OK - load average: 14.48, 29.92, 26.99 https://wikitech.wikimedia.org/wiki/Application_servers
[15:02:41] <icinga-wm>	 RECOVERY - High CPU load on API appserver on mw1285 is OK: OK - load average: 14.11, 31.37, 29.96 https://wikitech.wikimedia.org/wiki/Application_servers
[15:02:47] <icinga-wm>	 RECOVERY - HTTP availability for Varnish at ulsfo on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d
[15:03:01] <icinga-wm>	 RECOVERY - HTTP availability for Nginx -SSL terminators- at esams on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[15:03:01] <icinga-wm>	 RECOVERY - HTTP availability for Varnish at esams on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d
[15:03:07] <icinga-wm>	 RECOVERY - High CPU load on API appserver on mw1230 is OK: OK - load average: 10.42, 22.92, 20.91 https://wikitech.wikimedia.org/wiki/Application_servers
[15:03:07] <icinga-wm>	 RECOVERY - High CPU load on API appserver on mw1229 is OK: OK - load average: 11.91, 22.30, 19.47 https://wikitech.wikimedia.org/wiki/Application_servers
[15:03:07] <icinga-wm>	 RECOVERY - High CPU load on API appserver on mw1286 is OK: OK - load average: 14.95, 31.95, 29.98 https://wikitech.wikimedia.org/wiki/Application_servers
[15:03:13] <icinga-wm>	 RECOVERY - HTTP availability for Nginx -SSL terminators- at codfw on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[15:03:15] <icinga-wm>	 RECOVERY - High CPU load on API appserver on mw1235 is OK: OK - load average: 10.40, 24.56, 20.56 https://wikitech.wikimedia.org/wiki/Application_servers
[15:03:15] <icinga-wm>	 PROBLEM - puppet last run on analytics1048 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[15:03:21] <icinga-wm>	 RECOVERY - High CPU load on API appserver on mw1288 is OK: OK - load average: 15.38, 30.12, 28.42 https://wikitech.wikimedia.org/wiki/Application_servers
[15:04:09] <icinga-wm>	 RECOVERY - HTTP availability for Nginx -SSL terminators- at ulsfo on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[15:04:27] <icinga-wm>	 RECOVERY - High CPU load on API appserver on mw1227 is OK: OK - load average: 13.23, 25.51, 22.91 https://wikitech.wikimedia.org/wiki/Application_servers
[15:04:49] <icinga-wm>	 RECOVERY - High CPU load on API appserver on mw1290 is OK: OK - load average: 14.21, 27.53, 28.14 https://wikitech.wikimedia.org/wiki/Application_servers
[15:05:11] <icinga-wm>	 RECOVERY - High CPU load on API appserver on mw1231 is OK: OK - load average: 9.52, 21.35, 21.44 https://wikitech.wikimedia.org/wiki/Application_servers
[15:05:35] <icinga-wm>	 RECOVERY - High CPU load on API appserver on mw1223 is OK: OK - load average: 9.93, 22.97, 22.15 https://wikitech.wikimedia.org/wiki/Application_servers
[15:08:47] <icinga-wm>	 RECOVERY - High CPU load on API appserver on mw1232 is OK: OK - load average: 8.29, 17.41, 22.78 https://wikitech.wikimedia.org/wiki/Application_servers
[15:12:39] <logmsgbot>	 !log @ helmfile [STAGING] Ran 'apply' command on namespace 'restrouter' for release 'staging' .
[15:12:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:14:13] <mdholloway>	 greg-g: OK to do a bugfix deploy to fix a few MCS production crashers? (T229521 and T229630)
[15:14:13] <stashbot>	 T229630: NOT_FOUND_ERR (8): the object can not be found here - https://phabricator.wikimedia.org/T229630
[15:14:14] <stashbot>	 T229521: Cannot read property 'type' of undefined in MCS - https://phabricator.wikimedia.org/T229521
[15:14:22] <logmsgbot>	 !log @ helmfile [STAGING] Ran 'apply' command on namespace 'restrouter' for release 'staging' .
[15:14:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:19:03] <wikibugs>	 (03PS1) 10CRusnov: netbox: Fix additional swift parameters. [puppet] - 10https://gerrit.wikimedia.org/r/527576 (https://phabricator.wikimedia.org/T209182)
[15:19:58] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] netbox: Fix additional swift parameters. [puppet] - 10https://gerrit.wikimedia.org/r/527576 (https://phabricator.wikimedia.org/T209182) (owner: 10CRusnov)
[15:21:46] <wikibugs>	 (03PS2) 10CRusnov: netbox: Fix additional swift parameters [puppet] - 10https://gerrit.wikimedia.org/r/527576 (https://phabricator.wikimedia.org/T209182)
[15:22:42] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] netbox: Fix additional swift parameters [puppet] - 10https://gerrit.wikimedia.org/r/527576 (https://phabricator.wikimedia.org/T209182) (owner: 10CRusnov)
[15:25:23] <icinga-wm>	 RECOVERY - puppet last run on db1131 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[15:25:32] <wikibugs>	 (03PS3) 10CRusnov: netbox: Fix additional swift parameters [puppet] - 10https://gerrit.wikimedia.org/r/527576 (https://phabricator.wikimedia.org/T209182)
[15:25:39] <icinga-wm>	 RECOVERY - puppet last run on analytics1048 is OK: OK: Puppet is currently enabled, last run 24 seconds ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[15:26:29] <icinga-wm>	 RECOVERY - puppet last run on cp3030 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[15:27:15] <icinga-wm>	 RECOVERY - puppet last run on dbproxy1003 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[15:27:51] <icinga-wm>	 RECOVERY - puppet last run on ms-be1033 is OK: OK: Puppet is currently enabled, last run 5 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[15:30:59] <wikibugs>	 10Operations, 10DBA, 10Gerrit, 10Release-Engineering-Team-TODO, and 2 others: Gerrit is failing to connect to db on gerrit2001 thus preventing systemd from working - https://phabricator.wikimedia.org/T176532 (10Dzahn) a:03Dzahn I'll add it. Thanks Manuel!
[15:31:10] <greg-g>	 mdholloway: yeah, those look reasonable.
[15:31:22] <mdholloway>	 greg-g: cool, thanks!
[15:31:30] <greg-g>	 (sorry,w as in a meeting)
[15:31:35] <mdholloway>	 no prob
[15:36:49] <wikibugs>	 10Operations, 10ops-eqiad, 10netops: (Need By: Sept 30) upgrade msw1-eqiad from EX4200 to EX4300 - https://phabricator.wikimedia.org/T225121 (10ayounsi) a:05ayounsi→03Papaul codfw is done. @papaul let me know if you need help to prepare the eqiad one.
[15:46:47] <Lucas_WMDE>	 is someone online by now who knows their way around scap and i18n/l10n rebuilds?
[15:46:55] <Lucas_WMDE>	 we have some missing messages on Wikidata, e. g. https://www.wikidata.org/wiki/MediaWiki:Valueview-expertextender-languageselector-label
[15:47:11] <Lucas_WMDE>	 Amir1 tried to fix it earlier today but didn’t succeed, and now he’s already started his weekend so I’m trying to take over
[15:47:29] <Lucas_WMDE>	 not sure how to fix it though… scap sync-l10n? or full sync?
[15:47:56] <logmsgbot>	 !log mholloway-shell@deploy1001 Started deploy [mobileapps/deploy@250f711]: Fix MCS production crashers (T229521, T229630)
[15:48:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:48:06] <stashbot>	 T229630: NOT_FOUND_ERR (8): the object can not be found here - https://phabricator.wikimedia.org/T229630
[15:48:06] <stashbot>	 T229521: Cannot read property 'type' of undefined in MCS - https://phabricator.wikimedia.org/T229521
[15:49:13] <Lucas_WMDE>	 cc brennen as train conductor this week
[15:50:31] <brennen>	 hrm.  i do not know, but will attempt to find out.
[15:50:53] <brennen>	 well - thcipriani, any thoughts?
[15:51:36] <Lucas_WMDE>	 from the SAL, it looks like `scap sync-l10n` is what Amir tried
[15:52:37] <logmsgbot>	 !log mholloway-shell@deploy1001 Finished deploy [mobileapps/deploy@250f711]: Fix MCS production crashers (T229521, T229630) (duration: 04m 41s)
[15:52:41] * thcipriani looks
[15:52:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:55:36] <wikibugs>	 (03PS1) 10Jbond: CAS: Initial cas module [puppet] - 10https://gerrit.wikimedia.org/r/527591
[15:56:26] <wikibugs>	 (03PS2) 10Jbond: CAS: Initial cas module [puppet] - 10https://gerrit.wikimedia.org/r/527591
[15:56:49] <chaomodus>	 see, i highlight 'cas' for obvious reasons, and also it's annoying that the module is called 'CAS'
[15:57:37] <thcipriani>	 Lucas_WMDE: https://phabricator.wikimedia.org/P8853 looks like https://phabricator.wikimedia.org/T227814 with a different message. Where is that message?
[15:57:39] <wikibugs>	 (03PS3) 10Jbond: CAS: Initial cas module [puppet] - 10https://gerrit.wikimedia.org/r/527591
[15:58:15] <Lucas_WMDE>	 thcipriani: in the WikibaseView extension, I believe
[15:58:30] <thcipriani>	 WikibaseView
[15:58:31] <thcipriani>	 yep
[15:58:50] <thcipriani>	 https://phabricator.wikimedia.org/P8853#53268
[15:59:04] <jbond42>	 chaomodus: im happy to rename to apereo_cas would that still trigger your hlight
[15:59:08] <Lucas_WMDE>	 hrm
[15:59:21] <Lucas_WMDE>	 it should still be pulled in by WikibaseRepo, no?
[15:59:22] <chaomodus>	 ahah yah :)
[15:59:24] <Lucas_WMDE>	 let me check what that looks like
[15:59:50] * thcipriani does as well
[16:00:07] <Lucas_WMDE>	 okay, the WikibaseView PHP entry point does not have the i18n files compatibility thingy
[16:00:36] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] CAS: Initial cas module [puppet] - 10https://gerrit.wikimedia.org/r/527591 (owner: 10Jbond)
[16:00:52] <Lucas_WMDE>	 so let’s do the same thing as in https://gerrit.wikimedia.org/r/524539, I guess?
[16:01:14] <Lucas_WMDE>	 no idea why it would have broken between .15 and .16 though
[16:01:56] <Lucas_WMDE>	 can I somehow build ExtensionMessages-*.php locally, to figure out if that adds WikibaseView back to it?
[16:02:31] * thcipriani finds command
[16:02:54] <thcipriani>	 also, yeah, I think the same fix as 52439 should work
[16:03:14] <Lucas_WMDE>	 okay I’ll prepare the patch then
[16:03:39] <thcipriani>	 running mergeMessageFileList.php should do the trick
[16:05:32] <thcipriani>	 I think it fails randomly due to either (a) the wiki that is used for mergeMessageFileList.php which is pretty much a random wiki from group0 OR (b) wfLoadExtension ordering -- I would guess (a) is the problem, but hasn't been a problem since we have been using wgMessagesDir. tl;dr: I don't know why it broke in wmf.16 vs wmf.15, but I also don't know why it worked in wmf.15 :)
[16:08:55] <Lucas_WMDE>	 when I run mergeMessageFileList.php --extensions-dir extensions/ locally, I get WikibaseView in both cases
[16:09:09] <Lucas_WMDE>	 just in one case the value is a string and in the other an array with a single string
[16:09:12] <Lucas_WMDE>	 oh wait
[16:09:27] <Lucas_WMDE>	 gah, the string version (master) is missing a slash in the middle, I think
[16:09:54] <Lucas_WMDE>	 https://phabricator.wikimedia.org/P8854
[16:10:00] <Lucas_WMDE>	 “viewlib”
[16:10:05] <Lucas_WMDE>	 but anyways, I’ll upload the patch to add it to PHP
[16:10:11] <Lucas_WMDE>	 even though it looks like there might just be a bug in the JSON
[16:11:43] <Lucas_WMDE>	 no, sorry, I’m an idiot, the slash is missing from the *new* version because I made the patch wrong
[16:11:46] <Lucas_WMDE>	 nevermin
[16:11:48] <Lucas_WMDE>	 d
[16:12:37] <wikibugs>	 (03PS1) 10Paladox: profile::mariadb::ferm_misc: Add gerrit2001.wikimedia.org to the firewall for port 3306 [puppet] - 10https://gerrit.wikimedia.org/r/527595 (https://phabricator.wikimedia.org/T176532)
[16:13:18] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] profile::mariadb::ferm_misc: Add gerrit2001.wikimedia.org to the firewall for port 3306 [puppet] - 10https://gerrit.wikimedia.org/r/527595 (https://phabricator.wikimedia.org/T176532) (owner: 10Paladox)
[16:13:43] <wikibugs>	 (03PS2) 10Paladox: profile::mariadb::ferm_misc: Add gerrit2001.wikimedia.org to the firewall [puppet] - 10https://gerrit.wikimedia.org/r/527595 (https://phabricator.wikimedia.org/T176532)
[16:14:58] <wikibugs>	 (03PS4) 10Jbond: apereo_cas: Initial module [puppet] - 10https://gerrit.wikimedia.org/r/527591
[16:15:27] <jbond42>	 chaomodus: hopefully thats ^^^ better :)
[16:17:29] <wikibugs>	 (03PS1) 10Paladox: gerrit: Re-enable the use of HTTP auth tokens [puppet] - 10https://gerrit.wikimedia.org/r/527596 (https://phabricator.wikimedia.org/T225308)
[16:18:26] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] apereo_cas: Initial module [puppet] - 10https://gerrit.wikimedia.org/r/527591 (owner: 10Jbond)
[16:21:26] <wikibugs>	 (03PS5) 10Jbond: apereo_cas: Initial module [puppet] - 10https://gerrit.wikimedia.org/r/527591
[16:23:19] <Lucas_WMDE>	 anyone up for reviewing https://gerrit.wikimedia.org/r/527594 ? I’d like to backport+deploy it before the weekend
[16:23:48] <Lucas_WMDE>	 ok nevermind Amir got it :)
[16:24:34] * thcipriani backports
[16:24:42] <Lucas_WMDE>	 I already cherry-picked it
[16:24:55] <thcipriani>	 ah, just noticed :)
[16:24:58] <Lucas_WMDE>	 or did you mean deploy it before it goes through?
[16:25:00] <Lucas_WMDE>	 ok :)
[16:25:05] <Lucas_WMDE>	 gate-and-submit will take a while :/
[16:26:48] <Lucas_WMDE>	 and afterwards, I assume do the usual `git fetch`, `git submodule update` etc on the deployment host (same as SWAT), and then `scap sync` instead of `sync-file`?
[16:26:52] <Lucas_WMDE>	 or is it more special than that?
[16:27:02] <thcipriani>	 nope: that's exactly correct
[16:27:05] <Lucas_WMDE>	 yay
[16:27:16] <wikibugs>	 10Operations, 10Traffic, 10conftool, 10Patch-For-Review, and 2 others: Figure out a security model for etcd - https://phabricator.wikimedia.org/T97972 (10CDanis)
[16:29:11] <icinga-wm>	 PROBLEM - puppet last run on dbstore1004 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[16:35:42] <Lucas_WMDE>	 noo, one of the gate-and-submit-swat builds failed
[16:35:47] <Lucas_WMDE>	 so I’ll have to repeat it :(
[16:36:11] <Lucas_WMDE>	 there’s no way to cancel the others, right? have to wait until the whole job has failed and then restart it
[16:38:56] <thcipriani>	 if you push up another patchset I believe it cancels the current running jobs. i.e. tweak the commit message
[16:41:14] <Lucas_WMDE>	 great idea, thanks
[16:42:25] <XioNoX>	 !log replace rhenium with netflow1001 netflow target + iBGP peer on all routers
[16:42:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:43:47] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 04-1] "LGTM overall, see inline" (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/527576 (https://phabricator.wikimedia.org/T209182) (owner: 10CRusnov)
[16:44:55] <icinga-wm>	 PROBLEM - puppet last run on hassium is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[16:45:31] <icinga-wm>	 RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 30, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[16:50:39] <wikibugs>	 10Operations, 10Traffic: rhenium [spare] server still receiving flow data - https://phabricator.wikimedia.org/T224477 (10ayounsi) 05Open→03Resolved This also caused the BGP sessions between rhenium (netflow) and the routers to alert.  Updated to use the netflow1001 IP instead.
[16:50:42] <wikibugs>	 10Operations, 10netops: migrate netinsights from rhenium to sulfur - https://phabricator.wikimedia.org/T212011 (10ayounsi)
[16:50:44] <wikibugs>	 10Operations, 10ops-codfw, 10ops-eqiad, 10DC-Ops, and 2 others: Triage and resolve all outstanding Netbox report errors - https://phabricator.wikimedia.org/T223450 (10wiki_willy) @faidon - The majority of the influx in Netbox errors looks like it's from the new PDUs.  Some of the info was updated into Netb...
[16:51:43] <wikibugs>	 10Operations, 10netops: migrate netinsights from rhenium to sulfur - https://phabricator.wikimedia.org/T212011 (10ayounsi) 05Open→03Resolved a:05MoritzMuehlenhoff→03ayounsi We created a VM (netflow1001) to replace rhenium, everything has been migrated.
[16:51:45] <wikibugs>	 10Operations, 10ops-eqiad: rack/setup/install sulfur.wikimedia.org - https://phabricator.wikimedia.org/T201364 (10ayounsi)
[16:52:24] <wikibugs>	 10Operations, 10MediaWiki-Configuration, 10conftool: noc.wm.o/db.php: remove hosts information, or fetch it from etcd somehow - https://phabricator.wikimedia.org/T229631 (10CDanis)
[16:53:02] <wikibugs>	 (03PS1) 10Elukey: [WIP] profile::netbox: allow /metrics to be polled via http [puppet] - 10https://gerrit.wikimedia.org/r/527601
[16:55:37] <wikibugs>	 (03PS2) 10Elukey: [WIP] profile::netbox: allow /metrics to be polled via http [puppet] - 10https://gerrit.wikimedia.org/r/527601
[16:56:37] <wikibugs>	 (03PS4) 10CRusnov: netbox: Fix additional swift parameters [puppet] - 10https://gerrit.wikimedia.org/r/527576 (https://phabricator.wikimedia.org/T209182)
[16:56:47] <wikibugs>	 (03PS3) 10Elukey: [WIP] profile::netbox: allow /metrics to be polled via http [puppet] - 10https://gerrit.wikimedia.org/r/527601
[16:57:07] <icinga-wm>	 RECOVERY - puppet last run on dbstore1004 is OK: OK: Puppet is currently enabled, last run 5 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[17:00:15] <wikibugs>	 10Operations, 10MediaWiki-Configuration, 10conftool: noc.wm.o/db.php: remove hosts information, or fetch it from etcd somehow - https://phabricator.wikimedia.org/T229631 (10Krinkle) > First question is: does db.php have known users that require this information?  Unless you, a DBA, or someone else in SRE ans...
[17:05:34] <wikibugs>	 (03PS1) 10Krinkle: noc: Fix mw-api links on db.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/527604
[17:05:58] <wikibugs>	 (03CR) 10Krinkle: [C: 03+2] noc: Add cross-dc navigation links to db.php footer [mediawiki-config] - 10https://gerrit.wikimedia.org/r/527137 (https://phabricator.wikimedia.org/T197126) (owner: 10Krinkle)
[17:06:30] <wikibugs>	 (03CR) 10Ayounsi: "> Patch Set 2:" [puppet] - 10https://gerrit.wikimedia.org/r/526819 (https://phabricator.wikimedia.org/T226331) (owner: 10Ayounsi)
[17:07:23] <wikibugs>	 (03Merged) 10jenkins-bot: noc: Add cross-dc navigation links to db.php footer [mediawiki-config] - 10https://gerrit.wikimedia.org/r/527137 (https://phabricator.wikimedia.org/T197126) (owner: 10Krinkle)
[17:09:20] <wikibugs>	 10Operations, 10MediaWiki-Configuration, 10conftool: noc.wm.o/db.php: remove hosts information, or fetch it from etcd somehow - https://phabricator.wikimedia.org/T229631 (10CDanis) >>! In T229631#5388248, @Krinkle wrote: >> First question is: does db.php have known users that require this information? >  > U...
[17:09:39] <wikibugs>	 10Operations, 10MediaWiki-Configuration, 10conftool: noc.wm.o/db.php: remove hosts information, or fetch it from etcd somehow - https://phabricator.wikimedia.org/T229631 (10Marostegui) From the DBA point of view, we do use db-eqiad.php (or db.php) to quickly check what is and what isn't pooled from a browser...
[17:10:03] <logmsgbot>	 !log krinkle@deploy1001 Synchronized docroot/noc/db.php: ee528e886268c08e9377fbd764ec861b09adfc73 (duration: 00m 48s)
[17:10:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:10:49] <wikibugs>	 (03CR) 10Krinkle: [C: 03+2] noc: Fix mw-api links on db.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/527604 (owner: 10Krinkle)
[17:11:39] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops, 10netbox: Missing Netbox Info for New PDUs - https://phabricator.wikimedia.org/T229680 (10wiki_willy)
[17:12:51] <icinga-wm>	 RECOVERY - puppet last run on hassium is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[17:13:18] <wikibugs>	 (03CR) 10jenkins-bot: noc: Add cross-dc navigation links to db.php footer [mediawiki-config] - 10https://gerrit.wikimedia.org/r/527137 (https://phabricator.wikimedia.org/T197126) (owner: 10Krinkle)
[17:15:26] <wikibugs>	 10Operations, 10ops-codfw, 10ops-eqiad, 10DC-Ops, and 2 others: Triage and resolve all outstanding Netbox report errors - https://phabricator.wikimedia.org/T223450 (10wiki_willy)
[17:16:08] <wikibugs>	 (03CR) 10CDanis: [C: 03+1] noc: Fix mw-api links on db.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/527604 (owner: 10Krinkle)
[17:17:20] <wikibugs>	 (03PS2) 10Krinkle: noc: Fix mw-api links on db.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/527604
[17:17:38] <wikibugs>	 (03CR) 10Krinkle: [C: 03+2] noc: Fix mw-api links on db.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/527604 (owner: 10Krinkle)
[17:18:35] <wikibugs>	 (03Merged) 10jenkins-bot: noc: Fix mw-api links on db.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/527604 (owner: 10Krinkle)
[17:18:49] <Lucas_WMDE>	 Krinkle: can you let me know when you’re done on deploy1001?
[17:18:51] <wikibugs>	 (03CR) 10jenkins-bot: noc: Fix mw-api links on db.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/527604 (owner: 10Krinkle)
[17:19:11] <Krinkle>	 Lucas_WMDE: k, 1min
[17:19:15] <Lucas_WMDE>	 ok thanks
[17:19:47] <logmsgbot>	 !log krinkle@deploy1001 Synchronized docroot/noc/db.php: a75d23ecb1b (duration: 00m 47s)
[17:19:51] <Krinkle>	 Lucas_WMDE: done
[17:19:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:20:12] <Lucas_WMDE>	 thanks!
[17:20:38] <Lucas_WMDE>	 I’m backporting a Wikibase fix that needs a full scap sync
[17:25:13] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1001 Started scap: Fix WikibaseView i18n globals (T229604)
[17:25:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:25:27] <stashbot>	 T229604: Several selectors/experts are broken - https://phabricator.wikimedia.org/T229604
[17:26:44] <XioNoX>	 !log add avoid_path to cr1/2-eqsin
[17:26:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:29:35] <Lucas_WMDE>	 WikibaseView appears in ExtensionMessages-1.34.0-wmf.16.php now, so far so good…
[17:38:27] <icinga-wm>	 RECOVERY - PHP opcache health on mwdebug2002 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[17:42:04] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1001 Finished scap: Fix WikibaseView i18n globals (T229604) (duration: 16m 51s)
[17:42:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:42:12] <stashbot>	 T229604: Several selectors/experts are broken - https://phabricator.wikimedia.org/T229604
[17:44:10] <Lucas_WMDE>	 well… https://www.wikidata.org/wiki/MediaWiki:Valueview-expertextender-languageselector-label exists now
[17:44:19] <Lucas_WMDE>	 but when editing items I still don’t see the message
[17:44:32] <Lucas_WMDE>	 e. g. https://www.wikidata.org/wiki/Q66058247, try to add another title
[17:45:09] <Lucas_WMDE>	 but it works if I enable X-Wikimedia-Debug, wat
[17:46:12] <XioNoX>	 !log flap NTT link in eqsin
[17:46:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:55:11] <Lucas_WMDE>	 ok nevermind it’s fixed itself now, somehow
[17:55:24] <Lucas_WMDE>	 thanks a lot for the help thcipriani!
[17:55:38] <thcipriani>	 Lucas_WMDE: happy to help :)
[17:59:00] <Bsadowski1>	 Is it known that pages like the new user contributions take a while to load?
[17:59:10] <librenms-wmf>	 08Warning Alert for device cr1-eqsin.wikimedia.org - Processor usage over 85%
[17:59:29] <Bsadowski1>	 https://en.wikipedia.org/w/index.php?title=Special:Contributions&offset=20190802171826&contribs=newbie&target=newbies for example
[18:01:17] <James_F>	 Bsadowski1: We're thinking about getting rid of that page because it's so broken.
[18:01:23] <Bsadowski1>	 :O
[18:01:26] <Bsadowski1>	 Nooo!
[18:01:27] <Bsadowski1>	 xD
[18:01:36] <Bsadowski1>	 It took around 30 seconds for it to load.
[18:01:40] <James_F>	 Bsadowski1: There are much better, faster alternatives
[18:01:42] <Bsadowski1>	 er 49*
[18:01:45] <Bsadowski1>	 er 39*
[18:02:14] <Bsadowski1>	 Like what, James_F? :)
[18:02:37] <James_F>	 Bsadowski1: Well, https://en.wikipedia.org/wiki/Special:RecentChanges?userExpLevel=newcomer&hidelog=1 obviously. :-)
[18:02:59] <James_F>	 That loads in a couple of seconds and has a "live updates" mode.
[18:03:15] <Bsadowski1>	 oh nice :D thanks
[18:03:22] <Bsadowski1>	 Why does that other load slow?
[18:03:34] <Bsadowski1>	 Lots of weird queries?
[18:03:44] <James_F>	 Because RecentChanges is built on a bunch of technology to make it fast (dedicated table, etc.)
[18:04:12] <James_F>	 And Special:Contributions/newbies is just a ghastly hack based on (I think) 1% of the maximum user ID.
[18:04:18] <wikibugs>	 (03CR) 10CRusnov: netbox: Fix additional swift parameters (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/527576 (https://phabricator.wikimedia.org/T209182) (owner: 10CRusnov)
[18:04:48] <James_F>	 So it essentially makes very expensive queries against our most expensive database.
[18:05:25] <Bsadowski1>	 Jeez..
[18:05:30] <Bsadowski1>	 :P
[18:05:36] <James_F>	 Yeah. Bad feature, badly implemented.
[18:05:48] <Bsadowski1>	 It did work well for a while though
[18:06:00] <Bsadowski1>	 (no?)
[18:06:02] <James_F>	 Oh, yes, but then December 2005 happened and it stopped being so good. ;-)
[18:06:41] <James_F>	 For very small wikis it's OK, but once you've got more than ~10k registered accounts it's not great at focussing reviewer's time.
[18:06:51] <James_F>	 And volunteers' time is a precious resource.
[18:14:32] <Lucas_WMDE>	 !log recached all WikibaseView messages in ResourceLoader for T229604, cf. https://w.wiki/6kc
[18:14:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:14:41] <stashbot>	 T229604: Several selectors/experts are broken - https://phabricator.wikimedia.org/T229604
[18:19:10] <librenms-wmf>	 08̶W̶a̶r̶n̶i̶n̶g Device cr1-eqsin.wikimedia.org recovered from Processor usage over 85%
[18:21:43] <Lucas_WMDE>	 I’m going home now, my contact info is somewhere in the ops-l archives if that Wikibase deploy broke something horribly
[18:22:02] <Lucas_WMDE>	 (and possibly on officewiki, I’m not allowed to know that ^^)
[18:34:07] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1001/17720/" [puppet] - 10https://gerrit.wikimedia.org/r/527595 (https://phabricator.wikimedia.org/T176532) (owner: 10Paladox)
[18:36:28] <mutante>	 !log adding gerrit2001 to ferm rules on dbproxy for misc
[18:36:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:37:40] <mutante>	 !log gerrit2001 - disabling puppet, stopping gerrit service
[18:37:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:44:39] <wikibugs>	 (03PS1) 10Jdlrobson: Restore RelatedArticles config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/527611 (https://phabricator.wikimedia.org/T229644)
[18:45:20] <Krinkle>	 James_F: oh wow, I didn't know contribs/newbies was still a thing. Yeah, revision table queries without page or user ID filter is... not great.
[18:48:14] <wikibugs>	 (03CR) 10Jdlrobson: "Lego,Krinkle,Reedy adding you as reviewers as I admire their knowledge of this part of the stack which is much better than my own." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/527611 (https://phabricator.wikimedia.org/T229644) (owner: 10Jdlrobson)
[18:49:49] <wikibugs>	 (03CR) 10Krinkle: [C: 04-1] Restore RelatedArticles config (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/527611 (https://phabricator.wikimedia.org/T229644) (owner: 10Jdlrobson)
[18:53:12] <James_F>	 Krinkle: I filed a task to kill it somewhere.
[18:53:37] <James_F>	 T220447
[18:53:38] <stashbot>	 T220447: Split out or remove Special:Contributions/newbies functionality - https://phabricator.wikimedia.org/T220447
[18:57:35] <icinga-wm>	 PROBLEM - puppet last run on db1063 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[19:00:38] <wikibugs>	 (03PS1) 10Jdlrobson: Remove unused remnant from old menu click tracking [mediawiki-config] - 10https://gerrit.wikimedia.org/r/527615 (https://phabricator.wikimedia.org/T228681)
[19:01:39] <wikibugs>	 (03PS2) 10Jdlrobson: Remove unused remnant from old menu click tracking [mediawiki-config] - 10https://gerrit.wikimedia.org/r/527615 (https://phabricator.wikimedia.org/T228681)
[19:06:21] <wikibugs>	 (03PS2) 10Jdlrobson: Restore RelatedArticles config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/527611 (https://phabricator.wikimedia.org/T229644)
[19:08:06] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] wmnet: Add m1-master for codfw [dns] - 10https://gerrit.wikimedia.org/r/527462 (https://phabricator.wikimedia.org/T202367) (owner: 10Marostegui)
[19:10:32] <wikibugs>	 (03CR) 10Jdlrobson: Restore RelatedArticles config (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/527611 (https://phabricator.wikimedia.org/T229644) (owner: 10Jdlrobson)
[19:13:46] <wikibugs>	 (03CR) 10Krinkle: [C: 04-1] Restore RelatedArticles config (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/527611 (https://phabricator.wikimedia.org/T229644) (owner: 10Jdlrobson)
[19:14:25] <jdlrobson>	 @krinkle the default is a non-empty []
[19:14:53] <jdlrobson>	 https://github.com/wikimedia/mediawiki-extensions-RelatedArticles/blob/master/extension.json#L150
[19:15:12] <jdlrobson>	 which means "turn RelatedArticles on for all skins"
[19:15:34] <jdlrobson>	 So I'm a bit lost why 'default' => [ 'minerva' ], (which should be running for English Wikipedia) is being treated as an empty array
[19:16:33] <Krinkle>	 "RelatedArticlesFooterWhitelistedSkins": []
[19:16:36] <Krinkle>	 that is an empty []
[19:17:23] <jdlrobson>	 sorry typo - i meant to say the default is empty :)
[19:17:23] <Krinkle>	 which will be the value of $wgRelatedArticlesFooterWhitelistedSkins before that wmf-config function runs. Unless something else modifies it before then?
[19:17:27] <jdlrobson>	 https://en.wikivoyage.org is respecting its array: [ 'minerva', 'vector' ] and only showing on those skins (it doesn't show on Timeless)
[19:17:45] <jdlrobson>	 de.wikipedia.org is showing because it's being set as 'related-articles-footer-blacklisted-skins' => [], so that's explainable
[19:17:47] <Krinkle>	 eval.php enwiki
[19:17:48] <Krinkle>	 > var_dump($wgRelatedArticlesFooterWhitelistedSkins);
[19:17:48] <Krinkle>	 array(1) {
[19:17:48] <Krinkle>	   [0]=>
[19:17:48] <Krinkle>	   string(7) "minerva"
[19:17:48] <Krinkle>	 }
[19:18:08] <Krinkle>	 > var_dump($wmgRelatedArticlesFooterWhitelistedSkins);
[19:18:08] <Krinkle>	 array(1) { [0]=> string(7) "minerva" }
[19:19:29] <wikibugs>	 (03CR) 10Brian Wolff: "So looking at SiteConfiguration.php (around line 220), it seems like 'tag' settings are always handled as if they have a + in front of the" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/527611 (https://phabricator.wikimedia.org/T229644) (owner: 10Jdlrobson)
[19:19:59] <jdlrobson>	 okay so that's weird.. that's the correct value
[19:20:00] <Isarra>	 jdlrobson: We found the problem. It's just de and ru wikipedias because they for some reason tried to disable it for all their skins, so that's being interpreted now as 'no whitelist; enable for everything'.
[19:20:14] <jdlrobson>	 wait so this is not a problem on English Wikipedia?
[19:20:19] <Isarra>	 But at that point why is the extension even enabled there to begin with?!
[19:20:23] <Isarra>	 Right.
[19:20:28] <Isarra>	 It's just those two, as far as I can tell.
[19:20:29] <jdlrobson>	 arrgggh
[19:20:32] <jdlrobson>	 https://en.wikipedia.org/wiki/User:Jdlrobson/vector.js < stupid user script
[19:20:36] <jdlrobson>	 completely derailed me
[19:20:37] <Isarra>	 XD
[19:20:39] <jdlrobson>	 okay yeh that makes totally sense
[19:20:43] <jdlrobson>	 'related-articles-footer-blacklisted-skins' => [], is messed up
[19:20:50] <jdlrobson>	 and that previously wasn't working
[19:21:12] <jdlrobson>	 as the previous default was ['minerva'] so that was additive
[19:21:23] <jdlrobson>	 ['minerva'] + [] = ['minerva']
[19:21:30] <Isarra>	 Wait, so it wasn't even disabling it anyway?!
[19:21:31] <wikibugs>	 (03Abandoned) 10Jdlrobson: Restore RelatedArticles config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/527611 (https://phabricator.wikimedia.org/T229644) (owner: 10Jdlrobson)
[19:21:40] <Isarra>	 On those?
[19:21:51] <jdlrobson>	 yeh whoever configured those obviously didn't check :)
[19:22:18] <jdlrobson>	 it needs to be 'related-articles-footer-blacklisted-skins' = ['fallbackskin'] or something if they want to disable on all skins in preferences
[19:22:29] <jdlrobson>	 Isarra: you able to take care of that?
[19:22:39] <Isarra>	 Maybe. Huh.
[19:22:44] <jdlrobson>	 or remove the line  'related-articles-footer-blacklisted-skins' => [], and restore it to Minerva
[19:23:08] <Isarra>	 I mean, yeah, the global default is minerva, and if it's not been totally disabled all this time, why... start now? >.>
[19:23:27] <jdlrobson>	 running git blame to work out why this happened
[19:23:36] <jdlrobson>	 but git blaming that file takes a long time ;-)
[19:23:58] <jdlrobson>	 looks like it was me. I did it wrong for some reason
[19:24:02] <jdlrobson>	 so I think you can just remove that line
[19:24:55] <mutante>	 !log gerrit2001 - re-enabling puppet, starting as slave for the first time ever, thanks to codfw dbproxy, gerrit service running  (T176532)
[19:25:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:25:06] <stashbot>	 T176532: Gerrit is failing to connect to db on gerrit2001 thus preventing systemd from working - https://phabricator.wikimedia.org/T176532
[19:25:29] <Isarra>	 Cool, trying to find the right files...
[19:25:35] <icinga-wm>	 RECOVERY - puppet last run on db1063 is OK: OK: Puppet is currently enabled, last run 5 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[19:26:06] <jdlrobson>	 i think it's rm dblists/related-articles-footer-blacklisted-skins.dblist
[19:26:13] <jdlrobson>	 and remove the line in wmf-config/InitialiseSettings.php
[19:26:22] <jdlrobson>	 i can help if necessary after lunch (afk right now)
[19:27:06] <jdlrobson>	 you can also enable timeless while there i guess :)
[19:27:10] <jdlrobson>	 (in defaults)
[19:27:26] <jdlrobson>	 thanks Krinkle for the help
[19:27:26] <Isarra>	 Hee. For everything?
[19:27:31] <jdlrobson>	 why not
[19:27:33] <jdlrobson>	 your skin
[19:27:35] <jdlrobson>	 your rules
[19:27:51] <Isarra>	 Good point. It's randomly changing all the time, and where better to throw a random at people?
[19:28:07] <jdlrobson>	 unless someone's asked specifically not to have it - but I know certain English/German Wikipedia's use it or want to use it on desktop
[19:30:33] <wikibugs>	 (03CR) 10Jdlrobson: "Thanks bawolff. Turned out the problem is only Russian and German wiki which is very explainable." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/527611 (https://phabricator.wikimedia.org/T229644) (owner: 10Jdlrobson)
[19:30:43] <wikibugs>	 (03PS1) 10Paladox: gerrit: Enable --enable-httpd if slave mode is enabled [puppet] - 10https://gerrit.wikimedia.org/r/527621
[19:31:01] <wikibugs>	 (03PS2) 10Paladox: gerrit: Enable --enable-httpd if slave mode is enabled [puppet] - 10https://gerrit.wikimedia.org/r/527621
[19:41:17] <wikibugs>	 (03CR) 10Thcipriani: [C: 03+1] gerrit: Enable --enable-httpd if slave mode is enabled [puppet] - 10https://gerrit.wikimedia.org/r/527621 (owner: 10Paladox)
[19:41:22] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] gerrit: Enable --enable-httpd if slave mode is enabled [puppet] - 10https://gerrit.wikimedia.org/r/527621 (owner: 10Paladox)
[19:51:49] <wikibugs>	 (03PS1) 10Paladox: Gerrit: Enable sshd for gerrit on slaves [puppet] - 10https://gerrit.wikimedia.org/r/527631
[19:53:16] <wikibugs>	 (03PS2) 10Paladox: Gerrit: Enable sshd for gerrit on slaves [puppet] - 10https://gerrit.wikimedia.org/r/527631
[19:53:51] <wikibugs>	 (03PS3) 10Paladox: Gerrit: Enable sshd for gerrit on slaves [puppet] - 10https://gerrit.wikimedia.org/r/527631
[19:53:56] <wikibugs>	 (03CR) 10Paladox: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/527631 (owner: 10Paladox)
[19:54:24] <wikibugs>	 (03PS1) 10Isarra: Remove related-articles-footer-blacklisted-skins.dblist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/527632 (https://phabricator.wikimedia.org/T229644)
[19:57:01] <wikibugs>	 (03CR) 10Isarra: "Might want to double check I did this right..." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/527632 (https://phabricator.wikimedia.org/T229644) (owner: 10Isarra)
[19:57:41] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] Gerrit: Enable sshd for gerrit on slaves [puppet] - 10https://gerrit.wikimedia.org/r/527631 (owner: 10Paladox)
[20:03:09] <wikibugs>	 (03CR) 10Brian Wolff: "Ah, that makes sense." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/527611 (https://phabricator.wikimedia.org/T229644) (owner: 10Jdlrobson)
[20:05:45] <wikibugs>	 (03CR) 10Brian Wolff: [C: 04-1] "This is based on something I said where I was wrong. (sorry!). The use of the tag here is not the issue" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/527632 (https://phabricator.wikimedia.org/T229644) (owner: 10Isarra)
[20:07:30] <wikibugs>	 (03CR) 10Brian Wolff: "To clarify, on a different patch, i said that tags behaved differently then direct settings. I misunderstood the code, and that conclusion" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/527632 (https://phabricator.wikimedia.org/T229644) (owner: 10Isarra)
[20:09:38] <wikibugs>	 (03CR) 10Isarra: "> Patch Set 1: -Code-Review" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/527632 (https://phabricator.wikimedia.org/T229644) (owner: 10Isarra)
[20:14:37] <Urbanecm>	 !log Run mwscript deleteEqualMessages.php --wiki=cswiki --delete
[20:14:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:14:45] <Isarra>	 Ow, my head.
[20:19:53] <wikibugs>	 (03Abandoned) 10Isarra: Remove related-articles-footer-blacklisted-skins.dblist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/527632 (https://phabricator.wikimedia.org/T229644) (owner: 10Isarra)
[20:20:30] * bd808 hands Isarra an asprin
[20:21:42] <wikibugs>	 (03PS1) 10Isarra: Set a dummy skin to 'disable' Related Article cards on blacklisted projects [mediawiki-config] - 10https://gerrit.wikimedia.org/r/527636 (https://phabricator.wikimedia.org/T229644)
[20:22:19] <Isarra>	 Yay drugs.
[20:23:17] <sbassett>	 Hey - was going to sec-deploy patch for UBN T229541 now, cc: jdlrobson MaxSem
[20:37:30] <wikibugs>	 (03PS1) 10Dzahn: gerrit: fix sshd listen address if on a slave [puppet] - 10https://gerrit.wikimedia.org/r/527638 (https://phabricator.wikimedia.org/T176532)
[20:38:00] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] gerrit: fix sshd listen address if on a slave [puppet] - 10https://gerrit.wikimedia.org/r/527638 (https://phabricator.wikimedia.org/T176532) (owner: 10Dzahn)
[20:41:28] <jdlrobson>	 Isarra: so what have I missed?
[20:41:32] <wikibugs>	 (03CR) 10Isarra: "I THINK THIS ACTUALLY FIXES IT, AS INTENDED, ETC WHATEVER." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/527636 (https://phabricator.wikimedia.org/T229644) (owner: 10Isarra)
[20:41:42] <jdlrobson>	 https://gerrit.wikimedia.org/r/#/c/operations/mediawiki-config/+/527632/ looks like the right change to me.
[20:41:48] <jdlrobson>	 why abandoned?
[20:42:09] <wikibugs>	 (03PS2) 10Dzahn: gerrit: fix sshd listen address if on a slave [puppet] - 10https://gerrit.wikimedia.org/r/527638 (https://phabricator.wikimedia.org/T176532)
[20:42:20] <Isarra>	 jdlrobson: bawolff says we're probably, it probably did work previously, I just broke it.
[20:42:26] <jdlrobson>	 No it's my mistake
[20:42:37] <Isarra>	 I have no idea, but looking closer at the logic I'm pretty sure just adding a dummy will also fix it.
[20:42:52] <jdlrobson>	 I did a git blame. Russian and German have never asked for RelatedArticles to be disabled on Minerva
[20:42:58] <Isarra>	 I don't actually know, though, so, uh, I guess pick which patch you prefer? 
[20:43:02] <jdlrobson>	 and it has never been. 
[20:43:05] <Isarra>	 Okay.
[20:43:11] <Isarra>	 I dunno!
[20:43:21] <jdlrobson>	 I would advise restoring https://gerrit.wikimedia.org/r/#/c/operations/mediawiki-config/+/527632/ and swatting :)
[20:43:38] <jdlrobson>	 The config dates back to an older version of the config when related articles only worked on mobile
[20:43:39] <wikibugs>	 (03Restored) 10Isarra: Remove related-articles-footer-blacklisted-skins.dblist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/527632 (https://phabricator.wikimedia.org/T229644) (owner: 10Isarra)
[20:43:58] <Isarra>	 Okay, restored, but I don't know how to swat things.
[20:44:18] <greg-g>	 wait until Monday :)
[20:44:23] <jdlrobson>	 I'm not a SWATer right now. I don't know if any are around today so yeh that ^
[20:44:24] <jdlrobson>	  :)
[20:44:24] <Isarra>	 All I know is either of these should actually fix the immediate problem, so if you're sure about this one, I'm all for it because it's the neater solution overall regardless.
[20:44:29] <Isarra>	 Okay.
[20:44:40] <wikibugs>	 (03CR) 10Jdlrobson: [C: 03+1] Remove related-articles-footer-blacklisted-skins.dblist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/527632 (https://phabricator.wikimedia.org/T229644) (owner: 10Isarra)
[20:47:28] <jdlrobson>	 thanks for looking into this today Isarra 
[20:47:38] <jdlrobson>	 we'll get it fixed Monday. The wikis can wait.
[20:48:16] <sbassett>	 !log Deployed security patch for T229541
[20:48:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:49:30] <Isarra>	 Looks like ru and de have already disabled it locally, so... yeah. :P
[20:49:58] <wikibugs>	 (03CR) 10Dzahn: [C: 03+1] "lgtm https://puppet-compiler.wmflabs.org/compiler1001/17721/" [puppet] - 10https://gerrit.wikimedia.org/r/527638 (https://phabricator.wikimedia.org/T176532) (owner: 10Dzahn)
[20:51:16] <wikibugs>	 (03CR) 10Paladox: [C: 03+1] gerrit: fix sshd listen address if on a slave [puppet] - 10https://gerrit.wikimedia.org/r/527638 (https://phabricator.wikimedia.org/T176532) (owner: 10Dzahn)
[20:51:21] <wikibugs>	 (03CR) 10Jdlrobson: [C: 04-1] "https://gerrit.wikimedia.org/r/#/c/operations/mediawiki-config/+/527632/" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/527636 (https://phabricator.wikimedia.org/T229644) (owner: 10Isarra)
[20:51:46] <jdlrobson>	 Isarra: did you also want to enable on timeless anywhere on Monday while we do that?
[20:52:12] <Isarra>	 Eh, no need to swat that.
[20:52:37] <Isarra>	 The real question is... should we also enable it on monobook on the three wikis that already have it on vector?
[20:52:55] <Isarra>	 wikis/wiki groups
[20:55:01] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] gerrit: fix sshd listen address if on a slave [puppet] - 10https://gerrit.wikimedia.org/r/527638 (https://phabricator.wikimedia.org/T176532) (owner: 10Dzahn)
[21:02:34] <wikibugs>	 10Operations, 10DBA, 10Gerrit, 10Release-Engineering-Team-TODO, and 2 others: Gerrit is failing to connect to db on gerrit2001 thus preventing systemd from working - https://phabricator.wikimedia.org/T176532 (10Dzahn) 05Open→03Resolved gerrit, gerrit's httpd and gerrit's sshd are now all running and li...
[21:02:39] <wikibugs>	 10Operations, 10Gerrit, 10Patch-For-Review, 10Release-Engineering-Team (Backlog): Reimage gerrit2001 as stretch - https://phabricator.wikimedia.org/T168562 (10Dzahn)
[21:03:00] <wikibugs>	 10Operations, 10DBA, 10Gerrit, 10Release-Engineering-Team-TODO, 10Release-Engineering-Team (Development services): Gerrit is failing to connect to db on gerrit2001 thus preventing systemd from working - https://phabricator.wikimedia.org/T176532 (10Dzahn)
[21:15:37] <wikibugs>	 (03Abandoned) 10Isarra: Set a dummy skin to 'disable' Related Article cards on blacklisted projects [mediawiki-config] - 10https://gerrit.wikimedia.org/r/527636 (https://phabricator.wikimedia.org/T229644) (owner: 10Isarra)
[21:15:54] <wikibugs>	 (03PS2) 10Isarra: Remove related-articles-footer-blacklisted-skins.dblist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/527632 (https://phabricator.wikimedia.org/T229644)
[21:19:12] <Isarra>	 jdlrobson: So now I can't even figure out how to submit a patch to enable timeless at all because gerrit keeps rejecting it, at which point the entire commit gets undone locally too, so... I'll just come back to this later. Yes. >.>
[21:19:51] <Isarra>	 But as long as all this actually gets fixed to maintain what people actually expect normally, that's the most important bit regardless.
[21:30:18] <wikibugs>	 (03PS1) 10BryanDavis: Vagrantfile: embiggen the vm's disk [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/527650
[21:30:20] <wikibugs>	 (03PS1) 10BryanDavis: run-image: give error message if type is not passed [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/527651
[21:30:22] <wikibugs>	 (03PS1) 10BryanDavis: jessie: Work around removal of jessie-backports [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/527652
[21:30:24] <wikibugs>	 (03PS1) 10BryanDavis: locales-extended: Add support for Korean [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/527653 (https://phabricator.wikimedia.org/T130532)
[21:30:36] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Vagrantfile: embiggen the vm's disk [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/527650 (owner: 10BryanDavis)
[21:41:31] <wikibugs>	 (03PS1) 10Paladox: Gerrit: Rename gerrit-slave to gerrit-replica [puppet] - 10https://gerrit.wikimedia.org/r/527656
[21:42:19] <wikibugs>	 (03PS2) 10Paladox: Gerrit: Rename gerrit-slave to gerrit-replica [puppet] - 10https://gerrit.wikimedia.org/r/527656
[21:42:40] <wikibugs>	 (03PS1) 10Paladox: Rename gerrit-slave to gerrit-replica [dns] - 10https://gerrit.wikimedia.org/r/527657
[21:43:23] <wikibugs>	 (03PS2) 10Paladox: Rename gerrit-slave to gerrit-replica [dns] - 10https://gerrit.wikimedia.org/r/527657
[22:00:27] <wikibugs>	 (03CR) 10BryanDavis: "recheck" [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/527650 (owner: 10BryanDavis)
[22:03:06] <wikibugs>	 (03CR) 10BryanDavis: [C: 03+2] "local dev only" [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/527650 (owner: 10BryanDavis)
[22:03:17] <wikibugs>	 (03CR) 10BryanDavis: [C: 03+2] "local dev only" [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/527651 (owner: 10BryanDavis)
[22:03:31] <wikibugs>	 (03Merged) 10jenkins-bot: Vagrantfile: embiggen the vm's disk [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/527650 (owner: 10BryanDavis)
[22:03:41] <wikibugs>	 (03Merged) 10jenkins-bot: run-image: give error message if type is not passed [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/527651 (owner: 10BryanDavis)
[22:04:13] <icinga-wm>	 PROBLEM - puppet last run on dbmonitor2001 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[22:07:00] <wikibugs>	 (03CR) 10Bstorm: [C: 03+1] "Looks like a marvelous hack up to me." [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/527652 (owner: 10BryanDavis)
[22:10:35] <wikibugs>	 (03PS2) 10Paladox: Merge branch 'stable-2.15' into wmf/stable-2.15 [software/gerrit] (wmf/stable-2.15) - 10https://gerrit.wikimedia.org/r/525865
[22:11:04] <wikibugs>	 (03PS4) 10Paladox: Testing: Do not merge [software/gerrit] (wmf/stable-2.15) - 10https://gerrit.wikimedia.org/r/525867
[22:11:26] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Merge branch 'stable-2.15' into wmf/stable-2.15 [software/gerrit] (wmf/stable-2.15) - 10https://gerrit.wikimedia.org/r/525865 (owner: 10Paladox)
[22:11:34] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Testing: Do not merge [software/gerrit] (wmf/stable-2.15) - 10https://gerrit.wikimedia.org/r/525867 (owner: 10Paladox)
[22:13:12] <wikibugs>	 (03CR) 10Dzahn: "why? and that would first require DNS changes" [puppet] - 10https://gerrit.wikimedia.org/r/527656 (owner: 10Paladox)
[22:16:55] <wikibugs>	 (03CR) 10Greg Grossmeier: [C: 03+1] "> why? and that would first require DNS changes" [puppet] - 10https://gerrit.wikimedia.org/r/527656 (owner: 10Paladox)
[22:17:24] <wikibugs>	 (03CR) 10Greg Grossmeier: [C: 03+1] "Also, the DNS bits: https://gerrit.wikimedia.org/r/#/c/operations/dns/+/527657/" [puppet] - 10https://gerrit.wikimedia.org/r/527656 (owner: 10Paladox)
[22:18:06] <wikibugs>	 (03CR) 10Greg Grossmeier: [C: 03+1] "Yes, please. This fits with our (RelEng's) other efforts to reduce the occurrence of this language, see also: https://phabricator.wikimedi" [dns] - 10https://gerrit.wikimedia.org/r/527657 (owner: 10Paladox)
[22:20:08] <wikibugs>	 (03CR) 10Dzahn: "this is a technical term chosen by upstream. it's like mysql slave-lag" [puppet] - 10https://gerrit.wikimedia.org/r/527656 (owner: 10Paladox)
[22:23:59] <wikibugs>	 (03CR) 10Dzahn: "would require a lot more changes, acme_chief, puppet in various places, hiera.. fwiw the term appears 842 times in the repo" [puppet] - 10https://gerrit.wikimedia.org/r/527656 (owner: 10Paladox)
[22:26:39] <icinga-wm>	 RECOVERY - puppet last run on dbmonitor2001 is OK: OK: Puppet is currently enabled, last run 41 seconds ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[22:27:21] <wikibugs>	 (03PS1) 10Paladox: gerrit: Add gerrit-replica to acme [puppet] - 10https://gerrit.wikimedia.org/r/527664
[22:27:35] <wikibugs>	 (03CR) 10Greg Grossmeier: [C: 03+1] "> would require a lot more changes, acme_chief, puppet in various" [puppet] - 10https://gerrit.wikimedia.org/r/527656 (owner: 10Paladox)
[22:27:46] <wikibugs>	 (03PS2) 10Paladox: gerrit: Add gerrit-replica to acme [puppet] - 10https://gerrit.wikimedia.org/r/527664
[22:33:02] <wikibugs>	 (03CR) 10Krinkle: [C: 03+1] "Beware of the scap trap in deploying this to avoid intermediary fatal errors (which mwdebug won't catch). Takes three separate sync-files," [mediawiki-config] - 10https://gerrit.wikimedia.org/r/527632 (https://phabricator.wikimedia.org/T229644) (owner: 10Isarra)
[22:34:04] <wikibugs>	 (03CR) 10Greg Grossmeier: [C: 03+1] "> this is a technical term chosen by upstream. it's like mysql" [puppet] - 10https://gerrit.wikimedia.org/r/527656 (owner: 10Paladox)
[22:39:26] <wikibugs>	 (03CR) 10Krinkle: "(fwiw, in MW core we've adopted this change as well. Rdbms uses replica terminology consistently wherever possible.)" [puppet] - 10https://gerrit.wikimedia.org/r/527656 (owner: 10Paladox)
[22:41:29] <wikibugs>	 (03CR) 10Dzahn: [C: 04-1] gerrit: Add gerrit-replica to acme (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/527664 (owner: 10Paladox)
[22:42:45] <wikibugs>	 (03CR) 10Paladox: gerrit: Add gerrit-replica to acme (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/527664 (owner: 10Paladox)
[22:45:33] <icinga-wm>	 PROBLEM - puppet last run on kubernetes1003 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[22:48:01] <wikibugs>	 (03CR) 10Dzahn: "yea, it would probably fail. but that's because there would be no httpd vhost for the LE challenge to work. as i said, this is not a trivi" [puppet] - 10https://gerrit.wikimedia.org/r/527664 (owner: 10Paladox)
[22:59:34] <wikibugs>	 10Operations, 10ops-eqiad: helium.mgmt down - https://phabricator.wikimedia.org/T229706 (10Dzahn)
[23:00:19] <icinga-wm>	 ACKNOWLEDGEMENT - SSH helium.mgmt on helium.mgmt is CRITICAL: Server answer: daniel_zahn https://phabricator.wikimedia.org/T229706 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[23:02:29] <wikibugs>	 10Operations, 10ops-eqiad: helium.mgmt down - https://phabricator.wikimedia.org/T229706 (10wiki_willy) a:03Cmjohnson
[23:06:17] <mutante>	 !log mwdebug1001/mwdebug1002 - restart-php7.2-fpm - low opcache
[23:06:17] <icinga-wm>	 RECOVERY - PHP opcache health on mwdebug1001 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[23:06:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:07:23] <icinga-wm>	 RECOVERY - PHP opcache health on mwdebug1002 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[23:08:01] <icinga-wm>	 RECOVERY - puppet last run on kubernetes1003 is OK: OK: Puppet is currently enabled, last run 3 seconds ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[23:16:32] <XioNoX>	 !log Make the Level3 link between eqiad-knams primary - T228827
[23:16:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:16:41] <stashbot>	 T228827: Instability of the Level3 link between cr2-eqiad and cr2-esams - https://phabricator.wikimedia.org/T228827
[23:20:07] <wikibugs>	 10Operations, 10netops: Instability of the Level3 link between cr2-eqiad and cr2-esams - https://phabricator.wikimedia.org/T228827 (10ayounsi) Talked to Faidon, using the backup link for a long amount of time is costing us money (see overusage on https://librenms.wikimedia.org/bill/bill_id=17/). I made the Lev...
[23:58:46] <mutante>	 !log scandium - apt-get remove --purge prometheus-hhvm-exporter - not needed here, no HHVM (T228069)
[23:58:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:58:55] <stashbot>	 T228069: Deploy Parsoid-PHP with Mediawiki to scandium for RT and performance testing - https://phabricator.wikimedia.org/T228069