[00:07:25] PROBLEM - puppet last run on elastic1017 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 7 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/local/bin/apt2xml] https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [00:17:22] (03PS1) 10Dzahn: mediawiki::php::restarts: try to avoid including LVS but still get pools [puppet] - 10https://gerrit.wikimedia.org/r/527285 [00:29:00] (03CR) 10Dzahn: [C: 04-1] "WIP" [puppet] - 10https://gerrit.wikimedia.org/r/527285 (owner: 10Dzahn) [00:34:09] (03PS1) 10Dzahn: add scandium as an app test server to conftool data [puppet] - 10https://gerrit.wikimedia.org/r/527291 (https://phabricator.wikimedia.org/T228069) [00:35:19] RECOVERY - puppet last run on elastic1017 is OK: OK: Puppet is currently enabled, last run 5 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [00:36:17] (03CR) 10Dzahn: "Does it make sense to add it like a test server?" [puppet] - 10https://gerrit.wikimedia.org/r/527291 (https://phabricator.wikimedia.org/T228069) (owner: 10Dzahn) [00:45:02] (03CR) 10Ayounsi: "> Patch Set 2:" [puppet] - 10https://gerrit.wikimedia.org/r/526819 (https://phabricator.wikimedia.org/T226331) (owner: 10Ayounsi) [01:18:15] PROBLEM - puppet last run on mx2001 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [01:22:35] PROBLEM - puppet last run on alsafi is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [01:40:35] RECOVERY - puppet last run on mx2001 is OK: OK: Puppet is currently enabled, last run 23 seconds ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [01:44:57] RECOVERY - puppet last run on alsafi is OK: OK: Puppet is currently enabled, last run 23 seconds ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [03:46:33] PROBLEM - High CPU load on API appserver on mw1225 is CRITICAL: CRITICAL - load average: 56.93, 29.25, 16.57 https://wikitech.wikimedia.org/wiki/Application_servers [03:47:09] PROBLEM - High CPU load on API appserver on mw1222 is CRITICAL: CRITICAL - load average: 59.33, 35.98, 20.09 https://wikitech.wikimedia.org/wiki/Application_servers [03:56:15] PROBLEM - High CPU load on API appserver on mw1225 is CRITICAL: CRITICAL - load average: 52.38, 37.43, 27.03 https://wikitech.wikimedia.org/wiki/Application_servers [03:57:09] PROBLEM - High CPU load on API appserver on mw1226 is CRITICAL: CRITICAL - load average: 49.54, 34.70, 23.28 https://wikitech.wikimedia.org/wiki/Application_servers [03:58:49] PROBLEM - High CPU load on API appserver on mw1233 is CRITICAL: CRITICAL - load average: 56.88, 36.93, 25.98 https://wikitech.wikimedia.org/wiki/Application_servers [04:01:57] RECOVERY - High CPU load on API appserver on mw1226 is OK: OK - load average: 15.00, 23.64, 21.77 https://wikitech.wikimedia.org/wiki/Application_servers [04:05:51] RECOVERY - High CPU load on API appserver on mw1225 is OK: OK - load average: 9.42, 18.47, 23.87 https://wikitech.wikimedia.org/wiki/Application_servers [04:06:49] RECOVERY - High CPU load on API appserver on mw1233 is OK: OK - load average: 9.01, 19.98, 23.39 https://wikitech.wikimedia.org/wiki/Application_servers [04:11:37] (03CR) 10Vgutierrez: [C: 03+2] acme_chief: Save the certificate using the 3 save modes on issuance time [software/acme-chief] - 10https://gerrit.wikimedia.org/r/526840 (https://phabricator.wikimedia.org/T229096) (owner: 10Vgutierrez) [04:12:53] RECOVERY - High CPU load on API appserver on mw1222 is OK: OK - load average: 10.21, 12.15, 22.78 https://wikitech.wikimedia.org/wiki/Application_servers [04:14:34] (03CR) 10jenkins-bot: acme_chief: Save the certificate using the 3 save modes on issuance time [software/acme-chief] - 10https://gerrit.wikimedia.org/r/526840 (https://phabricator.wikimedia.org/T229096) (owner: 10Vgutierrez) [04:15:47] PROBLEM - Disk space on elastic1017 is CRITICAL: DISK CRITICAL - free space: /srv 28693 MB (5% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=elastic1017&var-datasource=eqiad+prometheus/ops [04:19:09] PROBLEM - PHP opcache health on mwdebug1001 is CRITICAL: CRITICAL: opcache free space is below 50 MB https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [04:19:36] (03PS1) 10Vgutierrez: Release 0.20 [software/acme-chief] - 10https://gerrit.wikimedia.org/r/527364 (https://phabricator.wikimedia.org/T229096) [04:22:27] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1004 is CRITICAL: CRITICAL: 70.00% of data above the critical threshold [50.0] https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=2&fullscreen [04:24:03] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1004 is OK: OK: Less than 70.00% above the threshold [25.0] https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=2&fullscreen [04:25:27] RECOVERY - Disk space on elastic1017 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=elastic1017&var-datasource=eqiad+prometheus/ops [04:29:57] (03PS4) 10Vgutierrez: fifo-log-demux: Keep attempting to read the FIFO after EOF [software/fifo-log-demux] - 10https://gerrit.wikimedia.org/r/527013 [04:30:12] (03CR) 10Vgutierrez: fifo-log-demux: Keep attempting to read the FIFO after EOF (031 comment) [software/fifo-log-demux] - 10https://gerrit.wikimedia.org/r/527013 (owner: 10Vgutierrez) [04:30:49] (03CR) 10Vgutierrez: [C: 03+2] Release 0.20 [software/acme-chief] - 10https://gerrit.wikimedia.org/r/527364 (https://phabricator.wikimedia.org/T229096) (owner: 10Vgutierrez) [04:33:45] (03CR) 10jenkins-bot: Release 0.20 [software/acme-chief] - 10https://gerrit.wikimedia.org/r/527364 (https://phabricator.wikimedia.org/T229096) (owner: 10Vgutierrez) [04:35:17] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1004 is CRITICAL: CRITICAL: 70.00% of data above the critical threshold [50.0] https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=2&fullscreen [04:38:03] (03PS1) 10Vgutierrez: acme_chief: Save the certificate using the 3 save modes on issuance time [software/acme-chief] (debian) - 10https://gerrit.wikimedia.org/r/527373 (https://phabricator.wikimedia.org/T229096) [04:38:05] (03PS1) 10Vgutierrez: Release 0.20 [software/acme-chief] (debian) - 10https://gerrit.wikimedia.org/r/527374 (https://phabricator.wikimedia.org/T229096) [04:38:07] (03PS1) 10Vgutierrez: debian: Add release 0.20 to changelog [software/acme-chief] (debian) - 10https://gerrit.wikimedia.org/r/527375 (https://phabricator.wikimedia.org/T229096) [04:38:31] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1004 is OK: OK: Less than 70.00% above the threshold [25.0] https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=2&fullscreen [04:44:33] (03CR) 10Vgutierrez: [C: 03+2] acme_chief: Save the certificate using the 3 save modes on issuance time [software/acme-chief] (debian) - 10https://gerrit.wikimedia.org/r/527373 (https://phabricator.wikimedia.org/T229096) (owner: 10Vgutierrez) [04:44:37] (03CR) 10Vgutierrez: [C: 03+2] Release 0.20 [software/acme-chief] (debian) - 10https://gerrit.wikimedia.org/r/527374 (https://phabricator.wikimedia.org/T229096) (owner: 10Vgutierrez) [04:47:16] (03CR) 10Vgutierrez: [C: 03+2] debian: Add release 0.20 to changelog [software/acme-chief] (debian) - 10https://gerrit.wikimedia.org/r/527375 (https://phabricator.wikimedia.org/T229096) (owner: 10Vgutierrez) [04:47:19] (03Merged) 10jenkins-bot: acme_chief: Save the certificate using the 3 save modes on issuance time [software/acme-chief] (debian) - 10https://gerrit.wikimedia.org/r/527373 (https://phabricator.wikimedia.org/T229096) (owner: 10Vgutierrez) [04:47:22] (03Merged) 10jenkins-bot: Release 0.20 [software/acme-chief] (debian) - 10https://gerrit.wikimedia.org/r/527374 (https://phabricator.wikimedia.org/T229096) (owner: 10Vgutierrez) [04:51:05] (03CR) 10jenkins-bot: acme_chief: Save the certificate using the 3 save modes on issuance time [software/acme-chief] (debian) - 10https://gerrit.wikimedia.org/r/527373 (https://phabricator.wikimedia.org/T229096) (owner: 10Vgutierrez) [04:51:09] (03Merged) 10jenkins-bot: debian: Add release 0.20 to changelog [software/acme-chief] (debian) - 10https://gerrit.wikimedia.org/r/527375 (https://phabricator.wikimedia.org/T229096) (owner: 10Vgutierrez) [04:51:30] (03CR) 10jenkins-bot: Release 0.20 [software/acme-chief] (debian) - 10https://gerrit.wikimedia.org/r/527374 (https://phabricator.wikimedia.org/T229096) (owner: 10Vgutierrez) [04:54:54] (03CR) 10jenkins-bot: debian: Add release 0.20 to changelog [software/acme-chief] (debian) - 10https://gerrit.wikimedia.org/r/527375 (https://phabricator.wikimedia.org/T229096) (owner: 10Vgutierrez) [05:01:24] (03PS3) 10Marostegui: wmnet: Point m2-master.codfw to dbproxy2002 [dns] - 10https://gerrit.wikimedia.org/r/527114 (https://phabricator.wikimedia.org/T176532) [05:02:28] (03CR) 10Marostegui: [C: 03+2] wmnet: Point m2-master.codfw to dbproxy2002 [dns] - 10https://gerrit.wikimedia.org/r/527114 (https://phabricator.wikimedia.org/T176532) (owner: 10Marostegui) [05:06:55] !log Remove db2058 from tendril and zarcillo T229543 [05:07:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:07:05] T229543: Decommission db2058.codfw.wmnet - https://phabricator.wikimedia.org/T229543 [05:07:31] (03PS1) 10Marostegui: mariadb: Decommission db2058 [puppet] - 10https://gerrit.wikimedia.org/r/527382 (https://phabricator.wikimedia.org/T229543) [05:10:07] (03CR) 10Marostegui: [C: 03+2] mariadb: Decommission db2058 [puppet] - 10https://gerrit.wikimedia.org/r/527382 (https://phabricator.wikimedia.org/T229543) (owner: 10Marostegui) [05:10:42] !log Stop MySQL on db2058 for decommissioning T229543 [05:10:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:21:06] !log uploaded acme-chief 0.20 to apt.wikimedia.org (buster) - T229096 [05:21:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:21:16] T229096: Provide the three cert types (chain-only, cert only and chained) as soon as we get the certificate issued - https://phabricator.wikimedia.org/T229096 [05:23:16] (03PS1) 10Marostegui: mariadb: Specify candidate masters [puppet] - 10https://gerrit.wikimedia.org/r/527383 [05:24:32] (03PS2) 10Marostegui: mariadb: Specify candidate masters (codfw) [puppet] - 10https://gerrit.wikimedia.org/r/527383 [05:25:42] (03CR) 10Marostegui: [C: 03+2] mariadb: Specify candidate masters (codfw) [puppet] - 10https://gerrit.wikimedia.org/r/527383 (owner: 10Marostegui) [05:26:03] PROBLEM - Nginx local proxy to apache on mw1285 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1308 bytes in 1.914 second response time https://wikitech.wikimedia.org/wiki/Application_servers [05:27:15] RECOVERY - Nginx local proxy to apache on mw1285 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 618 bytes in 0.117 second response time https://wikitech.wikimedia.org/wiki/Application_servers [05:37:16] (03PS1) 10Marostegui: mariadb: Provision db2124 into s6 [puppet] - 10https://gerrit.wikimedia.org/r/527386 (https://phabricator.wikimedia.org/T228969) [05:39:10] (03CR) 10Marostegui: [C: 03+2] mariadb: Provision db2124 into s6 [puppet] - 10https://gerrit.wikimedia.org/r/527386 (https://phabricator.wikimedia.org/T228969) (owner: 10Marostegui) [06:11:12] (03PS3) 10Ema: misc-common: piwik cookies should not block caching either [puppet] - 10https://gerrit.wikimedia.org/r/473299 (owner: 10BBlack) [06:21:42] (03CR) 10Ema: [C: 03+1] "LGTM but elukey should confirm!" [puppet] - 10https://gerrit.wikimedia.org/r/473299 (owner: 10BBlack) [06:23:10] (03CR) 10Giuseppe Lavagetto: [C: 04-2] "scandium has no lvs, and thus should not be in conftool-data as it won't serve live traffic, we need to add it to the dsh list manually in" [puppet] - 10https://gerrit.wikimedia.org/r/527291 (https://phabricator.wikimedia.org/T228069) (owner: 10Dzahn) [06:28:57] PROBLEM - puppet last run on acmechief1001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/local/bin/prometheus-puppet-agent-stats] https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [06:30:24] uh? [06:32:39] PROBLEM - puppet last run on elastic2054 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [06:38:41] (03CR) 10Ema: [C: 03+2] misc-common: piwik cookies should not block caching either [puppet] - 10https://gerrit.wikimedia.org/r/473299 (owner: 10BBlack) [06:40:03] RECOVERY - puppet last run on acmechief1001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [06:46:52] !log upgrading acme-chief to version 0.20 in acme-chief test instances - T229096 [06:47:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:47:02] T229096: Provide the three cert types (chain-only, cert only and chained) as soon as we get the certificate issued - https://phabricator.wikimedia.org/T229096 [06:48:43] (03PS1) 10Giuseppe Lavagetto: role::parsoid::testing: remove unnecessary php additions [puppet] - 10https://gerrit.wikimedia.org/r/527414 [06:50:23] (03PS2) 10Giuseppe Lavagetto: role::parsoid::testing: remove unnecessary php additions [puppet] - 10https://gerrit.wikimedia.org/r/527414 (https://phabricator.wikimedia.org/T228069) [06:51:19] (03CR) 10Giuseppe Lavagetto: [C: 03+2] role::parsoid::testing: remove unnecessary php additions [puppet] - 10https://gerrit.wikimedia.org/r/527414 (https://phabricator.wikimedia.org/T228069) (owner: 10Giuseppe Lavagetto) [06:51:28] <_joe_> is it me or ci is horribly slow? [06:54:41] (03CR) 10Giuseppe Lavagetto: [C: 04-2] "see I61cf3f for the correct way to include scandium in dsh." [puppet] - 10https://gerrit.wikimedia.org/r/527291 (https://phabricator.wikimedia.org/T228069) (owner: 10Dzahn) [07:00:09] (03PS1) 10Marostegui: db-eqiad,db-codfw.php: Pool db2124 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/527422 (https://phabricator.wikimedia.org/T228969) [07:00:13] <_joe_> !log running systemd-tmpfiles --create nutcracker.conf on scandium [07:00:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:00:35] RECOVERY - puppet last run on elastic2054 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [07:03:23] (03CR) 10Ema: [C: 03+1] "Looks good!" [software/fifo-log-demux] - 10https://gerrit.wikimedia.org/r/527013 (owner: 10Vgutierrez) [07:05:57] PROBLEM - High CPU load on API appserver on mw1222 is CRITICAL: CRITICAL - load average: 49.31, 23.83, 14.54 https://wikitech.wikimedia.org/wiki/Application_servers [07:10:21] PROBLEM - High CPU load on API appserver on mw1225 is CRITICAL: CRITICAL - load average: 51.13, 31.67, 19.18 https://wikitech.wikimedia.org/wiki/Application_servers [07:11:09] PROBLEM - High CPU load on API appserver on mw1231 is CRITICAL: CRITICAL - load average: 53.36, 36.73, 22.70 https://wikitech.wikimedia.org/wiki/Application_servers [07:11:20] (03PS1) 10Giuseppe Lavagetto: systemd::tmpfile: apply changes when we change the files. [puppet] - 10https://gerrit.wikimedia.org/r/527430 (https://phabricator.wikimedia.org/T204450) [07:11:21] <_joe_> sigh [07:12:53] (03PS1) 10Elukey: cdh::hive: add hive.server2.logging.operation.enabled [puppet] - 10https://gerrit.wikimedia.org/r/527433 (https://phabricator.wikimedia.org/T227257) [07:13:18] (03PS2) 10Elukey: cdh::hive: add hive.server2.logging.operation.enabled [puppet] - 10https://gerrit.wikimedia.org/r/527433 (https://phabricator.wikimedia.org/T227257) [07:14:27] (03CR) 10Elukey: [C: 03+2] cdh::hive: add hive.server2.logging.operation.enabled [puppet] - 10https://gerrit.wikimedia.org/r/527433 (https://phabricator.wikimedia.org/T227257) (owner: 10Elukey) [07:17:53] PROBLEM - High CPU load on API appserver on mw1234 is CRITICAL: CRITICAL - load average: 54.86, 37.48, 26.53 https://wikitech.wikimedia.org/wiki/Application_servers [07:17:55] PROBLEM - High CPU load on API appserver on mw1226 is CRITICAL: CRITICAL - load average: 64.57, 38.85, 24.49 https://wikitech.wikimedia.org/wiki/Application_servers [07:19:41] (03PS1) 10Vgutierrez: fifo-log-demux: Fix EPIPE check [software/fifo-log-demux] - 10https://gerrit.wikimedia.org/r/527435 [07:21:27] !log Add db2124 to tendril and zarcillo T228969 [07:21:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:21:36] T228969: Productionize db21[21-30} - https://phabricator.wikimedia.org/T228969 [07:21:48] <_joe_> ok, those appservers [07:22:02] (03PS2) 10Marostegui: db-eqiad,db-codfw.php: Pool db2129 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/527422 (https://phabricator.wikimedia.org/T228969) [07:22:07] <_joe_> can someone look? I'm looking at another problem rn [07:22:20] _joe_: I can take a look [07:24:50] marostegui: very interesting https://grafana.wikimedia.org/d/000000002/api-backend-summary?refresh=5m&orgId=1 [07:25:14] (03CR) 10Vgutierrez: [C: 03+1] db-eqiad,db-codfw.php: Pool db2129 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/527422 (https://phabricator.wikimedia.org/T228969) (owner: 10Marostegui) [07:25:41] <_joe_> it's parsoid-batch again [07:25:59] now I am wondering - is it something that hits hard hhvm for some reason, but not php-fpm? [07:26:01] elukey: yeah I was checking that https://grafana.wikimedia.org/d/000000377/host-overview?refresh=5m&orgId=1&var-server=mw1226&var-datasource=eqiad%20prometheus%2Fops&var-cluster=api_appserver kinda matches the SAL entry from _joe_ but it is probably a coincidence [07:26:02] <_joe_> we need to open a task and involve core platform in the investigation [07:26:09] <_joe_> elukey: based on what? [07:26:10] vgutierrez: thanks [07:26:22] np <3 [07:26:41] <_joe_> yeah marostegui my log entry is for something not in production [07:26:51] yep [07:26:54] _joe_ I am wondering out loud, not based on anything. It could be good to check. If so, we are migrating slowly to php7 only.. [07:27:05] <_joe_> elukey: mw1347 is php only [07:27:09] <_joe_> so is mw1348 [07:27:15] <_joe_> you can check if they're affected [07:27:50] not this time afaics from icinga, and not the last one too [07:28:03] mw1226 has hhvm [07:28:19] PROBLEM - High CPU load on API appserver on mw1232 is CRITICAL: CRITICAL - load average: 50.90, 33.95, 25.52 https://wikitech.wikimedia.org/wiki/Application_servers [07:32:13] PROBLEM - High CPU load on API appserver on mw1234 is CRITICAL: CRITICAL - load average: 43.70, 36.93, 30.41 https://wikitech.wikimedia.org/wiki/Application_servers [07:36:22] <_joe_> https://grafana.wikimedia.org/d/RIA1lzDZk/xxx-joe-appserver?orgId=1&from=1564728500616&to=1564731349240&var-datasource=eqiad%20prometheus%2Fops&var-cluster=api_appserver&var-instance=mw1347:3903&var-method=GET&var-code=200 response times are degrading [07:36:29] <_joe_> but nothing too horrible [07:38:39] <_joe_> !log disabling puppet on mw1270 for testing of different php settings [07:38:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:40:12] <_joe_> !log restarting php-fpm on mw1270, with 80 pms - static, apc 6 GB no ttl [07:40:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:41:05] PROBLEM - HHVM rendering on mw1227 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [07:41:20] (03CR) 10ArielGlenn: [C: 03+1] db-eqiad,db-codfw.php: Pool db2129 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/527422 (https://phabricator.wikimedia.org/T228969) (owner: 10Marostegui) [07:42:07] _joe_: what can we do to help? [07:42:16] thanks apergos [07:42:25] <_joe_> marostegui: help with what? [07:42:32] _joe_: with the app servers [07:42:33] RECOVERY - HHVM rendering on mw1227 is OK: HTTP OK: HTTP/1.1 200 OK - 81035 bytes in 0.134 second response time https://wikitech.wikimedia.org/wiki/Application_servers [07:42:37] <_joe_> the api issue? restarting hhvm [07:42:41] <_joe_> on those servers [07:42:45] ok! [07:43:07] <_joe_> sorry I'm trying to understand what went wrong with mw1270 last night [07:43:12] !log Restart hhvm on mw1226 [07:43:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:43:19] _joe_: No worries, I just wanted to help :) [07:43:29] <_joe_> being the only appserver (non-api) fully on php7, it's worrisome [07:43:42] <_joe_> marostegui: you have too much free time now! [07:43:48] hahahahaha [07:43:55] I actually have to push a change to mwconfig! :) [07:45:44] mw1226 looking good now [07:46:41] (03PS1) 10Marostegui: db2129: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/527441 (https://phabricator.wikimedia.org/T228969) [07:48:57] (03CR) 10Marostegui: [C: 03+2] db2129: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/527441 (https://phabricator.wikimedia.org/T228969) (owner: 10Marostegui) [07:48:59] (03CR) 10Ema: [C: 04-1] "Tested on traffic-upload-stretch.traffic.eqiad.wmflabs, error is still there." [software/fifo-log-demux] - 10https://gerrit.wikimedia.org/r/527435 (owner: 10Vgutierrez) [07:49:32] (03CR) 10Marostegui: [C: 03+2] db-eqiad,db-codfw.php: Pool db2129 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/527422 (https://phabricator.wikimedia.org/T228969) (owner: 10Marostegui) [07:51:00] (03Merged) 10jenkins-bot: db-eqiad,db-codfw.php: Pool db2129 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/527422 (https://phabricator.wikimedia.org/T228969) (owner: 10Marostegui) [07:51:17] (03CR) 10jenkins-bot: db-eqiad,db-codfw.php: Pool db2129 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/527422 (https://phabricator.wikimedia.org/T228969) (owner: 10Marostegui) [07:52:09] RECOVERY - High CPU load on API appserver on mw1232 is OK: OK - load average: 8.14, 14.67, 23.11 https://wikitech.wikimedia.org/wiki/Application_servers [07:52:10] !log marostegui@deploy1001 Synchronized wmf-config/db-codfw.php: Add db2129 to the config T228969 (duration: 00m 47s) [07:52:15] PROBLEM - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is CRITICAL: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most read articles for January 1, 2016) is CRITICAL: Test retrieve the most read articles for January 1, 2016 returned the unexpected status 504 (expecting: 200) https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [07:52:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:52:26] T228969: Productionize db21[21-30} - https://phabricator.wikimedia.org/T228969 [07:53:02] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Add db2129 to the config T228969 (duration: 00m 47s) [07:53:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:53:49] PROBLEM - mobileapps endpoints health on scb2003 is CRITICAL: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most read articles for January 1, 2016) is CRITICAL: Test retrieve the most read articles for January 1, 2016 returned the unexpected status 504 (expecting: 200): /{domain}/v1/page/summary/{title} (Get summary for test page) is CRITICAL: Test Get summary for test page returned the unexpected status 504 (exp [07:53:49] s://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [07:53:51] RECOVERY - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [07:55:25] RECOVERY - mobileapps endpoints health on scb2003 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [07:55:49] !log marostegui@cumin2001 dbctl commit of MediaWiki config (dc=all), diff saved to 'https://phabricator.wikimedia.org/P8852', previous config saved to /var/cache/conftool/dbconfig/20190802-075548-marostegui.json [07:56:03] RECOVERY - High CPU load on API appserver on mw1234 is OK: OK - load average: 8.22, 12.18, 22.84 https://wikitech.wikimedia.org/wiki/Application_servers [07:56:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:04:02] (03PS2) 10Vgutierrez: fifo-log-demux: Fix EPIPE check [software/fifo-log-demux] - 10https://gerrit.wikimedia.org/r/527435 [08:04:07] RECOVERY - High CPU load on API appserver on mw1226 is OK: OK - load average: 15.48, 15.83, 23.71 https://wikitech.wikimedia.org/wiki/Application_servers [08:07:07] PROBLEM - Disk space on elastic1017 is CRITICAL: DISK CRITICAL - free space: /srv 27833 MB (5% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=elastic1017&var-datasource=eqiad+prometheus/ops [08:08:00] (03PS1) 10Marostegui: dbctl_client.pp: Remove dbctl diff alerts [puppet] - 10https://gerrit.wikimedia.org/r/527451 (https://phabricator.wikimedia.org/T197126) [08:08:31] RECOVERY - High CPU load on API appserver on mw1231 is OK: OK - load average: 9.26, 13.00, 22.72 https://wikitech.wikimedia.org/wiki/Application_servers [08:09:17] RECOVERY - High CPU load on API appserver on mw1225 is OK: OK - load average: 8.81, 11.28, 23.34 https://wikitech.wikimedia.org/wiki/Application_servers [08:09:43] PROBLEM - dbctl differs from mediawiki-config in codfw- did you forget to update both- on cumin2001 is CRITICAL: Mismatched loads for section s6: diff {(db2129, 400)} -- PHP {db2053: 50, db2060: 100, db2046: 0, db2076: 400, db2089:3316: 100, db2087:3316: 100, db2067: 100, db2114: 400, db2117: 400} vs dbctl {db2129: 400, db2053: 50, db2060: 100, db2046: 0, db2076: 400, db2089:3316: 100, db2087:3316: 100, db2067: 100, db2114: 400, [08:09:43] s://wikitech.wikimedia.org/wiki/Dbctl%23Configuration_deltas_vs_PHP [08:09:52] I will ack that alert for now [08:10:07] Or actually I can just commit the change to clear it, it is just one line [08:11:17] RECOVERY - High CPU load on API appserver on mw1222 is OK: OK - load average: 7.63, 10.13, 23.07 https://wikitech.wikimedia.org/wiki/Application_servers [08:12:51] (03PS1) 10Marostegui: db-codfw.php: Add db2129 to s6 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/527453 [08:14:34] (03CR) 10Marostegui: [C: 03+2] db-codfw.php: Add db2129 to s6 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/527453 (owner: 10Marostegui) [08:15:29] (03Merged) 10jenkins-bot: db-codfw.php: Add db2129 to s6 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/527453 (owner: 10Marostegui) [08:15:35] (03PS1) 10Marostegui: db2129: Clarify it will be the candidate master for s6 [puppet] - 10https://gerrit.wikimedia.org/r/527454 [08:16:28] (03CR) 10jenkins-bot: db-codfw.php: Add db2129 to s6 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/527453 (owner: 10Marostegui) [08:16:33] !log marostegui@deploy1001 Synchronized wmf-config/db-codfw.php: Add db2129 to s6 (duration: 00m 46s) [08:16:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:19:51] RECOVERY - Disk space on elastic1017 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=elastic1017&var-datasource=eqiad+prometheus/ops [08:22:19] (03CR) 10Volans: [C: 04-1] "This is already taken care in Ie407f61f6cb09deb9311d0d5cb4b18e0aca5eacf ;)" [puppet] - 10https://gerrit.wikimedia.org/r/527451 (https://phabricator.wikimedia.org/T197126) (owner: 10Marostegui) [08:22:53] (03CR) 10Marostegui: "<3" [puppet] - 10https://gerrit.wikimedia.org/r/527451 (https://phabricator.wikimedia.org/T197126) (owner: 10Marostegui) [08:22:59] RECOVERY - dbctl differs from mediawiki-config in codfw- did you forget to update both- on cumin2001 is OK: OK: configurations match https://wikitech.wikimedia.org/wiki/Dbctl%23Configuration_deltas_vs_PHP [08:23:02] (03Abandoned) 10Marostegui: dbctl_client.pp: Remove dbctl diff alerts [puppet] - 10https://gerrit.wikimedia.org/r/527451 (https://phabricator.wikimedia.org/T197126) (owner: 10Marostegui) [08:26:40] (03CR) 10Marostegui: [C: 03+2] db2129: Clarify it will be the candidate master for s6 [puppet] - 10https://gerrit.wikimedia.org/r/527454 (owner: 10Marostegui) [08:31:23] (03PS1) 10Marostegui: wmnet: Add m1-master for codfw [dns] - 10https://gerrit.wikimedia.org/r/527462 (https://phabricator.wikimedia.org/T202367) [08:36:19] (03PS1) 10Filippo Giunchedi: monitoring: fix HTTP availability dashboard links [puppet] - 10https://gerrit.wikimedia.org/r/527465 (https://phabricator.wikimedia.org/T228878) [08:40:23] PROBLEM - Disk space on analytics1043 is CRITICAL: DISK CRITICAL - /sys/kernel/debug/tracing is not accessible: Permission denied https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=analytics1043&var-datasource=eqiad+prometheus/ops [08:41:03] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1004 is CRITICAL: CRITICAL: 80.00% of data above the critical threshold [50.0] https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=2&fullscreen [08:41:42] (03PS3) 10Vgutierrez: fifo-log-demux: Fix EPIPE check [software/fifo-log-demux] - 10https://gerrit.wikimedia.org/r/527435 [08:42:37] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1004 is OK: OK: Less than 70.00% above the threshold [25.0] https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=2&fullscreen [08:45:39] elukey@analytics1043:~$ sudo ls -ld /sys/kernel/debug/tracing [08:45:40] drwx------ 6 root root 0 Jun 7 09:13 /sys/kernel/debug/tracing [08:45:41] mmmmm [08:46:40] (03CR) 10Alexandros Kosiaris: [V: 03+2 C: 03+2] restrouter: Add helmfile stanzas [deployment-charts] - 10https://gerrit.wikimedia.org/r/526719 (https://phabricator.wikimedia.org/T223953) (owner: 10Alexandros Kosiaris) [08:47:02] (03CR) 10Filippo Giunchedi: [C: 03+2] "PCC https://puppet-compiler.wmflabs.org/compiler1001/17718/" [puppet] - 10https://gerrit.wikimedia.org/r/527465 (https://phabricator.wikimedia.org/T228878) (owner: 10Filippo Giunchedi) [08:47:12] (03PS5) 10Alexandros Kosiaris: restrouter: Add kubernetes stanzas [puppet] - 10https://gerrit.wikimedia.org/r/526632 (https://phabricator.wikimedia.org/T223953) [08:56:58] !log @ helmfile [STAGING] Ran 'apply' command on namespace 'kube-system' for release 'calico-policy-controller' . [08:57:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:57:11] !log akosiaris@ helmfile [STAGING] Ran 'apply' command on namespace 'kube-system' for release 'calico-policy-controller' . [08:57:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:05:32] (03PS1) 10Ladsgroup: Revert "Revert "Revert "Switch property terms migration to WRITE_NEW on production wikidata""" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/527477 [09:05:37] (03CR) 10Ladsgroup: [C: 03+2] Revert "Revert "Revert "Switch property terms migration to WRITE_NEW on production wikidata""" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/527477 (owner: 10Ladsgroup) [09:12:44] RECOVERY - Disk space on analytics1043 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=analytics1043&var-datasource=eqiad+prometheus/ops [09:12:48] !log umount /sys/kernel/debug/tracing on analytics1043 [09:12:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:13:07] (03PS1) 10Alexandros Kosiaris: restrouter: Fix typo in suffixes in admin [deployment-charts] - 10https://gerrit.wikimedia.org/r/527480 [09:14:40] (03PS2) 10Ladsgroup: Revert "Revert "Revert "Switch property terms migration to WRITE_NEW on production wikidata""" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/527477 [09:14:42] (03CR) 10Ladsgroup: [C: 03+2] Revert "Revert "Revert "Switch property terms migration to WRITE_NEW on production wikidata""" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/527477 (owner: 10Ladsgroup) [09:15:02] (03Merged) 10jenkins-bot: Revert "Revert "Revert "Switch property terms migration to WRITE_NEW on production wikidata""" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/527477 (owner: 10Ladsgroup) [09:16:40] (03CR) 10jenkins-bot: Revert "Revert "Revert "Switch property terms migration to WRITE_NEW on production wikidata""" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/527477 (owner: 10Ladsgroup) [09:17:05] !log ladsgroup@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:526657|Revert: Switch property terms migration to WRITE_NEW on production wikidata (T225053)]] (duration: 00m 48s) [09:17:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:17:13] T225053: Switch `tmpPropertyTermsMigrationStage` to MIGRATION_WRITE_NEW - https://phabricator.wikimedia.org/T225053 [09:22:05] !log Compress s7 on labsdb1010 - T222978 [09:22:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:22:14] T222978: Compress and defragment tables on labsdb hosts - https://phabricator.wikimedia.org/T222978 [09:23:21] (03CR) 10Alexandros Kosiaris: [C: 03+2] restrouter: Add kubernetes stanzas [puppet] - 10https://gerrit.wikimedia.org/r/526632 (https://phabricator.wikimedia.org/T223953) (owner: 10Alexandros Kosiaris) [09:25:49] (03CR) 10Filippo Giunchedi: "> Patch Set 2:" [puppet] - 10https://gerrit.wikimedia.org/r/526819 (https://phabricator.wikimedia.org/T226331) (owner: 10Ayounsi) [09:34:16] (03PS1) 10Giuseppe Lavagetto: Add the mediawiki.restart_appservers cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/527487 [09:35:55] (03CR) 10jerkins-bot: [V: 04-1] Add the mediawiki.restart_appservers cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/527487 (owner: 10Giuseppe Lavagetto) [09:37:17] (03PS4) 10Filippo Giunchedi: prometheus: query pdu resources based on model [puppet] - 10https://gerrit.wikimedia.org/r/526634 (https://phabricator.wikimedia.org/T148541) [09:38:11] (03CR) 10Filippo Giunchedi: [C: 03+2] prometheus: query pdu resources based on model [puppet] - 10https://gerrit.wikimedia.org/r/526634 (https://phabricator.wikimedia.org/T148541) (owner: 10Filippo Giunchedi) [09:41:34] (03PS3) 10Filippo Giunchedi: prometheus: generate targets for sentry4 PDUs too [puppet] - 10https://gerrit.wikimedia.org/r/526640 (https://phabricator.wikimedia.org/T148541) [09:41:45] (03CR) 10Filippo Giunchedi: [C: 03+2] prometheus: generate targets for sentry4 PDUs too [puppet] - 10https://gerrit.wikimedia.org/r/526640 (https://phabricator.wikimedia.org/T148541) (owner: 10Filippo Giunchedi) [09:56:01] PROBLEM - High CPU load on API appserver on mw1225 is CRITICAL: CRITICAL - load average: 53.14, 25.33, 17.41 https://wikitech.wikimedia.org/wiki/Application_servers [09:56:23] (03PS1) 10Filippo Giunchedi: prometheus: skip duplicates when generating pdu configuration [puppet] - 10https://gerrit.wikimedia.org/r/527498 (https://phabricator.wikimedia.org/T148541) [09:56:53] PROBLEM - High CPU load on API appserver on mw1233 is CRITICAL: CRITICAL - load average: 67.69, 39.26, 24.95 https://wikitech.wikimedia.org/wiki/Application_servers [09:57:19] PROBLEM - High CPU load on API appserver on mw1222 is CRITICAL: CRITICAL - load average: 74.15, 38.58, 23.35 https://wikitech.wikimedia.org/wiki/Application_servers [10:01:37] RECOVERY - High CPU load on API appserver on mw1233 is OK: OK - load average: 11.81, 23.16, 21.92 https://wikitech.wikimedia.org/wiki/Application_servers [10:01:51] !log @ helmfile [STAGING] Ran 'apply' command on namespace 'restrouter' for release 'staging' . [10:01:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:02:03] RECOVERY - High CPU load on API appserver on mw1222 is OK: OK - load average: 13.19, 23.84, 21.24 https://wikitech.wikimedia.org/wiki/Application_servers [10:02:21] RECOVERY - High CPU load on API appserver on mw1225 is OK: OK - load average: 13.78, 22.59, 20.14 https://wikitech.wikimedia.org/wiki/Application_servers [10:02:38] (03PS2) 10Giuseppe Lavagetto: Add the mediawiki.restart_appservers cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/527487 [10:03:17] (03CR) 10Filippo Giunchedi: [C: 03+2] prometheus: skip duplicates when generating pdu configuration [puppet] - 10https://gerrit.wikimedia.org/r/527498 (https://phabricator.wikimedia.org/T148541) (owner: 10Filippo Giunchedi) [10:03:25] (03PS2) 10Filippo Giunchedi: prometheus: skip duplicates when generating pdu configuration [puppet] - 10https://gerrit.wikimedia.org/r/527498 (https://phabricator.wikimedia.org/T148541) [10:07:17] (03CR) 10Ema: [C: 03+1] fifo-log-demux: Fix EPIPE check [software/fifo-log-demux] - 10https://gerrit.wikimedia.org/r/527435 (owner: 10Vgutierrez) [10:12:45] (03CR) 10Vgutierrez: [V: 03+2 C: 03+2] fifo-log-demux: Remove socket activation [software/fifo-log-demux] - 10https://gerrit.wikimedia.org/r/527039 (owner: 10Vgutierrez) [10:13:13] (03CR) 10Vgutierrez: [V: 03+2 C: 03+2] fifo-log-demux: Keep attempting to read the FIFO after EOF [software/fifo-log-demux] - 10https://gerrit.wikimedia.org/r/527013 (owner: 10Vgutierrez) [10:14:01] (03CR) 10Vgutierrez: [V: 03+2 C: 03+2] fifo-log-demux: Fix EPIPE check [software/fifo-log-demux] - 10https://gerrit.wikimedia.org/r/527435 (owner: 10Vgutierrez) [10:41:00] Jenkins it slow [10:41:09] I need to deploy for a UBN right now :/ [10:42:36] Amir1: this doesn't look too bad from a first sight https://integration.wikimedia.org/zuul/ [10:42:46] (03PS1) 10ArielGlenn: add more public tables for xml/sql dumps [puppet] - 10https://gerrit.wikimedia.org/r/527505 (https://phabricator.wikimedia.org/T226167) [10:43:19] marostegui: gate-and-submit-swat: 32 minutes and due to a flaky browser test I need to redo it [10:43:21] <_joe_> Amir1: if that's a rollback of a patch, you can quick-fix it on deploy1001 in the meantime [10:43:30] <_joe_> wow [10:43:43] 32 minutes woot [10:43:46] <_joe_> what's the commit? [10:43:50] _joe_: yeah [10:43:56] https://gerrit.wikimedia.org/r/c/mediawiki/extensions/Wikibase/+/527501 [10:44:04] branch backports are always slow [10:44:42] <_joe_> Amir1: if you feel confident it works, just remove the -1 from jenkins bot, add v+2 and hit submit [10:44:44] <_joe_> :P [10:45:24] _joe_: yeah, the error is flaky + the not-submit tests all passed (look at jenkins) [10:45:46] Thanks [10:45:47] <_joe_> Amir1: go for it then [10:47:37] now on wmf.16 [10:51:25] !log ladsgroup@deploy1001 Synchronized php-1.34.0-wmf.16/extensions/Wikibase: [[gerrit:527501|Revert "fix eslint errors in lib after moving submodule files into lib"]] (duration: 01m 08s) [10:51:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:34:29] !log ladsgroup@deploy1001 Started scap: [[phab:T229604|Rebuilding l10n cache]] [11:34:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:34:38] T229604: Several selectors/experts are broken - https://phabricator.wikimedia.org/T229604 [11:39:35] !log ladsgroup@deploy1001 Finished scap: [[phab:T229604|Rebuilding l10n cache]] (duration: 05m 06s) [11:39:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:39:47] T229604: Several selectors/experts are broken - https://phabricator.wikimedia.org/T229604 [11:40:58] It's done in five minutes and it didn't rebuild l10n cache :/ [11:41:16] weird [11:44:11] How can I rebuild the l10n cache? [11:44:18] extensions/LocalisationUpdate/update.php ? [11:45:17] nope that's something else [11:47:48] scap sync-l10n 1.34.0-wmf.16 '[[phab:T229604|Rebuilding l10n cache]]' would be it [11:47:49] T229604: Several selectors/experts are broken - https://phabricator.wikimedia.org/T229604 [11:48:07] !log ladsgroup@deploy1001 scap sync-l10n completed (1.34.0-wmf.16) (duration: 00m 44s) [11:48:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:52:13] Still not working. zeljkof hey, do you know why l10n cache is not getting rebuilt? [11:52:30] e.g. This https://www.wikidata.org/wiki/MediaWiki:Valueview-expertextender-languageselector-label [11:54:30] !log start of l10nupdate [11:54:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:56:08] I'm not sure if I'm doing it right. The changed logs are too big [11:56:25] !log aborted l10nupdate [11:56:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:10:06] (03CR) 10Lucas Werkmeister (WMDE): "I also checked the other redirects generated by Varnish to see if it makes sense to add the header to them as well (`git grep -F 'synth(3'" [puppet] - 10https://gerrit.wikimedia.org/r/526627 (https://phabricator.wikimedia.org/T229385) (owner: 10Lucas Werkmeister (WMDE)) [12:10:39] brennen: you’re train conductor this week, do you know why Amir1’s l10n rebuild might not be working as expected? [12:11:22] I think he's asleep now [12:14:48] I'd expect brennen to not wake up until at least two hours from now [12:15:37] If anyone knows better how to rebuild l10n cache for wmf.16, please do [12:15:55] This should show up https://www.wikidata.org/wiki/MediaWiki:Valueview-expertextender-languageselector-label [12:18:12] (03PS2) 10CDanis: Revert "dbctl: diff PHP vs dbctl configs" [puppet] - 10https://gerrit.wikimedia.org/r/527245 (https://phabricator.wikimedia.org/T229070) [12:19:37] (03CR) 10CDanis: [C: 03+2] Revert "dbctl: diff PHP vs dbctl configs" [puppet] - 10https://gerrit.wikimedia.org/r/527245 (https://phabricator.wikimedia.org/T229070) (owner: 10CDanis) [12:31:28] 10Operations, 10DBA, 10cloud-services-team, 10wikitech.wikimedia.org: Switchover m5 primary master: db1073 to db1133 - https://phabricator.wikimedia.org/T229657 (10Marostegui) [12:33:01] !log Restarted wikibugs a few minutes ago as it was not sending anything on IRC [12:33:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:33:22] 10Operations, 10DBA, 10wikitech.wikimedia.org, 10cloud-services-team (Kanban): Switchover m5 primary master: db1073 to db1133 - https://phabricator.wikimedia.org/T229657 (10aborrero) I think we could either do this next week or wait until september because the WMCS team we will be traveling for Wikimania +... [12:35:01] 10Operations, 10DBA, 10wikitech.wikimedia.org, 10cloud-services-team (Kanban): Switchover m5 primary master: db1073 to db1133 - https://phabricator.wikimedia.org/T229657 (10Marostegui) >>! In T229657#5387587, @aborrero wrote: > I think we could either do this next week or wait until september because the W... [12:37:36] 10Operations, 10DBA, 10wikitech.wikimedia.org, 10cloud-services-team (Kanban): Switchover m5 primary master: db1073 to db1133 - https://phabricator.wikimedia.org/T229657 (10CDanis) FYI I'll be on vacation and without a work laptop approx Sept 10th - Sept 20th, and possibly Sept 9th as well. Outside of tha... [12:42:08] 10Operations, 10DBA, 10wikitech.wikimedia.org, 10cloud-services-team (Kanban): Switchover m5 primary master: db1073 to db1133 - https://phabricator.wikimedia.org/T229657 (10Marostegui) Let me pick a tentative.....Tuesday 3rd Sept at 13:00 UTC? @aborrero @CDanis ? [12:42:27] 10Operations, 10DBA, 10wikitech.wikimedia.org, 10cloud-services-team (Kanban): Switchover m5 primary master: db1073 to db1133 - https://phabricator.wikimedia.org/T229657 (10aborrero) Ok, so I'm proposing two dates: * 2019-10-03 -- I'm unavailable, but I think both @JHedden and @Andrew will be around. Also... [12:42:45] PROBLEM - High CPU load on API appserver on mw1233 is CRITICAL: CRITICAL - load average: 75.67, 40.14, 23.56 https://wikitech.wikimedia.org/wiki/Application_servers [12:43:03] PROBLEM - High CPU load on API appserver on mw1231 is CRITICAL: CRITICAL - load average: 71.10, 35.33, 21.14 https://wikitech.wikimedia.org/wiki/Application_servers [12:43:25] PROBLEM - High CPU load on API appserver on mw1223 is CRITICAL: CRITICAL - load average: 66.86, 32.58, 18.70 https://wikitech.wikimedia.org/wiki/Application_servers [12:43:25] PROBLEM - High CPU load on API appserver on mw1222 is CRITICAL: CRITICAL - load average: 76.09, 38.17, 22.10 https://wikitech.wikimedia.org/wiki/Application_servers [12:43:46] 10Operations, 10DBA, 10wikitech.wikimedia.org, 10cloud-services-team (Kanban): Switchover m5 primary master: db1073 to db1133 - https://phabricator.wikimedia.org/T229657 (10CDanis) >>! In T229657#5387609, @Marostegui wrote: > Let me pick a tentative.....Tuesday 3rd Sept at 13:00 UTC? @aborrero @CDanis ? L... [12:44:32] 10Operations, 10DBA, 10wikitech.wikimedia.org, 10cloud-services-team (Kanban): Switchover m5 primary master: db1073 to db1133 - https://phabricator.wikimedia.org/T229657 (10Marostegui) @aborrero are you proposing October? [12:45:54] 10Operations, 10DBA, 10wikitech.wikimedia.org, 10cloud-services-team (Kanban): Switchover m5 primary master: db1073 to db1133 - https://phabricator.wikimedia.org/T229657 (10aborrero) Ok, **2019-10-03**, work for us. Will let my team know, since I won't be around. >>! In T229657#5387615, @Marostegui wrote:... [12:46:21] PROBLEM - High CPU load on API appserver on mw1225 is CRITICAL: CRITICAL - load average: 37.84, 36.45, 23.15 https://wikitech.wikimedia.org/wiki/Application_servers [12:47:03] 10Operations, 10DBA, 10wikitech.wikimedia.org, 10cloud-services-team (Kanban): Switchover m5 primary master: db1073 to db1133 - https://phabricator.wikimedia.org/T229657 (10Marostegui) Let's try to go for the 3rd of September at 13:00 UTC if @Andrew and/or @JHedden can confirm they'll be available to suppo... [12:50:59] RECOVERY - High CPU load on API appserver on mw1225 is OK: OK - load average: 18.94, 25.51, 21.89 https://wikitech.wikimedia.org/wiki/Application_servers [12:52:56] I have a feeling someone is hitting Wikidata's API too hard every hour or every two hour. The load errors ^ and this: https://grafana.wikimedia.org/d/000000548/wikibase-wb_terms?refresh=30s&orgId=1&from=now-24h&to=now [12:53:22] Amir1: could that be also related to what we saw on the DBs? [12:53:53] yeah, it's the same minute but exactly one hour off [12:54:10] (I've got a headache so I'm not sure my measurements are correct) [12:59:11] RECOVERY - toolschecker: showmount succeeds on a labs instance on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 158 bytes in 0.064 second response time https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Toolschecker [13:04:59] PROBLEM - High CPU load on API appserver on mw1234 is CRITICAL: CRITICAL - load average: 62.91, 35.35, 26.42 https://wikitech.wikimedia.org/wiki/Application_servers [13:05:15] PROBLEM - High CPU load on API appserver on mw1223 is CRITICAL: CRITICAL - load average: 56.00, 34.54, 28.58 https://wikitech.wikimedia.org/wiki/Application_servers [13:06:43] PROBLEM - High CPU load on API appserver on mw1226 is CRITICAL: CRITICAL - load average: 79.30, 44.48, 29.63 https://wikitech.wikimedia.org/wiki/Application_servers [13:06:51] PROBLEM - High CPU load on API appserver on mw1232 is CRITICAL: CRITICAL - load average: 66.11, 38.01, 26.16 https://wikitech.wikimedia.org/wiki/Application_servers [13:06:51] PROBLEM - High CPU load on API appserver on mw1225 is CRITICAL: CRITICAL - load average: 71.35, 43.09, 30.54 https://wikitech.wikimedia.org/wiki/Application_servers [13:08:01] PROBLEM - recommendation_api endpoints health on scb2002 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [13:09:31] RECOVERY - recommendation_api endpoints health on scb2002 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [13:18:47] (03PS1) 10Arturo Borrero Gonzalez: toolforge: k8s: ingress: nginx-ingress listen on 8082/tcp [puppet] - 10https://gerrit.wikimedia.org/r/527541 (https://phabricator.wikimedia.org/T228500) [13:18:49] (03PS1) 10Arturo Borrero Gonzalez: toolforge: k8s: ingress: add frontend service [puppet] - 10https://gerrit.wikimedia.org/r/527542 (https://phabricator.wikimedia.org/T228500) [13:19:25] RECOVERY - High CPU load on API appserver on mw1226 is OK: OK - load average: 9.80, 14.75, 22.98 https://wikitech.wikimedia.org/wiki/Application_servers [13:19:33] RECOVERY - High CPU load on API appserver on mw1223 is OK: OK - load average: 11.54, 15.25, 22.86 https://wikitech.wikimedia.org/wiki/Application_servers [13:19:33] RECOVERY - High CPU load on API appserver on mw1232 is OK: OK - load average: 9.90, 16.04, 22.81 https://wikitech.wikimedia.org/wiki/Application_servers [13:22:43] RECOVERY - High CPU load on API appserver on mw1225 is OK: OK - load average: 11.54, 14.53, 22.86 https://wikitech.wikimedia.org/wiki/Application_servers [13:24:03] RECOVERY - High CPU load on API appserver on mw1234 is OK: OK - load average: 10.39, 13.77, 22.95 https://wikitech.wikimedia.org/wiki/Application_servers [13:25:32] what were these alerts about? :) [13:27:09] paravoid: we saw them in the morning too, they recovered after a while, we restarted some hhvm but I am not sure if the root cause was found [13:27:28] _joe_, jijiki ^ [13:27:56] <_joe_> this morning it was a flood of parsoid_batch requests [13:28:19] <_joe_> but given it's always the same servers, lemme take a deeper look at what's going on there [13:28:56] (03PS1) 10Arturo Borrero Gonzalez: toolforge: k8s: haproxy: add proxy redirection for nginx-ingress [puppet] - 10https://gerrit.wikimedia.org/r/527544 (https://phabricator.wikimedia.org/T228500) [13:29:18] the hosts in question have basically no thermal headroom [13:29:28] ugh, the graph of thermal throttling events on grafana has broken [13:29:33] but it's very much happening on these hosts, and often [13:29:38] Aug 2 13:24:01 mw1223 kernel: [5086950.120019] CPU29: Package temperature above threshold, cpu clock throttled (total events = 176490976) [13:30:18] cdanis: yeah, I think it has been happening for a while already [13:30:33] <_joe_> those servers will be replaced *this quarter* btw [13:30:58] maybe we want to de-weight them some? [13:31:18] <_joe_> lemme first take a better look at what's going on [13:31:50] <_joe_> so the root issue is a flow of parsoid-batch requests (again) [13:32:00] <_joe_> that raised the cpu usage on all api servers [13:32:03] this looks to me like a spike of traffic that gets spread across the apiservers but these 5 servers in question are much less able to handle it [13:32:48] <_joe_> but yeah we could tweak the weights a bit I concur [13:34:32] would it be possible to make parsoid_batch traffic less bursty? [13:34:43] RECOVERY - High CPU load on API appserver on mw1233 is OK: OK - load average: 10.21, 12.85, 23.95 https://wikitech.wikimedia.org/wiki/Application_servers [13:34:56] <_joe_> I didn't have time to look more in depth to what is causing that [13:35:03] RECOVERY - High CPU load on API appserver on mw1231 is OK: OK - load average: 9.60, 10.80, 22.54 https://wikitech.wikimedia.org/wiki/Application_servers [13:35:19] <_joe_> but yes, parsoid-batch should go via changepropagation [13:35:36] <_joe_> so it's mediated by kafka and an http pusher that has concurrency limits [13:35:38] <_joe_> but [13:35:55] <_joe_> I don't see a spike in the number of requests [13:35:57] <_joe_> to the api [13:36:16] there is a slight increase in requests that correlates with a large increase in median execution time https://grafana.wikimedia.org/d/000000559/api-requests-breakdown?refresh=5m&orgId=1&var-metric=p50&var-module=parsoid_batch&panelId=19&fullscreen [13:36:34] so I think we are occasionally getting a group of very expensive queries? [13:37:16] (03PS1) 10Filippo Giunchedi: prometheus: don't snmp-poll st4InputCordNotifications [puppet] - 10https://gerrit.wikimedia.org/r/527548 (https://phabricator.wikimedia.org/T148541) [13:37:24] <_joe_> cdanis: I trust these other data more: https://grafana.wikimedia.org/d/RIA1lzDZk/xxx-joe-appserver?orgId=1&from=now-1h&to=now&var-datasource=eqiad%20prometheus%2Fops&var-cluster=api_appserver&var-instance=mw1276:3903&var-method=GET&var-code=200&panelId=17&fullscreen [13:37:34] <_joe_> the increase is small though [13:37:40] <_joe_> and it seems to be persisting [13:38:01] <_joe_> I think it's the kind of requests that are somehow very cpu-expensive [13:38:10] but that's not specific to parsoid-batch nor is it aggregated across apiservers [13:38:35] or am I misunderstanding the header row? [13:38:52] oh, I see, these are global graphs, the instance chooser up top only affects the bottom row [13:39:17] (03CR) 10jerkins-bot: [V: 04-1] prometheus: don't snmp-poll st4InputCordNotifications [puppet] - 10https://gerrit.wikimedia.org/r/527548 (https://phabricator.wikimedia.org/T148541) (owner: 10Filippo Giunchedi) [13:39:35] <_joe_> it is aggregated across the api servers [13:39:50] <_joe_> but yes, it's not specific, it's the overall rate [13:40:13] RECOVERY - High CPU load on API appserver on mw1222 is OK: OK - load average: 12.40, 13.01, 23.16 https://wikitech.wikimedia.org/wiki/Application_servers [13:40:58] <_joe_> now to know what happened here, welcome to api.log :P [13:41:54] heh, some of these servers already have lower weights [13:42:24] <_joe_> they do yes [13:42:58] <_joe_> oh I found one problem though [13:43:00] <_joe_> fixing it [13:44:17] there is something that doesn't make sense to me about the apache2 weights vs the nginx weights [13:45:28] !log oblivian@puppetmaster1001 conftool action : set/weight=10; selector: cluster=api_appserver,dc=eqiad,service=nginx,name=mw12[23].* [13:45:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:45:42] <_joe_> cdanis: ^^ [13:45:57] _joe_: LGTM [13:45:59] PROBLEM - toolschecker: All k8s worker nodes are healthy on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string OK not found on http://checker.tools.wmflabs.org:80/k8s/nodes/ready - 177 bytes in 0.154 second response time https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Toolschecker [13:46:05] <_joe_> the nginx weights were modified at some point during some emergency, and no one brought them back to normalcy [13:46:27] _joe_: I still don't understand why mw1276-mw1297 have weight:10 for nginx, but have weight:25 for apache [13:47:33] <_joe_> the reason for this was exactly preserving some api servers for whenever therre was a parsoid-batch storm [13:47:42] <_joe_> we did that at the time as a protective measure [13:47:48] <_joe_> as parsoid uses TLS [13:48:04] <_joe_> while most things are still reaching mediawiki unencrypted [13:48:46] <_joe_> those storms were gone for... years now? [13:48:50] <_joe_> but they seem to be back [13:48:52] 10Operations, 10Traffic, 10netops, 10IPv6: Fix IPv6 autoconf issues once and for all, across the fleet. - https://phabricator.wikimedia.org/T102099 (10akosiaris) > This still leaves all the servers currently installed which have a MAC based SLAAC address i.e. they do not have interface::add_ip6_mapped. It... [13:49:45] (03PS2) 10Filippo Giunchedi: prometheus: don't snmp-poll st4InputCordNotifications [puppet] - 10https://gerrit.wikimedia.org/r/527548 (https://phabricator.wikimedia.org/T148541) [13:50:42] (03CR) 10jerkins-bot: [V: 04-1] prometheus: don't snmp-poll st4InputCordNotifications [puppet] - 10https://gerrit.wikimedia.org/r/527548 (https://phabricator.wikimedia.org/T148541) (owner: 10Filippo Giunchedi) [13:51:11] (03CR) 10Filippo Giunchedi: [V: 03+2 C: 03+2] "CI is choking on the commit message too long, which is a url and I'm going to override" [puppet] - 10https://gerrit.wikimedia.org/r/527548 (https://phabricator.wikimedia.org/T148541) (owner: 10Filippo Giunchedi) [13:52:39] ACKNOWLEDGEMENT - toolschecker: All k8s worker nodes are healthy on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string OK not found on http://checker.tools.wmflabs.org:80/k8s/nodes/ready - 177 bytes in 0.184 second response time andrew bogott no idea what this is yet https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Toolschecker [13:54:22] !log @ helmfile [STAGING] Ran 'sync' command on namespace 'restrouter' for release 'staging' . [13:54:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:57:47] !log mforns@deploy1001 Started deploy [analytics/refinery@b50a939]: deploying refinery up to b50a93955952ed863d5ef7703a91ab59f5d979cf (rollback of cassandra and edit_hourly hive2 actions to unbreak production) [13:57:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:02:31] PROBLEM - mobileapps endpoints health on scb2001 is CRITICAL: /{domain}/v1/page/metadata/{title} (retrieve extended metadata for Video article on English Wikipedia) is CRITICAL: Test retrieve extended metadata for Video article on English Wikipedia returned the unexpected status 504 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [14:04:09] RECOVERY - mobileapps endpoints health on scb2001 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [14:08:27] RECOVERY - toolschecker: All k8s worker nodes are healthy on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 158 bytes in 0.204 second response time https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Toolschecker [14:11:58] 10Operations, 10Traffic, 10netops, 10IPv6: Fix IPv6 autoconf issues once and for all, across the fleet. - https://phabricator.wikimedia.org/T102099 (10BBlack) Re: transitioning away from SLAAC for the current fleet/setup (which I think is probably a good incremental idea, and could happen ahead of the futu... [14:14:34] !log mforns@deploy1001 Finished deploy [analytics/refinery@b50a939]: deploying refinery up to b50a93955952ed863d5ef7703a91ab59f5d979cf (rollback of cassandra and edit_hourly hive2 actions to unbreak production) (duration: 16m 47s) [14:14:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:22:03] (03PS1) 10Giuseppe Lavagetto: Remove the service object for the default schema [software/conftool] - 10https://gerrit.wikimedia.org/r/527564 [14:22:05] (03PS1) 10Giuseppe Lavagetto: kvobject: fix some class property ordering [software/conftool] - 10https://gerrit.wikimedia.org/r/527565 [14:27:10] (03CR) 1020after4: [C: 03+1] "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/520433 (owner: 10Aklapper) [14:28:45] (03CR) 1020after4: [C: 03+1] "D1145 is visible to WMD-NDA and Security" [puppet] - 10https://gerrit.wikimedia.org/r/520433 (owner: 10Aklapper) [14:29:30] (03CR) 10Filippo Giunchedi: "See also upstream issue https://github.com/prometheus/snmp_exporter/issues/443" [puppet] - 10https://gerrit.wikimedia.org/r/527548 (https://phabricator.wikimedia.org/T148541) (owner: 10Filippo Giunchedi) [14:29:32] (03PS2) 1020after4: Phab: Allow Greg and Andre to roll back specific user actions [puppet] - 10https://gerrit.wikimedia.org/r/520433 (owner: 10Aklapper) [14:29:50] (03CR) 1020after4: [C: 03+1] "It's been merged now, no need for security protection" [puppet] - 10https://gerrit.wikimedia.org/r/520433 (owner: 10Aklapper) [14:35:10] (03CR) 10jerkins-bot: [V: 04-1] Phab: Allow Greg and Andre to roll back specific user actions [puppet] - 10https://gerrit.wikimedia.org/r/520433 (owner: 10Aklapper) [14:40:25] (03Abandoned) 10Tarrow: Assign termbox-test.svc.{eqiad,codfw}.wmnet LVS IPs [dns] - 10https://gerrit.wikimedia.org/r/521456 (https://phabricator.wikimedia.org/T226814) (owner: 10Tarrow) [14:43:28] (03PS3) 1020after4: Phab: Allow Greg and Andre to roll back specific user actions [puppet] - 10https://gerrit.wikimedia.org/r/520433 (owner: 10Aklapper) [14:43:47] (03PS4) 1020after4: Phab: Allow Greg and Andre to roll back specific user actions [puppet] - 10https://gerrit.wikimedia.org/r/520433 (owner: 10Aklapper) [14:44:40] (03CR) 10CDanis: [C: 03+1] Remove the service object for the default schema [software/conftool] - 10https://gerrit.wikimedia.org/r/527564 (owner: 10Giuseppe Lavagetto) [14:48:00] (03CR) 10Alexandros Kosiaris: Add the mediawiki.restart_appservers cookbook (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/527487 (owner: 10Giuseppe Lavagetto) [14:51:05] PROBLEM - High CPU load on API appserver on mw1227 is CRITICAL: CRITICAL - load average: 52.73, 23.32, 14.72 https://wikitech.wikimedia.org/wiki/Application_servers [14:51:35] PROBLEM - High CPU load on API appserver on mw1317 is CRITICAL: CRITICAL - load average: 74.39, 35.84, 23.37 https://wikitech.wikimedia.org/wiki/Application_servers [14:51:49] PROBLEM - High CPU load on API appserver on mw1231 is CRITICAL: CRITICAL - load average: 53.88, 24.11, 14.85 https://wikitech.wikimedia.org/wiki/Application_servers [14:51:59] PROBLEM - High CPU load on API appserver on mw1234 is CRITICAL: CRITICAL - load average: 58.53, 25.45, 14.65 https://wikitech.wikimedia.org/wiki/Application_servers [14:52:07] PROBLEM - High CPU load on API appserver on mw1226 is CRITICAL: CRITICAL - load average: 50.71, 27.27, 15.43 https://wikitech.wikimedia.org/wiki/Application_servers [14:52:13] PROBLEM - High CPU load on API appserver on mw1223 is CRITICAL: CRITICAL - load average: 49.43, 24.31, 14.47 https://wikitech.wikimedia.org/wiki/Application_servers [14:52:15] PROBLEM - High CPU load on API appserver on mw1232 is CRITICAL: CRITICAL - load average: 56.41, 24.70, 14.37 https://wikitech.wikimedia.org/wiki/Application_servers [14:52:19] PROBLEM - High CPU load on API appserver on mw1284 is CRITICAL: CRITICAL - load average: 63.90, 30.16, 19.51 https://wikitech.wikimedia.org/wiki/Application_servers [14:52:23] PROBLEM - High CPU load on API appserver on mw1281 is CRITICAL: CRITICAL - load average: 61.84, 27.78, 18.51 https://wikitech.wikimedia.org/wiki/Application_servers [14:52:39] PROBLEM - High CPU load on API appserver on mw1283 is CRITICAL: CRITICAL - load average: 62.79, 28.31, 18.62 https://wikitech.wikimedia.org/wiki/Application_servers [14:52:57] PROBLEM - High CPU load on API appserver on mw1279 is CRITICAL: CRITICAL - load average: 76.58, 35.20, 21.47 https://wikitech.wikimedia.org/wiki/Application_servers [14:52:57] PROBLEM - High CPU load on API appserver on mw1280 is CRITICAL: CRITICAL - load average: 64.05, 31.33, 20.19 https://wikitech.wikimedia.org/wiki/Application_servers [14:53:01] PROBLEM - High CPU load on API appserver on mw1230 is CRITICAL: CRITICAL - load average: 57.37, 26.79, 15.08 https://wikitech.wikimedia.org/wiki/Application_servers [14:53:01] PROBLEM - High CPU load on API appserver on mw1314 is CRITICAL: CRITICAL - load average: 75.15, 45.95, 27.76 https://wikitech.wikimedia.org/wiki/Application_servers [14:53:01] PROBLEM - High CPU load on API appserver on mw1286 is CRITICAL: CRITICAL - load average: 68.23, 37.28, 24.08 https://wikitech.wikimedia.org/wiki/Application_servers [14:53:05] PROBLEM - High CPU load on API appserver on mw1278 is CRITICAL: CRITICAL - load average: 65.01, 32.33, 20.93 https://wikitech.wikimedia.org/wiki/Application_servers [14:53:05] PROBLEM - High CPU load on API appserver on mw1290 is CRITICAL: CRITICAL - load average: 61.60, 35.71, 22.41 https://wikitech.wikimedia.org/wiki/Application_servers [14:53:07] PROBLEM - High CPU load on API appserver on mw1233 is CRITICAL: CRITICAL - load average: 51.01, 30.89, 17.82 https://wikitech.wikimedia.org/wiki/Application_servers [14:53:13] PROBLEM - PHP7 rendering on mw1347 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [14:53:13] PROBLEM - PHP7 rendering on mw1348 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [14:53:13] RECOVERY - High CPU load on API appserver on mw1317 is OK: OK - load average: 40.12, 37.01, 25.22 https://wikitech.wikimedia.org/wiki/Application_servers [14:53:13] PROBLEM - High CPU load on API appserver on mw1276 is CRITICAL: CRITICAL - load average: 82.48, 38.90, 22.80 https://wikitech.wikimedia.org/wiki/Application_servers [14:53:13] PROBLEM - High CPU load on API appserver on mw1288 is CRITICAL: CRITICAL - load average: 66.16, 37.14, 23.06 https://wikitech.wikimedia.org/wiki/Application_servers [14:53:21] PROBLEM - Apache HTTP on mw1348 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [14:53:37] PROBLEM - High CPU load on API appserver on mw1289 is CRITICAL: CRITICAL - load average: 71.53, 38.16, 22.80 https://wikitech.wikimedia.org/wiki/Application_servers [14:53:39] PROBLEM - HTTP availability for Varnish at eqiad on icinga1001 is CRITICAL: job=varnish-text site=eqiad https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d [14:53:45] RECOVERY - High CPU load on API appserver on mw1226 is OK: OK - load average: 18.93, 23.17, 15.19 https://wikitech.wikimedia.org/wiki/Application_servers [14:53:51] RECOVERY - High CPU load on API appserver on mw1223 is OK: OK - load average: 25.08, 23.15, 15.09 https://wikitech.wikimedia.org/wiki/Application_servers [14:54:01] PROBLEM - High CPU load on API appserver on mw1282 is CRITICAL: CRITICAL - load average: 78.60, 41.54, 24.42 https://wikitech.wikimedia.org/wiki/Application_servers [14:54:01] PROBLEM - High CPU load on API appserver on mw1277 is CRITICAL: CRITICAL - load average: 84.55, 45.03, 25.83 https://wikitech.wikimedia.org/wiki/Application_servers [14:54:07] PROBLEM - recommendation_api endpoints health on scb1003 is CRITICAL: /{domain}/v1/caption/addition/{target} (Caption addition suggestions) timed out before a response was received: /{domain}/v1/article/creation/translation/{source}{/seed} (article.creation.translation - bad seed) timed out before a response was received: /{domain}/v1/caption/translation/from/{source}/to/{target} (Caption translation suggestions) timed out before [14:54:07] eceived: /{domain}/v1/description/translation/from/{source}/to/{target} (Description translation suggestions) timed out before a response was received: /{domain}/v1/description/addition/{target} (Description addition suggestions) timed out before a response was received: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received: /{domain}/v1/article/cre [14:54:07] eed} (article.creation.morelike - bad article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [14:54:09] PROBLEM - High CPU load on API appserver on mw1287 is CRITICAL: CRITICAL - load average: 71.35, 41.42, 24.58 https://wikitech.wikimedia.org/wiki/Application_servers [14:54:11] PROBLEM - recommendation_api endpoints health on scb2003 is CRITICAL: /{domain}/v1/caption/addition/{target} (Caption addition suggestions) timed out before a response was received: /{domain}/v1/article/creation/translation/{source}{/seed} (article.creation.translation - bad seed) timed out before a response was received: /{domain}/v1/caption/translation/from/{source}/to/{target} (Caption translation suggestions) timed out before [14:54:11] eceived: /{domain}/v1/description/translation/from/{source}/to/{target} (Description translation suggestions) timed out before a response was received: /{domain}/v1/description/addition/{target} (Description addition suggestions) timed out before a response was received: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received: /{domain}/v1/article/cre [14:54:11] eed} (article.creation.morelike - bad article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [14:54:11] PROBLEM - recommendation_api endpoints health on scb2004 is CRITICAL: /{domain}/v1/caption/addition/{target} (Caption addition suggestions) timed out before a response was received: /{domain}/v1/caption/translation/from/{source}/to/{target} (Caption translation suggestions) timed out before a response was received: /{domain}/v1/description/translation/from/{source}/to/{target} (Description translation suggestions) timed out befor [14:54:11] received: /{domain}/v1/description/addition/{target} (Description addition suggestions) timed out before a response was received: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - bad article title) timed out before a response was received https://wikitech.wikimedia.org/ [14:54:11] itoring/recommendation_api [14:54:11] PROBLEM - recommendation_api endpoints health on scb2006 is CRITICAL: /{domain}/v1/caption/translation/from/{source}/to/{target} (Caption translation suggestions) timed out before a response was received: /{domain}/v1/description/translation/from/{source}/to/{target} (Description translation suggestions) timed out before a response was received: /{domain}/v1/description/addition/{target} (Description addition suggestions) timed o [14:54:12] nse was received: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - bad article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [14:54:12] PROBLEM - High CPU load on API appserver on mw1285 is CRITICAL: CRITICAL - load average: 70.11, 42.90, 25.15 https://wikitech.wikimedia.org/wiki/Application_servers [14:54:20] (03CR) 10CDanis: [C: 03+1] kvobject: fix some class property ordering [software/conftool] - 10https://gerrit.wikimedia.org/r/527565 (owner: 10Giuseppe Lavagetto) [14:54:25] PROBLEM - Apache HTTP on mw1346 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [14:54:31] PROBLEM - recommendation_api endpoints health on scb2005 is CRITICAL: /{domain}/v1/caption/addition/{target} (Caption addition suggestions) timed out before a response was received: /{domain}/v1/article/creation/translation/{source}{/seed} (article.creation.translation - bad seed) timed out before a response was received: /{domain}/v1/caption/translation/from/{source}/to/{target} (Caption translation suggestions) timed out before [14:54:31] eceived: /{domain}/v1/description/translation/from/{source}/to/{target} (Description translation suggestions) timed out before a response was received: /{domain}/v1/description/addition/{target} (Description addition suggestions) timed out before a response was received: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedi [14:54:31] es/Monitoring/recommendation_api [14:54:31] PROBLEM - HTTP availability for Nginx -SSL terminators- at esams on icinga1001 is CRITICAL: cluster=cache_text site=esams https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [14:54:31] PROBLEM - HTTP availability for Varnish at esams on icinga1001 is CRITICAL: job=varnish-text site=esams https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d [14:54:33] PROBLEM - termbox eqiad on termbox.svc.eqiad.wmnet is CRITICAL: /termbox (get rendered termbox) is CRITICAL: Test get rendered termbox returned the unexpected status 500 (expecting: 200) https://wikitech.wikimedia.org/wiki/WMDE/Wikidata/SSR_Service [14:54:33] PROBLEM - HHVM rendering on mw1339 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [14:54:35] PROBLEM - Apache HTTP on mw1339 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [14:54:35] PROBLEM - Apache HTTP on mw1340 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [14:54:43] RECOVERY - PHP7 rendering on mw1347 is OK: HTTP OK: HTTP/1.1 200 OK - 81418 bytes in 2.133 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [14:54:43] PROBLEM - HTTP availability for Nginx -SSL terminators- at codfw on icinga1001 is CRITICAL: cluster=cache_text site=codfw https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [14:54:44] uhm [14:54:49] PROBLEM - recommendation_api endpoints health on scb2001 is CRITICAL: /{domain}/v1/caption/addition/{target} (Caption addition suggestions) timed out before a response was received: /{domain}/v1/caption/translation/from/{source}/to/{target} (Caption translation suggestions) timed out before a response was received: /{domain}/v1/description/translation/from/{source}/to/{target} (Description translation suggestions) timed out befor [14:54:49] received: /{domain}/v1/description/addition/{target} (Description addition suggestions) timed out before a response was received: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [14:54:52] wow [14:54:55] PROBLEM - recommendation_api endpoints health on scb1002 is CRITICAL: /{domain}/v1/caption/addition/{target} (Caption addition suggestions) timed out before a response was received: /{domain}/v1/caption/translation/from/{source}/to/{target} (Caption translation suggestions) timed out before a response was received: /{domain}/v1/description/translation/from/{source}/to/{target} (Description translation suggestions) timed out befor [14:54:55] received: /{domain}/v1/description/addition/{target} (Description addition suggestions) timed out before a response was received: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - bad article title) timed out before a response was received https://wikitech.wikimedia.org/ [14:54:55] itoring/recommendation_api [14:55:01] PROBLEM - recommendation_api endpoints health on scb1001 is CRITICAL: /{domain}/v1/caption/addition/{target} (Caption addition suggestions) timed out before a response was received: /{domain}/v1/description/translation/from/{source}/to/{target} (Description translation suggestions) timed out before a response was received: /{domain}/v1/description/addition/{target} (Description addition suggestions) timed out before a response wa [14:55:01] ain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - bad article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [14:55:01] PROBLEM - restbase endpoints health on restbase1022 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [14:55:02] PROBLEM - HTTP availability for Nginx -SSL terminators- at eqiad on icinga1001 is CRITICAL: cluster=cache_text site=eqiad https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [14:55:02] PROBLEM - restbase endpoints health on restbase2018 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received: /en.wikipedia.org/v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [14:55:03] PROBLEM - Nginx local proxy to apache on mw1344 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [14:55:05] PROBLEM - HTTP availability for Nginx -SSL terminators- at eqsin on icinga1001 is CRITICAL: cluster=cache_text site=eqsin https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [14:55:07] RECOVERY - High CPU load on API appserver on mw1231 is OK: OK - load average: 18.98, 24.14, 17.01 https://wikitech.wikimedia.org/wiki/Application_servers [14:55:09] PROBLEM - recommendation_api endpoints health on scb1004 is CRITICAL: /{domain}/v1/caption/addition/{target} (Caption addition suggestions) timed out before a response was received: /{domain}/v1/article/creation/translation/{source}{/seed} (article.creation.translation - bad seed) timed out before a response was received: /{domain}/v1/caption/translation/from/{source}/to/{target} (Caption translation suggestions) timed out before [14:55:09] eceived: /{domain}/v1/description/addition/{target} (Description addition suggestions) timed out before a response was received: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - bad article title) timed out before a response was received https://wikitech.wikimedia.org/w [14:55:09] toring/recommendation_api [14:55:19] PROBLEM - restbase endpoints health on restbase-dev1004 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received: /en.wikipedia.org/v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) is CRITICAL: Test Get a graph from Graphoid returned the unexpected status 400 (expecting: 200) https://wikitech.wikimedia [14:55:19] s/Monitoring/restbase [14:55:19] PROBLEM - mobileapps endpoints health on scb2005 is CRITICAL: /{domain}/v1/page/metadata/{title} (retrieve extended metadata for Video article on English Wikipedia) timed out before a response was received: /{domain}/v1/media/image/featured/{year}/{month}/{day} (retrieve featured image data for April 29, 2016) timed out before a response was received: /{domain}/v1/page/mobile-sections/{title} (retrieve test page via mobile-sectio [14:55:19] ore a response was received: /{domain}/v1/page/mobile-html/{title} (Get page content HTML for test page) timed out before a response was received: /{domain}/v1/page/random/title (retrieve a random article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [14:55:21] PROBLEM - restbase endpoints health on restbase1017 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received: /en.wikipedia.org/v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) is CRITICAL: Test Get a graph from Graphoid returned the unexpected status 400 (expecting: 200) https://wikitech.wikimedia.org [14:55:21] nitoring/restbase [14:55:21] PROBLEM - restbase endpoints health on restbase2013 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [14:55:22] PROBLEM - restbase endpoints health on restbase2020 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [14:55:22] PROBLEM - Nginx local proxy to apache on mw1339 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [14:55:22] PROBLEM - mobileapps endpoints health on scb2006 is CRITICAL: /{domain}/v1/page/media/{title} (Get media in test page) timed out before a response was received: /{domain}/v1/media/image/featured/{year}/{month}/{day} (retrieve featured image data for April 29, 2016) timed out before a response was received: /{domain}/v1/page/mobile-html/{title} (Get page content HTML for test page) timed out before a response was received https:// [14:55:22] a.org/wiki/Services/Monitoring/mobileapps [14:55:22] PROBLEM - Restbase LVS codfw on restbase.svc.codfw.wmnet is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) timed out before a response was received https://wikitech.wikimedia.org/wiki/RESTBase [14:55:27] PROBLEM - mobileapps endpoints health on scb1003 is CRITICAL: /{domain}/v1/page/media/{title} (Get media in test page) timed out before a response was received: /{domain}/v1/page/metadata/{title} (retrieve extended metadata for Video article on English Wikipedia) timed out before a response was received: /{domain}/v1/media/image/featured/{year}/{month}/{day} (retrieve featured image data for April 29, 2016) timed out before a res [14:55:27] d: /{domain}/v1/page/mobile-html/{title} (Get page content HTML for test page) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [14:55:29] PROBLEM - restbase endpoints health on restbase1025 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received: /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [14:55:29] PROBLEM - mobileapps endpoints health on scb2004 is CRITICAL: /{domain}/v1/page/media/{title} (Get media in test page) timed out before a response was received: /{domain}/v1/page/summary/{title} (Get summary for test page) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [14:55:29] PROBLEM - restbase endpoints health on restbase1023 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received: /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) timed out before a response was received: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) timed out before a re [14:55:30] ed https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [14:55:30] PROBLEM - restbase endpoints health on restbase1021 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [14:55:32] PROBLEM - restbase endpoints health on restbase1019 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [14:55:33] PROBLEM - restbase endpoints health on restbase2012 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [14:55:33] PROBLEM - restbase endpoints health on restbase2009 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [14:55:33] PROBLEM - restbase endpoints health on restbase2017 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [14:55:33] PROBLEM - restbase endpoints health on restbase2015 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [14:55:43] PROBLEM - mobileapps endpoints health on scb1001 is CRITICAL: /{domain}/v1/page/media/{title} (Get media in test page) timed out before a response was received: /{domain}/v1/media/image/featured/{year}/{month}/{day} (retrieve featured image data for April 29, 2016) timed out before a response was received: /{domain}/v1/page/mobile-html/{title} (Get page content HTML for test page) timed out before a response was received: /{domai [14:55:43] y/{title} (Get summary for test page) timed out before a response was received: /{domain}/v1/transform/html/to/mobile-html/{title} (Get preview mobile HTML for test page) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [14:55:43] PROBLEM - Cxserver LVS eqiad on cxserver.svc.eqiad.wmnet is CRITICAL: /v2/translate/{from}/{to}{/provider} (Machine translate an HTML fragment using TestClient, adapt the links to target language wiki.) timed out before a response was received https://wikitech.wikimedia.org/wiki/CX [14:55:45] PROBLEM - HHVM rendering on mw1226 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [14:55:45] PROBLEM - PyBal backends health check on lvs1015 is CRITICAL: PYBAL CRITICAL - CRITICAL - api_80: Servers mw1346.eqiad.wmnet, mw1344.eqiad.wmnet, mw1226.eqiad.wmnet, mw1233.eqiad.wmnet, mw1340.eqiad.wmnet, mw1343.eqiad.wmnet, mw1225.eqiad.wmnet, mw1345.eqiad.wmnet, mw1221.eqiad.wmnet, mw1317.eqiad.wmnet, mw1235.eqiad.wmnet, mw1342.eqiad.wmnet, mw1315.eqiad.wmnet, mw1339.eqiad.wmnet are marked down but pooled: api-https_443: Serve [14:55:45] mnet, mw1346.eqiad.wmnet, mw1344.eqiad.wmnet, mw1227.eqiad.wmnet, mw1226.eqiad.wmnet, mw1233.eqiad.wmnet, mw1222.eqiad.wmnet, mw1340.eqiad.wmnet, mw1343.eqiad.wmnet, mw1225.eqiad.wmnet, mw1347.eqiad.wmnet, mw1345.eqiad.wmnet, mw1221.eqiad.wmnet, mw1235.eqiad.wmnet, mw1342.eqiad.wmnet, mw1315.eqiad.wmnet, mw1341.eqiad.wmnet, mw1339.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [14:55:45] PROBLEM - restbase endpoints health on restbase2010 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [14:55:47] PROBLEM - restbase endpoints health on restbase-dev1005 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [14:55:49] PROBLEM - Apache HTTP on mw1315 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [14:55:51] PROBLEM - Apache HTTP on mw1342 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [14:55:51] PROBLEM - HTTP availability for Nginx -SSL terminators- at ulsfo on icinga1001 is CRITICAL: cluster=cache_text site=ulsfo https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [14:55:55] PROBLEM - HHVM rendering on mw1343 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [14:55:57] PROBLEM - mobileapps endpoints health on scb2001 is CRITICAL: /{domain}/v1/page/media/{title} (Get media in test page) timed out before a response was received: /{domain}/v1/page/metadata/{title} (retrieve extended metadata for Video article on English Wikipedia) timed out before a response was received: /{domain}/v1/media/image/featured/{year}/{month}/{day} (retrieve featured image data for April 29, 2016) timed out before a res [14:55:57] d: /{domain}/v1/transform/html/to/mobile-html/{title} (Get preview mobile HTML for test page) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [14:55:58] PROBLEM - HHVM rendering on mw1340 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [14:55:59] PROBLEM - recommendation_api endpoints health on scb2002 is CRITICAL: /{domain}/v1/caption/translation/from/{source}/to/{target} (Caption translation suggestions) timed out before a response was received: /{domain}/v1/description/translation/from/{source}/to/{target} (Description translation suggestions) timed out before a response was received: /{domain}/v1/description/addition/{target} (Description addition suggestions) timed o [14:55:59] nse was received: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [14:55:59] PROBLEM - Nginx local proxy to apache on mw1345 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [14:55:59] PROBLEM - Nginx local proxy to apache on mw1342 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [14:56:01] PROBLEM - restbase endpoints health on restbase1026 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [14:56:01] PROBLEM - restbase endpoints health on restbase-dev1006 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [14:56:02] PROBLEM - restbase endpoints health on restbase2016 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [14:56:02] PROBLEM - restbase endpoints health on restbase2019 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [14:56:03] PROBLEM - Apache HTTP on mw1343 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [14:56:03] PROBLEM - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is CRITICAL: /{domain}/v1/page/media/{title} (Get media in test page) timed out before a response was received: /{domain}/v1/page/metadata/{title} (retrieve extended metadata for Video article on English Wikipedia) timed out before a response was received: /{domain}/v1/media/image/featured/{year}/{month}/{day} (retrieve featured image data for April 29, 2016) timed out [14:56:03] was received https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [14:56:07] PROBLEM - PyBal IPVS diff check on lvs1016 is CRITICAL: CRITICAL: Hosts in IPVS but unknown to PyBal: set([mw1226.eqiad.wmnet, mw1348.eqiad.wmnet, mw1233.eqiad.wmnet, mw1346.eqiad.wmnet, mw1315.eqiad.wmnet, mw1221.eqiad.wmnet, mw1342.eqiad.wmnet, mw1340.eqiad.wmnet, mw1345.eqiad.wmnet, mw1344.eqiad.wmnet, mw1343.eqiad.wmnet, mw1225.eqiad.wmnet, mw1341.eqiad.wmnet, mw1317.eqiad.wmnet, mw1347.eqiad.wmnet, mw1339.eqiad.wmnet, mw1313 [14:56:07] tps://wikitech.wikimedia.org/wiki/PyBal [14:56:11] PROBLEM - Nginx local proxy to apache on mw1340 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [14:56:11] PROBLEM - HTTP availability for Varnish at ulsfo on icinga1001 is CRITICAL: job=varnish-text site=ulsfo https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d [14:56:11] PROBLEM - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is CRITICAL: /{domain}/v1/page/media/{title} (Get media in test page) timed out before a response was received: /{domain}/v1/media/image/featured/{year}/{month}/{day} (retrieve featured image data for April 29, 2016) timed out before a response was received: /{domain}/v1/page/mobile-sections/{title} (retrieve test page via mobile-sections) timed out before a response wa [14:56:11] ain}/v1/page/summary/{title} (Get summary for test page) timed out before a response was received: /{domain}/v1/transform/html/to/mobile-html/{title} (Get preview mobile HTML for test page) timed out before a response was received https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [14:56:11] PROBLEM - Nginx local proxy to apache on mw1343 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [14:56:13] RECOVERY - High CPU load on API appserver on mw1227 is OK: OK - load average: 18.04, 23.25, 17.85 https://wikitech.wikimedia.org/wiki/Application_servers [14:56:19] PROBLEM - HHVM rendering on mw1315 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [14:56:19] PROBLEM - restbase endpoints health on restbase1020 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [14:56:20] PROBLEM - restbase endpoints health on restbase1027 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received: /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) timed out before a response was received: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) timed out before a re [14:56:20] ed https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [14:56:25] PROBLEM - restbase endpoints health on restbase1024 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [14:56:25] PROBLEM - Nginx local proxy to apache on mw1346 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [14:56:25] PROBLEM - mobileapps endpoints health on scb2002 is CRITICAL: /{domain}/v1/page/media/{title} (Get media in test page) timed out before a response was received: /{domain}/v1/media/image/featured/{year}/{month}/{day} (retrieve featured image data for April 29, 2016) timed out before a response was received: /{domain}/v1/page/summary/{title} (Get summary for test page) timed out before a response was received https://wikitech.wikim [14:56:25] vices/Monitoring/mobileapps [14:56:25] PROBLEM - restbase endpoints health on restbase1018 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [14:56:27] PROBLEM - Restbase edge esams on text-lb.esams.wikimedia.org is CRITICAL: /api/rest_v1/transform/wikitext/to/html/{title} (Transform wikitext to html) timed out before a response was received https://wikitech.wikimedia.org/wiki/RESTBase [14:56:27] PROBLEM - proton endpoints health on proton1002 is CRITICAL: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Foo page from en.wp.org in letter format) timed out before a response was received: /{domain}/v1/pdf/{title}/{format}/{type} (Respond file not found for a nonexistent title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/proton [14:56:29] <_joe_> and indeed [14:56:37] PROBLEM - PyBal IPVS diff check on lvs1015 is CRITICAL: CRITICAL: Hosts in IPVS but unknown to PyBal: set([mw1224.eqiad.wmnet, mw1226.eqiad.wmnet, mw1348.eqiad.wmnet, mw1233.eqiad.wmnet, mw1346.eqiad.wmnet, mw1221.eqiad.wmnet, mw1342.eqiad.wmnet, mw1340.eqiad.wmnet, mw1344.eqiad.wmnet, mw1343.eqiad.wmnet, mw1225.eqiad.wmnet, mw1341.eqiad.wmnet, mw1315.eqiad.wmnet, mw1347.eqiad.wmnet, mw1345.eqiad.wmnet, mw1339.eqiad.wmnet]) https [14:56:37] edia.org/wiki/PyBal [14:56:37] PROBLEM - PyBal backends health check on lvs1016 is CRITICAL: PYBAL CRITICAL - CRITICAL - api-https_443: Servers mw1346.eqiad.wmnet, mw1344.eqiad.wmnet, mw1226.eqiad.wmnet, mw1348.eqiad.wmnet, mw1233.eqiad.wmnet, mw1340.eqiad.wmnet, mw1343.eqiad.wmnet, mw1225.eqiad.wmnet, mw1347.eqiad.wmnet, mw1345.eqiad.wmnet, mw1223.eqiad.wmnet, mw1221.eqiad.wmnet, mw1317.eqiad.wmnet, mw1224.eqiad.wmnet, mw1342.eqiad.wmnet, mw1341.eqiad.wmnet, [14:56:37] t are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [14:56:37] PROBLEM - HHVM rendering on mw1344 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [14:56:37] PROBLEM - restbase endpoints health on restbase2011 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received: /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [14:56:37] PROBLEM - HHVM rendering on mw1346 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [14:56:39] RECOVERY - High CPU load on API appserver on mw1233 is OK: OK - load average: 19.59, 23.26, 17.38 https://wikitech.wikimedia.org/wiki/Application_servers [14:56:41] PROBLEM - HTTP availability for Varnish at eqsin on icinga1001 is CRITICAL: job=varnish-text site=eqsin https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d [14:56:45] PROBLEM - proton endpoints health on proton1001 is CRITICAL: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Foo page from en.wp.org in letter format) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/proton [14:56:49] PROBLEM - mobileapps endpoints health on scb1002 is CRITICAL: /{domain}/v1/media/image/featured/{year}/{month}/{day} (retrieve featured image data for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [14:56:51] RECOVERY - restbase endpoints health on restbase1022 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [14:56:51] RECOVERY - restbase endpoints health on restbase2018 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [14:57:05] RECOVERY - Nginx local proxy to apache on mw1339 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 618 bytes in 0.666 second response time https://wikitech.wikimedia.org/wiki/Application_servers [14:57:09] PROBLEM - High CPU load on API appserver on mw1234 is CRITICAL: CRITICAL - load average: 55.00, 37.27, 22.62 https://wikitech.wikimedia.org/wiki/Application_servers [14:57:11] RECOVERY - restbase endpoints health on restbase-dev1004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [14:57:13] RECOVERY - restbase endpoints health on restbase2013 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [14:57:13] RECOVERY - mobileapps endpoints health on scb2006 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [14:57:15] PROBLEM - mobileapps endpoints health on scb2003 is CRITICAL: /{domain}/v1/page/media/{title} (Get media in test page) timed out before a response was received: /{domain}/v1/page/summary/{title} (Get summary for test page) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [14:57:15] RECOVERY - mobileapps endpoints health on scb1003 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [14:57:17] RECOVERY - restbase endpoints health on restbase2009 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [14:57:17] RECOVERY - restbase endpoints health on restbase1021 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [14:57:17] RECOVERY - restbase endpoints health on restbase2015 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [14:57:19] RECOVERY - mobileapps endpoints health on scb2004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [14:57:20] RECOVERY - restbase endpoints health on restbase1025 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [14:57:20] RECOVERY - restbase endpoints health on restbase2017 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [14:57:20] RECOVERY - restbase endpoints health on restbase1023 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [14:57:21] RECOVERY - restbase endpoints health on restbase1019 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [14:57:23] RECOVERY - HHVM rendering on mw1226 is OK: HTTP OK: HTTP/1.1 200 OK - 81370 bytes in 0.810 second response time https://wikitech.wikimedia.org/wiki/Application_servers [14:57:25] RECOVERY - Cxserver LVS eqiad on cxserver.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/CX [14:57:25] PROBLEM - puppet last run on db1131 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [14:57:25] RECOVERY - mobileapps endpoints health on scb1001 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [14:57:27] RECOVERY - Apache HTTP on mw1315 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 617 bytes in 0.393 second response time https://wikitech.wikimedia.org/wiki/Application_servers [14:57:29] PROBLEM - High CPU load on API appserver on mw1232 is CRITICAL: CRITICAL - load average: 60.02, 39.12, 23.38 https://wikitech.wikimedia.org/wiki/Application_servers [14:57:29] PROBLEM - HTTP availability for Varnish at codfw on icinga1001 is CRITICAL: job=varnish-text site=codfw https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d [14:57:29] RECOVERY - Apache HTTP on mw1342 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 617 bytes in 0.558 second response time https://wikitech.wikimedia.org/wiki/Application_servers [14:57:33] RECOVERY - restbase endpoints health on restbase-dev1005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [14:57:33] RECOVERY - HHVM rendering on mw1343 is OK: HTTP OK: HTTP/1.1 200 OK - 81370 bytes in 0.758 second response time https://wikitech.wikimedia.org/wiki/Application_servers [14:57:33] RECOVERY - PyBal backends health check on lvs1015 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [14:57:33] RECOVERY - HHVM rendering on mw1340 is OK: HTTP OK: HTTP/1.1 200 OK - 81370 bytes in 0.149 second response time https://wikitech.wikimedia.org/wiki/Application_servers [14:57:37] RECOVERY - Nginx local proxy to apache on mw1345 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 618 bytes in 0.502 second response time https://wikitech.wikimedia.org/wiki/Application_servers [14:57:38] 3~/win 30 [14:57:39] RECOVERY - Nginx local proxy to apache on mw1342 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 618 bytes in 0.872 second response time https://wikitech.wikimedia.org/wiki/Application_servers [14:57:39] RECOVERY - mobileapps endpoints health on scb2001 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [14:57:41] RECOVERY - Apache HTTP on mw1343 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 617 bytes in 0.344 second response time https://wikitech.wikimedia.org/wiki/Application_servers [14:57:45] RECOVERY - restbase endpoints health on restbase1026 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [14:57:45] RECOVERY - restbase endpoints health on restbase2016 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [14:57:47] RECOVERY - restbase endpoints health on restbase-dev1006 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [14:57:47] RECOVERY - Nginx local proxy to apache on mw1340 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 617 bytes in 0.076 second response time https://wikitech.wikimedia.org/wiki/Application_servers [14:57:47] RECOVERY - restbase endpoints health on restbase2019 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [14:57:47] RECOVERY - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [14:57:49] RECOVERY - Nginx local proxy to apache on mw1343 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 618 bytes in 0.408 second response time https://wikitech.wikimedia.org/wiki/Application_servers [14:57:53] RECOVERY - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [14:57:53] RECOVERY - Apache HTTP on mw1346 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 616 bytes in 0.092 second response time https://wikitech.wikimedia.org/wiki/Application_servers [14:57:55] RECOVERY - High CPU load on API appserver on mw1283 is OK: OK - load average: 17.51, 30.81, 23.89 https://wikitech.wikimedia.org/wiki/Application_servers [14:57:57] RECOVERY - HHVM rendering on mw1315 is OK: HTTP OK: HTTP/1.1 200 OK - 81370 bytes in 0.701 second response time https://wikitech.wikimedia.org/wiki/Application_servers [14:58:01] RECOVERY - Nginx local proxy to apache on mw1346 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 618 bytes in 0.225 second response time https://wikitech.wikimedia.org/wiki/Application_servers [14:58:03] RECOVERY - restbase endpoints health on restbase1027 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [14:58:03] RECOVERY - Apache HTTP on mw1339 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 616 bytes in 0.049 second response time https://wikitech.wikimedia.org/wiki/Application_servers [14:58:03] RECOVERY - Apache HTTP on mw1340 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 616 bytes in 0.047 second response time https://wikitech.wikimedia.org/wiki/Application_servers [14:58:03] RECOVERY - HHVM rendering on mw1339 is OK: HTTP OK: HTTP/1.1 200 OK - 81370 bytes in 0.348 second response time https://wikitech.wikimedia.org/wiki/Application_servers [14:58:05] RECOVERY - restbase endpoints health on restbase1024 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [14:58:05] RECOVERY - restbase endpoints health on restbase1018 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [14:58:05] RECOVERY - mobileapps endpoints health on scb2002 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [14:58:07] RECOVERY - proton endpoints health on proton1002 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/proton [14:58:07] RECOVERY - recommendation_api endpoints health on scb2005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [14:58:07] RECOVERY - Restbase edge esams on text-lb.esams.wikimedia.org is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase [14:58:11] RECOVERY - termbox eqiad on termbox.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/WMDE/Wikidata/SSR_Service [14:58:11] RECOVERY - HHVM rendering on mw1344 is OK: HTTP OK: HTTP/1.1 200 OK - 81370 bytes in 0.213 second response time https://wikitech.wikimedia.org/wiki/Application_servers [14:58:11] RECOVERY - HHVM rendering on mw1346 is OK: HTTP OK: HTTP/1.1 200 OK - 81370 bytes in 0.304 second response time https://wikitech.wikimedia.org/wiki/Application_servers [14:58:17] RECOVERY - restbase endpoints health on restbase2011 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [14:58:17] PROBLEM - High CPU load on API appserver on mw1230 is CRITICAL: CRITICAL - load average: 71.13, 43.74, 24.78 https://wikitech.wikimedia.org/wiki/Application_servers [14:58:17] PROBLEM - High CPU load on API appserver on mw1229 is CRITICAL: CRITICAL - load average: 53.84, 32.40, 19.93 https://wikitech.wikimedia.org/wiki/Application_servers [14:58:17] PROBLEM - High CPU load on API appserver on mw1286 is CRITICAL: CRITICAL - load average: 81.75, 51.28, 32.74 https://wikitech.wikimedia.org/wiki/Application_servers [14:58:19] RECOVERY - PyBal backends health check on lvs1016 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [14:58:21] PROBLEM - High CPU load on API appserver on mw1290 is CRITICAL: CRITICAL - load average: 80.01, 52.55, 32.40 https://wikitech.wikimedia.org/wiki/Application_servers [14:58:21] RECOVERY - PHP7 rendering on mw1348 is OK: HTTP OK: HTTP/1.1 200 OK - 81411 bytes in 0.395 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [14:58:23] PROBLEM - High CPU load on API appserver on mw1235 is CRITICAL: CRITICAL - load average: 53.41, 31.35, 19.29 https://wikitech.wikimedia.org/wiki/Application_servers [14:58:25] RECOVERY - proton endpoints health on proton1001 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/proton [14:58:25] RECOVERY - mobileapps endpoints health on scb1002 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [14:58:27] RECOVERY - recommendation_api endpoints health on scb1002 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [14:58:29] PROBLEM - puppet last run on cp3030 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [14:58:29] RECOVERY - Apache HTTP on mw1348 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 658 bytes in 0.041 second response time https://wikitech.wikimedia.org/wiki/Application_servers [14:58:31] PROBLEM - High CPU load on API appserver on mw1288 is CRITICAL: CRITICAL - load average: 66.89, 49.08, 31.52 https://wikitech.wikimedia.org/wiki/Application_servers [14:58:33] RECOVERY - Nginx local proxy to apache on mw1344 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 617 bytes in 0.055 second response time https://wikitech.wikimedia.org/wiki/Application_servers [14:58:33] RECOVERY - recommendation_api endpoints health on scb1001 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [14:58:43] RECOVERY - recommendation_api endpoints health on scb1004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [14:58:43] PROBLEM - High CPU load on API appserver on mw1231 is CRITICAL: CRITICAL - load average: 55.51, 34.38, 22.10 https://wikitech.wikimedia.org/wiki/Application_servers [14:58:53] RECOVERY - mobileapps endpoints health on scb2005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [14:58:53] PROBLEM - High CPU load on API appserver on mw1313 is CRITICAL: CRITICAL - load average: 80.33, 49.34, 32.47 https://wikitech.wikimedia.org/wiki/Application_servers [14:58:53] RECOVERY - restbase endpoints health on restbase1017 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [14:58:53] RECOVERY - Restbase LVS codfw on restbase.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase [14:58:55] RECOVERY - mobileapps endpoints health on scb2003 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [14:59:01] RECOVERY - restbase endpoints health on restbase2012 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [14:59:07] RECOVERY - restbase endpoints health on restbase2010 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [14:59:09] PROBLEM - High CPU load on API appserver on mw1223 is CRITICAL: CRITICAL - load average: 55.92, 34.70, 21.54 https://wikitech.wikimedia.org/wiki/Application_servers [14:59:13] PROBLEM - High CPU load on API appserver on mw1284 is CRITICAL: CRITICAL - load average: 45.09, 44.60, 30.00 https://wikitech.wikimedia.org/wiki/Application_servers [14:59:17] PROBLEM - puppet last run on dbproxy1003 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [14:59:19] RECOVERY - recommendation_api endpoints health on scb1003 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [14:59:21] RECOVERY - recommendation_api endpoints health on scb2002 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [14:59:23] RECOVERY - recommendation_api endpoints health on scb2003 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [14:59:23] RECOVERY - recommendation_api endpoints health on scb2004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [14:59:25] RECOVERY - recommendation_api endpoints health on scb2006 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [14:59:39] PROBLEM - High CPU load on API appserver on mw1227 is CRITICAL: CRITICAL - load average: 56.93, 35.67, 23.47 https://wikitech.wikimedia.org/wiki/Application_servers [14:59:41] RECOVERY - restbase endpoints health on restbase1020 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [14:59:51] PROBLEM - puppet last run on ms-be1033 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [15:00:01] RECOVERY - High CPU load on API appserver on mw1278 is OK: OK - load average: 14.01, 31.85, 27.47 https://wikitech.wikimedia.org/wiki/Application_servers [15:00:03] RECOVERY - recommendation_api endpoints health on scb2001 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [15:00:09] RECOVERY - High CPU load on API appserver on mw1276 is OK: OK - load average: 12.87, 27.41, 25.05 https://wikitech.wikimedia.org/wiki/Application_servers [15:00:33] RECOVERY - restbase endpoints health on restbase2020 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [15:00:47] RECOVERY - HTTP availability for Varnish at codfw on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d [15:00:53] RECOVERY - High CPU load on API appserver on mw1281 is OK: OK - load average: 14.40, 27.61, 26.00 https://wikitech.wikimedia.org/wiki/Application_servers [15:00:55] RECOVERY - High CPU load on API appserver on mw1282 is OK: OK - load average: 14.49, 29.14, 26.60 https://wikitech.wikimedia.org/wiki/Application_servers [15:00:55] RECOVERY - High CPU load on API appserver on mw1277 is OK: OK - load average: 13.83, 28.51, 26.75 https://wikitech.wikimedia.org/wiki/Application_servers [15:01:03] RECOVERY - High CPU load on API appserver on mw1287 is OK: OK - load average: 13.54, 28.14, 25.99 https://wikitech.wikimedia.org/wiki/Application_servers [15:01:27] RECOVERY - High CPU load on API appserver on mw1279 is OK: OK - load average: 13.75, 28.14, 27.18 https://wikitech.wikimedia.org/wiki/Application_servers [15:01:27] RECOVERY - High CPU load on API appserver on mw1280 is OK: OK - load average: 15.60, 30.46, 28.08 https://wikitech.wikimedia.org/wiki/Application_servers [15:01:31] RECOVERY - High CPU load on API appserver on mw1314 is OK: OK - load average: 19.28, 35.43, 31.35 https://wikitech.wikimedia.org/wiki/Application_servers [15:01:39] RECOVERY - HTTP availability for Varnish at eqsin on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d [15:01:45] RECOVERY - PyBal IPVS diff check on lvs1016 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [15:01:55] RECOVERY - HTTP availability for Nginx -SSL terminators- at eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [15:01:57] RECOVERY - HTTP availability for Nginx -SSL terminators- at eqsin on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [15:02:07] RECOVERY - High CPU load on API appserver on mw1313 is OK: OK - load average: 20.57, 34.31, 29.72 https://wikitech.wikimedia.org/wiki/Application_servers [15:02:07] RECOVERY - High CPU load on API appserver on mw1289 is OK: OK - load average: 14.89, 30.39, 28.46 https://wikitech.wikimedia.org/wiki/Application_servers [15:02:07] RECOVERY - High CPU load on API appserver on mw1234 is OK: OK - load average: 10.37, 23.24, 21.12 https://wikitech.wikimedia.org/wiki/Application_servers [15:02:09] RECOVERY - HTTP availability for Varnish at eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d [15:02:13] RECOVERY - PyBal IPVS diff check on lvs1015 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [15:02:25] RECOVERY - High CPU load on API appserver on mw1284 is OK: OK - load average: 14.48, 29.92, 26.99 https://wikitech.wikimedia.org/wiki/Application_servers [15:02:41] RECOVERY - High CPU load on API appserver on mw1285 is OK: OK - load average: 14.11, 31.37, 29.96 https://wikitech.wikimedia.org/wiki/Application_servers [15:02:47] RECOVERY - HTTP availability for Varnish at ulsfo on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d [15:03:01] RECOVERY - HTTP availability for Nginx -SSL terminators- at esams on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [15:03:01] RECOVERY - HTTP availability for Varnish at esams on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d [15:03:07] RECOVERY - High CPU load on API appserver on mw1230 is OK: OK - load average: 10.42, 22.92, 20.91 https://wikitech.wikimedia.org/wiki/Application_servers [15:03:07] RECOVERY - High CPU load on API appserver on mw1229 is OK: OK - load average: 11.91, 22.30, 19.47 https://wikitech.wikimedia.org/wiki/Application_servers [15:03:07] RECOVERY - High CPU load on API appserver on mw1286 is OK: OK - load average: 14.95, 31.95, 29.98 https://wikitech.wikimedia.org/wiki/Application_servers [15:03:13] RECOVERY - HTTP availability for Nginx -SSL terminators- at codfw on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [15:03:15] RECOVERY - High CPU load on API appserver on mw1235 is OK: OK - load average: 10.40, 24.56, 20.56 https://wikitech.wikimedia.org/wiki/Application_servers [15:03:15] PROBLEM - puppet last run on analytics1048 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [15:03:21] RECOVERY - High CPU load on API appserver on mw1288 is OK: OK - load average: 15.38, 30.12, 28.42 https://wikitech.wikimedia.org/wiki/Application_servers [15:04:09] RECOVERY - HTTP availability for Nginx -SSL terminators- at ulsfo on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [15:04:27] RECOVERY - High CPU load on API appserver on mw1227 is OK: OK - load average: 13.23, 25.51, 22.91 https://wikitech.wikimedia.org/wiki/Application_servers [15:04:49] RECOVERY - High CPU load on API appserver on mw1290 is OK: OK - load average: 14.21, 27.53, 28.14 https://wikitech.wikimedia.org/wiki/Application_servers [15:05:11] RECOVERY - High CPU load on API appserver on mw1231 is OK: OK - load average: 9.52, 21.35, 21.44 https://wikitech.wikimedia.org/wiki/Application_servers [15:05:35] RECOVERY - High CPU load on API appserver on mw1223 is OK: OK - load average: 9.93, 22.97, 22.15 https://wikitech.wikimedia.org/wiki/Application_servers [15:08:47] RECOVERY - High CPU load on API appserver on mw1232 is OK: OK - load average: 8.29, 17.41, 22.78 https://wikitech.wikimedia.org/wiki/Application_servers [15:12:39] !log @ helmfile [STAGING] Ran 'apply' command on namespace 'restrouter' for release 'staging' . [15:12:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:14:13] greg-g: OK to do a bugfix deploy to fix a few MCS production crashers? (T229521 and T229630) [15:14:13] T229630: NOT_FOUND_ERR (8): the object can not be found here - https://phabricator.wikimedia.org/T229630 [15:14:14] T229521: Cannot read property 'type' of undefined in MCS - https://phabricator.wikimedia.org/T229521 [15:14:22] !log @ helmfile [STAGING] Ran 'apply' command on namespace 'restrouter' for release 'staging' . [15:14:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:19:03] (03PS1) 10CRusnov: netbox: Fix additional swift parameters. [puppet] - 10https://gerrit.wikimedia.org/r/527576 (https://phabricator.wikimedia.org/T209182) [15:19:58] (03CR) 10jerkins-bot: [V: 04-1] netbox: Fix additional swift parameters. [puppet] - 10https://gerrit.wikimedia.org/r/527576 (https://phabricator.wikimedia.org/T209182) (owner: 10CRusnov) [15:21:46] (03PS2) 10CRusnov: netbox: Fix additional swift parameters [puppet] - 10https://gerrit.wikimedia.org/r/527576 (https://phabricator.wikimedia.org/T209182) [15:22:42] (03CR) 10jerkins-bot: [V: 04-1] netbox: Fix additional swift parameters [puppet] - 10https://gerrit.wikimedia.org/r/527576 (https://phabricator.wikimedia.org/T209182) (owner: 10CRusnov) [15:25:23] RECOVERY - puppet last run on db1131 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [15:25:32] (03PS3) 10CRusnov: netbox: Fix additional swift parameters [puppet] - 10https://gerrit.wikimedia.org/r/527576 (https://phabricator.wikimedia.org/T209182) [15:25:39] RECOVERY - puppet last run on analytics1048 is OK: OK: Puppet is currently enabled, last run 24 seconds ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [15:26:29] RECOVERY - puppet last run on cp3030 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [15:27:15] RECOVERY - puppet last run on dbproxy1003 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [15:27:51] RECOVERY - puppet last run on ms-be1033 is OK: OK: Puppet is currently enabled, last run 5 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [15:30:59] 10Operations, 10DBA, 10Gerrit, 10Release-Engineering-Team-TODO, and 2 others: Gerrit is failing to connect to db on gerrit2001 thus preventing systemd from working - https://phabricator.wikimedia.org/T176532 (10Dzahn) a:03Dzahn I'll add it. Thanks Manuel! [15:31:10] mdholloway: yeah, those look reasonable. [15:31:22] greg-g: cool, thanks! [15:31:30] (sorry,w as in a meeting) [15:31:35] no prob [15:36:49] 10Operations, 10ops-eqiad, 10netops: (Need By: Sept 30) upgrade msw1-eqiad from EX4200 to EX4300 - https://phabricator.wikimedia.org/T225121 (10ayounsi) a:05ayounsi→03Papaul codfw is done. @papaul let me know if you need help to prepare the eqiad one. [15:46:47] is someone online by now who knows their way around scap and i18n/l10n rebuilds? [15:46:55] we have some missing messages on Wikidata, e. g. https://www.wikidata.org/wiki/MediaWiki:Valueview-expertextender-languageselector-label [15:47:11] Amir1 tried to fix it earlier today but didn’t succeed, and now he’s already started his weekend so I’m trying to take over [15:47:29] not sure how to fix it though… scap sync-l10n? or full sync? [15:47:56] !log mholloway-shell@deploy1001 Started deploy [mobileapps/deploy@250f711]: Fix MCS production crashers (T229521, T229630) [15:48:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:48:06] T229630: NOT_FOUND_ERR (8): the object can not be found here - https://phabricator.wikimedia.org/T229630 [15:48:06] T229521: Cannot read property 'type' of undefined in MCS - https://phabricator.wikimedia.org/T229521 [15:49:13] cc brennen as train conductor this week [15:50:31] hrm. i do not know, but will attempt to find out. [15:50:53] well - thcipriani, any thoughts? [15:51:36] from the SAL, it looks like `scap sync-l10n` is what Amir tried [15:52:37] !log mholloway-shell@deploy1001 Finished deploy [mobileapps/deploy@250f711]: Fix MCS production crashers (T229521, T229630) (duration: 04m 41s) [15:52:41] * thcipriani looks [15:52:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:55:36] (03PS1) 10Jbond: CAS: Initial cas module [puppet] - 10https://gerrit.wikimedia.org/r/527591 [15:56:26] (03PS2) 10Jbond: CAS: Initial cas module [puppet] - 10https://gerrit.wikimedia.org/r/527591 [15:56:49] see, i highlight 'cas' for obvious reasons, and also it's annoying that the module is called 'CAS' [15:57:37] Lucas_WMDE: https://phabricator.wikimedia.org/P8853 looks like https://phabricator.wikimedia.org/T227814 with a different message. Where is that message? [15:57:39] (03PS3) 10Jbond: CAS: Initial cas module [puppet] - 10https://gerrit.wikimedia.org/r/527591 [15:58:15] thcipriani: in the WikibaseView extension, I believe [15:58:30] WikibaseView [15:58:31] yep [15:58:50] https://phabricator.wikimedia.org/P8853#53268 [15:59:04] chaomodus: im happy to rename to apereo_cas would that still trigger your hlight [15:59:08] hrm [15:59:21] it should still be pulled in by WikibaseRepo, no? [15:59:22] ahah yah :) [15:59:24] let me check what that looks like [15:59:50] * thcipriani does as well [16:00:07] okay, the WikibaseView PHP entry point does not have the i18n files compatibility thingy [16:00:36] (03CR) 10jerkins-bot: [V: 04-1] CAS: Initial cas module [puppet] - 10https://gerrit.wikimedia.org/r/527591 (owner: 10Jbond) [16:00:52] so let’s do the same thing as in https://gerrit.wikimedia.org/r/524539, I guess? [16:01:14] no idea why it would have broken between .15 and .16 though [16:01:56] can I somehow build ExtensionMessages-*.php locally, to figure out if that adds WikibaseView back to it? [16:02:31] * thcipriani finds command [16:02:54] also, yeah, I think the same fix as 52439 should work [16:03:14] okay I’ll prepare the patch then [16:03:39] running mergeMessageFileList.php should do the trick [16:05:32] I think it fails randomly due to either (a) the wiki that is used for mergeMessageFileList.php which is pretty much a random wiki from group0 OR (b) wfLoadExtension ordering -- I would guess (a) is the problem, but hasn't been a problem since we have been using wgMessagesDir. tl;dr: I don't know why it broke in wmf.16 vs wmf.15, but I also don't know why it worked in wmf.15 :) [16:08:55] when I run mergeMessageFileList.php --extensions-dir extensions/ locally, I get WikibaseView in both cases [16:09:09] just in one case the value is a string and in the other an array with a single string [16:09:12] oh wait [16:09:27] gah, the string version (master) is missing a slash in the middle, I think [16:09:54] https://phabricator.wikimedia.org/P8854 [16:10:00] “viewlib” [16:10:05] but anyways, I’ll upload the patch to add it to PHP [16:10:11] even though it looks like there might just be a bug in the JSON [16:11:43] no, sorry, I’m an idiot, the slash is missing from the *new* version because I made the patch wrong [16:11:46] nevermin [16:11:48] d [16:12:37] (03PS1) 10Paladox: profile::mariadb::ferm_misc: Add gerrit2001.wikimedia.org to the firewall for port 3306 [puppet] - 10https://gerrit.wikimedia.org/r/527595 (https://phabricator.wikimedia.org/T176532) [16:13:18] (03CR) 10jerkins-bot: [V: 04-1] profile::mariadb::ferm_misc: Add gerrit2001.wikimedia.org to the firewall for port 3306 [puppet] - 10https://gerrit.wikimedia.org/r/527595 (https://phabricator.wikimedia.org/T176532) (owner: 10Paladox) [16:13:43] (03PS2) 10Paladox: profile::mariadb::ferm_misc: Add gerrit2001.wikimedia.org to the firewall [puppet] - 10https://gerrit.wikimedia.org/r/527595 (https://phabricator.wikimedia.org/T176532) [16:14:58] (03PS4) 10Jbond: apereo_cas: Initial module [puppet] - 10https://gerrit.wikimedia.org/r/527591 [16:15:27] chaomodus: hopefully thats ^^^ better :) [16:17:29] (03PS1) 10Paladox: gerrit: Re-enable the use of HTTP auth tokens [puppet] - 10https://gerrit.wikimedia.org/r/527596 (https://phabricator.wikimedia.org/T225308) [16:18:26] (03CR) 10jerkins-bot: [V: 04-1] apereo_cas: Initial module [puppet] - 10https://gerrit.wikimedia.org/r/527591 (owner: 10Jbond) [16:21:26] (03PS5) 10Jbond: apereo_cas: Initial module [puppet] - 10https://gerrit.wikimedia.org/r/527591 [16:23:19] anyone up for reviewing https://gerrit.wikimedia.org/r/527594 ? I’d like to backport+deploy it before the weekend [16:23:48] ok nevermind Amir got it :) [16:24:34] * thcipriani backports [16:24:42] I already cherry-picked it [16:24:55] ah, just noticed :) [16:24:58] or did you mean deploy it before it goes through? [16:25:00] ok :) [16:25:05] gate-and-submit will take a while :/ [16:26:48] and afterwards, I assume do the usual `git fetch`, `git submodule update` etc on the deployment host (same as SWAT), and then `scap sync` instead of `sync-file`? [16:26:52] or is it more special than that? [16:27:02] nope: that's exactly correct [16:27:05] yay [16:27:16] 10Operations, 10Traffic, 10conftool, 10Patch-For-Review, and 2 others: Figure out a security model for etcd - https://phabricator.wikimedia.org/T97972 (10CDanis) [16:29:11] PROBLEM - puppet last run on dbstore1004 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [16:35:42] noo, one of the gate-and-submit-swat builds failed [16:35:47] so I’ll have to repeat it :( [16:36:11] there’s no way to cancel the others, right? have to wait until the whole job has failed and then restart it [16:38:56] if you push up another patchset I believe it cancels the current running jobs. i.e. tweak the commit message [16:41:14] great idea, thanks [16:42:25] !log replace rhenium with netflow1001 netflow target + iBGP peer on all routers [16:42:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:43:47] (03CR) 10Filippo Giunchedi: [C: 04-1] "LGTM overall, see inline" (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/527576 (https://phabricator.wikimedia.org/T209182) (owner: 10CRusnov) [16:44:55] PROBLEM - puppet last run on hassium is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [16:45:31] RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 30, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [16:50:39] 10Operations, 10Traffic: rhenium [spare] server still receiving flow data - https://phabricator.wikimedia.org/T224477 (10ayounsi) 05Open→03Resolved This also caused the BGP sessions between rhenium (netflow) and the routers to alert. Updated to use the netflow1001 IP instead. [16:50:42] 10Operations, 10netops: migrate netinsights from rhenium to sulfur - https://phabricator.wikimedia.org/T212011 (10ayounsi) [16:50:44] 10Operations, 10ops-codfw, 10ops-eqiad, 10DC-Ops, and 2 others: Triage and resolve all outstanding Netbox report errors - https://phabricator.wikimedia.org/T223450 (10wiki_willy) @faidon - The majority of the influx in Netbox errors looks like it's from the new PDUs. Some of the info was updated into Netb... [16:51:43] 10Operations, 10netops: migrate netinsights from rhenium to sulfur - https://phabricator.wikimedia.org/T212011 (10ayounsi) 05Open→03Resolved a:05MoritzMuehlenhoff→03ayounsi We created a VM (netflow1001) to replace rhenium, everything has been migrated. [16:51:45] 10Operations, 10ops-eqiad: rack/setup/install sulfur.wikimedia.org - https://phabricator.wikimedia.org/T201364 (10ayounsi) [16:52:24] 10Operations, 10MediaWiki-Configuration, 10conftool: noc.wm.o/db.php: remove hosts information, or fetch it from etcd somehow - https://phabricator.wikimedia.org/T229631 (10CDanis) [16:53:02] (03PS1) 10Elukey: [WIP] profile::netbox: allow /metrics to be polled via http [puppet] - 10https://gerrit.wikimedia.org/r/527601 [16:55:37] (03PS2) 10Elukey: [WIP] profile::netbox: allow /metrics to be polled via http [puppet] - 10https://gerrit.wikimedia.org/r/527601 [16:56:37] (03PS4) 10CRusnov: netbox: Fix additional swift parameters [puppet] - 10https://gerrit.wikimedia.org/r/527576 (https://phabricator.wikimedia.org/T209182) [16:56:47] (03PS3) 10Elukey: [WIP] profile::netbox: allow /metrics to be polled via http [puppet] - 10https://gerrit.wikimedia.org/r/527601 [16:57:07] RECOVERY - puppet last run on dbstore1004 is OK: OK: Puppet is currently enabled, last run 5 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [17:00:15] 10Operations, 10MediaWiki-Configuration, 10conftool: noc.wm.o/db.php: remove hosts information, or fetch it from etcd somehow - https://phabricator.wikimedia.org/T229631 (10Krinkle) > First question is: does db.php have known users that require this information? Unless you, a DBA, or someone else in SRE ans... [17:05:34] (03PS1) 10Krinkle: noc: Fix mw-api links on db.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/527604 [17:05:58] (03CR) 10Krinkle: [C: 03+2] noc: Add cross-dc navigation links to db.php footer [mediawiki-config] - 10https://gerrit.wikimedia.org/r/527137 (https://phabricator.wikimedia.org/T197126) (owner: 10Krinkle) [17:06:30] (03CR) 10Ayounsi: "> Patch Set 2:" [puppet] - 10https://gerrit.wikimedia.org/r/526819 (https://phabricator.wikimedia.org/T226331) (owner: 10Ayounsi) [17:07:23] (03Merged) 10jenkins-bot: noc: Add cross-dc navigation links to db.php footer [mediawiki-config] - 10https://gerrit.wikimedia.org/r/527137 (https://phabricator.wikimedia.org/T197126) (owner: 10Krinkle) [17:09:20] 10Operations, 10MediaWiki-Configuration, 10conftool: noc.wm.o/db.php: remove hosts information, or fetch it from etcd somehow - https://phabricator.wikimedia.org/T229631 (10CDanis) >>! In T229631#5388248, @Krinkle wrote: >> First question is: does db.php have known users that require this information? > > U... [17:09:39] 10Operations, 10MediaWiki-Configuration, 10conftool: noc.wm.o/db.php: remove hosts information, or fetch it from etcd somehow - https://phabricator.wikimedia.org/T229631 (10Marostegui) From the DBA point of view, we do use db-eqiad.php (or db.php) to quickly check what is and what isn't pooled from a browser... [17:10:03] !log krinkle@deploy1001 Synchronized docroot/noc/db.php: ee528e886268c08e9377fbd764ec861b09adfc73 (duration: 00m 48s) [17:10:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:10:49] (03CR) 10Krinkle: [C: 03+2] noc: Fix mw-api links on db.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/527604 (owner: 10Krinkle) [17:11:39] 10Operations, 10ops-eqiad, 10DC-Ops, 10netbox: Missing Netbox Info for New PDUs - https://phabricator.wikimedia.org/T229680 (10wiki_willy) [17:12:51] RECOVERY - puppet last run on hassium is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [17:13:18] (03CR) 10jenkins-bot: noc: Add cross-dc navigation links to db.php footer [mediawiki-config] - 10https://gerrit.wikimedia.org/r/527137 (https://phabricator.wikimedia.org/T197126) (owner: 10Krinkle) [17:15:26] 10Operations, 10ops-codfw, 10ops-eqiad, 10DC-Ops, and 2 others: Triage and resolve all outstanding Netbox report errors - https://phabricator.wikimedia.org/T223450 (10wiki_willy) [17:16:08] (03CR) 10CDanis: [C: 03+1] noc: Fix mw-api links on db.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/527604 (owner: 10Krinkle) [17:17:20] (03PS2) 10Krinkle: noc: Fix mw-api links on db.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/527604 [17:17:38] (03CR) 10Krinkle: [C: 03+2] noc: Fix mw-api links on db.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/527604 (owner: 10Krinkle) [17:18:35] (03Merged) 10jenkins-bot: noc: Fix mw-api links on db.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/527604 (owner: 10Krinkle) [17:18:49] Krinkle: can you let me know when you’re done on deploy1001? [17:18:51] (03CR) 10jenkins-bot: noc: Fix mw-api links on db.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/527604 (owner: 10Krinkle) [17:19:11] Lucas_WMDE: k, 1min [17:19:15] ok thanks [17:19:47] !log krinkle@deploy1001 Synchronized docroot/noc/db.php: a75d23ecb1b (duration: 00m 47s) [17:19:51] Lucas_WMDE: done [17:19:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:20:12] thanks! [17:20:38] I’m backporting a Wikibase fix that needs a full scap sync [17:25:13] !log lucaswerkmeister-wmde@deploy1001 Started scap: Fix WikibaseView i18n globals (T229604) [17:25:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:25:27] T229604: Several selectors/experts are broken - https://phabricator.wikimedia.org/T229604 [17:26:44] !log add avoid_path to cr1/2-eqsin [17:26:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:29:35] WikibaseView appears in ExtensionMessages-1.34.0-wmf.16.php now, so far so good… [17:38:27] RECOVERY - PHP opcache health on mwdebug2002 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [17:42:04] !log lucaswerkmeister-wmde@deploy1001 Finished scap: Fix WikibaseView i18n globals (T229604) (duration: 16m 51s) [17:42:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:42:12] T229604: Several selectors/experts are broken - https://phabricator.wikimedia.org/T229604 [17:44:10] well… https://www.wikidata.org/wiki/MediaWiki:Valueview-expertextender-languageselector-label exists now [17:44:19] but when editing items I still don’t see the message [17:44:32] e. g. https://www.wikidata.org/wiki/Q66058247, try to add another title [17:45:09] but it works if I enable X-Wikimedia-Debug, wat [17:46:12] !log flap NTT link in eqsin [17:46:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:55:11] ok nevermind it’s fixed itself now, somehow [17:55:24] thanks a lot for the help thcipriani! [17:55:38] Lucas_WMDE: happy to help :) [17:59:00] Is it known that pages like the new user contributions take a while to load? [17:59:10] 08Warning Alert for device cr1-eqsin.wikimedia.org - Processor usage over 85% [17:59:29] https://en.wikipedia.org/w/index.php?title=Special:Contributions&offset=20190802171826&contribs=newbie&target=newbies for example [18:01:17] Bsadowski1: We're thinking about getting rid of that page because it's so broken. [18:01:23] :O [18:01:26] Nooo! [18:01:27] xD [18:01:36] It took around 30 seconds for it to load. [18:01:40] Bsadowski1: There are much better, faster alternatives [18:01:42] er 49* [18:01:45] er 39* [18:02:14] Like what, James_F? :) [18:02:37] Bsadowski1: Well, https://en.wikipedia.org/wiki/Special:RecentChanges?userExpLevel=newcomer&hidelog=1 obviously. :-) [18:02:59] That loads in a couple of seconds and has a "live updates" mode. [18:03:15] oh nice :D thanks [18:03:22] Why does that other load slow? [18:03:34] Lots of weird queries? [18:03:44] Because RecentChanges is built on a bunch of technology to make it fast (dedicated table, etc.) [18:04:12] And Special:Contributions/newbies is just a ghastly hack based on (I think) 1% of the maximum user ID. [18:04:18] (03CR) 10CRusnov: netbox: Fix additional swift parameters (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/527576 (https://phabricator.wikimedia.org/T209182) (owner: 10CRusnov) [18:04:48] So it essentially makes very expensive queries against our most expensive database. [18:05:25] Jeez.. [18:05:30] :P [18:05:36] Yeah. Bad feature, badly implemented. [18:05:48] It did work well for a while though [18:06:00] (no?) [18:06:02] Oh, yes, but then December 2005 happened and it stopped being so good. ;-) [18:06:41] For very small wikis it's OK, but once you've got more than ~10k registered accounts it's not great at focussing reviewer's time. [18:06:51] And volunteers' time is a precious resource. [18:14:32] !log recached all WikibaseView messages in ResourceLoader for T229604, cf. https://w.wiki/6kc [18:14:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:14:41] T229604: Several selectors/experts are broken - https://phabricator.wikimedia.org/T229604 [18:19:10] 08̶W̶a̶r̶n̶i̶n̶g Device cr1-eqsin.wikimedia.org recovered from Processor usage over 85% [18:21:43] I’m going home now, my contact info is somewhere in the ops-l archives if that Wikibase deploy broke something horribly [18:22:02] (and possibly on officewiki, I’m not allowed to know that ^^) [18:34:07] (03CR) 10Dzahn: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1001/17720/" [puppet] - 10https://gerrit.wikimedia.org/r/527595 (https://phabricator.wikimedia.org/T176532) (owner: 10Paladox) [18:36:28] !log adding gerrit2001 to ferm rules on dbproxy for misc [18:36:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:37:40] !log gerrit2001 - disabling puppet, stopping gerrit service [18:37:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:44:39] (03PS1) 10Jdlrobson: Restore RelatedArticles config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/527611 (https://phabricator.wikimedia.org/T229644) [18:45:20] James_F: oh wow, I didn't know contribs/newbies was still a thing. Yeah, revision table queries without page or user ID filter is... not great. [18:48:14] (03CR) 10Jdlrobson: "Lego,Krinkle,Reedy adding you as reviewers as I admire their knowledge of this part of the stack which is much better than my own." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/527611 (https://phabricator.wikimedia.org/T229644) (owner: 10Jdlrobson) [18:49:49] (03CR) 10Krinkle: [C: 04-1] Restore RelatedArticles config (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/527611 (https://phabricator.wikimedia.org/T229644) (owner: 10Jdlrobson) [18:53:12] Krinkle: I filed a task to kill it somewhere. [18:53:37] T220447 [18:53:38] T220447: Split out or remove Special:Contributions/newbies functionality - https://phabricator.wikimedia.org/T220447 [18:57:35] PROBLEM - puppet last run on db1063 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [19:00:38] (03PS1) 10Jdlrobson: Remove unused remnant from old menu click tracking [mediawiki-config] - 10https://gerrit.wikimedia.org/r/527615 (https://phabricator.wikimedia.org/T228681) [19:01:39] (03PS2) 10Jdlrobson: Remove unused remnant from old menu click tracking [mediawiki-config] - 10https://gerrit.wikimedia.org/r/527615 (https://phabricator.wikimedia.org/T228681) [19:06:21] (03PS2) 10Jdlrobson: Restore RelatedArticles config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/527611 (https://phabricator.wikimedia.org/T229644) [19:08:06] (03CR) 10Dzahn: [C: 03+2] wmnet: Add m1-master for codfw [dns] - 10https://gerrit.wikimedia.org/r/527462 (https://phabricator.wikimedia.org/T202367) (owner: 10Marostegui) [19:10:32] (03CR) 10Jdlrobson: Restore RelatedArticles config (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/527611 (https://phabricator.wikimedia.org/T229644) (owner: 10Jdlrobson) [19:13:46] (03CR) 10Krinkle: [C: 04-1] Restore RelatedArticles config (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/527611 (https://phabricator.wikimedia.org/T229644) (owner: 10Jdlrobson) [19:14:25] @krinkle the default is a non-empty [] [19:14:53] https://github.com/wikimedia/mediawiki-extensions-RelatedArticles/blob/master/extension.json#L150 [19:15:12] which means "turn RelatedArticles on for all skins" [19:15:34] So I'm a bit lost why 'default' => [ 'minerva' ], (which should be running for English Wikipedia) is being treated as an empty array [19:16:33] "RelatedArticlesFooterWhitelistedSkins": [] [19:16:36] that is an empty [] [19:17:23] sorry typo - i meant to say the default is empty :) [19:17:23] which will be the value of $wgRelatedArticlesFooterWhitelistedSkins before that wmf-config function runs. Unless something else modifies it before then? [19:17:27] https://en.wikivoyage.org is respecting its array: [ 'minerva', 'vector' ] and only showing on those skins (it doesn't show on Timeless) [19:17:45] de.wikipedia.org is showing because it's being set as 'related-articles-footer-blacklisted-skins' => [], so that's explainable [19:17:47] eval.php enwiki [19:17:48] > var_dump($wgRelatedArticlesFooterWhitelistedSkins); [19:17:48] array(1) { [19:17:48] [0]=> [19:17:48] string(7) "minerva" [19:17:48] } [19:18:08] > var_dump($wmgRelatedArticlesFooterWhitelistedSkins); [19:18:08] array(1) { [0]=> string(7) "minerva" } [19:19:29] (03CR) 10Brian Wolff: "So looking at SiteConfiguration.php (around line 220), it seems like 'tag' settings are always handled as if they have a + in front of the" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/527611 (https://phabricator.wikimedia.org/T229644) (owner: 10Jdlrobson) [19:19:59] okay so that's weird.. that's the correct value [19:20:00] jdlrobson: We found the problem. It's just de and ru wikipedias because they for some reason tried to disable it for all their skins, so that's being interpreted now as 'no whitelist; enable for everything'. [19:20:14] wait so this is not a problem on English Wikipedia? [19:20:19] But at that point why is the extension even enabled there to begin with?! [19:20:23] Right. [19:20:28] It's just those two, as far as I can tell. [19:20:29] arrgggh [19:20:32] https://en.wikipedia.org/wiki/User:Jdlrobson/vector.js < stupid user script [19:20:36] completely derailed me [19:20:37] XD [19:20:39] okay yeh that makes totally sense [19:20:43] 'related-articles-footer-blacklisted-skins' => [], is messed up [19:20:50] and that previously wasn't working [19:21:12] as the previous default was ['minerva'] so that was additive [19:21:23] ['minerva'] + [] = ['minerva'] [19:21:30] Wait, so it wasn't even disabling it anyway?! [19:21:31] (03Abandoned) 10Jdlrobson: Restore RelatedArticles config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/527611 (https://phabricator.wikimedia.org/T229644) (owner: 10Jdlrobson) [19:21:40] On those? [19:21:51] yeh whoever configured those obviously didn't check :) [19:22:18] it needs to be 'related-articles-footer-blacklisted-skins' = ['fallbackskin'] or something if they want to disable on all skins in preferences [19:22:29] Isarra: you able to take care of that? [19:22:39] Maybe. Huh. [19:22:44] or remove the line 'related-articles-footer-blacklisted-skins' => [], and restore it to Minerva [19:23:08] I mean, yeah, the global default is minerva, and if it's not been totally disabled all this time, why... start now? >.> [19:23:27] running git blame to work out why this happened [19:23:36] but git blaming that file takes a long time ;-) [19:23:58] looks like it was me. I did it wrong for some reason [19:24:02] so I think you can just remove that line [19:24:55] !log gerrit2001 - re-enabling puppet, starting as slave for the first time ever, thanks to codfw dbproxy, gerrit service running (T176532) [19:25:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:25:06] T176532: Gerrit is failing to connect to db on gerrit2001 thus preventing systemd from working - https://phabricator.wikimedia.org/T176532 [19:25:29] Cool, trying to find the right files... [19:25:35] RECOVERY - puppet last run on db1063 is OK: OK: Puppet is currently enabled, last run 5 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [19:26:06] i think it's rm dblists/related-articles-footer-blacklisted-skins.dblist [19:26:13] and remove the line in wmf-config/InitialiseSettings.php [19:26:22] i can help if necessary after lunch (afk right now) [19:27:06] you can also enable timeless while there i guess :) [19:27:10] (in defaults) [19:27:26] thanks Krinkle for the help [19:27:26] Hee. For everything? [19:27:31] why not [19:27:33] your skin [19:27:35] your rules [19:27:51] Good point. It's randomly changing all the time, and where better to throw a random at people? [19:28:07] unless someone's asked specifically not to have it - but I know certain English/German Wikipedia's use it or want to use it on desktop [19:30:33] (03CR) 10Jdlrobson: "Thanks bawolff. Turned out the problem is only Russian and German wiki which is very explainable." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/527611 (https://phabricator.wikimedia.org/T229644) (owner: 10Jdlrobson) [19:30:43] (03PS1) 10Paladox: gerrit: Enable --enable-httpd if slave mode is enabled [puppet] - 10https://gerrit.wikimedia.org/r/527621 [19:31:01] (03PS2) 10Paladox: gerrit: Enable --enable-httpd if slave mode is enabled [puppet] - 10https://gerrit.wikimedia.org/r/527621 [19:41:17] (03CR) 10Thcipriani: [C: 03+1] gerrit: Enable --enable-httpd if slave mode is enabled [puppet] - 10https://gerrit.wikimedia.org/r/527621 (owner: 10Paladox) [19:41:22] (03CR) 10Dzahn: [C: 03+2] gerrit: Enable --enable-httpd if slave mode is enabled [puppet] - 10https://gerrit.wikimedia.org/r/527621 (owner: 10Paladox) [19:51:49] (03PS1) 10Paladox: Gerrit: Enable sshd for gerrit on slaves [puppet] - 10https://gerrit.wikimedia.org/r/527631 [19:53:16] (03PS2) 10Paladox: Gerrit: Enable sshd for gerrit on slaves [puppet] - 10https://gerrit.wikimedia.org/r/527631 [19:53:51] (03PS3) 10Paladox: Gerrit: Enable sshd for gerrit on slaves [puppet] - 10https://gerrit.wikimedia.org/r/527631 [19:53:56] (03CR) 10Paladox: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/527631 (owner: 10Paladox) [19:54:24] (03PS1) 10Isarra: Remove related-articles-footer-blacklisted-skins.dblist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/527632 (https://phabricator.wikimedia.org/T229644) [19:57:01] (03CR) 10Isarra: "Might want to double check I did this right..." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/527632 (https://phabricator.wikimedia.org/T229644) (owner: 10Isarra) [19:57:41] (03CR) 10Dzahn: [C: 03+2] Gerrit: Enable sshd for gerrit on slaves [puppet] - 10https://gerrit.wikimedia.org/r/527631 (owner: 10Paladox) [20:03:09] (03CR) 10Brian Wolff: "Ah, that makes sense." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/527611 (https://phabricator.wikimedia.org/T229644) (owner: 10Jdlrobson) [20:05:45] (03CR) 10Brian Wolff: [C: 04-1] "This is based on something I said where I was wrong. (sorry!). The use of the tag here is not the issue" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/527632 (https://phabricator.wikimedia.org/T229644) (owner: 10Isarra) [20:07:30] (03CR) 10Brian Wolff: "To clarify, on a different patch, i said that tags behaved differently then direct settings. I misunderstood the code, and that conclusion" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/527632 (https://phabricator.wikimedia.org/T229644) (owner: 10Isarra) [20:09:38] (03CR) 10Isarra: "> Patch Set 1: -Code-Review" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/527632 (https://phabricator.wikimedia.org/T229644) (owner: 10Isarra) [20:14:37] !log Run mwscript deleteEqualMessages.php --wiki=cswiki --delete [20:14:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:14:45] Ow, my head. [20:19:53] (03Abandoned) 10Isarra: Remove related-articles-footer-blacklisted-skins.dblist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/527632 (https://phabricator.wikimedia.org/T229644) (owner: 10Isarra) [20:20:30] * bd808 hands Isarra an asprin [20:21:42] (03PS1) 10Isarra: Set a dummy skin to 'disable' Related Article cards on blacklisted projects [mediawiki-config] - 10https://gerrit.wikimedia.org/r/527636 (https://phabricator.wikimedia.org/T229644) [20:22:19] Yay drugs. [20:23:17] Hey - was going to sec-deploy patch for UBN T229541 now, cc: jdlrobson MaxSem [20:37:30] (03PS1) 10Dzahn: gerrit: fix sshd listen address if on a slave [puppet] - 10https://gerrit.wikimedia.org/r/527638 (https://phabricator.wikimedia.org/T176532) [20:38:00] (03CR) 10jerkins-bot: [V: 04-1] gerrit: fix sshd listen address if on a slave [puppet] - 10https://gerrit.wikimedia.org/r/527638 (https://phabricator.wikimedia.org/T176532) (owner: 10Dzahn) [20:41:28] Isarra: so what have I missed? [20:41:32] (03CR) 10Isarra: "I THINK THIS ACTUALLY FIXES IT, AS INTENDED, ETC WHATEVER." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/527636 (https://phabricator.wikimedia.org/T229644) (owner: 10Isarra) [20:41:42] https://gerrit.wikimedia.org/r/#/c/operations/mediawiki-config/+/527632/ looks like the right change to me. [20:41:48] why abandoned? [20:42:09] (03PS2) 10Dzahn: gerrit: fix sshd listen address if on a slave [puppet] - 10https://gerrit.wikimedia.org/r/527638 (https://phabricator.wikimedia.org/T176532) [20:42:20] jdlrobson: bawolff says we're probably, it probably did work previously, I just broke it. [20:42:26] No it's my mistake [20:42:37] I have no idea, but looking closer at the logic I'm pretty sure just adding a dummy will also fix it. [20:42:52] I did a git blame. Russian and German have never asked for RelatedArticles to be disabled on Minerva [20:42:58] I don't actually know, though, so, uh, I guess pick which patch you prefer? [20:43:02] and it has never been. [20:43:05] Okay. [20:43:11] I dunno! [20:43:21] I would advise restoring https://gerrit.wikimedia.org/r/#/c/operations/mediawiki-config/+/527632/ and swatting :) [20:43:38] The config dates back to an older version of the config when related articles only worked on mobile [20:43:39] (03Restored) 10Isarra: Remove related-articles-footer-blacklisted-skins.dblist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/527632 (https://phabricator.wikimedia.org/T229644) (owner: 10Isarra) [20:43:58] Okay, restored, but I don't know how to swat things. [20:44:18] wait until Monday :) [20:44:23] I'm not a SWATer right now. I don't know if any are around today so yeh that ^ [20:44:24] :) [20:44:24] All I know is either of these should actually fix the immediate problem, so if you're sure about this one, I'm all for it because it's the neater solution overall regardless. [20:44:29] Okay. [20:44:40] (03CR) 10Jdlrobson: [C: 03+1] Remove related-articles-footer-blacklisted-skins.dblist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/527632 (https://phabricator.wikimedia.org/T229644) (owner: 10Isarra) [20:47:28] thanks for looking into this today Isarra [20:47:38] we'll get it fixed Monday. The wikis can wait. [20:48:16] !log Deployed security patch for T229541 [20:48:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:49:30] Looks like ru and de have already disabled it locally, so... yeah. :P [20:49:58] (03CR) 10Dzahn: [C: 03+1] "lgtm https://puppet-compiler.wmflabs.org/compiler1001/17721/" [puppet] - 10https://gerrit.wikimedia.org/r/527638 (https://phabricator.wikimedia.org/T176532) (owner: 10Dzahn) [20:51:16] (03CR) 10Paladox: [C: 03+1] gerrit: fix sshd listen address if on a slave [puppet] - 10https://gerrit.wikimedia.org/r/527638 (https://phabricator.wikimedia.org/T176532) (owner: 10Dzahn) [20:51:21] (03CR) 10Jdlrobson: [C: 04-1] "https://gerrit.wikimedia.org/r/#/c/operations/mediawiki-config/+/527632/" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/527636 (https://phabricator.wikimedia.org/T229644) (owner: 10Isarra) [20:51:46] Isarra: did you also want to enable on timeless anywhere on Monday while we do that? [20:52:12] Eh, no need to swat that. [20:52:37] The real question is... should we also enable it on monobook on the three wikis that already have it on vector? [20:52:55] wikis/wiki groups [20:55:01] (03CR) 10Dzahn: [C: 03+2] gerrit: fix sshd listen address if on a slave [puppet] - 10https://gerrit.wikimedia.org/r/527638 (https://phabricator.wikimedia.org/T176532) (owner: 10Dzahn) [21:02:34] 10Operations, 10DBA, 10Gerrit, 10Release-Engineering-Team-TODO, and 2 others: Gerrit is failing to connect to db on gerrit2001 thus preventing systemd from working - https://phabricator.wikimedia.org/T176532 (10Dzahn) 05Open→03Resolved gerrit, gerrit's httpd and gerrit's sshd are now all running and li... [21:02:39] 10Operations, 10Gerrit, 10Patch-For-Review, 10Release-Engineering-Team (Backlog): Reimage gerrit2001 as stretch - https://phabricator.wikimedia.org/T168562 (10Dzahn) [21:03:00] 10Operations, 10DBA, 10Gerrit, 10Release-Engineering-Team-TODO, 10Release-Engineering-Team (Development services): Gerrit is failing to connect to db on gerrit2001 thus preventing systemd from working - https://phabricator.wikimedia.org/T176532 (10Dzahn) [21:15:37] (03Abandoned) 10Isarra: Set a dummy skin to 'disable' Related Article cards on blacklisted projects [mediawiki-config] - 10https://gerrit.wikimedia.org/r/527636 (https://phabricator.wikimedia.org/T229644) (owner: 10Isarra) [21:15:54] (03PS2) 10Isarra: Remove related-articles-footer-blacklisted-skins.dblist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/527632 (https://phabricator.wikimedia.org/T229644) [21:19:12] jdlrobson: So now I can't even figure out how to submit a patch to enable timeless at all because gerrit keeps rejecting it, at which point the entire commit gets undone locally too, so... I'll just come back to this later. Yes. >.> [21:19:51] But as long as all this actually gets fixed to maintain what people actually expect normally, that's the most important bit regardless. [21:30:18] (03PS1) 10BryanDavis: Vagrantfile: embiggen the vm's disk [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/527650 [21:30:20] (03PS1) 10BryanDavis: run-image: give error message if type is not passed [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/527651 [21:30:22] (03PS1) 10BryanDavis: jessie: Work around removal of jessie-backports [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/527652 [21:30:24] (03PS1) 10BryanDavis: locales-extended: Add support for Korean [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/527653 (https://phabricator.wikimedia.org/T130532) [21:30:36] (03CR) 10jerkins-bot: [V: 04-1] Vagrantfile: embiggen the vm's disk [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/527650 (owner: 10BryanDavis) [21:41:31] (03PS1) 10Paladox: Gerrit: Rename gerrit-slave to gerrit-replica [puppet] - 10https://gerrit.wikimedia.org/r/527656 [21:42:19] (03PS2) 10Paladox: Gerrit: Rename gerrit-slave to gerrit-replica [puppet] - 10https://gerrit.wikimedia.org/r/527656 [21:42:40] (03PS1) 10Paladox: Rename gerrit-slave to gerrit-replica [dns] - 10https://gerrit.wikimedia.org/r/527657 [21:43:23] (03PS2) 10Paladox: Rename gerrit-slave to gerrit-replica [dns] - 10https://gerrit.wikimedia.org/r/527657 [22:00:27] (03CR) 10BryanDavis: "recheck" [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/527650 (owner: 10BryanDavis) [22:03:06] (03CR) 10BryanDavis: [C: 03+2] "local dev only" [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/527650 (owner: 10BryanDavis) [22:03:17] (03CR) 10BryanDavis: [C: 03+2] "local dev only" [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/527651 (owner: 10BryanDavis) [22:03:31] (03Merged) 10jenkins-bot: Vagrantfile: embiggen the vm's disk [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/527650 (owner: 10BryanDavis) [22:03:41] (03Merged) 10jenkins-bot: run-image: give error message if type is not passed [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/527651 (owner: 10BryanDavis) [22:04:13] PROBLEM - puppet last run on dbmonitor2001 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [22:07:00] (03CR) 10Bstorm: [C: 03+1] "Looks like a marvelous hack up to me." [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/527652 (owner: 10BryanDavis) [22:10:35] (03PS2) 10Paladox: Merge branch 'stable-2.15' into wmf/stable-2.15 [software/gerrit] (wmf/stable-2.15) - 10https://gerrit.wikimedia.org/r/525865 [22:11:04] (03PS4) 10Paladox: Testing: Do not merge [software/gerrit] (wmf/stable-2.15) - 10https://gerrit.wikimedia.org/r/525867 [22:11:26] (03CR) 10jerkins-bot: [V: 04-1] Merge branch 'stable-2.15' into wmf/stable-2.15 [software/gerrit] (wmf/stable-2.15) - 10https://gerrit.wikimedia.org/r/525865 (owner: 10Paladox) [22:11:34] (03CR) 10jerkins-bot: [V: 04-1] Testing: Do not merge [software/gerrit] (wmf/stable-2.15) - 10https://gerrit.wikimedia.org/r/525867 (owner: 10Paladox) [22:13:12] (03CR) 10Dzahn: "why? and that would first require DNS changes" [puppet] - 10https://gerrit.wikimedia.org/r/527656 (owner: 10Paladox) [22:16:55] (03CR) 10Greg Grossmeier: [C: 03+1] "> why? and that would first require DNS changes" [puppet] - 10https://gerrit.wikimedia.org/r/527656 (owner: 10Paladox) [22:17:24] (03CR) 10Greg Grossmeier: [C: 03+1] "Also, the DNS bits: https://gerrit.wikimedia.org/r/#/c/operations/dns/+/527657/" [puppet] - 10https://gerrit.wikimedia.org/r/527656 (owner: 10Paladox) [22:18:06] (03CR) 10Greg Grossmeier: [C: 03+1] "Yes, please. This fits with our (RelEng's) other efforts to reduce the occurrence of this language, see also: https://phabricator.wikimedi" [dns] - 10https://gerrit.wikimedia.org/r/527657 (owner: 10Paladox) [22:20:08] (03CR) 10Dzahn: "this is a technical term chosen by upstream. it's like mysql slave-lag" [puppet] - 10https://gerrit.wikimedia.org/r/527656 (owner: 10Paladox) [22:23:59] (03CR) 10Dzahn: "would require a lot more changes, acme_chief, puppet in various places, hiera.. fwiw the term appears 842 times in the repo" [puppet] - 10https://gerrit.wikimedia.org/r/527656 (owner: 10Paladox) [22:26:39] RECOVERY - puppet last run on dbmonitor2001 is OK: OK: Puppet is currently enabled, last run 41 seconds ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [22:27:21] (03PS1) 10Paladox: gerrit: Add gerrit-replica to acme [puppet] - 10https://gerrit.wikimedia.org/r/527664 [22:27:35] (03CR) 10Greg Grossmeier: [C: 03+1] "> would require a lot more changes, acme_chief, puppet in various" [puppet] - 10https://gerrit.wikimedia.org/r/527656 (owner: 10Paladox) [22:27:46] (03PS2) 10Paladox: gerrit: Add gerrit-replica to acme [puppet] - 10https://gerrit.wikimedia.org/r/527664 [22:33:02] (03CR) 10Krinkle: [C: 03+1] "Beware of the scap trap in deploying this to avoid intermediary fatal errors (which mwdebug won't catch). Takes three separate sync-files," [mediawiki-config] - 10https://gerrit.wikimedia.org/r/527632 (https://phabricator.wikimedia.org/T229644) (owner: 10Isarra) [22:34:04] (03CR) 10Greg Grossmeier: [C: 03+1] "> this is a technical term chosen by upstream. it's like mysql" [puppet] - 10https://gerrit.wikimedia.org/r/527656 (owner: 10Paladox) [22:39:26] (03CR) 10Krinkle: "(fwiw, in MW core we've adopted this change as well. Rdbms uses replica terminology consistently wherever possible.)" [puppet] - 10https://gerrit.wikimedia.org/r/527656 (owner: 10Paladox) [22:41:29] (03CR) 10Dzahn: [C: 04-1] gerrit: Add gerrit-replica to acme (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/527664 (owner: 10Paladox) [22:42:45] (03CR) 10Paladox: gerrit: Add gerrit-replica to acme (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/527664 (owner: 10Paladox) [22:45:33] PROBLEM - puppet last run on kubernetes1003 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [22:48:01] (03CR) 10Dzahn: "yea, it would probably fail. but that's because there would be no httpd vhost for the LE challenge to work. as i said, this is not a trivi" [puppet] - 10https://gerrit.wikimedia.org/r/527664 (owner: 10Paladox) [22:59:34] 10Operations, 10ops-eqiad: helium.mgmt down - https://phabricator.wikimedia.org/T229706 (10Dzahn) [23:00:19] ACKNOWLEDGEMENT - SSH helium.mgmt on helium.mgmt is CRITICAL: Server answer: daniel_zahn https://phabricator.wikimedia.org/T229706 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [23:02:29] 10Operations, 10ops-eqiad: helium.mgmt down - https://phabricator.wikimedia.org/T229706 (10wiki_willy) a:03Cmjohnson [23:06:17] !log mwdebug1001/mwdebug1002 - restart-php7.2-fpm - low opcache [23:06:17] RECOVERY - PHP opcache health on mwdebug1001 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [23:06:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:07:23] RECOVERY - PHP opcache health on mwdebug1002 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [23:08:01] RECOVERY - puppet last run on kubernetes1003 is OK: OK: Puppet is currently enabled, last run 3 seconds ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [23:16:32] !log Make the Level3 link between eqiad-knams primary - T228827 [23:16:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:16:41] T228827: Instability of the Level3 link between cr2-eqiad and cr2-esams - https://phabricator.wikimedia.org/T228827 [23:20:07] 10Operations, 10netops: Instability of the Level3 link between cr2-eqiad and cr2-esams - https://phabricator.wikimedia.org/T228827 (10ayounsi) Talked to Faidon, using the backup link for a long amount of time is costing us money (see overusage on https://librenms.wikimedia.org/bill/bill_id=17/). I made the Lev... [23:58:46] !log scandium - apt-get remove --purge prometheus-hhvm-exporter - not needed here, no HHVM (T228069) [23:58:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:58:55] T228069: Deploy Parsoid-PHP with Mediawiki to scandium for RT and performance testing - https://phabricator.wikimedia.org/T228069