[06:20:16] (03CR) 10Volans: "It seems quite suboptimal to me to have to specify a notes_url parameter when absenting a resource." [puppet] - 10https://gerrit.wikimedia.org/r/524951 (https://phabricator.wikimedia.org/T197873) (owner: 10Dzahn) [06:24:45] (03CR) 10Volans: [C: 04-1] "fixed typo in comment" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/523013 (https://phabricator.wikimedia.org/T197126) (owner: 10CDanis) [06:25:43] (03PS5) 10Muehlenhoff: xhgui: Read LDAP servers from Hiera and switch to read-only replicas [puppet] - 10https://gerrit.wikimedia.org/r/524791 (https://phabricator.wikimedia.org/T227650) [06:27:32] (03PS3) 10Ema: ATS: split the cache for beta variant of the mobile site [puppet] - 10https://gerrit.wikimedia.org/r/524789 (https://phabricator.wikimedia.org/T227432) [06:27:53] (03CR) 10Muehlenhoff: [C: 03+2] xhgui: Read LDAP servers from Hiera and switch to read-only replicas [puppet] - 10https://gerrit.wikimedia.org/r/524791 (https://phabricator.wikimedia.org/T227650) (owner: 10Muehlenhoff) [06:29:06] PROBLEM - puppet last run on mc2035 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [06:39:48] 10Operations, 10ops-eqiad, 10DC-Ops: a2-eqiad pdu refresh - https://phabricator.wikimedia.org/T227138 (10akosiaris) [06:41:44] 10Operations, 10ops-eqiad, 10DC-Ops: a2-eqiad pdu refresh - https://phabricator.wikimedia.org/T227138 (10akosiaris) conf1001 is fine to powerdown (no depool necessary), perform all wanted actions and then poweron as it will repool itself automatically [06:43:49] 10Operations, 10ops-eqiad, 10DC-Ops: a7-eqiad pdu refresh - https://phabricator.wikimedia.org/T227143 (10elukey) [06:44:08] 10Operations, 10ops-eqiad, 10DC-Ops: a7-eqiad pdu refresh - https://phabricator.wikimedia.org/T227143 (10elukey) Ok for the analytics nodes, hadoop workers that can go down without horrible consequences. [06:44:47] 10Operations, 10ops-eqiad, 10DC-Ops: a3-eqiad pdu refresh - https://phabricator.wikimedia.org/T227139 (10akosiaris) [06:49:04] 10Operations, 10ops-eqiad, 10DC-Ops: a3-eqiad pdu refresh - https://phabricator.wikimedia.org/T227139 (10akosiaris) emptying ganeti1006 will require some 15-30 mins of advance notice. Emptying it (in case me or @MoritzMuehlenhoff are not around) can be done per https://wikitech.wikimedia.org/wiki/Ganeti#Node... [06:51:50] 10Operations, 10ops-eqiad, 10DC-Ops: a4-eqiad pdu refresh - https://phabricator.wikimedia.org/T227140 (10akosiaris) [06:53:42] 10Operations, 10ops-eqiad, 10DC-Ops: a4-eqiad pdu refresh - https://phabricator.wikimedia.org/T227140 (10akosiaris) * emptying ganeti1005 will require some 15-30 mins of advance notice. Emptying it (in case me or @MoritzMuehlenhoff are not around) can be done per https://wikitech.wikimedia.org/wiki/Ganeti#No... [06:56:58] 10Operations, 10ops-eqiad, 10DC-Ops: a5-eqiad pdu refresh - https://phabricator.wikimedia.org/T227141 (10akosiaris) [06:57:24] RECOVERY - puppet last run on mc2035 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [06:58:02] 10Operations, 10ops-eqiad, 10DC-Ops: a5-eqiad pdu refresh - https://phabricator.wikimedia.org/T227141 (10akosiaris) * emptying ganeti1008 will require some 15-30 mins of advance notice. Emptying it (in case me or @MoritzMuehlenhoff are not around) can be done per https://wikitech.wikimedia.org/wiki/Ganeti#No... [06:58:42] (03PS2) 10Fsero: Add termbox-test release [deployment-charts] - 10https://gerrit.wikimedia.org/r/524817 (https://phabricator.wikimedia.org/T226814) (owner: 10Tarrow) [07:00:07] (03PS3) 10Fsero: Add termbox-test release [deployment-charts] - 10https://gerrit.wikimedia.org/r/524817 (https://phabricator.wikimedia.org/T226814) (owner: 10Tarrow) [07:00:29] (03CR) 10Fsero: [V: 03+2 C: 03+2] Add termbox-test release [deployment-charts] - 10https://gerrit.wikimedia.org/r/524817 (https://phabricator.wikimedia.org/T226814) (owner: 10Tarrow) [07:01:00] 10Operations, 10ops-eqiad, 10DC-Ops: a6-eqiad pdu refresh - https://phabricator.wikimedia.org/T227142 (10akosiaris) [07:01:06] 10Operations, 10ops-eqiad, 10DC-Ops: a5-eqiad pdu refresh - https://phabricator.wikimedia.org/T227141 (10Marostegui) From the DB side of things, this rack should be done **before** Thursday 30th 05:30AM UTC, as at that time db1128 will become phabricator master {T228243} [07:02:04] 10Operations, 10ops-eqiad, 10DC-Ops: a6-eqiad pdu refresh - https://phabricator.wikimedia.org/T227142 (10akosiaris) * emptying ganeti1006 will require some 15-30 mins of advance notice. Emptying it (in case me or @MoritzMuehlenhoff are not around) can be done per https://wikitech.wikimedia.org/wiki/Ganeti#No... [07:02:15] 10Operations, 10ops-eqiad, 10DC-Ops: a1-eqiad pdu refresh - https://phabricator.wikimedia.org/T226782 (10Marostegui) Good to go from the DB side [07:02:57] 10Operations, 10ops-eqiad, 10DC-Ops: a7-eqiad pdu refresh - https://phabricator.wikimedia.org/T227143 (10akosiaris) [07:03:38] 10Operations, 10ops-eqiad, 10DC-Ops: a8-eqiad pdu refresh - https://phabricator.wikimedia.org/T227133 (10akosiaris) [07:04:32] 10Operations, 10ops-eqiad, 10DC-Ops: a5-eqiad pdu refresh - https://phabricator.wikimedia.org/T227141 (10Marostegui) [07:05:03] 10Operations, 10ops-eqiad, 10DC-Ops: a4-eqiad pdu refresh - https://phabricator.wikimedia.org/T227140 (10akosiaris) Sigh, this was already done. I just hope the info added will be useful at some point in the future as a guide [07:05:48] 10Operations, 10ops-eqiad, 10DC-Ops: a6-eqiad pdu refresh - https://phabricator.wikimedia.org/T227142 (10Marostegui) This rack contains an active primary db master: db1066, this would need to be failed over if we are not confident about not losing power. [07:06:16] 10Operations, 10ops-eqiad, 10DC-Ops: a6-eqiad pdu refresh - https://phabricator.wikimedia.org/T227142 (10Marostegui) [07:06:53] 10Operations, 10ops-eqiad, 10DC-Ops: a7-eqiad pdu refresh - https://phabricator.wikimedia.org/T227143 (10Marostegui) From the DB side this rack is good to go [07:06:59] 10Operations, 10ops-eqiad, 10DC-Ops: a5-eqiad pdu refresh - https://phabricator.wikimedia.org/T227141 (10Joe) [07:07:06] (03PS1) 10Fsero: Revert "Add termbox-test release" [deployment-charts] - 10https://gerrit.wikimedia.org/r/525033 [07:07:22] (03Abandoned) 10Fsero: Revert "Add termbox-test release" [deployment-charts] - 10https://gerrit.wikimedia.org/r/525033 (owner: 10Fsero) [07:07:46] 10Operations, 10ops-eqiad, 10DC-Ops: a8-eqiad pdu refresh - https://phabricator.wikimedia.org/T227133 (10Marostegui) From the DB side, this rack is good to go [07:09:08] 10Operations, 10ops-eqiad, 10DC-Ops: a6-eqiad pdu refresh - https://phabricator.wikimedia.org/T227142 (10Joe) [07:09:35] 10Operations, 10ops-eqiad, 10DC-Ops: b1-eqiad pdu refresh - https://phabricator.wikimedia.org/T227536 (10Marostegui) [07:11:07] 10Operations, 10ops-eqiad, 10DC-Ops: b1-eqiad pdu refresh - https://phabricator.wikimedia.org/T227536 (10Marostegui) From the DB side this can be done anytime [07:12:30] 10Operations, 10ops-eqiad, 10DC-Ops: a7-eqiad pdu refresh - https://phabricator.wikimedia.org/T227143 (10Joe) [07:13:28] (03PS4) 10Ema: ATS: split the cache for beta variant of the mobile site [puppet] - 10https://gerrit.wikimedia.org/r/524789 (https://phabricator.wikimedia.org/T227432) [07:14:05] 10Operations, 10ops-eqiad, 10DC-Ops: b2-eqiad pdu refresh - https://phabricator.wikimedia.org/T227538 (10Marostegui) From the DB side, this can be done **after** Thursday 25th as db1072 will no longer be a master [07:14:26] 10Operations, 10ops-eqiad, 10DC-Ops: a5-eqiad pdu refresh - https://phabricator.wikimedia.org/T227141 (10Marostegui) [07:14:40] 10Operations, 10ops-eqiad, 10DC-Ops: a5-eqiad pdu refresh - https://phabricator.wikimedia.org/T227141 (10Marostegui) >>! In T227141#5356279, @Marostegui wrote: > From the DB side of things, this rack should be done **before** Thursday 30th 05:30AM UTC, as at that time db1128 will become phabricator master {T... [07:27:20] (03PS2) 10Marostegui: mariadb: Productionize dbproxy2002 into m2-codfw [puppet] - 10https://gerrit.wikimedia.org/r/524963 (https://phabricator.wikimedia.org/T202367) [07:29:20] PROBLEM - HHVM rendering on mw1328 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [07:30:52] RECOVERY - HHVM rendering on mw1328 is OK: HTTP OK: HTTP/1.1 200 OK - 77352 bytes in 0.103 second response time https://wikitech.wikimedia.org/wiki/Application_servers [07:31:23] !log Deploy grants for dbproxy2002 on m2 - T202367 [07:31:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:31:34] T202367: Productionize dbproxy101[2-7].eqiad.wmnet and dbproxy200[1-4] - https://phabricator.wikimedia.org/T202367 [07:38:11] (03PS3) 10Marostegui: mariadb: Productionize dbproxy2002 into m2-codfw [puppet] - 10https://gerrit.wikimedia.org/r/524963 (https://phabricator.wikimedia.org/T202367) [07:42:12] RECOVERY - Work requests waiting in Zuul Gearman server on contint1001 is OK: OK: Less than 30.00% above the threshold [90.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [07:42:27] ^^ I should probably make that alarm less sensible [07:44:36] PROBLEM - Eqsin HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/myRmf1Pik/varnish-aggregate-client-status-codes?var-site=eqsin&var-status_type=5 [07:46:29] (03CR) 10Hashar: [V: 03+1 C: 04-1] "NodeJS is still used :-( T228639" [puppet] - 10https://gerrit.wikimedia.org/r/524221 (https://phabricator.wikimedia.org/T225735) (owner: 10Hashar) [07:46:56] PROBLEM - Text HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/myRmf1Pik/varnish-aggregate-client-status-codes?var-cache_type=varnish-text&var-status_type=5 [07:50:38] (03CR) 10Marostegui: "Puppet looks good: https://puppet-compiler.wmflabs.org/compiler1002/17571/" [puppet] - 10https://gerrit.wikimedia.org/r/524963 (https://phabricator.wikimedia.org/T202367) (owner: 10Marostegui) [07:50:42] (03CR) 10Marostegui: [C: 03+2] mariadb: Productionize dbproxy2002 into m2-codfw [puppet] - 10https://gerrit.wikimedia.org/r/524963 (https://phabricator.wikimedia.org/T202367) (owner: 10Marostegui) [07:51:14] RECOVERY - Eqsin HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/myRmf1Pik/varnish-aggregate-client-status-codes?var-site=eqsin&var-status_type=5 [07:51:47] seemed cp5011 related [07:52:18] https://grafana.wikimedia.org/d/000000352/varnish-failed-fetches?orgId=1&from=now-3h&to=now&var-datasource=eqsin%20prometheus%2Fops&var-cache_type=text&var-server=All&var-layer=backend [07:52:22] ema --^ [07:53:36] RECOVERY - Text HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/myRmf1Pik/varnish-aggregate-client-status-codes?var-cache_type=varnish-text&var-status_type=5 [07:53:51] (03PS1) 10Elukey: profile::tlsproxy::service: add more granularity in monitoring [puppet] - 10https://gerrit.wikimedia.org/r/525039 (https://phabricator.wikimedia.org/T227860) [07:55:57] (03PS2) 10Elukey: profile::tlsproxy::service: add more granularity in monitoring [puppet] - 10https://gerrit.wikimedia.org/r/525039 (https://phabricator.wikimedia.org/T227860) [07:56:52] (03PS3) 10Elukey: profile::tlsproxy::service: add more granularity in monitoring [puppet] - 10https://gerrit.wikimedia.org/r/525039 (https://phabricator.wikimedia.org/T227860) [07:58:38] (03CR) 10Alexandros Kosiaris: [V: 03+2 C: 03+2] Fix bug in scaffold configmap.yaml and deployment.yaml [deployment-charts] - 10https://gerrit.wikimedia.org/r/524938 (owner: 10Jeena Huneidi) [07:58:48] (03PS2) 10Hashar: contint: no more include ::contint::packages::ruby by default [puppet] - 10https://gerrit.wikimedia.org/r/524224 (https://phabricator.wikimedia.org/T225735) [07:58:50] (03PS2) 10Hashar: contint: remove contint::php [puppet] - 10https://gerrit.wikimedia.org/r/524225 (https://phabricator.wikimedia.org/T225735) [07:58:52] (03PS2) 10Hashar: contint: no more include ::packages::javascript by default [puppet] - 10https://gerrit.wikimedia.org/r/524221 (https://phabricator.wikimedia.org/T225735) [07:58:54] (03PS4) 10Alexandros Kosiaris: Fix bug in scaffold configmap.yaml and deployment.yaml [deployment-charts] - 10https://gerrit.wikimedia.org/r/524938 (owner: 10Jeena Huneidi) [07:58:56] (03CR) 10Alexandros Kosiaris: [V: 03+2 C: 03+2] Fix bug in scaffold configmap.yaml and deployment.yaml [deployment-charts] - 10https://gerrit.wikimedia.org/r/524938 (owner: 10Jeena Huneidi) [08:00:05] (03CR) 10Hashar: [C: 04-1] "Due to T228639" [puppet] - 10https://gerrit.wikimedia.org/r/524221 (https://phabricator.wikimedia.org/T225735) (owner: 10Hashar) [08:00:21] (03PS1) 10Muehlenhoff: Add poolcounter1004 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/525040 (https://phabricator.wikimedia.org/T224572) [08:00:45] 10Operations, 10Traffic: TLS config issue for nginx on Buster - https://phabricator.wikimedia.org/T228730 (10elukey) [08:01:18] 10Operations, 10Analytics, 10Analytics-Kanban, 10Traffic, and 2 others: TLS certificates for Analytics origin servers - https://phabricator.wikimedia.org/T227860 (10elukey) 05Open→03Stalled Stalled due to https://phabricator.wikimedia.org/T228730 [08:06:01] (03PS4) 10Elukey: profile::tlsproxy::service: add more granularity in monitoring [puppet] - 10https://gerrit.wikimedia.org/r/525039 (https://phabricator.wikimedia.org/T227860) [08:08:06] (03CR) 10Elukey: "https://puppet-compiler.wmflabs.org/compiler1002/17574/" [puppet] - 10https://gerrit.wikimedia.org/r/525039 (https://phabricator.wikimedia.org/T227860) (owner: 10Elukey) [08:08:19] (03Abandoned) 10Hashar: contint: no more include ::packages::javascript by default [puppet] - 10https://gerrit.wikimedia.org/r/524221 (https://phabricator.wikimedia.org/T225735) (owner: 10Hashar) [08:08:29] !log Stop MySQL on db2044 to test dbproxy2002 notifications - T202367 [08:08:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:08:36] T202367: Productionize dbproxy101[2-7].eqiad.wmnet and dbproxy200[1-4] - https://phabricator.wikimedia.org/T202367 [08:12:45] PROBLEM - haproxy failover on dbproxy2002 is CRITICAL: CRITICAL check_failover servers up 1 down 1 https://wikitech.wikimedia.org/wiki/HAProxy [08:13:08] ^ that is me [08:16:16] (03PS1) 10Marostegui: dbproxy2001: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/525042 (https://phabricator.wikimedia.org/T202367) [08:16:23] RECOVERY - haproxy failover on dbproxy2002 is OK: OK check_failover servers up 2 down 0 https://wikitech.wikimedia.org/wiki/HAProxy [08:17:35] (03CR) 10Marostegui: [C: 03+2] dbproxy2001: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/525042 (https://phabricator.wikimedia.org/T202367) (owner: 10Marostegui) [08:17:58] (03CR) 10Ema: [C: 03+1] profile::tlsproxy::service: add more granularity in monitoring [puppet] - 10https://gerrit.wikimedia.org/r/525039 (https://phabricator.wikimedia.org/T227860) (owner: 10Elukey) [08:22:09] (03PS9) 10Hashar: releases: inline php packages installation [puppet] - 10https://gerrit.wikimedia.org/r/523147 (https://phabricator.wikimedia.org/T225735) [08:22:11] (03PS5) 10Hashar: contint: remove php packages [puppet] - 10https://gerrit.wikimedia.org/r/523148 (https://phabricator.wikimedia.org/T225735) [08:22:14] (03PS5) 10Hashar: contint: apply apt::unattend_upgrade at role level [puppet] - 10https://gerrit.wikimedia.org/r/523150 (https://phabricator.wikimedia.org/T225735) [08:22:16] (03PS2) 10Hashar: contint: remove sqlite3 debian package [puppet] - 10https://gerrit.wikimedia.org/r/524219 (https://phabricator.wikimedia.org/T225735) [08:22:18] (03PS3) 10Hashar: contint: no more include ::contint::packages::ruby by default [puppet] - 10https://gerrit.wikimedia.org/r/524224 (https://phabricator.wikimedia.org/T225735) [08:22:20] (03PS3) 10Hashar: contint: remove contint::php [puppet] - 10https://gerrit.wikimedia.org/r/524225 (https://phabricator.wikimedia.org/T225735) [08:29:33] (03PS1) 10Marostegui: db-eqiad.php: Depool db1100 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/525043 [08:31:40] (03CR) 10Muehlenhoff: [C: 04-1] releases: inline php packages installation (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/523147 (https://phabricator.wikimedia.org/T225735) (owner: 10Hashar) [08:31:45] (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: Depool db1100 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/525043 (owner: 10Marostegui) [08:32:42] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1100 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/525043 (owner: 10Marostegui) [08:33:06] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1100 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/525043 (owner: 10Marostegui) [08:33:59] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Depool db1100 for upgrade (duration: 00m 53s) [08:34:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:34:46] !log Upgrade db1100 [08:34:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:36:35] (03CR) 10Hashar: releases: inline php packages installation (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/523147 (https://phabricator.wikimedia.org/T225735) (owner: 10Hashar) [08:36:53] (03PS10) 10Hashar: releases: inline php packages installation [puppet] - 10https://gerrit.wikimedia.org/r/523147 (https://phabricator.wikimedia.org/T225735) [08:37:17] (03CR) 10Hashar: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/523147 (https://phabricator.wikimedia.org/T225735) (owner: 10Hashar) [08:38:10] (03CR) 10Filippo Giunchedi: [C: 03+1] hiera: deploy varnishkafka exporter to ulsfo [puppet] - 10https://gerrit.wikimedia.org/r/524930 (https://phabricator.wikimedia.org/T196066) (owner: 10Cwhite) [08:38:31] (03CR) 10Filippo Giunchedi: [C: 03+1] hiera: deploy varnishkafka exporter to esams [puppet] - 10https://gerrit.wikimedia.org/r/524931 (https://phabricator.wikimedia.org/T196066) (owner: 10Cwhite) [08:39:34] (03CR) 10Hashar: [V: 03+1] "Puppet compiler for production host releases1001.eqiad.wmnet:" [puppet] - 10https://gerrit.wikimedia.org/r/523147 (https://phabricator.wikimedia.org/T225735) (owner: 10Hashar) [08:39:57] (03CR) 10Filippo Giunchedi: [C: 03+1] hiera: deploy varnishkafka exporter to eqiad [puppet] - 10https://gerrit.wikimedia.org/r/524933 (https://phabricator.wikimedia.org/T196066) (owner: 10Cwhite) [08:40:01] (03CR) 10Filippo Giunchedi: [C: 03+1] hiera: deploy varnishkafka exporter to codfw [puppet] - 10https://gerrit.wikimedia.org/r/524932 (https://phabricator.wikimedia.org/T196066) (owner: 10Cwhite) [08:43:39] (03CR) 10Muehlenhoff: [C: 03+2] releases: inline php packages installation [puppet] - 10https://gerrit.wikimedia.org/r/523147 (https://phabricator.wikimedia.org/T225735) (owner: 10Hashar) [08:45:38] (03PS6) 10Muehlenhoff: contint: remove php packages [puppet] - 10https://gerrit.wikimedia.org/r/523148 (https://phabricator.wikimedia.org/T225735) (owner: 10Hashar) [08:47:35] (03CR) 10Muehlenhoff: [C: 03+2] contint: remove php packages [puppet] - 10https://gerrit.wikimedia.org/r/523148 (https://phabricator.wikimedia.org/T225735) (owner: 10Hashar) [08:49:17] (03PS6) 10Muehlenhoff: contint: apply apt::unattend_upgrade at role level [puppet] - 10https://gerrit.wikimedia.org/r/523150 (https://phabricator.wikimedia.org/T225735) (owner: 10Hashar) [08:50:22] (03CR) 10Muehlenhoff: [C: 03+2] contint: apply apt::unattend_upgrade at role level [puppet] - 10https://gerrit.wikimedia.org/r/523150 (https://phabricator.wikimedia.org/T225735) (owner: 10Hashar) [08:51:29] (03PS3) 10Muehlenhoff: contint: remove sqlite3 debian package [puppet] - 10https://gerrit.wikimedia.org/r/524219 (https://phabricator.wikimedia.org/T225735) (owner: 10Hashar) [08:52:54] (03CR) 10Filippo Giunchedi: "> Patch Set 4:" [puppet] - 10https://gerrit.wikimedia.org/r/524625 (https://phabricator.wikimedia.org/T227364) (owner: 10EBernhardson) [08:53:26] (03CR) 10Muehlenhoff: [C: 03+2] contint: remove sqlite3 debian package [puppet] - 10https://gerrit.wikimedia.org/r/524219 (https://phabricator.wikimedia.org/T225735) (owner: 10Hashar) [08:54:20] (03PS4) 10Muehlenhoff: contint: no more include ::contint::packages::ruby by default [puppet] - 10https://gerrit.wikimedia.org/r/524224 (https://phabricator.wikimedia.org/T225735) (owner: 10Hashar) [08:56:08] 10Operations, 10serviceops, 10Core Platform Team Workboards (Green): Keys from MediaWiki Redis Instances - https://phabricator.wikimedia.org/T228703 (10Joe) [08:57:20] 10Operations, 10ops-eqiad, 10DBA: Upgrade db1100 firmware and BIOS - https://phabricator.wikimedia.org/T228732 (10Marostegui) [08:58:18] (03CR) 10Muehlenhoff: [C: 03+2] contint: no more include ::contint::packages::ruby by default [puppet] - 10https://gerrit.wikimedia.org/r/524224 (https://phabricator.wikimedia.org/T225735) (owner: 10Hashar) [08:59:20] (03PS4) 10Muehlenhoff: contint: remove contint::php [puppet] - 10https://gerrit.wikimedia.org/r/524225 (https://phabricator.wikimedia.org/T225735) (owner: 10Hashar) [08:59:22] 10Operations, 10serviceops, 10PHP 7.2 support, 10Performance-Team (Radar), and 2 others: PHP 7 corruption during deployment (was: PHP 7 fatals on mw1262) - https://phabricator.wikimedia.org/T224491 (10Joe) >>! In T224491#5354568, @Krinkle wrote: > Logstash query for the error in question: > > 10Operations, 10LDAP-Access-Requests, 10Release-Engineering-Team: Add more SREs to gerritadmin LDAP group - https://phabricator.wikimedia.org/T228733 (10akosiaris) [08:59:57] (03CR) 10Filippo Giunchedi: [C: 04-1] "IMHO we should remove the flag altogether once the rollout is complete, it has a different semantic that the existing profile::cache::kafk" [puppet] - 10https://gerrit.wikimedia.org/r/524934 (https://phabricator.wikimedia.org/T196066) (owner: 10Cwhite) [09:00:09] (03PS1) 10Marostegui: db-eqiad.php: Slowly repool db1100 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/525045 [09:00:14] 10Operations, 10DBA: Reboot, upgrade firmware and kernel of db1096-db1106, db2071-db2092 - https://phabricator.wikimedia.org/T216240 (10Marostegui) [09:01:31] (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: Slowly repool db1100 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/525045 (owner: 10Marostegui) [09:02:15] (03CR) 10Alexandros Kosiaris: [C: 03+2] Add poolcounter1004 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/525040 (https://phabricator.wikimedia.org/T224572) (owner: 10Muehlenhoff) [09:02:17] (03CR) 10Muehlenhoff: [C: 03+2] contint: remove contint::php [puppet] - 10https://gerrit.wikimedia.org/r/524225 (https://phabricator.wikimedia.org/T225735) (owner: 10Hashar) [09:02:22] (03Merged) 10jenkins-bot: db-eqiad.php: Slowly repool db1100 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/525045 (owner: 10Marostegui) [09:02:29] (03Merged) 10jenkins-bot: Add poolcounter1004 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/525040 (https://phabricator.wikimedia.org/T224572) (owner: 10Muehlenhoff) [09:02:40] (03CR) 10jenkins-bot: db-eqiad.php: Slowly repool db1100 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/525045 (owner: 10Marostegui) [09:02:43] 10Operations, 10LDAP-Access-Requests, 10Release-Engineering-Team: Add more SREs to gerritadmin LDAP group - https://phabricator.wikimedia.org/T228733 (10akosiaris) p:05Triage→03Normal [09:03:28] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Slowly repool db1100 after upgrade (duration: 00m 46s) [09:03:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:04:35] (03PS1) 10Marostegui: db-eqiad.php: Repool db1100 into API [mediawiki-config] - 10https://gerrit.wikimedia.org/r/525047 [09:04:44] (03CR) 10jenkins-bot: Add poolcounter1004 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/525040 (https://phabricator.wikimedia.org/T224572) (owner: 10Muehlenhoff) [09:09:04] !log akosiaris@deploy1001 Synchronized wmf-config/ProductionServices.php: (no justification provided) (duration: 00m 47s) [09:09:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:09:11] (03Abandoned) 10Volans: Netbox: set media directory [puppet] - 10https://gerrit.wikimedia.org/r/463820 (https://phabricator.wikimedia.org/T190184) (owner: 10Volans) [09:09:28] 10Operations, 10LDAP-Access-Requests, 10Release-Engineering-Team: Add more SREs to gerritadmin LDAP group - https://phabricator.wikimedia.org/T228733 (10hashar) The change has been done after T218761 (private). The Gerrit Administrators group is now tied to the `gerritadmin` LDAP group ( https://gerrit.wiki... [09:10:16] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM, indeed bsd-mailx will be pulled in by icinga" [puppet] - 10https://gerrit.wikimedia.org/r/494464 (owner: 10Volans) [09:10:46] (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: Repool db1100 into API [mediawiki-config] - 10https://gerrit.wikimedia.org/r/525047 (owner: 10Marostegui) [09:11:49] 10Operations, 10serviceops: Migrate pool counters to Stretch/Buster - https://phabricator.wikimedia.org/T224572 (10akosiaris) poolcounter1004 has just been added [09:12:32] (03Merged) 10jenkins-bot: db-eqiad.php: Repool db1100 into API [mediawiki-config] - 10https://gerrit.wikimedia.org/r/525047 (owner: 10Marostegui) [09:12:47] (03CR) 10jenkins-bot: db-eqiad.php: Repool db1100 into API [mediawiki-config] - 10https://gerrit.wikimedia.org/r/525047 (owner: 10Marostegui) [09:12:57] 10Operations, 10ops-eqiad, 10DC-Ops: a2-eqiad pdu refresh - https://phabricator.wikimedia.org/T227138 (10fgiunchedi) For ms-be same as {T227140} [09:13:06] (03PS4) 10Volans: icinga: set Reply-To header to email notifications [puppet] - 10https://gerrit.wikimedia.org/r/494464 [09:13:08] 10Operations, 10LDAP-Access-Requests, 10Release-Engineering-Team: Add more SREs to gerritadmin LDAP group - https://phabricator.wikimedia.org/T228733 (10Peachey88) [09:13:24] 10Operations, 10LDAP-Access-Requests, 10Release-Engineering-Team: Add more SREs to gerritadmin LDAP group - https://phabricator.wikimedia.org/T228733 (10Peachey88) [09:13:36] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Slowly repool into API db1100 after upgrade (duration: 00m 47s) [09:13:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:14:46] 10Operations, 10ops-eqiad, 10DC-Ops: a3-eqiad pdu refresh - https://phabricator.wikimedia.org/T227139 (10fgiunchedi) [09:16:51] 10Operations, 10ops-eqiad, 10DC-Ops: a3-eqiad pdu refresh - https://phabricator.wikimedia.org/T227139 (10Marostegui) [09:17:55] 10Operations, 10ops-eqiad, 10DC-Ops: b1-eqiad pdu refresh - https://phabricator.wikimedia.org/T227536 (10Marostegui) [09:18:06] (03PS3) 10Alexandros Kosiaris: k8s: introducing termbox-test.staging.svc.eqiad.wmnet [dns] - 10https://gerrit.wikimedia.org/r/524797 (https://phabricator.wikimedia.org/T226814) (owner: 10Fsero) [09:18:21] (03CR) 10jerkins-bot: [V: 04-1] k8s: introducing termbox-test.staging.svc.eqiad.wmnet [dns] - 10https://gerrit.wikimedia.org/r/524797 (https://phabricator.wikimedia.org/T226814) (owner: 10Fsero) [09:18:52] (03PS1) 10Hashar: gerrit: remove a no more existing group [puppet] - 10https://gerrit.wikimedia.org/r/525048 [09:19:50] (03CR) 10Hashar: "The repository.ownerGroup is looked up by name, however groups can be renamed and a new one could use a previously used name." [puppet] - 10https://gerrit.wikimedia.org/r/525048 (owner: 10Hashar) [09:20:27] (03PS1) 10Marostegui: db-eqiad.php: Fully repool db1100 into API [mediawiki-config] - 10https://gerrit.wikimedia.org/r/525049 [09:20:38] (03CR) 10Volans: [C: 03+2] icinga: set Reply-To header to email notifications [puppet] - 10https://gerrit.wikimedia.org/r/494464 (owner: 10Volans) [09:21:41] (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: Fully repool db1100 into API [mediawiki-config] - 10https://gerrit.wikimedia.org/r/525049 (owner: 10Marostegui) [09:22:29] 10Operations, 10ops-eqiad, 10DC-Ops: a3-eqiad pdu refresh - https://phabricator.wikimedia.org/T227139 (10fgiunchedi) restbase / logstash / graphite / prometheus hosts should be fine in an event of power loss, if feeling nice restbase and prometheus should be depooled. for the logstash host we could disable e... [09:22:33] (03Merged) 10jenkins-bot: db-eqiad.php: Fully repool db1100 into API [mediawiki-config] - 10https://gerrit.wikimedia.org/r/525049 (owner: 10Marostegui) [09:22:48] (03CR) 10jenkins-bot: db-eqiad.php: Fully repool db1100 into API [mediawiki-config] - 10https://gerrit.wikimedia.org/r/525049 (owner: 10Marostegui) [09:23:37] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Fully repool into API db1100 after upgrade (duration: 00m 46s) [09:23:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:24:56] 10Operations, 10ops-eqiad, 10DC-Ops: a2-eqiad pdu refresh - https://phabricator.wikimedia.org/T227138 (10fgiunchedi) [09:25:14] (03Restored) 10Fsero: Revert "Add termbox-test release" [deployment-charts] - 10https://gerrit.wikimedia.org/r/525033 (owner: 10Fsero) [09:25:47] (03PS2) 10Fsero: Revert "Add termbox-test release" [deployment-charts] - 10https://gerrit.wikimedia.org/r/525033 [09:27:53] 10Operations, 10ops-eqiad, 10DC-Ops: a3-eqiad pdu refresh - https://phabricator.wikimedia.org/T227139 (10fgiunchedi) [09:28:10] (03CR) 10Fsero: [V: 03+2 C: 03+2] "this needs more changes and was merged prematurely" [deployment-charts] - 10https://gerrit.wikimedia.org/r/525033 (owner: 10Fsero) [09:28:33] 10Operations, 10ops-eqiad, 10DC-Ops: a5-eqiad pdu refresh - https://phabricator.wikimedia.org/T227141 (10fgiunchedi) [09:29:15] 10Operations, 10ops-eqiad, 10DC-Ops: a6-eqiad pdu refresh - https://phabricator.wikimedia.org/T227142 (10fgiunchedi) [09:30:05] 10Operations, 10ops-eqiad, 10DC-Ops: a7-eqiad pdu refresh - https://phabricator.wikimedia.org/T227143 (10fgiunchedi) [09:30:54] 10Operations, 10Wikimedia-Mailing-lists: New Mailing lists for AzWiki sysops - https://phabricator.wikimedia.org/T228542 (10Eldarado) p:05Triage→03Normal [09:33:04] 10Operations, 10ops-eqiad, 10DC-Ops: b1-eqiad pdu refresh - https://phabricator.wikimedia.org/T227536 (10fgiunchedi) [09:33:54] 10Operations, 10ops-eqiad, 10DC-Ops: b2-eqiad pdu refresh - https://phabricator.wikimedia.org/T227538 (10fgiunchedi) [09:40:18] hashar: thanks a lot for the projectviews task! [09:40:46] (03PS1) 10Elukey: Set async replication for mcrouter on mw api/appserv canaries [puppet] - 10https://gerrit.wikimedia.org/r/525053 (https://phabricator.wikimedia.org/T225642) [09:43:33] (03PS1) 10Fsero: Add termbox-test release [deployment-charts] - 10https://gerrit.wikimedia.org/r/525054 (https://phabricator.wikimedia.org/T226814) [09:43:51] (03PS2) 10Fsero: Add termbox-test release [deployment-charts] - 10https://gerrit.wikimedia.org/r/525054 (https://phabricator.wikimedia.org/T226814) [09:44:23] (03PS3) 10Fsero: Add termbox-test release [deployment-charts] - 10https://gerrit.wikimedia.org/r/525054 (https://phabricator.wikimedia.org/T226814) [09:45:20] (03CR) 10Elukey: "https://puppet-compiler.wmflabs.org/compiler1001/17575/" [puppet] - 10https://gerrit.wikimedia.org/r/525053 (https://phabricator.wikimedia.org/T225642) (owner: 10Elukey) [09:46:16] (03PS1) 10Muehlenhoff: Enable poolcounter1005, disable poolcounter1001 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/525055 (https://phabricator.wikimedia.org/T224572) [09:46:28] (03PS4) 10Alexandros Kosiaris: k8s: introducing termbox-test.staging.svc.eqiad.wmnet [dns] - 10https://gerrit.wikimedia.org/r/524797 (https://phabricator.wikimedia.org/T226814) (owner: 10Fsero) [09:47:15] elukey: you are welcome. Thoguh I have absolutely no idea about what is going on :-\ [09:47:41] (03CR) 10Alexandros Kosiaris: [C: 03+2] Enable poolcounter1005, disable poolcounter1001 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/525055 (https://phabricator.wikimedia.org/T224572) (owner: 10Muehlenhoff) [09:47:54] (03PS4) 10Fsero: Add termbox-test release [deployment-charts] - 10https://gerrit.wikimedia.org/r/525054 (https://phabricator.wikimedia.org/T226814) [09:48:34] (03Merged) 10jenkins-bot: Enable poolcounter1005, disable poolcounter1001 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/525055 (https://phabricator.wikimedia.org/T224572) (owner: 10Muehlenhoff) [09:48:55] (03CR) 10jenkins-bot: Enable poolcounter1005, disable poolcounter1001 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/525055 (https://phabricator.wikimedia.org/T224572) (owner: 10Muehlenhoff) [09:51:03] !log akosiaris@deploy1001 Synchronized wmf-config/ProductionServices.php: (no justification provided) (duration: 00m 47s) [09:51:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:51:38] !log enable poolcounter1005, disablepoolcounter1001 T224572 [09:51:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:51:45] T224572: Migrate pool counters to Stretch/Buster - https://phabricator.wikimedia.org/T224572 [09:53:57] !log Drop abuse_filter_log.afl_log_id from s6 codfw with replication (this will cause lag in s6 codfw) - T226851 [09:54:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:54:04] T226851: Drop abuse_filter_log.afl_log_id in production - https://phabricator.wikimedia.org/T226851 [09:54:10] hashar: should be fixed now [09:54:53] ah no not yet, weird [09:55:18] (was looking at cached content, it is indeed fixed) [09:55:40] (03PS5) 10Fsero: Add termbox-test release [deployment-charts] - 10https://gerrit.wikimedia.org/r/525054 (https://phabricator.wikimedia.org/T226814) [09:55:58] (03CR) 10Fsero: [V: 03+2 C: 03+2] Add termbox-test release [deployment-charts] - 10https://gerrit.wikimedia.org/r/525054 (https://phabricator.wikimedia.org/T226814) (owner: 10Fsero) [09:56:42] elukey: seems good yes. At least for /other/pageviews/ but some others might have been affected :-] [09:57:26] !log jmm@cumin2001 START - Cookbook sre.hosts.downtime [09:57:27] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [09:57:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:57:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:57:48] hashar: the other one that we changed was pageviews, already checked and we should be ok [09:58:22] elukey: cool, feel free to flag https://phabricator.wikimedia.org/T228731 as resolves so, unless there is a followup action to monitor those jobs (though that acn be made another standalone task)* [09:59:04] !log fsero@ helmfile [STAGING] Ran 'apply' command on namespace 'termbox' for release 'test' . [09:59:04] !log fsero@ helmfile [STAGING] Ran 'apply' command on namespace 'termbox' for release 'staging' . [09:59:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:59:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:00:40] PROBLEM - Mediawiki Cirrussearch update rate - eqiad on icinga1001 is CRITICAL: CRITICAL: 20.00% of data under the critical threshold [50.0] https://wikitech.wikimedia.org/wiki/Search%23No_updates https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?panelId=44&fullscreen&orgId=1 [10:01:03] (03CR) 10Fsero: [C: 03+2] k8s: introducing termbox-test.staging.svc.eqiad.wmnet [dns] - 10https://gerrit.wikimedia.org/r/524797 (https://phabricator.wikimedia.org/T226814) (owner: 10Fsero) [10:01:14] PROBLEM - Mediawiki Cirrussearch update rate - codfw on icinga1001 is CRITICAL: CRITICAL: 20.00% of data under the critical threshold [50.0] https://wikitech.wikimedia.org/wiki/Search%23No_updates https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?panelId=44&fullscreen&orgId=1 [10:01:22] (03CR) 10Fsero: [C: 03+2] "ty for the changes :)" [dns] - 10https://gerrit.wikimedia.org/r/524797 (https://phabricator.wikimedia.org/T226814) (owner: 10Fsero) [10:02:46] !log installing Java security updates on notebook/stat hosts [10:02:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:06:57] 10Operations, 10Wikidata, 10Wikidata-Termbox-Hike, 10serviceops, and 4 others: New Service Request: Wikidata Termbox SSR - https://phabricator.wikimedia.org/T212189 (10fsero) [10:09:18] (03PS1) 10Muehlenhoff: kibana: Read LDAP servers from standard Hiera and switch to read-only replicas [puppet] - 10https://gerrit.wikimedia.org/r/525057 (https://phabricator.wikimedia.org/T227650) [10:10:14] (03CR) 10jerkins-bot: [V: 04-1] kibana: Read LDAP servers from standard Hiera and switch to read-only replicas [puppet] - 10https://gerrit.wikimedia.org/r/525057 (https://phabricator.wikimedia.org/T227650) (owner: 10Muehlenhoff) [10:15:11] (03PS2) 10Muehlenhoff: kibana: Read LDAP servers from standard Hiera and switch to read-only replicas [puppet] - 10https://gerrit.wikimedia.org/r/525057 (https://phabricator.wikimedia.org/T227650) [10:16:06] (03CR) 10jerkins-bot: [V: 04-1] kibana: Read LDAP servers from standard Hiera and switch to read-only replicas [puppet] - 10https://gerrit.wikimedia.org/r/525057 (https://phabricator.wikimedia.org/T227650) (owner: 10Muehlenhoff) [10:16:12] jouncebot: next [10:16:12] In 0 hour(s) and 43 minute(s): European Mid-day SWAT(Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190723T1100) [10:17:21] !log Drop abuse_filter_log.afl_log_id from db1096:3316, db1139:3316 and dbstore1005:3316 T226851 [10:17:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:17:30] T226851: Drop abuse_filter_log.afl_log_id in production - https://phabricator.wikimedia.org/T226851 [10:18:07] (03PS5) 10Jbond: lookup checks: add checks to warn against using hiera and advice lookup [puppet-lint/wmf_styleguide-check] - 10https://gerrit.wikimedia.org/r/522526 (https://phabricator.wikimedia.org/T220820) [10:20:17] (03CR) 10Jbond: "thanks see responses inline" (033 comments) [puppet-lint/wmf_styleguide-check] - 10https://gerrit.wikimedia.org/r/522526 (https://phabricator.wikimedia.org/T220820) (owner: 10Jbond) [10:24:22] (03CR) 10Volans: [C: 03+1] "LGTM" [puppet-lint/wmf_styleguide-check] - 10https://gerrit.wikimedia.org/r/522526 (https://phabricator.wikimedia.org/T220820) (owner: 10Jbond) [10:24:49] 10Operations, 10ops-eqiad, 10DC-Ops: a1-eqiad pdu refresh - https://phabricator.wikimedia.org/T226782 (10Marostegui) [10:24:54] RECOVERY - Mediawiki Cirrussearch update rate - codfw on icinga1001 is OK: OK: Less than 1.00% under the threshold [80.0] https://wikitech.wikimedia.org/wiki/Search%23No_updates https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?panelId=44&fullscreen&orgId=1 [10:25:48] RECOVERY - Mediawiki Cirrussearch update rate - eqiad on icinga1001 is OK: OK: Less than 1.00% under the threshold [80.0] https://wikitech.wikimedia.org/wiki/Search%23No_updates https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?panelId=44&fullscreen&orgId=1 [10:27:40] I'll start cutting the branch momentarily for this week's train [10:42:22] RECOVERY - Check systemd state on puppetmaster1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:44:43] (03CR) 10Volans: [C: 03+1] "The change looks formally ok to me. I can't speak for the values in the RWStore.properties though or the effects of loading a different co" [puppet] - 10https://gerrit.wikimedia.org/r/524954 (https://phabricator.wikimedia.org/T228122) (owner: 10Smalyshev) [10:46:04] (03CR) 10Volans: [C: 03+1] "LGTM" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/524825 (owner: 10CDanis) [10:55:34] (03PS1) 10Tarrow: Enable termbox on testwikidatawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/525062 (https://phabricator.wikimedia.org/T227459) [10:56:40] (03CR) 10Tarrow: "This change is ready for review." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/525062 (https://phabricator.wikimedia.org/T227459) (owner: 10Tarrow) [10:56:44] (03CR) 10jerkins-bot: [V: 04-1] Enable termbox on testwikidatawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/525062 (https://phabricator.wikimedia.org/T227459) (owner: 10Tarrow) [10:58:19] (03PS2) 10Tarrow: Enable termbox on testwikidatawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/525062 (https://phabricator.wikimedia.org/T227459) [11:00:04] Amir1, Lucas_WMDE, awight, and Urbanecm: I, the Bot under the Fountain, allow thee, The Deployer, to do European Mid-day SWAT(Max 6 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190723T1100). [11:00:04] No GERRIT patches in the queue for this window AFAICS. [11:04:13] I just stuck in a patch for SWAT: https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/525062 anyone object to me doing it now? [11:06:44] tarrow: go on! let me know if you have any questions [11:07:13] Amir1: Awesome! I shall :) [11:08:54] !log enable puppet on jobrunners [11:09:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:12:55] (03CR) 10Jakob: [C: 03+1] Enable termbox on testwikidatawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/525062 (https://phabricator.wikimedia.org/T227459) (owner: 10Tarrow) [11:13:32] (03CR) 10Tarrow: [C: 03+2] Enable termbox on testwikidatawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/525062 (https://phabricator.wikimedia.org/T227459) (owner: 10Tarrow) [11:14:34] (03Merged) 10jenkins-bot: Enable termbox on testwikidatawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/525062 (https://phabricator.wikimedia.org/T227459) (owner: 10Tarrow) [11:16:24] (03CR) 10jenkins-bot: Enable termbox on testwikidatawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/525062 (https://phabricator.wikimedia.org/T227459) (owner: 10Tarrow) [11:17:22] (03PS1) 10Hashar: contint: remove arcanist [puppet] - 10https://gerrit.wikimedia.org/r/525063 (https://phabricator.wikimedia.org/T225735) [11:17:58] (03CR) 10Hashar: "I don't think we still use arcanist anywhere on CI do we?" [puppet] - 10https://gerrit.wikimedia.org/r/525063 (https://phabricator.wikimedia.org/T225735) (owner: 10Hashar) [11:24:21] Live and working on debug; sync-fileing now [11:25:05] !log tarrow@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:525062|T214902 Enable termbox on testwikidatawiki]] (duration: 01m 37s) [11:25:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:25:12] T214902: Show mobile termbox on Wikidata test wiki - https://phabricator.wikimedia.org/T214902 [11:27:23] (03PS2) 10Arturo Borrero Gonzalez: toolforge: k8s: add nginx-ingress configuration. [puppet] - 10https://gerrit.wikimedia.org/r/524759 (https://phabricator.wikimedia.org/T228500) [11:29:05] Looks like it's client side rendering fine but I see "Wikibase\View\Termbox\Renderer\TermboxRemoteRenderer: encountered a bad response from the remote renderer" in logstash [11:33:01] (03PS1) 10Tarrow: Fix missing /termbox in SSRTermboxServerUrl [mediawiki-config] - 10https://gerrit.wikimedia.org/r/525065 [11:33:35] Amir1: looks like I messed up a bit. Am I good to just merge and deploy the fix or do I need to revert first? [11:33:57] since it's test, I think it's fine [11:34:02] (03PS3) 10Arturo Borrero Gonzalez: toolforge: k8s: add nginx-ingress configuration. [puppet] - 10https://gerrit.wikimedia.org/r/524759 (https://phabricator.wikimedia.org/T228500) [11:34:03] unless it's exploding logs [11:34:19] Its not crazy, just a little [11:34:30] (03CR) 10Tarrow: "This change is ready for review." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/525065 (owner: 10Tarrow) [11:34:35] so I think we can deploy the fix then [11:34:55] great [11:35:40] (03CR) 10Jakob: [C: 03+1] Fix missing /termbox in SSRTermboxServerUrl [mediawiki-config] - 10https://gerrit.wikimedia.org/r/525065 (owner: 10Tarrow) [11:37:33] (03CR) 10Tarrow: [C: 03+2] Fix missing /termbox in SSRTermboxServerUrl [mediawiki-config] - 10https://gerrit.wikimedia.org/r/525065 (owner: 10Tarrow) [11:38:32] (03Merged) 10jenkins-bot: Fix missing /termbox in SSRTermboxServerUrl [mediawiki-config] - 10https://gerrit.wikimedia.org/r/525065 (owner: 10Tarrow) [11:38:47] (03CR) 10jenkins-bot: Fix missing /termbox in SSRTermboxServerUrl [mediawiki-config] - 10https://gerrit.wikimedia.org/r/525065 (owner: 10Tarrow) [11:39:03] (03PS4) 10Arturo Borrero Gonzalez: toolforge: k8s: add nginx-ingress configuration. [puppet] - 10https://gerrit.wikimedia.org/r/524759 (https://phabricator.wikimedia.org/T228500) [11:40:49] (03PS1) 10Muehlenhoff: Disable poolcounter1003, also switch over pool counters in codfw [mediawiki-config] - 10https://gerrit.wikimedia.org/r/525066 (https://phabricator.wikimedia.org/T224572) [11:42:20] PROBLEM - PHP opcache health on mwdebug1002 is CRITICAL: CRITICAL: opcache free space is below 50 MB https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [11:43:18] !log restart php-fpm on mwdebug* [11:43:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:45:08] (03PS1) 10Lars Wirzenius: Group0 to 1.34.0-wmf.15 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/525068 [11:46:42] hey [11:46:46] :) [11:47:08] I guess we're currently conflicting because I just got "Failed to acquire lock "/var/lock/scap.operations_mediawiki-config.lock"; owner is "liw"; reason is "Pruned MediaWiki: 1.34.0-wmf.10"" [11:47:55] Amir1: I guess I just hold fire? [11:48:32] (03PS1) 10Muehlenhoff: kibana: Switch to read-only LDAPreplicas [puppet] - 10https://gerrit.wikimedia.org/r/525069 (https://phabricator.wikimedia.org/T227650) [11:48:37] tarrow, eek. I'm in the middle of cutting this week's train branch. [11:48:52] liw: eek! [11:49:16] sorry about that [11:50:07] tarrow, well, hopefully everything works. what's the worst that could happen? I take down all wikis and set fire to all data centres? [11:50:08] liw: So I was mid way through SWAT-ing a patch. It's on mwdebug1002 but I didn't run scap sync-file yet [11:50:27] :P [11:51:00] cool, I won't do any more typing until I know what to do :) [11:51:13] FYI it was https://gerrit.wikimedia.org/r/525065 [11:52:09] tarrow, I'm new to this, so I have little clue to what I'm doing - hopefully I haven't broken anything [11:52:21] liw: hehehe! me too :) [11:52:32] (03PS2) 10Alexandros Kosiaris: Disable poolcounter1003, also switch over pool counters in codfw [mediawiki-config] - 10https://gerrit.wikimedia.org/r/525066 (https://phabricator.wikimedia.org/T224572) (owner: 10Muehlenhoff) [11:52:36] (03CR) 10Alexandros Kosiaris: [C: 03+2] Disable poolcounter1003, also switch over pool counters in codfw [mediawiki-config] - 10https://gerrit.wikimedia.org/r/525066 (https://phabricator.wikimedia.org/T224572) (owner: 10Muehlenhoff) [11:52:38] currently running scap clean to delete an old branch [11:53:01] and daydreaming of the day whan all of this is fully automated ;) [11:53:06] (03PS5) 10Arturo Borrero Gonzalez: toolforge: k8s: add nginx-ingress configuration. [puppet] - 10https://gerrit.wikimedia.org/r/524759 (https://phabricator.wikimedia.org/T228500) [11:53:10] :D [11:53:31] (03Merged) 10jenkins-bot: Disable poolcounter1003, also switch over pool counters in codfw [mediawiki-config] - 10https://gerrit.wikimedia.org/r/525066 (https://phabricator.wikimedia.org/T224572) (owner: 10Muehlenhoff) [11:53:46] (03CR) 10jenkins-bot: Disable poolcounter1003, also switch over pool counters in codfw [mediawiki-config] - 10https://gerrit.wikimedia.org/r/525066 (https://phabricator.wikimedia.org/T224572) (owner: 10Muehlenhoff) [11:54:16] !log liw@deploy1001 Pruned MediaWiki: 1.34.0-wmf.10 (duration: 07m 55s) [11:54:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:54:48] liw: mind if I finish my deployment now? [11:54:56] Or is there more for you to do? [11:55:30] tarrow, go ahead [11:55:35] cool! [11:56:24] tarrow, tell me when you're done, please [11:56:28] sure! [11:56:37] !log tarrow@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:525065|T214902 Fix missing /termbox in SSRTermboxServerUrl]] (duration: 00m 44s) [11:56:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:56:44] T214902: Show mobile termbox on Wikidata test wiki - https://phabricator.wikimedia.org/T214902 [11:57:11] (03PS6) 10Arturo Borrero Gonzalez: toolforge: k8s: add nginx-ingress configuration. [puppet] - 10https://gerrit.wikimedia.org/r/524759 (https://phabricator.wikimedia.org/T228500) [11:58:41] !log akosiaris@deploy1001 Synchronized wmf-config/ProductionServices.php: (no justification provided) (duration: 00m 46s) [11:58:42] !log EU SWAT finished [11:58:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:58:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:58:57] liw: all done :) [11:59:31] !log enable disable poolcounter1003, switchover codfw poolcounters T224572 [11:59:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:59:38] T224572: Migrate pool counters to Stretch/Buster - https://phabricator.wikimedia.org/T224572 [12:00:04] Deploy window Pre MediaWiki train sanity break (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190723T1200) [12:00:48] tarrow, thanks [12:01:07] !log empty ganeti1007 from running instances. T227139 [12:01:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:01:14] T227139: a3-eqiad pdu refresh - https://phabricator.wikimedia.org/T227139 [12:02:11] !log drain kubernetes1001. T227139 [12:02:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:04:05] !log liw@deploy1001 Started scap: testwiki to php-1.34.0-wmf.15 and rebuild l10n cache [12:04:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:04:56] 10Operations, 10ops-eqiad, 10DC-Ops: a3-eqiad pdu refresh - https://phabricator.wikimedia.org/T227139 (10RobH) FYI: I pinged both Alex and Filippo to drain the respective servers they mention above in anticipation of swapping the PDUs in this rack at 10:00 Eastern time. A3 was originally a DB rack, and has... [12:05:58] 10Operations, 10ops-eqiad, 10DC-Ops: a3-eqiad pdu refresh - https://phabricator.wikimedia.org/T227139 (10RobH) [12:07:24] (03PS1) 10Arturo Borrero Gonzalez: secrets: toolforge: add default k8s nginx-ingress key pair [labs/private] - 10https://gerrit.wikimedia.org/r/525074 (https://phabricator.wikimedia.org/T228500) [12:08:23] 10Operations, 10ops-eqiad, 10DC-Ops: a3-eqiad pdu refresh - https://phabricator.wikimedia.org/T227139 (10RobH) p:05Triage→03High [12:09:16] (03CR) 10Arturo Borrero Gonzalez: [V: 03+2 C: 03+2] secrets: toolforge: add default k8s nginx-ingress key pair [labs/private] - 10https://gerrit.wikimedia.org/r/525074 (https://phabricator.wikimedia.org/T228500) (owner: 10Arturo Borrero Gonzalez) [12:10:45] (03PS1) 10Muehlenhoff: Add net-tools to standard packages [puppet] - 10https://gerrit.wikimedia.org/r/525075 [12:11:52] 10Operations, 10Gerrit, 10LDAP-Access-Requests, 10Release-Engineering-Team-TODO, 10Release-Engineering-Team (Development services): Add more SREs to gerritadmin LDAP group - https://phabricator.wikimedia.org/T228733 (10greg) [12:13:43] (03PS1) 10Alexandros Kosiaris: Redeploy on stream-config changes [deployment-charts] - 10https://gerrit.wikimedia.org/r/525076 (https://phabricator.wikimedia.org/T228700) [12:17:29] (03CR) 10Alexandros Kosiaris: [C: 03+1] "Ouch. I know I 'd be annoyed if ifconfig or netstat wasn't around on boxes while under the pressure of debugging something, so many +1s." [puppet] - 10https://gerrit.wikimedia.org/r/525075 (owner: 10Muehlenhoff) [12:25:15] 10Operations, 10serviceops, 10Core Platform Team Workboards (Green): Keys from MediaWiki Redis Instances - https://phabricator.wikimedia.org/T228703 (10WDoranWMF) p:05Triage→03High [12:29:04] PROBLEM - High CPU load on API appserver on mw1341 is CRITICAL: CRITICAL - load average: 75.26, 34.31, 22.76 https://wikitech.wikimedia.org/wiki/Application_servers [12:29:48] PROBLEM - High CPU load on API appserver on mw1233 is CRITICAL: CRITICAL - load average: 53.32, 23.31, 14.90 https://wikitech.wikimedia.org/wiki/Application_servers [12:30:20] PROBLEM - High CPU load on API appserver on mw1234 is CRITICAL: CRITICAL - load average: 50.54, 23.43, 14.46 https://wikitech.wikimedia.org/wiki/Application_servers [12:30:42] RECOVERY - High CPU load on API appserver on mw1341 is OK: OK - load average: 28.39, 29.28, 22.12 https://wikitech.wikimedia.org/wiki/Application_servers [12:31:36] PROBLEM - High CPU load on API appserver on mw1289 is CRITICAL: CRITICAL - load average: 64.69, 32.86, 20.82 https://wikitech.wikimedia.org/wiki/Application_servers [12:32:00] RECOVERY - High CPU load on API appserver on mw1234 is OK: OK - load average: 19.16, 20.16, 14.20 https://wikitech.wikimedia.org/wiki/Application_servers [12:32:54] 10Operations, 10Wikimedia-Mailing-lists: New Mailing lists for AzWiki sysops - https://phabricator.wikimedia.org/T228542 (10MarcoAurelio) There's a bit of discussion on Meta about whether this should be approved or not based on the results of an RfC. May I advice ops to wait before the status is clarified? Tha... [12:32:54] PROBLEM - High CPU load on API appserver on mw1232 is CRITICAL: CRITICAL - load average: 49.12, 25.74, 15.39 https://wikitech.wikimedia.org/wiki/Application_servers [12:33:06] RECOVERY - High CPU load on API appserver on mw1233 is OK: OK - load average: 14.16, 21.12, 16.03 https://wikitech.wikimedia.org/wiki/Application_servers [12:33:18] RECOVERY - High CPU load on API appserver on mw1289 is OK: OK - load average: 22.90, 27.43, 20.11 https://wikitech.wikimedia.org/wiki/Application_servers [12:33:51] !log liw@deploy1001 Finished scap: testwiki to php-1.34.0-wmf.15 and rebuild l10n cache (duration: 29m 46s) [12:33:52] 10Operations, 10ops-eqiad, 10DC-Ops: a3-eqiad pdu refresh - https://phabricator.wikimedia.org/T227139 (10MoritzMuehlenhoff) >>! In T227139#5356540, @fgiunchedi wrote: > restbase / logstash / graphite / prometheus hosts should be fine in an event of power loss, This is graphite1003, the old server pending de... [12:33:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:34:34] RECOVERY - High CPU load on API appserver on mw1232 is OK: OK - load average: 19.05, 21.84, 15.05 https://wikitech.wikimedia.org/wiki/Application_servers [12:34:55] 10Operations, 10ops-eqiad, 10DC-Ops: a3-eqiad pdu refresh - https://phabricator.wikimedia.org/T227139 (10MoritzMuehlenhoff) [12:39:26] I've finished cutting this week's train branch [12:45:44] 10Operations, 10Traffic: TLS config issue for nginx on Buster - https://phabricator.wikimedia.org/T228730 (10BBlack) If we need this to work ASAP, probably the most-expedient thing to do would be to patch our puppetization to exclude the patched features from config on buster only, and use the vendor package.... [12:47:19] (03CR) 10Ottomata: [C: 03+2] "AH! So the stream-config is rendered properly via configmap.yaml, but the deployment config didn't know it had changed, since it was only" [deployment-charts] - 10https://gerrit.wikimedia.org/r/525076 (https://phabricator.wikimedia.org/T228700) (owner: 10Alexandros Kosiaris) [12:47:22] (03CR) 10Ottomata: [V: 03+2 C: 03+2] Redeploy on stream-config changes [deployment-charts] - 10https://gerrit.wikimedia.org/r/525076 (https://phabricator.wikimedia.org/T228700) (owner: 10Alexandros Kosiaris) [12:52:45] (03CR) 10Filippo Giunchedi: [C: 03+1] kibana: Switch to read-only LDAPreplicas [puppet] - 10https://gerrit.wikimedia.org/r/525069 (https://phabricator.wikimedia.org/T227650) (owner: 10Muehlenhoff) [12:53:49] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/525075 (owner: 10Muehlenhoff) [12:55:03] (03CR) 10Muehlenhoff: "PCC: https://puppet-compiler.wmflabs.org/compiler1002/17578/" [puppet] - 10https://gerrit.wikimedia.org/r/525069 (https://phabricator.wikimedia.org/T227650) (owner: 10Muehlenhoff) [12:55:18] (03CR) 10Muehlenhoff: [C: 03+2] kibana: Switch to read-only LDAPreplicas [puppet] - 10https://gerrit.wikimedia.org/r/525069 (https://phabricator.wikimedia.org/T227650) (owner: 10Muehlenhoff) [12:55:34] 10Operations, 10serviceops, 10Core Platform Team Backlog (Watching / External), 10Core Platform Team Workboards (Clinic Duty Team), and 4 others: Use PHP7 to run all async jobs - https://phabricator.wikimedia.org/T219148 (10jijiki) [13:00:04] liw: #bothumor I � Unicode. All rise for MediaWiki train - European version deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190723T1300). [13:03:30] (03PS7) 10Arturo Borrero Gonzalez: toolforge: k8s: add nginx-ingress configuration. [puppet] - 10https://gerrit.wikimedia.org/r/524759 (https://phabricator.wikimedia.org/T228500) [13:03:55] I've filed https://phabricator.wikimedia.org/T228746 and https://phabricator.wikimedia.org/T228749 but hoping neither of them is a blocker [13:04:05] starting group0 deployment now [13:04:27] (03CR) 10jerkins-bot: [V: 04-1] toolforge: k8s: add nginx-ingress configuration. [puppet] - 10https://gerrit.wikimedia.org/r/524759 (https://phabricator.wikimedia.org/T228500) (owner: 10Arturo Borrero Gonzalez) [13:05:14] (03PS1) 10Ema: prometheus: add ats_backend_requests_seconds_count rules [puppet] - 10https://gerrit.wikimedia.org/r/525081 (https://phabricator.wikimedia.org/T227668) [13:06:09] (03PS8) 10Arturo Borrero Gonzalez: toolforge: k8s: add nginx-ingress configuration. [puppet] - 10https://gerrit.wikimedia.org/r/524759 (https://phabricator.wikimedia.org/T228500) [13:06:17] !log Drop abuse_filter_log.afl_log_id from s8 codfw (lag will happen on codfw s8) - T226851 [13:06:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:06:25] T226851: Drop abuse_filter_log.afl_log_id in production - https://phabricator.wikimedia.org/T226851 [13:06:43] !log liw@deploy1001 rebuilt and synchronized wikiversions files: group0 to 1.34.0-wmf.15 [13:06:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:07:44] (03PS4) 10CDanis: conftool: update schemata for dbctl [puppet] - 10https://gerrit.wikimedia.org/r/523943 (https://phabricator.wikimedia.org/T197126) [13:07:44] !log jmm@cumin2001 START - Cookbook sre.hosts.downtime [13:07:45] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [13:07:46] (03PS14) 10CDanis: dbctl: monitor for uncommitted changes [puppet] - 10https://gerrit.wikimedia.org/r/523013 (https://phabricator.wikimedia.org/T197126) [13:07:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:07:50] 10Operations, 10Traffic: TLS config issue for nginx on Buster - https://phabricator.wikimedia.org/T228730 (10elukey) @BBlack thanks for the info! Not in a real rush, I was working on https://phabricator.wikimedia.org/T227860 to add TLS capabilities to the Analytics UIs, the only delay will be for Traffic :) [13:07:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:08:34] (03PS5) 10Elukey: profile::tlsproxy::service: add more granularity in monitoring [puppet] - 10https://gerrit.wikimedia.org/r/525039 (https://phabricator.wikimedia.org/T227860) [13:08:51] 10Operations, 10Traffic: TLS config issue for nginx on Buster - https://phabricator.wikimedia.org/T228730 (10ema) p:05Triage→03Normal [13:09:15] (03CR) 10CDanis: [C: 03+2] conftool: update schemata for dbctl [puppet] - 10https://gerrit.wikimedia.org/r/523943 (https://phabricator.wikimedia.org/T197126) (owner: 10CDanis) [13:09:56] 10Operations, 10ops-eqiad, 10DC-Ops: a3-eqiad pdu refresh - https://phabricator.wikimedia.org/T227139 (10jijiki) [13:09:58] (03PS9) 10Arturo Borrero Gonzalez: toolforge: k8s: add nginx-ingress configuration. [puppet] - 10https://gerrit.wikimedia.org/r/524759 (https://phabricator.wikimedia.org/T228500) [13:10:37] 10Operations, 10ops-eqiad, 10DC-Ops: a3-eqiad pdu refresh - https://phabricator.wikimedia.org/T227139 (10jijiki) [13:13:00] (03CR) 10Elukey: [C: 03+2] profile::tlsproxy::service: add more granularity in monitoring [puppet] - 10https://gerrit.wikimedia.org/r/525039 (https://phabricator.wikimedia.org/T227860) (owner: 10Elukey) [13:13:05] (03CR) 10Lars Wirzenius: [C: 03+2] Group0 to 1.34.0-wmf.15 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/525068 (owner: 10Lars Wirzenius) [13:13:07] (03PS6) 10Elukey: profile::tlsproxy::service: add more granularity in monitoring [puppet] - 10https://gerrit.wikimedia.org/r/525039 (https://phabricator.wikimedia.org/T227860) [13:13:55] (03Merged) 10jenkins-bot: Group0 to 1.34.0-wmf.15 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/525068 (owner: 10Lars Wirzenius) [13:14:11] (03CR) 10jenkins-bot: Group0 to 1.34.0-wmf.15 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/525068 (owner: 10Lars Wirzenius) [13:15:11] (03PS3) 10Muehlenhoff: maintain_dbusers: Read LDAP servers from Hiera and switch to read-only replicas [puppet] - 10https://gerrit.wikimedia.org/r/524810 (https://phabricator.wikimedia.org/T227650) [13:17:33] !log liw@deploy1001 rebuilt and synchronized wikiversions files: group0 to 1.34.0-wmf.15 [13:17:36] PROBLEM - puppet last run on mw2249 is CRITICAL: CRITICAL: Puppet has 3 failures. Last run 6 minutes ago with 3 failures. Failed resources (up to 3 shown): File[/etc/conftool/json-schema/dbconfig/instance.schema],File[/etc/conftool/json-schema/dbconfig/section.schema],File[/etc/conftool/json-schema/mediawiki-config/dbconfig.schema] https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [13:17:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:17:40] (03CR) 10Muehlenhoff: [C: 03+2] maintain_dbusers: Read LDAP servers from Hiera and switch to read-only replicas [puppet] - 10https://gerrit.wikimedia.org/r/524810 (https://phabricator.wikimedia.org/T227650) (owner: 10Muehlenhoff) [13:18:50] PROBLEM - puppet last run on elastic1042 is CRITICAL: CRITICAL: Puppet has 3 failures. Last run 8 minutes ago with 3 failures. Failed resources (up to 3 shown): File[/etc/conftool/json-schema/dbconfig/instance.schema],File[/etc/conftool/json-schema/dbconfig/section.schema],File[/etc/conftool/json-schema/mediawiki-config/dbconfig.schema] https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [13:19:08] cdanis: is that related to your change? ^ [13:19:16] sigh [13:19:18] (03CR) 10Effie Mouzeli: [C: 03+1] Set async replication for mcrouter on mw api/appserv canaries [puppet] - 10https://gerrit.wikimedia.org/r/525053 (https://phabricator.wikimedia.org/T225642) (owner: 10Elukey) [13:19:24] yes [13:19:36] I will revert [13:19:40] group0 is at 1.34.0-wmf.15 now [13:19:50] cdanis: happy tuesday :p [13:20:00] (03PS2) 10Ema: prometheus: add trafficserver_backend_requests_seconds_count rules [puppet] - 10https://gerrit.wikimedia.org/r/525081 (https://phabricator.wikimedia.org/T227668) [13:20:02] (03PS1) 10Ema: prometheus: rename trafficserver metrics [puppet] - 10https://gerrit.wikimedia.org/r/525085 (https://phabricator.wikimedia.org/T227668) [13:20:57] (03PS1) 10CDanis: Revert "conftool: update schemata for dbctl" [puppet] - 10https://gerrit.wikimedia.org/r/525086 [13:22:19] (03CR) 10CDanis: [C: 03+2] Revert "conftool: update schemata for dbctl" [puppet] - 10https://gerrit.wikimedia.org/r/525086 (owner: 10CDanis) [13:22:59] oh [13:23:01] dammit [13:23:14] Jul 23 13:10:14 elastic1042 puppet-agent[18320]: Could not set 'file' on ensure: Error 404 on SERVER: {"message":"Not Found: Could not find file_content modules/profile/conftool/json-schema/ [13:23:17] dbconfig/instance.schema","issue_kind":"RESOURCE_NOT_FOUND"} [13:23:23] this is just the usual stupid puppet race condition [13:24:14] I need more coffee 😖 [13:24:47] (03CR) 10Ema: [C: 03+1] "Commit message OCD that does not want to discourage an otherwise fantastic idea." (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/525075 (owner: 10Muehlenhoff) [13:26:20] (03PS4) 10Jhedden: dumps dist: switch active VPS to labstore1006 [puppet] - 10https://gerrit.wikimedia.org/r/524804 (https://phabricator.wikimedia.org/T224228) [13:26:29] (03PS1) 10CDanis: Revert "Revert "conftool: update schemata for dbctl"" [puppet] - 10https://gerrit.wikimedia.org/r/525088 [13:26:39] "Fatal error: entire web request took longer than 60 seconds and timed out in /srv/mediawiki/php-1.34.0-wmf.14/includes/parser/Preprocessor_Hash.php on line 187" - is this something I shoujld worry about? [13:26:52] .14, not .15, which I just deployed, but still [13:27:16] !log dumps switching active vps to labstore1006 T224228 [13:27:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:27:35] (03CR) 10Jhedden: [C: 03+2] dumps dist: switch active VPS to labstore1006 [puppet] - 10https://gerrit.wikimedia.org/r/524804 (https://phabricator.wikimedia.org/T224228) (owner: 10Jhedden) [13:27:42] (03CR) 10CDanis: [C: 03+2] Revert "Revert "conftool: update schemata for dbctl"" [puppet] - 10https://gerrit.wikimedia.org/r/525088 (owner: 10CDanis) [13:27:46] PROBLEM - puppet last run on mw2150 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/conftool/json-schema/dbconfig/section.schema] https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [13:28:37] (03CR) 10Filippo Giunchedi: [C: 03+1] prometheus: rename trafficserver metrics [puppet] - 10https://gerrit.wikimedia.org/r/525085 (https://phabricator.wikimedia.org/T227668) (owner: 10Ema) [13:28:49] (03CR) 10Filippo Giunchedi: [C: 03+1] prometheus: add trafficserver_backend_requests_seconds_count rules [puppet] - 10https://gerrit.wikimedia.org/r/525081 (https://phabricator.wikimedia.org/T227668) (owner: 10Ema) [13:30:34] liw: such errors are an unfortunate but I think fairly frequent problem. See for example https://logstash.wikimedia.org/goto/341af04420e264ef299b2d53b69abd2a [13:31:32] PROBLEM - puppet last run on elastic2045 is CRITICAL: CRITICAL: Puppet has 3 failures. Last run 8 minutes ago with 3 failures. Failed resources (up to 3 shown): File[/etc/conftool/json-schema/dbconfig/instance.schema],File[/etc/conftool/json-schema/dbconfig/section.schema],File[/etc/conftool/json-schema/mediawiki-config/dbconfig.schema] https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [13:31:47] ema, check, thanks [13:32:01] (03PS5) 10Jhedden: dumps dist: switch active VPS to labstore1006 [puppet] - 10https://gerrit.wikimedia.org/r/524804 (https://phabricator.wikimedia.org/T224228) [13:32:32] I now see: ErrorException from line 100 of /srv/mediawiki/php-1.34.0-wmf.15/includes/libs/HttpStatus.php: PHP Warning: Unknown HTTP status code default [13:33:22] RECOVERY - puppet last run on mw2150 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [13:35:02] PROBLEM - puppet last run on elastic1026 is CRITICAL: CRITICAL: Puppet has 3 failures. Last run 5 minutes ago with 3 failures. Failed resources (up to 3 shown): File[/etc/conftool/json-schema/dbconfig/instance.schema],File[/etc/conftool/json-schema/dbconfig/section.schema],File[/etc/conftool/json-schema/mediawiki-config/dbconfig.schema] https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [13:35:48] RECOVERY - puppet last run on elastic1042 is OK: OK: Puppet is currently enabled, last run 5 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [13:37:04] 10Operations, 10ops-eqiad, 10DC-Ops: (July 22-26) install new PDUs in rows A/B (Top level tracking task) - https://phabricator.wikimedia.org/T226778 (10RobH) [13:37:46] (03PS1) 10Muehlenhoff: Remove unused Apache config [puppet] - 10https://gerrit.wikimedia.org/r/525090 [13:39:38] PROBLEM - Mediawiki Cirrussearch update rate - eqiad on icinga1001 is CRITICAL: CRITICAL: 30.00% of data under the critical threshold [50.0] https://wikitech.wikimedia.org/wiki/Search%23No_updates https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?panelId=44&fullscreen&orgId=1 [13:40:42] RECOVERY - puppet last run on elastic1026 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [13:41:24] PROBLEM - Mediawiki Cirrussearch update rate - codfw on icinga1001 is CRITICAL: CRITICAL: 50.00% of data under the critical threshold [50.0] https://wikitech.wikimedia.org/wiki/Search%23No_updates https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?panelId=44&fullscreen&orgId=1 [13:41:52] (03PS1) 10Ottomata: eventgate - fix 'wrong number of args for include: want 2 got 1' [deployment-charts] - 10https://gerrit.wikimedia.org/r/525091 (https://phabricator.wikimedia.org/T228700) [13:42:34] (03CR) 10Ottomata: [V: 03+2 C: 03+2] eventgate - fix 'wrong number of args for include: want 2 got 1' [deployment-charts] - 10https://gerrit.wikimedia.org/r/525091 (https://phabricator.wikimedia.org/T228700) (owner: 10Ottomata) [13:42:50] 10Operations, 10LDAP, 10Patch-For-Review: Migrate web services using LDAP authentication towards the readonly LDAP replicas - https://phabricator.wikimedia.org/T227650 (10MoritzMuehlenhoff) a:03MoritzMuehlenhoff The following services have been converted to use the read-only replicas: - DB users sync... [13:43:36] !log otto@ helmfile [STAGING] Ran 'apply' command on namespace 'eventgate-main' for release 'main' . [13:43:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:44:27] (03CR) 10Elukey: [C: 03+1] Remove unused Apache config [puppet] - 10https://gerrit.wikimedia.org/r/525090 (owner: 10Muehlenhoff) [13:44:45] !log installing Java security updates on furud/flerovium [13:44:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:44:50] PROBLEM - Work requests waiting in Zuul Gearman server on contint1001 is CRITICAL: CRITICAL: 42.86% of data above the critical threshold [140.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [13:45:28] !log otto@ helmfile [CODFW] Ran 'apply' command on namespace 'eventgate-main' for release 'main' . [13:45:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:45:39] !log depool restbase1016 restbase1019 restbase1011 restbase1010 prometheus1003 ahead of PDU work - T227139 [13:45:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:45:46] T227139: a3-eqiad pdu refresh - https://phabricator.wikimedia.org/T227139 [13:45:48] RECOVERY - puppet last run on mw2249 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [13:47:00] 10Operations, 10Wikimedia-Mailing-lists: New Mailing lists for AzWiki sysops - https://phabricator.wikimedia.org/T228542 (10Mardetanha) >>! In T228542#5356900, @MarcoAurelio wrote: > There's a bit of discussion on Meta about whether this should be approved or not based on the results of an RfC. May I advice op... [13:47:29] !log otto@ helmfile [EQIAD] Ran 'apply' command on namespace 'eventgate-main' for release 'main' . [13:47:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:50:00] https://phabricator.wikimedia.org/T228758 [13:50:03] 10Operations, 10Analytics, 10Core Platform Team Backlog (Watching / External), 10Patch-For-Review, and 2 others: Replace and expand codfw kafka main hosts (kafka200[123]) with kafka-main200[12345] - https://phabricator.wikimedia.org/T225005 (10Ottomata) [13:51:56] RECOVERY - Work requests waiting in Zuul Gearman server on contint1001 is OK: OK: Less than 30.00% above the threshold [90.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [13:52:31] (03PS1) 10Muehlenhoff: Remove old jessie-based pool counters [puppet] - 10https://gerrit.wikimedia.org/r/525093 (https://phabricator.wikimedia.org/T224572) [13:52:55] 10Operations, 10PoolCounter: Migrate pool counters to stretch - https://phabricator.wikimedia.org/T199876 (10MoritzMuehlenhoff) 05Open→03Resolved Duplicate of T224572 [13:53:12] 10Operations, 10serviceops, 10Patch-For-Review: Migrate pool counters to Stretch/Buster - https://phabricator.wikimedia.org/T224572 (10MoritzMuehlenhoff) a:03MoritzMuehlenhoff [13:53:14] RECOVERY - puppet last run on elastic2045 is OK: OK: Puppet is currently enabled, last run 8 seconds ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [13:54:25] (03CR) 10Hashar: zuul: stop zuul-merger gracefully (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/524180 (owner: 10Hashar) [13:54:50] PROBLEM - puppet last run on db1080 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [13:57:37] 10Operations, 10ops-eqiad, 10DBA: Upgrade db1100 firmware and BIOS - https://phabricator.wikimedia.org/T228732 (10herron) p:05Triage→03Normal [13:59:09] https://phabricator.wikimedia.org/T228760 [14:00:47] (03PS2) 10Muehlenhoff: Add net-tools to standard packages [puppet] - 10https://gerrit.wikimedia.org/r/525075 [14:01:12] (03CR) 10Muehlenhoff: Add net-tools to standard packages (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/525075 (owner: 10Muehlenhoff) [14:01:36] 10Operations, 10Gerrit, 10LDAP-Access-Requests, 10Release-Engineering-Team-TODO, 10Release-Engineering-Team (Development services): Add more SREs to gerritadmin LDAP group - https://phabricator.wikimedia.org/T228733 (10Joe) I am happy to help. [14:01:40] 10Operations, 10netops: AS63541's session down reported by cr1-eqsin - https://phabricator.wikimedia.org/T228617 (10herron) p:05Triage→03Normal [14:02:03] (03CR) 10Ottomata: "Others will also use this swift oozie upload job, so I'd rather not have to deploy more new credentials for them if we don't have to. I'l" [puppet] - 10https://gerrit.wikimedia.org/r/524625 (https://phabricator.wikimedia.org/T227364) (owner: 10EBernhardson) [14:03:03] (03CR) 10Hashar: zuul: fix systemd Service/TimeoutStopSec (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/524174 (https://phabricator.wikimedia.org/T228381) (owner: 10Hashar) [14:03:16] (03PS2) 10Hashar: zuul: fix systemd Service/TimeoutStopSec [puppet] - 10https://gerrit.wikimedia.org/r/524174 (https://phabricator.wikimedia.org/T228381) [14:03:18] (03PS2) 10Hashar: zuul: stop zuul-merger gracefully [puppet] - 10https://gerrit.wikimedia.org/r/524180 [14:03:33] 10Operations, 10Puppet: puppetdb prometheus metrics per-host metrics - https://phabricator.wikimedia.org/T228395 (10herron) p:05Triage→03Normal [14:03:37] (03CR) 10Hashar: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/524174 (https://phabricator.wikimedia.org/T228381) (owner: 10Hashar) [14:03:47] (03CR) 10Hashar: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/524180 (owner: 10Hashar) [14:05:42] RECOVERY - Mediawiki Cirrussearch update rate - eqiad on icinga1001 is OK: OK: Less than 1.00% under the threshold [80.0] https://wikitech.wikimedia.org/wiki/Search%23No_updates https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?panelId=44&fullscreen&orgId=1 [14:05:48] RECOVERY - Mediawiki Cirrussearch update rate - codfw on icinga1001 is OK: OK: Less than 1.00% under the threshold [80.0] https://wikitech.wikimedia.org/wiki/Search%23No_updates https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?panelId=44&fullscreen&orgId=1 [14:12:10] (03PS3) 10Muehlenhoff: Add net-tools to standard packages [puppet] - 10https://gerrit.wikimedia.org/r/525075 [14:12:22] herron: is it ok to reenable puppet on kafka1001 already? [14:12:43] yep, will do that now [14:14:22] PROBLEM - Host ps1-a3-eqiad is DOWN: PING CRITICAL - Packet loss = 100% [14:14:28] !log a3-eqiad pdu swap taking place now via T227139 [14:14:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:14:35] T227139: a3-eqiad pdu refresh - https://phabricator.wikimedia.org/T227139 [14:14:41] that is expected! [14:14:48] also mgmt will offline for all the hsots at some point [14:14:51] but not the hosts themselves [14:14:58] can we set up downtimes for these? [14:15:43] we want to see them come back [14:15:49] and its not paging [14:15:52] so i rather not [14:16:06] (we didnt yesterday for the same reason) [14:16:10] ok [14:16:15] that makes sense [14:16:19] =] [14:16:31] there is a bit of a confusion about impact, that's why I was asking [14:16:57] you mentioned no unexpected power loss in the email, but then I saw we lost two servers after all? [14:17:10] it turns out yes i didnt realize those lost power [14:17:27] though im having issues finding which backlog channel it was mentioned in [14:17:27] I don't have any bright ideas, but would like for everyone to be on the same page around (expected or unexpected) impact :) [14:18:12] be aware i have a number of private messages where folks are unhappy about this so im quite aware that just mentioning this in the meeting isnt enough [14:18:14] =] [14:18:33] haha [14:20:14] 10Operations, 10ops-eqiad, 10DC-Ops: a4-eqiad pdu refresh - https://phabricator.wikimedia.org/T227140 (10RobH) Please note that netmon and kubestage both powered off yesterday (irc update about this) so we didn't have a flawless migration. [14:21:45] (03CR) 10Giuseppe Lavagetto: [C: 03+1] Set async replication for mcrouter on mw api/appserv canaries [puppet] - 10https://gerrit.wikimedia.org/r/525053 (https://phabricator.wikimedia.org/T225642) (owner: 10Elukey) [14:22:30] (03CR) 10Muehlenhoff: [C: 03+2] Add net-tools to standard packages [puppet] - 10https://gerrit.wikimedia.org/r/525075 (owner: 10Muehlenhoff) [14:24:12] (03PS15) 10CDanis: dbctl: monitor for uncommitted changes [puppet] - 10https://gerrit.wikimedia.org/r/523013 (https://phabricator.wikimedia.org/T197126) [14:24:14] (03CR) 10Giuseppe Lavagetto: dbctl: monitor for uncommitted changes (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/523013 (https://phabricator.wikimedia.org/T197126) (owner: 10CDanis) [14:25:35] (03CR) 10CDanis: dbctl: monitor for uncommitted changes (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/523013 (https://phabricator.wikimedia.org/T197126) (owner: 10CDanis) [14:26:41] (03CR) 10Volans: [C: 03+1] "LGTM, at most check the compiler." [puppet] - 10https://gerrit.wikimedia.org/r/523013 (https://phabricator.wikimedia.org/T197126) (owner: 10CDanis) [14:28:17] 10Operations, 10Analytics, 10LDAP-Access-Requests, 10wikimediafoundation.org: Access to WikimediaFoundation.org analytics for Deb - https://phabricator.wikimedia.org/T227496 (10herron) 05Open→03Resolved I wasn't able to find an ldap account with shell username `Deb_Zierten`, but I do see shell username... [14:28:30] (03CR) 10Giuseppe Lavagetto: dbctl: monitor for uncommitted changes (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/523013 (https://phabricator.wikimedia.org/T197126) (owner: 10CDanis) [14:28:44] (03CR) 10Giuseppe Lavagetto: [C: 03+1] lookup checks: add checks to warn against using hiera and advice lookup [puppet-lint/wmf_styleguide-check] - 10https://gerrit.wikimedia.org/r/522526 (https://phabricator.wikimedia.org/T220820) (owner: 10Jbond) [14:29:40] (03CR) 10Giuseppe Lavagetto: "I think this is correct but please add a reference task." [puppet] - 10https://gerrit.wikimedia.org/r/524925 (owner: 10Jforrester) [14:29:48] (03CR) 10CDanis: "PCC looks good https://puppet-compiler.wmflabs.org/compiler1002/17579/" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/523013 (https://phabricator.wikimedia.org/T197126) (owner: 10CDanis) [14:31:00] (03PS7) 10Giuseppe Lavagetto: Stop installing pear packages on MW Application Servers [puppet] - 10https://gerrit.wikimedia.org/r/434710 (https://phabricator.wikimedia.org/T195364) (owner: 10Reedy) [14:31:26] (03PS2) 10Ema: prometheus: rename trafficserver metrics [puppet] - 10https://gerrit.wikimedia.org/r/525085 (https://phabricator.wikimedia.org/T227668) [14:31:53] PROBLEM - puppet last run on auth1002 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [14:33:08] (03CR) 10CDanis: "Will wait to commit until we have some data in etcd (otherwise this will fail)" [puppet] - 10https://gerrit.wikimedia.org/r/523013 (https://phabricator.wikimedia.org/T197126) (owner: 10CDanis) [14:33:14] (03CR) 10Ema: [C: 03+2] prometheus: rename trafficserver metrics [puppet] - 10https://gerrit.wikimedia.org/r/525085 (https://phabricator.wikimedia.org/T227668) (owner: 10Ema) [14:33:19] 10Operations, 10Wikimedia-General-or-Unknown, 10serviceops, 10Patch-For-Review: Remove pear/mail packages from WMF MW app servers - https://phabricator.wikimedia.org/T195364 (10Joe) @Tgr do you see any reason not to uninstall those packages? I will for now just remove them from puppet, and uninstall them o... [14:33:26] (03PS3) 10Ema: prometheus: add trafficserver_backend_requests_seconds_count rules [puppet] - 10https://gerrit.wikimedia.org/r/525081 (https://phabricator.wikimedia.org/T227668) [14:33:37] (03CR) 10Giuseppe Lavagetto: [C: 03+2] Stop installing pear packages on MW Application Servers [puppet] - 10https://gerrit.wikimedia.org/r/434710 (https://phabricator.wikimedia.org/T195364) (owner: 10Reedy) [14:33:41] woo [14:33:48] (03PS8) 10Giuseppe Lavagetto: Stop installing pear packages on MW Application Servers [puppet] - 10https://gerrit.wikimedia.org/r/434710 (https://phabricator.wikimedia.org/T195364) (owner: 10Reedy) [14:34:08] <_joe_> Reedy: I plan to remove the packages from a few appservers as soon as gergo confirms all the bugs are actually solved :) [14:34:18] (03CR) 10Giuseppe Lavagetto: [V: 03+2 C: 03+2] Stop installing pear packages on MW Application Servers [puppet] - 10https://gerrit.wikimedia.org/r/434710 (https://phabricator.wikimedia.org/T195364) (owner: 10Reedy) [14:34:19] heh [14:34:54] (03PS4) 10Ema: prometheus: add trafficserver_backend_requests_seconds_count rules [puppet] - 10https://gerrit.wikimedia.org/r/525081 (https://phabricator.wikimedia.org/T227668) [14:36:25] _joe_: are your changes safe to puppet-merge? [14:36:35] <_joe_> ema: yes [14:36:40] (03CR) 10Ema: [C: 03+2] prometheus: add trafficserver_backend_requests_seconds_count rules [puppet] - 10https://gerrit.wikimedia.org/r/525081 (https://phabricator.wikimedia.org/T227668) (owner: 10Ema) [14:36:49] _joe_: ack, merging! [14:36:58] <_joe_> thanks [14:38:37] 10Operations, 10Wikimedia-General-or-Unknown, 10serviceops, 10Patch-For-Review: Remove pear/mail packages from WMF MW app servers - https://phabricator.wikimedia.org/T195364 (10Tgr) >>! In T195364#5357382, @Joe wrote: > @Tgr do you see any reason not to uninstall those packages? I will for now just remove... [14:39:34] !log ppchelko@deploy1001 Started deploy [restbase/deploy@ea10fa5]: Switch event production to eventgate T211248, attempt 2 [14:39:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:39:42] T211248: Modern Event Platform: Stream Intake Service: Migrate eventlogging-service-eventbus events to eventgate-main - https://phabricator.wikimedia.org/T211248 [14:41:15] RECOVERY - puppet last run on db1080 is OK: OK: Puppet is currently enabled, last run 21 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [14:41:33] (03PS2) 10Elukey: Set async replication for mcrouter on mw api/appserv canaries [puppet] - 10https://gerrit.wikimedia.org/r/525053 (https://phabricator.wikimedia.org/T225642) [14:42:40] (03CR) 10Elukey: [C: 03+2] Set async replication for mcrouter on mw api/appserv canaries [puppet] - 10https://gerrit.wikimedia.org/r/525053 (https://phabricator.wikimedia.org/T225642) (owner: 10Elukey) [14:43:16] jijiki: ---^ [14:43:22] (ping as requested :) [14:43:30] tx :D [14:46:56] please note all a3-eqiad mgmt is about to complain [14:46:58] it is expected [14:47:12] as side a pdu is being swapped for the rack and the mgmt switch is single infeed [14:49:18] 10Operations, 10Analytics, 10Discovery, 10Research-Backlog, 10Patch-For-Review: Workflow to be able to move data files computed in jobs from analytics cluster to production - https://phabricator.wikimedia.org/T213976 (10Ottomata) [14:52:40] (03PS3) 10Hashar: zuul: fix systemd Service/TimeoutStopSec [puppet] - 10https://gerrit.wikimedia.org/r/524174 (https://phabricator.wikimedia.org/T228381) [14:52:42] (03CR) 10Ottomata: "Alright! I've made search_glent readable. I've also merged https://gerrit.wikimedia.org/r/c/analytics/refinery/+/525106 to do this by de" [puppet] - 10https://gerrit.wikimedia.org/r/524625 (https://phabricator.wikimedia.org/T227364) (owner: 10EBernhardson) [14:52:42] !log ppchelko@deploy1001 Finished deploy [restbase/deploy@ea10fa5]: Switch event production to eventgate T211248, attempt 2 (duration: 13m 08s) [14:52:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:52:50] T211248: Modern Event Platform: Stream Intake Service: Migrate eventlogging-service-eventbus events to eventgate-main - https://phabricator.wikimedia.org/T211248 [14:52:52] (03CR) 10Hashar: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/524174 (https://phabricator.wikimedia.org/T228381) (owner: 10Hashar) [14:53:53] (03PS3) 10Hashar: zuul: stop zuul-merger gracefully [puppet] - 10https://gerrit.wikimedia.org/r/524180 [14:54:04] (03CR) 10Hashar: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/524180 (owner: 10Hashar) [14:54:17] RECOVERY - puppet last run on auth1002 is OK: OK: Puppet is currently enabled, last run 27 seconds ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [14:54:43] PROBLEM - Host dbproxy1003 is DOWN: PING CRITICAL - Packet loss = 100% [14:54:58] robh: ^ [14:55:06] Pchelolo: FYI I did depool some restbase hosts as a precaution for T227139 and the deploy pooled them back [14:55:07] T227139: a3-eqiad pdu refresh - https://phabricator.wikimedia.org/T227139 [14:55:23] godog: oh damn sorry about that [14:55:42] did I break something for you? [14:56:04] i think that woudln't break anything, unless the pdu repalcement caused a power outage [14:56:07] Pchelolo: no nothing broken :) I'm wondering if we could do that better [14:56:10] it isn't supposed to [14:56:15] but it could! [14:56:34] that == the pool/depool behaviour on deploy [14:57:29] (03CR) 10Hashar: [V: 03+1] "Better https://puppet-compiler.wmflabs.org/compiler1001/241/contint1001.wikimedia.org/" [puppet] - 10https://gerrit.wikimedia.org/r/524174 (https://phabricator.wikimedia.org/T228381) (owner: 10Hashar) [14:57:38] godog: ye, like scap checking whether the reason for depool is deploy or not [14:57:59] yeah exactly, sth like that [14:58:16] (meeting) [14:58:25] (03CR) 10Hashar: [V: 03+1] "https://puppet-compiler.wmflabs.org/compiler1002/242/contint1001.wikimedia.org/ looks good." [puppet] - 10https://gerrit.wikimedia.org/r/524180 (owner: 10Hashar) [14:59:10] that sounds like a scap bug to me, and like something that is going to seriously bite someone some day [14:59:59] RECOVERY - Host dbproxy1003 is UP: PING OK - Packet loss = 0%, RTA = 0.24 ms [15:00:39] oh, so we lost one in the plugging in of tower a [15:00:40] sucks [15:00:48] luckily marostegui had depooled that [15:01:05] oh wait, that was 1001 [15:01:12] marostegui: ^ [15:01:19] yeah, 1003 isn't active [15:01:23] (03PS2) 10Giuseppe Lavagetto: mediawiki::webserver: add mtail to gather latency, error rate metrics [puppet] - 10https://gerrit.wikimedia.org/r/520502 (https://phabricator.wikimedia.org/T226815) [15:01:24] ok, whew [15:01:29] (03CR) 10Giuseppe Lavagetto: mediawiki::webserver: add mtail to gather latency, error rate metrics (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/520502 (https://phabricator.wikimedia.org/T226815) (owner: 10Giuseppe Lavagetto) [15:01:39] robh: yeah, I checked when I checked the rack earlier, so not a big deal [15:01:41] it was more a FYI [15:02:18] 10Operations, 10ops-eqiad, 10DC-Ops: a3-eqiad pdu refresh - https://phabricator.wikimedia.org/T227139 (10RobH) All of the power has been migrated, and we are now setting up the networkign for the new pdus [15:03:35] PROBLEM - PHP opcache health on mwdebug1001 is CRITICAL: CRITICAL: opcache free space is below 50 MB https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [15:03:35] (03CR) 10Giuseppe Lavagetto: mediawiki::webserver: add mtail to gather latency, error rate metrics (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/520502 (https://phabricator.wikimedia.org/T226815) (owner: 10Giuseppe Lavagetto) [15:03:54] (03PS3) 10Giuseppe Lavagetto: mediawiki::webserver: add mtail to gather latency, error rate metrics [puppet] - 10https://gerrit.wikimedia.org/r/520502 (https://phabricator.wikimedia.org/T226815) [15:04:17] (03PS1) 10Jhedden: icinga: update toolschecker webservice interval [puppet] - 10https://gerrit.wikimedia.org/r/525108 (https://phabricator.wikimedia.org/T221301) [15:05:45] PROBLEM - IPMI Sensor Status on elastic1031 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [PS Redundancy = Critical, Status = Critical, Status = Critical] https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [15:08:33] <_joe_> !log uninstalling php-pear, php-mail, php-mail-mime from mw1267 T195364 [15:08:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:08:41] T195364: Remove pear/mail packages from WMF MW app servers - https://phabricator.wikimedia.org/T195364 [15:10:08] 10Operations, 10ops-eqiad: Degraded RAID on cloudvirt1015 - https://phabricator.wikimedia.org/T223237 (10Andrew) [15:10:11] 10Operations, 10ops-eqiad, 10DC-Ops, 10User-Zppix, 10cloud-services-team (Kanban): VMs on cloudvirt1015 crashing - bad mainboard/memory - https://phabricator.wikimedia.org/T220853 (10Andrew) [15:10:17] 10Operations, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review, 10cloud-services-team (Kanban): cloudvirt1015: apparent hardware errors in CPU/Memory - https://phabricator.wikimedia.org/T215012 (10Andrew) [15:11:47] RECOVERY - Host ps1-a3-eqiad is UP: PING OK - Packet loss = 0%, RTA = 3.11 ms [15:11:55] (03PS1) 10Ottomata: Use proper main-codfw Kafka cluster for eventgate-main in codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/525110 (https://phabricator.wikimedia.org/T211248) [15:12:30] (03PS2) 10Ottomata: Use proper main-codfw Kafka cluster for eventgate-main in codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/525110 (https://phabricator.wikimedia.org/T211248) [15:13:08] (03CR) 10Ottomata: [V: 03+2 C: 03+2] Use proper main-codfw Kafka cluster for eventgate-main in codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/525110 (https://phabricator.wikimedia.org/T211248) (owner: 10Ottomata) [15:13:15] 10Operations, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Kanban): labvirt1015 crashes - https://phabricator.wikimedia.org/T171473 (10Andrew) [15:13:50] 10Operations, 10Analytics, 10Analytics-EventLogging: Decommission m4 proxies (dbproxy1004 and dbproxy1008) - https://phabricator.wikimedia.org/T228768 (10Marostegui) [15:14:16] !log otto@ helmfile [CODFW] Ran 'apply' command on namespace 'eventgate-main' for release 'main' . [15:14:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:15:09] PROBLEM - ps1-a3-eqiad-infeed-load-tower-B-phase-X on ps1-a3-eqiad is CRITICAL: CRITICAL - Plugin timed out while executing system call https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [15:15:46] 10Operations, 10ops-eqiad, 10DC-Ops: elastic1031 failed PSU 2 fan - https://phabricator.wikimedia.org/T228769 (10Cmjohnson) [15:15:51] PROBLEM - ps1-a3-eqiad-infeed-load-tower-B-phase-Y on ps1-a3-eqiad is CRITICAL: CRITICAL - Plugin timed out while executing system call https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [15:16:19] PROBLEM - ps1-a3-eqiad-infeed-load-tower-B-phase-Z on ps1-a3-eqiad is CRITICAL: CRITICAL - Plugin timed out while executing system call https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [15:16:44] 10Operations, 10ops-eqiad, 10DC-Ops: a3-eqiad pdu refresh - https://phabricator.wikimedia.org/T227139 (10RobH) [15:17:12] 10Operations, 10ops-eqiad, 10DC-Ops: a3-eqiad pdu refresh - https://phabricator.wikimedia.org/T227139 (10RobH) 05Open→03Resolved All done. Elastic1031 has a PSU issue, and we lost power to dbproxy1003 (it was not in service) during this migration. [15:17:14] 10Operations, 10ops-eqiad, 10DC-Ops: (July 22-26) install new PDUs in rows A/B (Top level tracking task) - https://phabricator.wikimedia.org/T226778 (10RobH) [15:18:16] 10Operations, 10ops-eqiad, 10DC-Ops, 10User-Zppix, 10cloud-services-team (Kanban): VMs on cloudvirt1015 crashing - bad mainboard/memory - https://phabricator.wikimedia.org/T220853 (10Andrew) [15:22:53] PROBLEM - Host ps1-a5-eqiad is DOWN: PING CRITICAL - Packet loss = 100% [15:24:33] (03CR) 10Alexandros Kosiaris: [C: 03+1] Add mediawiki development chart. [deployment-charts] - 10https://gerrit.wikimedia.org/r/522584 (https://phabricator.wikimedia.org/T224935) (owner: 10Jeena Huneidi) [15:24:39] (03PS1) 10Arturo Borrero Gonzalez: toolforge: k8s: kubadm: calico requires ipset [puppet] - 10https://gerrit.wikimedia.org/r/525112 (https://phabricator.wikimedia.org/T215531) [15:25:23] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] toolforge: k8s: kubadm: calico requires ipset [puppet] - 10https://gerrit.wikimedia.org/r/525112 (https://phabricator.wikimedia.org/T215531) (owner: 10Arturo Borrero Gonzalez) [15:26:12] (03PS2) 10Jhedden: icinga: update toolschecker webservice interval [puppet] - 10https://gerrit.wikimedia.org/r/525108 (https://phabricator.wikimedia.org/T221301) [15:27:09] (03CR) 10Jhedden: [C: 03+2] icinga: update toolschecker webservice interval [puppet] - 10https://gerrit.wikimedia.org/r/525108 (https://phabricator.wikimedia.org/T221301) (owner: 10Jhedden) [15:27:33] (03CR) 10Bstorm: toolforge: k8s: add nginx-ingress configuration. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/524759 (https://phabricator.wikimedia.org/T228500) (owner: 10Arturo Borrero Gonzalez) [15:32:19] (03Abandoned) 10Ema: ATS: split the cache for beta variant of the mobile site [puppet] - 10https://gerrit.wikimedia.org/r/524789 (https://phabricator.wikimedia.org/T227432) (owner: 10Ema) [15:33:31] (03PS1) 10Ppchelko: Clean up eventlogging_service_uri from RESTBase profile. [puppet] - 10https://gerrit.wikimedia.org/r/525114 (https://phabricator.wikimedia.org/T211248) [15:34:29] (03CR) 10jerkins-bot: [V: 04-1] Clean up eventlogging_service_uri from RESTBase profile. [puppet] - 10https://gerrit.wikimedia.org/r/525114 (https://phabricator.wikimedia.org/T211248) (owner: 10Ppchelko) [15:35:51] (03PS1) 10Muehlenhoff: Enable seccomp-based hardening for apt [puppet] - 10https://gerrit.wikimedia.org/r/525115 [15:36:01] PROBLEM - Host wtp2013 is DOWN: PING CRITICAL - Packet loss = 100% [15:36:18] ^ expected [15:36:21] PROBLEM - puppet last run on db1121 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [15:36:54] (03PS2) 10Ppchelko: Clean up eventlogging_service_uri from RESTBase profile. [puppet] - 10https://gerrit.wikimedia.org/r/525114 (https://phabricator.wikimedia.org/T211248) [15:38:04] (03PS10) 10Arturo Borrero Gonzalez: toolforge: k8s: add nginx-ingress configuration. [puppet] - 10https://gerrit.wikimedia.org/r/524759 (https://phabricator.wikimedia.org/T228500) [15:38:40] (03PS1) 10Arturo Borrero Gonzalez: Revert "secrets: toolforge: add default k8s nginx-ingress key pair" [labs/private] - 10https://gerrit.wikimedia.org/r/525116 [15:38:47] (03CR) 10Arturo Borrero Gonzalez: [V: 03+2 C: 03+2] Revert "secrets: toolforge: add default k8s nginx-ingress key pair" [labs/private] - 10https://gerrit.wikimedia.org/r/525116 (owner: 10Arturo Borrero Gonzalez) [15:40:41] (03CR) 10Bstorm: [C: 03+1] "I like the commented prometheus scrape bits, since it's a good reminder that we haven't really thought about that piece yet :-D" [puppet] - 10https://gerrit.wikimedia.org/r/524759 (https://phabricator.wikimedia.org/T228500) (owner: 10Arturo Borrero Gonzalez) [15:46:39] !log side b of a5-eqiad swapping pdu via T227141 [15:47:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:47:01] T227141: a5-eqiad pdu refresh - https://phabricator.wikimedia.org/T227141 [15:49:08] correction was side a (they werent labeled on old pdu towers) [15:49:15] so that is changing instead of b first [15:49:55] RECOVERY - Host wtp2013 is UP: PING OK - Packet loss = 0%, RTA = 37.49 ms [15:53:35] 10Operations, 10SRE-Access-Requests, 10Release-Engineering-Team (Deployment services): Request access to deployment cluster for Jakob_WMDE - https://phabricator.wikimedia.org/T227193 (10WMDE-leszek) Thanks gentlemen! @Jakob_WMDE has noticed today he does not have +2 rights on operations/mediawiki-config. We... [15:54:01] PROBLEM - Host mw2159.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:54:01] PROBLEM - Host mw2160.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:54:47] mgmt down on mw2159 and mw2160 that's me [15:55:11] 10Operations, 10ops-eqiad, 10DC-Ops, 10Epic, 10cloud-services-team (Kanban): Move cloudvirt hosts to 10Gb ethernet - https://phabricator.wikimedia.org/T216195 (10Andrew) [15:55:39] (03PS4) 10Krinkle: noc db.php: include readonly status & group loads [mediawiki-config] - 10https://gerrit.wikimedia.org/r/524825 (owner: 10CDanis) [15:56:08] (03CR) 10Krinkle: [C: 03+1] "readonly=>readOnly, for consistency. And some spacing issue fixed (phpcs should have caught that, will look at that later)." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/524825 (owner: 10CDanis) [15:56:49] ok side a done doing side b in a5-eqiad [15:57:08] mgmt may flap [15:57:23] (03PS1) 10Ppchelko: Clean up eventlogging_service_uri from maps. [puppet] - 10https://gerrit.wikimedia.org/r/525121 (https://phabricator.wikimedia.org/T211248) [15:58:04] (03CR) 10Krinkle: [C: 03+1] "non-blocking issue for later improvement perhaps." (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/524825 (owner: 10CDanis) [15:58:18] (03CR) 10jerkins-bot: [V: 04-1] Clean up eventlogging_service_uri from maps. [puppet] - 10https://gerrit.wikimedia.org/r/525121 (https://phabricator.wikimedia.org/T211248) (owner: 10Ppchelko) [15:58:21] 10Operations, 10SRE-Access-Requests: Requesting access to machines [stat1004, stat1007, stat1006, notebook1003, and notebook1004] and groups for Mayakpwiki - https://phabricator.wikimedia.org/T227633 (10Mayakp.wiki) @fsero : Please advise if access is provided. Thanks! [15:58:39] PROBLEM - Text HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/myRmf1Pik/varnish-aggregate-client-status-codes?var-cache_type=varnish-text&var-status_type=5 [15:58:46] (03CR) 10Ppchelko: "According to @Mholloway there are​ no plans to use it now, so this can safely be removed." [puppet] - 10https://gerrit.wikimedia.org/r/525121 (https://phabricator.wikimedia.org/T211248) (owner: 10Ppchelko) [15:58:57] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 71, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [16:00:01] PROBLEM - Esams HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [1000.0] https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/myRmf1Pik/varnish-aggregate-client-status-codes?var-site=esams&var-status_type=5 [16:00:04] _joe_: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) Puppet SWAT(Max 6 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190723T1600). [16:00:04] No GERRIT patches in the queue for this window AFAICS. [16:00:21] (03PS2) 10Ppchelko: Clean up eventlogging_service_uri from maps. [puppet] - 10https://gerrit.wikimedia.org/r/525121 (https://phabricator.wikimedia.org/T211248) [16:00:52] (03CR) 10jerkins-bot: [V: 04-1] Clean up eventlogging_service_uri from maps. [puppet] - 10https://gerrit.wikimedia.org/r/525121 (https://phabricator.wikimedia.org/T211248) (owner: 10Ppchelko) [16:01:29] 10Operations, 10SRE-Access-Requests, 10Release-Engineering-Team (Deployment services): Request access to deployment cluster for Jakob_WMDE - https://phabricator.wikimedia.org/T227193 (10hashar) [[ https://gerrit.wikimedia.org/r/#/admin/projects/operations/mediawiki-config,access mediawiki-config access ]] ar... [16:01:33] 10Operations, 10serviceops, 10Core Platform Team Workboards (Green): Keys from MediaWiki Redis Instances - https://phabricator.wikimedia.org/T228703 (10jijiki) 05Open→03Resolved @holger.knust I copied a gzipped dump to a server you have access to, please reopen when you need newer one:) [16:01:39] PROBLEM - HTTP availability for Nginx -SSL terminators- at esams on icinga1001 is CRITICAL: cluster=cache_text site=esams https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [16:02:27] (03PS3) 10Ppchelko: Clean up eventlogging_service_uri from maps. [puppet] - 10https://gerrit.wikimedia.org/r/525121 (https://phabricator.wikimedia.org/T211248) [16:02:30] is this related to a5? [16:02:31] PROBLEM - HTTP availability for Varnish at esams on icinga1001 is CRITICAL: job=varnish-text site=esams https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [16:02:47] 10Operations, 10SRE-Access-Requests, 10Release-Engineering-Team (Deployment services): Request access to deployment cluster for Jakob_WMDE - https://phabricator.wikimedia.org/T227193 (10hashar) Confirmed to me by @Jakob_WMDE ! [16:02:51] RECOVERY - puppet last run on db1121 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [16:02:53] RECOVERY - Host ps1-a5-eqiad is UP: PING OK - Packet loss = 0%, RTA = 3.23 ms [16:02:59] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 229, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [16:03:03] 10Operations, 10SRE-Access-Requests, 10Release-Engineering-Team (Deployment services): Request access to deployment cluster for Jakob_WMDE - https://phabricator.wikimedia.org/T227193 (10WMDE-leszek) merci beaucoup @hashar! [16:03:24] 10Operations, 10ops-codfw: (OoW) wtp2013 memory correctable errors - https://phabricator.wikimedia.org/T194174 (10Papaul) 05Open→03Resolved - Replace DIMM B2 - Clear log - Upgrae BIOS from 2.3 to 2.6 -Upgrade IDRAC from 1.57 to 2.61 All looks good now . Resolving this task [16:03:40] 10Operations, 10SRE-Access-Requests, 10Release-Engineering-Team (Deployment services): Request access to deployment cluster for Jakob_WMDE - https://phabricator.wikimedia.org/T227193 (10hashar) bitte schon `\o/` [16:03:53] 10Operations, 10Wikimedia-Mailing-lists: New Mailing lists for AzWiki sysops - https://phabricator.wikimedia.org/T228542 (10Aklapper) Any link to share to a discussion on Meta? [16:05:17] RECOVERY - Host mw2160.mgmt is UP: PING WARNING - Packet loss = 61%, RTA = 36.81 ms [16:05:21] RECOVERY - HTTP availability for Varnish at esams on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [16:05:59] RECOVERY - HTTP availability for Nginx -SSL terminators- at esams on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [16:09:39] ah no ok this is related to the link between esams and eqiad [16:09:55] (cr2-eqiad <-> cr2-esams seems down) [16:10:12] again? [16:10:13] RECOVERY - Text HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/myRmf1Pik/varnish-aggregate-client-status-codes?var-cache_type=varnish-text&var-status_type=5 [16:10:17] RECOVERY - Esams HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/myRmf1Pik/varnish-aggregate-client-status-codes?var-site=esams&var-status_type=5 [16:10:25] bblack: I am checking now, just noticed in icinga :( [16:12:18] bblack: I can see that cr2-eqiad now routes traffic to cr1-eqiad and then knams, so it seems so [16:13:37] PROBLEM - puppet last run on snapshot1007 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [16:13:44] https://librenms.wikimedia.org/device/device=66/tab=port/port=16577/ [16:13:56] different than the last time though IIRC [16:15:37] no same link sigh [16:16:55] RECOVERY - Host mw2159.mgmt is UP: PING OK - Packet loss = 0%, RTA = 36.74 ms [16:18:01] it's a different one than the one I was thinking, I think [16:18:08] 10Operations, 10Wikimedia-Mailing-lists: New Mailing lists for AzWiki sysops - https://phabricator.wikimedia.org/T228542 (10Force_Radical) @Aklapper [[https://meta.wikimedia.org/wiki/Requests_for_comment/Do_something_about_azwiki | this RFC on metawiki]] [16:20:14] 10Operations, 10ops-eqiad, 10DC-Ops: a5-eqiad pdu refresh - https://phabricator.wikimedia.org/T227141 (10RobH) Both sides are swapped, and all items appear online. [16:22:11] !log pool prometheus1003 - T227139 [16:22:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:22:18] T227139: a3-eqiad pdu refresh - https://phabricator.wikimedia.org/T227139 [16:23:02] 10Operations, 10Wikimedia-Mailing-lists: New Mailing lists for AzWiki sysops - https://phabricator.wikimedia.org/T228542 (10Eldarado) >>! In T228542#5357749, @Force_Radical wrote: > @Aklapper [[https://meta.wikimedia.org/wiki/Requests_for_comment/Do_something_about_azwiki | this RFC on metawiki]] This discuss... [16:24:32] oh no, same one [16:26:30] so, GTT has been stable lately (cr1-eqiad xe-4/2/2.13 <-> cr2-knams xe-1/1/0.13), but Level3 not so much (cr2-eqiad xe-4/1/3 <-> cr2-eams xe-0/1/3) [16:26:37] yeah [16:26:49] and every time we have some impact [16:27:18] the last time there was some maintenance scheduled, but this time it seems not for eqiad? [16:27:22] the GTT one is MPLS [16:27:43] I need to go find the other wave out of eqord [16:28:09] PROBLEM - puppet last run on dbstore1004 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [16:29:25] I thought we had one anyways, looking again [16:29:42] bblack: (ignorant qs) GTT is MPLS on their side right? (trying to parse what you were writing :) [16:29:46] 10Operations, 10ops-codfw, 10DBA, 10Goal: rack/setup/install db21[21-30].codfw.wmnet - https://phabricator.wikimedia.org/T227113 (10Papaul) [16:30:12] yeah I was wrong about eqord, that was I guess some past plan that never materialized [16:30:44] So we have GTT MPLS and L3 wave listed in my comment earlier (and a tunnel backup) [16:31:31] the L3 wave means we have a physical fiber path, the GTT MPLS means it looks like fiber on each side to us, but it's really just a virtual circuit of sorts in GTT's network (generally these should have less availability and more latency variance than a real wave). [16:32:31] ah okok so my understanding was kinda good [16:32:33] thanks :) [16:33:58] 10Operations, 10Wikimedia-Mailing-lists: New Mailing lists for AzWiki sysops - https://phabricator.wikimedia.org/T228542 (10Force_Radical) @Eldarado A private admin-only mailing list is almost equivalent to having an AzWiki FB group, something that was criticized over at the RFC. Further, there have been discu... [16:34:15] the circuit is still dead I think [16:34:40] looks like it https://librenms.wikimedia.org/device/device=66/tab=port/port=16577/ [16:34:45] (and our traffic is now using the GTT MPLS, but there's always a disruption with 5xx alerts and such on the transition due to loss/reordering etc) [16:35:09] and I can see only Level3 maintenance for Tx, not Virginia [16:35:13] (03PS11) 10Jeena Huneidi: Add mediawiki development chart. [deployment-charts] - 10https://gerrit.wikimedia.org/r/522584 (https://phabricator.wikimedia.org/T224935) [16:35:30] other than that, I don't see any alert from Level3 telling us that the circuit is broken [16:35:46] (03PS2) 10Cwhite: profile: cleanup per-site varnishkafka deploy flags [puppet] - 10https://gerrit.wikimedia.org/r/524934 (https://phabricator.wikimedia.org/T196066) [16:36:01] I'm compiling up some data from recent event logs on the L3 circuit [16:36:32] (03CR) 10Jeena Huneidi: [V: 03+2 C: 03+2] Add mediawiki development chart. [deployment-charts] - 10https://gerrit.wikimedia.org/r/522584 (https://phabricator.wikimedia.org/T224935) (owner: 10Jeena Huneidi) [16:39:11] (03CR) 10Dzahn: "ok, yea, fair enough. i just didn't have time for that yesterday. i had merged my previous change to make it required and then a couple pu" [puppet] - 10https://gerrit.wikimedia.org/r/524951 (https://phabricator.wikimedia.org/T197873) (owner: 10Dzahn) [16:41:49] RECOVERY - puppet last run on snapshot1007 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [16:41:59] https://phabricator.wikimedia.org/P8785 [16:42:12] ^ recent history on this L3 wave from librenms event logs for the port statuses [16:42:40] really nice [16:43:22] should we open a task with those info and let Arzhel contact L3 to figure out what's wrong? [16:43:44] I'll let him open one, he may have a different pov, and maybe multiple of those were planned maint on L3's end, I donno. [16:43:58] ack :) [16:44:32] XioNoX: ping - https://phabricator.wikimedia.org/P8785 - is there a problem here we should do something about? Seems like a lot of link outages lately. Could be xcvr issue, some of it I think is planned maint, I donno. It just seems like a recurrent disruption lately... [16:51:22] bblack: last qs - you mentioned the MPLS vs fiber quality of service, may I ask also info about the GRE tunnel? [16:51:52] last hope so probably not the best performer of the group [16:53:02] 10Operations, 10ops-codfw, 10DBA, 10Goal: rack/setup/install db21[21-30].codfw.wmnet - https://phabricator.wikimedia.org/T227113 (10Papaul) [16:56:29] RECOVERY - puppet last run on dbstore1004 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [16:58:46] !log ppchelko@deploy1001 Started deploy [changeprop/deploy@894f735]: Switch internal events to the new schema T226522 [16:58:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:58:54] T226522: Modern Event Platform: Stream Intake Service: Migrate change-prop events to new (EventGate) style schemas - https://phabricator.wikimedia.org/T226522 [16:59:09] PROBLEM - puppet last run on mwmaint2001 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [17:00:04] cscott, arlolra, subbu, and halfak: Your horoscope predicts another unfortunate Services – Graphoid / Parsoid / Citoid / ORES deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190723T1700). [17:00:17] !log ppchelko@deploy1001 Finished deploy [changeprop/deploy@894f735]: Switch internal events to the new schema T226522 (duration: 01m 30s) [17:00:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:02:27] (03CR) 10Mholloway: [C: 04-1] Clean up eventlogging_service_uri from maps. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/525121 (https://phabricator.wikimedia.org/T211248) (owner: 10Ppchelko) [17:03:20] (03CR) 10Dzahn: [C: 03+2] Remove unused Apache config [puppet] - 10https://gerrit.wikimedia.org/r/525090 (owner: 10Muehlenhoff) [17:03:28] (03PS2) 10Dzahn: Remove unused Apache config [puppet] - 10https://gerrit.wikimedia.org/r/525090 (owner: 10Muehlenhoff) [17:04:46] (03CR) 10Cwhite: [C: 03+2] hiera: deploy varnishkafka exporter to ulsfo [puppet] - 10https://gerrit.wikimedia.org/r/524930 (https://phabricator.wikimedia.org/T196066) (owner: 10Cwhite) [17:04:54] (03PS2) 10Cwhite: hiera: deploy varnishkafka exporter to ulsfo [puppet] - 10https://gerrit.wikimedia.org/r/524930 (https://phabricator.wikimedia.org/T196066) [17:08:22] (03PS3) 10Dzahn: Remove unused Apache config [puppet] - 10https://gerrit.wikimedia.org/r/525090 (owner: 10Muehlenhoff) [17:14:51] (03PS1) 10Ppchelko: Revert "Add variables for map tile invalidation" [puppet] - 10https://gerrit.wikimedia.org/r/525131 (https://phabricator.wikimedia.org/T211248) [17:15:18] (03CR) 10jerkins-bot: [V: 04-1] Revert "Add variables for map tile invalidation" [puppet] - 10https://gerrit.wikimedia.org/r/525131 (https://phabricator.wikimedia.org/T211248) (owner: 10Ppchelko) [17:15:30] (03Abandoned) 10Ppchelko: Clean up eventlogging_service_uri from maps. [puppet] - 10https://gerrit.wikimedia.org/r/525121 (https://phabricator.wikimedia.org/T211248) (owner: 10Ppchelko) [17:15:49] (03CR) 10BBlack: [C: 03+1] "We just had some of these alerts, and confirm the availability alerts mirrored these removed ones (a minute or two later, but the one bein" [puppet] - 10https://gerrit.wikimedia.org/r/523891 (https://phabricator.wikimedia.org/T184942) (owner: 10Filippo Giunchedi) [17:15:52] (03CR) 10Ppchelko: "Heh, it's actually easier to just revert the commit that added these." [puppet] - 10https://gerrit.wikimedia.org/r/525121 (https://phabricator.wikimedia.org/T211248) (owner: 10Ppchelko) [17:16:24] 10Operations, 10ops-codfw, 10DBA, 10Goal: rack/setup/install db21[21-30].codfw.wmnet - https://phabricator.wikimedia.org/T227113 (10Papaul) [17:17:58] (03Restored) 10Ppchelko: Clean up eventlogging_service_uri from maps. [puppet] - 10https://gerrit.wikimedia.org/r/525121 (https://phabricator.wikimedia.org/T211248) (owner: 10Ppchelko) [17:18:19] (03CR) 10Ppchelko: "Or not :) Too many conflicts." [puppet] - 10https://gerrit.wikimedia.org/r/525121 (https://phabricator.wikimedia.org/T211248) (owner: 10Ppchelko) [17:18:29] (03Abandoned) 10Ppchelko: Revert "Add variables for map tile invalidation" [puppet] - 10https://gerrit.wikimedia.org/r/525131 (https://phabricator.wikimedia.org/T211248) (owner: 10Ppchelko) [17:20:16] (03PS6) 10Thcipriani: blubberoid: Add policy file [deployment-charts] - 10https://gerrit.wikimedia.org/r/517573 (https://phabricator.wikimedia.org/T215319) [17:20:47] (03CR) 10Thcipriani: [V: 03+2 C: 03+2] blubberoid: Add policy file [deployment-charts] - 10https://gerrit.wikimedia.org/r/517573 (https://phabricator.wikimedia.org/T215319) (owner: 10Thcipriani) [17:21:15] (03PS3) 10Ottomata: Clean up eventlogging_service_uri from RESTBase profile. [puppet] - 10https://gerrit.wikimedia.org/r/525114 (https://phabricator.wikimedia.org/T211248) (owner: 10Ppchelko) [17:22:15] (03PS4) 10Ppchelko: Clean up eventlogging_service_uri from maps. [puppet] - 10https://gerrit.wikimedia.org/r/525121 (https://phabricator.wikimedia.org/T109776) [17:22:50] (03CR) 10Ottomata: [C: 03+2] Clean up eventlogging_service_uri from RESTBase profile. [puppet] - 10https://gerrit.wikimedia.org/r/525114 (https://phabricator.wikimedia.org/T211248) (owner: 10Ppchelko) [17:23:16] (03CR) 10jerkins-bot: [V: 04-1] Clean up eventlogging_service_uri from maps. [puppet] - 10https://gerrit.wikimedia.org/r/525121 (https://phabricator.wikimedia.org/T109776) (owner: 10Ppchelko) [17:24:12] 10Operations, 10ops-codfw, 10decommission: Decommission db2042 - https://phabricator.wikimedia.org/T225090 (10Papaul) [17:24:18] (03PS5) 10Ppchelko: Clean up eventlogging_service_uri from maps. [puppet] - 10https://gerrit.wikimedia.org/r/525121 (https://phabricator.wikimedia.org/T109776) [17:25:18] (03CR) 10jerkins-bot: [V: 04-1] Clean up eventlogging_service_uri from maps. [puppet] - 10https://gerrit.wikimedia.org/r/525121 (https://phabricator.wikimedia.org/T109776) (owner: 10Ppchelko) [17:26:14] (03PS6) 10Ppchelko: Clean up eventlogging_service_uri from maps. [puppet] - 10https://gerrit.wikimedia.org/r/525121 (https://phabricator.wikimedia.org/T109776) [17:27:25] RECOVERY - puppet last run on mwmaint2001 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [17:29:53] (03CR) 10Mholloway: [C: 03+1] Clean up eventlogging_service_uri from maps. [puppet] - 10https://gerrit.wikimedia.org/r/525121 (https://phabricator.wikimedia.org/T109776) (owner: 10Ppchelko) [17:32:01] (03CR) 10Ottomata: [C: 03+2] Clean up eventlogging_service_uri from maps. [puppet] - 10https://gerrit.wikimedia.org/r/525121 (https://phabricator.wikimedia.org/T109776) (owner: 10Ppchelko) [17:36:35] !log ppchelko@deploy1001 Started deploy [changeprop/deploy@6c5c0a3]: Switch internal events to the new schema T226522, step 2 [17:36:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:36:45] T226522: Modern Event Platform: Stream Intake Service: Migrate change-prop events to new (EventGate) style schemas - https://phabricator.wikimedia.org/T226522 [17:38:11] !log ppchelko@deploy1001 Finished deploy [changeprop/deploy@6c5c0a3]: Switch internal events to the new schema T226522, step 2 (duration: 01m 37s) [17:38:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:38:37] (03PS3) 10Dzahn: Phabricator: Remove support for jessie [puppet] - 10https://gerrit.wikimedia.org/r/524718 (owner: 10Muehlenhoff) [17:39:14] (03CR) 10Alexandros Kosiaris: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/524525 (owner: 10Alexandros Kosiaris) [17:40:17] (03PS5) 10CDanis: noc db.php: include readonly status & group loads [mediawiki-config] - 10https://gerrit.wikimedia.org/r/524825 [17:40:41] (03CR) 10CDanis: "> Note that `$x ?? $y` is equivalent to `isset( $x ) ? $x : $y`." (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/524825 (owner: 10CDanis) [17:41:17] jouncebot: next [17:41:18] In 5 hour(s) and 18 minute(s): Evening SWAT (Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190723T2300) [17:51:11] (03CR) 10Dzahn: [C: 03+2] Phabricator: Remove support for jessie [puppet] - 10https://gerrit.wikimedia.org/r/524718 (owner: 10Muehlenhoff) [17:52:30] !log installing Java security updates on kafka/main and Logstash servers [17:52:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:54:54] (03PS1) 10Elukey: profile::kerberos::kadminserver: add script to create principals [puppet] - 10https://gerrit.wikimedia.org/r/525137 (https://phabricator.wikimedia.org/T226104) [17:59:25] (03CR) 10CDanis: [C: 03+2] "thanks for the reviews!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/524825 (owner: 10CDanis) [18:00:21] (03Merged) 10jenkins-bot: noc db.php: include readonly status & group loads [mediawiki-config] - 10https://gerrit.wikimedia.org/r/524825 (owner: 10CDanis) [18:00:36] (03CR) 10jenkins-bot: noc db.php: include readonly status & group loads [mediawiki-config] - 10https://gerrit.wikimedia.org/r/524825 (owner: 10CDanis) [18:03:16] !log cdanis@deploy1001 Synchronized docroot/noc/db.php: 8def4af1d noc db.php: include readonly status & group loads (duration: 00m 55s) [18:03:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:04:13] ehm [18:04:16] https://phabricator.wikimedia.org/P8786$55 [18:04:19] !log depool cp1077 + cp1088 - T227143 [18:04:20] ImportError: No module named concurrent.futures [18:04:22] during a scap run [18:04:24] that's new to me [18:04:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:04:26] T227143: a7-eqiad pdu refresh - https://phabricator.wikimedia.org/T227143 [18:05:34] !log lvs1013 - disable puppet and stop pybal - T227143 [18:05:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:05:59] !log jforrester@deploy1001 Synchronized php-1.34.0-wmf.14/includes/export/XmlDumpWriter.php: T228720 Make XmlDumpwriter resilient to blob store corruption (duration: 00m 57s) [18:06:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:06:07] T228720: stub for enwiki broken, attempt to load content for bad rev during sha1 retrieval - https://phabricator.wikimedia.org/T228720 [18:06:24] James_F: are you having issues with scap'ing to mw1267 as well? [18:06:28] !log Sync error on mw1314.eqiad.wmnet, No module named concurrent.futures [18:06:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:06:39] 1314 wtf [18:06:48] James_F: apergosI have that same error on mw1267 [18:06:56] Oh, hmm, no, 1267. [18:07:04] !log Belay that, error on mw1267. [18:07:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:07:09] ah hm [18:07:26] (Syncing wmf.15 too.) [18:07:38] also have it when I ssh to mw1267 and attempt scap pull [18:07:41] "concurrent.futures" sounds familiar. [18:07:45] 10Operations, 10ops-eqiad, 10DC-Ops: b3-eqiad pdu refresh - https://phabricator.wikimedia.org/T227539 (10RobH) [18:07:56] Aha! [18:07:57] https://phabricator.wikimedia.org/T228482 [18:08:07] !log jforrester@deploy1001 Synchronized php-1.34.0-wmf.15/includes/export/XmlDumpWriter.php: T228720 Make XmlDumpwriter resilient to blob store corruption (duration: 00m 54s) [18:08:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:08:25] Did that box not get the python module or whatever? [18:08:50] dpkg -s scap | grep Version --> Version: 3.11.0-1 [18:08:56] so it has the version that doesn't have the dependency [18:09:08] unless anyone objects I am going to apt-get install the dependency there by hand [18:09:13] Oh dear. Yeah, please do. [18:09:21] PROBLEM - pybal on lvs1013 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), args /usr/sbin/pybal https://wikitech.wikimedia.org/wiki/PyBal [18:09:51] (And then do a scap pull on the box?) [18:09:54] !log cdanis@mw1267.eqiad.wmnet ~ ☕ sudo apt install python-concurrent.futures [18:10:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:10:07] !log cdanis@mw1267.eqiad.wmnet /srv/mediawiki ☕ scap pull [18:10:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:10:17] ehm [18:10:20] sudo: /usr/local/bin/mwscript: command not found [18:10:22] 18:10:09 pull failed: Command '/usr/local/bin/mwscript extensions/WikimediaMaintenance/refreshMessageBlobs.php' returned non-zero exit status 1 [18:10:31] PROBLEM - PyBal backends health check on lvs1013 is CRITICAL: PYBAL CRITICAL - Bad Response from pybal: 500 Cant connect to localhost:9090 https://wikitech.wikimedia.org/wiki/PyBal [18:10:35] cdanis: there is an open bug for this [18:10:44] it will soon be fixed [18:10:53] hrm, scap should probably be updated everywhere...that would fix it [18:10:59] jijiki: okay, but, if we can't scap to this appserver, shouldn't it be depooled? [18:11:05] PROBLEM - Host ms-be1029 is DOWN: PING CRITICAL - Packet loss = 100% [18:11:19] https://phabricator.wikimedia.org/T228482 [18:11:22] !log depool mw1267 [18:11:22] ACKNOWLEDGEMENT - PyBal backends health check on lvs1013 is CRITICAL: PYBAL CRITICAL - Bad Response from pybal: 500 Cant connect to localhost:9090 Brandon Black T227143 https://wikitech.wikimedia.org/wiki/PyBal [18:11:22] ACKNOWLEDGEMENT - PyBal connections to etcd on lvs1013 is CRITICAL: CRITICAL: 0 connections established with conf1004.eqiad.wmnet:4001 (min=8) Brandon Black T227143 https://wikitech.wikimedia.org/wiki/PyBal [18:11:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:11:55] PROBLEM - Host ms-be1030 is DOWN: PING CRITICAL - Packet loss = 100% [18:12:01] PROBLEM - Host ms-be1028 is DOWN: PING CRITICAL - Packet loss = 100% [18:12:38] I assume, at least, that if you can't scap somewhere, there's easily the potential for config or code to get out of date on that server, and it shouldn't be ipooled [18:12:41] sorry, I was completely checked out already... [18:12:49] PROBLEM - Host ms-be1040 is DOWN: PING CRITICAL - Packet loss = 100% [18:13:16] those are a7 and intended aiui [18:13:16] !log started depooling servers in a7-eqiad for pdu work via T227143 [18:13:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:13:23] T227143: a7-eqiad pdu refresh - https://phabricator.wikimedia.org/T227143 [18:13:50] cdanis yes, no scap = depool it for sure [18:14:10] jouncebot: next [18:14:10] In 4 hour(s) and 45 minute(s): Evening SWAT (Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190723T2300) [18:14:18] i think this is a good time to upgrade scap [18:14:43] mutante: lgtm ;) [18:14:56] ok :) [18:17:33] PROBLEM - Host ps1-a7-eqiad is DOWN: PING CRITICAL - Packet loss = 100% [18:18:12] I have not seen /usr/local/bin/mwscript: command not found before, but I assume there is some good explanation for it [18:18:25] cdanis: yes, that is why i want to upgrade scap [18:18:34] ah ok [18:18:36] where do you see it? [18:18:40] mw1267 [18:18:46] (now depooled) [18:18:51] https://phabricator.wikimedia.org/T228328 [18:18:55] that is the background [18:19:07] ok, ack [18:19:17] (03PS1) 10Volans: Fix extras_require key for use in console_scripts [software/conftool] - 10https://gerrit.wikimedia.org/r/525140 [18:19:23] ahh I see, I thought the issue was just the missing python concurrent dependency [18:19:25] ty mutante [18:19:27] cdanis: ^^^ [18:19:49] (03CR) 10CDanis: [C: 03+2] Fix extras_require key for use in console_scripts [software/conftool] - 10https://gerrit.wikimedia.org/r/525140 (owner: 10Volans) [18:20:02] volans: 𝓪𝓹𝓹𝓻𝓸𝓿𝓮𝓭 [18:20:05] rotfl [18:20:29] pretty :) [18:20:35] yes ms-be 1028,29,30,40 are me [18:20:46] i left them in monitoring because i wanted to see them come back when we finish [18:20:57] and ps1-a7-eqiad ping loss is also epected [18:21:00] expected even [18:22:51] (03Merged) 10jenkins-bot: Fix extras_require key for use in console_scripts [software/conftool] - 10https://gerrit.wikimedia.org/r/525140 (owner: 10Volans) [18:23:13] 10Operations, 10ops-eqiad, 10DC-Ops: (July 22-26) install new PDUs in rows A/B (Top level tracking task) - https://phabricator.wikimedia.org/T226778 (10aborrero) The more "easy" racks for us in row B are `B3` and `B6`. I propose we start with these. rack `B3` contains cloudvirt1027 and we would like to real... [18:24:01] oh, wow "generate-debdeploy-spec". i remember writing those files manually last time i used debdeploy for something :) [18:24:05] that's nice [18:25:33] yeah it's pretty good [18:25:41] mutante: I have in my home dir [18:25:44] wait. somebody else already upgrade it? it looks like it heh [18:25:46] from the previous scap updgrade [18:25:58] jijiki: did you upgrade scap by any chance? [18:26:04] about a month ago [18:26:10] I see 3.11.0-1 on mw1267, not 3.11.1-1 [18:26:17] hmm. no, i meant like yesterday, heh [18:26:21] not at all [18:26:40] oh, nevermind, i am just reading the output of debmonitor the wrong way [18:26:45] it's upgrade-able to that [18:26:47] alright [18:28:23] PROBLEM - Work requests waiting in Zuul Gearman server on contint1001 is CRITICAL: CRITICAL: 57.14% of data above the critical threshold [140.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [18:29:29] PROBLEM - Host asw2-a-eqiad is DOWN: PING CRITICAL - Packet loss = 100% [18:30:35] RECOVERY - Host asw2-a-eqiad is UP: PING OK - Packet loss = 0%, RTA = 1.31 ms [18:30:37] PROBLEM - Host ms-be1040 is DOWN: PING CRITICAL - Packet loss = 100% [18:30:51] PROBLEM - Host ms-be1028 is DOWN: PING CRITICAL - Packet loss = 100% [18:30:51] PROBLEM - Host ms-be1029 is DOWN: PING CRITICAL - Packet loss = 100% [18:30:51] PROBLEM - Host ms-be1030 is DOWN: PING CRITICAL - Packet loss = 100% [18:33:09] ok, expected [18:33:17] not the asw2-a [18:33:20] but the ms-be was [18:33:46] we are going to kill power on one of the two sides in a7 now [18:33:48] mgmt may flap [18:34:26] 10Operations, 10cloud-services-team: Migrate remaining cloudvirt hosts to Stretch/Mitaka - https://phabricator.wikimedia.org/T224561 (10Krenair) [18:34:37] PROBLEM - MediaWiki memcached error rate on graphite1004 is CRITICAL: CRITICAL: 80.00% of data above the critical threshold [5000.0] https://wikitech.wikimedia.org/wiki/Memcached https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [18:38:35] PROBLEM - puppet last run on restbase-dev1004 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [18:42:42] side a done, doing side b [18:42:45] mgmt may flap in a7 [18:43:43] !log rolling out scap 3.11.1-1 on mw canary servers (T228328) [18:43:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:43:51] T228328: 'scap pull' stopped working on appservers ? - https://phabricator.wikimedia.org/T228328 [18:44:41] RECOVERY - MediaWiki memcached error rate on graphite1004 is OK: OK: Less than 40.00% above the threshold [1000.0] https://wikitech.wikimedia.org/wiki/Memcached https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [18:45:36] !log rolling out scap 3.11.1-1 on all mw codfw servers (T228328) [18:45:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:46:45] RECOVERY - Work requests waiting in Zuul Gearman server on contint1001 is OK: OK: Less than 30.00% above the threshold [90.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [18:50:27] PROBLEM - Host mw1271 is DOWN: PING CRITICAL - Packet loss = 100% [18:50:53] RECOVERY - Host mw1271 is UP: PING OK - Packet loss = 0%, RTA = 1.12 ms [18:50:57] PROBLEM - PHP7 rendering on mw1271 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [18:51:03] PROBLEM - Apache HTTP on mw1271 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [18:51:03] ok.. was about to say that host is up [18:52:21] we are changing the pdu [18:52:24] but that seems odd for mw1271 [18:52:29] RECOVERY - PHP7 rendering on mw1271 is OK: HTTP OK: HTTP/1.1 200 OK - 77329 bytes in 0.208 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [18:52:33] RECOVERY - Apache HTTP on mw1271 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 617 bytes in 0.243 second response time https://wikitech.wikimedia.org/wiki/Application_servers [18:52:36] 2min [18:52:38] it rebooted [18:52:39] mutante: [18:52:48] uptime of 2 minutes [18:52:56] it was a casualty in our a7-pdu swap [18:52:59] seems to be the only one [18:53:18] !log mw1271 had power loss event due to pdu swap via T227143 [18:53:20] robh: ah! ok, it will recover in a minute then. cool [18:53:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:53:26] T227143: a7-eqiad pdu refresh - https://phabricator.wikimedia.org/T227143 [18:53:58] we're finishing the cabling before having folks return things to service [18:54:39] PROBLEM - PHP7 rendering on mw1270 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [18:54:47] PROBLEM - HHVM rendering on mw1312 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [18:55:17] PROBLEM - PyBal backends health check on lvs1015 is CRITICAL: PYBAL CRITICAL - CRITICAL - logstash-json-tcp_11514: Servers logstash1009.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [18:56:17] PROBLEM - PHP7 rendering on mw1269 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [18:56:25] PROBLEM - PyBal backends health check on lvs1016 is CRITICAL: PYBAL CRITICAL - CRITICAL - logstash-json-tcp_11514: Servers logstash1009.eqiad.wmnet are marked down but pooled: logstash-syslog-tcp_10514: Servers logstash1007.eqiad.wmnet, logstash1009.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [18:56:43] mw 1270, 1312, 1269? [18:56:49] PROBLEM - HHVM rendering on mw1274 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [18:56:51] also logstash1009 ... [18:56:53] looking [18:56:57] PROBLEM - PHP7 rendering on mw1312 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [18:57:11] logstash is on ganeti [18:57:15] mw1312 confirmed up [18:57:29] and uptime 49 days [18:57:49] mw1312 is over in A6 nor A7 (the rendering alerts) [18:57:50] unlike 1271 above which rebooted [18:57:51] RECOVERY - PHP7 rendering on mw1269 is OK: HTTP OK: HTTP/1.1 200 OK - 77329 bytes in 0.486 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [18:57:57] RECOVERY - PHP7 rendering on mw1270 is OK: HTTP OK: HTTP/1.1 200 OK - 77329 bytes in 4.026 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [18:58:01] PROBLEM - MediaWiki memcached error rate on graphite1004 is CRITICAL: CRITICAL: 80.00% of data above the critical threshold [5000.0] https://wikitech.wikimedia.org/wiki/Memcached https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [18:58:05] RECOVERY - HHVM rendering on mw1312 is OK: HTTP OK: HTTP/1.1 200 OK - 77284 bytes in 3.122 second response time https://wikitech.wikimedia.org/wiki/Application_servers [18:58:09] PROBLEM - PHP7 rendering on mw1268 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [18:58:27] some of these showing rendering socket timeouts are in A7 though [18:58:35] PROBLEM - recommendation_api endpoints health on scb2004 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [18:58:37] PROBLEM - PHP7 rendering on mw1277 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [18:58:58] 1268, 1269, 1277 [18:59:00] all A7 [18:59:14] win 30 [18:59:41] RECOVERY - PHP7 rendering on mw1268 is OK: HTTP OK: HTTP/1.1 200 OK - 77329 bytes in 0.645 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [18:59:46] for mw hosts, it's 1267 - 1283 that are in A7 [19:00:09] RECOVERY - HHVM rendering on mw1274 is OK: HTTP OK: HTTP/1.1 200 OK - 77284 bytes in 6.987 second response time https://wikitech.wikimedia.org/wiki/Application_servers [19:00:09] RECOVERY - PHP7 rendering on mw1277 is OK: HTTP OK: HTTP/1.1 200 OK - 77329 bytes in 0.140 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [19:00:11] RECOVERY - PHP7 rendering on mw1312 is OK: HTTP OK: HTTP/1.1 200 OK - 77331 bytes in 2.111 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [19:00:17] RECOVERY - recommendation_api endpoints health on scb2004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [19:00:41] PROBLEM - logstash syslog TCP port on logstash1009 is CRITICAL: connect to address 127.0.0.1 and port 10514: Connection refused https://wikitech.wikimedia.org/wiki/Logstash [19:01:12] logstash seems unrelated [19:02:17] seems like java is maxing out a cpu [19:02:25] PROBLEM - logstash JSON linesTCP port on logstash1007 is CRITICAL: connect to address 127.0.0.1 and port 11514: Connection refused https://wikitech.wikimedia.org/wiki/Logstash [19:02:53] so, semi-related I think [19:03:05] PROBLEM - logstash syslog TCP port on logstash1007 is CRITICAL: connect to address 127.0.0.1 and port 10514: Connection refused https://wikitech.wikimedia.org/wiki/Logstash [19:03:18] 1277 - "GET /w/api.php?format=json&action=opensearch&namespace=14&limit =30&search=Category:Och") executing too slow (19.283281 sec) [19:03:23] probably the other stuff going on is throwing too much log traffic at logstash to handle [19:03:37] PROBLEM - puppet last run on tungsten is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [19:03:45] hmm yeah [19:04:10] https://grafana.wikimedia.org/d/000000561/logstash?orgId=1 [19:04:15] kafka piled up a bunch of consumer lag, but appears to be flattening out or dropping now [19:04:41] but the input rate is going down instead of up? [19:04:45] RECOVERY - logstash syslog TCP port on logstash1007 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 10514 https://wikitech.wikimedia.org/wiki/Logstash [19:04:46] ah [19:04:47] https://grafana.wikimedia.org/d/000000102/production-logging [19:04:53] because it's failing to handle input well and losing inputs [19:05:01] gotcha [19:05:22] MW seems to be sending a bunch of memcached errors to logstash [19:05:28] there was a memcached alert further up somewhere [19:05:43] RECOVERY - logstash JSON linesTCP port on logstash1007 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 11514 https://wikitech.wikimedia.org/wiki/Logstash [19:05:50] 18:34 <+icinga-wm> PROBLEM - MediaWiki memcached error rate on graphite1004 is CRITICAL: CRITICAL: 80.00% ... [19:06:02] big spike of memcached logs [19:06:04] !log restarting logstash on logstash100[789] [19:06:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:06:13] PROBLEM - logstash JSON linesTCP port on logstash1008 is CRITICAL: connect to address 127.0.0.1 and port 11514: Connection refused https://wikitech.wikimedia.org/wiki/Logstash [19:06:47] https://grafana.wikimedia.org/d/000000316/memcache?orgId=1 [19:06:55] RECOVERY - puppet last run on restbase-dev1004 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [19:07:05] PROBLEM - PHP7 rendering on mwdebug1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [19:07:10] lots of connections-yielded recently there for memcached [19:07:11] PROBLEM - logstash JSON linesTCP port on logstash1009 is CRITICAL: connect to address 127.0.0.1 and port 11514: Connection refused https://wikitech.wikimedia.org/wiki/Logstash [19:07:19] PROBLEM - Too many messages in kafka logging-eqiad on icinga1001 is CRITICAL: cluster=misc exported_cluster=logging-eqiad group={logstash,logstash-codfw} instance=kafkamon1001:9501 job=burrow partition={0,1,2} site=eqiad topic={rsyslog-info,rsyslog-notice,udp_localhost-err,udp_localhost-info,udp_localhost-warning} https://wikitech.wikimedia.org/wiki/Logstash%23Kafka_consumer_lag https://grafana.wikimedia.org/d/000000484/kafka-con [19:07:19] w-3h&to=now&orgId=1&var-datasource=eqiad+prometheus/ops&var-cluster=logging-eqiad&var-topic=All&var-consumer_group=All [19:07:21] PROBLEM - puppet last run on db1105 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [19:07:21] https://grafana.wikimedia.org/d/000000316/memcache?panelId=41&fullscreen&orgId=1 [19:07:33] PROBLEM - HHVM rendering on mw1312 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [19:07:43] PROBLEM - logstash syslog TCP port on logstash1008 is CRITICAL: connect to address 127.0.0.1 and port 10514: Connection refused https://wikitech.wikimedia.org/wiki/Logstash [19:07:46] ^ command rate has been elevated for a while now... since spiking up around 18:03 [19:07:50] (~hr ago) [19:07:59] PROBLEM - HHVM rendering on mw1270 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [19:08:01] PROBLEM - PHP7 rendering on mw1312 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [19:08:19] mc1030 thru mc1034 and mc1036 seem the affected ones [19:08:19] bblack: checking mcrouter [19:08:27] https://grafana.wikimedia.org/d/000000549/mcrouter?orgId=1&var-source=eqiad%20prometheus%2Fops&var-cluster=All&var-instance=mw1261&var-memcached_server=All [19:08:27] from the "connections yielded" graph [19:08:34] there were some deploys around that time [19:08:41] also a java security update for logstash just before [19:08:45] so many balls in the air! [19:08:49] seems mw1261? [19:09:16] PROBLEM - puppet last run on mc1021 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [19:09:18] !log depool mw1261 for investigation [19:09:23] RECOVERY - logstash syslog TCP port on logstash1008 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 10514 https://wikitech.wikimedia.org/wiki/Logstash [19:09:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:09:30] eh, i just ran "scap pull" on that mw1261 [19:09:33] RECOVERY - logstash JSON linesTCP port on logstash1008 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 11514 https://wikitech.wikimedia.org/wiki/Logstash [19:09:39] RECOVERY - HHVM rendering on mw1270 is OK: HTTP OK: HTTP/1.1 200 OK - 77284 bytes in 8.389 second response time https://wikitech.wikimedia.org/wiki/Application_servers [19:10:06] mutante: ah sorry! [19:10:19] errors are going down [19:10:28] elukey: no, i'm just saying that is a coincidence that you name that specific host.. hmm [19:10:36] !log repool cp1077 + cp1078 - T227143 [19:10:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:10:42] T227143: a7-eqiad pdu refresh - https://phabricator.wikimedia.org/T227143 [19:10:47] RECOVERY - Host ps1-a7-eqiad is UP: PING OK - Packet loss = 0%, RTA = 2.25 ms [19:10:57] PROBLEM - LVS HTTP IPv4 on logstash.svc.eqiad.wmnet is CRITICAL: connect to address 10.2.2.36 and port 10514: Connection refused https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [19:11:09] paged [19:11:27] PROBLEM - logstash syslog TCP port on logstash1007 is CRITICAL: connect to address 127.0.0.1 and port 10514: Connection refused https://wikitech.wikimedia.org/wiki/Logstash [19:11:27] !log repool lvs1013 - T227143 [19:11:29] PROBLEM - restbase endpoints health on restbase1021 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [19:11:31] * cdanis reading scrollback [19:11:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:11:37] that was logstash which got restarted [19:11:47] 10Operations, 10ops-eqiad, 10DC-Ops: a5-eqiad pdu refresh - https://phabricator.wikimedia.org/T227141 (10RobH) p:05Triage→03Normal [19:11:55] 10Operations, 10ops-eqiad, 10DC-Ops: a5-eqiad pdu refresh - https://phabricator.wikimedia.org/T227141 (10RobH) 05Open→03Resolved [19:11:57] 10Operations, 10ops-eqiad, 10DC-Ops: (July 22-26) install new PDUs in rows A/B (Top level tracking task) - https://phabricator.wikimedia.org/T226778 (10RobH) [19:12:02] <_joe_> hey I'm around [19:12:11] RECOVERY - logstash JSON linesTCP port on logstash1009 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 11514 https://wikitech.wikimedia.org/wiki/Logstash [19:12:14] cdanis: starts circa 18:03, not related to A7 power work. Some issue with memcached error rates for mediawiki, spilling over into excessive logstash load, etc [19:12:19] _joe_: ^ [19:12:21] 10Operations, 10ops-eqiad, 10DC-Ops: (July 22-26) install new PDUs in rows A/B (Top level tracking task) - https://phabricator.wikimedia.org/T226778 (10RobH) [19:12:21] RECOVERY - PyBal backends health check on lvs1013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [19:12:23] RECOVERY - logstash syslog TCP port on logstash1009 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 10514 https://wikitech.wikimedia.org/wiki/Logstash [19:12:27] RECOVERY - Host ms-be1040 is UP: PING OK - Packet loss = 0%, RTA = 2.25 ms [19:12:32] bblack: the memcached errors are unrelated? no mc hosts in A7? [19:12:33] PROBLEM - PHP7 rendering on mw1268 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [19:12:36] <_joe_> bblack: did anyone look at those logs? [19:12:42] not yet, just graphs [19:12:46] <_joe_> also what's up with all the php7 rendering alerts? [19:12:49] RECOVERY - pybal on lvs1013 is OK: PROCS OK: 1 process with UID = 0 (root), args /usr/sbin/pybal https://wikitech.wikimedia.org/wiki/PyBal [19:13:07] RECOVERY - restbase endpoints health on restbase1021 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [19:13:08] 04Critical Alert for device ps1-a7-eqiad.mgmt.eqiad.wmnet - Device rebooted [19:13:09] _joe_: many of the recent rendering alerts were A7 hosts, likely some network blip... [19:13:11] RECOVERY - Host ms-be1029 is UP: PING OK - Packet loss = 0%, RTA = 0.25 ms [19:13:13] affected mc hosts are in C5 [19:13:13] <_joe_> did we lose the mc hosts? [19:13:21] _joe_: mw1267 through mw1283 are on A7, but no mc* hosts [19:13:21] RECOVERY - Host ms-be1028 is UP: PING WARNING - Packet loss = 86%, RTA = 0.19 ms [19:13:27] RECOVERY - Host ms-be1030 is UP: PING OK - Packet loss = 0%, RTA = 0.18 ms [19:13:27] <_joe_> ok [19:13:30] but there was elevated MC command rates going back to ~18:03, way before that [19:13:31] PROBLEM - HHVM rendering on mwdebug1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [19:13:34] we only lost power on a single mw host im aware of [19:13:40] and D4 [19:13:40] D4: iscap 'project level' commands - https://phabricator.wikimedia.org/D4 [19:13:40] and then MC went a bit crazier on other graphs more-recently [19:13:59] there was some deploy traffic shortly before the elevated MC rates [19:14:09] https://grafana.wikimedia.org/d/000000316/memcache?panelId=41&fullscreen&orgId=1 [19:14:11] PROBLEM - PHP7 rendering on mw1273 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [19:14:14] lots of memcached SERVER ERROR logs [19:14:15] PROBLEM - logstash JSON linesTCP port on logstash2006 is CRITICAL: connect to address 127.0.0.1 and port 11514: Connection refused https://wikitech.wikimedia.org/wiki/Logstash [19:14:19] RECOVERY - LVS HTTP IPv4 on logstash.svc.eqiad.wmnet is OK: TCP OK - 0.000 second response time on 10.2.2.36 port 10514 https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [19:14:21] https://grafana.wikimedia.org/d/000000316/memcache?orgId=1 [19:14:22] robh: did we do A6? [19:14:27] memcache traffic patterns have shifted [19:14:45] elukey: no, it has a db master [19:14:47] https://grafana.wikimedia.org/d/000000316/memcache?panelId=38&fullscreen&orgId=1 [19:14:49] what is this graph? [19:14:51] RECOVERY - logstash syslog TCP port on logstash1007 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 10514 https://wikitech.wikimedia.org/wiki/Logstash [19:14:51] back at 18:03 when that increase first happened, nothing should've been happening that mattered with A7 power yet [19:14:51] connections yielded? [19:14:57] no ok sorry my link was wrong [19:14:57] https://grafana.wikimedia.org/d/000000549/mcrouter?orgId=1&var-source=eqiad%20prometheus%2Fops&var-cluster=All&var-instance=All&var-memcached_server=All [19:15:03] more than one appserver [19:15:33] cdanis: those are memcached threads reaching the max conns to process in a row, and yeilding the tcp conn to process other ones [19:15:34] (and A7 doesn't have MC hosts, but does have kafka-main1001) [19:15:40] <_joe_> chttps://grafana.wikimedia.org/d/000000316/memcache?orgId=1&panelId=38&fullscreen does this correspond to any release? [19:15:45] RECOVERY - PHP7 rendering on mw1273 is OK: HTTP OK: HTTP/1.1 200 OK - 77330 bytes in 1.027 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [19:15:49] RECOVERY - PHP7 rendering on mw1268 is OK: HTTP OK: HTTP/1.1 200 OK - 77329 bytes in 0.454 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [19:15:59] RECOVERY - logstash JSON linesTCP port on logstash2006 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 11514 https://wikitech.wikimedia.org/wiki/Logstash [19:16:15] _joe_: no, but that does correspond with the rendering errors cropping up in mostly-A7 MW hosts [19:16:28] (the ~18:50-onwards anomalies there) [19:16:28] mc1033 through mc1036 are on D4, was there a problem with that rack? [19:16:49] <_joe_> bblack: and when did maintenace happen? [19:16:51] PROBLEM - logstash JSON linesTCP port on logstash1009 is CRITICAL: connect to address 127.0.0.1 and port 11514: Connection refused https://wikitech.wikimedia.org/wiki/Logstash [19:17:11] the maint started ~18:00, but was mostly prep-work and depoolings of ms-fe, shutdowns of ms-be, etc [19:17:13] <_joe_> 20 minutes ago or so? [19:17:13] PROBLEM - logstash JSON linesTCP port on logstash2004 is CRITICAL: connect to address 127.0.0.1 and port 11514: Connection refused https://wikitech.wikimedia.org/wiki/Logstash [19:17:29] PROBLEM - puppet last run on ms-be1040 is CRITICAL: CRITICAL: Puppet has 3 failures. Last run 3 minutes ago with 3 failures. Failed resources (up to 3 shown): Exec[mkfs-/dev/sdc1],Exec[xfs_label-/dev/sdb3],Exec[xfs_label-/dev/sdb4] https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [19:17:46] * akosiaris around [19:17:51] PROBLEM - puppet last run on mc1022 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [19:18:06] looks like 18:33 for the first power cut on one leg [19:18:18] <_joe_> ok, I think it's time to look at the memcacheds [19:18:25] PROBLEM - LVS HTTP IPv4 on logstash.svc.eqiad.wmnet is CRITICAL: connect to address 10.2.2.36 and port 10514: Connection refused https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [19:18:30] 18:42 for the second leg of power work [19:18:33] PROBLEM - Check systemd state on cloudcontrol1004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:18:47] looking [19:18:47] PROBLEM - logstash syslog TCP port on logstash2004 is CRITICAL: connect to address 127.0.0.1 and port 10514: Connection refused https://wikitech.wikimedia.org/wiki/Logstash [19:18:53] I'm still here but it was a very long day, poke me if I can be of help, otherwise lurking [19:19:29] PROBLEM - HTTP availability for Nginx -SSL terminators- at eqsin on icinga1001 is CRITICAL: cluster=cache_text site=eqsin https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [19:19:33] PROBLEM - logstash syslog TCP port on logstash1008 is CRITICAL: connect to address 127.0.0.1 and port 10514: Connection refused https://wikitech.wikimedia.org/wiki/Logstash [19:19:35] from mw1271 [19:19:37] Jul 23 19:19:03 mw1271 mcrouter[919]: I0723 19:19:03.442854 1051 AsyncMcClientImpl.cpp:751] Failed to write into socket with remote endpoint "10.64.0.83:11211:ascii:plain:notcompressed", wrote 39782 bytes. Exception: AsyncSocketException: write timed out after 1000ms, type = Timed out [19:19:39] PROBLEM - PHP7 rendering on mw1275 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [19:20:01] PROBLEM - HTTP availability for Nginx -SSL terminators- at ulsfo on icinga1001 is CRITICAL: cluster=cache_text site=ulsfo https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [19:20:02] <_joe_> can we please suspend all maintenance for the day? [19:20:03] PROBLEM - HTTP availability for Nginx -SSL terminators- at esams on icinga1001 is CRITICAL: cluster=cache_text site=esams https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [19:20:09] PROBLEM - logstash syslog TCP port on logstash1009 is CRITICAL: connect to address 127.0.0.1 and port 10514: Connection refused https://wikitech.wikimedia.org/wiki/Logstash [19:20:11] PROBLEM - HTTP availability for Varnish at ulsfo on icinga1001 is CRITICAL: job=varnish-text site=ulsfo https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [19:20:23] (03PS1) 10Ladsgroup: varnish: Do not strip the cache out of Special:EntityData if revision is set [puppet] - 10https://gerrit.wikimedia.org/r/525142 (https://phabricator.wikimedia.org/T85499) [19:20:25] PROBLEM - HTTP availability for Nginx -SSL terminators- at codfw on icinga1001 is CRITICAL: cluster=cache_text site=codfw https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [19:20:25] PROBLEM - HTTP availability for Nginx -SSL terminators- at eqiad on icinga1001 is CRITICAL: cluster=cache_text site=eqiad https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [19:20:35] RECOVERY - PHP7 rendering on mw1275 is OK: HTTP OK: HTTP/1.1 200 OK - 77329 bytes in 0.249 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [19:20:39] RECOVERY - Check systemd state on cloudcontrol1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:20:45] _joe_: power maint for A7 is done, I'm not sure if they had further plans today, but +1 should hold everything for now [19:20:57] PROBLEM - Eqiad HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/myRmf1Pik/varnish-aggregate-client-status-codes?var-site=eqiad&var-status_type=5 [19:21:01] I think they wanted to start with row B [19:21:01] <_joe_> we're having network issues on the memcached I would say [19:21:06] but probably better to stop [19:21:10] in the middle of all of this A7-ish timeframe, we also had some issue with a singular mw server with scap issues, and a scap upgrade too. [19:21:20] <_joe_> Jul 23 19:13:05 mc1022 puppet-agent[18767]: Could not retrieve catalog from remote server: Broken pipe [19:21:24] removing some of the noise that is unrelated. like ms-be host has a disk issue. stopping any scap upgrades. [19:21:26] _joe_ yes take a look to the per-server metrics in https://grafana.wikimedia.org/d/000000549/mcrouter?orgId=1&var-source=eqiad%20prometheus%2Fops&var-cluster=All&var-instance=All&var-memcached_server=All [19:21:35] it is not only one shard [19:21:36] (03PS1) 10Halfak: Add accraze to team-scoring [puppet] - 10https://gerrit.wikimedia.org/r/525143 (https://phabricator.wikimedia.org/T226417) [19:21:46] <_joe_> elukey: what do you mean? [19:22:13] PROBLEM - HTTP availability for Varnish at eqiad on icinga1001 is CRITICAL: job=varnish-text site=eqiad https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [19:22:19] RECOVERY - logstash syslog TCP port on logstash1009 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 10514 https://wikitech.wikimedia.org/wiki/Logstash [19:22:29] PROBLEM - Ulsfo HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/myRmf1Pik/varnish-aggregate-client-status-codes?var-site=ulsfo&var-status_type=5 [19:22:29] ACKNOWLEDGEMENT - puppet last run on ms-be1040 is CRITICAL: CRITICAL: Puppet has 3 failures. Last run 3 minutes ago with 3 failures. Failed resources (up to 3 shown): Exec[mkfs-/dev/sdc1],Exec[xfs_label-/dev/sdb3],Exec[xfs_label-/dev/sdb4] daniel_zahn disk issue xfs - /dev/sdb4 https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [19:22:35] _joe_ there are metrics now for each single shard, it might help, that's it. Plus it seems that not only one shard is affected [19:22:36] <_joe_> elukey: so that's all the servers in A6 [19:22:45] PROBLEM - Esams HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/myRmf1Pik/varnish-aggregate-client-status-codes?var-site=esams&var-status_type=5 [19:22:45] PROBLEM - Text HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/myRmf1Pik/varnish-aggregate-client-status-codes?var-cache_type=varnish-text&var-status_type=5 [19:22:49] RECOVERY - logstash syslog TCP port on logstash1008 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 10514 https://wikitech.wikimedia.org/wiki/Logstash [19:22:52] <_joe_> the servers in A6 are experiencing timeouts [19:23:03] PROBLEM - logstash syslog TCP port on logstash2005 is CRITICAL: connect to address 127.0.0.1 and port 10514: Connection refused https://wikitech.wikimedia.org/wiki/Logstash [19:23:04] I am not sure if they worked on it [19:23:17] <_joe_> elukey: still, looking at the graphs you posted [19:23:19] <_joe_> also [19:23:22] <_joe_> Jul 23 19:13:05 mc1022 puppet-agent[18767]: Could not retrieve catalog from remote server: Broken pipe [19:23:29] <_joe_> this is from a serrver in that rack [19:23:29] PROBLEM - PHP7 rendering on mw1273 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [19:23:31] RECOVERY - HTTP availability for Varnish at ulsfo on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [19:23:43] RECOVERY - logstash JSON linesTCP port on logstash2004 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 11514 https://wikitech.wikimedia.org/wiki/Logstash [19:23:45] RECOVERY - HTTP availability for Nginx -SSL terminators- at eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [19:23:45] RECOVERY - HTTP availability for Nginx -SSL terminators- at codfw on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [19:23:48] RECOVERY - LVS HTTP IPv4 on logstash.svc.eqiad.wmnet is OK: TCP OK - 0.000 second response time on 10.2.2.36 port 10514 https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [19:23:52] A6 hasn't been touched on the pdu maintenance from what I can see [19:23:58] RECOVERY - HTTP availability for Nginx -SSL terminators- at eqsin on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [19:24:02] <_joe_> ok, this is a serious incident. I'll be the coordinator [19:24:08] ack [19:24:16] RECOVERY - logstash syslog TCP port on logstash2005 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 10514 https://wikitech.wikimedia.org/wiki/Logstash [19:24:18] <_joe_> mutante: can you look at the php/hhvm rendering alerts? [19:24:18] RECOVERY - logstash syslog TCP port on logstash2004 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 10514 https://wikitech.wikimedia.org/wiki/Logstash [19:24:24] _joe_: all hands on deck? [19:24:32] RECOVERY - PHP7 rendering on mw1273 is OK: HTTP OK: HTTP/1.1 200 OK - 77329 bytes in 0.400 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [19:24:34] RECOVERY - HTTP availability for Nginx -SSL terminators- at ulsfo on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [19:24:36] RECOVERY - HTTP availability for Nginx -SSL terminators- at esams on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [19:24:36] RECOVERY - HTTP availability for Varnish at eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [19:24:55] _joe_: can we move discussion into #-sre ? [19:24:59] +1 [19:25:01] <_joe_> yes [19:25:02] indeed [19:25:06] ack [19:25:38] RECOVERY - logstash JSON linesTCP port on logstash1009 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 11514 https://wikitech.wikimedia.org/wiki/Logstash [19:25:42] PROBLEM - PHP7 rendering on mw1270 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [19:25:56] PROBLEM - logstash JSON linesTCP port on logstash1007 is CRITICAL: connect to address 127.0.0.1 and port 11514: Connection refused https://wikitech.wikimedia.org/wiki/Logstash [19:26:06] ACKNOWLEDGEMENT - Check systemd state on cloudcontrol1004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. andrew bogott seems to be hanging -- Im investigating. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:26:30] PROBLEM - recommendation_api endpoints health on scb2002 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [19:26:48] RECOVERY - PHP7 rendering on mw1270 is OK: HTTP OK: HTTP/1.1 200 OK - 77329 bytes in 0.428 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [19:27:06] PROBLEM - restbase endpoints health on restbase1021 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [19:27:24] PROBLEM - logstash JSON linesTCP port on logstash2004 is CRITICAL: connect to address 127.0.0.1 and port 11514: Connection refused https://wikitech.wikimedia.org/wiki/Logstash [19:27:28] PROBLEM - LVS HTTP IPv4 on logstash.svc.eqiad.wmnet is CRITICAL: connect to address 10.2.2.36 and port 10514: Connection refused https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [19:27:40] RECOVERY - recommendation_api endpoints health on scb2002 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [19:27:48] PROBLEM - puppet last run on mw1280 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [19:27:48] PROBLEM - HHVM rendering on mw1269 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [19:28:00] PROBLEM - logstash syslog TCP port on logstash2005 is CRITICAL: connect to address 127.0.0.1 and port 10514: Connection refused https://wikitech.wikimedia.org/wiki/Logstash [19:28:08] PROBLEM - Check systemd state on cloudcontrol1003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:28:08] RECOVERY - Eqiad HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/myRmf1Pik/varnish-aggregate-client-status-codes?var-site=eqiad&var-status_type=5 [19:28:20] RECOVERY - puppet last run on mc1022 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [19:28:34] !log depool all appservers in eqiad A7 cdanis@cumin1001.eqiad.wmnet ~ 🍵 sudo cumin 'mw12[67-83]*' 'depool' [19:28:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:28:50] RECOVERY - LVS HTTP IPv4 on logstash.svc.eqiad.wmnet is OK: TCP OK - 0.000 second response time on 10.2.2.36 port 10514 https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [19:28:50] PROBLEM - logstash JSON linesTCP port on logstash2006 is CRITICAL: connect to address 127.0.0.1 and port 11514: Connection refused https://wikitech.wikimedia.org/wiki/Logstash [19:29:00] RECOVERY - HHVM rendering on mw1269 is OK: HTTP OK: HTTP/1.1 200 OK - 77282 bytes in 0.383 second response time https://wikitech.wikimedia.org/wiki/Application_servers [19:29:20] RECOVERY - logstash syslog TCP port on logstash2005 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 10514 https://wikitech.wikimedia.org/wiki/Logstash [19:29:50] PROBLEM - logstash syslog TCP port on logstash1009 is CRITICAL: connect to address 127.0.0.1 and port 10514: Connection refused https://wikitech.wikimedia.org/wiki/Logstash [19:29:52] RECOVERY - logstash JSON linesTCP port on logstash1007 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 11514 https://wikitech.wikimedia.org/wiki/Logstash [19:30:04] RECOVERY - Ulsfo HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/myRmf1Pik/varnish-aggregate-client-status-codes?var-site=ulsfo&var-status_type=5 [19:30:08] PROBLEM - logstash syslog TCP port on logstash2006 is CRITICAL: connect to address 127.0.0.1 and port 10514: Connection refused https://wikitech.wikimedia.org/wiki/Logstash [19:30:14] RECOVERY - HHVM rendering on mw1312 is OK: HTTP OK: HTTP/1.1 200 OK - 77284 bytes in 7.133 second response time https://wikitech.wikimedia.org/wiki/Application_servers [19:30:20] RECOVERY - Esams HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/myRmf1Pik/varnish-aggregate-client-status-codes?var-site=esams&var-status_type=5 [19:30:20] RECOVERY - Text HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/myRmf1Pik/varnish-aggregate-client-status-codes?var-cache_type=varnish-text&var-status_type=5 [19:30:32] RECOVERY - puppet last run on tungsten is OK: OK: Puppet is currently enabled, last run 45 seconds ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [19:30:50] RECOVERY - Check systemd state on cloudcontrol1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:30:54] PROBLEM - logstash JSON linesTCP port on logstash1009 is CRITICAL: connect to address 127.0.0.1 and port 11514: Connection refused https://wikitech.wikimedia.org/wiki/Logstash [19:31:12] RECOVERY - logstash syslog TCP port on logstash1009 is OK: TCP OK - 0.002 second response time on 127.0.0.1 port 10514 https://wikitech.wikimedia.org/wiki/Logstash [19:31:28] RECOVERY - logstash syslog TCP port on logstash2006 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 10514 https://wikitech.wikimedia.org/wiki/Logstash [19:31:28] RECOVERY - logstash JSON linesTCP port on logstash2004 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 11514 https://wikitech.wikimedia.org/wiki/Logstash [19:32:18] RECOVERY - logstash JSON linesTCP port on logstash1009 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 11514 https://wikitech.wikimedia.org/wiki/Logstash [19:32:58] PROBLEM - LVS HTTP IPv4 on logstash.svc.eqiad.wmnet is CRITICAL: connect to address 10.2.2.36 and port 10514: Connection refused https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [19:33:00] RECOVERY - logstash JSON linesTCP port on logstash2006 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 11514 https://wikitech.wikimedia.org/wiki/Logstash [19:34:02] RECOVERY - restbase endpoints health on restbase1021 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [19:34:02] RECOVERY - puppet last run on db1105 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [19:34:08] PROBLEM - HHVM rendering on mw1312 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [19:34:27] RECOVERY - LVS HTTP IPv4 on logstash.svc.eqiad.wmnet is OK: TCP OK - 0.000 second response time on 10.2.2.36 port 10514 https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [19:35:50] PROBLEM - logstash JSON linesTCP port on logstash2005 is CRITICAL: connect to address 127.0.0.1 and port 11514: Connection refused https://wikitech.wikimedia.org/wiki/Logstash [19:36:46] PROBLEM - logstash JSON linesTCP port on logstash1009 is CRITICAL: connect to address 127.0.0.1 and port 11514: Connection refused https://wikitech.wikimedia.org/wiki/Logstash [19:36:50] RECOVERY - PHP7 rendering on mwdebug1002 is OK: HTTP OK: HTTP/1.1 200 OK - 77367 bytes in 2.104 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [19:36:52] RECOVERY - PHP7 rendering on mw1312 is OK: HTTP OK: HTTP/1.1 200 OK - 77357 bytes in 4.587 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [19:37:26] PROBLEM - logstash syslog TCP port on logstash2006 is CRITICAL: connect to address 127.0.0.1 and port 10514: Connection refused https://wikitech.wikimedia.org/wiki/Logstash [19:37:32] PROBLEM - puppet last run on actinium is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [19:37:36] PROBLEM - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is CRITICAL: /{domain}/v1/page/summary/{title} (Get summary for test page) timed out before a response was received https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [19:38:24] RECOVERY - HHVM rendering on mw1312 is OK: HTTP OK: HTTP/1.1 200 OK - 77310 bytes in 2.784 second response time https://wikitech.wikimedia.org/wiki/Application_servers [19:38:26] (03PS2) 10Ladsgroup: varnish: Do not strip the cache out of Special:EntityData if revision is set [puppet] - 10https://gerrit.wikimedia.org/r/525142 (https://phabricator.wikimedia.org/T85499) [19:39:06] PROBLEM - logstash JSON linesTCP port on logstash2006 is CRITICAL: connect to address 127.0.0.1 and port 11514: Connection refused https://wikitech.wikimedia.org/wiki/Logstash [19:39:26] PROBLEM - Check systemd state on cloudcontrol1004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:40:08] PROBLEM - OSPF status on cr2-eqdfw is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [19:40:16] PROBLEM - logstash syslog TCP port on logstash1009 is CRITICAL: connect to address 127.0.0.1 and port 10514: Connection refused https://wikitech.wikimedia.org/wiki/Logstash [19:40:38] PROBLEM - CirrusSearch eqiad 95th percentile latency on graphite1004 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/dashboard/db/elasticsearch-percentiles?panelId=19&fullscreen&orgId=1&var-cluster=eqiad&var-smoothing=1 [19:40:39] !log restarting hhvm on mw1312 [19:40:42] RECOVERY - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [19:40:42] RECOVERY - logstash JSON linesTCP port on logstash2006 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 11514 https://wikitech.wikimedia.org/wiki/Logstash [19:40:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:41:02] PROBLEM - logstash syslog TCP port on logstash1008 is CRITICAL: connect to address 127.0.0.1 and port 10514: Connection refused https://wikitech.wikimedia.org/wiki/Logstash [19:41:04] RECOVERY - Check systemd state on cloudcontrol1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:41:10] (03PS1) 10Aaron Schulz: Use GTIDs for master position queries for external DB when possible [mediawiki-config] - 10https://gerrit.wikimedia.org/r/525147 [19:41:42] RECOVERY - HHVM rendering on mwdebug1002 is OK: HTTP OK: HTTP/1.1 200 OK - 77319 bytes in 1.066 second response time https://wikitech.wikimedia.org/wiki/Application_servers [19:41:42] RECOVERY - puppet last run on mc1021 is OK: OK: Puppet is currently enabled, last run 5 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [19:41:50] RECOVERY - OSPF status on cr2-eqdfw is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [19:41:58] RECOVERY - logstash syslog TCP port on logstash1009 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 10514 https://wikitech.wikimedia.org/wiki/Logstash [19:42:12] RECOVERY - logstash JSON linesTCP port on logstash2005 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 11514 https://wikitech.wikimedia.org/wiki/Logstash [19:42:18] RECOVERY - logstash syslog TCP port on logstash2006 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 10514 https://wikitech.wikimedia.org/wiki/Logstash [19:42:50] PROBLEM - HHVM rendering on mw1312 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [19:43:38] PROBLEM - logstash JSON linesTCP port on logstash1007 is CRITICAL: connect to address 127.0.0.1 and port 11514: Connection refused https://wikitech.wikimedia.org/wiki/Logstash [19:44:01] !log restarting rabbitmq-server on cloudcontrol1003 and 1004 [19:44:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:44:20] RECOVERY - logstash syslog TCP port on logstash1008 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 10514 https://wikitech.wikimedia.org/wiki/Logstash [19:44:28] PROBLEM - PHP7 rendering on mw1312 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [19:44:44] PROBLEM - logstash syslog TCP port on logstash2005 is CRITICAL: connect to address 127.0.0.1 and port 10514: Connection refused https://wikitech.wikimedia.org/wiki/Logstash [19:44:48] PROBLEM - logstash syslog TCP port on logstash2004 is CRITICAL: connect to address 127.0.0.1 and port 10514: Connection refused https://wikitech.wikimedia.org/wiki/Logstash [19:44:56] RECOVERY - logstash JSON linesTCP port on logstash1009 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 11514 https://wikitech.wikimedia.org/wiki/Logstash [19:45:18] RECOVERY - logstash JSON linesTCP port on logstash1007 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 11514 https://wikitech.wikimedia.org/wiki/Logstash [19:46:26] RECOVERY - logstash syslog TCP port on logstash2005 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 10514 https://wikitech.wikimedia.org/wiki/Logstash [19:46:30] RECOVERY - logstash syslog TCP port on logstash2004 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 10514 https://wikitech.wikimedia.org/wiki/Logstash [19:48:20] PROBLEM - Check systemd state on cloudcontrol1004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:48:29] PROBLEM - puppet last run on mc1019 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [19:49:27] RECOVERY - puppet last run on mw1280 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [19:49:29] PROBLEM - IPMI Sensor Status on dbprov1001 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [PS Redundancy = Critical, Status = Critical, Status = Critical] https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [19:49:45] PROBLEM - logstash syslog TCP port on logstash2005 is CRITICAL: connect to address 127.0.0.1 and port 10514: Connection refused https://wikitech.wikimedia.org/wiki/Logstash [19:49:49] !log mwdebug1002 - restarting hhvm - mw1312 - restarted apache [19:49:50] PROBLEM - logstash syslog TCP port on logstash2004 is CRITICAL: connect to address 127.0.0.1 and port 10514: Connection refused https://wikitech.wikimedia.org/wiki/Logstash [19:49:57] PROBLEM - logstash JSON linesTCP port on logstash1009 is CRITICAL: connect to address 127.0.0.1 and port 11514: Connection refused https://wikitech.wikimedia.org/wiki/Logstash [19:50:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:50:11] PROBLEM - recommendation_api endpoints health on scb2001 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [19:50:23] RECOVERY - Check systemd state on cloudcontrol1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:50:33] PROBLEM - recommendation_api endpoints health on scb2004 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [19:51:27] PROBLEM - logstash syslog TCP port on logstash1008 is CRITICAL: connect to address 127.0.0.1 and port 10514: Connection refused https://wikitech.wikimedia.org/wiki/Logstash [19:52:05] PROBLEM - logstash syslog TCP port on logstash1009 is CRITICAL: connect to address 127.0.0.1 and port 10514: Connection refused https://wikitech.wikimedia.org/wiki/Logstash [19:52:15] RECOVERY - recommendation_api endpoints health on scb2001 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [19:52:43] RECOVERY - recommendation_api endpoints health on scb2004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [19:53:15] RECOVERY - logstash JSON linesTCP port on logstash1009 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 11514 https://wikitech.wikimedia.org/wiki/Logstash [19:53:41] RECOVERY - logstash syslog TCP port on logstash1008 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 10514 https://wikitech.wikimedia.org/wiki/Logstash [19:53:43] PROBLEM - Check systemd state on cloudcontrol1004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:54:09] RECOVERY - logstash syslog TCP port on logstash2005 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 10514 https://wikitech.wikimedia.org/wiki/Logstash [19:54:13] RECOVERY - logstash syslog TCP port on logstash2004 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 10514 https://wikitech.wikimedia.org/wiki/Logstash [19:54:23] RECOVERY - logstash syslog TCP port on logstash1009 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 10514 https://wikitech.wikimedia.org/wiki/Logstash [19:54:39] PROBLEM - recommendation_api endpoints health on scb2006 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [19:55:11] PROBLEM - HHVM rendering on mwdebug1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [19:55:17] PROBLEM - logstash JSON linesTCP port on logstash1008 is CRITICAL: connect to address 127.0.0.1 and port 11514: Connection refused https://wikitech.wikimedia.org/wiki/Logstash [19:55:47] RECOVERY - recommendation_api endpoints health on scb2006 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [19:56:17] PROBLEM - recommendation_api endpoints health on scb2004 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [19:57:17] PROBLEM - logstash syslog TCP port on logstash1008 is CRITICAL: connect to address 127.0.0.1 and port 10514: Connection refused https://wikitech.wikimedia.org/wiki/Logstash [19:57:27] RECOVERY - recommendation_api endpoints health on scb2004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [19:57:41] RECOVERY - logstash JSON linesTCP port on logstash1008 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 11514 https://wikitech.wikimedia.org/wiki/Logstash [19:58:02] wikibugs: [19:58:33] RECOVERY - logstash syslog TCP port on logstash1008 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 10514 https://wikitech.wikimedia.org/wiki/Logstash [19:58:37] RECOVERY - MediaWiki memcached error rate on graphite1004 is OK: OK: Less than 40.00% above the threshold [1000.0] https://wikitech.wikimedia.org/wiki/Memcached https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [19:58:47] RECOVERY - HHVM rendering on mwdebug1002 is OK: HTTP OK: HTTP/1.1 200 OK - 77449 bytes in 1.493 second response time https://wikitech.wikimedia.org/wiki/Application_servers [20:00:37] PROBLEM - logstash syslog TCP port on logstash2006 is CRITICAL: connect to address 127.0.0.1 and port 10514: Connection refused https://wikitech.wikimedia.org/wiki/Logstash [20:01:15] RECOVERY - Check systemd state on cloudcontrol1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:01:53] PROBLEM - logstash syslog TCP port on logstash2004 is CRITICAL: connect to address 127.0.0.1 and port 10514: Connection refused https://wikitech.wikimedia.org/wiki/Logstash [20:01:59] RECOVERY - logstash syslog TCP port on logstash2006 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 10514 https://wikitech.wikimedia.org/wiki/Logstash [20:04:05] PROBLEM - restbase endpoints health on restbase1021 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [20:04:07] RECOVERY - puppet last run on actinium is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [20:04:17] RECOVERY - PHP7 rendering on mw1312 is OK: HTTP OK: HTTP/1.1 200 OK - 77486 bytes in 2.366 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [20:04:53] PROBLEM - logstash syslog TCP port on logstash1009 is CRITICAL: connect to address 127.0.0.1 and port 10514: Connection refused https://wikitech.wikimedia.org/wiki/Logstash [20:04:53] PROBLEM - logstash JSON linesTCP port on logstash1009 is CRITICAL: connect to address 127.0.0.1 and port 11514: Connection refused https://wikitech.wikimedia.org/wiki/Logstash [20:05:29] PROBLEM - Check systemd state on cloudcontrol1004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:05:29] RECOVERY - restbase endpoints health on restbase1021 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [20:05:51] PROBLEM - PHP7 rendering on mwdebug1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [20:06:04] (03PS1) 10Elukey: Remove mc1019->23 from the mcrouter config [puppet] - 10https://gerrit.wikimedia.org/r/525148 [20:06:19] PROBLEM - logstash syslog TCP port on logstash2006 is CRITICAL: connect to address 127.0.0.1 and port 10514: Connection refused https://wikitech.wikimedia.org/wiki/Logstash [20:07:19] RECOVERY - HHVM rendering on mw1312 is OK: HTTP OK: HTTP/1.1 200 OK - 77439 bytes in 3.483 second response time https://wikitech.wikimedia.org/wiki/Application_servers [20:07:43] RECOVERY - logstash syslog TCP port on logstash2004 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 10514 https://wikitech.wikimedia.org/wiki/Logstash [20:08:25] !log asw2-a-eqiad: request virtual-chassis vc-port set interface member 6 vcp-255/1/0 disable [20:08:29] PROBLEM - logstash JSON linesTCP port on logstash2006 is CRITICAL: connect to address 127.0.0.1 and port 11514: Connection refused https://wikitech.wikimedia.org/wiki/Logstash [20:08:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:08:45] RECOVERY - PHP7 rendering on mwdebug1002 is OK: HTTP OK: HTTP/1.1 200 OK - 77494 bytes in 0.253 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [20:08:56] (03CR) 10Ladsgroup: "Hopefully it should also be cached in VCL for logged-in users as well but I don't know how that can be done or this is enough for it." [puppet] - 10https://gerrit.wikimedia.org/r/525142 (https://phabricator.wikimedia.org/T85499) (owner: 10Ladsgroup) [20:09:06] (03PS1) 10Andrew Bogott: puppet: add facter.conf and cache some facts [puppet] - 10https://gerrit.wikimedia.org/r/525149 [20:09:46] (03CR) 10jerkins-bot: [V: 04-1] puppet: add facter.conf and cache some facts [puppet] - 10https://gerrit.wikimedia.org/r/525149 (owner: 10Andrew Bogott) [20:10:05] PROBLEM - logstash syslog TCP port on logstash1008 is CRITICAL: connect to address 127.0.0.1 and port 10514: Connection refused https://wikitech.wikimedia.org/wiki/Logstash [20:10:07] RECOVERY - Check systemd state on cloudcontrol1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:10:59] RECOVERY - logstash syslog TCP port on logstash1009 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 10514 https://wikitech.wikimedia.org/wiki/Logstash [20:11:01] RECOVERY - logstash JSON linesTCP port on logstash1009 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 11514 https://wikitech.wikimedia.org/wiki/Logstash [20:11:31] (03PS2) 10Andrew Bogott: puppet: add facter.conf and cache some facts [puppet] - 10https://gerrit.wikimedia.org/r/525149 [20:11:47] PROBLEM - MediaWiki memcached error rate on graphite1004 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [5000.0] https://wikitech.wikimedia.org/wiki/Memcached https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [20:12:05] PROBLEM - Juniper virtual chassis ports on asw2-a-eqiad is CRITICAL: CRIT: Down: 1 Unknown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23VCP_status [20:12:19] PROBLEM - logstash JSON linesTCP port on logstash1008 is CRITICAL: connect to address 127.0.0.1 and port 11514: Connection refused https://wikitech.wikimedia.org/wiki/Logstash [20:13:21] RECOVERY - logstash JSON linesTCP port on logstash2006 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 11514 https://wikitech.wikimedia.org/wiki/Logstash [20:13:27] RECOVERY - MediaWiki memcached error rate on graphite1004 is OK: OK: Less than 40.00% above the threshold [1000.0] https://wikitech.wikimedia.org/wiki/Memcached https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [20:13:45] 10Operations, 10ops-eqiad, 10DC-Ops, 10User-Zppix, 10cloud-services-team (Kanban): VMs on cloudvirt1015 crashing - bad mainboard/memory - https://phabricator.wikimedia.org/T220853 (10RobH) ` /admin1-> racadm getsel Record: 1 Date/Time: 05/30/2019 17:38:49 Source: system Severity: Ok Descri... [20:14:08] 10Operations, 10ops-eqiad, 10DC-Ops, 10User-Zppix, 10cloud-services-team (Kanban): VMs on cloudvirt1015 crashing - bad mainboard/memory - https://phabricator.wikimedia.org/T220853 (10RobH) Bad dimm, Chris moved it from B3 to A3 on >>! In T220853#5224397, @Cmjohnson wrote: > Swapped DIMM B3 with DIMM A3... [20:15:17] PROBLEM - logstash JSON linesTCP port on logstash2005 is CRITICAL: connect to address 127.0.0.1 and port 11514: Connection refused https://wikitech.wikimedia.org/wiki/Logstash [20:15:17] PROBLEM - puppet last run on mc1021 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [20:15:31] PROBLEM - logstash syslog TCP port on logstash1007 is CRITICAL: connect to address 127.0.0.1 and port 10514: Connection refused https://wikitech.wikimedia.org/wiki/Logstash [20:15:35] RECOVERY - puppet last run on mc1019 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [20:15:51] PROBLEM - logstash syslog TCP port on logstash2004 is CRITICAL: connect to address 127.0.0.1 and port 10514: Connection refused https://wikitech.wikimedia.org/wiki/Logstash [20:15:59] RECOVERY - logstash syslog TCP port on logstash2006 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 10514 https://wikitech.wikimedia.org/wiki/Logstash [20:16:01] PROBLEM - logstash JSON linesTCP port on logstash1009 is CRITICAL: connect to address 127.0.0.1 and port 11514: Connection refused https://wikitech.wikimedia.org/wiki/Logstash [20:17:22] RECOVERY - logstash JSON linesTCP port on logstash1009 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 11514 https://wikitech.wikimedia.org/wiki/Logstash [20:18:07] RECOVERY - logstash syslog TCP port on logstash2004 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 10514 https://wikitech.wikimedia.org/wiki/Logstash [20:19:04] (03PS3) 10Andrew Bogott: puppet: add facter.conf and cache some facts [puppet] - 10https://gerrit.wikimedia.org/r/525149 [20:19:17] PROBLEM - logstash syslog TCP port on logstash2006 is CRITICAL: connect to address 127.0.0.1 and port 10514: Connection refused https://wikitech.wikimedia.org/wiki/Logstash [20:19:36] (03CR) 10jerkins-bot: [V: 04-1] puppet: add facter.conf and cache some facts [puppet] - 10https://gerrit.wikimedia.org/r/525149 (owner: 10Andrew Bogott) [20:20:19] PROBLEM - logstash syslog TCP port on logstash1009 is CRITICAL: connect to address 127.0.0.1 and port 10514: Connection refused https://wikitech.wikimedia.org/wiki/Logstash [20:20:21] PROBLEM - logstash JSON linesTCP port on logstash1009 is CRITICAL: connect to address 127.0.0.1 and port 11514: Connection refused https://wikitech.wikimedia.org/wiki/Logstash [20:21:03] (03PS4) 10Andrew Bogott: puppet: add facter.conf and cache some facts [puppet] - 10https://gerrit.wikimedia.org/r/525149 [20:22:37] PROBLEM - logstash syslog TCP port on logstash2005 is CRITICAL: connect to address 127.0.0.1 and port 10514: Connection refused https://wikitech.wikimedia.org/wiki/Logstash [20:23:13] RECOVERY - logstash JSON linesTCP port on logstash2005 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 11514 https://wikitech.wikimedia.org/wiki/Logstash [20:23:33] PROBLEM - logstash syslog TCP port on logstash2004 is CRITICAL: connect to address 127.0.0.1 and port 10514: Connection refused https://wikitech.wikimedia.org/wiki/Logstash [20:23:41] RECOVERY - logstash syslog TCP port on logstash2006 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 10514 https://wikitech.wikimedia.org/wiki/Logstash [20:23:47] RECOVERY - logstash syslog TCP port on logstash2005 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 10514 https://wikitech.wikimedia.org/wiki/Logstash [20:24:44] RECOVERY - logstash syslog TCP port on logstash2004 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 10514 https://wikitech.wikimedia.org/wiki/Logstash [20:24:53] RECOVERY - logstash JSON linesTCP port on logstash1009 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 11514 https://wikitech.wikimedia.org/wiki/Logstash [20:26:05] RECOVERY - logstash syslog TCP port on logstash1009 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 10514 https://wikitech.wikimedia.org/wiki/Logstash [20:27:41] PROBLEM - logstash JSON linesTCP port on logstash1007 is CRITICAL: connect to address 127.0.0.1 and port 11514: Connection refused https://wikitech.wikimedia.org/wiki/Logstash [20:28:03] 10Operations, 10Puppet: Cache some facter facts - https://phabricator.wikimedia.org/T228805 (10Andrew) [20:28:10] 10Operations, 10Puppet: Cache some facter facts - https://phabricator.wikimedia.org/T228805 (10Andrew) p:05Triage→03Low [20:28:32] (03PS5) 10Andrew Bogott: puppet: add facter.conf and cache some facts [puppet] - 10https://gerrit.wikimedia.org/r/525149 (https://phabricator.wikimedia.org/T228805) [20:29:03] RECOVERY - logstash JSON linesTCP port on logstash1007 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 11514 https://wikitech.wikimedia.org/wiki/Logstash [20:29:49] (03CR) 10Andrew Bogott: puppet: add facter.conf and cache some facts (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/525149 (https://phabricator.wikimedia.org/T228805) (owner: 10Andrew Bogott) [20:29:57] PROBLEM - logstash syslog TCP port on logstash1009 is CRITICAL: connect to address 127.0.0.1 and port 10514: Connection refused https://wikitech.wikimedia.org/wiki/Logstash [20:29:59] PROBLEM - logstash JSON linesTCP port on logstash1009 is CRITICAL: connect to address 127.0.0.1 and port 11514: Connection refused https://wikitech.wikimedia.org/wiki/Logstash [20:31:03] RECOVERY - logstash syslog TCP port on logstash1007 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 10514 https://wikitech.wikimedia.org/wiki/Logstash [20:34:03] RECOVERY - CirrusSearch eqiad 95th percentile latency on graphite1004 is OK: OK: Less than 20.00% above the threshold [500.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/dashboard/db/elasticsearch-percentiles?panelId=19&fullscreen&orgId=1&var-cluster=eqiad&var-smoothing=1 [20:34:21] PROBLEM - logstash syslog TCP port on logstash2005 is CRITICAL: connect to address 127.0.0.1 and port 10514: Connection refused https://wikitech.wikimedia.org/wiki/Logstash [20:35:40] herron, shdubsh: are these logstash alerts expected? [20:35:49] RECOVERY - logstash syslog TCP port on logstash2005 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 10514 https://wikitech.wikimedia.org/wiki/Logstash [20:36:05] paravoid: logstash is choking on the logs built up over the course of the incident. we're looking at unclogging it now [20:36:14] ah cool, thanks [20:36:21] RECOVERY - puppet last run on mc1021 is OK: OK: Puppet is currently enabled, last run 43 seconds ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [20:36:28] (03Abandoned) 10Elukey: Remove mc1019->23 from the mcrouter config [puppet] - 10https://gerrit.wikimedia.org/r/525148 (owner: 10Elukey) [20:37:01] PROBLEM - logstash syslog TCP port on logstash2004 is CRITICAL: connect to address 127.0.0.1 and port 10514: Connection refused https://wikitech.wikimedia.org/wiki/Logstash [20:37:09] RECOVERY - logstash syslog TCP port on logstash1009 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 10514 https://wikitech.wikimedia.org/wiki/Logstash [20:37:39] PROBLEM - logstash JSON linesTCP port on logstash1007 is CRITICAL: connect to address 127.0.0.1 and port 11514: Connection refused https://wikitech.wikimedia.org/wiki/Logstash [20:37:51] RECOVERY - logstash JSON linesTCP port on logstash1008 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 11514 https://wikitech.wikimedia.org/wiki/Logstash [20:38:07] PROBLEM - logstash JSON linesTCP port on logstash2006 is CRITICAL: connect to address 127.0.0.1 and port 11514: Connection refused https://wikitech.wikimedia.org/wiki/Logstash [20:38:07] PROBLEM - logstash JSON linesTCP port on logstash2004 is CRITICAL: connect to address 127.0.0.1 and port 11514: Connection refused https://wikitech.wikimedia.org/wiki/Logstash [20:40:00] shdubsh: (and please !log actions, and also that you're investigating an issue etc., to avoid being asked by annoying people like me :) [20:41:21] RECOVERY - logstash syslog TCP port on logstash1008 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 10514 https://wikitech.wikimedia.org/wiki/Logstash [20:42:59] RECOVERY - logstash JSON linesTCP port on logstash2006 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 11514 https://wikitech.wikimedia.org/wiki/Logstash [20:42:59] RECOVERY - logstash JSON linesTCP port on logstash2004 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 11514 https://wikitech.wikimedia.org/wiki/Logstash [20:43:31] RECOVERY - logstash syslog TCP port on logstash2004 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 10514 https://wikitech.wikimedia.org/wiki/Logstash [20:43:49] PROBLEM - logstash syslog TCP port on logstash2005 is CRITICAL: connect to address 127.0.0.1 and port 10514: Connection refused https://wikitech.wikimedia.org/wiki/Logstash [20:44:05] RECOVERY - logstash JSON linesTCP port on logstash1007 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 11514 https://wikitech.wikimedia.org/wiki/Logstash [20:44:17] PROBLEM - logstash JSON linesTCP port on logstash1008 is CRITICAL: connect to address 127.0.0.1 and port 11514: Connection refused https://wikitech.wikimedia.org/wiki/Logstash [20:46:59] RECOVERY - logstash JSON linesTCP port on logstash1009 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 11514 https://wikitech.wikimedia.org/wiki/Logstash [20:46:59] PROBLEM - logstash syslog TCP port on logstash2006 is CRITICAL: connect to address 127.0.0.1 and port 10514: Connection refused https://wikitech.wikimedia.org/wiki/Logstash [20:47:07] RECOVERY - logstash syslog TCP port on logstash2005 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 10514 https://wikitech.wikimedia.org/wiki/Logstash [20:47:57] PROBLEM - logstash syslog TCP port on logstash1008 is CRITICAL: connect to address 127.0.0.1 and port 10514: Connection refused https://wikitech.wikimedia.org/wiki/Logstash [20:48:39] RECOVERY - logstash syslog TCP port on logstash2006 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 10514 https://wikitech.wikimedia.org/wiki/Logstash [20:49:17] RECOVERY - logstash JSON linesTCP port on logstash1008 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 11514 https://wikitech.wikimedia.org/wiki/Logstash [20:49:35] PROBLEM - logstash JSON linesTCP port on logstash2006 is CRITICAL: connect to address 127.0.0.1 and port 11514: Connection refused https://wikitech.wikimedia.org/wiki/Logstash [20:49:35] PROBLEM - logstash JSON linesTCP port on logstash2004 is CRITICAL: connect to address 127.0.0.1 and port 11514: Connection refused https://wikitech.wikimedia.org/wiki/Logstash [20:49:37] PROBLEM - logstash JSON linesTCP port on logstash2005 is CRITICAL: connect to address 127.0.0.1 and port 11514: Connection refused https://wikitech.wikimedia.org/wiki/Logstash [20:50:01] (03PS1) 10EBernhardson: Increase size of cirrus curl pools [puppet] - 10https://gerrit.wikimedia.org/r/525156 [20:50:07] PROBLEM - logstash syslog TCP port on logstash2004 is CRITICAL: connect to address 127.0.0.1 and port 10514: Connection refused https://wikitech.wikimedia.org/wiki/Logstash [20:50:43] PROBLEM - logstash JSON linesTCP port on logstash1007 is CRITICAL: connect to address 127.0.0.1 and port 11514: Connection refused https://wikitech.wikimedia.org/wiki/Logstash [20:51:13] RECOVERY - logstash JSON linesTCP port on logstash2006 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 11514 https://wikitech.wikimedia.org/wiki/Logstash [20:51:15] RECOVERY - logstash JSON linesTCP port on logstash2004 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 11514 https://wikitech.wikimedia.org/wiki/Logstash [20:51:17] RECOVERY - logstash JSON linesTCP port on logstash2005 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 11514 https://wikitech.wikimedia.org/wiki/Logstash [20:51:47] RECOVERY - logstash syslog TCP port on logstash2004 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 10514 https://wikitech.wikimedia.org/wiki/Logstash [20:51:55] PROBLEM - logstash syslog TCP port on logstash1009 is CRITICAL: connect to address 127.0.0.1 and port 10514: Connection refused https://wikitech.wikimedia.org/wiki/Logstash [20:52:35] PROBLEM - MediaWiki memcached error rate on graphite1004 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [5000.0] https://wikitech.wikimedia.org/wiki/Memcached https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [20:53:33] RECOVERY - logstash syslog TCP port on logstash1009 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 10514 https://wikitech.wikimedia.org/wiki/Logstash [20:53:45] PROBLEM - logstash syslog TCP port on logstash2005 is CRITICAL: connect to address 127.0.0.1 and port 10514: Connection refused https://wikitech.wikimedia.org/wiki/Logstash [20:54:35] RECOVERY - logstash syslog TCP port on logstash1008 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 10514 https://wikitech.wikimedia.org/wiki/Logstash [20:55:15] PROBLEM - logstash syslog TCP port on logstash2006 is CRITICAL: connect to address 127.0.0.1 and port 10514: Connection refused https://wikitech.wikimedia.org/wiki/Logstash [20:55:23] RECOVERY - logstash syslog TCP port on logstash2005 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 10514 https://wikitech.wikimedia.org/wiki/Logstash [20:55:39] RECOVERY - logstash JSON linesTCP port on logstash1007 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 11514 https://wikitech.wikimedia.org/wiki/Logstash [20:56:33] PROBLEM - logstash syslog TCP port on logstash1007 is CRITICAL: connect to address 127.0.0.1 and port 10514: Connection refused https://wikitech.wikimedia.org/wiki/Logstash [20:57:51] PROBLEM - logstash JSON linesTCP port on logstash2006 is CRITICAL: connect to address 127.0.0.1 and port 11514: Connection refused https://wikitech.wikimedia.org/wiki/Logstash [20:57:53] PROBLEM - logstash JSON linesTCP port on logstash2004 is CRITICAL: connect to address 127.0.0.1 and port 11514: Connection refused https://wikitech.wikimedia.org/wiki/Logstash [20:57:55] PROBLEM - logstash JSON linesTCP port on logstash2005 is CRITICAL: connect to address 127.0.0.1 and port 11514: Connection refused https://wikitech.wikimedia.org/wiki/Logstash [20:58:23] PROBLEM - logstash syslog TCP port on logstash2004 is CRITICAL: connect to address 127.0.0.1 and port 10514: Connection refused https://wikitech.wikimedia.org/wiki/Logstash [20:58:33] RECOVERY - logstash syslog TCP port on logstash2006 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 10514 https://wikitech.wikimedia.org/wiki/Logstash [20:59:31] RECOVERY - logstash JSON linesTCP port on logstash2004 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 11514 https://wikitech.wikimedia.org/wiki/Logstash [20:59:33] RECOVERY - logstash JSON linesTCP port on logstash2005 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 11514 https://wikitech.wikimedia.org/wiki/Logstash [20:59:41] !log temporarily disable input-kafka-rsyslog-shipper and drop memcached logs on logstash nodes [20:59:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:01:09] RECOVERY - logstash JSON linesTCP port on logstash2006 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 11514 https://wikitech.wikimedia.org/wiki/Logstash [21:01:41] RECOVERY - logstash syslog TCP port on logstash2004 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 10514 https://wikitech.wikimedia.org/wiki/Logstash [21:01:57] (03PS1) 10Legoktm: Enable SecureLinkFixer on group0 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/525157 (https://phabricator.wikimedia.org/T200751) [21:02:01] PROBLEM - logstash syslog TCP port on logstash2005 is CRITICAL: connect to address 127.0.0.1 and port 10514: Connection refused https://wikitech.wikimedia.org/wiki/Logstash [21:02:27] RECOVERY - MediaWiki memcached error rate on graphite1004 is OK: OK: Less than 40.00% above the threshold [1000.0] https://wikitech.wikimedia.org/wiki/Memcached https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [21:04:49] RECOVERY - logstash syslog TCP port on logstash1007 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 10514 https://wikitech.wikimedia.org/wiki/Logstash [21:05:21] RECOVERY - logstash syslog TCP port on logstash2005 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 10514 https://wikitech.wikimedia.org/wiki/Logstash [21:05:31] RECOVERY - PyBal backends health check on lvs1015 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [21:06:11] RECOVERY - PyBal backends health check on lvs1016 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [21:06:11] PROBLEM - logstash JSON linesTCP port on logstash2005 is CRITICAL: connect to address 127.0.0.1 and port 11514: Connection refused https://wikitech.wikimedia.org/wiki/Logstash [21:07:49] PROBLEM - logstash JSON linesTCP port on logstash2004 is CRITICAL: connect to address 127.0.0.1 and port 11514: Connection refused https://wikitech.wikimedia.org/wiki/Logstash [21:08:21] PROBLEM - logstash syslog TCP port on logstash2004 is CRITICAL: connect to address 127.0.0.1 and port 10514: Connection refused https://wikitech.wikimedia.org/wiki/Logstash [21:08:52] chaomodus: trying to use the netbox API. i see the example on https://netbox.readthedocs.io/en/stable/api/overview/ and see that we use 8001 instead of 8000. but i always get Bad Request (400) when doing something curl -s http://localhost:8001/api/ what am i missing [21:09:12] mutante: authentication [21:09:23] mutante: it's far easier to use the shell than the api [21:09:29] RECOVERY - logstash JSON linesTCP port on logstash2004 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 11514 https://wikitech.wikimedia.org/wiki/Logstash [21:09:31] RECOVERY - logstash JSON linesTCP port on logstash2005 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 11514 https://wikitech.wikimedia.org/wiki/Logstash [21:09:31] chaomodus: even from netmon1002? i see. but shouldnt that be a different return code [21:09:43] chaomodus: oh! looking into the shell [21:09:55] mutante: if you join my tmux on there i'm already in it :) [21:10:01] RECOVERY - logstash syslog TCP port on logstash2004 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 10514 https://wikitech.wikimedia.org/wiki/Logstash [21:10:19] (tmux as root) [21:11:59] there are 2 but i am in one :) [21:12:20] oic you're in the gcorrect one afaict [21:12:47] so you did ./manage.py nbshell ? [21:12:59] just shell [21:13:09] it's a standard python imterpreter but you have access to the internal models and stuff [21:13:20] alright [21:13:59] (fwiw you have to . the activate in the venv, prior to running it) [21:14:05] i guess i want non-interactive though [21:14:19] what query are you trying to run exactly? [21:14:35] i wanted to see about getting the rack for a hostname [21:14:43] ah that's easy [21:15:10] dcim.models.Device.objects.filter(name='wahetever').rack [21:15:15] erlr [21:15:20] dcim.models.Device.objects.get(name='wahetever').rack [21:15:48] nice! works [21:15:55] yep! [21:16:19] you could also look in the ui, unless you're trying to automate this then we have to go to api country [21:16:33] now i just want to send that without the interactive shell [21:16:47] host2rack.sh or something [21:16:59] yah [21:17:25] PROBLEM - MediaWiki memcached error rate on graphite1004 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [5000.0] https://wikitech.wikimedia.org/wiki/Memcached https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [21:17:33] sec [21:18:11] 100% memcached error rate sounds bad but the graph looks like it's not special [21:18:25] i see you. shared screen/tmux is the best [21:22:26] ideally mwe'd make a little tool to stick in scripts/ [21:22:36] that could do various queries based on hostname [21:23:31] yes, that :) [21:23:49] PROBLEM - puppet last run on analytics1047 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [21:23:54] and soonish you'll be able to use cumin to do some similar query [21:24:44] chaomodus: you mean it becomes a spicerack recipe? [21:25:52] it could yes [21:26:17] PROBLEM - puppet last run on serpens is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [21:32:44] (03PS3) 10Holger Knust: table-properties: Initial commit [software/cassandra-table-properties] - 10https://gerrit.wikimedia.org/r/524921 (https://phabricator.wikimedia.org/T220246) [21:32:48] (03CR) 10CRusnov: "I feel as though there was implicit agreement to merge this, still it'd be nice to have a +1." [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/518166 (https://phabricator.wikimedia.org/T223292) (owner: 10CRusnov) [21:46:29] RECOVERY - puppet last run on analytics1047 is OK: OK: Puppet is currently enabled, last run 29 seconds ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [21:49:18] 10Operations, 10ops-eqiad, 10DC-Ops, 10User-Zppix, 10cloud-services-team (Kanban): VMs on cloudvirt1015 crashing - bad mainboard/memory - https://phabricator.wikimedia.org/T220853 (10wiki_willy) @Cmjohnson - are those errors for DIMM A3 enough info to get Dell to RMA a part to us? If not, let me know, a... [21:52:35] !log logstash - temporarily dropping logs matching [message] =~ /^SlowTimer/ due to UTF-8 parsing errors that are stopping the logstash processing pipeline. will re-enable after logstash has caught up with the backlog [21:52:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:54:33] RECOVERY - puppet last run on serpens is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [21:58:21] (03PS1) 10CRusnov: nbdeviceinfo.py: Add simple command-line host dump [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/525165 [21:59:55] (03CR) 10CRusnov: "This is a simple script to dump host information on Netbox. This script outputs as yaml to stdout for maximum machine readability. It depe" [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/525165 (owner: 10CRusnov) [22:00:39] (03PS2) 10CRusnov: nbdeviceinfo.py: Add simple command-line host dump [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/525165 [22:06:19] !log puppet temporarily disabled on eqiad/codfw logstash collectors while catching up with backlog. see /etc/logstash/conf.d/01-filter_temp_drops.conf [22:06:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:14:19] !log continuing rollout of new scap version 3.11.1-1, starting with kafka-all followed by other cumin-alias groups (T228328) [22:14:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:14:26] T228328: 'scap pull' stopped working on appservers ? - https://phabricator.wikimedia.org/T228328 [22:36:22] !log rolling out scap 3.11.1-1 on mw-eqiad servers [22:36:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:40:27] PROBLEM - Logstash rate of ingestion percent change compared to yesterday on icinga1001 is CRITICAL: 449.6 ge 130 https://phabricator.wikimedia.org/T202307 https://grafana.wikimedia.org/dashboard/db/logstash?orgId=1&panelId=2&fullscreen [22:46:19] 10Operations, 10ops-eqiad, 10Cassandra, 10DC-Ops, and 4 others: restbase-dev1006 has a broken disk - https://phabricator.wikimedia.org/T224260 (10Dzahn) This host pops up because it's the only one where i can't upgrade scap with debdeploy. I tried to ssh to it manually and it asks me for a password. So th... [22:53:01] 10Operations, 10Analytics, 10Analytics-Kanban, 10Discovery, 10Research-Backlog: Make oozie swift upload emit event to Kafka about swift object upload complete - https://phabricator.wikimedia.org/T227896 (10Ottomata) Ok, from discussions with Erik today, we are going with an event like: `lang=json { "$... [22:53:37] PROBLEM - puppet last run on cloudvirt1021 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [23:00:04] MaxSem, RoanKattouw, and Niharika: It is that lovely time of the day again! You are hereby commanded to deploy Evening SWAT (Max 6 patches). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190723T2300). [23:00:05] No GERRIT patches in the queue for this window AFAICS. [23:00:34] Okie dokie. [23:12:41] 10Operations, 10Gerrit, 10LDAP-Access-Requests, 10Release-Engineering-Team-TODO, 10Release-Engineering-Team (Development services): Add more SREs to gerritadmin LDAP group - https://phabricator.wikimedia.org/T228733 (10Dzahn) I would volunteer as well. [23:16:57] 10Operations, 10Analytics, 10Analytics-Kanban, 10Discovery, 10Research-Backlog: Make oozie swift upload emit event to Kafka about swift object upload complete - https://phabricator.wikimedia.org/T227896 (10Nuria) Let me catch up here, seems that urls should have versions and not only be defined by a loca... [23:18:38] mutante: Did you repool mw1267? [23:20:21] RECOVERY - MediaWiki memcached error rate on graphite1004 is OK: OK: Less than 40.00% above the threshold [1000.0] https://wikitech.wikimedia.org/wiki/Memcached https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [23:20:45] 10Operations, 10Analytics, 10Analytics-Kanban, 10Discovery, 10Research-Backlog: Make oozie swift upload emit event to Kafka about swift object upload complete - https://phabricator.wikimedia.org/T227896 (10Ottomata) The object URLs are totally up to the user, the script just uploads whatever is in the hd... [23:21:53] RECOVERY - puppet last run on cloudvirt1021 is OK: OK: Puppet is currently enabled, last run 5 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [23:22:40] James_F: no, but i can. i don't know about the history besides SAL though [23:23:24] the module was missing? hmm.. that sounds familiar [23:23:29] mutante: c.danis depooled it because it was scap erroring due to the module. [23:23:32] Yeah. [23:23:41] yea, but there was a reason before scap [23:23:45] That theoretically is now fixed with the new scap version rollout? [23:24:13] i still wonder what made it different from all other appservers then [23:24:22] because scap was broken on all of them [23:24:43] I guess the Python module dependency happened to be installed on the rest but not on that one? [23:25:28] normally we don't install stuff manually though.. so a one-off is always weird [23:25:34] alright. scap pulled. it works [23:25:38] repooling. looks fine [23:25:43] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw1267.eqiad.wmnet [23:25:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:28:06] I'll give it a real test with a scap in a bit. [23:28:34] cool! [23:28:47] btw.. on every scap pull i see these: [23:28:57] cannot delete non-empty directory: php-1.33.0-wmf.3 [23:29:00] cannot delete non-empty directory: php-1.33.0-wmf.23 [23:29:19] Oh, failed clean ups from when servers were depooled? Fun. [23:29:21] and then sometimes we need to manually delete old versions for disk space [23:29:30] ah, yea [23:29:39] You can safely `rm -rf php-1.33.0*` on the whole fleet. [23:30:08] We only want php-1.34.0-wmf.11–php-1.34.0-wmf.15 in production at most. [23:31:25] !log mw1267 - rm -rf /srv/mediawiki/php-1.33.0-wmf.23 ; rm -rf /srv/mediawiki/php-1.32.0-wmf.3 ; scap pull [23:31:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:31:43] 23:31:08 Finished rsync common (duration: 00m 04s) [23:31:51] 4 seconds..nothing to do [23:32:13] * James_F nods. [23:38:27] (03PS1) 10Jeena Huneidi: Package mediawiki-dev and add to index [deployment-charts] - 10https://gerrit.wikimedia.org/r/525173 [23:42:23] !log jforrester@deploy1001 Synchronized php-1.34.0-wmf.15/includes/diff/DifferenceEngine.php: T228766 Don't double wrap rollback links (duration: 00m 56s) [23:42:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:42:30] T228766: Rollback links in wmf.15 now have two square brackets around them, not one - https://phabricator.wikimedia.org/T228766 [23:43:16] !log reverting logstash mitigations and re-enable puppet [23:43:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:44:05] (03CR) 10Jeena Huneidi: Package mediawiki-dev and add to index (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/525173 (owner: 10Jeena Huneidi) [23:49:27] 10Operations, 10ops-codfw, 10DBA, 10Goal: rack/setup/install db21[21-30].codfw.wmnet - https://phabricator.wikimedia.org/T227113 (10Papaul)