[06:20:16] <wikibugs>	 (03CR) 10Volans: "It seems quite suboptimal to me to have to specify a notes_url parameter when absenting a resource." [puppet] - 10https://gerrit.wikimedia.org/r/524951 (https://phabricator.wikimedia.org/T197873) (owner: 10Dzahn)
[06:24:45] <wikibugs>	 (03CR) 10Volans: [C: 04-1] "fixed typo in comment" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/523013 (https://phabricator.wikimedia.org/T197126) (owner: 10CDanis)
[06:25:43] <wikibugs>	 (03PS5) 10Muehlenhoff: xhgui: Read LDAP servers from Hiera and switch to read-only replicas [puppet] - 10https://gerrit.wikimedia.org/r/524791 (https://phabricator.wikimedia.org/T227650)
[06:27:32] <wikibugs>	 (03PS3) 10Ema: ATS: split the cache for beta variant of the mobile site [puppet] - 10https://gerrit.wikimedia.org/r/524789 (https://phabricator.wikimedia.org/T227432)
[06:27:53] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] xhgui: Read LDAP servers from Hiera and switch to read-only replicas [puppet] - 10https://gerrit.wikimedia.org/r/524791 (https://phabricator.wikimedia.org/T227650) (owner: 10Muehlenhoff)
[06:29:06] <icinga-wm>	 PROBLEM - puppet last run on mc2035 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[06:39:48] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops: a2-eqiad pdu refresh - https://phabricator.wikimedia.org/T227138 (10akosiaris)
[06:41:44] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops: a2-eqiad pdu refresh - https://phabricator.wikimedia.org/T227138 (10akosiaris) conf1001 is fine to powerdown (no depool necessary), perform all wanted actions and then poweron as it will repool itself automatically
[06:43:49] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops: a7-eqiad pdu refresh - https://phabricator.wikimedia.org/T227143 (10elukey)
[06:44:08] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops: a7-eqiad pdu refresh - https://phabricator.wikimedia.org/T227143 (10elukey) Ok for the analytics nodes, hadoop workers that can go down without horrible consequences.
[06:44:47] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops: a3-eqiad pdu refresh - https://phabricator.wikimedia.org/T227139 (10akosiaris)
[06:49:04] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops: a3-eqiad pdu refresh - https://phabricator.wikimedia.org/T227139 (10akosiaris) emptying ganeti1006 will require some 15-30 mins of advance notice. Emptying it (in case me or @MoritzMuehlenhoff are not around) can be done per https://wikitech.wikimedia.org/wiki/Ganeti#Node...
[06:51:50] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops: a4-eqiad pdu refresh - https://phabricator.wikimedia.org/T227140 (10akosiaris)
[06:53:42] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops: a4-eqiad pdu refresh - https://phabricator.wikimedia.org/T227140 (10akosiaris) * emptying ganeti1005 will require some 15-30 mins of advance notice. Emptying it (in case me or @MoritzMuehlenhoff are not around) can be done per https://wikitech.wikimedia.org/wiki/Ganeti#No...
[06:56:58] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops: a5-eqiad pdu refresh - https://phabricator.wikimedia.org/T227141 (10akosiaris)
[06:57:24] <icinga-wm>	 RECOVERY - puppet last run on mc2035 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[06:58:02] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops: a5-eqiad pdu refresh - https://phabricator.wikimedia.org/T227141 (10akosiaris) * emptying ganeti1008 will require some 15-30 mins of advance notice. Emptying it (in case me or @MoritzMuehlenhoff are not around) can be done per https://wikitech.wikimedia.org/wiki/Ganeti#No...
[06:58:42] <wikibugs>	 (03PS2) 10Fsero: Add termbox-test release [deployment-charts] - 10https://gerrit.wikimedia.org/r/524817 (https://phabricator.wikimedia.org/T226814) (owner: 10Tarrow)
[07:00:07] <wikibugs>	 (03PS3) 10Fsero: Add termbox-test release [deployment-charts] - 10https://gerrit.wikimedia.org/r/524817 (https://phabricator.wikimedia.org/T226814) (owner: 10Tarrow)
[07:00:29] <wikibugs>	 (03CR) 10Fsero: [V: 03+2 C: 03+2] Add termbox-test release [deployment-charts] - 10https://gerrit.wikimedia.org/r/524817 (https://phabricator.wikimedia.org/T226814) (owner: 10Tarrow)
[07:01:00] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops: a6-eqiad pdu refresh - https://phabricator.wikimedia.org/T227142 (10akosiaris)
[07:01:06] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops: a5-eqiad pdu refresh - https://phabricator.wikimedia.org/T227141 (10Marostegui) From the DB side of things, this rack should be done **before** Thursday 30th 05:30AM UTC, as at that time db1128 will become phabricator master {T228243}
[07:02:04] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops: a6-eqiad pdu refresh - https://phabricator.wikimedia.org/T227142 (10akosiaris) * emptying ganeti1006 will require some 15-30 mins of advance notice. Emptying it (in case me or @MoritzMuehlenhoff are not around) can be done per https://wikitech.wikimedia.org/wiki/Ganeti#No...
[07:02:15] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops: a1-eqiad pdu refresh - https://phabricator.wikimedia.org/T226782 (10Marostegui) Good to go from the DB side
[07:02:57] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops: a7-eqiad pdu refresh - https://phabricator.wikimedia.org/T227143 (10akosiaris)
[07:03:38] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops: a8-eqiad pdu refresh - https://phabricator.wikimedia.org/T227133 (10akosiaris)
[07:04:32] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops: a5-eqiad pdu refresh - https://phabricator.wikimedia.org/T227141 (10Marostegui)
[07:05:03] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops: a4-eqiad pdu refresh - https://phabricator.wikimedia.org/T227140 (10akosiaris) Sigh, this was already done. I just hope the info added will be useful at some point in the future as a guide
[07:05:48] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops: a6-eqiad pdu refresh - https://phabricator.wikimedia.org/T227142 (10Marostegui) This rack contains an active primary db master: db1066, this would need to be failed over if we are not confident about not losing power.
[07:06:16] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops: a6-eqiad pdu refresh - https://phabricator.wikimedia.org/T227142 (10Marostegui)
[07:06:53] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops: a7-eqiad pdu refresh - https://phabricator.wikimedia.org/T227143 (10Marostegui) From the DB side this rack is good to go
[07:06:59] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops: a5-eqiad pdu refresh - https://phabricator.wikimedia.org/T227141 (10Joe)
[07:07:06] <wikibugs>	 (03PS1) 10Fsero: Revert "Add termbox-test release" [deployment-charts] - 10https://gerrit.wikimedia.org/r/525033
[07:07:22] <wikibugs>	 (03Abandoned) 10Fsero: Revert "Add termbox-test release" [deployment-charts] - 10https://gerrit.wikimedia.org/r/525033 (owner: 10Fsero)
[07:07:46] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops: a8-eqiad pdu refresh - https://phabricator.wikimedia.org/T227133 (10Marostegui) From the DB side, this rack is good to go
[07:09:08] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops: a6-eqiad pdu refresh - https://phabricator.wikimedia.org/T227142 (10Joe)
[07:09:35] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops: b1-eqiad pdu refresh - https://phabricator.wikimedia.org/T227536 (10Marostegui)
[07:11:07] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops: b1-eqiad pdu refresh - https://phabricator.wikimedia.org/T227536 (10Marostegui) From the DB side this can be done anytime
[07:12:30] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops: a7-eqiad pdu refresh - https://phabricator.wikimedia.org/T227143 (10Joe)
[07:13:28] <wikibugs>	 (03PS4) 10Ema: ATS: split the cache for beta variant of the mobile site [puppet] - 10https://gerrit.wikimedia.org/r/524789 (https://phabricator.wikimedia.org/T227432)
[07:14:05] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops: b2-eqiad pdu refresh - https://phabricator.wikimedia.org/T227538 (10Marostegui) From the DB side, this can be done **after** Thursday 25th  as db1072 will no longer be a master
[07:14:26] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops: a5-eqiad pdu refresh - https://phabricator.wikimedia.org/T227141 (10Marostegui)
[07:14:40] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops: a5-eqiad pdu refresh - https://phabricator.wikimedia.org/T227141 (10Marostegui) >>! In T227141#5356279, @Marostegui wrote: > From the DB side of things, this rack should be done **before** Thursday 30th 05:30AM UTC, as at that time db1128 will become phabricator master {T...
[07:27:20] <wikibugs>	 (03PS2) 10Marostegui: mariadb: Productionize dbproxy2002 into m2-codfw [puppet] - 10https://gerrit.wikimedia.org/r/524963 (https://phabricator.wikimedia.org/T202367)
[07:29:20] <icinga-wm>	 PROBLEM - HHVM rendering on mw1328 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[07:30:52] <icinga-wm>	 RECOVERY - HHVM rendering on mw1328 is OK: HTTP OK: HTTP/1.1 200 OK - 77352 bytes in 0.103 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[07:31:23] <marostegui>	 !log Deploy grants for dbproxy2002 on m2 - T202367
[07:31:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:31:34] <stashbot>	 T202367: Productionize dbproxy101[2-7].eqiad.wmnet and dbproxy200[1-4] - https://phabricator.wikimedia.org/T202367
[07:38:11] <wikibugs>	 (03PS3) 10Marostegui: mariadb: Productionize dbproxy2002 into m2-codfw [puppet] - 10https://gerrit.wikimedia.org/r/524963 (https://phabricator.wikimedia.org/T202367)
[07:42:12] <icinga-wm>	 RECOVERY - Work requests waiting in Zuul Gearman server on contint1001 is OK: OK: Less than 30.00% above the threshold [90.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1
[07:42:27] <hashar>	 ^^ I should probably make that alarm less sensible
[07:44:36] <icinga-wm>	 PROBLEM - Eqsin HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/myRmf1Pik/varnish-aggregate-client-status-codes?var-site=eqsin&var-status_type=5
[07:46:29] <wikibugs>	 (03CR) 10Hashar: [V: 03+1 C: 04-1] "NodeJS is still used :-( T228639" [puppet] - 10https://gerrit.wikimedia.org/r/524221 (https://phabricator.wikimedia.org/T225735) (owner: 10Hashar)
[07:46:56] <icinga-wm>	 PROBLEM - Text HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/myRmf1Pik/varnish-aggregate-client-status-codes?var-cache_type=varnish-text&var-status_type=5
[07:50:38] <wikibugs>	 (03CR) 10Marostegui: "Puppet looks good: https://puppet-compiler.wmflabs.org/compiler1002/17571/" [puppet] - 10https://gerrit.wikimedia.org/r/524963 (https://phabricator.wikimedia.org/T202367) (owner: 10Marostegui)
[07:50:42] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] mariadb: Productionize dbproxy2002 into m2-codfw [puppet] - 10https://gerrit.wikimedia.org/r/524963 (https://phabricator.wikimedia.org/T202367) (owner: 10Marostegui)
[07:51:14] <icinga-wm>	 RECOVERY - Eqsin HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/myRmf1Pik/varnish-aggregate-client-status-codes?var-site=eqsin&var-status_type=5
[07:51:47] <elukey>	 seemed cp5011 related
[07:52:18] <elukey>	 https://grafana.wikimedia.org/d/000000352/varnish-failed-fetches?orgId=1&from=now-3h&to=now&var-datasource=eqsin%20prometheus%2Fops&var-cache_type=text&var-server=All&var-layer=backend
[07:52:22] <elukey>	 ema --^
[07:53:36] <icinga-wm>	 RECOVERY - Text HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/myRmf1Pik/varnish-aggregate-client-status-codes?var-cache_type=varnish-text&var-status_type=5
[07:53:51] <wikibugs>	 (03PS1) 10Elukey: profile::tlsproxy::service: add more granularity in monitoring [puppet] - 10https://gerrit.wikimedia.org/r/525039 (https://phabricator.wikimedia.org/T227860)
[07:55:57] <wikibugs>	 (03PS2) 10Elukey: profile::tlsproxy::service: add more granularity in monitoring [puppet] - 10https://gerrit.wikimedia.org/r/525039 (https://phabricator.wikimedia.org/T227860)
[07:56:52] <wikibugs>	 (03PS3) 10Elukey: profile::tlsproxy::service: add more granularity in monitoring [puppet] - 10https://gerrit.wikimedia.org/r/525039 (https://phabricator.wikimedia.org/T227860)
[07:58:38] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [V: 03+2 C: 03+2] Fix bug in scaffold configmap.yaml and deployment.yaml [deployment-charts] - 10https://gerrit.wikimedia.org/r/524938 (owner: 10Jeena Huneidi)
[07:58:48] <wikibugs>	 (03PS2) 10Hashar: contint: no more include ::contint::packages::ruby by default [puppet] - 10https://gerrit.wikimedia.org/r/524224 (https://phabricator.wikimedia.org/T225735)
[07:58:50] <wikibugs>	 (03PS2) 10Hashar: contint: remove contint::php [puppet] - 10https://gerrit.wikimedia.org/r/524225 (https://phabricator.wikimedia.org/T225735)
[07:58:52] <wikibugs>	 (03PS2) 10Hashar: contint: no more include ::packages::javascript by default [puppet] - 10https://gerrit.wikimedia.org/r/524221 (https://phabricator.wikimedia.org/T225735)
[07:58:54] <wikibugs>	 (03PS4) 10Alexandros Kosiaris: Fix bug in scaffold configmap.yaml and deployment.yaml [deployment-charts] - 10https://gerrit.wikimedia.org/r/524938 (owner: 10Jeena Huneidi)
[07:58:56] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [V: 03+2 C: 03+2] Fix bug in scaffold configmap.yaml and deployment.yaml [deployment-charts] - 10https://gerrit.wikimedia.org/r/524938 (owner: 10Jeena Huneidi)
[08:00:05] <wikibugs>	 (03CR) 10Hashar: [C: 04-1] "Due to T228639" [puppet] - 10https://gerrit.wikimedia.org/r/524221 (https://phabricator.wikimedia.org/T225735) (owner: 10Hashar)
[08:00:21] <wikibugs>	 (03PS1) 10Muehlenhoff: Add poolcounter1004 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/525040 (https://phabricator.wikimedia.org/T224572)
[08:00:45] <wikibugs>	 10Operations, 10Traffic: TLS config issue for nginx on Buster - https://phabricator.wikimedia.org/T228730 (10elukey)
[08:01:18] <wikibugs>	 10Operations, 10Analytics, 10Analytics-Kanban, 10Traffic, and 2 others: TLS certificates for Analytics origin servers - https://phabricator.wikimedia.org/T227860 (10elukey) 05Open→03Stalled Stalled due to https://phabricator.wikimedia.org/T228730
[08:06:01] <wikibugs>	 (03PS4) 10Elukey: profile::tlsproxy::service: add more granularity in monitoring [puppet] - 10https://gerrit.wikimedia.org/r/525039 (https://phabricator.wikimedia.org/T227860)
[08:08:06] <wikibugs>	 (03CR) 10Elukey: "https://puppet-compiler.wmflabs.org/compiler1002/17574/" [puppet] - 10https://gerrit.wikimedia.org/r/525039 (https://phabricator.wikimedia.org/T227860) (owner: 10Elukey)
[08:08:19] <wikibugs>	 (03Abandoned) 10Hashar: contint: no more include ::packages::javascript by default [puppet] - 10https://gerrit.wikimedia.org/r/524221 (https://phabricator.wikimedia.org/T225735) (owner: 10Hashar)
[08:08:29] <marostegui>	 !log Stop MySQL on db2044 to test dbproxy2002 notifications - T202367
[08:08:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:08:36] <stashbot>	 T202367: Productionize dbproxy101[2-7].eqiad.wmnet and dbproxy200[1-4] - https://phabricator.wikimedia.org/T202367
[08:12:45] <icinga-wm>	 PROBLEM - haproxy failover on dbproxy2002 is CRITICAL: CRITICAL check_failover servers up 1 down 1 https://wikitech.wikimedia.org/wiki/HAProxy
[08:13:08] <marostegui>	 ^ that is me
[08:16:16] <wikibugs>	 (03PS1) 10Marostegui: dbproxy2001: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/525042 (https://phabricator.wikimedia.org/T202367)
[08:16:23] <icinga-wm>	 RECOVERY - haproxy failover on dbproxy2002 is OK: OK check_failover servers up 2 down 0 https://wikitech.wikimedia.org/wiki/HAProxy
[08:17:35] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] dbproxy2001: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/525042 (https://phabricator.wikimedia.org/T202367) (owner: 10Marostegui)
[08:17:58] <wikibugs>	 (03CR) 10Ema: [C: 03+1] profile::tlsproxy::service: add more granularity in monitoring [puppet] - 10https://gerrit.wikimedia.org/r/525039 (https://phabricator.wikimedia.org/T227860) (owner: 10Elukey)
[08:22:09] <wikibugs>	 (03PS9) 10Hashar: releases: inline php packages installation [puppet] - 10https://gerrit.wikimedia.org/r/523147 (https://phabricator.wikimedia.org/T225735)
[08:22:11] <wikibugs>	 (03PS5) 10Hashar: contint: remove php packages [puppet] - 10https://gerrit.wikimedia.org/r/523148 (https://phabricator.wikimedia.org/T225735)
[08:22:14] <wikibugs>	 (03PS5) 10Hashar: contint: apply apt::unattend_upgrade at role level [puppet] - 10https://gerrit.wikimedia.org/r/523150 (https://phabricator.wikimedia.org/T225735)
[08:22:16] <wikibugs>	 (03PS2) 10Hashar: contint: remove sqlite3 debian package [puppet] - 10https://gerrit.wikimedia.org/r/524219 (https://phabricator.wikimedia.org/T225735)
[08:22:18] <wikibugs>	 (03PS3) 10Hashar: contint: no more include ::contint::packages::ruby by default [puppet] - 10https://gerrit.wikimedia.org/r/524224 (https://phabricator.wikimedia.org/T225735)
[08:22:20] <wikibugs>	 (03PS3) 10Hashar: contint: remove contint::php [puppet] - 10https://gerrit.wikimedia.org/r/524225 (https://phabricator.wikimedia.org/T225735)
[08:29:33] <wikibugs>	 (03PS1) 10Marostegui: db-eqiad.php: Depool db1100 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/525043
[08:31:40] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 04-1] releases: inline php packages installation (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/523147 (https://phabricator.wikimedia.org/T225735) (owner: 10Hashar)
[08:31:45] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: Depool db1100 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/525043 (owner: 10Marostegui)
[08:32:42] <wikibugs>	 (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1100 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/525043 (owner: 10Marostegui)
[08:33:06] <wikibugs>	 (03CR) 10jenkins-bot: db-eqiad.php: Depool db1100 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/525043 (owner: 10Marostegui)
[08:33:59] <logmsgbot>	 !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Depool db1100 for upgrade (duration: 00m 53s)
[08:34:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:34:46] <marostegui>	 !log Upgrade db1100
[08:34:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:36:35] <wikibugs>	 (03CR) 10Hashar: releases: inline php packages installation (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/523147 (https://phabricator.wikimedia.org/T225735) (owner: 10Hashar)
[08:36:53] <wikibugs>	 (03PS10) 10Hashar: releases: inline php packages installation [puppet] - 10https://gerrit.wikimedia.org/r/523147 (https://phabricator.wikimedia.org/T225735)
[08:37:17] <wikibugs>	 (03CR) 10Hashar: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/523147 (https://phabricator.wikimedia.org/T225735) (owner: 10Hashar)
[08:38:10] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] hiera: deploy varnishkafka exporter to ulsfo [puppet] - 10https://gerrit.wikimedia.org/r/524930 (https://phabricator.wikimedia.org/T196066) (owner: 10Cwhite)
[08:38:31] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] hiera: deploy varnishkafka exporter to esams [puppet] - 10https://gerrit.wikimedia.org/r/524931 (https://phabricator.wikimedia.org/T196066) (owner: 10Cwhite)
[08:39:34] <wikibugs>	 (03CR) 10Hashar: [V: 03+1] "Puppet compiler for production host releases1001.eqiad.wmnet:" [puppet] - 10https://gerrit.wikimedia.org/r/523147 (https://phabricator.wikimedia.org/T225735) (owner: 10Hashar)
[08:39:57] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] hiera: deploy varnishkafka exporter to eqiad [puppet] - 10https://gerrit.wikimedia.org/r/524933 (https://phabricator.wikimedia.org/T196066) (owner: 10Cwhite)
[08:40:01] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] hiera: deploy varnishkafka exporter to codfw [puppet] - 10https://gerrit.wikimedia.org/r/524932 (https://phabricator.wikimedia.org/T196066) (owner: 10Cwhite)
[08:43:39] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] releases: inline php packages installation [puppet] - 10https://gerrit.wikimedia.org/r/523147 (https://phabricator.wikimedia.org/T225735) (owner: 10Hashar)
[08:45:38] <wikibugs>	 (03PS6) 10Muehlenhoff: contint: remove php packages [puppet] - 10https://gerrit.wikimedia.org/r/523148 (https://phabricator.wikimedia.org/T225735) (owner: 10Hashar)
[08:47:35] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] contint: remove php packages [puppet] - 10https://gerrit.wikimedia.org/r/523148 (https://phabricator.wikimedia.org/T225735) (owner: 10Hashar)
[08:49:17] <wikibugs>	 (03PS6) 10Muehlenhoff: contint: apply apt::unattend_upgrade at role level [puppet] - 10https://gerrit.wikimedia.org/r/523150 (https://phabricator.wikimedia.org/T225735) (owner: 10Hashar)
[08:50:22] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] contint: apply apt::unattend_upgrade at role level [puppet] - 10https://gerrit.wikimedia.org/r/523150 (https://phabricator.wikimedia.org/T225735) (owner: 10Hashar)
[08:51:29] <wikibugs>	 (03PS3) 10Muehlenhoff: contint: remove sqlite3 debian package [puppet] - 10https://gerrit.wikimedia.org/r/524219 (https://phabricator.wikimedia.org/T225735) (owner: 10Hashar)
[08:52:54] <wikibugs>	 (03CR) 10Filippo Giunchedi: "> Patch Set 4:" [puppet] - 10https://gerrit.wikimedia.org/r/524625 (https://phabricator.wikimedia.org/T227364) (owner: 10EBernhardson)
[08:53:26] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] contint: remove sqlite3 debian package [puppet] - 10https://gerrit.wikimedia.org/r/524219 (https://phabricator.wikimedia.org/T225735) (owner: 10Hashar)
[08:54:20] <wikibugs>	 (03PS4) 10Muehlenhoff: contint: no more include ::contint::packages::ruby by default [puppet] - 10https://gerrit.wikimedia.org/r/524224 (https://phabricator.wikimedia.org/T225735) (owner: 10Hashar)
[08:56:08] <wikibugs>	 10Operations, 10serviceops, 10Core Platform Team Workboards (Green): Keys from MediaWiki Redis Instances - https://phabricator.wikimedia.org/T228703 (10Joe)
[08:57:20] <wikibugs>	 10Operations, 10ops-eqiad, 10DBA: Upgrade db1100 firmware and BIOS - https://phabricator.wikimedia.org/T228732 (10Marostegui)
[08:58:18] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] contint: no more include ::contint::packages::ruby by default [puppet] - 10https://gerrit.wikimedia.org/r/524224 (https://phabricator.wikimedia.org/T225735) (owner: 10Hashar)
[08:59:20] <wikibugs>	 (03PS4) 10Muehlenhoff: contint: remove contint::php [puppet] - 10https://gerrit.wikimedia.org/r/524225 (https://phabricator.wikimedia.org/T225735) (owner: 10Hashar)
[08:59:22] <wikibugs>	 10Operations, 10serviceops, 10PHP 7.2 support, 10Performance-Team (Radar), and 2 others: PHP 7 corruption during deployment (was: PHP 7 fatals on mw1262) - https://phabricator.wikimedia.org/T224491 (10Joe) >>! In T224491#5354568, @Krinkle wrote: > Logstash query for the error in question: >  > <https://log...
[08:59:35] <wikibugs>	 10Operations, 10LDAP-Access-Requests, 10Release-Engineering-Team: Add more SREs to gerritadmin LDAP group - https://phabricator.wikimedia.org/T228733 (10akosiaris)
[08:59:57] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 04-1] "IMHO we should remove the flag altogether once the rollout is complete, it has a different semantic that the existing profile::cache::kafk" [puppet] - 10https://gerrit.wikimedia.org/r/524934 (https://phabricator.wikimedia.org/T196066) (owner: 10Cwhite)
[09:00:09] <wikibugs>	 (03PS1) 10Marostegui: db-eqiad.php: Slowly repool db1100 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/525045
[09:00:14] <wikibugs>	 10Operations, 10DBA: Reboot, upgrade firmware and kernel of db1096-db1106, db2071-db2092 - https://phabricator.wikimedia.org/T216240 (10Marostegui)
[09:01:31] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: Slowly repool db1100 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/525045 (owner: 10Marostegui)
[09:02:15] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+2] Add poolcounter1004 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/525040 (https://phabricator.wikimedia.org/T224572) (owner: 10Muehlenhoff)
[09:02:17] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] contint: remove contint::php [puppet] - 10https://gerrit.wikimedia.org/r/524225 (https://phabricator.wikimedia.org/T225735) (owner: 10Hashar)
[09:02:22] <wikibugs>	 (03Merged) 10jenkins-bot: db-eqiad.php: Slowly repool db1100 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/525045 (owner: 10Marostegui)
[09:02:29] <wikibugs>	 (03Merged) 10jenkins-bot: Add poolcounter1004 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/525040 (https://phabricator.wikimedia.org/T224572) (owner: 10Muehlenhoff)
[09:02:40] <wikibugs>	 (03CR) 10jenkins-bot: db-eqiad.php: Slowly repool db1100 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/525045 (owner: 10Marostegui)
[09:02:43] <wikibugs>	 10Operations, 10LDAP-Access-Requests, 10Release-Engineering-Team: Add more SREs to gerritadmin LDAP group - https://phabricator.wikimedia.org/T228733 (10akosiaris) p:05Triage→03Normal
[09:03:28] <logmsgbot>	 !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Slowly repool db1100 after upgrade (duration: 00m 46s)
[09:03:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:04:35] <wikibugs>	 (03PS1) 10Marostegui: db-eqiad.php: Repool db1100 into API [mediawiki-config] - 10https://gerrit.wikimedia.org/r/525047
[09:04:44] <wikibugs>	 (03CR) 10jenkins-bot: Add poolcounter1004 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/525040 (https://phabricator.wikimedia.org/T224572) (owner: 10Muehlenhoff)
[09:09:04] <logmsgbot>	 !log akosiaris@deploy1001 Synchronized wmf-config/ProductionServices.php: (no justification provided) (duration: 00m 47s)
[09:09:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:09:11] <wikibugs>	 (03Abandoned) 10Volans: Netbox: set media directory [puppet] - 10https://gerrit.wikimedia.org/r/463820 (https://phabricator.wikimedia.org/T190184) (owner: 10Volans)
[09:09:28] <wikibugs>	 10Operations, 10LDAP-Access-Requests, 10Release-Engineering-Team: Add more SREs to gerritadmin LDAP group - https://phabricator.wikimedia.org/T228733 (10hashar) The change has been done after T218761 (private).  The Gerrit Administrators group is now tied to the `gerritadmin` LDAP group ( https://gerrit.wiki...
[09:10:16] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM, indeed bsd-mailx will be pulled in by icinga" [puppet] - 10https://gerrit.wikimedia.org/r/494464 (owner: 10Volans)
[09:10:46] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: Repool db1100 into API [mediawiki-config] - 10https://gerrit.wikimedia.org/r/525047 (owner: 10Marostegui)
[09:11:49] <wikibugs>	 10Operations, 10serviceops: Migrate pool counters to Stretch/Buster - https://phabricator.wikimedia.org/T224572 (10akosiaris) poolcounter1004 has just been added
[09:12:32] <wikibugs>	 (03Merged) 10jenkins-bot: db-eqiad.php: Repool db1100 into API [mediawiki-config] - 10https://gerrit.wikimedia.org/r/525047 (owner: 10Marostegui)
[09:12:47] <wikibugs>	 (03CR) 10jenkins-bot: db-eqiad.php: Repool db1100 into API [mediawiki-config] - 10https://gerrit.wikimedia.org/r/525047 (owner: 10Marostegui)
[09:12:57] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops: a2-eqiad pdu refresh - https://phabricator.wikimedia.org/T227138 (10fgiunchedi) For ms-be same as {T227140}
[09:13:06] <wikibugs>	 (03PS4) 10Volans: icinga: set Reply-To header to email notifications [puppet] - 10https://gerrit.wikimedia.org/r/494464
[09:13:08] <wikibugs>	 10Operations, 10LDAP-Access-Requests, 10Release-Engineering-Team: Add more SREs to gerritadmin LDAP group - https://phabricator.wikimedia.org/T228733 (10Peachey88)
[09:13:24] <wikibugs>	 10Operations, 10LDAP-Access-Requests, 10Release-Engineering-Team: Add more SREs to gerritadmin LDAP group - https://phabricator.wikimedia.org/T228733 (10Peachey88)
[09:13:36] <logmsgbot>	 !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Slowly repool into API db1100 after upgrade (duration: 00m 47s)
[09:13:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:14:46] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops: a3-eqiad pdu refresh - https://phabricator.wikimedia.org/T227139 (10fgiunchedi)
[09:16:51] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops: a3-eqiad pdu refresh - https://phabricator.wikimedia.org/T227139 (10Marostegui)
[09:17:55] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops: b1-eqiad pdu refresh - https://phabricator.wikimedia.org/T227536 (10Marostegui)
[09:18:06] <wikibugs>	 (03PS3) 10Alexandros Kosiaris: k8s: introducing termbox-test.staging.svc.eqiad.wmnet [dns] - 10https://gerrit.wikimedia.org/r/524797 (https://phabricator.wikimedia.org/T226814) (owner: 10Fsero)
[09:18:21] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] k8s: introducing termbox-test.staging.svc.eqiad.wmnet [dns] - 10https://gerrit.wikimedia.org/r/524797 (https://phabricator.wikimedia.org/T226814) (owner: 10Fsero)
[09:18:52] <wikibugs>	 (03PS1) 10Hashar: gerrit: remove a no more existing group [puppet] - 10https://gerrit.wikimedia.org/r/525048
[09:19:50] <wikibugs>	 (03CR) 10Hashar: "The repository.ownerGroup is looked up by name, however groups can be renamed and a new one could use a previously used name." [puppet] - 10https://gerrit.wikimedia.org/r/525048 (owner: 10Hashar)
[09:20:27] <wikibugs>	 (03PS1) 10Marostegui: db-eqiad.php: Fully repool db1100 into API [mediawiki-config] - 10https://gerrit.wikimedia.org/r/525049
[09:20:38] <wikibugs>	 (03CR) 10Volans: [C: 03+2] icinga: set Reply-To header to email notifications [puppet] - 10https://gerrit.wikimedia.org/r/494464 (owner: 10Volans)
[09:21:41] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: Fully repool db1100 into API [mediawiki-config] - 10https://gerrit.wikimedia.org/r/525049 (owner: 10Marostegui)
[09:22:29] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops: a3-eqiad pdu refresh - https://phabricator.wikimedia.org/T227139 (10fgiunchedi) restbase / logstash / graphite / prometheus hosts should be fine in an event of power loss, if feeling nice restbase and prometheus should be depooled. for the logstash host we could disable e...
[09:22:33] <wikibugs>	 (03Merged) 10jenkins-bot: db-eqiad.php: Fully repool db1100 into API [mediawiki-config] - 10https://gerrit.wikimedia.org/r/525049 (owner: 10Marostegui)
[09:22:48] <wikibugs>	 (03CR) 10jenkins-bot: db-eqiad.php: Fully repool db1100 into API [mediawiki-config] - 10https://gerrit.wikimedia.org/r/525049 (owner: 10Marostegui)
[09:23:37] <logmsgbot>	 !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Fully repool into API db1100 after upgrade (duration: 00m 46s)
[09:23:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:24:56] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops: a2-eqiad pdu refresh - https://phabricator.wikimedia.org/T227138 (10fgiunchedi)
[09:25:14] <wikibugs>	 (03Restored) 10Fsero: Revert "Add termbox-test release" [deployment-charts] - 10https://gerrit.wikimedia.org/r/525033 (owner: 10Fsero)
[09:25:47] <wikibugs>	 (03PS2) 10Fsero: Revert "Add termbox-test release" [deployment-charts] - 10https://gerrit.wikimedia.org/r/525033
[09:27:53] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops: a3-eqiad pdu refresh - https://phabricator.wikimedia.org/T227139 (10fgiunchedi)
[09:28:10] <wikibugs>	 (03CR) 10Fsero: [V: 03+2 C: 03+2] "this needs more changes and was merged prematurely" [deployment-charts] - 10https://gerrit.wikimedia.org/r/525033 (owner: 10Fsero)
[09:28:33] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops: a5-eqiad pdu refresh - https://phabricator.wikimedia.org/T227141 (10fgiunchedi)
[09:29:15] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops: a6-eqiad pdu refresh - https://phabricator.wikimedia.org/T227142 (10fgiunchedi)
[09:30:05] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops: a7-eqiad pdu refresh - https://phabricator.wikimedia.org/T227143 (10fgiunchedi)
[09:30:54] <wikibugs>	 10Operations, 10Wikimedia-Mailing-lists: New Mailing lists for AzWiki sysops - https://phabricator.wikimedia.org/T228542 (10Eldarado) p:05Triage→03Normal
[09:33:04] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops: b1-eqiad pdu refresh - https://phabricator.wikimedia.org/T227536 (10fgiunchedi)
[09:33:54] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops: b2-eqiad pdu refresh - https://phabricator.wikimedia.org/T227538 (10fgiunchedi)
[09:40:18] <elukey>	 hashar: thanks a lot for the projectviews task!
[09:40:46] <wikibugs>	 (03PS1) 10Elukey: Set async replication for mcrouter on mw api/appserv canaries [puppet] - 10https://gerrit.wikimedia.org/r/525053 (https://phabricator.wikimedia.org/T225642)
[09:43:33] <wikibugs>	 (03PS1) 10Fsero: Add termbox-test release [deployment-charts] - 10https://gerrit.wikimedia.org/r/525054 (https://phabricator.wikimedia.org/T226814)
[09:43:51] <wikibugs>	 (03PS2) 10Fsero: Add termbox-test release [deployment-charts] - 10https://gerrit.wikimedia.org/r/525054 (https://phabricator.wikimedia.org/T226814)
[09:44:23] <wikibugs>	 (03PS3) 10Fsero: Add termbox-test release [deployment-charts] - 10https://gerrit.wikimedia.org/r/525054 (https://phabricator.wikimedia.org/T226814)
[09:45:20] <wikibugs>	 (03CR) 10Elukey: "https://puppet-compiler.wmflabs.org/compiler1001/17575/" [puppet] - 10https://gerrit.wikimedia.org/r/525053 (https://phabricator.wikimedia.org/T225642) (owner: 10Elukey)
[09:46:16] <wikibugs>	 (03PS1) 10Muehlenhoff: Enable poolcounter1005, disable poolcounter1001 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/525055 (https://phabricator.wikimedia.org/T224572)
[09:46:28] <wikibugs>	 (03PS4) 10Alexandros Kosiaris: k8s: introducing termbox-test.staging.svc.eqiad.wmnet [dns] - 10https://gerrit.wikimedia.org/r/524797 (https://phabricator.wikimedia.org/T226814) (owner: 10Fsero)
[09:47:15] <hashar>	 elukey: you are welcome. Thoguh I have absolutely no idea about what is going on :-\
[09:47:41] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+2] Enable poolcounter1005, disable poolcounter1001 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/525055 (https://phabricator.wikimedia.org/T224572) (owner: 10Muehlenhoff)
[09:47:54] <wikibugs>	 (03PS4) 10Fsero: Add termbox-test release [deployment-charts] - 10https://gerrit.wikimedia.org/r/525054 (https://phabricator.wikimedia.org/T226814)
[09:48:34] <wikibugs>	 (03Merged) 10jenkins-bot: Enable poolcounter1005, disable poolcounter1001 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/525055 (https://phabricator.wikimedia.org/T224572) (owner: 10Muehlenhoff)
[09:48:55] <wikibugs>	 (03CR) 10jenkins-bot: Enable poolcounter1005, disable poolcounter1001 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/525055 (https://phabricator.wikimedia.org/T224572) (owner: 10Muehlenhoff)
[09:51:03] <logmsgbot>	 !log akosiaris@deploy1001 Synchronized wmf-config/ProductionServices.php: (no justification provided) (duration: 00m 47s)
[09:51:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:51:38] <akosiaris>	 !log enable poolcounter1005, disablepoolcounter1001 T224572
[09:51:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:51:45] <stashbot>	 T224572: Migrate pool counters to Stretch/Buster - https://phabricator.wikimedia.org/T224572
[09:53:57] <marostegui>	 !log Drop abuse_filter_log.afl_log_id from s6 codfw with replication (this will cause lag in s6 codfw) - T226851
[09:54:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:54:04] <stashbot>	 T226851: Drop abuse_filter_log.afl_log_id in production - https://phabricator.wikimedia.org/T226851
[09:54:10] <elukey>	 hashar: should be fixed now
[09:54:53] <elukey>	 ah no not yet, weird
[09:55:18] <elukey>	 (was looking at cached content, it is indeed fixed)
[09:55:40] <wikibugs>	 (03PS5) 10Fsero: Add termbox-test release [deployment-charts] - 10https://gerrit.wikimedia.org/r/525054 (https://phabricator.wikimedia.org/T226814)
[09:55:58] <wikibugs>	 (03CR) 10Fsero: [V: 03+2 C: 03+2] Add termbox-test release [deployment-charts] - 10https://gerrit.wikimedia.org/r/525054 (https://phabricator.wikimedia.org/T226814) (owner: 10Fsero)
[09:56:42] <hashar>	 elukey: seems good yes. At least for /other/pageviews/  but some others might have been affected :-]
[09:57:26] <logmsgbot>	 !log jmm@cumin2001 START - Cookbook sre.hosts.downtime
[09:57:27] <logmsgbot>	 !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
[09:57:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:57:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:57:48] <elukey>	 hashar: the other one that we changed was pageviews, already checked and we should be ok
[09:58:22] <hashar>	 elukey: cool, feel free to flag https://phabricator.wikimedia.org/T228731 as resolves so, unless there is a followup action to monitor those jobs (though that acn be made another standalone task)*
[09:59:04] <logmsgbot>	 !log fsero@ helmfile [STAGING] Ran 'apply' command on namespace 'termbox' for release 'test' .
[09:59:04] <logmsgbot>	 !log fsero@ helmfile [STAGING] Ran 'apply' command on namespace 'termbox' for release 'staging' .
[09:59:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:59:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:00:40] <icinga-wm>	 PROBLEM - Mediawiki Cirrussearch update rate - eqiad on icinga1001 is CRITICAL: CRITICAL: 20.00% of data under the critical threshold [50.0] https://wikitech.wikimedia.org/wiki/Search%23No_updates https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?panelId=44&fullscreen&orgId=1
[10:01:03] <wikibugs>	 (03CR) 10Fsero: [C: 03+2] k8s: introducing termbox-test.staging.svc.eqiad.wmnet [dns] - 10https://gerrit.wikimedia.org/r/524797 (https://phabricator.wikimedia.org/T226814) (owner: 10Fsero)
[10:01:14] <icinga-wm>	 PROBLEM - Mediawiki Cirrussearch update rate - codfw on icinga1001 is CRITICAL: CRITICAL: 20.00% of data under the critical threshold [50.0] https://wikitech.wikimedia.org/wiki/Search%23No_updates https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?panelId=44&fullscreen&orgId=1
[10:01:22] <wikibugs>	 (03CR) 10Fsero: [C: 03+2] "ty for the changes :)" [dns] - 10https://gerrit.wikimedia.org/r/524797 (https://phabricator.wikimedia.org/T226814) (owner: 10Fsero)
[10:02:46] <moritzm>	 !log installing Java security updates on notebook/stat hosts
[10:02:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:06:57] <wikibugs>	 10Operations, 10Wikidata, 10Wikidata-Termbox-Hike, 10serviceops, and 4 others: New Service Request: Wikidata Termbox SSR - https://phabricator.wikimedia.org/T212189 (10fsero)
[10:09:18] <wikibugs>	 (03PS1) 10Muehlenhoff: kibana: Read LDAP servers from standard Hiera and switch to read-only replicas [puppet] - 10https://gerrit.wikimedia.org/r/525057 (https://phabricator.wikimedia.org/T227650)
[10:10:14] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] kibana: Read LDAP servers from standard Hiera and switch to read-only replicas [puppet] - 10https://gerrit.wikimedia.org/r/525057 (https://phabricator.wikimedia.org/T227650) (owner: 10Muehlenhoff)
[10:15:11] <wikibugs>	 (03PS2) 10Muehlenhoff: kibana: Read LDAP servers from standard Hiera and switch to read-only replicas [puppet] - 10https://gerrit.wikimedia.org/r/525057 (https://phabricator.wikimedia.org/T227650)
[10:16:06] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] kibana: Read LDAP servers from standard Hiera and switch to read-only replicas [puppet] - 10https://gerrit.wikimedia.org/r/525057 (https://phabricator.wikimedia.org/T227650) (owner: 10Muehlenhoff)
[10:16:12] <marostegui>	 jouncebot: next
[10:16:12] <jouncebot>	 In 0 hour(s) and 43 minute(s): European Mid-day SWAT(Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190723T1100)
[10:17:21] <marostegui>	 !log Drop abuse_filter_log.afl_log_id from db1096:3316, db1139:3316 and dbstore1005:3316 T226851
[10:17:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:17:30] <stashbot>	 T226851: Drop abuse_filter_log.afl_log_id in production - https://phabricator.wikimedia.org/T226851
[10:18:07] <wikibugs>	 (03PS5) 10Jbond: lookup checks: add checks to warn against using hiera and advice lookup [puppet-lint/wmf_styleguide-check] - 10https://gerrit.wikimedia.org/r/522526 (https://phabricator.wikimedia.org/T220820)
[10:20:17] <wikibugs>	 (03CR) 10Jbond: "thanks see responses inline" (033 comments) [puppet-lint/wmf_styleguide-check] - 10https://gerrit.wikimedia.org/r/522526 (https://phabricator.wikimedia.org/T220820) (owner: 10Jbond)
[10:24:22] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "LGTM" [puppet-lint/wmf_styleguide-check] - 10https://gerrit.wikimedia.org/r/522526 (https://phabricator.wikimedia.org/T220820) (owner: 10Jbond)
[10:24:49] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops: a1-eqiad pdu refresh - https://phabricator.wikimedia.org/T226782 (10Marostegui)
[10:24:54] <icinga-wm>	 RECOVERY - Mediawiki Cirrussearch update rate - codfw on icinga1001 is OK: OK: Less than 1.00% under the threshold [80.0] https://wikitech.wikimedia.org/wiki/Search%23No_updates https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?panelId=44&fullscreen&orgId=1
[10:25:48] <icinga-wm>	 RECOVERY - Mediawiki Cirrussearch update rate - eqiad on icinga1001 is OK: OK: Less than 1.00% under the threshold [80.0] https://wikitech.wikimedia.org/wiki/Search%23No_updates https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?panelId=44&fullscreen&orgId=1
[10:27:40] <liw>	 I'll start cutting the branch momentarily for this week's train
[10:42:22] <icinga-wm>	 RECOVERY - Check systemd state on puppetmaster1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[10:44:43] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "The change looks formally ok to me. I can't speak for the values in the RWStore.properties though or the effects of loading a different co" [puppet] - 10https://gerrit.wikimedia.org/r/524954 (https://phabricator.wikimedia.org/T228122) (owner: 10Smalyshev)
[10:46:04] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "LGTM" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/524825 (owner: 10CDanis)
[10:55:34] <wikibugs>	 (03PS1) 10Tarrow: Enable termbox on testwikidatawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/525062 (https://phabricator.wikimedia.org/T227459)
[10:56:40] <wikibugs>	 (03CR) 10Tarrow: "This change is ready for review." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/525062 (https://phabricator.wikimedia.org/T227459) (owner: 10Tarrow)
[10:56:44] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Enable termbox on testwikidatawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/525062 (https://phabricator.wikimedia.org/T227459) (owner: 10Tarrow)
[10:58:19] <wikibugs>	 (03PS2) 10Tarrow: Enable termbox on testwikidatawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/525062 (https://phabricator.wikimedia.org/T227459)
[11:00:04] <jouncebot>	 Amir1, Lucas_WMDE, awight, and Urbanecm: I, the Bot under the Fountain, allow thee, The Deployer, to do European Mid-day SWAT(Max 6 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190723T1100).
[11:00:04] <jouncebot>	 No GERRIT patches in the queue for this window AFAICS.
[11:04:13] <tarrow>	 I just stuck in a patch for SWAT: https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/525062 anyone object to me doing it now?
[11:06:44] <Amir1>	 tarrow: go on! let me know if you have any questions
[11:07:13] <tarrow>	 Amir1: Awesome! I shall :)
[11:08:54] <jijiki>	 !log enable puppet on jobrunners 
[11:09:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:12:55] <wikibugs>	 (03CR) 10Jakob: [C: 03+1] Enable termbox on testwikidatawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/525062 (https://phabricator.wikimedia.org/T227459) (owner: 10Tarrow)
[11:13:32] <wikibugs>	 (03CR) 10Tarrow: [C: 03+2] Enable termbox on testwikidatawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/525062 (https://phabricator.wikimedia.org/T227459) (owner: 10Tarrow)
[11:14:34] <wikibugs>	 (03Merged) 10jenkins-bot: Enable termbox on testwikidatawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/525062 (https://phabricator.wikimedia.org/T227459) (owner: 10Tarrow)
[11:16:24] <wikibugs>	 (03CR) 10jenkins-bot: Enable termbox on testwikidatawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/525062 (https://phabricator.wikimedia.org/T227459) (owner: 10Tarrow)
[11:17:22] <wikibugs>	 (03PS1) 10Hashar: contint: remove arcanist [puppet] - 10https://gerrit.wikimedia.org/r/525063 (https://phabricator.wikimedia.org/T225735)
[11:17:58] <wikibugs>	 (03CR) 10Hashar: "I don't think we still use arcanist anywhere on CI do we?" [puppet] - 10https://gerrit.wikimedia.org/r/525063 (https://phabricator.wikimedia.org/T225735) (owner: 10Hashar)
[11:24:21] <tarrow>	 Live and working on debug; sync-fileing now
[11:25:05] <logmsgbot>	 !log tarrow@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:525062|T214902 Enable termbox on testwikidatawiki]] (duration: 01m 37s)
[11:25:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:25:12] <stashbot>	 T214902: Show mobile termbox on Wikidata test wiki - https://phabricator.wikimedia.org/T214902
[11:27:23] <wikibugs>	 (03PS2) 10Arturo Borrero Gonzalez: toolforge: k8s: add nginx-ingress configuration. [puppet] - 10https://gerrit.wikimedia.org/r/524759 (https://phabricator.wikimedia.org/T228500)
[11:29:05] <tarrow>	 Looks like it's client side rendering fine but I see "Wikibase\View\Termbox\Renderer\TermboxRemoteRenderer: encountered a bad response from the remote renderer" in logstash
[11:33:01] <wikibugs>	 (03PS1) 10Tarrow: Fix missing /termbox in SSRTermboxServerUrl [mediawiki-config] - 10https://gerrit.wikimedia.org/r/525065
[11:33:35] <tarrow>	 Amir1: looks like I messed up a bit. Am I good to just merge and deploy the fix or do I need to revert first?
[11:33:57] <Amir1>	 since it's test, I think it's fine
[11:34:02] <wikibugs>	 (03PS3) 10Arturo Borrero Gonzalez: toolforge: k8s: add nginx-ingress configuration. [puppet] - 10https://gerrit.wikimedia.org/r/524759 (https://phabricator.wikimedia.org/T228500)
[11:34:03] <Amir1>	 unless it's exploding logs
[11:34:19] <tarrow>	 Its not crazy, just a little
[11:34:30] <wikibugs>	 (03CR) 10Tarrow: "This change is ready for review." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/525065 (owner: 10Tarrow)
[11:34:35] <Amir1>	 so I think we can deploy the fix then
[11:34:55] <tarrow>	 great
[11:35:40] <wikibugs>	 (03CR) 10Jakob: [C: 03+1] Fix missing /termbox in SSRTermboxServerUrl [mediawiki-config] - 10https://gerrit.wikimedia.org/r/525065 (owner: 10Tarrow)
[11:37:33] <wikibugs>	 (03CR) 10Tarrow: [C: 03+2] Fix missing /termbox in SSRTermboxServerUrl [mediawiki-config] - 10https://gerrit.wikimedia.org/r/525065 (owner: 10Tarrow)
[11:38:32] <wikibugs>	 (03Merged) 10jenkins-bot: Fix missing /termbox in SSRTermboxServerUrl [mediawiki-config] - 10https://gerrit.wikimedia.org/r/525065 (owner: 10Tarrow)
[11:38:47] <wikibugs>	 (03CR) 10jenkins-bot: Fix missing /termbox in SSRTermboxServerUrl [mediawiki-config] - 10https://gerrit.wikimedia.org/r/525065 (owner: 10Tarrow)
[11:39:03] <wikibugs>	 (03PS4) 10Arturo Borrero Gonzalez: toolforge: k8s: add nginx-ingress configuration. [puppet] - 10https://gerrit.wikimedia.org/r/524759 (https://phabricator.wikimedia.org/T228500)
[11:40:49] <wikibugs>	 (03PS1) 10Muehlenhoff: Disable poolcounter1003, also switch over pool counters in codfw [mediawiki-config] - 10https://gerrit.wikimedia.org/r/525066 (https://phabricator.wikimedia.org/T224572)
[11:42:20] <icinga-wm>	 PROBLEM - PHP opcache health on mwdebug1002 is CRITICAL: CRITICAL: opcache free space is below 50 MB https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[11:43:18] <jijiki>	 !log restart php-fpm on mwdebug* 
[11:43:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:45:08] <wikibugs>	 (03PS1) 10Lars Wirzenius: Group0 to 1.34.0-wmf.15 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/525068
[11:46:42] <tarrow>	 hey
[11:46:46] <tarrow>	 :)
[11:47:08] <tarrow>	 I guess we're currently conflicting because I just got "Failed to acquire lock "/var/lock/scap.operations_mediawiki-config.lock"; owner is "liw"; reason is "Pruned MediaWiki: 1.34.0-wmf.10""
[11:47:55] <tarrow>	 Amir1: I guess I just hold fire?
[11:48:32] <wikibugs>	 (03PS1) 10Muehlenhoff: kibana: Switch to read-only LDAPreplicas [puppet] - 10https://gerrit.wikimedia.org/r/525069 (https://phabricator.wikimedia.org/T227650)
[11:48:37] <liw>	 tarrow, eek. I'm in the middle of cutting this week's train branch.
[11:48:52] <tarrow>	 liw: eek!
[11:49:16] <tarrow>	 sorry about that 
[11:50:07] <liw>	 tarrow, well, hopefully everything works. what's the worst that could happen? I take down all wikis and set fire to all data centres?
[11:50:08] <tarrow>	 liw: So I was mid way through SWAT-ing a patch. It's on mwdebug1002 but I didn't run scap sync-file yet
[11:50:27] <tarrow>	 :P
[11:51:00] <tarrow>	 cool, I won't do any more typing until I know what to do :)
[11:51:13] <tarrow>	 FYI it was https://gerrit.wikimedia.org/r/525065
[11:52:09] <liw>	 tarrow, I'm new to this, so I have little clue to what I'm doing - hopefully I haven't broken anything
[11:52:21] <tarrow>	 liw: hehehe! me too :)
[11:52:32] <wikibugs>	 (03PS2) 10Alexandros Kosiaris: Disable poolcounter1003, also switch over pool counters in codfw [mediawiki-config] - 10https://gerrit.wikimedia.org/r/525066 (https://phabricator.wikimedia.org/T224572) (owner: 10Muehlenhoff)
[11:52:36] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+2] Disable poolcounter1003, also switch over pool counters in codfw [mediawiki-config] - 10https://gerrit.wikimedia.org/r/525066 (https://phabricator.wikimedia.org/T224572) (owner: 10Muehlenhoff)
[11:52:38] <liw>	 currently running scap clean to delete an old branch
[11:53:01] <liw>	 and daydreaming of the day whan all of this is fully automated ;)
[11:53:06] <wikibugs>	 (03PS5) 10Arturo Borrero Gonzalez: toolforge: k8s: add nginx-ingress configuration. [puppet] - 10https://gerrit.wikimedia.org/r/524759 (https://phabricator.wikimedia.org/T228500)
[11:53:10] <tarrow>	 :D
[11:53:31] <wikibugs>	 (03Merged) 10jenkins-bot: Disable poolcounter1003, also switch over pool counters in codfw [mediawiki-config] - 10https://gerrit.wikimedia.org/r/525066 (https://phabricator.wikimedia.org/T224572) (owner: 10Muehlenhoff)
[11:53:46] <wikibugs>	 (03CR) 10jenkins-bot: Disable poolcounter1003, also switch over pool counters in codfw [mediawiki-config] - 10https://gerrit.wikimedia.org/r/525066 (https://phabricator.wikimedia.org/T224572) (owner: 10Muehlenhoff)
[11:54:16] <logmsgbot>	 !log liw@deploy1001 Pruned MediaWiki: 1.34.0-wmf.10 (duration: 07m 55s)
[11:54:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:54:48] <tarrow>	 liw: mind if I finish my deployment now?
[11:54:56] <tarrow>	 Or is there more for you to do?
[11:55:30] <liw>	 tarrow, go ahead
[11:55:35] <tarrow>	 cool!
[11:56:24] <liw>	 tarrow, tell me when you're done, please
[11:56:28] <tarrow>	 sure!
[11:56:37] <logmsgbot>	 !log tarrow@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:525065|T214902 Fix missing /termbox in SSRTermboxServerUrl]] (duration: 00m 44s)
[11:56:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:56:44] <stashbot>	 T214902: Show mobile termbox on Wikidata test wiki - https://phabricator.wikimedia.org/T214902
[11:57:11] <wikibugs>	 (03PS6) 10Arturo Borrero Gonzalez: toolforge: k8s: add nginx-ingress configuration. [puppet] - 10https://gerrit.wikimedia.org/r/524759 (https://phabricator.wikimedia.org/T228500)
[11:58:41] <logmsgbot>	 !log akosiaris@deploy1001 Synchronized wmf-config/ProductionServices.php: (no justification provided) (duration: 00m 46s)
[11:58:42] <tarrow>	 !log EU SWAT finished
[11:58:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:58:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:58:57] <tarrow>	 liw: all done :)
[11:59:31] <akosiaris>	 !log enable disable poolcounter1003, switchover codfw poolcounters T224572
[11:59:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:59:38] <stashbot>	 T224572: Migrate pool counters to Stretch/Buster - https://phabricator.wikimedia.org/T224572
[12:00:04] <jouncebot>	 Deploy window Pre MediaWiki train sanity break (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190723T1200)
[12:00:48] <liw>	 tarrow, thanks
[12:01:07] <akosiaris>	 !log empty ganeti1007 from running instances. T227139
[12:01:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:01:14] <stashbot>	 T227139: a3-eqiad pdu refresh - https://phabricator.wikimedia.org/T227139
[12:02:11] <akosiaris>	 !log drain kubernetes1001. T227139
[12:02:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:04:05] <logmsgbot>	 !log liw@deploy1001 Started scap: testwiki to php-1.34.0-wmf.15 and rebuild l10n cache
[12:04:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:04:56] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops: a3-eqiad pdu refresh - https://phabricator.wikimedia.org/T227139 (10RobH) FYI: I pinged both Alex and Filippo to drain the respective servers they mention above in anticipation of swapping the PDUs in this rack at 10:00 Eastern time.  A3 was originally a DB rack, and has...
[12:05:58] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops: a3-eqiad pdu refresh - https://phabricator.wikimedia.org/T227139 (10RobH)
[12:07:24] <wikibugs>	 (03PS1) 10Arturo Borrero Gonzalez: secrets: toolforge: add default k8s nginx-ingress key pair [labs/private] - 10https://gerrit.wikimedia.org/r/525074 (https://phabricator.wikimedia.org/T228500)
[12:08:23] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops: a3-eqiad pdu refresh - https://phabricator.wikimedia.org/T227139 (10RobH) p:05Triage→03High
[12:09:16] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [V: 03+2 C: 03+2] secrets: toolforge: add default k8s nginx-ingress key pair [labs/private] - 10https://gerrit.wikimedia.org/r/525074 (https://phabricator.wikimedia.org/T228500) (owner: 10Arturo Borrero Gonzalez)
[12:10:45] <wikibugs>	 (03PS1) 10Muehlenhoff: Add net-tools to standard packages [puppet] - 10https://gerrit.wikimedia.org/r/525075
[12:11:52] <wikibugs>	 10Operations, 10Gerrit, 10LDAP-Access-Requests, 10Release-Engineering-Team-TODO, 10Release-Engineering-Team (Development services): Add more SREs to gerritadmin LDAP group - https://phabricator.wikimedia.org/T228733 (10greg)
[12:13:43] <wikibugs>	 (03PS1) 10Alexandros Kosiaris: Redeploy on stream-config changes [deployment-charts] - 10https://gerrit.wikimedia.org/r/525076 (https://phabricator.wikimedia.org/T228700)
[12:17:29] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+1] "Ouch. I know I 'd be annoyed if ifconfig or netstat wasn't around on boxes while under the pressure of debugging something, so many +1s." [puppet] - 10https://gerrit.wikimedia.org/r/525075 (owner: 10Muehlenhoff)
[12:25:15] <wikibugs>	 10Operations, 10serviceops, 10Core Platform Team Workboards (Green): Keys from MediaWiki Redis Instances - https://phabricator.wikimedia.org/T228703 (10WDoranWMF) p:05Triage→03High
[12:29:04] <icinga-wm>	 PROBLEM - High CPU load on API appserver on mw1341 is CRITICAL: CRITICAL - load average: 75.26, 34.31, 22.76 https://wikitech.wikimedia.org/wiki/Application_servers
[12:29:48] <icinga-wm>	 PROBLEM - High CPU load on API appserver on mw1233 is CRITICAL: CRITICAL - load average: 53.32, 23.31, 14.90 https://wikitech.wikimedia.org/wiki/Application_servers
[12:30:20] <icinga-wm>	 PROBLEM - High CPU load on API appserver on mw1234 is CRITICAL: CRITICAL - load average: 50.54, 23.43, 14.46 https://wikitech.wikimedia.org/wiki/Application_servers
[12:30:42] <icinga-wm>	 RECOVERY - High CPU load on API appserver on mw1341 is OK: OK - load average: 28.39, 29.28, 22.12 https://wikitech.wikimedia.org/wiki/Application_servers
[12:31:36] <icinga-wm>	 PROBLEM - High CPU load on API appserver on mw1289 is CRITICAL: CRITICAL - load average: 64.69, 32.86, 20.82 https://wikitech.wikimedia.org/wiki/Application_servers
[12:32:00] <icinga-wm>	 RECOVERY - High CPU load on API appserver on mw1234 is OK: OK - load average: 19.16, 20.16, 14.20 https://wikitech.wikimedia.org/wiki/Application_servers
[12:32:54] <wikibugs>	 10Operations, 10Wikimedia-Mailing-lists: New Mailing lists for AzWiki sysops - https://phabricator.wikimedia.org/T228542 (10MarcoAurelio) There's a bit of discussion on Meta about whether this should be approved or not based on the results of an RfC. May I advice ops to wait before the status is clarified? Tha...
[12:32:54] <icinga-wm>	 PROBLEM - High CPU load on API appserver on mw1232 is CRITICAL: CRITICAL - load average: 49.12, 25.74, 15.39 https://wikitech.wikimedia.org/wiki/Application_servers
[12:33:06] <icinga-wm>	 RECOVERY - High CPU load on API appserver on mw1233 is OK: OK - load average: 14.16, 21.12, 16.03 https://wikitech.wikimedia.org/wiki/Application_servers
[12:33:18] <icinga-wm>	 RECOVERY - High CPU load on API appserver on mw1289 is OK: OK - load average: 22.90, 27.43, 20.11 https://wikitech.wikimedia.org/wiki/Application_servers
[12:33:51] <logmsgbot>	 !log liw@deploy1001 Finished scap: testwiki to php-1.34.0-wmf.15 and rebuild l10n cache (duration: 29m 46s)
[12:33:52] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops: a3-eqiad pdu refresh - https://phabricator.wikimedia.org/T227139 (10MoritzMuehlenhoff) >>! In T227139#5356540, @fgiunchedi wrote: > restbase / logstash / graphite / prometheus hosts should be fine in an event of power loss,  This is graphite1003, the old server pending de...
[12:33:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:34:34] <icinga-wm>	 RECOVERY - High CPU load on API appserver on mw1232 is OK: OK - load average: 19.05, 21.84, 15.05 https://wikitech.wikimedia.org/wiki/Application_servers
[12:34:55] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops: a3-eqiad pdu refresh - https://phabricator.wikimedia.org/T227139 (10MoritzMuehlenhoff)
[12:39:26] <liw>	 I've finished cutting this week's train branch
[12:45:44] <wikibugs>	 10Operations, 10Traffic: TLS config issue for nginx on Buster - https://phabricator.wikimedia.org/T228730 (10BBlack) If we need this to work ASAP, probably the most-expedient thing to do would be to patch our puppetization to exclude the patched features from config on buster only, and use the vendor package....
[12:47:19] <wikibugs>	 (03CR) 10Ottomata: [C: 03+2] "AH!  So the stream-config is rendered properly via configmap.yaml, but the deployment config didn't know it had changed, since it was only" [deployment-charts] - 10https://gerrit.wikimedia.org/r/525076 (https://phabricator.wikimedia.org/T228700) (owner: 10Alexandros Kosiaris)
[12:47:22] <wikibugs>	 (03CR) 10Ottomata: [V: 03+2 C: 03+2] Redeploy on stream-config changes [deployment-charts] - 10https://gerrit.wikimedia.org/r/525076 (https://phabricator.wikimedia.org/T228700) (owner: 10Alexandros Kosiaris)
[12:52:45] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] kibana: Switch to read-only LDAPreplicas [puppet] - 10https://gerrit.wikimedia.org/r/525069 (https://phabricator.wikimedia.org/T227650) (owner: 10Muehlenhoff)
[12:53:49] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/525075 (owner: 10Muehlenhoff)
[12:55:03] <wikibugs>	 (03CR) 10Muehlenhoff: "PCC: https://puppet-compiler.wmflabs.org/compiler1002/17578/" [puppet] - 10https://gerrit.wikimedia.org/r/525069 (https://phabricator.wikimedia.org/T227650) (owner: 10Muehlenhoff)
[12:55:18] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] kibana: Switch to read-only LDAPreplicas [puppet] - 10https://gerrit.wikimedia.org/r/525069 (https://phabricator.wikimedia.org/T227650) (owner: 10Muehlenhoff)
[12:55:34] <wikibugs>	 10Operations, 10serviceops, 10Core Platform Team Backlog (Watching / External), 10Core Platform Team Workboards (Clinic Duty Team), and 4 others: Use PHP7 to run all async jobs - https://phabricator.wikimedia.org/T219148 (10jijiki)
[13:00:04] <jouncebot>	 liw: #bothumor I � Unicode. All rise for MediaWiki train - European version deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190723T1300).
[13:03:30] <wikibugs>	 (03PS7) 10Arturo Borrero Gonzalez: toolforge: k8s: add nginx-ingress configuration. [puppet] - 10https://gerrit.wikimedia.org/r/524759 (https://phabricator.wikimedia.org/T228500)
[13:03:55] <liw>	 I've filed https://phabricator.wikimedia.org/T228746 and https://phabricator.wikimedia.org/T228749 but hoping neither of them is a blocker
[13:04:05] <liw>	 starting group0 deployment now
[13:04:27] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] toolforge: k8s: add nginx-ingress configuration. [puppet] - 10https://gerrit.wikimedia.org/r/524759 (https://phabricator.wikimedia.org/T228500) (owner: 10Arturo Borrero Gonzalez)
[13:05:14] <wikibugs>	 (03PS1) 10Ema: prometheus: add ats_backend_requests_seconds_count rules [puppet] - 10https://gerrit.wikimedia.org/r/525081 (https://phabricator.wikimedia.org/T227668)
[13:06:09] <wikibugs>	 (03PS8) 10Arturo Borrero Gonzalez: toolforge: k8s: add nginx-ingress configuration. [puppet] - 10https://gerrit.wikimedia.org/r/524759 (https://phabricator.wikimedia.org/T228500)
[13:06:17] <marostegui>	 !log Drop abuse_filter_log.afl_log_id from s8 codfw (lag will happen on codfw s8) - T226851
[13:06:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:06:25] <stashbot>	 T226851: Drop abuse_filter_log.afl_log_id in production - https://phabricator.wikimedia.org/T226851
[13:06:43] <logmsgbot>	 !log liw@deploy1001 rebuilt and synchronized wikiversions files: group0 to 1.34.0-wmf.15
[13:06:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:07:44] <wikibugs>	 (03PS4) 10CDanis: conftool: update schemata for dbctl [puppet] - 10https://gerrit.wikimedia.org/r/523943 (https://phabricator.wikimedia.org/T197126)
[13:07:44] <logmsgbot>	 !log jmm@cumin2001 START - Cookbook sre.hosts.downtime
[13:07:45] <logmsgbot>	 !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
[13:07:46] <wikibugs>	 (03PS14) 10CDanis: dbctl: monitor for uncommitted changes [puppet] - 10https://gerrit.wikimedia.org/r/523013 (https://phabricator.wikimedia.org/T197126)
[13:07:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:07:50] <wikibugs>	 10Operations, 10Traffic: TLS config issue for nginx on Buster - https://phabricator.wikimedia.org/T228730 (10elukey) @BBlack thanks for the info! Not in a real rush, I was working on https://phabricator.wikimedia.org/T227860 to add TLS capabilities to the Analytics UIs, the only delay will be for Traffic :)
[13:07:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:08:34] <wikibugs>	 (03PS5) 10Elukey: profile::tlsproxy::service: add more granularity in monitoring [puppet] - 10https://gerrit.wikimedia.org/r/525039 (https://phabricator.wikimedia.org/T227860)
[13:08:51] <wikibugs>	 10Operations, 10Traffic: TLS config issue for nginx on Buster - https://phabricator.wikimedia.org/T228730 (10ema) p:05Triage→03Normal
[13:09:15] <wikibugs>	 (03CR) 10CDanis: [C: 03+2] conftool: update schemata for dbctl [puppet] - 10https://gerrit.wikimedia.org/r/523943 (https://phabricator.wikimedia.org/T197126) (owner: 10CDanis)
[13:09:56] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops: a3-eqiad pdu refresh - https://phabricator.wikimedia.org/T227139 (10jijiki)
[13:09:58] <wikibugs>	 (03PS9) 10Arturo Borrero Gonzalez: toolforge: k8s: add nginx-ingress configuration. [puppet] - 10https://gerrit.wikimedia.org/r/524759 (https://phabricator.wikimedia.org/T228500)
[13:10:37] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops: a3-eqiad pdu refresh - https://phabricator.wikimedia.org/T227139 (10jijiki)
[13:13:00] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] profile::tlsproxy::service: add more granularity in monitoring [puppet] - 10https://gerrit.wikimedia.org/r/525039 (https://phabricator.wikimedia.org/T227860) (owner: 10Elukey)
[13:13:05] <wikibugs>	 (03CR) 10Lars Wirzenius: [C: 03+2] Group0 to 1.34.0-wmf.15 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/525068 (owner: 10Lars Wirzenius)
[13:13:07] <wikibugs>	 (03PS6) 10Elukey: profile::tlsproxy::service: add more granularity in monitoring [puppet] - 10https://gerrit.wikimedia.org/r/525039 (https://phabricator.wikimedia.org/T227860)
[13:13:55] <wikibugs>	 (03Merged) 10jenkins-bot: Group0 to 1.34.0-wmf.15 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/525068 (owner: 10Lars Wirzenius)
[13:14:11] <wikibugs>	 (03CR) 10jenkins-bot: Group0 to 1.34.0-wmf.15 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/525068 (owner: 10Lars Wirzenius)
[13:15:11] <wikibugs>	 (03PS3) 10Muehlenhoff: maintain_dbusers: Read LDAP servers from Hiera and switch to read-only replicas [puppet] - 10https://gerrit.wikimedia.org/r/524810 (https://phabricator.wikimedia.org/T227650)
[13:17:33] <logmsgbot>	 !log liw@deploy1001 rebuilt and synchronized wikiversions files: group0 to 1.34.0-wmf.15
[13:17:36] <icinga-wm>	 PROBLEM - puppet last run on mw2249 is CRITICAL: CRITICAL: Puppet has 3 failures. Last run 6 minutes ago with 3 failures. Failed resources (up to 3 shown): File[/etc/conftool/json-schema/dbconfig/instance.schema],File[/etc/conftool/json-schema/dbconfig/section.schema],File[/etc/conftool/json-schema/mediawiki-config/dbconfig.schema] https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[13:17:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:17:40] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] maintain_dbusers: Read LDAP servers from Hiera and switch to read-only replicas [puppet] - 10https://gerrit.wikimedia.org/r/524810 (https://phabricator.wikimedia.org/T227650) (owner: 10Muehlenhoff)
[13:18:50] <icinga-wm>	 PROBLEM - puppet last run on elastic1042 is CRITICAL: CRITICAL: Puppet has 3 failures. Last run 8 minutes ago with 3 failures. Failed resources (up to 3 shown): File[/etc/conftool/json-schema/dbconfig/instance.schema],File[/etc/conftool/json-schema/dbconfig/section.schema],File[/etc/conftool/json-schema/mediawiki-config/dbconfig.schema] https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[13:19:08] <marostegui>	 cdanis: is that related to your change?  ^
[13:19:16] <cdanis>	 sigh
[13:19:18] <wikibugs>	 (03CR) 10Effie Mouzeli: [C: 03+1] Set async replication for mcrouter on mw api/appserv canaries [puppet] - 10https://gerrit.wikimedia.org/r/525053 (https://phabricator.wikimedia.org/T225642) (owner: 10Elukey)
[13:19:24] <cdanis>	 yes
[13:19:36] <cdanis>	 I will revert
[13:19:40] <liw>	 group0 is at 1.34.0-wmf.15 now
[13:19:50] <marostegui>	 cdanis: happy tuesday :p
[13:20:00] <wikibugs>	 (03PS2) 10Ema: prometheus: add trafficserver_backend_requests_seconds_count rules [puppet] - 10https://gerrit.wikimedia.org/r/525081 (https://phabricator.wikimedia.org/T227668)
[13:20:02] <wikibugs>	 (03PS1) 10Ema: prometheus: rename trafficserver metrics [puppet] - 10https://gerrit.wikimedia.org/r/525085 (https://phabricator.wikimedia.org/T227668)
[13:20:57] <wikibugs>	 (03PS1) 10CDanis: Revert "conftool: update schemata for dbctl" [puppet] - 10https://gerrit.wikimedia.org/r/525086
[13:22:19] <wikibugs>	 (03CR) 10CDanis: [C: 03+2] Revert "conftool: update schemata for dbctl" [puppet] - 10https://gerrit.wikimedia.org/r/525086 (owner: 10CDanis)
[13:22:59] <cdanis>	 oh
[13:23:01] <cdanis>	 dammit
[13:23:14] <cdanis>	 Jul 23 13:10:14 elastic1042 puppet-agent[18320]: Could not set 'file' on ensure: Error 404 on SERVER: {"message":"Not Found: Could not find file_content modules/profile/conftool/json-schema/
[13:23:17] <cdanis>	 dbconfig/instance.schema","issue_kind":"RESOURCE_NOT_FOUND"}
[13:23:23] <cdanis>	 this is just the usual stupid puppet race condition
[13:24:14] <cdanis>	 I need more coffee 😖
[13:24:47] <wikibugs>	 (03CR) 10Ema: [C: 03+1] "Commit message OCD that does not want to discourage an otherwise fantastic idea." (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/525075 (owner: 10Muehlenhoff)
[13:26:20] <wikibugs>	 (03PS4) 10Jhedden: dumps dist: switch active VPS to labstore1006 [puppet] - 10https://gerrit.wikimedia.org/r/524804 (https://phabricator.wikimedia.org/T224228)
[13:26:29] <wikibugs>	 (03PS1) 10CDanis: Revert "Revert "conftool: update schemata for dbctl"" [puppet] - 10https://gerrit.wikimedia.org/r/525088
[13:26:39] <liw>	 "Fatal error: entire web request took longer than 60 seconds and timed out in /srv/mediawiki/php-1.34.0-wmf.14/includes/parser/Preprocessor_Hash.php on line 187" - is this something I shoujld worry about?
[13:26:52] <liw>	 .14, not .15, which I just deployed, but still
[13:27:16] <jeh>	 !log dumps switching active vps to labstore1006 T224228
[13:27:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:27:35] <wikibugs>	 (03CR) 10Jhedden: [C: 03+2] dumps dist: switch active VPS to labstore1006 [puppet] - 10https://gerrit.wikimedia.org/r/524804 (https://phabricator.wikimedia.org/T224228) (owner: 10Jhedden)
[13:27:42] <wikibugs>	 (03CR) 10CDanis: [C: 03+2] Revert "Revert "conftool: update schemata for dbctl"" [puppet] - 10https://gerrit.wikimedia.org/r/525088 (owner: 10CDanis)
[13:27:46] <icinga-wm>	 PROBLEM - puppet last run on mw2150 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/conftool/json-schema/dbconfig/section.schema] https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[13:28:37] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] prometheus: rename trafficserver metrics [puppet] - 10https://gerrit.wikimedia.org/r/525085 (https://phabricator.wikimedia.org/T227668) (owner: 10Ema)
[13:28:49] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] prometheus: add trafficserver_backend_requests_seconds_count rules [puppet] - 10https://gerrit.wikimedia.org/r/525081 (https://phabricator.wikimedia.org/T227668) (owner: 10Ema)
[13:30:34] <ema>	 liw: such errors are an unfortunate but I think fairly frequent problem. See for example https://logstash.wikimedia.org/goto/341af04420e264ef299b2d53b69abd2a
[13:31:32] <icinga-wm>	 PROBLEM - puppet last run on elastic2045 is CRITICAL: CRITICAL: Puppet has 3 failures. Last run 8 minutes ago with 3 failures. Failed resources (up to 3 shown): File[/etc/conftool/json-schema/dbconfig/instance.schema],File[/etc/conftool/json-schema/dbconfig/section.schema],File[/etc/conftool/json-schema/mediawiki-config/dbconfig.schema] https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[13:31:47] <liw>	 ema, check, thanks
[13:32:01] <wikibugs>	 (03PS5) 10Jhedden: dumps dist: switch active VPS to labstore1006 [puppet] - 10https://gerrit.wikimedia.org/r/524804 (https://phabricator.wikimedia.org/T224228)
[13:32:32] <liw>	 I now see: ErrorException from line 100 of /srv/mediawiki/php-1.34.0-wmf.15/includes/libs/HttpStatus.php: PHP Warning: Unknown HTTP status code default
[13:33:22] <icinga-wm>	 RECOVERY - puppet last run on mw2150 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[13:35:02] <icinga-wm>	 PROBLEM - puppet last run on elastic1026 is CRITICAL: CRITICAL: Puppet has 3 failures. Last run 5 minutes ago with 3 failures. Failed resources (up to 3 shown): File[/etc/conftool/json-schema/dbconfig/instance.schema],File[/etc/conftool/json-schema/dbconfig/section.schema],File[/etc/conftool/json-schema/mediawiki-config/dbconfig.schema] https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[13:35:48] <icinga-wm>	 RECOVERY - puppet last run on elastic1042 is OK: OK: Puppet is currently enabled, last run 5 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[13:37:04] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops: (July 22-26) install new PDUs in rows A/B (Top level tracking task) - https://phabricator.wikimedia.org/T226778 (10RobH)
[13:37:46] <wikibugs>	 (03PS1) 10Muehlenhoff: Remove unused Apache config [puppet] - 10https://gerrit.wikimedia.org/r/525090
[13:39:38] <icinga-wm>	 PROBLEM - Mediawiki Cirrussearch update rate - eqiad on icinga1001 is CRITICAL: CRITICAL: 30.00% of data under the critical threshold [50.0] https://wikitech.wikimedia.org/wiki/Search%23No_updates https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?panelId=44&fullscreen&orgId=1
[13:40:42] <icinga-wm>	 RECOVERY - puppet last run on elastic1026 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[13:41:24] <icinga-wm>	 PROBLEM - Mediawiki Cirrussearch update rate - codfw on icinga1001 is CRITICAL: CRITICAL: 50.00% of data under the critical threshold [50.0] https://wikitech.wikimedia.org/wiki/Search%23No_updates https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?panelId=44&fullscreen&orgId=1
[13:41:52] <wikibugs>	 (03PS1) 10Ottomata: eventgate - fix 'wrong number of args for include: want 2 got 1' [deployment-charts] - 10https://gerrit.wikimedia.org/r/525091 (https://phabricator.wikimedia.org/T228700)
[13:42:34] <wikibugs>	 (03CR) 10Ottomata: [V: 03+2 C: 03+2] eventgate - fix 'wrong number of args for include: want 2 got 1' [deployment-charts] - 10https://gerrit.wikimedia.org/r/525091 (https://phabricator.wikimedia.org/T228700) (owner: 10Ottomata)
[13:42:50] <wikibugs>	 10Operations, 10LDAP, 10Patch-For-Review: Migrate web services using LDAP authentication towards the readonly LDAP replicas - https://phabricator.wikimedia.org/T227650 (10MoritzMuehlenhoff) a:03MoritzMuehlenhoff The following services have been converted to use the read-only replicas:      - DB users sync...
[13:43:36] <logmsgbot>	 !log otto@ helmfile [STAGING] Ran 'apply' command on namespace 'eventgate-main' for release 'main' .
[13:43:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:44:27] <wikibugs>	 (03CR) 10Elukey: [C: 03+1] Remove unused Apache config [puppet] - 10https://gerrit.wikimedia.org/r/525090 (owner: 10Muehlenhoff)
[13:44:45] <moritzm>	 !log installing Java security updates on furud/flerovium
[13:44:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:44:50] <icinga-wm>	 PROBLEM - Work requests waiting in Zuul Gearman server on contint1001 is CRITICAL: CRITICAL: 42.86% of data above the critical threshold [140.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1
[13:45:28] <logmsgbot>	 !log otto@ helmfile [CODFW] Ran 'apply' command on namespace 'eventgate-main' for release 'main' .
[13:45:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:45:39] <godog>	 !log depool restbase1016 restbase1019 restbase1011 restbase1010 prometheus1003 ahead of PDU work - T227139
[13:45:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:45:46] <stashbot>	 T227139: a3-eqiad pdu refresh - https://phabricator.wikimedia.org/T227139
[13:45:48] <icinga-wm>	 RECOVERY - puppet last run on mw2249 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[13:47:00] <wikibugs>	 10Operations, 10Wikimedia-Mailing-lists: New Mailing lists for AzWiki sysops - https://phabricator.wikimedia.org/T228542 (10Mardetanha) >>! In T228542#5356900, @MarcoAurelio wrote: > There's a bit of discussion on Meta about whether this should be approved or not based on the results of an RfC. May I advice op...
[13:47:29] <logmsgbot>	 !log otto@ helmfile [EQIAD] Ran 'apply' command on namespace 'eventgate-main' for release 'main' .
[13:47:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:50:00] <liw>	 https://phabricator.wikimedia.org/T228758
[13:50:03] <wikibugs>	 10Operations, 10Analytics, 10Core Platform Team Backlog (Watching / External), 10Patch-For-Review, and 2 others: Replace and expand codfw kafka main hosts (kafka200[123]) with kafka-main200[12345] - https://phabricator.wikimedia.org/T225005 (10Ottomata)
[13:51:56] <icinga-wm>	 RECOVERY - Work requests waiting in Zuul Gearman server on contint1001 is OK: OK: Less than 30.00% above the threshold [90.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1
[13:52:31] <wikibugs>	 (03PS1) 10Muehlenhoff: Remove old jessie-based pool counters [puppet] - 10https://gerrit.wikimedia.org/r/525093 (https://phabricator.wikimedia.org/T224572)
[13:52:55] <wikibugs>	 10Operations, 10PoolCounter: Migrate pool counters to stretch - https://phabricator.wikimedia.org/T199876 (10MoritzMuehlenhoff) 05Open→03Resolved Duplicate of T224572
[13:53:12] <wikibugs>	 10Operations, 10serviceops, 10Patch-For-Review: Migrate pool counters to Stretch/Buster - https://phabricator.wikimedia.org/T224572 (10MoritzMuehlenhoff) a:03MoritzMuehlenhoff
[13:53:14] <icinga-wm>	 RECOVERY - puppet last run on elastic2045 is OK: OK: Puppet is currently enabled, last run 8 seconds ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[13:54:25] <wikibugs>	 (03CR) 10Hashar: zuul: stop zuul-merger gracefully (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/524180 (owner: 10Hashar)
[13:54:50] <icinga-wm>	 PROBLEM - puppet last run on db1080 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[13:57:37] <wikibugs>	 10Operations, 10ops-eqiad, 10DBA: Upgrade db1100 firmware and BIOS - https://phabricator.wikimedia.org/T228732 (10herron) p:05Triage→03Normal
[13:59:09] <liw>	 https://phabricator.wikimedia.org/T228760
[14:00:47] <wikibugs>	 (03PS2) 10Muehlenhoff: Add net-tools to standard packages [puppet] - 10https://gerrit.wikimedia.org/r/525075
[14:01:12] <wikibugs>	 (03CR) 10Muehlenhoff: Add net-tools to standard packages (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/525075 (owner: 10Muehlenhoff)
[14:01:36] <wikibugs>	 10Operations, 10Gerrit, 10LDAP-Access-Requests, 10Release-Engineering-Team-TODO, 10Release-Engineering-Team (Development services): Add more SREs to gerritadmin LDAP group - https://phabricator.wikimedia.org/T228733 (10Joe) I am happy to help.
[14:01:40] <wikibugs>	 10Operations, 10netops: AS63541's session down reported by cr1-eqsin - https://phabricator.wikimedia.org/T228617 (10herron) p:05Triage→03Normal
[14:02:03] <wikibugs>	 (03CR) 10Ottomata: "Others will also use this swift oozie upload job, so I'd rather not have to deploy more new credentials for them if we don't have to.  I'l" [puppet] - 10https://gerrit.wikimedia.org/r/524625 (https://phabricator.wikimedia.org/T227364) (owner: 10EBernhardson)
[14:03:03] <wikibugs>	 (03CR) 10Hashar: zuul: fix systemd Service/TimeoutStopSec (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/524174 (https://phabricator.wikimedia.org/T228381) (owner: 10Hashar)
[14:03:16] <wikibugs>	 (03PS2) 10Hashar: zuul: fix systemd Service/TimeoutStopSec [puppet] - 10https://gerrit.wikimedia.org/r/524174 (https://phabricator.wikimedia.org/T228381)
[14:03:18] <wikibugs>	 (03PS2) 10Hashar: zuul: stop zuul-merger gracefully [puppet] - 10https://gerrit.wikimedia.org/r/524180
[14:03:33] <wikibugs>	 10Operations, 10Puppet: puppetdb prometheus metrics per-host metrics - https://phabricator.wikimedia.org/T228395 (10herron) p:05Triage→03Normal
[14:03:37] <wikibugs>	 (03CR) 10Hashar: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/524174 (https://phabricator.wikimedia.org/T228381) (owner: 10Hashar)
[14:03:47] <wikibugs>	 (03CR) 10Hashar: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/524180 (owner: 10Hashar)
[14:05:42] <icinga-wm>	 RECOVERY - Mediawiki Cirrussearch update rate - eqiad on icinga1001 is OK: OK: Less than 1.00% under the threshold [80.0] https://wikitech.wikimedia.org/wiki/Search%23No_updates https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?panelId=44&fullscreen&orgId=1
[14:05:48] <icinga-wm>	 RECOVERY - Mediawiki Cirrussearch update rate - codfw on icinga1001 is OK: OK: Less than 1.00% under the threshold [80.0] https://wikitech.wikimedia.org/wiki/Search%23No_updates https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?panelId=44&fullscreen&orgId=1
[14:12:10] <wikibugs>	 (03PS3) 10Muehlenhoff: Add net-tools to standard packages [puppet] - 10https://gerrit.wikimedia.org/r/525075
[14:12:22] <Pchelolo>	 herron: is it ok to reenable puppet on kafka1001 already?
[14:12:43] <herron>	 yep, will do that now
[14:14:22] <icinga-wm>	 PROBLEM - Host ps1-a3-eqiad is DOWN: PING CRITICAL - Packet loss = 100%
[14:14:28] <robh>	 !log a3-eqiad pdu swap taking place now via T227139
[14:14:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:14:35] <stashbot>	 T227139: a3-eqiad pdu refresh - https://phabricator.wikimedia.org/T227139
[14:14:41] <robh>	 that is expected!
[14:14:48] <robh>	 also mgmt will offline for all the hsots at some point
[14:14:51] <robh>	 but not the hosts themselves
[14:14:58] <paravoid>	 can we set up downtimes for these?
[14:15:43] <robh>	 we want to see them come back
[14:15:49] <robh>	 and its not paging
[14:15:52] <robh>	 so i rather not
[14:16:06] <robh>	 (we didnt yesterday for the same reason)
[14:16:10] <paravoid>	 ok
[14:16:15] <paravoid>	 that makes sense
[14:16:19] <robh>	 =]
[14:16:31] <paravoid>	 there is a bit of a confusion about impact, that's why I was asking
[14:16:57] <paravoid>	 you mentioned no unexpected power loss in the email, but then I saw we lost two servers after all?
[14:17:10] <robh>	 it turns out yes i didnt realize those lost power
[14:17:27] <robh>	 though im having issues finding which backlog channel it was mentioned in
[14:17:27] <paravoid>	 I don't have any bright ideas, but would like for everyone to be on the same page around (expected or unexpected) impact :)
[14:18:12] <robh>	 be aware i have a number of private messages where folks are unhappy about this so im quite aware that just mentioning this in the meeting isnt enough
[14:18:14] <robh>	 =]
[14:18:33] <paravoid>	 haha
[14:20:14] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops: a4-eqiad pdu refresh - https://phabricator.wikimedia.org/T227140 (10RobH) Please note that netmon and kubestage both powered off yesterday (irc update about this) so we didn't have a flawless migration.
[14:21:45] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+1] Set async replication for mcrouter on mw api/appserv canaries [puppet] - 10https://gerrit.wikimedia.org/r/525053 (https://phabricator.wikimedia.org/T225642) (owner: 10Elukey)
[14:22:30] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Add net-tools to standard packages [puppet] - 10https://gerrit.wikimedia.org/r/525075 (owner: 10Muehlenhoff)
[14:24:12] <wikibugs>	 (03PS15) 10CDanis: dbctl: monitor for uncommitted changes [puppet] - 10https://gerrit.wikimedia.org/r/523013 (https://phabricator.wikimedia.org/T197126)
[14:24:14] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: dbctl: monitor for uncommitted changes (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/523013 (https://phabricator.wikimedia.org/T197126) (owner: 10CDanis)
[14:25:35] <wikibugs>	 (03CR) 10CDanis: dbctl: monitor for uncommitted changes (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/523013 (https://phabricator.wikimedia.org/T197126) (owner: 10CDanis)
[14:26:41] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "LGTM, at most check the compiler." [puppet] - 10https://gerrit.wikimedia.org/r/523013 (https://phabricator.wikimedia.org/T197126) (owner: 10CDanis)
[14:28:17] <wikibugs>	 10Operations, 10Analytics, 10LDAP-Access-Requests, 10wikimediafoundation.org: Access to WikimediaFoundation.org analytics for Deb - https://phabricator.wikimedia.org/T227496 (10herron) 05Open→03Resolved I wasn't able to find an ldap account with shell username `Deb_Zierten`, but I do see shell username...
[14:28:30] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: dbctl: monitor for uncommitted changes (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/523013 (https://phabricator.wikimedia.org/T197126) (owner: 10CDanis)
[14:28:44] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+1] lookup checks: add checks to warn against using hiera and advice lookup [puppet-lint/wmf_styleguide-check] - 10https://gerrit.wikimedia.org/r/522526 (https://phabricator.wikimedia.org/T220820) (owner: 10Jbond)
[14:29:40] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: "I think this is correct but please add a reference task." [puppet] - 10https://gerrit.wikimedia.org/r/524925 (owner: 10Jforrester)
[14:29:48] <wikibugs>	 (03CR) 10CDanis: "PCC looks good https://puppet-compiler.wmflabs.org/compiler1002/17579/" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/523013 (https://phabricator.wikimedia.org/T197126) (owner: 10CDanis)
[14:31:00] <wikibugs>	 (03PS7) 10Giuseppe Lavagetto: Stop installing pear packages on MW Application Servers [puppet] - 10https://gerrit.wikimedia.org/r/434710 (https://phabricator.wikimedia.org/T195364) (owner: 10Reedy)
[14:31:26] <wikibugs>	 (03PS2) 10Ema: prometheus: rename trafficserver metrics [puppet] - 10https://gerrit.wikimedia.org/r/525085 (https://phabricator.wikimedia.org/T227668)
[14:31:53] <icinga-wm>	 PROBLEM - puppet last run on auth1002 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[14:33:08] <wikibugs>	 (03CR) 10CDanis: "Will wait to commit until we have some data in etcd (otherwise this will fail)" [puppet] - 10https://gerrit.wikimedia.org/r/523013 (https://phabricator.wikimedia.org/T197126) (owner: 10CDanis)
[14:33:14] <wikibugs>	 (03CR) 10Ema: [C: 03+2] prometheus: rename trafficserver metrics [puppet] - 10https://gerrit.wikimedia.org/r/525085 (https://phabricator.wikimedia.org/T227668) (owner: 10Ema)
[14:33:19] <wikibugs>	 10Operations, 10Wikimedia-General-or-Unknown, 10serviceops, 10Patch-For-Review: Remove pear/mail packages from WMF MW app servers - https://phabricator.wikimedia.org/T195364 (10Joe) @Tgr do you see any reason not to uninstall those packages? I will for now just remove them from puppet, and uninstall them o...
[14:33:26] <wikibugs>	 (03PS3) 10Ema: prometheus: add trafficserver_backend_requests_seconds_count rules [puppet] - 10https://gerrit.wikimedia.org/r/525081 (https://phabricator.wikimedia.org/T227668)
[14:33:37] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+2] Stop installing pear packages on MW Application Servers [puppet] - 10https://gerrit.wikimedia.org/r/434710 (https://phabricator.wikimedia.org/T195364) (owner: 10Reedy)
[14:33:41] <Reedy>	 woo
[14:33:48] <wikibugs>	 (03PS8) 10Giuseppe Lavagetto: Stop installing pear packages on MW Application Servers [puppet] - 10https://gerrit.wikimedia.org/r/434710 (https://phabricator.wikimedia.org/T195364) (owner: 10Reedy)
[14:34:08] <_joe_>	 Reedy: I plan to remove the packages from a few appservers as soon as gergo confirms all the bugs are actually solved :)
[14:34:18] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [V: 03+2 C: 03+2] Stop installing pear packages on MW Application Servers [puppet] - 10https://gerrit.wikimedia.org/r/434710 (https://phabricator.wikimedia.org/T195364) (owner: 10Reedy)
[14:34:19] <Reedy>	 heh
[14:34:54] <wikibugs>	 (03PS4) 10Ema: prometheus: add trafficserver_backend_requests_seconds_count rules [puppet] - 10https://gerrit.wikimedia.org/r/525081 (https://phabricator.wikimedia.org/T227668)
[14:36:25] <ema>	 _joe_: are your changes safe to puppet-merge?
[14:36:35] <_joe_>	 ema: yes
[14:36:40] <wikibugs>	 (03CR) 10Ema: [C: 03+2] prometheus: add trafficserver_backend_requests_seconds_count rules [puppet] - 10https://gerrit.wikimedia.org/r/525081 (https://phabricator.wikimedia.org/T227668) (owner: 10Ema)
[14:36:49] <ema>	 _joe_: ack, merging!
[14:36:58] <_joe_>	 thanks
[14:38:37] <wikibugs>	 10Operations, 10Wikimedia-General-or-Unknown, 10serviceops, 10Patch-For-Review: Remove pear/mail packages from WMF MW app servers - https://phabricator.wikimedia.org/T195364 (10Tgr) >>! In T195364#5357382, @Joe wrote: > @Tgr do you see any reason not to uninstall those packages? I will for now just remove...
[14:39:34] <logmsgbot>	 !log ppchelko@deploy1001 Started deploy [restbase/deploy@ea10fa5]: Switch event production to eventgate T211248, attempt 2
[14:39:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:39:42] <stashbot>	 T211248: Modern Event Platform: Stream Intake Service: Migrate eventlogging-service-eventbus events to eventgate-main - https://phabricator.wikimedia.org/T211248
[14:41:15] <icinga-wm>	 RECOVERY - puppet last run on db1080 is OK: OK: Puppet is currently enabled, last run 21 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[14:41:33] <wikibugs>	 (03PS2) 10Elukey: Set async replication for mcrouter on mw api/appserv canaries [puppet] - 10https://gerrit.wikimedia.org/r/525053 (https://phabricator.wikimedia.org/T225642)
[14:42:40] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] Set async replication for mcrouter on mw api/appserv canaries [puppet] - 10https://gerrit.wikimedia.org/r/525053 (https://phabricator.wikimedia.org/T225642) (owner: 10Elukey)
[14:43:16] <elukey>	 jijiki: ---^
[14:43:22] <elukey>	 (ping as requested :)
[14:43:30] <jijiki>	 tx :D
[14:46:56] <robh>	 please note all a3-eqiad mgmt is about to complain
[14:46:58] <robh>	 it is expected
[14:47:12] <robh>	 as side a pdu is being swapped for the rack and the mgmt switch is single infeed
[14:49:18] <wikibugs>	 10Operations, 10Analytics, 10Discovery, 10Research-Backlog, 10Patch-For-Review: Workflow to be able to move data files computed in jobs from analytics cluster to production - https://phabricator.wikimedia.org/T213976 (10Ottomata)
[14:52:40] <wikibugs>	 (03PS3) 10Hashar: zuul: fix systemd Service/TimeoutStopSec [puppet] - 10https://gerrit.wikimedia.org/r/524174 (https://phabricator.wikimedia.org/T228381)
[14:52:42] <wikibugs>	 (03CR) 10Ottomata: "Alright!  I've made search_glent readable.  I've also merged https://gerrit.wikimedia.org/r/c/analytics/refinery/+/525106 to do this by de" [puppet] - 10https://gerrit.wikimedia.org/r/524625 (https://phabricator.wikimedia.org/T227364) (owner: 10EBernhardson)
[14:52:42] <logmsgbot>	 !log ppchelko@deploy1001 Finished deploy [restbase/deploy@ea10fa5]: Switch event production to eventgate T211248, attempt 2 (duration: 13m 08s)
[14:52:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:52:50] <stashbot>	 T211248: Modern Event Platform: Stream Intake Service: Migrate eventlogging-service-eventbus events to eventgate-main - https://phabricator.wikimedia.org/T211248
[14:52:52] <wikibugs>	 (03CR) 10Hashar: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/524174 (https://phabricator.wikimedia.org/T228381) (owner: 10Hashar)
[14:53:53] <wikibugs>	 (03PS3) 10Hashar: zuul: stop zuul-merger gracefully [puppet] - 10https://gerrit.wikimedia.org/r/524180
[14:54:04] <wikibugs>	 (03CR) 10Hashar: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/524180 (owner: 10Hashar)
[14:54:17] <icinga-wm>	 RECOVERY - puppet last run on auth1002 is OK: OK: Puppet is currently enabled, last run 27 seconds ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[14:54:43] <icinga-wm>	 PROBLEM - Host dbproxy1003 is DOWN: PING CRITICAL - Packet loss = 100%
[14:54:58] <marostegui>	 robh: ^
[14:55:06] <godog>	 Pchelolo: FYI I did depool some restbase hosts as a precaution for T227139 and the deploy pooled them back
[14:55:07] <stashbot>	 T227139: a3-eqiad pdu refresh - https://phabricator.wikimedia.org/T227139
[14:55:23] <Pchelolo>	 godog: oh damn sorry about that
[14:55:42] <Pchelolo>	 did I break something for you?
[14:56:04] <ottomata>	 i think that woudln't break anything, unless the pdu repalcement caused a power outage
[14:56:07] <godog>	 Pchelolo: no nothing broken :) I'm wondering if we could do that better
[14:56:10] <ottomata>	 it isn't supposed to
[14:56:15] <ottomata>	 but it could!
[14:56:34] <godog>	 that == the pool/depool behaviour on deploy
[14:57:29] <wikibugs>	 (03CR) 10Hashar: [V: 03+1] "Better https://puppet-compiler.wmflabs.org/compiler1001/241/contint1001.wikimedia.org/" [puppet] - 10https://gerrit.wikimedia.org/r/524174 (https://phabricator.wikimedia.org/T228381) (owner: 10Hashar)
[14:57:38] <Pchelolo>	 godog: ye, like scap checking whether the reason for depool is deploy or not
[14:57:59] <godog>	 yeah exactly, sth like that
[14:58:16] <godog>	 (meeting)
[14:58:25] <wikibugs>	 (03CR) 10Hashar: [V: 03+1] "https://puppet-compiler.wmflabs.org/compiler1002/242/contint1001.wikimedia.org/ looks good." [puppet] - 10https://gerrit.wikimedia.org/r/524180 (owner: 10Hashar)
[14:59:10] <cdanis>	 that sounds like a scap bug to me, and like something that is going to seriously bite someone some day
[14:59:59] <icinga-wm>	 RECOVERY - Host dbproxy1003 is UP: PING OK - Packet loss = 0%, RTA = 0.24 ms
[15:00:39] <robh>	 oh, so we lost one in the plugging in of tower a
[15:00:40] <robh>	 sucks
[15:00:48] <robh>	 luckily marostegui had depooled that
[15:01:05] <robh>	 oh wait, that was 1001
[15:01:12] <robh>	 marostegui: ^
[15:01:19] <marostegui>	 yeah, 1003 isn't active
[15:01:23] <wikibugs>	 (03PS2) 10Giuseppe Lavagetto: mediawiki::webserver: add mtail to gather latency, error rate metrics [puppet] - 10https://gerrit.wikimedia.org/r/520502 (https://phabricator.wikimedia.org/T226815)
[15:01:24] <robh>	 ok, whew
[15:01:29] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: mediawiki::webserver: add mtail to gather latency, error rate metrics (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/520502 (https://phabricator.wikimedia.org/T226815) (owner: 10Giuseppe Lavagetto)
[15:01:39] <marostegui>	 robh: yeah, I checked when I checked the rack earlier, so not a big deal
[15:01:41] <marostegui>	 it was more a FYI
[15:02:18] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops: a3-eqiad pdu refresh - https://phabricator.wikimedia.org/T227139 (10RobH) All of the power has been migrated, and we are now setting up the networkign for the new pdus
[15:03:35] <icinga-wm>	 PROBLEM - PHP opcache health on mwdebug1001 is CRITICAL: CRITICAL: opcache free space is below 50 MB https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[15:03:35] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: mediawiki::webserver: add mtail to gather latency, error rate metrics (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/520502 (https://phabricator.wikimedia.org/T226815) (owner: 10Giuseppe Lavagetto)
[15:03:54] <wikibugs>	 (03PS3) 10Giuseppe Lavagetto: mediawiki::webserver: add mtail to gather latency, error rate metrics [puppet] - 10https://gerrit.wikimedia.org/r/520502 (https://phabricator.wikimedia.org/T226815)
[15:04:17] <wikibugs>	 (03PS1) 10Jhedden: icinga: update toolschecker webservice interval [puppet] - 10https://gerrit.wikimedia.org/r/525108 (https://phabricator.wikimedia.org/T221301)
[15:05:45] <icinga-wm>	 PROBLEM - IPMI Sensor Status on elastic1031 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [PS Redundancy = Critical, Status = Critical, Status = Critical] https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures
[15:08:33] <_joe_>	 !log uninstalling php-pear, php-mail, php-mail-mime from mw1267 T195364
[15:08:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:08:41] <stashbot>	 T195364: Remove pear/mail packages from WMF MW app servers - https://phabricator.wikimedia.org/T195364
[15:10:08] <wikibugs>	 10Operations, 10ops-eqiad: Degraded RAID on cloudvirt1015 - https://phabricator.wikimedia.org/T223237 (10Andrew)
[15:10:11] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops, 10User-Zppix, 10cloud-services-team (Kanban): VMs on cloudvirt1015 crashing - bad mainboard/memory - https://phabricator.wikimedia.org/T220853 (10Andrew)
[15:10:17] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review, 10cloud-services-team (Kanban): cloudvirt1015: apparent hardware errors in CPU/Memory - https://phabricator.wikimedia.org/T215012 (10Andrew)
[15:11:47] <icinga-wm>	 RECOVERY - Host ps1-a3-eqiad is UP: PING OK - Packet loss = 0%, RTA = 3.11 ms
[15:11:55] <wikibugs>	 (03PS1) 10Ottomata: Use proper main-codfw Kafka cluster for eventgate-main in codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/525110 (https://phabricator.wikimedia.org/T211248)
[15:12:30] <wikibugs>	 (03PS2) 10Ottomata: Use proper main-codfw Kafka cluster for eventgate-main in codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/525110 (https://phabricator.wikimedia.org/T211248)
[15:13:08] <wikibugs>	 (03CR) 10Ottomata: [V: 03+2 C: 03+2] Use proper main-codfw Kafka cluster for eventgate-main in codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/525110 (https://phabricator.wikimedia.org/T211248) (owner: 10Ottomata)
[15:13:15] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Kanban): labvirt1015 crashes - https://phabricator.wikimedia.org/T171473 (10Andrew)
[15:13:50] <wikibugs>	 10Operations, 10Analytics, 10Analytics-EventLogging: Decommission m4 proxies (dbproxy1004 and dbproxy1008) - https://phabricator.wikimedia.org/T228768 (10Marostegui)
[15:14:16] <logmsgbot>	 !log otto@ helmfile [CODFW] Ran 'apply' command on namespace 'eventgate-main' for release 'main' .
[15:14:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:15:09] <icinga-wm>	 PROBLEM - ps1-a3-eqiad-infeed-load-tower-B-phase-X on ps1-a3-eqiad is CRITICAL: CRITICAL - Plugin timed out while executing system call https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[15:15:46] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops: elastic1031 failed PSU 2 fan - https://phabricator.wikimedia.org/T228769 (10Cmjohnson)
[15:15:51] <icinga-wm>	 PROBLEM - ps1-a3-eqiad-infeed-load-tower-B-phase-Y on ps1-a3-eqiad is CRITICAL: CRITICAL - Plugin timed out while executing system call https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[15:16:19] <icinga-wm>	 PROBLEM - ps1-a3-eqiad-infeed-load-tower-B-phase-Z on ps1-a3-eqiad is CRITICAL: CRITICAL - Plugin timed out while executing system call https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[15:16:44] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops: a3-eqiad pdu refresh - https://phabricator.wikimedia.org/T227139 (10RobH)
[15:17:12] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops: a3-eqiad pdu refresh - https://phabricator.wikimedia.org/T227139 (10RobH) 05Open→03Resolved All done.  Elastic1031 has a PSU issue, and we lost power to dbproxy1003 (it was not in service) during this migration.
[15:17:14] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops: (July 22-26) install new PDUs in rows A/B (Top level tracking task) - https://phabricator.wikimedia.org/T226778 (10RobH)
[15:18:16] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops, 10User-Zppix, 10cloud-services-team (Kanban): VMs on cloudvirt1015 crashing - bad mainboard/memory - https://phabricator.wikimedia.org/T220853 (10Andrew)
[15:22:53] <icinga-wm>	 PROBLEM - Host ps1-a5-eqiad is DOWN: PING CRITICAL - Packet loss = 100%
[15:24:33] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+1] Add mediawiki development chart. [deployment-charts] - 10https://gerrit.wikimedia.org/r/522584 (https://phabricator.wikimedia.org/T224935) (owner: 10Jeena Huneidi)
[15:24:39] <wikibugs>	 (03PS1) 10Arturo Borrero Gonzalez: toolforge: k8s: kubadm: calico requires ipset [puppet] - 10https://gerrit.wikimedia.org/r/525112 (https://phabricator.wikimedia.org/T215531)
[15:25:23] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] toolforge: k8s: kubadm: calico requires ipset [puppet] - 10https://gerrit.wikimedia.org/r/525112 (https://phabricator.wikimedia.org/T215531) (owner: 10Arturo Borrero Gonzalez)
[15:26:12] <wikibugs>	 (03PS2) 10Jhedden: icinga: update toolschecker webservice interval [puppet] - 10https://gerrit.wikimedia.org/r/525108 (https://phabricator.wikimedia.org/T221301)
[15:27:09] <wikibugs>	 (03CR) 10Jhedden: [C: 03+2] icinga: update toolschecker webservice interval [puppet] - 10https://gerrit.wikimedia.org/r/525108 (https://phabricator.wikimedia.org/T221301) (owner: 10Jhedden)
[15:27:33] <wikibugs>	 (03CR) 10Bstorm: toolforge: k8s: add nginx-ingress configuration. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/524759 (https://phabricator.wikimedia.org/T228500) (owner: 10Arturo Borrero Gonzalez)
[15:32:19] <wikibugs>	 (03Abandoned) 10Ema: ATS: split the cache for beta variant of the mobile site [puppet] - 10https://gerrit.wikimedia.org/r/524789 (https://phabricator.wikimedia.org/T227432) (owner: 10Ema)
[15:33:31] <wikibugs>	 (03PS1) 10Ppchelko: Clean up eventlogging_service_uri from RESTBase profile. [puppet] - 10https://gerrit.wikimedia.org/r/525114 (https://phabricator.wikimedia.org/T211248)
[15:34:29] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Clean up eventlogging_service_uri from RESTBase profile. [puppet] - 10https://gerrit.wikimedia.org/r/525114 (https://phabricator.wikimedia.org/T211248) (owner: 10Ppchelko)
[15:35:51] <wikibugs>	 (03PS1) 10Muehlenhoff: Enable seccomp-based hardening for apt [puppet] - 10https://gerrit.wikimedia.org/r/525115
[15:36:01] <icinga-wm>	 PROBLEM - Host wtp2013 is DOWN: PING CRITICAL - Packet loss = 100%
[15:36:18] <jijiki>	 ^  expected
[15:36:21] <icinga-wm>	 PROBLEM - puppet last run on db1121 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[15:36:54] <wikibugs>	 (03PS2) 10Ppchelko: Clean up eventlogging_service_uri from RESTBase profile. [puppet] - 10https://gerrit.wikimedia.org/r/525114 (https://phabricator.wikimedia.org/T211248)
[15:38:04] <wikibugs>	 (03PS10) 10Arturo Borrero Gonzalez: toolforge: k8s: add nginx-ingress configuration. [puppet] - 10https://gerrit.wikimedia.org/r/524759 (https://phabricator.wikimedia.org/T228500)
[15:38:40] <wikibugs>	 (03PS1) 10Arturo Borrero Gonzalez: Revert "secrets: toolforge: add default k8s nginx-ingress key pair" [labs/private] - 10https://gerrit.wikimedia.org/r/525116
[15:38:47] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [V: 03+2 C: 03+2] Revert "secrets: toolforge: add default k8s nginx-ingress key pair" [labs/private] - 10https://gerrit.wikimedia.org/r/525116 (owner: 10Arturo Borrero Gonzalez)
[15:40:41] <wikibugs>	 (03CR) 10Bstorm: [C: 03+1] "I like the commented prometheus scrape bits, since it's a good reminder that we haven't really thought about that piece yet :-D" [puppet] - 10https://gerrit.wikimedia.org/r/524759 (https://phabricator.wikimedia.org/T228500) (owner: 10Arturo Borrero Gonzalez)
[15:46:39] <robh>	 !log side b of a5-eqiad swapping pdu via T227141
[15:47:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:47:01] <stashbot>	 T227141: a5-eqiad pdu refresh - https://phabricator.wikimedia.org/T227141
[15:49:08] <robh>	 correction was side a (they werent labeled on old pdu towers)
[15:49:15] <robh>	 so that is changing instead of b first
[15:49:55] <icinga-wm>	 RECOVERY - Host wtp2013 is UP: PING OK - Packet loss = 0%, RTA = 37.49 ms
[15:53:35] <wikibugs>	 10Operations, 10SRE-Access-Requests, 10Release-Engineering-Team (Deployment services): Request access to deployment cluster for Jakob_WMDE - https://phabricator.wikimedia.org/T227193 (10WMDE-leszek) Thanks gentlemen!  @Jakob_WMDE has noticed today he does not have +2 rights on operations/mediawiki-config. We...
[15:54:01] <icinga-wm>	 PROBLEM - Host mw2159.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[15:54:01] <icinga-wm>	 PROBLEM - Host mw2160.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[15:54:47] <papaul>	 mgmt down on mw2159 and mw2160 that's me
[15:55:11] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops, 10Epic, 10cloud-services-team (Kanban): Move cloudvirt hosts to 10Gb ethernet - https://phabricator.wikimedia.org/T216195 (10Andrew)
[15:55:39] <wikibugs>	 (03PS4) 10Krinkle: noc db.php: include readonly status & group loads [mediawiki-config] - 10https://gerrit.wikimedia.org/r/524825 (owner: 10CDanis)
[15:56:08] <wikibugs>	 (03CR) 10Krinkle: [C: 03+1] "readonly=>readOnly, for consistency. And some spacing issue fixed (phpcs should have caught that, will look at that later)." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/524825 (owner: 10CDanis)
[15:56:49] <robh>	 ok side a done doing side b in a5-eqiad
[15:57:08] <robh>	 mgmt may flap
[15:57:23] <wikibugs>	 (03PS1) 10Ppchelko: Clean up eventlogging_service_uri from maps. [puppet] - 10https://gerrit.wikimedia.org/r/525121 (https://phabricator.wikimedia.org/T211248)
[15:58:04] <wikibugs>	 (03CR) 10Krinkle: [C: 03+1] "non-blocking issue for later improvement perhaps." (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/524825 (owner: 10CDanis)
[15:58:18] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Clean up eventlogging_service_uri from maps. [puppet] - 10https://gerrit.wikimedia.org/r/525121 (https://phabricator.wikimedia.org/T211248) (owner: 10Ppchelko)
[15:58:21] <wikibugs>	 10Operations, 10SRE-Access-Requests: Requesting access to machines [stat1004, stat1007, stat1006, notebook1003, and notebook1004] and groups for Mayakpwiki - https://phabricator.wikimedia.org/T227633 (10Mayakp.wiki) @fsero : Please advise if access is provided. Thanks!
[15:58:39] <icinga-wm>	 PROBLEM - Text HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/myRmf1Pik/varnish-aggregate-client-status-codes?var-cache_type=varnish-text&var-status_type=5
[15:58:46] <wikibugs>	 (03CR) 10Ppchelko: "According to @Mholloway there are​ no plans to use it now, so this can safely be removed." [puppet] - 10https://gerrit.wikimedia.org/r/525121 (https://phabricator.wikimedia.org/T211248) (owner: 10Ppchelko)
[15:58:57] <icinga-wm>	 PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 71, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[16:00:01] <icinga-wm>	 PROBLEM - Esams HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [1000.0] https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/myRmf1Pik/varnish-aggregate-client-status-codes?var-site=esams&var-status_type=5
[16:00:04] <jouncebot>	 _joe_: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) Puppet SWAT(Max 6 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190723T1600).
[16:00:04] <jouncebot>	 No GERRIT patches in the queue for this window AFAICS.
[16:00:21] <wikibugs>	 (03PS2) 10Ppchelko: Clean up eventlogging_service_uri from maps. [puppet] - 10https://gerrit.wikimedia.org/r/525121 (https://phabricator.wikimedia.org/T211248)
[16:00:52] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Clean up eventlogging_service_uri from maps. [puppet] - 10https://gerrit.wikimedia.org/r/525121 (https://phabricator.wikimedia.org/T211248) (owner: 10Ppchelko)
[16:01:29] <wikibugs>	 10Operations, 10SRE-Access-Requests, 10Release-Engineering-Team (Deployment services): Request access to deployment cluster for Jakob_WMDE - https://phabricator.wikimedia.org/T227193 (10hashar) [[ https://gerrit.wikimedia.org/r/#/admin/projects/operations/mediawiki-config,access mediawiki-config access ]] ar...
[16:01:33] <wikibugs>	 10Operations, 10serviceops, 10Core Platform Team Workboards (Green): Keys from MediaWiki Redis Instances - https://phabricator.wikimedia.org/T228703 (10jijiki) 05Open→03Resolved @holger.knust I copied a gzipped dump to a server you have access to, please reopen when you need newer one:)
[16:01:39] <icinga-wm>	 PROBLEM - HTTP availability for Nginx -SSL terminators- at esams on icinga1001 is CRITICAL: cluster=cache_text site=esams https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[16:02:27] <wikibugs>	 (03PS3) 10Ppchelko: Clean up eventlogging_service_uri from maps. [puppet] - 10https://gerrit.wikimedia.org/r/525121 (https://phabricator.wikimedia.org/T211248)
[16:02:30] <elukey>	 is this related to a5?
[16:02:31] <icinga-wm>	 PROBLEM - HTTP availability for Varnish at esams on icinga1001 is CRITICAL: job=varnish-text site=esams https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1
[16:02:47] <wikibugs>	 10Operations, 10SRE-Access-Requests, 10Release-Engineering-Team (Deployment services): Request access to deployment cluster for Jakob_WMDE - https://phabricator.wikimedia.org/T227193 (10hashar) Confirmed to me by @Jakob_WMDE !
[16:02:51] <icinga-wm>	 RECOVERY - puppet last run on db1121 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[16:02:53] <icinga-wm>	 RECOVERY - Host ps1-a5-eqiad is UP: PING OK - Packet loss = 0%, RTA = 3.23 ms
[16:02:59] <icinga-wm>	 PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 229, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[16:03:03] <wikibugs>	 10Operations, 10SRE-Access-Requests, 10Release-Engineering-Team (Deployment services): Request access to deployment cluster for Jakob_WMDE - https://phabricator.wikimedia.org/T227193 (10WMDE-leszek) merci beaucoup @hashar!
[16:03:24] <wikibugs>	 10Operations, 10ops-codfw: (OoW) wtp2013 memory correctable errors - https://phabricator.wikimedia.org/T194174 (10Papaul) 05Open→03Resolved - Replace DIMM B2  - Clear log - Upgrae BIOS from 2.3 to 2.6 -Upgrade IDRAC from 1.57 to 2.61  All looks good now . Resolving this task
[16:03:40] <wikibugs>	 10Operations, 10SRE-Access-Requests, 10Release-Engineering-Team (Deployment services): Request access to deployment cluster for Jakob_WMDE - https://phabricator.wikimedia.org/T227193 (10hashar) bitte schon `\o/`
[16:03:53] <wikibugs>	 10Operations, 10Wikimedia-Mailing-lists: New Mailing lists for AzWiki sysops - https://phabricator.wikimedia.org/T228542 (10Aklapper) Any link to share to a discussion on Meta?
[16:05:17] <icinga-wm>	 RECOVERY - Host mw2160.mgmt is UP: PING WARNING - Packet loss = 61%, RTA = 36.81 ms
[16:05:21] <icinga-wm>	 RECOVERY - HTTP availability for Varnish at esams on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1
[16:05:59] <icinga-wm>	 RECOVERY - HTTP availability for Nginx -SSL terminators- at esams on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[16:09:39] <elukey>	 ah no ok this is related to the link between esams and eqiad 
[16:09:55] <elukey>	 (cr2-eqiad <-> cr2-esams seems down)
[16:10:12] <bblack>	 again?
[16:10:13] <icinga-wm>	 RECOVERY - Text HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/myRmf1Pik/varnish-aggregate-client-status-codes?var-cache_type=varnish-text&var-status_type=5
[16:10:17] <icinga-wm>	 RECOVERY - Esams HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/myRmf1Pik/varnish-aggregate-client-status-codes?var-site=esams&var-status_type=5
[16:10:25] <elukey>	 bblack: I am checking now, just noticed in icinga :(
[16:12:18] <elukey>	 bblack: I can see that cr2-eqiad now routes traffic to cr1-eqiad and then knams, so it seems so
[16:13:37] <icinga-wm>	 PROBLEM - puppet last run on snapshot1007 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[16:13:44] <elukey>	 https://librenms.wikimedia.org/device/device=66/tab=port/port=16577/
[16:13:56] <elukey>	 different than the last time though IIRC
[16:15:37] <elukey>	 no same link sigh
[16:16:55] <icinga-wm>	 RECOVERY - Host mw2159.mgmt is UP: PING OK - Packet loss = 0%, RTA = 36.74 ms
[16:18:01] <bblack>	 it's a different one than the one I was thinking, I think
[16:18:08] <wikibugs>	 10Operations, 10Wikimedia-Mailing-lists: New Mailing lists for AzWiki sysops - https://phabricator.wikimedia.org/T228542 (10Force_Radical) @Aklapper [[https://meta.wikimedia.org/wiki/Requests_for_comment/Do_something_about_azwiki | this RFC on metawiki]]
[16:20:14] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops: a5-eqiad pdu refresh - https://phabricator.wikimedia.org/T227141 (10RobH) Both sides are swapped, and all items appear online.
[16:22:11] <godog>	 !log pool prometheus1003 - T227139
[16:22:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:22:18] <stashbot>	 T227139: a3-eqiad pdu refresh - https://phabricator.wikimedia.org/T227139
[16:23:02] <wikibugs>	 10Operations, 10Wikimedia-Mailing-lists: New Mailing lists for AzWiki sysops - https://phabricator.wikimedia.org/T228542 (10Eldarado) >>! In T228542#5357749, @Force_Radical wrote: > @Aklapper [[https://meta.wikimedia.org/wiki/Requests_for_comment/Do_something_about_azwiki | this RFC on metawiki]]  This discuss...
[16:24:32] <bblack>	 oh no, same one
[16:26:30] <bblack>	 so, GTT has been stable lately (cr1-eqiad xe-4/2/2.13 <-> cr2-knams xe-1/1/0.13), but Level3 not so much (cr2-eqiad xe-4/1/3 <-> cr2-eams xe-0/1/3)
[16:26:37] <elukey>	 yeah
[16:26:49] <elukey>	 and every time we have some impact 
[16:27:18] <elukey>	 the last time there was some maintenance scheduled, but this time it seems not for eqiad? 
[16:27:22] <bblack>	 the GTT one is MPLS
[16:27:43] <bblack>	 I need to go find the other wave out of eqord
[16:28:09] <icinga-wm>	 PROBLEM - puppet last run on dbstore1004 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[16:29:25] <bblack>	 I thought we had one anyways, looking again
[16:29:42] <elukey>	 bblack: (ignorant qs) GTT is MPLS on their side right? (trying to parse what you were writing :)
[16:29:46] <wikibugs>	 10Operations, 10ops-codfw, 10DBA, 10Goal: rack/setup/install db21[21-30].codfw.wmnet - https://phabricator.wikimedia.org/T227113 (10Papaul)
[16:30:12] <bblack>	 yeah I was wrong about eqord, that was I guess some past plan that never materialized
[16:30:44] <bblack>	 So we have GTT MPLS and L3 wave listed in my comment earlier (and a tunnel backup)
[16:31:31] <bblack>	 the L3 wave means we have a physical fiber path, the GTT MPLS means it looks like fiber on each side to us, but it's really just a virtual circuit of sorts in GTT's network (generally these should have less availability and more latency variance than a real wave).
[16:32:31] <elukey>	 ah okok so my understanding was kinda good
[16:32:33] <elukey>	 thanks :)
[16:33:58] <wikibugs>	 10Operations, 10Wikimedia-Mailing-lists: New Mailing lists for AzWiki sysops - https://phabricator.wikimedia.org/T228542 (10Force_Radical) @Eldarado A private admin-only mailing list is almost equivalent to having an AzWiki FB group, something that was criticized over at the RFC. Further, there have been discu...
[16:34:15] <bblack>	 the circuit is still dead I think
[16:34:40] <elukey>	 looks like it https://librenms.wikimedia.org/device/device=66/tab=port/port=16577/
[16:34:45] <bblack>	 (and our traffic is now using the GTT MPLS, but there's always a disruption with 5xx alerts and such on the transition due to loss/reordering etc)
[16:35:09] <elukey>	 and I can see only Level3 maintenance for Tx, not Virginia
[16:35:13] <wikibugs>	 (03PS11) 10Jeena Huneidi: Add mediawiki development chart. [deployment-charts] - 10https://gerrit.wikimedia.org/r/522584 (https://phabricator.wikimedia.org/T224935)
[16:35:30] <elukey>	 other than that, I don't see any alert from Level3 telling us that the circuit is broken
[16:35:46] <wikibugs>	 (03PS2) 10Cwhite: profile: cleanup per-site varnishkafka deploy flags [puppet] - 10https://gerrit.wikimedia.org/r/524934 (https://phabricator.wikimedia.org/T196066)
[16:36:01] <bblack>	 I'm compiling up some data from recent event logs on the L3 circuit
[16:36:32] <wikibugs>	 (03CR) 10Jeena Huneidi: [V: 03+2 C: 03+2] Add mediawiki development chart. [deployment-charts] - 10https://gerrit.wikimedia.org/r/522584 (https://phabricator.wikimedia.org/T224935) (owner: 10Jeena Huneidi)
[16:39:11] <wikibugs>	 (03CR) 10Dzahn: "ok, yea, fair enough. i just didn't have time for that yesterday. i had merged my previous change to make it required and then a couple pu" [puppet] - 10https://gerrit.wikimedia.org/r/524951 (https://phabricator.wikimedia.org/T197873) (owner: 10Dzahn)
[16:41:49] <icinga-wm>	 RECOVERY - puppet last run on snapshot1007 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[16:41:59] <bblack>	 https://phabricator.wikimedia.org/P8785
[16:42:12] <bblack>	 ^ recent history on this L3 wave from librenms event logs for the port statuses
[16:42:40] <elukey>	 really nice
[16:43:22] <elukey>	 should we open a task with those info and let Arzhel contact L3 to figure out what's wrong?
[16:43:44] <bblack>	 I'll let him open one, he may have a different pov, and maybe multiple of those were planned maint on L3's end, I donno.
[16:43:58] <elukey>	 ack :)
[16:44:32] <bblack>	 XioNoX: ping - https://phabricator.wikimedia.org/P8785 - is there a problem here we should do something about?  Seems like a lot of link outages lately.  Could be xcvr issue, some of it I think is planned maint, I donno.  It just seems like a recurrent disruption lately...
[16:51:22] <elukey>	 bblack: last qs - you mentioned the MPLS vs fiber quality of service, may I ask also info about the GRE tunnel? 
[16:51:52] <elukey>	 last hope so probably not the best performer of the group
[16:53:02] <wikibugs>	 10Operations, 10ops-codfw, 10DBA, 10Goal: rack/setup/install db21[21-30].codfw.wmnet - https://phabricator.wikimedia.org/T227113 (10Papaul)
[16:56:29] <icinga-wm>	 RECOVERY - puppet last run on dbstore1004 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[16:58:46] <logmsgbot>	 !log ppchelko@deploy1001 Started deploy [changeprop/deploy@894f735]: Switch internal events to the new schema T226522
[16:58:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:58:54] <stashbot>	 T226522: Modern Event Platform: Stream Intake Service: Migrate change-prop events to new (EventGate) style schemas - https://phabricator.wikimedia.org/T226522
[16:59:09] <icinga-wm>	 PROBLEM - puppet last run on mwmaint2001 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[17:00:04] <jouncebot>	 cscott, arlolra, subbu, and halfak: Your horoscope predicts another unfortunate Services – Graphoid / Parsoid / Citoid / ORES deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190723T1700).
[17:00:17] <logmsgbot>	 !log ppchelko@deploy1001 Finished deploy [changeprop/deploy@894f735]: Switch internal events to the new schema T226522 (duration: 01m 30s)
[17:00:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:02:27] <wikibugs>	 (03CR) 10Mholloway: [C: 04-1] Clean up eventlogging_service_uri from maps. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/525121 (https://phabricator.wikimedia.org/T211248) (owner: 10Ppchelko)
[17:03:20] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] Remove unused Apache config [puppet] - 10https://gerrit.wikimedia.org/r/525090 (owner: 10Muehlenhoff)
[17:03:28] <wikibugs>	 (03PS2) 10Dzahn: Remove unused Apache config [puppet] - 10https://gerrit.wikimedia.org/r/525090 (owner: 10Muehlenhoff)
[17:04:46] <wikibugs>	 (03CR) 10Cwhite: [C: 03+2] hiera: deploy varnishkafka exporter to ulsfo [puppet] - 10https://gerrit.wikimedia.org/r/524930 (https://phabricator.wikimedia.org/T196066) (owner: 10Cwhite)
[17:04:54] <wikibugs>	 (03PS2) 10Cwhite: hiera: deploy varnishkafka exporter to ulsfo [puppet] - 10https://gerrit.wikimedia.org/r/524930 (https://phabricator.wikimedia.org/T196066)
[17:08:22] <wikibugs>	 (03PS3) 10Dzahn: Remove unused Apache config [puppet] - 10https://gerrit.wikimedia.org/r/525090 (owner: 10Muehlenhoff)
[17:14:51] <wikibugs>	 (03PS1) 10Ppchelko: Revert "Add variables for map tile invalidation" [puppet] - 10https://gerrit.wikimedia.org/r/525131 (https://phabricator.wikimedia.org/T211248)
[17:15:18] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Revert "Add variables for map tile invalidation" [puppet] - 10https://gerrit.wikimedia.org/r/525131 (https://phabricator.wikimedia.org/T211248) (owner: 10Ppchelko)
[17:15:30] <wikibugs>	 (03Abandoned) 10Ppchelko: Clean up eventlogging_service_uri from maps. [puppet] - 10https://gerrit.wikimedia.org/r/525121 (https://phabricator.wikimedia.org/T211248) (owner: 10Ppchelko)
[17:15:49] <wikibugs>	 (03CR) 10BBlack: [C: 03+1] "We just had some of these alerts, and confirm the availability alerts mirrored these removed ones (a minute or two later, but the one bein" [puppet] - 10https://gerrit.wikimedia.org/r/523891 (https://phabricator.wikimedia.org/T184942) (owner: 10Filippo Giunchedi)
[17:15:52] <wikibugs>	 (03CR) 10Ppchelko: "Heh, it's actually easier to just revert the commit that added these." [puppet] - 10https://gerrit.wikimedia.org/r/525121 (https://phabricator.wikimedia.org/T211248) (owner: 10Ppchelko)
[17:16:24] <wikibugs>	 10Operations, 10ops-codfw, 10DBA, 10Goal: rack/setup/install db21[21-30].codfw.wmnet - https://phabricator.wikimedia.org/T227113 (10Papaul)
[17:17:58] <wikibugs>	 (03Restored) 10Ppchelko: Clean up eventlogging_service_uri from maps. [puppet] - 10https://gerrit.wikimedia.org/r/525121 (https://phabricator.wikimedia.org/T211248) (owner: 10Ppchelko)
[17:18:19] <wikibugs>	 (03CR) 10Ppchelko: "Or not :) Too many conflicts." [puppet] - 10https://gerrit.wikimedia.org/r/525121 (https://phabricator.wikimedia.org/T211248) (owner: 10Ppchelko)
[17:18:29] <wikibugs>	 (03Abandoned) 10Ppchelko: Revert "Add variables for map tile invalidation" [puppet] - 10https://gerrit.wikimedia.org/r/525131 (https://phabricator.wikimedia.org/T211248) (owner: 10Ppchelko)
[17:20:16] <wikibugs>	 (03PS6) 10Thcipriani: blubberoid: Add policy file [deployment-charts] - 10https://gerrit.wikimedia.org/r/517573 (https://phabricator.wikimedia.org/T215319)
[17:20:47] <wikibugs>	 (03CR) 10Thcipriani: [V: 03+2 C: 03+2] blubberoid: Add policy file [deployment-charts] - 10https://gerrit.wikimedia.org/r/517573 (https://phabricator.wikimedia.org/T215319) (owner: 10Thcipriani)
[17:21:15] <wikibugs>	 (03PS3) 10Ottomata: Clean up eventlogging_service_uri from RESTBase profile. [puppet] - 10https://gerrit.wikimedia.org/r/525114 (https://phabricator.wikimedia.org/T211248) (owner: 10Ppchelko)
[17:22:15] <wikibugs>	 (03PS4) 10Ppchelko: Clean up eventlogging_service_uri from maps. [puppet] - 10https://gerrit.wikimedia.org/r/525121 (https://phabricator.wikimedia.org/T109776)
[17:22:50] <wikibugs>	 (03CR) 10Ottomata: [C: 03+2] Clean up eventlogging_service_uri from RESTBase profile. [puppet] - 10https://gerrit.wikimedia.org/r/525114 (https://phabricator.wikimedia.org/T211248) (owner: 10Ppchelko)
[17:23:16] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Clean up eventlogging_service_uri from maps. [puppet] - 10https://gerrit.wikimedia.org/r/525121 (https://phabricator.wikimedia.org/T109776) (owner: 10Ppchelko)
[17:24:12] <wikibugs>	 10Operations, 10ops-codfw, 10decommission: Decommission db2042 - https://phabricator.wikimedia.org/T225090 (10Papaul)
[17:24:18] <wikibugs>	 (03PS5) 10Ppchelko: Clean up eventlogging_service_uri from maps. [puppet] - 10https://gerrit.wikimedia.org/r/525121 (https://phabricator.wikimedia.org/T109776)
[17:25:18] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Clean up eventlogging_service_uri from maps. [puppet] - 10https://gerrit.wikimedia.org/r/525121 (https://phabricator.wikimedia.org/T109776) (owner: 10Ppchelko)
[17:26:14] <wikibugs>	 (03PS6) 10Ppchelko: Clean up eventlogging_service_uri from maps. [puppet] - 10https://gerrit.wikimedia.org/r/525121 (https://phabricator.wikimedia.org/T109776)
[17:27:25] <icinga-wm>	 RECOVERY - puppet last run on mwmaint2001 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[17:29:53] <wikibugs>	 (03CR) 10Mholloway: [C: 03+1] Clean up eventlogging_service_uri from maps. [puppet] - 10https://gerrit.wikimedia.org/r/525121 (https://phabricator.wikimedia.org/T109776) (owner: 10Ppchelko)
[17:32:01] <wikibugs>	 (03CR) 10Ottomata: [C: 03+2] Clean up eventlogging_service_uri from maps. [puppet] - 10https://gerrit.wikimedia.org/r/525121 (https://phabricator.wikimedia.org/T109776) (owner: 10Ppchelko)
[17:36:35] <logmsgbot>	 !log ppchelko@deploy1001 Started deploy [changeprop/deploy@6c5c0a3]: Switch internal events to the new schema T226522, step 2
[17:36:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:36:45] <stashbot>	 T226522: Modern Event Platform: Stream Intake Service: Migrate change-prop events to new (EventGate) style schemas - https://phabricator.wikimedia.org/T226522
[17:38:11] <logmsgbot>	 !log ppchelko@deploy1001 Finished deploy [changeprop/deploy@6c5c0a3]: Switch internal events to the new schema T226522, step 2 (duration: 01m 37s)
[17:38:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:38:37] <wikibugs>	 (03PS3) 10Dzahn: Phabricator: Remove support for jessie [puppet] - 10https://gerrit.wikimedia.org/r/524718 (owner: 10Muehlenhoff)
[17:39:14] <wikibugs>	 (03CR) 10Alexandros Kosiaris: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/524525 (owner: 10Alexandros Kosiaris)
[17:40:17] <wikibugs>	 (03PS5) 10CDanis: noc db.php: include readonly status & group loads [mediawiki-config] - 10https://gerrit.wikimedia.org/r/524825
[17:40:41] <wikibugs>	 (03CR) 10CDanis: "> Note that `$x ?? $y`  is equivalent to `isset( $x ) ? $x : $y`." (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/524825 (owner: 10CDanis)
[17:41:17] <cdanis>	 jouncebot: next
[17:41:18] <jouncebot>	 In 5 hour(s) and 18 minute(s): Evening SWAT (Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190723T2300)
[17:51:11] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] Phabricator: Remove support for jessie [puppet] - 10https://gerrit.wikimedia.org/r/524718 (owner: 10Muehlenhoff)
[17:52:30] <moritzm>	 !log installing Java security updates on kafka/main and Logstash servers
[17:52:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:54:54] <wikibugs>	 (03PS1) 10Elukey: profile::kerberos::kadminserver: add script to create principals [puppet] - 10https://gerrit.wikimedia.org/r/525137 (https://phabricator.wikimedia.org/T226104)
[17:59:25] <wikibugs>	 (03CR) 10CDanis: [C: 03+2] "thanks for the reviews!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/524825 (owner: 10CDanis)
[18:00:21] <wikibugs>	 (03Merged) 10jenkins-bot: noc db.php: include readonly status & group loads [mediawiki-config] - 10https://gerrit.wikimedia.org/r/524825 (owner: 10CDanis)
[18:00:36] <wikibugs>	 (03CR) 10jenkins-bot: noc db.php: include readonly status & group loads [mediawiki-config] - 10https://gerrit.wikimedia.org/r/524825 (owner: 10CDanis)
[18:03:16] <logmsgbot>	 !log cdanis@deploy1001 Synchronized docroot/noc/db.php: 8def4af1d noc db.php: include readonly status & group loads (duration: 00m 55s)
[18:03:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:04:13] <cdanis>	 ehm
[18:04:16] <cdanis>	 https://phabricator.wikimedia.org/P8786$55
[18:04:19] <bblack>	 !log depool cp1077 + cp1088 - T227143
[18:04:20] <cdanis>	 ImportError: No module named concurrent.futures
[18:04:22] <cdanis>	 during a scap run
[18:04:24] <cdanis>	 that's new to me
[18:04:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:04:26] <stashbot>	 T227143: a7-eqiad pdu refresh - https://phabricator.wikimedia.org/T227143
[18:05:34] <bblack>	 !log lvs1013 - disable puppet and stop pybal - T227143
[18:05:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:05:59] <logmsgbot>	 !log jforrester@deploy1001 Synchronized php-1.34.0-wmf.14/includes/export/XmlDumpWriter.php: T228720 Make XmlDumpwriter resilient to blob store corruption (duration: 00m 57s)
[18:06:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:06:07] <stashbot>	 T228720: stub for enwiki broken, attempt to load content for bad rev during sha1 retrieval - https://phabricator.wikimedia.org/T228720
[18:06:24] <cdanis>	 James_F: are you having issues with scap'ing to mw1267 as well?
[18:06:28] <James_F>	 !log Sync error on mw1314.eqiad.wmnet, No module named concurrent.futures
[18:06:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:06:39] <apergos>	 1314 wtf
[18:06:48] <cdanis>	 James_F: apergosI have that same error on mw1267
[18:06:56] <James_F>	 Oh, hmm, no, 1267.
[18:07:04] <James_F>	 !log Belay that, error on mw1267.
[18:07:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:07:09] <apergos>	 ah hm
[18:07:26] <James_F>	 (Syncing wmf.15 too.)
[18:07:38] <cdanis>	 also have it when I ssh to mw1267 and attempt scap pull
[18:07:41] <James_F>	 "concurrent.futures" sounds familiar.
[18:07:45] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops: b3-eqiad pdu refresh - https://phabricator.wikimedia.org/T227539 (10RobH)
[18:07:56] <James_F>	 Aha!
[18:07:57] <James_F>	 https://phabricator.wikimedia.org/T228482
[18:08:07] <logmsgbot>	 !log jforrester@deploy1001 Synchronized php-1.34.0-wmf.15/includes/export/XmlDumpWriter.php: T228720 Make XmlDumpwriter resilient to blob store corruption (duration: 00m 54s)
[18:08:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:08:25] <James_F>	 Did that box not get the python module or whatever?
[18:08:50] <cdanis>	 dpkg -s scap | grep Version --> Version: 3.11.0-1
[18:08:56] <cdanis>	 so it has the version that doesn't have the dependency
[18:09:08] <cdanis>	 unless anyone objects I am going to apt-get install the dependency there by hand
[18:09:13] <James_F>	 Oh dear. Yeah, please do.
[18:09:21] <icinga-wm>	 PROBLEM - pybal on lvs1013 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), args /usr/sbin/pybal https://wikitech.wikimedia.org/wiki/PyBal
[18:09:51] <James_F>	 (And then do a scap pull on the box?)
[18:09:54] <cdanis>	 !log cdanis@mw1267.eqiad.wmnet ~ ☕ sudo apt install python-concurrent.futures 
[18:10:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:10:07] <cdanis>	 !log cdanis@mw1267.eqiad.wmnet /srv/mediawiki ☕ scap pull     
[18:10:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:10:17] <cdanis>	 ehm
[18:10:20] <cdanis>	 sudo: /usr/local/bin/mwscript: command not found
[18:10:22] <cdanis>	 18:10:09 pull failed: <CalledProcessError> Command '/usr/local/bin/mwscript extensions/WikimediaMaintenance/refreshMessageBlobs.php' returned non-zero exit status 1
[18:10:31] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1013 is CRITICAL: PYBAL CRITICAL - Bad Response from pybal: 500 Cant connect to localhost:9090 https://wikitech.wikimedia.org/wiki/PyBal
[18:10:35] <jijiki>	 cdanis: there is an open bug for this 
[18:10:44] <jijiki>	 it will soon be fixed
[18:10:53] <thcipriani>	 hrm, scap should probably be updated everywhere...that would fix it
[18:10:59] <cdanis>	 jijiki: okay, but, if we can't scap to this appserver, shouldn't it be depooled?
[18:11:05] <icinga-wm>	 PROBLEM - Host ms-be1029 is DOWN: PING CRITICAL - Packet loss = 100%
[18:11:19] <thcipriani>	 https://phabricator.wikimedia.org/T228482
[18:11:22] <cdanis>	 !log depool mw1267
[18:11:22] <icinga-wm>	 ACKNOWLEDGEMENT - PyBal backends health check on lvs1013 is CRITICAL: PYBAL CRITICAL - Bad Response from pybal: 500 Cant connect to localhost:9090 Brandon Black T227143 https://wikitech.wikimedia.org/wiki/PyBal
[18:11:22] <icinga-wm>	 ACKNOWLEDGEMENT - PyBal connections to etcd on lvs1013 is CRITICAL: CRITICAL: 0 connections established with conf1004.eqiad.wmnet:4001 (min=8) Brandon Black T227143 https://wikitech.wikimedia.org/wiki/PyBal
[18:11:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:11:55] <icinga-wm>	 PROBLEM - Host ms-be1030 is DOWN: PING CRITICAL - Packet loss = 100%
[18:12:01] <icinga-wm>	 PROBLEM - Host ms-be1028 is DOWN: PING CRITICAL - Packet loss = 100%
[18:12:38] <cdanis>	 I assume, at least, that if you can't scap somewhere, there's easily the potential for config or code to get out of date on that server, and it shouldn't be ipooled
[18:12:41] <apergos>	 sorry, I was completely checked out already... 
[18:12:49] <icinga-wm>	 PROBLEM - Host ms-be1040 is DOWN: PING CRITICAL - Packet loss = 100%
[18:13:16] <apergos>	 those are a7 and intended aiui
[18:13:16] <robh>	 !log started depooling servers in a7-eqiad for pdu work via T227143
[18:13:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:13:23] <stashbot>	 T227143: a7-eqiad pdu refresh - https://phabricator.wikimedia.org/T227143
[18:13:50] <apergos>	 cdanis yes, no scap = depool it for sure
[18:14:10] <mutante>	 jouncebot: next
[18:14:10] <jouncebot>	 In 4 hour(s) and 45 minute(s): Evening SWAT (Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190723T2300)
[18:14:18] <mutante>	 i think this is a good time to upgrade scap
[18:14:43] <cdanis>	 mutante: lgtm ;)
[18:14:56] <mutante>	 ok :)
[18:17:33] <icinga-wm>	 PROBLEM - Host ps1-a7-eqiad is DOWN: PING CRITICAL - Packet loss = 100%
[18:18:12] <cdanis>	 I have not seen /usr/local/bin/mwscript: command not found before, but I assume there is some good explanation for it
[18:18:25] <mutante>	 cdanis: yes, that is why i want to upgrade scap
[18:18:34] <cdanis>	 ah ok
[18:18:36] <mutante>	 where do you see it?
[18:18:40] <cdanis>	 mw1267
[18:18:46] <cdanis>	 (now depooled)
[18:18:51] <mutante>	 https://phabricator.wikimedia.org/T228328
[18:18:55] <mutante>	 that is the background
[18:19:07] <mutante>	 ok, ack
[18:19:17] <wikibugs>	 (03PS1) 10Volans: Fix extras_require key for use in console_scripts [software/conftool] - 10https://gerrit.wikimedia.org/r/525140
[18:19:23] <cdanis>	 ahh I see, I thought the issue was just the missing python concurrent dependency
[18:19:25] <cdanis>	 ty mutante
[18:19:27] <volans>	 cdanis: ^^^
[18:19:49] <wikibugs>	 (03CR) 10CDanis: [C: 03+2] Fix extras_require key for use in console_scripts [software/conftool] - 10https://gerrit.wikimedia.org/r/525140 (owner: 10Volans)
[18:20:02] <cdanis>	 volans: 𝓪𝓹𝓹𝓻𝓸𝓿𝓮𝓭
[18:20:05] <volans>	 rotfl
[18:20:29] <mutante>	 pretty :)
[18:20:35] <robh>	 yes ms-be 1028,29,30,40 are me
[18:20:46] <robh>	 i left them in monitoring because i wanted to see them come back when we finish
[18:20:57] <robh>	 and ps1-a7-eqiad ping loss is also epected
[18:21:00] <robh>	 expected even
[18:22:51] <wikibugs>	 (03Merged) 10jenkins-bot: Fix extras_require key for use in console_scripts [software/conftool] - 10https://gerrit.wikimedia.org/r/525140 (owner: 10Volans)
[18:23:13] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops: (July 22-26) install new PDUs in rows A/B (Top level tracking task) - https://phabricator.wikimedia.org/T226778 (10aborrero) The more "easy" racks for us in row B are `B3` and `B6`. I propose we start with these.  rack `B3` contains cloudvirt1027 and we would like to real...
[18:24:01] <mutante>	 oh, wow "generate-debdeploy-spec". i remember writing those files manually last time i used debdeploy for something :)
[18:24:05] <mutante>	 that's nice
[18:25:33] <cdanis>	 yeah it's pretty good
[18:25:41] <jijiki>	 mutante: I have in my home dir
[18:25:44] <mutante>	 wait. somebody else already upgrade it? it looks like it heh
[18:25:46] <jijiki>	 from the previous scap updgrade
[18:25:58] <mutante>	 jijiki: did you upgrade scap by any chance? 
[18:26:04] <jijiki>	 about a month ago 
[18:26:10] <cdanis>	 I see 3.11.0-1 on mw1267, not 3.11.1-1
[18:26:17] <mutante>	 hmm. no, i meant like yesterday, heh
[18:26:21] <jijiki>	 not at all
[18:26:40] <mutante>	 oh, nevermind, i am just reading the output of debmonitor the wrong way
[18:26:45] <mutante>	 it's upgrade-able to that
[18:26:47] <jijiki>	 alright
[18:28:23] <icinga-wm>	 PROBLEM - Work requests waiting in Zuul Gearman server on contint1001 is CRITICAL: CRITICAL: 57.14% of data above the critical threshold [140.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1
[18:29:29] <icinga-wm>	 PROBLEM - Host asw2-a-eqiad is DOWN: PING CRITICAL - Packet loss = 100%
[18:30:35] <icinga-wm>	 RECOVERY - Host asw2-a-eqiad is UP: PING OK - Packet loss = 0%, RTA = 1.31 ms
[18:30:37] <icinga-wm>	 PROBLEM - Host ms-be1040 is DOWN: PING CRITICAL - Packet loss = 100%
[18:30:51] <icinga-wm>	 PROBLEM - Host ms-be1028 is DOWN: PING CRITICAL - Packet loss = 100%
[18:30:51] <icinga-wm>	 PROBLEM - Host ms-be1029 is DOWN: PING CRITICAL - Packet loss = 100%
[18:30:51] <icinga-wm>	 PROBLEM - Host ms-be1030 is DOWN: PING CRITICAL - Packet loss = 100%
[18:33:09] <robh>	 ok, expected
[18:33:17] <robh>	 not the asw2-a 
[18:33:20] <robh>	 but the ms-be was
[18:33:46] <robh>	 we are going to kill power on one of the two sides in a7 now
[18:33:48] <robh>	 mgmt may flap
[18:34:26] <wikibugs>	 10Operations, 10cloud-services-team: Migrate remaining cloudvirt hosts to Stretch/Mitaka - https://phabricator.wikimedia.org/T224561 (10Krenair)
[18:34:37] <icinga-wm>	 PROBLEM - MediaWiki memcached error rate on graphite1004 is CRITICAL: CRITICAL: 80.00% of data above the critical threshold [5000.0] https://wikitech.wikimedia.org/wiki/Memcached https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen
[18:38:35] <icinga-wm>	 PROBLEM - puppet last run on restbase-dev1004 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[18:42:42] <robh>	 side a done, doing side b
[18:42:45] <robh>	 mgmt may flap in a7
[18:43:43] <mutante>	 !log rolling out scap 3.11.1-1 on mw canary servers (T228328)
[18:43:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:43:51] <stashbot>	 T228328: 'scap pull' stopped working on appservers ? - https://phabricator.wikimedia.org/T228328
[18:44:41] <icinga-wm>	 RECOVERY - MediaWiki memcached error rate on graphite1004 is OK: OK: Less than 40.00% above the threshold [1000.0] https://wikitech.wikimedia.org/wiki/Memcached https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen
[18:45:36] <mutante>	 !log rolling out scap 3.11.1-1 on all mw codfw servers (T228328)
[18:45:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:46:45] <icinga-wm>	 RECOVERY - Work requests waiting in Zuul Gearman server on contint1001 is OK: OK: Less than 30.00% above the threshold [90.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1
[18:50:27] <icinga-wm>	 PROBLEM - Host mw1271 is DOWN: PING CRITICAL - Packet loss = 100%
[18:50:53] <icinga-wm>	 RECOVERY - Host mw1271 is UP: PING OK - Packet loss = 0%, RTA = 1.12 ms
[18:50:57] <icinga-wm>	 PROBLEM - PHP7 rendering on mw1271 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[18:51:03] <icinga-wm>	 PROBLEM - Apache HTTP on mw1271 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[18:51:03] <mutante>	 ok.. was about to say that host is up
[18:52:21] <robh>	 we are changing the pdu
[18:52:24] <robh>	 but that seems odd for mw1271
[18:52:29] <icinga-wm>	 RECOVERY - PHP7 rendering on mw1271 is OK: HTTP OK: HTTP/1.1 200 OK - 77329 bytes in 0.208 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[18:52:33] <icinga-wm>	 RECOVERY - Apache HTTP on mw1271 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 617 bytes in 0.243 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[18:52:36] <robh>	 2min
[18:52:38] <robh>	 it rebooted
[18:52:39] <robh>	 mutante: 
[18:52:48] <robh>	 uptime of 2 minutes
[18:52:56] <robh>	 it was a casualty in our a7-pdu swap
[18:52:59] <robh>	 seems to be the only one
[18:53:18] <robh>	 !log mw1271 had power loss event due to pdu swap via T227143
[18:53:20] <mutante>	 robh: ah! ok, it will recover in a minute then. cool
[18:53:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:53:26] <stashbot>	 T227143: a7-eqiad pdu refresh - https://phabricator.wikimedia.org/T227143
[18:53:58] <robh>	 we're finishing the cabling before having folks return things to service
[18:54:39] <icinga-wm>	 PROBLEM - PHP7 rendering on mw1270 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[18:54:47] <icinga-wm>	 PROBLEM - HHVM rendering on mw1312 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[18:55:17] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1015 is CRITICAL: PYBAL CRITICAL - CRITICAL - logstash-json-tcp_11514: Servers logstash1009.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[18:56:17] <icinga-wm>	 PROBLEM - PHP7 rendering on mw1269 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[18:56:25] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1016 is CRITICAL: PYBAL CRITICAL - CRITICAL - logstash-json-tcp_11514: Servers logstash1009.eqiad.wmnet are marked down but pooled: logstash-syslog-tcp_10514: Servers logstash1007.eqiad.wmnet, logstash1009.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[18:56:43] <bblack>	 mw 1270, 1312, 1269?
[18:56:49] <icinga-wm>	 PROBLEM - HHVM rendering on mw1274 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[18:56:51] <bblack>	 also logstash1009 ...
[18:56:53] <bblack>	 looking
[18:56:57] <icinga-wm>	 PROBLEM - PHP7 rendering on mw1312 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[18:57:11] <bblack>	 logstash is on ganeti
[18:57:15] <mutante>	 mw1312 confirmed up
[18:57:29] <mutante>	 and uptime 49 days
[18:57:49] <bblack>	 mw1312 is over in A6 nor A7 (the rendering alerts)
[18:57:50] <mutante>	 unlike 1271 above which rebooted
[18:57:51] <icinga-wm>	 RECOVERY - PHP7 rendering on mw1269 is OK: HTTP OK: HTTP/1.1 200 OK - 77329 bytes in 0.486 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[18:57:57] <icinga-wm>	 RECOVERY - PHP7 rendering on mw1270 is OK: HTTP OK: HTTP/1.1 200 OK - 77329 bytes in 4.026 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[18:58:01] <icinga-wm>	 PROBLEM - MediaWiki memcached error rate on graphite1004 is CRITICAL: CRITICAL: 80.00% of data above the critical threshold [5000.0] https://wikitech.wikimedia.org/wiki/Memcached https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen
[18:58:05] <icinga-wm>	 RECOVERY - HHVM rendering on mw1312 is OK: HTTP OK: HTTP/1.1 200 OK - 77284 bytes in 3.122 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[18:58:09] <icinga-wm>	 PROBLEM - PHP7 rendering on mw1268 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[18:58:27] <bblack>	 some of these showing rendering socket timeouts are in A7 though
[18:58:35] <icinga-wm>	 PROBLEM - recommendation_api endpoints health on scb2004 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[18:58:37] <icinga-wm>	 PROBLEM - PHP7 rendering on mw1277 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[18:58:58] <bblack>	 1268, 1269, 1277
[18:59:00] <bblack>	 all A7
[18:59:14] <paravoid>	 win 30
[18:59:41] <icinga-wm>	 RECOVERY - PHP7 rendering on mw1268 is OK: HTTP OK: HTTP/1.1 200 OK - 77329 bytes in 0.645 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[18:59:46] <bblack>	 for mw hosts, it's 1267 - 1283 that are in A7
[19:00:09] <icinga-wm>	 RECOVERY - HHVM rendering on mw1274 is OK: HTTP OK: HTTP/1.1 200 OK - 77284 bytes in 6.987 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[19:00:09] <icinga-wm>	 RECOVERY - PHP7 rendering on mw1277 is OK: HTTP OK: HTTP/1.1 200 OK - 77329 bytes in 0.140 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[19:00:11] <icinga-wm>	 RECOVERY - PHP7 rendering on mw1312 is OK: HTTP OK: HTTP/1.1 200 OK - 77331 bytes in 2.111 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[19:00:17] <icinga-wm>	 RECOVERY - recommendation_api endpoints health on scb2004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[19:00:41] <icinga-wm>	 PROBLEM - logstash syslog TCP port on logstash1009 is CRITICAL: connect to address 127.0.0.1 and port 10514: Connection refused https://wikitech.wikimedia.org/wiki/Logstash
[19:01:12] <bblack>	 logstash seems unrelated
[19:02:17] <bblack>	 seems like java is maxing out a cpu
[19:02:25] <icinga-wm>	 PROBLEM - logstash JSON linesTCP port on logstash1007 is CRITICAL: connect to address 127.0.0.1 and port 11514: Connection refused https://wikitech.wikimedia.org/wiki/Logstash
[19:02:53] <bblack>	 so, semi-related I think
[19:03:05] <icinga-wm>	 PROBLEM - logstash syslog TCP port on logstash1007 is CRITICAL: connect to address 127.0.0.1 and port 10514: Connection refused https://wikitech.wikimedia.org/wiki/Logstash
[19:03:18] <mutante>	 1277 - "GET /w/api.php?format=json&action=opensearch&namespace=14&limit      =30&search=Category:Och") executing too slow (19.283281 sec)
[19:03:23] <bblack>	 probably the other stuff going on is throwing too much log traffic at logstash to handle
[19:03:37] <icinga-wm>	 PROBLEM - puppet last run on tungsten is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[19:03:45] <herron>	 hmm yeah
[19:04:10] <bblack>	 https://grafana.wikimedia.org/d/000000561/logstash?orgId=1
[19:04:15] <shdubsh>	 kafka piled up a bunch of consumer lag, but appears to be flattening out or dropping now
[19:04:41] <mutante>	 but the input rate is going down instead of up?
[19:04:45] <icinga-wm>	 RECOVERY - logstash syslog TCP port on logstash1007 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 10514 https://wikitech.wikimedia.org/wiki/Logstash
[19:04:46] <mutante>	 ah
[19:04:47] <herron>	 https://grafana.wikimedia.org/d/000000102/production-logging
[19:04:53] <bblack>	 because it's failing to handle input well and losing inputs
[19:05:01] <mutante>	 gotcha
[19:05:22] <bblack>	 MW seems to be sending a bunch of memcached errors to logstash
[19:05:28] <bblack>	 there was a memcached alert further up somewhere
[19:05:43] <icinga-wm>	 RECOVERY - logstash JSON linesTCP port on logstash1007 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 11514 https://wikitech.wikimedia.org/wiki/Logstash
[19:05:50] <bblack>	 18:34 <+icinga-wm> PROBLEM - MediaWiki memcached error rate on graphite1004 is CRITICAL: CRITICAL: 80.00% ...
[19:06:02] <shdubsh>	 big spike of memcached logs
[19:06:04] <herron>	 !log restarting logstash on logstash100[789]
[19:06:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:06:13] <icinga-wm>	 PROBLEM - logstash JSON linesTCP port on logstash1008 is CRITICAL: connect to address 127.0.0.1 and port 11514: Connection refused https://wikitech.wikimedia.org/wiki/Logstash
[19:06:47] <bblack>	 https://grafana.wikimedia.org/d/000000316/memcache?orgId=1
[19:06:55] <icinga-wm>	 RECOVERY - puppet last run on restbase-dev1004 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[19:07:05] <icinga-wm>	 PROBLEM - PHP7 rendering on mwdebug1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[19:07:10] <bblack>	 lots of connections-yielded recently there for memcached
[19:07:11] <icinga-wm>	 PROBLEM - logstash JSON linesTCP port on logstash1009 is CRITICAL: connect to address 127.0.0.1 and port 11514: Connection refused https://wikitech.wikimedia.org/wiki/Logstash
[19:07:19] <icinga-wm>	 PROBLEM - Too many messages in kafka logging-eqiad on icinga1001 is CRITICAL: cluster=misc exported_cluster=logging-eqiad group={logstash,logstash-codfw} instance=kafkamon1001:9501 job=burrow partition={0,1,2} site=eqiad topic={rsyslog-info,rsyslog-notice,udp_localhost-err,udp_localhost-info,udp_localhost-warning} https://wikitech.wikimedia.org/wiki/Logstash%23Kafka_consumer_lag https://grafana.wikimedia.org/d/000000484/kafka-con
[19:07:19] <icinga-wm>	 w-3h&to=now&orgId=1&var-datasource=eqiad+prometheus/ops&var-cluster=logging-eqiad&var-topic=All&var-consumer_group=All
[19:07:21] <icinga-wm>	 PROBLEM - puppet last run on db1105 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[19:07:21] <bblack>	 https://grafana.wikimedia.org/d/000000316/memcache?panelId=41&fullscreen&orgId=1
[19:07:33] <icinga-wm>	 PROBLEM - HHVM rendering on mw1312 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[19:07:43] <icinga-wm>	 PROBLEM - logstash syslog TCP port on logstash1008 is CRITICAL: connect to address 127.0.0.1 and port 10514: Connection refused https://wikitech.wikimedia.org/wiki/Logstash
[19:07:46] <bblack>	 ^ command rate has been elevated for a while now... since spiking up around 18:03
[19:07:50] <bblack>	 (~hr ago)
[19:07:59] <icinga-wm>	 PROBLEM - HHVM rendering on mw1270 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[19:08:01] <icinga-wm>	 PROBLEM - PHP7 rendering on mw1312 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[19:08:19] <mutante>	 mc1030 thru mc1034 and mc1036 seem the affected ones
[19:08:19] <elukey>	 bblack: checking mcrouter
[19:08:27] <elukey>	 https://grafana.wikimedia.org/d/000000549/mcrouter?orgId=1&var-source=eqiad%20prometheus%2Fops&var-cluster=All&var-instance=mw1261&var-memcached_server=All
[19:08:27] <mutante>	 from the "connections yielded" graph
[19:08:34] <bblack>	 there were some deploys around that time
[19:08:41] <bblack>	 also a java security update for logstash just before
[19:08:45] <bblack>	 so many balls in the air!
[19:08:49] <elukey>	 seems mw1261?
[19:09:16] <icinga-wm>	 PROBLEM - puppet last run on mc1021 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[19:09:18] <elukey>	 !log depool mw1261 for investigation
[19:09:23] <icinga-wm>	 RECOVERY - logstash syslog TCP port on logstash1008 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 10514 https://wikitech.wikimedia.org/wiki/Logstash
[19:09:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:09:30] <mutante>	 eh, i just ran "scap pull" on that mw1261 
[19:09:33] <icinga-wm>	 RECOVERY - logstash JSON linesTCP port on logstash1008 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 11514 https://wikitech.wikimedia.org/wiki/Logstash
[19:09:39] <icinga-wm>	 RECOVERY - HHVM rendering on mw1270 is OK: HTTP OK: HTTP/1.1 200 OK - 77284 bytes in 8.389 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[19:10:06] <elukey>	 mutante: ah sorry!
[19:10:19] <elukey>	 errors are going down
[19:10:28] <mutante>	 elukey: no, i'm just saying that is a coincidence that you name that specific host.. hmm
[19:10:36] <bblack>	 !log repool cp1077 + cp1078 - T227143
[19:10:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:10:42] <stashbot>	 T227143: a7-eqiad pdu refresh - https://phabricator.wikimedia.org/T227143
[19:10:47] <icinga-wm>	 RECOVERY - Host ps1-a7-eqiad is UP: PING OK - Packet loss = 0%, RTA = 2.25 ms
[19:10:57] <icinga-wm>	 PROBLEM - LVS HTTP IPv4 on logstash.svc.eqiad.wmnet is CRITICAL: connect to address 10.2.2.36 and port 10514: Connection refused https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems
[19:11:09] <apergos>	 paged
[19:11:27] <icinga-wm>	 PROBLEM - logstash syslog TCP port on logstash1007 is CRITICAL: connect to address 127.0.0.1 and port 10514: Connection refused https://wikitech.wikimedia.org/wiki/Logstash
[19:11:27] <bblack>	 !log repool lvs1013 - T227143
[19:11:29] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1021 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[19:11:31] * cdanis reading scrollback
[19:11:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:11:37] <mutante>	 that was logstash which got restarted
[19:11:47] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops: a5-eqiad pdu refresh - https://phabricator.wikimedia.org/T227141 (10RobH) p:05Triage→03Normal
[19:11:55] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops: a5-eqiad pdu refresh - https://phabricator.wikimedia.org/T227141 (10RobH) 05Open→03Resolved
[19:11:57] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops: (July 22-26) install new PDUs in rows A/B (Top level tracking task) - https://phabricator.wikimedia.org/T226778 (10RobH)
[19:12:02] <_joe_>	 hey I'm around
[19:12:11] <icinga-wm>	 RECOVERY - logstash JSON linesTCP port on logstash1009 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 11514 https://wikitech.wikimedia.org/wiki/Logstash
[19:12:14] <bblack>	 cdanis: starts circa 18:03, not related to A7 power work.  Some issue with memcached error rates for mediawiki, spilling over into excessive logstash load, etc
[19:12:19] <bblack>	 _joe_: ^
[19:12:21] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops: (July 22-26) install new PDUs in rows A/B (Top level tracking task) - https://phabricator.wikimedia.org/T226778 (10RobH)
[19:12:21] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[19:12:23] <icinga-wm>	 RECOVERY - logstash syslog TCP port on logstash1009 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 10514 https://wikitech.wikimedia.org/wiki/Logstash
[19:12:27] <icinga-wm>	 RECOVERY - Host ms-be1040 is UP: PING OK - Packet loss = 0%, RTA = 2.25 ms
[19:12:32] <cdanis>	 bblack: the memcached errors are unrelated?  no mc hosts in A7?
[19:12:33] <icinga-wm>	 PROBLEM - PHP7 rendering on mw1268 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[19:12:36] <_joe_>	 bblack: did anyone look at those logs?
[19:12:42] <bblack>	 not yet, just graphs
[19:12:46] <_joe_>	 also what's up with all the php7 rendering alerts?
[19:12:49] <icinga-wm>	 RECOVERY - pybal on lvs1013 is OK: PROCS OK: 1 process with UID = 0 (root), args /usr/sbin/pybal https://wikitech.wikimedia.org/wiki/PyBal
[19:13:07] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1021 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[19:13:08] <librenms-wmf>	 04Critical Alert for device ps1-a7-eqiad.mgmt.eqiad.wmnet - Device rebooted
[19:13:09] <bblack>	 _joe_: many of the recent rendering alerts were A7 hosts, likely some network blip...
[19:13:11] <icinga-wm>	 RECOVERY - Host ms-be1029 is UP: PING OK - Packet loss = 0%, RTA = 0.25 ms
[19:13:13] <mutante>	 affected mc hosts are in C5 
[19:13:13] <_joe_>	 did we lose the mc hosts?
[19:13:21] <cdanis>	 _joe_: mw1267 through mw1283 are on A7, but no mc* hosts
[19:13:21] <icinga-wm>	 RECOVERY - Host ms-be1028 is UP: PING WARNING - Packet loss = 86%, RTA = 0.19 ms
[19:13:27] <icinga-wm>	 RECOVERY - Host ms-be1030 is UP: PING OK - Packet loss = 0%, RTA = 0.18 ms
[19:13:27] <_joe_>	 ok
[19:13:30] <bblack>	 but there was elevated MC command rates going back to ~18:03, way before that
[19:13:31] <icinga-wm>	 PROBLEM - HHVM rendering on mwdebug1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[19:13:34] <robh>	 we only lost power on a single mw host im aware of
[19:13:40] <mutante>	 and D4
[19:13:40] <stashbot>	 D4: iscap 'project level' commands - https://phabricator.wikimedia.org/D4
[19:13:40] <bblack>	 and then MC went a bit crazier on other graphs more-recently
[19:13:59] <bblack>	 there was some deploy traffic shortly before the elevated MC rates
[19:14:09] <bblack>	 https://grafana.wikimedia.org/d/000000316/memcache?panelId=41&fullscreen&orgId=1
[19:14:11] <icinga-wm>	 PROBLEM - PHP7 rendering on mw1273 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[19:14:14] <shdubsh>	 lots of memcached SERVER ERROR logs
[19:14:15] <icinga-wm>	 PROBLEM - logstash JSON linesTCP port on logstash2006 is CRITICAL: connect to address 127.0.0.1 and port 11514: Connection refused https://wikitech.wikimedia.org/wiki/Logstash
[19:14:19] <icinga-wm>	 RECOVERY - LVS HTTP IPv4 on logstash.svc.eqiad.wmnet is OK: TCP OK - 0.000 second response time on 10.2.2.36 port 10514 https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems
[19:14:21] <cdanis>	 https://grafana.wikimedia.org/d/000000316/memcache?orgId=1
[19:14:22] <elukey>	 robh: did we do A6?
[19:14:27] <cdanis>	 memcache traffic patterns have shifted
[19:14:45] <robh>	 elukey: no, it has a db master
[19:14:47] <cdanis>	 https://grafana.wikimedia.org/d/000000316/memcache?panelId=38&fullscreen&orgId=1
[19:14:49] <cdanis>	 what is this graph?
[19:14:51] <icinga-wm>	 RECOVERY - logstash syslog TCP port on logstash1007 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 10514 https://wikitech.wikimedia.org/wiki/Logstash
[19:14:51] <bblack>	 back at 18:03 when that increase first happened, nothing should've been happening that mattered with A7 power yet
[19:14:51] <cdanis>	 connections yielded?
[19:14:57] <elukey>	 no ok sorry my link was wrong
[19:14:57] <elukey>	 https://grafana.wikimedia.org/d/000000549/mcrouter?orgId=1&var-source=eqiad%20prometheus%2Fops&var-cluster=All&var-instance=All&var-memcached_server=All
[19:15:03] <elukey>	 more than one appserver
[19:15:33] <elukey>	 cdanis: those are memcached threads reaching the max conns to process in a row, and yeilding the tcp conn to process other ones
[19:15:34] <bblack>	 (and A7 doesn't have MC hosts, but does have kafka-main1001)
[19:15:40] <_joe_>	 chttps://grafana.wikimedia.org/d/000000316/memcache?orgId=1&panelId=38&fullscreen does this correspond to any release?
[19:15:45] <icinga-wm>	 RECOVERY - PHP7 rendering on mw1273 is OK: HTTP OK: HTTP/1.1 200 OK - 77330 bytes in 1.027 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[19:15:49] <icinga-wm>	 RECOVERY - PHP7 rendering on mw1268 is OK: HTTP OK: HTTP/1.1 200 OK - 77329 bytes in 0.454 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[19:15:59] <icinga-wm>	 RECOVERY - logstash JSON linesTCP port on logstash2006 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 11514 https://wikitech.wikimedia.org/wiki/Logstash
[19:16:15] <bblack>	 _joe_: no, but that does correspond with the rendering errors cropping up in mostly-A7 MW hosts
[19:16:28] <bblack>	 (the ~18:50-onwards anomalies there)
[19:16:28] <cdanis>	 mc1033 through mc1036 are on D4, was there a problem with that rack?
[19:16:49] <_joe_>	 bblack: and when did maintenace happen?
[19:16:51] <icinga-wm>	 PROBLEM - logstash JSON linesTCP port on logstash1009 is CRITICAL: connect to address 127.0.0.1 and port 11514: Connection refused https://wikitech.wikimedia.org/wiki/Logstash
[19:17:11] <bblack>	 the maint started ~18:00, but was mostly prep-work and depoolings of ms-fe, shutdowns of ms-be, etc
[19:17:13] <_joe_>	 20 minutes ago or so?
[19:17:13] <icinga-wm>	 PROBLEM - logstash JSON linesTCP port on logstash2004 is CRITICAL: connect to address 127.0.0.1 and port 11514: Connection refused https://wikitech.wikimedia.org/wiki/Logstash
[19:17:29] <icinga-wm>	 PROBLEM - puppet last run on ms-be1040 is CRITICAL: CRITICAL: Puppet has 3 failures. Last run 3 minutes ago with 3 failures. Failed resources (up to 3 shown): Exec[mkfs-/dev/sdc1],Exec[xfs_label-/dev/sdb3],Exec[xfs_label-/dev/sdb4] https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[19:17:46] * akosiaris around
[19:17:51] <icinga-wm>	 PROBLEM - puppet last run on mc1022 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[19:18:06] <bblack>	 looks like 18:33 for the first power cut on one leg
[19:18:18] <_joe_>	 ok, I think it's time to look at the memcacheds
[19:18:25] <icinga-wm>	 PROBLEM - LVS HTTP IPv4 on logstash.svc.eqiad.wmnet is CRITICAL: connect to address 10.2.2.36 and port 10514: Connection refused https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems
[19:18:30] <bblack>	 18:42 for the second leg of power work
[19:18:33] <icinga-wm>	 PROBLEM - Check systemd state on cloudcontrol1004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[19:18:47] <arturo>	 looking
[19:18:47] <icinga-wm>	 PROBLEM - logstash syslog TCP port on logstash2004 is CRITICAL: connect to address 127.0.0.1 and port 10514: Connection refused https://wikitech.wikimedia.org/wiki/Logstash
[19:18:53] <apergos>	 I'm still here but it was a very long day, poke me if I can be of help, otherwise lurking
[19:19:29] <icinga-wm>	 PROBLEM - HTTP availability for Nginx -SSL terminators- at eqsin on icinga1001 is CRITICAL: cluster=cache_text site=eqsin https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[19:19:33] <icinga-wm>	 PROBLEM - logstash syslog TCP port on logstash1008 is CRITICAL: connect to address 127.0.0.1 and port 10514: Connection refused https://wikitech.wikimedia.org/wiki/Logstash
[19:19:35] <elukey>	 from mw1271
[19:19:37] <elukey>	 Jul 23 19:19:03 mw1271 mcrouter[919]: I0723 19:19:03.442854  1051 AsyncMcClientImpl.cpp:751] Failed to write into socket with remote endpoint "10.64.0.83:11211:ascii:plain:notcompressed", wrote 39782 bytes. Exception: AsyncSocketException: write timed out after 1000ms, type = Timed out
[19:19:39] <icinga-wm>	 PROBLEM - PHP7 rendering on mw1275 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[19:20:01] <icinga-wm>	 PROBLEM - HTTP availability for Nginx -SSL terminators- at ulsfo on icinga1001 is CRITICAL: cluster=cache_text site=ulsfo https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[19:20:02] <_joe_>	 can we please suspend all maintenance for the day?
[19:20:03] <icinga-wm>	 PROBLEM - HTTP availability for Nginx -SSL terminators- at esams on icinga1001 is CRITICAL: cluster=cache_text site=esams https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[19:20:09] <icinga-wm>	 PROBLEM - logstash syslog TCP port on logstash1009 is CRITICAL: connect to address 127.0.0.1 and port 10514: Connection refused https://wikitech.wikimedia.org/wiki/Logstash
[19:20:11] <icinga-wm>	 PROBLEM - HTTP availability for Varnish at ulsfo on icinga1001 is CRITICAL: job=varnish-text site=ulsfo https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1
[19:20:23] <wikibugs>	 (03PS1) 10Ladsgroup: varnish: Do not strip the cache out of Special:EntityData if revision is set [puppet] - 10https://gerrit.wikimedia.org/r/525142 (https://phabricator.wikimedia.org/T85499)
[19:20:25] <icinga-wm>	 PROBLEM - HTTP availability for Nginx -SSL terminators- at codfw on icinga1001 is CRITICAL: cluster=cache_text site=codfw https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[19:20:25] <icinga-wm>	 PROBLEM - HTTP availability for Nginx -SSL terminators- at eqiad on icinga1001 is CRITICAL: cluster=cache_text site=eqiad https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[19:20:35] <icinga-wm>	 RECOVERY - PHP7 rendering on mw1275 is OK: HTTP OK: HTTP/1.1 200 OK - 77329 bytes in 0.249 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[19:20:39] <icinga-wm>	 RECOVERY - Check systemd state on cloudcontrol1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[19:20:45] <bblack>	 _joe_: power maint for A7 is done, I'm not sure if they had further plans today, but +1 should hold everything for now
[19:20:57] <icinga-wm>	 PROBLEM - Eqiad HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/myRmf1Pik/varnish-aggregate-client-status-codes?var-site=eqiad&var-status_type=5
[19:21:01] <marostegui>	 I think they wanted to start with row B
[19:21:01] <_joe_>	 we're having network issues on the memcached I would say
[19:21:06] <marostegui>	 but probably better to stop
[19:21:10] <bblack>	 in the middle of all of this A7-ish timeframe, we also had some issue with a singular mw server with scap issues, and a scap upgrade too.
[19:21:20] <_joe_>	 Jul 23 19:13:05 mc1022 puppet-agent[18767]: Could not retrieve catalog from remote server: Broken pipe
[19:21:24] <mutante>	 removing some of the noise that is unrelated. like ms-be host has a disk issue.   stopping any scap upgrades.
[19:21:26] <elukey>	 _joe_ yes take a look to the per-server metrics in https://grafana.wikimedia.org/d/000000549/mcrouter?orgId=1&var-source=eqiad%20prometheus%2Fops&var-cluster=All&var-instance=All&var-memcached_server=All
[19:21:35] <elukey>	 it is not only one shard
[19:21:36] <wikibugs>	 (03PS1) 10Halfak: Add accraze to team-scoring [puppet] - 10https://gerrit.wikimedia.org/r/525143 (https://phabricator.wikimedia.org/T226417)
[19:21:46] <_joe_>	 elukey: what do you mean?
[19:22:13] <icinga-wm>	 PROBLEM - HTTP availability for Varnish at eqiad on icinga1001 is CRITICAL: job=varnish-text site=eqiad https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1
[19:22:19] <icinga-wm>	 RECOVERY - logstash syslog TCP port on logstash1009 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 10514 https://wikitech.wikimedia.org/wiki/Logstash
[19:22:29] <icinga-wm>	 PROBLEM - Ulsfo HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/myRmf1Pik/varnish-aggregate-client-status-codes?var-site=ulsfo&var-status_type=5
[19:22:29] <icinga-wm>	 ACKNOWLEDGEMENT - puppet last run on ms-be1040 is CRITICAL: CRITICAL: Puppet has 3 failures. Last run 3 minutes ago with 3 failures. Failed resources (up to 3 shown): Exec[mkfs-/dev/sdc1],Exec[xfs_label-/dev/sdb3],Exec[xfs_label-/dev/sdb4] daniel_zahn disk issue xfs - /dev/sdb4 https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[19:22:35] <elukey>	 _joe_ there are metrics now for each single shard, it might help, that's it. Plus it seems that not only one shard is affected
[19:22:36] <_joe_>	 elukey: so that's all the servers in A6
[19:22:45] <icinga-wm>	 PROBLEM - Esams HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/myRmf1Pik/varnish-aggregate-client-status-codes?var-site=esams&var-status_type=5
[19:22:45] <icinga-wm>	 PROBLEM - Text HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/myRmf1Pik/varnish-aggregate-client-status-codes?var-cache_type=varnish-text&var-status_type=5
[19:22:49] <icinga-wm>	 RECOVERY - logstash syslog TCP port on logstash1008 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 10514 https://wikitech.wikimedia.org/wiki/Logstash
[19:22:52] <_joe_>	 the servers in A6 are experiencing timeouts
[19:23:03] <icinga-wm>	 PROBLEM - logstash syslog TCP port on logstash2005 is CRITICAL: connect to address 127.0.0.1 and port 10514: Connection refused https://wikitech.wikimedia.org/wiki/Logstash
[19:23:04] <elukey>	 I am not sure if they worked on it
[19:23:17] <_joe_>	 elukey: still, looking at the graphs you posted
[19:23:19] <_joe_>	 also
[19:23:22] <_joe_>	 Jul 23 19:13:05 mc1022 puppet-agent[18767]: Could not retrieve catalog from remote server: Broken pipe
[19:23:29] <_joe_>	 this is from a serrver in that rack 
[19:23:29] <icinga-wm>	 PROBLEM - PHP7 rendering on mw1273 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[19:23:31] <icinga-wm>	 RECOVERY - HTTP availability for Varnish at ulsfo on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1
[19:23:43] <icinga-wm>	 RECOVERY - logstash JSON linesTCP port on logstash2004 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 11514 https://wikitech.wikimedia.org/wiki/Logstash
[19:23:45] <icinga-wm>	 RECOVERY - HTTP availability for Nginx -SSL terminators- at eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[19:23:45] <icinga-wm>	 RECOVERY - HTTP availability for Nginx -SSL terminators- at codfw on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[19:23:48] <icinga-wm>	 RECOVERY - LVS HTTP IPv4 on logstash.svc.eqiad.wmnet is OK: TCP OK - 0.000 second response time on 10.2.2.36 port 10514 https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems
[19:23:52] <marostegui>	 A6 hasn't been touched on the pdu maintenance from what I can see
[19:23:58] <icinga-wm>	 RECOVERY - HTTP availability for Nginx -SSL terminators- at eqsin on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[19:24:02] <_joe_>	 ok, this is a serious incident. I'll be the coordinator
[19:24:08] <bblack>	 ack
[19:24:16] <icinga-wm>	 RECOVERY - logstash syslog TCP port on logstash2005 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 10514 https://wikitech.wikimedia.org/wiki/Logstash
[19:24:18] <_joe_>	 mutante: can you look at the php/hhvm rendering alerts?
[19:24:18] <icinga-wm>	 RECOVERY - logstash syslog TCP port on logstash2004 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 10514 https://wikitech.wikimedia.org/wiki/Logstash
[19:24:24] <akosiaris>	 _joe_: all hands on deck?
[19:24:32] <icinga-wm>	 RECOVERY - PHP7 rendering on mw1273 is OK: HTTP OK: HTTP/1.1 200 OK - 77329 bytes in 0.400 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[19:24:34] <icinga-wm>	 RECOVERY - HTTP availability for Nginx -SSL terminators- at ulsfo on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[19:24:36] <icinga-wm>	 RECOVERY - HTTP availability for Nginx -SSL terminators- at esams on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[19:24:36] <icinga-wm>	 RECOVERY - HTTP availability for Varnish at eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1
[19:24:55] <cdanis>	 _joe_: can we move discussion into #-sre ?  
[19:24:59] <elukey>	 +1
[19:25:01] <_joe_>	 yes
[19:25:02] <akosiaris>	 indeed
[19:25:06] <shdubsh>	 ack
[19:25:38] <icinga-wm>	 RECOVERY - logstash JSON linesTCP port on logstash1009 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 11514 https://wikitech.wikimedia.org/wiki/Logstash
[19:25:42] <icinga-wm>	 PROBLEM - PHP7 rendering on mw1270 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[19:25:56] <icinga-wm>	 PROBLEM - logstash JSON linesTCP port on logstash1007 is CRITICAL: connect to address 127.0.0.1 and port 11514: Connection refused https://wikitech.wikimedia.org/wiki/Logstash
[19:26:06] <icinga-wm>	 ACKNOWLEDGEMENT - Check systemd state on cloudcontrol1004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. andrew bogott seems to be hanging -- Im investigating. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[19:26:30] <icinga-wm>	 PROBLEM - recommendation_api endpoints health on scb2002 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[19:26:48] <icinga-wm>	 RECOVERY - PHP7 rendering on mw1270 is OK: HTTP OK: HTTP/1.1 200 OK - 77329 bytes in 0.428 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[19:27:06] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1021 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[19:27:24] <icinga-wm>	 PROBLEM - logstash JSON linesTCP port on logstash2004 is CRITICAL: connect to address 127.0.0.1 and port 11514: Connection refused https://wikitech.wikimedia.org/wiki/Logstash
[19:27:28] <icinga-wm>	 PROBLEM - LVS HTTP IPv4 on logstash.svc.eqiad.wmnet is CRITICAL: connect to address 10.2.2.36 and port 10514: Connection refused https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems
[19:27:40] <icinga-wm>	 RECOVERY - recommendation_api endpoints health on scb2002 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[19:27:48] <icinga-wm>	 PROBLEM - puppet last run on mw1280 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[19:27:48] <icinga-wm>	 PROBLEM - HHVM rendering on mw1269 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[19:28:00] <icinga-wm>	 PROBLEM - logstash syslog TCP port on logstash2005 is CRITICAL: connect to address 127.0.0.1 and port 10514: Connection refused https://wikitech.wikimedia.org/wiki/Logstash
[19:28:08] <icinga-wm>	 PROBLEM - Check systemd state on cloudcontrol1003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[19:28:08] <icinga-wm>	 RECOVERY - Eqiad HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/myRmf1Pik/varnish-aggregate-client-status-codes?var-site=eqiad&var-status_type=5
[19:28:20] <icinga-wm>	 RECOVERY - puppet last run on mc1022 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[19:28:34] <cdanis>	 !log depool all appservers in eqiad A7 cdanis@cumin1001.eqiad.wmnet ~ 🍵 sudo cumin 'mw12[67-83]*' 'depool'
[19:28:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:28:50] <icinga-wm>	 RECOVERY - LVS HTTP IPv4 on logstash.svc.eqiad.wmnet is OK: TCP OK - 0.000 second response time on 10.2.2.36 port 10514 https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems
[19:28:50] <icinga-wm>	 PROBLEM - logstash JSON linesTCP port on logstash2006 is CRITICAL: connect to address 127.0.0.1 and port 11514: Connection refused https://wikitech.wikimedia.org/wiki/Logstash
[19:29:00] <icinga-wm>	 RECOVERY - HHVM rendering on mw1269 is OK: HTTP OK: HTTP/1.1 200 OK - 77282 bytes in 0.383 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[19:29:20] <icinga-wm>	 RECOVERY - logstash syslog TCP port on logstash2005 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 10514 https://wikitech.wikimedia.org/wiki/Logstash
[19:29:50] <icinga-wm>	 PROBLEM - logstash syslog TCP port on logstash1009 is CRITICAL: connect to address 127.0.0.1 and port 10514: Connection refused https://wikitech.wikimedia.org/wiki/Logstash
[19:29:52] <icinga-wm>	 RECOVERY - logstash JSON linesTCP port on logstash1007 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 11514 https://wikitech.wikimedia.org/wiki/Logstash
[19:30:04] <icinga-wm>	 RECOVERY - Ulsfo HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/myRmf1Pik/varnish-aggregate-client-status-codes?var-site=ulsfo&var-status_type=5
[19:30:08] <icinga-wm>	 PROBLEM - logstash syslog TCP port on logstash2006 is CRITICAL: connect to address 127.0.0.1 and port 10514: Connection refused https://wikitech.wikimedia.org/wiki/Logstash
[19:30:14] <icinga-wm>	 RECOVERY - HHVM rendering on mw1312 is OK: HTTP OK: HTTP/1.1 200 OK - 77284 bytes in 7.133 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[19:30:20] <icinga-wm>	 RECOVERY - Esams HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/myRmf1Pik/varnish-aggregate-client-status-codes?var-site=esams&var-status_type=5
[19:30:20] <icinga-wm>	 RECOVERY - Text HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/myRmf1Pik/varnish-aggregate-client-status-codes?var-cache_type=varnish-text&var-status_type=5
[19:30:32] <icinga-wm>	 RECOVERY - puppet last run on tungsten is OK: OK: Puppet is currently enabled, last run 45 seconds ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[19:30:50] <icinga-wm>	 RECOVERY - Check systemd state on cloudcontrol1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[19:30:54] <icinga-wm>	 PROBLEM - logstash JSON linesTCP port on logstash1009 is CRITICAL: connect to address 127.0.0.1 and port 11514: Connection refused https://wikitech.wikimedia.org/wiki/Logstash
[19:31:12] <icinga-wm>	 RECOVERY - logstash syslog TCP port on logstash1009 is OK: TCP OK - 0.002 second response time on 127.0.0.1 port 10514 https://wikitech.wikimedia.org/wiki/Logstash
[19:31:28] <icinga-wm>	 RECOVERY - logstash syslog TCP port on logstash2006 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 10514 https://wikitech.wikimedia.org/wiki/Logstash
[19:31:28] <icinga-wm>	 RECOVERY - logstash JSON linesTCP port on logstash2004 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 11514 https://wikitech.wikimedia.org/wiki/Logstash
[19:32:18] <icinga-wm>	 RECOVERY - logstash JSON linesTCP port on logstash1009 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 11514 https://wikitech.wikimedia.org/wiki/Logstash
[19:32:58] <icinga-wm>	 PROBLEM - LVS HTTP IPv4 on logstash.svc.eqiad.wmnet is CRITICAL: connect to address 10.2.2.36 and port 10514: Connection refused https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems
[19:33:00] <icinga-wm>	 RECOVERY - logstash JSON linesTCP port on logstash2006 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 11514 https://wikitech.wikimedia.org/wiki/Logstash
[19:34:02] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1021 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[19:34:02] <icinga-wm>	 RECOVERY - puppet last run on db1105 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[19:34:08] <icinga-wm>	 PROBLEM - HHVM rendering on mw1312 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[19:34:27] <icinga-wm>	 RECOVERY - LVS HTTP IPv4 on logstash.svc.eqiad.wmnet is OK: TCP OK - 0.000 second response time on 10.2.2.36 port 10514 https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems
[19:35:50] <icinga-wm>	 PROBLEM - logstash JSON linesTCP port on logstash2005 is CRITICAL: connect to address 127.0.0.1 and port 11514: Connection refused https://wikitech.wikimedia.org/wiki/Logstash
[19:36:46] <icinga-wm>	 PROBLEM - logstash JSON linesTCP port on logstash1009 is CRITICAL: connect to address 127.0.0.1 and port 11514: Connection refused https://wikitech.wikimedia.org/wiki/Logstash
[19:36:50] <icinga-wm>	 RECOVERY - PHP7 rendering on mwdebug1002 is OK: HTTP OK: HTTP/1.1 200 OK - 77367 bytes in 2.104 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[19:36:52] <icinga-wm>	 RECOVERY - PHP7 rendering on mw1312 is OK: HTTP OK: HTTP/1.1 200 OK - 77357 bytes in 4.587 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[19:37:26] <icinga-wm>	 PROBLEM - logstash syslog TCP port on logstash2006 is CRITICAL: connect to address 127.0.0.1 and port 10514: Connection refused https://wikitech.wikimedia.org/wiki/Logstash
[19:37:32] <icinga-wm>	 PROBLEM - puppet last run on actinium is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[19:37:36] <icinga-wm>	 PROBLEM - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is CRITICAL: /{domain}/v1/page/summary/{title} (Get summary for test page) timed out before a response was received https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29
[19:38:24] <icinga-wm>	 RECOVERY - HHVM rendering on mw1312 is OK: HTTP OK: HTTP/1.1 200 OK - 77310 bytes in 2.784 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[19:38:26] <wikibugs>	 (03PS2) 10Ladsgroup: varnish: Do not strip the cache out of Special:EntityData if revision is set [puppet] - 10https://gerrit.wikimedia.org/r/525142 (https://phabricator.wikimedia.org/T85499)
[19:39:06] <icinga-wm>	 PROBLEM - logstash JSON linesTCP port on logstash2006 is CRITICAL: connect to address 127.0.0.1 and port 11514: Connection refused https://wikitech.wikimedia.org/wiki/Logstash
[19:39:26] <icinga-wm>	 PROBLEM - Check systemd state on cloudcontrol1004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[19:40:08] <icinga-wm>	 PROBLEM - OSPF status on cr2-eqdfw is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[19:40:16] <icinga-wm>	 PROBLEM - logstash syslog TCP port on logstash1009 is CRITICAL: connect to address 127.0.0.1 and port 10514: Connection refused https://wikitech.wikimedia.org/wiki/Logstash
[19:40:38] <icinga-wm>	 PROBLEM - CirrusSearch eqiad 95th percentile latency on graphite1004 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/dashboard/db/elasticsearch-percentiles?panelId=19&fullscreen&orgId=1&var-cluster=eqiad&var-smoothing=1
[19:40:39] <mutante>	 !log restarting hhvm on mw1312
[19:40:42] <icinga-wm>	 RECOVERY - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29
[19:40:42] <icinga-wm>	 RECOVERY - logstash JSON linesTCP port on logstash2006 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 11514 https://wikitech.wikimedia.org/wiki/Logstash
[19:40:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:41:02] <icinga-wm>	 PROBLEM - logstash syslog TCP port on logstash1008 is CRITICAL: connect to address 127.0.0.1 and port 10514: Connection refused https://wikitech.wikimedia.org/wiki/Logstash
[19:41:04] <icinga-wm>	 RECOVERY - Check systemd state on cloudcontrol1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[19:41:10] <wikibugs>	 (03PS1) 10Aaron Schulz: Use GTIDs for master position queries for external DB when possible [mediawiki-config] - 10https://gerrit.wikimedia.org/r/525147
[19:41:42] <icinga-wm>	 RECOVERY - HHVM rendering on mwdebug1002 is OK: HTTP OK: HTTP/1.1 200 OK - 77319 bytes in 1.066 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[19:41:42] <icinga-wm>	 RECOVERY - puppet last run on mc1021 is OK: OK: Puppet is currently enabled, last run 5 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[19:41:50] <icinga-wm>	 RECOVERY - OSPF status on cr2-eqdfw is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[19:41:58] <icinga-wm>	 RECOVERY - logstash syslog TCP port on logstash1009 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 10514 https://wikitech.wikimedia.org/wiki/Logstash
[19:42:12] <icinga-wm>	 RECOVERY - logstash JSON linesTCP port on logstash2005 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 11514 https://wikitech.wikimedia.org/wiki/Logstash
[19:42:18] <icinga-wm>	 RECOVERY - logstash syslog TCP port on logstash2006 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 10514 https://wikitech.wikimedia.org/wiki/Logstash
[19:42:50] <icinga-wm>	 PROBLEM - HHVM rendering on mw1312 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[19:43:38] <icinga-wm>	 PROBLEM - logstash JSON linesTCP port on logstash1007 is CRITICAL: connect to address 127.0.0.1 and port 11514: Connection refused https://wikitech.wikimedia.org/wiki/Logstash
[19:44:01] <andrewbogott>	 !log restarting rabbitmq-server on cloudcontrol1003 and 1004
[19:44:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:44:20] <icinga-wm>	 RECOVERY - logstash syslog TCP port on logstash1008 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 10514 https://wikitech.wikimedia.org/wiki/Logstash
[19:44:28] <icinga-wm>	 PROBLEM - PHP7 rendering on mw1312 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[19:44:44] <icinga-wm>	 PROBLEM - logstash syslog TCP port on logstash2005 is CRITICAL: connect to address 127.0.0.1 and port 10514: Connection refused https://wikitech.wikimedia.org/wiki/Logstash
[19:44:48] <icinga-wm>	 PROBLEM - logstash syslog TCP port on logstash2004 is CRITICAL: connect to address 127.0.0.1 and port 10514: Connection refused https://wikitech.wikimedia.org/wiki/Logstash
[19:44:56] <icinga-wm>	 RECOVERY - logstash JSON linesTCP port on logstash1009 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 11514 https://wikitech.wikimedia.org/wiki/Logstash
[19:45:18] <icinga-wm>	 RECOVERY - logstash JSON linesTCP port on logstash1007 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 11514 https://wikitech.wikimedia.org/wiki/Logstash
[19:46:26] <icinga-wm>	 RECOVERY - logstash syslog TCP port on logstash2005 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 10514 https://wikitech.wikimedia.org/wiki/Logstash
[19:46:30] <icinga-wm>	 RECOVERY - logstash syslog TCP port on logstash2004 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 10514 https://wikitech.wikimedia.org/wiki/Logstash
[19:48:20] <icinga-wm>	 PROBLEM - Check systemd state on cloudcontrol1004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[19:48:29] <icinga-wm>	 PROBLEM - puppet last run on mc1019 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[19:49:27] <icinga-wm>	 RECOVERY - puppet last run on mw1280 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[19:49:29] <icinga-wm>	 PROBLEM - IPMI Sensor Status on dbprov1001 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [PS Redundancy = Critical, Status = Critical, Status = Critical] https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures
[19:49:45] <icinga-wm>	 PROBLEM - logstash syslog TCP port on logstash2005 is CRITICAL: connect to address 127.0.0.1 and port 10514: Connection refused https://wikitech.wikimedia.org/wiki/Logstash
[19:49:49] <mutante>	 !log mwdebug1002 - restarting hhvm - mw1312 - restarted apache
[19:49:50] <icinga-wm>	 PROBLEM - logstash syslog TCP port on logstash2004 is CRITICAL: connect to address 127.0.0.1 and port 10514: Connection refused https://wikitech.wikimedia.org/wiki/Logstash
[19:49:57] <icinga-wm>	 PROBLEM - logstash JSON linesTCP port on logstash1009 is CRITICAL: connect to address 127.0.0.1 and port 11514: Connection refused https://wikitech.wikimedia.org/wiki/Logstash
[19:50:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:50:11] <icinga-wm>	 PROBLEM - recommendation_api endpoints health on scb2001 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[19:50:23] <icinga-wm>	 RECOVERY - Check systemd state on cloudcontrol1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[19:50:33] <icinga-wm>	 PROBLEM - recommendation_api endpoints health on scb2004 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[19:51:27] <icinga-wm>	 PROBLEM - logstash syslog TCP port on logstash1008 is CRITICAL: connect to address 127.0.0.1 and port 10514: Connection refused https://wikitech.wikimedia.org/wiki/Logstash
[19:52:05] <icinga-wm>	 PROBLEM - logstash syslog TCP port on logstash1009 is CRITICAL: connect to address 127.0.0.1 and port 10514: Connection refused https://wikitech.wikimedia.org/wiki/Logstash
[19:52:15] <icinga-wm>	 RECOVERY - recommendation_api endpoints health on scb2001 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[19:52:43] <icinga-wm>	 RECOVERY - recommendation_api endpoints health on scb2004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[19:53:15] <icinga-wm>	 RECOVERY - logstash JSON linesTCP port on logstash1009 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 11514 https://wikitech.wikimedia.org/wiki/Logstash
[19:53:41] <icinga-wm>	 RECOVERY - logstash syslog TCP port on logstash1008 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 10514 https://wikitech.wikimedia.org/wiki/Logstash
[19:53:43] <icinga-wm>	 PROBLEM - Check systemd state on cloudcontrol1004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[19:54:09] <icinga-wm>	 RECOVERY - logstash syslog TCP port on logstash2005 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 10514 https://wikitech.wikimedia.org/wiki/Logstash
[19:54:13] <icinga-wm>	 RECOVERY - logstash syslog TCP port on logstash2004 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 10514 https://wikitech.wikimedia.org/wiki/Logstash
[19:54:23] <icinga-wm>	 RECOVERY - logstash syslog TCP port on logstash1009 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 10514 https://wikitech.wikimedia.org/wiki/Logstash
[19:54:39] <icinga-wm>	 PROBLEM - recommendation_api endpoints health on scb2006 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[19:55:11] <icinga-wm>	 PROBLEM - HHVM rendering on mwdebug1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[19:55:17] <icinga-wm>	 PROBLEM - logstash JSON linesTCP port on logstash1008 is CRITICAL: connect to address 127.0.0.1 and port 11514: Connection refused https://wikitech.wikimedia.org/wiki/Logstash
[19:55:47] <icinga-wm>	 RECOVERY - recommendation_api endpoints health on scb2006 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[19:56:17] <icinga-wm>	 PROBLEM - recommendation_api endpoints health on scb2004 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[19:57:17] <icinga-wm>	 PROBLEM - logstash syslog TCP port on logstash1008 is CRITICAL: connect to address 127.0.0.1 and port 10514: Connection refused https://wikitech.wikimedia.org/wiki/Logstash
[19:57:27] <icinga-wm>	 RECOVERY - recommendation_api endpoints health on scb2004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[19:57:41] <icinga-wm>	 RECOVERY - logstash JSON linesTCP port on logstash1008 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 11514 https://wikitech.wikimedia.org/wiki/Logstash
[19:58:02] <cmjohnson1>	 wikibugs:
[19:58:33] <icinga-wm>	 RECOVERY - logstash syslog TCP port on logstash1008 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 10514 https://wikitech.wikimedia.org/wiki/Logstash
[19:58:37] <icinga-wm>	 RECOVERY - MediaWiki memcached error rate on graphite1004 is OK: OK: Less than 40.00% above the threshold [1000.0] https://wikitech.wikimedia.org/wiki/Memcached https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen
[19:58:47] <icinga-wm>	 RECOVERY - HHVM rendering on mwdebug1002 is OK: HTTP OK: HTTP/1.1 200 OK - 77449 bytes in 1.493 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[20:00:37] <icinga-wm>	 PROBLEM - logstash syslog TCP port on logstash2006 is CRITICAL: connect to address 127.0.0.1 and port 10514: Connection refused https://wikitech.wikimedia.org/wiki/Logstash
[20:01:15] <icinga-wm>	 RECOVERY - Check systemd state on cloudcontrol1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[20:01:53] <icinga-wm>	 PROBLEM - logstash syslog TCP port on logstash2004 is CRITICAL: connect to address 127.0.0.1 and port 10514: Connection refused https://wikitech.wikimedia.org/wiki/Logstash
[20:01:59] <icinga-wm>	 RECOVERY - logstash syslog TCP port on logstash2006 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 10514 https://wikitech.wikimedia.org/wiki/Logstash
[20:04:05] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1021 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[20:04:07] <icinga-wm>	 RECOVERY - puppet last run on actinium is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[20:04:17] <icinga-wm>	 RECOVERY - PHP7 rendering on mw1312 is OK: HTTP OK: HTTP/1.1 200 OK - 77486 bytes in 2.366 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[20:04:53] <icinga-wm>	 PROBLEM - logstash syslog TCP port on logstash1009 is CRITICAL: connect to address 127.0.0.1 and port 10514: Connection refused https://wikitech.wikimedia.org/wiki/Logstash
[20:04:53] <icinga-wm>	 PROBLEM - logstash JSON linesTCP port on logstash1009 is CRITICAL: connect to address 127.0.0.1 and port 11514: Connection refused https://wikitech.wikimedia.org/wiki/Logstash
[20:05:29] <icinga-wm>	 PROBLEM - Check systemd state on cloudcontrol1004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[20:05:29] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1021 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[20:05:51] <icinga-wm>	 PROBLEM - PHP7 rendering on mwdebug1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[20:06:04] <wikibugs>	 (03PS1) 10Elukey: Remove mc1019->23 from the mcrouter config [puppet] - 10https://gerrit.wikimedia.org/r/525148
[20:06:19] <icinga-wm>	 PROBLEM - logstash syslog TCP port on logstash2006 is CRITICAL: connect to address 127.0.0.1 and port 10514: Connection refused https://wikitech.wikimedia.org/wiki/Logstash
[20:07:19] <icinga-wm>	 RECOVERY - HHVM rendering on mw1312 is OK: HTTP OK: HTTP/1.1 200 OK - 77439 bytes in 3.483 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[20:07:43] <icinga-wm>	 RECOVERY - logstash syslog TCP port on logstash2004 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 10514 https://wikitech.wikimedia.org/wiki/Logstash
[20:08:25] <paravoid>	 !log asw2-a-eqiad: request virtual-chassis vc-port set interface member 6 vcp-255/1/0 disable
[20:08:29] <icinga-wm>	 PROBLEM - logstash JSON linesTCP port on logstash2006 is CRITICAL: connect to address 127.0.0.1 and port 11514: Connection refused https://wikitech.wikimedia.org/wiki/Logstash
[20:08:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:08:45] <icinga-wm>	 RECOVERY - PHP7 rendering on mwdebug1002 is OK: HTTP OK: HTTP/1.1 200 OK - 77494 bytes in 0.253 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[20:08:56] <wikibugs>	 (03CR) 10Ladsgroup: "Hopefully it should also be cached in VCL for logged-in users as well but I don't know how that can be done or this is enough for it." [puppet] - 10https://gerrit.wikimedia.org/r/525142 (https://phabricator.wikimedia.org/T85499) (owner: 10Ladsgroup)
[20:09:06] <wikibugs>	 (03PS1) 10Andrew Bogott: puppet: add facter.conf and cache some facts [puppet] - 10https://gerrit.wikimedia.org/r/525149
[20:09:46] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] puppet: add facter.conf and cache some facts [puppet] - 10https://gerrit.wikimedia.org/r/525149 (owner: 10Andrew Bogott)
[20:10:05] <icinga-wm>	 PROBLEM - logstash syslog TCP port on logstash1008 is CRITICAL: connect to address 127.0.0.1 and port 10514: Connection refused https://wikitech.wikimedia.org/wiki/Logstash
[20:10:07] <icinga-wm>	 RECOVERY - Check systemd state on cloudcontrol1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[20:10:59] <icinga-wm>	 RECOVERY - logstash syslog TCP port on logstash1009 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 10514 https://wikitech.wikimedia.org/wiki/Logstash
[20:11:01] <icinga-wm>	 RECOVERY - logstash JSON linesTCP port on logstash1009 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 11514 https://wikitech.wikimedia.org/wiki/Logstash
[20:11:31] <wikibugs>	 (03PS2) 10Andrew Bogott: puppet: add facter.conf and cache some facts [puppet] - 10https://gerrit.wikimedia.org/r/525149
[20:11:47] <icinga-wm>	 PROBLEM - MediaWiki memcached error rate on graphite1004 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [5000.0] https://wikitech.wikimedia.org/wiki/Memcached https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen
[20:12:05] <icinga-wm>	 PROBLEM - Juniper virtual chassis ports on asw2-a-eqiad is CRITICAL: CRIT: Down: 1 Unknown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23VCP_status
[20:12:19] <icinga-wm>	 PROBLEM - logstash JSON linesTCP port on logstash1008 is CRITICAL: connect to address 127.0.0.1 and port 11514: Connection refused https://wikitech.wikimedia.org/wiki/Logstash
[20:13:21] <icinga-wm>	 RECOVERY - logstash JSON linesTCP port on logstash2006 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 11514 https://wikitech.wikimedia.org/wiki/Logstash
[20:13:27] <icinga-wm>	 RECOVERY - MediaWiki memcached error rate on graphite1004 is OK: OK: Less than 40.00% above the threshold [1000.0] https://wikitech.wikimedia.org/wiki/Memcached https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen
[20:13:45] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops, 10User-Zppix, 10cloud-services-team (Kanban): VMs on cloudvirt1015 crashing - bad mainboard/memory - https://phabricator.wikimedia.org/T220853 (10RobH) ` /admin1-> racadm getsel Record:      1 Date/Time:   05/30/2019 17:38:49 Source:      system Severity:    Ok Descri...
[20:14:08] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops, 10User-Zppix, 10cloud-services-team (Kanban): VMs on cloudvirt1015 crashing - bad mainboard/memory - https://phabricator.wikimedia.org/T220853 (10RobH) Bad dimm, Chris moved it from B3 to A3 on   >>! In T220853#5224397, @Cmjohnson wrote: > Swapped DIMM B3 with DIMM A3...
[20:15:17] <icinga-wm>	 PROBLEM - logstash JSON linesTCP port on logstash2005 is CRITICAL: connect to address 127.0.0.1 and port 11514: Connection refused https://wikitech.wikimedia.org/wiki/Logstash
[20:15:17] <icinga-wm>	 PROBLEM - puppet last run on mc1021 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[20:15:31] <icinga-wm>	 PROBLEM - logstash syslog TCP port on logstash1007 is CRITICAL: connect to address 127.0.0.1 and port 10514: Connection refused https://wikitech.wikimedia.org/wiki/Logstash
[20:15:35] <icinga-wm>	 RECOVERY - puppet last run on mc1019 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[20:15:51] <icinga-wm>	 PROBLEM - logstash syslog TCP port on logstash2004 is CRITICAL: connect to address 127.0.0.1 and port 10514: Connection refused https://wikitech.wikimedia.org/wiki/Logstash
[20:15:59] <icinga-wm>	 RECOVERY - logstash syslog TCP port on logstash2006 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 10514 https://wikitech.wikimedia.org/wiki/Logstash
[20:16:01] <icinga-wm>	 PROBLEM - logstash JSON linesTCP port on logstash1009 is CRITICAL: connect to address 127.0.0.1 and port 11514: Connection refused https://wikitech.wikimedia.org/wiki/Logstash
[20:17:22] <icinga-wm>	 RECOVERY - logstash JSON linesTCP port on logstash1009 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 11514 https://wikitech.wikimedia.org/wiki/Logstash
[20:18:07] <icinga-wm>	 RECOVERY - logstash syslog TCP port on logstash2004 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 10514 https://wikitech.wikimedia.org/wiki/Logstash
[20:19:04] <wikibugs>	 (03PS3) 10Andrew Bogott: puppet: add facter.conf and cache some facts [puppet] - 10https://gerrit.wikimedia.org/r/525149
[20:19:17] <icinga-wm>	 PROBLEM - logstash syslog TCP port on logstash2006 is CRITICAL: connect to address 127.0.0.1 and port 10514: Connection refused https://wikitech.wikimedia.org/wiki/Logstash
[20:19:36] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] puppet: add facter.conf and cache some facts [puppet] - 10https://gerrit.wikimedia.org/r/525149 (owner: 10Andrew Bogott)
[20:20:19] <icinga-wm>	 PROBLEM - logstash syslog TCP port on logstash1009 is CRITICAL: connect to address 127.0.0.1 and port 10514: Connection refused https://wikitech.wikimedia.org/wiki/Logstash
[20:20:21] <icinga-wm>	 PROBLEM - logstash JSON linesTCP port on logstash1009 is CRITICAL: connect to address 127.0.0.1 and port 11514: Connection refused https://wikitech.wikimedia.org/wiki/Logstash
[20:21:03] <wikibugs>	 (03PS4) 10Andrew Bogott: puppet: add facter.conf and cache some facts [puppet] - 10https://gerrit.wikimedia.org/r/525149
[20:22:37] <icinga-wm>	 PROBLEM - logstash syslog TCP port on logstash2005 is CRITICAL: connect to address 127.0.0.1 and port 10514: Connection refused https://wikitech.wikimedia.org/wiki/Logstash
[20:23:13] <icinga-wm>	 RECOVERY - logstash JSON linesTCP port on logstash2005 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 11514 https://wikitech.wikimedia.org/wiki/Logstash
[20:23:33] <icinga-wm>	 PROBLEM - logstash syslog TCP port on logstash2004 is CRITICAL: connect to address 127.0.0.1 and port 10514: Connection refused https://wikitech.wikimedia.org/wiki/Logstash
[20:23:41] <icinga-wm>	 RECOVERY - logstash syslog TCP port on logstash2006 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 10514 https://wikitech.wikimedia.org/wiki/Logstash
[20:23:47] <icinga-wm>	 RECOVERY - logstash syslog TCP port on logstash2005 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 10514 https://wikitech.wikimedia.org/wiki/Logstash
[20:24:44] <icinga-wm>	 RECOVERY - logstash syslog TCP port on logstash2004 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 10514 https://wikitech.wikimedia.org/wiki/Logstash
[20:24:53] <icinga-wm>	 RECOVERY - logstash JSON linesTCP port on logstash1009 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 11514 https://wikitech.wikimedia.org/wiki/Logstash
[20:26:05] <icinga-wm>	 RECOVERY - logstash syslog TCP port on logstash1009 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 10514 https://wikitech.wikimedia.org/wiki/Logstash
[20:27:41] <icinga-wm>	 PROBLEM - logstash JSON linesTCP port on logstash1007 is CRITICAL: connect to address 127.0.0.1 and port 11514: Connection refused https://wikitech.wikimedia.org/wiki/Logstash
[20:28:03] <wikibugs>	 10Operations, 10Puppet: Cache some facter facts - https://phabricator.wikimedia.org/T228805 (10Andrew)
[20:28:10] <wikibugs>	 10Operations, 10Puppet: Cache some facter facts - https://phabricator.wikimedia.org/T228805 (10Andrew) p:05Triage→03Low
[20:28:32] <wikibugs>	 (03PS5) 10Andrew Bogott: puppet: add facter.conf and cache some facts [puppet] - 10https://gerrit.wikimedia.org/r/525149 (https://phabricator.wikimedia.org/T228805)
[20:29:03] <icinga-wm>	 RECOVERY - logstash JSON linesTCP port on logstash1007 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 11514 https://wikitech.wikimedia.org/wiki/Logstash
[20:29:49] <wikibugs>	 (03CR) 10Andrew Bogott: puppet: add facter.conf and cache some facts (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/525149 (https://phabricator.wikimedia.org/T228805) (owner: 10Andrew Bogott)
[20:29:57] <icinga-wm>	 PROBLEM - logstash syslog TCP port on logstash1009 is CRITICAL: connect to address 127.0.0.1 and port 10514: Connection refused https://wikitech.wikimedia.org/wiki/Logstash
[20:29:59] <icinga-wm>	 PROBLEM - logstash JSON linesTCP port on logstash1009 is CRITICAL: connect to address 127.0.0.1 and port 11514: Connection refused https://wikitech.wikimedia.org/wiki/Logstash
[20:31:03] <icinga-wm>	 RECOVERY - logstash syslog TCP port on logstash1007 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 10514 https://wikitech.wikimedia.org/wiki/Logstash
[20:34:03] <icinga-wm>	 RECOVERY - CirrusSearch eqiad 95th percentile latency on graphite1004 is OK: OK: Less than 20.00% above the threshold [500.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/dashboard/db/elasticsearch-percentiles?panelId=19&fullscreen&orgId=1&var-cluster=eqiad&var-smoothing=1
[20:34:21] <icinga-wm>	 PROBLEM - logstash syslog TCP port on logstash2005 is CRITICAL: connect to address 127.0.0.1 and port 10514: Connection refused https://wikitech.wikimedia.org/wiki/Logstash
[20:35:40] <paravoid>	 herron, shdubsh: are these logstash alerts expected?
[20:35:49] <icinga-wm>	 RECOVERY - logstash syslog TCP port on logstash2005 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 10514 https://wikitech.wikimedia.org/wiki/Logstash
[20:36:05] <shdubsh>	 paravoid: logstash is choking on the logs built up over the course of the incident.  we're looking at unclogging it now
[20:36:14] <paravoid>	 ah cool, thanks
[20:36:21] <icinga-wm>	 RECOVERY - puppet last run on mc1021 is OK: OK: Puppet is currently enabled, last run 43 seconds ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[20:36:28] <wikibugs>	 (03Abandoned) 10Elukey: Remove mc1019->23 from the mcrouter config [puppet] - 10https://gerrit.wikimedia.org/r/525148 (owner: 10Elukey)
[20:37:01] <icinga-wm>	 PROBLEM - logstash syslog TCP port on logstash2004 is CRITICAL: connect to address 127.0.0.1 and port 10514: Connection refused https://wikitech.wikimedia.org/wiki/Logstash
[20:37:09] <icinga-wm>	 RECOVERY - logstash syslog TCP port on logstash1009 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 10514 https://wikitech.wikimedia.org/wiki/Logstash
[20:37:39] <icinga-wm>	 PROBLEM - logstash JSON linesTCP port on logstash1007 is CRITICAL: connect to address 127.0.0.1 and port 11514: Connection refused https://wikitech.wikimedia.org/wiki/Logstash
[20:37:51] <icinga-wm>	 RECOVERY - logstash JSON linesTCP port on logstash1008 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 11514 https://wikitech.wikimedia.org/wiki/Logstash
[20:38:07] <icinga-wm>	 PROBLEM - logstash JSON linesTCP port on logstash2006 is CRITICAL: connect to address 127.0.0.1 and port 11514: Connection refused https://wikitech.wikimedia.org/wiki/Logstash
[20:38:07] <icinga-wm>	 PROBLEM - logstash JSON linesTCP port on logstash2004 is CRITICAL: connect to address 127.0.0.1 and port 11514: Connection refused https://wikitech.wikimedia.org/wiki/Logstash
[20:40:00] <paravoid>	 shdubsh: (and please !log actions, and also that you're investigating an issue etc., to avoid being asked by annoying people like me :)
[20:41:21] <icinga-wm>	 RECOVERY - logstash syslog TCP port on logstash1008 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 10514 https://wikitech.wikimedia.org/wiki/Logstash
[20:42:59] <icinga-wm>	 RECOVERY - logstash JSON linesTCP port on logstash2006 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 11514 https://wikitech.wikimedia.org/wiki/Logstash
[20:42:59] <icinga-wm>	 RECOVERY - logstash JSON linesTCP port on logstash2004 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 11514 https://wikitech.wikimedia.org/wiki/Logstash
[20:43:31] <icinga-wm>	 RECOVERY - logstash syslog TCP port on logstash2004 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 10514 https://wikitech.wikimedia.org/wiki/Logstash
[20:43:49] <icinga-wm>	 PROBLEM - logstash syslog TCP port on logstash2005 is CRITICAL: connect to address 127.0.0.1 and port 10514: Connection refused https://wikitech.wikimedia.org/wiki/Logstash
[20:44:05] <icinga-wm>	 RECOVERY - logstash JSON linesTCP port on logstash1007 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 11514 https://wikitech.wikimedia.org/wiki/Logstash
[20:44:17] <icinga-wm>	 PROBLEM - logstash JSON linesTCP port on logstash1008 is CRITICAL: connect to address 127.0.0.1 and port 11514: Connection refused https://wikitech.wikimedia.org/wiki/Logstash
[20:46:59] <icinga-wm>	 RECOVERY - logstash JSON linesTCP port on logstash1009 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 11514 https://wikitech.wikimedia.org/wiki/Logstash
[20:46:59] <icinga-wm>	 PROBLEM - logstash syslog TCP port on logstash2006 is CRITICAL: connect to address 127.0.0.1 and port 10514: Connection refused https://wikitech.wikimedia.org/wiki/Logstash
[20:47:07] <icinga-wm>	 RECOVERY - logstash syslog TCP port on logstash2005 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 10514 https://wikitech.wikimedia.org/wiki/Logstash
[20:47:57] <icinga-wm>	 PROBLEM - logstash syslog TCP port on logstash1008 is CRITICAL: connect to address 127.0.0.1 and port 10514: Connection refused https://wikitech.wikimedia.org/wiki/Logstash
[20:48:39] <icinga-wm>	 RECOVERY - logstash syslog TCP port on logstash2006 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 10514 https://wikitech.wikimedia.org/wiki/Logstash
[20:49:17] <icinga-wm>	 RECOVERY - logstash JSON linesTCP port on logstash1008 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 11514 https://wikitech.wikimedia.org/wiki/Logstash
[20:49:35] <icinga-wm>	 PROBLEM - logstash JSON linesTCP port on logstash2006 is CRITICAL: connect to address 127.0.0.1 and port 11514: Connection refused https://wikitech.wikimedia.org/wiki/Logstash
[20:49:35] <icinga-wm>	 PROBLEM - logstash JSON linesTCP port on logstash2004 is CRITICAL: connect to address 127.0.0.1 and port 11514: Connection refused https://wikitech.wikimedia.org/wiki/Logstash
[20:49:37] <icinga-wm>	 PROBLEM - logstash JSON linesTCP port on logstash2005 is CRITICAL: connect to address 127.0.0.1 and port 11514: Connection refused https://wikitech.wikimedia.org/wiki/Logstash
[20:50:01] <wikibugs>	 (03PS1) 10EBernhardson: Increase size of cirrus curl pools [puppet] - 10https://gerrit.wikimedia.org/r/525156
[20:50:07] <icinga-wm>	 PROBLEM - logstash syslog TCP port on logstash2004 is CRITICAL: connect to address 127.0.0.1 and port 10514: Connection refused https://wikitech.wikimedia.org/wiki/Logstash
[20:50:43] <icinga-wm>	 PROBLEM - logstash JSON linesTCP port on logstash1007 is CRITICAL: connect to address 127.0.0.1 and port 11514: Connection refused https://wikitech.wikimedia.org/wiki/Logstash
[20:51:13] <icinga-wm>	 RECOVERY - logstash JSON linesTCP port on logstash2006 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 11514 https://wikitech.wikimedia.org/wiki/Logstash
[20:51:15] <icinga-wm>	 RECOVERY - logstash JSON linesTCP port on logstash2004 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 11514 https://wikitech.wikimedia.org/wiki/Logstash
[20:51:17] <icinga-wm>	 RECOVERY - logstash JSON linesTCP port on logstash2005 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 11514 https://wikitech.wikimedia.org/wiki/Logstash
[20:51:47] <icinga-wm>	 RECOVERY - logstash syslog TCP port on logstash2004 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 10514 https://wikitech.wikimedia.org/wiki/Logstash
[20:51:55] <icinga-wm>	 PROBLEM - logstash syslog TCP port on logstash1009 is CRITICAL: connect to address 127.0.0.1 and port 10514: Connection refused https://wikitech.wikimedia.org/wiki/Logstash
[20:52:35] <icinga-wm>	 PROBLEM - MediaWiki memcached error rate on graphite1004 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [5000.0] https://wikitech.wikimedia.org/wiki/Memcached https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen
[20:53:33] <icinga-wm>	 RECOVERY - logstash syslog TCP port on logstash1009 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 10514 https://wikitech.wikimedia.org/wiki/Logstash
[20:53:45] <icinga-wm>	 PROBLEM - logstash syslog TCP port on logstash2005 is CRITICAL: connect to address 127.0.0.1 and port 10514: Connection refused https://wikitech.wikimedia.org/wiki/Logstash
[20:54:35] <icinga-wm>	 RECOVERY - logstash syslog TCP port on logstash1008 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 10514 https://wikitech.wikimedia.org/wiki/Logstash
[20:55:15] <icinga-wm>	 PROBLEM - logstash syslog TCP port on logstash2006 is CRITICAL: connect to address 127.0.0.1 and port 10514: Connection refused https://wikitech.wikimedia.org/wiki/Logstash
[20:55:23] <icinga-wm>	 RECOVERY - logstash syslog TCP port on logstash2005 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 10514 https://wikitech.wikimedia.org/wiki/Logstash
[20:55:39] <icinga-wm>	 RECOVERY - logstash JSON linesTCP port on logstash1007 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 11514 https://wikitech.wikimedia.org/wiki/Logstash
[20:56:33] <icinga-wm>	 PROBLEM - logstash syslog TCP port on logstash1007 is CRITICAL: connect to address 127.0.0.1 and port 10514: Connection refused https://wikitech.wikimedia.org/wiki/Logstash
[20:57:51] <icinga-wm>	 PROBLEM - logstash JSON linesTCP port on logstash2006 is CRITICAL: connect to address 127.0.0.1 and port 11514: Connection refused https://wikitech.wikimedia.org/wiki/Logstash
[20:57:53] <icinga-wm>	 PROBLEM - logstash JSON linesTCP port on logstash2004 is CRITICAL: connect to address 127.0.0.1 and port 11514: Connection refused https://wikitech.wikimedia.org/wiki/Logstash
[20:57:55] <icinga-wm>	 PROBLEM - logstash JSON linesTCP port on logstash2005 is CRITICAL: connect to address 127.0.0.1 and port 11514: Connection refused https://wikitech.wikimedia.org/wiki/Logstash
[20:58:23] <icinga-wm>	 PROBLEM - logstash syslog TCP port on logstash2004 is CRITICAL: connect to address 127.0.0.1 and port 10514: Connection refused https://wikitech.wikimedia.org/wiki/Logstash
[20:58:33] <icinga-wm>	 RECOVERY - logstash syslog TCP port on logstash2006 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 10514 https://wikitech.wikimedia.org/wiki/Logstash
[20:59:31] <icinga-wm>	 RECOVERY - logstash JSON linesTCP port on logstash2004 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 11514 https://wikitech.wikimedia.org/wiki/Logstash
[20:59:33] <icinga-wm>	 RECOVERY - logstash JSON linesTCP port on logstash2005 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 11514 https://wikitech.wikimedia.org/wiki/Logstash
[20:59:41] <shdubsh>	 !log temporarily disable input-kafka-rsyslog-shipper and drop memcached logs on logstash nodes
[20:59:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:01:09] <icinga-wm>	 RECOVERY - logstash JSON linesTCP port on logstash2006 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 11514 https://wikitech.wikimedia.org/wiki/Logstash
[21:01:41] <icinga-wm>	 RECOVERY - logstash syslog TCP port on logstash2004 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 10514 https://wikitech.wikimedia.org/wiki/Logstash
[21:01:57] <wikibugs>	 (03PS1) 10Legoktm: Enable SecureLinkFixer on group0 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/525157 (https://phabricator.wikimedia.org/T200751)
[21:02:01] <icinga-wm>	 PROBLEM - logstash syslog TCP port on logstash2005 is CRITICAL: connect to address 127.0.0.1 and port 10514: Connection refused https://wikitech.wikimedia.org/wiki/Logstash
[21:02:27] <icinga-wm>	 RECOVERY - MediaWiki memcached error rate on graphite1004 is OK: OK: Less than 40.00% above the threshold [1000.0] https://wikitech.wikimedia.org/wiki/Memcached https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen
[21:04:49] <icinga-wm>	 RECOVERY - logstash syslog TCP port on logstash1007 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 10514 https://wikitech.wikimedia.org/wiki/Logstash
[21:05:21] <icinga-wm>	 RECOVERY - logstash syslog TCP port on logstash2005 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 10514 https://wikitech.wikimedia.org/wiki/Logstash
[21:05:31] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1015 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[21:06:11] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1016 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[21:06:11] <icinga-wm>	 PROBLEM - logstash JSON linesTCP port on logstash2005 is CRITICAL: connect to address 127.0.0.1 and port 11514: Connection refused https://wikitech.wikimedia.org/wiki/Logstash
[21:07:49] <icinga-wm>	 PROBLEM - logstash JSON linesTCP port on logstash2004 is CRITICAL: connect to address 127.0.0.1 and port 11514: Connection refused https://wikitech.wikimedia.org/wiki/Logstash
[21:08:21] <icinga-wm>	 PROBLEM - logstash syslog TCP port on logstash2004 is CRITICAL: connect to address 127.0.0.1 and port 10514: Connection refused https://wikitech.wikimedia.org/wiki/Logstash
[21:08:52] <mutante>	 chaomodus: trying to use the netbox API. i see the example on https://netbox.readthedocs.io/en/stable/api/overview/  and see that we use 8001 instead of 8000. but i always get Bad Request (400) when doing something curl -s http://localhost:8001/api/  what am i missing
[21:09:12] <chaomodus>	 mutante: authentication
[21:09:23] <chaomodus>	 mutante: it's far easier to use the shell than the api
[21:09:29] <icinga-wm>	 RECOVERY - logstash JSON linesTCP port on logstash2004 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 11514 https://wikitech.wikimedia.org/wiki/Logstash
[21:09:31] <icinga-wm>	 RECOVERY - logstash JSON linesTCP port on logstash2005 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 11514 https://wikitech.wikimedia.org/wiki/Logstash
[21:09:31] <mutante>	 chaomodus: even from netmon1002? i see. but shouldnt that be a different return code
[21:09:43] <mutante>	 chaomodus: oh! looking into the shell
[21:09:55] <chaomodus>	 mutante: if you join my tmux on there i'm already in it :)
[21:10:01] <icinga-wm>	 RECOVERY - logstash syslog TCP port on logstash2004 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 10514 https://wikitech.wikimedia.org/wiki/Logstash
[21:10:19] <chaomodus>	 (tmux as root)
[21:11:59] <mutante>	 there are 2 but i am in one :)
[21:12:20] <chaomodus>	 oic you're in the gcorrect one afaict
[21:12:47] <mutante>	 so you did   ./manage.py nbshell  ?
[21:12:59] <chaomodus>	 just shell
[21:13:09] <chaomodus>	 it's a standard python imterpreter but you have access to the internal models and stuff
[21:13:20] <mutante>	 alright
[21:13:59] <chaomodus>	 (fwiw you have to . the activate in the venv, prior to running it)
[21:14:05] <mutante>	 i guess i want non-interactive though
[21:14:19] <chaomodus>	 what query are you trying to run exactly?
[21:14:35] <mutante>	 i wanted to see about getting the rack for a hostname
[21:14:43] <chaomodus>	 ah that's easy
[21:15:10] <chaomodus>	 dcim.models.Device.objects.filter(name='wahetever').rack
[21:15:15] <chaomodus>	 erlr
[21:15:20] <chaomodus>	 dcim.models.Device.objects.get(name='wahetever').rack
[21:15:48] <mutante>	 nice! works
[21:15:55] <chaomodus>	 yep!
[21:16:19] <chaomodus>	 you could also look in the ui, unless you're trying to automate this then we have to go to api country
[21:16:33] <mutante>	 now i just want to send that without the interactive shell
[21:16:47] <mutante>	 host2rack.sh or something
[21:16:59] <chaomodus>	 yah
[21:17:25] <icinga-wm>	 PROBLEM - MediaWiki memcached error rate on graphite1004 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [5000.0] https://wikitech.wikimedia.org/wiki/Memcached https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen
[21:17:33] <chaomodus>	 sec
[21:18:11] <mutante>	 100% memcached error rate sounds bad but the graph looks like it's not special
[21:18:25] <mutante>	 i see you. shared screen/tmux is the best
[21:22:26] <chaomodus>	 ideally mwe'd make a little tool to stick in scripts/
[21:22:36] <chaomodus>	 that could do various queries based on hostname
[21:23:31] <mutante>	 yes, that :)
[21:23:49] <icinga-wm>	 PROBLEM - puppet last run on analytics1047 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[21:23:54] <chaomodus>	 and soonish you'll be able to use cumin to do some similar query
[21:24:44] <mutante>	 chaomodus: you mean it becomes a spicerack recipe?
[21:25:52] <chaomodus>	 it could yes
[21:26:17] <icinga-wm>	 PROBLEM - puppet last run on serpens is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[21:32:44] <wikibugs>	 (03PS3) 10Holger Knust: table-properties: Initial commit [software/cassandra-table-properties] - 10https://gerrit.wikimedia.org/r/524921 (https://phabricator.wikimedia.org/T220246)
[21:32:48] <wikibugs>	 (03CR) 10CRusnov: "I feel as though there was implicit agreement to merge this, still it'd be nice to have a +1." [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/518166 (https://phabricator.wikimedia.org/T223292) (owner: 10CRusnov)
[21:46:29] <icinga-wm>	 RECOVERY - puppet last run on analytics1047 is OK: OK: Puppet is currently enabled, last run 29 seconds ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[21:49:18] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops, 10User-Zppix, 10cloud-services-team (Kanban): VMs on cloudvirt1015 crashing - bad mainboard/memory - https://phabricator.wikimedia.org/T220853 (10wiki_willy) @Cmjohnson - are those errors for DIMM A3 enough info to get Dell to RMA a part to us?  If not, let me know, a...
[21:52:35] <herron>	 !log logstash - temporarily dropping logs matching [message] =~ /^SlowTimer/ due to UTF-8 parsing errors that are stopping the logstash processing pipeline.  will re-enable after logstash has caught up with the backlog
[21:52:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:54:33] <icinga-wm>	 RECOVERY - puppet last run on serpens is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[21:58:21] <wikibugs>	 (03PS1) 10CRusnov: nbdeviceinfo.py: Add simple command-line host dump [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/525165
[21:59:55] <wikibugs>	 (03CR) 10CRusnov: "This is a simple script to dump host information on Netbox. This script outputs as yaml to stdout for maximum machine readability. It depe" [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/525165 (owner: 10CRusnov)
[22:00:39] <wikibugs>	 (03PS2) 10CRusnov: nbdeviceinfo.py: Add simple command-line host dump [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/525165
[22:06:19] <herron>	 !log puppet temporarily disabled on eqiad/codfw logstash collectors while catching up with backlog. see /etc/logstash/conf.d/01-filter_temp_drops.conf
[22:06:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:14:19] <mutante>	 !log continuing rollout of new scap version 3.11.1-1, starting with kafka-all followed by other cumin-alias groups (T228328)
[22:14:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:14:26] <stashbot>	 T228328: 'scap pull' stopped working on appservers ? - https://phabricator.wikimedia.org/T228328
[22:36:22] <mutante>	 !log rolling out scap 3.11.1-1 on mw-eqiad servers
[22:36:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:40:27] <icinga-wm>	 PROBLEM - Logstash rate of ingestion percent change compared to yesterday on icinga1001 is CRITICAL: 449.6 ge 130 https://phabricator.wikimedia.org/T202307 https://grafana.wikimedia.org/dashboard/db/logstash?orgId=1&panelId=2&fullscreen
[22:46:19] <wikibugs>	 10Operations, 10ops-eqiad, 10Cassandra, 10DC-Ops, and 4 others: restbase-dev1006 has a broken disk - https://phabricator.wikimedia.org/T224260 (10Dzahn) This host pops up because it's the only one where i can't upgrade scap with debdeploy.  I tried to ssh to it manually and it asks me for a password. So th...
[22:53:01] <wikibugs>	 10Operations, 10Analytics, 10Analytics-Kanban, 10Discovery, 10Research-Backlog: Make oozie swift upload emit event to Kafka about swift object upload complete - https://phabricator.wikimedia.org/T227896 (10Ottomata) Ok, from discussions with Erik today, we are going with an event like:  `lang=json {   "$...
[22:53:37] <icinga-wm>	 PROBLEM - puppet last run on cloudvirt1021 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[23:00:04] <jouncebot>	 MaxSem, RoanKattouw, and Niharika: It is that lovely time of the day again! You are hereby commanded to deploy Evening SWAT (Max 6 patches). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190723T2300).
[23:00:05] <jouncebot>	 No GERRIT patches in the queue for this window AFAICS.
[23:00:34] <Niharika>	 Okie dokie. 
[23:12:41] <wikibugs>	 10Operations, 10Gerrit, 10LDAP-Access-Requests, 10Release-Engineering-Team-TODO, 10Release-Engineering-Team (Development services): Add more SREs to gerritadmin LDAP group - https://phabricator.wikimedia.org/T228733 (10Dzahn) I would volunteer as well.
[23:16:57] <wikibugs>	 10Operations, 10Analytics, 10Analytics-Kanban, 10Discovery, 10Research-Backlog: Make oozie swift upload emit event to Kafka about swift object upload complete - https://phabricator.wikimedia.org/T227896 (10Nuria) Let me catch up here, seems that urls should have versions and not only be defined by a loca...
[23:18:38] <James_F>	 mutante: Did you repool mw1267?
[23:20:21] <icinga-wm>	 RECOVERY - MediaWiki memcached error rate on graphite1004 is OK: OK: Less than 40.00% above the threshold [1000.0] https://wikitech.wikimedia.org/wiki/Memcached https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen
[23:20:45] <wikibugs>	 10Operations, 10Analytics, 10Analytics-Kanban, 10Discovery, 10Research-Backlog: Make oozie swift upload emit event to Kafka about swift object upload complete - https://phabricator.wikimedia.org/T227896 (10Ottomata) The object URLs are totally up to the user, the script just uploads whatever is in the hd...
[23:21:53] <icinga-wm>	 RECOVERY - puppet last run on cloudvirt1021 is OK: OK: Puppet is currently enabled, last run 5 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[23:22:40] <mutante>	 James_F: no, but i can. i don't know about the history besides SAL though
[23:23:24] <mutante>	 the module was missing? hmm.. that sounds familiar
[23:23:29] <James_F>	 mutante: c.danis depooled it because it was scap erroring due to the module.
[23:23:32] <James_F>	 Yeah.
[23:23:41] <mutante>	 yea, but there was a reason before scap
[23:23:45] <James_F>	 That theoretically is now fixed with the new scap version rollout?
[23:24:13] <mutante>	 i still wonder what made it different from all other appservers then
[23:24:22] <mutante>	 because scap was broken on all of them
[23:24:43] <James_F>	 I guess the Python module dependency happened to be installed on the rest but not on that one?
[23:25:28] <mutante>	 normally we don't install stuff manually though.. so a one-off is always weird
[23:25:34] <mutante>	 alright. scap pulled. it works
[23:25:38] <mutante>	 repooling. looks fine
[23:25:43] <logmsgbot>	 !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw1267.eqiad.wmnet
[23:25:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:28:06] <James_F>	 I'll give it a real test with a scap in a bit.
[23:28:34] <mutante>	 cool!
[23:28:47] <mutante>	 btw.. on every scap pull i see these:
[23:28:57] <mutante>	 cannot delete non-empty directory: php-1.33.0-wmf.3
[23:29:00] <mutante>	 cannot delete non-empty directory: php-1.33.0-wmf.23
[23:29:19] <James_F>	 Oh, failed clean ups from when servers were depooled? Fun.
[23:29:21] <mutante>	 and then sometimes we need to manually delete old versions for disk space
[23:29:30] <mutante>	 ah, yea
[23:29:39] <James_F>	 You can safely `rm -rf php-1.33.0*` on the whole fleet.
[23:30:08] <James_F>	 We only want php-1.34.0-wmf.11–php-1.34.0-wmf.15 in production at most.
[23:31:25] <mutante>	 !log mw1267 - rm -rf /srv/mediawiki/php-1.33.0-wmf.23 ; rm -rf /srv/mediawiki/php-1.32.0-wmf.3 ; scap pull
[23:31:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:31:43] <mutante>	 23:31:08 Finished rsync common (duration: 00m 04s)
[23:31:51] <mutante>	 4 seconds..nothing to do
[23:32:13] * James_F nods.
[23:38:27] <wikibugs>	 (03PS1) 10Jeena Huneidi: Package mediawiki-dev and add to index [deployment-charts] - 10https://gerrit.wikimedia.org/r/525173
[23:42:23] <logmsgbot>	 !log jforrester@deploy1001 Synchronized php-1.34.0-wmf.15/includes/diff/DifferenceEngine.php: T228766 Don't double wrap rollback links (duration: 00m 56s)
[23:42:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:42:30] <stashbot>	 T228766: Rollback links in wmf.15 now have two square brackets around them, not one - https://phabricator.wikimedia.org/T228766
[23:43:16] <shdubsh>	 !log reverting logstash mitigations and re-enable puppet
[23:43:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:44:05] <wikibugs>	 (03CR) 10Jeena Huneidi: Package mediawiki-dev and add to index (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/525173 (owner: 10Jeena Huneidi)
[23:49:27] <wikibugs>	 10Operations, 10ops-codfw, 10DBA, 10Goal: rack/setup/install db21[21-30].codfw.wmnet - https://phabricator.wikimedia.org/T227113 (10Papaul)