[00:15:23] <icinga-wm>	 RECOVERY - Too many messages in kafka logging-eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Logstash%23Kafka_consumer_lag https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?from=now-3h&to=now&orgId=1&var-datasource=eqiad+prometheus/ops&var-cluster=logging-eqiad&var-topic=All&var-consumer_group=All
[00:21:29] <icinga-wm>	 PROBLEM - puppet last run on lvs1016 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[00:30:05] <icinga-wm>	 RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 231, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[00:31:01] <icinga-wm>	 RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 73, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[00:34:03] <icinga-wm>	 PROBLEM - HTTP availability for Nginx -SSL terminators- at esams on icinga1001 is CRITICAL: cluster=cache_text site=esams https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[00:34:29] <icinga-wm>	 PROBLEM - HTTP availability for Varnish at esams on icinga1001 is CRITICAL: job=varnish-text site=esams https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1
[00:36:43] <icinga-wm>	 PROBLEM - Text HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/myRmf1Pik/varnish-aggregate-client-status-codes?var-cache_type=varnish-text&var-status_type=5
[00:36:55] <icinga-wm>	 PROBLEM - Esams HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/myRmf1Pik/varnish-aggregate-client-status-codes?var-site=esams&var-status_type=5
[00:37:49] <icinga-wm>	 RECOVERY - HTTP availability for Varnish at esams on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1
[00:39:03] <icinga-wm>	 RECOVERY - HTTP availability for Nginx -SSL terminators- at esams on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[00:41:07] <icinga-wm>	 RECOVERY - Logstash rate of ingestion percent change compared to yesterday on icinga1001 is OK: (C)130 ge (W)110 ge 101.2 https://phabricator.wikimedia.org/T202307 https://grafana.wikimedia.org/dashboard/db/logstash?orgId=1&panelId=2&fullscreen
[00:43:35] <icinga-wm>	 RECOVERY - Esams HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/myRmf1Pik/varnish-aggregate-client-status-codes?var-site=esams&var-status_type=5
[00:44:09] <icinga-wm>	 RECOVERY - puppet last run on lvs1016 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[00:45:01] <icinga-wm>	 RECOVERY - Text HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/myRmf1Pik/varnish-aggregate-client-status-codes?var-cache_type=varnish-text&var-status_type=5
[00:50:23] <wikibugs>	 10Operations, 10netops: AS63541's session down reported by cr1-eqsin - https://phabricator.wikimedia.org/T228617 (10ayounsi) p:05Normal→03Lowest Email sent to Chinacache. If no replies in ~1w then we will remove the session.
[01:11:00] <wikibugs>	 (03PS1) 10Tim Starling: Add logging for DeferredUpdates [mediawiki-config] - 10https://gerrit.wikimedia.org/r/525185
[01:18:31] <wikibugs>	 (03PS2) 10Tim Starling: Add logging for DeferredUpdates [mediawiki-config] - 10https://gerrit.wikimedia.org/r/525185 (https://phabricator.wikimedia.org/T228462)
[01:23:52] <wikibugs>	 (03CR) 10Tim Starling: "Please give CR+1 and I will deploy it" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/525185 (https://phabricator.wikimedia.org/T228462) (owner: 10Tim Starling)
[01:57:09] <wikibugs>	 10Operations, 10ops-eqiad: Faulty A6/A7 VC link - https://phabricator.wikimedia.org/T228823 (10ayounsi) p:05Triage→03High
[02:04:41] <icinga-wm>	 PROBLEM - puppet last run on cp1075 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[02:12:23] <wikibugs>	 10Operations, 10netops, 10observability: Add VCP stats monitoring - https://phabricator.wikimedia.org/T228824 (10ayounsi) p:05Triage→03Normal
[02:31:31] <icinga-wm>	 PROBLEM - puppet last run on cloudcontrol1004 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[02:32:55] <icinga-wm>	 RECOVERY - puppet last run on cp1075 is OK: OK: Puppet is currently enabled, last run 27 seconds ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[02:37:12] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops: a1-eqiad pdu refresh - https://phabricator.wikimedia.org/T226782 (10ayounsi) Seems like only 1 interface is `master` on cr1 the following is needed to fail it over ` [edit interfaces ae2 unit 1202 family inet6 address 2620:0:861:202:fe00::1/64 vrrp-inet6-group 2] +...
[02:54:43] <wikibugs>	 (03CR) 10MaxSem: [C: 03+1] Add logging for DeferredUpdates [mediawiki-config] - 10https://gerrit.wikimedia.org/r/525185 (https://phabricator.wikimedia.org/T228462) (owner: 10Tim Starling)
[02:58:53] <icinga-wm>	 PROBLEM - dump of s5 in codfw on db1115 is CRITICAL: dump for s5 at codfw taken more than 8 days ago: Most recent backup 2019-07-16 02:25:48 https://wikitech.wikimedia.org/wiki/MariaDB/Backups
[02:59:45] <icinga-wm>	 RECOVERY - puppet last run on cloudcontrol1004 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[03:17:49] <icinga-wm>	 PROBLEM - puppet last run on etcd1004 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[03:23:07] <icinga-wm>	 PROBLEM - Mediawiki Cirrussearch update rate - eqiad on icinga1001 is CRITICAL: CRITICAL: 30.00% of data under the critical threshold [50.0] https://wikitech.wikimedia.org/wiki/Search%23No_updates https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?panelId=44&fullscreen&orgId=1
[03:23:27] <icinga-wm>	 PROBLEM - Mediawiki Cirrussearch update rate - codfw on icinga1001 is CRITICAL: CRITICAL: 30.00% of data under the critical threshold [50.0] https://wikitech.wikimedia.org/wiki/Search%23No_updates https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?panelId=44&fullscreen&orgId=1
[03:34:54] <wikibugs>	 (03CR) 10Tim Starling: [C: 03+2] "Thanks MaxSem" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/525185 (https://phabricator.wikimedia.org/T228462) (owner: 10Tim Starling)
[03:35:51] <wikibugs>	 (03Merged) 10jenkins-bot: Add logging for DeferredUpdates [mediawiki-config] - 10https://gerrit.wikimedia.org/r/525185 (https://phabricator.wikimedia.org/T228462) (owner: 10Tim Starling)
[03:36:32] <wikibugs>	 (03CR) 10jenkins-bot: Add logging for DeferredUpdates [mediawiki-config] - 10https://gerrit.wikimedia.org/r/525185 (https://phabricator.wikimedia.org/T228462) (owner: 10Tim Starling)
[03:39:19] <logmsgbot>	 !log tstarling@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Adding DeferredUpdates log channel (T228462) (duration: 00m 56s)
[03:39:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[03:40:27] <icinga-wm>	 RECOVERY - puppet last run on etcd1004 is OK: OK: Puppet is currently enabled, last run 57 seconds ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[03:41:09] <logmsgbot>	 !log tstarling@deploy1001 Synchronized w/fatal-error.php: Adding post-send exception test for T228462 (duration: 00m 54s)
[03:41:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[03:42:45] <wikibugs>	 (03PS1) 10Zoranzoki21: Correct a typo on the label newarticle in the help panel on arwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/525190 (https://phabricator.wikimedia.org/T228820)
[03:43:10] <wikibugs>	 (03PS2) 10Zoranzoki21: Correct a typo on the label newarticle in the help panel on arwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/525190 (https://phabricator.wikimedia.org/T228820)
[03:44:51] <icinga-wm>	 RECOVERY - Mediawiki Cirrussearch update rate - eqiad on icinga1001 is OK: OK: Less than 1.00% under the threshold [80.0] https://wikitech.wikimedia.org/wiki/Search%23No_updates https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?panelId=44&fullscreen&orgId=1
[03:46:30] <wikibugs>	 (03PS3) 10Zoranzoki21: Correct a typo on the label newarticle in the help panel for arwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/525190 (https://phabricator.wikimedia.org/T228820)
[03:46:51] <icinga-wm>	 RECOVERY - Mediawiki Cirrussearch update rate - codfw on icinga1001 is OK: OK: Less than 1.00% under the threshold [80.0] https://wikitech.wikimedia.org/wiki/Search%23No_updates https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?panelId=44&fullscreen&orgId=1
[03:56:41] <icinga-wm>	 PROBLEM - puppet last run on cp3030 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/local/bin/apt2xml] https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[04:07:13] <icinga-wm>	 PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 229, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[04:08:11] <icinga-wm>	 PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 71, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[04:24:55] <icinga-wm>	 RECOVERY - puppet last run on cp3030 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[04:40:56] <logmsgbot>	 !log tstarling@deploy1001 Synchronized php-1.34.0-wmf.15/includes/specials/SpecialGoToInterwiki.php: (no justification provided) (duration: 00m 56s)
[04:41:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[04:41:19] <icinga-wm>	 PROBLEM - PHP opcache health on mwdebug2002 is CRITICAL: CRITICAL: opcache free space is below 50 MB https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[04:42:07] <logmsgbot>	 !log tstarling@deploy1001 Synchronized php-1.34.0-wmf.15/includes/MediaWiki.php: T227700 (duration: 00m 54s)
[04:42:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[04:42:14] <stashbot>	 T227700: Fatal on some Special:MyLanguage urls: MWException "Can't determine talk page associated with interwiki link" - https://phabricator.wikimedia.org/T227700
[04:43:51] <icinga-wm>	 PROBLEM - PHP opcache health on mwdebug2001 is CRITICAL: CRITICAL: opcache free space is below 50 MB https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[04:45:20] <logmsgbot>	 !log tstarling@deploy1001 Synchronized php-1.34.0-wmf.15/includes/specials/SpecialGoToInterwiki.php: T227700 (duration: 00m 54s)
[04:45:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[04:46:15] <logmsgbot>	 !log tstarling@deploy1001 Synchronized php-1.34.0-wmf.15/includes/MediaWiki.php: T227700 (duration: 00m 54s)
[04:46:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[04:49:10] <logmsgbot>	 !log tstarling@deploy1001 Synchronized php-1.34.0-wmf.14/includes/specials/SpecialGoToInterwiki.php: T227700 (duration: 00m 54s)
[04:49:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[04:49:23] <stashbot>	 T227700: Fatal on some Special:MyLanguage urls: MWException "Can't determine talk page associated with interwiki link" - https://phabricator.wikimedia.org/T227700
[04:50:04] <logmsgbot>	 !log tstarling@deploy1001 Synchronized php-1.34.0-wmf.14/includes/MediaWiki.php: T227700 (duration: 00m 53s)
[04:50:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[04:50:17] <icinga-wm>	 RECOVERY - PHP opcache health on mwdebug1001 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[04:51:29] <logmsgbot>	 !log tstarling@deploy1001 Synchronized php-1.34.0-wmf.14/includes/specials/SpecialGoToInterwiki.php: T227700 (duration: 00m 54s)
[04:51:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[04:52:24] <logmsgbot>	 !log tstarling@deploy1001 Synchronized php-1.34.0-wmf.14/includes/MediaWiki.php: T227700 (duration: 00m 54s)
[04:52:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:00:48] <marostegui>	 !log Stop puppet on dbprov2001 to generate s5 mysqldump manually
[05:00:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:21:51] <wikibugs>	 10Operations, 10Gerrit, 10LDAP-Access-Requests, 10Release-Engineering-Team-TODO, 10Release-Engineering-Team (Development services): Add more SREs to gerritadmin LDAP group - https://phabricator.wikimedia.org/T228733 (10Jony) I would like become a volunteer to help.
[05:24:04] <wikibugs>	 (03CR) 10Muehlenhoff: profile::kerberos::kadminserver: add script to create principals (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/525137 (https://phabricator.wikimedia.org/T226104) (owner: 10Elukey)
[05:38:00] <wikibugs>	 (03PS7) 10Giuseppe Lavagetto: Configure forensic logging of Apache requests; enable on beta [puppet] - 10https://gerrit.wikimedia.org/r/511751 (owner: 10Ori.livneh)
[05:39:01] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Configure forensic logging of Apache requests; enable on beta [puppet] - 10https://gerrit.wikimedia.org/r/511751 (owner: 10Ori.livneh)
[05:46:13] <icinga-wm>	 PROBLEM - OSPF status on cr1-codfw is CRITICAL: OSPFv2: 5/6 UP : OSPFv3: 5/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[05:47:55] <icinga-wm>	 RECOVERY - OSPF status on cr1-codfw is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[05:50:45] <wikibugs>	 (03PS1) 10Ayounsi: Routinator set refresh to 10min (instead of 1h) [puppet] - 10https://gerrit.wikimedia.org/r/525204 (https://phabricator.wikimedia.org/T220669)
[06:06:07] <wikibugs>	 (03PS4) 10Fsero: zuul: stop zuul-merger gracefully [puppet] - 10https://gerrit.wikimedia.org/r/524180 (owner: 10Hashar)
[06:06:20] <wikibugs>	 (03PS4) 10Fsero: zuul: fix systemd Service/TimeoutStopSec [puppet] - 10https://gerrit.wikimedia.org/r/524174 (https://phabricator.wikimedia.org/T228381) (owner: 10Hashar)
[06:07:18] <wikibugs>	 (03CR) 10Fsero: [C: 03+2] zuul: stop zuul-merger gracefully [puppet] - 10https://gerrit.wikimedia.org/r/524180 (owner: 10Hashar)
[06:09:04] <wikibugs>	 (03PS5) 10Fsero: zuul: fix systemd Service/TimeoutStopSec [puppet] - 10https://gerrit.wikimedia.org/r/524174 (https://phabricator.wikimedia.org/T228381) (owner: 10Hashar)
[06:10:01] <wikibugs>	 (03CR) 10Fsero: [C: 03+2] zuul: fix systemd Service/TimeoutStopSec [puppet] - 10https://gerrit.wikimedia.org/r/524174 (https://phabricator.wikimedia.org/T228381) (owner: 10Hashar)
[06:18:50] <wikibugs>	 (03PS3) 10Fsero: registry: introducing read only mode for maintenances [puppet] - 10https://gerrit.wikimedia.org/r/523100 (https://phabricator.wikimedia.org/T227570)
[06:21:37] <wikibugs>	 (03PS4) 10Fsero: registry: introducing read only mode for maintenances [puppet] - 10https://gerrit.wikimedia.org/r/523100 (https://phabricator.wikimedia.org/T227570)
[06:25:02] <wikibugs>	 (03PS5) 10Fsero: registry: introducing read only mode for maintenances [puppet] - 10https://gerrit.wikimedia.org/r/523100 (https://phabricator.wikimedia.org/T227570)
[06:25:10] <wikibugs>	 10Operations, 10netops: Instability of the Level3 link between cr2-eqiad and cr2-esams - https://phabricator.wikimedia.org/T228827 (10elukey) p:05Triage→03High
[06:26:22] <wikibugs>	 (03CR) 10Muehlenhoff: profile::kerberos::kadminserver: add script to create principals (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/525137 (https://phabricator.wikimedia.org/T226104) (owner: 10Elukey)
[06:27:10] <wikibugs>	 (03CR) 10Smalyshev: "compiler result looks good" [puppet] - 10https://gerrit.wikimedia.org/r/524954 (https://phabricator.wikimedia.org/T228122) (owner: 10Smalyshev)
[06:28:02] <wikibugs>	 (03CR) 10Fsero: [C: 03+2] "PCC is happy https://puppet-compiler.wmflabs.org/compiler1002/17583/" [puppet] - 10https://gerrit.wikimedia.org/r/523100 (https://phabricator.wikimedia.org/T227570) (owner: 10Fsero)
[06:28:06] <wikibugs>	 (03CR) 10Elukey: profile::kerberos::kadminserver: add script to create principals (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/525137 (https://phabricator.wikimedia.org/T226104) (owner: 10Elukey)
[06:28:31] <elukey>	 ah other comments :D
[06:31:32] <moritzm>	 sorry, slight race there :-)
[06:32:08] <moritzm>	 I went down the rabbit hole of looking for options to hide the process name from within Python
[06:32:10] <wikibugs>	 (03CR) 10Elukey: profile::kerberos::kadminserver: add script to create principals (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/525137 (https://phabricator.wikimedia.org/T226104) (owner: 10Elukey)
[06:32:24] <moritzm>	  but there's only support on Win32 for that and the kernel approach is much cleaner anyway
[06:33:20] <elukey>	 let's use windows for the kadmin host!
[06:33:23] * elukey runs away
[06:33:27] <icinga-wm>	 PROBLEM - puppet last run on mw1308 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 7 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/local/bin/cgroup-mediawiki-clean] https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[06:33:29] <icinga-wm>	 PROBLEM - puppet last run on db2098 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[06:33:39] <icinga-wm>	 PROBLEM - Docker registry HTTP interface on registry1002 is CRITICAL: connect to address 10.64.32.139 and port 81: Connection refused https://wikitech.wikimedia.org/wiki/Docker
[06:33:48] <wikibugs>	 (03CR) 10Muehlenhoff: profile::kerberos::kadminserver: add script to create principals (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/525137 (https://phabricator.wikimedia.org/T226104) (owner: 10Elukey)
[06:34:13] <fsero>	 acknowledged those 
[06:34:15] <fsero>	 from registry
[06:34:20] <moritzm>	 elukey: ReactOS, given our FLOSS policy
[06:35:15] <wikibugs>	 (03PS1) 10Fsero: registry: bug: disabling read only mode [puppet] - 10https://gerrit.wikimedia.org/r/525207
[06:35:33] <wikibugs>	 (03CR) 10Fsero: [C: 03+2] registry: bug: disabling read only mode [puppet] - 10https://gerrit.wikimedia.org/r/525207 (owner: 10Fsero)
[06:38:41] <icinga-wm>	 RECOVERY - Docker registry HTTP interface on registry1002 is OK: HTTP OK: Status line output matched HTTP/1.1 403 - 407 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Docker
[06:43:17] <wikibugs>	 (03CR) 10Elukey: profile::kerberos::kadminserver: add script to create principals (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/525137 (https://phabricator.wikimedia.org/T226104) (owner: 10Elukey)
[06:47:57] <icinga-wm>	 RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 73, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[06:48:27] <icinga-wm>	 RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 231, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[06:50:20] <wikibugs>	 (03CR) 10Muehlenhoff: profile::kerberos::kadminserver: add script to create principals (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/525137 (https://phabricator.wikimedia.org/T226104) (owner: 10Elukey)
[06:51:37] <icinga-wm>	 PROBLEM - Esams HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/myRmf1Pik/varnish-aggregate-client-status-codes?var-site=esams&var-status_type=5
[06:51:39] <icinga-wm>	 PROBLEM - Text HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/myRmf1Pik/varnish-aggregate-client-status-codes?var-cache_type=varnish-text&var-status_type=5
[06:54:45] <wikibugs>	 (03PS1) 10Marostegui: mariadb: Provision dbproxy1013 [puppet] - 10https://gerrit.wikimedia.org/r/525213 (https://phabricator.wikimedia.org/T202367)
[06:55:15] <icinga-wm>	 RECOVERY - dump of s5 in codfw on db1115 is OK: dump for s5 at codfw taken less than 8 days ago and larger than 10 GB: Last one 2019-07-24 05:00:17 from db2099.codfw.wmnet:3315 (98 GB) https://wikitech.wikimedia.org/wiki/MariaDB/Backups
[06:56:01] <icinga-wm>	 RECOVERY - puppet last run on mw1308 is OK: OK: Puppet is currently enabled, last run 14 seconds ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[06:56:03] <icinga-wm>	 RECOVERY - puppet last run on db2098 is OK: OK: Puppet is currently enabled, last run 19 seconds ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[06:58:15] <icinga-wm>	 RECOVERY - Esams HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/myRmf1Pik/varnish-aggregate-client-status-codes?var-site=esams&var-status_type=5
[06:58:17] <icinga-wm>	 RECOVERY - Text HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/myRmf1Pik/varnish-aggregate-client-status-codes?var-cache_type=varnish-text&var-status_type=5
[07:05:05] <elukey>	 Level3 link seems to have recovered --^ 
[07:08:09] <wikibugs>	 (03CR) 10Marostegui: "Puppet looks good: https://puppet-compiler.wmflabs.org/compiler1002/17585/" [puppet] - 10https://gerrit.wikimedia.org/r/525213 (https://phabricator.wikimedia.org/T202367) (owner: 10Marostegui)
[07:10:00] <marostegui>	 !log Deploy grants for dbproxy1013 in m2 - T202367
[07:10:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:10:11] <stashbot>	 T202367: Productionize dbproxy101[2-7].eqiad.wmnet and dbproxy200[1-4] - https://phabricator.wikimedia.org/T202367
[07:11:23] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] mariadb: Provision dbproxy1013 [puppet] - 10https://gerrit.wikimedia.org/r/525213 (https://phabricator.wikimedia.org/T202367) (owner: 10Marostegui)
[07:21:20] <marostegui>	 !log Stop MySQL on db1117:3322 to check dbproxy1013 notifications - T202367
[07:21:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:21:27] <stashbot>	 T202367: Productionize dbproxy101[2-7].eqiad.wmnet and dbproxy200[1-4] - https://phabricator.wikimedia.org/T202367
[07:24:53] <icinga-wm>	 PROBLEM - haproxy failover on dbproxy1007 is CRITICAL: CRITICAL check_failover servers up 1 down 1 https://wikitech.wikimedia.org/wiki/HAProxy
[07:25:05] <icinga-wm>	 PROBLEM - haproxy failover on dbproxy1002 is CRITICAL: CRITICAL check_failover servers up 1 down 1 https://wikitech.wikimedia.org/wiki/HAProxy
[07:25:13] <icinga-wm>	 PROBLEM - haproxy failover on dbproxy1013 is CRITICAL: CRITICAL check_failover servers up 1 down 1 https://wikitech.wikimedia.org/wiki/HAProxy
[07:25:38] <marostegui>	 ^ me
[07:25:39] <marostegui>	 expected
[07:26:07] <icinga-wm>	 RECOVERY - haproxy failover on dbproxy1007 is OK: OK check_failover servers up 2 down 0 https://wikitech.wikimedia.org/wiki/HAProxy
[07:26:14] <wikibugs>	 (03PS2) 10Muehlenhoff: Enable seccomp-based hardening for apt on Buster and later [puppet] - 10https://gerrit.wikimedia.org/r/525115
[07:26:23] <icinga-wm>	 RECOVERY - haproxy failover on dbproxy1002 is OK: OK check_failover servers up 2 down 0 https://wikitech.wikimedia.org/wiki/HAProxy
[07:26:29] <icinga-wm>	 RECOVERY - haproxy failover on dbproxy1013 is OK: OK check_failover servers up 2 down 0 https://wikitech.wikimedia.org/wiki/HAProxy
[07:34:58] <wikibugs>	 (03PS1) 10Ema: ATS: do not cache responses to cookies [puppet] - 10https://gerrit.wikimedia.org/r/525219 (https://phabricator.wikimedia.org/T227432)
[07:37:45] <wikibugs>	 (03CR) 10Ema: [C: 03+2] ATS: do not cache responses to cookies [puppet] - 10https://gerrit.wikimedia.org/r/525219 (https://phabricator.wikimedia.org/T227432) (owner: 10Ema)
[07:37:48] <wikibugs>	 (03PS2) 10Elukey: profile::kerberos::kadminserver: add script to create principals [puppet] - 10https://gerrit.wikimedia.org/r/525137 (https://phabricator.wikimedia.org/T226104)
[07:43:27] <wikibugs>	 (03PS1) 10Muehlenhoff: Switch back Cloud VPS instances to the read-only replicas [puppet] - 10https://gerrit.wikimedia.org/r/525220 (https://phabricator.wikimedia.org/T46722)
[07:51:13] <hashar>	 fsero: thank you for the merge of the Zuul systemd tweaks :]
[07:52:39] <wikibugs>	 (03PS1) 10Ema: ATS: gracefully fail request coalescing [puppet] - 10https://gerrit.wikimedia.org/r/525222 (https://phabricator.wikimedia.org/T227432)
[07:56:49] <wikibugs>	 (03CR) 10Ema: [C: 03+2] ATS: gracefully fail request coalescing [puppet] - 10https://gerrit.wikimedia.org/r/525222 (https://phabricator.wikimedia.org/T227432) (owner: 10Ema)
[07:57:59] <marostegui>	 !log Drop abuse_filter_log.afl_log_id  from wikidata in eqiad - T226851
[07:58:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:58:06] <stashbot>	 T226851: Drop abuse_filter_log.afl_log_id in production - https://phabricator.wikimedia.org/T226851
[08:02:50] <wikibugs>	 (03PS1) 10Fsero: registry: setting eqiad registries in read_only_mode [puppet] - 10https://gerrit.wikimedia.org/r/525223 (https://phabricator.wikimedia.org/T227570)
[08:04:52] <wikibugs>	 (03CR) 10Fsero: [C: 03+2] "PCC is happy https://puppet-compiler.wmflabs.org/compiler1002/17587/registry1001.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/525223 (https://phabricator.wikimedia.org/T227570) (owner: 10Fsero)
[08:05:01] <wikibugs>	 (03PS2) 10Fsero: registry: setting eqiad registries in read_only_mode [puppet] - 10https://gerrit.wikimedia.org/r/525223 (https://phabricator.wikimedia.org/T227570)
[08:09:20] <wikibugs>	 (03PS1) 10Elukey: profile::mediawiki::mcrouter_wancache: set async behavior as default [puppet] - 10https://gerrit.wikimedia.org/r/525224 (https://phabricator.wikimedia.org/T225642)
[08:15:13] <wikibugs>	 10Operations, 10Prod-Kubernetes, 10serviceops, 10Kubernetes, 10User-fsero: Set up a local redis proxy since docker-registry can only connect to one redis instance for caching - https://phabricator.wikimedia.org/T215809 (10fsero) a:05fsero→03None
[08:15:41] <wikibugs>	 10Operations, 10Prod-Kubernetes, 10serviceops, 10Kubernetes, and 2 others: Package envoy 1.9.X for stretch and use it as redis proxy on docker registry - https://phabricator.wikimedia.org/T215810 (10fsero) 05Open→03Resolved package is done and uploaded long time ago.
[08:15:44] <wikibugs>	 10Operations, 10Prod-Kubernetes, 10serviceops, 10Kubernetes, 10User-fsero: Set up a local redis proxy since docker-registry can only connect to one redis instance for caching - https://phabricator.wikimedia.org/T215809 (10fsero)
[08:16:27] <wikibugs>	 10Operations, 10Prod-Kubernetes, 10serviceops, 10Kubernetes, and 2 others: improve docker registry architecture - https://phabricator.wikimedia.org/T209271 (10fsero) Keeping this task opened, but we can mark iteration 1 as completed
[08:21:05] <wikibugs>	 (03CR) 10Elukey: "https://puppet-compiler.wmflabs.org/compiler1001/17588/" [puppet] - 10https://gerrit.wikimedia.org/r/525224 (https://phabricator.wikimedia.org/T225642) (owner: 10Elukey)
[08:21:27] <wikibugs>	 10Operations, 10netops: Instability of the Level3 link between cr2-eqiad and cr2-esams - https://phabricator.wikimedia.org/T228827 (10Peachey88)
[08:34:59] <marostegui>	 !log Drop abuse_filter_log.afl_log_id in s2 codfw (lag will appear on codfw) - T226851
[08:35:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:35:06] <stashbot>	 T226851: Drop abuse_filter_log.afl_log_id in production - https://phabricator.wikimedia.org/T226851
[08:35:12] <wikibugs>	 (03CR) 10Muehlenhoff: "A few nits inline, LGTM otherwise" (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/525137 (https://phabricator.wikimedia.org/T226104) (owner: 10Elukey)
[08:37:22] <wikibugs>	 (03PS1) 10Marostegui: db-eqiad.php: Depool db1082 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/525226
[08:38:19] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: Depool db1082 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/525226 (owner: 10Marostegui)
[08:38:50] <wikibugs>	 10Operations, 10Prod-Kubernetes, 10serviceops, 10Kubernetes, and 2 others: improve docker registry architecture - https://phabricator.wikimedia.org/T209271 (10fsero) 05Open→03Resolved a:05fsero→03None
[08:38:56] <wikibugs>	 (03CR) 10Filippo Giunchedi: "> Patch Set 5:" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/524037 (owner: 10Ayounsi)
[08:38:56] <wikibugs>	 10Operations, 10Kubernetes: Evaluate VMWare's Harbour as a docker registry - https://phabricator.wikimedia.org/T202504 (10fsero)
[08:39:15] <wikibugs>	 (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1082 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/525226 (owner: 10Marostegui)
[08:39:30] <wikibugs>	 (03CR) 10jenkins-bot: db-eqiad.php: Depool db1082 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/525226 (owner: 10Marostegui)
[08:39:58] <wikibugs>	 10Operations, 10Prod-Kubernetes, 10serviceops, 10Kubernetes, and 2 others: improve docker registry architecture - https://phabricator.wikimedia.org/T209271 (10fsero) 05Resolved→03Open
[08:40:03] <wikibugs>	 10Operations, 10Kubernetes: Evaluate VMWare's Harbour as a docker registry - https://phabricator.wikimedia.org/T202504 (10fsero)
[08:40:37] <logmsgbot>	 !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Depool db1082 for upgrade (duration: 00m 57s)
[08:40:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:40:51] <marostegui>	 !log Stop MySQL on db1082 for upgrade
[08:40:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:43:03] <wikibugs>	 (03CR) 10Elukey: profile::kerberos::kadminserver: add script to create principals (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/525137 (https://phabricator.wikimedia.org/T226104) (owner: 10Elukey)
[08:45:41] <wikibugs>	 (03PS3) 10Elukey: profile::kerberos::kadminserver: add script to create principals [puppet] - 10https://gerrit.wikimedia.org/r/525137 (https://phabricator.wikimedia.org/T226104)
[08:48:43] <wikibugs>	 (03PS1) 10Muehlenhoff: profile::mediawiki::nutcracker: Remove support for jessie [puppet] - 10https://gerrit.wikimedia.org/r/525229
[08:50:10] <wikibugs>	 (03PS1) 10Marostegui: db-eqiad.php: Slowly repool db1082 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/525230
[08:51:28] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: Slowly repool db1082 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/525230 (owner: 10Marostegui)
[08:51:32] <wikibugs>	 (03PS1) 10Muehlenhoff: role::alerting_host: Remove support for jessie [puppet] - 10https://gerrit.wikimedia.org/r/525232
[08:52:20] <wikibugs>	 (03Merged) 10jenkins-bot: db-eqiad.php: Slowly repool db1082 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/525230 (owner: 10Marostegui)
[08:52:44] <wikibugs>	 (03CR) 10jenkins-bot: db-eqiad.php: Slowly repool db1082 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/525230 (owner: 10Marostegui)
[08:53:34] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/525137 (https://phabricator.wikimedia.org/T226104) (owner: 10Elukey)
[08:53:37] <logmsgbot>	 !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Slowly repool db1082 after upgrade (duration: 00m 54s)
[08:53:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:54:53] <wikibugs>	 (03PS1) 10Marostegui: db-eqiad.php: Repool db1082 into API [mediawiki-config] - 10https://gerrit.wikimedia.org/r/525233
[08:56:04] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: Repool db1082 into API [mediawiki-config] - 10https://gerrit.wikimedia.org/r/525233 (owner: 10Marostegui)
[08:56:56] <wikibugs>	 (03Merged) 10jenkins-bot: db-eqiad.php: Repool db1082 into API [mediawiki-config] - 10https://gerrit.wikimedia.org/r/525233 (owner: 10Marostegui)
[08:58:02] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] profile::kerberos::kadminserver: add script to create principals [puppet] - 10https://gerrit.wikimedia.org/r/525137 (https://phabricator.wikimedia.org/T226104) (owner: 10Elukey)
[08:58:04] <wikibugs>	 (03CR) 10jenkins-bot: db-eqiad.php: Repool db1082 into API [mediawiki-config] - 10https://gerrit.wikimedia.org/r/525233 (owner: 10Marostegui)
[08:58:07] <logmsgbot>	 !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Repool db1082 into API after upgrade (duration: 00m 55s)
[08:58:10] <wikibugs>	 (03PS4) 10Elukey: profile::kerberos::kadminserver: add script to create principals [puppet] - 10https://gerrit.wikimedia.org/r/525137 (https://phabricator.wikimedia.org/T226104)
[08:58:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:03:12] <wikibugs>	 10Operations, 10Analytics, 10LDAP-Access-Requests, 10wikimediafoundation.org: Access to WikimediaFoundation.org analytics for Deb - https://phabricator.wikimedia.org/T227496 (10MoritzMuehlenhoff) 05Resolved→03Open @herron: If you add an account to a PII-relevant LDAP group which does not have shell acc...
[09:03:20] <wikibugs>	 (03PS3) 10Marostegui: mariadb: Promote db1128 as master for m3 [puppet] - 10https://gerrit.wikimedia.org/r/523941 (https://phabricator.wikimedia.org/T228243)
[09:10:14] <wikibugs>	 (03PS1) 10Elukey: profile::kerberos::kadminserver: highlight 'test' when creating users [puppet] - 10https://gerrit.wikimedia.org/r/525235 (https://phabricator.wikimedia.org/T226104)
[09:16:00] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] profile::kerberos::kadminserver: highlight 'test' when creating users [puppet] - 10https://gerrit.wikimedia.org/r/525235 (https://phabricator.wikimedia.org/T226104) (owner: 10Elukey)
[09:18:34] <wikibugs>	 (03PS1) 10Fsero: k8s: deploy users should be able to get and list any resource. [deployment-charts] - 10https://gerrit.wikimedia.org/r/525236
[09:19:03] <wikibugs>	 (03PS1) 10Volans: setup.py: re-include tests in the distribution [software/conftool] - 10https://gerrit.wikimedia.org/r/525237
[09:19:28] <wikibugs>	 (03CR) 10Fsero: [V: 03+2 C: 03+2] k8s: deploy users should be able to get and list any resource. [deployment-charts] - 10https://gerrit.wikimedia.org/r/525236 (owner: 10Fsero)
[09:21:28] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+1] setup.py: re-include tests in the distribution [software/conftool] - 10https://gerrit.wikimedia.org/r/525237 (owner: 10Volans)
[09:21:56] <wikibugs>	 (03CR) 10Volans: [C: 03+2] setup.py: re-include tests in the distribution [software/conftool] - 10https://gerrit.wikimedia.org/r/525237 (owner: 10Volans)
[09:22:09] <wikibugs>	 (03PS1) 10Fsero: Revert "k8s: deploy users should be able to get and list any resource." [deployment-charts] - 10https://gerrit.wikimedia.org/r/525239
[09:22:15] <wikibugs>	 (03CR) 10Fsero: [V: 03+2 C: 03+2] Revert "k8s: deploy users should be able to get and list any resource." [deployment-charts] - 10https://gerrit.wikimedia.org/r/525239 (owner: 10Fsero)
[09:24:35] <wikibugs>	 (03Merged) 10jenkins-bot: setup.py: re-include tests in the distribution [software/conftool] - 10https://gerrit.wikimedia.org/r/525237 (owner: 10Volans)
[09:26:16] <wikibugs>	 (03PS1) 10Fsero: k8s: deploy user should be able to list any resource [deployment-charts] - 10https://gerrit.wikimedia.org/r/525240
[09:26:53] <wikibugs>	 (03PS1) 10Volans: Release 1.1.3 [software/conftool] - 10https://gerrit.wikimedia.org/r/525241
[09:27:20] <wikibugs>	 (03PS2) 10Fsero: k8s: deploy user should be able to list any resource [deployment-charts] - 10https://gerrit.wikimedia.org/r/525240
[09:28:01] <wikibugs>	 (03PS1) 10Elukey: profile::kerberos::kadminserver: properly indent emails to send [puppet] - 10https://gerrit.wikimedia.org/r/525242 (https://phabricator.wikimedia.org/T226104)
[09:28:44] <wikibugs>	 (03PS2) 10Filippo Giunchedi: varnish: remove varnishreqstats-based alerts [puppet] - 10https://gerrit.wikimedia.org/r/523891 (https://phabricator.wikimedia.org/T184942)
[09:28:56] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] "PCC https://puppet-compiler.wmflabs.org/compiler1002/17591/" [puppet] - 10https://gerrit.wikimedia.org/r/523891 (https://phabricator.wikimedia.org/T184942) (owner: 10Filippo Giunchedi)
[09:29:46] <wikibugs>	 (03CR) 10Volans: [C: 03+2] Release 1.1.3 [software/conftool] - 10https://gerrit.wikimedia.org/r/525241 (owner: 10Volans)
[09:29:51] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] profile::kerberos::kadminserver: properly indent emails to send [puppet] - 10https://gerrit.wikimedia.org/r/525242 (https://phabricator.wikimedia.org/T226104) (owner: 10Elukey)
[09:30:00] <wikibugs>	 (03PS2) 10Elukey: profile::kerberos::kadminserver: properly indent emails to send [puppet] - 10https://gerrit.wikimedia.org/r/525242 (https://phabricator.wikimedia.org/T226104)
[09:32:21] <wikibugs>	 (03Merged) 10jenkins-bot: Release 1.1.3 [software/conftool] - 10https://gerrit.wikimedia.org/r/525241 (owner: 10Volans)
[09:33:58] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: Add the LBRemoteCluster class. (0311 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/503947 (owner: 10Giuseppe Lavagetto)
[09:34:09] <wikibugs>	 (03PS1) 10Marostegui: db-eqiad.php: More traffic to db1082 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/525243
[09:37:18] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: More traffic to db1082 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/525243 (owner: 10Marostegui)
[09:37:44] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] "PCC https://puppet-compiler.wmflabs.org/compiler1001/17593/" [puppet] - 10https://gerrit.wikimedia.org/r/523892 (https://phabricator.wikimedia.org/T184942) (owner: 10Filippo Giunchedi)
[09:38:18] <wikibugs>	 (03Merged) 10jenkins-bot: db-eqiad.php: More traffic to db1082 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/525243 (owner: 10Marostegui)
[09:38:34] <wikibugs>	 (03CR) 10jenkins-bot: db-eqiad.php: More traffic to db1082 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/525243 (owner: 10Marostegui)
[09:39:28] <logmsgbot>	 !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: More traffic to db1082 (duration: 00m 55s)
[09:39:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:45:05] <wikibugs>	 (03PS2) 10Filippo Giunchedi: varnish: ensure varnishreqstats is absent [puppet] - 10https://gerrit.wikimedia.org/r/523892 (https://phabricator.wikimedia.org/T184942)
[09:46:47] <wikibugs>	 10Operations, 10observability, 10User-fgiunchedi: Include apache_exporter in puppet module apache - https://phabricator.wikimedia.org/T187434 (10fgiunchedi) Parent task is a goal instead
[09:46:57] <icinga-wm>	 PROBLEM - HHVM rendering on mw1229 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[09:47:00] <wikibugs>	 10Operations, 10observability, 10Goal, 10Technical-Debt, and 2 others: Reduce technical debt in metrics monitoring - https://phabricator.wikimedia.org/T177195 (10fgiunchedi)
[09:47:47] <icinga-wm>	 RECOVERY - HHVM rendering on mw1229 is OK: HTTP OK: HTTP/1.1 200 OK - 76100 bytes in 0.169 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[09:49:18] <wikibugs>	 (03CR) 10Ema: [C: 03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/523892 (https://phabricator.wikimedia.org/T184942) (owner: 10Filippo Giunchedi)
[09:49:21] <wikibugs>	 10Operations, 10Cloud-VPS, 10Toolforge, 10LDAP, and 2 others: LDAP server running out of memory frequently and disrupting Cloud VPS clients - https://phabricator.wikimedia.org/T217280 (10aborrero)
[09:59:31] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] varnish: ensure varnishreqstats is absent [puppet] - 10https://gerrit.wikimedia.org/r/523892 (https://phabricator.wikimedia.org/T184942) (owner: 10Filippo Giunchedi)
[10:00:38] <wikibugs>	 (03PS3) 10Filippo Giunchedi: varnish: ensure varnishreqstats is absent [puppet] - 10https://gerrit.wikimedia.org/r/523892 (https://phabricator.wikimedia.org/T184942)
[10:04:23] <marostegui>	 !log Drop abuse_filter_log.afl_log_id from labswiki (wikitech) and labtestwiki - T226851
[10:04:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:04:41] <stashbot>	 T226851: Drop abuse_filter_log.afl_log_id in production - https://phabricator.wikimedia.org/T226851
[10:06:54] <wikibugs>	 10Operations, 10Analytics, 10Analytics-EventLogging: Decommission m4 proxies (dbproxy1004 and dbproxy1008) - https://phabricator.wikimedia.org/T228768 (10Marostegui) p:05Triage→03Normal
[10:07:20] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/524934 (https://phabricator.wikimedia.org/T196066) (owner: 10Cwhite)
[10:08:27] <icinga-wm>	 PROBLEM - Varnish traffic logger - varnishreqstats on cp2014 is CRITICAL: NRPE: Command check_varnishreqstats not defined https://wikitech.wikimedia.org/wiki/Varnish
[10:08:35] <icinga-wm>	 PROBLEM - Varnish traffic logger - varnishreqstats on cp1077 is CRITICAL: NRPE: Command check_varnishreqstats not defined https://wikitech.wikimedia.org/wiki/Varnish
[10:08:40] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM, see inline" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/525115 (owner: 10Muehlenhoff)
[10:09:21] <icinga-wm>	 PROBLEM - Varnish traffic logger - varnishreqstats on cp3043 is CRITICAL: NRPE: Command check_varnishreqstats not defined https://wikitech.wikimedia.org/wiki/Varnish
[10:09:39] <icinga-wm>	 PROBLEM - Varnish traffic logger - varnishreqstats on cp4031 is CRITICAL: NRPE: Command check_varnishreqstats not defined https://wikitech.wikimedia.org/wiki/Varnish
[10:10:27] <icinga-wm>	 PROBLEM - Varnish traffic logger - varnishreqstats on cp1076 is CRITICAL: NRPE: Command check_varnishreqstats not defined https://wikitech.wikimedia.org/wiki/Varnish
[10:10:55] <wikibugs>	 10Operations, 10Analytics, 10Analytics-EventLogging: Decommission m4 proxies (dbproxy1004 and dbproxy1008) - https://phabricator.wikimedia.org/T228768 (10elukey) +2 from Analytics
[10:11:33] <godog>	 the varnishreqstats alerts are expected, I'll silence
[10:11:41] <icinga-wm>	 PROBLEM - Varnish traffic logger - varnishreqstats on cp3042 is CRITICAL: NRPE: Command check_varnishreqstats not defined https://wikitech.wikimedia.org/wiki/Varnish
[10:11:43] <icinga-wm>	 PROBLEM - Varnish traffic logger - varnishreqstats on cp2016 is CRITICAL: NRPE: Command check_varnishreqstats not defined https://wikitech.wikimedia.org/wiki/Varnish
[10:11:43] <icinga-wm>	 PROBLEM - Varnish traffic logger - varnishreqstats on cp2018 is CRITICAL: NRPE: Command check_varnishreqstats not defined https://wikitech.wikimedia.org/wiki/Varnish
[10:11:47] <godog>	 sorry about the spam
[10:11:49] <wikibugs>	 10Operations, 10ops-codfw, 10DBA: pc2010 possibly broken memory - https://phabricator.wikimedia.org/T227552 (10Marostegui) 05Resolved→03Open a:05Marostegui→03Papaul Looks like this happened again and mysql crashed: @Papaul could this be the memory slot? Should we swap the DIMM with another existing D...
[10:12:03] <icinga-wm>	 PROBLEM - Varnish traffic logger - varnishreqstats on cp2008 is CRITICAL: NRPE: Command check_varnishreqstats not defined https://wikitech.wikimedia.org/wiki/Varnish
[10:12:49] <wikibugs>	 10Operations, 10Analytics, 10Analytics-EventLogging, 10DBA: Decommission m4 proxies (dbproxy1004 and dbproxy1008) - https://phabricator.wikimedia.org/T228768 (10Marostegui) a:03Marostegui Great - thanks. I will get them decommissioned
[10:15:47] <wikibugs>	 10Operations, 10Analytics, 10Analytics-EventLogging, 10DBA: Decommission dbproxy1004 and dbproxy1009 - https://phabricator.wikimedia.org/T228768 (10Marostegui)
[10:16:01] <wikibugs>	 10Operations, 10Analytics, 10Analytics-EventLogging, 10DBA, 10decommission: Decommission dbproxy1004 and dbproxy1009 - https://phabricator.wikimedia.org/T228768 (10Marostegui)
[10:17:59] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops: b3-eqiad pdu refresh - https://phabricator.wikimedia.org/T227539 (10aborrero)
[10:18:18] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops: b3-eqiad pdu refresh - https://phabricator.wikimedia.org/T227539 (10aborrero) This is ready to go on our side, hopefully today :-)
[10:20:11] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: Add the LBRemoteCluster class. (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/503947 (owner: 10Giuseppe Lavagetto)
[10:20:12] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops: b3-eqiad pdu refresh - https://phabricator.wikimedia.org/T227539 (10Marostegui)
[10:20:23] <wikibugs>	 (03PS11) 10Giuseppe Lavagetto: Add the LBRemoteCluster class. [software/spicerack] - 10https://gerrit.wikimedia.org/r/503947
[10:21:04] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops: b3-eqiad pdu refresh - https://phabricator.wikimedia.org/T227539 (10Marostegui) From the DBA side, it is good to. db1073 is a master for m5 (wikitech, nova...) #cloud-services-team needs to decide if they can afford a downtime there.
[10:22:26] <wikibugs>	 (03PS12) 10Giuseppe Lavagetto: Add the LBRemoteCluster class. [software/spicerack] - 10https://gerrit.wikimedia.org/r/503947
[10:23:27] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops: b6-eqiad pdu refresh - https://phabricator.wikimedia.org/T227541 (10aborrero)
[10:28:15] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Add the LBRemoteCluster class. [software/spicerack] - 10https://gerrit.wikimedia.org/r/503947 (owner: 10Giuseppe Lavagetto)
[10:29:05] <volans>	 I know I know jenkins... working on it :)
[10:31:51] <wikibugs>	 (03CR) 10Filippo Giunchedi: "See inline" (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/520502 (https://phabricator.wikimedia.org/T226815) (owner: 10Giuseppe Lavagetto)
[10:37:30] <icinga-wm>	 PROBLEM - puppet last run on cp1089 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): Service[varnishreqstats-frontend] https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[10:38:36] <icinga-wm>	 PROBLEM - puppet last run on cp1077 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): Service[varnishreqstats-frontend] https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[10:40:53] <godog>	 mhh looking
[10:41:20] <icinga-wm>	 PROBLEM - puppet last run on cp2014 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): Service[varnishreqstats-frontend] https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[10:41:38] <icinga-wm>	 PROBLEM - puppet last run on cp3043 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): Service[varnishreqstats-frontend] https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[10:41:38] <icinga-wm>	 PROBLEM - puppet last run on cp2008 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): Service[varnishreqstats-frontend] https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[10:42:09] <godog>	 I'll stop puppet on cp hosts, fails on second run after my merge /o\
[10:42:09] <godog>	 cc ema ^
[10:51:02] <wikibugs>	 (03PS1) 10Filippo Giunchedi: varnish: fix varnishreqstats systemd::service usage [puppet] - 10https://gerrit.wikimedia.org/r/525252 (https://phabricator.wikimedia.org/T184942)
[10:53:29] <ema>	 godog: thanks
[10:53:52] <wikibugs>	 (03CR) 10Ema: [C: 03+1] varnish: fix varnishreqstats systemd::service usage [puppet] - 10https://gerrit.wikimedia.org/r/525252 (https://phabricator.wikimedia.org/T184942) (owner: 10Filippo Giunchedi)
[10:54:32] <wikibugs>	 (03CR) 10Filippo Giunchedi: "PCC's happy https://puppet-compiler.wmflabs.org/compiler1002/17594/cp1077.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/525252 (https://phabricator.wikimedia.org/T184942) (owner: 10Filippo Giunchedi)
[10:54:49] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] varnish: fix varnishreqstats systemd::service usage [puppet] - 10https://gerrit.wikimedia.org/r/525252 (https://phabricator.wikimedia.org/T184942) (owner: 10Filippo Giunchedi)
[10:54:59] <wikibugs>	 (03CR) 10Filippo Giunchedi: [V: 03+2 C: 03+2] varnish: fix varnishreqstats systemd::service usage [puppet] - 10https://gerrit.wikimedia.org/r/525252 (https://phabricator.wikimedia.org/T184942) (owner: 10Filippo Giunchedi)
[10:58:52] <paravoid>	 in 30
[10:59:36] <icinga-wm>	 RECOVERY - puppet last run on cp1077 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[11:00:04] <jouncebot>	 Amir1, Lucas_WMDE, and Urbanecm: #bothumor My software never has bugs. It just develops random features. Rise for European Mid-day SWAT(Max 6 patches). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190724T1100).
[11:00:04] <jouncebot>	 Zoranzoki21 and kart_: A patch you scheduled for European Mid-day SWAT(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[11:00:18] <godog>	 ok puppet's back on cp1077, reenabling
[11:00:42] <icinga-wm>	 PROBLEM - puppet last run on cp2018 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): Service[varnishreqstats-frontend] https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[11:00:45] <awight>	 kart_: I can deploy your patch, if that's helpful?
[11:01:10] <kart_>	 awight: cool
[11:01:14] <kart_>	 awight: go ahead.
[11:01:32] <icinga-wm>	 PROBLEM - puppet last run on cp2013 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 59 seconds ago with 1 failures. Failed resources (up to 3 shown): Service[varnishreqstats-frontend] https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[11:01:36] <icinga-wm>	 PROBLEM - puppet last run on cp3032 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): Service[varnishreqstats-frontend] https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[11:01:46] <awight>	 Zoranzoki21: ping me if you're around during this window, otherwise I'll leave your patch for another day.
[11:01:48] <icinga-wm>	 PROBLEM - puppet last run on cp2014 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): Service[varnishreqstats-frontend] https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[11:01:49] <awight>	 kart_: will do
[11:01:50] <icinga-wm>	 PROBLEM - puppet last run on cp1081 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): Service[varnishreqstats-frontend] https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[11:02:00] <icinga-wm>	 PROBLEM - puppet last run on cp1087 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Service[varnishreqstats-frontend] https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[11:02:02] <icinga-wm>	 PROBLEM - puppet last run on cp1079 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): Service[varnishreqstats-frontend] https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[11:02:06] <icinga-wm>	 PROBLEM - puppet last run on cp3042 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): Service[varnishreqstats-frontend] https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[11:02:08] <godog>	 oh ffs, sorry about that!
[11:02:54] <wikibugs>	 (03PS4) 10Awight: Remove Content Translation event logging config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/514672 (owner: 10Petar.petkovic)
[11:03:02] <wikibugs>	 (03CR) 10Awight: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/514672 (owner: 10Petar.petkovic)
[11:03:13] <godog>	 despite the noise we're fine btw
[11:03:18] <awight>	 :)
[11:04:05] <wikibugs>	 (03Merged) 10jenkins-bot: Remove Content Translation event logging config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/514672 (owner: 10Petar.petkovic)
[11:04:23] <wikibugs>	 (03CR) 10jenkins-bot: Remove Content Translation event logging config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/514672 (owner: 10Petar.petkovic)
[11:05:08] <awight>	 kart_: Looks like this has to be pushed in two steps, with CommonSettings.php going first...
[11:05:46] <awight>	 kart_: Patch is ready to test on mwdebug1002
[11:06:41] <kart_>	 OK. Checking.
[11:07:28] <icinga-wm>	 RECOVERY - puppet last run on cp1079 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[11:07:30] <awight>	 I can confirm the ContentTranslation pages load, at least.
[11:07:33] <kart_>	 Nothing much to test except checking if it doesn't breaks stuffs.
[11:07:37] <kart_>	 awight: yeah.
[11:07:45] <awight>	 kk, deploying
[11:10:55] <logmsgbot>	 !log awight@deploy1001 Synchronized wmf-config/CommonSettings.php: [[gerrit:514672|Remove Content Translation event logging config]] (part 1/2) (duration: 00m 59s)
[11:11:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:12:03] <logmsgbot>	 !log awight@deploy1001 Synchronized wmf-config/InitialiseSettings.php: [[gerrit:514672|Remove Content Translation event logging config]] (part 2/2) (duration: 00m 54s)
[11:12:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:12:35] <awight>	 done :)
[11:12:44] <kart_>	 Thank you awight !
[11:12:54] <awight>	 any time
[11:15:03] <godog>	 awight: please LMK when swat is done, I'm holding off reenabling puppet in case spam comes up again
[11:15:29] <awight>	 godog: Okay, I have one more patch to push then will ping you.
[11:15:46] <godog>	 awesome, thanks
[11:19:37] <wikibugs>	 (03PS1) 10Awight: Disable FileImporter source wiki edits [mediawiki-config] - 10https://gerrit.wikimedia.org/r/525254 (https://phabricator.wikimedia.org/T228851)
[11:20:17] <wikibugs>	 (03CR) 10Awight: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/525254 (https://phabricator.wikimedia.org/T228851) (owner: 10Awight)
[11:21:59] <wikibugs>	 (03Merged) 10jenkins-bot: Disable FileImporter source wiki edits [mediawiki-config] - 10https://gerrit.wikimedia.org/r/525254 (https://phabricator.wikimedia.org/T228851) (owner: 10Awight)
[11:22:38] <wikibugs>	 (03CR) 10jenkins-bot: Disable FileImporter source wiki edits [mediawiki-config] - 10https://gerrit.wikimedia.org/r/525254 (https://phabricator.wikimedia.org/T228851) (owner: 10Awight)
[11:23:46] <logmsgbot>	 !log awight@deploy1001 Synchronized wmf-config/CommonSettings.php: [[gerrit:525254|Disable FileImporter source wiki edits (T228851)]] (duration: 00m 54s)
[11:23:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:23:53] <stashbot>	 T228851: Source wiki editing and deletion always fails - https://phabricator.wikimedia.org/T228851
[11:26:26] <awight>	 godog: Take it away!
[11:26:38] * godog grabs mic
[11:26:43] <awight>	 :-D
[11:26:51] <godog>	 thanks! will do
[11:28:28] <icinga-wm>	 RECOVERY - puppet last run on cp2008 is OK: OK: Puppet is currently enabled, last run 7 seconds ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[11:28:40] <icinga-wm>	 RECOVERY - puppet last run on cp2018 is OK: OK: Puppet is currently enabled, last run 18 seconds ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[11:29:34] <icinga-wm>	 RECOVERY - puppet last run on cp2013 is OK: OK: Puppet is currently enabled, last run 59 seconds ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[11:29:48] <icinga-wm>	 RECOVERY - puppet last run on cp2014 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[11:32:00] <icinga-wm>	 RECOVERY - puppet last run on cp1089 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[11:33:00] <icinga-wm>	 RECOVERY - puppet last run on cp3043 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[11:34:50] <icinga-wm>	 PROBLEM - MegaRAID on cloudvirt1024 is CRITICAL: CRITICAL: 1 failed LD(s) (Degraded) https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[11:34:58] <icinga-wm>	 ACKNOWLEDGEMENT - MegaRAID on cloudvirt1024 is CRITICAL: CRITICAL: 1 failed LD(s) (Degraded) nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T228853 https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[11:35:04] <wikibugs>	 10Operations, 10ops-eqiad: Degraded RAID on cloudvirt1024 - https://phabricator.wikimedia.org/T228853 (10ops-monitoring-bot)
[11:35:24] <icinga-wm>	 RECOVERY - puppet last run on cp3032 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[11:35:36] <icinga-wm>	 RECOVERY - puppet last run on cp1081 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[11:35:46] <icinga-wm>	 RECOVERY - puppet last run on cp1087 is OK: OK: Puppet is currently enabled, last run 5 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[11:35:52] <icinga-wm>	 RECOVERY - puppet last run on cp3042 is OK: OK: Puppet is currently enabled, last run 5 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[11:39:31] <wikibugs>	 (03PS3) 10Muehlenhoff: Enable seccomp-based hardening for apt on Buster and later [puppet] - 10https://gerrit.wikimedia.org/r/525115
[11:40:13] <wikibugs>	 10Operations, 10Puppet, 10observability: Use git commit id as "configuration version" for puppet - https://phabricator.wikimedia.org/T228854 (10fgiunchedi)
[11:42:47] <wikibugs>	 (03CR) 10Muehlenhoff: "PCC: https://puppet-compiler.wmflabs.org/compiler1002/17595/" [puppet] - 10https://gerrit.wikimedia.org/r/525115 (owner: 10Muehlenhoff)
[11:43:05] <wikibugs>	 (03PS4) 10Muehlenhoff: Enable seccomp-based hardening for apt on Buster and later [puppet] - 10https://gerrit.wikimedia.org/r/525115
[11:44:16] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Enable seccomp-based hardening for apt on Buster and later [puppet] - 10https://gerrit.wikimedia.org/r/525115 (owner: 10Muehlenhoff)
[11:49:09] <wikibugs>	 (03PS1) 10Filippo Giunchedi: varnish: remove varnishreqstats and varnishstatsd [puppet] - 10https://gerrit.wikimedia.org/r/525259 (https://phabricator.wikimedia.org/T184942)
[12:00:04] <jouncebot>	 Deploy window Pre MediaWiki train sanity break (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190724T1200)
[12:07:19] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops: dbproxy1012 and dbpro1001 alerting on PS Redundancy - https://phabricator.wikimedia.org/T228859 (10Marostegui)
[12:13:31] <wikibugs>	 10Operations, 10Traffic: Do not cache the beta version of the mobile site - https://phabricator.wikimedia.org/T228861 (10ema)
[12:13:42] <wikibugs>	 10Operations, 10Traffic: Do not cache the beta version of the mobile site - https://phabricator.wikimedia.org/T228861 (10ema) p:05Triage→03Normal
[12:17:33] <wikibugs>	 (03PS1) 10Ema: vcl: do not cache the beta variant of the mobile site [puppet] - 10https://gerrit.wikimedia.org/r/525268 (https://phabricator.wikimedia.org/T228861)
[12:19:49] <marostegui>	 !log Stop haproxy on dbproxy1004 and dbproxy1009 (m4 - eventlogging) - T228768
[12:19:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:19:56] <stashbot>	 T228768: Decommission dbproxy1004 and dbproxy1009 - https://phabricator.wikimedia.org/T228768
[12:20:04] <marostegui>	 elukey: fyi ^
[12:21:01] <elukey>	 gogogogo
[12:21:25] <wikibugs>	 10Operations, 10Analytics, 10Analytics-EventLogging, 10DBA, 10decommission: Decommission dbproxy1004 and dbproxy1009 - https://phabricator.wikimedia.org/T228768 (10Marostegui)
[12:21:53] <wikibugs>	 10Operations, 10Analytics, 10Analytics-EventLogging, 10DBA, 10decommission: Decommission dbproxy1004 and dbproxy1009 - https://phabricator.wikimedia.org/T228768 (10Marostegui) I have stopped haproxy on both hosts, and will leave it like that for 24h, just to be fully sure nothing uses it.
[12:26:45] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops: dbproxy1012 and dbprov1001 alerting on PS Redundancy - https://phabricator.wikimedia.org/T228859 (10Marostegui)
[12:32:22] <wikibugs>	 10Operations, 10MobileFrontend, 10Traffic, 10Patch-For-Review: Do not cache the beta version of the mobile site - https://phabricator.wikimedia.org/T228861 (10ema)
[12:42:38] <icinga-wm>	 PROBLEM - High CPU load on API appserver on mw1227 is CRITICAL: CRITICAL - load average: 55.15, 35.35, 26.35 https://wikitech.wikimedia.org/wiki/Application_servers
[12:48:26] <wikibugs>	 10Operations, 10Traffic: Separate Traffic layer caches for PHP7/HHVM - https://phabricator.wikimedia.org/T206339 (10ema)
[12:55:48] <icinga-wm>	 PROBLEM - High CPU load on API appserver on mw1227 is CRITICAL: CRITICAL - load average: 47.86, 36.30, 32.42 https://wikitech.wikimedia.org/wiki/Application_servers
[13:00:04] <jouncebot>	 liw: (Dis)respected human, time to deploy MediaWiki train - European version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190724T1300). Please do the needful.
[13:00:35] <liw>	 I'll start promoting 1.34.0-wmf.15 to group1
[13:00:59] <wikibugs>	 (03PS1) 10Ema: WIP: ATS: Vary-slotting for PHP7 [puppet] - 10https://gerrit.wikimedia.org/r/525274 (https://phabricator.wikimedia.org/T206339)
[13:01:23] <wikibugs>	 (03PS1) 10Lars Wirzenius: group1 wikis to 1.34.0-wmf.15 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/525275
[13:01:25] <wikibugs>	 (03CR) 10Lars Wirzenius: [C: 03+2] group1 wikis to 1.34.0-wmf.15 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/525275 (owner: 10Lars Wirzenius)
[13:02:27] <wikibugs>	 (03Merged) 10jenkins-bot: group1 wikis to 1.34.0-wmf.15 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/525275 (owner: 10Lars Wirzenius)
[13:02:52] <wikibugs>	 (03CR) 10jenkins-bot: group1 wikis to 1.34.0-wmf.15 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/525275 (owner: 10Lars Wirzenius)
[13:05:11] <logmsgbot>	 !log liw@deploy1001 rebuilt and synchronized wikiversions files: group1 wikis to 1.34.0-wmf.15
[13:05:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:06:06] <logmsgbot>	 !log liw@deploy1001 Synchronized php: group1 wikis to 1.34.0-wmf.15 (duration: 00m 54s)
[13:06:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:07:08] <wikibugs>	 (03PS1) 10Papaul: DNS: Add mgmt and production DNS for db21[21-30] [dns] - 10https://gerrit.wikimedia.org/r/525277
[13:08:33] <liw>	 eek, there's a couple of thousand gzinflate data errors
[13:08:46] <icinga-wm>	 PROBLEM - termbox codfw on termbox.svc.codfw.wmnet is CRITICAL: /termbox (get rendered termbox) is CRITICAL: Test get rendered termbox returned the unexpected status 500 (expecting: 200) https://wikitech.wikimedia.org/wiki/WMDE/Wikidata/SSR_Service
[13:09:21] <liw>	 but not in .15
[13:09:36] <icinga-wm>	 PROBLEM - High CPU load on API appserver on mw1222 is CRITICAL: CRITICAL - load average: 48.09, 32.86, 23.87 https://wikitech.wikimedia.org/wiki/Application_servers
[13:14:32] <icinga-wm>	 RECOVERY - High CPU load on API appserver on mw1222 is OK: OK - load average: 16.74, 23.15, 22.19 https://wikitech.wikimedia.org/wiki/Application_servers
[13:14:44] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] DNS: Add mgmt and production DNS for db21[21-30] [dns] - 10https://gerrit.wikimedia.org/r/525277 (owner: 10Papaul)
[13:18:03] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops, 10User-Zppix, 10cloud-services-team (Kanban): VMs on cloudvirt1015 crashing - bad mainboard/memory - https://phabricator.wikimedia.org/T220853 (10RobH) >>! In T220853#5359421, @wiki_willy wrote: > @Cmjohnson - are those errors for DIMM A3 enough info to get Dell to RM...
[13:18:40] <icinga-wm>	 RECOVERY - termbox codfw on termbox.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/WMDE/Wikidata/SSR_Service
[13:23:44] <icinga-wm>	 PROBLEM - termbox codfw on termbox.svc.codfw.wmnet is CRITICAL: /termbox (get rendered termbox) is CRITICAL: Test get rendered termbox returned the unexpected status 500 (expecting: 200) https://wikitech.wikimedia.org/wiki/WMDE/Wikidata/SSR_Service
[13:25:39] <robh>	 !log rebooting cloudvirt1015 into memtest for dell support repair via T220853
[13:25:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:25:47] <stashbot>	 T220853: VMs on cloudvirt1015 crashing - bad mainboard/memory - https://phabricator.wikimedia.org/T220853
[13:26:14] <wikibugs>	 (03PS1) 10Fsero: k8s: adding PodSecurityPolicies [deployment-charts] - 10https://gerrit.wikimedia.org/r/525281
[13:27:06] <icinga-wm>	 RECOVERY - High CPU load on API appserver on mw1227 is OK: OK - load average: 13.57, 15.76, 23.61 https://wikitech.wikimedia.org/wiki/Application_servers
[13:28:38] <icinga-wm>	 RECOVERY - termbox codfw on termbox.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/WMDE/Wikidata/SSR_Service
[13:30:29] <wikibugs>	 (03PS1) 10Elukey: role::analytics_test_cluster::hadoop::master: add missing option [puppet] - 10https://gerrit.wikimedia.org/r/525282
[13:30:35] <liw>	 group1: so far, so good
[13:31:33] <marostegui>	 !log Drop abuse_filter_log.afl_log_id in s2 eqiad - T226851
[13:31:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:31:48] <stashbot>	 T226851: Drop abuse_filter_log.afl_log_id in production - https://phabricator.wikimedia.org/T226851
[13:33:38] <icinga-wm>	 PROBLEM - termbox codfw on termbox.svc.codfw.wmnet is CRITICAL: /termbox (get rendered termbox) is CRITICAL: Test get rendered termbox returned the unexpected status 500 (expecting: 200) https://wikitech.wikimedia.org/wiki/WMDE/Wikidata/SSR_Service
[13:35:14] <icinga-wm>	 RECOVERY - termbox codfw on termbox.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/WMDE/Wikidata/SSR_Service
[13:36:07] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] role::analytics_test_cluster::hadoop::master: add missing option [puppet] - 10https://gerrit.wikimedia.org/r/525282 (owner: 10Elukey)
[13:40:18] <icinga-wm>	 PROBLEM - termbox codfw on termbox.svc.codfw.wmnet is CRITICAL: /termbox (get rendered termbox) is CRITICAL: Test get rendered termbox returned the unexpected status 500 (expecting: 200) https://wikitech.wikimedia.org/wiki/WMDE/Wikidata/SSR_Service
[13:41:54] <icinga-wm>	 RECOVERY - termbox codfw on termbox.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/WMDE/Wikidata/SSR_Service
[13:47:56] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops, 10User-Zppix, 10cloud-services-team (Kanban): VMs on cloudvirt1015 crashing - bad mainboard/memory - https://phabricator.wikimedia.org/T220853 (10RobH) Ok, this failed with another memory error in the SEL for dimm A3 (the one in question this entire time).  I've enter...
[13:48:32] <icinga-wm>	 PROBLEM - High CPU load on API appserver on mw1227 is CRITICAL: CRITICAL - load average: 51.78, 31.01, 23.70 https://wikitech.wikimedia.org/wiki/Application_servers
[13:49:08] <robh>	 !log rebooting cloudvirt1015 into OS, memory error confirmed.  new memory replacement dispatch entered via T220853
[13:49:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:49:24] <stashbot>	 T220853: VMs on cloudvirt1015 crashing - bad mainboard/memory - https://phabricator.wikimedia.org/T220853
[13:51:56] <icinga-wm>	 PROBLEM - termbox codfw on termbox.svc.codfw.wmnet is CRITICAL: /termbox (get rendered termbox) is CRITICAL: Test get rendered termbox returned the unexpected status 500 (expecting: 200) https://wikitech.wikimedia.org/wiki/WMDE/Wikidata/SSR_Service
[13:53:12] <wikibugs>	 10Operations, 10Analytics, 10Analytics-Kanban, 10Discovery, 10Research-Backlog: Make oozie swift upload emit event to Kafka about swift object upload complete - https://phabricator.wikimedia.org/T227896 (10Ottomata) But hm, I get your point.  It might be nice if the upload script automated some versionin...
[13:53:34] <icinga-wm>	 RECOVERY - termbox codfw on termbox.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/WMDE/Wikidata/SSR_Service
[13:53:40] <wikibugs>	 10Operations, 10ops-eqiad, 10netops: (Need By: Sept 30) update RE-S-X6-64G-S in cr[12]-eqiad - https://phabricator.wikimedia.org/T226424 (10RobH) @ayounsi:  This now has a need by date of September 30th (I assume you and @wiki_willy came up with that as he added it?)  This is basically blocked on #netops tel...
[14:04:53] <wikibugs>	 (03PS1) 10Tarrow: Increase termbox version in production to match staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/525288
[14:05:12] <wikibugs>	 (03CR) 10Tarrow: "This change is ready for review." [deployment-charts] - 10https://gerrit.wikimedia.org/r/525288 (owner: 10Tarrow)
[14:11:44] <wikibugs>	 (03PS2) 10Ema: ATS: Vary-slotting for PHP7 [puppet] - 10https://gerrit.wikimedia.org/r/525274 (https://phabricator.wikimedia.org/T206339)
[14:16:58] <icinga-wm>	 RECOVERY - Check the Netbox report-s- librenms for fail status. on netmon1002 is OK: librenms.LibreNMS OK https://wikitech.wikimedia.org/wiki/Netbox%23Reports
[14:23:20] <wikibugs>	 (03CR) 10Jakob: [C: 03+2] Increase termbox version in production to match staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/525288 (owner: 10Tarrow)
[14:26:54] <wikibugs>	 10Operations, 10ops-eqiad, 10netops: rack spare switches in c1-eqiad - https://phabricator.wikimedia.org/T185337 (10faidon) These could be racked in any rack, including in row A. It would be useful to have a working lab out of our spares - this came up yesterday/today when we were wondering if we had QSFPs t...
[14:28:56] <wikibugs>	 (03CR) 10Jakob: [V: 03+2 C: 03+2] Increase termbox version in production to match staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/525288 (owner: 10Tarrow)
[14:31:01] <wikibugs>	 (03PS1) 10Volans: pep257: fix newly reported issues [software/spicerack] - 10https://gerrit.wikimedia.org/r/525293
[14:31:59] <logmsgbot>	 !log tarrow@ helmfile [CODFW] Ran 'apply' command on namespace 'termbox' for release 'production' .
[14:32:01] <wikibugs>	 (03PS1) 10Herron: admin: add dz1 to ldap_only_users [puppet] - 10https://gerrit.wikimedia.org/r/525294 (https://phabricator.wikimedia.org/T227496)
[14:32:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:32:22] <wikibugs>	 (03PS1) 10Muehlenhoff: Configure unconditional flushes of the L1 cache during VMENTER for cloudvirt servers [puppet] - 10https://gerrit.wikimedia.org/r/525295 (https://phabricator.wikimedia.org/T228870)
[14:33:15] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Configure unconditional flushes of the L1 cache during VMENTER for cloudvirt servers [puppet] - 10https://gerrit.wikimedia.org/r/525295 (https://phabricator.wikimedia.org/T228870) (owner: 10Muehlenhoff)
[14:35:54] <wikibugs>	 (03PS2) 10Muehlenhoff: Configure unconditional flushes of the L1 cache during VMENTER [puppet] - 10https://gerrit.wikimedia.org/r/525295 (https://phabricator.wikimedia.org/T228870)
[14:36:42] <wikibugs>	 10Operations, 10netops: Instability of the Level3 link between cr2-eqiad and cr2-esams - https://phabricator.wikimedia.org/T228827 (10ayounsi) Level3/CenturyLink opened a ticket for that circuit and completed an emergency maintenance. I also see some planned maintenance in the last few days. And have at least...
[14:36:44] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] pep257: fix newly reported issues [software/spicerack] - 10https://gerrit.wikimedia.org/r/525293 (owner: 10Volans)
[14:37:02] <icinga-wm>	 PROBLEM - High CPU load on API appserver on mw1222 is CRITICAL: CRITICAL - load average: 54.12, 30.75, 24.29 https://wikitech.wikimedia.org/wiki/Application_servers
[14:40:16] <wikibugs>	 (03CR) 10Volans: [V: 03+2 C: 03+2] "Overriding CI as mypy is a new failure, I'll fix it in a separate patch. And then I'll look into probably freezing some deps, a bit too no" [software/spicerack] - 10https://gerrit.wikimedia.org/r/525293 (owner: 10Volans)
[14:40:46] <marostegui>	 !log Drop abuse_filter_log.afl_log_id in s5 codfw (lag will appear on codfw) - T226851
[14:40:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:40:53] <stashbot>	 T226851: Drop abuse_filter_log.afl_log_id in production - https://phabricator.wikimedia.org/T226851
[14:41:18] <icinga-wm>	 PROBLEM - High CPU load on API appserver on mw1227 is CRITICAL: CRITICAL - load average: 66.62, 44.26, 33.92 https://wikitech.wikimedia.org/wiki/Application_servers
[14:41:24] <icinga-wm>	 PROBLEM - High CPU load on API appserver on mw1233 is CRITICAL: CRITICAL - load average: 62.82, 36.35, 27.23 https://wikitech.wikimedia.org/wiki/Application_servers
[14:41:48] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+2] Anycast: move bird::neighbors_list from role/site to site in codfw [puppet] - 10https://gerrit.wikimedia.org/r/524076 (owner: 10Ayounsi)
[14:41:58] <wikibugs>	 (03PS2) 10Ayounsi: Anycast: move bird::neighbors_list from role/site to site in codfw [puppet] - 10https://gerrit.wikimedia.org/r/524076
[14:42:18] <wikibugs>	 (03CR) 10jenkins-bot: pep257: fix newly reported issues [software/spicerack] - 10https://gerrit.wikimedia.org/r/525293 (owner: 10Volans)
[14:43:24] <marostegui>	 !log Drop abuse_filter_log.afl_log_id in s5 eqiad - T226851
[14:43:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:44:57] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1001/17596/dns2002.wikimedia.org/" [puppet] - 10https://gerrit.wikimedia.org/r/524076 (owner: 10Ayounsi)
[14:45:10] <wikibugs>	 10Operations, 10Analytics, 10Analytics-Kanban, 10netops: Allow analytics VLAN to reach eventgate-analytics.discovery.wmnet:31192 - https://phabricator.wikimedia.org/T228882 (10Ottomata)
[14:46:14] <icinga-wm>	 RECOVERY - PHP opcache health on mwdebug1002 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[14:47:44] <wikibugs>	 (03PS8) 10Ori.livneh: Configure forensic logging of Apache requests; enable on beta [puppet] - 10https://gerrit.wikimedia.org/r/511751
[14:53:34] <wikibugs>	 (03PS1) 10Volans: elasticsearch_cluster: fix mypy newly reported bug [software/spicerack] - 10https://gerrit.wikimedia.org/r/525300
[14:54:28] <XioNoX>	 !log cleared vc ports stats on asw2-a-eqiad - T228823
[14:54:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:54:36] <stashbot>	 T228823: Faulty A6/A7 VC link - https://phabricator.wikimedia.org/T228823
[14:54:38] <icinga-wm>	 PROBLEM - High CPU load on API appserver on mw1233 is CRITICAL: CRITICAL - load average: 67.63, 40.77, 33.08 https://wikitech.wikimedia.org/wiki/Application_servers
[14:56:10] <icinga-wm>	 PROBLEM - High CPU load on API appserver on mw1227 is CRITICAL: CRITICAL - load average: 52.81, 37.13, 34.22 https://wikitech.wikimedia.org/wiki/Application_servers
[14:57:37] <wikibugs>	 (03PS1) 10Elukey: profile::kerberos::kdcserver: add +requires_preauth to new users [puppet] - 10https://gerrit.wikimedia.org/r/525301 (https://phabricator.wikimedia.org/T226104)
[14:58:26] <jeh>	 !log unmounting dumps NFS clients from labstore1007.wikimedia.org T224228
[14:58:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:59:54] <icinga-wm>	 RECOVERY - Juniper virtual chassis ports on asw2-a-eqiad is OK: OK: UP: 22 https://wikitech.wikimedia.org/wiki/Network_monitoring%23VCP_status
[15:00:06] <icinga-wm>	 RECOVERY - High CPU load on API appserver on mw1222 is OK: OK - load average: 19.20, 21.48, 23.59 https://wikitech.wikimedia.org/wiki/Application_servers
[15:01:04] <wikibugs>	 10Operations, 10Analytics, 10Analytics-Kanban, 10Discovery, 10Research-Backlog: Make oozie swift upload emit event to Kafka about swift object upload complete - https://phabricator.wikimedia.org/T227896 (10Nuria) @Ottomata I think for users sake it is easier to do it the other way around maybe? Provide v...
[15:02:38] <XioNoX>	 !log re-enable vc link between asw2-a6 and asw2-a7 - T228823
[15:02:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:02:46] <stashbot>	 T228823: Faulty A6/A7 VC link - https://phabricator.wikimedia.org/T228823
[15:05:19] <wikibugs>	 10Operations, 10ops-eqiad: Faulty A6/A7 VC link - https://phabricator.wikimedia.org/T228823 (10ayounsi)
[15:07:42] <wikibugs>	 10Operations, 10ops-eqiad: Faulty A6/A7 VC link - https://phabricator.wikimedia.org/T228823 (10ayounsi) 05Open→03Resolved All done, no more errors or packet loss.
[15:07:51] <wikibugs>	 (03PS4) 10Zoranzoki21: Correct a typo on the label newarticle in the help panel for arwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/525190 (https://phabricator.wikimedia.org/T228820)
[15:09:08] <icinga-wm>	 RECOVERY - Check systemd state on netmon1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:09:20] <icinga-wm>	 PROBLEM - High CPU load on API appserver on mw1233 is CRITICAL: CRITICAL - load average: 70.89, 41.77, 34.91 https://wikitech.wikimedia.org/wiki/Application_servers
[15:09:21] <wikibugs>	 (03PS2) 10Elukey: profile::kerberos::kdcserver: add +requires_preauth to new users [puppet] - 10https://gerrit.wikimedia.org/r/525301 (https://phabricator.wikimedia.org/T226104)
[15:09:40] <wikibugs>	 10Operations, 10serviceops, 10Core Platform Team Workboards (Green): Keys from MediaWiki Redis Instances - https://phabricator.wikimedia.org/T228703 (10jijiki) @holger.knust I accidentally copied the wrong dump to your directory yesterday, I uploaded a new dump today. Sorry for the confusion.
[15:11:53] <herron>	 !log resume ingesting [message] =~ /^SlowTimer/ logs on logstash1007 (as a canary)
[15:11:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:12:38] <wikibugs>	 (03PS1) 10Ayounsi: Anycast move bird::neighbors_list from role/site for all sites [puppet] - 10https://gerrit.wikimedia.org/r/525303
[15:13:10] <icinga-wm>	 PROBLEM - High CPU load on API appserver on mw1222 is CRITICAL: CRITICAL - load average: 49.84, 34.86, 28.47 https://wikitech.wikimedia.org/wiki/Application_servers
[15:14:04] <icinga-wm>	 PROBLEM - Check systemd state on netmon1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:15:42] <wikibugs>	 (03PS3) 10Muehlenhoff: Configure unconditional flushes of the L1 cache during VMENTER [puppet] - 10https://gerrit.wikimedia.org/r/525295 (https://phabricator.wikimedia.org/T228870)
[15:15:48] <icinga-wm>	 PROBLEM - High CPU load on API appserver on mw1227 is CRITICAL: CRITICAL - load average: 52.44, 32.98, 30.06 https://wikitech.wikimedia.org/wiki/Application_servers
[15:16:33] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops: a1-eqiad pdu refresh - https://phabricator.wikimedia.org/T226782 (10MoritzMuehlenhoff)
[15:19:14] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops: a1-eqiad pdu refresh - https://phabricator.wikimedia.org/T226782 (10RobH) >>! In T227139#5336328, @elukey wrote: > All the analytics nodes are hadoop workers, not a big deal if they loose power.  the above was on another task, but referenced same role as  analytics1058
[15:19:56] <hashar>	 !sal
[15:19:56] <wm-bot>	 https://wikitech.wikimedia.org/wiki/Server_Admin_Log  https://tools.wmflabs.org/sal/production   See it and you will know all you need.
[15:20:53] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops: a1-eqiad pdu refresh - https://phabricator.wikimedia.org/T226782 (10elukey) +1 for analytics1058, kafka-jumbo1001 is also ok, just please ping me or ottomata when starting so we can monitor.
[15:21:28] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops: a1-eqiad pdu refresh - https://phabricator.wikimedia.org/T226782 (10RobH)
[15:22:10] <wikibugs>	 (03CR) 10Volans: [C: 03+2] elasticsearch_cluster: fix mypy newly reported bug [software/spicerack] - 10https://gerrit.wikimedia.org/r/525300 (owner: 10Volans)
[15:25:14] <hashar>	 awight: note how the split of selenium and qunit makes your patch slightly easier to follow now :-]]]
[15:25:22] <hashar>	 that was worth the 3 merge effort ;-]
[15:27:51] <wikibugs>	 (03Merged) 10jenkins-bot: elasticsearch_cluster: fix mypy newly reported bug [software/spicerack] - 10https://gerrit.wikimedia.org/r/525300 (owner: 10Volans)
[15:28:21] <wikibugs>	 10Operations, 10netops: Instability of the Level3 link between cr2-eqiad and cr2-esams - https://phabricator.wikimedia.org/T228827 (10elukey) We usually see impact in 50x and/or nginx availability when the link goes down, so if that could be avoided I'd be +1.
[15:29:25] <wikibugs>	 (03CR) 10jenkins-bot: elasticsearch_cluster: fix mypy newly reported bug [software/spicerack] - 10https://gerrit.wikimedia.org/r/525300 (owner: 10Volans)
[15:30:20] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: Configure unconditional flushes of the L1 cache during VMENTER (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/525295 (https://phabricator.wikimedia.org/T228870) (owner: 10Muehlenhoff)
[15:32:23] <wikibugs>	 (03CR) 10Elukey: [C: 04-1] Allow the use of Ipv6 in the Hadoop Analytics cluster (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/523229 (https://phabricator.wikimedia.org/T225296) (owner: 10Elukey)
[15:32:24] <icinga-wm>	 PROBLEM - High CPU load on API appserver on mw1233 is CRITICAL: CRITICAL - load average: 40.05, 30.55, 32.10 https://wikitech.wikimedia.org/wiki/Application_servers
[15:35:16] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops: dbprov1001 alerting on PS Redundancy - https://phabricator.wikimedia.org/T228891 (10Cmjohnson)
[15:35:36] <icinga-wm>	 PROBLEM - High CPU load on API appserver on mw1227 is CRITICAL: CRITICAL - load average: 63.97, 40.29, 34.23 https://wikitech.wikimedia.org/wiki/Application_servers
[15:35:45] <wikibugs>	 (03PS1) 10Effie Mouzeli: jobrunners: Convert all jobrunners to server PHP7 only [puppet] - 10https://gerrit.wikimedia.org/r/525306 (https://phabricator.wikimedia.org/T219148)
[15:35:51] <wikibugs>	 (03PS9) 10Ori.livneh: Configure forensic logging of Apache requests; enable on beta [puppet] - 10https://gerrit.wikimedia.org/r/511751
[15:36:28] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops: dbproxy1012 alerting on PS Redundancy - https://phabricator.wikimedia.org/T228892 (10Cmjohnson)
[15:36:58] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops: dbproxy1012 and dbprov1001 alerting on PS Redundancy - https://phabricator.wikimedia.org/T228859 (10Cmjohnson) 05Open→03Invalid Created two separate tickets for each server.
[15:37:00] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops: (July 22-26) install new PDUs in rows A/B (Top level tracking task) - https://phabricator.wikimedia.org/T226778 (10Cmjohnson)
[15:37:07] <wikibugs>	 (03PS1) 10Ema: vcl: update Vary:XFP fixup comment [puppet] - 10https://gerrit.wikimedia.org/r/525308 (https://phabricator.wikimedia.org/T51700)
[15:37:19] <wikibugs>	 (03PS2) 10Effie Mouzeli: jobrunners: Migrate all jobrunners to serve only via PHP7 [puppet] - 10https://gerrit.wikimedia.org/r/525306 (https://phabricator.wikimedia.org/T219148)
[15:38:03] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops: dbprov1001 alerting on PS Redundancy - https://phabricator.wikimedia.org/T228891 (10Cmjohnson) The idrac interface is showing errors with the power supply. It is confirmed to be up and running. Most likely a fan went out. The server is still in warranty.
[15:38:14] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops: dbproxy1012 alerting on PS Redundancy - https://phabricator.wikimedia.org/T228892 (10Cmjohnson) The idrac interface is showing errors with the power supply. It is confirmed to be up and running. Most likely a fan went out. The server is still in warranty.
[15:39:13] <wikibugs>	 (03PS4) 10Elukey: Allow the use of Ipv6 in the Hadoop Analytics cluster [puppet] - 10https://gerrit.wikimedia.org/r/523229 (https://phabricator.wikimedia.org/T225296)
[15:40:12] <wikibugs>	 (03CR) 10Effie Mouzeli: [V: 03+1] "LGTM https://puppet-compiler.wmflabs.org/compiler1002/17598/" [puppet] - 10https://gerrit.wikimedia.org/r/525306 (https://phabricator.wikimedia.org/T219148) (owner: 10Effie Mouzeli)
[15:41:23] <wikibugs>	 (03CR) 10Ppchelko: [C: 03+1] jobrunners: Migrate all jobrunners to serve only via PHP7 [puppet] - 10https://gerrit.wikimedia.org/r/525306 (https://phabricator.wikimedia.org/T219148) (owner: 10Effie Mouzeli)
[15:41:45] <wikibugs>	 (03PS5) 10Elukey: Allow the use of Ipv6 in the Hadoop Analytics cluster [puppet] - 10https://gerrit.wikimedia.org/r/523229 (https://phabricator.wikimedia.org/T225296)
[15:41:47] <wikibugs>	 (03PS1) 10Elukey: profile::prometheus::jmx_exporter: allow IPv6 polling [puppet] - 10https://gerrit.wikimedia.org/r/525309 (https://phabricator.wikimedia.org/T225296)
[15:41:54] <icinga-wm>	 PROBLEM - High CPU load on API appserver on mw1231 is CRITICAL: CRITICAL - load average: 52.61, 29.83, 22.75 https://wikitech.wikimedia.org/wiki/Application_servers
[15:42:24] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] profile::prometheus::jmx_exporter: allow IPv6 polling [puppet] - 10https://gerrit.wikimedia.org/r/525309 (https://phabricator.wikimedia.org/T225296) (owner: 10Elukey)
[15:42:49] <jijiki>	 !log Disable puppet on jobrunners for 525306
[15:42:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:43:29] <wikibugs>	 (03PS2) 10Elukey: profile::prometheus::jmx_exporter: allow IPv6 polling [puppet] - 10https://gerrit.wikimedia.org/r/525309 (https://phabricator.wikimedia.org/T225296)
[15:43:31] <wikibugs>	 (03PS6) 10Elukey: Allow the use of Ipv6 in the Hadoop Analytics cluster [puppet] - 10https://gerrit.wikimedia.org/r/523229 (https://phabricator.wikimedia.org/T225296)
[15:44:20] <jeh>	 !log rebooting labstore1007.wikimedia.org for updates T224228
[15:44:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:44:35] <wikibugs>	 (03CR) 10Effie Mouzeli: [V: 03+1 C: 03+2] jobrunners: Migrate all jobrunners to serve only via PHP7 [puppet] - 10https://gerrit.wikimedia.org/r/525306 (https://phabricator.wikimedia.org/T219148) (owner: 10Effie Mouzeli)
[15:44:50] <icinga-wm>	 PROBLEM - Check the Netbox report-s- librenms for fail status. on netmon1002 is CRITICAL: librenms.LibreNMS CRITICAL https://wikitech.wikimedia.org/wiki/Netbox%23Reports
[15:45:00] <chaomodus>	 wyd
[15:45:03] <wikibugs>	 (03PS3) 10Ema: ATS: Vary-slotting for PHP7 [puppet] - 10https://gerrit.wikimedia.org/r/525274 (https://phabricator.wikimedia.org/T206339)
[15:45:05] <wikibugs>	 (03PS1) 10Ema: ATS: Vary-slotting for X-Forwarded-Proto [puppet] - 10https://gerrit.wikimedia.org/r/525310 (https://phabricator.wikimedia.org/T227432)
[15:45:20] <chaomodus>	 oic
[15:47:10] <icinga-wm>	 RECOVERY - IPMI Sensor Status on dbprov1001 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures
[15:47:13] <wikibugs>	 10Operations, 10Gerrit, 10LDAP-Access-Requests, 10Release-Engineering-Team-TODO, 10Release-Engineering-Team (Development services): Add more SREs to gerritadmin LDAP group - https://phabricator.wikimedia.org/T228733 (10herron) @Joe @Dzahn adding you both to gerritadmin would satisfy "at least 1 person fr...
[15:47:42] <jijiki>	 !log Rolling puppet-enable and apache reload of jobrunners in eqiad
[15:47:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:48:13] <wikibugs>	 10Operations, 10serviceops, 10PHP 7.2 support, 10Performance-Team (Radar), and 2 others: PHP 7 corruption during deployment (was: PHP 7 fatals on mw1262) - https://phabricator.wikimedia.org/T224491 (10cscott) Worth noting that we have a known GC error in PHP 7.2, which is also 100% reproducible: {T228346}....
[15:48:31] <James_F>	 jijiki: Yay.
[15:48:44] <jijiki>	 haha
[15:49:22] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops: a1-eqiad pdu refresh - https://phabricator.wikimedia.org/T226782 (10RobH)
[15:49:52] <icinga-wm>	 PROBLEM - puppet last run on elastic1049 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[15:50:14] <icinga-wm>	 RECOVERY - High CPU load on API appserver on mw1231 is OK: OK - load average: 16.84, 22.86, 23.58 https://wikitech.wikimedia.org/wiki/Application_servers
[15:52:39] <elukey>	 SMalyshev: o/ - if you are online, would you mind to join #wikimedia-dcops ?
[15:53:31] <ottomata>	 wow there are so many chat rooms
[15:54:51] <bblack>	 don't forget the SRE Slack channel
[15:55:18] <elukey>	 bblack: well played :D
[15:56:05] <chaomodus>	 that doesn't actually exist does it ?
[15:56:05] <bblack>	 !log depooling recdns on dns1001 via confctl
[15:56:13] <bblack>	 !log depooling recdns on dns1001 via confctl - T226782
[15:56:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:56:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:56:29] <stashbot>	 T226782: a1-eqiad pdu refresh - https://phabricator.wikimedia.org/T226782
[15:56:33] <logmsgbot>	 !log bblack@cumin1001 conftool action : set/pooled=no; selector: name=dns1001.wikimedia.org
[15:56:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:56:39] <marostegui>	 cmjohnson1: the alert for dbprov1001 recovered, did you do some magic?
[15:58:03] <XioNoX>	 !log failover master VIP of ae2.1202 inet6 away from cr1-eqiad - T226782
[15:58:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:59:06] <bblack>	 !log lvs1014 - puppet disable, remove dns1001 from resolv.conf, restart pybal - T226782
[15:59:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:59:38] <wikibugs>	 10Operations, 10serviceops, 10PHP 7.2 support, 10Performance-Team (Radar), and 2 others: PHP 7 corruption during deployment (was: PHP 7 fatals on mw1262) - https://phabricator.wikimedia.org/T224491 (10Krinkle) >>! In T224491#5356481, @Joe wrote: >>>! In T224491#5354568, @Krinkle wrote: >> […] >> Only seen...
[15:59:39] <wikibugs>	 (03PS1) 10Jhedden: Revert "dumps dist: switch active VPS to labstore1006" [puppet] - 10https://gerrit.wikimedia.org/r/525313
[15:59:58] <bblack>	 !log dns1001 - puppet disable, stop recursor service to kill anycast advert - T226782
[16:00:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:00:04] <jouncebot>	 MaxSem, RoanKattouw, Niharika, and Urbanecm: I, the Bot under the Fountain, allow thee, The Deployer, to do Morning SWAT (Max 6 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190724T1600).
[16:00:04] <jouncebot>	 No GERRIT patches in the queue for this window AFAICS.
[16:00:49] <wikibugs>	 (03PS2) 10Jhedden: Revert "dumps dist: switch active VPS to labstore1006" [puppet] - 10https://gerrit.wikimedia.org/r/525313
[16:01:34] <wikibugs>	 (03CR) 10Jhedden: [C: 03+2] Revert "dumps dist: switch active VPS to labstore1006" [puppet] - 10https://gerrit.wikimedia.org/r/525313 (owner: 10Jhedden)
[16:02:07] <cmjohnson1>	 marostegui: I cleared the log but I did pull the report for Dell.  I will check on it again later before resolving ticket
[16:02:22] <icinga-wm>	 PROBLEM - Blazegraph Port for wdqs-blazegraph on wdqs1006 is CRITICAL: connect to address 127.0.0.1 and port 9999: Connection refused https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook
[16:02:44] <icinga-wm>	 PROBLEM - WDQS HTTP Port on wdqs1006 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 380 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service
[16:03:08] <icinga-wm>	 PROBLEM - Blazegraph process -wdqs-blazegraph- on wdqs1006 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 499 (blazegraph), regex args ^java .* --port 9999 .* blazegraph-service-.*war https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook
[16:03:44] <icinga-wm>	 PROBLEM - Check systemd state on wdqs1006 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:03:48] <icinga-wm>	 PROBLEM - Recursive DNS on 208.80.154.10 is CRITICAL: CRITICAL - Plugin timed out while executing system call https://wikitech.wikimedia.org/wiki/DNS
[16:04:04] <marostegui>	 cmjohnson1: good idea yeah, it might fire again later. thanks
[16:04:05] <icinga-wm>	 ACKNOWLEDGEMENT - Blazegraph Port for wdqs-blazegraph on wdqs1006 is CRITICAL: connect to address 127.0.0.1 and port 9999: Connection refused Stas Malychev PDU replacement work https://phabricator.wikimedia.org/T226782 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook
[16:04:05] <icinga-wm>	 ACKNOWLEDGEMENT - Blazegraph process -wdqs-blazegraph- on wdqs1006 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 499 (blazegraph), regex args ^java .* --port 9999 .* blazegraph-service-.*war Stas Malychev PDU replacement work https://phabricator.wikimedia.org/T226782 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook
[16:04:05] <icinga-wm>	 ACKNOWLEDGEMENT - Check systemd state on wdqs1006 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. Stas Malychev PDU replacement work https://phabricator.wikimedia.org/T226782 https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:04:06] <icinga-wm>	 ACKNOWLEDGEMENT - WDQS HTTP Port on wdqs1006 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 380 bytes in 0.001 second response time Stas Malychev PDU replacement work https://phabricator.wikimedia.org/T226782 https://wikitech.wikimedia.org/wiki/Wikidata_query_service
[16:04:16] <icinga-wm>	 PROBLEM - Recursive DNS on 2620:0:861:1:208:80:154:10 is CRITICAL: CRITICAL - Plugin timed out while executing system call https://wikitech.wikimedia.org/wiki/DNS
[16:06:05] <icinga-wm>	 ACKNOWLEDGEMENT - Recursive DNS on 208.80.154.10 is CRITICAL: CRITICAL - Plugin timed out while executing system call Brandon Black Stopped for A1 PDU work - T226782 https://wikitech.wikimedia.org/wiki/DNS
[16:06:05] <icinga-wm>	 ACKNOWLEDGEMENT - Recursive DNS on 2620:0:861:1:208:80:154:10 is CRITICAL: CRITICAL - Plugin timed out while executing system call Brandon Black Stopped for A1 PDU work - T226782 https://wikitech.wikimedia.org/wiki/DNS
[16:06:43] <logmsgbot>	 !log jforrester@deploy1001 Synchronized php-1.34.0-wmf.14/includes/export/XmlDumpWriter.php: T228720 make XmlDumpwriter more resilient to blob store corruption (duration: 00m 55s)
[16:06:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:06:50] <stashbot>	 T228720: stub for enwiki broken, attempt to load content for bad rev during sha1 retrieval - https://phabricator.wikimedia.org/T228720
[16:06:58] <apergos>	 \o/
[16:07:46] <logmsgbot>	 !log jforrester@deploy1001 Synchronized php-1.34.0-wmf.15/includes/export/XmlDumpWriter.php: T228720 make XmlDumpwriter more resilient to blob store corruption (duration: 00m 55s)
[16:07:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:08:00] * apergos heaves a sigh of relief
[16:08:04] <apergos>	 hopefully that's it this time
[16:09:19] <wikibugs>	 10Operations, 10Analytics, 10Analytics-Kanban, 10netops: Allow analytics VLAN to reach eventgate-analytics.discovery.wmnet:31192 - https://phabricator.wikimedia.org/T228882 (10Ottomata) a:05Ottomata→03None
[16:09:22] <liw>	 "Trying to get property 'gb_expiry' of non-object" - should I worry about that?
[16:10:22] <apergos>	 maybe
[16:10:36] <icinga-wm>	 RECOVERY - Blazegraph Port for wdqs-blazegraph on wdqs1006 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 9999 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook
[16:10:44] <icinga-wm>	 RECOVERY - Recursive DNS on 2620:0:861:1:208:80:154:10 is OK: DNS OK: 0.018 seconds response time. www.wikipedia.org returns 208.80.154.224 https://wikitech.wikimedia.org/wiki/DNS
[16:10:52] <bblack>	 !log dns1001 - restart recursor and re-enable puppet - T226782
[16:10:58] <icinga-wm>	 RECOVERY - High CPU load on API appserver on mw1222 is OK: OK - load average: 17.96, 21.36, 23.73 https://wikitech.wikimedia.org/wiki/Application_servers
[16:10:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:11:00] <icinga-wm>	 RECOVERY - WDQS HTTP Port on wdqs1006 is OK: HTTP OK: HTTP/1.1 200 OK - 448 bytes in 0.029 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service
[16:11:00] <stashbot>	 T226782: a1-eqiad pdu refresh - https://phabricator.wikimedia.org/T226782
[16:11:13] <liw>	 ten times in logstash, for .15, on metawiki, though not in the past few minutes
[16:11:22] <icinga-wm>	 RECOVERY - Blazegraph process -wdqs-blazegraph- on wdqs1006 is OK: PROCS OK: 1 process with UID = 499 (blazegraph), regex args ^java .* --port 9999 .* blazegraph-service-.*war https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook
[16:11:47] <apergos>	 do we hve the request urls?
[16:11:48] <bblack>	 !log lvs1014 - restore puppet and resolv.conf contents, restart pybal
[16:11:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:11:58] <icinga-wm>	 RECOVERY - Recursive DNS on 208.80.154.10 is OK: DNS OK: 0.010 seconds response time. www.wikipedia.org returns 208.80.154.224 https://wikitech.wikimedia.org/wiki/DNS
[16:12:00] <icinga-wm>	 RECOVERY - Check systemd state on wdqs1006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:12:30] <liw>	 apergos, meta.wikimedia.org and /w/api.php, but that's all (reqId XTiAGQpAIDwAAIe1-vMAAAAJ)
[16:12:35] <bblack>	 !log re-pooling recdns on dns1001 via confctl - T226782
[16:12:38] <logmsgbot>	 !log bblack@cumin1001 conftool action : set/pooled=yes; selector: name=dns1001.wikimedia.org
[16:12:39] <apergos>	 api. uh. meh
[16:12:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:12:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:14:14] <apergos>	 I guess it's not preventing the user from getting things done
[16:14:35] <apergos>	 so it would be nice to know the cause but unless global blocks are suddenly broken it's not a huge deal
[16:14:47] <apergos>	 disclaimer: not a mw dev
[16:15:36] <apergos>	 PHP Notice: Trying to get property 'gb_anon_only' of non-object     that too, same request
[16:15:38] <apergos>	 hm
[16:16:17] <James_F>	 Ah, GlobalBlocks.
[16:16:31] <apergos>	 $someone will have to look at the code 
[16:16:32] <James_F>	 Very likely a regression there; that area of code has been changing recently.
[16:16:46] <James_F>	 liw: Have you filed a Phab task?
[16:17:30] <James_F>	 Not necessarily a train blocker, but we should file it and throw it over to the Anti-Harassment team
[16:17:34] <liw>	 James_F, in the process of doing that
[16:17:37] <elukey>	 I have noticed 2/3 api appservers with high CPU load (sustained), still need to check but afaik they started after the deployment
[16:17:46] <icinga-wm>	 RECOVERY - puppet last run on elastic1049 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[16:18:06] <elukey>	 might be completely unrelated, just wanted to raise the concern
[16:18:21] * James_F nods.
[16:18:32] <liw>	 https://phabricator.wikimedia.org/T228899
[16:18:40] <wikibugs>	 10Operations, 10Puppet, 10observability: Use git commit id as "configuration version" for puppet - https://phabricator.wikimedia.org/T228854 (10fgiunchedi)
[16:21:01] <apergos>	 added the other error
[16:34:25] <wikibugs>	 (03PS1) 10Papaul: DHCP: Add MAC address entries for db21[21-30] [puppet] - 10https://gerrit.wikimedia.org/r/525318 (https://phabricator.wikimedia.org/T227113)
[16:39:55] <wikibugs>	 (03PS4) 10Muehlenhoff: Configure unconditional flushes of the L1 cache during VMENTER [puppet] - 10https://gerrit.wikimedia.org/r/525295 (https://phabricator.wikimedia.org/T228870)
[16:40:00] <wikibugs>	 10Operations, 10ops-codfw, 10DBA, 10Goal, 10Patch-For-Review: rack/setup/install db21[21-30].codfw.wmnet - https://phabricator.wikimedia.org/T227113 (10Papaul)
[16:40:04] <wikibugs>	 (03CR) 10Muehlenhoff: Configure unconditional flushes of the L1 cache during VMENTER (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/525295 (https://phabricator.wikimedia.org/T228870) (owner: 10Muehlenhoff)
[16:40:50] <wikibugs>	 10Operations, 10ops-codfw, 10DBA, 10Goal, 10Patch-For-Review: rack/setup/install db21[21-30].codfw.wmnet - https://phabricator.wikimedia.org/T227113 (10Papaul)
[16:41:41] <wikibugs>	 (03PS1) 10Arturo Borrero Gonzalez: openstack: fullstack: use a readable-friendly name for VMs [puppet] - 10https://gerrit.wikimedia.org/r/525320
[16:41:43] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] admin: add dz1 to ldap_only_users [puppet] - 10https://gerrit.wikimedia.org/r/525294 (https://phabricator.wikimedia.org/T227496) (owner: 10Herron)
[16:41:52] <icinga-wm>	 PROBLEM - High CPU load on API appserver on mw1233 is CRITICAL: CRITICAL - load average: 51.82, 36.13, 30.02 https://wikitech.wikimedia.org/wiki/Application_servers
[16:42:13] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] openstack: fullstack: use a readable-friendly name for VMs [puppet] - 10https://gerrit.wikimedia.org/r/525320 (owner: 10Arturo Borrero Gonzalez)
[16:42:20] <icinga-wm>	 PROBLEM - High CPU load on API appserver on mw1227 is CRITICAL: CRITICAL - load average: 57.73, 37.77, 32.77 https://wikitech.wikimedia.org/wiki/Application_servers
[16:43:57] <wikibugs>	 (03PS2) 10Arturo Borrero Gonzalez: openstack: fullstack: use a readable-friendly name for VMs [puppet] - 10https://gerrit.wikimedia.org/r/525320
[16:44:15] <jijiki>	 !log Rolling puppet-enable and apache reload of jobrunners in codfw
[16:44:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:47:48] <icinga-wm>	 PROBLEM - High CPU load on API appserver on mw1222 is CRITICAL: CRITICAL - load average: 79.01, 44.86, 30.76 https://wikitech.wikimedia.org/wiki/Application_servers
[16:47:54] <wikibugs>	 10Operations, 10ops-codfw, 10DBA, 10Goal, 10Patch-For-Review: rack/setup/install db21[21-30].codfw.wmnet - https://phabricator.wikimedia.org/T227113 (10Papaul)
[16:56:55] <wikibugs>	 10Operations, 10ops-codfw, 10DBA, 10Goal, 10Patch-For-Review: rack/setup/install db21[21-30].codfw.wmnet - https://phabricator.wikimedia.org/T227113 (10Papaul)
[16:58:06] <wikibugs>	 (03PS2) 10Herron: admin: add dz1 to ldap_only_users [puppet] - 10https://gerrit.wikimedia.org/r/525294 (https://phabricator.wikimedia.org/T227496)
[16:59:12] <wikibugs>	 (03CR) 10Herron: [C: 03+2] admin: add dz1 to ldap_only_users [puppet] - 10https://gerrit.wikimedia.org/r/525294 (https://phabricator.wikimedia.org/T227496) (owner: 10Herron)
[17:02:34] <icinga-wm>	 PROBLEM - High CPU load on API appserver on mw1222 is CRITICAL: CRITICAL - load average: 44.73, 35.05, 32.48 https://wikitech.wikimedia.org/wiki/Application_servers
[17:03:38] <icinga-wm>	 PROBLEM - High CPU load on API appserver on mw1227 is CRITICAL: CRITICAL - load average: 40.26, 35.40, 33.17 https://wikitech.wikimedia.org/wiki/Application_servers
[17:07:32] <wikibugs>	 10Operations, 10ops-codfw, 10DBA, 10Goal, 10Patch-For-Review: rack/setup/install db21[21-30].codfw.wmnet - https://phabricator.wikimedia.org/T227113 (10Papaul)
[17:07:38] <wikibugs>	 10Operations, 10ops-codfw, 10netops: Cable mr1-codfw<->cr1/2-codfw through asw-a-codfw - https://phabricator.wikimedia.org/T228112 (10ayounsi)
[17:10:27] <XioNoX>	 !log Add mr1-codfw<->cr1/2-codfw vlan/link config on asw-a-codfw - T228112
[17:10:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:10:35] <stashbot>	 T228112: Cable mr1-codfw<->cr1/2-codfw through asw-a-codfw - https://phabricator.wikimedia.org/T228112
[17:10:40] <wikibugs>	 10Operations, 10Analytics, 10LDAP-Access-Requests, 10wikimediafoundation.org, 10Patch-For-Review: Access to WikimediaFoundation.org analytics for Deb - https://phabricator.wikimedia.org/T227496 (10RStallman-legalteam) I don't actually see the paper work for WMF full time req # employees, so I think havin...
[17:12:19] <wikibugs>	 10Operations, 10ops-codfw, 10netops: Cable mr1-codfw<->cr1/2-codfw through asw-a-codfw - https://phabricator.wikimedia.org/T228112 (10ayounsi)
[17:14:04] <XioNoX>	 !log rollback failover master VIP of ae2.1202 inet6 away from cr1-eqiad - T226782
[17:14:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:14:10] <stashbot>	 T226782: a1-eqiad pdu refresh - https://phabricator.wikimedia.org/T226782
[17:14:40] <icinga-wm>	 PROBLEM - High CPU load on API appserver on mw1233 is CRITICAL: CRITICAL - load average: 67.64, 40.09, 33.59 https://wikitech.wikimedia.org/wiki/Application_servers
[17:15:08] <icinga-wm>	 PROBLEM - High CPU load on API appserver on mw1227 is CRITICAL: CRITICAL - load average: 74.15, 45.24, 36.17 https://wikitech.wikimedia.org/wiki/Application_servers
[17:18:27] <wikibugs>	 (03CR) 10Cwhite: [C: 03+2] hiera: deploy varnishkafka exporter to esams [puppet] - 10https://gerrit.wikimedia.org/r/524931 (https://phabricator.wikimedia.org/T196066) (owner: 10Cwhite)
[17:18:37] <wikibugs>	 (03PS2) 10Cwhite: hiera: deploy varnishkafka exporter to esams [puppet] - 10https://gerrit.wikimedia.org/r/524931 (https://phabricator.wikimedia.org/T196066)
[17:18:54] <wikibugs>	 (03CR) 10Jhedden: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/525320 (owner: 10Arturo Borrero Gonzalez)
[17:20:05] <wikibugs>	 10Operations, 10Analytics, 10LDAP-Access-Requests, 10wikimediafoundation.org: Access to WikimediaFoundation.org analytics for Deb - https://phabricator.wikimedia.org/T227496 (10Heather) It doesn't seem like you need it, but this is approved. Let me know if you need something else.
[17:20:38] <wikibugs>	 10Operations, 10ops-codfw, 10DBA, 10Goal, 10Patch-For-Review: rack/setup/install db21[21-30].codfw.wmnet - https://phabricator.wikimedia.org/T227113 (10Papaul)
[17:20:56] <wikibugs>	 (03CR) 10CRusnov: "This change is ready for review." [puppet] - 10https://gerrit.wikimedia.org/r/514395 (https://phabricator.wikimedia.org/T223291) (owner: 10CRusnov)
[17:22:30] <wikibugs>	 (03CR) 10CDanis: [C: 03+1] role::alerting_host: Remove support for jessie [puppet] - 10https://gerrit.wikimedia.org/r/525232 (owner: 10Muehlenhoff)
[17:27:32] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops: dbproxy1012 alerting on PS Redundancy - https://phabricator.wikimedia.org/T228892 (10Cmjohnson) i created a dispatch with Dell to replace the PSU  You have successfully submitted request SR995054295.
[17:30:07] <wikibugs>	 10Operations, 10Analytics, 10LDAP-Access-Requests, 10wikimediafoundation.org: Access to WikimediaFoundation.org analytics for Deb - https://phabricator.wikimedia.org/T227496 (10herron) 05Open→03Resolved Great!  Thanks all
[17:30:08] <wikibugs>	 (03PS13) 10Volans: Add the LBRemoteCluster class. [software/spicerack] - 10https://gerrit.wikimedia.org/r/503947 (owner: 10Giuseppe Lavagetto)
[17:33:02] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] DHCP: Add MAC address entries for db21[21-30] [puppet] - 10https://gerrit.wikimedia.org/r/525318 (https://phabricator.wikimedia.org/T227113) (owner: 10Papaul)
[17:33:10] <wikibugs>	 (03PS2) 10Dzahn: DHCP: Add MAC address entries for db21[21-30] [puppet] - 10https://gerrit.wikimedia.org/r/525318 (https://phabricator.wikimedia.org/T227113) (owner: 10Papaul)
[17:33:48] <icinga-wm>	 RECOVERY - High CPU load on API appserver on mw1222 is OK: OK - load average: 18.39, 20.40, 23.81 https://wikitech.wikimedia.org/wiki/Application_servers
[17:34:16] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Add the LBRemoteCluster class. [software/spicerack] - 10https://gerrit.wikimedia.org/r/503947 (owner: 10Giuseppe Lavagetto)
[17:37:49] <wikibugs>	 (03PS6) 10Ayounsi: Add an anycast endpoint to syslog centralservers [puppet] - 10https://gerrit.wikimedia.org/r/524037
[17:37:54] <liw>	 https://phabricator.wikimedia.org/T228911
[17:38:10] <icinga-wm>	 PROBLEM - High CPU load on API appserver on mw1227 is CRITICAL: CRITICAL - load average: 47.68, 33.82, 32.58 https://wikitech.wikimedia.org/wiki/Application_servers
[17:39:23] <wikibugs>	 10Operations, 10ops-eqiad: Degraded RAID on elastic1046 - https://phabricator.wikimedia.org/T228606 (10Cmjohnson) a:05Cmjohnson→03wiki_willy This server is out of warranty, ended April 2019.   @wiki_willy escalating to you to decide on disks
[17:41:54] <wikibugs>	 10Operations, 10MediaWiki-General, 10serviceops, 10Core Platform Team (PHP7 (TEC4)), and 3 others: Some pages will become completely unreachable after PHP7 update due to Unicode changes - https://phabricator.wikimedia.org/T219279 (10Anomie) Next steps here:  [ ] 1. Determine the schedule to do these next s...
[17:42:56] <wikibugs>	 (03PS14) 10Volans: Add the LBRemoteCluster class. [software/spicerack] - 10https://gerrit.wikimedia.org/r/503947 (owner: 10Giuseppe Lavagetto)
[17:43:14] <wikibugs>	 (03PS2) 10Jeena Huneidi: Package mediawiki-dev and add to index [deployment-charts] - 10https://gerrit.wikimedia.org/r/525173 (https://phabricator.wikimedia.org/T224935)
[17:44:10] <wikibugs>	 (03CR) 10Ayounsi: "I went the "Set a different profile::bird::advertise_vips for the server to be decom" way." [puppet] - 10https://gerrit.wikimedia.org/r/524037 (owner: 10Ayounsi)
[17:48:56] <wikibugs>	 10Operations, 10Analytics, 10Analytics-Kanban, 10Discovery, 10Research-Backlog: Make oozie swift upload emit event to Kafka about swift object upload complete - https://phabricator.wikimedia.org/T227896 (10EBernhardson) I hadn't previously thought about re-publishing a new version of the same dataset. It...
[17:49:58] <wikibugs>	 10Operations, 10Analytics, 10Analytics-Kanban, 10Discovery, 10Research-Backlog: Make oozie swift upload emit event to Kafka about swift object upload complete - https://phabricator.wikimedia.org/T227896 (10Ottomata) Ya, if you needed to re-run a job due to data backfill, you might want to be able to do s...
[17:52:30] <icinga-wm>	 RECOVERY - High CPU load on API appserver on mw1233 is OK: OK - load average: 17.79, 20.27, 23.70 https://wikitech.wikimedia.org/wiki/Application_servers
[17:54:55] <wikibugs>	 (03PS1) 10Ayounsi: Fastnetmon: disable Graphite, fix notify script path [puppet] - 10https://gerrit.wikimedia.org/r/525334 (https://phabricator.wikimedia.org/T226810)
[17:56:02] <wikibugs>	 10Operations, 10ops-eqiad: Degraded RAID on cloudvirt1024 - https://phabricator.wikimedia.org/T228853 (10Cmjohnson) A new ticket has been created with Dell  You have successfully submitted request SR995055580.
[18:00:41] <wikibugs>	 10Operations, 10ops-eqiad, 10DBA: Upgrade db1100 firmware and BIOS - https://phabricator.wikimedia.org/T228732 (10Cmjohnson) @Marostegui This can be done any day...Let's plan 8/6 @1000EDT /1400UTC
[18:00:47] <wikibugs>	 (03CR) 10Dzahn: "i don't know much about this but i have 2 questions/observations:  a) looking at the repo all the other charts seem to exist as both a .tg" [deployment-charts] - 10https://gerrit.wikimedia.org/r/525173 (https://phabricator.wikimedia.org/T224935) (owner: 10Jeena Huneidi)
[18:02:50] <wikibugs>	 (03PS2) 10Ayounsi: Fastnetmon: disable Graphite, fix notify script path [puppet] - 10https://gerrit.wikimedia.org/r/525334 (https://phabricator.wikimedia.org/T226810)
[18:04:32] <wikibugs>	 10Operations, 10ops-eqiad, 10Cassandra, 10DC-Ops, and 4 others: restbase-dev1006 has a broken disk - https://phabricator.wikimedia.org/T224260 (10Cmjohnson) @Eevans  is this your server? I think I understand that the server is going to be re-installed anyway so if I pull the wrong disk to replace I won't'...
[18:11:13] * Krinkle staging on mwdebug1002
[18:12:07] <Krinkle>	 !log krinkle@deploy1001: extensions/CheckUser is dirty in php-1.34.0-wmf.15
[18:12:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:13:00] <urandom>	 !log creating new restbase keyspaces -- T228804
[18:13:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:13:08] <stashbot>	 T228804: Create keyspaces in Cassandra for PCS endpoints - https://phabricator.wikimedia.org/T228804
[18:19:07] <logmsgbot>	 !log krinkle@deploy1001 Synchronized php-1.34.0-wmf.15/includes/cache/localisation/LocalisationCache.php: 31d99eb381bc (duration: 00m 54s)
[18:19:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:19:57] <wikibugs>	 (03CR) 10Dzahn: "Yep, and i can't think of a reason why Icinga ever had PHP module installed anyways." [puppet] - 10https://gerrit.wikimedia.org/r/525232 (owner: 10Muehlenhoff)
[18:20:35] <wikibugs>	 (03PS2) 10Dzahn: role::alerting_host: Remove support for jessie [puppet] - 10https://gerrit.wikimedia.org/r/525232 (owner: 10Muehlenhoff)
[18:20:45] <wikibugs>	 (03PS1) 10Bstorm: toolforge: set kubeadm to use internal registry for pause container [puppet] - 10https://gerrit.wikimedia.org/r/525339 (https://phabricator.wikimedia.org/T228887)
[18:21:56] <wikibugs>	 10Operations, 10ops-eqiad: Degraded RAID on elastic1046 - https://phabricator.wikimedia.org/T228606 (10wiki_willy) @elukey - since elastic1046 is just barely out of warranty (only by a few months), we'll still have to purchase a new disk for this server.  Just double-checking that's the route you want to go, b...
[18:22:57] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1002/17600/icinga1001.wikimedia.org/" [puppet] - 10https://gerrit.wikimedia.org/r/525232 (owner: 10Muehlenhoff)
[18:29:14] <icinga-wm>	 RECOVERY - High CPU load on API appserver on mw1227 is OK: OK - load average: 19.89, 21.33, 23.71 https://wikitech.wikimedia.org/wiki/Application_servers
[18:30:46] <wikibugs>	 10Operations, 10ops-eqiad: replace scs-a8-eqiad - https://phabricator.wikimedia.org/T228919 (10RobH) p:05Triage→03Normal
[18:30:57] <wikibugs>	 10Operations, 10ops-eqiad: replace scs-a8-eqiad - https://phabricator.wikimedia.org/T228919 (10RobH)
[18:33:22] <cmjohnson1>	 !log moving cloudvirt107 to 10G rack T228691
[18:33:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:33:29] <stashbot>	 T228691: relocate/reimage cloudvirt1017 with 10G interfaces - https://phabricator.wikimedia.org/T228691
[18:36:20] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+1] "I'm pretty sure this won't break anything, and it sounds like a big improvement!" [puppet] - 10https://gerrit.wikimedia.org/r/525320 (owner: 10Arturo Borrero Gonzalez)
[18:52:46] <wikibugs>	 10Operations, 10ops-eqiad, 10Cassandra, 10DC-Ops, and 4 others: restbase-dev1006 has a broken disk - https://phabricator.wikimedia.org/T224260 (10Eevans) >>! In T224260#5362558, @Cmjohnson wrote: > @Eevans  is this your server? I think I understand that the server is going to be re-installed anyway so if I...
[18:52:50] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops, 10Epic, 10cloud-services-team (Kanban): relocate/reimage cloudvirt1017 with 10G interfaces - https://phabricator.wikimedia.org/T228691 (10Cmjohnson)
[18:54:08] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops, 10Epic, 10cloud-services-team (Kanban): relocate/reimage cloudvirt1017 with 10G interfaces - https://phabricator.wikimedia.org/T228691 (10Cmjohnson) @Andrew a little luck with this server, it was already in a 10G rack.   Removed the old network info on the switch, add...
[18:56:54] <icinga-wm>	 PROBLEM - High CPU load on API appserver on mw1233 is CRITICAL: CRITICAL - load average: 66.53, 39.23, 28.41 https://wikitech.wikimedia.org/wiki/Application_servers
[18:57:20] <icinga-wm>	 PROBLEM - High CPU load on API appserver on mw1227 is CRITICAL: CRITICAL - load average: 48.90, 40.67, 31.71 https://wikitech.wikimedia.org/wiki/Application_servers
[18:57:32] <icinga-wm>	 RECOVERY - DPKG on restbase-dev1006 is OK: All packages OK
[18:59:00] <legoktm>	 jouncebot: next
[18:59:01] <jouncebot>	 In 1 hour(s) and 0 minute(s): Services – Parsoid / Citoid / Mobileapps / ORES / … (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190724T2000)
[18:59:43] <wikibugs>	 10Operations, 10vm-requests: rack/setup/install ganeti10([09]|1[0-8[).eqiad.wmnet - https://phabricator.wikimedia.org/T228924 (10RobH) p:05Triage→03Normal
[19:00:40] <wikibugs>	 10Operations, 10vm-requests: rack/setup/install ganeti10([09]|1[0-8[).eqiad.wmnet - https://phabricator.wikimedia.org/T228924 (10RobH) a:03akosiaris @akosiaris,  Can i get your sign off about the racking proposal and planning for these 10 ganeti nodes?  4 were refresh, while 6 were expansion from last years...
[19:01:09] <wikibugs>	 10Operations, 10ops-eqiad, 10Cassandra, 10DC-Ops, and 4 others: restbase-dev1006 has a broken disk - https://phabricator.wikimedia.org/T224260 (10Cmjohnson) 05Open→03Resolved @eevans the disk has been replaced. I am resolving this task, if you find the problem is not fixed, please re-open and assign to...
[19:01:48] <wikibugs>	 10Operations, 10ops-codfw, 10serviceops: (OoW) restbase2009 lockup - https://phabricator.wikimedia.org/T227408 (10Eevans) >>! In T227408#5349035, @jijiki wrote: > @Eevans Shall we mark restbase2009 as inactive on conftool?  I'm not positive I understand the implications of that.   As far as I know, the host...
[19:02:12] <legoktm>	 jouncebot: reload
[19:02:19] <legoktm>	 jouncebot: now
[19:02:19] <jouncebot>	 No deployments scheduled for the next 0 hour(s) and 57 minute(s)
[19:02:23] <legoktm>	 jouncebot: help
[19:02:24] <jouncebot>	 **** JounceBot Help ****
[19:02:24] <jouncebot>	 JounceBot is a deployment helper bot for the Wikimedia Foundation.
[19:02:24] <jouncebot>	 You can find my source at https://github.com/mattofak/jouncebot
[19:02:24] <jouncebot>	 Available commands:
[19:02:24] <jouncebot>	  HELP    Prints the list of all commands known to the server
[19:02:24] <jouncebot>	  NEXT    Get the next deployment event(s if they happen at the same time)
[19:02:24] <jouncebot>	  NOW     Get the current deployment event(s) or the time until the next
[19:02:25] <jouncebot>	  REFRESH Refresh my knowledge about deployments
[19:02:29] <legoktm>	 jouncebot: refresh
[19:02:30] <jouncebot>	 I refreshed my knowledge about deployments.
[19:02:33] <legoktm>	 jouncebot: now
[19:02:33] <jouncebot>	 For the next 0 hour(s) and 57 minute(s): SecureLinkFixer to group0 (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190724T1900)
[19:02:36] <legoktm>	 yay
[19:02:49] <wikibugs>	 (03PS2) 10Legoktm: Enable SecureLinkFixer on group0 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/525157 (https://phabricator.wikimedia.org/T200751)
[19:03:35] <wikibugs>	 (03CR) 10Legoktm: [C: 03+2] Enable SecureLinkFixer on group0 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/525157 (https://phabricator.wikimedia.org/T200751) (owner: 10Legoktm)
[19:03:45] <wikibugs>	 10Operations, 10ops-eqiad: rack/setup/instal (4) CI ganeti nodes - https://phabricator.wikimedia.org/T228926 (10RobH) p:05Triage→03Normal
[19:04:20] <wikibugs>	 (03CR) 10Jeena Huneidi: "> Patch Set 2:" [deployment-charts] - 10https://gerrit.wikimedia.org/r/525173 (https://phabricator.wikimedia.org/T224935) (owner: 10Jeena Huneidi)
[19:04:40] <wikibugs>	 10Operations, 10ops-eqiad: rack/setup/instal (4) CI ganeti nodes - https://phabricator.wikimedia.org/T228926 (10RobH) @akosiaris,  Are you involved in this project, and if so would you be the one to provide details for this?  Please comment and assign back to me for followup, thanks!
[19:04:42] <wikibugs>	 (03Merged) 10jenkins-bot: Enable SecureLinkFixer on group0 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/525157 (https://phabricator.wikimedia.org/T200751) (owner: 10Legoktm)
[19:05:03] <wikibugs>	 10Operations, 10ops-eqiad: rack/setup/instal (4) CI ganeti nodes - https://phabricator.wikimedia.org/T228926 (10RobH)
[19:05:06] <wikibugs>	 10Operations, 10ops-eqiad, 10vm-requests: rack/setup/install ganeti10([09]|1[0-8[).eqiad.wmnet - https://phabricator.wikimedia.org/T228924 (10RobH)
[19:06:38] <wikibugs>	 (03CR) 10jenkins-bot: Enable SecureLinkFixer on group0 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/525157 (https://phabricator.wikimedia.org/T200751) (owner: 10Legoktm)
[19:08:27] <hauskatze>	 legoktm: when you have a minute, may I ask you a question re. libup2?
[19:08:34] <logmsgbot>	 !log legoktm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Enable SecureLinkFixer on group0 wikis - T200751 (duration: 00m 55s)
[19:08:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:08:43] <stashbot>	 T200751: Review and deploy SecureLinkFixer extension - https://phabricator.wikimedia.org/T200751
[19:08:52] <legoktm>	 hauskatze: ask :) I'll answer when I do ahve a minute :p
[19:09:31] <hauskatze>	 heh okay so legoktm I don't understand the "proposed patch" feature - is it suposed to be the actual fix or what the bot did commit in the past?
[19:10:01] <hauskatze>	 'cause I tried to use it today but ended using NCU && npm audit (--fix)
[19:10:35] <legoktm>	 both, kind of. 
[19:11:23] <legoktm>	 it's mostly a testing feature for me to check what would have happened since the bot is read-only right now (the pushes I did last week were off of my laptop)
[19:11:38] <legoktm>	 eventually the plan is that libup runs once a day pushing patches as necessary
[19:15:17] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1001/17601/netflow1001.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/525334 (https://phabricator.wikimedia.org/T226810) (owner: 10Ayounsi)
[19:15:28] <wikibugs>	 (03PS3) 10Ayounsi: Fastnetmon: disable Graphite, fix notify script path [puppet] - 10https://gerrit.wikimedia.org/r/525334 (https://phabricator.wikimedia.org/T226810)
[19:16:18] <wikibugs>	 10Operations, 10Security-Team, 10Trust-and-Safety: Add sguebo_WMF to WMF LDAP group - https://phabricator.wikimedia.org/T228927 (10jrbs)
[19:16:22] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops, 10Wikimedia-Logstash, and 2 others: Decommission old eqiad logstash hardware hosts logstash100[456] - https://phabricator.wikimedia.org/T217556 (10RobH) wipe is running on all 4 internal disks for T217556 and on the external usb disk for T212457.
[19:17:01] <legoktm>	 I'm done
[19:17:26] <wikibugs>	 10Operations, 10LDAP-Access-Requests, 10Security-Team, 10Trust-and-Safety: Add sguebo_WMF to WMF LDAP group - https://phabricator.wikimedia.org/T228927 (10Legoktm)
[19:17:40] <wikibugs>	 10Operations, 10ops-eqiad, 10Cassandra, 10DC-Ops, and 4 others: restbase-dev1006 has a broken disk - https://phabricator.wikimedia.org/T224260 (10Cmjohnson)
[19:18:02] <hauskatze>	 legoktm: alright, thanks. libup pushing patches once a day looks promising. Hopefully they don't get stuck in CI and we have to go with the broom afterwards ;-)
[19:18:43] <legoktm>	 hauskatze: libup won't submit new patches if an open patch in that repo has the topic bump-dev-deps. So we won't have a pileup of just failing patches everywhere
[19:21:32] <hauskatze>	 :)
[19:21:48] <hauskatze>	 k thnx, I won't bother you anymore for the remaining of today
[19:23:58] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops: elastic1031 failed PSU 2 fan - https://phabricator.wikimedia.org/T228769 (10RobH)
[19:25:22] <wikibugs>	 (03PS1) 10Legoktm: Add SecureLinkFixer to extension-list [mediawiki-config] - 10https://gerrit.wikimedia.org/r/525346 (https://phabricator.wikimedia.org/T200751)
[19:25:24] <wikibugs>	 (03PS1) 10Legoktm: Enable SecureLinkFixer everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/525347 (https://phabricator.wikimedia.org/T200751)
[19:25:34] <legoktm>	 hauskatze: no worries! feel free to ask anytime :)
[19:25:45] <wikibugs>	 (03CR) 10Legoktm: "Pending wmf.15 rollout everywhere." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/525346 (https://phabricator.wikimedia.org/T200751) (owner: 10Legoktm)
[19:25:49] <hauskatze>	 legoktm: thanks much :)
[19:26:07] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops: elastic1031 failed PSU 2 fan - https://phabricator.wikimedia.org/T228769 (10RobH) 05Open→03Resolved a:03RobH issue seems resolved, no errors reported on host.
[19:28:10] <icinga-wm>	 RECOVERY - High CPU load on API appserver on mw1233 is OK: OK - load average: 19.60, 21.01, 23.53 https://wikitech.wikimedia.org/wiki/Application_servers
[19:28:40] <icinga-wm>	 PROBLEM - High CPU load on API appserver on mw1227 is CRITICAL: CRITICAL - load average: 41.19, 32.18, 32.90 https://wikitech.wikimedia.org/wiki/Application_servers
[19:29:55] <wikibugs>	 (03CR) 10Cwhite: [C: 03+2] hiera: deploy varnishkafka exporter to eqiad [puppet] - 10https://gerrit.wikimedia.org/r/524933 (https://phabricator.wikimedia.org/T196066) (owner: 10Cwhite)
[19:30:05] <wikibugs>	 (03PS2) 10Cwhite: hiera: deploy varnishkafka exporter to eqiad [puppet] - 10https://gerrit.wikimedia.org/r/524933 (https://phabricator.wikimedia.org/T196066)
[19:36:41] <wikibugs>	 (03CR) 10CDanis: "This mostly LGTM, just +1 to Filippo's comments" (037 comments) [puppet] - 10https://gerrit.wikimedia.org/r/520502 (https://phabricator.wikimedia.org/T226815) (owner: 10Giuseppe Lavagetto)
[19:38:18] <wikibugs>	 10Operations, 10ops-eqiad: (Need By: August 31) rack/setup/install (3) new zookeeper nodes - https://phabricator.wikimedia.org/T227025 (10RobH)
[19:39:57] <wikibugs>	 10Operations, 10ops-eqiad, 10Cassandra, 10DC-Ops, and 4 others: restbase-dev1006 has a broken disk - https://phabricator.wikimedia.org/T224260 (10Eevans) 05Resolved→03Open >>! In T224260#5362801, @Cmjohnson wrote: > @eevans the disk has been replaced. I am resolving this task, if you find the problem i...
[19:40:08] <icinga-wm>	 PROBLEM - Logstash rate of ingestion percent change compared to yesterday on icinga1001 is CRITICAL: 940.7 ge 130 https://phabricator.wikimedia.org/T202307 https://grafana.wikimedia.org/dashboard/db/logstash?orgId=1&panelId=2&fullscreen
[19:40:16] <icinga-wm>	 PROBLEM - High CPU load on API appserver on mw1227 is CRITICAL: CRITICAL - load average: 42.65, 32.87, 32.09 https://wikitech.wikimedia.org/wiki/Application_servers
[19:46:58] <wikibugs>	 (03CR) 10CDanis: "The script LGTM modulo one nit." (031 comment) [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/525165 (owner: 10CRusnov)
[19:48:15] <wikibugs>	 (03PS1) 10Eevans: Switch restbase-dev1006 to Stretch [puppet] - 10https://gerrit.wikimedia.org/r/525351 (https://phabricator.wikimedia.org/T224260)
[19:52:24] <wikibugs>	 10Operations, 10LDAP-Access-Requests, 10Security-Team, 10Trust-and-Safety: Add sguebo_WMF to WMF LDAP group - https://phabricator.wikimedia.org/T228927 (10sbassett) p:05Triage→03Normal This is approved by the #security-team (cc: @JBennett) for the specific request of access to logstash for @sguebo_WMF.
[19:52:26] <icinga-wm>	 PROBLEM - Host dbproxy1021.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[19:55:44] <icinga-wm>	 PROBLEM - High CPU load on API appserver on mw1222 is CRITICAL: CRITICAL - load average: 50.57, 34.32, 27.09 https://wikitech.wikimedia.org/wiki/Application_servers
[19:56:18] <icinga-wm>	 PROBLEM - Host dbproxy1020.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[19:56:35] <marostegui>	 robh: that dbproxy1021 mgt down expected?
[19:57:03] <marostegui>	 ah, I guess it is chris moving 1020 and 1021 to a different rack?
[19:58:10] <icinga-wm>	 RECOVERY - Host dbproxy1021.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.17 ms
[20:00:04] <jouncebot>	 cscott, arlolra, subbu, bearND, and halfak: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for Services – Parsoid / Citoid / Mobileapps / ORES / …. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190724T2000).
[20:01:00] <wikibugs>	 10Operations, 10ops-eqiad: (Need By: August 31) rack/setup/install (3) new zookeeper nodes - https://phabricator.wikimedia.org/T227025 (10Cmjohnson)
[20:02:02] <icinga-wm>	 RECOVERY - Host dbproxy1020.mgmt is UP: PING OK - Packet loss = 0%, RTA = 3.40 ms
[20:02:38] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops: Reallocate dbproxy1020 and dbproxy1021 from row D to row C - https://phabricator.wikimedia.org/T228618 (10Cmjohnson) @Marostegui do these need to be in the same rack or separate racks?  1G space is limited to racks C5 and C8. C5 currently has a couple of dbproxy servers.
[20:05:23] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops: Reallocate dbproxy1020 and dbproxy1021 from row D to row C - https://phabricator.wikimedia.org/T228618 (10Marostegui) Separate if possible. If it is really not possible then same rack it is also ok
[20:05:37] <marostegui>	 @cmjohnson1: ^
[20:07:12] <icinga-wm>	 PROBLEM - puppet last run on cp1089 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[20:09:35] <wikibugs>	 (03PS1) 10Jhedden: Revert "dumps distribution: switch dumps to labstore1006" [dns] - 10https://gerrit.wikimedia.org/r/525361
[20:12:12] <jeh>	 !log redirecting dumps.wikimedia.org back to labstore1007.wikimedia.org T224228
[20:12:15] <wikibugs>	 (03CR) 10Jhedden: [C: 03+2] Revert "dumps distribution: switch dumps to labstore1006" [dns] - 10https://gerrit.wikimedia.org/r/525361 (owner: 10Jhedden)
[20:12:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:12:30] <wikibugs>	 (03PS2) 10Jhedden: Revert "dumps distribution: switch dumps to labstore1006" [dns] - 10https://gerrit.wikimedia.org/r/525361
[20:12:50] <wikibugs>	 (03CR) 10Jhedden: [V: 03+2 C: 03+2] Revert "dumps distribution: switch dumps to labstore1006" [dns] - 10https://gerrit.wikimedia.org/r/525361 (owner: 10Jhedden)
[20:18:52] <icinga-wm>	 RECOVERY - High CPU load on API appserver on mw1222 is OK: OK - load average: 20.51, 21.61, 23.65 https://wikitech.wikimedia.org/wiki/Application_servers
[20:23:14] <wikibugs>	 10Operations, 10ops-eqiad: Degraded RAID on cloudvirt1024 - https://phabricator.wikimedia.org/T228853 (10wiki_willy) a:03Cmjohnson
[20:24:59] <wikibugs>	 (03CR) 10Jhedden: [C: 03+1] Configure unconditional flushes of the L1 cache during VMENTER [puppet] - 10https://gerrit.wikimedia.org/r/525295 (https://phabricator.wikimedia.org/T228870) (owner: 10Muehlenhoff)
[20:25:26] <wikibugs>	 10Operations, 10ops-eqiad, 10DBA: Upgrade db1100 firmware and BIOS - https://phabricator.wikimedia.org/T228732 (10wiki_willy) a:03Cmjohnson
[20:27:08] <icinga-wm>	 PROBLEM - High CPU load on API appserver on mw1222 is CRITICAL: CRITICAL - load average: 40.82, 36.02, 29.47 https://wikitech.wikimedia.org/wiki/Application_servers
[20:34:06] <wikibugs>	 10Operations, 10ops-codfw, 10netops: update RE-S-X6-64G-S in cr[12]-codfw - https://phabricator.wikimedia.org/T226422 (10ayounsi) Scheduled for the 31st at 15:00UTC (1h total).
[20:35:03] <logmsgbot>	 !log bsitzmann@deploy1001 Started deploy [mobileapps/deploy@2e2ce6c]: Update mobileapps to 1751a2e
[20:35:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:35:12] <wikibugs>	 10Operations, 10ops-codfw, 10netops: Cable mr1-codfw<->cr1/2-codfw through asw-a-codfw - https://phabricator.wikimedia.org/T228112 (10ayounsi) Scheduled for the 30st at 15:00UTC (1h total). Let me know if it needs to be rescheduled.
[20:35:24] <icinga-wm>	 RECOVERY - puppet last run on cp1089 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[20:36:16] <wikibugs>	 (03PS1) 10RobH: adding john clark to wmf ldap [puppet] - 10https://gerrit.wikimedia.org/r/525420 (https://phabricator.wikimedia.org/T228935)
[20:36:52] <logmsgbot>	 !log ppchelko@deploy1001 Started deploy [changeprop/deploy@bf28187]: Rerender PCS endpoints T222384
[20:36:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:36:58] <stashbot>	 T222384: Enable storage and pre-generation for PCS endpoints - https://phabricator.wikimedia.org/T222384
[20:37:22] <wikibugs>	 (03CR) 10RobH: [C: 03+2] adding john clark to wmf ldap [puppet] - 10https://gerrit.wikimedia.org/r/525420 (https://phabricator.wikimedia.org/T228935) (owner: 10RobH)
[20:38:26] <logmsgbot>	 !log ppchelko@deploy1001 Finished deploy [changeprop/deploy@bf28187]: Rerender PCS endpoints T222384 (duration: 01m 34s)
[20:38:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:39:23] <logmsgbot>	 !log bsitzmann@deploy1001 Finished deploy [mobileapps/deploy@2e2ce6c]: Update mobileapps to 1751a2e (duration: 04m 20s)
[20:39:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:44:02] <wikibugs>	 (03PS1) 10Muehlenhoff: Extend package-build-deb-src.list with Buster [puppet] - 10https://gerrit.wikimedia.org/r/525422
[20:44:31] <wikibugs>	 (03PS1) 10RobH: Revert "adding john clark to wmf ldap" [puppet] - 10https://gerrit.wikimedia.org/r/525423
[20:45:20] <logmsgbot>	 !log ppchelko@deploy1001 Started deploy [restbase/deploy@7911f65]: Store PCS endpoints T222384
[20:45:26] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Extend package-build-deb-src.list with Buster [puppet] - 10https://gerrit.wikimedia.org/r/525422 (owner: 10Muehlenhoff)
[20:45:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:45:27] <stashbot>	 T222384: Enable storage and pre-generation for PCS endpoints - https://phabricator.wikimedia.org/T222384
[20:47:44] <wikibugs>	 (03CR) 10RobH: [C: 03+2] Revert "adding john clark to wmf ldap" [puppet] - 10https://gerrit.wikimedia.org/r/525423 (owner: 10RobH)
[20:47:52] <wikibugs>	 (03PS2) 10RobH: Revert "adding john clark to wmf ldap" [puppet] - 10https://gerrit.wikimedia.org/r/525423
[20:48:16] <cscott>	 fyi, we're about to deploy parsoid
[20:48:47] <wikibugs>	 (03PS2) 10Bstorm: toolforge: set kubeadm to use internal registry for pause container [puppet] - 10https://gerrit.wikimedia.org/r/525339 (https://phabricator.wikimedia.org/T228887)
[20:49:45] <wikibugs>	 (03PS3) 10Bstorm: toolforge: set kubeadm to use internal registry for pause container [puppet] - 10https://gerrit.wikimedia.org/r/525339 (https://phabricator.wikimedia.org/T228887)
[20:50:52] <wikibugs>	 (03CR) 10Bstorm: [C: 03+2] toolforge: set kubeadm to use internal registry for pause container [puppet] - 10https://gerrit.wikimedia.org/r/525339 (https://phabricator.wikimedia.org/T228887) (owner: 10Bstorm)
[20:52:34] <icinga-wm>	 PROBLEM - High CPU load on API appserver on mw1233 is CRITICAL: CRITICAL - load average: 57.84, 35.88, 27.88 https://wikitech.wikimedia.org/wiki/Application_servers
[20:54:39] <wikibugs>	 (03PS1) 10RobH: john clark - adding to ldap [puppet] - 10https://gerrit.wikimedia.org/r/525425 (https://phabricator.wikimedia.org/T228935)
[20:55:16] <icinga-wm>	 RECOVERY - High CPU load on API appserver on mw1222 is OK: OK - load average: 18.60, 21.57, 23.85 https://wikitech.wikimedia.org/wiki/Application_servers
[20:55:52] <wikibugs>	 (03CR) 10RobH: [C: 03+2] john clark - adding to ldap [puppet] - 10https://gerrit.wikimedia.org/r/525425 (https://phabricator.wikimedia.org/T228935) (owner: 10RobH)
[20:56:03] <cscott>	 i got a "The Wikipedia database is temporarily in read-only mode. This is probably due to routine maintenance; if so, you will be able to edit again within a few minutes." on beta just now?
[20:57:47] <cscott>	 that read-only mode complaint on beta might have coincided with that "high cpu load" icinga warning above?
[20:58:38] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] Switch restbase-dev1006 to Stretch [puppet] - 10https://gerrit.wikimedia.org/r/525351 (https://phabricator.wikimedia.org/T224260) (owner: 10Eevans)
[20:58:46] <wikibugs>	 (03PS2) 10Dzahn: Switch restbase-dev1006 to Stretch [puppet] - 10https://gerrit.wikimedia.org/r/525351 (https://phabricator.wikimedia.org/T224260) (owner: 10Eevans)
[21:00:17] <logmsgbot>	 !log cscott@deploy1001 Started deploy [parsoid/deploy@abd05ab]: Updating Parsoid to df1af404 (T227216, T226523, T226451)
[21:00:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:00:29] <stashbot>	 T226523: Template in wikilink target position also returns pipe separated params - https://phabricator.wikimedia.org/T226523
[21:00:31] <stashbot>	 T226451: Possible bug in PHP Tokenizer: Unexpected OOM - https://phabricator.wikimedia.org/T226451
[21:00:31] <stashbot>	 T227216: Adding or editing citations using VisualEditor causes major formatting issues involving pipes, equals signs and nowiki tags - https://phabricator.wikimedia.org/T227216
[21:01:36] <subbu>	 cscott, i doubt those two are correlated. in any case, it looks transient.
[21:02:32] <icinga-wm>	 PROBLEM - High CPU load on API appserver on mw1233 is CRITICAL: CRITICAL - load average: 46.21, 36.17, 31.11 https://wikitech.wikimedia.org/wiki/Application_servers
[21:03:34] <icinga-wm>	 PROBLEM - High CPU load on API appserver on mw1222 is CRITICAL: CRITICAL - load average: 50.74, 34.33, 27.99 https://wikitech.wikimedia.org/wiki/Application_servers
[21:03:38] <logmsgbot>	 !log ppchelko@deploy1001 Finished deploy [restbase/deploy@7911f65]: Store PCS endpoints T222384 (duration: 18m 18s)
[21:03:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:03:45] <stashbot>	 T222384: Enable storage and pre-generation for PCS endpoints - https://phabricator.wikimedia.org/T222384
[21:11:02] <wikibugs>	 10Operations, 10ops-eqiad, 10Cassandra, 10DC-Ops, and 4 others: restbase-dev1006 has a broken disk - https://phabricator.wikimedia.org/T224260 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqiad.wmnet for hosts: ` restbase-dev1006.eqiad.wmnet ` The log can be found in `/...
[21:12:13] <logmsgbot>	 !log nuria@deploy1001 Started deploy [analytics/refinery@58e64c1]: deploying refinery 0.0.95
[21:12:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:13:52] <icinga-wm>	 RECOVERY - Device not healthy -SMART- on restbase-dev1006 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=restbase-dev1006&var-datasource=eqiad+prometheus/ops
[21:16:07] <logmsgbot>	 !log nuria@deploy1001 Finished deploy [analytics/refinery@58e64c1]: deploying refinery 0.0.95 (duration: 03m 54s)
[21:16:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:18:52] <logmsgbot>	 !log cscott@deploy1001 Finished deploy [parsoid/deploy@abd05ab]: Updating Parsoid to df1af404 (T227216, T226523, T226451) (duration: 18m 35s)
[21:19:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:19:01] <stashbot>	 T226523: Template in wikilink target position also returns pipe separated params - https://phabricator.wikimedia.org/T226523
[21:19:01] <stashbot>	 T226451: Possible bug in PHP Tokenizer: Unexpected OOM - https://phabricator.wikimedia.org/T226451
[21:19:01] <stashbot>	 T227216: Adding or editing citations using VisualEditor causes major formatting issues involving pipes, equals signs and nowiki tags - https://phabricator.wikimedia.org/T227216
[21:22:22] <mutante>	 !log <+icinga-wm> RECOVERY - Device not healthy -SMART- on restbase-dev1006 is OK: All metrics within thresholds. (T224260)
[21:22:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:22:30] <stashbot>	 T224260: restbase-dev1006 has a broken disk - https://phabricator.wikimedia.org/T224260
[21:22:45] <mutante>	 it's kind of strange wording that "RECOVERY .. NOT healthy" but yea :)
[21:23:01] <icinga-wm>	 RECOVERY - High CPU load on API appserver on mw1222 is OK: OK - load average: 21.45, 21.93, 23.90 https://wikitech.wikimedia.org/wiki/Application_servers
[21:23:04] <mutante>	 nice that it confirmed the disk replacement
[21:26:49] <icinga-wm>	 RECOVERY - High CPU load on API appserver on mw1233 is OK: OK - load average: 17.72, 19.84, 23.56 https://wikitech.wikimedia.org/wiki/Application_servers
[21:30:32] <wikibugs>	 10Operations, 10ops-codfw, 10DBA, 10Goal: rack/setup/install db21[21-30].codfw.wmnet - https://phabricator.wikimedia.org/T227113 (10Papaul)
[21:34:36] <wikibugs>	 10Operations, 10ops-codfw, 10DBA, 10Goal: rack/setup/install db21[21-30].codfw.wmnet - https://phabricator.wikimedia.org/T227113 (10Papaul)
[21:40:43] <icinga-wm>	 RECOVERY - Logstash rate of ingestion percent change compared to yesterday on icinga1001 is OK: (C)130 ge (W)110 ge 25.13 https://phabricator.wikimedia.org/T202307 https://grafana.wikimedia.org/dashboard/db/logstash?orgId=1&panelId=2&fullscreen
[21:41:39] <cscott>	 ok, parsoid deploy done, looks good to us
[21:43:45] <icinga-wm>	 PROBLEM - Check the Netbox report-s- puppetdb for fail status. on netmon1002 is CRITICAL: puppetdb.PuppetDB CRITICAL https://wikitech.wikimedia.org/wiki/Netbox%23Reports
[21:47:14] <mutante>	 ^ happens during reinstall of machines, in this case restbase-dev1006
[21:48:28] <icinga-wm>	 RECOVERY - High CPU load on API appserver on mw1227 is OK: OK - load average: 19.99, 21.55, 23.59 https://wikitech.wikimedia.org/wiki/Application_servers
[21:54:58] <icinga-wm>	 PROBLEM - puppet last run on elastic1019 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[21:57:51] <wikibugs>	 (03PS1) 10Bstorm: toolforge: add internal pause container to all the other kubelets [puppet] - 10https://gerrit.wikimedia.org/r/525434 (https://phabricator.wikimedia.org/T228887)
[21:59:01] <wikibugs>	 (03CR) 10Bstorm: [C: 03+2] toolforge: add internal pause container to all the other kubelets [puppet] - 10https://gerrit.wikimedia.org/r/525434 (https://phabricator.wikimedia.org/T228887) (owner: 10Bstorm)
[21:59:58] <wikibugs>	 (03PS4) 10Thcipriani: Blubberoid: enable policy, bump version, reindex helm repo [deployment-charts] - 10https://gerrit.wikimedia.org/r/522561
[22:02:01] <wikibugs>	 (03PS1) 10Bstorm: toolforge: fix typo kubelet file content [puppet] - 10https://gerrit.wikimedia.org/r/525436 (https://phabricator.wikimedia.org/T228887)
[22:02:13] <wikibugs>	 (03CR) 10Thcipriani: [V: 03+2 C: 03+2] Blubberoid: enable policy, bump version, reindex helm repo [deployment-charts] - 10https://gerrit.wikimedia.org/r/522561 (owner: 10Thcipriani)
[22:04:18] <icinga-wm>	 PROBLEM - puppet last run on analytics1065 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[22:04:32] <wikibugs>	 (03CR) 10Bstorm: [C: 03+2] toolforge: fix typo kubelet file content [puppet] - 10https://gerrit.wikimedia.org/r/525436 (https://phabricator.wikimedia.org/T228887) (owner: 10Bstorm)
[22:15:55] <wikibugs>	 10Operations, 10ops-eqiad, 10Cassandra, 10DC-Ops, and 4 others: restbase-dev1006 has a broken disk - https://phabricator.wikimedia.org/T224260 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['restbase-dev1006.eqiad.wmnet'] `  Of which those **FAILED**: ` ['restbase-dev1006.eqiad.wmnet'] `
[22:22:42] <icinga-wm>	 RECOVERY - puppet last run on elastic1019 is OK: OK: Puppet is currently enabled, last run 51 seconds ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[22:23:43] <wikibugs>	 (03PS1) 10Thcipriani: gerrit: use gerrit-deployers not gerrit-root [puppet] - 10https://gerrit.wikimedia.org/r/525444
[22:28:09] <logmsgbot>	 !log thcipriani@ helmfile [STAGING] Ran 'apply' command on namespace 'blubberoid' for release 'staging' .
[22:28:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:32:34] <icinga-wm>	 RECOVERY - puppet last run on analytics1065 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[22:36:00] <logmsgbot>	 !log thcipriani@ helmfile [CODFW] Ran 'apply' command on namespace 'blubberoid' for release 'production' .
[22:36:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:40:06] <wikibugs>	 (03CR) 10Jforrester: "Dupe of I798c809317544." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/525346 (https://phabricator.wikimedia.org/T200751) (owner: 10Legoktm)
[22:41:10] <logmsgbot>	 !log thcipriani@ helmfile [EQIAD] Ran 'apply' command on namespace 'blubberoid' for release 'production' .
[22:41:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:44:50] <wikibugs>	 (03PS1) 10Aklapper: phabricator weekly project changes email: List cookie-licked tasks [puppet] - 10https://gerrit.wikimedia.org/r/525449 (https://phabricator.wikimedia.org/T228575)
[22:46:57] <RoanKattouw>	 I'm the only customer in the upcoming SWAT, and I can do it myself, but I'll probably be ~5 mins late
[23:00:05] <jouncebot>	 MaxSem, RoanKattouw, and Niharika: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for Evening SWAT (Max 6 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190724T2300).
[23:00:05] <jouncebot>	 RoanKattouw: A patch you scheduled for Evening SWAT (Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[23:01:10] <wikibugs>	 (03CR) 10Aklapper: "As usual I'm not sure about performance, but locally the query was surprisingly fast." [puppet] - 10https://gerrit.wikimedia.org/r/525449 (https://phabricator.wikimedia.org/T228575) (owner: 10Aklapper)
[23:05:54] <wikibugs>	 (03CR) 10Cwhite: "Is it in the plan to clean up the files left behind manually?" [puppet] - 10https://gerrit.wikimedia.org/r/525259 (https://phabricator.wikimedia.org/T184942) (owner: 10Filippo Giunchedi)
[23:10:00] <wikibugs>	 (03PS5) 10Catrope: Correct a typo on the label newarticle in the help panel for arwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/525190 (https://phabricator.wikimedia.org/T228820) (owner: 10Zoranzoki21)
[23:10:06] <wikibugs>	 (03CR) 10Catrope: [C: 03+2] Correct a typo on the label newarticle in the help panel for arwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/525190 (https://phabricator.wikimedia.org/T228820) (owner: 10Zoranzoki21)
[23:11:18] <wikibugs>	 (03Merged) 10jenkins-bot: Correct a typo on the label newarticle in the help panel for arwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/525190 (https://phabricator.wikimedia.org/T228820) (owner: 10Zoranzoki21)
[23:11:38] <wikibugs>	 (03CR) 10jenkins-bot: Correct a typo on the label newarticle in the help panel for arwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/525190 (https://phabricator.wikimedia.org/T228820) (owner: 10Zoranzoki21)
[23:12:05] <logmsgbot>	 !log nuria@deploy1001 Started deploy [analytics/refinery@834db0a]: deploying refinery 0.0.96 (skipping 0.0.95 due to some jenkins/archiva issues)
[23:12:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:13:40] <logmsgbot>	 !log catrope@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Correct typo in arwiki help panel config (T228820) (duration: 00m 57s)
[23:13:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:13:48] <stashbot>	 T228820: Correct typo error in the Help Panel in Arabic Wikipedia - https://phabricator.wikimedia.org/T228820
[23:15:39] <wikibugs>	 (03PS2) 10Catrope: Enable GrowthExperiments homepage on arwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/523362 (https://phabricator.wikimedia.org/T228120)
[23:17:40] <wikibugs>	 (03CR) 10Catrope: [C: 03+2] Enable GrowthExperiments homepage on arwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/523362 (https://phabricator.wikimedia.org/T228120) (owner: 10Catrope)
[23:18:40] <wikibugs>	 (03Merged) 10jenkins-bot: Enable GrowthExperiments homepage on arwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/523362 (https://phabricator.wikimedia.org/T228120) (owner: 10Catrope)
[23:18:55] <wikibugs>	 (03CR) 10jenkins-bot: Enable GrowthExperiments homepage on arwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/523362 (https://phabricator.wikimedia.org/T228120) (owner: 10Catrope)
[23:22:44] <logmsgbot>	 !log catrope@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Enable GrowthExperiments homepage on arwiki (T228120) (duration: 00m 55s)
[23:22:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:22:51] <stashbot>	 T228120: Set up and deploy homepage on Arabic Wikipedia - https://phabricator.wikimedia.org/T228120
[23:30:15] <logmsgbot>	 !log nuria@deploy1001 Finished deploy [analytics/refinery@834db0a]: deploying refinery 0.0.96 (skipping 0.0.95 due to some jenkins/archiva issues) (duration: 18m 10s)
[23:30:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:32:37] <logmsgbot>	 !log nuria@deploy1001 Started deploy [analytics/refinery@7d93398]: deploying refinery 0.0.96 (skipping 0.0.95 due to some jenkins/archiva issues). Try 2
[23:32:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:37:04] <wikibugs>	 (03PS2) 10Catrope: Enable homepage for 50% of new users on arwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/523363 (https://phabricator.wikimedia.org/T228120)
[23:37:10] <wikibugs>	 (03CR) 10Catrope: [C: 03+2] Enable homepage for 50% of new users on arwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/523363 (https://phabricator.wikimedia.org/T228120) (owner: 10Catrope)
[23:38:19] <wikibugs>	 (03Merged) 10jenkins-bot: Enable homepage for 50% of new users on arwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/523363 (https://phabricator.wikimedia.org/T228120) (owner: 10Catrope)
[23:38:34] <wikibugs>	 (03CR) 10jenkins-bot: Enable homepage for 50% of new users on arwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/523363 (https://phabricator.wikimedia.org/T228120) (owner: 10Catrope)
[23:39:38] <logmsgbot>	 !log catrope@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Enable homepage for 50% of new users on arwiki (T228120) (duration: 00m 58s)
[23:39:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:39:45] <stashbot>	 T228120: Set up and deploy homepage on Arabic Wikipedia - https://phabricator.wikimedia.org/T228120
[23:42:38] <logmsgbot>	 !log catrope@deploy1001 Synchronized php-1.34.0-wmf.14/extensions/Flow: Fix JS error when saving Flow board descriptions (T228818) (duration: 01m 03s)
[23:42:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:42:45] <stashbot>	 T228818: [wmf.14-regression] issues with saving edits to Flow board description - https://phabricator.wikimedia.org/T228818
[23:43:40] <logmsgbot>	 !log catrope@deploy1001 Synchronized php-1.34.0-wmf.15/extensions/Flow: Fix JS error when saving Flow board descriptions (T228818) (duration: 01m 01s)
[23:43:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:46:08] <icinga-wm>	 RECOVERY - MegaRAID on cloudvirt1024 is OK: OK: optimal, 1 logical, 8 physical https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[23:46:11] <logmsgbot>	 !log nuria@deploy1001 Finished deploy [analytics/refinery@7d93398]: deploying refinery 0.0.96 (skipping 0.0.95 due to some jenkins/archiva issues). Try 2 (duration: 13m 34s)
[23:46:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log