[00:15:23] RECOVERY - Too many messages in kafka logging-eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Logstash%23Kafka_consumer_lag https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?from=now-3h&to=now&orgId=1&var-datasource=eqiad+prometheus/ops&var-cluster=logging-eqiad&var-topic=All&var-consumer_group=All [00:21:29] PROBLEM - puppet last run on lvs1016 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [00:30:05] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 231, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [00:31:01] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 73, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [00:34:03] PROBLEM - HTTP availability for Nginx -SSL terminators- at esams on icinga1001 is CRITICAL: cluster=cache_text site=esams https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [00:34:29] PROBLEM - HTTP availability for Varnish at esams on icinga1001 is CRITICAL: job=varnish-text site=esams https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [00:36:43] PROBLEM - Text HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/myRmf1Pik/varnish-aggregate-client-status-codes?var-cache_type=varnish-text&var-status_type=5 [00:36:55] PROBLEM - Esams HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/myRmf1Pik/varnish-aggregate-client-status-codes?var-site=esams&var-status_type=5 [00:37:49] RECOVERY - HTTP availability for Varnish at esams on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [00:39:03] RECOVERY - HTTP availability for Nginx -SSL terminators- at esams on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [00:41:07] RECOVERY - Logstash rate of ingestion percent change compared to yesterday on icinga1001 is OK: (C)130 ge (W)110 ge 101.2 https://phabricator.wikimedia.org/T202307 https://grafana.wikimedia.org/dashboard/db/logstash?orgId=1&panelId=2&fullscreen [00:43:35] RECOVERY - Esams HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/myRmf1Pik/varnish-aggregate-client-status-codes?var-site=esams&var-status_type=5 [00:44:09] RECOVERY - puppet last run on lvs1016 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [00:45:01] RECOVERY - Text HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/myRmf1Pik/varnish-aggregate-client-status-codes?var-cache_type=varnish-text&var-status_type=5 [00:50:23] 10Operations, 10netops: AS63541's session down reported by cr1-eqsin - https://phabricator.wikimedia.org/T228617 (10ayounsi) p:05Normal→03Lowest Email sent to Chinacache. If no replies in ~1w then we will remove the session. [01:11:00] (03PS1) 10Tim Starling: Add logging for DeferredUpdates [mediawiki-config] - 10https://gerrit.wikimedia.org/r/525185 [01:18:31] (03PS2) 10Tim Starling: Add logging for DeferredUpdates [mediawiki-config] - 10https://gerrit.wikimedia.org/r/525185 (https://phabricator.wikimedia.org/T228462) [01:23:52] (03CR) 10Tim Starling: "Please give CR+1 and I will deploy it" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/525185 (https://phabricator.wikimedia.org/T228462) (owner: 10Tim Starling) [01:57:09] 10Operations, 10ops-eqiad: Faulty A6/A7 VC link - https://phabricator.wikimedia.org/T228823 (10ayounsi) p:05Triage→03High [02:04:41] PROBLEM - puppet last run on cp1075 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [02:12:23] 10Operations, 10netops, 10observability: Add VCP stats monitoring - https://phabricator.wikimedia.org/T228824 (10ayounsi) p:05Triage→03Normal [02:31:31] PROBLEM - puppet last run on cloudcontrol1004 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [02:32:55] RECOVERY - puppet last run on cp1075 is OK: OK: Puppet is currently enabled, last run 27 seconds ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [02:37:12] 10Operations, 10ops-eqiad, 10DC-Ops: a1-eqiad pdu refresh - https://phabricator.wikimedia.org/T226782 (10ayounsi) Seems like only 1 interface is `master` on cr1 the following is needed to fail it over ` [edit interfaces ae2 unit 1202 family inet6 address 2620:0:861:202:fe00::1/64 vrrp-inet6-group 2] +... [02:54:43] (03CR) 10MaxSem: [C: 03+1] Add logging for DeferredUpdates [mediawiki-config] - 10https://gerrit.wikimedia.org/r/525185 (https://phabricator.wikimedia.org/T228462) (owner: 10Tim Starling) [02:58:53] PROBLEM - dump of s5 in codfw on db1115 is CRITICAL: dump for s5 at codfw taken more than 8 days ago: Most recent backup 2019-07-16 02:25:48 https://wikitech.wikimedia.org/wiki/MariaDB/Backups [02:59:45] RECOVERY - puppet last run on cloudcontrol1004 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [03:17:49] PROBLEM - puppet last run on etcd1004 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [03:23:07] PROBLEM - Mediawiki Cirrussearch update rate - eqiad on icinga1001 is CRITICAL: CRITICAL: 30.00% of data under the critical threshold [50.0] https://wikitech.wikimedia.org/wiki/Search%23No_updates https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?panelId=44&fullscreen&orgId=1 [03:23:27] PROBLEM - Mediawiki Cirrussearch update rate - codfw on icinga1001 is CRITICAL: CRITICAL: 30.00% of data under the critical threshold [50.0] https://wikitech.wikimedia.org/wiki/Search%23No_updates https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?panelId=44&fullscreen&orgId=1 [03:34:54] (03CR) 10Tim Starling: [C: 03+2] "Thanks MaxSem" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/525185 (https://phabricator.wikimedia.org/T228462) (owner: 10Tim Starling) [03:35:51] (03Merged) 10jenkins-bot: Add logging for DeferredUpdates [mediawiki-config] - 10https://gerrit.wikimedia.org/r/525185 (https://phabricator.wikimedia.org/T228462) (owner: 10Tim Starling) [03:36:32] (03CR) 10jenkins-bot: Add logging for DeferredUpdates [mediawiki-config] - 10https://gerrit.wikimedia.org/r/525185 (https://phabricator.wikimedia.org/T228462) (owner: 10Tim Starling) [03:39:19] !log tstarling@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Adding DeferredUpdates log channel (T228462) (duration: 00m 56s) [03:39:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:40:27] RECOVERY - puppet last run on etcd1004 is OK: OK: Puppet is currently enabled, last run 57 seconds ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [03:41:09] !log tstarling@deploy1001 Synchronized w/fatal-error.php: Adding post-send exception test for T228462 (duration: 00m 54s) [03:41:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:42:45] (03PS1) 10Zoranzoki21: Correct a typo on the label newarticle in the help panel on arwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/525190 (https://phabricator.wikimedia.org/T228820) [03:43:10] (03PS2) 10Zoranzoki21: Correct a typo on the label newarticle in the help panel on arwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/525190 (https://phabricator.wikimedia.org/T228820) [03:44:51] RECOVERY - Mediawiki Cirrussearch update rate - eqiad on icinga1001 is OK: OK: Less than 1.00% under the threshold [80.0] https://wikitech.wikimedia.org/wiki/Search%23No_updates https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?panelId=44&fullscreen&orgId=1 [03:46:30] (03PS3) 10Zoranzoki21: Correct a typo on the label newarticle in the help panel for arwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/525190 (https://phabricator.wikimedia.org/T228820) [03:46:51] RECOVERY - Mediawiki Cirrussearch update rate - codfw on icinga1001 is OK: OK: Less than 1.00% under the threshold [80.0] https://wikitech.wikimedia.org/wiki/Search%23No_updates https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?panelId=44&fullscreen&orgId=1 [03:56:41] PROBLEM - puppet last run on cp3030 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/local/bin/apt2xml] https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [04:07:13] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 229, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [04:08:11] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 71, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [04:24:55] RECOVERY - puppet last run on cp3030 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [04:40:56] !log tstarling@deploy1001 Synchronized php-1.34.0-wmf.15/includes/specials/SpecialGoToInterwiki.php: (no justification provided) (duration: 00m 56s) [04:41:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:41:19] PROBLEM - PHP opcache health on mwdebug2002 is CRITICAL: CRITICAL: opcache free space is below 50 MB https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [04:42:07] !log tstarling@deploy1001 Synchronized php-1.34.0-wmf.15/includes/MediaWiki.php: T227700 (duration: 00m 54s) [04:42:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:42:14] T227700: Fatal on some Special:MyLanguage urls: MWException "Can't determine talk page associated with interwiki link" - https://phabricator.wikimedia.org/T227700 [04:43:51] PROBLEM - PHP opcache health on mwdebug2001 is CRITICAL: CRITICAL: opcache free space is below 50 MB https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [04:45:20] !log tstarling@deploy1001 Synchronized php-1.34.0-wmf.15/includes/specials/SpecialGoToInterwiki.php: T227700 (duration: 00m 54s) [04:45:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:46:15] !log tstarling@deploy1001 Synchronized php-1.34.0-wmf.15/includes/MediaWiki.php: T227700 (duration: 00m 54s) [04:46:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:49:10] !log tstarling@deploy1001 Synchronized php-1.34.0-wmf.14/includes/specials/SpecialGoToInterwiki.php: T227700 (duration: 00m 54s) [04:49:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:49:23] T227700: Fatal on some Special:MyLanguage urls: MWException "Can't determine talk page associated with interwiki link" - https://phabricator.wikimedia.org/T227700 [04:50:04] !log tstarling@deploy1001 Synchronized php-1.34.0-wmf.14/includes/MediaWiki.php: T227700 (duration: 00m 53s) [04:50:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:50:17] RECOVERY - PHP opcache health on mwdebug1001 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [04:51:29] !log tstarling@deploy1001 Synchronized php-1.34.0-wmf.14/includes/specials/SpecialGoToInterwiki.php: T227700 (duration: 00m 54s) [04:51:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:52:24] !log tstarling@deploy1001 Synchronized php-1.34.0-wmf.14/includes/MediaWiki.php: T227700 (duration: 00m 54s) [04:52:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:00:48] !log Stop puppet on dbprov2001 to generate s5 mysqldump manually [05:00:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:21:51] 10Operations, 10Gerrit, 10LDAP-Access-Requests, 10Release-Engineering-Team-TODO, 10Release-Engineering-Team (Development services): Add more SREs to gerritadmin LDAP group - https://phabricator.wikimedia.org/T228733 (10Jony) I would like become a volunteer to help. [05:24:04] (03CR) 10Muehlenhoff: profile::kerberos::kadminserver: add script to create principals (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/525137 (https://phabricator.wikimedia.org/T226104) (owner: 10Elukey) [05:38:00] (03PS7) 10Giuseppe Lavagetto: Configure forensic logging of Apache requests; enable on beta [puppet] - 10https://gerrit.wikimedia.org/r/511751 (owner: 10Ori.livneh) [05:39:01] (03CR) 10jerkins-bot: [V: 04-1] Configure forensic logging of Apache requests; enable on beta [puppet] - 10https://gerrit.wikimedia.org/r/511751 (owner: 10Ori.livneh) [05:46:13] PROBLEM - OSPF status on cr1-codfw is CRITICAL: OSPFv2: 5/6 UP : OSPFv3: 5/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [05:47:55] RECOVERY - OSPF status on cr1-codfw is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [05:50:45] (03PS1) 10Ayounsi: Routinator set refresh to 10min (instead of 1h) [puppet] - 10https://gerrit.wikimedia.org/r/525204 (https://phabricator.wikimedia.org/T220669) [06:06:07] (03PS4) 10Fsero: zuul: stop zuul-merger gracefully [puppet] - 10https://gerrit.wikimedia.org/r/524180 (owner: 10Hashar) [06:06:20] (03PS4) 10Fsero: zuul: fix systemd Service/TimeoutStopSec [puppet] - 10https://gerrit.wikimedia.org/r/524174 (https://phabricator.wikimedia.org/T228381) (owner: 10Hashar) [06:07:18] (03CR) 10Fsero: [C: 03+2] zuul: stop zuul-merger gracefully [puppet] - 10https://gerrit.wikimedia.org/r/524180 (owner: 10Hashar) [06:09:04] (03PS5) 10Fsero: zuul: fix systemd Service/TimeoutStopSec [puppet] - 10https://gerrit.wikimedia.org/r/524174 (https://phabricator.wikimedia.org/T228381) (owner: 10Hashar) [06:10:01] (03CR) 10Fsero: [C: 03+2] zuul: fix systemd Service/TimeoutStopSec [puppet] - 10https://gerrit.wikimedia.org/r/524174 (https://phabricator.wikimedia.org/T228381) (owner: 10Hashar) [06:18:50] (03PS3) 10Fsero: registry: introducing read only mode for maintenances [puppet] - 10https://gerrit.wikimedia.org/r/523100 (https://phabricator.wikimedia.org/T227570) [06:21:37] (03PS4) 10Fsero: registry: introducing read only mode for maintenances [puppet] - 10https://gerrit.wikimedia.org/r/523100 (https://phabricator.wikimedia.org/T227570) [06:25:02] (03PS5) 10Fsero: registry: introducing read only mode for maintenances [puppet] - 10https://gerrit.wikimedia.org/r/523100 (https://phabricator.wikimedia.org/T227570) [06:25:10] 10Operations, 10netops: Instability of the Level3 link between cr2-eqiad and cr2-esams - https://phabricator.wikimedia.org/T228827 (10elukey) p:05Triage→03High [06:26:22] (03CR) 10Muehlenhoff: profile::kerberos::kadminserver: add script to create principals (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/525137 (https://phabricator.wikimedia.org/T226104) (owner: 10Elukey) [06:27:10] (03CR) 10Smalyshev: "compiler result looks good" [puppet] - 10https://gerrit.wikimedia.org/r/524954 (https://phabricator.wikimedia.org/T228122) (owner: 10Smalyshev) [06:28:02] (03CR) 10Fsero: [C: 03+2] "PCC is happy https://puppet-compiler.wmflabs.org/compiler1002/17583/" [puppet] - 10https://gerrit.wikimedia.org/r/523100 (https://phabricator.wikimedia.org/T227570) (owner: 10Fsero) [06:28:06] (03CR) 10Elukey: profile::kerberos::kadminserver: add script to create principals (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/525137 (https://phabricator.wikimedia.org/T226104) (owner: 10Elukey) [06:28:31] ah other comments :D [06:31:32] sorry, slight race there :-) [06:32:08] I went down the rabbit hole of looking for options to hide the process name from within Python [06:32:10] (03CR) 10Elukey: profile::kerberos::kadminserver: add script to create principals (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/525137 (https://phabricator.wikimedia.org/T226104) (owner: 10Elukey) [06:32:24] but there's only support on Win32 for that and the kernel approach is much cleaner anyway [06:33:20] let's use windows for the kadmin host! [06:33:23] * elukey runs away [06:33:27] PROBLEM - puppet last run on mw1308 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 7 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/local/bin/cgroup-mediawiki-clean] https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [06:33:29] PROBLEM - puppet last run on db2098 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [06:33:39] PROBLEM - Docker registry HTTP interface on registry1002 is CRITICAL: connect to address 10.64.32.139 and port 81: Connection refused https://wikitech.wikimedia.org/wiki/Docker [06:33:48] (03CR) 10Muehlenhoff: profile::kerberos::kadminserver: add script to create principals (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/525137 (https://phabricator.wikimedia.org/T226104) (owner: 10Elukey) [06:34:13] acknowledged those [06:34:15] from registry [06:34:20] elukey: ReactOS, given our FLOSS policy [06:35:15] (03PS1) 10Fsero: registry: bug: disabling read only mode [puppet] - 10https://gerrit.wikimedia.org/r/525207 [06:35:33] (03CR) 10Fsero: [C: 03+2] registry: bug: disabling read only mode [puppet] - 10https://gerrit.wikimedia.org/r/525207 (owner: 10Fsero) [06:38:41] RECOVERY - Docker registry HTTP interface on registry1002 is OK: HTTP OK: Status line output matched HTTP/1.1 403 - 407 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Docker [06:43:17] (03CR) 10Elukey: profile::kerberos::kadminserver: add script to create principals (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/525137 (https://phabricator.wikimedia.org/T226104) (owner: 10Elukey) [06:47:57] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 73, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:48:27] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 231, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:50:20] (03CR) 10Muehlenhoff: profile::kerberos::kadminserver: add script to create principals (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/525137 (https://phabricator.wikimedia.org/T226104) (owner: 10Elukey) [06:51:37] PROBLEM - Esams HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/myRmf1Pik/varnish-aggregate-client-status-codes?var-site=esams&var-status_type=5 [06:51:39] PROBLEM - Text HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/myRmf1Pik/varnish-aggregate-client-status-codes?var-cache_type=varnish-text&var-status_type=5 [06:54:45] (03PS1) 10Marostegui: mariadb: Provision dbproxy1013 [puppet] - 10https://gerrit.wikimedia.org/r/525213 (https://phabricator.wikimedia.org/T202367) [06:55:15] RECOVERY - dump of s5 in codfw on db1115 is OK: dump for s5 at codfw taken less than 8 days ago and larger than 10 GB: Last one 2019-07-24 05:00:17 from db2099.codfw.wmnet:3315 (98 GB) https://wikitech.wikimedia.org/wiki/MariaDB/Backups [06:56:01] RECOVERY - puppet last run on mw1308 is OK: OK: Puppet is currently enabled, last run 14 seconds ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [06:56:03] RECOVERY - puppet last run on db2098 is OK: OK: Puppet is currently enabled, last run 19 seconds ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [06:58:15] RECOVERY - Esams HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/myRmf1Pik/varnish-aggregate-client-status-codes?var-site=esams&var-status_type=5 [06:58:17] RECOVERY - Text HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/myRmf1Pik/varnish-aggregate-client-status-codes?var-cache_type=varnish-text&var-status_type=5 [07:05:05] Level3 link seems to have recovered --^ [07:08:09] (03CR) 10Marostegui: "Puppet looks good: https://puppet-compiler.wmflabs.org/compiler1002/17585/" [puppet] - 10https://gerrit.wikimedia.org/r/525213 (https://phabricator.wikimedia.org/T202367) (owner: 10Marostegui) [07:10:00] !log Deploy grants for dbproxy1013 in m2 - T202367 [07:10:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:10:11] T202367: Productionize dbproxy101[2-7].eqiad.wmnet and dbproxy200[1-4] - https://phabricator.wikimedia.org/T202367 [07:11:23] (03CR) 10Marostegui: [C: 03+2] mariadb: Provision dbproxy1013 [puppet] - 10https://gerrit.wikimedia.org/r/525213 (https://phabricator.wikimedia.org/T202367) (owner: 10Marostegui) [07:21:20] !log Stop MySQL on db1117:3322 to check dbproxy1013 notifications - T202367 [07:21:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:21:27] T202367: Productionize dbproxy101[2-7].eqiad.wmnet and dbproxy200[1-4] - https://phabricator.wikimedia.org/T202367 [07:24:53] PROBLEM - haproxy failover on dbproxy1007 is CRITICAL: CRITICAL check_failover servers up 1 down 1 https://wikitech.wikimedia.org/wiki/HAProxy [07:25:05] PROBLEM - haproxy failover on dbproxy1002 is CRITICAL: CRITICAL check_failover servers up 1 down 1 https://wikitech.wikimedia.org/wiki/HAProxy [07:25:13] PROBLEM - haproxy failover on dbproxy1013 is CRITICAL: CRITICAL check_failover servers up 1 down 1 https://wikitech.wikimedia.org/wiki/HAProxy [07:25:38] ^ me [07:25:39] expected [07:26:07] RECOVERY - haproxy failover on dbproxy1007 is OK: OK check_failover servers up 2 down 0 https://wikitech.wikimedia.org/wiki/HAProxy [07:26:14] (03PS2) 10Muehlenhoff: Enable seccomp-based hardening for apt on Buster and later [puppet] - 10https://gerrit.wikimedia.org/r/525115 [07:26:23] RECOVERY - haproxy failover on dbproxy1002 is OK: OK check_failover servers up 2 down 0 https://wikitech.wikimedia.org/wiki/HAProxy [07:26:29] RECOVERY - haproxy failover on dbproxy1013 is OK: OK check_failover servers up 2 down 0 https://wikitech.wikimedia.org/wiki/HAProxy [07:34:58] (03PS1) 10Ema: ATS: do not cache responses to cookies [puppet] - 10https://gerrit.wikimedia.org/r/525219 (https://phabricator.wikimedia.org/T227432) [07:37:45] (03CR) 10Ema: [C: 03+2] ATS: do not cache responses to cookies [puppet] - 10https://gerrit.wikimedia.org/r/525219 (https://phabricator.wikimedia.org/T227432) (owner: 10Ema) [07:37:48] (03PS2) 10Elukey: profile::kerberos::kadminserver: add script to create principals [puppet] - 10https://gerrit.wikimedia.org/r/525137 (https://phabricator.wikimedia.org/T226104) [07:43:27] (03PS1) 10Muehlenhoff: Switch back Cloud VPS instances to the read-only replicas [puppet] - 10https://gerrit.wikimedia.org/r/525220 (https://phabricator.wikimedia.org/T46722) [07:51:13] fsero: thank you for the merge of the Zuul systemd tweaks :] [07:52:39] (03PS1) 10Ema: ATS: gracefully fail request coalescing [puppet] - 10https://gerrit.wikimedia.org/r/525222 (https://phabricator.wikimedia.org/T227432) [07:56:49] (03CR) 10Ema: [C: 03+2] ATS: gracefully fail request coalescing [puppet] - 10https://gerrit.wikimedia.org/r/525222 (https://phabricator.wikimedia.org/T227432) (owner: 10Ema) [07:57:59] !log Drop abuse_filter_log.afl_log_id from wikidata in eqiad - T226851 [07:58:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:58:06] T226851: Drop abuse_filter_log.afl_log_id in production - https://phabricator.wikimedia.org/T226851 [08:02:50] (03PS1) 10Fsero: registry: setting eqiad registries in read_only_mode [puppet] - 10https://gerrit.wikimedia.org/r/525223 (https://phabricator.wikimedia.org/T227570) [08:04:52] (03CR) 10Fsero: [C: 03+2] "PCC is happy https://puppet-compiler.wmflabs.org/compiler1002/17587/registry1001.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/525223 (https://phabricator.wikimedia.org/T227570) (owner: 10Fsero) [08:05:01] (03PS2) 10Fsero: registry: setting eqiad registries in read_only_mode [puppet] - 10https://gerrit.wikimedia.org/r/525223 (https://phabricator.wikimedia.org/T227570) [08:09:20] (03PS1) 10Elukey: profile::mediawiki::mcrouter_wancache: set async behavior as default [puppet] - 10https://gerrit.wikimedia.org/r/525224 (https://phabricator.wikimedia.org/T225642) [08:15:13] 10Operations, 10Prod-Kubernetes, 10serviceops, 10Kubernetes, 10User-fsero: Set up a local redis proxy since docker-registry can only connect to one redis instance for caching - https://phabricator.wikimedia.org/T215809 (10fsero) a:05fsero→03None [08:15:41] 10Operations, 10Prod-Kubernetes, 10serviceops, 10Kubernetes, and 2 others: Package envoy 1.9.X for stretch and use it as redis proxy on docker registry - https://phabricator.wikimedia.org/T215810 (10fsero) 05Open→03Resolved package is done and uploaded long time ago. [08:15:44] 10Operations, 10Prod-Kubernetes, 10serviceops, 10Kubernetes, 10User-fsero: Set up a local redis proxy since docker-registry can only connect to one redis instance for caching - https://phabricator.wikimedia.org/T215809 (10fsero) [08:16:27] 10Operations, 10Prod-Kubernetes, 10serviceops, 10Kubernetes, and 2 others: improve docker registry architecture - https://phabricator.wikimedia.org/T209271 (10fsero) Keeping this task opened, but we can mark iteration 1 as completed [08:21:05] (03CR) 10Elukey: "https://puppet-compiler.wmflabs.org/compiler1001/17588/" [puppet] - 10https://gerrit.wikimedia.org/r/525224 (https://phabricator.wikimedia.org/T225642) (owner: 10Elukey) [08:21:27] 10Operations, 10netops: Instability of the Level3 link between cr2-eqiad and cr2-esams - https://phabricator.wikimedia.org/T228827 (10Peachey88) [08:34:59] !log Drop abuse_filter_log.afl_log_id in s2 codfw (lag will appear on codfw) - T226851 [08:35:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:35:06] T226851: Drop abuse_filter_log.afl_log_id in production - https://phabricator.wikimedia.org/T226851 [08:35:12] (03CR) 10Muehlenhoff: "A few nits inline, LGTM otherwise" (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/525137 (https://phabricator.wikimedia.org/T226104) (owner: 10Elukey) [08:37:22] (03PS1) 10Marostegui: db-eqiad.php: Depool db1082 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/525226 [08:38:19] (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: Depool db1082 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/525226 (owner: 10Marostegui) [08:38:50] 10Operations, 10Prod-Kubernetes, 10serviceops, 10Kubernetes, and 2 others: improve docker registry architecture - https://phabricator.wikimedia.org/T209271 (10fsero) 05Open→03Resolved a:05fsero→03None [08:38:56] (03CR) 10Filippo Giunchedi: "> Patch Set 5:" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/524037 (owner: 10Ayounsi) [08:38:56] 10Operations, 10Kubernetes: Evaluate VMWare's Harbour as a docker registry - https://phabricator.wikimedia.org/T202504 (10fsero) [08:39:15] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1082 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/525226 (owner: 10Marostegui) [08:39:30] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1082 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/525226 (owner: 10Marostegui) [08:39:58] 10Operations, 10Prod-Kubernetes, 10serviceops, 10Kubernetes, and 2 others: improve docker registry architecture - https://phabricator.wikimedia.org/T209271 (10fsero) 05Resolved→03Open [08:40:03] 10Operations, 10Kubernetes: Evaluate VMWare's Harbour as a docker registry - https://phabricator.wikimedia.org/T202504 (10fsero) [08:40:37] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Depool db1082 for upgrade (duration: 00m 57s) [08:40:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:40:51] !log Stop MySQL on db1082 for upgrade [08:40:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:43:03] (03CR) 10Elukey: profile::kerberos::kadminserver: add script to create principals (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/525137 (https://phabricator.wikimedia.org/T226104) (owner: 10Elukey) [08:45:41] (03PS3) 10Elukey: profile::kerberos::kadminserver: add script to create principals [puppet] - 10https://gerrit.wikimedia.org/r/525137 (https://phabricator.wikimedia.org/T226104) [08:48:43] (03PS1) 10Muehlenhoff: profile::mediawiki::nutcracker: Remove support for jessie [puppet] - 10https://gerrit.wikimedia.org/r/525229 [08:50:10] (03PS1) 10Marostegui: db-eqiad.php: Slowly repool db1082 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/525230 [08:51:28] (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: Slowly repool db1082 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/525230 (owner: 10Marostegui) [08:51:32] (03PS1) 10Muehlenhoff: role::alerting_host: Remove support for jessie [puppet] - 10https://gerrit.wikimedia.org/r/525232 [08:52:20] (03Merged) 10jenkins-bot: db-eqiad.php: Slowly repool db1082 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/525230 (owner: 10Marostegui) [08:52:44] (03CR) 10jenkins-bot: db-eqiad.php: Slowly repool db1082 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/525230 (owner: 10Marostegui) [08:53:34] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/525137 (https://phabricator.wikimedia.org/T226104) (owner: 10Elukey) [08:53:37] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Slowly repool db1082 after upgrade (duration: 00m 54s) [08:53:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:54:53] (03PS1) 10Marostegui: db-eqiad.php: Repool db1082 into API [mediawiki-config] - 10https://gerrit.wikimedia.org/r/525233 [08:56:04] (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: Repool db1082 into API [mediawiki-config] - 10https://gerrit.wikimedia.org/r/525233 (owner: 10Marostegui) [08:56:56] (03Merged) 10jenkins-bot: db-eqiad.php: Repool db1082 into API [mediawiki-config] - 10https://gerrit.wikimedia.org/r/525233 (owner: 10Marostegui) [08:58:02] (03CR) 10Elukey: [C: 03+2] profile::kerberos::kadminserver: add script to create principals [puppet] - 10https://gerrit.wikimedia.org/r/525137 (https://phabricator.wikimedia.org/T226104) (owner: 10Elukey) [08:58:04] (03CR) 10jenkins-bot: db-eqiad.php: Repool db1082 into API [mediawiki-config] - 10https://gerrit.wikimedia.org/r/525233 (owner: 10Marostegui) [08:58:07] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Repool db1082 into API after upgrade (duration: 00m 55s) [08:58:10] (03PS4) 10Elukey: profile::kerberos::kadminserver: add script to create principals [puppet] - 10https://gerrit.wikimedia.org/r/525137 (https://phabricator.wikimedia.org/T226104) [08:58:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:03:12] 10Operations, 10Analytics, 10LDAP-Access-Requests, 10wikimediafoundation.org: Access to WikimediaFoundation.org analytics for Deb - https://phabricator.wikimedia.org/T227496 (10MoritzMuehlenhoff) 05Resolved→03Open @herron: If you add an account to a PII-relevant LDAP group which does not have shell acc... [09:03:20] (03PS3) 10Marostegui: mariadb: Promote db1128 as master for m3 [puppet] - 10https://gerrit.wikimedia.org/r/523941 (https://phabricator.wikimedia.org/T228243) [09:10:14] (03PS1) 10Elukey: profile::kerberos::kadminserver: highlight 'test' when creating users [puppet] - 10https://gerrit.wikimedia.org/r/525235 (https://phabricator.wikimedia.org/T226104) [09:16:00] (03CR) 10Elukey: [C: 03+2] profile::kerberos::kadminserver: highlight 'test' when creating users [puppet] - 10https://gerrit.wikimedia.org/r/525235 (https://phabricator.wikimedia.org/T226104) (owner: 10Elukey) [09:18:34] (03PS1) 10Fsero: k8s: deploy users should be able to get and list any resource. [deployment-charts] - 10https://gerrit.wikimedia.org/r/525236 [09:19:03] (03PS1) 10Volans: setup.py: re-include tests in the distribution [software/conftool] - 10https://gerrit.wikimedia.org/r/525237 [09:19:28] (03CR) 10Fsero: [V: 03+2 C: 03+2] k8s: deploy users should be able to get and list any resource. [deployment-charts] - 10https://gerrit.wikimedia.org/r/525236 (owner: 10Fsero) [09:21:28] (03CR) 10Giuseppe Lavagetto: [C: 03+1] setup.py: re-include tests in the distribution [software/conftool] - 10https://gerrit.wikimedia.org/r/525237 (owner: 10Volans) [09:21:56] (03CR) 10Volans: [C: 03+2] setup.py: re-include tests in the distribution [software/conftool] - 10https://gerrit.wikimedia.org/r/525237 (owner: 10Volans) [09:22:09] (03PS1) 10Fsero: Revert "k8s: deploy users should be able to get and list any resource." [deployment-charts] - 10https://gerrit.wikimedia.org/r/525239 [09:22:15] (03CR) 10Fsero: [V: 03+2 C: 03+2] Revert "k8s: deploy users should be able to get and list any resource." [deployment-charts] - 10https://gerrit.wikimedia.org/r/525239 (owner: 10Fsero) [09:24:35] (03Merged) 10jenkins-bot: setup.py: re-include tests in the distribution [software/conftool] - 10https://gerrit.wikimedia.org/r/525237 (owner: 10Volans) [09:26:16] (03PS1) 10Fsero: k8s: deploy user should be able to list any resource [deployment-charts] - 10https://gerrit.wikimedia.org/r/525240 [09:26:53] (03PS1) 10Volans: Release 1.1.3 [software/conftool] - 10https://gerrit.wikimedia.org/r/525241 [09:27:20] (03PS2) 10Fsero: k8s: deploy user should be able to list any resource [deployment-charts] - 10https://gerrit.wikimedia.org/r/525240 [09:28:01] (03PS1) 10Elukey: profile::kerberos::kadminserver: properly indent emails to send [puppet] - 10https://gerrit.wikimedia.org/r/525242 (https://phabricator.wikimedia.org/T226104) [09:28:44] (03PS2) 10Filippo Giunchedi: varnish: remove varnishreqstats-based alerts [puppet] - 10https://gerrit.wikimedia.org/r/523891 (https://phabricator.wikimedia.org/T184942) [09:28:56] (03CR) 10Filippo Giunchedi: [C: 03+2] "PCC https://puppet-compiler.wmflabs.org/compiler1002/17591/" [puppet] - 10https://gerrit.wikimedia.org/r/523891 (https://phabricator.wikimedia.org/T184942) (owner: 10Filippo Giunchedi) [09:29:46] (03CR) 10Volans: [C: 03+2] Release 1.1.3 [software/conftool] - 10https://gerrit.wikimedia.org/r/525241 (owner: 10Volans) [09:29:51] (03CR) 10Elukey: [C: 03+2] profile::kerberos::kadminserver: properly indent emails to send [puppet] - 10https://gerrit.wikimedia.org/r/525242 (https://phabricator.wikimedia.org/T226104) (owner: 10Elukey) [09:30:00] (03PS2) 10Elukey: profile::kerberos::kadminserver: properly indent emails to send [puppet] - 10https://gerrit.wikimedia.org/r/525242 (https://phabricator.wikimedia.org/T226104) [09:32:21] (03Merged) 10jenkins-bot: Release 1.1.3 [software/conftool] - 10https://gerrit.wikimedia.org/r/525241 (owner: 10Volans) [09:33:58] (03CR) 10Giuseppe Lavagetto: Add the LBRemoteCluster class. (0311 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/503947 (owner: 10Giuseppe Lavagetto) [09:34:09] (03PS1) 10Marostegui: db-eqiad.php: More traffic to db1082 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/525243 [09:37:18] (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: More traffic to db1082 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/525243 (owner: 10Marostegui) [09:37:44] (03CR) 10Filippo Giunchedi: [C: 03+1] "PCC https://puppet-compiler.wmflabs.org/compiler1001/17593/" [puppet] - 10https://gerrit.wikimedia.org/r/523892 (https://phabricator.wikimedia.org/T184942) (owner: 10Filippo Giunchedi) [09:38:18] (03Merged) 10jenkins-bot: db-eqiad.php: More traffic to db1082 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/525243 (owner: 10Marostegui) [09:38:34] (03CR) 10jenkins-bot: db-eqiad.php: More traffic to db1082 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/525243 (owner: 10Marostegui) [09:39:28] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: More traffic to db1082 (duration: 00m 55s) [09:39:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:45:05] (03PS2) 10Filippo Giunchedi: varnish: ensure varnishreqstats is absent [puppet] - 10https://gerrit.wikimedia.org/r/523892 (https://phabricator.wikimedia.org/T184942) [09:46:47] 10Operations, 10observability, 10User-fgiunchedi: Include apache_exporter in puppet module apache - https://phabricator.wikimedia.org/T187434 (10fgiunchedi) Parent task is a goal instead [09:46:57] PROBLEM - HHVM rendering on mw1229 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [09:47:00] 10Operations, 10observability, 10Goal, 10Technical-Debt, and 2 others: Reduce technical debt in metrics monitoring - https://phabricator.wikimedia.org/T177195 (10fgiunchedi) [09:47:47] RECOVERY - HHVM rendering on mw1229 is OK: HTTP OK: HTTP/1.1 200 OK - 76100 bytes in 0.169 second response time https://wikitech.wikimedia.org/wiki/Application_servers [09:49:18] (03CR) 10Ema: [C: 03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/523892 (https://phabricator.wikimedia.org/T184942) (owner: 10Filippo Giunchedi) [09:49:21] 10Operations, 10Cloud-VPS, 10Toolforge, 10LDAP, and 2 others: LDAP server running out of memory frequently and disrupting Cloud VPS clients - https://phabricator.wikimedia.org/T217280 (10aborrero) [09:59:31] (03CR) 10Filippo Giunchedi: [C: 03+2] varnish: ensure varnishreqstats is absent [puppet] - 10https://gerrit.wikimedia.org/r/523892 (https://phabricator.wikimedia.org/T184942) (owner: 10Filippo Giunchedi) [10:00:38] (03PS3) 10Filippo Giunchedi: varnish: ensure varnishreqstats is absent [puppet] - 10https://gerrit.wikimedia.org/r/523892 (https://phabricator.wikimedia.org/T184942) [10:04:23] !log Drop abuse_filter_log.afl_log_id from labswiki (wikitech) and labtestwiki - T226851 [10:04:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:04:41] T226851: Drop abuse_filter_log.afl_log_id in production - https://phabricator.wikimedia.org/T226851 [10:06:54] 10Operations, 10Analytics, 10Analytics-EventLogging: Decommission m4 proxies (dbproxy1004 and dbproxy1008) - https://phabricator.wikimedia.org/T228768 (10Marostegui) p:05Triage→03Normal [10:07:20] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/524934 (https://phabricator.wikimedia.org/T196066) (owner: 10Cwhite) [10:08:27] PROBLEM - Varnish traffic logger - varnishreqstats on cp2014 is CRITICAL: NRPE: Command check_varnishreqstats not defined https://wikitech.wikimedia.org/wiki/Varnish [10:08:35] PROBLEM - Varnish traffic logger - varnishreqstats on cp1077 is CRITICAL: NRPE: Command check_varnishreqstats not defined https://wikitech.wikimedia.org/wiki/Varnish [10:08:40] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM, see inline" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/525115 (owner: 10Muehlenhoff) [10:09:21] PROBLEM - Varnish traffic logger - varnishreqstats on cp3043 is CRITICAL: NRPE: Command check_varnishreqstats not defined https://wikitech.wikimedia.org/wiki/Varnish [10:09:39] PROBLEM - Varnish traffic logger - varnishreqstats on cp4031 is CRITICAL: NRPE: Command check_varnishreqstats not defined https://wikitech.wikimedia.org/wiki/Varnish [10:10:27] PROBLEM - Varnish traffic logger - varnishreqstats on cp1076 is CRITICAL: NRPE: Command check_varnishreqstats not defined https://wikitech.wikimedia.org/wiki/Varnish [10:10:55] 10Operations, 10Analytics, 10Analytics-EventLogging: Decommission m4 proxies (dbproxy1004 and dbproxy1008) - https://phabricator.wikimedia.org/T228768 (10elukey) +2 from Analytics [10:11:33] the varnishreqstats alerts are expected, I'll silence [10:11:41] PROBLEM - Varnish traffic logger - varnishreqstats on cp3042 is CRITICAL: NRPE: Command check_varnishreqstats not defined https://wikitech.wikimedia.org/wiki/Varnish [10:11:43] PROBLEM - Varnish traffic logger - varnishreqstats on cp2016 is CRITICAL: NRPE: Command check_varnishreqstats not defined https://wikitech.wikimedia.org/wiki/Varnish [10:11:43] PROBLEM - Varnish traffic logger - varnishreqstats on cp2018 is CRITICAL: NRPE: Command check_varnishreqstats not defined https://wikitech.wikimedia.org/wiki/Varnish [10:11:47] sorry about the spam [10:11:49] 10Operations, 10ops-codfw, 10DBA: pc2010 possibly broken memory - https://phabricator.wikimedia.org/T227552 (10Marostegui) 05Resolved→03Open a:05Marostegui→03Papaul Looks like this happened again and mysql crashed: @Papaul could this be the memory slot? Should we swap the DIMM with another existing D... [10:12:03] PROBLEM - Varnish traffic logger - varnishreqstats on cp2008 is CRITICAL: NRPE: Command check_varnishreqstats not defined https://wikitech.wikimedia.org/wiki/Varnish [10:12:49] 10Operations, 10Analytics, 10Analytics-EventLogging, 10DBA: Decommission m4 proxies (dbproxy1004 and dbproxy1008) - https://phabricator.wikimedia.org/T228768 (10Marostegui) a:03Marostegui Great - thanks. I will get them decommissioned [10:15:47] 10Operations, 10Analytics, 10Analytics-EventLogging, 10DBA: Decommission dbproxy1004 and dbproxy1009 - https://phabricator.wikimedia.org/T228768 (10Marostegui) [10:16:01] 10Operations, 10Analytics, 10Analytics-EventLogging, 10DBA, 10decommission: Decommission dbproxy1004 and dbproxy1009 - https://phabricator.wikimedia.org/T228768 (10Marostegui) [10:17:59] 10Operations, 10ops-eqiad, 10DC-Ops: b3-eqiad pdu refresh - https://phabricator.wikimedia.org/T227539 (10aborrero) [10:18:18] 10Operations, 10ops-eqiad, 10DC-Ops: b3-eqiad pdu refresh - https://phabricator.wikimedia.org/T227539 (10aborrero) This is ready to go on our side, hopefully today :-) [10:20:11] (03CR) 10Giuseppe Lavagetto: Add the LBRemoteCluster class. (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/503947 (owner: 10Giuseppe Lavagetto) [10:20:12] 10Operations, 10ops-eqiad, 10DC-Ops: b3-eqiad pdu refresh - https://phabricator.wikimedia.org/T227539 (10Marostegui) [10:20:23] (03PS11) 10Giuseppe Lavagetto: Add the LBRemoteCluster class. [software/spicerack] - 10https://gerrit.wikimedia.org/r/503947 [10:21:04] 10Operations, 10ops-eqiad, 10DC-Ops: b3-eqiad pdu refresh - https://phabricator.wikimedia.org/T227539 (10Marostegui) From the DBA side, it is good to. db1073 is a master for m5 (wikitech, nova...) #cloud-services-team needs to decide if they can afford a downtime there. [10:22:26] (03PS12) 10Giuseppe Lavagetto: Add the LBRemoteCluster class. [software/spicerack] - 10https://gerrit.wikimedia.org/r/503947 [10:23:27] 10Operations, 10ops-eqiad, 10DC-Ops: b6-eqiad pdu refresh - https://phabricator.wikimedia.org/T227541 (10aborrero) [10:28:15] (03CR) 10jerkins-bot: [V: 04-1] Add the LBRemoteCluster class. [software/spicerack] - 10https://gerrit.wikimedia.org/r/503947 (owner: 10Giuseppe Lavagetto) [10:29:05] I know I know jenkins... working on it :) [10:31:51] (03CR) 10Filippo Giunchedi: "See inline" (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/520502 (https://phabricator.wikimedia.org/T226815) (owner: 10Giuseppe Lavagetto) [10:37:30] PROBLEM - puppet last run on cp1089 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): Service[varnishreqstats-frontend] https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [10:38:36] PROBLEM - puppet last run on cp1077 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): Service[varnishreqstats-frontend] https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [10:40:53] mhh looking [10:41:20] PROBLEM - puppet last run on cp2014 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): Service[varnishreqstats-frontend] https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [10:41:38] PROBLEM - puppet last run on cp3043 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): Service[varnishreqstats-frontend] https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [10:41:38] PROBLEM - puppet last run on cp2008 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): Service[varnishreqstats-frontend] https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [10:42:09] I'll stop puppet on cp hosts, fails on second run after my merge /o\ [10:42:09] cc ema ^ [10:51:02] (03PS1) 10Filippo Giunchedi: varnish: fix varnishreqstats systemd::service usage [puppet] - 10https://gerrit.wikimedia.org/r/525252 (https://phabricator.wikimedia.org/T184942) [10:53:29] godog: thanks [10:53:52] (03CR) 10Ema: [C: 03+1] varnish: fix varnishreqstats systemd::service usage [puppet] - 10https://gerrit.wikimedia.org/r/525252 (https://phabricator.wikimedia.org/T184942) (owner: 10Filippo Giunchedi) [10:54:32] (03CR) 10Filippo Giunchedi: "PCC's happy https://puppet-compiler.wmflabs.org/compiler1002/17594/cp1077.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/525252 (https://phabricator.wikimedia.org/T184942) (owner: 10Filippo Giunchedi) [10:54:49] (03CR) 10Filippo Giunchedi: [C: 03+2] varnish: fix varnishreqstats systemd::service usage [puppet] - 10https://gerrit.wikimedia.org/r/525252 (https://phabricator.wikimedia.org/T184942) (owner: 10Filippo Giunchedi) [10:54:59] (03CR) 10Filippo Giunchedi: [V: 03+2 C: 03+2] varnish: fix varnishreqstats systemd::service usage [puppet] - 10https://gerrit.wikimedia.org/r/525252 (https://phabricator.wikimedia.org/T184942) (owner: 10Filippo Giunchedi) [10:58:52] in 30 [10:59:36] RECOVERY - puppet last run on cp1077 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [11:00:04] Amir1, Lucas_WMDE, and Urbanecm: #bothumor My software never has bugs. It just develops random features. Rise for European Mid-day SWAT(Max 6 patches). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190724T1100). [11:00:04] Zoranzoki21 and kart_: A patch you scheduled for European Mid-day SWAT(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [11:00:18] ok puppet's back on cp1077, reenabling [11:00:42] PROBLEM - puppet last run on cp2018 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): Service[varnishreqstats-frontend] https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [11:00:45] kart_: I can deploy your patch, if that's helpful? [11:01:10] awight: cool [11:01:14] awight: go ahead. [11:01:32] PROBLEM - puppet last run on cp2013 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 59 seconds ago with 1 failures. Failed resources (up to 3 shown): Service[varnishreqstats-frontend] https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [11:01:36] PROBLEM - puppet last run on cp3032 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): Service[varnishreqstats-frontend] https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [11:01:46] Zoranzoki21: ping me if you're around during this window, otherwise I'll leave your patch for another day. [11:01:48] PROBLEM - puppet last run on cp2014 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): Service[varnishreqstats-frontend] https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [11:01:49] kart_: will do [11:01:50] PROBLEM - puppet last run on cp1081 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): Service[varnishreqstats-frontend] https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [11:02:00] PROBLEM - puppet last run on cp1087 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Service[varnishreqstats-frontend] https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [11:02:02] PROBLEM - puppet last run on cp1079 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): Service[varnishreqstats-frontend] https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [11:02:06] PROBLEM - puppet last run on cp3042 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): Service[varnishreqstats-frontend] https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [11:02:08] oh ffs, sorry about that! [11:02:54] (03PS4) 10Awight: Remove Content Translation event logging config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/514672 (owner: 10Petar.petkovic) [11:03:02] (03CR) 10Awight: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/514672 (owner: 10Petar.petkovic) [11:03:13] despite the noise we're fine btw [11:03:18] :) [11:04:05] (03Merged) 10jenkins-bot: Remove Content Translation event logging config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/514672 (owner: 10Petar.petkovic) [11:04:23] (03CR) 10jenkins-bot: Remove Content Translation event logging config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/514672 (owner: 10Petar.petkovic) [11:05:08] kart_: Looks like this has to be pushed in two steps, with CommonSettings.php going first... [11:05:46] kart_: Patch is ready to test on mwdebug1002 [11:06:41] OK. Checking. [11:07:28] RECOVERY - puppet last run on cp1079 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [11:07:30] I can confirm the ContentTranslation pages load, at least. [11:07:33] Nothing much to test except checking if it doesn't breaks stuffs. [11:07:37] awight: yeah. [11:07:45] kk, deploying [11:10:55] !log awight@deploy1001 Synchronized wmf-config/CommonSettings.php: [[gerrit:514672|Remove Content Translation event logging config]] (part 1/2) (duration: 00m 59s) [11:11:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:12:03] !log awight@deploy1001 Synchronized wmf-config/InitialiseSettings.php: [[gerrit:514672|Remove Content Translation event logging config]] (part 2/2) (duration: 00m 54s) [11:12:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:12:35] done :) [11:12:44] Thank you awight ! [11:12:54] any time [11:15:03] awight: please LMK when swat is done, I'm holding off reenabling puppet in case spam comes up again [11:15:29] godog: Okay, I have one more patch to push then will ping you. [11:15:46] awesome, thanks [11:19:37] (03PS1) 10Awight: Disable FileImporter source wiki edits [mediawiki-config] - 10https://gerrit.wikimedia.org/r/525254 (https://phabricator.wikimedia.org/T228851) [11:20:17] (03CR) 10Awight: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/525254 (https://phabricator.wikimedia.org/T228851) (owner: 10Awight) [11:21:59] (03Merged) 10jenkins-bot: Disable FileImporter source wiki edits [mediawiki-config] - 10https://gerrit.wikimedia.org/r/525254 (https://phabricator.wikimedia.org/T228851) (owner: 10Awight) [11:22:38] (03CR) 10jenkins-bot: Disable FileImporter source wiki edits [mediawiki-config] - 10https://gerrit.wikimedia.org/r/525254 (https://phabricator.wikimedia.org/T228851) (owner: 10Awight) [11:23:46] !log awight@deploy1001 Synchronized wmf-config/CommonSettings.php: [[gerrit:525254|Disable FileImporter source wiki edits (T228851)]] (duration: 00m 54s) [11:23:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:23:53] T228851: Source wiki editing and deletion always fails - https://phabricator.wikimedia.org/T228851 [11:26:26] godog: Take it away! [11:26:38] * godog grabs mic [11:26:43] :-D [11:26:51] thanks! will do [11:28:28] RECOVERY - puppet last run on cp2008 is OK: OK: Puppet is currently enabled, last run 7 seconds ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [11:28:40] RECOVERY - puppet last run on cp2018 is OK: OK: Puppet is currently enabled, last run 18 seconds ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [11:29:34] RECOVERY - puppet last run on cp2013 is OK: OK: Puppet is currently enabled, last run 59 seconds ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [11:29:48] RECOVERY - puppet last run on cp2014 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [11:32:00] RECOVERY - puppet last run on cp1089 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [11:33:00] RECOVERY - puppet last run on cp3043 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [11:34:50] PROBLEM - MegaRAID on cloudvirt1024 is CRITICAL: CRITICAL: 1 failed LD(s) (Degraded) https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [11:34:58] ACKNOWLEDGEMENT - MegaRAID on cloudvirt1024 is CRITICAL: CRITICAL: 1 failed LD(s) (Degraded) nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T228853 https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [11:35:04] 10Operations, 10ops-eqiad: Degraded RAID on cloudvirt1024 - https://phabricator.wikimedia.org/T228853 (10ops-monitoring-bot) [11:35:24] RECOVERY - puppet last run on cp3032 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [11:35:36] RECOVERY - puppet last run on cp1081 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [11:35:46] RECOVERY - puppet last run on cp1087 is OK: OK: Puppet is currently enabled, last run 5 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [11:35:52] RECOVERY - puppet last run on cp3042 is OK: OK: Puppet is currently enabled, last run 5 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [11:39:31] (03PS3) 10Muehlenhoff: Enable seccomp-based hardening for apt on Buster and later [puppet] - 10https://gerrit.wikimedia.org/r/525115 [11:40:13] 10Operations, 10Puppet, 10observability: Use git commit id as "configuration version" for puppet - https://phabricator.wikimedia.org/T228854 (10fgiunchedi) [11:42:47] (03CR) 10Muehlenhoff: "PCC: https://puppet-compiler.wmflabs.org/compiler1002/17595/" [puppet] - 10https://gerrit.wikimedia.org/r/525115 (owner: 10Muehlenhoff) [11:43:05] (03PS4) 10Muehlenhoff: Enable seccomp-based hardening for apt on Buster and later [puppet] - 10https://gerrit.wikimedia.org/r/525115 [11:44:16] (03CR) 10Muehlenhoff: [C: 03+2] Enable seccomp-based hardening for apt on Buster and later [puppet] - 10https://gerrit.wikimedia.org/r/525115 (owner: 10Muehlenhoff) [11:49:09] (03PS1) 10Filippo Giunchedi: varnish: remove varnishreqstats and varnishstatsd [puppet] - 10https://gerrit.wikimedia.org/r/525259 (https://phabricator.wikimedia.org/T184942) [12:00:04] Deploy window Pre MediaWiki train sanity break (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190724T1200) [12:07:19] 10Operations, 10ops-eqiad, 10DC-Ops: dbproxy1012 and dbpro1001 alerting on PS Redundancy - https://phabricator.wikimedia.org/T228859 (10Marostegui) [12:13:31] 10Operations, 10Traffic: Do not cache the beta version of the mobile site - https://phabricator.wikimedia.org/T228861 (10ema) [12:13:42] 10Operations, 10Traffic: Do not cache the beta version of the mobile site - https://phabricator.wikimedia.org/T228861 (10ema) p:05Triage→03Normal [12:17:33] (03PS1) 10Ema: vcl: do not cache the beta variant of the mobile site [puppet] - 10https://gerrit.wikimedia.org/r/525268 (https://phabricator.wikimedia.org/T228861) [12:19:49] !log Stop haproxy on dbproxy1004 and dbproxy1009 (m4 - eventlogging) - T228768 [12:19:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:19:56] T228768: Decommission dbproxy1004 and dbproxy1009 - https://phabricator.wikimedia.org/T228768 [12:20:04] elukey: fyi ^ [12:21:01] gogogogo [12:21:25] 10Operations, 10Analytics, 10Analytics-EventLogging, 10DBA, 10decommission: Decommission dbproxy1004 and dbproxy1009 - https://phabricator.wikimedia.org/T228768 (10Marostegui) [12:21:53] 10Operations, 10Analytics, 10Analytics-EventLogging, 10DBA, 10decommission: Decommission dbproxy1004 and dbproxy1009 - https://phabricator.wikimedia.org/T228768 (10Marostegui) I have stopped haproxy on both hosts, and will leave it like that for 24h, just to be fully sure nothing uses it. [12:26:45] 10Operations, 10ops-eqiad, 10DC-Ops: dbproxy1012 and dbprov1001 alerting on PS Redundancy - https://phabricator.wikimedia.org/T228859 (10Marostegui) [12:32:22] 10Operations, 10MobileFrontend, 10Traffic, 10Patch-For-Review: Do not cache the beta version of the mobile site - https://phabricator.wikimedia.org/T228861 (10ema) [12:42:38] PROBLEM - High CPU load on API appserver on mw1227 is CRITICAL: CRITICAL - load average: 55.15, 35.35, 26.35 https://wikitech.wikimedia.org/wiki/Application_servers [12:48:26] 10Operations, 10Traffic: Separate Traffic layer caches for PHP7/HHVM - https://phabricator.wikimedia.org/T206339 (10ema) [12:55:48] PROBLEM - High CPU load on API appserver on mw1227 is CRITICAL: CRITICAL - load average: 47.86, 36.30, 32.42 https://wikitech.wikimedia.org/wiki/Application_servers [13:00:04] liw: (Dis)respected human, time to deploy MediaWiki train - European version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190724T1300). Please do the needful. [13:00:35] I'll start promoting 1.34.0-wmf.15 to group1 [13:00:59] (03PS1) 10Ema: WIP: ATS: Vary-slotting for PHP7 [puppet] - 10https://gerrit.wikimedia.org/r/525274 (https://phabricator.wikimedia.org/T206339) [13:01:23] (03PS1) 10Lars Wirzenius: group1 wikis to 1.34.0-wmf.15 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/525275 [13:01:25] (03CR) 10Lars Wirzenius: [C: 03+2] group1 wikis to 1.34.0-wmf.15 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/525275 (owner: 10Lars Wirzenius) [13:02:27] (03Merged) 10jenkins-bot: group1 wikis to 1.34.0-wmf.15 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/525275 (owner: 10Lars Wirzenius) [13:02:52] (03CR) 10jenkins-bot: group1 wikis to 1.34.0-wmf.15 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/525275 (owner: 10Lars Wirzenius) [13:05:11] !log liw@deploy1001 rebuilt and synchronized wikiversions files: group1 wikis to 1.34.0-wmf.15 [13:05:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:06:06] !log liw@deploy1001 Synchronized php: group1 wikis to 1.34.0-wmf.15 (duration: 00m 54s) [13:06:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:07:08] (03PS1) 10Papaul: DNS: Add mgmt and production DNS for db21[21-30] [dns] - 10https://gerrit.wikimedia.org/r/525277 [13:08:33] eek, there's a couple of thousand gzinflate data errors [13:08:46] PROBLEM - termbox codfw on termbox.svc.codfw.wmnet is CRITICAL: /termbox (get rendered termbox) is CRITICAL: Test get rendered termbox returned the unexpected status 500 (expecting: 200) https://wikitech.wikimedia.org/wiki/WMDE/Wikidata/SSR_Service [13:09:21] but not in .15 [13:09:36] PROBLEM - High CPU load on API appserver on mw1222 is CRITICAL: CRITICAL - load average: 48.09, 32.86, 23.87 https://wikitech.wikimedia.org/wiki/Application_servers [13:14:32] RECOVERY - High CPU load on API appserver on mw1222 is OK: OK - load average: 16.74, 23.15, 22.19 https://wikitech.wikimedia.org/wiki/Application_servers [13:14:44] (03CR) 10Marostegui: [C: 03+2] DNS: Add mgmt and production DNS for db21[21-30] [dns] - 10https://gerrit.wikimedia.org/r/525277 (owner: 10Papaul) [13:18:03] 10Operations, 10ops-eqiad, 10DC-Ops, 10User-Zppix, 10cloud-services-team (Kanban): VMs on cloudvirt1015 crashing - bad mainboard/memory - https://phabricator.wikimedia.org/T220853 (10RobH) >>! In T220853#5359421, @wiki_willy wrote: > @Cmjohnson - are those errors for DIMM A3 enough info to get Dell to RM... [13:18:40] RECOVERY - termbox codfw on termbox.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/WMDE/Wikidata/SSR_Service [13:23:44] PROBLEM - termbox codfw on termbox.svc.codfw.wmnet is CRITICAL: /termbox (get rendered termbox) is CRITICAL: Test get rendered termbox returned the unexpected status 500 (expecting: 200) https://wikitech.wikimedia.org/wiki/WMDE/Wikidata/SSR_Service [13:25:39] !log rebooting cloudvirt1015 into memtest for dell support repair via T220853 [13:25:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:25:47] T220853: VMs on cloudvirt1015 crashing - bad mainboard/memory - https://phabricator.wikimedia.org/T220853 [13:26:14] (03PS1) 10Fsero: k8s: adding PodSecurityPolicies [deployment-charts] - 10https://gerrit.wikimedia.org/r/525281 [13:27:06] RECOVERY - High CPU load on API appserver on mw1227 is OK: OK - load average: 13.57, 15.76, 23.61 https://wikitech.wikimedia.org/wiki/Application_servers [13:28:38] RECOVERY - termbox codfw on termbox.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/WMDE/Wikidata/SSR_Service [13:30:29] (03PS1) 10Elukey: role::analytics_test_cluster::hadoop::master: add missing option [puppet] - 10https://gerrit.wikimedia.org/r/525282 [13:30:35] group1: so far, so good [13:31:33] !log Drop abuse_filter_log.afl_log_id in s2 eqiad - T226851 [13:31:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:31:48] T226851: Drop abuse_filter_log.afl_log_id in production - https://phabricator.wikimedia.org/T226851 [13:33:38] PROBLEM - termbox codfw on termbox.svc.codfw.wmnet is CRITICAL: /termbox (get rendered termbox) is CRITICAL: Test get rendered termbox returned the unexpected status 500 (expecting: 200) https://wikitech.wikimedia.org/wiki/WMDE/Wikidata/SSR_Service [13:35:14] RECOVERY - termbox codfw on termbox.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/WMDE/Wikidata/SSR_Service [13:36:07] (03CR) 10Elukey: [C: 03+2] role::analytics_test_cluster::hadoop::master: add missing option [puppet] - 10https://gerrit.wikimedia.org/r/525282 (owner: 10Elukey) [13:40:18] PROBLEM - termbox codfw on termbox.svc.codfw.wmnet is CRITICAL: /termbox (get rendered termbox) is CRITICAL: Test get rendered termbox returned the unexpected status 500 (expecting: 200) https://wikitech.wikimedia.org/wiki/WMDE/Wikidata/SSR_Service [13:41:54] RECOVERY - termbox codfw on termbox.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/WMDE/Wikidata/SSR_Service [13:47:56] 10Operations, 10ops-eqiad, 10DC-Ops, 10User-Zppix, 10cloud-services-team (Kanban): VMs on cloudvirt1015 crashing - bad mainboard/memory - https://phabricator.wikimedia.org/T220853 (10RobH) Ok, this failed with another memory error in the SEL for dimm A3 (the one in question this entire time). I've enter... [13:48:32] PROBLEM - High CPU load on API appserver on mw1227 is CRITICAL: CRITICAL - load average: 51.78, 31.01, 23.70 https://wikitech.wikimedia.org/wiki/Application_servers [13:49:08] !log rebooting cloudvirt1015 into OS, memory error confirmed. new memory replacement dispatch entered via T220853 [13:49:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:49:24] T220853: VMs on cloudvirt1015 crashing - bad mainboard/memory - https://phabricator.wikimedia.org/T220853 [13:51:56] PROBLEM - termbox codfw on termbox.svc.codfw.wmnet is CRITICAL: /termbox (get rendered termbox) is CRITICAL: Test get rendered termbox returned the unexpected status 500 (expecting: 200) https://wikitech.wikimedia.org/wiki/WMDE/Wikidata/SSR_Service [13:53:12] 10Operations, 10Analytics, 10Analytics-Kanban, 10Discovery, 10Research-Backlog: Make oozie swift upload emit event to Kafka about swift object upload complete - https://phabricator.wikimedia.org/T227896 (10Ottomata) But hm, I get your point. It might be nice if the upload script automated some versionin... [13:53:34] RECOVERY - termbox codfw on termbox.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/WMDE/Wikidata/SSR_Service [13:53:40] 10Operations, 10ops-eqiad, 10netops: (Need By: Sept 30) update RE-S-X6-64G-S in cr[12]-eqiad - https://phabricator.wikimedia.org/T226424 (10RobH) @ayounsi: This now has a need by date of September 30th (I assume you and @wiki_willy came up with that as he added it?) This is basically blocked on #netops tel... [14:04:53] (03PS1) 10Tarrow: Increase termbox version in production to match staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/525288 [14:05:12] (03CR) 10Tarrow: "This change is ready for review." [deployment-charts] - 10https://gerrit.wikimedia.org/r/525288 (owner: 10Tarrow) [14:11:44] (03PS2) 10Ema: ATS: Vary-slotting for PHP7 [puppet] - 10https://gerrit.wikimedia.org/r/525274 (https://phabricator.wikimedia.org/T206339) [14:16:58] RECOVERY - Check the Netbox report-s- librenms for fail status. on netmon1002 is OK: librenms.LibreNMS OK https://wikitech.wikimedia.org/wiki/Netbox%23Reports [14:23:20] (03CR) 10Jakob: [C: 03+2] Increase termbox version in production to match staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/525288 (owner: 10Tarrow) [14:26:54] 10Operations, 10ops-eqiad, 10netops: rack spare switches in c1-eqiad - https://phabricator.wikimedia.org/T185337 (10faidon) These could be racked in any rack, including in row A. It would be useful to have a working lab out of our spares - this came up yesterday/today when we were wondering if we had QSFPs t... [14:28:56] (03CR) 10Jakob: [V: 03+2 C: 03+2] Increase termbox version in production to match staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/525288 (owner: 10Tarrow) [14:31:01] (03PS1) 10Volans: pep257: fix newly reported issues [software/spicerack] - 10https://gerrit.wikimedia.org/r/525293 [14:31:59] !log tarrow@ helmfile [CODFW] Ran 'apply' command on namespace 'termbox' for release 'production' . [14:32:01] (03PS1) 10Herron: admin: add dz1 to ldap_only_users [puppet] - 10https://gerrit.wikimedia.org/r/525294 (https://phabricator.wikimedia.org/T227496) [14:32:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:32:22] (03PS1) 10Muehlenhoff: Configure unconditional flushes of the L1 cache during VMENTER for cloudvirt servers [puppet] - 10https://gerrit.wikimedia.org/r/525295 (https://phabricator.wikimedia.org/T228870) [14:33:15] (03CR) 10jerkins-bot: [V: 04-1] Configure unconditional flushes of the L1 cache during VMENTER for cloudvirt servers [puppet] - 10https://gerrit.wikimedia.org/r/525295 (https://phabricator.wikimedia.org/T228870) (owner: 10Muehlenhoff) [14:35:54] (03PS2) 10Muehlenhoff: Configure unconditional flushes of the L1 cache during VMENTER [puppet] - 10https://gerrit.wikimedia.org/r/525295 (https://phabricator.wikimedia.org/T228870) [14:36:42] 10Operations, 10netops: Instability of the Level3 link between cr2-eqiad and cr2-esams - https://phabricator.wikimedia.org/T228827 (10ayounsi) Level3/CenturyLink opened a ticket for that circuit and completed an emergency maintenance. I also see some planned maintenance in the last few days. And have at least... [14:36:44] (03CR) 10jerkins-bot: [V: 04-1] pep257: fix newly reported issues [software/spicerack] - 10https://gerrit.wikimedia.org/r/525293 (owner: 10Volans) [14:37:02] PROBLEM - High CPU load on API appserver on mw1222 is CRITICAL: CRITICAL - load average: 54.12, 30.75, 24.29 https://wikitech.wikimedia.org/wiki/Application_servers [14:40:16] (03CR) 10Volans: [V: 03+2 C: 03+2] "Overriding CI as mypy is a new failure, I'll fix it in a separate patch. And then I'll look into probably freezing some deps, a bit too no" [software/spicerack] - 10https://gerrit.wikimedia.org/r/525293 (owner: 10Volans) [14:40:46] !log Drop abuse_filter_log.afl_log_id in s5 codfw (lag will appear on codfw) - T226851 [14:40:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:40:53] T226851: Drop abuse_filter_log.afl_log_id in production - https://phabricator.wikimedia.org/T226851 [14:41:18] PROBLEM - High CPU load on API appserver on mw1227 is CRITICAL: CRITICAL - load average: 66.62, 44.26, 33.92 https://wikitech.wikimedia.org/wiki/Application_servers [14:41:24] PROBLEM - High CPU load on API appserver on mw1233 is CRITICAL: CRITICAL - load average: 62.82, 36.35, 27.23 https://wikitech.wikimedia.org/wiki/Application_servers [14:41:48] (03CR) 10Ayounsi: [C: 03+2] Anycast: move bird::neighbors_list from role/site to site in codfw [puppet] - 10https://gerrit.wikimedia.org/r/524076 (owner: 10Ayounsi) [14:41:58] (03PS2) 10Ayounsi: Anycast: move bird::neighbors_list from role/site to site in codfw [puppet] - 10https://gerrit.wikimedia.org/r/524076 [14:42:18] (03CR) 10jenkins-bot: pep257: fix newly reported issues [software/spicerack] - 10https://gerrit.wikimedia.org/r/525293 (owner: 10Volans) [14:43:24] !log Drop abuse_filter_log.afl_log_id in s5 eqiad - T226851 [14:43:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:44:57] (03CR) 10Ayounsi: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1001/17596/dns2002.wikimedia.org/" [puppet] - 10https://gerrit.wikimedia.org/r/524076 (owner: 10Ayounsi) [14:45:10] 10Operations, 10Analytics, 10Analytics-Kanban, 10netops: Allow analytics VLAN to reach eventgate-analytics.discovery.wmnet:31192 - https://phabricator.wikimedia.org/T228882 (10Ottomata) [14:46:14] RECOVERY - PHP opcache health on mwdebug1002 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [14:47:44] (03PS8) 10Ori.livneh: Configure forensic logging of Apache requests; enable on beta [puppet] - 10https://gerrit.wikimedia.org/r/511751 [14:53:34] (03PS1) 10Volans: elasticsearch_cluster: fix mypy newly reported bug [software/spicerack] - 10https://gerrit.wikimedia.org/r/525300 [14:54:28] !log cleared vc ports stats on asw2-a-eqiad - T228823 [14:54:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:54:36] T228823: Faulty A6/A7 VC link - https://phabricator.wikimedia.org/T228823 [14:54:38] PROBLEM - High CPU load on API appserver on mw1233 is CRITICAL: CRITICAL - load average: 67.63, 40.77, 33.08 https://wikitech.wikimedia.org/wiki/Application_servers [14:56:10] PROBLEM - High CPU load on API appserver on mw1227 is CRITICAL: CRITICAL - load average: 52.81, 37.13, 34.22 https://wikitech.wikimedia.org/wiki/Application_servers [14:57:37] (03PS1) 10Elukey: profile::kerberos::kdcserver: add +requires_preauth to new users [puppet] - 10https://gerrit.wikimedia.org/r/525301 (https://phabricator.wikimedia.org/T226104) [14:58:26] !log unmounting dumps NFS clients from labstore1007.wikimedia.org T224228 [14:58:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:59:54] RECOVERY - Juniper virtual chassis ports on asw2-a-eqiad is OK: OK: UP: 22 https://wikitech.wikimedia.org/wiki/Network_monitoring%23VCP_status [15:00:06] RECOVERY - High CPU load on API appserver on mw1222 is OK: OK - load average: 19.20, 21.48, 23.59 https://wikitech.wikimedia.org/wiki/Application_servers [15:01:04] 10Operations, 10Analytics, 10Analytics-Kanban, 10Discovery, 10Research-Backlog: Make oozie swift upload emit event to Kafka about swift object upload complete - https://phabricator.wikimedia.org/T227896 (10Nuria) @Ottomata I think for users sake it is easier to do it the other way around maybe? Provide v... [15:02:38] !log re-enable vc link between asw2-a6 and asw2-a7 - T228823 [15:02:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:02:46] T228823: Faulty A6/A7 VC link - https://phabricator.wikimedia.org/T228823 [15:05:19] 10Operations, 10ops-eqiad: Faulty A6/A7 VC link - https://phabricator.wikimedia.org/T228823 (10ayounsi) [15:07:42] 10Operations, 10ops-eqiad: Faulty A6/A7 VC link - https://phabricator.wikimedia.org/T228823 (10ayounsi) 05Open→03Resolved All done, no more errors or packet loss. [15:07:51] (03PS4) 10Zoranzoki21: Correct a typo on the label newarticle in the help panel for arwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/525190 (https://phabricator.wikimedia.org/T228820) [15:09:08] RECOVERY - Check systemd state on netmon1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:09:20] PROBLEM - High CPU load on API appserver on mw1233 is CRITICAL: CRITICAL - load average: 70.89, 41.77, 34.91 https://wikitech.wikimedia.org/wiki/Application_servers [15:09:21] (03PS2) 10Elukey: profile::kerberos::kdcserver: add +requires_preauth to new users [puppet] - 10https://gerrit.wikimedia.org/r/525301 (https://phabricator.wikimedia.org/T226104) [15:09:40] 10Operations, 10serviceops, 10Core Platform Team Workboards (Green): Keys from MediaWiki Redis Instances - https://phabricator.wikimedia.org/T228703 (10jijiki) @holger.knust I accidentally copied the wrong dump to your directory yesterday, I uploaded a new dump today. Sorry for the confusion. [15:11:53] !log resume ingesting [message] =~ /^SlowTimer/ logs on logstash1007 (as a canary) [15:11:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:12:38] (03PS1) 10Ayounsi: Anycast move bird::neighbors_list from role/site for all sites [puppet] - 10https://gerrit.wikimedia.org/r/525303 [15:13:10] PROBLEM - High CPU load on API appserver on mw1222 is CRITICAL: CRITICAL - load average: 49.84, 34.86, 28.47 https://wikitech.wikimedia.org/wiki/Application_servers [15:14:04] PROBLEM - Check systemd state on netmon1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:15:42] (03PS3) 10Muehlenhoff: Configure unconditional flushes of the L1 cache during VMENTER [puppet] - 10https://gerrit.wikimedia.org/r/525295 (https://phabricator.wikimedia.org/T228870) [15:15:48] PROBLEM - High CPU load on API appserver on mw1227 is CRITICAL: CRITICAL - load average: 52.44, 32.98, 30.06 https://wikitech.wikimedia.org/wiki/Application_servers [15:16:33] 10Operations, 10ops-eqiad, 10DC-Ops: a1-eqiad pdu refresh - https://phabricator.wikimedia.org/T226782 (10MoritzMuehlenhoff) [15:19:14] 10Operations, 10ops-eqiad, 10DC-Ops: a1-eqiad pdu refresh - https://phabricator.wikimedia.org/T226782 (10RobH) >>! In T227139#5336328, @elukey wrote: > All the analytics nodes are hadoop workers, not a big deal if they loose power. the above was on another task, but referenced same role as analytics1058 [15:19:56] !sal [15:19:56] https://wikitech.wikimedia.org/wiki/Server_Admin_Log https://tools.wmflabs.org/sal/production See it and you will know all you need. [15:20:53] 10Operations, 10ops-eqiad, 10DC-Ops: a1-eqiad pdu refresh - https://phabricator.wikimedia.org/T226782 (10elukey) +1 for analytics1058, kafka-jumbo1001 is also ok, just please ping me or ottomata when starting so we can monitor. [15:21:28] 10Operations, 10ops-eqiad, 10DC-Ops: a1-eqiad pdu refresh - https://phabricator.wikimedia.org/T226782 (10RobH) [15:22:10] (03CR) 10Volans: [C: 03+2] elasticsearch_cluster: fix mypy newly reported bug [software/spicerack] - 10https://gerrit.wikimedia.org/r/525300 (owner: 10Volans) [15:25:14] awight: note how the split of selenium and qunit makes your patch slightly easier to follow now :-]]] [15:25:22] that was worth the 3 merge effort ;-] [15:27:51] (03Merged) 10jenkins-bot: elasticsearch_cluster: fix mypy newly reported bug [software/spicerack] - 10https://gerrit.wikimedia.org/r/525300 (owner: 10Volans) [15:28:21] 10Operations, 10netops: Instability of the Level3 link between cr2-eqiad and cr2-esams - https://phabricator.wikimedia.org/T228827 (10elukey) We usually see impact in 50x and/or nginx availability when the link goes down, so if that could be avoided I'd be +1. [15:29:25] (03CR) 10jenkins-bot: elasticsearch_cluster: fix mypy newly reported bug [software/spicerack] - 10https://gerrit.wikimedia.org/r/525300 (owner: 10Volans) [15:30:20] (03CR) 10Arturo Borrero Gonzalez: Configure unconditional flushes of the L1 cache during VMENTER (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/525295 (https://phabricator.wikimedia.org/T228870) (owner: 10Muehlenhoff) [15:32:23] (03CR) 10Elukey: [C: 04-1] Allow the use of Ipv6 in the Hadoop Analytics cluster (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/523229 (https://phabricator.wikimedia.org/T225296) (owner: 10Elukey) [15:32:24] PROBLEM - High CPU load on API appserver on mw1233 is CRITICAL: CRITICAL - load average: 40.05, 30.55, 32.10 https://wikitech.wikimedia.org/wiki/Application_servers [15:35:16] 10Operations, 10ops-eqiad, 10DC-Ops: dbprov1001 alerting on PS Redundancy - https://phabricator.wikimedia.org/T228891 (10Cmjohnson) [15:35:36] PROBLEM - High CPU load on API appserver on mw1227 is CRITICAL: CRITICAL - load average: 63.97, 40.29, 34.23 https://wikitech.wikimedia.org/wiki/Application_servers [15:35:45] (03PS1) 10Effie Mouzeli: jobrunners: Convert all jobrunners to server PHP7 only [puppet] - 10https://gerrit.wikimedia.org/r/525306 (https://phabricator.wikimedia.org/T219148) [15:35:51] (03PS9) 10Ori.livneh: Configure forensic logging of Apache requests; enable on beta [puppet] - 10https://gerrit.wikimedia.org/r/511751 [15:36:28] 10Operations, 10ops-eqiad, 10DC-Ops: dbproxy1012 alerting on PS Redundancy - https://phabricator.wikimedia.org/T228892 (10Cmjohnson) [15:36:58] 10Operations, 10ops-eqiad, 10DC-Ops: dbproxy1012 and dbprov1001 alerting on PS Redundancy - https://phabricator.wikimedia.org/T228859 (10Cmjohnson) 05Open→03Invalid Created two separate tickets for each server. [15:37:00] 10Operations, 10ops-eqiad, 10DC-Ops: (July 22-26) install new PDUs in rows A/B (Top level tracking task) - https://phabricator.wikimedia.org/T226778 (10Cmjohnson) [15:37:07] (03PS1) 10Ema: vcl: update Vary:XFP fixup comment [puppet] - 10https://gerrit.wikimedia.org/r/525308 (https://phabricator.wikimedia.org/T51700) [15:37:19] (03PS2) 10Effie Mouzeli: jobrunners: Migrate all jobrunners to serve only via PHP7 [puppet] - 10https://gerrit.wikimedia.org/r/525306 (https://phabricator.wikimedia.org/T219148) [15:38:03] 10Operations, 10ops-eqiad, 10DC-Ops: dbprov1001 alerting on PS Redundancy - https://phabricator.wikimedia.org/T228891 (10Cmjohnson) The idrac interface is showing errors with the power supply. It is confirmed to be up and running. Most likely a fan went out. The server is still in warranty. [15:38:14] 10Operations, 10ops-eqiad, 10DC-Ops: dbproxy1012 alerting on PS Redundancy - https://phabricator.wikimedia.org/T228892 (10Cmjohnson) The idrac interface is showing errors with the power supply. It is confirmed to be up and running. Most likely a fan went out. The server is still in warranty. [15:39:13] (03PS4) 10Elukey: Allow the use of Ipv6 in the Hadoop Analytics cluster [puppet] - 10https://gerrit.wikimedia.org/r/523229 (https://phabricator.wikimedia.org/T225296) [15:40:12] (03CR) 10Effie Mouzeli: [V: 03+1] "LGTM https://puppet-compiler.wmflabs.org/compiler1002/17598/" [puppet] - 10https://gerrit.wikimedia.org/r/525306 (https://phabricator.wikimedia.org/T219148) (owner: 10Effie Mouzeli) [15:41:23] (03CR) 10Ppchelko: [C: 03+1] jobrunners: Migrate all jobrunners to serve only via PHP7 [puppet] - 10https://gerrit.wikimedia.org/r/525306 (https://phabricator.wikimedia.org/T219148) (owner: 10Effie Mouzeli) [15:41:45] (03PS5) 10Elukey: Allow the use of Ipv6 in the Hadoop Analytics cluster [puppet] - 10https://gerrit.wikimedia.org/r/523229 (https://phabricator.wikimedia.org/T225296) [15:41:47] (03PS1) 10Elukey: profile::prometheus::jmx_exporter: allow IPv6 polling [puppet] - 10https://gerrit.wikimedia.org/r/525309 (https://phabricator.wikimedia.org/T225296) [15:41:54] PROBLEM - High CPU load on API appserver on mw1231 is CRITICAL: CRITICAL - load average: 52.61, 29.83, 22.75 https://wikitech.wikimedia.org/wiki/Application_servers [15:42:24] (03CR) 10jerkins-bot: [V: 04-1] profile::prometheus::jmx_exporter: allow IPv6 polling [puppet] - 10https://gerrit.wikimedia.org/r/525309 (https://phabricator.wikimedia.org/T225296) (owner: 10Elukey) [15:42:49] !log Disable puppet on jobrunners for 525306 [15:42:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:43:29] (03PS2) 10Elukey: profile::prometheus::jmx_exporter: allow IPv6 polling [puppet] - 10https://gerrit.wikimedia.org/r/525309 (https://phabricator.wikimedia.org/T225296) [15:43:31] (03PS6) 10Elukey: Allow the use of Ipv6 in the Hadoop Analytics cluster [puppet] - 10https://gerrit.wikimedia.org/r/523229 (https://phabricator.wikimedia.org/T225296) [15:44:20] !log rebooting labstore1007.wikimedia.org for updates T224228 [15:44:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:44:35] (03CR) 10Effie Mouzeli: [V: 03+1 C: 03+2] jobrunners: Migrate all jobrunners to serve only via PHP7 [puppet] - 10https://gerrit.wikimedia.org/r/525306 (https://phabricator.wikimedia.org/T219148) (owner: 10Effie Mouzeli) [15:44:50] PROBLEM - Check the Netbox report-s- librenms for fail status. on netmon1002 is CRITICAL: librenms.LibreNMS CRITICAL https://wikitech.wikimedia.org/wiki/Netbox%23Reports [15:45:00] wyd [15:45:03] (03PS3) 10Ema: ATS: Vary-slotting for PHP7 [puppet] - 10https://gerrit.wikimedia.org/r/525274 (https://phabricator.wikimedia.org/T206339) [15:45:05] (03PS1) 10Ema: ATS: Vary-slotting for X-Forwarded-Proto [puppet] - 10https://gerrit.wikimedia.org/r/525310 (https://phabricator.wikimedia.org/T227432) [15:45:20] oic [15:47:10] RECOVERY - IPMI Sensor Status on dbprov1001 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [15:47:13] 10Operations, 10Gerrit, 10LDAP-Access-Requests, 10Release-Engineering-Team-TODO, 10Release-Engineering-Team (Development services): Add more SREs to gerritadmin LDAP group - https://phabricator.wikimedia.org/T228733 (10herron) @Joe @Dzahn adding you both to gerritadmin would satisfy "at least 1 person fr... [15:47:42] !log Rolling puppet-enable and apache reload of jobrunners in eqiad [15:47:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:48:13] 10Operations, 10serviceops, 10PHP 7.2 support, 10Performance-Team (Radar), and 2 others: PHP 7 corruption during deployment (was: PHP 7 fatals on mw1262) - https://phabricator.wikimedia.org/T224491 (10cscott) Worth noting that we have a known GC error in PHP 7.2, which is also 100% reproducible: {T228346}.... [15:48:31] jijiki: Yay. [15:48:44] haha [15:49:22] 10Operations, 10ops-eqiad, 10DC-Ops: a1-eqiad pdu refresh - https://phabricator.wikimedia.org/T226782 (10RobH) [15:49:52] PROBLEM - puppet last run on elastic1049 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [15:50:14] RECOVERY - High CPU load on API appserver on mw1231 is OK: OK - load average: 16.84, 22.86, 23.58 https://wikitech.wikimedia.org/wiki/Application_servers [15:52:39] SMalyshev: o/ - if you are online, would you mind to join #wikimedia-dcops ? [15:53:31] wow there are so many chat rooms [15:54:51] don't forget the SRE Slack channel [15:55:18] bblack: well played :D [15:56:05] that doesn't actually exist does it ? [15:56:05] !log depooling recdns on dns1001 via confctl [15:56:13] !log depooling recdns on dns1001 via confctl - T226782 [15:56:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:56:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:56:29] T226782: a1-eqiad pdu refresh - https://phabricator.wikimedia.org/T226782 [15:56:33] !log bblack@cumin1001 conftool action : set/pooled=no; selector: name=dns1001.wikimedia.org [15:56:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:56:39] cmjohnson1: the alert for dbprov1001 recovered, did you do some magic? [15:58:03] !log failover master VIP of ae2.1202 inet6 away from cr1-eqiad - T226782 [15:58:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:59:06] !log lvs1014 - puppet disable, remove dns1001 from resolv.conf, restart pybal - T226782 [15:59:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:59:38] 10Operations, 10serviceops, 10PHP 7.2 support, 10Performance-Team (Radar), and 2 others: PHP 7 corruption during deployment (was: PHP 7 fatals on mw1262) - https://phabricator.wikimedia.org/T224491 (10Krinkle) >>! In T224491#5356481, @Joe wrote: >>>! In T224491#5354568, @Krinkle wrote: >> […] >> Only seen... [15:59:39] (03PS1) 10Jhedden: Revert "dumps dist: switch active VPS to labstore1006" [puppet] - 10https://gerrit.wikimedia.org/r/525313 [15:59:58] !log dns1001 - puppet disable, stop recursor service to kill anycast advert - T226782 [16:00:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:00:04] MaxSem, RoanKattouw, Niharika, and Urbanecm: I, the Bot under the Fountain, allow thee, The Deployer, to do Morning SWAT (Max 6 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190724T1600). [16:00:04] No GERRIT patches in the queue for this window AFAICS. [16:00:49] (03PS2) 10Jhedden: Revert "dumps dist: switch active VPS to labstore1006" [puppet] - 10https://gerrit.wikimedia.org/r/525313 [16:01:34] (03CR) 10Jhedden: [C: 03+2] Revert "dumps dist: switch active VPS to labstore1006" [puppet] - 10https://gerrit.wikimedia.org/r/525313 (owner: 10Jhedden) [16:02:07] marostegui: I cleared the log but I did pull the report for Dell. I will check on it again later before resolving ticket [16:02:22] PROBLEM - Blazegraph Port for wdqs-blazegraph on wdqs1006 is CRITICAL: connect to address 127.0.0.1 and port 9999: Connection refused https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [16:02:44] PROBLEM - WDQS HTTP Port on wdqs1006 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 380 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service [16:03:08] PROBLEM - Blazegraph process -wdqs-blazegraph- on wdqs1006 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 499 (blazegraph), regex args ^java .* --port 9999 .* blazegraph-service-.*war https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [16:03:44] PROBLEM - Check systemd state on wdqs1006 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:03:48] PROBLEM - Recursive DNS on 208.80.154.10 is CRITICAL: CRITICAL - Plugin timed out while executing system call https://wikitech.wikimedia.org/wiki/DNS [16:04:04] cmjohnson1: good idea yeah, it might fire again later. thanks [16:04:05] ACKNOWLEDGEMENT - Blazegraph Port for wdqs-blazegraph on wdqs1006 is CRITICAL: connect to address 127.0.0.1 and port 9999: Connection refused Stas Malychev PDU replacement work https://phabricator.wikimedia.org/T226782 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [16:04:05] ACKNOWLEDGEMENT - Blazegraph process -wdqs-blazegraph- on wdqs1006 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 499 (blazegraph), regex args ^java .* --port 9999 .* blazegraph-service-.*war Stas Malychev PDU replacement work https://phabricator.wikimedia.org/T226782 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [16:04:05] ACKNOWLEDGEMENT - Check systemd state on wdqs1006 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. Stas Malychev PDU replacement work https://phabricator.wikimedia.org/T226782 https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:04:06] ACKNOWLEDGEMENT - WDQS HTTP Port on wdqs1006 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 380 bytes in 0.001 second response time Stas Malychev PDU replacement work https://phabricator.wikimedia.org/T226782 https://wikitech.wikimedia.org/wiki/Wikidata_query_service [16:04:16] PROBLEM - Recursive DNS on 2620:0:861:1:208:80:154:10 is CRITICAL: CRITICAL - Plugin timed out while executing system call https://wikitech.wikimedia.org/wiki/DNS [16:06:05] ACKNOWLEDGEMENT - Recursive DNS on 208.80.154.10 is CRITICAL: CRITICAL - Plugin timed out while executing system call Brandon Black Stopped for A1 PDU work - T226782 https://wikitech.wikimedia.org/wiki/DNS [16:06:05] ACKNOWLEDGEMENT - Recursive DNS on 2620:0:861:1:208:80:154:10 is CRITICAL: CRITICAL - Plugin timed out while executing system call Brandon Black Stopped for A1 PDU work - T226782 https://wikitech.wikimedia.org/wiki/DNS [16:06:43] !log jforrester@deploy1001 Synchronized php-1.34.0-wmf.14/includes/export/XmlDumpWriter.php: T228720 make XmlDumpwriter more resilient to blob store corruption (duration: 00m 55s) [16:06:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:06:50] T228720: stub for enwiki broken, attempt to load content for bad rev during sha1 retrieval - https://phabricator.wikimedia.org/T228720 [16:06:58] \o/ [16:07:46] !log jforrester@deploy1001 Synchronized php-1.34.0-wmf.15/includes/export/XmlDumpWriter.php: T228720 make XmlDumpwriter more resilient to blob store corruption (duration: 00m 55s) [16:07:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:08:00] * apergos heaves a sigh of relief [16:08:04] hopefully that's it this time [16:09:19] 10Operations, 10Analytics, 10Analytics-Kanban, 10netops: Allow analytics VLAN to reach eventgate-analytics.discovery.wmnet:31192 - https://phabricator.wikimedia.org/T228882 (10Ottomata) a:05Ottomata→03None [16:09:22] "Trying to get property 'gb_expiry' of non-object" - should I worry about that? [16:10:22] maybe [16:10:36] RECOVERY - Blazegraph Port for wdqs-blazegraph on wdqs1006 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 9999 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [16:10:44] RECOVERY - Recursive DNS on 2620:0:861:1:208:80:154:10 is OK: DNS OK: 0.018 seconds response time. www.wikipedia.org returns 208.80.154.224 https://wikitech.wikimedia.org/wiki/DNS [16:10:52] !log dns1001 - restart recursor and re-enable puppet - T226782 [16:10:58] RECOVERY - High CPU load on API appserver on mw1222 is OK: OK - load average: 17.96, 21.36, 23.73 https://wikitech.wikimedia.org/wiki/Application_servers [16:10:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:11:00] RECOVERY - WDQS HTTP Port on wdqs1006 is OK: HTTP OK: HTTP/1.1 200 OK - 448 bytes in 0.029 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service [16:11:00] T226782: a1-eqiad pdu refresh - https://phabricator.wikimedia.org/T226782 [16:11:13] ten times in logstash, for .15, on metawiki, though not in the past few minutes [16:11:22] RECOVERY - Blazegraph process -wdqs-blazegraph- on wdqs1006 is OK: PROCS OK: 1 process with UID = 499 (blazegraph), regex args ^java .* --port 9999 .* blazegraph-service-.*war https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [16:11:47] do we hve the request urls? [16:11:48] !log lvs1014 - restore puppet and resolv.conf contents, restart pybal [16:11:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:11:58] RECOVERY - Recursive DNS on 208.80.154.10 is OK: DNS OK: 0.010 seconds response time. www.wikipedia.org returns 208.80.154.224 https://wikitech.wikimedia.org/wiki/DNS [16:12:00] RECOVERY - Check systemd state on wdqs1006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:12:30] apergos, meta.wikimedia.org and /w/api.php, but that's all (reqId XTiAGQpAIDwAAIe1-vMAAAAJ) [16:12:35] !log re-pooling recdns on dns1001 via confctl - T226782 [16:12:38] !log bblack@cumin1001 conftool action : set/pooled=yes; selector: name=dns1001.wikimedia.org [16:12:39] api. uh. meh [16:12:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:12:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:14:14] I guess it's not preventing the user from getting things done [16:14:35] so it would be nice to know the cause but unless global blocks are suddenly broken it's not a huge deal [16:14:47] disclaimer: not a mw dev [16:15:36] PHP Notice: Trying to get property 'gb_anon_only' of non-object that too, same request [16:15:38] hm [16:16:17] Ah, GlobalBlocks. [16:16:31] $someone will have to look at the code [16:16:32] Very likely a regression there; that area of code has been changing recently. [16:16:46] liw: Have you filed a Phab task? [16:17:30] Not necessarily a train blocker, but we should file it and throw it over to the Anti-Harassment team [16:17:34] James_F, in the process of doing that [16:17:37] I have noticed 2/3 api appservers with high CPU load (sustained), still need to check but afaik they started after the deployment [16:17:46] RECOVERY - puppet last run on elastic1049 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [16:18:06] might be completely unrelated, just wanted to raise the concern [16:18:21] * James_F nods. [16:18:32] https://phabricator.wikimedia.org/T228899 [16:18:40] 10Operations, 10Puppet, 10observability: Use git commit id as "configuration version" for puppet - https://phabricator.wikimedia.org/T228854 (10fgiunchedi) [16:21:01] added the other error [16:34:25] (03PS1) 10Papaul: DHCP: Add MAC address entries for db21[21-30] [puppet] - 10https://gerrit.wikimedia.org/r/525318 (https://phabricator.wikimedia.org/T227113) [16:39:55] (03PS4) 10Muehlenhoff: Configure unconditional flushes of the L1 cache during VMENTER [puppet] - 10https://gerrit.wikimedia.org/r/525295 (https://phabricator.wikimedia.org/T228870) [16:40:00] 10Operations, 10ops-codfw, 10DBA, 10Goal, 10Patch-For-Review: rack/setup/install db21[21-30].codfw.wmnet - https://phabricator.wikimedia.org/T227113 (10Papaul) [16:40:04] (03CR) 10Muehlenhoff: Configure unconditional flushes of the L1 cache during VMENTER (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/525295 (https://phabricator.wikimedia.org/T228870) (owner: 10Muehlenhoff) [16:40:50] 10Operations, 10ops-codfw, 10DBA, 10Goal, 10Patch-For-Review: rack/setup/install db21[21-30].codfw.wmnet - https://phabricator.wikimedia.org/T227113 (10Papaul) [16:41:41] (03PS1) 10Arturo Borrero Gonzalez: openstack: fullstack: use a readable-friendly name for VMs [puppet] - 10https://gerrit.wikimedia.org/r/525320 [16:41:43] (03CR) 10Muehlenhoff: [C: 03+1] admin: add dz1 to ldap_only_users [puppet] - 10https://gerrit.wikimedia.org/r/525294 (https://phabricator.wikimedia.org/T227496) (owner: 10Herron) [16:41:52] PROBLEM - High CPU load on API appserver on mw1233 is CRITICAL: CRITICAL - load average: 51.82, 36.13, 30.02 https://wikitech.wikimedia.org/wiki/Application_servers [16:42:13] (03CR) 10jerkins-bot: [V: 04-1] openstack: fullstack: use a readable-friendly name for VMs [puppet] - 10https://gerrit.wikimedia.org/r/525320 (owner: 10Arturo Borrero Gonzalez) [16:42:20] PROBLEM - High CPU load on API appserver on mw1227 is CRITICAL: CRITICAL - load average: 57.73, 37.77, 32.77 https://wikitech.wikimedia.org/wiki/Application_servers [16:43:57] (03PS2) 10Arturo Borrero Gonzalez: openstack: fullstack: use a readable-friendly name for VMs [puppet] - 10https://gerrit.wikimedia.org/r/525320 [16:44:15] !log Rolling puppet-enable and apache reload of jobrunners in codfw [16:44:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:47:48] PROBLEM - High CPU load on API appserver on mw1222 is CRITICAL: CRITICAL - load average: 79.01, 44.86, 30.76 https://wikitech.wikimedia.org/wiki/Application_servers [16:47:54] 10Operations, 10ops-codfw, 10DBA, 10Goal, 10Patch-For-Review: rack/setup/install db21[21-30].codfw.wmnet - https://phabricator.wikimedia.org/T227113 (10Papaul) [16:56:55] 10Operations, 10ops-codfw, 10DBA, 10Goal, 10Patch-For-Review: rack/setup/install db21[21-30].codfw.wmnet - https://phabricator.wikimedia.org/T227113 (10Papaul) [16:58:06] (03PS2) 10Herron: admin: add dz1 to ldap_only_users [puppet] - 10https://gerrit.wikimedia.org/r/525294 (https://phabricator.wikimedia.org/T227496) [16:59:12] (03CR) 10Herron: [C: 03+2] admin: add dz1 to ldap_only_users [puppet] - 10https://gerrit.wikimedia.org/r/525294 (https://phabricator.wikimedia.org/T227496) (owner: 10Herron) [17:02:34] PROBLEM - High CPU load on API appserver on mw1222 is CRITICAL: CRITICAL - load average: 44.73, 35.05, 32.48 https://wikitech.wikimedia.org/wiki/Application_servers [17:03:38] PROBLEM - High CPU load on API appserver on mw1227 is CRITICAL: CRITICAL - load average: 40.26, 35.40, 33.17 https://wikitech.wikimedia.org/wiki/Application_servers [17:07:32] 10Operations, 10ops-codfw, 10DBA, 10Goal, 10Patch-For-Review: rack/setup/install db21[21-30].codfw.wmnet - https://phabricator.wikimedia.org/T227113 (10Papaul) [17:07:38] 10Operations, 10ops-codfw, 10netops: Cable mr1-codfw<->cr1/2-codfw through asw-a-codfw - https://phabricator.wikimedia.org/T228112 (10ayounsi) [17:10:27] !log Add mr1-codfw<->cr1/2-codfw vlan/link config on asw-a-codfw - T228112 [17:10:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:10:35] T228112: Cable mr1-codfw<->cr1/2-codfw through asw-a-codfw - https://phabricator.wikimedia.org/T228112 [17:10:40] 10Operations, 10Analytics, 10LDAP-Access-Requests, 10wikimediafoundation.org, 10Patch-For-Review: Access to WikimediaFoundation.org analytics for Deb - https://phabricator.wikimedia.org/T227496 (10RStallman-legalteam) I don't actually see the paper work for WMF full time req # employees, so I think havin... [17:12:19] 10Operations, 10ops-codfw, 10netops: Cable mr1-codfw<->cr1/2-codfw through asw-a-codfw - https://phabricator.wikimedia.org/T228112 (10ayounsi) [17:14:04] !log rollback failover master VIP of ae2.1202 inet6 away from cr1-eqiad - T226782 [17:14:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:14:10] T226782: a1-eqiad pdu refresh - https://phabricator.wikimedia.org/T226782 [17:14:40] PROBLEM - High CPU load on API appserver on mw1233 is CRITICAL: CRITICAL - load average: 67.64, 40.09, 33.59 https://wikitech.wikimedia.org/wiki/Application_servers [17:15:08] PROBLEM - High CPU load on API appserver on mw1227 is CRITICAL: CRITICAL - load average: 74.15, 45.24, 36.17 https://wikitech.wikimedia.org/wiki/Application_servers [17:18:27] (03CR) 10Cwhite: [C: 03+2] hiera: deploy varnishkafka exporter to esams [puppet] - 10https://gerrit.wikimedia.org/r/524931 (https://phabricator.wikimedia.org/T196066) (owner: 10Cwhite) [17:18:37] (03PS2) 10Cwhite: hiera: deploy varnishkafka exporter to esams [puppet] - 10https://gerrit.wikimedia.org/r/524931 (https://phabricator.wikimedia.org/T196066) [17:18:54] (03CR) 10Jhedden: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/525320 (owner: 10Arturo Borrero Gonzalez) [17:20:05] 10Operations, 10Analytics, 10LDAP-Access-Requests, 10wikimediafoundation.org: Access to WikimediaFoundation.org analytics for Deb - https://phabricator.wikimedia.org/T227496 (10Heather) It doesn't seem like you need it, but this is approved. Let me know if you need something else. [17:20:38] 10Operations, 10ops-codfw, 10DBA, 10Goal, 10Patch-For-Review: rack/setup/install db21[21-30].codfw.wmnet - https://phabricator.wikimedia.org/T227113 (10Papaul) [17:20:56] (03CR) 10CRusnov: "This change is ready for review." [puppet] - 10https://gerrit.wikimedia.org/r/514395 (https://phabricator.wikimedia.org/T223291) (owner: 10CRusnov) [17:22:30] (03CR) 10CDanis: [C: 03+1] role::alerting_host: Remove support for jessie [puppet] - 10https://gerrit.wikimedia.org/r/525232 (owner: 10Muehlenhoff) [17:27:32] 10Operations, 10ops-eqiad, 10DC-Ops: dbproxy1012 alerting on PS Redundancy - https://phabricator.wikimedia.org/T228892 (10Cmjohnson) i created a dispatch with Dell to replace the PSU You have successfully submitted request SR995054295. [17:30:07] 10Operations, 10Analytics, 10LDAP-Access-Requests, 10wikimediafoundation.org: Access to WikimediaFoundation.org analytics for Deb - https://phabricator.wikimedia.org/T227496 (10herron) 05Open→03Resolved Great! Thanks all [17:30:08] (03PS13) 10Volans: Add the LBRemoteCluster class. [software/spicerack] - 10https://gerrit.wikimedia.org/r/503947 (owner: 10Giuseppe Lavagetto) [17:33:02] (03CR) 10Dzahn: [C: 03+2] DHCP: Add MAC address entries for db21[21-30] [puppet] - 10https://gerrit.wikimedia.org/r/525318 (https://phabricator.wikimedia.org/T227113) (owner: 10Papaul) [17:33:10] (03PS2) 10Dzahn: DHCP: Add MAC address entries for db21[21-30] [puppet] - 10https://gerrit.wikimedia.org/r/525318 (https://phabricator.wikimedia.org/T227113) (owner: 10Papaul) [17:33:48] RECOVERY - High CPU load on API appserver on mw1222 is OK: OK - load average: 18.39, 20.40, 23.81 https://wikitech.wikimedia.org/wiki/Application_servers [17:34:16] (03CR) 10jerkins-bot: [V: 04-1] Add the LBRemoteCluster class. [software/spicerack] - 10https://gerrit.wikimedia.org/r/503947 (owner: 10Giuseppe Lavagetto) [17:37:49] (03PS6) 10Ayounsi: Add an anycast endpoint to syslog centralservers [puppet] - 10https://gerrit.wikimedia.org/r/524037 [17:37:54] https://phabricator.wikimedia.org/T228911 [17:38:10] PROBLEM - High CPU load on API appserver on mw1227 is CRITICAL: CRITICAL - load average: 47.68, 33.82, 32.58 https://wikitech.wikimedia.org/wiki/Application_servers [17:39:23] 10Operations, 10ops-eqiad: Degraded RAID on elastic1046 - https://phabricator.wikimedia.org/T228606 (10Cmjohnson) a:05Cmjohnson→03wiki_willy This server is out of warranty, ended April 2019. @wiki_willy escalating to you to decide on disks [17:41:54] 10Operations, 10MediaWiki-General, 10serviceops, 10Core Platform Team (PHP7 (TEC4)), and 3 others: Some pages will become completely unreachable after PHP7 update due to Unicode changes - https://phabricator.wikimedia.org/T219279 (10Anomie) Next steps here: [ ] 1. Determine the schedule to do these next s... [17:42:56] (03PS14) 10Volans: Add the LBRemoteCluster class. [software/spicerack] - 10https://gerrit.wikimedia.org/r/503947 (owner: 10Giuseppe Lavagetto) [17:43:14] (03PS2) 10Jeena Huneidi: Package mediawiki-dev and add to index [deployment-charts] - 10https://gerrit.wikimedia.org/r/525173 (https://phabricator.wikimedia.org/T224935) [17:44:10] (03CR) 10Ayounsi: "I went the "Set a different profile::bird::advertise_vips for the server to be decom" way." [puppet] - 10https://gerrit.wikimedia.org/r/524037 (owner: 10Ayounsi) [17:48:56] 10Operations, 10Analytics, 10Analytics-Kanban, 10Discovery, 10Research-Backlog: Make oozie swift upload emit event to Kafka about swift object upload complete - https://phabricator.wikimedia.org/T227896 (10EBernhardson) I hadn't previously thought about re-publishing a new version of the same dataset. It... [17:49:58] 10Operations, 10Analytics, 10Analytics-Kanban, 10Discovery, 10Research-Backlog: Make oozie swift upload emit event to Kafka about swift object upload complete - https://phabricator.wikimedia.org/T227896 (10Ottomata) Ya, if you needed to re-run a job due to data backfill, you might want to be able to do s... [17:52:30] RECOVERY - High CPU load on API appserver on mw1233 is OK: OK - load average: 17.79, 20.27, 23.70 https://wikitech.wikimedia.org/wiki/Application_servers [17:54:55] (03PS1) 10Ayounsi: Fastnetmon: disable Graphite, fix notify script path [puppet] - 10https://gerrit.wikimedia.org/r/525334 (https://phabricator.wikimedia.org/T226810) [17:56:02] 10Operations, 10ops-eqiad: Degraded RAID on cloudvirt1024 - https://phabricator.wikimedia.org/T228853 (10Cmjohnson) A new ticket has been created with Dell You have successfully submitted request SR995055580. [18:00:41] 10Operations, 10ops-eqiad, 10DBA: Upgrade db1100 firmware and BIOS - https://phabricator.wikimedia.org/T228732 (10Cmjohnson) @Marostegui This can be done any day...Let's plan 8/6 @1000EDT /1400UTC [18:00:47] (03CR) 10Dzahn: "i don't know much about this but i have 2 questions/observations: a) looking at the repo all the other charts seem to exist as both a .tg" [deployment-charts] - 10https://gerrit.wikimedia.org/r/525173 (https://phabricator.wikimedia.org/T224935) (owner: 10Jeena Huneidi) [18:02:50] (03PS2) 10Ayounsi: Fastnetmon: disable Graphite, fix notify script path [puppet] - 10https://gerrit.wikimedia.org/r/525334 (https://phabricator.wikimedia.org/T226810) [18:04:32] 10Operations, 10ops-eqiad, 10Cassandra, 10DC-Ops, and 4 others: restbase-dev1006 has a broken disk - https://phabricator.wikimedia.org/T224260 (10Cmjohnson) @Eevans is this your server? I think I understand that the server is going to be re-installed anyway so if I pull the wrong disk to replace I won't'... [18:11:13] * Krinkle staging on mwdebug1002 [18:12:07] !log krinkle@deploy1001: extensions/CheckUser is dirty in php-1.34.0-wmf.15 [18:12:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:13:00] !log creating new restbase keyspaces -- T228804 [18:13:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:13:08] T228804: Create keyspaces in Cassandra for PCS endpoints - https://phabricator.wikimedia.org/T228804 [18:19:07] !log krinkle@deploy1001 Synchronized php-1.34.0-wmf.15/includes/cache/localisation/LocalisationCache.php: 31d99eb381bc (duration: 00m 54s) [18:19:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:19:57] (03CR) 10Dzahn: "Yep, and i can't think of a reason why Icinga ever had PHP module installed anyways." [puppet] - 10https://gerrit.wikimedia.org/r/525232 (owner: 10Muehlenhoff) [18:20:35] (03PS2) 10Dzahn: role::alerting_host: Remove support for jessie [puppet] - 10https://gerrit.wikimedia.org/r/525232 (owner: 10Muehlenhoff) [18:20:45] (03PS1) 10Bstorm: toolforge: set kubeadm to use internal registry for pause container [puppet] - 10https://gerrit.wikimedia.org/r/525339 (https://phabricator.wikimedia.org/T228887) [18:21:56] 10Operations, 10ops-eqiad: Degraded RAID on elastic1046 - https://phabricator.wikimedia.org/T228606 (10wiki_willy) @elukey - since elastic1046 is just barely out of warranty (only by a few months), we'll still have to purchase a new disk for this server. Just double-checking that's the route you want to go, b... [18:22:57] (03CR) 10Dzahn: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1002/17600/icinga1001.wikimedia.org/" [puppet] - 10https://gerrit.wikimedia.org/r/525232 (owner: 10Muehlenhoff) [18:29:14] RECOVERY - High CPU load on API appserver on mw1227 is OK: OK - load average: 19.89, 21.33, 23.71 https://wikitech.wikimedia.org/wiki/Application_servers [18:30:46] 10Operations, 10ops-eqiad: replace scs-a8-eqiad - https://phabricator.wikimedia.org/T228919 (10RobH) p:05Triage→03Normal [18:30:57] 10Operations, 10ops-eqiad: replace scs-a8-eqiad - https://phabricator.wikimedia.org/T228919 (10RobH) [18:33:22] !log moving cloudvirt107 to 10G rack T228691 [18:33:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:33:29] T228691: relocate/reimage cloudvirt1017 with 10G interfaces - https://phabricator.wikimedia.org/T228691 [18:36:20] (03CR) 10Andrew Bogott: [C: 03+1] "I'm pretty sure this won't break anything, and it sounds like a big improvement!" [puppet] - 10https://gerrit.wikimedia.org/r/525320 (owner: 10Arturo Borrero Gonzalez) [18:52:46] 10Operations, 10ops-eqiad, 10Cassandra, 10DC-Ops, and 4 others: restbase-dev1006 has a broken disk - https://phabricator.wikimedia.org/T224260 (10Eevans) >>! In T224260#5362558, @Cmjohnson wrote: > @Eevans is this your server? I think I understand that the server is going to be re-installed anyway so if I... [18:52:50] 10Operations, 10ops-eqiad, 10DC-Ops, 10Epic, 10cloud-services-team (Kanban): relocate/reimage cloudvirt1017 with 10G interfaces - https://phabricator.wikimedia.org/T228691 (10Cmjohnson) [18:54:08] 10Operations, 10ops-eqiad, 10DC-Ops, 10Epic, 10cloud-services-team (Kanban): relocate/reimage cloudvirt1017 with 10G interfaces - https://phabricator.wikimedia.org/T228691 (10Cmjohnson) @Andrew a little luck with this server, it was already in a 10G rack. Removed the old network info on the switch, add... [18:56:54] PROBLEM - High CPU load on API appserver on mw1233 is CRITICAL: CRITICAL - load average: 66.53, 39.23, 28.41 https://wikitech.wikimedia.org/wiki/Application_servers [18:57:20] PROBLEM - High CPU load on API appserver on mw1227 is CRITICAL: CRITICAL - load average: 48.90, 40.67, 31.71 https://wikitech.wikimedia.org/wiki/Application_servers [18:57:32] RECOVERY - DPKG on restbase-dev1006 is OK: All packages OK [18:59:00] jouncebot: next [18:59:01] In 1 hour(s) and 0 minute(s): Services – Parsoid / Citoid / Mobileapps / ORES / … (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190724T2000) [18:59:43] 10Operations, 10vm-requests: rack/setup/install ganeti10([09]|1[0-8[).eqiad.wmnet - https://phabricator.wikimedia.org/T228924 (10RobH) p:05Triage→03Normal [19:00:40] 10Operations, 10vm-requests: rack/setup/install ganeti10([09]|1[0-8[).eqiad.wmnet - https://phabricator.wikimedia.org/T228924 (10RobH) a:03akosiaris @akosiaris, Can i get your sign off about the racking proposal and planning for these 10 ganeti nodes? 4 were refresh, while 6 were expansion from last years... [19:01:09] 10Operations, 10ops-eqiad, 10Cassandra, 10DC-Ops, and 4 others: restbase-dev1006 has a broken disk - https://phabricator.wikimedia.org/T224260 (10Cmjohnson) 05Open→03Resolved @eevans the disk has been replaced. I am resolving this task, if you find the problem is not fixed, please re-open and assign to... [19:01:48] 10Operations, 10ops-codfw, 10serviceops: (OoW) restbase2009 lockup - https://phabricator.wikimedia.org/T227408 (10Eevans) >>! In T227408#5349035, @jijiki wrote: > @Eevans Shall we mark restbase2009 as inactive on conftool? I'm not positive I understand the implications of that. As far as I know, the host... [19:02:12] jouncebot: reload [19:02:19] jouncebot: now [19:02:19] No deployments scheduled for the next 0 hour(s) and 57 minute(s) [19:02:23] jouncebot: help [19:02:24] **** JounceBot Help **** [19:02:24] JounceBot is a deployment helper bot for the Wikimedia Foundation. [19:02:24] You can find my source at https://github.com/mattofak/jouncebot [19:02:24] Available commands: [19:02:24] HELP Prints the list of all commands known to the server [19:02:24] NEXT Get the next deployment event(s if they happen at the same time) [19:02:24] NOW Get the current deployment event(s) or the time until the next [19:02:25] REFRESH Refresh my knowledge about deployments [19:02:29] jouncebot: refresh [19:02:30] I refreshed my knowledge about deployments. [19:02:33] jouncebot: now [19:02:33] For the next 0 hour(s) and 57 minute(s): SecureLinkFixer to group0 (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190724T1900) [19:02:36] yay [19:02:49] (03PS2) 10Legoktm: Enable SecureLinkFixer on group0 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/525157 (https://phabricator.wikimedia.org/T200751) [19:03:35] (03CR) 10Legoktm: [C: 03+2] Enable SecureLinkFixer on group0 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/525157 (https://phabricator.wikimedia.org/T200751) (owner: 10Legoktm) [19:03:45] 10Operations, 10ops-eqiad: rack/setup/instal (4) CI ganeti nodes - https://phabricator.wikimedia.org/T228926 (10RobH) p:05Triage→03Normal [19:04:20] (03CR) 10Jeena Huneidi: "> Patch Set 2:" [deployment-charts] - 10https://gerrit.wikimedia.org/r/525173 (https://phabricator.wikimedia.org/T224935) (owner: 10Jeena Huneidi) [19:04:40] 10Operations, 10ops-eqiad: rack/setup/instal (4) CI ganeti nodes - https://phabricator.wikimedia.org/T228926 (10RobH) @akosiaris, Are you involved in this project, and if so would you be the one to provide details for this? Please comment and assign back to me for followup, thanks! [19:04:42] (03Merged) 10jenkins-bot: Enable SecureLinkFixer on group0 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/525157 (https://phabricator.wikimedia.org/T200751) (owner: 10Legoktm) [19:05:03] 10Operations, 10ops-eqiad: rack/setup/instal (4) CI ganeti nodes - https://phabricator.wikimedia.org/T228926 (10RobH) [19:05:06] 10Operations, 10ops-eqiad, 10vm-requests: rack/setup/install ganeti10([09]|1[0-8[).eqiad.wmnet - https://phabricator.wikimedia.org/T228924 (10RobH) [19:06:38] (03CR) 10jenkins-bot: Enable SecureLinkFixer on group0 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/525157 (https://phabricator.wikimedia.org/T200751) (owner: 10Legoktm) [19:08:27] legoktm: when you have a minute, may I ask you a question re. libup2? [19:08:34] !log legoktm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Enable SecureLinkFixer on group0 wikis - T200751 (duration: 00m 55s) [19:08:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:08:43] T200751: Review and deploy SecureLinkFixer extension - https://phabricator.wikimedia.org/T200751 [19:08:52] hauskatze: ask :) I'll answer when I do ahve a minute :p [19:09:31] heh okay so legoktm I don't understand the "proposed patch" feature - is it suposed to be the actual fix or what the bot did commit in the past? [19:10:01] 'cause I tried to use it today but ended using NCU && npm audit (--fix) [19:10:35] both, kind of. [19:11:23] it's mostly a testing feature for me to check what would have happened since the bot is read-only right now (the pushes I did last week were off of my laptop) [19:11:38] eventually the plan is that libup runs once a day pushing patches as necessary [19:15:17] (03CR) 10Ayounsi: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1001/17601/netflow1001.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/525334 (https://phabricator.wikimedia.org/T226810) (owner: 10Ayounsi) [19:15:28] (03PS3) 10Ayounsi: Fastnetmon: disable Graphite, fix notify script path [puppet] - 10https://gerrit.wikimedia.org/r/525334 (https://phabricator.wikimedia.org/T226810) [19:16:18] 10Operations, 10Security-Team, 10Trust-and-Safety: Add sguebo_WMF to WMF LDAP group - https://phabricator.wikimedia.org/T228927 (10jrbs) [19:16:22] 10Operations, 10ops-eqiad, 10DC-Ops, 10Wikimedia-Logstash, and 2 others: Decommission old eqiad logstash hardware hosts logstash100[456] - https://phabricator.wikimedia.org/T217556 (10RobH) wipe is running on all 4 internal disks for T217556 and on the external usb disk for T212457. [19:17:01] I'm done [19:17:26] 10Operations, 10LDAP-Access-Requests, 10Security-Team, 10Trust-and-Safety: Add sguebo_WMF to WMF LDAP group - https://phabricator.wikimedia.org/T228927 (10Legoktm) [19:17:40] 10Operations, 10ops-eqiad, 10Cassandra, 10DC-Ops, and 4 others: restbase-dev1006 has a broken disk - https://phabricator.wikimedia.org/T224260 (10Cmjohnson) [19:18:02] legoktm: alright, thanks. libup pushing patches once a day looks promising. Hopefully they don't get stuck in CI and we have to go with the broom afterwards ;-) [19:18:43] hauskatze: libup won't submit new patches if an open patch in that repo has the topic bump-dev-deps. So we won't have a pileup of just failing patches everywhere [19:21:32] :) [19:21:48] k thnx, I won't bother you anymore for the remaining of today [19:23:58] 10Operations, 10ops-eqiad, 10DC-Ops: elastic1031 failed PSU 2 fan - https://phabricator.wikimedia.org/T228769 (10RobH) [19:25:22] (03PS1) 10Legoktm: Add SecureLinkFixer to extension-list [mediawiki-config] - 10https://gerrit.wikimedia.org/r/525346 (https://phabricator.wikimedia.org/T200751) [19:25:24] (03PS1) 10Legoktm: Enable SecureLinkFixer everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/525347 (https://phabricator.wikimedia.org/T200751) [19:25:34] hauskatze: no worries! feel free to ask anytime :) [19:25:45] (03CR) 10Legoktm: "Pending wmf.15 rollout everywhere." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/525346 (https://phabricator.wikimedia.org/T200751) (owner: 10Legoktm) [19:25:49] legoktm: thanks much :) [19:26:07] 10Operations, 10ops-eqiad, 10DC-Ops: elastic1031 failed PSU 2 fan - https://phabricator.wikimedia.org/T228769 (10RobH) 05Open→03Resolved a:03RobH issue seems resolved, no errors reported on host. [19:28:10] RECOVERY - High CPU load on API appserver on mw1233 is OK: OK - load average: 19.60, 21.01, 23.53 https://wikitech.wikimedia.org/wiki/Application_servers [19:28:40] PROBLEM - High CPU load on API appserver on mw1227 is CRITICAL: CRITICAL - load average: 41.19, 32.18, 32.90 https://wikitech.wikimedia.org/wiki/Application_servers [19:29:55] (03CR) 10Cwhite: [C: 03+2] hiera: deploy varnishkafka exporter to eqiad [puppet] - 10https://gerrit.wikimedia.org/r/524933 (https://phabricator.wikimedia.org/T196066) (owner: 10Cwhite) [19:30:05] (03PS2) 10Cwhite: hiera: deploy varnishkafka exporter to eqiad [puppet] - 10https://gerrit.wikimedia.org/r/524933 (https://phabricator.wikimedia.org/T196066) [19:36:41] (03CR) 10CDanis: "This mostly LGTM, just +1 to Filippo's comments" (037 comments) [puppet] - 10https://gerrit.wikimedia.org/r/520502 (https://phabricator.wikimedia.org/T226815) (owner: 10Giuseppe Lavagetto) [19:38:18] 10Operations, 10ops-eqiad: (Need By: August 31) rack/setup/install (3) new zookeeper nodes - https://phabricator.wikimedia.org/T227025 (10RobH) [19:39:57] 10Operations, 10ops-eqiad, 10Cassandra, 10DC-Ops, and 4 others: restbase-dev1006 has a broken disk - https://phabricator.wikimedia.org/T224260 (10Eevans) 05Resolved→03Open >>! In T224260#5362801, @Cmjohnson wrote: > @eevans the disk has been replaced. I am resolving this task, if you find the problem i... [19:40:08] PROBLEM - Logstash rate of ingestion percent change compared to yesterday on icinga1001 is CRITICAL: 940.7 ge 130 https://phabricator.wikimedia.org/T202307 https://grafana.wikimedia.org/dashboard/db/logstash?orgId=1&panelId=2&fullscreen [19:40:16] PROBLEM - High CPU load on API appserver on mw1227 is CRITICAL: CRITICAL - load average: 42.65, 32.87, 32.09 https://wikitech.wikimedia.org/wiki/Application_servers [19:46:58] (03CR) 10CDanis: "The script LGTM modulo one nit." (031 comment) [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/525165 (owner: 10CRusnov) [19:48:15] (03PS1) 10Eevans: Switch restbase-dev1006 to Stretch [puppet] - 10https://gerrit.wikimedia.org/r/525351 (https://phabricator.wikimedia.org/T224260) [19:52:24] 10Operations, 10LDAP-Access-Requests, 10Security-Team, 10Trust-and-Safety: Add sguebo_WMF to WMF LDAP group - https://phabricator.wikimedia.org/T228927 (10sbassett) p:05Triage→03Normal This is approved by the #security-team (cc: @JBennett) for the specific request of access to logstash for @sguebo_WMF. [19:52:26] PROBLEM - Host dbproxy1021.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [19:55:44] PROBLEM - High CPU load on API appserver on mw1222 is CRITICAL: CRITICAL - load average: 50.57, 34.32, 27.09 https://wikitech.wikimedia.org/wiki/Application_servers [19:56:18] PROBLEM - Host dbproxy1020.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [19:56:35] robh: that dbproxy1021 mgt down expected? [19:57:03] ah, I guess it is chris moving 1020 and 1021 to a different rack? [19:58:10] RECOVERY - Host dbproxy1021.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.17 ms [20:00:04] cscott, arlolra, subbu, bearND, and halfak: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for Services – Parsoid / Citoid / Mobileapps / ORES / …. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190724T2000). [20:01:00] 10Operations, 10ops-eqiad: (Need By: August 31) rack/setup/install (3) new zookeeper nodes - https://phabricator.wikimedia.org/T227025 (10Cmjohnson) [20:02:02] RECOVERY - Host dbproxy1020.mgmt is UP: PING OK - Packet loss = 0%, RTA = 3.40 ms [20:02:38] 10Operations, 10ops-eqiad, 10DC-Ops: Reallocate dbproxy1020 and dbproxy1021 from row D to row C - https://phabricator.wikimedia.org/T228618 (10Cmjohnson) @Marostegui do these need to be in the same rack or separate racks? 1G space is limited to racks C5 and C8. C5 currently has a couple of dbproxy servers. [20:05:23] 10Operations, 10ops-eqiad, 10DC-Ops: Reallocate dbproxy1020 and dbproxy1021 from row D to row C - https://phabricator.wikimedia.org/T228618 (10Marostegui) Separate if possible. If it is really not possible then same rack it is also ok [20:05:37] @cmjohnson1: ^ [20:07:12] PROBLEM - puppet last run on cp1089 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [20:09:35] (03PS1) 10Jhedden: Revert "dumps distribution: switch dumps to labstore1006" [dns] - 10https://gerrit.wikimedia.org/r/525361 [20:12:12] !log redirecting dumps.wikimedia.org back to labstore1007.wikimedia.org T224228 [20:12:15] (03CR) 10Jhedden: [C: 03+2] Revert "dumps distribution: switch dumps to labstore1006" [dns] - 10https://gerrit.wikimedia.org/r/525361 (owner: 10Jhedden) [20:12:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:12:30] (03PS2) 10Jhedden: Revert "dumps distribution: switch dumps to labstore1006" [dns] - 10https://gerrit.wikimedia.org/r/525361 [20:12:50] (03CR) 10Jhedden: [V: 03+2 C: 03+2] Revert "dumps distribution: switch dumps to labstore1006" [dns] - 10https://gerrit.wikimedia.org/r/525361 (owner: 10Jhedden) [20:18:52] RECOVERY - High CPU load on API appserver on mw1222 is OK: OK - load average: 20.51, 21.61, 23.65 https://wikitech.wikimedia.org/wiki/Application_servers [20:23:14] 10Operations, 10ops-eqiad: Degraded RAID on cloudvirt1024 - https://phabricator.wikimedia.org/T228853 (10wiki_willy) a:03Cmjohnson [20:24:59] (03CR) 10Jhedden: [C: 03+1] Configure unconditional flushes of the L1 cache during VMENTER [puppet] - 10https://gerrit.wikimedia.org/r/525295 (https://phabricator.wikimedia.org/T228870) (owner: 10Muehlenhoff) [20:25:26] 10Operations, 10ops-eqiad, 10DBA: Upgrade db1100 firmware and BIOS - https://phabricator.wikimedia.org/T228732 (10wiki_willy) a:03Cmjohnson [20:27:08] PROBLEM - High CPU load on API appserver on mw1222 is CRITICAL: CRITICAL - load average: 40.82, 36.02, 29.47 https://wikitech.wikimedia.org/wiki/Application_servers [20:34:06] 10Operations, 10ops-codfw, 10netops: update RE-S-X6-64G-S in cr[12]-codfw - https://phabricator.wikimedia.org/T226422 (10ayounsi) Scheduled for the 31st at 15:00UTC (1h total). [20:35:03] !log bsitzmann@deploy1001 Started deploy [mobileapps/deploy@2e2ce6c]: Update mobileapps to 1751a2e [20:35:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:35:12] 10Operations, 10ops-codfw, 10netops: Cable mr1-codfw<->cr1/2-codfw through asw-a-codfw - https://phabricator.wikimedia.org/T228112 (10ayounsi) Scheduled for the 30st at 15:00UTC (1h total). Let me know if it needs to be rescheduled. [20:35:24] RECOVERY - puppet last run on cp1089 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [20:36:16] (03PS1) 10RobH: adding john clark to wmf ldap [puppet] - 10https://gerrit.wikimedia.org/r/525420 (https://phabricator.wikimedia.org/T228935) [20:36:52] !log ppchelko@deploy1001 Started deploy [changeprop/deploy@bf28187]: Rerender PCS endpoints T222384 [20:36:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:36:58] T222384: Enable storage and pre-generation for PCS endpoints - https://phabricator.wikimedia.org/T222384 [20:37:22] (03CR) 10RobH: [C: 03+2] adding john clark to wmf ldap [puppet] - 10https://gerrit.wikimedia.org/r/525420 (https://phabricator.wikimedia.org/T228935) (owner: 10RobH) [20:38:26] !log ppchelko@deploy1001 Finished deploy [changeprop/deploy@bf28187]: Rerender PCS endpoints T222384 (duration: 01m 34s) [20:38:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:39:23] !log bsitzmann@deploy1001 Finished deploy [mobileapps/deploy@2e2ce6c]: Update mobileapps to 1751a2e (duration: 04m 20s) [20:39:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:44:02] (03PS1) 10Muehlenhoff: Extend package-build-deb-src.list with Buster [puppet] - 10https://gerrit.wikimedia.org/r/525422 [20:44:31] (03PS1) 10RobH: Revert "adding john clark to wmf ldap" [puppet] - 10https://gerrit.wikimedia.org/r/525423 [20:45:20] !log ppchelko@deploy1001 Started deploy [restbase/deploy@7911f65]: Store PCS endpoints T222384 [20:45:26] (03CR) 10Muehlenhoff: [C: 03+2] Extend package-build-deb-src.list with Buster [puppet] - 10https://gerrit.wikimedia.org/r/525422 (owner: 10Muehlenhoff) [20:45:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:45:27] T222384: Enable storage and pre-generation for PCS endpoints - https://phabricator.wikimedia.org/T222384 [20:47:44] (03CR) 10RobH: [C: 03+2] Revert "adding john clark to wmf ldap" [puppet] - 10https://gerrit.wikimedia.org/r/525423 (owner: 10RobH) [20:47:52] (03PS2) 10RobH: Revert "adding john clark to wmf ldap" [puppet] - 10https://gerrit.wikimedia.org/r/525423 [20:48:16] fyi, we're about to deploy parsoid [20:48:47] (03PS2) 10Bstorm: toolforge: set kubeadm to use internal registry for pause container [puppet] - 10https://gerrit.wikimedia.org/r/525339 (https://phabricator.wikimedia.org/T228887) [20:49:45] (03PS3) 10Bstorm: toolforge: set kubeadm to use internal registry for pause container [puppet] - 10https://gerrit.wikimedia.org/r/525339 (https://phabricator.wikimedia.org/T228887) [20:50:52] (03CR) 10Bstorm: [C: 03+2] toolforge: set kubeadm to use internal registry for pause container [puppet] - 10https://gerrit.wikimedia.org/r/525339 (https://phabricator.wikimedia.org/T228887) (owner: 10Bstorm) [20:52:34] PROBLEM - High CPU load on API appserver on mw1233 is CRITICAL: CRITICAL - load average: 57.84, 35.88, 27.88 https://wikitech.wikimedia.org/wiki/Application_servers [20:54:39] (03PS1) 10RobH: john clark - adding to ldap [puppet] - 10https://gerrit.wikimedia.org/r/525425 (https://phabricator.wikimedia.org/T228935) [20:55:16] RECOVERY - High CPU load on API appserver on mw1222 is OK: OK - load average: 18.60, 21.57, 23.85 https://wikitech.wikimedia.org/wiki/Application_servers [20:55:52] (03CR) 10RobH: [C: 03+2] john clark - adding to ldap [puppet] - 10https://gerrit.wikimedia.org/r/525425 (https://phabricator.wikimedia.org/T228935) (owner: 10RobH) [20:56:03] i got a "The Wikipedia database is temporarily in read-only mode. This is probably due to routine maintenance; if so, you will be able to edit again within a few minutes." on beta just now? [20:57:47] that read-only mode complaint on beta might have coincided with that "high cpu load" icinga warning above? [20:58:38] (03CR) 10Dzahn: [C: 03+2] Switch restbase-dev1006 to Stretch [puppet] - 10https://gerrit.wikimedia.org/r/525351 (https://phabricator.wikimedia.org/T224260) (owner: 10Eevans) [20:58:46] (03PS2) 10Dzahn: Switch restbase-dev1006 to Stretch [puppet] - 10https://gerrit.wikimedia.org/r/525351 (https://phabricator.wikimedia.org/T224260) (owner: 10Eevans) [21:00:17] !log cscott@deploy1001 Started deploy [parsoid/deploy@abd05ab]: Updating Parsoid to df1af404 (T227216, T226523, T226451) [21:00:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:00:29] T226523: Template in wikilink target position also returns pipe separated params - https://phabricator.wikimedia.org/T226523 [21:00:31] T226451: Possible bug in PHP Tokenizer: Unexpected OOM - https://phabricator.wikimedia.org/T226451 [21:00:31] T227216: Adding or editing citations using VisualEditor causes major formatting issues involving pipes, equals signs and nowiki tags - https://phabricator.wikimedia.org/T227216 [21:01:36] cscott, i doubt those two are correlated. in any case, it looks transient. [21:02:32] PROBLEM - High CPU load on API appserver on mw1233 is CRITICAL: CRITICAL - load average: 46.21, 36.17, 31.11 https://wikitech.wikimedia.org/wiki/Application_servers [21:03:34] PROBLEM - High CPU load on API appserver on mw1222 is CRITICAL: CRITICAL - load average: 50.74, 34.33, 27.99 https://wikitech.wikimedia.org/wiki/Application_servers [21:03:38] !log ppchelko@deploy1001 Finished deploy [restbase/deploy@7911f65]: Store PCS endpoints T222384 (duration: 18m 18s) [21:03:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:03:45] T222384: Enable storage and pre-generation for PCS endpoints - https://phabricator.wikimedia.org/T222384 [21:11:02] 10Operations, 10ops-eqiad, 10Cassandra, 10DC-Ops, and 4 others: restbase-dev1006 has a broken disk - https://phabricator.wikimedia.org/T224260 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqiad.wmnet for hosts: ` restbase-dev1006.eqiad.wmnet ` The log can be found in `/... [21:12:13] !log nuria@deploy1001 Started deploy [analytics/refinery@58e64c1]: deploying refinery 0.0.95 [21:12:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:13:52] RECOVERY - Device not healthy -SMART- on restbase-dev1006 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=restbase-dev1006&var-datasource=eqiad+prometheus/ops [21:16:07] !log nuria@deploy1001 Finished deploy [analytics/refinery@58e64c1]: deploying refinery 0.0.95 (duration: 03m 54s) [21:16:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:18:52] !log cscott@deploy1001 Finished deploy [parsoid/deploy@abd05ab]: Updating Parsoid to df1af404 (T227216, T226523, T226451) (duration: 18m 35s) [21:19:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:19:01] T226523: Template in wikilink target position also returns pipe separated params - https://phabricator.wikimedia.org/T226523 [21:19:01] T226451: Possible bug in PHP Tokenizer: Unexpected OOM - https://phabricator.wikimedia.org/T226451 [21:19:01] T227216: Adding or editing citations using VisualEditor causes major formatting issues involving pipes, equals signs and nowiki tags - https://phabricator.wikimedia.org/T227216 [21:22:22] !log <+icinga-wm> RECOVERY - Device not healthy -SMART- on restbase-dev1006 is OK: All metrics within thresholds. (T224260) [21:22:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:22:30] T224260: restbase-dev1006 has a broken disk - https://phabricator.wikimedia.org/T224260 [21:22:45] it's kind of strange wording that "RECOVERY .. NOT healthy" but yea :) [21:23:01] RECOVERY - High CPU load on API appserver on mw1222 is OK: OK - load average: 21.45, 21.93, 23.90 https://wikitech.wikimedia.org/wiki/Application_servers [21:23:04] nice that it confirmed the disk replacement [21:26:49] RECOVERY - High CPU load on API appserver on mw1233 is OK: OK - load average: 17.72, 19.84, 23.56 https://wikitech.wikimedia.org/wiki/Application_servers [21:30:32] 10Operations, 10ops-codfw, 10DBA, 10Goal: rack/setup/install db21[21-30].codfw.wmnet - https://phabricator.wikimedia.org/T227113 (10Papaul) [21:34:36] 10Operations, 10ops-codfw, 10DBA, 10Goal: rack/setup/install db21[21-30].codfw.wmnet - https://phabricator.wikimedia.org/T227113 (10Papaul) [21:40:43] RECOVERY - Logstash rate of ingestion percent change compared to yesterday on icinga1001 is OK: (C)130 ge (W)110 ge 25.13 https://phabricator.wikimedia.org/T202307 https://grafana.wikimedia.org/dashboard/db/logstash?orgId=1&panelId=2&fullscreen [21:41:39] ok, parsoid deploy done, looks good to us [21:43:45] PROBLEM - Check the Netbox report-s- puppetdb for fail status. on netmon1002 is CRITICAL: puppetdb.PuppetDB CRITICAL https://wikitech.wikimedia.org/wiki/Netbox%23Reports [21:47:14] ^ happens during reinstall of machines, in this case restbase-dev1006 [21:48:28] RECOVERY - High CPU load on API appserver on mw1227 is OK: OK - load average: 19.99, 21.55, 23.59 https://wikitech.wikimedia.org/wiki/Application_servers [21:54:58] PROBLEM - puppet last run on elastic1019 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [21:57:51] (03PS1) 10Bstorm: toolforge: add internal pause container to all the other kubelets [puppet] - 10https://gerrit.wikimedia.org/r/525434 (https://phabricator.wikimedia.org/T228887) [21:59:01] (03CR) 10Bstorm: [C: 03+2] toolforge: add internal pause container to all the other kubelets [puppet] - 10https://gerrit.wikimedia.org/r/525434 (https://phabricator.wikimedia.org/T228887) (owner: 10Bstorm) [21:59:58] (03PS4) 10Thcipriani: Blubberoid: enable policy, bump version, reindex helm repo [deployment-charts] - 10https://gerrit.wikimedia.org/r/522561 [22:02:01] (03PS1) 10Bstorm: toolforge: fix typo kubelet file content [puppet] - 10https://gerrit.wikimedia.org/r/525436 (https://phabricator.wikimedia.org/T228887) [22:02:13] (03CR) 10Thcipriani: [V: 03+2 C: 03+2] Blubberoid: enable policy, bump version, reindex helm repo [deployment-charts] - 10https://gerrit.wikimedia.org/r/522561 (owner: 10Thcipriani) [22:04:18] PROBLEM - puppet last run on analytics1065 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [22:04:32] (03CR) 10Bstorm: [C: 03+2] toolforge: fix typo kubelet file content [puppet] - 10https://gerrit.wikimedia.org/r/525436 (https://phabricator.wikimedia.org/T228887) (owner: 10Bstorm) [22:15:55] 10Operations, 10ops-eqiad, 10Cassandra, 10DC-Ops, and 4 others: restbase-dev1006 has a broken disk - https://phabricator.wikimedia.org/T224260 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['restbase-dev1006.eqiad.wmnet'] ` Of which those **FAILED**: ` ['restbase-dev1006.eqiad.wmnet'] ` [22:22:42] RECOVERY - puppet last run on elastic1019 is OK: OK: Puppet is currently enabled, last run 51 seconds ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [22:23:43] (03PS1) 10Thcipriani: gerrit: use gerrit-deployers not gerrit-root [puppet] - 10https://gerrit.wikimedia.org/r/525444 [22:28:09] !log thcipriani@ helmfile [STAGING] Ran 'apply' command on namespace 'blubberoid' for release 'staging' . [22:28:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:32:34] RECOVERY - puppet last run on analytics1065 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [22:36:00] !log thcipriani@ helmfile [CODFW] Ran 'apply' command on namespace 'blubberoid' for release 'production' . [22:36:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:40:06] (03CR) 10Jforrester: "Dupe of I798c809317544." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/525346 (https://phabricator.wikimedia.org/T200751) (owner: 10Legoktm) [22:41:10] !log thcipriani@ helmfile [EQIAD] Ran 'apply' command on namespace 'blubberoid' for release 'production' . [22:41:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:44:50] (03PS1) 10Aklapper: phabricator weekly project changes email: List cookie-licked tasks [puppet] - 10https://gerrit.wikimedia.org/r/525449 (https://phabricator.wikimedia.org/T228575) [22:46:57] I'm the only customer in the upcoming SWAT, and I can do it myself, but I'll probably be ~5 mins late [23:00:05] MaxSem, RoanKattouw, and Niharika: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for Evening SWAT (Max 6 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190724T2300). [23:00:05] RoanKattouw: A patch you scheduled for Evening SWAT (Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [23:01:10] (03CR) 10Aklapper: "As usual I'm not sure about performance, but locally the query was surprisingly fast." [puppet] - 10https://gerrit.wikimedia.org/r/525449 (https://phabricator.wikimedia.org/T228575) (owner: 10Aklapper) [23:05:54] (03CR) 10Cwhite: "Is it in the plan to clean up the files left behind manually?" [puppet] - 10https://gerrit.wikimedia.org/r/525259 (https://phabricator.wikimedia.org/T184942) (owner: 10Filippo Giunchedi) [23:10:00] (03PS5) 10Catrope: Correct a typo on the label newarticle in the help panel for arwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/525190 (https://phabricator.wikimedia.org/T228820) (owner: 10Zoranzoki21) [23:10:06] (03CR) 10Catrope: [C: 03+2] Correct a typo on the label newarticle in the help panel for arwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/525190 (https://phabricator.wikimedia.org/T228820) (owner: 10Zoranzoki21) [23:11:18] (03Merged) 10jenkins-bot: Correct a typo on the label newarticle in the help panel for arwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/525190 (https://phabricator.wikimedia.org/T228820) (owner: 10Zoranzoki21) [23:11:38] (03CR) 10jenkins-bot: Correct a typo on the label newarticle in the help panel for arwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/525190 (https://phabricator.wikimedia.org/T228820) (owner: 10Zoranzoki21) [23:12:05] !log nuria@deploy1001 Started deploy [analytics/refinery@834db0a]: deploying refinery 0.0.96 (skipping 0.0.95 due to some jenkins/archiva issues) [23:12:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:13:40] !log catrope@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Correct typo in arwiki help panel config (T228820) (duration: 00m 57s) [23:13:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:13:48] T228820: Correct typo error in the Help Panel in Arabic Wikipedia - https://phabricator.wikimedia.org/T228820 [23:15:39] (03PS2) 10Catrope: Enable GrowthExperiments homepage on arwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/523362 (https://phabricator.wikimedia.org/T228120) [23:17:40] (03CR) 10Catrope: [C: 03+2] Enable GrowthExperiments homepage on arwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/523362 (https://phabricator.wikimedia.org/T228120) (owner: 10Catrope) [23:18:40] (03Merged) 10jenkins-bot: Enable GrowthExperiments homepage on arwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/523362 (https://phabricator.wikimedia.org/T228120) (owner: 10Catrope) [23:18:55] (03CR) 10jenkins-bot: Enable GrowthExperiments homepage on arwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/523362 (https://phabricator.wikimedia.org/T228120) (owner: 10Catrope) [23:22:44] !log catrope@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Enable GrowthExperiments homepage on arwiki (T228120) (duration: 00m 55s) [23:22:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:22:51] T228120: Set up and deploy homepage on Arabic Wikipedia - https://phabricator.wikimedia.org/T228120 [23:30:15] !log nuria@deploy1001 Finished deploy [analytics/refinery@834db0a]: deploying refinery 0.0.96 (skipping 0.0.95 due to some jenkins/archiva issues) (duration: 18m 10s) [23:30:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:32:37] !log nuria@deploy1001 Started deploy [analytics/refinery@7d93398]: deploying refinery 0.0.96 (skipping 0.0.95 due to some jenkins/archiva issues). Try 2 [23:32:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:37:04] (03PS2) 10Catrope: Enable homepage for 50% of new users on arwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/523363 (https://phabricator.wikimedia.org/T228120) [23:37:10] (03CR) 10Catrope: [C: 03+2] Enable homepage for 50% of new users on arwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/523363 (https://phabricator.wikimedia.org/T228120) (owner: 10Catrope) [23:38:19] (03Merged) 10jenkins-bot: Enable homepage for 50% of new users on arwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/523363 (https://phabricator.wikimedia.org/T228120) (owner: 10Catrope) [23:38:34] (03CR) 10jenkins-bot: Enable homepage for 50% of new users on arwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/523363 (https://phabricator.wikimedia.org/T228120) (owner: 10Catrope) [23:39:38] !log catrope@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Enable homepage for 50% of new users on arwiki (T228120) (duration: 00m 58s) [23:39:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:39:45] T228120: Set up and deploy homepage on Arabic Wikipedia - https://phabricator.wikimedia.org/T228120 [23:42:38] !log catrope@deploy1001 Synchronized php-1.34.0-wmf.14/extensions/Flow: Fix JS error when saving Flow board descriptions (T228818) (duration: 01m 03s) [23:42:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:42:45] T228818: [wmf.14-regression] issues with saving edits to Flow board description - https://phabricator.wikimedia.org/T228818 [23:43:40] !log catrope@deploy1001 Synchronized php-1.34.0-wmf.15/extensions/Flow: Fix JS error when saving Flow board descriptions (T228818) (duration: 01m 01s) [23:43:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:46:08] RECOVERY - MegaRAID on cloudvirt1024 is OK: OK: optimal, 1 logical, 8 physical https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [23:46:11] !log nuria@deploy1001 Finished deploy [analytics/refinery@7d93398]: deploying refinery 0.0.96 (skipping 0.0.95 due to some jenkins/archiva issues). Try 2 (duration: 13m 34s) [23:46:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log