[00:00:41] unrelated comment: we should stop using ruby 2.1.5 [00:00:57] yes [00:02:17] bast3002 - it was slow to run puppet (99s) but there was no error [00:02:35] RECOVERY - Esams HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=esams&var-cache_type=All&var-status_type=5 [00:03:39] RECOVERY - Eqiad HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=eqiad&var-cache_type=All&var-status_type=5 [00:03:59] RECOVERY - puppet last run on bast3002 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [00:04:21] RECOVERY - Text HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5 [00:11:23] 10Operations, 10Operations-Software-Development, 10netbox, 10netops, 10User-crusnov: Netbox report to validate network equipment data - https://phabricator.wikimedia.org/T221507 (10ayounsi) >>! In T221507#5178669, @crusnov wrote: > [ ] test_nb_device_in_librenms: every Staged,Active asw `Device` in Netbo... [00:18:03] (03CR) 10Herron: "> LGTM. Should it also apply to other fields that are MW specific," [puppet] - 10https://gerrit.wikimedia.org/r/509924 (https://phabricator.wikimedia.org/T187147) (owner: 10Herron) [00:18:22] (03CR) 10Herron: "> > Patch Set 1: Code-Review+1" [puppet] - 10https://gerrit.wikimedia.org/r/509924 (https://phabricator.wikimedia.org/T187147) (owner: 10Herron) [00:28:39] (03PS6) 10Dzahn: rsync: add a bwlimit option for quickdatacopy [puppet] - 10https://gerrit.wikimedia.org/r/509458 (https://phabricator.wikimedia.org/T209527) (owner: 10Bstorm) [00:30:18] wikibugs: reboot [00:30:29] it quit other channels but is still here [00:32:25] !log restarting wikibugs because it left some channels [00:32:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:35:20] mutante, isnt it a bot that only joins channels when it has a message to send to them? [00:36:17] 10Operations, 10Gerrit, 10serviceops, 10Patch-For-Review: Convert Gerrit to use H2 as the database - https://phabricator.wikimedia.org/T211139 (10Dzahn) say something random to make wikibugs rejoin a channel [00:36:53] Krenair: maybe:) i tried and it also joined other channels [00:37:06] after just waiting a while [00:37:15] but yea, it is also back in -serviceops [00:37:23] though why it timed out i dont know [00:37:30] yep its joining channels to post messages to them [00:38:13] ok [00:38:21] it looks like when it restarted the last time there was a period when both were online [00:38:26] which may have caused some issues [00:38:54] this time i only stopped one of its 3 processes [00:39:02] the "irc" one .. but not the gerrit and phab ones [00:39:10] do you normally stop all three? [00:39:40] yea, last times i did it i would first make sure nothing is running anymore [00:39:43] and then start again [00:40:39] hm, ok [00:40:58] when I've done it I think I've usually just restarted whichever part appeared to be playing up [00:41:39] 10Operations, 10Traffic, 10Wikidata, 10Wikidata-Query-Service, and 2 others: Reduce / remove the aggessive cache busting behaviour of wdqs-updater - https://phabricator.wikimedia.org/T217897 (10Addshore) So, I did some really crappy analysis of the hit rate in varnish before and after this change, looking... [00:43:02] (03PS7) 10Dzahn: rsync: add a bwlimit option for quickdatacopy [puppet] - 10https://gerrit.wikimedia.org/r/509458 (https://phabricator.wikimedia.org/T209527) (owner: 10Bstorm) [00:51:43] (03PS8) 10Dzahn: rsync: add a bwlimit option for quickdatacopy [puppet] - 10https://gerrit.wikimedia.org/r/509458 (https://phabricator.wikimedia.org/T209527) (owner: 10Bstorm) [00:55:39] (03CR) 10Dzahn: [C: 03+2] "now noop on existing servers using it without the new option https://puppet-compiler.wmflabs.org/compiler1001/16502/" [puppet] - 10https://gerrit.wikimedia.org/r/509458 (https://phabricator.wikimedia.org/T209527) (owner: 10Bstorm) [01:11:38] (03PS7) 10Dzahn: mirrors: add Icinga notes_urls [puppet] - 10https://gerrit.wikimedia.org/r/509477 (https://phabricator.wikimedia.org/T197873) [01:16:14] (03CR) 10Dzahn: [V: 03+2 C: 03+2] mirrors: add Icinga notes_urls [puppet] - 10https://gerrit.wikimedia.org/r/509477 (https://phabricator.wikimedia.org/T197873) (owner: 10Dzahn) [01:20:33] (03PS5) 10Dzahn: Gerrit: Enable gerrit.listProjectsFromIndex [puppet] - 10https://gerrit.wikimedia.org/r/508892 (https://phabricator.wikimedia.org/T200739) (owner: 10Paladox) [01:24:52] (03CR) 10Dzahn: [C: 03+2] Gerrit: Enable gerrit.listProjectsFromIndex [puppet] - 10https://gerrit.wikimedia.org/r/508892 (https://phabricator.wikimedia.org/T200739) (owner: 10Paladox) [01:27:46] (03CR) 10Dzahn: [C: 03+2] Gerrit: Disable DNS reverse lookup [puppet] - 10https://gerrit.wikimedia.org/r/508127 (owner: 10Paladox) [01:31:47] (03PS9) 10Dzahn: Gerrit: Disable DNS reverse lookup [puppet] - 10https://gerrit.wikimedia.org/r/508127 (owner: 10Paladox) [01:32:22] (03CR) 10Dzahn: [C: 03+2] Gerrit: Disable DNS reverse lookup [puppet] - 10https://gerrit.wikimedia.org/r/508127 (owner: 10Paladox) [01:36:59] !log restarting Gerrit to apply 2 config changes - disable DNS reverse lookup (gerrit:508127) & list projects from index (gerrit:508892) - removes blockers for 2.16 upgrade (T200739) [01:37:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:37:04] T200739: Upgrade to Gerrit 2.16.8 - https://phabricator.wikimedia.org/T200739 [01:41:45] PROBLEM - puppet last run on netmon2001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_operations/software/netbox-reports] [01:41:57] PROBLEM - puppet last run on schema2001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_mediawiki/event-schemas] [01:42:23] !log cumin -b 6 'R:git::clone' 'run-puppet-agent -q --failed-only' [01:42:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:44:31] RECOVERY - puppet last run on netmon2001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [01:44:31] RECOVERY - puppet last run on schema2001 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [01:48:01] 10Operations, 10netops: eqord - ulsfo Telia link down - IC-313592 - https://phabricator.wikimedia.org/T221259 (10Dzahn) 05Resolved→03Open This is down again since about 5 hours. Is it ok to reuse the ticket? Same circuit IC-313592, same interfaces. https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?t... [01:48:31] 10Operations, 10netops: eqord - ulsfo Telia link down - IC-313592 - https://phabricator.wikimedia.org/T221259 (10Dzahn) a:05Dzahn→03ayounsi [01:49:02] ACKNOWLEDGEMENT - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 53, down: 1, dormant: 0, excluded: 0, unused: 0: daniel_zahn https://phabricator.wikimedia.org/T221259 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [01:49:06] ACKNOWLEDGEMENT - Router interfaces on cr3-ulsfo is CRITICAL: CRITICAL: host 198.35.26.192, interfaces up: 66, down: 1, dormant: 0, excluded: 0, unused: 0: daniel_zahn https://phabricator.wikimedia.org/T221259 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [01:49:20] PROBLEM - Logstash rate of ingestion percent change compared to yesterday on icinga1001 is CRITICAL: 141.2 ge 130 https://grafana.wikimedia.org/dashboard/db/logstash?orgId=1&panelId=2&fullscreen [01:50:38] PROBLEM - HTTP availability for Nginx -SSL terminators- at ulsfo on icinga1001 is CRITICAL: cluster=cache_text site=ulsfo https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [01:50:46] PROBLEM - HTTP availability for Nginx -SSL terminators- at codfw on icinga1001 is CRITICAL: cluster=cache_text site=codfw https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [01:51:08] PROBLEM - HTTP availability for Nginx -SSL terminators- at esams on icinga1001 is CRITICAL: cluster=cache_text site=esams https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [01:51:14] PROBLEM - HTTP availability for Nginx -SSL terminators- at eqiad on icinga1001 is CRITICAL: cluster=cache_text site=eqiad https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [01:51:26] PROBLEM - Text HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5 [01:51:30] PROBLEM - HTTP availability for Varnish at eqiad on icinga1001 is CRITICAL: job=varnish-text site=eqiad https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [01:52:02] PROBLEM - Esams HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=esams&var-cache_type=All&var-status_type=5 [01:52:17] 10Operations, 10netops: eqord - ulsfo Telia link down - IC-313592 - https://phabricator.wikimedia.org/T221259 (10Dzahn) 5 hours ago: "We regret to inform you that we are facing a cable Between Denver and Strasburg in US. We will investigate and update you accordingly" 4 hours ago: "Our provider confirmed th... [01:53:04] PROBLEM - Eqsin HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=eqsin&var-cache_type=All&var-status_type=5 [01:54:28] RECOVERY - HTTP availability for Nginx -SSL terminators- at eqiad on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [01:54:46] RECOVERY - HTTP availability for Varnish at eqiad on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [01:54:56] RECOVERY - HTTP availability for Nginx -SSL terminators- at ulsfo on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [01:55:00] RECOVERY - HTTP availability for Nginx -SSL terminators- at esams on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [01:55:04] RECOVERY - HTTP availability for Nginx -SSL terminators- at codfw on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [01:55:29] !log re-scheduled nginx / HTTP availability icinga checks [01:55:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:56:22] PROBLEM - Eqiad HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=eqiad&var-cache_type=All&var-status_type=5 [01:59:08] RECOVERY - Esams HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=esams&var-cache_type=All&var-status_type=5 [01:59:08] RECOVERY - Eqiad HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=eqiad&var-cache_type=All&var-status_type=5 [02:00:20] RECOVERY - Eqsin HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=eqsin&var-cache_type=All&var-status_type=5 [02:01:32] RECOVERY - Text HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5 [02:01:37] (03CR) 10Dzahn: "please see https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=2&host=lvs2003&service=PyBal+IPVS+diff+check , https://icinga.wik" [puppet] - 10https://gerrit.wikimedia.org/r/509106 (https://phabricator.wikimedia.org/T222899) (owner: 10Ottomata) [02:03:44] ACKNOWLEDGEMENT - PyBal IPVS diff check on lvs2003 is CRITICAL: CRITICAL: Services known to PyBal but not to IPVS: set([10.2.1.45:32192]) daniel_zahn https://phabricator.wikimedia.org/T222899 https://wikitech.wikimedia.org/wiki/PyBal [02:03:54] ACKNOWLEDGEMENT - PyBal connections to etcd on lvs2003 is CRITICAL: CRITICAL: 39 connections established with conf2001.codfw.wmnet:2379 (min=40) daniel_zahn https://phabricator.wikimedia.org/T222899 https://wikitech.wikimedia.org/wiki/PyBal [02:04:35] ok.. afk now [02:38:32] PROBLEM - Postgres Replication Lag on maps1003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 32354216 and 1 seconds [02:39:54] RECOVERY - Postgres Replication Lag on maps1003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 5840 and 47 seconds [02:50:34] PROBLEM - HTTP availability for Varnish at eqiad on icinga1001 is CRITICAL: job=varnish-text site=eqiad https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [02:50:54] PROBLEM - HTTP availability for Nginx -SSL terminators- at eqiad on icinga1001 is CRITICAL: cluster=cache_text site=eqiad https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [02:50:54] PROBLEM - HTTP availability for Nginx -SSL terminators- at ulsfo on icinga1001 is CRITICAL: cluster=cache_text site=ulsfo https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [02:50:54] PROBLEM - HTTP availability for Nginx -SSL terminators- at esams on icinga1001 is CRITICAL: cluster=cache_text site=esams https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [02:50:54] PROBLEM - HTTP availability for Varnish at esams on icinga1001 is CRITICAL: job=varnish-text site=esams https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [02:51:00] PROBLEM - HTTP availability for Nginx -SSL terminators- at codfw on icinga1001 is CRITICAL: cluster=cache_text site=codfw https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [02:53:02] PROBLEM - Esams HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=esams&var-cache_type=All&var-status_type=5 [02:53:02] PROBLEM - Eqsin HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=eqsin&var-cache_type=All&var-status_type=5 [02:53:02] PROBLEM - Eqiad HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=eqiad&var-cache_type=All&var-status_type=5 [02:53:02] PROBLEM - Text HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5 [02:54:26] PROBLEM - Ulsfo HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=ulsfo&var-cache_type=All&var-status_type=5 [02:54:42] RECOVERY - HTTP availability for Varnish at eqiad on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [02:55:00] RECOVERY - HTTP availability for Nginx -SSL terminators- at eqiad on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [02:55:00] RECOVERY - HTTP availability for Nginx -SSL terminators- at ulsfo on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [02:55:01] RECOVERY - HTTP availability for Nginx -SSL terminators- at esams on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [02:55:01] RECOVERY - HTTP availability for Varnish at esams on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [02:55:06] RECOVERY - HTTP availability for Nginx -SSL terminators- at codfw on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [02:58:30] RECOVERY - Esams HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=esams&var-cache_type=All&var-status_type=5 [02:59:52] RECOVERY - Eqiad HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=eqiad&var-cache_type=All&var-status_type=5 [03:01:16] RECOVERY - Eqsin HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=eqsin&var-cache_type=All&var-status_type=5 [03:01:16] RECOVERY - Text HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5 [03:01:18] RECOVERY - Ulsfo HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=ulsfo&var-cache_type=All&var-status_type=5 [03:16:54] 10Puppet, 10MobileFrontend, 10Readers-Web-Backlog (Tracking), 10User-Jdlrobson: Mobile site does not automatically redirect to desktop version (and not possible to use browser "use desktop view") - https://phabricator.wikimedia.org/T60425 (10Zillionair1) >>! In T60425#604115, @MaxSem wrote: > This is inten... [03:30:22] PROBLEM - DPKG on ms-be2017 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.0.137: Connection reset by peer [03:30:32] PROBLEM - swift-account-replicator on ms-be2017 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.0.137: Connection reset by peer https://wikitech.wikimedia.org/wiki/Swift [03:30:32] PROBLEM - dhclient process on ms-be2017 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.0.137: Connection reset by peer [03:31:46] RECOVERY - swift-account-replicator on ms-be2017 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-replicator https://wikitech.wikimedia.org/wiki/Swift [03:31:46] RECOVERY - dhclient process on ms-be2017 is OK: PROCS OK: 0 processes with command name dhclient [03:37:08] RECOVERY - DPKG on ms-be2017 is OK: All packages OK [03:49:42] RECOVERY - Logstash rate of ingestion percent change compared to yesterday on icinga1001 is OK: (C)130 ge (W)110 ge 105.7 https://grafana.wikimedia.org/dashboard/db/logstash?orgId=1&panelId=2&fullscreen [04:01:48] PROBLEM - puppet last run on ms-be1030 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. It might be a dependency cycle. [04:19:32] PROBLEM - Check systemd state on an-coord1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [04:28:44] RECOVERY - puppet last run on ms-be1030 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [04:34:35] 10Puppet, 10MobileFrontend, 10Readers-Web-Backlog (Tracking), 10User-Jdlrobson: Mobile site does not automatically redirect to desktop version (and not possible to use browser "use desktop view") - https://phabricator.wikimedia.org/T60425 (10MaxSem) Why do you think there aren't such users? After a few mi... [04:55:31] (03PS11) 10Marostegui: db-eqiad,db-codfw.php: Change second parsercache key [mediawiki-config] - 10https://gerrit.wikimedia.org/r/508170 (https://phabricator.wikimedia.org/T210725) [04:55:33] (03CR) 10Marostegui: db-eqiad,db-codfw.php: Change second parsercache key [mediawiki-config] - 10https://gerrit.wikimedia.org/r/508170 (https://phabricator.wikimedia.org/T210725) (owner: 10Marostegui) [05:00:37] (03PS1) 10Marostegui: db2106,db2110,db2119: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/510071 (https://phabricator.wikimedia.org/T222772) [05:01:12] (03CR) 10Marostegui: [C: 03+1] m1 proxy: Switch to use dbproxy1001 in preparation for b5-eqiad maint [dns] - 10https://gerrit.wikimedia.org/r/509894 (https://phabricator.wikimedia.org/T223126) (owner: 10Jcrespo) [05:02:33] (03CR) 10Marostegui: [C: 03+2] db2106,db2110,db2119: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/510071 (https://phabricator.wikimedia.org/T222772) (owner: 10Marostegui) [05:21:32] (03PS1) 10Marostegui: mariadb: Prepare to decommission db2034 [puppet] - 10https://gerrit.wikimedia.org/r/510074 (https://phabricator.wikimedia.org/T219493) [05:35:52] In 25 minutes I will deploy the parsercache key change, so I will hold a lock on deploy1001, if you need to deploy something from there, please coordinate with me [05:37:33] RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 55, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:37:37] RECOVERY - Router interfaces on cr3-ulsfo is OK: OK: host 198.35.26.192, interfaces up: 68, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:58:14] (03CR) 10Marostegui: [C: 03+2] db-eqiad,db-codfw.php: Change second parsercache key [mediawiki-config] - 10https://gerrit.wikimedia.org/r/508170 (https://phabricator.wikimedia.org/T210725) (owner: 10Marostegui) [05:59:17] (03Merged) 10jenkins-bot: db-eqiad,db-codfw.php: Change second parsercache key [mediawiki-config] - 10https://gerrit.wikimedia.org/r/508170 (https://phabricator.wikimedia.org/T210725) (owner: 10Marostegui) [05:59:31] (03CR) 10jenkins-bot: db-eqiad,db-codfw.php: Change second parsercache key [mediawiki-config] - 10https://gerrit.wikimedia.org/r/508170 (https://phabricator.wikimedia.org/T210725) (owner: 10Marostegui) [05:59:41] (03CR) 10Ema: [C: 03+2] Make "disable_configuration_modification" work [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/509871 (owner: 10Ema) [06:00:04] marostegui: Dear deployers, time to do the Change parsercache key deploy. Dont look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190514T0600). [06:00:09] \o/ [06:01:41] !log marostegui@deploy1001 Synchronized wmf-config/db-codfw.php: Change parsercache on codfw T210725 (duration: 00m 54s) [06:01:43] !log Lock wmf-config deployment on deploy1001 to slowly change parsercache key on eqiad - T210725 [06:01:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:01:47] T210725: Replace parsercache keys to something more meaningful on db-XXXX.php - https://phabricator.wikimedia.org/T210725 [06:01:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:02:11] !log Deploy parsercache change to eqiad canaries - T210725 [06:02:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:07:14] (03PS5) 10Ema: ATS: require explicit Cache-Control/Expires [puppet] - 10https://gerrit.wikimedia.org/r/509787 (https://phabricator.wikimedia.org/T222937) [06:08:42] (03Abandoned) 10Ema: ATS: do not cache server errors [puppet] - 10https://gerrit.wikimedia.org/r/509443 (https://phabricator.wikimedia.org/T222937) (owner: 10Ema) [06:09:14] (03CR) 10Vgutierrez: [C: 03+1] ATS: require explicit Cache-Control/Expires [puppet] - 10https://gerrit.wikimedia.org/r/509787 (https://phabricator.wikimedia.org/T222937) (owner: 10Ema) [06:11:01] (03PS10) 10Vgutierrez: ATS: Provide a unified monitoring define [puppet] - 10https://gerrit.wikimedia.org/r/506986 (https://phabricator.wikimedia.org/T221217) [06:11:17] (03PS2) 10Vgutierrez: ATS: Ensure that server's cipher suites preference is being honored [puppet] - 10https://gerrit.wikimedia.org/r/509771 (https://phabricator.wikimedia.org/T221594) [06:11:31] (03PS37) 10Vgutierrez: ATS: Provide a TLS terminator profile and backend+TLS role [puppet] - 10https://gerrit.wikimedia.org/r/506398 (https://phabricator.wikimedia.org/T221594) [06:15:27] !log upload trafficserver 8.0.3-1wm2 to stretch-wikimedia [06:15:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:20:10] !log cp4021: upgrade trafficserver to 8.0.3-1wm2 [06:20:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:20:50] 10Operations, 10DBA, 10MediaWiki-Configuration, 10Patch-For-Review, and 2 others: Create tool to handle the state of database configuration in MediaWiki in etcd - https://phabricator.wikimedia.org/T197126 (10Joe) EtcdConfig in MediaWiki has been extensively tested against failures before it was introduced,... [06:28:05] (03CR) 10Giuseppe Lavagetto: [C: 03+1] role::mediawiki::canary_appserver: remove nutcracker memcached conf [puppet] - 10https://gerrit.wikimedia.org/r/509462 (https://phabricator.wikimedia.org/T214275) (owner: 10Elukey) [06:34:46] (03CR) 10Ema: [C: 03+1] flake8 - varnish: update python file extensions so they are detected by CI [puppet] - 10https://gerrit.wikimedia.org/r/509921 (https://phabricator.wikimedia.org/T144169) (owner: 10Jbond) [06:42:32] (03PS4) 10Giuseppe Lavagetto: stdlib: use Stdlib::Port in place of our local definition [puppet] - 10https://gerrit.wikimedia.org/r/481818 [06:45:10] !log cp-ats: upgrade trafficserver to 8.0.3-1wm2 [06:45:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:18:24] (03CR) 10Muehlenhoff: [C: 03+1] "One comment inline, but looks good." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/509472 (owner: 10Dzahn) [07:19:42] (03CR) 10Marostegui: "@elukey, is the eventlogging url I suggested the best one or there are some others more specific you'd like to add?" [puppet] - 10https://gerrit.wikimedia.org/r/509552 (https://phabricator.wikimedia.org/T197873) (owner: 10Dzahn) [07:20:17] (03CR) 10Elukey: [C: 03+1] mariadb: set some more Icinga notes URLs for nrpe checks [puppet] - 10https://gerrit.wikimedia.org/r/509552 (https://phabricator.wikimedia.org/T197873) (owner: 10Dzahn) [07:22:01] (03CR) 10Marostegui: [C: 03+1] mariadb: set some more Icinga notes URLs for nrpe checks [puppet] - 10https://gerrit.wikimedia.org/r/509552 (https://phabricator.wikimedia.org/T197873) (owner: 10Dzahn) [07:43:54] (03CR) 10Jcrespo: [C: 03+1] mariadb: set some more Icinga notes URLs for nrpe checks [puppet] - 10https://gerrit.wikimedia.org/r/509552 (https://phabricator.wikimedia.org/T197873) (owner: 10Dzahn) [07:44:10] (03PS2) 10Muehlenhoff: Update ssh keys and email for awight [puppet] - 10https://gerrit.wikimedia.org/r/509804 (owner: 10Awight) [07:48:05] !log installing bind security updates for stretch (only client-side tools/libraries in use) [07:48:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:51:49] RECOVERY - PyBal IPVS diff check on lvs2003 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [07:53:13] (03CR) 10Alexandros Kosiaris: [C: 03+2] Add eventgate-main.discovery.wmnet [dns] - 10https://gerrit.wikimedia.org/r/509912 (https://phabricator.wikimedia.org/T222899) (owner: 10Ottomata) [07:53:45] RECOVERY - PyBal connections to etcd on lvs2003 is OK: OK: 40 connections established with conf2001.codfw.wmnet:2379 (min=40) https://wikitech.wikimedia.org/wiki/PyBal [07:57:38] (03PS1) 10Fsero: mcrouter: feat(T221346) add icinga check for certs [puppet] - 10https://gerrit.wikimedia.org/r/510082 (https://phabricator.wikimedia.org/T221346) [07:57:54] (03PS5) 10Filippo Giunchedi: prometheus: enable metrics relabel [puppet] - 10https://gerrit.wikimedia.org/r/508809 (https://phabricator.wikimedia.org/T193017) (owner: 10Mathew.onipe) [07:58:04] (03CR) 10Filippo Giunchedi: [C: 03+2] prometheus: enable metrics relabel [puppet] - 10https://gerrit.wikimedia.org/r/508809 (https://phabricator.wikimedia.org/T193017) (owner: 10Mathew.onipe) [07:58:07] (03PS2) 10Fsero: mcrouter: feat(T221346) add icinga check for certs [puppet] - 10https://gerrit.wikimedia.org/r/510082 (https://phabricator.wikimedia.org/T221346) [07:59:26] (03CR) 10jerkins-bot: [V: 04-1] mcrouter: feat(T221346) add icinga check for certs [puppet] - 10https://gerrit.wikimedia.org/r/510082 (https://phabricator.wikimedia.org/T221346) (owner: 10Fsero) [08:05:07] (03PS3) 10Muehlenhoff: Update ssh keys and email for awight [puppet] - 10https://gerrit.wikimedia.org/r/509804 (owner: 10Awight) [08:06:12] (03CR) 10Muehlenhoff: [C: 03+2] Update ssh keys and email for awight [puppet] - 10https://gerrit.wikimedia.org/r/509804 (owner: 10Awight) [08:07:17] (03PS3) 10Fsero: mcrouter: feat(T221346) add icinga check for certs [puppet] - 10https://gerrit.wikimedia.org/r/510082 (https://phabricator.wikimedia.org/T221346) [08:08:35] (03CR) 10jerkins-bot: [V: 04-1] mcrouter: feat(T221346) add icinga check for certs [puppet] - 10https://gerrit.wikimedia.org/r/510082 (https://phabricator.wikimedia.org/T221346) (owner: 10Fsero) [08:12:33] (03PS1) 10Marostegui: db-eqiad,db-codfw.php: Remove db2034 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/510083 (https://phabricator.wikimedia.org/T219493) [08:12:43] (03PS4) 10Fsero: mcrouter: feat(T221346) add icinga check for certs [puppet] - 10https://gerrit.wikimedia.org/r/510082 (https://phabricator.wikimedia.org/T221346) [08:12:59] PROBLEM - puppet last run on logstash2002 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[dnsutils] [08:15:52] (03CR) 10Giuseppe Lavagetto: "Very good work overall! I have a few questions (and some smaller code comments too)" (035 comments) [puppet] - 10https://gerrit.wikimedia.org/r/510082 (https://phabricator.wikimedia.org/T221346) (owner: 10Fsero) [08:18:31] PROBLEM - puppet last run on mw1273 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[dnsutils] [08:20:10] (03PS1) 10Filippo Giunchedi: Revert "prometheus: enable metrics relabel" [puppet] - 10https://gerrit.wikimedia.org/r/510084 [08:20:26] (03PS2) 10Filippo Giunchedi: Revert "prometheus: enable metrics relabel" [puppet] - 10https://gerrit.wikimedia.org/r/510084 [08:20:32] (03CR) 10Filippo Giunchedi: [C: 03+2] Revert "prometheus: enable metrics relabel" [puppet] - 10https://gerrit.wikimedia.org/r/510084 (owner: 10Filippo Giunchedi) [08:20:53] (03CR) 10Filippo Giunchedi: [V: 03+2 C: 03+2] Revert "prometheus: enable metrics relabel" [puppet] - 10https://gerrit.wikimedia.org/r/510084 (owner: 10Filippo Giunchedi) [08:26:33] 10Operations, 10MediaWiki-extensions-WikibaseClient, 10MediaWiki-extensions-WikibaseRepository, 10Wikidata, and 7 others: Fix inefficient CacheAwarePropertyInfoStore memcached access pattern - https://phabricator.wikimedia.org/T97368 (10elukey) From a quick look with memkeys on mc1033, these are the top ta... [08:27:33] addshore: hello there! :) [08:38:18] (03PS2) 10Jcrespo: m1 proxy: Switch to use dbproxy1001 in preparation for b5-eqiad maint [dns] - 10https://gerrit.wikimedia.org/r/509894 (https://phabricator.wikimedia.org/T223126) [08:39:12] (03PS5) 10Giuseppe Lavagetto: stdlib: use Stdlib::Port in place of our local definition [puppet] - 10https://gerrit.wikimedia.org/r/481818 [08:39:47] RECOVERY - puppet last run on logstash2002 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [08:40:17] (03CR) 10Giuseppe Lavagetto: [C: 03+2] stdlib: use Stdlib::Port in place of our local definition [puppet] - 10https://gerrit.wikimedia.org/r/481818 (owner: 10Giuseppe Lavagetto) [08:45:19] RECOVERY - puppet last run on mw1273 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [08:51:18] (03CR) 10Jcrespo: [C: 03+2] m1 proxy: Switch to use dbproxy1001 in preparation for b5-eqiad maint [dns] - 10https://gerrit.wikimedia.org/r/509894 (https://phabricator.wikimedia.org/T223126) (owner: 10Jcrespo) [08:52:35] RECOVERY - Check systemd state on an-coord1001 is OK: OK - running: The system is fully operational [08:53:16] !log failing connections over dbproxy1006 to dbproxy1001 [08:53:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:53:34] 10Operations, 10cloud-services-team (Kanban): WMCS-related dashboards using Diamond metrics - https://phabricator.wikimedia.org/T210850 (10fgiunchedi) >>! In T210850#5177775, @aborrero wrote: > Sorry folks, there are a couple of things that I don't understand. The nova_fullstack_test.py script is sending colle... [08:54:12] (03CR) 10Volans: "recheck" [software/cumin] - 10https://gerrit.wikimedia.org/r/508079 (owner: 10Volans) [08:57:42] (03CR) 10Volans: "I'm quite on the fence on this to live in the puppet repo." (037 comments) [puppet] - 10https://gerrit.wikimedia.org/r/510082 (https://phabricator.wikimedia.org/T221346) (owner: 10Fsero) [09:01:07] (03CR) 10jerkins-bot: [V: 04-1] tox: refactor configuration [software/cumin] - 10https://gerrit.wikimedia.org/r/508079 (owner: 10Volans) [09:02:28] !log statsd_exporter 0.9 upgrade on logstash - T220709 [09:02:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:02:32] T220709: Upgrade statsd_exporter to 0.9 - https://phabricator.wikimedia.org/T220709 [09:02:45] (03CR) 10Fsero: "> Patch Set 4:" (035 comments) [puppet] - 10https://gerrit.wikimedia.org/r/510082 (https://phabricator.wikimedia.org/T221346) (owner: 10Fsero) [09:06:38] (03PS1) 10Filippo Giunchedi: thumbor: statsd_exporter mappings to seconds [puppet] - 10https://gerrit.wikimedia.org/r/510089 (https://phabricator.wikimedia.org/T220709) [09:06:54] 10Operations, 10cloud-services-team (Kanban): WMCS-related dashboards using Diamond metrics - https://phabricator.wikimedia.org/T210850 (10MoritzMuehlenhoff) >>! In T210850#5179448, @fgiunchedi wrote: >>>! In T210850#5177775, @aborrero wrote: >> Sorry folks, there are a couple of things that I don't understand... [09:07:30] (03CR) 10Filippo Giunchedi: [C: 03+1] thumbor: statsd_exporter mappings to seconds [puppet] - 10https://gerrit.wikimedia.org/r/510089 (https://phabricator.wikimedia.org/T220709) (owner: 10Filippo Giunchedi) [09:09:37] PROBLEM - DPKG on dbproxy1006 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [09:10:13] (03PS7) 10Mobrovac: Handle application/octet-stream requests properly; release v0.1.5 [software/service-checker] - 10https://gerrit.wikimedia.org/r/507531 (https://phabricator.wikimedia.org/T220401) [09:11:05] (03CR) 10jerkins-bot: [V: 04-1] Handle application/octet-stream requests properly; release v0.1.5 [software/service-checker] - 10https://gerrit.wikimedia.org/r/507531 (https://phabricator.wikimedia.org/T220401) (owner: 10Mobrovac) [09:12:21] RECOVERY - DPKG on dbproxy1006 is OK: All packages OK [09:15:06] 10Operations, 10Patch-For-Review: ferm: Log dropped packets - https://phabricator.wikimedia.org/T116011 (10jbond) [09:15:09] 10Operations, 10Wikimedia-Logstash, 10Patch-For-Review, 10Security: Ferm: send ferm/iptables/ulogd logs to Kafaka/logstash/elasticsearch - https://phabricator.wikimedia.org/T220987 (10jbond) 05Open→03Resolved logs are now been sent to kafka, however we still need to role the profile::firewall::logging... [09:15:47] !log statsd_exporter 0.9 upgrade on ores - T220709 [09:15:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:15:53] T220709: Upgrade statsd_exporter to 0.9 - https://phabricator.wikimedia.org/T220709 [09:15:59] PROBLEM - Disk space on rhenium is CRITICAL: DISK CRITICAL - free space: / 93 MB (0% inode=96%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space [09:18:49] (03PS6) 10Ema: ATS: require explicit Cache-Control/Expires [puppet] - 10https://gerrit.wikimedia.org/r/509787 (https://phabricator.wikimedia.org/T222937) [09:19:30] looks like disk jumped from 80 to 100 % on rhenium, "netinsights" host [09:19:37] (03PS3) 10Volans: tox: refactor configuration [software/cumin] - 10https://gerrit.wikimedia.org/r/508079 [09:19:39] (03PS3) 10Volans: flake8: enforce import order and adopt W504 [software/cumin] - 10https://gerrit.wikimedia.org/r/508080 [09:19:41] (03PS3) 10Volans: documentation: fix typo [software/cumin] - 10https://gerrit.wikimedia.org/r/508081 [09:19:43] (03PS1) 10Volans: tests: temporarily force bandit < 1.6.0 [software/cumin] - 10https://gerrit.wikimedia.org/r/510092 [09:20:49] (03PS2) 10Arturo Borrero Gonzalez: icinga: check_eth: fix bridge and tap interface regexp [puppet] - 10https://gerrit.wikimedia.org/r/509888 (https://phabricator.wikimedia.org/T223107) [09:21:01] (03CR) 10Ema: [C: 03+2] ATS: require explicit Cache-Control/Expires [puppet] - 10https://gerrit.wikimedia.org/r/509787 (https://phabricator.wikimedia.org/T222937) (owner: 10Ema) [09:23:24] (03PS3) 10Vgutierrez: ATS: Ensure that server's cipher suites preference is being honored [puppet] - 10https://gerrit.wikimedia.org/r/509771 (https://phabricator.wikimedia.org/T221594) [09:23:26] (03PS38) 10Vgutierrez: ATS: Provide a TLS terminator profile and backend+TLS role [puppet] - 10https://gerrit.wikimedia.org/r/506398 (https://phabricator.wikimedia.org/T221594) [09:23:28] (03PS1) 10Vgutierrez: ATS: Provide support for TLS certificates with different SNI [puppet] - 10https://gerrit.wikimedia.org/r/510093 (https://phabricator.wikimedia.org/T221594) [09:23:38] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] icinga: check_eth: fix bridge and tap interface regexp [puppet] - 10https://gerrit.wikimedia.org/r/509888 (https://phabricator.wikimedia.org/T223107) (owner: 10Arturo Borrero Gonzalez) [09:23:54] (03PS3) 10Arturo Borrero Gonzalez: icinga: check_eth: fix bridge and tap interface regexp [puppet] - 10https://gerrit.wikimedia.org/r/509888 (https://phabricator.wikimedia.org/T223107) [09:24:29] (03CR) 10Jcrespo: [C: 03+1] mariadb: Prepare to decommission db2034 [puppet] - 10https://gerrit.wikimedia.org/r/510074 (https://phabricator.wikimedia.org/T219493) (owner: 10Marostegui) [09:24:31] 10Operations, 10cloud-services-team (Kanban): WMCS-related dashboards using Diamond metrics - https://phabricator.wikimedia.org/T210850 (10fgiunchedi) >>! In T210850#5179495, @MoritzMuehlenhoff wrote: >>>! In T210850#5179448, @fgiunchedi wrote: >>>>! In T210850#5177775, @aborrero wrote: >>> Sorry folks, there... [09:24:42] (03CR) 10jerkins-bot: [V: 04-1] ATS: Provide a TLS terminator profile and backend+TLS role [puppet] - 10https://gerrit.wikimedia.org/r/506398 (https://phabricator.wikimedia.org/T221594) (owner: 10Vgutierrez) [09:25:39] RECOVERY - Disk space on rhenium is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space [09:27:36] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Deploy second parsercache key change everywhere after deploying it in batches first T210725 (duration: 00m 50s) [09:27:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:27:41] T210725: Replace parsercache keys to something more meaningful on db-XXXX.php - https://phabricator.wikimedia.org/T210725 [09:28:07] (03PS39) 10Vgutierrez: ATS: Provide a TLS terminator profile and backend+TLS role [puppet] - 10https://gerrit.wikimedia.org/r/506398 (https://phabricator.wikimedia.org/T221594) [09:29:14] (03CR) 10jerkins-bot: [V: 04-1] ATS: Provide a TLS terminator profile and backend+TLS role [puppet] - 10https://gerrit.wikimedia.org/r/506398 (https://phabricator.wikimedia.org/T221594) (owner: 10Vgutierrez) [09:29:30] do you even yaml, bro? [09:29:31] !log Parsercache deployment window FINISHED [09:29:33] sigh :( [09:29:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:29:46] hashar: you around? [09:31:00] hashar: jenkins-bot didn't verify my gerrit patch for almost 10 minutes now https://gerrit.wikimedia.org/r/c/operations/puppet/+/509888 [09:31:48] arturo: looks like there is still things being checked (slowly but being checked) https://integration.wikimedia.org/zuul/ [09:32:08] 10Operations, 10Prod-Kubernetes, 10serviceops, 10Kubernetes, 10User-fsero: placeholder task for migration problems - https://phabricator.wikimedia.org/T222210 (10hashar) I can confirm that instances are now able to fetch new containers immediately after they have been published. So that solves it for me.... [09:32:10] arturo: yes I am there :) [09:32:24] cool thanks [09:32:58] arturo: i will dig in the CI logs [09:33:14] arturo: meanwhile you might want to try commenting "recheck" to add the change back in CI [09:33:33] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/509888 (https://phabricator.wikimedia.org/T223107) (owner: 10Arturo Borrero Gonzalez) [09:33:35] ok [09:33:44] hmm [09:34:08] ah yes I see where the issue is :-/ [09:34:25] since the change already has a Code-Review +2, it is not tested again [09:34:27] !log restart apache on ununpentium [09:34:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:34:46] since for most repositories it should instead be handled by the gate-and-submit pipeline which would ultimately merge the change [09:34:57] that is to avoid running tests twice [09:35:06] (once as a new patchset and another time because of the CR+2) [09:35:15] but operations/puppet does not have such CR+2 thing bah [09:35:29] arturo: tldr yes some filter is broken in CI I will dig into it [09:35:43] (03PS4) 10Vgutierrez: prometheus: Toggle SSL certificate verification for trafficserver-exporter [puppet] - 10https://gerrit.wikimedia.org/r/508327 (https://phabricator.wikimedia.org/T221217) [09:35:45] (03PS11) 10Vgutierrez: ATS: Provide a unified monitoring define [puppet] - 10https://gerrit.wikimedia.org/r/506986 (https://phabricator.wikimedia.org/T221217) [09:35:47] (03PS4) 10Vgutierrez: ATS: Ensure that server's cipher suites preference is being honored [puppet] - 10https://gerrit.wikimedia.org/r/509771 (https://phabricator.wikimedia.org/T221594) [09:35:49] (03PS2) 10Vgutierrez: ATS: Provide support for TLS certificates with different SNI [puppet] - 10https://gerrit.wikimedia.org/r/510093 (https://phabricator.wikimedia.org/T221594) [09:35:51] (03PS40) 10Vgutierrez: ATS: Provide a TLS terminator profile and backend+TLS role [puppet] - 10https://gerrit.wikimedia.org/r/506398 (https://phabricator.wikimedia.org/T221594) [09:38:12] (03CR) 10jerkins-bot: [V: 04-1] ATS: Provide a TLS terminator profile and backend+TLS role [puppet] - 10https://gerrit.wikimedia.org/r/506398 (https://phabricator.wikimedia.org/T221594) (owner: 10Vgutierrez) [09:39:24] hashar: ok, I'm glad I pinged you then :-) [09:40:03] and I have filled it as https://phabricator.wikimedia.org/T223209 [09:40:14] but I gotta find a workaround for operations/puppet :-\ [09:40:50] (03CR) 10Arturo Borrero Gonzalez: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/509888 (https://phabricator.wikimedia.org/T223107) (owner: 10Arturo Borrero Gonzalez) [09:42:04] (03PS3) 10Michael Große: Add configuration for EntitySchema ShExSimpleUrl [mediawiki-config] - 10https://gerrit.wikimedia.org/r/509878 [09:42:22] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] icinga: check_eth: fix bridge and tap interface regexp [puppet] - 10https://gerrit.wikimedia.org/r/509888 (https://phabricator.wikimedia.org/T223107) (owner: 10Arturo Borrero Gonzalez) [09:42:23] 10Operations, 10cloud-services-team (Kanban): WMCS-related dashboards using Diamond metrics - https://phabricator.wikimedia.org/T210850 (10aborrero) Ok! I'm fine renaming the metrics. I would need a suggestion though :-) `cloudvps.novafullstack.*`? [09:42:41] (03PS3) 10Michael Große: Add EntitySchema to extension-list [mediawiki-config] - 10https://gerrit.wikimedia.org/r/509437 (https://phabricator.wikimedia.org/T221650) [09:42:43] (03PS4) 10Michael Große: Add configuration for EntitySchema ShExSimpleUrl [mediawiki-config] - 10https://gerrit.wikimedia.org/r/509878 [09:43:55] (03CR) 10jerkins-bot: [V: 04-1] Add configuration for EntitySchema ShExSimpleUrl [mediawiki-config] - 10https://gerrit.wikimedia.org/r/509878 (owner: 10Michael Große) [09:43:57] (03PS2) 10Filippo Giunchedi: cassandra: add restbase10[19-27] [puppet] - 10https://gerrit.wikimedia.org/r/509422 (https://phabricator.wikimedia.org/T219404) [09:43:59] (03CR) 10jerkins-bot: [V: 04-1] Add configuration for EntitySchema ShExSimpleUrl [mediawiki-config] - 10https://gerrit.wikimedia.org/r/509878 (owner: 10Michael Große) [09:44:04] (03CR) 10Filippo Giunchedi: [C: 03+2] cassandra: add restbase10[19-27] [puppet] - 10https://gerrit.wikimedia.org/r/509422 (https://phabricator.wikimedia.org/T219404) (owner: 10Filippo Giunchedi) [09:45:40] (03PS5) 10Michael Große: Add configuration for EntitySchema ShExSimpleUrl [mediawiki-config] - 10https://gerrit.wikimedia.org/r/509878 [09:47:41] (03CR) 10Ema: [C: 03+1] "Two nitpicks, lgtm!" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/510093 (https://phabricator.wikimedia.org/T221594) (owner: 10Vgutierrez) [09:48:56] (03CR) 10Marostegui: [C: 03+2] db-eqiad,db-codfw.php: Remove db2034 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/510083 (https://phabricator.wikimedia.org/T219493) (owner: 10Marostegui) [09:50:07] (03Merged) 10jenkins-bot: db-eqiad,db-codfw.php: Remove db2034 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/510083 (https://phabricator.wikimedia.org/T219493) (owner: 10Marostegui) [09:50:19] (03PS2) 10Jbond: flake8 - varnish: update python file extensions so they are detected by CI [puppet] - 10https://gerrit.wikimedia.org/r/509921 (https://phabricator.wikimedia.org/T144169) [09:50:22] (03CR) 10jenkins-bot: db-eqiad,db-codfw.php: Remove db2034 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/510083 (https://phabricator.wikimedia.org/T219493) (owner: 10Marostegui) [09:51:38] !log marostegui@deploy1001 Synchronized wmf-config/db-codfw.php: Remove db2034 from config T219493 (duration: 00m 50s) [09:51:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:51:42] T219493: Prepare to decommission 2 codfw x1 hosts db2033 and db2034 - https://phabricator.wikimedia.org/T219493 [09:51:59] !log Remove db2034 from tendril and zarcillo - T219493 [09:52:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:52:42] (03PS2) 10Marostegui: mariadb: Prepare to decommission db2034 [puppet] - 10https://gerrit.wikimedia.org/r/510074 (https://phabricator.wikimedia.org/T219493) [09:53:39] !log marostegui@deploy1001 scap failed: average error rate on 4/11 canaries increased by 10x (rerun with --force to override this check, see https://logstash.wikimedia.org/goto/db09a36be5ed3e81155041f7d46ad040 for details) [09:53:39] (03CR) 10Marostegui: [C: 03+2] mariadb: Prepare to decommission db2034 [puppet] - 10https://gerrit.wikimedia.org/r/510074 (https://phabricator.wikimedia.org/T219493) (owner: 10Marostegui) [09:53:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:55:09] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Remove db2034 from config T219493 (duration: 00m 49s) [09:55:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:55:48] (03PS1) 10Ema: cache: reimage cp3036 as upload_ats [puppet] - 10https://gerrit.wikimedia.org/r/510106 (https://phabricator.wikimedia.org/T222937) [09:57:08] (03PS3) 10Vgutierrez: ATS: Provide support for TLS certificates with different SNI [puppet] - 10https://gerrit.wikimedia.org/r/510093 (https://phabricator.wikimedia.org/T221594) [09:57:10] (03PS41) 10Vgutierrez: ATS: Provide a TLS terminator profile and backend+TLS role [puppet] - 10https://gerrit.wikimedia.org/r/506398 (https://phabricator.wikimedia.org/T221594) [09:57:16] (03CR) 10Vgutierrez: "Thanks for the review!" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/510093 (https://phabricator.wikimedia.org/T221594) (owner: 10Vgutierrez) [09:58:06] (03CR) 10Vgutierrez: [C: 03+1] cache: reimage cp3036 as upload_ats [puppet] - 10https://gerrit.wikimedia.org/r/510106 (https://phabricator.wikimedia.org/T222937) (owner: 10Ema) [09:58:18] (03CR) 10Ema: [C: 03+1] ATS: Provide support for TLS certificates with different SNI [puppet] - 10https://gerrit.wikimedia.org/r/510093 (https://phabricator.wikimedia.org/T221594) (owner: 10Vgutierrez) [09:58:27] (03PS1) 10Jcrespo: mariadb: Depool db1098 & db1131 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/510108 (https://phabricator.wikimedia.org/T223126) [09:58:48] (03CR) 10jerkins-bot: [V: 04-1] ATS: Provide a TLS terminator profile and backend+TLS role [puppet] - 10https://gerrit.wikimedia.org/r/506398 (https://phabricator.wikimedia.org/T221594) (owner: 10Vgutierrez) [10:00:04] hashar: My dear minions, it's time we take the moon! Just kidding. Time for MediaWiki wmf branch cut deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190514T1000). [10:01:04] !log depool cp3036 and reimage as upload_ats T222937 [10:01:13] RECOVERY - Check systemd state on cloudcontrol1004 is OK: OK - running: The system is fully operational [10:01:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:01:18] T222937: Replace Varnish backends with ATS on cache upload nodes in esams - https://phabricator.wikimedia.org/T222937 [10:01:34] (03PS42) 10Vgutierrez: ATS: Provide a TLS terminator profile and backend+TLS role [puppet] - 10https://gerrit.wikimedia.org/r/506398 (https://phabricator.wikimedia.org/T221594) [10:02:11] 10Operations, 10Release Pipeline, 10Release-Engineering-Team, 10Wikidata, and 5 others: Introduce wikidata termbox SSR to kubernetes - https://phabricator.wikimedia.org/T220402 (10akosiaris) > With respect to the end point checks it would be great to hear what we are trying to achieve with them. Our servic... [10:02:59] (03PS1) 10Jbond: yaml parse error test: re: 509462 [puppet] - 10https://gerrit.wikimedia.org/r/510110 [10:03:35] 10Operations, 10ops-codfw, 10decommission: Decommission db2034 - https://phabricator.wikimedia.org/T223216 (10Marostegui) [10:03:53] 10Operations, 10ops-codfw, 10decommission: Decommission db2034 - https://phabricator.wikimedia.org/T223216 (10Marostegui) p:05Triage→03Normal [10:04:13] (03CR) 10Ema: [C: 03+2] cache: reimage cp3036 as upload_ats [puppet] - 10https://gerrit.wikimedia.org/r/510106 (https://phabricator.wikimedia.org/T222937) (owner: 10Ema) [10:07:07] (03PS1) 10Hashar: Allow tests to run on trivial rebase [puppet] (refs/meta/config) - 10https://gerrit.wikimedia.org/r/510112 (https://phabricator.wikimedia.org/T223209) [10:07:44] (03CR) 10Mathew.onipe: [C: 03+1] "Look good! I assume the guard files will/have been created?" [puppet] - 10https://gerrit.wikimedia.org/r/509409 (https://phabricator.wikimedia.org/T214166) (owner: 10Filippo Giunchedi) [10:07:58] 10Operations, 10Traffic, 10Patch-For-Review: Replace Varnish backends with ATS on cache upload nodes in esams - https://phabricator.wikimedia.org/T222937 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by ema on cumin1001.eqiad.wmnet for hosts: ` ['cp3036.esams.wmnet'] ` The log can be found in `... [10:08:46] (03PS1) 10Muehlenhoff: Add a Spicerack cook book to reboot hosts [cookbooks] - 10https://gerrit.wikimedia.org/r/510113 [10:10:37] (03CR) 10jerkins-bot: [V: 04-1] Add a Spicerack cook book to reboot hosts [cookbooks] - 10https://gerrit.wikimedia.org/r/510113 (owner: 10Muehlenhoff) [10:11:41] (03CR) 10Filippo Giunchedi: "> Patch Set 1: Code-Review+1" [puppet] - 10https://gerrit.wikimedia.org/r/509409 (https://phabricator.wikimedia.org/T214166) (owner: 10Filippo Giunchedi) [10:12:00] (03PS4) 10Vgutierrez: ATS: Provide support for TLS certificates with different SNI [puppet] - 10https://gerrit.wikimedia.org/r/510093 (https://phabricator.wikimedia.org/T221594) [10:12:02] (03PS43) 10Vgutierrez: ATS: Provide a TLS terminator profile and backend+TLS role [puppet] - 10https://gerrit.wikimedia.org/r/506398 (https://phabricator.wikimedia.org/T221594) [10:12:11] (03PS1) 10Jbond: testing - ignore [puppet] - 10https://gerrit.wikimedia.org/r/510115 [10:12:31] (03CR) 10Hashar: [C: 04-1] "Actually copyMaxScore default to false and it is not set in operations/puppet or All-Projects.git ..." [puppet] (refs/meta/config) - 10https://gerrit.wikimedia.org/r/510112 (https://phabricator.wikimedia.org/T223209) (owner: 10Hashar) [10:13:05] 10Operations, 10Release Pipeline, 10Release-Engineering-Team, 10Wikidata, and 5 others: Introduce wikidata termbox SSR to kubernetes - https://phabricator.wikimedia.org/T220402 (10akosiaris) >>! In T220402#5177862, @Pablo-WMDE wrote: > Hi @akosiaris - thanks for getting back to us. > >> sending a Host: HT... [10:13:14] (03PS2) 10Muehlenhoff: Add a Spicerack cook book to reboot hosts [cookbooks] - 10https://gerrit.wikimedia.org/r/510113 [10:14:42] (03PS2) 10Jbond: flake8: Add python extension so theses scripts can be picked up via CI [puppet] - 10https://gerrit.wikimedia.org/r/509467 (https://phabricator.wikimedia.org/T144169) [10:15:01] (03CR) 10jerkins-bot: [V: 04-1] Add a Spicerack cook book to reboot hosts [cookbooks] - 10https://gerrit.wikimedia.org/r/510113 (owner: 10Muehlenhoff) [10:15:18] (03CR) 10Jbond: [C: 03+2] flake8: Add python extension so theses scripts can be picked up via CI [puppet] - 10https://gerrit.wikimedia.org/r/509467 (https://phabricator.wikimedia.org/T144169) (owner: 10Jbond) [10:16:10] !log Cutting branches for 1.34.0-wmf.5 [10:16:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:17:05] (03CR) 10Jbond: [C: 03+2] flake8: update python file extensions so they are detected by CI [puppet] - 10https://gerrit.wikimedia.org/r/509885 (https://phabricator.wikimedia.org/T144169) (owner: 10Jbond) [10:17:14] (03PS2) 10Jbond: flake8: update python file extensions so they are detected by CI [puppet] - 10https://gerrit.wikimedia.org/r/509885 (https://phabricator.wikimedia.org/T144169) [10:17:53] (03CR) 10Filippo Giunchedi: [C: 03+1] logstash: add logstash-filter-truncate plugin [software/logstash/plugins] - 10https://gerrit.wikimedia.org/r/509880 (https://phabricator.wikimedia.org/T187147) (owner: 10Herron) [10:18:21] (03PS3) 10Muehlenhoff: Add a Spicerack cook book to reboot hosts [cookbooks] - 10https://gerrit.wikimedia.org/r/510113 [10:19:05] (03PS2) 10Jbond: testing - ignore [puppet] - 10https://gerrit.wikimedia.org/r/510115 [10:20:30] PROBLEM - Check systemd state on cloudcontrol1004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [10:22:08] (03CR) 10Filippo Giunchedi: Puppet, add RPKI validation daemon (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/508928 (https://phabricator.wikimedia.org/T220669) (owner: 10Ayounsi) [10:22:16] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/508956 (https://phabricator.wikimedia.org/T220669) (owner: 10Ayounsi) [10:23:03] (03CR) 10Jbond: [C: 03+2] "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/509885 (https://phabricator.wikimedia.org/T144169) (owner: 10Jbond) [10:23:53] (03CR) 10Filippo Giunchedi: [C: 04-1] "See inline, LGTM other than those comments" (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/508952 (https://phabricator.wikimedia.org/T184086) (owner: 10Paladox) [10:23:58] (03PS3) 10Jbond: testing - ignore [puppet] - 10https://gerrit.wikimedia.org/r/510115 [10:27:23] (03CR) 10Filippo Giunchedi: [C: 03+1] flake8: update python file extensions so they are detected by CI [puppet] - 10https://gerrit.wikimedia.org/r/509885 (https://phabricator.wikimedia.org/T144169) (owner: 10Jbond) [10:30:08] (03PS4) 10Jbond: testing - ignore [puppet] - 10https://gerrit.wikimedia.org/r/510115 [10:31:33] (03PS1) 10Hashar: admin: add pushInsteadOf in my .gitconfig [puppet] - 10https://gerrit.wikimedia.org/r/510119 [10:31:36] (03CR) 10Jbond: [V: 03+2 C: 03+2] flake8: update python file extensions so they are detected by CI [puppet] - 10https://gerrit.wikimedia.org/r/509885 (https://phabricator.wikimedia.org/T144169) (owner: 10Jbond) [10:32:12] ^^ rebase CI not run after 25 mins [10:33:38] (03CR) 10Giuseppe Lavagetto: [C: 03+2] admin: add pushInsteadOf in my .gitconfig [puppet] - 10https://gerrit.wikimedia.org/r/510119 (owner: 10Hashar) [10:33:49] (03PS2) 10Giuseppe Lavagetto: admin: add pushInsteadOf in my .gitconfig [puppet] - 10https://gerrit.wikimedia.org/r/510119 (owner: 10Hashar) [10:33:52] (03CR) 10Giuseppe Lavagetto: [V: 03+2 C: 03+2] admin: add pushInsteadOf in my .gitconfig [puppet] - 10https://gerrit.wikimedia.org/r/510119 (owner: 10Hashar) [10:34:47] (03CR) 10Filippo Giunchedi: [C: 03+1] "Nits inline, LGTM" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/508855 (https://phabricator.wikimedia.org/T218544) (owner: 10Jbond) [10:35:50] (03PS1) 10Jbond: role::mediawiki::canary_appserver: remove nutcracker memcached conf [puppet] - 10https://gerrit.wikimedia.org/r/510121 (https://phabricator.wikimedia.org/T214275) [10:36:07] (03PS1) 10Jcrespo: mariadb: Fix typo on multi-instance dbstore role [puppet] - 10https://gerrit.wikimedia.org/r/510122 [10:37:17] (03CR) 10Jbond: [C: 03+2] flake8 - mediawiki: update file extensions so they are detected by CI [puppet] - 10https://gerrit.wikimedia.org/r/509909 (https://phabricator.wikimedia.org/T144169) (owner: 10Jbond) [10:37:21] (03PS2) 10Jbond: flake8 - mediawiki: update file extensions so they are detected by CI [puppet] - 10https://gerrit.wikimedia.org/r/509909 (https://phabricator.wikimedia.org/T144169) [10:39:45] (03PS1) 10Arturo Borrero Gonzalez: openstack: fix weird systemd state in standby controller node [puppet] - 10https://gerrit.wikimedia.org/r/510123 (https://phabricator.wikimedia.org/T215544) [10:40:07] (03PS1) 10Jbond: role::mediawiki::canary_appserver: remove nutcracker memcached conf [puppet] - 10https://gerrit.wikimedia.org/r/510124 (https://phabricator.wikimedia.org/T214275) [10:48:13] (03PS2) 10Elukey: role::mediawiki::canary_appserver: remove nutcracker memcached conf [puppet] - 10https://gerrit.wikimedia.org/r/510124 (https://phabricator.wikimedia.org/T214275) (owner: 10Jbond) [10:49:30] 10Operations, 10Traffic, 10Patch-For-Review: Replace Varnish backends with ATS on cache upload nodes in esams - https://phabricator.wikimedia.org/T222937 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cp3036.esams.wmnet'] ` and were **ALL** successful. [10:50:21] (03PS8) 10Mobrovac: Handle application/octet-stream requests properly; release v0.1.5 [software/service-checker] - 10https://gerrit.wikimedia.org/r/507531 (https://phabricator.wikimedia.org/T220401) [10:50:36] (03PS3) 10Elukey: role::mediawiki::canary_appserver: remove nutcracker memcached conf [puppet] - 10https://gerrit.wikimedia.org/r/510124 (https://phabricator.wikimedia.org/T214275) (owner: 10Jbond) [10:50:48] (03PS5) 10Jbond: raid: update check_raid to detect missing disk [puppet] - 10https://gerrit.wikimedia.org/r/508855 (https://phabricator.wikimedia.org/T218544) [10:50:54] (03CR) 10Jbond: raid: update check_raid to detect missing disk (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/508855 (https://phabricator.wikimedia.org/T218544) (owner: 10Jbond) [10:51:08] (03PS2) 10Jcrespo: mariadb: Fix typo on multi-instance dbstore role [puppet] - 10https://gerrit.wikimedia.org/r/510122 [10:51:10] (03PS1) 10Jcrespo: dbproxy: Switchover labsdb1009 to 11, reorganize weights [puppet] - 10https://gerrit.wikimedia.org/r/510126 (https://phabricator.wikimedia.org/T222978) [10:51:21] (03CR) 10jerkins-bot: [V: 04-1] Handle application/octet-stream requests properly; release v0.1.5 [software/service-checker] - 10https://gerrit.wikimedia.org/r/507531 (https://phabricator.wikimedia.org/T220401) (owner: 10Mobrovac) [10:53:19] (03CR) 10Jbond: [C: 03+2] raid: update check_raid to detect missing disk [puppet] - 10https://gerrit.wikimedia.org/r/508855 (https://phabricator.wikimedia.org/T218544) (owner: 10Jbond) [10:53:47] (03PS6) 10Jbond: raid: update check_raid to detect missing disk [puppet] - 10https://gerrit.wikimedia.org/r/508855 (https://phabricator.wikimedia.org/T218544) [10:53:53] (03PS1) 10Arturo Borrero Gonzalez: openstack: codfw1dev: update nova_controller_standby [puppet] - 10https://gerrit.wikimedia.org/r/510127 [10:54:39] (03PS9) 10Mobrovac: Handle application/octet-stream requests properly; release v0.1.5 [software/service-checker] - 10https://gerrit.wikimedia.org/r/507531 (https://phabricator.wikimedia.org/T220401) [10:56:19] (03CR) 10jerkins-bot: [V: 04-1] Handle application/octet-stream requests properly; release v0.1.5 [software/service-checker] - 10https://gerrit.wikimedia.org/r/507531 (https://phabricator.wikimedia.org/T220401) (owner: 10Mobrovac) [10:58:46] !log scap prep 1.34.0-wmf.5 # T220730 [10:58:48] (03PS10) 10Mobrovac: Handle application/octet-stream requests properly; release v0.1.5 [software/service-checker] - 10https://gerrit.wikimedia.org/r/507531 (https://phabricator.wikimedia.org/T220401) [10:58:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:58:51] T220730: 1.34.0-wmf.5 deployment blockers - https://phabricator.wikimedia.org/T220730 [10:59:51] (03CR) 10Jbond: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/509909 (https://phabricator.wikimedia.org/T144169) (owner: 10Jbond) [11:00:05] MaxSem, RoanKattouw, and Niharika: How many deployers does it take to do European Mid-day SWAT(Max 6 patches) deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190514T1100). [11:00:05] kart_: A patch you scheduled for European Mid-day SWAT(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [11:00:35] OK! [11:00:48] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] "PCC https://puppet-compiler.wmflabs.org/compiler1002/16529/" [puppet] - 10https://gerrit.wikimedia.org/r/510127 (owner: 10Arturo Borrero Gonzalez) [11:01:21] hashar: MW train cut overlap with EU SWAT! [11:01:32] (03CR) 10Jbond: [C: 03+2] flake8 - mediawiki: update file extensions so they are detected by CI [puppet] - 10https://gerrit.wikimedia.org/r/509909 (https://phabricator.wikimedia.org/T144169) (owner: 10Jbond) [11:01:42] (03PS3) 10Jbond: flake8 - mediawiki: update file extensions so they are detected by CI [puppet] - 10https://gerrit.wikimedia.org/r/509909 (https://phabricator.wikimedia.org/T144169) [11:01:44] (03CR) 10Jbond: [V: 03+2 C: 03+2] flake8 - mediawiki: update file extensions so they are detected by CI [puppet] - 10https://gerrit.wikimedia.org/r/509909 (https://phabricator.wikimedia.org/T144169) (owner: 10Jbond) [11:02:08] (03CR) 10KartikMistry: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/508818 (https://phabricator.wikimedia.org/T222782) (owner: 10Petar.petkovic) [11:02:17] (03PS7) 10Jbond: raid: update check_raid to detect missing disk [puppet] - 10https://gerrit.wikimedia.org/r/508855 (https://phabricator.wikimedia.org/T218544) [11:03:02] (03PS2) 10Arturo Borrero Gonzalez: openstack: fix weird systemd state in standby controller node [puppet] - 10https://gerrit.wikimedia.org/r/510123 (https://phabricator.wikimedia.org/T215544) [11:03:13] (03CR) 10Jbond: [C: 03+2] raid: update check_raid to detect missing disk [puppet] - 10https://gerrit.wikimedia.org/r/508855 (https://phabricator.wikimedia.org/T218544) (owner: 10Jbond) [11:03:48] 10Operations, 10cloud-services-team (Kanban): WMCS-related dashboards using Diamond metrics - https://phabricator.wikimedia.org/T210850 (10MoritzMuehlenhoff) >>! In T210850#5179570, @aborrero wrote: > Ok! I'm fine renaming the metrics. I would need a suggestion though :-) > > `cloudvps.novafullstack.*`? Look... [11:04:13] (03PS3) 10Jbond: flake8 - sslcert: update python file extensions so they are detected by CI [puppet] - 10https://gerrit.wikimedia.org/r/509925 (https://phabricator.wikimedia.org/T144169) [11:05:12] (03CR) 10Jbond: [C: 03+2] flake8 - sslcert: update python file extensions so they are detected by CI [puppet] - 10https://gerrit.wikimedia.org/r/509925 (https://phabricator.wikimedia.org/T144169) (owner: 10Jbond) [11:05:41] What patch is not merged yet :/ [11:06:08] (03PS3) 10Jbond: flake8 - rabbitmq: update python file extensions so they are detected by CI [puppet] - 10https://gerrit.wikimedia.org/r/509917 (https://phabricator.wikimedia.org/T144169) [11:06:52] (03PS3) 10KartikMistry: Decrease idwiki MT threshold for publishing [mediawiki-config] - 10https://gerrit.wikimedia.org/r/508818 (https://phabricator.wikimedia.org/T222782) (owner: 10Petar.petkovic) [11:07:09] PROBLEM - puppet last run on mw2246 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 3 minutes ago with 2 failures. Failed resources (up to 3 shown): File[/usr/local/bin/mediawiki-firejail-convert],File[/usr/local/bin/mediawiki-firejail-ghostscript] [11:07:09] 10Operations, 10DC-Ops, 10netops: Juniper network device audit - all sites - https://phabricator.wikimedia.org/T213843 (10faidon) Update from IRC: Juniper's install base is actually missing a whole lot of our devices (e.g. only lists 9 EX4300s, out of... 52). @ayounsi is asking them, but this clearly needs m... [11:07:17] PROBLEM - puppet last run on mw1293 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 3 minutes ago with 2 failures. Failed resources (up to 3 shown): File[/usr/local/bin/mediawiki-firejail-convert],File[/usr/local/bin/mediawiki-firejail-ghostscript] [11:07:31] (03PS1) 10Hashar: Group 0 to 1.34.0-wmf.5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/510129 (https://phabricator.wikimedia.org/T220730) [11:07:35] PROBLEM - puppet last run on mw2207 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 3 minutes ago with 2 failures. Failed resources (up to 3 shown): File[/usr/local/bin/mediawiki-firejail-convert],File[/usr/local/bin/mediawiki-firejail-ghostscript] [11:07:55] PROBLEM - puppet last run on mw1294 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 4 minutes ago with 2 failures. Failed resources (up to 3 shown): File[/usr/local/bin/mediawiki-firejail-convert],File[/usr/local/bin/mediawiki-firejail-ghostscript] [11:08:13] PROBLEM - puppet last run on mw2218 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 4 minutes ago with 2 failures. Failed resources (up to 3 shown): File[/usr/local/bin/mediawiki-firejail-convert],File[/usr/local/bin/mediawiki-firejail-ghostscript] [11:08:17] ^^looking [11:08:27] PROBLEM - puppet last run on mw1284 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 4 minutes ago with 2 failures. Failed resources (up to 3 shown): File[/usr/local/bin/mediawiki-firejail-convert],File[/usr/local/bin/mediawiki-firejail-ghostscript] [11:08:31] PROBLEM - puppet last run on mw2286 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 3 minutes ago with 2 failures. Failed resources (up to 3 shown): File[/usr/local/bin/mediawiki-firejail-convert],File[/usr/local/bin/mediawiki-firejail-ghostscript] [11:08:31] PROBLEM - puppet last run on mw1255 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 3 minutes ago with 2 failures. Failed resources (up to 3 shown): File[/usr/local/bin/mediawiki-firejail-convert],File[/usr/local/bin/mediawiki-firejail-ghostscript] [11:08:53] PROBLEM - puppet last run on mw2160 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 4 minutes ago with 2 failures. Failed resources (up to 3 shown): File[/usr/local/bin/mediawiki-firejail-convert],File[/usr/local/bin/mediawiki-firejail-ghostscript] [11:09:02] !log disable puppet [11:09:03] PROBLEM - puppet last run on mw2214 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 5 minutes ago with 2 failures. Failed resources (up to 3 shown): File[/usr/local/bin/mediawiki-firejail-convert],File[/usr/local/bin/mediawiki-firejail-ghostscript] [11:09:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:09:13] PROBLEM - puppet last run on ganeti1007 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/local/sbin/x509-bundle] [11:09:17] PROBLEM - puppet last run on mw1332 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 5 minutes ago with 2 failures. Failed resources (up to 3 shown): File[/usr/local/bin/mediawiki-firejail-convert],File[/usr/local/bin/mediawiki-firejail-ghostscript] [11:09:19] PROBLEM - puppet last run on mw1276 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 4 minutes ago with 2 failures. Failed resources (up to 3 shown): File[/usr/local/bin/mediawiki-firejail-convert],File[/usr/local/bin/mediawiki-firejail-ghostscript] [11:09:25] PROBLEM - puppet last run on mw1274 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 5 minutes ago with 2 failures. Failed resources (up to 3 shown): File[/usr/local/bin/mediawiki-firejail-convert],File[/usr/local/bin/mediawiki-firejail-ghostscript] [11:09:29] PROBLEM - puppet last run on mw1245 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 4 minutes ago with 2 failures. Failed resources (up to 3 shown): File[/usr/local/bin/mediawiki-firejail-convert],File[/usr/local/bin/mediawiki-firejail-ghostscript] [11:09:37] PROBLEM - puppet last run on kafka1001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/local/sbin/x509-bundle] [11:09:37] PROBLEM - puppet last run on cp2008 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/local/sbin/x509-bundle] [11:09:45] PROBLEM - puppet last run on mw2262 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 3 minutes ago with 2 failures. Failed resources (up to 3 shown): File[/usr/local/bin/mediawiki-firejail-convert],File[/usr/local/bin/mediawiki-firejail-ghostscript] [11:09:47] !log Deleting 1.33.0-wmf.23 from deploy1001 # T220730 [11:09:49] PROBLEM - puppet last run on mw2153 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 3 minutes ago with 2 failures. Failed resources (up to 3 shown): File[/usr/local/bin/mediawiki-firejail-convert],File[/usr/local/bin/mediawiki-firejail-ghostscript] [11:09:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:09:51] T220730: 1.34.0-wmf.5 deployment blockers - https://phabricator.wikimedia.org/T220730 [11:09:57] PROBLEM - IPMI Sensor Status on cp3036 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [PS Redundancy = Critical, Status = Critical] [11:10:16] (03PS3) 10Arturo Borrero Gonzalez: openstack: fix weird systemd state in standby controller node [puppet] - 10https://gerrit.wikimedia.org/r/510123 (https://phabricator.wikimedia.org/T215544) [11:10:25] PROBLEM - puppet last run on wtp1035 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/local/sbin/x509-bundle] [11:10:27] PROBLEM - puppet last run on mw2259 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 4 minutes ago with 2 failures. Failed resources (up to 3 shown): File[/usr/local/bin/mediawiki-firejail-convert],File[/usr/local/bin/mediawiki-firejail-ghostscript] [11:10:27] PROBLEM - puppet last run on mw2271 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 5 minutes ago with 2 failures. Failed resources (up to 3 shown): File[/usr/local/bin/mediawiki-firejail-convert],File[/usr/local/bin/mediawiki-firejail-ghostscript] [11:10:49] PROBLEM - puppet last run on mw1263 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 7 minutes ago with 2 failures. Failed resources (up to 3 shown): File[/usr/local/bin/mediawiki-firejail-convert],File[/usr/local/bin/mediawiki-firejail-ghostscript] [11:11:01] PROBLEM - puppet last run on mw1310 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 7 minutes ago with 2 failures. Failed resources (up to 3 shown): File[/usr/local/bin/mediawiki-firejail-convert],File[/usr/local/bin/mediawiki-firejail-ghostscript] [11:11:05] ACKNOWLEDGEMENT - IPMI Sensor Status on cp3036 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [PS Redundancy = Critical, Status = Critical] Ema known [11:11:09] ACKNOWLEDGEMENT - IPMI Sensor Status on cp3036 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [PS Redundancy = Critical, Status = Critical] Ema known [11:11:11] ACKNOWLEDGEMENT - IPMI Sensor Status on cp3036 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [PS Redundancy = Critical, Status = Critical] Ema known [11:11:18] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] "PCC as expected: https://puppet-compiler.wmflabs.org/compiler1002/16530/" [puppet] - 10https://gerrit.wikimedia.org/r/510123 (https://phabricator.wikimedia.org/T215544) (owner: 10Arturo Borrero Gonzalez) [11:11:45] (03PS1) 10Jbond: flake8 - rename file [puppet] - 10https://gerrit.wikimedia.org/r/510131 [11:12:04] !log pool cp3036 reimaged to ATS T222937 [11:12:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:12:08] T222937: Replace Varnish backends with ATS on cache upload nodes in esams - https://phabricator.wikimedia.org/T222937 [11:12:49] (03CR) 10Jbond: [C: 03+2] flake8 - rename file [puppet] - 10https://gerrit.wikimedia.org/r/510131 (owner: 10Jbond) [11:12:57] (03PS2) 10Jbond: flake8 - rename file [puppet] - 10https://gerrit.wikimedia.org/r/510131 [11:13:39] (03CR) 10Jbond: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/510131 (owner: 10Jbond) [11:14:55] (03CR) 10Jbond: [C: 03+2] flake8 - rename file [puppet] - 10https://gerrit.wikimedia.org/r/510131 (owner: 10Jbond) [11:15:23] (03CR) 10KartikMistry: [V: 03+2 C: 03+2] Decrease idwiki MT threshold for publishing [mediawiki-config] - 10https://gerrit.wikimedia.org/r/508818 (https://phabricator.wikimedia.org/T222782) (owner: 10Petar.petkovic) [11:15:52] 10Operations, 10DBA, 10Patch-For-Review: correctable memory errors db1068 (commons primary master database) - https://phabricator.wikimedia.org/T213664 (10jcrespo) It now says: `CRITICAL: Devices (12) not equal to PDs (2)` [11:16:14] (03CR) 10jenkins-bot: Decrease idwiki MT threshold for publishing [mediawiki-config] - 10https://gerrit.wikimedia.org/r/508818 (https://phabricator.wikimedia.org/T222782) (owner: 10Petar.petkovic) [11:18:16] !log enable puppet issue fixed https://gerrit.wikimedia.org/r/c/operations/puppet/+/510131 [11:18:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:18:55] PROBLEM - puppet last run on mw2178 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 3 minutes ago with 2 failures. Failed resources (up to 3 shown): File[/usr/local/bin/mediawiki-firejail-convert],File[/usr/local/bin/mediawiki-firejail-ghostscript] [11:19:01] PROBLEM - puppet last run on mw2218 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 1 minute ago with 2 failures. Failed resources (up to 3 shown): File[/usr/local/bin/mediawiki-firejail-convert],File[/usr/local/bin/mediawiki-firejail-ghostscript] [11:19:13] PROBLEM - puppet last run on mw2191 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 1 minute ago with 2 failures. Failed resources (up to 3 shown): File[/usr/local/bin/mediawiki-firejail-convert],File[/usr/local/bin/mediawiki-firejail-ghostscript] [11:19:13] PROBLEM - puppet last run on mw2203 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 4 minutes ago with 2 failures. Failed resources (up to 3 shown): File[/usr/local/bin/mediawiki-firejail-convert],File[/usr/local/bin/mediawiki-firejail-ghostscript] [11:19:15] PROBLEM - puppet last run on mw1284 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 2 minutes ago with 2 failures. Failed resources (up to 3 shown): File[/usr/local/bin/mediawiki-firejail-convert],File[/usr/local/bin/mediawiki-firejail-ghostscript] [11:19:15] PROBLEM - puppet last run on mw2236 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 5 minutes ago with 2 failures. Failed resources (up to 3 shown): File[/usr/local/bin/mediawiki-firejail-convert],File[/usr/local/bin/mediawiki-firejail-ghostscript] [11:19:19] PROBLEM - puppet last run on mw2286 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 3 minutes ago with 2 failures. Failed resources (up to 3 shown): File[/usr/local/bin/mediawiki-firejail-convert],File[/usr/local/bin/mediawiki-firejail-ghostscript] [11:19:19] PROBLEM - puppet last run on mw1255 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 3 minutes ago with 2 failures. Failed resources (up to 3 shown): File[/usr/local/bin/mediawiki-firejail-convert],File[/usr/local/bin/mediawiki-firejail-ghostscript] [11:19:23] PROBLEM - puppet last run on mw2190 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 5 minutes ago with 2 failures. Failed resources (up to 3 shown): File[/usr/local/bin/mediawiki-firejail-convert],File[/usr/local/bin/mediawiki-firejail-ghostscript] [11:19:25] PROBLEM - puppet last run on mw2185 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 4 minutes ago with 2 failures. Failed resources (up to 3 shown): File[/usr/local/bin/mediawiki-firejail-convert],File[/usr/local/bin/mediawiki-firejail-ghostscript] [11:19:27] PROBLEM - puppet last run on etcd1005 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/local/sbin/x509-bundle] [11:19:27] PROBLEM - puppet last run on cloudvirtan1005 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/local/sbin/x509-bundle] [11:19:27] PROBLEM - puppet last run on mw1221 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 5 minutes ago with 2 failures. Failed resources (up to 3 shown): File[/usr/local/bin/mediawiki-firejail-convert],File[/usr/local/bin/mediawiki-firejail-ghostscript] [11:19:29] PROBLEM - puppet last run on mw1222 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 2 minutes ago with 2 failures. Failed resources (up to 3 shown): File[/usr/local/bin/mediawiki-firejail-convert],File[/usr/local/bin/mediawiki-firejail-ghostscript] [11:19:33] PROBLEM - puppet last run on mw1315 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 4 minutes ago with 2 failures. Failed resources (up to 3 shown): File[/usr/local/bin/mediawiki-firejail-convert],File[/usr/local/bin/mediawiki-firejail-ghostscript] [11:19:33] PROBLEM - puppet last run on mw1243 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 1 minute ago with 2 failures. Failed resources (up to 3 shown): File[/usr/local/bin/mediawiki-firejail-convert],File[/usr/local/bin/mediawiki-firejail-ghostscript] [11:19:39] PROBLEM - puppet last run on mw1277 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 3 minutes ago with 2 failures. Failed resources (up to 3 shown): File[/usr/local/bin/mediawiki-firejail-convert],File[/usr/local/bin/mediawiki-firejail-ghostscript] [11:19:39] PROBLEM - puppet last run on mw1234 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 2 minutes ago with 2 failures. Failed resources (up to 3 shown): File[/usr/local/bin/mediawiki-firejail-convert],File[/usr/local/bin/mediawiki-firejail-ghostscript] [11:19:42] OK. Deploying to Production now. [11:19:43] PROBLEM - puppet last run on mw2160 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 3 minutes ago with 2 failures. Failed resources (up to 3 shown): File[/usr/local/bin/mediawiki-firejail-convert],File[/usr/local/bin/mediawiki-firejail-ghostscript] [11:19:47] PROBLEM - puppet last run on mw2275 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 3 minutes ago with 2 failures. Failed resources (up to 3 shown): File[/usr/local/bin/mediawiki-firejail-convert],File[/usr/local/bin/mediawiki-firejail-ghostscript] [11:19:52] What's up with these messages? [11:19:53] PROBLEM - puppet last run on mw2214 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 1 minute ago with 2 failures. Failed resources (up to 3 shown): File[/usr/local/bin/mediawiki-firejail-convert],File[/usr/local/bin/mediawiki-firejail-ghostscript] [11:19:55] PROBLEM - puppet last run on mw2174 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 3 minutes ago with 2 failures. Failed resources (up to 3 shown): File[/usr/local/bin/mediawiki-firejail-convert],File[/usr/local/bin/mediawiki-firejail-ghostscript] [11:20:03] PROBLEM - puppet last run on ganeti1007 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/local/sbin/x509-bundle] [11:20:07] PROBLEM - puppet last run on mw1332 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 5 minutes ago with 2 failures. Failed resources (up to 3 shown): File[/usr/local/bin/mediawiki-firejail-convert],File[/usr/local/bin/mediawiki-firejail-ghostscript] [11:20:09] PROBLEM - puppet last run on mw1276 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 4 minutes ago with 2 failures. Failed resources (up to 3 shown): File[/usr/local/bin/mediawiki-firejail-convert],File[/usr/local/bin/mediawiki-firejail-ghostscript] [11:20:13] (03CR) 10Elukey: [C: 03+1] cassandra: check for flag file before service startup [puppet] - 10https://gerrit.wikimedia.org/r/509409 (https://phabricator.wikimedia.org/T214166) (owner: 10Filippo Giunchedi) [11:20:15] PROBLEM - puppet last run on mw1274 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 4 minutes ago with 2 failures. Failed resources (up to 3 shown): File[/usr/local/bin/mediawiki-firejail-convert],File[/usr/local/bin/mediawiki-firejail-ghostscript] [11:20:15] PROBLEM - puppet last run on mw1235 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 1 minute ago with 2 failures. Failed resources (up to 3 shown): File[/usr/local/bin/mediawiki-firejail-convert],File[/usr/local/bin/mediawiki-firejail-ghostscript] [11:20:17] OK to SWAT or not with this? [11:20:17] PROBLEM - puppet last run on mw1297 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 2 minutes ago with 2 failures. Failed resources (up to 3 shown): File[/usr/local/bin/mediawiki-firejail-convert],File[/usr/local/bin/mediawiki-firejail-ghostscript] [11:20:19] PROBLEM - puppet last run on mw1245 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 5 minutes ago with 2 failures. Failed resources (up to 3 shown): File[/usr/local/bin/mediawiki-firejail-convert],File[/usr/local/bin/mediawiki-firejail-ghostscript] [11:20:27] PROBLEM - puppet last run on kafka1001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/local/sbin/x509-bundle] [11:20:27] PROBLEM - puppet last run on cp2008 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/local/sbin/x509-bundle] [11:20:35] PROBLEM - puppet last run on mw2262 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 5 minutes ago with 2 failures. Failed resources (up to 3 shown): File[/usr/local/bin/mediawiki-firejail-convert],File[/usr/local/bin/mediawiki-firejail-ghostscript] [11:20:39] PROBLEM - puppet last run on mw2153 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 5 minutes ago with 2 failures. Failed resources (up to 3 shown): File[/usr/local/bin/mediawiki-firejail-convert],File[/usr/local/bin/mediawiki-firejail-ghostscript] [11:20:41] PROBLEM - puppet last run on mw1313 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 2 minutes ago with 2 failures. Failed resources (up to 3 shown): File[/usr/local/bin/mediawiki-firejail-convert],File[/usr/local/bin/mediawiki-firejail-ghostscript] [11:20:48] I'll wait few minutes.. [11:20:50] kart_: I think they're related to ongoing work from jbond42 [11:20:51] PROBLEM - puppet last run on cp1076 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/local/sbin/x509-bundle] [11:20:53] PROBLEM - puppet last run on mw2155 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 3 minutes ago with 2 failures. Failed resources (up to 3 shown): File[/usr/local/bin/mediawiki-firejail-convert],File[/usr/local/bin/mediawiki-firejail-ghostscript] [11:20:59] PROBLEM - puppet last run on mw2193 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 4 minutes ago with 2 failures. Failed resources (up to 3 shown): File[/usr/local/bin/mediawiki-firejail-convert],File[/usr/local/bin/mediawiki-firejail-ghostscript] [11:21:07] ema: OK to SWAT in this case? [11:21:15] PROBLEM - puppet last run on wtp1035 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/local/sbin/x509-bundle] [11:21:15] PROBLEM - puppet last run on mw2259 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 3 minutes ago with 2 failures. Failed resources (up to 3 shown): File[/usr/local/bin/mediawiki-firejail-convert],File[/usr/local/bin/mediawiki-firejail-ghostscript] [11:21:15] PROBLEM - puppet last run on mw2271 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 5 minutes ago with 2 failures. Failed resources (up to 3 shown): File[/usr/local/bin/mediawiki-firejail-convert],File[/usr/local/bin/mediawiki-firejail-ghostscript] [11:21:22] yes sorry they should start to recover soon [11:21:27] PROBLEM - puppet last run on mw2230 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 5 minutes ago with 2 failures. Failed resources (up to 3 shown): File[/usr/local/bin/mediawiki-firejail-convert],File[/usr/local/bin/mediawiki-firejail-ghostscript] [11:21:29] PROBLEM - puppet last run on mw2145 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 5 minutes ago with 2 failures. Failed resources (up to 3 shown): File[/usr/local/bin/mediawiki-firejail-convert],File[/usr/local/bin/mediawiki-firejail-ghostscript] [11:21:29] kart_: yes IMHO you should go ahead [11:21:35] PROBLEM - puppet last run on mw1256 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 5 minutes ago with 2 failures. Failed resources (up to 3 shown): File[/usr/local/bin/mediawiki-firejail-convert],File[/usr/local/bin/mediawiki-firejail-ghostscript] [11:21:37] cool. Thanks ema [11:21:37] PROBLEM - puppet last run on mw1263 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 5 minutes ago with 2 failures. Failed resources (up to 3 shown): File[/usr/local/bin/mediawiki-firejail-convert],File[/usr/local/bin/mediawiki-firejail-ghostscript] [11:21:49] PROBLEM - puppet last run on mw1310 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 4 minutes ago with 2 failures. Failed resources (up to 3 shown): File[/usr/local/bin/mediawiki-firejail-convert],File[/usr/local/bin/mediawiki-firejail-ghostscript] [11:22:01] PROBLEM - puppet last run on mw2146 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 3 minutes ago with 2 failures. Failed resources (up to 3 shown): File[/usr/local/bin/mediawiki-firejail-convert],File[/usr/local/bin/mediawiki-firejail-ghostscript] [11:22:01] PROBLEM - puppet last run on mw2244 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 5 minutes ago with 2 failures. Failed resources (up to 3 shown): File[/usr/local/bin/mediawiki-firejail-convert],File[/usr/local/bin/mediawiki-firejail-ghostscript] [11:22:03] PROBLEM - puppet last run on mw2189 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 5 minutes ago with 2 failures. Failed resources (up to 3 shown): File[/usr/local/bin/mediawiki-firejail-convert],File[/usr/local/bin/mediawiki-firejail-ghostscript] [11:22:07] PROBLEM - puppet last run on cp2016 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/local/sbin/x509-bundle] [11:22:14] hashar: "11:21:52 sync-file failed: Failed to acquire lock "/var/lock/scap.operations_mediawiki-config.lock"; owner is "hashar"; reason is "Pruned MediaWiki: 1.33.0-wmf.23"" [11:22:15] PROBLEM - puppet last run on mw1267 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 3 minutes ago with 2 failures. Failed resources (up to 3 shown): File[/usr/local/bin/mediawiki-firejail-convert],File[/usr/local/bin/mediawiki-firejail-ghostscript] [11:22:15] PROBLEM - puppet last run on mw1279 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 5 minutes ago with 2 failures. Failed resources (up to 3 shown): File[/usr/local/bin/mediawiki-firejail-convert],File[/usr/local/bin/mediawiki-firejail-ghostscript] [11:22:19] PROBLEM - puppet last run on mw1231 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 5 minutes ago with 2 failures. Failed resources (up to 3 shown): File[/usr/local/bin/mediawiki-firejail-convert],File[/usr/local/bin/mediawiki-firejail-ghostscript] [11:22:25] hashar: What's up? :) [11:22:29] PROBLEM - puppet last run on snapshot1007 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 5 minutes ago with 2 failures. Failed resources (up to 3 shown): File[/usr/local/bin/mediawiki-firejail-convert],File[/usr/local/bin/mediawiki-firejail-ghostscript] [11:22:35] PROBLEM - puppet last run on mw1251 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 5 minutes ago with 2 failures. Failed resources (up to 3 shown): File[/usr/local/bin/mediawiki-firejail-convert],File[/usr/local/bin/mediawiki-firejail-ghostscript] [11:22:49] PROBLEM - puppet last run on mw1228 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 4 minutes ago with 2 failures. Failed resources (up to 3 shown): File[/usr/local/bin/mediawiki-firejail-convert],File[/usr/local/bin/mediawiki-firejail-ghostscript] [11:23:06] kart_: ahhh I have missed the overlap with the swat window sorry :/ [11:23:06] !log cumin1001 ~ % sudo cumin A:all '/usr/local/sbin/run-puppet-agent --failed-only [11:23:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:23:11] PROBLEM - puppet last run on mw2279 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 5 minutes ago with 2 failures. Failed resources (up to 3 shown): File[/usr/local/bin/mediawiki-firejail-convert],File[/usr/local/bin/mediawiki-firejail-ghostscript] [11:23:11] PROBLEM - puppet last run on mw2250 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 5 minutes ago with 2 failures. Failed resources (up to 3 shown): File[/usr/local/bin/mediawiki-firejail-convert],File[/usr/local/bin/mediawiki-firejail-ghostscript] [11:23:13] PROBLEM - puppet last run on mw2238 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 5 minutes ago with 2 failures. Failed resources (up to 3 shown): File[/usr/local/bin/mediawiki-firejail-convert],File[/usr/local/bin/mediawiki-firejail-ghostscript] [11:23:15] PROBLEM - puppet last run on mw1225 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 5 minutes ago with 2 failures. Failed resources (up to 3 shown): File[/usr/local/bin/mediawiki-firejail-convert],File[/usr/local/bin/mediawiki-firejail-ghostscript] [11:23:17] PROBLEM - puppet last run on db2087 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/local/sbin/x509-bundle] [11:23:19] PROBLEM - puppet last run on mw2215 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 5 minutes ago with 2 failures. Failed resources (up to 3 shown): File[/usr/local/bin/mediawiki-firejail-convert],File[/usr/local/bin/mediawiki-firejail-ghostscript] [11:23:23] PROBLEM - puppet last run on mw2246 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 5 minutes ago with 2 failures. Failed resources (up to 3 shown): File[/usr/local/bin/mediawiki-firejail-convert],File[/usr/local/bin/mediawiki-firejail-ghostscript] [11:23:23] PROBLEM - puppet last run on mw1224 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 5 minutes ago with 2 failures. Failed resources (up to 3 shown): File[/usr/local/bin/mediawiki-firejail-convert],File[/usr/local/bin/mediawiki-firejail-ghostscript] [11:23:27] PROBLEM - puppet last run on mw2186 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 5 minutes ago with 2 failures. Failed resources (up to 3 shown): File[/usr/local/bin/mediawiki-firejail-convert],File[/usr/local/bin/mediawiki-firejail-ghostscript] [11:23:28] PROBLEM - puppet last run on mw1226 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 5 minutes ago with 2 failures. Failed resources (up to 3 shown): File[/usr/local/bin/mediawiki-firejail-convert],File[/usr/local/bin/mediawiki-firejail-ghostscript] [11:23:29] PROBLEM - puppet last run on mw1293 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 5 minutes ago with 2 failures. Failed resources (up to 3 shown): File[/usr/local/bin/mediawiki-firejail-convert],File[/usr/local/bin/mediawiki-firejail-ghostscript] [11:23:33] PROBLEM - puppet last run on mw2138 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 5 minutes ago with 2 failures. Failed resources (up to 3 shown): File[/usr/local/bin/mediawiki-firejail-convert],File[/usr/local/bin/mediawiki-firejail-ghostscript] [11:23:33] jbond42: I hope you used a batch size [11:23:34] !log hashar@deploy1001 clean aborted: Pruned MediaWiki: 1.33.0-wmf.23 (duration: 14m 31s) [11:23:38] 10Operations, 10DBA, 10Patch-For-Review: correctable memory errors db1068 (commons primary master database) - https://phabricator.wikimedia.org/T213664 (10jcrespo) Ignore the above, that is unrelated. [11:23:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:23:39] PROBLEM - puppet last run on mw1286 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 5 minutes ago with 2 failures. Failed resources (up to 3 shown): File[/usr/local/bin/mediawiki-firejail-convert],File[/usr/local/bin/mediawiki-firejail-ghostscript] [11:23:43] PROBLEM - puppet last run on mw2283 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 5 minutes ago with 2 failures. Failed resources (up to 3 shown): File[/usr/local/bin/mediawiki-firejail-convert],File[/usr/local/bin/mediawiki-firejail-ghostscript] [11:23:44] kart_: should be good now. Sorry [11:23:45] PROBLEM - puppet last run on mw2207 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 5 minutes ago with 2 failures. Failed resources (up to 3 shown): File[/usr/local/bin/mediawiki-firejail-convert],File[/usr/local/bin/mediawiki-firejail-ghostscript] [11:23:56] 10Operations, 10Release Pipeline, 10Release-Engineering-Team, 10Wikidata, and 5 others: Introduce wikidata termbox SSR to kubernetes - https://phabricator.wikimedia.org/T220402 (10Pablo-WMDE) Hi @akosiaris, thanks for taking the time to explain the way the `Host` header is intended to be used. If I unders... [11:23:57] it can kill the puppetmasters otherwise if they are too many [11:23:59] PROBLEM - puppet last run on mw2226 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 5 minutes ago with 2 failures. Failed resources (up to 3 shown): File[/usr/local/bin/mediawiki-firejail-convert],File[/usr/local/bin/mediawiki-firejail-ghostscript] [11:24:05] RECOVERY - puppet last run on mw1294 is OK: OK: Puppet is currently enabled, last run 1 second ago with 0 failures [11:24:22] hashar: thanks. [11:24:23] RECOVERY - puppet last run on mw2218 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [11:24:37] RECOVERY - puppet last run on mw1284 is OK: OK: Puppet is currently enabled, last run 50 seconds ago with 0 failures [11:24:41] RECOVERY - puppet last run on mw1255 is OK: OK: Puppet is currently enabled, last run 35 seconds ago with 0 failures [11:24:47] RECOVERY - puppet last run on etcd1005 is OK: OK: Puppet is currently enabled, last run 34 seconds ago with 0 failures [11:24:48] RECOVERY - puppet last run on cloudvirtan1005 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [11:24:48] RECOVERY - puppet last run on mw1221 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [11:24:49] RECOVERY - puppet last run on mw1222 is OK: OK: Puppet is currently enabled, last run 58 seconds ago with 0 failures [11:24:53] RECOVERY - puppet last run on mw1243 is OK: OK: Puppet is currently enabled, last run 50 seconds ago with 0 failures [11:24:53] RECOVERY - puppet last run on mw1315 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [11:24:59] RECOVERY - puppet last run on mw1277 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [11:24:59] RECOVERY - puppet last run on mw1234 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [11:25:03] RECOVERY - puppet last run on mw2160 is OK: OK: Puppet is currently enabled, last run 32 seconds ago with 0 failures [11:25:13] RECOVERY - puppet last run on mw2214 is OK: OK: Puppet is currently enabled, last run 15 seconds ago with 0 failures [11:25:15] RECOVERY - puppet last run on mw2174 is OK: OK: Puppet is currently enabled, last run 37 seconds ago with 0 failures [11:25:17] !log kartik@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT [[gerrit:508818|Decrease idwiki MT thresold for publishing]] (T222782) (duration: 00m 51s) [11:25:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:25:23] T222782: Adjust the threshold for Indonesian to prevent publishing when overall unmodified content is higher than 40% - https://phabricator.wikimedia.org/T222782 [11:25:25] RECOVERY - puppet last run on ganeti1007 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [11:25:27] RECOVERY - puppet last run on mw1332 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [11:25:27] RECOVERY - puppet last run on mw1276 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [11:25:35] RECOVERY - puppet last run on mw1235 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [11:25:35] RECOVERY - puppet last run on mw1274 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [11:25:37] RECOVERY - puppet last run on mw1297 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [11:25:39] RECOVERY - puppet last run on mw1245 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [11:25:47] RECOVERY - puppet last run on kafka1001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [11:25:47] RECOVERY - puppet last run on cp2008 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [11:25:55] RECOVERY - puppet last run on mw2262 is OK: OK: Puppet is currently enabled, last run 48 seconds ago with 0 failures [11:25:59] RECOVERY - puppet last run on mw2153 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [11:26:01] RECOVERY - puppet last run on mw1313 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [11:26:11] RECOVERY - puppet last run on cp1076 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [11:26:13] RECOVERY - puppet last run on mw2155 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [11:26:19] RECOVERY - puppet last run on mw2193 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [11:26:35] RECOVERY - puppet last run on wtp1035 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [11:26:35] RECOVERY - puppet last run on mw2259 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [11:26:35] RECOVERY - puppet last run on mw2271 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [11:26:48] RECOVERY - puppet last run on mw2230 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [11:26:51] RECOVERY - puppet last run on mw2145 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [11:26:55] RECOVERY - puppet last run on mw1256 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [11:26:57] RECOVERY - puppet last run on mw1263 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [11:27:09] RECOVERY - puppet last run on mw1310 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [11:27:21] RECOVERY - puppet last run on mw2146 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [11:27:23] RECOVERY - puppet last run on mw2244 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [11:27:23] RECOVERY - puppet last run on mw2189 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [11:27:31] RECOVERY - puppet last run on cp2016 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [11:27:37] RECOVERY - puppet last run on mw1279 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [11:27:37] RECOVERY - puppet last run on mw1267 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [11:27:39] (03PS1) 10Jbond: Revert "raid: update check_raid to detect missing disk" [puppet] - 10https://gerrit.wikimedia.org/r/510137 [11:27:41] RECOVERY - puppet last run on mw1231 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [11:27:51] RECOVERY - puppet last run on snapshot1007 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [11:27:55] kart_: does it work now? [11:27:57] RECOVERY - puppet last run on mw1251 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [11:28:03] PROBLEM - Check systemd state on cloudcontrol1003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [11:28:05] (03PS2) 10Jbond: Revert "raid: update check_raid to detect missing disk" [puppet] - 10https://gerrit.wikimedia.org/r/510137 [11:28:11] RECOVERY - puppet last run on mw1228 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [11:28:22] hashar: yes. Deployed. [11:28:33] RECOVERY - puppet last run on mw2279 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [11:28:33] RECOVERY - puppet last run on mw2250 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [11:28:33] RECOVERY - puppet last run on mw2238 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [11:28:35] RECOVERY - puppet last run on mw1225 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [11:28:41] RECOVERY - puppet last run on db2087 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [11:28:41] RECOVERY - puppet last run on mw2215 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [11:28:45] PROBLEM - puppet last run on cloudcontrol1003 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 3 minutes ago with 2 failures. Failed resources (up to 3 shown): Service[ferm] [11:28:45] RECOVERY - puppet last run on mw1224 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [11:28:45] RECOVERY - puppet last run on mw2246 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [11:28:49] RECOVERY - puppet last run on mw1226 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [11:28:51] RECOVERY - puppet last run on mw2186 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [11:28:51] RECOVERY - puppet last run on mw1293 is OK: OK: Puppet is currently enabled, last run 5 minutes ago with 0 failures [11:28:54] !log EU-Mid day SWAT Done. [11:28:55] PROBLEM - Disk space on contint1001 is CRITICAL: DISK CRITICAL - free space: / 1891 MB (4% inode=53%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space [11:28:55] RECOVERY - puppet last run on mw2138 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [11:28:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:29:01] RECOVERY - puppet last run on mw1286 is OK: OK: Puppet is currently enabled, last run 5 minutes ago with 0 failures [11:29:05] RECOVERY - puppet last run on mw2283 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [11:29:07] RECOVERY - puppet last run on mw2207 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [11:29:21] RECOVERY - puppet last run on mw2226 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [11:29:26] kart_: \o/ And sorry about the lock/conflict, it is entirely my fault [11:29:41] RECOVERY - puppet last run on mw2178 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [11:29:50] (03CR) 10Jcrespo: [C: 03+1] Revert "raid: update check_raid to detect missing disk" [puppet] - 10https://gerrit.wikimedia.org/r/510137 (owner: 10Jbond) [11:29:57] RECOVERY - puppet last run on mw2191 is OK: OK: Puppet is currently enabled, last run 5 minutes ago with 0 failures [11:29:57] RECOVERY - puppet last run on mw2203 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [11:29:59] hashar: no worries. Lock had your name ;) [11:29:59] RECOVERY - puppet last run on mw2236 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [11:29:59] !log Deleting 1.33.0-wmf.24 from deploy1001 # T220730 [11:30:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:30:03] RECOVERY - puppet last run on mw2286 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [11:30:04] T220730: 1.34.0-wmf.5 deployment blockers - https://phabricator.wikimedia.org/T220730 [11:30:05] RECOVERY - puppet last run on mw2190 is OK: OK: Puppet is currently enabled, last run 5 minutes ago with 0 failures [11:30:05] RECOVERY - puppet last run on mw2185 is OK: OK: Puppet is currently enabled, last run 5 minutes ago with 0 failures [11:30:17] (03CR) 10Jbond: [C: 03+2] Revert "raid: update check_raid to detect missing disk" [puppet] - 10https://gerrit.wikimedia.org/r/510137 (owner: 10Jbond) [11:30:31] RECOVERY - puppet last run on mw2275 is OK: OK: Puppet is currently enabled, last run 5 minutes ago with 0 failures [11:31:13] * Krinkle is debugging on mwdebug1001 [11:31:18] (03PS1) 10Jbond: raid: update check_raid to detect missing disk"" [puppet] - 10https://gerrit.wikimedia.org/r/510139 [11:32:11] PROBLEM - MegaRAID on labstore2004 is CRITICAL: CRITICAL: Devices (12) not equal to PDs (2) https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [11:32:13] ACKNOWLEDGEMENT - MegaRAID on labstore2004 is CRITICAL: CRITICAL: Devices (12) not equal to PDs (2) nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T223232 https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [11:32:18] 10Operations, 10ops-codfw: Degraded RAID on labstore2004 - https://phabricator.wikimedia.org/T223232 (10ops-monitoring-bot) [11:32:51] PROBLEM - MegaRAID on db1068 is CRITICAL: CRITICAL: Devices (12) not equal to PDs (2) https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [11:32:53] ACKNOWLEDGEMENT - MegaRAID on db1068 is CRITICAL: CRITICAL: Devices (12) not equal to PDs (2) nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T223233 https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [11:32:57] 10Operations, 10ops-eqiad: Degraded RAID on db1068 - https://phabricator.wikimedia.org/T223233 (10ops-monitoring-bot) [11:33:03] 10Operations, 10ops-eqiad: Degraded RAID on labstore1001 - https://phabricator.wikimedia.org/T223234 (10ops-monitoring-bot) [11:33:07] 10Operations, 10ops-eqiad: Degraded RAID on cloudstore1009 - https://phabricator.wikimedia.org/T223235 (10ops-monitoring-bot) [11:33:17] !log hashar@deploy1001 Pruned MediaWiki: 1.33.0-wmf.24 (duration: 03m 20s) [11:33:26] 10Operations, 10ops-eqiad: Degraded RAID on dbstore1001 - https://phabricator.wikimedia.org/T223236 (10ops-monitoring-bot) [11:33:29] ACKNOWLEDGEMENT - MegaRAID on cloudvirt1015 is CRITICAL: CRITICAL: Devices (10) not equal to PDs (8) nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T223237 https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [11:33:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:33:34] PROBLEM - MegaRAID on db1064 is CRITICAL: CRITICAL: Devices (12) not equal to PDs (2) https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [11:33:35] ACKNOWLEDGEMENT - MegaRAID on db1064 is CRITICAL: CRITICAL: Devices (12) not equal to PDs (2) nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T223238 https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [11:33:40] 10Operations, 10ops-eqiad: Degraded RAID on cloudvirt1015 - https://phabricator.wikimedia.org/T223237 (10ops-monitoring-bot) [11:33:43] 10Operations, 10ops-eqiad: Degraded RAID on db1064 - https://phabricator.wikimedia.org/T223238 (10ops-monitoring-bot) [11:33:50] RECOVERY - Check systemd state on cloudcontrol1003 is OK: OK - running: The system is fully operational [11:33:51] (03PS1) 10Arturo Borrero Gonzalez: openstack: base: rabbitmq: fix typo in ferm rule [puppet] - 10https://gerrit.wikimedia.org/r/510141 (https://phabricator.wikimedia.org/T215544) [11:34:02] PROBLEM - Mediawiki Cirrussearch update rate - codfw on icinga1001 is CRITICAL: CRITICAL: 10.00% of data under the critical threshold [50.0] https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?panelId=44&fullscreen&orgId=1 [11:34:24] (03CR) 10Noa wmde: [C: 03+1] Add configuration for EntitySchema ShExSimpleUrl [mediawiki-config] - 10https://gerrit.wikimedia.org/r/509878 (owner: 10Michael Große) [11:34:48] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] openstack: base: rabbitmq: fix typo in ferm rule [puppet] - 10https://gerrit.wikimedia.org/r/510141 (https://phabricator.wikimedia.org/T215544) (owner: 10Arturo Borrero Gonzalez) [11:35:14] PROBLEM - Mediawiki Cirrussearch update rate - eqiad on icinga1001 is CRITICAL: CRITICAL: 30.00% of data under the critical threshold [50.0] https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?panelId=44&fullscreen&orgId=1 [11:36:22] PROBLEM - MegaRAID on cloudvirt1024 is CRITICAL: CRITICAL: Devices (10) not equal to PDs (8) https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [11:36:23] ACKNOWLEDGEMENT - MegaRAID on cloudvirt1024 is CRITICAL: CRITICAL: Devices (10) not equal to PDs (8) nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T223241 https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [11:36:28] 10Operations, 10ops-eqiad: Degraded RAID on cloudvirt1024 - https://phabricator.wikimedia.org/T223241 (10ops-monitoring-bot) [11:37:10] PROBLEM - MegaRAID on db1071 is CRITICAL: CRITICAL: Devices (12) not equal to PDs (2) https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [11:37:11] ACKNOWLEDGEMENT - MegaRAID on db1071 is CRITICAL: CRITICAL: Devices (12) not equal to PDs (2) nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T223242 https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [11:37:16] 10Operations, 10ops-eqiad: Degraded RAID on db1071 - https://phabricator.wikimedia.org/T223242 (10ops-monitoring-bot) [11:37:22] PROBLEM - MegaRAID on db1066 is CRITICAL: CRITICAL: Devices (12) not equal to PDs (2) https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [11:37:23] ACKNOWLEDGEMENT - MegaRAID on db1066 is CRITICAL: CRITICAL: Devices (12) not equal to PDs (2) nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T223243 https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [11:37:28] 10Operations, 10ops-eqiad: Degraded RAID on db1066 - https://phabricator.wikimedia.org/T223243 (10ops-monitoring-bot) [11:37:48] (03PS9) 10Paladox: Add prometheus server for gerrit javamelody monitoring [puppet] - 10https://gerrit.wikimedia.org/r/508952 (https://phabricator.wikimedia.org/T184086) [11:37:59] (03PS10) 10Paladox: Add prometheus server for gerrit javamelody monitoring [puppet] - 10https://gerrit.wikimedia.org/r/508952 (https://phabricator.wikimedia.org/T184086) [11:38:32] (03CR) 10Paladox: Add prometheus server for gerrit javamelody monitoring (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/508952 (https://phabricator.wikimedia.org/T184086) (owner: 10Paladox) [11:39:02] RECOVERY - MegaRAID on cloudvirt1024 is OK: OK: optimal, 1 logical, 8 physical https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [11:39:02] RECOVERY - MegaRAID on db1066 is OK: OK: optimal, 1 logical, 2 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [11:39:02] PROBLEM - MegaRAID on cloudvirt1018 is CRITICAL: CRITICAL: Devices (10) not equal to PDs (8) https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [11:39:03] ACKNOWLEDGEMENT - MegaRAID on cloudvirt1018 is CRITICAL: CRITICAL: Devices (10) not equal to PDs (8) nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T223244 https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [11:39:04] PROBLEM - MegaRAID on db1063 is CRITICAL: CRITICAL: Devices (12) not equal to PDs (2) https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [11:39:05] ACKNOWLEDGEMENT - MegaRAID on db1063 is CRITICAL: CRITICAL: Devices (12) not equal to PDs (2) nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T223245 https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [11:39:06] PROBLEM - MegaRAID on kafka1014 is CRITICAL: CRITICAL: Devices (12) not equal to PDs (0) https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [11:39:07] ACKNOWLEDGEMENT - MegaRAID on kafka1014 is CRITICAL: CRITICAL: Devices (12) not equal to PDs (0) nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T223246 https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [11:39:10] 10Operations, 10ops-eqiad: Degraded RAID on cloudvirt1018 - https://phabricator.wikimedia.org/T223244 (10ops-monitoring-bot) [11:39:12] RECOVERY - Check systemd state on cloudcontrol1004 is OK: OK - running: The system is fully operational [11:39:15] 10Operations, 10ops-eqiad: Degraded RAID on db1063 - https://phabricator.wikimedia.org/T223245 (10ops-monitoring-bot) [11:39:16] 10Operations, 10ops-eqiad: Degraded RAID on kafka1014 - https://phabricator.wikimedia.org/T223246 (10ops-monitoring-bot) [11:39:18] 10Operations, 10ops-eqiad: Degraded RAID on db1068 - https://phabricator.wikimedia.org/T223233 (10Marostegui) 05Open→03Declined This was due to a bad puppet merge https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/508855/ [11:39:26] 10Operations, 10ops-eqiad: Degraded RAID on db1066 - https://phabricator.wikimedia.org/T223243 (10Marostegui) 05Open→03Declined This was due to a bad puppet merge https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/508855/ [11:39:43] 10Operations, 10ops-eqiad: Degraded RAID on dbstore1001 - https://phabricator.wikimedia.org/T223236 (10Marostegui) 05Open→03Declined This was due to a bad puppet merge https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/508855/ [11:39:46] RECOVERY - puppet last run on cloudcontrol1003 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [11:40:13] 10Operations, 10ops-eqiad: Degraded RAID on db1063 - https://phabricator.wikimedia.org/T223245 (10Marostegui) 05Open→03Declined This was due to a bad puppet merge https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/508855/ [11:40:29] 10Operations, 10ops-eqiad: Degraded RAID on db1071 - https://phabricator.wikimedia.org/T223242 (10Marostegui) 05Open→03Declined This was due to a bad puppet merge https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/508855/ [11:40:38] 10Operations, 10ops-eqiad: Degraded RAID on db1064 - https://phabricator.wikimedia.org/T223238 (10Marostegui) 05Open→03Declined This was due to a bad puppet merge https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/508855/ [11:41:42] (03CR) 10Marostegui: [C: 03+1] dbproxy: Switchover labsdb1009 to 11, reorganize weights [puppet] - 10https://gerrit.wikimedia.org/r/510126 (https://phabricator.wikimedia.org/T222978) (owner: 10Jcrespo) [11:41:49] 10Operations, 10ops-eqiad: Degraded RAID on cloudvirt1024 - https://phabricator.wikimedia.org/T223241 (10aborrero) 05Open→03Invalid This is a bogus autogenerated task produced by https://gerrit.wikimedia.org/r/c/operations/puppet/+/508855/, closing now. [11:41:59] 10Operations, 10ops-codfw: Degraded RAID on labstore2004 - https://phabricator.wikimedia.org/T223232 (10aborrero) 05Open→03Invalid This is a bogus autogenerated task produced by https://gerrit.wikimedia.org/r/c/operations/puppet/+/508855/, closing now. [11:42:08] 10Operations, 10ops-eqiad: Degraded RAID on cloudstore1009 - https://phabricator.wikimedia.org/T223235 (10aborrero) 05Open→03Invalid This is a bogus autogenerated task produced by https://gerrit.wikimedia.org/r/c/operations/puppet/+/508855/, closing now. [11:42:17] 10Operations, 10ops-eqiad: Degraded RAID on cloudvirt1015 - https://phabricator.wikimedia.org/T223237 (10aborrero) 05Open→03Invalid This is a bogus autogenerated task produced by https://gerrit.wikimedia.org/r/c/operations/puppet/+/508855/, closing now. [11:42:34] RECOVERY - MegaRAID on labstore2004 is OK: OK: optimal, 1 logical, 2 physical https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [11:42:36] (03CR) 10Marostegui: [C: 03+1] "Probably worth stopping MySQL too" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/510108 (https://phabricator.wikimedia.org/T223126) (owner: 10Jcrespo) [11:43:10] RECOVERY - MegaRAID on db1068 is OK: OK: optimal, 1 logical, 2 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [11:43:54] RECOVERY - MegaRAID on db1064 is OK: OK: optimal, 1 logical, 2 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [11:43:57] 10Operations, 10ops-eqiad: Degraded RAID on labstore1001 - https://phabricator.wikimedia.org/T223234 (10aborrero) 05Open→03Invalid This is a bogus autogenerated task produced by https://gerrit.wikimedia.org/r/c/operations/puppet/+/508855/, closing now. [11:44:09] 10Operations, 10ops-eqiad: Degraded RAID on cloudvirt1018 - https://phabricator.wikimedia.org/T223244 (10aborrero) 05Open→03Invalid This is a bogus autogenerated task produced by https://gerrit.wikimedia.org/r/c/operations/puppet/+/508855/, closing now. [11:46:13] 10Operations, 10ops-eqiad: Degraded RAID on kafka1014 - https://phabricator.wikimedia.org/T223246 (10aborrero) This is likely a bogus autogenerated task produced by https://gerrit.wikimedia.org/r/c/operations/puppet/+/508855/ I recommend closing as invalid. [11:49:22] RECOVERY - MegaRAID on cloudvirt1018 is OK: OK: optimal, 1 logical, 8 physical https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [11:49:22] RECOVERY - MegaRAID on db1071 is OK: OK: optimal, 1 logical, 2 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [11:49:22] PROBLEM - MegaRAID on db1067 is CRITICAL: CRITICAL: Devices (12) not equal to PDs (6) https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [11:49:24] ACKNOWLEDGEMENT - MegaRAID on db1067 is CRITICAL: CRITICAL: Devices (12) not equal to PDs (6) nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T223248 https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [11:49:24] PROBLEM - MegaRAID on backup2001 is CRITICAL: CRITICAL: Devices (12) not equal to PDs (24) https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [11:49:25] ACKNOWLEDGEMENT - MegaRAID on backup2001 is CRITICAL: CRITICAL: Devices (12) not equal to PDs (24) nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T223249 https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [11:49:27] PROBLEM - MegaRAID on kafka1012 is CRITICAL: CRITICAL: Devices (12) not equal to PDs (0) https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [11:49:28] ACKNOWLEDGEMENT - MegaRAID on kafka1012 is CRITICAL: CRITICAL: Devices (12) not equal to PDs (0) nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T223250 https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [11:49:28] 10Operations, 10ops-eqiad: Degraded RAID on db1067 - https://phabricator.wikimedia.org/T223248 (10ops-monitoring-bot) [11:49:29] PROBLEM - MegaRAID on ms-be1043 is CRITICAL: CRITICAL: Devices (14) not equal to PDs (13) https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [11:49:30] ACKNOWLEDGEMENT - MegaRAID on ms-be1043 is CRITICAL: CRITICAL: Devices (14) not equal to PDs (13) nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T223251 https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [11:49:30] PROBLEM - MegaRAID on db1072 is CRITICAL: CRITICAL: Devices (12) not equal to PDs (2) https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [11:49:31] ACKNOWLEDGEMENT - MegaRAID on db1072 is CRITICAL: CRITICAL: Devices (12) not equal to PDs (2) nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T223252 https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [11:49:32] PROBLEM - MegaRAID on tungsten is CRITICAL: CRITICAL: Devices (12) not equal to PDs (2) https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [11:49:32] 10Operations, 10ops-codfw: Degraded RAID on backup2001 - https://phabricator.wikimedia.org/T223249 (10ops-monitoring-bot) [11:49:33] ACKNOWLEDGEMENT - MegaRAID on tungsten is CRITICAL: CRITICAL: Devices (12) not equal to PDs (2) nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T223253 https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [11:49:33] PROBLEM - MegaRAID on db2096 is CRITICAL: CRITICAL: Devices (10) not equal to PDs (2) https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [11:49:34] 10Operations, 10ops-eqiad: Degraded RAID on kafka1012 - https://phabricator.wikimedia.org/T223250 (10ops-monitoring-bot) [11:49:35] ACKNOWLEDGEMENT - MegaRAID on db2096 is CRITICAL: CRITICAL: Devices (10) not equal to PDs (2) nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T223254 https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [11:49:35] PROBLEM - MegaRAID on labstore2003 is CRITICAL: CRITICAL: Devices (12) not equal to PDs (2) https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [11:49:36] 10Operations, 10ops-eqiad: Degraded RAID on ms-be1043 - https://phabricator.wikimedia.org/T223251 (10ops-monitoring-bot) [11:49:36] ACKNOWLEDGEMENT - MegaRAID on labstore2003 is CRITICAL: CRITICAL: Devices (12) not equal to PDs (2) nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T223255 https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [11:49:38] 10Operations, 10ops-eqiad: Degraded RAID on db1072 - https://phabricator.wikimedia.org/T223252 (10ops-monitoring-bot) [11:49:40] PROBLEM - MegaRAID on es2002 is CRITICAL: CRITICAL: Devices (12) not equal to PDs (2) https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [11:49:40] 10Operations, 10ops-eqiad: Degraded RAID on tungsten - https://phabricator.wikimedia.org/T223253 (10ops-monitoring-bot) [11:49:41] ACKNOWLEDGEMENT - MegaRAID on es2002 is CRITICAL: CRITICAL: Devices (12) not equal to PDs (2) nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T223257 https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [11:49:42] 10Operations, 10ops-codfw: Degraded RAID on db2096 - https://phabricator.wikimedia.org/T223254 (10ops-monitoring-bot) [11:49:44] 10Operations, 10ops-codfw: Degraded RAID on labstore2003 - https://phabricator.wikimedia.org/T223255 (10ops-monitoring-bot) [11:49:46] 10Operations, 10ops-codfw: Degraded RAID on labstore2001 - https://phabricator.wikimedia.org/T223256 (10ops-monitoring-bot) [11:49:48] 10Operations, 10ops-codfw: Degraded RAID on es2002 - https://phabricator.wikimedia.org/T223257 (10ops-monitoring-bot) [11:49:53] 10Operations, 10ops-codfw: Degraded RAID on labstore2002 - https://phabricator.wikimedia.org/T223258 (10ops-monitoring-bot) [11:50:07] 10Operations, 10ops-eqiad: Degraded RAID on db1072 - https://phabricator.wikimedia.org/T223252 (10Marostegui) 05Open→03Declined This was due to a bad puppet merge https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/508855/ [11:50:21] 10Operations, 10ops-eqiad: Degraded RAID on db1067 - https://phabricator.wikimedia.org/T223248 (10Marostegui) 05Open→03Declined This was due to a bad puppet merge https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/508855/ [11:50:42] 10Operations, 10ops-codfw: Degraded RAID on db2096 - https://phabricator.wikimedia.org/T223254 (10Marostegui) 05Open→03Declined This was due to a bad puppet merge https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/508855/ [11:52:53] (03PS1) 10Zppix: Enable SandboxLink on papwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/510142 (https://phabricator.wikimedia.org/T223166) [11:54:14] 10Operations, 10Continuous-Integration-Infrastructure, 10SRE-Access-Requests, 10Release-Engineering-Team (Backlog): Request: add awight to contint-docker - https://phabricator.wikimedia.org/T223262 (10awight) [11:54:34] 10Operations, 10ops-codfw: Degraded RAID on labstore2003 - https://phabricator.wikimedia.org/T223255 (10aborrero) 05Open→03Invalid This is a bogus autogenerated task produced by https://gerrit.wikimedia.org/r/c/operations/puppet/+/508855/, closing now. [11:54:42] 10Operations, 10ops-codfw: Degraded RAID on labstore2001 - https://phabricator.wikimedia.org/T223256 (10aborrero) 05Open→03Invalid This is a bogus autogenerated task produced by https://gerrit.wikimedia.org/r/c/operations/puppet/+/508855/, closing now. [11:54:49] 10Operations, 10ops-codfw: Degraded RAID on labstore2002 - https://phabricator.wikimedia.org/T223258 (10aborrero) 05Open→03Invalid This is a bogus autogenerated task produced by https://gerrit.wikimedia.org/r/c/operations/puppet/+/508855/, closing now. [11:55:07] 10Operations, 10ops-eqiad, 10Operations-Software-Development, 10observability, 10Patch-For-Review: ms-be1043 sdk failed - https://phabricator.wikimedia.org/T218544 (10jbond) The change above failed as the phsicalDrive regex dose not take into account the the number of drives per span. however further in... [11:55:40] (03CR) 10Ppchelko: "Hm... I think I want to abandon this change. This idea basically makes the process of what to construct and where to send it cyclic - we n" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/509092 (https://phabricator.wikimedia.org/T222822) (owner: 10Ppchelko) [11:58:28] PROBLEM - MegaRAID on es2004 is CRITICAL: CRITICAL: Devices (12) not equal to PDs (2) https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [11:58:29] ACKNOWLEDGEMENT - MegaRAID on es2004 is CRITICAL: CRITICAL: Devices (12) not equal to PDs (2) nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T223264 https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [11:58:35] 10Operations, 10ops-codfw: Degraded RAID on es2004 - https://phabricator.wikimedia.org/T223264 (10ops-monitoring-bot) [11:59:36] PROBLEM - MegaRAID on db1070 is CRITICAL: CRITICAL: Devices (12) not equal to PDs (2) https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [11:59:37] ACKNOWLEDGEMENT - MegaRAID on db1070 is CRITICAL: CRITICAL: Devices (12) not equal to PDs (2) nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T223265 https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [11:59:42] 10Operations, 10ops-eqiad: Degraded RAID on db1070 - https://phabricator.wikimedia.org/T223265 (10ops-monitoring-bot) [11:59:48] RECOVERY - MegaRAID on db1067 is OK: OK: optimal, 1 logical, 6 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [11:59:48] RECOVERY - MegaRAID on db1063 is OK: OK: optimal, 1 logical, 2 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [11:59:48] RECOVERY - MegaRAID on backup2001 is OK: OK: optimal, 1 logical, 24 physical https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [11:59:52] RECOVERY - MegaRAID on kafka1014 is OK: OK: no disks configured for RAID https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [11:59:52] RECOVERY - MegaRAID on kafka1012 is OK: OK: no disks configured for RAID https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [11:59:53] RECOVERY - MegaRAID on db1072 is OK: OK: optimal, 1 logical, 2 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [11:59:54] RECOVERY - MegaRAID on db2096 is OK: OK: optimal, 1 logical, 2 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [11:59:54] RECOVERY - MegaRAID on labstore2003 is OK: OK: optimal, 1 logical, 2 physical https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [11:59:54] RECOVERY - MegaRAID on es2002 is OK: OK: optimal, 1 logical, 2 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [12:00:04] Deploy window Pre MediaWiki train sanity break (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190514T1200) [12:00:48] RECOVERY - Mediawiki Cirrussearch update rate - codfw on icinga1001 is OK: OK: Less than 1.00% under the threshold [80.0] https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?panelId=44&fullscreen&orgId=1 [12:01:16] (I'm deploying a UBN to wmf.4 but it should affect the train, FYI hashar.) [12:01:56] RECOVERY - Mediawiki Cirrussearch update rate - eqiad on icinga1001 is OK: OK: Less than 1.00% under the threshold [80.0] https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?panelId=44&fullscreen&orgId=1 [12:02:07] should or should not? [12:03:11] Should not. :-) Whoops. [12:03:57] 10Operations, 10Release Pipeline, 10Release-Engineering-Team, 10Wikidata, and 5 others: Introduce wikidata termbox SSR to kubernetes - https://phabricator.wikimedia.org/T220402 (10akosiaris) >>! In T220402#5180031, @Pablo-WMDE wrote: > Hi @akosiaris, > > thanks for taking the time to explain the way the `... [12:06:37] (03PS3) 10Jbond: flake8: update python file extensions so they are detected by CI [puppet] - 10https://gerrit.wikimedia.org/r/509875 (https://phabricator.wikimedia.org/T144169) [12:07:19] (03CR) 10Jbond: "PCC https://puppet-compiler.wmflabs.org/compiler1002/16532/icinga1001.wikimedia.org/" [puppet] - 10https://gerrit.wikimedia.org/r/509875 (https://phabricator.wikimedia.org/T144169) (owner: 10Jbond) [12:07:23] (03CR) 10Jbond: [C: 03+2] flake8: update python file extensions so they are detected by CI [puppet] - 10https://gerrit.wikimedia.org/r/509875 (https://phabricator.wikimedia.org/T144169) (owner: 10Jbond) [12:08:52] RECOVERY - MegaRAID on es2004 is OK: OK: optimal, 1 logical, 2 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [12:09:58] RECOVERY - MegaRAID on db1070 is OK: OK: optimal, 1 logical, 2 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [12:10:14] RECOVERY - MegaRAID on ms-be1043 is OK: OK: optimal, 13 logical, 13 physical https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [12:10:14] RECOVERY - MegaRAID on tungsten is OK: OK: optimal, 1 logical, 2 physical https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [12:10:53] !log rebooting mw2164 for kernel update [12:10:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:12:26] (03Abandoned) 10Hashar: Allow tests to run on trivial rebase [puppet] (refs/meta/config) - 10https://gerrit.wikimedia.org/r/510112 (https://phabricator.wikimedia.org/T223209) (owner: 10Hashar) [12:13:08] (03PS2) 10Jbond: flake8 - mwgrep: update python file extensions so they are detected by CI [puppet] - 10https://gerrit.wikimedia.org/r/509936 (https://phabricator.wikimedia.org/T144169) [12:13:28] 10Operations, 10Domains, 10Traffic, 10WMF-Legal, 10Patch-For-Review: Move wikimedia.ee under WM-EE - https://phabricator.wikimedia.org/T204056 (10tramm) @Dzahn Just to make everything clear, we are going to use virtual hosting of our service provider on IP aadresss 185.7.252.114 to run our Wordpress home... [12:14:34] (03CR) 10Jbond: [C: 03+2] flake8 - mwgrep: update python file extensions so they are detected by CI [puppet] - 10https://gerrit.wikimedia.org/r/509936 (https://phabricator.wikimedia.org/T144169) (owner: 10Jbond) [12:14:56] (03CR) 10Jcrespo: [C: 04-2] "> Patch Set 1: Code-Review+1" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/510108 (https://phabricator.wikimedia.org/T223126) (owner: 10Jcrespo) [12:17:43] (03CR) 10Jbond: [C: 03+2] flake8 - grafana: update python file extensions so they are detected by CI [puppet] - 10https://gerrit.wikimedia.org/r/509927 (https://phabricator.wikimedia.org/T144169) (owner: 10Jbond) [12:17:45] 10Operations, 10ops-eqiad, 10Patch-For-Review, 10cloud-services-team (Kanban): Cloud Services: reallocate workload from rack B5-eqiad - https://phabricator.wikimedia.org/T223148 (10aborrero) Email sent with content: `lines=5 Hi! on 2019-05-16 13:00 UTC there will be a maintenance operation in one of the... [12:17:56] (03PS2) 10Jbond: flake8 - grafana: update python file extensions so they are detected by CI [puppet] - 10https://gerrit.wikimedia.org/r/509927 (https://phabricator.wikimedia.org/T144169) [12:19:00] (03CR) 10Jbond: [C: 03+2] flake8 - grafana: update python file extensions so they are detected by CI [puppet] - 10https://gerrit.wikimedia.org/r/509927 (https://phabricator.wikimedia.org/T144169) (owner: 10Jbond) [12:19:40] (03PS2) 10Jbond: flake8 - cassandra: update python file extensions so they are detected by CI [puppet] - 10https://gerrit.wikimedia.org/r/509937 (https://phabricator.wikimedia.org/T144169) [12:20:35] 10Operations, 10Discovery-Search, 10Elasticsearch: Reshard Commons_wiki - https://phabricator.wikimedia.org/T217531 (10Mathew.onipe) 05Open→03Invalid [12:21:00] (03CR) 10Jbond: [C: 03+2] flake8 - cassandra: update python file extensions so they are detected by CI [puppet] - 10https://gerrit.wikimedia.org/r/509937 (https://phabricator.wikimedia.org/T144169) (owner: 10Jbond) [12:23:55] (03PS2) 10Muehlenhoff: Switch deployment-prep to facter 3 / puppet 5 [puppet] - 10https://gerrit.wikimedia.org/r/509852 (https://phabricator.wikimedia.org/T219803) [12:24:10] PROBLEM - Disk space on mwmaint1002 is CRITICAL: DISK CRITICAL - free space: / 4897 MB (3% inode=89%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space [12:24:56] (03CR) 10Muehlenhoff: [C: 03+2] Switch deployment-prep to facter 3 / puppet 5 [puppet] - 10https://gerrit.wikimedia.org/r/509852 (https://phabricator.wikimedia.org/T219803) (owner: 10Muehlenhoff) [12:27:46] PROBLEM - High CPU load on API appserver on mw1227 is CRITICAL: CRITICAL - load average: 59.48, 31.47, 21.50 [12:28:35] (03CR) 10Jbond: "https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/16535/console" [puppet] - 10https://gerrit.wikimedia.org/r/509923 (https://phabricator.wikimedia.org/T144169) (owner: 10Jbond) [12:31:47] (03PS3) 10Jbond: flake8 - arclamp: update python file extensions so they are detected by CI [puppet] - 10https://gerrit.wikimedia.org/r/509923 (https://phabricator.wikimedia.org/T144169) [12:31:56] RECOVERY - High CPU load on API appserver on mw1227 is OK: OK - load average: 17.14, 25.03, 21.41 [12:33:03] (03CR) 10Jbond: [C: 03+2] flake8 - arclamp: update python file extensions so they are detected by CI [puppet] - 10https://gerrit.wikimedia.org/r/509923 (https://phabricator.wikimedia.org/T144169) (owner: 10Jbond) [12:37:55] !log jforrester@deploy1001 Synchronized php-1.34.0-wmf.4/extensions/VisualEditor/: Hot-deploy T223023 fix I1b35b28e42 for mobile VE edit section switches (duration: 00m 54s) [12:37:57] jbond42: arturo: the tests not running is fixed. I have reverted a config change that caused the issue :-( [12:38:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:38:08] T223023: Switching from source to visual editing in mobile web removes all other sections - https://phabricator.wikimedia.org/T223023 [12:38:11] hashar: ack [12:38:22] hashar: great thanks [12:43:41] (03CR) 10Jbond: "PCC - https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/16536/console" [puppet] - 10https://gerrit.wikimedia.org/r/509917 (https://phabricator.wikimedia.org/T144169) (owner: 10Jbond) [12:45:34] (03PS4) 10Jbond: flake8 - rabbitmq: update python file extensions so they are detected by CI [puppet] - 10https://gerrit.wikimedia.org/r/509917 (https://phabricator.wikimedia.org/T144169) [12:46:07] (03CR) 10Jbond: [C: 03+2] flake8 - rabbitmq: update python file extensions so they are detected by CI [puppet] - 10https://gerrit.wikimedia.org/r/509917 (https://phabricator.wikimedia.org/T144169) (owner: 10Jbond) [12:49:31] (03PS3) 10Jbond: flake8 - varnish: update python file extensions so they are detected by CI [puppet] - 10https://gerrit.wikimedia.org/r/509921 (https://phabricator.wikimedia.org/T144169) [12:49:56] (03Abandoned) 10Elukey: role::mediawiki::canary_appserver: remove nutcracker memcached conf [puppet] - 10https://gerrit.wikimedia.org/r/510124 (https://phabricator.wikimedia.org/T214275) (owner: 10Jbond) [12:50:33] (03Abandoned) 10Elukey: role::mediawiki::canary_appserver: remove nutcracker memcached conf [puppet] - 10https://gerrit.wikimedia.org/r/509462 (https://phabricator.wikimedia.org/T214275) (owner: 10Elukey) [12:52:35] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+1] "> Well, “Endpoint: https://query.wikidata.org/sparql” isn’t going to work correctly for Beta or Test Wikidata" (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/509878 (owner: 10Michael Große) [12:52:38] (03CR) 10Jbond: "https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/16537/console" [puppet] - 10https://gerrit.wikimedia.org/r/509921 (https://phabricator.wikimedia.org/T144169) (owner: 10Jbond) [12:52:53] (03CR) 10Jbond: [C: 03+2] flake8 - varnish: update python file extensions so they are detected by CI [puppet] - 10https://gerrit.wikimedia.org/r/509921 (https://phabricator.wikimedia.org/T144169) (owner: 10Jbond) [12:52:59] 10Operations, 10Availability (MediaWiki-MultiDC), 10codfw-rollout: Add redundancy to IRC recent changes service - https://phabricator.wikimedia.org/T128592 (10Krinkle) [12:53:15] 10Operations, 10Availability (MediaWiki-MultiDC), 10codfw-rollout: Add redundancy to IRC recent changes service - https://phabricator.wikimedia.org/T128592 (10Krenair) People connect via hostname, and we can put multiple A records in for the different VMs? [12:53:35] (03PS2) 10Zppix: Enable SandboxLink on papwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/510142 (https://phabricator.wikimedia.org/T223166) [12:54:29] (03CR) 10jerkins-bot: [V: 04-1] Enable SandboxLink on papwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/510142 (https://phabricator.wikimedia.org/T223166) (owner: 10Zppix) [12:55:55] (03PS3) 10Zppix: Enable SandboxLink on papwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/510142 (https://phabricator.wikimedia.org/T223166) [12:58:12] PROBLEM - puppet last run on cp1084 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/local/bin/varnishstatsd] [12:58:25] 10Operations, 10Availability (MediaWiki-MultiDC), 10codfw-rollout: Add redundancy to IRC recent changes service - https://phabricator.wikimedia.org/T128592 (10faidon) That's an old task! @ottomata et al may have an opinion. Also see T185319, plus I'm sure we've discussed various ideas over time in other tas... [12:58:28] PROBLEM - puppet last run on cp4022 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/local/bin/varnishreqstats] [12:58:30] (03PS1) 10Elukey: Empty mediawiki_memcached_servers for 3 mw hosts [puppet] - 10https://gerrit.wikimedia.org/r/510153 (https://phabricator.wikimedia.org/T214275) [13:00:04] hashar: How many deployers does it take to do MediaWiki train - European version deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190514T1300). [13:00:36] PROBLEM - puppet last run on cp3033 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 6 minutes ago with 2 failures. Failed resources (up to 3 shown): File[/usr/local/bin/varnishreqstats],File[/usr/local/bin/varnishstatsd] [13:00:50] o/ [13:01:20] PROBLEM - puppet last run on cp3036 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 7 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/local/bin/varnishreqstats] [13:01:58] (03CR) 10Elukey: "https://puppet-compiler.wmflabs.org/compiler1001/16538/" [puppet] - 10https://gerrit.wikimedia.org/r/510153 (https://phabricator.wikimedia.org/T214275) (owner: 10Elukey) [13:03:20] May 14 12:53:31 cp1084 puppet-agent[212933]: (/Stage[main]/Varnish::Logging::Backend/Varnish::Logging::Statsd[default]/File[/usr/local/bin/varnishstatsd]) Could not evaluate: Could not retrieve information from environment production source(s) puppet:///modules/varnish/varnishstatsd [13:03:25] ^ ? [13:04:06] (03PS6) 10Jbond: flake8 - misc: update python file extensions so they are detected by CI [puppet] - 10https://gerrit.wikimedia.org/r/509929 [13:04:58] (03PS4) 10Lucas Werkmeister (WMDE): Add EntitySchema to extension-list [mediawiki-config] - 10https://gerrit.wikimedia.org/r/509437 (https://phabricator.wikimedia.org/T221650) (owner: 10Michael Große) [13:05:01] (03PS2) 10Lucas Werkmeister (WMDE): Define wmgUseEntitySchema [mediawiki-config] - 10https://gerrit.wikimedia.org/r/505816 (https://phabricator.wikimedia.org/T221651) [13:05:02] (03PS6) 10Lucas Werkmeister (WMDE): Add configuration for EntitySchema ShExSimpleUrl [mediawiki-config] - 10https://gerrit.wikimedia.org/r/509878 (https://phabricator.wikimedia.org/T223120) (owner: 10Michael Große) [13:05:15] (03CR) 10Jbond: [C: 03+2] flake8 - misc: update python file extensions so they are detected by CI [puppet] - 10https://gerrit.wikimedia.org/r/509929 (owner: 10Jbond) [13:08:28] (03CR) 10Hashar: [C: 03+2] "it is train time!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/510129 (https://phabricator.wikimedia.org/T220730) (owner: 10Hashar) [13:09:16] 10Operations, 10Availability (MediaWiki-MultiDC), 10codfw-rollout: Add redundancy to IRC recent changes service - https://phabricator.wikimedia.org/T128592 (10Krinkle) >>! In T128592#5180472, @Krenair wrote: > [..] Still would be less effort for ops to just alter DNS in such a case instead of having to build... [13:09:30] PROBLEM - puppet last run on mw2236 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/local/lib/nagios/plugins/check_systemd_unit_state.py] [13:09:38] (03Merged) 10jenkins-bot: Group 0 to 1.34.0-wmf.5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/510129 (https://phabricator.wikimedia.org/T220730) (owner: 10Hashar) [13:09:44] PROBLEM - puppet last run on oresrdb1002 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/local/sbin/smart-data-dump] [13:09:44] PROBLEM - puppet last run on elastic1028 is CRITICAL: CRITICAL: Puppet has 3 failures. Last run 3 minutes ago with 3 failures. Failed resources (up to 3 shown): File[/usr/local/bin/phaste],File[/usr/local/sbin/smart-data-dump],File[/usr/local/bin/es-tool] [13:09:56] (03CR) 10jenkins-bot: Group 0 to 1.34.0-wmf.5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/510129 (https://phabricator.wikimedia.org/T220730) (owner: 10Hashar) [13:10:50] PROBLEM - Keyholder SSH agent on netmon1002 is CRITICAL: CRITICAL: Keyholder is not armed. Run keyholder arm to arm it. https://wikitech.wikimedia.org/wiki/Keyholder [13:11:18] RECOVERY - puppet last run on cp3033 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [13:12:10] PROBLEM - puppet last run on labpuppetmaster1001 is CRITICAL: CRITICAL: Puppet has 3 failures. Last run 5 minutes ago with 3 failures. Failed resources (up to 3 shown): File[/usr/local/bin/phaste],File[/usr/local/sbin/smart-data-dump],File[/usr/local/bin/ssh-agent-proxy] [13:12:10] PROBLEM - puppet last run on mw2146 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/local/sbin/smart-data-dump] [13:12:18] PROBLEM - Keyholder SSH agent on netmon2001 is CRITICAL: CRITICAL: Keyholder is not armed. Run keyholder arm to arm it. https://wikitech.wikimedia.org/wiki/Keyholder [13:12:47] !log hashar@deploy1001 Started scap: testwiki to 1.34.0-wmf.5 and rebuild l10n cache [13:12:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:12:59] !log train delay, I forgot to sync 1.34.0-wmf.5 [13:13:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:14:52] RECOVERY - puppet last run on mw2236 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [13:15:08] RECOVERY - puppet last run on elastic1028 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [13:15:32] PROBLEM - Keyholder SSH agent on cumin2001 is CRITICAL: CRITICAL: Keyholder is not armed. Run keyholder arm to arm it. https://wikitech.wikimedia.org/wiki/Keyholder [13:15:53] (03PS1) 10Hashar: Revert "Group 0 to 1.34.0-wmf.5" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/510157 [13:16:07] (03CR) 10Hashar: [C: 03+2] "I forgot to scap sync." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/510157 (owner: 10Hashar) [13:16:22] RECOVERY - Keyholder SSH agent on netmon1002 is OK: OK: Keyholder is armed with all configured keys. https://wikitech.wikimedia.org/wiki/Keyholder [13:16:28] RECOVERY - Keyholder SSH agent on netmon2001 is OK: OK: Keyholder is armed with all configured keys. https://wikitech.wikimedia.org/wiki/Keyholder [13:17:10] (03Merged) 10jenkins-bot: Revert "Group 0 to 1.34.0-wmf.5" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/510157 (owner: 10Hashar) [13:17:24] (03CR) 10jenkins-bot: Revert "Group 0 to 1.34.0-wmf.5" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/510157 (owner: 10Hashar) [13:18:16] RECOVERY - Keyholder SSH agent on cumin2001 is OK: OK: Keyholder is armed with all configured keys. https://wikitech.wikimedia.org/wiki/Keyholder [13:18:37] (03PS2) 10Lucas Werkmeister (WMDE): Remove constraint-suggestions beta feature [mediawiki-config] - 10https://gerrit.wikimedia.org/r/503342 (https://phabricator.wikimedia.org/T220609) [13:18:51] (03PS1) 10Hashar: Group 0 to 1.34.0-wmf.5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/510158 (https://phabricator.wikimedia.org/T220730) [13:19:08] (03CR) 10Lucas Werkmeister (WMDE): Remove constraint-suggestions beta feature [mediawiki-config] - 10https://gerrit.wikimedia.org/r/503342 (https://phabricator.wikimedia.org/T220609) (owner: 10Lucas Werkmeister (WMDE)) [13:23:57] (03CR) 10Filippo Giunchedi: [C: 03+2] cassandra: check for flag file before service startup [puppet] - 10https://gerrit.wikimedia.org/r/509409 (https://phabricator.wikimedia.org/T214166) (owner: 10Filippo Giunchedi) [13:24:05] (03PS2) 10Filippo Giunchedi: cassandra: check for flag file before service startup [puppet] - 10https://gerrit.wikimedia.org/r/509409 (https://phabricator.wikimedia.org/T214166) [13:25:00] RECOVERY - puppet last run on cp1084 is OK: OK: Puppet is currently enabled, last run 18 seconds ago with 0 failures [13:25:08] (03CR) 10jerkins-bot: [V: 04-1] cassandra: check for flag file before service startup [puppet] - 10https://gerrit.wikimedia.org/r/509409 (https://phabricator.wikimedia.org/T214166) (owner: 10Filippo Giunchedi) [13:25:18] RECOVERY - puppet last run on cp4022 is OK: OK: Puppet is currently enabled, last run 37 seconds ago with 0 failures [13:26:48] 10Operations, 10netops, 10cloud-services-team (Kanban): CloudVPS: evaluate if 10G is working correctly in cloudvirts - https://phabricator.wikimedia.org/T223272 (10aborrero) [13:27:12] 10Operations, 10Availability (MediaWiki-MultiDC), 10codfw-rollout: Add redundancy to IRC recent changes service - https://phabricator.wikimedia.org/T128592 (10Ottomata) It'd be really fun to build a replaceement IRC service based on the mediawiki.recentchange Kafka topic. Perhaps one day I will have time... :) [13:27:25] !log hashar@deploy1001 Finished scap: testwiki to 1.34.0-wmf.5 and rebuild l10n cache (duration: 14m 39s) [13:27:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:27:46] PROBLEM - Check Varnish expiry mailbox lag on cp3034 is CRITICAL: CRITICAL: expiry mailbox lag is 2008823 https://wikitech.wikimedia.org/wiki/Varnish [13:28:09] (03PS1) 10Muehlenhoff: Stop installing the obsolete puppet-common transition package [puppet] - 10https://gerrit.wikimedia.org/r/510159 [13:28:10] RECOVERY - puppet last run on cp3036 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [13:29:28] PROBLEM - Keyholder SSH agent on deploy1001 is CRITICAL: CRITICAL: Keyholder is not armed. Run keyholder arm to arm it. https://wikitech.wikimedia.org/wiki/Keyholder [13:30:06] PROBLEM - Keyholder SSH agent on labpuppetmaster1002 is CRITICAL: CRITICAL: Keyholder is not armed. Run keyholder arm to arm it. https://wikitech.wikimedia.org/wiki/Keyholder [13:31:52] PROBLEM - Keyholder SSH agent on deploy2001 is CRITICAL: CRITICAL: Keyholder is not armed. Run keyholder arm to arm it. https://wikitech.wikimedia.org/wiki/Keyholder [13:32:09] 10Operations, 10Maps, 10Reading-Infrastructure-Team-Backlog, 10Patch-For-Review: migrate maps servers to stretch with the current style - https://phabricator.wikimedia.org/T198622 (10Mathew.onipe) [13:32:19] (03CR) 10Filippo Giunchedi: [C: 03+2] "Not passing CI because of:" [puppet] - 10https://gerrit.wikimedia.org/r/509409 (https://phabricator.wikimedia.org/T214166) (owner: 10Filippo Giunchedi) [13:33:21] 10Operations, 10Maps, 10Patch-For-Review: Switch to unix socket connections for osmupdater / osmimporter for postgresql on maps - https://phabricator.wikimedia.org/T206639 (10Mathew.onipe) 05Open→03Resolved [13:34:08] Error: Could not retrieve catalog from remote server: Error 500 on SERVER: Server Error: Evaluation Error: Error while evaluating a Function Call, undefined method `fips_enabled?' for Puppet::Util::Platform:Module at /etc/puppet/modules/base/manifests/puppet.pp:14:26 on node deployment-docker-citoid01.deployment-prep.eqiad.wmflabs [13:34:13] Know anything about that one moritzm ? [13:34:43] 13:27:23 sudo -u mwdeploy -n -- /usr/bin/rsync -l deploy1001.eqiad.wmnet::common/wikiversions*.{json,php} /srv/mediawiki on mw1228.eqiad.wmnet returned [255]: Permission denied (publickey,keyboard-interactive). [13:34:44] :( [13:34:59] currently investigating, caused by new facter or puppet [13:35:03] 13:27:21 289 hosts had scap-cdb-rebuild errors [13:35:12] I figured it was that commit, just wanted to check you were on it :) [13:35:19] some how ssh is broken on deploy1001 [13:35:20] 10Operations, 10Epic, 10Maps (Kartotherian): Create blubberfile for deploying kartotherian into docker environment. - https://phabricator.wikimedia.org/T223275 (10Mathew.onipe) [13:35:27] which I guess is due to the Keyholder SSH agent on deploy1001 is CRITICAL: CRITICAL: [13:35:34] 10Operations, 10Epic, 10Maps (Kartotherian): Create blubberfile for deploying kartotherian into docker environment. - https://phabricator.wikimedia.org/T223275 (10Mathew.onipe) p:05Triage→03Normal [13:36:13] keyholder-proxy: active [13:36:13] - The agent has no identities. [13:36:14] ;( [13:36:19] yeah why did keyholder suddenly stop everywhere? [13:36:26] deploy[12]001, labpuppetmaster1002 too [13:36:29] filling a bug [13:36:33] maybe not everywhere but still [13:36:36] RECOVERY - puppet last run on oresrdb1002 is OK: OK: Puppet is currently enabled, last run 34 seconds ago with 0 failures [13:37:49] 10Operations: keyholder is no more armed on deplo1001 (train blocker) - https://phabricator.wikimedia.org/T223276 (10hashar) [13:37:59] 10Operations: keyholder is no more armed on deplo1001 (train blocker) - https://phabricator.wikimedia.org/T223276 (10hashar) [13:38:04] PROBLEM - Keyholder SSH agent on cumin1001 is CRITICAL: CRITICAL: Keyholder is not armed. Run keyholder arm to arm it. https://wikitech.wikimedia.org/wiki/Keyholder [13:38:56] RECOVERY - puppet last run on labpuppetmaster1001 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [13:38:58] RECOVERY - puppet last run on mw2146 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [13:39:22] PROBLEM - Keyholder SSH agent on labpuppetmaster1001 is CRITICAL: CRITICAL: Keyholder is not armed. Run keyholder arm to arm it. https://wikitech.wikimedia.org/wiki/Keyholder [13:39:37] 10Operations: keyholder has just disarmed everywhere (train blocker) - https://phabricator.wikimedia.org/T223276 (10Krenair) [13:40:26] 10Operations: keyholder is no more armed on deplo1001 (train blocker) - https://phabricator.wikimedia.org/T223276 (10hashar) [13:40:40] 10Operations: keyholder is no more armed on deplo1001 (train blocker) - https://phabricator.wikimedia.org/T223276 (10Krenair) May 14 14:10:50 PROBLEM - Keyholder SSH agent on netmon1002 is CRITICAL: CRITICAL: Keyholder is not armed. Run keyholder arm to arm it. https://wikitech.wikimedia.org/wiki/Key... [13:40:47] <_joe_> May 14 13:25:52 deploy1001 puppet-agent[7145]: (/Stage[main]/Keyholder/Systemd::Service[keyholder-agent]/Service[keyholder-agent]) Triggered 'refresh' from 1 event [13:40:55] <_joe_> something made it refresh [13:41:02] 10Operations: keyholder has just disarmed everywhere (train blocker) - https://phabricator.wikimedia.org/T223276 (10Krenair) [13:41:40] though not on contint1001 :/ [13:41:50] <_joe_> hashar: we're looking into it [13:41:51] maybe puppet didn't run there yet? [13:41:59] <_joe_> probably [13:42:02] acmechief1001 looks good [13:42:07] and I've triggered a puppet run right now [13:42:09] oh I have mis read cumin1001 and contint1001 [13:42:25] keyholder status reports the expected stuff [13:42:48] can we please get it rearmed at least on deploy1001 ? [13:43:01] I need it to run the train ;D [13:43:34] yes, I'm rearming it now [13:43:39] <_joe_> godog: thanks [13:43:49] 10Operations, 10ops-eqiad: Degraded RAID on db1070 - https://phabricator.wikimedia.org/T223265 (10Marostegui) 05Open→03Declined This was due to a bad puppet merge https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/508855/ [13:44:23] 10Operations, 10Move-Files-To-Commons, 10TCB-Team, 10Wikimedia-Extension-setup, and 2 others: Deploying FileExporter and FileImporter - https://phabricator.wikimedia.org/T190716 (10thiemowmde) [13:44:30] RECOVERY - Keyholder SSH agent on deploy1001 is OK: OK: Keyholder is armed with all configured keys. https://wikitech.wikimedia.org/wiki/Keyholder [13:44:31] godog: thank you :) [13:44:38] np! please try again [13:44:52] !log rearm keyholder on deploy and cumin hosts [13:44:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:44:55] !log hashar@deploy1001 Started scap: testwiki to 1.34.0-wmf.5 and rebuild l10n cache [13:44:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:45:13] godog: yeah it is armed so it should work just fine [13:45:15] stand by while now I accidentally the ssh key [13:45:21] not sure though why puppet would have cycled it :( [13:45:34] RECOVERY - Keyholder SSH agent on deploy2001 is OK: OK: Keyholder is armed with all configured keys. https://wikitech.wikimedia.org/wiki/Keyholder [13:46:16] RECOVERY - Keyholder SSH agent on cumin1001 is OK: OK: Keyholder is armed with all configured keys. https://wikitech.wikimedia.org/wiki/Keyholder [13:47:38] RECOVERY - Keyholder SSH agent on labpuppetmaster1001 is OK: OK: Keyholder is armed with all configured keys. https://wikitech.wikimedia.org/wiki/Keyholder [13:47:54] RECOVERY - Keyholder SSH agent on labpuppetmaster1002 is OK: OK: Keyholder is armed with all configured keys. https://wikitech.wikimedia.org/wiki/Keyholder [13:48:09] 10Operations: keyholder has just disarmed everywhere (train blocker) - https://phabricator.wikimedia.org/T223276 (10fgiunchedi) Keyholder rearmed on these hosts: deploy* cumin* labpuppetmaster* [13:48:14] not sure either [13:48:21] 10Operations, 10Domains, 10Traffic, 10WMF-Legal, 10Patch-For-Review: Move wikimedia.ee under WM-EE - https://phabricator.wikimedia.org/T204056 (10Dzahn) Hi, @tramm thanks for the update. I'll assign this back to @CRoslof for doing this at the Zone.ee level. For us in SRE the initial comment still stan... [13:48:41] /usr/local/bin/ssh-agent-proxy changed [13:48:48] 10Operations, 10Domains, 10Traffic, 10WMF-Legal, 10Patch-For-Review: Move wikimedia.ee under WM-EE - https://phabricator.wikimedia.org/T204056 (10Dzahn) a:05tramm→03CRoslof [13:49:25] one of those flake8 commits? [13:49:37] most likely [13:49:53] jbond42: ^^ [13:50:05] how did some of them recover? [13:50:20] netmon1002, netmon2001, cumin2001 [13:50:47] all shown by icinga as disarmed and then recovered [13:51:33] 10Operations: keyholder has just disarmed everywhere (train blocker) - https://phabricator.wikimedia.org/T223276 (10fgiunchedi) Root cause is `/usr/local/bin/ssh-agent-proxy` changed in https://gerrit.wikimedia.org/r/c/operations/puppet/+/509929 and caused keyholder to reload: ` May 14 13:26:21 labpuppetmaster1... [13:51:35] this was caused by this rename https://gerrit.wikimedia.org/r/c/operations/puppet/+/509929/6/modules/keyholder/manifests/init.pp [13:51:52] i rearmed them [13:52:05] k [13:52:14] sorry should have logged [13:54:08] jbond42: you should also tell it in the WMCS channel, as any VM in WMCS with keyholder would need the re-arm too [13:54:52] volans: ok thanks [13:55:16] * volans goes to re-arm his :) [13:57:10] (03PS1) 10Ema: ATS: add hardening features to systemd unit [puppet] - 10https://gerrit.wikimedia.org/r/510168 [13:58:22] 10Operations: keyholder has just disarmed everywhere (train blocker) - https://phabricator.wikimedia.org/T223276 (10jbond) Confirm that is responsible for the faliure, the other re-arms where done by myself [14:00:00] godog: keyholder works all fine. Confirmed thank you again [14:01:05] np! [14:01:16] 10Operations, 10Maps (Kartotherian): Create blubberfile for deploying kartotherian into docker environment. - https://phabricator.wikimedia.org/T223275 (10Mathew.onipe) [14:01:24] (03CR) 10Ema: "pcc looks sane to me: https://puppet-compiler.wmflabs.org/compiler1001/16540/" [puppet] - 10https://gerrit.wikimedia.org/r/510168 (owner: 10Ema) [14:02:58] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1004 is CRITICAL: CRITICAL: 80.00% of data above the critical threshold [50.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=2&fullscreen [14:03:09] * hashar waits for cdb rebuild [14:03:24] PROBLEM - High CPU load on API appserver on mw1282 is CRITICAL: CRITICAL - load average: 68.35, 31.50, 20.76 [14:03:28] (03PS1) 10Muehlenhoff: Revert "Switch deployment-prep to facter 3 / puppet 5" [puppet] - 10https://gerrit.wikimedia.org/r/510169 [14:03:46] 10Operations: keyholder has just disarmed everywhere (train blocker) - https://phabricator.wikimedia.org/T223276 (10fgiunchedi) 05Open→03Resolved a:03fgiunchedi Resolving because we're back, although feel free to reopen if we're missing something. [14:03:56] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/510169 (owner: 10Muehlenhoff) [14:04:02] PROBLEM - High CPU load on API appserver on mw1287 is CRITICAL: CRITICAL - load average: 76.31, 39.49, 23.89 [14:04:17] 10Operations: keyholder has just disarmed everywhere (train blocker) - https://phabricator.wikimedia.org/T223276 (10hashar) Thank you for the quick fix! As jbond stated, that has been caused by the renaming of `ssh-agent-proxy` source: https://gerrit.wikimedia.org/r/c/operations/puppet/+/509929/6/modules/keyhol... [14:04:43] (03CR) 10Muehlenhoff: [C: 03+2] Revert "Switch deployment-prep to facter 3 / puppet 5" [puppet] - 10https://gerrit.wikimedia.org/r/510169 (owner: 10Muehlenhoff) [14:05:37] (03CR) 10BryanDavis: "> I think this will default to false or broken looking at the code." [puppet] - 10https://gerrit.wikimedia.org/r/507351 (https://phabricator.wikimedia.org/T219830) (owner: 10BryanDavis) [14:06:48] RECOVERY - High CPU load on API appserver on mw1287 is OK: OK - load average: 18.36, 28.69, 22.27 [14:07:05] <_joe_> !log apt-get lean on mwmaint1002 [14:07:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:07:18] Krenair, mobrovac: puppet errors on deployment-prep should be gone now, reverted for now [14:07:26] yup yup, seen it [14:07:28] thnx moritzm [14:07:32] RECOVERY - High CPU load on API appserver on mw1282 is OK: OK - load average: 19.76, 31.09, 23.95 [14:08:49] (03PS4) 10BryanDavis: striker: Enable developer account creation [puppet] - 10https://gerrit.wikimedia.org/r/507351 (https://phabricator.wikimedia.org/T219830) [14:09:52] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1004 is OK: OK: Less than 70.00% above the threshold [25.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=2&fullscreen [14:10:04] PROBLEM - High CPU load on API appserver on mw1277 is CRITICAL: CRITICAL - load average: 60.27, 31.19, 20.77 [14:10:26] PROBLEM - High CPU load on API appserver on mw1280 is CRITICAL: CRITICAL - load average: 64.65, 31.98, 21.08 [14:11:26] RECOVERY - High CPU load on API appserver on mw1277 is OK: OK - load average: 29.57, 29.06, 20.94 [14:11:46] RECOVERY - High CPU load on API appserver on mw1280 is OK: OK - load average: 30.68, 28.87, 20.89 [14:11:54] PROBLEM - High CPU load on API appserver on mw1283 is CRITICAL: CRITICAL - load average: 62.95, 36.95, 23.93 [14:12:07] (03Abandoned) 10Jbond: yaml parse error test: re: 509462 [puppet] - 10https://gerrit.wikimedia.org/r/510110 (owner: 10Jbond) [14:12:20] PROBLEM - High CPU load on API appserver on mw1279 is CRITICAL: CRITICAL - load average: 65.63, 30.11, 20.47 [14:12:32] (03Abandoned) 10Jbond: testing - ignore [puppet] - 10https://gerrit.wikimedia.org/r/510115 (owner: 10Jbond) [14:13:07] (03Abandoned) 10Jbond: role::mediawiki::canary_appserver: remove nutcracker memcached conf [puppet] - 10https://gerrit.wikimedia.org/r/510121 (https://phabricator.wikimedia.org/T214275) (owner: 10Jbond) [14:13:18] RECOVERY - High CPU load on API appserver on mw1283 is OK: OK - load average: 28.15, 31.99, 23.27 [14:13:42] RECOVERY - High CPU load on API appserver on mw1279 is OK: OK - load average: 30.17, 27.94, 20.59 [14:18:58] (03PS1) 10Muehlenhoff: Switch puppetdb1001/1002 to facter 3/puppet 5 [puppet] - 10https://gerrit.wikimedia.org/r/510171 [14:19:28] (03CR) 10Jbond: [C: 03+1] "LGTM however it would be nice if we could pass an list of packages to upgrade before preforming the reboot" (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/510113 (owner: 10Muehlenhoff) [14:19:34] (03CR) 10jerkins-bot: [V: 04-1] Switch puppetdb1001/1002 to facter 3/puppet 5 [puppet] - 10https://gerrit.wikimedia.org/r/510171 (owner: 10Muehlenhoff) [14:23:30] (03PS2) 10Muehlenhoff: Switch puppetdb1001/1002 to facter 3/puppet 5 [puppet] - 10https://gerrit.wikimedia.org/r/510171 (https://phabricator.wikimedia.org/T219803) [14:23:53] 10Operations, 10Domains, 10Traffic, 10WMF-Legal, 10Patch-For-Review: Move wikimedia.ee under WM-EE - https://phabricator.wikimedia.org/T204056 (10tramm) @Dzahn I try to express myself clearly. Wikimedia Estonia is okey with taking responsibility of the whole domain and AFAIK you transfer the domain at:... [14:28:14] (03PS12) 10Vgutierrez: ATS: Provide a unified monitoring define [puppet] - 10https://gerrit.wikimedia.org/r/506986 (https://phabricator.wikimedia.org/T221217) [14:28:16] (03PS5) 10Vgutierrez: ATS: Ensure that server's cipher suites preference is being honored [puppet] - 10https://gerrit.wikimedia.org/r/509771 (https://phabricator.wikimedia.org/T221594) [14:28:18] (03PS5) 10Vgutierrez: ATS: Provide support for TLS certificates with different SNI [puppet] - 10https://gerrit.wikimedia.org/r/510093 (https://phabricator.wikimedia.org/T221594) [14:28:20] (03PS44) 10Vgutierrez: ATS: Provide a TLS terminator profile and backend+TLS role [puppet] - 10https://gerrit.wikimedia.org/r/506398 (https://phabricator.wikimedia.org/T221594) [14:31:51] (03PS2) 10Ema: ATS: add hardening features to systemd unit [puppet] - 10https://gerrit.wikimedia.org/r/510168 [14:32:01] (03PS45) 10Vgutierrez: ATS: Provide a TLS terminator profile and backend+TLS role [puppet] - 10https://gerrit.wikimedia.org/r/506398 (https://phabricator.wikimedia.org/T221594) [14:33:35] (03CR) 10Jbond: [C: 03+1] "LGTM" [software/cumin] - 10https://gerrit.wikimedia.org/r/508079 (owner: 10Volans) [14:36:30] 10Operations, 10Maps (Kartotherian): Create blubberfile for deploying kartotherian into docker environment. - https://phabricator.wikimedia.org/T223275 (10MSantos) [14:38:56] (03PS46) 10Vgutierrez: ATS: Provide a TLS terminator profile and backend+TLS role [puppet] - 10https://gerrit.wikimedia.org/r/506398 (https://phabricator.wikimedia.org/T221594) [14:39:44] (03CR) 10Vgutierrez: [C: 04-1] ATS: add hardening features to systemd unit (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/510168 (owner: 10Ema) [14:42:14] (03PS1) 10Ottomata: Increase maprep.map.tasks from 25 to 200 for eventlogging camus job [puppet] - 10https://gerrit.wikimedia.org/r/510174 [14:43:29] ottomata: would it be ok to test a more conservative value? Like 100 or similar, just to avoid camus to hammer jumbo :) [14:45:39] (03PS3) 10Ema: ATS: add hardening features to systemd unit [puppet] - 10https://gerrit.wikimedia.org/r/510168 [14:45:43] hmm [14:46:11] elukey: i don't think it will hammer jumbo in this case; total el traffic is ~2K msgs / second [14:47:19] but i'd be fine with a lower value [14:47:24] any increase will probably help [14:47:42] !log hashar@deploy1001 Finished scap: testwiki to 1.34.0-wmf.5 and rebuild l10n cache (duration: 62m 47s) [14:47:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:47:48] 1 hour .... [14:48:18] that is until I have pressed ENTER repeatedly to unlock the progress report [14:48:50] (03CR) 10Hashar: [C: 03+2] Group 0 to 1.34.0-wmf.5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/510158 (https://phabricator.wikimedia.org/T220730) (owner: 10Hashar) [14:49:58] (03Merged) 10jenkins-bot: Group 0 to 1.34.0-wmf.5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/510158 (https://phabricator.wikimedia.org/T220730) (owner: 10Hashar) [14:50:01] 10Operations, 10ops-eqiad, 10Patch-For-Review: Install new PDUs into b5-eqiad - https://phabricator.wikimedia.org/T223126 (10jcrespo) dbproxy1006 switched over completely. The above patch (plus db1139 shutdown will be done hours before the maintenance). [14:50:03] !log rebooting mw1263 for kernel update [14:50:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:50:17] (03CR) 10jenkins-bot: Group 0 to 1.34.0-wmf.5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/510158 (https://phabricator.wikimedia.org/T220730) (owner: 10Hashar) [14:53:51] !log hashar@deploy1001 rebuilt and synchronized wikiversions files: group0 to 1.34.0-wmf.5 [14:53:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:54:16] (03CR) 10Jbond: "One query about PrivateTmp and another small comment" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/510168 (owner: 10Ema) [14:54:26] (03CR) 10Jcrespo: "Around?" [puppet] - 10https://gerrit.wikimedia.org/r/505407 (https://phabricator.wikimedia.org/T220894) (owner: 10Alex Monk) [14:56:23] 10Operations, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): Refactor public-facing DYNA scheme for primary project hostnames in our DNS - https://phabricator.wikimedia.org/T208263 (10kostajh) > Our analytics seems to indicate the changes above had the intended effect in restoring normal levels... [14:57:56] !sal [14:57:56] https://wikitech.wikimedia.org/wiki/Server_Admin_Log https://tools.wmflabs.org/sal/production See it and you will know all you need. [14:59:32] 10Operations, 10ops-codfw, 10media-storage, 10observability: ms-be2043 'sdd' throwing lots of errors - https://phabricator.wikimedia.org/T222654 (10Papaul) Create Dispatch: Success You have successfully submitted request SR990663287. [14:59:50] (03PS7) 10BBlack: Change CNAME->DYNA TTLs from 1H to 1D [dns] - 10https://gerrit.wikimedia.org/r/507400 (https://phabricator.wikimedia.org/T208263) [15:02:36] (03PS18) 10CRusnov: Gerrit: Set plugin.javamelody.prometheusBearerToken [puppet] - 10https://gerrit.wikimedia.org/r/508621 (https://phabricator.wikimedia.org/T184086) (owner: 10Paladox) [15:02:48] hashar: Looking. [15:06:50] (03CR) 10Bstorm: [C: 03+2] "Weird. It looked to me like puppet would completely overwrite that file, but it may be writing to a different file location. I do like t" [puppet] - 10https://gerrit.wikimedia.org/r/507351 (https://phabricator.wikimedia.org/T219830) (owner: 10BryanDavis) [15:07:00] (03PS5) 10Bstorm: striker: Enable developer account creation [puppet] - 10https://gerrit.wikimedia.org/r/507351 (https://phabricator.wikimedia.org/T219830) (owner: 10BryanDavis) [15:10:13] 10Operations, 10serviceops, 10Beta-Feature, 10MW-1.34-notes (1.34.0-wmf.6; 2019-05-21), and 2 others: Remove php7 beta feature - https://phabricator.wikimedia.org/T219128 (10Jdforrester-WMF) For simplicity, we should land https://gerrit.wikimedia.org/r/508177 whenever, and then just let the removal of the... [15:10:16] James_F: only noticed that error a minute or so ago [15:10:59] (03CR) 10Bstorm: "My reading of the ticket suggested that I should be merging this, anyway. If I'm wrong, I can revert and put it back :-p" [puppet] - 10https://gerrit.wikimedia.org/r/507351 (https://phabricator.wikimedia.org/T219830) (owner: 10BryanDavis) [15:12:28] (03PS2) 10Ottomata: Increase maprep.map.tasks from 25 to 100 for eventlogging camus job [puppet] - 10https://gerrit.wikimedia.org/r/510174 [15:12:55] elukey: ^ [15:13:50] (03CR) 10Gilles: [C: 03+1] thumbor: statsd_exporter mappings to seconds [puppet] - 10https://gerrit.wikimedia.org/r/510089 (https://phabricator.wikimedia.org/T220709) (owner: 10Filippo Giunchedi) [15:14:30] (03CR) 10Elukey: [C: 03+1] Increase maprep.map.tasks from 25 to 100 for eventlogging camus job [puppet] - 10https://gerrit.wikimedia.org/r/510174 (owner: 10Ottomata) [15:14:52] !log mw1263: scap pull [15:14:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:15:39] (03CR) 10CRusnov: [C: 03+1] "LGTM" [software/cumin] - 10https://gerrit.wikimedia.org/r/508081 (owner: 10Volans) [15:15:50] 10Operations, 10DC-Ops, 10netops: Juniper network device audit - all sites - https://phabricator.wikimedia.org/T213843 (10ayounsi) From Juniper: > I am still in the process of changing the installed base address of the serial numbers given. > Furthermore, for the serial numbers that are not showing in MyJuni... [15:15:55] (03CR) 10Ottomata: [C: 03+2] Increase maprep.map.tasks from 25 to 100 for eventlogging camus job [puppet] - 10https://gerrit.wikimedia.org/r/510174 (owner: 10Ottomata) [15:16:01] (03PS3) 10Ottomata: Increase maprep.map.tasks from 25 to 100 for eventlogging camus job [puppet] - 10https://gerrit.wikimedia.org/r/510174 [15:16:03] (03CR) 10Ottomata: [V: 03+2 C: 03+2] Increase maprep.map.tasks from 25 to 100 for eventlogging camus job [puppet] - 10https://gerrit.wikimedia.org/r/510174 (owner: 10Ottomata) [15:17:05] (03CR) 10Jcrespo: [C: 03+2] mariadb: Fix typo on multi-instance dbstore role [puppet] - 10https://gerrit.wikimedia.org/r/510122 (owner: 10Jcrespo) [15:17:08] (03CR) 10CRusnov: [C: 03+1] "Pretty minor style changes overall, LGTM." [software/cumin] - 10https://gerrit.wikimedia.org/r/508080 (owner: 10Volans) [15:17:13] (03PS3) 10Jcrespo: mariadb: Fix typo on multi-instance dbstore role [puppet] - 10https://gerrit.wikimedia.org/r/510122 [15:17:41] 10Operations, 10Stashbot: #wikimedia-sre is missing stashbot - https://phabricator.wikimedia.org/T222755 (10bd808) [15:17:47] (03CR) 10CRusnov: [C: 03+1] "looks good" [software/cumin] - 10https://gerrit.wikimedia.org/r/510092 (owner: 10Volans) [15:19:00] !log shutting down elastic2038 for memory replacement [15:19:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:19:27] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] Stop installing the obsolete puppet-common transition package [puppet] - 10https://gerrit.wikimedia.org/r/510159 (owner: 10Muehlenhoff) [15:21:35] 10Operations, 10netops: eqord - ulsfo Telia link down - IC-313592 - https://phabricator.wikimedia.org/T221259 (10ayounsi) 05Open→03Resolved Thanks! Yes it's fine to reuse the ticket. Link is back up now. [15:23:21] (03CR) 10Bstorm: "bd808 changed his to the same thing, and I merged that one. Sorry to cause confusion this morning." [puppet] - 10https://gerrit.wikimedia.org/r/509482 (https://phabricator.wikimedia.org/T222844) (owner: 10Reedy) [15:24:07] (03PS47) 10Vgutierrez: ATS: Provide a TLS terminator profile and backend+TLS role [puppet] - 10https://gerrit.wikimedia.org/r/506398 (https://phabricator.wikimedia.org/T221594) [15:25:34] PROBLEM - Host elastic2038.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:26:37] jouncebot: now [15:26:37] No deployments scheduled for the next 0 hour(s) and 33 minute(s) [15:26:47] hashar: Should I deploy the fix? [15:27:30] 10Operations, 10Stashbot: #wikimedia-sre is missing stashbot - https://phabricator.wikimedia.org/T222755 (10bd808) 05Open→03Resolved a:03bd808 [15:28:48] Taking silence as assent. :-) [15:30:26] RECOVERY - Host elastic2038 is UP: PING OK - Packet loss = 0%, RTA = 36.10 ms [15:30:26] James_F: yes please ;) [15:30:36] James_F: sorry I forgot you since I am in a conf call :/ [15:31:01] (03PS48) 10Vgutierrez: ATS: Provide a TLS terminator profile and backend+TLS role [puppet] - 10https://gerrit.wikimedia.org/r/506398 (https://phabricator.wikimedia.org/T221594) [15:32:18] (03CR) 10Volans: [C: 03+2] tests: temporarily force bandit < 1.6.0 [software/cumin] - 10https://gerrit.wikimedia.org/r/510092 (owner: 10Volans) [15:33:19] 10Operations, 10Africa-Wikimedia-Developers, 10Stashbot: Add stashbot to #wikimedia-dev-africa - https://phabricator.wikimedia.org/T223289 (10D3r1ck01) [15:33:29] !log deactivate bgp to telia on cr1-codfw - T222967 [15:33:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:33:34] T222967: Interface errors on cr1-codfw: xe-5/3/1 - https://phabricator.wikimedia.org/T222967 [15:34:04] (03CR) 10Jcrespo: [C: 03+2] dbproxy: Switchover labsdb1009 to 11, reorganize weights [puppet] - 10https://gerrit.wikimedia.org/r/510126 (https://phabricator.wikimedia.org/T222978) (owner: 10Jcrespo) [15:34:16] (03PS2) 10Jcrespo: dbproxy: Switchover labsdb1009 to 11, reorganize weights [puppet] - 10https://gerrit.wikimedia.org/r/510126 (https://phabricator.wikimedia.org/T222978) [15:35:16] (03PS2) 10CRusnov: profile::netbox: Deploy gsheets config for reports [puppet] - 10https://gerrit.wikimedia.org/r/508625 [15:36:10] RECOVERY - Disk space on mwmaint1002 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space [15:36:42] RECOVERY - Host elastic2038.mgmt is UP: PING OK - Packet loss = 0%, RTA = 36.95 ms [15:36:52] (03CR) 10CRusnov: [C: 03+2] profile::netbox: Deploy gsheets config for reports [puppet] - 10https://gerrit.wikimedia.org/r/508625 (owner: 10CRusnov) [15:37:54] (03PS3) 10CRusnov: profile::netbox: Deploy gsheets config for reports [puppet] - 10https://gerrit.wikimedia.org/r/508625 [15:38:00] 10Operations, 10Proton, 10Reading-Infrastructure-Team-Backlog, 10Security-Team: [2 hrs] Decide on handling system updates for Proton - https://phabricator.wikimedia.org/T213366 (10Tgr) [15:38:11] (03Merged) 10jenkins-bot: tests: temporarily force bandit < 1.6.0 [software/cumin] - 10https://gerrit.wikimedia.org/r/510092 (owner: 10Volans) [15:38:11] puppet repo race [15:38:14] wee [15:38:35] !log re-activate bgp to telia on cr1-codfw - T222967 [15:38:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:38:40] T222967: Interface errors on cr1-codfw: xe-5/3/1 - https://phabricator.wikimedia.org/T222967 [15:39:10] (03CR) 10jenkins-bot: tests: temporarily force bandit < 1.6.0 [software/cumin] - 10https://gerrit.wikimedia.org/r/510092 (owner: 10Volans) [15:40:07] (03PS1) 10Jcrespo: mariadb-backups: Reduce frequency of backups to 3 times a week [puppet] - 10https://gerrit.wikimedia.org/r/510184 (https://phabricator.wikimedia.org/T206203) [15:40:38] 10Operations, 10ops-codfw, 10Discovery-Search (Current work): elastic2038 DOWN (CPU/memory errors ) - https://phabricator.wikimedia.org/T217398 (10Papaul) 05Open→03Resolved - Memory Replaced on DIMM B2 - clear log Part return tracking information {F29051855} closing this task for now [15:40:43] (03PS2) 10Jcrespo: mariadb-backups: Reduce frequency of snapshots to 3 times a week [puppet] - 10https://gerrit.wikimedia.org/r/510184 (https://phabricator.wikimedia.org/T206203) [15:41:52] (03PS1) 10Bstorm: cloudstore: start syncing data off labstore1003 [puppet] - 10https://gerrit.wikimedia.org/r/510185 (https://phabricator.wikimedia.org/T209527) [15:42:12] 10Operations, 10netops, 10cloud-services-team (Kanban): CloudVPS: evaluate if 10G is working correctly in cloudvirts - https://phabricator.wikimedia.org/T223272 (10aborrero) A quick iperf3 test shows that I'm wrong and this is working actually: ` aborrero@cloudvirt1018:~ 1 $ iperf3 -c cloudvirt1024.eqiad.wm... [15:43:08] hi elukey! got 5 mins re https://phabricator.wikimedia.org/T97368 [15:43:11] ? [15:43:22] 10Operations, 10netops, 10cloud-services-team (Kanban): CloudVPS: evaluate if 10G is working correctly in cloudvirts - https://phabricator.wikimedia.org/T223272 (10Bstorm) Is the Neutron hardware on 10G? [15:43:32] !log reload haproxy config @ dbproxy1010, dbproxy1011 [15:43:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:44:02] 10Operations, 10Traffic, 10Wikidata, 10serviceops, and 4 others: [Task] move wikiba.se webhosting to wikimedia cluster - https://phabricator.wikimedia.org/T99531 (10WMDE-leszek) @BBlack @Dzahn: I have passed the topic of domain ownership transfer to the C-level ranks here at WMDE, and I have to inform that... [15:44:12] 10Operations, 10netops, 10cloud-services-team (Kanban): CloudVPS: evaluate if 10G is working correctly in cloudvirts - https://phabricator.wikimedia.org/T223272 (10Bstorm) Not that it should matter to the virts themselves, but it would matter to the VMs. Just curious. [15:45:08] (03Abandoned) 10Reedy: Re-enable striker account creation [puppet] - 10https://gerrit.wikimedia.org/r/509482 (https://phabricator.wikimedia.org/T222844) (owner: 10Reedy) [15:45:44] papaul: Thanks! [15:45:53] addshore: sure! [15:46:31] im sure confused about this surfacing in the same way agian...... [15:46:52] even though there are thousands of keys now, is there some odd small tiny chance that they all just ended up on the same host? [15:47:10] i'll have a look for code changes again in a second, but i didnt spot any when i looked yesterday [15:47:32] (03CR) 10Volans: [C: 03+2] tox: refactor configuration [software/cumin] - 10https://gerrit.wikimedia.org/r/508079 (owner: 10Volans) [15:49:03] !log crusnov@deploy1001 Started deploy [netbox/deploy@81059c6]: Deploy new reqs for reports [15:49:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:49:22] 10Operations, 10ops-codfw, 10decommission, 10cloud-services-team (Kanban): decommmision: labtestweb2001.wikimedia.org - https://phabricator.wikimedia.org/T218024 (10Papaul) [15:49:59] !log crusnov@deploy1001 Finished deploy [netbox/deploy@81059c6]: Deploy new reqs for reports (duration: 00m 55s) [15:50:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:50:24] (03CR) 10CRusnov: "Just a note on this since it has been extensively review: The gsheets.cfg is in place in private and deployed to the netmon servers. The r" [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/506663 (owner: 10Faidon Liambotis) [15:50:39] addshore: I checked the top talkers on mc1033 (the host with high bandwidth usage) and I listed some in the task, even if I am not sure that they are the culprit. Have you had the chance to see them? [15:50:53] yeah, they are all our keys [15:51:04] they are parts of what used to be in the 1 big key [15:51:37] the main theroy is though, as we are using thousands of keys now instead of 1, they will be distributed among the machines...... [15:51:42] !log jforrester@deploy1001 Synchronized php-1.34.0-wmf.5/extensions/VisualEditor/includes/ApiVisualEditor.php: Hot-deploy VE unset variable fix T223281 (duration: 00m 57s) [15:51:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:51:47] T223281: [{exception_id}] {exception_url} ErrorException from line 599 of /srv/mediawiki/php-1.34.0-wmf.4/extensions/VisualEditor/includes/ApiVisualEditor.php: PHP Notice: Undefined variable: restbaseHeaders - https://phabricator.wikimedia.org/T223281 [15:51:48] its almost like they are now all on 1 machine again, or something [15:54:46] addshore: maybe the last deployment triggered a bigger usage of some specific keys? [15:54:48] (03Merged) 10jenkins-bot: tox: refactor configuration [software/cumin] - 10https://gerrit.wikimedia.org/r/508079 (owner: 10Volans) [15:55:22] 10Operations, 10netops, 10cloud-services-team (Kanban): CloudVPS: evaluate if 10G is working correctly in cloudvirts - https://phabricator.wikimedia.org/T223272 (10aborrero) >>! In T223272#5181553, @Bstorm wrote: > Is the Neutron hardware on 10G? yes! And they passed the 1G boundary several times already, s... [15:56:05] (03CR) 10jenkins-bot: tox: refactor configuration [software/cumin] - 10https://gerrit.wikimedia.org/r/508079 (owner: 10Volans) [15:56:14] !log jforrester@deploy1001 Synchronized php-1.34.0-wmf.4/extensions/VisualEditor/includes/ApiVisualEditor.php: Hot-deploy VE unset variable fix T223281 (duration: 00m 55s) [15:56:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:56:19] (03CR) 10Volans: [C: 03+2] flake8: enforce import order and adopt W504 [software/cumin] - 10https://gerrit.wikimedia.org/r/508080 (owner: 10Volans) [16:00:05] godog and _joe_: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for Puppet SWAT(Max 6 patches) . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190514T1600). [16:00:05] No GERRIT patches in the queue for this window AFAICS. [16:04:46] !log gilles@deploy1001 Started deploy [performance/coal@5a32eb2]: T221401 [16:04:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:04:51] T221401: Repopulate missing coal data in Graphite for 2019-04-17 outage - https://phabricator.wikimedia.org/T221401 [16:04:52] !log gilles@deploy1001 Finished deploy [performance/coal@5a32eb2]: T221401 (duration: 00m 06s) [16:04:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:07:08] (03PS49) 10Vgutierrez: ATS: Provide a TLS terminator profile and backend+TLS role [puppet] - 10https://gerrit.wikimedia.org/r/506398 (https://phabricator.wikimedia.org/T221594) [16:08:33] elukey: perhaps :/ [16:08:44] (03CR) 10jerkins-bot: [V: 04-1] ATS: Provide a TLS terminator profile and backend+TLS role [puppet] - 10https://gerrit.wikimedia.org/r/506398 (https://phabricator.wikimedia.org/T221594) (owner: 10Vgutierrez) [16:09:35] addshore: I'll open a task so we can start from scratch in there [16:09:45] elukey: okay [16:10:15] (03PS50) 10Vgutierrez: ATS: Provide a TLS terminator profile and backend+TLS role [puppet] - 10https://gerrit.wikimedia.org/r/506398 (https://phabricator.wikimedia.org/T221594) [16:10:29] train seems good on group0 so far :] [16:10:31] I am off! [16:10:38] (03CR) 10CRusnov: [C: 03+1] "LGTM. It's nice having an example of this! Minor phrasing niggles but ignorable if you wish." (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/506954 (https://phabricator.wikimedia.org/T221212) (owner: 10Volans) [16:11:17] (03Merged) 10jenkins-bot: flake8: enforce import order and adopt W504 [software/cumin] - 10https://gerrit.wikimedia.org/r/508080 (owner: 10Volans) [16:12:09] (03CR) 10Volans: "Big work, thanks for making progress. This is a long overdue review from my side, sorry for the delay." (0364 comments) [software/conftool] - 10https://gerrit.wikimedia.org/r/441396 (https://phabricator.wikimedia.org/T197126) (owner: 10Giuseppe Lavagetto) [16:12:40] elukey: is it possible to see which slab that most called key is in? [16:12:41] (03CR) 10jenkins-bot: flake8: enforce import order and adopt W504 [software/cumin] - 10https://gerrit.wikimedia.org/r/508080 (owner: 10Volans) [16:13:16] (03CR) 10Volans: [C: 03+2] documentation: fix typo [software/cumin] - 10https://gerrit.wikimedia.org/r/508081 (owner: 10Volans) [16:15:41] addshore: yep definitely, I am going to check asap and add in the task (meeting time now) [16:17:07] 10Operations, 10puppet-compiler, 10Cloud-VPS (Quota-requests): Requesting quota increase for 'puppet-diffs' project - https://phabricator.wikimedia.org/T222800 (10Andrew) Approved [16:17:48] ack! [16:18:05] (03CR) 10Filippo Giunchedi: [C: 03+2] "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/509409 (https://phabricator.wikimedia.org/T214166) (owner: 10Filippo Giunchedi) [16:18:49] 10Operations, 10LDAP-Access-Requests, 10WMF-Legal, 10WMF-NDA-Requests: Request to be added to the ldap/wmde group - https://phabricator.wikimedia.org/T222788 (10RStallman-legalteam) Monica's NDA is signed and on file. Thanks! [16:18:56] (03CR) 10Filippo Giunchedi: [C: 03+2] thumbor: statsd_exporter mappings to seconds [puppet] - 10https://gerrit.wikimedia.org/r/510089 (https://phabricator.wikimedia.org/T220709) (owner: 10Filippo Giunchedi) [16:19:05] (03PS2) 10Filippo Giunchedi: thumbor: statsd_exporter mappings to seconds [puppet] - 10https://gerrit.wikimedia.org/r/510089 (https://phabricator.wikimedia.org/T220709) [16:20:34] (03Merged) 10jenkins-bot: documentation: fix typo [software/cumin] - 10https://gerrit.wikimedia.org/r/508081 (owner: 10Volans) [16:21:40] (03CR) 10jenkins-bot: documentation: fix typo [software/cumin] - 10https://gerrit.wikimedia.org/r/508081 (owner: 10Volans) [16:22:10] (03PS6) 10Ayounsi: Puppet, add RPKI validation daemon [puppet] - 10https://gerrit.wikimedia.org/r/508928 (https://phabricator.wikimedia.org/T220669) [16:22:30] !log statsd_exporter 0.9 upgrade on thumbor - T220709 [16:22:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:22:35] T220709: Upgrade statsd_exporter to 0.9 - https://phabricator.wikimedia.org/T220709 [16:24:07] (03CR) 10Ayounsi: Puppet, add RPKI validation daemon (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/508928 (https://phabricator.wikimedia.org/T220669) (owner: 10Ayounsi) [16:28:25] (03CR) 10Jforrester: [C: 03+1] "OK, this should now be good to land." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/509437 (https://phabricator.wikimedia.org/T221650) (owner: 10Michael Große) [16:30:04] 10Operations, 10Analytics, 10EventBus, 10observability, and 4 others: Upgrade statsd_exporter to 0.9 - https://phabricator.wikimedia.org/T220709 (10fgiunchedi) All production has been updated ! Leaving open for now in case there's still upgrades to be done in k8s (cc @akosiaris ) [16:30:30] !log stop replication and start table recompression on labsdb1009 T222978 [16:30:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:30:34] T222978: Compress and defragment tables on labsdb hosts - https://phabricator.wikimedia.org/T222978 [16:30:50] PROBLEM - Disk space on mwmaint1002 is CRITICAL: DISK CRITICAL - free space: / 4657 MB (3% inode=89%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space [16:34:14] (03PS3) 10Filippo Giunchedi: cassandra: check for flag file before service startup [puppet] - 10https://gerrit.wikimedia.org/r/509409 (https://phabricator.wikimedia.org/T214166) [16:39:07] (03CR) 10Jcrespo: [C: 03+2] "Compiler is happy, but not 100% sure this is the right syntax: https://puppet-compiler.wmflabs.org/compiler1002/16546/cumin1001.eqiad.wmne" [puppet] - 10https://gerrit.wikimedia.org/r/510184 (https://phabricator.wikimedia.org/T206203) (owner: 10Jcrespo) [16:39:17] (03PS3) 10Jcrespo: mariadb-backups: Reduce frequency of snapshots to 3 times a week [puppet] - 10https://gerrit.wikimedia.org/r/510184 (https://phabricator.wikimedia.org/T206203) [16:39:50] (03CR) 10Cwhite: [C: 03+1] flake8 - diffscan: Add python extension so CI is run [puppet] - 10https://gerrit.wikimedia.org/r/509946 (https://phabricator.wikimedia.org/T144169) (owner: 10Jbond) [16:40:31] 10Operations, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): Refactor public-facing DYNA scheme for primary project hostnames in our DNS - https://phabricator.wikimedia.org/T208263 (10BBlack) @kostajh - The OONI article you linked ( https://ooni.torproject.org/post/2019-china-wikipedia-blocking... [16:41:16] (03CR) 10Cwhite: [C: 03+1] flake8 - letsencrypt: Add python extension so CI is run [puppet] - 10https://gerrit.wikimedia.org/r/509945 (https://phabricator.wikimedia.org/T144169) (owner: 10Jbond) [16:41:44] 10Operations, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): Refactor public-facing DYNA scheme for primary project hostnames in our DNS - https://phabricator.wikimedia.org/T208263 (10kostajh) Got it, thanks so much for this clarification, it's a very helpful summary of events. [16:43:22] (03PS1) 10Filippo Giunchedi: cassandra: unquote ConditionPathExists argument [puppet] - 10https://gerrit.wikimedia.org/r/510195 (https://phabricator.wikimedia.org/T214166) [16:43:52] (03CR) 10Jbond: [C: 03+2] flake8 - letsencrypt: Add python extension so CI is run [puppet] - 10https://gerrit.wikimedia.org/r/509945 (https://phabricator.wikimedia.org/T144169) (owner: 10Jbond) [16:44:00] (03PS4) 10Jbond: flake8 - letsencrypt: Add python extension so CI is run [puppet] - 10https://gerrit.wikimedia.org/r/509945 (https://phabricator.wikimedia.org/T144169) [16:44:28] (03CR) 10Filippo Giunchedi: [C: 03+2] cassandra: unquote ConditionPathExists argument [puppet] - 10https://gerrit.wikimedia.org/r/510195 (https://phabricator.wikimedia.org/T214166) (owner: 10Filippo Giunchedi) [16:45:26] (03PS5) 10Jbond: flake8 - letsencrypt: Add python extension so CI is run [puppet] - 10https://gerrit.wikimedia.org/r/509945 (https://phabricator.wikimedia.org/T144169) [16:48:45] (03PS2) 10Jbond: flake8 - diffscan: Add python extension so CI is run [puppet] - 10https://gerrit.wikimedia.org/r/509946 (https://phabricator.wikimedia.org/T144169) [16:49:35] (03CR) 10Jbond: [C: 03+2] flake8 - diffscan: Add python extension so CI is run [puppet] - 10https://gerrit.wikimedia.org/r/509946 (https://phabricator.wikimedia.org/T144169) (owner: 10Jbond) [16:49:59] (03PS2) 10Bstorm: cloudstore: start syncing data off labstore1003 [puppet] - 10https://gerrit.wikimedia.org/r/510185 (https://phabricator.wikimedia.org/T209527) [16:57:03] (03PS3) 10Dzahn: mariadb: set some more Icinga notes URLs for nrpe checks [puppet] - 10https://gerrit.wikimedia.org/r/509552 (https://phabricator.wikimedia.org/T197873) [16:57:20] (03CR) 10Dzahn: [C: 03+2] "thank you all :)" [puppet] - 10https://gerrit.wikimedia.org/r/509552 (https://phabricator.wikimedia.org/T197873) (owner: 10Dzahn) [16:57:33] (03PS3) 10Jbond: flake8: update python file extensions so they are detected by CI [puppet] - 10https://gerrit.wikimedia.org/r/509444 (https://phabricator.wikimedia.org/T144169) [16:57:52] (03PS3) 10Bstorm: cloudstore: start syncing data off labstore1003 [puppet] - 10https://gerrit.wikimedia.org/r/510185 (https://phabricator.wikimedia.org/T209527) [16:58:33] (03CR) 10jerkins-bot: [V: 04-1] flake8: update python file extensions so they are detected by CI [puppet] - 10https://gerrit.wikimedia.org/r/509444 (https://phabricator.wikimedia.org/T144169) (owner: 10Jbond) [16:59:36] (03CR) 10Dzahn: "you have one limit 4,000 and the other 40,000. Intended?" [puppet] - 10https://gerrit.wikimedia.org/r/510185 (https://phabricator.wikimedia.org/T209527) (owner: 10Bstorm) [16:59:41] (03PS4) 10Jbond: flake8: update python file extensions so they are detected by CI [puppet] - 10https://gerrit.wikimedia.org/r/509444 (https://phabricator.wikimedia.org/T144169) [16:59:46] 10Operations, 10netops, 10cloud-services-team (Kanban): CloudVPS: evaluate if 10G is working correctly in cloudvirts - https://phabricator.wikimedia.org/T223272 (10Andrew) drive-by-comment: I've also been disappointed at transfer speeds when migrating to/from 10G systems but never followed up to figure out... [17:00:05] cscott, arlolra, subbu, and halfak: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for Services – Graphoid / Parsoid / Citoid / ORES. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190514T1700). [17:00:14] no parsoid deploy today [17:01:04] RECOVERY - Disk space on mwmaint1002 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space [17:01:17] (03PS19) 10CRusnov: Gerrit: Set plugin.javamelody.prometheusBearerToken [puppet] - 10https://gerrit.wikimedia.org/r/508621 (https://phabricator.wikimedia.org/T184086) (owner: 10Paladox) [17:01:50] PROBLEM - PHP7 rendering on mw1304 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [17:02:05] (03CR) 10CRusnov: [C: 03+2] Gerrit: Set plugin.javamelody.prometheusBearerToken [puppet] - 10https://gerrit.wikimedia.org/r/508621 (https://phabricator.wikimedia.org/T184086) (owner: 10Paladox) [17:03:04] RECOVERY - PHP7 rendering on mw1304 is OK: HTTP OK: HTTP/1.1 200 OK - 327 bytes in 0.003 second response time https://wikitech.wikimedia.org/wiki/Application_servers [17:04:34] elukey: I'm leaning toward thinking it is not the wikidata keys, or at least not those ones [17:07:50] (03CR) 10Jbond: "ready for review" [puppet] - 10https://gerrit.wikimedia.org/r/509444 (https://phabricator.wikimedia.org/T144169) (owner: 10Jbond) [17:08:03] (03PS1) 10Dzahn: admins: extend account expiration date for pbj a few more days [puppet] - 10https://gerrit.wikimedia.org/r/510202 [17:08:38] (03PS2) 10Dzahn: admins: extend account expiration date for pbj a few more days [puppet] - 10https://gerrit.wikimedia.org/r/510202 [17:08:51] (03CR) 10Dzahn: [C: 03+2] admins: extend account expiration date for pbj a few more days [puppet] - 10https://gerrit.wikimedia.org/r/510202 (owner: 10Dzahn) [17:11:26] (03PS1) 10Michael Große: Add feature flag config for breaking Wikibase API change [mediawiki-config] - 10https://gerrit.wikimedia.org/r/510204 (https://phabricator.wikimedia.org/T203337) [17:11:51] 10Operations, 10serviceops: SRE FY2019 Q4 goal: complete the transition to PHP7 - https://phabricator.wikimedia.org/T219127 (10Reedy) [17:12:04] 10Operations, 10MediaWiki-extensions-WikibaseClient, 10MediaWiki-extensions-WikibaseRepository, 10Wikidata, and 7 others: Fix inefficient CacheAwarePropertyInfoStore memcached access pattern - https://phabricator.wikimedia.org/T97368 (10Addshore) That all looks fine. I added a new panel to https://grafana.... [17:15:42] RECOVERY - Check Varnish expiry mailbox lag on cp3034 is OK: OK: expiry mailbox lag is 290708 https://wikitech.wikimedia.org/wiki/Varnish [17:19:57] (03PS1) 10Jbond: flake8 - ldap: add extension so file is detected by CI [puppet] - 10https://gerrit.wikimedia.org/r/510208 [17:20:23] !log mbsantos@deploy1001 Started deploy [proton/deploy@881b22b]: Update chromium-render to 8cc96e7 make timeout handler more robust (T217724) [17:20:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:20:28] T217724: Investigate 2019-03-01 Proton incident - https://phabricator.wikimedia.org/T217724 [17:21:29] (03CR) 10jerkins-bot: [V: 04-1] flake8 - ldap: add extension so file is detected by CI [puppet] - 10https://gerrit.wikimedia.org/r/510208 (owner: 10Jbond) [17:22:46] !log mbsantos@deploy1001 Finished deploy [proton/deploy@881b22b]: Update chromium-render to 8cc96e7 make timeout handler more robust (T217724) (duration: 02m 23s) [17:22:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:22:53] tgr: ^ [17:23:22] thanks! [17:24:30] (03PS2) 10Jbond: flake8 - ldap: add extension so file is detected by CI [puppet] - 10https://gerrit.wikimedia.org/r/510208 [17:25:52] 10Operations, 10Continuous-Integration-Infrastructure, 10Release-Engineering-Team (Backlog): contint1001 store docker images on separate partition or disk - https://phabricator.wikimedia.org/T207707 (10Dzahn) Alright, it's time again for CRIT: https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=2&h... [17:26:21] 10Operations, 10Continuous-Integration-Infrastructure, 10Release-Engineering-Team (Kanban): contint1001 store docker images on separate partition or disk - https://phabricator.wikimedia.org/T207707 (10Dzahn) a:05RobH→03hashar [17:27:57] ACKNOWLEDGEMENT - Disk space on contint1001 is CRITICAL: DISK CRITICAL - free space: / 1502 MB (3% inode=52%): daniel_zahn https://phabricator.wikimedia.org/T207707#5135476 https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space [17:30:26] (03CR) 10Jbond: [C: 03+2] flake8 - ldap: add extension so file is detected by CI [puppet] - 10https://gerrit.wikimedia.org/r/510208 (owner: 10Jbond) [17:32:20] RECOVERY - Disk space on contint1001 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space [17:32:46] !log contint1001 - mkdir /srv/zuul-logs ; mv /var/log/zuul/debug.log* /srv/zuul-logs/ to prevent CI running out of disk again (T207707) [17:32:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:32:51] T207707: contint1001 store docker images on separate partition or disk - https://phabricator.wikimedia.org/T207707 [17:33:23] 10Operations: Investigate increase in tx bandwidth usage for mc1033 - https://phabricator.wikimedia.org/T223310 (10elukey) p:05Triage→03Normal [17:33:30] 10Operations, 10Continuous-Integration-Infrastructure, 10Release-Engineering-Team (Kanban): contint1001 store docker images on separate partition or disk - https://phabricator.wikimedia.org/T207707 (10Dzahn) [17:33:42] 10Operations, 10Continuous-Integration-Infrastructure, 10Release-Engineering-Team (Kanban): contint1001 store docker images on separate partition or disk - https://phabricator.wikimedia.org/T207707 (10Dzahn) [17:33:50] 10Operations, 10Continuous-Integration-Infrastructure, 10Patch-For-Review, 10Release-Engineering-Team (Kanban): contint1001: DISK WARNING - free space: /srv 88397 MB (10% inode=94%): - https://phabricator.wikimedia.org/T219850 (10Dzahn) [17:35:30] (03PS2) 10Jbond: flake8: Add python extension so theses scripts can be picked up via CI [puppet] - 10https://gerrit.wikimedia.org/r/509476 (https://phabricator.wikimedia.org/T144169) [17:36:48] 10Operations, 10Continuous-Integration-Infrastructure, 10Release-Engineering-Team (Kanban): contint1001 store docker images on separate partition or disk - https://phabricator.wikimedia.org/T207707 (10Dzahn) As before /var/log/zuul is many Gigabytes and a large percentage of / and debug logging is enabled.... [17:36:59] 10Operations, 10Continuous-Integration-Infrastructure, 10Release-Engineering-Team (Kanban): contint1001 store docker images on separate partition or disk - https://phabricator.wikimedia.org/T207707 (10Dzahn) p:05Normal→03High [17:40:13] (03PS3) 10Jbond: flake8: Add python extension so theses scripts can be picked up via CI [puppet] - 10https://gerrit.wikimedia.org/r/509476 (https://phabricator.wikimedia.org/T144169) [17:41:02] (03CR) 10jerkins-bot: [V: 04-1] flake8: Add python extension so theses scripts can be picked up via CI [puppet] - 10https://gerrit.wikimedia.org/r/509476 (https://phabricator.wikimedia.org/T144169) (owner: 10Jbond) [17:42:39] (03PS4) 10Jbond: flake8: Add python extension so theses scripts can be picked up via CI [puppet] - 10https://gerrit.wikimedia.org/r/509476 (https://phabricator.wikimedia.org/T144169) [17:45:34] (03CR) 10Jbond: "ready for review, should be easy to compare changes now" [puppet] - 10https://gerrit.wikimedia.org/r/509476 (https://phabricator.wikimedia.org/T144169) (owner: 10Jbond) [17:48:38] (03PS2) 10Michael Große: Add feature flag config for breaking Wikibase API change [mediawiki-config] - 10https://gerrit.wikimedia.org/r/510204 (https://phabricator.wikimedia.org/T223312) [17:50:38] onimisionipe: is there a reason for puppet disabled on elastic2029? it looks like nobody logged in but it's turned off [17:53:35] 10Operations: Investigate increase in tx bandwidth usage for mc1033 - https://phabricator.wikimedia.org/T223310 (10elukey) From the memcached's bytes_written and bytes_read metrics I don't see anything changing dramatically: https://grafana.wikimedia.org/d/000000316/memcache?panelId=44&fullscreen&orgId=1&from=n... [17:54:43] !log elastic2038 - restart nagios-nrpe-server - attempt to fix "CHECK_NRPE STATE UNKNOWN" for a single check [17:54:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:54:54] 10Operations, 10Continuous-Integration-Infrastructure, 10Release-Engineering-Team (Kanban): contint1001 store docker images on separate partition or disk - https://phabricator.wikimedia.org/T207707 (10RobH) >>! In T207707#4937008, @thcipriani wrote: >>>! In T207707#4909130, @Dzahn wrote: >> Let's ask dcops i... [17:55:01] !log elastic2029 - enable puppet agent - was disabled without reason and nobody seems to have logged in recently [17:55:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:55:40] PROBLEM - puppet last run on bast3002 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. It might be a dependency cycle. [17:56:35] 10Operations, 10ops-codfw: Interface errors on cr1-codfw: xe-5/3/1 - https://phabricator.wikimedia.org/T222967 (10Papaul) - Email sent to Telia - Telia open a case - Case information : Telia Carrier Case Reference 00980021 [18:07:15] 08Warning Alert for device asw2-a-eqiad.mgmt.eqiad.wmnet - Access port utilisation over 80% for 1h [18:08:15] 08Warning Alert for device asw-b-codfw.mgmt.codfw.wmnet - Access port utilisation over 80% for 1h [18:09:47] (03CR) 10Volans: [C: 03+1] "LGTM" [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/509934 (owner: 10CRusnov) [18:10:18] mutante: none that I know of. [18:10:32] (03CR) 10Volans: [C: 03+1] "LGTM, don't forget to cleanup manually the old one" [puppet] - 10https://gerrit.wikimedia.org/r/509932 (owner: 10CRusnov) [18:10:39] onimisionipe: ok! i just re-enabled it [18:10:51] (03PS2) 10CRusnov: profile::netbox: Move reports config to /etc/netbox [puppet] - 10https://gerrit.wikimedia.org/r/509932 [18:10:53] thanks! [18:13:00] PROBLEM - Eqiad HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=eqiad&var-cache_type=All&var-status_type=5 [18:14:00] PROBLEM - Esams HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=esams&var-cache_type=All&var-status_type=5 [18:14:32] PROBLEM - Text HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5 [18:20:52] RECOVERY - Esams HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=esams&var-cache_type=All&var-status_type=5 [18:21:16] RECOVERY - Eqiad HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=eqiad&var-cache_type=All&var-status_type=5 [18:21:24] RECOVERY - Text HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5 [18:21:35] !log mwmaint1002 - deleting /root/home-mwmaint2001 to save space - confirmed we have bacula backups of home on mwmaint2001 [18:21:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:23:51] 10Operations, 10ops-codfw: Degraded RAID on es2002 - https://phabricator.wikimedia.org/T223257 (10jbond) 05Open→03Resolved a:03jbond Caused by https://gerrit.wikimedia.org/r/c/operations/puppet/+/508855 which has now been rolled back [18:24:00] 10Operations, 10ops-eqiad: Degraded RAID on tungsten - https://phabricator.wikimedia.org/T223253 (10jbond) Caused by https://gerrit.wikimedia.org/r/c/operations/puppet/+/508855 which has now been rolled back [18:24:07] 10Operations, 10ops-eqiad: Degraded RAID on tungsten - https://phabricator.wikimedia.org/T223253 (10jbond) 05Open→03Resolved a:03jbond [18:24:25] 10Operations, 10ops-eqiad: Degraded RAID on ms-be1043 - https://phabricator.wikimedia.org/T223251 (10jbond) 05Open→03Resolved a:03jbond Caused by https://gerrit.wikimedia.org/r/c/operations/puppet/+/508855 which has now been rolled back [18:24:39] 10Operations, 10ops-eqiad: Degraded RAID on kafka1012 - https://phabricator.wikimedia.org/T223250 (10jbond) 05Open→03Resolved a:03jbond Caused by https://gerrit.wikimedia.org/r/c/operations/puppet/+/508855 which has now been rolled back [18:24:53] 10Operations, 10ops-codfw: Degraded RAID on backup2001 - https://phabricator.wikimedia.org/T223249 (10jbond) 05Open→03Resolved a:03jbond Caused by https://gerrit.wikimedia.org/r/c/operations/puppet/+/508855 which has now been rolled back [18:25:11] 10Operations, 10ops-eqiad: Degraded RAID on kafka1014 - https://phabricator.wikimedia.org/T223246 (10jbond) 05Open→03Invalid Caused by https://gerrit.wikimedia.org/r/c/operations/puppet/+/508855 which has now been rolled back [18:25:24] 10Operations, 10ops-codfw: Degraded RAID on es2004 - https://phabricator.wikimedia.org/T223264 (10jbond) 05Open→03Invalid Caused by https://gerrit.wikimedia.org/r/c/operations/puppet/+/508855 which has now been rolled back [18:27:38] RECOVERY - puppet last run on bast3002 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [18:27:40] PROBLEM - MariaDB Slave Lag: s3 on db2098 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 839.08 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [18:28:13] (03CR) 10Bstorm: "> Patch Set 3:" [puppet] - 10https://gerrit.wikimedia.org/r/510185 (https://phabricator.wikimedia.org/T209527) (owner: 10Bstorm) [18:28:18] (03CR) 10Bstorm: "https://puppet-compiler.wmflabs.org/compiler1001/16549/" [puppet] - 10https://gerrit.wikimedia.org/r/510185 (https://phabricator.wikimedia.org/T209527) (owner: 10Bstorm) [18:28:27] (03PS1) 10Jbond: CI Tests: add a check to ensure all python files have a py extension [puppet] - 10https://gerrit.wikimedia.org/r/510216 (https://phabricator.wikimedia.org/T144169) [18:29:04] (03PS4) 10Bstorm: cloudstore: start syncing data off labstore1003 [puppet] - 10https://gerrit.wikimedia.org/r/510185 (https://phabricator.wikimedia.org/T209527) [18:29:06] (03CR) 10jerkins-bot: [V: 04-1] CI Tests: add a check to ensure all python files have a py extension [puppet] - 10https://gerrit.wikimedia.org/r/510216 (https://phabricator.wikimedia.org/T144169) (owner: 10Jbond) [18:30:19] (03PS5) 10Bstorm: cloudstore: start syncing data off labstore1003 [puppet] - 10https://gerrit.wikimedia.org/r/510185 (https://phabricator.wikimedia.org/T209527) [18:31:00] (03PS2) 10Jbond: CI Tests: add a check to ensure all python files have a py extension [puppet] - 10https://gerrit.wikimedia.org/r/510216 (https://phabricator.wikimedia.org/T144169) [18:37:15] 08Warning Alert for device asw2-c-eqiad.mgmt.eqiad.wmnet - Access port utilisation over 80% for 1h [18:37:17] jbond42: fun I was about to suggest adding such a test ! [18:42:45] 10Operations, 10Operations-Software-Development, 10Patch-For-Review, 10cloud-services-team (Kanban): Flake8 for python files without extension in puppet repo - https://phabricator.wikimedia.org/T144169 (10jbond) As i was on clinic duty last week i decided to have a crack at doing the manual work to rename... [18:44:15] 08Warning Alert for device asw-d-codfw.mgmt.codfw.wmnet - Access port utilisation over 80% for 1h [18:48:41] (03CR) 10Hashar: "Lovely! We might want to support #!/usr/bin/python2.7 eventually. I can imagine it helps for the transition at some point in the futur" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/510216 (https://phabricator.wikimedia.org/T144169) (owner: 10Jbond) [18:54:18] 10Operations, 10Operations-Software-Development, 10Patch-For-Review, 10cloud-services-team (Kanban): Flake8 for python files without extension in puppet repo - https://phabricator.wikimedia.org/T144169 (10hashar) That is a nice sprint @jbond thank you. For the tox work I am not too worried and @volans shou... [18:58:01] (03PS3) 10Jbond: CI Tests: add a check to ensure all python files have a py extension [puppet] - 10https://gerrit.wikimedia.org/r/510216 (https://phabricator.wikimedia.org/T144169) [19:00:01] be back later.. trying to rescue some baby birds [19:00:05] heh, it's true [19:01:42] (03CR) 10Jbond: CI Tests: add a check to ensure all python files have a py extension (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/510216 (https://phabricator.wikimedia.org/T144169) (owner: 10Jbond) [19:04:44] (03CR) 10Alex Monk: "was until about half an hour before that message :/" [puppet] - 10https://gerrit.wikimedia.org/r/505407 (https://phabricator.wikimedia.org/T220894) (owner: 10Alex Monk) [19:10:00] PROBLEM - Host mw1294 is DOWN: PING CRITICAL - Packet loss = 100% [19:10:58] RECOVERY - Host mw1294 is UP: PING WARNING - Packet loss = 50%, RTA = 0.31 ms [19:17:15] 08̶W̶a̶r̶n̶i̶n̶g Device asw2-a-eqiad.mgmt.eqiad.wmnet recovered from Access port utilisation over 80% for 1h [19:18:26] 10Operations: URL shortener subdomains for useful Wikimedia infrastructure - https://phabricator.wikimedia.org/T223319 (10CDanis) [19:20:59] 10Operations, 10Analytics, 10Discovery, 10Research: Make hadoop cluster able to push to swift - https://phabricator.wikimedia.org/T219544 (10Ottomata) Found some better docs here: https://docs.openstack.org/sahara/latest/user/hadoop-swift.html So configs will go in core-site.xml. We can probably do this... [19:23:31] 10Operations: URL shortener subdomains for useful Wikimedia infrastructure - https://phabricator.wikimedia.org/T223319 (10greg) Just want to throw out the possibility in the future (future) that some of these underlying tools may change and the unique identifier for that service may no longer align with the uniq... [19:25:43] !log ban elastic2038 from elasticsearch cluster for memory replacement - T217398 [19:25:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:25:49] T217398: elastic2038 DOWN (CPU/memory errors ) - https://phabricator.wikimedia.org/T217398 [19:28:11] !log shutting down elastic2038 for memory replacement - T217398 [19:28:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:37:12] !log adding logstash filter truncate plugin to prod logstash collectors [19:37:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:37:15] 08Warning Alert for device asw2-d-eqiad.mgmt.eqiad.wmnet - Access port utilisation over 80% for 1h [19:37:26] (03CR) 10Herron: [V: 03+2 C: 03+2] logstash: add logstash-filter-truncate plugin [software/logstash/plugins] - 10https://gerrit.wikimedia.org/r/509880 (https://phabricator.wikimedia.org/T187147) (owner: 10Herron) [19:38:02] (03PS7) 10Faidon Liambotis: Add "accounting" report [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/506663 [19:38:46] (03CR) 10Faidon Liambotis: [C: 03+2] Add "accounting" report [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/506663 (owner: 10Faidon Liambotis) [19:44:33] moritzm, just noticed this in a puppet run I did earlier [19:44:40] -deployment-puppetdb02.deployment-prep.eqiad.wmflabs,deployment-puppetdb02,172.16.4.104 [19:44:52] +deployment-puppetdb02.deployment-prep.eqiad.wmflabs,deployment-puppetdb02,172.16.4.104,fe80::f816:3eff:fe54:808f [19:45:07] a labs host thinks it has an ipv6 address? [19:45:57] that's a link-local ipv6 address; it's not routable by anything external [19:50:39] 10Operations: URL shortener subdomains for useful Wikimedia infrastructure - https://phabricator.wikimedia.org/T223319 (10CDanis) >>! In T223319#5182402, @greg wrote: > Just want to throw out the possibility in the future (future) that some of these underlying tools may change and the unique identifier for that... [19:53:12] cdanis, hm, it probably shouldn't show up in other hosts' known hosts then... [19:54:08] yeah that seems suboptimal. that address is basically a transformation from the host's MAC address [19:54:16] 08̶W̶a̶r̶n̶i̶n̶g Device asw-d-codfw.mgmt.codfw.wmnet recovered from Access port utilisation over 80% for 1h [19:55:40] It seems the following deployment-prep instances have ipv6 entries in known hosts [19:55:54] RECOVERY - MariaDB Slave Lag: s3 on db2098 is OK: OK slave_sql_lag Replication lag: 0.41 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [19:55:54] deployment-acme-chief0[34], which run buster and puppet 5 [19:56:22] deployment-puppetdb02 which showed up today, probably as a result of moritz's puppet 5/facter 3 work? [19:56:41] and deployment-db05 for some reason [19:57:09] other hosts have these addresses on eth0, they just don't make it into known hosts [19:58:11] looking at ssh::server it comes from the 'ipaddress6' fact [19:58:23] (03CR) 10CRusnov: [C: 03+2] profile::netbox: Move reports config to /etc/netbox [puppet] - 10https://gerrit.wikimedia.org/r/509932 (owner: 10CRusnov) [20:00:43] (03PS2) 10CRusnov: Move netbox report config to /etc/netbox [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/509934 [20:03:34] (03CR) 10CRusnov: [C: 03+2] Move netbox report config to /etc/netbox [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/509934 (owner: 10CRusnov) [20:13:40] !log restarting gerrit on cobalt to pick up metrics export changes [20:13:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:21:13] 10Operations, 10Analytics, 10Analytics-EventLogging, 10Analytics-Kanban, and 2 others: Upgrade python-kafka - https://phabricator.wikimedia.org/T221848 (10Gilles) I'd like to understand this bug better before rolling back the package for coal. It's not a big deal per se if coal is a little behind events. W... [20:21:13] (03PS11) 10Paladox: Add prometheus server for gerrit javamelody monitoring [puppet] - 10https://gerrit.wikimedia.org/r/508952 (https://phabricator.wikimedia.org/T184086) [20:23:30] (03CR) 10CRusnov: [C: 03+2] Add prometheus server for gerrit javamelody monitoring [puppet] - 10https://gerrit.wikimedia.org/r/508952 (https://phabricator.wikimedia.org/T184086) (owner: 10Paladox) [20:27:15] 08̶W̶a̶r̶n̶i̶n̶g Device asw2-c-eqiad.mgmt.eqiad.wmnet recovered from Access port utilisation over 80% for 1h [20:28:15] 08Warning Alert for device asw-a-codfw.mgmt.codfw.wmnet - Access port utilisation over 80% for 1h [20:31:35] 10Operations, 10cloud-services-team (Kanban): update *.tools.wmflabs.org certificate - https://phabricator.wikimedia.org/T223332 (10RobH) p:05Triage→03High [20:33:53] 10Operations, 10MediaWiki-General-or-Unknown, 10serviceops, 10Core Platform Team (PHP7 (TEC4)), and 4 others: Some pages will become completely unreachable after PHP7 update due to Unicode changes - https://phabricator.wikimedia.org/T219279 (10WDoranWMF) [20:41:24] !log herron@deploy1001 Started deploy [logstash/plugins@7fb8843]: (no justification provided) [20:41:25] !log herron@deploy1001 Finished deploy [logstash/plugins@7fb8843]: (no justification provided) (duration: 00m 01s) [20:41:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:41:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:42:55] 10Operations, 10Puppet, 10Packaging, 10Patch-For-Review: upgrade facter and puppet across the fleet - https://phabricator.wikimedia.org/T219803 (10Krenair) facter upgrade in deployment-prep appears to have set the ipaddress6 fact to link-local addresses [20:43:15] ^ that was a —dry-run [20:43:15] 08̶W̶a̶r̶n̶i̶n̶g Device asw-b-codfw.mgmt.codfw.wmnet recovered from Access port utilisation over 80% for 1h [20:43:47] (03CR) 10Ladsgroup: [C: 03+1] Remove constraint-suggestions beta feature [mediawiki-config] - 10https://gerrit.wikimedia.org/r/503342 (https://phabricator.wikimedia.org/T220609) (owner: 10Lucas Werkmeister (WMDE)) [20:43:53] !log herron@deploy1001 Started deploy [logstash/plugins@7fb8843]: adding logstash-filter-truncate plugin [20:43:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:44:01] !log herron@deploy1001 Finished deploy [logstash/plugins@7fb8843]: adding logstash-filter-truncate plugin (duration: 00m 07s) [20:44:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:53:45] (03CR) 10Bstorm: [C: 03+2] cloudstore: start syncing data off labstore1003 [puppet] - 10https://gerrit.wikimedia.org/r/510185 (https://phabricator.wikimedia.org/T209527) (owner: 10Bstorm) [20:53:55] (03PS6) 10Bstorm: cloudstore: start syncing data off labstore1003 [puppet] - 10https://gerrit.wikimedia.org/r/510185 (https://phabricator.wikimedia.org/T209527) [20:55:09] (03PS1) 10Brennen Bearnes: local_dev: Add config for dev-images docker-pkg [puppet] - 10https://gerrit.wikimedia.org/r/510249 (https://phabricator.wikimedia.org/T223329) [20:57:15] 08̶W̶a̶r̶n̶i̶n̶g Device asw2-d-eqiad.mgmt.eqiad.wmnet recovered from Access port utilisation over 80% for 1h [20:57:40] (03CR) 10Brennen Bearnes: "Will require private puppet value for docker::registry::dev_password." [puppet] - 10https://gerrit.wikimedia.org/r/510249 (https://phabricator.wikimedia.org/T223329) (owner: 10Brennen Bearnes) [20:59:15] 10Operations, 10Puppet, 10Packaging, 10Patch-For-Review: upgrade facter and puppet across the fleet - https://phabricator.wikimedia.org/T219803 (10jbond) ` git/puppet [ grep -r ipaddress6 ./modules review/jbond/update_python ] 9:57 PM ./modules/base/lib/facter/interface_primary.rb:... [20:59:53] (03PS1) 10RobH: updating/renewing star.tools.wmflabs.org cert/keypair [puppet] - 10https://gerrit.wikimedia.org/r/510250 (https://phabricator.wikimedia.org/T223332) [21:01:16] 10Operations, 10cloud-services-team (Kanban): update *.tools.wmflabs.org certificate - https://phabricator.wikimedia.org/T223332 (10RobH) [21:01:29] jbond: I'm trying to use puppet in order to get labstore1003 decommissioned. I just noticed you disabled puppet there? [21:01:58] I need to transfer data off it, which requires an rsync server estup [21:02:01] *setup [21:02:16] 10Operations, 10cloud-services-team (Kanban): update *.tools.wmflabs.org certificate - https://phabricator.wikimedia.org/T223332 (10RobH) a:05RobH→03aborrero @aborrero: Since you were the one to confirm the certificate usage on the #procurement task, would you also be the person to implement the renewed ce... [21:02:16] Sorry, jbond42 ^ [21:02:17] bstorm_: i thik it is faling due to the puppet upgrade [21:02:26] It's disabled [21:02:36] Notice: Skipping run of Puppet configuration client; administratively disabled (Reason: 'rsyslog-kafka not built for trusty, server scheduled for decomission (jbond)'); [21:02:40] let me take a look [21:04:22] bstorm_: i rember now it fails on puppet runs due to a kafaka-rsyslog change [21:04:23] that was used for the ferm logging, wasn't it? and the kafka stuff was probably never built for trusty [21:04:40] (03PS2) 10Herron: logstash: enforce max length on "message" and "msg" fields [puppet] - 10https://gerrit.wikimedia.org/r/509924 (https://phabricator.wikimedia.org/T187147) [21:05:13] i have enabled it now and it will run but will fail on anything that depends on Pacakage['rsyslog-kafka'] [21:05:24] we can try to simply scp the rsyslog-kafka deb from jessie to trusty and "dpkg -i", it should be close enough I guess [21:06:48] yes that could work, bstorm_ ? [21:07:02] (03CR) 10Herron: [C: 03+2] logstash: enforce max length on "message" and "msg" fields [puppet] - 10https://gerrit.wikimedia.org/r/509924 (https://phabricator.wikimedia.org/T187147) (owner: 10Herron) [21:07:09] or we build a dummy deb using "equivs" if only to fulfill the dependency for puppet [21:08:08] not sure what is needed from puppet fom the decom its possible this is a none issue [21:08:59] As long as the rsync class works, it'll be fine :) [21:09:13] Sorry, I'd briefly lost internet in a valley [21:09:15] Checking now [21:09:29] (03PS1) 10Paladox: prometheus: Add gerrit.yaml under targets [puppet] - 10https://gerrit.wikimedia.org/r/510251 [21:09:42] let me know if you need the output from the run ai did after enabling [21:10:30] jbond42: looks good to me :) [21:10:37] We can probably disable puppet again now [21:11:09] cool done [21:12:03] moritzm: not really sure of the effort involved in 'build a dummy deb using "equivs"' so may be a good excersies for tomorrow regardless :) [21:12:49] (03PS1) 10Herron: logstash: obligatory typo fix for filter-truncate [puppet] - 10https://gerrit.wikimedia.org/r/510252 [21:13:30] PROBLEM - puppet last run on logstash2004 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. It might be a dependency cycle. [21:14:02] yeah, if labstore1003 is still alive tomorrow, we can do that :-) [21:14:10] 10Operations, 10Analytics, 10Analytics-Kanban, 10EventBus, and 5 others: Create schema[12]00[12] (schema.svc.{eqiad,codfw}.wmnet) - https://phabricator.wikimedia.org/T219556 (10Nuria) 05Open→03Resolved [21:14:11] cool [21:14:26] logstash is me, please ignore [21:15:21] I'm pretty sure it will be. There's a fair bit of data to get off there [21:15:28] (03CR) 10Herron: [C: 03+2] logstash: obligatory typo fix for filter-truncate [puppet] - 10https://gerrit.wikimedia.org/r/510252 (owner: 10Herron) [21:17:13] hrm...wait, this rsync config isn't right. This is old stuff. Ugh. [21:17:37] I can probably manually configure it from this scaffold [21:18:52] RECOVERY - puppet last run on logstash2004 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [21:21:34] its a bit late here bstorm_ but sounds like mor.tz will help me get puppet working tomorrow if you are still blocked [21:22:15] Great. Thanks! I should be able to fiddle with it manually as well. :) [21:22:36] ok cool ill ping yu tomorrow with an update [21:23:49] (03CR) 10Volans: "Some questions/suggestions inline. Thanks a lot for taking care of this!" (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/510216 (https://phabricator.wikimedia.org/T144169) (owner: 10Jbond) [21:33:22] (03PS2) 10Paladox: prometheus: Add gerrit.yaml under targets [puppet] - 10https://gerrit.wikimedia.org/r/510251 [21:34:19] (03PS3) 10Paladox: prometheus: Add gerrit.yaml under targets [puppet] - 10https://gerrit.wikimedia.org/r/510251 [21:34:54] 10Operations, 10Operations-Software-Development, 10Patch-For-Review, 10cloud-services-team (Kanban): Flake8 for python files without extension in puppet repo - https://phabricator.wikimedia.org/T144169 (10Volans) @jbond thanks a lot for this sprint, really appreciated. I've commented on the CI patch. Regar... [21:35:19] (03PS4) 10Jbond: CI Tests: add a check to ensure all python files have a py extension [puppet] - 10https://gerrit.wikimedia.org/r/510216 (https://phabricator.wikimedia.org/T144169) [21:35:53] (03CR) 10jerkins-bot: [V: 04-1] CI Tests: add a check to ensure all python files have a py extension [puppet] - 10https://gerrit.wikimedia.org/r/510216 (https://phabricator.wikimedia.org/T144169) (owner: 10Jbond) [21:36:52] (03PS4) 10Paladox: prometheus: Add gerrit.yaml under targets [puppet] - 10https://gerrit.wikimedia.org/r/510251 [21:37:08] (03PS5) 10Jbond: CI Tests: add a check to ensure all python files have a py extension [puppet] - 10https://gerrit.wikimedia.org/r/510216 (https://phabricator.wikimedia.org/T144169) [21:37:40] (03CR) 10jerkins-bot: [V: 04-1] CI Tests: add a check to ensure all python files have a py extension [puppet] - 10https://gerrit.wikimedia.org/r/510216 (https://phabricator.wikimedia.org/T144169) (owner: 10Jbond) [21:39:59] (03PS6) 10Jbond: CI Tests: add a check to ensure all python files have a py extension [puppet] - 10https://gerrit.wikimedia.org/r/510216 (https://phabricator.wikimedia.org/T144169) [21:40:44] (03CR) 10Jbond: CI Tests: add a check to ensure all python files have a py extension (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/510216 (https://phabricator.wikimedia.org/T144169) (owner: 10Jbond) [21:41:15] 10Operations, 10Operations-Software-Development, 10Patch-For-Review: Puppet tox: properly lint both Py2 and Py3 files - https://phabricator.wikimedia.org/T184435 (10Volans) I'm not convinced this is a good solution for few reasons: - quite some overhead to rename all existing py3 files to .py3 - quite some o... [21:47:04] (03CR) 10Volans: "@jbond: I guess that the assumption here is that the upgrades were already been done via debdeploy." (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/510113 (owner: 10Muehlenhoff) [21:48:15] 08̶W̶a̶r̶n̶i̶n̶g Device asw-a-codfw.mgmt.codfw.wmnet recovered from Access port utilisation over 80% for 1h [21:53:07] (03CR) 10Volans: [C: 03+1] "LGTM but let's merge it tomorrow to avoid side effect, no hurry ;)" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/510216 (https://phabricator.wikimedia.org/T144169) (owner: 10Jbond) [21:53:47] XioNoX: FYI ^^^^ ( librenms-wmf ) [21:54:01] thx [21:54:08] (03PS1) 10CRusnov: Add LibreNMS parity check report. [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/510256 [21:55:09] those are just irc warnings, I don't think they're very useful, at least without the hostname in the alert [21:56:27] yeah [22:07:09] (03PS2) 10CRusnov: Add LibreNMS parity check report. [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/510256 (https://phabricator.wikimedia.org/T221507) [22:07:43] (03CR) 10jerkins-bot: [V: 04-1] Add LibreNMS parity check report. [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/510256 (https://phabricator.wikimedia.org/T221507) (owner: 10CRusnov) [22:08:30] (03PS3) 10CRusnov: Add LibreNMS parity check report. [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/510256 (https://phabricator.wikimedia.org/T221507) [22:12:25] (03PS1) 10Bstorm: labstore: shut down old rsync and export materials for migration [puppet] - 10https://gerrit.wikimedia.org/r/510259 (https://phabricator.wikimedia.org/T209527) [22:15:03] jouncebot: reload [22:15:13] jouncebot: refresh [22:15:14] I refreshed my knowledge about deployments. [22:15:32] 10Operations, 10Operations-Software-Development, 10netbox, 10netops, and 2 others: Netbox report to validate network equipment data - https://phabricator.wikimedia.org/T221507 (10crusnov) The >>! In T221507#5182977, @gerritbot wrote: > Change 510256 had a related patch set uploaded (by CRusnov; owner: CR... [22:16:56] (03CR) 10Bstorm: [C: 03+2] labstore: shut down old rsync and export materials for migration [puppet] - 10https://gerrit.wikimedia.org/r/510259 (https://phabricator.wikimedia.org/T209527) (owner: 10Bstorm) [22:24:53] 10Operations, 10Operations-Software-Development, 10netbox, 10netops, and 2 others: Netbox report to validate network equipment data - https://phabricator.wikimedia.org/T221507 (10crusnov) It was pointed out to me that the vendor name in entPhysical is there, so we could hypothetically check that (for inven... [22:26:10] (03PS4) 10CRusnov: Add LibreNMS parity check report. [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/510256 (https://phabricator.wikimedia.org/T221507) [22:30:40] (03PS1) 10RobH: adding in new model SSD drive [software] - 10https://gerrit.wikimedia.org/r/510261 [22:32:36] (03PS5) 10CRusnov: Add LibreNMS parity check report. [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/510256 (https://phabricator.wikimedia.org/T221507) [22:34:04] 10Operations, 10PHP 7.2 support: [Regression] Varnish is replacing the detailed HTTP 500 page from PHP 7 with "503 Service Temporarily Unavailable" - https://phabricator.wikimedia.org/T223336 (10Krinkle) [22:34:12] 10Operations, 10PHP 7.2 support, 10Performance-Team (Radar): [Regression] Varnish is replacing the detailed HTTP 500 page from PHP 7 with "503 Service Temporarily Unavailable" - https://phabricator.wikimedia.org/T223336 (10Krinkle) [22:34:52] 10Operations, 10Patch-For-Review, 10Performance-Team (Radar): PHP fatal error handler not working on mwdebug servers - https://phabricator.wikimedia.org/T217846 (10Krinkle) Not sure if the changed went out or if something else happened related, but as of today: {T223336} [22:35:33] (03PS1) 10Bstorm: labstore: fix a couple errors and tidy mounts for migration [puppet] - 10https://gerrit.wikimedia.org/r/510262 (https://phabricator.wikimedia.org/T209527) [22:37:56] (03CR) 10Bstorm: [C: 03+2] labstore: fix a couple errors and tidy mounts for migration [puppet] - 10https://gerrit.wikimedia.org/r/510262 (https://phabricator.wikimedia.org/T209527) (owner: 10Bstorm) [22:39:43] 10Operations, 10observability, 10PHP 7.2 support, 10Performance-Team (Radar): [Regression] Varnish is replacing the detailed HTTP 500 page from PHP 7 with "503 Service Temporarily Unavailable" - https://phabricator.wikimedia.org/T223336 (10Krinkle) Tagging monitoring as well because we generally associate... [22:40:01] 10Operations, 10Traffic, 10observability, 10PHP 7.2 support, 10Performance-Team (Radar): [Regression] Varnish is replacing the detailed HTTP 500 page from PHP 7 with "503 Service Temporarily Unavailable" - https://phabricator.wikimedia.org/T223336 (10Krinkle) [22:42:11] 10Operations, 10Traffic, 10observability, 10PHP 7.2 support, and 2 others: [Regression] Varnish is replacing the detailed HTTP 500 page from PHP 7 with "503 Service Temporarily Unavailable" - https://phabricator.wikimedia.org/T223336 (10jijiki) [22:42:15] 08Warning Alert for device asw2-a-eqiad.mgmt.eqiad.wmnet - Access port utilisation over 80% for 1h [22:45:40] (03PS1) 10Bstorm: rsync: equal sign was removed from quickdatacopy bwlimit [puppet] - 10https://gerrit.wikimedia.org/r/510264 (https://phabricator.wikimedia.org/T209527) [22:46:11] librenms-wmf: thats a new bot... atleast for me [22:46:41] it's been around for a while [22:46:46] it's fairly quiet [22:48:37] Krenair: i mean compared to most bots in here anything is quiet xD anyway thats interesting is there any docs onwiki on it im interested in it [22:49:36] https://wikitech.wikimedia.org/wiki/LibreNMS#IRC_Alerting [22:49:57] https://wikitech.wikimedia.org/w/index.php?title=LibreNMS&diff=1785585&oldid=1782875 [22:50:14] Krenair: ah thank you :) [22:50:50] (03CR) 10Bstorm: "This must have the equal sign, or it basically does nothing. This is also a noop on servers without the option, but it also is missing a " [puppet] - 10https://gerrit.wikimedia.org/r/510264 (https://phabricator.wikimedia.org/T209527) (owner: 10Bstorm) [22:51:29] (03PS2) 10Bstorm: rsync: equal sign was removed from quickdatacopy bwlimit [puppet] - 10https://gerrit.wikimedia.org/r/510264 (https://phabricator.wikimedia.org/T209527) [22:53:57] (03CR) 10Bstorm: "That's the stuff! https://puppet-compiler.wmflabs.org/compiler1002/16554/" [puppet] - 10https://gerrit.wikimedia.org/r/510264 (https://phabricator.wikimedia.org/T209527) (owner: 10Bstorm) [22:54:09] (03CR) 10Dzahn: [C: 03+1] "oh yea. that was my bad. was reading on the difference between erubis and stdlibs erb and how to remove the whitespace and that should hav" [puppet] - 10https://gerrit.wikimedia.org/r/510264 (https://phabricator.wikimedia.org/T209527) (owner: 10Bstorm) [22:54:35] (03CR) 10Bstorm: [C: 03+2] rsync: equal sign was removed from quickdatacopy bwlimit [puppet] - 10https://gerrit.wikimedia.org/r/510264 (https://phabricator.wikimedia.org/T209527) (owner: 10Bstorm) [22:55:46] bstorm_: i spent too much time on trying to remove the whitespace :) [22:56:19] No worries :) It takes a couple tries to get these right sometimes. By our powers combined, it does the thing right. [22:57:23] :) yea, googled the different tag options multiple times by now [23:00:04] MaxSem, RoanKattouw, and Niharika: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for Evening SWAT (Max 6 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190514T2300). [23:00:05] No GERRIT patches in the queue for this window AFAICS. [23:03:28] (03PS1) 10Faidon Liambotis: coherence: optimize by reducing database queries [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/510267 [23:03:30] (03PS1) 10Faidon Liambotis: Remove the oldhardware report [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/510268 [23:19:15] 08Warning Alert for device asw-d-codfw.mgmt.codfw.wmnet - Access port utilisation over 80% for 1h [23:28:09] Hi, is SWAT still online? [23:28:53] MaxSem, RoanKattuow, Niharika:: ? [23:28:59] *RoanKattouw [23:29:24] Zoranzoki21: You can tab-complete usernames on IRC! :) Type a few letters and pres s [23:29:26] Sorry, we're in a meeting [23:29:34] I'm in a meeting too, sorry. :( [23:29:40] No problems. [23:29:57] Will you end in next 30 minutes? [23:30:00] I need one patch merged [23:30:09] Zoranzoki21: What's the patch? [23:30:15] I making it currently [23:30:21] It is for https://phabricator.wikimedia.org/T219617 [23:31:13] (03PS1) 10Zoranzoki21: Enable RCPatrol and some rights on Croatian projects [mediawiki-config] - 10https://gerrit.wikimedia.org/r/510269 (https://phabricator.wikimedia.org/T219617) [23:40:58] Niharika, MaxSem patch will be ready for ~5 minutes [23:41:03] (03CR) 10RobH: [C: 03+2] adding in new model SSD drive [software] - 10https://gerrit.wikimedia.org/r/510261 (owner: 10RobH) [23:41:39] (03PS2) 10Zoranzoki21: Enable RCPatrol and some rights on Croatian projects [mediawiki-config] - 10https://gerrit.wikimedia.org/r/510269 (https://phabricator.wikimedia.org/T219617) [23:42:27] (03CR) 10jerkins-bot: [V: 04-1] Enable RCPatrol and some rights on Croatian projects [mediawiki-config] - 10https://gerrit.wikimedia.org/r/510269 (https://phabricator.wikimedia.org/T219617) (owner: 10Zoranzoki21) [23:44:38] Niharika, MaxSem: Patch is ready. It is 510269 [23:44:42] I will add it in calendar [23:44:51] (03PS3) 10Zoranzoki21: Enable RCPatrol and some rights on Croatian projects [mediawiki-config] - 10https://gerrit.wikimedia.org/r/510269 (https://phabricator.wikimedia.org/T219617) [23:45:36] (03CR) 10jerkins-bot: [V: 04-1] Enable RCPatrol and some rights on Croatian projects [mediawiki-config] - 10https://gerrit.wikimedia.org/r/510269 (https://phabricator.wikimedia.org/T219617) (owner: 10Zoranzoki21) [23:47:53] (03CR) 10Zoranzoki21: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/510269 (https://phabricator.wikimedia.org/T219617) (owner: 10Zoranzoki21) [23:52:15] 08̶W̶a̶r̶n̶i̶n̶g Device asw2-a-eqiad.mgmt.eqiad.wmnet recovered from Access port utilisation over 80% for 1h