[00:00:04] twentyafterfour: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for Phabricator update deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190509T0000). [00:01:41] PROBLEM - Check whether ferm is active by checking the default input chain on ms-be2028 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.0.88: Connection reset by peer https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [00:01:53] PROBLEM - swift-object-auditor on ms-be2028 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.0.88: Connection reset by peer https://wikitech.wikimedia.org/wiki/Swift [00:02:07] PROBLEM - swift-object-replicator on ms-be2028 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.0.88: Connection reset by peer https://wikitech.wikimedia.org/wiki/Swift [00:02:07] PROBLEM - Check size of conntrack table on ms-be2028 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.0.88: Connection reset by peer https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [00:02:17] PROBLEM - very high load average likely xfs on ms-be2028 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.0.88: Connection reset by peer https://wikitech.wikimedia.org/wiki/Swift [00:02:19] PROBLEM - swift-account-reaper on ms-be2028 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.0.88: Connection reset by peer https://wikitech.wikimedia.org/wiki/Swift [00:02:29] PROBLEM - DPKG on ms-be2028 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.0.88: Connection reset by peer [00:02:31] PROBLEM - dhclient process on ms-be2028 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.0.88: Connection reset by peer [00:02:31] PROBLEM - swift-object-updater on ms-be2028 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.0.88: Connection reset by peer https://wikitech.wikimedia.org/wiki/Swift [00:02:33] PROBLEM - swift-account-server on ms-be2028 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.0.88: Connection reset by peer https://wikitech.wikimedia.org/wiki/Swift [00:02:45] PROBLEM - configured eth on ms-be2028 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.0.88: Connection reset by peer [00:02:53] PROBLEM - swift-container-replicator on ms-be2028 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.0.88: Connection reset by peer https://wikitech.wikimedia.org/wiki/Swift [00:02:54] uhm, I was about to start the phabricator deployment but it looks like maybe I should wait [00:03:11] RECOVERY - swift-object-auditor on ms-be2028 is OK: PROCS OK: 3 processes with regex args ^/usr/bin/python /usr/bin/swift-object-auditor https://wikitech.wikimedia.org/wiki/Swift [00:03:13] PROBLEM - SSH on ms-be2028 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [00:03:25] RECOVERY - Check size of conntrack table on ms-be2028 is OK: OK: nf_conntrack is 3 % full https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [00:03:25] RECOVERY - swift-object-replicator on ms-be2028 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-object-replicator https://wikitech.wikimedia.org/wiki/Swift [00:03:33] RECOVERY - very high load average likely xfs on ms-be2028 is OK: OK - load average: 66.38, 62.63, 57.20 https://wikitech.wikimedia.org/wiki/Swift [00:03:35] RECOVERY - swift-account-reaper on ms-be2028 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-reaper https://wikitech.wikimedia.org/wiki/Swift [00:03:47] RECOVERY - DPKG on ms-be2028 is OK: All packages OK [00:03:49] RECOVERY - swift-object-updater on ms-be2028 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-object-updater https://wikitech.wikimedia.org/wiki/Swift [00:03:49] RECOVERY - dhclient process on ms-be2028 is OK: PROCS OK: 0 processes with command name dhclient [00:03:49] RECOVERY - swift-account-server on ms-be2028 is OK: PROCS OK: 49 processes with regex args ^/usr/bin/python /usr/bin/swift-account-server https://wikitech.wikimedia.org/wiki/Swift [00:04:01] RECOVERY - configured eth on ms-be2028 is OK: OK - interfaces up [00:04:05] meh. [00:04:11] RECOVERY - swift-container-replicator on ms-be2028 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-replicator https://wikitech.wikimedia.org/wiki/Swift [00:04:27] RECOVERY - SSH on ms-be2028 is OK: SSH OK - OpenSSH_7.4p1 Debian-10+deb9u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [00:04:43] !log starting phabricator deployment, momentary downtime expected (~1 minute) [00:04:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:05:51] PROBLEM - Check whether ferm is active by checking the default input chain on ms-be2028 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.0.88: Connection reset by peer https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [00:07:25] PROBLEM - swift-container-updater on ms-be2028 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.0.88: Connection reset by peer https://wikitech.wikimedia.org/wiki/Swift [00:07:25] PROBLEM - swift-object-auditor on ms-be2028 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.0.88: Connection reset by peer https://wikitech.wikimedia.org/wiki/Swift [00:07:39] PROBLEM - swift-object-replicator on ms-be2028 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.0.88: Connection reset by peer https://wikitech.wikimedia.org/wiki/Swift [00:07:39] PROBLEM - Check size of conntrack table on ms-be2028 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.0.88: Connection reset by peer https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [00:07:47] PROBLEM - very high load average likely xfs on ms-be2028 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.0.88: Connection reset by peer https://wikitech.wikimedia.org/wiki/Swift [00:08:01] PROBLEM - DPKG on ms-be2028 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.0.88: Connection reset by peer [00:08:03] PROBLEM - swift-object-updater on ms-be2028 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.0.88: Connection reset by peer https://wikitech.wikimedia.org/wiki/Swift [00:08:03] PROBLEM - dhclient process on ms-be2028 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.0.88: Connection reset by peer [00:08:19] PROBLEM - configured eth on ms-be2028 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.0.88: Connection reset by peer [00:08:19] PROBLEM - Check systemd state on ms-be2028 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.0.88: Connection reset by peer [00:08:25] PROBLEM - swift-container-replicator on ms-be2028 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.0.88: Connection reset by peer https://wikitech.wikimedia.org/wiki/Swift [00:08:28] !log phabricator upgrade successful [00:08:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:08:43] PROBLEM - SSH on ms-be2028 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [00:08:49] PROBLEM - swift-account-auditor on ms-be2028 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.0.88: Connection reset by peer https://wikitech.wikimedia.org/wiki/Swift [00:08:53] PROBLEM - swift-container-server on ms-be2028 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.0.88: Connection reset by peer https://wikitech.wikimedia.org/wiki/Swift [00:09:11] PROBLEM - swift-account-reaper on ms-be2028 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.0.88: Connection reset by peer https://wikitech.wikimedia.org/wiki/Swift [00:09:27] PROBLEM - swift-object-server on ms-be2028 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.0.88: Connection reset by peer https://wikitech.wikimedia.org/wiki/Swift [00:09:31] PROBLEM - Disk space on ms-be2028 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.0.88: Connection reset by peer https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space [00:10:25] PROBLEM - Check size of conntrack table on ms-be2028 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.0.88: Connection reset by peer https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [00:10:47] PROBLEM - DPKG on ms-be2028 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.0.88: Connection reset by peer [00:10:49] PROBLEM - swift-object-updater on ms-be2028 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.0.88: Connection reset by peer https://wikitech.wikimedia.org/wiki/Swift [00:10:51] PROBLEM - swift-account-server on ms-be2028 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.0.88: Connection reset by peer https://wikitech.wikimedia.org/wiki/Swift [00:11:01] twentyafterfour: don't worry about the alerts from ms-be* hosts for that. they're not in the phab request path, they're redundant anyway, and the monitoring noise is a known issue when doing data rebalances (re: which I'm hopeful for some possible fixes) [00:11:03] PROBLEM - Check systemd state on ms-be2028 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.0.88: Connection reset by peer [00:11:27] cool, thanks cdanis [00:11:29] RECOVERY - swift-object-auditor on ms-be2028 is OK: PROCS OK: 3 processes with regex args ^/usr/bin/python /usr/bin/swift-object-auditor https://wikitech.wikimedia.org/wiki/Swift [00:11:29] RECOVERY - swift-account-auditor on ms-be2028 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-auditor https://wikitech.wikimedia.org/wiki/Swift [00:11:31] RECOVERY - swift-container-server on ms-be2028 is OK: PROCS OK: 49 processes with regex args ^/usr/bin/python /usr/bin/swift-container-server https://wikitech.wikimedia.org/wiki/Swift [00:11:33] RECOVERY - swift-container-updater on ms-be2028 is OK: PROCS OK: 2 processes with regex args ^/usr/bin/python /usr/bin/swift-container-updater https://wikitech.wikimedia.org/wiki/Swift [00:11:39] deployment went off without a hitch anyway as far as I can tell [00:11:39] RECOVERY - Check size of conntrack table on ms-be2028 is OK: OK: nf_conntrack is 0 % full https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [00:11:39] RECOVERY - swift-object-replicator on ms-be2028 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-object-replicator https://wikitech.wikimedia.org/wiki/Swift [00:11:40] transient error? [00:11:49] RECOVERY - very high load average likely xfs on ms-be2028 is OK: OK - load average: 59.27, 56.80, 56.15 https://wikitech.wikimedia.org/wiki/Swift [00:11:49] RECOVERY - swift-account-reaper on ms-be2028 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-reaper https://wikitech.wikimedia.org/wiki/Swift [00:12:02] chaomodus: other disk I/O starving NRPE, basically [00:12:03] RECOVERY - DPKG on ms-be2028 is OK: All packages OK [00:12:03] RECOVERY - dhclient process on ms-be2028 is OK: PROCS OK: 0 processes with command name dhclient [00:12:03] RECOVERY - swift-object-updater on ms-be2028 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-object-updater https://wikitech.wikimedia.org/wiki/Swift [00:12:03] RECOVERY - swift-object-server on ms-be2028 is OK: PROCS OK: 101 processes with regex args ^/usr/bin/python /usr/bin/swift-object-server https://wikitech.wikimedia.org/wiki/Swift [00:12:05] RECOVERY - swift-account-server on ms-be2028 is OK: PROCS OK: 49 processes with regex args ^/usr/bin/python /usr/bin/swift-account-server https://wikitech.wikimedia.org/wiki/Swift [00:12:07] RECOVERY - Disk space on ms-be2028 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space [00:12:13] cdanis: ah rog [00:12:17] RECOVERY - configured eth on ms-be2028 is OK: OK - interfaces up [00:12:17] RECOVERY - Check systemd state on ms-be2028 is OK: OK - running: The system is fully operational [00:12:18] md being checked? [00:12:23] or just swift being swift [00:12:25] RECOVERY - swift-container-replicator on ms-be2028 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-replicator https://wikitech.wikimedia.org/wiki/Swift [00:12:31] no, moving data off of to-be-decommed swift hosts onto the others [00:12:37] RECOVERY - Check whether ferm is active by checking the default input chain on ms-be2028 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [00:12:41] RECOVERY - SSH on ms-be2028 is OK: SSH OK - OpenSSH_7.4p1 Debian-10+deb9u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [00:12:43] oic [00:12:45] cool [00:12:47] noisy tho [00:13:00] chaomodus: https://phabricator.wikimedia.org/T221068 for the decom and https://phabricator.wikimedia.org/T221904 for the noise [00:13:43] ahh hah true enough [00:13:53] i'll be checking logs, but if you see any hosts ending in a 4 or a 7 (ms-be2*{4,7}.codfw) let me know -- those are ones with tweaked disk scheduler settings. so far so good 🤞 [00:13:56] I'm not complaining it just makes me check when my watch buzzes more than a few times [00:14:12] you get icinga alerts on your watch? :D [00:14:25] Yah [00:15:02] have weechat-android, and then a highlight that matches icinga-wm PROBLEMs [00:16:57] your watch will be buzzing more than ever :P [00:36:53] PROBLEM - puppet last run on phab1001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): Service[phd] [00:53:18] twentyafterfour: ^ phd failure on phab1001 [00:55:18] cdanis: looking into it [00:57:15] PROBLEM - puppet last run on bast3002 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. It might be a dependency cycle. [00:57:58] !log stopped phd, now running `puppet agent --test` manually on phab1001 [00:58:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:58:21] RECOVERY - puppet last run on phab1001 is OK: OK: Puppet is currently enabled, last run 20 seconds ago with 0 failures [00:59:13] seems to be ok now, hmm [01:02:59] PROBLEM - Check systemd state on an-coord1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [01:09:37] RECOVERY - HP RAID on ms-be2036 is OK: OK: Slot 3: OK: 2I:4:1, 2I:4:2, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4 - Controller: OK - Battery/Capacitor: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [01:29:19] RECOVERY - puppet last run on bast3002 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [01:53:11] PROBLEM - very high load average likely xfs on ms-be2014 is CRITICAL: CRITICAL - load average: 155.95, 101.14, 51.89 https://wikitech.wikimedia.org/wiki/Swift [02:24:39] RECOVERY - very high load average likely xfs on ms-be2014 is OK: OK - load average: 9.96, 19.02, 74.72 https://wikitech.wikimedia.org/wiki/Swift [02:28:23] PROBLEM - Text HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5 [02:29:41] PROBLEM - Eqiad HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=eqiad&var-cache_type=All&var-status_type=5 [02:33:51] RECOVERY - Eqiad HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=eqiad&var-cache_type=All&var-status_type=5 [02:35:15] RECOVERY - Text HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5 [03:06:51] (03PS3) 10Mathew.onipe: prometheus: enable metrics relabel [puppet] - 10https://gerrit.wikimedia.org/r/508809 (https://phabricator.wikimedia.org/T193017) [03:07:36] (03CR) 10Mathew.onipe: prometheus: enable metrics relabel (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/508809 (https://phabricator.wikimedia.org/T193017) (owner: 10Mathew.onipe) [03:11:55] PROBLEM - Check systemd state on ms-be2020 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [03:21:27] RECOVERY - Check systemd state on ms-be2020 is OK: OK - running: The system is fully operational [03:53:43] PROBLEM - puppet last run on schema1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [03:55:48] (03CR) 10Mathew.onipe: icinga: add unit test for elastic config check (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/508742 (https://phabricator.wikimedia.org/T218932) (owner: 10Mathew.onipe) [03:57:40] (03PS8) 10Mathew.onipe: elasticsearch: add new attribute [puppet] - 10https://gerrit.wikimedia.org/r/507950 (https://phabricator.wikimedia.org/T218932) [03:57:42] (03PS41) 10Mathew.onipe: icinga: create and apply cirrus config check [puppet] - 10https://gerrit.wikimedia.org/r/507045 (https://phabricator.wikimedia.org/T218932) [04:01:53] (03CR) 10Mathew.onipe: "PCC is Ok: https://puppet-compiler.wmflabs.org/compiler1002/16432/" [puppet] - 10https://gerrit.wikimedia.org/r/507950 (https://phabricator.wikimedia.org/T218932) (owner: 10Mathew.onipe) [04:03:12] (03CR) 10Mathew.onipe: icinga: create and apply cirrus config check (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/507045 (https://phabricator.wikimedia.org/T218932) (owner: 10Mathew.onipe) [04:04:29] PROBLEM - very high load average likely xfs on ms-be2014 is CRITICAL: CRITICAL - load average: 185.78, 122.70, 66.30 https://wikitech.wikimedia.org/wiki/Swift [04:12:17] PROBLEM - Check systemd state on ms-be2029 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [04:20:33] RECOVERY - puppet last run on schema1001 is OK: OK: Puppet is currently enabled, last run 28 seconds ago with 0 failures [04:25:03] RECOVERY - very high load average likely xfs on ms-be2014 is OK: OK - load average: 10.24, 42.10, 75.88 https://wikitech.wikimedia.org/wiki/Swift [04:35:29] PROBLEM - Check systemd state on ms-be2023 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [04:51:08] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool pc1007" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/508973 [04:51:13] (03PS2) 10Marostegui: Revert "db-eqiad.php: Depool pc1007" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/508973 [04:53:07] (03CR) 10Marostegui: [C: 03+2] Revert "db-eqiad.php: Depool pc1007" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/508973 (owner: 10Marostegui) [04:53:28] (03PS2) 10Marostegui: db2103,db2112,db2116: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/508844 (https://phabricator.wikimedia.org/T222772) [04:54:16] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool pc1007" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/508973 (owner: 10Marostegui) [04:54:40] (03CR) 10Marostegui: [C: 03+2] db2103,db2112,db2116: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/508844 (https://phabricator.wikimedia.org/T222772) (owner: 10Marostegui) [04:55:34] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Repool pc1007 (duration: 00m 59s) [04:55:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:57:29] RECOVERY - Check systemd state on ms-be2029 is OK: OK - running: The system is fully operational [05:01:21] (03PS1) 10Marostegui: db-eqiad,db-codfw.php: Pool db2103,db2112,db2116 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/508974 (https://phabricator.wikimedia.org/T222772) [05:02:51] (03PS2) 10Marostegui: db-eqiad,db-codfw.php: Pool db2103,db2112,db2116 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/508974 (https://phabricator.wikimedia.org/T222772) [05:04:17] (03CR) 10Marostegui: [C: 03+2] db-eqiad,db-codfw.php: Pool db2103,db2112,db2116 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/508974 (https://phabricator.wikimedia.org/T222772) (owner: 10Marostegui) [05:05:42] (03Merged) 10jenkins-bot: db-eqiad,db-codfw.php: Pool db2103,db2112,db2116 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/508974 (https://phabricator.wikimedia.org/T222772) (owner: 10Marostegui) [05:07:23] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Pool db2103, db2112 and db2116 into s1 T222772 (duration: 01m 22s) [05:07:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:07:28] T222772: Productionize db2[103-120] - https://phabricator.wikimedia.org/T222772 [05:09:12] !log marostegui@deploy1001 Synchronized wmf-config/db-codfw.php: Pool db2103, db2112 and db2116 into s1 T222772 (duration: 01m 41s) [05:09:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:17:58] (03PS1) 10Marostegui: mariadb: Provision several hosts on s2 [puppet] - 10https://gerrit.wikimedia.org/r/508975 (https://phabricator.wikimedia.org/T222772) [05:18:32] (03PS1) 10Marostegui: db-eqiad.php: Depool db1076 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/508976 [05:20:00] (03CR) 10Marostegui: [C: 03+2] mariadb: Provision several hosts on s2 [puppet] - 10https://gerrit.wikimedia.org/r/508975 (https://phabricator.wikimedia.org/T222772) (owner: 10Marostegui) [05:20:15] (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: Depool db1076 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/508976 (owner: 10Marostegui) [05:21:22] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1076 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/508976 (owner: 10Marostegui) [05:22:14] RECOVERY - Check systemd state on ms-be2023 is OK: OK - running: The system is fully operational [05:22:43] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Depool db1076 (duration: 00m 57s) [05:22:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:25:13] !log Stop MySQL on db1076 [05:25:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:29:53] !log Stop replication on db2098:s2 [05:29:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:38:57] 10Operations, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): Refactor public-facing DYNA scheme for primary project hostnames in our DNS - https://phabricator.wikimedia.org/T208263 (10Cwek) Can stop your hand? login.wikimedia.org is a CNAME of www.wikipedia.org and the wikipedia.org domain is p... [05:41:02] 10Operations, 10Traffic, 10Core Platform Team Backlog (Next), 10Services (next): Have Varnish set the `X-Request-Id` header for incoming external requests - https://phabricator.wikimedia.org/T221976 (10ema) [05:41:06] 10Operations, 10Traffic, 10Core Platform Team Backlog (Watching / External), 10Patch-For-Review, 10Services (watching): Package libvmod-uuid for Debian - https://phabricator.wikimedia.org/T221977 (10ema) 05Open→03Resolved >>! In T221977#5168771, @mobrovac wrote: > @ema since the pkg has been uploaded... [05:44:51] (03PS1) 10Elukey: profile::analytics::refinery::job::data_purge: remove --verbose [puppet] - 10https://gerrit.wikimedia.org/r/508977 (https://phabricator.wikimedia.org/T212014) [05:45:51] (03CR) 10Elukey: [C: 03+2] profile::analytics::refinery::job::data_purge: remove --verbose [puppet] - 10https://gerrit.wikimedia.org/r/508977 (https://phabricator.wikimedia.org/T212014) (owner: 10Elukey) [05:46:55] PROBLEM - HHVM jobrunner on mw1296 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Jobrunner [05:47:38] (03PS1) 10Cwek: Revert "Convert most DYNA into 1H CNAME records" [dns] - 10https://gerrit.wikimedia.org/r/508978 [05:47:40] (03CR) 10Welcome, new contributor!: "Thank you for making your first contribution to Wikimedia! :) To learn how to get your code changes reviewed faster and more likely to get" [dns] - 10https://gerrit.wikimedia.org/r/508978 (owner: 10Cwek) [05:48:01] RECOVERY - HHVM jobrunner on mw1296 is OK: HTTP OK: HTTP/1.1 200 OK - 270 bytes in 0.006 second response time https://wikitech.wikimedia.org/wiki/Jobrunner [05:48:03] (03Abandoned) 10Cwek: Revert "Convert most DYNA into 1H CNAME records" [dns] - 10https://gerrit.wikimedia.org/r/508978 (owner: 10Cwek) [05:48:04] (03PS5) 10Marostegui: db-eqiad,db-codfw.php: Change second parsercache key [mediawiki-config] - 10https://gerrit.wikimedia.org/r/508170 (https://phabricator.wikimedia.org/T210725) [05:50:59] (03PS1) 10Cwek: Revert "Convert most DYNA into 1H CNAME records" [dns] - 10https://gerrit.wikimedia.org/r/508979 [05:55:00] (03PS1) 10Marostegui: db-codfw.php: Clarify disk space [mediawiki-config] - 10https://gerrit.wikimedia.org/r/508980 [05:56:06] (03CR) 10Marostegui: [C: 03+2] db-codfw.php: Clarify disk space [mediawiki-config] - 10https://gerrit.wikimedia.org/r/508980 (owner: 10Marostegui) [05:57:12] (03Merged) 10jenkins-bot: db-codfw.php: Clarify disk space [mediawiki-config] - 10https://gerrit.wikimedia.org/r/508980 (owner: 10Marostegui) [05:58:56] !log marostegui@deploy1001 Synchronized wmf-config/db-codfw.php: Clarify disk status for db2103, db2112, db2116 (duration: 00m 58s) [05:58:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:07:32] (03CR) 10Elukey: [C: 03+1] "Any special handling for this or we can merge?" [puppet] - 10https://gerrit.wikimedia.org/r/508730 (https://phabricator.wikimedia.org/T222765) (owner: 10Krinkle) [06:12:05] (03PS1) 10Ema: ATS: SystemTap script to debug request coalescing [puppet] - 10https://gerrit.wikimedia.org/r/508981 [06:13:22] (03CR) 10Ema: [C: 03+2] ATS: SystemTap script to debug request coalescing [puppet] - 10https://gerrit.wikimedia.org/r/508981 (owner: 10Ema) [06:15:27] PROBLEM - puppet last run on nescio is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. It might be a dependency cycle. [06:19:40] 10Operations, 10ops-eqiad, 10Analytics, 10Analytics-EventLogging, 10DBA: db1107 (eventlogging db master) possibly memory issues - https://phabricator.wikimedia.org/T222050 (10elukey) @Cmjohnson I'd need a heads up of ~15 mins before the maintenance to shutdown the host properly, but we can do it anytime! [06:20:45] PROBLEM - Check systemd state on ms-be2018 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [06:28:29] PROBLEM - puppet last run on mw1319 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/local/bin/cgroup-mediawiki-clean] [06:29:03] (03CR) 10Giuseppe Lavagetto: [C: 03+2] mediawiki: Fix statsd reporting of MediaWiki.errors.fatal [puppet] - 10https://gerrit.wikimedia.org/r/508730 (https://phabricator.wikimedia.org/T222765) (owner: 10Krinkle) [06:29:30] (03PS2) 10Giuseppe Lavagetto: mediawiki: Fix statsd reporting of MediaWiki.errors.fatal [puppet] - 10https://gerrit.wikimedia.org/r/508730 (https://phabricator.wikimedia.org/T222765) (owner: 10Krinkle) [06:32:11] PROBLEM - puppet last run on mw1307 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/profile.d/bash_autologout.sh] [06:33:19] PROBLEM - puppet last run on cp2024 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [06:36:17] (03CR) 10Elukey: "My first thought was to simply turn off the secondary route, but then it would be really useful to know how mcrouter works with the config" [puppet] - 10https://gerrit.wikimedia.org/r/492948 (owner: 10Aaron Schulz) [06:38:58] (03PS1) 10Ema: ATS: bump max_open_write_retries [puppet] - 10https://gerrit.wikimedia.org/r/508982 [06:40:50] (03CR) 10Ema: [C: 03+2] ATS: bump max_open_write_retries [puppet] - 10https://gerrit.wikimedia.org/r/508982 (owner: 10Ema) [06:42:13] RECOVERY - puppet last run on nescio is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [06:49:15] RECOVERY - Check systemd state on ms-be2018 is OK: OK - running: The system is fully operational [06:53:51] <_joe_> !log restarted nagios-nrpe-server on proton1001 [06:53:51] RECOVERY - Check whether ferm is active by checking the default input chain on proton1001 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [06:53:53] RECOVERY - Disk space on proton1001 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space [06:53:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:54:29] RECOVERY - dhclient process on proton1001 is OK: PROCS OK: 0 processes with command name dhclient [06:54:37] RECOVERY - Check systemd state on proton1001 is OK: OK - running: The system is fully operational [06:54:39] RECOVERY - Check size of conntrack table on proton1001 is OK: OK: nf_conntrack is 0 % full https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [06:54:45] RECOVERY - configured eth on proton1001 is OK: OK - interfaces up [06:54:47] RECOVERY - DPKG on proton1001 is OK: All packages OK [06:56:47] RECOVERY - Long running screen/tmux on proton1001 is OK: OK: No SCREEN or tmux processes detected. https://wikitech.wikimedia.org/wiki/Monitoring/Long_running_screens [06:58:59] RECOVERY - puppet last run on mw1307 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [07:00:09] RECOVERY - puppet last run on cp2024 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [07:00:33] RECOVERY - puppet last run on mw1319 is OK: OK: Puppet is currently enabled, last run 5 minutes ago with 0 failures [07:02:27] 10Operations, 10Patch-For-Review: uwsgi's logsocket_plugin.so causes segfaults during log rotation - https://phabricator.wikimedia.org/T212697 (10elukey) No segfaults today, logrotate ran fine! I don't see any log in logstash though, worth to investigate? (Maybe in a separate task if confirmed by others). I u... [07:06:38] (03CR) 10Effie Mouzeli: [C: 03+1] lvs: test php7 on enwiki as well [puppet] - 10https://gerrit.wikimedia.org/r/508827 (https://phabricator.wikimedia.org/T222705) (owner: 10Giuseppe Lavagetto) [07:08:03] 10Operations, 10Proton, 10Reading-Infrastructure-Team-Backlog: proton experienced a period of high CPU usage, busy queue, lockups - https://phabricator.wikimedia.org/T214975 (10Joe) So what happened is that the server ran out of available memory and OOM'd. See https://grafana.wikimedia.org/d/000000274/pro... [07:12:15] RECOVERY - Check the NTP synchronisation status of timesyncd on proton1001 is OK: OK: synced at Thu 2019-05-09 07:12:13 UTC. [07:14:24] (03CR) 10Elukey: "Looks good, ok to merge Filippo?" [puppet] - 10https://gerrit.wikimedia.org/r/508809 (https://phabricator.wikimedia.org/T193017) (owner: 10Mathew.onipe) [07:17:22] (03CR) 10Giuseppe Lavagetto: [C: 03+2] lvs: test php7 on enwiki as well [puppet] - 10https://gerrit.wikimedia.org/r/508827 (https://phabricator.wikimedia.org/T222705) (owner: 10Giuseppe Lavagetto) [07:17:31] (03PS3) 10Giuseppe Lavagetto: lvs: test php7 on enwiki as well [puppet] - 10https://gerrit.wikimedia.org/r/508827 (https://phabricator.wikimedia.org/T222705) [07:17:55] PROBLEM - Check systemd state on ms-be2033 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [07:23:13] !log installing twitter-bootstrap3 security updates [07:23:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:36:16] (03PS1) 10Elukey: Improve HDFS accounting [puppet/cdh] - 10https://gerrit.wikimedia.org/r/508989 (https://phabricator.wikimedia.org/T220702) [07:36:52] (03PS2) 10Elukey: Improve HDFS accounting [puppet/cdh] - 10https://gerrit.wikimedia.org/r/508989 (https://phabricator.wikimedia.org/T220702) [07:37:41] (03PS3) 10Elukey: Improve HDFS accounting [puppet/cdh] - 10https://gerrit.wikimedia.org/r/508989 (https://phabricator.wikimedia.org/T220702) [07:38:32] (03CR) 10Elukey: [V: 03+2 C: 03+2] Improve HDFS accounting [puppet/cdh] - 10https://gerrit.wikimedia.org/r/508989 (https://phabricator.wikimedia.org/T220702) (owner: 10Elukey) [07:39:43] RECOVERY - Check systemd state on ms-be2033 is OK: OK - running: The system is fully operational [07:39:51] (03PS1) 10Elukey: Update the cdh module to its latest sha [puppet] - 10https://gerrit.wikimedia.org/r/508990 [07:40:28] (03CR) 10Elukey: [C: 03+2] Update the cdh module to its latest sha [puppet] - 10https://gerrit.wikimedia.org/r/508990 (owner: 10Elukey) [07:40:36] (03CR) 10Jcrespo: [C: 03+1] db-eqiad,db-codfw.php: Change second parsercache key [mediawiki-config] - 10https://gerrit.wikimedia.org/r/508170 (https://phabricator.wikimedia.org/T210725) (owner: 10Marostegui) [07:48:16] 10Operations, 10cloud-services-team (Kanban): WMCS-related dashboards using Diamond metrics - https://phabricator.wikimedia.org/T210850 (10MoritzMuehlenhoff) [07:49:27] (03PS1) 10Marostegui: db-eqiad,db-codfw.php: Add new hosts [mediawiki-config] - 10https://gerrit.wikimedia.org/r/508992 (https://phabricator.wikimedia.org/T222772) [07:50:04] !log roll restart HDFS masters on an-master100[1,2] to pick up new logging settings [07:50:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:52:43] RECOVERY - MariaDB Slave Lag: s1 on db1139 is OK: OK slave_sql_lag Replication lag: 0.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [07:54:35] !log installing jquery security updates for stretch [07:54:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:55:16] (03PS1) 10Fsero: registryha,lvs: feat(T221101) adding missing discovery data [puppet] - 10https://gerrit.wikimedia.org/r/508994 (https://phabricator.wikimedia.org/T221101) [07:55:50] (03CR) 10jerkins-bot: [V: 04-1] registryha,lvs: feat(T221101) adding missing discovery data [puppet] - 10https://gerrit.wikimedia.org/r/508994 (https://phabricator.wikimedia.org/T221101) (owner: 10Fsero) [07:56:55] (03PS2) 10Fsero: registryha,lvs: feat(T221101) adding missing discovery data [puppet] - 10https://gerrit.wikimedia.org/r/508994 (https://phabricator.wikimedia.org/T221101) [08:01:06] (03PS1) 10Fsero: registryha: feat(T221101) swifting traffic to new registry [dns] - 10https://gerrit.wikimedia.org/r/508996 [08:01:25] (03CR) 10jerkins-bot: [V: 04-1] registryha: feat(T221101) swifting traffic to new registry [dns] - 10https://gerrit.wikimedia.org/r/508996 (owner: 10Fsero) [08:02:48] (03CR) 10Jcrespo: [C: 03+1] "the ips and syntax is ok, only weights are missing" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/508992 (https://phabricator.wikimedia.org/T222772) (owner: 10Marostegui) [08:03:01] (03CR) 10Marostegui: [C: 03+2] db-eqiad,db-codfw.php: Add new hosts [mediawiki-config] - 10https://gerrit.wikimedia.org/r/508992 (https://phabricator.wikimedia.org/T222772) (owner: 10Marostegui) [08:04:10] (03Merged) 10jenkins-bot: db-eqiad,db-codfw.php: Add new hosts [mediawiki-config] - 10https://gerrit.wikimedia.org/r/508992 (https://phabricator.wikimedia.org/T222772) (owner: 10Marostegui) [08:06:03] !log marostegui@deploy1001 Synchronized wmf-config/db-codfw.php: Provision db1129, db2104, db2107, db2108 T222772 T222682 (duration: 00m 59s) [08:06:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:06:10] T222772: Productionize db2[103-120] - https://phabricator.wikimedia.org/T222772 [08:06:10] T222682: Productionize db11[26-38] - https://phabricator.wikimedia.org/T222682 [08:06:43] 10Operations, 10Proton, 10Reading-Infrastructure-Team-Backlog: proton experienced a period of high CPU usage, busy queue, lockups - https://phabricator.wikimedia.org/T214975 (10jcrespo) Now it says `proton1001: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues` [08:07:14] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Provision db1129, db2104, db2107, db2108 T222772 T222682 (duration: 00m 57s) [08:07:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:08:42] (03PS2) 10Fsero: registryha: feat(T221101) swifting traffic to new registry [dns] - 10https://gerrit.wikimedia.org/r/508996 (https://phabricator.wikimedia.org/T221101) [08:09:05] (03CR) 10jerkins-bot: [V: 04-1] registryha: feat(T221101) swifting traffic to new registry [dns] - 10https://gerrit.wikimedia.org/r/508996 (https://phabricator.wikimedia.org/T221101) (owner: 10Fsero) [08:09:28] (03PS3) 10Fsero: registryha: feat(T221101) swifting traffic to new registry [dns] - 10https://gerrit.wikimedia.org/r/508996 (https://phabricator.wikimedia.org/T221101) [08:09:49] (03PS1) 10Marostegui: db-eqiad.php: Slowly repool db1076 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/508997 [08:09:51] (03CR) 10jerkins-bot: [V: 04-1] registryha: feat(T221101) swifting traffic to new registry [dns] - 10https://gerrit.wikimedia.org/r/508996 (https://phabricator.wikimedia.org/T221101) (owner: 10Fsero) [08:09:56] (03PS1) 10Marostegui: mariadb: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/508998 (https://phabricator.wikimedia.org/T222772) [08:11:03] (03PS4) 10Fsero: registryha: feat(T221101) swifting traffic to new registry [dns] - 10https://gerrit.wikimedia.org/r/508996 (https://phabricator.wikimedia.org/T221101) [08:11:04] (03CR) 10Marostegui: [C: 03+2] mariadb: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/508998 (https://phabricator.wikimedia.org/T222772) (owner: 10Marostegui) [08:11:25] (03PS5) 10Fsero: registryha: feat(T221101) swifting traffic to new registry [dns] - 10https://gerrit.wikimedia.org/r/508996 (https://phabricator.wikimedia.org/T221101) [08:11:27] (03CR) 10jerkins-bot: [V: 04-1] registryha: feat(T221101) swifting traffic to new registry [dns] - 10https://gerrit.wikimedia.org/r/508996 (https://phabricator.wikimedia.org/T221101) (owner: 10Fsero) [08:11:33] (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: Slowly repool db1076 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/508997 (owner: 10Marostegui) [08:12:42] (03Merged) 10jenkins-bot: db-eqiad.php: Slowly repool db1076 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/508997 (owner: 10Marostegui) [08:12:56] (03PS2) 10Elukey: Remove mediawiki's nutcracker config from deployment-prep [puppet] - 10https://gerrit.wikimedia.org/r/508817 (https://phabricator.wikimedia.org/T214275) [08:13:52] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Slowly repool db1076 (duration: 00m 56s) [08:13:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:14:35] (03PS6) 10Fsero: registryha: feat(T221101) swifting traffic to new registry [dns] - 10https://gerrit.wikimedia.org/r/508996 (https://phabricator.wikimedia.org/T221101) [08:14:55] (03CR) 10Elukey: [C: 03+2] Remove mediawiki's nutcracker config from deployment-prep [puppet] - 10https://gerrit.wikimedia.org/r/508817 (https://phabricator.wikimedia.org/T214275) (owner: 10Elukey) [08:15:02] (03PS3) 10Fsero: registryha,lvs: feat(T221101) adding missing discovery data [puppet] - 10https://gerrit.wikimedia.org/r/508994 (https://phabricator.wikimedia.org/T221101) [08:15:14] (03PS1) 10Marostegui: db-eqiad.php: Pool db1129 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/508999 (https://phabricator.wikimedia.org/T222682) [08:16:11] (03PS2) 10Marostegui: db-eqiad.php: Pool db1129 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/508999 (https://phabricator.wikimedia.org/T222682) [08:16:45] (03PS3) 10Marostegui: db-eqiad.php: Pool db1129 in s2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/508999 (https://phabricator.wikimedia.org/T222682) [08:17:53] (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: Pool db1129 in s2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/508999 (https://phabricator.wikimedia.org/T222682) (owner: 10Marostegui) [08:18:58] (03Merged) 10jenkins-bot: db-eqiad.php: Pool db1129 in s2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/508999 (https://phabricator.wikimedia.org/T222682) (owner: 10Marostegui) [08:21:16] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Pool db1129 with low weight on s2 - T222682 (duration: 00m 56s) [08:21:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:21:21] T222682: Productionize db11[26-38] - https://phabricator.wikimedia.org/T222682 [08:21:52] (03PS1) 10Marostegui: db-eqiad.php: More traffic to db1076 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/509000 [08:23:23] (03PS1) 10Jcrespo: transfer.py: Fix bug that broke xtrabackup transfers [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/509001 [08:23:47] (03CR) 10jerkins-bot: [V: 04-1] transfer.py: Fix bug that broke xtrabackup transfers [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/509001 (owner: 10Jcrespo) [08:23:56] (03PS1) 10Ema: ATS: Negative Response Caching [puppet] - 10https://gerrit.wikimedia.org/r/509002 [08:23:58] !log upload uwsgi 2.0.14+20161117-3+deb9u2+wmf1 packages to stretch-wikimedia - T212697 [08:24:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:24:02] T212697: uwsgi's logsocket_plugin.so causes segfaults during log rotation - https://phabricator.wikimedia.org/T212697 [08:24:08] (03PS1) 10Jcrespo: mariadb-backups: Fix missing mbstream untaring on destination [puppet] - 10https://gerrit.wikimedia.org/r/509003 [08:25:41] (03PS1) 10Marostegui: mariadb: Provision db2105,db2109 into s3 [puppet] - 10https://gerrit.wikimedia.org/r/509004 (https://phabricator.wikimedia.org/T222772) [08:25:50] (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: More traffic to db1076 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/509000 (owner: 10Marostegui) [08:25:56] (03CR) 10Ema: "pcc seems fine https://puppet-compiler.wmflabs.org/compiler1002/16435/cp4021.ulsfo.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/509002 (owner: 10Ema) [08:26:21] RECOVERY - HP RAID on ms-be2028 is OK: OK: Slot 3: OK: 2I:4:1, 2I:4:2, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4 - Controller: OK - Battery/Capacitor: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [08:26:59] (03Merged) 10jenkins-bot: db-eqiad.php: More traffic to db1076 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/509000 (owner: 10Marostegui) [08:27:19] (03CR) 10Marostegui: [C: 03+2] mariadb: Provision db2105,db2109 into s3 [puppet] - 10https://gerrit.wikimedia.org/r/509004 (https://phabricator.wikimedia.org/T222772) (owner: 10Marostegui) [08:27:32] (03CR) 10Jcrespo: [V: 03+2 C: 03+2] transfer.py: Fix bug that broke xtrabackup transfers [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/509001 (owner: 10Jcrespo) [08:27:51] (03PS2) 10Jcrespo: mariadb-backups: Fix missing mbstream untaring on destination [puppet] - 10https://gerrit.wikimedia.org/r/509003 [08:28:07] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: More traffic to db1076 (duration: 00m 55s) [08:28:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:29:54] (03CR) 10Jcrespo: [C: 03+2] mariadb-backups: Fix missing mbstream untaring on destination [puppet] - 10https://gerrit.wikimedia.org/r/509003 (owner: 10Jcrespo) [08:32:21] PROBLEM - DPKG on netmon2001 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [08:32:30] (03PS1) 10Marostegui: db-eqiad.php: More traffic to db1076,db1129 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/509005 [08:33:01] PROBLEM - Check systemd state on netmon2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [08:33:11] this is me --^ [08:35:34] (03CR) 10Filippo Giunchedi: [C: 04-1] "See inline, LGTM save for dropping the old metrics" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/508809 (https://phabricator.wikimedia.org/T193017) (owner: 10Mathew.onipe) [08:36:32] (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: More traffic to db1076,db1129 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/509005 (owner: 10Marostegui) [08:37:35] (03Merged) 10jenkins-bot: db-eqiad.php: More traffic to db1076,db1129 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/509005 (owner: 10Marostegui) [08:38:48] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: More traffic to db1076 and db1129 (new host on s2) (duration: 00m 57s) [08:38:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:42:27] (03PS1) 10Marostegui: db-eqiad.php: More traffic to db1076,db1129 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/509008 [08:42:39] (03CR) 10Gehel: [C: 04-1] "> Patch Set 2:" [puppet] - 10https://gerrit.wikimedia.org/r/508657 (owner: 10Paladox) [08:42:46] (03PS6) 10Marostegui: db-eqiad,db-codfw.php: Change second parsercache key [mediawiki-config] - 10https://gerrit.wikimedia.org/r/508170 (https://phabricator.wikimedia.org/T210725) [08:44:45] PROBLEM - puppet last run on netmon2001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[uwsgi] [08:45:58] (03PS1) 10Muehlenhoff: netbox: Mask the default uwsgi service for Netbox [puppet] - 10https://gerrit.wikimedia.org/r/509009 (https://phabricator.wikimedia.org/T212697) [08:46:38] (03CR) 10jerkins-bot: [V: 04-1] netbox: Mask the default uwsgi service for Netbox [puppet] - 10https://gerrit.wikimedia.org/r/509009 (https://phabricator.wikimedia.org/T212697) (owner: 10Muehlenhoff) [08:47:27] (03PS2) 10Muehlenhoff: netbox: Mask the default uwsgi service for Netbox [puppet] - 10https://gerrit.wikimedia.org/r/509009 (https://phabricator.wikimedia.org/T212697) [08:47:34] (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: More traffic to db1076,db1129 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/509008 (owner: 10Marostegui) [08:47:40] (03CR) 10Elukey: [C: 03+1] "Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/509009 (https://phabricator.wikimedia.org/T212697) (owner: 10Muehlenhoff) [08:48:42] (03Merged) 10jenkins-bot: db-eqiad.php: More traffic to db1076,db1129 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/509008 (owner: 10Marostegui) [08:49:48] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: More traffic to db1076 and db1129 (new host on s2) (duration: 00m 56s) [08:49:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:57:11] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good, one nit." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/509002 (owner: 10Ema) [08:57:31] 10Operations, 10Icinga, 10observability, 10Patch-For-Review: Fix RAID handler alert and puppet facter to work with Gen10 hosts and ssacli tool - https://phabricator.wikimedia.org/T220787 (10jcrespo) [08:57:41] (03PS1) 10Marostegui: db-eqiad.php: More traffic to db1076,db1129 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/509010 [08:58:34] (03PS4) 10Fsero: registryha,lvs: feat(T221101) adding missing discovery data [puppet] - 10https://gerrit.wikimedia.org/r/508994 (https://phabricator.wikimedia.org/T221101) [08:58:53] (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: More traffic to db1076,db1129 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/509010 (owner: 10Marostegui) [08:59:20] RECOVERY - Check systemd state on netmon2001 is OK: OK - running: The system is fully operational [08:59:56] (03Merged) 10jenkins-bot: db-eqiad.php: More traffic to db1076,db1129 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/509010 (owner: 10Marostegui) [09:01:00] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: More traffic to db1076 and db1129 (new host on s2) (duration: 00m 56s) [09:01:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:03:12] PROBLEM - Check systemd state on netmon2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [09:05:14] (03PS2) 10Ema: ATS: Negative Response Caching [puppet] - 10https://gerrit.wikimedia.org/r/509002 [09:06:31] (03CR) 10Ema: ATS: Negative Response Caching (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/509002 (owner: 10Ema) [09:08:23] (03PS1) 10Marostegui: db-eqiad.php: More traffic to db1076,db1129 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/509011 [09:08:34] (03CR) 10Muehlenhoff: [C: 03+1] ATS: Negative Response Caching [puppet] - 10https://gerrit.wikimedia.org/r/509002 (owner: 10Ema) [09:10:17] (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: More traffic to db1076,db1129 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/509011 (owner: 10Marostegui) [09:10:21] (03CR) 10Ema: [C: 03+2] ATS: Negative Response Caching [puppet] - 10https://gerrit.wikimedia.org/r/509002 (owner: 10Ema) [09:11:18] (03Merged) 10jenkins-bot: db-eqiad.php: More traffic to db1076,db1129 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/509011 (owner: 10Marostegui) [09:11:25] (03PS1) 10Jcrespo: mariadb-backups: Setup all metadata sections for daily snapshots [puppet] - 10https://gerrit.wikimedia.org/r/509012 (https://phabricator.wikimedia.org/T206203) [09:12:22] !log bounce rsyslog on lithium [09:12:24] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: More traffic to db1076 and db1129 (new host on s2) (duration: 00m 56s) [09:12:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:12:26] PROBLEM - rsyslog TLS listener on port 6514 on lithium is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Logs [09:12:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:12:48] RECOVERY - rsyslog TLS listener on port 6514 on lithium is OK: SSL OK - Certificate lithium.eqiad.wmnet valid until 2021-10-23 19:09:29 +0000 (expires in 898 days) https://wikitech.wikimedia.org/wiki/Logs [09:14:23] (03PS1) 10Fsero: added initial debianization [debs/helm-diff] (debian/stretch-wikimedia) - 10https://gerrit.wikimedia.org/r/509015 [09:16:04] PROBLEM - swift-container-server on ms-be2017 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.0.137: Connection reset by peer https://wikitech.wikimedia.org/wiki/Swift [09:16:08] PROBLEM - Check systemd state on ms-be2017 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.0.137: Connection reset by peer [09:16:10] PROBLEM - swift-object-updater on ms-be2017 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.0.137: Connection reset by peer https://wikitech.wikimedia.org/wiki/Swift [09:16:12] PROBLEM - swift-container-updater on ms-be2017 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.0.137: Connection reset by peer https://wikitech.wikimedia.org/wiki/Swift [09:16:17] PROBLEM - MD RAID on ms-be2017 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.0.137: Connection reset by peer https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [09:16:32] PROBLEM - swift-object-replicator on ms-be2017 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.0.137: Connection reset by peer https://wikitech.wikimedia.org/wiki/Swift [09:16:32] PROBLEM - swift-container-replicator on ms-be2017 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.0.137: Connection reset by peer https://wikitech.wikimedia.org/wiki/Swift [09:17:12] 10Operations, 10ops-eqiad: Bad disk on new host db1133 - https://phabricator.wikimedia.org/T222731 (10Marostegui) {F28979249} [09:17:20] RECOVERY - swift-container-server on ms-be2017 is OK: PROCS OK: 41 processes with regex args ^/usr/bin/python /usr/bin/swift-container-server https://wikitech.wikimedia.org/wiki/Swift [09:17:22] RECOVERY - Check systemd state on ms-be2017 is OK: OK - running: The system is fully operational [09:17:24] RECOVERY - swift-object-updater on ms-be2017 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-object-updater https://wikitech.wikimedia.org/wiki/Swift [09:17:26] RECOVERY - swift-container-updater on ms-be2017 is OK: PROCS OK: 2 processes with regex args ^/usr/bin/python /usr/bin/swift-container-updater https://wikitech.wikimedia.org/wiki/Swift [09:17:30] RECOVERY - MD RAID on ms-be2017 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [09:17:38] 10Operations, 10ops-eqiad, 10DBA, 10Goal, and 2 others: rack/setup/install db11[26-38].eqiad.wmnet - https://phabricator.wikimedia.org/T211613 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by marostegui on cumin1001.eqiad.wmnet for hosts: ` ['db1133.eqiad.wmnet'] ` The log can be found in `/v... [09:17:46] RECOVERY - swift-object-replicator on ms-be2017 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-object-replicator https://wikitech.wikimedia.org/wiki/Swift [09:17:46] RECOVERY - swift-container-replicator on ms-be2017 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-replicator https://wikitech.wikimedia.org/wiki/Swift [09:18:47] (03PS1) 10Marostegui: db-eqiad.php: Give more traffic to db1129 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/509017 [09:20:10] (03CR) 10Giuseppe Lavagetto: [C: 03+1] "you need this change to be merged and applied on the authdns servers before you switch the record in the DNS. Also I assume conftool objec" [puppet] - 10https://gerrit.wikimedia.org/r/508994 (https://phabricator.wikimedia.org/T221101) (owner: 10Fsero) [09:27:16] 10Operations, 10ops-eqiad: Bad disk on new host db1133 - https://phabricator.wikimedia.org/T222731 (10Marostegui) @Cmjohnson So I saw this on the idrac: ` /admin1/system1/logs1/log1-> show record44 properties CreationTimestamp = 20190507090745.000000-300 ElementName = System Event Log Entry RecordData... [09:28:27] (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: Give more traffic to db1129 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/509017 (owner: 10Marostegui) [09:29:28] (03Merged) 10jenkins-bot: db-eqiad.php: Give more traffic to db1129 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/509017 (owner: 10Marostegui) [09:29:28] !log fsero@puppetmaster1001 conftool action : set/pooled=true; selector: dnsdisc=docker-registry,name=codfw [09:29:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:30:25] (03CR) 10Volans: [C: 03+1] "LGTM, thanks for taking care of it" [puppet] - 10https://gerrit.wikimedia.org/r/509009 (https://phabricator.wikimedia.org/T212697) (owner: 10Muehlenhoff) [09:31:40] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: More traffic to db1129 (new host on s2) (duration: 00m 57s) [09:31:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:31:48] (03PS3) 10Muehlenhoff: netbox: Mask the default uwsgi service for Netbox [puppet] - 10https://gerrit.wikimedia.org/r/509009 (https://phabricator.wikimedia.org/T212697) [09:33:06] (03CR) 10Muehlenhoff: [C: 03+2] netbox: Mask the default uwsgi service for Netbox [puppet] - 10https://gerrit.wikimedia.org/r/509009 (https://phabricator.wikimedia.org/T212697) (owner: 10Muehlenhoff) [09:36:26] RECOVERY - DPKG on netmon2001 is OK: All packages OK [09:37:36] RECOVERY - puppet last run on netmon2001 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [09:46:02] PROBLEM - MariaDB Slave Lag: s1 on db2097 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 49552.37 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [09:48:32] (03PS1) 10Marostegui: db-eqiad.php: Give API traffic to db1129 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/509024 [09:49:55] (03PS2) 10Marostegui: db-eqiad.php: Give API traffic to db1129 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/509024 [09:50:17] (03PS1) 10ArielGlenn: reduce sleep between adds-changes wiki dumps one more time [dumps] - 10https://gerrit.wikimedia.org/r/509025 [09:51:36] 10Operations, 10SRE-Access-Requests: Requesting access to deloyment for Cormac Parle - https://phabricator.wikimedia.org/T222864 (10Cparle) [09:53:42] (03CR) 10ArielGlenn: [C: 03+2] reduce sleep between adds-changes wiki dumps one more time [dumps] - 10https://gerrit.wikimedia.org/r/509025 (owner: 10ArielGlenn) [09:54:17] !log ariel@deploy1001 Started deploy [dumps/dumps@ab56fdd]: reduce sleep time between dumps of adds-changes wikis still more [09:54:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:54:23] !log ariel@deploy1001 Finished deploy [dumps/dumps@ab56fdd]: reduce sleep time between dumps of adds-changes wikis still more (duration: 00m 06s) [09:54:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:56:30] PROBLEM - MariaDB Slave Lag: s1 on db1139 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 854.56 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [09:57:38] !log ariel@deploy1001 Started deploy [dumps/dumps@ab56fdd]: reduce sleep time between dumps of adds-changes wikis still more (retry) [09:57:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:57:44] !log ariel@deploy1001 Finished deploy [dumps/dumps@ab56fdd]: reduce sleep time between dumps of adds-changes wikis still more (retry) (duration: 00m 06s) [09:57:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:01:44] RECOVERY - proton endpoints health on proton1001 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/proton [10:01:53] (03PS1) 10Ema: swift-proxy: add cache_control middleware [puppet] - 10https://gerrit.wikimedia.org/r/509027 [10:03:58] RECOVERY - puppet last run on proton1001 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [10:05:48] !log restart proton on proton1001. Host Out of memory T214975 [10:05:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:05:53] T214975: proton experienced a period of high CPU usage, busy queue, lockups - https://phabricator.wikimedia.org/T214975 [10:11:23] 10Operations, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): Refactor public-facing DYNA scheme for primary project hostnames in our DNS - https://phabricator.wikimedia.org/T208263 (10lilydjwg) Hi there, thanks for your work but it demonstrates unexpected issues in mainland China. The *.wikipe... [10:15:00] <_joe_> !log restarting low-traffic pybals in codfw, eqiad [10:15:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:22:50] 10Operations, 10ops-eqiad, 10DBA, 10Goal, and 2 others: rack/setup/install db11[26-38].eqiad.wmnet - https://phabricator.wikimedia.org/T211613 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['db1133.eqiad.wmnet'] ` Of which those **FAILED**: ` ['db1133.eqiad.wmnet'] ` [10:27:31] (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: Give API traffic to db1129 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/509024 (owner: 10Marostegui) [10:28:41] (03Merged) 10jenkins-bot: db-eqiad.php: Give API traffic to db1129 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/509024 (owner: 10Marostegui) [10:29:58] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Give API traffic to db1129 (new host on s2) (duration: 00m 57s) [10:30:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:35:14] (03PS1) 10Marostegui: mariadb: Provision db1138 in s4 [puppet] - 10https://gerrit.wikimedia.org/r/509032 (https://phabricator.wikimedia.org/T222682) [10:36:16] (03PS1) 10Marostegui: db-eqiad.php: Depool db1081 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/509033 [10:36:18] (03PS2) 10Marostegui: mariadb: Provision db1138 in s4 [puppet] - 10https://gerrit.wikimedia.org/r/509032 (https://phabricator.wikimedia.org/T222682) [10:37:29] (03PS3) 10Marostegui: mariadb: Provision db1138 in s4 [puppet] - 10https://gerrit.wikimedia.org/r/509032 (https://phabricator.wikimedia.org/T222682) [10:37:32] RECOVERY - MariaDB Slave Lag: s1 on db1139 is OK: OK slave_sql_lag Replication lag: 0.31 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [10:37:43] (03PS4) 10Marostegui: mariadb: Provision db1138 in s4 [puppet] - 10https://gerrit.wikimedia.org/r/509032 (https://phabricator.wikimedia.org/T222682) [10:38:29] (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: Depool db1081 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/509033 (owner: 10Marostegui) [10:38:43] (03CR) 10Marostegui: [C: 03+2] mariadb: Provision db1138 in s4 [puppet] - 10https://gerrit.wikimedia.org/r/509032 (https://phabricator.wikimedia.org/T222682) (owner: 10Marostegui) [10:38:56] (03PS1) 10Jcrespo: transfer.py: Stop slave when calling stop slave [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/509034 [10:39:20] (03PS1) 10Jcrespo: mariadb-backups: Fix bug of stopping slave after the backup [puppet] - 10https://gerrit.wikimedia.org/r/509036 (https://phabricator.wikimedia.org/T206203) [10:39:29] (03CR) 10jerkins-bot: [V: 04-1] transfer.py: Stop slave when calling stop slave [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/509034 (owner: 10Jcrespo) [10:39:35] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1081 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/509033 (owner: 10Marostegui) [10:40:08] (03CR) 10Marostegui: transfer.py: Stop slave when calling stop slave (032 comments) [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/509034 (owner: 10Jcrespo) [10:41:25] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Depool db1081 (duration: 00m 57s) [10:41:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:41:48] (03PS2) 10Jcrespo: transfer.py: Stop slave when calling stop slave [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/509034 [10:41:58] (03PS2) 10Jcrespo: mariadb-backups: Setup all metadata sections for daily snapshots [puppet] - 10https://gerrit.wikimedia.org/r/509012 (https://phabricator.wikimedia.org/T206203) [10:42:00] (03PS2) 10Jcrespo: mariadb-backups: Fix bug of stopping slave after the backup [puppet] - 10https://gerrit.wikimedia.org/r/509036 (https://phabricator.wikimedia.org/T206203) [10:42:15] (03CR) 10jerkins-bot: [V: 04-1] transfer.py: Stop slave when calling stop slave [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/509034 (owner: 10Jcrespo) [10:43:51] !log Stop MySQL on db1081 [10:43:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:44:24] (03CR) 10Jcrespo: [V: 03+2 C: 03+2] transfer.py: Stop slave when calling stop slave [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/509034 (owner: 10Jcrespo) [10:44:41] (03CR) 10Jcrespo: [C: 03+2] mariadb-backups: Fix bug of stopping slave after the backup [puppet] - 10https://gerrit.wikimedia.org/r/509036 (https://phabricator.wikimedia.org/T206203) (owner: 10Jcrespo) [10:44:50] (03PS3) 10Jcrespo: mariadb-backups: Fix bug of stopping slave after the backup [puppet] - 10https://gerrit.wikimedia.org/r/509036 (https://phabricator.wikimedia.org/T206203) [10:50:01] (03PS5) 10Fsero: registryha,lvs: feat(T221101) adding missing discovery data [puppet] - 10https://gerrit.wikimedia.org/r/508994 (https://phabricator.wikimedia.org/T221101) [10:50:41] (03PS2) 10Jbond: facter3/puppet5: enable puppet5/facter3 eqiad [puppet] - 10https://gerrit.wikimedia.org/r/507303 (https://phabricator.wikimedia.org/T219803) [10:51:20] (03PS1) 10ArielGlenn: remove ftp link for umu.se at admin request [puppet] - 10https://gerrit.wikimedia.org/r/509037 [10:51:32] (03CR) 10Jbond: [C: 03+2] facter3/puppet5: enable puppet5/facter3 eqiad [puppet] - 10https://gerrit.wikimedia.org/r/507303 (https://phabricator.wikimedia.org/T219803) (owner: 10Jbond) [10:51:49] (03PS3) 10Jbond: facter3/puppet5: enable puppet5/facter3 eqiad [puppet] - 10https://gerrit.wikimedia.org/r/507303 (https://phabricator.wikimedia.org/T219803) [10:51:52] (03CR) 10Fsero: [C: 03+2] registryha,lvs: feat(T221101) adding missing discovery data [puppet] - 10https://gerrit.wikimedia.org/r/508994 (https://phabricator.wikimedia.org/T221101) (owner: 10Fsero) [10:52:04] (03PS6) 10Fsero: registryha,lvs: feat(T221101) adding missing discovery data [puppet] - 10https://gerrit.wikimedia.org/r/508994 (https://phabricator.wikimedia.org/T221101) [10:54:21] (03PS2) 10ArielGlenn: remove ftp link for umu.se at admin request [puppet] - 10https://gerrit.wikimedia.org/r/509037 [10:55:31] (03CR) 10ArielGlenn: [C: 03+2] remove ftp link for umu.se at admin request [puppet] - 10https://gerrit.wikimedia.org/r/509037 (owner: 10ArielGlenn) [10:56:26] (03PS7) 10Fsero: registryha,lvs: feat(T221101) adding missing discovery data [puppet] - 10https://gerrit.wikimedia.org/r/508994 (https://phabricator.wikimedia.org/T221101) [10:59:09] PROBLEM - debmonitor.wikimedia.org on debmonitor1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Debmonitor [10:59:58] 10Operations: service::uwsgi should mask uwsgi.service - https://phabricator.wikimedia.org/T222874 (10MoritzMuehlenhoff) [11:00:02] 10Operations: service::uwsgi should mask uwsgi.service - https://phabricator.wikimedia.org/T222874 (10MoritzMuehlenhoff) p:05Triage→03Normal [11:00:04] MaxSem, RoanKattouw, and Niharika: It is that lovely time of the day again! You are hereby commanded to deploy European Mid-day SWAT(Max 6 patches). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190509T1100). [11:00:04] Ammarpad: A patch you scheduled for European Mid-day SWAT(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [11:00:11] PROBLEM - DPKG on puppetmaster2001 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [11:00:11] PROBLEM - puppetmaster https on puppetmaster2001 is CRITICAL: HTTP CRITICAL - Invalid HTTP response received from host on port 8140: HTTP/1.1 404 Not Found https://wikitech.wikimedia.org/wiki/Puppet%23Debugging [11:00:15] RECOVERY - debmonitor.wikimedia.org on debmonitor1001 is OK: HTTP OK: Status line output matched HTTP/1.1 301 - 274 bytes in 0.002 second response time https://wikitech.wikimedia.org/wiki/Debmonitor [11:00:23] PROBLEM - puppetmaster backend https on puppetmaster2001 is CRITICAL: HTTP CRITICAL - Invalid HTTP response received from host on port 8141: HTTP/1.1 404 Not Found https://wikitech.wikimedia.org/wiki/Puppet%23Debugging [11:00:39] PROBLEM - puppet last run on elastic2039 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. It might be a dependency cycle. [11:00:39] PROBLEM - puppetmaster backend https on puppetmaster2002 is CRITICAL: HTTP CRITICAL - Invalid HTTP response received from host on port 8141: HTTP/1.1 404 Not Found https://wikitech.wikimedia.org/wiki/Puppet%23Debugging [11:00:45] PROBLEM - puppet last run on elastic2053 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [11:00:45] PROBLEM - puppet last run on db2044 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [11:00:49] PROBLEM - puppet last run on cp4021 is CRITICAL: CRITICAL: Puppet has 40 failures. Last run 3 minutes ago with 40 failures. Failed resources (up to 3 shown) [11:00:53] PROBLEM - puppet last run on mw2183 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [11:00:53] PROBLEM - puppet last run on mw2152 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. It might be a dependency cycle. [11:01:33] PROBLEM - puppet last run on mw2240 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. It might be a dependency cycle. [11:01:39] what's going on? [11:01:43] PROBLEM - puppet last run on cp4025 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. It might be a dependency cycle. [11:01:47] PROBLEM - puppet last run on elastic2049 is CRITICAL: CRITICAL: Puppet has 5 failures. Last run 3 minutes ago with 5 failures. Failed resources (up to 3 shown) [11:01:55] ^^ looking [11:02:01] PROBLEM - puppet last run on elastic2035 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. It might be a dependency cycle. [11:02:05] PROBLEM - puppet last run on mw2156 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. It might be a dependency cycle. [11:02:07] PROBLEM - puppet last run on mw2182 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [11:02:09] PROBLEM - puppet last run on mw2147 is CRITICAL: CRITICAL: Puppet has 74 failures. Last run 4 minutes ago with 74 failures. Failed resources (up to 3 shown) [11:02:18] PROBLEM - DPKG on puppetmaster2001 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [11:02:20] jbond42: it seems they are getting 404 [11:02:31] PROBLEM - puppet last run on cp2012 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [11:02:31] PROBLEM - puppet last run on restbase2019 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. It might be a dependency cycle. [11:02:31] PROBLEM - puppet last run on wtp2010 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. It might be a dependency cycle. [11:02:31] PROBLEM - puppet last run on db2074 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. It might be a dependency cycle. [11:02:31] PROBLEM - puppet last run on dbstore2002 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. It might be a dependency cycle. [11:02:31] PROBLEM - puppet last run on mw2265 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. It might be a dependency cycle. [11:02:31] PROBLEM - puppet last run on mw2277 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. It might be a dependency cycle. [11:02:32] PROBLEM - puppet last run on wtp2012 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. It might be a dependency cycle. [11:02:32] PROBLEM - puppet last run on mw2157 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. It might be a dependency cycle. [11:02:33] PROBLEM - puppet last run on mw2248 is CRITICAL: CRITICAL: Puppet has 52 failures. Last run 4 minutes ago with 52 failures. Failed resources (up to 3 shown) [11:02:33] PROBLEM - puppet last run on mw2255 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. It might be a dependency cycle. [11:02:41] PROBLEM - puppet last run on elastic2044 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. It might be a dependency cycle. [11:02:43] PROBLEM - puppet last run on scb2004 is CRITICAL: CRITICAL: Puppet has 39 failures. Last run 4 minutes ago with 39 failures. Failed resources (up to 3 shown) [11:02:47] PROBLEM - puppet last run on ores2007 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. It might be a dependency cycle. [11:02:47] PROBLEM - puppet last run on mw2234 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. It might be a dependency cycle. [11:02:55] !log stopped ircecho to avoid spam [11:02:56] <_joe_> uhm something bad happened [11:02:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:03:11] (03PS4) 10Mathew.onipe: prometheus: enable metrics relabel [puppet] - 10https://gerrit.wikimedia.org/r/508809 (https://phabricator.wikimedia.org/T193017) [11:03:21] i updated puppet in codfw to 5 [11:03:50] 10Operations, 10SRE-Access-Requests: Requesting access to deployment for Cormac Parle - https://phabricator.wikimedia.org/T222864 (10Aklapper) [11:03:53] was it supposed to be a temporary interruption or a long one? [11:04:19] the 2001 frontend is logging proxy-server/404 [11:04:23] (03CR) 10Mathew.onipe: prometheus: enable metrics relabel (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/508809 (https://phabricator.wikimedia.org/T193017) (owner: 10Mathew.onipe) [11:04:35] was not suppos to casue any outage [11:04:37] and the backend is logging the 404 [11:04:40] I get a 404 Not Found and then 405 Method Not Allowed [11:04:50] <_joe_> !log disabling puppet across the fleet [11:04:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:06:02] i think the puppet masters remove passanger when they where updated [11:06:39] jbond42: are you referring to [11:06:39] ii passenger 5.0.30-1+deb9u1 [11:06:39] rc puppet-master-passenger 4.8.2-5 [11:06:40] ? [11:06:52] <_joe_> puppet disabled everywhere [11:06:54] Removing puppet-master (4.8.2-5~bpo8+1) ... [11:06:55] Removing puppet-master-passenger (4.8.2-5~bpo8+1) .. [11:08:26] https://www.irccloud.com/pastebin/frbVGwhe/ [11:08:31] from puppetmaster2001 [11:08:56] jbond42: is it easy to either fix or revert? [11:09:17] <_joe_> I think fsero found the issue [11:09:18] volans: just testing a fix on puppetmaster2001 now [11:09:19] apt.log on puppetmaster2001 shows that dpkg returned an error code, probably some issue with the postinst [11:09:30] <_joe_> maybe the upgrade killed puppetdb? [11:09:44] <_joe_> but yes I will check puppetdb [11:09:52] i have run sudo apt-get install facter=2.4.6-1 puppet=4.8.2-5 puppet-master puppet-master-passenger [11:10:02] <_joe_> fsero: can you check eqiad as well? [11:10:04] on puppetmaster2001 and puppet is running ok there now [11:10:06] yes [11:10:45] <_joe_> 2019-05-09 10:57:36,444 INFO [p.t.s.w.jetty9-core] Shutting down web server. [11:10:51] <_joe_> on puppetdb2001 [11:11:04] <_joe_> the service was turned off [11:11:16] eqiad is unaffected [11:11:17] sero@puppetmaster1001:~$ dpkg -l | grep puppet [11:11:17] ii puppet 4.8.2-5 all configuration management system [11:11:17] ii puppet-el 3.8.5-2 all syntax highlighting for puppet manifests in emacs [11:11:17] ii puppet-master 4.8.2-5 all configuration management system, master service [11:11:17] ii puppet-master-passenger 4.8.2-5 all configuration management system, scalable master service [11:11:18] ii puppetdb-termini 4.4.0-1~wmf1 all Termini for puppetdb [11:11:18] ii vim-puppet 3.8.5-2 all syntax highlighting for puppet manifests in vim [11:11:30] fsero: eqiad isn't updated yet [11:11:38] we know moritzm just checking out [11:11:43] <_joe_> May 09 10:57:35 puppetdb2001 systemd[1]: Stopping puppetdb Service... [11:11:45] <_joe_> May 09 10:57:37 puppetdb2001 systemd[1]: Stopped puppetdb Service. [11:11:47] the puppet merge only adds the repos, the update of the packages are updated separately [11:11:48] no errors for puppetdb either [11:11:51] <_joe_> fsero: I meant hte logs [11:11:55] yeah yeah [11:11:57] <_joe_> ok [11:12:09] <_joe_> moritzm, jbond42 the problem is puppetdb2001 [11:12:16] <_joe_> it is somehow dead [11:12:38] <_joe_> uwsgi-puppetdb-microservice.service is running though [11:12:47] _joe_: the update of puppet on the master caused some master componets to be removed [11:13:01] there was Commandline: apt-get install facter=2.4.6-1 puppet=4.8.2-5 on puppetdb2001 too [11:13:04] i have updated (downgraded them) on the masters and they seem to be working agains [11:13:08] checking the puppetdb now [11:13:17] volans: that was me just now [11:13:18] <_joe_> so on puppetdb2001 something started puppetdb [11:13:22] <_joe_> then it was shut down [11:13:30] <_joe_> the old puppetdb I mean [11:13:35] <_joe_> not managed by the uwsgi app [11:13:36] fsero@puppetdb2001:~$ dpkg -l | grep puppetdb [11:13:36] rc puppetdb 4.4.0-1~wmf1 all Puppet Labs puppetdb [11:13:42] well i thnk it was deleted [11:13:49] Removing puppetdb (4.4.0-1~wmf1) [11:14:22] I'll have a look at the apt dependency debugger to see what made it deinstall it [11:14:26] !log ran the folloowing on puppetmaster200{1,2} sudo apt-get install facter=2.4.6-1 puppet=4.8.2-5 puppet-master puppet-master-passenger [11:14:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:14:40] !log ran the folloowing on puppetdb2001 sudo apt-get install facter=2.4.6-1 puppet=4.8.2-5 [11:14:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:15:28] ok im going to reinstall puppetdb on puppetdb2001 [11:15:45] !log run sudo apt-get install puppetdb on puppetdb2001 [11:15:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:16:02] <_joe_> sorry I have some questions about this upgrade. Did we test all of our code was compatible with the new puppet version? [11:16:18] <_joe_> every major upgrade for us was a major pain in terms of code changes [11:16:34] <_joe_> I am asking because I didn't follow [11:16:42] let me fix and can postmortem after [11:16:47] _joe_: it is a good question but lest focus on fix it first [11:17:02] <_joe_> no my question regarded the opportunity to rollback or not [11:17:04] jbond42: our puppetdb package has Depends: puppet (<< 5.0.0-1puppetlabs), we need to rebuild it [11:17:22] <_joe_> I guess we need to anyways, ok [11:17:35] jbond42: puppetdb systemd unit is stopped [11:17:54] trying to start now [11:18:02] !log starting puppetdb on puppetdb2001 [11:18:03] fsero: it's starting, but it takes up to a minute... [11:18:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:18:20] the synergy of Java _and_ Clojure [11:18:22] ok puppetdb is running on puppetdb2001 now [11:19:23] !log running sudo apt-get install facter=2.4.6-1 puppet=4.8.2-5 puppet-master puppet-master-passenger on labpuppetmaster1001 [11:19:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:19:32] _joe_: wrt compat of our Puppet code: canaries for all major roles are running the new puppet 5 since 1-2 weeks [11:19:35] puppetmaster is dead on puppetmaster2001 [11:19:58] <_joe_> moritzm: you mean the puppet client [11:20:15] <_joe_> but I think the puppet package is not disjoint between master and client perfectly [11:20:23] yeah, we're only updating that to 5 in Q [11:20:35] <_joe_> so client at 5 on the puppetmasters also means the server will be moved [11:20:43] <_joe_> unless things have changed lately [11:20:50] same for puppetdb [11:20:54] <_joe_> so you need to pin puppetmasters to the old version [11:21:30] yeah, the pin to the new version only affects the client packages, the puppetmaster packages are not part of the component that gets included [11:22:09] <_joe_> still [11:22:21] <_joe_> the puppetmaster uses resources from the client package [11:23:23] !log running sudo apt-get install puppet-master=4.8.2-5~bpo8+1 puppet-master-passenger=4.8.2-5~bpo8+1 on labtestpuppetmaster2001 [11:23:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:23:39] https://etherpad.wikimedia.org/p/incident-20190509-codfwpuppetmasterdown [11:24:58] <_joe_> thanks fsero [11:29:38] (03PS1) 10Jbond: puppet5/facter3: ensure puppet master infrastructre is not upgraded [puppet] - 10https://gerrit.wikimedia.org/r/509040 (https://phabricator.wikimedia.org/T219803) [11:30:19] could i get some review here ^^ this should ensure the puppet master stuff remains on puppet 4 and facter 2 untill we can look at rebuilding packages or further dig into the issue [11:30:51] (03CR) 10Muehlenhoff: puppet5/facter3: ensure puppet master infrastructre is not upgraded (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/509040 (https://phabricator.wikimedia.org/T219803) (owner: 10Jbond) [11:31:15] jbond42: it's missing rhodium [11:31:32] let's better go by the roles [11:31:40] <_joe_> yes [11:32:26] <_joe_> role::puppetmaster::{frontend,backend} and role::puppetdb IIRC [11:32:33] puppetmaster::backend, puppetmaster::frontend, wmcs::openstack::eqiad1::puppetmaster::backend, wmcs::openstack::eqiad1::puppetmaster::frontend [11:32:48] and puppetmaster::puppetdb [11:32:55] the problem with the rol is it is the last thing int eh hierarcy so if a paramter is defined anywhere elses in the hierarcy then it will not be read [11:33:03] <_joe_> oh right [11:33:12] <_joe_> and you have them defined in common right now [11:33:14] <_joe_> sigh [11:33:17] yes [11:33:17] <_joe_> go this way [11:33:31] <_joe_> you're missing some of the labs puppetmasters and rhodium [11:33:38] <_joe_> also lemme check one more thing [11:34:03] <_joe_> for puppetdb there is a regex already [11:34:14] <_joe_> puppetdb4_servers: [11:34:15] labpuppetmaster1001, labpuppetmaster1002, rhodium and labtestpuppetmaster2001 we need [11:34:18] <_joe_> sigh, why. [11:34:36] <_joe_> and also [11:34:38] <_joe_> puppetmaster_puppetdb4: [11:34:44] <_joe_> so use regex.yaml jbond42 [11:35:05] <_joe_> rhodium_puppetdb4 too, why [11:35:17] * _joe_ closes editor [11:35:58] <_joe_> jbond42: anyways, regex.yaml is the right place [11:36:06] <_joe_> we already have stanzas there [11:41:06] (03PS1) 10Jbond: puppet5/facter3: ensure puppet master infrastructre is not upgraded [puppet] - 10https://gerrit.wikimedia.org/r/509042 (https://phabricator.wikimedia.org/T219803) [11:41:14] ok pushing a change now. the labs ones are ok for now as we have the paramters in labs.yaml [11:42:39] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/509042 (https://phabricator.wikimedia.org/T219803) (owner: 10Jbond) [11:43:25] jbond42: after you roll out changes i would really appreciate your collaboration in the incident report [11:43:32] ping if you need help [11:43:49] im around [11:43:49] incident report is not strictly required as this isn't a user affecting outage [11:43:51] fsero: yes of course once i have done i will go through and fill out anything missing [11:44:03] <_joe_> jbond42: what about the labs masters? [11:44:03] but yeah, it is still a good idea to do so [11:44:05] mark any failure is an opportunity to learn and improve [11:44:10] and this is serious [11:44:11] sorry [11:44:13] not that serious [11:44:21] _joe_: thoses settings are allready in labs.yaml [11:44:25] but yes, of course it is an opportunity to learn and improve [11:44:43] <_joe_> jbond42: but those puppetmasters are in the labs realm? [11:44:49] <_joe_> I didn't remember [11:44:51] <_joe_> sorry then [11:45:26] actully your right for some reason labstestpuppetmaster was pudated. the other two were not. either way ill update to have a regex for the labs ones [11:45:32] (central) labs puppetmasters themselves are currently in the prod realm [11:45:38] and also don't use puppetdb [11:45:50] <_joe_> Krenair: ok so my recollection is correct [11:46:01] <_joe_> jbond42: they're in the prod realm so labs.yaml doesn't affect them [11:46:16] ok updating now thanks [11:46:20] (there is a project to move them into the labs realm but it's not done yet) [11:46:33] the labs masters are all in eqiad, the one in codfw is just for testing [11:46:43] <_joe_> yeah I wasn't sure where that project was at :P [11:47:05] moritzm: ahh ok thats twhey the others didn;t get updated [11:49:04] !log updated netbox statues for decommissioning and spare hosts according to T222352 [11:49:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:49:08] T222352: Server Lifecycle: re-arrage statuses including a decommissioning one - https://phabricator.wikimedia.org/T222352 [11:49:34] (03PS2) 10Jbond: puppet5/facter3: ensure puppet master infrastructre is not upgraded [puppet] - 10https://gerrit.wikimedia.org/r/509042 (https://phabricator.wikimedia.org/T219803) [11:49:53] _joe_, moritzm can you take another pass ^^ [11:49:58] looking [11:51:07] (03CR) 10Giuseppe Lavagetto: [C: 03+1] "nitpick (typo), otherwise lgtm." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/509042 (https://phabricator.wikimedia.org/T219803) (owner: 10Jbond) [11:51:18] (03CR) 10Muehlenhoff: puppet5/facter3: ensure puppet master infrastructre is not upgraded (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/509042 (https://phabricator.wikimedia.org/T219803) (owner: 10Jbond) [11:51:37] <_joe_> moritzm: lol [11:51:47] <_joe_> we made the exact same comment [11:52:19] :-) [11:52:28] (03PS3) 10Jbond: puppet5/facter3: ensure puppet master infrastructre is not upgraded [puppet] - 10https://gerrit.wikimedia.org/r/509042 (https://phabricator.wikimedia.org/T219803) [11:52:33] ok i have updated the typo now thanks [11:52:39] * mark draws the preliminary management conclusion that we need only one of you two ;-) [11:52:43] will push once CI has done [11:53:13] (03CR) 10Jbond: [C: 03+2] puppet5/facter3: ensure puppet master infrastructre is not upgraded [puppet] - 10https://gerrit.wikimedia.org/r/509042 (https://phabricator.wikimedia.org/T219803) (owner: 10Jbond) [11:53:16] mark: think of the "eliminate bus factor" KD :-) [11:53:23] right right [11:53:37] maybe just not active/active then [11:53:59] * volans pre-links https://wikitech.wikimedia.org/wiki/Cumin#Run_Puppet_only_if_last_run_failed [11:55:15] !log clean up old source files sudo cumin A:puppetmaster 'rm /etc/apt/sources.list.d/component-facter3.list /etc/apt/sources.list.d/component-puppet5.list' [11:55:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:56:04] jbond42: might need to do that on the lab*puppetmaster* too [11:56:07] * _joe_ goes offline [11:56:33] and yes, that's a valid cumin query too :D [11:56:53] volans: thanks running now [11:56:54] (03PS4) 10Volans: Add decommissioning status support to reports [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/508951 (https://phabricator.wikimedia.org/T222353) (owner: 10CRusnov) [11:57:31] <_joe_> lmk when I can reenable puppet across the fleet btw [11:57:50] !log all puppetmasters and puppetdbs should be restored' [11:57:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:57:55] same for icinga bot [11:58:01] i think we shuld be good now [11:58:12] <_joe_> ok lemme test [11:58:26] puppet-master.service on 2001 is still down [11:58:41] <_joe_> moritzm: that's good [11:58:49] <_joe_> that needs to be down [11:58:54] <_joe_> it's the standalone puppetmaster [11:59:02] <_joe_> it can work up to like 10 clients [11:59:36] same on 2002, it needs a clear failed? [11:59:37] but it should be something like "active exited", not "active-failed", right? [11:59:40] puppet-master.service loaded failed failed [12:00:01] <_joe_> it might have tried to start when the package was reinstalled, yes [12:00:04] Deploy window Pre MediaWiki train sanity break (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190509T1200) [12:00:05] Could not run: Address already in use - listen(2) [12:00:11] <_joe_> yes [12:00:14] <_joe_> that's it [12:00:24] (03CR) 10Volans: [C: 03+2] Add decommissioning status support to reports [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/508951 (https://phabricator.wikimedia.org/T222353) (owner: 10CRusnov) [12:00:33] <_joe_> ok [12:00:55] I'll restart puppet-master on 2001, then? [12:01:00] <_joe_> no [12:01:01] port is held by apache which in turn spins up the passanger process [12:01:07] <_joe_> it needs to be stopped [12:01:11] <_joe_> passenger is running [12:01:14] ok [12:01:28] i think we can just clear the systemnd errors tbh [12:01:30] <_joe_> !log reenabling puppet across the fleet [12:01:31] so just reset-failed? [12:01:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:01:34] <_joe_> jbond42: +1 [12:01:45] <_joe_> and then disable the service too [12:01:49] <_joe_> which I thought was in puppet [12:02:25] I ran systemctl reset-failed on 2001 [12:03:03] jbond42: I've built an updated puppetdb package for stretch-wikimedia with the correct dependencies, we can test this in labs later [12:03:19] moritzm: awesome thanks [12:05:28] puppet run worked fine again on mw2164 [12:12:19] good job jbond42 [12:12:26] volans: reenable ircbot? [12:13:03] fsero: depends if we want the whole spam of recovery or not [12:13:06] I'm tailing the irc.log [12:13:21] recovery spam is reassuring :P [12:13:22] is my understanging correct that we're not forcing a puppet run ? [12:13:32] it can ban icinga-wm from irc though [12:13:36] kick more than ban [12:14:30] so it will take 30m [12:14:32] i assume we are not forcing a puppet run and if there is consequences we will notice after [12:14:37] any reason why not speeding up the process? [12:15:06] <_joe_> I still see a lot of failures though [12:15:31] which hosts are you seeing failures still? [12:15:54] <_joe_> ha nevermind [12:15:59] <_joe_> icinga tricked me [12:16:04] yeah me too [12:16:11] <_joe_> it says "puppet ran 6 minutes ago" [12:16:12] ok :-) [12:16:21] <_joe_> yeah, at the time of first detection :P [12:16:55] <_joe_> every time a puppet 5 run happens, that yellow line remembers me my sad attempts at implementing PSON in python [12:17:39] RECOVERY - puppet last run on mw2135 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [12:17:45] is puppet enabled every where again now? [12:17:56] I guess puppet has run on icinga and reneabled icinga-wm [12:17:57] :D [12:18:07] we'll get the recovery slow spam [12:18:22] ok cool thanks all :) [12:18:34] fsero: will go throu report now [12:18:41] I'm running "run-puppet-agent -q --failed-only" on remaining hosts to speed things up [12:18:54] jbond42: take your time, we are not in a hurry, [12:20:33] RECOVERY - puppet last run on mw2141 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [12:20:41] 10Operations: service::uwsgi should mask uwsgi.service - https://phabricator.wikimedia.org/T222874 (10Volans) Debmonitor too is using uwsgi fwiw. +1 for me to do it globally if we don't have special cases. [12:20:45] RECOVERY - puppet last run on db2044 is OK: OK: Puppet is currently enabled, last run 29 seconds ago with 0 failures [12:20:46] (03PS2) 10Ema: swift-proxy: add cache_control middleware [puppet] - 10https://gerrit.wikimedia.org/r/509027 [12:20:51] RECOVERY - puppet last run on mw2183 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [12:20:51] RECOVERY - puppet last run on cp4021 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [12:21:00] * volans bbiab [12:21:53] RECOVERY - puppet last run on mw2182 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [12:22:09] RECOVERY - puppet last run on elastic2049 is OK: OK: Puppet is currently enabled, last run 18 seconds ago with 0 failures [12:22:24] RECOVERY - puppet last run on mw2248 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [12:22:41] RECOVERY - puppet last run on mw2195 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [12:22:51] RECOVERY - puppet last run on mw2202 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [12:23:07] RECOVERY - puppet last run on db2045 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [12:23:09] RECOVERY - puppet last run on db2042 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [12:23:31] RECOVERY - puppet last run on mw2212 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [12:24:47] PROBLEM - puppet last run on multatuli is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. It might be a dependency cycle. [12:25:01] PROBLEM - HHVM jobrunner on mw1299 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 473 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Jobrunner [12:25:51] RECOVERY - puppet last run on elastic2053 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [12:26:09] RECOVERY - HHVM jobrunner on mw1299 is OK: HTTP OK: HTTP/1.1 200 OK - 270 bytes in 0.008 second response time https://wikitech.wikimedia.org/wiki/Jobrunner [12:27:27] RECOVERY - puppet last run on cp2012 is OK: OK: Puppet is currently enabled, last run 23 seconds ago with 0 failures [12:27:45] RECOVERY - puppet last run on scb2004 is OK: OK: Puppet is currently enabled, last run 53 seconds ago with 0 failures [12:27:51] RECOVERY - puppet last run on ores2007 is OK: OK: Puppet is currently enabled, last run 6 seconds ago with 0 failures [12:28:03] RECOVERY - puppet last run on restbase2016 is OK: OK: Puppet is currently enabled, last run 1 second ago with 0 failures [12:28:36] (03PS3) 10Ema: swift-proxy: add cache_control middleware [puppet] - 10https://gerrit.wikimedia.org/r/509027 [12:29:03] RECOVERY - puppet last run on mw2288 is OK: OK: Puppet is currently enabled, last run 56 seconds ago with 0 failures [12:29:37] RECOVERY - puppet last run on wtp2007 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [12:29:43] RECOVERY - puppet last run on labstore2003 is OK: OK: Puppet is currently enabled, last run 31 seconds ago with 0 failures [12:29:47] RECOVERY - puppet last run on elastic2030 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [12:29:47] RECOVERY - puppet last run on es2002 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [12:29:50] (03CR) 10Alexandros Kosiaris: [C: 03+1] "sure" [puppet] - 10https://gerrit.wikimedia.org/r/508582 (https://phabricator.wikimedia.org/T218346) (owner: 10Ottomata) [12:29:53] RECOVERY - puppet last run on es2017 is OK: OK: Puppet is currently enabled, last run 32 seconds ago with 0 failures [12:30:09] RECOVERY - puppet last run on ores2006 is OK: OK: Puppet is currently enabled, last run 35 seconds ago with 0 failures [12:30:29] RECOVERY - puppet last run on ms-be2046 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [12:30:29] RECOVERY - puppet last run on ms-fe2007 is OK: OK: Puppet is currently enabled, last run 18 seconds ago with 0 failures [12:30:29] RECOVERY - puppet last run on db2035 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [12:30:47] RECOVERY - puppet last run on ms-be2032 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [12:30:47] RECOVERY - puppet last run on pybal-test2003 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [12:30:49] RECOVERY - puppet last run on elastic2039 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [12:30:57] RECOVERY - puppet last run on acrux is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [12:31:07] (03CR) 10Alexandros Kosiaris: [C: 03+1] registryha: feat(T221101) swifting traffic to new registry [dns] - 10https://gerrit.wikimedia.org/r/508996 (https://phabricator.wikimedia.org/T221101) (owner: 10Fsero) [12:31:07] RECOVERY - puppet last run on mw2225 is OK: OK: Puppet is currently enabled, last run 11 seconds ago with 0 failures [12:31:07] RECOVERY - puppet last run on mw2152 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [12:31:07] RECOVERY - puppet last run on ms-be2021 is OK: OK: Puppet is currently enabled, last run 1 second ago with 0 failures [12:31:11] RECOVERY - puppet last run on mw2280 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [12:31:17] RECOVERY - puppet last run on mw2137 is OK: OK: Puppet is currently enabled, last run 48 seconds ago with 0 failures [12:31:23] RECOVERY - puppet last run on vega is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [12:31:31] RECOVERY - puppet last run on releases2001 is OK: OK: Puppet is currently enabled, last run 58 seconds ago with 0 failures [12:32:05] RECOVERY - puppet last run on elastic2035 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [12:32:11] RECOVERY - puppet last run on mw2156 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [12:32:13] RECOVERY - puppet last run on mw2147 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [12:32:21] RECOVERY - puppet last run on kafka2002 is OK: OK: Puppet is currently enabled, last run 33 seconds ago with 0 failures [12:32:21] RECOVERY - puppet last run on db2039 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [12:32:27] RECOVERY - puppet last run on proton2002 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [12:32:27] RECOVERY - puppet last run on scb2003 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [12:32:33] RECOVERY - puppet last run on mw2144 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [12:32:39] RECOVERY - puppet last run on db2074 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [12:32:39] RECOVERY - puppet last run on dbstore2002 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [12:32:39] RECOVERY - puppet last run on mw2157 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [12:32:41] RECOVERY - puppet last run on restbase2019 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [12:32:41] RECOVERY - puppet last run on wtp2010 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [12:32:41] RECOVERY - puppet last run on wtp2012 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [12:32:41] RECOVERY - puppet last run on mw2265 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [12:32:41] RECOVERY - puppet last run on mw2277 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [12:32:45] RECOVERY - puppet last run on mw2255 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [12:32:47] RECOVERY - puppet last run on cloudnet2002-dev is OK: OK: Puppet is currently enabled, last run 27 seconds ago with 0 failures [12:32:47] RECOVERY - puppet last run on db2055 is OK: OK: Puppet is currently enabled, last run 27 seconds ago with 0 failures [12:32:49] RECOVERY - puppet last run on elastic2044 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [12:32:57] RECOVERY - puppet last run on ores2002 is OK: OK: Puppet is currently enabled, last run 58 seconds ago with 0 failures [12:32:59] RECOVERY - puppet last run on cp5003 is OK: OK: Puppet is currently enabled, last run 54 seconds ago with 0 failures [12:33:03] RECOVERY - puppet last run on mw2234 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [12:33:07] RECOVERY - puppet last run on wtp2004 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [12:33:11] RECOVERY - puppet last run on mc2019 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [12:33:13] RECOVERY - puppet last run on cp4028 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [12:33:13] RECOVERY - puppet last run on es2004 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [12:33:15] RECOVERY - puppet last run on mw2290 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [12:33:15] RECOVERY - puppet last run on mw2176 is OK: OK: Puppet is currently enabled, last run 53 seconds ago with 0 failures [12:33:15] RECOVERY - puppet last run on lvs4006 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [12:33:21] RECOVERY - puppet last run on wtp2005 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [12:33:37] RECOVERY - puppet last run on sessionstore2001 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [12:33:37] RECOVERY - puppet last run on rdb2005 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [12:33:37] RECOVERY - puppet last run on mw2223 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [12:33:37] RECOVERY - puppet last run on puppetmaster2002 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [12:33:47] RECOVERY - puppet last run on db2099 is OK: OK: Puppet is currently enabled, last run 39 seconds ago with 0 failures [12:33:47] RECOVERY - puppet last run on lvs2004 is OK: OK: Puppet is currently enabled, last run 15 seconds ago with 0 failures [12:33:51] RECOVERY - puppet last run on mw2214 is OK: OK: Puppet is currently enabled, last run 3 seconds ago with 0 failures [12:33:57] RECOVERY - puppet last run on cp5006 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [12:33:59] RECOVERY - puppet last run on ms-be2047 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [12:34:03] RECOVERY - puppet last run on logstash2006 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [12:34:03] RECOVERY - puppet last run on ping2001 is OK: OK: Puppet is currently enabled, last run 18 seconds ago with 0 failures [12:34:03] RECOVERY - puppet last run on cloudnet2003-dev is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [12:34:13] RECOVERY - puppet last run on db2095 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [12:34:13] RECOVERY - puppet last run on mw2256 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [12:34:13] RECOVERY - puppet last run on mw2287 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [12:34:13] RECOVERY - puppet last run on mw2274 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [12:34:19] RECOVERY - puppet last run on ms-fe2005 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [12:34:21] RECOVERY - puppet last run on mw2245 is OK: OK: Puppet is currently enabled, last run 49 seconds ago with 0 failures [12:34:21] RECOVERY - puppet last run on mw2231 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [12:34:23] RECOVERY - puppet last run on mw2162 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [12:34:24] RECOVERY - puppet last run on mw2172 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [12:34:24] RECOVERY - puppet last run on mw2169 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [12:34:24] RECOVERY - puppet last run on ms-be2023 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [12:34:31] RECOVERY - puppet last run on mw2266 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [12:34:31] RECOVERY - puppet last run on mw2251 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [12:34:31] RECOVERY - puppet last run on mw2252 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [12:34:31] RECOVERY - puppet last run on mw2233 is OK: OK: Puppet is currently enabled, last run 57 seconds ago with 0 failures [12:34:39] RECOVERY - puppet last run on lvs2003 is OK: OK: Puppet is currently enabled, last run 5 minutes ago with 0 failures [12:34:39] RECOVERY - puppet last run on cp2005 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [12:34:39] RECOVERY - puppet last run on db2036 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [12:34:39] RECOVERY - puppet last run on mw2253 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [12:34:45] RECOVERY - puppet last run on restbase1012 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [12:34:47] RECOVERY - puppet last run on cp5012 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [12:34:51] RECOVERY - puppet last run on pc2010 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [12:34:59] RECOVERY - puppet last run on elastic2048 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [12:34:59] RECOVERY - puppet last run on furud is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [12:34:59] RECOVERY - puppet last run on elastic2033 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [12:35:01] RECOVERY - puppet last run on db2081 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [12:35:01] RECOVERY - puppet last run on pybal-test2002 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [12:35:01] RECOVERY - puppet last run on ores2009 is OK: OK: Puppet is currently enabled, last run 37 seconds ago with 0 failures [12:35:07] RECOVERY - puppet last run on mc2022 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [12:35:15] RECOVERY - puppet last run on multatuli is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [12:35:27] RECOVERY - puppet last run on cp2011 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [12:35:37] RECOVERY - puppet last run on mw2207 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [12:35:47] RECOVERY - puppet last run on ores2008 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [12:35:53] RECOVERY - puppet last run on lvs2006 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [12:36:01] RECOVERY - puppet last run on heze is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [12:36:13] RECOVERY - puppet last run on ms-be2050 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [12:36:17] RECOVERY - puppet last run on pc2009 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [12:36:23] RECOVERY - puppet last run on db2089 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [12:36:27] RECOVERY - puppet last run on wtp2006 is OK: OK: Puppet is currently enabled, last run 5 minutes ago with 0 failures [12:36:27] RECOVERY - puppet last run on kubernetes2005 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [12:36:37] RECOVERY - puppet last run on maps2004 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [12:36:37] RECOVERY - puppet last run on mw2177 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [12:36:39] RECOVERY - puppet last run on mw2220 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [12:36:43] RECOVERY - puppet last run on ms-be2017 is OK: OK: Puppet is currently enabled, last run 5 minutes ago with 0 failures [12:36:47] RECOVERY - puppet last run on db2059 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [12:37:16] 10Operations, 10Puppet, 10Packaging: update puppetdb and puppet-master packages to be compatible with puppet5 - https://phabricator.wikimedia.org/T222879 (10jbond) p:05Triage→03Normal [12:37:27] RECOVERY - puppet last run on mw2244 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [12:37:27] RECOVERY - puppet last run on mw2186 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [12:37:37] RECOVERY - puppet last run on db2116 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [12:37:37] RECOVERY - puppet last run on mc2023 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [12:37:37] RECOVERY - puppet last run on thumbor2003 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [12:37:37] RECOVERY - puppet last run on mw2201 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [12:37:43] RECOVERY - puppet last run on kubetcd2001 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [12:38:03] RECOVERY - puppet last run on elastic2050 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [12:38:03] RECOVERY - puppet last run on db2071 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [12:38:04] RECOVERY - puppet last run on mw2246 is OK: OK: Puppet is currently enabled, last run 5 minutes ago with 0 failures [12:38:31] RECOVERY - puppet last run on mw2218 is OK: OK: Puppet is currently enabled, last run 5 minutes ago with 0 failures [12:38:33] RECOVERY - puppet last run on db2072 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [12:39:37] RECOVERY - puppet last run on db2069 is OK: OK: Puppet is currently enabled, last run 5 minutes ago with 0 failures [12:39:47] RECOVERY - puppet last run on ms-be2030 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [12:39:53] RECOVERY - puppet last run on mw2160 is OK: OK: Puppet is currently enabled, last run 5 minutes ago with 0 failures [12:40:23] RECOVERY - puppet last run on cp4025 is OK: OK: Puppet is currently enabled, last run 12 minutes ago with 0 failures [12:42:32] 10Operations, 10Puppet, 10Packaging: update puppetdb and puppet-master packages to be compatible with puppet5 - https://phabricator.wikimedia.org/T222879 (10MoritzMuehlenhoff) An updated puppetdb package has been built on boron with the folllowing diff: ` diff -Nru puppetdb-4.4.0/debian/changelog puppetdb-... [12:45:36] (03PS1) 10Fsero: registryha: moving docker-registry.w.o to new registry [puppet] - 10https://gerrit.wikimedia.org/r/509049 (https://phabricator.wikimedia.org/T221101) [12:49:00] !log switching traffic from old-registry to new registries registry[12]00[12] - T221101 [12:49:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:49:04] T221101: migrate endpoint from old registry instance to new one - https://phabricator.wikimedia.org/T221101 [12:50:06] (03CR) 10Fsero: [C: 03+2] registryha: feat(T221101) swifting traffic to new registry [dns] - 10https://gerrit.wikimedia.org/r/508996 (https://phabricator.wikimedia.org/T221101) (owner: 10Fsero) [12:53:00] (03PS1) 10Filippo Giunchedi: prometheus: remove v2 feature flag [puppet] - 10https://gerrit.wikimedia.org/r/509052 (https://phabricator.wikimedia.org/T187987) [13:00:05] Deploy window MediaWiki train - European version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190509T1300) [13:12:12] (03PS4) 10Paladox: Add prometheus server for gerrit javamelody monitoring [puppet] - 10https://gerrit.wikimedia.org/r/508952 (https://phabricator.wikimedia.org/T184086) [13:13:55] !log running authdns-update for new docker-registry T221101 [13:13:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:14:00] T221101: migrate endpoint from old registry instance to new one - https://phabricator.wikimedia.org/T221101 [13:15:42] (03CR) 10Ema: registryha: moving docker-registry.w.o to new registry (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/509049 (https://phabricator.wikimedia.org/T221101) (owner: 10Fsero) [13:16:08] 10Operations, 10Puppet, 10Packaging: update puppetdb and puppet-master packages to be compatible with puppet5 - https://phabricator.wikimedia.org/T222879 (10MoritzMuehlenhoff) I found the second bug which made update uninstall the puppetmaster packages on puppetmaster2001: "puppet-master-passenger" and "pupp... [13:22:26] (03PS1) 10Ema: upload-frontend: unset Cache-Control [puppet] - 10https://gerrit.wikimedia.org/r/509053 [13:23:07] (03PS2) 10Ema: upload-frontend: unset Cache-Control [puppet] - 10https://gerrit.wikimedia.org/r/509053 [13:27:17] (03PS1) 10BBlack: Create dyna.wikimedia.org for text-addrs target [dns] - 10https://gerrit.wikimedia.org/r/509055 (https://phabricator.wikimedia.org/T208263) [13:27:19] (03PS1) 10BBlack: Switch CNAME->DYNA scheme to dyna.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/509056 (https://phabricator.wikimedia.org/T208263) [13:27:21] (03PS2) 10Fsero: registryha: moving docker-registry.w.o to new registry [puppet] - 10https://gerrit.wikimedia.org/r/509049 (https://phabricator.wikimedia.org/T221101) [13:29:24] 10Operations, 10Wikimedia-Mailing-lists, 10Patch-For-Review, 10Security: Disable list subscription via email also for listname-subscribe@ - https://phabricator.wikimedia.org/T219107 (10jbond) > Can this task be made public (via "Edit Task > Visible To")? done [13:29:48] (03CR) 10Fsero: "PCC seems happy https://puppet-compiler.wmflabs.org/compiler1001/16439/" [puppet] - 10https://gerrit.wikimedia.org/r/509049 (https://phabricator.wikimedia.org/T221101) (owner: 10Fsero) [13:29:58] (03PS1) 10BBlack: Undo "www.wikipedia.org" direct DYNA [dns] - 10https://gerrit.wikimedia.org/r/509057 (https://phabricator.wikimedia.org/T208263) [13:30:45] (03CR) 10BBlack: [C: 03+2] Create dyna.wikimedia.org for text-addrs target [dns] - 10https://gerrit.wikimedia.org/r/509055 (https://phabricator.wikimedia.org/T208263) (owner: 10BBlack) [13:31:34] (03PS1) 10Ema: varnish: add shell wrapper to run VTC tests [puppet] - 10https://gerrit.wikimedia.org/r/509058 [13:33:02] 10Operations, 10Puppet, 10Packaging: update puppetdb and puppet-master packages to be compatible with puppet5 - https://phabricator.wikimedia.org/T222879 (10MoritzMuehlenhoff) >>! In T222879#5169990, @MoritzMuehlenhoff wrote: > I found the second bug which made update uninstall the puppetmaster packages on p... [13:34:27] !log recdns: wiping dyna.wikimedia.org from pdns-recursors [13:34:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:35:11] ottomata: here ? can you join #wikimedia-sre ? I wanted to chat re: https://gerrit.wikimedia.org/r/c/operations/debs/prometheus-varnishkafka-exporter/+/507632 [13:37:13] 10Operations, 10LDAP-Access-Requests, 10WMF-Legal, 10WMF-NDA-Requests: Request to be added to the ldap/wmde group - https://phabricator.wikimedia.org/T222788 (10jbond) Hi @RStallman-legalteam can you confirm NDA status please [13:37:22] 10Operations, 10LDAP-Access-Requests, 10WMF-Legal, 10WMF-NDA-Requests: Request to be added to the ldap/wmde group - https://phabricator.wikimedia.org/T222788 (10jbond) p:05Triage→03Normal [13:40:15] (03CR) 10CDanis: [C: 03+1] Switch CNAME->DYNA scheme to dyna.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/509056 (https://phabricator.wikimedia.org/T208263) (owner: 10BBlack) [13:41:18] (03CR) 10Filippo Giunchedi: "LGTM, see also inline" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/509027 (owner: 10Ema) [13:41:34] (03PS3) 10Fsero: registryha: moving docker-registry.w.o to new registry [puppet] - 10https://gerrit.wikimedia.org/r/509049 (https://phabricator.wikimedia.org/T221101) [13:41:36] (03CR) 10Volans: [C: 03+1] "LGTM" [dns] - 10https://gerrit.wikimedia.org/r/509056 (https://phabricator.wikimedia.org/T208263) (owner: 10BBlack) [13:42:17] (03CR) 10BBlack: [C: 03+2] Switch CNAME->DYNA scheme to dyna.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/509056 (https://phabricator.wikimedia.org/T208263) (owner: 10BBlack) [13:42:37] (03PS1) 10Elukey: ores::web: allow uwsgi to dump a core after a segfault [puppet] - 10https://gerrit.wikimedia.org/r/509059 (https://phabricator.wikimedia.org/T222866) [13:43:10] (03CR) 10jerkins-bot: [V: 04-1] ores::web: allow uwsgi to dump a core after a segfault [puppet] - 10https://gerrit.wikimedia.org/r/509059 (https://phabricator.wikimedia.org/T222866) (owner: 10Elukey) [13:43:14] 10Operations, 10SRE-Access-Requests: Requesting access to deployment for Cormac Parle - https://phabricator.wikimedia.org/T222864 (10MarkTraceur) Support, as Cormac's manager - we currently have one deployer and will be doing a lot of deployments in the next six months or so. [13:43:34] (03Abandoned) 10Elukey: ores::web: allow uwsgi to dump a core after a segfault [puppet] - 10https://gerrit.wikimedia.org/r/509059 (https://phabricator.wikimedia.org/T222866) (owner: 10Elukey) [13:43:43] 10Operations: Integrate Stretch 9.9 point update - https://phabricator.wikimedia.org/T222053 (10MoritzMuehlenhoff) [13:43:45] (not needed) [13:44:30] (03CR) 10Fsero: "PCC still happy https://puppet-compiler.wmflabs.org/compiler1002/16440/cp1079.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/509049 (https://phabricator.wikimedia.org/T221101) (owner: 10Fsero) [13:45:27] (03PS2) 10Filippo Giunchedi: prometheus: remove v2 feature flag [puppet] - 10https://gerrit.wikimedia.org/r/509052 (https://phabricator.wikimedia.org/T187987) [13:46:20] (03PS2) 10Ema: varnish: add shell wrapper to run VTC tests [puppet] - 10https://gerrit.wikimedia.org/r/509058 [13:48:51] (03PS1) 10Elukey: ores::worker: allow celery to emit a core dump upon segfault [puppet] - 10https://gerrit.wikimedia.org/r/509060 (https://phabricator.wikimedia.org/T222866) [13:49:28] (03CR) 10Ema: [C: 03+2] varnish: add shell wrapper to run VTC tests [puppet] - 10https://gerrit.wikimedia.org/r/509058 (owner: 10Ema) [13:50:30] (03CR) 10Elukey: "https://puppet-compiler.wmflabs.org/compiler1002/16443/ores1002.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/509060 (https://phabricator.wikimedia.org/T222866) (owner: 10Elukey) [13:50:54] (03CR) 10Filippo Giunchedi: [C: 03+1] "PCC looks good https://puppet-compiler.wmflabs.org/compiler1002/16441/" [puppet] - 10https://gerrit.wikimedia.org/r/509052 (https://phabricator.wikimedia.org/T187987) (owner: 10Filippo Giunchedi) [13:51:40] 10Operations, 10LDAP-Access-Requests, 10WMF-Legal, 10WMF-NDA-Requests: Request to be added to the ldap/wmde group - https://phabricator.wikimedia.org/T222788 (10RStallman-legalteam) I am out of office the next couple days and can attend to this first thing on Monday, May 13. Thanks! [13:54:06] (03CR) 10Marostegui: [C: 03+1] "ok from my side, not touching any of those hosts" [puppet] - 10https://gerrit.wikimedia.org/r/509012 (https://phabricator.wikimedia.org/T206203) (owner: 10Jcrespo) [13:57:13] (03CR) 10Alexandros Kosiaris: [C: 03+1] ores::worker: allow celery to emit a core dump upon segfault [puppet] - 10https://gerrit.wikimedia.org/r/509060 (https://phabricator.wikimedia.org/T222866) (owner: 10Elukey) [13:57:29] (03CR) 10jenkins-bot: Beta Features whitelist: Drop TemplateWizard, now everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/508882 (owner: 10Jforrester) [13:57:31] (03CR) 10jenkins-bot: Beta Features whitelist: Drop AdvancedSearch, now everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/508883 (owner: 10Jforrester) [13:57:33] (03CR) 10jenkins-bot: AdvancedSearch: Stop pretending loading this varies by wiki, it doesn't [mediawiki-config] - 10https://gerrit.wikimedia.org/r/508884 (owner: 10Jforrester) [13:57:35] (03CR) 10jenkins-bot: Beta Features whitelist: Drop RCFilters, now everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/508888 (owner: 10Jforrester) [13:57:37] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool pc1007" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/508973 (owner: 10Marostegui) [13:57:39] (03CR) 10jenkins-bot: db-eqiad,db-codfw.php: Pool db2103,db2112,db2116 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/508974 (https://phabricator.wikimedia.org/T222772) (owner: 10Marostegui) [13:57:41] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1076 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/508976 (owner: 10Marostegui) [13:57:43] (03CR) 10jenkins-bot: db-codfw.php: Clarify disk space [mediawiki-config] - 10https://gerrit.wikimedia.org/r/508980 (owner: 10Marostegui) [13:57:45] (03CR) 10jenkins-bot: db-eqiad,db-codfw.php: Add new hosts [mediawiki-config] - 10https://gerrit.wikimedia.org/r/508992 (https://phabricator.wikimedia.org/T222772) (owner: 10Marostegui) [13:57:47] (03CR) 10jenkins-bot: db-eqiad.php: Slowly repool db1076 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/508997 (owner: 10Marostegui) [13:57:49] (03CR) 10jenkins-bot: db-eqiad.php: Pool db1129 in s2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/508999 (https://phabricator.wikimedia.org/T222682) (owner: 10Marostegui) [13:57:51] (03CR) 10jenkins-bot: db-eqiad.php: More traffic to db1076 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/509000 (owner: 10Marostegui) [13:57:53] (03CR) 10jenkins-bot: db-eqiad.php: More traffic to db1076,db1129 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/509005 (owner: 10Marostegui) [13:57:55] (03CR) 10jenkins-bot: db-eqiad.php: More traffic to db1076,db1129 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/509008 (owner: 10Marostegui) [13:57:57] (03CR) 10jenkins-bot: db-eqiad.php: More traffic to db1076,db1129 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/509010 (owner: 10Marostegui) [13:57:59] (03CR) 10jenkins-bot: db-eqiad.php: More traffic to db1076,db1129 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/509011 (owner: 10Marostegui) [13:58:01] (03CR) 10jenkins-bot: db-eqiad.php: Give more traffic to db1129 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/509017 (owner: 10Marostegui) [13:58:03] (03CR) 10jenkins-bot: db-eqiad.php: Give API traffic to db1129 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/509024 (owner: 10Marostegui) [13:58:05] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1081 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/509033 (owner: 10Marostegui) [13:58:50] (03CR) 10Alexandros Kosiaris: [C: 03+1] registryha: moving docker-registry.w.o to new registry [puppet] - 10https://gerrit.wikimedia.org/r/509049 (https://phabricator.wikimedia.org/T221101) (owner: 10Fsero) [14:04:02] 10Operations, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): Refactor public-facing DYNA scheme for primary project hostnames in our DNS - https://phabricator.wikimedia.org/T208263 (10BBlack) @Cwek @lilydjwg - Thanks for the reports! I apologize, this time around the fallout should've been pre... [14:04:47] 10Operations, 10Wikimedia-Mailing-lists: Please create a private mailing list traffic-anomaly-report - https://phabricator.wikimedia.org/T222794 (10jbond) I have created the list and you should have received the admin password. I have set some of the privacy options but the majority of settings have been left... [14:05:11] 10Operations, 10Wikimedia-Mailing-lists: Please create a private mailing list traffic-anomaly-report - https://phabricator.wikimedia.org/T222794 (10jbond) 05Open→03Resolved a:03jbond [14:08:22] 10Operations, 10SRE-Access-Requests: Requesting access to deployment for Cormac Parle - https://phabricator.wikimedia.org/T222864 (10jbond) @greg this is essentially a request to add cparle to the deployment group. As the manager of Release engineering I think you are the relevent person to approve this acces... [14:10:05] (03PS4) 10Fsero: registryha: moving docker-registry.w.o to new registry [puppet] - 10https://gerrit.wikimedia.org/r/509049 (https://phabricator.wikimedia.org/T221101) [14:11:51] (03PS5) 10Fsero: registryha: moving docker-registry.w.o to new registry [puppet] - 10https://gerrit.wikimedia.org/r/509049 (https://phabricator.wikimedia.org/T221101) [14:13:19] !log otto@deploy1001 scap-helm eventgate-main upgrade main -f main/staging-values.yaml --reset-values stable/eventgate [namespace: eventgate-main, clusters: staging] [14:13:20] !log otto@deploy1001 scap-helm eventgate-main cluster staging completed [14:13:20] !log otto@deploy1001 scap-helm eventgate-main finished [14:13:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:13:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:13:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:13:31] (03CR) 10CDanis: [C: 03+1] prometheus: remove v2 feature flag (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/509052 (https://phabricator.wikimedia.org/T187987) (owner: 10Filippo Giunchedi) [14:14:04] (03CR) 10Fsero: "PCC still happy https://puppet-compiler.wmflabs.org/compiler1002/16444/cp2013.codfw.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/509049 (https://phabricator.wikimedia.org/T221101) (owner: 10Fsero) [14:15:14] hoo: I've read T221774 now, did you have other specific questions for irc besides what's in the tasK ? [14:15:14] T221774: Add Wikidata query service lag to Wikidata maxlag - https://phabricator.wikimedia.org/T221774 [14:16:55] (03CR) 10Alexandros Kosiaris: [V: 03+2 C: 03+2] Use prometheus-statsd-exporter 0.9 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/508803 (https://phabricator.wikimedia.org/T220709) (owner: 10Alexandros Kosiaris) [14:18:30] PROBLEM - Request latencies on neon is CRITICAL: instance=10.64.0.40:6443 verb=CONNECT https://grafana.wikimedia.org/dashboard/db/kubernetes-api [14:18:42] (03CR) 10Filippo Giunchedi: [C: 03+1] prometheus: remove v2 feature flag (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/509052 (https://phabricator.wikimedia.org/T187987) (owner: 10Filippo Giunchedi) [14:19:20] PROBLEM - Check systemd state on ms-be2034 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [14:19:52] RECOVERY - Request latencies on neon is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [14:20:00] (03CR) 10Ema: [C: 03+1] registryha: moving docker-registry.w.o to new registry [puppet] - 10https://gerrit.wikimedia.org/r/509049 (https://phabricator.wikimedia.org/T221101) (owner: 10Fsero) [14:21:22] (03CR) 10Fsero: [C: 03+2] registryha: moving docker-registry.w.o to new registry [puppet] - 10https://gerrit.wikimedia.org/r/509049 (https://phabricator.wikimedia.org/T221101) (owner: 10Fsero) [14:28:24] 10Operations, 10Release Pipeline, 10Release-Engineering-Team, 10Wikidata, and 5 others: Introduce wikidata termbox SSR to kubernetes - https://phabricator.wikimedia.org/T220402 (10WMDE-leszek) Hey @akosiaris and @mobrovac we've been wondering if you had a chance to look into our service again. As reported... [14:40:26] (03CR) 10Volans: [C: 03+1] "LGTM, I'll leave it to you and Cas for the merge and deploy due to the pending deps." (034 comments) [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/506663 (owner: 10Faidon Liambotis) [14:41:05] 10Operations, 10ops-codfw, 10Patch-For-Review: Broken disk on ms-be2026 - https://phabricator.wikimedia.org/T219854 (10jbond) >>! In T219854#5084464, @fgiunchedi wrote: >>>! In T219854#5076968, @Volans wrote: >> So the `dsa-check-hpssacli` check is happily returning `0` exit code and this output: >> ` >> OK:... [14:41:09] (03CR) 10Mforns: "Ready to review and merge if appropriate." [puppet] - 10https://gerrit.wikimedia.org/r/485063 (https://phabricator.wikimedia.org/T212014) (owner: 10Mforns) [14:45:08] godog: No, the task should have it all. [14:45:14] RECOVERY - Check systemd state on ms-be2034 is OK: OK - running: The system is fully operational [14:47:49] 10Operations, 10ops-codfw, 10Patch-For-Review: Broken disk on ms-be2026 - https://phabricator.wikimedia.org/T219854 (10Volans) I think we have some exception hosts that have 2 controllers but only one is in use, but I'm not sure 100% [14:48:11] !log removing unused uwsgi packages from scb* hosts [14:48:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:48:58] hoo: kk, thanks! [14:52:55] !log otto@deploy1001 scap-helm eventgate-main upgrade main -f main/staging-values.yaml --reset-values stable/eventgate [namespace: eventgate-main, clusters: staging] [14:52:56] !log otto@deploy1001 scap-helm eventgate-main cluster staging completed [14:52:56] !log otto@deploy1001 scap-helm eventgate-main finished [14:52:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:53:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:53:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:56:59] 10Operations, 10serviceops: create IRC channel for the Service Operations SRE subteam - https://phabricator.wikimedia.org/T211902 (10jijiki) [14:58:59] (03CR) 10Effie Mouzeli: [C: 03+1] Disable the PHP7 beta feature [mediawiki-config] - 10https://gerrit.wikimedia.org/r/508177 (https://phabricator.wikimedia.org/T219128) (owner: 10Giuseppe Lavagetto) [14:59:17] (03PS1) 10Marostegui: db-eqiad.php: Slowly repool db1081 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/509070 [14:59:50] (03CR) 10Volans: [C: 03+1] "Code looks good, have you already committed the dummy file in the labs/private repo and tested a compilation?" [puppet] - 10https://gerrit.wikimedia.org/r/508625 (owner: 10CRusnov) [15:00:02] (03CR) 10BBlack: [C: 03+1] upload-frontend: unset Cache-Control [puppet] - 10https://gerrit.wikimedia.org/r/509053 (owner: 10Ema) [15:00:38] PROBLEM - Request latencies on neon is CRITICAL: instance=10.64.0.40:6443 verb=CONNECT https://grafana.wikimedia.org/dashboard/db/kubernetes-api [15:03:36] PROBLEM - Request latencies on argon is CRITICAL: instance=10.64.32.133:6443 verb=CONNECT https://grafana.wikimedia.org/dashboard/db/kubernetes-api [15:05:00] RECOVERY - Request latencies on argon is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [15:05:15] 10Operations, 10Analytics, 10Analytics-Kanban, 10Traffic, 10Patch-For-Review: Add prometheus metrics for varnishkafka instances running on caching hosts - https://phabricator.wikimedia.org/T196066 (10fgiunchedi) a:05mforns→03colewhite Reporting here a chat between me and @Ottomata re: metric naming a... [15:07:32] 10Operations, 10ops-eqiad: Bad disk on new host db1133 - https://phabricator.wikimedia.org/T222731 (10Cmjohnson) new disk ordered You have successfully submitted request SR990443425. [15:07:40] 10Operations, 10Traffic, 10Zero, 10Patch-For-Review: Zero VCL removal - https://phabricator.wikimedia.org/T213769 (10jbond) Is this ticket complete? can it be closed, if not what further actions are required? [15:08:18] 10Operations, 10ops-eqiad: Bad disk on new host db1133 - https://phabricator.wikimedia.org/T222731 (10Marostegui) Thank you! [15:08:48] RECOVERY - Request latencies on neon is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [15:09:20] 10Operations, 10wikidiff2: Update the php-wikidiff2 package - https://phabricator.wikimedia.org/T222896 (10ori) [15:11:54] 10Operations, 10wikidiff2: Update the php-wikidiff2 package - https://phabricator.wikimedia.org/T222896 (10ori) I just came across https://www.mediawiki.org/wiki/Extension:Wikidiff2/Release_process. @Legoktm, who handles this nowadays? [15:12:10] (03PS7) 10Elukey: Adapt saltrotate and EventLoggingSanitization params in data_purge.pp [puppet] - 10https://gerrit.wikimedia.org/r/485063 (https://phabricator.wikimedia.org/T212014) (owner: 10Mforns) [15:12:33] !log shurtting down db2114 for main board replacement [15:12:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:13:46] !log otto@deploy1001 scap-helm eventgate-main install -n main -f main/codfw-values.yaml stable/eventgate [namespace: eventgate-main, clusters: codfw] [15:13:48] !log otto@deploy1001 scap-helm eventgate-main cluster codfw completed [15:13:48] !log otto@deploy1001 scap-helm eventgate-main finished [15:13:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:13:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:13:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:14:04] 10Operations, 10wikidiff2: Update the php-wikidiff2 package - https://phabricator.wikimedia.org/T222896 (10MoritzMuehlenhoff) >>! In T222896#5170319, @ori wrote: > I just came across https://www.mediawiki.org/wiki/Extension:Wikidiff2/Release_process. @Legoktm, who handles this nowadays? @WMDE-Fisch did the la... [15:14:28] (03CR) 10Elukey: "Marcel I had to rebase since my earlier change was causing problems, should be good. Please triple check then I'll merge :)" [puppet] - 10https://gerrit.wikimedia.org/r/485063 (https://phabricator.wikimedia.org/T212014) (owner: 10Mforns) [15:15:44] PROBLEM - Memory correctable errors -EDAC- on elastic1029 is CRITICAL: 4.001 ge 4 https://grafana.wikimedia.org/dashboard/db/host-overview?orgId=1&var-server=elastic1029&var-datasource=eqiad+prometheus/ops [15:16:29] !log otto@deploy1001 scap-helm eventgate-main install -n main -f main/eqiad-values.yaml stable/eventgate [namespace: eventgate-main, clusters: eqiad] [15:16:30] !log otto@deploy1001 scap-helm eventgate-main cluster eqiad completed [15:16:30] !log otto@deploy1001 scap-helm eventgate-main finished [15:16:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:16:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:16:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:18:08] PROBLEM - Host db2114.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:19:06] ^ expected [15:19:12] it is having onsite maintenance [15:20:58] !log Stop mysql on db2112 for onsite work [15:21:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:21:36] PROBLEM - Request latencies on acrux is CRITICAL: instance=10.192.0.93:6443 verb=CONNECT https://grafana.wikimedia.org/dashboard/db/kubernetes-api [15:22:58] RECOVERY - Request latencies on acrux is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [15:23:04] 10Operations, 10ops-codfw, 10DBA: db2112 doesn't show service tag in idrac - https://phabricator.wikimedia.org/T222845 (10RobH) [15:23:07] (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: Slowly repool db1081 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/509070 (owner: 10Marostegui) [15:24:18] (03Merged) 10jenkins-bot: db-eqiad.php: Slowly repool db1081 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/509070 (owner: 10Marostegui) [15:24:32] (03CR) 10jenkins-bot: db-eqiad.php: Slowly repool db1081 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/509070 (owner: 10Marostegui) [15:24:42] 10Operations, 10ops-codfw, 10DBA: db2112 doesn't show service tag in idrac - https://phabricator.wikimedia.org/T222845 (10Marostegui) [15:24:50] 10Operations, 10Traffic, 10Zero, 10Patch-For-Review: Zero VCL removal - https://phabricator.wikimedia.org/T213769 (10Reedy) >>! In T213769#5170303, @jbond wrote: > Is this ticket complete? can it be closed, if not what further actions are required? I guess the removal needs merging? >>! In T213769#4879... [15:25:34] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Slowly repool db1081 (duration: 00m 56s) [15:25:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:25:57] 10Operations, 10Traffic, 10Zero, 10Patch-For-Review: Zero VCL removal - https://phabricator.wikimedia.org/T213769 (10jbond) > I guess the removal needs merging? Oh yes missed that :D [15:28:54] 10Operations, 10serviceops: Separate Wikitech cronjobs from production - https://phabricator.wikimedia.org/T222900 (10jijiki) [15:29:02] PROBLEM - Host db2112.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:29:07] ^expected [15:29:37] 10Operations, 10cloud-services-team, 10serviceops, 10Core Platform Team Backlog (Watching / External), and 3 others: Switch cronjobs on maintenance hosts to PHP7 - https://phabricator.wikimedia.org/T195392 (10jijiki) [15:29:38] 10Operations, 10serviceops: Separate Wikitech cronjobs from production - https://phabricator.wikimedia.org/T222900 (10jijiki) [15:31:58] RECOVERY - Host db2112.mgmt is UP: PING OK - Packet loss = 0%, RTA = 36.96 ms [15:33:50] (03PS1) 10Marostegui: db-eqiad.php: Give some API traffic back to db1081 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/509083 [15:34:05] 10Operations, 10Analytics, 10Analytics-Kanban, 10Traffic, 10Patch-For-Review: Add prometheus metrics for varnishkafka instances running on caching hosts - https://phabricator.wikimedia.org/T196066 (10Ottomata) I think there are a few more branches: - producer, consumer - global, per broker, per topic, p... [15:34:50] (03PS20) 10CDanis: Add a WMF-specific tool for managing db config in MediaWiki [software/conftool] - 10https://gerrit.wikimedia.org/r/441396 (https://phabricator.wikimedia.org/T197126) (owner: 10Giuseppe Lavagetto) [15:35:46] (03PS1) 10Marostegui: db-codfw.php: Depool db2112 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/509084 [15:36:30] (03CR) 10Marostegui: [V: 03+2 C: 03+2] db-codfw.php: Depool db2112 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/509084 (owner: 10Marostegui) [15:36:57] (03PS2) 10Marostegui: db-eqiad.php: Give some API traffic back to db1081 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/509083 [15:37:26] 10Operations, 10Traffic, 10Zero, 10Patch-For-Review: Zero VCL removal - https://phabricator.wikimedia.org/T213769 (10Jdforrester-WMF) I was told by Traffic that the removal is blocked on something in SRE land, but I don't know enough to even give pointers, sorry. [15:37:46] !log marostegui@deploy1001 Synchronized wmf-config/db-codfw.php: Depool db2112 (duration: 00m 59s) [15:37:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:38:07] (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: Give some API traffic back to db1081 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/509083 (owner: 10Marostegui) [15:38:36] (03CR) 10CDanis: "OK, I think this is ready for a review pass." [software/conftool] - 10https://gerrit.wikimedia.org/r/441396 (https://phabricator.wikimedia.org/T197126) (owner: 10Giuseppe Lavagetto) [15:39:09] (03Merged) 10jenkins-bot: db-eqiad.php: Give some API traffic back to db1081 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/509083 (owner: 10Marostegui) [15:40:50] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Slowly repool db1081 (duration: 01m 00s) [15:40:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:42:47] (03PS1) 10Ppchelko: Make EventFactory and event destination configurable. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/509092 (https://phabricator.wikimedia.org/T222822) [15:42:56] (03PS7) 10Jbond: Switch Thumbor hardening from Firejail to native systemd features (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/482309 (https://phabricator.wikimedia.org/T212941) (owner: 10Muehlenhoff) [15:43:17] (03CR) 10jenkins-bot: db-codfw.php: Depool db2112 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/509084 (owner: 10Marostegui) [15:43:19] (03CR) 10jenkins-bot: db-eqiad.php: Give some API traffic back to db1081 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/509083 (owner: 10Marostegui) [15:43:29] (03CR) 10jerkins-bot: [V: 04-1] Switch Thumbor hardening from Firejail to native systemd features (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/482309 (https://phabricator.wikimedia.org/T212941) (owner: 10Muehlenhoff) [15:44:25] (03CR) 10Jbond: "> Patch Set 5: Verified+1 Code-Review+1" [puppet] - 10https://gerrit.wikimedia.org/r/482309 (https://phabricator.wikimedia.org/T212941) (owner: 10Muehlenhoff) [15:45:23] (03PS4) 10Muehlenhoff: Stop using transitional package names for Icinga plugins [puppet] - 10https://gerrit.wikimedia.org/r/494681 (https://phabricator.wikimedia.org/T213527) [15:45:39] (03PS8) 10Jbond: Switch Thumbor hardening from Firejail to native systemd features (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/482309 (https://phabricator.wikimedia.org/T212941) (owner: 10Muehlenhoff) [15:46:30] (03CR) 10Daimona Eaytoy: [C: 04-1] "I don't really know how things are usually done for quarry, but this change isn't convincing." [puppet] - 10https://gerrit.wikimedia.org/r/498773 (https://phabricator.wikimedia.org/T123978) (owner: 10Meshvogel) [15:47:14] 10Operations, 10Traffic, 10Zero, 10Patch-For-Review: Zero VCL removal - https://phabricator.wikimedia.org/T213769 (10BBlack) Yeah, it's mostly just blocked on us making some time to deal with it, and time has been in extremely short supply lately, so we tend not to prioritize anything that doesn't have imm... [15:48:21] (03PS9) 10Jbond: Switch Thumbor hardening from Firejail to native systemd features (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/482309 (https://phabricator.wikimedia.org/T212941) (owner: 10Muehlenhoff) [15:48:39] 10Operations, 10Traffic, 10Zero, 10Patch-For-Review: Zero VCL removal - https://phabricator.wikimedia.org/T213769 (10Jdforrester-WMF) Yeah, this is definitely not urgent, just tech debt clean-up and RelEng (though it will simplify life for Analytics when they don't have to worry about this stuff any more,... [15:48:47] (03CR) 10Muehlenhoff: "This is mostly blocked on a lack of time for some thorough testing, Thumbor is running on Stretch in the mean time." [puppet] - 10https://gerrit.wikimedia.org/r/482309 (https://phabricator.wikimedia.org/T212941) (owner: 10Muehlenhoff) [15:50:07] (03PS1) 10Marostegui: db-eqiad.php: More traffic to db1081 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/509095 [15:50:37] (03CR) 10Ppchelko: "Merged now it will be a no-op since the code changes for this are not ready/deployed. I'm creating this to discuss the approach." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/509092 (https://phabricator.wikimedia.org/T222822) (owner: 10Ppchelko) [15:51:07] (03PS10) 10Jbond: Switch Thumbor hardening from Firejail to native systemd features (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/482309 (https://phabricator.wikimedia.org/T212941) (owner: 10Muehlenhoff) [15:51:36] (03PS4) 10Ema: swift-proxy: add ensure_max_age middleware [puppet] - 10https://gerrit.wikimedia.org/r/509027 [15:51:44] RECOVERY - Host db2114.mgmt is UP: PING OK - Packet loss = 0%, RTA = 36.97 ms [15:51:48] (03CR) 10Jbond: "> Patch Set 8:" [puppet] - 10https://gerrit.wikimedia.org/r/482309 (https://phabricator.wikimedia.org/T212941) (owner: 10Muehlenhoff) [15:52:07] (03CR) 10Ema: swift-proxy: add ensure_max_age middleware (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/509027 (owner: 10Ema) [15:52:28] PROBLEM - Check systemd state on ms-be2014 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.16.32: Connection reset by peer [15:52:32] PROBLEM - swift-object-updater on ms-be2014 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.16.32: Connection reset by peer https://wikitech.wikimedia.org/wiki/Swift [15:52:32] PROBLEM - swift-container-updater on ms-be2014 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.16.32: Connection reset by peer https://wikitech.wikimedia.org/wiki/Swift [15:52:39] PROBLEM - Disk space on ms-be2014 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.16.32: Connection reset by peer https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space [15:52:40] PROBLEM - swift-account-reaper on ms-be2014 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.16.32: Connection reset by peer https://wikitech.wikimedia.org/wiki/Swift [15:52:42] PROBLEM - Check size of conntrack table on ms-be2014 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.16.32: Connection reset by peer https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [15:52:46] PROBLEM - swift-object-auditor on ms-be2014 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.16.32: Connection reset by peer https://wikitech.wikimedia.org/wiki/Swift [15:52:50] PROBLEM - dhclient process on ms-be2014 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.16.32: Connection reset by peer [15:52:54] someone working on ms-be2014 ^^ [15:52:56] PROBLEM - configured eth on ms-be2014 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.16.32: Connection reset by peer [15:53:00] (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: More traffic to db1081 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/509095 (owner: 10Marostegui) [15:53:02] PROBLEM - swift-account-replicator on ms-be2014 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.16.32: Connection reset by peer https://wikitech.wikimedia.org/wiki/Swift [15:53:06] PROBLEM - swift-container-auditor on ms-be2014 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.16.32: Connection reset by peer https://wikitech.wikimedia.org/wiki/Swift [15:53:10] PROBLEM - DPKG on ms-be2014 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.16.32: Connection reset by peer [15:53:16] PROBLEM - swift-object-server on ms-be2014 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.16.32: Connection reset by peer https://wikitech.wikimedia.org/wiki/Swift [15:53:20] PROBLEM - MD RAID on ms-be2014 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.16.32: Connection reset by peer https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [15:53:20] PROBLEM - Check whether ferm is active by checking the default input chain on ms-be2014 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.16.32: Connection reset by peer https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [15:53:25] siiigh [15:53:30] PROBLEM - very high load average likely xfs on ms-be2014 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.16.32: Connection reset by peer https://wikitech.wikimedia.org/wiki/Swift [15:53:32] PROBLEM - swift-container-server on ms-be2014 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.16.32: Connection reset by peer https://wikitech.wikimedia.org/wiki/Swift [15:53:51] 10Operations, 10Traffic, 10Zero, 10Patch-For-Review: Zero VCL removal - https://phabricator.wikimedia.org/T213769 (10jbond) Ok great thanks for the update [15:53:54] PROBLEM - swift-container-replicator on ms-be2014 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.16.32: Connection reset by peer https://wikitech.wikimedia.org/wiki/Swift [15:54:02] cdanis: is this a known issue? [15:54:06] PROBLEM - swift-account-server on ms-be2014 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.16.32: Connection reset by peer https://wikitech.wikimedia.org/wiki/Swift [15:54:08] PROBLEM - swift-object-replicator on ms-be2014 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.16.32: Connection reset by peer https://wikitech.wikimedia.org/wiki/Swift [15:54:09] decoms [15:54:09] (03Merged) 10jenkins-bot: db-eqiad.php: More traffic to db1081 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/509095 (owner: 10Marostegui) [15:54:13] ahh ok thanks [15:54:14] jbond42: yeah, more swift decomms [15:54:22] cheers [15:54:24] (03CR) 10jenkins-bot: db-eqiad.php: More traffic to db1081 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/509095 (owner: 10Marostegui) [15:54:30] RECOVERY - DPKG on ms-be2014 is OK: All packages OK [15:54:32] RECOVERY - swift-object-server on ms-be2014 is OK: PROCS OK: 101 processes with regex args ^/usr/bin/python /usr/bin/swift-object-server https://wikitech.wikimedia.org/wiki/Swift [15:54:34] RECOVERY - Check whether ferm is active by checking the default input chain on ms-be2014 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [15:54:34] RECOVERY - MD RAID on ms-be2014 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [15:54:44] RECOVERY - very high load average likely xfs on ms-be2014 is OK: OK - load average: 14.82, 12.60, 10.00 https://wikitech.wikimedia.org/wiki/Swift [15:54:48] RECOVERY - swift-container-server on ms-be2014 is OK: PROCS OK: 13 processes with regex args ^/usr/bin/python /usr/bin/swift-container-server https://wikitech.wikimedia.org/wiki/Swift [15:55:06] RECOVERY - Check systemd state on ms-be2014 is OK: OK - running: The system is fully operational [15:55:10] RECOVERY - swift-container-replicator on ms-be2014 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-replicator https://wikitech.wikimedia.org/wiki/Swift [15:55:10] RECOVERY - swift-container-updater on ms-be2014 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-updater https://wikitech.wikimedia.org/wiki/Swift [15:55:10] RECOVERY - swift-object-updater on ms-be2014 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-object-updater https://wikitech.wikimedia.org/wiki/Swift [15:55:14] (03PS3) 10Jcrespo: mariadb-backups: Setup all metadata sections for daily snapshots [puppet] - 10https://gerrit.wikimedia.org/r/509012 (https://phabricator.wikimedia.org/T206203) [15:55:16] RECOVERY - Disk space on ms-be2014 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space [15:55:19] RECOVERY - Check size of conntrack table on ms-be2014 is OK: OK: nf_conntrack is 0 % full https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [15:55:19] RECOVERY - swift-account-reaper on ms-be2014 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-reaper https://wikitech.wikimedia.org/wiki/Swift [15:55:22] RECOVERY - swift-account-server on ms-be2014 is OK: PROCS OK: 13 processes with regex args ^/usr/bin/python /usr/bin/swift-account-server https://wikitech.wikimedia.org/wiki/Swift [15:55:24] RECOVERY - swift-object-replicator on ms-be2014 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-object-replicator https://wikitech.wikimedia.org/wiki/Swift [15:55:24] RECOVERY - swift-object-auditor on ms-be2014 is OK: PROCS OK: 3 processes with regex args ^/usr/bin/python /usr/bin/swift-object-auditor https://wikitech.wikimedia.org/wiki/Swift [15:55:28] RECOVERY - dhclient process on ms-be2014 is OK: PROCS OK: 0 processes with command name dhclient [15:55:36] RECOVERY - configured eth on ms-be2014 is OK: OK - interfaces up [15:55:42] RECOVERY - swift-account-replicator on ms-be2014 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-replicator https://wikitech.wikimedia.org/wiki/Swift [15:55:43] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: More traffic to db1081 (duration: 01m 01s) [15:55:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:55:46] RECOVERY - swift-container-auditor on ms-be2014 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor https://wikitech.wikimedia.org/wiki/Swift [15:56:15] (and also, it looks like the disk scheduler changes i was experimenting with did not do much to fix the monitoring noise) [15:56:19] (03CR) 10Jcrespo: [C: 03+2] mariadb-backups: Setup all metadata sections for daily snapshots [puppet] - 10https://gerrit.wikimedia.org/r/509012 (https://phabricator.wikimedia.org/T206203) (owner: 10Jcrespo) [15:59:31] 10Operations, 10ops-eqiad, 10DBA, 10Goal, and 2 others: rack/setup/install db11[26-38].eqiad.wmnet - https://phabricator.wikimedia.org/T211613 (10RobH) [15:59:44] 10Operations, 10ops-eqiad, 10DBA, 10Goal, and 2 others: rack/setup/install db11[26-38].eqiad.wmnet - https://phabricator.wikimedia.org/T211613 (10RobH) [16:00:04] godog and _joe_: That opportune time is upon us again. Time for a Puppet SWAT(Max 6 patches) deploy. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190509T1600). [16:00:04] No GERRIT patches in the queue for this window AFAICS. [16:01:36] (03PS1) 10Marostegui: db-eqiad.php: More traffic to db1081 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/509098 [16:02:31] 10Operations, 10Performance-Team, 10observability: Graphite alert 'MediaWiki.errors.fatal' no longer working - https://phabricator.wikimedia.org/T222765 (10Krinkle) 05Open→03Resolved p:05Triage→03High ... and we're back :) {F28984208} [16:04:50] (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: More traffic to db1081 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/509098 (owner: 10Marostegui) [16:06:01] (03Merged) 10jenkins-bot: db-eqiad.php: More traffic to db1081 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/509098 (owner: 10Marostegui) [16:06:25] (03CR) 10jenkins-bot: db-eqiad.php: More traffic to db1081 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/509098 (owner: 10Marostegui) [16:06:54] 10Operations, 10ops-eqiad, 10DBA, 10Goal, and 2 others: rack/setup/install db11[26-38].eqiad.wmnet - https://phabricator.wikimedia.org/T211613 (10Marostegui) [16:07:23] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: More traffic to db1081 (duration: 01m 13s) [16:07:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:08:10] 10Operations, 10Analytics, 10Analytics-Kanban, 10Traffic, 10Patch-For-Review: Add prometheus metrics for varnishkafka instances running on caching hosts - https://phabricator.wikimedia.org/T196066 (10Ottomata) BTW for reference here's what EvenGate is currently exporting: https://gist.github.com/ottomat... [16:08:19] 10Operations, 10ops-eqiad, 10DBA, 10Goal, and 2 others: rack/setup/install db11[26-38].eqiad.wmnet - https://phabricator.wikimedia.org/T211613 (10Marostegui) [16:08:45] 10Operations, 10ops-codfw, 10DBA: db2114 hardware problem - https://phabricator.wikimedia.org/T222753 (10Papaul) 05Open→03Resolved @Marostegui @jcrespo main board replacement complete on db2114. The problem has been resolved. You can take over now. System Board Fan1A 14% 3480 RPM 840 RPM N/A 480 RPM... [16:10:06] 10Operations, 10ops-codfw, 10DBA: db2114 hardware problem - https://phabricator.wikimedia.org/T222753 (10Marostegui) Thanks Papaul! [16:12:35] (03PS1) 10Marostegui: db-eqiad.php: Restore db1084 weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/509099 [16:14:03] (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: Restore db1084 weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/509099 (owner: 10Marostegui) [16:15:37] (03Merged) 10jenkins-bot: db-eqiad.php: Restore db1084 weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/509099 (owner: 10Marostegui) [16:16:59] (03CR) 10jenkins-bot: db-eqiad.php: Restore db1084 weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/509099 (owner: 10Marostegui) [16:17:00] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Restore original weight on db1084 (duration: 00m 59s) [16:17:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:21:19] (03PS1) 10Volans: CHANGELOG: add changelogs for release v0.0.24 [software/spicerack] - 10https://gerrit.wikimedia.org/r/509101 [16:21:39] (03PS1) 10Alexandros Kosiaris: kask: Add incubator/cassandra subchart [deployment-charts] - 10https://gerrit.wikimedia.org/r/509102 (https://phabricator.wikimedia.org/T220401) [16:22:25] (03PS5) 10Ema: swift-proxy: add ensure_max_age middleware [puppet] - 10https://gerrit.wikimedia.org/r/509027 [16:25:16] 10Operations, 10Prod-Kubernetes, 10serviceops, 10Kubernetes, and 2 others: migrate endpoint from old registry instance to new one - https://phabricator.wikimedia.org/T221101 (10fsero) [16:25:35] (03PS1) 10Ottomata: Add LVS DNS for eventgate-maian [dns] - 10https://gerrit.wikimedia.org/r/509104 (https://phabricator.wikimedia.org/T222899) [16:25:38] (03CR) 10jerkins-bot: [V: 04-1] CHANGELOG: add changelogs for release v0.0.24 [software/spicerack] - 10https://gerrit.wikimedia.org/r/509101 (owner: 10Volans) [16:26:11] 10Operations, 10Prod-Kubernetes, 10serviceops, 10Kubernetes, and 2 others: improve docker registry architecture - https://phabricator.wikimedia.org/T209271 (10fsero) [16:26:17] 10Operations, 10Prod-Kubernetes, 10serviceops, 10Kubernetes, and 2 others: migrate endpoint from old registry instance to new one - https://phabricator.wikimedia.org/T221101 (10fsero) 05Open→03Resolved [16:28:07] (03PS3) 10Ema: upload-frontend: unset Cache-Control for upload.w.o [puppet] - 10https://gerrit.wikimedia.org/r/509053 [16:29:12] (03PS1) 10Ottomata: LVS for eventgate-main [puppet] - 10https://gerrit.wikimedia.org/r/509106 (https://phabricator.wikimedia.org/T222899) [16:29:49] (03CR) 10Dzahn: [C: 03+1] Stop using transitional package names for Icinga plugins [puppet] - 10https://gerrit.wikimedia.org/r/494681 (https://phabricator.wikimedia.org/T213527) (owner: 10Muehlenhoff) [16:30:14] (03CR) 10Filippo Giunchedi: [C: 03+1] "Untested but LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/509027 (owner: 10Ema) [16:32:20] (03CR) 10BBlack: [C: 03+1] upload-frontend: unset Cache-Control for upload.w.o [puppet] - 10https://gerrit.wikimedia.org/r/509053 (owner: 10Ema) [16:32:33] (03CR) 10BBlack: [C: 03+1] swift-proxy: add ensure_max_age middleware [puppet] - 10https://gerrit.wikimedia.org/r/509027 (owner: 10Ema) [16:32:41] 10Operations, 10TechCom, 10Core Platform Team (Session Management Service (CDP2)), 10Core Platform Team Backlog (Later), and 5 others: Establish an SLA for session storage - https://phabricator.wikimedia.org/T211721 (10EvanProdromou) I updated the description so it has what seems to be the latest values.... [16:34:56] (03PS1) 10Volans: tests: temporarily force bandit < 1.6.0 [software/spicerack] - 10https://gerrit.wikimedia.org/r/509108 [16:40:13] (03CR) 10Dzahn: "so the query limit for registered users looks to be 5000 (and 500 for anons). With a number of projects of something over 2000 this should" [puppet] - 10https://gerrit.wikimedia.org/r/508892 (owner: 10Paladox) [16:40:59] (03CR) 10CRusnov: [C: 03+1] "LGTM" [software/spicerack] - 10https://gerrit.wikimedia.org/r/509108 (owner: 10Volans) [16:41:13] (03CR) 10Volans: [C: 03+2] tests: temporarily force bandit < 1.6.0 [software/spicerack] - 10https://gerrit.wikimedia.org/r/509108 (owner: 10Volans) [16:41:45] (03CR) 10Dzahn: "fwiw there is a bug that means using Polygerrit i can't actually see the numbers the query limit is set to but Paladox can see it and it's" [puppet] - 10https://gerrit.wikimedia.org/r/508892 (owner: 10Paladox) [16:45:03] 10Operations, 10SRE-Access-Requests: Requesting access to deployment for Cormac Parle - https://phabricator.wikimedia.org/T222864 (10greg) >>! In T222864#5170143, @jbond wrote: > @greg this is essentially a request to add cparle to the deployment group. As the manager of Release engineering are you the releve... [16:45:25] (03Merged) 10jenkins-bot: tests: temporarily force bandit < 1.6.0 [software/spicerack] - 10https://gerrit.wikimedia.org/r/509108 (owner: 10Volans) [16:46:34] (03CR) 10jenkins-bot: tests: temporarily force bandit < 1.6.0 [software/spicerack] - 10https://gerrit.wikimedia.org/r/509108 (owner: 10Volans) [16:47:02] 10Operations, 10wikidiff2: Update the php-wikidiff2 package - https://phabricator.wikimedia.org/T222896 (10WMDE-Fisch) >>! In T222896#5170333, @MoritzMuehlenhoff wrote: >>>! In T222896#5170319, @ori wrote: >> I just came across https://www.mediawiki.org/wiki/Extension:Wikidiff2/Release_process. @Legoktm, who h... [16:48:11] (03CR) 10Dzahn: [C: 03+1] Gerrit: Enable gerrit.listProjectsFromIndex [puppet] - 10https://gerrit.wikimedia.org/r/508892 (owner: 10Paladox) [16:48:44] (03CR) 10Dzahn: [C: 03+1] "expected to be noop on current version but needed for 2.16 upgrade" [puppet] - 10https://gerrit.wikimedia.org/r/508892 (owner: 10Paladox) [16:49:54] (03PS2) 10Volans: CHANGELOG: add changelogs for release v0.0.24 [software/spicerack] - 10https://gerrit.wikimedia.org/r/509101 [16:51:23] 10Operations, 10ops-eqiad, 10RESTBase, 10Core Platform Team Backlog (Watching / External), and 2 others: rack/setup/install restbase10[19-27].eqiad.wmnet - https://phabricator.wikimedia.org/T219404 (10RobH) [16:53:29] 10Operations, 10SRE-Access-Requests, 10Security-Team: Requesting access to deployment and analytics-privatedata-users for jfishback - https://phabricator.wikimedia.org/T222910 (10sbassett) [16:53:41] 10Operations, 10SRE-Access-Requests, 10Security-Team: Requesting access to deployment and analytics-privatedata-users for jfishback - https://phabricator.wikimedia.org/T222910 (10sbassett) [16:53:55] 10Operations, 10SRE-Access-Requests, 10Security-Team: Requesting access to deployment and analytics-privatedata-users for jfishback - https://phabricator.wikimedia.org/T222910 (10sbassett) p:05Triage→03Low [16:55:34] 10Operations, 10ops-eqiad, 10RESTBase, 10Core Platform Team Backlog (Watching / External), and 2 others: rack/setup/install restbase10[19-27].eqiad.wmnet - https://phabricator.wikimedia.org/T219404 (10RobH) Ok, all of these systems are now installed and calling into puppet with role::spare so they are in m... [17:00:04] cscott, arlolra, subbu, and halfak: Dear deployers, time to do the Services – Graphoid / Parsoid / Citoid / ORES deploy. Dont look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190509T1700). [17:01:40] (03CR) 10Volans: [C: 03+2] CHANGELOG: add changelogs for release v0.0.24 [software/spicerack] - 10https://gerrit.wikimedia.org/r/509101 (owner: 10Volans) [17:07:40] (03Merged) 10jenkins-bot: CHANGELOG: add changelogs for release v0.0.24 [software/spicerack] - 10https://gerrit.wikimedia.org/r/509101 (owner: 10Volans) [17:08:51] (03CR) 10jenkins-bot: CHANGELOG: add changelogs for release v0.0.24 [software/spicerack] - 10https://gerrit.wikimedia.org/r/509101 (owner: 10Volans) [17:09:32] (03PS1) 10MSantos: Fix notify tilerator script [puppet] - 10https://gerrit.wikimedia.org/r/509113 [17:10:33] (03CR) 10Cwhite: [C: 03+1] prometheus: remove v2 feature flag [puppet] - 10https://gerrit.wikimedia.org/r/509052 (https://phabricator.wikimedia.org/T187987) (owner: 10Filippo Giunchedi) [17:11:02] (03CR) 10Gehel: [C: 03+2] Fix notify tilerator script [puppet] - 10https://gerrit.wikimedia.org/r/509113 (owner: 10MSantos) [17:11:12] 10Operations, 10ops-eqiad, 10RESTBase, 10Core Platform Team Backlog (Watching / External), and 2 others: rack/setup/install restbase10[19-27].eqiad.wmnet - https://phabricator.wikimedia.org/T219404 (10RobH) a:05RobH→03fgiunchedi After IRC sync up, previously this was handled by @fgiunchedi to push into... [17:13:42] 10Operations, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): Refactor public-facing DYNA scheme for primary project hostnames in our DNS - https://phabricator.wikimedia.org/T208263 (10BBlack) Our analytics seems to indicate the changes above had the intended effect in restoring normal levels of... [17:19:27] (03PS1) 10Volans: Upstream release v0.0.24 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/509120 [17:24:12] (03PS1) 10Bstorm: The maps exports on cloudstore1008/9 may just work with this [puppet] - 10https://gerrit.wikimedia.org/r/509122 [17:24:36] (03PS2) 10Ottomata: Add LVS DNS for eventgate-main [dns] - 10https://gerrit.wikimedia.org/r/509104 (https://phabricator.wikimedia.org/T222899) [17:25:38] (03CR) 10Volans: [C: 03+2] Upstream release v0.0.24 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/509120 (owner: 10Volans) [17:29:33] 10Operations, 10serviceops: Separate Wikitech cronjobs from production - https://phabricator.wikimedia.org/T222900 (10Dzahn) cron jobs running as 'www-data' on wikitech (labweb1001/1002) are only 2: ` # Puppet Name: run-jobs * * * * * /usr/local/bin/mwscript maintenance/runJobs.php --wiki=labswiki > /dev/nul... [17:31:14] (03Merged) 10jenkins-bot: Upstream release v0.0.24 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/509120 (owner: 10Volans) [17:34:53] 10Operations, 10MediaWiki-Cache, 10serviceops, 10Core Platform Team (Security, stability, performance and scalability (TEC1)), and 5 others: Use a multi-dc aware store for ObjectCache's MainStash if needed. - https://phabricator.wikimedia.org/T212129 (10EvanProdromou) [17:36:17] 10Operations, 10MediaWiki-Cache, 10serviceops, 10Core Platform Team (Security, stability, performance and scalability (TEC1)), and 5 others: Use a multi-dc aware store for ObjectCache's MainStash if needed. - https://phabricator.wikimedia.org/T212129 (10EvanProdromou) I moved this task so it's a direct chi... [17:38:11] (03PS2) 10Bstorm: cloudstore: the maps exports on cloudstore1008/9 may just work with this [puppet] - 10https://gerrit.wikimedia.org/r/509122 [17:46:14] (03PS3) 10Bstorm: cloudstore: the maps exports on cloudstore1008/9 may just work with this [puppet] - 10https://gerrit.wikimedia.org/r/509122 [17:47:07] 10Operations, 10DC-Ops: documented procedure for replacing disks in software RAID servers - https://phabricator.wikimedia.org/T220842 (10RobH) I'm not quite sure what the other dc opsen do these days. Back when I was primary at a site, I tended to send the rebuild commands, and also the manual bootloader inst... [17:48:13] 10Operations, 10serviceops: Separate Wikitech cronjobs from production - https://phabricator.wikimedia.org/T222900 (10Dzahn) labweb1001/1002 are on PHP 7, but it's 7.0 instead of 7.2 Regarding the TorBlock job, i tried to switch that in the past and it worked fine in production but failed on labweb, so it was... [17:58:42] (03PS2) 10Ottomata: Bump up refinery version for refine.pp [puppet] - 10https://gerrit.wikimedia.org/r/508863 (https://phabricator.wikimedia.org/T215442) (owner: 10Mforns) [17:59:15] (03CR) 10jerkins-bot: [V: 04-1] Bump up refinery version for refine.pp [puppet] - 10https://gerrit.wikimedia.org/r/508863 (https://phabricator.wikimedia.org/T215442) (owner: 10Mforns) [18:00:04] MaxSem, RoanKattouw, and Niharika: Time to snap out of that daydream and deploy Morning SWAT (Max 6 patches). Get on with it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190509T1800). [18:00:04] alaa_wmde: A patch you scheduled for Morning SWAT (Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [18:00:37] (03PS3) 10Ottomata: Enable schema aware eventlogging hive refinement [puppet] - 10https://gerrit.wikimedia.org/r/508863 (https://phabricator.wikimedia.org/T215442) (owner: 10Mforns) [18:00:51] o/ [18:01:10] (03CR) 10jerkins-bot: [V: 04-1] Enable schema aware eventlogging hive refinement [puppet] - 10https://gerrit.wikimedia.org/r/508863 (https://phabricator.wikimedia.org/T215442) (owner: 10Mforns) [18:01:23] good morning .. anyone around for the SWAT? [18:01:36] (03PS4) 10Ottomata: Enable schema aware eventlogging hive refinement [puppet] - 10https://gerrit.wikimedia.org/r/508863 (https://phabricator.wikimedia.org/T215442) (owner: 10Mforns) [18:02:57] (03CR) 10jerkins-bot: [V: 04-1] Enable schema aware eventlogging hive refinement [puppet] - 10https://gerrit.wikimedia.org/r/508863 (https://phabricator.wikimedia.org/T215442) (owner: 10Mforns) [18:04:35] (03PS3) 10Jforrester: Duplicate …Squid variables into …Cdn ahead of MW renaming [mediawiki-config] - 10https://gerrit.wikimedia.org/r/496847 (https://phabricator.wikimedia.org/T104148) [18:04:37] (03PS3) 10Jforrester: Stop reading wmgUseClusterSquid, never varied [mediawiki-config] - 10https://gerrit.wikimedia.org/r/496848 [18:04:45] (03CR) 10Jforrester: [C: 04-2] Stop reading wmgUseClusterSquid, never varied (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/496848 (owner: 10Jforrester) [18:04:49] (03PS5) 10Ottomata: Enable schema aware eventlogging hive refinement [puppet] - 10https://gerrit.wikimedia.org/r/508863 (https://phabricator.wikimedia.org/T215442) (owner: 10Mforns) [18:12:25] (03CR) 10Ottomata: "So the idea is to first build the event with the stream name, and then have getStream pull that out of the event, and use the returned val" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/509092 (https://phabricator.wikimedia.org/T222822) (owner: 10Ppchelko) [18:13:24] (03CR) 10Ottomata: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1001/16448/an-coord1001.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/508863 (https://phabricator.wikimedia.org/T215442) (owner: 10Mforns) [18:15:01] 10Operations, 10serviceops: Separate Wikitech cronjobs from production - https://phabricator.wikimedia.org/T222900 (10Dzahn) The TorBlock job is already separated between wikitech and mwmaint. For mwmaint we are using `profile::mediawiki::periodic_job { 'mediawiki_tor_exit_node':` and for wikitech we are using... [18:22:13] PROBLEM - Maps tiles generation on icinga1001 is CRITICAL: CRITICAL: 90.35% of data under the critical threshold [5.0] https://grafana.wikimedia.org/dashboard/db/maps-performances?panelId=8&fullscreen&orgId=1 [18:22:49] !log simplify filter analytics-in4 term mysql-dbstore on cr1/2-eqiad [18:22:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:23:31] (03PS1) 10Nray: Enable AdvancedMobileContributions Overflow menu [mediawiki-config] - 10https://gerrit.wikimedia.org/r/509130 [18:24:32] (03CR) 10jerkins-bot: [V: 04-1] Enable AdvancedMobileContributions Overflow menu [mediawiki-config] - 10https://gerrit.wikimedia.org/r/509130 (owner: 10Nray) [18:27:01] (03PS2) 10Nray: Enable AdvancedMobileContributions Overflow menu [mediawiki-config] - 10https://gerrit.wikimedia.org/r/509130 [18:27:50] (03CR) 10jerkins-bot: [V: 04-1] Enable AdvancedMobileContributions Overflow menu [mediawiki-config] - 10https://gerrit.wikimedia.org/r/509130 (owner: 10Nray) [18:29:41] (03CR) 10Nray: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/509130 (owner: 10Nray) [18:30:39] (03CR) 10jerkins-bot: [V: 04-1] Enable AdvancedMobileContributions Overflow menu [mediawiki-config] - 10https://gerrit.wikimedia.org/r/509130 (owner: 10Nray) [18:32:30] (03Abandoned) 10Nray: Enable AdvancedMobileContributions Overflow menu [mediawiki-config] - 10https://gerrit.wikimedia.org/r/509130 (owner: 10Nray) [18:33:34] (03PS7) 10Aaron Schulz: db-eqiad,db-codfw.php: Change second parsercache key [mediawiki-config] - 10https://gerrit.wikimedia.org/r/508170 (https://phabricator.wikimedia.org/T210725) (owner: 10Marostegui) [18:33:50] (03CR) 10Aaron Schulz: [C: 03+1] db-eqiad,db-codfw.php: Change second parsercache key [mediawiki-config] - 10https://gerrit.wikimedia.org/r/508170 (https://phabricator.wikimedia.org/T210725) (owner: 10Marostegui) [18:35:25] (03Abandoned) 10Ayounsi: Bird-lg [puppet] - 10https://gerrit.wikimedia.org/r/390330 (https://phabricator.wikimedia.org/T106056) (owner: 10Ayounsi) [18:35:31] (03Abandoned) 10Ayounsi: acme_chief: Issue birdlg certificate [puppet] - 10https://gerrit.wikimedia.org/r/504248 (https://phabricator.wikimedia.org/T106056) (owner: 10Vgutierrez) [18:35:40] (03Abandoned) 10Ayounsi: Add looking glass CNAMEs [dns] - 10https://gerrit.wikimedia.org/r/504233 (https://phabricator.wikimedia.org/T106056) (owner: 10Ayounsi) [18:44:13] 10Operations, 10netops, 10Patch-For-Review: set up a looking glass for WMF ASes - https://phabricator.wikimedia.org/T106056 (10ayounsi) 05Open→03Declined The amount of work required to properly deploy a (muti-dc) looking glass is, so far, not worth the benefits of having and maintaining one. - Peering wi... [18:52:53] 10Operations, 10netops, 10Patch-For-Review: set up a looking glass for WMF ASes - https://phabricator.wikimedia.org/T106056 (10Nemo_bis) Thanks for considering this and for sharing the analysis. [18:57:29] PROBLEM - HTTP availability for Nginx -SSL terminators- at ulsfo on icinga1001 is CRITICAL: cluster=cache_text site=ulsfo https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [18:58:29] PROBLEM - HTTP availability for Nginx -SSL terminators- at eqiad on icinga1001 is CRITICAL: cluster=cache_text site=eqiad https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [18:58:31] PROBLEM - HTTP availability for Varnish at eqiad on icinga1001 is CRITICAL: job=varnish-text site=eqiad https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [18:58:51] PROBLEM - HTTP availability for Nginx -SSL terminators- at codfw on icinga1001 is CRITICAL: cluster=cache_text site=codfw https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [18:59:05] PROBLEM - HTTP availability for Varnish at esams on icinga1001 is CRITICAL: job=varnish-text site=esams https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [18:59:09] PROBLEM - HTTP availability for Nginx -SSL terminators- at esams on icinga1001 is CRITICAL: cluster=cache_text site=esams https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [18:59:31] PROBLEM - HTTP availability for Nginx -SSL terminators- at eqsin on icinga1001 is CRITICAL: cluster=cache_text site=eqsin https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [19:00:05] thcipriani: That opportune time is upon us again. Time for a MediaWiki train - Americas version deploy. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190509T1900). [19:00:07] PROBLEM - Esams HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=esams&var-cache_type=All&var-status_type=5 [19:00:11] PROBLEM - Text HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5 [19:00:33] PROBLEM - Ulsfo HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=ulsfo&var-cache_type=All&var-status_type=5 [19:01:07] PROBLEM - Eqiad HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=eqiad&var-cache_type=All&var-status_type=5 [19:01:33] RECOVERY - HTTP availability for Nginx -SSL terminators- at ulsfo on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [19:01:33] RECOVERY - HTTP availability for Nginx -SSL terminators- at codfw on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [19:01:49] RECOVERY - HTTP availability for Varnish at esams on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [19:01:53] RECOVERY - HTTP availability for Nginx -SSL terminators- at esams on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [19:02:15] RECOVERY - HTTP availability for Nginx -SSL terminators- at eqsin on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [19:02:35] RECOVERY - HTTP availability for Nginx -SSL terminators- at eqiad on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [19:02:35] RECOVERY - HTTP availability for Varnish at eqiad on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [19:04:54] hrm, well, was just about to report for train, good timing [19:05:18] * James_F grins. [19:05:47] PROBLEM - Check systemd state on ms-be2017 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [19:05:51] https://grafana.wikimedia.org/d/000000439/varnish-backend-connections?orgId=1&from=1557427324975&to=1557428738662 [19:06:55] but no matching increase on the public input ide [19:06:57] *side [19:06:59] RECOVERY - Esams HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=esams&var-cache_type=All&var-status_type=5 [19:07:01] RECOVERY - Text HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5 [19:07:20] tends to mean appservers/api latency -> saturate connection parallelism -> start 503ing random $things [19:07:25] RECOVERY - Ulsfo HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=ulsfo&var-cache_type=All&var-status_type=5 [19:07:59] RECOVERY - Eqiad HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=eqiad&var-cache_type=All&var-status_type=5 [19:08:22] (the fe->be increase that goes with it is because the frontends do a single-retry of any 503 regardless of source, in an attempt to hide one-shot/transient failures) [19:12:50] so all this is to say: performance issues somewhere in the stack could have caused that bunch of 503s? [19:14:44] 10Operations, 10wikidiff2: Update the php-wikidiff2 package - https://phabricator.wikimedia.org/T222896 (10ori) SGTM; thank you. [19:18:15] (03PS1) 10Ladsgroup: dumps: Add clickstream to list of other datasets [puppet] - 10https://gerrit.wikimedia.org/r/509136 [19:21:25] 10Operations, 10Performance-Team (Radar): PHP fatal error handler not working on mwdebug servers - https://phabricator.wikimedia.org/T217846 (10jijiki) What is a bit more weird is that this works as expected with PHP7. [19:24:19] !log renumber mr1-esams<->cr1-esams link to 91.198.174.240/31 - T211254 [19:24:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:24:24] T211254: Free up 185.15.59.0/24 - https://phabricator.wikimedia.org/T211254 [19:24:57] RECOVERY - Check systemd state on ms-be2017 is OK: OK - running: The system is fully operational [19:26:26] (03PS5) 10Paladox: beta: Stop cloning over /p/ [puppet] - 10https://gerrit.wikimedia.org/r/507069 (https://phabricator.wikimedia.org/T218844) [19:28:23] !log renumber mr1-esams<->cr2-knams link to 91.198.174.224/31 - T211254 [19:28:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:31:34] (03CR) 10Ayounsi: [C: 03+2] Move mr1-esams interco links to 91.198.174.0/24 [dns] - 10https://gerrit.wikimedia.org/r/485081 (https://phabricator.wikimedia.org/T211254) (owner: 10Ayounsi) [19:31:50] (03PS3) 10Ayounsi: Move mr1-esams interco links to 91.198.174.0/24 [dns] - 10https://gerrit.wikimedia.org/r/485081 (https://phabricator.wikimedia.org/T211254) [19:35:59] Shrink 91.198.174.224/28 to 91.198.174.232/29 on cr1/2-esams - T211254 [19:36:00] T211254: Free up 185.15.59.0/24 - https://phabricator.wikimedia.org/T211254 [19:40:35] 10Operations, 10SRE-Access-Requests, 10Security-Team: Requesting access to deployment and analytics-privatedata-users for jfishback - https://phabricator.wikimedia.org/T222910 (10JFishback_WMF) [19:41:15] 10Operations, 10SRE-Access-Requests, 10Security-Team: Requesting access to deployment and analytics-privatedata-users for jfishback - https://phabricator.wikimedia.org/T222910 (10JFishback_WMF) Pinging @JBennett for approval. [19:43:58] 10Operations, 10SRE-Access-Requests, 10Security-Team: Requesting access to deployment and analytics-privatedata-users for jfishback - https://phabricator.wikimedia.org/T222910 (10JFishback_WMF) [19:49:11] * thcipriani back to running train [19:51:20] (03PS1) 10Thcipriani: all wikis to 1.34.0-wmf.4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/509138 [19:51:22] (03CR) 10Thcipriani: [C: 03+2] all wikis to 1.34.0-wmf.4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/509138 (owner: 10Thcipriani) [19:52:57] (03Merged) 10jenkins-bot: all wikis to 1.34.0-wmf.4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/509138 (owner: 10Thcipriani) [19:53:24] (03CR) 10jenkins-bot: all wikis to 1.34.0-wmf.4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/509138 (owner: 10Thcipriani) [19:56:09] !log thcipriani@deploy1001 rebuilt and synchronized wikiversions files: all wikis to 1.34.0-wmf.4 [19:57:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:58:37] PROBLEM - Nginx local proxy to apache on mw1232 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [19:58:47] PROBLEM - HHVM rendering on mw1232 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [19:59:51] RECOVERY - Nginx local proxy to apache on mw1232 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 617 bytes in 0.051 second response time https://wikitech.wikimedia.org/wiki/Application_servers [19:59:59] RECOVERY - HHVM rendering on mw1232 is OK: HTTP OK: HTTP/1.1 200 OK - 75838 bytes in 0.194 second response time https://wikitech.wikimedia.org/wiki/Application_servers [20:05:23] PROBLEM - Eqiad HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=eqiad&var-cache_type=All&var-status_type=5 [20:05:47] PROBLEM - Esams HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=esams&var-cache_type=All&var-status_type=5 [20:05:51] PROBLEM - Text HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5 [20:06:06] 10Operations, 10Puppet, 10User-Joe, 10cloud-services-team (FY2017-18): Upgrade to puppet 4 (4.8 or newer) - https://phabricator.wikimedia.org/T177254 (10herron) [20:06:13] 10Operations, 10Puppet, 10Patch-For-Review, 10User-herron: custom fact interface_primary breaks under newer versions of facter - https://phabricator.wikimedia.org/T182819 (10herron) 05Open→03Resolved yup! [20:07:07] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1004 is CRITICAL: CRITICAL: 90.00% of data above the critical threshold [50.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=2&fullscreen [20:07:29] hrm everything in the log appears to be a 60 second timeout [20:07:46] but more sustained than I'm used to [20:08:59] (03CR) 10Andrew Bogott: "I can't find any examples of this on VMs which makes me wonder if either this is dead code or if I'm grepping for the wrong thing. I'm ru" [puppet] - 10https://gerrit.wikimedia.org/r/507069 (https://phabricator.wikimedia.org/T218844) (owner: 10Paladox) [20:12:35] RECOVERY - Esams HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=esams&var-cache_type=All&var-status_type=5 [20:12:39] RECOVERY - Text HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5 [20:13:27] PROBLEM - MariaDB Slave Lag: s4 on db1102 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 802.24 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [20:13:35] RECOVERY - Eqiad HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=eqiad&var-cache_type=All&var-status_type=5 [20:14:11] PROBLEM - MariaDB Slave Lag: s8 on db1116 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 844.39 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [20:18:10] (03CR) 10Andrew Bogott: [C: 03+2] beta: Stop cloning over /p/ [puppet] - 10https://gerrit.wikimedia.org/r/507069 (https://phabricator.wikimedia.org/T218844) (owner: 10Paladox) [20:19:23] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1004 is OK: OK: Less than 70.00% above the threshold [25.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=2&fullscreen [20:21:52] (03PS1) 10Ayounsi: Remove unused/untrusted IP ranges from trusted lists [puppet] - 10https://gerrit.wikimedia.org/r/509140 (https://phabricator.wikimedia.org/T222392) [20:25:23] (03PS2) 10Ayounsi: Remove unused/untrusted IP ranges from trusted lists [puppet] - 10https://gerrit.wikimedia.org/r/509140 (https://phabricator.wikimedia.org/T222392) [20:26:12] 10Operations, 10Gerrit, 10cloud-services-team, 10serviceops: Change /r/p/ to /r/ on all hosts (where https://gerrit.wikimedia.org/r/p/ exists) - https://phabricator.wikimedia.org/T222093 (10Paladox) [20:26:46] (03PS6) 10Paladox: statistics: Stop cloning over /p/ [puppet] - 10https://gerrit.wikimedia.org/r/507077 (https://phabricator.wikimedia.org/T218844) [20:27:26] (03PS1) 10Ottomata: service::docker - add $image_name parameter [puppet] - 10https://gerrit.wikimedia.org/r/509141 (https://phabricator.wikimedia.org/T218346) [20:28:16] (03PS4) 10Paladox: reportupdater: Stop cloning over /p/ [puppet] - 10https://gerrit.wikimedia.org/r/507073 [20:30:11] (03CR) 10Ottomata: [C: 03+2] reportupdater: Stop cloning over /p/ [puppet] - 10https://gerrit.wikimedia.org/r/507073 (owner: 10Paladox) [20:31:12] 10Operations, 10ops-eqiad: wmf7622 wont powercycle (cannot be allocated from spares) - https://phabricator.wikimedia.org/T222922 (10RobH) p:05Triage→03Normal [20:33:41] (03PS8) 10Mforns: Adapt saltrotate and EventLoggingSanitization params in data_purge.pp [puppet] - 10https://gerrit.wikimedia.org/r/485063 (https://phabricator.wikimedia.org/T212014) [20:33:44] thcipriani: train all clear? [20:33:51] mdholloway: yep [20:33:54] 10Operations, 10Gerrit, 10cloud-services-team, 10serviceops: Change /r/p/ to /r/ on all hosts (where https://gerrit.wikimedia.org/r/p/ exists) - https://phabricator.wikimedia.org/T222093 (10Paladox) [20:34:03] thcipriani: sweet, thanks! [20:34:07] mdholloway: just monitoring things now, but deploys are done for the time being. [20:34:37] thcipriani: btw, i was looking at blubber today, and wanted to say thanks to you and marxarelli for the amazing docs! [20:34:59] nice! glad those docs are useful. [20:35:21] mdholloway: really glad to hear it. docs were all thcipriani [20:35:42] i guess i wrote the concepts part [20:36:05] pretty sure it's the best documented piece of software i've used at wmf [20:36:34] (03CR) 10Ottomata: "https://puppet-compiler.wmflabs.org/compiler1002/16449/an-coord1001.eqiad.wmnet/" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/485063 (https://phabricator.wikimedia.org/T212014) (owner: 10Mforns) [20:37:39] lots of stuff written by the time I got there, I split it up a bit, I'm glad to hear that they're not complete nonsense :P [20:38:25] (03CR) 10Ottomata: [C: 03+2] Adapt saltrotate and EventLoggingSanitization params in data_purge.pp (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/485063 (https://phabricator.wikimedia.org/T212014) (owner: 10Mforns) [20:38:31] 10Operations, 10Traffic, 10netops, 10Patch-For-Review: Free up 185.15.59.0/24 - https://phabricator.wikimedia.org/T211254 (10ayounsi) The conversation went a bit outside the scope of the task description. Re-focusing on it and with the new info of T222392, I renumbered mr1-esams links (trivial change) so 1... [20:38:32] (03PS9) 10Ottomata: Adapt saltrotate and EventLoggingSanitization params in data_purge.pp [puppet] - 10https://gerrit.wikimedia.org/r/485063 (https://phabricator.wikimedia.org/T212014) (owner: 10Mforns) [20:39:14] anyway, i was going to deploy a config change, but i don't think i want to attempt it on this feeble coffee shop connection. i'll run home and get it on sometime later but before evening swat. [20:45:23] 10Operations, 10Gerrit, 10cloud-services-team, 10serviceops: Change /r/p/ to /r/ on all hosts (where https://gerrit.wikimedia.org/r/p/ exists) - https://phabricator.wikimedia.org/T222093 (10Paladox) [20:45:33] (03PS4) 10Paladox: analytics: Stop cloning over /p/ [puppet] - 10https://gerrit.wikimedia.org/r/507075 (https://phabricator.wikimedia.org/T218844) [20:46:34] (03CR) 10Ayounsi: "Adding reviewers based on who added the IPs to the various configs as well as who could be impacted (cloud)." [puppet] - 10https://gerrit.wikimedia.org/r/509140 (https://phabricator.wikimedia.org/T222392) (owner: 10Ayounsi) [20:50:07] (03PS1) 10Ottomata: Remove unused profile::analytics::refinery::{job::guard,source} [puppet] - 10https://gerrit.wikimedia.org/r/509143 (https://phabricator.wikimedia.org/T218844) [20:50:25] PROBLEM - MariaDB Slave Lag: s1 on db2097 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 738.58 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [20:51:21] (03CR) 10Paladox: [C: 03+1] Remove unused profile::analytics::refinery::{job::guard,source} [puppet] - 10https://gerrit.wikimedia.org/r/509143 (https://phabricator.wikimedia.org/T218844) (owner: 10Ottomata) [20:53:34] 10Operations, 10Gerrit, 10cloud-services-team, 10serviceops: Change /r/p/ to /r/ on all hosts (where https://gerrit.wikimedia.org/r/p/ exists) - https://phabricator.wikimedia.org/T222093 (10Paladox) [20:54:21] 10Operations, 10ops-eqiad: wmf7622 wont powercycle (cannot be allocated from spares) - https://phabricator.wikimedia.org/T222922 (10RobH) a:05RobH→03Cmjohnson Ok, I flashed the idrac firmware to the newest, and now it says it accepts the power on command, however it doesn't actually power on. So Chris wi... [20:54:38] (03PS1) 10Ottomata: Remove unused statistics::aggregator [puppet] - 10https://gerrit.wikimedia.org/r/509145 (https://phabricator.wikimedia.org/T218844) [20:56:04] (03CR) 10Paladox: [C: 03+1] Remove unused statistics::aggregator [puppet] - 10https://gerrit.wikimedia.org/r/509145 (https://phabricator.wikimedia.org/T218844) (owner: 10Ottomata) [20:57:24] 10Operations, 10ops-eqiad: wmf7622 wont powercycle (cannot be allocated from spares) - https://phabricator.wikimedia.org/T222922 (10RobH) [20:57:41] 10Operations, 10Electron-PDFs, 10Proton, 10Reading-Infrastructure-Team-Backlog, and 5 others: [EPIC] New service request: chromium-render/deploy - https://phabricator.wikimedia.org/T186748 (10Tgr) >>! In T186748#4854896, @Trizek-WMF wrote: > OK, I leave this ticket as "not ready to announce" until somethin... [20:59:19] 10Operations, 10Proton, 10Reading-Infrastructure-Team-Backlog: proton experienced a period of high CPU usage, busy queue, lockups - https://phabricator.wikimedia.org/T214975 (10Tgr) [21:16:38] (03PS1) 10Bstorm: wiki replicas: apply black formatting to maintain-views.py [puppet] - 10https://gerrit.wikimedia.org/r/509147 [21:16:39] (03PS1) 10Bstorm: wiki replicas: Fix misconfiguration in the views [puppet] - 10https://gerrit.wikimedia.org/r/509148 (https://phabricator.wikimedia.org/T212972) [21:17:28] (03PS2) 10Mholloway: WikimediaEditorTasks: add caption counter config and enable on Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/508876 [21:19:08] (03CR) 10Mholloway: [C: 03+2] WikimediaEditorTasks: add caption counter config and enable on Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/508876 (owner: 10Mholloway) [21:20:16] (03Merged) 10jenkins-bot: WikimediaEditorTasks: add caption counter config and enable on Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/508876 (owner: 10Mholloway) [21:20:31] (03CR) 10jenkins-bot: WikimediaEditorTasks: add caption counter config and enable on Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/508876 (owner: 10Mholloway) [21:23:36] !log mholloway-shell@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Enable WikipediaAppCaptionEditCounter (T222211) (duration: 00m 52s) [21:23:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:23:41] T222211: Add a caption edit counter - https://phabricator.wikimedia.org/T222211 [21:26:35] (03PS1) 10Alex Monk: role::wmcs::services::ntp: Fix standard::ntp [puppet] - 10https://gerrit.wikimedia.org/r/509149 [21:27:12] (03PS5) 10Mobrovac: Handle application/octet-stream requests properly; release v0.1.5 [software/service-checker] - 10https://gerrit.wikimedia.org/r/507531 (https://phabricator.wikimedia.org/T220401) [21:27:20] !log change user email for Melamrawy (WMF)@global [21:27:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:28:29] (03CR) 10jerkins-bot: [V: 04-1] Handle application/octet-stream requests properly; release v0.1.5 [software/service-checker] - 10https://gerrit.wikimedia.org/r/507531 (https://phabricator.wikimedia.org/T220401) (owner: 10Mobrovac) [21:29:33] (03CR) 10Alex Monk: [C: 03+1] Remove unused/untrusted IP ranges from trusted lists [puppet] - 10https://gerrit.wikimedia.org/r/509140 (https://phabricator.wikimedia.org/T222392) (owner: 10Ayounsi) [21:33:07] (03PS1) 10Mforns: role::common::aqs: update mediawiki's druid datasource to 2019-04 [puppet] - 10https://gerrit.wikimedia.org/r/509150 [21:35:36] (03PS1) 10Volans: documentations: fix Sphinx configuration [software/spicerack] - 10https://gerrit.wikimedia.org/r/509152 [21:35:38] (03PS1) 10Volans: setup.py: fix urllib3 dependency [software/spicerack] - 10https://gerrit.wikimedia.org/r/509153 [21:39:58] RECOVERY - MariaDB Slave Lag: s4 on db1102 is OK: OK slave_sql_lag Replication lag: 0.19 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [21:57:25] (03CR) 10CRusnov: [C: 03+1] "LGTM does what it says on the tin" [software/spicerack] - 10https://gerrit.wikimedia.org/r/509152 (owner: 10Volans) [21:57:49] (03CR) 10Volans: [C: 03+2] documentations: fix Sphinx configuration [software/spicerack] - 10https://gerrit.wikimedia.org/r/509152 (owner: 10Volans) [21:58:47] (03CR) 10CRusnov: [C: 03+1] "lgtm" [software/spicerack] - 10https://gerrit.wikimedia.org/r/509153 (owner: 10Volans) [22:01:54] (03Merged) 10jenkins-bot: documentations: fix Sphinx configuration [software/spicerack] - 10https://gerrit.wikimedia.org/r/509152 (owner: 10Volans) [22:02:14] 10Operations, 10Analytics, 10Analytics-Kanban, 10Traffic, 10Patch-For-Review: Add prometheus metrics for varnishkafka instances running on caching hosts - https://phabricator.wikimedia.org/T196066 (10colewhite) I agree with dropping the prefix in favor of "rdkafka". I see the type branch (producer and c... [22:02:19] (03CR) 10Volans: [C: 03+2] setup.py: fix urllib3 dependency [software/spicerack] - 10https://gerrit.wikimedia.org/r/509153 (owner: 10Volans) [22:02:53] (03CR) 10jenkins-bot: documentations: fix Sphinx configuration [software/spicerack] - 10https://gerrit.wikimedia.org/r/509152 (owner: 10Volans) [22:08:06] (03CR) 10CRusnov: cookbook API: add class API (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/506956 (https://phabricator.wikimedia.org/T221212) (owner: 10Volans) [22:10:07] (03Merged) 10jenkins-bot: setup.py: fix urllib3 dependency [software/spicerack] - 10https://gerrit.wikimedia.org/r/509153 (owner: 10Volans) [22:11:15] (03CR) 10jenkins-bot: setup.py: fix urllib3 dependency [software/spicerack] - 10https://gerrit.wikimedia.org/r/509153 (owner: 10Volans) [22:15:26] RECOVERY - MariaDB Slave Lag: s1 on db2097 is OK: OK slave_sql_lag Replication lag: 0.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [22:18:05] (03PS1) 10Volans: CHANGELOG: add changelogs for release v0.0.25 [software/spicerack] - 10https://gerrit.wikimedia.org/r/509158 [22:21:22] (03CR) 10Bstorm: "This looks right, but I'm confused by the chain of events around the standard module and would like input from @Jbond?" [puppet] - 10https://gerrit.wikimedia.org/r/509149 (owner: 10Alex Monk) [22:24:53] (03CR) 10Volans: [C: 03+2] CHANGELOG: add changelogs for release v0.0.25 [software/spicerack] - 10https://gerrit.wikimedia.org/r/509158 (owner: 10Volans) [22:30:06] (03Merged) 10jenkins-bot: CHANGELOG: add changelogs for release v0.0.25 [software/spicerack] - 10https://gerrit.wikimedia.org/r/509158 (owner: 10Volans) [22:31:13] (03CR) 10jenkins-bot: CHANGELOG: add changelogs for release v0.0.25 [software/spicerack] - 10https://gerrit.wikimedia.org/r/509158 (owner: 10Volans) [22:32:10] RECOVERY - MariaDB Slave Lag: s8 on db1116 is OK: OK slave_sql_lag Replication lag: 0.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [22:37:23] (03PS1) 10Volans: Upstream release v0.0.25 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/509161 [22:38:54] (03CR) 10CRusnov: [C: 03+1] "LGTM" [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/509161 (owner: 10Volans) [22:38:57] (03PS4) 10Bstorm: cloudstore: the maps exports on cloudstore1008/9 may just work with this [puppet] - 10https://gerrit.wikimedia.org/r/509122 [22:43:52] (03CR) 10Jbond: [C: 03+1] "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/509149 (owner: 10Alex Monk) [22:43:54] (03CR) 10Volans: [C: 03+2] Upstream release v0.0.25 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/509161 (owner: 10Volans) [22:44:04] (03CR) 10Bstorm: [C: 03+2] cloudstore: the maps exports on cloudstore1008/9 may just work with this [puppet] - 10https://gerrit.wikimedia.org/r/509122 (owner: 10Bstorm) [22:45:08] PROBLEM - HHVM rendering on mw1285 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1308 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Application_servers [22:45:12] PROBLEM - Nginx local proxy to apache on mw1285 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1308 bytes in 0.009 second response time https://wikitech.wikimedia.org/wiki/Application_servers [22:46:30] RECOVERY - HHVM rendering on mw1285 is OK: HTTP OK: HTTP/1.1 200 OK - 75854 bytes in 0.140 second response time https://wikitech.wikimedia.org/wiki/Application_servers [22:46:34] RECOVERY - Nginx local proxy to apache on mw1285 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 617 bytes in 0.050 second response time https://wikitech.wikimedia.org/wiki/Application_servers [22:49:06] (03Merged) 10jenkins-bot: Upstream release v0.0.25 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/509161 (owner: 10Volans) [22:49:41] (03CR) 10Bstorm: "I don't see the renamed profile for standard? It seems to be in the original place." [puppet] - 10https://gerrit.wikimedia.org/r/509149 (owner: 10Alex Monk) [22:50:16] !log labweb1001/labweb1002 - remove "runJob" cron job from www-data's crontab, it is already also a systemd timer and puppet was meant to remove it (T222917) [22:50:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:50:21] T222917: wikitech - duplicate runJobs job (cron vs. systemd) - https://phabricator.wikimedia.org/T222917 [22:52:20] (03CR) 10Bstorm: "Oh! I see it. It's just a pp file, not a folder." [puppet] - 10https://gerrit.wikimedia.org/r/509149 (owner: 10Alex Monk) [22:52:35] (03PS2) 10Bstorm: role::wmcs::services::ntp: Fix standard::ntp [puppet] - 10https://gerrit.wikimedia.org/r/509149 (owner: 10Alex Monk) [22:52:49] (03PS1) 10Dzahn: wikitech/labweb: remove code to absent runJob cron [puppet] - 10https://gerrit.wikimedia.org/r/509162 (https://phabricator.wikimedia.org/T222917) [22:54:29] (03CR) 10Dzahn: [C: 03+2] wikitech/labweb: remove code to absent runJob cron [puppet] - 10https://gerrit.wikimedia.org/r/509162 (https://phabricator.wikimedia.org/T222917) (owner: 10Dzahn) [22:56:01] (03PS3) 10Bstorm: role::wmcs::services::ntp: Fix standard::ntp [puppet] - 10https://gerrit.wikimedia.org/r/509149 (owner: 10Alex Monk) [22:57:29] !log Manually cleared extdistributor cache T188692 [22:57:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:57:34] T188692: Special:ExtensionDistributor displays an error - https://phabricator.wikimedia.org/T188692 [22:57:41] (03CR) 10Bstorm: [C: 03+2] role::wmcs::services::ntp: Fix standard::ntp [puppet] - 10https://gerrit.wikimedia.org/r/509149 (owner: 10Alex Monk) [22:58:09] !log uploaded spicerack_0.0.25-1_amd64.deb to apt.wikimedia.org stretch-wikimedia [22:58:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:00:04] MaxSem, RoanKattouw, and Niharika: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for Evening SWAT (Max 6 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190509T2300). [23:00:04] ebernhardson and Lucas_WMDE: A patch you scheduled for Evening SWAT (Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [23:00:10] o/ [23:00:21] \o [23:00:24] i suppose i can ship things [23:01:13] these patches are all going to take significant time to get through CI...merging all now and will sync one at a time [23:01:45] ok [23:03:57] (I also have deployment access btw, though it’s probably simpler if you do all three) [23:24:27] !log spicerack upgraded to 0.0.25 on cumin1001 and cumin 2001 [23:24:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:29:39] (03CR) 10Dzahn: [C: 03+1] "lgtm afaict, the only ones from these ranges that are actually in DNS is 185.15.56.23 (toolserver.org) and the esams transit networks" [puppet] - 10https://gerrit.wikimedia.org/r/509140 (https://phabricator.wikimedia.org/T222392) (owner: 10Ayounsi) [23:31:39] gah, I just realized this Firefox bug killed my WikimediaDebug extension [23:31:42] (03PS12) 10Dzahn: Gerrit: Redirect /p/(.+)/info/(.+) to /$1/info/$2 [puppet] - 10https://gerrit.wikimedia.org/r/507787 (https://phabricator.wikimedia.org/T218844) (owner: 10Paladox) [23:31:44] let’s see if I can get it reinstalled to test the backport… [23:31:59] (xdebug helper and HTTPS everywhere not affected, strangely enough) [23:32:58] (03PS13) 10Dzahn: Gerrit: Redirect /p/(.+)/info/(.+) to /$1/info/$2 [puppet] - 10https://gerrit.wikimedia.org/r/507787 (https://phabricator.wikimedia.org/T218844) (owner: 10Paladox) [23:33:32] (03CR) 10Dzahn: "> Patch Set 11: Code-Review+1" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/507787 (https://phabricator.wikimedia.org/T218844) (owner: 10Paladox) [23:33:56] (03CR) 10Dzahn: [C: 03+2] Gerrit: Redirect /p/(.+)/info/(.+) to /$1/info/$2 [puppet] - 10https://gerrit.wikimedia.org/r/507787 (https://phabricator.wikimedia.org/T218844) (owner: 10Paladox) [23:34:03] mutante \o/ [23:34:53] paladox: yep, testing looks good and all. but we still need to double check [23:35:00] yup [23:35:10] i can try cloning over /r/p/ to confirm it still works [23:35:13] i fixed the minor nitpicks he had [23:35:15] and also that it redirects. [23:36:21] paladox: applied [23:36:25] * paladox tests [23:36:37] Lucas_WMDE: patches merged [23:36:39] tested https://gerrit.wikimedia.org/r/p/labs/tools/stashbot [23:36:42] yay [23:36:53] I’m ready to test, I think (had to update Firefox) [23:37:21] mutante curl -I https://gerrit.wikimedia.org/r/p/operations/puppet/info/refs?service=git-upload-pack works [23:37:26] paladox: > GET /r/p/labs/tools/stashbot HTTP/1.1 [23:37:28] Lucas_WMDE: pulled to mwdebug1002 [23:37:34] paladox: < HTTP/1.1 302 Found [23:38:03] yup [23:38:04] expected [23:38:11] as it's a git link (ui does not work) [23:38:18] ie ui is not hosted over /p/ [23:38:40] paladox: confirmed it's a 301 with your example [23:38:46] if you git clone https://gerrit.wikimedia.org/r/p/labs/tools/stashbot [23:38:59] clone works [23:39:06] i get: [23:39:07] warning: redirecting to https://gerrit.wikimedia.org/r/labs/tools/stashbot/ [23:39:11] git clone https://gerrit.wikimedia.org/r/p/labs/tools/stashbot [23:39:11] Cloning into 'stashbot'... [23:39:17] ebernhardson: my backport works as intended [23:39:22] please deploy :) [23:39:22] paladox: i dont see the warning ? [23:39:32] mutante which version of git are you running? [23:39:41] git --version [23:39:41] git version 2.21.0 [23:39:43] paladox: 2.11 [23:39:57] i guess that warning is new in a newer git version :) [23:40:09] ok.. well. it works. that is the important part [23:40:14] yup [23:40:35] actually the warning is a reminder then to update URLs [23:40:41] not bad [23:40:46] yup which is good :) [23:40:50] acj [23:40:51] ack [23:41:14] !log ebernhardson@deploy1001 Synchronized php-1.34.0-wmf.4/extensions/Wikibase/view/resources/jquery/wikibase/jquery.wikibase.entityselector.js: T172937 T222346 Revert Close entityselector after selecting exact match (duration: 00m 51s) [23:41:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:41:20] T222346: [Bug] Drop-down doesn't appear when typing in the Wikidata search box - https://phabricator.wikimedia.org/T222346 [23:41:20] T172937: Cursor jumping to next field and overlapping of menus - https://phabricator.wikimedia.org/T172937 [23:42:06] paladox: glad it is done now. i just wanted to be extra careful.. hence the apache-fast-test fixes and slowdown. thanks! [23:42:14] yup :) [23:43:36] !log ebernhardson@deploy1001 Synchronized php-1.34.0-wmf.4/extensions/CirrusSearch/: T220625 Limit the clusters archive index is written to (duration: 00m 59s) [23:43:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:43:41] T220625: Initialize CirrusSearch on cloudelastic - https://phabricator.wikimedia.org/T220625 [23:45:22] paladox: i linked https://phabricator.wikimedia.org/T218844 to https://phabricator.wikimedia.org/T200739 as the parent. we should do that with other tasks as well that are blockers for 2.16.7 [23:45:33] thanks! [23:45:56] (03PS3) 10Dzahn: Gerrit: Enable gerrit.listProjectsFromIndex [puppet] - 10https://gerrit.wikimedia.org/r/508892 (https://phabricator.wikimedia.org/T200739) (owner: 10Paladox) [23:46:05] adding ticket to that too [23:46:14] kind of a blocker, correct [23:46:21] we dont want the OOM :P [23:47:12] thanks :) [23:47:13] and yes [23:47:36] (03PS2) 10EBernhardson: cloudelastic: Don't write to private wikis on cloudelastic [mediawiki-config] - 10https://gerrit.wikimedia.org/r/508357 [23:47:43] (03CR) 10EBernhardson: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/508357 (owner: 10EBernhardson) [23:47:47] ebernhardson: you don’t need me around anymore, right? [23:47:53] Lucas_WMDE: nope, should be all good [23:47:58] !log ebernhardson@deploy1001 Synchronized php-1.34.0-wmf.4/extensions/CirrusSearch/: T220819 Uniquely identify connections in connection pool (duration: 00m 58s) [23:48:00] alright, thanks for deploying :) [23:48:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:48:03] T220819: Updating a tag with ProveIt wipes the "group" attribute - https://phabricator.wikimedia.org/T220819 [23:49:18] (03PS4) 10Dzahn: Gerrit: Enable gerrit.listProjectsFromIndex [puppet] - 10https://gerrit.wikimedia.org/r/508892 (https://phabricator.wikimedia.org/T200739) (owner: 10Paladox) [23:49:28] (03Merged) 10jenkins-bot: cloudelastic: Don't write to private wikis on cloudelastic [mediawiki-config] - 10https://gerrit.wikimedia.org/r/508357 (owner: 10EBernhardson) [23:49:44] (03CR) 10jenkins-bot: cloudelastic: Don't write to private wikis on cloudelastic [mediawiki-config] - 10https://gerrit.wikimedia.org/r/508357 (owner: 10EBernhardson) [23:50:53] paladox: gerrit.listProjectsFromIndex requires restart, right [23:50:58] yup [23:51:36] jouncebot: now [23:51:36] For the next 0 hour(s) and 8 minute(s): Evening SWAT (Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190509T2300) [23:51:38] from gerrit 2.16 some config's doin't require a restart anymore, im hopping in the future they add the useful configs to it :) [23:52:11] !log ebernhardson@deploy1001 Synchronized wmf-config/InitialiseSettings.php: T220625: Dont write to private wikis on cloudelastic (duration: 00m 50s) [23:52:14] paladox: we could just merge it.. knowing there will be plenty restarting until 2.16.7 ..but i dont like to surprise somebody else [23:52:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:52:15] T220625: Initialize CirrusSearch on cloudelastic - https://phabricator.wikimedia.org/T220625 [23:52:24] yup [23:56:27] paladox: ok, i would say let's get all these in shape and then more than one together: https://gerrit.wikimedia.org/r/c/operations/puppet/+/508391 https://gerrit.wikimedia.org/r/c/operations/puppet/+/508127 https://gerrit.wikimedia.org/r/c/operations/puppet/+/508621 https://gerrit.wikimedia.org/r/c/operations/puppet/+/508657 [23:57:00] there is a balance between restarting it for every single one and not doing TOO much at once [23:57:02] SWAT is complete [23:57:56] paladox: we should resolve ", we have some rsyslog config which parses all log files and send them to logstash. It should instead only collect the main log, otherwise it will probably lead to duplication in logstash." [23:58:10] hmm [23:58:14] how would we fix that? [23:58:28] find the rsyslog config .. [23:58:32] let's see [23:58:36] mutante https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/508621/ can go in when ever (just a private token needs generating) [23:59:10] mutante also that change is merged, just need's a follow up to fix it i guess. [23:59:32] paladox: ok, but that would be nice to let Filippo +1 it , since he -1 and then there was an update [23:59:41] yup