[00:08:26] (03PS1) 10Catrope: Revert "Disable MessageBlobStore::clear() via hook" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/508476 (https://phabricator.wikimedia.org/T217719) [00:08:39] RECOVERY - Check systemd state on netmon2001 is OK: OK - running: The system is fully operational [00:08:40] (03CR) 10jerkins-bot: [V: 04-1] Revert "Disable MessageBlobStore::clear() via hook" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/508476 (https://phabricator.wikimedia.org/T217719) (owner: 10Catrope) [00:11:27] 10Operations, 10ops-codfw, 10DBA, 10Goal, 10Patch-For-Review: rack/setup/install db2[103-120].codfw.wmnet (18 hosts) - https://phabricator.wikimedia.org/T221532 (10Papaul) [00:26:29] (03CR) 10Volans: cookbook API: add class API (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/506956 (https://phabricator.wikimedia.org/T221212) (owner: 10Volans) [00:30:54] (03CR) 10Volans: [C: 03+1] "LGTM" [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/508456 (owner: 10CRusnov) [00:34:08] (03PS2) 10Catrope: Revert "Disable MessageBlobStore::clear() via hook" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/508476 (https://phabricator.wikimedia.org/T217719) [00:36:06] (03CR) 10Catrope: "PS2: rebase; unsurprisingly, reverting a patch from 5 1/2 years ago runs into conflicts" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/508476 (https://phabricator.wikimedia.org/T217719) (owner: 10Catrope) [00:38:06] !log jforrester@deploy1001 Synchronized php-1.34.0-wmf.3/extensions/VisualEditor/modules/ve-mw/init/ve.init.mw.ArticleTarget.js: Hot-deploy fix for visual diffs on mobile in non-section mode T222489 (duration: 00m 53s) [00:38:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:38:10] T222489: Visual diffs not showing in mobile VE (when visual section editing is disabled) - https://phabricator.wikimedia.org/T222489 [00:53:16] !log install2002 - disabling puppet, live hacking DHCP config for db2103 to not serve installer via http to debug install issue for T221532 which seems like T190424#4548003 [00:53:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:53:22] T221532: rack/setup/install db2[103-120].codfw.wmnet (18 hosts) - https://phabricator.wikimedia.org/T221532 [00:53:22] T190424: modify labs-hosts1-vlans for http load of installer kernel - https://phabricator.wikimedia.org/T190424 [01:06:23] 10Operations, 10ops-codfw, 10DBA, 10Goal, 10Patch-For-Review: rack/setup/install db2[103-120].codfw.wmnet (18 hosts) - https://phabricator.wikimedia.org/T221532 (10Papaul) I tried to PXE boot the first server, on the switch side everything looks good since I can see that the switch learned the MAC addres... [01:06:43] 10Operations, 10ops-codfw, 10DBA, 10Goal, 10Patch-For-Review: rack/setup/install db2[103-120].codfw.wmnet (18 hosts) - https://phabricator.wikimedia.org/T221532 (10Dzahn) @ayounsi @RobH These servers have an install issue where they get a DHCP ACK followed by "Serving stretch-installer/debian-installer/a... [01:07:57] 10Operations, 10ops-codfw, 10DBA, 10Goal: rack/setup/install db2[103-120].codfw.wmnet (18 hosts) - https://phabricator.wikimedia.org/T221532 (10Dzahn) [01:10:53] PROBLEM - puppet last run on cp3042 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. It might be a dependency cycle. [01:28:40] (03PS3) 10Catrope: Revert "Disable MessageBlobStore::clear() via hook" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/508476 (https://phabricator.wikimedia.org/T222539) [01:28:51] (03PS4) 10Catrope: Revert "Disable MessageBlobStore::clear() via hook" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/508476 (https://phabricator.wikimedia.org/T222539) [01:42:43] RECOVERY - puppet last run on cp3042 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [02:00:06] 10Operations, 10ops-codfw, 10DBA, 10Goal: rack/setup/install db2[103-120].codfw.wmnet (18 hosts) - https://phabricator.wikimedia.org/T221532 (10Papaul) The issue was in the BIOS setting. The boot mode was set to UEFI after changing it to BIOS it works. [02:06:33] (03PS4) 10Tim Starling: Default to Preprocessor_Hash on both PHP7 and HHVM [mediawiki-config] - 10https://gerrit.wikimedia.org/r/502567 (https://phabricator.wikimedia.org/T216664) (owner: 10C. Scott Ananian) [02:06:44] (03CR) 10Tim Starling: [C: 03+2] Default to Preprocessor_Hash on both PHP7 and HHVM [mediawiki-config] - 10https://gerrit.wikimedia.org/r/502567 (https://phabricator.wikimedia.org/T216664) (owner: 10C. Scott Ananian) [02:07:43] (03Merged) 10jenkins-bot: Default to Preprocessor_Hash on both PHP7 and HHVM [mediawiki-config] - 10https://gerrit.wikimedia.org/r/502567 (https://phabricator.wikimedia.org/T216664) (owner: 10C. Scott Ananian) [02:07:59] (03CR) 10jenkins-bot: Default to Preprocessor_Hash on both PHP7 and HHVM [mediawiki-config] - 10https://gerrit.wikimedia.org/r/502567 (https://phabricator.wikimedia.org/T216664) (owner: 10C. Scott Ananian) [02:12:00] !log tstarling@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Use Preprocessor_Hash unconditionally (duration: 00m 52s) [02:12:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:29:55] PROBLEM - HTTP availability for Nginx -SSL terminators- at esams on icinga1001 is CRITICAL: cluster=cache_text site=esams https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [02:30:53] PROBLEM - HTTP availability for Nginx -SSL terminators- at ulsfo on icinga1001 is CRITICAL: cluster=cache_text site=ulsfo https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [02:31:15] PROBLEM - HTTP availability for Varnish at eqiad on icinga1001 is CRITICAL: job=varnish-text site=eqiad https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [02:31:33] PROBLEM - HTTP availability for Nginx -SSL terminators- at eqiad on icinga1001 is CRITICAL: cluster=cache_text site=eqiad https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [02:31:57] PROBLEM - HTTP availability for Nginx -SSL terminators- at codfw on icinga1001 is CRITICAL: cluster=cache_text site=codfw https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [02:32:03] PROBLEM - Text HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5 [02:32:21] PROBLEM - Eqsin HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=eqsin&var-cache_type=All&var-status_type=5 [02:32:33] PROBLEM - Eqiad HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=eqiad&var-cache_type=All&var-status_type=5 [02:33:29] PROBLEM - Esams HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=esams&var-cache_type=All&var-status_type=5 [02:34:33] RECOVERY - HTTP availability for Nginx -SSL terminators- at codfw on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [02:34:45] PROBLEM - Ulsfo HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=ulsfo&var-cache_type=All&var-status_type=5 [02:34:51] RECOVERY - HTTP availability for Nginx -SSL terminators- at ulsfo on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [02:35:13] RECOVERY - HTTP availability for Nginx -SSL terminators- at esams on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [02:35:13] RECOVERY - HTTP availability for Varnish at eqiad on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [02:35:31] RECOVERY - HTTP availability for Nginx -SSL terminators- at eqiad on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [02:36:55] (03PS28) 10Mathew.onipe: icinga: create and apply cirrus config check [puppet] - 10https://gerrit.wikimedia.org/r/507045 (https://phabricator.wikimedia.org/T218932) [02:38:08] (03CR) 10Mathew.onipe: icinga: create and apply cirrus config check (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/507045 (https://phabricator.wikimedia.org/T218932) (owner: 10Mathew.onipe) [02:39:59] RECOVERY - Text HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5 [02:40:03] RECOVERY - Ulsfo HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=ulsfo&var-cache_type=All&var-status_type=5 [02:40:29] RECOVERY - Eqiad HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=eqiad&var-cache_type=All&var-status_type=5 [02:41:27] RECOVERY - Esams HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=esams&var-cache_type=All&var-status_type=5 [02:41:35] RECOVERY - Eqsin HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=eqsin&var-cache_type=All&var-status_type=5 [02:49:42] (03PS29) 10Mathew.onipe: icinga: create and apply cirrus config check [puppet] - 10https://gerrit.wikimedia.org/r/507045 (https://phabricator.wikimedia.org/T218932) [02:56:09] (03PS30) 10Mathew.onipe: icinga: create and apply cirrus config check [puppet] - 10https://gerrit.wikimedia.org/r/507045 (https://phabricator.wikimedia.org/T218932) [03:05:03] (03PS31) 10Mathew.onipe: icinga: create and apply cirrus config check [puppet] - 10https://gerrit.wikimedia.org/r/507045 (https://phabricator.wikimedia.org/T218932) [03:11:19] 10Operations, 10ops-codfw, 10DBA, 10Goal: rack/setup/install db2[103-120].codfw.wmnet (18 hosts) - https://phabricator.wikimedia.org/T221532 (10Papaul) [03:11:34] (03PS32) 10Mathew.onipe: icinga: create and apply cirrus config check [puppet] - 10https://gerrit.wikimedia.org/r/507045 (https://phabricator.wikimedia.org/T218932) [03:15:58] (03PS33) 10Mathew.onipe: icinga: create and apply cirrus config check [puppet] - 10https://gerrit.wikimedia.org/r/507045 (https://phabricator.wikimedia.org/T218932) [05:03:01] (03PS2) 10Marostegui: db-codfw.php: Promote db2045 to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/508321 (https://phabricator.wikimedia.org/T219493) [05:03:15] (03PS2) 10Marostegui: mariadb: Promote db2045 to codfw x1 master [puppet] - 10https://gerrit.wikimedia.org/r/508168 (https://phabricator.wikimedia.org/T219493) [05:05:49] (03CR) 10Mobrovac: "recheck" [software/service-checker] - 10https://gerrit.wikimedia.org/r/507531 (https://phabricator.wikimedia.org/T220401) (owner: 10Mobrovac) [05:11:22] 10Operations, 10ops-eqiad, 10DBA, 10Goal, and 2 others: rack/setup/install db11[26-38].eqiad.wmnet - https://phabricator.wikimedia.org/T211613 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by marostegui on cumin1001.eqiad.wmnet for hosts: ` ['db1126.eqiad.wmnet'] ` The log can be found in `/v... [05:12:47] !log Change topology on x1 codfw to promote db2045 to master T219493 [05:12:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:12:52] T219493: Decommission 2 codfw x1 hosts db2033 and db2034 - https://phabricator.wikimedia.org/T219493 [05:19:53] (03CR) 10Marostegui: [C: 03+2] mariadb: Promote db2045 to codfw x1 master [puppet] - 10https://gerrit.wikimedia.org/r/508168 (https://phabricator.wikimedia.org/T219493) (owner: 10Marostegui) [05:20:40] (03CR) 10Marostegui: [C: 03+2] db-codfw.php: Promote db2045 to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/508321 (https://phabricator.wikimedia.org/T219493) (owner: 10Marostegui) [05:21:50] (03Merged) 10jenkins-bot: db-codfw.php: Promote db2045 to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/508321 (https://phabricator.wikimedia.org/T219493) (owner: 10Marostegui) [05:22:04] (03CR) 10jenkins-bot: db-codfw.php: Promote db2045 to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/508321 (https://phabricator.wikimedia.org/T219493) (owner: 10Marostegui) [05:23:33] !log marostegui@deploy1001 Synchronized wmf-config/db-codfw.php: Promote db2045 to codfw x1 master T219493 (duration: 00m 55s) [05:23:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:23:38] T219493: Prepare to decommission 2 codfw x1 hosts db2033 and db2034 - https://phabricator.wikimedia.org/T219493 [05:27:43] 10Operations, 10ops-eqiad, 10DBA, 10Goal, and 2 others: rack/setup/install db11[26-38].eqiad.wmnet - https://phabricator.wikimedia.org/T211613 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['db1126.eqiad.wmnet'] ` and were **ALL** successful. [05:27:55] 10Operations, 10ops-codfw, 10DBA, 10Goal: rack/setup/install db2[103-120].codfw.wmnet (18 hosts) - https://phabricator.wikimedia.org/T221532 (10Marostegui) I can confirm db2103 looks good. ` root@db2103:~# free -g total used free shared buff/cache available Mem:... [05:29:49] 10Operations, 10ops-eqiad, 10DBA, 10Goal, and 2 others: rack/setup/install db11[26-38].eqiad.wmnet - https://phabricator.wikimedia.org/T211613 (10Marostegui) db1126 installed correctly: ` root@db1126:~# free -g ; megacli -LDInfo -Lall -aALL ; df -hT /srv total used free sha... [05:29:59] 10Operations, 10ops-eqiad, 10DBA, 10Goal, and 2 others: rack/setup/install db11[26-38].eqiad.wmnet - https://phabricator.wikimedia.org/T211613 (10Marostegui) [05:31:25] 10Operations, 10ops-eqiad, 10DBA, 10Goal, and 2 others: rack/setup/install db11[26-38].eqiad.wmnet - https://phabricator.wikimedia.org/T211613 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by marostegui on cumin1001.eqiad.wmnet for hosts: ` ['db1128.eqiad.wmnet', 'db1129.eqiad.wmnet', 'db1130... [05:34:49] PROBLEM - BFD status on cr4-ulsfo is CRITICAL: CRIT: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [05:45:05] (03PS1) 10Marostegui: db2103: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/508492 (https://phabricator.wikimedia.org/T221532) [05:46:10] (03CR) 10Marostegui: [C: 03+2] db2103: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/508492 (https://phabricator.wikimedia.org/T221532) (owner: 10Marostegui) [05:46:34] (03CR) 10Elukey: "Joseph everything looks good, thanks a lot! If you still have patience I'd ask you one last change.. if we remove the systemd timer code p" [puppet] - 10https://gerrit.wikimedia.org/r/508362 (https://phabricator.wikimedia.org/T222378) (owner: 10Joal) [05:46:51] 10Operations, 10ops-eqiad, 10DBA, 10Goal, and 2 others: rack/setup/install db11[26-38].eqiad.wmnet - https://phabricator.wikimedia.org/T211613 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['db1130.eqiad.wmnet'] ` Of which those **FAILED**: ` ['db1130.eqiad.wmnet'] ` [05:49:04] 10Operations, 10ops-eqiad, 10DBA, 10Goal, and 2 others: rack/setup/install db11[26-38].eqiad.wmnet - https://phabricator.wikimedia.org/T211613 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by marostegui on cumin1001.eqiad.wmnet for hosts: ` ['db1130.eqiad.wmnet'] ` The log can be found in `/v... [05:50:06] 10Operations, 10ops-eqiad, 10DBA, 10Goal, and 2 others: rack/setup/install db11[26-38].eqiad.wmnet - https://phabricator.wikimedia.org/T211613 (10Marostegui) db1128 and db1129 installed correctly: ` root@db1128:~# free -g ; megacli -LDInfo -Lall -aALL ; df -hT /srv total used fr... [05:50:38] 10Operations, 10ops-eqiad, 10DBA, 10Goal, and 2 others: rack/setup/install db11[26-38].eqiad.wmnet - https://phabricator.wikimedia.org/T211613 (10Marostegui) [05:50:39] (03CR) 10Giuseppe Lavagetto: "recheck" [software/spicerack] - 10https://gerrit.wikimedia.org/r/503946 (owner: 10Giuseppe Lavagetto) [05:57:58] (03PS1) 10Marostegui: install_server: Add DHCP lease for db1127 [puppet] - 10https://gerrit.wikimedia.org/r/508493 (https://phabricator.wikimedia.org/T211613) [05:59:01] (03CR) 10Marostegui: [C: 03+2] install_server: Add DHCP lease for db1127 [puppet] - 10https://gerrit.wikimedia.org/r/508493 (https://phabricator.wikimedia.org/T211613) (owner: 10Marostegui) [06:04:41] 10Operations, 10ops-eqiad, 10DBA, 10Goal, and 2 others: rack/setup/install db11[26-38].eqiad.wmnet - https://phabricator.wikimedia.org/T211613 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['db1130.eqiad.wmnet'] ` and were **ALL** successful. [06:05:30] 10Operations, 10ops-eqiad, 10DBA, 10Goal, and 2 others: rack/setup/install db11[26-38].eqiad.wmnet - https://phabricator.wikimedia.org/T211613 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by marostegui on cumin1001.eqiad.wmnet for hosts: ` ['db1127.eqiad.wmnet'] ` The log can be found in `/v... [06:06:50] 10Operations, 10ops-eqiad, 10DBA, 10Goal, and 2 others: rack/setup/install db11[26-38].eqiad.wmnet - https://phabricator.wikimedia.org/T211613 (10Marostegui) db1130 has been installed correctly: ` root@db1130:~# free -g ; megacli -LDInfo -Lall -aALL ; df -hT /srv total used fre... [06:07:08] 10Operations, 10ops-eqiad, 10DBA, 10Goal, and 2 others: rack/setup/install db11[26-38].eqiad.wmnet - https://phabricator.wikimedia.org/T211613 (10Marostegui) [06:08:58] (03CR) 10Giuseppe Lavagetto: "recheck" [software/spicerack] - 10https://gerrit.wikimedia.org/r/504578 (owner: 10Giuseppe Lavagetto) [06:14:11] (03CR) 10Giuseppe Lavagetto: "recheck" [software/spicerack] - 10https://gerrit.wikimedia.org/r/503947 (owner: 10Giuseppe Lavagetto) [06:21:56] 10Operations, 10ops-eqiad, 10DBA, 10Goal, and 2 others: rack/setup/install db11[26-38].eqiad.wmnet - https://phabricator.wikimedia.org/T211613 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['db1127.eqiad.wmnet'] ` and were **ALL** successful. [06:22:29] 10Operations, 10ops-eqiad, 10DBA, 10Goal, and 2 others: rack/setup/install db11[26-38].eqiad.wmnet - https://phabricator.wikimedia.org/T211613 (10Marostegui) @RobH @Cmjohnson I have seen that the idrac for db1127 was working already so I have grabbed the MAC for the NIC and added the DHCP entry for it. So... [06:23:20] 10Operations, 10ops-eqiad, 10DBA, 10Goal, and 2 others: rack/setup/install db11[26-38].eqiad.wmnet - https://phabricator.wikimedia.org/T211613 (10Marostegui) [06:23:55] 10Operations, 10ops-eqiad, 10DBA, 10Goal, and 2 others: rack/setup/install db11[26-38].eqiad.wmnet - https://phabricator.wikimedia.org/T211613 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by marostegui on cumin1001.eqiad.wmnet for hosts: ` ['db1131.eqiad.wmnet', 'db1132.eqiad.wmnet', 'db1133... [06:24:43] (03CR) 10Vgutierrez: [C: 03+1] "LGTM" [debs/pybal] - 10https://gerrit.wikimedia.org/r/507740 (owner: 10Giuseppe Lavagetto) [06:26:40] (03CR) 10Luca Mauri: "> Presumably this should only get deployed just before I6d0215082f?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/508130 (https://phabricator.wikimedia.org/T222516) (owner: 10Luca Mauri) [06:27:52] PROBLEM - netbox HTTPS on netmon1002 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 547 bytes in 0.007 second response time https://wikitech.wikimedia.org/wiki/Netbox [06:28:30] PROBLEM - Check systemd state on netmon1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [06:31:26] PROBLEM - puppet last run on labmon1002 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/profile.d/field.sh] [06:32:40] PROBLEM - puppet last run on mw1305 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/tmpreaper.conf] [06:37:50] PROBLEM - Check the last execution of netbox_ganeti_codfw_sync on netmon1002 is CRITICAL: CRITICAL: Status of the systemd unit netbox_ganeti_codfw_sync [06:39:06] PROBLEM - Check the last execution of netbox_ganeti_eqiad_sync on netmon1002 is CRITICAL: CRITICAL: Status of the systemd unit netbox_ganeti_eqiad_sync [06:44:36] !log restart uwsgi-netbox on netmon1002 after segfault [06:44:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:45:02] RECOVERY - netbox HTTPS on netmon1002 is OK: HTTP OK: HTTP/1.1 302 Found - 348 bytes in 0.602 second response time https://wikitech.wikimedia.org/wiki/Netbox [06:47:29] 10Operations, 10ops-eqiad, 10DBA, 10Goal, and 2 others: rack/setup/install db11[26-38].eqiad.wmnet - https://phabricator.wikimedia.org/T211613 (10Marostegui) db1131 installed correctly: ` root@db1131:~# free -g ; megacli -LDInfo -Lall -aALL ; df -hT /srv total used free sh... [06:47:42] 10Operations, 10ops-eqiad, 10DBA, 10Goal, and 2 others: rack/setup/install db11[26-38].eqiad.wmnet - https://phabricator.wikimedia.org/T211613 (10Marostegui) [06:48:01] 10Operations, 10ops-eqiad, 10DBA, 10Goal, and 2 others: rack/setup/install db11[26-38].eqiad.wmnet - https://phabricator.wikimedia.org/T211613 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['db1132.eqiad.wmnet', 'db1133.eqiad.wmnet'] ` Of which those **FAILED**: ` ['db1132.eqiad.wmnet', 'db113... [06:48:34] (03PS34) 10Mathew.onipe: icinga: create and apply cirrus config check [puppet] - 10https://gerrit.wikimedia.org/r/507045 (https://phabricator.wikimedia.org/T218932) [06:49:58] (03PS2) 10Matthias Mullie: SDC: Enable feature flag for depicts in UW on Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/507943 (https://phabricator.wikimedia.org/T217024) [06:50:14] 10Operations, 10ops-eqiad, 10DBA, 10Goal, and 2 others: rack/setup/install db11[26-38].eqiad.wmnet - https://phabricator.wikimedia.org/T211613 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by marostegui on cumin1001.eqiad.wmnet for hosts: ` ['db1131.eqiad.wmnet', 'db1132.eqiad.wmnet'] ` The l... [06:51:49] RECOVERY - Check systemd state on netmon1002 is OK: OK - running: The system is fully operational [06:57:05] RECOVERY - puppet last run on labmon1002 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:58:01] RECOVERY - Check the last execution of netbox_ganeti_codfw_sync on netmon1002 is OK: OK: Status of the systemd unit netbox_ganeti_codfw_sync [06:58:05] RECOVERY - puppet last run on mw1305 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [06:59:19] RECOVERY - Check the last execution of netbox_ganeti_eqiad_sync on netmon1002 is OK: OK: Status of the systemd unit netbox_ganeti_eqiad_sync [06:59:53] !log updating firmware-bnx2x (from stretch point release, this is a NOP, the source package firmware-nonfree was updated for various Wifi chipsets we don't use, doublechecked by comparing check sums for old and new bnx2x firmware) [06:59:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:05:39] (03PS35) 10Mathew.onipe: icinga: create and apply cirrus config check [puppet] - 10https://gerrit.wikimedia.org/r/507045 (https://phabricator.wikimedia.org/T218932) [07:07:31] 10Operations, 10ops-eqiad, 10DBA, 10Goal, and 2 others: rack/setup/install db11[26-38].eqiad.wmnet - https://phabricator.wikimedia.org/T211613 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['db1131.eqiad.wmnet', 'db1132.eqiad.wmnet'] ` and were **ALL** successful. [07:12:22] 10Operations, 10ops-eqiad, 10DBA, 10Goal, and 2 others: rack/setup/install db11[26-38].eqiad.wmnet - https://phabricator.wikimedia.org/T211613 (10Marostegui) I am troubleshooting db1133's RAID, which is OFFLINE due to several disks being OFFLINE [07:18:32] 10Operations, 10ops-eqiad, 10DBA, 10Goal, and 2 others: rack/setup/install db11[26-38].eqiad.wmnet - https://phabricator.wikimedia.org/T211613 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by marostegui on cumin1001.eqiad.wmnet for hosts: ` ['db1134.eqiad.wmnet', 'db1135.eqiad.wmnet', 'db1136... [07:18:48] !log mobrovac@deploy1001 Started deploy [restbase/deploy@d91ee4c] (dev-cluster): Remove section functionality from the REST API [07:18:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:20:38] (03PS1) 10Filippo Giunchedi: hieradata: bast3002 to prometheus v2 [puppet] - 10https://gerrit.wikimedia.org/r/508503 (https://phabricator.wikimedia.org/T187987) [07:21:49] !log Optimize tables on pc1010 [07:21:50] !log mobrovac@deploy1001 Finished deploy [restbase/deploy@d91ee4c] (dev-cluster): Remove section functionality from the REST API (duration: 03m 02s) [07:21:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:21:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:22:24] (03CR) 10Filippo Giunchedi: [C: 03+2] hieradata: bast3002 to prometheus v2 [puppet] - 10https://gerrit.wikimedia.org/r/508503 (https://phabricator.wikimedia.org/T187987) (owner: 10Filippo Giunchedi) [07:22:34] (03PS2) 10Filippo Giunchedi: hieradata: bast3002 to prometheus v2 [puppet] - 10https://gerrit.wikimedia.org/r/508503 (https://phabricator.wikimedia.org/T187987) [07:26:38] !log mobrovac@deploy1001 Started deploy [restbase/deploy@d91ee4c]: Remove section functionality from the REST API - T216636 [07:26:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:26:42] T216636: Consider deprecating section editing API in RESTBase - https://phabricator.wikimedia.org/T216636 [07:27:03] !log upgrade prometheus on bast3002 - T187987 [07:27:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:27:09] T187987: 100% of Prometheus traffic served by Prometheus v2 - https://phabricator.wikimedia.org/T187987 [07:27:09] PROBLEM - DPKG on ms-be1024 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [07:27:27] 10Operations, 10ops-codfw, 10Traffic: cp2009 down and mgmt console not reachable - https://phabricator.wikimedia.org/T222459 (10MoritzMuehlenhoff) a:03Papaul [07:28:49] (03CR) 10Jcrespo: [C: 04-1] mysql: Fix installing package on stretch [puppet] - 10https://gerrit.wikimedia.org/r/354131 (owner: 10Paladox) [07:29:31] RECOVERY - DPKG on ms-be1024 is OK: All packages OK [07:36:20] 10Operations, 10ops-eqiad, 10DBA, 10Goal, and 2 others: rack/setup/install db11[26-38].eqiad.wmnet - https://phabricator.wikimedia.org/T211613 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['db1136.eqiad.wmnet', 'db1134.eqiad.wmnet', 'db1135.eqiad.wmnet'] ` and were **ALL** successful. [07:38:50] 10Operations, 10ops-eqiad, 10DBA, 10Goal, and 2 others: rack/setup/install db11[26-38].eqiad.wmnet - https://phabricator.wikimedia.org/T211613 (10Marostegui) db1132 installed correctly: ` root@db1132:~# free -g ; megacli -LDInfo -Lall -aALL ; df -hT /srv total used free sha... [07:39:23] 10Operations, 10ops-eqiad, 10DBA, 10Goal, and 2 others: rack/setup/install db11[26-38].eqiad.wmnet - https://phabricator.wikimedia.org/T211613 (10Marostegui) [07:39:56] 10Operations, 10ops-eqiad, 10DBA, 10Goal, and 2 others: rack/setup/install db11[26-38].eqiad.wmnet - https://phabricator.wikimedia.org/T211613 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by marostegui on cumin1001.eqiad.wmnet for hosts: ` ['db1133.eqiad.wmnet'] ` The log can be found in `/v... [07:40:33] 10Operations, 10ops-eqiad, 10DBA, 10Goal, and 2 others: rack/setup/install db11[26-38].eqiad.wmnet - https://phabricator.wikimedia.org/T211613 (10Marostegui) >>! In T211613#5163463, @Marostegui wrote: > I am troubleshooting db1133's RAID, which is OFFLINE due to several disks being OFFLINE The RAID is now... [07:47:25] PROBLEM - puppet last run on ms-be2042 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 7 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[initramfs-tools] [07:49:03] PROBLEM - HTTP availability for Nginx -SSL terminators- at codfw on icinga1001 is CRITICAL: cluster=cache_text site=codfw https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [07:51:24] !log mobrovac@deploy1001 Finished deploy [restbase/deploy@d91ee4c]: Remove section functionality from the REST API - T216636 (duration: 24m 46s) [07:51:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:51:28] T216636: Consider deprecating section editing API in RESTBase - https://phabricator.wikimedia.org/T216636 [07:53:23] RECOVERY - HTTP availability for Nginx -SSL terminators- at codfw on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [07:56:23] 10Operations, 10ops-eqiad, 10RESTBase, 10Core Platform Team Backlog (Watching / External), and 2 others: rack/setup/install restbase10[19-27].eqiad.wmnet - https://phabricator.wikimedia.org/T219404 (10mobrovac) >>! In T219404#5160524, @Cmjohnson wrote: > @moborvac I haven't had a chance to get to them unti... [07:59:01] !log updating base-files from recent stretch point release [07:59:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:00:01] 10Operations, 10ops-codfw, 10media-storage, 10observability: ms-be2043 'sdd' throwing lots of errors - https://phabricator.wikimedia.org/T222654 (10fgiunchedi) Indeed usually it is a disk replacement, under warranty in this case [08:00:53] 10Operations, 10ops-eqiad, 10Operations-Software-Development, 10observability, 10Patch-For-Review: ms-be1043 sdk failed - https://phabricator.wikimedia.org/T218544 (10fgiunchedi) Any updates on this @Cmjohnson ? [08:09:25] 10Operations, 10cloud-services-team: Remove facter from openstack-mitaka-jessie component - https://phabricator.wikimedia.org/T222685 (10MoritzMuehlenhoff) [08:09:30] (03CR) 10Ema: [C: 03+1] Allow proxyfetch to check more than one url at a time [debs/pybal] - 10https://gerrit.wikimedia.org/r/507740 (owner: 10Giuseppe Lavagetto) [08:11:04] (03PS4) 10Giuseppe Lavagetto: role::deployment_server: reorganize code, step 1 [puppet] - 10https://gerrit.wikimedia.org/r/508335 [08:13:29] RECOVERY - puppet last run on ms-be2042 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [08:14:58] 10Operations: rsyslog's in:imtcp thread stuck on old sockets - https://phabricator.wikimedia.org/T199406 (10MoritzMuehlenhoff) >>! In T199406#5160429, @fgiunchedi wrote: > While annoying, I think before digging further we should focus on upgrading to stretch central syslog servers instead. +1 [08:15:31] 10Operations: Integrate Stretch 9.9 point update - https://phabricator.wikimedia.org/T222053 (10MoritzMuehlenhoff) [08:16:05] (03PS2) 10Muehlenhoff: Remove obsolete openstack::nova::compute::audit [puppet] - 10https://gerrit.wikimedia.org/r/508308 [08:17:17] 10Operations, 10ops-eqiad, 10DBA, 10Goal, and 2 others: rack/setup/install db11[26-38].eqiad.wmnet - https://phabricator.wikimedia.org/T211613 (10Marostegui) db1133 had another issue: On reboot to go for an install and while connected on the idrac this is what I get: ` Unified Server Configurator does not... [08:19:03] (03CR) 10Giuseppe Lavagetto: [C: 03+1] "Change seems benign, see https://puppet-compiler.wmflabs.org/compiler1002/16389/" [puppet] - 10https://gerrit.wikimedia.org/r/508335 (owner: 10Giuseppe Lavagetto) [08:22:32] (03CR) 10Muehlenhoff: [C: 03+2] Remove obsolete openstack::nova::compute::audit [puppet] - 10https://gerrit.wikimedia.org/r/508308 (owner: 10Muehlenhoff) [08:24:05] (03CR) 10Giuseppe Lavagetto: [C: 03+2] Allow proxyfetch to check more than one url at a time [debs/pybal] - 10https://gerrit.wikimedia.org/r/507740 (owner: 10Giuseppe Lavagetto) [08:24:56] (03Merged) 10jenkins-bot: Allow proxyfetch to check more than one url at a time [debs/pybal] - 10https://gerrit.wikimedia.org/r/507740 (owner: 10Giuseppe Lavagetto) [08:31:55] 10Operations, 10ops-eqiad, 10Traffic: cp1083 crashed - https://phabricator.wikimedia.org/T222620 (10ema) [08:32:25] 10Operations, 10ops-eqiad, 10Traffic: cp1083 crashed - https://phabricator.wikimedia.org/T222620 (10ema) [08:35:07] 10Operations, 10ops-eqiad, 10DBA, 10Goal, and 2 others: rack/setup/install db11[26-38].eqiad.wmnet - https://phabricator.wikimedia.org/T211613 (10Marostegui) And now db1133 on reboot: ` FW could not sync up config/prop changes for some of the VD's/PD's Press any key to continue, or 'C' to load the configur... [08:36:21] 10Operations, 10ops-eqiad, 10DBA, 10Goal, and 2 others: rack/setup/install db11[26-38].eqiad.wmnet - https://phabricator.wikimedia.org/T211613 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by marostegui on cumin1001.eqiad.wmnet for hosts: ` ['db1137.eqiad.wmnet', 'db1138.eqiad.wmnet'] ` The l... [08:39:28] !log repool cp1083 T222620 [08:39:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:39:32] T222620: cp1083 crashed - https://phabricator.wikimedia.org/T222620 [08:42:59] 10Operations, 10ops-eqiad, 10Traffic: cp1083 crashed - https://phabricator.wikimedia.org/T222620 (10ema) Interestingly, there was a memory usage spike right before the host crashed. {F28951427} [08:50:25] (03PS1) 10Arturo Borrero Gonzalez: reprepro: openstack-mitaka-jessie: drop facter [puppet] - 10https://gerrit.wikimedia.org/r/508517 (https://phabricator.wikimedia.org/T222685) [08:51:16] !log T222685 remove facter from jessie-wikimedia/openstack-mitaka-jessie [08:51:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:51:20] T222685: Remove facter from openstack-mitaka-jessie component - https://phabricator.wikimedia.org/T222685 [08:52:35] 10Operations, 10ops-eqiad, 10DBA, 10Goal, and 2 others: rack/setup/install db11[26-38].eqiad.wmnet - https://phabricator.wikimedia.org/T211613 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['db1138.eqiad.wmnet', 'db1137.eqiad.wmnet'] ` and were **ALL** successful. [08:53:34] 10Operations, 10Traffic: false positives in check_trafficserver_config_status - https://phabricator.wikimedia.org/T222642 (10ema) I've ack'ed the warnings in Icinga for the time being. [08:53:47] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] reprepro: openstack-mitaka-jessie: drop facter [puppet] - 10https://gerrit.wikimedia.org/r/508517 (https://phabricator.wikimedia.org/T222685) (owner: 10Arturo Borrero Gonzalez) [08:55:48] (03CR) 10Filippo Giunchedi: initial attempt at a varnishkafka exporter (032 comments) [debs/prometheus-varnishkafka-exporter] - 10https://gerrit.wikimedia.org/r/507632 (https://phabricator.wikimedia.org/T196066) (owner: 10Cwhite) [08:55:51] (03PS5) 10Joal: Update analytics sqoop scheduling [puppet] - 10https://gerrit.wikimedia.org/r/508362 (https://phabricator.wikimedia.org/T222378) [08:56:41] elukey: --^ :) [08:58:08] (03CR) 10Vgutierrez: [C: 03+2] trafficserver: Allow disabling caching requests [puppet] - 10https://gerrit.wikimedia.org/r/506390 (https://phabricator.wikimedia.org/T221594) (owner: 10Vgutierrez) [08:58:42] 10Operations, 10Patch-For-Review, 10cloud-services-team (Kanban): Remove facter from openstack-mitaka-jessie component - https://phabricator.wikimedia.org/T222685 (10aborrero) 05Open→03Resolved p:05Triage→03Normal a:03aborrero [08:59:04] 10Operations, 10ops-eqiad, 10DBA, 10Goal, and 2 others: rack/setup/install db11[26-38].eqiad.wmnet - https://phabricator.wikimedia.org/T211613 (10Marostegui) db1137 installed correctly: ` root@db1137:~# free -g ; megacli -LDInfo -Lall -aALL ; df -hT /srv total used free sha... [08:59:21] 10Operations, 10ops-eqiad, 10DBA, 10Goal, and 2 others: rack/setup/install db11[26-38].eqiad.wmnet - https://phabricator.wikimedia.org/T211613 (10Marostegui) [08:59:27] (03PS18) 10Vgutierrez: trafficserver: Allow disabling caching requests [puppet] - 10https://gerrit.wikimedia.org/r/506390 (https://phabricator.wikimedia.org/T221594) [09:00:02] (03CR) 1020after4: [C: 03+1] admins: simplify sudo privs for phab-admin group [puppet] - 10https://gerrit.wikimedia.org/r/508373 (owner: 10Dzahn) [09:00:04] 10Operations, 10ops-eqiad, 10DBA, 10Goal, and 2 others: rack/setup/install db11[26-38].eqiad.wmnet - https://phabricator.wikimedia.org/T211613 (10Marostegui) The only pending host to install is db1133 which is having issues and we need on-site help from @Cmjohnson (T211613#5163570) - I have already pinged... [09:00:45] (03CR) 10Volans: "Looks mostly good, minor nitpicks and a question inline." (033 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/504570 (https://phabricator.wikimedia.org/T220946) (owner: 10Mathew.onipe) [09:01:21] (03CR) 10DCausse: icinga: create and apply cirrus config check (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/507045 (https://phabricator.wikimedia.org/T218932) (owner: 10Mathew.onipe) [09:03:03] (03CR) 10DCausse: icinga: create and apply cirrus config check (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/507045 (https://phabricator.wikimedia.org/T218932) (owner: 10Mathew.onipe) [09:03:46] (03CR) 10Elukey: Update analytics sqoop scheduling (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/508362 (https://phabricator.wikimedia.org/T222378) (owner: 10Joal) [09:04:16] joal: --^ [09:08:51] (03PS1) 10Ema: ATS: log debug messages as such [puppet] - 10https://gerrit.wikimedia.org/r/508519 [09:09:10] (03PS1) 10Fsero: Initial WMF debianization [debs/helmfile] (debian/buster-wikimedia) - 10https://gerrit.wikimedia.org/r/508521 [09:09:46] (03CR) 10jerkins-bot: [V: 04-1] ATS: log debug messages as such [puppet] - 10https://gerrit.wikimedia.org/r/508519 (owner: 10Ema) [09:10:28] (03PS2) 10Fsero: Initial WMF debianization [debs/helmfile] (debian/buster-wikimedia) - 10https://gerrit.wikimedia.org/r/508521 [09:10:48] (03CR) 10Fsero: [V: 03+2 C: 03+2] Initial WMF debianization [debs/helmfile] (debian/buster-wikimedia) - 10https://gerrit.wikimedia.org/r/508521 (owner: 10Fsero) [09:15:38] (03CR) 10Fsero: [V: 03+2 C: 03+2] "Built on boron without issues" [debs/helmfile] (debian/buster-wikimedia) - 10https://gerrit.wikimedia.org/r/508521 (owner: 10Fsero) [09:17:37] (03PS2) 10Ema: ATS: log debug messages as such [puppet] - 10https://gerrit.wikimedia.org/r/508519 [09:18:56] (03PS5) 10Vgutierrez: prometheus: Support several instances of the ATS exporter [puppet] - 10https://gerrit.wikimedia.org/r/506659 (https://phabricator.wikimedia.org/T221217) [09:19:41] (03CR) 10Vgutierrez: [C: 03+1] ATS: log debug messages as such [puppet] - 10https://gerrit.wikimedia.org/r/508519 (owner: 10Ema) [09:20:24] PROBLEM - pdfrender on scb1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://phabricator.wikimedia.org/T174916 [09:21:44] (03PS1) 10Marostegui: mariadb: Provision db1127 and db1137 on x1 [puppet] - 10https://gerrit.wikimedia.org/r/508523 (https://phabricator.wikimedia.org/T222682) [09:21:46] 10Operations, 10DBA: Decommission db1061-db1073 - https://phabricator.wikimedia.org/T217396 (10Marostegui) [09:23:30] RECOVERY - pdfrender on scb1004 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 7.381 second response time https://phabricator.wikimedia.org/T174916 [09:32:14] 10Operations, 10Continuous-Integration-Infrastructure, 10Wikimedia-Incident, 10Zuul: Upload zuul_2.5.1-wmf8 to apt.wikimedia.org - https://phabricator.wikimedia.org/T222689 (10hashar) [09:32:36] PROBLEM - pdfrender on scb1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://phabricator.wikimedia.org/T174916 [09:32:52] 10Operations, 10Continuous-Integration-Infrastructure, 10Wikimedia-Incident, 10Zuul: Upload zuul_2.5.1-wmf8 to apt.wikimedia.org - https://phabricator.wikimedia.org/T222689 (10hashar) [09:33:40] RECOVERY - pdfrender on scb1004 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 2.733 second response time https://phabricator.wikimedia.org/T174916 [09:34:15] (03PS1) 10Ema: Initial packaging [software/varnish/libvmod-uuid] (debian) - 10https://gerrit.wikimedia.org/r/508524 (https://phabricator.wikimedia.org/T221977) [09:37:40] (03CR) 10Jbond: "> Patch Set 1: Code-Review-1" [puppet] - 10https://gerrit.wikimedia.org/r/507948 (https://phabricator.wikimedia.org/T222443) (owner: 10Jbond) [09:37:57] (03Abandoned) 10Jbond: cumin: update list of cumin_masters [puppet] - 10https://gerrit.wikimedia.org/r/507948 (https://phabricator.wikimedia.org/T222443) (owner: 10Jbond) [09:38:47] (03CR) 10Joal: "> Patch Set 4:" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/508362 (https://phabricator.wikimedia.org/T222378) (owner: 10Joal) [09:38:50] PROBLEM - pdfrender on scb1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://phabricator.wikimedia.org/T174916 [09:39:11] (03PS6) 10Joal: Update analytics sqoop scheduling [puppet] - 10https://gerrit.wikimedia.org/r/508362 (https://phabricator.wikimedia.org/T222378) [09:44:21] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" (031 comment) [software/varnish/libvmod-uuid] (debian) - 10https://gerrit.wikimedia.org/r/508524 (https://phabricator.wikimedia.org/T221977) (owner: 10Ema) [09:46:38] RECOVERY - pdfrender on scb1004 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 0.003 second response time https://phabricator.wikimedia.org/T174916 [09:47:18] !log restart pdfrender on scb1004 - T174916 [09:47:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:47:23] T174916: electron/pdfrender hangs - https://phabricator.wikimedia.org/T174916 [09:47:54] (03PS7) 10Elukey: Update analytics sqoop scheduling [puppet] - 10https://gerrit.wikimedia.org/r/508362 (https://phabricator.wikimedia.org/T222378) (owner: 10Joal) [09:49:13] 10Operations, 10Traffic, 10Goal, 10Patch-For-Review: Replace Varnish backends with ATS on cache upload nodes in ulsfo - https://phabricator.wikimedia.org/T219967 (10ema) 05Open→03Resolved a:03ema All Varnish backends in ulsfo upload replaced with ATS. [09:49:29] (03CR) 10Elukey: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1002/16391/an-coord1001.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/508362 (https://phabricator.wikimedia.org/T222378) (owner: 10Joal) [09:51:25] !log test statsd-exporter 0.9 upgrade on deployment-imagescaler03 - T220709 [09:51:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:51:30] T220709: Upgrade statsd_exporter to 0.9 - https://phabricator.wikimedia.org/T220709 [10:00:04] matthiasmullie: My dear minions, it's time we take the moon! Just kidding. Time for Structured Data on Commons: statements in UploadWizard deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190507T1000). [10:00:04] matthiasmullie: A patch you scheduled for Structured Data on Commons: statements in UploadWizard is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [10:05:57] 10Operations, 10DBA, 10Patch-For-Review: Decommission db1061-db1073 - https://phabricator.wikimedia.org/T217396 (10Marostegui) [10:06:04] 10Operations, 10DBA, 10Patch-For-Review: correctable memory errors db1068 (commons primary master database) - https://phabricator.wikimedia.org/T213664 (10Marostegui) [10:08:24] PROBLEM - Check systemd state on an-coord1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [10:08:37] !log upload zull_2.5.1-wmf8 package to jessie-wikimedia [10:08:39] elukey: FYI ^^^ [10:08:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:10:00] (03PS1) 10Arturo Borrero Gonzalez: toollabs: drop old compute puppet code [puppet] - 10https://gerrit.wikimedia.org/r/508529 (https://phabricator.wikimedia.org/T219362) [10:11:16] 10Operations, 10Continuous-Integration-Infrastructure, 10Wikimedia-Incident, 10Zuul: Upload zuul_2.5.1-wmf8 to apt.wikimedia.org - https://phabricator.wikimedia.org/T222689 (10jbond) this has been uploaded let me know if there are any issues [10:11:31] 10Operations, 10Continuous-Integration-Infrastructure, 10Wikimedia-Incident, 10Zuul: Upload zuul_2.5.1-wmf8 to apt.wikimedia.org - https://phabricator.wikimedia.org/T222689 (10jbond) 05Open→03Resolved [10:13:00] (03CR) 10Hashar: [C: 03+1] "And apparently apt::pin does not manage the directory so it was still pinned." [puppet] - 10https://gerrit.wikimedia.org/r/502207 (https://phabricator.wikimedia.org/T218559) (owner: 10Hashar) [10:13:21] 10Operations, 10Puppet: Add a CI check for the use of hiera() function - https://phabricator.wikimedia.org/T220820 (10Joe) I would even suggest if we write a puppet-lint plugin for this to add the fix capability. It should allow a relatively quick removal of all hiera() calls. [10:13:38] (03PS2) 10Jbond: facter3/puppet5: enable puppet5/facter3 ulsfo [puppet] - 10https://gerrit.wikimedia.org/r/507301 (https://phabricator.wikimedia.org/T219803) [10:14:17] (03CR) 10Jbond: [C: 03+2] facter3/puppet5: enable puppet5/facter3 ulsfo [puppet] - 10https://gerrit.wikimedia.org/r/507301 (https://phabricator.wikimedia.org/T219803) (owner: 10Jbond) [10:14:40] (03PS2) 10Arturo Borrero Gonzalez: toollabs: drop old compute puppet code [puppet] - 10https://gerrit.wikimedia.org/r/508529 (https://phabricator.wikimedia.org/T219362) [10:15:25] (03PS3) 10Arturo Borrero Gonzalez: toollabs: drop old node puppet code [puppet] - 10https://gerrit.wikimedia.org/r/508529 (https://phabricator.wikimedia.org/T219362) [10:16:30] !log contint1001, contint2002: rm /etc/apt/preferences.d/python_pbr.pref /etc/apt/preferences.d/python-pbr.pref # T218559 [10:16:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:16:34] T218559: puppet broken on integration WMCS instances due to openstack Debian packages - https://phabricator.wikimedia.org/T218559 [10:16:56] !log contint1001: upgrading python-pbr from 0.8.2-1 to 1.10.0-1 , no more needed with recent Zuul # T218559 [10:16:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:17:45] (03CR) 10Matthias Mullie: [C: 03+2] SDC: Enable feature flag for depicts in UW on Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/507943 (https://phabricator.wikimedia.org/T217024) (owner: 10Matthias Mullie) [10:18:55] (03Merged) 10jenkins-bot: SDC: Enable feature flag for depicts in UW on Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/507943 (https://phabricator.wikimedia.org/T217024) (owner: 10Matthias Mullie) [10:19:09] (03CR) 10jenkins-bot: SDC: Enable feature flag for depicts in UW on Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/507943 (https://phabricator.wikimedia.org/T217024) (owner: 10Matthias Mullie) [10:20:09] volans: ack thanks! [10:20:21] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] "PCC is NOOP for Toolforge: https://puppet-compiler.wmflabs.org/compiler1002/16393/" [puppet] - 10https://gerrit.wikimedia.org/r/508529 (https://phabricator.wikimedia.org/T219362) (owner: 10Arturo Borrero Gonzalez) [10:23:19] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] toollabs: drop old node puppet code [puppet] - 10https://gerrit.wikimedia.org/r/508529 (https://phabricator.wikimedia.org/T219362) (owner: 10Arturo Borrero Gonzalez) [10:23:35] !log mlitn@deploy1001 Started scap: SDC: Enable Depicts in UploadWizard on Commons [10:23:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:26:40] 10Operations, 10Puppet, 10Packaging, 10Patch-For-Review: upgrade facter and puppet across the fleet - https://phabricator.wikimedia.org/T219803 (10jbond) >>! In T219803#5147762, @fgiunchedi wrote: > FYI the upgrade seems to be generating cronspam, in the form of facter warnings: > > `lines=5 > Subject: Cr... [10:28:31] 10Operations, 10Puppet, 10Packaging, 10Patch-For-Review: upgrade facter and puppet across the fleet - https://phabricator.wikimedia.org/T219803 (10jbond) [10:28:33] 10Operations: cronspam from smart-data-dump due to facter bug - https://phabricator.wikimedia.org/T222326 (10jbond) 05Open→03Resolved a:03jbond Resolving this and will track the root problem in https://phabricator.wikimedia.org/T222356 [10:30:45] (03PS1) 10Elukey: profile::analytics::refinery: use analytics in /var/log/refinery [puppet] - 10https://gerrit.wikimedia.org/r/508534 (https://phabricator.wikimedia.org/T220971) [10:31:10] (03PS2) 10Ema: Initial packaging [software/varnish/libvmod-uuid] (debian) - 10https://gerrit.wikimedia.org/r/508524 (https://phabricator.wikimedia.org/T221977) [10:31:41] (03CR) 10Ema: Initial packaging (031 comment) [software/varnish/libvmod-uuid] (debian) - 10https://gerrit.wikimedia.org/r/508524 (https://phabricator.wikimedia.org/T221977) (owner: 10Ema) [10:33:25] (03CR) 10Muehlenhoff: [C: 03+1] Initial packaging [software/varnish/libvmod-uuid] (debian) - 10https://gerrit.wikimedia.org/r/508524 (https://phabricator.wikimedia.org/T221977) (owner: 10Ema) [10:34:08] (03CR) 10Elukey: [C: 03+2] profile::analytics::refinery: use analytics in /var/log/refinery [puppet] - 10https://gerrit.wikimedia.org/r/508534 (https://phabricator.wikimedia.org/T220971) (owner: 10Elukey) [10:34:38] (03CR) 10Ema: [V: 03+2 C: 03+2] Initial packaging [software/varnish/libvmod-uuid] (debian) - 10https://gerrit.wikimedia.org/r/508524 (https://phabricator.wikimedia.org/T221977) (owner: 10Ema) [10:40:22] !log libvmod-uuid 1.4-1 uploaded to stretch-wikimedia T221977 [10:40:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:40:31] T221977: Package libvmod-uuid for Debian - https://phabricator.wikimedia.org/T221977 [10:42:56] RECOVERY - Check systemd state on an-coord1001 is OK: OK - running: The system is fully operational [10:42:57] (03PS3) 10Ema: ATS: log debug messages as such [puppet] - 10https://gerrit.wikimedia.org/r/508519 [10:43:51] (03CR) 10Ema: [C: 03+2] ATS: log debug messages as such [puppet] - 10https://gerrit.wikimedia.org/r/508519 (owner: 10Ema) [10:46:20] !log mlitn@deploy1001 Finished scap: SDC: Enable Depicts in UploadWizard on Commons (duration: 22m 45s) [10:46:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:48:15] (03PS1) 10Arturo Borrero Gonzalez: toolforge: refactor static role puppet code [puppet] - 10https://gerrit.wikimedia.org/r/508535 (https://phabricator.wikimedia.org/T219362) [10:49:34] 10Operations, 10Traffic, 10serviceops, 10PHP 7.2 support, 10User-jijiki: Improve Pybal's url checks - https://phabricator.wikimedia.org/T222705 (10jijiki) [10:49:45] 10Operations, 10Traffic, 10serviceops, 10PHP 7.2 support, 10User-jijiki: Improve Pybal's url checks - https://phabricator.wikimedia.org/T222705 (10jijiki) p:05Triage→03High [10:50:22] 10Operations, 10Traffic, 10serviceops, 10PHP 7.2 support, 10User-jijiki: Improve Pybal's url checks - https://phabricator.wikimedia.org/T222705 (10jijiki) [10:50:25] 10Operations, 10serviceops, 10Patch-For-Review, 10User-jijiki: Ramp up percentage of users on php7.2 to 100% on both API and appserver clusters - https://phabricator.wikimedia.org/T219150 (10jijiki) [10:51:21] !log Gracefully stopping Zuul for upgrade [10:51:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:59:12] Done with deployment [11:00:04] MaxSem, RoanKattouw, and Niharika: #bothumor My software never has bugs. It just develops random features. Rise for European Mid-day SWAT(Max 6 patches). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190507T1100). [11:00:04] Ammarpad: A patch you scheduled for European Mid-day SWAT(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [11:06:56] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] "PCC: https://puppet-compiler.wmflabs.org/compiler1001/16395/" [puppet] - 10https://gerrit.wikimedia.org/r/508535 (https://phabricator.wikimedia.org/T219362) (owner: 10Arturo Borrero Gonzalez) [11:08:46] !log Upgraded Zuul and it is broken. So downgrading back :-( [11:08:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:10:34] 10Operations, 10Puppet, 10Packaging: facter3: Unable to parse routing table - https://phabricator.wikimedia.org/T222356 (10jbond) >I think we should patch facter once your PR is reviewed/merged upstream to address this for good. But I think it's fine to proceed with the facter rollout given that this is harm... [11:11:16] PROBLEM - puppet last run on ms-be1029 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [11:12:42] (03PS1) 10Elukey: Move Analytics' project_namespace_map timer to the analytics user [puppet] - 10https://gerrit.wikimedia.org/r/508536 (https://phabricator.wikimedia.org/T220971) [11:14:05] (03CR) 10Joal: [C: 03+1] "Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/508536 (https://phabricator.wikimedia.org/T220971) (owner: 10Elukey) [11:16:22] !log Downgraded Zuul back to 2.5.1-wmf7 # T105474 T140297 [11:16:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:16:27] T105474: 'recheck' on a CR+2 patch should trigger gate-and-submit, not test - https://phabricator.wikimedia.org/T105474 [11:18:25] (03CR) 10Elukey: [C: 03+2] Move Analytics' project_namespace_map timer to the analytics user [puppet] - 10https://gerrit.wikimedia.org/r/508536 (https://phabricator.wikimedia.org/T220971) (owner: 10Elukey) [11:21:45] (03PS1) 10Joal: Move import_wikitext_dumps to analytics user [puppet] - 10https://gerrit.wikimedia.org/r/508539 (https://phabricator.wikimedia.org/T220971) [11:21:59] elukey: --^ [11:23:06] (03CR) 10Elukey: [C: 03+2] Move import_wikitext_dumps to analytics user [puppet] - 10https://gerrit.wikimedia.org/r/508539 (https://phabricator.wikimedia.org/T220971) (owner: 10Joal) [11:23:13] (03PS2) 10Elukey: Move import_wikitext_dumps to analytics user [puppet] - 10https://gerrit.wikimedia.org/r/508539 (https://phabricator.wikimedia.org/T220971) (owner: 10Joal) [11:30:15] RECOVERY - Disk space on dbprov1001 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space [11:32:50] (03PS1) 10Arturo Borrero Gonzalez: toolforge: refactor redis role from toollabs [puppet] - 10https://gerrit.wikimedia.org/r/508542 (https://phabricator.wikimedia.org/T219362) [11:37:29] RECOVERY - puppet last run on ms-be1029 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [11:37:58] (03PS4) 10Jcrespo: mariadb: Reenable notifications on db1139 and db1140 [puppet] - 10https://gerrit.wikimedia.org/r/507925 (https://phabricator.wikimedia.org/T220572) [11:38:00] (03PS3) 10Jcrespo: backups: Decommission dbstore1001, dbstore2001 and dbstore2002 [puppet] - 10https://gerrit.wikimedia.org/r/507944 (https://phabricator.wikimedia.org/T220002) [11:38:02] (03PS1) 10Jcrespo: mariadb: Chmod /srv/backups/dumps o+x so the disk space check works [puppet] - 10https://gerrit.wikimedia.org/r/508543 (https://phabricator.wikimedia.org/T219399) [11:38:51] (03CR) 10Jcrespo: [C: 03+2] mariadb: Reenable notifications on db1139 and db1140 [puppet] - 10https://gerrit.wikimedia.org/r/507925 (https://phabricator.wikimedia.org/T220572) (owner: 10Jcrespo) [11:39:10] (03CR) 10Jcrespo: [C: 03+2] backups: Decommission dbstore1001, dbstore2001 and dbstore2002 [puppet] - 10https://gerrit.wikimedia.org/r/507944 (https://phabricator.wikimedia.org/T220002) (owner: 10Jcrespo) [11:39:25] (03CR) 10Jcrespo: [C: 03+2] mariadb: Chmod /srv/backups/dumps o+x so the disk space check works [puppet] - 10https://gerrit.wikimedia.org/r/508543 (https://phabricator.wikimedia.org/T219399) (owner: 10Jcrespo) [11:42:07] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] "PCC run https://puppet-compiler.wmflabs.org/compiler1002/16396/" [puppet] - 10https://gerrit.wikimedia.org/r/508542 (https://phabricator.wikimedia.org/T219362) (owner: 10Arturo Borrero Gonzalez) [11:42:17] (03PS2) 10Arturo Borrero Gonzalez: toolforge: refactor redis role from toollabs [puppet] - 10https://gerrit.wikimedia.org/r/508542 (https://phabricator.wikimedia.org/T219362) [11:42:21] RECOVERY - Disk space on dbprov2001 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space [11:44:55] PROBLEM - HTTP availability for Varnish at eqiad on icinga1001 is CRITICAL: job=varnish-upload site=eqiad https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [11:46:11] indeed looks like 503s from upload eqiad [11:47:19] (ended already) [11:48:51] although I'm not sure I can pinpoint the cause yet [11:49:09] 10Operations, 10Thumbor, 10serviceops, 10Patch-For-Review, and 2 others: Build Thumbor packages for buster - https://phabricator.wikimedia.org/T221562 (10jijiki) [11:49:45] cc ema vgutierrez bblack [11:52:02] ah looks like a lot of it was for a single client [11:53:51] (03PS2) 10Marostegui: mariadb: Provision db1127 and db1137 on x1 [puppet] - 10https://gerrit.wikimedia.org/r/508523 (https://phabricator.wikimedia.org/T222682) [11:54:29] (03CR) 10Jbond: "running sphinx with py3.5 i get the following error, wonder if you have sen this before Ricardo or have any pointers i cant see anything o" [software/spicerack] - 10https://gerrit.wikimedia.org/r/507561 (owner: 10Jbond) [11:59:12] RECOVERY - HTTP availability for Varnish at eqiad on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [11:59:36] (03PS1) 10Vgutierrez: Implement kubernetes configuration observer [debs/pybal] (1.15-stretch) - 10https://gerrit.wikimedia.org/r/508551 (https://phabricator.wikimedia.org/T192437) [11:59:38] (03PS1) 10Vgutierrez: Create MonitoringProtocolTestCase base class [debs/pybal] (1.15-stretch) - 10https://gerrit.wikimedia.org/r/508552 [11:59:40] (03PS1) 10Vgutierrez: Add minimal test cases for Skeleton and ProxyFetch [debs/pybal] (1.15-stretch) - 10https://gerrit.wikimedia.org/r/508553 [12:00:05] Deploy window Pre MediaWiki train sanity break (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190507T1200) [12:00:24] (03CR) 10Marostegui: [C: 03+2] mariadb: Provision db1127 and db1137 on x1 [puppet] - 10https://gerrit.wikimedia.org/r/508523 (https://phabricator.wikimedia.org/T222682) (owner: 10Marostegui) [12:01:48] PROBLEM - puppet last run on dbstore2001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [12:01:52] (03PS1) 10Vgutierrez: Add unit tests for ProxyFetchMonitoringProtocol [debs/pybal] (1.15-stretch) - 10https://gerrit.wikimedia.org/r/508555 [12:02:34] 10Operations, 10Analytics, 10EventBus, 10observability, and 3 others: Upgrade statsd_exporter to 0.9 - https://phabricator.wikimedia.org/T220709 (10fgiunchedi) Looks good so far in deployment-prep, there's a deb on `boron` for testing `/var/cache/pbuilder/result/stretch-amd64/prometheus-statsd-exporter_0.9... [12:04:39] (03PS36) 10Mathew.onipe: icinga: create and apply cirrus config check [puppet] - 10https://gerrit.wikimedia.org/r/507045 (https://phabricator.wikimedia.org/T218932) [12:05:26] (03CR) 10jerkins-bot: [V: 04-1] icinga: create and apply cirrus config check [puppet] - 10https://gerrit.wikimedia.org/r/507045 (https://phabricator.wikimedia.org/T218932) (owner: 10Mathew.onipe) [12:08:02] (03PS37) 10Mathew.onipe: icinga: create and apply cirrus config check [puppet] - 10https://gerrit.wikimedia.org/r/507045 (https://phabricator.wikimedia.org/T218932) [12:08:47] (03CR) 10jerkins-bot: [V: 04-1] icinga: create and apply cirrus config check [puppet] - 10https://gerrit.wikimedia.org/r/507045 (https://phabricator.wikimedia.org/T218932) (owner: 10Mathew.onipe) [12:09:14] !log Stop Replication on db1140:3320 to provision db1127 and db1137 T222682 [12:09:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:09:18] T222682: Productionize db11[26-38] - https://phabricator.wikimedia.org/T222682 [12:10:41] 10Operations, 10ops-eqiad, 10Traffic: cp1083 crashed - https://phabricator.wikimedia.org/T222620 (10CDanis) >>! In T222620#5163577, @ema wrote: > Interestingly, there was a memory usage spike right before the host crashed. > > {F28951427} I think that is just a strange monitoring artifact. If you zoom in... [12:11:29] (03CR) 10Effie Mouzeli: [C: 03+1] Upgrade to 2.5 [debs/python-thumbor-wikimedia] - 10https://gerrit.wikimedia.org/r/506116 (https://phabricator.wikimedia.org/T221562) (owner: 10Gilles) [12:12:53] (03PS38) 10Mathew.onipe: icinga: create and apply cirrus config check [puppet] - 10https://gerrit.wikimedia.org/r/507045 (https://phabricator.wikimedia.org/T218932) [12:13:53] chaomodus: akosiaris: you should hopefully now have access to https://gerrit.wikimedia.org/r/monitoring (poke paladox) [12:13:58] PROBLEM - puppet last run on dbstore1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [12:19:13] (03CR) 10Mathew.onipe: icinga: create and apply cirrus config check (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/507045 (https://phabricator.wikimedia.org/T218932) (owner: 10Mathew.onipe) [12:19:36] hashar: works for me, thanks! [12:24:31] (03PS1) 10Vgutierrez: Avoid Deferred.cancel() induced CancelledErrors [debs/pybal] (1.15-stretch) - 10https://gerrit.wikimedia.org/r/508558 [12:25:34] (03PS1) 10Arturo Borrero Gonzalez: toolforge: refactor proxy role from toollabs Refactor the old tools-proxy-* puppet code into a modern layout. Bug: T219362 Signed-off-by: Arturo Borrero Gonzalez [puppet] - 10https://gerrit.wikimedia.org/r/508560 (https://phabricator.wikimedia.org/T219362) [12:26:23] (03PS2) 10Arturo Borrero Gonzalez: toolforge: refactor proxy role from toollabs [puppet] - 10https://gerrit.wikimedia.org/r/508560 (https://phabricator.wikimedia.org/T219362) [12:26:28] (03CR) 10jerkins-bot: [V: 04-1] toolforge: refactor proxy role from toollabs [puppet] - 10https://gerrit.wikimedia.org/r/508560 (https://phabricator.wikimedia.org/T219362) (owner: 10Arturo Borrero Gonzalez) [12:26:40] (03PS39) 10Mathew.onipe: icinga: create and apply cirrus config check [puppet] - 10https://gerrit.wikimedia.org/r/507045 (https://phabricator.wikimedia.org/T218932) [12:26:59] (03CR) 10Arturo Borrero Gonzalez: "This change is a bit more complex. I'm leaving it in review for a while." [puppet] - 10https://gerrit.wikimedia.org/r/508560 (https://phabricator.wikimedia.org/T219362) (owner: 10Arturo Borrero Gonzalez) [12:27:20] (03CR) 10jerkins-bot: [V: 04-1] toolforge: refactor proxy role from toollabs [puppet] - 10https://gerrit.wikimedia.org/r/508560 (https://phabricator.wikimedia.org/T219362) (owner: 10Arturo Borrero Gonzalez) [12:31:40] PROBLEM - HHVM jobrunner on mw1311 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 473 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Jobrunner [12:33:00] RECOVERY - HHVM jobrunner on mw1311 is OK: HTTP OK: HTTP/1.1 200 OK - 270 bytes in 0.007 second response time https://wikitech.wikimedia.org/wiki/Jobrunner [12:36:47] (03PS1) 10Jcrespo: dbstores: Set the right role for dbstores to be spare [puppet] - 10https://gerrit.wikimedia.org/r/508565 (https://phabricator.wikimedia.org/T220002) [12:37:40] (03CR) 10Jcrespo: [C: 03+2] dbstores: Set the right role for dbstores to be spare [puppet] - 10https://gerrit.wikimedia.org/r/508565 (https://phabricator.wikimedia.org/T220002) (owner: 10Jcrespo) [12:40:24] RECOVERY - puppet last run on dbstore1001 is OK: OK: Puppet is currently enabled, last run 42 seconds ago with 0 failures [12:41:11] (03CR) 10Nikerabbit: Add publish restrictions config for enwiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/495677 (https://phabricator.wikimedia.org/T217237) (owner: 10Petar.petkovic) [12:43:36] RECOVERY - puppet last run on dbstore2001 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [12:45:21] (03CR) 10KartikMistry: Add publish restrictions config for enwiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/495677 (https://phabricator.wikimedia.org/T217237) (owner: 10Petar.petkovic) [12:45:26] !log remove dbstore1001, dbstore2001, dbstore2002 from tendril and zarcillo T220002 [12:45:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:45:30] T220002: Decommission dbstore1001, dbstore2001, dbstore2002 - https://phabricator.wikimedia.org/T220002 [12:49:50] (03PS1) 10CDanis: swift: object auditing is less important than replication [puppet] - 10https://gerrit.wikimedia.org/r/508566 [12:50:18] (03CR) 10CDanis: [C: 03+2] swift: object auditing is less important than replication [puppet] - 10https://gerrit.wikimedia.org/r/508566 (owner: 10CDanis) [12:51:47] (03PS1) 10Ema: cp-ats: reimage as test nodes [puppet] - 10https://gerrit.wikimedia.org/r/508567 (https://phabricator.wikimedia.org/T213263) [12:59:17] 10Operations, 10media-storage, 10observability: swift backend decomms / rebalances are noisy - https://phabricator.wikimedia.org/T221904 (10CDanis) a:03CDanis [13:00:04] Deploy window MediaWiki train - European version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190507T1300) [13:02:14] (03Abandoned) 10Jcrespo: mariadb-snapshots: Setup full daily snapshots for all codfw sections [puppet] - 10https://gerrit.wikimedia.org/r/500980 (https://phabricator.wikimedia.org/T206203) (owner: 10Jcrespo) [13:02:45] !log T221904 cdanis@cumin1001.eqiad.wmnet ~ % sudo cumin -p95 -m async -b5 'ms-be2*' 'run-puppet-agent -q' 'systemctl restart swift-object-replicator' 'systemctl restart swift-object-auditor' [13:02:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:02:50] T221904: swift backend decomms / rebalances are noisy - https://phabricator.wikimedia.org/T221904 [13:03:57] (03PS5) 10Petar.petkovic: Add publish restrictions config for enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/495677 (https://phabricator.wikimedia.org/T217237) [13:05:18] 10Operations, 10media-storage, 10observability: swift backend decomms / rebalances are noisy - https://phabricator.wikimedia.org/T221904 (10CDanis) Trying out a few things here: [x] ionice'ing swift-object-replicator lower than everything else, except [x] ionice'ing swift-object-auditor even lower than that... [13:05:49] (03CR) 10Petar.petkovic: Add publish restrictions config for enwiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/495677 (https://phabricator.wikimedia.org/T217237) (owner: 10Petar.petkovic) [13:07:28] !log sudo ipmitool -I lanplus -H "cp2009.mgmt.codfw.wmnet" -U root -E chassis power cycle T222459 [13:07:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:07:33] T222459: cp2009 down and mgmt console not reachable - https://phabricator.wikimedia.org/T222459 [13:08:55] !log sudo ipmitool -I lanplus -H cp2009.mgmt.codfw.wmnet -U root mc reset cold T222459 [13:08:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:09:49] RECOVERY - Host cp2009 is UP: PING OK - Packet loss = 0%, RTA = 36.11 ms [13:11:42] 10Operations, 10ops-codfw, 10Traffic: cp2009 down and mgmt console not reachable - https://phabricator.wikimedia.org/T222459 (10ema) 05Open→03Resolved IPMI seems to be working remotely: ` $ sudo ipmitool -I lanplus -H "cp2009.mgmt.codfw.wmnet" -U root -E chassis power status Unable to read password from... [13:14:05] (03PS1) 10Vgutierrez: Allow proxyfetch to check more than one url at a time [debs/pybal] (1.15-stretch) - 10https://gerrit.wikimedia.org/r/508569 [13:14:54] (03CR) 10jerkins-bot: [V: 04-1] Allow proxyfetch to check more than one url at a time [debs/pybal] (1.15-stretch) - 10https://gerrit.wikimedia.org/r/508569 (owner: 10Vgutierrez) [13:15:02] 10Operations, 10Analytics, 10EventBus, 10observability, and 3 others: Upgrade statsd_exporter to 0.9 - https://phabricator.wikimedia.org/T220709 (10Ottomata) Great! I guess it just needs to go into the WMF base docker image somehow? [13:15:03] of course :) [13:16:19] (03CR) 10Ema: [C: 03+2] cp-ats: reimage as test nodes [puppet] - 10https://gerrit.wikimedia.org/r/508567 (https://phabricator.wikimedia.org/T213263) (owner: 10Ema) [13:17:17] !log T221904 cdanis@cumin1001.eqiad.wmnet ~ % sudo cumin -p95 -m async -b5 'ms-be1*' 'run-puppet-agent -q' 'systemctl restart swift-object-replicator' 'systemctl restart swift-object-auditor' [13:17:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:17:21] T221904: swift backend decomms / rebalances are noisy - https://phabricator.wikimedia.org/T221904 [13:18:45] (03PS2) 10Vgutierrez: Allow proxyfetch to check more than one url at a time [debs/pybal] (1.15-stretch) - 10https://gerrit.wikimedia.org/r/508569 [13:22:34] (03CR) 10Ottomata: [C: 03+1] profile::analytics::refinery: use analytics in /var/log/refinery [puppet] - 10https://gerrit.wikimedia.org/r/508534 (https://phabricator.wikimedia.org/T220971) (owner: 10Elukey) [13:23:01] (03CR) 10Giuseppe Lavagetto: [C: 03+1] Implement kubernetes configuration observer [debs/pybal] (1.15-stretch) - 10https://gerrit.wikimedia.org/r/508551 (https://phabricator.wikimedia.org/T192437) (owner: 10Vgutierrez) [13:26:14] (03CR) 10Giuseppe Lavagetto: [C: 03+1] Create MonitoringProtocolTestCase base class [debs/pybal] (1.15-stretch) - 10https://gerrit.wikimedia.org/r/508552 (owner: 10Vgutierrez) [13:26:29] (03CR) 10Giuseppe Lavagetto: [C: 03+1] Add minimal test cases for Skeleton and ProxyFetch [debs/pybal] (1.15-stretch) - 10https://gerrit.wikimedia.org/r/508553 (owner: 10Vgutierrez) [13:28:09] (03CR) 10Giuseppe Lavagetto: [C: 03+1] Add unit tests for ProxyFetchMonitoringProtocol [debs/pybal] (1.15-stretch) - 10https://gerrit.wikimedia.org/r/508555 (owner: 10Vgutierrez) [13:29:31] (03CR) 10Giuseppe Lavagetto: [C: 03+1] Avoid Deferred.cancel() induced CancelledErrors [debs/pybal] (1.15-stretch) - 10https://gerrit.wikimedia.org/r/508558 (owner: 10Vgutierrez) [13:31:36] (03CR) 10Giuseppe Lavagetto: [C: 03+1] Allow proxyfetch to check more than one url at a time [debs/pybal] (1.15-stretch) - 10https://gerrit.wikimedia.org/r/508569 (owner: 10Vgutierrez) [13:31:40] 10Operations, 10Traffic, 10Patch-For-Review: Partial cache_upload traffic switchover to ATS and switchback to Varnish - https://phabricator.wikimedia.org/T213263 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by ema on cumin1001.eqiad.wmnet for hosts: ` ['cp1071.eqiad.wmnet', 'cp1072.eqiad.wmnet... [13:34:48] (03CR) 10Vgutierrez: [C: 03+2] Implement kubernetes configuration observer [debs/pybal] (1.15-stretch) - 10https://gerrit.wikimedia.org/r/508551 (https://phabricator.wikimedia.org/T192437) (owner: 10Vgutierrez) [13:35:00] (03CR) 10Vgutierrez: [C: 03+2] Create MonitoringProtocolTestCase base class [debs/pybal] (1.15-stretch) - 10https://gerrit.wikimedia.org/r/508552 (owner: 10Vgutierrez) [13:35:05] (03CR) 10Vgutierrez: [C: 03+2] Add minimal test cases for Skeleton and ProxyFetch [debs/pybal] (1.15-stretch) - 10https://gerrit.wikimedia.org/r/508553 (owner: 10Vgutierrez) [13:35:09] (03CR) 10Vgutierrez: [C: 03+2] Add unit tests for ProxyFetchMonitoringProtocol [debs/pybal] (1.15-stretch) - 10https://gerrit.wikimedia.org/r/508555 (owner: 10Vgutierrez) [13:35:13] (03CR) 10Vgutierrez: [C: 03+2] Avoid Deferred.cancel() induced CancelledErrors [debs/pybal] (1.15-stretch) - 10https://gerrit.wikimedia.org/r/508558 (owner: 10Vgutierrez) [13:35:19] (03CR) 10Vgutierrez: [C: 03+2] Allow proxyfetch to check more than one url at a time [debs/pybal] (1.15-stretch) - 10https://gerrit.wikimedia.org/r/508569 (owner: 10Vgutierrez) [13:35:22] (03Merged) 10jenkins-bot: Implement kubernetes configuration observer [debs/pybal] (1.15-stretch) - 10https://gerrit.wikimedia.org/r/508551 (https://phabricator.wikimedia.org/T192437) (owner: 10Vgutierrez) [13:35:35] (03Merged) 10jenkins-bot: Create MonitoringProtocolTestCase base class [debs/pybal] (1.15-stretch) - 10https://gerrit.wikimedia.org/r/508552 (owner: 10Vgutierrez) [13:35:39] (03Merged) 10jenkins-bot: Add minimal test cases for Skeleton and ProxyFetch [debs/pybal] (1.15-stretch) - 10https://gerrit.wikimedia.org/r/508553 (owner: 10Vgutierrez) [13:35:42] (03Merged) 10jenkins-bot: Add unit tests for ProxyFetchMonitoringProtocol [debs/pybal] (1.15-stretch) - 10https://gerrit.wikimedia.org/r/508555 (owner: 10Vgutierrez) [13:35:47] (03Merged) 10jenkins-bot: Avoid Deferred.cancel() induced CancelledErrors [debs/pybal] (1.15-stretch) - 10https://gerrit.wikimedia.org/r/508558 (owner: 10Vgutierrez) [13:35:55] (03Merged) 10jenkins-bot: Allow proxyfetch to check more than one url at a time [debs/pybal] (1.15-stretch) - 10https://gerrit.wikimedia.org/r/508569 (owner: 10Vgutierrez) [13:36:45] (03CR) 10Effie Mouzeli: [C: 03+2] Upgrade to 2.5 [debs/python-thumbor-wikimedia] - 10https://gerrit.wikimedia.org/r/506116 (https://phabricator.wikimedia.org/T221562) (owner: 10Gilles) [13:37:15] (03PS1) 10Jbond: puppet errors: Tests [puppet] - 10https://gerrit.wikimedia.org/r/508573 [13:37:27] !log otto@deploy1001 scap-helm eventgate-analytics upgrade analytics -f analytics/staging-values.yaml --reset-values stable/eventgate [namespace: eventgate-analytics, clusters: staging] [13:37:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:37:31] !log otto@deploy1001 scap-helm eventgate-analytics upgrade analytics -f analytics/staging-values.yaml --reset-values stable/eventgate [namespace: eventgate-analytics, clusters: staging] [13:37:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:37:33] !log otto@deploy1001 scap-helm eventgate-analytics cluster staging completed [13:37:33] !log otto@deploy1001 scap-helm eventgate-analytics finished [13:37:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:37:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:40:19] Thanks hashar ! [13:42:01] (03PS1) 10Marostegui: db-eqiad.php: Depoool db1093 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/508574 (https://phabricator.wikimedia.org/T222127) [13:43:11] (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: Depoool db1093 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/508574 (https://phabricator.wikimedia.org/T222127) (owner: 10Marostegui) [13:44:14] (03Merged) 10jenkins-bot: db-eqiad.php: Depoool db1093 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/508574 (https://phabricator.wikimedia.org/T222127) (owner: 10Marostegui) [13:45:26] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Depool db1093 for BBU replacement T222127 (duration: 00m 51s) [13:45:28] !log Stop MySQL and poweroff db1093 for BBU replacement - T222127 [13:45:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:45:32] T222127: db1093 (s6 candidate master) went down - broken BBU - https://phabricator.wikimedia.org/T222127 [13:45:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:46:12] (03CR) 10jenkins-bot: db-eqiad.php: Depoool db1093 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/508574 (https://phabricator.wikimedia.org/T222127) (owner: 10Marostegui) [13:47:44] PROBLEM - PHP7 rendering on mw1256 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 539 bytes in 0.041 second response time https://wikitech.wikimedia.org/wiki/Application_servers [13:48:30] PROBLEM - PHP7 rendering on mw1320 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 106015 bytes in 0.181 second response time https://wikitech.wikimedia.org/wiki/Application_servers [13:49:36] (03PS2) 10Ottomata: Configure eventgate-main for k8s deployment [puppet] - 10https://gerrit.wikimedia.org/r/508371 (https://phabricator.wikimedia.org/T218346) [13:50:31] !log uploaded prometheus-trafficserver-exporter 0.2.3 to apt.wikimedia.org (stretch) - T221217 [13:50:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:50:35] T221217: Allow running several ATS instances on the same server - https://phabricator.wikimedia.org/T221217 [13:50:50] PROBLEM - Host db1093.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [13:50:57] ^ expected [13:54:42] (03CR) 10CRusnov: [C: 03+2] puppetdb report: Exclude OFFLINE VMs from report [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/508456 (owner: 10CRusnov) [13:55:02] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1004 is CRITICAL: CRITICAL: 90.00% of data above the critical threshold [50.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=2&fullscreen [13:55:08] (03PS1) 10Vgutierrez: Release 1.15.4 [debs/pybal] (1.15-stretch) - 10https://gerrit.wikimedia.org/r/508576 [13:56:02] RECOVERY - Host db1093.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.09 ms [13:56:14] (03PS2) 10Giuseppe Lavagetto: role::deployment_server: depend on base role [puppet] - 10https://gerrit.wikimedia.org/r/508334 [13:56:16] (03PS5) 10Giuseppe Lavagetto: role::deployment_server: reorganize code, step 1 [puppet] - 10https://gerrit.wikimedia.org/r/508335 [13:56:18] (03PS1) 10Giuseppe Lavagetto: profile::keyholder::server: profile for keyholder installation [puppet] - 10https://gerrit.wikimedia.org/r/508577 [13:56:20] (03PS1) 10Giuseppe Lavagetto: profile::mediawiki::scap_client: rationalize scap2 installation [puppet] - 10https://gerrit.wikimedia.org/r/508578 [13:56:22] (03PS1) 10Giuseppe Lavagetto: profile::mediawiki::deployment::server: rationalize puppetization. [puppet] - 10https://gerrit.wikimedia.org/r/508579 [13:56:24] (03PS1) 10Giuseppe Lavagetto: role::deployment_server: fold in the base class [puppet] - 10https://gerrit.wikimedia.org/r/508580 [13:57:24] RECOVERY - HP RAID on db1093 is OK: OK: Slot 1: OK: 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 2I:2:1, 2I:2:2 - Controller: OK - Battery/Capacitor: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [13:57:47] !log otto@deploy1001 scap-helm eventgate-analytics install -n analytics -f analytics/codfw-values.yaml stable/eventgate [namespace: eventgate-analytics, clusters: codfw] [13:57:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:58:23] 10Operations, 10ops-eqiad, 10DBA, 10Patch-For-Review: db1093 (s6 candidate master) went down - broken BBU - https://phabricator.wikimedia.org/T222127 (10Marostegui) Chris has changed the BBU and I can already see it: ` root@db1093:~# hpssacli controller all show detail | grep -i battery No-Battery Writ... [13:58:34] !log otto@deploy1001 scap-helm eventgate-analytics install -n analytics -f analytics/codfw-values.yaml stable/eventgate [namespace: eventgate-analytics, clusters: codfw] [13:58:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:59:15] !log otto@deploy1001 scap-helm eventgate-analytics upgrade analytics -f analytics/codfw-values.yaml --reset-values stable/eventgate [namespace: eventgate-analytics, clusters: codfw] [13:59:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:59:37] (03CR) 10Giuseppe Lavagetto: [C: 03+1] Release 1.15.4 [debs/pybal] (1.15-stretch) - 10https://gerrit.wikimedia.org/r/508576 (owner: 10Vgutierrez) [14:01:04] (03CR) 10Vgutierrez: [C: 03+2] Release 1.15.4 [debs/pybal] (1.15-stretch) - 10https://gerrit.wikimedia.org/r/508576 (owner: 10Vgutierrez) [14:01:04] !log otto@deploy1001 scap-helm eventgate-analytics upgrade analytics -f analytics/codfw-values.yaml --reset-values stable/eventgate [namespace: eventgate-analytics, clusters: codfw] [14:01:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:01:14] !log otto@deploy1001 scap-helm eventgate-analytics install -n analytics -f analytics/codfw-values.yaml stable/eventgate [namespace: eventgate-analytics, clusters: codfw] [14:01:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:02:07] !log otto@deploy1001 scap-helm eventgate-analytics install -n analytics -f analytics/codfw-values.yaml stable/eventgate [namespace: eventgate-analytics, clusters: codfw] [14:02:09] !log otto@deploy1001 scap-helm eventgate-analytics cluster codfw completed [14:02:09] !log otto@deploy1001 scap-helm eventgate-analytics finished [14:02:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:02:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:02:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:04:16] PROBLEM - Request latencies on acrab is CRITICAL: instance=10.192.16.26:6443 verb=CONNECT https://grafana.wikimedia.org/dashboard/db/kubernetes-api [14:05:54] (03PS1) 10Vgutierrez: Fix release 1.15.4 changelog entry [debs/pybal] (1.15-stretch) - 10https://gerrit.wikimedia.org/r/508581 [14:05:59] * vgutierrez cries in the corner [14:06:21] (03CR) 10Vgutierrez: [C: 03+2] Fix release 1.15.4 changelog entry [debs/pybal] (1.15-stretch) - 10https://gerrit.wikimedia.org/r/508581 (owner: 10Vgutierrez) [14:06:23] 10Operations, 10Traffic, 10Patch-For-Review: Partial cache_upload traffic switchover to ATS and switchback to Varnish - https://phabricator.wikimedia.org/T213263 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by ema on cumin1001.eqiad.wmnet for hosts: ` ['cp1074.eqiad.wmnet', 'cp2003.codfw.wmnet... [14:07:36] !log otto@deploy1001 scap-helm eventgate-analytics install -n analytics -f analytics/eqiad-values.yaml stable/eventgate [namespace: eventgate-analytics, clusters: eqiad] [14:07:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:07:39] !log otto@deploy1001 scap-helm eventgate-analytics cluster eqiad completed [14:07:39] !log otto@deploy1001 scap-helm eventgate-analytics finished [14:07:40] RECOVERY - BFD status on cr4-ulsfo is OK: OK: UP: 6 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [14:07:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:07:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:08:09] (03PS2) 10Giuseppe Lavagetto: profile::mediawiki::scap_client: rationalize scap2 installation [puppet] - 10https://gerrit.wikimedia.org/r/508578 [14:08:11] (03PS2) 10Giuseppe Lavagetto: profile::mediawiki::deployment::server: rationalize puppetization. [puppet] - 10https://gerrit.wikimedia.org/r/508579 [14:08:13] (03PS2) 10Giuseppe Lavagetto: role::deployment_server: fold in the base class [puppet] - 10https://gerrit.wikimedia.org/r/508580 [14:09:20] !log depool mw1320 [14:09:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:09:57] !log cdanis@puppetmaster1001 conftool action : set/pooled=no; selector: name=mw1320.eqiad.wmnet [14:09:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:10:39] (03CR) 10Giuseppe Lavagetto: "https://puppet-compiler.wmflabs.org/compiler1001/16401/deploy1001.eqiad.wmnet/ all the cumulative patches are ok" [puppet] - 10https://gerrit.wikimedia.org/r/508580 (owner: 10Giuseppe Lavagetto) [14:10:42] RECOVERY - Request latencies on acrab is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [14:12:03] !log cdanis@puppetmaster1001 conftool action : set/pooled=no; selector: name=mw1271.eqiad.wmnet [14:12:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:12:13] !log cdanis@puppetmaster1001 conftool action : set/pooled=no; selector: name=mw1256.eqiad.wmnet [14:12:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:13:02] !log uploaded pybal 1.15.4 to apt.wikimedia.org (stretch) [14:13:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:13:20] (03PS1) 10Ottomata: Change eventgate-analytics LVS port to 33192 [puppet] - 10https://gerrit.wikimedia.org/r/508582 (https://phabricator.wikimedia.org/T218346) [14:15:32] PROBLEM - toolschecker: All k8s worker nodes are healthy on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string OK not found on http://checker.tools.wmflabs.org:80/k8s/nodes/ready - 177 bytes in 0.164 second response time https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Toolschecker [14:16:56] ^^^^ looking at that [14:17:09] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1004 is OK: OK: Less than 70.00% above the threshold [25.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=2&fullscreen [14:17:24] (03PS1) 10CRusnov: ganeti: Fix RAPI port [software/spicerack] - 10https://gerrit.wikimedia.org/r/508583 [14:22:25] (03PS1) 10Vgutierrez: Revert "Implement kubernetes configuration observer" [debs/pybal] (1.15-stretch) - 10https://gerrit.wikimedia.org/r/508584 [14:22:57] (03CR) 10Volans: [C: 03+2] ganeti: Fix RAPI port [software/spicerack] - 10https://gerrit.wikimedia.org/r/508583 (owner: 10CRusnov) [14:25:01] <_joe_> !log resetting opcache on mw1320 [14:25:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:26:09] <_joe_> !log repooling mw1320 [14:26:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:26:14] RECOVERY - PHP7 rendering on mw1320 is OK: HTTP OK: HTTP/1.1 200 OK - 75404 bytes in 0.146 second response time https://wikitech.wikimedia.org/wiki/Application_servers [14:27:10] (03Merged) 10jenkins-bot: ganeti: Fix RAPI port [software/spicerack] - 10https://gerrit.wikimedia.org/r/508583 (owner: 10CRusnov) [14:28:19] (03CR) 10jenkins-bot: ganeti: Fix RAPI port [software/spicerack] - 10https://gerrit.wikimedia.org/r/508583 (owner: 10CRusnov) [14:28:33] (03PS1) 10Ema: ATS: remove role trafficserver::backend [puppet] - 10https://gerrit.wikimedia.org/r/508587 (https://phabricator.wikimedia.org/T213263) [14:28:35] (03PS1) 10Ema: ATS: update cumin aliases [puppet] - 10https://gerrit.wikimedia.org/r/508588 (https://phabricator.wikimedia.org/T213263) [14:30:44] RECOVERY - toolschecker: All k8s worker nodes are healthy on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 158 bytes in 0.115 second response time https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Toolschecker [14:33:05] (03CR) 10Vgutierrez: [C: 03+2] Revert "Implement kubernetes configuration observer" [debs/pybal] (1.15-stretch) - 10https://gerrit.wikimedia.org/r/508584 (owner: 10Vgutierrez) [14:33:11] (03PS2) 10Vgutierrez: Revert "Implement kubernetes configuration observer" [debs/pybal] (1.15-stretch) - 10https://gerrit.wikimedia.org/r/508584 [14:33:30] 10Operations, 10ops-codfw: Degraded RAID on db2049 - https://phabricator.wikimedia.org/T222526 (10Papaul) a:05Papaul→03Marostegui Try again [14:33:36] (03PS22) 10CRusnov: Netbox module for Spicerack [software/spicerack] - 10https://gerrit.wikimedia.org/r/493138 (https://phabricator.wikimedia.org/T217072) [14:34:01] (03CR) 10CRusnov: Netbox module for Spicerack (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/493138 (https://phabricator.wikimedia.org/T217072) (owner: 10CRusnov) [14:34:47] (03CR) 10Muehlenhoff: ATS: update cumin aliases (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/508588 (https://phabricator.wikimedia.org/T213263) (owner: 10Ema) [14:35:57] (03CR) 10Reedy: "I don't think it's majorly important... It probably needs to go in around the time it's merged into master (doesn't make much difference t" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/508130 (https://phabricator.wikimedia.org/T222516) (owner: 10Luca Mauri) [14:36:04] (03PS1) 10Vgutierrez: Release 1.15.5 [debs/pybal] (1.15-stretch) - 10https://gerrit.wikimedia.org/r/508589 [14:36:39] (03PS1) 10Giuseppe Lavagetto: php-fpm: revalidate opcache every 60 seconds. [puppet] - 10https://gerrit.wikimedia.org/r/508590 (https://phabricator.wikimedia.org/T221347) [14:37:12] (03CR) 10Vgutierrez: [C: 03+2] Release 1.15.5 [debs/pybal] (1.15-stretch) - 10https://gerrit.wikimedia.org/r/508589 (owner: 10Vgutierrez) [14:37:35] (03CR) 10Ema: ATS: update cumin aliases (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/508588 (https://phabricator.wikimedia.org/T213263) (owner: 10Ema) [14:39:25] (03CR) 10Muehlenhoff: [C: 03+1] ATS: update cumin aliases (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/508588 (https://phabricator.wikimedia.org/T213263) (owner: 10Ema) [14:39:46] 10Operations, 10ops-codfw: Degraded RAID on db2049 - https://phabricator.wikimedia.org/T222526 (10Marostegui) Thanks! We'll see! ` physicaldrive 1I:1:5 (port 1I:box 1:bay 5, SAS, 600 GB, Rebuilding) ` [14:40:11] !log uploaded pybal 1.15.5 to apt.wikimedia.org (stretch && jessie) [14:40:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:41:10] (03CR) 10Volans: [C: 04-1] "Partial -1. While the syntax is totally correct, it will create ATS aliases for each DC and until we'll have at least 1 ATS host in each D" [puppet] - 10https://gerrit.wikimedia.org/r/508588 (https://phabricator.wikimedia.org/T213263) (owner: 10Ema) [14:41:13] 10Operations, 10ops-codfw, 10DBA, 10Goal, 10Patch-For-Review: rack/setup/install db2[103-120].codfw.wmnet (18 hosts) - https://phabricator.wikimedia.org/T221532 (10Papaul) [14:43:07] (03PS2) 10Ema: ATS: update cumin aliases [puppet] - 10https://gerrit.wikimedia.org/r/508588 (https://phabricator.wikimedia.org/T213263) [14:43:42] !log cdanis@mw1271.eqiad.wmnet ~ % sudo php7adm /opcache-free [14:43:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:44:20] !log cdanis@mw1256.eqiad.wmnet ~ % sudo php7adm /opcache-free [14:44:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:44:30] 10Operations, 10Gerrit, 10Release-Engineering-Team (Watching / External): Add prometheus exporter to Gerrit - https://phabricator.wikimedia.org/T184086 (10crusnov) >>! In T184086#5162119, @Paladox wrote: > @crusnov we could use your help, yup. We need to create a prometheusBearerToken [plugin.javamelody.prom... [14:44:32] 10Operations, 10ops-codfw, 10decommission, 10cloud-services-team (Kanban): decommmision: labtestweb2001.wikimedia.org - https://phabricator.wikimedia.org/T218024 (10Papaul) [14:44:36] RECOVERY - PHP7 rendering on mw1256 is OK: HTTP OK: HTTP/1.1 200 OK - 75405 bytes in 1.031 second response time https://wikitech.wikimedia.org/wiki/Application_servers [14:45:04] 10Operations, 10ops-codfw, 10decommission: decommission: labtestservices2001.wikimedia.org - https://phabricator.wikimedia.org/T218022 (10Papaul) [14:45:21] anyone deploying mw stuff? Like to sneak a maintenance script only change in [14:46:42] (03CR) 10CDanis: [C: 03+1] php-fpm: revalidate opcache every 60 seconds. [puppet] - 10https://gerrit.wikimedia.org/r/508590 (https://phabricator.wikimedia.org/T221347) (owner: 10Giuseppe Lavagetto) [14:47:11] 10Operations, 10ops-eqiad: Degraded RAID on ms-be1013 - https://phabricator.wikimedia.org/T220907 (10Cmjohnson) 05Open→03Stalled I drained the flea power and you should not have any issues with the idrac. The server is still out of warranty so not much I can do about the raid at this point in time. I did... [14:47:25] ebernhardson: I think _joe_ is about to deploy an urgent patch [14:48:08] (03PS7) 10Herron: WIP puppetmaster-standalone - add dynamic envs that map to gerrit [puppet] - 10https://gerrit.wikimedia.org/r/507846 [14:48:15] (03PS2) 10Jbond: puppet errors: Tests [puppet] - 10https://gerrit.wikimedia.org/r/508573 [14:48:34] (03CR) 10Ema: "> Partial -1. While the syntax is totally correct, it will create ATS" [puppet] - 10https://gerrit.wikimedia.org/r/508588 (https://phabricator.wikimedia.org/T213263) (owner: 10Ema) [14:48:44] volans: ack [14:48:57] (03CR) 10Volans: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/508588 (https://phabricator.wikimedia.org/T213263) (owner: 10Ema) [14:49:06] RECOVERY - Device not healthy -SMART- on db2049 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=db2049&var-datasource=codfw+prometheus/ops [14:49:39] <_joe_> volans: it's not urgent [14:49:44] <_joe_> go on ebernhardson [14:49:47] _joe_: thanks [14:52:01] (03CR) 10Giuseppe Lavagetto: [C: 03+2] php-fpm: revalidate opcache every 60 seconds. [puppet] - 10https://gerrit.wikimedia.org/r/508590 (https://phabricator.wikimedia.org/T221347) (owner: 10Giuseppe Lavagetto) [14:53:11] !log pool mw1256 [14:53:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:53:28] !log pool mw1271 [14:53:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:55:06] 10Operations, 10ops-eqiad, 10DBA, 10Goal, and 2 others: rack/setup/install db11[26-38].eqiad.wmnet - https://phabricator.wikimedia.org/T211613 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by marostegui on cumin1001.eqiad.wmnet for hosts: ` ['db1133.eqiad.wmnet'] ` The log can be found in `/v... [14:55:35] 10Operations, 10Puppet, 10puppet-compiler: Frequent puppet failures - https://phabricator.wikimedia.org/T221529 (10jbond) >! In T221529#5144319, @crusnov wrote: > Has anyone checked if the 5xx errors happen to coincide with puppet-merge happening? I just checked and as far as i can see puppet-merge dose not... [14:55:59] (03Abandoned) 10Jbond: puppet errors: Tests [puppet] - 10https://gerrit.wikimedia.org/r/508573 (owner: 10Jbond) [14:57:13] (03PS1) 10Reedy: Update size dblists [mediawiki-config] - 10https://gerrit.wikimedia.org/r/508595 [14:59:06] (03PS1) 10Vgutierrez: Fix proxy fetch multi URL cherry-pick [debs/pybal] (1.15-stretch) - 10https://gerrit.wikimedia.org/r/508596 [15:00:22] 10Operations, 10Traffic, 10Patch-For-Review: Partial cache_upload traffic switchover to ATS and switchback to Varnish - https://phabricator.wikimedia.org/T213263 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cp1072.eqiad.wmnet', 'cp1073.eqiad.wmnet', 'cp1071.eqiad.wmnet'] ` and were **ALL** su... [15:00:48] 10Operations, 10ops-codfw, 10DBA, 10Goal, 10Patch-For-Review: rack/setup/install db2[103-120].codfw.wmnet (18 hosts) - https://phabricator.wikimedia.org/T221532 (10Papaul) [15:01:02] 10Operations, 10ops-eqiad, 10DBA, 10Goal, and 2 others: rack/setup/install db11[26-38].eqiad.wmnet - https://phabricator.wikimedia.org/T211613 (10Marostegui) @Cmjohnson re-created the RAID on site, but it is still showing up as degraded, so this host might need further troubleshooting. Not a big priority n... [15:03:57] 10Operations, 10Traffic, 10Patch-For-Review: Partial cache_upload traffic switchover to ATS and switchback to Varnish - https://phabricator.wikimedia.org/T213263 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cp1074.eqiad.wmnet', 'cp2021.codfw.wmnet', 'cp2015.codfw.wmnet', 'cp2003.codfw.wmnet']... [15:07:18] 10Operations, 10Traffic, 10Patch-For-Review: Partial cache_upload traffic switchover to ATS and switchback to Varnish - https://phabricator.wikimedia.org/T213263 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by ema on cumin1001.eqiad.wmnet for hosts: ` ['cp2009.codfw.wmnet'] ` The log can be fo... [15:07:41] vgutierrez: If/when you have a few minutes… I have followup questions re: https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/474272/ [15:07:50] (03CR) 10Cwhite: initial attempt at a varnishkafka exporter (032 comments) [debs/prometheus-varnishkafka-exporter] - 10https://gerrit.wikimedia.org/r/507632 (https://phabricator.wikimedia.org/T196066) (owner: 10Cwhite) [15:08:31] andrewbogott: maybe post them in the phab task and I'll take a look ASAP [15:08:39] I'm in the middle of something right now, sorry [15:08:50] 'k [15:09:12] 10Operations, 10ops-eqiad: Bad disk on new host db1133 - https://phabricator.wikimedia.org/T222731 (10Marostegui) [15:13:24] 10Operations, 10ops-eqiad: Bad disk on new host db1133 - https://phabricator.wikimedia.org/T222731 (10Marostegui) [15:16:24] (03CR) 10Ottomata: "BTW, a Confluent engineer responded on my email thread. He said they'd been wanting to build in full on Prometheus support into librdkafk" [debs/prometheus-varnishkafka-exporter] - 10https://gerrit.wikimedia.org/r/507632 (https://phabricator.wikimedia.org/T196066) (owner: 10Cwhite) [15:17:51] (03PS6) 10Cwhite: initial attempt at a varnishkafka exporter [debs/prometheus-varnishkafka-exporter] - 10https://gerrit.wikimedia.org/r/507632 (https://phabricator.wikimedia.org/T196066) [15:19:31] (03CR) 10SBassett: [C: 03+1] "This should be fine, imo, on an ad-hoc basis with references/documentation as to why. Though at some point, a better solution or compromi" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/504474 (https://phabricator.wikimedia.org/T207900) (owner: 10Krinkle) [15:21:51] !log ebernhardson@deploy1001 Synchronized php-1.34.0-wmf.3/extensions/CirrusSearch/maintenance/forceSearchIndex.php: T222641: Cirrus maint script handle ancient logging rows (duration: 00m 52s) [15:21:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:21:56] T222641: Create archive indices and delete archive docs from general indices - https://phabricator.wikimedia.org/T222641 [15:26:18] 10Operations, 10ops-codfw, 10DBA, 10Goal, 10Patch-For-Review: rack/setup/install db2[103-120].codfw.wmnet (18 hosts) - https://phabricator.wikimedia.org/T221532 (10Papaul) [15:27:10] (03CR) 10Giuseppe Lavagetto: [C: 03+1] Fix proxy fetch multi URL cherry-pick [debs/pybal] (1.15-stretch) - 10https://gerrit.wikimedia.org/r/508596 (owner: 10Vgutierrez) [15:28:48] (03CR) 10Vgutierrez: [C: 03+2] Fix proxy fetch multi URL cherry-pick [debs/pybal] (1.15-stretch) - 10https://gerrit.wikimedia.org/r/508596 (owner: 10Vgutierrez) [15:29:50] (03CR) 10Volans: "Mostly ok, minor suggestions inline." (033 comments) [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/506008 (https://phabricator.wikimedia.org/T220422) (owner: 10CRusnov) [15:30:09] (03PS1) 10Vgutierrez: Release 1.15.6 [debs/pybal] (1.15-stretch) - 10https://gerrit.wikimedia.org/r/508600 [15:32:55] (03CR) 10Volans: [C: 03+1] "LGTM" [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/507717 (owner: 10CRusnov) [15:32:57] (03PS2) 10Vgutierrez: Release 1.15.6 [debs/pybal] (1.15-stretch) - 10https://gerrit.wikimedia.org/r/508600 [15:33:29] (03PS6) 10CRusnov: Cleanups to the oldhardware report [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/506008 (https://phabricator.wikimedia.org/T220422) [15:34:10] (03CR) 10CRusnov: "Thanks!" (033 comments) [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/506008 (https://phabricator.wikimedia.org/T220422) (owner: 10CRusnov) [15:34:15] (03CR) 10Vgutierrez: [C: 03+2] Release 1.15.6 [debs/pybal] (1.15-stretch) - 10https://gerrit.wikimedia.org/r/508600 (owner: 10Vgutierrez) [15:34:41] (03CR) 10CRusnov: [C: 03+2] Add device model/device type parity check [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/507717 (owner: 10CRusnov) [15:34:45] (03Merged) 10jenkins-bot: Release 1.15.6 [debs/pybal] (1.15-stretch) - 10https://gerrit.wikimedia.org/r/508600 (owner: 10Vgutierrez) [15:34:48] (03PS3) 10CRusnov: Add device model/device type parity check [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/507717 [15:34:51] (03CR) 10Volans: [C: 03+1] "LGTM" [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/506017 (https://phabricator.wikimedia.org/T220422) (owner: 10CRusnov) [15:35:21] (03PS3) 10CRusnov: Minor improvements to management console report [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/506017 (https://phabricator.wikimedia.org/T220422) [15:35:59] 10Operations, 10ops-codfw, 10DBA, 10Goal, 10Patch-For-Review: rack/setup/install db2[103-120].codfw.wmnet (18 hosts) - https://phabricator.wikimedia.org/T221532 (10Papaul) [15:38:12] !log uploaded pybal 1.15.6 to apt.wikimedia.org (stretch && jessie) [15:38:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:40:28] andrewbogott: so, what do you need? :) [15:42:08] 10Operations, 10ops-codfw, 10DBA, 10Goal, 10Patch-For-Review: rack/setup/install db2[103-120].codfw.wmnet (18 hosts) - https://phabricator.wikimedia.org/T221532 (10Papaul) @Marostegui @jcrespo please hold on to db2114. It looks like the system has some Hardware issues, I am investigating. [15:43:18] 10Operations, 10ops-codfw, 10DBA, 10Goal, 10Patch-For-Review: rack/setup/install db2[103-120].codfw.wmnet (18 hosts) - https://phabricator.wikimedia.org/T221532 (10Marostegui) Thanks Papaul We won't do anything to any of the hosts until we've got the green light from you in this ticket Thanks! [15:43:46] (03PS3) 10Fsero: Configure eventgate-main for k8s deployment [puppet] - 10https://gerrit.wikimedia.org/r/508371 (https://phabricator.wikimedia.org/T218346) (owner: 10Ottomata) [15:46:50] (03PS6) 10Vgutierrez: prometheus: Support several instances of the ATS exporter [puppet] - 10https://gerrit.wikimedia.org/r/506659 (https://phabricator.wikimedia.org/T221217) [15:47:22] !log creating eventgate-main namespace on k8s clusters [15:47:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:47:39] !log creating eventgate-main namespace on k8s clusters - T218346 [15:47:40] T218346: Modern Event Platform: Deploy instance of EventGate service that produces events to kafka main - https://phabricator.wikimedia.org/T218346 [15:47:43] vgutierrez: I applied that patch on cloudvirt1024.eqiad.wmnet and it doesn't seem able to bring the tagged interface up [15:48:04] I haven't dug into it much beyond that actually, thought you might have a quick explanation [15:48:38] 10Operations, 10ops-codfw, 10DBA, 10Goal, 10Patch-For-Review: rack/setup/install db2[103-120].codfw.wmnet (18 hosts) - https://phabricator.wikimedia.org/T221532 (10Papaul) db2114 Critical,Tue 07 May 2019 10:04:30,Fan redundancy is lost., Normal,Tue 07 May 2019 10:03:33,The fans are redundant., Critical,... [15:49:55] 10Operations, 10Traffic, 10Patch-For-Review: Partial cache_upload traffic switchover to ATS and switchback to Varnish - https://phabricator.wikimedia.org/T213263 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cp2009.codfw.wmnet'] ` and were **ALL** successful. [15:50:19] vgutierrez: My first question thought is… how would it know that p175s0f1d1.1105 is associated with enp175s0f1d1 now that the name is different? [15:51:10] 10Operations, 10ops-codfw, 10DBA, 10Goal, 10Patch-For-Review: rack/setup/install db2[103-120].codfw.wmnet (18 hosts) - https://phabricator.wikimedia.org/T221532 (10Papaul) it looks like we are missng FAN 3B i am going to open the server and double check Status Name Type PWM (% of Max) RPM System Board... [15:52:27] (03CR) 10Volans: [C: 03+1] "LGTM, just make sure to test everything once deployed to make sure we're not missing anything." [puppet] - 10https://gerrit.wikimedia.org/r/506579 (owner: 10Dzahn) [15:54:56] andrewbogott: so it looks like we need to provide custom ip link stanzas [15:55:31] something like ip link add link enp175s0f1d1 name p175s0f1d1.1105 type vlan id 1105 should do the trick [15:55:57] (03CR) 10Fsero: [C: 03+2] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/508371 (https://phabricator.wikimedia.org/T218346) (owner: 10Ottomata) [15:57:50] (03CR) 10Dzahn: [C: 03+2] icinga: remove 'system/process command access' for everyone [puppet] - 10https://gerrit.wikimedia.org/r/506579 (owner: 10Dzahn) [15:57:55] vgutierrez: hm, is that easier than renaming enp175s0f1d1 to p175s0f1d1? <- not that I know if that's possible [15:58:40] it's possible but I don't know if we wanna go that way :) [15:58:48] PROBLEM - Host db2114.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:58:54] ok, so you mean like [15:58:56] https://www.irccloud.com/pastebin/jdqHek56/ [15:59:27] db2114 can be ignored [16:00:04] godog and _joe_: It is that lovely time of the day again! You are hereby commanded to deploy Puppet SWAT(Max 6 patches). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190507T1600). [16:00:04] No GERRIT patches in the queue for this window AFAICS. [16:00:17] hm, also tried [16:00:20] https://www.irccloud.com/pastebin/bgGoYzLR/ [16:00:29] to no effect. Although maybe I have to force that to reload somehow... [16:00:58] 10Operations, 10ops-eqiad, 10DBA, 10Goal, and 2 others: rack/setup/install db11[26-38].eqiad.wmnet - https://phabricator.wikimedia.org/T211613 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['db1133.eqiad.wmnet'] ` Of which those **FAILED**: ` ['db1133.eqiad.wmnet'] ` [16:01:13] andrewbogott: nope, check https://manpages.debian.org/stretch/ifupdown/interfaces.5.en.html [16:01:46] ok, I'll… read. [16:01:50] gotta run for now [16:02:45] 10Operations, 10Thumbor, 10serviceops, 10Patch-For-Review, and 2 others: Build Thumbor packages for buster - https://phabricator.wikimedia.org/T221562 (10jijiki) [16:03:55] andrewbogott: so.. from https://wiki.debian.org/NetworkConfiguration#Manual_config, "vlan-raw-device enp175s0f1d1" within the iface p175s0f1d1.1105 should fix it [16:04:34] (03CR) 10Filippo Giunchedi: "> Patch Set 5:" (031 comment) [debs/prometheus-varnishkafka-exporter] - 10https://gerrit.wikimedia.org/r/507632 (https://phabricator.wikimedia.org/T196066) (owner: 10Cwhite) [16:04:42] 10Operations, 10Thumbor, 10serviceops, 10User-jijiki: Upgrade Thumbor to Buster - https://phabricator.wikimedia.org/T216815 (10jijiki) [16:04:44] 10Operations, 10Thumbor, 10serviceops, 10Patch-For-Review, and 2 others: Build Thumbor packages for buster - https://phabricator.wikimedia.org/T221562 (10jijiki) 05Open→03Resolved @Gilles All packages have been rebuilt and added to buster-wikimedia main repo. Please reopen if we have any issues. [16:05:18] !log created eventgate-main tokens [16:05:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:05:32] !log created eventgate-main tokens - T218346 [16:05:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:05:36] T218346: Modern Event Platform: Deploy instance of EventGate service that produces events to kafka main - https://phabricator.wikimedia.org/T218346 [16:09:07] 10Operations, 10Patch-For-Review: uwsgi's logsocket_plugin.so causes segfaults during log rotation - https://phabricator.wikimedia.org/T212697 (10elukey) @MoritzMuehlenhoff we have an interesting segfault that happens for uwsgi when systemctl restart the netbox unit, but only in production and not in labs. Cas... [16:09:42] RECOVERY - Host db2114.mgmt is UP: PING OK - Packet loss = 0%, RTA = 36.99 ms [16:12:41] 10Operations, 10Patch-For-Review: uwsgi's logsocket_plugin.so causes segfaults during log rotation - https://phabricator.wikimedia.org/T212697 (10crusnov) The patch seems sane and simple. I concur with this plan fwiw. [16:19:38] (03PS1) 10CRusnov: Add reports reqs and rebuild artifacts [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/508613 [16:20:34] (03PS2) 10Dzahn: icinga: remove 'system/process command access' for everyone [puppet] - 10https://gerrit.wikimedia.org/r/506579 [16:21:40] 10Operations, 10User-fgiunchedi, 10Wikimedia-production-error (Shared Build Failure): PHP Fatal error: The UdpSocket to 127.0.0.1:10514 has been closed (from Monolog/SyslogUdp) - https://phabricator.wikimedia.org/T214734 (10Krinkle) p:05High→03Low Didn't realise it was limited to mwdebug1002. Never chec... [16:21:51] 10Operations, 10User-fgiunchedi, 10Wikimedia-production-error: PHP Fatal error: The UdpSocket to 127.0.0.1:10514 has been closed (from Monolog/SyslogUdp) - https://phabricator.wikimedia.org/T214734 (10Krinkle) [16:22:12] 10Operations, 10User-fgiunchedi, 10Wikimedia-production-error: PHP Fatal error: The UdpSocket to 127.0.0.1:10514 has been closed (from Monolog/SyslogUdp on mwdebug1002) - https://phabricator.wikimedia.org/T214734 (10Krinkle) [16:23:23] 10Operations, 10Core Platform Team (PHP7 (TEC4)), 10Core Platform Team Kanban (Doing), 10HHVM, and 3 others: Migrate to PHP 7 in WMF production - https://phabricator.wikimedia.org/T176370 (10Krinkle) [16:23:48] (03PS4) 10CRusnov: Minor improvements to management console report [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/506017 (https://phabricator.wikimedia.org/T220422) [16:23:50] (03CR) 10Volans: "Nice effort! Approach looks good. Few comment/question inline." (0311 comments) [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/506663 (owner: 10Faidon Liambotis) [16:25:02] (03CR) 10CRusnov: [C: 03+2] Minor improvements to management console report [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/506017 (https://phabricator.wikimedia.org/T220422) (owner: 10CRusnov) [16:27:57] (03CR) 10Dzahn: [C: 03+2] icinga: remove 'system/process command access' for everyone [puppet] - 10https://gerrit.wikimedia.org/r/506579 (owner: 10Dzahn) [16:28:05] (03PS3) 10Dzahn: icinga: remove 'system/process command access' for everyone [puppet] - 10https://gerrit.wikimedia.org/r/506579 [16:38:16] !log rebooting cloudvirt1024 to test interfaces configuration [16:38:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:39:39] !log otto@deploy1001 scap-helm eventgate-main install -n main -f main/staging-values.yaml stable/eventgate [namespace: eventgate-main, clusters: staging] [16:39:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:41:46] PROBLEM - puppet last run on pybal-test2001 is CRITICAL: CRITICAL: Puppet has 8 failures. Last run 3 minutes ago with 8 failures. Failed resources (up to 3 shown): Exec[eth0_v6_token],Service[rsyslog],Service[nagios-nrpe-server],Exec[ip addr add 2620:0:860:102:10:192:16:139/64 dev eth0] [16:49:18] PROBLEM - Host db2114.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [16:49:59] papaul: ^ [16:50:41] (03PS1) 10Andrew Bogott: Specify vlan-raw-device for truncated, tagged interfaces [puppet] - 10https://gerrit.wikimedia.org/r/508618 [16:51:44] mutante: that is expected, the host is being under maintenance [16:51:47] Going to downtime it [16:52:00] So it doesn't generate noise [16:52:32] (03CR) 10Muehlenhoff: [C: 03+1] ATS: update cumin aliases [puppet] - 10https://gerrit.wikimedia.org/r/508588 (https://phabricator.wikimedia.org/T213263) (owner: 10Ema) [16:52:33] marostegui: gotcha! since it was only mgmt i assumed it was related to installing the new servers. cool [16:52:59] yeah, that's it :) [16:53:19] now i just wonder what is up with analytics1029.. all the checks are just disabled but nothing in SAL or phab... [16:53:52] could be that https://phabricator.wikimedia.org/T178742 is back or could be something new... hard to tell this way [16:56:36] (03PS7) 10Cwhite: initial attempt at a varnishkafka exporter [debs/prometheus-varnishkafka-exporter] - 10https://gerrit.wikimedia.org/r/507632 (https://phabricator.wikimedia.org/T196066) [16:57:48] (03CR) 10Cwhite: "During the rename, I also found a block that was indented too far." (031 comment) [debs/prometheus-varnishkafka-exporter] - 10https://gerrit.wikimedia.org/r/507632 (https://phabricator.wikimedia.org/T196066) (owner: 10Cwhite) [16:58:40] 10Operations, 10ops-eqiad, 10Analytics, 10User-Elukey: Possibly faulty BBU on analytics1029 - https://phabricator.wikimedia.org/T178742 (10Dzahn) Is the issue back or is this known ? https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=2&host=analytics1029&service=Device+not+healthy+-SMART- [16:59:09] !log otto@deploy1001 scap-helm eventgate-main install -n main -f main/staging-values.yaml stable/eventgate [namespace: eventgate-main, clusters: staging] [16:59:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:59:49] (03CR) 10Dzahn: "i was still able to ACK a CRIT in the web UI after merging this" [puppet] - 10https://gerrit.wikimedia.org/r/506579 (owner: 10Dzahn) [17:00:04] cscott, arlolra, subbu, and halfak: My dear minions, it's time we take the moon! Just kidding. Time for Services – Graphoid / Parsoid / Citoid / ORES deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190507T1700). [17:00:31] using downtimes instead of disabling notifications gets the same benefits but removes the uncertainty [17:01:01] and it's easy to forget re-enabling [17:01:30] (03CR) 10Volans: [C: 03+1] "LGTM" [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/508613 (owner: 10CRusnov) [17:02:42] (03PS1) 10Paladox: Gerrit: Set plugin.javamelody.prometheusBearerToken [puppet] - 10https://gerrit.wikimedia.org/r/508621 [17:02:54] ACKNOWLEDGEMENT - Host db2114.mgmt is DOWN: PING CRITICAL - Packet loss = 100% daniel_zahn https://phabricator.wikimedia.org/T221532 [17:03:57] (03CR) 10Volans: [C: 03+1] "LGTM" (031 comment) [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/506008 (https://phabricator.wikimedia.org/T220422) (owner: 10CRusnov) [17:04:24] (03PS7) 10CRusnov: Cleanups to the oldhardware report [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/506008 (https://phabricator.wikimedia.org/T220422) [17:05:42] RECOVERY - Host db2114.mgmt is UP: PING OK - Packet loss = 0%, RTA = 37.15 ms [17:05:51] (03PS3) 10Dzahn: admins: simplify sudo privs for phab-admin group [puppet] - 10https://gerrit.wikimedia.org/r/508373 [17:05:55] 10Operations, 10ops-codfw: Degraded RAID on db2049 - https://phabricator.wikimedia.org/T222526 (10Marostegui) 05Open→03Resolved It worked this time! ` root@db2049:~# hpssacli controller all show config Smart Array P420i in Slot 0 (Embedded) (sn: 0014380337DD260) Port Name: 1I Port Name: 2I... [17:06:39] (03CR) 10Dzahn: [C: 03+2] admins: simplify sudo privs for phab-admin group [puppet] - 10https://gerrit.wikimedia.org/r/508373 (owner: 10Dzahn) [17:06:55] (03CR) 10Ottomata: "> If we did this what should we do about node-rdkafka-statsd users like e.g. eventgate and its statsd-exporter mappings?" [debs/prometheus-varnishkafka-exporter] - 10https://gerrit.wikimedia.org/r/507632 (https://phabricator.wikimedia.org/T196066) (owner: 10Cwhite) [17:07:37] (03CR) 10Gehel: Gerrit: Configure logging in json to error_log.json (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/508391 (https://phabricator.wikimedia.org/T141324) (owner: 10Paladox) [17:07:46] (03PS2) 10Paladox: Gerrit: Set plugin.javamelody.prometheusBearerToken [puppet] - 10https://gerrit.wikimedia.org/r/508621 [17:07:56] RECOVERY - puppet last run on pybal-test2001 is OK: OK: Puppet is currently enabled, last run 16 seconds ago with 0 failures [17:08:46] (03PS3) 10Paladox: Gerrit: Set plugin.javamelody.prometheusBearerToken [puppet] - 10https://gerrit.wikimedia.org/r/508621 (https://phabricator.wikimedia.org/T184086) [17:09:37] (03CR) 10jerkins-bot: [V: 04-1] Gerrit: Set plugin.javamelody.prometheusBearerToken [puppet] - 10https://gerrit.wikimedia.org/r/508621 (https://phabricator.wikimedia.org/T184086) (owner: 10Paladox) [17:09:39] (03PS1) 10CRusnov: Add dummy gsheets.cfg for netbox reports. [labs/private] - 10https://gerrit.wikimedia.org/r/508624 [17:11:00] (03PS4) 10Paladox: Gerrit: Set plugin.javamelody.prometheusBearerToken [puppet] - 10https://gerrit.wikimedia.org/r/508621 (https://phabricator.wikimedia.org/T184086) [17:11:49] (03CR) 10jerkins-bot: [V: 04-1] Gerrit: Set plugin.javamelody.prometheusBearerToken [puppet] - 10https://gerrit.wikimedia.org/r/508621 (https://phabricator.wikimedia.org/T184086) (owner: 10Paladox) [17:12:27] (03PS5) 10Paladox: Gerrit: Set plugin.javamelody.prometheusBearerToken [puppet] - 10https://gerrit.wikimedia.org/r/508621 (https://phabricator.wikimedia.org/T184086) [17:12:28] RECOVERY - HP RAID on db2049 is OK: OK: Slot 0: OK: 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:9, 1I:1:10, 1I:1:11, 1I:1:12 - Controller: OK - Battery/Capacitor: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [17:12:56] PROBLEM - Host db2114.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [17:13:06] 10Operations: tracking task: jessie -> stretch - https://phabricator.wikimedia.org/T168494 (10Dzahn) @Muehlenhoff Does it make sense to reopen? [17:14:07] (03PS1) 10CRusnov: profile::netbox: Deploy gsheets config for reports [puppet] - 10https://gerrit.wikimedia.org/r/508625 [17:14:17] (03PS5) 10BBlack: Convert most DYNA into 1H CNAME records [dns] - 10https://gerrit.wikimedia.org/r/507399 (https://phabricator.wikimedia.org/T208263) [17:14:19] (03PS5) 10BBlack: Change CNAME->DYNA TTLs from 1H to 1D [dns] - 10https://gerrit.wikimedia.org/r/507400 (https://phabricator.wikimedia.org/T208263) [17:15:50] (03CR) 10CRusnov: [C: 03+2] Cleanups to the oldhardware report [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/506008 (https://phabricator.wikimedia.org/T220422) (owner: 10CRusnov) [17:16:23] (03CR) 10CRusnov: [V: 03+2 C: 03+2] Add reports reqs and rebuild artifacts [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/508613 (owner: 10CRusnov) [17:19:42] (03PS1) 10Jdlrobson: Enable content provider on de beta cluster for QA purposes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/508631 [17:19:51] 10Operations, 10Traffic, 10Patch-For-Review: tagged_interface sometimes exceeds IFNAMSIZ - https://phabricator.wikimedia.org/T209707 (10aborrero) After the trimmed interface name, we had to generate a `/etc/network/interface` file like this by hand for the config to survive a reboot: ` auto p175s0f1d1.1105... [17:23:23] RECOVERY - Host db2114.mgmt is UP: PING OK - Packet loss = 0%, RTA = 37.01 ms [17:23:36] 10Operations, 10Traffic, 10Patch-For-Review: tagged_interface sometimes exceeds IFNAMSIZ - https://phabricator.wikimedia.org/T209707 (10Vgutierrez) As discussed on IRC, using vlan-raw-device enp175s0f1d1 should be enough, as recommended in https://wiki.debian.org/NetworkConfiguration#Manual_config [17:25:21] 10Operations, 10ops-codfw, 10DBA, 10Goal, 10Patch-For-Review: rack/setup/install db2[103-120].codfw.wmnet (18 hosts) - https://phabricator.wikimedia.org/T221532 (10Papaul) I swapped FAN 3 with FAN 5 still have the same issue so the problem is not the FAN it has to be on the main board. I will contact DE... [17:26:41] (03PS2) 10Jdlrobson: Enable content provider on de beta cluster for QA purposes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/508631 [17:28:30] 10Operations, 10Traffic, 10Patch-For-Review: tagged_interface sometimes exceeds IFNAMSIZ - https://phabricator.wikimedia.org/T209707 (10aborrero) >>! In T209707#5165188, @Vgutierrez wrote: > As discussed on IRC, using vlan-raw-device enp175s0f1d1 should be enough, as recommended in https://wiki.debian.org/Ne... [17:29:16] 10Operations, 10ops-codfw, 10DBA, 10Goal, 10Patch-For-Review: rack/setup/install db2[103-120].codfw.wmnet (18 hosts) - https://phabricator.wikimedia.org/T221532 (10Papaul) [17:30:15] (03PS6) 10Paladox: Gerrit: Set plugin.javamelody.prometheusBearerToken [puppet] - 10https://gerrit.wikimedia.org/r/508621 (https://phabricator.wikimedia.org/T184086) [17:30:27] (03CR) 10Paladox: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/508621 (https://phabricator.wikimedia.org/T184086) (owner: 10Paladox) [17:31:58] !log rebooting cloudvirt1024 to test interfaces configuration [17:32:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:32:35] (03CR) 10Thcipriani: [C: 03+1] "Nice!" [puppet] - 10https://gerrit.wikimedia.org/r/508621 (https://phabricator.wikimedia.org/T184086) (owner: 10Paladox) [17:32:44] (03PS7) 10Paladox: Gerrit: Set plugin.javamelody.prometheusBearerToken [puppet] - 10https://gerrit.wikimedia.org/r/508621 (https://phabricator.wikimedia.org/T184086) [17:32:49] (03CR) 10Paladox: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/508621 (https://phabricator.wikimedia.org/T184086) (owner: 10Paladox) [17:33:56] (03Abandoned) 10Paladox: mysql: Fix installing package on stretch [puppet] - 10https://gerrit.wikimedia.org/r/354131 (owner: 10Paladox) [17:35:13] (03PS8) 10Paladox: Gerrit: Set plugin.javamelody.prometheusBearerToken [puppet] - 10https://gerrit.wikimedia.org/r/508621 (https://phabricator.wikimedia.org/T184086) [17:35:18] (03CR) 10Paladox: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/508621 (https://phabricator.wikimedia.org/T184086) (owner: 10Paladox) [17:38:51] (03CR) 10Volans: "> Patch Set 5:" [software/spicerack] - 10https://gerrit.wikimedia.org/r/507561 (owner: 10Jbond) [17:40:22] (03PS1) 10Paladox: Gerrit: Add a dummy token for gerrit::server::prometheus_bearer_token [labs/private] - 10https://gerrit.wikimedia.org/r/508644 [17:42:14] (03PS9) 10Paladox: Gerrit: Set plugin.javamelody.prometheusBearerToken [puppet] - 10https://gerrit.wikimedia.org/r/508621 (https://phabricator.wikimedia.org/T184086) [17:43:13] (03PS2) 10Paladox: Gerrit: Add a dummy token for passwords::gerrit::prometheus_bearer_token [labs/private] - 10https://gerrit.wikimedia.org/r/508644 [17:44:02] (03CR) 10Paladox: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/508621 (https://phabricator.wikimedia.org/T184086) (owner: 10Paladox) [17:44:45] 10Operations, 10ops-codfw, 10Discovery-Search (Current work): elastic2038 CPU/memory errors - https://phabricator.wikimedia.org/T217398 (10Dzahn) a:05Papaul→03Gehel [17:45:15] !log starting branchcut for train (1.34.0-wmf.4) [17:45:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:45:28] 10Operations, 10LDAP-Access-Requests, 10WMF-Legal, 10WMF-NDA-Requests: Request to be added to the ldap/wmde group - https://phabricator.wikimedia.org/T222538 (10RStallman-legalteam) Adam has signed an amendment to reflect the WMDE-LDAP access level. So this is all set. [17:46:18] 10Operations, 10LDAP-Access-Requests, 10WMF-Legal, 10WMF-NDA-Requests: Request to be added to the ldap/wmde group - https://phabricator.wikimedia.org/T222538 (10Dzahn) a:03Dzahn [17:49:10] (03PS1) 10Cmjohnson: Adding mgmt dns for restbase1019-27 [dns] - 10https://gerrit.wikimedia.org/r/508650 (https://phabricator.wikimedia.org/T219404) [17:49:33] (03CR) 10jerkins-bot: [V: 04-1] Adding mgmt dns for restbase1019-27 [dns] - 10https://gerrit.wikimedia.org/r/508650 (https://phabricator.wikimedia.org/T219404) (owner: 10Cmjohnson) [17:50:46] 10Operations, 10Patch-For-Review: uwsgi's logsocket_plugin.so causes segfaults during log rotation - https://phabricator.wikimedia.org/T212697 (10elukey) On boron I have built `uwsgi-core_2.0.14+20161117-3+deb9u2~wmf1_amd64.deb` with the following patch: ` elukey@boron:~/uwsgi-2.0.14+20161117$ cat debian/patc... [17:50:48] (03PS2) 10Cmjohnson: Adding mgmt dns for restbase1019-27 [dns] - 10https://gerrit.wikimedia.org/r/508650 (https://phabricator.wikimedia.org/T219404) [17:51:10] (03CR) 10jerkins-bot: [V: 04-1] Adding mgmt dns for restbase1019-27 [dns] - 10https://gerrit.wikimedia.org/r/508650 (https://phabricator.wikimedia.org/T219404) (owner: 10Cmjohnson) [17:53:49] (03PS4) 10Paladox: Gerrit: Configure logging in json to gerrit.json [puppet] - 10https://gerrit.wikimedia.org/r/508391 (https://phabricator.wikimedia.org/T141324) [17:53:49] (03CR) 10Paladox: Gerrit: Configure logging in json to gerrit.json (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/508391 (https://phabricator.wikimedia.org/T141324) (owner: 10Paladox) [17:54:09] (03PS1) 10Paladox: Gerrit: Remove additivity from it's log4j file [puppet] - 10https://gerrit.wikimedia.org/r/508657 [17:54:45] (03PS2) 10Paladox: Gerrit: Remove additivity from it's log4j file [puppet] - 10https://gerrit.wikimedia.org/r/508657 [17:59:14] (03PS8) 10Herron: WIP puppetmaster-standalone - add dynamic envs that map to gerrit [puppet] - 10https://gerrit.wikimedia.org/r/507846 [18:00:04] Deploy window Pre MediaWiki train sanity break (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190507T1800) [18:03:20] (03CR) 10Volans: [C: 03+1] "I couldn't spot any obvious typo and code seems to reflect what's in the commit message. So +1 for me with the obvious caution for a chang" [dns] - 10https://gerrit.wikimedia.org/r/507399 (https://phabricator.wikimedia.org/T208263) (owner: 10BBlack) [18:07:16] 08Warning Alert for device asw2-a-eqiad.mgmt.eqiad.wmnet - Access port utilisation over 80% for 1h [18:07:27] 08Warning Alert for device asw2-c-eqiad.mgmt.eqiad.wmnet - Access port utilisation over 80% for 1h [18:07:37] (03CR) 10CRusnov: [V: 03+2 C: 03+2] "LGTM!" [labs/private] - 10https://gerrit.wikimedia.org/r/508644 (owner: 10Paladox) [18:07:59] (03CR) 10Paladox: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/508621 (https://phabricator.wikimedia.org/T184086) (owner: 10Paladox) [18:08:01] !log restarting icinga via web UI button [18:08:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:08:13] PROBLEM - Work requests waiting in Zuul Gearman server on contint1001 is CRITICAL: CRITICAL: 35.71% of data above the critical threshold [140.0] https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [18:11:38] (03CR) 10CRusnov: "Compile looks good for this change:" [puppet] - 10https://gerrit.wikimedia.org/r/508621 (https://phabricator.wikimedia.org/T184086) (owner: 10Paladox) [18:12:05] (03CR) 10CRusnov: "> Patch Set 9:" [puppet] - 10https://gerrit.wikimedia.org/r/508621 (https://phabricator.wikimedia.org/T184086) (owner: 10Paladox) [18:13:18] (03CR) 10Dzahn: "confirmed that, even though buttons are still shown, executing commands to "restart" and "shutdown" icinga process are not allowed anymore" [puppet] - 10https://gerrit.wikimedia.org/r/506579 (owner: 10Dzahn) [18:13:53] PROBLEM - puppet last run on pybal-test2001 is CRITICAL: CRITICAL: Puppet has 14 failures. Last run 5 minutes ago with 14 failures. Failed resources (up to 3 shown): Package[ntp],Service[systemd-timesyncd],Package[diamond],Package[python-diamond] [18:14:55] RECOVERY - Work requests waiting in Zuul Gearman server on contint1001 is OK: OK: Less than 30.00% above the threshold [90.0] https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [18:15:43] (03PS3) 10Cmjohnson: Adding mgmt dns for restbase1019-27 [dns] - 10https://gerrit.wikimedia.org/r/508650 (https://phabricator.wikimedia.org/T219404) [18:16:45] (03CR) 10Cmjohnson: [C: 03+2] Adding mgmt dns for restbase1019-27 [dns] - 10https://gerrit.wikimedia.org/r/508650 (https://phabricator.wikimedia.org/T219404) (owner: 10Cmjohnson) [18:18:19] 10Operations, 10Patch-For-Review: uwsgi's logsocket_plugin.so causes segfaults during log rotation - https://phabricator.wikimedia.org/T212697 (10MoritzMuehlenhoff) > Does it sound acceptable? Or do you prefer another way? Sounds great! [18:27:42] 10Operations, 10Traffic, 10Patch-For-Review: tagged_interface sometimes exceeds IFNAMSIZ - https://phabricator.wikimedia.org/T209707 (10Vgutierrez) so taking a deeper look into https://manpages.debian.org/jessie/vlan/vlan-interfaces.5.en.html: > vlan-raw-device devicename > Indicates the device to create the... [18:28:37] (03PS1) 10Vgutierrez: Revert "lvs: Avoid tagged network interfaces to hit IFNAMSIZ (15+\0) limit" [puppet] - 10https://gerrit.wikimedia.org/r/508668 [18:29:24] (03CR) 10jerkins-bot: [V: 04-1] Revert "lvs: Avoid tagged network interfaces to hit IFNAMSIZ (15+\0) limit" [puppet] - 10https://gerrit.wikimedia.org/r/508668 (owner: 10Vgutierrez) [18:30:10] (03PS2) 10Vgutierrez: Revert "lvs: Avoid tagged network interfaces to hit IFNAMSIZ (15+\0) limit" [puppet] - 10https://gerrit.wikimedia.org/r/508668 [18:37:29] (03PS1) 10RobH: splitting role::spare into staged and decomisssioning [puppet] - 10https://gerrit.wikimedia.org/r/508671 (https://phabricator.wikimedia.org/T222352) [18:38:21] !log LDAP - adding awight to 'wmde' group (T222538) [18:38:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:38:25] T222538: Request to be added to the ldap/wmde group - https://phabricator.wikimedia.org/T222538 [18:39:05] (03PS1) 10Thcipriani: Group0 to 1.34.0-wmf.4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/508672 [18:39:14] 10Operations, 10ops-eqiad, 10RESTBase, 10Core Platform Team Backlog (Watching / External), and 3 others: rack/setup/install restbase10[19-27].eqiad.wmnet - https://phabricator.wikimedia.org/T219404 (10Cmjohnson) [18:39:56] (03CR) 10Vgutierrez: [C: 03+1] Revert "lvs: Avoid tagged network interfaces to hit IFNAMSIZ (15+\0) limit" [puppet] - 10https://gerrit.wikimedia.org/r/508668 (owner: 10Vgutierrez) [18:40:17] 10Operations, 10ops-eqiad, 10RESTBase, 10Core Platform Team Backlog (Watching / External), and 3 others: rack/setup/install restbase10[19-27].eqiad.wmnet - https://phabricator.wikimedia.org/T219404 (10Cmjohnson) a:05Cmjohnson→03RobH @mobrovac all of the on-site work has been completed, I am assigning t... [18:40:31] RECOVERY - puppet last run on pybal-test2001 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [18:40:39] 10Operations, 10ops-eqiad, 10RESTBase, 10Core Platform Team Backlog (Watching / External), and 3 others: rack/setup/install restbase10[19-27].eqiad.wmnet - https://phabricator.wikimedia.org/T219404 (10Cmjohnson) @robh these are currently powered down [18:40:49] 10Operations, 10LDAP-Access-Requests, 10WMF-Legal, 10WMF-NDA-Requests: Request to be added to the ldap/wmde group - https://phabricator.wikimedia.org/T222538 (10Dzahn) Thanks @RStallman-legalteam @awight Done! You have been added. I did not have to add you to the puppet module because you are already an... [18:41:03] 10Operations, 10LDAP-Access-Requests, 10WMF-Legal, 10WMF-NDA-Requests: Request to be added to the ldap/wmde group - https://phabricator.wikimedia.org/T222538 (10Dzahn) 05Open→03Resolved [18:46:43] (03CR) 10Andrew Bogott: [C: 03+1] "Reverting sounds right. I'm hoping you'll write me some kind of sample patch explaining the comments on the phab ticket though, because I" [puppet] - 10https://gerrit.wikimedia.org/r/508668 (owner: 10Vgutierrez) [18:46:52] (03PS8) 10Dzahn: Gerrit: Redirect /p/(.+)/info/(.+) to /$1/info/$2 [puppet] - 10https://gerrit.wikimedia.org/r/507787 (https://phabricator.wikimedia.org/T218844) (owner: 10Paladox) [18:47:28] PROBLEM - HTTP availability for Nginx -SSL terminators- at esams on icinga1001 is CRITICAL: cluster=cache_text site=esams https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [18:47:54] PROBLEM - Text HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5 [18:48:22] PROBLEM - HTTP availability for Varnish at esams on icinga1001 is CRITICAL: job=varnish-text site=esams https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [18:48:35] (03CR) 10BBlack: [C: 03+1] Revert "lvs: Avoid tagged network interfaces to hit IFNAMSIZ (15+\0) limit" [puppet] - 10https://gerrit.wikimedia.org/r/508668 (owner: 10Vgutierrez) [18:48:54] PROBLEM - Eqsin HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=eqsin&var-cache_type=All&var-status_type=5 [18:49:52] PROBLEM - Eqiad HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=eqiad&var-cache_type=All&var-status_type=5 [18:50:47] !log thcipriani@deploy1001 Pruned MediaWiki: 1.33.0-wmf.21 (duration: 08m 48s) [18:50:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:51:34] RECOVERY - HTTP availability for Nginx -SSL terminators- at esams on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [18:51:44] afaict the above seems to be timeouts. Why the 60sec timeout thing is triggered from deleting old code: dunno. [18:52:30] RECOVERY - HTTP availability for Varnish at esams on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [18:55:14] RECOVERY - Eqsin HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=eqsin&var-cache_type=All&var-status_type=5 [18:56:18] RECOVERY - Eqiad HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=eqiad&var-cache_type=All&var-status_type=5 [18:56:20] RECOVERY - Text HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5 [18:57:37] (03CR) 10Vgutierrez: [C: 03+1] "ACK, I'll get his merged tomorrow EU morning, just to play on the safe side of things." [puppet] - 10https://gerrit.wikimedia.org/r/508668 (owner: 10Vgutierrez) [18:59:32] 10Operations, 10ops-codfw, 10DBA, 10Goal, 10Patch-For-Review: rack/setup/install db2[103-120].codfw.wmnet (18 hosts) - https://phabricator.wikimedia.org/T221532 (10Papaul) [19:00:04] thcipriani: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) MediaWiki train - Americas version deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190507T1900). [19:00:31] yay train. [19:03:15] 08Warning Alert for device asw-d-codfw.mgmt.codfw.wmnet - Access port utilisation over 80% for 1h [19:04:04] !log thcipriani@deploy1001 Pruned MediaWiki: 1.33.0-wmf.22 (duration: 02m 15s) [19:04:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:05:16] 10Operations, 10ops-codfw, 10DBA, 10Goal, 10Patch-For-Review: rack/setup/install db2[103-120].codfw.wmnet (18 hosts) - https://phabricator.wikimedia.org/T221532 (10Papaul) @Marostegui @jcrespo please fell free to take this task. I open T222753 to track down the problem on db2114. [19:10:23] (03Abandoned) 10Andrew Bogott: Specify vlan-raw-device for truncated, tagged interfaces [puppet] - 10https://gerrit.wikimedia.org/r/508618 (owner: 10Andrew Bogott) [19:12:16] 08̶W̶a̶r̶n̶i̶n̶g Device asw2-c-eqiad.mgmt.eqiad.wmnet recovered from Access port utilisation over 80% for 1h [19:13:27] !log thcipriani@deploy1001 Started scap: testwiki to 1.34.0-wmf.4 and rebuild l10n cache [19:13:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:14:06] (03PS4) 10Andrew Bogott: Allow puppet-merge to merge the labs/private repo [puppet] - 10https://gerrit.wikimedia.org/r/506582 (https://phabricator.wikimedia.org/T221888) [19:15:35] (03CR) 10Andrew Bogott: [C: 03+2] Allow puppet-merge to merge the labs/private repo [puppet] - 10https://gerrit.wikimedia.org/r/506582 (https://phabricator.wikimedia.org/T221888) (owner: 10Andrew Bogott) [19:18:07] (03PS3) 10Andrew Bogott: openstack: Remove support for Ubuntu/Upstart [puppet] - 10https://gerrit.wikimedia.org/r/508310 (owner: 10Muehlenhoff) [19:22:52] (03CR) 10Andrew Bogott: [C: 03+2] openstack: Remove support for Ubuntu/Upstart [puppet] - 10https://gerrit.wikimedia.org/r/508310 (owner: 10Muehlenhoff) [19:26:54] 10Operations: #wikimedia-sre is missing stashbot - https://phabricator.wikimedia.org/T222755 (10CDanis) [19:29:18] (03PS1) 10Andrew Bogott: puppet-merge: initialize $LABS_PRIVATE [puppet] - 10https://gerrit.wikimedia.org/r/508688 [19:30:15] (03CR) 10Andrew Bogott: [C: 03+2] puppet-merge: initialize $LABS_PRIVATE [puppet] - 10https://gerrit.wikimedia.org/r/508688 (owner: 10Andrew Bogott) [19:34:17] (03PS1) 10Andrew Bogott: Minor README change to test patch merging [labs/private] - 10https://gerrit.wikimedia.org/r/508689 [19:34:27] 10Operations, 10Electron-PDFs, 10Proton, 10Reading-Infrastructure-Team-Backlog, and 5 others: [EPIC] New service request: chromium-render/deploy - https://phabricator.wikimedia.org/T186748 (10Johan) [19:35:22] (03CR) 10Andrew Bogott: [V: 03+2 C: 03+2] Minor README change to test patch merging [labs/private] - 10https://gerrit.wikimedia.org/r/508689 (owner: 10Andrew Bogott) [19:36:59] PROBLEM - High CPU load on API appserver on mw1281 is CRITICAL: CRITICAL - load average: 79.66, 40.33, 25.35 [19:37:09] PROBLEM - High CPU load on API appserver on mw1227 is CRITICAL: CRITICAL - load average: 58.95, 29.22, 20.64 [19:37:17] PROBLEM - High CPU load on API appserver on mw1225 is CRITICAL: CRITICAL - load average: 52.23, 25.69, 17.18 [19:37:31] PROBLEM - High CPU load on API appserver on mw1226 is CRITICAL: CRITICAL - load average: 50.73, 24.96, 16.83 [19:38:27] PROBLEM - High CPU load on API appserver on mw1280 is CRITICAL: CRITICAL - load average: 64.25, 30.76, 20.65 [19:38:35] PROBLEM - High CPU load on API appserver on mw1283 is CRITICAL: CRITICAL - load average: 67.72, 32.42, 23.19 [19:38:47] RECOVERY - High CPU load on API appserver on mw1226 is OK: OK - load average: 25.06, 22.86, 16.74 [19:38:59] PROBLEM - High CPU load on API appserver on mw1277 is CRITICAL: CRITICAL - load average: 76.62, 37.81, 23.71 [19:38:59] PROBLEM - High CPU load on API appserver on mw1286 is CRITICAL: CRITICAL - load average: 69.27, 33.24, 22.56 [19:39:11] PROBLEM - High CPU load on API appserver on mw1284 is CRITICAL: CRITICAL - load average: 77.17, 39.81, 25.62 [19:39:18] (03PS1) 10Andrew Bogott: puppet-merge: permit --labsprivate arg [puppet] - 10https://gerrit.wikimedia.org/r/508690 (https://phabricator.wikimedia.org/T221888) [19:39:21] PROBLEM - High CPU load on API appserver on mw1279 is CRITICAL: CRITICAL - load average: 65.87, 30.79, 20.57 [19:39:32] cdb rebuild happening [19:39:35] for train [19:39:45] RECOVERY - High CPU load on API appserver on mw1280 is OK: OK - load average: 30.82, 28.44, 20.66 [19:39:45] PROBLEM - High CPU load on API appserver on mw1278 is CRITICAL: CRITICAL - load average: 79.62, 38.01, 23.18 [19:39:49] RECOVERY - High CPU load on API appserver on mw1225 is OK: OK - load average: 19.18, 23.79, 17.88 [19:40:17] RECOVERY - High CPU load on API appserver on mw1286 is OK: OK - load average: 34.02, 30.82, 22.62 [19:40:39] (03CR) 10Andrew Bogott: [C: 03+2] puppet-merge: permit --labsprivate arg [puppet] - 10https://gerrit.wikimedia.org/r/508690 (https://phabricator.wikimedia.org/T221888) (owner: 10Andrew Bogott) [19:40:39] RECOVERY - High CPU load on API appserver on mw1279 is OK: OK - load average: 33.33, 29.61, 21.06 [19:41:13] RECOVERY - High CPU load on API appserver on mw1283 is OK: OK - load average: 25.27, 31.71, 24.68 [19:41:37] RECOVERY - High CPU load on API appserver on mw1277 is OK: OK - load average: 22.15, 31.80, 23.81 [19:42:13] RECOVERY - High CPU load on API appserver on mw1281 is OK: OK - load average: 18.49, 28.86, 25.22 [19:42:23] !log thcipriani@deploy1001 Finished scap: testwiki to 1.34.0-wmf.4 and rebuild l10n cache (duration: 28m 55s) [19:42:23] RECOVERY - High CPU load on API appserver on mw1227 is OK: OK - load average: 16.44, 24.94, 21.96 [19:42:25] RECOVERY - High CPU load on API appserver on mw1278 is OK: OK - load average: 19.99, 29.46, 22.34 [19:42:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:43:09] RECOVERY - High CPU load on API appserver on mw1284 is OK: OK - load average: 18.28, 30.65, 25.46 [19:43:51] (03PS1) 10Andrew Bogott: puppet-merge: define $PRIVATE_ARG properly [puppet] - 10https://gerrit.wikimedia.org/r/508691 (https://phabricator.wikimedia.org/T221888) [19:44:58] (03CR) 10Andrew Bogott: [C: 03+2] puppet-merge: define $PRIVATE_ARG properly [puppet] - 10https://gerrit.wikimedia.org/r/508691 (https://phabricator.wikimedia.org/T221888) (owner: 10Andrew Bogott) [19:47:34] (03CR) 10Thcipriani: [C: 03+2] Group0 to 1.34.0-wmf.4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/508672 (owner: 10Thcipriani) [19:48:58] (03Merged) 10jenkins-bot: Group0 to 1.34.0-wmf.4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/508672 (owner: 10Thcipriani) [19:49:14] (03CR) 10jenkins-bot: Group0 to 1.34.0-wmf.4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/508672 (owner: 10Thcipriani) [19:52:55] !log thcipriani@deploy1001 rebuilt and synchronized wikiversions files: group0 to 1.34.0-wmf.4 [19:52:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:54:13] PROBLEM - puppet last run on cloudservices1003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [19:54:32] I'm really interested to know what made scap operations so expensive for appservers :) [19:54:46] (03PS1) 10Andrew Bogott: Another no-op text patch to test merging [labs/private] - 10https://gerrit.wikimedia.org/r/508694 (https://phabricator.wikimedia.org/T221888) [19:55:19] (03CR) 10Andrew Bogott: [V: 03+2 C: 03+2] Another no-op text patch to test merging [labs/private] - 10https://gerrit.wikimedia.org/r/508694 (https://phabricator.wikimedia.org/T221888) (owner: 10Andrew Bogott) [19:55:20] The i10n stuff RoanKattouw was looking at yesterday? Or did that not turn into a landed change yet? [19:56:47] James_F: not landed yet afaik [19:57:00] (03PS1) 10Andrew Bogott: Revert "Another no-op text patch to test merging" [labs/private] - 10https://gerrit.wikimedia.org/r/508695 [19:57:08] Hmm. [19:57:18] No that didn't land [19:57:21] (03CR) 10Andrew Bogott: [V: 03+2 C: 03+2] Revert "Another no-op text patch to test merging" [labs/private] - 10https://gerrit.wikimedia.org/r/508695 (owner: 10Andrew Bogott) [19:58:43] PROBLEM - puppet last run on cloudservices1004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [19:59:17] There's nothing very suspicious in git log --oneline --exclude wmf-config/InitialiseSettings.php 637bd300c.. [20:00:30] Err. In `git log --oneline 637bd300c.. wmf-config/CommonSettings.php` even. :-) 8afada66e Send 5% of anonymous users to PHP7.2 possibly? [20:01:29] (03PS1) 10Andrew Bogott: labspuppetmasters: remove git-sync-upstream [puppet] - 10https://gerrit.wikimedia.org/r/508697 [20:02:46] (03CR) 10Andrew Bogott: [C: 03+2] labspuppetmasters: remove git-sync-upstream [puppet] - 10https://gerrit.wikimedia.org/r/508697 (owner: 10Andrew Bogott) [20:06:58] (03PS1) 10Andrew Bogott: no-op README patch for testing [labs/private] - 10https://gerrit.wikimedia.org/r/508698 (https://phabricator.wikimedia.org/T221888) [20:07:11] (03CR) 10Andrew Bogott: [V: 03+2 C: 03+2] no-op README patch for testing [labs/private] - 10https://gerrit.wikimedia.org/r/508698 (https://phabricator.wikimedia.org/T221888) (owner: 10Andrew Bogott) [20:08:00] seems like cdb rebuild is where the appserver load exploded. Although nothing has changed about cdb rebuilding in quite a while. [20:08:33] PROBLEM - HTTP availability for Nginx -SSL terminators- at eqiad on icinga1001 is CRITICAL: cluster=cache_text site=eqiad https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [20:08:49] size of cdbs being rebuilt over time may be higher, or it could be a red herring [20:08:57] PROBLEM - HTTP availability for Varnish at eqiad on icinga1001 is CRITICAL: job=varnish-text site=eqiad https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [20:09:03] PROBLEM - HTTP availability for Nginx -SSL terminators- at codfw on icinga1001 is CRITICAL: cluster=cache_text site=codfw https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [20:09:09] PROBLEM - HTTP availability for Nginx -SSL terminators- at esams on icinga1001 is CRITICAL: cluster=cache_text site=esams https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [20:09:31] PROBLEM - HTTP availability for Varnish at esams on icinga1001 is CRITICAL: job=varnish-text site=esams https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [20:09:41] PROBLEM - HTTP availability for Nginx -SSL terminators- at ulsfo on icinga1001 is CRITICAL: cluster=cache_text site=ulsfo https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [20:09:45] PROBLEM - HTTP availability for Nginx -SSL terminators- at eqsin on icinga1001 is CRITICAL: cluster=cache_text site=eqsin https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [20:09:45] PROBLEM - Text HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5 [20:09:45] PROBLEM - Eqsin HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=eqsin&var-cache_type=All&var-status_type=5 [20:10:15] PROBLEM - Esams HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=esams&var-cache_type=All&var-status_type=5 [20:10:34] another brief 503 spike it seems [20:11:05] PROBLEM - Eqiad HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=eqiad&var-cache_type=All&var-status_type=5 [20:11:25] PROBLEM - Ulsfo HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=ulsfo&var-cache_type=All&var-status_type=5 [20:11:29] I'm really wondering if there's any correlation between these and release train actions. I do vaguely recall a bunch of extra 50xs on Tuesdays [20:11:37] RECOVERY - HTTP availability for Varnish at eqiad on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [20:12:05] well these are correlated with an inflation of backend connection parallelism with mediawiki (varnish -> appservers.svc/api.svc) [20:12:07] confirmed, it does seem to happen close to deployment and then recover [20:12:21] i looked at grafana and already saw it recover.. same pattern [20:12:24] which can in turn be caused by high response latencies from MW [20:12:28] bblack: that suggests high latency? [20:12:31] RECOVERY - HTTP availability for Nginx -SSL terminators- at eqiad on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [20:12:32] haha, beaten [20:12:36] but pulling apart causes and effects in this stuff is hard [20:13:03] RECOVERY - HTTP availability for Nginx -SSL terminators- at codfw on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [20:13:09] RECOVERY - HTTP availability for Nginx -SSL terminators- at esams on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [20:13:31] RECOVERY - HTTP availability for Varnish at esams on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [20:13:41] RECOVERY - HTTP availability for Nginx -SSL terminators- at ulsfo on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [20:13:45] RECOVERY - HTTP availability for Nginx -SSL terminators- at eqsin on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [20:14:05] when the backend connection parallelism seems to cap off at 1.0K, that's a configured limit on the varnish side [20:14:31] we can raise that, have in the past sometimes. it might mitigate smaller versions of this problem from having secondary fallout, maybe. Or it just makes it easier to overwhelm the applayer [20:14:39] !log deploy1001 - temp disabled puppet - debugging issue with apache-fast-test script [20:14:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:14:49] so the event at 18:40ish does look like an appserver latency issue [20:14:52] https://grafana.wikimedia.org/d/000000580/apache-backend-timing?orgId=1&from=now-6h&to=now [20:15:04] but not the event just now at 20:10ish [20:15:07] (03PS1) 10Andrew Bogott: Designate: remove dependencies on some obsolete init scripts [puppet] - 10https://gerrit.wikimedia.org/r/508699 [20:15:34] ah wait, no, that is wrong [20:15:43] (03CR) 10Andrew Bogott: [C: 03+2] Designate: remove dependencies on some obsolete init scripts [puppet] - 10https://gerrit.wikimedia.org/r/508699 (owner: 10Andrew Bogott) [20:16:06] there was an appserver latency event at 20:10ish that, while the median was not as affected, the long tail was certainly affected [20:16:25] RECOVERY - Text HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5 [20:16:25] RECOVERY - Eqiad HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=eqiad&var-cache_type=All&var-status_type=5 [20:16:28] It would certainly make sense that the cluster-wide CDB re-build would make the appservers all increase their latency. [20:16:45] RECOVERY - Ulsfo HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=ulsfo&var-cache_type=All&var-status_type=5 [20:16:50] given the normal case is supposed to be a high rate of very fast req->resp cycles over a relatively-small pool of reused connections... [20:16:55] RECOVERY - Esams HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=esams&var-cache_type=All&var-status_type=5 [20:17:00] it doesn't take all that many slow responses stacking up to wreak havoc [20:17:12] (slower than whatever norm-ish thing we're tuned around) [20:17:33] Maybe we should bring forward the idea to replace local CDB builds with single PHP array builds? The numbers Krinkle generated suggested they were faster anyway, and have the benefit of not being built on each machine, right? [20:17:45] RECOVERY - Eqsin HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=eqsin&var-cache_type=All&var-status_type=5 [20:18:18] why does the CDB build stall up the appservers? [20:18:47] (03PS1) 10Dzahn: apache-fast-test: accept a literal - in host names [puppet] - 10https://gerrit.wikimedia.org/r/508701 [20:19:03] switching to PHP array files is indeed the plan, however, there are still things to be sorted out, in scap and elsewhere, not something to do during an incident. [20:19:11] keep in mind I know almost nothing about this. is it a separate process and it just burns up too much host CPU? or is it in the main mediawiki php process but contends on some lock with everything else for runtime reqs? [20:19:37] what's the basic mechanism by which it causes impact? [20:20:23] I wouldn't call this an incident exactly. we've served something like 45k 503s today, which is not so bad [20:20:33] (03PS3) 10Andrew Bogott: git-sync-upstream: Rebase on top of prod's copy of the puppet repo [puppet] - 10https://gerrit.wikimedia.org/r/504825 (https://phabricator.wikimedia.org/T219390) (owner: 10Alex Monk) [20:20:51] it is really unclear to me why the old branch cleanup also causes appserver hiccups [20:20:53] RECOVERY - puppet last run on cloudservices1003 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [20:20:58] (03PS2) 10Dzahn: apache-fast-test: accept a literal - in host names [puppet] - 10https://gerrit.wikimedia.org/r/508701 [20:20:59] that looks to be just doing a pile of 'rm' [20:21:34] ionice? :) [20:22:15] "ionice -c Idle rm ...." [20:22:25] bblack: to make ionice effective, we'd have to switch away from using the 'deadline' scheduler everywhere [20:22:27] well -c 3 [20:22:40] ah, yeah that might help heh [20:23:06] (and who knows, maybe the db servers need that; it would probably be a bunch of benchmarking) [20:23:12] deadline seems so appealing when you think of your hosts as singular in purpose, then all the maintenance and update and logging and metrics stuff happens [20:23:57] last week because of wanting to ionice different processes differently on Swift storage hosts I went code-spelunking; using 'deadline' on all hosts is something m.ark did in like 2011 or thereabouts [20:24:01] so, yes :) [20:24:22] if what we're seeing is web requests time out after branch-cleanup that is presumably the usual SNAFU that HHVM suffers from as of 6-8 months ago which is that whenever anything changes in /srv/mediawiki, it trips over itself for a few minutes due unknown reasons (possible relating to compilation cache and filestat notifs) until it's got a hold on itself again. This is why every scap deployment we get fatal errors by default for a few [20:24:22] minutes. [20:24:48] I probably would've done the same, esp then :) [20:24:53] thcipriani: is/was there other user impact? [20:24:54] okay, that was my hunch as well -- that our HHVM install uses inotify to populate its opcode cache [20:25:23] RECOVERY - puppet last run on cloudservices1004 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [20:25:32] 10Operations, 10Puppet, 10Cloud-Services, 10Patch-For-Review, 10cloud-services-team (Kanban): Move the main WMCS puppetmaster into the Labs realm - https://phabricator.wikimedia.org/T171188 (10Andrew) [20:27:45] Krinkle: outside of the normal 60 second timeouts, it doesn't seem so. I believe it's a new development that appservers have begun to report high load in addition to the 60 second timeouts though. [20:28:21] where "new development" == "first time I noticed it was last week" -- so may not be a new development at all [20:29:09] https://grafana.wikimedia.org/d/000000607/cluster-overview?orgId=1&var-datasource=eqiad%20prometheus%2Fops&var-cluster=api_appserver&var-instance=All&from=now-3h&to=now [20:29:22] there's noticeable network traffic on *some* of the appservers during scap actions, but not all of them [20:30:19] thcipriani: wanna make sure I understand, you mean it's potentially a new regression that regular syncs now have timeout fatals for more than a minute during deploy. or that it's a new regression that branch clean ups have any timeout fatals as result, and those take longer than for regular deploys? [20:31:08] the high load looks to be during the CDB rebuild step? [20:31:29] !log gerrit2001 - temp disabling puppet - testing apache rewrites for T218844 on non-prod host [20:31:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:31:33] T218844: Update Gerrit /r/p/ links to /r/ - https://phabricator.wikimedia.org/T218844 [20:33:39] Krinkle: new regression in that there is high load during CDB rebuild. Touching the anything under /srv/mediawiki causing 60second timeouts seems like the same problem that's been happening. [20:33:47] or possible new regression rather [20:34:07] it may simply be a knock-on effect of hhvm flailing at its cache and monitoring only kicks in during cdb rebuild. [20:34:34] (03CR) 10Dzahn: [C: 04-1] "The "L" rewrite flag means that is the Last rule and to stop doing rewrites after this. But there are more rules following it, so these wo" [puppet] - 10https://gerrit.wikimedia.org/r/507787 (https://phabricator.wikimedia.org/T218844) (owner: 10Paladox) [20:34:47] eh, judging from the graphs it looks like just the CDB rebuild step [20:35:11] (03PS9) 10Paladox: Gerrit: Redirect /p/(.+)/info/(.+) to /$1/info/$2 [puppet] - 10https://gerrit.wikimedia.org/r/507787 (https://phabricator.wikimedia.org/T218844) [20:35:38] (03PS5) 1020after4: Gerrit: Configure logging in json to gerrit.json [puppet] - 10https://gerrit.wikimedia.org/r/508391 (https://phabricator.wikimedia.org/T141324) (owner: 10Paladox) [20:35:51] (03CR) 1020after4: [C: 03+1] Gerrit: Configure logging in json to gerrit.json [puppet] - 10https://gerrit.wikimedia.org/r/508391 (https://phabricator.wikimedia.org/T141324) (owner: 10Paladox) [20:36:01] I'm going to guess that started in eqiad at 19:36? [20:36:23] Do the CDB files get rebuilt "live" in the directory that HHVM monitors, or does it happen in tmp and only get moved when they're all done? [20:39:48] > 19:34:56 Started scap-cdb-rebuild [20:40:49] (03CR) 10Dzahn: [C: 04-1] "let's just move it to the end after the other rules, because of "You will almost always want to use [R] in conjunction with [L] (that is, " [puppet] - 10https://gerrit.wikimedia.org/r/507787 (https://phabricator.wikimedia.org/T218844) (owner: 10Paladox) [20:41:02] James_F: good question. hopefully they're at least rebuilt as separate files before being moved into place over the old file (instead of overwrite in place as the replacement is being slowly constructed)... but even if so, if the temp destination is same dir, could indeed impact the inotify stuff... [20:41:33] thcipriani: I see at https://grafana.wikimedia.org/d/000000607/cluster-overview?orgId=1&var-datasource=eqiad%20prometheus%2Fops&var-cluster=api_appserver&var-instance=All&from=1555388884216&to=1555501366002&panelId=85&fullscreen [20:41:44] (03PS10) 10Paladox: Gerrit: Redirect /p/(.+)/info/(.+) to /$1/info/$2 [puppet] - 10https://gerrit.wikimedia.org/r/507787 (https://phabricator.wikimedia.org/T218844) [20:41:47] taht there was also a similar CPU spike on April 16 during this event: [20:41:52] > 21:47 twentyafter.four@deploy1001: rebuilt and synchronized wikiversions files: group0 wikis to 1.34.0-wmf.1 refs T220726 [20:41:52] T220726: 1.34.0-wmf.1 deployment blockers - https://phabricator.wikimedia.org/T220726 [20:42:18] I mean - [20:42:21] > 21:21 twe.tyafterfour@deploy1001: Finished scap: testwikis wikis to 1.34.0-wmf.1 refs T220726 (duration: 36m 47s) [20:42:39] so that seems "normal" ish [20:42:40] James_F: they are built as .tmp files in the /srv/mediawiki/php-[version]/cache/l10n -- so under the directory hhvm is watching, but moved into place after rebuild -- "both" is the short answer [20:43:12] !log gerrit2001 - restarting apache.. failed [20:43:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:43:19] thcipriani: Hmm. Does HHVM's opcache looker react to .tmp files? I assume not, so it's just triggered on move (which is good)? [20:43:20] the cdb rebuild afaik doesn't have much impact on HHVM in terms of compilation cache. there aren't that many files in total (compared to other syncs), and they're not php files. [20:43:34] Right. [20:43:37] So it would ignore those events pretty quickly. [20:43:51] James_F: I'm not confident of this, but it looks like HHVM is using inotify() to cache *all* calls to stat() under a certain tree. so it might 'care' [20:43:54] We get HHVM timeouts after all deploys, not just those with full scap / l10n cdb [20:44:16] Krinkle: gotcha, it's very possible this is just new to me since we hand off train every couple weeks. [20:44:26] right, it has a general stat() cache as well, so if MW does filemtime() on those files, it'll update that accordingly. [20:44:49] (03PS11) 10Paladox: Gerrit: Redirect /p/(.+)/info/(.+) to /$1/info/$2 [puppet] - 10https://gerrit.wikimedia.org/r/507787 (https://phabricator.wikimedia.org/T218844) [20:45:06] but it's keeping that populated preemptively, which is a possible mechanism for all the rms in the branch prune step to affect appserver performance [20:45:06] OK, never mind my suspicion. [20:45:07] thcipriani: np, so what did you initially saw as concern - was it the CPU spike, or the exception count? [20:46:40] Krinkle: CPU spike since it seemed "new", but as you've found, it's evidently not new. [20:47:36] thcipriani: aye, still concerning indeed, but I guess it's something we can add to the pile together with the 60s timeouts as something really scarry and hurting users that we've somehow decided is fine to have for a year because we're switching to php7 soon and are too complicated to debug.. [20:47:59] it also doesn't help that looking at the last 6 months these spikes are invisible in Grafana [20:48:04] only when you zoom in randomly do they appear [20:48:07] Of course, PHP7 will give us different fun problems instead. [20:48:07] (03CR) 10Dzahn: [C: 04-2] "May 07 20:46:48 gerrit2001 apachectl[10180]: AH00526: Syntax error on line 116 of /etc/apache2/sites-enabled/50-gerrit-slave-wikimedia-org" [puppet] - 10https://gerrit.wikimedia.org/r/507787 (https://phabricator.wikimedia.org/T218844) (owner: 10Paladox) [20:48:10] because averaging/irate [20:49:35] thcipriani: on a related note, assuming the incident is over now, what's the status of scap cdb/static-array switch? Anything I can help with there? [20:50:12] I believe the next step is https://phabricator.wikimedia.org/T105683 [20:50:19] I don't know anything about it, what's the status of MediaWiki? [20:50:21] * thcipriani reads [20:50:52] thcipriani: from the MediaWiki side, the wmf-config setting for it is ready, and the maintenance script (rebuild localisation cache) also has support for it, and it's been benched and tested fairly well. [20:51:14] next step is to make sure scap builds both versions while we try it out on a test wiki. [20:51:22] Krinkle: did you see this morning's PHP7 fun? [20:51:42] cdanis: hm.. don't think so? [20:52:34] https://wikitech.wikimedia.org/wiki/Incident_documentation/20190507-opcache [20:52:52] little actual impact but very worrisome that it happened at all [20:53:06] (but also a bunch of mitigations available) [20:54:06] cdanis: uh, yikes, we saw that before, very similar infact [20:54:20] https://phabricator.wikimedia.org/T221347 [20:54:59] oh, this was around the opcache's string interning? [20:55:22] cdanis: I don't know what it's cause was, but it describes a very similar problem where a string just randomly changed [20:55:45] Today it was wgUseKeyHeader [20:55:49] back then it was wgLogo [20:56:59] _joe_: around? :-) [20:58:14] yeah, I think there are different sections of the opcache for code vs interned strings; not sure if we've seen any corruptions in the former before, but it does look like a previous one for the latter [20:59:05] thanks for the link to that ticket Krinkle, I've also linked it in the IR [20:59:06] <_joe_> Krenair: yes it's the exact same issue [20:59:25] <_joe_> err Krinkle even [20:59:41] <_joe_> revi: not really, saw your ping but it's very late here [20:59:52] _joe_: sure, we can get it done tomorrow [20:59:58] <_joe_> ok :) [21:00:01] 7am without sleeping is not a good time to do patch either [21:00:08] oh 6am but still [21:00:22] 10Operations, 10PHP 7.2 support, 10Patch-For-Review, 10Wikimedia-production-error: Investigate why a string literal changed in opcache (Fatal exception of type "ConfigException") - https://phabricator.wikimedia.org/T221347 (10Krinkle) Today's incident at (03PS1) 10CRusnov: fix minor typo in puppetdb report [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/508716 [21:01:07] <_joe_> so on one hand we reactivated opcache validation hoping it will help [21:01:10] <_joe_> on the other [21:01:12] <_joe_> https://phabricator.wikimedia.org/T222705 [21:01:28] <_joe_> we will depool servers with opcache corrupted starting tomorrow morning hopefully [21:02:08] <_joe_> this will leave us time to try to get to the bottom of this with no user impact [21:02:17] did you do the php-fpm restart to enable that _joe_ ? [21:02:20] (03CR) 10CRusnov: [C: 03+2] fix minor typo in puppetdb report [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/508716 (owner: 10CRusnov) [21:02:28] <_joe_> cdanis: I did in eqiad [21:02:31] if not I'll file a task so it doesn't get dropped [21:02:35] <_joe_> a reload actually [21:02:55] Krinkle: scap is shelling out to rebuildLocalisationCache, so if we can ensure that that maintenance script is outputting php files in addition to cdb files it may "just work" -- in the sense that the php files will be synced like regular code in that case. [21:03:18] <_joe_> another set of files not under version control? [21:03:19] <_joe_> eek [21:03:21] <_joe_> :P [21:03:32] looking through scap changes for the last time this was tried that seems to be all that changed from the scap side anyway [21:04:12] _joe_: ok I'll just re-use your cumin command in codfw [21:04:18] _joe_: adding a step in the pipeline to generate l10n would be a boon for us beleaguered train deployers :) [21:04:47] <_joe_> cdanis: or I do that tomorrow morning :) [21:05:09] <_joe_> but thanks for taking care of it in case [21:05:34] thcipriani: aye, the maintenance script has only a single wiki context though. so we'd need scap to run it twice. [21:05:43] e.g. once for wiki=aawiki and once for wiki=testwiki or some such. [21:05:59] maybe one wiki in each group, but tricky... [21:06:10] !log cdanis@cumin1001.eqiad.wmnet ~ % sudo cumin -m async -p80 -b10 'C:profile::mediawiki::php and *.codfw.wmnet' 'run-puppet-agent' 'systemctl reload php7.2-fpm.service' [21:06:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:06:29] depending on how temporary it is, might also be able to add a parameter that hardcodes the format in, maybe we can even inject that from wmf-config/scap/somethign.py directly. [21:07:02] that's what we do currently iirc -- once for one wiki for each active wikiversion -- we could maybe expand that to per active wikiversion per group [21:09:46] thcipriani: ah, I see. I was wondering why it already had a wikidb parameter. [21:09:48] it's for wikiversions [21:09:50] neat [21:10:56] thcipriani: yeah, it'll be a bit hacky, as presumably we'd want to be consistent and have both l10n formats on both wiki versions, whereas we'd presumably have only 1 test wiki to apply it to. But maybe it's fine for it to only exist on the wiki versin that the test wiki is on. [21:10:58] Should be fine I guess? [21:11:24] In that case, adding a 'test2wiki' to the loop for 'one random wiki per wiki version' (and making sure the latter doesn't pick test2wiki) might suffice. [21:12:17] or maybe 'testwiki' instead so that it also works in beta, and we can then subsewuently switch testwiki to 'php' array format in beta first. [21:12:56] yeah, something like that would work, would need to fiddle with it a bit, but that seems like an approach that could work. What's the command line flag to rebuildLocalisationCache needed to generate php? [21:13:28] for performance (not 1/3 needless generations) we may then want to limit that third one to beta for a while as we first prove that HHVM now behaves well with large PHP files. That previously was the reason we aborted the experiment in 2015 due to a translation-cache leak in HHVM's compiler GC. [21:13:32] but that has since been fixed. [21:13:49] thcipriani: there isn't one, it finds it at run-time from Config. [21:15:34] ah ok, based on storeClass? [21:23:28] thcipriani: `store`, not `storeClass`, although either would work. [21:23:30] https://phabricator.wikimedia.org/T99740#5165753 [21:23:40] I've posted a draft deploy plan there just now based on 2015 [21:24:24] Krinkle: great! that's very helpful. [21:26:08] 10Operations, 10Gerrit, 10serviceops, 10Patch-For-Review: Convert Gerrit to use H2 as the database - https://phabricator.wikimedia.org/T211139 (10Paladox) [21:27:40] (03PS1) 10CRusnov: Fix minor typo in oldhardware report [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/508722 [21:29:31] (03CR) 10CRusnov: [C: 03+2] Fix minor typo in oldhardware report [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/508722 (owner: 10CRusnov) [21:29:59] (03PS1) 10Jforrester: [BETA] Enable array format on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/508724 (https://phabricator.wikimedia.org/T99740) [21:30:38] (03PS2) 10Jforrester: [BETA] Enable array LCStoreStaticArray format on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/508724 (https://phabricator.wikimedia.org/T99740) [21:39:08] 10Operations: #wikimedia-sre is missing stashbot - https://phabricator.wikimedia.org/T222755 (10Dzahn) The list of maintainers is at https://tools.wmflabs.org/admin/tool/stashbot The repo is at https://gerrit.wikimedia.org/r/q/project:labs%252Ftools%252Fstashbot The README in that repo says there should be a c... [21:41:32] (03PS1) 10Krinkle: Set wgLocalisationCacheConf['storeClass'] explicitly [mediawiki-config] - 10https://gerrit.wikimedia.org/r/508726 (https://phabricator.wikimedia.org/T99740) [21:42:37] (03CR) 10Krinkle: [BETA] Enable array LCStoreStaticArray format on testwiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/508724 (https://phabricator.wikimedia.org/T99740) (owner: 10Jforrester) [21:42:52] James_F: thx [21:44:46] mutante: hi! is it ok to use deploy1001 for.. deploys, while puppet's disabled there? would it interfere with what you doing? [21:45:11] (03PS3) 10Jforrester: [BETA] Enable array LCStoreStaticArray format on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/508724 (https://phabricator.wikimedia.org/T99740) [21:45:33] (03CR) 10Jforrester: [BETA] Enable array LCStoreStaticArray format on testwiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/508724 (https://phabricator.wikimedia.org/T99740) (owner: 10Jforrester) [21:46:06] Pchelolo: no, it should not have influence anything but .. i re-enabled it just now anyways [21:46:12] running it [21:46:15] to revert my live hack [21:46:34] ok cool, thank you. just wanted to double-check [21:46:36] !log deploy1001 - renabled puppet - deployment can go ahead [21:46:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:46:41] Pchelolo: thanks for checking [21:47:15] !log ppchelko@deploy1001 Started deploy [restbase/deploy@d91ee4c]: Do not cache html if stash was requested T215956 [21:47:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:47:19] T215956: Consider stashing data-parsoid for VE - https://phabricator.wikimedia.org/T215956 [21:47:27] !log ppchelko@deploy1001 deploy aborted: Do not cache html if stash was requested T215956 (duration: 00m 12s) [21:47:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:48:01] !log ppchelko@deploy1001 Started deploy [restbase/deploy@8f5859f]: Do not cache html if stash was requested T215956 [21:48:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:58:15] 08̶W̶a̶r̶n̶i̶n̶g Device asw-d-codfw.mgmt.codfw.wmnet recovered from Access port utilisation over 80% for 1h [22:04:19] PROBLEM - mobileapps endpoints health on scb1004 is CRITICAL: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most read articles for January 1, 2016) timed out before a response was received: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most-read articles for January 1, 2016 (with aggregated=true)) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/mob [22:05:33] RECOVERY - mobileapps endpoints health on scb1004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [22:06:12] !log ppchelko@deploy1001 Finished deploy [restbase/deploy@8f5859f]: Do not cache html if stash was requested T215956 (duration: 18m 12s) [22:06:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:06:17] T215956: Consider stashing data-parsoid for VE - https://phabricator.wikimedia.org/T215956 [22:08:55] (03CR) 10Gehel: "It is unclear what we are trying to achieve here. This will duplicate all log messages into function specific log files and a generic log " [puppet] - 10https://gerrit.wikimedia.org/r/508657 (owner: 10Paladox) [22:10:23] 10Operations, 10DBA, 10Jade, 10TechCom-RFC, and 2 others: Introduce a new namespace for collaborative judgements about wiki entities - https://phabricator.wikimedia.org/T200297 (10Harej) [22:15:36] (03CR) 10Paladox: "> It is unclear what we are trying to achieve here. This will" [puppet] - 10https://gerrit.wikimedia.org/r/508657 (owner: 10Paladox) [22:20:18] (03CR) 10Andrew Bogott: "This looks reasonable to me. If we just did a global hiera lookup in sudo::user and sudo::group then this patch would touch 1/10th as man" [puppet] - 10https://gerrit.wikimedia.org/r/508311 (https://phabricator.wikimedia.org/T221225) (owner: 10Arturo Borrero Gonzalez) [22:22:16] 08̶W̶a̶r̶n̶i̶n̶g Device asw2-a-eqiad.mgmt.eqiad.wmnet recovered from Access port utilisation over 80% for 1h [22:23:14] 10Operations, 10observability: Graphite alert 'MediaWiki.errors.fatal' no longer working - https://phabricator.wikimedia.org/T222765 (10Krinkle) [22:23:28] Krinkle: Should I land the Beta Cluster and default wgLocalisationCacheConf changes then? [22:24:59] 10Operations, 10observability: Graphite alert 'MediaWiki.errors.fatal' no longer working - https://phabricator.wikimedia.org/T222765 (10Krinkle) Culprit is pathset 12 here which attempted to rebase the patch to accomodate the now-split statsd Hiera property. (03PS1) 10Krinkle: mediawiki: Fix statsd reporting of MediaWiki.errors.fatal [puppet] - 10https://gerrit.wikimedia.org/r/508730 (https://phabricator.wikimedia.org/T222765) [22:26:55] 10Operations, 10Performance-Team, 10observability, 10Patch-For-Review: Graphite alert 'MediaWiki.errors.fatal' no longer working - https://phabricator.wikimedia.org/T222765 (10Krinkle) a:03Krinkle [22:35:14] (03PS1) 10EBernhardson: Configure wgCirrusSearchPrivateClusters [mediawiki-config] - 10https://gerrit.wikimedia.org/r/508733 (https://phabricator.wikimedia.org/T220625) [22:35:48] (03CR) 10Krinkle: "Verified at https://puppet-compiler.wmflabs.org/compiler1001/16406/mw1234.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/508730 (https://phabricator.wikimedia.org/T222765) (owner: 10Krinkle) [22:37:17] 10Operations, 10DBA, 10Jade, 10TechCom-RFC, and 2 others: Introduce a new namespace for collaborative judgements about wiki entities - https://phabricator.wikimedia.org/T200297 (10Harej) @jcrespo Yes, it reflects our current understanding. [22:45:17] (03PS1) 10RobH: restbase10[19-27].eqiad.wmnet prod dns entries [dns] - 10https://gerrit.wikimedia.org/r/508735 (https://phabricator.wikimedia.org/T219404) [22:45:19] (03CR) 10jerkins-bot: [V: 04-1] restbase10[19-27].eqiad.wmnet prod dns entries [dns] - 10https://gerrit.wikimedia.org/r/508735 (https://phabricator.wikimedia.org/T219404) (owner: 10RobH) [22:47:13] (03PS2) 10RobH: restbase10[19-27].eqiad.wmnet prod dns entries [dns] - 10https://gerrit.wikimedia.org/r/508735 (https://phabricator.wikimedia.org/T219404) [22:48:15] 08Warning Alert for device asw-c-codfw.mgmt.codfw.wmnet - Access port utilisation over 80% for 1h [22:48:26] (03CR) 10RobH: [C: 03+2] restbase10[19-27].eqiad.wmnet prod dns entries [dns] - 10https://gerrit.wikimedia.org/r/508735 (https://phabricator.wikimedia.org/T219404) (owner: 10RobH) [23:00:04] MaxSem, RoanKattouw, and Niharika: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for Evening SWAT (Max 6 patches). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190507T2300). [23:00:04] No GERRIT patches in the queue for this window AFAICS. [23:08:32] (03PS1) 10Paladox: Fix apache-fast-test to support https [puppet] - 10https://gerrit.wikimedia.org/r/508737 [23:09:16] (03CR) 10jerkins-bot: [V: 04-1] Fix apache-fast-test to support https [puppet] - 10https://gerrit.wikimedia.org/r/508737 (owner: 10Paladox) [23:10:25] 10Operations, 10ORES, 10Release Pipeline, 10Scoring-platform-team, and 2 others: Execution of the deployment pipeline should be configurable via .pipeline/config.yaml - https://phabricator.wikimedia.org/T210267 (10dduvall) [23:11:08] (03PS2) 10Paladox: Fix apache-fast-test to support https [puppet] - 10https://gerrit.wikimedia.org/r/508737 [23:11:21] (03PS3) 10Paladox: Fix apache-fast-test to support https [puppet] - 10https://gerrit.wikimedia.org/r/508737 [23:13:13] PROBLEM - puppet last run on pybal-test2001 is CRITICAL: CRITICAL: Puppet has 8 failures. Last run 5 minutes ago with 8 failures. Failed resources (up to 3 shown): Exec[eth0_v6_token],Service[rsyslog],Service[nagios-nrpe-server],Exec[ip addr add 2620:0:860:102:10:192:16:139/64 dev eth0] [23:14:06] (03PS4) 10Paladox: Fix apache-fast-test to support https [puppet] - 10https://gerrit.wikimedia.org/r/508737 [23:16:07] going to deploy a noop config patch since swat is empty, goes along with some code currently in review [23:16:38] (03CR) 10EBernhardson: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/508733 (https://phabricator.wikimedia.org/T220625) (owner: 10EBernhardson) [23:18:59] (03PS1) 10Papaul: DNS: Remove mgmt DNS for osm-db200[12] and osm-web200[1234] [dns] - 10https://gerrit.wikimedia.org/r/508738 [23:19:23] (03CR) 10jerkins-bot: [V: 04-1] DNS: Remove mgmt DNS for osm-db200[12] and osm-web200[1234] [dns] - 10https://gerrit.wikimedia.org/r/508738 (owner: 10Papaul) [23:23:05] (03Abandoned) 10Papaul: DNS: Remove mgmt DNS for osm-db200[12] and osm-web200[1234] [dns] - 10https://gerrit.wikimedia.org/r/508738 (owner: 10Papaul) [23:23:19] (03PS1) 10RobH: adding restbase10[19-27] mac addresses [puppet] - 10https://gerrit.wikimedia.org/r/508739 (https://phabricator.wikimedia.org/T219404) [23:26:53] (03PS2) 10EBernhardson: Configure wgCirrusSearchPrivateClusters [mediawiki-config] - 10https://gerrit.wikimedia.org/r/508733 (https://phabricator.wikimedia.org/T220625) [23:28:05] (03PS2) 10RobH: adding restbase10[19-27] install params [puppet] - 10https://gerrit.wikimedia.org/r/508739 (https://phabricator.wikimedia.org/T219404) [23:28:25] (03CR) 10EBernhardson: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/508733 (https://phabricator.wikimedia.org/T220625) (owner: 10EBernhardson) [23:29:35] (03Merged) 10jenkins-bot: Configure wgCirrusSearchPrivateClusters [mediawiki-config] - 10https://gerrit.wikimedia.org/r/508733 (https://phabricator.wikimedia.org/T220625) (owner: 10EBernhardson) [23:29:39] (03PS3) 10RobH: adding restbase10[19-27] install params [puppet] - 10https://gerrit.wikimedia.org/r/508739 (https://phabricator.wikimedia.org/T219404) [23:29:50] (03CR) 10jenkins-bot: Configure wgCirrusSearchPrivateClusters [mediawiki-config] - 10https://gerrit.wikimedia.org/r/508733 (https://phabricator.wikimedia.org/T220625) (owner: 10EBernhardson) [23:30:10] (03CR) 10RobH: [C: 03+2] adding restbase10[19-27] install params [puppet] - 10https://gerrit.wikimedia.org/r/508739 (https://phabricator.wikimedia.org/T219404) (owner: 10RobH) [23:31:03] !log ebernhardson@deploy1001 Synchronized wmf-config/CirrusSearch-production.php: T220625 Configure wgCirrusSearchPrivateClusters (duration: 00m 58s) [23:31:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:31:13] T220625: Initialize CirrusSearch on cloudelastic - https://phabricator.wikimedia.org/T220625 [23:33:54] (03CR) 10Thcipriani: [C: 03+1] Gerrit: Configure logging in json to gerrit.json (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/508391 (https://phabricator.wikimedia.org/T141324) (owner: 10Paladox) [23:39:46] 10Operations, 10ops-eqiad, 10RESTBase, 10Core Platform Team Backlog (Watching / External), and 3 others: rack/setup/install restbase10[19-27].eqiad.wmnet - https://phabricator.wikimedia.org/T219404 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by robh on cumin1001.eqiad.wmnet for hosts: ` ['r... [23:39:55] RECOVERY - puppet last run on pybal-test2001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [23:55:29] 10Operations, 10ops-eqiad, 10RESTBase, 10Core Platform Team Backlog (Watching / External), and 2 others: rack/setup/install restbase10[19-27].eqiad.wmnet - https://phabricator.wikimedia.org/T219404 (10RobH) [23:56:31] 10Operations, 10ops-eqiad, 10RESTBase, 10Core Platform Team Backlog (Watching / External), and 2 others: rack/setup/install restbase10[19-27].eqiad.wmnet - https://phabricator.wikimedia.org/T219404 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by robh on cumin1001.eqiad.wmnet for hosts: ` ['r...