[00:03:16] 08̶W̶a̶r̶n̶i̶n̶g Device asw-c-codfw.mgmt.codfw.wmnet recovered from Access port utilisation over 80% for 1h [00:14:17] 10Operations, 10ops-eqiad, 10RESTBase, 10Core Platform Team Backlog (Watching / External), and 2 others: rack/setup/install restbase10[19-27].eqiad.wmnet - https://phabricator.wikimedia.org/T219404 (10RobH) started installing restbase1019 is ready for service handoff. the others started the installer and... [02:49:59] (03PS40) 10Mathew.onipe: icinga: create and apply cirrus config check [puppet] - 10https://gerrit.wikimedia.org/r/507045 (https://phabricator.wikimedia.org/T218932) [03:03:24] PROBLEM - Check systemd state on stat1007 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [03:12:59] (03PS1) 10Mathew.onipe: icinga: add unit test for elastic config check [puppet] - 10https://gerrit.wikimedia.org/r/508742 (https://phabricator.wikimedia.org/T218932) [03:13:30] (03CR) 10Mathew.onipe: "PCC output looks good https://puppet-compiler.wmflabs.org/compiler1002/16398/" [puppet] - 10https://gerrit.wikimedia.org/r/507045 (https://phabricator.wikimedia.org/T218932) (owner: 10Mathew.onipe) [03:19:54] PROBLEM - Prometheus bast3002.wikimedia.org/ops was restarted on bast3002 is CRITICAL: 499.4 lt 600 https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=esams+prometheus/ops [03:24:14] (03PS10) 10Mathew.onipe: Add postgres slave init cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/504570 (https://phabricator.wikimedia.org/T220946) [03:25:01] (03CR) 10Mathew.onipe: Add postgres slave init cookbook (033 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/504570 (https://phabricator.wikimedia.org/T220946) (owner: 10Mathew.onipe) [03:42:50] RECOVERY - Prometheus bast3002.wikimedia.org/ops was restarted on bast3002 is OK: (C)600 lt (W)1800 lt 1875 https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=esams+prometheus/ops [04:13:44] PROBLEM - puppet last run on ores1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [04:35:06] PROBLEM - puppet last run on actinium is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [04:40:13] (03PS1) 10CDanis: prometheus uptime alert: fix scraping of 'global' instance, plus more [puppet] - 10https://gerrit.wikimedia.org/r/508745 [04:40:26] RECOVERY - puppet last run on ores1002 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [04:45:10] (03PS2) 10CDanis: prometheus uptime alert: fix scraping of 'global' instance, plus more [puppet] - 10https://gerrit.wikimedia.org/r/508745 [04:48:40] PROBLEM - Disk space on actinium is CRITICAL: DISK CRITICAL - free space: / 339 MB (3% inode=90%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space [04:57:04] (03PS1) 10Marostegui: db-eqiad.php: Depool pc1007 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/508747 [04:58:36] (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: Depool pc1007 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/508747 (owner: 10Marostegui) [04:59:42] (03Merged) 10jenkins-bot: db-eqiad.php: Depool pc1007 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/508747 (owner: 10Marostegui) [04:59:56] (03CR) 10jenkins-bot: db-eqiad.php: Depool pc1007 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/508747 (owner: 10Marostegui) [05:01:18] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Depool pc1007 (duration: 00m 59s) [05:01:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:01:40] !log Optimize tables on pc1007 [05:01:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:01:48] RECOVERY - puppet last run on actinium is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [05:08:00] (03PS1) 10CDanis: check_prometheus: allow non-grafana links in $dashboard_links [puppet] - 10https://gerrit.wikimedia.org/r/508748 [05:08:33] (03CR) 10jerkins-bot: [V: 04-1] check_prometheus: allow non-grafana links in $dashboard_links [puppet] - 10https://gerrit.wikimedia.org/r/508748 (owner: 10CDanis) [05:09:37] (03PS2) 10CDanis: check_prometheus: allow non-grafana links in $dashboard_links [puppet] - 10https://gerrit.wikimedia.org/r/508748 [05:11:09] (03PS1) 10Marostegui: mariadb: Provision db1131 on s6 [puppet] - 10https://gerrit.wikimedia.org/r/508749 (https://phabricator.wikimedia.org/T222682) [05:12:41] (03CR) 10CDanis: "Fails in PCC:" [puppet] - 10https://gerrit.wikimedia.org/r/508745 (owner: 10CDanis) [05:12:54] (03PS2) 10Marostegui: mariadb: Provision db1131 on s6 [puppet] - 10https://gerrit.wikimedia.org/r/508749 (https://phabricator.wikimedia.org/T222682) [05:14:52] (03PS3) 10Marostegui: mariadb: Provision db1131 on s6 [puppet] - 10https://gerrit.wikimedia.org/r/508749 (https://phabricator.wikimedia.org/T222682) [05:19:14] (03CR) 10Marostegui: [C: 03+2] mariadb: Provision db1131 on s6 [puppet] - 10https://gerrit.wikimedia.org/r/508749 (https://phabricator.wikimedia.org/T222682) (owner: 10Marostegui) [05:25:04] !log Stop MySQL on db1093 [05:25:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:25:35] (03PS1) 10Marostegui: site.pp: Remove db1131 from spare [puppet] - 10https://gerrit.wikimedia.org/r/508750 [05:26:18] (03CR) 10Marostegui: [C: 03+2] site.pp: Remove db1131 from spare [puppet] - 10https://gerrit.wikimedia.org/r/508750 (owner: 10Marostegui) [05:34:06] 10Operations, 10ops-codfw, 10DBA, 10Goal, 10Patch-For-Review: rack/setup/install db2[103-120].codfw.wmnet (18 hosts) - https://phabricator.wikimedia.org/T221532 (10Marostegui) I have checked that all the hosts have been installed correctly [05:36:40] 10Operations, 10ops-codfw, 10DBA, 10Goal, 10Patch-For-Review: rack/setup/install db2[103-120].codfw.wmnet (18 hosts) - https://phabricator.wikimedia.org/T221532 (10Marostegui) [05:37:48] 10Operations, 10ops-codfw, 10DBA, 10Goal, 10Patch-For-Review: rack/setup/install db2[103-120].codfw.wmnet (18 hosts) - https://phabricator.wikimedia.org/T221532 (10Marostegui) 05Open→03Resolved I have changed the status to Active on netbox. Will close this task and will create new one for productioni... [05:48:32] !log upgrading pybal to version 1.15.6 in lvs2006 - T222705 [05:48:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:48:37] T222705: Improve Pybal's url checks - https://phabricator.wikimedia.org/T222705 [05:59:08] !log upgrading pybal to version 1.15.6 in lvs5003 - T222705 [05:59:08] 10Operations, 10Traffic, 10Wikidata, 10Wikidata-Query-Service, and 2 others: Reduce / remove the aggessive cache busting behaviour of wdqs-updater - https://phabricator.wikimedia.org/T217897 (10Smalyshev) 05Open→03Resolved Update now uses revision IDs everywhere for non-lagged fetches. [05:59:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:59:12] T222705: Improve Pybal's url checks - https://phabricator.wikimedia.org/T222705 [06:02:20] !log upgrading pybal to version 1.15.6 in lvs5002 - T222705 [06:02:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:04:32] (03PS1) 10Marostegui: mariadb: Pool db2115 on x1 [puppet] - 10https://gerrit.wikimedia.org/r/508755 (https://phabricator.wikimedia.org/T222772) [06:07:01] !log upgrading pybal to version 1.15.6 in lvs5001 - T222705 [06:07:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:07:06] T222705: Improve Pybal's url checks - https://phabricator.wikimedia.org/T222705 [06:13:08] !log upgrading pybal to version 1.15.6 in lvs4007 - T222705 [06:13:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:13:12] T222705: Improve Pybal's url checks - https://phabricator.wikimedia.org/T222705 [06:14:36] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS2914/IPv4: Active, AS2914/IPv6: Active https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [06:14:49] (03CR) 10Marostegui: [C: 03+2] mariadb: Pool db2115 on x1 [puppet] - 10https://gerrit.wikimedia.org/r/508755 (https://phabricator.wikimedia.org/T222772) (owner: 10Marostegui) [06:15:22] vgutierrez: very relaxing first thing in the morning that you are doing :D [06:15:25] (morning :) [06:15:29] unrelated BGP alerts while upgrading pybal... lovely :) [06:15:41] elukey: oh it's my warming pre-gym routine [06:15:48] ahahahah [06:15:55] I upgrade the lvs fleet and I get a nice heart rate [06:16:32] !log upgrading pybal to version 1.15.6 in lvs4006 - T222705 [06:16:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:17:26] RECOVERY - Check systemd state on stat1007 is OK: OK - running: The system is fully operational [06:17:31] AS2914 is NTT America BTW [06:19:57] !log upgrading pybal to version 1.15.6 in lvs4005 - T222705 [06:20:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:20:01] T222705: Improve Pybal's url checks - https://phabricator.wikimedia.org/T222705 [06:20:26] !log Stop MySQL on db2096 [06:20:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:24:13] somebody upgrading pybal, others stopping mysql [06:24:18] I am still reading emails [06:24:28] :D [06:24:55] !log upgrading pybal to version 1.15.6 in lvs3004 - T222705 [06:24:57] * elukey alters some random tables on behalf of marostegui [06:24:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:25:12] elukey: somehow I find easier to handle pybal than some humans on my inbox [06:25:34] RECOVERY - Disk space on actinium is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space [06:27:36] PROBLEM - netbox HTTPS on netmon1002 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 547 bytes in 0.009 second response time https://wikitech.wikimedia.org/wiki/Netbox [06:28:06] PROBLEM - Check systemd state on netmon1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [06:28:49] this is netmon's daily breakage, restarting [06:29:02] I have uwsgi-core package with a patch from upstream to test today --^ [06:29:26] PROBLEM - puppet last run on cp2024 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [06:29:40] !log upgrading pybal to version 1.15.6 in lvs3002 - T222705 [06:29:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:29:46] T222705: Improve Pybal's url checks - https://phabricator.wikimedia.org/T222705 [06:29:52] !log restart uwsgi-netbox on netmon1002 after the daily segfault (upon restart) [06:29:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:30:22] PROBLEM - puppet last run on mw1307 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/lib/nagios/plugins/check_ferm] [06:32:25] (03PS1) 10Marostegui: db-eqiad,db-codfw.php: Pool db1127 and db1137 in x1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/508757 (https://phabricator.wikimedia.org/T217396) [06:32:53] !log upgrading pybal to version 1.15.6 in lvs3003 - T222705 [06:32:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:34:40] (03PS2) 10Marostegui: db-eqiad,db-codfw.php: Pool db1127 and db1137 in x1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/508757 (https://phabricator.wikimedia.org/T217396) [06:35:22] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 239, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:36:06] !log upgrading pybal to version 1.15.6 in lvs3001 - T222705 [06:36:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:36:10] T222705: Improve Pybal's url checks - https://phabricator.wikimedia.org/T222705 [06:36:43] (03PS1) 10Marostegui: db1127,db1137: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/508758 (https://phabricator.wikimedia.org/T222682) [06:36:44] PROBLEM - Check the last execution of netbox_ganeti_eqiad_sync on netmon1002 is CRITICAL: CRITICAL: Status of the systemd unit netbox_ganeti_eqiad_sync [06:37:50] PROBLEM - Check the last execution of netbox_ganeti_codfw_sync on netmon1002 is CRITICAL: CRITICAL: Status of the systemd unit netbox_ganeti_codfw_sync [06:38:11] (03CR) 10Marostegui: [C: 03+2] db1127,db1137: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/508758 (https://phabricator.wikimedia.org/T222682) (owner: 10Marostegui) [06:40:14] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 241, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:41:57] (03PS1) 10Giuseppe Lavagetto: mediawiki::web::prod_sites: allow forcing engine for testwikidata S:BP [puppet] - 10https://gerrit.wikimedia.org/r/508759 (https://phabricator.wikimedia.org/T222705) [06:42:06] !log upgrading pybal to version 1.15.6 in lvs2003 - T222705 [06:42:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:42:11] T222705: Improve Pybal's url checks - https://phabricator.wikimedia.org/T222705 [06:46:39] (03CR) 10Elukey: [C: 03+1] mediawiki::web::prod_sites: allow forcing engine for testwikidata S:BP [puppet] - 10https://gerrit.wikimedia.org/r/508759 (https://phabricator.wikimedia.org/T222705) (owner: 10Giuseppe Lavagetto) [06:50:16] (03CR) 10Mathew.onipe: wdqs: add WDQS restart cookbook (036 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/507347 (https://phabricator.wikimedia.org/T221832) (owner: 10Mathew.onipe) [06:51:18] !log upgrading pybal to version 1.15.6 in lvs2005 - T222705 [06:51:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:51:22] T222705: Improve Pybal's url checks - https://phabricator.wikimedia.org/T222705 [06:51:32] (03PS3) 10Mathew.onipe: wdqs: add WDQS restart cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/507347 (https://phabricator.wikimedia.org/T221832) [06:53:11] (03CR) 10jerkins-bot: [V: 04-1] wdqs: add WDQS restart cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/507347 (https://phabricator.wikimedia.org/T221832) (owner: 10Mathew.onipe) [06:55:06] (03PS4) 10Mathew.onipe: wdqs: add WDQS restart cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/507347 (https://phabricator.wikimedia.org/T221832) [06:56:42] RECOVERY - puppet last run on mw1307 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:58:11] (03CR) 10Giuseppe Lavagetto: [C: 03+2] mediawiki::web::prod_sites: allow forcing engine for testwikidata S:BP [puppet] - 10https://gerrit.wikimedia.org/r/508759 (https://phabricator.wikimedia.org/T222705) (owner: 10Giuseppe Lavagetto) [06:58:22] !log upgrading pybal to version 1.15.6 in lvs2002 - T222705 [06:58:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:58:27] T222705: Improve Pybal's url checks - https://phabricator.wikimedia.org/T222705 [06:59:02] RECOVERY - BGP status on cr1-eqiad is OK: BGP OK - up: 34, down: 0, shutdown: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [07:01:10] RECOVERY - puppet last run on cp2024 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [07:02:03] !log upgrading pybal to version 1.15.6 in lvs2004 - T222705 [07:02:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:02:56] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 239, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:04:55] !log upgrading pybal to version 1.15.6 in lvs2001 - T222705 [07:04:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:04:59] T222705: Improve Pybal's url checks - https://phabricator.wikimedia.org/T222705 [07:07:00] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 241, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:09:50] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS2914/IPv6: Active, AS2914/IPv4: Active https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [07:10:49] (03CR) 10Marostegui: [C: 03+2] db-eqiad,db-codfw.php: Pool db1127 and db1137 in x1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/508757 (https://phabricator.wikimedia.org/T217396) (owner: 10Marostegui) [07:11:47] (03Merged) 10jenkins-bot: db-eqiad,db-codfw.php: Pool db1127 and db1137 in x1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/508757 (https://phabricator.wikimedia.org/T217396) (owner: 10Marostegui) [07:12:01] (03CR) 10jenkins-bot: db-eqiad,db-codfw.php: Pool db1127 and db1137 in x1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/508757 (https://phabricator.wikimedia.org/T217396) (owner: 10Marostegui) [07:13:25] !log marostegui@deploy1001 Synchronized wmf-config/db-codfw.php: Pool db1127 and db1137 into x1 T222682 (duration: 01m 03s) [07:13:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:13:30] T222682: Productionize db11[26-38] - https://phabricator.wikimedia.org/T222682 [07:14:37] !log upgrading pybal to version 1.15.6 in lvs1006 - T222705 [07:14:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:14:41] T222705: Improve Pybal's url checks - https://phabricator.wikimedia.org/T222705 [07:14:48] (yeah, lvs1006 is a secondary LVS) [07:15:04] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Pool db1127 and db1137 into x1 T222682 (duration: 00m 56s) [07:15:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:17:01] (03PS1) 10Marostegui: db-eqiad.php: Slowly repool db1093 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/508761 [07:18:03] (03CR) 10jerkins-bot: [V: 04-1] db-eqiad.php: Slowly repool db1093 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/508761 (owner: 10Marostegui) [07:19:00] (03PS2) 10Marostegui: db-eqiad.php: Slowly repool db1093 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/508761 [07:21:53] !log upgrading pybal to version 1.15.6 in lvs1016 - T222705 [07:21:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:21:57] T222705: Improve Pybal's url checks - https://phabricator.wikimedia.org/T222705 [07:22:04] (03PS1) 10Marostegui: db2115: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/508762 (https://phabricator.wikimedia.org/T222772) [07:22:16] (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: Slowly repool db1093 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/508761 (owner: 10Marostegui) [07:22:49] (03CR) 10Marostegui: [C: 03+2] db2115: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/508762 (https://phabricator.wikimedia.org/T222772) (owner: 10Marostegui) [07:23:44] (03Merged) 10jenkins-bot: db-eqiad.php: Slowly repool db1093 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/508761 (owner: 10Marostegui) [07:23:58] (03CR) 10jenkins-bot: db-eqiad.php: Slowly repool db1093 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/508761 (owner: 10Marostegui) [07:25:00] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Give some weight to db1093 (duration: 00m 56s) [07:25:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:25:34] (03PS1) 10Muehlenhoff: Remove obsolete kmod class [puppet] - 10https://gerrit.wikimedia.org/r/508763 [07:26:32] !log upgrading pybal to version 1.15.6 in lvs1005 - T222705 [07:26:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:28:06] (03PS1) 10Marostegui: db-eqiad,db-codfw.php: Pool db2115 in x1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/508764 (https://phabricator.wikimedia.org/T222772) [07:29:01] PROBLEM - puppet last run on snapshot1008 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [07:29:03] PROBLEM - puppet last run on mw1306 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [07:29:38] (03CR) 10Marostegui: [C: 03+2] db-eqiad,db-codfw.php: Pool db2115 in x1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/508764 (https://phabricator.wikimedia.org/T222772) (owner: 10Marostegui) [07:30:25] PROBLEM - puppet last run on mw1336 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [07:30:42] (03Merged) 10jenkins-bot: db-eqiad,db-codfw.php: Pool db2115 in x1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/508764 (https://phabricator.wikimedia.org/T222772) (owner: 10Marostegui) [07:30:45] PROBLEM - puppet last run on mw1303 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [07:30:56] (03CR) 10jenkins-bot: db-eqiad,db-codfw.php: Pool db2115 in x1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/508764 (https://phabricator.wikimedia.org/T222772) (owner: 10Marostegui) [07:31:03] PROBLEM - puppet last run on mw1296 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [07:31:35] PROBLEM - puppet last run on snapshot1007 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [07:32:29] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Pool db2115 into x1 T222772 (duration: 01m 09s) [07:32:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:32:34] T222772: Productionize db2[103-120] - https://phabricator.wikimedia.org/T222772 [07:32:39] PROBLEM - puppet last run on mw1293 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [07:32:49] PROBLEM - puppet last run on mw1310 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [07:33:11] <_joe_> uhm looking [07:33:15] <_joe_> I'm running puppet [07:33:31] !log marostegui@deploy1001 Synchronized wmf-config/db-codfw.php: Pool db2115 into x1 T222772 (duration: 00m 56s) [07:33:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:33:35] !log upgrading pybal to version 1.15.6 in lvs1002 - T222705 [07:33:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:33:39] T222705: Improve Pybal's url checks - https://phabricator.wikimedia.org/T222705 [07:33:43] PROBLEM - puppet last run on mw1261 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. It might be a dependency cycle. [07:33:58] (03PS1) 10Marostegui: db-eqiad.php: Give API traffic to db1093 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/508766 [07:33:58] <_joe_> what? [07:35:18] <_joe_> so on mw1261 I pressed ctrl+c in a cumin-controlled puppet run [07:35:39] <_joe_> and that's the error message :D [07:35:41] RECOVERY - puppet last run on mw1336 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [07:36:01] RECOVERY - puppet last run on mw1303 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:36:09] RECOVERY - BGP status on cr1-eqiad is OK: BGP OK - up: 36, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [07:36:53] RECOVERY - puppet last run on snapshot1007 is OK: OK: Puppet is currently enabled, last run 37 seconds ago with 0 failures [07:36:58] !log upgrading pybal to version 1.15.6 in lvs1004 - T222705 [07:37:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:37:55] RECOVERY - puppet last run on mw1293 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [07:38:03] RECOVERY - netbox HTTPS on netmon1002 is OK: HTTP OK: HTTP/1.1 302 Found - 348 bytes in 0.581 second response time https://wikitech.wikimedia.org/wiki/Netbox [07:38:07] RECOVERY - puppet last run on mw1310 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [07:38:29] (03PS1) 10Muehlenhoff: Update various comments [puppet] - 10https://gerrit.wikimedia.org/r/508767 [07:39:03] RECOVERY - puppet last run on mw1261 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [07:39:35] RECOVERY - puppet last run on snapshot1008 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [07:39:35] RECOVERY - puppet last run on mw1306 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [07:40:42] !log bounce prometheus on bast3002 to finalize migration [07:40:44] (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: Give API traffic to db1093 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/508766 (owner: 10Marostegui) [07:40:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:41:05] !log upgrading pybal to version 1.15.6 in lvs1001 - T222705 [07:41:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:41:09] T222705: Improve Pybal's url checks - https://phabricator.wikimedia.org/T222705 [07:41:39] RECOVERY - puppet last run on mw1296 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:41:43] (03Merged) 10jenkins-bot: db-eqiad.php: Give API traffic to db1093 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/508766 (owner: 10Marostegui) [07:41:58] (03CR) 10jenkins-bot: db-eqiad.php: Give API traffic to db1093 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/508766 (owner: 10Marostegui) [07:42:02] (03CR) 10Ema: [C: 03+1] "Nice catch, thanks." [puppet] - 10https://gerrit.wikimedia.org/r/508763 (owner: 10Muehlenhoff) [07:42:51] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Give some API traffic to db1093 (duration: 00m 57s) [07:42:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:45:23] !log install uwsgi-core_2.0.14+20161117-3+deb9u2+wmf1 on netmon1002 to test a uwsgi bug fix - T212697 [07:45:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:45:28] T212697: uwsgi's logsocket_plugin.so causes segfaults during log rotation - https://phabricator.wikimedia.org/T212697 [07:45:57] (03PS2) 10Marostegui: db-eqiad,db-codfw.php: Change second parsercache key [mediawiki-config] - 10https://gerrit.wikimedia.org/r/508170 (https://phabricator.wikimedia.org/T210725) [07:47:48] PROBLEM - Prometheus bast3002.wikimedia.org/ops was restarted on bast3002 is CRITICAL: 468.3 lt 600 https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=esams+prometheus/ops [07:49:38] !log Stop replication s1 on db2102 [07:49:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:50:55] (03PS2) 10Ema: ATS: remove role trafficserver::backend [puppet] - 10https://gerrit.wikimedia.org/r/508587 (https://phabricator.wikimedia.org/T213263) [07:51:39] (03CR) 10Ema: [C: 03+2] ATS: remove role trafficserver::backend [puppet] - 10https://gerrit.wikimedia.org/r/508587 (https://phabricator.wikimedia.org/T213263) (owner: 10Ema) [07:51:56] 10Operations, 10Patch-For-Review: uwsgi's logsocket_plugin.so causes segfaults during log rotation - https://phabricator.wikimedia.org/T212697 (10elukey) The fix looks very promising, I have restarted 3 times in a row uwsgi-netbox and no trace of the segfault. Let's wait for tomorrow's round of logrotate resta... [07:51:58] (03PS3) 10Ema: ATS: update cumin aliases [puppet] - 10https://gerrit.wikimedia.org/r/508588 (https://phabricator.wikimedia.org/T213263) [07:53:02] (03CR) 10Ema: [C: 03+2] ATS: update cumin aliases [puppet] - 10https://gerrit.wikimedia.org/r/508588 (https://phabricator.wikimedia.org/T213263) (owner: 10Ema) [07:54:05] 10Operations, 10Patch-For-Review: uwsgi's logsocket_plugin.so causes segfaults during log rotation - https://phabricator.wikimedia.org/T212697 (10elukey) Side note: I don't find any message in Kibana related to netmon1002 or netbox, not sure if normal or not. [07:56:19] (03PS1) 10Marostegui: mariadb: db2103,db2112,db2116 into s1 [puppet] - 10https://gerrit.wikimedia.org/r/508768 (https://phabricator.wikimedia.org/T222772) [07:57:43] (03PS2) 10Marostegui: mariadb: db2103,db2112,db2116 into s1 [puppet] - 10https://gerrit.wikimedia.org/r/508768 (https://phabricator.wikimedia.org/T222772) [07:58:18] RECOVERY - Check the last execution of netbox_ganeti_eqiad_sync on netmon1002 is OK: OK: Status of the systemd unit netbox_ganeti_eqiad_sync [07:59:26] RECOVERY - Check the last execution of netbox_ganeti_codfw_sync on netmon1002 is OK: OK: Status of the systemd unit netbox_ganeti_codfw_sync [07:59:40] RECOVERY - Check systemd state on netmon1002 is OK: OK - running: The system is fully operational [08:00:08] (03CR) 10Filippo Giunchedi: [C: 03+1] mediawiki: Fix statsd reporting of MediaWiki.errors.fatal [puppet] - 10https://gerrit.wikimedia.org/r/508730 (https://phabricator.wikimedia.org/T222765) (owner: 10Krinkle) [08:00:27] (03CR) 10Marostegui: [C: 03+2] mariadb: db2103,db2112,db2116 into s1 [puppet] - 10https://gerrit.wikimedia.org/r/508768 (https://phabricator.wikimedia.org/T222772) (owner: 10Marostegui) [08:02:21] (03CR) 10Filippo Giunchedi: [C: 03+1] "+1, cc'ing Daniel and Riccardo as they worked on this too" [puppet] - 10https://gerrit.wikimedia.org/r/508748 (owner: 10CDanis) [08:04:18] (03PS1) 10Vgutierrez: lvs: Toggle VLAN legacy naming [puppet] - 10https://gerrit.wikimedia.org/r/508770 (https://phabricator.wikimedia.org/T209707) [08:04:20] (03CR) 10Jcrespo: "I can deploy this, but I would like to do it gradually on production. Alex please tell me when you are around to proceed." [puppet] - 10https://gerrit.wikimedia.org/r/505407 (https://phabricator.wikimedia.org/T220894) (owner: 10Alex Monk) [08:05:04] (03CR) 10Jcrespo: "Sorry, I meant Alex Monk not Alexandros." [puppet] - 10https://gerrit.wikimedia.org/r/505407 (https://phabricator.wikimedia.org/T220894) (owner: 10Alex Monk) [08:06:15] (03PS1) 10Giuseppe Lavagetto: jobrunner: allow forcing the engine to use via query string [puppet] - 10https://gerrit.wikimedia.org/r/508771 (https://phabricator.wikimedia.org/T222705) [08:06:18] (03PS1) 10Giuseppe Lavagetto: lvs: check jobrunners with both rendering engines [puppet] - 10https://gerrit.wikimedia.org/r/508772 (https://phabricator.wikimedia.org/T222705) [08:06:20] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM! for the future: I _think_ we should be able to fix labs Prometheus to DTRT without an Host: header" [puppet] - 10https://gerrit.wikimedia.org/r/508745 (owner: 10CDanis) [08:06:22] (03PS1) 10Giuseppe Lavagetto: lvs: check both php7 and hhvm on appserver-https [puppet] - 10https://gerrit.wikimedia.org/r/508773 (https://phabricator.wikimedia.org/T222705) [08:06:24] (03PS1) 10Giuseppe Lavagetto: lvs: Check both rendering engines on the api, appservers pools [puppet] - 10https://gerrit.wikimedia.org/r/508774 (https://phabricator.wikimedia.org/T222705) [08:07:52] (03PS1) 10Ema: ATS: log origin server timestamps and connection details [puppet] - 10https://gerrit.wikimedia.org/r/508775 [08:08:10] (03CR) 10Giuseppe Lavagetto: "https://puppet-compiler.wmflabs.org/compiler1002/16412/mw1300.eqiad.wmnet/ LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/508771 (https://phabricator.wikimedia.org/T222705) (owner: 10Giuseppe Lavagetto) [08:10:22] (03PS1) 10Marostegui: db-eqiad.php: More traffic to db1093 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/508776 [08:10:25] (03CR) 10Vgutierrez: "expected "changes" (almost a NOOP): https://puppet-compiler.wmflabs.org/compiler1002/16411/" [puppet] - 10https://gerrit.wikimedia.org/r/508770 (https://phabricator.wikimedia.org/T209707) (owner: 10Vgutierrez) [08:10:57] RECOVERY - Prometheus bast3002.wikimedia.org/ops was restarted on bast3002 is OK: (C)600 lt (W)1800 lt 1857 https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=esams+prometheus/ops [08:12:07] (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: More traffic to db1093 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/508776 (owner: 10Marostegui) [08:12:14] (03CR) 10Giuseppe Lavagetto: "https://puppet-compiler.wmflabs.org/compiler1002/16414/lvs1006.wikimedia.org/ lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/508772 (https://phabricator.wikimedia.org/T222705) (owner: 10Giuseppe Lavagetto) [08:13:09] (03Merged) 10jenkins-bot: db-eqiad.php: More traffic to db1093 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/508776 (owner: 10Marostegui) [08:13:22] (03CR) 10jenkins-bot: db-eqiad.php: More traffic to db1093 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/508776 (owner: 10Marostegui) [08:14:24] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: More traffic to db1093 (duration: 00m 58s) [08:14:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:16:55] (03PS1) 10Marostegui: db-eqiad.php: Fully repool db1093 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/508780 (https://phabricator.wikimedia.org/T222127) [08:20:50] (03PS2) 10Giuseppe Lavagetto: lvs: Check both rendering engines on the api, appservers pools [puppet] - 10https://gerrit.wikimedia.org/r/508774 (https://phabricator.wikimedia.org/T222705) [08:23:45] (03CR) 10Giuseppe Lavagetto: [C: 03+2] jobrunner: allow forcing the engine to use via query string [puppet] - 10https://gerrit.wikimedia.org/r/508771 (https://phabricator.wikimedia.org/T222705) (owner: 10Giuseppe Lavagetto) [08:25:06] (03CR) 10Gehel: [C: 04-1] wdqs: add WDQS restart cookbook (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/507347 (https://phabricator.wikimedia.org/T221832) (owner: 10Mathew.onipe) [08:27:22] (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: Fully repool db1093 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/508780 (https://phabricator.wikimedia.org/T222127) (owner: 10Marostegui) [08:28:29] (03Merged) 10jenkins-bot: db-eqiad.php: Fully repool db1093 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/508780 (https://phabricator.wikimedia.org/T222127) (owner: 10Marostegui) [08:28:42] (03CR) 10jenkins-bot: db-eqiad.php: Fully repool db1093 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/508780 (https://phabricator.wikimedia.org/T222127) (owner: 10Marostegui) [08:29:03] 10Operations, 10ops-eqiad, 10Traffic: cp1083 crashed - https://phabricator.wikimedia.org/T222620 (10fgiunchedi) >>! In T222620#5164117, @CDanis wrote: >>>! In T222620#5163577, @ema wrote: >> Interestingly, there was a memory usage spike right before the host crashed. >> >> {F28951427} > > I think that is j... [08:29:15] (03CR) 10Gehel: [C: 04-1] "A few minor comments inline." (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/508742 (https://phabricator.wikimedia.org/T218932) (owner: 10Mathew.onipe) [08:29:44] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Fully repool db1093 (duration: 00m 58s) [08:29:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:30:30] 10Operations, 10ops-eqiad, 10DBA, 10Patch-For-Review: db1093 (s6 candidate master) went down - broken BBU - https://phabricator.wikimedia.org/T222127 (10Marostegui) 05Open→03Resolved This host has been fully repooled Thanks @Cmjohnson for replacing the BBU [08:30:33] (03PS1) 10Giuseppe Lavagetto: jobrunner: fix typo in apache configuration [puppet] - 10https://gerrit.wikimedia.org/r/508781 [08:31:19] (03CR) 10Giuseppe Lavagetto: [C: 03+2] jobrunner: fix typo in apache configuration [puppet] - 10https://gerrit.wikimedia.org/r/508781 (owner: 10Giuseppe Lavagetto) [08:31:27] <_joe_> oh come on jenkins [08:32:41] PROBLEM - Check Varnish expiry mailbox lag on cp5008 is CRITICAL: CRITICAL: expiry mailbox lag is 10151792 https://wikitech.wikimedia.org/wiki/Varnish [08:33:38] (03CR) 10Gehel: [C: 04-1] "Still some issues with logstash cluster: https://puppet-compiler.wmflabs.org/compiler1002/16416/" [puppet] - 10https://gerrit.wikimedia.org/r/507950 (https://phabricator.wikimedia.org/T218932) (owner: 10Mathew.onipe) [08:35:21] (03PS1) 10Marostegui: db-eqiad,db.codfw.php: Pool db1131 on s6 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/508782 (https://phabricator.wikimedia.org/T222682) [08:38:28] (03CR) 10Gehel: [C: 04-1] Add postgres slave init cookbook (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/504570 (https://phabricator.wikimedia.org/T220946) (owner: 10Mathew.onipe) [08:40:13] (03PS1) 10Marostegui: db1131: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/508784 (https://phabricator.wikimedia.org/T222682) [08:42:28] (03CR) 10Marostegui: [C: 03+2] db1131: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/508784 (https://phabricator.wikimedia.org/T222682) (owner: 10Marostegui) [08:45:54] (03CR) 10Marostegui: [C: 03+2] db-eqiad,db.codfw.php: Pool db1131 on s6 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/508782 (https://phabricator.wikimedia.org/T222682) (owner: 10Marostegui) [08:46:34] (03Merged) 10jenkins-bot: db-eqiad,db.codfw.php: Pool db1131 on s6 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/508782 (https://phabricator.wikimedia.org/T222682) (owner: 10Marostegui) [08:46:48] (03CR) 10jenkins-bot: db-eqiad,db.codfw.php: Pool db1131 on s6 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/508782 (https://phabricator.wikimedia.org/T222682) (owner: 10Marostegui) [08:47:26] (03CR) 10Ema: [C: 03+1] lvs: check jobrunners with both rendering engines [puppet] - 10https://gerrit.wikimedia.org/r/508772 (https://phabricator.wikimedia.org/T222705) (owner: 10Giuseppe Lavagetto) [08:47:58] !log marostegui@deploy1001 Synchronized wmf-config/db-codfw.php: Pool db1131 into s6 with low weight T222682 (duration: 00m 53s) [08:48:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:48:03] T222682: Productionize db11[26-38] - https://phabricator.wikimedia.org/T222682 [08:48:20] 10Operations, 10Analytics, 10EventBus, 10observability, and 3 others: Upgrade statsd_exporter to 0.9 - https://phabricator.wikimedia.org/T220709 (10fgiunchedi) >>! In T220709#5164240, @Ottomata wrote: > Great! I guess it just needs to go into the WMF base docker image somehow? Indeed, I'm not sure about... [08:49:01] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Pool db1131 into s6 with low weight T222682 (duration: 00m 51s) [08:49:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:49:31] (03CR) 10Giuseppe Lavagetto: [C: 03+2] lvs: check jobrunners with both rendering engines [puppet] - 10https://gerrit.wikimedia.org/r/508772 (https://phabricator.wikimedia.org/T222705) (owner: 10Giuseppe Lavagetto) [08:49:56] (03PS2) 10Giuseppe Lavagetto: lvs: check jobrunners with both rendering engines [puppet] - 10https://gerrit.wikimedia.org/r/508772 (https://phabricator.wikimedia.org/T222705) [08:50:11] (03PS1) 10Marostegui: db-eqiad.php: More traffic to db1131 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/508785 [08:50:25] (03CR) 10Marostegui: [C: 04-2] "Wait a bit" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/508785 (owner: 10Marostegui) [08:50:52] (03PS3) 10Marostegui: db-eqiad,db-codfw.php: Change second parsercache key [mediawiki-config] - 10https://gerrit.wikimedia.org/r/508170 (https://phabricator.wikimedia.org/T210725) [08:52:40] (03PS10) 10Paladox: Gerrit: Set plugin.javamelody.prometheusBearerToken [puppet] - 10https://gerrit.wikimedia.org/r/508621 (https://phabricator.wikimedia.org/T184086) [08:53:35] (03CR) 10jerkins-bot: [V: 04-1] Gerrit: Set plugin.javamelody.prometheusBearerToken [puppet] - 10https://gerrit.wikimedia.org/r/508621 (https://phabricator.wikimedia.org/T184086) (owner: 10Paladox) [08:55:59] !log upload prometheus-statsd-exporter 0.9.0+ds1-1 to stretch-wikimedia T220709 [08:56:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:56:04] T220709: Upgrade statsd_exporter to 0.9 - https://phabricator.wikimedia.org/T220709 [08:56:13] (03PS11) 10Paladox: Gerrit: Set plugin.javamelody.prometheusBearerToken [puppet] - 10https://gerrit.wikimedia.org/r/508621 (https://phabricator.wikimedia.org/T184086) [08:57:10] (03CR) 10jerkins-bot: [V: 04-1] Gerrit: Set plugin.javamelody.prometheusBearerToken [puppet] - 10https://gerrit.wikimedia.org/r/508621 (https://phabricator.wikimedia.org/T184086) (owner: 10Paladox) [08:57:11] <_joe_> !log restarting pybal on lvs2006 to pick up changes for T222705 [08:57:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:57:15] T222705: Improve Pybal's url checks - https://phabricator.wikimedia.org/T222705 [08:57:53] (03PS12) 10Paladox: Gerrit: Set plugin.javamelody.prometheusBearerToken [puppet] - 10https://gerrit.wikimedia.org/r/508621 (https://phabricator.wikimedia.org/T184086) [09:00:35] (03CR) 10Paladox: Gerrit: Set plugin.javamelody.prometheusBearerToken (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/508621 (https://phabricator.wikimedia.org/T184086) (owner: 10Paladox) [09:07:26] (03PS2) 10Giuseppe Lavagetto: lvs: check both php7 and hhvm on appserver-https [puppet] - 10https://gerrit.wikimedia.org/r/508773 (https://phabricator.wikimedia.org/T222705) [09:08:37] (03CR) 10Giuseppe Lavagetto: [C: 03+2] lvs: check both php7 and hhvm on appserver-https [puppet] - 10https://gerrit.wikimedia.org/r/508773 (https://phabricator.wikimedia.org/T222705) (owner: 10Giuseppe Lavagetto) [09:12:57] <_joe_> !log restarting pybal on lvs2006 to pick up changes for T222705 (2/3) [09:13:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:13:01] T222705: Improve Pybal's url checks - https://phabricator.wikimedia.org/T222705 [09:21:15] (03CR) 10Giuseppe Lavagetto: [C: 03+2] lvs: Check both rendering engines on the api, appservers pools [puppet] - 10https://gerrit.wikimedia.org/r/508774 (https://phabricator.wikimedia.org/T222705) (owner: 10Giuseppe Lavagetto) [09:21:30] 10Operations: #wikimedia-sre is missing stashbot - https://phabricator.wikimedia.org/T222755 (10jbond) p:05Triage→03Normal [09:23:01] (03PS3) 10Giuseppe Lavagetto: lvs: Check both rendering engines on the api, appservers pools [puppet] - 10https://gerrit.wikimedia.org/r/508774 (https://phabricator.wikimedia.org/T222705) [09:24:17] (03CR) 10Volans: [C: 04-1] "Minor final adjustments and we're ready." (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/496873 (https://phabricator.wikimedia.org/T83992) (owner: 10Ayounsi) [09:24:32] !log install uwsgi-core_2.0.14+20161117-3+deb9u2+wmf1 on netmon2001 to test a uwsgi bug fix - T212697 [09:24:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:24:36] T212697: uwsgi's logsocket_plugin.so causes segfaults during log rotation - https://phabricator.wikimedia.org/T212697 [09:26:29] <_joe_> !log restarting pybal on lvs2006 to pick up changes for T222705 (3/3) [09:26:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:26:33] T222705: Improve Pybal's url checks - https://phabricator.wikimedia.org/T222705 [09:26:58] (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: More traffic to db1131 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/508785 (owner: 10Marostegui) [09:28:06] (03Merged) 10jenkins-bot: db-eqiad.php: More traffic to db1131 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/508785 (owner: 10Marostegui) [09:28:20] (03CR) 10jenkins-bot: db-eqiad.php: More traffic to db1131 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/508785 (owner: 10Marostegui) [09:29:51] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Give more traffic to db1131 in s6 T222682 (duration: 01m 07s) [09:29:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:29:56] T222682: Productionize db11[26-38] - https://phabricator.wikimedia.org/T222682 [09:37:13] (03PS1) 10Marostegui: install_server: Only allow reimage db1133 and db2114 [puppet] - 10https://gerrit.wikimedia.org/r/508789 [09:38:38] (03CR) 10Jcrespo: [C: 03+1] install_server: Only allow reimage db1133 and db2114 [puppet] - 10https://gerrit.wikimedia.org/r/508789 (owner: 10Marostegui) [09:38:54] (03CR) 10Marostegui: [C: 03+2] install_server: Only allow reimage db1133 and db2114 [puppet] - 10https://gerrit.wikimedia.org/r/508789 (owner: 10Marostegui) [09:45:49] !log Stop replication on db2097:3311 [09:45:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:48:26] (03CR) 10Vgutierrez: [C: 03+1] ATS: log origin server timestamps and connection details [puppet] - 10https://gerrit.wikimedia.org/r/508775 (owner: 10Ema) [09:49:27] <_joe_> !log restarted pybal on lvs2003 to pick up changes for T222705 [09:49:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:49:31] T222705: Improve Pybal's url checks - https://phabricator.wikimedia.org/T222705 [09:50:22] <_joe_> !log restarted pybal on lvs1006 to pick up changes for T222705 [09:50:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:51:51] <_joe_> !log restarted proton on proton1001 [09:51:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:51:59] RECOVERY - proton endpoints health on proton1001 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/proton [09:52:08] <_joe_> it was polluting pybal's logs [09:57:48] solved like that? [09:57:54] nice [09:59:48] <_joe_> I'm watching lvs1006 for a few [09:59:54] <_joe_> before switching lvs1016 [10:03:12] (03PS1) 10Marostegui: db-eqiad.php: More traffic to db1131 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/508794 [10:03:35] (03PS2) 10Ema: ATS: log origin server timestamps and connection details [puppet] - 10https://gerrit.wikimedia.org/r/508775 [10:04:24] (03CR) 10Volans: [C: 04-1] "Unfortunately the situation is a bit more complex and this is not enough IMHO." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/508748 (owner: 10CDanis) [10:04:33] (03CR) 10Ema: [C: 03+2] ATS: log origin server timestamps and connection details [puppet] - 10https://gerrit.wikimedia.org/r/508775 (owner: 10Ema) [10:05:59] (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: More traffic to db1131 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/508794 (owner: 10Marostegui) [10:06:27] (03PS1) 10Vgutierrez: openstack: Disable legacy vlan naming for cloudvirt1024 [puppet] - 10https://gerrit.wikimedia.org/r/508796 (https://phabricator.wikimedia.org/T209707) [10:06:48] (03CR) 10jerkins-bot: [V: 04-1] openstack: Disable legacy vlan naming for cloudvirt1024 [puppet] - 10https://gerrit.wikimedia.org/r/508796 (https://phabricator.wikimedia.org/T209707) (owner: 10Vgutierrez) [10:07:00] (03Merged) 10jenkins-bot: db-eqiad.php: More traffic to db1131 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/508794 (owner: 10Marostegui) [10:07:23] (03CR) 10jenkins-bot: db-eqiad.php: More traffic to db1131 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/508794 (owner: 10Marostegui) [10:07:58] (03PS3) 10Vgutierrez: Revert "lvs: Avoid tagged network interfaces to hit IFNAMSIZ (15+\0) limit" [puppet] - 10https://gerrit.wikimedia.org/r/508668 [10:08:06] (03PS2) 10Vgutierrez: lvs: Toggle VLAN legacy naming [puppet] - 10https://gerrit.wikimedia.org/r/508770 (https://phabricator.wikimedia.org/T209707) [10:08:09] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Give more traffic to db1131 in s6 T222682 (duration: 00m 57s) [10:08:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:08:14] T222682: Productionize db11[26-38] - https://phabricator.wikimedia.org/T222682 [10:09:30] (03PS4) 10Vgutierrez: Revert "lvs: Avoid tagged network interfaces to hit IFNAMSIZ (15+\0) limit" [puppet] - 10https://gerrit.wikimedia.org/r/508668 [10:09:32] (03PS3) 10Vgutierrez: lvs: Toggle VLAN legacy naming [puppet] - 10https://gerrit.wikimedia.org/r/508770 (https://phabricator.wikimedia.org/T209707) [10:09:34] (03PS2) 10Vgutierrez: openstack: Disable legacy vlan naming for cloudvirt1024 [puppet] - 10https://gerrit.wikimedia.org/r/508796 (https://phabricator.wikimedia.org/T209707) [10:14:10] (03PS4) 10Vgutierrez: lvs: Toggle VLAN legacy naming [puppet] - 10https://gerrit.wikimedia.org/r/508770 (https://phabricator.wikimedia.org/T209707) [10:14:12] (03PS3) 10Vgutierrez: openstack: Disable legacy vlan naming for cloudvirt1024 [puppet] - 10https://gerrit.wikimedia.org/r/508796 (https://phabricator.wikimedia.org/T209707) [10:17:46] <_joe_> !log restarted pybal on lvs1016 to pick up changes for T222705 [10:17:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:17:50] T222705: Improve Pybal's url checks - https://phabricator.wikimedia.org/T222705 [10:21:05] (03CR) 10Volans: wdqs: add WDQS restart cookbook (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/507347 (https://phabricator.wikimedia.org/T221832) (owner: 10Mathew.onipe) [10:21:26] (03CR) 10Vgutierrez: "pcc seems happy: https://puppet-compiler.wmflabs.org/compiler1002/16419/" [puppet] - 10https://gerrit.wikimedia.org/r/508796 (https://phabricator.wikimedia.org/T209707) (owner: 10Vgutierrez) [10:21:59] (03PS1) 10Filippo Giunchedi: hieradata: change swift statsd-exporter units [puppet] - 10https://gerrit.wikimedia.org/r/508798 (https://phabricator.wikimedia.org/T220709) [10:22:05] (03CR) 10Vgutierrez: [C: 03+1] "> Patch Set 2: Code-Review+1" [puppet] - 10https://gerrit.wikimedia.org/r/508668 (owner: 10Vgutierrez) [10:25:27] (03CR) 10Filippo Giunchedi: "PCC https://puppet-compiler.wmflabs.org/compiler1002/16420/" [puppet] - 10https://gerrit.wikimedia.org/r/508798 (https://phabricator.wikimedia.org/T220709) (owner: 10Filippo Giunchedi) [10:28:59] (03PS2) 10Jbond: facter3/puppet5: enable puppet5/facter3 codfw [puppet] - 10https://gerrit.wikimedia.org/r/507302 (https://phabricator.wikimedia.org/T219803) [10:29:39] (03CR) 10Jbond: [C: 03+2] facter3/puppet5: enable puppet5/facter3 codfw [puppet] - 10https://gerrit.wikimedia.org/r/507302 (https://phabricator.wikimedia.org/T219803) (owner: 10Jbond) [10:30:26] (03PS1) 10Jcrespo: mariadb-backups: Update transfer.py to 3b11a70cc1f356 [puppet] - 10https://gerrit.wikimedia.org/r/508801 (https://phabricator.wikimedia.org/T206203) [10:31:17] (03CR) 10jerkins-bot: [V: 04-1] mariadb-backups: Update transfer.py to 3b11a70cc1f356 [puppet] - 10https://gerrit.wikimedia.org/r/508801 (https://phabricator.wikimedia.org/T206203) (owner: 10Jcrespo) [10:34:39] (03PS2) 10Jcrespo: mariadb-backups: Update transfer.py to 3b11a70cc1f356 [puppet] - 10https://gerrit.wikimedia.org/r/508801 (https://phabricator.wikimedia.org/T206203) [10:35:47] (03PS1) 10Alexandros Kosiaris: Use prometheus-statsd-exporter 0.9 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/508803 (https://phabricator.wikimedia.org/T220709) [10:38:09] (03PS6) 10Jbond: Prometheus: add timeout parameter to query method [software/spicerack] - 10https://gerrit.wikimedia.org/r/507561 [10:41:26] (03CR) 10Jbond: "updated to remove the Tuple" [software/spicerack] - 10https://gerrit.wikimedia.org/r/507561 (owner: 10Jbond) [10:42:03] PROBLEM - puppet last run on pybal-test2001 is CRITICAL: CRITICAL: Puppet has 8 failures. Last run 3 minutes ago with 8 failures. Failed resources (up to 3 shown): Exec[eth0_v6_token],Service[rsyslog],Service[nagios-nrpe-server],Exec[ip addr add 2620:0:860:102:10:192:16:139/64 dev eth0] [10:42:10] (03CR) 10jerkins-bot: [V: 04-1] Prometheus: add timeout parameter to query method [software/spicerack] - 10https://gerrit.wikimedia.org/r/507561 (owner: 10Jbond) [10:42:24] (03CR) 10Volans: [C: 03+1] "LGTM, minor optional nit inline" (032 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/503946 (owner: 10Giuseppe Lavagetto) [10:46:16] (03PS7) 10Jbond: Prometheus: add timeout parameter to query method [software/spicerack] - 10https://gerrit.wikimedia.org/r/507561 [10:47:57] hmmm puppet seems happy in pybal-test2001 [10:49:36] (03PS1) 10Jcrespo: mariadb-backups: Setup snapshots of s1 stopping replication [puppet] - 10https://gerrit.wikimedia.org/r/508805 (https://phabricator.wikimedia.org/T206203) [10:50:54] (03CR) 10Volans: "Code looks good, just one main comment/question and a nit." (032 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/504578 (owner: 10Giuseppe Lavagetto) [10:52:20] (03CR) 10Jcrespo: [V: 03+2] transfer.py: Allow for a 3rd transfer type: decompression [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/500043 (https://phabricator.wikimedia.org/T219631) (owner: 10Jcrespo) [10:52:26] (03CR) 10Jcrespo: [V: 03+2 C: 03+2] transfer.py: Allow for a 3rd transfer type: decompression [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/500043 (https://phabricator.wikimedia.org/T219631) (owner: 10Jcrespo) [10:52:38] (03CR) 10Jcrespo: [V: 03+2 C: 03+2] mariadb: Allow new option --stop-slave for xtrabackup transfers [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/502828 (https://phabricator.wikimedia.org/T206203) (owner: 10Jcrespo) [10:52:41] RECOVERY - puppet last run on pybal-test2001 is OK: OK: Puppet is currently enabled, last run 5 minutes ago with 0 failures [10:52:59] (03CR) 10Volans: "LGTM, just nits in the docstring" (033 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/507561 (owner: 10Jbond) [10:53:10] (03PS3) 10Jcrespo: mariadb-backups: Update transfer.py to 3b11a70cc1f356 [puppet] - 10https://gerrit.wikimedia.org/r/508801 (https://phabricator.wikimedia.org/T206203) [10:54:56] (03CR) 10Jcrespo: [C: 03+2] mariadb-backups: Update transfer.py to 3b11a70cc1f356 [puppet] - 10https://gerrit.wikimedia.org/r/508801 (https://phabricator.wikimedia.org/T206203) (owner: 10Jcrespo) [10:58:35] PROBLEM - Check Varnish expiry mailbox lag on cp3035 is CRITICAL: CRITICAL: expiry mailbox lag is 2057292 https://wikitech.wikimedia.org/wiki/Varnish [10:58:36] (03PS2) 10Filippo Giunchedi: hieradata: change swift statsd-exporter units [puppet] - 10https://gerrit.wikimedia.org/r/508798 (https://phabricator.wikimedia.org/T220709) [10:58:49] (03PS8) 10Jbond: Prometheus: add timeout parameter to query method [software/spicerack] - 10https://gerrit.wikimedia.org/r/507561 [10:59:28] (03CR) 10Filippo Giunchedi: [C: 03+2] hieradata: change swift statsd-exporter units [puppet] - 10https://gerrit.wikimedia.org/r/508798 (https://phabricator.wikimedia.org/T220709) (owner: 10Filippo Giunchedi) [11:00:04] MaxSem, RoanKattouw, and Niharika: #bothumor I � Unicode. All rise for European Mid-day SWAT(Max 6 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190508T1100). [11:00:04] kart_ and duesen: A patch you scheduled for European Mid-day SWAT(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [11:01:04] o/ [11:01:22] here [11:01:38] duesen: Do you deploy? [11:01:56] nope. [11:02:07] also waiting for my stuff to be rolled out :) [11:02:08] OK. I can start then, with my patch :) [11:02:23] (03PS9) 10Jbond: Prometheus: add timeout parameter to query method [software/spicerack] - 10https://gerrit.wikimedia.org/r/507561 [11:03:16] (03CR) 10Volans: [C: 03+1] "LGTM if CI agrees :)" [software/spicerack] - 10https://gerrit.wikimedia.org/r/507561 (owner: 10Jbond) [11:03:18] (03CR) 10KartikMistry: [C: 03+2] Add publish restrictions config for enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/495677 (https://phabricator.wikimedia.org/T217237) (owner: 10Petar.petkovic) [11:04:24] (03Merged) 10jenkins-bot: Add publish restrictions config for enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/495677 (https://phabricator.wikimedia.org/T217237) (owner: 10Petar.petkovic) [11:04:55] (03PS10) 10Jbond: Prometheus: add timeout parameter to query method [software/spicerack] - 10https://gerrit.wikimedia.org/r/507561 [11:04:59] !log mobrovac@deploy1001 Started deploy [cpjobqueue/deploy@abd7fdc]: Prepare the config to allow jobs to be switched to PHP7 individually - T219148 [11:05:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:05:03] T219148: Use PHP7 to run all async jobs - https://phabricator.wikimedia.org/T219148 [11:06:18] (03PS14) 10Jcrespo: mariadb-snapshots: Stop replication during transfer [puppet] - 10https://gerrit.wikimedia.org/r/501546 (https://phabricator.wikimedia.org/T206203) [11:06:22] (03CR) 10jenkins-bot: Add publish restrictions config for enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/495677 (https://phabricator.wikimedia.org/T217237) (owner: 10Petar.petkovic) [11:06:29] !log mobrovac@deploy1001 Finished deploy [cpjobqueue/deploy@abd7fdc]: Prepare the config to allow jobs to be switched to PHP7 individually - T219148 (duration: 01m 30s) [11:06:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:06:50] i can help with swat if nobody else is around btw [11:07:26] (03CR) 10Jbond: "cheers updated" (033 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/507561 (owner: 10Jbond) [11:08:25] (03CR) 10Jcrespo: [C: 03+2] mariadb-snapshots: Stop replication during transfer [puppet] - 10https://gerrit.wikimedia.org/r/501546 (https://phabricator.wikimedia.org/T206203) (owner: 10Jcrespo) [11:08:57] mobrovac: the wiki says MaxSem, RoanKattouw, and Niharika are on duty. They have been silent so far, though :) [11:09:01] (03PS2) 10Jcrespo: mariadb-backups: Setup snapshots of s1 stopping replication [puppet] - 10https://gerrit.wikimedia.org/r/508805 (https://phabricator.wikimedia.org/T206203) [11:09:13] I have a fix for a fatal error I'd like to be deployed. [11:11:18] duesen: ok, i have a couple of things to do, so if they don't appear in the next 10 mins or so, i can help [11:11:51] duesen: Let me finish my patch, I can deploy too. [11:12:00] ok cool [11:15:54] duesen: few more minutes.. some testing required for this patch. [11:17:24] (03CR) 10Jbond: [C: 03+2] Prometheus: add timeout parameter to query method [software/spicerack] - 10https://gerrit.wikimedia.org/r/507561 (owner: 10Jbond) [11:22:38] (03Merged) 10jenkins-bot: Prometheus: add timeout parameter to query method [software/spicerack] - 10https://gerrit.wikimedia.org/r/507561 (owner: 10Jbond) [11:23:27] jouncebot: now [11:23:27] For the next 0 hour(s) and 36 minute(s): European Mid-day SWAT(Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190508T1100) [11:23:38] (03CR) 10jenkins-bot: Prometheus: add timeout parameter to query method [software/spicerack] - 10https://gerrit.wikimedia.org/r/507561 (owner: 10Jbond) [11:23:48] OK. I'm reverting my patch. [11:25:00] 10Operations, 10Patch-For-Review, 10User-Ladsgroup, 10User-Urbanecm, 10Wiki-Setup (Create): Create Wikisource Hindi - https://phabricator.wikimedia.org/T218155 (10Ladsgroup) a:05Ladsgroup→03None The issue still exists and it's impossible to make new wikis without lots of bandage and hacks. I'm doing... [11:25:15] ah. [11:26:22] 10Operations: tracking task: jessie -> stretch - https://phabricator.wikimedia.org/T168494 (10MoritzMuehlenhoff) >>! In T168494#5165168, @Dzahn wrote: > @Muehlenhoff Does it make sense to reopen? After we've formally set the new OS policy in place (during the offsite in June), the process will involve that Foun... [11:26:50] kart_: do you need any additional information from me? [11:27:13] duesen: give me one more min. My patch is almost done. [11:27:32] no rush. [11:27:45] actually - i should probably +2 the backport of my patch [11:28:05] or is this not needed? I'm blurry on the sat procedure [11:28:53] duesen: yes. +2 please. [11:29:11] With SWAT in message. [11:29:37] !log kartik@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT [[gerrit:495677|Add publish restrictions config for enwiki]] (T217237) (duration: 00m 58s) [11:29:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:29:42] T217237: Add configuration to wikis with publish restrictions - https://phabricator.wikimedia.org/T217237 [11:29:52] (03CR) 10Volans: "Post-merge -1" (035 comments) [puppet] - 10https://gerrit.wikimedia.org/r/505657 (https://phabricator.wikimedia.org/T215378) (owner: 10CRusnov) [11:29:55] duesen: let's go to your patch. [11:30:17] ah, i don't have +2 on deployment branches [11:30:19] i can only give a +1 [11:30:32] https://gerrit.wikimedia.org/r/c/mediawiki/core/+/508559 [11:31:41] also, for some reason I can't cherry-pick this to wmf.4 [11:31:45] * duesen is confused [11:31:55] (03PS1) 10Mathew.onipe: prometheus: enable metrics relabel [puppet] - 10https://gerrit.wikimedia.org/r/508809 [11:32:06] duesen: what! [11:32:16] duesen: Let me +2 on .wmf.3 first. [11:33:14] "Cherry pick failed: identical tre" [11:33:28] duesen: yes. Same for me. [11:33:42] duesen: so, some issue with gerrit? [11:33:49] ah, no [11:33:57] it'S already in [11:34:09] it was already merged into master when wmf4 was cut [11:34:21] (03PS1) 10Jcrespo: transfer.py: Ignore stopping and starting slave if option is not set [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/508810 (https://phabricator.wikimedia.org/T206203) [11:34:46] (03CR) 10jerkins-bot: [V: 04-1] transfer.py: Ignore stopping and starting slave if option is not set [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/508810 (https://phabricator.wikimedia.org/T206203) (owner: 10Jcrespo) [11:34:54] so i guess the swat is a bit pointless, wmf4 goues to group2 today anyway :) [11:35:01] yep [11:35:53] We've 12 min to wait for CI. [11:36:08] (03CR) 10Jcrespo: [V: 03+2 C: 03+2] transfer.py: Ignore stopping and starting slave if option is not set [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/508810 (https://phabricator.wikimedia.org/T206203) (owner: 10Jcrespo) [11:37:43] kart_: actually, group1 is going to wmf4 today. group2 is due tomorrow. [11:37:55] and pretty late in my time zone [11:38:21] getting it out now would allow me to confirm the fix and clonse the ticket. [11:41:28] (03PS1) 10Jcrespo: mariadb-backups: Update transfer.py to 37c035ae06c54bbb79078c3d84 [puppet] - 10https://gerrit.wikimedia.org/r/508812 (https://phabricator.wikimedia.org/T206203) [11:42:10] (03CR) 10Jcrespo: [C: 03+2] mariadb-backups: Setup snapshots of s1 stopping replication [puppet] - 10https://gerrit.wikimedia.org/r/508805 (https://phabricator.wikimedia.org/T206203) (owner: 10Jcrespo) [11:43:00] (03PS2) 10Jcrespo: mariadb-backups: Update transfer.py to 37c035ae06c54bbb79078c3d84 [puppet] - 10https://gerrit.wikimedia.org/r/508812 (https://phabricator.wikimedia.org/T206203) [11:43:39] (03CR) 10Jbond: [C: 03+1] "DNS side of this looks good" [dns] - 10https://gerrit.wikimedia.org/r/485081 (https://phabricator.wikimedia.org/T211254) (owner: 10Ayounsi) [11:44:21] (03CR) 10Jcrespo: [C: 03+2] mariadb-backups: Update transfer.py to 37c035ae06c54bbb79078c3d84 [puppet] - 10https://gerrit.wikimedia.org/r/508812 (https://phabricator.wikimedia.org/T206203) (owner: 10Jcrespo) [11:45:59] duesen: but yeah, we can't put it in wmf.4 :) [11:47:47] lets hope it passes ci in time. I forgot that we'D have to wait for this. annoying. [11:47:52] (03CR) 10Elukey: [C: 03+1] "LGTM but I'd wait also for Eric and/or Filippo's +1 before merging" [puppet] - 10https://gerrit.wikimedia.org/r/508809 (owner: 10Mathew.onipe) [11:48:15] kart_: merged! [11:48:24] duesen: I should've done that earlier. [11:48:49] PROBLEM - Check Varnish expiry mailbox lag on cp3039 is CRITICAL: CRITICAL: expiry mailbox lag is 2047146 https://wikitech.wikimedia.org/wiki/Varnish [11:49:00] duesen: cool. Let's get started. I'll ping once on mwdebug1002 [11:49:33] (03CR) 10Elukey: [C: 03+1] prometheus: enable metrics relabel (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/508809 (owner: 10Mathew.onipe) [11:51:19] PROBLEM - MariaDB Slave Lag: s3 on db2098 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 244513.39 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [11:51:31] duesen: Please test on mwdebug1002 and let me know. [11:52:01] 10Operations, 10Mail, 10Patch-For-Review: Gmail - Multiple destination domains per transaction is unsupported. Please try again. - https://phabricator.wikimedia.org/T222198 (10jbond) 05Open→03Resolved a:03jbond Resolving as this looks fixed, i have checked the logs and the last error i see relating to... [11:52:02] checking db2098 [11:52:40] probably a downtime expired [11:53:10] kart_: localization strings seem to not have made it into the cache yet. i suppose that's normal [11:53:20] duesen: yep. [11:54:09] other than that, it works as expected [11:54:29] !log akosiaris@deploy1001 scap-helm cxserver upgrade -f cxserver-staging-values.yaml staging stable/cxserver [namespace: cxserver, clusters: staging] [11:54:30] !log akosiaris@deploy1001 scap-helm cxserver cluster staging completed [11:54:30] !log akosiaris@deploy1001 scap-helm cxserver finished [11:54:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:54:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:54:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:54:57] !log bump prometheus-statsd-exporter for cxserver to 0.0.5 T220709 [11:55:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:55:02] T220709: Upgrade statsd_exporter to 0.9 - https://phabricator.wikimedia.org/T220709 [11:55:19] duesen: OK. deploying. [11:55:43] yay for writing code to work around data corruption caused by a maintenance script run in 2005 :P [11:56:09] !log akosiaris@deploy1001 scap-helm cxserver upgrade -f cxserver-codfw-values.yaml staging stable/cxserver [namespace: cxserver, clusters: codfw] [11:56:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:56:15] !log akosiaris@deploy1001 scap-helm cxserver upgrade -f cxserver-codfw-values.yaml production stable/cxserver [namespace: cxserver, clusters: codfw] [11:56:17] !log akosiaris@deploy1001 scap-helm cxserver cluster codfw completed [11:56:17] !log akosiaris@deploy1001 scap-helm cxserver finished [11:56:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:56:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:56:20] godog: Do you have a second to talk about https://phabricator.wikimedia.org/T221774#5155621 ? tl;dr I need blazegraph_lastupdated from Prometheus in MediaWiki for all WDQS instances [11:56:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:58:18] duesen: wait a minute.. :P [11:59:05] PROBLEM - Request latencies on neon is CRITICAL: instance=10.64.0.40:6443 verb=CONNECT https://grafana.wikimedia.org/dashboard/db/kubernetes-api [11:59:28] duesen: now scap'ng.. [12:00:04] Deploy window Pre MediaWiki train sanity break (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190508T1200) [12:00:16] 10Operations, 10LDAP-Access-Requests, 10WMF-Legal, 10WMF-NDA-Requests: Request to be added to the ldap/wmde group - https://phabricator.wikimedia.org/T222788 (10darthmon_wmde) [12:00:54] why scap is stuck? :/ [12:01:11] at: 11:59:13 Checking for new runtime errors locally [12:01:49] RECOVERY - Request latencies on neon is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [12:02:49] thcipriani: I'm running: scap sync-file php-1.34.0-wmf.3 'SWAT: [[gerrit:508559|Log warning and show error on empty username (T222529)]]' [12:02:50] T222529: Wikimedia\Assert\ParameterAssertionException when rendering a log snippet and log_user_text is empty - https://phabricator.wikimedia.org/T222529 [12:02:56] eh. [12:03:20] Anything wrong with this? [12:03:45] Looks like slow.. [12:05:42] mobrovac: can you help kart_? I don't know squat about scap ;) [12:06:06] kart_: yeah, that may fail...not sure...it's linting everything under php-1.34.0-wmf.3. It has l10n, looks like you need a full scap sync [12:06:26] oh you're doing a full sync [12:06:27] sigh [12:06:43] !log kartik@deploy1001 Synchronized php-1.34.0-wmf.3: SWAT: [[gerrit:508559|Log warning and show error on empty username (T222529)]] (duration: 07m 29s) [12:06:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:06:58] mobrovac: yep. [12:07:08] thcipriani: yep. [12:07:13] kart_: scap sync-dir is your frined in the future :) [12:07:18] Sorry for pinging early in the morning. [12:07:18] *friend [12:07:25] mobrovac: there isn't scap-dir :) [12:07:33] right sync-file [12:07:35] but it does dirs too [12:07:45] !log EU-Midday SWAT done. [12:07:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:08:02] trivia: sync-file and sync-dir are the same command under the hood [12:08:10] :D [12:08:16] Noted again. [12:09:00] duesen: Thanks for cycling and running with SWAT member. [12:09:49] (03PS34) 10Vgutierrez: trafficserver: Provide a TLS terminator profile and backend+TLS role [puppet] - 10https://gerrit.wikimedia.org/r/506398 (https://phabricator.wikimedia.org/T221594) [12:09:59] kart_: thanks for deploying the fix! [12:10:24] (03CR) 10jerkins-bot: [V: 04-1] trafficserver: Provide a TLS terminator profile and backend+TLS role [puppet] - 10https://gerrit.wikimedia.org/r/506398 (https://phabricator.wikimedia.org/T221594) (owner: 10Vgutierrez) [12:11:48] (03CR) 10Jgreen: [C: 03+1] apache-fast-test: accept a literal - in host names [puppet] - 10https://gerrit.wikimedia.org/r/508701 (owner: 10Dzahn) [12:12:13] (03PS2) 10Elukey: profile::mediawiki::nutcracker: make memcached configuration optional [puppet] - 10https://gerrit.wikimedia.org/r/504831 (https://phabricator.wikimedia.org/T214275) [12:12:26] (03PS7) 10Vgutierrez: prometheus: Support several instances of the trafficserver exporter [puppet] - 10https://gerrit.wikimedia.org/r/506659 (https://phabricator.wikimedia.org/T221217) [12:12:28] (03PS4) 10Vgutierrez: nagios_common: Provide check_https_hostheader_port_url check [puppet] - 10https://gerrit.wikimedia.org/r/507006 (https://phabricator.wikimedia.org/T221594) [12:12:30] (03PS3) 10Vgutierrez: prometheus: Toggle SSL certificate verification for trafficserver-exporter [puppet] - 10https://gerrit.wikimedia.org/r/508327 (https://phabricator.wikimedia.org/T221217) [12:12:32] (03PS9) 10Vgutierrez: trafficserver: Provide a unified monitoring define [puppet] - 10https://gerrit.wikimedia.org/r/506986 (https://phabricator.wikimedia.org/T221217) [12:12:34] (03PS35) 10Vgutierrez: trafficserver: Provide a TLS terminator profile and backend+TLS role [puppet] - 10https://gerrit.wikimedia.org/r/506398 (https://phabricator.wikimedia.org/T221594) [12:13:25] (03CR) 10Elukey: [C: 03+2] profile::mediawiki::nutcracker: make memcached configuration optional [puppet] - 10https://gerrit.wikimedia.org/r/504831 (https://phabricator.wikimedia.org/T214275) (owner: 10Elukey) [12:14:51] 10Operations, 10Puppet, 10Icinga, 10observability, 10Patch-For-Review: Puppet failing without Icinga alert in case of dependency cycle - https://phabricator.wikimedia.org/T221784 (10jbond) Are there any more actions required here or can we close this ticket? [12:15:29] duesen: Fatal error: entire web request took longer than 60 seconds and timed out in /srv/mediawiki/php-1.34.0-wmf.3/includes/Linker.php on line 1738 - Is this related? [12:16:23] 10Operations, 10LDAP-Access-Requests, 10WMF-Legal, 10WMF-NDA-Requests: Request to be added to the ldap/wmde group - https://phabricator.wikimedia.org/T222788 (10RazShuty) [12:16:33] seems decreasing.. [12:16:45] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1004 is CRITICAL: CRITICAL: 70.00% of data above the critical threshold [50.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=2&fullscreen [12:16:59] 10Operations, 10LDAP-Access-Requests, 10WMF-Legal, 10WMF-NDA-Requests: Request to be added to the ldap/wmde group - https://phabricator.wikimedia.org/T222788 (10RazShuty) I support this request, @darthmon_wmde is the new Engineering Manager of the Wikidata team. [12:19:27] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1004 is OK: OK: Less than 70.00% above the threshold [25.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=2&fullscreen [12:19:33] kart_: it seems passed, but do we know why this happened? [12:20:36] elukey: I deployed: https://gerrit.wikimedia.org/r/#/c/mediawiki/core/+/508559/ during SWAT. It was on peak during deployment. [12:20:49] duesen can have more idea. [12:28:52] 10Operations, 10Puppet, 10Icinga, 10observability, 10Patch-For-Review: Puppet failing without Icinga alert in case of dependency cycle - https://phabricator.wikimedia.org/T221784 (10Volans) I think I'll change the error message as it's a bit too specific given that the same error can happen under differe... [12:30:11] RECOVERY - Check Varnish expiry mailbox lag on cp3039 is OK: OK: expiry mailbox lag is 0 https://wikitech.wikimedia.org/wiki/Varnish [12:30:55] 10Operations, 10DNS, 10Mail, 10Traffic, 10Patch-For-Review: Phabricator SPF record contains internal addressing for phab[12]001 - https://phabricator.wikimedia.org/T221288 (10jbond) Is there any further action for this ticket or can we close it? [12:37:14] (03PS1) 10Muehlenhoff: Bump meta package for new ABI in 4.9.168 for jessie Drop meta package fr Linux 4.14, we didn't need it in the end [debs/linux-meta] - 10https://gerrit.wikimedia.org/r/508815 [12:37:27] (03CR) 10Jbond: "LGTM" [dns] - 10https://gerrit.wikimedia.org/r/504241 (https://phabricator.wikimedia.org/T220786) (owner: 10Vgutierrez) [12:37:56] (03CR) 10Jbond: [C: 03+1] "LGTM" [dns] - 10https://gerrit.wikimedia.org/r/504242 (https://phabricator.wikimedia.org/T220786) (owner: 10Vgutierrez) [12:38:18] (03CR) 10Jbond: [C: 03+1] "LGTM" [dns] - 10https://gerrit.wikimedia.org/r/504244 (https://phabricator.wikimedia.org/T220786) (owner: 10Vgutierrez) [12:38:29] oh nice :D [12:39:22] jbond42: this one's related to those CRs and it's been waiting for a long time to be merged: https://gerrit.wikimedia.org/r/c/operations/dns/+/283870 [12:40:22] (03CR) 10Jbond: [C: 03+1] "LGTM - with all theses spf changes im sure they could be locked down further however they are still an improvement and looks safe to me" [dns] - 10https://gerrit.wikimedia.org/r/283870 (https://phabricator.wikimedia.org/T220786) (owner: 10Mschon) [12:40:35] vgutierrez: i just got to that one :) [12:40:42] awesome [12:41:31] 10Operations, 10serviceops: SRE FY2019 Q4 goal: complete the transition to PHP7 - https://phabricator.wikimedia.org/T219127 (10Joe) [12:45:15] (03CR) 10CDanis: check_prometheus: allow non-grafana links in $dashboard_links (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/508748 (owner: 10CDanis) [12:46:00] (03CR) 10CDanis: "given the ongoing discussion on I355dec29 I'll remove the added runbook link and merge" [puppet] - 10https://gerrit.wikimedia.org/r/508745 (owner: 10CDanis) [12:46:47] 10Operations, 10serviceops: SRE FY2019 Q4 goal: complete the transition to PHP7 - https://phabricator.wikimedia.org/T219127 (10Joe) [12:48:53] 10Operations, 10Patch-For-Review: parsoid-vd on scandium randomly died - https://phabricator.wikimedia.org/T219933 (10jbond) did this issue get fixed? i just checked scandium and parsoid-vd has been running for 3 weeks ` root@scandium:~# systemctl status parsoid-vd ● parsoid-vd.service - parsoid-vd: Testredu... [12:49:56] 10Operations, 10serviceops: SRE FY2019 Q4 goal: complete the transition to PHP7 - https://phabricator.wikimedia.org/T219127 (10Joe) [12:50:37] 10Operations, 10serviceops: SRE FY2019 Q4 goal: complete the transition to PHP7 - https://phabricator.wikimedia.org/T219127 (10Joe) [12:50:44] 10Operations, 10MediaWiki-General-or-Unknown, 10serviceops, 10Core Platform Team (PHP7 (TEC4)), and 4 others: Some pages will become completely unreachable after PHP7 update due to Unicode changes - https://phabricator.wikimedia.org/T219279 (10Joe) [12:50:46] (03PS3) 10CDanis: prometheus uptime alert: fix scraping of 'global' instance, plus more [puppet] - 10https://gerrit.wikimedia.org/r/508745 [12:51:12] 10Operations, 10serviceops: SRE FY2019 Q4 goal: complete the transition to PHP7 - https://phabricator.wikimedia.org/T219127 (10Joe) [12:53:10] (03CR) 10CDanis: "PCC looks good: https://puppet-compiler.wmflabs.org/compiler1002/16424/" [puppet] - 10https://gerrit.wikimedia.org/r/508745 (owner: 10CDanis) [12:54:44] (03PS4) 10CDanis: prometheus uptime alert: fix scraping of 'global' instance, plus more [puppet] - 10https://gerrit.wikimedia.org/r/508745 [12:55:38] (03CR) 10CDanis: [C: 03+2] prometheus uptime alert: fix scraping of 'global' instance, plus more [puppet] - 10https://gerrit.wikimedia.org/r/508745 (owner: 10CDanis) [12:56:21] hoo: I can't ATM but will take a look today/tomorrow [12:56:39] Thanks :) [12:56:43] (03PS1) 10Elukey: Remove mediawiki's nutcracker config from deployment-prep [puppet] - 10https://gerrit.wikimedia.org/r/508817 (https://phabricator.wikimedia.org/T214275) [12:56:51] (03PS1) 10Petar.petkovic: Decrease idwiki MT threshold for publishing [mediawiki-config] - 10https://gerrit.wikimedia.org/r/508818 (https://phabricator.wikimedia.org/T222782) [12:57:21] PROBLEM - Upload HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=upload&var-status_type=5 [12:58:05] PROBLEM - Esams HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=esams&var-cache_type=All&var-status_type=5 [12:58:21] (03CR) 10Volans: [C: 04-1] "reply inline" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/508748 (owner: 10CDanis) [12:58:24] single spike, seems going down [13:00:05] Deploy window MediaWiki train - European version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190508T1300) [13:09:18] (03PS2) 10Mathew.onipe: prometheus: enable metrics relabel [puppet] - 10https://gerrit.wikimedia.org/r/508809 [13:13:55] 10Operations, 10DNS, 10Mail, 10Traffic, 10Patch-For-Review: Phabricator SPF record contains internal addressing for phab[12]001 - https://phabricator.wikimedia.org/T221288 (10herron) 05Open→03Resolved a:03herron Ready to resolve afaict! [13:14:33] (03CR) 10Mathew.onipe: prometheus: enable metrics relabel (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/508809 (owner: 10Mathew.onipe) [13:14:44] 10Operations, 10DNS, 10Mail, 10Traffic, 10Patch-For-Review: Phabricator SPF record contains internal addressing for phab[12]001 - https://phabricator.wikimedia.org/T221288 (10GedHaywood) Do those IPv6 addresses actually send any mail? If not they can be deleted. [13:14:57] (03PS1) 10Jbond: exim mailman: disable email subscriptions [puppet] - 10https://gerrit.wikimedia.org/r/508820 (https://phabricator.wikimedia.org/T219107) [13:18:28] (03CR) 10Volans: "I agree with the concept but I was thinking if we should invert the thing and move those to system::* so that they are in the same directo" [puppet] - 10https://gerrit.wikimedia.org/r/508671 (https://phabricator.wikimedia.org/T222352) (owner: 10RobH) [13:18:38] !log cp3035: restart varnish-be [13:18:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:19:18] (03CR) 10CDanis: "reply inline but if this is gonna be more complicated I need to put it aside for now" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/508748 (owner: 10CDanis) [13:21:11] (03PS4) 10Marostegui: db-eqiad,db-codfw.php: Change second parsercache key [mediawiki-config] - 10https://gerrit.wikimedia.org/r/508170 (https://phabricator.wikimedia.org/T210725) [13:21:41] 10Operations, 10DNS, 10Mail, 10Traffic, 10Patch-For-Review: Phabricator SPF record contains internal addressing for phab[12]001 - https://phabricator.wikimedia.org/T221288 (10herron) >>! In T221288#5167063, @GedHaywood wrote: > Do those IPv6 addresses actually send any mail? Yes, these are the IPv6 addr... [13:23:06] RECOVERY - Check Varnish expiry mailbox lag on cp3035 is OK: OK: expiry mailbox lag is 0 https://wikitech.wikimedia.org/wiki/Varnish [13:29:14] RECOVERY - Upload HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=upload&var-status_type=5 [13:30:52] RECOVERY - Esams HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=esams&var-cache_type=All&var-status_type=5 [13:37:54] (03PS1) 10Gehel: maps: upgrade to nodejs10 [puppet] - 10https://gerrit.wikimedia.org/r/508822 (https://phabricator.wikimedia.org/T210704) [13:41:03] (03PS1) 10Papaul: DNS: Remove mgmt DNS for osm-db200[12] and osm-web200[1234] [dns] - 10https://gerrit.wikimedia.org/r/508823 [13:43:13] 10Operations, 10Wikimedia-Mailing-lists: Please create a private mailing traffic-anomaly-report - https://phabricator.wikimedia.org/T222794 (10ssingh) [13:43:33] 10Operations, 10Wikimedia-Mailing-lists: Please create a private mailing list traffic-anomaly-report - https://phabricator.wikimedia.org/T222794 (10ssingh) [13:45:08] (03PS1) 10Giuseppe Lavagetto: mediawiki::web::prod_sites: add the rewrite rules for blankpage [puppet] - 10https://gerrit.wikimedia.org/r/508826 (https://phabricator.wikimedia.org/T222705) [13:45:11] (03PS1) 10Giuseppe Lavagetto: lvs: test php7 on enwiki as well [puppet] - 10https://gerrit.wikimedia.org/r/508827 (https://phabricator.wikimedia.org/T222705) [13:47:29] (03CR) 10Vgutierrez: [C: 03+2] prometheus: Support several instances of the trafficserver exporter [puppet] - 10https://gerrit.wikimedia.org/r/506659 (https://phabricator.wikimedia.org/T221217) (owner: 10Vgutierrez) [13:47:31] (03CR) 10Giuseppe Lavagetto: "https://puppet-compiler.wmflabs.org/compiler1001/16425/mw1261.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/508826 (https://phabricator.wikimedia.org/T222705) (owner: 10Giuseppe Lavagetto) [13:47:38] (03PS8) 10Vgutierrez: prometheus: Support several instances of the trafficserver exporter [puppet] - 10https://gerrit.wikimedia.org/r/506659 (https://phabricator.wikimedia.org/T221217) [13:49:52] (03CR) 10Giuseppe Lavagetto: "The change was correct last time, the problem was that the compiler was using this file while production was using the hiera3.yaml version" [puppet] - 10https://gerrit.wikimedia.org/r/506167 (owner: 10Jbond) [13:50:04] !log otto@deploy1001 scap-helm eventgate-main install -n main -f main/staging-values.yaml stable/eventgate [namespace: eventgate-main, clusters: staging] [13:50:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:50:48] 10Operations, 10ops-codfw, 10DBA: db2114 hardware problem - https://phabricator.wikimedia.org/T222753 (10Reedy) [13:50:56] (03CR) 10Giuseppe Lavagetto: [C: 03+1] "but this merits a full puppet compiler run." [puppet] - 10https://gerrit.wikimedia.org/r/506167 (owner: 10Jbond) [13:57:33] 10Operations, 10DNS, 10Mail, 10Traffic, 10Patch-For-Review: wiki-mail DKIM failing - https://phabricator.wikimedia.org/T221290 (10herron) 05Open→03Resolved [13:59:16] (03CR) 10Effie Mouzeli: "LGTM I think an example in the commit message would help and a comment in the config file as well" [puppet] - 10https://gerrit.wikimedia.org/r/508826 (https://phabricator.wikimedia.org/T222705) (owner: 10Giuseppe Lavagetto) [14:00:51] 10Operations, 10serviceops: SRE FY2019 Q4 goal: complete the transition to PHP7 - https://phabricator.wikimedia.org/T219127 (10Krinkle) [14:01:01] 10Operations, 10PHP 7.2 support, 10Patch-For-Review, 10Wikimedia-production-error: Investigate why a string literal changed in opcache (Fatal exception of type "ConfigException") - https://phabricator.wikimedia.org/T221347 (10Krinkle) [14:01:09] 10Operations, 10PHP 7.2 support, 10Patch-For-Review, 10Wikimedia-production-error: Investigate why a string literal changed in opcache (Fatal exception of type "ConfigException") - https://phabricator.wikimedia.org/T221347 (10Krinkle) [14:01:51] 10Operations, 10PHP 7.2 support, 10Patch-For-Review, 10Wikimedia-production-error: PHP7 opcache sometimes corrupts (was: Fatal ConfigException, undefined InitialiseSettings variable) - https://phabricator.wikimedia.org/T221347 (10Krinkle) [14:02:17] (03PS2) 10Giuseppe Lavagetto: mediawiki::web::prod_sites: add the rewrite rules for blankpage [puppet] - 10https://gerrit.wikimedia.org/r/508826 (https://phabricator.wikimedia.org/T222705) [14:02:19] (03PS2) 10Giuseppe Lavagetto: lvs: test php7 on enwiki as well [puppet] - 10https://gerrit.wikimedia.org/r/508827 (https://phabricator.wikimedia.org/T222705) [14:02:34] (03CR) 10MSantos: [C: 03+1] maps: upgrade to nodejs10 [puppet] - 10https://gerrit.wikimedia.org/r/508822 (https://phabricator.wikimedia.org/T210704) (owner: 10Gehel) [14:02:52] 10Operations, 10PHP 7.2 support, 10Patch-For-Review, 10Wikimedia-production-error: PHP7 opcache sometimes corrupts when cleared (was: Fatal ConfigException, undefined InitialiseSettings variable) - https://phabricator.wikimedia.org/T221347 (10Joe) [14:03:26] !log starting upgrade to nodejs 10 for maps - T210704 [14:03:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:03:30] T210704: Migrate node-based services in production to node10 - https://phabricator.wikimedia.org/T210704 [14:03:49] (03CR) 10Herron: "Nice catch! One question inline" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/508820 (https://phabricator.wikimedia.org/T219107) (owner: 10Jbond) [14:04:21] (03CR) 10Gehel: [C: 03+2] maps: upgrade to nodejs10 [puppet] - 10https://gerrit.wikimedia.org/r/508822 (https://phabricator.wikimedia.org/T210704) (owner: 10Gehel) [14:04:31] (03PS2) 10Gehel: maps: upgrade to nodejs10 [puppet] - 10https://gerrit.wikimedia.org/r/508822 (https://phabricator.wikimedia.org/T210704) [14:04:39] (03CR) 10Herron: [C: 03+1] Add SPF record to toolserver.org [dns] - 10https://gerrit.wikimedia.org/r/283870 (https://phabricator.wikimedia.org/T220786) (owner: 10Mschon) [14:05:01] !log fsero@deploy1001 scap-helm eventgate-main install -n main -f main/staging-values.yaml stable/eventgate [namespace: eventgate-main, clusters: staging] [14:05:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:05:05] 10Operations, 10Services, 10service-runner, 10serviceops: Re-evaluate service-runner's (ab) of statsd timing metric for nodejs GC stats - https://phabricator.wikimedia.org/T222795 (10akosiaris) [14:05:13] 10Operations, 10Services, 10service-runner, 10serviceops: Re-evaluate service-runner's (ab) of statsd timing metric for nodejs GC stats - https://phabricator.wikimedia.org/T222795 (10akosiaris) p:05Triage→03Low [14:05:44] (03CR) 10Herron: [C: 03+1] Gerrit: Configure logging in json to gerrit.json [puppet] - 10https://gerrit.wikimedia.org/r/508391 (https://phabricator.wikimedia.org/T141324) (owner: 10Paladox) [14:06:53] (03CR) 10Effie Mouzeli: [C: 03+1] mediawiki::web::prod_sites: add the rewrite rules for blankpage [puppet] - 10https://gerrit.wikimedia.org/r/508826 (https://phabricator.wikimedia.org/T222705) (owner: 10Giuseppe Lavagetto) [14:07:02] (03CR) 10Jbond: "thanks for the review comment inline" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/508820 (https://phabricator.wikimedia.org/T219107) (owner: 10Jbond) [14:08:04] 10Operations, 10User-fgiunchedi, 10Wikimedia-production-error: PHP Fatal error: The UdpSocket to 127.0.0.1:10514 has been closed (from Monolog/SyslogUdp on mwdebug1002) - https://phabricator.wikimedia.org/T214734 (10Krinkle) [14:08:07] 10Operations, 10PHP 7.2 support, 10Patch-For-Review, 10Wikimedia-production-error: PHP7 opcache sometimes corrupts when cleared (was: Fatal ConfigException, undefined InitialiseSettings variable) - https://phabricator.wikimedia.org/T221347 (10Joe) >>! In T221347#5165706, @Krinkle wrote: > Today's incident... [14:08:19] 10Operations, 10Core Platform Team (PHP7 (TEC4)), 10Core Platform Team Kanban (Doing), 10HHVM, and 3 others: Migrate to PHP 7 in WMF production - https://phabricator.wikimedia.org/T176370 (10Krinkle) [14:08:22] (03CR) 10Herron: [C: 03+1] "Looks good!" [puppet] - 10https://gerrit.wikimedia.org/r/508820 (https://phabricator.wikimedia.org/T219107) (owner: 10Jbond) [14:11:32] (03CR) 10Giuseppe Lavagetto: [C: 03+2] mediawiki::web::prod_sites: add the rewrite rules for blankpage [puppet] - 10https://gerrit.wikimedia.org/r/508826 (https://phabricator.wikimedia.org/T222705) (owner: 10Giuseppe Lavagetto) [14:11:38] PROBLEM - Request latencies on neon is CRITICAL: instance=10.64.0.40:6443 verb=CONNECT https://grafana.wikimedia.org/dashboard/db/kubernetes-api [14:11:42] (03PS3) 10Giuseppe Lavagetto: mediawiki::web::prod_sites: add the rewrite rules for blankpage [puppet] - 10https://gerrit.wikimedia.org/r/508826 (https://phabricator.wikimedia.org/T222705) [14:12:53] <_joe_> damn jenkins [14:13:38] RECOVERY - Request latencies on neon is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [14:14:07] !log mbsantos@deploy1001 Started deploy [kartotherian/deploy@7774721] (stretch): Deploy kartotherian node 10 build into maps2001 (T215852) [14:14:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:14:12] T215852: Migrate Kartotherian/Tilerator to Node 10 - https://phabricator.wikimedia.org/T215852 [14:14:34] !log mbsantos@deploy1001 Finished deploy [kartotherian/deploy@7774721] (stretch): Deploy kartotherian node 10 build into maps2001 (T215852) (duration: 00m 27s) [14:14:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:16:52] 10Operations, 10PHP 7.2 support, 10Patch-For-Review, 10Wikimedia-production-error: PHP7 opcache sometimes corrupts when cleared (was: Fatal ConfigException, undefined InitialiseSettings variable) - https://phabricator.wikimedia.org/T221347 (10Krinkle) > Pybal detects a server spitting errors way faster tha... [14:17:44] !log mbsantos@deploy1001 Started deploy [tilerator/deploy@2736a69] (stretch): Deploy tilerator node 10 build into maps2001 (T215852) [14:17:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:18:12] !log mbsantos@deploy1001 Finished deploy [tilerator/deploy@2736a69] (stretch): Deploy tilerator node 10 build into maps2001 (T215852) (duration: 00m 27s) [14:18:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:21:04] 10Operations, 10serviceops: SRE FY2019 Q4 goal: complete the transition to PHP7 - https://phabricator.wikimedia.org/T219127 (10Krinkle) [14:21:43] !log otto@deploy1001 scap-helm eventgate-main install -n main -f main/staging-values.yaml stable/eventgate [namespace: eventgate-main, clusters: staging] [14:21:44] !log otto@deploy1001 scap-helm eventgate-main cluster staging completed [14:21:44] !log otto@deploy1001 scap-helm eventgate-main finished [14:21:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:21:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:21:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:22:13] 10Operations, 10Core Platform Team Kanban, 10MediaWiki-General-or-Unknown, 10serviceops, and 4 others: Some pages will become completely unreachable after PHP7 update due to Unicode changes - https://phabricator.wikimedia.org/T219279 (10kchapman) CPT needs to review regarding long term fixes for this. [14:25:04] 10Operations, 10DNS, 10Mail, 10Traffic, 10Patch-For-Review: Phabricator SPF record contains internal addressing for phab[12]001 - https://phabricator.wikimedia.org/T221288 (10GedHaywood) SPF works on the information in the SMTP envelope. If mx1001.wikimedia.org has only the IP addresses 208.80.154.76 an... [14:25:15] (03CR) 10Fsero: [C: 03+1] "would be slightly more clear if the yamls are moved into tiller folder but besides that LGTM" [deployment-charts] - 10https://gerrit.wikimedia.org/r/492269 (owner: 10Alexandros Kosiaris) [14:26:06] PROBLEM - Request latencies on neon is CRITICAL: instance=10.64.0.40:6443 verb=CONNECT https://grafana.wikimedia.org/dashboard/db/kubernetes-api [14:26:24] PROBLEM - puppet last run on maps2001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[nodejs-legacy] [14:26:51] ^ looking [14:28:38] RECOVERY - Request latencies on neon is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [14:33:01] (03PS1) 10Gehel: maps: remove nodejs-legacy [puppet] - 10https://gerrit.wikimedia.org/r/508835 (https://phabricator.wikimedia.org/T214153) [14:33:04] 10Operations, 10Patch-For-Review: parsoid-vd on scandium randomly died - https://phabricator.wikimedia.org/T219933 (10Dzahn) 05Open→03Resolved a:03Dzahn >>! In T219933#5167021, @jbond wrote: > did this issue get fixed? i just checked scandium and parsoid-vd has been running for 3 weeks Thanks! Based o... [14:33:59] (03CR) 10MSantos: [C: 03+1] "LGTM." [puppet] - 10https://gerrit.wikimedia.org/r/508835 (https://phabricator.wikimedia.org/T214153) (owner: 10Gehel) [14:34:43] (03CR) 10Gehel: [C: 03+2] maps: remove nodejs-legacy [puppet] - 10https://gerrit.wikimedia.org/r/508835 (https://phabricator.wikimedia.org/T214153) (owner: 10Gehel) [14:34:45] 10Operations, 10Patch-For-Review: parsoid-vd on scandium randomly died - https://phabricator.wikimedia.org/T219933 (10ssastry) >>! In T219933#5167290, @Dzahn wrote: >>>! In T219933#5167021, @jbond wrote: >> did this issue get fixed? i just checked scandium and parsoid-vd has been running for 3 weeks > > Than... [14:35:01] (03CR) 10Alexandros Kosiaris: [C: 03+1] Change eventgate-analytics LVS port to 33192 [puppet] - 10https://gerrit.wikimedia.org/r/508582 (https://phabricator.wikimedia.org/T218346) (owner: 10Ottomata) [14:35:05] (03PS11) 10Ayounsi: Icinga: Add OSPF check to routers [puppet] - 10https://gerrit.wikimedia.org/r/496873 (https://phabricator.wikimedia.org/T83992) [14:35:55] (03CR) 10Ayounsi: Icinga: Add OSPF check to routers (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/496873 (https://phabricator.wikimedia.org/T83992) (owner: 10Ayounsi) [14:37:28] (03PS5) 10Mathew.onipe: wdqs: add WDQS restart cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/507347 (https://phabricator.wikimedia.org/T221832) [14:38:24] (03CR) 10Mathew.onipe: wdqs: add WDQS restart cookbook (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/507347 (https://phabricator.wikimedia.org/T221832) (owner: 10Mathew.onipe) [14:39:04] (03PS1) 10BBlack: cache_upload FB ratelimiter: increase by ~3x [puppet] - 10https://gerrit.wikimedia.org/r/508838 (https://phabricator.wikimedia.org/T192688) [14:39:23] (03CR) 10jerkins-bot: [V: 04-1] wdqs: add WDQS restart cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/507347 (https://phabricator.wikimedia.org/T221832) (owner: 10Mathew.onipe) [14:40:11] (03CR) 10Volans: [C: 04-1] "Just a minor typo to fix." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/496873 (https://phabricator.wikimedia.org/T83992) (owner: 10Ayounsi) [14:41:35] (03PS6) 10Mathew.onipe: wdqs: add WDQS restart cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/507347 (https://phabricator.wikimedia.org/T221832) [14:42:46] (03PS12) 10Ayounsi: Icinga: Add OSPF check to routers [puppet] - 10https://gerrit.wikimedia.org/r/496873 (https://phabricator.wikimedia.org/T83992) [14:43:08] (03CR) 10Ayounsi: Icinga: Add OSPF check to routers (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/496873 (https://phabricator.wikimedia.org/T83992) (owner: 10Ayounsi) [14:43:35] (03CR) 10jerkins-bot: [V: 04-1] wdqs: add WDQS restart cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/507347 (https://phabricator.wikimedia.org/T221832) (owner: 10Mathew.onipe) [14:46:47] (03CR) 10Volans: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/496873 (https://phabricator.wikimedia.org/T83992) (owner: 10Ayounsi) [14:47:22] (03CR) 10CRusnov: Gerrit: Set plugin.javamelody.prometheusBearerToken (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/508621 (https://phabricator.wikimedia.org/T184086) (owner: 10Paladox) [14:47:34] (03PS13) 10Ayounsi: Icinga: Add OSPF check to routers [puppet] - 10https://gerrit.wikimedia.org/r/496873 (https://phabricator.wikimedia.org/T83992) [14:47:53] (03CR) 10Volans: [C: 03+1] "LGTM, thanks! :)" [puppet] - 10https://gerrit.wikimedia.org/r/496873 (https://phabricator.wikimedia.org/T83992) (owner: 10Ayounsi) [14:48:14] (03CR) 10Jbond: "PCC https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/16427/console (still running)" [puppet] - 10https://gerrit.wikimedia.org/r/506167 (owner: 10Jbond) [14:48:16] (03PS13) 10Paladox: Gerrit: Set plugin.javamelody.prometheusBearerToken [puppet] - 10https://gerrit.wikimedia.org/r/508621 (https://phabricator.wikimedia.org/T184086) [14:48:32] (03CR) 10Paladox: Gerrit: Set plugin.javamelody.prometheusBearerToken (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/508621 (https://phabricator.wikimedia.org/T184086) (owner: 10Paladox) [14:49:09] (03CR) 10Ayounsi: [C: 03+2] Icinga: Add OSPF check to routers [puppet] - 10https://gerrit.wikimedia.org/r/496873 (https://phabricator.wikimedia.org/T83992) (owner: 10Ayounsi) [14:49:23] (03PS14) 10Ayounsi: Icinga: Add OSPF check to routers [puppet] - 10https://gerrit.wikimedia.org/r/496873 (https://phabricator.wikimedia.org/T83992) [14:50:09] I'm about to merge an Icinga check for OSPF, if you see any OSPF related alerts please ignore them [14:50:10] (03CR) 10Jbond: "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/508671 (https://phabricator.wikimedia.org/T222352) (owner: 10RobH) [14:50:52] (03PS2) 10Jbond: exim mailman: disable email subscriptions [puppet] - 10https://gerrit.wikimedia.org/r/508820 (https://phabricator.wikimedia.org/T219107) [14:51:58] 10Operations, 10Phabricator, 10serviceops, 10Patch-For-Review, 10Release-Engineering-Team (Watching / External): Reimage both phab1001 and phab2001 to stretch - https://phabricator.wikimedia.org/T190568 (10Dzahn) [14:52:01] (03CR) 10Jbond: [C: 03+2] exim mailman: disable email subscriptions [puppet] - 10https://gerrit.wikimedia.org/r/508820 (https://phabricator.wikimedia.org/T219107) (owner: 10Jbond) [14:52:36] RECOVERY - puppet last run on maps2001 is OK: OK: Puppet is currently enabled, last run 41 seconds ago with 0 failures [14:53:57] 10Operations, 10ops-codfw, 10Reading-Infrastructure-Team-Backlog, 10decommission: Decommission maps-test cluster - https://phabricator.wikimedia.org/T202898 (10Papaul) [14:57:45] ACKNOWLEDGEMENT - OSPF status on cr4-ulsfo is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/5 UP Ayounsi To be investigated https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [14:57:45] ACKNOWLEDGEMENT - OSPF status on mr1-ulsfo is CRITICAL: OSPFv2: 1/2 UP : OSPFv3: 1/2 UP Ayounsi To be investigated https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [14:58:08] 10Operations, 10netops, 10observability, 10Patch-For-Review: Juniper monitoring - https://phabricator.wikimedia.org/T83992 (10ayounsi) [14:58:23] 10Operations, 10netops, 10observability, 10Patch-For-Review: Juniper monitoring - https://phabricator.wikimedia.org/T83992 (10ayounsi) 05Open→03Resolved https://icinga.wikimedia.org/cgi-bin/icinga/status.cgi?search_string=ospf [15:00:13] (03CR) 10Mathew.onipe: "prospector fails on line 45 for py34 due to `*base_commands`" [cookbooks] - 10https://gerrit.wikimedia.org/r/507347 (https://phabricator.wikimedia.org/T221832) (owner: 10Mathew.onipe) [15:01:29] 10Operations, 10Services, 10service-runner, 10serviceops: Re-evaluate service-runner's (ab)use of statsd timing metric for nodejs GC stats - https://phabricator.wikimedia.org/T222795 (10akosiaris) [15:02:59] (03CR) 10Muehlenhoff: "I like the approach, this is similar to my older proposal in https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/477283/ but now with N" (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/508671 (https://phabricator.wikimedia.org/T222352) (owner: 10RobH) [15:04:08] (03CR) 10Muehlenhoff: splitting role::spare into staged and decomisssioning (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/508671 (https://phabricator.wikimedia.org/T222352) (owner: 10RobH) [15:04:20] 10Operations, 10puppet-compiler, 10Cloud-VPS (Quota-requests): Requesting quota increase for 'puppet-diffs' project - https://phabricator.wikimedia.org/T222800 (10herron) p:05Triage→03Normal [15:04:44] !log fix typo on asw2-ulsfo<->cr2-ulsfo interface (Xlink2 instead of Xlink1) [15:04:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:04:49] 10Operations, 10Core Platform Team Kanban, 10MediaWiki-General-or-Unknown, 10serviceops, and 4 others: Some pages will become completely unreachable after PHP7 update due to Unicode changes - https://phabricator.wikimedia.org/T219279 (10Anomie) The long-term plan, as I understand it, is that we'll run main... [15:05:40] 10Operations, 10ops-codfw, 10decommission: decommission: labtestservices2001.wikimedia.org - https://phabricator.wikimedia.org/T218022 (10Papaul) [15:05:52] (03CR) 10Volans: "> Patch Set 6:" [cookbooks] - 10https://gerrit.wikimedia.org/r/507347 (https://phabricator.wikimedia.org/T221832) (owner: 10Mathew.onipe) [15:06:23] allllll green https://icinga.wikimedia.org/cgi-bin/icinga/status.cgi?search_string=ospf [15:07:01] (03PS1) 10Gehel: Revert "maps: remove nodejs-legacy" [puppet] - 10https://gerrit.wikimedia.org/r/508840 [15:10:15] (03CR) 10Ayounsi: [C: 03+1] cache_upload FB ratelimiter: increase by ~3x [puppet] - 10https://gerrit.wikimedia.org/r/508838 (https://phabricator.wikimedia.org/T192688) (owner: 10BBlack) [15:12:16] (03PS1) 10Gehel: Revert "maps: remove nodejs-legacy" [puppet] - 10https://gerrit.wikimedia.org/r/508842 [15:12:53] 10Operations, 10ops-codfw, 10DBA: db2114 hardware problem - https://phabricator.wikimedia.org/T222753 (10jcrespo) Thanks. [15:14:31] !log installing rails security updates [15:14:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:17:18] (03PS1) 10Volans: tox: drop Py34 support add Py37 [cookbooks] - 10https://gerrit.wikimedia.org/r/508843 [15:17:21] onimisionipe: ^^^ for you :) [15:18:27] (03PS1) 10Marostegui: db2103,db2112,db2116: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/508844 (https://phabricator.wikimedia.org/T222772) [15:18:40] (03CR) 10Marostegui: [C: 04-2] "Do not push yet" [puppet] - 10https://gerrit.wikimedia.org/r/508844 (https://phabricator.wikimedia.org/T222772) (owner: 10Marostegui) [15:19:26] !log installing ruby-i18n security updates [15:19:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:23:50] (03CR) 10Ema: [C: 03+1] nagios_common: Provide check_https_hostheader_port_url check [puppet] - 10https://gerrit.wikimedia.org/r/507006 (https://phabricator.wikimedia.org/T221594) (owner: 10Vgutierrez) [15:24:26] (03CR) 10Vgutierrez: [C: 03+2] nagios_common: Provide check_https_hostheader_port_url check [puppet] - 10https://gerrit.wikimedia.org/r/507006 (https://phabricator.wikimedia.org/T221594) (owner: 10Vgutierrez) [15:24:43] (03PS5) 10Vgutierrez: nagios_common: Provide check_https_hostheader_port_url check [puppet] - 10https://gerrit.wikimedia.org/r/507006 (https://phabricator.wikimedia.org/T221594) [15:27:30] (03CR) 10Alex Monk: "I can be around on Tuesday afternoon if needed. Not sure what difference I can make." [puppet] - 10https://gerrit.wikimedia.org/r/505407 (https://phabricator.wikimedia.org/T220894) (owner: 10Alex Monk) [15:27:56] PROBLEM - Text HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5 [15:28:05] 10Operations, 10Analytics, 10Discovery, 10Research: Make hadoop cluster able to push to swift - https://phabricator.wikimedia.org/T219544 (10CDanis) Some quick notes from today's meeting: - elukey has a cloud-vps hadoop cluster for testing changes like this (although it is kind of flakey / needs poking/re... [15:29:56] (03PS14) 10Paladox: Gerrit: Set plugin.javamelody.prometheusBearerToken and setup prometheus [puppet] - 10https://gerrit.wikimedia.org/r/508621 (https://phabricator.wikimedia.org/T184086) [15:30:08] (03CR) 10CRusnov: "https://puppet-compiler.wmflabs.org/compiler1001/16429/" [puppet] - 10https://gerrit.wikimedia.org/r/508621 (https://phabricator.wikimedia.org/T184086) (owner: 10Paladox) [15:31:14] vgutierrez: shall I merge https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/508770/4/modules/interface/manifests/tagged.pp and set up a test? Seems like it's a no-op until I lower that flag on a particular host. [15:31:19] (thanks for writing that btw) [15:31:54] (03CR) 10Paladox: "Assigning to godog to review" [puppet] - 10https://gerrit.wikimedia.org/r/508621 (https://phabricator.wikimedia.org/T184086) (owner: 10Paladox) [15:31:56] andrewbogott: see https://gerrit.wikimedia.org/r/c/operations/puppet/+/508796/ as well [15:31:57] volans: Thanks! [15:32:16] heh, you're way ahead of me! I'lll try [15:32:53] _joe_: around? :) [15:33:10] andrewbogott: err as discussed with arturo, we were thinking in merging all of this on Monday [15:33:14] (03PS5) 10Andrew Bogott: Revert "lvs: Avoid tagged network interfaces to hit IFNAMSIZ (15+\0) limit" [puppet] - 10https://gerrit.wikimedia.org/r/508668 (owner: 10Vgutierrez) [15:33:22] as we are on holidays tomorrow and Friday [15:33:36] * arturo nods [15:33:41] vgutierrez, arturo, that's fine with me, we aren't dying for lack of that host. I just don't want to block anyone. [15:33:55] just playing on the safe side of things :) [15:34:34] ok! [15:36:10] 10Operations: Integrate Stretch 9.9 point update - https://phabricator.wikimedia.org/T222053 (10MoritzMuehlenhoff) [15:36:35] (03PS7) 10Mathew.onipe: wdqs: add WDQS restart cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/507347 (https://phabricator.wikimedia.org/T221832) [15:36:35] <_joe_> revi: what's your patch? [15:36:37] (03PS2) 10Muehlenhoff: Remove obsolete kmod class [puppet] - 10https://gerrit.wikimedia.org/r/508763 [15:36:45] <_joe_> I will review and someone in my team can merge it [15:36:57] (03PS2) 10Revi: Change kr.wikimedia.org destination [puppet] - 10https://gerrit.wikimedia.org/r/506895 (https://phabricator.wikimedia.org/T222033) [15:37:00] ^ [15:37:22] RECOVERY - Text HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5 [15:38:13] (03CR) 10jerkins-bot: [V: 04-1] wdqs: add WDQS restart cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/507347 (https://phabricator.wikimedia.org/T221832) (owner: 10Mathew.onipe) [15:39:29] (03CR) 10Jcrespo: "> I can be around on Tuesday afternoon if needed. Not sure what" [puppet] - 10https://gerrit.wikimedia.org/r/505407 (https://phabricator.wikimedia.org/T220894) (owner: 10Alex Monk) [15:40:24] (03CR) 10Muehlenhoff: [C: 03+2] Remove obsolete kmod class [puppet] - 10https://gerrit.wikimedia.org/r/508763 (owner: 10Muehlenhoff) [15:40:59] !log mforns@deploy1001 Started deploy [analytics/refinery@698f213]: deploying analytics-refinery up to 698f2137aa965b07548ae7565aafaa784628b13c with source=v0.0.89 [15:41:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:42:27] (03PS4) 10Andrew Bogott: git-sync-upstream: Rebase on top of prod's copy of the puppet repo [puppet] - 10https://gerrit.wikimedia.org/r/504825 (https://phabricator.wikimedia.org/T219390) (owner: 10Alex Monk) [15:42:36] <_joe_> revi: uhm I can't see the unicode chars. strange [15:42:38] (03CR) 10Mathew.onipe: [C: 03+1] tox: drop Py34 support add Py37 [cookbooks] - 10https://gerrit.wikimedia.org/r/508843 (owner: 10Volans) [15:42:42] it's Korean [15:43:10] shows up fine to me https://usercontent.irccloud-cdn.com/file/LlkoVFKy/image.png [15:44:06] (03CR) 10Filippo Giunchedi: "See inline, LGTM overall" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/508809 (owner: 10Mathew.onipe) [15:44:30] (03CR) 10Giuseppe Lavagetto: [C: 03+1] "LGTM." [puppet] - 10https://gerrit.wikimedia.org/r/506895 (https://phabricator.wikimedia.org/T222033) (owner: 10Revi) [15:47:04] (03PS1) 10Jcrespo: mariadb: Productionize db2117 as an s6 replica [puppet] - 10https://gerrit.wikimedia.org/r/508852 (https://phabricator.wikimedia.org/T222772) [15:47:28] (03PS1) 10Andrew Bogott: Revert "no-op README patch for testing" [labs/private] - 10https://gerrit.wikimedia.org/r/508853 [15:47:30] PROBLEM - etcd request latencies on argon is CRITICAL: instance=10.64.32.133:6443 operation={compareAndSwap,create,get,list} https://grafana.wikimedia.org/dashboard/db/kubernetes-api [15:47:39] (03CR) 10Filippo Giunchedi: [C: 04-1] "Good start, see inline." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/508621 (https://phabricator.wikimedia.org/T184086) (owner: 10Paladox) [15:47:46] (03CR) 10Andrew Bogott: [V: 03+2 C: 03+2] Revert "no-op README patch for testing" [labs/private] - 10https://gerrit.wikimedia.org/r/508853 (owner: 10Andrew Bogott) [15:48:28] (03CR) 10Marostegui: [C: 03+1] mariadb: Productionize db2117 as an s6 replica [puppet] - 10https://gerrit.wikimedia.org/r/508852 (https://phabricator.wikimedia.org/T222772) (owner: 10Jcrespo) [15:49:08] ottomata: around? I wanted to chat re: https://gerrit.wikimedia.org/r/c/operations/debs/prometheus-varnishkafka-exporter/+/507632 could you join #wikimedia-sre ? [15:49:22] PROBLEM - Request latencies on argon is CRITICAL: instance=10.64.32.133:6443 verb=PATCH https://grafana.wikimedia.org/dashboard/db/kubernetes-api [15:49:40] PROBLEM - proton endpoints health on proton1001 is CRITICAL: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Foo page from en.wp.org in letter format) timed out before a response was received: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Bar page from en.wp.org in A4 format using optimized for reading on mobile devices) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/pro [15:49:48] (03PS1) 10Muehlenhoff: httpd::mod_conf: Remove support for Ubuntu [puppet] - 10https://gerrit.wikimedia.org/r/508854 [15:51:02] (03PS5) 10Andrew Bogott: git-sync-upstream: Rebase on top of prod's copy of the puppet repo [puppet] - 10https://gerrit.wikimedia.org/r/504825 (https://phabricator.wikimedia.org/T219390) (owner: 10Alex Monk) [15:52:00] PROBLEM - Request latencies on neon is CRITICAL: instance=10.64.0.40:6443 verb={PATCH,PUT} https://grafana.wikimedia.org/dashboard/db/kubernetes-api [15:52:02] PROBLEM - etcd request latencies on chlorine is CRITICAL: instance=10.64.0.45:6443 operation={compareAndSwap,get,list} https://grafana.wikimedia.org/dashboard/db/kubernetes-api [15:52:33] !log reset authentication on cassandra / maps / codfw - T222801 [15:52:34] (03PS11) 10Mathew.onipe: Add postgres slave init cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/504570 (https://phabricator.wikimedia.org/T220946) [15:52:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:52:37] T222801: TileratorUI reports Error: Username and/or password are incorrect - https://phabricator.wikimedia.org/T222801 [15:52:40] PROBLEM - etcd request latencies on neon is CRITICAL: instance=10.64.0.40:6443 operation={compareAndSwap,get,list} https://grafana.wikimedia.org/dashboard/db/kubernetes-api [15:52:46] PROBLEM - Request latencies on chlorine is CRITICAL: instance=10.64.0.45:6443 verb={GET,LIST,PATCH,PUT} https://grafana.wikimedia.org/dashboard/db/kubernetes-api [15:55:13] (03CR) 10Jcrespo: [C: 03+2] mariadb: Productionize db2117 as an s6 replica [puppet] - 10https://gerrit.wikimedia.org/r/508852 (https://phabricator.wikimedia.org/T222772) (owner: 10Jcrespo) [15:56:37] !log mforns@deploy1001 Finished deploy [analytics/refinery@698f213]: deploying analytics-refinery up to 698f2137aa965b07548ae7565aafaa784628b13c with source=v0.0.89 (duration: 15m 38s) [15:56:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:57:26] RECOVERY - Request latencies on neon is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [15:57:26] RECOVERY - etcd request latencies on chlorine is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [15:57:27] RECOVERY - Request latencies on argon is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [15:58:04] RECOVERY - etcd request latencies on neon is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [15:58:10] RECOVERY - Request latencies on chlorine is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [15:58:18] RECOVERY - etcd request latencies on argon is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [15:58:46] PROBLEM - puppet last run on proton1001 is CRITICAL: connect to address 10.64.0.20 port 5666: Connection refused [15:59:02] PROBLEM - proton endpoints health on proton1001 is CRITICAL: connect to address 10.64.0.20 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Services/Monitoring/proton [15:59:12] PROBLEM - DPKG on proton1001 is CRITICAL: connect to address 10.64.0.20 port 5666: Connection refused [15:59:34] PROBLEM - Check the NTP synchronisation status of timesyncd on proton1001 is CRITICAL: connect to address 10.64.0.20 port 5666: Connection refused [15:59:38] PROBLEM - configured eth on proton1001 is CRITICAL: connect to address 10.64.0.20 port 5666: Connection refused [15:59:44] PROBLEM - Check whether ferm is active by checking the default input chain on proton1001 is CRITICAL: connect to address 10.64.0.20 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [15:59:51] !log restart db2117 after first puppet run [15:59:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:00:04] MaxSem, RoanKattouw, and Niharika: #bothumor My software never has bugs. It just develops random features. Rise for Morning SWAT (Max 6 patches). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190508T1600). [16:00:04] Ammarpad: A patch you scheduled for Morning SWAT (Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [16:00:48] PROBLEM - Check systemd state on proton1001 is CRITICAL: connect to address 10.64.0.20 port 5666: Connection refused [16:00:54] PROBLEM - dhclient process on proton1001 is CRITICAL: connect to address 10.64.0.20 port 5666: Connection refused [16:01:00] PROBLEM - Disk space on proton1001 is CRITICAL: connect to address 10.64.0.20 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space [16:01:16] PROBLEM - Check size of conntrack table on proton1001 is CRITICAL: connect to address 10.64.0.20 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [16:01:18] what is the issue with proton1001? [16:01:20] (03PS1) 10Jbond: raid: update check_raid to detect missing disk [puppet] - 10https://gerrit.wikimedia.org/r/508855 (https://phabricator.wikimedia.org/T218544) [16:01:38] (03PS2) 10BBlack: cache_upload FB ratelimiter: increase by ~3x [puppet] - 10https://gerrit.wikimedia.org/r/508838 (https://phabricator.wikimedia.org/T192688) [16:02:09] (03CR) 10jerkins-bot: [V: 04-1] raid: update check_raid to detect missing disk [puppet] - 10https://gerrit.wikimedia.org/r/508855 (https://phabricator.wikimedia.org/T218544) (owner: 10Jbond) [16:04:32] (03PS2) 10Jbond: raid: update check_raid to detect missing disk [puppet] - 10https://gerrit.wikimedia.org/r/508855 (https://phabricator.wikimedia.org/T218544) [16:05:33] (03CR) 10jerkins-bot: [V: 04-1] raid: update check_raid to detect missing disk [puppet] - 10https://gerrit.wikimedia.org/r/508855 (https://phabricator.wikimedia.org/T218544) (owner: 10Jbond) [16:05:54] I think it overloaded [16:08:04] !log restart tileratorui on maps2001 - T222801 [16:08:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:08:10] T222801: TileratorUI reports Error: Username and/or password are incorrect - https://phabricator.wikimedia.org/T222801 [16:08:20] 10Operations, 10puppet-compiler, 10Cloud-VPS (Quota-requests): Requesting quota increase for 'puppet-diffs' project - https://phabricator.wikimedia.org/T222800 (10aborrero) I was expecting this request :-) The standard procedure is that we talk about this request in our next team meeting, which is on 2019-05... [16:08:29] 10Operations, 10Proton, 10Reading-Infrastructure-Team-Backlog: proton experienced a period of high CPU usage, busy queue, lockups - https://phabricator.wikimedia.org/T214975 (10jcrespo) I think it overloaded again today (times are CEST): ` [15:58:46] PROBLEM - puppet last run on proton1001 is... [16:08:33] I've reorted it at T214975 [16:08:34] T214975: proton experienced a period of high CPU usage, busy queue, lockups - https://phabricator.wikimedia.org/T214975 [16:10:40] (03PS6) 10Andrew Bogott: git-sync-upstream: Rebase on top of prod's copy of the puppet repo [puppet] - 10https://gerrit.wikimedia.org/r/504825 (https://phabricator.wikimedia.org/T219390) (owner: 10Alex Monk) [16:10:54] (03PS1) 10Mholloway: Enable WikimediaEditorTasks on Commons Beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/508860 [16:12:04] (03PS2) 10Mholloway: Enable WikimediaEditorTasks on Commons Beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/508860 [16:13:50] (03PS1) 10Mforns: Bump up refinery version for refine.pp [puppet] - 10https://gerrit.wikimedia.org/r/508863 (https://phabricator.wikimedia.org/T215442) [16:14:15] (03PS6) 10Paladox: puppetmaster: Stop cloning over /p/ [puppet] - 10https://gerrit.wikimedia.org/r/507079 (https://phabricator.wikimedia.org/T218844) [16:16:10] (03CR) 10BBlack: [C: 03+2] cache_upload FB ratelimiter: increase by ~3x [puppet] - 10https://gerrit.wikimedia.org/r/508838 (https://phabricator.wikimedia.org/T192688) (owner: 10BBlack) [16:20:46] (03CR) 10Mforns: [C: 04-1] "Let's wait to merge this until Andrew can be there." [puppet] - 10https://gerrit.wikimedia.org/r/508863 (https://phabricator.wikimedia.org/T215442) (owner: 10Mforns) [16:24:14] 10Operations, 10fundraising-tech-ops, 10netops, 10Patch-For-Review: Revoke production prometheus fundraising access - https://phabricator.wikimedia.org/T217355 (10cwdent) 05Open→03Resolved [16:26:40] (03CR) 10Ottomata: "Let's do this Monday (my) morning?" [puppet] - 10https://gerrit.wikimedia.org/r/508582 (https://phabricator.wikimedia.org/T218346) (owner: 10Ottomata) [16:27:02] (03PS7) 10Andrew Bogott: puppetmaster: Stop cloning over /p/ [puppet] - 10https://gerrit.wikimedia.org/r/507079 (https://phabricator.wikimedia.org/T218844) (owner: 10Paladox) [16:28:01] (03CR) 10Andrew Bogott: [C: 03+2] puppetmaster: Stop cloning over /p/ [puppet] - 10https://gerrit.wikimedia.org/r/507079 (https://phabricator.wikimedia.org/T218844) (owner: 10Paladox) [16:28:36] thank you andrewbogott!! [16:31:42] (03PS7) 10Andrew Bogott: git-sync-upstream: Rebase on top of prod's copy of the puppet repo [puppet] - 10https://gerrit.wikimedia.org/r/504825 (https://phabricator.wikimedia.org/T219390) (owner: 10Alex Monk) [16:32:45] (03CR) 10Andrew Bogott: [C: 03+2] git-sync-upstream: Rebase on top of prod's copy of the puppet repo [puppet] - 10https://gerrit.wikimedia.org/r/504825 (https://phabricator.wikimedia.org/T219390) (owner: 10Alex Monk) [16:37:28] 10Operations, 10Puppet, 10Packaging: facter3: Unable to parse routing table - https://phabricator.wikimedia.org/T222356 (10jbond) upstream PR https://github.com/puppetlabs/facter/pull/1464 [16:38:58] 10Operations, 10Puppet, 10Packaging: facter3: Unable to parse routing table - https://phabricator.wikimedia.org/T222356 (10jbond) correct pull request https://github.com/puppetlabs/facter/pull/1775 [16:42:23] (03PS1) 10Jforrester: extension-list: Re-sort [mediawiki-config] - 10https://gerrit.wikimedia.org/r/508864 [16:42:39] (03PS5) 10Paladox: Fix apache-fast-test to support https [puppet] - 10https://gerrit.wikimedia.org/r/508737 [16:42:47] (03PS6) 10Paladox: Fix apache-fast-test to support https [puppet] - 10https://gerrit.wikimedia.org/r/508737 [16:43:48] (03CR) 10Volans: [C: 03+2] tox: drop Py34 support add Py37 [cookbooks] - 10https://gerrit.wikimedia.org/r/508843 (owner: 10Volans) [16:45:31] (03Merged) 10jenkins-bot: tox: drop Py34 support add Py37 [cookbooks] - 10https://gerrit.wikimedia.org/r/508843 (owner: 10Volans) [16:50:31] PROBLEM - HTTP availability for Nginx -SSL terminators- at ulsfo on icinga1001 is CRITICAL: cluster=cache_text site=ulsfo https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [16:51:33] PROBLEM - HTTP availability for Nginx -SSL terminators- at eqiad on icinga1001 is CRITICAL: cluster=cache_text site=eqiad https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [16:52:04] looks like an already-recovered spike btw [16:52:43] PROBLEM - Esams HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=esams&var-cache_type=All&var-status_type=5 [16:53:50] (03PS7) 10Paladox: Fix apache-fast-test to support https [puppet] - 10https://gerrit.wikimedia.org/r/508737 [16:53:57] PROBLEM - Eqiad HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=eqiad&var-cache_type=All&var-status_type=5 [16:54:03] PROBLEM - Text HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5 [16:54:33] RECOVERY - HTTP availability for Nginx -SSL terminators- at ulsfo on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [16:55:08] !log otto@deploy1001 scap-helm eventgate-main upgrade main -f main/staging-values.yaml --reset-values stable/eventgate [namespace: eventgate-main, clusters: staging] [16:55:10] !log otto@deploy1001 scap-helm eventgate-main cluster staging completed [16:55:10] !log otto@deploy1001 scap-helm eventgate-main finished [16:55:10] do we know what caused those spikes? [16:55:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:55:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:55:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:55:35] RECOVERY - HTTP availability for Nginx -SSL terminators- at eqiad on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [16:57:56] (03PS3) 10Jbond: raid: update check_raid to detect missing disk [puppet] - 10https://gerrit.wikimedia.org/r/508855 (https://phabricator.wikimedia.org/T218544) [16:59:19] RECOVERY - Eqiad HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=eqiad&var-cache_type=All&var-status_type=5 [16:59:23] RECOVERY - Text HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5 [17:00:45] RECOVERY - Esams HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=esams&var-cache_type=All&var-status_type=5 [17:02:03] PROBLEM - Request latencies on neon is CRITICAL: instance=10.64.0.40:6443 verb=CONNECT https://grafana.wikimedia.org/dashboard/db/kubernetes-api [17:04:17] (03PS7) 10Paladox: scap: Stop cloning over /p/ [puppet] - 10https://gerrit.wikimedia.org/r/507072 (https://phabricator.wikimedia.org/T218844) [17:04:40] (03CR) 10Mholloway: [C: 03+2] Enable WikimediaEditorTasks on Commons Beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/508860 (owner: 10Mholloway) [17:05:34] https://grafana.wikimedia.org/d/000000439/varnish-backend-connections?orgId=1&from=now-1h&to=now [17:05:42] assuming the dropdowns are set to eqiad and text [17:05:54] (03Merged) 10jenkins-bot: Enable WikimediaEditorTasks on Commons Beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/508860 (owner: 10Mholloway) [17:05:57] text-backends hit their 1K "connections to MW" limits [17:06:18] which cases them to intentionally 503 some of the other such requests while that limit is limiting things [17:06:26] (03CR) 10jenkins-bot: Enable WikimediaEditorTasks on Commons Beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/508860 (owner: 10Mholloway) [17:07:02] one theoretical avenue we're looking at there, is that likely those spikes of connection parallelism to MW are caused by a spike of latent MW responses [17:07:25] RECOVERY - Request latencies on neon is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [17:07:47] even in some of our most ideal past worlds, there's always going to be at least some minority latent responses, e.g. re-rendering Barack_Obama in response to a request or whatever [17:08:02] the problem is when there's a lot of slow responses all at once [17:08:58] and we've got easy mitigation knobs for this too. We can tell MW to cap its latency and 5xx. We can also cap it at varnish and say if no response in < Xs, abort and move on. And arguably for many/most cases, that limit should be pretty small. [17:09:15] but there are useful APIs and re-renderings and so-on that do take longer to execute, and which people would complain about if they were broken [17:09:50] so it's hard to just take the "eh a user won't wait more than ~15s, so let's just abort anything that goes longer than that" [17:10:10] sort of hard line-approach. It sounds like a good idea, but it's probably naive and makes a bunch of people legitimately angry :) [17:10:16] (03PS1) 10Mholloway: WikimediaEditorTasks: Always use local DB on Beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/508866 [17:11:28] but maybe if we had some idea that certain APIs are expected to often have slow responses, we could move those to a separate service endpoint and tune it differently from the rest. kinda like how we have appservers-rw and appservers-ro, we could have an appservers-slow and route certain URI patterns there, and enforce more aggressive timeout limits for everything else [17:11:28] (03CR) 10Mholloway: [C: 03+2] WikimediaEditorTasks: Always use local DB on Beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/508866 (owner: 10Mholloway) [17:11:57] 10Operations, 10observability: Leverage Grafana annotations to show events in graphs - https://phabricator.wikimedia.org/T222826 (10colewhite) [17:12:20] 10Operations, 10Graphite: unused grafana-dashboard indices on elasticsearch / logstash - https://phabricator.wikimedia.org/T174172 (10colewhite) [17:12:22] 10Operations, 10observability: Leverage Grafana annotations to show events in graphs - https://phabricator.wikimedia.org/T222826 (10colewhite) [17:12:34] (03Merged) 10jenkins-bot: WikimediaEditorTasks: Always use local DB on Beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/508866 (owner: 10Mholloway) [17:12:38] 10Operations, 10observability: Leverage Grafana annotations to show events in graphs - https://phabricator.wikimedia.org/T222826 (10colewhite) p:05Triage→03Normal [17:12:48] (03CR) 10jenkins-bot: WikimediaEditorTasks: Always use local DB on Beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/508866 (owner: 10Mholloway) [17:12:49] 10Operations, 10observability: Leverage Grafana annotations to show events in graphs - https://phabricator.wikimedia.org/T222826 (10colewhite) p:05Normal→03Low [17:13:27] (03CR) 10Dzahn: [C: 03+2] DNS: Remove mgmt DNS for osm-db200[12] and osm-web200[1234] [dns] - 10https://gerrit.wikimedia.org/r/508823 (owner: 10Papaul) [17:15:29] or in more abstract and sensible terms: make up multiple response-latency SLA buckets for categories of MW outputs (e.g. standard pageviews vs certain APIs vs whatever), and make per-SLA service endpoints where we can differ on policies about timeouts and such. [17:15:45] !log mholloway-shell@deploy1001 Synchronized wmf-config/InitialiseSettings-labs.php: Enable WikimediaEditorTasks on Beta commonswiki (duration: 00m 57s) [17:15:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:16:12] MW as a whole is a big bucket with a lot of variance [17:16:19] 10Operations, 10ops-codfw, 10DC-Ops, 10decommission, 10Patch-For-Review: Decommission osm-db200[12] and osm-web200[1234] - https://phabricator.wikimedia.org/T187445 (10Dzahn) [17:16:35] 10Operations, 10Gerrit, 10cloud-services-team, 10serviceops: Change /r/p/ to /r/ on all hosts (where https://gerrit.wikimedia.org/r/p/ exists) - https://phabricator.wikimedia.org/T222093 (10Paladox) [17:16:47] splitting it up by logical SLA policy of various sub-services/APIs makes a certain sense [17:17:11] or SLO I guess is the more-hip thing to say now [17:18:25] PROBLEM - PHP7 rendering on mw1264 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 378 bytes in 0.010 second response time https://wikitech.wikimedia.org/wiki/Application_servers [17:19:27] hmmm [17:19:38] I'm guessing that's icinga hitting a URL with the cookie set to ensure PHP7 execution? [17:22:02] !log mholloway-shell@deploy1001 Synchronized wmf-config/InitialiseSettings-labs.php: Retry: Enable WikimediaEditorTasks on Beta commonswiki (duration: 00m 57s) [17:22:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:22:27] RECOVERY - PHP7 rendering on mw1264 is OK: HTTP OK: HTTP/1.1 200 OK - 76741 bytes in 0.135 second response time https://wikitech.wikimedia.org/wiki/Application_servers [17:22:41] (03PS8) 10Paladox: toolforge: update origin URL for integration/composer.git clones [puppet] - 10https://gerrit.wikimedia.org/r/507074 (https://phabricator.wikimedia.org/T218844) [17:28:47] RECOVERY - MariaDB Slave Lag: s3 on db2098 is OK: OK slave_sql_lag Replication lag: 0.12 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [17:33:09] (03CR) 10Jbond: "Really like this, some minor comments" (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/507846 (owner: 10Herron) [17:37:55] jouncebot: now [17:37:55] No deployments scheduled for the next 0 hour(s) and 22 minute(s) [17:37:57] @seen Revi [17:37:57] mutante: revi is in here, right now [17:38:14] (03CR) 10Jforrester: [C: 03+2] extension-list: Re-sort [mediawiki-config] - 10https://gerrit.wikimedia.org/r/508864 (owner: 10Jforrester) [17:38:35] revi: would it be a good time to change the kr.wiki redirect? [17:38:50] (03CR) 10Jforrester: [C: 03+2] composer: Ignore multiversion's vendor, too [mediawiki-config] - 10https://gerrit.wikimedia.org/r/507724 (owner: 10Jforrester) [17:38:53] hi mutante [17:38:53] (03CR) 10Jforrester: [C: 03+2] env: Allow for running outside the cluster for local testing [mediawiki-config] - 10https://gerrit.wikimedia.org/r/507725 (owner: 10Jforrester) [17:39:00] mutante: sure just a moment... [17:39:15] need to wake up my laptop [17:39:27] (03Merged) 10jenkins-bot: extension-list: Re-sort [mediawiki-config] - 10https://gerrit.wikimedia.org/r/508864 (owner: 10Jforrester) [17:39:37] PROBLEM - Eqiad HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=eqiad&var-cache_type=All&var-status_type=5 [17:39:43] PROBLEM - Text HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5 [17:39:48] (03PS3) 10Dzahn: Change kr.wikimedia.org destination [puppet] - 10https://gerrit.wikimedia.org/r/506895 (https://phabricator.wikimedia.org/T222033) (owner: 10Revi) [17:40:07] (03Merged) 10jenkins-bot: composer: Ignore multiversion's vendor, too [mediawiki-config] - 10https://gerrit.wikimedia.org/r/507724 (owner: 10Jforrester) [17:40:10] (03Merged) 10jenkins-bot: env: Allow for running outside the cluster for local testing [mediawiki-config] - 10https://gerrit.wikimedia.org/r/507725 (owner: 10Jforrester) [17:40:54] (03CR) 10jenkins-bot: extension-list: Re-sort [mediawiki-config] - 10https://gerrit.wikimedia.org/r/508864 (owner: 10Jforrester) [17:41:15] revi: there will be a short window where people could get old or new [17:41:42] that's fine because redirects will be kept [17:42:27] ok [17:42:36] ready to go [17:42:42] (03CR) 10jenkins-bot: composer: Ignore multiversion's vendor, too [mediawiki-config] - 10https://gerrit.wikimedia.org/r/507724 (owner: 10Jforrester) [17:42:51] !log jforrester@deploy1001 Synchronized wmf-config/env.php: Clean-up: Allow for running outside the cluster for local testing (no-op for prod) (duration: 00m 56s) [17:42:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:42:58] revi: remove your -1 when ready :) [17:43:21] waiting for my browser to load gerrit ha [17:44:01] `Removed Code-Review-1 by Revi` [17:44:11] 10Operations, 10ops-eqiad, 10RESTBase, 10Core Platform Team Backlog (Watching / External), and 2 others: rack/setup/install restbase10[19-27].eqiad.wmnet - https://phabricator.wikimedia.org/T219404 (10RobH) [17:44:14] (03CR) 10Dzahn: [C: 03+2] Change kr.wikimedia.org destination [puppet] - 10https://gerrit.wikimedia.org/r/506895 (https://phabricator.wikimedia.org/T222033) (owner: 10Revi) [17:44:34] !log jforrester@deploy1001 Synchronized wmf-config/extension-list: Re-sort extension-list (prod no-op) (duration: 00m 56s) [17:44:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:45:01] RECOVERY - Eqiad HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=eqiad&var-cache_type=All&var-status_type=5 [17:46:27] RECOVERY - Text HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5 [17:46:37] revi: it has been applied on mwdebug1001 now if that helps. it will be deployed to all others within 30min max [17:46:45] thanks! [17:46:47] 10Operations, 10ops-eqiad, 10RESTBase, 10Core Platform Team Backlog (Watching / External), and 2 others: rack/setup/install restbase10[19-27].eqiad.wmnet - https://phabricator.wikimedia.org/T219404 (10RobH) [17:47:08] it's 02:46 AM in South Korea so I don't think there would be enough traffic for that domain as of now [17:48:14] meta.wikimedia.org/wiki/위키미디어_한국 [17:48:17] becomes [17:48:28] meta.wikimedia.org/wiki/\%EC\%9C\%84\%ED\%82\%A4\%EB\%AF\%B8\%EB\%94\%94\%EC\%96\%B4_\%ED\%95\%9C\%EA\%B5\%AD [17:48:30] confirmed via mwdebug1001 as well [17:48:33] hmm loads fine to me [17:48:37] ah, cool. thanks [17:49:27] 10Operations, 10ops-eqiad, 10Operations-Software-Development, 10observability, 10Patch-For-Review: ms-be1043 sdk failed - https://phabricator.wikimedia.org/T218544 (10jbond) >>! In T218544#5031911, @Volans wrote: > @fgiunchedi agree that this is a new issue, and we need to fix two different scripts to ha... [17:49:53] 10Operations, 10Traffic, 10Goal, 10Patch-For-Review: Support TLSv1.3 - https://phabricator.wikimedia.org/T170567 (10BBlack) Putting this here for lack of a better place, for future reference: In the TLSv1.2 (and below) world, we've gone with a static preference on symmetric ciphers of ChaPoly -> AES256 ->... [17:50:26] 10Operations, 10Wikimedia-Apache-configuration, 10Patch-For-Review, 10User-revi: Change kr.wikimedia.org redirection destination - https://phabricator.wikimedia.org/T222033 (10Dzahn) 13:46 < mutante> revi: it has been applied on mwdebug1001 now if that helps. it will be deployed to all others within 30min... [17:51:04] 10Operations, 10Wikimedia-Apache-configuration, 10User-revi: Change kr.wikimedia.org redirection destination - https://phabricator.wikimedia.org/T222033 (10revi) [17:52:14] (03PS6) 10BBlack: Convert most DYNA into 1H CNAME records [dns] - 10https://gerrit.wikimedia.org/r/507399 (https://phabricator.wikimedia.org/T208263) [17:54:02] (03PS1) 10Mholloway: WikimediaEditorTasks: add caption counter config and enable on Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/508876 [17:55:00] (03CR) 10Mholloway: [C: 04-2] "hold until wmf.4 rolls out to testcommonswiki and commonswiki" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/508876 (owner: 10Mholloway) [17:55:32] (03PS3) 10Dzahn: apache-fast-test: accept a literal - in host names [puppet] - 10https://gerrit.wikimedia.org/r/508701 [17:56:28] (03CR) 10Dzahn: [C: 03+2] apache-fast-test: accept a literal - in host names [puppet] - 10https://gerrit.wikimedia.org/r/508701 (owner: 10Dzahn) [17:57:54] (03CR) 10BBlack: [C: 03+2] Convert most DYNA into 1H CNAME records [dns] - 10https://gerrit.wikimedia.org/r/507399 (https://phabricator.wikimedia.org/T208263) (owner: 10BBlack) [17:57:56] (03PS8) 10Paladox: Fix apache-fast-test to support https [puppet] - 10https://gerrit.wikimedia.org/r/508737 [17:58:46] (03PS6) 10Dzahn: Gerrit: Configure logging in json to gerrit.json [puppet] - 10https://gerrit.wikimedia.org/r/508391 (https://phabricator.wikimedia.org/T141324) (owner: 10Paladox) [17:58:48] !log public authdns: deploying the big DYNA/CNAME change in https://gerrit.wikimedia.org/r/c/operations/dns/+/507399 [17:58:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:00:05] Deploy window Pre MediaWiki train sanity break (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190508T1800) [18:01:38] (03PS1) 10Jforrester: Beta Features whitelist: Update dates of expected promotion [mediawiki-config] - 10https://gerrit.wikimedia.org/r/508881 [18:01:40] (03PS1) 10Jforrester: Beta Features whitelist: Drop TemplateWizard, now everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/508882 [18:01:42] (03PS1) 10Jforrester: Beta Features whitelist: Drop AdvancedSearch, now everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/508883 [18:01:44] (03PS1) 10Jforrester: AdvancedSearch: Stop pretending loading this varies by wiki, it doesn't [mediawiki-config] - 10https://gerrit.wikimedia.org/r/508884 [18:05:02] (03CR) 10Dzahn: [C: 03+2] Gerrit: Configure logging in json to gerrit.json [puppet] - 10https://gerrit.wikimedia.org/r/508391 (https://phabricator.wikimedia.org/T141324) (owner: 10Paladox) [18:06:14] jouncebot: now [18:06:15] For the next 0 hour(s) and 53 minute(s): Pre MediaWiki train sanity break (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190508T1800) [18:09:48] !log restarting gerrit to apply logging changes (gerrit:508391) [18:09:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:12:40] PROBLEM - Gerrit JSON on gerrit.wikimedia.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - page size 1529 too small - 1529 bytes in 0.011 second response time https://wikitech.wikimedia.org/wiki/Gerrit%23Monitoring [18:13:11] Gerrit is down. Is that known? [18:13:29] Back up now, never mind. [18:13:32] anomie: yep, restarting [18:14:00] RECOVERY - Gerrit JSON on gerrit.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 27601 bytes in 0.054 second response time https://wikitech.wikimedia.org/wiki/Gerrit%23Monitoring [18:14:38] PROBLEM - puppet last run on webperf1002 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_operations/software/xhgui] [18:15:44] PROBLEM - puppet last run on db1125 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_operations/mediawiki-config] [18:16:16] PROBLEM - puppet last run on cumin2001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_operations/cookbooks] [18:16:28] PROBLEM - puppet last run on stat1006 is CRITICAL: CRITICAL: Puppet has 4 failures. Last run 3 minutes ago with 4 failures. Failed resources (up to 3 shown): Exec[git_pull_operations/mediawiki-config],Exec[git_pull_mediawiki/event-schemas],Exec[git_pull_statistics_mediawiki],Exec[git_pull_analytics/reportupdater] [18:16:30] PROBLEM - puppet last run on webperf2001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_performance/docroot] [18:17:08] PROBLEM - puppet last run on webperf1001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_performance/docroot] [18:17:18] PROBLEM - Host ms-fe2008 is DOWN: PING CRITICAL - Packet loss = 100% [18:17:36] these are all caused by gerrit restart and will recover shortly [18:17:48] well, except ms-fe2008 going down of course [18:18:46] PROBLEM - puppet last run on stat1004 is CRITICAL: CRITICAL: Puppet has 3 failures. Last run 7 minutes ago with 3 failures. Failed resources (up to 3 shown): Exec[git_pull_operations/mediawiki-config],Exec[git_pull_mediawiki/event-schemas],Exec[git_pull_statistics_mediawiki] [18:19:00] RECOVERY - Host ms-fe2008 is UP: PING OK - Packet loss = 0%, RTA = 36.15 ms [18:19:16] we should make a cumin alias for "hosts pulling from gerrit" [18:19:23] then we can use that to run puppet on them at once [18:20:36] mutante: all git clone? [18:20:41] volans: yes [18:21:11] ah, you mean running by puppet class? [18:21:13] 'R:git::clone' returns 76 hosts [18:21:18] yes [18:21:21] hmm. that's more than expected [18:21:28] RECOVERY - puppet last run on stat1006 is OK: OK: Puppet is currently enabled, last run 44 seconds ago with 0 failures [18:21:33] but yea.. could be.. the ones we see are only the unlucky ones [18:21:36] running during that specific time [18:21:51] anyway you can run puppet on the failed ones only [18:22:02] https://wikitech.wikimedia.org/wiki/Cumin#Run_Puppet_only_if_last_run_failed [18:22:10] if the issue is fixed [18:22:54] ah, right! yes, thanks [18:24:08] doing it..with -b 6 [18:24:40] given that we always hit at least one host, and given that we have done it now at many different times maybe 76 isn't that unexpected. [18:24:42] RECOVERY - puppet last run on webperf1002 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [18:25:10] cumin was really quick to go to like 72 of 76 .. determined they didnt fail [18:25:17] (03PS1) 10Jforrester: Beta Features whitelist: Drop RCFilters, now everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/508888 [18:25:39] icinga-wm: now let us know [18:25:48] RECOVERY - puppet last run on db1125 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [18:26:20] RECOVERY - puppet last run on cumin2001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [18:26:36] RECOVERY - puppet last run on webperf2001 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [18:27:14] RECOVERY - puppet last run on webperf1001 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [18:28:52] RECOVERY - puppet last run on stat1004 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [18:29:53] (03CR) 10Dzahn: [C: 03+1] "nitpick: "enable disable" -> "disable" :)" [puppet] - 10https://gerrit.wikimedia.org/r/508127 (owner: 10Paladox) [18:30:36] (03PS6) 10Dzahn: Gerrit: Enable gerrit.disableReverseDnsLookup [puppet] - 10https://gerrit.wikimedia.org/r/508127 (owner: 10Paladox) [18:31:05] (03CR) 10Dzahn: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/508127 (owner: 10Paladox) [18:31:40] (03PS7) 10Paladox: Gerrit: Disable gerrit.disableReverseDnsLookup [puppet] - 10https://gerrit.wikimedia.org/r/508127 [18:34:29] (03PS8) 10Paladox: Gerrit: Disable DNS reverse lookup [puppet] - 10https://gerrit.wikimedia.org/r/508127 [18:36:54] PROBLEM - Long running screen/tmux on lithium is CRITICAL: CRIT: Long running tmux process. (user: fsero PID: 28079, 1732266s 1728000s). https://wikitech.wikimedia.org/wiki/Monitoring/Long_running_screens [18:37:13] (03CR) 10Dzahn: [C: 03+1] "+ Reedy" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/504502 (https://phabricator.wikimedia.org/T167375) (owner: 10Urbanecm) [18:40:56] (03CR) 10Dzahn: [C: 04-2] "additional test results in https://phabricator.wikimedia.org/P8493" [puppet] - 10https://gerrit.wikimedia.org/r/507787 (https://phabricator.wikimedia.org/T218844) (owner: 10Paladox) [18:40:57] (03CR) 10Brennen Bearnes: [C: 03+1] "Echoing that this seems reasonable since upstream is doing it (as long as there aren't some sort of important unforeseen consequences for " [puppet] - 10https://gerrit.wikimedia.org/r/508127 (owner: 10Paladox) [18:43:24] (03CR) 10Dzahn: [C: 03+1] "> Patch Set 11: Code-Review-2" [puppet] - 10https://gerrit.wikimedia.org/r/507787 (https://phabricator.wikimedia.org/T218844) (owner: 10Paladox) [18:44:51] (03CR) 10Dzahn: [C: 03+1] "we made apache-fast-test work on gerrit-slave and paladox' cloud VPS gerrit host with 2 patches and https://phabricator.wikimedia.org/P849" [puppet] - 10https://gerrit.wikimedia.org/r/507787 (https://phabricator.wikimedia.org/T218844) (owner: 10Paladox) [18:47:08] (03CR) 10Dzahn: [C: 03+1] "just means either the checked out dirs have to be deleted and recreated by puppet or needs an edit inside them to reflect the new origin" [puppet] - 10https://gerrit.wikimedia.org/r/507074 (https://phabricator.wikimedia.org/T218844) (owner: 10Paladox) [18:51:40] 10Operations, 10ops-codfw, 10DC-Ops, 10decommission, 10Patch-For-Review: Decommission osm-db200[12] and osm-web200[1234] - https://phabricator.wikimedia.org/T187445 (10Papaul) 05Open→03Resolved Complete [18:53:44] (03CR) 10Dzahn: "gotta check which hosts exactly are influenced by this" [puppet] - 10https://gerrit.wikimedia.org/r/507072 (https://phabricator.wikimedia.org/T218844) (owner: 10Paladox) [18:54:03] 10Operations, 10ops-codfw, 10DBA: db2114 hardware problem - https://phabricator.wikimedia.org/T222753 (10Papaul) Technician will be on site tomorrow between 9:30 and 10am CT [18:54:59] (03PS1) 10Paladox: Gerrit: Enable gerrit.listProjectsFromIndex [puppet] - 10https://gerrit.wikimedia.org/r/508892 [18:55:44] PROBLEM - Long running screen/tmux on proton1001 is CRITICAL: connect to address 10.64.0.20 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/Long_running_screens [18:55:44] (03PS2) 10Paladox: Gerrit: Enable gerrit.listProjectsFromIndex [puppet] - 10https://gerrit.wikimedia.org/r/508892 [18:56:46] (03CR) 10Dzahn: "I would prefer if we don't change ARG1 and make this ARG2. And then let's add a description further up what it does. And finally nitpick: " [puppet] - 10https://gerrit.wikimedia.org/r/508737 (owner: 10Paladox) [18:57:29] (03CR) 10Dzahn: "it should be an optional second ARG that has a default value (http) so that behaviour doesn't change for people who use it like they did b" [puppet] - 10https://gerrit.wikimedia.org/r/508737 (owner: 10Paladox) [19:00:04] thcipriani: Dear deployers, time to do the MediaWiki train - Americas version deploy. Dont look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190508T1900). [19:00:22] did I though? [19:00:28] implicitly perhaps [19:01:04] !log T221904 cdanis@cumin1001.eqiad.wmnet ~ % sudo cumin 'ms-be2*[4,7].codfw.wmnet' 'for DISK in /sys/block/sd*/queue/scheduler ; do echo cfq > $DISK ; done' [19:01:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:01:11] T221904: swift backend decomms / rebalances are noisy - https://phabricator.wikimedia.org/T221904 [19:02:22] (03CR) 10Brennen Bearnes: [C: 03+1] Gerrit: Redirect /p/(.+)/info/(.+) to /$1/info/$2 (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/507787 (https://phabricator.wikimedia.org/T218844) (owner: 10Paladox) [19:02:45] (03PS9) 10Paladox: Fix apache-fast-test to support https [puppet] - 10https://gerrit.wikimedia.org/r/508737 [19:04:26] (03PS10) 10Paladox: Fix apache-fast-test to support https [puppet] - 10https://gerrit.wikimedia.org/r/508737 [19:04:51] (03PS1) 10Papaul: DNS: Remove mgmt DNS for maps-test200[1-4] [dns] - 10https://gerrit.wikimedia.org/r/508896 [19:11:24] 10Operations, 10ops-codfw, 10DBA: db2114 hardware problem - https://phabricator.wikimedia.org/T222753 (10Marostegui) Great news! Thanks! [19:13:32] (03CR) 10Dzahn: [C: 03+2] DNS: Remove mgmt DNS for maps-test200[1-4] [dns] - 10https://gerrit.wikimedia.org/r/508896 (owner: 10Papaul) [19:14:11] 10Operations, 10ops-codfw, 10Reading-Infrastructure-Team-Backlog, 10decommission, 10Patch-For-Review: Decommission maps-test cluster - https://phabricator.wikimedia.org/T202898 (10Dzahn) [19:14:29] (03PS11) 10Paladox: Fix apache-fast-test to support https [puppet] - 10https://gerrit.wikimedia.org/r/508737 [19:14:46] (03CR) 10jenkins-bot: env: Allow for running outside the cluster for local testing [mediawiki-config] - 10https://gerrit.wikimedia.org/r/507725 (owner: 10Jforrester) [19:15:49] (03PS9) 10Andrew Bogott: toolforge: update origin URL for integration/composer.git clones [puppet] - 10https://gerrit.wikimedia.org/r/507074 (https://phabricator.wikimedia.org/T218844) (owner: 10Paladox) [19:17:12] (03CR) 10Andrew Bogott: [C: 03+2] toolforge: update origin URL for integration/composer.git clones [puppet] - 10https://gerrit.wikimedia.org/r/507074 (https://phabricator.wikimedia.org/T218844) (owner: 10Paladox) [19:17:15] (03PS1) 10Thcipriani: group1 wikis to 1.34.0-wmf.4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/508898 [19:17:17] (03CR) 10Thcipriani: [C: 03+2] group1 wikis to 1.34.0-wmf.4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/508898 (owner: 10Thcipriani) [19:17:19] (03PS1) 10CDanis: swift codfw-prod: decomm ms-be201{3,4,5}: 0 weight [software/swift-ring] - 10https://gerrit.wikimedia.org/r/508899 (https://phabricator.wikimedia.org/T221068) [19:17:37] 10Operations, 10ops-codfw, 10Reading-Infrastructure-Team-Backlog, 10decommission, 10Patch-For-Review: Decommission maps-test cluster - https://phabricator.wikimedia.org/T202898 (10Papaul) 05Open→03Resolved Complete [19:18:23] (03Merged) 10jenkins-bot: group1 wikis to 1.34.0-wmf.4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/508898 (owner: 10Thcipriani) [19:19:44] (03CR) 10CDanis: [V: 03+2 C: 03+2] swift codfw-prod: decomm ms-be201{3,4,5}: 0 weight [software/swift-ring] - 10https://gerrit.wikimedia.org/r/508899 (https://phabricator.wikimedia.org/T221068) (owner: 10CDanis) [19:20:08] (03PS12) 10Paladox: Fix apache-fast-test to support https [puppet] - 10https://gerrit.wikimedia.org/r/508737 [19:20:36] !log thcipriani@deploy1001 rebuilt and synchronized wikiversions files: group1 wikis to 1.34.0-wmf.4 [19:20:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:21:49] !log swift codfw-prod: deploy I59c88aed T221068 [19:22:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:22:04] T221068: decom ms-be201[345] - https://phabricator.wikimedia.org/T221068 [19:22:25] !log thcipriani@deploy1001 Synchronized php: group1 wikis to 1.34.0-wmf.4 (duration: 01m 48s) [19:22:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:24:52] (03CR) 10jenkins-bot: group1 wikis to 1.34.0-wmf.4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/508898 (owner: 10Thcipriani) [19:26:27] !log continue upgrade to nodejs 10 for maps - T210704 [19:26:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:26:32] T210704: Migrate node-based services in production to node10 - https://phabricator.wikimedia.org/T210704 [19:29:38] (03CR) 10Jgreen: [C: 03+1] Fix apache-fast-test to support https [puppet] - 10https://gerrit.wikimedia.org/r/508737 (owner: 10Paladox) [19:31:07] !log mbsantos@deploy1001 Started deploy [kartotherian/deploy@7774721] (stretch): Deploy tilerator node 10 build into maps[12]002 (T215852) [19:31:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:31:11] T215852: Migrate Kartotherian/Tilerator to Node 10 - https://phabricator.wikimedia.org/T215852 [19:32:04] !log mbsantos@deploy1001 Finished deploy [kartotherian/deploy@7774721] (stretch): Deploy tilerator node 10 build into maps[12]002 (T215852) (duration: 00m 57s) [19:32:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:33:53] !log mbsantos@deploy1001 Started deploy [tilerator/deploy@2736a69] (stretch): Deploy tilerator/kartotherian node 10 build into maps[12]002 (T215852) [19:33:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:35:05] !log mbsantos@deploy1001 Finished deploy [tilerator/deploy@2736a69] (stretch): Deploy tilerator/kartotherian node 10 build into maps[12]002 (T215852) (duration: 01m 12s) [19:35:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:42:46] PROBLEM - Check systemd state on ms-be2019 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [19:45:12] !log mbsantos@deploy1001 Started deploy [tilerator/deploy@2736a69] (stretch): Deploy tilerator node 10 build into maps[12]003 (T215852) [19:45:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:45:17] T215852: Migrate Kartotherian/Tilerator to Node 10 - https://phabricator.wikimedia.org/T215852 [19:46:09] !log mbsantos@deploy1001 Finished deploy [tilerator/deploy@2736a69] (stretch): Deploy tilerator node 10 build into maps[12]003 (T215852) (duration: 00m 56s) [19:46:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:46:32] !log mbsantos@deploy1001 Started deploy [kartotherian/deploy@7774721] (stretch): Deploy kartotherian node 10 build into maps[12]003 (T215852) [19:46:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:47:26] !log mbsantos@deploy1001 Finished deploy [kartotherian/deploy@7774721] (stretch): Deploy kartotherian node 10 build into maps[12]003 (T215852) (duration: 00m 54s) [19:47:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:47:56] PROBLEM - HP RAID on ms-be2036 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.32.166: Connection reset by peer https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [19:51:06] note that gerrit is having higher threads [19:53:00] thcipriani ^^ [19:53:10] (it's impacting users per -releng) [19:53:48] https://fastthread.io/my-thread-report.jsp?p=c2hhcmVkLzIwMTkvMDUvOC8tLWpzdGFjay0xOS0wNS0wOC0xOS01Mi01OC5kdW1wLS0xOS01My0yNg== [19:53:59] seems to be the SendEmail thing again [19:54:18] ah [19:54:31] 10Operations, 10serviceops, 10Core Platform Team Backlog (Watching / External), 10Patch-For-Review, and 2 others: Use PHP7 to run all async jobs - https://phabricator.wikimedia.org/T219148 (10jijiki) It looks like deployment-prep has an older php7.2 version than production, which is something we should fix... [19:55:35] !log mbsantos@deploy1001 Started deploy [kartotherian/deploy@7774721] (stretch): Deploy kartotherian node 10 build into maps[12]004 (T215852) [19:55:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:55:39] T215852: Migrate Kartotherian/Tilerator to Node 10 - https://phabricator.wikimedia.org/T215852 [19:56:34] !log mbsantos@deploy1001 Finished deploy [kartotherian/deploy@7774721] (stretch): Deploy kartotherian node 10 build into maps[12]004 (T215852) (duration: 00m 59s) [19:56:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:57:10] !log mbsantos@deploy1001 Started deploy [tilerator/deploy@2736a69] (stretch): Deploy tilerator node 10 build into maps[12]004 (T215852) [19:57:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:58:08] !log mbsantos@deploy1001 Finished deploy [tilerator/deploy@2736a69] (stretch): Deploy tilerator node 10 build into maps[12]004 (T215852) (duration: 00m 58s) [19:58:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:00:04] cscott, arlolra, subbu, bearND, and halfak: #bothumor My software never has bugs. It just develops random features. Rise for Services – Parsoid / Citoid / Mobileapps / ORES / …. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190508T2000). [20:02:34] should we restart exim to see if that will do anything? [20:07:31] !log mbsantos@deploy1001 Started deploy [tilerator/deploy@2736a69] (stretch): Deploy tilerator node 10 build into maps1001 (T215852) [20:07:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:07:36] T215852: Migrate Kartotherian/Tilerator to Node 10 - https://phabricator.wikimedia.org/T215852 [20:07:55] !log mbsantos@deploy1001 Finished deploy [tilerator/deploy@2736a69] (stretch): Deploy tilerator node 10 build into maps1001 (T215852) (duration: 00m 24s) [20:07:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:08:20] !log mbsantos@deploy1001 Started deploy [kartotherian/deploy@7774721] (stretch): Deploy kartotherian node 10 build into maps1001 (T215852) [20:08:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:08:40] !log mbsantos@deploy1001 Finished deploy [kartotherian/deploy@7774721] (stretch): Deploy kartotherian node 10 build into maps1001 (T215852) (duration: 00m 20s) [20:08:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:10:42] !log upgrade to nodejs 10 for maps completed - T210704 [20:10:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:10:47] T210704: Migrate node-based services in production to node10 - https://phabricator.wikimedia.org/T210704 [20:11:58] PROBLEM - Check systemd state on ms-be2029 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [20:12:26] !log restarting gerrit due to threads stuck behind sendemail [20:12:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:14:34] PROBLEM - MariaDB Slave Lag: s1 on db1139 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 862.44 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [20:15:38] the ms-be2029 failure was just a 'usual' debmonitor systemd session going bad, probably because of heavy I/O load [20:16:00] PROBLEM - Gerrit Health Check on gerrit.wikimedia.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1529 bytes in 0.318 second response time https://gerrit.wikimedia.org/r/config/server/healthcheck%7Estatus [20:16:14] PROBLEM - Gerrit JSON on gerrit.wikimedia.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - page size 1529 too small - 1529 bytes in 0.008 second response time https://wikitech.wikimedia.org/wiki/Gerrit%23Monitoring [20:16:24] while we've only seen 3 spurious ms-be2* failures since I started the rebalances, so far none of them have been on the experiment hosts 🤞 [20:17:10] also... gerrit doesn't seem down? [20:17:24] RECOVERY - Gerrit Health Check on gerrit.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 865 bytes in 0.064 second response time https://gerrit.wikimedia.org/r/config/server/healthcheck%7Estatus [20:17:36] RECOVERY - Gerrit JSON on gerrit.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 26621 bytes in 0.057 second response time https://wikitech.wikimedia.org/wiki/Gerrit%23Monitoring [20:17:44] cdanis: it was, thc.ipriani restarted it a minute ago during which I got 503 as well [20:18:03] oh the SAL log is right there heh [20:18:54] PROBLEM - puppet last run on stat1007 is CRITICAL: CRITICAL: Puppet has 7 failures. Last run 2 minutes ago with 7 failures. Failed resources (up to 3 shown): Exec[git_pull_operations/mediawiki-config],Exec[git_pull_wmde/scripts],Exec[git_pull_wmde/toolkit-analyzer-build],Exec[git_pull_mediawiki/event-schemas] [20:20:58] (03PS1) 10Ayounsi: [WIP] Puppet, add RPKI validation [puppet] - 10https://gerrit.wikimedia.org/r/508928 [20:21:54] (03CR) 10jerkins-bot: [V: 04-1] [WIP] Puppet, add RPKI validation [puppet] - 10https://gerrit.wikimedia.org/r/508928 (owner: 10Ayounsi) [20:22:08] RECOVERY - Check Varnish expiry mailbox lag on cp5008 is OK: OK: expiry mailbox lag is 0 https://wikitech.wikimedia.org/wiki/Varnish [20:22:12] 10Operations, 10Fr-CentralNotice-Translation-Bugs, 10Fundraising-Backlog, 10MediaWiki-extensions-CentralNotice, 10MediaWiki-extensions-Translate: Publishing translations for central notice banners fails - https://phabricator.wikimedia.org/T104774 (10DStrine) a:05cwdent→03None [20:23:16] PROBLEM - puppet last run on notebook1003 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 6 minutes ago with 2 failures. Failed resources (up to 3 shown): Exec[git_pull_operations/mediawiki-config],Exec[git_pull_mediawiki/event-schemas] [20:29:18] PROBLEM - Check systemd state on ms-be2015 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [20:29:49] (03PS2) 10Ayounsi: [WIP] Puppet, add RPKI validation [puppet] - 10https://gerrit.wikimedia.org/r/508928 [20:30:18] RECOVERY - Check systemd state on ms-be2019 is OK: OK - running: The system is fully operational [20:30:23] (03CR) 10jerkins-bot: [V: 04-1] [WIP] Puppet, add RPKI validation [puppet] - 10https://gerrit.wikimedia.org/r/508928 (owner: 10Ayounsi) [20:31:36] PROBLEM - Check systemd state on ms-be2032 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [20:34:31] (03PS3) 10Ayounsi: Puppet, add RPKI validation daemon [puppet] - 10https://gerrit.wikimedia.org/r/508928 [20:34:44] RECOVERY - Check systemd state on ms-be2015 is OK: OK - running: The system is fully operational [20:37:18] (03PS3) 10Pmiazga: beta: Enable content provider on de beta cluster for QA purposes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/508631 (https://phabricator.wikimedia.org/T216961) (owner: 10Jdlrobson) [20:38:24] (03PS1) 10Catrope: Remove rcenhancedfilters from beta features whitelist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/508932 (https://phabricator.wikimedia.org/T196033) [20:41:52] PROBLEM - Check systemd state on ms-be2020 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [20:42:43] (03CR) 10Pmiazga: [C: 04-1] "Incorrect config keys, MFContentProviderClass is defined twice, the second key should be MFMwApiContentProviderBaseUri, not MFContentProvi" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/508631 (https://phabricator.wikimedia.org/T216961) (owner: 10Jdlrobson) [20:44:37] (03PS4) 10Pmiazga: beta: Enable content provider on de beta cluster for QA purposes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/508631 (https://phabricator.wikimedia.org/T216961) (owner: 10Jdlrobson) [20:45:56] PROBLEM - Check systemd state on ms-be2029 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [20:49:58] RECOVERY - puppet last run on notebook1003 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [20:50:16] (03PS5) 10Pmiazga: beta: Enable content provider on de beta cluster for QA purposes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/508631 (https://phabricator.wikimedia.org/T216961) (owner: 10Jdlrobson) [20:51:02] RECOVERY - puppet last run on stat1007 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [20:53:45] (03Abandoned) 10Jforrester: Remove rcenhancedfilters from beta features whitelist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/508932 (https://phabricator.wikimedia.org/T196033) (owner: 10Catrope) [20:55:26] (03PS1) 10RobH: fixing restbase102[7-9] prod dns [dns] - 10https://gerrit.wikimedia.org/r/508936 (https://phabricator.wikimedia.org/T219404) [20:56:31] (03PS6) 10Pmiazga: beta: Enable content provider on de beta cluster for QA purposes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/508631 (https://phabricator.wikimedia.org/T216961) (owner: 10Jdlrobson) [20:56:55] (03CR) 10RobH: [C: 03+2] fixing restbase102[7-9] prod dns [dns] - 10https://gerrit.wikimedia.org/r/508936 (https://phabricator.wikimedia.org/T219404) (owner: 10RobH) [20:57:28] (03CR) 10Pmiazga: "Fixed the incorrect config key (renamed the second wgMFContentProviderClass to wgMFMwApiContentProviderBaseUri), and fixed code style - re" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/508631 (https://phabricator.wikimedia.org/T216961) (owner: 10Jdlrobson) [20:58:10] RECOVERY - Check systemd state on ms-be2029 is OK: OK - running: The system is fully operational [21:00:04] Niharika: I, the Bot under the Fountain, allow thee, The Deployer, to do Partial blocks bug swat deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190508T2100). [21:02:36] jouncebot: Thanks, buddy. [21:02:51] (03CR) 10Jdlrobson: [C: 03+1] "Thanks for switching " to '" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/508631 (https://phabricator.wikimedia.org/T216961) (owner: 10Jdlrobson) [21:05:55] Niharika: It's not a SWAT deploy, it's a bespoke deploy window. :-P [21:09:42] RECOVERY - Check systemd state on ms-be2032 is OK: OK - running: The system is fully operational [21:11:05] (03PS4) 10Ayounsi: Puppet, add RPKI validation daemon [puppet] - 10https://gerrit.wikimedia.org/r/508928 [21:12:58] (03CR) 10Pmiazga: [C: 03+2] beta: Enable content provider on de beta cluster for QA purposes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/508631 (https://phabricator.wikimedia.org/T216961) (owner: 10Jdlrobson) [21:14:05] 10Operations, 10Traffic, 10Core Platform Team Backlog (Watching / External), 10Patch-For-Review, 10Services (watching): Package libvmod-uuid for Debian - https://phabricator.wikimedia.org/T221977 (10mobrovac) @ema since the pkg has been uploaded, are we now good here? Ok to resolve the task or is there s... [21:14:15] (03Merged) 10jenkins-bot: beta: Enable content provider on de beta cluster for QA purposes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/508631 (https://phabricator.wikimedia.org/T216961) (owner: 10Jdlrobson) [21:14:17] James_F: I exert my right to call this my SWAT window. [21:14:29] Hey. I just merged a mediawiki-config change - to config on beta cluster [21:14:41] raynor: Okay. [21:15:01] we need to change one thing on beta so our QA can test stuff, it's a no-op for prod [21:15:35] do you want me to sync the prod servers, or is it possible for you to rebase/sync it during morning swat window? [21:16:15] raynor: Just leave it. I'll fix it when Niharika is done because I have an UBN. [21:16:21] (03CR) 10jenkins-bot: beta: Enable content provider on de beta cluster for QA purposes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/508631 (https://phabricator.wikimedia.org/T216961) (owner: 10Jdlrobson) [21:16:34] PROBLEM - Check systemd state on ms-be2033 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [21:16:44] James_F: Hey hey, it's my window! [21:16:55] Niharika, James_F btw - what is the "Partial blocks bug swat" ? First time I see sth like that [21:17:01] Niharika: UBNs over-ride windows generally, but as you're already going I'm waiting for you. [21:17:09] raynor: It's because it's not a thing, Niharika is just being confusing. [21:17:14] James_F: Cool, thanks! [21:17:21] Niharika: Aka please hurry up. ;-) [21:17:29] James_F: Waiting on Zuul. [21:17:37] ok, no problem, thanks for explanation. I thought it's something new [21:17:39] Story of my life. [21:17:57] ok, thanks for taking care of my beta cluster config change. I own you James_F [21:18:28] I guess that was supposed to be owe? [21:18:30] Yessir! ;-) [21:18:37] James_F: Questions from the room about your use of "over-ride". [21:18:38] bawolff: Eh. [21:19:07] Niharika: I don't use the "t-word" as much because it's triggering. ;-) [21:20:25] yea, sorry, owe*, it's pretty late here :) [21:20:25] James_F: We meant the hyphen. [21:21:20] RECOVERY - Check systemd state on ms-be2020 is OK: OK - running: The system is fully operational [21:21:35] Niharika: How else to indicate the differentiated phoneme and the non-rolling of the double-r? [21:23:38] Moriel calls BS. :P [21:27:41] 10Operations, 10ops-eqiad, 10RESTBase, 10Core Platform Team Backlog (Watching / External), and 2 others: rack/setup/install restbase10[19-27].eqiad.wmnet - https://phabricator.wikimedia.org/T219404 (10RobH) [21:30:12] 10Operations, 10ops-eqiad, 10RESTBase, 10Core Platform Team Backlog (Watching / External), and 2 others: rack/setup/install restbase10[19-27].eqiad.wmnet - https://phabricator.wikimedia.org/T219404 (10RobH) [21:30:56] * James_F sighs at how long gate can take. [21:34:05] (03PS1) 10Reedy: Revert "striker: Disable developer account creation" [puppet] - 10https://gerrit.wikimedia.org/r/508944 (https://phabricator.wikimedia.org/T222844) [21:34:24] (03CR) 10jerkins-bot: [V: 04-1] Revert "striker: Disable developer account creation" [puppet] - 10https://gerrit.wikimedia.org/r/508944 (https://phabricator.wikimedia.org/T222844) (owner: 10Reedy) [21:35:22] chasemp: ^ [21:35:30] (03PS2) 10Reedy: Revert "striker: Disable developer account creation" [puppet] - 10https://gerrit.wikimedia.org/r/508944 (https://phabricator.wikimedia.org/T222844) [21:37:24] Niharika: Yay, it's finally landed. [21:39:54] RECOVERY - Check systemd state on ms-be2033 is OK: OK - running: The system is fully operational [21:46:46] James_F: On it. Gmme five. [21:47:13] Always. :-) [21:49:25] !log niharika29@deploy1001 Started scap: php-1.34.0-wmf.4/includes/Block.php Fix Block::newLoad for IPv6 range blocks - follow-up to Ie8bebd8 T222246 [21:49:28] !log niharika29@deploy1001 sync aborted: php-1.34.0-wmf.4/includes/Block.php Fix Block::newLoad for IPv6 range blocks - follow-up to Ie8bebd8 T222246 (duration: 00m 03s) [21:49:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:49:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:50:44] !log niharika29@deploy1001 Synchronized php-1.34.0-wmf.4/includes/Block.php: Fix Block::newLoad for IPv6 range blocks - follow-up to Ie8bebd8 T222246 (duration: 00m 59s) [21:50:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:52:59] !log niharika29@deploy1001 Synchronized php-1.34.0-wmf.4/tests/phpunit/: Fix Block::newLoad for IPv6 range blocks - follow-up to Ie8bebd8 T222246 (duration: 01m 09s) [21:53:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:53:55] 10Operations, 10ops-codfw: db2112 doesn't show service tag in idrac - https://phabricator.wikimedia.org/T222845 (10RobH) p:05Triage→03Normal [21:54:47] James_F: All yours. [21:55:09] Niharika: Thanks! [21:55:21] (03CR) 10Jforrester: [C: 03+2] Beta Features whitelist: Update dates of expected promotion [mediawiki-config] - 10https://gerrit.wikimedia.org/r/508881 (owner: 10Jforrester) [21:55:24] (03CR) 10Jforrester: [C: 03+2] Beta Features whitelist: Drop TemplateWizard, now everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/508882 (owner: 10Jforrester) [21:55:28] (03CR) 10Jforrester: [C: 03+2] Beta Features whitelist: Drop AdvancedSearch, now everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/508883 (owner: 10Jforrester) [21:56:04] James_F: I'm in after you, if time allows before swat. [21:56:41] (03Merged) 10jenkins-bot: Beta Features whitelist: Update dates of expected promotion [mediawiki-config] - 10https://gerrit.wikimedia.org/r/508881 (owner: 10Jforrester) [21:57:34] Krinkle: Thankfully I've gone one extension patch, already merged and synching now, and then just some quick configs. [21:58:03] !log jforrester@deploy1001 Synchronized php-1.34.0-wmf.4/extensions/VisualEditor/includes/ApiVisualEditor.php: UBN T209599 ApiVisualEditor: Fix use of getBlockInfo() (duration: 00m 57s) [21:58:06] just extend the window [21:58:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:58:08] T209599: Show block notices on mobile visual editor - https://phabricator.wikimedia.org/T209599 [21:58:58] (03Merged) 10jenkins-bot: Beta Features whitelist: Drop TemplateWizard, now everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/508882 (owner: 10Jforrester) [21:59:09] (03Merged) 10jenkins-bot: Beta Features whitelist: Drop AdvancedSearch, now everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/508883 (owner: 10Jforrester) [22:00:28] Krinkle: James_F forgot to snub you for calling this the "swat". [22:00:51] !log jforrester@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Beta Feature config cleanup: doc change plus drop advancedsearch and templatewizard-betafeature (duration: 00m 57s) [22:00:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:01:02] Niharika: No no, the one at 16:00 *is* the SWAT. [22:04:07] Oh! right. :) [22:05:34] James_F: OK, landing a wmf.3 patch now to mw repos. [22:05:52] Krinkle: Cool, I'll be fully done in prod ~5 minutes' time. [22:06:08] (Did wikibugs die?) [22:06:23] by that time Jenkins will have started thinking about testing my patch, and I'll have completed the dishes and taken out the trash... [22:06:30] Indeed. :-) [22:06:34] !log jforrester@deploy1001 Synchronized wmf-config/CommonSettings.php: Unconditionally load AdvancedSearch everywhere, the config is always true (duration: 00m 57s) [22:06:34] I think so...It didn't post on my phab task for my last deploy. [22:06:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:06:39] (03CR) 10Jforrester: [C: 03+2] AdvancedSearch: Stop pretending loading this varies by wiki, it doesn't [mediawiki-config] - 10https://gerrit.wikimedia.org/r/508884 (owner: 10Jforrester) [22:06:44] Oh, now it's awake. [22:07:31] (03CR) 10jerkins-bot: [V: 04-1] AdvancedSearch: Stop pretending loading this varies by wiki, it doesn't [mediawiki-config] - 10https://gerrit.wikimedia.org/r/508884 (owner: 10Jforrester) [22:07:35] (03PS2) 10Jforrester: AdvancedSearch: Stop pretending loading this varies by wiki, it doesn't [mediawiki-config] - 10https://gerrit.wikimedia.org/r/508884 [22:07:37] (03CR) 10Jforrester: [C: 03+2] "…" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/508884 (owner: 10Jforrester) [22:07:43] (03PS2) 10Jforrester: Beta Features whitelist: Drop RCFilters, now everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/508888 [22:07:45] (03CR) 10Jforrester: [C: 03+2] Beta Features whitelist: Drop RCFilters, now everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/508888 (owner: 10Jforrester) [22:07:49] (03CR) 10jenkins-bot: Beta Features whitelist: Update dates of expected promotion [mediawiki-config] - 10https://gerrit.wikimedia.org/r/508881 (owner: 10Jforrester) [22:07:55] (03Merged) 10jenkins-bot: AdvancedSearch: Stop pretending loading this varies by wiki, it doesn't [mediawiki-config] - 10https://gerrit.wikimedia.org/r/508884 (owner: 10Jforrester) [22:08:03] (03Merged) 10jenkins-bot: Beta Features whitelist: Drop RCFilters, now everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/508888 (owner: 10Jforrester) [22:08:13] Krinkle: Also ping again as to whether it's OK to land https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/508724 and https://gerrit.wikimedia.org/r/#/c/508726/ :-) [22:08:14] !log jforrester@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Stop setting wmgUseAdvancedSearch, no longer read; drop rcenhancedfilters from BF whitelist (duration: 00m 57s) [22:08:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:08:34] OK, all done. Prod is still up. Over to Krinkle to try to take it down. [22:08:52] James_F: beta patch, no, prod config yes - but I'll do that myself at some point when I have time to mwdebug test it properly [22:09:04] OK. Will C-1 the beta patch then. [22:09:09] beta works, I checked it [22:09:15] (03CR) 10Jforrester: [C: 04-1] "Per Krinkle." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/508724 (https://phabricator.wikimedia.org/T99740) (owner: 10Jforrester) [22:09:24] raynor: Yeah, sorry, didn't say, all clear on that front. [22:09:27] (03CR) 10Krinkle: [C: 04-1] "Blocked on T105683" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/508724 (https://phabricator.wikimedia.org/T99740) (owner: 10Jforrester) [22:10:17] ok, thx for letting me know [22:16:42] (03PS1) 10CRusnov: Add decommissioning status support to reports [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/508951 [22:16:58] (03CR) 10Dzahn: [C: 03+2] Fix apache-fast-test to support https [puppet] - 10https://gerrit.wikimedia.org/r/508737 (owner: 10Paladox) [22:17:08] (03PS13) 10Dzahn: Fix apache-fast-test to support https [puppet] - 10https://gerrit.wikimedia.org/r/508737 (owner: 10Paladox) [22:18:07] (03CR) 10jerkins-bot: [V: 04-1] Add decommissioning status support to reports [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/508951 (owner: 10CRusnov) [22:19:33] (03PS2) 10CRusnov: Add decommissioning status support to reports [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/508951 [22:24:00] (03CR) 10Dzahn: "thanks Paladox and Jeff, yes, it is still useful and on deployment and maintenance servers and also in cloud VPS ;)" [puppet] - 10https://gerrit.wikimedia.org/r/508737 (owner: 10Paladox) [22:24:12] :) [22:25:43] (03PS1) 10Paladox: Add prometheus server for gerrit javamelody monitoring [puppet] - 10https://gerrit.wikimedia.org/r/508952 [22:26:22] (03PS2) 10Paladox: Add prometheus server for gerrit javamelody monitoring [puppet] - 10https://gerrit.wikimedia.org/r/508952 [22:26:28] (03PS15) 10Paladox: Gerrit: Set plugin.javamelody.prometheusBearerToken [puppet] - 10https://gerrit.wikimedia.org/r/508621 (https://phabricator.wikimedia.org/T184086) [22:26:36] (03PS16) 10Paladox: Gerrit: Set plugin.javamelody.prometheusBearerToken [puppet] - 10https://gerrit.wikimedia.org/r/508621 (https://phabricator.wikimedia.org/T184086) [22:26:39] (03PS1) 10QChris: Add .gitreview [debs/helm-diff] - 10https://gerrit.wikimedia.org/r/508953 [22:26:41] (03CR) 10QChris: [V: 03+2 C: 03+2] Add .gitreview [debs/helm-diff] - 10https://gerrit.wikimedia.org/r/508953 (owner: 10QChris) [22:26:45] PROBLEM - Check systemd state on ms-be2035 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [22:27:53] James_F: all clear? [22:27:58] (03CR) 10Paladox: "@Filippo Giunchedi i did this https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/508952/ (follow up that creates the prometheus scrapp" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/508621 (https://phabricator.wikimedia.org/T184086) (owner: 10Paladox) [22:28:10] Ah, missed it, [22:28:12] okay, thanks [22:28:37] * Krinkle staging on mwdebug1002 [22:28:51] AaronSchulz: stand by for wmf.3 testing of the watchlist patch [22:30:20] ok, live now on mwdebug1002, let me know when it's verified with XWD. Will do https://gerrit.wikimedia.org/r/508889 meanwhile. [22:30:22] (03PS3) 10Paladox: Add prometheus server for gerrit javamelody monitoring [puppet] - 10https://gerrit.wikimedia.org/r/508952 (https://phabricator.wikimedia.org/T184086) [22:32:24] (03PS1) 10QChris: Add .gitreview [debs/helm-secrets] - 10https://gerrit.wikimedia.org/r/508955 [22:32:26] (03CR) 10QChris: [V: 03+2 C: 03+2] Add .gitreview [debs/helm-secrets] - 10https://gerrit.wikimedia.org/r/508955 (owner: 10QChris) [22:36:45] (03PS1) 10Ayounsi: Prometheus, add Routinator endpoint [puppet] - 10https://gerrit.wikimedia.org/r/508956 [22:41:30] (03CR) 10Dzahn: "How many projects do we have?" [puppet] - 10https://gerrit.wikimedia.org/r/508892 (owner: 10Paladox) [22:41:53] (03CR) 10Paladox: "> How many projects do we have?" [puppet] - 10https://gerrit.wikimedia.org/r/508892 (owner: 10Paladox) [22:42:18] (03CR) 10Volans: "LGTM, just a nit inline. I'll take care of merging it tomorrow after the migration." (034 comments) [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/508951 (owner: 10CRusnov) [22:43:37] (03CR) 10Dzahn: "But would there still be a way to see all projects (in batches / by clicking next page)?" [puppet] - 10https://gerrit.wikimedia.org/r/508892 (owner: 10Paladox) [22:44:45] (03CR) 10Paladox: "> But would there still be a way to see all projects (in batches / by" [puppet] - 10https://gerrit.wikimedia.org/r/508892 (owner: 10Paladox) [22:45:10] (03CR) 10Paladox: "Here's the config url https://gerrit-review.googlesource.com/Documentation/config-gerrit.html#gerrit.listProjectsFromIndex" [puppet] - 10https://gerrit.wikimedia.org/r/508892 (owner: 10Paladox) [22:45:24] (03CR) 10Dzahn: [C: 03+1] "maybe we can get back to this now" [puppet] - 10https://gerrit.wikimedia.org/r/505410 (https://phabricator.wikimedia.org/T221499) (owner: 10Alex Monk) [22:46:59] (03CR) 10Dzahn: "> Patch Set 2:" [puppet] - 10https://gerrit.wikimedia.org/r/508892 (owner: 10Paladox) [22:47:08] (03CR) 10Thcipriani: "> How many projects do we have?" [puppet] - 10https://gerrit.wikimedia.org/r/508892 (owner: 10Paladox) [22:48:11] (03CR) 10Paladox: "> > Patch Set 2:" [puppet] - 10https://gerrit.wikimedia.org/r/508892 (owner: 10Paladox) [22:48:14] (03CR) 10Dzahn: "I guess we can do this and limit it but the query limit doesn't have to be the default of 500 https://gerrit-review.googlesource.com/Docum" [puppet] - 10https://gerrit.wikimedia.org/r/508892 (owner: 10Paladox) [22:48:46] (03CR) 10Paladox: "> I guess we can do this and limit it but the query limit doesn't" [puppet] - 10https://gerrit.wikimedia.org/r/508892 (owner: 10Paladox) [22:51:37] (03CR) 10Paladox: "See https://gerrit-review.googlesource.com/c/gerrit/+/214472" [puppet] - 10https://gerrit.wikimedia.org/r/508892 (owner: 10Paladox) [22:57:00] (03PS3) 10Jforrester: De-duplicate …Squid variables now MW only uses the …Cdn ones [mediawiki-config] - 10https://gerrit.wikimedia.org/r/496850 (https://phabricator.wikimedia.org/T104148) [22:57:07] (03CR) 10Jforrester: [C: 04-2] De-duplicate …Squid variables now MW only uses the …Cdn ones [mediawiki-config] - 10https://gerrit.wikimedia.org/r/496850 (https://phabricator.wikimedia.org/T104148) (owner: 10Jforrester) [22:57:12] (03CR) 10Jforrester: [C: 04-2] Stop setting wmgUseClusterSquid, never varied, no longer used [mediawiki-config] - 10https://gerrit.wikimedia.org/r/496849 (owner: 10Jforrester) [22:57:44] Staging the wmf.4 patch now [22:57:54] ( AaronSchulz: ) [22:58:21] (03CR) 10Jforrester: [C: 04-2] Stop reading wmgUseClusterSquid, never varied (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/496848 (owner: 10Jforrester) [22:59:56] Krinkle: aye [23:00:04] MaxSem, RoanKattouw, and Niharika: Time to snap out of that daydream and deploy Evening SWAT (Max 6 patches). Get on with it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190508T2300). [23:00:04] No GERRIT patches in the queue for this window AFAICS. [23:00:40] !log krinkle@deploy1001 Synchronized php-1.34.0-wmf.4/extensions/CirrusSearch/includes/Hooks.php: T219342 / 164a7c135c800cf73f7fbfc (duration: 00m 59s) [23:00:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:00:46] T219342: Pageview performance timeline analysis (March 2019) - https://phabricator.wikimedia.org/T219342 [23:02:31] AaronSchulz: https://gerrit.wikimedia.org/r/#/c/mediawiki/core/+/508728/ is staged for wmf.3 on mwdebug1002 [23:03:44] Krinkle: yeah, seems ok [23:06:22] !log krinkle@deploy1001 Synchronized php-1.34.0-wmf.3/includes/specials/SpecialWatchlist.php: T218511 / I42387498dff0b1 (duration: 00m 57s) [23:06:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:06:27] T218511: After opening a diff, entry on Special:Watchlist sometimes stays unread (bold) - https://phabricator.wikimedia.org/T218511 [23:09:02] AaronSchulz: ok, in prod now :) [23:11:57] (03PS5) 10Faidon Liambotis: Add "accounting" report [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/506663 [23:11:59] (03CR) 10Faidon Liambotis: Add "accounting" report (0310 comments) [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/506663 (owner: 10Faidon Liambotis) [23:12:24] (03PS6) 10Faidon Liambotis: Add "accounting" report [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/506663 [23:13:55] 10Operations, 10Gerrit, 10cloud-services-team, 10serviceops: Change /r/p/ to /r/ on all hosts (where https://gerrit.wikimedia.org/r/p/ exists) - https://phabricator.wikimedia.org/T222093 (10Paladox) [23:13:59] (03PS3) 10CRusnov: Add decommissioning status support to reports [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/508951 [23:15:28] (03CR) 10CRusnov: "Roger dodger." (034 comments) [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/508951 (owner: 10CRusnov) [23:18:25] RECOVERY - Check systemd state on ms-be2035 is OK: OK - running: The system is fully operational [23:25:56] 10Operations, 10MediaWiki-Cache, 10serviceops, 10Core Platform Team (Security, stability, performance and scalability (TEC1)), and 5 others: Use a multi-dc aware store for ObjectCache's MainStash if needed. - https://phabricator.wikimedia.org/T212129 (10Catrope) [23:59:05] PROBLEM - HP RAID on ms-be2028 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.0.88: Connection reset by peer https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering