[01:32:19] !log tstarling@deploy1001 Synchronized php-1.35.0-wmf.5/includes/parser/Parser.php: deploying REST compare section feature because iOS team need it for a beta release due very soon (duration: 00m 54s) [01:32:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:33:44] !log tstarling@deploy1001 Synchronized php-1.35.0-wmf.5/includes/Rest/coreRoutes.json: deploying REST compare section feature because iOS team need it for a beta release due very soon (duration: 00m 52s) [01:33:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:35:49] !log tstarling@deploy1001 Synchronized php-1.35.0-wmf.5/includes/Rest/Handler/CompareHandler.php: deploying REST compare section feature because iOS team need it for a beta release due very soon (duration: 00m 53s) [01:35:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:10:53] 10Operations, 10Cloud-Services: maintain-meta_p hangs on connecting to wikimedia.org.uk - https://phabricator.wikimedia.org/T164490 (10Dzahn) duplicate of T168436 ? [02:13:53] 10Operations, 10ops-requests: Redirect uk.wikimedia.org to wikimedia.org.uk - https://phabricator.wikimedia.org/T83509 (10Dzahn) a:05faidon→03None [02:25:03] PROBLEM - Check systemd state on an-presto1003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:25:11] PROBLEM - Check systemd state on an-presto1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:25:43] PROBLEM - Check systemd state on an-presto1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:26:07] PROBLEM - Check systemd state on an-presto1004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:26:37] PROBLEM - Check systemd state on an-presto1005 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:28:19] RECOVERY - Check systemd state on an-presto1005 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:31:09] 10Operations, 10Wikimedia-Mailing-lists, 10Privacy, 10Security, 10User-Josve05a: Stop storing Mailman passwords in plain text - https://phabricator.wikimedia.org/T181803 (10Apap04) >>! In T181803#5662528, @Bawolff wrote: > Its kind of obvious when mailman keeps sending people monthly password reminders... [02:31:11] RECOVERY - Check systemd state on an-presto1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:33:33] RECOVERY - Check systemd state on an-presto1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:33:41] RECOVERY - Check systemd state on an-presto1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:49:29] RECOVERY - Check systemd state on an-presto1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:06:03] (03PS3) 10Vgutierrez: vcl: Use synthetic warning for 1% of TLSv1/TLSv1.1 pageviews [puppet] - 10https://gerrit.wikimedia.org/r/550856 (https://phabricator.wikimedia.org/T238038) [03:06:05] (03PS2) 10Vgutierrez: vcl: Bump TLSv1/TLSv1.1 pageview replacement to 4% [puppet] - 10https://gerrit.wikimedia.org/r/550868 (https://phabricator.wikimedia.org/T238038) [03:06:07] (03PS2) 10Vgutierrez: vcl: Bump TLSv1/TLSv1.1 pageview replacement to 10% [puppet] - 10https://gerrit.wikimedia.org/r/550869 (https://phabricator.wikimedia.org/T238038) [03:06:09] (03PS2) 10Vgutierrez: vcl: Bump TLSv1/TLSv1.1 pageview replacement to 100% [puppet] - 10https://gerrit.wikimedia.org/r/550870 (https://phabricator.wikimedia.org/T238038) [03:16:15] (03CR) 10Vgutierrez: vcl: Use synthetic warning for 1% of TLSv1/TLSv1.1 pageviews (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/550856 (https://phabricator.wikimedia.org/T238038) (owner: 10Vgutierrez) [03:21:53] (03CR) 10BBlack: [C: 03+1] vcl: Use synthetic warning for 1% of TLSv1/TLSv1.1 pageviews [puppet] - 10https://gerrit.wikimedia.org/r/550856 (https://phabricator.wikimedia.org/T238038) (owner: 10Vgutierrez) [03:30:31] PROBLEM - Varnish traffic drop between 30min ago and now at esams on icinga1001 is CRITICAL: 55.38 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [03:32:11] RECOVERY - Varnish traffic drop between 30min ago and now at esams on icinga1001 is OK: (C)60 le (W)70 le 83.39 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [03:34:23] ^^ spike 30 mins ago :) [03:38:10] (03CR) 10Vgutierrez: [C: 04-1] "+1 as soon as you clarify the sock_option_flag_in value" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/550866 (https://phabricator.wikimedia.org/T227432) (owner: 10Ema) [03:54:30] (03PS1) 10Vgutierrez: hiera: Move nginx from port 443 to port 4443 on cp3060 [puppet] - 10https://gerrit.wikimedia.org/r/551004 (https://phabricator.wikimedia.org/T231627) [03:54:32] (03PS1) 10Vgutierrez: hiera: Move ats-tls from port 8443 to port 443 on cp3060 [puppet] - 10https://gerrit.wikimedia.org/r/551005 (https://phabricator.wikimedia.org/T231627) [04:00:43] !log Move cp3060 from nginx to ats-tls - T231627 [04:00:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:00:49] T231627: Move cache text cluster from nginx to ats-tls - https://phabricator.wikimedia.org/T231627 [04:01:32] (03CR) 10Vgutierrez: [C: 03+2] hiera: Move nginx from port 443 to port 4443 on cp3060 [puppet] - 10https://gerrit.wikimedia.org/r/551004 (https://phabricator.wikimedia.org/T231627) (owner: 10Vgutierrez) [04:03:15] (03CR) 10Vgutierrez: [C: 03+2] hiera: Move ats-tls from port 8443 to port 443 on cp3060 [puppet] - 10https://gerrit.wikimedia.org/r/551005 (https://phabricator.wikimedia.org/T231627) (owner: 10Vgutierrez) [04:13:08] 10Operations, 10Traffic: Move cache text cluster from nginx to ats-tls - https://phabricator.wikimedia.org/T231627 (10Vgutierrez) [04:16:52] (03PS1) 10Vgutierrez: hiera: Move nginx from port 443 to port 4443 on cp3062 [puppet] - 10https://gerrit.wikimedia.org/r/551006 (https://phabricator.wikimedia.org/T231627) [04:16:54] (03PS1) 10Vgutierrez: hiera: Move ats-tls from port 8443 to port 443 on cp3062 [puppet] - 10https://gerrit.wikimedia.org/r/551007 (https://phabricator.wikimedia.org/T231627) [04:17:03] !log Move cp3062 from nginx to ats-tls - T231627 [04:17:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:17:08] T231627: Move cache text cluster from nginx to ats-tls - https://phabricator.wikimedia.org/T231627 [04:19:32] (03CR) 10Vgutierrez: [C: 03+2] hiera: Move nginx from port 443 to port 4443 on cp3062 [puppet] - 10https://gerrit.wikimedia.org/r/551006 (https://phabricator.wikimedia.org/T231627) (owner: 10Vgutierrez) [04:21:38] (03CR) 10Vgutierrez: [C: 03+2] hiera: Move ats-tls from port 8443 to port 443 on cp3062 [puppet] - 10https://gerrit.wikimedia.org/r/551007 (https://phabricator.wikimedia.org/T231627) (owner: 10Vgutierrez) [04:34:22] 10Operations, 10Traffic, 10Patch-For-Review: Move cache text cluster from nginx to ats-tls - https://phabricator.wikimedia.org/T231627 (10Vgutierrez) [04:38:39] !log volker-e@deploy1001 Started deploy [design/style-guide@2ad7b1a]: Deploy design/style-guide: [04:38:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:38:47] !log volker-e@deploy1001 Finished deploy [design/style-guide@2ad7b1a]: Deploy design/style-guide: (duration: 00m 07s) [04:38:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:44:28] (03PS1) 10Vgutierrez: hiera: Set nginx on port 4443 on esams [puppet] - 10https://gerrit.wikimedia.org/r/551010 (https://phabricator.wikimedia.org/T231627) [04:44:30] (03PS1) 10Vgutierrez: hiera: Set ats-tls on port 443 on esams [puppet] - 10https://gerrit.wikimedia.org/r/551011 (https://phabricator.wikimedia.org/T231627) [04:53:35] (03PS2) 10Vgutierrez: hiera: Set nginx on port 4443 for the text/text_ats cluster on esams [puppet] - 10https://gerrit.wikimedia.org/r/551010 (https://phabricator.wikimedia.org/T231627) [04:53:37] (03PS2) 10Vgutierrez: hiera: Set ats-tls on port 443 on esams [puppet] - 10https://gerrit.wikimedia.org/r/551011 (https://phabricator.wikimedia.org/T231627) [05:00:28] (03CR) 10Vgutierrez: "PCC looks happy: https://puppet-compiler.wmflabs.org/compiler1002/19400/" [puppet] - 10https://gerrit.wikimedia.org/r/551010 (https://phabricator.wikimedia.org/T231627) (owner: 10Vgutierrez) [05:01:33] (03PS3) 10Vgutierrez: hiera: Set ats-tls on port 443 for the text/text_ats cluster on esams [puppet] - 10https://gerrit.wikimedia.org/r/551011 (https://phabricator.wikimedia.org/T231627) [05:06:33] (03CR) 10Vgutierrez: "https://puppet-compiler.wmflabs.org/compiler1003/19401/ pcc is happy" [puppet] - 10https://gerrit.wikimedia.org/r/551011 (https://phabricator.wikimedia.org/T231627) (owner: 10Vgutierrez) [05:06:49] (03CR) 10Vgutierrez: [C: 03+2] hiera: Set nginx on port 4443 for the text/text_ats cluster on esams [puppet] - 10https://gerrit.wikimedia.org/r/551010 (https://phabricator.wikimedia.org/T231627) (owner: 10Vgutierrez) [05:07:26] !log Move cp3064 from nginx to ats-tls - T231627 [05:07:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:07:31] T231627: Move cache text cluster from nginx to ats-tls - https://phabricator.wikimedia.org/T231627 [05:10:13] (03CR) 10Vgutierrez: [C: 03+2] hiera: Set ats-tls on port 443 for the text/text_ats cluster on esams [puppet] - 10https://gerrit.wikimedia.org/r/551011 (https://phabricator.wikimedia.org/T231627) (owner: 10Vgutierrez) [05:12:40] PROBLEM - HTTPS Unified ECDSA on cp3064 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/HTTPS [05:12:42] PROBLEM - HTTPS Unified RSA on cp3064 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/HTTPS [05:14:24] RECOVERY - HTTPS Unified ECDSA on cp3064 is OK: SSL OK - OCSP staple validity for en.wikipedia.org has 345519 seconds left:Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (ECDSA) valid until 2020-11-22 07:59:59 +0000 (expires in 373 days) https://wikitech.wikimedia.org/wiki/HTTPS [05:14:24] RECOVERY - HTTPS Unified RSA on cp3064 is OK: SSL OK - OCSP staple validity for en.wikipedia.org has 345519 seconds left:Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (RSA) valid until 2020-11-22 07:59:59 +0000 (expires in 373 days) https://wikitech.wikimedia.org/wiki/HTTPS [05:16:15] ^^ expected [05:20:12] 10Operations, 10Traffic, 10Patch-For-Review: Move cache text cluster from nginx to ats-tls - https://phabricator.wikimedia.org/T231627 (10Vgutierrez) [05:41:27] PROBLEM - MariaDB Slave Lag: s6 on db2089 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 86827.24 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [05:48:57] uh? [05:49:02] I will check that [05:49:31] thanks [05:49:48] Ah [05:49:53] It is an expired downtime [05:49:56] ah ha [05:50:39] any others due to expire soon? [05:51:59] nope [05:53:24] sweet! happy friday [05:56:54] same to you! :* [05:57:28] !log Run maintain-views for s5 on labsdb1009, labsdb1010, labsdb1012 (pending labsdb1011 as it is still running the schema change) T233135 [05:57:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:57:34] T233135: Schema change for refactored actor and comment storage - https://phabricator.wikimedia.org/T233135 [06:03:02] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repool db1103:3312 db1082 after schema changes', diff saved to https://phabricator.wikimedia.org/P9641 and previous config saved to /var/cache/conftool/dbconfig/20191115-060300-marostegui.json [06:03:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:04:26] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db2088:3311 for compression', diff saved to https://phabricator.wikimedia.org/P9642 and previous config saved to /var/cache/conftool/dbconfig/20191115-060425-marostegui.json [06:04:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:08:08] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1113:3315 for schema change and temporary pool db1082 into vslow,dump', diff saved to https://phabricator.wikimedia.org/P9643 and previous config saved to /var/cache/conftool/dbconfig/20191115-060807-marostegui.json [06:08:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:20:22] (03PS1) 10Marostegui: db1067: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/551032 (https://phabricator.wikimedia.org/T238297) [06:26:46] (03CR) 10Marostegui: [C: 03+2] db1067: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/551032 (https://phabricator.wikimedia.org/T238297) (owner: 10Marostegui) [06:36:13] (03PS1) 10ArielGlenn: make dumpsdata1003 another secondary dumps NFS server along with dumpsdata1002 [puppet] - 10https://gerrit.wikimedia.org/r/551035 (https://phabricator.wikimedia.org/T224563) [06:37:43] (03PS1) 10Marostegui: mariadb: Provision db2134 into m3 codfw [puppet] - 10https://gerrit.wikimedia.org/r/551036 (https://phabricator.wikimedia.org/T238183) [06:37:49] 10Operations, 10Core Platform Team, 10Traffic, 10Wikimedia-General-or-Unknown, 10User-DannyS712: Pages whose title ends with semicolon (;) are intermittently inaccessible - https://phabricator.wikimedia.org/T238285 (10eranroz) [06:39:31] (03CR) 10ArielGlenn: [C: 03+2] make dumpsdata1003 another secondary dumps NFS server along with dumpsdata1002 [puppet] - 10https://gerrit.wikimedia.org/r/551035 (https://phabricator.wikimedia.org/T224563) (owner: 10ArielGlenn) [06:39:33] (03CR) 10Marostegui: [C: 03+2] mariadb: Provision db2134 into m3 codfw [puppet] - 10https://gerrit.wikimedia.org/r/551036 (https://phabricator.wikimedia.org/T238183) (owner: 10Marostegui) [06:39:56] apergos: I merged your change [06:40:05] bah humbug [06:40:11] I was just typing the ssh command now ;-D [06:40:20] haha [06:40:20] (03PS2) 10Marostegui: mariadb: Provision db2134 into m3 codfw [puppet] - 10https://gerrit.wikimedia.org/r/551036 (https://phabricator.wikimedia.org/T238183) [06:40:22] * apergos switches tracks and goes right to the dumpsdata servers [06:40:29] (03CR) 10Marostegui: [V: 03+2 C: 03+2] mariadb: Provision db2134 into m3 codfw [puppet] - 10https://gerrit.wikimedia.org/r/551036 (https://phabricator.wikimedia.org/T238183) (owner: 10Marostegui) [06:41:23] !log Stop MySQL on db2065 to clone db2134 (this will trigger an haproxy irc alert) - T238183 [06:41:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:41:28] T238183: Productionize db213[2-5} - https://phabricator.wikimedia.org/T238183 [06:41:57] if icinga whines about dumpsdata1003, ignore. I'm working on it [06:43:54] (03PS1) 10ArielGlenn: add per-host configs for new dumps fallback NFS server [puppet] - 10https://gerrit.wikimedia.org/r/551038 (https://phabricator.wikimedia.org/T224563) [06:47:54] (03CR) 10ArielGlenn: [C: 03+2] add per-host configs for new dumps fallback NFS server [puppet] - 10https://gerrit.wikimedia.org/r/551038 (https://phabricator.wikimedia.org/T224563) (owner: 10ArielGlenn) [07:05:02] (03PS1) 10ArielGlenn: buster doesn't have mailx, replace with s-nail [puppet] - 10https://gerrit.wikimedia.org/r/551039 (https://phabricator.wikimedia.org/T224563) [07:08:55] (03CR) 10ArielGlenn: [C: 03+2] buster doesn't have mailx, replace with s-nail [puppet] - 10https://gerrit.wikimedia.org/r/551039 (https://phabricator.wikimedia.org/T224563) (owner: 10ArielGlenn) [07:13:13] there should be no whines at this point, puppet is happy [07:14:29] (03PS1) 10Marostegui: mariadb: Promote db1131 to s6 master [puppet] - 10https://gerrit.wikimedia.org/r/551040 (https://phabricator.wikimedia.org/T235469) [07:15:24] (03PS1) 10Marostegui: wmnet: Update s6-master CNAME [dns] - 10https://gerrit.wikimedia.org/r/551041 (https://phabricator.wikimedia.org/T235469) [07:15:48] (03CR) 10Marostegui: [C: 04-2] "Wait for the failover day" [puppet] - 10https://gerrit.wikimedia.org/r/551040 (https://phabricator.wikimedia.org/T235469) (owner: 10Marostegui) [07:16:41] (03CR) 10Marostegui: [C: 04-2] "Wait for the failover day" [dns] - 10https://gerrit.wikimedia.org/r/551041 (https://phabricator.wikimedia.org/T235469) (owner: 10Marostegui) [07:17:19] (03PS1) 10ArielGlenn: fix up dump stats script to use either mail or s-nail [puppet] - 10https://gerrit.wikimedia.org/r/551042 (https://phabricator.wikimedia.org/T224563) [08:02:06] (03CR) 10Jcrespo: "> Patch Set 4:" [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/500043 (https://phabricator.wikimedia.org/T219631) (owner: 10Jcrespo) [08:23:00] RECOVERY - haproxy failover on dbproxy2003 is OK: OK check_failover servers up 2 down 0 https://wikitech.wikimedia.org/wiki/HAProxy [08:27:01] (03PS1) 10Vgutierrez: hiera: Move nginx from port 443 to port 4443 on cp2004 [puppet] - 10https://gerrit.wikimedia.org/r/551138 (https://phabricator.wikimedia.org/T231627) [08:27:02] (03PS1) 10Vgutierrez: hiera: Move ats-tls from port 8443 to port 443 on cp2004 [puppet] - 10https://gerrit.wikimedia.org/r/551139 (https://phabricator.wikimedia.org/T231627) [08:30:11] !log Move cp2004 from nginx to ats-tls - T231627 [08:30:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:30:17] T231627: Move cache text cluster from nginx to ats-tls - https://phabricator.wikimedia.org/T231627 [08:30:41] (03CR) 10Vgutierrez: [C: 03+2] hiera: Move nginx from port 443 to port 4443 on cp2004 [puppet] - 10https://gerrit.wikimedia.org/r/551138 (https://phabricator.wikimedia.org/T231627) (owner: 10Vgutierrez) [08:32:13] (03CR) 10Vgutierrez: [C: 03+2] hiera: Move ats-tls from port 8443 to port 443 on cp2004 [puppet] - 10https://gerrit.wikimedia.org/r/551139 (https://phabricator.wikimedia.org/T231627) (owner: 10Vgutierrez) [08:34:28] (03PS1) 10Marostegui: realm.pp: Add oauth2_access_tokens as a private table [puppet] - 10https://gerrit.wikimedia.org/r/551140 (https://phabricator.wikimedia.org/T238370) [08:38:02] 10Operations, 10Traffic, 10Patch-For-Review: Move cache text cluster from nginx to ats-tls - https://phabricator.wikimedia.org/T231627 (10Vgutierrez) [08:40:24] (03PS1) 10Vgutierrez: hiera: Move nginx from port 443 to port 4443 on cp2006 [puppet] - 10https://gerrit.wikimedia.org/r/551141 (https://phabricator.wikimedia.org/T231627) [08:40:26] (03PS1) 10Vgutierrez: hiera: Move ats-tls from port 8443 to port 443 on cp2006 [puppet] - 10https://gerrit.wikimedia.org/r/551142 (https://phabricator.wikimedia.org/T231627) [08:42:16] (03CR) 10Vgutierrez: [C: 03+2] hiera: Move nginx from port 443 to port 4443 on cp2006 [puppet] - 10https://gerrit.wikimedia.org/r/551141 (https://phabricator.wikimedia.org/T231627) (owner: 10Vgutierrez) [08:42:44] !log Move cp2006 from nginx to ats-tls - T231627 [08:42:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:42:51] T231627: Move cache text cluster from nginx to ats-tls - https://phabricator.wikimedia.org/T231627 [08:44:19] (03CR) 10Vgutierrez: [C: 03+2] hiera: Move ats-tls from port 8443 to port 443 on cp2006 [puppet] - 10https://gerrit.wikimedia.org/r/551142 (https://phabricator.wikimedia.org/T231627) (owner: 10Vgutierrez) [08:49:05] (03CR) 10Jcrespo: [C: 03+2] bacula: Fix calculation of success rate function on bacula check [puppet] - 10https://gerrit.wikimedia.org/r/550898 (owner: 10Jcrespo) [08:49:18] (03PS2) 10Jcrespo: bacula: Fix calculation of success rate function on bacula check [puppet] - 10https://gerrit.wikimedia.org/r/550898 [08:49:25] PROBLEM - Ensure traffic_manager binds on 8443 and responds to HTTP requests on cp2006 is CRITICAL: connect to address 10.192.0.127 and port 8443: Connection refused https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [08:50:12] that's expected [08:51:41] PROBLEM - netbox HTTPS on netbox1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Netbox [08:53:06] (03CR) 10Jcrespo: [C: 03+2] mariadb-client: Install 10.4 on buster, unblock os upgrade [puppet] - 10https://gerrit.wikimedia.org/r/550647 (https://phabricator.wikimedia.org/T193224) (owner: 10Jcrespo) [08:54:57] 10Operations, 10Traffic, 10Patch-For-Review: Move cache text cluster from nginx to ats-tls - https://phabricator.wikimedia.org/T231627 (10Vgutierrez) [08:55:15] ^^ mmm somebody messing with netbox? [08:56:47] RECOVERY - netbox HTTPS on netbox1001 is OK: HTTP OK: HTTP/1.1 302 Found - 346 bytes in 0.012 second response time https://wikitech.wikimedia.org/wiki/Netbox [08:57:03] sigh :) [09:01:02] (03PS1) 10Vgutierrez: hiera: Move nginx from port 443 to port 4443 on cp1077 [puppet] - 10https://gerrit.wikimedia.org/r/551143 (https://phabricator.wikimedia.org/T231627) [09:01:04] (03PS1) 10Vgutierrez: hiera: Move ats-tls from port 8443 to port 443 on cp1077 [puppet] - 10https://gerrit.wikimedia.org/r/551144 (https://phabricator.wikimedia.org/T231627) [09:02:31] !log Move cp1077 from nginx to ats-tls - T231627 [09:02:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:02:36] T231627: Move cache text cluster from nginx to ats-tls - https://phabricator.wikimedia.org/T231627 [09:03:01] (03CR) 10Vgutierrez: [C: 03+2] hiera: Move nginx from port 443 to port 4443 on cp1077 [puppet] - 10https://gerrit.wikimedia.org/r/551143 (https://phabricator.wikimedia.org/T231627) (owner: 10Vgutierrez) [09:03:30] jynus: is ok if I merge your CR? [09:03:36] Jcrespo: mariadb-client: Install 10.4 on buster, unblock os upgrade (003344c5f0) [09:03:44] yes [09:03:47] sorry [09:03:56] np, merging :) [09:04:10] done [09:05:03] (03CR) 10Vgutierrez: [C: 03+2] hiera: Move ats-tls from port 8443 to port 443 on cp1077 [puppet] - 10https://gerrit.wikimedia.org/r/551144 (https://phabricator.wikimedia.org/T231627) (owner: 10Vgutierrez) [09:06:30] 10Operations, 10Gerrit: Gerrit loading slowly - https://phabricator.wikimedia.org/T215004 (10GuillermoABrunner) Here they have shared some valuable information which is useful for the students who are struggling to prepare a quality assignment [[ https://businessays.net/business-communications-chapter6/ | http... [09:06:41] (03PS13) 10Jcrespo: bacula: Setup separate pool and defaults for database backups on backup1001 [puppet] - 10https://gerrit.wikimedia.org/r/550671 (https://phabricator.wikimedia.org/T238048) [09:06:43] (03PS1) 10Jcrespo: bacula: Add prometheus exporter [puppet] - 10https://gerrit.wikimedia.org/r/551145 (https://phabricator.wikimedia.org/T234900) [09:07:44] (03CR) 10Jcrespo: "Quick consultation that this is in the right direction. Also, could you assign me a monitoring port, if that is a thing?" [puppet] - 10https://gerrit.wikimedia.org/r/551145 (https://phabricator.wikimedia.org/T234900) (owner: 10Jcrespo) [09:08:49] (03CR) 10jerkins-bot: [V: 04-1] bacula: Add prometheus exporter [puppet] - 10https://gerrit.wikimedia.org/r/551145 (https://phabricator.wikimedia.org/T234900) (owner: 10Jcrespo) [09:08:51] (03CR) 10Jcrespo: "$ curl localhost:9133" [puppet] - 10https://gerrit.wikimedia.org/r/551145 (https://phabricator.wikimedia.org/T234900) (owner: 10Jcrespo) [09:13:42] !log gehel@cumin1001 END (ERROR) - Cookbook sre.wdqs.data-reload (exit_code=97) [09:13:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:14:11] 10Operations, 10Traffic: Move cache text cluster from nginx to ats-tls - https://phabricator.wikimedia.org/T231627 (10Vgutierrez) [09:18:20] (03PS1) 10Vgutierrez: hiera: Move nginx from port 443 to port 4443 on cp1079 [puppet] - 10https://gerrit.wikimedia.org/r/551150 (https://phabricator.wikimedia.org/T231627) [09:18:35] !log Move cp1079 from nginx to ats-tls - T231627 [09:18:37] (03PS1) 10Vgutierrez: hiera: Move ats-tls from port 8443 to port 443 on cp1079 [puppet] - 10https://gerrit.wikimedia.org/r/551151 (https://phabricator.wikimedia.org/T231627) [09:18:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:18:42] T231627: Move cache text cluster from nginx to ats-tls - https://phabricator.wikimedia.org/T231627 [09:20:16] gehel: ^ [09:20:59] (03CR) 10Vgutierrez: [C: 03+2] hiera: Move nginx from port 443 to port 4443 on cp1079 [puppet] - 10https://gerrit.wikimedia.org/r/551150 (https://phabricator.wikimedia.org/T231627) (owner: 10Vgutierrez) [09:21:10] onimisionipe: yep, I killed it, it's been stuck for 24h at the same point, something is wrong [09:22:22] (03CR) 10Vgutierrez: [C: 03+2] hiera: Move ats-tls from port 8443 to port 443 on cp1079 [puppet] - 10https://gerrit.wikimedia.org/r/551151 (https://phabricator.wikimedia.org/T231627) (owner: 10Vgutierrez) [09:23:49] ok [09:36:09] 10Operations, 10Traffic, 10Patch-For-Review: Move cache text cluster from nginx to ats-tls - https://phabricator.wikimedia.org/T231627 (10Vgutierrez) [09:42:13] (03CR) 10Ema: [C: 03+1] "Looks good to me and the vtc tests are all green" [puppet] - 10https://gerrit.wikimedia.org/r/550856 (https://phabricator.wikimedia.org/T238038) (owner: 10Vgutierrez) [09:45:44] (03CR) 10Effie Mouzeli: mediawiki: Remove HHVM references and includes (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/550818 (https://phabricator.wikimedia.org/T229792) (owner: 10Effie Mouzeli) [09:45:46] (03CR) 10Ema: ATS: network settings for ats-be (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/550866 (https://phabricator.wikimedia.org/T227432) (owner: 10Ema) [09:45:53] (03CR) 10Vgutierrez: [C: 03+2] vcl: Use synthetic warning for 1% of TLSv1/TLSv1.1 pageviews [puppet] - 10https://gerrit.wikimedia.org/r/550856 (https://phabricator.wikimedia.org/T238038) (owner: 10Vgutierrez) [09:47:01] !log Use a synthetic warning for 1% of TLSv1/TLS1v.1 pageviews - T238038 [09:47:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:47:07] T238038: Start warning and deprecation process for all legacy TLS - https://phabricator.wikimedia.org/T238038 [09:47:49] (03PS2) 10Ema: ATS: network settings for ats-be [puppet] - 10https://gerrit.wikimedia.org/r/550866 (https://phabricator.wikimedia.org/T227432) [09:49:25] (03CR) 10Vgutierrez: [C: 03+1] ATS: network settings for ats-be [puppet] - 10https://gerrit.wikimedia.org/r/550866 (https://phabricator.wikimedia.org/T227432) (owner: 10Ema) [09:50:52] (03PS2) 10Ema: cache: reimage cp3062 as text_ats [puppet] - 10https://gerrit.wikimedia.org/r/550849 (https://phabricator.wikimedia.org/T227432) [09:50:58] !log depool cp3062 and reimage as text_ats T227432 [09:51:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:51:03] T227432: Replace Varnish backends with ATS on cache text nodes - https://phabricator.wikimedia.org/T227432 [09:52:56] (03CR) 10Ema: [C: 03+2] cache: reimage cp3062 as text_ats [puppet] - 10https://gerrit.wikimedia.org/r/550849 (https://phabricator.wikimedia.org/T227432) (owner: 10Ema) [09:54:55] 10Operations, 10Traffic, 10Patch-For-Review: Replace Varnish backends with ATS on cache text nodes - https://phabricator.wikimedia.org/T227432 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by ema on cumin1001.eqiad.wmnet for hosts: ` ['cp3062.esams.wmnet'] ` The log can be found in `/var/log/wm... [10:03:00] PROBLEM - Aggregate IPsec Tunnel Status eqiad on icinga1001 is CRITICAL: instance={cp1077:9536,cp1079:9536,cp1081:9536,cp1083:9536,cp1085:9536,cp1087:9536,cp1089:9536} site=eqiad tunnel={cp3062_v4,cp3062_v6} https://wikitech.wikimedia.org/wiki/Monitoring/strongswan https://grafana.wikimedia.org/d/B9JpocKZz/ipsec-tunnel-status [10:03:14] PROBLEM - Aggregate IPsec Tunnel Status codfw on icinga1001 is CRITICAL: instance={cp2004:9536,cp2006:9536,cp2007:9536,cp2013:9536,cp2016:9536,cp2019:9536} site=codfw tunnel={cp3062_v4,cp3062_v6} https://wikitech.wikimedia.org/wiki/Monitoring/strongswan https://grafana.wikimedia.org/d/B9JpocKZz/ipsec-tunnel-status [10:04:33] ^^ ema reimaging cp3062 [10:04:41] (and getting rid of the nasty IPSec stuff) [10:04:57] yes [10:05:00] ack'ing [10:05:24] giyeah [10:05:26] yeah* [10:05:38] ACKNOWLEDGEMENT - Aggregate IPsec Tunnel Status codfw on icinga1001 is CRITICAL: instance={cp2004:9536,cp2006:9536,cp2007:9536,cp2013:9536,cp2016:9536,cp2019:9536} site=codfw tunnel={cp3062_v4,cp3062_v6} Ema reimaging cp3062 https://wikitech.wikimedia.org/wiki/Monitoring/strongswan https://grafana.wikimedia.org/d/B9JpocKZz/ipsec-tunnel-status [10:05:38] ACKNOWLEDGEMENT - Aggregate IPsec Tunnel Status eqiad on icinga1001 is CRITICAL: instance={cp1079:9536,cp1081:9536,cp1083:9536,cp1085:9536,cp1087:9536} site=eqiad tunnel={cp3062_v4,cp3062_v6} Ema reimaging cp3062 https://wikitech.wikimedia.org/wiki/Monitoring/strongswan https://grafana.wikimedia.org/d/B9JpocKZz/ipsec-tunnel-status [10:14:04] PROBLEM - Check the Netbox report puppetdb for fail status. on netbox1001 is CRITICAL: puppetdb.PuppetDB CRITICAL https://wikitech.wikimedia.org/wiki/Netbox%23Reports [10:15:33] (03CR) 10Vgutierrez: [C: 03+2] VCL: Move analytics hooks above beacon synth point [puppet] - 10https://gerrit.wikimedia.org/r/550826 (owner: 10BBlack) [10:15:48] !log ema@cumin1001 START - Cookbook sre.hosts.downtime [10:15:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:19:12] !log ema@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [10:19:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:22:25] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 240, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [10:22:51] PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 54, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [10:23:14] (03CR) 10Filippo Giunchedi: [C: 03+2] logstash: introduce restbase-specific index [puppet] - 10https://gerrit.wikimedia.org/r/550806 (https://phabricator.wikimedia.org/T238196) (owner: 10Filippo Giunchedi) [10:23:35] (03PS3) 10Vgutierrez: VCL: Move analytics hooks above beacon synth point [puppet] - 10https://gerrit.wikimedia.org/r/550826 (owner: 10BBlack) [10:23:37] RECOVERY - Aggregate IPsec Tunnel Status codfw on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/strongswan https://grafana.wikimedia.org/d/B9JpocKZz/ipsec-tunnel-status [10:24:08] !log roll-restart logstash to apply configuration change [10:24:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:24:15] 10Operations, 10Core Platform Team, 10Traffic, 10Wikimedia-General-or-Unknown, 10User-DannyS712: Pages whose title ends with semicolon (;) are intermittently inaccessible - https://phabricator.wikimedia.org/T238285 (10ema) >>! In T238285#5665013, @Ladsgroup wrote: > I think this has to do something with... [10:24:48] there is a TELIA maint on calendar, maybe it is related to cr2-eqiad and cr2-eqord [10:25:23] RECOVERY - Aggregate IPsec Tunnel Status eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/strongswan https://grafana.wikimedia.org/d/B9JpocKZz/ipsec-tunnel-status [10:26:39] Location of work: Chicago/IL, US [10:26:40] ok it is [10:29:12] 10Operations, 10Traffic, 10Patch-For-Review: Replace Varnish backends with ATS on cache text nodes - https://phabricator.wikimedia.org/T227432 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cp3062.esams.wmnet'] ` and were **ALL** successful. [10:36:05] effie: as quick check I usually either go on the routers or grab the link id and check on netbox (like https://netbox.wikimedia.org/circuits/circuits/31/) [10:36:44] !log pool cp3062 with ATS backend T227432 [10:36:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:36:49] T227432: Replace Varnish backends with ATS on cache text nodes - https://phabricator.wikimedia.org/T227432 [10:37:23] !log restbase - truncated parsoidphp data tables - T229015 [10:37:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:37:28] T229015: Tracking: Direct live production traffic at Parsoid/PHP - https://phabricator.wikimedia.org/T229015 [10:38:41] !log installing ghostscript security updates [10:38:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:39:35] RECOVERY - Check the Netbox report puppetdb for fail status. on netbox1001 is OK: puppetdb.PuppetDB OK https://wikitech.wikimedia.org/wiki/Netbox%23Reports [10:44:14] (03CR) 10Filippo Giunchedi: "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/551145 (https://phabricator.wikimedia.org/T234900) (owner: 10Jcrespo) [10:45:27] !log Run maintain-views for s5 on labsdb1011 T233135 [10:45:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:45:32] T233135: Schema change for refactored actor and comment storage - https://phabricator.wikimedia.org/T233135 [10:51:29] 10Operations, 10serviceops, 10PHP 7.2 support: (euwiki) Mysterious, coordinated slowdowns every ~ 25 minutes on API servers - https://phabricator.wikimedia.org/T231011 (10jijiki) [11:00:21] 10Operations, 10Traffic, 10Wikimedia-General-or-Unknown, 10User-DannyS712: Pages whose title ends with semicolon (;) are intermittently inaccessible - https://phabricator.wikimedia.org/T238285 (10mobrovac) [11:02:29] 10Operations, 10Traffic, 10Wikimedia-General-or-Unknown, 10User-DannyS712: Pages whose title ends with semicolon (;) are intermittently inaccessible - https://phabricator.wikimedia.org/T238285 (10ema) p:05Triage→03Normal [11:05:46] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/550682 (https://phabricator.wikimedia.org/T233931) (owner: 10Jbond) [11:11:28] 10Operations, 10Traffic, 10Wikimedia-General-or-Unknown, 10User-DannyS712: Pages whose title ends with semicolon (;) are intermittently inaccessible - https://phabricator.wikimedia.org/T238285 (10Vgutierrez) hmm it looks like ATS URL parsing is at fault here. ATS is using a semi colon as a separator betwee... [11:12:47] !log Reboot dbproxy2001 [11:12:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:15:02] !log Reboot dbproxy2004 [11:15:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:16:27] (03CR) 10Jcrespo: "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/551145 (https://phabricator.wikimedia.org/T234900) (owner: 10Jcrespo) [11:18:41] RECOVERY - Check whether microcode mitigations for CPU vulnerabilities are applied on dbproxy2001 is OK: OK - All expected CPU flags found https://wikitech.wikimedia.org/wiki/Microcode [11:18:53] moritzm: ^ \o/ [11:19:26] !log Reboot dbproxy2002 [11:19:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:20:29] great :-) [11:24:09] RECOVERY - Check whether microcode mitigations for CPU vulnerabilities are applied on dbproxy2002 is OK: OK - All expected CPU flags found https://wikitech.wikimedia.org/wiki/Microcode [11:25:21] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repool db1113:3315 into vslow,dump after schema change', diff saved to https://phabricator.wikimedia.org/P9645 and previous config saved to /var/cache/conftool/dbconfig/20191115-112520-marostegui.json [11:25:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:26:47] !log jmm@cumin2001 START - Cookbook sre.hosts.downtime [11:26:49] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [11:26:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:26:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:27:15] !log reboott ganeti4001-4003 to rectify microcode application [11:27:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:27:35] 10Operations, 10Traffic, 10Wikimedia-General-or-Unknown, 10User-DannyS712: Pages whose title ends with semicolon (;) are intermittently inaccessible - https://phabricator.wikimedia.org/T238285 (10Vgutierrez) BTW, Checking RFC 3986, I'm not sure that `https://ban.wikipedia.org/wiki/Mal:;` is a valid URL whe... [11:30:41] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 242, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [11:31:33] RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 56, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [11:31:40] (03CR) 10Vgutierrez: [C: 03+2] VCL: Move analytics hooks above beacon synth point [puppet] - 10https://gerrit.wikimedia.org/r/550826 (owner: 10BBlack) [11:35:36] (03PS1) 10Ema: ATS: enable mwdebug routes for noc.wm.org [puppet] - 10https://gerrit.wikimedia.org/r/551154 (https://phabricator.wikimedia.org/T233768) [11:37:29] (03CR) 10Ema: "pcc here https://puppet-compiler.wmflabs.org/compiler1003/19407/cp1075.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/551154 (https://phabricator.wikimedia.org/T233768) (owner: 10Ema) [11:42:02] (03PS4) 10ArielGlenn: ability to configure a wiki to produce empty abstract files [dumps] - 10https://gerrit.wikimedia.org/r/547197 (https://phabricator.wikimedia.org/T236006) [11:42:43] (03CR) 10ArielGlenn: [C: 03+2] ability to configure a wiki to produce empty abstract files [dumps] - 10https://gerrit.wikimedia.org/r/547197 (https://phabricator.wikimedia.org/T236006) (owner: 10ArielGlenn) [11:43:37] !log ariel@deploy1001 Started deploy [dumps/dumps@61090ee]: configuration setting to produce empty abstracts [11:43:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:43:46] !log ariel@deploy1001 Finished deploy [dumps/dumps@61090ee]: configuration setting to produce empty abstracts (duration: 00m 09s) [11:43:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:47:25] (03PS1) 10Jbond: apereo_cas: improve systemd security [puppet] - 10https://gerrit.wikimedia.org/r/551159 (https://phabricator.wikimedia.org/T233951) [11:53:34] (03CR) 10Effie Mouzeli: [V: 03+1] "PCC looks good: https://puppet-compiler.wmflabs.org/compiler1002/19405/" [puppet] - 10https://gerrit.wikimedia.org/r/550818 (https://phabricator.wikimedia.org/T229792) (owner: 10Effie Mouzeli) [11:56:31] (03CR) 10Jbond: [V: 03+2 C: 03+2] build.gradle: add memcached support to cas blob [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/550682 (https://phabricator.wikimedia.org/T233931) (owner: 10Jbond) [11:57:32] (03Abandoned) 10Mobrovac: service::node: Also do the Scap fetch phase when refreshing the config [puppet] - 10https://gerrit.wikimedia.org/r/309963 (owner: 10Mobrovac) [11:58:07] (03Abandoned) 10Mobrovac: RESTBase-Cassandra: Add the topk reporter [puppet] - 10https://gerrit.wikimedia.org/r/328660 (https://phabricator.wikimedia.org/T147366) (owner: 10Mobrovac) [11:58:26] (03Abandoned) 10Mobrovac: OCG: Do not use the INFO command as a readiness check [puppet] - 10https://gerrit.wikimedia.org/r/363045 (owner: 10Mobrovac) [11:58:54] (03Abandoned) 10Mobrovac: Kafka: Make all Kafka clients require the same set of packages [puppet] - 10https://gerrit.wikimedia.org/r/377238 (owner: 10Mobrovac) [12:02:24] (03Abandoned) 10Mobrovac: CP-JobQueue: Use the Special:RunSingleJob page to execute jobs [puppet] - 10https://gerrit.wikimedia.org/r/385382 (https://phabricator.wikimedia.org/T175146) (owner: 10Mobrovac) [12:02:35] (03PS2) 10Jcrespo: bacula: Add prometheus exporter [puppet] - 10https://gerrit.wikimedia.org/r/551145 (https://phabricator.wikimedia.org/T234900) [12:09:02] (03PS2) 10Effie Mouzeli: admin,mediawiki: Remove hhvm related sudo privileges [puppet] - 10https://gerrit.wikimedia.org/r/550483 (https://phabricator.wikimedia.org/T229792) [12:15:27] 10Operations, 10Education-Program-Dashboard, 10Traffic, 10Programs-and-Events-Dashboard-Sprint 2: Cache education dashboard pages - https://phabricator.wikimedia.org/T120509 (10ema) @awight: is there anything to do here or can we close the task? [12:22:19] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good, two comments inline. Given that this adds a new system user, let's maybe enable profile::base::enable_adduser for role(spare) " (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/550872 (https://phabricator.wikimedia.org/T233951) (owner: 10Jbond) [12:22:54] 10Operations, 10Traffic, 10Wikimedia-Incident: upload.wikimedia.org returns HTTP status code 503 for truncated urls, not 404 - https://phabricator.wikimedia.org/T106517 (10ema) I cannot reproduce with URLs such as https://upload.wikimedia.org/wikipedia/commons/thumb/6/6b/Kitagawa_Utamaro_-_Toji_san_bijin_%28... [12:29:03] (03PS1) 10Effie Mouzeli: prometheus: Remove dead HHVM code [puppet] - 10https://gerrit.wikimedia.org/r/551161 (https://phabricator.wikimedia.org/T229792) [12:29:48] 10Operations, 10Traffic, 10Wikimedia-General-or-Unknown, 10User-DannyS712: Pages whose title ends with semicolon (;) are intermittently inaccessible - https://phabricator.wikimedia.org/T238285 (10Urbanecm) @Vgutierrez I guess what you quoted wouldn't be valid, but https://ban.wikipedia.org/wiki/Mal:%3B sho... [12:30:29] (03CR) 10Jbond: "> Patch Set 5: Code-Review+1" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/550872 (https://phabricator.wikimedia.org/T233951) (owner: 10Jbond) [12:33:43] (03PS1) 10Effie Mouzeli: mediawiki: Update decommission_appserver.sh [puppet] - 10https://gerrit.wikimedia.org/r/551162 (https://phabricator.wikimedia.org/T229792) [12:47:06] 10Operations, 10Traffic, 10Wikimedia-General-or-Unknown, 10User-DannyS712: Pages whose title ends with semicolon (;) are intermittently inaccessible - https://phabricator.wikimedia.org/T238285 (10BBlack) There's some confusion on historical standards interpretation here, I think. There are some ancient st... [12:47:41] (03CR) 10Muehlenhoff: "Looks great! A few comments inline" (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/551159 (https://phabricator.wikimedia.org/T233951) (owner: 10Jbond) [12:49:10] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/551161 (https://phabricator.wikimedia.org/T229792) (owner: 10Effie Mouzeli) [12:55:42] (03CR) 10Muehlenhoff: mediawiki: Update decommission_appserver.sh (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/551162 (https://phabricator.wikimedia.org/T229792) (owner: 10Effie Mouzeli) [13:01:15] (03PS2) 10Jbond: apereo_cas: improve systemd security [puppet] - 10https://gerrit.wikimedia.org/r/551159 (https://phabricator.wikimedia.org/T233951) [13:01:37] (03CR) 10Jbond: "Thanks updated" (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/551159 (https://phabricator.wikimedia.org/T233951) (owner: 10Jbond) [13:03:09] 10Operations, 10Traffic, 10Patch-For-Review: Create a second text-lb IP address for test purposes - https://phabricator.wikimedia.org/T237492 (10faidon) p:05Normal→03High It looks like there are proposed patches for this, so perhaps we're not too far off? This ties to an exploration we're doing with a ve... [13:06:11] !log ariel@deploy1001 Started deploy [dumps/dumps@61090ee]: configuration setting to produce empty abstracts (expecting failure) [13:06:15] !log ariel@deploy1001 Finished deploy [dumps/dumps@61090ee]: configuration setting to produce empty abstracts (expecting failure) (duration: 00m 04s) [13:06:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:06:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:07:04] (03CR) 10Faidon Liambotis: [C: 04-1] "Very minor changes, otherwise LGTM (i.e. feel free to self-merge afterwards)" (032 comments) [homer/public] - 10https://gerrit.wikimedia.org/r/550576 (owner: 10Ayounsi) [13:07:24] (03PS6) 10Jbond: apereo_cas: update systemd to run as a system user [puppet] - 10https://gerrit.wikimedia.org/r/550872 (https://phabricator.wikimedia.org/T233951) [13:07:36] (03CR) 10Faidon Liambotis: [C: 03+1] "+1 assuming there are no diffs right now?" [homer/public] - 10https://gerrit.wikimedia.org/r/549938 (owner: 10Ayounsi) [13:09:55] (03PS1) 10Mathew.onipe: query_service: add updater mode option [puppet] - 10https://gerrit.wikimedia.org/r/551167 (https://phabricator.wikimedia.org/T231411) [13:14:36] (03CR) 10Faidon Liambotis: [C: 04-1] "Minor comment inline. I'm also not sure if we should have special code on the homer end for this, that feels like a stub and thus an overk" (032 comments) [homer/public] - 10https://gerrit.wikimedia.org/r/550376 (owner: 10Ayounsi) [13:16:45] (03CR) 10Faidon Liambotis: "> For some reasons codfw VCs have an explicit VC ID configured, eg." (032 comments) [homer/public] - 10https://gerrit.wikimedia.org/r/550370 (owner: 10Ayounsi) [13:17:06] (03PS1) 10Mathew.onipe: Switch wdqs1004 to merging updater mode [puppet] - 10https://gerrit.wikimedia.org/r/551169 (https://phabricator.wikimedia.org/T231411) [13:18:17] (03PS1) 10Cmjohnson: Adding production dns for ms-be105[7-9] [dns] - 10https://gerrit.wikimedia.org/r/551170 (https://phabricator.wikimedia.org/T237438) [13:19:03] (03CR) 10Faidon Liambotis: Add security alg/forwarding-options/screen to mr template (031 comment) [homer/public] - 10https://gerrit.wikimedia.org/r/550356 (owner: 10Ayounsi) [13:20:20] (03PS2) 10Cmjohnson: Adding production dns for ms-be105[7-9] [dns] - 10https://gerrit.wikimedia.org/r/551170 (https://phabricator.wikimedia.org/T237438) [13:20:35] (03CR) 10Faidon Liambotis: msw/asw: use same generic config (031 comment) [homer/public] - 10https://gerrit.wikimedia.org/r/549933 (owner: 10Ayounsi) [13:23:11] (03CR) 10Faidon Liambotis: CR: add apply-groups [ re0 re1 ]; if multiple REs (031 comment) [homer/public] - 10https://gerrit.wikimedia.org/r/549690 (owner: 10Ayounsi) [13:23:18] (03CR) 10Faidon Liambotis: [C: 03+1] CR: add apply-groups [ re0 re1 ]; if multiple REs [homer/public] - 10https://gerrit.wikimedia.org/r/549690 (owner: 10Ayounsi) [13:23:44] (03CR) 10Muehlenhoff: "Two comments inline" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/550872 (https://phabricator.wikimedia.org/T233951) (owner: 10Jbond) [13:24:21] (03PS2) 10Mathew.onipe: query_service: add updater mode option [puppet] - 10https://gerrit.wikimedia.org/r/551167 (https://phabricator.wikimedia.org/T231411) [13:24:24] (03PS2) 10Mathew.onipe: Switch wdqs1004 to merging updater mode [puppet] - 10https://gerrit.wikimedia.org/r/551169 (https://phabricator.wikimedia.org/T231411) [13:25:41] (03CR) 10Faidon Liambotis: Add PIM stanza for CR devices (032 comments) [homer/public] - 10https://gerrit.wikimedia.org/r/549689 (owner: 10Ayounsi) [13:25:47] (03CR) 10Faidon Liambotis: [C: 04-1] Add PIM stanza for CR devices [homer/public] - 10https://gerrit.wikimedia.org/r/549689 (owner: 10Ayounsi) [13:27:32] (03PS1) 10ArielGlenn: move keyholder name for key to global section [dumps/scap] - 10https://gerrit.wikimedia.org/r/551171 [13:28:39] (03CR) 10Mathew.onipe: "PCC is Ok: https://puppet-compiler.wmflabs.org/compiler1003/19409/" [puppet] - 10https://gerrit.wikimedia.org/r/551169 (https://phabricator.wikimedia.org/T231411) (owner: 10Mathew.onipe) [13:28:41] !log ariel@deploy1001 Started deploy [dumps/dumps@61090ee]: configuration setting to produce empty abstracts [13:28:45] !log ariel@deploy1001 Finished deploy [dumps/dumps@61090ee]: configuration setting to produce empty abstracts (duration: 00m 03s) [13:28:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:28:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:30:16] (03CR) 10ArielGlenn: [V: 03+2 C: 03+2] move keyholder name for key to global section [dumps/scap] - 10https://gerrit.wikimedia.org/r/551171 (owner: 10ArielGlenn) [13:32:43] (03CR) 10Faidon Liambotis: Initial templating for CR routing-options (032 comments) [homer/public] - 10https://gerrit.wikimedia.org/r/547587 (owner: 10Ayounsi) [13:33:22] (03PS2) 10ArielGlenn: fix up dump stats script to use either mail or s-nail [puppet] - 10https://gerrit.wikimedia.org/r/551042 (https://phabricator.wikimedia.org/T224563) [13:33:43] (03CR) 10BBlack: [C: 03+2] Add test-lb to DNS for IPv[46] [dns] - 10https://gerrit.wikimedia.org/r/549476 (https://phabricator.wikimedia.org/T237492) (owner: 10BBlack) [13:33:47] (03PS3) 10BBlack: Add test-lb to DNS for IPv[46] [dns] - 10https://gerrit.wikimedia.org/r/549476 (https://phabricator.wikimedia.org/T237492) [13:35:08] (03CR) 10Faidon Liambotis: [C: 04-1] Initial templating for CR routing-options (031 comment) [homer/public] - 10https://gerrit.wikimedia.org/r/547587 (owner: 10Ayounsi) [13:35:23] (03PS3) 10BBlack: LVS text: add test-lb IPs [puppet] - 10https://gerrit.wikimedia.org/r/549477 (https://phabricator.wikimedia.org/T237492) [13:35:51] (03CR) 10ArielGlenn: [C: 03+2] fix up dump stats script to use either mail or s-nail [puppet] - 10https://gerrit.wikimedia.org/r/551042 (https://phabricator.wikimedia.org/T224563) (owner: 10ArielGlenn) [13:41:12] (03PS1) 10ArielGlenn: configure wikidata dumps to generate empty abstracts files [puppet] - 10https://gerrit.wikimedia.org/r/551172 (https://phabricator.wikimedia.org/T236006) [13:41:52] (03CR) 10Gehel: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/551167 (https://phabricator.wikimedia.org/T231411) (owner: 10Mathew.onipe) [13:42:00] (03CR) 10Gehel: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/551169 (https://phabricator.wikimedia.org/T231411) (owner: 10Mathew.onipe) [13:43:04] (03CR) 10Cmjohnson: [C: 03+2] Adding production dns for ms-be105[7-9] [dns] - 10https://gerrit.wikimedia.org/r/551170 (https://phabricator.wikimedia.org/T237438) (owner: 10Cmjohnson) [13:43:09] (03PS3) 10Cmjohnson: Adding production dns for ms-be105[7-9] [dns] - 10https://gerrit.wikimedia.org/r/551170 (https://phabricator.wikimedia.org/T237438) [13:43:17] (03CR) 10Cmjohnson: [V: 03+2 C: 03+2] Adding production dns for ms-be105[7-9] [dns] - 10https://gerrit.wikimedia.org/r/551170 (https://phabricator.wikimedia.org/T237438) (owner: 10Cmjohnson) [13:43:38] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/550818 (https://phabricator.wikimedia.org/T229792) (owner: 10Effie Mouzeli) [13:43:40] (03CR) 10ArielGlenn: [C: 03+2] configure wikidata dumps to generate empty abstracts files [puppet] - 10https://gerrit.wikimedia.org/r/551172 (https://phabricator.wikimedia.org/T236006) (owner: 10ArielGlenn) [13:43:55] (03PS7) 10Jbond: apereo_cas: update systemd to run as a system user [puppet] - 10https://gerrit.wikimedia.org/r/550872 (https://phabricator.wikimedia.org/T233951) [13:44:28] (03CR) 10Jbond: "updated thanks" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/550872 (https://phabricator.wikimedia.org/T233951) (owner: 10Jbond) [13:48:36] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/550872 (https://phabricator.wikimedia.org/T233951) (owner: 10Jbond) [13:49:57] (03CR) 10BBlack: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1003/19410/" [puppet] - 10https://gerrit.wikimedia.org/r/549477 (https://phabricator.wikimedia.org/T237492) (owner: 10BBlack) [13:53:11] (03PS1) 10ArielGlenn: make dumpsdata primary nfs server rsync to dumpsdata1003 now [puppet] - 10https://gerrit.wikimedia.org/r/551173 (https://phabricator.wikimedia.org/T224563) [13:56:43] (03PS2) 10Effie Mouzeli: mediawiki: Update decommission_appserver.sh [puppet] - 10https://gerrit.wikimedia.org/r/551162 (https://phabricator.wikimedia.org/T229792) [13:58:18] (03PS3) 10Effie Mouzeli: mediawiki: Update decommission_appserver.sh [puppet] - 10https://gerrit.wikimedia.org/r/551162 (https://phabricator.wikimedia.org/T229792) [13:58:47] (03CR) 10Effie Mouzeli: mediawiki: Update decommission_appserver.sh (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/551162 (https://phabricator.wikimedia.org/T229792) (owner: 10Effie Mouzeli) [13:59:21] PROBLEM - PyBal IPVS diff check on lvs1013 is CRITICAL: CRITICAL: Services known to PyBal but not to IPVS: set([208.80.154.225:443, 2620:0:861:ed1a::2:80, 208.80.154.225:80, 2620:0:861:ed1a::2:443]) https://wikitech.wikimedia.org/wiki/PyBal [14:00:35] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/551162 (https://phabricator.wikimedia.org/T229792) (owner: 10Effie Mouzeli) [14:02:49] PROBLEM - PyBal connections to etcd on lvs1013 is CRITICAL: CRITICAL: 8 connections established with conf1004.eqiad.wmnet:4001 (min=12) https://wikitech.wikimedia.org/wiki/PyBal [14:03:41] PROBLEM - PyBal IPVS diff check on lvs3007 is CRITICAL: CRITICAL: Services known to PyBal but not to IPVS: set([91.198.174.193:443, 2620:0:862:ed1a::2:80, 2620:0:862:ed1a::2:443, 91.198.174.193:80]) https://wikitech.wikimedia.org/wiki/PyBal [14:04:21] 10Operations, 10Product-Analytics, 10SRE-Access-Requests: Search Console access for he.wikisource.org - https://phabricator.wikimedia.org/T238090 (10Fuzzy) Let's get one step at a time. Before getting into legal and technicality, WMF should assess and approve the requirement. [14:05:23] PROBLEM - PyBal connections to etcd on lvs3007 is CRITICAL: CRITICAL: 12 connections established with conf1006.eqiad.wmnet:4001 (min=16) https://wikitech.wikimedia.org/wiki/PyBal [14:05:41] PROBLEM - PyBal connections to etcd on lvs1016 is CRITICAL: CRITICAL: 91 connections established with conf1004.eqiad.wmnet:4001 (min=95) https://wikitech.wikimedia.org/wiki/PyBal [14:06:05] PROBLEM - PyBal IPVS diff check on lvs2004 is CRITICAL: CRITICAL: Services known to PyBal but not to IPVS: set([2620:0:860:ed1a::2:443, 2620:0:860:ed1a::2:80, 208.80.153.225:443, 208.80.153.225:80]) https://wikitech.wikimedia.org/wiki/PyBal [14:06:07] PROBLEM - PyBal IPVS diff check on lvs4005 is CRITICAL: CRITICAL: Services known to PyBal but not to IPVS: set([198.35.26.97:443, 2620:0:863:ed1a::2:80, 198.35.26.97:80, 2620:0:863:ed1a::2:443]) https://wikitech.wikimedia.org/wiki/PyBal [14:06:35] PROBLEM - PyBal connections to etcd on lvs5001 is CRITICAL: CRITICAL: 4 connections established with conf2003.codfw.wmnet:2379 (min=8) https://wikitech.wikimedia.org/wiki/PyBal [14:06:49] PROBLEM - PyBal IPVS diff check on lvs3005 is CRITICAL: CRITICAL: Services known to PyBal but not to IPVS: set([2620:0:862:ed1a::2:80, 91.198.174.193:443, 2620:0:862:ed1a::2:443, 91.198.174.193:80]) https://wikitech.wikimedia.org/wiki/PyBal [14:07:00] you can ignore those, it's etcd complaining basically abou tmismatch between pybal config and running state [14:07:03] PROBLEM - PyBal IPVS diff check on lvs5001 is CRITICAL: CRITICAL: Services known to PyBal but not to IPVS: set([103.102.166.225:443, 2001:df2:e500:ed1a::2:80, 2001:df2:e500:ed1a::2:443, 103.102.166.225:80]) https://wikitech.wikimedia.org/wiki/PyBal [14:07:14] which is pretty much normal after merging any change... [14:07:19] (03PS3) 10Andrew Bogott: Remove tor_exit_node_update cron from wikitech::web [puppet] - 10https://gerrit.wikimedia.org/r/550057 (https://phabricator.wikimedia.org/T156733) (owner: 10Reedy) [14:07:45] PROBLEM - PyBal connections to etcd on lvs5003 is CRITICAL: CRITICAL: 12 connections established with conf2003.codfw.wmnet:2379 (min=16) https://wikitech.wikimedia.org/wiki/PyBal [14:07:55] PROBLEM - PyBal connections to etcd on lvs4005 is CRITICAL: CRITICAL: 4 connections established with conf2003.codfw.wmnet:2379 (min=8) https://wikitech.wikimedia.org/wiki/PyBal [14:08:25] PROBLEM - PyBal connections to etcd on lvs2004 is CRITICAL: CRITICAL: 8 connections established with conf2001.codfw.wmnet:2379 (min=12) https://wikitech.wikimedia.org/wiki/PyBal [14:08:25] PROBLEM - PyBal connections to etcd on lvs2001 is CRITICAL: CRITICAL: 8 connections established with conf2001.codfw.wmnet:2379 (min=12) https://wikitech.wikimedia.org/wiki/PyBal [14:08:35] PROBLEM - PyBal connections to etcd on lvs4007 is CRITICAL: CRITICAL: 12 connections established with conf2003.codfw.wmnet:2379 (min=16) https://wikitech.wikimedia.org/wiki/PyBal [14:08:55] PROBLEM - PyBal IPVS diff check on lvs5003 is CRITICAL: CRITICAL: Services known to PyBal but not to IPVS: set([103.102.166.225:443, 2001:df2:e500:ed1a::2:80, 2001:df2:e500:ed1a::2:443, 103.102.166.225:80]) https://wikitech.wikimedia.org/wiki/PyBal [14:09:21] PROBLEM - PyBal IPVS diff check on lvs1016 is CRITICAL: CRITICAL: Services known to PyBal but not to IPVS: set([208.80.154.225:443, 2620:0:861:ed1a::2:443, 208.80.154.225:80, 2620:0:861:ed1a::2:80]) https://wikitech.wikimedia.org/wiki/PyBal [14:09:23] PROBLEM - PyBal connections to etcd on lvs3005 is CRITICAL: CRITICAL: 4 connections established with conf1006.eqiad.wmnet:4001 (min=8) https://wikitech.wikimedia.org/wiki/PyBal [14:09:34] !log lvs1016 - pybal restart for new config [14:09:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:09:43] PROBLEM - PyBal IPVS diff check on lvs2001 is CRITICAL: CRITICAL: Services known to PyBal but not to IPVS: set([2620:0:860:ed1a::2:443, 2620:0:860:ed1a::2:80, 208.80.153.225:443, 208.80.153.225:80]) https://wikitech.wikimedia.org/wiki/PyBal [14:09:43] PROBLEM - PyBal IPVS diff check on lvs4007 is CRITICAL: CRITICAL: Services known to PyBal but not to IPVS: set([198.35.26.97:443, 2620:0:863:ed1a::2:80, 198.35.26.97:80, 2620:0:863:ed1a::2:443]) https://wikitech.wikimedia.org/wiki/PyBal [14:10:25] !log lvs2004 - pybal restart for new config [14:10:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:10:30] (03PS1) 10Physikerwelt: Enable links from math formulae on beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/551180 (https://phabricator.wikimedia.org/T208758) [14:10:46] (03CR) 10Andrew Bogott: [C: 03+2] Remove tor_exit_node_update cron from wikitech::web [puppet] - 10https://gerrit.wikimedia.org/r/550057 (https://phabricator.wikimedia.org/T156733) (owner: 10Reedy) [14:11:01] !log lvs3007 - pybal restart for new config [14:11:03] RECOVERY - PyBal connections to etcd on lvs3007 is OK: OK: 16 connections established with conf1006.eqiad.wmnet:4001 (min=16) https://wikitech.wikimedia.org/wiki/PyBal [14:11:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:11:21] !log lvs4007 - pybal restart for new config [14:11:23] RECOVERY - PyBal connections to etcd on lvs1016 is OK: OK: 95 connections established with conf1004.eqiad.wmnet:4001 (min=95) https://wikitech.wikimedia.org/wiki/PyBal [14:11:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:11:27] !log lvs5003 - pybal restart for new config [14:11:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:11:47] RECOVERY - PyBal IPVS diff check on lvs2004 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [14:12:54] !log lvs3005 - pybal restart for new config [14:12:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:13:04] !log lvs4005 - pybal restart for new config [14:13:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:13:18] !log lvs5001 - pybal restart for new config [14:13:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:13:25] RECOVERY - PyBal connections to etcd on lvs5003 is OK: OK: 16 connections established with conf2003.codfw.wmnet:2379 (min=16) https://wikitech.wikimedia.org/wiki/PyBal [14:13:31] !log lvs2001 - pybal restart for new config [14:13:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:13:35] RECOVERY - PyBal connections to etcd on lvs4005 is OK: OK: 8 connections established with conf2003.codfw.wmnet:2379 (min=8) https://wikitech.wikimedia.org/wiki/PyBal [14:13:42] !log lvs1013 - pybal restart for new config [14:13:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:14:05] RECOVERY - PyBal connections to etcd on lvs2004 is OK: OK: 12 connections established with conf2001.codfw.wmnet:2379 (min=12) https://wikitech.wikimedia.org/wiki/PyBal [14:14:05] RECOVERY - PyBal connections to etcd on lvs2001 is OK: OK: 12 connections established with conf2001.codfw.wmnet:2379 (min=12) https://wikitech.wikimedia.org/wiki/PyBal [14:14:13] RECOVERY - PyBal connections to etcd on lvs1013 is OK: OK: 12 connections established with conf1004.eqiad.wmnet:4001 (min=12) https://wikitech.wikimedia.org/wiki/PyBal [14:14:17] RECOVERY - PyBal connections to etcd on lvs4007 is OK: OK: 16 connections established with conf2003.codfw.wmnet:2379 (min=16) https://wikitech.wikimedia.org/wiki/PyBal [14:14:37] RECOVERY - PyBal IPVS diff check on lvs5003 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [14:15:03] RECOVERY - PyBal IPVS diff check on lvs1016 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [14:15:05] RECOVERY - PyBal connections to etcd on lvs3005 is OK: OK: 8 connections established with conf1006.eqiad.wmnet:4001 (min=8) https://wikitech.wikimedia.org/wiki/PyBal [14:15:06] 10Operations, 10Traffic: Create a second text-lb IP address for test purposes - https://phabricator.wikimedia.org/T237492 (10BBlack) Should work now! [14:15:07] RECOVERY - PyBal IPVS diff check on lvs3007 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [14:15:08] (03CR) 10Physikerwelt: [C: 04-1] "need adjustment of property ids" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/551180 (https://phabricator.wikimedia.org/T208758) (owner: 10Physikerwelt) [14:15:25] RECOVERY - PyBal IPVS diff check on lvs2001 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [14:15:25] RECOVERY - PyBal IPVS diff check on lvs4007 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [14:16:27] RECOVERY - PyBal IPVS diff check on lvs1013 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [14:16:42] (03PS1) 10Ema: acme_chief: add dbtree.wm.org to tendril cert SAN [puppet] - 10https://gerrit.wikimedia.org/r/551184 (https://phabricator.wikimedia.org/T210411) [14:17:31] RECOVERY - PyBal IPVS diff check on lvs4005 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [14:17:36] !log SIGHUP prometheus@ops on prometheus1004 [14:17:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:18:27] RECOVERY - PyBal IPVS diff check on lvs3005 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [14:18:27] RECOVERY - PyBal connections to etcd on lvs5001 is OK: OK: 8 connections established with conf2003.codfw.wmnet:2379 (min=8) https://wikitech.wikimedia.org/wiki/PyBal [14:18:29] RECOVERY - PyBal IPVS diff check on lvs5001 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [14:24:17] (03PS2) 10Ema: cache: reimage cp3064 as text_ats [puppet] - 10https://gerrit.wikimedia.org/r/550850 (https://phabricator.wikimedia.org/T227432) [14:24:21] (03CR) 10Marostegui: [C: 03+1] "make dbtree great again \o/" [puppet] - 10https://gerrit.wikimedia.org/r/551184 (https://phabricator.wikimedia.org/T210411) (owner: 10Ema) [14:25:03] !log depool cp3064 and reimage as text_ats T227432 [14:25:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:25:10] T227432: Replace Varnish backends with ATS on cache text nodes - https://phabricator.wikimedia.org/T227432 [14:25:33] (03CR) 10Ema: [C: 03+2] acme_chief: add dbtree.wm.org to tendril cert SAN [puppet] - 10https://gerrit.wikimedia.org/r/551184 (https://phabricator.wikimedia.org/T210411) (owner: 10Ema) [14:26:51] (03CR) 10Ema: [C: 03+2] cache: reimage cp3064 as text_ats [puppet] - 10https://gerrit.wikimedia.org/r/550850 (https://phabricator.wikimedia.org/T227432) (owner: 10Ema) [14:28:00] (03CR) 10Filippo Giunchedi: [C: 04-1] "Duplicate ip_block, LGTM otherwise" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/550922 (https://phabricator.wikimedia.org/T236386) (owner: 10Ottomata) [14:28:40] (03CR) 10Filippo Giunchedi: [C: 03+1] Add discovery for eventgate-logging-external [puppet] - 10https://gerrit.wikimedia.org/r/550923 (https://phabricator.wikimedia.org/T236386) (owner: 10Ottomata) [14:28:50] 10Operations, 10Traffic, 10Patch-For-Review: Replace Varnish backends with ATS on cache text nodes - https://phabricator.wikimedia.org/T227432 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by ema on cumin1001.eqiad.wmnet for hosts: ` ['cp3064.esams.wmnet'] ` The log can be found in `/var/log/wm... [14:28:59] (03CR) 10Filippo Giunchedi: [C: 03+1] Add discovery entries for eventgate-logging-external [dns] - 10https://gerrit.wikimedia.org/r/550915 (https://phabricator.wikimedia.org/T236386) (owner: 10Ottomata) [14:30:04] (03CR) 10Filippo Giunchedi: [C: 03+1] Add LVS entries for eventgate-logging-external [dns] - 10https://gerrit.wikimedia.org/r/550914 (https://phabricator.wikimedia.org/T236386) (owner: 10Ottomata) [14:32:31] oh oops! will fix. [14:32:54] godog: what do you thikn about https://phabricator.wikimedia.org/T236386#5664885 [14:33:11] where will clients be configured to POST errors to? [14:33:13] ACKNOWLEDGEMENT - Aggregate IPsec Tunnel Status codfw on icinga1001 is CRITICAL: instance={cp2001:9536,cp2004:9536,cp2006:9536,cp2007:9536,cp2010:9536,cp2012:9536,cp2013:9536,cp2016:9536,cp2019:9536,cp2023:9536} site=codfw tunnel={cp3064_v4,cp3064_v6} Ema reimaging 3064 https://wikitech.wikimedia.org/wiki/Monitoring/strongswan https://grafana.wikimedia.org/d/B9JpocKZz/ipsec-tunnel-status [14:33:13] ACKNOWLEDGEMENT - Aggregate IPsec Tunnel Status eqiad on icinga1001 is CRITICAL: instance={cp1077:9536,cp1079:9536,cp1081:9536,cp1083:9536,cp1085:9536,cp1087:9536,cp1089:9536} site=eqiad tunnel={cp3064_v4,cp3064_v6} Ema reimaging 3064 https://wikitech.wikimedia.org/wiki/Monitoring/strongswan https://grafana.wikimedia.org/d/B9JpocKZz/ipsec-tunnel-status [14:33:28] current_domain/beacon/logging [14:33:29] ? [14:33:41] ottomata: taking a look [14:35:16] (03PS1) 10Gehel: wdqs: move wdqs1007 from internal to public cluster [puppet] - 10https://gerrit.wikimedia.org/r/551189 (https://phabricator.wikimedia.org/T238229) [14:38:36] (03PS1) 10Arturo Borrero Gonzalez: toolforge: prometheus: enable scraping for the new k8s cluster [puppet] - 10https://gerrit.wikimedia.org/r/551191 (https://phabricator.wikimedia.org/T237643) [14:39:36] ottomata: yeah I think we should try sth like that first [14:39:46] (03PS2) 10Ottomata: Add LVS for eventgate-logging-external [puppet] - 10https://gerrit.wikimedia.org/r/550922 (https://phabricator.wikimedia.org/T236386) [14:40:13] PROBLEM - Check the Netbox report puppetdb for fail status. on netbox1001 is CRITICAL: puppetdb.PuppetDB CRITICAL https://wikitech.wikimedia.org/wiki/Netbox%23Reports [14:40:51] ema: I think we might want to revert the patch for dbtree as it now points to tendril, which requires auth [14:41:05] So not too useful anyways [14:41:39] marostegui: oh so you're telling me that dbtree is not running on dbmonitor? [14:41:40] We (dbas) need to work on a proper fix for dbtree, jynu.s has some ideas, but we don't have time at the moment [14:41:45] ema: it does [14:41:59] ema: dbmonitor is its frontend [14:42:14] (03PS2) 10Arturo Borrero Gonzalez: toolforge: prometheus: enable scraping for the new k8s cluster [puppet] - 10https://gerrit.wikimedia.org/r/551191 (https://phabricator.wikimedia.org/T237643) [14:42:30] ema: but it gathers the data from tendril [14:43:00] 10Operations, 10SRE-tools, 10Traffic, 10Goal, and 3 others: Automate generation of Management DNS records from Netbox - https://phabricator.wikimedia.org/T233183 (10Volans) >>! In T233183#5665281, @BBlack wrote: > Seems sane! The only thing I'm a little iffy about iis from the "SHA1 written to etcd" onwar... [14:44:48] marostegui: I see, there are two virtual hosts on dbmonitor. One is tendril and the other one is dbtree [14:44:56] TLS is configured on tendril only though [14:44:57] yep [14:45:15] that what jynu.s was mentioning earlier, that we mostly need TLS for dbtree [14:45:32] well then let's add a 443 virtualhost for dbtree too, right? [14:45:39] 10Operations, 10DBA, 10Data-Services, 10cloud-services-team (Kanban): Prepare and check storage layer for gcrwiki - https://phabricator.wikimedia.org/T238114 (10Dzahn) p:05Triage→03Normal [14:45:55] 10Operations, 10DBA, 10Data-Services, 10cloud-services-team (Kanban): Prepare and check storage layer for shywiktionary - https://phabricator.wikimedia.org/T238115 (10Dzahn) p:05Triage→03Normal [14:46:08] I don't know what else jynus checked/needs, I haven't looked at it myself [14:46:42] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/550922 (https://phabricator.wikimedia.org/T236386) (owner: 10Ottomata) [14:47:49] (03PS1) 10Volans: homer: move git peer to hiera [puppet] - 10https://gerrit.wikimedia.org/r/551195 [14:49:42] !log ema@cumin1001 START - Cookbook sre.hosts.downtime [14:49:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:49:55] marostegui: ack, I'll leave it to you and jynus then. I personally don't see a reason to revert the TLS cert fix which we're gonna need anyways [14:50:02] ema: marostegui: I haven't tracked this conversation this AM very well, re: dbtree/dbmonitor/tendril [14:50:05] (03CR) 10Volans: "Compiler seems happy, ofc the diff is "fake" as it's already applied in production" [puppet] - 10https://gerrit.wikimedia.org/r/551195 (owner: 10Volans) [14:50:38] but my feeling last time I looked, is we should let dbmonitor service dbtree/tendril directly on its public IP with TLS, and change the DNS for whichever one was going through varnish to just go directly to dbmonitor, and not have these configured in varnish/ats at all anymore. [14:51:37] (it's already on public subnet with public listener for 1x service that's not behind #traffic for outage-recovery reasons, and once you're in that boat there's little gain/point in putting the other service "behind traffic") [14:51:48] !log ema@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [14:51:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:53:17] (03CR) 10Volans: CI - python3: first attempt at adding python3 CI (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/510613 (owner: 10Jbond) [14:53:48] (03PS1) 10Giuseppe Lavagetto: rake: Add basic validation for json schema files [puppet] - 10https://gerrit.wikimedia.org/r/551196 [14:54:13] <_joe_> cdanis: ^^ [14:54:31] ahah okay good [14:54:32] <_joe_> completely untested [14:54:34] 10Operations, 10Product-Analytics, 10SRE-Access-Requests: Search Console access for he.wikisource.org - https://phabricator.wikimedia.org/T238090 (10Dzahn) @Fuzzy As far as i know it's the other way around. Approval would be given once the other steps, like having signed an NDA and having a WMF sponsor, are... [14:54:38] bblack: I haven't followed the conversation that much I think Jaime has been a bit more involved and had some ideas on what to do, we can discuss next week [14:54:44] _joe_: I was just starting to write in taskgen.rb, cribbing from setup_dhcp [14:54:49] ema: I think I am going to revert to leave things as they were, if that's ok with you [14:55:25] (03CR) 10jerkins-bot: [V: 04-1] rake: Add basic validation for json schema files [puppet] - 10https://gerrit.wikimedia.org/r/551196 (owner: 10Giuseppe Lavagetto) [14:55:29] marostegui: if you like 502s better that sounds good to me :) [14:55:47] ema: Currently dbtree works for eqiad and codfw, which is better than nothing :) [14:56:03] that change isn't affecting eqiad/codfw at all [14:56:14] Ah! [14:56:16] Right! [14:56:25] I missed that, sorry [14:56:28] np! [14:57:27] note that there is one ATS host in eqiad (cp1075), so requests for dbtree going through that backend will fail [14:57:45] right [14:58:02] 10Operations, 10Traffic, 10Patch-For-Review: Replace Varnish backends with ATS on cache text nodes - https://phabricator.wikimedia.org/T227432 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cp3064.esams.wmnet'] ` Of which those **FAILED**: ` ['cp3064.esams.wmnet'] ` [14:58:17] (03CR) 10Mathew.onipe: [C: 03+1] wdqs: move wdqs1007 from internal to public cluster [puppet] - 10https://gerrit.wikimedia.org/r/551189 (https://phabricator.wikimedia.org/T238229) (owner: 10Gehel) [14:58:46] (03PS1) 10CDanis: taskgen: use filter_files_by for dhcp [puppet] - 10https://gerrit.wikimedia.org/r/551198 [15:00:30] (03CR) 10Herron: logstash: introduce logstash 7 and openjdk-11 support (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/548880 (https://phabricator.wikimedia.org/T217340) (owner: 10Herron) [15:00:32] (03PS2) 10CDanis: rake: Add basic validation for json schema files [puppet] - 10https://gerrit.wikimedia.org/r/551196 (owner: 10Giuseppe Lavagetto) [15:01:37] (03PS5) 10Jforrester: Variant configuration: Generate dblists from YAML [mediawiki-config] - 10https://gerrit.wikimedia.org/r/545411 (https://phabricator.wikimedia.org/T223602) [15:03:12] (03PS8) 10Jbond: CI - python3: first attempt at adding python3 CI [puppet] - 10https://gerrit.wikimedia.org/r/510613 [15:05:19] (03CR) 10Jbond: "thanks, updated" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/510613 (owner: 10Jbond) [15:06:50] PROBLEM - traffic-pool service on cp3064 is CRITICAL: CRITICAL - Expecting active but unit traffic-pool is inactive https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [15:06:59] (03CR) 10jerkins-bot: [V: 04-1] CI - python3: first attempt at adding python3 CI [puppet] - 10https://gerrit.wikimedia.org/r/510613 (owner: 10Jbond) [15:07:29] (03PS3) 10CDanis: rake: Add basic validation for json schema files [puppet] - 10https://gerrit.wikimedia.org/r/551196 (owner: 10Giuseppe Lavagetto) [15:07:46] !log reboot cp3064 after reimage [15:07:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:08:34] (03PS4) 10CDanis: rake: Add basic validation for json schema files [puppet] - 10https://gerrit.wikimedia.org/r/551196 (owner: 10Giuseppe Lavagetto) [15:09:52] I hate tabs so much [15:09:56] PROBLEM - Host cp3064 is DOWN: PING CRITICAL - Packet loss = 100% [15:09:56] :) [15:10:42] RECOVERY - Host cp3064 is UP: PING OK - Packet loss = 0%, RTA = 83.47 ms [15:11:10] RECOVERY - traffic-pool service on cp3064 is OK: OK - traffic-pool is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [15:11:30] (03CR) 10jerkins-bot: [V: 04-1] rake: Add basic validation for json schema files [puppet] - 10https://gerrit.wikimedia.org/r/551196 (owner: 10Giuseppe Lavagetto) [15:11:55] !log pool cp3064 with ATS backend T227432 [15:11:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:12:00] T227432: Replace Varnish backends with ATS on cache text nodes - https://phabricator.wikimedia.org/T227432 [15:12:21] wow ruby's JSON error messages are garbage :) [15:12:57] for a superfluous trailing comma you get "unexpected token at '{dump of the entire file}'" [15:13:15] cdanis: factually correct tho! [15:13:56] ema: https://media1.giphy.com/media/14bDMRUYVrzOIo/source.gif [15:14:15] heya ema , any thoughts about https://phabricator.wikimedia.org/T236386#5666847? we need to set up public endpoints to get events [15:14:22] for client error logging, and also for analytics events [15:14:31] (03PS5) 10CDanis: rake: Add basic validation for json schema files [puppet] - 10https://gerrit.wikimedia.org/r/551196 (owner: 10Giuseppe Lavagetto) [15:15:14] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/551195 (owner: 10Volans) [15:16:45] 10Operations, 10Product-Analytics, 10SRE-Access-Requests: Search Console access for he.wikisource.org - https://phabricator.wikimedia.org/T238090 (10Ijon) Hi, @Dzahn. Yes, I am happy to sponsor this request. I have followed @Fuzzy 's work for over a decade now, know him personally (met in person multiple ti... [15:17:01] 10Operations, 10Traffic, 10Patch-For-Review: Replace Varnish backends with ATS on cache text nodes - https://phabricator.wikimedia.org/T227432 (10ema) [15:17:52] RECOVERY - Check the Netbox report puppetdb for fail status. on netbox1001 is OK: puppetdb.PuppetDB OK https://wikitech.wikimedia.org/wiki/Netbox%23Reports [15:19:58] (03CR) 10CDanis: [C: 03+2] "Fixed a few issues; tested that it catches some simple errors." [puppet] - 10https://gerrit.wikimedia.org/r/551196 (owner: 10Giuseppe Lavagetto) [15:27:25] (03CR) 10CDanis: [C: 03+2] taskgen: use filter_files_by for dhcp [puppet] - 10https://gerrit.wikimedia.org/r/551198 (owner: 10CDanis) [15:27:38] (03PS2) 10CDanis: taskgen: use filter_files_by for dhcp [puppet] - 10https://gerrit.wikimedia.org/r/551198 [15:29:05] ottomata: I'm not sure I understand, is the question whether we can route certain requests to certain servers at the cache layer? [15:29:22] ema: i guess the question is what should we do [15:29:50] should the endpoint be in a mediawiki domains? or should it have its own domain? [15:30:02] it will be POSTed to mostly by MW javascript clients, but also by other remote clients, like mobile apps [15:32:32] ottomata: separate seems cleaner to me [15:33:47] <_joe_> ema: I'm not sure our CORS settings would allow that [15:34:15] <_joe_> but if it's possible there are obvious reasons why it's better [15:34:23] <_joe_> but please let's not use beacon [15:34:49] <_joe_> let's use wespyonyou.w.o - it's more honest and it's also not blocked by all adblockers [15:35:28] (03PS2) 10BBlack: VCL: Remove host regex from TLS redirect and STS [puppet] - 10https://gerrit.wikimedia.org/r/550931 (https://phabricator.wikimedia.org/T133548) [15:35:30] (03PS2) 10BBlack: Add protocol to TLS analytics fields [puppet] - 10https://gerrit.wikimedia.org/r/550932 (https://phabricator.wikimedia.org/T233661) [15:35:32] (03PS2) 10BBlack: Add session to TLS analytics fields [puppet] - 10https://gerrit.wikimedia.org/r/550933 (https://phabricator.wikimedia.org/T233661) [15:35:33] nsa.w.o would fly below most radars I guess [15:35:34] (03PS2) 10BBlack: varnishxcps decom: undefine the mtail program [puppet] - 10https://gerrit.wikimedia.org/r/550934 [15:35:36] (03PS2) 10BBlack: varnishxcps decom: remove xcps ref [puppet] - 10https://gerrit.wikimedia.org/r/550935 [15:35:38] (03PS2) 10BBlack: varnishxcps decom: remove manifest [puppet] - 10https://gerrit.wikimedia.org/r/550936 [15:35:40] (03PS2) 10BBlack: varnishxcps decom: remove mtail prog/tests [puppet] - 10https://gerrit.wikimedia.org/r/550937 [15:35:42] (03PS2) 10BBlack: varnishxcps decom: remove global prom rules [puppet] - 10https://gerrit.wikimedia.org/r/550938 [15:35:44] (03PS2) 10BBlack: varnishxcps decom: remove mtail log outputs from VCL [puppet] - 10https://gerrit.wikimedia.org/r/550939 [15:35:46] (03PS3) 10BBlack: TLS analytics: simplify variable scheme [puppet] - 10https://gerrit.wikimedia.org/r/550940 [15:35:48] (03PS3) 10BBlack: TLS analytics: simplify logic for the present [puppet] - 10https://gerrit.wikimedia.org/r/550941 [15:35:50] (03PS1) 10BBlack: VCL: Move XCPS logging even earlier in recv [puppet] - 10https://gerrit.wikimedia.org/r/551200 [15:35:52] (03PS1) 10BBlack: Revert "vcl: block requests with Host header set to an IP" [puppet] - 10https://gerrit.wikimedia.org/r/551201 [15:35:54] (03PS1) 10BBlack: VCL: Reject all non-canonical hostnames [puppet] - 10https://gerrit.wikimedia.org/r/551202 [15:36:11] ema: newp! nsa.w.o is reserved for anycast DNS: https://phabricator.wikimedia.org/T98006#5416434 [15:36:44] (at the bottom of that long comment) [15:36:54] ah! [15:37:07] hahhaha [15:37:11] some equivalent then, stasi.w.o/kgb? [15:38:01] we could re-use stream.wikimedia.org? [15:38:06] migth be confusing though [15:38:10] /v2/stream is eventstreams [15:38:26] the API for eventgate (POST) is /v1/events [15:38:46] but we need multiple endpoint routing for diffferent eventgates [15:38:50] so not sure what they should be [15:38:59] /v1(or v2?/logging/events [15:39:00] ? [15:39:29] (brb) [15:41:35] (back) [15:42:01] (03PS2) 10BBlack: VCL: Reject all non-canonical hostnames [puppet] - 10https://gerrit.wikimedia.org/r/551202 [15:42:03] (03PS3) 10BBlack: VCL: Remove host regex from TLS redirect and STS [puppet] - 10https://gerrit.wikimedia.org/r/550931 (https://phabricator.wikimedia.org/T133548) [15:42:05] (03PS3) 10BBlack: Add protocol to TLS analytics fields [puppet] - 10https://gerrit.wikimedia.org/r/550932 (https://phabricator.wikimedia.org/T233661) [15:42:07] (03PS3) 10BBlack: Add session to TLS analytics fields [puppet] - 10https://gerrit.wikimedia.org/r/550933 (https://phabricator.wikimedia.org/T233661) [15:42:09] (03PS3) 10BBlack: varnishxcps decom: undefine the mtail program [puppet] - 10https://gerrit.wikimedia.org/r/550934 [15:42:11] (03PS3) 10BBlack: varnishxcps decom: remove xcps ref [puppet] - 10https://gerrit.wikimedia.org/r/550935 [15:42:13] (03PS3) 10BBlack: varnishxcps decom: remove manifest [puppet] - 10https://gerrit.wikimedia.org/r/550936 [15:42:15] (03PS3) 10BBlack: varnishxcps decom: remove mtail prog/tests [puppet] - 10https://gerrit.wikimedia.org/r/550937 [15:42:17] (03PS3) 10BBlack: varnishxcps decom: remove global prom rules [puppet] - 10https://gerrit.wikimedia.org/r/550938 [15:42:19] (03PS3) 10BBlack: varnishxcps decom: remove mtail log outputs from VCL [puppet] - 10https://gerrit.wikimedia.org/r/550939 [15:42:21] (03PS4) 10BBlack: TLS analytics: simplify variable scheme [puppet] - 10https://gerrit.wikimedia.org/r/550940 [15:42:23] (03PS4) 10BBlack: TLS analytics: simplify logic for the present [puppet] - 10https://gerrit.wikimedia.org/r/550941 [15:42:38] (03CR) 10Herron: [C: 03+1] "looks good!" [puppet] - 10https://gerrit.wikimedia.org/r/551195 (owner: 10Volans) [15:46:18] (03PS1) 10RLazarus: Reflow from 79 to 100 columns. [software/httpbb] - 10https://gerrit.wikimedia.org/r/551204 [15:46:56] 10Operations, 10ops-codfw, 10ops-eqiad, 10netbox: Document PDU models - https://phabricator.wikimedia.org/T227632 (10faidon) Now that the PDU migration in eqiad has been completed, all that's left in this task is to record and document the modles for: - eqiad's row D (rows A/B as well as C are all document... [15:48:38] PROBLEM - Work requests waiting in Zuul Gearman server on contint1001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [150.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [15:49:34] PROBLEM - SSH on analytics1077 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [15:49:46] The alert seems to be caused by bblack's stack of commits in puppet? Lame. [15:49:55] lol [15:49:59] more incoming! :) [15:50:09] Clearly the threshold is too low, then. [15:50:12] (03PS4) 10BBlack: VCL: Remove host regex from TLS redirect [puppet] - 10https://gerrit.wikimedia.org/r/550931 (https://phabricator.wikimedia.org/T133548) [15:50:14] (03PS4) 10BBlack: Add protocol to TLS analytics fields [puppet] - 10https://gerrit.wikimedia.org/r/550932 (https://phabricator.wikimedia.org/T233661) [15:50:16] (03PS4) 10BBlack: Add session to TLS analytics fields [puppet] - 10https://gerrit.wikimedia.org/r/550933 (https://phabricator.wikimedia.org/T233661) [15:50:18] (03PS4) 10BBlack: varnishxcps decom: undefine the mtail program [puppet] - 10https://gerrit.wikimedia.org/r/550934 [15:50:20] (03PS4) 10BBlack: varnishxcps decom: remove xcps ref [puppet] - 10https://gerrit.wikimedia.org/r/550935 [15:50:21] it's probably not common to upload so many at once [15:50:22] (03PS4) 10BBlack: varnishxcps decom: remove manifest [puppet] - 10https://gerrit.wikimedia.org/r/550936 [15:50:24] (03PS4) 10BBlack: varnishxcps decom: remove mtail prog/tests [puppet] - 10https://gerrit.wikimedia.org/r/550937 [15:50:26] (03PS4) 10BBlack: varnishxcps decom: remove global prom rules [puppet] - 10https://gerrit.wikimedia.org/r/550938 [15:50:28] (03PS4) 10BBlack: varnishxcps decom: remove mtail log outputs from VCL [puppet] - 10https://gerrit.wikimedia.org/r/550939 [15:50:30] (03PS5) 10BBlack: TLS analytics: simplify variable scheme [puppet] - 10https://gerrit.wikimedia.org/r/550940 [15:50:32] (03PS5) 10BBlack: TLS analytics: simplify logic for the present [puppet] - 10https://gerrit.wikimedia.org/r/550941 [15:50:32] I happen to be working on a bunch of inter-related changes and rebasing/fixing a bunch :/ [15:50:38] Yeah, but… [15:50:54] 10Operations, 10Product-Analytics, 10SRE-Access-Requests: Search Console access for he.wikisource.org - https://phabricator.wikimedia.org/T238090 (10Dzahn) [15:51:03] * bblack is here to exercise the edge cases for everyone! [15:54:16] PROBLEM - Hadoop NodeManager on analytics1077 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration [15:54:32] RECOVERY - SSH on analytics1077 is OK: SSH OK - OpenSSH_7.4p1 Debian-10+deb9u3 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [15:54:46] PROBLEM - Hadoop DataNode on analytics1077 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration [15:54:59] (03PS2) 10Physikerwelt: Enable links from math formulae on beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/551180 (https://phabricator.wikimedia.org/T208758) [15:56:13] 10Operations, 10Product-Analytics, 10SRE-Access-Requests: Search Console access for he.wikisource.org - https://phabricator.wikimedia.org/T238090 (10Dzahn) Thank you @Ijon! @Fuzzy i copied the check boxes to the top of the ticket. I think the first 3 are checked now. You have a sponsor and a valid reason fo... [15:56:28] checking analytics1077 [15:56:29] (03CR) 10BBlack: [C: 03+2] VCL: Move XCPS logging even earlier in recv [puppet] - 10https://gerrit.wikimedia.org/r/551200 (owner: 10BBlack) [15:56:56] (03CR) 10CDanis: [C: 03+1] Reflow from 79 to 100 columns. [software/httpbb] - 10https://gerrit.wikimedia.org/r/551204 (owner: 10RLazarus) [15:57:03] (03CR) 10CDanis: [C: 03+2] Reflow from 79 to 100 columns. [software/httpbb] - 10https://gerrit.wikimedia.org/r/551204 (owner: 10RLazarus) [15:57:16] 10Operations, 10Product-Analytics, 10SRE-Access-Requests: Search Console access for he.wikisource.org - https://phabricator.wikimedia.org/T238090 (10Dzahn) p:05Triage→03Normal [15:58:35] 10Operations, 10Product-Analytics, 10SRE-Access-Requests: Search Console access for he.wikisource.org - https://phabricator.wikimedia.org/T238090 (10Dzahn) Keep in mind we have a rotating clinic duty handling access request. Once the NDA part is done another person will complete this request. [16:00:56] !log jmm@cumin2001 START - Cookbook sre.hosts.downtime [16:00:57] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [16:00:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:01:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:02:00] !log rebooting rpki1001 to rectify microcode loading [16:02:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:02:17] (03PS1) 10BBlack: VCL: Do fe_ip stuff even earlier than TLS [puppet] - 10https://gerrit.wikimedia.org/r/551208 [16:03:14] ACKNOWLEDGEMENT - Check whether microcode mitigations for CPU vulnerabilities are applied on oresrdb2001 is CRITICAL: CRITICAL - Server is missing the following CPU flags: {ssbd} Muehlenhoff T223430 https://wikitech.wikimedia.org/wiki/Microcode [16:03:24] (03CR) 10BBlack: [V: 03+2 C: 03+2] VCL: Do fe_ip stuff even earlier than TLS [puppet] - 10https://gerrit.wikimedia.org/r/551208 (owner: 10BBlack) [16:05:59] 10Operations, 10serviceops, 10PHP 7.2 support: (euwiki) Mysterious, coordinated slowdowns every ~ 25 minutes on API servers - https://phabricator.wikimedia.org/T231011 (10Theklan) Last days I have been heavily merging some templates, and it may be related. [16:08:51] !log depool cp2001 for experiments [16:08:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:08:57] 10Operations, 10ops-eqiad: rack/setup/install ms-be105[7-9].eqiad.wmnet - https://phabricator.wikimedia.org/T237438 (10Cmjohnson) [16:09:20] RECOVERY - Check whether microcode mitigations for CPU vulnerabilities are applied on rpki1001 is OK: OK - All expected CPU flags found https://wikitech.wikimedia.org/wiki/Microcode [16:12:28] (03PS1) 10Jbond: promethus: add the metrics overlay to provide prometheus support [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/551209 (https://phabricator.wikimedia.org/T233934) [16:14:32] (03Merged) 10jenkins-bot: Reflow from 79 to 100 columns. [software/httpbb] - 10https://gerrit.wikimedia.org/r/551204 (owner: 10RLazarus) [16:15:46] (03PS2) 10Ayounsi: MR: add policy-options and routing-options [homer/public] - 10https://gerrit.wikimedia.org/r/550576 [16:21:07] (03PS1) 10Jbond: apereo_cas: add prometheus actuator [puppet] - 10https://gerrit.wikimedia.org/r/551212 (https://phabricator.wikimedia.org/T233934) [16:25:25] !log repool cp2001 [16:25:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:27:19] 10Operations: Write ulogd logs to a dedicated logfile - https://phabricator.wikimedia.org/T238414 (10ayounsi) p:05Triage→03Normal [16:28:51] (03PS2) 10BBlack: Revert "vcl: block requests with Host header set to an IP" [puppet] - 10https://gerrit.wikimedia.org/r/551201 [16:28:53] (03PS3) 10BBlack: VCL: Reject all non-canonical hostnames [puppet] - 10https://gerrit.wikimedia.org/r/551202 [16:28:54] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 240, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [16:28:55] (03PS5) 10BBlack: VCL: Remove host regex from TLS redirect [puppet] - 10https://gerrit.wikimedia.org/r/550931 (https://phabricator.wikimedia.org/T133548) [16:29:00] (03CR) 10jerkins-bot: [V: 04-1] Revert "vcl: block requests with Host header set to an IP" [puppet] - 10https://gerrit.wikimedia.org/r/551201 (owner: 10BBlack) [16:29:02] (03CR) 10jerkins-bot: [V: 04-1] VCL: Reject all non-canonical hostnames [puppet] - 10https://gerrit.wikimedia.org/r/551202 (owner: 10BBlack) [16:29:04] (03CR) 10jerkins-bot: [V: 04-1] VCL: Remove host regex from TLS redirect [puppet] - 10https://gerrit.wikimedia.org/r/550931 (https://phabricator.wikimedia.org/T133548) (owner: 10BBlack) [16:29:18] !log CI slowed down due to a huge spike of internal jobs. Being flushed as of now # T140297 [16:29:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:29:40] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 82, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [16:30:32] that's odd [16:30:35] 10Operations, 10Product-Analytics, 10SRE-Access-Requests: Search Console access for he.wikisource.org - https://phabricator.wikimedia.org/T238090 (10RStallman-legalteam) Hi @Fuzzy, I can take care of your NDA. Could you write to rstallman@wikimedia.org with your legal name, a mailing address (won't be used,... [16:30:40] the "unable to merge" jenkins failures I mean [16:31:05] bblack: yeah I have disabled it :-- [16:31:09] :-| [16:31:15] to ge tthe crazy queue to flush [16:31:19] oh I see [16:31:20] will restore them [16:31:27] for context, I am sitting next to James right now ;] [16:31:38] RECOVERY - Check systemd state on an-coord1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:31:42] I did stop uploading the whole set, just the first few :P [16:31:56] the root cause is an issue in Zuul which just overflow when a large chain of dependant changes is send [16:32:46] upstream has a patch to fix it [16:32:58] but I could not manage to backport it properly on our forked / legacy version :-\ [16:33:19] 10Operations, 10Wikimedia-Mailing-lists, 10Privacy, 10Security, 10User-Josve05a: Stop storing Mailman passwords in plain text - https://phabricator.wikimedia.org/T181803 (10sbassett) >>! In T181803#5666088, @Apap04 wrote: > Yeah, but you wouldn't want someone malicious to look at this issue and target WM... [16:34:57] (03CR) 10CDanis: "> Patch Set 2: Code-Review-1" [puppet] - 10https://gerrit.wikimedia.org/r/547792 (https://phabricator.wikimedia.org/T237134) (owner: 10CDanis) [16:35:05] (03PS3) 10CDanis: bot_blocked_nets: also block blank/unset UA [puppet] - 10https://gerrit.wikimedia.org/r/547792 (https://phabricator.wikimedia.org/T237134) [16:35:13] (03CR) 10jerkins-bot: [V: 04-1] bot_blocked_nets: also block blank/unset UA [puppet] - 10https://gerrit.wikimedia.org/r/547792 (https://phabricator.wikimedia.org/T237134) (owner: 10CDanis) [16:35:18] cdanis: CI is on hold so your patches^ are broken for now [16:35:24] well not the patch, but the CI result is broken [16:35:25] ;D [16:35:46] do I get a new custom t-shirt for breaking CI on a friday? :) [16:36:43] (03PS1) 10Muehlenhoff: Add image submission mode to debmonitor client [software/debmonitor] - 10https://gerrit.wikimedia.org/r/551220 (https://phabricator.wikimedia.org/T237978) [16:38:00] bblack: definitely. I should consider some CI t-shirt indeed :-] [16:38:07] but really it is not your fault [16:38:13] it is a flaw in our system :-\\ [16:38:34] I vaguely recall ages ago, gerrit would simply reject a push with over 10 commits, right at the outset [16:38:41] maybe we need that setting back for now :) [16:38:42] oh there is that as well [16:38:51] (03CR) 10jerkins-bot: [V: 04-1] Add image submission mode to debmonitor client [software/debmonitor] - 10https://gerrit.wikimedia.org/r/551220 (https://phabricator.wikimedia.org/T237978) (owner: 10Muehlenhoff) [16:38:54] (it didn't reject my push of 11 or 12 or whatever it was) [16:39:13] I guess that limit got raised so [16:39:38] at least then it didn't break CI, and it's not hard for submitters to batch the submissions to workaround [16:39:40] PROBLEM - Logstash Elasticsearch indexing errors on icinga1001 is CRITICAL: 1.483 ge 1 https://wikitech.wikimedia.org/wiki/Logstash https://logstash.wikimedia.org/goto/1cee1f1b5d4e6c5e06edb3353a2a4b83 https://grafana.wikimedia.org/dashboard/db/logstash [16:40:49] that's known ^ will be fixed on next MW train [16:41:10] RECOVERY - Work requests waiting in Zuul Gearman server on contint1001 is OK: OK: Less than 100.00% above the threshold [90.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [16:41:14] (03PS2) 10Muehlenhoff: Add image submission mode to debmonitor client [software/debmonitor] - 10https://gerrit.wikimedia.org/r/551220 (https://phabricator.wikimedia.org/T237978) [16:41:20] T238344 for the curious [16:41:20] T238344: MediaWiki Math invalid JSON in logs on Restbase server error - https://phabricator.wikimedia.org/T238344 [16:41:45] almost flushed [16:42:31] bblack i added you to the trusted user group (which raises how many changes you can push to 20 from 10) [16:43:04] PROBLEM - Logstash Elasticsearch indexing errors on icinga1001 is CRITICAL: 1.596 ge 1 https://wikitech.wikimedia.org/wiki/Logstash https://logstash.wikimedia.org/goto/1cee1f1b5d4e6c5e06edb3353a2a4b83 https://grafana.wikimedia.org/dashboard/db/logstash [16:43:16] probably trust has nothing to do with this issue though :) [16:43:19] (03CR) 10jerkins-bot: [V: 04-1] Add image submission mode to debmonitor client [software/debmonitor] - 10https://gerrit.wikimedia.org/r/551220 (https://phabricator.wikimedia.org/T237978) (owner: 10Muehlenhoff) [16:43:59] !log Restored zuul-merger / CI for operations/puppet.git [16:44:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:44:31] bblack was only answering " (it didn't reject my push of 11 or 12 or whatever it was)" " I guess that limit got raised so" :) [16:45:30] anyways, once CI is back online, I'll limit to smaller upload sets while I proceed with trying to break production itself :) [16:45:47] (03PS3) 10Muehlenhoff: Add image submission mode to debmonitor client [software/debmonitor] - 10https://gerrit.wikimedia.org/r/551220 (https://phabricator.wikimedia.org/T237978) [16:47:48] (03CR) 10jerkins-bot: [V: 04-1] Add image submission mode to debmonitor client [software/debmonitor] - 10https://gerrit.wikimedia.org/r/551220 (https://phabricator.wikimedia.org/T237978) (owner: 10Muehlenhoff) [16:49:39] (03CR) 10Hashar: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/510613 (owner: 10Jbond) [16:49:43] (03CR) 10Hashar: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/547792 (https://phabricator.wikimedia.org/T237134) (owner: 10CDanis) [16:51:34] RECOVERY - Logstash Elasticsearch indexing errors on icinga1001 is OK: (C)1 ge (W)0.2 ge 0.06667 https://wikitech.wikimedia.org/wiki/Logstash https://logstash.wikimedia.org/goto/1cee1f1b5d4e6c5e06edb3353a2a4b83 https://grafana.wikimedia.org/dashboard/db/logstash [16:54:25] (03PS1) 10Jhedden: install_server: update cloudcephosd partman config [puppet] - 10https://gerrit.wikimedia.org/r/551226 (https://phabricator.wikimedia.org/T224188) [16:54:27] (03PS4) 10Muehlenhoff: Add image submission mode to debmonitor client [software/debmonitor] - 10https://gerrit.wikimedia.org/r/551220 (https://phabricator.wikimedia.org/T237978) [16:54:53] (03PS1) 10Andrew Bogott: pdns: support api server for pdns4 [puppet] - 10https://gerrit.wikimedia.org/r/551227 (https://phabricator.wikimedia.org/T210715) [16:56:36] (03CR) 10Jhedden: [C: 03+2] install_server: update cloudcephosd partman config [puppet] - 10https://gerrit.wikimedia.org/r/551226 (https://phabricator.wikimedia.org/T224188) (owner: 10Jhedden) [16:57:09] 10Operations, 10observability: Logstash doesn't parse ulogd source and destination ports - https://phabricator.wikimedia.org/T238416 (10ayounsi) p:05Triage→03Lowest [16:57:50] (03CR) 10jerkins-bot: [V: 04-1] pdns: support api server for pdns4 [puppet] - 10https://gerrit.wikimedia.org/r/551227 (https://phabricator.wikimedia.org/T210715) (owner: 10Andrew Bogott) [16:58:55] (03PS1) 10Andrew Bogott: Added dummy pdns api keys [labs/private] - 10https://gerrit.wikimedia.org/r/551228 [16:59:20] akosiaris: OK to merge your ops/puppet changes? `Add system:heapster role to prometheus (c7d82bc)` [17:00:11] jeh: I have to revert it, gimme a sec [17:00:23] ok, no problem [17:01:14] E: failed to lock, another puppet-merge running on this host? [17:01:15] locking process tree: systemd---sshd---sshd---sshd(jeh)---bash---sudo(root)---puppet-merge---su---sh(gitpuppet)---git---git-remote-http [17:01:24] jeh: re-run it, it should fetch both now [17:01:56] (03CR) 10Andrew Bogott: [V: 03+2 C: 03+2] Added dummy pdns api keys [labs/private] - 10https://gerrit.wikimedia.org/r/551228 (owner: 10Andrew Bogott) [17:02:20] I merged my change in, didn't see any errors or changes other than mine [17:04:33] (03PS2) 10Andrew Bogott: pdns: support api server for pdns4 [puppet] - 10https://gerrit.wikimedia.org/r/551227 (https://phabricator.wikimedia.org/T210715) [17:07:14] akosiaris: you can merge your changes whenever you're ready, there's a few in queue now [17:07:34] jeh: ok, will do so [17:07:36] (03CR) 10jerkins-bot: [V: 04-1] pdns: support api server for pdns4 [puppet] - 10https://gerrit.wikimedia.org/r/551227 (https://phabricator.wikimedia.org/T210715) (owner: 10Andrew Bogott) [17:08:10] (03PS3) 10Ayounsi: MR: add policy-options and routing-options [homer/public] - 10https://gerrit.wikimedia.org/r/550576 [17:09:37] (03PS3) 10Andrew Bogott: pdns: support api server for pdns4 [puppet] - 10https://gerrit.wikimedia.org/r/551227 (https://phabricator.wikimedia.org/T210715) [17:10:24] (03CR) 10Ayounsi: [V: 03+2 C: 03+2] "Thanks!" (032 comments) [homer/public] - 10https://gerrit.wikimedia.org/r/550576 (owner: 10Ayounsi) [17:10:32] 10Operations, 10Cloud-Services, 10cloud-services-team (Kanban): rack/setup/install (3) new osd ceph nodes - https://phabricator.wikimedia.org/T224188 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by jeh on cumin1001.eqiad.wmnet for hosts: ` ['cloudcephosd1002.wikimedia.org'] ` The log can be fo... [17:11:00] !log homer push to management routers (https://gerrit.wikimedia.org/r/550576) [17:11:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:11:18] (03CR) 10jerkins-bot: [V: 04-1] pdns: support api server for pdns4 [puppet] - 10https://gerrit.wikimedia.org/r/551227 (https://phabricator.wikimedia.org/T210715) (owner: 10Andrew Bogott) [17:13:35] (03PS4) 10Andrew Bogott: pdns: support api server for pdns4 [puppet] - 10https://gerrit.wikimedia.org/r/551227 (https://phabricator.wikimedia.org/T210715) [17:14:11] (03CR) 10Anomie: [C: 03+1] "The new table should be treated the same as the existing oauth_accepted_consumer, which seems to be what's going on here." [puppet] - 10https://gerrit.wikimedia.org/r/551140 (https://phabricator.wikimedia.org/T238370) (owner: 10Marostegui) [17:16:33] (03CR) 10jerkins-bot: [V: 04-1] pdns: support api server for pdns4 [puppet] - 10https://gerrit.wikimedia.org/r/551227 (https://phabricator.wikimedia.org/T210715) (owner: 10Andrew Bogott) [17:20:28] (03PS5) 10Andrew Bogott: pdns: support api server for pdns4 [puppet] - 10https://gerrit.wikimedia.org/r/551227 (https://phabricator.wikimedia.org/T210715) [17:23:14] PROBLEM - https://phabricator.wikimedia.org #page on phabricator.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Phabricator [17:23:33] uh oh [17:23:40] indeed, phab doesn't load for me [17:23:47] me too [17:23:48] I'm here too [17:24:01] * apergos peeks in [17:24:17] im here [17:24:22] out to lunch with my partner, following on my phone but I can jump in if needed [17:24:31] its loading fpor me [17:24:34] same here in atlanta [17:24:35] looking too, but I don't know anything about phab [17:24:42] er, same == not loading [17:24:47] it's back for me [17:24:49] also works from esams/Germany [17:25:18] still not here [17:25:18] loaded once, hung twice [17:25:23] works from my phone [17:25:53] works again for me now [17:25:55] definitely seeing success on apache on phab1003 [17:26:14] maybe php-fpm? [17:26:35] this is not good: https://grafana.wikimedia.org/d/000000587/phabricator?panelId=7&fullscreen&orgId=1&from=now-15m&to=now [17:26:37] my phone on the same wifi works, but my laptop (after reconnecting) is still hanging.... [17:27:14] it's very intermittent and slow here in atlanta [17:28:06] (03PS6) 10Andrew Bogott: pdns: support api server for pdns4 [puppet] - 10https://gerrit.wikimedia.org/r/551227 (https://phabricator.wikimedia.org/T210715) [17:28:29] does #page mean that's a paging alert? [17:28:32] yes [17:28:41] oh there it is, I guess my carrier was slow [17:28:44] logs indicating problems pushing repos to mirrors? [17:29:12] maybe the CI chaos also impacted phab, indirectly and latently? [17:29:24] hmm [17:29:27] No, CI is fixed. [17:29:31] there are a lot of phd workers using CPU but I didn't think those used Apache workers [17:29:31] not they are really disconnected [17:29:34] shouldn't have, the only thing connecting the two would be gerrit changes mirroring [17:29:38] Error while pushing "R1958" repository to mirrors. {>} (PhutilAggregateException) Exceptions occurred while mirroring the "tool-versions" repository. [17:29:41] the phabricator dashboard could use some more metrics [17:29:43] Aha, just got a response from Phab [17:29:48] really different infra that do not interact with each others (I mean zuul-mergers and Phabricator have no connections) [17:29:54] RECOVERY - https://phabricator.wikimedia.org #page on phabricator.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 36939 bytes in 0.132 second response time https://wikitech.wikimedia.org/wiki/Phabricator [17:30:06] !log phabricator - -started phd service [17:30:08] guess php-fpm got overloaded somehow [17:30:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:30:20] huh [17:30:21] i restarted php-fpm and started phd service [17:30:24] which was dead [17:30:27] Thanks mutante. [17:30:29] (03PS7) 10Andrew Bogott: pdns: support api server for pdns4 [puppet] - 10https://gerrit.wikimedia.org/r/551227 (https://phabricator.wikimedia.org/T210715) [17:30:30] thanks mutante [17:30:31] I see a bunch of "AH00288: scoreboard is full, not at MaxRequestWorkers" in apache2 error.log [17:30:41] I wonder what kills it off [17:30:54] cdanis: saw that too, though from UTC morning [17:30:59] well. this is weird: [17:30:59] Active: inactive (dead) since Wed 2019-10-23 23:02:49 UTC; 3 weeks 1 days ago [17:31:02] since 3 weeks? [17:31:03] godog: ah you are right [17:31:36] is there an icinga check for those? [17:31:42] yes [17:32:27] oh, wait. what i pasted wasn't the right server [17:33:07] (03PS8) 10Andrew Bogott: pdns: support api server for pdns4 [puppet] - 10https://gerrit.wikimedia.org/r/551227 (https://phabricator.wikimedia.org/T210715) [17:33:44] hmm [17:33:48] logstash might have details [17:34:06] apparently phab1003 only has apache logs so... [17:34:08] (03PS3) 10BBlack: Revert "vcl: block requests with Host header set to an IP" [puppet] - 10https://gerrit.wikimedia.org/r/551201 [17:34:10] (03PS4) 10BBlack: VCL: Reject all non-canonical hostnames [puppet] - 10https://gerrit.wikimedia.org/r/551202 [17:34:14] just 2 :) [17:34:19] :]]] [17:34:19] cdanis: that confused me too as they were at 09:23:44, whcih was exactly my local time... [17:34:23] XioNoX: thx for the restart! [17:34:29] I am off, gotta reach out to folks here IRL [17:34:34] I didn't do anything :) [17:34:35] hah! phabricator_error.log is where the error log is btw [17:35:32] /var/log/phd/daemons.log [17:35:38] and note 1003 is prod, not 1001 [17:36:36] https://grafana.wikimedia.org/d/000000278/mysql-aggregated?orgId=1&var-dc=eqiad%20prometheus%2Fops&var-group=misc&var-shard=m3&var-role=All&from=now-1h&to=now [17:36:40] this is phab's database shard [17:36:44] related to arcanist [17:37:25] what is the issue? [17:38:40] phab seemed to want to make a lot of writes [17:38:48] to the point of being unresponsive [17:38:56] (phab, not the DBs, I think?) [17:39:09] https://grafana.wikimedia.org/d/000000273/mysql?panelId=2&fullscreen&orgId=1&var-dc=eqiad%20prometheus%2Fops&var-server=db1128&var-port=9104 [17:39:12] indeed [17:39:36] I have no idea what triggered it [17:40:32] hm that was *after* the page though [17:40:40] so I think that's phd being restarted, if it was indeed not running? [17:40:50] I added some doc on https://wikitech.wikimedia.org/w/index.php?title=Phabricator&type=revision&diff=1844886&oldid=1841083 [17:41:04] phabricator tried to git push to gerrit and got exceptions trying to do so "while mirroring the "R2668" repository" but that is probably a red herring because it happend earlier as well [17:41:06] (03PS5) 10BBlack: VCL: Reject all non-canonical hostnames [puppet] - 10https://gerrit.wikimedia.org/r/551202 [17:41:14] mutante yeh [17:41:18] i see that quite alot [17:41:22] it's lock failures [17:41:24] there's also nothing that looks meaningful in php7.2-fpm slowlog [17:42:05] (03CR) 10Ayounsi: [V: 03+2 C: 03+2] "Thanks!" [homer/public] - 10https://gerrit.wikimedia.org/r/549690 (owner: 10Ayounsi) [17:42:08] cdanis: phd on the actual prod server is running since 2 weeks. scratch that comment, it was from 1001 not 1003 [17:42:18] INSERT INTO `phabricator_worker`.`worker_taskdata` is what seems to have moreactivity [17:42:40] 10Operations, 10serviceops, 10PHP 7.2 support: (euwiki) Mysterious, coordinated slowdowns every ~ 25 minutes on API servers - https://phabricator.wikimedia.org/T231011 (10jijiki) @Theklan We suspect it is something on the production side since we have noticed this behaviour in the past. Moreover, this is not... [17:42:47] i doin't think it failing to push to gerrit is related. [17:43:06] also INSERT INTO `lisk_counter` [17:43:53] {\"commitID\":\"100178\"} [17:44:11] ^check large imports going from git [17:45:13] Wouldn't it cause phabricator to throw a mysql error rather then causing upstream failures if it was mysql? [17:45:44] (03CR) 10BBlack: [C: 03+2] Revert "vcl: block requests with Host header set to an IP" [puppet] - 10https://gerrit.wikimedia.org/r/551201 (owner: 10BBlack) [17:45:45] 10Operations, 10SRE-Access-Requests: Read access for aklapper to Phabricator production database to run SELECT queries - https://phabricator.wikimedia.org/T238425 (10Aklapper) [17:45:46] I am not checking for mysql failures, I saying what phabricator is doing [17:45:48] (03CR) 10BBlack: [C: 03+2] VCL: Reject all non-canonical hostnames [puppet] - 10https://gerrit.wikimedia.org/r/551202 (owner: 10BBlack) [17:46:42] aparently we have 500000 commits from gerrit on phabricator [17:47:14] request: "POST / source/mediawiki/lastmodified/master/ [17:49:30] and then that php-fpm child process got stopped for tracing [17:50:22] (03PS2) 10ArielGlenn: make dumpsdata primary nfs server rsync to dumpsdata1003 now [puppet] - 10https://gerrit.wikimedia.org/r/551173 (https://phabricator.wikimedia.org/T224563) [17:51:15] (03CR) 10Andrew Bogott: [C: 03+2] Depool labsdb1009 [puppet] - 10https://gerrit.wikimedia.org/r/550878 (https://phabricator.wikimedia.org/T237509) (owner: 10Andrew Bogott) [17:51:57] (03CR) 10Andrew Bogott: [C: 03+2] Remove 'globalblocks' table from maintain-views [puppet] - 10https://gerrit.wikimedia.org/r/550888 (https://phabricator.wikimedia.org/T237509) (owner: 10Andrew Bogott) [17:52:00] i will meet Mukunda in an hour and talk about that [17:53:26] (03PS5) 10Ayounsi: msw/asw: use same generic config [homer/public] - 10https://gerrit.wikimedia.org/r/549933 [17:53:56] (03CR) 10Ayounsi: msw/asw: use same generic config (031 comment) [homer/public] - 10https://gerrit.wikimedia.org/r/549933 (owner: 10Ayounsi) [17:54:00] (03CR) 10ArielGlenn: [C: 03+2] make dumpsdata primary nfs server rsync to dumpsdata1003 now [puppet] - 10https://gerrit.wikimedia.org/r/551173 (https://phabricator.wikimedia.org/T224563) (owner: 10ArielGlenn) [17:54:52] 10Operations, 10Cloud-Services, 10cloud-services-team (Kanban): rack/setup/install (3) new osd ceph nodes - https://phabricator.wikimedia.org/T224188 (10JHedden) I'm having an issue on cloudcephosd1002 and 1003. Using the first 2 240GB drives I created a RAID0 virtual disk, and specified that as the boot d... [17:57:54] 10Operations, 10Wikimedia-Mailing-lists, 10Privacy, 10Security, 10User-Josve05a: Stop storing Mailman passwords in plain text - https://phabricator.wikimedia.org/T181803 (10Bawolff) Keep in mind these passwords are (mostly?) randomly chosen by mailman not the user, so they are closer to tokens than tradi... [17:58:41] (03CR) 10Ayounsi: [V: 03+2 C: 03+2] msw/asw: use same generic config [homer/public] - 10https://gerrit.wikimedia.org/r/549933 (owner: 10Ayounsi) [17:59:01] (03PS1) 10BBlack: Support varnishcheck in host header filter [puppet] - 10https://gerrit.wikimedia.org/r/551238 [18:01:16] (03CR) 10BBlack: [C: 03+2] Support varnishcheck in host header filter [puppet] - 10https://gerrit.wikimedia.org/r/551238 (owner: 10BBlack) [18:07:44] PROBLEM - High average GET latency for mw requests on api_appserver in eqiad on icinga1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-m [18:07:47] !log homer push on management switches [18:07:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:09:12] RECOVERY - High average GET latency for mw requests on api_appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-method=GET [18:09:15] that alert seems to be the "normal" every 25m thing [18:18:50] I don't see any other alerts on my irc log [18:19:08] unless my client broke the log history replay again [18:19:29] I was talking about the get latency thing [18:19:36] 18:09 <+icinga-wm> RECOVERY - High average GET latency for mw requests on api_appserver [...] [18:19:43] yeah that one [18:20:14] I am trying to find the critical alert before the 18:07 one [18:20:16] I don't think it always makes it to icinga output here every 25 mins, if that's what you mean [18:20:25] checking icinga [18:20:52] (the alert isn't every 25 minutes, just the actual underlying spike is) [18:20:57] https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus%2Fops&var-cluster=api_appserver&var-method=GET [18:21:08] ok there were 4 more soft critical alerts [18:21:10] there's a ticket about it somewhere [18:21:24] I am not sure it is that one [18:21:43] (03CR) 10Ayounsi: "There is a diff for 1 devices: ['msw1-codfw.mgmt.codfw.wmnet'] leftovers from https://phabricator.wikimedia.org/T228112" [homer/public] - 10https://gerrit.wikimedia.org/r/549938 (owner: 10Ayounsi) [18:22:03] I was looking into the alerts more for the same task :p [18:22:40] ok scratch that [18:22:44] https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-24h&to=now&var-datasource=eqiad%20prometheus%2Fops&var-cluster=api_appserver&var-method=GET [18:22:53] it appears that it gotten worse [18:23:26] PROBLEM - cassandra-c SSL 10.192.32.105:7001 on restbase2015 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://phabricator.wikimedia.org/T120662 [18:23:30] anyway, that is the task [18:23:33] https://phabricator.wikimedia.org/T231011 [18:23:41] but we have not gotten anywhere so far [18:24:02] * effie sighs [18:24:20] PROBLEM - Check systemd state on restbase2015 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:24:40] PROBLEM - cassandra-c service on restbase2015 is CRITICAL: CRITICAL - Expecting active but unit cassandra-c is failed https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [18:24:44] PROBLEM - cassandra-c CQL 10.192.32.105:9042 on restbase2015 is CRITICAL: connect to address 10.192.32.105 and port 9042: Connection refused https://phabricator.wikimedia.org/T93886 [18:25:01] (03PS6) 10BBlack: VCL: Remove host regex from TLS redirect [puppet] - 10https://gerrit.wikimedia.org/r/550931 (https://phabricator.wikimedia.org/T133548) [18:27:00] (03CR) 10BBlack: [C: 03+2] VCL: Remove host regex from TLS redirect [puppet] - 10https://gerrit.wikimedia.org/r/550931 (https://phabricator.wikimedia.org/T133548) (owner: 10BBlack) [18:28:20] ACKNOWLEDGEMENT - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 240, down: 1, dormant: 0, excluded: 0, unused: 0: CDanis emergency maintenance: CenturyLink Ticket #: 17559560 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [18:28:20] ACKNOWLEDGEMENT - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 82, down: 1, dormant: 0, excluded: 0, unused: 0: CDanis emergency maintenance: CenturyLink Ticket #: 17559560 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [18:29:56] (03PS5) 10BBlack: Add prot and sess fields to TLS analytics [puppet] - 10https://gerrit.wikimedia.org/r/550932 (https://phabricator.wikimedia.org/T233661) [18:31:18] (03Abandoned) 10BBlack: Add session to TLS analytics fields [puppet] - 10https://gerrit.wikimedia.org/r/550933 (https://phabricator.wikimedia.org/T233661) (owner: 10BBlack) [18:31:30] (03Abandoned) 10BBlack: Add prot and sess fields to TLS analytics [puppet] - 10https://gerrit.wikimedia.org/r/550932 (https://phabricator.wikimedia.org/T233661) (owner: 10BBlack) [18:31:55] (03Restored) 10BBlack: Add prot and sess fields to TLS analytics [puppet] - 10https://gerrit.wikimedia.org/r/550932 (https://phabricator.wikimedia.org/T233661) (owner: 10BBlack) [18:32:30] oops, it's hard to see what to abandon sometimes after squishing and reuploading [18:33:11] heh [18:34:30] I don't usually have that; my vice is forgetting to git add all the things before I push and merge [18:36:03] (03CR) 10BBlack: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/550932 (https://phabricator.wikimedia.org/T233661) (owner: 10BBlack) [18:39:38] RECOVERY - Check systemd state on restbase2015 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:40:00] RECOVERY - cassandra-c service on restbase2015 is OK: OK - cassandra-c is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [18:40:30] RECOVERY - cassandra-c SSL 10.192.32.105:7001 on restbase2015 is OK: SSL OK - Certificate restbase2015-c valid until 2020-11-29 09:26:12 +0000 (expires in 379 days) https://phabricator.wikimedia.org/T120662 [18:41:44] RECOVERY - cassandra-c CQL 10.192.32.105:9042 on restbase2015 is OK: TCP OK - 0.037 second response time on 10.192.32.105 port 9042 https://phabricator.wikimedia.org/T93886 [18:46:05] (03CR) 10BBlack: [C: 03+2] Add prot and sess fields to TLS analytics [puppet] - 10https://gerrit.wikimedia.org/r/550932 (https://phabricator.wikimedia.org/T233661) (owner: 10BBlack) [18:47:45] (03PS1) 10RLazarus: Reflow from 79 to 100 columns. [software/httpbb] - 10https://gerrit.wikimedia.org/r/551244 [18:49:07] (03CR) 10Andrew Bogott: [C: 03+2] Revert "Depool labsdb1009" [puppet] - 10https://gerrit.wikimedia.org/r/550879 (https://phabricator.wikimedia.org/T237509) (owner: 10Andrew Bogott) [18:50:16] (03CR) 10Andrew Bogott: [C: 03+2] Depool labsdb1010 [puppet] - 10https://gerrit.wikimedia.org/r/550880 (https://phabricator.wikimedia.org/T237509) (owner: 10Andrew Bogott) [18:54:06] (03PS5) 10BBlack: varnishxcps decom: undefine the mtail program [puppet] - 10https://gerrit.wikimedia.org/r/550934 [18:54:26] 10Operations, 10ops-codfw, 10fundraising-tech-ops: rack/setup/install frban2001.frack.codfw.wmnet - https://phabricator.wikimedia.org/T234069 (10Jgreen) [18:54:45] 10Operations, 10ops-eqiad, 10fundraising-tech-ops: rack/setup/install frnetmon1001.frack.eqiad.wmnet - https://phabricator.wikimedia.org/T232137 (10Jgreen) [18:57:09] (03CR) 10BBlack: [C: 03+2] varnishxcps decom: undefine the mtail program [puppet] - 10https://gerrit.wikimedia.org/r/550934 (owner: 10BBlack) [18:57:19] 10Operations, 10Dumps-Generation: Migrate dumpsdata hosts to Stretch/Buster - https://phabricator.wikimedia.org/T224563 (10ArielGlenn) dumpsdata1003 is now receiving all files from dumpsdata1001 via rsync. dumpsdata1002 can be turned into a spare and re-imaged with buster as the next step. [18:59:24] (03CR) 10RLazarus: [C: 03+2] Reflow from 79 to 100 columns. [software/httpbb] - 10https://gerrit.wikimedia.org/r/551244 (owner: 10RLazarus) [19:03:00] (03PS1) 10BBlack: varnishxcps decom: remove from varnishmtail itself [puppet] - 10https://gerrit.wikimedia.org/r/551246 [19:03:54] (03PS9) 10Zoranzoki21: IS.php: Add wgProofreadPagePageJoiner, set it per default on '-' and at zhwikisource on __PAGEJOIN__ [mediawiki-config] - 10https://gerrit.wikimedia.org/r/482502 (https://phabricator.wikimedia.org/T205826) [19:07:02] !log remove vlan 1 trunking between msw1-codfw and mr1-codfw, will cause a quick connectivity issue - T228112 [19:07:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:07:07] T228112: Cable mr1-codfw<->cr1/2-codfw through asw-a-codfw - https://phabricator.wikimedia.org/T228112 [19:08:02] (03CR) 10Cwhite: "anything else to add or ready to go?" [puppet] - 10https://gerrit.wikimedia.org/r/542472 (https://phabricator.wikimedia.org/T205870) (owner: 10Cwhite) [19:09:20] (03CR) 10BBlack: [C: 03+2] varnishxcps decom: remove from varnishmtail itself [puppet] - 10https://gerrit.wikimedia.org/r/551246 (owner: 10BBlack) [19:09:25] done [19:11:03] (03CR) 10Ayounsi: [V: 03+2 C: 03+2] msw: ensure no vlans are configured [homer/public] - 10https://gerrit.wikimedia.org/r/549938 (owner: 10Ayounsi) [19:11:10] (03PS3) 10Ayounsi: msw: ensure no vlans are configured [homer/public] - 10https://gerrit.wikimedia.org/r/549938 [19:11:18] (03CR) 10Ayounsi: [V: 03+2 C: 03+2] msw: ensure no vlans are configured [homer/public] - 10https://gerrit.wikimedia.org/r/549938 (owner: 10Ayounsi) [19:12:55] (03PS5) 10BBlack: varnishxcps decom: remove xcps manifest [puppet] - 10https://gerrit.wikimedia.org/r/550935 [19:12:57] (03PS5) 10BBlack: varnishxcps decom: remove mtail rules and code [puppet] - 10https://gerrit.wikimedia.org/r/550937 [19:13:13] (03PS9) 10Cwhite: CI - python3: first attempt at adding python3 CI [puppet] - 10https://gerrit.wikimedia.org/r/510613 (owner: 10Jbond) [19:15:27] 10Operations, 10serviceops, 10PHP 7.2 support: (euwiki) Mysterious, coordinated slowdowns every ~ 25 minutes on API servers - https://phabricator.wikimedia.org/T231011 (10akosiaris) I think it's in the ~35mins "schedule" now, but other than that, it's still present https://grafana.wikimedia.org/d/RIA1lzDZk/a... [19:15:45] (03CR) 10Ayounsi: Add security alg/forwarding-options/screen to mr template (031 comment) [homer/public] - 10https://gerrit.wikimedia.org/r/550356 (owner: 10Ayounsi) [19:15:47] (03CR) 10BBlack: [C: 03+2] varnishxcps decom: remove xcps manifest [puppet] - 10https://gerrit.wikimedia.org/r/550935 (owner: 10BBlack) [19:17:12] (03CR) 10jerkins-bot: [V: 04-1] CI - python3: first attempt at adding python3 CI [puppet] - 10https://gerrit.wikimedia.org/r/510613 (owner: 10Jbond) [19:18:01] (03CR) 10BBlack: [C: 03+2] varnishxcps decom: remove mtail rules and code [puppet] - 10https://gerrit.wikimedia.org/r/550937 (owner: 10BBlack) [19:19:35] (03CR) 10Andrew Bogott: [C: 03+2] Revert "Depool labsdb1010" [puppet] - 10https://gerrit.wikimedia.org/r/550881 (https://phabricator.wikimedia.org/T237509) (owner: 10Andrew Bogott) [19:19:43] (03PS10) 10Jbond: CI - python3: first attempt at adding python3 CI [puppet] - 10https://gerrit.wikimedia.org/r/510613 [19:19:52] (03CR) 10Andrew Bogott: [C: 03+2] Depool labsdb1011 [puppet] - 10https://gerrit.wikimedia.org/r/550882 (https://phabricator.wikimedia.org/T237509) (owner: 10Andrew Bogott) [19:20:20] 10Operations, 10Analytics, 10Event-Platform, 10Wikimedia-Logstash, 10observability: Move eventgate logs to new logging infrastructure - https://phabricator.wikimedia.org/T225129 (10Ottomata) Ah, I think the deployment-chart for eventgate already has this done. We just need to deploy. Will do soon enoug... [19:22:26] (03PS5) 10BBlack: varnishxcps decom: remove mtail log outputs from VCL [puppet] - 10https://gerrit.wikimedia.org/r/550939 [19:23:43] (03CR) 10jerkins-bot: [V: 04-1] CI - python3: first attempt at adding python3 CI [puppet] - 10https://gerrit.wikimedia.org/r/510613 (owner: 10Jbond) [19:25:07] 10Operations, 10serviceops, 10PHP 7.2 support: (euwiki) Mysterious, coordinated slowdowns every ~ 25 minutes on API servers - https://phabricator.wikimedia.org/T231011 (10Theklan) Ok! Now there's nothing special happening at euwiki and there are not new merges happening, so it should be another thing. [19:25:34] (03CR) 10BBlack: [C: 03+2] varnishxcps decom: remove mtail log outputs from VCL [puppet] - 10https://gerrit.wikimedia.org/r/550939 (owner: 10BBlack) [19:26:25] (03PS2) 10Ayounsi: Add virtual-chassis support [homer/public] - 10https://gerrit.wikimedia.org/r/550370 [19:26:38] (03CR) 10BBlack: [C: 03+2] TLS analytics: simplify variable scheme [puppet] - 10https://gerrit.wikimedia.org/r/550940 (owner: 10BBlack) [19:26:56] (03PS6) 10BBlack: TLS analytics: simplify variable scheme [puppet] - 10https://gerrit.wikimedia.org/r/550940 [19:27:08] (03PS1) 10Ottomata: Public cache routing for eventgate-logging-external [puppet] - 10https://gerrit.wikimedia.org/r/551247 (https://phabricator.wikimedia.org/T236386) [19:27:16] (03CR) 10Ayounsi: "> Patch Set 1:" (032 comments) [homer/public] - 10https://gerrit.wikimedia.org/r/550370 (owner: 10Ayounsi) [19:27:37] (03CR) 10BBlack: [V: 03+2 C: 03+2] TLS analytics: simplify variable scheme [puppet] - 10https://gerrit.wikimedia.org/r/550940 (owner: 10BBlack) [19:27:39] 10Operations, 10SRE-Access-Requests: Read access for aklapper to Phabricator production database to run SELECT queries - https://phabricator.wikimedia.org/T238425 (10Dzahn) +1. We talked about this and in the past i have often ran queries for Andre for things like community metrics. [19:28:47] (03CR) 10BBlack: [C: 03+2] TLS analytics: simplify logic for the present [puppet] - 10https://gerrit.wikimedia.org/r/550941 (owner: 10BBlack) [19:28:56] (03PS6) 10BBlack: TLS analytics: simplify logic for the present [puppet] - 10https://gerrit.wikimedia.org/r/550941 [19:28:57] (03CR) 10jerkins-bot: [V: 04-1] Public cache routing for eventgate-logging-external [puppet] - 10https://gerrit.wikimedia.org/r/551247 (https://phabricator.wikimedia.org/T236386) (owner: 10Ottomata) [19:29:00] (03CR) 10BBlack: [V: 03+2 C: 03+2] TLS analytics: simplify logic for the present [puppet] - 10https://gerrit.wikimedia.org/r/550941 (owner: 10BBlack) [19:30:49] (03Abandoned) 10BBlack: varnishxcps decom: remove global prom rules [puppet] - 10https://gerrit.wikimedia.org/r/550938 (owner: 10BBlack) [19:31:02] (03Abandoned) 10BBlack: varnishxcps decom: remove manifest [puppet] - 10https://gerrit.wikimedia.org/r/550936 (owner: 10BBlack) [19:31:38] (03PS2) 10Ottomata: Public cache routing for eventgate-logging-external [puppet] - 10https://gerrit.wikimedia.org/r/551247 (https://phabricator.wikimedia.org/T236386) [19:34:20] 10Operations, 10Phabricator: List of recent most active Phab "Priority" field setters - https://phabricator.wikimedia.org/T235153 (10Dzahn) a:03Dzahn [19:36:19] 10Operations, 10Phabricator: List of recent most active Phab "Priority" field setters - https://phabricator.wikimedia.org/T235153 (10Dzahn) @Aklapper here it is: {P9646} [19:36:27] 10Operations, 10Phabricator: List of recent most active Phab "Priority" field setters - https://phabricator.wikimedia.org/T235153 (10Dzahn) 05Open→03Resolved [19:39:55] 10Operations, 10DBA, 10SRE-Access-Requests: Read access for aklapper to Phabricator production database to run SELECT queries - https://phabricator.wikimedia.org/T238425 (10Dzahn) [19:42:53] 10Operations, 10SRE-tools, 10Traffic, 10Goal, and 3 others: Automate generation of Management DNS records from Netbox - https://phabricator.wikimedia.org/T233183 (10Volans) A simplified version could be to use a cookbook to couple stuff: - have a script on the netbox hosts to generate the snippets from th... [19:44:28] 10Operations, 10DBA, 10SRE-Access-Requests: Read access for aklapper to Phabricator production database to run SELECT queries - https://phabricator.wikimedia.org/T238425 (10Dzahn) @DBA fyi. I suggest i can puppetize that Andre gets a my.cnf written to his home dir somewhere with the existing "metrics_user" f... [19:47:05] (03PS1) 10RLazarus: httpbb: Install python3-requests-toolbelt. [puppet] - 10https://gerrit.wikimedia.org/r/551249 (https://phabricator.wikimedia.org/T236699) [19:47:10] (03PS1) 10RLazarus: Verify SSL certs against the domain in the Host: header. [software/httpbb] - 10https://gerrit.wikimedia.org/r/551250 (https://phabricator.wikimedia.org/T236699) [19:48:07] (03CR) 10CDanis: [C: 03+1] httpbb: Install python3-requests-toolbelt. [puppet] - 10https://gerrit.wikimedia.org/r/551249 (https://phabricator.wikimedia.org/T236699) (owner: 10RLazarus) [19:49:24] (03CR) 10CDanis: [C: 03+1] Verify SSL certs against the domain in the Host: header. [software/httpbb] - 10https://gerrit.wikimedia.org/r/551250 (https://phabricator.wikimedia.org/T236699) (owner: 10RLazarus) [19:49:50] (03CR) 10Dzahn: [C: 03+1] "contains: FingerprintAdapter, SSLAdapter, SourceAddressAdapter, SocketOptionsAdapter, TCPKeepAliveAdapter and authenticators: AuthHandler" [puppet] - 10https://gerrit.wikimedia.org/r/551249 (https://phabricator.wikimedia.org/T236699) (owner: 10RLazarus) [19:52:45] 10Operations, 10DBA, 10SRE-Access-Requests: Read access for aklapper to Phabricator production database to run SELECT queries - https://phabricator.wikimedia.org/T238425 (10Dzahn) a:03Dzahn [19:52:52] 10Operations, 10DBA, 10SRE-Access-Requests: Read access for aklapper to Phabricator production database to run SELECT queries - https://phabricator.wikimedia.org/T238425 (10Dzahn) p:05Triage→03Normal [19:54:47] 10Operations, 10Cloud-Services, 10cloud-services-team (Kanban): rack/setup/install (3) new osd ceph nodes - https://phabricator.wikimedia.org/T224188 (10JHedden) Looks like it's an issue with the virtual disk not getting assigned /dev/sda. Checking to see if I can work around this with our installation proce... [19:55:19] (03PS1) 10Alexandros Kosiaris: RBAC: Unify rules into 1 file [deployment-charts] - 10https://gerrit.wikimedia.org/r/551251 [19:57:16] (03PS1) 10BBlack: Depool ulsfo [dns] - 10https://gerrit.wikimedia.org/r/551252 [19:57:53] (03CR) 10Jbond: "The CI errors are occurring because there is an update to tox.ini. this means that the tox:update job runs with `tox -r` and the py{2,3}-" [puppet] - 10https://gerrit.wikimedia.org/r/510613 (owner: 10Jbond) [19:58:13] (03PS1) 10Ottomata: TLS envoyproxy support for eventgate chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/551253 (https://phabricator.wikimedia.org/T236386) [19:58:26] (03CR) 10jerkins-bot: [V: 04-1] TLS envoyproxy support for eventgate chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/551253 (https://phabricator.wikimedia.org/T236386) (owner: 10Ottomata) [19:58:41] (03PS1) 10Tpt: Properly configures the Wikisource extension on beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/551254 (https://phabricator.wikimedia.org/T236502) [19:59:20] (03CR) 10jerkins-bot: [V: 04-1] Properly configures the Wikisource extension on beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/551254 (https://phabricator.wikimedia.org/T236502) (owner: 10Tpt) [20:00:47] (03PS2) 10Ottomata: TLS envoyproxy support for eventgate chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/551253 (https://phabricator.wikimedia.org/T236386) [20:01:00] (03CR) 10jerkins-bot: [V: 04-1] TLS envoyproxy support for eventgate chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/551253 (https://phabricator.wikimedia.org/T236386) (owner: 10Ottomata) [20:01:08] (03PS2) 10Tpt: Properly configures the Wikisource extension on beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/551254 (https://phabricator.wikimedia.org/T236502) [20:02:00] !log push pfw policies to pfw3-codfw - T238368 [20:02:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:03:33] (03PS3) 10Tpt: Properly configures the Wikisource extension on beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/551254 (https://phabricator.wikimedia.org/T236502) [20:04:00] (03CR) 10Andrew Bogott: [C: 03+2] Revert "Depool labsdb1011" [puppet] - 10https://gerrit.wikimedia.org/r/550883 (https://phabricator.wikimedia.org/T237509) (owner: 10Andrew Bogott) [20:04:06] !log push pfw policies to pfw3-eqiad - T238368 [20:04:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:04:52] (03CR) 10Ottomata: "Not sure why the .fixtures file isn't being used in the test." [deployment-charts] - 10https://gerrit.wikimedia.org/r/551253 (https://phabricator.wikimedia.org/T236386) (owner: 10Ottomata) [20:08:27] (03PS3) 10Ottomata: TLS envoyproxy support for eventgate chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/551253 (https://phabricator.wikimedia.org/T236386) [20:08:40] (03CR) 10jerkins-bot: [V: 04-1] TLS envoyproxy support for eventgate chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/551253 (https://phabricator.wikimedia.org/T236386) (owner: 10Ottomata) [20:17:17] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "If you set tls.enabled: true in values you also need some values for the certs." (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/551253 (https://phabricator.wikimedia.org/T236386) (owner: 10Ottomata) [20:18:34] (03PS4) 10Ottomata: TLS envoyproxy support for eventgate chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/551253 (https://phabricator.wikimedia.org/T236386) [20:20:51] (03PS1) 10Jhedden: install_server: update cloudcephosd root disk [puppet] - 10https://gerrit.wikimedia.org/r/551256 (https://phabricator.wikimedia.org/T224188) [20:21:44] 10Operations, 10ops-codfw, 10ops-eqiad, 10netbox: Document PDU models - https://phabricator.wikimedia.org/T227632 (10wiki_willy) @faidon - I'll check with the team on this one, and see if we can get an estimate for completion. Thanks, Willy [20:29:43] (03PS5) 10Ottomata: TLS envoyproxy support for eventgate chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/551253 (https://phabricator.wikimedia.org/T236386) [20:30:49] (03PS1) 10Ottomata: Enable TLS envoyproxy for eventgate-logging-external instances [deployment-charts] - 10https://gerrit.wikimedia.org/r/551263 (https://phabricator.wikimedia.org/T236386) [20:37:10] (03CR) 10Jhedden: [C: 03+2] install_server: update cloudcephosd root disk [puppet] - 10https://gerrit.wikimedia.org/r/551256 (https://phabricator.wikimedia.org/T224188) (owner: 10Jhedden) [20:41:57] (03PS3) 10Ottomata: Add LVS for eventgate-logging-external using TLS port [puppet] - 10https://gerrit.wikimedia.org/r/550922 (https://phabricator.wikimedia.org/T236386) [20:42:06] 10Operations, 10Cloud-Services, 10Patch-For-Review, 10cloud-services-team (Kanban): rack/setup/install (3) new osd ceph nodes - https://phabricator.wikimedia.org/T224188 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by jeh on cumin1001.eqiad.wmnet for hosts: ` ['cloudcephosd1002.wikimedia.org... [20:43:56] (03PS3) 10Ottomata: Public cache routing for eventgate-logging-external [puppet] - 10https://gerrit.wikimedia.org/r/551247 (https://phabricator.wikimedia.org/T236386) [20:46:26] (03PS4) 10Ottomata: Public cache routing for eventgate-logging-external [puppet] - 10https://gerrit.wikimedia.org/r/551247 (https://phabricator.wikimedia.org/T236386) [20:47:20] (03CR) 10Ottomata: Public cache routing for eventgate-logging-external (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/551247 (https://phabricator.wikimedia.org/T236386) (owner: 10Ottomata) [20:57:38] 10Operations, 10DBA, 10SRE-Access-Requests: Read access for phabricator-admins (aklapper) to Phabricator production database to run SELECT queries - https://phabricator.wikimedia.org/T238425 (10Dzahn) [20:57:43] (03PS1) 10Alexandros Kosiaris: RBAC: Allow prometheus access to nodes resources [deployment-charts] - 10https://gerrit.wikimedia.org/r/551266 (https://phabricator.wikimedia.org/T238410) [21:03:12] (03PS1) 10Jhedden: install_server: cloudcephosd update grub bootdev [puppet] - 10https://gerrit.wikimedia.org/r/551267 (https://phabricator.wikimedia.org/T224188) [21:05:33] (03CR) 10Jhedden: [C: 03+2] install_server: cloudcephosd update grub bootdev [puppet] - 10https://gerrit.wikimedia.org/r/551267 (https://phabricator.wikimedia.org/T224188) (owner: 10Jhedden) [21:07:56] (03PS1) 10Dzahn: phabricator: write my.cnf for db access into each admin home dir [puppet] - 10https://gerrit.wikimedia.org/r/551268 (https://phabricator.wikimedia.org/T238425) [21:13:37] (03PS22) 10Jforrester: Variant configuration: Pre-calculate config for each wiki and store it in config.git [mediawiki-config] - 10https://gerrit.wikimedia.org/r/507729 (https://phabricator.wikimedia.org/T223602) [21:14:30] (03PS4) 10Jforrester: Variant configuration: Move some all-wiki configuration from CS to all.yaml [mediawiki-config] - 10https://gerrit.wikimedia.org/r/539436 [21:15:51] (03CR) 10jerkins-bot: [V: 04-1] Variant configuration: Move some all-wiki configuration from CS to all.yaml [mediawiki-config] - 10https://gerrit.wikimedia.org/r/539436 (owner: 10Jforrester) [21:17:27] 10Operations, 10Cloud-Services, 10cloud-services-team (Kanban): rack/setup/install (3) new osd ceph nodes - https://phabricator.wikimedia.org/T224188 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by jeh on cumin1001.eqiad.wmnet for hosts: ` ['cloudcephosd1002.wikimedia.org'] ` The log can be fo... [21:21:16] <_joe_> !log disabling proxying to ws on phabricator1003 [21:21:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:26:41] (03CR) 10Jforrester: [C: 04-2] "We maybe aren't going to commit these, but instead generate them in scap during deploys." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/507729 (https://phabricator.wikimedia.org/T223602) (owner: 10Jforrester) [21:27:58] (03PS5) 10Jforrester: Variant configuration: Move some all-wiki configuration from CS to all.yaml [mediawiki-config] - 10https://gerrit.wikimedia.org/r/539436 [21:28:55] (03PS1) 10Herron: logstash: parse DPT and SPT from ulogd events [puppet] - 10https://gerrit.wikimedia.org/r/551270 (https://phabricator.wikimedia.org/T238416) [21:29:12] (03CR) 10Jforrester: [C: 04-2] CommonSettings: Switch from getMWConfigForCacheing to getCachableMWConfig to avoid wgConf [mediawiki-config] - 10https://gerrit.wikimedia.org/r/538342 (owner: 10Jforrester) [21:29:21] (03PS6) 10Jforrester: CommonSettings: Switch from getMWConfigForCacheing to getCachableMWConfig to avoid wgConf [mediawiki-config] - 10https://gerrit.wikimedia.org/r/538342 [21:29:36] (03PS4) 10Jforrester: Drop getMWConfigForCacheing [mediawiki-config] - 10https://gerrit.wikimedia.org/r/538343 [21:29:42] !log jeh@cumin1001 START - Cookbook sre.hosts.downtime [21:29:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:31:21] (03PS2) 10Jforrester: Drop HHVMRequestInit, never called [mediawiki-config] - 10https://gerrit.wikimedia.org/r/542184 (https://phabricator.wikimedia.org/T235142) [21:31:49] !log jeh@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [21:31:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:32:01] (03CR) 10Jforrester: [C: 04-1] "(Still used in modules/profile/manifests/mediawiki/hhvm.pp.)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/542184 (https://phabricator.wikimedia.org/T235142) (owner: 10Jforrester) [21:32:09] 10Operations, 10observability, 10Patch-For-Review: Logstash doesn't parse ulogd source and destination ports - https://phabricator.wikimedia.org/T238416 (10herron) https://gerrit.wikimedia.org/r/551270 should do the trick for source/dest ports. I don't recall why these weren't parsed out in the first place.... [21:35:31] (03PS1) 10Dzahn: phabricator: disable aphlict [puppet] - 10https://gerrit.wikimedia.org/r/551271 [21:35:38] 10Operations, 10Cloud-Services, 10cloud-services-team (Kanban): rack/setup/install (3) new osd ceph nodes - https://phabricator.wikimedia.org/T224188 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cloudcephosd1002.wikimedia.org'] ` and were **ALL** successful. [21:38:39] (03CR) 10Alexandros Kosiaris: [C: 03+1] phabricator: disable aphlict [puppet] - 10https://gerrit.wikimedia.org/r/551271 (owner: 10Dzahn) [21:39:59] 10Operations, 10Cloud-Services, 10cloud-services-team (Kanban): rack/setup/install (3) new osd ceph nodes - https://phabricator.wikimedia.org/T224188 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by jeh on cumin1001.eqiad.wmnet for hosts: ` ['cloudcephosd1003.wikimedia.org'] ` The log can be fo... [21:41:21] (03CR) 10Ayounsi: [C: 03+1] logstash: parse DPT and SPT from ulogd events [puppet] - 10https://gerrit.wikimedia.org/r/551270 (https://phabricator.wikimedia.org/T238416) (owner: 10Herron) [21:41:53] 10Operations, 10observability, 10Patch-For-Review: Logstash doesn't parse ulogd source and destination ports - https://phabricator.wikimedia.org/T238416 (10ayounsi) Not that I can think of for now. Thanks! [21:42:19] PROBLEM - Check the Netbox report puppetdb for fail status. on netbox1001 is CRITICAL: puppetdb.PuppetDB CRITICAL https://wikitech.wikimedia.org/wiki/Netbox%23Reports [21:50:32] PROBLEM - High average GET latency for mw requests on api_appserver in eqiad on icinga1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-m [21:50:40] (03PS1) 10Ayounsi: Automatically cast network strings to ipaddress objects [software/homer] - 10https://gerrit.wikimedia.org/r/551273 [21:51:40] RECOVERY - High average GET latency for mw requests on api_appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-method=GET [21:52:13] !log jeh@cumin1001 START - Cookbook sre.hosts.downtime [21:52:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:52:36] (03PS1) 10Ayounsi: mr: add DHCP server support + replace all system {} [homer/public] - 10https://gerrit.wikimedia.org/r/551274 [21:52:58] (03CR) 10Jforrester: [C: 03+1] "Let's go?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/551180 (https://phabricator.wikimedia.org/T208758) (owner: 10Physikerwelt) [21:53:08] (03CR) 10jerkins-bot: [V: 04-1] Automatically cast network strings to ipaddress objects [software/homer] - 10https://gerrit.wikimedia.org/r/551273 (owner: 10Ayounsi) [21:54:21] !log jeh@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [21:54:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:56:05] PROBLEM - Disk space on stat1007 is CRITICAL: DISK CRITICAL - free space: /srv 243895 MB (3% inode=98%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=stat1007&var-datasource=eqiad+prometheus/ops [21:58:10] 10Operations, 10Cloud-Services, 10cloud-services-team (Kanban): rack/setup/install (3) new osd ceph nodes - https://phabricator.wikimedia.org/T224188 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cloudcephosd1003.wikimedia.org'] ` and were **ALL** successful. [21:58:52] 10Operations, 10Cloud-Services, 10cloud-services-team (Kanban): rack/setup/install (3) new osd ceph nodes - https://phabricator.wikimedia.org/T224188 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by jeh on cumin1001.eqiad.wmnet for hosts: ` ['cloudcephosd1001.wikimedia.org'] ` The log can be fo... [22:04:37] (03PS8) 10CDanis: prometheus: export NIC firmware versions [puppet] - 10https://gerrit.wikimedia.org/r/549683 (https://phabricator.wikimedia.org/T236744) [22:04:39] (03PS1) 10CDanis: systemd::timer::job: fix bug re: On(In)?ActiveUnitSec [puppet] - 10https://gerrit.wikimedia.org/r/551281 [22:09:35] (03PS2) 10CDanis: systemd::timer::job: fix bug re: On(In)?ActiveUnitSec [puppet] - 10https://gerrit.wikimedia.org/r/551281 [22:09:37] (03PS9) 10CDanis: prometheus: export NIC firmware versions [puppet] - 10https://gerrit.wikimedia.org/r/549683 (https://phabricator.wikimedia.org/T236744) [22:10:08] (03PS2) 10Ayounsi: Automatically cast network strings to ipaddress objects [software/homer] - 10https://gerrit.wikimedia.org/r/551273 [22:10:14] (03PS6) 10Brennen Bearnes: logging: add logspam utilities [puppet] - 10https://gerrit.wikimedia.org/r/547777 [22:12:08] !log jeh@cumin1001 START - Cookbook sre.hosts.downtime [22:12:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:12:51] (03CR) 10jerkins-bot: [V: 04-1] Automatically cast network strings to ipaddress objects [software/homer] - 10https://gerrit.wikimedia.org/r/551273 (owner: 10Ayounsi) [22:14:17] !log jeh@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [22:14:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:14:59] 10Operations, 10Cloud-Services, 10cloud-services-team (Kanban): rack/setup/install (3) new osd ceph nodes - https://phabricator.wikimedia.org/T224188 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cloudcephosd1001.wikimedia.org'] ` Of which those **FAILED**: ` ['cloudcephosd1001.wikimedia.org'] ` [22:18:08] (03PS1) 10RLazarus: Refactor the state shared between test cases into a TestHarness class. [software/httpbb] - 10https://gerrit.wikimedia.org/r/551283 (https://phabricator.wikimedia.org/T236699) [22:18:36] (03PS3) 10CDanis: systemd::timer::job: fix bug re: On(In)?ActiveUnitSec [puppet] - 10https://gerrit.wikimedia.org/r/551281 [22:18:38] (03PS10) 10CDanis: prometheus: export NIC firmware versions [puppet] - 10https://gerrit.wikimedia.org/r/549683 (https://phabricator.wikimedia.org/T236744) [22:18:51] (03PS1) 10Dzahn: re-add phabricator-new to point to caching layer [dns] - 10https://gerrit.wikimedia.org/r/551284 (https://phabricator.wikimedia.org/T137928) [22:25:21] (03PS1) 10Dzahn: phabricator: use codfw db servers for codfw server [puppet] - 10https://gerrit.wikimedia.org/r/551285 (https://phabricator.wikimedia.org/T137928) [22:33:02] (03PS4) 10CDanis: systemd::timer::job: fix bug re: On(In)?ActiveUnitSec [puppet] - 10https://gerrit.wikimedia.org/r/551281 [22:33:04] (03PS11) 10CDanis: prometheus: export NIC firmware versions [puppet] - 10https://gerrit.wikimedia.org/r/549683 (https://phabricator.wikimedia.org/T236744) [22:37:23] (03PS5) 10CDanis: systemd::timer::job: fix bug re: On(In)?ActiveUnitSec [puppet] - 10https://gerrit.wikimedia.org/r/551281 [22:37:25] (03PS12) 10CDanis: prometheus: export NIC firmware versions [puppet] - 10https://gerrit.wikimedia.org/r/549683 (https://phabricator.wikimedia.org/T236744) [22:40:15] (03PS1) 10Dzahn: ATS/varnish: add phabricator-new to point to phab1001 [puppet] - 10https://gerrit.wikimedia.org/r/551286 (https://phabricator.wikimedia.org/T190568) [22:40:24] (03CR) 10jerkins-bot: [V: 04-1] systemd::timer::job: fix bug re: On(In)?ActiveUnitSec [puppet] - 10https://gerrit.wikimedia.org/r/551281 (owner: 10CDanis) [22:49:38] (03PS1) 10Dzahn: install_server: switch phab2001 to buster [puppet] - 10https://gerrit.wikimedia.org/r/551287 (https://phabricator.wikimedia.org/T190568) [22:56:05] 10Operations, 10Cloud-Services, 10cloud-services-team (Kanban): rack/setup/install (3) new osd ceph nodes - https://phabricator.wikimedia.org/T224188 (10JHedden) [23:00:06] (03PS9) 10Andrew Bogott: pdns: support api server for pdns4 [puppet] - 10https://gerrit.wikimedia.org/r/551227 (https://phabricator.wikimedia.org/T210715) [23:01:03] 10Operations, 10Cloud-Services, 10cloud-services-team (Kanban): rack/setup/install (3) new osd ceph nodes - https://phabricator.wikimedia.org/T224188 (10JHedden) I've update the task details with the current status. Should I leave the netbox status as staged or set it to active? These systems will be testing... [23:07:07] RECOVERY - Check the Netbox report puppetdb for fail status. on netbox1001 is OK: puppetdb.PuppetDB OK https://wikitech.wikimedia.org/wiki/Netbox%23Reports [23:08:01] (03PS6) 10CDanis: systemd::timer::job: fix bug re: On(In)?ActiveUnitSec [puppet] - 10https://gerrit.wikimedia.org/r/551281 [23:08:03] (03PS13) 10CDanis: prometheus: export NIC firmware versions [puppet] - 10https://gerrit.wikimedia.org/r/549683 (https://phabricator.wikimedia.org/T236744) [23:11:18] (03CR) 10jerkins-bot: [V: 04-1] systemd::timer::job: fix bug re: On(In)?ActiveUnitSec [puppet] - 10https://gerrit.wikimedia.org/r/551281 (owner: 10CDanis) [23:28:01] 10Operations, 10DBA, 10User-notice: Switchover s7 primary database master db1062 -> db1086 - 26th Nov 06:00 - 06:30 UTC - https://phabricator.wikimedia.org/T238044 (10Agusbou2015) There is a typo on the date: it should be "Tue 26th Nov", not "Tue 24th Nov", despite the correct date is shown in the title. [23:38:47] (03CR) 10Andrew Bogott: [C: 03+2] pdns: support api server for pdns4 [puppet] - 10https://gerrit.wikimedia.org/r/551227 (https://phabricator.wikimedia.org/T210715) (owner: 10Andrew Bogott) [23:51:54] 10Operations, 10DBA, 10User-notice: Switchover s7 primary database master db1062 -> db1086 - 26th Nov 06:00 - 06:30 UTC - https://phabricator.wikimedia.org/T238044 (10JJMC89)