[00:02:44] !log catrope@deploy1001 Started scap: GrowthExperiments and MobileFrontend changes SWAT (includes i18n) [00:02:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:03:49] 10Operations, 10DBA, 10SRE-Access-Requests, 10Patch-For-Review: Read access for phabricator-admins (aklapper) to Phabricator production database to run SELECT queries - https://phabricator.wikimedia.org/T238425 (10Dzahn) ` [phab1001:/home/aklapper] $ mysql ERROR 2002 (HY000): Can't connect to local MySQL s... [00:05:47] 10Operations, 10DBA, 10SRE-Access-Requests, 10Patch-For-Review: Read access for phabricator-admins (aklapper) to Phabricator production database to run SELECT queries - https://phabricator.wikimedia.org/T238425 (10Dzahn) You can switch to the other databases we idenitified with "use" after you connected. F... [00:09:08] 10Operations, 10DBA, 10SRE-Access-Requests, 10Patch-For-Review: Read access for phabricator-admins (aklapper) to Phabricator production database to run SELECT queries - https://phabricator.wikimedia.org/T238425 (10Dzahn) 05Open→03Resolved Let me know if this works for you. [00:18:41] !log catrope@deploy1001 Finished scap: GrowthExperiments and MobileFrontend changes SWAT (includes i18n) (duration: 15m 57s) [00:18:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:19:45] (03CR) 10Catrope: [C: 03+2] GrowthExperiments: Enable suggested edits without opt-in [mediawiki-config] - 10https://gerrit.wikimedia.org/r/552156 (https://phabricator.wikimedia.org/T227728) (owner: 10Catrope) [00:20:28] (03Merged) 10jenkins-bot: GrowthExperiments: Enable suggested edits without opt-in [mediawiki-config] - 10https://gerrit.wikimedia.org/r/552156 (https://phabricator.wikimedia.org/T227728) (owner: 10Catrope) [00:21:19] (03PS3) 10CRusnov: netbox report alerting: Simplify icinga check and cleanup [puppet] - 10https://gerrit.wikimedia.org/r/552154 (https://phabricator.wikimedia.org/T237803) [00:24:32] !log catrope@deploy1001 Synchronized wmf-config/InitialiseSettings.php: GrowthExperiments: Enable suggested edits without opt-in (T227728) (duration: 00m 52s) [00:24:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:24:37] T227728: [EPIC] Growth: Newcomer tasks 1.0 - https://phabricator.wikimedia.org/T227728 [00:30:11] 10Operations, 10observability: Clean up ORES metrics - https://phabricator.wikimedia.org/T238807 (10colewhite) [00:30:31] 10Operations, 10observability: Clean up ORES metrics - https://phabricator.wikimedia.org/T238807 (10colewhite) p:05Triage→03Low [00:33:30] 10Operations, 10observability: Clean up ORES metrics - https://phabricator.wikimedia.org/T238807 (10colewhite) @fgiunchedi, would you mind having a quick look at P9701? I'd like to run it on production. [00:37:09] 10Operations, 10observability, 10Patch-For-Review, 10Performance-Team (Radar): Fully migrate producers off statsd - https://phabricator.wikimedia.org/T205870 (10colewhite) [00:38:27] PROBLEM - Varnish traffic drop between 30min ago and now at codfw on icinga1001 is CRITICAL: 59.63 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [00:41:51] RECOVERY - Varnish traffic drop between 30min ago and now at codfw on icinga1001 is OK: (C)60 le (W)70 le 86.93 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [00:41:56] !log catrope@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Config does not seem to be applying on half the app servers, resyncing (duration: 00m 52s) [00:42:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:56:08] (03PS1) 10Ladsgroup: Set correct language for shywiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/552159 (https://phabricator.wikimedia.org/T238105) [00:56:59] (03CR) 10jerkins-bot: [V: 04-1] Set correct language for shywiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/552159 (https://phabricator.wikimedia.org/T238105) (owner: 10Ladsgroup) [00:59:47] (03PS2) 10Ladsgroup: Set correct language for shywiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/552159 (https://phabricator.wikimedia.org/T238105) [01:02:45] (03CR) 10Dzahn: "removing the last text node in codfw broke code like this on phab2001:" [puppet] - 10https://gerrit.wikimedia.org/r/552083 (https://phabricator.wikimedia.org/T227432) (owner: 10Ema) [01:05:55] (03PS1) 10Dzahn: phab: avoid failing to load trusted_proxies in codfw after ATS migration [puppet] - 10https://gerrit.wikimedia.org/r/552161 [01:07:51] (03PS2) 10Dzahn: phab: avoid failing to load trusted_proxies in codfw after ATS migration [puppet] - 10https://gerrit.wikimedia.org/r/552161 [01:09:28] (03CR) 10Paladox: [C: 03+1] phab: avoid failing to load trusted_proxies in codfw after ATS migration [puppet] - 10https://gerrit.wikimedia.org/r/552161 (owner: 10Dzahn) [01:09:42] (03PS3) 10Dzahn: phab: avoid failing to load trusted_proxies in codfw after ATS migration [puppet] - 10https://gerrit.wikimedia.org/r/552161 [01:10:08] (03CR) 10Dzahn: "fixed it per" [puppet] - 10https://gerrit.wikimedia.org/r/552161 (owner: 10Dzahn) [01:12:03] (03CR) 10Dzahn: [C: 03+2] phab: avoid failing to load trusted_proxies in codfw after ATS migration [puppet] - 10https://gerrit.wikimedia.org/r/552161 (owner: 10Dzahn) [01:12:19] (03CR) 10Dzahn: "https://gerrit.wikimedia.org/r/c/operations/puppet/+/552161" [puppet] - 10https://gerrit.wikimedia.org/r/552083 (https://phabricator.wikimedia.org/T227432) (owner: 10Ema) [01:12:58] (03CR) 10jerkins-bot: [V: 04-1] phab: avoid failing to load trusted_proxies in codfw after ATS migration [puppet] - 10https://gerrit.wikimedia.org/r/552161 (owner: 10Dzahn) [01:15:38] (03PS4) 10Dzahn: phab: avoid failing to load trusted_proxies in codfw after ATS migration [puppet] - 10https://gerrit.wikimedia.org/r/552161 [01:19:46] (03CR) 10Dzahn: [C: 03+2] phab: avoid failing to load trusted_proxies in codfw after ATS migration [puppet] - 10https://gerrit.wikimedia.org/r/552161 (owner: 10Dzahn) [01:25:10] (03CR) 10Vgutierrez: [C: 03+1] ATS: explicitly skip the cache instead of hiding CC (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/552076 (https://phabricator.wikimedia.org/T238494) (owner: 10Ema) [01:34:37] 10Operations, 10serviceops, 10Patch-For-Review: dropped packets to phab1003 22280/tcp - https://phabricator.wikimedia.org/T238781 (10Vgutierrez) @dzahn that's right, Removing the wss:// -> ws:// from ats-tls doesn't stop the requests from reaching varnish-fe as they are accepted as part of the catch-all rema... [01:35:06] (03CR) 10Vgutierrez: [C: 03+1] varnish: remove config for disabled phab_aphlict [puppet] - 10https://gerrit.wikimedia.org/r/552122 (https://phabricator.wikimedia.org/T238781) (owner: 10Dzahn) [02:06:23] PROBLEM - Varnish traffic drop between 30min ago and now at codfw on icinga1001 is CRITICAL: 46.82 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [02:12:09] that's matching a previous spike... [02:14:51] RECOVERY - Varnish traffic drop between 30min ago and now at codfw on icinga1001 is OK: (C)60 le (W)70 le 75.79 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [02:47:24] (03CR) 10Jon Harald Søby: [C: 03+1] "lgtm" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/552159 (https://phabricator.wikimedia.org/T238105) (owner: 10Ladsgroup) [02:56:35] PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 133, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [02:56:45] PROBLEM - Router interfaces on cr4-ulsfo is CRITICAL: CRITICAL: host 198.35.26.193, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [03:06:46] 10Operations, 10Traffic, 10User-DannyS712: 502 errors on ATS/8.0.5 - https://phabricator.wikimedia.org/T237319 (10Vgutierrez) nice stap, playing a little bit with debug logging for a specific client ip I've been able to get the same information: `[Nov 21 02:57:06.975] {0x2b12a4d77700} DEBUG: RECOVERY - Router interfaces on cr4-ulsfo is OK: OK: host 198.35.26.193, interfaces up: 70, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [03:08:31] RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 135, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [03:12:25] PROBLEM - Varnish traffic drop between 30min ago and now at codfw on icinga1001 is CRITICAL: 46.31 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [03:15:49] RECOVERY - Varnish traffic drop between 30min ago and now at codfw on icinga1001 is OK: (C)60 le (W)70 le 76.65 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [03:36:12] 10Operations, 10DBA, 10SRE-Access-Requests: Read access for phabricator-admins (aklapper) to Phabricator production database to run SELECT queries - https://phabricator.wikimedia.org/T238425 (10Aklapper) @dzahn: Uh yay! Works! Thanks so much! <3 [04:12:46] 10Operations, 10Traffic, 10User-DannyS712: 502 errors on ATS/8.0.5 - https://phabricator.wikimedia.org/T237319 (10Vgutierrez) it looks like the issue is happening at mw servers as well, nginx (TLS termination for appservers-rw.discovery.wmnet) is screaming a lot with wikidata requests, for the request I've p... [05:53:42] !log Compress db2073 [05:53:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:54:55] (03PS1) 10Marostegui: dbproxy1011: Depool labsdb1009 [puppet] - 10https://gerrit.wikimedia.org/r/552169 [05:55:58] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1086 for upgrade', diff saved to https://phabricator.wikimedia.org/P9702 and previous config saved to /var/cache/conftool/dbconfig/20191121-055557-marostegui.json [05:56:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:56:14] !log Upgrade db1086 [05:56:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:56:45] (03CR) 10Marostegui: [C: 03+2] dbproxy1011: Depool labsdb1009 [puppet] - 10https://gerrit.wikimedia.org/r/552169 (owner: 10Marostegui) [05:57:21] !log Depool labsdb1009 for upgrade [05:57:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:06:18] !log Compress db2083 [06:06:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:11:21] PROBLEM - haproxy failover on dbproxy1019 is CRITICAL: CRITICAL check_failover servers up 1 down 1 https://wikitech.wikimedia.org/wiki/HAProxy [06:11:26] ^expected [06:13:10] !log Stop MySQL on db1107 T238113 [06:13:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:13:15] T238113: Repurpose db1107 as a generic database - https://phabricator.wikimedia.org/T238113 [06:13:56] (03PS1) 10Marostegui: Revert "dbproxy1011: Depool labsdb1009" [puppet] - 10https://gerrit.wikimedia.org/r/552170 [06:14:43] RECOVERY - haproxy failover on dbproxy1019 is OK: OK check_failover servers up 2 down 0 https://wikitech.wikimedia.org/wiki/HAProxy [06:16:02] !log Compress db2081 [06:16:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:17:12] !log marostegui@cumin1001 dbctl commit (dc=all): 'Slowly repool db1086 after upgrade', diff saved to https://phabricator.wikimedia.org/P9703 and previous config saved to /var/cache/conftool/dbconfig/20191121-061711-marostegui.json [06:17:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:20:08] (03CR) 10Marostegui: [C: 03+2] Revert "dbproxy1011: Depool labsdb1009" [puppet] - 10https://gerrit.wikimedia.org/r/552170 (owner: 10Marostegui) [06:24:13] !log marostegui@cumin1001 dbctl commit (dc=all): 'More traffic to db1086 after upgrade', diff saved to https://phabricator.wikimedia.org/P9704 and previous config saved to /var/cache/conftool/dbconfig/20191121-062412-marostegui.json [06:24:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:24:41] 10Operations, 10DBA, 10Data-Services, 10cloud-services-team (Kanban): Prepare and check storage layer for shywiktionary - https://phabricator.wikimedia.org/T238115 (10Marostegui) a:03Marostegui [06:25:07] 10Operations, 10DBA, 10Data-Services, 10cloud-services-team (Kanban): Prepare and check storage layer for gcrwiki - https://phabricator.wikimedia.org/T238114 (10Marostegui) a:03Marostegui [06:25:43] 10Operations, 10DBA, 10Data-Services, 10cloud-services-team (Kanban): Prepare and check storage layer for minwiktionary - https://phabricator.wikimedia.org/T238522 (10Marostegui) a:03Marostegui [06:26:03] 10Operations, 10DBA, 10Data-Services: Prepare and check storage layer for ge.wikimedia.org - https://phabricator.wikimedia.org/T236404 (10Marostegui) a:03Marostegui [06:30:26] !log Sanitize shywiktionary gcrwiki szywiki minwiktionary gewikimedia on db2094:3313 T238115 T238114 T237373 T238522 T236404 [06:30:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:30:36] T236404: Prepare and check storage layer for ge.wikimedia.org - https://phabricator.wikimedia.org/T236404 [06:30:36] T237373: Prepare and check storage layer for szywiki - https://phabricator.wikimedia.org/T237373 [06:30:37] T238115: Prepare and check storage layer for shywiktionary - https://phabricator.wikimedia.org/T238115 [06:30:37] T238114: Prepare and check storage layer for gcrwiki - https://phabricator.wikimedia.org/T238114 [06:30:37] T238522: Prepare and check storage layer for minwiktionary - https://phabricator.wikimedia.org/T238522 [06:32:37] !log Sanitize shywiktionary gcrwiki szywiki minwiktionary gewikimedia on db1124:3313 T238115 T238114 T237373 T238522 T236404 [06:32:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:38:54] 10Operations, 10Traffic, 10User-DannyS712: 502 errors on ATS/8.0.5 - https://phabricator.wikimedia.org/T237319 (10Vgutierrez) Interesting enough, on the apache access log, that request looks OK: ` 2019-11-21T02:57:04 73916 10.64.0.62 proxy:unix:/run/php/fpm-www.sock|fcgi://localhost/304 14079 GET http://www.... [06:49:55] 10Operations, 10DBA, 10Data-Services: Prepare and check storage layer for ge.wikimedia.org - https://phabricator.wikimedia.org/T236404 (10Marostegui) a:05Marostegui→03None Database sanitized, `_p` database created and grants done - #cloud-services-team please proceed with the views on the 4 hosts. ` root... [06:51:02] 10Operations, 10DBA, 10Data-Services, 10cloud-services-team (Kanban): Prepare and check storage layer for minwiktionary - https://phabricator.wikimedia.org/T238522 (10Marostegui) a:05Marostegui→03None Database sanitized, `_p` database created and grants done - #cloud-services-team please proceed with t... [06:54:53] 10Operations, 10DBA, 10Data-Services, 10cloud-services-team (Kanban): Prepare and check storage layer for gcrwiki - https://phabricator.wikimedia.org/T238114 (10Marostegui) a:05Marostegui→03None Database sanitized, `_p` database created and grants done - #cloud-services-team please proceed with the vie... [06:56:15] 10Operations, 10DBA, 10Data-Services, 10cloud-services-team (Kanban): Prepare and check storage layer for shywiktionary - https://phabricator.wikimedia.org/T238115 (10Marostegui) a:05Marostegui→03None Database sanitized, `_p` database created and grants done - #cloud-services-team please proceed with t... [06:56:58] !log Repool labsdb1009 [06:57:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:57:31] 10Operations, 10Traffic, 10User-DannyS712: 502 errors on ATS/8.0.5 - https://phabricator.wikimedia.org/T237319 (10Joe) Why would a 304 have 14079 bytes of content-length is the first thing I'd ask. It looks like we do something wrong somewhere, but it's not reproducible consistently in my tests. @Vgutierrez... [07:00:37] RECOVERY - Check systemd state on an-coord1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:01:57] PROBLEM - Varnish traffic drop between 30min ago and now at codfw on icinga1001 is CRITICAL: 57.01 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [07:03:39] RECOVERY - Varnish traffic drop between 30min ago and now at codfw on icinga1001 is OK: (C)60 le (W)70 le 75.15 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [07:03:43] 10Operations, 10Traffic, 10User-DannyS712: 502 errors on ATS/8.0.5 - https://phabricator.wikimedia.org/T237319 (10DannyS712) Do I need to be concerned about my IP being released during discussion // should this task be private? [07:09:45] 10Operations, 10Traffic, 10User-DannyS712: 502 errors on ATS/8.0.5 - https://phabricator.wikimedia.org/T237319 (10Joe) Just confirmed we only get 304s with a non-zero content-length for requests to `Special:EntityData` on wikidata. I still don't have a reliable repro case. [07:12:09] (03CR) 10Elukey: "The change looks good, but after reviewing it a second time I am not super excited to have another CNAME to remember/maintain. What benefi" [dns] - 10https://gerrit.wikimedia.org/r/551938 (owner: 10Dzahn) [07:16:05] RECOVERY - Wikitech and wt-static content in sync on labweb1002 is OK: wikitech-static OK - wikitech and wikitech-static in sync (39618 200000s) https://wikitech.wikimedia.org/wiki/Wikitech-static [07:17:59] !log marostegui@cumin1001 dbctl commit (dc=all): 'More traffic to db1086 after upgrade', diff saved to https://phabricator.wikimedia.org/P9705 and previous config saved to /var/cache/conftool/dbconfig/20191121-071758-marostegui.json [07:18:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:18:41] !log Upgrade db1125 (sanitarium) [07:18:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:25:45] !log marostegui@cumin1001 dbctl commit (dc=all): 'Fully repool db1086 after upgrade', diff saved to https://phabricator.wikimedia.org/P9706 and previous config saved to /var/cache/conftool/dbconfig/20191121-072543-marostegui.json [07:25:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:29:09] PROBLEM - Varnish traffic drop between 30min ago and now at codfw on icinga1001 is CRITICAL: 55.55 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [07:30:11] (03PS1) 10Marostegui: db2133: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/552173 (https://phabricator.wikimedia.org/T238183) [07:32:21] (03CR) 10Marostegui: [C: 03+2] db2133: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/552173 (https://phabricator.wikimedia.org/T238183) (owner: 10Marostegui) [07:34:13] RECOVERY - Varnish traffic drop between 30min ago and now at codfw on icinga1001 is OK: (C)60 le (W)70 le 90.08 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [07:47:23] 10Operations, 10Traffic, 10User-DannyS712: 502 errors on ATS/8.0.5 - https://phabricator.wikimedia.org/T237319 (10Vgutierrez) @Joe it seems pretty easy to trigger: ` vgutierrez@mw1330:~$ curl --resolve www.wikidata.org:80:10.64.32.32 "www.wikidata.org/wiki/Special:EntityData/Q38646387.json" -o /dev/null -v -... [07:47:39] (03PS1) 10Marostegui: mariadb: Promote db2133 to m2 codfw master [puppet] - 10https://gerrit.wikimedia.org/r/552191 (https://phabricator.wikimedia.org/T238183) [07:53:40] (03CR) 10Marostegui: [C: 03+2] mariadb: Promote db2133 to m2 codfw master [puppet] - 10https://gerrit.wikimedia.org/r/552191 (https://phabricator.wikimedia.org/T238183) (owner: 10Marostegui) [07:56:22] !log Promote db2133 to codfw m2 master - T238183 [07:56:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:56:28] T238183: Productionize db213[2-5} - https://phabricator.wikimedia.org/T238183 [07:57:40] !log upgrade OTRS to 5.0.39 T225925 [07:57:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:58:20] (03CR) 10Urbanecm: [C: 03+1] Set correct language for shywiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/552159 (https://phabricator.wikimedia.org/T238105) (owner: 10Ladsgroup) [08:02:14] (03CR) 10Alexandros Kosiaris: "Ouch, we missed that for so long? Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/552151 (owner: 10CDanis) [08:05:31] (03PS1) 10Mobrovac: VirtualRESTService: Switch to Parsoid/PHP [mediawiki-config] - 10https://gerrit.wikimedia.org/r/552194 (https://phabricator.wikimedia.org/T229015) [08:09:34] (03CR) 10Vgutierrez: [C: 03+1] TLS Analytics: make parsing more robust [puppet] - 10https://gerrit.wikimedia.org/r/551840 (owner: 10BBlack) [08:11:51] 10Operations, 10Release Pipeline, 10serviceops, 10Kubernetes: Identify which parts of the "Add a wiki" procedure can be integrated with the deployment pipeline - https://phabricator.wikimedia.org/T238158 (10akosiaris) p:05Triage→03Low [08:14:46] PROBLEM - Varnish traffic drop between 30min ago and now at codfw on icinga1001 is CRITICAL: 55.1 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [08:17:21] (03CR) 10Alexandros Kosiaris: "Fine with putting it on hold, but I did like the idea of having brendan greg's perftools around" [puppet] - 10https://gerrit.wikimedia.org/r/550833 (https://phabricator.wikimedia.org/T236048) (owner: 10Effie Mouzeli) [08:19:16] RECOVERY - Varnish traffic drop between 30min ago and now at codfw on icinga1001 is OK: (C)60 le (W)70 le 86.68 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [08:21:09] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1079 for upgrade', diff saved to https://phabricator.wikimedia.org/P9707 and previous config saved to /var/cache/conftool/dbconfig/20191121-082108-marostegui.json [08:21:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:21:26] !log Upgrade db1079 [08:21:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:21:41] (03CR) 10Alexandros Kosiaris: "@Giuseppe, now that blubberoid supports TLS should we followup with migrating this config to the discovery one?" [puppet] - 10https://gerrit.wikimedia.org/r/552111 (owner: 10CDanis) [08:21:59] 10Operations, 10Traffic, 10User-DannyS712: 502 errors on ATS/8.0.5 - https://phabricator.wikimedia.org/T237319 (10Joe) https://bz.apache.org/bugzilla/show_bug.cgi?id=57198 seems relevant, but I think it should be fixed in our version of apache2 (2.4.25). We should try to see what php-fpm answers using `furl... [08:22:10] 10Operations, 10Puppet, 10DBA, 10Patch-For-Review, 10User-jbond: Extend Puppet CA Expiry date - https://phabricator.wikimedia.org/T236277 (10elukey) @jbond thanks a lot for the ping, in my todo list there is the action item of adding cert expiry checks to all the hadoop nodes (haven't had the time to do... [08:23:29] (03PS1) 10Marostegui: db2067: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/552198 (https://phabricator.wikimedia.org/T233185) [08:25:24] (03CR) 10Marostegui: [C: 03+2] db2067: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/552198 (https://phabricator.wikimedia.org/T233185) (owner: 10Marostegui) [08:28:13] 10Operations, 10DBA: Predictive failures on disk S.M.A.R.T. status - https://phabricator.wikimedia.org/T208323 (10Marostegui) [08:29:37] (03CR) 10Jcrespo: [C: 03+2] bacula: Add prometheus exporter [puppet] - 10https://gerrit.wikimedia.org/r/551145 (https://phabricator.wikimedia.org/T234900) (owner: 10Jcrespo) [08:31:10] (03CR) 10Jcrespo: [C: 03+2] prometheus-bacula-exporter: Setup service on bacula director [puppet] - 10https://gerrit.wikimedia.org/r/552027 (https://phabricator.wikimedia.org/T234900) (owner: 10Jcrespo) [08:33:24] !log marostegui@cumin1001 dbctl commit (dc=all): 'Slowly repool db1079 after upgrade', diff saved to https://phabricator.wikimedia.org/P9708 and previous config saved to /var/cache/conftool/dbconfig/20191121-083322-marostegui.json [08:33:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:33:55] (03CR) 10Jcrespo: "Failed at step EXEC spawning /usr/local/bin/prometheus-bacula-exporter: No such file or directory" [puppet] - 10https://gerrit.wikimedia.org/r/552027 (https://phabricator.wikimedia.org/T234900) (owner: 10Jcrespo) [08:34:40] (03PS1) 10Ema: ATS: enable reload for global Lua script [puppet] - 10https://gerrit.wikimedia.org/r/552201 (https://phabricator.wikimedia.org/T233274) [08:36:10] PROBLEM - Check systemd state on backup1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:37:59] ^I am about to fix that [08:38:02] (03PS1) 10Jcrespo: prometheus-bacula-exporter: Fix systemd unit and permissions [puppet] - 10https://gerrit.wikimedia.org/r/552204 (https://phabricator.wikimedia.org/T234900) [08:40:53] (03PS2) 10Jcrespo: prometheus-bacula-exporter: Fix systemd unit and permissions [puppet] - 10https://gerrit.wikimedia.org/r/552204 (https://phabricator.wikimedia.org/T234900) [08:42:05] (03PS3) 10Jcrespo: prometheus-bacula-exporter: Fix systemd unit and permissions [puppet] - 10https://gerrit.wikimedia.org/r/552204 (https://phabricator.wikimedia.org/T234900) [08:42:42] PROBLEM - Varnish traffic drop between 30min ago and now at codfw on icinga1001 is CRITICAL: 56.35 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [08:43:41] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM overall, see nits inline" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/552107 (https://phabricator.wikimedia.org/T224585) (owner: 10Phamhi) [08:43:55] (03CR) 10Jcrespo: [C: 03+2] prometheus-bacula-exporter: Fix systemd unit and permissions [puppet] - 10https://gerrit.wikimedia.org/r/552204 (https://phabricator.wikimedia.org/T234900) (owner: 10Jcrespo) [08:45:01] !log marostegui@cumin1001 dbctl commit (dc=all): 'More traffic to db1079 after upgrade', diff saved to https://phabricator.wikimedia.org/P9709 and previous config saved to /var/cache/conftool/dbconfig/20191121-084500-marostegui.json [08:45:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:46:08] RECOVERY - Wikitech and wt-static content in sync on labweb1001 is OK: wikitech-static OK - wikitech and wikitech-static in sync (46102 200000s) https://wikitech.wikimedia.org/wiki/Wikitech-static [08:47:36] backup1001 should recover now [08:47:38] RECOVERY - Check systemd state on backup1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:48:54] (03PS1) 10ArielGlenn: fix up path for temp tarball of dump status files for rsync [puppet] - 10https://gerrit.wikimedia.org/r/552206 [08:50:05] (03PS1) 10Alexandros Kosiaris: Switch all servcies docker-registries to internal [deployment-charts] - 10https://gerrit.wikimedia.org/r/552207 (https://phabricator.wikimedia.org/T238792) [08:50:39] (03PS2) 10Alexandros Kosiaris: Switch all services' docker-registries to internal [deployment-charts] - 10https://gerrit.wikimedia.org/r/552207 (https://phabricator.wikimedia.org/T238792) [08:50:53] (03PS1) 10Filippo Giunchedi: centrallog: remove quickdatacopy after sync [puppet] - 10https://gerrit.wikimedia.org/r/552208 [08:51:08] (03CR) 10Alexandros Kosiaris: [C: 03+2] Switch all services' docker-registries to internal [deployment-charts] - 10https://gerrit.wikimedia.org/r/552207 (https://phabricator.wikimedia.org/T238792) (owner: 10Alexandros Kosiaris) [08:51:30] (03PS2) 10Filippo Giunchedi: centrallog: remove quickdatacopy after sync [puppet] - 10https://gerrit.wikimedia.org/r/552208 (https://phabricator.wikimedia.org/T224564) [08:51:35] (03CR) 10ArielGlenn: [C: 03+2] fix up path for temp tarball of dump status files for rsync [puppet] - 10https://gerrit.wikimedia.org/r/552206 (owner: 10ArielGlenn) [08:52:34] PROBLEM - Varnish traffic drop between 30min ago and now at codfw on icinga1001 is CRITICAL: 59.65 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [08:53:53] (03CR) 10Filippo Giunchedi: [C: 03+2] "PCC https://puppet-compiler.wmflabs.org/compiler1001/19525/" [puppet] - 10https://gerrit.wikimedia.org/r/552208 (https://phabricator.wikimedia.org/T224564) (owner: 10Filippo Giunchedi) [08:53:55] !log akosiaris@deploy1001 helmfile [STAGING] Ran 'sync' command on namespace 'wikifeeds' for release 'staging' . [08:53:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:54:27] 10Operations, 10Traffic, 10User-DannyS712: 502 errors on ATS/8.0.5 - https://phabricator.wikimedia.org/T237319 (10Vgutierrez) reproducing it with furl shows that php-fpm returns a 304 with a body, and it looks like apache2 2.4.25-3+deb9u7 is failing to handle that [08:55:58] RECOVERY - Varnish traffic drop between 30min ago and now at codfw on icinga1001 is OK: (C)60 le (W)70 le 74.43 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [08:56:46] !log marostegui@cumin1001 dbctl commit (dc=all): 'More traffic to db1079 after upgrade', diff saved to https://phabricator.wikimedia.org/P9710 and previous config saved to /var/cache/conftool/dbconfig/20191121-085644-marostegui.json [08:56:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:58:03] !log akosiaris@deploy1001 helmfile [STAGING] Ran 'sync' command on namespace 'wikifeeds' for release 'staging' . [08:58:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:03:22] 10Operations, 10Traffic, 10User-DannyS712: 502 errors on ATS/8.0.5 - https://phabricator.wikimedia.org/T237319 (10Vgutierrez) The patch seems to be there, see https://salsa.debian.org/apache-team/apache2/blob/debian/2.4.25-3+deb9u7/modules%2Fproxy%2Fmod_proxy_fcgi.c#L663-675: `lang=C... [09:03:30] !log akosiaris@deploy1001 helmfile [STAGING] Ran 'apply' command on namespace 'wikifeeds' for release 'staging' . [09:03:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:06:07] (03PS1) 10Alexandros Kosiaris: wikifeeds: Followup to b129b2088dbe [deployment-charts] - 10https://gerrit.wikimedia.org/r/552212 (https://phabricator.wikimedia.org/T238792) [09:06:25] !log marostegui@cumin1001 dbctl commit (dc=all): 'Fully repool db1079 after upgrade', diff saved to https://phabricator.wikimedia.org/P9711 and previous config saved to /var/cache/conftool/dbconfig/20191121-090623-marostegui.json [09:06:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:06:52] (03CR) 10Alexandros Kosiaris: [C: 03+2] wikifeeds: Followup to b129b2088dbe [deployment-charts] - 10https://gerrit.wikimedia.org/r/552212 (https://phabricator.wikimedia.org/T238792) (owner: 10Alexandros Kosiaris) [09:07:03] 10Operations, 10observability: dropped packets to kafkamon 9000/tcp - https://phabricator.wikimedia.org/T238794 (10fgiunchedi) Indeed that's prometheus@analytics trying to reach burrow-exporter on port 9000 on kafkamon hosts, burrow-exporter is listening there but clearly no ferm. @Ottomata @elukey is port 900... [09:07:04] (03Merged) 10jenkins-bot: wikifeeds: Followup to b129b2088dbe [deployment-charts] - 10https://gerrit.wikimedia.org/r/552212 (https://phabricator.wikimedia.org/T238792) (owner: 10Alexandros Kosiaris) [09:08:34] !log akosiaris@deploy1001 helmfile [STAGING] Ran 'apply' command on namespace 'wikifeeds' for release 'staging' . [09:08:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:11:06] (03PS1) 10Marostegui: mariadb: Prepare to decommission db1067 [puppet] - 10https://gerrit.wikimedia.org/r/552213 (https://phabricator.wikimedia.org/T238297) [09:11:46] 10Operations, 10observability, 10Availability, 10Goal, 10Patch-For-Review: Setup bacula backup monitoring - https://phabricator.wikimedia.org/T234900 (10jcrespo) There is a bug: ` 92]: Traceback (most recent call last): 92]: File "/usr/local/bin/prometheus-bacula-exporter.py", line 326, in 92... [09:12:00] (03CR) 10Marostegui: "jcrespo, for consistency I have changed the exporter to reflect the new master, just to avoid confusion." [puppet] - 10https://gerrit.wikimedia.org/r/552213 (https://phabricator.wikimedia.org/T238297) (owner: 10Marostegui) [09:13:01] 10Operations, 10observability, 10serviceops: dropped packets to conf1004/5/6 2379/tcp - https://phabricator.wikimedia.org/T238791 (10fgiunchedi) Indeed looks like prometheus is trying to fetch `conf1004.eqiad.wmnet:2379/metrics` with no success. Locally on conf1004 even past the firewall the endpoint doesn't... [09:13:31] 10Operations, 10observability, 10User-fgiunchedi: Reimage wezen to Stretch or Buster (and rename to centrallog2001) - https://phabricator.wikimedia.org/T224564 (10fgiunchedi) [09:13:47] 10Operations: Track remaining jessie systems in production - https://phabricator.wikimedia.org/T224549 (10fgiunchedi) [09:13:49] 10Operations, 10observability, 10User-fgiunchedi: Reimage wezen to Stretch or Buster (and rename to centrallog2001) - https://phabricator.wikimedia.org/T224564 (10fgiunchedi) 05Open→03Resolved This is complete! [09:14:29] 10Operations: Track remaining jessie systems in production - https://phabricator.wikimedia.org/T224549 (10fgiunchedi) [09:14:33] (03CR) 10Marostegui: [C: 03+2] mariadb: Prepare to decommission db1067 [puppet] - 10https://gerrit.wikimedia.org/r/552213 (https://phabricator.wikimedia.org/T238297) (owner: 10Marostegui) [09:14:39] (03CR) 10Jcrespo: [C: 03+1] mariadb: Prepare to decommission db1067 [puppet] - 10https://gerrit.wikimedia.org/r/552213 (https://phabricator.wikimedia.org/T238297) (owner: 10Marostegui) [09:18:52] _joe_: can https://config-master.wikimedia.org/pybal/eqiad/wdqs generally be relied uppon to work and also be up to date? [09:21:23] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "I think there are a few issues with this change:" (038 comments) [puppet] - 10https://gerrit.wikimedia.org/r/550833 (https://phabricator.wikimedia.org/T236048) (owner: 10Effie Mouzeli) [09:21:55] <_joe_> addshore: what do you need to do? [09:22:04] <_joe_> XY problem :P [09:22:20] 10Operations, 10observability: Clean up ORES metrics - https://phabricator.wikimedia.org/T238807 (10fgiunchedi) >>! In T238807#5680476, @colewhite wrote: > @fgiunchedi, would you mind having a quick look at P9701? I'd like to run it on production. LGTM! Thanks for taking care of this We'll also need to temp... [09:22:33] <_joe_> (also, for talking, we should probably move to #-sre [09:22:37] <_joe_> it's calmer) [09:23:15] 10Operations, 10Scap, 10User-Addshore: Config scap seemed to update file but not get config changes into mediawiki (until a re scap) - https://phabricator.wikimedia.org/T238816 (10Addshore) [09:24:37] 10Operations, 10Traffic: Request routing to active/passive services active in codfw only stopped working - https://phabricator.wikimedia.org/T238817 (10ema) [09:24:48] 10Operations, 10Traffic: Request routing to active/passive services active in codfw only stopped working - https://phabricator.wikimedia.org/T238817 (10ema) p:05Triage→03High [09:25:51] 10Operations, 10observability, 10serviceops: dropped packets to conf1004/5/6 2379/tcp - https://phabricator.wikimedia.org/T238791 (10Joe) @fgiunchedi you need to use https, and it works locally. If you want to access metrics remotely, either you switch to port 4001 which is publically exposed. We have diffe... [09:25:55] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1090:3312, db1090:3317 for upgrade', diff saved to https://phabricator.wikimedia.org/P9712 and previous config saved to /var/cache/conftool/dbconfig/20191121-092554-marostegui.json [09:25:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:26:29] (03CR) 10Ema: [C: 04-2] "Smart! -2 because of the reasons mentioned in https://phabricator.wikimedia.org/T238817" [puppet] - 10https://gerrit.wikimedia.org/r/552142 (owner: 10CDanis) [09:27:08] PROBLEM - Varnish traffic drop between 30min ago and now at codfw on icinga1001 is CRITICAL: 58.15 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [09:27:21] !log Upgrade db1090:3312, db1090:3317 [09:27:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:28:46] 10Operations, 10Traffic: Request routing to active/passive services active in codfw only stopped working - https://phabricator.wikimedia.org/T238817 (10ema) [09:29:51] 10Operations, 10Commons, 10SRE-swift-storage: File on Commons not found: File:Nl-gegourmet.ogg - https://phabricator.wikimedia.org/T238695 (10fgiunchedi) Indeed, the file is a Nov 2013 upload, we could search for it in archives containers as well in case it got moved there. re: finding all orphan files, my u... [09:33:58] PROBLEM - Varnish traffic drop between 30min ago and now at codfw on icinga1001 is CRITICAL: 52.29 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [09:34:49] 10Operations, 10DNS, 10SRE-tools, 10Traffic: Include zone+subnet checks for DNS validation - https://phabricator.wikimedia.org/T238727 (10fgiunchedi) p:05Normal→03Low >>! In T238727#5679123, @Volans wrote: > @fgiunchedi I think is fair request, but given we're in process of auto-generating all mgmt and... [09:34:51] (03PS1) 10Jcrespo: prometheus-bacula-exporter: Exit cleanly on SIGINT [puppet] - 10https://gerrit.wikimedia.org/r/552216 (https://phabricator.wikimedia.org/T234900) [09:36:30] (03PS1) 10ArielGlenn: make dumpsdata1002 a spare dumps server with no special jobs [puppet] - 10https://gerrit.wikimedia.org/r/552217 [09:36:54] (03PS1) 10Ema: Revert "cache: reimage cp2023 as text_ats" [puppet] - 10https://gerrit.wikimedia.org/r/552218 (https://phabricator.wikimedia.org/T238817) [09:38:18] 10Operations, 10DBA, 10User-notice: Switchover s7 primary database master db1062 -> db1086 - 26th Nov 06:00 - 06:30 UTC - https://phabricator.wikimedia.org/T238044 (10Marostegui) All the hosts in s7 have been upgraded to 10.1.43 (including labs and sanitariums) [09:38:36] !log Stop MySQL on db1067 - T238297 [09:38:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:38:41] T238297: decommission db1067.eqiad.wmnet - https://phabricator.wikimedia.org/T238297 [09:39:22] (03CR) 10jerkins-bot: [V: 04-1] make dumpsdata1002 a spare dumps server with no special jobs [puppet] - 10https://gerrit.wikimedia.org/r/552217 (owner: 10ArielGlenn) [09:39:27] !log depool cp2023 and reimage back as varnish-be T238817 T227432 [09:39:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:39:33] T238817: Request routing to active/passive services active in codfw only stopped working - https://phabricator.wikimedia.org/T238817 [09:39:34] T227432: Replace Varnish backends with ATS on cache text nodes - https://phabricator.wikimedia.org/T227432 [09:39:59] !log marostegui@cumin1001 dbctl commit (dc=all): 'Slowly repool db1090:331{2,7} after upgrade', diff saved to https://phabricator.wikimedia.org/P9713 and previous config saved to /var/cache/conftool/dbconfig/20191121-093958-marostegui.json [09:40:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:40:22] PROBLEM - mobileapps endpoints health on scb2005 is CRITICAL: /{domain}/v1/page/summary/{title} (Get summary for test page) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [09:40:46] 10Operations, 10Traffic, 10Patch-For-Review: Request routing to active/passive services active in codfw only stopped working - https://phabricator.wikimedia.org/T238817 (10ema) [09:41:03] (03PS2) 10Jcrespo: prometheus-bacula-exporter: Exit cleanly on SIGINT [puppet] - 10https://gerrit.wikimedia.org/r/552216 (https://phabricator.wikimedia.org/T234900) [09:41:05] (03PS1) 10Jcrespo: prometheus-bacula-exporter: Fix Null check for date fields [puppet] - 10https://gerrit.wikimedia.org/r/552219 (https://phabricator.wikimedia.org/T234900) [09:41:18] RECOVERY - Varnish traffic drop between 30min ago and now at codfw on icinga1001 is OK: (C)60 le (W)70 le 73.83 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [09:41:22] (03CR) 10Ema: [C: 03+2] Revert "cache: reimage cp2023 as text_ats" [puppet] - 10https://gerrit.wikimedia.org/r/552218 (https://phabricator.wikimedia.org/T238817) (owner: 10Ema) [09:42:18] RECOVERY - mobileapps endpoints health on scb2005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [09:43:14] 10Operations, 10Traffic, 10Patch-For-Review: Request routing to active/passive services active in codfw only stopped working - https://phabricator.wikimedia.org/T238817 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by ema on cumin2001.codfw.wmnet for hosts: ` ['cp2023.codfw.wmnet'] ` The log ca... [09:43:15] (03CR) 10Jcrespo: [C: 03+2] prometheus-bacula-exporter: Exit cleanly on SIGINT [puppet] - 10https://gerrit.wikimedia.org/r/552216 (https://phabricator.wikimedia.org/T234900) (owner: 10Jcrespo) [09:43:50] (03CR) 10Jcrespo: [C: 03+2] prometheus-bacula-exporter: Fix Null check for date fields [puppet] - 10https://gerrit.wikimedia.org/r/552219 (https://phabricator.wikimedia.org/T234900) (owner: 10Jcrespo) [09:44:01] (03PS2) 10Jcrespo: prometheus-bacula-exporter: Fix Null check for date fields [puppet] - 10https://gerrit.wikimedia.org/r/552219 (https://phabricator.wikimedia.org/T234900) [09:44:04] (03CR) 10Jcrespo: [V: 03+2 C: 03+2] prometheus-bacula-exporter: Fix Null check for date fields [puppet] - 10https://gerrit.wikimedia.org/r/552219 (https://phabricator.wikimedia.org/T234900) (owner: 10Jcrespo) [09:44:19] (03PS2) 10ArielGlenn: make dumpsdata1002 a spare dumps server with no special jobs [puppet] - 10https://gerrit.wikimedia.org/r/552217 [09:44:46] PROBLEM - Check systemd state on backup1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:45:52] RECOVERY - Check systemd state on backup1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:48:03] 10Operations, 10observability, 10Availability, 10Goal, 10Patch-For-Review: Setup bacula backup monitoring - https://phabricator.wikimedia.org/T234900 (10jcrespo) Working now: ` root@prometheus1003:~$ time curl backup1001.eqiad.wmnet:9133/metrics ... # HELP bacula_job_last_execution_job_missing_files Jo... [09:49:19] (03PS9) 10Jcrespo: prometheus-bacula-exporter: Setup bacula collection on prometheus [puppet] - 10https://gerrit.wikimedia.org/r/552033 (https://phabricator.wikimedia.org/T234900) [09:52:03] 10Operations, 10observability, 10serviceops: dropped packets to conf1004/5/6 2379/tcp - https://phabricator.wikimedia.org/T238791 (10fgiunchedi) >>! In T238791#5680931, @Joe wrote: > @fgiunchedi you need to use https, and it works locally. > > If you want to access metrics remotely, either you switch to por... [09:52:47] (03PS2) 10Ema: ATS: explicitly skip the cache instead of hiding CC [puppet] - 10https://gerrit.wikimedia.org/r/552076 (https://phabricator.wikimedia.org/T238494) [09:54:02] !log marostegui@cumin1001 dbctl commit (dc=all): 'Slowly repool db1090:331{2,7} after upgrade', diff saved to https://phabricator.wikimedia.org/P9714 and previous config saved to /var/cache/conftool/dbconfig/20191121-095401-marostegui.json [09:54:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:54:36] 10Operations, 10observability, 10serviceops: dropped packets to conf1004/5/6 2379/tcp - https://phabricator.wikimedia.org/T238791 (10Joe) >>! In T238791#5681024, @fgiunchedi wrote: >>>! In T238791#5680931, @Joe wrote: >> @fgiunchedi you need to use https, and it works locally. >> >> If you want to access me... [09:55:33] (03CR) 10Jcrespo: [C: 03+2] prometheus-bacula-exporter: Setup bacula collection on prometheus [puppet] - 10https://gerrit.wikimedia.org/r/552033 (https://phabricator.wikimedia.org/T234900) (owner: 10Jcrespo) [09:57:02] 10Operations, 10Gerrit-Privilege-Requests, 10Wikidata, 10Wikidata-Query-Service, 10Release-Engineering-Team (Unit & Int & System Tooling): Push rights on https://gerrit.wikimedia.org/r/admin/projects/wikidata/query/blazegraph for onimisionipe - https://phabricator.wikimedia.org/T238733 (10Gehel) @hashar... [10:04:01] 10Operations, 10Traffic, 10Wikidata, 10Wikidata-Campsite, 10User-DannyS712: 502 errors on ATS/8.0.5 - https://phabricator.wikimedia.org/T237319 (10Ladsgroup) [10:04:25] 10Operations, 10Traffic, 10Patch-For-Review: Request routing to active/passive services active in codfw only stopped working - https://phabricator.wikimedia.org/T238817 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cp2023.codfw.wmnet'] ` Of which those **FAILED**: ` ['cp2023.codfw.wmnet'] ` [10:04:58] 10Operations, 10Traffic, 10Wikidata, 10Wikidata-Campsite, 10User-DannyS712: 502 errors on ATS/8.0.5 - https://phabricator.wikimedia.org/T237319 (10Ladsgroup) @addshore @wmde-leszek: it seems we are causing this. We should take a look [10:07:16] (03PS3) 10ArielGlenn: make dumpsdata1002 a spare dumps server with no special jobs [puppet] - 10https://gerrit.wikimedia.org/r/552217 [10:07:59] 10Operations, 10observability, 10serviceops, 10Patch-For-Review: Gather metrics on request status codes, latencies from the MediaWiki appservers - https://phabricator.wikimedia.org/T226815 (10Joe) 05Open→03Resolved a:03Joe [10:08:47] 10Operations, 10Traffic, 10Wikidata, 10Wikidata-Campsite, 10User-DannyS712: 502 errors on ATS/8.0.5 - https://phabricator.wikimedia.org/T237319 (10Vgutierrez) so it looks like the apache fix doesn't fix every scenario. This small PoC triggers the issue: `lang=php 10Operations, 10WMF-JobQueue: Dismantle most of the old jobqueue infrastructure - https://phabricator.wikimedia.org/T197003 (10Joe) 05Open→03Resolved a:03Joe [10:11:05] 10Operations: SRE quarterly goal: allow MediaWiki requests to be served by PHP7 alongside HHVM - https://phabricator.wikimedia.org/T203959 (10Joe) 05Open→03Resolved a:03Joe [10:11:20] 10Operations: SRE quarterly goal: Ability to serve a fraction of the production traffic from PHP7 - https://phabricator.wikimedia.org/T206336 (10Joe) 05Open→03Resolved a:03Joe [10:12:52] (03CR) 10ArielGlenn: [C: 03+2] make dumpsdata1002 a spare dumps server with no special jobs [puppet] - 10https://gerrit.wikimedia.org/r/552217 (owner: 10ArielGlenn) [10:17:59] !log update buster-wikimedia thirdparty/kubeadm-k8s packages (newer version will be used to handle T238654) [10:18:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:18:04] T238654: toolforge: new k8s: issues with routing interfering with DNS in the cluster as well as the webhook controllers - https://phabricator.wikimedia.org/T238654 [10:22:37] !log pool cp2023 with Varnish backend T238817 T227432 [10:22:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:22:43] T238817: Request routing to active/passive services active in codfw only stopped working - https://phabricator.wikimedia.org/T238817 [10:22:44] T227432: Replace Varnish backends with ATS on cache text nodes - https://phabricator.wikimedia.org/T227432 [10:26:40] 10Operations, 10Traffic: Request routing to active/passive services active in codfw only stopped working - https://phabricator.wikimedia.org/T238817 (10ema) HTTP routing of docker-registry looks good to me now: ` $ curl -s --resolve docker-registry.wikimedia.org:443:208.80.154.224 -v https://docker-registry.w... [10:29:09] (03PS2) 10ArielGlenn: move misc crons to dumpsdata1002 nfs server [puppet] - 10https://gerrit.wikimedia.org/r/551804 (https://phabricator.wikimedia.org/T224563) [10:42:55] 10Operations, 10observability: The "logstash-*" index pattern does not contain any of the following field types: ip - https://phabricator.wikimedia.org/T238795 (10fgiunchedi) Yes we can, if you know the name of the field we can add an explicit mapping to force the type in `modules/profile/files/logstash/elasti... [10:46:58] (03PS3) 10ArielGlenn: move misc crons to dumpsdata1002 nfs server [puppet] - 10https://gerrit.wikimedia.org/r/551804 (https://phabricator.wikimedia.org/T224563) [10:49:04] !log restarting tcpircbot-logmsgbot on icinga1001, has failed to log some messages, no useful log on the host [10:49:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:52:57] PROBLEM - High average GET latency for mw requests on api_appserver in eqiad on icinga1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-m [10:54:39] RECOVERY - High average GET latency for mw requests on api_appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-method=GET [10:58:09] (03CR) 10Volans: "@crusnov: ping as we need this ready by the end of the week because it will be very useful for the work on esams scheduled for next week." [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/550052 (https://phabricator.wikimedia.org/T237007) (owner: 10CRusnov) [10:58:16] (03CR) 10Muehlenhoff: "We could create a generic profile which installs the perftools by itself, but I'm not really sold on the idea of pre-installing specific d" [puppet] - 10https://gerrit.wikimedia.org/r/550833 (https://phabricator.wikimedia.org/T236048) (owner: 10Effie Mouzeli) [11:00:04] Amir1, Lucas_WMDE, awight, and Urbanecm: #bothumor I � Unicode. All rise for European Mid-day SWAT(Max 6 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20191121T1100). [11:00:04] Ammarpad: A patch you scheduled for European Mid-day SWAT(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [11:00:20] I can deploy the patch [11:00:43] (03PS6) 10Urbanecm: Restrict editing CNBanner namespace to autoconfirmed on metawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/552024 (https://phabricator.wikimedia.org/T238723) (owner: 10Ammarpad) [11:00:51] (03CR) 10Urbanecm: [C: 03+2] Restrict editing CNBanner namespace to autoconfirmed on metawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/552024 (https://phabricator.wikimedia.org/T238723) (owner: 10Ammarpad) [11:01:39] (03Merged) 10jenkins-bot: Restrict editing CNBanner namespace to autoconfirmed on metawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/552024 (https://phabricator.wikimedia.org/T238723) (owner: 10Ammarpad) [11:02:58] (03CR) 10Urbanecm: [C: 03+2] Set correct language for shywiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/552159 (https://phabricator.wikimedia.org/T238105) (owner: 10Ladsgroup) [11:03:01] (03CR) 10Filippo Giunchedi: bacula: Add prometheus exporter (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/551145 (https://phabricator.wikimedia.org/T234900) (owner: 10Jcrespo) [11:03:05] (03PS3) 10Urbanecm: Set correct language for shywiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/552159 (https://phabricator.wikimedia.org/T238105) (owner: 10Ladsgroup) [11:03:10] (03CR) 10Urbanecm: [C: 03+2] Set correct language for shywiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/552159 (https://phabricator.wikimedia.org/T238105) (owner: 10Ladsgroup) [11:03:57] (03Merged) 10jenkins-bot: Set correct language for shywiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/552159 (https://phabricator.wikimedia.org/T238105) (owner: 10Ladsgroup) [11:05:22] !log disable puppet on mw[1-2]* [11:05:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:06:10] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: 68d2003: Restrict editing CNBanner namespace to autoconfirmed on metawiki (T238723) (duration: 00m 54s) [11:06:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:06:15] T238723: Restrict anonymous users from editing pages in the CNBanner namespace on Meta-wiki - https://phabricator.wikimedia.org/T238723 [11:08:03] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: e4861ec: Set correct language for shywiktionary (T238105) (duration: 00m 52s) [11:08:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:08:08] T238105: Create Shawiya Wiktionary - https://phabricator.wikimedia.org/T238105 [11:09:03] !log EU SWAT done [11:09:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:09:16] 10Operations, 10Mail, 10Wikimedia-Mailing-lists: [Mailing lists] Received 205 bounce action notification emails from mailman in 20 minutes - https://phabricator.wikimedia.org/T238780 (10mforns) 05Open→03Invalid @Aklapper Yes! They are all either @yahoo.com or @aol.com. And many of them follow the same n... [11:13:09] 10Operations, 10Traffic: Request routing to active/passive services active in codfw only stopped working - https://phabricator.wikimedia.org/T238817 (10akosiaris) \o/. Thanks for taking care of this! [11:13:22] (03PS6) 10Muehlenhoff: Add image submission mode to debmonitor client [software/debmonitor] - 10https://gerrit.wikimedia.org/r/551220 (https://phabricator.wikimedia.org/T237978) [11:15:33] (03CR) 10Effie Mouzeli: [C: 03+2] mediawiki: remove all hhvm related files and hieradata [puppet] - 10https://gerrit.wikimedia.org/r/551527 (https://phabricator.wikimedia.org/T229792) (owner: 10Effie Mouzeli) [11:16:12] (03PS2) 10Ema: ATS: enable reload for global Lua script [puppet] - 10https://gerrit.wikimedia.org/r/552201 (https://phabricator.wikimedia.org/T233274) [11:16:19] (03CR) 10Volans: [C: 04-1] "Change looks good apart a typo in the puppet file. The check runs usually within 0.5~1s for all the reports *except* the PuppetDB one, tha" (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/552154 (https://phabricator.wikimedia.org/T237803) (owner: 10CRusnov) [11:20:17] jouncebot: now [11:20:17] For the next 0 hour(s) and 39 minute(s): European Mid-day SWAT(Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20191121T1100) [11:20:21] jouncebot: next [11:20:21] In 4 hour(s) and 39 minute(s): Puppet SWAT(Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20191121T1600) [11:24:53] (03PS1) 10ArielGlenn: fix up role name for the misc crons dump worker [puppet] - 10https://gerrit.wikimedia.org/r/552225 [11:26:34] (03PS2) 10ArielGlenn: fix up role name for the misc crons dump worker [puppet] - 10https://gerrit.wikimedia.org/r/552225 [11:27:47] (03PS3) 10Ema: ATS: enable reload for global Lua script [puppet] - 10https://gerrit.wikimedia.org/r/552201 (https://phabricator.wikimedia.org/T233274) [11:28:09] (03CR) 10Alexandros Kosiaris: "> it seems more useful to create a generic script which allows to pull in the respective dbgsym packages based on the libraries installed " [puppet] - 10https://gerrit.wikimedia.org/r/550833 (https://phabricator.wikimedia.org/T236048) (owner: 10Effie Mouzeli) [11:29:25] (03CR) 10jerkins-bot: [V: 04-1] ATS: enable reload for global Lua script [puppet] - 10https://gerrit.wikimedia.org/r/552201 (https://phabricator.wikimedia.org/T233274) (owner: 10Ema) [11:33:46] !log reedy@deploy1001 Started scap: T234450 [11:33:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:36:39] (03CR) 10ArielGlenn: [C: 03+2] fix up role name for the misc crons dump worker [puppet] - 10https://gerrit.wikimedia.org/r/552225 (owner: 10ArielGlenn) [11:42:06] (03PS1) 10Reedy: Add PoolCounter configuration for Special:Contributions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/552228 (https://phabricator.wikimedia.org/T234450) [11:42:30] (03PS4) 10Ema: ATS: enable reload for global Lua script [puppet] - 10https://gerrit.wikimedia.org/r/552201 (https://phabricator.wikimedia.org/T233274) [11:42:39] !log enable puppet on all mw hosts [11:42:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:44:36] (03CR) 10Ema: "https://puppet-compiler.wmflabs.org/compiler1001/19532/" [puppet] - 10https://gerrit.wikimedia.org/r/552201 (https://phabricator.wikimedia.org/T233274) (owner: 10Ema) [11:51:25] (03PS1) 10Ema: cache: reimage cp1077 as text_ats [puppet] - 10https://gerrit.wikimedia.org/r/552229 (https://phabricator.wikimedia.org/T227432) [11:53:06] !log reedy@deploy1001 Finished scap: T234450 (duration: 19m 20s) [11:53:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:53:13] (03PS2) 10Ema: cache: reimage cp1077 as text_ats [puppet] - 10https://gerrit.wikimedia.org/r/552229 (https://phabricator.wikimedia.org/T227432) [11:53:19] (03CR) 10jerkins-bot: [V: 04-1] cache: reimage cp1077 as text_ats [puppet] - 10https://gerrit.wikimedia.org/r/552229 (https://phabricator.wikimedia.org/T227432) (owner: 10Ema) [11:56:26] 10Operations, 10netops, 10User-jbond: Sporatic RST drops in the ulogd logs - https://phabricator.wikimedia.org/T238823 (10jbond) [12:05:31] 10Operations, 10Traffic, 10Wikidata, 10Wikidata-Campsite, 10User-DannyS712: 502 errors on ATS/8.0.5 - https://phabricator.wikimedia.org/T237319 (10elukey) On mwdebug2001 this seems to be the mod proxy fcgi debug: ` Nov 21 12:02:13 mwdebug2001 apache2[4516]: [proxy_fcgi:debug] [pid 4516:tid 1403735935526... [12:06:57] (03Abandoned) 10Effie Mouzeli: dumps: fix hieradata for generation::worker::dumper_misc_crons [puppet] - 10https://gerrit.wikimedia.org/r/552125 (owner: 10Effie Mouzeli) [12:09:52] (03CR) 10Effie Mouzeli: [C: 03+2] Upgrade to 2.6 [debs/python-thumbor-wikimedia] - 10https://gerrit.wikimedia.org/r/531204 (https://phabricator.wikimedia.org/T226707) (owner: 10Gilles) [12:13:39] 10Operations, 10Traffic, 10fixcopyright.wikimedia.org, 10Core Platform Team Workboards (Clinic Duty Team), and 2 others: Retire fixcopyright.wikimedia.org - https://phabricator.wikimedia.org/T238803 (10Bugreporter) >>! In T238803#5680279, @Koavf wrote: > I object to deletion: as long as we still own the do... [12:14:13] PROBLEM - Varnish traffic drop between 30min ago and now at codfw on icinga1001 is CRITICAL: 55.32 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [12:19:19] RECOVERY - Varnish traffic drop between 30min ago and now at codfw on icinga1001 is OK: (C)60 le (W)70 le 72.45 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [12:22:02] 10Operations, 10DNS, 10Traffic: Create wildcard DNS record for Wikimedia projects - https://phabricator.wikimedia.org/T238825 (10Bugreporter) [12:26:39] (03CR) 10Arturo Borrero Gonzalez: labmon: add compatibility in buster (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/552107 (https://phabricator.wikimedia.org/T224585) (owner: 10Phamhi) [12:29:59] (03CR) 10Muehlenhoff: labmon: add compatibility in buster (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/552107 (https://phabricator.wikimedia.org/T224585) (owner: 10Phamhi) [12:32:17] 10Operations, 10Traffic, 10fixcopyright.wikimedia.org, 10Core Platform Team Workboards (Clinic Duty Team), and 2 others: Retire fixcopyright.wikimedia.org - https://phabricator.wikimedia.org/T238803 (10Jdforrester-WMF) a:03Jdforrester-WMF [12:38:39] PROBLEM - PHP7 rendering on mwdebug2001 is CRITICAL: connect to address 10.192.0.98 and port 80: Connection refused https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [12:38:59] PROBLEM - Apache HTTP on mwdebug2001 is CRITICAL: connect to address 10.192.0.98 and port 80: Connection refused https://wikitech.wikimedia.org/wiki/Application_servers [12:39:07] PROBLEM - Nginx local proxy to apache on mwdebug2001 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 325 bytes in 0.151 second response time https://wikitech.wikimedia.org/wiki/Application_servers [12:42:01] RECOVERY - PHP7 rendering on mwdebug2001 is OK: HTTP OK: HTTP/1.1 200 OK - 77975 bytes in 0.343 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [12:42:21] RECOVERY - Apache HTTP on mwdebug2001 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 631 bytes in 0.168 second response time https://wikitech.wikimedia.org/wiki/Application_servers [12:42:27] RECOVERY - Nginx local proxy to apache on mwdebug2001 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 632 bytes in 0.207 second response time https://wikitech.wikimedia.org/wiki/Application_servers [12:50:43] (03CR) 10Jcrespo: "I will work on this on a different patch." (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/551145 (https://phabricator.wikimedia.org/T234900) (owner: 10Jcrespo) [12:55:01] (03CR) 10Giuseppe Lavagetto: [V: 03+2 C: 03+2] envoy-tls-local-proxy: require configuration of the admin endpoint [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/549825 (https://phabricator.wikimedia.org/T237234) (owner: 10Giuseppe Lavagetto) [12:55:17] (03PS1) 10Jcrespo: prometheus-mysqld-exporter: Fix misc bugs/style pointed on review [puppet] - 10https://gerrit.wikimedia.org/r/552237 (https://phabricator.wikimedia.org/T234900) [12:57:25] (03CR) 10jerkins-bot: [V: 04-1] prometheus-mysqld-exporter: Fix misc bugs/style pointed on review [puppet] - 10https://gerrit.wikimedia.org/r/552237 (https://phabricator.wikimedia.org/T234900) (owner: 10Jcrespo) [12:57:49] 10Operations, 10ops-esams, 10netops: Setup esams atlas anchor - https://phabricator.wikimedia.org/T174637 (10faidon) Update: given the upcoming follow-up visit to esams next week, I requested a new image from RIPE. I got it today, and it can be found in the same place, as "anchor.nl-ams-as14907-**v2**.img". [12:59:57] PROBLEM - Varnish traffic drop between 30min ago and now at codfw on icinga1001 is CRITICAL: 53.58 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [13:00:10] (03CR) 10Effie Mouzeli: [C: 03+2] Only apply expiry logic to "thumb" zone [puppet] - 10https://gerrit.wikimedia.org/r/519374 (https://phabricator.wikimedia.org/T211661) (owner: 10Gilles) [13:06:43] PROBLEM - Varnish traffic drop between 30min ago and now at codfw on icinga1001 is CRITICAL: 59.01 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [13:09:12] 10Operations, 10Traffic, 10Wikidata, 10Wikidata-Campsite, 10User-DannyS712: 502 errors on ATS/8.0.5 - https://phabricator.wikimedia.org/T237319 (10darthmon_wmde) Is there anything that we can quickly do on wikibase to fix this? if so, please advise what concretely. Thanks! [13:12:07] (03PS1) 10Mholloway: Fix typos in docker-registry discovery URLs [deployment-charts] - 10https://gerrit.wikimedia.org/r/552239 [13:12:26] (03CR) 10Mholloway: [C: 03+2] Fix typos in docker-registry discovery URLs [deployment-charts] - 10https://gerrit.wikimedia.org/r/552239 (owner: 10Mholloway) [13:12:40] (03Merged) 10jenkins-bot: Fix typos in docker-registry discovery URLs [deployment-charts] - 10https://gerrit.wikimedia.org/r/552239 (owner: 10Mholloway) [13:15:09] RECOVERY - Varnish traffic drop between 30min ago and now at codfw on icinga1001 is OK: (C)60 le (W)70 le 74.31 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [13:15:47] PROBLEM - High average GET latency for mw requests on appserver in eqiad on icinga1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [13:16:11] (03PS1) 10Jcrespo: prometheus-bacula-exporter: Add expected fresness metrics [puppet] - 10https://gerrit.wikimedia.org/r/552241 (https://phabricator.wikimedia.org/T234900) [13:18:50] (03PS2) 10Jcrespo: prometheus-mysqld-exporter: Fix misc bugs/style pointed on review [puppet] - 10https://gerrit.wikimedia.org/r/552237 (https://phabricator.wikimedia.org/T234900) [13:19:11] RECOVERY - High average GET latency for mw requests on appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [13:20:19] !log depool cp1077 and reimage as text_ats T227432 [13:20:21] 10Operations, 10ops-esams, 10netops: mr1-esams RMA - https://phabricator.wikimedia.org/T238174 (10faidon) [13:20:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:20:25] T227432: Replace Varnish backends with ATS on cache text nodes - https://phabricator.wikimedia.org/T227432 [13:20:46] (03CR) 10jerkins-bot: [V: 04-1] prometheus-mysqld-exporter: Fix misc bugs/style pointed on review [puppet] - 10https://gerrit.wikimedia.org/r/552237 (https://phabricator.wikimedia.org/T234900) (owner: 10Jcrespo) [13:21:28] (03CR) 10Ema: [C: 03+2] cache: reimage cp1077 as text_ats [puppet] - 10https://gerrit.wikimedia.org/r/552229 (https://phabricator.wikimedia.org/T227432) (owner: 10Ema) [13:22:33] PROBLEM - High average GET latency for mw requests on appserver in eqiad on icinga1001 is CRITICAL: cluster=appserver code=205 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [13:22:52] (03PS2) 10Jcrespo: prometheus-bacula-exporter: Add expected fresness metrics [puppet] - 10https://gerrit.wikimedia.org/r/552241 (https://phabricator.wikimedia.org/T234900) [13:22:59] (03Abandoned) 10Jcrespo: prometheus-mysqld-exporter: Fix misc bugs/style pointed on review [puppet] - 10https://gerrit.wikimedia.org/r/552237 (https://phabricator.wikimedia.org/T234900) (owner: 10Jcrespo) [13:23:38] 10Operations, 10Traffic, 10Patch-For-Review: Replace Varnish backends with ATS on cache text nodes - https://phabricator.wikimedia.org/T227432 (10ema) [13:24:17] RECOVERY - High average GET latency for mw requests on appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [13:25:53] 10Operations, 10Traffic, 10Patch-For-Review: Replace Varnish backends with ATS on cache text nodes - https://phabricator.wikimedia.org/T227432 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by ema on cumin1001.eqiad.wmnet for hosts: ` ['cp1077.eqiad.wmnet'] ` The log can be found in `/var/log/wm... [13:26:01] 10Operations, 10netops, 10User-jbond: Sporatic RST drops in the ulogd logs - https://phabricator.wikimedia.org/T238823 (10akosiaris) Could be totally different but with @jijiki we 've seen this behavior elsewhere as well. The latest installment is T238789. Per that logstash dashboard, it's the mediawiki's th... [13:27:47] 10Operations, 10serviceops: dropped packets to echostore.svc.eqiad 8082/tcp - https://phabricator.wikimedia.org/T238789 (10akosiaris) More investigation in T238823. It's quite possibly expected. [13:32:13] (03CR) 10Filippo Giunchedi: [C: 03+1] labmon: add compatibility in buster (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/552107 (https://phabricator.wikimedia.org/T224585) (owner: 10Phamhi) [13:32:57] (03PS1) 10Giuseppe Lavagetto: Upgrade envoy images to the latest version (1.12.1) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/552244 (https://phabricator.wikimedia.org/T237235) [13:38:14] (03CR) 10Giuseppe Lavagetto: [V: 03+2 C: 03+2] Upgrade envoy images to the latest version (1.12.1) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/552244 (https://phabricator.wikimedia.org/T237235) (owner: 10Giuseppe Lavagetto) [13:39:10] !log ema@cumin1001 START - Cookbook sre.hosts.downtime [13:39:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:41:17] !log ema@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [13:41:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:42:12] (03PS1) 10Ema: otrs/phabricator: do not assume text nodes are defined [puppet] - 10https://gerrit.wikimedia.org/r/552245 (https://phabricator.wikimedia.org/T227432) [13:49:19] 10Operations, 10netbox: Netbox should use CN rather than UID for LDAP login username - https://phabricator.wikimedia.org/T210566 (10faidon) 05Open→03Declined I'll decline, on the basis that this will be converted to use SSO soon-ish, and there's no point in going over two migrations :) [13:49:45] (03CR) 10Ema: "https://puppet-compiler.wmflabs.org/compiler1002/19535/" [puppet] - 10https://gerrit.wikimedia.org/r/552245 (https://phabricator.wikimedia.org/T227432) (owner: 10Ema) [13:50:20] 10Operations, 10Traffic, 10Patch-For-Review: Replace Varnish backends with ATS on cache text nodes - https://phabricator.wikimedia.org/T227432 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cp1077.eqiad.wmnet'] ` and were **ALL** successful. [13:51:02] (03PS3) 10Jcrespo: prometheus-bacula-exporter: Add expected fresness metrics [puppet] - 10https://gerrit.wikimedia.org/r/552241 (https://phabricator.wikimedia.org/T234900) [13:51:04] (03PS1) 10Jcrespo: prometheus-bacula-exporter: Add total and successful backup counts [puppet] - 10https://gerrit.wikimedia.org/r/552249 (https://phabricator.wikimedia.org/T234900) [13:51:07] (03CR) 10Alexandros Kosiaris: [C: 03+1] otrs/phabricator: do not assume text nodes are defined [puppet] - 10https://gerrit.wikimedia.org/r/552245 (https://phabricator.wikimedia.org/T227432) (owner: 10Ema) [13:51:30] (03CR) 10Ema: [C: 03+2] otrs/phabricator: do not assume text nodes are defined [puppet] - 10https://gerrit.wikimedia.org/r/552245 (https://phabricator.wikimedia.org/T227432) (owner: 10Ema) [13:53:08] (03CR) 10jerkins-bot: [V: 04-1] prometheus-bacula-exporter: Add total and successful backup counts [puppet] - 10https://gerrit.wikimedia.org/r/552249 (https://phabricator.wikimedia.org/T234900) (owner: 10Jcrespo) [13:53:12] 10Operations, 10netbox: Netbox: tracking of hardware errors / grouping servers in order/batches - https://phabricator.wikimedia.org/T233774 (10MoritzMuehlenhoff) 05Open→03Invalid [13:53:41] (03CR) 10Giuseppe Lavagetto: [C: 03+2] Also check charts generated by helmfile [deployment-charts] - 10https://gerrit.wikimedia.org/r/549059 (owner: 10Giuseppe Lavagetto) [13:53:45] (03PS5) 10Giuseppe Lavagetto: Also check charts generated by helmfile [deployment-charts] - 10https://gerrit.wikimedia.org/r/549059 [13:53:59] (03PS2) 10Jcrespo: prometheus-bacula-exporter: Add total and successful backup counts [puppet] - 10https://gerrit.wikimedia.org/r/552249 (https://phabricator.wikimedia.org/T234900) [13:58:37] (03PS1) 10Mobrovac: kask-echoseen: Do not report dupes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/552257 (https://phabricator.wikimedia.org/T237143) [13:58:55] PROBLEM - MariaDB Slave Lag: s4 on db2095 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 29117.52 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [13:59:18] ^ downtime that expired [13:59:50] !log pool cp1077 with ATS backend T227432 [13:59:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:59:55] T227432: Replace Varnish backends with ATS on cache text nodes - https://phabricator.wikimedia.org/T227432 [14:00:47] PROBLEM - High average GET latency for mw requests on api_appserver in eqiad on icinga1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-m [14:01:53] RECOVERY - High average GET latency for mw requests on api_appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-method=GET [14:02:45] _joe_: do we know what's up with the api_appservers latency alerts? [14:03:03] <_joe_> ema: in a meeting [14:03:24] 10Operations, 10Math: Clean up artifacts from LaTeX based math rendering - https://phabricator.wikimedia.org/T195847 (10jijiki) @Dzahn do you know the status of this? can we mark it as resolved? [14:03:27] <_joe_> but partly yes, ask effie [14:03:33] _joe_: ack, thanks [14:03:38] effie: ^ :) [14:04:54] ema: I am looking at it, but nothing so far :/ [14:07:15] (03PS1) 10Jbond: cergen: add Icing check to validate the expiry date on certificates [puppet] - 10https://gerrit.wikimedia.org/r/552260 (https://phabricator.wikimedia.org/T236277) [14:08:10] (03PS3) 10Jcrespo: prometheus-bacula-exporter: Add total and successful backup counts [puppet] - 10https://gerrit.wikimedia.org/r/552249 (https://phabricator.wikimedia.org/T234900) [14:08:38] ema: it is that https://phabricator.wikimedia.org/T231011 [14:10:08] (03PS5) 10Ema: ATS: enable reload for global Lua script [puppet] - 10https://gerrit.wikimedia.org/r/552201 (https://phabricator.wikimedia.org/T233274) [14:10:54] (03CR) 10jerkins-bot: [V: 04-1] cergen: add Icing check to validate the expiry date on certificates [puppet] - 10https://gerrit.wikimedia.org/r/552260 (https://phabricator.wikimedia.org/T236277) (owner: 10Jbond) [14:11:57] (03PS2) 10Jbond: cergen: add Icing check to validate the expiry date on certificates [puppet] - 10https://gerrit.wikimedia.org/r/552260 (https://phabricator.wikimedia.org/T236277) [14:12:39] (03PS3) 10Jbond: cergen: add Icing check to validate the expiry date on certificates [puppet] - 10https://gerrit.wikimedia.org/r/552260 (https://phabricator.wikimedia.org/T236277) [14:16:49] (03PS6) 10Ottomata: TLS envoyproxy support for eventgate chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/551253 (https://phabricator.wikimedia.org/T236386) [14:23:13] (03CR) 10Anomie: [C: 03+1] Add PoolCounter configuration for Special:Contributions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/552228 (https://phabricator.wikimedia.org/T234450) (owner: 10Reedy) [14:25:13] (03PS7) 10Ottomata: TLS envoyproxy support for eventgate chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/551253 (https://phabricator.wikimedia.org/T236386) [14:25:54] (03CR) 10Ottomata: [C: 03+2] TLS envoyproxy support for eventgate chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/551253 (https://phabricator.wikimedia.org/T236386) (owner: 10Ottomata) [14:27:42] (03PS5) 10Ottomata: Kafka producer TLS support for eventgate charts [deployment-charts] - 10https://gerrit.wikimedia.org/r/551610 (https://phabricator.wikimedia.org/T236386) [14:27:50] (03PS1) 10Filippo Giunchedi: prometheus: aggregate Sentry 4 metrics [puppet] - 10https://gerrit.wikimedia.org/r/552267 (https://phabricator.wikimedia.org/T148541) [14:28:51] !log onimisionipe@deploy1001 Started deploy [wdqs/wdqs@db43901]: Agent filter changes [14:28:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:31:09] (03PS6) 10Ema: ATS: enable reload for global Lua script [puppet] - 10https://gerrit.wikimedia.org/r/552201 (https://phabricator.wikimedia.org/T233274) [14:32:28] (03PS6) 10Ottomata: Kafka producer TLS support for eventgate charts [deployment-charts] - 10https://gerrit.wikimedia.org/r/551610 (https://phabricator.wikimedia.org/T236386) [14:32:42] (03CR) 10Ema: "pcc looks happy now: https://puppet-compiler.wmflabs.org/compiler1003/19537/cp1075.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/552201 (https://phabricator.wikimedia.org/T233274) (owner: 10Ema) [14:35:29] (03CR) 10CDanis: [C: 03+1] "lgtm, thanks!" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/552260 (https://phabricator.wikimedia.org/T236277) (owner: 10Jbond) [14:36:48] (03PS4) 10Jcrespo: prometheus-bacula-exporter: Add total and successful backup counts [puppet] - 10https://gerrit.wikimedia.org/r/552249 (https://phabricator.wikimedia.org/T234900) [14:38:21] (03CR) 10Volans: [C: 04-1] "I skipped the python file for now in my review. I think there is an error, see inline, plus a general comment here below:" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/552260 (https://phabricator.wikimedia.org/T236277) (owner: 10Jbond) [14:38:47] (03PS2) 10BBlack: authdns-local-update: non-verbose by default [puppet] - 10https://gerrit.wikimedia.org/r/552085 (https://phabricator.wikimedia.org/T98006) [14:38:49] (03PS4) 10BBlack: Parallelize authdns-update with clush [puppet] - 10https://gerrit.wikimedia.org/r/552081 (https://phabricator.wikimedia.org/T98006) [14:38:51] (03PS1) 10BBlack: authdns-local-update: var rename for clarity [puppet] - 10https://gerrit.wikimedia.org/r/552270 (https://phabricator.wikimedia.org/T98006) [14:39:13] (03CR) 10jerkins-bot: [V: 04-1] prometheus-bacula-exporter: Add total and successful backup counts [puppet] - 10https://gerrit.wikimedia.org/r/552249 (https://phabricator.wikimedia.org/T234900) (owner: 10Jcrespo) [14:39:21] (03PS4) 10Jcrespo: prometheus-bacula-exporter: Add expected fresness metrics [puppet] - 10https://gerrit.wikimedia.org/r/552241 (https://phabricator.wikimedia.org/T234900) [14:39:23] (03PS5) 10Jcrespo: prometheus-bacula-exporter: Add total and successful backup counts [puppet] - 10https://gerrit.wikimedia.org/r/552249 (https://phabricator.wikimedia.org/T234900) [14:41:03] (03CR) 10Volans: [C: 04-1] "Reply to an inline comment (-1 is from previous review)" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/552260 (https://phabricator.wikimedia.org/T236277) (owner: 10Jbond) [14:41:37] !log restarting Jenkins for update [14:41:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:41:52] (03CR) 10Filippo Giunchedi: [C: 03+2] prometheus: aggregate Sentry 4 metrics [puppet] - 10https://gerrit.wikimedia.org/r/552267 (https://phabricator.wikimedia.org/T148541) (owner: 10Filippo Giunchedi) [14:42:22] (03PS1) 10Ottomata: eventgate 0.0.13 - envoyproxy tls support [deployment-charts] - 10https://gerrit.wikimedia.org/r/552271 (https://phabricator.wikimedia.org/T236386) [14:42:29] (03PS6) 10Jcrespo: prometheus-bacula-exporter: Add total and successful backup counts [puppet] - 10https://gerrit.wikimedia.org/r/552249 (https://phabricator.wikimedia.org/T234900) [14:42:33] (03PS1) 10BBlack: test commit [dns] - 10https://gerrit.wikimedia.org/r/552272 [14:43:10] !log testing deployment software changes on authdns cluster, please hold dns changes for a few! [14:43:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:43:15] (03CR) 10Ottomata: [C: 03+2] eventgate 0.0.13 - envoyproxy tls support [deployment-charts] - 10https://gerrit.wikimedia.org/r/552271 (https://phabricator.wikimedia.org/T236386) (owner: 10Ottomata) [14:43:19] (03PS2) 10Ottomata: eventgate 0.0.13 - envoyproxy tls support [deployment-charts] - 10https://gerrit.wikimedia.org/r/552271 (https://phabricator.wikimedia.org/T236386) [14:43:21] (03CR) 10Ottomata: [V: 03+2 C: 03+2] eventgate 0.0.13 - envoyproxy tls support [deployment-charts] - 10https://gerrit.wikimedia.org/r/552271 (https://phabricator.wikimedia.org/T236386) (owner: 10Ottomata) [14:43:26] (03CR) 10Volans: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/552270 (https://phabricator.wikimedia.org/T98006) (owner: 10BBlack) [14:43:45] (03PS2) 10Ottomata: Enable TLS envoyproxy for eventgate-logging-external instances [deployment-charts] - 10https://gerrit.wikimedia.org/r/551263 (https://phabricator.wikimedia.org/T236386) [14:44:32] (03CR) 10jerkins-bot: [V: 04-1] prometheus-bacula-exporter: Add total and successful backup counts [puppet] - 10https://gerrit.wikimedia.org/r/552249 (https://phabricator.wikimedia.org/T234900) (owner: 10Jcrespo) [14:44:34] (03CR) 10Ottomata: [C: 03+2] Enable TLS envoyproxy for eventgate-logging-external instances [deployment-charts] - 10https://gerrit.wikimedia.org/r/551263 (https://phabricator.wikimedia.org/T236386) (owner: 10Ottomata) [14:44:43] (03CR) 10jerkins-bot: [V: 04-1] prometheus-bacula-exporter: Add total and successful backup counts [puppet] - 10https://gerrit.wikimedia.org/r/552249 (https://phabricator.wikimedia.org/T234900) (owner: 10Jcrespo) [14:45:08] (03PS4) 10Jbond: cergen: add Icinga check to validate the expiry date on certificates [puppet] - 10https://gerrit.wikimedia.org/r/552260 (https://phabricator.wikimedia.org/T236277) [14:46:01] (03PS7) 10Jcrespo: prometheus-bacula-exporter: Add total and successful backup counts [puppet] - 10https://gerrit.wikimedia.org/r/552249 (https://phabricator.wikimedia.org/T234900) [14:46:03] (03Abandoned) 10CDanis: varnishbe: there are no other varnishbes! use ats-be [puppet] - 10https://gerrit.wikimedia.org/r/552142 (owner: 10CDanis) [14:46:08] (03PS1) 10Ema: cache: reimage cp1079 as text_ats [puppet] - 10https://gerrit.wikimedia.org/r/552273 (https://phabricator.wikimedia.org/T227432) [14:46:41] (03PS1) 10Ottomata: Adding missed eventgate-0.0.13.tgz [deployment-charts] - 10https://gerrit.wikimedia.org/r/552274 [14:47:07] (03CR) 10Filippo Giunchedi: cergen: add Icinga check to validate the expiry date on certificates (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/552260 (https://phabricator.wikimedia.org/T236277) (owner: 10Jbond) [14:47:25] !log onimisionipe@deploy1001 Finished deploy [wdqs/wdqs@db43901]: Agent filter changes (duration: 18m 33s) [14:47:25] (03CR) 10Ottomata: [C: 03+2] Adding missed eventgate-0.0.13.tgz [deployment-charts] - 10https://gerrit.wikimedia.org/r/552274 (owner: 10Ottomata) [14:47:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:48:11] (03CR) 10Jcrespo: [C: 03+2] prometheus-bacula-exporter: Add expected fresness metrics [puppet] - 10https://gerrit.wikimedia.org/r/552241 (https://phabricator.wikimedia.org/T234900) (owner: 10Jcrespo) [14:48:13] (03CR) 10CDanis: [C: 03+1] "thanks!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/552228 (https://phabricator.wikimedia.org/T234450) (owner: 10Reedy) [14:48:24] (03CR) 10Jcrespo: [C: 03+2] prometheus-bacula-exporter: Add total and successful backup counts [puppet] - 10https://gerrit.wikimedia.org/r/552249 (https://phabricator.wikimedia.org/T234900) (owner: 10Jcrespo) [14:48:37] (03PS8) 10Jcrespo: prometheus-bacula-exporter: Add total and successful backup counts [puppet] - 10https://gerrit.wikimedia.org/r/552249 (https://phabricator.wikimedia.org/T234900) [14:49:32] !log depool cp1079 and reimage as text_ats T227432 [14:49:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:49:38] T227432: Replace Varnish backends with ATS on cache text nodes - https://phabricator.wikimedia.org/T227432 [14:51:01] (03CR) 10Ema: [C: 03+2] cache: reimage cp1079 as text_ats [puppet] - 10https://gerrit.wikimedia.org/r/552273 (https://phabricator.wikimedia.org/T227432) (owner: 10Ema) [14:53:19] (03CR) 10Filippo Giunchedi: "too late, but see inline" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/552241 (https://phabricator.wikimedia.org/T234900) (owner: 10Jcrespo) [14:53:22] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good to me. I was looking into whether the module should configure CAS v3, but https://github.com/apereo/mod_auth_cas/issues/133 the" [puppet] - 10https://gerrit.wikimedia.org/r/551811 (owner: 10Jbond) [14:53:48] 10Operations, 10Traffic, 10Patch-For-Review: Replace Varnish backends with ATS on cache text nodes - https://phabricator.wikimedia.org/T227432 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by ema on cumin1001.eqiad.wmnet for hosts: ` ['cp1079.eqiad.wmnet'] ` The log can be found in `/var/log/wm... [14:54:03] gehel, addshore ^ [14:54:05] deploy is done [14:54:23] PROBLEM - mobileapps endpoints health on scb2005 is CRITICAL: /{domain}/v1/page/mobile-sections/{title} (retrieve test page via mobile-sections) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [14:54:29] onimisionipe: thanks! [14:56:04] (03CR) 10Jbond: "> I've some security concerns, in particular permission wise. In order to perform this check we'll need to give RO permission to the NRPE " (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/552260 (https://phabricator.wikimedia.org/T236277) (owner: 10Jbond) [14:56:05] RECOVERY - mobileapps endpoints health on scb2005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [14:56:24] (03CR) 10BBlack: [C: 03+2] test commit [dns] - 10https://gerrit.wikimedia.org/r/552272 (owner: 10BBlack) [14:58:31] (03CR) 10Jbond: "> Patch Set 1: Code-Review+1" [puppet] - 10https://gerrit.wikimedia.org/r/551811 (owner: 10Jbond) [14:58:38] (03CR) 10Jbond: [C: 03+2] apereo_cas: use CAS protocol instead of SAML [puppet] - 10https://gerrit.wikimedia.org/r/551811 (owner: 10Jbond) [15:01:36] (03PS1) 10Jcrespo: prometheus-bacula-exporter: Fix job label for last backup dates [puppet] - 10https://gerrit.wikimedia.org/r/552275 (https://phabricator.wikimedia.org/T234900) [15:02:44] (03CR) 10Volans: "Also I don't see in the task a specific design of what we're trying to achieve here. This would be a single check that will fire saying sa" [puppet] - 10https://gerrit.wikimedia.org/r/552260 (https://phabricator.wikimedia.org/T236277) (owner: 10Jbond) [15:03:09] (03CR) 10Subramanya Sastry: VirtualRESTService: Switch to Parsoid/PHP (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/552194 (https://phabricator.wikimedia.org/T229015) (owner: 10Mobrovac) [15:03:38] (03CR) 10Jcrespo: [C: 03+2] prometheus-bacula-exporter: Fix job label for last backup dates [puppet] - 10https://gerrit.wikimedia.org/r/552275 (https://phabricator.wikimedia.org/T234900) (owner: 10Jcrespo) [15:04:55] (03PS1) 10BBlack: Revert "test commit" [dns] - 10https://gerrit.wikimedia.org/r/552276 [15:05:01] (03CR) 10BBlack: [V: 03+2 C: 03+2] Revert "test commit" [dns] - 10https://gerrit.wikimedia.org/r/552276 (owner: 10BBlack) [15:07:05] !log ema@cumin1001 START - Cookbook sre.hosts.downtime [15:07:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:07:43] !log DONE testing deployment software changes on authdns cluster, back to normal [15:07:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:09:59] !log ema@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [15:10:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:13:01] !log purge https://releases.wikimedia.org/charts/eventgate-0.0.13.tgz, https://releases.wikimedia.org/charts/ and https://releases.wikimedia.org/charts/index.yaml [15:13:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:17:04] (03Abandoned) 10CDanis: test change for PCC only do not submit [puppet] - 10https://gerrit.wikimedia.org/r/552114 (owner: 10CDanis) [15:18:18] (03CR) 10Volans: [C: 03+1] "LGTM,I didn't tested it though" (031 comment) [software/debmonitor] - 10https://gerrit.wikimedia.org/r/551220 (https://phabricator.wikimedia.org/T237978) (owner: 10Muehlenhoff) [15:18:25] 10Operations, 10Traffic, 10Patch-For-Review: Replace Varnish backends with ATS on cache text nodes - https://phabricator.wikimedia.org/T227432 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cp1079.eqiad.wmnet'] ` and were **ALL** successful. [15:18:44] (03CR) 10Andrew Bogott: [C: 03+2] openstack: bootstrap ocata puppet code for servers [puppet] - 10https://gerrit.wikimedia.org/r/552059 (https://phabricator.wikimedia.org/T237749) (owner: 10Arturo Borrero Gonzalez) [15:18:53] (03CR) 10Andrew Bogott: [C: 03+2] "Thanks for this!" [puppet] - 10https://gerrit.wikimedia.org/r/552059 (https://phabricator.wikimedia.org/T237749) (owner: 10Arturo Borrero Gonzalez) [15:19:30] !log mholloway-shell@deploy1001 helmfile [EQIAD] Ran 'apply' command on namespace 'wikifeeds' for release 'production' . [15:19:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:19:34] (03PS2) 10Andrew Bogott: OpenStack Nova: Update config to work with Ocata [puppet] - 10https://gerrit.wikimedia.org/r/550503 (https://phabricator.wikimedia.org/T237749) [15:19:42] (03CR) 10Andrew Bogott: [C: 03+2] Openstack Neutron: Remove no-longer-supported min_l3_agents_per_router [puppet] - 10https://gerrit.wikimedia.org/r/550504 (owner: 10Andrew Bogott) [15:19:50] (03PS2) 10Andrew Bogott: Openstack Neutron: Remove no-longer-supported min_l3_agents_per_router [puppet] - 10https://gerrit.wikimedia.org/r/550504 [15:22:05] !log mholloway-shell@deploy1001 helmfile [CODFW] Ran 'apply' command on namespace 'wikifeeds' for release 'production' . [15:22:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:22:56] (03CR) 10Jbond: "> Patch Set 4:" [puppet] - 10https://gerrit.wikimedia.org/r/552260 (https://phabricator.wikimedia.org/T236277) (owner: 10Jbond) [15:25:56] (03CR) 10Jcrespo: prometheus-bacula-exporter: Add expected fresness metrics (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/552241 (https://phabricator.wikimedia.org/T234900) (owner: 10Jcrespo) [15:27:12] (03PS1) 10Mforns: analytics::refinery::job::refine: Bump up refinery jar version [puppet] - 10https://gerrit.wikimedia.org/r/552280 [15:28:23] (03CR) 10BBlack: [C: 03+2] authdns-local-update: var rename for clarity [puppet] - 10https://gerrit.wikimedia.org/r/552270 (https://phabricator.wikimedia.org/T98006) (owner: 10BBlack) [15:29:13] (03PS2) 10BBlack: authdns-local-update: var rename for clarity [puppet] - 10https://gerrit.wikimedia.org/r/552270 (https://phabricator.wikimedia.org/T98006) [15:29:15] (03PS3) 10BBlack: authdns-local-update: non-verbose by default [puppet] - 10https://gerrit.wikimedia.org/r/552085 (https://phabricator.wikimedia.org/T98006) [15:29:17] (03PS5) 10BBlack: Parallelize authdns-update with clush [puppet] - 10https://gerrit.wikimedia.org/r/552081 (https://phabricator.wikimedia.org/T98006) [15:29:53] 10Operations, 10Puppet, 10User-jbond: Create NRPE check to alert when certgen certificates are due to expire - https://phabricator.wikimedia.org/T238833 (10jbond) [15:30:01] marostegui: We forgot to include an index on that new oauth2_access_tokens table, . Should we file that as a schema change task, or since the table is still unused should I just do it? [15:30:09] (03CR) 10Andrew Bogott: [C: 03+2] OpenStack Nova: Update config to work with Ocata [puppet] - 10https://gerrit.wikimedia.org/r/550503 (https://phabricator.wikimedia.org/T237749) (owner: 10Andrew Bogott) [15:30:14] !log pool cp1079 with ATS backend T227432 [15:30:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:30:19] T227432: Replace Varnish backends with ATS on cache text nodes - https://phabricator.wikimedia.org/T227432 [15:30:27] (03PS5) 10Jbond: cergen: add Icinga check to validate the expiry date on certificates [puppet] - 10https://gerrit.wikimedia.org/r/552260 (https://phabricator.wikimedia.org/T238833) [15:31:03] anomie: you can probably just drop and recreate the table [15:31:12] marostegui: Ok, thanks [15:31:39] !log mforns@deploy1001 Started deploy [analytics/refinery@7f32472]: deploying analytics refinery (after refinery-source v0.0.107) [15:31:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:33:24] PROBLEM - High average GET latency for mw requests on api_appserver in eqiad on icinga1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-m [15:33:26] (03CR) 10BBlack: [C: 03+2] authdns-local-update: non-verbose by default [puppet] - 10https://gerrit.wikimedia.org/r/552085 (https://phabricator.wikimedia.org/T98006) (owner: 10BBlack) [15:33:35] (03PS4) 10BBlack: authdns-local-update: non-verbose by default [puppet] - 10https://gerrit.wikimedia.org/r/552085 (https://phabricator.wikimedia.org/T98006) [15:33:36] Hey all - Daimona and I are going to push T238451 through gerrit and deploy right now. Let me know if we shouldn't. [15:34:38] RECOVERY - High average GET latency for mw requests on api_appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-method=GET [15:34:38] (03CR) 10BBlack: [V: 03+2 C: 03+2] authdns-local-update: non-verbose by default [puppet] - 10https://gerrit.wikimedia.org/r/552085 (https://phabricator.wikimedia.org/T98006) (owner: 10BBlack) [15:36:58] (03PS1) 10Andrew Bogott: codfw1dev: move designate to 'Ocata' [puppet] - 10https://gerrit.wikimedia.org/r/552287 (https://phabricator.wikimedia.org/T237749) [15:38:00] (03CR) 10Elukey: [C: 03+2] analytics::refinery::job::refine: Bump up refinery jar version [puppet] - 10https://gerrit.wikimedia.org/r/552280 (owner: 10Mforns) [15:39:04] (03CR) 10Andrew Bogott: [C: 03+2] codfw1dev: move designate to 'Ocata' [puppet] - 10https://gerrit.wikimedia.org/r/552287 (https://phabricator.wikimedia.org/T237749) (owner: 10Andrew Bogott) [15:40:49] (03CR) 10Alexandros Kosiaris: [C: 04-1] Kafka producer TLS support for eventgate charts (035 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/551610 (https://phabricator.wikimedia.org/T236386) (owner: 10Ottomata) [15:41:04] (03CR) 10BBlack: [C: 03+2] Parallelize authdns-update with clush [puppet] - 10https://gerrit.wikimedia.org/r/552081 (https://phabricator.wikimedia.org/T98006) (owner: 10BBlack) [15:42:30] !log mforns@deploy1001 Finished deploy [analytics/refinery@7f32472]: deploying analytics refinery (after refinery-source v0.0.107) (duration: 10m 50s) [15:42:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:49:33] 10Operations, 10Puppet, 10Patch-For-Review, 10User-jbond: Create NRPE check to alert when certgen certificates are due to expire - https://phabricator.wikimedia.org/T238833 (10Volans) It's hard to reply from the description, there is no quote task description button AFAIK. Summarizing: - for the noise, so... [15:54:04] !log otto@deploy1001 helmfile [STAGING] Ran 'apply' command on namespace 'eventgate-logging-external' for release 'logging-external' . [15:54:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:54:53] 10Operations, 10DBA, 10Data-Services: Prepare and check storage layer for ge.wikimedia.org - https://phabricator.wikimedia.org/T236404 (10JHedden) 05Open→03Resolved a:03JHedden `name="Updates on labsdb10{09,10,11,12}" $ sudo /usr/local/sbin/maintain-replica-indexes --database gewikimedia --debug $ sud... [15:56:05] 10Operations, 10DBA, 10Data-Services, 10cloud-services-team (Kanban): Prepare and check storage layer for minwiktionary - https://phabricator.wikimedia.org/T238522 (10JHedden) 05Open→03Resolved a:03JHedden `name="Updates on labsdb10{09,10,11,12}" $ sudo /usr/local/sbin/maintain-replica-indexes --dat... [16:00:04] godog and _joe_: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for Puppet SWAT(Max 6 patches) . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20191121T1600). [16:00:04] No GERRIT patches in the queue for this window AFAICS. [16:03:40] 10Operations, 10DBA, 10Data-Services, 10cloud-services-team (Kanban): Prepare and check storage layer for gcrwiki - https://phabricator.wikimedia.org/T238114 (10JHedden) 05Open→03Resolved a:03JHedden `name="Updates on labsdb10{09,10,11,12}" $ sudo /usr/local/sbin/maintain-replica-indexes --database... [16:04:53] 10Operations, 10DBA, 10Data-Services, 10cloud-services-team (Kanban): Prepare and check storage layer for shywiktionary - https://phabricator.wikimedia.org/T238115 (10JHedden) 05Open→03Resolved a:03JHedden `name="Updates on labsdb10{09,10,11,12}" $ sudo /usr/local/sbin/maintain-replica-indexes --data... [16:06:06] (03PS1) 10Jcrespo: prometheus-bacula-exporter: Fix missing job files metric [puppet] - 10https://gerrit.wikimedia.org/r/552291 (https://phabricator.wikimedia.org/T234900) [16:08:08] (03CR) 10Jcrespo: [C: 03+2] prometheus-bacula-exporter: Fix missing job files metric [puppet] - 10https://gerrit.wikimedia.org/r/552291 (https://phabricator.wikimedia.org/T234900) (owner: 10Jcrespo) [16:08:46] (03PS1) 10Arturo Borrero Gonzalez: openstack: serverpackages: ocata: drop trusted spec from repo declaration [puppet] - 10https://gerrit.wikimedia.org/r/552292 (https://phabricator.wikimedia.org/T238338) [16:10:12] PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:10:17] 10Operations, 10Puppet, 10Patch-For-Review, 10User-jbond: Create NRPE check to alert when certgen certificates are due to expire - https://phabricator.wikimedia.org/T238833 (10crusnov) p:05Triage→03Normal [16:10:55] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] openstack: serverpackages: ocata: drop trusted spec from repo declaration [puppet] - 10https://gerrit.wikimedia.org/r/552292 (https://phabricator.wikimedia.org/T238338) (owner: 10Arturo Borrero Gonzalez) [16:13:11] 10Operations, 10ops-esams: apply asset tags to cable managers - https://phabricator.wikimedia.org/T238835 (10RobH) p:05Triage→03Normal [16:16:48] (03PS5) 10CRusnov: cables: detect duplicate cable names, and blank cable names [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/550052 (https://phabricator.wikimedia.org/T237007) [16:17:23] (03CR) 10jerkins-bot: [V: 04-1] cables: detect duplicate cable names, and blank cable names [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/550052 (https://phabricator.wikimedia.org/T237007) (owner: 10CRusnov) [16:17:55] (03CR) 10CRusnov: "> Patch Set 4:" [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/550052 (https://phabricator.wikimedia.org/T237007) (owner: 10CRusnov) [16:18:14] 10Puppet, 10ORES, 10Scoring-platform-team (Current): Require git-lfs in ORES hosts - https://phabricator.wikimedia.org/T232494 (10Halfak) 05Open→03Resolved [16:21:52] 10Operations, 10Puppet, 10Patch-For-Review, 10User-jbond: Create NRPE check to alert when certgen certificates are due to expire - https://phabricator.wikimedia.org/T238833 (10jbond) >>! In T238833#5681739, @Volans wrote: > Summarizing: > - the case of multiple certs is for example for our global wildcard... [16:21:54] (03PS6) 10CRusnov: cables: detect duplicate cable names, and blank cable names [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/550052 (https://phabricator.wikimedia.org/T237007) [16:22:24] PROBLEM - Varnish traffic drop between 30min ago and now at codfw on icinga1001 is CRITICAL: 55.2 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [16:27:05] (03PS3) 10Jhedden: ceph: preserve docker iptables chains [puppet] - 10https://gerrit.wikimedia.org/r/551677 [16:29:08] RECOVERY - Varnish traffic drop between 30min ago and now at codfw on icinga1001 is OK: (C)60 le (W)70 le 71.58 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [16:30:34] (03CR) 10Jhedden: [C: 03+2] ceph: preserve docker iptables chains [puppet] - 10https://gerrit.wikimedia.org/r/551677 (owner: 10Jhedden) [16:31:47] !log sbassett@deploy1001 Started scap: Deploying T238451 (ext:AbuseFilter), running scap sync for i18n issues. [16:31:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:35:35] (03PS6) 10Alexandros Kosiaris: Also check charts generated by helmfile [deployment-charts] - 10https://gerrit.wikimedia.org/r/549059 (owner: 10Giuseppe Lavagetto) [16:37:16] RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:41:52] (03CR) 10Volans: [C: 03+1] "LGTM if _get_site_slug_for_cable() is ok for Arzhel. Was this tested on af-netbox?" [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/550052 (https://phabricator.wikimedia.org/T237007) (owner: 10CRusnov) [16:43:20] (03PS1) 10Arturo Borrero Gonzalez: openstack: serverpackages: ocata: fix missing thirdparty keyword [puppet] - 10https://gerrit.wikimedia.org/r/552298 (https://phabricator.wikimedia.org/T238338) [16:44:50] (03CR) 10Andrew Bogott: [C: 03+1] openstack: serverpackages: ocata: fix missing thirdparty keyword [puppet] - 10https://gerrit.wikimedia.org/r/552298 (https://phabricator.wikimedia.org/T238338) (owner: 10Arturo Borrero Gonzalez) [16:45:33] (03PS4) 10CRusnov: netbox report alerting: Simplify icinga check and cleanup [puppet] - 10https://gerrit.wikimedia.org/r/552154 (https://phabricator.wikimedia.org/T237803) [16:47:01] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] openstack: serverpackages: ocata: fix missing thirdparty keyword [puppet] - 10https://gerrit.wikimedia.org/r/552298 (https://phabricator.wikimedia.org/T238338) (owner: 10Arturo Borrero Gonzalez) [16:48:20] (03CR) 10CRusnov: "> Patch Set 3: Code-Review-1" (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/552154 (https://phabricator.wikimedia.org/T237803) (owner: 10CRusnov) [16:48:29] !log sbassett@deploy1001 Finished scap: Deploying T238451 (ext:AbuseFilter), running scap sync for i18n issues. (duration: 16m 42s) [16:48:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:54:29] !log otto@deploy1001 helmfile [STAGING] Ran 'sync' command on namespace 'eventgate-logging-external' for release 'logging-external' . [16:54:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:55:08] (03CR) 10Volans: [C: 03+1] "LGTM" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/552154 (https://phabricator.wikimedia.org/T237803) (owner: 10CRusnov) [16:57:56] (03PS5) 10CRusnov: netbox report alerting: Simplify icinga check and cleanup [puppet] - 10https://gerrit.wikimedia.org/r/552154 (https://phabricator.wikimedia.org/T237803) [16:57:59] (03CR) 10Subramanya Sastry: VirtualRESTService: Switch to Parsoid/PHP (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/552194 (https://phabricator.wikimedia.org/T229015) (owner: 10Mobrovac) [16:58:10] (03PS1) 10Filippo Giunchedi: facilities: fix esams pdu name [puppet] - 10https://gerrit.wikimedia.org/r/552301 [16:58:17] (03CR) 10CRusnov: [C: 03+2] netbox report alerting: Simplify icinga check and cleanup (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/552154 (https://phabricator.wikimedia.org/T237803) (owner: 10CRusnov) [16:58:39] (03CR) 10Filippo Giunchedi: [C: 03+2] facilities: fix esams pdu name [puppet] - 10https://gerrit.wikimedia.org/r/552301 (owner: 10Filippo Giunchedi) [17:00:04] cscott, arlolra, subbu, halfak, and accraze: How many deployers does it take to do Services – Graphoid / Parsoid / Citoid / ORES deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20191121T1700). [17:04:33] (03CR) 10Ayounsi: cables: detect duplicate cable names, and blank cable names (031 comment) [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/550052 (https://phabricator.wikimedia.org/T237007) (owner: 10CRusnov) [17:06:37] !log mobrovac@deploy1001 Started deploy [restbase/deploy@b987068] (dev-cluster): Switch mw.org to Parsoid/PHP [17:06:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:09:15] !log mobrovac@deploy1001 Finished deploy [restbase/deploy@b987068] (dev-cluster): Switch mw.org to Parsoid/PHP (duration: 02m 38s) [17:09:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:10:28] !log mobrovac@deploy1001 Started deploy [restbase/deploy@b987068]: Switch mw.org to Parsoid/PHP - T229015 [17:10:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:10:33] T229015: Tracking: Direct live production traffic at Parsoid/PHP - https://phabricator.wikimedia.org/T229015 [17:11:02] PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:11:24] no [17:12:10] (03PS12) 10EBernhardson: airflow: Run webserver and scheduler processes [puppet] - 10https://gerrit.wikimedia.org/r/544990 (https://phabricator.wikimedia.org/T236180) [17:12:12] (03PS1) 10EBernhardson: Allow analytics-search-users to manage search/airflow instances [puppet] - 10https://gerrit.wikimedia.org/r/552304 (https://phabricator.wikimedia.org/T236180) [17:14:50] ebernhardson: o/ --^ will need to go through a SRE meeting [17:15:12] better to open a task for that with the "sre-access-requests" tag [17:16:02] 10Operations, 10Puppet, 10Patch-For-Review, 10User-jbond: Create NRPE check to alert when cergen certificates are due to expire - https://phabricator.wikimedia.org/T238833 (10Ottomata) [17:16:46] (03PS2) 10Mobrovac: VirtualRESTService: Switch private wikis to Parsoid/PHP [mediawiki-config] - 10https://gerrit.wikimedia.org/r/552194 (https://phabricator.wikimedia.org/T229015) [17:17:27] (03CR) 10jerkins-bot: [V: 04-1] VirtualRESTService: Switch private wikis to Parsoid/PHP [mediawiki-config] - 10https://gerrit.wikimedia.org/r/552194 (https://phabricator.wikimedia.org/T229015) (owner: 10Mobrovac) [17:18:05] (03PS1) 10CRusnov: netbox reports: run as root instead of nagios [puppet] - 10https://gerrit.wikimedia.org/r/552305 [17:18:07] oh here it comes [17:19:26] elukey: yup, that's why i pulled it out into a separate patch [17:19:48] PROBLEM - mobileapps endpoints health on scb2004 is CRITICAL: /{domain}/v1/page/mobile-sections/{title} (retrieve test page via mobile-sections) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [17:19:51] elukey: slowly getting it deployed, just takes a bit with hiring and g being a manager now :) [17:19:55] (03PS3) 10Mobrovac: VirtualRESTService: Switch private wikis to Parsoid/PHP [mediawiki-config] - 10https://gerrit.wikimedia.org/r/552194 (https://phabricator.wikimedia.org/T229015) [17:20:17] 10Operations, 10Puppet, 10Patch-For-Review, 10User-jbond: Create NRPE check to alert when cergen certificates are due to expire - https://phabricator.wikimedia.org/T238833 (10Ottomata) Why does this script need sudo? Doesn't it just need to read the expiry out of the public certs? [17:20:32] (03CR) 10Subramanya Sastry: [C: 03+1] VirtualRESTService: Switch private wikis to Parsoid/PHP [mediawiki-config] - 10https://gerrit.wikimedia.org/r/552194 (https://phabricator.wikimedia.org/T229015) (owner: 10Mobrovac) [17:20:46] RECOVERY - mobileapps endpoints health on scb2004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [17:21:51] (03CR) 10Ayounsi: "> Patch Set 6: Code-Review+1" [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/550052 (https://phabricator.wikimedia.org/T237007) (owner: 10CRusnov) [17:23:18] (03PS2) 10CRusnov: netbox reports: run as netbox instead of nagios [puppet] - 10https://gerrit.wikimedia.org/r/552305 [17:24:39] (03CR) 10CRusnov: cables: detect duplicate cable names, and blank cable names (031 comment) [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/550052 (https://phabricator.wikimedia.org/T237007) (owner: 10CRusnov) [17:24:41] (03CR) 10Volans: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/552305 (owner: 10CRusnov) [17:25:05] (03CR) 10CRusnov: [C: 03+2] netbox reports: run as netbox instead of nagios [puppet] - 10https://gerrit.wikimedia.org/r/552305 (owner: 10CRusnov) [17:26:02] ebernhardson: you have my support if you need any help, feel free to ping.. I suggested to open a task for sre access requests because usually it is required [17:27:10] !log mobrovac@deploy1001 Finished deploy [restbase/deploy@b987068]: Switch mw.org to Parsoid/PHP - T229015 (duration: 16m 43s) [17:27:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:27:16] T229015: Tracking: Direct live production traffic at Parsoid/PHP - https://phabricator.wikimedia.org/T229015 [17:28:04] oh, it's done/?? [17:28:10] woo hoo! [17:29:38] RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:35:39] (03CR) 10Ayounsi: cables: detect duplicate cable names, and blank cable names (031 comment) [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/550052 (https://phabricator.wikimedia.org/T237007) (owner: 10CRusnov) [17:36:43] 10Operations, 10MediaWiki-REST-API, 10serviceops, 10wikidiff2, and 2 others: Prod compare endpoint missing offset object (with from & to keys) on diff items - https://phabricator.wikimedia.org/T238846 (10Tsevener) [17:38:22] 10Operations, 10MediaWiki-REST-API, 10serviceops, 10wikidiff2, and 2 others: Prod compare endpoint missing offset object (with from & to keys) on diff items - https://phabricator.wikimedia.org/T238846 (10Tsevener) [17:42:11] 10Operations, 10observability: The "logstash-*" index pattern does not contain any of the following field types: ip - https://phabricator.wikimedia.org/T238795 (10ayounsi) From https://logstash.wikimedia.org/app/kibana#/dashboard/69b9fbe0-3c1b-11e8-90f7-4958fd3a62b4 `src-ip` `dst-ip` From https://logstash.wik... [17:45:28] 10Operations, 10MediaWiki-REST-API, 10serviceops, 10wikidiff2, and 2 others: Prod compare endpoint missing offset object (with from & to keys) on diff items - https://phabricator.wikimedia.org/T238846 (10Pchelolo) No surprise. @jijiki has deployed 1.10 to labs machines as indicated in the parent task, but... [17:45:39] (03CR) 10Mobrovac: [C: 03+2] VirtualRESTService: Switch private wikis to Parsoid/PHP [mediawiki-config] - 10https://gerrit.wikimedia.org/r/552194 (https://phabricator.wikimedia.org/T229015) (owner: 10Mobrovac) [17:45:58] PROBLEM - Varnish traffic drop between 30min ago and now at codfw on icinga1001 is CRITICAL: 59.94 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [17:46:41] (03Merged) 10jenkins-bot: VirtualRESTService: Switch private wikis to Parsoid/PHP [mediawiki-config] - 10https://gerrit.wikimedia.org/r/552194 (https://phabricator.wikimedia.org/T229015) (owner: 10Mobrovac) [17:46:46] (03PS1) 10Giuseppe Lavagetto: Add fix for tclap position (#9702) [debs/envoyproxy] (wikimedia-stretch) - 10https://gerrit.wikimedia.org/r/552311 [17:48:48] !log mobrovac@deploy1001 Synchronized wmf-config/LabsServices.php: Switch private wikis to Parsoid/PHP; file 1/4 -- T229015 (duration: 00m 53s) [17:48:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:48:54] T229015: Tracking: Direct live production traffic at Parsoid/PHP - https://phabricator.wikimedia.org/T229015 [17:50:05] !log mobrovac@deploy1001 Synchronized wmf-config/ProductionServices.php: Switch private wikis to Parsoid/PHP; file 2/4 -- T229015 (duration: 00m 53s) [17:50:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:51:02] RECOVERY - Varnish traffic drop between 30min ago and now at codfw on icinga1001 is OK: (C)60 le (W)70 le 71.68 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [17:51:26] !log mobrovac@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Switch private wikis to Parsoid/PHP; file 3/4 -- T229015 (duration: 00m 51s) [17:51:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:52:36] (03CR) 10BBlack: [C: 03+2] TLS Analytics: make parsing more robust [puppet] - 10https://gerrit.wikimedia.org/r/551840 (owner: 10BBlack) [17:52:56] !log mobrovac@deploy1001 Synchronized wmf-config/CommonSettings.php: Switch private wikis to Parsoid/PHP; file 4/4 -- T229015 (duration: 00m 53s) [17:53:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:55:17] (03CR) 10Dzahn: [C: 03+2] varnish: remove config for disabled phab_aphlict [puppet] - 10https://gerrit.wikimedia.org/r/552122 (https://phabricator.wikimedia.org/T238781) (owner: 10Dzahn) [17:55:25] (03PS3) 10Dzahn: varnish: remove config for disabled phab_aphlict [puppet] - 10https://gerrit.wikimedia.org/r/552122 (https://phabricator.wikimedia.org/T238781) [17:59:21] 10Operations, 10serviceops, 10Patch-For-Review: dropped packets to phab1003 22280/tcp - https://phabricator.wikimedia.org/T238781 (10Dzahn) @ayounsi After the merge above this is expected to stop soon, now. [18:00:04] MaxSem, RoanKattouw, Niharika, and Urbanecm: My dear minions, it's time we take the moon! Just kidding. Time for Morning SWAT(Max 6 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20191121T1800). [18:00:05] No GERRIT patches in the queue for this window AFAICS. [18:00:18] (03PS1) 10Mobrovac: Parsoid/PHP Service: Use HTTPS [mediawiki-config] - 10https://gerrit.wikimedia.org/r/552314 (https://phabricator.wikimedia.org/T229015) [18:00:33] (03CR) 10Mobrovac: [C: 03+2] Parsoid/PHP Service: Use HTTPS [mediawiki-config] - 10https://gerrit.wikimedia.org/r/552314 (https://phabricator.wikimedia.org/T229015) (owner: 10Mobrovac) [18:01:13] (03PS1) 10Ottomata: eventgate-0.0.14 [deployment-charts] - 10https://gerrit.wikimedia.org/r/552315 [18:01:25] (03Merged) 10jenkins-bot: Parsoid/PHP Service: Use HTTPS [mediawiki-config] - 10https://gerrit.wikimedia.org/r/552314 (https://phabricator.wikimedia.org/T229015) (owner: 10Mobrovac) [18:01:39] (03CR) 10Ottomata: [C: 03+2] eventgate-0.0.14 [deployment-charts] - 10https://gerrit.wikimedia.org/r/552315 (owner: 10Ottomata) [18:01:43] (03PS2) 10Ottomata: eventgate-0.0.14 [deployment-charts] - 10https://gerrit.wikimedia.org/r/552315 [18:01:45] (03CR) 10Ottomata: [V: 03+2 C: 03+2] eventgate-0.0.14 [deployment-charts] - 10https://gerrit.wikimedia.org/r/552315 (owner: 10Ottomata) [18:02:57] !log mobrovac@deploy1001 Synchronized wmf-config/ProductionServices.php: Use HTTPS for contacting Parsoid/PHP - T229015 (duration: 00m 53s) [18:03:00] !log otto@deploy1001 helmfile [STAGING] Ran 'apply' command on namespace 'eventgate-logging-external' for release 'logging-external' . [18:03:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:03:03] T229015: Tracking: Direct live production traffic at Parsoid/PHP - https://phabricator.wikimedia.org/T229015 [18:03:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:06:34] (03CR) 10Dzahn: [C: 03+2] webperf: Remove xhgui profile from webperf::profiling_tools role [puppet] - 10https://gerrit.wikimedia.org/r/552135 (https://phabricator.wikimedia.org/T180761) (owner: 10Krinkle) [18:06:40] 10Operations, 10Traffic, 10Wikidata, 10Wikidata-Campsite, 10User-DannyS712: 502 errors on ATS/8.0.5 - https://phabricator.wikimedia.org/T237319 (10elukey) I have created a Docker image for Debian stretch installing apache2 (same version of the mw app servers) + php7.2-fpm from Sury's repo + the following... [18:10:56] (03PS1) 10Mobrovac: Parsoid VRS: Switch back private wikis to Parsoid/JS [mediawiki-config] - 10https://gerrit.wikimedia.org/r/552317 (https://phabricator.wikimedia.org/T229015) [18:11:48] (03CR) 10Mobrovac: [V: 03+2 C: 03+2] Parsoid VRS: Switch back private wikis to Parsoid/JS [mediawiki-config] - 10https://gerrit.wikimedia.org/r/552317 (https://phabricator.wikimedia.org/T229015) (owner: 10Mobrovac) [18:13:17] !log mobrovac@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Switch private wikis back to Parsoid/JS - T229015 (duration: 00m 52s) [18:13:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:13:23] T229015: Tracking: Direct live production traffic at Parsoid/PHP - https://phabricator.wikimedia.org/T229015 [18:14:18] 10Operations, 10Analytics, 10Patch-For-Review, 10User-Elukey: Archival of home directories on servers with very large homes - https://phabricator.wikimedia.org/T215171 (10mforns) p:05High→03Normal [18:16:36] 10Operations, 10Traffic, 10Inuka-Team (Kanban), 10Patch-For-Review, 10Performance-Team (Radar): Code for InukaPageView instrumentation - https://phabricator.wikimedia.org/T238029 (10SBisson) >>! In T238029#5679932, @Neil_P._Quinn_WMF wrote: > @SBisson I looked over the patch and [the schema](https://meta... [18:31:58] 10Operations, 10Traffic, 10Inuka-Team (Kanban), 10Patch-For-Review, 10Performance-Team (Radar): Code for InukaPageView instrumentation - https://phabricator.wikimedia.org/T238029 (10Neil_P._Quinn_WMF) Thank you for the quick responses, @SBisson! >>! In T238029#5682200, @SBisson wrote: >>>! In T238029#5... [18:32:39] 10Operations, 10observability, 10Patch-For-Review, 10Performance-Team (Radar): Upgrade grafana to 6.4.4 - https://phabricator.wikimedia.org/T220838 (10CDanis) I've heard no complaints, and can verify from the logs that it's seen at least some testing by others. Planning to do a final snapshot and move tra... [18:33:25] 10Operations, 10observability, 10Patch-For-Review, 10Performance-Team (Radar): Upgrade grafana to 6.4.4 - https://phabricator.wikimedia.org/T220838 (10CDanis) [18:36:35] (03PS7) 10CRusnov: cables: detect duplicate cable names, and blank cable names [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/550052 (https://phabricator.wikimedia.org/T237007) [18:37:55] (03CR) 10CRusnov: [C: 03+2] cables: detect duplicate cable names, and blank cable names (031 comment) [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/550052 (https://phabricator.wikimedia.org/T237007) (owner: 10CRusnov) [18:38:06] (03PS8) 10CRusnov: cables: detect duplicate cable names, and blank cable names [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/550052 (https://phabricator.wikimedia.org/T237007) [18:39:56] (03PS1) 10Ottomata: eventgate-0.0.15 - Fix services app selector for envoyproxy tls port [deployment-charts] - 10https://gerrit.wikimedia.org/r/552318 (https://phabricator.wikimedia.org/T236386) [18:41:33] (03CR) 10Ottomata: [C: 03+2] eventgate-0.0.15 - Fix services app selector for envoyproxy tls port [deployment-charts] - 10https://gerrit.wikimedia.org/r/552318 (https://phabricator.wikimedia.org/T236386) (owner: 10Ottomata) [18:42:24] !log otto@deploy1001 helmfile [STAGING] Ran 'apply' command on namespace 'eventgate-logging-external' for release 'logging-external' . [18:42:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:44:10] PROBLEM - High average GET latency for mw requests on api_appserver in eqiad on icinga1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-m [18:44:25] (03PS1) 10CRusnov: cables: fix sense of blankness check [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/552319 [18:45:10] !log otto@deploy1001 helmfile [CODFW] Ran 'apply' command on namespace 'eventgate-logging-external' for release 'logging-external' . [18:45:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:45:51] (03CR) 10Volans: [C: 03+1] "LGTM" [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/552319 (owner: 10CRusnov) [18:45:52] RECOVERY - High average GET latency for mw requests on api_appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-method=GET [18:45:54] 10Operations, 10Analytics-Kanban, 10Better Use Of Data, 10Event-Platform, and 8 others: Set up eventgate-logging-external in production - https://phabricator.wikimedia.org/T236386 (10Ottomata) [18:46:13] (03CR) 10CRusnov: [C: 03+2] cables: fix sense of blankness check [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/552319 (owner: 10CRusnov) [18:49:19] !log otto@deploy1001 helmfile [EQIAD] Ran 'apply' command on namespace 'eventgate-logging-external' for release 'logging-external' . [18:49:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:49:59] 10Operations, 10Analytics-Kanban, 10Better Use Of Data, 10Event-Platform, and 8 others: Set up eventgate-logging-external in production - https://phabricator.wikimedia.org/T236386 (10Ottomata) Ok thanks for the help today @akosiaris and @Joe, HTTPS via envoyproxy is finally working! I will be off tomorrow... [18:59:07] (03PS1) 10Andrew Bogott: codfw1dev: move designate to 'Ocata' [puppet] - 10https://gerrit.wikimedia.org/r/552322 (https://phabricator.wikimedia.org/T237749) [18:59:11] (03PS4) 10Dzahn: monitoring: add data types to monitoring::service [puppet] - 10https://gerrit.wikimedia.org/r/551882 [18:59:19] (03PS1) 10CRusnov: cables: blacklist eqiad in blank cable test [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/552323 [19:00:01] (03CR) 10jerkins-bot: [V: 04-1] cables: blacklist eqiad in blank cable test [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/552323 (owner: 10CRusnov) [19:00:48] (03CR) 10Volans: cables: blacklist eqiad in blank cable test (032 comments) [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/552323 (owner: 10CRusnov) [19:01:12] (03PS2) 10CRusnov: cables: blacklist eqiad in blank cable test [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/552323 [19:01:18] !log upgrading designate to 'ocata' on cloudservices1003 and 1004 [19:01:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:01:50] (03CR) 10jerkins-bot: [V: 04-1] cables: blacklist eqiad in blank cable test [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/552323 (owner: 10CRusnov) [19:02:29] (03PS2) 10Andrew Bogott: eqiad1: move designate to 'Ocata' [puppet] - 10https://gerrit.wikimedia.org/r/552322 (https://phabricator.wikimedia.org/T237749) [19:02:44] (03CR) 10jerkins-bot: [V: 04-1] monitoring: add data types to monitoring::service [puppet] - 10https://gerrit.wikimedia.org/r/551882 (owner: 10Dzahn) [19:03:05] (03PS3) 10CRusnov: cables: blacklist eqiad in blank cable test [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/552323 [19:04:23] (03CR) 10Andrew Bogott: [C: 03+2] eqiad1: move designate to 'Ocata' [puppet] - 10https://gerrit.wikimedia.org/r/552322 (https://phabricator.wikimedia.org/T237749) (owner: 10Andrew Bogott) [19:05:05] (03PS4) 10CRusnov: cables: blacklist eqiad in blank cable test [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/552323 [19:06:47] 10Operations, 10Traffic, 10Inuka-Team (Kanban), 10Patch-For-Review, 10Performance-Team (Radar): Code for InukaPageView instrumentation - https://phabricator.wikimedia.org/T238029 (10SBisson) >>! In T238029#5682265, @Neil_P._Quinn_WMF wrote: > [...] > ISO-8601 (i.e. `2019-10-02T16:15:30Z`), please. Done [19:07:09] (03CR) 10CRusnov: cables: blacklist eqiad in blank cable test (032 comments) [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/552323 (owner: 10CRusnov) [19:07:50] (03CR) 10CRusnov: [C: 03+2] cables: blacklist eqiad in blank cable test [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/552323 (owner: 10CRusnov) [19:17:43] 10Operations, 10ops-ulsfo: audit cable labels @ ulsfo - https://phabricator.wikimedia.org/T238856 (10RobH) p:05Triage→03Normal [19:18:45] 10Operations, 10ops-esams: apply asset tags to cable managers - https://phabricator.wikimedia.org/T238835 (10Peachey88) [19:36:43] (03PS1) 10Dzahn: install_server: downgrade xhgui servers from buster to stretch [puppet] - 10https://gerrit.wikimedia.org/r/552324 (https://phabricator.wikimedia.org/T238098) [19:37:31] (03CR) 10Dzahn: [C: 03+2] install_server: downgrade xhgui servers from buster to stretch [puppet] - 10https://gerrit.wikimedia.org/r/552324 (https://phabricator.wikimedia.org/T238098) (owner: 10Dzahn) [19:43:01] (03PS1) 10Dzahn: xhgui: add support for stretch/PHP7.2 [puppet] - 10https://gerrit.wikimedia.org/r/552325 (https://phabricator.wikimedia.org/T238098) [19:44:11] (03CR) 10Muehlenhoff: xhgui: add support for stretch/PHP7.2 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/552325 (https://phabricator.wikimedia.org/T238098) (owner: 10Dzahn) [19:49:25] (03PS5) 10Dzahn: monitoring: add data types to monitoring::service [puppet] - 10https://gerrit.wikimedia.org/r/551882 [19:52:30] (03CR) 10jerkins-bot: [V: 04-1] monitoring: add data types to monitoring::service [puppet] - 10https://gerrit.wikimedia.org/r/551882 (owner: 10Dzahn) [19:55:41] (03CR) 10Dzahn: xhgui: add support for stretch/PHP7.2 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/552325 (https://phabricator.wikimedia.org/T238098) (owner: 10Dzahn) [19:56:07] (03PS2) 10Dzahn: xhgui: add support for stretch/PHP7.2 [puppet] - 10https://gerrit.wikimedia.org/r/552325 (https://phabricator.wikimedia.org/T238098) [19:56:41] (03PS3) 10Dzahn: xhgui: add support for stretch/PHP7.0 [puppet] - 10https://gerrit.wikimedia.org/r/552325 (https://phabricator.wikimedia.org/T238098) [19:57:12] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/552325 (https://phabricator.wikimedia.org/T238098) (owner: 10Dzahn) [19:57:32] (03CR) 10Dzahn: [C: 03+2] xhgui: add support for stretch/PHP7.0 [puppet] - 10https://gerrit.wikimedia.org/r/552325 (https://phabricator.wikimedia.org/T238098) (owner: 10Dzahn) [20:02:30] 10Operations, 10Math: Clean up artifacts from LaTeX based math rendering - https://phabricator.wikimedia.org/T195847 (10Dzahn) 05Open→03Resolved @jijiki Per our meeting today.. it would have been nice to have to remove the math packages from all but since reinstall is happening really soon now i am closing... [20:08:09] !log mforns@deploy1001 Started deploy [analytics/refinery@97015e4]: add new projects to webrequest whitelist [20:08:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:08:28] PROBLEM - Varnish traffic drop between 30min ago and now at codfw on icinga1001 is CRITICAL: 55.34 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [20:11:50] RECOVERY - Varnish traffic drop between 30min ago and now at codfw on icinga1001 is OK: (C)60 le (W)70 le 70.34 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [20:12:58] RECOVERY - Check systemd state on icinga1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:14:39] !log icinga1001 - systemctl reset-failed [20:14:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:15:44] (03PS1) 10ArielGlenn: redact possible password entries in dumps log exceptions emailer [puppet] - 10https://gerrit.wikimedia.org/r/552328 [20:16:38] !log mforns@deploy1001 Finished deploy [analytics/refinery@97015e4]: add new projects to webrequest whitelist (duration: 08m 29s) [20:16:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:19:46] PROBLEM - Check systemd state on icinga1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:23:44] (03PS1) 10CDanis: icinga contacts: handle [lack of] trailing newlines [puppet] - 10https://gerrit.wikimedia.org/r/552329 [20:25:00] (03CR) 10CDanis: "00000e30 20 20 20 20 20 20 20 20 20 20 69 72 63 2d 64 63 | irc-dc|" [puppet] - 10https://gerrit.wikimedia.org/r/552329 (owner: 10CDanis) [20:26:40] (03CR) 10Volans: icinga contacts: handle [lack of] trailing newlines (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/552329 (owner: 10CDanis) [20:27:38] (03PS2) 10CDanis: icinga contacts: handle [lack of] trailing newlines [puppet] - 10https://gerrit.wikimedia.org/r/552329 [20:28:57] (03CR) 10Volans: [C: 03+1] "LGTM, thanks" [puppet] - 10https://gerrit.wikimedia.org/r/552329 (owner: 10CDanis) [20:30:57] (03CR) 10CDanis: [C: 03+2] icinga contacts: handle [lack of] trailing newlines [puppet] - 10https://gerrit.wikimedia.org/r/552329 (owner: 10CDanis) [20:35:02] RECOVERY - Check systemd state on icinga1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:36:12] RECOVERY - Check the last execution of sync_check_icinga_contacts on icinga1001 is OK: OK: Status of the systemd unit sync_check_icinga_contacts https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [20:46:57] (03PS1) 10EBernhardson: Enable CirrusSearch log channel on beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/552332 (https://phabricator.wikimedia.org/T237560) [20:47:54] (03CR) 10EBernhardson: [C: 03+2] Enable CirrusSearch log channel on beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/552332 (https://phabricator.wikimedia.org/T237560) (owner: 10EBernhardson) [20:48:36] (03Merged) 10jenkins-bot: Enable CirrusSearch log channel on beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/552332 (https://phabricator.wikimedia.org/T237560) (owner: 10EBernhardson) [20:49:12] PROBLEM - Varnish traffic drop between 30min ago and now at codfw on icinga1001 is CRITICAL: 58.94 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [20:49:46] !log ganeti1003 - switching boot order of xhgui1001 to network and reinstalling with stretch (T238098) [20:49:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:49:51] T238098: vm request for xhgui - https://phabricator.wikimedia.org/T238098 [20:50:56] PROBLEM - Check whether microcode mitigations for CPU vulnerabilities are applied on dumpsdata1002 is CRITICAL: CRITICAL - Server is missing the following CPU flags: {flush_l1d, ssbd, md_clear} https://wikitech.wikimedia.org/wiki/Microcode [20:51:13] !log puppetmaster1001 - revoking puppet certs for xhgui1001/xhgui2001 [20:51:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:57:42] RECOVERY - Varnish traffic drop between 30min ago and now at codfw on icinga1001 is OK: (C)60 le (W)70 le 78.09 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [21:09:54] !log mholloway-shell@deploy1001 Started deploy [mobileapps/deploy@70154b4]: Update mobileapps to c140e88 [21:09:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:14:58] (03PS1) 10BBlack: acme-chief: parallelize gdnsd-sync [puppet] - 10https://gerrit.wikimedia.org/r/552336 (https://phabricator.wikimedia.org/T98006) [21:16:23] !log mholloway-shell@deploy1001 Finished deploy [mobileapps/deploy@70154b4]: Update mobileapps to c140e88 (duration: 06m 29s) [21:16:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:17:00] (03CR) 10jerkins-bot: [V: 04-1] acme-chief: parallelize gdnsd-sync [puppet] - 10https://gerrit.wikimedia.org/r/552336 (https://phabricator.wikimedia.org/T98006) (owner: 10BBlack) [21:24:17] (03PS2) 10BBlack: acme-chief: parallelize gdnsd-sync [puppet] - 10https://gerrit.wikimedia.org/r/552336 (https://phabricator.wikimedia.org/T98006) [21:24:57] (03CR) 10Dzahn: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1001/19540/xhgui1001.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/552124 (owner: 10Dzahn) [21:25:19] (03PS2) 10Dzahn: add xhgui::app role on xhgui VMs [puppet] - 10https://gerrit.wikimedia.org/r/552124 (https://phabricator.wikimedia.org/T238788) [21:25:30] (03PS3) 10Dzahn: add xhgui::app role on xhgui VMs [puppet] - 10https://gerrit.wikimedia.org/r/552124 (https://phabricator.wikimedia.org/T238788) [21:28:05] (03CR) 10Dzahn: [C: 03+2] add xhgui::app role on xhgui VMs [puppet] - 10https://gerrit.wikimedia.org/r/552124 (https://phabricator.wikimedia.org/T238788) (owner: 10Dzahn) [21:29:38] !log mholloway-shell@deploy1001 Synchronized php-1.35.0-wmf.5/extensions/UploadWizard: Add Machine Vision CTA to final step (T234960) (duration: 00m 59s) [21:29:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:29:48] T234960: Add call to action on final step of Upload Wizard - https://phabricator.wikimedia.org/T234960 [21:32:24] PROBLEM - High average GET latency for mw requests on appserver in eqiad on icinga1001 is CRITICAL: cluster=appserver code={200,204} handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method= [21:33:50] PROBLEM - PHP7 rendering on mw1262 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [21:34:02] PROBLEM - Nginx local proxy to apache on mw1270 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [21:34:06] !log mholloway-shell@deploy1001 Scap failed!: 4/11 canaries failed their endpoint checks(http://en.wikipedia.org) [21:34:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:34:12] PROBLEM - Apache HTTP on mw1274 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [21:34:20] PROBLEM - High average POST latency for mw requests on appserver in eqiad on icinga1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=POST [21:35:32] RECOVERY - PHP7 rendering on mw1262 is OK: HTTP OK: HTTP/1.1 200 OK - 77997 bytes in 9.681 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [21:35:42] RECOVERY - Nginx local proxy to apache on mw1270 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 629 bytes in 8.697 second response time https://wikitech.wikimedia.org/wiki/Application_servers [21:35:50] PROBLEM - Apache HTTP on mw1322 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [21:36:42] !log mholloway-shell@deploy1001 Scap failed!: 5/11 canaries failed their endpoint checks(http://en.wikipedia.org) [21:36:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:37:04] RECOVERY - Apache HTTP on mw1322 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 628 bytes in 3.986 second response time https://wikitech.wikimedia.org/wiki/Application_servers [21:37:08] RECOVERY - Apache HTTP on mw1274 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 628 bytes in 7.824 second response time https://wikitech.wikimedia.org/wiki/Application_servers [21:37:10] PROBLEM - Apache HTTP on mw1265 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [21:37:10] PROBLEM - PHP7 rendering on mw1265 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [21:38:44] RECOVERY - Apache HTTP on mw1265 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 626 bytes in 0.073 second response time https://wikitech.wikimedia.org/wiki/Application_servers [21:38:46] RECOVERY - PHP7 rendering on mw1265 is OK: HTTP OK: HTTP/1.1 200 OK - 77996 bytes in 0.243 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [21:39:25] mdholloway: everything okay? [21:39:56] cdanis: i deployed a change, then immediately tried to revert, and revert SWAT is failing. [21:40:09] soliciting help now... [21:40:13] is it failing on the logstash check? [21:40:20] PROBLEM - PHP7 rendering on mw1262 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [21:40:36] cdanis: Check 'Check endpoints for mw1261.eqiad.wmnet' failed: WARNING:urllib3.connectionpool:Retrying (Retry(total=2, connect=None, read=None, redirect=None)) after connection broken by 'ReadTimeoutError("HTTPConnectionPool(host='mw1261.eqiad.wmnet', port=80): Read timed out. (read timeout=5)",)': /w/api.php [21:41:19] i would try the revert with --force [21:41:56] RECOVERY - PHP7 rendering on mw1262 is OK: HTTP OK: HTTP/1.1 200 OK - 77997 bytes in 2.973 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [21:42:14] cdanis: doing now [21:42:47] !log mholloway-shell@deploy1001 Synchronized php-1.35.0-wmf.5/extensions/UploadWizard: Revert "Add Machine Vision CTA to final step (T234960)", take 2 (duration: 00m 41s) [21:42:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:42:52] T234960: Add call to action on final step of Upload Wizard - https://phabricator.wikimedia.org/T234960 [21:43:18] PROBLEM - PHP7 rendering on mw1261 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [21:44:44] RECOVERY - PHP7 rendering on mw1261 is OK: HTTP OK: HTTP/1.1 200 OK - 77997 bytes in 7.748 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [21:53:51] RECOVERY - High average POST latency for mw requests on appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=POST [21:53:56] still a fairly large performance hit on the appservers for GET requests [21:56:53] (03PS1) 10RobH: adding new skus for tool review [software] - 10https://gerrit.wikimedia.org/r/552344 [21:57:04] i'm not sure what happened here. mdholloway did the last revert run finish successfully? [21:57:17] cdanis: yes. [21:57:59] I'm struggling to find a smoking gun in that patch... [22:03:25] (03CR) 10RobH: [C: 03+2] adding new skus for tool review [software] - 10https://gerrit.wikimedia.org/r/552344 (owner: 10RobH) [22:07:17] (03PS1) 10RobH: adding in new 2tb sku [software] - 10https://gerrit.wikimedia.org/r/552345 [22:07:23] (03PS1) 10BBlack: authdns: refactor role/profile/hieradata bits [puppet] - 10https://gerrit.wikimedia.org/r/552346 (https://phabricator.wikimedia.org/T98006) [22:07:53] (03CR) 10RobH: [C: 03+2] adding in new 2tb sku [software] - 10https://gerrit.wikimedia.org/r/552345 (owner: 10RobH) [22:11:00] !log mwscript importImages.php --wiki=commonswiki --overwrite --user=Bürgerentscheid . (T238764) [22:11:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:11:06] T238764: Upload a large file from www.g9-jetzt-nrw.de to Wikimedia Commons - https://phabricator.wikimedia.org/T238764 [22:13:29] RECOVERY - High average GET latency for mw requests on appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [22:19:07] cdanis: looks like we're back to normal. thanks for the assist. [22:33:37] (03PS1) 10Andrew Bogott: wmf_sink: delete instance puppet config from git on instance deletion [puppet] - 10https://gerrit.wikimedia.org/r/552348 (https://phabricator.wikimedia.org/T238708) [22:35:25] (03CR) 10jerkins-bot: [V: 04-1] wmf_sink: delete instance puppet config from git on instance deletion [puppet] - 10https://gerrit.wikimedia.org/r/552348 (https://phabricator.wikimedia.org/T238708) (owner: 10Andrew Bogott) [22:54:29] (03PS1) 10EBernhardson: Remove CirrusSearchEnableSearchLogging [mediawiki-config] - 10https://gerrit.wikimedia.org/r/552350 [22:58:00] (03CR) 10EBernhardson: "still waiting on train deploy for If77dd1ebd1fa257a74a65f6578c94f70828422e9" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/548750 (https://phabricator.wikimedia.org/T237363) (owner: 10DCausse) [22:59:58] (03PS1) 10Dzahn: xhgui: install unpuppetized apache php module [puppet] - 10https://gerrit.wikimedia.org/r/552351 (https://phabricator.wikimedia.org/T238788) [23:00:04] MaxSem, RoanKattouw, Niharika, and Urbanecm: Your horoscope predicts another unfortunate Evening SWAT(Max 6 patches) deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20191121T2300). [23:00:04] No GERRIT patches in the queue for this window AFAICS. [23:00:08] (03PS1) 10EBernhardson: [cirrus] Explicitly choose the search clusters to write to in beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/552352 [23:00:15] I will SWAT a config patch but I'm still working on it [23:00:45] (03PS2) 10EBernhardson: Remove CirrusSearchEnableSearchLogging [mediawiki-config] - 10https://gerrit.wikimedia.org/r/552350 (https://phabricator.wikimedia.org/T238802) [23:01:07] RoanKattouw: i have two as well :) Shipping now [23:01:24] (03CR) 10EBernhardson: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/552350 (https://phabricator.wikimedia.org/T238802) (owner: 10EBernhardson) [23:01:26] (03PS1) 10Catrope: GrowthExperiments: Move newcomer tasks JSON config from mw.org to local wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/552353 (https://phabricator.wikimedia.org/T237301) [23:02:12] (03Merged) 10jenkins-bot: Remove CirrusSearchEnableSearchLogging [mediawiki-config] - 10https://gerrit.wikimedia.org/r/552350 (https://phabricator.wikimedia.org/T238802) (owner: 10EBernhardson) [23:02:27] (03PS2) 10EBernhardson: [cirrus] Explicitly choose the search clusters to write to in beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/552352 (https://phabricator.wikimedia.org/T237560) [23:02:34] (03CR) 10EBernhardson: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/552352 (https://phabricator.wikimedia.org/T237560) (owner: 10EBernhardson) [23:04:02] (03PS1) 10EBernhardson: Remove wgCirrusSearchEnableSearchLogging from beta cluster aswell [mediawiki-config] - 10https://gerrit.wikimedia.org/r/552354 [23:04:14] (03CR) 10EBernhardson: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/552354 (owner: 10EBernhardson) [23:04:29] ok that should be all my patches submitted, will deploy in a sec when merge [23:06:21] (03PS2) 10Dzahn: xhgui: install unpuppetized apache php module [puppet] - 10https://gerrit.wikimedia.org/r/552351 (https://phabricator.wikimedia.org/T238788) [23:06:28] (03PS3) 10EBernhardson: [cirrus] Explicitly choose the search clusters to write to in beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/552352 (https://phabricator.wikimedia.org/T237560) [23:06:30] (03CR) 10EBernhardson: [C: 03+2] [cirrus] Explicitly choose the search clusters to write to in beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/552352 (https://phabricator.wikimedia.org/T237560) (owner: 10EBernhardson) [23:07:01] (03PS2) 10EBernhardson: Remove wgCirrusSearchEnableSearchLogging from beta cluster aswell [mediawiki-config] - 10https://gerrit.wikimedia.org/r/552354 [23:07:08] (03CR) 10EBernhardson: [C: 03+2] Remove wgCirrusSearchEnableSearchLogging from beta cluster aswell [mediawiki-config] - 10https://gerrit.wikimedia.org/r/552354 (owner: 10EBernhardson) [23:07:15] (03Merged) 10jenkins-bot: [cirrus] Explicitly choose the search clusters to write to in beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/552352 (https://phabricator.wikimedia.org/T237560) (owner: 10EBernhardson) [23:07:57] (03Merged) 10jenkins-bot: Remove wgCirrusSearchEnableSearchLogging from beta cluster aswell [mediawiki-config] - 10https://gerrit.wikimedia.org/r/552354 (owner: 10EBernhardson) [23:08:10] ebernhardson: Let me know when you're done [23:08:32] (03CR) 10Gergő Tisza: [C: 04-1] "Per https://phabricator.wikimedia.org/T237301#5683029" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/552353 (https://phabricator.wikimedia.org/T237301) (owner: 10Catrope) [23:09:54] !log ebernhardson@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Remove unused CirrusSearch config variable (duration: 00m 52s) [23:09:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:10:02] RoanKattouw: all done [23:10:30] (03PS3) 10Dzahn: xhgui: install unpuppetized apache php module [puppet] - 10https://gerrit.wikimedia.org/r/552351 (https://phabricator.wikimedia.org/T238788) [23:10:32] (03PS2) 10Catrope: GrowthExperiments: Move newcomer tasks JSON config from mw.org to local wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/552353 (https://phabricator.wikimedia.org/T237301) [23:14:00] (03CR) 10Dzahn: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1001/19541/" [puppet] - 10https://gerrit.wikimedia.org/r/552351 (https://phabricator.wikimedia.org/T238788) (owner: 10Dzahn) [23:16:08] (03PS3) 10Catrope: GrowthExperiments: Move newcomer tasks JSON config from mw.org to local wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/552353 (https://phabricator.wikimedia.org/T237301) [23:16:20] (03CR) 10Catrope: [C: 03+2] GrowthExperiments: Move newcomer tasks JSON config from mw.org to local wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/552353 (https://phabricator.wikimedia.org/T237301) (owner: 10Catrope) [23:17:08] (03Merged) 10jenkins-bot: GrowthExperiments: Move newcomer tasks JSON config from mw.org to local wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/552353 (https://phabricator.wikimedia.org/T237301) (owner: 10Catrope) [23:35:26] (03PS1) 10Dzahn: webperf: switch xhgui_host from tungsten to xhgui1001 [puppet] - 10https://gerrit.wikimedia.org/r/552357 (https://phabricator.wikimedia.org/T158837) [23:35:50] (03CR) 10jerkins-bot: [V: 04-1] webperf: switch xhgui_host from tungsten to xhgui1001 [puppet] - 10https://gerrit.wikimedia.org/r/552357 (https://phabricator.wikimedia.org/T158837) (owner: 10Dzahn) [23:36:52] (03PS2) 10Dzahn: webperf: switch xhgui_host from tungsten to xhgui1001 [puppet] - 10https://gerrit.wikimedia.org/r/552357 (https://phabricator.wikimedia.org/T158837) [23:47:05] (03PS1) 10DannyS712: Remove `wgImportSources` settings for closed wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/552361 (https://phabricator.wikimedia.org/T231178) [23:48:51] (03PS2) 10DannyS712: Remove `wgImportSources` settings for closed wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/552361 (https://phabricator.wikimedia.org/T231178) [23:49:38] (03PS1) 10Dzahn: xhgui: rsync mongodb data from tungsten to xhgui1001 [puppet] - 10https://gerrit.wikimedia.org/r/552362 (https://phabricator.wikimedia.org/T158837) [23:59:58] (03CR) 10Dzahn: [C: 03+2] "This sets up rsyncd on tungsten so that xhgui1001 can pull from it." [puppet] - 10https://gerrit.wikimedia.org/r/552362 (https://phabricator.wikimedia.org/T158837) (owner: 10Dzahn)