[00:00:05] twentyafterfour: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for Phabricator update deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200326T0000). [00:01:20] DannyS712: I would've thought so... [00:01:34] it's like the config didn't sync... ? [00:01:38] I think, like for config, it needs a second touch? [00:01:41] RECOVERY - Disk space on netflow2001 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=netflow2001&var-datasource=codfw+prometheus/ops [00:01:46] The config synced, since the right was added to the global group [00:02:07] [23:42:56] !log catrope@deploy1001 Synchronized php-1.35.0-wmf.25/extensions/CheckUser/: Retry because mw1251 timed out, and it is a proxy (duration: 03m 15s) [00:02:16] [23:49:21] !log catrope@deploy1001 Synchronized wmf-config/CommonSettings.php: Add investigate to $wgAvailableRights (T247645) (duration: 03m 16s) [00:02:17] T247645: CU 2.0: Enable Special:Investigate on testwiki [small] - https://phabricator.wikimedia.org/T247645 [00:02:51] Oh crap I forgot the InitialiseSettings patch [00:02:53] Good catch [00:02:55] heh [00:04:08] but also https://test.wikipedia.org/wiki/Special:Version - the version of CheckUser being run on testwiki hasn't changed (i.e. is from before the new right) [00:04:15] 10Operations, 10DNS, 10Technical blog, 10Traffic, and 2 others: Setup DNS to direct techblog.wikimedia.org to new Wordpress VIP hosting - https://phabricator.wikimedia.org/T246507 (10bd808) 05Open→03Resolved [00:05:29] DannyS712: Yeah, that's a known bug with git cache [00:05:58] but I purged my cache and everything ;) [00:06:08] Your cache isn't server side [00:06:37] !log catrope@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Enable Special:Investigate on testwiki (T247645) (duration: 03m 14s) [00:06:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:06:58] yay, permission error at https://test.wikipedia.org/wiki/Special:Investigate [00:07:06] except the messages didn't sync? [00:07:15] full scap wasn't run [00:07:21] there it is! https://test.wikipedia.org/wiki/Special:Investigate [00:07:24] [23:17:28] So it'll be a broken message on Special:UserGroupRights and if anyone tries to use the page without the right [00:07:28] Welcome to 50 minutes ago DannyS712 ;p [00:08:06] oh well; staff has investigate rights now, so enjoy testing it [00:08:20] I'll have to install checkuser locally to try it out [00:08:43] Yeah, if we want we can run a full scap later to fix the messages [00:08:49] Or we can leave it be until the next train runs [00:09:17] it works perfectly for my staff account, personal account throws permission error as expected. [00:09:51] thanks everyone! [00:11:53] RECOVERY - Work requests waiting in Zuul Gearman server on contint1001 is OK: OK: Less than 100.00% above the threshold [90.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [00:13:29] Thanks RoanKattouw, DannyS712 and davidwbarratt. :) [00:13:38] and Reedy :) [00:16:46] And Reedy of course! :) [00:40:56] >DannyS712 removed a project: Patch-For-Review. [00:41:01] DannyS712: You know we have a bot for this, right? ;) [00:47:24] yes [01:00:43] 10Operations, 10SRE-Access-Requests: Requesting access to analytics for ItamarWMDE - https://phabricator.wikimedia.org/T248482 (10KFrancis) a:03KFrancis >>! In T248482#5998845, @Volans wrote: > Looping @KFrancis to verify that we have a valid NDA on file. I can see the line in the related spreadsheet but the... [01:34:45] 10Operations, 10SRE-Access-Requests: Requesting access to analytics for tarrow - https://phabricator.wikimedia.org/T248498 (10KFrancis) @Volans - I'm confirming we have a valid NDA for TArrow. Thanks! [04:32:39] 10Operations, 10SRE-Access-Requests: Request Netbox access for user "dubosv10" - https://phabricator.wikimedia.org/T248445 (10DubOSv10) Currently a graduate student at University (University of Arizona), deploying Netbox for a research project and wanted to see an example of how a large, disparate would be or... [04:32:56] 10Operations, 10SRE-Access-Requests: Request Netbox access for user "dubosv10" - https://phabricator.wikimedia.org/T248445 (10DubOSv10) 05Stalled→03Open [05:37:49] (03PS1) 10Thcipriani: CI: add James_F as contint_root [puppet] - 10https://gerrit.wikimedia.org/r/583512 [06:00:28] (03PS1) 10Marostegui: Revert "install_server: Allow reimage db2115" [puppet] - 10https://gerrit.wikimedia.org/r/583514 [06:00:37] (03PS2) 10Marostegui: Revert "install_server: Allow reimage db2115" [puppet] - 10https://gerrit.wikimedia.org/r/583514 [06:02:22] (03CR) 10Marostegui: [C: 03+2] Revert "install_server: Allow reimage db2115" [puppet] - 10https://gerrit.wikimedia.org/r/583514 (owner: 10Marostegui) [06:03:51] hello all [06:03:51] I'm getting some weird errors (and people telling me different results from mine) from this api: https://en.wikipedia.org/api/rest_v1/page/media/C gives me a 200 but [06:03:51] https://en.wikipedia.org/api/rest_v1/page/media/Cat gives me a 404 [06:03:51] who should I talk to about this? [06:03:51] wondering if this is something being deployed now [06:16:25] PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS1299/IPv4: Connect - Telia, AS1299/IPv6: Active - Telia https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [06:22:28] !log Rename nova and nova_api tables on db1117:3325 - T248313 [06:22:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:22:35] T248313: Drop nova and nova_api databases from m5 - https://phabricator.wikimedia.org/T248313 [06:24:43] RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 31, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [06:26:31] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1096:3316 for schema change', diff saved to https://phabricator.wikimedia.org/P10772 and previous config saved to /var/cache/conftool/dbconfig/20200326-062631-marostegui.json [06:26:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:26:59] !log Deploy schema change on db1096:3316 [06:27:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:36:34] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repool db1096:3316 after schema change', diff saved to https://phabricator.wikimedia.org/P10773 and previous config saved to /var/cache/conftool/dbconfig/20200326-063633-marostegui.json [06:36:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:38:45] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1098:3316 for schema change', diff saved to https://phabricator.wikimedia.org/P10774 and previous config saved to /var/cache/conftool/dbconfig/20200326-063844-marostegui.json [06:38:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:39:33] !log Deploy schema change on db1098:3316 [06:39:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:41:39] PROBLEM - mediawiki originals uploads -hourly- for codfw on icinga1001 is CRITICAL: account=mw-media class=originals cluster=swift instance=ms-fe2005:9112 job=statsd_exporter site=codfw https://wikitech.wikimedia.org/wiki/Swift/How_To https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=codfw [06:42:23] PROBLEM - mediawiki originals uploads -hourly- for eqiad on icinga1001 is CRITICAL: account=mw-media class=originals cluster=swift instance=ms-fe1005:9112 job=statsd_exporter site=eqiad https://wikitech.wikimedia.org/wiki/Swift/How_To https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=eqiad [06:46:49] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repool db1098:3316 after schema change', diff saved to https://phabricator.wikimedia.org/P10775 and previous config saved to /var/cache/conftool/dbconfig/20200326-064648-marostegui.json [06:46:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:47:48] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1088 for schema change', diff saved to https://phabricator.wikimedia.org/P10776 and previous config saved to /var/cache/conftool/dbconfig/20200326-064748-marostegui.json [06:47:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:48:47] v!log Deploy schema change on db1088 [06:48:53] !log Deploy schema change on db1088 [06:48:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:58:14] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repool db1088 after schema change', diff saved to https://phabricator.wikimedia.org/P10777 and previous config saved to /var/cache/conftool/dbconfig/20200326-065814-marostegui.json [06:58:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:59:29] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1093 for schema change', diff saved to https://phabricator.wikimedia.org/P10778 and previous config saved to /var/cache/conftool/dbconfig/20200326-065929-marostegui.json [06:59:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:59:49] !log Deploy schema change on db1093 [06:59:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:07:47] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repool db1093 after schema change', diff saved to https://phabricator.wikimedia.org/P10779 and previous config saved to /var/cache/conftool/dbconfig/20200326-070746-marostegui.json [07:07:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:18:10] (03CR) 10Muehlenhoff: [C: 03+2] Extend Cumin alias for logstash with ELK7 roles [puppet] - 10https://gerrit.wikimedia.org/r/583386 (owner: 10Muehlenhoff) [07:20:24] (03PS2) 10Muehlenhoff: Use builder role for deneb [puppet] - 10https://gerrit.wikimedia.org/r/583373 [07:29:39] (03CR) 10Muehlenhoff: [C: 03+2] Use builder role for deneb [puppet] - 10https://gerrit.wikimedia.org/r/583373 (owner: 10Muehlenhoff) [07:30:49] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1085 for schema change', diff saved to https://phabricator.wikimedia.org/P10780 and previous config saved to /var/cache/conftool/dbconfig/20200326-073048-marostegui.json [07:30:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:31:00] !log Deploy schema change on db1085, lag will appear on s6 on labs [07:31:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:35:00] ACKNOWLEDGEMENT - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=idp site=eqiad Muehlenhoff Known/harmless impact, patches being in progress to address this https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [07:40:33] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repool db1085 after schema change', diff saved to https://phabricator.wikimedia.org/P10781 and previous config saved to /var/cache/conftool/dbconfig/20200326-074033-marostegui.json [07:40:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:44:32] 10Operations, 10Patch-For-Review, 10Security: envoyproxy: CVE-2020-8664 CVE-2020-8661 CVE-2020-8660 CVE-2020-8659 - https://phabricator.wikimedia.org/T246868 (10ayounsi) >>! In T246868#5999850, @Stashbot wrote: > {nav icon=file, name=Mentioned in SAL (#wikimedia-operations), href=https://tools.wmflabs.org/sa... [07:50:06] 10Operations, 10ops-eqiad, 10DC-Ops: (Need by: 2020-03-01) rack/setup/install htmldumper1001.eqiad.wmnet. - https://phabricator.wikimedia.org/T245567 (10ayounsi) I don't think it got puppetized properly, https://netbox.wikimedia.org/extras/reports/puppetdb.PhysicalHosts/ alerts as `missing physical device in... [07:58:44] !log remove BGP session to AS8001 in eqiad (down and not replying to email) [07:58:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:08:44] brb [08:10:26] (03PS1) 10Giuseppe Lavagetto: mw: remove decommissioned servers from the scap,mcrouter proxies [puppet] - 10https://gerrit.wikimedia.org/r/583558 (https://phabricator.wikimedia.org/T248501) [08:11:24] (03PS2) 10Giuseppe Lavagetto: mw: remove decommissioned servers from the scap,mcrouter proxies [puppet] - 10https://gerrit.wikimedia.org/r/583558 (https://phabricator.wikimedia.org/T248501) [08:11:25] (03CR) 10jerkins-bot: [V: 04-1] mw: remove decommissioned servers from the scap,mcrouter proxies [puppet] - 10https://gerrit.wikimedia.org/r/583558 (https://phabricator.wikimedia.org/T248501) (owner: 10Giuseppe Lavagetto) [08:13:58] (03PS1) 10Ayounsi: Add more precise notes_link for mediawiki originals uploads alerts [puppet] - 10https://gerrit.wikimedia.org/r/583559 [08:20:21] (03CR) 10Giuseppe Lavagetto: [C: 03+2] mw: remove decommissioned servers from the scap,mcrouter proxies [puppet] - 10https://gerrit.wikimedia.org/r/583558 (https://phabricator.wikimedia.org/T248501) (owner: 10Giuseppe Lavagetto) [08:23:13] jouncebot: next [08:23:13] In 2 hour(s) and 36 minute(s): European Mid-day SWAT(Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200326T1100) [08:27:41] !log troubleshot v6 conditional advertisement from cr3-knams - T236785 [08:27:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:27:47] T236785: Configure conditional advertising in eqdfw and knams - https://phabricator.wikimedia.org/T236785 [08:44:41] !log Deploy schema change on s5 codfw, lag will show up on codfw - T248333 [08:44:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:44:47] T248333: Schema change: Make page.page_restrictions column NULL - https://phabricator.wikimedia.org/T248333 [08:51:43] (03CR) 10Filippo Giunchedi: [C: 03+1] "Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/583559 (owner: 10Ayounsi) [08:54:31] (03CR) 10Filippo Giunchedi: [C: 03+2] Add more precise notes_link for mediawiki originals uploads alerts [puppet] - 10https://gerrit.wikimedia.org/r/583559 (owner: 10Ayounsi) [08:58:21] 10Operations, 10netops: Configure conditional advertising in eqdfw and knams - https://phabricator.wikimedia.org/T236785 (10ayounsi) Mystery solved. The reason for the route to not be accepted was: > Inactive reason: Unusable path This was due to: `rib inet6.0 aggregate route 2620:0:862:ed1a::/64` Causing th... [09:00:52] !log push v4 conditional advertising on cr3-knams - T236785 [09:00:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:00:58] T236785: Configure conditional advertising in eqdfw and knams - https://phabricator.wikimedia.org/T236785 [09:09:44] 10Operations, 10netops: Configure conditional advertising in eqdfw and knams - https://phabricator.wikimedia.org/T236785 (10ayounsi) 05Open→03Resolved All done! [09:19:32] 10Operations, 10MediaWiki-General, 10serviceops, 10Service-Architecture: Use envoy for TLS termination on the appservers - https://phabricator.wikimedia.org/T247389 (10Joe) p:05Triage→03Medium [09:19:49] (03PS1) 10Giuseppe Lavagetto: mediawiki: move debug servers to use envoy for TLS termination [puppet] - 10https://gerrit.wikimedia.org/r/583563 (https://phabricator.wikimedia.org/T247389) [09:23:06] 10Operations, 10MediaWiki-Debug-Logger, 10Traffic, 10Developer Productivity: noc.wikimedia.org doesn't route to the docroot when WikimediaDebug browser extension is live - https://phabricator.wikimedia.org/T245552 (10ema) >>! In T245552#5895210, @Krinkle wrote: > This has regressed last month as well and w... [09:23:47] 10Operations, 10SRE-Access-Requests: Requesting access to analytics for ItamarWMDE - https://phabricator.wikimedia.org/T248482 (10Addshore) [09:35:17] (03CR) 10Giuseppe Lavagetto: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1003/21579/mwdebug1001.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/583563 (https://phabricator.wikimedia.org/T247389) (owner: 10Giuseppe Lavagetto) [09:38:22] <_joe_> any alert from mwdebug2002 is me [09:40:00] 10Operations, 10LDAP-Access-Requests: Add Scardenasmolinar to WMF LDAP group - https://phabricator.wikimedia.org/T248521 (10Volans) 05Resolved→03Open Re-opening as I forgot also for `wmf` only group we need to add it to Puppet, doing it now. [09:50:33] !log reboot stat1008 - gpu + drivers in a weird state after multiple tests [09:50:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:55:35] PROBLEM - Host stat1008 is DOWN: PING CRITICAL - Packet loss = 100% [09:56:45] (03CR) 10Ema: profile::tlsproxy::envoy: allow users to override the cluster addr (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/583367 (owner: 10Jbond) [09:57:33] * RhinosF1 here for SWAT. I've got a patch. [10:01:35] (03PS1) 10Volans: admin: add suecarmol as LDAP only user [puppet] - 10https://gerrit.wikimedia.org/r/583567 (https://phabricator.wikimedia.org/T248521) [10:03:20] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to event logging data in hive for joewalsh - https://phabricator.wikimedia.org/T247636 (10Volans) @Nuria any update on this request? [10:04:34] 10Operations, 10SRE-Access-Requests: Requesting access to analytics for ItamarWMDE - https://phabricator.wikimedia.org/T248482 (10Volans) [10:05:01] (03CR) 10Ema: [C: 04-1] "I wonder if we should be explicit and set skip_xff_append to false too? What's the default?" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/583366 (owner: 10Jbond) [10:05:17] ah stat1008 didn't come up [10:05:17] sigh [10:05:19] checking [10:05:31] 10Operations, 10SRE-Access-Requests: Requesting access to analytics for ItamarWMDE - https://phabricator.wikimedia.org/T248482 (10Volans) a:05KFrancis→03Nuria @Nuria task description was updated with more details on the reason for access. Over to you for approval. [10:07:09] 10Operations, 10SRE-Access-Requests: Requesting access to analytics for tarrow - https://phabricator.wikimedia.org/T248498 (10Volans) [10:08:22] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "I have serious doubts about this patch." [puppet] - 10https://gerrit.wikimedia.org/r/583367 (owner: 10Jbond) [10:08:39] 10Operations, 10SRE-Access-Requests: Requesting access to analytics for tarrow - https://phabricator.wikimedia.org/T248498 (10Tarrow) @WMDE-leszek Any chance you can take a look and preemptively give the WMDE 👍 ? [10:10:02] 10Operations, 10SRE-Access-Requests: Requesting access to analytics for tarrow - https://phabricator.wikimedia.org/T248498 (10Volans) a:03Tarrow @Tarrow similar to T248482, could you please elaborate a bit more on the "Reason for access" in the task description regarding how the data you need to access relat... [10:10:38] (03PS9) 10Ayounsi: Initial templating for CR routing-options [homer/public] - 10https://gerrit.wikimedia.org/r/547587 [10:10:41] RECOVERY - Host stat1008 is UP: PING OK - Packet loss = 0%, RTA = 0.25 ms [10:10:55] 10Operations, 10Analytics, 10Product-Analytics, 10SRE-Access-Requests: Hive access - https://phabricator.wikimedia.org/T248097 (10Volans) a:03spatton @spatton: gentle reminder for the above request. [10:10:57] (03CR) 10jerkins-bot: [V: 04-1] Initial templating for CR routing-options [homer/public] - 10https://gerrit.wikimedia.org/r/547587 (owner: 10Ayounsi) [10:13:24] (03PS3) 10Ayounsi: esams/knams: advertise 185.15.58.0/23 instead of 185.15.56.0/22 [homer/public] - 10https://gerrit.wikimedia.org/r/564564 (https://phabricator.wikimedia.org/T207753) [10:15:34] (03CR) 10Giuseppe Lavagetto: [C: 03+1] "The change seems correct." [puppet] - 10https://gerrit.wikimedia.org/r/582048 (https://phabricator.wikimedia.org/T248169) (owner: 10Jbond) [10:16:02] !log esams/knams: advertise 185.15.58.0/23 instead of 185.15.56.0/22 - T207753 [10:16:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:16:08] T207753: esams/knams: advertise 185.15.58.0/23 instead of 185.15.56.0/22 - https://phabricator.wikimedia.org/T207753 [10:19:36] 10Operations, 10SRE-Access-Requests: Requesting access to analytics for tarrow - https://phabricator.wikimedia.org/T248498 (10Tarrow) [10:20:19] (03PS1) 10Arturo Borrero Gonzalez: openstack: queens: apt pin systemd harder [puppet] - 10https://gerrit.wikimedia.org/r/583569 (https://phabricator.wikimedia.org/T247013) [10:20:28] 10Operations, 10LDAP-Access-Requests, 10Patch-For-Review: Add Scardenasmolinar to WMF LDAP group - https://phabricator.wikimedia.org/T248521 (10Volans) @Scardenasmolinar: could you also please link your Phabricator account to your official WMF meta account on wiki? See for example my profile on Phabricator u... [10:22:10] 10Operations, 10SRE-Access-Requests: Requesting access to analytics for tarrow - https://phabricator.wikimedia.org/T248498 (10Tarrow) @Volans Thanks for the awesomely quick response! I've added more details of the sort of tasks I'm expecting to undertake [10:22:36] 10Operations, 10SRE-Access-Requests: Requesting access to analytics for tarrow - https://phabricator.wikimedia.org/T248498 (10Volans) a:05Tarrow→03Nuria @Tarrow thanks for the detailed explanation. @Nuria over to you for the WMF side approval. [10:23:59] (03CR) 10Gehel: [C: 03+2] wdqs: added monitoring to data-reload cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/582784 (owner: 10Gehel) [10:24:23] (03CR) 10Ayounsi: [C: 03+2] esams/knams: advertise 185.15.58.0/23 instead of 185.15.56.0/22 [homer/public] - 10https://gerrit.wikimedia.org/r/564564 (https://phabricator.wikimedia.org/T207753) (owner: 10Ayounsi) [10:24:49] 10Operations, 10Wikimedia-Mailing-lists: Request for creation: les sans pagEs Mailing List - https://phabricator.wikimedia.org/T245066 (10Volans) @Anthere: gentle reminder for the above request if this request is still valid. [10:25:10] (03Merged) 10jenkins-bot: esams/knams: advertise 185.15.58.0/23 instead of 185.15.56.0/22 [homer/public] - 10https://gerrit.wikimedia.org/r/564564 (https://phabricator.wikimedia.org/T207753) (owner: 10Ayounsi) [10:25:36] 10Operations, 10Wikimedia-Mailing-lists: Request for new mailing list Deutschschweiz - https://phabricator.wikimedia.org/T247737 (10Volans) @Lantus: gentle reminder for the pending acknowledge that everything is working as expected. [10:26:40] 10Operations, 10netops, 10Patch-For-Review: esams/knams: advertise 185.15.58.0/23 instead of 185.15.56.0/22 - https://phabricator.wikimedia.org/T207753 (10ayounsi) 05Open→03Resolved Got the OK during a previous meeting. Done and verified: ROAs are good /23 is advertised as expected WMCS /24s are reachable [10:30:43] (03PS2) 10DCausse: [cirrus] force cloudelastic replica count to 1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/583106 (https://phabricator.wikimedia.org/T231517) [10:34:04] 10Operations, 10SRE-Access-Requests: Requesting access to analytics for tarrow - https://phabricator.wikimedia.org/T248498 (10WMDE-leszek) I hereby approve this request from WMDE side. [10:34:23] 10Operations, 10SRE-Access-Requests: Requesting access to analytics for ItamarWMDE - https://phabricator.wikimedia.org/T248482 (10WMDE-leszek) I approve this request from WMDE side. [10:35:12] (03PS3) 10Jbond: envoy: introduce use_remote_address parameter [puppet] - 10https://gerrit.wikimedia.org/r/583366 [10:35:31] (03PS4) 10Jbond: envoy: introduce use_remote_address parameter [puppet] - 10https://gerrit.wikimedia.org/r/583366 [10:35:36] (03CR) 10jerkins-bot: [V: 04-1] envoy: introduce use_remote_address parameter [puppet] - 10https://gerrit.wikimedia.org/r/583366 (owner: 10Jbond) [10:36:06] (03CR) 10jerkins-bot: [V: 04-1] envoy: introduce use_remote_address parameter [puppet] - 10https://gerrit.wikimedia.org/r/583366 (owner: 10Jbond) [10:36:53] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/583569 (https://phabricator.wikimedia.org/T247013) (owner: 10Arturo Borrero Gonzalez) [10:38:48] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/583567 (https://phabricator.wikimedia.org/T248521) (owner: 10Volans) [10:40:22] 10Operations, 10netops, 10Wikimedia-Incident: Add linecard diversity to the router-to-router interconnect in codfw - https://phabricator.wikimedia.org/T248506 (10ayounsi) As data point, FPC0 got purchased on 2014 and FPC5 in 2013 so it's also time to replace them. [10:40:53] (03CR) 10Volans: [C: 03+2] admin: add suecarmol as LDAP only user [puppet] - 10https://gerrit.wikimedia.org/r/583567 (https://phabricator.wikimedia.org/T248521) (owner: 10Volans) [10:40:56] (03PS1) 10Ema: ATS: remove debug HTTP headers if X-Wikimedia-Debug is absent [puppet] - 10https://gerrit.wikimedia.org/r/583570 (https://phabricator.wikimedia.org/T210484) [10:44:33] (03PS2) 10Ema: ATS: remove debug HTTP headers if X-Wikimedia-Debug is absent [puppet] - 10https://gerrit.wikimedia.org/r/583570 (https://phabricator.wikimedia.org/T210484) [10:52:21] (03CR) 10Giuseppe Lavagetto: "Overall seems good, but I would like to retain the Server: header there." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/583570 (https://phabricator.wikimedia.org/T210484) (owner: 10Ema) [10:53:30] jouncebot: next [10:53:30] In 0 hour(s) and 6 minute(s): European Mid-day SWAT(Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200326T1100) [10:54:25] !log hnowlan@deploy1001 helmfile [EQIAD] Ran 'sync' command on namespace 'changeprop' for release 'production' . [10:54:25] (03PS3) 10RhinosF1: Removed expired throttle.php entries. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/583325 [10:54:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:54:42] (03CR) 10RhinosF1: [C: 03+1] "Ready for SWAT!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/583325 (owner: 10RhinosF1) [10:55:21] (03PS5) 10Jbond: envoy: introduce use_remote_address parameter [puppet] - 10https://gerrit.wikimedia.org/r/583366 [10:56:14] (03PS6) 10Jbond: envoy: introduce use_remote_address parameter [puppet] - 10https://gerrit.wikimedia.org/r/583366 [10:58:27] !log hnowlan@deploy1001 helmfile [EQIAD] Ran 'sync' command on namespace 'changeprop' for release 'production' . [10:58:29] (03CR) 10Jbond: "> Patch Set 2: Code-Review-1" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/583366 (owner: 10Jbond) [10:58:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:58:41] (03PS4) 10Jbond: profile::tlsproxy::envoy: allow users to override the cluster addr [puppet] - 10https://gerrit.wikimedia.org/r/583367 [10:59:04] (03CR) 10jerkins-bot: [V: 04-1] envoy: introduce use_remote_address parameter [puppet] - 10https://gerrit.wikimedia.org/r/583366 (owner: 10Jbond) [11:00:04] Amir1, Lucas_WMDE, awight, and Urbanecm: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) European Mid-day SWAT(Max 6 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200326T1100). [11:00:04] RhinosF1, kart_, dcausse, and awight: A patch you scheduled for European Mid-day SWAT(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [11:00:23] o/ [11:00:35] * RhinosF1 here [11:00:45] it was only me this morning! [11:00:53] !log hnowlan@deploy1001 helmfile [EQIAD] Ran 'sync' command on namespace 'changeprop' for release 'production' . [11:00:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:01:05] * kart_ is here [11:01:10] or I didn't read [11:01:15] Anyone to SWAT? [11:01:26] @RhinosF1 Always refresh page :) [11:01:28] (03PS1) 10Dzahn: decom mw1254 through mw1258, remaining rack D5 appservers [puppet] - 10https://gerrit.wikimedia.org/r/583575 (https://phabricator.wikimedia.org/T247780) [11:01:40] I can SWAT today! [11:01:41] kart_: would normally help [11:01:46] (03PS2) 10Dzahn: decom mw1254 through mw1258, remaining rack D5 appservers [puppet] - 10https://gerrit.wikimedia.org/r/583575 (https://phabricator.wikimedia.org/T247780) [11:01:46] Urbanecm: thanks! [11:02:04] Urbanecm: mine is just removing old throttles so a simple sync file straight into prod pls [11:02:04] (03PS3) 10Ema: ATS: remove debug HTTP headers if X-Wikimedia-Debug is absent [puppet] - 10https://gerrit.wikimedia.org/r/583570 (https://phabricator.wikimedia.org/T210484) [11:02:14] (03CR) 10jerkins-bot: [V: 04-1] profile::tlsproxy::envoy: allow users to override the cluster addr [puppet] - 10https://gerrit.wikimedia.org/r/583367 (owner: 10Jbond) [11:02:19] kart_: For yours, phan doesn't run on JS... So forcing it, while not ideal, it's not a major issue [11:02:41] (03CR) 10Urbanecm: [C: 03+2] "This change is ready for review." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/583325 (owner: 10RhinosF1) [11:02:43] (03CR) 10Ema: ATS: remove debug HTTP headers if X-Wikimedia-Debug is absent (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/583570 (https://phabricator.wikimedia.org/T210484) (owner: 10Ema) [11:02:55] Reedy: thanks! [11:03:45] (03Merged) 10jenkins-bot: Removed expired throttle.php entries. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/583325 (owner: 10RhinosF1) [11:04:03] _joe_: should i do it? the 5 appservers in D5 ? [11:04:03] D5: Ok so I hacked up ssh.py to use mozprocess - https://phabricator.wikimedia.org/D5 [11:04:19] Urbanecm: merged and ready for sync [11:04:26] I see that :) [11:04:28] but thanks [11:04:58] RhinosF1: You really don't need to keep pointing out the obvious to people. You're creating unnecessary noise/pings to people [11:05:40] (03CR) 10Ema: "> Also: we should check we don't use any of those headers in" [puppet] - 10https://gerrit.wikimedia.org/r/583570 (https://phabricator.wikimedia.org/T210484) (owner: 10Ema) [11:05:46] <_joe_> mutante: not now, no [11:05:50] _joe_: ok [11:05:55] <_joe_> mutante: let's wait for monday and reevaluate [11:06:06] !log urbanecm@deploy1001 Synchronized wmf-config/throttle.php: SWAT: d1bb0b1: Removed expired throttle.php entries (duration: 01m 09s) [11:06:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:06:24] _joe_: yep, sounds good. there is other stuff to do like new VMs [11:06:46] Reedy: So, we just need Jenkins verified +2 on my patch manually or anything else? [11:06:50] <_joe_> mutante: have you seen https://phabricator.wikimedia.org/T248501 ? [11:07:15] kart_: Remove the V+2, manually apply a V+2 and hit submit. I think Martin has already done it though ;) [11:07:19] (03CR) 10Vgutierrez: [C: 03+1] ATS: remove debug HTTP headers if X-Wikimedia-Debug is absent [puppet] - 10https://gerrit.wikimedia.org/r/583570 (https://phabricator.wikimedia.org/T210484) (owner: 10Ema) [11:07:28] yup :) [11:07:38] thanks! [11:07:40] Thanks! [11:08:05] kart_: pulled onto mwdebug1001 :) [11:08:10] (03PS5) 10Jbond: profile::tlsproxy::envoy: allow users to override the cluster addr [puppet] - 10https://gerrit.wikimedia.org/r/583367 [11:08:12] _joe_: yea, but not the latest updates. i see, if they are scap proxies they need the extra steps, i'll keep that in mind [11:08:16] Urbanecm: testing.. [11:08:29] <_joe_> mutante: maybe add it to the docs if it's not there? [11:08:46] <_joe_> (also the mcrouter proxies) [11:09:08] _joe_: thanks for the fix. i'll write some docs about this [11:10:14] _joe_: i also have this about canaries in dsh groups. i think i have to add these: https://gerrit.wikimedia.org/r/c/operations/puppet/+/574902 [11:10:43] Urbanecm: looks good! [11:10:48] thanks, syncing! [11:10:54] (03CR) 10jerkins-bot: [V: 04-1] profile::tlsproxy::envoy: allow users to override the cluster addr [puppet] - 10https://gerrit.wikimedia.org/r/583367 (owner: 10Jbond) [11:11:00] if hosts are using the canary roles then they also need to be listed as such in dsh.yaml .. right [11:11:10] (03PS7) 10Jbond: envoy: introduce use_remote_address parameter [puppet] - 10https://gerrit.wikimedia.org/r/583366 [11:11:23] (03PS6) 10Jbond: profile::tlsproxy::envoy: allow users to override the cluster addr [puppet] - 10https://gerrit.wikimedia.org/r/583367 [11:11:56] PROBLEM - Debian mirror in sync with upstream on sodium is CRITICAL: /srv/mirrors/debian is over 14 hours old. https://wikitech.wikimedia.org/wiki/Mirrors [11:12:44] !log urbanecm@deploy1001 Synchronized php-1.35.0-wmf.25/extensions/ContentTranslation/modules/ui/mw.cx.ui.Categories.js: SWAT: 1ea6bad: Allow publishing to continue even with broken categories (T248302) (duration: 01m 07s) [11:12:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:12:50] T248302: Publishing fails with publish button turning grey without any visible error message - https://phabricator.wikimedia.org/T248302 [11:12:59] taking a look at the sodium alert.. happens sometimes [11:13:01] kart_: should be live! [11:13:21] dcausse: do you want to self-deploy your entry? [11:13:36] Urbanecm: thanks!! [11:13:49] Urbanecm: sure I can :) [11:13:57] go ahead then :) [11:14:12] (03CR) 10jerkins-bot: [V: 04-1] profile::tlsproxy::envoy: allow users to override the cluster addr [puppet] - 10https://gerrit.wikimedia.org/r/583367 (owner: 10Jbond) [11:14:40] (03CR) 10DCausse: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/583106 (https://phabricator.wikimedia.org/T231517) (owner: 10DCausse) [11:15:07] (don't forget to sync IS.php twice, due to T236104) [11:15:07] T236104: Cache of wmf-config/InitialiseSettings often 1 step behind - https://phabricator.wikimedia.org/T236104 [11:15:21] (03PS2) 10Brian Wolff: Add m.wikidata.beta.wmflabs.org to CSP list for beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/582633 [11:15:34] (03Merged) 10jenkins-bot: [cirrus] force cloudelastic replica count to 1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/583106 (https://phabricator.wikimedia.org/T231517) (owner: 10DCausse) [11:17:25] Too late to add something to swat? (Just a config change for beta cluster only) [11:17:47] bawolff: i think that would be doable :) [11:17:59] Its https://gerrit.wikimedia.org/r/#/c/operations/mediawiki-config/+/582633/ [11:18:46] k [11:19:12] thanks :) [11:19:29] (03CR) 10Jbond: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/583366 (owner: 10Jbond) [11:20:14] (03PS3) 10Urbanecm: Add m.wikidata.beta.wmflabs.org to CSP list for beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/582633 (owner: 10Brian Wolff) [11:21:12] !log dcausse@deploy1001 Synchronized wmf-config/InitialiseSettings.php: T231517: [cirrus] force cloudelastic replica count to 1 (duration: 01m 06s) [11:21:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:21:19] T231517: Investigate and fix GC issues on cloudelastic machines - https://phabricator.wikimedia.org/T231517 [11:21:26] 10Operations, 10Traffic, 10netops: BGP: Investigate isolating codfw and eqiad - https://phabricator.wikimedia.org/T246721 (10ayounsi) Next steps are: * Check ROAs - DONE * Advertise `208.80.154.0/23` (compete with CF) and `2620:0:861::/48` from eqiad/eqord * Check that they are being advertised as expected (... [11:21:36] Urbanecm: done (unless I need to sync IS twice, I vaguely remember a bug like that) [11:21:48] dcausse: yes, you need that [11:21:50] ok [11:23:03] (03CR) 10Urbanecm: [C: 03+2] "beta-only change" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/582633 (owner: 10Brian Wolff) [11:23:07] !log dcausse@deploy1001 Synchronized wmf-config/InitialiseSettings.php: T231517: [cirrus] force cloudelastic replica count to 1 (duration: 01m 05s) [11:23:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:23:17] Urbanecm: done [11:23:20] thanks [11:23:57] (03Merged) 10jenkins-bot: Add m.wikidata.beta.wmflabs.org to CSP list for beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/582633 (owner: 10Brian Wolff) [11:24:22] bawolff: should land at beta within several minutes :-) [11:24:27] Thanks [11:24:31] yw [11:24:41] It will take a while to actually show up because CSP headers are cached if logged out [11:25:28] My dashboard is looking beautiful! https://logstash-beta.wmflabs.org/goto/b21d8a906e90014b9777ebf8162face3 :) [11:25:33] First beta, then the world! [11:25:56] !log sodium - running ftpsync to get Debian mirror in sync [11:25:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:26:12] awight: I see you scheduled a patch, but didn't see you so far - if you're here, feel free to deploy it. [11:26:28] RECOVERY - Debian mirror in sync with upstream on sodium is OK: /srv/mirrors/debian is over 2 hours old. https://wikitech.wikimedia.org/wiki/Mirrors [11:27:00] there we go [11:27:15] it should say "under 2 hours old" though :p [11:27:26] Urbanecm: :-) Thanks, I'll give it a try. Currently getting CI blocked by T248306, which is not fun [11:27:26] T248306: CI error on WMF branches: Cannot use the final modifier on an abstract class in vendor/microsoft/tolerant-php-parser/tests/cases/parser/abstractMethodDeclaration7.php on line 3 - https://phabricator.wikimedia.org/T248306 [11:28:03] awight: the only way to workaround that is to bypass jenkins [11:28:19] * awight puts on hazardous activity suit [11:29:10] hopefully that didn't break all of CI. [11:34:37] (03PS1) 10Ayounsi: Shrink eqiad/eqord bgp_out to /23 and /48 [homer/public] - 10https://gerrit.wikimedia.org/r/583579 (https://phabricator.wikimedia.org/T246721) [11:37:16] !log awight@deploy1001 Synchronized php-1.35.0-wmf.25/extensions/TwoColConflict: SWAT: [[gerrit:583576|Two hotfixes for guided tour (T248465)]] (duration: 01m 07s) [11:37:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:37:22] T248465: Textbox tooltip is transparent when left side is not selected - https://phabricator.wikimedia.org/T248465 [11:37:38] Urbanecm: anything else you were going to do, or shall I close this window? [11:37:46] no, I'm done :) [11:38:09] !log EU SWAT done [11:38:11] Thanks! [11:38:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:43:54] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] openstack: queens: apt pin systemd harder [puppet] - 10https://gerrit.wikimedia.org/r/583569 (https://phabricator.wikimedia.org/T247013) (owner: 10Arturo Borrero Gonzalez) [11:50:33] (03PS8) 10Dzahn: noc::site: close port 80 for caching servers [puppet] - 10https://gerrit.wikimedia.org/r/572337 [11:50:44] Urbanecm: do we have space for 1 more patch in SWAT? [11:51:14] kart_: awight closed the window, feel free to reopen by a !log entry and do your stuff [11:51:29] jouncebot: now [11:51:30] For the next 0 hour(s) and 8 minute(s): European Mid-day SWAT(Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200326T1100) [11:51:35] OK. Let me see if we can.. or will do that in next. [11:51:49] Noting the !log really isn't necessary [11:52:52] kart_: Oh, sorry to miss that! [11:53:18] awight: no issue. I'm late :P [12:03:27] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to event logging data in hive for joewalsh - https://phabricator.wikimedia.org/T247636 (10Volans) a:05ArielGlenn→03Nuria [12:07:14] (03CR) 10Dzahn: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1002/21580/mwmaint1002.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/572337 (owner: 10Dzahn) [12:09:11] (03PS7) 10Jbond: profile::tlsproxy::envoy: allow users to override the cluster addr [puppet] - 10https://gerrit.wikimedia.org/r/583367 [12:12:51] (03CR) 10jerkins-bot: [V: 04-1] profile::tlsproxy::envoy: allow users to override the cluster addr [puppet] - 10https://gerrit.wikimedia.org/r/583367 (owner: 10Jbond) [12:14:37] PROBLEM - Check systemd state on mwmaint2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:15:00] (03PS1) 10Dzahn: noc: add missing brackets in ferm rule for cumin masters [puppet] - 10https://gerrit.wikimedia.org/r/583581 [12:15:26] ACKNOWLEDGEMENT - Check systemd state on mwmaint1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. daniel_zahn fix incoming https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:15:26] ACKNOWLEDGEMENT - Check systemd state on mwmaint2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. daniel_zahn fix incoming https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:15:49] (03CR) 10Dzahn: [C: 03+2] noc: add missing brackets in ferm rule for cumin masters [puppet] - 10https://gerrit.wikimedia.org/r/583581 (owner: 10Dzahn) [12:17:29] (03CR) 10Jbond: "> Patch Set 3: Code-Review-1" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/583367 (owner: 10Jbond) [12:19:00] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1096:3315 for schema change', diff saved to https://phabricator.wikimedia.org/P10782 and previous config saved to /var/cache/conftool/dbconfig/20200326-121859-marostegui.json [12:19:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:19:18] 10Operations, 10cloud-services-team (Kanban): rack/setup/install cloudcontrol2001-dev & cloudvirt200[123]-dev - https://phabricator.wikimedia.org/T214448 (10Dzahn) {F31703186} ACKed to handle Icinga alerts [12:19:45] (03PS2) 10Dzahn: noc: add missing brackets in ferm rule for cumin masters [puppet] - 10https://gerrit.wikimedia.org/r/583581 [12:21:10] cleaned up Icinga. 4 alerts should be left (not 40) [12:21:34] 2 of them are cxserver, 2 netbox reports [12:21:38] (03PS8) 10Jbond: profile::tlsproxy::envoy: allow users to override the cluster addr [puppet] - 10https://gerrit.wikimedia.org/r/583367 [12:22:01] (03PS3) 10Jbond: idp: update the idp proxy config to use localhost and use_remote_address [puppet] - 10https://gerrit.wikimedia.org/r/583368 [12:22:13] (03CR) 10Jbond: [C: 03+2] realm.pp: trusted facts unavailable when performing a lookup or pcc [puppet] - 10https://gerrit.wikimedia.org/r/582048 (https://phabricator.wikimedia.org/T248169) (owner: 10Jbond) [12:23:01] jbond42: multiple, please merge both [12:23:34] mutante: merging [12:23:40] thx [12:24:05] yep, fixed systemd state on mwmaint in a moment [12:25:40] !log analytics1028 - performing a puppet change on every run (all other hosts doing this were fixed just recently) [12:25:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:28:41] (03CR) 10Jbond: "> Patch Set 6: Code-Review+1" [puppet] - 10https://gerrit.wikimedia.org/r/582048 (https://phabricator.wikimedia.org/T248169) (owner: 10Jbond) [12:29:02] RECOVERY - Check systemd state on mwmaint2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:29:43] (03PS1) 10Dzahn: noc: add another missing bracket in ferm rule syntax [puppet] - 10https://gerrit.wikimedia.org/r/583583 [12:30:26] (03CR) 10Dzahn: [V: 03+2 C: 03+2] noc: add another missing bracket in ferm rule syntax [puppet] - 10https://gerrit.wikimedia.org/r/583583 (owner: 10Dzahn) [12:31:11] ACKNOWLEDGEMENT - Cxserver LVS codfw on cxserver.svc.codfw.wmnet is CRITICAL: /v2/suggest/sections/{title}/{from}/{to} (Suggest source sections to translate) is CRITICAL: Test Suggest source sections to translate returned the unexpected status 404 (expecting: 200) alexandros kosiaris https://phabricator.wikimedia.org/T248578 https://wikitech.wikimedia.org/wiki/CX [12:31:11] ACKNOWLEDGEMENT - Cxserver LVS eqiad on cxserver.svc.eqiad.wmnet is CRITICAL: /v2/suggest/sections/{title}/{from}/{to} (Suggest source sections to translate) is CRITICAL: Test Suggest source sections to translate returned the unexpected status 404 (expecting: 200) alexandros kosiaris https://phabricator.wikimedia.org/T248578 https://wikitech.wikimedia.org/wiki/CX [12:31:30] ah :) [12:31:55] (03PS4) 10Alexandros Kosiaris: netboot/partman: add new ganeti servers and fix typo in selector [puppet] - 10https://gerrit.wikimedia.org/r/576887 (https://phabricator.wikimedia.org/T228924) (owner: 10Dzahn) [12:31:57] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repool db1096:3315 after schema change', diff saved to https://phabricator.wikimedia.org/P10783 and previous config saved to /var/cache/conftool/dbconfig/20200326-123157-marostegui.json [12:32:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:33:02] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1097:3315 for schema change', diff saved to https://phabricator.wikimedia.org/P10784 and previous config saved to /var/cache/conftool/dbconfig/20200326-123302-marostegui.json [12:33:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:35:19] (03CR) 10Dzahn: ATS: directly talk wss:// to aphlict (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/569104 (https://phabricator.wikimedia.org/T238593) (owner: 10Dzahn) [12:35:22] (03PS1) 10Jbond: trusted['certname']: update the other instance of trusted['certname'] [puppet] - 10https://gerrit.wikimedia.org/r/583588 [12:38:07] (03CR) 10jerkins-bot: [V: 04-1] trusted['certname']: update the other instance of trusted['certname'] [puppet] - 10https://gerrit.wikimedia.org/r/583588 (owner: 10Jbond) [12:38:09] (03CR) 10Dzahn: "Looks good to me but please link to a an access request." [puppet] - 10https://gerrit.wikimedia.org/r/583512 (owner: 10Thcipriani) [12:38:58] (03CR) 10Dzahn: [C: 03+2] DHCP Partman: Add MAC address and partman for cp2027 to cp2042 [puppet] - 10https://gerrit.wikimedia.org/r/583381 (https://phabricator.wikimedia.org/T247340) (owner: 10Papaul) [12:39:08] (03PS2) 10Dzahn: DHCP Partman: Add MAC address and partman for cp2027 to cp2042 [puppet] - 10https://gerrit.wikimedia.org/r/583381 (https://phabricator.wikimedia.org/T247340) (owner: 10Papaul) [12:40:30] PROBLEM - Rate of JVM GC Old generation-s runs - cloudelastic1001-cloudelastic-chi-eqiad on cloudelastic1001 is CRITICAL: 126.1 gt 100 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=cloudelastic-chi-eqiad&var-instance=cloudelastic1001&panelId=37 [12:41:26] (03CR) 10Jbond: [C: 03+1] "lgtm" [homer/public] - 10https://gerrit.wikimedia.org/r/583579 (https://phabricator.wikimedia.org/T246721) (owner: 10Ayounsi) [12:41:26] 10Operations, 10netops: Configure management-instance on routers with Junos > 17.3 - https://phabricator.wikimedia.org/T247073 (10ayounsi) [12:41:29] 10Operations, 10netops, 10Wikimedia-Incident: Juniper HA audit - https://phabricator.wikimedia.org/T191667 (10ayounsi) [12:42:11] (03CR) 10Jbond: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/583367 (owner: 10Jbond) [12:45:30] (03PS2) 10Dzahn: Add new cp nodes cp2027 to cp2042 to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/583469 (https://phabricator.wikimedia.org/T247340) (owner: 10Papaul) [12:46:20] PROBLEM - Rate of JVM GC Old generation-s runs - cloudelastic1002-cloudelastic-chi-eqiad on cloudelastic1002 is CRITICAL: 123.1 gt 100 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=cloudelastic-chi-eqiad&var-instance=cloudelastic1002&panelId=37 [12:47:19] (03PS1) 10Dzahn: partman: rename cp2018.cfg to cacheproxy.cfg [puppet] - 10https://gerrit.wikimedia.org/r/583592 [12:47:27] (03CR) 10Dzahn: [C: 03+2] Add new cp nodes cp2027 to cp2042 to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/583469 (https://phabricator.wikimedia.org/T247340) (owner: 10Papaul) [12:49:44] (03CR) 10Alexandros Kosiaris: [C: 03+1] "Reading through https://wikitech.wikimedia.org/wiki/Deployments/Covid-19, between me and Giuseppe, we tick all of the 3 boxes. Plus this i" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/576009 (https://phabricator.wikimedia.org/T244843) (owner: 10Giuseppe Lavagetto) [12:50:02] <_joe_> jouncebot: next [12:50:02] In 3 hour(s) and 9 minute(s): Puppet SWAT(Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200326T1600) [12:50:12] <_joe_> ok, a good time as any [12:50:22] (03CR) 10Giuseppe Lavagetto: [C: 03+2] ProductionServices: switch eventgate-main to use envoy [mediawiki-config] - 10https://gerrit.wikimedia.org/r/576009 (https://phabricator.wikimedia.org/T244843) (owner: 10Giuseppe Lavagetto) [12:51:16] (03Merged) 10jenkins-bot: ProductionServices: switch eventgate-main to use envoy [mediawiki-config] - 10https://gerrit.wikimedia.org/r/576009 (https://phabricator.wikimedia.org/T244843) (owner: 10Giuseppe Lavagetto) [12:52:34] (03CR) 10Alexandros Kosiaris: [C: 04-2] "ganeti1009 and ganeti1012-ganeti1018 are already being (correctly) covered by the rule on line 114, i.e." [puppet] - 10https://gerrit.wikimedia.org/r/576887 (https://phabricator.wikimedia.org/T228924) (owner: 10Dzahn) [12:53:19] (03PS3) 10Dzahn: Add profile and module for for static HTML dump of CodeReview [puppet] - 10https://gerrit.wikimedia.org/r/567407 (https://phabricator.wikimedia.org/T243056) (owner: 10Legoktm) [12:53:46] <_joe_> akosiaris: I'm testing on mwdebug1001 [12:54:54] <_joe_> ok it works [12:55:31] <_joe_> scapping [12:55:48] (03CR) 10Dzahn: "PS3: just replaced an "include" with instantiating a class. This should make jenkins-bot vote +1 now." [puppet] - 10https://gerrit.wikimedia.org/r/567407 (https://phabricator.wikimedia.org/T243056) (owner: 10Legoktm) [12:57:11] <_joe_> https://grafana.wikimedia.org/d/VTCkm29Wz/envoy-telemetry?orgId=1&var-datasource=eqiad%20prometheus%2Fops&var-origin=appserver&var-destination=eventgate-main&from=now-15m&to=now [12:57:13] !log oblivian@deploy1001 Synchronized wmf-config/ProductionServices.php: eventgate-main to use envoy T244843 (duration: 01m 07s) [12:57:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:57:19] T244843: Create a service-to-service proxy for handling HTTP calls from services to other entities - https://phabricator.wikimedia.org/T244843 [13:01:22] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repool db1097:3315 after schema change', diff saved to https://phabricator.wikimedia.org/P10785 and previous config saved to /var/cache/conftool/dbconfig/20200326-130122-marostegui.json [13:01:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:01:42] <_joe_> akosiaris: eventgate-main seems ok to me looking at the dashboard [13:02:16] <_joe_> akosiaris: also interesting effects on the cpu usage of pods [13:02:53] _joe_: it's CFS given envoy it's full alloted timeslot [13:03:03] it's the usual kernel bugs we 've been talking about lately [13:03:14] <_joe_> no I mean apart from the throttling going away [13:03:22] <_joe_> the load on the application went down [13:04:00] <_joe_> 1.8s -> 1.6s user, 0.6 -> 0.3 sys [13:04:36] <_joe_> still early to call it as a persistent effect [13:05:35] <_joe_> also I'm not sure about the numbers I get from envoy [13:05:39] (03Abandoned) 10Dzahn: netboot/partman: add new ganeti servers and fix typo in selector [puppet] - 10https://gerrit.wikimedia.org/r/576887 (https://phabricator.wikimedia.org/T228924) (owner: 10Dzahn) [13:06:38] 10Operations, 10netops, 10Wikimedia-Incident: Juniper HA audit - https://phabricator.wikimedia.org/T191667 (10ayounsi) [13:07:04] (03CR) 10Dzahn: "i tend to agree with Krinkle's comment. codereview-archive.wikimedia.org ok with you too?" [puppet] - 10https://gerrit.wikimedia.org/r/567407 (https://phabricator.wikimedia.org/T243056) (owner: 10Legoktm) [13:08:00] _joe_: I guess the persistent connections? less connections to open from the app's side? [13:08:08] well, less churn rate more like it [13:09:02] funnily enough if you diff the sys between the tls proxy and the app, of course envoy has less sys time. [13:09:15] probably does less syscalls to start with? [13:10:43] (03PS2) 10Jbond: trusted['certname']: use facts['fqdn'] instead [puppet] - 10https://gerrit.wikimedia.org/r/583588 [13:13:08] (03CR) 10Jbond: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/583368 (owner: 10Jbond) [13:13:49] (03PS3) 10Jbond: trusted['certname']: use facts['fqdn'] instead [puppet] - 10https://gerrit.wikimedia.org/r/583588 [13:13:59] (03PS1) 10Arturo Borrero Gonzalez: toolforge: introduce role/profile for legacy URL redirector [puppet] - 10https://gerrit.wikimedia.org/r/583593 (https://phabricator.wikimedia.org/T247236) [13:16:42] (03CR) 10jerkins-bot: [V: 04-1] toolforge: introduce role/profile for legacy URL redirector [puppet] - 10https://gerrit.wikimedia.org/r/583593 (https://phabricator.wikimedia.org/T247236) (owner: 10Arturo Borrero Gonzalez) [13:20:46] PROBLEM - Check no envoy runtime configuration is left persistent on mwdebug1001 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string entries: {} not found on http://localhost:9631/runtime - 396 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy [13:21:30] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good, the reference to PUP 5441 is moot, though? Per the ticket is should be fixed in 4.3, so not affect production." [puppet] - 10https://gerrit.wikimedia.org/r/583588 (owner: 10Jbond) [13:23:26] 10Operations, 10Wikimedia-Mailing-lists: Migrate archives of the OKFN-hosted Open-GLAM mailing list to Wikimedia's mailman - https://phabricator.wikimedia.org/T240929 (10jcrespo) ping @herron [13:23:50] 10Operations, 10Wikimedia-Mailing-lists: Migrate archives of the OKFN-hosted Open-GLAM mailing list to Wikimedia's mailman - https://phabricator.wikimedia.org/T240929 (10jcrespo) a:05jcrespo→03herron [13:25:17] (03PS4) 10Jbond: trusted['certname']: use facts['fqdn'] instead [puppet] - 10https://gerrit.wikimedia.org/r/583588 [13:26:26] (03CR) 10Jbond: "> Patch Set 3: Code-Review+1" [puppet] - 10https://gerrit.wikimedia.org/r/583588 (owner: 10Jbond) [13:29:41] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1110 for schema change', diff saved to https://phabricator.wikimedia.org/P10786 and previous config saved to /var/cache/conftool/dbconfig/20200326-132940-marostegui.json [13:29:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:32:56] PROBLEM - Ensure traffic_exporter binds on port 9322 and responds to HTTP requests on cp3064 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [13:33:31] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/583588 (owner: 10Jbond) [13:40:56] RECOVERY - Ensure traffic_exporter binds on port 9322 and responds to HTTP requests on cp3064 is OK: HTTP OK: HTTP/1.0 200 OK - 22377 bytes in 0.256 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [13:45:16] (03PS1) 10Muehlenhoff: Make image pruning toggleable [puppet] - 10https://gerrit.wikimedia.org/r/583595 [13:47:51] (03CR) 10jerkins-bot: [V: 04-1] Make image pruning toggleable [puppet] - 10https://gerrit.wikimedia.org/r/583595 (owner: 10Muehlenhoff) [13:48:13] (03CR) 10CDanis: [C: 03+1] Shrink eqiad/eqord bgp_out to /23 and /48 [homer/public] - 10https://gerrit.wikimedia.org/r/583579 (https://phabricator.wikimedia.org/T246721) (owner: 10Ayounsi) [13:49:10] (03CR) 10CDanis: [C: 03+2] clean up stub routing-options left behind in caf7b4f [homer/public] - 10https://gerrit.wikimedia.org/r/577563 (owner: 10CDanis) [13:49:27] (03PS2) 10Muehlenhoff: Make image pruning toggleable [puppet] - 10https://gerrit.wikimedia.org/r/583595 [13:49:29] (03Merged) 10jenkins-bot: clean up stub routing-options left behind in caf7b4f [homer/public] - 10https://gerrit.wikimedia.org/r/577563 (owner: 10CDanis) [13:50:55] (03CR) 10CDanis: [C: 03+1] php-admin: remove dead code for partial opcache invalidation (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/577652 (owner: 10Ori.livneh) [13:52:06] (03PS3) 10Muehlenhoff: Make image pruning toggleable [puppet] - 10https://gerrit.wikimedia.org/r/583595 [13:52:11] (03CR) 10jerkins-bot: [V: 04-1] Make image pruning toggleable [puppet] - 10https://gerrit.wikimedia.org/r/583595 (owner: 10Muehlenhoff) [13:52:46] (03Abandoned) 10Hashar: contint: add acl package for file permissions tweak [puppet] - 10https://gerrit.wikimedia.org/r/583392 (https://phabricator.wikimedia.org/T210271) (owner: 10Hashar) [13:53:25] 10Operations, 10ops-codfw, 10Traffic: (Need by: TBD) rack/setup/install cp202[7-9], cp203[0-9], cp204[0-2] - https://phabricator.wikimedia.org/T247340 (10Papaul) [13:56:28] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repool db1110 after schema change', diff saved to https://phabricator.wikimedia.org/P10787 and previous config saved to /var/cache/conftool/dbconfig/20200326-135625-marostegui.json [13:56:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:58:15] 10Operations, 10SRE-Access-Requests: Requesting access to analytics for tarrow - https://phabricator.wikimedia.org/T248498 (10ItamarWMDE) Hello @Nuria , thank you for your review and consideration. As @Addshore added in the description of the ticket, I need access to wikidata json dumps in hadoop to make vario... [13:58:21] 10Operations, 10ops-codfw, 10Traffic: (Need by: TBD) rack/setup/install cp202[7-9], cp203[0-9], cp204[0-2] - https://phabricator.wikimedia.org/T247340 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by pt1979 on cumin2001.codfw.wmnet for hosts: ` cp2027.codfw.wmnet ` The log can be found in `/var... [13:58:33] 10Operations, 10fundraising-tech-ops, 10observability, 10Patch-For-Review, 10User-fgiunchedi: Icinga latency is skyrocketing and commands ignored - https://phabricator.wikimedia.org/T247538 (10fgiunchedi) p:05High→03Medium Lowering priority as things I believe are better now, pending https://gerrit.w... [13:59:51] 10Operations, 10SRE-Access-Requests: Requesting access to analytics for ItamarWMDE - https://phabricator.wikimedia.org/T248482 (10ItamarWMDE) Hello @Nuria, thank you for your review and cosideration. As @addshore added in the description of the ticket, I need access to wikidata json dumps in hadoop to make var... [14:06:58] 10Operations, 10netops: roll out sensible flow-table-sizes to Juniper core routers with sampling enabled - https://phabricator.wikimedia.org/T248394 (10CDanis) Druid disk usage is not greatly increased, routers seem happy. Will reconfigure another router or two, and work on Homer-izing the change, today [14:09:11] (03CR) 10Jbond: [C: 03+2] profile::idp::client::httpd: add check for sso redirect [puppet] - 10https://gerrit.wikimedia.org/r/583078 (https://phabricator.wikimedia.org/T245743) (owner: 10Jbond) [14:11:19] (03PS1) 10Arturo Borrero Gonzalez: openstack: queens: don't install libpam-systemd from bpo [puppet] - 10https://gerrit.wikimedia.org/r/583599 (https://phabricator.wikimedia.org/T242766) [14:11:59] (03CR) 10Andrew Bogott: [C: 03+1] openstack: queens: don't install libpam-systemd from bpo [puppet] - 10https://gerrit.wikimedia.org/r/583599 (https://phabricator.wikimedia.org/T242766) (owner: 10Arturo Borrero Gonzalez) [14:12:19] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] openstack: queens: don't install libpam-systemd from bpo [puppet] - 10https://gerrit.wikimedia.org/r/583599 (https://phabricator.wikimedia.org/T242766) (owner: 10Arturo Borrero Gonzalez) [14:15:14] (03PS1) 10Andrew Bogott: Horizon: put in maintenance mode for the pike=>queens upgrade [puppet] - 10https://gerrit.wikimedia.org/r/583600 (https://phabricator.wikimedia.org/T242766) [14:15:16] (03PS1) 10Andrew Bogott: Openstack: move eqiad1 to version 'queens' [puppet] - 10https://gerrit.wikimedia.org/r/583601 (https://phabricator.wikimedia.org/T242766) [14:15:18] (03PS1) 10Andrew Bogott: Revert "Horizon: put in maintenance mode for the pike=>queens upgrade" [puppet] - 10https://gerrit.wikimedia.org/r/583602 [14:15:52] (03CR) 10Elukey: "LGTM, does pcc looks good?" [puppet] - 10https://gerrit.wikimedia.org/r/583588 (owner: 10Jbond) [14:16:20] (03CR) 10Jbond: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/583588 (owner: 10Jbond) [14:17:36] (03PS1) 10Urbanecm: Enable wmgUseFooterContactLink for cswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/583603 (https://phabricator.wikimedia.org/T248584) [14:18:17] (03PS2) 10Arturo Borrero Gonzalez: toolforge: introduce role/profile for legacy URL redirector [puppet] - 10https://gerrit.wikimedia.org/r/583593 (https://phabricator.wikimedia.org/T247236) [14:20:49] (03PS1) 10Dzahn: add miscweb1002.eqiad.wmnet [dns] - 10https://gerrit.wikimedia.org/r/583604 (https://phabricator.wikimedia.org/T247887) [14:22:01] (03CR) 10Alexandros Kosiaris: [C: 03+1] Make image pruning toggleable [puppet] - 10https://gerrit.wikimedia.org/r/583595 (owner: 10Muehlenhoff) [14:22:59] (03PS2) 10Dzahn: partman: rename cp2018.cfg to cacheproxy.cfg [puppet] - 10https://gerrit.wikimedia.org/r/583592 (https://phabricator.wikimedia.org/T156955) [14:23:37] (03PS3) 10Arturo Borrero Gonzalez: toolforge: introduce role/profile for legacy URL redirector [puppet] - 10https://gerrit.wikimedia.org/r/583593 (https://phabricator.wikimedia.org/T247236) [14:23:42] !log pt1979@cumin2001 START - Cookbook sre.hosts.downtime [14:23:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:24:13] (03CR) 10Jbond: "> Patch Set 4:" [puppet] - 10https://gerrit.wikimedia.org/r/583588 (owner: 10Jbond) [14:26:09] !log pt1979@cumin2001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) [14:26:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:27:34] 10Operations, 10serviceops, 10Patch-For-Review: decom old appservers in eqiad - https://phabricator.wikimedia.org/T247780 (10Dzahn) @Jclark-ctr @Cmjohnson @wiki_willy We (serviceops) are aware that currently there won't be onsite work except for emergencies. Additionally we also wanted to clarify that in thi... [14:29:08] 10Operations, 10serviceops, 10Patch-For-Review: decom old appservers in eqiad - https://phabricator.wikimedia.org/T247780 (10Dzahn) 05Open→03Stalled Setting to stalled. We are waiting at least until Monday before removing the remaining 5 servers in rack D5. [14:30:31] PROBLEM - cas-graphite.wikimedia.org requires authentication on graphite2003 is CRITICAL: HTTP CRITICAL: HTTP/1.1 400 Bad Request - header location: https://idp.wiki... not found on https:///:443-e - 66 bytes in 1.153 second response time https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [14:30:53] 10Operations, 10ops-codfw, 10Traffic: (Need by: TBD) rack/setup/install cp202[7-9], cp203[0-9], cp204[0-2] - https://phabricator.wikimedia.org/T247340 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cp2027.codfw.wmnet'] ` and were **ALL** successful. [14:31:17] (03PS2) 10Alexandros Kosiaris: eventstreams: Switch to service_setup [puppet] - 10https://gerrit.wikimedia.org/r/583073 (https://phabricator.wikimedia.org/T238658) [14:31:24] (03CR) 10Alexandros Kosiaris: [C: 03+2] eventstreams: Switch to service_setup [puppet] - 10https://gerrit.wikimedia.org/r/583073 (https://phabricator.wikimedia.org/T238658) (owner: 10Alexandros Kosiaris) [14:31:45] 10Operations, 10ops-codfw, 10Traffic: (Need by: TBD) rack/setup/install cp202[7-9], cp203[0-9], cp204[0-2] - https://phabricator.wikimedia.org/T247340 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by pt1979 on cumin2001.codfw.wmnet for hosts: ` cp2028.codfw.wmnet ` The log can be found in `/var... [14:34:33] PROBLEM - cas-logstash.wikimedia.org requires authentication on logstash2004 is CRITICAL: HTTP CRITICAL: HTTP/1.1 400 Bad Request - header location: https://idp.wiki... not found on https:///:443-e - 66 bytes in 1.151 second response time https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [14:34:52] (03CR) 10Elukey: [C: 03+1] "> > Patch Set 4:" [puppet] - 10https://gerrit.wikimedia.org/r/583588 (owner: 10Jbond) [14:36:02] (03CR) 10Jbond: [C: 03+2] "> Patch Set 4: Code-Review+1" [puppet] - 10https://gerrit.wikimedia.org/r/583588 (owner: 10Jbond) [14:37:04] 10Operations, 10Patch-For-Review, 10Security: envoyproxy: CVE-2020-8664 CVE-2020-8661 CVE-2020-8660 CVE-2020-8659 - https://phabricator.wikimedia.org/T246868 (10RLazarus) >>! In T246868#6001036, @ayounsi wrote: > Seems to match the start of: > https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=2&ho... [14:39:05] PROBLEM - cas-graphite.wikimedia.org requires authentication on graphite1004 is CRITICAL: HTTP CRITICAL: HTTP/1.1 400 Bad Request - header location: https://idp.wiki... not found on https:///:443-e - 66 bytes in 1.005 second response time https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [14:40:37] PROBLEM - Ensure traffic_exporter binds on port 9322 and responds to HTTP requests on cp3058 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [14:41:25] PROBLEM - cas-icinga.wikimedia.org requires authentication on icinga1001 is CRITICAL: HTTP CRITICAL: HTTP/1.1 400 Bad Request - header location: https://idp.wiki... not found on https:///icinga:443-e - 587 bytes in 0.005 second response time https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [14:41:29] (03CR) 10Dzahn: [C: 03+2] add miscweb1002.eqiad.wmnet [dns] - 10https://gerrit.wikimedia.org/r/583604 (https://phabricator.wikimedia.org/T247887) (owner: 10Dzahn) [14:42:49] PROBLEM - PyBal IPVS diff check on lvs2009 is CRITICAL: CRITICAL: Services in IPVS but unknown to PyBal: set([10.2.1.34:8092]) https://wikitech.wikimedia.org/wiki/PyBal [14:46:17] (03CR) 10Papaul: [C: 03+1] partman: rename cp2018.cfg to cacheproxy.cfg [puppet] - 10https://gerrit.wikimedia.org/r/583592 (https://phabricator.wikimedia.org/T156955) (owner: 10Dzahn) [14:46:51] PROBLEM - PyBal IPVS diff check on lvs1016 is CRITICAL: CRITICAL: Services in IPVS but unknown to PyBal: set([10.2.2.34:8092]) https://wikitech.wikimedia.org/wiki/PyBal [14:47:47] !log pt1979@cumin2001 START - Cookbook sre.hosts.downtime [14:47:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:48:21] 10Operations, 10ops-eqiad, 10DC-Ops, 10User-Zppix, 10cloud-services-team (Hardware): VMs on cloudvirt1015 crashing - bad mainboard/memory - https://phabricator.wikimedia.org/T220853 (10JHedden) cloudvirt1015 has crashed again using @Andrew's stress test. Paste with all the kernel oops and panics prior t... [14:49:09] akosiaris: the pybal alert above is related to eventstreams, i think it's kind of normal when adding new things ..for a while? [14:49:50] PROBLEM - cas-puppetboard.wikimedia.org requires authentication on puppetboard1001 is CRITICAL: HTTP CRITICAL: HTTP/1.1 400 Bad Request - header location: https://idp.wiki... not found on https:///:443-e - 66 bytes in 1.006 second response time https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [14:50:13] !log pt1979@cumin2001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) [14:50:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:50:26] (03PS1) 10BBlack: Add IPv6 for cp2027-42 [dns] - 10https://gerrit.wikimedia.org/r/583608 (https://phabricator.wikimedia.org/T247340) [14:50:49] uh? puppetboard, jbond42 anything WIP by any chance? [14:51:08] PROBLEM - cas-puppetboard.wikimedia.org requires authentication on puppetboard2001 is CRITICAL: HTTP CRITICAL: HTTP/1.1 400 Bad Request - header location: https://idp.wiki... not found on https:///:443-e - 66 bytes in 1.152 second response time https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [14:51:12] volans: ahh thats a new check that just got added ill take a look [14:51:29] seems to be missing the host [14:51:30] mutante: it's not adding, it's removing [14:51:35] https:///:443 [14:51:38] but you are correct otherwise, I have cleanup to do [14:51:59] ahh yes i think i no the issue [14:53:10] PROBLEM - PyBal IPVS diff check on lvs1015 is CRITICAL: CRITICAL: Services in IPVS but unknown to PyBal: set([10.2.2.34:8092]) https://wikitech.wikimedia.org/wiki/PyBal [14:53:54] (03CR) 10Andrew Bogott: [C: 03+2] Horizon: put in maintenance mode for the pike=>queens upgrade [puppet] - 10https://gerrit.wikimedia.org/r/583600 (https://phabricator.wikimedia.org/T242766) (owner: 10Andrew Bogott) [14:55:26] 10Operations, 10DBA, 10MediaWiki-General: Evaluate and decide the future of relational datastore at WMF after the upgrade of MariaDB 10.1 is finished - https://phabricator.wikimedia.org/T193224 (10jcrespo) [14:55:49] (03PS1) 10Jbond: icinga: correct check arguments [puppet] - 10https://gerrit.wikimedia.org/r/583609 [14:56:02] 10Operations, 10ops-codfw, 10Traffic, 10Patch-For-Review: (Need by: TBD) rack/setup/install cp202[7-9], cp203[0-9], cp204[0-2] - https://phabricator.wikimedia.org/T247340 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cp2028.codfw.wmnet'] ` and were **ALL** successful. [14:56:41] 10Operations, 10ops-codfw, 10Traffic, 10Patch-For-Review: (Need by: TBD) rack/setup/install cp202[7-9], cp203[0-9], cp204[0-2] - https://phabricator.wikimedia.org/T247340 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by pt1979 on cumin2001.codfw.wmnet for hosts: ` cp2029.codfw.wmnet ` The log... [14:57:01] (03CR) 10Muehlenhoff: [C: 03+2] Make image pruning toggleable [puppet] - 10https://gerrit.wikimedia.org/r/583595 (owner: 10Muehlenhoff) [14:57:10] (03CR) 10BBlack: [C: 03+2] partman: rename cp2018.cfg to cacheproxy.cfg [puppet] - 10https://gerrit.wikimedia.org/r/583592 (https://phabricator.wikimedia.org/T156955) (owner: 10Dzahn) [14:57:18] PROBLEM - cas-logstash.wikimedia.org requires authentication on logstash1009 is CRITICAL: HTTP CRITICAL: HTTP/1.1 400 Bad Request - header location: https://idp.wiki... not found on https:///:443-e - 66 bytes in 1.009 second response time https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [14:57:36] (03CR) 10Jbond: [C: 03+2] icinga: correct check arguments [puppet] - 10https://gerrit.wikimedia.org/r/583609 (owner: 10Jbond) [14:57:58] all ok to merge? [14:58:19] yes please [14:58:33] akosiaris: ack. i ran into that too afair [14:58:44] PROBLEM - cas-logstash.wikimedia.org requires authentication on logstash2006 is CRITICAL: HTTP CRITICAL: HTTP/1.1 400 Bad Request - header location: https://idp.wiki... not found on https:///:443-e - 66 bytes in 1.157 second response time https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [14:58:51] done [14:58:54] thx [14:58:56] PROBLEM - people.wikimedia.org requires authentication on people1001 is CRITICAL: HTTP CRITICAL: HTTP/1.1 400 Bad Request - header location: https://idp.wiki... not found on https:///:443-e - 66 bytes in 1.008 second response time https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [14:59:36] !log dzahn@cumin1001 START - Cookbook sre.ganeti.makevm [14:59:36] !log dzahn@cumin1001 END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) [14:59:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:59:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:59:50] !log dzahn@cumin1001 START - Cookbook sre.ganeti.makevm [14:59:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:01:05] !log aborrero@cumin1001 START - Cookbook sre.hosts.downtime [15:01:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:01:15] !log aborrero@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [15:01:17] !log T247887 - create Ganeti VM miscweb1002.eqiad.wmnet in the ganeti01.svc.eqiad.wmnet cluster on row C with 1 vCPUs, 2GB of RAM, 20GB of disk in the private network. [15:01:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:01:21] 10Operations, 10ops-codfw, 10Traffic, 10Patch-For-Review: (Need by: TBD) rack/setup/install cp202[7-9], cp203[0-9], cp204[0-2] - https://phabricator.wikimedia.org/T247340 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by pt1979 on cumin2001.codfw.wmnet for hosts: ` cp2030.codfw.wmnet ` The log... [15:01:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:01:24] T247887: Site: eqiad/codfw 2 VM request for miscweb - https://phabricator.wikimedia.org/T247887 [15:01:37] RECOVERY - cas-graphite.wikimedia.org requires authentication on graphite1004 is OK: HTTP OK: Status line output matched HTTP/1.1 302 - 556 bytes in 1.011 second response time https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [15:01:37] RECOVERY - cas-logstash.wikimedia.org requires authentication on logstash1009 is OK: HTTP OK: Status line output matched HTTP/1.1 302 - 554 bytes in 1.017 second response time https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [15:01:37] RECOVERY - cas-puppetboard.wikimedia.org requires authentication on puppetboard1001 is OK: HTTP OK: Status line output matched HTTP/1.1 302 - 568 bytes in 1.006 second response time https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [15:01:37] RECOVERY - cas-logstash.wikimedia.org requires authentication on logstash2004 is OK: HTTP OK: Status line output matched HTTP/1.1 302 - 554 bytes in 1.153 second response time https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [15:01:37] RECOVERY - cas-graphite.wikimedia.org requires authentication on graphite2003 is OK: HTTP OK: Status line output matched HTTP/1.1 302 - 556 bytes in 1.166 second response time https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [15:01:37] RECOVERY - cas-logstash.wikimedia.org requires authentication on logstash2006 is OK: HTTP OK: Status line output matched HTTP/1.1 302 - 554 bytes in 1.153 second response time https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [15:01:38] RECOVERY - cas-puppetboard.wikimedia.org requires authentication on puppetboard2001 is OK: HTTP OK: Status line output matched HTTP/1.1 302 - 568 bytes in 1.150 second response time https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [15:02:07] !log aborrero@cumin1001 START - Cookbook sre.hosts.downtime [15:02:08] !log aborrero@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [15:02:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:02:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:02:21] !log aborrero@cumin1001 START - Cookbook sre.hosts.downtime [15:02:22] !log aborrero@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [15:02:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:02:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:02:30] !log aborrero@cumin1001 START - Cookbook sre.hosts.downtime [15:02:31] !log aborrero@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [15:02:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:02:36] !log aborrero@cumin1001 START - Cookbook sre.hosts.downtime [15:02:36] (03CR) 10Andrew Bogott: [C: 03+2] Openstack: move eqiad1 to version 'queens' [puppet] - 10https://gerrit.wikimedia.org/r/583601 (https://phabricator.wikimedia.org/T242766) (owner: 10Andrew Bogott) [15:02:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:02:40] !log aborrero@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [15:02:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:02:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:03:01] PROBLEM - PyBal IPVS diff check on lvs2010 is CRITICAL: CRITICAL: Services in IPVS but unknown to PyBal: set([10.2.1.34:8092]) https://wikitech.wikimedia.org/wiki/PyBal [15:05:59] PROBLEM - tendril.wikimedia.org requires authentication on dbmonitor1001 is CRITICAL: HTTP CRITICAL: Status line output matched HTTP/1.1 302 - header location: https://idp.wiki... not found on https://tendril.wikimedia.org:443/ - 586 bytes in 0.006 second response time https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [15:06:24] (03CR) 10Dzahn: [C: 04-1] phabricator: close port 80 for caching servers (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/569100 (owner: 10Dzahn) [15:07:36] (03PS1) 10Jbond: icinga: redirect check [puppet] - 10https://gerrit.wikimedia.org/r/583612 [15:08:34] (03PS1) 10BBlack: partman: clean up cacheproxy selectors [puppet] - 10https://gerrit.wikimedia.org/r/583613 (https://phabricator.wikimedia.org/T156955) [15:09:09] RECOVERY - PyBal IPVS diff check on lvs2009 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [15:10:40] !log dzahn@cumin1001 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) [15:10:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:10:55] (03CR) 10Jbond: [C: 03+2] icinga: redirect check [puppet] - 10https://gerrit.wikimedia.org/r/583612 (owner: 10Jbond) [15:11:19] 10Operations, 10SRE-Access-Requests: Request Netbox access for user "dubosv10" - https://phabricator.wikimedia.org/T248445 (10Volans) @DubOSv10 thanks for the additional context, but this doesn't meet our current requirements for the need-to-know basis for Netbox, I'm sorry. Have you tried the unofficial publi... [15:11:34] (03PS1) 10Dzahn: DHCP: add miscweb1001.eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/583617 (https://phabricator.wikimedia.org/T247887) [15:12:40] !log pt1979@cumin2001 START - Cookbook sre.hosts.downtime [15:12:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:13:22] (03PS1) 10Muehlenhoff: Make the docker package name configurable and use docker.io on deneb [puppet] - 10https://gerrit.wikimedia.org/r/583619 [15:14:31] RECOVERY - PyBal IPVS diff check on lvs1015 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [15:14:39] RECOVERY - Ensure traffic_exporter binds on port 9322 and responds to HTTP requests on cp3058 is OK: HTTP OK: HTTP/1.0 200 OK - 22367 bytes in 0.278 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [15:15:09] !log pt1979@cumin2001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) [15:15:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:15:41] RECOVERY - tendril.wikimedia.org requires authentication on dbmonitor1001 is OK: HTTP OK: Status line output matched HTTP/1.1 302 - 586 bytes in 0.010 second response time https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [15:15:51] RECOVERY - cas-icinga.wikimedia.org requires authentication on icinga1001 is OK: HTTP OK: Status line output matched HTTP/1.1 302 - 604 bytes in 0.005 second response time https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [15:16:57] (03PS1) 10BBlack: cp2027-42: define ATS storage as nvme [puppet] - 10https://gerrit.wikimedia.org/r/583623 (https://phabricator.wikimedia.org/T247340) [15:17:17] !log pt1979@cumin2001 START - Cookbook sre.hosts.downtime [15:17:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:18:55] RECOVERY - PyBal IPVS diff check on lvs1016 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [15:19:27] RECOVERY - PyBal IPVS diff check on lvs2010 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [15:19:36] (03CR) 10Muehlenhoff: "PCC: https://puppet-compiler.wmflabs.org/compiler1003/21583/" [puppet] - 10https://gerrit.wikimedia.org/r/583619 (owner: 10Muehlenhoff) [15:19:55] 10Operations, 10ops-codfw, 10Traffic, 10Patch-For-Review: (Need by: TBD) rack/setup/install cp202[7-9], cp203[0-9], cp204[0-2] - https://phabricator.wikimedia.org/T247340 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cp2029.codfw.wmnet'] ` and were **ALL** successful. [15:19:58] 10Operations, 10Analytics, 10Analytics-Kanban, 10EventStreams, and 2 others: EventStreams drops the connection after 15 minutes, which makes it unreliable - https://phabricator.wikimedia.org/T242767 (10stjn) 05Open→03Resolved Frequent disconnects stopped after 25th March, 15:30 UTC, so yes. Thank you f... [15:20:28] !log pt1979@cumin2001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) [15:20:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:21:58] (03PS3) 10Alexandros Kosiaris: eventstreams: Remove all conftool data [puppet] - 10https://gerrit.wikimedia.org/r/566773 (https://phabricator.wikimedia.org/T238658) [15:22:12] (03CR) 10Alexandros Kosiaris: [C: 03+2] eventstreams: Remove all conftool data [puppet] - 10https://gerrit.wikimedia.org/r/566773 (https://phabricator.wikimedia.org/T238658) (owner: 10Alexandros Kosiaris) [15:22:25] (03PS2) 10Alexandros Kosiaris: eventstreams: Remove old lvs service [puppet] - 10https://gerrit.wikimedia.org/r/583074 (https://phabricator.wikimedia.org/T238658) [15:22:59] (03CR) 10BBlack: [C: 03+2] cp2027-42: define ATS storage as nvme [puppet] - 10https://gerrit.wikimedia.org/r/583623 (https://phabricator.wikimedia.org/T247340) (owner: 10BBlack) [15:24:13] akosiaris: ok to merge? [15:24:25] bblack: ah, yes, thanks! [15:24:43] (03CR) 10Alexandros Kosiaris: [C: 03+2] eventstreams: Remove old lvs service [puppet] - 10https://gerrit.wikimedia.org/r/583074 (https://phabricator.wikimedia.org/T238658) (owner: 10Alexandros Kosiaris) [15:24:50] I was about to also merge https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/583074/ [15:24:58] one less LVS service around [15:25:14] 10Operations, 10ops-codfw, 10Traffic, 10Patch-For-Review: (Need by: TBD) rack/setup/install cp202[7-9], cp203[0-9], cp204[0-2] - https://phabricator.wikimedia.org/T247340 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cp2030.codfw.wmnet'] ` and were **ALL** successful. [15:26:40] Amir1, awight, Urbanecm: if mid-day swat goes quickly, i'm around & you could do some of my config changes scheduled for the morning SWAT [15:26:41] 10Operations, 10SRE-tools, 10Traffic, 10Goal, and 3 others: Automate generation of Management DNS records from Netbox - https://phabricator.wikimedia.org/T233183 (10Volans) In relation to https://gerrit.wikimedia.org/r/c/operations/cookbooks/+/583109 I've just tested the command on a single host, this is t... [15:27:19] cscott: I don't understand - mid day SWAT is already over for today? :) [15:27:41] Urbanecm: oh, sorry, i don't understand time zones apparently ;) [15:28:18] https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200326T1100 has it highlighted in purple still and i didn't sanity check the times [15:28:25] PROBLEM - Confd template for /srv/config-master/pybal/codfw/eventstreams on puppetmaster1001 is CRITICAL: Compilation of file /srv/config-master/pybal/codfw/eventstreams is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [15:28:43] PROBLEM - Confd template for /srv/config-master/pybal/eqiad/eventstreams on puppetmaster2001 is CRITICAL: Compilation of file /srv/config-master/pybal/eqiad/eventstreams is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [15:28:51] PROBLEM - Confd template for /srv/config-master/pybal/codfw/eventstreams on puppetmaster2001 is CRITICAL: Compilation of file /srv/config-master/pybal/codfw/eventstreams is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [15:29:03] PROBLEM - Confd template for /srv/config-master/pybal/eqiad/eventstreams on puppetmaster1001 is CRITICAL: Compilation of file /srv/config-master/pybal/eqiad/eventstreams is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [15:29:41] <_joe_> uh akosiaris, ottomata? [15:29:46] <_joe_> expected right [15:30:06] I was about to ask [15:30:23] ERROR "updating error mtime on /var/run/confd-template/.eventstreams690605534.err\nfailed linting '/usr/local/bin/pybal-eval-check /srv/config-master/pybal/codfw/.eventstreams690605534' with 1 (0.0217859745026s) [invalid]: server pool cannot be empty!\n\n" [15:30:43] 10Operations, 10ops-codfw, 10Traffic, 10Patch-For-Review: (Need by: TBD) rack/setup/install cp202[7-9], cp203[0-9], cp204[0-2] - https://phabricator.wikimedia.org/T247340 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by pt1979 on cumin2001.codfw.wmnet for hosts: ` cp2031.codfw.wmnet ` The log... [15:30:56] <_joe_> volans: I think we're not absenting the confd templates when we remove them from puppet or something [15:31:01] it is one of those things that happen when adding/removing services [15:31:09] kind of remember that happening [15:31:13] (03PS1) 10Elukey: cdh::hadoop: allow hadoop daemons to override ipv6 settings [puppet] - 10https://gerrit.wikimedia.org/r/583631 (https://phabricator.wikimedia.org/T240255) [15:31:56] cscott: :-) timezones have been biting me lately, too. [15:32:07] (03CR) 10jerkins-bot: [V: 04-1] cdh::hadoop: allow hadoop daemons to override ipv6 settings [puppet] - 10https://gerrit.wikimedia.org/r/583631 (https://phabricator.wikimedia.org/T240255) (owner: 10Elukey) [15:32:20] (03CR) 10Alexandros Kosiaris: [C: 04-1] "A number of inline comments, most nitpicks but overall I like the idea" (039 comments) [puppet] - 10https://gerrit.wikimedia.org/r/583340 (https://phabricator.wikimedia.org/T233945) (owner: 10Jbond) [15:32:24] 10Operations, 10ops-codfw, 10Traffic, 10Patch-For-Review: (Need by: TBD) rack/setup/install cp202[7-9], cp203[0-9], cp204[0-2] - https://phabricator.wikimedia.org/T247340 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by pt1979 on cumin2001.codfw.wmnet for hosts: ` cp2032.codfw.wmnet ` The log... [15:33:12] (03CR) 10Alexandros Kosiaris: [C: 03+1] profile::base::firewall: add support for abuse_networks [puppet] - 10https://gerrit.wikimedia.org/r/583341 (https://phabricator.wikimedia.org/T233945) (owner: 10Jbond) [15:33:25] this is from SAL when something similar happened before. we had to delete .err files to make Icinga happy again [15:33:28] 23:36 mutante: [puppetmaster2001:/var/run/confd-template] $ sudo rm .cloudceph*.err [15:36:29] mutante: that is probably something we need to document, but this is unrelated. This is full removal of a service [15:36:47] puppet just run on puppetmasters, the entire check should disappear soon [15:37:20] alright, there are some eventstream.err files in that same dir there but yea [15:37:21] (03PS1) 10BBlack: cp2027-42: add IPs to cache_hosts data [puppet] - 10https://gerrit.wikimedia.org/r/583634 (https://phabricator.wikimedia.org/T247340) [15:37:22] (03PS1) 10BBlack: acme_chief: expand cp nodes regex [puppet] - 10https://gerrit.wikimedia.org/r/583635 (https://phabricator.wikimedia.org/T247340) [15:37:49] (03PS2) 10Elukey: cdh::hadoop: allow hadoop daemons to override ipv6 settings [puppet] - 10https://gerrit.wikimedia.org/r/583631 (https://phabricator.wikimedia.org/T240255) [15:38:16] (03CR) 10Alexandros Kosiaris: [C: 03+2] eventstreams: Remove from scb role [puppet] - 10https://gerrit.wikimedia.org/r/583076 (https://phabricator.wikimedia.org/T238658) (owner: 10Alexandros Kosiaris) [15:38:20] akosiaris: we did:) https://wikitech.wikimedia.org/wiki/Confd#Compilation_is_broken [15:38:38] mutante: oh great, I did not remember that. thanks! [15:39:06] (03CR) 10Volans: [C: 03+2] sre.dns.netbox: deploy the changes to gdnsd [cookbooks] - 10https://gerrit.wikimedia.org/r/583109 (https://phabricator.wikimedia.org/T233183) (owner: 10Volans) [15:40:01] !log start advertising 2620:0:861::/48 from eqiad - T246721 [15:40:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:40:06] T246721: BGP: Investigate isolating codfw and eqiad - https://phabricator.wikimedia.org/T246721 [15:42:23] 10Operations, 10Traffic, 10netops, 10Patch-For-Review: BGP: Investigate isolating codfw and eqiad - https://phabricator.wikimedia.org/T246721 (10ayounsi) ` Prefix Nexthop MED Lclpref AS path * 2620:0:860::/46 Self I * 2620:0:861::/48 Sel... [15:43:12] (03PS3) 10Elukey: cdh::hadoop: allow hadoop daemons to override ipv6 settings [puppet] - 10https://gerrit.wikimedia.org/r/583631 (https://phabricator.wikimedia.org/T240255) [15:44:57] (03PS2) 10Volans: sre.dns.netbox: deploy the changes to gdnsd [cookbooks] - 10https://gerrit.wikimedia.org/r/583109 (https://phabricator.wikimedia.org/T233183) [15:46:39] !log pt1979@cumin2001 START - Cookbook sre.hosts.downtime [15:46:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:47:11] PROBLEM - Rate of JVM GC Old generation-s runs - cloudelastic1003-cloudelastic-chi-eqiad on cloudelastic1003 is CRITICAL: 101.7 gt 100 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=cloudelastic-chi-eqiad&var-instance=cloudelastic1003&panelId=37 [15:48:25] !log pt1979@cumin2001 START - Cookbook sre.hosts.downtime [15:48:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:49:03] !log pt1979@cumin2001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) [15:49:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:49:18] !log start advertising 208.80.154.0/23 from eqiad - T246721 [15:49:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:49:23] T246721: BGP: Investigate isolating codfw and eqiad - https://phabricator.wikimedia.org/T246721 [15:49:58] (03CR) 10Volans: "recheck" (031 comment) [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/578623 (https://phabricator.wikimedia.org/T218189) (owner: 10Guozr.im) [15:50:24] (03CR) 10jerkins-bot: [V: 04-1] CuminExecution: Capture Exception cumin.transports.WorkerError [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/578623 (https://phabricator.wikimedia.org/T218189) (owner: 10Guozr.im) [15:51:19] !log installing grub2 updates from Stretch point release [15:51:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:51:37] !log pt1979@cumin2001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) [15:51:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:53:27] !log volans@cumin1001 START - Cookbook sre.dns.netbox [15:53:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:53:51] 10Operations, 10ops-codfw, 10Traffic, 10Patch-For-Review: (Need by: TBD) rack/setup/install cp202[7-9], cp203[0-9], cp204[0-2] - https://phabricator.wikimedia.org/T247340 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cp2031.codfw.wmnet'] ` and were **ALL** successful. [15:54:05] 10Operations, 10ops-codfw, 10Traffic, 10Patch-For-Review: (Need by: TBD) rack/setup/install cp202[7-9], cp203[0-9], cp204[0-2] - https://phabricator.wikimedia.org/T247340 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by pt1979 on cumin2001.codfw.wmnet for hosts: ` cp2033.codfw.wmnet ` The log... [15:54:09] 10Operations, 10Traffic, 10netops, 10Patch-For-Review: BGP: Investigate isolating codfw and eqiad - https://phabricator.wikimedia.org/T246721 (10ayounsi) Both v4 and v6 are seen as expected in LG: https://stat.ripe.net/widget/looking-glass#w.resource=208.80.154.0/23 https://stat.ripe.net/widget/looking-gla... [15:55:14] (03PS2) 10Alexandros Kosiaris: lvs: Rename eventstreams-tls to eventstreams [puppet] - 10https://gerrit.wikimedia.org/r/583075 (https://phabricator.wikimedia.org/T238658) [15:55:18] (03CR) 10Alexandros Kosiaris: [V: 03+2 C: 03+2] lvs: Rename eventstreams-tls to eventstreams [puppet] - 10https://gerrit.wikimedia.org/r/583075 (https://phabricator.wikimedia.org/T238658) (owner: 10Alexandros Kosiaris) [15:55:29] (03CR) 10Dzahn: [C: 03+2] DHCP: add miscweb1001.eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/583617 (https://phabricator.wikimedia.org/T247887) (owner: 10Dzahn) [15:55:32] PROBLEM - Check if active EventStreams endpoint is delivering messages. on icinga1001 is CRITICAL: CRITICAL: No EventStreams message was consumed from https://stream.wikimedia.org/v2/stream/recentchange within 10 seconds. https://wikitech.wikimedia.org/wiki/Event_Platform/EventStreams/Administration [15:56:46] PROBLEM - LVS HTTPS IPv4 on eventstreams.svc.eqiad.wmnet is CRITICAL: connect to address 10.2.2.34 and port 4892: No route to host https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [15:57:07] (03CR) 10BBlack: [C: 03+2] cp2027-42: add IPs to cache_hosts data [puppet] - 10https://gerrit.wikimedia.org/r/583634 (https://phabricator.wikimedia.org/T247340) (owner: 10BBlack) [15:57:18] (03PS3) 10Jbond: network: add new function to return ip lists used in ACLs [puppet] - 10https://gerrit.wikimedia.org/r/583340 (https://phabricator.wikimedia.org/T233945) [15:57:26] (03CR) 10Jbond: "thanks updated" (038 comments) [puppet] - 10https://gerrit.wikimedia.org/r/583340 (https://phabricator.wikimedia.org/T233945) (owner: 10Jbond) [15:57:52] RECOVERY - LVS HTTPS IPv4 on eventstreams.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 1100 bytes in 1.008 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [15:58:30] 10Operations, 10ops-codfw, 10Traffic, 10Patch-For-Review: (Need by: TBD) rack/setup/install cp202[7-9], cp203[0-9], cp204[0-2] - https://phabricator.wikimedia.org/T247340 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cp2032.codfw.wmnet'] ` and were **ALL** successful. [15:58:35] !log volans@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:58:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:58:57] 10Operations, 10ops-codfw, 10Traffic, 10Patch-For-Review: (Need by: TBD) rack/setup/install cp202[7-9], cp203[0-9], cp204[0-2] - https://phabricator.wikimedia.org/T247340 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by pt1979 on cumin2001.codfw.wmnet for hosts: ` cp2034.codfw.wmnet ` The log... [15:59:51] wom 3 [16:00:04] godog and _joe_: That opportune time is upon us again. Time for a Puppet SWAT(Max 6 patches) deploy. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200326T1600). [16:00:04] No GERRIT patches in the queue for this window AFAICS. [16:00:53] (03CR) 10Elukey: [C: 03+2] cdh::hadoop: allow hadoop daemons to override ipv6 settings [puppet] - 10https://gerrit.wikimedia.org/r/583631 (https://phabricator.wikimedia.org/T240255) (owner: 10Elukey) [16:01:03] (03CR) 10Jcrespo: "Translation if not obvious already- all tests pass, there is an unnecessary extra import added only failing." [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/578623 (https://phabricator.wikimedia.org/T218189) (owner: 10Guozr.im) [16:02:07] elukey: yea, do 'multiple', or i can :) [16:02:20] mutante: please go! [16:02:22] thanks :) [16:02:54] ... done! [16:06:01] (03CR) 10Jforrester: "I'd need to be added to mediawiki-releasers to do the same for release-jenkins." [puppet] - 10https://gerrit.wikimedia.org/r/583512 (owner: 10Thcipriani) [16:07:13] (03PS1) 10Dzahn: site: add miscweb1002 with insetup role [puppet] - 10https://gerrit.wikimedia.org/r/583662 (https://phabricator.wikimedia.org/T247887) [16:07:44] 10Operations, 10Continuous-Integration-Infrastructure, 10Release-Engineering-Team-TODO, 10SRE-Access-Requests: Grant "contint-roots" and "releasers-mediawiki" to user jforrester - https://phabricator.wikimedia.org/T248597 (10Jdforrester-WMF) [16:09:28] (03PS2) 10Jforrester: CI: Add James_F to contint-roots and releasers-mediawiki [puppet] - 10https://gerrit.wikimedia.org/r/583512 (https://phabricator.wikimedia.org/T248597) (owner: 10Thcipriani) [16:10:05] !log pt1979@cumin2001 START - Cookbook sre.hosts.downtime [16:10:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:10:21] (03CR) 10Muehlenhoff: [C: 03+1] site: add miscweb1002 with insetup role [puppet] - 10https://gerrit.wikimedia.org/r/583662 (https://phabricator.wikimedia.org/T247887) (owner: 10Dzahn) [16:11:37] (03CR) 10BBlack: [C: 03+2] Add IPv6 for cp2027-42 [dns] - 10https://gerrit.wikimedia.org/r/583608 (https://phabricator.wikimedia.org/T247340) (owner: 10BBlack) [16:11:42] (03PS2) 10BBlack: Add IPv6 for cp2027-42 [dns] - 10https://gerrit.wikimedia.org/r/583608 (https://phabricator.wikimedia.org/T247340) [16:11:46] !log jmm@cumin2001 START - Cookbook sre.hosts.downtime [16:11:47] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [16:11:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:11:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:12:33] !log pt1979@cumin2001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) [16:12:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:12:53] !log stop advertising 2620:0:860::/46 from eqiad - T246721 [16:12:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:12:58] T246721: BGP: Investigate isolating codfw and eqiad - https://phabricator.wikimedia.org/T246721 [16:14:14] (03CR) 10Dzahn: [C: 03+2] site: add miscweb1002 with insetup role [puppet] - 10https://gerrit.wikimedia.org/r/583662 (https://phabricator.wikimedia.org/T247887) (owner: 10Dzahn) [16:14:45] !log rebooting mw2150 for some tests [16:14:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:14:55] !log pt1979@cumin2001 START - Cookbook sre.hosts.downtime [16:14:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:15:25] !log set cloudelastic-chi wikidatawiki_content to 0 replicas while reindexing [16:15:26] !log signing puppet cert for miscweb1002, installed buster, added insetup role (T247887) [16:15:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:15:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:15:34] T247887: Site: eqiad/codfw 2 VM request for miscweb - https://phabricator.wikimedia.org/T247887 [16:16:04] PROBLEM - Ensure hosts are not performing a change on every puppet run on puppetdb1002 is CRITICAL: CRITICAL: the following (8) node(s) change every puppet run: cp2029.codfw.wmnet, cp2028.codfw.wmnet, cp2027.codfw.wmnet, analytics1039.eqiad.wmnet, cp2030.codfw.wmnet, cp2033.codfw.wmnet, cp2032.codfw.wmnet, cp2031.codfw.wmnet https://wikitech.wikimedia.org/wiki/Puppet%23check_puppet_run_changes [16:17:04] (03PS1) 10Elukey: cdh::hadoop: use line instead of match for file_line [puppet] - 10https://gerrit.wikimedia.org/r/583666 [16:17:19] 10Operations, 10ops-codfw, 10Traffic, 10Patch-For-Review: (Need by: TBD) rack/setup/install cp202[7-9], cp203[0-9], cp204[0-2] - https://phabricator.wikimedia.org/T247340 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cp2033.codfw.wmnet'] ` and were **ALL** successful. [16:17:46] (03PS1) 10Dzahn: add webserver_misc_apps role to miscweb1002 [puppet] - 10https://gerrit.wikimedia.org/r/583667 (https://phabricator.wikimedia.org/T247887) [16:18:17] (03CR) 10jerkins-bot: [V: 04-1] add webserver_misc_apps role to miscweb1002 [puppet] - 10https://gerrit.wikimedia.org/r/583667 (https://phabricator.wikimedia.org/T247887) (owner: 10Dzahn) [16:18:42] !log stop advertising 208.80.152.0/22 from eqiad - T246721 [16:18:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:18:47] T246721: BGP: Investigate isolating codfw and eqiad - https://phabricator.wikimedia.org/T246721 [16:19:41] !log pt1979@cumin2001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [16:19:43] (03CR) 10Dzahn: [C: 04-1] "needs buster support (php 7.0 version in package names )" [puppet] - 10https://gerrit.wikimedia.org/r/583667 (https://phabricator.wikimedia.org/T247887) (owner: 10Dzahn) [16:19:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:21:08] (03CR) 10Elukey: [C: 03+2] cdh::hadoop: use line instead of match for file_line [puppet] - 10https://gerrit.wikimedia.org/r/583666 (owner: 10Elukey) [16:21:57] (03Restored) 10Hashar: contint: add acl package for file permissions tweak [puppet] - 10https://gerrit.wikimedia.org/r/583392 (https://phabricator.wikimedia.org/T210271) (owner: 10Hashar) [16:21:59] (03CR) 10Muehlenhoff: "Given that nothing uses miscweb1002 host yet, it's also an option to apply the role and fix up Puppet errors as they occur?" [puppet] - 10https://gerrit.wikimedia.org/r/583667 (https://phabricator.wikimedia.org/T247887) (owner: 10Dzahn) [16:22:10] 10Operations, 10Wikimedia-Mailing-lists: Request for creation: les sans pagEs Mailing List - https://phabricator.wikimedia.org/T245066 (10Anthere) You do well to send a reminder. I had missed the alert.... So, description "a mailing list for les sans pagEs project" And link on les sans pagEs being : https://f... [16:24:29] 10Operations, 10ops-codfw, 10Traffic, 10Patch-For-Review: (Need by: TBD) rack/setup/install cp202[7-9], cp203[0-9], cp204[0-2] - https://phabricator.wikimedia.org/T247340 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cp2034.codfw.wmnet'] ` and were **ALL** successful. [16:24:53] (03PS5) 10Guozr.im: CuminExecution: Capture Exception cumin.transports.WorkerError [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/578623 (https://phabricator.wikimedia.org/T218189) [16:25:14] 10Operations, 10ops-codfw, 10Traffic, 10Patch-For-Review: (Need by: TBD) rack/setup/install cp202[7-9], cp203[0-9], cp204[0-2] - https://phabricator.wikimedia.org/T247340 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by pt1979 on cumin2001.codfw.wmnet for hosts: ` cp2035.codfw.wmnet ` The log... [16:25:23] (03CR) 10Guozr.im: "> Patch Set 4:" [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/578623 (https://phabricator.wikimedia.org/T218189) (owner: 10Guozr.im) [16:25:36] RECOVERY - Check if active EventStreams endpoint is delivering messages. on icinga1001 is OK: OK: An EventStreams message was consumed from https://stream.wikimedia.org/v2/stream/recentchange within 10 seconds. https://wikitech.wikimedia.org/wiki/Event_Platform/EventStreams/Administration [16:27:25] (03PS1) 10Dzahn: misc_apps/httpd: add support for PHP on buster [puppet] - 10https://gerrit.wikimedia.org/r/583675 (https://phabricator.wikimedia.org/T247887) [16:28:00] (03CR) 10jerkins-bot: [V: 04-1] misc_apps/httpd: add support for PHP on buster [puppet] - 10https://gerrit.wikimedia.org/r/583675 (https://phabricator.wikimedia.org/T247887) (owner: 10Dzahn) [16:28:59] (03CR) 10Muehlenhoff: [C: 04-1] misc_apps/httpd: add support for PHP on buster (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/583675 (https://phabricator.wikimedia.org/T247887) (owner: 10Dzahn) [16:29:04] (03CR) 10Ottomata: [C: 03+1] cdh::hadoop: allow hadoop daemons to override ipv6 settings [puppet] - 10https://gerrit.wikimedia.org/r/583631 (https://phabricator.wikimedia.org/T240255) (owner: 10Elukey) [16:29:16] (03CR) 10Alexandros Kosiaris: [C: 03+1] network: add new function to return ip lists used in ACLs [puppet] - 10https://gerrit.wikimedia.org/r/583340 (https://phabricator.wikimedia.org/T233945) (owner: 10Jbond) [16:30:24] (03PS1) 10Volans: sre.dns.netbox: pull the specific SHA1 [cookbooks] - 10https://gerrit.wikimedia.org/r/583676 (https://phabricator.wikimedia.org/T233183) [16:31:19] (03CR) 10Ayounsi: [C: 03+2] Shrink eqiad/eqord bgp_out to /23 and /48 [homer/public] - 10https://gerrit.wikimedia.org/r/583579 (https://phabricator.wikimedia.org/T246721) (owner: 10Ayounsi) [16:31:38] (03Merged) 10jenkins-bot: Shrink eqiad/eqord bgp_out to /23 and /48 [homer/public] - 10https://gerrit.wikimedia.org/r/583579 (https://phabricator.wikimedia.org/T246721) (owner: 10Ayounsi) [16:33:38] (03PS6) 10Jcrespo: CuminExecution: Capture Exception cumin.transports.WorkerError [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/578623 (https://phabricator.wikimedia.org/T218189) (owner: 10Guozr.im) [16:34:11] !log stop exchanging full BGP view between eqiad and codfw - T246721 [16:34:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:34:17] T246721: BGP: Investigate isolating codfw and eqiad - https://phabricator.wikimedia.org/T246721 [16:34:28] (03CR) 10Jcrespo: "I've edited the "removing sys import". You introduced that on a previous version. This patch doesn't remove anything related to that." [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/578623 (https://phabricator.wikimedia.org/T218189) (owner: 10Guozr.im) [16:34:51] 10Operations, 10ops-codfw, 10Traffic, 10Patch-For-Review: (Need by: TBD) rack/setup/install cp202[7-9], cp203[0-9], cp204[0-2] - https://phabricator.wikimedia.org/T247340 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by pt1979 on cumin2001.codfw.wmnet for hosts: ` cp2036.codfw.wmnet ` The log... [16:36:13] (03PS1) 10Ottomata: Temporarilty disable webrequest deletion for 1 week [puppet] - 10https://gerrit.wikimedia.org/r/583678 (https://phabricator.wikimedia.org/T248600) [16:37:17] (03CR) 10Guozr.im: "> Patch Set 6:" (031 comment) [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/578623 (https://phabricator.wikimedia.org/T218189) (owner: 10Guozr.im) [16:37:28] (03CR) 10Joal: [C: 03+1] "+1 :)" [puppet] - 10https://gerrit.wikimedia.org/r/583678 (https://phabricator.wikimedia.org/T248600) (owner: 10Ottomata) [16:40:42] !log pt1979@cumin2001 START - Cookbook sre.hosts.downtime [16:40:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:41:16] (03CR) 10Jcrespo: "Sadly, the advice we got didn't properly fixed the issue for the only config setup we care: :-(" [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/578623 (https://phabricator.wikimedia.org/T218189) (owner: 10Guozr.im) [16:42:16] (03PS1) 10Arturo Borrero Gonzalez: openstack: queens: drop python2 packages [puppet] - 10https://gerrit.wikimedia.org/r/583680 (https://phabricator.wikimedia.org/T242766) [16:42:36] RECOVERY - Rate of JVM GC Old generation-s runs - cloudelastic1003-cloudelastic-chi-eqiad on cloudelastic1003 is OK: (C)100 gt (W)80 gt 74.24 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=cloudelastic-chi-eqiad&var-instance=cloudelastic1003&panelId=37 [16:42:54] PROBLEM - Ensure traffic_exporter binds on port 9322 and responds to HTTP requests on cp3056 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [16:43:14] !log pt1979@cumin2001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) [16:43:16] 10Operations, 10serviceops, 10Patch-For-Review: decom old appservers in eqiad - https://phabricator.wikimedia.org/T247780 (10wiki_willy) Thanks for the heads up @Dzahn . @Jclark-ctr has been working on some of the other decom tasks this past week, but as long as this one doesn't show up on the eqiad workboa... [16:43:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:45:11] 10Operations, 10netops: IRR updates needed - https://phabricator.wikimedia.org/T235886 (10ayounsi) [16:45:18] 10Operations, 10Traffic, 10netops, 10Patch-For-Review: BGP: Investigate isolating codfw and eqiad - https://phabricator.wikimedia.org/T246721 (10ayounsi) [16:45:26] 10Operations, 10netops: IRR updates needed - https://phabricator.wikimedia.org/T235886 (10ayounsi) [16:45:31] (03CR) 10BBlack: sre.dns.netbox: pull the specific SHA1 (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/583676 (https://phabricator.wikimedia.org/T233183) (owner: 10Volans) [16:46:05] 10Operations, 10netops: esams/knams: advertise 185.15.58.0/23 instead of 185.15.56.0/22 - https://phabricator.wikimedia.org/T207753 (10ayounsi) [16:46:11] RECOVERY - Ensure traffic_exporter binds on port 9322 and responds to HTTP requests on cp3056 is OK: HTTP OK: HTTP/1.0 200 OK - 22392 bytes in 0.257 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [16:47:27] (03CR) 10CRusnov: [C: 03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/583676 (https://phabricator.wikimedia.org/T233183) (owner: 10Volans) [16:48:13] (03CR) 10Jcrespo: "> Patch Set 6:" [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/578623 (https://phabricator.wikimedia.org/T218189) (owner: 10Guozr.im) [16:49:14] 10Operations, 10ops-codfw, 10Traffic, 10Patch-For-Review: (Need by: TBD) rack/setup/install cp202[7-9], cp203[0-9], cp204[0-2] - https://phabricator.wikimedia.org/T247340 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cp2035.codfw.wmnet'] ` and were **ALL** successful. [16:49:22] 10Operations, 10netops: IRR updates needed - https://phabricator.wikimedia.org/T235886 (10ayounsi) p:05Low→03High a:03ayounsi [16:49:42] 10Operations, 10Traffic, 10netops, 10Patch-For-Review: BGP: Investigate isolating codfw and eqiad - https://phabricator.wikimedia.org/T246721 (10ayounsi) This is all done. Last step is to update IRRs, tracked in T235886. [16:50:19] !log pt1979@cumin2001 START - Cookbook sre.hosts.downtime [16:50:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:52:43] !log pt1979@cumin2001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) [16:52:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:58:43] 10Operations, 10ops-codfw, 10Traffic, 10Patch-For-Review: (Need by: TBD) rack/setup/install cp202[7-9], cp203[0-9], cp204[0-2] - https://phabricator.wikimedia.org/T247340 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cp2036.codfw.wmnet'] ` and were **ALL** successful. [17:02:25] (03PS1) 10Jgreen: nsca_frack.cfg.erb updates [puppet] - 10https://gerrit.wikimedia.org/r/583685 (https://phabricator.wikimedia.org/T247855) [17:04:36] (03CR) 1020after4: [C: 03+1] zuul: provision the scap repository [puppet] - 10https://gerrit.wikimedia.org/r/579587 (https://phabricator.wikimedia.org/T215458) (owner: 10Hashar) [17:04:40] (03CR) 10Elukey: "I think that the changes are looking good, the diff is basically the removal of parameters from profile::kibana (the new ones are not disp" [puppet] - 10https://gerrit.wikimedia.org/r/583414 (https://phabricator.wikimedia.org/T246961) (owner: 10Mstyles) [17:04:49] (03CR) 10Guozr.im: "> Patch Set 6:" [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/578623 (https://phabricator.wikimedia.org/T218189) (owner: 10Guozr.im) [17:04:56] (03PS6) 1020after4: zuul: provision the scap repository [puppet] - 10https://gerrit.wikimedia.org/r/579587 (https://phabricator.wikimedia.org/T215458) (owner: 10Hashar) [17:07:21] 10Operations, 10netops: esams/knams: advertise 185.15.58.0/23 instead of 185.15.56.0/22 - https://phabricator.wikimedia.org/T207753 (10ayounsi) IRR objects created. [17:11:33] 10Operations, 10Continuous-Integration-Infrastructure, 10Release-Engineering-Team-TODO, 10SRE-Access-Requests, 10Patch-For-Review: Grant "contint-roots" and "releasers-mediawiki" to user jforrester - https://phabricator.wikimedia.org/T248597 (10Volans) p:05Triage→03Medium a:03thcipriani Over to @th... [17:12:00] (03CR) 10Jcrespo: "> Patch Set 6:" [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/578623 (https://phabricator.wikimedia.org/T218189) (owner: 10Guozr.im) [17:12:31] 10Operations, 10Continuous-Integration-Infrastructure, 10Release-Engineering-Team-TODO, 10SRE-Access-Requests, 10Patch-For-Review: Grant "contint-roots" and "releasers-mediawiki" to user jforrester - https://phabricator.wikimedia.org/T248597 (10thcipriani) approved. [17:13:31] (03CR) 10Ladsgroup: CuminExecution: Capture Exception cumin.transports.WorkerError (031 comment) [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/578623 (https://phabricator.wikimedia.org/T218189) (owner: 10Guozr.im) [17:15:37] (03CR) 1020after4: [C: 03+1] "https://puppet-compiler.wmflabs.org/compiler1001/21591/" [puppet] - 10https://gerrit.wikimedia.org/r/579587 (https://phabricator.wikimedia.org/T215458) (owner: 10Hashar) [17:16:43] (03CR) 10Volans: [C: 03+2] "Approved on task." [puppet] - 10https://gerrit.wikimedia.org/r/583512 (https://phabricator.wikimedia.org/T248597) (owner: 10Thcipriani) [17:17:02] (03PS3) 10Volans: CI: Add James_F to contint-roots and releasers-mediawiki [puppet] - 10https://gerrit.wikimedia.org/r/583512 (https://phabricator.wikimedia.org/T248597) (owner: 10Thcipriani) [17:18:30] (03PS1) 10Giuseppe Lavagetto: services_proxy: higher timeout for eventgate-main, more retries [puppet] - 10https://gerrit.wikimedia.org/r/583688 (https://phabricator.wikimedia.org/T248602) [17:19:00] (03PS7) 1020after4: zuul: provision the scap repository [puppet] - 10https://gerrit.wikimedia.org/r/579587 (https://phabricator.wikimedia.org/T215458) (owner: 10Hashar) [17:19:42] (03CR) 1020after4: [C: 03+1] "compiled with puppet-compiler: https://puppet-compiler.wmflabs.org/compiler1001/21591/" [puppet] - 10https://gerrit.wikimedia.org/r/579587 (https://phabricator.wikimedia.org/T215458) (owner: 10Hashar) [17:20:16] (03CR) 10Jcrespo: "+1" (031 comment) [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/578623 (https://phabricator.wikimedia.org/T218189) (owner: 10Guozr.im) [17:20:18] (03CR) 10jerkins-bot: [V: 04-1] zuul: provision the scap repository [puppet] - 10https://gerrit.wikimedia.org/r/579587 (https://phabricator.wikimedia.org/T215458) (owner: 10Hashar) [17:22:13] 10Operations, 10Continuous-Integration-Infrastructure, 10Release-Engineering-Team-TODO, 10SRE-Access-Requests, 10Patch-For-Review: Grant "contint-roots" and "releasers-mediawiki" to user jforrester - https://phabricator.wikimedia.org/T248597 (10Volans) a:05thcipriani→03Volans Patch merged, changes wi... [17:22:30] volans: Thanks. [17:22:40] yw :) [17:22:54] don't destroy everything :-P [17:23:07] That's the plan. :-) [17:24:38] (03CR) 10Giuseppe Lavagetto: [C: 03+2] services_proxy: higher timeout for eventgate-main, more retries [puppet] - 10https://gerrit.wikimedia.org/r/583688 (https://phabricator.wikimedia.org/T248602) (owner: 10Giuseppe Lavagetto) [17:26:02] just something ;) [17:26:29] very selective and limited destruction [17:26:59] 10Operations, 10LDAP-Access-Requests: Add Huei Tan to `wmf` LDAF group - https://phabricator.wikimedia.org/T248605 (10hueitan) [17:28:20] (03PS2) 10Volans: sre.dns.netbox: pull the specific SHA1 [cookbooks] - 10https://gerrit.wikimedia.org/r/583676 (https://phabricator.wikimedia.org/T233183) [17:28:33] !log changing email for "Unicorn17glitter" and "Tameka unicorn" [17:28:55] (03CR) 10Volans: "done" (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/583676 (https://phabricator.wikimedia.org/T233183) (owner: 10Volans) [17:29:43] (03PS1) 10Bstorm: wikireplicas: Add wb_terms_no_longer_updated view name [puppet] - 10https://gerrit.wikimedia.org/r/583693 (https://phabricator.wikimedia.org/T248592) [17:33:43] (03CR) 10Addshore: [C: 03+1] wikireplicas: Add wb_terms_no_longer_updated view name [puppet] - 10https://gerrit.wikimedia.org/r/583693 (https://phabricator.wikimedia.org/T248592) (owner: 10Bstorm) [17:35:15] PROBLEM - Widespread puppet agent failures on icinga1001 is CRITICAL: 0.01633 ge 0.01 https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [17:36:09] (03CR) 10Volans: "> Patch Set 6:" [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/578623 (https://phabricator.wikimedia.org/T218189) (owner: 10Guozr.im) [17:36:54] 10Operations, 10ops-eqiad, 10netops: Three ports on asw2-d-eqiad are not working as expected - https://phabricator.wikimedia.org/T247881 (10wiki_willy) a:03Cmjohnson [17:38:31] bstorm_: the puppet failures seems to be related to all the cloudvirt hosts for the: Package[nova-compute] failure [17:38:35] it seems it failed to install it [17:40:52] see https://puppetboard.wikimedia.org/nodes?status=failed [17:41:39] (03PS2) 10Ottomata: Temporarilty disable webrequest deletion for 1 week [puppet] - 10https://gerrit.wikimedia.org/r/583678 (https://phabricator.wikimedia.org/T248600) [17:41:47] andrewbogott and arturo: ^^ that'd be the upgrade I imagine? [17:42:24] I downtimed all the cloudvirts, is that on some different hosts? [17:42:52] Nope. It's the same. I think its just an icinga thing that dodged the downtime :) [17:43:02] "lots of puppet failures" apparently alerts here anyway [17:43:13] If it's known, then nothing to see here [17:43:27] I was just thinking in case you didn't know :) [17:43:48] (03CR) 10Herron: "Thanks for this!" [puppet] - 10https://gerrit.wikimedia.org/r/583414 (https://phabricator.wikimedia.org/T246961) (owner: 10Mstyles) [17:44:17] bstorm_: yeah, puppet is likely broken on all the cloudvirts until I get to the end of this [17:44:37] Ok, sounds good. [17:45:14] volans: There's an openstack upgrade in progress, so puppet failure is expected there. Thanks for the heads up. [17:45:43] (03CR) 10Dwisehaupt: [C: 03+1] "Looks good. Thanks for removing those two hosts in this chunk." [puppet] - 10https://gerrit.wikimedia.org/r/583685 (https://phabricator.wikimedia.org/T247855) (owner: 10Jgreen) [17:46:20] (03CR) 10Bstorm: [C: 03+2] wikireplicas: Add wb_terms_no_longer_updated view name [puppet] - 10https://gerrit.wikimedia.org/r/583693 (https://phabricator.wikimedia.org/T248592) (owner: 10Bstorm) [17:46:31] ack, feel free to disable it in that cluster :) [17:48:02] andrewbogott: is the openstack upgrade why a bunch of toolforge irc bots disconnected ~5pm UTC? [17:48:17] RhinosF1: likely yes [17:48:48] Thanks, i’ll check my logs tonight (that now work) but looks toolforge-wide [17:48:54] (03CR) 10Guozr.im: "> Patch Set 6:" (031 comment) [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/578623 (https://phabricator.wikimedia.org/T218189) (owner: 10Guozr.im) [17:53:13] 10Operations, 10LDAP-Access-Requests: LDAP access to the wmf group for Pita - https://phabricator.wikimedia.org/T247722 (10Volans) p:05Triage→03Medium I think I see the issue here, @Jpita was added with the `jpita-ctr@` account before (`uid=josepita`) while the current one is `uid=jpita`. I think we just n... [17:56:12] PROBLEM - Work requests waiting in Zuul Gearman server on contint1001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [150.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [17:58:18] RECOVERY - Widespread puppet agent failures on icinga1001 is OK: (C)0.01 ge (W)0.006 ge 0.001256 https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [17:59:51] !log updating wikireplica views on labsdb1009/10/11/12 for T248592 [18:00:04] RoanKattouw, Niharika, and Urbanecm: #bothumor My software never has bugs. It just develops random features. Rise for Morning SWAT(Max 6 patches). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200326T1800). [18:00:04] kart_ and cscott: A patch you scheduled for Morning SWAT(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [18:02:01] * kart_ is here [18:02:50] Anyone else or should I self deploy? [18:03:02] Urbanecm Niharika or RoanKattouw ? [18:03:26] kart_: I'm here in case i can be helpful, but feel free to self-deploy! [18:03:35] kart_: I'm in a meeting, sorry. If you feel confident self-deploying, go ahead. [18:03:52] OK. Starting my patch.. [18:04:52] RECOVERY - Check no envoy runtime configuration is left persistent on mwdebug1001 is OK: HTTP OK: HTTP/1.1 200 OK - 286 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy [18:05:55] (03PS1) 10Vgutierrez: Release 8.0.6-1wm4 [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/583715 (https://phabricator.wikimedia.org/T245616) [18:06:13] Urbanecm: Do we use mwdebug1002 or mwdebug1001? [18:06:20] kart_: doesn't matter which one you use [18:06:28] you can use either [18:06:31] OK. Thanks! [18:10:42] (03CR) 10Andrew Bogott: [C: 03+2] Revert "Horizon: put in maintenance mode for the pike=>queens upgrade" [puppet] - 10https://gerrit.wikimedia.org/r/583602 (owner: 10Andrew Bogott) [18:14:22] (03CR) 10Nuria: [C: 03+1] Temporarilty disable webrequest deletion for 1 week [puppet] - 10https://gerrit.wikimedia.org/r/583678 (https://phabricator.wikimedia.org/T248600) (owner: 10Ottomata) [18:15:02] (03PS3) 10Ottomata: Temporarilty disable webrequest deletion for 1 week [puppet] - 10https://gerrit.wikimedia.org/r/583678 (https://phabricator.wikimedia.org/T248600) [18:15:11] (03CR) 10Ottomata: [V: 03+2 C: 03+2] Temporarilty disable webrequest deletion for 1 week [puppet] - 10https://gerrit.wikimedia.org/r/583678 (https://phabricator.wikimedia.org/T248600) (owner: 10Ottomata) [18:17:32] 10Operations, 10LDAP-Access-Requests: LDAP access to the wmf group for Pita - https://phabricator.wikimedia.org/T247722 (10MoritzMuehlenhoff) >>! In T247722#6002818, @Volans wrote: > I think I see the issue here, @Jpita was added with the `jpita-ctr@` account before (`uid=josepita`) while the current one is `u... [18:17:45] 10Operations, 10SRE-Access-Requests: Requesting access to analytics for tarrow - https://phabricator.wikimedia.org/T248498 (10Nuria) Approved on my end, please read the data access guidelines in detail to understand what is allowed and not in the data environment: https://wikitech.wikimedia.org/wiki/Analytics/... [18:19:18] 10Operations, 10SRE-Access-Requests: Requesting access to analytics for ItamarWMDE - https://phabricator.wikimedia.org/T248482 (10Nuria) Approved on my end , please read the data access guidelines: https://wikitech.wikimedia.org/wiki/Analytics/Data_Access_Guidelines [18:20:02] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to event logging data in hive for joewalsh - https://phabricator.wikimedia.org/T247636 (10Nuria) Approved on my end, please read https://wikitech.wikimedia.org/wiki/Analytics/Data_Access_Guidelines [18:21:08] 10Operations, 10Wikimedia-Mailing-lists: Request for creation: les sans pagEs Mailing List - https://phabricator.wikimedia.org/T245066 (10Volans) 05Open→03Resolved a:03Volans @Anthere List created, here are the URLs for [[ https://lists.wikimedia.org/mailman/listinfo/lessanspages | listinfo ]] and [[ ht... [18:24:45] Change seems fine. Deploying.. [18:28:45] !log kartik@deploy1001 Synchronized php-1.35.0-wmf.25/extensions/ContentTranslation: SWAT: [[gerrit|583561|Fix handling of user added categories(T248302)]] (duration: 01m 09s) [18:29:56] Looks like I miss space? :/ [18:31:01] 10Operations, 10cloud-services-team (Kanban): Migrate remaining self-hosted puppet masters to Puppet 5 / facter 3 - https://phabricator.wikimedia.org/T241719 (10Krenair) [18:31:19] I'll manually update task, never mind. [18:31:32] 10Operations, 10LDAP-Access-Requests: LDAP access to the wmf group for Pita - https://phabricator.wikimedia.org/T247722 (10Volans) a:05ArielGlenn→03Jpita @Jpita could you confirm that we can "offboard" the previous `josepita` account in favor of the current `jpita` one? [18:34:24] cscott: I'm done, if you're around to take over for SWAT. [18:34:38] (03PS1) 10Volans: admin: update jpita account [puppet] - 10https://gerrit.wikimedia.org/r/583720 (https://phabricator.wikimedia.org/T247722) [18:34:43] (03PS1) 10Cmjohnson: Adding mgmt dns an-druid100[12] and druid100[78] [dns] - 10https://gerrit.wikimedia.org/r/583721 (https://phabricator.wikimedia.org/T245569) [18:35:21] (03CR) 10Volans: [C: 04-2] "To be merged only when the changed to LDAP will be performed, see task." [puppet] - 10https://gerrit.wikimedia.org/r/583720 (https://phabricator.wikimedia.org/T247722) (owner: 10Volans) [18:35:23] (03CR) 10Jgreen: [C: 03+2] nsca_frack.cfg.erb updates [puppet] - 10https://gerrit.wikimedia.org/r/583685 (https://phabricator.wikimedia.org/T247855) (owner: 10Jgreen) [18:37:48] 10Operations, 10SRE-Access-Requests: Requesting access to analytics for ItamarWMDE - https://phabricator.wikimedia.org/T248482 (10Volans) a:05Nuria→03Volans [18:38:17] 10Operations, 10SRE-Access-Requests: Requesting access to analytics for tarrow - https://phabricator.wikimedia.org/T248498 (10Volans) a:05Nuria→03Volans [18:39:25] (03PS2) 10Volans: add joewalsh to analytics-privatedata-users and remove from researchers [puppet] - 10https://gerrit.wikimedia.org/r/580853 (https://phabricator.wikimedia.org/T247636) (owner: 10ArielGlenn) [18:41:07] (03PS1) 10Faidon Liambotis: reports/accounting: remove LRU caching of output [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/583723 [18:42:32] 10Operations, 10ops-eqiad, 10Analytics, 10DC-Ops, 10Patch-For-Review: (Need by: TBD) rack/setup/install an-druid100[12] and druid100[78] - https://phabricator.wikimedia.org/T245569 (10Cmjohnson) a:05Jclark-ctr→03Cmjohnson [18:42:46] Seems cscott is not around and I need to rush back. I'll consider this SWAT as done then.. [18:44:29] 10Operations, 10Performance-Team, 10Traffic: Production load.php spends ~ 10% time doing output compression within PHP - https://phabricator.wikimedia.org/T242478 (10Krinkle) Summary from {T247020}: * The fix for this task (T242478) made it so that handle compression at the edge again, instead of at the app... [18:45:05] (03CR) 10Cmjohnson: [C: 03+2] Adding mgmt dns an-druid100[12] and druid100[78] [dns] - 10https://gerrit.wikimedia.org/r/583721 (https://phabricator.wikimedia.org/T245569) (owner: 10Cmjohnson) [18:46:48] (03CR) 10Jcrespo: "Sorry, this is indeed fixed, only the line fix is needed for +2. I tested it on a different branch and works as expected." [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/578623 (https://phabricator.wikimedia.org/T218189) (owner: 10Guozr.im) [18:47:50] 10Operations, 10ops-codfw, 10Traffic, 10Patch-For-Review: (Need by: TBD) rack/setup/install cp202[7-9], cp203[0-9], cp204[0-2] - https://phabricator.wikimedia.org/T247340 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by pt1979 on cumin2001.codfw.wmnet for hosts: ` cp2037.codfw.wmnet ` The log... [18:48:29] (03CR) 10Volans: [C: 03+2] add joewalsh to analytics-privatedata-users and remove from researchers [puppet] - 10https://gerrit.wikimedia.org/r/580853 (https://phabricator.wikimedia.org/T247636) (owner: 10ArielGlenn) [18:49:42] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to event logging data in hive for joewalsh - https://phabricator.wikimedia.org/T247636 (10Volans) 05Open→03Resolved a:05Nuria→03Volans @JoeWalsh the patch was merged, within 30 minutes it will be applied everywhere. Please rea... [18:52:15] 10Operations, 10SRE-Access-Requests: Requesting access to analytics for tarrow - https://phabricator.wikimedia.org/T248498 (10Volans) If no objections will be raised by Monday evening EU time the related patch could be sent and merged. [18:52:33] 10Operations, 10SRE-Access-Requests: Requesting access to analytics for ItamarWMDE - https://phabricator.wikimedia.org/T248482 (10Volans) If no objections will be raised by Monday afternoon EU time the related patch could be sent and merged. [18:53:54] (03PS2) 10Vgutierrez: Release 8.0.6-1wm4 [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/583715 (https://phabricator.wikimedia.org/T245616) [18:54:31] (03CR) 10Vgutierrez: "Tested on labs :)" [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/583715 (https://phabricator.wikimedia.org/T245616) (owner: 10Vgutierrez) [18:55:02] RECOVERY - Work requests waiting in Zuul Gearman server on contint1001 is OK: OK: Less than 100.00% above the threshold [90.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [18:55:18] (03PS1) 10Faidon Liambotis: reports/accounting: avoid evaluating formulas [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/583725 [18:56:08] (03CR) 10Faidon Liambotis: "This is untested - please test :)" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/583723 (owner: 10Faidon Liambotis) [18:57:16] 10Operations, 10ops-codfw, 10Traffic, 10Patch-For-Review: (Need by: TBD) rack/setup/install cp202[7-9], cp203[0-9], cp204[0-2] - https://phabricator.wikimedia.org/T247340 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by pt1979 on cumin2001.codfw.wmnet for hosts: ` cp2038.codfw.wmnet ` The log... [18:57:40] 10Operations, 10netops: roll out sensible flow-table-sizes to Juniper core routers with sampling enabled - https://phabricator.wikimedia.org/T248394 (10CDanis) Oh, for posterity, it definitely worked -- here's bytes + packets reported by netflow with dst IP == any eqsin loadbalancer IP{: {F31703601} [19:00:04] twentyafterfound and dduvall: #bothumor My software never has bugs. It just develops random features. Rise for Mediawiki train - American Version. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200326T1900). [19:03:21] !log pt1979@cumin2001 START - Cookbook sre.hosts.downtime [19:05:19] (03CR) 10Volans: [C: 03+1] "LGTM" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/583723 (owner: 10Faidon Liambotis) [19:05:41] !log pt1979@cumin2001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) [19:10:43] 10Operations, 10ops-codfw, 10Traffic, 10Patch-For-Review: (Need by: TBD) rack/setup/install cp202[7-9], cp203[0-9], cp204[0-2] - https://phabricator.wikimedia.org/T247340 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cp2037.codfw.wmnet'] ` and were **ALL** successful. [19:12:45] !log pt1979@cumin2001 START - Cookbook sre.hosts.downtime [19:12:57] 10Operations, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): (Need by: 2020-04-01) rack/setup/install cloudcontrol1005 - https://phabricator.wikimedia.org/T247471 (10Jclark-ctr) [19:14:03] 10Operations, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): (Need by: 2020-04-01) rack/setup/install cloudcontrol1005 - https://phabricator.wikimedia.org/T247471 (10Jclark-ctr) host racked c5 u31 switchport 30 [19:14:23] 10Operations, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): (Need by: 2020-04-01) rack/setup/install cloudcontrol1005 - https://phabricator.wikimedia.org/T247471 (10Jclark-ctr) 05Open→03Resolved a:05Jclark-ctr→03Cmjohnson [19:14:31] 10Operations, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): (Need by: 2020-04-01) rack/setup/install cloudcontrol1005 - https://phabricator.wikimedia.org/T247471 (10Jclark-ctr) 05Resolved→03Open [19:15:15] !log pt1979@cumin2001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) [19:17:09] (03PS1) 1020after4: all wikis to 1.35.0-wmf.25 refs T233873 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/583731 [19:17:11] (03CR) 10Volans: [C: 03+1] "LGTM, possible optional improvement inline" (031 comment) [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/583725 (owner: 10Faidon Liambotis) [19:17:13] (03CR) 1020after4: [C: 03+2] all wikis to 1.35.0-wmf.25 refs T233873 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/583731 (owner: 1020after4) [19:17:56] (03Merged) 10jenkins-bot: all wikis to 1.35.0-wmf.25 refs T233873 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/583731 (owner: 1020after4) [19:20:09] 10Operations, 10ops-codfw, 10Traffic, 10Patch-For-Review: (Need by: TBD) rack/setup/install cp202[7-9], cp203[0-9], cp204[0-2] - https://phabricator.wikimedia.org/T247340 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cp2038.codfw.wmnet'] ` and were **ALL** successful. [19:20:17] 10Operations, 10ops-codfw, 10Traffic, 10Patch-For-Review: (Need by: TBD) rack/setup/install cp202[7-9], cp203[0-9], cp204[0-2] - https://phabricator.wikimedia.org/T247340 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by pt1979 on cumin2001.codfw.wmnet for hosts: ` cp2039.codfw.wmnet ` The log... [19:20:31] !log twentyafterfour@deploy1001 rebuilt and synchronized wikiversions files: all wikis to 1.35.0-wmf.25 refs T233873 [19:20:49] 10Operations, 10ops-codfw, 10Traffic, 10Patch-For-Review: (Need by: TBD) rack/setup/install cp202[7-9], cp203[0-9], cp204[0-2] - https://phabricator.wikimedia.org/T247340 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by pt1979 on cumin2001.codfw.wmnet for hosts: ` cp2040.codfw.wmnet ` The log... [19:21:06] (03PS7) 10Guozr.im: CuminExecution: Capture Exception cumin.transports.WorkerError [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/578623 (https://phabricator.wikimedia.org/T218189) [19:27:23] 10Operations, 10LDAP-Access-Requests: Add Huei Tan to `wmf` LDAF group - https://phabricator.wikimedia.org/T248605 (10Volans) p:05Triage→03Medium [19:28:49] (03PS1) 10Jgreen: nsca_frack_cfg.erb updates [puppet] - 10https://gerrit.wikimedia.org/r/583732 (https://phabricator.wikimedia.org/T247855) [19:29:27] (03PS1) 10Cmjohnson: Adding production dns an-druid100[12] and druid100[78] [dns] - 10https://gerrit.wikimedia.org/r/583733 (https://phabricator.wikimedia.org/T245569) [19:33:03] PROBLEM - Rate of JVM GC Old generation-s runs - cloudelastic1003-cloudelastic-chi-eqiad on cloudelastic1003 is CRITICAL: 118 gt 100 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=cloudelastic-chi-eqiad&var-instance=cloudelastic1003&panelId=37 [19:35:47] !log pt1979@cumin2001 START - Cookbook sre.hosts.downtime [19:37:21] !log pt1979@cumin2001 START - Cookbook sre.hosts.downtime [19:38:14] !log pt1979@cumin2001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) [19:40:40] !log pt1979@cumin2001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) [19:41:28] (03PS2) 10Jgreen: nsca_frack_cfg.erb updates [puppet] - 10https://gerrit.wikimedia.org/r/583732 (https://phabricator.wikimedia.org/T247855) [19:42:27] PROBLEM - Host cp2039 is DOWN: PING CRITICAL - Packet loss = 100% [19:43:05] RECOVERY - Host cp2039 is UP: PING OK - Packet loss = 0%, RTA = 36.20 ms [19:44:12] 10Operations, 10ops-codfw, 10Traffic, 10Patch-For-Review: (Need by: TBD) rack/setup/install cp202[7-9], cp203[0-9], cp204[0-2] - https://phabricator.wikimedia.org/T247340 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cp2039.codfw.wmnet'] ` and were **ALL** successful. [19:45:30] 10Operations, 10ops-codfw, 10Traffic, 10Patch-For-Review: (Need by: TBD) rack/setup/install cp202[7-9], cp203[0-9], cp204[0-2] - https://phabricator.wikimedia.org/T247340 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cp2040.codfw.wmnet'] ` and were **ALL** successful. [19:48:50] (03PS2) 10Cmjohnson: Adding production dns an-druid100[12] and druid100[78] [dns] - 10https://gerrit.wikimedia.org/r/583733 (https://phabricator.wikimedia.org/T245569) [19:55:04] PROBLEM - rsyslog in eqiad is failing to deliver messages on icinga1001 is CRITICAL: action=fwd_centrallog2001.codfw.wmnet:6514 https://wikitech.wikimedia.org/wiki/Rsyslog https://grafana.wikimedia.org/d/000000596/rsyslog?var-datasource=eqiad+prometheus/ops [19:56:26] RECOVERY - rsyslog in eqiad is failing to deliver messages on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Rsyslog https://grafana.wikimedia.org/d/000000596/rsyslog?var-datasource=eqiad+prometheus/ops [19:59:25] (03PS1) 10CDanis: phased rollout of sensible flow-table-sizes [homer/public] - 10https://gerrit.wikimedia.org/r/583740 (https://phabricator.wikimedia.org/T248394) [20:03:30] PROBLEM - Rate of JVM GC Old generation-s runs - cloudelastic1004-cloudelastic-chi-eqiad on cloudelastic1004 is CRITICAL: 112.9 gt 100 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=cloudelastic-chi-eqiad&var-instance=cloudelastic1004&panelId=37 [20:03:57] (03CR) 10Cmjohnson: [C: 03+2] Adding production dns an-druid100[12] and druid100[78] [dns] - 10https://gerrit.wikimedia.org/r/583733 (https://phabricator.wikimedia.org/T245569) (owner: 10Cmjohnson) [20:12:44] PROBLEM - Rate of JVM GC Old generation-s runs - cloudelastic1004-cloudelastic-chi-eqiad on cloudelastic1004 is CRITICAL: 108.8 gt 100 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=cloudelastic-chi-eqiad&var-instance=cloudelastic1004&panelId=37 [20:19:13] (03PS1) 10DannyS712: [Beta cluster] add a fake 'UselessRightForTesting' to available rights [mediawiki-config] - 10https://gerrit.wikimedia.org/r/583745 (https://phabricator.wikimedia.org/T241503) [20:20:41] (03PS2) 10DannyS712: [Beta cluster] add a fake 'UselessRightForTesting' to available rights [mediawiki-config] - 10https://gerrit.wikimedia.org/r/583745 (https://phabricator.wikimedia.org/T241503) [20:32:53] (03PS1) 10CDanis: depool ulsfo for router maintenance [dns] - 10https://gerrit.wikimedia.org/r/583748 (https://phabricator.wikimedia.org/T248394) [20:34:53] (03CR) 10CDanis: [C: 03+2] depool ulsfo for router maintenance [dns] - 10https://gerrit.wikimedia.org/r/583748 (https://phabricator.wikimedia.org/T248394) (owner: 10CDanis) [20:36:29] !log depool ulsfo [20:39:10] uh stashbot_ are you okay [20:52:10] !log cdanis@cr3-ulsfo> request system reboot [21:10:09] 10Operations, 10DC-Ops, 10decommission, 10fundraising-tech-ops: decommission bismuth.frack.eqiad.wmnet - https://phabricator.wikimedia.org/T248516 (10Jgreen) [21:11:17] 10Operations, 10ops-codfw, 10Traffic, 10Patch-For-Review: (Need by: TBD) rack/setup/install cp202[7-9], cp203[0-9], cp204[0-2] - https://phabricator.wikimedia.org/T247340 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by pt1979 on cumin2001.codfw.wmnet for hosts: ` cp2041.codfw.wmnet ` The log... [21:11:46] 10Operations, 10DC-Ops, 10decommission: decommission heka.frack.codfw.wmnet - https://phabricator.wikimedia.org/T248627 (10Jgreen) [21:12:20] (03PS1) 10CDanis: Revert "depool ulsfo for router maintenance" [dns] - 10https://gerrit.wikimedia.org/r/583755 (https://phabricator.wikimedia.org/T248394) [21:12:22] (03PS1) 10Jgreen: remove bismuth.frack.eqiad.wmnet and heka.frack.codfw.wmnet [dns] - 10https://gerrit.wikimedia.org/r/583756 (https://phabricator.wikimedia.org/T248516) [21:12:56] !log applied flow-table-size configuration to cr4-ulsfo which did not need a reboot to apply it T248394 [21:13:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:13:02] T248394: roll out sensible flow-table-sizes to Juniper core routers with sampling enabled - https://phabricator.wikimedia.org/T248394 [21:15:10] 10Operations, 10ops-codfw, 10Traffic, 10Patch-For-Review: (Need by: TBD) rack/setup/install cp202[7-9], cp203[0-9], cp204[0-2] - https://phabricator.wikimedia.org/T247340 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by pt1979 on cumin2001.codfw.wmnet for hosts: ` cp2042.codfw.wmnet ` The log... [21:15:21] (03CR) 10CDanis: [C: 03+2] Revert "depool ulsfo for router maintenance" [dns] - 10https://gerrit.wikimedia.org/r/583755 (https://phabricator.wikimedia.org/T248394) (owner: 10CDanis) [21:15:57] !log repool ulsfo [21:16:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:16:52] 10Operations, 10netops, 10Patch-For-Review: roll out sensible flow-table-sizes to Juniper core routers with sampling enabled - https://phabricator.wikimedia.org/T248394 (10CDanis) [21:24:04] (03PS2) 10Jgreen: remove bismuth.frack.eqiad.wmnet and heka.frack.codfw.wmnet [dns] - 10https://gerrit.wikimedia.org/r/583756 (https://phabricator.wikimedia.org/T248516) [21:24:37] !log mholloway-shell@deploy1001 Started deploy [mobileapps/deploy@f34260c]: Update mobileapps to 3f30f20c [21:24:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:25:29] (03CR) 10Jgreen: [C: 03+2] remove bismuth.frack.eqiad.wmnet and heka.frack.codfw.wmnet [dns] - 10https://gerrit.wikimedia.org/r/583756 (https://phabricator.wikimedia.org/T248516) (owner: 10Jgreen) [21:27:30] 10Operations, 10DC-Ops, 10decommission, 10Patch-For-Review: decommission heka.frack.codfw.wmnet - https://phabricator.wikimedia.org/T248627 (10Jgreen) a:03Papaul [21:27:44] !log mholloway-shell@deploy1001 Finished deploy [mobileapps/deploy@f34260c]: Update mobileapps to 3f30f20c (duration: 03m 07s) [21:27:47] !log pt1979@cumin2001 START - Cookbook sre.hosts.downtime [21:27:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:27:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:28:14] 10Operations, 10DC-Ops, 10decommission, 10fundraising-tech-ops, 10Patch-For-Review: decommission bismuth.frack.eqiad.wmnet - https://phabricator.wikimedia.org/T248516 (10Jgreen) a:05Dwisehaupt→03Jclark-ctr [21:29:08] 10Operations, 10DC-Ops, 10decommission, 10fundraising-tech-ops: decommission heka.frack.eqiad.wmnet - https://phabricator.wikimedia.org/T248628 (10Dwisehaupt) [21:29:26] 10Operations, 10fundraising-tech-ops: rack/setup/install frnetmon1001.frack.eqiad.wmnet - https://phabricator.wikimedia.org/T232137 (10Jgreen) [21:29:45] 10Operations, 10fundraising-tech-ops: rack/setup/install frnetmon1001.frack.eqiad.wmnet - https://phabricator.wikimedia.org/T232137 (10Jgreen) 05Open→03Resolved [21:30:10] !log pt1979@cumin2001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) [21:30:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:31:12] (03PS1) 10CDanis: Prepped depool of eqsin (just in case) [dns] - 10https://gerrit.wikimedia.org/r/583760 [21:31:42] !log pt1979@cumin2001 START - Cookbook sre.hosts.downtime [21:31:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:32:36] !log cdanis@re0.cr1-eqsin# set chassis afeb slot 0 inline-services flex-flow-sizing cdanis@re0.cr1-eqsin# commit comment "flex-flow-sizing T248394" [21:32:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:32:40] T248394: roll out sensible flow-table-sizes to Juniper core routers with sampling enabled - https://phabricator.wikimedia.org/T248394 [21:34:15] !log pt1979@cumin2001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) [21:34:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:37:06] 10Operations, 10netops, 10Patch-For-Review: roll out sensible flow-table-sizes to Juniper core routers with sampling enabled - https://phabricator.wikimedia.org/T248394 (10CDanis) [21:37:10] 10Operations, 10ops-codfw, 10Traffic, 10Patch-For-Review: (Need by: TBD) rack/setup/install cp202[7-9], cp203[0-9], cp204[0-2] - https://phabricator.wikimedia.org/T247340 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cp2041.codfw.wmnet'] ` and were **ALL** successful. [21:38:11] 10Operations, 10ops-codfw, 10Traffic, 10Patch-For-Review: (Need by: TBD) rack/setup/install cp202[7-9], cp203[0-9], cp204[0-2] - https://phabricator.wikimedia.org/T247340 (10Papaul) [21:41:13] 10Operations, 10ops-codfw, 10Traffic, 10Patch-For-Review: (Need by: TBD) rack/setup/install cp202[7-9], cp203[0-9], cp204[0-2] - https://phabricator.wikimedia.org/T247340 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cp2042.codfw.wmnet'] ` and were **ALL** successful. [21:43:16] !log mholloway-shell@deploy1001 Synchronized php-1.35.0-wmf.25/extensions/MachineVision: Fix: Stop sorting label suggestions by Wikidata ID in ApiQueryImageLabels (duration: 01m 00s) [21:43:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:47:07] 10Operations, 10ops-codfw, 10Traffic, 10Patch-For-Review: (Need by: TBD) rack/setup/install cp202[7-9], cp203[0-9], cp204[0-2] - https://phabricator.wikimedia.org/T247340 (10Papaul) 05Open→03Resolved @BBlack servers are ready for service. [21:51:02] RECOVERY - mediawiki originals uploads -hourly- for eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Swift/How_To%23mediawiki_originals_uploads https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=eqiad [21:51:04] RECOVERY - mediawiki originals uploads -hourly- for codfw on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Swift/How_To%23mediawiki_originals_uploads https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=codfw [21:53:14] 10Operations, 10Wikimedia-SVG-rendering, 10Upstream: Update librsvg to ≥2.42.3 - https://phabricator.wikimedia.org/T193352 (10Esanders) > That is a common SVGO problem Is SVGO actually violating the spec, or is this just a bug in librsvg? [22:24:52] RECOVERY - Rate of JVM GC Old generation-s runs - cloudelastic1004-cloudelastic-chi-eqiad on cloudelastic1004 is OK: (C)100 gt (W)80 gt 67.24 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=cloudelastic-chi-eqiad&var-instance=cloudelastic1004&panelId=37 [22:27:00] (03CR) 10Jdlrobson: "How do you feel about us merging this Monday (or later today if you are up for it) James?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/583408 (https://phabricator.wikimedia.org/T248500) (owner: 10Jdlrobson) [22:28:33] (03CR) 10Jforrester: "> Patch Set 3:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/583408 (https://phabricator.wikimedia.org/T248500) (owner: 10Jdlrobson) [22:38:21] (03CR) 10Jdlrobson: "> Patch Set 3:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/583408 (https://phabricator.wikimedia.org/T248500) (owner: 10Jdlrobson) [22:39:34] (03CR) 10Krinkle: [C: 03+1] "nice :)" (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/583459 (owner: 10Jforrester) [22:39:55] * Krinkle staging on mwdbug1002 [22:43:52] 10Operations, 10DC-Ops, 10decommission, 10fundraising-tech-ops: decommission heka.frack.eqiad.wmnet - https://phabricator.wikimedia.org/T248628 (10Dwisehaupt) a:05Dwisehaupt→03Papaul [22:44:29] !log krinkle@deploy1001 Synchronized php-1.35.0-wmf.25/includes/jobqueue/jobs/RecentChangesUpdateJob.php: I9121f5aae (1/4) (duration: 01m 00s) [22:44:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:47:04] 10Operations, 10Wikimedia-SVG-rendering, 10Upstream: Update librsvg to ≥2.42.3 - https://phabricator.wikimedia.org/T193352 (10JoKalliauer) >>! In T193352#6003571, @Esanders wrote: >> That is a common SVGO problem > > Is SVGO actually violating the spec, or is this just a bug in librsvg? SVGO removes spaces... [22:48:46] !log krinkle@deploy1001 Synchronized php-1.35.0-wmf.25/includes/objectcache/SqlBagOStuff.php: I9121f5aae (2/4) (duration: 00m 58s) [22:48:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:50:26] !log krinkle@deploy1001 Synchronized php-1.35.0-wmf.25/includes/search/SearchMySQL.php: I9121f5aae (3/4) (duration: 00m 58s) [22:50:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:51:53] !log krinkle@deploy1001 Synchronized php-1.35.0-wmf.25/includes/user/UserRightsProxy.php: I9121f5aae (4/4) (duration: 00m 58s) [22:51:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:53:50] (03CR) 10Jforrester: Construct wgLogos in CommonSettings so that projects can inherit values (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/583459 (owner: 10Jforrester) [22:58:34] (03PS3) 10Jforrester: Construct wgLogos in CommonSettings so that projects can inherit values [mediawiki-config] - 10https://gerrit.wikimedia.org/r/583459 [22:59:51] (03CR) 10Krinkle: [C: 03+1] Construct wgLogos in CommonSettings so that projects can inherit values (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/583459 (owner: 10Jforrester) [23:00:04] RoanKattouw, Niharika, and Urbanecm: I, the Bot under the Fountain, allow thee, The Deployer, to do Evening SWAT(Max 6 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200326T2300). [23:00:04] No GERRIT patches in the queue for this window AFAICS. [23:01:59] (03CR) 10Urbanecm: [C: 03+2] Enable wmgUseFooterContactLink for cswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/583603 (https://phabricator.wikimedia.org/T248584) (owner: 10Urbanecm) [23:03:02] (03Merged) 10jenkins-bot: Enable wmgUseFooterContactLink for cswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/583603 (https://phabricator.wikimedia.org/T248584) (owner: 10Urbanecm) [23:04:15] (03PS4) 10Jforrester: Construct wgLogos in CommonSettings so that projects can inherit values [mediawiki-config] - 10https://gerrit.wikimedia.org/r/583459 [23:05:58] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: ce63a4e: Enable wmgUseFooterContactLink for cswiki (T248584) (duration: 00m 58s) [23:06:00] PROBLEM - Work requests waiting in Zuul Gearman server on contint1001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [150.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [23:06:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:06:03] T248584: Enable wmgUseFooterContactLink for cswiki - https://phabricator.wikimedia.org/T248584 [23:07:02] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: ce63a4e: Enable wmgUseFooterContactLink for cswiki (T248584; take II) (duration: 00m 57s) [23:07:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:07:29] * Urbanecm done [23:12:18] RECOVERY - Work requests waiting in Zuul Gearman server on contint1001 is OK: OK: Less than 100.00% above the threshold [90.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [23:24:17] (03PS6) 10Jforrester: Enable DiscussionTools as a beta feature on four wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/579337 (https://phabricator.wikimedia.org/T245794) (owner: 10Bartosz Dziewoński) [23:27:54] (03PS7) 10Jforrester: Enable DiscussionTools as a beta feature on four wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/579337 (https://phabricator.wikimedia.org/T245794) (owner: 10Bartosz Dziewoński) [23:36:32] PROBLEM - Rate of JVM GC Old generation-s runs - cloudelastic1004-cloudelastic-chi-eqiad on cloudelastic1004 is CRITICAL: 104.7 gt 100 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=cloudelastic-chi-eqiad&var-instance=cloudelastic1004&panelId=37 [23:37:34] PROBLEM - Work requests waiting in Zuul Gearman server on contint1001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [150.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1