[00:00:04] RoanKattouw, Niharika, and Urbanecm: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for Evening SWAT(Max 6 patches) . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200226T0000). [00:00:04] mooeypoo and niedzielski: A patch you scheduled for Evening SWAT(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [00:00:08] o/ [00:00:42] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 242, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [00:00:57] RECOVERY - Host cr2-esams is UP: PING OK - Packet loss = 0%, RTA = 86.13 ms [00:01:28] RECOVERY - OSPF status on mr1-esams is OK: OSPFv2: 2/2 UP : OSPFv3: 2/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [00:01:46] RECOVERY - OSPF status on cr3-knams is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [00:03:58] I'm here in place of mooeypoo [00:04:04] I can SWAT. [00:04:34] RECOVERY - Host cr2-esams IPv6 is UP: PING OK - Packet loss = 0%, RTA = 86.37 ms [00:04:39] (03PS5) 10Jforrester: Enable password-reset-update on Wikivoyages and Wiktionaries [mediawiki-config] - 10https://gerrit.wikimedia.org/r/573788 (https://phabricator.wikimedia.org/T245792) (owner: 10Samwilson) [00:04:50] Network flap fun. [00:04:57] (03CR) 10Jforrester: [C: 03+2] Enable password-reset-update on Wikivoyages and Wiktionaries [mediawiki-config] - 10https://gerrit.wikimedia.org/r/573788 (https://phabricator.wikimedia.org/T245792) (owner: 10Samwilson) [00:05:30] samwilson1-who-is-not-mooeypoo: Will sync to mwdebug1001 for you to test. [00:05:41] James_F: great [00:06:03] (03Merged) 10jenkins-bot: Enable password-reset-update on Wikivoyages and Wiktionaries [mediawiki-config] - 10https://gerrit.wikimedia.org/r/573788 (https://phabricator.wikimedia.org/T245792) (owner: 10Samwilson) [00:06:22] samwilson1: It's live on mwdebug1001. Please test. [00:06:50] James_F: tested. looks good. [00:06:57] Okie-dokie. [00:08:17] !log resume writes from mediawiki to cloudelastic [00:08:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:08:24] !log jforrester@deploy1001 Synchronized wmf-config/InitialiseSettings.php: T245792 Enable password-reset-update on Wikivoyages and Wiktionaries (duration: 01m 04s) [00:08:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:08:30] T245792: PRU: Enable PRU for Wikivoyage & Wiktionary [x-small] -- for MONDAY - https://phabricator.wikimedia.org/T245792 [00:08:35] ccccccktilbeljlejidcdurdndfkvvguibircjngtrgk [00:08:43] thanks, yubikey [00:08:54] * James_F grins. [00:09:00] niedzielski: You're up next. [00:09:06] good morning, cdanis's yubikey! [00:09:07] (03PS4) 10Jforrester: [prod] [beta] [Vector] Set skin version defaults [mediawiki-config] - 10https://gerrit.wikimedia.org/r/572039 (https://phabricator.wikimedia.org/T242381) (owner: 10Niedzielski) [00:09:09] yayyyyy [00:09:27] (03CR) 10Jforrester: [C: 03+2] [prod] [beta] [Vector] Set skin version defaults [mediawiki-config] - 10https://gerrit.wikimedia.org/r/572039 (https://phabricator.wikimedia.org/T242381) (owner: 10Niedzielski) [00:09:42] !log jforrester@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Bonus sync for cache clearance (duration: 01m 03s) [00:09:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:10:21] (03CR) 10Dzahn: [C: 03+2] add codfw appservers in racks A3 and A6 as spares [puppet] - 10https://gerrit.wikimedia.org/r/574895 (https://phabricator.wikimedia.org/T241852) (owner: 10Dzahn) [00:10:30] (03Merged) 10jenkins-bot: [prod] [beta] [Vector] Set skin version defaults [mediawiki-config] - 10https://gerrit.wikimedia.org/r/572039 (https://phabricator.wikimedia.org/T242381) (owner: 10Niedzielski) [00:11:17] (03PS2) 10Dzahn: add codfw appservers in racks A3 and A6 as spares [puppet] - 10https://gerrit.wikimedia.org/r/574895 (https://phabricator.wikimedia.org/T241852) [00:11:40] niedzielski: Live on mwdebug1001 (but should be a no-op in prod, right?). [00:11:52] That's correct. I will now test. [00:12:05] Cool. It doesn't fatal for me, so… success. [00:12:41] ^^^ [00:12:46] I think so. Looks good to me. [00:13:30] OK, syncing now. [00:13:38] 10Operations, 10SRE-Access-Requests: Give access to Anti Harassment Tools team to production deployment - https://phabricator.wikimedia.org/T246053 (10Mooeypoo) Thanks @Dzahn ; I signed the document, and my public ED25519 key is: ` ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAIOiHafWCjGy8aC+Xvq1YoqGHkOnEz4HjMhw+Bpo/4p... [00:14:02] niedzielski: BTW, how do I get Web team code review? https://gerrit.wikimedia.org/r/c/wikimedia/portals/+/574070 in particular. :-) [00:14:28] !log jforrester@deploy1001 Synchronized wmf-config/InitialiseSettings.php: T242381 Set Vector skin version defaults so they can be changed on Beta Cluster (duration: 01m 04s) [00:14:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:14:34] T242381: Add a Vector skin version preference - https://phabricator.wikimedia.org/T242381 [00:15:11] !log SWAT complete. [00:15:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:15:23] James_F: I will review this to support your effort and add Jan Drewniak (who is the primary maintainer). [00:15:42] Thank you for SWATting! [00:15:49] niedzielski: Always. Thanks! [00:15:58] !log jforrester@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Bonus sync for cache clearance (duration: 01m 03s) [00:16:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:16:44] (03CR) 10Dzahn: "https://puppet-compiler.wmflabs.org/compiler1003/21071/" [puppet] - 10https://gerrit.wikimedia.org/r/574895 (https://phabricator.wikimedia.org/T241852) (owner: 10Dzahn) [00:17:14] someone an edit at dewiki, and when I'm not logged in I cannot see the edit in the history [00:17:19] but as logged in user I can [00:17:43] Sagan: That sounds like FlaggedRevisions working as required by the dewiki community? [00:17:47] is there a caching problem going on currently? action=purge does not help anything [00:17:54] Sagan: Is it different from before? [00:18:06] James_F: if it would be flaggedrevs, you would at least see the edit in the history, but unflagged [00:18:10] I cannot see it at all [00:18:29] Oh, interesting, it doesn't hide it from history? [00:18:36] when I view https://de.wikipedia.org/w/index.php?title=Hans-Werner_Sahm&action=history&uselang=en as logged in user, the last edit is from waithamai [00:19:07] when I view it logged in, it's me (Luke081515) [00:19:21] but if I view it logged out with uselang=de, it's waithamai [00:19:44] I get the same result logged in and logged out. [00:19:44] so normally it should not have a different behaviour because the interface language is different, right? [00:20:02] hm, also with &uselang=de? [00:20:08] It might be that the page is cached oddly. We had a cache hiccup this morning. [00:20:12] or is this maybe a regional problem? [00:20:36] hm, ok. then it's at least a cache problem where purge did not help [00:20:38] Yes, with uselang=fr and uselang=de; I'll be using the SF-based cache, you're probably in the AMS-based one? [00:20:50] yup, germany currently [00:20:50] Yeah, I can try to manually purge the page. One second. [00:21:43] !log Manually purged https://de.wikipedia.org/w/index.php?title=Hans-Werner_Sahm&action=history from mwmaint1002 [00:21:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:21:50] Sagan: What about now? [00:22:25] still the wrong result, even with a private window [00:23:19] Sagan: This is very odd, indeed. Is it just this page, or have you seen it elsewhere too? [00:23:48] not yet elsewhere, but I can take a look if I find things like this in the recent changes [00:23:59] Sagan: Can you file a task please? [00:24:18] ok [00:24:40] James_F: #wikimedia-general-or-unknown? [00:25:13] Sagan: Yeah, and maybe #Traffic? [00:29:22] 10Operations, 10Traffic, 10Wikimedia-General-or-Unknown: Different age of history logged in and out - https://phabricator.wikimedia.org/T246185 (10Luke081515) [00:29:34] James_F: ^, I included also some screenshots [00:29:46] Thank you. [00:29:53] 10Operations, 10ops-eqiad, 10serviceops, 10Patch-For-Review: (Need by: 2020-02-28) rack/setup/install mw[1385-1413].eqiad.wmnet - https://phabricator.wikimedia.org/T241849 (10Jclark-ctr) @Cmjohnson Host are racked have started power cables Will have cables finished tomorrow name rack_name position mw13... [00:30:38] 10Operations, 10Traffic, 10Wikimedia-General-or-Unknown: Different age of history logged in and out when from the EU (but not SF office) - https://phabricator.wikimedia.org/T246185 (10Jdforrester-WMF) [00:38:33] (03PS1) 10Dzahn: scap: add codfw canary appservers to dsh group [puppet] - 10https://gerrit.wikimedia.org/r/574902 (https://phabricator.wikimedia.org/T242606) [00:44:19] (03PS3) 10Herron: add load balancing for kibana-next [puppet] - 10https://gerrit.wikimedia.org/r/574862 (https://phabricator.wikimedia.org/T234854) [00:46:57] 10Operations, 10Traffic, 10Wikimedia-General-or-Unknown: Different age of history logged in and out when from the EU (but not SF office) - https://phabricator.wikimedia.org/T246185 (10Aklapper) Hmm, I'm afraid I cannot confirm at 00:44 UTC from Central Europe. Both logged in and private mode window show the... [00:59:29] 10Operations, 10netops: can aggregated netflow data include the router it was sampled from? - https://phabricator.wikimedia.org/T246186 (10CDanis) [00:59:54] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=swagger_check_citoid_cluster_codfw site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [01:01:26] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [01:08:15] (03PS1) 10Dzahn: add mw2291 through mw2300 as API appservers [puppet] - 10https://gerrit.wikimedia.org/r/574907 (https://phabricator.wikimedia.org/T241852) [01:15:56] (03PS4) 10Herron: add load balancing for kibana-next [puppet] - 10https://gerrit.wikimedia.org/r/574862 (https://phabricator.wikimedia.org/T234854) [01:42:48] 10Operations, 10ops-codfw, 10Traffic, 10Patch-For-Review: (Need by: TBD) rack/setup/install LVS200[7-10] - https://phabricator.wikimedia.org/T196560 (10wiki_willy) [01:43:42] 10Operations, 10ops-codfw, 10Discovery-Search (Current work): (Need by: TBD) codfw: rack/setup/install elastic20{55,56,57,58,59,60}.codfw.wmnet - https://phabricator.wikimedia.org/T241337 (10wiki_willy) [01:44:17] 10Operations, 10ops-codfw: (Need by: TBD) codfw: rack/setup/install wdqs200[7-8].codfw.wmnet - https://phabricator.wikimedia.org/T242301 (10wiki_willy) [01:46:49] 10Operations, 10ops-codfw, 10fundraising-tech-ops: (Need by: TBD) codfw: rack/setup/install 3 new payments server for frack - https://phabricator.wikimedia.org/T244169 (10wiki_willy) [01:47:35] 10Operations, 10ops-codfw: (Need by: TBD) codfw: rack/setup/install parse200[1-20].codfw.wmnet - https://phabricator.wikimedia.org/T243112 (10wiki_willy) [01:48:14] 10Operations, 10ops-codfw, 10Wikimedia-Logstash, 10Patch-For-Review: (Need by: TBD) rack/setup/install logstash202[6-9].codfw.wmnet - https://phabricator.wikimedia.org/T240882 (10wiki_willy) [01:49:07] 10Operations, 10ops-codfw, 10fundraising-tech-ops: (Need by: TBD) codfw:fundraising single-cpu misc servers - https://phabricator.wikimedia.org/T244950 (10wiki_willy) [01:49:57] 10Operations, 10ops-codfw, 10DC-Ops, 10fundraising-tech-ops: (Need by: ASAP) rack/setup/install frdb2001 - https://phabricator.wikimedia.org/T245566 (10wiki_willy) [01:51:24] 10Operations, 10ops-codfw, 10serviceops, 10Patch-For-Review: (Need by: TBD) rack/setup/install new codfw mw systems - https://phabricator.wikimedia.org/T241852 (10wiki_willy) [01:52:06] 10Operations, 10ops-codfw, 10DC-Ops: (Need by: TBD) rack/setup/install ganeti20[19-24] - https://phabricator.wikimedia.org/T244783 (10wiki_willy) [01:52:48] 10Operations, 10ops-codfw, 10fundraising-tech-ops, 10Patch-For-Review: (Need by: TBD) rack/setup/install frpm2001.frack.codfw.wmnet - https://phabricator.wikimedia.org/T242269 (10wiki_willy) [01:56:47] (03PS5) 10Herron: add load balancing for kibana-next [puppet] - 10https://gerrit.wikimedia.org/r/574862 (https://phabricator.wikimedia.org/T234854) [02:06:31] (03PS6) 10Herron: add load balancing for kibana-next [puppet] - 10https://gerrit.wikimedia.org/r/574862 (https://phabricator.wikimedia.org/T234854) [02:27:11] (03PS7) 10Herron: add load balancing for kibana-next [puppet] - 10https://gerrit.wikimedia.org/r/574862 (https://phabricator.wikimedia.org/T234854) [02:30:48] (03CR) 10Herron: "https://puppet-compiler.wmflabs.org/compiler1002/21079/" [puppet] - 10https://gerrit.wikimedia.org/r/574862 (https://phabricator.wikimedia.org/T234854) (owner: 10Herron) [02:53:59] PROBLEM - Old JVM GC check - cloudelastic1002-cloudelastic-chi-eqiad on cloudelastic1002 is CRITICAL: 100.7 gt 100 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1 [03:00:23] PROBLEM - Old JVM GC check - cloudelastic1002-cloudelastic-chi-eqiad on cloudelastic1002 is CRITICAL: 100.7 gt 100 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1 [03:46:40] (03CR) 1020after4: [C: 03+1] "lgtm" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/566411 (owner: 10Legoktm) [03:47:56] (03CR) 1020after4: [C: 03+1] scap: Clean up unused build configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/566412 (owner: 10Legoktm) [04:56:39] (03CR) 10Giuseppe Lavagetto: "Overall correct, but see my comments. I think we can do without adding two separated services for TLS and non-TLS and just go TLS-only for" (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/574862 (https://phabricator.wikimedia.org/T234854) (owner: 10Herron) [05:00:20] (03CR) 10Giuseppe Lavagetto: [C: 03+1] add kibana-next service records [dns] - 10https://gerrit.wikimedia.org/r/574861 (owner: 10Herron) [05:01:30] 10Operations: Install private instance of gnomon for greater SRE team - https://phabricator.wikimedia.org/T246062 (10CDanis) p:05Triage→03Low [05:25:44] I plan to deploy https://gerrit.wikimedia.org/r/#/c/operations/deployment-charts/+/574768/ - (cxserver update) anything on deploy1001 at moment? [05:27:25] I guess not.. [05:27:47] (03PS2) 10KartikMistry: Update cxserver to 2020-02-24-110149-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/574768 (https://phabricator.wikimedia.org/T227183) [05:28:33] (03CR) 10KartikMistry: [C: 03+2] Update cxserver to 2020-02-24-110149-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/574768 (https://phabricator.wikimedia.org/T227183) (owner: 10KartikMistry) [05:28:51] (03Merged) 10jenkins-bot: Update cxserver to 2020-02-24-110149-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/574768 (https://phabricator.wikimedia.org/T227183) (owner: 10KartikMistry) [05:29:34] !log kartik@deploy1001 helmfile [STAGING] Ran 'apply' command on namespace 'cxserver' for release 'staging' . [05:29:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:31:26] !log kartik@deploy1001 helmfile [EQIAD] Ran 'apply' command on namespace 'cxserver' for release 'production' . [05:31:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:31:40] (03PS1) 10DannyS712: Remove $wgAllowTitlesInSVG, deprecated and unused [mediawiki-config] - 10https://gerrit.wikimedia.org/r/574921 (https://phabricator.wikimedia.org/T246193) [05:32:04] (03PS2) 10DannyS712: Remove $wgAllowTitlesInSVG, deprecated and unused [mediawiki-config] - 10https://gerrit.wikimedia.org/r/574921 (https://phabricator.wikimedia.org/T246193) [05:35:34] !log kartik@deploy1001 helmfile [CODFW] Ran 'apply' command on namespace 'cxserver' for release 'production' . [05:35:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:41:05] !log Updated cxserver to 2020-02-24-110149-production (T227183) [05:41:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:41:12] T227183: Generate template parameter alignments for the selected small wikis - https://phabricator.wikimedia.org/T227183 [06:03:49] PROBLEM - Maps tiles generation on icinga1001 is CRITICAL: CRITICAL: 100.00% of data under the critical threshold [5.0] https://wikitech.wikimedia.org/wiki/Maps/Runbook https://grafana.wikimedia.org/dashboard/db/maps-performances?panelId=8&fullscreen&orgId=1 [06:06:14] (03PS1) 10Marostegui: db1087: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/574922 [06:08:24] 10Operations, 10ops-eqiad, 10DBA: db1095 backup source crashed: broken BBU - https://phabricator.wikimedia.org/T244958 (10Marostegui) I can see the battery - thank you ` root@db1095:~# hpssacli controller all show detail | grep -i battery No-Battery Write Cache: Disabled Battery/Capacitor Count: 1... [06:09:07] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1084 for BBU replacement - T245647', diff saved to https://phabricator.wikimedia.org/P10528 and previous config saved to /var/cache/conftool/dbconfig/20200226-060906-marostegui.json [06:09:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:09:14] T245647: Replace broken BBU on db1084 (HP host) - https://phabricator.wikimedia.org/T245647 [06:09:50] (03CR) 10Marostegui: [C: 03+2] db1087: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/574922 (owner: 10Marostegui) [06:14:07] PROBLEM - Old JVM GC check - cloudelastic1003-cloudelastic-chi-eqiad on cloudelastic1003 is CRITICAL: 103.7 gt 100 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1 [06:15:03] (03PS1) 10Marostegui: db1084: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/574923 (https://phabricator.wikimedia.org/T245621) [06:16:41] !log marostegui@cumin1001 dbctl commit (dc=all): 'Restore es1017 (master) original weight (0) T243963', diff saved to https://phabricator.wikimedia.org/P10529 and previous config saved to /var/cache/conftool/dbconfig/20200226-061640-marostegui.json [06:16:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:16:48] T243963: es1019: reseat IPMI - https://phabricator.wikimedia.org/T243963 [06:17:11] !log marostegui@cumin1001 dbctl commit (dc=all): 'Fully repool es1019 after on-site maintenance T243963', diff saved to https://phabricator.wikimedia.org/P10530 and previous config saved to /var/cache/conftool/dbconfig/20200226-061710-marostegui.json [06:17:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:18:16] 10Operations, 10ops-eqiad, 10DBA, 10DC-Ops: es1019: reseat IPMI - https://phabricator.wikimedia.org/T243963 (10Marostegui) >>! In T243963#5916919, @jcrespo wrote: > es1019 is just pending the last config push back to normal traffic weights (and reducing the master's). Done! Thank you for handling this [06:18:24] (03CR) 10Marostegui: [C: 03+2] db1084: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/574923 (https://phabricator.wikimedia.org/T245621) (owner: 10Marostegui) [06:19:13] !log Stop MySQL and poweroff db1084 for BBU replacement - T245647 [06:19:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:19:20] T245647: Replace broken BBU on db1084 (HP host) - https://phabricator.wikimedia.org/T245647 [06:20:46] 10Operations, 10ops-eqiad, 10DC-Ops: Replace broken BBU on db1084 (HP host) - https://phabricator.wikimedia.org/T245647 (10Marostegui) @Jclark-ctr please proceed to replace the BBU, host is now powered off. I will message you on IRC too Thank you [06:24:39] 10Operations, 10Traffic, 10Wikimedia-General-or-Unknown: Different age of history logged in and out when from the EU (but not SF office) - https://phabricator.wikimedia.org/T246185 (10Conny) Got yesterday info for this via Servicehotline [1] for page https://de.wikipedia.org/wiki/Benutzer:Gerhard_Steigerwald... [06:24:54] (03PS1) 10Giuseppe Lavagetto: envoy: default to tls 1.2 for now for upstreams [puppet] - 10https://gerrit.wikimedia.org/r/574924 (https://phabricator.wikimedia.org/T246083) [06:25:23] (03CR) 10Giuseppe Lavagetto: [C: 03+2] envoy: default to tls 1.2 for now for upstreams [puppet] - 10https://gerrit.wikimedia.org/r/574924 (https://phabricator.wikimedia.org/T246083) (owner: 10Giuseppe Lavagetto) [06:27:17] PROBLEM - Maps tiles generation on icinga1001 is CRITICAL: CRITICAL: 100.00% of data under the critical threshold [5.0] https://wikitech.wikimedia.org/wiki/Maps/Runbook https://grafana.wikimedia.org/dashboard/db/maps-performances?panelId=8&fullscreen&orgId=1 [06:27:55] <_joe_> uhm that dashboard's empty [06:28:46] (03PS3) 10Marostegui: db-eqiad,db-codfw.php: Add es4 as new ES, for initial testing [mediawiki-config] - 10https://gerrit.wikimedia.org/r/574696 (https://phabricator.wikimedia.org/T246072) [07:01:44] PROBLEM - Old JVM GC check - cloudelastic1003-cloudelastic-chi-eqiad on cloudelastic1003 is CRITICAL: 105.8 gt 100 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1 [07:17:44] (03PS1) 10Giuseppe Lavagetto: envoy: use TLS 1.2 for local clusters as well [puppet] - 10https://gerrit.wikimedia.org/r/574938 [07:18:44] RECOVERY - rpki grafana alert on icinga1001 is OK: OK: RPKI ( https://grafana.wikimedia.org/d/UwUa77GZk/rpki ) is not alerting. https://wikitech.wikimedia.org/wiki/RPKI%23Grafana_alerts https://grafana.wikimedia.org/d/UwUa77GZk/ [07:19:56] PROBLEM - Old JVM GC check - cloudelastic1003-cloudelastic-chi-eqiad on cloudelastic1003 is CRITICAL: 106.8 gt 100 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1 [07:21:13] (03CR) 10Giuseppe Lavagetto: [C: 03+2] envoy: use TLS 1.2 for local clusters as well [puppet] - 10https://gerrit.wikimedia.org/r/574938 (owner: 10Giuseppe Lavagetto) [07:22:44] (03PS6) 10Effie Mouzeli: WIP hieradata: test streaming apache logs to logstash from mwdebug1001 [puppet] - 10https://gerrit.wikimedia.org/r/572057 (https://phabricator.wikimedia.org/T244472) [07:22:56] PROBLEM - MariaDB Slave IO: s2 on db1140 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2026, Errmsg: error reconnecting to master repl@db1122.eqiad.wmnet:3306 - retry-time: 60 maximum-retries: 86400 message: SSL connection error00000000:lib(0):func(0):reason(0) https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [07:23:17] woot? [07:23:19] checking [07:25:06] RECOVERY - MariaDB Slave IO: s2 on db1140 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [07:25:08] Looks like it was temporary [07:25:15] <_joe_> you scared it [07:25:15] I am going to create a task, cause that error is strange [07:25:25] it is a backup source host, so no user impact [07:27:09] (03PS2) 10Elukey: Add an-launcher1001 to the list of statistics servers [puppet] - 10https://gerrit.wikimedia.org/r/574843 (https://phabricator.wikimedia.org/T243934) [07:29:05] (03CR) 10Elukey: [C: 03+2] Add an-launcher1001 to the list of statistics servers [puppet] - 10https://gerrit.wikimedia.org/r/574843 (https://phabricator.wikimedia.org/T243934) (owner: 10Elukey) [07:30:49] (03PS7) 10Effie Mouzeli: hieradata: test streaming apache logs to logstash from mwdebug1001 [puppet] - 10https://gerrit.wikimedia.org/r/572057 (https://phabricator.wikimedia.org/T244472) [07:31:21] (03PS1) 10Elukey: admin: add kerberos flag for jmorgan [puppet] - 10https://gerrit.wikimedia.org/r/574973 (https://phabricator.wikimedia.org/T246118) [07:34:05] (03CR) 10Elukey: [C: 03+2] admin: add kerberos flag for jmorgan [puppet] - 10https://gerrit.wikimedia.org/r/574973 (https://phabricator.wikimedia.org/T246118) (owner: 10Elukey) [07:35:10] (03CR) 10Giuseppe Lavagetto: [C: 03+1] hieradata: test streaming apache logs to logstash from mwdebug1001 [puppet] - 10https://gerrit.wikimedia.org/r/572057 (https://phabricator.wikimedia.org/T244472) (owner: 10Effie Mouzeli) [07:38:39] 10Operations, 10ops-codfw, 10Traffic, 10Patch-For-Review: (Need by: TBD) rack/setup/install LVS200[7-10] - https://phabricator.wikimedia.org/T196560 (10Vgutierrez) @Papaul I think I can handle that as well. I'll let you know, the first one will be lvs2006 [07:41:51] (03CR) 10Effie Mouzeli: "https://puppet-compiler.wmflabs.org/compiler1002/21080/" [puppet] - 10https://gerrit.wikimedia.org/r/572057 (https://phabricator.wikimedia.org/T244472) (owner: 10Effie Mouzeli) [07:54:02] 10Operations, 10Traffic, 10conftool, 10Patch-For-Review, and 2 others: Figure out a security model for etcd - https://phabricator.wikimedia.org/T97972 (10akosiaris) Per @volans recommendation, adding a use case here: * Be able to read conftool data via automated means (i.e. scripts) in order to cross chec... [07:57:59] (03CR) 10Effie Mouzeli: [C: 03+2] hieradata: test streaming apache logs to logstash from mwdebug1001 [puppet] - 10https://gerrit.wikimedia.org/r/572057 (https://phabricator.wikimedia.org/T244472) (owner: 10Effie Mouzeli) [08:01:16] 10Operations, 10netops: PyBal BGP group prefix-limit 50 teardown - https://phabricator.wikimedia.org/T246110 (10akosiaris) For what is worth, the Kubernetes groups (v4, v6) have copy pasted the PyBal group, so it probably makes sense to follow the same approach for those as well. There is one extra interesting... [08:14:06] PROBLEM - rpki grafana alert on icinga1001 is CRITICAL: CRITICAL: RPKI ( https://grafana.wikimedia.org/d/UwUa77GZk/rpki ) is alerting: RRDP status alert. https://wikitech.wikimedia.org/wiki/RPKI%23Grafana_alerts https://grafana.wikimedia.org/d/UwUa77GZk/ [08:38:00] 10Operations, 10ops-eqiad, 10DBA: db1095 backup source crashed: broken BBU - https://phabricator.wikimedia.org/T244958 (10Marostegui) a:05Jclark-ctr→03jcrespo Assigning to Jaime to reflect that this is now pending his follow-up Thank you John! [08:38:11] !log upgrade prometheus-mcrouter-exporter on mwdebug1001 to test the new version [08:38:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:43:05] 10Operations, 10ops-eqiad, 10DBA: db1095 backup source crashed: broken BBU - https://phabricator.wikimedia.org/T244958 (10jcrespo) Thanks, I will repopulate this host with some production data, which may help with T246198. [08:51:10] !log upload prometheus-mcrouter-exporter 0.1.0+git20200225-1 to stretch-wikimedia [08:51:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:01:41] (03CR) 10Aaron Schulz: [C: 03+1] db-eqiad,db-codfw.php: Add es4 as new ES, for initial testing [mediawiki-config] - 10https://gerrit.wikimedia.org/r/574696 (https://phabricator.wikimedia.org/T246072) (owner: 10Marostegui) [09:13:16] PROBLEM - puppet last run on elastic1048 is CRITICAL: CRITICAL: Puppet last ran 6 days ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [09:18:07] !log restarting elasticsearch on cloudelastic for JVM upgrade [09:18:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:19:22] RECOVERY - puppet last run on elastic1048 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [09:24:17] 10Operations, 10Release-Engineering-Team, 10serviceops: mcrouter proxies and scap proxies - https://phabricator.wikimedia.org/T245841 (10akosiaris) Fine by me. [09:32:41] !log upgrade prometheus-mcrouter-exporter 0.1.0+git20200225-1 to all cumin alias all-mw-codfw [09:32:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:34:36] !log roll restart the Hadoop Analytcs workers for openjdk upgrades [09:34:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:37:14] !log elukey@cumin1001 START - Cookbook sre.hadoop.roll-restart-workers [09:37:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:41:30] 10Operations, 10Release-Engineering-Team, 10serviceops: mcrouter proxies and scap proxies - https://phabricator.wikimedia.org/T245841 (10Joe) 05Open→03Declined No please. I didn't notice this task. Scap proxies are under a ton of pressure during train releases, ti was a deliberate choice not to mix the... [09:54:45] !log upgrade prometheus-mcrouter-exporter 0.1.0+git20200225-1 to all cumin alias all-mw-eqiad [09:54:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:59:58] (03PS1) 10Giuseppe Lavagetto: mediawiki::common: use envoy for tls termination too in nodes using it [puppet] - 10https://gerrit.wikimedia.org/r/574988 [10:00:45] 10Operations, 10Traffic, 10Wikimedia-General-or-Unknown: Different age of history logged in and out when from the EU (but not SF office) - https://phabricator.wikimedia.org/T246185 (10Aklapper) I'm wondering if this could have anything do with the yesterday's earlier problem at https://wikitech.wikimedia.org... [10:03:26] !log upgrade prometheus-mcrouter-exporter 0.1.0+git20200225-1 to all cumin alias parsoid/deployment-servers/mw-maintenance [10:03:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:05:57] elukey: if you want to do all of them at once can use 'R:package = prometheus-mcrouter-exporter' ;) [10:06:04] or to verify noone was left behind [10:06:21] https://debmonitor.wikimedia.org/packages/prometheus-mcrouter-exporter helps too :D [10:07:18] yep I was following debmonitor, didn't think about the R:package thanks! [10:08:44] RECOVERY - Old JVM GC check - cloudelastic1001-cloudelastic-chi-eqiad on cloudelastic1001 is OK: (C)100 gt (W)80 gt 78.31 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1 [10:10:20] RECOVERY - Old JVM GC check - cloudelastic1003-cloudelastic-chi-eqiad on cloudelastic1003 is OK: (C)100 gt (W)80 gt 75.27 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1 [10:15:28] RECOVERY - Old JVM GC check - cloudelastic1002-cloudelastic-chi-eqiad on cloudelastic1002 is OK: (C)100 gt (W)80 gt 72.2 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1 [10:16:25] (03PS8) 10Elukey: Move all Report Updater Jobs to an-launcher1001 [puppet] - 10https://gerrit.wikimedia.org/r/574722 (https://phabricator.wikimedia.org/T243934) [10:21:14] (03CR) 10Muehlenhoff: [C: 03+2] Switch to /srv/druid [debs/druid] (debian) - 10https://gerrit.wikimedia.org/r/574803 (owner: 10Muehlenhoff) [10:25:25] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/574883 (owner: 10Jbond) [10:25:38] (03CR) 10Elukey: [C: 03+1] "<3" [debs/druid] (debian) - 10https://gerrit.wikimedia.org/r/574803 (owner: 10Muehlenhoff) [10:26:59] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/574884 (https://phabricator.wikimedia.org/T246010) (owner: 10Jbond) [10:27:34] (03CR) 10Jbond: [V: 03+2 C: 03+2] templates: add initial template file so we have the git history [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/574883 (owner: 10Jbond) [10:27:44] (03CR) 10Jbond: [V: 03+2 C: 03+2] templates: update so that CSS and JS files come from CF CDN [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/574884 (https://phabricator.wikimedia.org/T246010) (owner: 10Jbond) [10:32:07] (03CR) 10Muehlenhoff: style: remove branding (031 comment) [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/574889 (https://phabricator.wikimedia.org/T233939) (owner: 10Jbond) [10:32:24] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=idp site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [10:34:30] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [10:36:56] (03CR) 10Volans: "Replies inline" (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/573963 (owner: 10Alexandros Kosiaris) [10:37:42] (03PS2) 10Jbond: style: remove branding [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/574889 (https://phabricator.wikimedia.org/T233939) [10:40:47] (03PS2) 10Jbond: templates: add initial templates to provide git history [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/574888 (https://phabricator.wikimedia.org/T233939) [10:41:20] (03PS3) 10Jbond: style: remove branding [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/574889 (https://phabricator.wikimedia.org/T233939) [10:42:24] (03CR) 10Kosta Harlan: [C: 03+1] Enable articletopic: search keyword in CirrusSearch [mediawiki-config] - 10https://gerrit.wikimedia.org/r/574634 (https://phabricator.wikimedia.org/T240559) (owner: 10Gergő Tisza) [10:45:33] (03PS3) 10Jbond: templates: add initial templates to provide git history [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/574888 (https://phabricator.wikimedia.org/T233939) [10:45:42] (03PS4) 10Jbond: style: remove branding [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/574889 (https://phabricator.wikimedia.org/T233939) [10:51:53] 10Operations, 10Traffic, 10Wikimedia-General-or-Unknown: Different age of history logged in and out when from the EU (but not SF office) - https://phabricator.wikimedia.org/T246185 (10Luke081515) As an logged out user I'm now able to see the newest version. However, I've temporary revoked the review status... [10:52:25] !log installing clamav security updates on mendelevium (ticket.wikimedia.org [10:52:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:54:52] (03CR) 10Hnowlan: "Good question - I'm not familiar enough with our setup to know whether there will be some adverse effects of us having a service in k8s bu" [puppet] - 10https://gerrit.wikimedia.org/r/574811 (https://phabricator.wikimedia.org/T213193) (owner: 10Hnowlan) [10:56:38] (03PS1) 10Volans: icinga: fix use of stale unpuppetized check files [puppet] - 10https://gerrit.wikimedia.org/r/574993 [10:56:55] (03PS5) 10Jbond: style: remove branding [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/574889 (https://phabricator.wikimedia.org/T233939) [10:57:37] !log elukey@cumin1001 END (PASS) - Cookbook sre.hadoop.roll-restart-workers (exit_code=0) [10:57:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:58:53] (03CR) 10Jbond: "updated thanks" (031 comment) [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/574889 (https://phabricator.wikimedia.org/T233939) (owner: 10Jbond) [11:01:57] (03CR) 10Volans: "Thanks effie for noticing a discrepancy and nerd-sniping me into this rabbit hole!" [puppet] - 10https://gerrit.wikimedia.org/r/574993 (owner: 10Volans) [11:04:04] (03CR) 10Jbond: [C: 03+2] admin: add support for system users and groups [puppet] - 10https://gerrit.wikimedia.org/r/573990 (https://phabricator.wikimedia.org/T235162) (owner: 10Jbond) [11:05:40] !log rolling out remaining PHP 7.0 security updates [11:05:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:06:50] (03CR) 10Effie Mouzeli: [C: 03+1] ":D" [puppet] - 10https://gerrit.wikimedia.org/r/574993 (owner: 10Volans) [11:08:06] (03CR) 10Jbond: [C: 03+1] "thanks" [puppet] - 10https://gerrit.wikimedia.org/r/574993 (owner: 10Volans) [11:09:33] (03CR) 10Jbond: [C: 03+2] admin: add rerprepo system user [puppet] - 10https://gerrit.wikimedia.org/r/573991 (owner: 10Jbond) [11:09:48] (03PS7) 10Jbond: admin: add rerprepo system user [puppet] - 10https://gerrit.wikimedia.org/r/573991 [11:14:40] (03CR) 10Hnowlan: [C: 03+2] changeprop: add hierdata k8s entries and LVS entry [puppet] - 10https://gerrit.wikimedia.org/r/574811 (https://phabricator.wikimedia.org/T213193) (owner: 10Hnowlan) [11:26:34] (03PS1) 10Jbond: reprepo: remove user and group as they are create by the admin module [puppet] - 10https://gerrit.wikimedia.org/r/574995 (https://phabricator.wikimedia.org/T245612) [11:26:37] (03PS1) 10Jcrespo: database-backups: Add s3 to db1095 (backup source) [puppet] - 10https://gerrit.wikimedia.org/r/574996 (https://phabricator.wikimedia.org/T244958) [11:27:51] (03CR) 10Jbond: [C: 03+2] reprepo: remove user and group as they are create by the admin module [puppet] - 10https://gerrit.wikimedia.org/r/574995 (https://phabricator.wikimedia.org/T245612) (owner: 10Jbond) [11:28:06] (03CR) 10Jcrespo: [C: 03+2] database-backups: Add s3 to db1095 (backup source) [puppet] - 10https://gerrit.wikimedia.org/r/574996 (https://phabricator.wikimedia.org/T244958) (owner: 10Jcrespo) [11:32:53] (03CR) 10Hnowlan: changeprop: add hierdata k8s entries and LVS entry [puppet] - 10https://gerrit.wikimedia.org/r/574811 (https://phabricator.wikimedia.org/T213193) (owner: 10Hnowlan) [11:33:13] (03CR) 10Hnowlan: [C: 03+1] Switch restbase-dev* to standard Partman recipes [puppet] - 10https://gerrit.wikimedia.org/r/574796 (https://phabricator.wikimedia.org/T156955) (owner: 10Muehlenhoff) [11:34:10] I am expecting some disabled notifications to show up on icinga about db1095, will ack them when they get created [11:36:35] (03PS1) 10Volans: CHANGELOG: add changelogs for release v0.0.31 [software/spicerack] - 10https://gerrit.wikimedia.org/r/574999 [11:37:10] jouncebot: next [11:37:10] In 0 hour(s) and 22 minute(s): European Mid-day SWAT(Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200226T1200) [11:41:01] (03CR) 10Volans: [C: 03+2] CHANGELOG: add changelogs for release v0.0.31 [software/spicerack] - 10https://gerrit.wikimedia.org/r/574999 (owner: 10Volans) [11:41:52] (03PS1) 10Effie Mouzeli: logstash, mediawiki: minor fixes in log streaming [puppet] - 10https://gerrit.wikimedia.org/r/575000 (https://phabricator.wikimedia.org/T244472) [11:44:56] (03Merged) 10jenkins-bot: CHANGELOG: add changelogs for release v0.0.31 [software/spicerack] - 10https://gerrit.wikimedia.org/r/574999 (owner: 10Volans) [11:45:37] (03PS1) 10Jcrespo: database-backups: Productionize db1095 as the backup source of s3 [puppet] - 10https://gerrit.wikimedia.org/r/575002 (https://phabricator.wikimedia.org/T244958) [11:45:42] changing uid/gid of reprepro effects release[12]001/install[12]002 [11:45:48] !log changing uid/gid of reprepro effects release[12]001/install[12]002 [11:45:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:48:15] (03PS1) 10Volans: Upstream release v0.0.31 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/575003 [11:48:45] (03CR) 10Volans: [C: 03+2] Upstream release v0.0.31 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/575003 (owner: 10Volans) [11:50:11] (03PS6) 10Vgutierrez: ATS: Support TLS Session tickets [puppet] - 10https://gerrit.wikimedia.org/r/573977 (https://phabricator.wikimedia.org/T245616) [11:50:31] (03CR) 10Vgutierrez: ATS: Support TLS Session tickets (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/573977 (https://phabricator.wikimedia.org/T245616) (owner: 10Vgutierrez) [11:50:33] (03PS2) 10Muehlenhoff: Switch restbase-dev* to standard Partman recipes [puppet] - 10https://gerrit.wikimedia.org/r/574796 (https://phabricator.wikimedia.org/T156955) [11:54:15] (03Merged) 10jenkins-bot: Upstream release v0.0.31 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/575003 (owner: 10Volans) [11:56:14] (03CR) 10Jcrespo: [C: 03+2] database-backups: Productionize db1095 as the backup source of s3 [puppet] - 10https://gerrit.wikimedia.org/r/575002 (https://phabricator.wikimedia.org/T244958) (owner: 10Jcrespo) [12:00:04] Amir1, Lucas_WMDE, awight, and Urbanecm: It is that lovely time of the day again! You are hereby commanded to deploy European Mid-day SWAT(Max 6 patches). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200226T1200). [12:00:04] kart_: A patch you scheduled for European Mid-day SWAT(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [12:00:15] * kart_ is here [12:00:30] @Nikerabbit around too? [12:00:44] mw memcached errors jumped up at 11:58 [12:00:58] (03PS2) 10KartikMistry: Enable CX out of beta in eu, sw and ta WPs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/574469 (https://phabricator.wikimedia.org/T245446) [12:01:11] I think it went down back again, all good [12:02:08] (03CR) 10Muehlenhoff: [C: 03+2] Switch restbase-dev* to standard Partman recipes [puppet] - 10https://gerrit.wikimedia.org/r/574796 (https://phabricator.wikimedia.org/T156955) (owner: 10Muehlenhoff) [12:03:18] (03CR) 10KartikMistry: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/574469 (https://phabricator.wikimedia.org/T245446) (owner: 10KartikMistry) [12:03:37] Urbanecm: I've rechecked NS_TEMPLATE on Meta and now I see the breadcrumbs [12:03:43] Could it be related to wmf.21? [12:03:47] hauskatze: that's interesting [12:03:56] it can be [12:04:04] let's test at testwiki [12:04:18] PROBLEM - MediaWiki memcached error rate on icinga1001 is CRITICAL: 1.016e+04 gt 5000 https://wikitech.wikimedia.org/wiki/Memcached https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=1&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [12:04:27] hauskatze: https://test.wikipedia.org/wiki/Template:Link_FA/ba/monthly_edits, also seems to work [12:04:29] ^ignore [12:04:41] (03Merged) 10jenkins-bot: Enable CX out of beta in eu, sw and ta WPs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/574469 (https://phabricator.wikimedia.org/T245446) (owner: 10KartikMistry) [12:04:41] (03PS11) 10Jbond: admin: add CI checks to ensure users and group have the correct gid/uid [puppet] - 10https://gerrit.wikimedia.org/r/574412 (https://phabricator.wikimedia.org/T235162) [12:05:26] RECOVERY - MediaWiki memcached error rate on icinga1001 is OK: (C)5000 gt (W)1000 gt 185 https://wikitech.wikimedia.org/wiki/Memcached https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=1&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [12:05:47] !log uploaded spicerack_0.0.31-1_amd64.deb to apt.wikimedia.org stretch-wikimedia [12:05:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:08:11] (03CR) 10Jbond: [C: 03+2] admin: add CI checks to ensure users and group have the correct gid/uid [puppet] - 10https://gerrit.wikimedia.org/r/574412 (https://phabricator.wikimedia.org/T235162) (owner: 10Jbond) [12:08:21] kart_: now [12:09:37] cool. Running first 'scap' for 574469 :) [12:10:11] !log kartik@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit|574469|Enable CX out of beta in eu, sw, and ta Wikipedias (T245446, T245447, T245448)]] (duration: 01m 15s) [12:10:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:10:22] T245446: Enable Content Translation in Swahili Wikipedia as a default tool - https://phabricator.wikimedia.org/T245446 [12:10:22] T245447: Enable Content Translation in Tamil Wikipedia as a default tool - https://phabricator.wikimedia.org/T245447 [12:10:23] T245448: Enable Content Translation in Basque Wikipedia as a default tool - https://phabricator.wikimedia.org/T245448 [12:11:45] !log kartik@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit|574469|Enable CX out of beta in eu, sw, and ta Wikipedias (T245446, T245447, T245448)]] take II (duration: 01m 05s) [12:11:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:11:57] (03PS5) 10KartikMistry: ContentTranslation: Set cookieDomain for Production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/416973 [12:14:29] (03CR) 10KartikMistry: [C: 03+2] "SWAT." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/416973 (owner: 10KartikMistry) [12:15:25] (03Merged) 10jenkins-bot: ContentTranslation: Set cookieDomain for Production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/416973 (owner: 10KartikMistry) [12:16:52] @Nikerabbit patch is in mwdebug1002 for testing.. [12:17:19] kart_: ok, testing [12:19:43] still testing [12:20:50] kart_: as far as I can see, everything is working correctly [12:21:14] mw.cx.siteMapper.siteTemplates.cookieDomain [12:21:14] ".wikipedia.org" [12:21:17] (03PS2) 10Effie Mouzeli: logstash, mediawiki: minor fixes in log streaming [puppet] - 10https://gerrit.wikimedia.org/r/575000 (https://phabricator.wikimedia.org/T244472) [12:21:20] also confirmed the correct value is showing up [12:21:28] cool. [12:21:33] Thanks for testing! [12:21:42] (I did same, but got lost in cookies :)) [12:21:53] Deploying.. [12:22:39] 10Operations, 10Patch-For-Review, 10User-jbond: Wikimedia theme for SSO login page - https://phabricator.wikimedia.org/T233939 (10jbond) initial page load is a bit better {F31631841} [12:23:36] 10Operations, 10Performance Issue: Investigate CAS performance - https://phabricator.wikimedia.org/T246010 (10jbond) initial page load is a bit better {F31631841} [12:23:50] @Nikerabbit AFAIK, we don't need to sync CommonSettings.php twice, right? [12:23:58] @Amir1 ^^ [12:24:18] kart_: AFAIK no [12:24:30] !log kartik@deploy1001 Synchronized wmf-config/CommonSettings.php: SWAT: [[gerrit|416973|ContentTranslation: Set cookieDomain for Production]] (duration: 01m 04s) [12:24:31] OK! [12:24:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:24:41] kart_: doesn't matter this time, since behavior does not change [12:24:50] OK! [12:24:55] We are done with SWAT. [12:25:13] I'm seeing the right value in production, too. [12:25:23] \0/ [12:26:09] !log jmm@cumin2001 START - Cookbook sre.hosts.downtime [12:26:11] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [12:26:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:26:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:27:24] 10Operations, 10netops: PyBal BGP group prefix-limit 50 teardown - https://phabricator.wikimedia.org/T246110 (10ayounsi) > [1] I say at least, because in the a bad scenario we experienced, due to operator error (yours truly), summarization failed and we ended up advertising something like 40 /32 prefixes. If t... [12:29:25] (03PS6) 10Jbond: admin: add tests for system users [puppet] - 10https://gerrit.wikimedia.org/r/554302 (https://phabricator.wikimedia.org/T239686) [12:32:34] (03CR) 10Jbond: "pcc: https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/21082/" [puppet] - 10https://gerrit.wikimedia.org/r/566512 (owner: 10Jbond) [12:35:46] (03CR) 10Jbond: [C: 03+2] profile::tlsproxy::envoy: add type checking and defaults [puppet] - 10https://gerrit.wikimedia.org/r/561850 (https://phabricator.wikimedia.org/T240941) (owner: 10Jbond) [12:51:02] (03PS2) 10Hnowlan: changeprop: add hierdata k8s entries and LVS entry [puppet] - 10https://gerrit.wikimedia.org/r/574811 (https://phabricator.wikimedia.org/T213193) [12:57:00] (03PS3) 10Effie Mouzeli: logstash, mediawiki: minor fixes in log streaming [puppet] - 10https://gerrit.wikimedia.org/r/575000 (https://phabricator.wikimedia.org/T244472) [13:04:11] (03PS6) 10Jbond: profile::tlsproxy::envoy: add type checking and defaults [puppet] - 10https://gerrit.wikimedia.org/r/561850 (https://phabricator.wikimedia.org/T240941) [13:05:04] 10Operations, 10netops: can aggregated netflow data include the router it was sampled from? - https://phabricator.wikimedia.org/T246186 (10ayounsi) It's possible by adding the keys `in_iface`, `out_iface` and `export_proto_sysid` to `modules/pmacct/templates/nfacctd.conf.erb` See: https://github.com/pmacct/pma... [13:10:37] 10Operations, 10netops: can aggregated netflow data include the router it was sampled from? - https://phabricator.wikimedia.org/T246186 (10ayounsi) About this specific peak, we could also configure pmacct to send BGP events to Kafka using BMP, see https://github.com/pmacct/pmacct/blob/master/QUICKSTART#L2230 [13:13:24] (03PS7) 10Jbond: profile::tlsproxy::envoy: add type checking and defaults [puppet] - 10https://gerrit.wikimedia.org/r/561850 (https://phabricator.wikimedia.org/T240941) [13:13:40] (03CR) 10Jbond: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/561850 (https://phabricator.wikimedia.org/T240941) (owner: 10Jbond) [13:15:46] (03PS2) 10Muehlenhoff: Switch elastic* to standard Partman recipes [puppet] - 10https://gerrit.wikimedia.org/r/570596 (https://phabricator.wikimedia.org/T156955) [13:16:37] (03PS8) 10Jbond: profile::tlsproxy::envoy: add type checking and defaults [puppet] - 10https://gerrit.wikimedia.org/r/561850 (https://phabricator.wikimedia.org/T240941) [13:20:02] (03CR) 10jerkins-bot: [V: 04-1] profile::tlsproxy::envoy: add type checking and defaults [puppet] - 10https://gerrit.wikimedia.org/r/561850 (https://phabricator.wikimedia.org/T240941) (owner: 10Jbond) [13:22:12] (03PS9) 10Jbond: profile::tlsproxy::envoy: add type checking and defaults [puppet] - 10https://gerrit.wikimedia.org/r/561850 (https://phabricator.wikimedia.org/T240941) [13:22:26] (03CR) 10Jbond: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/561850 (https://phabricator.wikimedia.org/T240941) (owner: 10Jbond) [13:24:56] 10Operations, 10Analytics, 10User-Elukey: notebook1003:/srv/ 2% disk space left - https://phabricator.wikimedia.org/T224682 (10ayounsi) 05Resolved→03Open It's at 0% now and alerting. [13:26:33] ACKNOWLEDGEMENT - Disk space on notebook1004 is CRITICAL: DISK CRITICAL - free space: / 0 MB (0% inode=93%): Ayounsi https://phabricator.wikimedia.org/T224682#5919599 https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=notebook1004&var-datasource=eqiad+prometheus/ops [13:28:22] (03PS10) 10Jbond: profile::tlsproxy::envoy: add type checking and defaults [puppet] - 10https://gerrit.wikimedia.org/r/561850 (https://phabricator.wikimedia.org/T240941) [13:30:01] (03CR) 10Jbond: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/561850 (https://phabricator.wikimedia.org/T240941) (owner: 10Jbond) [13:33:27] 10Operations, 10Dumps-Generation: (Need By Jan 25) rack/setup/install snapshot1010.eqiad.wmnet - https://phabricator.wikimedia.org/T241794 (10ayounsi) Changed its Netbox status to Active as https://netbox.wikimedia.org/extras/reports/puppetdb.PhysicalHosts/ was alerting and the host looks active. [13:33:50] 10Operations, 10Release-Engineering-Team, 10serviceops: mcrouter proxies and scap proxies - https://phabricator.wikimedia.org/T245841 (10jijiki) 05Declined→03Open (Reopening to discuss this a bit more) I understand your point, automating would be great, but it will not solve the problem where we have ab... [13:39:29] (03PS2) 10Muehlenhoff: Enable CASValidateSAML for tendril [puppet] - 10https://gerrit.wikimedia.org/r/574747 [13:40:45] (03CR) 10Volans: [C: 03+2] spicerack: add http_proxy to config file [puppet] - 10https://gerrit.wikimedia.org/r/574155 (owner: 10Volans) [13:41:42] (03PS11) 10Jbond: profile::tlsproxy::envoy: add type checking and defaults [puppet] - 10https://gerrit.wikimedia.org/r/561850 (https://phabricator.wikimedia.org/T240941) [13:42:03] (03CR) 10Jbond: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/561850 (https://phabricator.wikimedia.org/T240941) (owner: 10Jbond) [13:45:53] 10Operations, 10Patch-For-Review: Upgrade install servers to Buster - https://phabricator.wikimedia.org/T224576 (10ayounsi) `apt2001 missing VM from PuppetDB` from https://netbox.wikimedia.org/extras/reports/puppetdb.VirtualMachines/ and https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=2&host=netb... [13:45:59] !log ganeti2001:~$ sudo gnt-instance shutdown apt2001.wikimedia.org - T224576 [13:46:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:46:07] T224576: Upgrade install servers to Buster - https://phabricator.wikimedia.org/T224576 [13:46:44] (03CR) 10Muehlenhoff: "Compared to PS1 I've turned this into a profile option and also changes the validation endpoint (as tested on dbmonitor1001)" [puppet] - 10https://gerrit.wikimedia.org/r/574747 (owner: 10Muehlenhoff) [13:47:53] (03CR) 10Jbond: "> Patch Set 11:" [puppet] - 10https://gerrit.wikimedia.org/r/561850 (https://phabricator.wikimedia.org/T240941) (owner: 10Jbond) [13:52:46] (03CR) 10Jbond: [C: 03+1] Add profile::base::no_firewall [puppet] - 10https://gerrit.wikimedia.org/r/574491 (https://phabricator.wikimedia.org/T104939) (owner: 10Muehlenhoff) [13:53:48] 10Operations, 10netops: RRDP status alert - https://phabricator.wikimedia.org/T245121 (10ayounsi) As it keeps flapping I (temporarily) disabled the alert in eqiad, and we can rely in codfw if there is an actual issue. [13:54:08] RECOVERY - rpki grafana alert on icinga1001 is OK: OK: RPKI ( https://grafana.wikimedia.org/d/UwUa77GZk/rpki ) is not alerting. https://wikitech.wikimedia.org/wiki/RPKI%23Grafana_alerts https://grafana.wikimedia.org/d/UwUa77GZk/ [13:54:16] (03PS5) 10Volans: sre.hosts.decommission: improve Ganeti VM support [cookbooks] - 10https://gerrit.wikimedia.org/r/567169 (https://phabricator.wikimedia.org/T231068) [13:54:18] (03PS3) 10Volans: sre.ganeti.makevm: refactor for new spicerack [cookbooks] - 10https://gerrit.wikimedia.org/r/571999 (https://phabricator.wikimedia.org/T231068) [13:56:59] 10Operations, 10ops-eqiad, 10DBA: db1095 backup source crashed: broken BBU - https://phabricator.wikimedia.org/T244958 (10jcrespo) I have updated zarcillo to point prometheus to the right instances (CC @Marostegui to try to have that up to date, because I sometimes forget, and it is totally my fault for not... [13:58:05] 10Operations, 10ops-eqiad, 10DBA: db1095 backup source crashed: broken BBU - https://phabricator.wikimedia.org/T244958 (10jcrespo) I will run a data check on s3 and s2 and then consider this fixed. Maybe moving s2 on codfw to to mirror data distribution? [13:58:54] 10Operations, 10Dumps-Generation: (Need By Jan 25) rack/setup/install snapshot1010.eqiad.wmnet - https://phabricator.wikimedia.org/T241794 (10ArielGlenn) >>! In T241794#5919606, @ayounsi wrote: > Changed its Netbox status to Active as https://netbox.wikimedia.org/extras/reports/puppetdb.PhysicalHosts/ was aler... [13:59:50] (03CR) 10Volans: [C: 03+2] "Last PS updated only the module's docstring" [cookbooks] - 10https://gerrit.wikimedia.org/r/567169 (https://phabricator.wikimedia.org/T231068) (owner: 10Volans) [14:00:14] 10Operations, 10Wikidata, 10Wikidata-Query-Service, 10User-Smalyshev, 10cloud-services-team (Kanban): Provide a way to have test servers on real hardware, isolated from production for Wikidata Query Service - https://phabricator.wikimedia.org/T206636 (10Gehel) 05Open→03Resolved a:03Gehel This is ad... [14:00:38] (03PS2) 10Giuseppe Lavagetto: mediawiki::common: use envoy for tls termination too in nodes using it [puppet] - 10https://gerrit.wikimedia.org/r/574988 (https://phabricator.wikimedia.org/T244843) [14:00:40] (03PS1) 10Giuseppe Lavagetto: services_proxy::envoy: specify preferred ciphers [puppet] - 10https://gerrit.wikimedia.org/r/575015 (https://phabricator.wikimedia.org/T244843) [14:00:42] !log run apt-get clean on notebook1004 to free some space - T224682 [14:00:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:00:49] T224682: notebook1003:/srv/ 2% disk space left - https://phabricator.wikimedia.org/T224682 [14:01:23] (03Merged) 10jenkins-bot: sre.hosts.decommission: improve Ganeti VM support [cookbooks] - 10https://gerrit.wikimedia.org/r/567169 (https://phabricator.wikimedia.org/T231068) (owner: 10Volans) [14:02:02] RECOVERY - Disk space on notebook1004 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=notebook1004&var-datasource=eqiad+prometheus/ops [14:03:54] !log deactivate BGP to AS23930 on cr1-eqsin, will re-enable when their technical issues are fixed and they notify us [14:03:55] (03CR) 10jerkins-bot: [V: 04-1] mediawiki::common: use envoy for tls termination too in nodes using it [puppet] - 10https://gerrit.wikimedia.org/r/574988 (https://phabricator.wikimedia.org/T244843) (owner: 10Giuseppe Lavagetto) [14:03:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:05:02] !log restart of elasticsearch on cloudelastic for JVM upgrade completed [14:05:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:08:50] 10Operations, 10Release-Engineering-Team, 10serviceops: mcrouter proxies and scap proxies - https://phabricator.wikimedia.org/T245841 (10Joe) >>! In T245841#5919608, @jijiki wrote: > (Reopening to discuss this a bit more) > > I understand your point, automating would be great, but it will not solve the prob... [14:11:57] !log volans@cumin2001 START - Cookbook sre.hosts.decommission [14:12:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:12:18] !log volans@cumin2001 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) [14:12:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:13:36] (03CR) 10Giuseppe Lavagetto: [C: 03+2] services_proxy::envoy: specify preferred ciphers [puppet] - 10https://gerrit.wikimedia.org/r/575015 (https://phabricator.wikimedia.org/T244843) (owner: 10Giuseppe Lavagetto) [14:14:57] (03PS1) 10Marostegui: db-eqiad,db-codfw.php: Add es4 to default ES cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/575016 (https://phabricator.wikimedia.org/T246072) [14:15:52] (03CR) 10Marostegui: [C: 04-2] "This is the last step of T246072#5919707" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/575016 (https://phabricator.wikimedia.org/T246072) (owner: 10Marostegui) [14:16:29] (03CR) 10Marostegui: [C: 04-2] "> This is the last step of T246072#5919707" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/575016 (https://phabricator.wikimedia.org/T246072) (owner: 10Marostegui) [14:18:44] (03CR) 10Muehlenhoff: [C: 03+2] Switch elastic* to standard Partman recipes [puppet] - 10https://gerrit.wikimedia.org/r/570596 (https://phabricator.wikimedia.org/T156955) (owner: 10Muehlenhoff) [14:19:09] !log volans@cumin2001 START - Cookbook sre.hosts.decommission [14:19:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:19:27] _joe_: can I puppet-merge your envoy patch along? [14:19:51] !log volans@cumin2001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) [14:19:53] <_joe_> moritzm: sure, sorry, I thought I didn't submit it [14:19:53] also fine to wait (or you merge when the time is right), mine isn't urgent at all [14:19:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:19:58] ack, doing [14:27:43] (03CR) 10Giuseppe Lavagetto: "Overall LGTM, small nitpick but it's ok either way." (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/561850 (https://phabricator.wikimedia.org/T240941) (owner: 10Jbond) [14:28:43] (03PS4) 10Volans: sre.ganeti.makevm: refactor for new spicerack [cookbooks] - 10https://gerrit.wikimedia.org/r/571999 (https://phabricator.wikimedia.org/T231068) [14:28:45] (03PS1) 10Volans: sre.hosts.decommission: fix Ganeti VM decom path [cookbooks] - 10https://gerrit.wikimedia.org/r/575017 (https://phabricator.wikimedia.org/T231068) [14:29:15] (03PS3) 10Giuseppe Lavagetto: conftool::scripts: remove compatibility, disable draining [puppet] - 10https://gerrit.wikimedia.org/r/573291 (https://phabricator.wikimedia.org/T245594) [14:29:59] (03CR) 10Volans: "Tested live on cumin2001. Merging for now, happy to adjust later." [cookbooks] - 10https://gerrit.wikimedia.org/r/575017 (https://phabricator.wikimedia.org/T231068) (owner: 10Volans) [14:31:09] (03CR) 10Jhedden: "Will this add duplicate compute node entries for each exec node? Could we use SGE's host_aliases instead? http://gridscheduler.sourceforge" [puppet] - 10https://gerrit.wikimedia.org/r/574885 (https://phabricator.wikimedia.org/T245572) (owner: 10Bstorm) [14:31:28] (03CR) 10Volans: [C: 03+2] sre.hosts.decommission: fix Ganeti VM decom path [cookbooks] - 10https://gerrit.wikimedia.org/r/575017 (https://phabricator.wikimedia.org/T231068) (owner: 10Volans) [14:32:56] (03Merged) 10jenkins-bot: sre.hosts.decommission: fix Ganeti VM decom path [cookbooks] - 10https://gerrit.wikimedia.org/r/575017 (https://phabricator.wikimedia.org/T231068) (owner: 10Volans) [14:34:08] (03PS1) 10Muehlenhoff: Unroll analytics Partman configs [puppet] - 10https://gerrit.wikimedia.org/r/575018 (https://phabricator.wikimedia.org/T156955) [14:34:35] (03CR) 10Giuseppe Lavagetto: [C: 03+2] conftool::scripts: remove compatibility, disable draining [puppet] - 10https://gerrit.wikimedia.org/r/573291 (https://phabricator.wikimedia.org/T245594) (owner: 10Giuseppe Lavagetto) [14:36:58] (03CR) 10Giuseppe Lavagetto: conftool::scripts: refuse to pool a server if the weight is 0 (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/573292 (https://phabricator.wikimedia.org/T245594) (owner: 10Giuseppe Lavagetto) [14:39:26] 10Operations: Track remaining jessie systems in production - https://phabricator.wikimedia.org/T224549 (10MoritzMuehlenhoff) [14:40:17] (03PS2) 10Muehlenhoff: Add profile::base::no_firewall [puppet] - 10https://gerrit.wikimedia.org/r/574491 (https://phabricator.wikimedia.org/T104939) [14:41:56] (03PS3) 10Giuseppe Lavagetto: conftool::scripts: refuse to pool a server if the weight is 0 [puppet] - 10https://gerrit.wikimedia.org/r/573292 (https://phabricator.wikimedia.org/T245594) [14:42:39] (03CR) 10Giuseppe Lavagetto: [C: 03+2] conftool::scripts: refuse to pool a server if the weight is 0 [puppet] - 10https://gerrit.wikimedia.org/r/573292 (https://phabricator.wikimedia.org/T245594) (owner: 10Giuseppe Lavagetto) [14:45:19] (03PS1) 10Mvolz: Update citoid to 79245b40b6 [deployment-charts] - 10https://gerrit.wikimedia.org/r/575019 [14:46:24] (03CR) 10Muehlenhoff: [C: 03+2] Add profile::base::no_firewall [puppet] - 10https://gerrit.wikimedia.org/r/574491 (https://phabricator.wikimedia.org/T104939) (owner: 10Muehlenhoff) [14:46:26] (03PS12) 10Jbond: profile::tlsproxy::envoy: add type checking and defaults [puppet] - 10https://gerrit.wikimedia.org/r/561850 (https://phabricator.wikimedia.org/T240941) [14:46:47] !log volans@cumin2001 START - Cookbook sre.ganeti.makevm [14:46:48] (03CR) 10Jbond: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/561850 (https://phabricator.wikimedia.org/T240941) (owner: 10Jbond) [14:46:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:47:46] (03CR) 10Jbond: "thanks" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/561850 (https://phabricator.wikimedia.org/T240941) (owner: 10Jbond) [14:48:17] (03PS1) 10DCausse: [cirrus] configure wgCirrusSearchMaxShardsPerNode per cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/575020 [14:50:55] (03CR) 10Gehel: [C: 03+1] "LGTM" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/575020 (owner: 10DCausse) [14:51:57] !log volans@cumin2001 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) [14:52:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:54:27] !log volans@cumin2001 START - Cookbook sre.hosts.decommission [14:54:27] !log volans@cumin2001 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=99) [14:54:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:54:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:55:58] (03PS1) 10Muehlenhoff: Add profile::base::no_firewall to load balancers [puppet] - 10https://gerrit.wikimedia.org/r/575022 [14:56:10] (03PS5) 10Volans: sre.ganeti.makevm: refactor for new spicerack [cookbooks] - 10https://gerrit.wikimedia.org/r/571999 (https://phabricator.wikimedia.org/T231068) [14:56:25] (03CR) 10Elukey: [C: 03+1] Unroll analytics Partman configs [puppet] - 10https://gerrit.wikimedia.org/r/575018 (https://phabricator.wikimedia.org/T156955) (owner: 10Muehlenhoff) [14:57:42] (03CR) 10Volans: [C: 03+2] "Changes tested on cumin2001 with a test VM. Since last review:" [cookbooks] - 10https://gerrit.wikimedia.org/r/571999 (https://phabricator.wikimedia.org/T231068) (owner: 10Volans) [14:59:25] (03Merged) 10jenkins-bot: sre.ganeti.makevm: refactor for new spicerack [cookbooks] - 10https://gerrit.wikimedia.org/r/571999 (https://phabricator.wikimedia.org/T231068) (owner: 10Volans) [15:00:41] (03PS9) 10Elukey: Move all Report Updater Jobs to an-launcher1001 [puppet] - 10https://gerrit.wikimedia.org/r/574722 (https://phabricator.wikimedia.org/T243934) [15:02:14] (03PS13) 10Jbond: profile::tlsproxy::envoy: add type checking and defaults [puppet] - 10https://gerrit.wikimedia.org/r/561850 (https://phabricator.wikimedia.org/T240941) [15:04:11] (03CR) 10Elukey: "https://puppet-compiler.wmflabs.org/compiler1002/21087/" [puppet] - 10https://gerrit.wikimedia.org/r/574722 (https://phabricator.wikimedia.org/T243934) (owner: 10Elukey) [15:06:35] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=swagger_check_cxserver_cluster_eqiad site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [15:07:36] (03CR) 10Jbond: [C: 03+2] profile::tlsproxy::envoy: add type checking and defaults [puppet] - 10https://gerrit.wikimedia.org/r/561850 (https://phabricator.wikimedia.org/T240941) (owner: 10Jbond) [15:08:12] (03PS1) 10Volans: admin: add Brandon's temporary key [puppet] - 10https://gerrit.wikimedia.org/r/575023 [15:08:43] (03CR) 10Muehlenhoff: [C: 03+2] Unroll analytics Partman configs [puppet] - 10https://gerrit.wikimedia.org/r/575018 (https://phabricator.wikimedia.org/T156955) (owner: 10Muehlenhoff) [15:09:01] (03CR) 10RLazarus: [C: 03+1] admin: add Brandon's temporary key [puppet] - 10https://gerrit.wikimedia.org/r/575023 (owner: 10Volans) [15:09:07] (03CR) 10BBlack: [C: 03+1] admin: add Brandon's temporary key [puppet] - 10https://gerrit.wikimedia.org/r/575023 (owner: 10Volans) [15:09:52] (03CR) 10Ottomata: [C: 03+1] "Found a typo but +1" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/574722 (https://phabricator.wikimedia.org/T243934) (owner: 10Elukey) [15:10:05] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [15:11:32] (03CR) 10Volans: [C: 03+2] admin: add Brandon's temporary key [puppet] - 10https://gerrit.wikimedia.org/r/575023 (owner: 10Volans) [15:11:56] 10Operations, 10Patch-For-Review: Upgrade install servers to Buster - https://phabricator.wikimedia.org/T224576 (10Dzahn) Thanks @ayounsi For unknown reasons i could never get console on this server so I did not know if it was actually installing an OS. (The exact same thing in eqiad works fine). It does... [15:13:13] (03PS1) 10Muehlenhoff: Re-enable CAS authentication after enabling CASValidateSAML [puppet] - 10https://gerrit.wikimedia.org/r/575026 [15:15:18] (03PS4) 10Jbond: profile::tlsproxy::envoy: refactor parameters [puppet] - 10https://gerrit.wikimedia.org/r/574009 (https://phabricator.wikimedia.org/T240941) [15:15:55] (03PS5) 10Jbond: profile::tlsproxy::envoy: refactor parameters [puppet] - 10https://gerrit.wikimedia.org/r/574009 (https://phabricator.wikimedia.org/T240941) [15:17:25] !log volans@cumin1001 START - Cookbook sre.hosts.decommission [15:17:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:18:37] (03PS6) 10Jbond: profile::tlsproxy::envoy: refactor parameters [puppet] - 10https://gerrit.wikimedia.org/r/574009 (https://phabricator.wikimedia.org/T240941) [15:19:02] !log volans@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) [15:19:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:19:42] (03CR) 10Ppchelko: [C: 03+2] Update citoid to 79245b40b6 [deployment-charts] - 10https://gerrit.wikimedia.org/r/575019 (owner: 10Mvolz) [15:19:45] 10Operations: Track remaining jessie systems in production - https://phabricator.wikimedia.org/T224549 (10Volans) [15:20:00] (03Merged) 10jenkins-bot: Update citoid to 79245b40b6 [deployment-charts] - 10https://gerrit.wikimedia.org/r/575019 (owner: 10Mvolz) [15:20:22] (03PS7) 10Jbond: profile::tlsproxy::envoy: refactor parameters [puppet] - 10https://gerrit.wikimedia.org/r/574009 (https://phabricator.wikimedia.org/T240941) [15:20:43] (03PS8) 10Jbond: profile::tlsproxy::envoy: add custom SNI type [puppet] - 10https://gerrit.wikimedia.org/r/574009 (https://phabricator.wikimedia.org/T240941) [15:20:54] (03CR) 10Alexandros Kosiaris: configmaster: Add DNS Discovery discrepancy check (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/573963 (owner: 10Alexandros Kosiaris) [15:21:24] (03CR) 10Jbond: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/574009 (https://phabricator.wikimedia.org/T240941) (owner: 10Jbond) [15:21:34] (03PS7) 10Alexandros Kosiaris: configmaster: Add DNS Discovery discrepancy check [puppet] - 10https://gerrit.wikimedia.org/r/573963 [15:25:01] !log addshore@mwmaint1002:~$ time mwscript extensions/Wikibase/repo/maintenance/rebuildItemTerms.php --wiki=wikidatawiki --batch-size=50 --sleep=1 --file=20to30holes-25feb2229 # T219123 [15:25:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:25:08] T219123: Migrate to and read from new store for item terms - https://phabricator.wikimedia.org/T219123 [15:33:36] (03CR) 10Jbond: [C: 03+2] profile::tlsproxy::envoy: add custom SNI type [puppet] - 10https://gerrit.wikimedia.org/r/574009 (https://phabricator.wikimedia.org/T240941) (owner: 10Jbond) [15:34:30] (03CR) 10Bstorm: [C: 04-1] "> Patch Set 3:" [puppet] - 10https://gerrit.wikimedia.org/r/574885 (https://phabricator.wikimedia.org/T245572) (owner: 10Bstorm) [15:35:14] (03CR) 10Bstorm: [C: 04-1] "> Patch Set 3:" [puppet] - 10https://gerrit.wikimedia.org/r/574885 (https://phabricator.wikimedia.org/T245572) (owner: 10Bstorm) [15:37:13] PROBLEM - Check size of conntrack table on stat1007 is CRITICAL: connect to address 10.64.21.118 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [15:37:27] PROBLEM - Check systemd state on stat1007 is CRITICAL: connect to address 10.64.21.118 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:37:29] PROBLEM - MD RAID on stat1007 is CRITICAL: connect to address 10.64.21.118 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [15:37:37] PROBLEM - Check whether ferm is active by checking the default input chain on stat1007 is CRITICAL: connect to address 10.64.21.118 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [15:37:45] PROBLEM - configured eth on stat1007 is CRITICAL: connect to address 10.64.21.118 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_eth [15:38:09] PROBLEM - DPKG on stat1007 is CRITICAL: connect to address 10.64.21.118 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [15:38:11] PROBLEM - Disk space on stat1007 is CRITICAL: connect to address 10.64.21.118 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=stat1007&var-datasource=eqiad+prometheus/ops [15:38:12] ouch [15:38:15] PROBLEM - dhclient process on stat1007 is CRITICAL: connect to address 10.64.21.118 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_dhclient [15:38:18] host down? [15:38:27] nono OOM [15:38:35] (03CR) 10Dzahn: "> Patch Set 2:" [puppet] - 10https://gerrit.wikimedia.org/r/574751 (https://phabricator.wikimedia.org/T246105) (owner: 10Aklapper) [15:38:42] (03PS3) 10Giuseppe Lavagetto: mediawiki::common: use envoy for tls termination too in nodes using it [puppet] - 10https://gerrit.wikimedia.org/r/574988 (https://phabricator.wikimedia.org/T244843) [15:38:51] (03PS6) 10Jbond: profile::tlsproxy::envoy: add support for acme certs [puppet] - 10https://gerrit.wikimedia.org/r/574010 (https://phabricator.wikimedia.org/T240941) [15:39:19] RECOVERY - Check size of conntrack table on stat1007 is OK: OK: nf_conntrack is 0 % full https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [15:39:33] RECOVERY - Check systemd state on stat1007 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:39:35] RECOVERY - MD RAID on stat1007 is OK: OK: Active: 8, Working: 8, Failed: 0, Spare: 0 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [15:39:43] RECOVERY - Check whether ferm is active by checking the default input chain on stat1007 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [15:39:51] RECOVERY - configured eth on stat1007 is OK: OK - interfaces up https://wikitech.wikimedia.org/wiki/Monitoring/check_eth [15:40:07] 10Operations, 10netops: PyBal BGP group prefix-limit 50 teardown - https://phabricator.wikimedia.org/T246110 (10akosiaris) >>! In T246110#5919494, @ayounsi wrote: >> [1] I say at least, because in the a bad scenario we experienced, due to operator error (yours truly), summarization failed and we ended up adver... [15:40:15] RECOVERY - DPKG on stat1007 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [15:40:17] RECOVERY - Disk space on stat1007 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=stat1007&var-datasource=eqiad+prometheus/ops [15:40:21] RECOVERY - dhclient process on stat1007 is OK: PROCS OK: 0 processes with command name dhclient https://wikitech.wikimedia.org/wiki/Monitoring/check_dhclient [15:41:04] (03CR) 10RLazarus: [C: 03+1] add mw2291 through mw2300 as API appservers [puppet] - 10https://gerrit.wikimedia.org/r/574907 (https://phabricator.wikimedia.org/T241852) (owner: 10Dzahn) [15:41:36] (03CR) 10jerkins-bot: [V: 04-1] mediawiki::common: use envoy for tls termination too in nodes using it [puppet] - 10https://gerrit.wikimedia.org/r/574988 (https://phabricator.wikimedia.org/T244843) (owner: 10Giuseppe Lavagetto) [15:42:07] (03PS4) 10Effie Mouzeli: logstash, mediawiki: minor fixes in log streaming [puppet] - 10https://gerrit.wikimedia.org/r/575000 (https://phabricator.wikimedia.org/T244472) [15:42:23] (03PS1) 10Dzahn: mariadb: grant phabricator stats user access to herald database [puppet] - 10https://gerrit.wikimedia.org/r/575030 (https://phabricator.wikimedia.org/T246105) [15:43:11] (03PS7) 10Jbond: profile::tlsproxy::envoy: add support for acme certs [puppet] - 10https://gerrit.wikimedia.org/r/574010 (https://phabricator.wikimedia.org/T240941) [15:43:34] (03CR) 10Jbond: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/574010 (https://phabricator.wikimedia.org/T240941) (owner: 10Jbond) [15:43:44] (03CR) 10Dzahn: [C: 04-2] "wrong IPs..." [puppet] - 10https://gerrit.wikimedia.org/r/575030 (https://phabricator.wikimedia.org/T246105) (owner: 10Dzahn) [15:45:57] (03PS2) 10Dzahn: mariadb: grant phabricator stats user access to herald database [puppet] - 10https://gerrit.wikimedia.org/r/575030 (https://phabricator.wikimedia.org/T246105) [15:47:36] (03PS3) 10Dzahn: mariadb: grant phabricator stats user access to herald database [puppet] - 10https://gerrit.wikimedia.org/r/575030 (https://phabricator.wikimedia.org/T246105) [15:48:24] (03CR) 10Mforns: [C: 04-1] "I think there's still one thing that will break. On the reportupdater job in puppet, on line 57, $title is used to specify the query folde" [puppet] - 10https://gerrit.wikimedia.org/r/574722 (https://phabricator.wikimedia.org/T243934) (owner: 10Elukey) [15:50:20] (03CR) 10Mforns: [C: 04-1] "3) Would be cool, if possible, because the separation into config-hive.yaml and config-mysql.yaml was made because of the necessity to acc" [puppet] - 10https://gerrit.wikimedia.org/r/574722 (https://phabricator.wikimedia.org/T243934) (owner: 10Elukey) [15:51:09] !log starting s2, s3 eqiad backup source data check; expect increase read traffic on db1095:3313, db1140:3312, db1078, db1090:3312 T244958 [15:51:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:51:16] T244958: db1095 backup source crashed: broken BBU - https://phabricator.wikimedia.org/T244958 [15:51:57] (03CR) 10Elukey: "> 3) Would be cool, if possible, because the separation into" [puppet] - 10https://gerrit.wikimedia.org/r/574722 (https://phabricator.wikimedia.org/T243934) (owner: 10Elukey) [15:53:49] (03CR) 10Vgutierrez: [C: 03+1] profile::tlsproxy::envoy: add support for acme certs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/574010 (https://phabricator.wikimedia.org/T240941) (owner: 10Jbond) [15:55:13] (03CR) 10Jbond: "thanks" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/574010 (https://phabricator.wikimedia.org/T240941) (owner: 10Jbond) [15:55:29] (03PS8) 10Jbond: profile::tlsproxy::envoy: add support for acme certs [puppet] - 10https://gerrit.wikimedia.org/r/574010 (https://phabricator.wikimedia.org/T240941) [15:56:39] (03CR) 10Jcrespo: [C: 03+1] mariadb: grant phabricator stats user access to herald database [puppet] - 10https://gerrit.wikimedia.org/r/575030 (https://phabricator.wikimedia.org/T246105) (owner: 10Dzahn) [15:56:51] (03CR) 10Dzahn: "needs https://gerrit.wikimedia.org/r/c/operations/puppet/+/575030" [puppet] - 10https://gerrit.wikimedia.org/r/574751 (https://phabricator.wikimedia.org/T246105) (owner: 10Aklapper) [15:57:34] (03PS2) 10Jbond: Re-enable CAS authentication after enabling CASValidateSAML [puppet] - 10https://gerrit.wikimedia.org/r/575026 (owner: 10Muehlenhoff) [16:00:25] !log deploy new grants to phabricator stats user to database on m3 T246105 [16:00:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:00:31] T246105: Weekly phabricator-reports mail: List active Herald rules authored by users not recently active - https://phabricator.wikimedia.org/T246105 [16:02:10] (03CR) 10Jcrespo: [C: 03+1] "This has already been deploy to production successfully, please merge when you can." [puppet] - 10https://gerrit.wikimedia.org/r/575030 (https://phabricator.wikimedia.org/T246105) (owner: 10Dzahn) [16:05:13] (03PS10) 10Elukey: Move all Report Updater Jobs to an-launcher1001 [puppet] - 10https://gerrit.wikimedia.org/r/574722 (https://phabricator.wikimedia.org/T243934) [16:05:44] (03PS3) 10Jbond: Enable CASValidateSAML for tendril [puppet] - 10https://gerrit.wikimedia.org/r/574747 (owner: 10Muehlenhoff) [16:08:37] (03CR) 10Elukey: "Marcel: I modified the reportupdater::job define, let me know if you like it!" [puppet] - 10https://gerrit.wikimedia.org/r/574722 (https://phabricator.wikimedia.org/T243934) (owner: 10Elukey) [16:09:18] (03PS4) 10Jbond: Enable CASValidateSAML for tendril [puppet] - 10https://gerrit.wikimedia.org/r/574747 (owner: 10Muehlenhoff) [16:11:25] (03PS3) 10Jbond: Re-enable CAS authentication after enabling CASValidateSAML [puppet] - 10https://gerrit.wikimedia.org/r/575026 (owner: 10Muehlenhoff) [16:14:36] (03PS5) 10Jbond: Enable CASValidateSAML for tendril [puppet] - 10https://gerrit.wikimedia.org/r/574747 (owner: 10Muehlenhoff) [16:18:26] (03PS4) 10Jbond: Re-enable CAS authentication after enabling CASValidateSAML [puppet] - 10https://gerrit.wikimedia.org/r/575026 (owner: 10Muehlenhoff) [16:20:34] (03PS1) 10Papaul: DHCP: Add mgmt and production DNS for parse200[1-20] [dns] - 10https://gerrit.wikimedia.org/r/575034 [16:20:37] (03PS6) 10Jbond: Enable CASValidateSAML for tendril [puppet] - 10https://gerrit.wikimedia.org/r/574747 (owner: 10Muehlenhoff) [16:20:52] (03PS5) 10Jbond: Re-enable CAS authentication after enabling CASValidateSAML [puppet] - 10https://gerrit.wikimedia.org/r/575026 (owner: 10Muehlenhoff) [16:22:25] (03PS7) 10Jbond: Enable CASValidateSAML for tendril [puppet] - 10https://gerrit.wikimedia.org/r/574747 (owner: 10Muehlenhoff) [16:22:40] (03PS6) 10Jbond: Re-enable CAS authentication after enabling CASValidateSAML [puppet] - 10https://gerrit.wikimedia.org/r/575026 (owner: 10Muehlenhoff) [16:23:18] (03PS1) 10Milimetric: Make normalized request count available in Turnilo [puppet] - 10https://gerrit.wikimedia.org/r/575035 (https://phabricator.wikimedia.org/T241162) [16:24:08] 10Operations, 10Analytics, 10Analytics-Cluster, 10Cloud-Services: notebook1003 failed network mount on boot - https://phabricator.wikimedia.org/T204857 (10elukey) 05Open→03Resolved a:03elukey [16:27:17] (03PS8) 10Jbond: Enable CASValidateSAML for tendril [puppet] - 10https://gerrit.wikimedia.org/r/574747 (owner: 10Muehlenhoff) [16:27:33] (03PS7) 10Jbond: Re-enable CAS authentication after enabling CASValidateSAML [puppet] - 10https://gerrit.wikimedia.org/r/575026 (owner: 10Muehlenhoff) [16:33:29] (03PS9) 10Jbond: Enable CASValidateSAML for tendril [puppet] - 10https://gerrit.wikimedia.org/r/574747 (owner: 10Muehlenhoff) [16:33:43] (03PS8) 10Jbond: Re-enable CAS authentication after enabling CASValidateSAML [puppet] - 10https://gerrit.wikimedia.org/r/575026 (owner: 10Muehlenhoff) [16:36:42] (03PS10) 10Jbond: Enable CASValidateSAML for tendril [puppet] - 10https://gerrit.wikimedia.org/r/574747 (owner: 10Muehlenhoff) [16:37:09] (03PS9) 10Jbond: Re-enable CAS authentication after enabling CASValidateSAML [puppet] - 10https://gerrit.wikimedia.org/r/575026 (owner: 10Muehlenhoff) [16:38:14] 10Operations, 10ops-codfw, 10fundraising-tech-ops, 10Patch-For-Review: (Need by: TBD) rack/setup/install frpm2001.frack.codfw.wmnet - https://phabricator.wikimedia.org/T242269 (10Papaul) @Jgreen i already have a task (T244950) to track this down can we declined this and tack it in T244950 or do you want... [16:38:37] PROBLEM - Host cr2-esams is DOWN: PING CRITICAL - Packet loss = 100% [16:38:52] KNOWN [16:38:58] ah you beat me [16:39:01] ack [16:39:09] see team channel discussion [16:39:25] PROBLEM - Host cr2-esams IPv6 is DOWN: CRITICAL - Destination Unreachable (2620:0:862:ffff::3) [16:39:27] (03PS11) 10Jbond: Enable CASValidateSAML for tendril [puppet] - 10https://gerrit.wikimedia.org/r/574747 (owner: 10Muehlenhoff) [16:39:38] please !log though robh [16:39:44] (03PS10) 10Jbond: Re-enable CAS authentication after enabling CASValidateSAML [puppet] - 10https://gerrit.wikimedia.org/r/575026 (owner: 10Muehlenhoff) [16:40:00] !log please note cr2-esams work is ongoing via T246009 and its downtime is expected [16:40:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:40:09] godog: done, sorry about that shoudl have done it sooner [16:40:28] (03CR) 10Jbond: [C: 03+1] "lgtm" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/575026 (owner: 10Muehlenhoff) [16:40:34] robh: np! caught up with backlog now [16:41:39] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 240, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [16:42:05] PROBLEM - OSPF status on cr3-knams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [16:42:13] PROBLEM - OSPF status on mr1-esams is CRITICAL: OSPFv2: 1/2 UP : OSPFv3: 1/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [16:42:43] anything that tied back to cr2-esams is going to complain [16:42:53] (03CR) 10Jbond: [C: 03+1] "LGTM sorry for the noise" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/574747 (owner: 10Muehlenhoff) [16:43:00] (03PS1) 10Mvolz: Update Zotero to I5e24a0f18 [deployment-charts] - 10https://gerrit.wikimedia.org/r/575040 [16:43:07] (03CR) 10jerkins-bot: [V: 04-1] Re-enable CAS authentication after enabling CASValidateSAML [puppet] - 10https://gerrit.wikimedia.org/r/575026 (owner: 10Muehlenhoff) [16:44:33] PROBLEM - IPv4 ping to esams on ripe-atlas-esams is CRITICAL: CRITICAL - failed 107 probes of 604 (alerts on 35) - https://atlas.ripe.net/measurements/23449935/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [16:44:53] PROBLEM - IPv6 ping to esams on ripe-atlas-esams IPv6 is CRITICAL: CRITICAL - failed 105 probes of 525 (alerts on 50) - https://atlas.ripe.net/measurements/23449938/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [16:45:07] basically ignore any esams alerts, for now [16:48:43] (03CR) 10Jbond: [C: 03+2] profile::tlsproxy::envoy: add support for acme certs [puppet] - 10https://gerrit.wikimedia.org/r/574010 (https://phabricator.wikimedia.org/T240941) (owner: 10Jbond) [16:50:43] RECOVERY - IPv4 ping to esams on ripe-atlas-esams is OK: OK - failed 5 probes of 605 (alerts on 35) - https://atlas.ripe.net/measurements/23449935/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [16:51:01] RECOVERY - IPv6 ping to esams on ripe-atlas-esams IPv6 is OK: OK - failed 32 probes of 525 (alerts on 50) - https://atlas.ripe.net/measurements/23449938/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [16:51:21] (03PS2) 10Jbond: profile::idp: update profile to use tlsproxy::envoy [puppet] - 10https://gerrit.wikimedia.org/r/574020 (https://phabricator.wikimedia.org/T240941) [16:51:23] (03PS3) 10Jbond: profile::idp: update profile to use tlsproxy::envoy [puppet] - 10https://gerrit.wikimedia.org/r/574020 (https://phabricator.wikimedia.org/T240941) [16:55:57] (03CR) 10Jbond: [C: 03+1] "looks ok to me but better wait for filipo to check the grok changes" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/575000 (https://phabricator.wikimedia.org/T244472) (owner: 10Effie Mouzeli) [16:57:41] PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /api (bad URL) timed out before a response was received https://wikitech.wikimedia.org/wiki/Citoid [16:59:45] RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid [17:04:23] RECOVERY - Host cr2-esams is UP: PING OK - Packet loss = 0%, RTA = 84.80 ms [17:04:29] <_joe_> hi cr2 [17:04:31] <_joe_> how are you [17:04:49] it just got a hip replacement [17:05:13] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 242, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [17:05:14] my page ringtone is way too cheerful, gotta fix that [17:05:41] RECOVERY - OSPF status on cr3-knams is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [17:06:13] RECOVERY - OSPF status on mr1-esams is OK: OSPFv2: 2/2 UP : OSPFv3: 2/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [17:06:19] apergos: lol [17:07:24] (03CR) 10Mforns: "Yea. 1) is the easiest! Couple comments :]" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/574722 (https://phabricator.wikimedia.org/T243934) (owner: 10Elukey) [17:07:40] (03PS2) 10Ottomata: Make normalized request count available in Turnilo [puppet] - 10https://gerrit.wikimedia.org/r/575035 (https://phabricator.wikimedia.org/T241162) (owner: 10Milimetric) [17:08:29] PROBLEM - BGP status on cr2-esams is CRITICAL: BGP CRITICAL - AS6939/IPv4: Connect, AS6939/IPv6: Connect https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [17:09:52] !log cr2-esasms work done, cr3-esams linecard swap starting now via T245825 [17:09:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:10:11] (03PS1) 10BBlack: new key for bblack [homer/public] - 10https://gerrit.wikimedia.org/r/575045 [17:10:15] RECOVERY - Host cr2-esams IPv6 is UP: PING OK - Packet loss = 0%, RTA = 83.75 ms [17:10:24] (03CR) 10Ottomata: [C: 03+2] Make normalized request count available in Turnilo [puppet] - 10https://gerrit.wikimedia.org/r/575035 (https://phabricator.wikimedia.org/T241162) (owner: 10Milimetric) [17:12:43] (03CR) 10Elukey: Move all Report Updater Jobs to an-launcher1001 (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/574722 (https://phabricator.wikimedia.org/T243934) (owner: 10Elukey) [17:13:27] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=swagger_check_cxserver_cluster_eqiad site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [17:14:13] PROBLEM - OSPF status on cr3-knams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [17:14:47] PROBLEM - OSPF status on mr1-esams is CRITICAL: OSPFv2: 1/2 UP : OSPFv3: 1/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [17:15:35] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [17:17:49] (03CR) 10Dzahn: [C: 03+2] mariadb: grant phabricator stats user access to herald database [puppet] - 10https://gerrit.wikimedia.org/r/575030 (https://phabricator.wikimedia.org/T246105) (owner: 10Dzahn) [17:20:19] (03PS11) 10Elukey: Move all Report Updater Jobs to an-launcher1001 [puppet] - 10https://gerrit.wikimedia.org/r/574722 (https://phabricator.wikimedia.org/T243934) [17:22:58] (03PS5) 10Effie Mouzeli: logstash, mediawiki: minor fixes in log streaming [puppet] - 10https://gerrit.wikimedia.org/r/575000 (https://phabricator.wikimedia.org/T244472) [17:24:51] RECOVERY - OSPF status on cr3-knams is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [17:25:23] RECOVERY - OSPF status on mr1-esams is OK: OSPFv2: 2/2 UP : OSPFv3: 2/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [17:26:58] 10Operations, 10ops-codfw, 10Patch-For-Review: (Need by: TBD) codfw: rack/setup/install parse200[1-20].codfw.wmnet - https://phabricator.wikimedia.org/T243112 (10Papaul) [17:27:09] (03CR) 10Dzahn: "access confirmed working with prod credentials:" [puppet] - 10https://gerrit.wikimedia.org/r/575030 (https://phabricator.wikimedia.org/T246105) (owner: 10Dzahn) [17:29:47] I seem to be getting 502s on the account eligibility tool on toolforge, would here be the right spot to report that? [17:30:12] phuzion: #wikimedia-cloud [17:30:18] Lucas_WMDE: thanks :) [17:30:20] migration of tools is currently in progress [17:30:27] what’s the tool name? [17:31:32] Lucas_WMDE: https://tools.wmflabs.org/meta/accounteligibility/52 it seems to have loaded now [17:31:47] it's the tool to check if you're eligible to vote in steward elections. [17:32:08] (03CR) 10Dzahn: "phab1001:~] $ mysql -u phstats -p -h m3-master.codfw.wmnet phabricator_herald" [puppet] - 10https://gerrit.wikimedia.org/r/574751 (https://phabricator.wikimedia.org/T246105) (owner: 10Aklapper) [17:32:30] (03CR) 10Dzahn: [C: 03+2] phabricator weekly changes email: List Herald rules by inactive users [puppet] - 10https://gerrit.wikimedia.org/r/574751 (https://phabricator.wikimedia.org/T246105) (owner: 10Aklapper) [17:34:05] RECOVERY - BGP status on cr2-esams is OK: BGP OK - up: 407, down: 5, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [17:40:09] !log re-enable transits on cr3-esams [17:40:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:40:19] 10Operations, 10ops-codfw, 10DC-Ops, 10fundraising-tech-ops: (Need by: ASAP) rack/setup/install frdb2001 - https://phabricator.wikimedia.org/T245566 (10Jgreen) a:05Jgreen→03Dwisehaupt Reassigning to Dallas since he's doing all the work! [17:40:42] 10Operations, 10Mail, 10Phabricator, 10Regression: Weekly phabricator-reports mail cronjob broken since January 2020 - https://phabricator.wikimedia.org/T244677 (10Dzahn) a:03Dzahn [17:42:50] !log remove ns2 redirect to eqiad on cr3-knams [17:42:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:43:09] 10Operations, 10ops-codfw, 10DC-Ops, 10fundraising-tech-ops: (Need by: ASAP) rack/setup/install frdb2001 - https://phabricator.wikimedia.org/T245566 (10Papaul) @Jgreen @Dwisehaupt can we resolve this task since it is a rack/setup task for dc-ops and open another task to track the work that left to be done?... [17:46:21] (03CR) 10Ppchelko: [C: 03+2] Update Zotero to I5e24a0f18 [deployment-charts] - 10https://gerrit.wikimedia.org/r/575040 (owner: 10Mvolz) [17:46:41] (03Merged) 10jenkins-bot: Update Zotero to I5e24a0f18 [deployment-charts] - 10https://gerrit.wikimedia.org/r/575040 (owner: 10Mvolz) [17:48:27] 10Operations, 10ops-codfw, 10DC-Ops, 10fundraising-tech-ops: (Need by: ASAP) rack/setup/install frdb2001 - https://phabricator.wikimedia.org/T245566 (10Dwisehaupt) @Papaul I'm fine with that. @Jgreen The last bit on this checklist is moving the host to staged in netbox. I just checked and I don't appear t... [17:49:42] !log setting cache type of mwdebug1001 to LCStoreStaticArray, this would break group1 and group2 in that node (T99740) [17:49:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:49:49] T99740: Use static php array files for l10n cache at WMF (instead of CDB) - https://phabricator.wikimedia.org/T99740 [17:51:26] (03PS1) 10Elukey: Add an-launcher1001 to profile::dumps::distribution [puppet] - 10https://gerrit.wikimedia.org/r/575048 (https://phabricator.wikimedia.org/T243934) [17:54:07] 10Operations, 10ops-codfw, 10DC-Ops, 10fundraising-tech-ops: (Need by: ASAP) rack/setup/install frdb2001 - https://phabricator.wikimedia.org/T245566 (10Papaul) @Dwisehaupt the last step is to move the host to online and not staged. staged is after the first puppet run. once the host is live and in producti... [17:54:39] (03PS1) 10Dzahn: phabricator: reactivate stats report mails on phab1001 [puppet] - 10https://gerrit.wikimedia.org/r/575049 (https://phabricator.wikimedia.org/T244677) [17:55:58] (03PS2) 10Dzahn: phabricator: reactivate stats report mails on phab1001 [puppet] - 10https://gerrit.wikimedia.org/r/575049 (https://phabricator.wikimedia.org/T244677) [17:56:56] (03CR) 10Jbond: "see if we can use https://github.com/MaxKellermann/ferm/blob/master/test/sort.pl instead" [puppet] - 10https://gerrit.wikimedia.org/r/573335 (https://phabricator.wikimedia.org/T206951) (owner: 10Jbond) [17:57:51] (03CR) 10Dzahn: [C: 03+2] "re-enables:" [puppet] - 10https://gerrit.wikimedia.org/r/575049 (https://phabricator.wikimedia.org/T244677) (owner: 10Dzahn) [18:00:35] ^ It's clean now [18:01:46] 10Operations, 10ops-codfw, 10DC-Ops, 10fundraising-tech-ops: (Need by: ASAP) rack/setup/install frdb2001 - https://phabricator.wikimedia.org/T245566 (10Dwisehaupt) @Papaul Good to know. It can be moved to online at this point. The host has gotten it's puppet runs and is in the process of getting the db res... [18:03:54] (03CR) 10Filippo Giunchedi: "See inline, LGTM overall" (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/575000 (https://phabricator.wikimedia.org/T244472) (owner: 10Effie Mouzeli) [18:05:45] !log phab1001 - manually running community_metrics and project_changes scripts (crons) (T244677) [18:05:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:05:52] T244677: Weekly phabricator-reports mail cronjob broken since January 2020 - https://phabricator.wikimedia.org/T244677 [18:09:59] !log downtimed labstore1004/5, cloudstore1008/9 and cloudbackup1001/2 for merging https://gerrit.wikimedia.org/r/c/operations/puppet/+/571821 [18:10:00] 10Operations, 10Mail, 10Phabricator, 10Patch-For-Review, 10Regression: Weekly phabricator-reports mail cronjob broken since January 2020 - https://phabricator.wikimedia.org/T244677 (10Dzahn) After the change above the cron jobs running these have been reactivated. Looks like a simple oversight when we mo... [18:10:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:10:22] (03PS6) 10Effie Mouzeli: logstash, mediawiki: minor fixes in log streaming [puppet] - 10https://gerrit.wikimedia.org/r/575000 (https://phabricator.wikimedia.org/T244472) [18:11:04] 10Operations, 10Mail, 10Phabricator, 10Patch-For-Review, 10Regression: Weekly phabricator-reports mail cronjob broken since January 2020 - https://phabricator.wikimedia.org/T244677 (10Dzahn) 05Open→03Resolved should be resolved. please reopen if any issues in the future when regular crons should run... [18:11:42] (03PS9) 10Bstorm: cloudstore: remove dependency on bind mounts [puppet] - 10https://gerrit.wikimedia.org/r/571821 (https://phabricator.wikimedia.org/T224582) [18:14:16] (03CR) 10Bstorm: [V: 03+2 C: 03+2] cloudstore: remove dependency on bind mounts [puppet] - 10https://gerrit.wikimedia.org/r/571821 (https://phabricator.wikimedia.org/T224582) (owner: 10Bstorm) [18:14:25] (03CR) 10Effie Mouzeli: logstash, mediawiki: minor fixes in log streaming (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/575000 (https://phabricator.wikimedia.org/T244472) (owner: 10Effie Mouzeli) [18:15:39] (03CR) 10Dzahn: "There might be some uncertainty about naming here." [dns] - 10https://gerrit.wikimedia.org/r/575034 (owner: 10Papaul) [18:15:40] 10Operations, 10ops-codfw, 10Patch-For-Review: (Need by: TBD) codfw: rack/setup/install parse200[1-20].codfw.wmnet - https://phabricator.wikimedia.org/T243112 (10Dzahn) There might be some uncertainty about naming here. https://wikitech.wikimedia.org/wiki/Infrastructure_naming_conventions says they are call... [18:15:41] (03CR) 10Mforns: [C: 03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/574722 (https://phabricator.wikimedia.org/T243934) (owner: 10Elukey) [18:16:10] (03CR) 10Dzahn: "commit message says DHCP but it's DNS" [dns] - 10https://gerrit.wikimedia.org/r/575034 (owner: 10Papaul) [18:21:40] effie: is it parsoid* (wiki page) or parse* (ticket) for the new servers? [18:37:39] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime [18:37:40] dzahn@cumin1001: Failed to log message to wiki. Somebody should check the error logs. [18:37:43] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [18:37:44] dzahn@cumin1001: Failed to log message to wiki. Somebody should check the error logs. [18:37:52] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime [18:37:52] dzahn@cumin1001: Failed to log message to wiki. Somebody should check the error logs. [18:37:53] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [18:37:53] dzahn@cumin1001: Failed to log message to wiki. Somebody should check the error logs. [18:43:06] mutante: if it turns out its wrong in ticket [18:43:11] then its wrong on the labels, netbox, etc... [18:43:17] i would assume task > wikitech [18:43:20] as task is updated more often [18:43:35] just echoing the items to change if it needs to for others really [18:43:40] since i know you have renamed things and know ;D [18:43:57] but happy to assist if it needs rename (i can change switch descriptions) [18:44:46] jouncebot and stashbot should be back as soon as NFS related reboots in Toolforge complete [18:45:18] robh: yea, agreed. i already said to papaul to go ahead in the interest of time [18:45:37] cool [18:49:25] !log generating mcrouter certs for mw2291 - mw2299 [18:53:06] robh: wiki page edited to match [18:53:13] \o/ [18:53:15] living docs [18:53:24] well, maybe undead docs, it doesnt get updated a lot really [18:53:28] so you just raised a zombie ;D [18:53:50] not that dead https://wikitech.wikimedia.org/w/index.php?title=Infrastructure_naming_conventions&action=history [18:54:10] oh, yeah thats multiple times a month [18:54:15] living docs \o/ [18:54:23] :) [19:00:04] RoanKattouw, Niharika, and Urbanecm: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for Morning SWAT(Max 6 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200226T1900). [19:00:05] Amir1: A patch you scheduled for Morning SWAT(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [19:00:10] o/ [19:00:13] They are being merged [19:01:25] We are reaching mw2300, nice [19:05:40] Amir1: these are still being installed (right now). will be pooled later though [19:05:53] will run scap pull before doing so [19:07:00] the very first puppet run on an appserver takes a long time [19:11:26] bstorm_: merge conflict, go ahead and merge both if you want [19:11:29] or i can [19:11:33] Thansk! I'll do it [19:11:36] ty [19:13:09] PROBLEM - Widespread puppet agent failures- no resources reported on icinga1001 is CRITICAL: 0.013 ge 0.01 https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [19:13:24] okay the mwdebug1001 looks fine, deploying [19:13:40] uhmm.. widespread puppet failures could be from me installing new mw2* [19:14:37] yea, widespread can be just 1% [19:14:59] invalid secret [19:15:03] give it a moment until my runs are finished. i'll keep an eye on it [19:15:06] volans: yea, known and fixed [19:15:17] k [19:15:20] (on the current/ next run) [19:15:26] added the mcrouter certs [19:15:26] !log ladsgroup@deploy1001 Synchronized php-1.35.0-wmf.20/extensions/Wikibase/lib/includes/Store/Sql/Terms: SWAT: [[gerrit:575055|Do prefetching entity ids on batches of 20 entity per query (T246159)]] (duration: 01m 00s) [19:15:44] you can run this if you don't want to wait [19:15:47] https://wikitech.wikimedia.org/wiki/Cumin#Run_Puppet_only_if_last_run_failed [19:16:04] given you know the cluster can pick the specific alias or interval [19:17:59] yes, thanks. i am using a regex to target them and cumin is currently running [19:18:32] and following the new docs to avoid icinga spam [19:19:33] RECOVERY - Widespread puppet agent failures- no resources reported on icinga1001 is OK: (C)0.01 ge (W)0.006 ge 0.005886 https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [19:21:02] !log rsyncing /srv/ from install1002 to apt1001 (APT repo data) (T224576) [19:26:37] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime [19:28:10] no bot messages for gerrit merges - i think because toollabs is down [19:29:00] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [19:29:22] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime [19:31:34] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [19:36:58] jouncebot: now [19:36:58] For the next 0 hour(s) and 23 minute(s): Morning SWAT(Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200226T1900) [19:38:06] (03PS1) 10Ladsgroup: Increase the reads for term store for clients for up to Q4Mio [mediawiki-config] - 10https://gerrit.wikimedia.org/r/575072 (https://phabricator.wikimedia.org/T219123) [19:38:31] (03CR) 10Ladsgroup: [C: 03+2] Increase the reads for term store for clients for up to Q4Mio [mediawiki-config] - 10https://gerrit.wikimedia.org/r/575072 (https://phabricator.wikimedia.org/T219123) (owner: 10Ladsgroup) [19:39:06] !log scap pull mw2291 through mw2300 and then pooling them as new API appservers in codfw (T241852) [19:39:33] (03Merged) 10jenkins-bot: Increase the reads for term store for clients for up to Q4Mio [mediawiki-config] - 10https://gerrit.wikimedia.org/r/575072 (https://phabricator.wikimedia.org/T219123) (owner: 10Ladsgroup) [19:39:52] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw2300.codfw.wmnet [19:40:28] !log dzahn@cumin1001 conftool action : set/weight=15; selector: name=mw2300.codfw.wmnet [19:40:53] !log dzahn@cumin1001 conftool action : set/weight=15; selector: name=mw229[1-9].codfw.wmnet [19:40:54] I scap pulled deploy1001 on deploy1001 [19:41:00] I'm so facepalm [19:41:32] Amir1: uhm.. did that break anything? [19:41:50] git status [19:41:57] *git status doesn't show anything [19:42:11] and i was actually checking there are no deployments right now [19:42:24] Amir1: alrighty [19:42:39] !log otto@deploy1001 helmfile [STAGING] Ran 'apply' command on namespace 'eventstreams' for release 'production' . [19:42:58] doesn't show anything as in "working tree clean" right [19:43:56] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw229[1-9].codfw.wmnet [19:43:59] !log ladsgroup@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:574454|Increase the reads for term store for clients for up to Q4Mio (T219123)]] (duration: 01m 07s) [19:44:05] jouncebot: now [19:44:05] For the next 0 hour(s) and 15 minute(s): Morning SWAT(Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200226T1900) [19:44:14] Amir1: isnt it 15 minutes until deploy? ^ [19:44:27] "rsync: [receiver] write error: Broken pipe (32)" I think these are the new hosts [19:44:31] mutante: nope, it's now [19:44:32] i mean.. i checked this specifically to not pool during deployment [19:44:35] yet there is one [19:44:42] 10Operations, 10netops: BFD session alerts due to inconsistent status on cr3-knams - https://phabricator.wikimedia.org/T240659 (10ayounsi) 05Open→03Resolved No more issues. [19:44:43] jouncebot is wrong? [19:44:50] https://wikitech.wikimedia.org/wiki/Deployments [19:45:20] It says "For the next 15 minutes", I mean it's the SWAT window [19:45:21] i don't see anything start at 15 before hour [19:45:38] "Morning SWAT" it's yellowed for me [19:45:42] oh.. not "in 15 minutes" [19:45:54] arg.. i read that wrong then [19:46:27] Amir1: all of them? mw2291 through 2300 ? [19:46:44] let me pull one more time [19:46:45] it just gave a number, 32 [19:46:53] that's more than i touched .. [19:47:11] !log ladsgroup@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:574454|Increase the reads for term store for clients for up to Q4Mio (T219123)]], take II (duration: 01m 03s) [19:47:17] in the second run, it didn't give this [19:47:28] ran scap pull again one the new ones [19:47:31] ok, good! [19:48:52] !log otto@deploy1001 helmfile [STAGING] Ran 'apply' command on namespace 'eventstreams' for release 'production' . [19:50:22] Can I go over the SWAT window? I need to deploy something continuously, otherwise things will go down [19:50:30] mutante: ^ [19:51:28] Amir1: yes [19:51:35] go ahead [19:51:47] that's a good reason to, heh [19:52:02] will this delay the train? [19:52:23] oh there's train after it [19:52:36] no, I stop at 6Mio for now [19:52:50] (03PS1) 10Ladsgroup: Increase the reads for term store for clients for up to Q6Mio [mediawiki-config] - 10https://gerrit.wikimedia.org/r/575074 (https://phabricator.wikimedia.org/T219123) [19:53:06] (03CR) 10Ladsgroup: [C: 03+2] Increase the reads for term store for clients for up to Q6Mio [mediawiki-config] - 10https://gerrit.wikimedia.org/r/575074 (https://phabricator.wikimedia.org/T219123) (owner: 10Ladsgroup) [19:53:57] 10Operations, 10MediaWiki-General, 10serviceops, 10Core Platform Team Workboards (Clinic Duty Team): siteinfo api calls should be cached for N minutes on the caching layer - https://phabricator.wikimedia.org/T244204 (10Pchelolo) 05Open→03Declined I have verified that services running internally are cac... [19:54:15] (03Merged) 10jenkins-bot: Increase the reads for term store for clients for up to Q6Mio [mediawiki-config] - 10https://gerrit.wikimedia.org/r/575074 (https://phabricator.wikimedia.org/T219123) (owner: 10Ladsgroup) [19:56:26] Nice logo btw: https://phabricator.wikimedia.org/tag/scap/ [19:56:51] !log ladsgroup@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:574454|Increase the reads for term store for clients for up to Q6Mio (T219123)]] (duration: 01m 02s) [19:57:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:57:03] T219123: Migrate to and read from new store for item terms - https://phabricator.wikimedia.org/T219123 [19:58:06] !log ladsgroup@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:574454|Increase the reads for term store for clients for up to Q6Mio (T219123)]], take II (duration: 01m 04s) [19:58:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:00:04] longma and twentyafterfour: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) Mediawiki train - American Version deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200226T2000). [20:00:04] !log Morning SWAT is done [20:00:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:00:18] Thanks Amir1 [20:01:20] 10Operations, 10netops: PyBal BGP group prefix-limit 50 teardown - https://phabricator.wikimedia.org/T246110 (10ayounsi) Ok, looks like a good one would be: ` set protocols bgp group PyBal family inet unicast prefix-limit maximum 1000 teardown 20 set protocols bgp group PyBal family inet6 unicast prefix-limit... [20:01:25] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [20:01:31] Thank you for doing the train, I'm just merging some config change [20:01:40] That might be me [20:02:00] it should recover [20:02:16] I'll hold off for a few to wait for recovery [20:02:23] it's sorta natural because the query patterns are changing in a really large scale [20:02:35] PROBLEM - High average GET latency for mw requests on appserver in eqiad on icinga1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [20:02:54] thanks for waiting a moment [20:03:04] that last one looks like already past the spike [20:03:37] yea,, expecting recovery [20:04:02] cool [20:04:27] rescheduling checks to speed it up [20:05:37] RECOVERY - High average GET latency for mw requests on appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [20:05:43] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [20:05:45] longma: looks normal again to me. i think you can start [20:06:00] (03CR) 10Ayounsi: [C: 03+2] new key for bblack [homer/public] - 10https://gerrit.wikimedia.org/r/575045 (owner: 10BBlack) [20:06:04] proceeding with the train. Thanks mutante [20:06:19] next time I add only one million I guess instead of two [20:06:21] (03Merged) 10jenkins-bot: new key for bblack [homer/public] - 10https://gerrit.wikimedia.org/r/575045 (owner: 10BBlack) [20:08:31] !log add BGP to AS8859 in AMS-IX [20:09:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:10:37] !log add BGP to AS4780 in Equinix Palo-Alot [20:10:51] (03CR) 10Dzahn: [C: 03+2] DHCP: Add MAC address for parse200[1-20] [puppet] - 10https://gerrit.wikimedia.org/r/575071 (https://phabricator.wikimedia.org/T243112) (owner: 10Papaul) [20:10:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:12:37] papaul: ready for OS install (except partman maybe?) [20:17:11] (03PS1) 10Jeena Huneidi: group1 wikis to 1.35.0-wmf.21 refs T233869 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/575077 [20:17:13] (03CR) 10Jeena Huneidi: [C: 03+2] group1 wikis to 1.35.0-wmf.21 refs T233869 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/575077 (owner: 10Jeena Huneidi) [20:18:03] !log otto@deploy1001 helmfile [STAGING] Ran 'apply' command on namespace 'eventstreams' for release 'production' . [20:18:03] mutante: thanks [20:18:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:18:18] (03Merged) 10jenkins-bot: group1 wikis to 1.35.0-wmf.21 refs T233869 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/575077 (owner: 10Jeena Huneidi) [20:19:51] !log jhuneidi@deploy1001 rebuilt and synchronized wikiversions files: group1 wikis to 1.35.0-wmf.21 refs T233869 [20:20:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:20:03] T233869: 1.35.0-wmf.21 deployment blockers - https://phabricator.wikimedia.org/T233869 [20:20:56] !log jhuneidi@deploy1001 Synchronized php: group1 wikis to 1.35.0-wmf.21 refs T233869 (duration: 01m 04s) [20:21:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:23:02] (03PS2) 10Jforrester: Move wgULSLanguageDetection to the 'must not' section of CommonSettings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/574794 (https://phabricator.wikimedia.org/T246212) [20:26:21] (03PS1) 10Andrew Bogott: designate.conf: catch up with some deprecation warnings [puppet] - 10https://gerrit.wikimedia.org/r/575079 [20:26:35] (03PS5) 10Jforrester: Add chart for mobileapps [deployment-charts] - 10https://gerrit.wikimedia.org/r/570162 (https://phabricator.wikimedia.org/T218733) (owner: 10Mholloway) [20:27:29] PROBLEM - recommendation_api endpoints health on scb1001 is CRITICAL: /{domain}/v1/caption/addition/{target} (Caption addition suggestions) timed out before a response was received: /{domain}/v1/article/creation/translation/{source}{/seed} (article.creation.translation - bad seed) timed out before a response was received: /{domain}/v1/caption/translation/from/{source}/to/{target} (Caption translation suggestions) timed out before [20:27:29] eceived: /{domain}/v1/description/translation/from/{source}/to/{target} (Description translation suggestions) timed out before a response was received: /{domain}/v1/description/addition/{target} (Description addition suggestions) timed out before a response was received: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received: /{domain}/v1/article/cre [20:27:29] eed} (article.creation.morelike - bad article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [20:29:13] PROBLEM - High average GET latency for mw requests on appserver in eqiad on icinga1001 is CRITICAL: cluster=appserver code={200,204} handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method= [20:29:19] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [20:29:23] err? [20:29:24] Should I roll back? [20:29:29] PROBLEM - recommendation_api endpoints health on scb2005 is CRITICAL: /{domain}/v1/description/translation/from/{source}/to/{target} (Description translation suggestions) timed out before a response was received: /{domain}/v1/description/addition/{target} (Description addition suggestions) timed out before a response was received: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed [20:29:29] ponse was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [20:29:29] PROBLEM - recommendation_api endpoints health on scb2006 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [20:30:07] longma: i dont think yet [20:30:10] per https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad%20prometheus%2Fops [20:30:47] oh, I will add that to my list of dashboards [20:30:50] it is going down [20:30:57] longma: a rise in timeouts isn't uncommon. seems to be subsiding [20:31:04] it has happened before during deploys [20:31:07] PROBLEM - recommendation_api endpoints health on scb2001 is CRITICAL: /{domain}/v1/description/translation/from/{source}/to/{target} (Description translation suggestions) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [20:31:08] (03CR) 10Andrew Bogott: [C: 03+2] designate.conf: catch up with some deprecation warnings [puppet] - 10https://gerrit.wikimedia.org/r/575079 (owner: 10Andrew Bogott) [20:31:10] it seemed like more than usual lately [20:31:17] the scb thing though.. i dont know [20:31:23] RECOVERY - High average GET latency for mw requests on appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [20:31:27] nice [20:31:29] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [20:31:35] RECOVERY - recommendation_api endpoints health on scb2006 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [20:31:37] RECOVERY - recommendation_api endpoints health on scb2005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [20:31:43] RECOVERY - recommendation_api endpoints health on scb1001 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [20:31:56] cool [20:32:03] (one of the links in the alerts had "no data" fwiw) [20:35:05] longma: you don't sync everything in this deploy? [20:35:11] only wikiversions? [20:35:53] !log ladsgroup@deploy1001 Synchronized php-1.35.0-wmf.21/extensions/Wikibase/lib/includes/Store/Sql/Terms: SWAT: [[gerrit:575055|Do prefetching entity ids on batches of 20 entity per query (T246159)]] (duration: 01m 04s) [20:36:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:36:03] T246159: Setting on read new for items causes increased worst response time and number of open connections - https://phabricator.wikimedia.org/T246159 [20:36:16] better safe than sorry [20:36:20] Amir1: yes i believe only wikiversions [20:36:53] Assuming the train will bring my backport, I didn't sync [20:37:04] that could have caused some issues [20:37:20] oh I see [20:37:34] Sorry [20:38:27] same, I did not realize that. Do we need to do anything now? [20:38:34] such spikes should have been avoided at least mitigated [20:38:37] I just did :D [20:38:43] ah okay [20:40:35] PROBLEM - recommendation_api endpoints health on scb2002 is CRITICAL: /{domain}/v1/caption/addition/{target} (Caption addition suggestions) timed out before a response was received: /{domain}/v1/caption/translation/from/{source}/to/{target} (Caption translation suggestions) timed out before a response was received: /{domain}/v1/description/translation/from/{source}/to/{target} (Description translation suggestions) timed out befor [20:40:35] received: /{domain}/v1/description/addition/{target} (Description addition suggestions) timed out before a response was received: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [20:41:47] PROBLEM - termbox codfw on termbox.svc.codfw.wmnet is CRITICAL: /termbox (get rendered termbox) is CRITICAL: Test get rendered termbox returned the unexpected status 500 (expecting: 200) https://wikitech.wikimedia.org/wiki/WMDE/Wikidata/SSR_Service [20:42:05] PROBLEM - recommendation_api endpoints health on scb1004 is CRITICAL: /{domain}/v1/caption/addition/{target} (Caption addition suggestions) timed out before a response was received: /{domain}/v1/article/creation/translation/{source}{/seed} (article.creation.translation - bad seed) timed out before a response was received: /{domain}/v1/caption/translation/from/{source}/to/{target} (Caption translation suggestions) timed out before [20:42:05] eceived: /{domain}/v1/description/translation/from/{source}/to/{target} (Description translation suggestions) timed out before a response was received: /{domain}/v1/description/addition/{target} (Description addition suggestions) timed out before a response was received: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedi [20:42:05] es/Monitoring/recommendation_api [20:42:11] PROBLEM - High average GET latency for mw requests on appserver in eqiad on icinga1001 is CRITICAL: cluster=appserver code={200,204} handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method= [20:42:11] PROBLEM - High average GET latency for mw requests on api_appserver in eqiad on icinga1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-m [20:42:15] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [20:42:29] PROBLEM - recommendation_api endpoints health on scb2006 is CRITICAL: /{domain}/v1/caption/addition/{target} (Caption addition suggestions) timed out before a response was received: /{domain}/v1/article/creation/translation/{source}{/seed} (article.creation.translation - bad seed) timed out before a response was received: /{domain}/v1/caption/translation/from/{source}/to/{target} (Caption translation suggestions) timed out before [20:42:29] eceived: /{domain}/v1/description/translation/from/{source}/to/{target} (Description translation suggestions) timed out before a response was received: /{domain}/v1/description/addition/{target} (Description addition suggestions) timed out before a response was received: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedi [20:42:29] es/Monitoring/recommendation_api [20:42:39] PROBLEM - recommendation_api endpoints health on scb1001 is CRITICAL: /{domain}/v1/caption/addition/{target} (Caption addition suggestions) timed out before a response was received: /{domain}/v1/caption/translation/from/{source}/to/{target} (Caption translation suggestions) timed out before a response was received: /{domain}/v1/description/translation/from/{source}/to/{target} (Description translation suggestions) timed out befor [20:42:39] received: /{domain}/v1/description/addition/{target} (Description addition suggestions) timed out before a response was received: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [20:42:45] PROBLEM - recommendation_api endpoints health on scb2003 is CRITICAL: /{domain}/v1/caption/addition/{target} (Caption addition suggestions) timed out before a response was received: /{domain}/v1/caption/translation/from/{source}/to/{target} (Caption translation suggestions) timed out before a response was received: /{domain}/v1/description/translation/from/{source}/to/{target} (Description translation suggestions) timed out befor [20:42:45] received: /{domain}/v1/description/addition/{target} (Description addition suggestions) timed out before a response was received: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [20:42:47] looking bad but grafana shows it's another spike as on last deploy [20:42:49] PROBLEM - recommendation_api endpoints health on scb1002 is CRITICAL: /{domain}/v1/description/translation/from/{source}/to/{target} (Description translation suggestions) timed out before a response was received: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [20:42:53] PROBLEM - High average POST latency for mw requests on appserver in eqiad on icinga1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=POST [20:42:53] PROBLEM - High average POST latency for mw requests on api_appserver in eqiad on icinga1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var- [20:42:59] PROBLEM - recommendation_api endpoints health on scb1003 is CRITICAL: /{domain}/v1/description/translation/from/{source}/to/{target} (Description translation suggestions) timed out before a response was received: /{domain}/v1/description/addition/{target} (Description addition suggestions) timed out before a response was received: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed [20:42:59] ponse was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [20:43:09] let me revert my changes again [20:43:12] one of them [20:43:32] why one of them? [20:43:48] to reduce the read on the new term store, I feel it's causing the spikes [20:43:55] RECOVERY - termbox codfw on termbox.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/WMDE/Wikidata/SSR_Service [20:44:08] ok. it does look like it is past the spike though [20:44:33] RECOVERY - recommendation_api endpoints health on scb2006 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [20:44:34] both errors and response time [20:44:43] RECOVERY - recommendation_api endpoints health on scb1001 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [20:44:49] RECOVERY - recommendation_api endpoints health on scb2002 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [20:44:49] RECOVERY - recommendation_api endpoints health on scb2003 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [20:44:57] RECOVERY - recommendation_api endpoints health on scb1002 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [20:45:03] RECOVERY - High average POST latency for mw requests on appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=POST [20:45:03] RECOVERY - High average POST latency for mw requests on api_appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-method=POST [20:45:07] RECOVERY - recommendation_api endpoints health on scb1003 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [20:45:18] but the thing is if any spike can be so bad that trigger alarms, I don't want to do it [20:45:27] I think spikes like this can happen quite often [20:45:47] let me wait for a couple of hours and see if there's another spike, then I revert evreything [20:46:27] RECOVERY - High average GET latency for mw requests on api_appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-method=GET [20:46:27] RECOVERY - High average GET latency for mw requests on appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [20:47:01] RECOVERY - recommendation_api endpoints health on scb2001 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [20:47:05] things looking normal again [20:47:26] Amir1: well that.. or the alerts need to be adjusted [20:47:50] to me that looks like the spike always just happens during the actual deploy [20:47:55] and we have had that many times [20:48:12] okay, noted [20:48:16] let's wait for a bit [20:48:19] in an ideal world maybe scap would downtime the alerts for a couple minutes [20:48:33] RECOVERY - recommendation_api endpoints health on scb1004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [20:48:41] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [20:49:11] well there it is again.. is that another deploy? [20:49:39] let's see what are the fatals [20:49:53] Amir1: did you sync again? [20:50:25] yup [20:50:43] ok.. more alerts coming in from that. always 5 min behind [20:50:47] PROBLEM - High average GET latency for mw requests on appserver in eqiad on icinga1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [20:50:50] is this the last one for now? [20:51:15] should I sync twice the for ordinary files as well? I'm not sure [20:51:20] I think it's only IS.php [20:51:23] PROBLEM - recommendation_api endpoints health on scb2003 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [20:52:30] currently we don't even see logs of deploys because the bot is down because toollabs [20:52:49] jouncebot: now [20:52:49] For the next 0 hour(s) and 7 minute(s): Mediawiki train - American Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200226T2000) [20:53:25] PROBLEM - recommendation_api endpoints health on scb1001 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [20:53:29] RECOVERY - recommendation_api endpoints health on scb2003 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [20:53:41] also kind of missing the dotted lines on the graph that normally show when there are deploys [20:53:50] (03PS1) 10Ladsgroup: Revert read from the new term store again to Q2M [mediawiki-config] - 10https://gerrit.wikimedia.org/r/575082 (https://phabricator.wikimedia.org/T219123) [20:54:05] well i guess that's normal, only the start of the window is shown [20:54:13] Amir1: why revert now? [20:54:41] Isn't another spike? [20:55:03] every deploy causes one [20:55:12] so if you revert it will just be another one [20:55:28] yeah but I thought there's another spike happening [20:55:31] RECOVERY - recommendation_api endpoints health on scb1001 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [20:55:38] i can't tell when it's you syncing and when not [20:56:08] but so far it always happened during the actual deploy and then recovered [20:56:17] then it's my bad [20:56:21] let's wait for a bit [20:56:45] https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus%2Fops&var-cluster=appserver&var-method=&fullscreen&panelId=9 This is all empty to me [20:57:00] https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad%20prometheus%2Fops&from=1582747126442&to=1582750488310 [20:57:27] ack. 15:32 < mutante> (one of the links in the alerts had "no data" fwiw) [20:59:21] RECOVERY - High average GET latency for mw requests on appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [20:59:25] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [20:59:43] Amir1: looks good now https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad%20prometheus%2Fops&from=now-1h&to=now [20:59:54] i think that confirms it again [21:00:04] cscott, arlolra, subbu, halfak, and accraze: (Dis)respected human, time to deploy Services – Graphoid / Parsoid / Citoid / ORES (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200226T2100). Please do the needful. [21:00:57] reading scrollback: for clarity for group1 and group2 train we only do a sync wikiversions, not a full scap sync [21:00:57] Amir1: well, good timing for the next window. let's cool it down for now [21:01:21] group0 is a full scap sync, but that's because we have to update l10n [21:01:27] (for the new version) [21:02:08] s/update/build/ [21:02:14] nothing to update when it doesn't exist yet ;) [21:03:43] heh, good clarification [21:04:11] PROBLEM - recommendation_api endpoints health on scb1001 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [21:04:17] PROBLEM - recommendation_api endpoints health on scb2003 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [21:05:42] nope, open connections on s8 are going sky rocket: https://grafana.wikimedia.org/d/000000278/mysql-aggregated?orgId=1&var-dc=eqiad%20prometheus%2Fops&var-group=core&var-shard=s8&var-role=All&from=now-12h&to=now&fullscreen&panelId=9 [21:05:45] PROBLEM - High average GET latency for mw requests on appserver in eqiad on icinga1001 is CRITICAL: cluster=appserver code={200,204} handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method= [21:05:51] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [21:06:13] RECOVERY - recommendation_api endpoints health on scb1001 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [21:06:21] RECOVERY - recommendation_api endpoints health on scb2003 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [21:07:03] Amir1: ok, the DB part doesn't look good. agreed [21:07:23] (03PS1) 10Andrew Bogott: Openstack: add 'queens' config files [puppet] - 10https://gerrit.wikimedia.org/r/575083 [21:07:25] (03PS1) 10Andrew Bogott: Openstack: add apt definitions for version 'queens' [puppet] - 10https://gerrit.wikimedia.org/r/575084 (https://phabricator.wikimedia.org/T246287) [21:08:15] Amir1: i guess we should stop the services deploy and revert then ? [21:08:23] (03CR) 10jerkins-bot: [V: 04-1] Openstack: add 'queens' config files [puppet] - 10https://gerrit.wikimedia.org/r/575083 (owner: 10Andrew Bogott) [21:08:39] there is no spike in errors though [21:09:54] yeah, let me revert [21:10:05] RECOVERY - High average GET latency for mw requests on appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [21:10:09] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [21:10:19] expecting more alerts then [21:11:40] nah, this wouldn't be super big [21:12:38] (03CR) 10Ladsgroup: [C: 03+2] Revert read from the new term store again to Q2M [mediawiki-config] - 10https://gerrit.wikimedia.org/r/575082 (https://phabricator.wikimedia.org/T219123) (owner: 10Ladsgroup) [21:13:33] (03Merged) 10jenkins-bot: Revert read from the new term store again to Q2M [mediawiki-config] - 10https://gerrit.wikimedia.org/r/575082 (https://phabricator.wikimedia.org/T219123) (owner: 10Ladsgroup) [21:14:09] yup, thats even more open connections than yesterday when we went to 6 mill [21:14:23] PROBLEM - High average GET latency for mw requests on appserver in eqiad on icinga1001 is CRITICAL: cluster=appserver code={200,204} handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method= [21:14:29] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [21:14:32] there we go.. every time [21:15:43] 10Operations, 10MediaWiki-General, 10serviceops, 10Core Platform Team Workboards (Clinic Duty Team): siteinfo api calls should be cached for N minutes on the caching layer - https://phabricator.wikimedia.org/T244204 (10Anomie) >>! In T244204#5921402, @Pchelolo wrote: > The only outstanding question left in... [21:15:43] !log ganeti - re-starting apt2001 which is mysteriously broken and "half up" ..as in you can't ssh to it and don't get console but it does cause icinga alerts [21:15:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:15:53] !log ladsgroup@deploy1001 Synchronized wmf-config/InitialiseSettings.php: [[gerrit:574454|Decrease the reads for term store for clients down to Q2Mio (T219123)]] (duration: 01m 04s) [21:16:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:16:09] T219123: Migrate to and read from new store for item terms - https://phabricator.wikimedia.org/T219123 [21:16:31] RECOVERY - High average GET latency for mw requests on appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [21:16:37] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [21:17:18] 10Operations, 10SRE-Access-Requests: Give access to Anti Harassment Tools team to production deployment - https://phabricator.wikimedia.org/T246053 (10dmaza) Thanks @Dzahn . Here is mine ` ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAIMzAnzpZdT/jiyru+B+bUxA/r04r6kPBEl/aBQmuBlmU dmaza@wikimedia.org ` [21:17:27] !log ladsgroup@deploy1001 Synchronized wmf-config/InitialiseSettings.php: [[gerrit:574454|Decrease the reads for term store for clients down to Q2Mio (T219123)]], take II (duration: 01m 04s) [21:17:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:20:16] 10Operations, 10ops-codfw, 10serviceops, 10Patch-For-Review: (Need by: TBD) rack/setup/install new codfw mw systems - https://phabricator.wikimedia.org/T241852 (10Dzahn) mw2291 through mw2300 are in production now, pooled and then set Active in Netbox. [21:20:41] 10Operations, 10MediaWiki-General, 10serviceops, 10Core Platform Team Workboards (Clinic Duty Team): siteinfo api calls should be cached for N minutes on the caching layer - https://phabricator.wikimedia.org/T244204 (10Pchelolo) 05Declined→03Open [21:20:46] 10Operations, 10ops-codfw, 10serviceops, 10Patch-For-Review: (Need by: TBD) rack/setup/install new codfw mw systems - https://phabricator.wikimedia.org/T241852 (10Dzahn) 05Stalled→03Open a:05Dzahn→03None [21:22:23] (03PS9) 10Bstorm: labstore: introduce a firewall for the old primary NFS cluster [puppet] - 10https://gerrit.wikimedia.org/r/571832 [21:25:22] Amir1: done with deploying for now? [21:25:34] (03CR) 10Bstorm: "This should now be a safe patch to put out since the legacy services are not needed in the cluster anymore. It is based on the firewall se" [puppet] - 10https://gerrit.wikimedia.org/r/571832 (owner: 10Bstorm) [21:25:35] yup [21:25:40] alright, cool [21:25:40] I think it looks fine [21:25:55] yea, the DB graph looks much better [21:26:34] (03PS1) 10Dzahn: netboot: add parse* to use raid1-2dev partman recipe [puppet] - 10https://gerrit.wikimedia.org/r/575085 (https://phabricator.wikimedia.org/T243112) [21:26:36] (03CR) 10Bstorm: [C: 03+1] Add an-launcher1001 to profile::dumps::distribution [puppet] - 10https://gerrit.wikimedia.org/r/575048 (https://phabricator.wikimedia.org/T243934) (owner: 10Elukey) [21:27:07] i need to go now for a while, cya later Amir1 [21:27:15] o/ [21:27:38] (03CR) 10Dzahn: [C: 03+2] netboot: add parse* to use raid1-2dev partman recipe [puppet] - 10https://gerrit.wikimedia.org/r/575085 (https://phabricator.wikimedia.org/T243112) (owner: 10Dzahn) [21:28:43] !log ganeti - shutting apt2001 down again [21:28:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:29:22] off [21:34:30] (03CR) 10Jhedden: [C: 03+1] labstore: introduce a firewall for the old primary NFS cluster [puppet] - 10https://gerrit.wikimedia.org/r/571832 (owner: 10Bstorm) [21:34:57] (03PS1) 10Ayounsi: Flowspec: ignore any prefixes advertised from routers to gobgp [puppet] - 10https://gerrit.wikimedia.org/r/575088 [21:44:19] is phab being paricularly slow to anyone else or just me? [21:44:23] particularly even [21:45:34] robh: slower but only by a second or so (UK) [21:45:43] yeah, its totally usable [21:45:45] but seems laggy [21:46:08] im uncertain if its caching level stuff or phab level stuff. [21:46:36] Yeah [21:47:59] Could be either [21:48:19] PROBLEM - Host db1084.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [21:51:52] !log Password reset for User:Joax (T242941) [21:52:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:52:11] 10Operations, 10ops-eqiad, 10DC-Ops: Replace broken BBU on db1084 (HP host) - https://phabricator.wikimedia.org/T245647 (10Jclark-ctr) Replaced failed BBU [21:52:20] T242941: User:Jaox has forgotten their password, need a reset via CLI - https://phabricator.wikimedia.org/T242941 [21:54:33] RECOVERY - Host db1084.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.79 ms [21:54:40] (03PS2) 10Ayounsi: Flowspec: ignore any prefixes advertised from routers to gobgp [puppet] - 10https://gerrit.wikimedia.org/r/575088 [21:57:08] 10Operations, 10DBA: db1084 crashed due to BBU failure - https://phabricator.wikimedia.org/T245621 (10Jclark-ctr) [21:57:20] 10Operations, 10ops-eqiad, 10DC-Ops: Replace broken BBU on db1084 (HP host) - https://phabricator.wikimedia.org/T245647 (10Jclark-ctr) 05Open→03Resolved [21:57:33] PROBLEM - Disk space on notebook1004 is CRITICAL: DISK CRITICAL - free space: / 0 MB (0% inode=94%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=notebook1004&var-datasource=eqiad+prometheus/ops [21:57:59] (03CR) 10Ayounsi: [C: 03+2] Flowspec: ignore any prefixes advertised from routers to gobgp [puppet] - 10https://gerrit.wikimedia.org/r/575088 (owner: 10Ayounsi) [22:01:29] 10Operations, 10SRE-Access-Requests: Give access to Anti Harassment Tools team to production deployment - https://phabricator.wikimedia.org/T246053 (10Tchanders) Thanks everyone - here's mine: ` ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAICEeJivjg1q1mDS25ZOvJzaCqBCTidMjXZKoHdJzc3TC tchanders@tchanders-ThinkPad ` Ha... [22:01:39] 10Operations, 10ops-codfw, 10Patch-For-Review: (Need by: TBD) codfw: rack/setup/install parse200[1-20].codfw.wmnet - https://phabricator.wikimedia.org/T243112 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by pt1979 on cumin2001.codfw.wmnet for hosts: ` parse2001.codfw.wmnet ` The log can be fou... [22:10:32] 10Operations, 10ops-codfw, 10Patch-For-Review: (Need by: TBD) codfw: rack/setup/install parse200[1-20].codfw.wmnet - https://phabricator.wikimedia.org/T243112 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by pt1979 on cumin2001.codfw.wmnet for hosts: ` parse2002.codfw.wmnet ` The log can be fou... [22:13:26] (03PS2) 10Andrew Bogott: Openstack: add apt definitions for version 'queens' [puppet] - 10https://gerrit.wikimedia.org/r/575084 (https://phabricator.wikimedia.org/T246287) [22:13:28] (03PS2) 10Andrew Bogott: Openstack: add 'queens' config files [puppet] - 10https://gerrit.wikimedia.org/r/575083 [22:16:33] !log pt1979@cumin2001 START - Cookbook sre.hosts.downtime [22:16:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:16:49] (03CR) 10Volans: [C: 03+2] Add cookbook to control CF BGP advertisements [cookbooks] - 10https://gerrit.wikimedia.org/r/572262 (owner: 10Ayounsi) [22:18:06] 10Operations, 10Parsoid-PHP, 10SRE-Access-Requests, 10serviceops: Give all members of the Parsing team production `deployment` access - https://phabricator.wikimedia.org/T245877 (10ssastry) >>! In T245877#5918345, @Dzahn wrote: > @ssastry Yea, though it's not tied to the hostname. It's "the puppet role pa... [22:18:21] (03Merged) 10jenkins-bot: Add cookbook to control CF BGP advertisements [cookbooks] - 10https://gerrit.wikimedia.org/r/572262 (owner: 10Ayounsi) [22:18:54] (03CR) 10Andrew Bogott: [C: 03+2] Openstack: add apt definitions for version 'queens' [puppet] - 10https://gerrit.wikimedia.org/r/575084 (https://phabricator.wikimedia.org/T246287) (owner: 10Andrew Bogott) [22:19:18] !log pt1979@cumin2001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) [22:19:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:24:01] 10Operations, 10ops-codfw: (Need by: TBD) codfw: rack/setup/install parse200[1-20].codfw.wmnet - https://phabricator.wikimedia.org/T243112 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['parse2001.codfw.wmnet'] ` and were **ALL** successful. [22:25:33] !log pt1979@cumin2001 START - Cookbook sre.hosts.downtime [22:25:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:26:46] (03PS1) 10Jhedden: nova: add cloudvirt1014 to scheduler pool [puppet] - 10https://gerrit.wikimedia.org/r/575094 (https://phabricator.wikimedia.org/T241494) [22:27:55] !log pt1979@cumin2001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) [22:28:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:28:45] (03CR) 10Jhedden: [C: 03+2] nova: add cloudvirt1014 to scheduler pool [puppet] - 10https://gerrit.wikimedia.org/r/575094 (https://phabricator.wikimedia.org/T241494) (owner: 10Jhedden) [22:29:18] (03CR) 10Andrew Bogott: [C: 03+2] "sign this repo (on a test host) with" [puppet] - 10https://gerrit.wikimedia.org/r/575084 (https://phabricator.wikimedia.org/T246287) (owner: 10Andrew Bogott) [22:32:41] 10Operations, 10ops-eqiad, 10Patch-For-Review, 10cloud-services-team (Hardware): Degraded RAID on cloudvirt1014 - https://phabricator.wikimedia.org/T241494 (10JHedden) 05Open→03Resolved Cloudvirt1014 is fixed and back online. Thanks @Jclark-ctr ! [22:32:45] 10Operations, 10ops-codfw: (Need by: TBD) codfw: rack/setup/install parse200[1-20].codfw.wmnet - https://phabricator.wikimedia.org/T243112 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['parse2002.codfw.wmnet'] ` and were **ALL** successful. [22:33:06] 10Operations, 10ops-codfw: (Need by: TBD) codfw: rack/setup/install parse200[1-20].codfw.wmnet - https://phabricator.wikimedia.org/T243112 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by pt1979 on cumin2001.codfw.wmnet for hosts: ` parse2003.codfw.wmnet ` The log can be found in `/var/log/wmf-au... [22:33:20] 10Operations, 10ops-codfw: (Need by: TBD) codfw: rack/setup/install parse200[1-20].codfw.wmnet - https://phabricator.wikimedia.org/T243112 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by pt1979 on cumin2001.codfw.wmnet for hosts: ` parse2004.codfw.wmnet ` The log can be found in `/var/log/wmf-au... [22:34:10] 10Operations, 10ops-eqiad, 10Patch-For-Review, 10cloud-services-team (Hardware): Degraded RAID on cloudvirt1014 - https://phabricator.wikimedia.org/T241494 (10Jclark-ctr) replaced failed bbu [22:39:32] RECOVERY - Disk space on notebook1004 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=notebook1004&var-datasource=eqiad+prometheus/ops [22:44:31] !log removing one file for legal compliance [22:44:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:44:54] (03PS1) 10Dzahn: admins: add Shannon Bailey to parsoid-test groups, upgrade to shell user [puppet] - 10https://gerrit.wikimedia.org/r/575097 (https://phabricator.wikimedia.org/T245877) [22:45:32] 10Operations, 10Parsoid-PHP, 10SRE-Access-Requests, 10serviceops, 10Patch-For-Review: Give all members of the Parsing team production `deployment` access - https://phabricator.wikimedia.org/T245877 (10Dzahn) @Sbailey @ssastry Sounds good. Now we just need an SSH public key from you. I started a patch to... [22:47:06] !log pt1979@cumin2001 START - Cookbook sre.hosts.downtime [22:47:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:48:22] !log pt1979@cumin2001 START - Cookbook sre.hosts.downtime [22:48:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:49:00] (03PS2) 10Dzahn: admins: add Shannon Bailey to parsoid-test groups, upgrade to shell user [puppet] - 10https://gerrit.wikimedia.org/r/575097 (https://phabricator.wikimedia.org/T245877) [22:49:23] !log pt1979@cumin2001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) [22:49:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:51:45] !log pt1979@cumin2001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) [22:51:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:53:46] PROBLEM - Host parse2003 is DOWN: PING CRITICAL - Packet loss = 100% [22:54:09] 10Operations, 10Parsoid-PHP, 10SRE-Access-Requests, 10serviceops, 10Patch-For-Review: Give all members of the Parsing team production `deployment` access - https://phabricator.wikimedia.org/T245877 (10Dzahn) >>! In T245877#5921857, @ssastry wrote: > Let us start with granting @sbailey access to just the... [22:54:14] RECOVERY - Host parse2003 is UP: PING OK - Packet loss = 0%, RTA = 36.25 ms [22:55:10] 10Operations, 10ops-codfw: (Need by: TBD) codfw: rack/setup/install parse200[1-20].codfw.wmnet - https://phabricator.wikimedia.org/T243112 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['parse2003.codfw.wmnet'] ` and were **ALL** successful. [22:55:13] papaul: there is now this to avoid the icinga-wm part btw https://wikitech.wikimedia.org/wiki/Icinga#Avoid_Icinga_spam_on_new_server_installs [22:55:26] adding some downtimes [22:56:21] I would really prefer that we instead install only things tracked in site.pp and that this current workflow fails completely. my 2 cents [22:56:28] 10Operations, 10ops-codfw: (Need by: TBD) codfw: rack/setup/install parse200[1-20].codfw.wmnet - https://phabricator.wikimedia.org/T243112 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['parse2004.codfw.wmnet'] ` and were **ALL** successful. [22:58:09] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime [22:58:09] !log dzahn@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) [22:58:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:58:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:58:29] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime [22:58:31] !log dzahn@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) [22:58:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:58:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:01:05] 10Operations, 10ops-codfw: (Need by: TBD) codfw: rack/setup/install parse200[1-20].codfw.wmnet - https://phabricator.wikimedia.org/T243112 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by pt1979 on cumin2001.codfw.wmnet for hosts: ` parse2005.codfw.wmnet ` The log can be found in `/var/log/wmf-au... [23:01:15] 10Operations, 10ops-codfw: (Need by: TBD) codfw: rack/setup/install parse200[1-20].codfw.wmnet - https://phabricator.wikimedia.org/T243112 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by pt1979 on cumin2001.codfw.wmnet for hosts: ` parse2006.codfw.wmnet ` The log can be found in `/var/log/wmf-au... [23:05:00] mutante: ok looking thanks [23:06:14] ok [23:08:07] mutante: also this is not the case for papaul servers [23:08:21] those are new servers that have no map in site.pp so are not at all in icinga [23:09:30] PROBLEM - IPMI Sensor Status on db1098 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [PS Redundancy = Critical, Status = Critical] https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [23:11:13] volans: ack, they are new servers [23:11:27] papaul: you can't use the icinga-downtime command until they are in puppet hosts [23:11:37] let's add to site [23:13:05] [icinga1001:~] $ sudo /usr/local/bin/icinga-downtime -h parse2003 -d 3600 -r new_install [23:13:11] did this for 2003 [23:13:33] he's not using it directly, that's the reimage script [23:13:35] that runs it [23:14:01] (03PS1) 10Aaron Schulz: Set "coalesceKeys" in mc.php to minimize host fan-out by WANCache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/575098 [23:15:55] !log pt1979@cumin2001 START - Cookbook sre.hosts.downtime [23:16:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:16:15] !log pt1979@cumin2001 START - Cookbook sre.hosts.downtime [23:16:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:16:58] !log pt1979@cumin2001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) [23:17:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:18:25] 10Operations, 10ops-codfw: (Need by: TBD) codfw: rack/setup/install parse200[1-20].codfw.wmnet - https://phabricator.wikimedia.org/T243112 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['parse2006.codfw.wmnet'] ` Of which those **FAILED**: ` ['parse2006.codfw.wmnet'] ` [23:19:11] !log pt1979@cumin2001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) [23:19:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:23:42] (03PS1) 10Dzahn: site: add new parsoid nodes with spare role [puppet] - 10https://gerrit.wikimedia.org/r/575100 (https://phabricator.wikimedia.org/T243112) [23:23:59] 10Operations, 10ops-codfw, 10Patch-For-Review: (Need by: TBD) codfw: rack/setup/install parse200[1-20].codfw.wmnet - https://phabricator.wikimedia.org/T243112 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['parse2005.codfw.wmnet'] ` and were **ALL** successful. [23:25:21] (03CR) 10jerkins-bot: [V: 04-1] site: add new parsoid nodes with spare role [puppet] - 10https://gerrit.wikimedia.org/r/575100 (https://phabricator.wikimedia.org/T243112) (owner: 10Dzahn) [23:26:12] (03PS2) 10Dzahn: site: add new parsoid nodes with spare role [puppet] - 10https://gerrit.wikimedia.org/r/575100 (https://phabricator.wikimedia.org/T243112) [23:27:22] (03CR) 10jerkins-bot: [V: 04-1] site: add new parsoid nodes with spare role [puppet] - 10https://gerrit.wikimedia.org/r/575100 (https://phabricator.wikimedia.org/T243112) (owner: 10Dzahn) [23:27:40] (03PS3) 10Dzahn: site: add new parsoid nodes with spare role [puppet] - 10https://gerrit.wikimedia.org/r/575100 (https://phabricator.wikimedia.org/T243112) [23:36:23] (03PS1) 10Dzahn: admins: add tchanders, dmaza and wikigit to deployers [puppet] - 10https://gerrit.wikimedia.org/r/575101 (https://phabricator.wikimedia.org/T246053) [23:40:00] (03CR) 10Cwhite: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/574499 (owner: 10Muehlenhoff) [23:41:05] (03CR) 10Cwhite: [C: 03+1] logstash: add ES 7 compatible logstash template [puppet] - 10https://gerrit.wikimedia.org/r/571622 (https://phabricator.wikimedia.org/T234854) (owner: 10Herron) [23:41:47] (03CR) 10Jforrester: [C: 03+1] admins: add tchanders, dmaza and wikigit to deployers [puppet] - 10https://gerrit.wikimedia.org/r/575101 (https://phabricator.wikimedia.org/T246053) (owner: 10Dzahn) [23:42:06] (03CR) 10Subramanya Sastry: admins: add Shannon Bailey to parsoid-test groups, upgrade to shell user (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/575097 (https://phabricator.wikimedia.org/T245877) (owner: 10Dzahn) [23:42:41] mutante: but you have already the hosts in site.pp no need now to downtime theem in icinga right [23:47:15] !log pt1979@cumin2001 START - Cookbook sre.hosts.downtime [23:47:15] !log pt1979@cumin2001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) [23:47:43] !log pt1979@cumin2001 START - Cookbook sre.hosts.downtime [23:47:44] !log pt1979@cumin2001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) [23:50:21] I'll commandeer the (empty) SWAT slot for some clean-up. [23:50:37] (03PS3) 10Jforrester: Remove $wgAllowTitlesInSVG, deprecated and unused [mediawiki-config] - 10https://gerrit.wikimedia.org/r/574921 (https://phabricator.wikimedia.org/T246193) (owner: 10DannyS712) [23:50:46] (03CR) 10Jforrester: [C: 03+2] Remove $wgAllowTitlesInSVG, deprecated and unused [mediawiki-config] - 10https://gerrit.wikimedia.org/r/574921 (https://phabricator.wikimedia.org/T246193) (owner: 10DannyS712) [23:51:55] (03Merged) 10jenkins-bot: Remove $wgAllowTitlesInSVG, deprecated and unused [mediawiki-config] - 10https://gerrit.wikimedia.org/r/574921 (https://phabricator.wikimedia.org/T246193) (owner: 10DannyS712) [23:53:32] !log jforrester@deploy1001 Synchronized wmf-config/InitialiseSettings.php: T246193 Stop setting wgAllowTitlesInSVG, never read (and this was default anyway) (duration: 01m 05s) [23:53:55] 10Operations, 10ops-codfw, 10Patch-For-Review: (Need by: TBD) codfw: rack/setup/install parse200[1-20].codfw.wmnet - https://phabricator.wikimedia.org/T243112 (10Papaul) @Volans i ma trying the downtime command from cookbook to downtime a host before running the auto-reimage script i am getting the error bel... [23:54:37] !log jforrester@deploy1001 Synchronized wmf-config/InitialiseSettings.php: T246193 Stop setting wgAllowTitlesInSVG, never read (and this was default anyway) (duration: 01m 05s) [23:54:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:54:46] T246193: Remove wgAllowTitlesInSVG - https://phabricator.wikimedia.org/T246193 [23:55:59] !log jforrester@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Bonus sync for cache clearance (duration: 01m 04s) [23:56:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:56:15] (03PS3) 10Jforrester: Move wgULSLanguageDetection to the 'must not' section of CommonSettings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/574794 (https://phabricator.wikimedia.org/T246212) [23:56:21] (03CR) 10Jforrester: [C: 03+2] Move wgULSLanguageDetection to the 'must not' section of CommonSettings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/574794 (https://phabricator.wikimedia.org/T246212) (owner: 10Jforrester) [23:57:35] (03Merged) 10jenkins-bot: Move wgULSLanguageDetection to the 'must not' section of CommonSettings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/574794 (https://phabricator.wikimedia.org/T246212) (owner: 10Jforrester) [23:59:53] !log jforrester@deploy1001 Synchronized wmf-config/CommonSettings.php: T246212 Set wgULSLanguageDetection false in CS (duration: 01m 04s) [23:59:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log