[00:00:03] Is only a vector issue [00:00:04] twentyafterfour: Dear deployers, time to do the Phabricator update deploy. Dont look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200319T0000). [00:03:09] twentyafterfour: ideally the revert can be deployed but it's not the end of the world if it isn't [00:03:34] Jdlrobson: ok, just noticed it and wondered the status [00:04:02] it is adding an unneeded HTML comment to all page views but that will only be readable to people looking at the source code [00:04:06] and mostly swallowed in gzip [00:04:30] the patch is not in the next branch or master (it was only deployed to that branch) [00:04:51] 10Operations, 10observability, 10Performance-Team (Radar): Decide on `service-runner` aggregated prometheus metrics and use of `service` label - https://phabricator.wikimedia.org/T247820 (10colewhite) >>! In T247820#5979414, @akosiaris wrote: > And of course we can always just add a new one if we feel like i... [00:04:55] 10Puppet, 10Beta-Cluster-Infrastructure: puppetdb on deployment-puppetdb03 keeps getting OOMKilled - https://phabricator.wikimedia.org/T248041 (10Krenair) [00:05:54] well I would deploy it but I can't log in to deploy1001 [00:06:06] and my existing session is hung [00:06:45] also I couldn't promote wmf.24 to group0 because of broken ci jobs that I have no clue about [00:10:28] twentyafterfour: do you need help? [00:11:13] deploy1001 is working for me [00:15:20] twentyafterfour train should be able to move forward after https://gerrit.wikimedia.org/r/#/c/integration/config/+/581154/ [00:16:00] cdanis: I'm not sure why it isn't working. given that my mediawiki-config change didn't pass tests I guess there is nothing pressing to deploy [00:16:41] did bast1002.wikimedia.org recently get taken down perhaps? [00:16:49] nope [00:17:00] (03CR) 10DannyS712: "Recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/581137 (owner: 10DannyS712) [00:17:27] twentyafterfour, what happens if you try to `ssh -vvv deploy1001.eqiad.wmnet`? [00:17:32] twentyafterfour: last thing bast1003 sees from your IP is a connection timeout at 00:01 UTC [00:17:43] bast2001 works for me, hmm [00:17:45] hang on, bast1002 vs. bast1003? [00:18:06] cdanis: bast1003? [00:18:14] https://wikitech.wikimedia.org/wiki/Bastion still shows 1002 a [00:18:16] twentyafterfour: bast1002 [00:18:26] there is no bast1003 [00:18:32] (03CR) 10DannyS712: [C: 03+1] "Resubmit" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/581133 (owner: 1020after4) [00:19:21] (03CR) 10DannyS712: [C: 03+1] "Recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/581133 (owner: 1020after4) [00:19:34] debug1: Executing proxy command: exec ssh -a -W deploy1001.eqiad.wmnet:22 bastion [00:19:41] debug1: Local version string SSH-2.0-OpenSSH_7.9p1 Debian-10+deb10u2 [00:20:04] seems like the network between me and eqiad is down somewhere [00:20:08] can you access other things? like phab? [00:20:09] I can reach codfw [00:20:23] phab wfm [00:20:31] twentyafterfour: ok so you can override your ssh config to use bast2002 instead, but also, can you: mtr -zw bast1002.wikimedia.org [00:20:31] and the wikis [00:20:34] and put it in a paste for me? [00:20:51] yeah ssh config with bast2002 works [00:21:03] cdanis: running mtr one sec [00:21:19] twentyafterfour train should be able to deploy, failing jenkins job was disabled [00:22:01] 10Operations, 10fundraising-tech-ops: rack/setup/install frnetmon1001.frack.eqiad.wmnet - https://phabricator.wikimedia.org/T232137 (10Dwisehaupt) Nessus package removed from bismuth and /opt/nessus removed. the tar.gz backup that was used during the transfer is still present on the host, just in case. [00:22:46] There's a train blocker still. [00:23:16] It's visual-only, it seems, so maybe going to group0 wouldn't be too bad. [00:23:20] https://phabricator.wikimedia.org/P10725 [00:23:23] cdanis: ^ [00:23:50] 10Operations, 10ops-codfw, 10Wikimedia-FR-Tech-Systems, 10fundraising-tech-ops: Fix incongruences between Netbox and DNS repository - https://phabricator.wikimedia.org/T248035 (10Dwisehaupt) [00:25:24] twentyafterfour: that's odd, that shows you with 0% packet loss to bast1002 [00:30:42] cdanis: yeah and now it's working again. shrug, sorry for the distraction [00:31:11] !log twentyafterfour@deploy1001 Synchronized php-1.35.0-wmf.24/skins/Vector/includes/templates/index.mustache: deploy https://gerrit.wikimedia.org/r/c/mediawiki/skins/Vector/+/581116 which reverts https://gerrit.wikimedia.org/r/c/mediawiki/skins/Vector/+/581054 refs T248010 (duration: 01m 07s) [00:31:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:31:17] T248010: Vector sidebar missing on test.wikipedia.org and weird footer - https://phabricator.wikimedia.org/T248010 [00:33:14] (03CR) 10Jforrester: [C: 04-1] Consolidate user rights assignments, part 1 (033 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/562396 (https://phabricator.wikimedia.org/T239771) (owner: 10DannyS712) [00:33:28] (03PS4) 10Jforrester: Consolidate user rights assignments, part 2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/580414 (https://phabricator.wikimedia.org/T239771) (owner: 10DannyS712) [00:34:18] (03CR) 10Jforrester: "This absolutely can't be merged before 562396, so should depend directly on it. (Or, per my comments, be split off, but that's distinct.)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/580414 (https://phabricator.wikimedia.org/T239771) (owner: 10DannyS712) [00:34:40] (03CR) 10DannyS712: Consolidate user rights assignments, part 1 (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/562396 (https://phabricator.wikimedia.org/T239771) (owner: 10DannyS712) [00:34:49] twentyafterfour: np, if you see it not working again, please grab an mtr :) [00:38:39] running `scap sync-wikiversions 'group0 wikiws to 1.35.0-wmf.24 refs T233872'` [00:38:40] T233872: 1.35.0-wmf.24 deployment blockers - https://phabricator.wikimedia.org/T233872 [00:39:46] !log twentyafterfour@deploy1001 rebuilt and synchronized wikiversions files: group0 wikiws to 1.35.0-wmf.24 refs T233872 [00:39:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:42:59] twentyafterfour: Beyond the section edit breakage, looks fine. [00:43:17] James_F: the section edit links look normal to me [00:43:26] twentyafterfour: What skin are you using? [00:43:55] not logged in so default [00:44:20] Huh. What browser? Broken for me logged in (multiple accounts, multiple wikis) and out in Firefox and Chrome. [00:44:47] I can properly debug, but not this late in the day. Tomorrow. [00:45:12] James_F: firefox. I see the bug logged in to mediawiki.org but not logged out on testwiki [00:45:19] Fun. [00:45:20] James_F: yeah, don't worry about it tonight [00:45:25] So probably caching again. :-( [00:46:17] I'm tempted to full-sync wmf.24 since I don't know what files got synced and what didn't (took over for hashar and the state of wmf.24 is rather ambiguous) [00:46:28] Sounds sensible. [00:46:35] But I'm leaving, so… [00:46:46] This normally comes from legacy.less in core [00:46:47] take care James_F, thanks for your help [00:46:53] so most likely caused by changes in that area [00:47:10] * Krinkle goes to sleep [00:47:19] thanks everyone [00:47:29] I'm gonna let it be for tonight [00:55:56] PROBLEM - Old JVM GC check - cloudelastic1003-cloudelastic-chi-eqiad on cloudelastic1003 is CRITICAL: 102.7 gt 100 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=cloudelastic-chi-eqiad [00:57:22] (03Abandoned) 10DannyS712: Testing [mediawiki-config] - 10https://gerrit.wikimedia.org/r/581137 (owner: 10DannyS712) [01:25:52] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=icinga site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [01:32:06] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [01:33:34] RECOVERY - Old JVM GC check - cloudelastic1003-cloudelastic-chi-eqiad on cloudelastic1003 is OK: (C)100 gt (W)80 gt 73.52 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=cloudelastic-chi-eqiad [02:07:42] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=icinga site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [02:09:48] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [02:16:37] (03PS1) 10Andrew Bogott: wmfnovamiddleware: adjust encoding, again [puppet] - 10https://gerrit.wikimedia.org/r/581210 (https://phabricator.wikimedia.org/T242766) [02:17:31] (03CR) 10jerkins-bot: [V: 04-1] wmfnovamiddleware: adjust encoding, again [puppet] - 10https://gerrit.wikimedia.org/r/581210 (https://phabricator.wikimedia.org/T242766) (owner: 10Andrew Bogott) [02:18:19] (03PS2) 10Andrew Bogott: wmfnovamiddleware: adjust encoding, again [puppet] - 10https://gerrit.wikimedia.org/r/581210 (https://phabricator.wikimedia.org/T242766) [02:20:27] (03CR) 10Andrew Bogott: [C: 03+2] wmfnovamiddleware: adjust encoding, again [puppet] - 10https://gerrit.wikimedia.org/r/581210 (https://phabricator.wikimedia.org/T242766) (owner: 10Andrew Bogott) [03:02:10] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=icinga site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [03:04:16] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [03:31:28] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=icinga site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [03:33:34] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [05:18:24] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=icinga site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [05:20:30] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [06:33:09] !log Upgrade db1132 without restarting T246098 [06:33:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:33:15] T246098: Upgrade and restart m2 primary database master (db1132) - https://phabricator.wikimedia.org/T246098 [06:46:45] !log Deploy schema change on testcommonswiki.globalimagelinks (empty table) on the s4 master T243987 [06:46:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:46:50] T243987: GlobalUsage table `globalimagelinks` lacks a primary key - https://phabricator.wikimedia.org/T243987 [06:49:17] !log execute 'sudo rm /etc/logrotate.d/ceph-common' on cloudvirt-dev and cloudcontrol-dev to stop daily cronspam [06:49:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:49:21] Cc: arturo: --^ [07:03:10] (03PS1) 10Elukey: Add prometheus jmx agent to all the Presto servers [puppet] - 10https://gerrit.wikimedia.org/r/581327 (https://phabricator.wikimedia.org/T247884) [07:03:31] (03PS1) 10Marostegui: db-eqiad.php: Update pc1008 situation [mediawiki-config] - 10https://gerrit.wikimedia.org/r/581328 (https://phabricator.wikimedia.org/T247787) [07:04:16] (03CR) 10Elukey: [C: 03+2] Add prometheus jmx agent to all the Presto servers [puppet] - 10https://gerrit.wikimedia.org/r/581327 (https://phabricator.wikimedia.org/T247884) (owner: 10Elukey) [07:05:13] (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: Update pc1008 situation [mediawiki-config] - 10https://gerrit.wikimedia.org/r/581328 (https://phabricator.wikimedia.org/T247787) (owner: 10Marostegui) [07:06:06] (03Merged) 10jenkins-bot: db-eqiad.php: Update pc1008 situation [mediawiki-config] - 10https://gerrit.wikimedia.org/r/581328 (https://phabricator.wikimedia.org/T247787) (owner: 10Marostegui) [07:07:45] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Update pc1008 spare situation T247787 (duration: 01m 09s) [07:07:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:07:51] T247787: investigate pc1008 for possible hardware issues / performance under high load - https://phabricator.wikimedia.org/T247787 [07:30:38] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=icinga site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [07:32:44] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [07:40:38] 10Operations, 10observability, 10Performance-Team (Radar): Decide on `service-runner` aggregated prometheus metrics and use of `service` label - https://phabricator.wikimedia.org/T247820 (10akosiaris) >>! In T247820#5982448, @colewhite wrote: >>>! In T247820#5979414, @akosiaris wrote: >> And of course we can... [07:43:51] (03PS1) 10Giuseppe Lavagetto: services_proxy: re-add retries for eventgate-analytics [puppet] - 10https://gerrit.wikimedia.org/r/581415 (https://phabricator.wikimedia.org/T247484) [07:48:12] !log installing libjaxen-java security updates from Stretch point release [07:48:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:54:40] !log installing cups updates from Stretch point release [07:54:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:01:06] (03CR) 10Giuseppe Lavagetto: [C: 03+2] services_proxy: re-add retries for eventgate-analytics [puppet] - 10https://gerrit.wikimedia.org/r/581415 (https://phabricator.wikimedia.org/T247484) (owner: 10Giuseppe Lavagetto) [08:08:06] 10Operations: Integrate Stretch 9.12 point update - https://phabricator.wikimedia.org/T244695 (10MoritzMuehlenhoff) [08:14:40] 10Operations, 10vm-requests: eqiad: 1 VM request for OTRS - https://phabricator.wikimedia.org/T248028 (10akosiaris) @Dzahn thanks for starting this. We can start working on this and setting it up while in covid-19 mode, I am a bit skeptical about switching over from the current one while that mode is ongoing.... [08:19:29] (03PS2) 10Alexandros Kosiaris: admin: Deduplicate coredns [deployment-charts] - 10https://gerrit.wikimedia.org/r/580996 [08:19:31] (03PS2) 10Alexandros Kosiaris: admin: Deduplicate rbac more [deployment-charts] - 10https://gerrit.wikimedia.org/r/581006 [08:31:47] (03CR) 10Alexandros Kosiaris: [C: 03+2] admin: Deduplicate coredns [deployment-charts] - 10https://gerrit.wikimedia.org/r/580996 (owner: 10Alexandros Kosiaris) [08:32:07] (03Merged) 10jenkins-bot: admin: Deduplicate coredns [deployment-charts] - 10https://gerrit.wikimedia.org/r/580996 (owner: 10Alexandros Kosiaris) [08:39:58] (03CR) 10Alexandros Kosiaris: [C: 03+2] admin: Deduplicate rbac more [deployment-charts] - 10https://gerrit.wikimedia.org/r/581006 (owner: 10Alexandros Kosiaris) [08:40:21] (03Merged) 10jenkins-bot: admin: Deduplicate rbac more [deployment-charts] - 10https://gerrit.wikimedia.org/r/581006 (owner: 10Alexandros Kosiaris) [08:43:34] !log restarting blazegraph on wdqs1006 (T242453) [08:43:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:43:44] T242453: wdqs1005 stopped to handle updates - https://phabricator.wikimedia.org/T242453 [08:48:24] !log depooling wdqs1006 to help catching up lag [08:48:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:49:11] hi, I can't deploy anything this morning. But if any deployer is around, there is a pending patch for Vector that should fix up some weird still issue (and a train blocker) [08:49:14] https://gerrit.wikimedia.org/r/#/c/mediawiki/skins/Vector/+/581248/ [08:49:46] We have a maintenance window in 10 minutes that might affect gerrit, for around 30-45 seconds [08:52:33] sorry for the gratuitous pings maro stegui [08:52:44] I copy-pasted the announcement without thinking about your name being in it [08:53:13] marostegui: there is a weird bug which might cause some CI/Zuul connection to Gerrit to block which causes CI to deadlock entirely [08:53:24] but that is solveable by restarting Gerrit -which clear the stall connection- [08:53:30] (03PS1) 10Ema: ATS: add dt to atskafka log format [puppet] - 10https://gerrit.wikimedia.org/r/581435 (https://phabricator.wikimedia.org/T247497) [08:53:43] I am more or less floating here, though I am also Mr Teacher this morning [08:53:45] hashar: Ah, ok good to know. Are you around to do so in case it is needed? [08:53:48] XD [08:53:56] hashar: how can we know if the restart is required? [08:53:57] is it worth just restarting gerrit right afterwards in any case? [08:53:59] possibly yeah. I am homeschooling in the morning. 10am - 1pm ;) [08:54:10] marostegui: when _joe_ complains about CI haha [08:54:18] XDDDDDD [08:54:43] <_joe_> hashar: ahahahah [08:59:01] (03PS2) 10Ema: Add sequence number [software/atskafka] - 10https://gerrit.wikimedia.org/r/580986 (https://phabricator.wikimedia.org/T237993) [08:59:03] (03PS4) 10Ema: Do not append to stats file [software/atskafka] - 10https://gerrit.wikimedia.org/r/580987 (https://phabricator.wikimedia.org/T237993) [09:00:04] marostegui and akosiaris: It is that lovely time of the day again! You are hereby commanded to deploy m2 database master restart. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200319T0900). [09:00:26] !log Restart m2 primary database master - T246098 [09:00:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:00:31] T246098: Upgrade and restart m2 primary database master (db1132) - https://phabricator.wikimedia.org/T246098 [09:00:31] me too :) [09:00:32] restarting [09:01:06] confirmed connections failing on debmonitor, as expected [09:01:31] all done [09:01:36] !log restart recommendation-api on scb T246098 [09:01:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:01:50] (included proxies restart) [09:01:53] debmonitor working again, no action was needed [09:02:13] !log restart otrs-daemon, apache on mendelevium T246098 [09:02:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:02:30] Gerrit Health Check failint ATM [09:02:38] slaves reconnecting [09:02:50] mailman_queue_size too? [09:02:55] otrs? [09:03:00] !log restart gerrit on gerrit1001 T246098 [09:03:03] lmk if I can help with anything else [09:03:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:03:06] jynus: otrs done [09:03:09] otrs working for me [09:03:27] volans: thanks :* [09:03:33] I mean if you think otrs could be the thing bothering mail queue? [09:03:38] world's fastest ever [09:03:56] just logged into otrs, works fine [09:04:16] gerrit restarted and works as well? [09:04:19] gerrit is back [09:04:34] (03CR) 10Marostegui: [C: 03+2] "test" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/581328 (https://phabricator.wikimedia.org/T247787) (owner: 10Marostegui) [09:04:39] yep, looks good [09:04:44] (03PS1) 10Jcrespo: test [puppet] - 10https://gerrit.wikimedia.org/r/581447 [09:04:46] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=icinga site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [09:04:49] gerrit working for me [09:04:57] (03CR) 10Marostegui: "test comment" [puppet] - 10https://gerrit.wikimedia.org/r/581447 (owner: 10Jcrespo) [09:04:58] trying git update puppet [09:05:00] that works too [09:05:09] it was a bit slow, but I guess expected from gerrit 0:-D [09:05:19] (03Abandoned) 10Jcrespo: test [puppet] - 10https://gerrit.wikimedia.org/r/581447 (owner: 10Jcrespo) [09:06:07] Status of the systemd unit git_pull_httpbb was also soft [09:06:23] lol, mcrouter certs started complaining about expiry dates [09:06:25] plus some gerrit dependencies [09:06:50] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [09:06:59] I am rechecking those [09:07:30] all alerts cleared [09:07:42] great [09:07:47] CI seems to be working [09:07:48] dbproxy1015 recent restart [09:07:51] expected [09:07:56] yep [09:07:58] and 1013 [09:08:06] contint1001 needs to retry puppet [09:08:17] where's the best place on wikitech to document those actions for each app next times? [09:08:21] *for next time [09:08:25] volans: I will do that [09:08:31] hashar: do you have rights to run puppet there? [09:08:40] to be fair, most of these are uneeded [09:08:56] I was just checking dependencies between services [09:09:02] for better understanding [09:09:11] it's ok, better to have a checklist that covers all the bases [09:09:32] even if it says "these are probably more than is needed, but..." [09:10:08] I like to be super-thorough on changes, because we can only fix the ones we see [09:10:19] yep [09:10:23] I bother marostegui a lot during switchovers 0:-D [09:11:43] there is also gerrit1002, but I woulnd't touch that without daniel [09:12:07] gerrit1002 has gerrit restarting every few secs for weeks now [09:12:09] jynus: yes doing [09:12:16] I have no idea what's going on there [09:12:22] every few seconds? wut [09:12:45] yeah systemd trying to restart it, but it failing [09:12:46] I am guessing systemd + not being able to contact a db, or some other error [09:12:52] probably [09:13:02] that's a known issue [09:13:05] let me look for the task [09:13:06] ¡log running puppet on contint1001 [09:13:06] (03CR) 10Elukey: [C: 03+1] "Had a chat with ema about https://phabricator.wikimedia.org/T136314" [puppet] - 10https://gerrit.wikimedia.org/r/581435 (https://phabricator.wikimedia.org/T247497) (owner: 10Ema) [09:13:10] it's part of the buster migration or whatever [09:13:14] yep [09:13:15] ah ha [09:13:17] I honestly don't know much [09:13:34] but systemd unit probably should be disabled to prevent the hammering [09:13:59] hashar: how did you type an inverted exclamation mark? I did not know that french keyboards had that [09:14:22] me neither [09:14:27] https://phabricator.wikimedia.org/T243800 [09:14:28] that [09:14:40] I thought it was a spanish + surrounding languages thing only [09:14:48] so did I [09:15:12] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=icinga site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [09:15:13] ¡¿Who would have thought?! [09:15:27] copy pasted! [09:15:48] fun fact, those were the 2 types of errors logged by debmonitor (django) during the restart: [09:15:49] jynus: I don't know; [09:15:51] _mysql_exceptions.OperationalError: (2013, "Lost connection to MySQL server at 'handshake: reading inital communication packet', system error: 115") [09:15:51] ^ that seems kafkamon2001:9700 [09:15:54] _mysql_exceptions.OperationalError: (1290, 'The MariaDB server is running with the --read-only option so it cannot execute this statement') [09:16:09] jynus: note btw the ; Not a semicolon :P [09:16:09] where is that?, volans [09:16:14] it's a greek question mark [09:16:19] debmonitor logs jynus [09:16:27] volans: mmm, the proxy was failed over correctly [09:16:28] but not ongoing, right? [09:16:35] no no, during the restart [09:16:37] volans: as in restarted, you should not be going to the slave [09:16:39] ah [09:16:40] that's expected [09:16:51] as you would have temporarily hit the slave which is in RO [09:16:57] ok, then expected, probably not as clean as possible, but it is part of the automatic proxy work [09:17:11] volans: low prio [09:17:19] but consider making debmonitor work in read only [09:17:19] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [09:17:19] (03CR) 10Ema: [C: 03+2] ATS: add dt to atskafka log format [puppet] - 10https://gerrit.wikimedia.org/r/581435 (https://phabricator.wikimedia.org/T247497) (owner: 10Ema) [09:17:29] even if with reduced functionality [09:17:36] well, maybe not you [09:17:42] whoever developed it [09:17:50] that's me :D [09:17:53] he [09:18:01] the clients send updates that have to be written [09:18:10] the only thing that coudl work RO is the web UI [09:18:16] sure, that cannot me workarounded [09:18:16] but there is the sessions bits [09:18:22] that might be tricky [09:18:23] but the dashboard could? [09:18:39] maybe session should be on a local memcache? [09:18:41] Idk [09:18:46] we do not require that [09:19:01] normally we would kill the original master [09:19:07] and put the new one on rw fast [09:19:13] so low prio [09:19:49] :) [09:19:53] read only automatic failovers are easy [09:20:03] read write (state) are more complicated [09:20:04] https://wikitech.wikimedia.org/w/index.php?title=MariaDB%2Fmisc&type=revision&diff=1860614&oldid=1853986 [09:20:23] we don't have a STONITH method atm, hence the read only [09:20:27] feel free to change [09:22:36] +1 thanks [09:22:47] and I happily noticed there were already good info there [09:23:05] where's the log and how to restart it and even the RO exception, I guess from last failover [09:23:46] 10Operations, 10Continuous-Integration-Infrastructure, 10Traffic, 10Patch-For-Review: Make CI run Varnish VCL tests - https://phabricator.wikimedia.org/T128188 (10ema) >>! In T128188#5143265, @ema wrote: > @hashar can we run it in CI? The catalog needs to be compiled against a given hostname. Perhaps this... [09:23:58] 10Operations, 10Continuous-Integration-Infrastructure, 10Traffic, 10Patch-For-Review: Make CI run Varnish VCL tests - https://phabricator.wikimedia.org/T128188 (10ema) a:05ema→03None [09:24:44] So, are we good? Can we call it a day? [09:24:47] what is STONITH ? [09:25:03] shut the other in the head [09:25:12] PROBLEM - check_trafficserver_log_fifo_analytics_tls on cp5005 is CRITICAL: CRITICAL: /srv/trafficserver/tls/var/log/analytics.pipe - TS_MAIN not writing to pipe https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [09:25:20] the other NODE I think [09:25:22] but you get the idea [09:25:34] and shot :D [09:25:43] basically, once a host takes over,the other one goes down, either entirelly (network) or the daemon or whatever [09:25:47] volans: thanks! [09:25:59] I am closing the maintenance window [09:26:08] !log m2 maintenance window done T246098 [09:26:14] Thanks everyone! :) [09:26:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:26:17] T246098: Upgrade and restart m2 primary database master (db1132) - https://phabricator.wikimedia.org/T246098 [09:26:18] but in practical term, is usually shut down, at least outside the US :-P [09:26:29] 10Operations, 10DBA, 10OTRS, 10Recommendation-API, 10Research: Upgrade and restart m2 primary database master (db1132) - https://phabricator.wikimedia.org/T246098 (10Marostegui) 05Open→03Resolved This was done. MySQL downtime was 60 seconds: ` Starts: 9:00:29 End: 9:01:29 ` Thanks so much everyone wh... [09:26:32] 10Operations, 10Puppet, 10DBA, 10User-jbond: DB: perform rolling restart of mariadb daemons to pick up CA changes - https://phabricator.wikimedia.org/T239791 (10Marostegui) [09:26:33] thanks marostegui! [09:26:54] 10Operations, 10Puppet, 10DBA, 10User-jbond: DB: perform rolling restart of mariadb daemons to pick up CA changes - https://phabricator.wikimedia.org/T239791 (10Marostegui) [09:26:55] Now time to write an IR, joy! [09:27:16] not for this I hope [09:29:04] no no :) [09:30:33] I can help [09:32:28] PROBLEM - check_trafficserver_log_fifo_analytics_tls on cp3052 is CRITICAL: CRITICAL: /srv/trafficserver/tls/var/log/analytics.pipe - TS_MAIN not writing to pipe https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [09:32:41] known false positive ^ [09:32:51] the prvious one too? [09:32:54] I was about to ping :) [09:33:00] yup [09:37:10] thanks for the acronym explanation :-) [09:37:33] 10Operations: Setup an Offsite backup infrastructure - https://phabricator.wikimedia.org/T85278 (10jcrespo) 05Open→03Stalled a:03jcrespo I technically solved this already on the new bacula version, but haven't tested it, as it requires to make bacula1001 unavailable. Other than that, this is done, even if... [09:53:50] PROBLEM - Check systemd state on an-coord1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:54:44] this is me --^ [09:54:55] I was about to look :-) [09:55:58] RECOVERY - Check systemd state on an-coord1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:56:36] 10Operations, 10Wikimedia-Mailing-lists: Request new mailing list for Myanmar Wikimedia Community User Group - https://phabricator.wikimedia.org/T247647 (10ArielGlenn) 05Open→03Resolved a:03ArielGlenn [09:58:01] 10Operations, 10Traffic: check_trafficserver_log_fifo: false positives when changing log format - https://phabricator.wikimedia.org/T248067 (10ema) [09:58:10] 10Operations, 10Traffic: check_trafficserver_log_fifo: false positives when changing log format - https://phabricator.wikimedia.org/T248067 (10ema) p:05Triage→03Medium [10:01:02] !log cp: rolling ats-tls-restart to apply log format changes T248067 T237993 [10:01:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:01:08] T237993: Create replacement for Varnishkafka - https://phabricator.wikimedia.org/T237993 [10:01:09] T248067: check_trafficserver_log_fifo: false positives when changing log format - https://phabricator.wikimedia.org/T248067 [10:05:40] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=icinga site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [10:08:25] Quickly deploying this: https://gerrit.wikimedia.org/r/c/mediawiki/skins/Vector/+/581248 [10:08:31] Train blocker [10:09:56] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [10:10:30] RECOVERY - check_trafficserver_log_fifo_analytics_tls on cp3052 is OK: OK: TS_MAIN writing to and fifo-log-demux reading from /srv/trafficserver/tls/var/log/analytics.pipe https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [10:11:36] RECOVERY - check_trafficserver_log_fifo_analytics_tls on cp5005 is OK: OK: TS_MAIN writing to and fifo-log-demux reading from /srv/trafficserver/tls/var/log/analytics.pipe https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [10:13:21] 10Operations, 10observability, 10Performance-Team (Radar): Decide on `service-runner` aggregated prometheus metrics and use of `service` label - https://phabricator.wikimedia.org/T247820 (10fgiunchedi) >>! In T247820#5977711, @colewhite wrote: > Good idea forking the original task. Thanks for that! +1 ! >... [10:15:27] 10Operations, 10netops, 10observability: Migrate role::netmon to Buster - https://phabricator.wikimedia.org/T247967 (10fgiunchedi) [10:19:40] (03CR) 10Filippo Giunchedi: "> Patch Set 2:" [puppet] - 10https://gerrit.wikimedia.org/r/580985 (https://phabricator.wikimedia.org/T247538) (owner: 10Filippo Giunchedi) [10:20:57] (03PS3) 10Filippo Giunchedi: icinga: relax check interval for selected checks [puppet] - 10https://gerrit.wikimedia.org/r/580985 (https://phabricator.wikimedia.org/T247538) [10:22:03] (03PS1) 10Elukey: profile::presto::monitoring: add first round of metrics [puppet] - 10https://gerrit.wikimedia.org/r/581488 (https://phabricator.wikimedia.org/T247884) [10:22:09] tested on mwdebug1002. works fine, rolling forward [10:22:16] (03CR) 10jerkins-bot: [V: 04-1] icinga: relax check interval for selected checks [puppet] - 10https://gerrit.wikimedia.org/r/580985 (https://phabricator.wikimedia.org/T247538) (owner: 10Filippo Giunchedi) [10:22:27] Amir1: You're deploying that change now? Rad! [10:22:40] Also, o/ [10:22:53] phuedx: yup, it's UBN :) [10:23:04] o/ [10:23:23] (03CR) 10Elukey: "Example from an-coord1001:" [puppet] - 10https://gerrit.wikimedia.org/r/581488 (https://phabricator.wikimedia.org/T247884) (owner: 10Elukey) [10:24:50] !log ladsgroup@deploy1001 Synchronized php-1.35.0-wmf.24/skins/Vector/skin.json: [[gerrit:581248|skins.vector.styles.legacy needs to define legacy feature (T247566)]] (duration: 01m 08s) [10:24:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:24:55] T247566: Broken section edit links styles on Vector - https://phabricator.wikimedia.org/T247566 [10:26:14] Amir1: Is wmf.24 out on group1 wikis? [10:26:36] phuedx: nope https://www.wikidata.org/wiki/Special:Version [10:26:49] It should be fixed now [10:27:32] (03PS1) 10Elukey: role::prometheus::analytics: add presto targets [puppet] - 10https://gerrit.wikimedia.org/r/581495 (https://phabricator.wikimedia.org/T247884) [10:28:40] phuedx: but officewiki and mediawiki.org if you want to test [10:28:50] <3 [10:28:52] and test wikis [10:29:14] testwiki and mediawikiwiki look good [10:29:38] I'd forgotten that mediawiki.org is in group0 -- it's been a while :) [10:32:33] (03CR) 10Elukey: [C: 03+1] Add sequence number [software/atskafka] - 10https://gerrit.wikimedia.org/r/580986 (https://phabricator.wikimedia.org/T237993) (owner: 10Ema) [10:34:35] (03CR) 10Elukey: [C: 03+1] Do not append to stats file (031 comment) [software/atskafka] - 10https://gerrit.wikimedia.org/r/580987 (https://phabricator.wikimedia.org/T237993) (owner: 10Ema) [10:38:57] (03CR) 10Ema: [C: 03+2] Add sequence number [software/atskafka] - 10https://gerrit.wikimedia.org/r/580986 (https://phabricator.wikimedia.org/T237993) (owner: 10Ema) [10:39:05] (03CR) 10Ema: [C: 03+2] Do not append to stats file [software/atskafka] - 10https://gerrit.wikimedia.org/r/580987 (https://phabricator.wikimedia.org/T237993) (owner: 10Ema) [10:42:46] mediawiki.org is our guinea pig of mediawiki installations, seems fair :P [10:43:43] Do the people that use mediawiki.org know that? ;) [10:46:34] good morning [10:47:24] !log upload atskafka 0.4 to buster-wikimedia T237993 [10:47:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:47:30] T237993: Create replacement for Varnishkafka - https://phabricator.wikimedia.org/T237993 [10:53:50] (03PS1) 10Alexandros Kosiaris: admin: Deduplicate defaults.yaml [deployment-charts] - 10https://gerrit.wikimedia.org/r/581507 [10:53:58] I am sorry about disrupting the train, all. Krinkle has further improved the TemplateParser test suite. I'll be reviewing the changes soon [11:00:04] Amir1, Lucas_WMDE, awight, and Urbanecm: (Dis)respected human, time to deploy European Mid-day SWAT(Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200319T1100). Please do the needful. [11:00:04] phuedx: A patch you scheduled for European Mid-day SWAT(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [11:01:18] there's nothing to deploy ^_^ [11:05:33] * Urbanecm deploys [11:05:45] (03PS7) 10Urbanecm: trwiki: Grant interface editors editprotected & editsemiprotected [mediawiki-config] - 10https://gerrit.wikimedia.org/r/579772 (https://phabricator.wikimedia.org/T247672) (owner: 10DannyS712) [11:06:00] (03CR) 10Urbanecm: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/579772 (https://phabricator.wikimedia.org/T247672) (owner: 10DannyS712) [11:07:15] (03Merged) 10jenkins-bot: trwiki: Grant interface editors editprotected & editsemiprotected [mediawiki-config] - 10https://gerrit.wikimedia.org/r/579772 (https://phabricator.wikimedia.org/T247672) (owner: 10DannyS712) [11:07:28] wait what [11:07:52] phuedx: Amir1 thank you for the Vector deployment!!! [11:08:20] jouncebot and deployment calendar call out one change that Amir already merged. Amir says nothing to deploy. meanwhile Urbanecm merges a change that isn’t on the calendar? [11:08:22] I’m very confused [11:09:46] Lucas_WMDE: the vector fix got deployed earlier due to it being a train blocker / UBN [11:10:11] Lucas_WMDE: Amir1 fixed T247566 (an UBN) outside any deployment, and I've did a last-time addition [11:10:12] T247566: Broken section edit links styles on Vector - https://phabricator.wikimedia.org/T247566 [11:10:13] (calendar updated) [11:10:17] sorry for confusing you! [11:10:20] for Urbanecm change I would guess they forgot to fill [Deployments] at least it is in the SWAT window :]] [11:10:47] be safe, I am preparing lunch etc. I will be back in a couple hours. [11:10:54] Lucas_WMDE: you need to do this if you add last time stuff otherwise jouncebot would be stupid [11:10:57] jouncebot: refresh [11:10:57] I refreshed my knowledge about deployments. [11:11:24] *last minute [11:12:01] …I know that? I’m not the one who added any last time stuff :D [11:12:03] (03CR) 10Volans: [C: 03+2] "merging as I've personally verified that they are all in Offline status in Netbox." [dns] - 10https://gerrit.wikimedia.org/r/580955 (https://phabricator.wikimedia.org/T233183) (owner: 10Volans) [11:13:35] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: e277d29: trwiki: Grant interface editors editprotected & editsemiprotected (T247672) (duration: 01m 07s) [11:13:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:13:42] T247672: Grant interface editors editprotected & editsemiprotected on Turkish Wikipedia - https://phabricator.wikimedia.org/T247672 [11:15:05] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: e277d29: trwiki: Grant interface editors editprotected & editsemiprotected (T247672; take II) (duration: 01m 08s) [11:15:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:15:16] * Urbanecm done [11:15:57] if you feel brave there is https://gerrit.wikimedia.org/r/#/c/mediawiki/core/+/581245/ :] [11:16:59] (03PS5) 10Jbond: profile::idp: update profile to use tlsproxy::envoy [puppet] - 10https://gerrit.wikimedia.org/r/574020 (https://phabricator.wikimedia.org/T240941) [11:17:09] hashar: that one looks good to me [11:17:29] though I can't baby sit it, I am preparing lunch for the kids [11:17:29] I’ll add it to the calendar and SWAT [11:17:35] thanks ! [11:17:37] Lucas_WMDE: thanks! [11:17:45] (03CR) 10Jbond: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/574020 (https://phabricator.wikimedia.org/T240941) (owner: 10Jbond) [11:18:21] hm, not sure which ircnick to put in the calendar :D [11:18:23] I guess my own [11:18:50] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=icinga site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [11:19:32] Lucas_WMDE: whoever's going to test it :-). If you, yours 😉 [11:19:43] yeah ^^ [11:21:12] I think I should be able to test this, even [11:21:48] 10Operations, 10ops-esams, 10netops: 2*10G optics down on cr2-esams - https://phabricator.wikimedia.org/T245520 (10ayounsi) Replacing the optics on both sides didn't help, and light levels are correct. Service Request ID 2020-0319-0197 has been created. [11:22:17] please whoever deploys things do add them to the Deploy windows even after the fact, then we can see what happened later [11:23:04] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [11:23:29] * Urbanecm did that apergos [11:23:37] awesome, thank you! [11:23:47] thank you for the reminder! [11:26:51] (03PS1) 10ArielGlenn: no second dumps run this month due to vslow dbs serving live traffic [puppet] - 10https://gerrit.wikimedia.org/r/581522 [11:27:27] (03PS7) 10Jbond: pick_nodes: add ability to pick nodes based on a puppet class [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/579579 (https://phabricator.wikimedia.org/T245288) [11:28:16] (03CR) 10Filippo Giunchedi: "See inline, LGTM overall" (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/581488 (https://phabricator.wikimedia.org/T247884) (owner: 10Elukey) [11:28:32] (03CR) 10Filippo Giunchedi: [C: 03+1] role::prometheus::analytics: add presto targets [puppet] - 10https://gerrit.wikimedia.org/r/581495 (https://phabricator.wikimedia.org/T247884) (owner: 10Elukey) [11:29:43] (03CR) 10Jbond: pick_nodes: add ability to pick nodes based on a puppet class (031 comment) [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/579579 (https://phabricator.wikimedia.org/T245288) (owner: 10Jbond) [11:30:42] (03CR) 10ArielGlenn: [C: 03+2] no second dumps run this month due to vslow dbs serving live traffic [puppet] - 10https://gerrit.wikimedia.org/r/581522 (owner: 10ArielGlenn) [11:33:28] the change was merged, let’s see if I can test it [11:34:14] change is on mwdebug1001 now [11:34:50] (03PS6) 10Jbond: profile::idp: update profile to use tlsproxy::envoy [puppet] - 10https://gerrit.wikimedia.org/r/574020 (https://phabricator.wikimedia.org/T240941) [11:35:28] ok, I don’t understand that JS config variable, it seems to be null whatever I do [11:35:36] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=icinga site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [11:35:40] guess I’ll just look for any suspicious errors [11:37:30] some warnings that Echo is issuing DB writes from GET requests, but I assume that’s unrelated [11:37:36] nothing else looks suspicious in logstash [11:37:40] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [11:37:44] syncing [11:39:17] This is strange: https://logstash.wikimedia.org/goto/50f235c972242f548b804b87f5497ea1 CC marostegui [11:40:05] !log lucaswerkmeister-wmde@deploy1001 Synchronized php-1.35.0-wmf.24/includes/OutputPage.php: SWAT: [[gerrit:581245|OutputPage: Fix warning when setting wgUserNewMsgRevisionId (T248049)]] (duration: 01m 08s) [11:40:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:40:11] T248049: wgNewUserMsgRevisionId (in JavaScript) is no longer set when the user has a new message - https://phabricator.wikimedia.org/T248049 [11:42:20] jynus: from a deployment you think? [11:42:42] could be [11:42:54] if it is gone, we can ignore it [11:43:06] but we didn't use to have many of those lately [11:43:34] !log EU SWAT done [11:43:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:49:09] (03CR) 10Volans: pick_nodes: add ability to pick nodes based on a puppet class (031 comment) [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/579579 (https://phabricator.wikimedia.org/T245288) (owner: 10Jbond) [11:55:12] (03CR) 10Jbond: pick_nodes: add ability to pick nodes based on a puppet class (031 comment) [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/579579 (https://phabricator.wikimedia.org/T245288) (owner: 10Jbond) [12:00:04] Deploy window Pre MediaWiki train sanity break (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200319T1200) [12:11:14] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=icinga site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [12:13:20] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [12:25:13] (03PS1) 10Jbond: CFSSL: add cfssl module [puppet] - 10https://gerrit.wikimedia.org/r/581557 [12:26:40] (03CR) 10Jbond: [C: 03+2] CFSSL: add cfssl module [puppet] - 10https://gerrit.wikimedia.org/r/581557 (owner: 10Jbond) [12:29:30] (03CR) 10Arturo Borrero Gonzalez: [C: 04-1] "the patch in general LGTM. I believe most of the code can be just dropped. However I detected a couple of things that can be addressed in " (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/581056 (https://phabricator.wikimedia.org/T246689) (owner: 10Bstorm) [12:32:04] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=icinga site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [12:33:15] !log push frack fw policies T248004 [12:33:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:34:10] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [12:37:32] PROBLEM - Old JVM GC check - cloudelastic1001-cloudelastic-chi-eqiad on cloudelastic1001 is CRITICAL: 111.9 gt 100 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=cloudelastic-chi-eqiad [12:37:40] PROBLEM - Old JVM GC check - cloudelastic1003-cloudelastic-chi-eqiad on cloudelastic1003 is CRITICAL: 121 gt 100 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=cloudelastic-chi-eqiad [12:38:00] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] profile::mariadb::cloudinfra: Allow overriding of hostnames [puppet] - 10https://gerrit.wikimedia.org/r/575744 (https://phabricator.wikimedia.org/T242607) (owner: 10Alex Monk) [12:40:36] (03PS1) 10Jbond: cfssl: Ensure CSR exists before we try to sign it [puppet] - 10https://gerrit.wikimedia.org/r/581559 [12:59:03] 10Operations, 10netops, 10cloud-services-team (Kanban): CloudVPS: enable BGP in the neutron transport network - https://phabricator.wikimedia.org/T245606 (10aborrero) I hope what @ayounsi said help clarify the situation @faidon. Some additional info about the setup we tested can be found here: https://wikite... [13:00:04] hashar and twentyafterfour: It is that lovely time of the day again! You are hereby commanded to deploy Mediawiki train - European+American Version. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200319T1300). [13:09:44] so yeah hmm train [13:10:18] though I haven't looked at logstash since yesterday since I just started now. [13:10:28] I am going to ding into the logs [13:11:33] !log Rename testwikidatawiki.wb_terms on db1078 - T248086 [13:11:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:11:39] T248086: Drop wb_terms in production from s4 (commonswiki, testcommonswiki), s3 (testwikidatawiki), s8 (wikidatawiki) - https://phabricator.wikimedia.org/T248086 [13:15:38] (03PS2) 10Alexandros Kosiaris: admin: Deduplicate defaults.yaml [deployment-charts] - 10https://gerrit.wikimedia.org/r/581507 [13:17:08] RECOVERY - Old JVM GC check - cloudelastic1001-cloudelastic-chi-eqiad on cloudelastic1001 is OK: (C)100 gt (W)80 gt 67.8 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=cloudelastic-chi-eqiad [13:17:58] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=icinga site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [13:19:20] RECOVERY - Old JVM GC check - cloudelastic1003-cloudelastic-chi-eqiad on cloudelastic1003 is OK: (C)100 gt (W)80 gt 78.31 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=cloudelastic-chi-eqiad [13:20:02] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [13:23:37] so logs are fine. I am going to promote 1.35.0-wmf.24 to group 1 [13:28:13] (03PS1) 10Hashar: group1 wikis to 1.35.0-wmf.24 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/581577 [13:28:15] (03CR) 10Hashar: [C: 03+2] group1 wikis to 1.35.0-wmf.24 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/581577 (owner: 10Hashar) [13:29:12] (03Merged) 10jenkins-bot: group1 wikis to 1.35.0-wmf.24 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/581577 (owner: 10Hashar) [13:31:02] !log hashar@deploy1001 rebuilt and synchronized wikiversions files: group1 wikis to 1.35.0-wmf.24 [13:31:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:31:50] 10Operations, 10netops, 10cloud-services-team (Kanban): CloudVPS: IPv6 early PoC - https://phabricator.wikimedia.org/T245495 (10aborrero) [13:32:09] !log hashar@deploy1001 Synchronized php: group1 wikis to 1.35.0-wmf.24 (duration: 01m 07s) [13:32:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:32:32] canaries passed! [13:33:31] ResourceLoaderFileModule::getFileContents: style file not found, or is not a file: "/srv/mediawiki/php-1.35.0-wmf.24/extensions/RelatedArticles/" [13:33:32] sigh [13:33:41] I swear we have mediawiki test for those [13:51:07] group1 looks fine [13:55:45] 10Operations, 10DC-Ops, 10netops: Juniper network device audit - all sites - https://phabricator.wikimedia.org/T213843 (10faidon) >>! In T213843#5913332, @ayounsi wrote: > Ok! From https://wikitech.wikimedia.org/wiki/Server_Lifecycle#States I thought that if a device was not in netbox it was not in our posse... [13:56:05] 10Operations, 10vm-requests: eqiad: 1 VM request for OTRS - https://phabricator.wikimedia.org/T248028 (10Dzahn) otrs1001 sounds good to me. I was also just thinking about starting it, not a switch yet. We can see how it goes later. ACK! And thanks for the comments on resources. Let's do 4GB then and keep th... [14:09:31] 10Operations, 10serviceops: Renew certs for mcrouter on all application servers. - https://phabricator.wikimedia.org/T248093 (10fgiunchedi) [14:12:34] (03CR) 10Elukey: "Thanks Filippo!" (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/581488 (https://phabricator.wikimedia.org/T247884) (owner: 10Elukey) [14:14:24] (03CR) 10Bstorm: toolforge: remove the entire toollabs module and all related roles (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/581056 (https://phabricator.wikimedia.org/T246689) (owner: 10Bstorm) [14:18:41] (03PS2) 10Elukey: profile::presto::monitoring: add first round of metrics [puppet] - 10https://gerrit.wikimedia.org/r/581488 (https://phabricator.wikimedia.org/T247884) [14:18:43] (03PS2) 10Elukey: role::prometheus::analytics: add presto targets [puppet] - 10https://gerrit.wikimedia.org/r/581495 (https://phabricator.wikimedia.org/T247884) [14:20:38] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=icinga site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [14:22:46] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [14:24:14] (03CR) 10Filippo Giunchedi: profile::presto::monitoring: add first round of metrics (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/581488 (https://phabricator.wikimedia.org/T247884) (owner: 10Elukey) [14:28:16] 10Operations, 10ops-eqiad, 10DC-Ops, 10decommission: Decommission dbproxy1001.eqiad.wmnet - https://phabricator.wikimedia.org/T244463 (10Papaul) ` [edit interfaces interface-range disabled] member ge-8/0/4 { ... } + member ge-3/0/3; [edit interfaces interface-range vlan-private1-a-eqiad] - membe... [14:29:38] (03PS3) 10Elukey: profile::presto::monitoring: add first round of metrics [puppet] - 10https://gerrit.wikimedia.org/r/581488 (https://phabricator.wikimedia.org/T247884) [14:29:40] (03PS3) 10Elukey: role::prometheus::analytics: add presto targets [puppet] - 10https://gerrit.wikimedia.org/r/581495 (https://phabricator.wikimedia.org/T247884) [14:29:42] 10Operations, 10ops-eqiad, 10DC-Ops, 10decommission: Decommission dbproxy1001.eqiad.wmnet - https://phabricator.wikimedia.org/T244463 (10Papaul) [14:29:52] (03CR) 10Jbond: pick_nodes: add ability to pick nodes based on a puppet class (031 comment) [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/579579 (https://phabricator.wikimedia.org/T245288) (owner: 10Jbond) [14:34:52] 10Operations, 10ops-eqiad, 10DC-Ops, 10decommission, 10Patch-For-Review: decommission dbproxy1002.eqiad.wmnet - https://phabricator.wikimedia.org/T245384 (10Papaul) ` [edit interfaces interface-range disabled] member ge-3/0/3 { ... } + member ge-3/0/4; [edit interfaces interface-range vlan-privat... [14:35:29] 10Operations, 10ops-eqiad, 10DC-Ops, 10decommission, 10Patch-For-Review: decommission dbproxy1002.eqiad.wmnet - https://phabricator.wikimedia.org/T245384 (10Papaul) [14:36:19] (03PS1) 10KartikMistry: apertium-eo-fr: Fix FTBFS with apertium 3.6 [debs/contenttranslation/apertium-eo-fr] - 10https://gerrit.wikimedia.org/r/581597 (https://phabricator.wikimedia.org/T247585) [14:38:20] (03CR) 10Elukey: "elukey@an-coord1001:~$ curl localhost:10281/metrics -s | grep presto" [puppet] - 10https://gerrit.wikimedia.org/r/581488 (https://phabricator.wikimedia.org/T247884) (owner: 10Elukey) [14:39:28] (03CR) 10Elukey: [C: 03+2] profile::presto::monitoring: add first round of metrics [puppet] - 10https://gerrit.wikimedia.org/r/581488 (https://phabricator.wikimedia.org/T247884) (owner: 10Elukey) [14:39:39] (03CR) 10Elukey: [C: 03+2] role::prometheus::analytics: add presto targets [puppet] - 10https://gerrit.wikimedia.org/r/581495 (https://phabricator.wikimedia.org/T247884) (owner: 10Elukey) [14:45:32] (03CR) 10Muehlenhoff: [C: 03+1] "Good idea, let's give it a shot! We can simply give a short heads on IRC before merging this." [puppet] - 10https://gerrit.wikimedia.org/r/574020 (https://phabricator.wikimedia.org/T240941) (owner: 10Jbond) [14:47:12] (03PS3) 10Bstorm: toolforge: remove almost entire toollabs module and related roles [puppet] - 10https://gerrit.wikimedia.org/r/581056 (https://phabricator.wikimedia.org/T246689) [14:48:46] !log otto@deploy1001 helmfile [STAGING] Ran 'apply' command on namespace 'eventstreams' for release 'production' . [14:48:46] !log otto@deploy1001 helmfile [STAGING] Ran 'apply' command on namespace 'eventstreams' for release 'canary' . [14:48:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:48:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:52:39] !log otto@deploy1001 helmfile [EQIAD] Ran 'apply' command on namespace 'eventstreams' for release 'canary' . [14:52:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:52:57] (03CR) 10Bstorm: "I'll put this through a PCC for the relic-stretch.toolserver-legacy.eqiad.wmflabs server and the image builder" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/581056 (https://phabricator.wikimedia.org/T246689) (owner: 10Bstorm) [14:54:16] !log otto@deploy1001 helmfile [EQIAD] Ran 'apply' command on namespace 'eventstreams' for release 'production' . [14:54:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:56:04] 10Operations, 10netops: Cloud IPv6 subnets - https://phabricator.wikimedia.org/T187929 (10aborrero) [14:56:05] 10Operations, 10netops, 10cloud-services-team (Kanban): CloudVPS: IPv6 early PoC - https://phabricator.wikimedia.org/T245495 (10aborrero) [14:56:16] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=icinga site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [14:56:21] (03CR) 10Bstorm: "https://puppet-compiler.wmflabs.org/compiler1002/21495/ NOOP now :)" [puppet] - 10https://gerrit.wikimedia.org/r/581056 (https://phabricator.wikimedia.org/T246689) (owner: 10Bstorm) [14:57:54] 10Operations, 10ops-eqiad, 10DC-Ops, 10decommission: decommission db1067.eqiad.wmnet - https://phabricator.wikimedia.org/T238297 (10Papaul) ` [edit interfaces interface-range vlan-private1-c-eqiad] - member ge-6/0/32; [edit interfaces interface-range disabled] member ge-7/0/0 { ... } + member ge... [14:58:48] 10Operations, 10ops-eqiad, 10DC-Ops, 10decommission: decommission db1067.eqiad.wmnet - https://phabricator.wikimedia.org/T238297 (10Papaul) [15:00:26] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [15:05:41] 10Operations, 10ops-eqiad, 10DC-Ops, 10decommission: Decommission db1066.eqiad.wmnet - https://phabricator.wikimedia.org/T233071 (10Papaul) ` [edit interfaces interface-range disabled] member ge-3/0/4 { ... } + member ge-6/0/16; [edit interfaces interface-range vlan-private1-a-eqiad] - member ge... [15:06:20] 10Operations, 10ops-eqiad, 10DC-Ops, 10decommission: Decommission db1066.eqiad.wmnet - https://phabricator.wikimedia.org/T233071 (10Papaul) [15:07:21] (03CR) 10Giuseppe Lavagetto: [C: 03+1] "nice, it doesn't even make the code too awkward :)" [software/httpbb] - 10https://gerrit.wikimedia.org/r/576159 (https://phabricator.wikimedia.org/T236699) (owner: 10RLazarus) [15:17:04] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=icinga site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [15:18:16] (03PS1) 10RLazarus: Remove apache-fast-test, now replaced by httpbb. [puppet] - 10https://gerrit.wikimedia.org/r/581616 [15:19:10] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [15:19:18] !log andrew@deploy1001 deploy aborted: modest css change for the hiera editing dialog (take two -- I consistently forget to rebase before doing this) (duration: 00m 00s) [15:19:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:19:41] 10Operations, 10ops-eqiad, 10DC-Ops, 10decommission: Decommission db1061.eqiad.wmnet - https://phabricator.wikimedia.org/T238624 (10Papaul) No interface on any switch showing this server. [15:19:54] (03CR) 10Giuseppe Lavagetto: [C: 03+1] "🍾" [puppet] - 10https://gerrit.wikimedia.org/r/581616 (owner: 10RLazarus) [15:20:12] 10Operations, 10ops-eqiad, 10DC-Ops, 10decommission: Decommission db1061.eqiad.wmnet - https://phabricator.wikimedia.org/T238624 (10Papaul) [15:22:22] (03PS1) 10Filippo Giunchedi: smart: stop smartd on Buster + hpsa [puppet] - 10https://gerrit.wikimedia.org/r/581617 (https://phabricator.wikimedia.org/T246997) [15:24:55] 10Operations, 10ops-eqiad, 10DC-Ops, 10decommission: Decommission db1062.eqiad.wmnet - https://phabricator.wikimedia.org/T239188 (10Papaul) No interface on any switch showing this server [15:25:29] 10Operations, 10ops-eqiad, 10DC-Ops, 10decommission: Decommission db1062.eqiad.wmnet - https://phabricator.wikimedia.org/T239188 (10Papaul) [15:29:53] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/581617 (https://phabricator.wikimedia.org/T246997) (owner: 10Filippo Giunchedi) [15:30:15] (03PS4) 10Filippo Giunchedi: icinga: relax check interval for selected checks [puppet] - 10https://gerrit.wikimedia.org/r/580985 (https://phabricator.wikimedia.org/T247538) [15:33:17] (03CR) 10RLazarus: [C: 03+2] Remove apache-fast-test, now replaced by httpbb. [puppet] - 10https://gerrit.wikimedia.org/r/581616 (owner: 10RLazarus) [15:33:50] 10Operations, 10SRE-swift-storage: Sanity check global-multiwrite logs for ConfirmEdit usage - https://phabricator.wikimedia.org/T159830 (10fgiunchedi) a:05fgiunchedi→03None [15:34:46] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] "LGTM, thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/581056 (https://phabricator.wikimedia.org/T246689) (owner: 10Bstorm) [15:35:54] 10Operations, 10Graphite: extend existing graphite whisper files retention to five years - https://phabricator.wikimedia.org/T138821 (10fgiunchedi) a:05fgiunchedi→03None [15:36:29] 10Operations, 10SRE-swift-storage: authoritative copy of 'root' files for upload.wikimedia.org is only in swift - https://phabricator.wikimedia.org/T130709 (10fgiunchedi) a:05fgiunchedi→03None [15:38:58] 10Operations, 10Graphite: Enforce a minimum refresh period for grafana dashboards hitting graphite - https://phabricator.wikimedia.org/T119719 (10fgiunchedi) GH issue is resolved, and the feature will be available in Grafana 6.7: https://github.com/grafana/grafana/blob/master/CHANGELOG.md#670-beta1-2020-03-12 [15:39:16] 10Operations, 10observability: Upgrade Grafana to 6.6 - https://phabricator.wikimedia.org/T244208 (10fgiunchedi) See also T119719 when Grafana 6.7 is released [15:41:59] 10Operations, 10Graphite: Enforce a minimum refresh period for grafana dashboards hitting graphite - https://phabricator.wikimedia.org/T119719 (10fgiunchedi) a:05fgiunchedi→03None [15:42:33] 10Operations, 10Graphite: Make it easier to ban misbehaving dashboards from graphite - https://phabricator.wikimedia.org/T119718 (10fgiunchedi) 05Open→03Declined Declining as we haven't been experiencing this problem anymore (less dashboards on graphite) [15:42:47] 10Operations, 10Documentation, 10Graphite: document graphite failover/backfill procedures - https://phabricator.wikimedia.org/T102575 (10fgiunchedi) a:05fgiunchedi→03None [15:43:47] 10Operations, 10observability, 10Graphite: UDP rcvbuferrors and inerrors on graphite hosts - https://phabricator.wikimedia.org/T101141 (10fgiunchedi) 05Open→03Resolved Resolving since we have significantly lessened the load of udp traffic [15:43:50] 10Operations, 10observability, 10Graphite, 10Patch-For-Review: check_graphite - "UNKNOWN: More than half of the datapoints are undefined " - https://phabricator.wikimedia.org/T105218 (10fgiunchedi) [15:44:47] 10Operations, 10Graphite: improve graphite operational documentation - https://phabricator.wikimedia.org/T99234 (10fgiunchedi) 05Open→03Resolved Docs have been expanded and available at https://wikitech.wikimedia.org/wiki/Graphite [15:45:24] 10Operations, 10Graphite: limit the impact of many new metrics being pushed to graphite - https://phabricator.wikimedia.org/T99233 (10fgiunchedi) 05Open→03Declined Not relevant anymore as we're dialing down our graphite usage across the board [15:45:52] 10Operations, 10SRE-swift-storage: rsync errors slowing down object-replicator - https://phabricator.wikimedia.org/T95429 (10fgiunchedi) a:05fgiunchedi→03None [15:46:43] 10Operations, 10Graphite, 10Performance-Team (Radar): Improve graphite failover - https://phabricator.wikimedia.org/T88997 (10fgiunchedi) a:05fgiunchedi→03None [15:46:57] 10Operations, 10Graphite, 10audits-data-retention: graphite-web logs are not rotated - https://phabricator.wikimedia.org/T86546 (10fgiunchedi) a:05fgiunchedi→03None [15:47:19] 10Operations, 10Cloud-VPS, 10Shinken, 10Graphite, 10cloud-services-team (Kanban): Clean up labs graphite datapoints - https://phabricator.wikimedia.org/T111540 (10fgiunchedi) a:05fgiunchedi→03None [15:50:11] 10Operations, 10ops-eqiad, 10DC-Ops, 10decommission: decom silver/WMF3434 - https://phabricator.wikimedia.org/T191357 (10Papaul) [15:51:32] 10Operations, 10ops-eqiad, 10DC-Ops, 10decommission: decom silver/WMF3434 - https://phabricator.wikimedia.org/T191357 (10Papaul) 05Open→03Resolved Complete [15:51:34] 10Operations, 10ops-eqiad, 10DC-Ops, 10decommission: Decommission old and unused/spare servers in eqiad - https://phabricator.wikimedia.org/T187473 (10Papaul) [15:57:56] !log jeh@deploy1001 Started deploy [horizon/deploy@ad60c2b]: update horizon designate-dashboard submodule [15:58:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:00:04] godog and _joe_: My dear minions, it's time we take the moon! Just kidding. Time for Puppet SWAT(Max 6 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200319T1600). [16:00:04] No GERRIT patches in the queue for this window AFAICS. [16:00:07] 10Operations, 10Wikimedia-Apache-configuration, 10serviceops, 10Patch-For-Review: Build a black-box httpd testing framework - https://phabricator.wikimedia.org/T236699 (10RLazarus) 05Open→03Resolved Marking this done: https://gerrit.wikimedia.org/r/581616 deleted apache-fast-test, as httpbb is now comp... [16:01:27] !log jeh@deploy1001 Finished deploy [horizon/deploy@ad60c2b]: update horizon designate-dashboard submodule (duration: 03m 31s) [16:01:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:04:45] 10Operations, 10netops: Cloud IPv6 subnets - https://phabricator.wikimedia.org/T187929 (10aborrero) I like option 2) the most. Are those ranges actual data? Regarding coding the vlan id: I don't think we should do it. We might eventually move away from the prod VLAN thing, or have addresses where the VLAN par... [16:18:47] (03PS1) 10Elukey: admin: add kerberos flag for gsingers [puppet] - 10https://gerrit.wikimedia.org/r/581632 (https://phabricator.wikimedia.org/T248014) [16:19:47] (03CR) 10Elukey: [C: 03+2] admin: add kerberos flag for gsingers [puppet] - 10https://gerrit.wikimedia.org/r/581632 (https://phabricator.wikimedia.org/T248014) (owner: 10Elukey) [16:32:16] jouncebot: now [16:32:16] For the next 0 hour(s) and 27 minute(s): Puppet SWAT(Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200319T1600) [16:32:24] going to roll the hotfix for RelatedArticles / T248090 [16:32:25] T248090: ResourceLoaderFileModule::getFileContents: style file not found, or is not a file: "/srv/mediawiki/php-1.35.0-wmf.24/extensions/RelatedArticles/" - https://phabricator.wikimedia.org/T248090 [16:33:04] James_F: Krinkle: looks like we can deploy the RelatedArticles fix now can't we? [16:33:32] * Krinkle nods [16:33:46] i will take care of it [16:34:20] 10Operations, 10Analytics, 10Product-Analytics, 10SRE-Access-Requests: Hive access - https://phabricator.wikimedia.org/T248097 (10Nuria) [16:35:39] (03PS1) 10MSantos: maps: tweak OSM replication hours [puppet] - 10https://gerrit.wikimedia.org/r/581636 [16:35:41] (03PS1) 10MSantos: maps: enable osm replication cron [puppet] - 10https://gerrit.wikimedia.org/r/581637 [16:35:43] (03PS1) 10Elukey: role::prometheus::analytics: correct presto targets [puppet] - 10https://gerrit.wikimedia.org/r/581638 [16:36:24] 10Operations, 10Analytics, 10Product-Analytics, 10SRE-Access-Requests: Hive access - https://phabricator.wikimedia.org/T248097 (10Nuria) Please take a look at https://wikitech.wikimedia.org/wiki/Analytics/Data_access You need to provide ssh keys [16:36:28] (03CR) 10Bstorm: [C: 03+2] toolforge: remove almost entire toollabs module and related roles [puppet] - 10https://gerrit.wikimedia.org/r/581056 (https://phabricator.wikimedia.org/T246689) (owner: 10Bstorm) [16:37:22] (03PS2) 10MSantos: maps: enable osm replication cron [puppet] - 10https://gerrit.wikimedia.org/r/581637 [16:37:42] (03CR) 10Volans: pick_nodes: add ability to pick nodes based on a puppet class (031 comment) [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/579579 (https://phabricator.wikimedia.org/T245288) (owner: 10Jbond) [16:38:06] hashar: Thanks. [16:39:10] (03CR) 10Elukey: [C: 03+2] role::prometheus::analytics: correct presto targets [puppet] - 10https://gerrit.wikimedia.org/r/581638 (owner: 10Elukey) [16:43:31] 10Operations, 10Wikimedia-SVG-rendering, 10Upstream: Update librsvg to ≥2.42.3 - https://phabricator.wikimedia.org/T193352 (10Esanders) Most of the icons in the OOUI library, which are just simple black and white paths compressed by SVGO, won't convert correctly on the live servers: {F31692371} They work fi... [16:45:08] (03PS1) 10Jgreen: clean up temporary pay-lvs2* and payments2* DNS entries [dns] - 10https://gerrit.wikimedia.org/r/581648 (https://phabricator.wikimedia.org/T248035) [16:45:10] (03PS1) 10Bstorm: toolforge: refactor docker builder to remove toollabs module [puppet] - 10https://gerrit.wikimedia.org/r/581647 (https://phabricator.wikimedia.org/T246689) [16:46:13] 10Operations, 10Wikimedia-SVG-rendering, 10Upstream: Update librsvg to ≥2.42.3 - https://phabricator.wikimedia.org/T193352 (10Jdforrester-WMF) AFAICT this entire task stack is the wrong way around? We can't do this upgrade until T216815 is done, and all of the "sub-tasks" are mostly different aspects of dupl... [16:47:29] merged [16:51:10] PROBLEM - mediawiki originals uploads -hourly- for codfw on icinga1001 is CRITICAL: account=mw-media class=originals cluster=swift instance=ms-fe2005:9112 job=statsd_exporter site=codfw https://wikitech.wikimedia.org/wiki/Swift/How_To https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=codfw [16:52:16] PROBLEM - mediawiki originals uploads -hourly- for eqiad on icinga1001 is CRITICAL: account=mw-media class=originals cluster=swift instance=ms-fe1005:9112 job=statsd_exporter site=eqiad https://wikitech.wikimedia.org/wiki/Swift/How_To https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=eqiad [16:53:01] (03CR) 10Volans: "2 question inline" (032 comments) [dns] - 10https://gerrit.wikimedia.org/r/581648 (https://phabricator.wikimedia.org/T248035) (owner: 10Jgreen) [16:53:22] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=icinga site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [16:53:42] 10Operations, 10ops-eqiad, 10DC-Ops, 10decommission: decommission auth1001 - https://phabricator.wikimedia.org/T234909 (10Papaul) ` edit interfaces interface-range vlan-private1-d-eqiad] - member ge-3/0/10; [edit interfaces interface-range disabled] member ge-1/0/6 { ... } + member ge-3/0/10; [e... [16:54:27] !log hashar@deploy1001 Synchronized php-1.35.0-wmf.24/extensions/RelatedArticles: Do not register "" as a style path, that breaks ResourceLoader - T248090 (duration: 01m 07s) [16:54:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:54:39] T248090: ResourceLoaderFileModule::getFileContents: style file not found, or is not a file: "/srv/mediawiki/php-1.35.0-wmf.24/extensions/RelatedArticles/" - https://phabricator.wikimedia.org/T248090 [16:54:48] 10Operations, 10ops-eqiad, 10DC-Ops, 10decommission: decommission auth1001 - https://phabricator.wikimedia.org/T234909 (10Papaul) [16:55:23] waiting to confirm logs disappeared [16:55:26] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [16:59:15] (03PS2) 10Jgreen: clean up temporary pay-lvs2* and payments2* DNS entries [dns] - 10https://gerrit.wikimedia.org/r/581648 (https://phabricator.wikimedia.org/T248035) [16:59:24] 10Operations, 10Wikimedia-SVG-rendering, 10Upstream: Update librsvg to ≥2.42.3 - https://phabricator.wikimedia.org/T193352 (10JoKalliauer) @Esanders : Your bug is imho T217990 and can be fixed using https://tools.wmflabs.org/svgworkaroundbot/ (activate "run svgcleaner"). It is related to missing spaces betwe... [17:00:04] halfak and accraze: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) Services – Graphoid / Citoid / ORES deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200319T1700). [17:00:30] solved! [17:00:56] looks like 1.35.0-wmf.24 is fine [17:01:12] Hurrah. [17:01:48] (03PS1) 10Bstorm: toolserver: refactor into profile and move off "toollabs" name [puppet] - 10https://gerrit.wikimedia.org/r/581654 (https://phabricator.wikimedia.org/T246689) [17:02:48] 10Operations, 10serviceops: Renew certs for mcrouter on all application servers. - https://phabricator.wikimedia.org/T248093 (10Joe) The procedure to do this is described here: https://wikitech.wikimedia.org/wiki/Mcrouter#Renew_CA_and_certificates it would be nice to make this a script so we don't need to pa... [17:04:07] (03PS3) 10Jgreen: first stage nsca_frack.cfg.erb cleanup, add misc hostgroup, some reformatting [puppet] - 10https://gerrit.wikimedia.org/r/581076 (https://phabricator.wikimedia.org/T247855) [17:04:50] (03PS4) 10Jgreen: first stage nsca_frack.cfg.erb cleanup, add misc hostgroup, some reformatting [puppet] - 10https://gerrit.wikimedia.org/r/581076 (https://phabricator.wikimedia.org/T247855) [17:08:00] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=icinga site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [17:08:15] (03CR) 10Papaul: [C: 03+1] clean up temporary pay-lvs2* and payments2* DNS entries [dns] - 10https://gerrit.wikimedia.org/r/581648 (https://phabricator.wikimedia.org/T248035) (owner: 10Jgreen) [17:08:45] (03PS3) 10Alexandros Kosiaris: admin: Deduplicate defaults.yaml [deployment-charts] - 10https://gerrit.wikimedia.org/r/581507 [17:08:47] (03PS1) 10Alexandros Kosiaris: admin: deduplicate main helmfile.yaml [deployment-charts] - 10https://gerrit.wikimedia.org/r/581656 [17:08:49] (03PS1) 10Alexandros Kosiaris: admin/namespace: Deduplicate all helmfile templates [deployment-charts] - 10https://gerrit.wikimedia.org/r/581657 [17:08:51] (03PS1) 10Alexandros Kosiaris: admin: Default to sensible values for deploUser, namespaceName [deployment-charts] - 10https://gerrit.wikimedia.org/r/581658 [17:10:06] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [17:11:48] (03CR) 10Jgreen: [C: 03+2] clean up temporary pay-lvs2* and payments2* DNS entries [dns] - 10https://gerrit.wikimedia.org/r/581648 (https://phabricator.wikimedia.org/T248035) (owner: 10Jgreen) [17:14:12] 10Operations, 10ops-codfw, 10Wikimedia-FR-Tech-Systems, 10fundraising-tech-ops, 10Patch-For-Review: Fix incongruences between Netbox and DNS repository - https://phabricator.wikimedia.org/T248035 (10Jgreen) 05Open→03Resolved a:03Jgreen fixed! [17:15:36] 10Operations, 10Analytics, 10Product-Analytics, 10SRE-Access-Requests: Hive access - https://phabricator.wikimedia.org/T248097 (10mpopov) @spatton: we have some additional information on [[ https://www.mediawiki.org/wiki/Product_Analytics/Onboarding#SSH_Keys,_Stat_machines,_Notebooks,_HUE,_Datagrip,_Groups... [17:16:26] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=icinga site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [17:17:50] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [17:17:58] 10Operations, 10service-runner, 10serviceops, 10CPT Initiatives (RESTBase Split (CDP2)), and 5 others: RESTBase/RESTRouter/service-runner rate limiting plans - https://phabricator.wikimedia.org/T235437 (10WDoranWMF) p:05High→03Low [17:19:50] 10Operations, 10Analytics, 10Product-Analytics, 10SRE-Access-Requests: Hive access - https://phabricator.wikimedia.org/T248097 (10Nuria) @spatton: Can you also explain a little what are you trying to do? (this is not needed for access but it help us to understand your use case) [17:23:33] (03CR) 10Vgutierrez: [C: 04-1] "it looks good, please fix lvs200[89] IP addresses" (031 comment) [homer/public] - 10https://gerrit.wikimedia.org/r/576320 (owner: 10Ayounsi) [17:23:36] 10Operations, 10OfflineContentGenerator, 10Readers-Web-Backlog (Tracking), 10Services (watching): Confirm attribution needs - https://phabricator.wikimedia.org/T150875 (10JKatzWMF) 05Open→03Resolved this was resolved during the OCG replacement [17:23:40] 10Operations, 10OCG-General, 10Readers-Community-Engagement, 10Epic, and 2 others: [EPIC] (Proposal) Replicate core OCG features and sunset OCG service - https://phabricator.wikimedia.org/T150871 (10JKatzWMF) [17:23:42] 10Operations, 10OfflineContentGenerator, 10Readers-Community-Engagement, 10Patch-For-Review, and 2 others: Collate wikimedia pages into a single html wikimedia page that can then be rendered into a single pdf - https://phabricator.wikimedia.org/T150874 (10JKatzWMF) [17:26:12] (03PS2) 10RLazarus: Allow regular expressions in assert_headers values. [software/httpbb] - 10https://gerrit.wikimedia.org/r/576159 (https://phabricator.wikimedia.org/T236699) [17:28:20] (03CR) 10RLazarus: [C: 03+2] Allow regular expressions in assert_headers values. [software/httpbb] - 10https://gerrit.wikimedia.org/r/576159 (https://phabricator.wikimedia.org/T236699) (owner: 10RLazarus) [17:30:15] (03Merged) 10jenkins-bot: Allow regular expressions in assert_headers values. [software/httpbb] - 10https://gerrit.wikimedia.org/r/576159 (https://phabricator.wikimedia.org/T236699) (owner: 10RLazarus) [17:31:10] 10Operations, 10netops, 10cloud-services-team (Kanban): CloudVPS: IPv6 early PoC - https://phabricator.wikimedia.org/T245495 (10ayounsi) The steps I have in mind here are: 1/ Setup v6 on the transport network (most likely `2620:0:860:fe0a::/64`) 2/ Assign a /48 for cloud codfw, see T187929 (Here we won't go... [17:34:04] (03PS1) 10Andrew Bogott: nova-compute: change virt_type to qemu [puppet] - 10https://gerrit.wikimedia.org/r/581666 (https://phabricator.wikimedia.org/T242766) [17:37:05] (03CR) 10Andrew Bogott: [C: 03+2] nova-compute: change virt_type to qemu [puppet] - 10https://gerrit.wikimedia.org/r/581666 (https://phabricator.wikimedia.org/T242766) (owner: 10Andrew Bogott) [17:40:15] (03CR) 10Vgutierrez: [C: 03+1] Add routers BGP to LVS/Pybal config [homer/public] - 10https://gerrit.wikimedia.org/r/576320 (owner: 10Ayounsi) [17:40:16] (03PS1) 10Hnowlan: calico: add changeprop access to varnish multicast address [deployment-charts] - 10https://gerrit.wikimedia.org/r/581667 (https://phabricator.wikimedia.org/T213193) [17:40:53] (03CR) 10Dzahn: [C: 03+2] releases: close port 80 for caching servers. [puppet] - 10https://gerrit.wikimedia.org/r/572353 (https://phabricator.wikimedia.org/T210411) (owner: 10Dzahn) [17:45:20] (03CR) 10Ppchelko: [C: 03+1] calico: add changeprop access to varnish multicast address [deployment-charts] - 10https://gerrit.wikimedia.org/r/581667 (https://phabricator.wikimedia.org/T213193) (owner: 10Hnowlan) [17:47:10] !log releases/releases-jenkins - closed firewall hole to port 80 for caching servers - kept it open just for envoy from the backends - ATS speaks https to them meanwhile [17:47:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:50:24] (03PS1) 10Bstorm: toolforge: fix file location for grid override.my.cnf [puppet] - 10https://gerrit.wikimedia.org/r/581673 (https://phabricator.wikimedia.org/T246689) [17:52:44] (03PS2) 10Hnowlan: calico: add changeprop access to varnish multicast address [deployment-charts] - 10https://gerrit.wikimedia.org/r/581667 (https://phabricator.wikimedia.org/T213193) [17:55:18] (03CR) 10Bstorm: [C: 03+2] toolforge: fix file location for grid override.my.cnf [puppet] - 10https://gerrit.wikimedia.org/r/581673 (https://phabricator.wikimedia.org/T246689) (owner: 10Bstorm) [17:57:26] Stealing the services window to deploy https://gerrit.wikimedia.org/r/c/mediawiki/extensions/Wikibase/+/581674 [17:59:30] (03PS1) 10Cparle: Enable WikibaseQualityConstraints on commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/581678 (https://phabricator.wikimedia.org/T248117) [18:00:04] RoanKattouw, Niharika, and Urbanecm: May I have your attention please! Morning SWAT(Max 6 patches). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200319T1800) [18:00:04] No GERRIT patches in the queue for this window AFAICS. [18:02:43] 10Operations, 10netops: Cloud IPv6 subnets - https://phabricator.wikimedia.org/T187929 (10ayounsi) a:03faidon I agree that option 2 is the way to go. The complication is how to subnet them properly for both the short term (T245495 PoC) and the longer term. I couldn't find much subnetting recommendation doc... [18:03:06] (03CR) 10Cparle: [C: 04-1] Enable WikibaseQualityConstraints on commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/581678 (https://phabricator.wikimedia.org/T248117) (owner: 10Cparle) [18:03:40] (03CR) 10Herron: [C: 03+1] remove elnath.codfw.wmnet [dns] - 10https://gerrit.wikimedia.org/r/581060 (https://phabricator.wikimedia.org/T188544) (owner: 10Dzahn) [18:04:11] (03CR) 10Dzahn: [C: 03+2] remove elnath.codfw.wmnet [dns] - 10https://gerrit.wikimedia.org/r/581060 (https://phabricator.wikimedia.org/T188544) (owner: 10Dzahn) [18:04:15] (03PS3) 10Dzahn: remove elnath.codfw.wmnet [dns] - 10https://gerrit.wikimedia.org/r/581060 (https://phabricator.wikimedia.org/T188544) [18:05:01] 10Operations, 10ops-eqiad, 10decommission: Decommission neodymium - https://phabricator.wikimedia.org/T220503 (10Papaul) ` [edit interfaces interface-range vlan-private1-c-eqiad] - member ge-4/0/8; [edit interfaces interface-range disabled] member ge-6/0/32 { ... } + member ge-4/0/8; [edit interfa... [18:05:52] 10Operations, 10ops-eqiad, 10decommission: Decommission neodymium - https://phabricator.wikimedia.org/T220503 (10Papaul) [18:06:54] (03PS1) 10Andrew Bogott: nova-compute: install version-specific config [puppet] - 10https://gerrit.wikimedia.org/r/581682 (https://phabricator.wikimedia.org/T242766) [18:12:10] 10Operations, 10Puppet: compile/diff catalogs between puppetdb v2 (production) and puppetdb v4 - https://phabricator.wikimedia.org/T188544 (10Dzahn) 05Open→03Resolved [18:12:12] 10Operations, 10Puppet: Upgrade PuppetDB to version 4.4 - https://phabricator.wikimedia.org/T177253 (10Dzahn) [18:12:19] (03CR) 10Andrew Bogott: [C: 03+2] "PCC suggests that, in fact, nothing has changed between N and P. One small change for Q." [puppet] - 10https://gerrit.wikimedia.org/r/581682 (https://phabricator.wikimedia.org/T242766) (owner: 10Andrew Bogott) [18:12:56] 10Operations, 10ops-eqiad, 10DC-Ops, 10decommission, 10Patch-For-Review: Decommission dbproxy1006.eqiad.wmnet - https://phabricator.wikimedia.org/T233207 (10Papaul) no switch configuration for this server on any switch [18:13:26] 10Operations, 10netops, 10cloud-services-team (Kanban): CloudVPS: IPv6 early PoC - https://phabricator.wikimedia.org/T245495 (10aborrero) Mostly agree with everything, some inlined comments >>! In T245495#5984709, @ayounsi wrote: > The steps I have in mind here are: > 1/ Setup v6 on the transport network (m... [18:13:32] 10Operations, 10ops-eqiad, 10DC-Ops, 10decommission, 10Patch-For-Review: Decommission dbproxy1006.eqiad.wmnet - https://phabricator.wikimedia.org/T233207 (10Papaul) [18:16:10] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=icinga site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [18:18:18] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [18:22:31] 10Operations, 10netops: eqord - ulsfo Telia link down - IC-313592 - https://phabricator.wikimedia.org/T221259 (10Dzahn) 05Open→03Resolved Wed, 18 Mar 2020 01:09 : Our technician performed cleaning to FMP in Denver and the issue resolved , the traffic is up now . apologize for the inconvenience caused.... [18:22:33] 10Operations, 10netops: eqord - ulsfo Telia link down - IC-313592 - https://phabricator.wikimedia.org/T221259 (10Dzahn) Icinga is all green again. [18:24:31] !log ladsgroup@deploy1001 Synchronized php-1.35.0-wmf.24/extensions/Wikibase/data-access/src/SingleEntitySourceServices.php: [[gerrit:581674|Fix 'max' to Int32EntityId::MAX conversion (T247985)]], part I (duration: 01m 08s) [18:24:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:24:36] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=icinga site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [18:24:37] T247985: Wikidata localisation is broken for units - https://phabricator.wikimedia.org/T247985 [18:26:42] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [18:30:56] !log ladsgroup@deploy1001 Synchronized php-1.35.0-wmf.24/extensions/Wikibase/lib/includes/Store/ByIdDispatchingEntityInfoBuilder.php: [[gerrit:581674|Fix 'max' to Int32EntityId::MAX conversion (T247985)]], part II (duration: 01m 07s) [18:31:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:31:01] T247985: Wikidata localisation is broken for units - https://phabricator.wikimedia.org/T247985 [18:31:22] (03PS1) 10Cwhite: Release 0.7 [debs/prometheus-icinga-exporter] - 10https://gerrit.wikimedia.org/r/581689 [18:33:10] (03CR) 10Cwhite: [C: 03+2] Release 0.7 [debs/prometheus-icinga-exporter] - 10https://gerrit.wikimedia.org/r/581689 (owner: 10Cwhite) [18:33:26] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] toolforge: refactor docker builder to remove toollabs module [puppet] - 10https://gerrit.wikimedia.org/r/581647 (https://phabricator.wikimedia.org/T246689) (owner: 10Bstorm) [18:33:33] (03PS7) 10Dzahn: gerrit: replace hiera() with lookup() [puppet] - 10https://gerrit.wikimedia.org/r/563546 [18:35:21] (03PS8) 10Dzahn: gerrit: replace hiera() with lookup() [puppet] - 10https://gerrit.wikimedia.org/r/563546 [18:37:54] (03CR) 10Dzahn: [C: 04-2] "https://puppet-compiler.wmflabs.org/compiler1002/21501/gerrit1001.wikimedia.org/change.gerrit1001.wikimedia.org.err" [puppet] - 10https://gerrit.wikimedia.org/r/563546 (owner: 10Dzahn) [18:42:34] (03PS5) 10Arturo Borrero Gonzalez: toolforge: support canonical redirects in urlproxy [puppet] - 10https://gerrit.wikimedia.org/r/579952 (https://phabricator.wikimedia.org/T234617) [18:43:11] (03PS9) 10Dzahn: gerrit: replace hiera() with lookup() [puppet] - 10https://gerrit.wikimedia.org/r/563546 [18:46:05] (03CR) 10Bstorm: [C: 03+2] toolforge: refactor docker builder to remove toollabs module [puppet] - 10https://gerrit.wikimedia.org/r/581647 (https://phabricator.wikimedia.org/T246689) (owner: 10Bstorm) [18:46:26] Not sure where to ask this, so I'll give it a try. Is possible to track specific directories' disk usage `du -sh /srv/postgresql/9.6/main/*` in Grafana with Prometheus? (I want to do that in maps' clusters) [18:47:04] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/580985 (https://phabricator.wikimedia.org/T247538) (owner: 10Filippo Giunchedi) [18:47:52] 10Operations, 10serviceops: Renew certs for mcrouter on all application servers. - https://phabricator.wikimedia.org/T248093 (10RLazarus) a:03RLazarus [18:54:51] (03CR) 10Jgreen: [C: 03+2] first stage nsca_frack.cfg.erb cleanup, add misc hostgroup, some reformatting [puppet] - 10https://gerrit.wikimedia.org/r/581076 (https://phabricator.wikimedia.org/T247855) (owner: 10Jgreen) [18:55:37] 10Operations, 10Discovery, 10Traffic, 10Wikidata, and 3 others: Wikidata maxlag repeatedly over 5s since Jan20, 2020 (primarily caused by the query service) - https://phabricator.wikimedia.org/T243701 (10Vojtech.dostal) This is also extremely important for OpenRefine users. We cannot push our reconciliatio... [18:59:03] 10Operations, 10Discovery, 10Traffic, 10Wikidata, and 3 others: Wikidata maxlag repeatedly over 5s since Jan20, 2020 (primarily caused by the query service) - https://phabricator.wikimedia.org/T243701 (10Ladsgroup) >>! In T243701#5973374, @Dvorapa wrote: > Any news? From possible solutions like T238751, T2... [19:00:04] hashar and twentyafterfour: I, the Bot under the Fountain, allow thee, The Deployer, to do Mediawiki train - European+American Version (secondary timeslot) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200319T1900). [19:02:16] (03PS10) 10Dzahn: gerrit: replace hiera() with lookup() [puppet] - 10https://gerrit.wikimedia.org/r/563546 [19:10:07] o/ [19:10:51] so train to all! [19:11:03] (03PS1) 10Jgreen: fix typo in nsca_frack.cfg.erb [puppet] - 10https://gerrit.wikimedia.org/r/581707 [19:11:38] (03CR) 10Dzahn: [C: 03+1] fix typo in nsca_frack.cfg.erb [puppet] - 10https://gerrit.wikimedia.org/r/581707 (owner: 10Jgreen) [19:11:46] (03PS1) 10Hashar: all wikis to 1.35.0-wmf.24 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/581709 [19:11:48] (03CR) 10Hashar: [C: 03+2] all wikis to 1.35.0-wmf.24 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/581709 (owner: 10Hashar) [19:12:45] (03Merged) 10jenkins-bot: all wikis to 1.35.0-wmf.24 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/581709 (owner: 10Hashar) [19:13:25] (03CR) 10Jgreen: [C: 03+2] fix typo in nsca_frack.cfg.erb [puppet] - 10https://gerrit.wikimedia.org/r/581707 (owner: 10Jgreen) [19:14:30] !log hashar@deploy1001 rebuilt and synchronized wikiversions files: all wikis to 1.35.0-wmf.24 [19:14:31] (03PS6) 10Arturo Borrero Gonzalez: toolforge: support canonical redirects in urlproxy [puppet] - 10https://gerrit.wikimedia.org/r/579952 (https://phabricator.wikimedia.org/T234617) [19:14:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:15:52] hashar: Looks quiet to me. [19:16:06] I have two errors on the dashboard [19:16:10] 10Operations, 10Discovery, 10Traffic, 10Wikidata, and 3 others: Wikidata maxlag repeatedly over 5s since Jan20, 2020 (primarily caused by the query service) - https://phabricator.wikimedia.org/T243701 (10Dvorapa) Is there any short-term temporary solution until RFC gets done? Perhaps hardcode maxlag=4? Or... [19:16:11] this train is boring [19:16:17] James_F: yeah super quiet [19:16:38] Poor hashar, no explodey-train for him. [19:16:57] as usual :D [19:17:19] I suspect developers to hold their patches when I am the train conductor [19:21:08] PROBLEM - PHP opcache health on scandium is CRITICAL: CRITICAL: opcache free space is below 50 MB https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [19:33:06] not much happening yeah [19:34:46] (03PS11) 10Dzahn: gerrit: replace hiera() with lookup() [puppet] - 10https://gerrit.wikimedia.org/r/563546 [19:36:06] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=icinga site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [19:38:10] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [19:40:59] (03PS7) 10Arturo Borrero Gonzalez: toolforge: support canonical redirects in urlproxy [puppet] - 10https://gerrit.wikimedia.org/r/579952 (https://phabricator.wikimedia.org/T234617) [19:44:54] (03PS8) 10Arturo Borrero Gonzalez: toolforge: support canonical redirects in urlproxy [puppet] - 10https://gerrit.wikimedia.org/r/579952 (https://phabricator.wikimedia.org/T234617) [19:46:13] (03PS9) 10Arturo Borrero Gonzalez: toolforge: support canonical redirects in urlproxy [puppet] - 10https://gerrit.wikimedia.org/r/579952 (https://phabricator.wikimedia.org/T234617) [19:52:12] (03PS1) 10Andrew Bogott: Revert "nova-compute: change virt_type to qemu" [puppet] - 10https://gerrit.wikimedia.org/r/581721 [19:52:23] (03PS2) 10Andrew Bogott: Revert "nova-compute: change virt_type to qemu" [puppet] - 10https://gerrit.wikimedia.org/r/581721 [19:53:43] (03CR) 10Andrew Bogott: [C: 03+2] Revert "nova-compute: change virt_type to qemu" [puppet] - 10https://gerrit.wikimedia.org/r/581721 (owner: 10Andrew Bogott) [19:55:34] (03PS10) 10Arturo Borrero Gonzalez: toolforge: support canonical redirects in urlproxy [puppet] - 10https://gerrit.wikimedia.org/r/579952 (https://phabricator.wikimedia.org/T234617) [19:56:56] ah got a fatal! https://meta.wikimedia.org/wiki/Special:SupportedLanguages/ar [19:58:09] repros for me [19:59:32] 10Operations, 10Wikimedia-Mailing-lists: Please decom reading-wmf mailing list - https://phabricator.wikimedia.org/T248126 (10dr0ptp4kt) [20:02:04] filled as https://phabricator.wikimedia.org/T248125 [20:02:18] (03CR) 10BryanDavis: toolforge: support canonical redirects in urlproxy (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/579952 (https://phabricator.wikimedia.org/T234617) (owner: 10Arturo Borrero Gonzalez) [20:02:24] some of the other lanugages are slow [20:02:28] /fr times out [20:02:38] but maybe that was the case with wmf.23 as well ... who knows [20:03:17] Nikerabbit: ^ in case ou are around https://meta.wikimedia.org/wiki/Special:SupportedLanguages/ar fatals out [20:04:04] maybe it is not a big deal. It is probably not worth a rollback [20:04:50] Yeah, not worth it yet [20:05:37] i havent made it a blocker [20:06:27] I am claiming that 1.35.0-wmf.24 is a success [20:06:34] * James_F grins. [20:08:24] (03PS12) 10Dzahn: gerrit: replace hiera() with lookup() [puppet] - 10https://gerrit.wikimedia.org/r/563546 [20:08:50] hashar: just commented on the task [20:09:19] Nikerabbit: awesome thank you :] [20:10:45] Nikerabbit: I will be back tomorrow at 1pm UTC and will be happy to deploy a hotfix if needed [20:10:53] I have lowered the prio on the task [20:11:34] hashar: I'll check the fatal next week, or tomorrow if lucky [20:11:44] +1 :) [20:28:50] (03CR) 10CDanis: [C: 03+1] "LGTM, at least as far as I can read Juniper configs." [homer/public] - 10https://gerrit.wikimedia.org/r/576320 (owner: 10Ayounsi) [20:30:36] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=icinga site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [20:32:42] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [20:52:20] 10Operations, 10Patch-For-Review, 10Security: envoyproxy: CVE-2020-8664 CVE-2020-8661 CVE-2020-8660 CVE-2020-8659 - https://phabricator.wikimedia.org/T246868 (10RLazarus) All remaining MW hosts are updated. That leaves parsoid, snapshot hosts, and a few other odds and ends. [20:59:22] (03PS13) 10Dzahn: gerrit: replace hiera() with lookup() [puppet] - 10https://gerrit.wikimedia.org/r/563546 [21:00:33] (03PS1) 10RLazarus: Bump versions for envoy and envoy-tls-local-proxy to 1.13.1. [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/581747 (https://phabricator.wikimedia.org/T246868) [21:01:52] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=icinga site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [21:03:01] (03PS1) 10Alexandros Kosiaris: admin: Remove all override files [deployment-charts] - 10https://gerrit.wikimedia.org/r/581748 [21:03:56] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [21:04:13] (03PS2) 10RLazarus: Bump versions for envoy and envoy-tls-local-proxy to 1.13.1. [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/581747 (https://phabricator.wikimedia.org/T246868) [21:25:51] (03PS1) 10Cwhite: profile: set icinga exporter scrape_timeout to 20s [puppet] - 10https://gerrit.wikimedia.org/r/581762 (https://phabricator.wikimedia.org/T248131) [21:42:50] 10Operations, 10Product-Infrastructure-Team-Backlog, 10Traffic: Increase in 503 responses to POST reqs to api.php since 2020-03-15 - https://phabricator.wikimedia.org/T248132 (10Mholloway) [21:48:51] 10Operations, 10Discovery, 10Traffic, 10Wikidata, and 3 others: Wikidata maxlag repeatedly over 5s since Jan20, 2020 (primarily caused by the query service) - https://phabricator.wikimedia.org/T243701 (10Ladsgroup) >>! In T243701#5985113, @Dvorapa wrote: > Is there any short-term temporary solution until R... [21:55:23] 10Operations, 10Product-Infrastructure-Team-Backlog, 10Traffic: Increase in 503 responses to POST reqs to api.php since 2020-03-15 - https://phabricator.wikimedia.org/T248132 (10Mholloway) Actually, this appears nearly identical even without filtering for POST or api.php. [21:57:48] 10Operations, 10Product-Infrastructure-Team-Backlog, 10Traffic: Increase in 503 responses since 2020-03-15 - https://phabricator.wikimedia.org/T248132 (10Mholloway) [21:58:14] 10Operations, 10Product-Infrastructure-Team-Backlog, 10Traffic: Increase in 503 responses since 2020-03-15 - https://phabricator.wikimedia.org/T248132 (10Mholloway) [22:00:42] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job={icinga,swagger_check_eventgate_main_http_cluster_eqiad} site={codfw,eqiad} https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [22:02:48] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [22:10:18] !log mholloway-shell@deploy1001 Started deploy [mobileapps/deploy@794f099]: Update mobileapps to 99869f45 [22:10:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:15:31] !log mholloway-shell@deploy1001 Finished deploy [mobileapps/deploy@794f099]: Update mobileapps to 99869f45 (duration: 05m 13s) [22:15:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:42:30] RECOVERY - PHP opcache health on scandium is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [23:00:04] RoanKattouw, Niharika, and Urbanecm: #bothumor I � Unicode. All rise for Evening SWAT(Max 6 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200319T2300). [23:00:04] No GERRIT patches in the queue for this window AFAICS. [23:18:57] 10Operations, 10cloud-services-team (Kanban): Migrate remaining self-hosted puppet masters to Puppet 5 / facter 3 - https://phabricator.wikimedia.org/T241719 (10Krenair) [23:20:50] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=icinga site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [23:22:54] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets