[00:00:10] https://webkit.org/tracking-prevention/ [00:00:33] Krinkle: ok wow thanks much... So, new summary, the redirect strategy won't work, either? [00:00:43] indeed [00:00:51] ejegg: ^ [00:01:04] it also risks putting us on a list of bad behaving websites so please don't try [00:01:23] ok gotcha [00:01:37] I have some ideas but let's discuss that another time, happy to set up a call [00:01:46] thanks so much for this info, yes! [00:02:09] oh dang [00:02:44] ejegg: Krinkle: I think for now, then the solution is to put a link on the thank you page explicitly asking donors to click on a link if they want to hide fundraising banners [00:03:19] that is in fact what I was going to suggest. email them a thing with a nice message, thanks again for donating, we wont' show you etc. [00:03:42] Yeah they do get an e-mail currently [00:04:17] if it requires a user action though, does the close button not suffice upon first seeing it? [00:04:25] hmm, the bounce tracking classification seems to penalize things with multiple redirect destinations, while we would send everyone to the TY page [00:04:57] Krinkle: this is for people who DON'T close the banner [00:04:57] I guess they can hide it already but it's seeing the banner that's confusing [00:05:05] and instead click through it to donate [00:05:17] right, sure but after that upon first seeing it [00:05:35] also, I asume for cases wher epeople donate from a Wikipedia banner there is already no issue since that's first party? [00:05:39] Krinkle: ejegg: the close button also sets a hide cookie and invokes the 3rd party cookie storm, but we're not really worrying about that for now [00:05:41] the donor cookie also lasts a lot longer than the close-banner cookie [00:05:53] I see, that's fair. [00:05:58] although see about long lasting cookies [00:06:00] ejegg: Krinkle: we could have the cookie set from the banner itself [00:06:02] but I get the theory [00:06:14] when they click to donate [00:06:27] AndyRussG: right, assume they will get through the rest of the flow [00:06:33] the only disadvantage being that they'll still get the cookie even if the donation wasn't successful [00:06:35] yeah [00:06:43] to confirm, enwiki -> banner -> donate.wikipedia -> thanks -> img wikipedia *.wikipeida hidebanner 1 year [00:06:47] that tunnel works? [00:06:55] Krinkle: almost [00:08:10] enwiki banner -> payments wiki -> maybe redirect to payment provider or otherwise iframe -> donate wiki thanks, with background injection of img tags from WP and other domains into DOM and those requests for those images set the cookies [00:08:35] right [00:08:42] and the wikipedia one is working [00:09:08] but the others are set in an alternate universe by Safari specifically only for cross-origin requests between wikipedia and [00:09:12] AndyRussG: man, https://webkit.org/tracking-prevention/ pours cold water on banner history [00:09:53] looks like it gets erased after 7 days of not visiting the site (see section 7-Day Cap on All Script-Writeable Storage ) [00:10:29] Krinkle: the wikipedia one isn't working for Safari users [00:10:40] ok, why :) [00:11:00] because m != p [00:11:08] oh, it's not donate.wikipedia.org anymore? [00:11:09] i.e. donate.wikiMedia.org [00:11:14] maybe never was [00:11:17] vs en.wikiPedia.org [00:11:17] we should fix that? [00:11:23] hehe, that's the rebrand [00:11:30] well, doesn't have to be [00:11:39] and will involve plenty of other fun [00:11:50] but yeah, would at least help us with cookies! [00:11:54] it's your only option for this though [00:12:06] also wouldn't need the image in that case, can do *.wikipedia.org directly [00:12:18] right [00:13:09] but yeah, I don't see a way out of this apart from a list of links in the thank you page to opt-in to hiding the banner by visiting a link and maybe a simple confirmation on the other hand to feel good with a link back to the thank you page and/or the same list again. [00:13:09] hmm, looks like donate.wikipedia.org does redirect to donate.wikimedia.org [00:13:18] redirecting would risk being rejected again [00:13:50] wmde could embed that same list/chain of links [00:14:11] I recently renamed test.wikimedia.beta.wmflabs to test.wikiPedia.beta.wmflabs [00:14:14] that was quite easy [00:14:48] ejegg: alternatively all you really need is a single HTML file on a wikipedia domain [00:15:04] Krinkle: ejegg: we really do need a deeper investigation into all this [00:15:17] I mean, it could be thanks.wikipedia.org/index.html if we want a quick hack [00:15:40] yeah that would also do it :) [00:15:54] we have a microsite cluster now for simpel sites like this [00:16:13] ejegg: for the NL campaign, though, we'll suggest in-banner JS to set the cookie when people click to donate? [00:16:41] We already have facilities for trying to run code when people click to navigate away, if it's a link (not sure if it still is) [00:16:46] AndyRussG: I guess so [00:17:13] the payments-wiki redirect is part of the core banner js [00:17:17] ejegg: probably faster than quickly implementing donate.wikipedia.org thank-you [00:17:36] yeah, if hiding it on first attempt to donate is acceptable that would be even simpler [00:17:40] also works naturally for other projects [00:17:46] e.g. if we run on non-Wikipedia [00:18:04] yeah... currently only FR on WP [00:18:32] Guessing we'd lose a number of donations from ppl who want to donate, click, lose or close tab, and keep browsing [00:18:44] who would otherwise click again on next banner view [00:20:00] or, (sorry), realistically find it again two weeks later :) [00:20:56] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [00:21:29] ejegg: I'm just thinking out loud about stuff I'm not an expert in, but I suppose it might also be posible to e.g. let the on-click cookie only reduce the severity of the banner [00:21:37] and still have the thank you opt-out thing [00:21:57] Krinkle: hey also an idea [00:21:58] e.g. bucket them into something lighter where they can pick it up or say "I already donated" [00:22:15] yeah [00:22:42] Krinkle: ejegg: apologies, my kids have been waiting for me to eat with them, I should run... [00:22:53] that might even remove the need for automatic hiding if it's suble/respectable enough [00:23:05] ok AndyRussG, buen provecho [00:23:06] Krinkle: thanks so so much for all the info and advice on this [00:23:29] Krinkle: we'll definitely take you up on the offer to talk through this live!!! thanks again :) [00:23:32] ejegg: ¡gracias! [00:23:39] yes, thanks very much Krinkle! [00:23:51] k, going afk myself. if there's more cookie brain storming ahead, feel free to pull me in [00:23:52] ejegg: and also thanks 4 sticking around and helping with this :) [00:24:11] :) Krinkle quick last question: pointer to the task for centraluth stuff? [00:24:49] https://phabricator.wikimedia.org/T252236 [00:26:36] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [00:28:17] Krinkle: thx!! [00:31:58] PROBLEM - rsyslog in codfw is failing to deliver messages on icinga1001 is CRITICAL: action={fwd_centrallog1001.eqiad.wmnet:6514,fwd_centrallog2001.codfw.wmnet:6514} https://wikitech.wikimedia.org/wiki/Rsyslog https://grafana.wikimedia.org/d/000000596/rsyslog?var-datasource=codfw+prometheus/ops [00:32:05] 10Operations, 10Fundraising-Backlog, 10MediaWiki-extensions-CentralNotice, 10Traffic, and 2 others: Implement redirect for hide banner cookie issue - https://phabricator.wikimedia.org/T251780 (10Ejegg) @Krinkle explained in IRC that this approach will probably not work long term, and risks Wikipedia being... [00:34:40] PROBLEM - rsyslog in eqiad is failing to deliver messages on icinga1001 is CRITICAL: action={fwd_centrallog1001.eqiad.wmnet:6514,fwd_centrallog2001.codfw.wmnet:6514} https://wikitech.wikimedia.org/wiki/Rsyslog https://grafana.wikimedia.org/d/000000596/rsyslog?var-datasource=eqiad+prometheus/ops [00:45:36] PROBLEM - rsyslog TLS listener on port 6514 on centrallog1001 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Logs [00:47:30] RECOVERY - rsyslog TLS listener on port 6514 on centrallog1001 is OK: SSL OK - Certificate centrallog1001.eqiad.wmnet valid until 2024-06-25 15:42:33 +0000 (expires in 1446 days) https://wikitech.wikimedia.org/wiki/Logs [00:47:38] !log truncated labswiki.interwiki table (outdated and unnecessary) [00:47:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:54:36] PROBLEM - Too many messages in kafka logging-eqiad on icinga1001 is CRITICAL: cluster=misc exported_cluster=logging-eqiad group={logstash,logstash-codfw,logstash7-codfw,logstash7-eqiad} instance=kafkamon1001 job=burrow partition={0,1,2,3,4,5} site=eqiad topic=rsyslog-notice https://wikitech.wikimedia.org/wiki/Logstash%23Kafka_consumer_lag https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?from=now-3h&to=now&orgId=1&var- [00:54:37] prometheus/ops&var-cluster=logging-eqiad&var-topic=All&var-consumer_group=All [00:56:56] (03PS1) 10Bstorm: kubeadm: If using a stacked control plane, expose etcd metrics [puppet] - 10https://gerrit.wikimedia.org/r/610980 (https://phabricator.wikimedia.org/T256361) [00:57:34] RECOVERY - rsyslog in eqiad is failing to deliver messages on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Rsyslog https://grafana.wikimedia.org/d/000000596/rsyslog?var-datasource=eqiad+prometheus/ops [00:57:46] RECOVERY - rsyslog in codfw is failing to deliver messages on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Rsyslog https://grafana.wikimedia.org/d/000000596/rsyslog?var-datasource=codfw+prometheus/ops [00:59:37] (03CR) 10Bstorm: [C: 04-1] "Just realized this config cannot be generally applied because it will only include the IP of the bootstrapping node." [puppet] - 10https://gerrit.wikimedia.org/r/610980 (https://phabricator.wikimedia.org/T256361) (owner: 10Bstorm) [01:05:15] (03PS2) 10Bstorm: kubeadm: If using a stacked control plane, expose etcd metrics [puppet] - 10https://gerrit.wikimedia.org/r/610980 (https://phabricator.wikimedia.org/T256361) [01:08:13] (03CR) 10Bstorm: "Ok that corrects it." [puppet] - 10https://gerrit.wikimedia.org/r/610980 (https://phabricator.wikimedia.org/T256361) (owner: 10Bstorm) [01:09:28] 10Operations, 10Fundraising-Backlog, 10MediaWiki-extensions-CentralNotice, 10Traffic, and 2 others: Implement redirect for hide banner cookie issue - https://phabricator.wikimedia.org/T251780 (10AndyRussG) > He suggested we put the thank you page on a *.wikipedia.org domain rather than *.wikimedia.org, so... [01:11:52] PROBLEM - rsyslog in eqiad is failing to deliver messages on icinga1001 is CRITICAL: action={fwd_centrallog1001.eqiad.wmnet:6514,fwd_centrallog2001.codfw.wmnet:6514} https://wikitech.wikimedia.org/wiki/Rsyslog https://grafana.wikimedia.org/d/000000596/rsyslog?var-datasource=eqiad+prometheus/ops [01:12:04] PROBLEM - rsyslog in codfw is failing to deliver messages on icinga1001 is CRITICAL: action={fwd_centrallog1001.eqiad.wmnet:6514,fwd_centrallog2001.codfw.wmnet:6514} https://wikitech.wikimedia.org/wiki/Rsyslog https://grafana.wikimedia.org/d/000000596/rsyslog?var-datasource=codfw+prometheus/ops [01:21:13] 10Operations, 10ops-codfw, 10netops: (Need by: End of July-2020 ) codfw:rack/setup/new management switches - https://phabricator.wikimedia.org/T253154 (10Papaul) [01:34:31] 10Operations, 10Fundraising-Backlog, 10MediaWiki-extensions-CentralNotice, 10Traffic, and 2 others: Implement redirect for hide banner cookie issue - https://phabricator.wikimedia.org/T251780 (10AndyRussG) Just another idea for the shortest-term solution: in the banner, we could try to determine the user's... [01:34:48] RECOVERY - rsyslog in eqiad is failing to deliver messages on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Rsyslog https://grafana.wikimedia.org/d/000000596/rsyslog?var-datasource=eqiad+prometheus/ops [01:34:54] RECOVERY - rsyslog in codfw is failing to deliver messages on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Rsyslog https://grafana.wikimedia.org/d/000000596/rsyslog?var-datasource=codfw+prometheus/ops [01:41:20] RECOVERY - Too many messages in kafka logging-eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Logstash%23Kafka_consumer_lag https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?from=now-3h&to=now&orgId=1&var-datasource=eqiad+prometheus/ops&var-cluster=logging-eqiad&var-topic=All&var-consumer_group=All [01:43:34] (03PS1) 10Dzahn: admins: add Conny Kawohl to ldap_only admins (wmde/nda) [puppet] - 10https://gerrit.wikimedia.org/r/611002 (https://phabricator.wikimedia.org/T257038) [01:44:42] !log LDAP - adding coka to wmde and nda (T257038) [01:44:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:44:47] T257038: Add Conny Kawohl to the ldap/wmde and ldap/nda group - https://phabricator.wikimedia.org/T257038 [01:46:07] (03CR) 10Dzahn: [C: 03+2] admins: add Conny Kawohl to ldap_only admins (wmde/nda) [puppet] - 10https://gerrit.wikimedia.org/r/611002 (https://phabricator.wikimedia.org/T257038) (owner: 10Dzahn) [01:47:19] 10Operations, 10LDAP-Access-Requests, 10Patch-For-Review: Add Conny Kawohl to the ldap/wmde and ldap/nda group - https://phabricator.wikimedia.org/T257038 (10Dzahn) [01:47:45] 10Operations, 10LDAP-Access-Requests, 10Patch-For-Review: Add Conny Kawohl to the ldap/wmde and ldap/nda group - https://phabricator.wikimedia.org/T257038 (10Dzahn) 05Open→03Resolved @conny-kawohl_WMDE This is done, you have been added to the wmde and nda LDAP groups. [01:49:10] PROBLEM - rsyslog in eqiad is failing to deliver messages on icinga1001 is CRITICAL: action={fwd_centrallog1001.eqiad.wmnet:6514,fwd_centrallog2001.codfw.wmnet:6514} https://wikitech.wikimedia.org/wiki/Rsyslog https://grafana.wikimedia.org/d/000000596/rsyslog?var-datasource=eqiad+prometheus/ops [01:49:14] PROBLEM - rsyslog in codfw is failing to deliver messages on icinga1001 is CRITICAL: action={fwd_centrallog1001.eqiad.wmnet:6514,fwd_centrallog2001.codfw.wmnet:6514} https://wikitech.wikimedia.org/wiki/Rsyslog https://grafana.wikimedia.org/d/000000596/rsyslog?var-datasource=codfw+prometheus/ops [02:00:12] PROBLEM - rsyslog TLS listener on port 6514 on centrallog1001 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection reset by peer https://wikitech.wikimedia.org/wiki/Logs [02:00:37] PROBLEM - rsyslog in codfw is failing to deliver messages on icinga1001 is CRITICAL: action={fwd_centrallog1001.eqiad.wmnet:6514,fwd_centrallog2001.codfw.wmnet:6514} https://wikitech.wikimedia.org/wiki/Rsyslog https://grafana.wikimedia.org/d/000000596/rsyslog?var-datasource=codfw+prometheus/ops [02:02:34] RECOVERY - rsyslog TLS listener on port 6514 on centrallog1001 is OK: SSL OK - Certificate centrallog1001.eqiad.wmnet valid until 2024-06-25 15:42:33 +0000 (expires in 1446 days) https://wikitech.wikimedia.org/wiki/Logs [02:26:17] RECOVERY - rsyslog in eqiad is failing to deliver messages on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Rsyslog https://grafana.wikimedia.org/d/000000596/rsyslog?var-datasource=eqiad+prometheus/ops [02:29:14] RECOVERY - rsyslog in codfw is failing to deliver messages on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Rsyslog https://grafana.wikimedia.org/d/000000596/rsyslog?var-datasource=codfw+prometheus/ops [02:43:32] PROBLEM - rsyslog in eqiad is failing to deliver messages on icinga1001 is CRITICAL: action={fwd_centrallog1001.eqiad.wmnet:6514,fwd_centrallog2001.codfw.wmnet:6514} https://wikitech.wikimedia.org/wiki/Rsyslog https://grafana.wikimedia.org/d/000000596/rsyslog?var-datasource=eqiad+prometheus/ops [02:43:36] PROBLEM - rsyslog in codfw is failing to deliver messages on icinga1001 is CRITICAL: action={fwd_centrallog1001.eqiad.wmnet:6514,fwd_centrallog2001.codfw.wmnet:6514} https://wikitech.wikimedia.org/wiki/Rsyslog https://grafana.wikimedia.org/d/000000596/rsyslog?var-datasource=codfw+prometheus/ops [03:20:22] PROBLEM - Too many messages in kafka logging-eqiad on icinga1001 is CRITICAL: cluster=misc exported_cluster=logging-eqiad group={logstash,logstash-codfw} instance=kafkamon1001 job=burrow partition={0,1,2,3,4,5} site=eqiad topic=rsyslog-notice https://wikitech.wikimedia.org/wiki/Logstash%23Kafka_consumer_lag https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?from=now-3h&to=now&orgId=1&var-datasource=eqiad+prometheus/ops& [03:20:22] ng-eqiad&var-topic=All&var-consumer_group=All [03:25:12] PROBLEM - rsyslog TLS listener on port 6514 on centrallog2001 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection reset by peer https://wikitech.wikimedia.org/wiki/Logs [03:27:10] RECOVERY - rsyslog TLS listener on port 6514 on centrallog2001 is OK: SSL OK - Certificate centrallog2001.codfw.wmnet valid until 2024-11-16 16:04:24 +0000 (expires in 1590 days) https://wikitech.wikimedia.org/wiki/Logs [03:35:06] RECOVERY - rsyslog in eqiad is failing to deliver messages on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Rsyslog https://grafana.wikimedia.org/d/000000596/rsyslog?var-datasource=eqiad+prometheus/ops [03:35:08] RECOVERY - rsyslog in codfw is failing to deliver messages on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Rsyslog https://grafana.wikimedia.org/d/000000596/rsyslog?var-datasource=codfw+prometheus/ops [03:42:44] RECOVERY - Too many messages in kafka logging-eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Logstash%23Kafka_consumer_lag https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?from=now-3h&to=now&orgId=1&var-datasource=eqiad+prometheus/ops&var-cluster=logging-eqiad&var-topic=All&var-consumer_group=All [03:49:20] PROBLEM - rsyslog in eqiad is failing to deliver messages on icinga1001 is CRITICAL: action={fwd_centrallog1001.eqiad.wmnet:6514,fwd_centrallog2001.codfw.wmnet:6514} https://wikitech.wikimedia.org/wiki/Rsyslog https://grafana.wikimedia.org/d/000000596/rsyslog?var-datasource=eqiad+prometheus/ops [03:49:24] PROBLEM - rsyslog in codfw is failing to deliver messages on icinga1001 is CRITICAL: action={fwd_centrallog1001.eqiad.wmnet:6514,fwd_centrallog2001.codfw.wmnet:6514} https://wikitech.wikimedia.org/wiki/Rsyslog https://grafana.wikimedia.org/d/000000596/rsyslog?var-datasource=codfw+prometheus/ops [04:18:22] PROBLEM - Too many messages in kafka logging-eqiad on icinga1001 is CRITICAL: cluster=misc exported_cluster=logging-eqiad group={logstash,logstash-codfw,logstash7-codfw,logstash7-eqiad} instance=kafkamon1001 job=burrow partition={0,1,2,3,4,5} site=eqiad topic=rsyslog-notice https://wikitech.wikimedia.org/wiki/Logstash%23Kafka_consumer_lag https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?from=now-3h&to=now&orgId=1&var- [04:18:22] prometheus/ops&var-cluster=logging-eqiad&var-topic=All&var-consumer_group=All [04:20:12] PROBLEM - rsyslog TLS listener on port 6514 on centrallog2001 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Logs [04:22:56] RECOVERY - rsyslog TLS listener on port 6514 on centrallog2001 is OK: SSL OK - Certificate centrallog2001.codfw.wmnet valid until 2024-11-16 16:04:24 +0000 (expires in 1590 days) https://wikitech.wikimedia.org/wiki/Logs [04:35:12] PROBLEM - rsyslog TLS listener on port 6514 on centrallog1001 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection reset by peer https://wikitech.wikimedia.org/wiki/Logs [04:36:04] RECOVERY - rsyslog TLS listener on port 6514 on centrallog1001 is OK: SSL OK - Certificate centrallog1001.eqiad.wmnet valid until 2024-06-25 15:42:33 +0000 (expires in 1446 days) https://wikitech.wikimedia.org/wiki/Logs [04:40:12] PROBLEM - rsyslog TLS listener on port 6514 on centrallog2001 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Logs [04:42:52] (03PS1) 10Marostegui: dbproxy1017: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/611067 (https://phabricator.wikimedia.org/T255408) [04:44:29] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1134', diff saved to https://phabricator.wikimedia.org/P11839 and previous config saved to /var/cache/conftool/dbconfig/20200710-044428-marostegui.json [04:44:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:45:10] (03CR) 10Marostegui: [C: 03+2] dbproxy1017: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/611067 (https://phabricator.wikimedia.org/T255408) (owner: 10Marostegui) [04:45:16] RECOVERY - rsyslog TLS listener on port 6514 on centrallog2001 is OK: SSL OK - Certificate centrallog2001.codfw.wmnet valid until 2024-11-16 16:04:24 +0000 (expires in 1590 days) https://wikitech.wikimedia.org/wiki/Logs [04:52:10] 10Operations, 10ops-eqiad, 10DBA: Degraded RAID on db1131 - https://phabricator.wikimedia.org/T257253 (10Marostegui) I will failover db1131 to db1093 on Tuesday 14th at 05:00 AM UTC [04:55:12] PROBLEM - rsyslog TLS listener on port 6514 on centrallog2001 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Logs [04:57:57] (03PS1) 10Marostegui: db1107: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/611075 (https://phabricator.wikimedia.org/T254462) [04:58:38] (03CR) 10Marostegui: [C: 03+2] db1107: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/611075 (https://phabricator.wikimedia.org/T254462) (owner: 10Marostegui) [05:05:14] RECOVERY - rsyslog TLS listener on port 6514 on centrallog2001 is OK: SSL OK - Certificate centrallog2001.codfw.wmnet valid until 2024-11-16 16:04:24 +0000 (expires in 1590 days) https://wikitech.wikimedia.org/wiki/Logs [05:10:44] PROBLEM - Logstash rate of ingestion percent change compared to yesterday on icinga1001 is CRITICAL: 252.1 ge 210 https://phabricator.wikimedia.org/T202307 https://grafana.wikimedia.org/dashboard/db/logstash?orgId=1&panelId=2&fullscreen [05:20:12] PROBLEM - rsyslog TLS listener on port 6514 on centrallog2001 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection reset by peer https://wikitech.wikimedia.org/wiki/Logs [05:21:00] RECOVERY - rsyslog TLS listener on port 6514 on centrallog2001 is OK: SSL OK - Certificate centrallog2001.codfw.wmnet valid until 2024-11-16 16:04:24 +0000 (expires in 1590 days) https://wikitech.wikimedia.org/wiki/Logs [05:33:04] RECOVERY - rsyslog in codfw is failing to deliver messages on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Rsyslog https://grafana.wikimedia.org/d/000000596/rsyslog?var-datasource=codfw+prometheus/ops [05:35:20] RECOVERY - rsyslog in eqiad is failing to deliver messages on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Rsyslog https://grafana.wikimedia.org/d/000000596/rsyslog?var-datasource=eqiad+prometheus/ops [06:00:13] RECOVERY - Too many messages in kafka logging-eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Logstash%23Kafka_consumer_lag https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?from=now-3h&to=now&orgId=1&var-datasource=eqiad+prometheus/ops&var-cluster=logging-eqiad&var-topic=All&var-consumer_group=All [06:19:35] PROBLEM - OSPF status on cr3-knams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [06:19:57] PROBLEM - OSPF status on cr2-eqdfw is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [06:20:39] PROBLEM - BFD status on cr3-knams is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [06:20:44] should be GTT maintenance --^ [06:24:25] PROBLEM - OSPF status on cr1-eqiad is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [06:24:27] RECOVERY - BFD status on cr3-knams is OK: OK: UP: 8 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [06:25:15] RECOVERY - OSPF status on cr3-knams is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [06:25:37] RECOVERY - OSPF status on cr2-eqdfw is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [06:26:17] RECOVERY - OSPF status on cr1-eqiad is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [06:29:47] PROBLEM - ores on ores2003 is CRITICAL: connect to address 10.192.16.63 and port 8081: Connection refused https://wikitech.wikimedia.org/wiki/Services/Monitoring/ores [06:32:05] 10Operations, 10Wikimedia-Mailing-lists, 10Cloud-VPS (Project-requests), 10cloud-services-team (Kanban): Request creation of mailman VPS project - https://phabricator.wikimedia.org/T257270 (10Ladsgroup) >>! In T257270#6293740, @Andrew wrote: > approved! Bryan will take care of this shortly Thanks! I req... [06:33:35] 10Operations, 10Fundraising-Backlog, 10MediaWiki-extensions-CentralNotice, 10Traffic, and 2 others: Implement redirect for hide banner cookie issue - https://phabricator.wikimedia.org/T251780 (10AndyRussG) @Ejegg what about this option? We set a special cookie on donate.wikimedia.org when people go to the... [06:34:21] (03PS1) 10Marostegui: db1124: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/611165 (https://phabricator.wikimedia.org/T254462) [06:35:05] !log Compress InnoDB on db1124:3311 (Sanitarium - lag will appear on s1 on labsdb) - T254462 [06:35:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:35:11] T254462: Compress enwiki InnoDB tables - https://phabricator.wikimedia.org/T254462 [06:35:20] (03CR) 10Marostegui: [C: 03+2] db1124: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/611165 (https://phabricator.wikimedia.org/T254462) (owner: 10Marostegui) [06:37:47] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repool db1134', diff saved to https://phabricator.wikimedia.org/P11840 and previous config saved to /var/cache/conftool/dbconfig/20200710-063746-marostegui.json [06:37:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:38:18] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1099:3311', diff saved to https://phabricator.wikimedia.org/P11841 and previous config saved to /var/cache/conftool/dbconfig/20200710-063818-marostegui.json [06:38:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:40:55] RECOVERY - ores on ores2003 is OK: HTTP OK: HTTP/1.0 200 OK - 6397 bytes in 0.089 second response time https://wikitech.wikimedia.org/wiki/Services/Monitoring/ores [06:45:37] (03CR) 10Ayounsi: [C: 03+2] Reports, add new cloudsw role [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/610820 (https://phabricator.wikimedia.org/T251632) (owner: 10Ayounsi) [06:50:32] (03PS1) 10Elukey: Add custom ferm srange to Kafka Jumbo brokers [puppet] - 10https://gerrit.wikimedia.org/r/611168 [06:51:38] (03CR) 10jerkins-bot: [V: 04-1] Add custom ferm srange to Kafka Jumbo brokers [puppet] - 10https://gerrit.wikimedia.org/r/611168 (owner: 10Elukey) [06:51:59] PROBLEM - Router interfaces on cr3-ulsfo is CRITICAL: CRITICAL: host 198.35.26.192, interfaces up: 76, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:52:45] PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 52, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:53:32] (03PS2) 10Elukey: Add custom ferm srange to Kafka Jumbo brokers [puppet] - 10https://gerrit.wikimedia.org/r/611168 [06:55:47] RECOVERY - Router interfaces on cr3-ulsfo is OK: OK: host 198.35.26.192, interfaces up: 78, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:56:31] RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 54, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:57:51] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repool db1099:3311', diff saved to https://phabricator.wikimedia.org/P11843 and previous config saved to /var/cache/conftool/dbconfig/20200710-065751-marostegui.json [06:57:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:00:04] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200710T0700) [07:01:54] (03CR) 10Elukey: [C: 04-1] "From a first pcc it looks good: https://puppet-compiler.wmflabs.org/compiler1002/23808/" [puppet] - 10https://gerrit.wikimedia.org/r/611168 (owner: 10Elukey) [07:05:07] PROBLEM - Router interfaces on cr3-ulsfo is CRITICAL: CRITICAL: host 198.35.26.192, interfaces up: 76, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:05:51] PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 52, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:08:28] (03CR) 10Alexandros Kosiaris: [C: 03+2] "thanks!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/610855 (https://phabricator.wikimedia.org/T225680) (owner: 10Alexandros Kosiaris) [07:09:28] (03Merged) 10jenkins-bot: proton: Amend prometheus-statsd config [deployment-charts] - 10https://gerrit.wikimedia.org/r/610855 (https://phabricator.wikimedia.org/T225680) (owner: 10Alexandros Kosiaris) [07:11:33] RECOVERY - Logstash rate of ingestion percent change compared to yesterday on icinga1001 is OK: (C)210 ge (W)150 ge 99.97 https://phabricator.wikimedia.org/T202307 https://grafana.wikimedia.org/dashboard/db/logstash?orgId=1&panelId=2&fullscreen [07:13:50] !log akosiaris@deploy1001 helmfile [STAGING] Ran 'sync' command on namespace 'proton' for release 'production' . [07:13:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:14:25] !log akosiaris@deploy1001 helmfile [EQIAD] Ran 'sync' command on namespace 'proton' for release 'production' . [07:14:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:15:33] !log akosiaris@deploy1001 helmfile [CODFW] Ran 'sync' command on namespace 'proton' for release 'production' . [07:15:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:27:03] 10Operations, 10Release Pipeline, 10Release-Engineering-Team-TODO, 10Epic, and 2 others: Migrate production services to kubernetes using the pipeline - https://phabricator.wikimedia.org/T198901 (10akosiaris) [07:28:03] 10Operations, 10Prod-Kubernetes, 10serviceops, 10Kubernetes, 10Patch-For-Review: Move proton to use TLS only - https://phabricator.wikimedia.org/T255877 (10akosiaris) [07:29:05] (03CR) 10Muehlenhoff: [C: 03+2] lxc: Remove jessie compat code [puppet] - 10https://gerrit.wikimedia.org/r/610707 (owner: 10Muehlenhoff) [07:30:48] (03CR) 10Hashar: "I did a quick and dirty audit for the CI image which is captured at T257553. Basically for all containers used by the Jenkins job, I ran:" [puppet] - 10https://gerrit.wikimedia.org/r/610050 (https://phabricator.wikimedia.org/T256877) (owner: 10Muehlenhoff) [07:31:05] RECOVERY - BGP status on cr2-esams is OK: BGP OK - up: 424, down: 2, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [07:32:09] !log installing e2fsprogs security updates on jessie (stretch/buster already fixed) [07:32:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:41:12] !log jbond@deploy1001 Started deploy [librenms/librenms@0a88d64]: redeplopy to [try and] fix php errors [07:41:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:41:17] !log jbond@deploy1001 Finished deploy [librenms/librenms@0a88d64]: redeplopy to [try and] fix php errors (duration: 00m 05s) [07:41:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:41:39] (03PS3) 10Elukey: Add custom ferm srange to Kafka Jumbo brokers [puppet] - 10https://gerrit.wikimedia.org/r/611168 [07:43:27] !log kormat@cumin1001 dbctl commit (dc=all): 'Add weight to es1020, reduce weight on es1021 T257284', diff saved to https://phabricator.wikimedia.org/P11844 and previous config saved to /var/cache/conftool/dbconfig/20200710-074326-kormat.json [07:43:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:43:32] T257284: Upgrade es4 to debian buster + mariadb 10.4 - https://phabricator.wikimedia.org/T257284 [07:43:55] ACKNOWLEDGEMENT - Stale file for node-exporter textfile in eqiad on icinga1001 is CRITICAL: cluster=analytics file=nic_firmware.prom instance=analytics1030 job=node site=eqiad Ema elukey running some tests on the hadoop test cluster https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile [07:44:14] !log reimaging es1021 to buster T257284 [07:44:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:45:43] RECOVERY - Stale file for node-exporter textfile in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile [07:45:45] (03PS1) 10Kormat: es1021: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/611183 (https://phabricator.wikimedia.org/T257284) [07:47:19] PROBLEM - Host ganeti1007 is DOWN: PING CRITICAL - Packet loss = 100% [07:47:37] PROBLEM - Host etcd1003 is DOWN: PING CRITICAL - Packet loss = 100% [07:48:09] are we doing maintenance on --^ ? [07:49:57] 10Operations, 10serviceops: docker-reporter-releng-images failed on deneb - https://phabricator.wikimedia.org/T251918 (10ema) The service failed 3 days ago due to another image this time: ` root@deneb:~# journalctl -u docker-reporter-releng-images.service | grep FAIL Jul 06 16:54:38 deneb docker-report-releng... [07:50:40] ACKNOWLEDGEMENT - Check systemd state on deneb is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. Ema This is a known issue: T251918 https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:51:53] (03CR) 10Marostegui: [C: 03+1] es1021: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/611183 (https://phabricator.wikimedia.org/T257284) (owner: 10Kormat) [07:52:13] yeah, that's expired downtime for a RAM expansion [07:52:24] moritzm: have you tried downloading more ram? [07:52:53] kormat: we'd need to buy Ganeti Enterprise for that... [07:53:00] haha [07:53:58] ahhaah [07:54:16] akosiaris: https://phabricator.wikimedia.org/T244530 doesn't mention if John added the RAM yesterday, do you know more? should the downtime be extended until the next week [07:54:31] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repool db1079', diff saved to https://phabricator.wikimedia.org/P11845 and previous config saved to /var/cache/conftool/dbconfig/20200710-075431-marostegui.json [07:54:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:55:01] !log marostegui@cumin1001 dbctl commit (dc=all): 'Slowly repool db1107', diff saved to https://phabricator.wikimedia.org/P11846 and previous config saved to /var/cache/conftool/dbconfig/20200710-075500-marostegui.json [07:55:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:55:20] 10Operations, 10serviceops: docker-reporter-releng-images failed on deneb - https://phabricator.wikimedia.org/T251918 (10hashar) That `releng/ci-common` image is a scratch image containing scripts shared by our base images ci-jessie, ci-stretch, ci-buster. It does not have any Debian OS layer, thus if the repo... [07:56:09] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1119', diff saved to https://phabricator.wikimedia.org/P11847 and previous config saved to /var/cache/conftool/dbconfig/20200710-075608-marostegui.json [07:56:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:00:45] !log installing cron security updates on jessie (stretch/buster already fixed) [08:00:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:01:33] !log kormat@cumin1001 dbctl commit (dc=all): 'Reset es2020/es2021 to correct weights after master switch T257284', diff saved to https://phabricator.wikimedia.org/P11848 and previous config saved to /var/cache/conftool/dbconfig/20200710-080133-kormat.json [08:01:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:01:38] T257284: Upgrade es4 to debian buster + mariadb 10.4 - https://phabricator.wikimedia.org/T257284 [08:02:28] (03CR) 10Kormat: [C: 03+2] es1021: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/611183 (https://phabricator.wikimedia.org/T257284) (owner: 10Kormat) [08:04:05] 10Operations, 10serviceops: docker-reporter-releng-images failed on deneb - https://phabricator.wikimedia.org/T251918 (10JMeybohm) >>! In T251918#6296048, @hashar wrote: > That `releng/ci-common` image is a scratch image containing scripts shared by our base images ci-jessie, ci-stretch, ci-buster. It does not... [08:06:46] (03PS1) 10Kormat: install_server: Switch es1021 to buster [puppet] - 10https://gerrit.wikimedia.org/r/611193 (https://phabricator.wikimedia.org/T257284) [08:07:01] 10Operations, 10serviceops: docker-reporter-releng-images failed on deneb - https://phabricator.wikimedia.org/T251918 (10MoritzMuehlenhoff) >>! In T251918#6296059, @JMeybohm wrote: > Maybe we can just skip images that are not debian based? Sounds good, we could simply test for the presence of /etc/debian_vers... [08:08:43] !log kormat@cumin1001 dbctl commit (dc=all): 'Depool es1021 for reimaging T257284', diff saved to https://phabricator.wikimedia.org/P11849 and previous config saved to /var/cache/conftool/dbconfig/20200710-080843-kormat.json [08:08:50] 10Operations, 10serviceops: docker-reporter-releng-images failed on deneb - https://phabricator.wikimedia.org/T251918 (10JMeybohm) a:03JMeybohm >>! In T251918#6296062, @MoritzMuehlenhoff wrote: > > Sounds good, we could simply test for the presence of /etc/debian_version which is owned by the base-files pack... [08:08:54] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repool db1119', diff saved to https://phabricator.wikimedia.org/P11850 and previous config saved to /var/cache/conftool/dbconfig/20200710-080854-marostegui.json [08:09:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:09:05] T257284: Upgrade es4 to debian buster + mariadb 10.4 - https://phabricator.wikimedia.org/T257284 [08:09:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:09:13] !log marostegui@cumin1001 dbctl commit (dc=all): 'Slowly repool db1107', diff saved to https://phabricator.wikimedia.org/P11851 and previous config saved to /var/cache/conftool/dbconfig/20200710-080912-marostegui.json [08:09:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:12:03] (03CR) 10Marostegui: [C: 03+1] install_server: Switch es1021 to buster [puppet] - 10https://gerrit.wikimedia.org/r/611193 (https://phabricator.wikimedia.org/T257284) (owner: 10Kormat) [08:13:18] (03CR) 10Kormat: [C: 03+2] install_server: Switch es1021 to buster [puppet] - 10https://gerrit.wikimedia.org/r/611193 (https://phabricator.wikimedia.org/T257284) (owner: 10Kormat) [08:17:15] (03PS1) 10DCausse: [wcqs] update logo URL [puppet] - 10https://gerrit.wikimedia.org/r/611196 (https://phabricator.wikimedia.org/T251514) [08:17:50] (03PS1) 10Effie Mouzeli: hieradata: improve description of ncredir [puppet] - 10https://gerrit.wikimedia.org/r/611197 [08:20:02] moritzm: I have no more information after powering off the host. I guess we can keep it downtime for some more days and ping John on the task [08:20:16] PROBLEM - DPKG on mc2029 is CRITICAL: DPKG CRITICAL dpkg reports broken packages https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [08:21:21] ^ fixing mc2029 [08:22:34] !log jmm@cumin2001 START - Cookbook sre.hosts.downtime [08:22:35] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [08:22:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:22:42] !log jmm@cumin2001 START - Cookbook sre.hosts.downtime [08:22:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:22:43] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [08:22:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:22:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:23:18] 10Operations, 10ops-eqiad: upgrade memory in ganeti100[5-8].eqiad.wmnet - https://phabricator.wikimedia.org/T244530 (10MoritzMuehlenhoff) >>! In T244530#6287429, @Jclark-ctr wrote: > @akosiaris I will be on site tomorrow also if host is available to do 1 day earlier Did you plug in the new DIMMs yesterday? [08:23:29] !log marostegui@cumin1001 dbctl commit (dc=all): 'Slowly repool db1107', diff saved to https://phabricator.wikimedia.org/P11852 and previous config saved to /var/cache/conftool/dbconfig/20200710-082329-marostegui.json [08:23:30] akosiaris: followed up on task and extended downtime until Tuesday [08:23:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:23:47] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1106', diff saved to https://phabricator.wikimedia.org/P11853 and previous config saved to /var/cache/conftool/dbconfig/20200710-082346-marostegui.json [08:23:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:26:47] moritzm: I was about to do that, thanks! [08:27:08] RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 54, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:33:37] (03CR) 10Jbond: [C: 03+2] SSHFP: add a text file with the SSHFB of all hosts [puppet] - 10https://gerrit.wikimedia.org/r/609796 (https://phabricator.wikimedia.org/T257219) (owner: 10Jbond) [08:39:50] RECOVERY - Router interfaces on cr3-ulsfo is OK: OK: host 198.35.26.192, interfaces up: 78, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:42:14] (03CR) 10Filippo Giunchedi: "LGTM (haven't tried building the package myself)" (031 comment) [debs/grafana-loki] (debian/sid) - 10https://gerrit.wikimedia.org/r/610864 (owner: 10Cwhite) [08:44:41] 10Operations, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for mooeypoo - https://phabricator.wikimedia.org/T257502 (10jcrespo) a:05Nuria→03jcrespo [08:46:08] (03PS5) 10Jcrespo: admin: Add Jgiannelos production access [puppet] - 10https://gerrit.wikimedia.org/r/609752 (https://phabricator.wikimedia.org/T257187) [08:47:29] 10Operations, 10serviceops, 10Patch-For-Review, 10User-Elukey: Reimage one memcached shard to Buster - https://phabricator.wikimedia.org/T252391 (10elukey) @Krinkle @aaron if you have time, let's follow up on the question that I asked about what happens if a Redis shard disappears. It would be really nice... [08:48:03] (03CR) 10Filippo Giunchedi: "> Patch Set 1: Code-Review+1" [puppet] - 10https://gerrit.wikimedia.org/r/610832 (https://phabricator.wikimedia.org/T247968) (owner: 10Filippo Giunchedi) [08:48:23] (03PS1) 10Ema: ATS: add log_set_cookie_response(), reduce noise, log Host [puppet] - 10https://gerrit.wikimedia.org/r/611227 (https://phabricator.wikimedia.org/T256395) [08:48:30] (03CR) 10Jcrespo: [C: 03+2] admin: Add Jgiannelos production access [puppet] - 10https://gerrit.wikimedia.org/r/609752 (https://phabricator.wikimedia.org/T257187) (owner: 10Jcrespo) [08:49:49] (03CR) 10Alexandros Kosiaris: [C: 04-1] Kask: Use Releng Cassandra Image (033 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/609894 (https://phabricator.wikimedia.org/T224041) (owner: 10Jeena Huneidi) [08:50:18] RECOVERY - DPKG on mc2029 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [08:50:41] !log marostegui@cumin1001 dbctl commit (dc=all): 'Fully repool db1107', diff saved to https://phabricator.wikimedia.org/P11855 and previous config saved to /var/cache/conftool/dbconfig/20200710-085040-marostegui.json [08:50:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:51:01] (03CR) 10Ema: [C: 03+2] ATS: add log_set_cookie_response(), reduce noise, log Host [puppet] - 10https://gerrit.wikimedia.org/r/611227 (https://phabricator.wikimedia.org/T256395) (owner: 10Ema) [08:51:12] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repool db1106', diff saved to https://phabricator.wikimedia.org/P11856 and previous config saved to /var/cache/conftool/dbconfig/20200710-085112-marostegui.json [08:51:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:51:58] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1110', diff saved to https://phabricator.wikimedia.org/P11857 and previous config saved to /var/cache/conftool/dbconfig/20200710-085157-marostegui.json [08:52:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:53:33] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to production infrastructure services for jgiannelos - https://phabricator.wikimedia.org/T257187 (10jcrespo) 05Open→03Resolved a:03jcrespo Access request has been merged: ` Notice: /Stage[main]/Admin/Admin::Hashuser[jgiannelos]/... [08:57:42] (03PS1) 10Jcrespo: admin: Add Mooeypoo (wikigit) to the analytics-privatedata-users group [puppet] - 10https://gerrit.wikimedia.org/r/611228 (https://phabricator.wikimedia.org/T257502) [09:01:24] 10Operations, 10CAS-SSO, 10Patch-For-Review, 10User-jbond: mod_auth_cas segfaulting on netmon - https://phabricator.wikimedia.org/T257587 (10MoritzMuehlenhoff) A few initial findings, still investigating further: This all boils down to curl and OpenSSL: We're not seeing this issue on jessie (which only ha... [09:02:18] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime [09:02:36] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to analytics-privatedata-users for mooeypoo - https://phabricator.wikimedia.org/T257502 (10jcrespo) Patch is prepared, will be deployed on Monday following procedures. Kerberos access will also be granted then. [09:02:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:04:54] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [09:04:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:05:51] 10Operations, 10LDAP-Access-Requests, 10WMF-Legal: Add Guergana Tzatchkova to the ldap/wmde and ldap/nda group - https://phabricator.wikimedia.org/T256201 (10jcrespo) a:03jcrespo Thank you very much for the heads up! Will proceed now with the group granting. [09:06:05] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [09:07:24] 10Operations, 10LDAP-Access-Requests, 10WMF-Legal: Add Guergana Tzatchkova to the ldap/wmde and ldap/nda group - https://phabricator.wikimedia.org/T256201 (10jcrespo) [09:08:15] (03CR) 10Hashar: [C: 03+1] "So the status is:" [puppet] - 10https://gerrit.wikimedia.org/r/608296 (https://phabricator.wikimedia.org/T256575) (owner: 10Hashar) [09:09:15] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [09:10:58] (03CR) 10Hashar: [C: 03+1] "The regex example from above: https://regex101.com/r/L5GNMY/1" [puppet] - 10https://gerrit.wikimedia.org/r/608296 (https://phabricator.wikimedia.org/T256575) (owner: 10Hashar) [09:11:21] 10Operations, 10LDAP-Access-Requests: Add Conny Kawohl to the ldap/wmde and ldap/nda group - https://phabricator.wikimedia.org/T257038 (10jcrespo) Thanks Dzahn for taking over, as that sped up the group addition! [09:15:19] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [09:18:30] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] kubeadm: If using a stacked control plane, expose etcd metrics [puppet] - 10https://gerrit.wikimedia.org/r/610980 (https://phabricator.wikimedia.org/T256361) (owner: 10Bstorm) [09:22:06] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [09:23:36] 10Operations, 10LDAP-Access-Requests, 10WMF-Legal: Add Guergana Tzatchkova to the ldap/wmde and ldap/nda group - https://phabricator.wikimedia.org/T256201 (10jcrespo) For future reference, UID for cn=Guergana Tzatchkova is gtzatchkova. @guergana.tzatchkova This is not required for access request, but consid... [09:29:56] 10Operations, 10Wikimedia-Mailing-lists, 10Cloud-VPS (Project-requests), 10cloud-services-team (Kanban): Request creation of mailman VPS project - https://phabricator.wikimedia.org/T257270 (10Ladsgroup) >>! In T257270#6296182, @Stashbot wrote: > {nav icon=file, name=Mentioned in SAL (#wikimedia-cloud), hre... [09:31:00] 10Operations, 10Cloud-VPS (Project-requests), 10cloud-services-team (Kanban): Request creation of 'sre-sandbox' VPS project - https://phabricator.wikimedia.org/T247517 (10jbond) is to possible to get more quota in this project. I Just tried to create a machine and we have 1 x m1.xlarge which seems to have t... [09:31:02] (03CR) 10Jcrespo: [C: 03+1] mariadb: remove ferm firewall hole for gerrit servers [puppet] - 10https://gerrit.wikimedia.org/r/609884 (https://phabricator.wikimedia.org/T239151) (owner: 10Dzahn) [09:34:13] (03PS1) 10Jcrespo: admin: Add Guergana Tzatchkova (gtzatchkova) to the list of privileged ldap groups [puppet] - 10https://gerrit.wikimedia.org/r/611232 (https://phabricator.wikimedia.org/T256201) [09:35:01] (03PS1) 10Kormat: Revert "es1021: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/611215 [09:35:04] (03CR) 10jerkins-bot: [V: 04-1] admin: Add Guergana Tzatchkova (gtzatchkova) to the list of privileged ldap groups [puppet] - 10https://gerrit.wikimedia.org/r/611232 (https://phabricator.wikimedia.org/T256201) (owner: 10Jcrespo) [09:36:50] (03CR) 10Kormat: [C: 03+2] Revert "es1021: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/611215 (owner: 10Kormat) [09:37:47] (03PS2) 10Jcrespo: admin: Add gtzatchkova to the list of privileged ldap groups [puppet] - 10https://gerrit.wikimedia.org/r/611232 (https://phabricator.wikimedia.org/T256201) [09:43:59] (03CR) 10Effie Mouzeli: [C: 04-1] charts for push-notification service (034 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/602390 (https://phabricator.wikimedia.org/T250491) (owner: 10MSantos) [09:47:27] (03CR) 10Jcrespo: [C: 03+2] admin: Add gtzatchkova to the list of privileged ldap groups [puppet] - 10https://gerrit.wikimedia.org/r/611232 (https://phabricator.wikimedia.org/T256201) (owner: 10Jcrespo) [09:49:55] !log kormat@cumin1001 dbctl commit (dc=all): 'Start repooling es1021 after reimage @ 50% T257284', diff saved to https://phabricator.wikimedia.org/P11858 and previous config saved to /var/cache/conftool/dbconfig/20200710-094954-kormat.json [09:49:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:50:00] T257284: Upgrade es4 to debian buster + mariadb 10.4 - https://phabricator.wikimedia.org/T257284 [09:52:59] (03PS1) 10Jbond: mariadb::farm_misc add netmon1002/2001 access [puppet] - 10https://gerrit.wikimedia.org/r/611243 [09:53:42] (03CR) 10Jbond: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/611243 (owner: 10Jbond) [09:55:53] 10Operations, 10LDAP-Access-Requests, 10WMF-Legal, 10Patch-For-Review: Add Guergana Tzatchkova to the ldap/wmde and ldap/nda group - https://phabricator.wikimedia.org/T256201 (10jcrespo) [09:56:56] 10Operations, 10LDAP-Access-Requests, 10WMF-Legal, 10Patch-For-Review: Add Guergana Tzatchkova to the ldap/wmde and ldap/nda group - https://phabricator.wikimedia.org/T256201 (10jcrespo) 05Open→03Resolved Change has been deployed https://ldap.toolforge.org/user/gtzatchkova Please test your privileged... [09:59:17] 10Operations, 10LDAP-Access-Requests, 10observability, 10serviceops, 10Patch-For-Review: Grant Access to Logstash to Peter(peter.ovchyn@speedandfunction.com) - https://phabricator.wikimedia.org/T249037 (10jcrespo) I will get notified when this can move forward and https://gerrit.wikimedia.org/r/c/operati... [09:59:25] (03PS2) 10Jbond: mariadb::farm_misc add netmon1002/2001 access [puppet] - 10https://gerrit.wikimedia.org/r/611243 [10:02:57] (03PS4) 10Elukey: Add custom ferm srange to Kafka Jumbo brokers [puppet] - 10https://gerrit.wikimedia.org/r/611168 [10:18:41] (03CR) 10Alexandros Kosiaris: [C: 03+1] Add custom ferm srange to Kafka Jumbo brokers [puppet] - 10https://gerrit.wikimedia.org/r/611168 (owner: 10Elukey) [10:20:59] (03PS1) 10Jbond: role::grafana: allow embedding [puppet] - 10https://gerrit.wikimedia.org/r/611250 (https://phabricator.wikimedia.org/T250792) [10:21:34] (03CR) 10Jbond: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/611250 (https://phabricator.wikimedia.org/T250792) (owner: 10Jbond) [10:21:48] !log kormat@cumin1001 dbctl commit (dc=all): 'Finish repooling es1021, and remove weight from es1010 T257284', diff saved to https://phabricator.wikimedia.org/P11859 and previous config saved to /var/cache/conftool/dbconfig/20200710-102147-kormat.json [10:21:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:21:53] T257284: Upgrade es4 to debian buster + mariadb 10.4 - https://phabricator.wikimedia.org/T257284 [10:26:31] (03PS1) 10Jbond: Revert "librenms: convert back to ldap config" [puppet] - 10https://gerrit.wikimedia.org/r/611216 [10:26:54] 10Operations, 10serviceops, 10Patch-For-Review, 10User-Elukey: Reimage one memcached shard to Buster - https://phabricator.wikimedia.org/T252391 (10elukey) Another idea to add in here - recently John and Moritz needed TLS for memcached and imported memcached 1.6.6 (latest upstream) into out buster reposito... [10:34:24] (03PS1) 10JMeybohm: Check if images are debian based before generating report [docker-images/docker-report] - 10https://gerrit.wikimedia.org/r/611251 (https://phabricator.wikimedia.org/T251918) [10:34:26] (03PS1) 10JMeybohm: New package version [docker-images/docker-report] - 10https://gerrit.wikimedia.org/r/611252 [10:38:35] 10Operations, 10Desktop Improvements, 10Traffic, 10Performance-Team (Radar): CDN cache revalidation on several wikis for desktop improvements deployment - https://phabricator.wikimedia.org/T256750 (10ovasileva) @ema - apologies for the late response - we had some blockers arise this week with the deploymen... [10:38:49] 10Operations, 10Desktop Improvements, 10Traffic, 10Performance-Team (Radar): CDN cache revalidation on several wikis for desktop improvements deployment - https://phabricator.wikimedia.org/T256750 (10ovasileva) [10:50:13] (03PS1) 10Arturo Borrero Gonzalez: cloud: add prometheus neutron conntrack collector [puppet] - 10https://gerrit.wikimedia.org/r/611262 (https://phabricator.wikimedia.org/T257552) [10:50:30] (03PS1) 10Jbond: profile::grafana: add types and convert to lookup [puppet] - 10https://gerrit.wikimedia.org/r/611263 [10:51:05] (03CR) 10jerkins-bot: [V: 04-1] cloud: add prometheus neutron conntrack collector [puppet] - 10https://gerrit.wikimedia.org/r/611262 (https://phabricator.wikimedia.org/T257552) (owner: 10Arturo Borrero Gonzalez) [10:51:43] (03CR) 10jerkins-bot: [V: 04-1] profile::grafana: add types and convert to lookup [puppet] - 10https://gerrit.wikimedia.org/r/611263 (owner: 10Jbond) [10:52:08] (03PS2) 10Jbond: profile::grafana: add types and convert to lookup [puppet] - 10https://gerrit.wikimedia.org/r/611263 [10:54:47] (03PS4) 10Hnowlan: changeprop-jobqueue: add beta configuration skeleton [deployment-charts] - 10https://gerrit.wikimedia.org/r/604425 (https://phabricator.wikimedia.org/T220399) [10:56:21] (03PS3) 10Jbond: profile::grafana: add types and convert to lookup [puppet] - 10https://gerrit.wikimedia.org/r/611263 [11:01:59] (03CR) 10Muehlenhoff: [C: 03+2] Revert "librenms: convert back to ldap config" [puppet] - 10https://gerrit.wikimedia.org/r/611216 (owner: 10Jbond) [11:02:05] (03PS4) 10Jbond: profile::grafana: add types and convert to lookup [puppet] - 10https://gerrit.wikimedia.org/r/611263 [11:08:26] (03PS5) 10Jbond: profile::grafana: add types and convert to lookup [puppet] - 10https://gerrit.wikimedia.org/r/611263 [11:10:49] (03PS2) 10Arturo Borrero Gonzalez: cloud: add prometheus neutron conntrack collector [puppet] - 10https://gerrit.wikimedia.org/r/611262 (https://phabricator.wikimedia.org/T257552) [11:13:07] (03PS6) 10Jbond: profile::grafana: add types and convert to lookup [puppet] - 10https://gerrit.wikimedia.org/r/611263 [11:13:09] (03CR) 10Hnowlan: [C: 03+2] changeprop-jobqueue: add beta configuration skeleton [deployment-charts] - 10https://gerrit.wikimedia.org/r/604425 (https://phabricator.wikimedia.org/T220399) (owner: 10Hnowlan) [11:23:15] (03Merged) 10jenkins-bot: changeprop-jobqueue: add beta configuration skeleton [deployment-charts] - 10https://gerrit.wikimedia.org/r/604425 (https://phabricator.wikimedia.org/T220399) (owner: 10Hnowlan) [11:23:15] (03CR) 10Jbond: "PCC: https://puppet-compiler.wmflabs.org/compiler1001/23813/" [puppet] - 10https://gerrit.wikimedia.org/r/611263 (owner: 10Jbond) [11:23:16] 10Operations: Network access to Wikipedia blocked - https://phabricator.wikimedia.org/T257664 (10Olem) [11:23:16] (03PS3) 10Jbond: mariadb::ferm_misc add netmon1002/2001 access [puppet] - 10https://gerrit.wikimedia.org/r/611243 [11:23:16] (03PS3) 10Arturo Borrero Gonzalez: cloud: add prometheus neutron conntrack collector [puppet] - 10https://gerrit.wikimedia.org/r/611262 (https://phabricator.wikimedia.org/T257552) [11:23:16] (03CR) 10jerkins-bot: [V: 04-1] cloud: add prometheus neutron conntrack collector [puppet] - 10https://gerrit.wikimedia.org/r/611262 (https://phabricator.wikimedia.org/T257552) (owner: 10Arturo Borrero Gonzalez) [11:23:17] (03PS4) 10Arturo Borrero Gonzalez: cloud: add prometheus neutron conntrack collector [puppet] - 10https://gerrit.wikimedia.org/r/611262 (https://phabricator.wikimedia.org/T257552) [11:23:17] 10Operations, 10Traffic: Network access to Wikipedia blocked - https://phabricator.wikimedia.org/T257664 (10Majavah) [11:23:18] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] cloud: add prometheus neutron conntrack collector [puppet] - 10https://gerrit.wikimedia.org/r/611262 (https://phabricator.wikimedia.org/T257552) (owner: 10Arturo Borrero Gonzalez) [11:29:00] (03PS1) 10Arturo Borrero Gonzalez: cloud: prometheus neutron collector: add "" characters for label values [puppet] - 10https://gerrit.wikimedia.org/r/611278 (https://phabricator.wikimedia.org/T257552) [11:30:13] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to analytics-privatedata-users and nda groups for edtadros - https://phabricator.wikimedia.org/T256435 (10jcrespo) [11:30:37] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] cloud: prometheus neutron collector: add "" characters for label values [puppet] - 10https://gerrit.wikimedia.org/r/611278 (https://phabricator.wikimedia.org/T257552) (owner: 10Arturo Borrero Gonzalez) [11:33:00] (03PS14) 10Hnowlan: api-gateway: Basic envoy chart WIP [deployment-charts] - 10https://gerrit.wikimedia.org/r/609808 (https://phabricator.wikimedia.org/T254906) [11:35:46] (03PS1) 10Arturo Borrero Gonzalez: cloud: prometheus neutron exporter: cleanup log messages [puppet] - 10https://gerrit.wikimedia.org/r/611285 (https://phabricator.wikimedia.org/T257552) [11:37:00] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] cloud: prometheus neutron exporter: cleanup log messages [puppet] - 10https://gerrit.wikimedia.org/r/611285 (https://phabricator.wikimedia.org/T257552) (owner: 10Arturo Borrero Gonzalez) [11:58:40] (03PS1) 10Reedy: Make Score errors use a specific css class [extensions/Score] (wmf/1.35.0-wmf.40) - 10https://gerrit.wikimedia.org/r/611217 (https://phabricator.wikimedia.org/T257623) [11:59:07] (03CR) 10Reedy: [C: 03+2] Make Score errors use a specific css class [extensions/Score] (wmf/1.35.0-wmf.40) - 10https://gerrit.wikimedia.org/r/611217 (https://phabricator.wikimedia.org/T257623) (owner: 10Reedy) [12:03:55] (03PS3) 10RhinosF1: Add NamespaceAliases for kowikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/604515 (https://phabricator.wikimedia.org/T255031) [12:06:11] Reedy: any update on https://phabricator.wikimedia.org/T257066#6277787? It's past Monday 6th? [12:06:24] No [12:08:30] Reedy: any updated timescale? [12:08:36] No [12:08:46] The comment said at least [12:08:53] So it could be for any amount of time afterwards [12:08:56] ok [12:09:38] (03CR) 10Marostegui: [C: 03+1] mariadb::ferm_misc add netmon1002/2001 access [puppet] - 10https://gerrit.wikimedia.org/r/611243 (owner: 10Jbond) [12:17:12] (03CR) 10jerkins-bot: [V: 04-1] Make Score errors use a specific css class [extensions/Score] (wmf/1.35.0-wmf.40) - 10https://gerrit.wikimedia.org/r/611217 (https://phabricator.wikimedia.org/T257623) (owner: 10Reedy) [12:17:55] (03CR) 10Reedy: [V: 03+2 C: 03+2] "Unrelated failure, task already filed" [extensions/Score] (wmf/1.35.0-wmf.40) - 10https://gerrit.wikimedia.org/r/611217 (https://phabricator.wikimedia.org/T257623) (owner: 10Reedy) [12:20:24] !log reedy@deploy1001 Synchronized php-1.35.0-wmf.40/extensions/Score/: Make Score errors use a specific css class (duration: 00m 58s) [12:20:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:36:05] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repool db1110', diff saved to https://phabricator.wikimedia.org/P11860 and previous config saved to /var/cache/conftool/dbconfig/20200710-123604-marostegui.json [12:36:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:51:20] (03CR) 10Alexandros Kosiaris: [C: 04-1] "> Patch Set 2:" [puppet] - 10https://gerrit.wikimedia.org/r/552515 (https://phabricator.wikimedia.org/T239340) (owner: 10Alexandros Kosiaris) [12:54:51] (03PS9) 10MSantos: charts for push-notification service [deployment-charts] - 10https://gerrit.wikimedia.org/r/602390 (https://phabricator.wikimedia.org/T250493) [12:55:34] (03CR) 10Alexandros Kosiaris: Add recommendation-api helmfile stanzas (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/602527 (https://phabricator.wikimedia.org/T241230) (owner: 10Bmansurov) [12:57:15] (03CR) 10MSantos: charts for push-notification service (034 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/602390 (https://phabricator.wikimedia.org/T250493) (owner: 10MSantos) [13:03:19] RECOVERY - MariaDB Replica SQL: matomo on db1108 is OK: OK slave_sql_state not a slave https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [13:04:39] RECOVERY - MariaDB Replica Lag: matomo on db1108 is OK: OK slave_sql_lag not a slave https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [13:04:53] RECOVERY - MariaDB Replica IO: matomo on db1108 is OK: OK slave_io_state not a slave https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [13:05:59] RECOVERY - MariaDB Replica Lag: analytics_meta on db1108 is OK: OK slave_sql_lag not a slave https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [13:10:35] ^ elukey :) [13:11:35] RECOVERY - Check systemd state on db1108 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:12:35] PROBLEM - Host ganeti1007.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [13:12:50] ah!!! [13:12:52] <3 [13:14:24] 10Operations, 10ops-eqiad: upgrade memory in ganeti100[5-8].eqiad.wmnet - https://phabricator.wikimedia.org/T244530 (10Jclark-ctr) [13:14:51] 10Operations, 10ops-eqiad: upgrade memory in ganeti100[5-8].eqiad.wmnet - https://phabricator.wikimedia.org/T244530 (10Jclark-ctr) finished with memory upgrade [13:15:29] RECOVERY - Host ganeti1007 is UP: PING OK - Packet loss = 0%, RTA = 0.24 ms [13:15:32] (03PS1) 10Hashar: Fix repository name in .gitreview [software/acme-chief] - 10https://gerrit.wikimedia.org/r/611309 [13:18:27] RECOVERY - Host ganeti1007.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.84 ms [13:18:41] (03PS1) 10Ema: ATS: add SyslogIdentifier to systemd unit [puppet] - 10https://gerrit.wikimedia.org/r/611311 [13:19:15] (03PS2) 10Ema: ATS: add SyslogIdentifier to systemd unit [puppet] - 10https://gerrit.wikimedia.org/r/611311 (https://phabricator.wikimedia.org/T256395) [13:19:40] (03CR) 10jerkins-bot: [V: 04-1] ATS: add SyslogIdentifier to systemd unit [puppet] - 10https://gerrit.wikimedia.org/r/611311 (https://phabricator.wikimedia.org/T256395) (owner: 10Ema) [13:20:04] (03PS3) 10Ema: ATS: add SyslogIdentifier to systemd unit [puppet] - 10https://gerrit.wikimedia.org/r/611311 (https://phabricator.wikimedia.org/T256395) [13:20:46] 10Operations, 10Analytics-Clusters: Segfault for systemd-sysusers.service on stat1007 - https://phabricator.wikimedia.org/T256098 (10elukey) The redhat bug report leads to https://github.com/systemd/systemd/issues/6512, I followed the steps outlined in there: ` elukey@stat1007:~$ sudo gdb systemd-sysusers [..... [13:21:59] (03CR) 10Ema: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/611311 (https://phabricator.wikimedia.org/T256395) (owner: 10Ema) [13:31:12] 10Operations, 10Analytics-Clusters: Segfault for systemd-sysusers.service on stat1007 - https://phabricator.wikimedia.org/T256098 (10elukey) Coreos applied a patch to libc: https://github.com/mischief/coreos-overlay/commit/19d5f42d8208334ef8581ba90e01161e00dede71 [13:34:27] (03Abandoned) 10Elukey: sre.dns.netbox: print some suggestions in case the diff is wrong [cookbooks] - 10https://gerrit.wikimedia.org/r/609390 (owner: 10Elukey) [13:34:42] (03Abandoned) 10Elukey: profile::mediawiki::alerts: tune mediawiki-errors to be more lenient [puppet] - 10https://gerrit.wikimedia.org/r/608708 (https://phabricator.wikimedia.org/T256459) (owner: 10Elukey) [13:35:54] 10Operations, 10CAS-SSO, 10Patch-For-Review, 10User-jbond: mod_auth_cas segfaulting on netmon - https://phabricator.wikimedia.org/T257587 (10MoritzMuehlenhoff) The underlying Debian bug is https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=844018 and specifically https://bugs.debian.org/cgi-bin/bugreport.cg... [13:36:47] Hi, I created https://phabricator.wikimedia.org/T257664 (Network access to Wikipedia blocked) a few hours ago, but I don't have the permissions to view it now (Access Denied: Restricted Task). May I be granted permission to view this task in order to check its status and respond to any questions if needed? [13:37:20] olem: yeah, let me fix it for you [13:37:40] Thanks legoktm [13:38:32] olem: try now? [13:39:26] Thanks, I now have access. [13:39:52] 10Operations, 10Analytics-Clusters: Segfault for systemd-sysusers.service on stat1007 - https://phabricator.wikimedia.org/T256098 (10elukey) ` elukey@stat1006:~$ sudo systemd-sysusers Creating group systemd-coredump with gid 490. Creating user systemd-coredump (systemd Core Dumper) with uid 490 and gid 490. Se... [13:40:15] great [13:40:43] 10Operations, 10CAS-SSO, 10Patch-For-Review, 10User-jbond: mod_auth_cas segfaulting on netmon - https://phabricator.wikimedia.org/T257587 (10MoritzMuehlenhoff) Also adding @CDanis and @ayounsi for comments on a updating to Buster (potential blockers etc.) [13:41:13] 10Operations, 10Fundraising-Backlog, 10MediaWiki-extensions-CentralNotice, 10Traffic, and 2 others: Implement redirect for hide banner cookie issue - https://phabricator.wikimedia.org/T251780 (10Ejegg) @AndyRussG Hmm, that cross-site cookie check would require at least one more web request to complete befo... [13:41:24] 10Operations, 10Fundraising-Backlog, 10MediaWiki-extensions-CentralNotice, 10Traffic, and 2 others: Implement redirect for hide banner cookie issue - https://phabricator.wikimedia.org/T251780 (10spatton) I like that last idea (special cookie on donate wiki that we check in banners), @AndyRussG! I would be... [13:41:26] !log bounce ms-be1037, not quite responsive [13:41:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:42:41] PROBLEM - very high load average likely xfs on ms-be1037 is CRITICAL: connect to address 10.64.48.142 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Swift [13:42:51] PROBLEM - Check size of conntrack table on ms-be1037 is CRITICAL: connect to address 10.64.48.142 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [13:42:51] PROBLEM - Check systemd state on ms-be1037 is CRITICAL: connect to address 10.64.48.142 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:46:45] PROBLEM - Host ms-be1037 is DOWN: PING CRITICAL - Packet loss = 100% [13:47:58] 10Operations, 10observability, 10Patch-For-Review, 10Performance-Team (Radar): Revisit Grafana/Icinga notification strategy - https://phabricator.wikimedia.org/T203485 (10ema) This came up again today. Due to my very short memory I forgot all about the performance team alerts and started complaining about... [13:48:15] RECOVERY - very high load average likely xfs on ms-be1037 is OK: OK - load average: 20.15, 5.02, 1.68 https://wikitech.wikimedia.org/wiki/Swift [13:48:17] RECOVERY - Host ms-be1037 is UP: PING OK - Packet loss = 0%, RTA = 0.20 ms [13:48:23] RECOVERY - Check size of conntrack table on ms-be1037 is OK: OK: nf_conntrack is 0 % full https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [13:54:31] (03PS16) 10Privacybatm: transferpy: Generate checksum parallel to the data transfer [software/transferpy] - 10https://gerrit.wikimedia.org/r/605851 (https://phabricator.wikimedia.org/T254979) [13:54:43] (03PS1) 10Ema: ATS: send Set-Cookie syslog output to logstash [puppet] - 10https://gerrit.wikimedia.org/r/611315 (https://phabricator.wikimedia.org/T256395) [13:54:58] (03CR) 10jerkins-bot: [V: 04-1] transferpy: Generate checksum parallel to the data transfer [software/transferpy] - 10https://gerrit.wikimedia.org/r/605851 (https://phabricator.wikimedia.org/T254979) (owner: 10Privacybatm) [13:55:17] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/609840 (owner: 10Legoktm) [13:55:48] PROBLEM - Check systemd state on ms-be1037 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:56:27] (03CR) 10Ema: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/611315 (https://phabricator.wikimedia.org/T256395) (owner: 10Ema) [13:58:13] (03PS17) 10Privacybatm: transferpy: Generate checksum parallel to the data transfer [software/transferpy] - 10https://gerrit.wikimedia.org/r/605851 (https://phabricator.wikimedia.org/T254979) [13:59:40] (03CR) 10Vgutierrez: [C: 03+1] ATS: add SyslogIdentifier to systemd unit [puppet] - 10https://gerrit.wikimedia.org/r/611311 (https://phabricator.wikimedia.org/T256395) (owner: 10Ema) [14:00:02] (03CR) 10Privacybatm: transferpy: Generate checksum parallel to the data transfer (031 comment) [software/transferpy] - 10https://gerrit.wikimedia.org/r/605851 (https://phabricator.wikimedia.org/T254979) (owner: 10Privacybatm) [14:00:27] (03CR) 10Ema: [C: 03+2] ATS: add SyslogIdentifier to systemd unit [puppet] - 10https://gerrit.wikimedia.org/r/611311 (https://phabricator.wikimedia.org/T256395) (owner: 10Ema) [14:02:39] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good (to the extent this brackets maze in the srange can look good :-)" [puppet] - 10https://gerrit.wikimedia.org/r/611168 (owner: 10Elukey) [14:03:11] PROBLEM - Check systemd state on kubernetes1004 is CRITICAL: connect to address 10.64.48.52 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:03:43] PROBLEM - Check size of conntrack table on kubernetes1004 is CRITICAL: connect to address 10.64.48.52 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [14:04:01] 10Operations, 10netops: Investigate Juniper storm control - https://phabricator.wikimedia.org/T245192 (10Krinkle) [14:04:55] (03PS2) 10Ema: ATS: send Set-Cookie syslog output to logstash [puppet] - 10https://gerrit.wikimedia.org/r/611315 (https://phabricator.wikimedia.org/T256395) [14:05:24] (03CR) 10Vgutierrez: ATS: send Set-Cookie syslog output to logstash (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/611315 (https://phabricator.wikimedia.org/T256395) (owner: 10Ema) [14:05:44] (03PS1) 10Filippo Giunchedi: role: port netmon to Buster [puppet] - 10https://gerrit.wikimedia.org/r/611317 (https://phabricator.wikimedia.org/T247967) [14:05:46] (03PS1) 10Filippo Giunchedi: role: install fcgid package on netmon [puppet] - 10https://gerrit.wikimedia.org/r/611318 (https://phabricator.wikimedia.org/T247967) [14:05:55] 10Operations, 10Cloud-Services, 10Traffic, 10SRE-OnFire-Incident-Docs, 10cloud-services-team (Kanban): Requests to production are sometimes timing out or giving empty response - https://phabricator.wikimedia.org/T249035 (10Krinkle) Looks like this is no longer an active incident. Re-tagging as such. Are... [14:05:57] PROBLEM - puppet last run on kubernetes1004 is CRITICAL: connect to address 10.64.48.52 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [14:06:19] hm...that again. Looking at kubernetes1004 [14:07:00] (03CR) 10jerkins-bot: [V: 04-1] role: install fcgid package on netmon [puppet] - 10https://gerrit.wikimedia.org/r/611318 (https://phabricator.wikimedia.org/T247967) (owner: 10Filippo Giunchedi) [14:07:32] 10Operations, 10Analytics-Clusters: Segfault for systemd-sysusers.service on stat1007 - https://phabricator.wikimedia.org/T256098 (10elukey) I checked `/usr/lib/sysusers.d/*.conf` and the last user listed is `systemd-coredump`, plus we still don't use systemd-sysusers in analytics (yet). [14:07:49] PROBLEM - DPKG on kubernetes1004 is CRITICAL: connect to address 10.64.48.52 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [14:08:57] PROBLEM - configured eth on kubernetes1004 is CRITICAL: connect to address 10.64.48.52 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_eth [14:09:59] PROBLEM - dhclient process on kubernetes1004 is CRITICAL: connect to address 10.64.48.52 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_dhclient [14:10:31] PROBLEM - Ensure hosts are not performing a change on every puppet run on puppetdb1002 is CRITICAL: CRITICAL: the following (5) node(s) change every puppet run: releases1002.eqiad.wmnet, ms-be1037.eqiad.wmnet, wdqs1010.eqiad.wmnet, releases2002.codfw.wmnet, wdqs1009.eqiad.wmnet https://wikitech.wikimedia.org/wiki/Puppet%23check_puppet_run_changes [14:12:53] PROBLEM - Check whether ferm is active by checking the default input chain on kubernetes1004 is CRITICAL: connect to address 10.64.48.52 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [14:13:03] PROBLEM - Disk space on kubernetes1004 is CRITICAL: connect to address 10.64.48.52 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=kubernetes1004&var-datasource=eqiad+prometheus/ops [14:14:18] (03PS1) 10Elukey: Set BigTop for the Hadoop test cluster [puppet] - 10https://gerrit.wikimedia.org/r/611319 [14:14:19] RECOVERY - Check systemd state on kubernetes1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:14:49] RECOVERY - Check size of conntrack table on kubernetes1004 is OK: OK: nf_conntrack is 0 % full https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [14:15:44] (03PS3) 10Ema: ATS: send Set-Cookie syslog output to logstash [puppet] - 10https://gerrit.wikimedia.org/r/611315 (https://phabricator.wikimedia.org/T256395) [14:16:11] (03CR) 10Ema: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/611315 (https://phabricator.wikimedia.org/T256395) (owner: 10Ema) [14:17:20] (03CR) 10Ema: ATS: send Set-Cookie syslog output to logstash (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/611315 (https://phabricator.wikimedia.org/T256395) (owner: 10Ema) [14:17:39] RECOVERY - puppet last run on kubernetes1004 is OK: OK: Puppet is currently enabled, last run 46 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [14:18:21] (03CR) 10Vgutierrez: [C: 03+1] ATS: send Set-Cookie syslog output to logstash [puppet] - 10https://gerrit.wikimedia.org/r/611315 (https://phabricator.wikimedia.org/T256395) (owner: 10Ema) [14:18:37] (03CR) 10Ema: [C: 03+2] ATS: send Set-Cookie syslog output to logstash [puppet] - 10https://gerrit.wikimedia.org/r/611315 (https://phabricator.wikimedia.org/T256395) (owner: 10Ema) [14:20:11] (03CR) 10Elukey: [C: 03+2] Set BigTop for the Hadoop test cluster [puppet] - 10https://gerrit.wikimedia.org/r/611319 (owner: 10Elukey) [14:25:56] 10Operations, 10ops-eqiad: Suspected network troubles on ms-be1037 - https://phabricator.wikimedia.org/T257675 (10fgiunchedi) [14:26:04] 10Operations, 10ops-eqiad: Suspected network troubles on ms-be1037 - https://phabricator.wikimedia.org/T257675 (10fgiunchedi) p:05Triage→03High [14:29:23] PROBLEM - Check systemd state on ms-be1037 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:30:32] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good to me, two nits inline (feel free to ignore)" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/611317 (https://phabricator.wikimedia.org/T247967) (owner: 10Filippo Giunchedi) [14:30:42] 10Operations, 10observability, 10User-fgiunchedi: Port Prometheus dashboards to Thanos - https://phabricator.wikimedia.org/T256954 (10jcrespo) [14:30:44] 10Operations, 10DBA, 10User-Kormat: Port DBA dashboards to thanos - https://phabricator.wikimedia.org/T256730 (10jcrespo) [14:30:46] !log elukey@cumin1001 START - Cookbook sre.hadoop.stop-cluster [14:30:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:31:24] (03CR) 10jerkins-bot: [V: 04-1] role: port netmon to Buster [puppet] - 10https://gerrit.wikimedia.org/r/611317 (https://phabricator.wikimedia.org/T247967) (owner: 10Filippo Giunchedi) [14:33:04] 10Operations, 10Puppet: Missing dependency on bacula-fd Puppet setup - https://phabricator.wikimedia.org/T256454 (10jcrespo) p:05Triage→03Medium a:03jcrespo [14:33:53] RECOVERY - Disk space on kubernetes1004 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=kubernetes1004&var-datasource=eqiad+prometheus/ops [14:34:16] (03PS9) 10Andrew Bogott: Prometheus: gather db stats from wmcs galera db hosts [puppet] - 10https://gerrit.wikimedia.org/r/610420 [14:35:04] (03CR) 10Muehlenhoff: role: install fcgid package on netmon (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/611318 (https://phabricator.wikimedia.org/T247967) (owner: 10Filippo Giunchedi) [14:35:22] 10Operations, 10Fundraising-Backlog, 10MediaWiki-extensions-CentralNotice, 10Traffic, and 2 others: Implement redirect for hide banner cookie issue - https://phabricator.wikimedia.org/T251780 (10Pcoombe) Some quick thoughts on proposed solutions **TY page on *.wikipedia.org** `-` need a new restricted acc... [14:37:23] !log elukey@cumin1001 END (PASS) - Cookbook sre.hadoop.stop-cluster (exit_code=0) [14:37:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:38:35] 10Operations, 10Fundraising-Backlog, 10MediaWiki-extensions-CentralNotice, 10Traffic, and 2 others: Implement redirect for hide banner cookie issue - https://phabricator.wikimedia.org/T251780 (10MBeat33) > Put a link in the thank-you page and/or thank-you e-mail for users to click I'd like to advocate gen... [14:38:41] RECOVERY - DPKG on kubernetes1004 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [14:38:51] 10Operations, 10ops-eqiad: Suspected network troubles on ms-be1037 - https://phabricator.wikimedia.org/T257675 (10fgiunchedi) In case it is useful and assuming my theory is correct, host is on asw2-d-eqiad ` Physical interface: xe-7/0/0 Laser bias current : 45.788 mA Laser outp... [14:39:34] !log elukey@cumin1001 START - Cookbook sre.hadoop.change-distro [14:39:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:39:47] RECOVERY - configured eth on kubernetes1004 is OK: OK - interfaces up https://wikitech.wikimedia.org/wiki/Monitoring/check_eth [14:40:47] RECOVERY - dhclient process on kubernetes1004 is OK: PROCS OK: 0 processes with command name dhclient https://wikitech.wikimedia.org/wiki/Monitoring/check_dhclient [14:40:53] (03PS2) 10Cwhite: debianization [debs/grafana-loki] (debian/sid) - 10https://gerrit.wikimedia.org/r/610864 [14:42:38] (03CR) 10Cwhite: debianization (031 comment) [debs/grafana-loki] (debian/sid) - 10https://gerrit.wikimedia.org/r/610864 (owner: 10Cwhite) [14:43:30] (03PS10) 10Andrew Bogott: cloudmetrics: gather db stats from wmcs galera db hosts [puppet] - 10https://gerrit.wikimedia.org/r/610420 [14:43:41] RECOVERY - Check whether ferm is active by checking the default input chain on kubernetes1004 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [14:47:30] (03CR) 10Filippo Giunchedi: [C: 03+1] cloudmetrics: gather db stats from wmcs galera db hosts [puppet] - 10https://gerrit.wikimedia.org/r/610420 (owner: 10Andrew Bogott) [14:47:54] (03CR) 10Andrew Bogott: [C: 03+2] cloudmetrics: gather db stats from wmcs galera db hosts [puppet] - 10https://gerrit.wikimedia.org/r/610420 (owner: 10Andrew Bogott) [14:52:27] (03PS9) 10ZPapierski: Correct url and path for nginx OAuth 1.0a [puppet] - 10https://gerrit.wikimedia.org/r/609909 (https://phabricator.wikimedia.org/T251498) [14:57:51] PROBLEM - Host ms-be1037 is DOWN: PING CRITICAL - Packet loss = 100% [14:59:35] RECOVERY - Host ms-be1037 is UP: PING OK - Packet loss = 0%, RTA = 0.22 ms [15:02:24] 10Operations, 10ops-eqiad: Suspected network troubles on ms-be1037 - https://phabricator.wikimedia.org/T257675 (10Jclark-ctr) a:03Jclark-ctr [15:03:26] !log elukey@cumin1001 END (PASS) - Cookbook sre.hadoop.change-distro (exit_code=0) [15:03:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:04:21] PROBLEM - Check systemd state on ms-be1037 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:06:11] 10Operations, 10ops-eqiad: Suspected network troubles on ms-be1037 - https://phabricator.wikimedia.org/T257675 (10Jclark-ctr) replaced sfp on host Finisar model FTLX1471D3BCL S/N AQR0HM6 [15:08:43] (03PS10) 10ZPapierski: Correct url and path for nginx OAuth 1.0a [puppet] - 10https://gerrit.wikimedia.org/r/609909 (https://phabricator.wikimedia.org/T251498) [15:08:59] (03PS1) 10Elukey: Revert "Set BigTop for the Hadoop test cluster" [puppet] - 10https://gerrit.wikimedia.org/r/611218 [15:09:16] (03CR) 10Bstorm: [C: 03+2] kubeadm: If using a stacked control plane, expose etcd metrics [puppet] - 10https://gerrit.wikimedia.org/r/610980 (https://phabricator.wikimedia.org/T256361) (owner: 10Bstorm) [15:10:32] (03CR) 10Elukey: [C: 03+2] Revert "Set BigTop for the Hadoop test cluster" [puppet] - 10https://gerrit.wikimedia.org/r/611218 (owner: 10Elukey) [15:11:15] bstorm: o/ [15:11:20] ok to puppet-merge? [15:11:38] Oh sure! I was about it [15:11:46] *about to [15:12:39] elukey ^ [15:12:47] ack! [15:13:21] done :) [15:14:11] (03PS1) 10Andrew Bogott: wmcs prometheus: correct role name for galera metrics [puppet] - 10https://gerrit.wikimedia.org/r/611343 [15:16:09] (03CR) 10Andrew Bogott: [C: 03+2] wmcs prometheus: correct role name for galera metrics [puppet] - 10https://gerrit.wikimedia.org/r/611343 (owner: 10Andrew Bogott) [15:17:17] RECOVERY - Check systemd state on ms-be1037 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:19:08] (03PS1) 10Muehlenhoff: Remove cas-icinga server alias [puppet] - 10https://gerrit.wikimedia.org/r/611344 [15:19:10] (03PS1) 10Muehlenhoff: Remove cas-icinga from ACME config [puppet] - 10https://gerrit.wikimedia.org/r/611345 [15:19:53] !log elukey@cumin1001 START - Cookbook sre.hadoop.stop-cluster [15:19:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:23:27] 10Operations, 10MediaWiki-extensions-Score, 10Security-Team, 10Wikimedia-General-or-Unknown, and 3 others: Extension:Score / Lilypond is disabled on all wikis - https://phabricator.wikimedia.org/T257066 (10Tgr) Have we made an effort to reach out to non-Wikimedia MediaWiki users? Given the severity, warnin... [15:24:45] (03CR) 10Muehlenhoff: [C: 03+2] Icinga: Add permissions also for ayounsi [puppet] - 10https://gerrit.wikimedia.org/r/610699 (owner: 10Muehlenhoff) [15:29:31] !log milimetric@deploy1001 Started deploy [analytics/refinery@4d40145]: Update EventLogging refine whitelist [15:29:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:29:49] !log elukey@cumin1001 END (PASS) - Cookbook sre.hadoop.stop-cluster (exit_code=0) [15:29:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:30:26] !log elukey@cumin1001 START - Cookbook sre.hadoop.change-distro [15:30:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:31:30] (03CR) 10Cwhite: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/607075 (owner: 10Hashar) [15:33:30] 10Operations, 10ops-eqiad: Suspected network troubles on ms-be1037 - https://phabricator.wikimedia.org/T257675 (10fgiunchedi) 05Open→03Resolved Host is back and network works as expected now, thanks for the quick action @Jclark-ctr ! [15:35:31] (03CR) 10Cwhite: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/607076 (https://phabricator.wikimedia.org/T149924) (owner: 10Hashar) [15:36:57] (03CR) 10Cwhite: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/607524 (owner: 10Hashar) [15:37:49] (03CR) 10Cwhite: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/607525 (owner: 10Hashar) [15:44:29] (03CR) 10Cwhite: [C: 03+2] ci: remove Apache config for nightlies [puppet] - 10https://gerrit.wikimedia.org/r/607075 (owner: 10Hashar) [15:44:35] (03PS4) 10Cwhite: ci: remove Apache config for nightlies [puppet] - 10https://gerrit.wikimedia.org/r/607075 (owner: 10Hashar) [15:44:48] !log milimetric@deploy1001 Finished deploy [analytics/refinery@4d40145]: Update EventLogging refine whitelist (duration: 15m 17s) [15:44:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:49:31] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [15:49:46] (03CR) 10Cwhite: [C: 03+2] ci: switch integration.wikimedia.org to scap DocumentRoot [puppet] - 10https://gerrit.wikimedia.org/r/607076 (https://phabricator.wikimedia.org/T149924) (owner: 10Hashar) [15:49:53] (03PS5) 10Cwhite: ci: switch integration.wikimedia.org to scap DocumentRoot [puppet] - 10https://gerrit.wikimedia.org/r/607076 (https://phabricator.wikimedia.org/T149924) (owner: 10Hashar) [15:51:17] 10Operations, 10Cloud-VPS (Project-requests), 10cloud-services-team (Kanban): Request creation of 'sre-sandbox' VPS project - https://phabricator.wikimedia.org/T247517 (10bd808) >>! In T247517#6296240, @jbond wrote: > is to possible to get more quota in this project. I Just tried to create a machine and we... [15:52:41] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [15:55:50] (03CR) 10Cwhite: [C: 03+2] contint: move Apache config to flat file [puppet] - 10https://gerrit.wikimedia.org/r/607524 (owner: 10Hashar) [15:55:57] (03PS3) 10Cwhite: contint: move Apache config to flat file [puppet] - 10https://gerrit.wikimedia.org/r/607524 (owner: 10Hashar) [15:56:52] 10Operations, 10Fundraising-Backlog, 10MediaWiki-extensions-CentralNotice, 10Traffic, and 2 others: Implement redirect for hide banner cookie issue - https://phabricator.wikimedia.org/T251780 (10DStrine) Here is another option. We make a subdomain with a name similar to fundraising.wikipedia.org or donate.... [15:56:58] !log milimetric@deploy1001 Started deploy [analytics/refinery@4d40145] (thin): Update EventLogging refine whitelist (THIN) [15:57:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:57:07] !log milimetric@deploy1001 Finished deploy [analytics/refinery@4d40145] (thin): Update EventLogging refine whitelist (THIN) (duration: 00m 08s) [15:57:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:58:01] (03PS1) 10Cwhite: Revert "ci: switch integration.wikimedia.org to scap DocumentRoot" [puppet] - 10https://gerrit.wikimedia.org/r/611219 [15:58:34] (03CR) 10Cwhite: [V: 03+2 C: 03+2] Revert "ci: switch integration.wikimedia.org to scap DocumentRoot" [puppet] - 10https://gerrit.wikimedia.org/r/611219 (owner: 10Cwhite) [16:02:38] (03CR) 10Elukey: [C: 04-1] "After a chat with EBernardson I found another use case that I didn't know, namely all ES Search nodes have a daemon that pulls from kafka " [puppet] - 10https://gerrit.wikimedia.org/r/611168 (owner: 10Elukey) [16:04:54] (03PS4) 10Hashar: contint: move Apache config to flat file [puppet] - 10https://gerrit.wikimedia.org/r/607524 [16:04:56] (03PS3) 10Hashar: doc: move Apache config to flat file [puppet] - 10https://gerrit.wikimedia.org/r/607525 [16:04:58] (03PS1) 10Hashar: ci: switch integration.wikimedia.org to scap DocumentRoot [puppet] - 10https://gerrit.wikimedia.org/r/611369 (https://phabricator.wikimedia.org/T149924) [16:07:15] (03CR) 10Hashar: [C: 04-1] "When deploying this we had https://integration.wikimedia.org/ broken, the HTML containing:" [puppet] - 10https://gerrit.wikimedia.org/r/611369 (https://phabricator.wikimedia.org/T149924) (owner: 10Hashar) [16:07:43] (03PS1) 10Bstorm: tools-prometheus: Add the paws etcd exports [puppet] - 10https://gerrit.wikimedia.org/r/611370 (https://phabricator.wikimedia.org/T256361) [16:11:59] (03CR) 10Krinkle: "That's because the DocumentRoot is not a real directory but a symlink with Scap3." [puppet] - 10https://gerrit.wikimedia.org/r/611369 (https://phabricator.wikimedia.org/T149924) (owner: 10Hashar) [16:12:50] Krinkle: my hero ;) [16:16:06] !log elukey@cumin1001 END (FAIL) - Cookbook sre.hadoop.change-distro (exit_code=99) [16:16:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:16:12] hashar: I think adding another realpath() there should fix it, want to try that? [16:17:43] Krinkle: in strpos( $realPath, $_SERVER['DOCUMENT_ROOT'] ) ? [16:18:01] cause that variable would be whatever is configured in Apache I guess [16:18:07] hashar: right now it resolve the request path, and confirms it exists in the doc root [16:18:13] you'll want to resolve docroot itself also [16:18:44] so `strpos( $realPath, $realDocRoot )` instead of `strpos( $realPath, $_SERVER['DOCUMENT_ROOT'] )` [16:18:51] and define readlDocRoot [16:18:56] but isn't Apache supposed to prevent leaking from outside the DocumentRoot anyway? [16:19:08] no, I've neard heard of such rule. [16:19:15] symlinks can escape it [16:19:19] ah yeah [16:19:22] given this is PHP reading files [16:19:36] it is mainly for doc.wm.o [16:19:46] but should be harmless here for now [16:20:31] for doc.wm.o I want to move the generated files to /srv/doc outside of the DocumentRoot, but I guess I will need a full install to make sure everything works fine [16:20:36] so I will deal with it later [16:28:33] (03PS1) 10DCausse: [wdqs] overrides default blazegraph ns [puppet] - 10https://gerrit.wikimedia.org/r/611373 [16:31:01] (03CR) 10DCausse: "should be used in conjunction with https://gerrit.wikimedia.org/r/c/wikidata/query/rdf/+/611348" [puppet] - 10https://gerrit.wikimedia.org/r/611373 (owner: 10DCausse) [16:38:53] (03PS2) 10Krinkle: Enable wgForceHTTPS and wgCookieSameSite='None' (Phase 2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/610148 (https://phabricator.wikimedia.org/T256095) [16:38:59] (03CR) 10jerkins-bot: [V: 04-1] Enable wgForceHTTPS and wgCookieSameSite='None' (Phase 2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/610148 (https://phabricator.wikimedia.org/T256095) (owner: 10Krinkle) [16:40:46] (03PS3) 10Krinkle: Enable wgForceHTTPS and wgCookieSameSite='None' (Phase 2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/610148 (https://phabricator.wikimedia.org/T256095) [16:41:34] (03CR) 10Bstorm: [C: 03+2] tools-prometheus: Add the paws etcd exports [puppet] - 10https://gerrit.wikimedia.org/r/611370 (https://phabricator.wikimedia.org/T256361) (owner: 10Bstorm) [16:42:24] (03PS2) 10Hashar: ci: switch integration.wikimedia.org to scap DocumentRoot [puppet] - 10https://gerrit.wikimedia.org/r/611369 (https://phabricator.wikimedia.org/T149924) [16:43:20] (03CR) 10Hashar: [C: 04-1] "Thank you so much Timo. The fix should be https://gerrit.wikimedia.org/r/c/integration/docroot/+/611377" [puppet] - 10https://gerrit.wikimedia.org/r/611369 (https://phabricator.wikimedia.org/T149924) (owner: 10Hashar) [16:43:53] Krinkle: perfect thanks. And on this last patch, I am closing and heading on vacations [16:43:58] those can wait anyway [16:44:15] LGTM, have a good one! [16:45:11] thank you for saving my night :] [16:45:26] I will probably never have found the root cause was the scap symlink hehe [16:46:23] (03PS2) 10Lucas Werkmeister (WMDE): Load WikibaseClient using extension registration in beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/610265 [16:47:01] (03CR) 10Krinkle: "Per diff, this also changes test2wiki back to match enwiki (since test2wiki is not in group0). This is fine and might also be useful for t" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/610148 (https://phabricator.wikimedia.org/T256095) (owner: 10Krinkle) [16:48:49] (03CR) 10Lucas Werkmeister (WMDE): [C: 04-2] Load WikibaseClient using extension registration in beta (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/610265 (owner: 10Lucas Werkmeister (WMDE)) [16:50:48] (03CR) 10Thcipriani: [C: 03+1] Enable wgForceHTTPS and wgCookieSameSite='None' (Phase 2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/610148 (https://phabricator.wikimedia.org/T256095) (owner: 10Krinkle) [16:52:01] (03CR) 10Krinkle: [C: 03+2] Enable wgForceHTTPS and wgCookieSameSite='None' (Phase 2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/610148 (https://phabricator.wikimedia.org/T256095) (owner: 10Krinkle) [16:53:00] (03Merged) 10jenkins-bot: Enable wgForceHTTPS and wgCookieSameSite='None' (Phase 2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/610148 (https://phabricator.wikimedia.org/T256095) (owner: 10Krinkle) [16:53:11] * Krinkle staging on mwdebug1002 [17:03:53] (03PS1) 10Greg Grossmeier: admin: update matrix.py to add color [puppet] - 10https://gerrit.wikimedia.org/r/611388 [17:04:42] (03CR) 10Greg Grossmeier: "Output used here: https://www.mediawiki.org/w/index.php?title=Wikimedia_Release_Engineering_Team/Access_list&diff=3957120&oldid=3957111&di" [puppet] - 10https://gerrit.wikimedia.org/r/611388 (owner: 10Greg Grossmeier) [17:04:46] (03CR) 10jerkins-bot: [V: 04-1] admin: update matrix.py to add color [puppet] - 10https://gerrit.wikimedia.org/r/611388 (owner: 10Greg Grossmeier) [17:05:00] !log krinkle@deploy1001 Synchronized wmf-config/InitialiseSettings.php: I63fcea7737 (duration: 00m 57s) [17:05:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:05:46] (03PS1) 10Jcrespo: bacula: Merge prometheus exporter and icinga check into a single file [puppet] - 10https://gerrit.wikimedia.org/r/611390 (https://phabricator.wikimedia.org/T234900) [17:06:55] 10Operations, 10ops-eqsin: update power ports for ps[12]-603-eqiad - https://phabricator.wikimedia.org/T255812 (10RobH) 05Open→03Resolved Ok, this is now fully done in both netbox and on the PDU software directly. [17:07:03] (03CR) 10jerkins-bot: [V: 04-1] bacula: Merge prometheus exporter and icinga check into a single file [puppet] - 10https://gerrit.wikimedia.org/r/611390 (https://phabricator.wikimedia.org/T234900) (owner: 10Jcrespo) [17:07:30] (03CR) 10Krinkle: "The current comment format as deployed is what the new Gerrit 3 plugin uses to decide what label to use in the extracted test table." [puppet] - 10https://gerrit.wikimedia.org/r/608296 (https://phabricator.wikimedia.org/T256575) (owner: 10Hashar) [17:07:32] (03CR) 10Hnowlan: api-gateway: Basic envoy chart WIP (034 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/609808 (https://phabricator.wikimedia.org/T254906) (owner: 10Hnowlan) [17:09:19] (03PS1) 10Elukey: sre.hadoop.change-distro.py: change logic for JN roll restart [cookbooks] - 10https://gerrit.wikimedia.org/r/611392 (https://phabricator.wikimedia.org/T244499) [17:11:02] (03PS2) 10Jcrespo: bacula: Merge prometheus exporter and icinga check into a single file [puppet] - 10https://gerrit.wikimedia.org/r/611390 (https://phabricator.wikimedia.org/T234900) [17:11:11] (03CR) 10Elukey: [C: 03+2] sre.hadoop.change-distro.py: change logic for JN roll restart [cookbooks] - 10https://gerrit.wikimedia.org/r/611392 (https://phabricator.wikimedia.org/T244499) (owner: 10Elukey) [17:11:16] (03PS1) 10Lucas Werkmeister (WMDE): extension-list: Load WikibaseClient via JSON [mediawiki-config] - 10https://gerrit.wikimedia.org/r/611393 [17:14:50] (03CR) 10Lucas Werkmeister (WMDE): [C: 04-2] "Test plan:" (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/611393 (owner: 10Lucas Werkmeister (WMDE)) [17:15:48] (03CR) 10Lucas Werkmeister (WMDE): [C: 04-2] "> extension-list is just for harvesting the i18n, so you could switch it now, if you wished." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/610265 (owner: 10Lucas Werkmeister (WMDE)) [17:17:26] (03PS3) 10Lucas Werkmeister (WMDE): Load WikibaseClient using extension registration in beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/610265 (https://phabricator.wikimedia.org/T257435) [17:19:01] (03CR) 10Jeena Huneidi: Kask: Use Releng Cassandra Image (033 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/609894 (https://phabricator.wikimedia.org/T224041) (owner: 10Jeena Huneidi) [17:21:32] (03PS3) 10Jcrespo: bacula: Merge prometheus exporter and icinga check into a single file [puppet] - 10https://gerrit.wikimedia.org/r/611390 (https://phabricator.wikimedia.org/T234900) [17:23:02] (03PS4) 10Jcrespo: bacula: Merge prometheus exporter and icinga check into a single file [puppet] - 10https://gerrit.wikimedia.org/r/611390 (https://phabricator.wikimedia.org/T234900) [17:25:58] (03CR) 10Dzahn: "looks like pep8 doesn't like that the line is long now" [puppet] - 10https://gerrit.wikimedia.org/r/611388 (owner: 10Greg Grossmeier) [17:26:39] (03PS3) 10Dzahn: releases: remove duplicate rsync code from blubber and parsoid classes [puppet] - 10https://gerrit.wikimedia.org/r/610402 [17:28:07] (03CR) 10Jcrespo: "This was mostly a cleanup before implementing a denylist for jobs that are configured on bacula, but are known to be failing for a long ti" [puppet] - 10https://gerrit.wikimedia.org/r/611390 (https://phabricator.wikimedia.org/T234900) (owner: 10Jcrespo) [17:28:37] 10Operations, 10Fundraising-Backlog, 10MediaWiki-extensions-CentralNotice, 10Traffic, and 2 others: Implement redirect for hide banner cookie issue - https://phabricator.wikimedia.org/T251780 (10AndyRussG) **One more option** (thanks to @Jgreen for this idea): we could create a **thank-you banner on Wikipe... [17:30:52] (03PS2) 10Greg Grossmeier: admin: update matrix.py to add color [puppet] - 10https://gerrit.wikimedia.org/r/611388 [17:31:44] (03CR) 10jerkins-bot: [V: 04-1] admin: update matrix.py to add color [puppet] - 10https://gerrit.wikimedia.org/r/611388 (owner: 10Greg Grossmeier) [17:34:39] getting Failed to poll mysqli connection! on phab [17:34:56] mutante, andre__: ^ [17:35:11] WFM [17:35:23] RhinosF1: why me? [17:35:26] working here as well [17:35:47] jynus: use the search to search for '-riggle-2' [17:35:52] (03PS3) 10Greg Grossmeier: admin: update matrix.py to add color [puppet] - 10https://gerrit.wikimedia.org/r/611388 [17:35:58] andre__: in case you have an idea on why [17:36:14] * RhinosF1 is checking for an account that might need disabling if i find it [17:36:35] 10Operations, 10MediaWiki-General, 10serviceops-radar, 10Performance-Team (Radar), and 3 others: Move MainStash out of Redis to a simpler multi-dc aware solution - https://phabricator.wikimedia.org/T212129 (10Krinkle) [17:36:35] RhinosF1: no, I don't do databases at all [17:37:08] andre__: you might be able to help look for an LTA's phab account though? [17:37:26] RhinosF1: phabricator is known to sometimes overwhealm the db on connections, but it happens rarely enough that wasn't work a deep research [17:37:31] RhinosF1: i can't confirm it either [17:37:31] RhinosF1: if it's search related, that may be due to the change twentyafterfour pushed out recently. [17:37:41] RhinosF1: yepp :) [17:37:48] greg-g: yeah it's search although a specific one [17:37:50] as in, it is not like a huge issue, more of a rare one [17:37:54] "-riggle-2" [17:37:59] RhinosF1, but you too as you're member of https://phab-ban.toolforge.org/ [17:38:05] I got it when searching with that search term [17:38:12] I didn't [17:38:13] andre__: I need to find it first! [17:39:01] andre__: it'll be linked to https://meta.wikimedia.org/wiki/Special:CentralAuth/ZAR2020SKAYTEC or https://meta.wikimedia.org/wiki/Special:CentralAuth?target=-riggle-2 - I can find the email they always use soon [17:39:02] RhinosF1, find what? Plus no idea what "-riggle-2" means [17:39:13] (03CR) 10Greg Grossmeier: "> 10:31:14 modules/admin/data/matrix.py:29:13: E741 ambiguous variable name 'l'" [puppet] - 10https://gerrit.wikimedia.org/r/611388 (owner: 10Greg Grossmeier) [17:39:17] andre__: a username [17:39:35] RhinosF1, errm, why do you think that people have a Phab account? Plus this is really the wrong channel for it. [17:39:52] andre__: because he nearly always doea [17:40:01] Where's best [17:40:06] 10Operations, 10serviceops, 10Patch-For-Review, 10User-Elukey: Reimage one memcached shard to Buster - https://phabricator.wikimedia.org/T252391 (10Krinkle) CentralAuth and ChronologyProtector are both still high-profile consumers of main stash. Both are scheduled for migration, but currently only with rel... [17:40:06] -releng [17:48:08] (03CR) 10Thcipriani: [C: 03+1] admin: update matrix.py to add color (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/611388 (owner: 10Greg Grossmeier) [17:55:29] (03CR) 10Dzahn: "python matrix.py hashar thcipriani|column -t" [puppet] - 10https://gerrit.wikimedia.org/r/611388 (owner: 10Greg Grossmeier) [17:57:03] !log change loginwiki password for Cindy-the-browser-test-bot, no email account was associated to allow for normal reset. [17:57:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:00:35] (03CR) 10Dzahn: "> Patch Set 3:" [puppet] - 10https://gerrit.wikimedia.org/r/611388 (owner: 10Greg Grossmeier) [18:02:59] 10Operations, 10Fundraising-Backlog, 10MediaWiki-extensions-CentralNotice, 10Traffic, and 2 others: Implement redirect for hide banner cookie issue - https://phabricator.wikimedia.org/T251780 (10Pcoombe) **Thank you banner** This certainly seems like it could solve the problem, but it would need a not insi... [18:28:38] 10Operations, 10Fundraising-Backlog, 10MediaWiki-extensions-CentralNotice, 10Traffic, and 2 others: Implement redirect for hide banner cookie issue - https://phabricator.wikimedia.org/T251780 (10AndyRussG) >>! In T251780#6297530, @Pcoombe wrote: > **Thank you banner** > This certainly seems like it could s... [18:37:35] PROBLEM - Check systemd state on kubernetes1004 is CRITICAL: connect to address 10.64.48.52 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:37:37] PROBLEM - Check size of conntrack table on kubernetes1004 is CRITICAL: connect to address 10.64.48.52 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [18:38:27] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=swagger_check_proton_cluster_eqiad site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [18:40:19] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [18:40:51] PROBLEM - MD RAID on kubernetes1004 is CRITICAL: connect to address 10.64.48.52 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [18:41:37] PROBLEM - puppet last run on kubernetes1004 is CRITICAL: connect to address 10.64.48.52 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [18:43:47] (03CR) 10Dzahn: [C: 03+2] mariadb: remove ferm firewall hole for gerrit servers [puppet] - 10https://gerrit.wikimedia.org/r/609884 (https://phabricator.wikimedia.org/T239151) (owner: 10Dzahn) [18:43:54] (03PS2) 10Dzahn: mariadb: remove ferm firewall hole for gerrit servers [puppet] - 10https://gerrit.wikimedia.org/r/609884 (https://phabricator.wikimedia.org/T239151) [18:44:53] !log kubernetes1004 - started nagios-nrpe-server [18:44:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:45:01] RECOVERY - Check systemd state on kubernetes1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:45:03] RECOVERY - Check size of conntrack table on kubernetes1004 is OK: OK: nf_conntrack is 0 % full https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [18:47:28] RECOVERY - puppet last run on kubernetes1004 is OK: OK: Puppet is currently enabled, last run 16 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [18:51:41] RECOVERY - MD RAID on kubernetes1004 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [19:02:02] !log removing firewall hole for gerrit -> mysql servers on dbproxy servers for misc db's [19:02:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:02:11] (03PS4) 10Greg Grossmeier: admin: update matrix.py to add color [puppet] - 10https://gerrit.wikimedia.org/r/611388 [19:02:55] (03CR) 10Greg Grossmeier: "> Patch Set 3:" [puppet] - 10https://gerrit.wikimedia.org/r/611388 (owner: 10Greg Grossmeier) [19:05:32] (03PS5) 10Greg Grossmeier: admin: update matrix.py to add color [puppet] - 10https://gerrit.wikimedia.org/r/611388 [19:06:34] (03CR) 10Greg Grossmeier: "Addressed color choice (ohai bikeshed) ;)" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/611388 (owner: 10Greg Grossmeier) [19:07:19] J [19:08:56] (03CR) 10Dzahn: [C: 03+2] "python3 matrix.py --wikitext hashar thcipriani|column -t works" [puppet] - 10https://gerrit.wikimedia.org/r/611388 (owner: 10Greg Grossmeier) [19:09:15] sorry for all the patchsets :) [19:09:30] multitasking while in meetings (I know I know, it's Friday) [19:09:51] heh, no worries. my comments were also just because of python vs python3 [19:10:58] so gerrit still works even after the dbproxy servers closed their firewall holes for them now. guaranteed to not use mysql anymore. [19:11:17] awesome! [19:11:34] running that on all dbproxy* was giving me minimal concern, but it had plenty of +1s [19:12:16] (03PS4) 10Dzahn: releases: remove duplicate rsync code from blubber and parsoid classes [puppet] - 10https://gerrit.wikimedia.org/r/610402 [19:13:10] ^ all this time we are rsyncing stuff multiple times. one cron does /srv/org/wikimedia/releases and others do some subdirs of that.. that are already included anyways [19:13:50] this is part of replacing releases* backends with buster. needed the option to sync to multiple secondary servers, not just one [19:15:01] ooooh bustre [19:15:03] good good [19:18:12] 10Operations, 10Fundraising-Backlog, 10MediaWiki-extensions-CentralNotice, 10Traffic, and 2 others: Implement redirect for hide banner cookie issue - https://phabricator.wikimedia.org/T251780 (10DStrine) It's great we are considering alternatives. I really want to highlight the effort needed to set this up... [19:27:39] (03CR) 10Dzahn: "gerrit is still working after this ran on all dbproxy for misc databases. you can remove the GRANTs next" [puppet] - 10https://gerrit.wikimedia.org/r/609884 (https://phabricator.wikimedia.org/T239151) (owner: 10Dzahn) [19:37:00] Krinkle ema bblack mutante around? [19:37:29] Trying to figure out if it's possible to vary a Varnish response based on cookie [19:41:35] AndyRussG: possible - yes, i think so. based on comments like "Vary:Cookie" "Cookie:Token=1 value for Vary purposes" in modules/varnish/templates/text-frontend.inc.vcl.erb but i am not on the traffic team and don't know much about VCL. try asking for more details in the -traffic channel [19:42:18] mutante: ahhhh thanks I didn't know about that channel! [19:42:42] sure,yw [19:43:31] what about ATS? [19:43:43] (this question may not make sense, just checking) [19:52:39] apergos: ? (or unrelated to what I was asking?) [19:55:04] eventually are we not moving off of varnish entirely? and ats is deployed in part... so maybe the same question applies? or is that only the back ends? [19:57:56] apergos: oh thanks! any ideas about timelines for that? [19:58:10] well I don't kow how true my representation of it is [19:58:15] it could be crap :-D [19:58:24] I mean I know there is some back end work [19:58:44] https://phabricator.wikimedia.org/T227432 like this [19:59:22] but what the plas are for the front end instances, I dunno [20:02:42] there might even be zero plans... [20:03:43] (03CR) 10Dzahn: [C: 03+2] releases: remove duplicate rsync code from blubber and parsoid classes [puppet] - 10https://gerrit.wikimedia.org/r/610402 (owner: 10Dzahn) [20:10:33] apergos: k thanks! [20:10:47] prolly shouldn't have asked the question in the first place :-D [20:11:01] oh well back to my slow progress on this heisenbug... [20:13:01] (03CR) 10Dzahn: "remove the additional crontab entries from releases2001 manually. the one syncing all of /srv/org/wikimedia/releases is still there and wo" [puppet] - 10https://gerrit.wikimedia.org/r/610402 (owner: 10Dzahn) [20:15:01] apergos: :) [20:18:45] 10Operations, 10Fundraising-Backlog, 10MediaWiki-extensions-CentralNotice, 10Traffic, and 2 others: Implement redirect for hide banner cookie issue - https://phabricator.wikimedia.org/T251780 (10Ejegg) @Pcoombe We started diving into this solution, and then realized that the Special:BannerLoader page is he... [20:22:01] (03PS1) 10Andrew Bogott: Openstack Nova: move database access to galera on cloudcontrol nodes [puppet] - 10https://gerrit.wikimedia.org/r/611421 (https://phabricator.wikimedia.org/T242455) [20:24:51] (03PS2) 10Dzahn: releases: move rsync code for all releases from mediawiki to common [puppet] - 10https://gerrit.wikimedia.org/r/610403 [20:27:38] (03CR) 10Andrew Bogott: [C: 03+2] Openstack Nova: move database access to galera on cloudcontrol nodes [puppet] - 10https://gerrit.wikimedia.org/r/611421 (https://phabricator.wikimedia.org/T242455) (owner: 10Andrew Bogott) [20:43:55] PROBLEM - Widespread puppet agent failures on icinga1001 is CRITICAL: 0.0101 ge 0.01 https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [20:44:37] (03PS1) 10Andrew Bogott: openstack nova: point eqiad1 to the eqiad1 galera [puppet] - 10https://gerrit.wikimedia.org/r/611422 [20:54:21] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [20:55:18] (03PS1) 10Urbanecm: Add ary language [dns] - 10https://gerrit.wikimedia.org/r/611426 (https://phabricator.wikimedia.org/T257674) [21:01:48] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [21:13:35] 10Operations, 10MediaWiki-extensions-Score, 10Security-Team, 10Wikimedia-General-or-Unknown, and 3 others: Extension:Score / Lilypond is disabled on all wikis - https://phabricator.wikimedia.org/T257066 (10sbassett) >>! In T257066#6297061, @Tgr wrote: > Have we made an effort to reach out to non-Wikimedia... [21:23:18] PROBLEM - nova-compute proc minimum on cloudvirt1008 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [21:25:09] RECOVERY - nova-compute proc minimum on cloudvirt1008 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [21:31:23] (03CR) 10Dzahn: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1003/23818/" [puppet] - 10https://gerrit.wikimedia.org/r/610403 (owner: 10Dzahn) [21:32:28] (03PS3) 10C. Scott Ananian: VisualEditor: Explicitly set visualeditor-enable to 0 when non-default [mediawiki-config] - 10https://gerrit.wikimedia.org/r/610156 (https://phabricator.wikimedia.org/T248343) [21:32:30] (03CR) 10Andrew Bogott: [C: 03+2] openstack nova: point eqiad1 to the eqiad1 galera [puppet] - 10https://gerrit.wikimedia.org/r/611422 (owner: 10Andrew Bogott) [21:36:50] (03PS2) 10Dzahn: releases: move more common code out of the mediawiki class [puppet] - 10https://gerrit.wikimedia.org/r/610404 [21:38:05] (03CR) 10jerkins-bot: [V: 04-1] releases: move more common code out of the mediawiki class [puppet] - 10https://gerrit.wikimedia.org/r/610404 (owner: 10Dzahn) [21:48:45] RECOVERY - Widespread puppet agent failures on icinga1001 is OK: (C)0.01 ge (W)0.006 ge 0.005682 https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [21:52:46] !log Started long-running reindex of Elasticsearch indices in `eqiad`, `codfw`, and `dewiki` on `mwmaint1002` under tmux session `reindex` for user `ryankemper` [21:52:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:54:56] (03PS1) 10RhinosF1: create lijwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/611435 [21:55:18] Amir1, Urbanecm: ^ [21:55:48] (03PS2) 10RhinosF1: create lijwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/611435 (https://phabricator.wikimedia.org/T257672) [21:55:53] (03CR) 10jerkins-bot: [V: 04-1] create lijwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/611435 (https://phabricator.wikimedia.org/T257672) (owner: 10RhinosF1) [21:56:40] (03CR) 10jerkins-bot: [V: 04-1] create lijwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/611435 (https://phabricator.wikimedia.org/T257672) (owner: 10RhinosF1) [21:57:09] * RhinosF1 knew his mac hated IS.php [21:57:36] https://usercontent.irccloud-cdn.com/file/YeL7xrJw/Screenshot%202020-07-10%20at%2022.57.30.png [21:57:48] paladox: gerrit UI won't load it either [22:06:25] @RhinosF1 https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/611435 loads for me [22:06:41] mooeypoo: try editing the patch [22:07:11] I never used that feature, but I see the patch still with "Stop editing" button now instead of "edit" [22:08:30] The edit URL is different for me than in your screenshot though https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/611435,edit [22:08:44] yours seem to have the trailing /2 [22:09:25] If I manually add the /2, then click "edit", it redirects me to the /611435,edit link [22:14:37] (03CR) 10Dzahn: [C: 04-1] "https://puppet-compiler.wmflabs.org/compiler1002/23819/releases1001.eqiad.wmnet/change.releases1001.eqiad.wmnet.err" [puppet] - 10https://gerrit.wikimedia.org/r/610404 (owner: 10Dzahn) [22:16:43] mooeypoo: click on IS.php [22:19:53] (03PS3) 10Dzahn: releases: move more common code out of the mediawiki class [puppet] - 10https://gerrit.wikimedia.org/r/610404 [22:21:01] RhinosF1: i have no issue loading https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/611435/2/wmf-config/InitialiseSettings.php not even slow [22:21:24] mutante: no in edit mode [22:22:07] https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/611435/2/wmf-config/InitialiseSettings.php,edit [22:22:13] RhinosF1: yea, that makes my fan start for a moment [22:22:43] IS just became too large [22:23:00] mutante: yeah xcode was extremely slow at surviving it and somehow still messed it up [22:23:03] it does work eventually though [22:23:07] just needed to wait a bit [22:23:40] * RhinosF1 wonders how long "a bit" is [22:23:55] RhinosF1: i don't know xcode but if that's an IDE then why use the browser edit mode? [22:24:17] mutante: because it messed it up [22:24:25] that's when jenkins -1'd [22:24:29] and was super slow [22:24:46] (03PS9) 10Ryan Kemper: Scale largest shards to be closer to 30GB [mediawiki-config] - 10https://gerrit.wikimedia.org/r/608965 (https://phabricator.wikimedia.org/T256928) [22:25:38] (03CR) 10Ryan Kemper: [V: 03+2 C: 03+2] "shipping it" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/608965 (https://phabricator.wikimedia.org/T256928) (owner: 10Ryan Kemper) [22:25:38] RhinosF1: "a bit" = 1 minute for me on my hardware [22:26:12] it's loading but still unusably slow to actually scroll/edit [22:26:14] jouncebot: now [22:26:14] For the next 8 hour(s) and 33 minute(s): No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200710T0700) [22:26:31] is this an emergency deploy? [22:26:40] no [22:26:44] it's a new wiki [22:26:56] i am talking about the mw-config change [22:27:01] that just got merged [22:27:06] oh [22:27:41] ryankemper: ^ [22:28:53] I kicked off a reindex job and then realized I hadn't merged the corresponding config changes yet [22:30:18] gotcha [22:30:50] (These changes just touch our cirrussearch/elasticsearch shard replica counts) [22:33:30] alright, ack [22:40:48] ryankemper: are you going to sync your config change? 🙂 [22:43:22] Urbanecm: my bad, is there documentation somewhere on how to do that [22:44:35] ryankemper: there's https://wikitech.wikimedia.org/wiki/Heterogeneous_deployment#Change_wiki_configuration [22:45:10] thanks, taking a look at the steps now [22:45:30] hmm, do you really want to do this the first time ever on a Friday afternoon outside deployment windows [22:45:49] ftr, unless it's synced, it doesn't take effect, and it probably will confuse whoever will make changes after you ryankemper [22:45:54] +1 to what mutante says [22:46:10] this is normally only for emergencies [22:46:19] yeah, that's a good point [22:46:28] should be simple as me opening up a revert commit in gerrit right? [22:46:31] since the changes haven't been synced yet [22:46:56] yup, and fetching the commits to deploy1001 to not confuse people [22:49:00] (03PS1) 10Ryan Kemper: Revert "Scale largest shards to be closer to 30GB" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/611447 [22:50:40] (03CR) 10Ryan Kemper: [V: 03+2 C: 03+2] "going to self-approve given this reverts a patch that hasn't been deployed" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/611447 (owner: 10Ryan Kemper) [22:51:34] (03Merged) 10jenkins-bot: Revert "Scale largest shards to be closer to 30GB" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/611447 (owner: 10Ryan Kemper) [22:54:06] ryankemper: sorry, by "fetching", I meant fetching and rebasing [22:54:28] ack [22:58:42] Urbanecm: okay, `/srv/mediawiki-staging` is fetched and is on the head of `origin/master` [22:58:53] thanks all for helping me sort this out [22:59:01] and sorry for generating noise :x [22:59:08] Thanks ryankemper :) [23:06:23] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [23:10:05] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [23:10:06] (03PS1) 10Ahmon Dancy: Moved a comment to a better place [puppet] - 10https://gerrit.wikimedia.org/r/611455 [23:11:43] (03PS1) 10Ahmon Dancy: Allow aptly::repo commands to run as alternate user [puppet] - 10https://gerrit.wikimedia.org/r/611457 (https://phabricator.wikimedia.org/T250157) [23:13:03] (03CR) 10jerkins-bot: [V: 04-1] Allow aptly::repo commands to run as alternate user [puppet] - 10https://gerrit.wikimedia.org/r/611457 (https://phabricator.wikimedia.org/T250157) (owner: 10Ahmon Dancy) [23:23:48] 10Operations, 10Phabricator, 10Security-Team: Can't access phabricator from my server - https://phabricator.wikimedia.org/T257507 (10Legoktm) Can we exempt Isarra's IP from the blocklist in the meantime? [23:49:06] 10Operations, 10Phabricator, 10Security-Team: Can't access phabricator from my server - https://phabricator.wikimedia.org/T257507 (10Isarra) If the IP block exempt right onwiki also works (at least some of the time) per T254568, maybe I should just go request that? I never bothered because I so rarely actual... [23:58:36] 10Operations, 10Phabricator, 10Security-Team: Can't access phabricator from my server - https://phabricator.wikimedia.org/T257507 (10Reedy) >>! In T257507#6298076, @Isarra wrote: > If the IP block exempt right onwiki also works (at least some of the time) per T254568, maybe I should just go request that? I n...