[00:00:10] <AndyRussG>	 https://webkit.org/tracking-prevention/
[00:00:33] <AndyRussG>	 Krinkle: ok wow thanks much... So, new summary, the redirect strategy won't work, either?
[00:00:43] <Krinkle>	 indeed
[00:00:51] <AndyRussG>	 ejegg: ^
[00:01:04] <Krinkle>	 it also risks putting us on a list of bad behaving websites so please don't try
[00:01:23] <AndyRussG>	 ok gotcha
[00:01:37] <Krinkle>	 I have some ideas but let's  discuss that another time, happy to set up a call
[00:01:46] <AndyRussG>	 thanks so much for this info, yes!
[00:02:09] <ejegg>	 oh dang
[00:02:44] <AndyRussG>	 ejegg: Krinkle: I think for now, then the solution is to put a link on the thank you page explicitly asking donors to click on a link if they want to hide fundraising banners
[00:03:19] <Krinkle>	 that is in fact  what  I was going  to suggest. email  them a thing with a nice message, thanks again  for donating, we wont'  show you etc.
[00:03:42] <AndyRussG>	 Yeah they do get an e-mail currently
[00:04:17] <Krinkle>	 if it requires a user action though, does the  close button not suffice upon  first seeing it?
[00:04:25] <ejegg>	 hmm, the bounce tracking classification seems to penalize things with multiple redirect destinations, while we would send everyone to the TY page
[00:04:57] <ejegg>	 Krinkle: this is for people who DON'T close the banner
[00:04:57] <Krinkle>	 I guess  they can  hide it already but  it's seeing the  banner  that's confusing 
[00:05:05] <ejegg>	 and instead click through it to donate
[00:05:17] <Krinkle>	 right, sure but after that upon first seeing it
[00:05:35] <Krinkle>	 also, I asume  for cases wher epeople donate  from a Wikipedia banner  there is already no issue  since that's first party?
[00:05:39] <AndyRussG>	 Krinkle: ejegg: the close button also sets a hide cookie and invokes the 3rd party cookie storm, but we're not really worrying about that for now
[00:05:41] <ejegg>	 the donor cookie also lasts a lot longer than the close-banner cookie
[00:05:53] <Krinkle>	 I see, that's fair.
[00:05:58] <Krinkle>	 although  see  about long lasting cookies
[00:06:00] <AndyRussG>	 ejegg: Krinkle: we could have the cookie set from the banner itself
[00:06:02] <Krinkle>	 but  I  get  the  theory
[00:06:14] <AndyRussG>	 when they click to donate
[00:06:27] <ejegg>	 AndyRussG: right, assume they will get through the rest of the flow
[00:06:33] <AndyRussG>	 the only disadvantage being that they'll still get the cookie even if the donation wasn't successful
[00:06:35] <AndyRussG>	 yeah
[00:06:43] <Krinkle>	 to confirm,  enwiki -> banner  -> donate.wikipedia -> thanks -> img wikipedia *.wikipeida hidebanner 1 year 
[00:06:47] <Krinkle>	 that  tunnel  works?
[00:06:55] <AndyRussG>	 Krinkle: almost
[00:08:10] <AndyRussG>	 enwiki banner -> payments wiki -> maybe redirect to payment provider or otherwise iframe -> donate wiki thanks, with background injection of img tags from WP and other domains into DOM and those requests for those images set the cookies
[00:08:35] <Krinkle>	 right 
[00:08:42] <Krinkle>	 and  the wikipedia one is working
[00:09:08] <Krinkle>	 but the others  are set in an alternate universe by  Safari specifically only for cross-origin requests between wikipedia and <other domain>
[00:09:12] <ejegg>	 AndyRussG: man, https://webkit.org/tracking-prevention/ pours cold water on banner history
[00:09:53] <ejegg>	 looks like it gets erased after 7 days of not visiting the site (see section 7-Day Cap on All Script-Writeable Storage )
[00:10:29] <ejegg>	 Krinkle: the wikipedia one isn't working for Safari users
[00:10:40] <Krinkle>	 ok, why :)
[00:11:00] <ejegg>	 because m != p
[00:11:08] <Krinkle>	 oh, it's  not donate.wikipedia.org anymore?
[00:11:09] <ejegg>	 i.e. donate.wikiMedia.org
[00:11:14] <Krinkle>	 maybe never was
[00:11:17] <ejegg>	 vs en.wikiPedia.org
[00:11:17] <Krinkle>	 we should fix  that?
[00:11:23] <ejegg>	 hehe, that's the rebrand
[00:11:30] <Krinkle>	 well, doesn't have to be
[00:11:39] <ejegg>	 and will involve plenty of other fun
[00:11:50] <ejegg>	 but yeah, would at least help us with cookies!
[00:11:54] <Krinkle>	 it's your only option for this  though
[00:12:06] <Krinkle>	 also wouldn't  need the image  in  that  case, can  do *.wikipedia.org directly
[00:12:18] <ejegg>	 right
[00:13:09] <Krinkle>	 but yeah,  I don't see a way out of  this apart from a list of links in the thank  you  page to opt-in to hiding the banner by visiting a link  and maybe a  simple  confirmation  on the other  hand to feel good  with a link back to the thank  you  page and/or the same list again.
[00:13:09] <ejegg>	 hmm, looks like donate.wikipedia.org does redirect to donate.wikimedia.org
[00:13:18] <Krinkle>	 redirecting would risk being rejected again
[00:13:50] <Krinkle>	 wmde could embed that  same list/chain of links
[00:14:11] <Krinkle>	 I recently renamed test.wikimedia.beta.wmflabs to test.wikiPedia.beta.wmflabs
[00:14:14] <Krinkle>	 that  was quite easy
[00:14:48] <Krinkle>	 ejegg: alternatively all  you really need is a  single HTML file on a wikipedia domain
[00:15:04] <AndyRussG>	 Krinkle: ejegg: we really do need a deeper investigation into all this
[00:15:17] <Krinkle>	 I mean, it  could be thanks.wikipedia.org/index.html if we want a quick hack
[00:15:40] <AndyRussG>	 yeah that would also do it :)
[00:15:54] <Krinkle>	 we  have a  microsite cluster now for simpel sites like this
[00:16:13] <AndyRussG>	 ejegg: for the NL campaign, though, we'll suggest in-banner JS to set the cookie when people click to donate?
[00:16:41] <AndyRussG>	 We already have facilities for trying to run code when people click to navigate away, if it's a link (not sure if it still is)
[00:16:46] <ejegg>	 AndyRussG: I guess so
[00:17:13] <ejegg>	 the payments-wiki redirect is part of the core banner js
[00:17:17] <AndyRussG>	 ejegg: probably faster than quickly implementing donate.wikipedia.org thank-you
[00:17:36] <Krinkle>	 yeah, if hiding  it on first attempt to donate  is  acceptable that  would  be even  simpler
[00:17:40] <Krinkle>	 also  works naturally for other  projects
[00:17:46] <Krinkle>	 e.g. if we run on  non-Wikipedia
[00:18:04] <AndyRussG>	 yeah... currently only FR on WP
[00:18:32] <ejegg>	 Guessing we'd lose a number of donations from ppl who want to donate, click, lose or close tab, and keep browsing
[00:18:44] <ejegg>	 who would otherwise click again on next banner view
[00:20:00] <Krinkle>	 or, (sorry),  realistically find  it again  two weeks later :) 
[00:20:56] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[00:21:29] <Krinkle>	 ejegg: I'm  just thinking out loud  about stuff I'm not an expert in,  but I suppose it  might also  be posible  to e.g. let  the on-click cookie only  reduce  the severity of the banner
[00:21:37] <Krinkle>	 and  still   have  the thank you  opt-out thing
[00:21:57] <AndyRussG>	 Krinkle: hey also an idea
[00:21:58] <Krinkle>	 e.g.  bucket them into  something lighter  where they can pick  it up or say "I already  donated"
[00:22:15] <AndyRussG>	 yeah
[00:22:42] <AndyRussG>	 Krinkle: ejegg: apologies, my kids have been waiting for me to eat with them, I should run...
[00:22:53] <Krinkle>	 that might even  remove the need for automatic hiding  if  it's  suble/respectable enough 
[00:23:05] <ejegg>	 ok AndyRussG, buen provecho
[00:23:06] <AndyRussG>	 Krinkle: thanks so so much for all the info and advice on this
[00:23:29] <AndyRussG>	 Krinkle: we'll definitely take you up on the offer to talk through this live!!! thanks again :)
[00:23:32] <AndyRussG>	 ejegg: ¡gracias!
[00:23:39] <ejegg>	 yes, thanks very much Krinkle!
[00:23:51] <Krinkle>	 k, going afk myself. if there's  more cookie brain storming ahead, feel free to pull me in 
[00:23:52] <AndyRussG>	 ejegg: and also thanks 4 sticking around and helping with this :)
[00:24:11] <AndyRussG>	 :) Krinkle quick last question: pointer to the task for centraluth stuff?
[00:24:49] <Krinkle>	 https://phabricator.wikimedia.org/T252236
[00:26:36] <icinga-wm>	 RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[00:28:17] <AndyRussG>	 Krinkle: thx!!
[00:31:58] <icinga-wm>	 PROBLEM - rsyslog in codfw is failing to deliver messages on icinga1001 is CRITICAL: action={fwd_centrallog1001.eqiad.wmnet:6514,fwd_centrallog2001.codfw.wmnet:6514} https://wikitech.wikimedia.org/wiki/Rsyslog https://grafana.wikimedia.org/d/000000596/rsyslog?var-datasource=codfw+prometheus/ops
[00:32:05] <wikibugs>	 10Operations, 10Fundraising-Backlog, 10MediaWiki-extensions-CentralNotice, 10Traffic, and 2 others: Implement redirect for hide banner cookie issue - https://phabricator.wikimedia.org/T251780 (10Ejegg) @Krinkle explained in IRC that this approach will probably not work long term, and risks Wikipedia being...
[00:34:40] <icinga-wm>	 PROBLEM - rsyslog in eqiad is failing to deliver messages on icinga1001 is CRITICAL: action={fwd_centrallog1001.eqiad.wmnet:6514,fwd_centrallog2001.codfw.wmnet:6514} https://wikitech.wikimedia.org/wiki/Rsyslog https://grafana.wikimedia.org/d/000000596/rsyslog?var-datasource=eqiad+prometheus/ops
[00:45:36] <icinga-wm>	 PROBLEM - rsyslog TLS listener on port 6514 on centrallog1001 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Logs
[00:47:30] <icinga-wm>	 RECOVERY - rsyslog TLS listener on port 6514 on centrallog1001 is OK: SSL OK - Certificate centrallog1001.eqiad.wmnet valid until 2024-06-25 15:42:33 +0000 (expires in 1446 days) https://wikitech.wikimedia.org/wiki/Logs
[00:47:38] <Reedy>	 !log truncated labswiki.interwiki table (outdated and unnecessary)
[00:47:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:54:36] <icinga-wm>	 PROBLEM - Too many messages in kafka logging-eqiad on icinga1001 is CRITICAL: cluster=misc exported_cluster=logging-eqiad group={logstash,logstash-codfw,logstash7-codfw,logstash7-eqiad} instance=kafkamon1001 job=burrow partition={0,1,2,3,4,5} site=eqiad topic=rsyslog-notice https://wikitech.wikimedia.org/wiki/Logstash%23Kafka_consumer_lag https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?from=now-3h&to=now&orgId=1&var-
[00:54:37] <icinga-wm>	 prometheus/ops&var-cluster=logging-eqiad&var-topic=All&var-consumer_group=All
[00:56:56] <wikibugs>	 (03PS1) 10Bstorm: kubeadm: If using a stacked control plane, expose etcd metrics [puppet] - 10https://gerrit.wikimedia.org/r/610980 (https://phabricator.wikimedia.org/T256361)
[00:57:34] <icinga-wm>	 RECOVERY - rsyslog in eqiad is failing to deliver messages on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Rsyslog https://grafana.wikimedia.org/d/000000596/rsyslog?var-datasource=eqiad+prometheus/ops
[00:57:46] <icinga-wm>	 RECOVERY - rsyslog in codfw is failing to deliver messages on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Rsyslog https://grafana.wikimedia.org/d/000000596/rsyslog?var-datasource=codfw+prometheus/ops
[00:59:37] <wikibugs>	 (03CR) 10Bstorm: [C: 04-1] "Just realized this config cannot be generally applied because it will only include the IP of the bootstrapping node." [puppet] - 10https://gerrit.wikimedia.org/r/610980 (https://phabricator.wikimedia.org/T256361) (owner: 10Bstorm)
[01:05:15] <wikibugs>	 (03PS2) 10Bstorm: kubeadm: If using a stacked control plane, expose etcd metrics [puppet] - 10https://gerrit.wikimedia.org/r/610980 (https://phabricator.wikimedia.org/T256361)
[01:08:13] <wikibugs>	 (03CR) 10Bstorm: "Ok that corrects it." [puppet] - 10https://gerrit.wikimedia.org/r/610980 (https://phabricator.wikimedia.org/T256361) (owner: 10Bstorm)
[01:09:28] <wikibugs>	 10Operations, 10Fundraising-Backlog, 10MediaWiki-extensions-CentralNotice, 10Traffic, and 2 others: Implement redirect for hide banner cookie issue - https://phabricator.wikimedia.org/T251780 (10AndyRussG) > He suggested we put the thank you page on a *.wikipedia.org domain rather than *.wikimedia.org, so...
[01:11:52] <icinga-wm>	 PROBLEM - rsyslog in eqiad is failing to deliver messages on icinga1001 is CRITICAL: action={fwd_centrallog1001.eqiad.wmnet:6514,fwd_centrallog2001.codfw.wmnet:6514} https://wikitech.wikimedia.org/wiki/Rsyslog https://grafana.wikimedia.org/d/000000596/rsyslog?var-datasource=eqiad+prometheus/ops
[01:12:04] <icinga-wm>	 PROBLEM - rsyslog in codfw is failing to deliver messages on icinga1001 is CRITICAL: action={fwd_centrallog1001.eqiad.wmnet:6514,fwd_centrallog2001.codfw.wmnet:6514} https://wikitech.wikimedia.org/wiki/Rsyslog https://grafana.wikimedia.org/d/000000596/rsyslog?var-datasource=codfw+prometheus/ops
[01:21:13] <wikibugs>	 10Operations, 10ops-codfw, 10netops: (Need by: End of July-2020 ) codfw:rack/setup/new management switches - https://phabricator.wikimedia.org/T253154 (10Papaul)
[01:34:31] <wikibugs>	 10Operations, 10Fundraising-Backlog, 10MediaWiki-extensions-CentralNotice, 10Traffic, and 2 others: Implement redirect for hide banner cookie issue - https://phabricator.wikimedia.org/T251780 (10AndyRussG) Just another idea for the shortest-term solution: in the banner, we could try to determine the user's...
[01:34:48] <icinga-wm>	 RECOVERY - rsyslog in eqiad is failing to deliver messages on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Rsyslog https://grafana.wikimedia.org/d/000000596/rsyslog?var-datasource=eqiad+prometheus/ops
[01:34:54] <icinga-wm>	 RECOVERY - rsyslog in codfw is failing to deliver messages on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Rsyslog https://grafana.wikimedia.org/d/000000596/rsyslog?var-datasource=codfw+prometheus/ops
[01:41:20] <icinga-wm>	 RECOVERY - Too many messages in kafka logging-eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Logstash%23Kafka_consumer_lag https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?from=now-3h&to=now&orgId=1&var-datasource=eqiad+prometheus/ops&var-cluster=logging-eqiad&var-topic=All&var-consumer_group=All
[01:43:34] <wikibugs>	 (03PS1) 10Dzahn: admins: add Conny Kawohl to ldap_only admins (wmde/nda) [puppet] - 10https://gerrit.wikimedia.org/r/611002 (https://phabricator.wikimedia.org/T257038)
[01:44:42] <mutante>	 !log LDAP - adding coka to wmde and nda (T257038)
[01:44:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[01:44:47] <stashbot>	 T257038: Add Conny Kawohl to the ldap/wmde and ldap/nda group - https://phabricator.wikimedia.org/T257038
[01:46:07] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] admins: add Conny Kawohl to ldap_only admins (wmde/nda) [puppet] - 10https://gerrit.wikimedia.org/r/611002 (https://phabricator.wikimedia.org/T257038) (owner: 10Dzahn)
[01:47:19] <wikibugs>	 10Operations, 10LDAP-Access-Requests, 10Patch-For-Review: Add Conny Kawohl to the ldap/wmde and ldap/nda group - https://phabricator.wikimedia.org/T257038 (10Dzahn)
[01:47:45] <wikibugs>	 10Operations, 10LDAP-Access-Requests, 10Patch-For-Review: Add Conny Kawohl to the ldap/wmde and ldap/nda group - https://phabricator.wikimedia.org/T257038 (10Dzahn) 05Open→03Resolved @conny-kawohl_WMDE This is done, you have been added to the wmde and nda LDAP groups.
[01:49:10] <icinga-wm>	 PROBLEM - rsyslog in eqiad is failing to deliver messages on icinga1001 is CRITICAL: action={fwd_centrallog1001.eqiad.wmnet:6514,fwd_centrallog2001.codfw.wmnet:6514} https://wikitech.wikimedia.org/wiki/Rsyslog https://grafana.wikimedia.org/d/000000596/rsyslog?var-datasource=eqiad+prometheus/ops
[01:49:14] <icinga-wm>	 PROBLEM - rsyslog in codfw is failing to deliver messages on icinga1001 is CRITICAL: action={fwd_centrallog1001.eqiad.wmnet:6514,fwd_centrallog2001.codfw.wmnet:6514} https://wikitech.wikimedia.org/wiki/Rsyslog https://grafana.wikimedia.org/d/000000596/rsyslog?var-datasource=codfw+prometheus/ops
[02:00:12] <icinga-wm>	 PROBLEM - rsyslog TLS listener on port 6514 on centrallog1001 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection reset by peer https://wikitech.wikimedia.org/wiki/Logs
[02:00:37] <icinga-wm>	 PROBLEM - rsyslog in codfw is failing to deliver messages on icinga1001 is CRITICAL: action={fwd_centrallog1001.eqiad.wmnet:6514,fwd_centrallog2001.codfw.wmnet:6514} https://wikitech.wikimedia.org/wiki/Rsyslog https://grafana.wikimedia.org/d/000000596/rsyslog?var-datasource=codfw+prometheus/ops
[02:02:34] <icinga-wm>	 RECOVERY - rsyslog TLS listener on port 6514 on centrallog1001 is OK: SSL OK - Certificate centrallog1001.eqiad.wmnet valid until 2024-06-25 15:42:33 +0000 (expires in 1446 days) https://wikitech.wikimedia.org/wiki/Logs
[02:26:17] <icinga-wm>	 RECOVERY - rsyslog in eqiad is failing to deliver messages on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Rsyslog https://grafana.wikimedia.org/d/000000596/rsyslog?var-datasource=eqiad+prometheus/ops
[02:29:14] <icinga-wm>	 RECOVERY - rsyslog in codfw is failing to deliver messages on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Rsyslog https://grafana.wikimedia.org/d/000000596/rsyslog?var-datasource=codfw+prometheus/ops
[02:43:32] <icinga-wm>	 PROBLEM - rsyslog in eqiad is failing to deliver messages on icinga1001 is CRITICAL: action={fwd_centrallog1001.eqiad.wmnet:6514,fwd_centrallog2001.codfw.wmnet:6514} https://wikitech.wikimedia.org/wiki/Rsyslog https://grafana.wikimedia.org/d/000000596/rsyslog?var-datasource=eqiad+prometheus/ops
[02:43:36] <icinga-wm>	 PROBLEM - rsyslog in codfw is failing to deliver messages on icinga1001 is CRITICAL: action={fwd_centrallog1001.eqiad.wmnet:6514,fwd_centrallog2001.codfw.wmnet:6514} https://wikitech.wikimedia.org/wiki/Rsyslog https://grafana.wikimedia.org/d/000000596/rsyslog?var-datasource=codfw+prometheus/ops
[03:20:22] <icinga-wm>	 PROBLEM - Too many messages in kafka logging-eqiad on icinga1001 is CRITICAL: cluster=misc exported_cluster=logging-eqiad group={logstash,logstash-codfw} instance=kafkamon1001 job=burrow partition={0,1,2,3,4,5} site=eqiad topic=rsyslog-notice https://wikitech.wikimedia.org/wiki/Logstash%23Kafka_consumer_lag https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?from=now-3h&to=now&orgId=1&var-datasource=eqiad+prometheus/ops&
[03:20:22] <icinga-wm>	 ng-eqiad&var-topic=All&var-consumer_group=All
[03:25:12] <icinga-wm>	 PROBLEM - rsyslog TLS listener on port 6514 on centrallog2001 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection reset by peer https://wikitech.wikimedia.org/wiki/Logs
[03:27:10] <icinga-wm>	 RECOVERY - rsyslog TLS listener on port 6514 on centrallog2001 is OK: SSL OK - Certificate centrallog2001.codfw.wmnet valid until 2024-11-16 16:04:24 +0000 (expires in 1590 days) https://wikitech.wikimedia.org/wiki/Logs
[03:35:06] <icinga-wm>	 RECOVERY - rsyslog in eqiad is failing to deliver messages on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Rsyslog https://grafana.wikimedia.org/d/000000596/rsyslog?var-datasource=eqiad+prometheus/ops
[03:35:08] <icinga-wm>	 RECOVERY - rsyslog in codfw is failing to deliver messages on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Rsyslog https://grafana.wikimedia.org/d/000000596/rsyslog?var-datasource=codfw+prometheus/ops
[03:42:44] <icinga-wm>	 RECOVERY - Too many messages in kafka logging-eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Logstash%23Kafka_consumer_lag https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?from=now-3h&to=now&orgId=1&var-datasource=eqiad+prometheus/ops&var-cluster=logging-eqiad&var-topic=All&var-consumer_group=All
[03:49:20] <icinga-wm>	 PROBLEM - rsyslog in eqiad is failing to deliver messages on icinga1001 is CRITICAL: action={fwd_centrallog1001.eqiad.wmnet:6514,fwd_centrallog2001.codfw.wmnet:6514} https://wikitech.wikimedia.org/wiki/Rsyslog https://grafana.wikimedia.org/d/000000596/rsyslog?var-datasource=eqiad+prometheus/ops
[03:49:24] <icinga-wm>	 PROBLEM - rsyslog in codfw is failing to deliver messages on icinga1001 is CRITICAL: action={fwd_centrallog1001.eqiad.wmnet:6514,fwd_centrallog2001.codfw.wmnet:6514} https://wikitech.wikimedia.org/wiki/Rsyslog https://grafana.wikimedia.org/d/000000596/rsyslog?var-datasource=codfw+prometheus/ops
[04:18:22] <icinga-wm>	 PROBLEM - Too many messages in kafka logging-eqiad on icinga1001 is CRITICAL: cluster=misc exported_cluster=logging-eqiad group={logstash,logstash-codfw,logstash7-codfw,logstash7-eqiad} instance=kafkamon1001 job=burrow partition={0,1,2,3,4,5} site=eqiad topic=rsyslog-notice https://wikitech.wikimedia.org/wiki/Logstash%23Kafka_consumer_lag https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?from=now-3h&to=now&orgId=1&var-
[04:18:22] <icinga-wm>	 prometheus/ops&var-cluster=logging-eqiad&var-topic=All&var-consumer_group=All
[04:20:12] <icinga-wm>	 PROBLEM - rsyslog TLS listener on port 6514 on centrallog2001 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Logs
[04:22:56] <icinga-wm>	 RECOVERY - rsyslog TLS listener on port 6514 on centrallog2001 is OK: SSL OK - Certificate centrallog2001.codfw.wmnet valid until 2024-11-16 16:04:24 +0000 (expires in 1590 days) https://wikitech.wikimedia.org/wiki/Logs
[04:35:12] <icinga-wm>	 PROBLEM - rsyslog TLS listener on port 6514 on centrallog1001 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection reset by peer https://wikitech.wikimedia.org/wiki/Logs
[04:36:04] <icinga-wm>	 RECOVERY - rsyslog TLS listener on port 6514 on centrallog1001 is OK: SSL OK - Certificate centrallog1001.eqiad.wmnet valid until 2024-06-25 15:42:33 +0000 (expires in 1446 days) https://wikitech.wikimedia.org/wiki/Logs
[04:40:12] <icinga-wm>	 PROBLEM - rsyslog TLS listener on port 6514 on centrallog2001 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Logs
[04:42:52] <wikibugs>	 (03PS1) 10Marostegui: dbproxy1017: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/611067 (https://phabricator.wikimedia.org/T255408)
[04:44:29] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1134', diff saved to https://phabricator.wikimedia.org/P11839 and previous config saved to /var/cache/conftool/dbconfig/20200710-044428-marostegui.json
[04:44:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[04:45:10] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] dbproxy1017: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/611067 (https://phabricator.wikimedia.org/T255408) (owner: 10Marostegui)
[04:45:16] <icinga-wm>	 RECOVERY - rsyslog TLS listener on port 6514 on centrallog2001 is OK: SSL OK - Certificate centrallog2001.codfw.wmnet valid until 2024-11-16 16:04:24 +0000 (expires in 1590 days) https://wikitech.wikimedia.org/wiki/Logs
[04:52:10] <wikibugs>	 10Operations, 10ops-eqiad, 10DBA: Degraded RAID on db1131 - https://phabricator.wikimedia.org/T257253 (10Marostegui) I will failover db1131 to db1093 on Tuesday 14th at 05:00 AM UTC
[04:55:12] <icinga-wm>	 PROBLEM - rsyslog TLS listener on port 6514 on centrallog2001 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Logs
[04:57:57] <wikibugs>	 (03PS1) 10Marostegui: db1107: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/611075 (https://phabricator.wikimedia.org/T254462)
[04:58:38] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] db1107: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/611075 (https://phabricator.wikimedia.org/T254462) (owner: 10Marostegui)
[05:05:14] <icinga-wm>	 RECOVERY - rsyslog TLS listener on port 6514 on centrallog2001 is OK: SSL OK - Certificate centrallog2001.codfw.wmnet valid until 2024-11-16 16:04:24 +0000 (expires in 1590 days) https://wikitech.wikimedia.org/wiki/Logs
[05:10:44] <icinga-wm>	 PROBLEM - Logstash rate of ingestion percent change compared to yesterday on icinga1001 is CRITICAL: 252.1 ge 210 https://phabricator.wikimedia.org/T202307 https://grafana.wikimedia.org/dashboard/db/logstash?orgId=1&panelId=2&fullscreen
[05:20:12] <icinga-wm>	 PROBLEM - rsyslog TLS listener on port 6514 on centrallog2001 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection reset by peer https://wikitech.wikimedia.org/wiki/Logs
[05:21:00] <icinga-wm>	 RECOVERY - rsyslog TLS listener on port 6514 on centrallog2001 is OK: SSL OK - Certificate centrallog2001.codfw.wmnet valid until 2024-11-16 16:04:24 +0000 (expires in 1590 days) https://wikitech.wikimedia.org/wiki/Logs
[05:33:04] <icinga-wm>	 RECOVERY - rsyslog in codfw is failing to deliver messages on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Rsyslog https://grafana.wikimedia.org/d/000000596/rsyslog?var-datasource=codfw+prometheus/ops
[05:35:20] <icinga-wm>	 RECOVERY - rsyslog in eqiad is failing to deliver messages on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Rsyslog https://grafana.wikimedia.org/d/000000596/rsyslog?var-datasource=eqiad+prometheus/ops
[06:00:13] <icinga-wm>	 RECOVERY - Too many messages in kafka logging-eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Logstash%23Kafka_consumer_lag https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?from=now-3h&to=now&orgId=1&var-datasource=eqiad+prometheus/ops&var-cluster=logging-eqiad&var-topic=All&var-consumer_group=All
[06:19:35] <icinga-wm>	 PROBLEM - OSPF status on cr3-knams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[06:19:57] <icinga-wm>	 PROBLEM - OSPF status on cr2-eqdfw is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[06:20:39] <icinga-wm>	 PROBLEM - BFD status on cr3-knams is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[06:20:44] <elukey>	 should be GTT maintenance --^
[06:24:25] <icinga-wm>	 PROBLEM - OSPF status on cr1-eqiad is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[06:24:27] <icinga-wm>	 RECOVERY - BFD status on cr3-knams is OK: OK: UP: 8 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[06:25:15] <icinga-wm>	 RECOVERY - OSPF status on cr3-knams is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[06:25:37] <icinga-wm>	 RECOVERY - OSPF status on cr2-eqdfw is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[06:26:17] <icinga-wm>	 RECOVERY - OSPF status on cr1-eqiad is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[06:29:47] <icinga-wm>	 PROBLEM - ores on ores2003 is CRITICAL: connect to address 10.192.16.63 and port 8081: Connection refused https://wikitech.wikimedia.org/wiki/Services/Monitoring/ores
[06:32:05] <wikibugs>	 10Operations, 10Wikimedia-Mailing-lists, 10Cloud-VPS (Project-requests), 10cloud-services-team (Kanban): Request creation of mailman VPS project - https://phabricator.wikimedia.org/T257270 (10Ladsgroup) >>! In T257270#6293740, @Andrew wrote: > approved!  Bryan will take care of this shortly  Thanks!  I req...
[06:33:35] <wikibugs>	 10Operations, 10Fundraising-Backlog, 10MediaWiki-extensions-CentralNotice, 10Traffic, and 2 others: Implement redirect for hide banner cookie issue - https://phabricator.wikimedia.org/T251780 (10AndyRussG) @Ejegg what about this option? We set a special cookie on donate.wikimedia.org when people go to the...
[06:34:21] <wikibugs>	 (03PS1) 10Marostegui: db1124: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/611165 (https://phabricator.wikimedia.org/T254462)
[06:35:05] <marostegui>	 !log Compress InnoDB on db1124:3311 (Sanitarium - lag will appear on s1 on labsdb) - T254462
[06:35:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:35:11] <stashbot>	 T254462: Compress enwiki InnoDB tables - https://phabricator.wikimedia.org/T254462
[06:35:20] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] db1124: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/611165 (https://phabricator.wikimedia.org/T254462) (owner: 10Marostegui)
[06:37:47] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repool db1134', diff saved to https://phabricator.wikimedia.org/P11840 and previous config saved to /var/cache/conftool/dbconfig/20200710-063746-marostegui.json
[06:37:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:38:18] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1099:3311', diff saved to https://phabricator.wikimedia.org/P11841 and previous config saved to /var/cache/conftool/dbconfig/20200710-063818-marostegui.json
[06:38:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:40:55] <icinga-wm>	 RECOVERY - ores on ores2003 is OK: HTTP OK: HTTP/1.0 200 OK - 6397 bytes in 0.089 second response time https://wikitech.wikimedia.org/wiki/Services/Monitoring/ores
[06:45:37] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+2] Reports, add new cloudsw role [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/610820 (https://phabricator.wikimedia.org/T251632) (owner: 10Ayounsi)
[06:50:32] <wikibugs>	 (03PS1) 10Elukey: Add custom ferm srange to Kafka Jumbo brokers [puppet] - 10https://gerrit.wikimedia.org/r/611168
[06:51:38] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Add custom ferm srange to Kafka Jumbo brokers [puppet] - 10https://gerrit.wikimedia.org/r/611168 (owner: 10Elukey)
[06:51:59] <icinga-wm>	 PROBLEM - Router interfaces on cr3-ulsfo is CRITICAL: CRITICAL: host 198.35.26.192, interfaces up: 76, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[06:52:45] <icinga-wm>	 PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 52, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[06:53:32] <wikibugs>	 (03PS2) 10Elukey: Add custom ferm srange to Kafka Jumbo brokers [puppet] - 10https://gerrit.wikimedia.org/r/611168
[06:55:47] <icinga-wm>	 RECOVERY - Router interfaces on cr3-ulsfo is OK: OK: host 198.35.26.192, interfaces up: 78, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[06:56:31] <icinga-wm>	 RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 54, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[06:57:51] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repool db1099:3311', diff saved to https://phabricator.wikimedia.org/P11843 and previous config saved to /var/cache/conftool/dbconfig/20200710-065751-marostegui.json
[06:57:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:00:04] <jouncebot>	 Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200710T0700)
[07:01:54] <wikibugs>	 (03CR) 10Elukey: [C: 04-1] "From a first pcc it looks good: https://puppet-compiler.wmflabs.org/compiler1002/23808/" [puppet] - 10https://gerrit.wikimedia.org/r/611168 (owner: 10Elukey)
[07:05:07] <icinga-wm>	 PROBLEM - Router interfaces on cr3-ulsfo is CRITICAL: CRITICAL: host 198.35.26.192, interfaces up: 76, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[07:05:51] <icinga-wm>	 PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 52, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[07:08:28] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+2] "thanks!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/610855 (https://phabricator.wikimedia.org/T225680) (owner: 10Alexandros Kosiaris)
[07:09:28] <wikibugs>	 (03Merged) 10jenkins-bot: proton: Amend prometheus-statsd config [deployment-charts] - 10https://gerrit.wikimedia.org/r/610855 (https://phabricator.wikimedia.org/T225680) (owner: 10Alexandros Kosiaris)
[07:11:33] <icinga-wm>	 RECOVERY - Logstash rate of ingestion percent change compared to yesterday on icinga1001 is OK: (C)210 ge (W)150 ge 99.97 https://phabricator.wikimedia.org/T202307 https://grafana.wikimedia.org/dashboard/db/logstash?orgId=1&panelId=2&fullscreen
[07:13:50] <logmsgbot>	 !log akosiaris@deploy1001 helmfile [STAGING] Ran 'sync' command on namespace 'proton' for release 'production' .
[07:13:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:14:25] <logmsgbot>	 !log akosiaris@deploy1001 helmfile [EQIAD] Ran 'sync' command on namespace 'proton' for release 'production' .
[07:14:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:15:33] <logmsgbot>	 !log akosiaris@deploy1001 helmfile [CODFW] Ran 'sync' command on namespace 'proton' for release 'production' .
[07:15:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:27:03] <wikibugs>	 10Operations, 10Release Pipeline, 10Release-Engineering-Team-TODO, 10Epic, and 2 others: Migrate production services to kubernetes using the pipeline - https://phabricator.wikimedia.org/T198901 (10akosiaris)
[07:28:03] <wikibugs>	 10Operations, 10Prod-Kubernetes, 10serviceops, 10Kubernetes, 10Patch-For-Review: Move proton to use TLS only - https://phabricator.wikimedia.org/T255877 (10akosiaris)
[07:29:05] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] lxc: Remove jessie compat code [puppet] - 10https://gerrit.wikimedia.org/r/610707 (owner: 10Muehlenhoff)
[07:30:48] <wikibugs>	 (03CR) 10Hashar: "I did a quick and dirty audit for the CI image which is captured at T257553.  Basically for all containers used by the Jenkins job, I ran:" [puppet] - 10https://gerrit.wikimedia.org/r/610050 (https://phabricator.wikimedia.org/T256877) (owner: 10Muehlenhoff)
[07:31:05] <icinga-wm>	 RECOVERY - BGP status on cr2-esams is OK: BGP OK - up: 424, down: 2, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[07:32:09] <moritzm>	 !log installing e2fsprogs security updates on jessie (stretch/buster already fixed)
[07:32:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:41:12] <logmsgbot>	 !log jbond@deploy1001 Started deploy [librenms/librenms@0a88d64]: redeplopy to [try and] fix php errors
[07:41:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:41:17] <logmsgbot>	 !log jbond@deploy1001 Finished deploy [librenms/librenms@0a88d64]: redeplopy to [try and] fix php errors (duration: 00m 05s)
[07:41:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:41:39] <wikibugs>	 (03PS3) 10Elukey: Add custom ferm srange to Kafka Jumbo brokers [puppet] - 10https://gerrit.wikimedia.org/r/611168
[07:43:27] <logmsgbot>	 !log kormat@cumin1001 dbctl commit (dc=all): 'Add weight to es1020, reduce weight on es1021 T257284', diff saved to https://phabricator.wikimedia.org/P11844 and previous config saved to /var/cache/conftool/dbconfig/20200710-074326-kormat.json
[07:43:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:43:32] <stashbot>	 T257284: Upgrade es4 to debian buster + mariadb 10.4 - https://phabricator.wikimedia.org/T257284
[07:43:55] <icinga-wm>	 ACKNOWLEDGEMENT - Stale file for node-exporter textfile in eqiad on icinga1001 is CRITICAL: cluster=analytics file=nic_firmware.prom instance=analytics1030 job=node site=eqiad Ema elukey running some tests on the hadoop test cluster https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile
[07:44:14] <kormat>	 !log reimaging es1021 to buster T257284
[07:44:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:45:43] <icinga-wm>	 RECOVERY - Stale file for node-exporter textfile in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile
[07:45:45] <wikibugs>	 (03PS1) 10Kormat: es1021: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/611183 (https://phabricator.wikimedia.org/T257284)
[07:47:19] <icinga-wm>	 PROBLEM - Host ganeti1007 is DOWN: PING CRITICAL - Packet loss = 100%
[07:47:37] <icinga-wm>	 PROBLEM - Host etcd1003 is DOWN: PING CRITICAL - Packet loss = 100%
[07:48:09] <elukey>	 are we doing maintenance on --^ ?
[07:49:57] <wikibugs>	 10Operations, 10serviceops: docker-reporter-releng-images failed on deneb - https://phabricator.wikimedia.org/T251918 (10ema) The service failed 3 days ago due to another image this time:  ` root@deneb:~# journalctl -u docker-reporter-releng-images.service | grep FAIL Jul 06 16:54:38 deneb docker-report-releng...
[07:50:40] <icinga-wm>	 ACKNOWLEDGEMENT - Check systemd state on deneb is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. Ema This is a known issue: T251918 https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[07:51:53] <wikibugs>	 (03CR) 10Marostegui: [C: 03+1] es1021: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/611183 (https://phabricator.wikimedia.org/T257284) (owner: 10Kormat)
[07:52:13] <moritzm>	 yeah, that's expired downtime for a RAM expansion
[07:52:24] <kormat>	 moritzm: have you tried downloading more ram?
[07:52:53] <moritzm>	 kormat: we'd need to buy Ganeti Enterprise for that...
[07:53:00] <kormat>	 haha
[07:53:58] <elukey>	 ahhaah
[07:54:16] <moritzm>	 akosiaris: https://phabricator.wikimedia.org/T244530 doesn't mention if John added the RAM yesterday, do you know more? should the downtime be extended until the next week
[07:54:31] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repool db1079', diff saved to https://phabricator.wikimedia.org/P11845 and previous config saved to /var/cache/conftool/dbconfig/20200710-075431-marostegui.json
[07:54:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:55:01] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Slowly repool db1107', diff saved to https://phabricator.wikimedia.org/P11846 and previous config saved to /var/cache/conftool/dbconfig/20200710-075500-marostegui.json
[07:55:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:55:20] <wikibugs>	 10Operations, 10serviceops: docker-reporter-releng-images failed on deneb - https://phabricator.wikimedia.org/T251918 (10hashar) That `releng/ci-common` image is a scratch image containing scripts shared by our base images ci-jessie, ci-stretch, ci-buster. It does not have any Debian OS layer, thus if the repo...
[07:56:09] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1119', diff saved to https://phabricator.wikimedia.org/P11847 and previous config saved to /var/cache/conftool/dbconfig/20200710-075608-marostegui.json
[07:56:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:00:45] <moritzm>	 !log installing cron security updates on jessie (stretch/buster already fixed)
[08:00:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:01:33] <logmsgbot>	 !log kormat@cumin1001 dbctl commit (dc=all): 'Reset es2020/es2021 to correct weights after master switch T257284', diff saved to https://phabricator.wikimedia.org/P11848 and previous config saved to /var/cache/conftool/dbconfig/20200710-080133-kormat.json
[08:01:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:01:38] <stashbot>	 T257284: Upgrade es4 to debian buster + mariadb 10.4 - https://phabricator.wikimedia.org/T257284
[08:02:28] <wikibugs>	 (03CR) 10Kormat: [C: 03+2] es1021: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/611183 (https://phabricator.wikimedia.org/T257284) (owner: 10Kormat)
[08:04:05] <wikibugs>	 10Operations, 10serviceops: docker-reporter-releng-images failed on deneb - https://phabricator.wikimedia.org/T251918 (10JMeybohm) >>! In T251918#6296048, @hashar wrote: > That `releng/ci-common` image is a scratch image containing scripts shared by our base images ci-jessie, ci-stretch, ci-buster. It does not...
[08:06:46] <wikibugs>	 (03PS1) 10Kormat: install_server: Switch es1021 to buster [puppet] - 10https://gerrit.wikimedia.org/r/611193 (https://phabricator.wikimedia.org/T257284)
[08:07:01] <wikibugs>	 10Operations, 10serviceops: docker-reporter-releng-images failed on deneb - https://phabricator.wikimedia.org/T251918 (10MoritzMuehlenhoff) >>! In T251918#6296059, @JMeybohm wrote: > Maybe we can just skip images that are not debian based?  Sounds good, we could simply test for the presence of /etc/debian_vers...
[08:08:43] <logmsgbot>	 !log kormat@cumin1001 dbctl commit (dc=all): 'Depool es1021 for reimaging T257284', diff saved to https://phabricator.wikimedia.org/P11849 and previous config saved to /var/cache/conftool/dbconfig/20200710-080843-kormat.json
[08:08:50] <wikibugs>	 10Operations, 10serviceops: docker-reporter-releng-images failed on deneb - https://phabricator.wikimedia.org/T251918 (10JMeybohm) a:03JMeybohm >>! In T251918#6296062, @MoritzMuehlenhoff wrote: > > Sounds good, we could simply test for the presence of /etc/debian_version which is owned by the base-files pack...
[08:08:54] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repool db1119', diff saved to https://phabricator.wikimedia.org/P11850 and previous config saved to /var/cache/conftool/dbconfig/20200710-080854-marostegui.json
[08:09:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:09:05] <stashbot>	 T257284: Upgrade es4 to debian buster + mariadb 10.4 - https://phabricator.wikimedia.org/T257284
[08:09:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:09:13] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Slowly repool db1107', diff saved to https://phabricator.wikimedia.org/P11851 and previous config saved to /var/cache/conftool/dbconfig/20200710-080912-marostegui.json
[08:09:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:12:03] <wikibugs>	 (03CR) 10Marostegui: [C: 03+1] install_server: Switch es1021 to buster [puppet] - 10https://gerrit.wikimedia.org/r/611193 (https://phabricator.wikimedia.org/T257284) (owner: 10Kormat)
[08:13:18] <wikibugs>	 (03CR) 10Kormat: [C: 03+2] install_server: Switch es1021 to buster [puppet] - 10https://gerrit.wikimedia.org/r/611193 (https://phabricator.wikimedia.org/T257284) (owner: 10Kormat)
[08:17:15] <wikibugs>	 (03PS1) 10DCausse: [wcqs] update logo URL [puppet] - 10https://gerrit.wikimedia.org/r/611196 (https://phabricator.wikimedia.org/T251514)
[08:17:50] <wikibugs>	 (03PS1) 10Effie Mouzeli: hieradata: improve description of ncredir [puppet] - 10https://gerrit.wikimedia.org/r/611197
[08:20:02] <akosiaris>	 moritzm: I have no more information after powering off the host. I guess we can keep it downtime for some more days and ping John on the task
[08:20:16] <icinga-wm>	 PROBLEM - DPKG on mc2029 is CRITICAL: DPKG CRITICAL dpkg reports broken packages https://wikitech.wikimedia.org/wiki/Monitoring/dpkg
[08:21:21] <moritzm>	 ^ fixing mc2029
[08:22:34] <logmsgbot>	 !log jmm@cumin2001 START - Cookbook sre.hosts.downtime
[08:22:35] <logmsgbot>	 !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
[08:22:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:22:42] <logmsgbot>	 !log jmm@cumin2001 START - Cookbook sre.hosts.downtime
[08:22:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:22:43] <logmsgbot>	 !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
[08:22:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:22:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:23:18] <wikibugs>	 10Operations, 10ops-eqiad: upgrade memory in ganeti100[5-8].eqiad.wmnet - https://phabricator.wikimedia.org/T244530 (10MoritzMuehlenhoff) >>! In T244530#6287429, @Jclark-ctr wrote: > @akosiaris  I will be on site tomorrow also if host is available to do 1 day earlier  Did you plug in the new DIMMs yesterday?
[08:23:29] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Slowly repool db1107', diff saved to https://phabricator.wikimedia.org/P11852 and previous config saved to /var/cache/conftool/dbconfig/20200710-082329-marostegui.json
[08:23:30] <moritzm>	 akosiaris: followed up on task and extended downtime until Tuesday
[08:23:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:23:47] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1106', diff saved to https://phabricator.wikimedia.org/P11853 and previous config saved to /var/cache/conftool/dbconfig/20200710-082346-marostegui.json
[08:23:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:26:47] <akosiaris>	 moritzm: I was about to do that, thanks!
[08:27:08] <icinga-wm>	 RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 54, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[08:33:37] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] SSHFP: add a text file with the SSHFB of all hosts [puppet] - 10https://gerrit.wikimedia.org/r/609796 (https://phabricator.wikimedia.org/T257219) (owner: 10Jbond)
[08:39:50] <icinga-wm>	 RECOVERY - Router interfaces on cr3-ulsfo is OK: OK: host 198.35.26.192, interfaces up: 78, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[08:42:14] <wikibugs>	 (03CR) 10Filippo Giunchedi: "LGTM (haven't tried building the package myself)" (031 comment) [debs/grafana-loki] (debian/sid) - 10https://gerrit.wikimedia.org/r/610864 (owner: 10Cwhite)
[08:44:41] <wikibugs>	 10Operations, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for mooeypoo - https://phabricator.wikimedia.org/T257502 (10jcrespo) a:05Nuria→03jcrespo
[08:46:08] <wikibugs>	 (03PS5) 10Jcrespo: admin: Add Jgiannelos production access [puppet] - 10https://gerrit.wikimedia.org/r/609752 (https://phabricator.wikimedia.org/T257187)
[08:47:29] <wikibugs>	 10Operations, 10serviceops, 10Patch-For-Review, 10User-Elukey: Reimage one memcached shard to Buster - https://phabricator.wikimedia.org/T252391 (10elukey) @Krinkle @aaron if you have time, let's follow up on the question that I asked about what happens if a Redis shard disappears. It would be really nice...
[08:48:03] <wikibugs>	 (03CR) 10Filippo Giunchedi: "> Patch Set 1: Code-Review+1" [puppet] - 10https://gerrit.wikimedia.org/r/610832 (https://phabricator.wikimedia.org/T247968) (owner: 10Filippo Giunchedi)
[08:48:23] <wikibugs>	 (03PS1) 10Ema: ATS: add log_set_cookie_response(), reduce noise, log Host [puppet] - 10https://gerrit.wikimedia.org/r/611227 (https://phabricator.wikimedia.org/T256395)
[08:48:30] <wikibugs>	 (03CR) 10Jcrespo: [C: 03+2] admin: Add Jgiannelos production access [puppet] - 10https://gerrit.wikimedia.org/r/609752 (https://phabricator.wikimedia.org/T257187) (owner: 10Jcrespo)
[08:49:49] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 04-1] Kask: Use Releng Cassandra Image (033 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/609894 (https://phabricator.wikimedia.org/T224041) (owner: 10Jeena Huneidi)
[08:50:18] <icinga-wm>	 RECOVERY - DPKG on mc2029 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg
[08:50:41] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Fully repool db1107', diff saved to https://phabricator.wikimedia.org/P11855 and previous config saved to /var/cache/conftool/dbconfig/20200710-085040-marostegui.json
[08:50:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:51:01] <wikibugs>	 (03CR) 10Ema: [C: 03+2] ATS: add log_set_cookie_response(), reduce noise, log Host [puppet] - 10https://gerrit.wikimedia.org/r/611227 (https://phabricator.wikimedia.org/T256395) (owner: 10Ema)
[08:51:12] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repool db1106', diff saved to https://phabricator.wikimedia.org/P11856 and previous config saved to /var/cache/conftool/dbconfig/20200710-085112-marostegui.json
[08:51:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:51:58] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1110', diff saved to https://phabricator.wikimedia.org/P11857 and previous config saved to /var/cache/conftool/dbconfig/20200710-085157-marostegui.json
[08:52:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:53:33] <wikibugs>	 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to production infrastructure services for jgiannelos - https://phabricator.wikimedia.org/T257187 (10jcrespo) 05Open→03Resolved a:03jcrespo Access request has been merged: ` Notice: /Stage[main]/Admin/Admin::Hashuser[jgiannelos]/...
[08:57:42] <wikibugs>	 (03PS1) 10Jcrespo: admin: Add Mooeypoo (wikigit) to the analytics-privatedata-users group [puppet] - 10https://gerrit.wikimedia.org/r/611228 (https://phabricator.wikimedia.org/T257502)
[09:01:24] <wikibugs>	 10Operations, 10CAS-SSO, 10Patch-For-Review, 10User-jbond: mod_auth_cas segfaulting on netmon - https://phabricator.wikimedia.org/T257587 (10MoritzMuehlenhoff) A few initial findings, still investigating further:  This all boils down to curl and OpenSSL: We're not seeing this issue on jessie (which only ha...
[09:02:18] <logmsgbot>	 !log kormat@cumin1001 START - Cookbook sre.hosts.downtime
[09:02:36] <wikibugs>	 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to analytics-privatedata-users for mooeypoo - https://phabricator.wikimedia.org/T257502 (10jcrespo) Patch is prepared, will be deployed on Monday following procedures. Kerberos access will also be granted then.
[09:02:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:04:54] <logmsgbot>	 !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
[09:04:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:05:51] <wikibugs>	 10Operations, 10LDAP-Access-Requests, 10WMF-Legal: Add Guergana Tzatchkova to the ldap/wmde and ldap/nda group - https://phabricator.wikimedia.org/T256201 (10jcrespo) a:03jcrespo Thank you very much for the heads up! Will proceed now with the group granting.
[09:06:05] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[09:07:24] <wikibugs>	 10Operations, 10LDAP-Access-Requests, 10WMF-Legal: Add Guergana Tzatchkova to the ldap/wmde and ldap/nda group - https://phabricator.wikimedia.org/T256201 (10jcrespo)
[09:08:15] <wikibugs>	 (03CR) 10Hashar: [C: 03+1] "So the status is:" [puppet] - 10https://gerrit.wikimedia.org/r/608296 (https://phabricator.wikimedia.org/T256575) (owner: 10Hashar)
[09:09:15] <icinga-wm>	 RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[09:10:58] <wikibugs>	 (03CR) 10Hashar: [C: 03+1] "The regex example from above: https://regex101.com/r/L5GNMY/1" [puppet] - 10https://gerrit.wikimedia.org/r/608296 (https://phabricator.wikimedia.org/T256575) (owner: 10Hashar)
[09:11:21] <wikibugs>	 10Operations, 10LDAP-Access-Requests: Add Conny Kawohl to the ldap/wmde and ldap/nda group - https://phabricator.wikimedia.org/T257038 (10jcrespo) Thanks Dzahn for taking over, as that sped up the group addition!
[09:15:19] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[09:18:30] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] kubeadm: If using a stacked control plane, expose etcd metrics [puppet] - 10https://gerrit.wikimedia.org/r/610980 (https://phabricator.wikimedia.org/T256361) (owner: 10Bstorm)
[09:22:06] <icinga-wm>	 RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[09:23:36] <wikibugs>	 10Operations, 10LDAP-Access-Requests, 10WMF-Legal: Add Guergana Tzatchkova to the ldap/wmde and ldap/nda group - https://phabricator.wikimedia.org/T256201 (10jcrespo) For future reference, UID for cn=Guergana Tzatchkova is gtzatchkova.  @guergana.tzatchkova This is not required for access request, but consid...
[09:29:56] <wikibugs>	 10Operations, 10Wikimedia-Mailing-lists, 10Cloud-VPS (Project-requests), 10cloud-services-team (Kanban): Request creation of mailman VPS project - https://phabricator.wikimedia.org/T257270 (10Ladsgroup) >>! In T257270#6296182, @Stashbot wrote: > {nav icon=file, name=Mentioned in SAL (#wikimedia-cloud), hre...
[09:31:00] <wikibugs>	 10Operations, 10Cloud-VPS (Project-requests), 10cloud-services-team (Kanban): Request creation of 'sre-sandbox' VPS project - https://phabricator.wikimedia.org/T247517 (10jbond) is to possible to get more quota in this project.  I Just tried to create a machine and we have 1 x m1.xlarge which seems to have t...
[09:31:02] <wikibugs>	 (03CR) 10Jcrespo: [C: 03+1] mariadb: remove ferm firewall hole for gerrit servers [puppet] - 10https://gerrit.wikimedia.org/r/609884 (https://phabricator.wikimedia.org/T239151) (owner: 10Dzahn)
[09:34:13] <wikibugs>	 (03PS1) 10Jcrespo: admin: Add Guergana Tzatchkova (gtzatchkova) to the list of privileged ldap groups [puppet] - 10https://gerrit.wikimedia.org/r/611232 (https://phabricator.wikimedia.org/T256201)
[09:35:01] <wikibugs>	 (03PS1) 10Kormat: Revert "es1021: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/611215
[09:35:04] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] admin: Add Guergana Tzatchkova (gtzatchkova) to the list of privileged ldap groups [puppet] - 10https://gerrit.wikimedia.org/r/611232 (https://phabricator.wikimedia.org/T256201) (owner: 10Jcrespo)
[09:36:50] <wikibugs>	 (03CR) 10Kormat: [C: 03+2] Revert "es1021: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/611215 (owner: 10Kormat)
[09:37:47] <wikibugs>	 (03PS2) 10Jcrespo: admin: Add gtzatchkova to the list of privileged ldap groups [puppet] - 10https://gerrit.wikimedia.org/r/611232 (https://phabricator.wikimedia.org/T256201)
[09:43:59] <wikibugs>	 (03CR) 10Effie Mouzeli: [C: 04-1] charts for push-notification service (034 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/602390 (https://phabricator.wikimedia.org/T250491) (owner: 10MSantos)
[09:47:27] <wikibugs>	 (03CR) 10Jcrespo: [C: 03+2] admin: Add gtzatchkova to the list of privileged ldap groups [puppet] - 10https://gerrit.wikimedia.org/r/611232 (https://phabricator.wikimedia.org/T256201) (owner: 10Jcrespo)
[09:49:55] <logmsgbot>	 !log kormat@cumin1001 dbctl commit (dc=all): 'Start repooling es1021 after reimage @ 50% T257284', diff saved to https://phabricator.wikimedia.org/P11858 and previous config saved to /var/cache/conftool/dbconfig/20200710-094954-kormat.json
[09:49:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:50:00] <stashbot>	 T257284: Upgrade es4 to debian buster + mariadb 10.4 - https://phabricator.wikimedia.org/T257284
[09:52:59] <wikibugs>	 (03PS1) 10Jbond: mariadb::farm_misc add netmon1002/2001 access [puppet] - 10https://gerrit.wikimedia.org/r/611243
[09:53:42] <wikibugs>	 (03CR) 10Jbond: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/611243 (owner: 10Jbond)
[09:55:53] <wikibugs>	 10Operations, 10LDAP-Access-Requests, 10WMF-Legal, 10Patch-For-Review: Add Guergana Tzatchkova to the ldap/wmde and ldap/nda group - https://phabricator.wikimedia.org/T256201 (10jcrespo)
[09:56:56] <wikibugs>	 10Operations, 10LDAP-Access-Requests, 10WMF-Legal, 10Patch-For-Review: Add Guergana Tzatchkova to the ldap/wmde and ldap/nda group - https://phabricator.wikimedia.org/T256201 (10jcrespo) 05Open→03Resolved Change has been deployed https://ldap.toolforge.org/user/gtzatchkova  Please test your privileged...
[09:59:17] <wikibugs>	 10Operations, 10LDAP-Access-Requests, 10observability, 10serviceops, 10Patch-For-Review: Grant Access to Logstash to Peter(peter.ovchyn@speedandfunction.com) - https://phabricator.wikimedia.org/T249037 (10jcrespo) I will get notified when this can move forward and https://gerrit.wikimedia.org/r/c/operati...
[09:59:25] <wikibugs>	 (03PS2) 10Jbond: mariadb::farm_misc add netmon1002/2001 access [puppet] - 10https://gerrit.wikimedia.org/r/611243
[10:02:57] <wikibugs>	 (03PS4) 10Elukey: Add custom ferm srange to Kafka Jumbo brokers [puppet] - 10https://gerrit.wikimedia.org/r/611168
[10:18:41] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+1] Add custom ferm srange to Kafka Jumbo brokers [puppet] - 10https://gerrit.wikimedia.org/r/611168 (owner: 10Elukey)
[10:20:59] <wikibugs>	 (03PS1) 10Jbond: role::grafana: allow embedding [puppet] - 10https://gerrit.wikimedia.org/r/611250 (https://phabricator.wikimedia.org/T250792)
[10:21:34] <wikibugs>	 (03CR) 10Jbond: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/611250 (https://phabricator.wikimedia.org/T250792) (owner: 10Jbond)
[10:21:48] <logmsgbot>	 !log kormat@cumin1001 dbctl commit (dc=all): 'Finish repooling es1021, and remove weight from es1010 T257284', diff saved to https://phabricator.wikimedia.org/P11859 and previous config saved to /var/cache/conftool/dbconfig/20200710-102147-kormat.json
[10:21:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:21:53] <stashbot>	 T257284: Upgrade es4 to debian buster + mariadb 10.4 - https://phabricator.wikimedia.org/T257284
[10:26:31] <wikibugs>	 (03PS1) 10Jbond: Revert "librenms: convert back to ldap config" [puppet] - 10https://gerrit.wikimedia.org/r/611216
[10:26:54] <wikibugs>	 10Operations, 10serviceops, 10Patch-For-Review, 10User-Elukey: Reimage one memcached shard to Buster - https://phabricator.wikimedia.org/T252391 (10elukey) Another idea to add in here - recently John and Moritz needed TLS for memcached and imported memcached 1.6.6 (latest upstream) into out buster reposito...
[10:34:24] <wikibugs>	 (03PS1) 10JMeybohm: Check if images are debian based before generating report [docker-images/docker-report] - 10https://gerrit.wikimedia.org/r/611251 (https://phabricator.wikimedia.org/T251918)
[10:34:26] <wikibugs>	 (03PS1) 10JMeybohm: New package version [docker-images/docker-report] - 10https://gerrit.wikimedia.org/r/611252
[10:38:35] <wikibugs>	 10Operations, 10Desktop Improvements, 10Traffic, 10Performance-Team (Radar): CDN cache revalidation on several wikis for desktop improvements deployment - https://phabricator.wikimedia.org/T256750 (10ovasileva) @ema - apologies for the late response - we had some blockers arise this week with the deploymen...
[10:38:49] <wikibugs>	 10Operations, 10Desktop Improvements, 10Traffic, 10Performance-Team (Radar): CDN cache revalidation on several wikis for desktop improvements deployment - https://phabricator.wikimedia.org/T256750 (10ovasileva)
[10:50:13] <wikibugs>	 (03PS1) 10Arturo Borrero Gonzalez: cloud: add prometheus neutron conntrack collector [puppet] - 10https://gerrit.wikimedia.org/r/611262 (https://phabricator.wikimedia.org/T257552)
[10:50:30] <wikibugs>	 (03PS1) 10Jbond: profile::grafana: add types and convert to lookup [puppet] - 10https://gerrit.wikimedia.org/r/611263
[10:51:05] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] cloud: add prometheus neutron conntrack collector [puppet] - 10https://gerrit.wikimedia.org/r/611262 (https://phabricator.wikimedia.org/T257552) (owner: 10Arturo Borrero Gonzalez)
[10:51:43] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] profile::grafana: add types and convert to lookup [puppet] - 10https://gerrit.wikimedia.org/r/611263 (owner: 10Jbond)
[10:52:08] <wikibugs>	 (03PS2) 10Jbond: profile::grafana: add types and convert to lookup [puppet] - 10https://gerrit.wikimedia.org/r/611263
[10:54:47] <wikibugs>	 (03PS4) 10Hnowlan: changeprop-jobqueue: add beta configuration skeleton [deployment-charts] - 10https://gerrit.wikimedia.org/r/604425 (https://phabricator.wikimedia.org/T220399)
[10:56:21] <wikibugs>	 (03PS3) 10Jbond: profile::grafana: add types and convert to lookup [puppet] - 10https://gerrit.wikimedia.org/r/611263
[11:01:59] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Revert "librenms: convert back to ldap config" [puppet] - 10https://gerrit.wikimedia.org/r/611216 (owner: 10Jbond)
[11:02:05] <wikibugs>	 (03PS4) 10Jbond: profile::grafana: add types and convert to lookup [puppet] - 10https://gerrit.wikimedia.org/r/611263
[11:08:26] <wikibugs>	 (03PS5) 10Jbond: profile::grafana: add types and convert to lookup [puppet] - 10https://gerrit.wikimedia.org/r/611263
[11:10:49] <wikibugs>	 (03PS2) 10Arturo Borrero Gonzalez: cloud: add prometheus neutron conntrack collector [puppet] - 10https://gerrit.wikimedia.org/r/611262 (https://phabricator.wikimedia.org/T257552)
[11:13:07] <wikibugs>	 (03PS6) 10Jbond: profile::grafana: add types and convert to lookup [puppet] - 10https://gerrit.wikimedia.org/r/611263
[11:13:09] <wikibugs>	 (03CR) 10Hnowlan: [C: 03+2] changeprop-jobqueue: add beta configuration skeleton [deployment-charts] - 10https://gerrit.wikimedia.org/r/604425 (https://phabricator.wikimedia.org/T220399) (owner: 10Hnowlan)
[11:23:15] <wikibugs>	 (03Merged) 10jenkins-bot: changeprop-jobqueue: add beta configuration skeleton [deployment-charts] - 10https://gerrit.wikimedia.org/r/604425 (https://phabricator.wikimedia.org/T220399) (owner: 10Hnowlan)
[11:23:15] <wikibugs>	 (03CR) 10Jbond: "PCC: https://puppet-compiler.wmflabs.org/compiler1001/23813/" [puppet] - 10https://gerrit.wikimedia.org/r/611263 (owner: 10Jbond)
[11:23:16] <wikibugs>	 10Operations: Network access to Wikipedia blocked - https://phabricator.wikimedia.org/T257664 (10Olem)
[11:23:16] <wikibugs>	 (03PS3) 10Jbond: mariadb::ferm_misc add netmon1002/2001 access [puppet] - 10https://gerrit.wikimedia.org/r/611243
[11:23:16] <wikibugs>	 (03PS3) 10Arturo Borrero Gonzalez: cloud: add prometheus neutron conntrack collector [puppet] - 10https://gerrit.wikimedia.org/r/611262 (https://phabricator.wikimedia.org/T257552)
[11:23:16] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] cloud: add prometheus neutron conntrack collector [puppet] - 10https://gerrit.wikimedia.org/r/611262 (https://phabricator.wikimedia.org/T257552) (owner: 10Arturo Borrero Gonzalez)
[11:23:17] <wikibugs>	 (03PS4) 10Arturo Borrero Gonzalez: cloud: add prometheus neutron conntrack collector [puppet] - 10https://gerrit.wikimedia.org/r/611262 (https://phabricator.wikimedia.org/T257552)
[11:23:17] <wikibugs>	 10Operations, 10Traffic: Network access to Wikipedia blocked - https://phabricator.wikimedia.org/T257664 (10Majavah)
[11:23:18] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] cloud: add prometheus neutron conntrack collector [puppet] - 10https://gerrit.wikimedia.org/r/611262 (https://phabricator.wikimedia.org/T257552) (owner: 10Arturo Borrero Gonzalez)
[11:29:00] <wikibugs>	 (03PS1) 10Arturo Borrero Gonzalez: cloud: prometheus neutron collector: add "" characters for label values [puppet] - 10https://gerrit.wikimedia.org/r/611278 (https://phabricator.wikimedia.org/T257552)
[11:30:13] <wikibugs>	 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to analytics-privatedata-users and nda groups for edtadros - https://phabricator.wikimedia.org/T256435 (10jcrespo)
[11:30:37] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] cloud: prometheus neutron collector: add "" characters for label values [puppet] - 10https://gerrit.wikimedia.org/r/611278 (https://phabricator.wikimedia.org/T257552) (owner: 10Arturo Borrero Gonzalez)
[11:33:00] <wikibugs>	 (03PS14) 10Hnowlan: api-gateway: Basic envoy chart WIP [deployment-charts] - 10https://gerrit.wikimedia.org/r/609808 (https://phabricator.wikimedia.org/T254906)
[11:35:46] <wikibugs>	 (03PS1) 10Arturo Borrero Gonzalez: cloud: prometheus neutron exporter: cleanup log messages [puppet] - 10https://gerrit.wikimedia.org/r/611285 (https://phabricator.wikimedia.org/T257552)
[11:37:00] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] cloud: prometheus neutron exporter: cleanup log messages [puppet] - 10https://gerrit.wikimedia.org/r/611285 (https://phabricator.wikimedia.org/T257552) (owner: 10Arturo Borrero Gonzalez)
[11:58:40] <wikibugs>	 (03PS1) 10Reedy: Make Score errors use a specific css class [extensions/Score] (wmf/1.35.0-wmf.40) - 10https://gerrit.wikimedia.org/r/611217 (https://phabricator.wikimedia.org/T257623)
[11:59:07] <wikibugs>	 (03CR) 10Reedy: [C: 03+2] Make Score errors use a specific css class [extensions/Score] (wmf/1.35.0-wmf.40) - 10https://gerrit.wikimedia.org/r/611217 (https://phabricator.wikimedia.org/T257623) (owner: 10Reedy)
[12:03:55] <wikibugs>	 (03PS3) 10RhinosF1: Add NamespaceAliases for kowikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/604515 (https://phabricator.wikimedia.org/T255031)
[12:06:11] <RhinosF1>	 Reedy: any update on https://phabricator.wikimedia.org/T257066#6277787? It's past Monday 6th?
[12:06:24] <Reedy>	 No
[12:08:30] <RhinosF1>	 Reedy: any updated timescale?
[12:08:36] <Reedy>	 No
[12:08:46] <Reedy>	 The comment said at least
[12:08:53] <Reedy>	 So it could be for any amount of time afterwards
[12:08:56] <RhinosF1>	 ok
[12:09:38] <wikibugs>	 (03CR) 10Marostegui: [C: 03+1] mariadb::ferm_misc add netmon1002/2001 access [puppet] - 10https://gerrit.wikimedia.org/r/611243 (owner: 10Jbond)
[12:17:12] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Make Score errors use a specific css class [extensions/Score] (wmf/1.35.0-wmf.40) - 10https://gerrit.wikimedia.org/r/611217 (https://phabricator.wikimedia.org/T257623) (owner: 10Reedy)
[12:17:55] <wikibugs>	 (03CR) 10Reedy: [V: 03+2 C: 03+2] "Unrelated failure, task already filed" [extensions/Score] (wmf/1.35.0-wmf.40) - 10https://gerrit.wikimedia.org/r/611217 (https://phabricator.wikimedia.org/T257623) (owner: 10Reedy)
[12:20:24] <logmsgbot>	 !log reedy@deploy1001 Synchronized php-1.35.0-wmf.40/extensions/Score/: Make Score errors use a specific css class (duration: 00m 58s)
[12:20:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:36:05] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repool db1110', diff saved to https://phabricator.wikimedia.org/P11860 and previous config saved to /var/cache/conftool/dbconfig/20200710-123604-marostegui.json
[12:36:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:51:20] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 04-1] "> Patch Set 2:" [puppet] - 10https://gerrit.wikimedia.org/r/552515 (https://phabricator.wikimedia.org/T239340) (owner: 10Alexandros Kosiaris)
[12:54:51] <wikibugs>	 (03PS9) 10MSantos: charts for push-notification service [deployment-charts] - 10https://gerrit.wikimedia.org/r/602390 (https://phabricator.wikimedia.org/T250493)
[12:55:34] <wikibugs>	 (03CR) 10Alexandros Kosiaris: Add recommendation-api helmfile stanzas (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/602527 (https://phabricator.wikimedia.org/T241230) (owner: 10Bmansurov)
[12:57:15] <wikibugs>	 (03CR) 10MSantos: charts for push-notification service (034 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/602390 (https://phabricator.wikimedia.org/T250493) (owner: 10MSantos)
[13:03:19] <icinga-wm>	 RECOVERY - MariaDB Replica SQL: matomo on db1108 is OK: OK slave_sql_state not a slave https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave
[13:04:39] <icinga-wm>	 RECOVERY - MariaDB Replica Lag: matomo on db1108 is OK: OK slave_sql_lag not a slave https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave
[13:04:53] <icinga-wm>	 RECOVERY - MariaDB Replica IO: matomo on db1108 is OK: OK slave_io_state not a slave https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave
[13:05:59] <icinga-wm>	 RECOVERY - MariaDB Replica Lag: analytics_meta on db1108 is OK: OK slave_sql_lag not a slave https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave
[13:10:35] <marostegui>	 ^ elukey :)
[13:11:35] <icinga-wm>	 RECOVERY - Check systemd state on db1108 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:12:35] <icinga-wm>	 PROBLEM - Host ganeti1007.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[13:12:50] <elukey>	 ah!!!
[13:12:52] <elukey>	 <3
[13:14:24] <wikibugs>	 10Operations, 10ops-eqiad: upgrade memory in ganeti100[5-8].eqiad.wmnet - https://phabricator.wikimedia.org/T244530 (10Jclark-ctr)
[13:14:51] <wikibugs>	 10Operations, 10ops-eqiad: upgrade memory in ganeti100[5-8].eqiad.wmnet - https://phabricator.wikimedia.org/T244530 (10Jclark-ctr) finished with memory upgrade
[13:15:29] <icinga-wm>	 RECOVERY - Host ganeti1007 is UP: PING OK - Packet loss = 0%, RTA = 0.24 ms
[13:15:32] <wikibugs>	 (03PS1) 10Hashar: Fix repository name in .gitreview [software/acme-chief] - 10https://gerrit.wikimedia.org/r/611309
[13:18:27] <icinga-wm>	 RECOVERY - Host ganeti1007.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.84 ms
[13:18:41] <wikibugs>	 (03PS1) 10Ema: ATS: add SyslogIdentifier to systemd unit [puppet] - 10https://gerrit.wikimedia.org/r/611311
[13:19:15] <wikibugs>	 (03PS2) 10Ema: ATS: add SyslogIdentifier to systemd unit [puppet] - 10https://gerrit.wikimedia.org/r/611311 (https://phabricator.wikimedia.org/T256395)
[13:19:40] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] ATS: add SyslogIdentifier to systemd unit [puppet] - 10https://gerrit.wikimedia.org/r/611311 (https://phabricator.wikimedia.org/T256395) (owner: 10Ema)
[13:20:04] <wikibugs>	 (03PS3) 10Ema: ATS: add SyslogIdentifier to systemd unit [puppet] - 10https://gerrit.wikimedia.org/r/611311 (https://phabricator.wikimedia.org/T256395)
[13:20:46] <wikibugs>	 10Operations, 10Analytics-Clusters: Segfault for systemd-sysusers.service on stat1007 - https://phabricator.wikimedia.org/T256098 (10elukey) The redhat bug report leads to https://github.com/systemd/systemd/issues/6512, I followed the steps outlined in there:  ` elukey@stat1007:~$ sudo gdb systemd-sysusers [.....
[13:21:59] <wikibugs>	 (03CR) 10Ema: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/611311 (https://phabricator.wikimedia.org/T256395) (owner: 10Ema)
[13:31:12] <wikibugs>	 10Operations, 10Analytics-Clusters: Segfault for systemd-sysusers.service on stat1007 - https://phabricator.wikimedia.org/T256098 (10elukey) Coreos applied a patch to libc: https://github.com/mischief/coreos-overlay/commit/19d5f42d8208334ef8581ba90e01161e00dede71
[13:34:27] <wikibugs>	 (03Abandoned) 10Elukey: sre.dns.netbox: print some suggestions in case the diff is wrong [cookbooks] - 10https://gerrit.wikimedia.org/r/609390 (owner: 10Elukey)
[13:34:42] <wikibugs>	 (03Abandoned) 10Elukey: profile::mediawiki::alerts: tune mediawiki-errors to be more lenient [puppet] - 10https://gerrit.wikimedia.org/r/608708 (https://phabricator.wikimedia.org/T256459) (owner: 10Elukey)
[13:35:54] <wikibugs>	 10Operations, 10CAS-SSO, 10Patch-For-Review, 10User-jbond: mod_auth_cas segfaulting on netmon - https://phabricator.wikimedia.org/T257587 (10MoritzMuehlenhoff) The underlying Debian bug is https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=844018 and specifically https://bugs.debian.org/cgi-bin/bugreport.cg...
[13:36:47] <olem>	 Hi, I created https://phabricator.wikimedia.org/T257664 (Network access to Wikipedia blocked) a few hours ago, but I don't have the permissions to view it now (Access Denied: Restricted Task). May I be granted permission to view this task in order to check its status and respond to any questions if needed?
[13:37:20] <legoktm>	 olem: yeah, let me fix it for you
[13:37:40] <olem>	 Thanks legoktm
[13:38:32] <legoktm>	 olem: try now?
[13:39:26] <olem>	 Thanks, I now have access.
[13:39:52] <wikibugs>	 10Operations, 10Analytics-Clusters: Segfault for systemd-sysusers.service on stat1007 - https://phabricator.wikimedia.org/T256098 (10elukey) ` elukey@stat1006:~$ sudo systemd-sysusers Creating group systemd-coredump with gid 490. Creating user systemd-coredump (systemd Core Dumper) with uid 490 and gid 490. Se...
[13:40:15] <legoktm>	 great
[13:40:43] <wikibugs>	 10Operations, 10CAS-SSO, 10Patch-For-Review, 10User-jbond: mod_auth_cas segfaulting on netmon - https://phabricator.wikimedia.org/T257587 (10MoritzMuehlenhoff) Also adding @CDanis and @ayounsi for comments on a updating to Buster (potential blockers etc.)
[13:41:13] <wikibugs>	 10Operations, 10Fundraising-Backlog, 10MediaWiki-extensions-CentralNotice, 10Traffic, and 2 others: Implement redirect for hide banner cookie issue - https://phabricator.wikimedia.org/T251780 (10Ejegg) @AndyRussG Hmm, that cross-site cookie check would require at least one more web request to complete befo...
[13:41:24] <wikibugs>	 10Operations, 10Fundraising-Backlog, 10MediaWiki-extensions-CentralNotice, 10Traffic, and 2 others: Implement redirect for hide banner cookie issue - https://phabricator.wikimedia.org/T251780 (10spatton) I like that last idea (special cookie on donate wiki that we check in banners), @AndyRussG! I would be...
[13:41:26] <godog>	 !log bounce ms-be1037, not quite responsive
[13:41:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:42:41] <icinga-wm>	 PROBLEM - very high load average likely xfs on ms-be1037 is CRITICAL: connect to address 10.64.48.142 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Swift
[13:42:51] <icinga-wm>	 PROBLEM - Check size of conntrack table on ms-be1037 is CRITICAL: connect to address 10.64.48.142 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack
[13:42:51] <icinga-wm>	 PROBLEM - Check systemd state on ms-be1037 is CRITICAL: connect to address 10.64.48.142 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:46:45] <icinga-wm>	 PROBLEM - Host ms-be1037 is DOWN: PING CRITICAL - Packet loss = 100%
[13:47:58] <wikibugs>	 10Operations, 10observability, 10Patch-For-Review, 10Performance-Team (Radar): Revisit Grafana/Icinga notification strategy - https://phabricator.wikimedia.org/T203485 (10ema) This came up again today. Due to my very short memory I forgot all about the performance team alerts and started complaining about...
[13:48:15] <icinga-wm>	 RECOVERY - very high load average likely xfs on ms-be1037 is OK: OK - load average: 20.15, 5.02, 1.68 https://wikitech.wikimedia.org/wiki/Swift
[13:48:17] <icinga-wm>	 RECOVERY - Host ms-be1037 is UP: PING OK - Packet loss = 0%, RTA = 0.20 ms
[13:48:23] <icinga-wm>	 RECOVERY - Check size of conntrack table on ms-be1037 is OK: OK: nf_conntrack is 0 % full https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack
[13:54:31] <wikibugs>	 (03PS16) 10Privacybatm: transferpy: Generate checksum parallel to the data transfer [software/transferpy] - 10https://gerrit.wikimedia.org/r/605851 (https://phabricator.wikimedia.org/T254979)
[13:54:43] <wikibugs>	 (03PS1) 10Ema: ATS: send Set-Cookie syslog output to logstash [puppet] - 10https://gerrit.wikimedia.org/r/611315 (https://phabricator.wikimedia.org/T256395)
[13:54:58] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] transferpy: Generate checksum parallel to the data transfer [software/transferpy] - 10https://gerrit.wikimedia.org/r/605851 (https://phabricator.wikimedia.org/T254979) (owner: 10Privacybatm)
[13:55:17] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/609840 (owner: 10Legoktm)
[13:55:48] <icinga-wm>	 PROBLEM - Check systemd state on ms-be1037 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:56:27] <wikibugs>	 (03CR) 10Ema: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/611315 (https://phabricator.wikimedia.org/T256395) (owner: 10Ema)
[13:58:13] <wikibugs>	 (03PS17) 10Privacybatm: transferpy: Generate checksum parallel to the data transfer [software/transferpy] - 10https://gerrit.wikimedia.org/r/605851 (https://phabricator.wikimedia.org/T254979)
[13:59:40] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+1] ATS: add SyslogIdentifier to systemd unit [puppet] - 10https://gerrit.wikimedia.org/r/611311 (https://phabricator.wikimedia.org/T256395) (owner: 10Ema)
[14:00:02] <wikibugs>	 (03CR) 10Privacybatm: transferpy: Generate checksum parallel to the data transfer (031 comment) [software/transferpy] - 10https://gerrit.wikimedia.org/r/605851 (https://phabricator.wikimedia.org/T254979) (owner: 10Privacybatm)
[14:00:27] <wikibugs>	 (03CR) 10Ema: [C: 03+2] ATS: add SyslogIdentifier to systemd unit [puppet] - 10https://gerrit.wikimedia.org/r/611311 (https://phabricator.wikimedia.org/T256395) (owner: 10Ema)
[14:02:39] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "Looks good (to the extent this brackets maze in the srange can look good :-)" [puppet] - 10https://gerrit.wikimedia.org/r/611168 (owner: 10Elukey)
[14:03:11] <icinga-wm>	 PROBLEM - Check systemd state on kubernetes1004 is CRITICAL: connect to address 10.64.48.52 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:03:43] <icinga-wm>	 PROBLEM - Check size of conntrack table on kubernetes1004 is CRITICAL: connect to address 10.64.48.52 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack
[14:04:01] <wikibugs>	 10Operations, 10netops: Investigate Juniper storm control - https://phabricator.wikimedia.org/T245192 (10Krinkle)
[14:04:55] <wikibugs>	 (03PS2) 10Ema: ATS: send Set-Cookie syslog output to logstash [puppet] - 10https://gerrit.wikimedia.org/r/611315 (https://phabricator.wikimedia.org/T256395)
[14:05:24] <wikibugs>	 (03CR) 10Vgutierrez: ATS: send Set-Cookie syslog output to logstash (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/611315 (https://phabricator.wikimedia.org/T256395) (owner: 10Ema)
[14:05:44] <wikibugs>	 (03PS1) 10Filippo Giunchedi: role: port netmon to Buster [puppet] - 10https://gerrit.wikimedia.org/r/611317 (https://phabricator.wikimedia.org/T247967)
[14:05:46] <wikibugs>	 (03PS1) 10Filippo Giunchedi: role: install fcgid package on netmon [puppet] - 10https://gerrit.wikimedia.org/r/611318 (https://phabricator.wikimedia.org/T247967)
[14:05:55] <wikibugs>	 10Operations, 10Cloud-Services, 10Traffic, 10SRE-OnFire-Incident-Docs, 10cloud-services-team (Kanban): Requests to production are sometimes timing out or giving empty response - https://phabricator.wikimedia.org/T249035 (10Krinkle) Looks like this is no longer an active incident. Re-tagging as such. Are...
[14:05:57] <icinga-wm>	 PROBLEM - puppet last run on kubernetes1004 is CRITICAL: connect to address 10.64.48.52 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[14:06:19] <jayme>	 hm...that again. Looking at kubernetes1004
[14:07:00] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] role: install fcgid package on netmon [puppet] - 10https://gerrit.wikimedia.org/r/611318 (https://phabricator.wikimedia.org/T247967) (owner: 10Filippo Giunchedi)
[14:07:32] <wikibugs>	 10Operations, 10Analytics-Clusters: Segfault for systemd-sysusers.service on stat1007 - https://phabricator.wikimedia.org/T256098 (10elukey) I checked `/usr/lib/sysusers.d/*.conf` and the last user listed is `systemd-coredump`, plus we still don't use systemd-sysusers in analytics (yet).
[14:07:49] <icinga-wm>	 PROBLEM - DPKG on kubernetes1004 is CRITICAL: connect to address 10.64.48.52 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/dpkg
[14:08:57] <icinga-wm>	 PROBLEM - configured eth on kubernetes1004 is CRITICAL: connect to address 10.64.48.52 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_eth
[14:09:59] <icinga-wm>	 PROBLEM - dhclient process on kubernetes1004 is CRITICAL: connect to address 10.64.48.52 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_dhclient
[14:10:31] <icinga-wm>	 PROBLEM - Ensure hosts are not performing a change on every puppet run on puppetdb1002 is CRITICAL: CRITICAL: the following (5) node(s) change every puppet run: releases1002.eqiad.wmnet, ms-be1037.eqiad.wmnet, wdqs1010.eqiad.wmnet, releases2002.codfw.wmnet, wdqs1009.eqiad.wmnet https://wikitech.wikimedia.org/wiki/Puppet%23check_puppet_run_changes
[14:12:53] <icinga-wm>	 PROBLEM - Check whether ferm is active by checking the default input chain on kubernetes1004 is CRITICAL: connect to address 10.64.48.52 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[14:13:03] <icinga-wm>	 PROBLEM - Disk space on kubernetes1004 is CRITICAL: connect to address 10.64.48.52 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=kubernetes1004&var-datasource=eqiad+prometheus/ops
[14:14:18] <wikibugs>	 (03PS1) 10Elukey: Set BigTop for the Hadoop test cluster [puppet] - 10https://gerrit.wikimedia.org/r/611319
[14:14:19] <icinga-wm>	 RECOVERY - Check systemd state on kubernetes1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:14:49] <icinga-wm>	 RECOVERY - Check size of conntrack table on kubernetes1004 is OK: OK: nf_conntrack is 0 % full https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack
[14:15:44] <wikibugs>	 (03PS3) 10Ema: ATS: send Set-Cookie syslog output to logstash [puppet] - 10https://gerrit.wikimedia.org/r/611315 (https://phabricator.wikimedia.org/T256395)
[14:16:11] <wikibugs>	 (03CR) 10Ema: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/611315 (https://phabricator.wikimedia.org/T256395) (owner: 10Ema)
[14:17:20] <wikibugs>	 (03CR) 10Ema: ATS: send Set-Cookie syslog output to logstash (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/611315 (https://phabricator.wikimedia.org/T256395) (owner: 10Ema)
[14:17:39] <icinga-wm>	 RECOVERY - puppet last run on kubernetes1004 is OK: OK: Puppet is currently enabled, last run 46 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[14:18:21] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+1] ATS: send Set-Cookie syslog output to logstash [puppet] - 10https://gerrit.wikimedia.org/r/611315 (https://phabricator.wikimedia.org/T256395) (owner: 10Ema)
[14:18:37] <wikibugs>	 (03CR) 10Ema: [C: 03+2] ATS: send Set-Cookie syslog output to logstash [puppet] - 10https://gerrit.wikimedia.org/r/611315 (https://phabricator.wikimedia.org/T256395) (owner: 10Ema)
[14:20:11] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] Set BigTop for the Hadoop test cluster [puppet] - 10https://gerrit.wikimedia.org/r/611319 (owner: 10Elukey)
[14:25:56] <wikibugs>	 10Operations, 10ops-eqiad: Suspected network troubles on ms-be1037 - https://phabricator.wikimedia.org/T257675 (10fgiunchedi)
[14:26:04] <wikibugs>	 10Operations, 10ops-eqiad: Suspected network troubles on ms-be1037 - https://phabricator.wikimedia.org/T257675 (10fgiunchedi) p:05Triage→03High
[14:29:23] <icinga-wm>	 PROBLEM - Check systemd state on ms-be1037 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:30:32] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "Looks good to me, two nits inline (feel free to ignore)" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/611317 (https://phabricator.wikimedia.org/T247967) (owner: 10Filippo Giunchedi)
[14:30:42] <wikibugs>	 10Operations, 10observability, 10User-fgiunchedi: Port Prometheus dashboards to Thanos - https://phabricator.wikimedia.org/T256954 (10jcrespo)
[14:30:44] <wikibugs>	 10Operations, 10DBA, 10User-Kormat: Port DBA dashboards to thanos - https://phabricator.wikimedia.org/T256730 (10jcrespo)
[14:30:46] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.hadoop.stop-cluster
[14:30:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:31:24] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] role: port netmon to Buster [puppet] - 10https://gerrit.wikimedia.org/r/611317 (https://phabricator.wikimedia.org/T247967) (owner: 10Filippo Giunchedi)
[14:33:04] <wikibugs>	 10Operations, 10Puppet: Missing dependency on bacula-fd Puppet setup - https://phabricator.wikimedia.org/T256454 (10jcrespo) p:05Triage→03Medium a:03jcrespo
[14:33:53] <icinga-wm>	 RECOVERY - Disk space on kubernetes1004 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=kubernetes1004&var-datasource=eqiad+prometheus/ops
[14:34:16] <wikibugs>	 (03PS9) 10Andrew Bogott: Prometheus: gather db stats from wmcs galera db hosts [puppet] - 10https://gerrit.wikimedia.org/r/610420
[14:35:04] <wikibugs>	 (03CR) 10Muehlenhoff: role: install fcgid package on netmon (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/611318 (https://phabricator.wikimedia.org/T247967) (owner: 10Filippo Giunchedi)
[14:35:22] <wikibugs>	 10Operations, 10Fundraising-Backlog, 10MediaWiki-extensions-CentralNotice, 10Traffic, and 2 others: Implement redirect for hide banner cookie issue - https://phabricator.wikimedia.org/T251780 (10Pcoombe) Some quick thoughts on proposed solutions  **TY page on *.wikipedia.org** `-` need a new restricted acc...
[14:37:23] <logmsgbot>	 !log elukey@cumin1001 END (PASS) - Cookbook sre.hadoop.stop-cluster (exit_code=0)
[14:37:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:38:35] <wikibugs>	 10Operations, 10Fundraising-Backlog, 10MediaWiki-extensions-CentralNotice, 10Traffic, and 2 others: Implement redirect for hide banner cookie issue - https://phabricator.wikimedia.org/T251780 (10MBeat33) > Put a link in the thank-you page and/or thank-you e-mail for users to click  I'd like to advocate gen...
[14:38:41] <icinga-wm>	 RECOVERY - DPKG on kubernetes1004 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg
[14:38:51] <wikibugs>	 10Operations, 10ops-eqiad: Suspected network troubles on ms-be1037 - https://phabricator.wikimedia.org/T257675 (10fgiunchedi) In case it is useful and assuming my theory is correct, host is on asw2-d-eqiad  ` Physical interface: xe-7/0/0     Laser bias current                        :  45.788 mA     Laser outp...
[14:39:34] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.hadoop.change-distro
[14:39:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:39:47] <icinga-wm>	 RECOVERY - configured eth on kubernetes1004 is OK: OK - interfaces up https://wikitech.wikimedia.org/wiki/Monitoring/check_eth
[14:40:47] <icinga-wm>	 RECOVERY - dhclient process on kubernetes1004 is OK: PROCS OK: 0 processes with command name dhclient https://wikitech.wikimedia.org/wiki/Monitoring/check_dhclient
[14:40:53] <wikibugs>	 (03PS2) 10Cwhite: debianization [debs/grafana-loki] (debian/sid) - 10https://gerrit.wikimedia.org/r/610864
[14:42:38] <wikibugs>	 (03CR) 10Cwhite: debianization (031 comment) [debs/grafana-loki] (debian/sid) - 10https://gerrit.wikimedia.org/r/610864 (owner: 10Cwhite)
[14:43:30] <wikibugs>	 (03PS10) 10Andrew Bogott: cloudmetrics: gather db stats from wmcs galera db hosts [puppet] - 10https://gerrit.wikimedia.org/r/610420
[14:43:41] <icinga-wm>	 RECOVERY - Check whether ferm is active by checking the default input chain on kubernetes1004 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[14:47:30] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] cloudmetrics: gather db stats from wmcs galera db hosts [puppet] - 10https://gerrit.wikimedia.org/r/610420 (owner: 10Andrew Bogott)
[14:47:54] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] cloudmetrics: gather db stats from wmcs galera db hosts [puppet] - 10https://gerrit.wikimedia.org/r/610420 (owner: 10Andrew Bogott)
[14:52:27] <wikibugs>	 (03PS9) 10ZPapierski: Correct url and path for nginx OAuth 1.0a [puppet] - 10https://gerrit.wikimedia.org/r/609909 (https://phabricator.wikimedia.org/T251498)
[14:57:51] <icinga-wm>	 PROBLEM - Host ms-be1037 is DOWN: PING CRITICAL - Packet loss = 100%
[14:59:35] <icinga-wm>	 RECOVERY - Host ms-be1037 is UP: PING OK - Packet loss = 0%, RTA = 0.22 ms
[15:02:24] <wikibugs>	 10Operations, 10ops-eqiad: Suspected network troubles on ms-be1037 - https://phabricator.wikimedia.org/T257675 (10Jclark-ctr) a:03Jclark-ctr
[15:03:26] <logmsgbot>	 !log elukey@cumin1001 END (PASS) - Cookbook sre.hadoop.change-distro (exit_code=0)
[15:03:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:04:21] <icinga-wm>	 PROBLEM - Check systemd state on ms-be1037 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:06:11] <wikibugs>	 10Operations, 10ops-eqiad: Suspected network troubles on ms-be1037 - https://phabricator.wikimedia.org/T257675 (10Jclark-ctr) replaced sfp on host  Finisar  model FTLX1471D3BCL   S/N AQR0HM6
[15:08:43] <wikibugs>	 (03PS10) 10ZPapierski: Correct url and path for nginx OAuth 1.0a [puppet] - 10https://gerrit.wikimedia.org/r/609909 (https://phabricator.wikimedia.org/T251498)
[15:08:59] <wikibugs>	 (03PS1) 10Elukey: Revert "Set BigTop for the Hadoop test cluster" [puppet] - 10https://gerrit.wikimedia.org/r/611218
[15:09:16] <wikibugs>	 (03CR) 10Bstorm: [C: 03+2] kubeadm: If using a stacked control plane, expose etcd metrics [puppet] - 10https://gerrit.wikimedia.org/r/610980 (https://phabricator.wikimedia.org/T256361) (owner: 10Bstorm)
[15:10:32] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] Revert "Set BigTop for the Hadoop test cluster" [puppet] - 10https://gerrit.wikimedia.org/r/611218 (owner: 10Elukey)
[15:11:15] <elukey>	 bstorm: o/
[15:11:20] <elukey>	 ok to puppet-merge?
[15:11:38] <bstorm>	 Oh sure! I was about it
[15:11:46] <bstorm>	 *about to
[15:12:39] <bstorm>	 elukey ^
[15:12:47] <elukey>	 ack!
[15:13:21] <elukey>	 done :)
[15:14:11] <wikibugs>	 (03PS1) 10Andrew Bogott: wmcs prometheus: correct role name for galera metrics [puppet] - 10https://gerrit.wikimedia.org/r/611343
[15:16:09] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] wmcs prometheus: correct role name for galera metrics [puppet] - 10https://gerrit.wikimedia.org/r/611343 (owner: 10Andrew Bogott)
[15:17:17] <icinga-wm>	 RECOVERY - Check systemd state on ms-be1037 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:19:08] <wikibugs>	 (03PS1) 10Muehlenhoff: Remove cas-icinga server alias [puppet] - 10https://gerrit.wikimedia.org/r/611344
[15:19:10] <wikibugs>	 (03PS1) 10Muehlenhoff: Remove cas-icinga from ACME config [puppet] - 10https://gerrit.wikimedia.org/r/611345
[15:19:53] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.hadoop.stop-cluster
[15:19:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:23:27] <wikibugs>	 10Operations, 10MediaWiki-extensions-Score, 10Security-Team, 10Wikimedia-General-or-Unknown, and 3 others: Extension:Score / Lilypond is disabled on all wikis - https://phabricator.wikimedia.org/T257066 (10Tgr) Have we made an effort to reach out to non-Wikimedia MediaWiki users? Given the severity, warnin...
[15:24:45] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Icinga: Add permissions also for ayounsi [puppet] - 10https://gerrit.wikimedia.org/r/610699 (owner: 10Muehlenhoff)
[15:29:31] <logmsgbot>	 !log milimetric@deploy1001 Started deploy [analytics/refinery@4d40145]: Update EventLogging refine whitelist
[15:29:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:29:49] <logmsgbot>	 !log elukey@cumin1001 END (PASS) - Cookbook sre.hadoop.stop-cluster (exit_code=0)
[15:29:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:30:26] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.hadoop.change-distro
[15:30:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:31:30] <wikibugs>	 (03CR) 10Cwhite: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/607075 (owner: 10Hashar)
[15:33:30] <wikibugs>	 10Operations, 10ops-eqiad: Suspected network troubles on ms-be1037 - https://phabricator.wikimedia.org/T257675 (10fgiunchedi) 05Open→03Resolved Host is back and network works as expected now, thanks for the quick action @Jclark-ctr !
[15:35:31] <wikibugs>	 (03CR) 10Cwhite: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/607076 (https://phabricator.wikimedia.org/T149924) (owner: 10Hashar)
[15:36:57] <wikibugs>	 (03CR) 10Cwhite: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/607524 (owner: 10Hashar)
[15:37:49] <wikibugs>	 (03CR) 10Cwhite: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/607525 (owner: 10Hashar)
[15:44:29] <wikibugs>	 (03CR) 10Cwhite: [C: 03+2] ci: remove Apache config for nightlies [puppet] - 10https://gerrit.wikimedia.org/r/607075 (owner: 10Hashar)
[15:44:35] <wikibugs>	 (03PS4) 10Cwhite: ci: remove Apache config for nightlies [puppet] - 10https://gerrit.wikimedia.org/r/607075 (owner: 10Hashar)
[15:44:48] <logmsgbot>	 !log milimetric@deploy1001 Finished deploy [analytics/refinery@4d40145]: Update EventLogging refine whitelist (duration: 15m 17s)
[15:44:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:49:31] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[15:49:46] <wikibugs>	 (03CR) 10Cwhite: [C: 03+2] ci: switch integration.wikimedia.org to scap DocumentRoot [puppet] - 10https://gerrit.wikimedia.org/r/607076 (https://phabricator.wikimedia.org/T149924) (owner: 10Hashar)
[15:49:53] <wikibugs>	 (03PS5) 10Cwhite: ci: switch integration.wikimedia.org to scap DocumentRoot [puppet] - 10https://gerrit.wikimedia.org/r/607076 (https://phabricator.wikimedia.org/T149924) (owner: 10Hashar)
[15:51:17] <wikibugs>	 10Operations, 10Cloud-VPS (Project-requests), 10cloud-services-team (Kanban): Request creation of 'sre-sandbox' VPS project - https://phabricator.wikimedia.org/T247517 (10bd808) >>! In T247517#6296240, @jbond wrote: > is to possible to get more quota in this project.  I Just tried to create a machine and we...
[15:52:41] <icinga-wm>	 RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[15:55:50] <wikibugs>	 (03CR) 10Cwhite: [C: 03+2] contint: move Apache config to flat file [puppet] - 10https://gerrit.wikimedia.org/r/607524 (owner: 10Hashar)
[15:55:57] <wikibugs>	 (03PS3) 10Cwhite: contint: move Apache config to flat file [puppet] - 10https://gerrit.wikimedia.org/r/607524 (owner: 10Hashar)
[15:56:52] <wikibugs>	 10Operations, 10Fundraising-Backlog, 10MediaWiki-extensions-CentralNotice, 10Traffic, and 2 others: Implement redirect for hide banner cookie issue - https://phabricator.wikimedia.org/T251780 (10DStrine) Here is another option. We make a subdomain with a name similar to fundraising.wikipedia.org or donate....
[15:56:58] <logmsgbot>	 !log milimetric@deploy1001 Started deploy [analytics/refinery@4d40145] (thin): Update EventLogging refine whitelist (THIN)
[15:57:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:57:07] <logmsgbot>	 !log milimetric@deploy1001 Finished deploy [analytics/refinery@4d40145] (thin): Update EventLogging refine whitelist (THIN) (duration: 00m 08s)
[15:57:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:58:01] <wikibugs>	 (03PS1) 10Cwhite: Revert "ci: switch integration.wikimedia.org to scap DocumentRoot" [puppet] - 10https://gerrit.wikimedia.org/r/611219
[15:58:34] <wikibugs>	 (03CR) 10Cwhite: [V: 03+2 C: 03+2] Revert "ci: switch integration.wikimedia.org to scap DocumentRoot" [puppet] - 10https://gerrit.wikimedia.org/r/611219 (owner: 10Cwhite)
[16:02:38] <wikibugs>	 (03CR) 10Elukey: [C: 04-1] "After a chat with EBernardson I found another use case that I didn't know, namely all ES Search nodes have a daemon that pulls from kafka " [puppet] - 10https://gerrit.wikimedia.org/r/611168 (owner: 10Elukey)
[16:04:54] <wikibugs>	 (03PS4) 10Hashar: contint: move Apache config to flat file [puppet] - 10https://gerrit.wikimedia.org/r/607524
[16:04:56] <wikibugs>	 (03PS3) 10Hashar: doc: move Apache config to flat file [puppet] - 10https://gerrit.wikimedia.org/r/607525
[16:04:58] <wikibugs>	 (03PS1) 10Hashar: ci: switch integration.wikimedia.org to scap DocumentRoot [puppet] - 10https://gerrit.wikimedia.org/r/611369 (https://phabricator.wikimedia.org/T149924)
[16:07:15] <wikibugs>	 (03CR) 10Hashar: [C: 04-1] "When deploying this we had https://integration.wikimedia.org/ broken, the HTML containing:" [puppet] - 10https://gerrit.wikimedia.org/r/611369 (https://phabricator.wikimedia.org/T149924) (owner: 10Hashar)
[16:07:43] <wikibugs>	 (03PS1) 10Bstorm: tools-prometheus: Add the paws etcd exports [puppet] - 10https://gerrit.wikimedia.org/r/611370 (https://phabricator.wikimedia.org/T256361)
[16:11:59] <wikibugs>	 (03CR) 10Krinkle: "That's because the DocumentRoot is not a real directory but a symlink with Scap3." [puppet] - 10https://gerrit.wikimedia.org/r/611369 (https://phabricator.wikimedia.org/T149924) (owner: 10Hashar)
[16:12:50] <hashar>	 Krinkle: my hero ;)
[16:16:06] <logmsgbot>	 !log elukey@cumin1001 END (FAIL) - Cookbook sre.hadoop.change-distro (exit_code=99)
[16:16:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:16:12] <Krinkle>	 hashar: I think adding another realpath() there should fix it, want  to try that?
[16:17:43] <hashar>	 Krinkle: in strpos( $realPath, $_SERVER['DOCUMENT_ROOT'] )   ?
[16:18:01] <hashar>	 cause that variable would be whatever is configured in Apache I guess
[16:18:07] <Krinkle>	 hashar: right now  it resolve the request  path, and confirms it exists in the doc root
[16:18:13] <Krinkle>	 you'll want to resolve docroot itself also
[16:18:44] <Krinkle>	 so  `strpos( $realPath, $realDocRoot )`  instead  of `strpos( $realPath, $_SERVER['DOCUMENT_ROOT'] )`
[16:18:51] <Krinkle>	 and define  readlDocRoot
[16:18:56] <hashar>	 but isn't Apache supposed to prevent leaking from outside the DocumentRoot anyway?
[16:19:08] <Krinkle>	 no, I've neard heard of such rule.
[16:19:15] <Krinkle>	 symlinks can  escape  it
[16:19:19] <hashar>	 ah yeah
[16:19:22] <Krinkle>	 given this  is PHP reading files
[16:19:36] <Krinkle>	 it  is mainly for doc.wm.o
[16:19:46] <Krinkle>	 but should be harmless here for now
[16:20:31] <hashar>	 for doc.wm.o I want to move the generated files to /srv/doc outside of the DocumentRoot, but I guess I will need a full install to make sure everything works fine
[16:20:36] <hashar>	 so I will deal with it later
[16:28:33] <wikibugs>	 (03PS1) 10DCausse: [wdqs] overrides default blazegraph ns [puppet] - 10https://gerrit.wikimedia.org/r/611373
[16:31:01] <wikibugs>	 (03CR) 10DCausse: "should be used in conjunction with https://gerrit.wikimedia.org/r/c/wikidata/query/rdf/+/611348" [puppet] - 10https://gerrit.wikimedia.org/r/611373 (owner: 10DCausse)
[16:38:53] <wikibugs>	 (03PS2) 10Krinkle: Enable wgForceHTTPS and wgCookieSameSite='None' (Phase 2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/610148 (https://phabricator.wikimedia.org/T256095)
[16:38:59] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Enable wgForceHTTPS and wgCookieSameSite='None' (Phase 2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/610148 (https://phabricator.wikimedia.org/T256095) (owner: 10Krinkle)
[16:40:46] <wikibugs>	 (03PS3) 10Krinkle: Enable wgForceHTTPS and wgCookieSameSite='None' (Phase 2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/610148 (https://phabricator.wikimedia.org/T256095)
[16:41:34] <wikibugs>	 (03CR) 10Bstorm: [C: 03+2] tools-prometheus: Add the paws etcd exports [puppet] - 10https://gerrit.wikimedia.org/r/611370 (https://phabricator.wikimedia.org/T256361) (owner: 10Bstorm)
[16:42:24] <wikibugs>	 (03PS2) 10Hashar: ci: switch integration.wikimedia.org to scap DocumentRoot [puppet] - 10https://gerrit.wikimedia.org/r/611369 (https://phabricator.wikimedia.org/T149924)
[16:43:20] <wikibugs>	 (03CR) 10Hashar: [C: 04-1] "Thank you so much Timo. The fix should be https://gerrit.wikimedia.org/r/c/integration/docroot/+/611377" [puppet] - 10https://gerrit.wikimedia.org/r/611369 (https://phabricator.wikimedia.org/T149924) (owner: 10Hashar)
[16:43:53] <hashar>	 Krinkle: perfect thanks. And on this last patch, I am closing and heading on vacations
[16:43:58] <hashar>	 those can wait anyway
[16:44:15] <Krinkle>	 LGTM,  have a good  one!
[16:45:11] <hashar>	 thank you for saving my night :]
[16:45:26] <hashar>	 I will probably never have found the root cause was the scap symlink hehe
[16:46:23] <wikibugs>	 (03PS2) 10Lucas Werkmeister (WMDE): Load WikibaseClient using extension registration in beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/610265
[16:47:01] <wikibugs>	 (03CR) 10Krinkle: "Per diff, this also changes test2wiki back to match enwiki (since test2wiki is not in group0). This is fine and might also be useful for t" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/610148 (https://phabricator.wikimedia.org/T256095) (owner: 10Krinkle)
[16:48:49] <wikibugs>	 (03CR) 10Lucas Werkmeister (WMDE): [C: 04-2] Load WikibaseClient using extension registration in beta (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/610265 (owner: 10Lucas Werkmeister (WMDE))
[16:50:48] <wikibugs>	 (03CR) 10Thcipriani: [C: 03+1] Enable wgForceHTTPS and wgCookieSameSite='None' (Phase 2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/610148 (https://phabricator.wikimedia.org/T256095) (owner: 10Krinkle)
[16:52:01] <wikibugs>	 (03CR) 10Krinkle: [C: 03+2] Enable wgForceHTTPS and wgCookieSameSite='None' (Phase 2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/610148 (https://phabricator.wikimedia.org/T256095) (owner: 10Krinkle)
[16:53:00] <wikibugs>	 (03Merged) 10jenkins-bot: Enable wgForceHTTPS and wgCookieSameSite='None' (Phase 2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/610148 (https://phabricator.wikimedia.org/T256095) (owner: 10Krinkle)
[16:53:11] * Krinkle staging on  mwdebug1002
[17:03:53] <wikibugs>	 (03PS1) 10Greg Grossmeier: admin: update matrix.py to add color [puppet] - 10https://gerrit.wikimedia.org/r/611388
[17:04:42] <wikibugs>	 (03CR) 10Greg Grossmeier: "Output used here: https://www.mediawiki.org/w/index.php?title=Wikimedia_Release_Engineering_Team/Access_list&diff=3957120&oldid=3957111&di" [puppet] - 10https://gerrit.wikimedia.org/r/611388 (owner: 10Greg Grossmeier)
[17:04:46] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] admin: update matrix.py to add color [puppet] - 10https://gerrit.wikimedia.org/r/611388 (owner: 10Greg Grossmeier)
[17:05:00] <logmsgbot>	 !log krinkle@deploy1001 Synchronized wmf-config/InitialiseSettings.php: I63fcea7737 (duration: 00m 57s)
[17:05:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:05:46] <wikibugs>	 (03PS1) 10Jcrespo: bacula: Merge prometheus exporter and icinga check into a single file [puppet] - 10https://gerrit.wikimedia.org/r/611390 (https://phabricator.wikimedia.org/T234900)
[17:06:55] <wikibugs>	 10Operations, 10ops-eqsin: update power ports for ps[12]-603-eqiad - https://phabricator.wikimedia.org/T255812 (10RobH) 05Open→03Resolved Ok, this is now fully done in both netbox and on the PDU software directly.
[17:07:03] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] bacula: Merge prometheus exporter and icinga check into a single file [puppet] - 10https://gerrit.wikimedia.org/r/611390 (https://phabricator.wikimedia.org/T234900) (owner: 10Jcrespo)
[17:07:30] <wikibugs>	 (03CR) 10Krinkle: "The current comment format as  deployed is  what the  new Gerrit 3  plugin uses to decide what label to use  in the  extracted test table." [puppet] - 10https://gerrit.wikimedia.org/r/608296 (https://phabricator.wikimedia.org/T256575) (owner: 10Hashar)
[17:07:32] <wikibugs>	 (03CR) 10Hnowlan: api-gateway: Basic envoy chart WIP (034 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/609808 (https://phabricator.wikimedia.org/T254906) (owner: 10Hnowlan)
[17:09:19] <wikibugs>	 (03PS1) 10Elukey: sre.hadoop.change-distro.py: change logic for JN roll restart [cookbooks] - 10https://gerrit.wikimedia.org/r/611392 (https://phabricator.wikimedia.org/T244499)
[17:11:02] <wikibugs>	 (03PS2) 10Jcrespo: bacula: Merge prometheus exporter and icinga check into a single file [puppet] - 10https://gerrit.wikimedia.org/r/611390 (https://phabricator.wikimedia.org/T234900)
[17:11:11] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] sre.hadoop.change-distro.py: change logic for JN roll restart [cookbooks] - 10https://gerrit.wikimedia.org/r/611392 (https://phabricator.wikimedia.org/T244499) (owner: 10Elukey)
[17:11:16] <wikibugs>	 (03PS1) 10Lucas Werkmeister (WMDE): extension-list: Load WikibaseClient via JSON [mediawiki-config] - 10https://gerrit.wikimedia.org/r/611393
[17:14:50] <wikibugs>	 (03CR) 10Lucas Werkmeister (WMDE): [C: 04-2] "Test plan:" (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/611393 (owner: 10Lucas Werkmeister (WMDE))
[17:15:48] <wikibugs>	 (03CR) 10Lucas Werkmeister (WMDE): [C: 04-2] "> extension-list is just for harvesting the i18n, so you could switch it now, if you wished." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/610265 (owner: 10Lucas Werkmeister (WMDE))
[17:17:26] <wikibugs>	 (03PS3) 10Lucas Werkmeister (WMDE): Load WikibaseClient using extension registration in beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/610265 (https://phabricator.wikimedia.org/T257435)
[17:19:01] <wikibugs>	 (03CR) 10Jeena Huneidi: Kask: Use Releng Cassandra Image (033 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/609894 (https://phabricator.wikimedia.org/T224041) (owner: 10Jeena Huneidi)
[17:21:32] <wikibugs>	 (03PS3) 10Jcrespo: bacula: Merge prometheus exporter and icinga check into a single file [puppet] - 10https://gerrit.wikimedia.org/r/611390 (https://phabricator.wikimedia.org/T234900)
[17:23:02] <wikibugs>	 (03PS4) 10Jcrespo: bacula: Merge prometheus exporter and icinga check into a single file [puppet] - 10https://gerrit.wikimedia.org/r/611390 (https://phabricator.wikimedia.org/T234900)
[17:25:58] <wikibugs>	 (03CR) 10Dzahn: "looks like pep8 doesn't like that the line is long now" [puppet] - 10https://gerrit.wikimedia.org/r/611388 (owner: 10Greg Grossmeier)
[17:26:39] <wikibugs>	 (03PS3) 10Dzahn: releases: remove duplicate rsync code from blubber and parsoid classes [puppet] - 10https://gerrit.wikimedia.org/r/610402
[17:28:07] <wikibugs>	 (03CR) 10Jcrespo: "This was mostly a cleanup before implementing a denylist for jobs that are configured on bacula, but are known to be failing for a long ti" [puppet] - 10https://gerrit.wikimedia.org/r/611390 (https://phabricator.wikimedia.org/T234900) (owner: 10Jcrespo)
[17:28:37] <wikibugs>	 10Operations, 10Fundraising-Backlog, 10MediaWiki-extensions-CentralNotice, 10Traffic, and 2 others: Implement redirect for hide banner cookie issue - https://phabricator.wikimedia.org/T251780 (10AndyRussG) **One more option** (thanks to @Jgreen for this idea): we could create a **thank-you banner on Wikipe...
[17:30:52] <wikibugs>	 (03PS2) 10Greg Grossmeier: admin: update matrix.py to add color [puppet] - 10https://gerrit.wikimedia.org/r/611388
[17:31:44] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] admin: update matrix.py to add color [puppet] - 10https://gerrit.wikimedia.org/r/611388 (owner: 10Greg Grossmeier)
[17:34:39] <RhinosF1>	 getting Failed to poll mysqli connection! on phab
[17:34:56] <RhinosF1>	 mutante, andre__: ^
[17:35:11] <jynus>	 WFM
[17:35:23] <andre__>	 RhinosF1: why me?
[17:35:26] <herron>	 working here as well
[17:35:47] <RhinosF1>	 jynus: use the search to search for '-riggle-2'
[17:35:52] <wikibugs>	 (03PS3) 10Greg Grossmeier: admin: update matrix.py to add color [puppet] - 10https://gerrit.wikimedia.org/r/611388
[17:35:58] <RhinosF1>	 andre__: in case you have an idea on why
[17:36:14] * RhinosF1 is checking for an account that might need disabling if i find it
[17:36:35] <wikibugs>	 10Operations, 10MediaWiki-General, 10serviceops-radar, 10Performance-Team (Radar), and 3 others: Move MainStash out of Redis to a simpler multi-dc aware solution - https://phabricator.wikimedia.org/T212129 (10Krinkle)
[17:36:35] <andre__>	 RhinosF1: no, I don't do databases at all
[17:37:08] <RhinosF1>	 andre__: you might be able to help look for an LTA's phab account though?
[17:37:26] <jynus>	 RhinosF1: phabricator is known to sometimes overwhealm the db on connections, but it happens rarely enough that wasn't work a deep research
[17:37:31] <mutante>	 RhinosF1: i can't confirm it either
[17:37:31] <greg-g>	 RhinosF1: if it's search related, that may be due to the change twentyafterfour pushed out recently.
[17:37:41] <andre__>	 RhinosF1: yepp :)
[17:37:48] <RhinosF1>	 greg-g: yeah it's search although a specific one
[17:37:50] <jynus>	 as in, it is not like a huge issue, more of a rare one
[17:37:54] <RhinosF1>	 "-riggle-2"
[17:37:59] <andre__>	 RhinosF1, but you too as you're member of https://phab-ban.toolforge.org/
[17:38:05] <greg-g>	 I got it when searching with that search term
[17:38:12] <jynus>	 I didn't
[17:38:13] <RhinosF1>	 andre__: I need to find it first!
[17:39:01] <RhinosF1>	 andre__: it'll be linked to https://meta.wikimedia.org/wiki/Special:CentralAuth/ZAR2020SKAYTEC or https://meta.wikimedia.org/wiki/Special:CentralAuth?target=-riggle-2 - I can find the email they always use soon
[17:39:02] <andre__>	 RhinosF1, find what? Plus no idea what "-riggle-2" means
[17:39:13] <wikibugs>	 (03CR) 10Greg Grossmeier: "> 10:31:14 modules/admin/data/matrix.py:29:13: E741 ambiguous variable name 'l'" [puppet] - 10https://gerrit.wikimedia.org/r/611388 (owner: 10Greg Grossmeier)
[17:39:17] <RhinosF1>	 andre__: a username
[17:39:35] <andre__>	 RhinosF1, errm, why do you think that people have a Phab account? Plus this is really the wrong channel for it.
[17:39:52] <RhinosF1>	 andre__: because he nearly always doea
[17:40:01] <RhinosF1>	 Where's best
[17:40:06] <wikibugs>	 10Operations, 10serviceops, 10Patch-For-Review, 10User-Elukey: Reimage one memcached shard to Buster - https://phabricator.wikimedia.org/T252391 (10Krinkle) CentralAuth and ChronologyProtector are both still high-profile consumers of main stash. Both are scheduled for migration, but currently only with rel...
[17:40:06] <greg-g>	 -releng
[17:48:08] <wikibugs>	 (03CR) 10Thcipriani: [C: 03+1] admin: update matrix.py to add color (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/611388 (owner: 10Greg Grossmeier)
[17:55:29] <wikibugs>	 (03CR) 10Dzahn: "python matrix.py hashar thcipriani|column -t" [puppet] - 10https://gerrit.wikimedia.org/r/611388 (owner: 10Greg Grossmeier)
[17:57:03] <ebernhardson>	 !log change loginwiki password for Cindy-the-browser-test-bot, no email account was associated to allow for normal reset.
[17:57:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:00:35] <wikibugs>	 (03CR) 10Dzahn: "> Patch Set 3:" [puppet] - 10https://gerrit.wikimedia.org/r/611388 (owner: 10Greg Grossmeier)
[18:02:59] <wikibugs>	 10Operations, 10Fundraising-Backlog, 10MediaWiki-extensions-CentralNotice, 10Traffic, and 2 others: Implement redirect for hide banner cookie issue - https://phabricator.wikimedia.org/T251780 (10Pcoombe) **Thank you banner** This certainly seems like it could solve the problem, but it would need a not insi...
[18:28:38] <wikibugs>	 10Operations, 10Fundraising-Backlog, 10MediaWiki-extensions-CentralNotice, 10Traffic, and 2 others: Implement redirect for hide banner cookie issue - https://phabricator.wikimedia.org/T251780 (10AndyRussG) >>! In T251780#6297530, @Pcoombe wrote: > **Thank you banner** > This certainly seems like it could s...
[18:37:35] <icinga-wm>	 PROBLEM - Check systemd state on kubernetes1004 is CRITICAL: connect to address 10.64.48.52 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[18:37:37] <icinga-wm>	 PROBLEM - Check size of conntrack table on kubernetes1004 is CRITICAL: connect to address 10.64.48.52 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack
[18:38:27] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=swagger_check_proton_cluster_eqiad site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[18:40:19] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[18:40:51] <icinga-wm>	 PROBLEM - MD RAID on kubernetes1004 is CRITICAL: connect to address 10.64.48.52 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering
[18:41:37] <icinga-wm>	 PROBLEM - puppet last run on kubernetes1004 is CRITICAL: connect to address 10.64.48.52 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[18:43:47] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] mariadb: remove ferm firewall hole for gerrit servers [puppet] - 10https://gerrit.wikimedia.org/r/609884 (https://phabricator.wikimedia.org/T239151) (owner: 10Dzahn)
[18:43:54] <wikibugs>	 (03PS2) 10Dzahn: mariadb: remove ferm firewall hole for gerrit servers [puppet] - 10https://gerrit.wikimedia.org/r/609884 (https://phabricator.wikimedia.org/T239151)
[18:44:53] <mutante>	 !log kubernetes1004 - started nagios-nrpe-server
[18:44:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:45:01] <icinga-wm>	 RECOVERY - Check systemd state on kubernetes1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[18:45:03] <icinga-wm>	 RECOVERY - Check size of conntrack table on kubernetes1004 is OK: OK: nf_conntrack is 0 % full https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack
[18:47:28] <icinga-wm>	 RECOVERY - puppet last run on kubernetes1004 is OK: OK: Puppet is currently enabled, last run 16 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[18:51:41] <icinga-wm>	 RECOVERY - MD RAID on kubernetes1004 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering
[19:02:02] <mutante>	 !log removing firewall hole for gerrit -> mysql servers on dbproxy servers for misc db's
[19:02:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:02:11] <wikibugs>	 (03PS4) 10Greg Grossmeier: admin: update matrix.py to add color [puppet] - 10https://gerrit.wikimedia.org/r/611388
[19:02:55] <wikibugs>	 (03CR) 10Greg Grossmeier: "> Patch Set 3:" [puppet] - 10https://gerrit.wikimedia.org/r/611388 (owner: 10Greg Grossmeier)
[19:05:32] <wikibugs>	 (03PS5) 10Greg Grossmeier: admin: update matrix.py to add color [puppet] - 10https://gerrit.wikimedia.org/r/611388
[19:06:34] <wikibugs>	 (03CR) 10Greg Grossmeier: "Addressed color choice (ohai bikeshed) ;)" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/611388 (owner: 10Greg Grossmeier)
[19:07:19] <mutante>	 J
[19:08:56] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] "python3 matrix.py --wikitext hashar thcipriani|column -t  works" [puppet] - 10https://gerrit.wikimedia.org/r/611388 (owner: 10Greg Grossmeier)
[19:09:15] <greg-g>	 sorry for all the patchsets :)
[19:09:30] <greg-g>	 multitasking while in meetings (I know I know, it's Friday)
[19:09:51] <mutante>	 heh, no worries. my comments were also just because of python vs python3 
[19:10:58] <mutante>	 so gerrit still works even after the dbproxy servers closed their firewall holes for them now. guaranteed to not use mysql anymore.
[19:11:17] <apergos>	 awesome!
[19:11:34] <mutante>	 running that on all dbproxy* was giving me minimal concern, but it had plenty of +1s
[19:12:16] <wikibugs>	 (03PS4) 10Dzahn: releases: remove duplicate rsync code from blubber and parsoid classes [puppet] - 10https://gerrit.wikimedia.org/r/610402
[19:13:10] <mutante>	 ^ all this time we are rsyncing stuff multiple times. one cron does /srv/org/wikimedia/releases and others do some subdirs of that.. that are already included anyways
[19:13:50] <mutante>	 this is part of replacing releases* backends with buster. needed the option to sync to multiple secondary servers, not just one
[19:15:01] <apergos>	 ooooh bustre
[19:15:03] <apergos>	 good good
[19:18:12] <wikibugs>	 10Operations, 10Fundraising-Backlog, 10MediaWiki-extensions-CentralNotice, 10Traffic, and 2 others: Implement redirect for hide banner cookie issue - https://phabricator.wikimedia.org/T251780 (10DStrine) It's great we are considering alternatives. I really want to highlight the effort needed to set this up...
[19:27:39] <wikibugs>	 (03CR) 10Dzahn: "gerrit is still working after this ran on all dbproxy for misc databases. you can remove the GRANTs next" [puppet] - 10https://gerrit.wikimedia.org/r/609884 (https://phabricator.wikimedia.org/T239151) (owner: 10Dzahn)
[19:37:00] <AndyRussG>	 Krinkle ema bblack mutante around?
[19:37:29] <AndyRussG>	 Trying to figure out if it's possible to vary a Varnish response based on cookie
[19:41:35] <mutante>	 AndyRussG: possible - yes, i think so. based on comments like "Vary:Cookie" "Cookie:Token=1 value for Vary purposes" in modules/varnish/templates/text-frontend.inc.vcl.erb but i am not on the traffic team and don't know much about VCL. try asking for more details in the -traffic channel
[19:42:18] <AndyRussG>	 mutante: ahhhh thanks I didn't know about that channel!
[19:42:42] <mutante>	 sure,yw
[19:43:31] <apergos>	 what about ATS? 
[19:43:43] <apergos>	 (this question may not make sense, just checking)
[19:52:39] <AndyRussG>	 apergos: ? (or unrelated to what I was asking?)
[19:55:04] <apergos>	 eventually are we not moving off of varnish entirely? and ats is deployed in part... so maybe the same question applies? or is that only the back ends?
[19:57:56] <AndyRussG>	 apergos: oh thanks! any ideas about timelines for that?
[19:58:10] <apergos>	 well I don't kow how true my representation of it is
[19:58:15] <apergos>	 it could be crap :-D
[19:58:24] <apergos>	 I mean I know there is some back end work
[19:58:44] <apergos>	 https://phabricator.wikimedia.org/T227432    like this
[19:59:22] <apergos>	 but what the plas are for the front end instances, I dunno
[20:02:42] <apergos>	 there might even be zero plans...
[20:03:43] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] releases: remove duplicate rsync code from blubber and parsoid classes [puppet] - 10https://gerrit.wikimedia.org/r/610402 (owner: 10Dzahn)
[20:10:33] <AndyRussG>	 apergos: k thanks!
[20:10:47] <apergos>	 prolly shouldn't have asked the question in the first place :-D
[20:11:01] <apergos>	 oh well back to my slow progress on this heisenbug...
[20:13:01] <wikibugs>	 (03CR) 10Dzahn: "remove the additional crontab entries from releases2001 manually. the one syncing all of /srv/org/wikimedia/releases is still there and wo" [puppet] - 10https://gerrit.wikimedia.org/r/610402 (owner: 10Dzahn)
[20:15:01] <AndyRussG>	 apergos: :)
[20:18:45] <wikibugs>	 10Operations, 10Fundraising-Backlog, 10MediaWiki-extensions-CentralNotice, 10Traffic, and 2 others: Implement redirect for hide banner cookie issue - https://phabricator.wikimedia.org/T251780 (10Ejegg) @Pcoombe We started diving into this solution, and then realized that the Special:BannerLoader page is he...
[20:22:01] <wikibugs>	 (03PS1) 10Andrew Bogott: Openstack Nova: move database access to galera on cloudcontrol nodes [puppet] - 10https://gerrit.wikimedia.org/r/611421 (https://phabricator.wikimedia.org/T242455)
[20:24:51] <wikibugs>	 (03PS2) 10Dzahn: releases: move rsync code for all releases from mediawiki to common [puppet] - 10https://gerrit.wikimedia.org/r/610403
[20:27:38] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] Openstack Nova: move database access to galera on cloudcontrol nodes [puppet] - 10https://gerrit.wikimedia.org/r/611421 (https://phabricator.wikimedia.org/T242455) (owner: 10Andrew Bogott)
[20:43:55] <icinga-wm>	 PROBLEM - Widespread puppet agent failures on icinga1001 is CRITICAL: 0.0101 ge 0.01 https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/yOxVDGvWk/puppet
[20:44:37] <wikibugs>	 (03PS1) 10Andrew Bogott: openstack nova: point eqiad1 to the eqiad1 galera [puppet] - 10https://gerrit.wikimedia.org/r/611422
[20:54:21] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[20:55:18] <wikibugs>	 (03PS1) 10Urbanecm: Add ary language [dns] - 10https://gerrit.wikimedia.org/r/611426 (https://phabricator.wikimedia.org/T257674)
[21:01:48] <icinga-wm>	 RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[21:13:35] <wikibugs>	 10Operations, 10MediaWiki-extensions-Score, 10Security-Team, 10Wikimedia-General-or-Unknown, and 3 others: Extension:Score / Lilypond is disabled on all wikis - https://phabricator.wikimedia.org/T257066 (10sbassett) >>! In T257066#6297061, @Tgr wrote: > Have we made an effort to reach out to non-Wikimedia...
[21:23:18] <icinga-wm>	 PROBLEM - nova-compute proc minimum on cloudvirt1008 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[21:25:09] <icinga-wm>	 RECOVERY - nova-compute proc minimum on cloudvirt1008 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[21:31:23] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1003/23818/" [puppet] - 10https://gerrit.wikimedia.org/r/610403 (owner: 10Dzahn)
[21:32:28] <wikibugs>	 (03PS3) 10C. Scott Ananian: VisualEditor: Explicitly set visualeditor-enable to 0 when non-default [mediawiki-config] - 10https://gerrit.wikimedia.org/r/610156 (https://phabricator.wikimedia.org/T248343)
[21:32:30] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] openstack nova: point eqiad1 to the eqiad1 galera [puppet] - 10https://gerrit.wikimedia.org/r/611422 (owner: 10Andrew Bogott)
[21:36:50] <wikibugs>	 (03PS2) 10Dzahn: releases: move more common code out of the mediawiki class [puppet] - 10https://gerrit.wikimedia.org/r/610404
[21:38:05] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] releases: move more common code out of the mediawiki class [puppet] - 10https://gerrit.wikimedia.org/r/610404 (owner: 10Dzahn)
[21:48:45] <icinga-wm>	 RECOVERY - Widespread puppet agent failures on icinga1001 is OK: (C)0.01 ge (W)0.006 ge 0.005682 https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/yOxVDGvWk/puppet
[21:52:46] <ryankemper>	 !log Started long-running reindex of Elasticsearch indices in `eqiad`, `codfw`, and `dewiki` on `mwmaint1002` under tmux session `reindex` for user `ryankemper`
[21:52:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:54:56] <wikibugs>	 (03PS1) 10RhinosF1: create lijwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/611435
[21:55:18] <RhinosF1>	 Amir1, Urbanecm: ^
[21:55:48] <wikibugs>	 (03PS2) 10RhinosF1: create lijwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/611435 (https://phabricator.wikimedia.org/T257672)
[21:55:53] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] create lijwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/611435 (https://phabricator.wikimedia.org/T257672) (owner: 10RhinosF1)
[21:56:40] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] create lijwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/611435 (https://phabricator.wikimedia.org/T257672) (owner: 10RhinosF1)
[21:57:09] * RhinosF1 knew his mac hated IS.php
[21:57:36] <RhinosF1>	 https://usercontent.irccloud-cdn.com/file/YeL7xrJw/Screenshot%202020-07-10%20at%2022.57.30.png
[21:57:48] <RhinosF1>	 paladox: gerrit UI won't load it either
[22:06:25] <mooeypoo>	 @RhinosF1 https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/611435 loads for me
[22:06:41] <RhinosF1>	 mooeypoo: try editing the patch
[22:07:11] <mooeypoo>	 I never used that feature, but I see the patch still with "Stop editing" button now instead of "edit"
[22:08:30] <mooeypoo>	 The edit URL is different for me than in your screenshot though https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/611435,edit
[22:08:44] <mooeypoo>	 yours seem to have the trailing /2
[22:09:25] <mooeypoo>	 If I manually add the /2, then click "edit", it redirects me to the /611435,edit link
[22:14:37] <wikibugs>	 (03CR) 10Dzahn: [C: 04-1] "https://puppet-compiler.wmflabs.org/compiler1002/23819/releases1001.eqiad.wmnet/change.releases1001.eqiad.wmnet.err" [puppet] - 10https://gerrit.wikimedia.org/r/610404 (owner: 10Dzahn)
[22:16:43] <RhinosF1>	 mooeypoo: click on IS.php
[22:19:53] <wikibugs>	 (03PS3) 10Dzahn: releases: move more common code out of the mediawiki class [puppet] - 10https://gerrit.wikimedia.org/r/610404
[22:21:01] <mutante>	 RhinosF1: i have no issue loading https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/611435/2/wmf-config/InitialiseSettings.php   not even slow
[22:21:24] <RhinosF1>	 mutante: no in edit mode
[22:22:07] <RhinosF1>	 https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/611435/2/wmf-config/InitialiseSettings.php,edit
[22:22:13] <mutante>	 RhinosF1: yea, that makes my fan start for a moment
[22:22:43] <mutante>	 IS just became too large
[22:23:00] <RhinosF1>	 mutante: yeah xcode was extremely slow at surviving it and somehow still messed it up
[22:23:03] <mutante>	 it does work eventually though
[22:23:07] <mutante>	 just needed to wait a bit
[22:23:40] * RhinosF1 wonders how long "a bit" is
[22:23:55] <mutante>	 RhinosF1: i don't know xcode but if that's an IDE then why use the browser edit mode?
[22:24:17] <RhinosF1>	 mutante: because it messed it up
[22:24:25] <RhinosF1>	 that's when jenkins -1'd
[22:24:29] <RhinosF1>	 and was super slow
[22:24:46] <wikibugs>	 (03PS9) 10Ryan Kemper: Scale largest shards to be closer to 30GB [mediawiki-config] - 10https://gerrit.wikimedia.org/r/608965 (https://phabricator.wikimedia.org/T256928)
[22:25:38] <wikibugs>	 (03CR) 10Ryan Kemper: [V: 03+2 C: 03+2] "shipping it" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/608965 (https://phabricator.wikimedia.org/T256928) (owner: 10Ryan Kemper)
[22:25:38] <mutante>	 RhinosF1: "a bit" = 1 minute for me on my hardware
[22:26:12] <RhinosF1>	 it's loading but still unusably slow to actually scroll/edit
[22:26:14] <mutante>	 jouncebot: now
[22:26:14] <jouncebot>	 For the next 8 hour(s) and 33 minute(s): No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200710T0700)
[22:26:31] <mutante>	 is this an emergency deploy?
[22:26:40] <RhinosF1>	 no
[22:26:44] <RhinosF1>	 it's a new wiki
[22:26:56] <mutante>	 i am talking about the mw-config change
[22:27:01] <mutante>	 that just got merged
[22:27:06] <RhinosF1>	 oh
[22:27:41] <RhinosF1>	 ryankemper: ^
[22:28:53] <ryankemper>	 I kicked off a reindex job and then realized I hadn't merged the corresponding config changes yet
[22:30:18] <mutante>	 gotcha
[22:30:50] <ryankemper>	 (These changes just touch our cirrussearch/elasticsearch shard replica counts)
[22:33:30] <mutante>	 alright, ack
[22:40:48] <Urbanecm>	 ryankemper: are you going to sync your config change? 🙂
[22:43:22] <ryankemper>	 Urbanecm: my bad, is there documentation somewhere on how to do that
[22:44:35] <Urbanecm>	 ryankemper: there's https://wikitech.wikimedia.org/wiki/Heterogeneous_deployment#Change_wiki_configuration
[22:45:10] <ryankemper>	 thanks, taking a look at the steps now
[22:45:30] <mutante>	 hmm, do you really want to do this the first time ever on a Friday afternoon outside deployment windows
[22:45:49] <Urbanecm>	 ftr, unless it's synced, it doesn't take effect, and it probably will confuse whoever will make changes after you  ryankemper 
[22:45:54] <Urbanecm>	 +1 to what mutante says
[22:46:10] <mutante>	 this is normally only for emergencies
[22:46:19] <ryankemper>	 yeah, that's a good point
[22:46:28] <ryankemper>	 should be simple as me opening up a revert commit in gerrit right?
[22:46:31] <ryankemper>	 since the changes haven't been synced yet
[22:46:56] <Urbanecm>	 yup, and fetching the commits to deploy1001 to not confuse people
[22:49:00] <wikibugs>	 (03PS1) 10Ryan Kemper: Revert "Scale largest shards to be closer to 30GB" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/611447
[22:50:40] <wikibugs>	 (03CR) 10Ryan Kemper: [V: 03+2 C: 03+2] "going to self-approve given this reverts a patch that hasn't been deployed" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/611447 (owner: 10Ryan Kemper)
[22:51:34] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "Scale largest shards to be closer to 30GB" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/611447 (owner: 10Ryan Kemper)
[22:54:06] <Urbanecm>	 ryankemper: sorry, by "fetching", I meant fetching and rebasing
[22:54:28] <ryankemper>	 ack
[22:58:42] <ryankemper>	 Urbanecm: okay, `/srv/mediawiki-staging` is fetched and is on the head of `origin/master`
[22:58:53] <ryankemper>	 thanks all for helping me sort this out
[22:59:01] <ryankemper>	 and sorry for generating noise :x
[22:59:08] <Urbanecm>	 Thanks ryankemper :)
[23:06:23] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[23:10:05] <icinga-wm>	 RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[23:10:06] <wikibugs>	 (03PS1) 10Ahmon Dancy: Moved a comment to a better place [puppet] - 10https://gerrit.wikimedia.org/r/611455
[23:11:43] <wikibugs>	 (03PS1) 10Ahmon Dancy: Allow aptly::repo commands to run as alternate user [puppet] - 10https://gerrit.wikimedia.org/r/611457 (https://phabricator.wikimedia.org/T250157)
[23:13:03] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Allow aptly::repo commands to run as alternate user [puppet] - 10https://gerrit.wikimedia.org/r/611457 (https://phabricator.wikimedia.org/T250157) (owner: 10Ahmon Dancy)
[23:23:48] <wikibugs>	 10Operations, 10Phabricator, 10Security-Team: Can't access phabricator from my server - https://phabricator.wikimedia.org/T257507 (10Legoktm) Can we exempt Isarra's IP from the blocklist in the meantime?
[23:49:06] <wikibugs>	 10Operations, 10Phabricator, 10Security-Team: Can't access phabricator from my server - https://phabricator.wikimedia.org/T257507 (10Isarra) If the IP block exempt right onwiki also works (at least some of the time) per T254568, maybe I should just go request that? I never bothered because I so rarely actual...
[23:58:36] <wikibugs>	 10Operations, 10Phabricator, 10Security-Team: Can't access phabricator from my server - https://phabricator.wikimedia.org/T257507 (10Reedy) >>! In T257507#6298076, @Isarra wrote: > If the IP block exempt right onwiki also works (at least some of the time) per T254568, maybe I should just go request that? I n...