[00:33:43] PROBLEM - OSPF status on cr2-eqdfw is CRITICAL: OSPFv2: 3/5 UP : OSPFv3: 3/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [00:33:57] PROBLEM - BFD status on cr2-eqdfw is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [00:34:29] PROBLEM - OSPF status on cr1-eqiad is CRITICAL: OSPFv2: 5/6 UP : OSPFv3: 5/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [00:35:05] PROBLEM - OSPF status on cr3-knams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [01:00:33] RECOVERY - OSPF status on cr3-knams is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [01:00:57] RECOVERY - OSPF status on cr2-eqdfw is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [01:01:11] RECOVERY - BFD status on cr2-eqdfw is OK: OK: UP: 10 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [01:01:41] RECOVERY - OSPF status on cr1-eqiad is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [01:21:22] 10Operations: Audit all cumin queries in switchdc scripts - https://phabricator.wikimedia.org/T243935 (10RLazarus) [01:21:45] 10Operations, 10SRE-tools: Audit all cumin queries in switchdc scripts - https://phabricator.wikimedia.org/T243935 (10RLazarus) [02:26:06] (03PS1) 10Ori.livneh: arclamp-log: abort if no message received in 30 minutes [puppet] - 10https://gerrit.wikimedia.org/r/568732 [02:27:25] (03PS2) 10Ori.livneh: arclamp-log: abort if no message received in 30 minutes [puppet] - 10https://gerrit.wikimedia.org/r/568732 [02:29:37] (03PS3) 10Ori.livneh: arclamp-log: abort if no message received in 30 minutes [puppet] - 10https://gerrit.wikimedia.org/r/568732 [03:18:25] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 242, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [03:20:15] RECOVERY - OSPF status on cr2-codfw is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [04:55:38] Is something wrong with cp4029. Someone was just complaining about a 503, and when I put that servername into logstash, there are a lot of "no backend" errors in the last little bit (but only seems that varnish and only very recently) [04:55:43] ? [04:57:11] I'm also getting 503s on cp4029 [04:58:11] Seems like a very noticible spike starting at around 4:20 UTC [04:58:48] checking [05:00:06] !log depooling cp4029 [05:00:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:12:00] !log restarting varnish-frontend and repooling cp4029 - T243634 [05:12:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:12:04] T243634: cp4029 varnish-fe freakout - https://phabricator.wikimedia.org/T243634 [05:37:05] ah thanks vgutierrez [05:37:29] no problem :) [06:13:35] PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS1299/IPv4: Active, AS1299/IPv6: Active https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [06:21:29] !log depool cp4032 - T243634 [06:21:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:21:34] T243634: cp4029 varnish-fe freakout - https://phabricator.wikimedia.org/T243634 [06:23:15] !log restarting varnish-frontend on cp4030 before it crashes - T243634 [06:23:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:48:49] (03PS1) 10Legoktm: Run Python tests using pytest, not nose [puppet] - 10https://gerrit.wikimedia.org/r/568856 [06:48:51] (03PS1) 10Legoktm: admin: Fix data_test.py on Python 3.9+ [puppet] - 10https://gerrit.wikimedia.org/r/568857 [06:51:35] RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 33, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [07:12:31] 10Operations, 10Traffic: cp4029 varnish-fe freakout - https://phabricator.wikimedia.org/T243634 (10Vgutierrez) after depooling cp4029, the issue moved to cp4030, and upon the restart of varnish-fe on cp4030, now the number of fds is increasing on cp4031 (22k right now) [09:00:11] PROBLEM - HTTPS non-canonical-redirect-3 on ncredir4001 is CRITICAL: SSL CRITICAL - OCSP staple validity for www.wikipedia.bg has 86391 seconds left https://wikitech.wikimedia.org/wiki/Ncredir [09:00:45] PROBLEM - HTTPS non-canonical-redirect-3 on ncredir2001 is CRITICAL: SSL CRITICAL - OCSP staple validity for www.wikipedia.bg has 86357 seconds left https://wikitech.wikimedia.org/wiki/Ncredir [09:00:49] PROBLEM - HTTPS non-canonical-redirect-3 on ncredir3001 is CRITICAL: SSL CRITICAL - OCSP staple validity for www.wikipedia.bg has 86352 seconds left https://wikitech.wikimedia.org/wiki/Ncredir [09:00:51] PROBLEM - HTTPS non-canonical-redirect-3 on ncredir2002 is CRITICAL: SSL CRITICAL - OCSP staple validity for www.wikipedia.bg has 86350 seconds left https://wikitech.wikimedia.org/wiki/Ncredir [09:00:57] PROBLEM - HTTPS non-canonical-redirect-3 on ncredir4002 is CRITICAL: SSL CRITICAL - OCSP staple validity for www.wikipedia.bg has 86344 seconds left https://wikitech.wikimedia.org/wiki/Ncredir [09:01:05] PROBLEM - HTTPS non-canonical-redirect-3 on ncredir3002 is CRITICAL: SSL CRITICAL - OCSP staple validity for www.wikipedia.bg has 86336 seconds left https://wikitech.wikimedia.org/wiki/Ncredir [09:01:05] PROBLEM - HTTPS non-canonical-redirect-3 on ncredir1002 is CRITICAL: SSL CRITICAL - OCSP staple validity for www.wikipedia.bg has 86336 seconds left https://wikitech.wikimedia.org/wiki/Ncredir [09:01:27] PROBLEM - HTTPS non-canonical-redirect-3 on ncredir5002 is CRITICAL: SSL CRITICAL - OCSP staple validity for www.wikipedia.bg has 86315 seconds left https://wikitech.wikimedia.org/wiki/Ncredir [09:01:31] PROBLEM - HTTPS non-canonical-redirect-3 on ncredir1001 is CRITICAL: SSL CRITICAL - OCSP staple validity for www.wikipedia.bg has 86311 seconds left https://wikitech.wikimedia.org/wiki/Ncredir [09:01:33] PROBLEM - HTTPS non-canonical-redirect-3 on ncredir5001 is CRITICAL: SSL CRITICAL - OCSP staple validity for www.wikipedia.bg has 86308 seconds left https://wikitech.wikimedia.org/wiki/Ncredir [09:04:39] so 23 hours or something [09:17:20] all the redirect-3 ones are from Jan 26 and the rest are from Jan 27 or later. these are /etc/acmecerts/non-canonical-redirect-/live//.ocsp [09:36:10] I do not know what to do with this alert [09:36:26] I will take a look in previous issues [09:40:21] i am slowly wadig my way through the acme-chief stuff but don't expect anything soon... on the acmechief1001 instance anyways, the oscps are out of date there, the 'live' ones, but just for that one (non-canonical-redirect-3), the rest are fine [09:40:30] I would check if the domain is in use, sometimes domains are decom'ed and some expiration alerts incorrectly [09:41:02] basically, what apergos was doing [09:41:29] there are a bunch of domains for that one though, some are definitely still in use [09:42:00] it covers also for example wikipedia.gr ( :-P ) [09:49:18] there are 86000 secs left [09:49:24] I think we can create a task [09:49:29] and let valentin take a look [10:40:51] PROBLEM - Ensure acme-chief-backend is running only in the active node on acmechief1001 is CRITICAL: PROCS CRITICAL: 2 processes with args acme-chief-backend https://wikitech.wikimedia.org/wiki/Acme-chief [10:41:03] sigh [10:42:42] 10Operations, 10Traffic: SSL CRITICAL - OCSP staple validity for www.wikipedia.bg has X seconds left - https://phabricator.wikimedia.org/T243948 (10jijiki) [10:43:32] ACKNOWLEDGEMENT - HTTPS non-canonical-redirect-3 on ncredir1001 is CRITICAL: SSL CRITICAL - OCSP staple validity for www.wikipedia.bg has 80238 seconds left Effie Mouzeli Opened task T243948 https://wikitech.wikimedia.org/wiki/Ncredir [10:43:32] ACKNOWLEDGEMENT - HTTPS non-canonical-redirect-3 on ncredir1002 is CRITICAL: SSL CRITICAL - OCSP staple validity for www.wikipedia.bg has 80262 seconds left Effie Mouzeli Opened task T243948 https://wikitech.wikimedia.org/wiki/Ncredir [10:43:32] ACKNOWLEDGEMENT - HTTPS non-canonical-redirect-3 on ncredir2001 is CRITICAL: SSL CRITICAL - OCSP staple validity for www.wikipedia.bg has 80281 seconds left Effie Mouzeli Opened task T243948 https://wikitech.wikimedia.org/wiki/Ncredir [10:43:32] ACKNOWLEDGEMENT - HTTPS non-canonical-redirect-3 on ncredir2002 is CRITICAL: SSL CRITICAL - OCSP staple validity for www.wikipedia.bg has 80274 seconds left Effie Mouzeli Opened task T243948 https://wikitech.wikimedia.org/wiki/Ncredir [10:43:32] ACKNOWLEDGEMENT - HTTPS non-canonical-redirect-3 on ncredir3001 is CRITICAL: SSL CRITICAL - OCSP staple validity for www.wikipedia.bg has 80266 seconds left Effie Mouzeli Opened task T243948 https://wikitech.wikimedia.org/wiki/Ncredir [10:43:32] ACKNOWLEDGEMENT - HTTPS non-canonical-redirect-3 on ncredir3002 is CRITICAL: SSL CRITICAL - OCSP staple validity for www.wikipedia.bg has 80250 seconds left Effie Mouzeli Opened task T243948 https://wikitech.wikimedia.org/wiki/Ncredir [10:43:32] ACKNOWLEDGEMENT - HTTPS non-canonical-redirect-3 on ncredir4001 is CRITICAL: SSL CRITICAL - OCSP staple validity for www.wikipedia.bg has 80197 seconds left Effie Mouzeli Opened task T243948 https://wikitech.wikimedia.org/wiki/Ncredir [10:43:33] ACKNOWLEDGEMENT - HTTPS non-canonical-redirect-3 on ncredir4002 is CRITICAL: SSL CRITICAL - OCSP staple validity for www.wikipedia.bg has 80258 seconds left Effie Mouzeli Opened task T243948 https://wikitech.wikimedia.org/wiki/Ncredir [10:43:34] ACKNOWLEDGEMENT - HTTPS non-canonical-redirect-3 on ncredir5001 is CRITICAL: SSL CRITICAL - OCSP staple validity for www.wikipedia.bg has 80209 seconds left Effie Mouzeli Opened task T243948 https://wikitech.wikimedia.org/wiki/Ncredir [10:43:34] ACKNOWLEDGEMENT - HTTPS non-canonical-redirect-3 on ncredir5002 is CRITICAL: SSL CRITICAL - OCSP staple validity for www.wikipedia.bg has 80216 seconds left Effie Mouzeli Opened task T243948 https://wikitech.wikimedia.org/wiki/Ncredir [10:44:27] RECOVERY - Ensure acme-chief-backend is running only in the active node on acmechief1001 is OK: PROCS OK: 1 process with args acme-chief-backend https://wikitech.wikimedia.org/wiki/Acme-chief [10:46:37] (03PS1) 10Arturo Borrero Gonzalez: prometheus: wmcs_scripts: refresh package requirements [puppet] - 10https://gerrit.wikimedia.org/r/568953 (https://phabricator.wikimedia.org/T238096) [10:50:39] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] "PCC: https://puppet-compiler.wmflabs.org/compiler1001/20578/" [puppet] - 10https://gerrit.wikimedia.org/r/568953 (https://phabricator.wikimedia.org/T238096) (owner: 10Arturo Borrero Gonzalez) [10:51:43] PROBLEM - Ensure acme-chief-backend is running only in the active node on acmechief1001 is CRITICAL: PROCS CRITICAL: 2 processes with args acme-chief-backend https://wikitech.wikimedia.org/wiki/Acme-chief [10:52:48] ^ looking [10:53:31] RECOVERY - Ensure acme-chief-backend is running only in the active node on acmechief1001 is OK: PROCS OK: 1 process with args acme-chief-backend https://wikitech.wikimedia.org/wiki/Acme-chief [10:54:02] ok [11:00:43] PROBLEM - Ensure acme-chief-backend is running only in the active node on acmechief1001 is CRITICAL: PROCS CRITICAL: 2 processes with args acme-chief-backend https://wikitech.wikimedia.org/wiki/Acme-chief [11:04:56] ^did any of of you run it manually? [11:05:21] no, I think there is an issue with the check [11:06:01] the process is running only once [11:06:16] interesting [11:06:22] will force a recheck [11:06:32] 10Operations, 10Citoid: Request took down both zotero and citoid (exceeding memory) - https://phabricator.wikimedia.org/T243444 (10Mvolz) >>! In T243444#5829834, @akosiaris wrote: >>>! In T243444#5829606, @Mvolz wrote: >>>>! In T243444#5826816, @akosiaris wrote: >>>> Thanks. That's working now, but I've downlo... [11:07:06] jynus: no no [11:07:15] let me check the check [11:07:36] well, if it fails again it will do nothing [11:10:01] it is finding pids 22912 28439 [11:10:26] one of them a grep, indeed the check is bad [11:10:46] that is another ticket [11:10:58] should I file it? [11:11:21] ^ effie [11:11:40] no I did not touch any processes there, just looking [11:12:08] yeah, I asked because it seemed related to the other issue [11:12:20] jynus: I am looking, I will give it another 5' and open a task [11:12:27] and let it be [11:12:55] RECOVERY - Ensure acme-chief-backend is running only in the active node on acmechief1001 is OK: PROCS OK: 1 process with args acme-chief-backend https://wikitech.wikimedia.org/wiki/Acme-chief [11:13:10] grep most likely finished [11:13:12] the check worked this time [11:13:23] grep was trying to write somewhere [11:13:26] and it was failing [11:13:31] nah, the check is flawed [11:13:42] Jan 30 11:00:01 acmechief1001 acme-chief-backend[22912]: SIGHUP received [11:13:42] this is the hourly restart I guess [11:14:03] it is not a huge issue, but probably there is a better way to check the process [11:14:31] * apergos goes back to looking at how challenge validation works... prolly leave off soon, getting too into the bowels of this thing [11:14:53] effie: I have the verbose output of the check, but will not do anything without your ok [11:15:23] is there anything interesting ? [11:15:40] sending output to you [11:15:44] one sec [11:16:08] tx [11:17:29] I don't think there is anything private here: https://phabricator.wikimedia.org/P10292 [11:17:52] my theory is there is some failure on acme, which makes it slower and produced the race condition [11:18:07] so a fallout of the previous issue [11:19:00] I didn't get the runner of the process as it had stopped when I looked at the ppid [11:21:29] in any case, the check I think is flawed, but not a huge issue [11:23:11] the check itself no, something is making grep stay there lingering [11:23:24] and then the check finds 2 processes [11:23:33] sure, but what I meant is that the check should not check args [11:23:39] the check itself is the standard nagios check [11:23:49] no, it checks for arguments [11:23:57] it should check for actual execution [11:24:05] so it doesn't catch the grep [11:24:10] (in my opinion) [11:24:25] ofc the grep getting blocked is an issue [11:27:00] probably related to T243948 [11:27:00] T243948: SSL CRITICAL - OCSP staple validity for www.wikipedia.bg has X seconds left - https://phabricator.wikimedia.org/T243948 [11:27:43] the real issue is grep being blocked, the check has been fine since forever [11:28:17] sure, I just think the check is improbable [11:28:41] *improvable [11:28:50] the check is also correct, in a way [11:29:38] my point is that if I run "sleep 100 acme-chief-backend" it will wrongly fail [11:29:59] but that is a minor issue, compared to the real one [12:15:29] jynus: since it might be related, I am adding that info to the task [12:15:59] +1 [12:17:34] 10Operations, 10SRE-Access-Requests: Requesting access to deployment for niedzielski - https://phabricator.wikimedia.org/T243924 (10jijiki) p:05Triage→03Normal [12:20:08] 10Operations, 10Traffic: SSL CRITICAL - OCSP staple validity for www.wikipedia.bg has X seconds left - https://phabricator.wikimedia.org/T243948 (10jijiki) [12:20:29] thanks [12:22:08] !log add prometheus 2.7.1+ds-3+k8s+buster to buster-wikimedia T238096 (basically a rebuild from stretch) [12:22:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:22:12] T238096: Toolforge: prometheus: refresh setup - https://phabricator.wikimedia.org/T238096 [12:40:28] 10Operations, 10ops-codfw: codfw: rack/setup/install parse200[1-20].codfw.wmnet - https://phabricator.wikimedia.org/T243112 (10jijiki) >>! In T243112#5830999, @Dzahn wrote: > @jijiki Are these going to be parsoid/PHP appservers? But we don't want to call them mw? Let's add the new name on https://wikitech.w... [13:07:37] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [13:09:25] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [13:12:45] "The requested page title contains an invalid UTF-8 sequence" on zhwiki [13:23:13] PROBLEM - Ensure acme-chief-backend is running only in the active node on acmechief1001 is CRITICAL: PROCS CRITICAL: 2 processes with args acme-chief-backend https://wikitech.wikimedia.org/wiki/Acme-chief [13:35:51] RECOVERY - Ensure acme-chief-backend is running only in the active node on acmechief1001 is OK: PROCS OK: 1 process with args acme-chief-backend https://wikitech.wikimedia.org/wiki/Acme-chief [14:06:12] 10Operations, 10Traffic: SSL CRITICAL - OCSP staple validity for www.wikipedia.bg has X seconds left - https://phabricator.wikimedia.org/T243948 (10ArielGlenn) I can see that the challenges get set on the dns hosts by e.g. dig @208.80.154.238 -t txt _acme-challenge.wiki-pedia.org a little past the hour and get... [14:12:57] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [14:13:38] (03PS1) 10Marostegui: control-mariadb-*: Change version [software] - 10https://gerrit.wikimedia.org/r/568978 (https://phabricator.wikimedia.org/T242702) [14:16:35] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [14:32:51] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [14:36:29] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [14:41:13] (03CR) 10Jcrespo: [C: 03+1] control-mariadb-*: Change version [software] - 10https://gerrit.wikimedia.org/r/568978 (https://phabricator.wikimedia.org/T242702) (owner: 10Marostegui) [15:04:26] 10Operations, 10Traffic: SSL CRITICAL - OCSP staple validity for www.wikipedia.bg has X seconds left - https://phabricator.wikimedia.org/T243948 (10Vgutierrez) we have several bugs here: 1. acme-chief should refresh the OCSP stapling response even if he is unable to renew the certificate 2. acme-chief should i... [15:05:53] hey vgutierrez. I didn't even get past the first validation step :-D [15:22:19] 10Operations, 10Traffic: SSL CRITICAL - OCSP staple validity for www.wikipedia.bg has X seconds left - https://phabricator.wikimedia.org/T243948 (10jcrespo) If I can add a 4th and 5th, with lower priority, and feel free to disagree- "Ensure acme-chief-backend is running only in the active node" check should no... [15:28:10] 10Operations, 10Domains, 10Traffic: nameserver change for wikimedia.sk - https://phabricator.wikimedia.org/T241084 (10Luky001) 05Open→03Resolved Everything worked out well. Thanks a lot for the help! [15:38:13] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [15:40:01] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [15:40:26] 10Operations, 10Traffic: SSL CRITICAL - OCSP staple validity for www.wikipedia.bg has X seconds left - https://phabricator.wikimedia.org/T243948 (10Vgutierrez) hmm actually I'm wrong, the prevalidation works as expected for wiki-pedia.org, it's the actual DNS challenge validation that fails on acme-chief side... [15:49:45] 10Operations, 10Wikimedia-General-or-Unknown, 10serviceops, 10Performance-Team (Radar), 10Wikimedia-Incident: Investigate recurrent GET latency spikes on MediaWiki appservers (Oct 2019) - https://phabricator.wikimedia.org/T235872 (10jijiki) I think we should close this task for the time being. What we ha... [16:04:13] (03PS1) 10Arturo Borrero Gonzalez: prometheus-labs-targets: use python-keystoneauth1 for sessions [puppet] - 10https://gerrit.wikimedia.org/r/569019 (https://phabricator.wikimedia.org/T238096) [16:11:42] (03PS1) 10Arturo Borrero Gonzalez: prometheus: wmcs_scripts: drop package requirements [puppet] - 10https://gerrit.wikimedia.org/r/569021 (https://phabricator.wikimedia.org/T238096) [16:12:04] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] prometheus-labs-targets: use python-keystoneauth1 for sessions [puppet] - 10https://gerrit.wikimedia.org/r/569019 (https://phabricator.wikimedia.org/T238096) (owner: 10Arturo Borrero Gonzalez) [16:14:22] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] prometheus: wmcs_scripts: drop package requirements [puppet] - 10https://gerrit.wikimedia.org/r/569021 (https://phabricator.wikimedia.org/T238096) (owner: 10Arturo Borrero Gonzalez) [16:15:31] PROBLEM - Host blog.wikimedia.org is DOWN: check_ping: Invalid hostname/address - blog.wikimedia.org [16:16:27] RECOVERY - Host blog.wikimedia.org is UP: PING OK - Packet loss = 0%, RTA = 0.54 ms [16:26:56] !log manually refreshing OCSP stapling response for non-canonical-redirects-3 - T243948 [16:26:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:27:01] T243948: SSL CRITICAL - OCSP staple validity for www.wikipedia.bg has X seconds left - https://phabricator.wikimedia.org/T243948 [16:30:05] RECOVERY - HTTPS non-canonical-redirect-3 on ncredir1001 is OK: SSL OK - OCSP staple validity for www.wikipedia.bg has 577797 seconds left:Certificate *.wikipedia.bg valid until 2020-02-26 08:01:36 +0000 (expires in 26 days) https://wikitech.wikimedia.org/wiki/Ncredir [16:31:45] RECOVERY - HTTPS non-canonical-redirect-3 on ncredir2001 is OK: SSL OK - OCSP staple validity for www.wikipedia.bg has 577696 seconds left:Certificate *.wikipedia.bg valid until 2020-02-26 08:01:36 +0000 (expires in 26 days) https://wikitech.wikimedia.org/wiki/Ncredir [16:31:47] RECOVERY - HTTPS non-canonical-redirect-3 on ncredir4001 is OK: SSL OK - OCSP staple validity for www.wikipedia.bg has 577694 seconds left:Certificate *.wikipedia.bg valid until 2020-02-26 08:01:36 +0000 (expires in 26 days) https://wikitech.wikimedia.org/wiki/Ncredir [16:31:57] RECOVERY - HTTPS non-canonical-redirect-3 on ncredir1002 is OK: SSL OK - OCSP staple validity for www.wikipedia.bg has 577685 seconds left:Certificate *.wikipedia.bg valid until 2020-02-26 08:01:36 +0000 (expires in 26 days) https://wikitech.wikimedia.org/wiki/Ncredir [16:31:57] RECOVERY - HTTPS non-canonical-redirect-3 on ncredir2002 is OK: SSL OK - OCSP staple validity for www.wikipedia.bg has 577684 seconds left:Certificate *.wikipedia.bg valid until 2020-02-26 08:01:36 +0000 (expires in 26 days) https://wikitech.wikimedia.org/wiki/Ncredir [16:32:17] RECOVERY - HTTPS non-canonical-redirect-3 on ncredir5002 is OK: SSL OK - OCSP staple validity for www.wikipedia.bg has 577665 seconds left:Certificate *.wikipedia.bg valid until 2020-02-26 08:01:36 +0000 (expires in 26 days) https://wikitech.wikimedia.org/wiki/Ncredir [16:32:25] RECOVERY - HTTPS non-canonical-redirect-3 on ncredir5001 is OK: SSL OK - OCSP staple validity for www.wikipedia.bg has 577657 seconds left:Certificate *.wikipedia.bg valid until 2020-02-26 08:01:36 +0000 (expires in 26 days) https://wikitech.wikimedia.org/wiki/Ncredir [16:32:31] RECOVERY - HTTPS non-canonical-redirect-3 on ncredir3001 is OK: SSL OK - OCSP staple validity for www.wikipedia.bg has 577651 seconds left:Certificate *.wikipedia.bg valid until 2020-02-26 08:01:36 +0000 (expires in 26 days) https://wikitech.wikimedia.org/wiki/Ncredir [16:32:37] RECOVERY - HTTPS non-canonical-redirect-3 on ncredir4002 is OK: SSL OK - OCSP staple validity for www.wikipedia.bg has 577646 seconds left:Certificate *.wikipedia.bg valid until 2020-02-26 08:01:36 +0000 (expires in 26 days) https://wikitech.wikimedia.org/wiki/Ncredir [16:32:49] RECOVERY - HTTPS non-canonical-redirect-3 on ncredir3002 is OK: SSL OK - OCSP staple validity for www.wikipedia.bg has 577632 seconds left:Certificate *.wikipedia.bg valid until 2020-02-26 08:01:36 +0000 (expires in 26 days) https://wikitech.wikimedia.org/wiki/Ncredir [16:37:50] 10Operations, 10Traffic: SSL CRITICAL - OCSP staple validity for www.wikipedia.bg has X seconds left - https://phabricator.wikimedia.org/T243948 (10Vgutierrez) I've ran a manual OCSP refresh for non-canonical-redirects-3 running: ` sudo http_proxy=http://webproxy.eqiad.wmnet:8080 python3 ~vgutierrez/ocsp.py no... [17:02:56] !log restarting varnish-frontend on cp4031 before it crashes - T243634 [17:02:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:02:59] T243634: cp4029 varnish-fe freakout - https://phabricator.wikimedia.org/T243634 [17:03:25] !log repooling cp4032 - T243634 [17:03:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:15:13] PROBLEM - Host ms-be1034 is DOWN: PING CRITICAL - Packet loss = 100% [17:23:07] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [17:24:55] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [17:30:04] Amir1: I'm around if you need support. [17:31:58] James_F: Thanks! [17:33:08] (03Abandoned) 10C. Scott Ananian: Allow OCG machines in Beta to be jenkins slaves. [puppet] - 10https://gerrit.wikimedia.org/r/170130 (owner: 10C. Scott Ananian) [17:33:30] (03CR) 10C. Scott Ananian: "OCG is dead. Long live electron." [puppet] - 10https://gerrit.wikimedia.org/r/170130 (owner: 10C. Scott Ananian) [17:34:34] (03PS1) 10WMDE-leszek: Wikibase: added config variables to configure entity sources [mediawiki-config] - 10https://gerrit.wikimedia.org/r/569031 (https://phabricator.wikimedia.org/T242087) [17:35:20] 10Operations, 10ops-eqiad, 10DBA: es1019: reseat IPMI - https://phabricator.wikimedia.org/T243963 (10Marostegui) [17:35:35] (03CR) 10WMDE-leszek: [C: 04-1] "Not ready yet, awaiting ongoing work in Wikibase" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/569031 (https://phabricator.wikimedia.org/T242087) (owner: 10WMDE-leszek) [17:36:17] 10Operations, 10ops-eqiad, 10DBA: es1019: reseat IPMI - https://phabricator.wikimedia.org/T243963 (10Marostegui) p:05Triage→03Normal [17:37:49] 10Operations, 10ops-eqiad, 10DBA: es1019: reseat IPMI - https://phabricator.wikimedia.org/T243963 (10jcrespo) Strange, not as if it has happened before! T120689 T155691 T187530 T201132 T213422 T233698 [17:38:59] 10Operations, 10ops-eqiad, 10DBA: es1019: reseat IPMI - https://phabricator.wikimedia.org/T243963 (10Marostegui) [17:49:44] !log ladsgroup@deploy1001 Synchronized php-1.35.0-wmf.16/extensions/Wikibase/repo/maintenance/rebuildItemTerms.php: wbterms: Write only to the new term store in rebuildItemTerms (T243944) (duration: 01m 09s) [17:49:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:49:48] T243944: Really large holes in the new term store (again) - https://phabricator.wikimedia.org/T243944 [17:51:23] !log ladsgroup@deploy1001 Synchronized php-1.35.0-wmf.16/extensions/Wikibase/lib/includes/Store/Sql/Terms/FingerprintableEntityTermStoreTrait.php: wbterms: Fix incorrect deletion of rows in findActuallyUnusedTermIds (T243944) (duration: 01m 06s) [17:51:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:00:41] PROBLEM - High average GET latency for mw requests on appserver in eqiad on icinga1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [18:02:26] That doesn't look like us: https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-1h&to=now&var-datasource=eqiad%20prometheus%2Fops&var-cluster=appserver&var-method=GET [18:05:11] Something is happening [18:07:55] !log depool cp4032 and perform a rolling restart of varnish-fe at cp4027-cp4031 - T243634 [18:07:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:07:58] T243634: cp4029 varnish-fe freakout - https://phabricator.wikimedia.org/T243634 [18:08:48] it's recovering [18:11:31] RECOVERY - High average GET latency for mw requests on appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [18:11:57] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [18:13:47] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [18:16:17] These are not related to us but someone needs to take a look, it looks like wtp* are having fun with zhwiki [18:16:49] 10Operations, 10Traffic: cp4029 varnish-fe freakout - https://phabricator.wikimedia.org/T243634 (10Vgutierrez) ulsfo is the only DC where we are seeing this issue, and at the same time it's the DC where we are testing the cache nodes buster upgrades (T242093). To discard that cp4032 (buster text node) could b... [18:40:04] 10Operations, 10serviceops: Upgrade and improve our application object caching service (memcached) - https://phabricator.wikimedia.org/T240684 (10Joe) @RLazarus has taken a look at 0.41 some time ago, not sure if he remembers something in that direction. [18:42:30] 10Operations, 10Traffic: cp4029 varnish-fe freakout - https://phabricator.wikimedia.org/T243634 (10Vgutierrez) after finishing the rolling restart, this is current amount of fds on varnish-frontend per node: ` ===== NODE GROUP ===== (1) cp4028.ulsfo.wmnet ----- OUTPUT of 'ls -1 /proc/$(ps...2 }')/fd | wc -l' -... [18:47:55] RECOVERY - Host ms-be1034 is UP: PING OK - Packet loss = 0%, RTA = 0.18 ms [19:12:54] 10Operations, 10ops-codfw: PDUs with Infeed < 0.5Amps - https://phabricator.wikimedia.org/T222464 (10Papaul) 05Open→03Resolved I think we don't need alerting when the readings are low since some racks in codfw are not fully populated. For example rack d8 has only 7 servers. Let us only alert when the val... [19:16:23] 10Operations, 10ops-codfw, 10DC-Ops, 10fundraising-tech-ops: hw troubleshooting: hardware RAID predictive failure for bellatrix.frack.codfw.wmnet - https://phabricator.wikimedia.org/T240876 (10Papaul) 05Open→03Declined No need for this task since we have a replacing server setup on T237440 [19:17:40] 10Operations, 10Traffic: cp4029 varnish-fe freakout - https://phabricator.wikimedia.org/T243634 (10Vgutierrez) in ~30 minutes cp4029 has gone from 1400 to ~8600 so it doesn't look like cp4032 is at fault here: `(1) cp4029.ulsfo.wmnet ----- OUTPUT of 'ls -1 /proc/$(ps...2 }')/fd | wc -l' ----- 8570 ============... [19:33:15] (03PS1) 10Elukey: admin: add krb flag for user kartik [puppet] - 10https://gerrit.wikimedia.org/r/569055 (https://phabricator.wikimedia.org/T243929) [19:34:01] (03CR) 10Dzahn: "It should be restart." [puppet] - 10https://gerrit.wikimedia.org/r/567189 (owner: 10Legoktm) [19:34:29] (03CR) 10Dzahn: [C: 03+2] codesearch: Restart hound_proxy if port configuration changes [puppet] - 10https://gerrit.wikimedia.org/r/567189 (owner: 10Legoktm) [19:35:46] (03CR) 10Dzahn: [C: 03+1] admin: add krb flag for user kartik [puppet] - 10https://gerrit.wikimedia.org/r/569055 (https://phabricator.wikimedia.org/T243929) (owner: 10Elukey) [19:37:24] !log copying /var/log/apache2 to /root on all eqiad mw appservers to preserve logs [19:37:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:38:11] PROBLEM - High average GET latency for mw requests on appserver in eqiad on icinga1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [19:41:23] that spike looks like it was caused by me but it's going away (we talked about doing this) hrmm [19:42:53] PROBLEM - High average POST latency for mw requests on appserver in eqiad on icinga1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=POST [19:43:52] (03CR) 10Elukey: [C: 03+2] admin: add krb flag for user kartik [puppet] - 10https://gerrit.wikimedia.org/r/569055 (https://phabricator.wikimedia.org/T243929) (owner: 10Elukey) [19:44:35] RECOVERY - High average POST latency for mw requests on appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=POST [19:44:53] and .. the cumin action is over [19:46:12] 10Operations, 10Gerrit, 10Release-Engineering-Team-TODO, 10Release-Engineering-Team (Development services): Create Gerrit Administrator right policy - https://phabricator.wikimedia.org/T218686 (10JZ19775) [19:47:24] 10Operations, 10Gerrit, 10Release-Engineering-Team-TODO, 10Release-Engineering-Team (Development services): Create Gerrit Administrator right policy - https://phabricator.wikimedia.org/T218686 (10Peachey88) [19:48:27] RECOVERY - High average GET latency for mw requests on appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [19:55:44] 10Operations, 10ops-eqiad: Degraded RAID on analytics1030 - https://phabricator.wikimedia.org/T243971 (10ops-monitoring-bot) [20:00:11] 10Operations, 10Wikimedia-Mailing-lists: Loss of HTML formatting in email to or from Wikimedia-l - https://phabricator.wikimedia.org/T243809 (10Pine) Thanks for the information, everyone. [20:29:01] PROBLEM - Disk space on notebook1004 is CRITICAL: DISK CRITICAL - free space: /srv 2134 MB (1% inode=78%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=notebook1004&var-datasource=eqiad+prometheus/ops [20:36:33] (03PS1) 10Andrew Bogott: Buster vms: include python3 versions of openstack clients [puppet] - 10https://gerrit.wikimedia.org/r/569084 [21:57:42] (03CR) 10Volans: [C: 03+2] spicerack: add getter for the Netbox master host [software/spicerack] - 10https://gerrit.wikimedia.org/r/567164 (https://phabricator.wikimedia.org/T231068) (owner: 10Volans) [21:58:45] (03CR) 10Volans: [C: 03+2] ganeti: add cluster to instance() [software/spicerack] - 10https://gerrit.wikimedia.org/r/567168 (https://phabricator.wikimedia.org/T231068) (owner: 10Volans) [21:59:05] (03CR) 10Volans: [C: 03+2] netbox: rename injected property in host details [software/spicerack] - 10https://gerrit.wikimedia.org/r/567175 (https://phabricator.wikimedia.org/T231068) (owner: 10Volans) [22:01:29] (03CR) 10jerkins-bot: [V: 04-1] spicerack: add getter for the Netbox master host [software/spicerack] - 10https://gerrit.wikimedia.org/r/567164 (https://phabricator.wikimedia.org/T231068) (owner: 10Volans) [22:01:31] (03CR) 10jerkins-bot: [V: 04-1] ganeti: add cluster to instance() [software/spicerack] - 10https://gerrit.wikimedia.org/r/567168 (https://phabricator.wikimedia.org/T231068) (owner: 10Volans) [22:01:33] (03CR) 10jerkins-bot: [V: 04-1] netbox: rename injected property in host details [software/spicerack] - 10https://gerrit.wikimedia.org/r/567175 (https://phabricator.wikimedia.org/T231068) (owner: 10Volans) [22:03:12] (03CR) 10Volans: [C: 03+2] "recheck" [software/spicerack] - 10https://gerrit.wikimedia.org/r/567164 (https://phabricator.wikimedia.org/T231068) (owner: 10Volans) [22:07:34] (03Merged) 10jenkins-bot: spicerack: add getter for the Netbox master host [software/spicerack] - 10https://gerrit.wikimedia.org/r/567164 (https://phabricator.wikimedia.org/T231068) (owner: 10Volans) [22:07:36] (03Merged) 10jenkins-bot: ganeti: add cluster to instance() [software/spicerack] - 10https://gerrit.wikimedia.org/r/567168 (https://phabricator.wikimedia.org/T231068) (owner: 10Volans) [22:11:38] (03Merged) 10jenkins-bot: netbox: rename injected property in host details [software/spicerack] - 10https://gerrit.wikimedia.org/r/567175 (https://phabricator.wikimedia.org/T231068) (owner: 10Volans) [22:21:18] (03Abandoned) 10Brion VIBBER: List deps for MIDI to Ogg/MP3 conversion for video scalers [puppet] - 10https://gerrit.wikimedia.org/r/514962 (https://phabricator.wikimedia.org/T135597) (owner: 10Brion VIBBER) [23:19:25] 10Operations, 10ops-eqiad, 10DBA, 10DC-Ops: es1019: reseat IPMI - https://phabricator.wikimedia.org/T243963 (10wiki_willy) a:03Cmjohnson [23:40:41] 10Operations, 10Phabricator, 10Traffic, 10serviceops, 10Release-Engineering-Team-TODO (2020-01 to 2020-03 (Q3)): Phabricator downtime due to aphlict and websockets (aphlict current disabled) - https://phabricator.wikimedia.org/T238593 (10mmodell) questions: why are there 2 yaml files for apache traffic... [23:41:58] (03PS1) 10Dzahn: phabricator: remove firewall holes for port 80 [puppet] - 10https://gerrit.wikimedia.org/r/569100 [23:51:51] 10Operations, 10ops-eqiad, 10vm-requests: rack/setup/install ganeti10([09]|1[0-8]).eqiad.wmnet - https://phabricator.wikimedia.org/T228924 (10Cmjohnson) [23:59:20] 10Operations, 10ops-eqiad, 10vm-requests: rack/setup/install ganeti10([09]|1[0-8]).eqiad.wmnet - https://phabricator.wikimedia.org/T228924 (10Cmjohnson) All but ganeti1017 are ready for handoff, I am not sure what is going on with this server, I cannot get any output on the console. This needs to be checked...