[00:12:18] PROBLEM - Work requests waiting in Zuul Gearman server https://grafana.wikimedia.org/dashboard/db/zuul-gearman on contint1001 is CRITICAL: CRITICAL: 35.71% of data above the critical threshold [140.0] [00:17:19] PROBLEM - nodepoold running on labnodepool1001 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 113 (nodepool), regex args ^/usr/bin/python /usr/bin/nodepoold -d [00:19:08] PROBLEM - Host cp3007 is DOWN: PING CRITICAL - Packet loss = 100% [00:19:09] PROBLEM - Host cp3005 is DOWN: PING CRITICAL - Packet loss = 100% [00:19:09] PROBLEM - Host cp3008 is DOWN: PING CRITICAL - Packet loss = 100% [00:19:09] PROBLEM - Host cp3006 is DOWN: PING CRITICAL - Packet loss = 100% [00:19:09] PROBLEM - Host cp3010 is DOWN: PING CRITICAL - Packet loss = 100% [00:19:38] RECOVERY - Host cp3008 is UP: PING OK - Packet loss = 0%, RTA = 83.84 ms [00:19:38] RECOVERY - Host cp3010 is UP: PING OK - Packet loss = 0%, RTA = 84.06 ms [00:19:38] RECOVERY - Host cp3007 is UP: PING OK - Packet loss = 0%, RTA = 84.08 ms [00:19:38] RECOVERY - Host cp3005 is UP: PING OK - Packet loss = 0%, RTA = 83.89 ms [00:19:38] RECOVERY - Host cp3006 is UP: PING OK - Packet loss = 0%, RTA = 83.85 ms [00:34:44] ... [00:36:55] !log wmcs: purged rabbitmq ceilometer_notifications.* queues [00:37:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:37:12] !log labnodepool1001:~# service nodepool restart [00:37:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:37:38] RECOVERY - nodepoold running on labnodepool1001 is OK: PROCS OK: 1 process with UID = 113 (nodepool), regex args ^/usr/bin/python /usr/bin/nodepoold -d [01:32:28] RECOVERY - Work requests waiting in Zuul Gearman server https://grafana.wikimedia.org/dashboard/db/zuul-gearman on contint1001 is OK: OK: Less than 30.00% above the threshold [90.0] [03:28:49] PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 820.86 seconds [03:32:58] PROBLEM - puppet last run on mw2135 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/share/GeoIP/GeoIPCity.dat.gz] [03:33:08] PROBLEM - puppet last run on mw2222 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/share/GeoIP/GeoIP2-City.mmdb.gz] [03:57:38] PROBLEM - puppet last run on db2016 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/mysql/grcat.config] [04:00:39] RECOVERY - puppet last run on mw2135 is OK: OK: Puppet is currently enabled, last run 0 seconds ago with 0 failures [04:00:58] RECOVERY - puppet last run on mw2222 is OK: OK: Puppet is currently enabled, last run 24 seconds ago with 0 failures [04:22:28] PROBLEM - HHVM jobrunner on mw1300 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 473 bytes in 0.001 second response time [04:23:28] RECOVERY - HHVM jobrunner on mw1300 is OK: HTTP OK: HTTP/1.1 200 OK - 202 bytes in 0.001 second response time [04:25:19] RECOVERY - puppet last run on db2016 is OK: OK: Puppet is currently enabled, last run 25 seconds ago with 0 failures [04:41:48] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [04:46:48] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [05:13:59] I am having a memory blank, what is the name of the search function that we use? [05:14:58] ask and one remembers "CirrusSearch" [05:31:19] RECOVERY - are wikitech and wt-static in sync on silver is OK: wikitech-static OK - wikitech and wikitech-static in sync (17510 200000s) [05:32:08] RECOVERY - are wikitech and wt-static in sync on labtestweb2001 is OK: wikitech-static OK - wikitech and wikitech-static in sync (17510 200000s) [05:34:15] time to leave the hotel... will be on the road today so see youse all back on line this evening late [05:48:58] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [05:55:59] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] [05:58:08] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [06:27:58] PROBLEM - graphite.wikimedia.org on graphite1003 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 398 bytes in 0.002 second response time [06:28:58] RECOVERY - graphite.wikimedia.org on graphite1003 is OK: HTTP OK: HTTP/1.1 200 OK - 1547 bytes in 0.010 second response time [07:13:13] (03PS1) 10BryanDavis: rabbitmq: Add drain_queue utility script [puppet] - 10https://gerrit.wikimedia.org/r/377039 (https://phabricator.wikimedia.org/T170492) [07:13:15] (03PS1) 10BryanDavis: rabbitmq: remove orphan files [puppet] - 10https://gerrit.wikimedia.org/r/377040 [07:14:17] (03CR) 10BryanDavis: "I'm not 100% on this being the right module for this script. It is very OpenStack specific." [puppet] - 10https://gerrit.wikimedia.org/r/377039 (https://phabricator.wikimedia.org/T170492) (owner: 10BryanDavis) [08:53:58] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] [08:54:38] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [08:55:42] (03CR) 10Volans: [C: 031] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/376702 (owner: 10Muehlenhoff) [09:03:38] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [09:03:59] RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [09:27:32] (03CR) 10Volans: "See some comments inline." (0311 comments) [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/376715 (owner: 10Muehlenhoff) [09:58:28] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [09:58:49] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [1000.0] [09:59:48] PROBLEM - Eqiad HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [1000.0] [10:00:38] PROBLEM - Ulsfo HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [1000.0] [10:01:47] Getting a lot of errors of the "Request from via cp1053 cp1053, Varnish XID 883949803 [10:01:49] Error: 503, Backend fetch failed at Sun, 10 Sep 2017 10:00:53 GMT" kind [10:01:55] Replication running? [10:02:04] or was there planned mantainence? [10:02:58] Wikidata has just died (normally I'd be grateful, but as I'm trying to sort out stuff there, it's a nuisance right now) [10:04:25] same error, cp1053, hitting refresh fixes it [10:04:38] same here [10:04:44] cp1053 [10:04:54] and no css [10:05:01] (from esams) [10:05:58] me too. x-cache:cp1053 int, cp2004 miss, cp4028 pass, cp4027 miss [10:06:37] I'd recentlly updated a template at Commons? [10:06:44] cp1053 looks OK in icinga, loads of 503s from 09:50 UTC onwards [10:06:46] Would that cause meltdown? I hope not [10:07:58] Also https://ganglia.wikimedia.org/latest seems to giving 403's [10:08:19] Presumably it's intended only approrpiate staff can access it [10:08:28] 401's sorru [10:08:54] getting tons of such errors: Caused by: java.io.IOException: Server returned HTTP response code: 503 for URL: https://commons.wikimedia.org/w/api.php?format=xml&rawcontinue=1&maxlag=6&action=query&prop=categories&cllimit=max&titles=File%3A%C3%89glise+St+%C3%89tienne+Montcet+6.jpg [10:09:47] * Ops Clinic Duty: ema [10:10:00] Welp, cp1053 does have a critical (Varnish HTTP text-backend - port 3128) [10:10:18] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 70.00% of data above the critical threshold [50.0] [10:12:02] ema seems to be out of channel right now [10:12:05] :( [10:12:18] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 70.00% above the threshold [25.0] [10:13:11] 10Operations: Multiple 503 Erros - https://phabricator.wikimedia.org/T175473#3594350 (10Steinsplitter) [10:18:58] RECOVERY - Eqiad HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [10:21:08] 10Operations: Multiple 503 Erros - https://phabricator.wikimedia.org/T175473#3594364 (10Samtar) Varnish is having a heck of a lot of 503s since 09:50 UTC {F9432897} cp1053 did flap `Varnish HTTP text-backend - port 3128` but has since recovered Stack from a [[ https://logstash.wikimedia.org/app/kibana#/doc/l... [10:21:48] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [10:21:48] RECOVERY - Ulsfo HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [10:22:08] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] [10:23:08] RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [11:06:49] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 90.00% of data above the critical threshold [50.0] [11:13:58] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 70.00% above the threshold [25.0] [11:31:00] bd808; it needed more memory -- it is working fine now. [12:19:41] 10Operations: Multiple 503 Erros - https://phabricator.wikimedia.org/T175473#3594479 (10Samtar) Appears to be resolved now {F9435260} [12:58:49] 10Operations, 10Traffic: Multiple 503 Erros - https://phabricator.wikimedia.org/T175473#3594540 (10elukey) [13:09:58] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] [13:11:19] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] [13:12:08] PROBLEM - Ulsfo HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [1000.0] [13:12:29] PROBLEM - HHVM rendering on mw1295 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:13:18] PROBLEM - Eqiad HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] [13:13:19] RECOVERY - HHVM rendering on mw1295 is OK: HTTP OK: HTTP/1.1 200 OK - 73700 bytes in 0.140 second response time [13:13:29] !log restart cp1053's varnish backend for mailbox expiry lag and 503s - T175473 [13:13:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:13:44] T175473: Multiple 503 Erros - https://phabricator.wikimedia.org/T175473 [13:13:58] PROBLEM - PyBal backends health check on lvs1006 is CRITICAL: PYBAL CRITICAL - logstash-syslog-udp_10514_udp - Could not depool server logstash1001.eqiad.wmnet because of too many down!: logstash-log4j_4560 - Could not depool server logstash1002.eqiad.wmnet because of too many down!: logstash-json-udp_11514_udp - Could not depool server logstash1001.eqiad.wmnet because of too many down!: logstash-syslog-tcp_10514 - Could not de [13:13:59] h1002.eqiad.wmnet because of too many down!: logstash-json-tcp_11514 - Could not depool server logstash1002.eqiad.wmnet because of too many down! [13:14:09] PROBLEM - PyBal backends health check on lvs1009 is CRITICAL: PYBAL CRITICAL - logstash-json-tcp_11514 - Could not depool server logstash1001.eqiad.wmnet because of too many down!: logstash-log4j_4560 - Could not depool server logstash1002.eqiad.wmnet because of too many down!: logstash-json-udp_11514_udp - Could not depool server logstash1002.eqiad.wmnet because of too many down!: logstash-syslog-tcp_10514 - Could not depool s [13:14:09] eqiad.wmnet because of too many down!: logstash-syslog-udp_10514_udp - Could not depool server logstash1002.eqiad.wmnet because of too many down! [13:14:19] PROBLEM - PyBal backends health check on lvs1003 is CRITICAL: PYBAL CRITICAL - logstash-syslog-udp_10514_udp - Could not depool server logstash1001.eqiad.wmnet because of too many down!: logstash-json-tcp_11514 - Could not depool server logstash1002.eqiad.wmnet because of too many down!: logstash-syslog-tcp_10514 - Could not depool server logstash1001.eqiad.wmnet because of too many down! [13:14:46] lovely [13:14:59] RECOVERY - PyBal backends health check on lvs1006 is OK: PYBAL OK - All pools are healthy [13:15:03] Currently getting this when my script tries to log in to Wikidata: Request from 67.188.39.159 via cp1053 cp1053, Varnish XID 223740244
Error: 503, Backend fetch failed at Sun, 10 Sep 2017 13:14:09 GMT [13:15:09] RECOVERY - PyBal backends health check on lvs1009 is OK: PYBAL OK - All pools are healthy [13:15:19] PROBLEM - Codfw HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [13:15:19] RECOVERY - PyBal backends health check on lvs1003 is OK: PYBAL OK - All pools are healthy [13:15:21] harej: known issue, T175473 [13:15:30] thank you [13:15:44] elukey is doing the needful :) [13:15:45] I think that cp1053 might be the culprit as indicated in the task [13:15:49] :) [13:16:10] will check logstash asap [13:16:36] God I would hate if my script caused it. I doubt it, but can't help but feel guilty :) [13:17:46] harej: I doubt it, it seems more a pre-existing issue, but can you give us a bit more detail about the script? [13:18:22] It makes lots and lots of edits to Wikidata. It operated several times before, including several times just the past few days, without taking down Wikidata, but maybe there's a first time for everything. [13:18:48] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 70.00% of data above the critical threshold [50.0] [13:19:34] harej: can you stop it just in case? [13:19:44] Oh, it literally can't run right now because it can't log in :) [13:19:48] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 70.00% above the threshold [25.0] [13:20:03] harej: hahahah okok thanks :D [13:20:09] elukey: last in logstash was 13:14:08, around the time you restarted? [13:22:21] TheresNoTime: more or less, but I am pretty sure it was cp1053's backend since it was alarming for mailbox expiry lag.. We have been seeing this issue occasionally for Varnish in upload, but there is no easy solution (there should also be a task if you are interested) [13:24:08] as I noted on the task, I saw cp1053's `Varnish HTTP text-backend - port 3128`alarm earlier today, which suggests that's what it was? [13:24:28] RECOVERY - Codfw HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [13:25:18] RECOVERY - Ulsfo HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [13:26:08] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [13:26:10] TheresNoTime: ah sorry didn't see it, the alert that i saw is a different one but if you saw that it meant that Varnish backend was not healthy, so you were definitely right [13:26:29] RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [13:28:28] https://usercontent.irccloud-cdn.com/file/GIX06ot2/image.png [13:28:28] RECOVERY - Eqiad HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [13:28:38] elukey: if at all helpful ^ :) [13:30:26] thanks :) [13:32:27] I am a bit worried that cp1073 is showing up the same alarm now [13:34:45] for 7 mins :/ all the `Varnish HTTP upload-backend`s have been stable on that host for 4+ days though.. [13:35:03] so cp1053.eqiad.wmnet is cache::text (I thought it was upload) meanwhile cp1073 is upload, so they are completely separate clusters [13:36:40] 10Operations, 10Traffic: Multiple 503 Errors - https://phabricator.wikimedia.org/T175473#3594564 (10Aklapper) p:05Triage>03High [13:41:54] !log restart Varnish backend on cp1073 (cache::upload) for mailbox expiry lag [13:42:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:44:28] (03PS1) 10Ladsgroup: Use new logo of WMF for gerrit [puppet] - 10https://gerrit.wikimedia.org/r/377049 (https://phabricator.wikimedia.org/T174576) [13:47:20] ok now everything looks better [13:47:56] \o/ [13:49:50] 10Operations, 10Traffic: Multiple 503 Errors - https://phabricator.wikimedia.org/T175473#3594587 (10elukey) p:05High>03Normal The 503s seems to be down to zero now, lowering down the priority to Normal since we are aware of this issue (https://phabricator.wikimedia.org/T174932) and everything looks good at... [13:51:18] thanks all for the help! [13:51:38] Glad I could be any help :) [13:52:04] T174932 looks "fun" [13:52:04] T174932: Recurrent 'mailbox lag' critical alerts and 500s - https://phabricator.wikimedia.org/T174932 [13:55:00] gehel: Ccing you because ~30 mins ago lvs100[69] were unable to contact logstash100[123] and pybal was not happy (you can see the alarms if you scroll up) [13:55:26] after a minute everything recovered [13:56:49] elukey: thanks! Quitte busy right now, but let me know if I need to connect [13:57:15] gehel: hello! nono it was just a note for tomorrow, all alarms are green now [13:57:51] all right going afk! ttl people! [13:58:12] elukey: thanks [15:18:19] PROBLEM - Nginx local proxy to apache on mw1288 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1308 bytes in 0.006 second response time [15:19:18] RECOVERY - Nginx local proxy to apache on mw1288 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 613 bytes in 0.059 second response time [17:54:46] (03PS1) 10BryanDavis: user homes: Allow git to control +x for $HOME files [puppet] - 10https://gerrit.wikimedia.org/r/377056 [18:15:59] Question huwikitonary wants to use commons to host their logo is this possible by just using the upload link [18:19:27] disregard [18:29:30] (03PS1) 10Zppix: Change logo for huwiktonary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/377059 (https://phabricator.wikimedia.org/T175483) [18:35:52] (03Draft1) 10Paladox: Logstash: Add filebeat support [puppet] - 10https://gerrit.wikimedia.org/r/377058 [18:35:54] (03PS2) 10Paladox: Logstash: Add beats support [puppet] - 10https://gerrit.wikimedia.org/r/377058 (https://phabricator.wikimedia.org/T141324) [18:36:36] (03CR) 10Paladox: "@Gehel hi, im not sure if the beats plugin is installed. If not could it be installed please? :)" [puppet] - 10https://gerrit.wikimedia.org/r/377058 (https://phabricator.wikimedia.org/T141324) (owner: 10Paladox) [18:53:20] (03Draft1) 10Paladox: Gerrit: Install and configure filebeats [puppet] - 10https://gerrit.wikimedia.org/r/377060 [18:53:23] (03PS2) 10Paladox: Gerrit: Install and configure filebeats [puppet] - 10https://gerrit.wikimedia.org/r/377060 (https://phabricator.wikimedia.org/T141324) [19:12:56] (03PS3) 10Paladox: Gerrit: Install and configure filebeats [puppet] - 10https://gerrit.wikimedia.org/r/377060 (https://phabricator.wikimedia.org/T141324) [19:20:12] (03PS4) 10Paladox: Gerrit: Install and configure filebeats [puppet] - 10https://gerrit.wikimedia.org/r/377060 (https://phabricator.wikimedia.org/T141324) [19:42:32] (03PS5) 10Paladox: Gerrit: Install and configure filebeats [puppet] - 10https://gerrit.wikimedia.org/r/377060 (https://phabricator.wikimedia.org/T141324) [20:26:31] 10Operations, 10Traffic: Multiple 503 Errors - https://phabricator.wikimedia.org/T175473#3595030 (10Samtar) 05Open>03Resolved a:03elukey Overarching issue tracked in T174932, still minimal 503s in logstash and no further reports. Closing resolved (503s "resolved" by restart of varnish backends) [20:54:07] (03PS1) 10ArielGlenn: revoke James F's ssh key [puppet] - 10https://gerrit.wikimedia.org/r/377170 [20:54:38] (03CR) 10Reedy: [C: 031] revoke James F's ssh key [puppet] - 10https://gerrit.wikimedia.org/r/377170 (owner: 10ArielGlenn) [20:56:24] (03CR) 10ArielGlenn: [C: 032] revoke James F's ssh key [puppet] - 10https://gerrit.wikimedia.org/r/377170 (owner: 10ArielGlenn) [21:00:04] !log disabled Jforrester's gerrit account and flush sessions [21:00:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:37:50] :o