[00:03:24] PROBLEM - very high load average likely xfs on ms-be1011 is CRITICAL: CRITICAL - load average: 272.42, 142.60, 65.13 [00:20:53] RECOVERY - puppet last run on mw2107 is OK: OK: Puppet is currently enabled, last run 5 seconds ago with 0 failures [00:23:13] PROBLEM - Host ms-be1011 is DOWN: PING CRITICAL - Packet loss = 100% [02:24:46] !log mwdeploy@tin sync-l10n completed (1.27.0-wmf.9) (duration: 10m 39s) [02:24:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:31:39] !log l10nupdate@tin ResourceLoader cache refresh completed at Sat Dec 26 02:31:39 UTC 2015 (duration 6m 54s) [02:31:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [03:59:43] PROBLEM - cassandra CQL 10.64.32.159:9042 on restbase1003 is CRITICAL: Connection refused [04:01:02] PROBLEM - cassandra service on restbase1003 is CRITICAL: CRITICAL - Expecting active but unit cassandra is failed [04:06:42] RECOVERY - cassandra service on restbase1003 is OK: OK - cassandra is active [04:06:50] RECOVERY - Disk space on restbase1003 is OK: DISK OK [04:08:01] RECOVERY - cassandra CQL 10.64.32.159:9042 on restbase1003 is OK: TCP OK - 0.004 second response time on port 9042 [04:22:37] so, did anything do something to restbase1003, or did it somehow fix itself? [04:22:49] s/anything/anyone/ [04:23:21] urandom: 2.1.12 handles out of disk better by aborting compactions [04:24:06] yeah, but cql stopped responding [04:24:21] how does that happen without the daemon dieing? [04:35:28] I would guess that it hangs for a bit, which causes other operations to back up [04:35:46] hence causing the "cql stopped responding" bit [04:36:09] then things clear & operation resumes [04:36:41] before 2.1.12 it would often remain in a messed-up state [04:37:41] I was wondering why the node was so low on disk space, as there weren't many major compactions nor streams going on [04:49:13] urandom: did you restart cassandra? [04:58:19] the process hasn't been running for that long, so perhaps systemd respawned it [05:38:05] !log rolling restart of hhvm jobrunners (T122069) [05:38:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [06:30:14] PROBLEM - puppet last run on wtp2015 is CRITICAL: CRITICAL: puppet fail [06:30:45] PROBLEM - puppet last run on wtp2017 is CRITICAL: CRITICAL: Puppet has 1 failures [06:30:55] PROBLEM - puppet last run on mw1120 is CRITICAL: CRITICAL: Puppet has 1 failures [06:30:55] PROBLEM - puppet last run on cp3048 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:05] PROBLEM - puppet last run on cp2013 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:53] PROBLEM - puppet last run on cp2001 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:53] PROBLEM - puppet last run on db2056 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:11] PROBLEM - puppet last run on mw2158 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:13] PROBLEM - puppet last run on mw2126 is CRITICAL: CRITICAL: Puppet has 2 failures [06:32:32] PROBLEM - puppet last run on mw2050 is CRITICAL: CRITICAL: Puppet has 1 failures [06:33:22] PROBLEM - puppet last run on mw1135 is CRITICAL: CRITICAL: Puppet has 1 failures [06:34:51] PROBLEM - puppet last run on mw2073 is CRITICAL: CRITICAL: Puppet has 1 failures [06:56:21] RECOVERY - puppet last run on wtp2017 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:56:32] RECOVERY - puppet last run on db2056 is OK: OK: Puppet is currently enabled, last run 48 seconds ago with 0 failures [06:56:51] RECOVERY - puppet last run on mw1120 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:56:51] RECOVERY - puppet last run on mw1135 is OK: OK: Puppet is currently enabled, last run 15 seconds ago with 0 failures [06:56:52] RECOVERY - puppet last run on cp3048 is OK: OK: Puppet is currently enabled, last run 39 seconds ago with 0 failures [06:57:02] RECOVERY - puppet last run on mw2158 is OK: OK: Puppet is currently enabled, last run 43 seconds ago with 0 failures [06:57:21] RECOVERY - puppet last run on mw2126 is OK: OK: Puppet is currently enabled, last run 54 seconds ago with 0 failures [06:57:23] RECOVERY - puppet last run on cp2013 is OK: OK: Puppet is currently enabled, last run 58 seconds ago with 0 failures [06:57:42] RECOVERY - puppet last run on mw2050 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:58:31] RECOVERY - puppet last run on cp2001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:58:31] RECOVERY - puppet last run on mw2073 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:58:52] RECOVERY - puppet last run on wtp2015 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:32:16] PROBLEM - puppet last run on cp4010 is CRITICAL: CRITICAL: puppet fail [07:57:27] RECOVERY - puppet last run on cp4010 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [08:26:58] 6operations, 5Patch-For-Review: Remove secure.wikimedia.org - https://phabricator.wikimedia.org/T120790#1904487 (10Liuxinyu970226) [08:46:45] PROBLEM - puppet last run on cp2025 is CRITICAL: CRITICAL: puppet fail [09:06:09] PROBLEM - puppet last run on db2053 is CRITICAL: CRITICAL: puppet fail [09:12:38] RECOVERY - puppet last run on cp2025 is OK: OK: Puppet is currently enabled, last run 4 seconds ago with 0 failures [09:15:18] PROBLEM - check_mysql on db1008 is CRITICAL: SLOW_SLAVE CRITICAL: Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 638 [09:30:08] RECOVERY - check_mysql on db1008 is OK: Uptime: 406402 Threads: 113 Questions: 17676170 Slow queries: 5005 Opens: 30873 Flush tables: 2 Open tables: 409 Queries per second avg: 43.494 Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 0 [09:33:43] RECOVERY - puppet last run on db2053 is OK: OK: Puppet is currently enabled, last run 36 seconds ago with 0 failures [10:30:17] PROBLEM - check_mysql on db1008 is CRITICAL: SLOW_SLAVE CRITICAL: Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 702 [10:45:09] PROBLEM - check_mysql on db1008 is CRITICAL: SLOW_SLAVE CRITICAL: Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 622 [10:50:19] PROBLEM - check_mysql on db1008 is CRITICAL: SLOW_SLAVE CRITICAL: Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 698 [10:55:09] PROBLEM - check_mysql on db1008 is CRITICAL: SLOW_SLAVE CRITICAL: Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 997 [11:00:09] PROBLEM - check_mysql on db1008 is CRITICAL: SLOW_SLAVE CRITICAL: Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 1298 [11:05:13] PROBLEM - check_mysql on db1008 is CRITICAL: SLOW_SLAVE CRITICAL: Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 950 [11:10:13] PROBLEM - check_mysql on db1008 is CRITICAL: SLOW_SLAVE CRITICAL: Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 1249 [11:15:13] PROBLEM - check_mysql on db1008 is CRITICAL: SLOW_SLAVE CRITICAL: Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 1549 [11:20:14] PROBLEM - check_mysql on db1008 is CRITICAL: SLOW_SLAVE CRITICAL: Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 1546 [11:25:13] RECOVERY - check_mysql on db1008 is OK: Uptime: 413302 Threads: 113 Questions: 17809785 Slow queries: 5184 Opens: 32285 Flush tables: 2 Open tables: 410 Queries per second avg: 43.091 Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 1 [14:23:33] PROBLEM - puppet last run on ms-fe3002 is CRITICAL: CRITICAL: Puppet has 1 failures [14:41:53] PROBLEM - puppet last run on ms-be3004 is CRITICAL: CRITICAL: puppet fail [14:43:13] PROBLEM - HHVM rendering on mw1139 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:44:54] RECOVERY - HHVM rendering on mw1139 is OK: HTTP OK: HTTP/1.1 200 OK - 68139 bytes in 0.719 second response time [14:50:13] RECOVERY - puppet last run on ms-fe3002 is OK: OK: Puppet is currently enabled, last run 46 seconds ago with 0 failures [15:09:48] RECOVERY - puppet last run on ms-be3004 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:29:19] PROBLEM - puppet last run on mw2018 is CRITICAL: CRITICAL: puppet fail [15:57:40] RECOVERY - puppet last run on mw2018 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [16:19:49] PROBLEM - High load average on labstore1001 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [24.0] [16:23:40] RECOVERY - High load average on labstore1001 is OK: OK: Less than 50.00% above the threshold [16.0] [16:41:17] PROBLEM - salt-minion processes on tin is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [16:41:47] PROBLEM - salt-minion processes on bohrium is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [16:51:47] RECOVERY - salt-minion processes on bohrium is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [16:57:17] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [17:08:13] RECOVERY - salt-minion processes on tin is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [17:08:32] PROBLEM - Confd template for /etc/varnish/directors.backend.vcl on cp3048 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [17:08:33] PROBLEM - IPsec on cp3048 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [17:08:42] PROBLEM - salt-minion processes on cp3048 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [17:09:53] PROBLEM - confd service on cp3048 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [17:10:02] PROBLEM - HTTPS on cp3048 is CRITICAL: Return code of 255 is out of bounds [17:10:02] PROBLEM - Freshness of OCSP Stapling files on cp3048 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [17:10:12] PROBLEM - configured eth on cp3048 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [17:10:23] PROBLEM - puppet last run on cp3048 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [17:10:53] PROBLEM - RAID on cp3048 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [17:11:42] RECOVERY - confd service on cp3048 is OK: OK - confd is active [17:11:43] RECOVERY - Freshness of OCSP Stapling files on cp3048 is OK: OK [17:12:02] RECOVERY - HTTPS on cp3048 is OK: SSLXNN OK - 36 OK [17:12:02] RECOVERY - configured eth on cp3048 is OK: OK - interfaces up [17:12:13] RECOVERY - puppet last run on cp3048 is OK: OK: Puppet is currently enabled, last run 15 minutes ago with 0 failures [17:12:22] RECOVERY - Confd template for /etc/varnish/directors.backend.vcl on cp3048 is OK: No errors detected [17:12:23] RECOVERY - IPsec on cp3048 is OK: Strongswan OK - 38 ESP OK [17:12:24] RECOVERY - salt-minion processes on cp3048 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [17:12:43] RECOVERY - RAID on cp3048 is OK: OK: Active: 2, Working: 2, Failed: 0, Spare: 0 [17:22:23] PROBLEM - puppet last run on mw1078 is CRITICAL: CRITICAL: Puppet has 1 failures [17:22:29] ehm [17:22:37] did the image hosts just crash ? [17:22:43] RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [17:23:57] bblack: ping, i have no images anymore here in the netherlands [17:24:46] Request from 10.20.0.183 via cp3048 frontend ([10.20.0.183]:80), Varnish XID 1427324235 [17:24:49] Forwarded for: 84.80.97.16, 10.20.0.183 [17:24:52] Error: 429, Request Rate Exceeded [17:31:45] hmm, seems pretty local to me... [17:34:04] weird. i don't think my parents are on a shared ip or anything.. works fine from my home and work pcs... [17:34:13] 6operations, 10MediaWiki-extensions-MultimediaViewer, 10Wikimedia-General-or-Unknown: Wikimedia servers won't serve images - https://phabricator.wikimedia.org/T122453#1904581 (10Glaisher) Works for me too but someone else also did report this error on #wikimedia. He said he was located at the UK. [17:34:35] 6operations, 10MediaWiki-extensions-MultimediaViewer, 10Wikimedia-General-or-Unknown: Wikimedia servers won't serve images - https://phabricator.wikimedia.org/T122453#1904584 (10TheDJ) Same for me. netherlands. [17:34:40] thedj: ^ [17:34:52] 6operations, 10MediaWiki-extensions-MultimediaViewer, 10Wikimedia-General-or-Unknown: Wikimedia servers won't serve images - https://phabricator.wikimedia.org/T122453#1904585 (10TheDJ) Request from 10.20.0.183 via cp3048 frontend ([10.20.0.183]:80), Varnish XID 1428481750 Forwarded for: 84.80.97.16, 10.20.0.... [17:35:27] yeah. just added my error output [17:35:36] 6operations, 10MediaWiki-extensions-MultimediaViewer, 10Wikimedia-General-or-Unknown: Wikimedia servers won't serve images - https://phabricator.wikimedia.org/T122453#1904586 (10Glaisher) By this error, I meant that images do only load intermittently. [17:36:16] oh, then maybe we have diffirent issues... [17:37:27] thedj: (s)he also did see the varnish error page [17:38:54] also a lot of esams 5xx errors at 18:00 according to the log [17:41:32] https://grafana.wikimedia.org/dashboard/db/varnish-http-errors [17:42:37] 6operations, 10MediaWiki-extensions-MultimediaViewer, 10Wikimedia-General-or-Unknown: Wikimedia servers won't serve images - https://phabricator.wikimedia.org/T122453#1904587 (10TheDJ) Also quite some 5xx 45mins ago.. and later a big group of 4xx. Maybe something triggered an internal limit ? https://grafan... [17:44:12] 6operations, 10MediaWiki-extensions-MultimediaViewer, 10Wikimedia-General-or-Unknown: Wikimedia servers won't serve images - https://phabricator.wikimedia.org/T122453#1904589 (10Ciencia_Al_Poder) https://www.mediawiki.org/wiki/Manual:$wgRateLimits ? [17:48:28] 6operations, 10MediaWiki-extensions-MultimediaViewer, 10Wikimedia-General-or-Unknown: Wikimedia servers won't serve images - https://phabricator.wikimedia.org/T122453#1904591 (10Glaisher) p:5Triage>3Unbreak! Others are also reporting this. https://commons.wikimedia.org/wiki/Special:NewFiles no im... [17:48:56] RECOVERY - puppet last run on mw1078 is OK: OK: Puppet is currently enabled, last run 33 seconds ago with 0 failures [17:50:20] still down here [17:50:24] the details below. [17:50:24] 39 [17:50:24] 183 [17:50:47] 6operations, 10MediaWiki-extensions-MultimediaViewer, 10Wikimedia-General-or-Unknown: Wikimedia servers won't serve images - https://phabricator.wikimedia.org/T122453#1904603 (10Yann) Broken from France with this ISP: https://en.wikipedia.org/wiki/Free_(ISP) [17:52:02] hey [17:52:20] can one of you confirm? [17:52:24] that it's back now [17:52:37] 6operations, 10MediaWiki-extensions-MultimediaViewer, 10Wikimedia-General-or-Unknown: Wikimedia servers won't serve images - https://phabricator.wikimedia.org/T122453#1904604 (10Glaisher) Probably related to PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data a... [17:52:42] paravoid: not here [17:52:56] is it fixed or not? [17:53:06] getting 429 for every request [17:53:10] still? [17:53:14] yes [17:53:33] how about now? [17:53:40] loads [17:53:45] from which server? [17:54:06] sec. on a phone, hard to tell [17:55:00] 3049 [17:55:35] !log cp3048: service varnish-frontend stop (sending 429 to lots of people, T122453) [17:55:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:59:41] matanya: still here? [17:59:44] yes [17:59:54] seems to recover [18:00:26] can you confirm that you're getting proper responses from cp304*8* now? [18:01:02] !log cp3048: cleaned up /run/vmod_tbf/tbf.db/, kept a backup copy under ~faidon [18:01:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:01:20] paravoid: how can i trigger that one to serve me ? [18:01:42] I'm talking about the frontend [18:01:58] so if you're coming from the same IP, you should be reaching it [18:02:13] then yes, 200 [18:02:26] I restarted a couple of caches yesterday, but not that one [18:02:32] 6operations, 10Traffic: Wikimedia servers won't serve images - https://phabricator.wikimedia.org/T122453#1904606 (10faidon) [18:02:53] looks like images are back [18:05:04] cool, thanks for confirming [18:05:44] I can put a word on the phab if needed [18:05:47] 6operations, 10Traffic: Wikimedia servers won't serve images - https://phabricator.wikimedia.org/T122453#1904610 (10faidon) This is all very preliminary but this appears to have happened: - cp3048 ran out of memory due to what it looks like a memory leak - The OOM killer killed the varnish-frontend - puppet st... [18:05:50] just did :) [18:05:54] nice [18:06:50] mmm, maybe related to the other incidents? [18:06:56] which ones? [18:07:04] varnish leaks memory, I'm opening a separate task now [18:07:08] already texted brandon [18:07:09] https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:07:23] cp3010 cp4007 [18:07:39] yeah possible [18:07:52] but those were 100% dead [18:08:38] back for me now [18:08:45] 6operations, 10Traffic: Wikimedia servers won't serve images - https://phabricator.wikimedia.org/T122453#1904612 (10PierreSelim) Thanks faidon, it looks like the images are back on the wikis. Good luck on the investigation part. [18:09:27] let me see which type are those to see if there is any connection [18:10:48] amsterdam one text, but sf one upload [18:10:53] 6operations, 10Traffic: Wikimedia servers won't serve images - https://phabricator.wikimedia.org/T122453#1904613 (10Yann) Seems to be back... [18:11:15] I will put a comment, even if it is not related, more info is better than nothing [18:11:22] wait [18:11:28] ok [18:11:35] paravoid: thank you [18:13:28] I would discard an OOM on those, based on grafana [18:14:16] 6operations, 10Traffic: Varnish leaks memory - https://phabricator.wikimedia.org/T122455#1904615 (10faidon) 3NEW [18:14:55] PROBLEM - MariaDB Slave SQL: s6 on db1022 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1062, Errmsg: Error Duplicate entry 116393982 for key rev_id on query. Default database: frwiki. Query: INSERT /* Revision::insertOn NB80 */ INTO revision (rev_id,rev_page,rev_text_id,rev_comment,rev_minor_edit,rev_user,rev_user_text,rev_timestamp,rev_deleted,rev_len,rev_parent_id,rev_sha1,rev_content_model,rev_content_format) [18:15:11] double whammy [18:15:40] jynus: ^^^ [18:16:18] yes, it is ok, no outage [18:16:29] as in, no service affected [18:17:05] automatic depooling should handle it, trying to fix it now [18:17:49] 6operations, 6Performance-Team: jobrunner memory leaks - https://phabricator.wikimedia.org/T122069#1904624 (10faidon) I've restarted HHVM on jobrunners thrice now, to avoid further OOMs (one cut it real close too). I'd like to revert the two commits that we identified above as possible suspects of this leak. I... [18:18:45] brb [18:19:34] paravoid: i am moving on, unless you need anything on my part ? [18:26:29] 6operations, 10DBA, 5Patch-For-Review: db1022 duplicate key errors - https://phabricator.wikimedia.org/T105879#1904625 (10jcrespo) 5Resolved>3Open [18:28:17] !log disabling lag notifications for codfw (s6) [18:28:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:29:18] jynus: around ? [18:29:23] yes [18:29:40] this is new thing, and actually afecting production [18:29:44] what's up with frwiki and db1022 ? you are already investigating ? [18:29:50] good thing there is redundancy [18:30:07] I already know, I am waiting to disable all further notifications [18:30:16] go back, I can handle this [18:30:33] db1022 has bigger problems than just the replication ? I go the idea it's frwiki specifically [18:30:37] got* [18:31:17] this is not the s2 lag, this is data consistency problem, it has to be depooled [18:31:27] jynus: argh [18:31:33] I need to do an emergency mediawiki commit [18:31:42] I hope it is justified enough [18:31:58] for depooling a server ? obviously [18:32:11] :-) [18:32:41] ok. I am happy you are handling it. This is the 4th or 5th week in a row something happens on a saturday ? [18:32:53] are they truncating again? [18:33:01] I am very suspicious, it happened when I was online [18:33:02] I assume today you got the page fine though [18:33:23] it was by chance I was here, but I was not even connected to the cluster [18:33:56] I was evaluating some issue about caches (supporting faidon, although he had handled that on his own) [18:34:16] this is unrelated, this is a production issue [18:35:49] (03PS1) 10Jcrespo: Emergency depool of db1022 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/261042 [18:36:26] the others I thing they are not a coincidence, I think it is related to wikidata load, and probably someone is regularly importing on weekends [18:36:26] ok. I am happy you are handling it. I assume you will send an email and let us know what happened. jynus: thanks!!!! [18:36:38] yes, go away [18:36:45] heh, yeah, that would make sense [18:37:05] I take this one for you, the same way you took multiple for us [18:38:23] (03CR) 10Jcrespo: [C: 032] Emergency depool of db1022 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/261042 (owner: 10Jcrespo) [18:38:47] (03Merged) 10jenkins-bot: Emergency depool of db1022 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/261042 (owner: 10Jcrespo) [18:40:00] !log jynus@tin Synchronized wmf-config/db-eqiad.php: Emergency depool of db1022 (duration: 00m 30s) [18:40:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:41:37] 6operations, 10DBA, 5Patch-For-Review: db1022 duplicate key errors - https://phabricator.wikimedia.org/T105879#1904634 (10jcrespo) We need to reimage db1022, something is very wrong with it. [19:06:14] !log setting db1030 as the new master of db2028 [19:06:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:12:31] !log restarting varnish-frontend on cp3042 [19:12:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:14:01] oh [19:14:06] all taken care of now, I suppose? [19:14:10] <3 jynus thanks [19:14:21] yes [19:14:22] two different outages [19:14:24] both taken care of, yes [19:14:51] mine wasn't a service outage [19:15:12] 6operations, 10Traffic: Varnish leaks memory - https://phabricator.wikimedia.org/T122455#1904639 (10faidon) The memory usage appears to be growing only for the frontend instance, as far as I can see. I wonder if this is coming from one of the frontend-specific vmods (tbf?) [19:24:15] RECOVERY - MariaDB Slave SQL: s6 on db1022 is OK: OK slave_sql_state Slave_SQL_Running: Yes [19:25:01] ^that is false, replication there is not reliable [19:31:07] PROBLEM - puppet last run on mw2145 is CRITICAL: CRITICAL: Puppet has 1 failures [19:33:51] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] [19:34:29] text Esams HTTP 5xx reqs/min [19:35:22] seems only a spike [19:37:10] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] [19:45:10] RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [19:45:50] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [19:55:41] RECOVERY - puppet last run on mw2145 is OK: OK: Puppet is currently enabled, last run 40 seconds ago with 0 failures [20:32:13] (03PS13) 10Hashar: contint: Put mysql db on tmpfs for role::ci::slave::labs [puppet] - 10https://gerrit.wikimedia.org/r/204528 (https://phabricator.wikimedia.org/T96230) (owner: 10coren) [20:32:41] (03CR) 10Hashar: "rebased, cherry picked again on CI puppetmaster" [puppet] - 10https://gerrit.wikimedia.org/r/204528 (https://phabricator.wikimedia.org/T96230) (owner: 10coren) [20:33:55] (03CR) 10Hashar: "This was missing from CI puppetmaster, I restored it." [puppet] - 10https://gerrit.wikimedia.org/r/208024 (owner: 10Dduvall) [20:55:53] !log [20:56:11] !sal [20:56:11] https://wikitech.wikimedia.org/wiki/Server_Admin_Log https://tools.wmflabs.org/sal/production See it and you will know all you need. [21:17:30] 6operations, 10Continuous-Integration-Infrastructure: Webproxy on carbon unreachable from labs instances since Dec 24 roughly 1am - https://phabricator.wikimedia.org/T122461#1904975 (10hashar) 3NEW [21:29:16] 6operations, 10Continuous-Integration-Infrastructure: Webproxy on carbon unreachable from labs instances since Dec 24 roughly 1am - https://phabricator.wikimedia.org/T122461#1904987 (10faidon) 5Open>3Invalid a:3faidon See T122368. Why do you need to use the webproxy? Labs instances have Internet connect... [21:37:40] 6operations, 10Fundraising-Backlog, 10Traffic, 10Unplanned-Sprint-Work, 3Fundraising Sprint Zapp: Firefox SPDY-coalesces requests to geoiplookup over text-lb, causing GeoIP IPv6 failures - https://phabricator.wikimedia.org/T121922#1904994 (10faidon) [21:45:24] 6operations, 10Continuous-Integration-Infrastructure: Webproxy on carbon unreachable from labs instances since Dec 24 roughly 1am - https://phabricator.wikimedia.org/T122461#1904999 (10hashar) We have pointed the MediaWiki configuration on CI to a proxy because we had some hosts that had no direct access to i... [22:10:07] 6operations, 10Traffic: Varnish Assert error in VGZ_Ibuf() - https://phabricator.wikimedia.org/T122462#1905008 (10faidon) 3NEW [22:19:13] 7Puppet, 6operations, 10Continuous-Integration-Infrastructure, 6Labs: compiler02.puppet3-diffs.eqiad.wmflabs out of disk space - https://phabricator.wikimedia.org/T122346#1905033 (10hashar) ``` -h, --human-numeric-sort compare human readable numbers (e.g., 2K 1G) ``` @scfc `5/5`. I wi... [22:52:38] 6operations, 6Performance-Team: jobrunner memory leaks - https://phabricator.wikimedia.org/T122069#1905040 (10aaron) >>! In T122069#1904624, @faidon wrote: > I've restarted HHVM on jobrunners thrice now, to avoid further OOMs (one cut it real close too). I'd like to revert the two commits that we identified ab...