[00:59:48] !log legoktm@terbium:~$ mwscript extensions/WikimediaMaintenance/filebackend/setZoneAccess.php --wiki=hiwikiversity --backend=local-multiwrite # T174859 [01:00:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:00:06] T174859: Image File is not working on hi.wikiversity - https://phabricator.wikimedia.org/T174859 [01:04:31] PROBLEM - Check health of redis instance on 6479 on rdb2005 is CRITICAL: CRITICAL ERROR - Redis Library - can not ping 127.0.0.1 on port 6479 [01:05:31] RECOVERY - Check health of redis instance on 6479 on rdb2005 is OK: OK: REDIS 2.8.17 on 127.0.0.1:6479 has 1 databases (db0) with 4943133 keys, up 5 minutes 20 seconds - replication_delay is 0 [01:06:42] PROBLEM - Check systemd state on conf2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [01:07:11] PROBLEM - Etcd replication lag on conf2002 is CRITICAL: connect to address 10.192.32.141 and port 8000: Connection refused [01:07:22] PROBLEM - etcdmirror-conftool-eqiad-wmnet service on conf2002 is CRITICAL: CRITICAL - Expecting active but unit etcdmirror-conftool-eqiad-wmnet is failed [01:39:21] PROBLEM - Check whether ferm is active by checking the default input chain on kubernetes1002 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly [01:41:22] RECOVERY - Check whether ferm is active by checking the default input chain on kubernetes1002 is OK: OK ferm input default policy is set [02:59:01] PROBLEM - Check whether ferm is active by checking the default input chain on kubernetes2002 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly [03:01:01] RECOVERY - Check whether ferm is active by checking the default input chain on kubernetes2002 is OK: OK ferm input default policy is set [03:06:52] PROBLEM - Check whether ferm is active by checking the default input chain on kubernetes2001 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly [03:07:52] RECOVERY - Check whether ferm is active by checking the default input chain on kubernetes2001 is OK: OK ferm input default policy is set [03:13:31] PROBLEM - Check whether ferm is active by checking the default input chain on kubernetes2003 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly [03:14:32] RECOVERY - Check whether ferm is active by checking the default input chain on kubernetes2003 is OK: OK ferm input default policy is set [03:28:22] PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 706.43 seconds [03:32:52] PROBLEM - puppet last run on mw2240 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/share/GeoIP/GeoIPCity.dat.gz] [03:33:02] PROBLEM - puppet last run on mw2164 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/share/GeoIP/GeoIP2-City.mmdb.gz] [03:33:41] PROBLEM - puppet last run on cp3032 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/share/GeoIP/GeoIPCity.dat.gz] [03:58:07] (03CR) 10Jayprakash12345: [C: 04-1] "In InitialiseSettings.php, on 2695 line. It will be wikivoyage instead of hiwkivoyage." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/371109 (https://phabricator.wikimedia.org/T173013) (owner: 10MarcoAurelio) [04:00:42] RECOVERY - puppet last run on mw2164 is OK: OK: Puppet is currently enabled, last run 2 seconds ago with 0 failures [04:01:12] RECOVERY - puppet last run on cp3032 is OK: OK: Puppet is currently enabled, last run 36 seconds ago with 0 failures [04:01:41] RECOVERY - puppet last run on mw2240 is OK: OK: Puppet is currently enabled, last run 45 seconds ago with 0 failures [04:22:45] (03CR) 10Jayprakash12345: [C: 031] Update logo for sr.wikibooks.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/375409 (https://phabricator.wikimedia.org/T172284) (owner: 10Urbanecm) [04:23:46] (03PS13) 10MarcoAurelio: Initial configuration for hi.wikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/371109 (https://phabricator.wikimedia.org/T173013) [04:26:49] (03CR) 10MarcoAurelio: [C: 04-1] "Some weird chars appeared after the last fix. Will fix later among other things." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/371109 (https://phabricator.wikimedia.org/T173013) (owner: 10MarcoAurelio) [04:38:22] RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 233.42 seconds [05:07:14] PROBLEM - Etcd replication lag on conf2002 is CRITICAL: connect to address 10.192.32.141 and port 8000: Connection refused [05:19:34] PROBLEM - Host text-lb.eqiad.wikimedia.org is DOWN: PING CRITICAL - Packet loss = 100% [05:20:44] RECOVERY - Host text-lb.eqiad.wikimedia.org is UP: PING OK - Packet loss = 0%, RTA = 0.28 ms [05:37:32] RECOVERY - Check systemd state on chlorine is OK: OK - running: The system is fully operational [05:40:41] PROBLEM - Check systemd state on chlorine is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [09:07:26] PROBLEM - Etcd replication lag on conf2002 is CRITICAL: connect to address 10.192.32.141 and port 8000: Connection refused [09:07:54] <_joe_> looking [09:08:53] _joe_: I'm here too, LMK if I can help [09:09:30] <_joe_> godog: yeah I think this is pretty serious [09:09:46] <_joe_> let me test something, but we might need to reload the etcd cluster in codfw [09:10:01] _joe_: ok! [09:11:09] * volans|off around too [09:13:12] RECOVERY - etcdmirror-conftool-eqiad-wmnet service on conf2002 is OK: OK - etcdmirror-conftool-eqiad-wmnet is active [09:13:31] RECOVERY - Check systemd state on conf2002 is OK: OK - running: The system is fully operational [09:13:36] RECOVERY - Etcd replication lag on conf2002 is OK: HTTP OK: HTTP/1.1 200 OK - 148 bytes in 0.073 second response time [09:13:47] <_joe_> !log manually fixed value of /conftool/v1/pools/eqiad/api_appserver/apache2/mw1208.eqiad.wmnet in codfw to allow replication to restart [09:14:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:14:01] _joe_: I am around too [09:14:03] <_joe_> so, what happened here: [09:14:18] <_joe_> - the etcd cluster in codfw farted when writing a key [09:14:37] <_joe_> - etcdmirror thus failed and was restarted a couple of times, failing to write again [09:15:29] <_joe_> - the third time, it was able to write the value but not the new replication index reached [09:16:06] <_joe_> - so when trying to restart replication, since etcdmirror plays it safe and performs a compareAndSwap, it found that the value on the codfw cluster didn't correspond to what it expected to be [09:16:11] <_joe_> so it failed again [09:16:49] <_joe_> I also had the possibility of just moving the etcd replication index to +1, but I preferred to let etcdmirror replay that transaction too [09:17:22] _joe_: it seems to me that we could also add an additional check in the code to avoid the: unsupported operand type(s) for +: 'NoneType' and 'int' error I see in the logs [09:17:30] <_joe_> yes [09:17:48] oh I see that cp4024 died yesterday, cleaning up the icinga ipsec spam [09:17:48] just to have a better error handling :) [09:17:48] <_joe_> that's a logging error btw [09:17:58] <_joe_> yup [09:18:26] _joe_: and the fart is the: found conflict at index 37940750 [existing term: 17261, conflicting term: 17268]? [09:18:33] I am here too if needed [09:19:02] <_joe_> volans: let me look, I think that's already a consequence [09:20:14] _joe_: yeah, that's too late, I'm going up in the logs [09:20:22] <_joe_> ok, no I got it reversed [09:20:51] <_joe_> it failed to write the key the first time [09:21:18] <_joe_> but somehow it managed to write the index [09:22:09] <_joe_> I'll need to check the code [09:22:14] <_joe_> but we're ok now [09:22:21] <_joe_> you can all go on with your lives :P [09:22:44] yeah it timed out the first time, found it now [09:23:22] <_joe_> volans: the problematic thing is that the last valid index is the one of the timeout [09:23:38] also it would be nice to avoid the 5k log lines from etcd-mirror that can't stop the reactor :D [09:24:06] <_joe_> yeah be my guest to fix the code if you have time, that's a consequence of using deferToThread [09:24:12] <_joe_> it's not very easy to fix [09:25:19] ok, and my last question is... why it happens only during the UTC night between Sat. and Sun.? [09:25:28] last time too :D [09:25:32] <_joe_> I have no idea [09:25:38] !log cp4024 is down, powercycled [09:25:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:26:19] <_joe_> there is something going on with the codfw etcd cluster, I have a ticket open but since it causes little to no practical issue, and I didn't find any easy outlier at the time, it fell back in my queue of things to do [09:26:48] <_joe_> the ticket is https://phabricator.wikimedia.org/T162013 [09:27:11] RECOVERY - IPsec on cp1062 is OK: Strongswan OK - 68 ESP OK [09:27:11] RECOVERY - Host cp4024 is UP: PING OK - Packet loss = 0%, RTA = 78.50 ms [09:27:12] RECOVERY - IPsec on cp2022 is OK: Strongswan OK - 82 ESP OK [09:27:17] _joe_: ack [09:27:21] RECOVERY - IPsec on cp1073 is OK: Strongswan OK - 68 ESP OK [09:27:21] RECOVERY - IPsec on cp1071 is OK: Strongswan OK - 68 ESP OK [09:27:31] RECOVERY - IPsec on cp2020 is OK: Strongswan OK - 82 ESP OK [09:27:32] RECOVERY - IPsec on cp1049 is OK: Strongswan OK - 68 ESP OK [09:27:32] RECOVERY - IPsec on cp2005 is OK: Strongswan OK - 82 ESP OK [09:27:32] RECOVERY - IPsec on cp1050 is OK: Strongswan OK - 68 ESP OK [09:27:32] RECOVERY - IPsec on cp2024 is OK: Strongswan OK - 82 ESP OK [09:27:32] RECOVERY - IPsec on cp2002 is OK: Strongswan OK - 82 ESP OK [09:27:41] RECOVERY - IPsec on cp1099 is OK: Strongswan OK - 68 ESP OK [09:27:42] RECOVERY - IPsec on cp1072 is OK: Strongswan OK - 68 ESP OK [09:27:51] RECOVERY - IPsec on cp1074 is OK: Strongswan OK - 68 ESP OK [09:27:51] RECOVERY - IPsec on cp2026 is OK: Strongswan OK - 82 ESP OK [09:27:52] RECOVERY - IPsec on cp1048 is OK: Strongswan OK - 68 ESP OK [09:28:01] RECOVERY - IPsec on cp1063 is OK: Strongswan OK - 68 ESP OK [09:28:01] RECOVERY - IPsec on cp2011 is OK: Strongswan OK - 82 ESP OK [09:28:01] RECOVERY - IPsec on cp2008 is OK: Strongswan OK - 82 ESP OK [09:28:02] RECOVERY - IPsec on cp2014 is OK: Strongswan OK - 82 ESP OK [09:28:02] RECOVERY - IPsec on cp2017 is OK: Strongswan OK - 82 ESP OK [09:28:02] RECOVERY - IPsec on cp1064 is OK: Strongswan OK - 68 ESP OK [09:28:42] <_joe_> regarding why it happens on sundays, usually during our nights, I have no idea [09:30:01] PROBLEM - Freshness of zerofetch successful run file on cp4024 is CRITICAL: CRITICAL: File /var/netmapper/.update-success is more than 86400 secs old! [09:32:00] I was wondering if it might be related to any weekly/monthly operation [09:32:18] _joe_: it seems to me that this and last time were at 1:07 UTC on the first sunday of the month [09:32:26] the pages [09:32:39] so the event around 3 minutes earlier [09:32:59] but this applies only for the last 2 events [09:33:40] sorry 3 times, july too [09:38:11] RECOVERY - Freshness of zerofetch successful run file on cp4024 is OK: OK [09:41:15] !log ema@neodymium conftool action : set/pooled=yes; selector: name=cp4024.ulsfo.wmnet,service=nginx [09:41:15] !log ema@neodymium conftool action : set/pooled=yes; selector: name=cp4024.ulsfo.wmnet,service=varnish-fe [09:41:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:41:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:44:57] _joe_: I have a good candidate for the root cause [09:45:33] mdadm weekly crontab, monthly does additional stuff and from dmesg the times fit, let me update the task [09:48:53] sorry I meant it actually runs only monthly [09:50:28] <_joe_> yes, I was looking at monthly crons and indeed it can be related [09:51:01] the strange thing is that md0 was checked at that time, md2 only later [09:51:13] and conf2002-vg is build on top of md2 [09:51:17] <_joe_> they share the same physical disks though [09:51:54] <_joe_> also [09:52:07] <_joe_> it can be a problem with either of the three nodes I think [09:52:36] <_joe_> well, not really [09:53:09] <_joe_> I'd say we can try to force running what cron does during the workweek [09:53:18] and monitor [09:53:19] <_joe_> and see if we can reproduce the issue [09:53:19] yeah [09:58:34] 10Operations, 10User-Joe: etcd cluster in codfw has raft consensus issues - https://phabricator.wikimedia.org/T162013#3149754 (10Volans) I might have a good candidate for what is causing it: **mdadm checkarray** The `mdadm` crontab is: ``` 57 0 * * 0 root if [ -x /usr/share/mdadm/checkarray ] && [ $(date +\%d... [10:28:16] 10Operations, 10Commons: Error: 500, Internal Server Error on many Commons categories. - https://phabricator.wikimedia.org/T174868#3575867 (10Multichill) @zhuyifei1999 you closed this on as a duplicate, but the other bug is scoped to non-English. You might want to update that one if the scope has gotten bigger. [12:50:20] (03Draft2) 10Felipe L. Ewald: Removed dead FTP link from mirror list. [puppet] - 10https://gerrit.wikimedia.org/r/375614 [14:44:55] (03PS1) 10Fomafix: Rename language codes sr-ec and sr-el to sr-cyrl and sr-latn [mediawiki-config] - 10https://gerrit.wikimedia.org/r/375616 (https://phabricator.wikimedia.org/T117845) [15:16:48] 10Operations, 10Commons, 10MediaWiki-extensions-Scribunto, 10Patch-For-Review, 10Wikimedia-log-errors: Some Commons pages transcluding Template:Countries_of_Europe HTTP 500/503 due to OOM in Lua→PHP→Lua calls - https://phabricator.wikimedia.org/T171392#3575999 (10zhuyifei1999) [15:35:21] PROBLEM - Check Varnish expiry mailbox lag on cp1073 is CRITICAL: CRITICAL: expiry mailbox lag is 2048811 [15:52:42] PROBLEM - MediaWiki memcached error rate on graphite1001 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [5000.0] [15:54:51] RECOVERY - MediaWiki memcached error rate on graphite1001 is OK: OK: Less than 40.00% above the threshold [1000.0] [15:57:18] !log restart varnish backend on cp1073 (mailbox lag) [15:57:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:05:21] RECOVERY - Check Varnish expiry mailbox lag on cp1073 is OK: OK: expiry mailbox lag is 0 [16:31:17] (03PS3) 10Ema: varnish: introduce rate limiting for maps [puppet] - 10https://gerrit.wikimedia.org/r/375354 (https://phabricator.wikimedia.org/T169175) [17:04:07] (03CR) 10Ottomata: [C: 031] "Can merge tuesday" [puppet] - 10https://gerrit.wikimedia.org/r/375106 (https://phabricator.wikimedia.org/T110903) (owner: 10Krinkle) [17:08:02] RECOVERY - Check systemd state on chlorine is OK: OK - running: The system is fully operational [17:11:11] PROBLEM - Check systemd state on chlorine is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [17:28:42] PROBLEM - Varnish HTTP upload-backend - port 3128 on cp4024 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:52:49] !log depooled cp4024 (ulsfo upload) due to kernel errors in dmesg [17:53:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:55:43] a bit weird, never seen this error before [18:00:39] 10Operations, 10Traffic: cp4024 kernel errors - https://phabricator.wikimedia.org/T174891#3576064 (10elukey) [18:01:58] seems some sort of lockup, load spiked and disk io went down to zero https://grafana.wikimedia.org/dashboard/file/server-board.json?var-server=cp4024&refresh=1m&orgId=1 [18:02:52] PROBLEM - Check systemd state on cp4024 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [18:03:05] yeah this one is me [18:06:30] ok so the host is depooled, acked the alarms and opened the task above [18:07:12] I am inclined not to try anything like reboots etc... until ema/bblack are online tomorrow so they'll be able to better triage the problem [18:08:29] 10Operations, 10Traffic: cp4024 kernel errors - https://phabricator.wikimedia.org/T174891#3576091 (10elukey) [18:10:36] cache::upload ulsfo seems good atm [18:11:03] RECOVERY - Check systemd state on cp4024 is OK: OK - running: The system is fully operational [18:15:41] 10Operations, 10Traffic: cp4024 kernel errors - https://phabricator.wikimedia.org/T174891#3576097 (10elukey) {F9308416} ``` elukey@cp4024:~$ top top - 18:15:02 up 8:48, 2 users, load average: 771.32, 750.69, 538.32 Tasks: 542 total, 4 running, 538 sleeping, 0 stopped, 0 zombie %Cpu(s): 1.8 us, 1.2... [18:16:31] * elukey afk [18:17:41] PROBLEM - puppet last run on analytics1066 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [18:42:12] (03CR) 10Reedy: [C: 031] "Or, more specifically, T174782 for dropping those empty tables :)" [puppet] - 10https://gerrit.wikimedia.org/r/375347 (https://phabricator.wikimedia.org/T113842) (owner: 10Tpt) [18:46:01] RECOVERY - puppet last run on analytics1066 is OK: OK: Puppet is currently enabled, last run 32 seconds ago with 0 failures [19:15:42] PROBLEM - puppet last run on mw1211 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [19:21:10] (03PS2) 10Niharika29: Avoid pinging deployers unless there are patches to be deployed [wikimedia/bots/jouncebot] - 10https://gerrit.wikimedia.org/r/375046 [19:21:50] (03CR) 10jerkins-bot: [V: 04-1] Avoid pinging deployers unless there are patches to be deployed [wikimedia/bots/jouncebot] - 10https://gerrit.wikimedia.org/r/375046 (owner: 10Niharika29) [19:28:33] (03PS3) 10Niharika29: Avoid pinging deployers unless there are patches to be deployed [wikimedia/bots/jouncebot] - 10https://gerrit.wikimedia.org/r/375046 [19:44:02] RECOVERY - puppet last run on mw1211 is OK: OK: Puppet is currently enabled, last run 15 seconds ago with 0 failures [21:12:21] PROBLEM - HHVM rendering on mw1298 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:13:12] RECOVERY - HHVM rendering on mw1298 is OK: HTTP OK: HTTP/1.1 200 OK - 74670 bytes in 0.138 second response time [21:40:21] PROBLEM - puppet last run on ms-be1028 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [22:09:41] RECOVERY - puppet last run on ms-be1028 is OK: OK: Puppet is currently enabled, last run 33 seconds ago with 0 failures