[00:59:48] <legoktm>	 !log legoktm@terbium:~$ mwscript extensions/WikimediaMaintenance/filebackend/setZoneAccess.php --wiki=hiwikiversity --backend=local-multiwrite # T174859
[01:00:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[01:00:06] <stashbot>	 T174859: Image File is not working on hi.wikiversity - https://phabricator.wikimedia.org/T174859
[01:04:31] <icinga-wm>	 PROBLEM - Check health of redis instance on 6479 on rdb2005 is CRITICAL: CRITICAL ERROR - Redis Library - can not ping 127.0.0.1 on port 6479
[01:05:31] <icinga-wm>	 RECOVERY - Check health of redis instance on 6479 on rdb2005 is OK: OK: REDIS 2.8.17 on 127.0.0.1:6479 has 1 databases (db0) with 4943133 keys, up 5 minutes 20 seconds - replication_delay is 0
[01:06:42] <icinga-wm>	 PROBLEM - Check systemd state on conf2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[01:07:11] <icinga-wm>	 PROBLEM - Etcd replication lag on conf2002 is CRITICAL: connect to address 10.192.32.141 and port 8000: Connection refused
[01:07:22] <icinga-wm>	 PROBLEM - etcdmirror-conftool-eqiad-wmnet service on conf2002 is CRITICAL: CRITICAL - Expecting active but unit etcdmirror-conftool-eqiad-wmnet is failed
[01:39:21] <icinga-wm>	 PROBLEM - Check whether ferm is active by checking the default input chain on kubernetes1002 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly
[01:41:22] <icinga-wm>	 RECOVERY - Check whether ferm is active by checking the default input chain on kubernetes1002 is OK: OK ferm input default policy is set
[02:59:01] <icinga-wm>	 PROBLEM - Check whether ferm is active by checking the default input chain on kubernetes2002 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly
[03:01:01] <icinga-wm>	 RECOVERY - Check whether ferm is active by checking the default input chain on kubernetes2002 is OK: OK ferm input default policy is set
[03:06:52] <icinga-wm>	 PROBLEM - Check whether ferm is active by checking the default input chain on kubernetes2001 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly
[03:07:52] <icinga-wm>	 RECOVERY - Check whether ferm is active by checking the default input chain on kubernetes2001 is OK: OK ferm input default policy is set
[03:13:31] <icinga-wm>	 PROBLEM - Check whether ferm is active by checking the default input chain on kubernetes2003 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly
[03:14:32] <icinga-wm>	 RECOVERY - Check whether ferm is active by checking the default input chain on kubernetes2003 is OK: OK ferm input default policy is set
[03:28:22] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 706.43 seconds
[03:32:52] <icinga-wm>	 PROBLEM - puppet last run on mw2240 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/share/GeoIP/GeoIPCity.dat.gz]
[03:33:02] <icinga-wm>	 PROBLEM - puppet last run on mw2164 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/share/GeoIP/GeoIP2-City.mmdb.gz]
[03:33:41] <icinga-wm>	 PROBLEM - puppet last run on cp3032 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/share/GeoIP/GeoIPCity.dat.gz]
[03:58:07] <wikibugs_>	 (03CR) 10Jayprakash12345: [C: 04-1] "In InitialiseSettings.php, on 2695 line. It will be wikivoyage instead of hiwkivoyage." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/371109 (https://phabricator.wikimedia.org/T173013) (owner: 10MarcoAurelio)
[04:00:42] <icinga-wm>	 RECOVERY - puppet last run on mw2164 is OK: OK: Puppet is currently enabled, last run 2 seconds ago with 0 failures
[04:01:12] <icinga-wm>	 RECOVERY - puppet last run on cp3032 is OK: OK: Puppet is currently enabled, last run 36 seconds ago with 0 failures
[04:01:41] <icinga-wm>	 RECOVERY - puppet last run on mw2240 is OK: OK: Puppet is currently enabled, last run 45 seconds ago with 0 failures
[04:22:45] <wikibugs_>	 (03CR) 10Jayprakash12345: [C: 031] Update logo for sr.wikibooks.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/375409 (https://phabricator.wikimedia.org/T172284) (owner: 10Urbanecm)
[04:23:46] <wikibugs_>	 (03PS13) 10MarcoAurelio: Initial configuration for hi.wikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/371109 (https://phabricator.wikimedia.org/T173013)
[04:26:49] <wikibugs_>	 (03CR) 10MarcoAurelio: [C: 04-1] "Some weird chars appeared after the last fix. Will fix later among other things." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/371109 (https://phabricator.wikimedia.org/T173013) (owner: 10MarcoAurelio)
[04:38:22] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 233.42 seconds
[05:07:14] <icinga-wm>	 PROBLEM - Etcd replication lag on conf2002 is CRITICAL: connect to address 10.192.32.141 and port 8000: Connection refused
[05:19:34] <icinga-wm>	 PROBLEM - Host text-lb.eqiad.wikimedia.org is DOWN: PING CRITICAL - Packet loss = 100%
[05:20:44] <icinga-wm>	 RECOVERY - Host text-lb.eqiad.wikimedia.org is UP: PING OK - Packet loss = 0%, RTA = 0.28 ms
[05:37:32] <icinga-wm>	 RECOVERY - Check systemd state on chlorine is OK: OK - running: The system is fully operational
[05:40:41] <icinga-wm>	 PROBLEM - Check systemd state on chlorine is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[09:07:26] <icinga-wm>	 PROBLEM - Etcd replication lag on conf2002 is CRITICAL: connect to address 10.192.32.141 and port 8000: Connection refused
[09:07:54] <_joe_>	 looking
[09:08:53] <godog>	 _joe_: I'm here too, LMK if I can help
[09:09:30] <_joe_>	 godog: yeah I think this is pretty serious
[09:09:46] <_joe_>	 let me test something, but we might need to reload the etcd cluster in codfw
[09:10:01] <godog>	 _joe_: ok!
[09:11:09] * volans|off around too
[09:13:12] <icinga-wm>	 RECOVERY - etcdmirror-conftool-eqiad-wmnet service on conf2002 is OK: OK - etcdmirror-conftool-eqiad-wmnet is active
[09:13:31] <icinga-wm>	 RECOVERY - Check systemd state on conf2002 is OK: OK - running: The system is fully operational
[09:13:36] <icinga-wm>	 RECOVERY - Etcd replication lag on conf2002 is OK: HTTP OK: HTTP/1.1 200 OK - 148 bytes in 0.073 second response time
[09:13:47] <_joe_>	 !log manually fixed value of /conftool/v1/pools/eqiad/api_appserver/apache2/mw1208.eqiad.wmnet in codfw to allow replication to restart
[09:14:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:14:01] <akosiaris>	 _joe_: I am around too
[09:14:03] <_joe_>	 so, what happened here:
[09:14:18] <_joe_>	 - the etcd cluster in codfw farted when writing a key
[09:14:37] <_joe_>	 - etcdmirror thus failed and was restarted a couple of times, failing to write again
[09:15:29] <_joe_>	 - the third time, it was able to write the value but not the new replication index reached
[09:16:06] <_joe_>	 - so when trying to restart replication, since etcdmirror plays it safe and performs a compareAndSwap, it found that the value on the codfw cluster didn't correspond to what it expected to be
[09:16:11] <_joe_>	 so it failed again
[09:16:49] <_joe_>	 I also had the possibility of just moving the etcd replication index to +1, but I preferred to let etcdmirror replay that transaction too
[09:17:22] <volans>	 _joe_: it seems to me that we could also add an additional check in the code to avoid the: unsupported operand type(s) for +: 'NoneType' and 'int' error I see in the logs
[09:17:30] <_joe_>	 yes
[09:17:48] <ema>	 oh I see that cp4024 died yesterday, cleaning up the icinga ipsec spam
[09:17:48] <volans>	 just to have a better error handling :)
[09:17:48] <_joe_>	 that's a logging error btw
[09:17:58] <_joe_>	 yup
[09:18:26] <volans>	 _joe_: and the fart is the: found conflict at index 37940750 [existing term: 17261, conflicting term: 17268]?
[09:18:33] <marostegui>	 I am here too if needed
[09:19:02] <_joe_>	 volans: let me look, I think that's already a consequence
[09:20:14] <volans>	 _joe_: yeah, that's too late, I'm going up in the logs
[09:20:22] <_joe_>	 ok, no I got it reversed
[09:20:51] <_joe_>	 it failed to write the key the first time
[09:21:18] <_joe_>	 but somehow it managed to write the index
[09:22:09] <_joe_>	 I'll need to check the code
[09:22:14] <_joe_>	 but we're ok now
[09:22:21] <_joe_>	 you can all go on with your lives :P
[09:22:44] <volans>	 yeah it timed out the first time, found it now
[09:23:22] <_joe_>	 volans: the problematic thing is that the last valid index is the one of the timeout
[09:23:38] <volans>	 also it would be nice to avoid the 5k log lines from etcd-mirror that can't stop the reactor :D
[09:24:06] <_joe_>	 yeah be my guest to fix the code if you have time, that's a consequence of using deferToThread
[09:24:12] <_joe_>	 it's not very easy to fix
[09:25:19] <volans>	 ok, and my last question is... why it happens only during the UTC night between Sat. and Sun.?
[09:25:28] <volans>	 last time too :D
[09:25:32] <_joe_>	 I have no idea
[09:25:38] <ema>	 !log cp4024 is down, powercycled
[09:25:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:26:19] <_joe_>	 there is something going on with the codfw etcd cluster, I have a ticket open but since it causes little to no practical issue, and I didn't find any easy outlier at the time, it fell back in my queue of things to do
[09:26:48] <_joe_>	 the ticket is https://phabricator.wikimedia.org/T162013
[09:27:11] <icinga-wm>	 RECOVERY - IPsec on cp1062 is OK: Strongswan OK - 68 ESP OK
[09:27:11] <icinga-wm>	 RECOVERY - Host cp4024 is UP: PING OK - Packet loss = 0%, RTA = 78.50 ms
[09:27:12] <icinga-wm>	 RECOVERY - IPsec on cp2022 is OK: Strongswan OK - 82 ESP OK
[09:27:17] <volans>	 _joe_: ack
[09:27:21] <icinga-wm>	 RECOVERY - IPsec on cp1073 is OK: Strongswan OK - 68 ESP OK
[09:27:21] <icinga-wm>	 RECOVERY - IPsec on cp1071 is OK: Strongswan OK - 68 ESP OK
[09:27:31] <icinga-wm>	 RECOVERY - IPsec on cp2020 is OK: Strongswan OK - 82 ESP OK
[09:27:32] <icinga-wm>	 RECOVERY - IPsec on cp1049 is OK: Strongswan OK - 68 ESP OK
[09:27:32] <icinga-wm>	 RECOVERY - IPsec on cp2005 is OK: Strongswan OK - 82 ESP OK
[09:27:32] <icinga-wm>	 RECOVERY - IPsec on cp1050 is OK: Strongswan OK - 68 ESP OK
[09:27:32] <icinga-wm>	 RECOVERY - IPsec on cp2024 is OK: Strongswan OK - 82 ESP OK
[09:27:32] <icinga-wm>	 RECOVERY - IPsec on cp2002 is OK: Strongswan OK - 82 ESP OK
[09:27:41] <icinga-wm>	 RECOVERY - IPsec on cp1099 is OK: Strongswan OK - 68 ESP OK
[09:27:42] <icinga-wm>	 RECOVERY - IPsec on cp1072 is OK: Strongswan OK - 68 ESP OK
[09:27:51] <icinga-wm>	 RECOVERY - IPsec on cp1074 is OK: Strongswan OK - 68 ESP OK
[09:27:51] <icinga-wm>	 RECOVERY - IPsec on cp2026 is OK: Strongswan OK - 82 ESP OK
[09:27:52] <icinga-wm>	 RECOVERY - IPsec on cp1048 is OK: Strongswan OK - 68 ESP OK
[09:28:01] <icinga-wm>	 RECOVERY - IPsec on cp1063 is OK: Strongswan OK - 68 ESP OK
[09:28:01] <icinga-wm>	 RECOVERY - IPsec on cp2011 is OK: Strongswan OK - 82 ESP OK
[09:28:01] <icinga-wm>	 RECOVERY - IPsec on cp2008 is OK: Strongswan OK - 82 ESP OK
[09:28:02] <icinga-wm>	 RECOVERY - IPsec on cp2014 is OK: Strongswan OK - 82 ESP OK
[09:28:02] <icinga-wm>	 RECOVERY - IPsec on cp2017 is OK: Strongswan OK - 82 ESP OK
[09:28:02] <icinga-wm>	 RECOVERY - IPsec on cp1064 is OK: Strongswan OK - 68 ESP OK
[09:28:42] <_joe_>	 regarding why it happens on sundays, usually during our nights, I have no idea 
[09:30:01] <icinga-wm>	 PROBLEM - Freshness of zerofetch successful run file on cp4024 is CRITICAL: CRITICAL: File /var/netmapper/.update-success is more than 86400 secs old!
[09:32:00] <volans>	 I was wondering if it might be related to any weekly/monthly operation
[09:32:18] <volans>	 _joe_: it seems to me that this and last time were at 1:07 UTC on the first sunday of the month
[09:32:26] <volans>	 the pages
[09:32:39] <volans>	 so the event around 3 minutes earlier
[09:32:59] <volans>	 but this applies only for the last 2 events
[09:33:40] <volans>	 sorry 3 times, july too
[09:38:11] <icinga-wm>	 RECOVERY - Freshness of zerofetch successful run file on cp4024 is OK: OK
[09:41:15] <logmsgbot>	 !log ema@neodymium conftool action : set/pooled=yes; selector: name=cp4024.ulsfo.wmnet,service=nginx
[09:41:15] <logmsgbot>	 !log ema@neodymium conftool action : set/pooled=yes; selector: name=cp4024.ulsfo.wmnet,service=varnish-fe
[09:41:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:41:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:44:57] <volans>	 _joe_: I have a good candidate for the root cause
[09:45:33] <volans>	 mdadm weekly crontab, monthly does additional stuff and from dmesg the times fit, let me update the task
[09:48:53] <volans>	 sorry I meant it actually runs only monthly
[09:50:28] <_joe_>	 yes, I was looking at monthly crons and indeed it can be related
[09:51:01] <volans>	 the strange thing is that md0 was checked at that time, md2 only later
[09:51:13] <volans>	 and conf2002-vg is build on top of md2
[09:51:17] <_joe_>	 they share the same physical disks though
[09:51:54] <_joe_>	 also
[09:52:07] <_joe_>	 it can be a problem with either of the three nodes I think
[09:52:36] <_joe_>	 well, not really
[09:53:09] <_joe_>	 I'd say we can try to force running what cron does during the workweek 
[09:53:18] <volans>	 and monitor
[09:53:19] <_joe_>	 and see if we can reproduce the issue
[09:53:19] <volans>	 yeah
[09:58:34] <wikibugs_>	 10Operations, 10User-Joe: etcd cluster in codfw has raft consensus issues - https://phabricator.wikimedia.org/T162013#3149754 (10Volans) I might have a good candidate for what is causing it: **mdadm checkarray**  The `mdadm` crontab is: ``` 57 0 * * 0 root if [ -x /usr/share/mdadm/checkarray ] && [ $(date +\%d...
[10:28:16] <wikibugs_>	 10Operations, 10Commons: Error: 500, Internal Server Error on many Commons categories. - https://phabricator.wikimedia.org/T174868#3575867 (10Multichill) @zhuyifei1999 you closed this on as a duplicate, but the other bug is scoped to non-English. You might want to update that one if the scope has gotten bigger.
[12:50:20] <wikibugs_>	 (03Draft2) 10Felipe L. Ewald: Removed dead FTP link from mirror list. [puppet] - 10https://gerrit.wikimedia.org/r/375614
[14:44:55] <wikibugs_>	 (03PS1) 10Fomafix: Rename language codes sr-ec and sr-el to sr-cyrl and sr-latn [mediawiki-config] - 10https://gerrit.wikimedia.org/r/375616 (https://phabricator.wikimedia.org/T117845)
[15:16:48] <wikibugs_>	 10Operations, 10Commons, 10MediaWiki-extensions-Scribunto, 10Patch-For-Review, 10Wikimedia-log-errors: Some Commons pages transcluding Template:Countries_of_Europe HTTP 500/503 due to OOM in Lua→PHP→Lua calls - https://phabricator.wikimedia.org/T171392#3575999 (10zhuyifei1999)
[15:35:21] <icinga-wm>	 PROBLEM - Check Varnish expiry mailbox lag on cp1073 is CRITICAL: CRITICAL: expiry mailbox lag is 2048811
[15:52:42] <icinga-wm>	 PROBLEM - MediaWiki memcached error rate on graphite1001 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [5000.0]
[15:54:51] <icinga-wm>	 RECOVERY - MediaWiki memcached error rate on graphite1001 is OK: OK: Less than 40.00% above the threshold [1000.0]
[15:57:18] <ema>	 !log restart varnish backend on cp1073 (mailbox lag)
[15:57:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:05:21] <icinga-wm>	 RECOVERY - Check Varnish expiry mailbox lag on cp1073 is OK: OK: expiry mailbox lag is 0
[16:31:17] <wikibugs_>	 (03PS3) 10Ema: varnish: introduce rate limiting for maps [puppet] - 10https://gerrit.wikimedia.org/r/375354 (https://phabricator.wikimedia.org/T169175)
[17:04:07] <wikibugs_>	 (03CR) 10Ottomata: [C: 031] "Can merge tuesday" [puppet] - 10https://gerrit.wikimedia.org/r/375106 (https://phabricator.wikimedia.org/T110903) (owner: 10Krinkle)
[17:08:02] <icinga-wm>	 RECOVERY - Check systemd state on chlorine is OK: OK - running: The system is fully operational
[17:11:11] <icinga-wm>	 PROBLEM - Check systemd state on chlorine is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[17:28:42] <icinga-wm>	 PROBLEM - Varnish HTTP upload-backend - port 3128 on cp4024 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[17:52:49] <elukey>	 !log depooled cp4024 (ulsfo upload) due to kernel errors in dmesg
[17:53:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:55:43] <elukey>	 a bit weird, never seen this error before
[18:00:39] <wikibugs_>	 10Operations, 10Traffic: cp4024 kernel errors - https://phabricator.wikimedia.org/T174891#3576064 (10elukey)
[18:01:58] <elukey>	 seems some sort of lockup, load spiked and disk io went down to zero https://grafana.wikimedia.org/dashboard/file/server-board.json?var-server=cp4024&refresh=1m&orgId=1
[18:02:52] <icinga-wm>	 PROBLEM - Check systemd state on cp4024 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[18:03:05] <elukey>	 yeah this one is me
[18:06:30] <elukey>	 ok so the host is depooled, acked the alarms and opened the task above
[18:07:12] <elukey>	 I am inclined not to try anything like reboots etc... until ema/bblack are online tomorrow so they'll be able to better triage the problem 
[18:08:29] <wikibugs_>	 10Operations, 10Traffic: cp4024 kernel errors - https://phabricator.wikimedia.org/T174891#3576091 (10elukey)
[18:10:36] <elukey>	 cache::upload ulsfo seems good atm
[18:11:03] <icinga-wm>	 RECOVERY - Check systemd state on cp4024 is OK: OK - running: The system is fully operational
[18:15:41] <wikibugs_>	 10Operations, 10Traffic: cp4024 kernel errors - https://phabricator.wikimedia.org/T174891#3576097 (10elukey) {F9308416}  ``` elukey@cp4024:~$ top top - 18:15:02 up  8:48,  2 users,  load average: 771.32, 750.69, 538.32 Tasks: 542 total,   4 running, 538 sleeping,   0 stopped,   0 zombie %Cpu(s):  1.8 us,  1.2...
[18:16:31] * elukey afk
[18:17:41] <icinga-wm>	 PROBLEM - puppet last run on analytics1066 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[18:42:12] <wikibugs_>	 (03CR) 10Reedy: [C: 031] "Or, more specifically, T174782 for dropping those empty tables :)" [puppet] - 10https://gerrit.wikimedia.org/r/375347 (https://phabricator.wikimedia.org/T113842) (owner: 10Tpt)
[18:46:01] <icinga-wm>	 RECOVERY - puppet last run on analytics1066 is OK: OK: Puppet is currently enabled, last run 32 seconds ago with 0 failures
[19:15:42] <icinga-wm>	 PROBLEM - puppet last run on mw1211 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[19:21:10] <wikibugs_>	 (03PS2) 10Niharika29: Avoid pinging deployers unless there are patches to be deployed [wikimedia/bots/jouncebot] - 10https://gerrit.wikimedia.org/r/375046
[19:21:50] <wikibugs_>	 (03CR) 10jerkins-bot: [V: 04-1] Avoid pinging deployers unless there are patches to be deployed [wikimedia/bots/jouncebot] - 10https://gerrit.wikimedia.org/r/375046 (owner: 10Niharika29)
[19:28:33] <wikibugs_>	 (03PS3) 10Niharika29: Avoid pinging deployers unless there are patches to be deployed [wikimedia/bots/jouncebot] - 10https://gerrit.wikimedia.org/r/375046
[19:44:02] <icinga-wm>	 RECOVERY - puppet last run on mw1211 is OK: OK: Puppet is currently enabled, last run 15 seconds ago with 0 failures
[21:12:21] <icinga-wm>	 PROBLEM - HHVM rendering on mw1298 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[21:13:12] <icinga-wm>	 RECOVERY - HHVM rendering on mw1298 is OK: HTTP OK: HTTP/1.1 200 OK - 74670 bytes in 0.138 second response time
[21:40:21] <icinga-wm>	 PROBLEM - puppet last run on ms-be1028 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[22:09:41] <icinga-wm>	 RECOVERY - puppet last run on ms-be1028 is OK: OK: Puppet is currently enabled, last run 33 seconds ago with 0 failures