[00:21:39] <icinga-wm>	 PROBLEM - puppet last run on lvs1004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[00:23:35] <icinga-wm>	 PROBLEM - puppet last run on serpens is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[00:24:33] <grrrit-wm>	 (03PS1) 10Dereckson: Respawn ptwikimedia configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/314792 
[00:27:10] <grrrit-wm>	 (03PS2) 10Dereckson: Respawn ptwikimedia configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/314792 (https://phabricator.wikimedia.org/T126832) 
[00:34:46] <grrrit-wm>	 (03CR) 10Dereckson: [C: 031] Allow Commons 'crats to manage accountcreator group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/309912 (https://phabricator.wikimedia.org/T144689) (owner: 10Odder)
[00:36:46] <grrrit-wm>	 (03CR) 10Dereckson: "This change is ready for deployment." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/309912 (https://phabricator.wikimedia.org/T144689) (owner: 10Odder)
[00:48:08] <icinga-wm>	 RECOVERY - puppet last run on lvs1004 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures
[00:49:51] <icinga-wm>	 RECOVERY - puppet last run on serpens is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[00:50:54] <grrrit-wm>	 (03PS1) 10Dereckson: Raise abuse filter emergency threshold for es.wikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/314797 (https://phabricator.wikimedia.org/T145765) 
[01:33:08] <grrrit-wm>	 (03CR) 10Krinkle: [C: 031] contint: add phpdbg for code coverage [puppet] - 10https://gerrit.wikimedia.org/r/314563 (owner: 10Hashar)
[01:33:18] <grrrit-wm>	 (03CR) 10Krinkle: "What went wrong?" [puppet] - 10https://gerrit.wikimedia.org/r/314563 (owner: 10Hashar)
[01:36:19] <grrrit-wm>	 (03CR) 10Aaron Schulz: [C: 031] robots.php: Use WikiPage instead of Article class [mediawiki-config] - 10https://gerrit.wikimedia.org/r/314790 (owner: 10Krinkle)
[01:37:10] <p858snake|L2>	 ostriches: around?
[02:05:56] <icinga-wm>	 PROBLEM - puppet last run on mw1259 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[02:14:44] <icinga-wm>	 PROBLEM - puppet last run on restbase1015 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[02:22:46] <icinga-wm>	 PROBLEM - puppet last run on labvirt1009 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[02:24:55] <logmsgbot>	 !log mwdeploy@tin scap sync-l10n completed (1.28.0-wmf.21) (duration: 09m 09s)
[02:25:00] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[02:29:25] <icinga-wm>	 RECOVERY - puppet last run on mw1259 is OK: OK: Puppet is currently enabled, last run 46 seconds ago with 0 failures
[02:31:15] <logmsgbot>	 !log l10nupdate@tin ResourceLoader cache refresh completed at Sat Oct  8 02:31:15 UTC 2016 (duration 6m 20s)
[02:31:20] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[02:35:54] <icinga-wm>	 RECOVERY - puppet last run on restbase1015 is OK: OK: Puppet is currently enabled, last run 12 seconds ago with 0 failures
[02:46:38] <icinga-wm>	 RECOVERY - puppet last run on labvirt1009 is OK: OK: Puppet is currently enabled, last run 49 seconds ago with 0 failures
[06:58:18] <icinga-wm>	 PROBLEM - puppet last run on labcontrol1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[07:21:33] <icinga-wm>	 PROBLEM - puppet last run on mw1300 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[07:22:15] <icinga-wm>	 RECOVERY - puppet last run on labcontrol1002 is OK: OK: Puppet is currently enabled, last run 51 seconds ago with 0 failures
[07:44:25] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [50.0]
[07:45:34] <icinga-wm>	 RECOVERY - puppet last run on mw1300 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[08:06:02] <icinga-wm>	 RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 1.00% above the threshold [25.0]
[08:35:53] <icinga-wm>	 PROBLEM - Kafka MirrorMaker main-eqiad_to_analytics on kafka1012 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-eqiad_to_analytics/producer\.properties
[08:39:14] <icinga-wm>	 PROBLEM - Kafka MirrorMaker main-eqiad_to_analytics on kafka1013 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-eqiad_to_analytics/producer\.properties
[08:39:34] <icinga-wm>	 PROBLEM - jmxtrans on kafka1018 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[08:39:58] <icinga-wm>	 PROBLEM - salt-minion processes on kafka1018 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[08:40:23] <icinga-wm>	 PROBLEM - Kafka MirrorMaker main-eqiad_to_analytics on kafka1018 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[08:40:58] <icinga-wm>	 PROBLEM - dhclient process on kafka1018 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[08:41:52] <icinga-wm>	 PROBLEM - Kafka Broker Server on kafka1018 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[08:42:15] <icinga-wm>	 RECOVERY - jmxtrans on kafka1018 is OK: PROCS OK: 1 process with command name java, regex args -jar.+jmxtrans-all.jar
[08:42:36] <icinga-wm>	 RECOVERY - salt-minion processes on kafka1018 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion
[08:42:53] <icinga-wm>	 RECOVERY - Kafka MirrorMaker main-eqiad_to_analytics on kafka1018 is OK: PROCS OK: 1 process with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-eqiad_to_analytics/producer\.properties
[08:43:37] <icinga-wm>	 RECOVERY - dhclient process on kafka1018 is OK: PROCS OK: 0 processes with command name dhclient
[08:44:31] <volans>	 looks like a broken disk
[08:45:18] * apergos groggily reaches for phone to see which one was the page
[08:45:27] <volans>	 apergos: kafka1018
[08:45:47] <apergos>	 ah ha
[08:45:52] <volans>	  /dev/sdi
[08:46:35] <apergos>	 how did it recover?
[08:47:02] <volans>	 it didn't
[08:47:04] <volans>	 icinga still red
[08:47:12] <volans>	 missing process processes with command name 'java', args 'Kafka /etc/kafka/server.properties' 
[08:47:35] * volans looking at kafka on wikitech
[08:47:52] <apergos>	 oh, 1018
[08:47:54] <apergos>	 right
[08:50:03] <icinga-wm>	 RECOVERY - Kafka MirrorMaker main-eqiad_to_analytics on kafka1013 is OK: PROCS OK: 1 process with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-eqiad_to_analytics/producer\.properties
[08:52:09] <volans>	 looks like 1018 too is recovering
[08:52:54] <apergos>	 how is that possible with a broken disk?
[08:53:17] <volans>	 maybe it detects is broken and just use the others
[08:53:26] <volans>	 kafka uses JBOD
[08:54:14] <volans>	 reovering as in the process seems running now
[08:54:18] <apergos>	 huh
[08:54:43] <volans>	 mmmh not anymore... maybe is trying to start and failing
[08:54:45] <volans>	 more likely
[08:54:53] <icinga-wm>	 PROBLEM - Kafka Broker Under Replicated Partitions on kafka1020 is CRITICAL: CRITICAL: 72.41% of data above the critical threshold [10.0]
[08:55:05] <apergos>	 ugh
[08:55:26] <apergos>	 oh. no, that might be an appropriate warning
[08:55:40] <apergos>	 hm
[08:57:26] <icinga-wm>	 PROBLEM - Kafka Broker Under Replicated Partitions on kafka1022 is CRITICAL: CRITICAL: 78.57% of data above the critical threshold [10.0]
[08:57:51] <volans>	 I'm looking at https://wikitech.wikimedia.org/wiki/Analytics/Cluster/Kafka/Administration 
[08:58:01] <volans>	 but commands fail due to required --zookeeper option
[08:58:04] <icinga-wm>	 PROBLEM - Kafka Broker Under Replicated Partitions on kafka1014 is CRITICAL: CRITICAL: 79.31% of data above the critical threshold [10.0]
[08:58:04] <apergos>	 heh me too
[08:58:05] <volans>	 do you know the urls
[08:58:06] <volans>	 ?
[08:58:19] <apergos>	 but I did not try running any commands (not even on the box yet)
[08:59:53] <icinga-wm>	 RECOVERY - Kafka MirrorMaker main-eqiad_to_analytics on kafka1012 is OK: PROCS OK: 1 process with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-eqiad_to_analytics/producer\.properties
[08:59:56] <apergos>	 no I sure don't
[09:00:54] <volans>	 ah ok it's just it doesn't work there
[09:00:58] <volans>	 working on other machines
[09:03:16] <volans>	 elukey: around?
[09:03:18] <apergos>	 you could maybe edit out the disk from log.dirs in /etc/kafka/server.properties
[09:03:25] <apergos>	 with puppet disabled for now on the host
[09:03:29] <apergos>	 it might come back that way
[09:03:35] <icinga-wm>	 PROBLEM - Kafka Broker Under Replicated Partitions on kafka1012 is CRITICAL: CRITICAL: 75.86% of data above the critical threshold [10.0]
[09:03:36] <icinga-wm>	 PROBLEM - Kafka Broker Under Replicated Partitions on kafka1013 is CRITICAL: CRITICAL: 75.86% of data above the critical threshold [10.0]
[09:03:59] <volans>	 they are replicated so I guess the cluster is replicating stuff, I'd like to check the status first
[09:04:07] <elukey>	 checking 
[09:04:21] <elukey>	 just got the page
[09:04:29] <apergos>	 heh
[09:04:38] <volans>	 elukey: broken disk on kafka1018
[09:04:40] <volans>	 dev/sdi
[09:04:54] <volans>	 at least looks like this to me
[09:05:14] <apergos>	 [11111407.165305] EXT4-fs warning (device sdi1): htree_dirblock_to_tree:959: inode #2: lblock 0: comm ls: error -5 reading directory block
[09:05:14] <volans>	 I was trying to check the status of the cluster and reading the docs ;)
[09:05:26] <apergos>	 looks pretty broken to me
[09:05:29] <elukey>	 we can disable puppet and leave it down
[09:05:38] <volans>	 syslog is full of errors on that disk
[09:05:45] <apergos>	 so I was thinking we could disable puppet, and remove the disk from log.dirs
[09:05:48] <apergos>	 and restart 
[09:05:56] <apergos>	 at least let it do its work on the rest
[09:05:57] <volans>	 but how to check "make sure that any topics for which the target broker is the leader also has In Sync Replicas"
[09:06:00] <volans>	 ?
[09:06:22] <volans>	 in the meanwhile I'll open the task for the broken disk
[09:07:01] <elukey>	 volans: kafka topics --describe
[09:07:23] <volans>	 elukey: yes but says that is missing the zookeeper param because I guess is broken there
[09:07:27] <elukey>	 apergos: we can just leave the broker down, it won't be a big issue for the cluster
[09:07:37] <apergos>	 well that's even easier I guess
[09:07:58] <elukey>	 volans: it works for me 
[09:07:59] <elukey>	 mmm
[09:08:15] <volans>	 now working
[09:08:18] <volans>	 before was not
[09:08:29] <apergos>	 worked for me as well
[09:08:34] <apergos>	 I am disabling puppet with a reason
[09:08:41] <volans>	 kafka-topics  --describe
[09:08:42] <volans>	 Missing required argument "[zookeeper]"
[09:08:45] <volans>	 this was before
[09:08:56] <apergos>	 you have a hypher
[09:09:09] <apergos>	 maybe without the hyphen it is different behavior?
[09:09:25] <apergos>	 maybe it picks up config settings from someplace
[09:09:38] <volans>	 that's the error message, my command didn't had an hyphen ;)
[09:09:43] <apergos>	 oh :-D
[09:09:49] <apergos>	 then I have no idea :-D
[09:10:19] <apergos>	 !log puppet disabled on kafka1018, leave broker down, bad disk /dev/sdi (see dmesg for sample errors)
[09:10:25] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[09:10:31] <elukey>	 apergos: I already did it :)
[09:10:33] <apergos>	 oh
[09:10:35] <apergos>	 hahaha
[09:10:39] <apergos>	 well it's done twice then
[09:10:44] <elukey>	 ahhahahah 
[09:10:53] <apergos>	 I hope you didn't log it too, oh well
[09:10:58] <elukey>	 nonono
[09:11:01] <apergos>	 :-D
[09:11:03] <elukey>	 I was about to do it
[09:11:05] <elukey>	 thanks :)
[09:11:07] <apergos>	 ah 
[09:11:24] <elukey>	 so kafka-topics invokes a different thing
[09:11:30] <elukey>	 that requires zookeeper
[09:11:43] <elukey>	 meanwhile kafka topic --describe automagically add it
[09:12:05] <volans>	 maybe I didn't explain myself
[09:12:06] <volans>	 # kafka topics --describe
[09:12:06] <volans>	 kafka-topics  --describe
[09:12:06] <volans>	 Missing required argument "[zookeeper]"
[09:12:09] <apergos>	 ah hah, I had done it before you, it shows your message in the disable :-P
[09:12:11] <volans>	 that is what I run
[09:12:16] <volans>	 and what it output... ;0
[09:12:19] <volans>	 ;)
[09:12:34] <apergos>	 uh
[09:12:35] <apergos>	 hm
[09:12:39] <apergos>	 let me see someting
[09:13:02] <apergos>	 yes!
[09:13:04] <apergos>	 su -
[09:13:06] <apergos>	 after you are root
[09:13:09] <apergos>	 then it works :-P
[09:13:18] <apergos>	 need the root environ
[09:13:30] <elukey>	 volans: did you read what I wrote? :D
[09:13:46] <apergos>	 oh
[09:13:47] <apergos>	 wait
[09:13:51] <apergos>	 trailing s?
[09:14:17] <apergos>	 no. not trailing s
[09:14:24] <apergos>	 it's literally the su - that makes it work
[09:14:29] <elukey>	 nono kafka topics vs kafka-topics
[09:15:23] <apergos>	 no
[09:15:24] <apergos>	 elukey
[09:15:28] <apergos>	 I type "kafka topics --describe"
[09:15:35] <apergos>	 if I have done sudo -s
[09:15:42] <apergos>	 and type that command I get the zookeeper fail
[09:15:51] <apergos>	 if I su - after and cpy-paste that same command
[09:15:52] <volans>	 elukey: yes but I type kafka topics
[09:15:53] <apergos>	 I get success
[09:15:56] <volans>	 *I typed
[09:16:30] <elukey>	 so sudo or su requested
[09:16:38] <elukey>	 you can run it from your username
[09:16:52] <volans>	 maybe I was root at that time
[09:17:10] <elukey>	 anyhow
[09:17:14] <volans>	 on a different topic, is normal that in JBOD mode megacli don't see any error on the disks?
[09:17:20] <apergos>	 all I can say is that after sudo -s there is that fail and it's the same exact command, so....
[09:17:27] <elukey>	 kafka1018 is not a partition leader anymore and this is the good thing
[09:17:33] <_joe_>	 jbod means disks are not managed
[09:17:37] <elukey>	 but it is listed as replica in some partitions
[09:17:40] <volans>	 yes I know
[09:17:52] <volans>	 but the I/O still pass through megaraid_sas
[09:18:00] <elukey>	 yep
[09:18:02] <_joe_>	 and that means megacli doesn't really get any info, I guess
[09:18:20] <volans>	 it even tried to reset it
[09:18:21] <volans>	 sd 0:0:8:0: [sdi] tag#6 megasas: RESET cmd=2a retries=1
[09:18:32] <_joe_>	 (sorry, I was around but saw you guys were managing it already )
[09:18:36] <apergos>	 I assume the resets failed
[09:18:45] <wikibugs>	 06Operations, 10ops-eqiad: Broken disk on kafka1018 - https://phabricator.wikimedia.org/T147707#2701301 (10Volans)
[09:19:01] <volans>	 for the logs, see  ^^^
[09:19:06] <volans>	 yep
[09:19:07] <volans>	 it failed
[09:19:18] <_joe_>	 shuldn't kafka detect a failed node and recover by itself?
[09:19:25] <_joe_>	 do I remember incorrectly?
[09:19:28] <apergos>	 the cluster should do that yes
[09:19:31] <elukey>	 _joe_ yes 
[09:19:53] <elukey>	 we have just disabled puppet and stopped the kafka broker daemon to be sure
[09:20:01] <volans>	 elukey: did you stop kafka? systemd will retry to start it?
[09:20:04] <volans>	 ok
[09:20:34] <elukey>	 the only thing that needs to be done is make sure that the replicas are ok
[09:21:27] <volans>	 1013, 1014, 1020, 1022 alarm is off for underreplicated
[09:21:42] <volans>	 any easy way to check they are replicating over other nodes?
[09:21:59] <volans>	 1012 too
[09:22:36] <apergos>	 do we need to do  kafka preferred-replica-election   or no?
[09:22:58] <elukey>	 apergos: already done it just in case but 18 was already removed
[09:23:03] <apergos>	 ah ok
[09:23:10] <elukey>	 so kafka topics --describe will show two things
[09:23:26] <elukey>	 1) Partition leaders 
[09:23:41] <elukey>	 2) Replicas and their In Sync Replicas (ISR)
[09:23:46] <apergos>	 ok, I am seeing that leader solumn
[09:24:00] <apergos>	 the isr column was a mystery to me (except I guessed it should always have three entries :-P)
[09:24:08] <elukey>	 :D
[09:24:22] <elukey>	 it basically tells you how many replicas are in sync with the leader
[09:24:26] <volans>	 so when they have only 2 is underreplicated?
[09:24:47] <volans>	 like Topic: webrequest_uploadPartition: 20Leader: 14Replicas: 14,13,18Isr: 13,14
[09:24:54] <volans>	 should be on 18 too but 18 is broken
[09:24:58] <volans>	 is that right?
[09:25:08] <elukey>	 yes since ReplicationFactor = 3
[09:26:09] <volans>	 but do we have any easy check that gives us a % or progress of the fact that the other nodes are re-replicating the underreplicated stuff?
[09:26:54] * volans can check the icinga alert, looks like it has one :)
[09:27:57] <_joe_>	 can I suggest we mask the systemd unit for the broker and let puppet run, if that's the only reason for disabling it?
[09:28:20] <elukey>	 better yes :)
[09:28:34] <apergos>	 great idea
[09:28:41] <volans>	 it's a graphite metric
[09:28:52] <volans>	 ${group_prefix}kafka.${graphite_broker_key}.kafka.server.ReplicaManager.UnderReplicatedPartitions.Value
[09:29:16] <apergos>	 so I count 209 entries without three replicas (using good old mawk to grep pipeline)
[09:29:50] <apergos>	 of course there would be an easier way, heh (graphite)
[09:30:52] <elukey>	 IIRC there is a special command to remove the broker from the cluster and force the replicas to re-balance
[09:31:13] <elukey>	 but I have never used it and I don't think there is a huge risk in here to test it 
[09:31:51] <apergos>	 right
[09:32:04] <elukey>	 volans: https://grafana.wikimedia.org/dashboard/db/kafka
[09:32:17] <volans>	 where? I've already opened that
[09:32:24] <volans>	 this works
[09:32:32] <volans>	  kafka.cluster.$cluster.kafka.$kafka_brokers.kafka.server.ReplicaManager.UnderReplicatedPartitions.Value
[09:32:47] <volans>	 editing one graph of them just to have the variable populated
[09:33:12] <elukey>	 there is a under replicated partitions graph
[09:33:37] <volans>	 oh found now :D
[09:33:41] <volans>	 same thing
[09:33:56] <volans>	 they look stable, not going down
[09:34:31] <apergos>	 yeah I see them staying quite fixed
[09:34:49] <apergos>	 I wonder how long it takes for even one to catch up
[09:35:04] <elukey>	 yeah I think that to shuffle things around I'd need to remove 1018 from the cluster manually
[09:35:13] <elukey>	 but even two ISR out of three is not bad
[09:35:33] <apergos>	 600gb to be copied around
[09:35:34] <volans>	 do you prefer to check on tuesday if there is a spare disk in the datacenter already?
[09:35:55] <volans>	 and leave it like this until the disk replacement?
[09:36:03] <apergos>	 but if one partition is one 200/th of that... 
[09:36:07] <elukey>	 yes I think it would be the best choice 
[09:36:53] * volans have no idea on how much the re-shuffle loads the cluster
[09:37:06] <volans>	 ok and of course it will need to shuffle back when 1018 will be back
[09:37:31] <apergos>	 well even if we lose another host over the weekend (hope not) we'd still be ok
[09:37:48] <apergos>	 so waiting and seeing is a good call
[09:38:03] <elukey>	 volans: yes 1018 will need a bit of time to catch up before beging a leader again
[09:38:17] <volans>	 yeah
[09:38:30] <elukey>	 but without running kafka-preferred-replica-election it won't be put into the leader pool
[09:38:43] <elukey>	 so we can put it back online when the disk will be swapped
[09:38:54] <elukey>	 wait for its replicas to catch up
[09:39:06] <elukey>	 and then rebalance the partition leaders
[09:39:15] <volans>	 ack
[09:39:23] <elukey>	 masking kafka and re-enabling puppet
[09:39:57] <volans>	 elukey: a couple of things on icinga
[09:40:19] <volans>	 1) the ack for 1018 is on the host only, the failed procs is still not acked
[09:40:36] <volans>	 2) you probably want to ack the under replicate alarm too on the others
[09:40:46] <volans>	 you can use T147707 for the message
[09:40:47] <stashbot>	 T147707: Broken disk on kafka1018 - https://phabricator.wikimedia.org/T147707
[09:40:54] <elukey>	 !log masked the kafka systemd unit on kafka1018 and re-enabling puppet
[09:40:59] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[09:41:15] <elukey>	 volans: yes I am going to do these things 
[09:41:26] <elukey>	 and also add a summary of things done in the task
[09:42:23] <volans>	 sure, or open a different one if you need a separate one
[09:44:33] <elukey>	 nono thanks a lot for creating one :)
[09:44:52] <volans>	 de nada ;)
[09:45:09] <wikibugs>	 06Operations, 10ops-eqiad, 06DC-Ops: Broken disk on kafka1018 - https://phabricator.wikimedia.org/T147707#2701344 (10elukey)
[09:47:08] <apergos>	 and in the meantime I foudn out the difference between sudo -s and being you or su -  
[09:47:16] <apergos>	 it's the missing ZOOKEEPER_URL=conf1001.eqiad.wmnet,conf1002.eqiad.wmnet,conf1003.eqiad.wmnet/kafka/eqiad in the environ :-D
[09:47:33] <wikibugs>	 06Operations, 10ops-eqiad, 06DC-Ops: Broken disk on kafka1018 - https://phabricator.wikimedia.org/T147707#2701345 (10elukey) The kafka1018's kafka systemd unit has been masked and the service is stopped. The cluster is fine at the moment but a lot of topic partitions are under-replicated since kafka1018 is d...
[09:48:05] <wikibugs>	 06Operations, 10ops-eqiad, 06DC-Ops: Broken disk on kafka1018 - https://phabricator.wikimedia.org/T147707#2701346 (10elukey) p:05Triage>03Normal
[09:48:30] <wikibugs>	 06Operations, 10ops-eqiad, 06DC-Ops: Broken disk on kafka1018 - https://phabricator.wikimedia.org/T147707#2701301 (10elukey) a:03Cmjohnson
[09:49:57] <apergos>	 thanks for bailing us out, elukey
[09:50:36] <elukey>	 apergos: thank to you and volans for the help! 
[09:51:56] <elukey>	 ah snap tons of cron-spam again
[09:51:57] <elukey>	 sigh
[09:52:14] <apergos>	 oh noes
[09:53:11] * volans bbiab
[09:53:48] <apergos>	 safe to wander off now I suppose
[09:53:56] <apergos>	 I'll peek in from time to time
[09:58:29] <elukey>	 o/
[09:58:30] <elukey>	 me too
[10:23:37] <icinga-wm>	 PROBLEM - puppet last run on lvs3004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[10:50:21] <icinga-wm>	 RECOVERY - puppet last run on lvs3004 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[12:11:38] <grrrit-wm>	 (03PS1) 10BBlack: Revert "upload storage: avoid cron restarts while rebooting" [puppet] - 10https://gerrit.wikimedia.org/r/314815 
[12:11:50] <grrrit-wm>	 (03CR) 10BBlack: [C: 032 V: 032] Revert "upload storage: avoid cron restarts while rebooting" [puppet] - 10https://gerrit.wikimedia.org/r/314815 (owner: 10BBlack)
[12:13:05] <wikibugs>	 06Operations, 07Puppet, 07RfC: RFC: New puppet code organization paradigm/coding standards - https://phabricator.wikimedia.org/T147718#2701526 (10Joe)
[12:13:59] <wikibugs>	 06Operations, 07Puppet, 07RfC: RFC: New puppet code organization paradigm/coding standards - https://phabricator.wikimedia.org/T147718#2701538 (10Joe)
[12:31:42] <icinga-wm>	 PROBLEM - puppet last run on cp3007 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[12:58:10] <icinga-wm>	 RECOVERY - puppet last run on cp3007 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures
[13:49:22] <elukey>	 (kafka metrics look stable)
[14:13:30] <icinga-wm>	 PROBLEM - puppet last run on elastic1019 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[14:39:53] <icinga-wm>	 RECOVERY - puppet last run on elastic1019 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[14:57:48] <apergos>	 oh good, I was just wandering by to have a look
[15:00:04] <apergos>	 hm yeah looks routine except for those partitions still missing full replication (which we expected)
[15:16:53] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [50.0]
[15:22:03] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [50.0]
[15:30:03] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [50.0]
[15:38:03] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [50.0]
[15:38:52] <grrrit-wm>	 (03CR) 10Andrew Bogott: Remove wikitech references from ldapconfig [puppet] - 10https://gerrit.wikimedia.org/r/309705 (owner: 10Alex Monk)
[15:45:53] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [50.0]
[16:04:54] <wikibugs>	 06Operations, 07Puppet, 07RfC: RFC: New puppet code organization paradigm/coding standards - https://phabricator.wikimedia.org/T147718#2701698 (10Joe)
[16:05:11] <grrrit-wm>	 (03Draft1) 10Giuseppe Lavagetto: swift: refactor to role/profile pattern, part 1 [puppet] - 10https://gerrit.wikimedia.org/r/314829 (https://phabricator.wikimedia.org/T147718) 
[16:07:03] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [50.0]
[16:23:37] <grrrit-wm>	 (03PS1) 10Giuseppe Lavagetto: Add common::swift private data to allow testing of 314829 [labs/private] - 10https://gerrit.wikimedia.org/r/314830 
[16:24:06] <grrrit-wm>	 (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] Add common::swift private data to allow testing of 314829 [labs/private] - 10https://gerrit.wikimedia.org/r/314830 (owner: 10Giuseppe Lavagetto)
[16:34:08] <Zppix>	 morning (my time) petan
[16:39:53] <icinga-wm>	 PROBLEM - puppet last run on ms-be1023 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[16:41:43] <Zppix>	 that sounds good...
[16:41:52] <Josve05a>	 I'm geting 503's on Phabricator when uploading images...
[16:41:58] <Josve05a>	 https://usercontent.irccloud-cdn.com/file/2dvjP549/IMG_4110.PNG
[16:42:07] <Zppix>	 Well we just got an error
[16:42:07] <Zppix>	 PROBLEM - puppet last run on ms-be1023 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[16:42:52] <Zppix>	 Josve05a have you been able to upload images via mobile before?
[16:43:32] <Reedy>	 I uploaded some last night fine
[16:43:42] <shutdown>	 Josve05a: Works for me 
[16:43:50] <Josve05a>	 Zppix: yes
[16:44:10] <Josve05a>	 I'm at WIkiCon on the library's wifi....
[16:44:18] <Zppix>	 Hmm, im not the best at phab... im assuming its probably your connection and/or your end only... 
[16:46:08] <Josve05a>	 I was pressing the upload button in the text field and then browsed the images, I didn't drag anything...
[16:46:23] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [50.0]
[16:46:38] <Zppix>	 ^^ lovely
[16:48:26] <Zppix>	 Josve05a, hmm i have no clue then, I'm not a wizard at phab atm.
[16:59:32] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [50.0]
[17:06:22] <icinga-wm>	 RECOVERY - puppet last run on ms-be1023 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[17:12:44] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [50.0]
[17:18:02] <icinga-wm>	 RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 1.00% above the threshold [25.0]
[17:33:55] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [50.0]
[17:47:05] <icinga-wm>	 RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 1.00% above the threshold [25.0]
[17:47:43] <greg-g>	 ^^ is mostly redis timeouts and related, completely filling up the fatallog
[17:47:49] <greg-g>	 such a pain in the ass
[17:48:40] <greg-g>	 this is interesting though: started at ~16:30 UTC today: https://logstash.wikimedia.org/goto/cda75730013d93f2babe93f1403b8c30
[17:48:55] <greg-g>	 Warning: API call had warnings trying to login: warnings={"login":{"*":"Fetching a token via action=login is deprecated. Use action=query\u0026meta=tokens\u0026type=login instead."}}, query={"action":"login","lgname":"Zerowiki@banners","lgpassword":"***"} [Called from JsonConfig\JCUtils::warn in /srv/mediawiki/php-1.28.0-wmf.21/extensions/JsonConfig/includes/JCUtils.php at line 52] in /srv/med
[17:49:01] <greg-g>	 iawiki/php-1.28.0-wmf.21/includes/debug/MWDebug.php on line 311
[17:49:02] <greg-g>	 and yuri just left
[17:49:31] <Dereckson>	 Hello.
[17:49:34] <Dereckson>	 greg-g: that's 3-4 weeks this error appears
[18:10:25] <icinga-wm>	 PROBLEM - puppet last run on xenon is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[18:34:06] <icinga-wm>	 RECOVERY - puppet last run on xenon is OK: OK: Puppet is currently enabled, last run 36 seconds ago with 0 failures
[18:35:37] <grrrit-wm>	 (03CR) 10Odder: "Unfortunaly due to travel and other engagements I will not be present during any of the SWAT windows over the coming week, so if anyone wi" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/309912 (https://phabricator.wikimedia.org/T144689) (owner: 10Odder)
[18:36:09] <grrrit-wm>	 (03CR) 10Odder: "Unfortunaly due to travel and other engagements I will not be present during any of the SWAT windows over the coming week, so if anyone wi" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/309911 (https://phabricator.wikimedia.org/T145010) (owner: 10Odder)
[18:53:06] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [50.0]
[18:55:45] <icinga-wm>	 RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 1.00% above the threshold [25.0]
[19:02:35] <grrrit-wm>	 (03PS1) 10Paladox: Revert "gerrit: workaround a CSS bug with Microsoft Edge" [puppet] - 10https://gerrit.wikimedia.org/r/314835 
[19:02:43] <grrrit-wm>	 (03PS2) 10Paladox: Revert "gerrit: workaround a CSS bug with Microsoft Edge" [puppet] - 10https://gerrit.wikimedia.org/r/314835 
[19:22:17] <icinga-wm>	 PROBLEM - Host cp2008 is DOWN: PING CRITICAL - Packet loss = 100%
[19:22:33] <Zppix>	 rip
[19:30:21] <icinga-wm>	 PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0]
[19:35:50] <icinga-wm>	 PROBLEM - IPsec on cp4014 is CRITICAL: Strongswan CRITICAL - ok: 52 connecting: cp2008_v4, cp2008_v6
[19:35:50] <icinga-wm>	 PROBLEM - IPsec on cp4013 is CRITICAL: Strongswan CRITICAL - ok: 52 connecting: cp2008_v4, cp2008_v6
[19:35:51] <icinga-wm>	 PROBLEM - IPsec on cp3038 is CRITICAL: Strongswan CRITICAL - ok: 52 connecting: cp2008_v4, cp2008_v6
[19:35:51] <icinga-wm>	 PROBLEM - IPsec on cp3036 is CRITICAL: Strongswan CRITICAL - ok: 52 connecting: cp2008_v4, cp2008_v6
[19:35:51] <icinga-wm>	 PROBLEM - IPsec on cp4007 is CRITICAL: Strongswan CRITICAL - ok: 52 connecting: cp2008_v4, cp2008_v6
[19:35:51] <icinga-wm>	 PROBLEM - IPsec on cp3039 is CRITICAL: Strongswan CRITICAL - ok: 52 connecting: cp2008_v4, cp2008_v6
[19:36:09] <icinga-wm>	 PROBLEM - IPsec on kafka1013 is CRITICAL: Strongswan CRITICAL - ok: 146 connecting: cp2008_v4, cp2008_v6
[19:36:09] <icinga-wm>	 PROBLEM - IPsec on cp1063 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: cp2008_v4, cp2008_v6
[19:36:09] <icinga-wm>	 PROBLEM - IPsec on cp1050 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: cp2008_v4, cp2008_v6
[19:36:20] <icinga-wm>	 PROBLEM - IPsec on cp4006 is CRITICAL: Strongswan CRITICAL - ok: 52 connecting: cp2008_v4, cp2008_v6
[19:36:20] <icinga-wm>	 PROBLEM - IPsec on cp3034 is CRITICAL: Strongswan CRITICAL - ok: 52 connecting: cp2008_v4, cp2008_v6
[19:36:21] <icinga-wm>	 PROBLEM - IPsec on cp4005 is CRITICAL: Strongswan CRITICAL - ok: 52 connecting: cp2008_v4, cp2008_v6
[19:36:21] <icinga-wm>	 PROBLEM - IPsec on cp4015 is CRITICAL: Strongswan CRITICAL - ok: 52 connecting: cp2008_v4, cp2008_v6
[19:36:31] <icinga-wm>	 PROBLEM - IPsec on cp3049 is CRITICAL: Strongswan CRITICAL - ok: 52 connecting: cp2008_v4, cp2008_v6
[19:36:42] <icinga-wm>	 PROBLEM - IPsec on cp1071 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: cp2008_v4, cp2008_v6
[19:36:44] <icinga-wm>	 PROBLEM - IPsec on cp3037 is CRITICAL: Strongswan CRITICAL - ok: 52 connecting: cp2008_v4, cp2008_v6
[19:36:44] <icinga-wm>	 PROBLEM - IPsec on kafka1020 is CRITICAL: Strongswan CRITICAL - ok: 146 connecting: cp2008_v4, cp2008_v6
[19:36:44] <icinga-wm>	 PROBLEM - IPsec on cp3047 is CRITICAL: Strongswan CRITICAL - ok: 52 connecting: cp2008_v4, cp2008_v6
[19:36:44] <icinga-wm>	 PROBLEM - IPsec on cp3044 is CRITICAL: Strongswan CRITICAL - ok: 52 connecting: cp2008_v4, cp2008_v6
[19:36:44] <icinga-wm>	 PROBLEM - IPsec on cp3035 is CRITICAL: Strongswan CRITICAL - ok: 52 connecting: cp2008_v4, cp2008_v6
[19:36:44] <icinga-wm>	 PROBLEM - IPsec on cp3048 is CRITICAL: Strongswan CRITICAL - ok: 52 connecting: cp2008_v4, cp2008_v6
[19:36:45] <icinga-wm>	 PROBLEM - IPsec on cp1062 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: cp2008_v4, cp2008_v6
[19:36:50] <Zppix>	 ....
[19:36:57] <Zppix>	 whelp i think the op team just died
[19:36:59] <icinga-wm>	 PROBLEM - IPsec on kafka1012 is CRITICAL: Strongswan CRITICAL - ok: 146 connecting: cp2008_v4, cp2008_v6
[19:37:00] <icinga-wm>	 PROBLEM - IPsec on cp1072 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: cp2008_v4, cp2008_v6
[19:37:00] <icinga-wm>	 PROBLEM - IPsec on kafka1022 is CRITICAL: Strongswan CRITICAL - ok: 146 connecting: cp2008_v4, cp2008_v6
[19:37:00] <icinga-wm>	 PROBLEM - IPsec on cp1049 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: cp2008_v4, cp2008_v6
[19:37:00] <icinga-wm>	 PROBLEM - IPsec on kafka1014 is CRITICAL: Strongswan CRITICAL - ok: 146 connecting: cp2008_v4, cp2008_v6
[19:37:00] <icinga-wm>	 PROBLEM - IPsec on cp1099 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: cp2008_v4, cp2008_v6
[19:37:10] <icinga-wm>	 PROBLEM - IPsec on cp3046 is CRITICAL: Strongswan CRITICAL - ok: 52 connecting: cp2008_v4, cp2008_v6
[19:37:10] <icinga-wm>	 PROBLEM - IPsec on cp3045 is CRITICAL: Strongswan CRITICAL - ok: 52 connecting: cp2008_v4, cp2008_v6
[19:37:10] <icinga-wm>	 PROBLEM - IPsec on cp1048 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: cp2008_v4, cp2008_v6
[19:37:10] <icinga-wm>	 PROBLEM - IPsec on cp1074 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: cp2008_v4, cp2008_v6
[19:37:42] <icinga-wm>	 PROBLEM - IPsec on cp1073 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: cp2008_v4, cp2008_v6
[19:37:46] <icinga-wm>	 PROBLEM - IPsec on cp1064 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: cp2008_v4, cp2008_v6
[19:47:11] <bd808>	 ah the awesomeness of the mesh encryption failures.
[19:47:37] <Zppix>	 bd808 sorry i couldnt resist the red buttons
[19:48:28] <bd808>	 !log cp2008 Strongswan failures for both ipv4 and ipv6 across a larg number (all?) hosts
[19:48:34] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[20:22:04] <icinga-wm>	 RECOVERY - Redis status tcp_6479 on rdb2006 is OK: OK: REDIS on 10.192.48.44:6479 has 1 databases (db0) with 3039560 keys - replication_delay is 0
[21:10:39] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [50.0]
[21:11:19] <icinga-wm>	 PROBLEM - puppet last run on elastic1022 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[21:12:28] <icinga-wm>	 PROBLEM - Redis status tcp_6479 on rdb2006 is CRITICAL: CRITICAL: replication_delay is 738 600 - REDIS on 10.192.48.44:6479 has 1 databases (db0) with 3042033 keys - replication_delay is 738
[21:15:59] <icinga-wm>	 RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 1.00% above the threshold [25.0]
[21:25:32] <icinga-wm>	 PROBLEM - Redis status tcp_6479 on rdb2006 is CRITICAL: CRITICAL: replication_delay is 714 600 - REDIS on 10.192.48.44:6479 has 1 databases (db0) with 3039644 keys - replication_delay is 714
[21:30:49] <icinga-wm>	 RECOVERY - Redis status tcp_6479 on rdb2006 is OK: OK: REDIS on 10.192.48.44:6479 has 1 databases (db0) with 3039956 keys - replication_delay is 0
[21:37:49] <icinga-wm>	 RECOVERY - puppet last run on elastic1022 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[21:41:05] <shutdown>	 hm, what's that? https://de.wikipedia.org/wiki/Spezial:Beitr%C3%A4ge/Schlapfm?uselang=en why is there a block shown?
[21:48:19] <Krenair>	 shutdown, https://phabricator.wikimedia.org/T147642
[21:48:37] <shutdown>	 yay, restricted
[21:50:52] <Zppix>	 ^ same
[22:04:32] <icinga-wm>	 PROBLEM - puppet last run on mw1259 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[22:30:46] <icinga-wm>	 RECOVERY - puppet last run on mw1259 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[22:35:22] <icinga-wm>	 PROBLEM - puppet last run on ms-be1005 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[22:51:48] <grrrit-wm>	 (03PS8) 10Paladox: Add support for searching gerrit using bug:T1 [puppet] - 10https://gerrit.wikimedia.org/r/308753 (https://phabricator.wikimedia.org/T85002) 
[22:53:37] <icinga-wm>	 RECOVERY - OCG health on ocg1001 is OK: OK: ocg_job_status 799317 msg: ocg_render_job_queue 0 msg
[22:54:08] <icinga-wm>	 RECOVERY - OCG health on ocg1002 is OK: OK: ocg_job_status 799416 msg: ocg_render_job_queue 0 msg
[22:55:17] <icinga-wm>	 RECOVERY - OCG health on ocg1003 is OK: OK: ocg_job_status 799606 msg: ocg_render_job_queue 0 msg
[22:58:59] <icinga-wm>	 RECOVERY - puppet last run on ms-be1005 is OK: OK: Puppet is currently enabled, last run 44 seconds ago with 0 failures
[23:26:45] <grrrit-wm>	 (03CR) 10Dereckson: "Sure, I'm scheduling it for Monday evening SWAT." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/309912 (https://phabricator.wikimedia.org/T144689) (owner: 10Odder)
[23:31:26] <icinga-wm>	 RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0]
[23:36:31] <grrrit-wm>	 (03CR) 10Alex Monk: [C: 031] Remove spurious transcoding-labs.org usage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/314753 (owner: 10Reedy)
[23:37:24] <grrrit-wm>	 (03CR) 10Alex Monk: [C: 031] Remove wikimania2013wiki specific translate config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/314742 (owner: 10Reedy)
[23:44:22] <grrrit-wm>	 (03CR) 10Alex Monk: "relevant commit?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/303383 (owner: 10Reedy)
[23:44:58] <grrrit-wm>	 (03CR) 10Reedy: "https://gerrit.wikimedia.org/r/#/c/303382/" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/303383 (owner: 10Reedy)