[00:21:39] PROBLEM - puppet last run on lvs1004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [00:23:35] PROBLEM - puppet last run on serpens is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [00:24:33] (03PS1) 10Dereckson: Respawn ptwikimedia configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/314792 [00:27:10] (03PS2) 10Dereckson: Respawn ptwikimedia configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/314792 (https://phabricator.wikimedia.org/T126832) [00:34:46] (03CR) 10Dereckson: [C: 031] Allow Commons 'crats to manage accountcreator group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/309912 (https://phabricator.wikimedia.org/T144689) (owner: 10Odder) [00:36:46] (03CR) 10Dereckson: "This change is ready for deployment." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/309912 (https://phabricator.wikimedia.org/T144689) (owner: 10Odder) [00:48:08] RECOVERY - puppet last run on lvs1004 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [00:49:51] RECOVERY - puppet last run on serpens is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [00:50:54] (03PS1) 10Dereckson: Raise abuse filter emergency threshold for es.wikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/314797 (https://phabricator.wikimedia.org/T145765) [01:33:08] (03CR) 10Krinkle: [C: 031] contint: add phpdbg for code coverage [puppet] - 10https://gerrit.wikimedia.org/r/314563 (owner: 10Hashar) [01:33:18] (03CR) 10Krinkle: "What went wrong?" [puppet] - 10https://gerrit.wikimedia.org/r/314563 (owner: 10Hashar) [01:36:19] (03CR) 10Aaron Schulz: [C: 031] robots.php: Use WikiPage instead of Article class [mediawiki-config] - 10https://gerrit.wikimedia.org/r/314790 (owner: 10Krinkle) [01:37:10] ostriches: around? [02:05:56] PROBLEM - puppet last run on mw1259 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [02:14:44] PROBLEM - puppet last run on restbase1015 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [02:22:46] PROBLEM - puppet last run on labvirt1009 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [02:24:55] !log mwdeploy@tin scap sync-l10n completed (1.28.0-wmf.21) (duration: 09m 09s) [02:25:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:29:25] RECOVERY - puppet last run on mw1259 is OK: OK: Puppet is currently enabled, last run 46 seconds ago with 0 failures [02:31:15] !log l10nupdate@tin ResourceLoader cache refresh completed at Sat Oct 8 02:31:15 UTC 2016 (duration 6m 20s) [02:31:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:35:54] RECOVERY - puppet last run on restbase1015 is OK: OK: Puppet is currently enabled, last run 12 seconds ago with 0 failures [02:46:38] RECOVERY - puppet last run on labvirt1009 is OK: OK: Puppet is currently enabled, last run 49 seconds ago with 0 failures [06:58:18] PROBLEM - puppet last run on labcontrol1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [07:21:33] PROBLEM - puppet last run on mw1300 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [07:22:15] RECOVERY - puppet last run on labcontrol1002 is OK: OK: Puppet is currently enabled, last run 51 seconds ago with 0 failures [07:44:25] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [50.0] [07:45:34] RECOVERY - puppet last run on mw1300 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [08:06:02] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 1.00% above the threshold [25.0] [08:35:53] PROBLEM - Kafka MirrorMaker main-eqiad_to_analytics on kafka1012 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-eqiad_to_analytics/producer\.properties [08:39:14] PROBLEM - Kafka MirrorMaker main-eqiad_to_analytics on kafka1013 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-eqiad_to_analytics/producer\.properties [08:39:34] PROBLEM - jmxtrans on kafka1018 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:39:58] PROBLEM - salt-minion processes on kafka1018 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:40:23] PROBLEM - Kafka MirrorMaker main-eqiad_to_analytics on kafka1018 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:40:58] PROBLEM - dhclient process on kafka1018 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:41:52] PROBLEM - Kafka Broker Server on kafka1018 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:42:15] RECOVERY - jmxtrans on kafka1018 is OK: PROCS OK: 1 process with command name java, regex args -jar.+jmxtrans-all.jar [08:42:36] RECOVERY - salt-minion processes on kafka1018 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [08:42:53] RECOVERY - Kafka MirrorMaker main-eqiad_to_analytics on kafka1018 is OK: PROCS OK: 1 process with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-eqiad_to_analytics/producer\.properties [08:43:37] RECOVERY - dhclient process on kafka1018 is OK: PROCS OK: 0 processes with command name dhclient [08:44:31] looks like a broken disk [08:45:18] * apergos groggily reaches for phone to see which one was the page [08:45:27] apergos: kafka1018 [08:45:47] ah ha [08:45:52] /dev/sdi [08:46:35] how did it recover? [08:47:02] it didn't [08:47:04] icinga still red [08:47:12] missing process processes with command name 'java', args 'Kafka /etc/kafka/server.properties' [08:47:35] * volans looking at kafka on wikitech [08:47:52] oh, 1018 [08:47:54] right [08:50:03] RECOVERY - Kafka MirrorMaker main-eqiad_to_analytics on kafka1013 is OK: PROCS OK: 1 process with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-eqiad_to_analytics/producer\.properties [08:52:09] looks like 1018 too is recovering [08:52:54] how is that possible with a broken disk? [08:53:17] maybe it detects is broken and just use the others [08:53:26] kafka uses JBOD [08:54:14] reovering as in the process seems running now [08:54:18] huh [08:54:43] mmmh not anymore... maybe is trying to start and failing [08:54:45] more likely [08:54:53] PROBLEM - Kafka Broker Under Replicated Partitions on kafka1020 is CRITICAL: CRITICAL: 72.41% of data above the critical threshold [10.0] [08:55:05] ugh [08:55:26] oh. no, that might be an appropriate warning [08:55:40] hm [08:57:26] PROBLEM - Kafka Broker Under Replicated Partitions on kafka1022 is CRITICAL: CRITICAL: 78.57% of data above the critical threshold [10.0] [08:57:51] I'm looking at https://wikitech.wikimedia.org/wiki/Analytics/Cluster/Kafka/Administration [08:58:01] but commands fail due to required --zookeeper option [08:58:04] PROBLEM - Kafka Broker Under Replicated Partitions on kafka1014 is CRITICAL: CRITICAL: 79.31% of data above the critical threshold [10.0] [08:58:04] heh me too [08:58:05] do you know the urls [08:58:06] ? [08:58:19] but I did not try running any commands (not even on the box yet) [08:59:53] RECOVERY - Kafka MirrorMaker main-eqiad_to_analytics on kafka1012 is OK: PROCS OK: 1 process with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-eqiad_to_analytics/producer\.properties [08:59:56] no I sure don't [09:00:54] ah ok it's just it doesn't work there [09:00:58] working on other machines [09:03:16] elukey: around? [09:03:18] you could maybe edit out the disk from log.dirs in /etc/kafka/server.properties [09:03:25] with puppet disabled for now on the host [09:03:29] it might come back that way [09:03:35] PROBLEM - Kafka Broker Under Replicated Partitions on kafka1012 is CRITICAL: CRITICAL: 75.86% of data above the critical threshold [10.0] [09:03:36] PROBLEM - Kafka Broker Under Replicated Partitions on kafka1013 is CRITICAL: CRITICAL: 75.86% of data above the critical threshold [10.0] [09:03:59] they are replicated so I guess the cluster is replicating stuff, I'd like to check the status first [09:04:07] checking [09:04:21] just got the page [09:04:29] heh [09:04:38] elukey: broken disk on kafka1018 [09:04:40] dev/sdi [09:04:54] at least looks like this to me [09:05:14] [11111407.165305] EXT4-fs warning (device sdi1): htree_dirblock_to_tree:959: inode #2: lblock 0: comm ls: error -5 reading directory block [09:05:14] I was trying to check the status of the cluster and reading the docs ;) [09:05:26] looks pretty broken to me [09:05:29] we can disable puppet and leave it down [09:05:38] syslog is full of errors on that disk [09:05:45] so I was thinking we could disable puppet, and remove the disk from log.dirs [09:05:48] and restart [09:05:56] at least let it do its work on the rest [09:05:57] but how to check "make sure that any topics for which the target broker is the leader also has In Sync Replicas" [09:06:00] ? [09:06:22] in the meanwhile I'll open the task for the broken disk [09:07:01] volans: kafka topics --describe [09:07:23] elukey: yes but says that is missing the zookeeper param because I guess is broken there [09:07:27] apergos: we can just leave the broker down, it won't be a big issue for the cluster [09:07:37] well that's even easier I guess [09:07:58] volans: it works for me [09:07:59] mmm [09:08:15] now working [09:08:18] before was not [09:08:29] worked for me as well [09:08:34] I am disabling puppet with a reason [09:08:41] kafka-topics --describe [09:08:42] Missing required argument "[zookeeper]" [09:08:45] this was before [09:08:56] you have a hypher [09:09:09] maybe without the hyphen it is different behavior? [09:09:25] maybe it picks up config settings from someplace [09:09:38] that's the error message, my command didn't had an hyphen ;) [09:09:43] oh :-D [09:09:49] then I have no idea :-D [09:10:19] !log puppet disabled on kafka1018, leave broker down, bad disk /dev/sdi (see dmesg for sample errors) [09:10:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:10:31] apergos: I already did it :) [09:10:33] oh [09:10:35] hahaha [09:10:39] well it's done twice then [09:10:44] ahhahahah [09:10:53] I hope you didn't log it too, oh well [09:10:58] nonono [09:11:01] :-D [09:11:03] I was about to do it [09:11:05] thanks :) [09:11:07] ah [09:11:24] so kafka-topics invokes a different thing [09:11:30] that requires zookeeper [09:11:43] meanwhile kafka topic --describe automagically add it [09:12:05] maybe I didn't explain myself [09:12:06] # kafka topics --describe [09:12:06] kafka-topics --describe [09:12:06] Missing required argument "[zookeeper]" [09:12:09] ah hah, I had done it before you, it shows your message in the disable :-P [09:12:11] that is what I run [09:12:16] and what it output... ;0 [09:12:19] ;) [09:12:34] uh [09:12:35] hm [09:12:39] let me see someting [09:13:02] yes! [09:13:04] su - [09:13:06] after you are root [09:13:09] then it works :-P [09:13:18] need the root environ [09:13:30] volans: did you read what I wrote? :D [09:13:46] oh [09:13:47] wait [09:13:51] trailing s? [09:14:17] no. not trailing s [09:14:24] it's literally the su - that makes it work [09:14:29] nono kafka topics vs kafka-topics [09:15:23] no [09:15:24] elukey [09:15:28] I type "kafka topics --describe" [09:15:35] if I have done sudo -s [09:15:42] and type that command I get the zookeeper fail [09:15:51] if I su - after and cpy-paste that same command [09:15:52] elukey: yes but I type kafka topics [09:15:53] I get success [09:15:56] *I typed [09:16:30] so sudo or su requested [09:16:38] you can run it from your username [09:16:52] maybe I was root at that time [09:17:10] anyhow [09:17:14] on a different topic, is normal that in JBOD mode megacli don't see any error on the disks? [09:17:20] all I can say is that after sudo -s there is that fail and it's the same exact command, so.... [09:17:27] kafka1018 is not a partition leader anymore and this is the good thing [09:17:33] <_joe_> jbod means disks are not managed [09:17:37] but it is listed as replica in some partitions [09:17:40] yes I know [09:17:52] but the I/O still pass through megaraid_sas [09:18:00] yep [09:18:02] <_joe_> and that means megacli doesn't really get any info, I guess [09:18:20] it even tried to reset it [09:18:21] sd 0:0:8:0: [sdi] tag#6 megasas: RESET cmd=2a retries=1 [09:18:32] <_joe_> (sorry, I was around but saw you guys were managing it already ) [09:18:36] I assume the resets failed [09:18:45] 06Operations, 10ops-eqiad: Broken disk on kafka1018 - https://phabricator.wikimedia.org/T147707#2701301 (10Volans) [09:19:01] for the logs, see ^^^ [09:19:06] yep [09:19:07] it failed [09:19:18] <_joe_> shuldn't kafka detect a failed node and recover by itself? [09:19:25] <_joe_> do I remember incorrectly? [09:19:28] the cluster should do that yes [09:19:31] _joe_ yes [09:19:53] we have just disabled puppet and stopped the kafka broker daemon to be sure [09:20:01] elukey: did you stop kafka? systemd will retry to start it? [09:20:04] ok [09:20:34] the only thing that needs to be done is make sure that the replicas are ok [09:21:27] 1013, 1014, 1020, 1022 alarm is off for underreplicated [09:21:42] any easy way to check they are replicating over other nodes? [09:21:59] 1012 too [09:22:36] do we need to do kafka preferred-replica-election or no? [09:22:58] apergos: already done it just in case but 18 was already removed [09:23:03] ah ok [09:23:10] so kafka topics --describe will show two things [09:23:26] 1) Partition leaders [09:23:41] 2) Replicas and their In Sync Replicas (ISR) [09:23:46] ok, I am seeing that leader solumn [09:24:00] the isr column was a mystery to me (except I guessed it should always have three entries :-P) [09:24:08] :D [09:24:22] it basically tells you how many replicas are in sync with the leader [09:24:26] so when they have only 2 is underreplicated? [09:24:47] like Topic: webrequest_uploadPartition: 20Leader: 14Replicas: 14,13,18Isr: 13,14 [09:24:54] should be on 18 too but 18 is broken [09:24:58] is that right? [09:25:08] yes since ReplicationFactor = 3 [09:26:09] but do we have any easy check that gives us a % or progress of the fact that the other nodes are re-replicating the underreplicated stuff? [09:26:54] * volans can check the icinga alert, looks like it has one :) [09:27:57] <_joe_> can I suggest we mask the systemd unit for the broker and let puppet run, if that's the only reason for disabling it? [09:28:20] better yes :) [09:28:34] great idea [09:28:41] it's a graphite metric [09:28:52] ${group_prefix}kafka.${graphite_broker_key}.kafka.server.ReplicaManager.UnderReplicatedPartitions.Value [09:29:16] so I count 209 entries without three replicas (using good old mawk to grep pipeline) [09:29:50] of course there would be an easier way, heh (graphite) [09:30:52] IIRC there is a special command to remove the broker from the cluster and force the replicas to re-balance [09:31:13] but I have never used it and I don't think there is a huge risk in here to test it [09:31:51] right [09:32:04] volans: https://grafana.wikimedia.org/dashboard/db/kafka [09:32:17] where? I've already opened that [09:32:24] this works [09:32:32] kafka.cluster.$cluster.kafka.$kafka_brokers.kafka.server.ReplicaManager.UnderReplicatedPartitions.Value [09:32:47] editing one graph of them just to have the variable populated [09:33:12] there is a under replicated partitions graph [09:33:37] oh found now :D [09:33:41] same thing [09:33:56] they look stable, not going down [09:34:31] yeah I see them staying quite fixed [09:34:49] I wonder how long it takes for even one to catch up [09:35:04] yeah I think that to shuffle things around I'd need to remove 1018 from the cluster manually [09:35:13] but even two ISR out of three is not bad [09:35:33] 600gb to be copied around [09:35:34] do you prefer to check on tuesday if there is a spare disk in the datacenter already? [09:35:55] and leave it like this until the disk replacement? [09:36:03] but if one partition is one 200/th of that... [09:36:07] yes I think it would be the best choice [09:36:53] * volans have no idea on how much the re-shuffle loads the cluster [09:37:06] ok and of course it will need to shuffle back when 1018 will be back [09:37:31] well even if we lose another host over the weekend (hope not) we'd still be ok [09:37:48] so waiting and seeing is a good call [09:38:03] volans: yes 1018 will need a bit of time to catch up before beging a leader again [09:38:17] yeah [09:38:30] but without running kafka-preferred-replica-election it won't be put into the leader pool [09:38:43] so we can put it back online when the disk will be swapped [09:38:54] wait for its replicas to catch up [09:39:06] and then rebalance the partition leaders [09:39:15] ack [09:39:23] masking kafka and re-enabling puppet [09:39:57] elukey: a couple of things on icinga [09:40:19] 1) the ack for 1018 is on the host only, the failed procs is still not acked [09:40:36] 2) you probably want to ack the under replicate alarm too on the others [09:40:46] you can use T147707 for the message [09:40:47] T147707: Broken disk on kafka1018 - https://phabricator.wikimedia.org/T147707 [09:40:54] !log masked the kafka systemd unit on kafka1018 and re-enabling puppet [09:40:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:41:15] volans: yes I am going to do these things [09:41:26] and also add a summary of things done in the task [09:42:23] sure, or open a different one if you need a separate one [09:44:33] nono thanks a lot for creating one :) [09:44:52] de nada ;) [09:45:09] 06Operations, 10ops-eqiad, 06DC-Ops: Broken disk on kafka1018 - https://phabricator.wikimedia.org/T147707#2701344 (10elukey) [09:47:08] and in the meantime I foudn out the difference between sudo -s and being you or su - [09:47:16] it's the missing ZOOKEEPER_URL=conf1001.eqiad.wmnet,conf1002.eqiad.wmnet,conf1003.eqiad.wmnet/kafka/eqiad in the environ :-D [09:47:33] 06Operations, 10ops-eqiad, 06DC-Ops: Broken disk on kafka1018 - https://phabricator.wikimedia.org/T147707#2701345 (10elukey) The kafka1018's kafka systemd unit has been masked and the service is stopped. The cluster is fine at the moment but a lot of topic partitions are under-replicated since kafka1018 is d... [09:48:05] 06Operations, 10ops-eqiad, 06DC-Ops: Broken disk on kafka1018 - https://phabricator.wikimedia.org/T147707#2701346 (10elukey) p:05Triage>03Normal [09:48:30] 06Operations, 10ops-eqiad, 06DC-Ops: Broken disk on kafka1018 - https://phabricator.wikimedia.org/T147707#2701301 (10elukey) a:03Cmjohnson [09:49:57] thanks for bailing us out, elukey [09:50:36] apergos: thank to you and volans for the help! [09:51:56] ah snap tons of cron-spam again [09:51:57] sigh [09:52:14] oh noes [09:53:11] * volans bbiab [09:53:48] safe to wander off now I suppose [09:53:56] I'll peek in from time to time [09:58:29] o/ [09:58:30] me too [10:23:37] PROBLEM - puppet last run on lvs3004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [10:50:21] RECOVERY - puppet last run on lvs3004 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [12:11:38] (03PS1) 10BBlack: Revert "upload storage: avoid cron restarts while rebooting" [puppet] - 10https://gerrit.wikimedia.org/r/314815 [12:11:50] (03CR) 10BBlack: [C: 032 V: 032] Revert "upload storage: avoid cron restarts while rebooting" [puppet] - 10https://gerrit.wikimedia.org/r/314815 (owner: 10BBlack) [12:13:05] 06Operations, 07Puppet, 07RfC: RFC: New puppet code organization paradigm/coding standards - https://phabricator.wikimedia.org/T147718#2701526 (10Joe) [12:13:59] 06Operations, 07Puppet, 07RfC: RFC: New puppet code organization paradigm/coding standards - https://phabricator.wikimedia.org/T147718#2701538 (10Joe) [12:31:42] PROBLEM - puppet last run on cp3007 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [12:58:10] RECOVERY - puppet last run on cp3007 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [13:49:22] (kafka metrics look stable) [14:13:30] PROBLEM - puppet last run on elastic1019 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [14:39:53] RECOVERY - puppet last run on elastic1019 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [14:57:48] oh good, I was just wandering by to have a look [15:00:04] hm yeah looks routine except for those partitions still missing full replication (which we expected) [15:16:53] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [50.0] [15:22:03] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [50.0] [15:30:03] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [50.0] [15:38:03] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [50.0] [15:38:52] (03CR) 10Andrew Bogott: Remove wikitech references from ldapconfig [puppet] - 10https://gerrit.wikimedia.org/r/309705 (owner: 10Alex Monk) [15:45:53] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [50.0] [16:04:54] 06Operations, 07Puppet, 07RfC: RFC: New puppet code organization paradigm/coding standards - https://phabricator.wikimedia.org/T147718#2701698 (10Joe) [16:05:11] (03Draft1) 10Giuseppe Lavagetto: swift: refactor to role/profile pattern, part 1 [puppet] - 10https://gerrit.wikimedia.org/r/314829 (https://phabricator.wikimedia.org/T147718) [16:07:03] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [50.0] [16:23:37] (03PS1) 10Giuseppe Lavagetto: Add common::swift private data to allow testing of 314829 [labs/private] - 10https://gerrit.wikimedia.org/r/314830 [16:24:06] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] Add common::swift private data to allow testing of 314829 [labs/private] - 10https://gerrit.wikimedia.org/r/314830 (owner: 10Giuseppe Lavagetto) [16:34:08] morning (my time) petan [16:39:53] PROBLEM - puppet last run on ms-be1023 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:41:43] that sounds good... [16:41:52] I'm geting 503's on Phabricator when uploading images... [16:41:58] https://usercontent.irccloud-cdn.com/file/2dvjP549/IMG_4110.PNG [16:42:07] Well we just got an error [16:42:07] PROBLEM - puppet last run on ms-be1023 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:42:52] Josve05a have you been able to upload images via mobile before? [16:43:32] I uploaded some last night fine [16:43:42] Josve05a: Works for me [16:43:50] Zppix: yes [16:44:10] I'm at WIkiCon on the library's wifi.... [16:44:18] Hmm, im not the best at phab... im assuming its probably your connection and/or your end only... [16:46:08] I was pressing the upload button in the text field and then browsed the images, I didn't drag anything... [16:46:23] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [50.0] [16:46:38] ^^ lovely [16:48:26] Josve05a, hmm i have no clue then, I'm not a wizard at phab atm. [16:59:32] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [50.0] [17:06:22] RECOVERY - puppet last run on ms-be1023 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [17:12:44] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [50.0] [17:18:02] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 1.00% above the threshold [25.0] [17:33:55] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [50.0] [17:47:05] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 1.00% above the threshold [25.0] [17:47:43] ^^ is mostly redis timeouts and related, completely filling up the fatallog [17:47:49] such a pain in the ass [17:48:40] this is interesting though: started at ~16:30 UTC today: https://logstash.wikimedia.org/goto/cda75730013d93f2babe93f1403b8c30 [17:48:55] Warning: API call had warnings trying to login: warnings={"login":{"*":"Fetching a token via action=login is deprecated. Use action=query\u0026meta=tokens\u0026type=login instead."}}, query={"action":"login","lgname":"Zerowiki@banners","lgpassword":"***"} [Called from JsonConfig\JCUtils::warn in /srv/mediawiki/php-1.28.0-wmf.21/extensions/JsonConfig/includes/JCUtils.php at line 52] in /srv/med [17:49:01] iawiki/php-1.28.0-wmf.21/includes/debug/MWDebug.php on line 311 [17:49:02] and yuri just left [17:49:31] Hello. [17:49:34] greg-g: that's 3-4 weeks this error appears [18:10:25] PROBLEM - puppet last run on xenon is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [18:34:06] RECOVERY - puppet last run on xenon is OK: OK: Puppet is currently enabled, last run 36 seconds ago with 0 failures [18:35:37] (03CR) 10Odder: "Unfortunaly due to travel and other engagements I will not be present during any of the SWAT windows over the coming week, so if anyone wi" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/309912 (https://phabricator.wikimedia.org/T144689) (owner: 10Odder) [18:36:09] (03CR) 10Odder: "Unfortunaly due to travel and other engagements I will not be present during any of the SWAT windows over the coming week, so if anyone wi" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/309911 (https://phabricator.wikimedia.org/T145010) (owner: 10Odder) [18:53:06] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [50.0] [18:55:45] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 1.00% above the threshold [25.0] [19:02:35] (03PS1) 10Paladox: Revert "gerrit: workaround a CSS bug with Microsoft Edge" [puppet] - 10https://gerrit.wikimedia.org/r/314835 [19:02:43] (03PS2) 10Paladox: Revert "gerrit: workaround a CSS bug with Microsoft Edge" [puppet] - 10https://gerrit.wikimedia.org/r/314835 [19:22:17] PROBLEM - Host cp2008 is DOWN: PING CRITICAL - Packet loss = 100% [19:22:33] rip [19:30:21] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] [19:35:50] PROBLEM - IPsec on cp4014 is CRITICAL: Strongswan CRITICAL - ok: 52 connecting: cp2008_v4, cp2008_v6 [19:35:50] PROBLEM - IPsec on cp4013 is CRITICAL: Strongswan CRITICAL - ok: 52 connecting: cp2008_v4, cp2008_v6 [19:35:51] PROBLEM - IPsec on cp3038 is CRITICAL: Strongswan CRITICAL - ok: 52 connecting: cp2008_v4, cp2008_v6 [19:35:51] PROBLEM - IPsec on cp3036 is CRITICAL: Strongswan CRITICAL - ok: 52 connecting: cp2008_v4, cp2008_v6 [19:35:51] PROBLEM - IPsec on cp4007 is CRITICAL: Strongswan CRITICAL - ok: 52 connecting: cp2008_v4, cp2008_v6 [19:35:51] PROBLEM - IPsec on cp3039 is CRITICAL: Strongswan CRITICAL - ok: 52 connecting: cp2008_v4, cp2008_v6 [19:36:09] PROBLEM - IPsec on kafka1013 is CRITICAL: Strongswan CRITICAL - ok: 146 connecting: cp2008_v4, cp2008_v6 [19:36:09] PROBLEM - IPsec on cp1063 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: cp2008_v4, cp2008_v6 [19:36:09] PROBLEM - IPsec on cp1050 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: cp2008_v4, cp2008_v6 [19:36:20] PROBLEM - IPsec on cp4006 is CRITICAL: Strongswan CRITICAL - ok: 52 connecting: cp2008_v4, cp2008_v6 [19:36:20] PROBLEM - IPsec on cp3034 is CRITICAL: Strongswan CRITICAL - ok: 52 connecting: cp2008_v4, cp2008_v6 [19:36:21] PROBLEM - IPsec on cp4005 is CRITICAL: Strongswan CRITICAL - ok: 52 connecting: cp2008_v4, cp2008_v6 [19:36:21] PROBLEM - IPsec on cp4015 is CRITICAL: Strongswan CRITICAL - ok: 52 connecting: cp2008_v4, cp2008_v6 [19:36:31] PROBLEM - IPsec on cp3049 is CRITICAL: Strongswan CRITICAL - ok: 52 connecting: cp2008_v4, cp2008_v6 [19:36:42] PROBLEM - IPsec on cp1071 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: cp2008_v4, cp2008_v6 [19:36:44] PROBLEM - IPsec on cp3037 is CRITICAL: Strongswan CRITICAL - ok: 52 connecting: cp2008_v4, cp2008_v6 [19:36:44] PROBLEM - IPsec on kafka1020 is CRITICAL: Strongswan CRITICAL - ok: 146 connecting: cp2008_v4, cp2008_v6 [19:36:44] PROBLEM - IPsec on cp3047 is CRITICAL: Strongswan CRITICAL - ok: 52 connecting: cp2008_v4, cp2008_v6 [19:36:44] PROBLEM - IPsec on cp3044 is CRITICAL: Strongswan CRITICAL - ok: 52 connecting: cp2008_v4, cp2008_v6 [19:36:44] PROBLEM - IPsec on cp3035 is CRITICAL: Strongswan CRITICAL - ok: 52 connecting: cp2008_v4, cp2008_v6 [19:36:44] PROBLEM - IPsec on cp3048 is CRITICAL: Strongswan CRITICAL - ok: 52 connecting: cp2008_v4, cp2008_v6 [19:36:45] PROBLEM - IPsec on cp1062 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: cp2008_v4, cp2008_v6 [19:36:50] .... [19:36:57] whelp i think the op team just died [19:36:59] PROBLEM - IPsec on kafka1012 is CRITICAL: Strongswan CRITICAL - ok: 146 connecting: cp2008_v4, cp2008_v6 [19:37:00] PROBLEM - IPsec on cp1072 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: cp2008_v4, cp2008_v6 [19:37:00] PROBLEM - IPsec on kafka1022 is CRITICAL: Strongswan CRITICAL - ok: 146 connecting: cp2008_v4, cp2008_v6 [19:37:00] PROBLEM - IPsec on cp1049 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: cp2008_v4, cp2008_v6 [19:37:00] PROBLEM - IPsec on kafka1014 is CRITICAL: Strongswan CRITICAL - ok: 146 connecting: cp2008_v4, cp2008_v6 [19:37:00] PROBLEM - IPsec on cp1099 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: cp2008_v4, cp2008_v6 [19:37:10] PROBLEM - IPsec on cp3046 is CRITICAL: Strongswan CRITICAL - ok: 52 connecting: cp2008_v4, cp2008_v6 [19:37:10] PROBLEM - IPsec on cp3045 is CRITICAL: Strongswan CRITICAL - ok: 52 connecting: cp2008_v4, cp2008_v6 [19:37:10] PROBLEM - IPsec on cp1048 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: cp2008_v4, cp2008_v6 [19:37:10] PROBLEM - IPsec on cp1074 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: cp2008_v4, cp2008_v6 [19:37:42] PROBLEM - IPsec on cp1073 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: cp2008_v4, cp2008_v6 [19:37:46] PROBLEM - IPsec on cp1064 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: cp2008_v4, cp2008_v6 [19:47:11] ah the awesomeness of the mesh encryption failures. [19:47:37] bd808 sorry i couldnt resist the red buttons [19:48:28] !log cp2008 Strongswan failures for both ipv4 and ipv6 across a larg number (all?) hosts [19:48:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:22:04] RECOVERY - Redis status tcp_6479 on rdb2006 is OK: OK: REDIS on 10.192.48.44:6479 has 1 databases (db0) with 3039560 keys - replication_delay is 0 [21:10:39] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [50.0] [21:11:19] PROBLEM - puppet last run on elastic1022 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:12:28] PROBLEM - Redis status tcp_6479 on rdb2006 is CRITICAL: CRITICAL: replication_delay is 738 600 - REDIS on 10.192.48.44:6479 has 1 databases (db0) with 3042033 keys - replication_delay is 738 [21:15:59] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 1.00% above the threshold [25.0] [21:25:32] PROBLEM - Redis status tcp_6479 on rdb2006 is CRITICAL: CRITICAL: replication_delay is 714 600 - REDIS on 10.192.48.44:6479 has 1 databases (db0) with 3039644 keys - replication_delay is 714 [21:30:49] RECOVERY - Redis status tcp_6479 on rdb2006 is OK: OK: REDIS on 10.192.48.44:6479 has 1 databases (db0) with 3039956 keys - replication_delay is 0 [21:37:49] RECOVERY - puppet last run on elastic1022 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [21:41:05] hm, what's that? https://de.wikipedia.org/wiki/Spezial:Beitr%C3%A4ge/Schlapfm?uselang=en why is there a block shown? [21:48:19] shutdown, https://phabricator.wikimedia.org/T147642 [21:48:37] yay, restricted [21:50:52] ^ same [22:04:32] PROBLEM - puppet last run on mw1259 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [22:30:46] RECOVERY - puppet last run on mw1259 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [22:35:22] PROBLEM - puppet last run on ms-be1005 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [22:51:48] (03PS8) 10Paladox: Add support for searching gerrit using bug:T1 [puppet] - 10https://gerrit.wikimedia.org/r/308753 (https://phabricator.wikimedia.org/T85002) [22:53:37] RECOVERY - OCG health on ocg1001 is OK: OK: ocg_job_status 799317 msg: ocg_render_job_queue 0 msg [22:54:08] RECOVERY - OCG health on ocg1002 is OK: OK: ocg_job_status 799416 msg: ocg_render_job_queue 0 msg [22:55:17] RECOVERY - OCG health on ocg1003 is OK: OK: ocg_job_status 799606 msg: ocg_render_job_queue 0 msg [22:58:59] RECOVERY - puppet last run on ms-be1005 is OK: OK: Puppet is currently enabled, last run 44 seconds ago with 0 failures [23:26:45] (03CR) 10Dereckson: "Sure, I'm scheduling it for Monday evening SWAT." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/309912 (https://phabricator.wikimedia.org/T144689) (owner: 10Odder) [23:31:26] RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [23:36:31] (03CR) 10Alex Monk: [C: 031] Remove spurious transcoding-labs.org usage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/314753 (owner: 10Reedy) [23:37:24] (03CR) 10Alex Monk: [C: 031] Remove wikimania2013wiki specific translate config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/314742 (owner: 10Reedy) [23:44:22] (03CR) 10Alex Monk: "relevant commit?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/303383 (owner: 10Reedy) [23:44:58] (03CR) 10Reedy: "https://gerrit.wikimedia.org/r/#/c/303382/" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/303383 (owner: 10Reedy)