[00:00:00] aude: any idea why there are lots of SiteList memcached gets again? [00:01:47] (03CR) 10Nuria: "Dan: in order for the javascript caching strategy to work I will need to do some changes to the build of the dashboard. Will do those in a" [puppet/wikimetrics] - 10https://gerrit.wikimedia.org/r/158819 (owner: 10Nuria) [00:02:33] (03CR) 10Dzahn: [C: 04-1] "pretty sure you can't use check_interval in a command definition, instead it needs to go in a service or host definition. ("check_interva" [puppet] - 10https://gerrit.wikimedia.org/r/158773 (owner: 10MaxSem) [00:12:42] (03PS2) 10Dzahn: Check DSH groups once in 60 mins [puppet] - 10https://gerrit.wikimedia.org/r/158773 (owner: 10MaxSem) [00:16:34] (03CR) 10Dzahn: "PS2: should be normal_check_interval on monitor_service instead. see f.e. also on the service definition for "wikidata.org dispatch lag"" [puppet] - 10https://gerrit.wikimedia.org/r/158773 (owner: 10MaxSem) [00:16:55] it can take up to... uh, 30 minutes? an hour? [00:17:09] er, that was for kaldari, but he's not here anyways [00:17:14] (03CR) 10Dzahn: [C: 032] Check DSH groups once in 60 mins [puppet] - 10https://gerrit.wikimedia.org/r/158773 (owner: 10MaxSem) [00:22:10] (03CR) 10Dzahn: [C: 032] "this is a test that will be reverted" [puppet] - 10https://gerrit.wikimedia.org/r/158813 (owner: 10Dzahn) [00:23:24] (03PS1) 10Dzahn: Revert "remove fenari from mw-installation dsh group" [puppet] - 10https://gerrit.wikimedia.org/r/158835 [00:26:42] (03CR) 10Dzahn: "the files ircecho (icinga-wm) wants to read are currently /var/log/icinga/irc.log and /var/log/icinga/irc-wikidata.log (and icinga needs t" [puppet] - 10https://gerrit.wikimedia.org/r/158633 (owner: 10JanZerebecki) [00:27:57] PROBLEM - mediawiki-installation DSH group on fenari is CRITICAL: Host fenari is not in mediawiki-installation dsh group [00:29:03] (03CR) 10Dzahn: "test passed:" [puppet] - 10https://gerrit.wikimedia.org/r/158813 (owner: 10Dzahn) [00:29:30] (03CR) 10Dzahn: [C: 032] Revert "remove fenari from mw-installation dsh group" [puppet] - 10https://gerrit.wikimedia.org/r/158835 (owner: 10Dzahn) [00:33:53] RECOVERY - mediawiki-installation DSH group on fenari is OK: OK [00:38:22] PROBLEM - Puppet freshness on mw1053 is CRITICAL: Last successful Puppet run was Thu 04 Sep 2014 00:21:29 UTC [00:40:38] (03CR) 10Physikerwelt: "Yes. It does not trust cached content. That's what CSteipp recommended. But the load is balanced now Iae9693d1948bec6dd08473bce3cb704f2433" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/158559 (https://bugzilla.wikimedia.org/49169) (owner: 10Physikerwelt) [00:56:48] (03CR) 10Dzahn: Puppetize icinga log file permission fix. (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/158633 (owner: 10JanZerebecki) [00:57:42] (03CR) 10Dzahn: "17:35 <+icinga-wm> RECOVERY - mediawiki-installation DSH group on fenari is OK: OK" [puppet] - 10https://gerrit.wikimedia.org/r/158835 (owner: 10Dzahn) [01:03:14] (03CR) 10Dzahn: "the reason these contacts are private is actually just that they have phone numbers in some cases (not sure how much we'd care about publi" [puppet] - 10https://gerrit.wikimedia.org/r/158355 (owner: 10JanZerebecki) [01:08:13] (03PS1) 10Dzahn: add ca.wikimedia.org ServerAlias [apache-config] - 10https://gerrit.wikimedia.org/r/158843 [01:09:13] (03CR) 10Dzahn: "wait, i'm in operations/apache-config when i locally edited in ./puppet/modules/mediawiki/files/apache/config/ ?" [apache-config] - 10https://gerrit.wikimedia.org/r/158843 (owner: 10Dzahn) [01:10:12] (03CR) 10Dzahn: "i guess duplicate of Change-Id: I1da08288761a then? i thought it was in the wrong old repo, now i'm not sure" [apache-config] - 10https://gerrit.wikimedia.org/r/158843 (owner: 10Dzahn) [01:13:35] (03CR) 10Dzahn: "or also see https://gerrit.wikimedia.org/r/#/c/158843/ and it's the same thing because ?" [apache-config] - 10https://gerrit.wikimedia.org/r/158808 (owner: 10Reedy) [01:14:11] (03CR) 10Dzahn: [C: 031] "i claim Reedy's -1 is invalid :)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/158284 (owner: 10Dzahn) [01:20:12] (03CR) 10Dzahn: Add ca.wikimedia.org to wikimedia-chapter apache site [apache-config] - 10https://gerrit.wikimedia.org/r/158808 (owner: 10Reedy) [01:27:19] (03CR) 10John F. Lewis: [C: 031] "Looks good on a mailman perspective." [puppet] - 10https://gerrit.wikimedia.org/r/145500 (https://bugzilla.wikimedia.org/38516) (owner: 10Dzahn) [01:32:45] (03PS11) 10Dzahn: public_html directory service, see RT #6862 [puppet] - 10https://gerrit.wikimedia.org/r/149890 (owner: 10ArielGlenn) [01:34:08] (03CR) 10Dzahn: [C: 031] "PS11 - removes just the part that touches the existing config on noc. would like to be able to merge this new puppet code, then have peopl" [puppet] - 10https://gerrit.wikimedia.org/r/149890 (owner: 10ArielGlenn) [01:35:54] (03PS12) 10Dzahn: public_html directory service, see RT #6862 [puppet] - 10https://gerrit.wikimedia.org/r/149890 (owner: 10ArielGlenn) [01:36:24] (03CR) 10Dzahn: "PS12 - applies the new role on terbium, before nothing would have happened" [puppet] - 10https://gerrit.wikimedia.org/r/149890 (owner: 10ArielGlenn) [01:37:39] (03CR) 10Dzahn: [C: 031] public_html directory service, see RT #6862 [puppet] - 10https://gerrit.wikimedia.org/r/149890 (owner: 10ArielGlenn) [01:37:49] /away cya [02:10:59] PROBLEM - Disk space on virt0 is CRITICAL: DISK CRITICAL - free space: /a 3612 MB (3% inode=99%): [02:25:38] !log LocalisationUpdate completed (1.24wmf15) at 2014-09-06 02:24:35+00:00 [02:25:46] Logged the message, Master [02:38:44] !log LocalisationUpdate completed (1.24wmf19) at 2014-09-06 02:37:41+00:00 [02:38:49] PROBLEM - Puppet freshness on mw1053 is CRITICAL: Last successful Puppet run was Thu 04 Sep 2014 00:21:29 UTC [02:38:50] Logged the message, Master [02:51:38] !log LocalisationUpdate completed (1.24wmf20) at 2014-09-06 02:50:35+00:00 [02:51:44] Logged the message, Master [03:00:59] RECOVERY - Disk space on virt0 is OK: DISK OK [03:27:57] (03CR) 10Chmarkine: webserver - use ssl_ciphersuite in generic_vhost (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/153971 (owner: 10Dzahn) [03:43:28] !log LocalisationUpdate ResourceLoader cache refresh completed at Sat Sep 6 03:42:22 UTC 2014 (duration 42m 21s) [03:43:34] Logged the message, Master [03:57:10] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: 13.33% of data above the critical threshold [500.0] [03:57:49] PROBLEM - puppet last run on cp4016 is CRITICAL: CRITICAL: Epic puppet fail [03:58:00] PROBLEM - HTTP 5xx req/min on labmon1001 is CRITICAL: CRITICAL: 13.33% of data above the critical threshold [500.0] [04:11:10] RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: Less than 1.00% above the threshold [250.0] [04:12:00] RECOVERY - HTTP 5xx req/min on labmon1001 is OK: OK: Less than 1.00% above the threshold [250.0] [04:17:49] RECOVERY - puppet last run on cp4016 is OK: OK: Puppet is currently enabled, last run 36 seconds ago with 0 failures [04:39:49] PROBLEM - Puppet freshness on mw1053 is CRITICAL: Last successful Puppet run was Thu 04 Sep 2014 00:21:29 UTC [06:27:49] PROBLEM - puppet last run on mw1039 is CRITICAL: CRITICAL: Epic puppet fail [06:28:00] PROBLEM - puppet last run on mw1172 is CRITICAL: CRITICAL: Epic puppet fail [06:28:30] PROBLEM - puppet last run on mw1123 is CRITICAL: CRITICAL: Puppet has 1 failures [06:28:30] PROBLEM - puppet last run on db1015 is CRITICAL: CRITICAL: Puppet has 1 failures [06:29:00] PROBLEM - puppet last run on mw1166 is CRITICAL: CRITICAL: Puppet has 2 failures [06:29:10] PROBLEM - puppet last run on mw1118 is CRITICAL: CRITICAL: Puppet has 1 failures [06:29:19] PROBLEM - puppet last run on mw1170 is CRITICAL: CRITICAL: Puppet has 1 failures [06:29:49] PROBLEM - puppet last run on cp4003 is CRITICAL: CRITICAL: Puppet has 1 failures [06:40:49] PROBLEM - Puppet freshness on mw1053 is CRITICAL: Last successful Puppet run was Thu 04 Sep 2014 00:21:29 UTC [06:45:59] RECOVERY - puppet last run on mw1166 is OK: OK: Puppet is currently enabled, last run 49 seconds ago with 0 failures [06:46:10] RECOVERY - puppet last run on mw1118 is OK: OK: Puppet is currently enabled, last run 8 seconds ago with 0 failures [06:46:19] RECOVERY - puppet last run on mw1170 is OK: OK: Puppet is currently enabled, last run 46 seconds ago with 0 failures [06:46:30] RECOVERY - puppet last run on mw1123 is OK: OK: Puppet is currently enabled, last run 18 seconds ago with 0 failures [06:46:30] RECOVERY - puppet last run on db1015 is OK: OK: Puppet is currently enabled, last run 49 seconds ago with 0 failures [06:46:49] RECOVERY - puppet last run on cp4003 is OK: OK: Puppet is currently enabled, last run 48 seconds ago with 0 failures [06:46:49] RECOVERY - puppet last run on mw1039 is OK: OK: Puppet is currently enabled, last run 21 seconds ago with 0 failures [06:47:00] RECOVERY - puppet last run on mw1172 is OK: OK: Puppet is currently enabled, last run 17 seconds ago with 0 failures [06:57:49] PROBLEM - puppet last run on db1035 is CRITICAL: CRITICAL: Puppet has 1 failures [07:15:49] RECOVERY - puppet last run on db1035 is OK: OK: Puppet is currently enabled, last run 6 seconds ago with 0 failures [07:17:19] PROBLEM - HTTP error ratio anomaly detection on tungsten is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 2 below the confidence bounds [07:17:20] PROBLEM - HTTP error ratio anomaly detection on labmon1001 is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 2 below the confidence bounds [08:41:49] PROBLEM - Puppet freshness on mw1053 is CRITICAL: Last successful Puppet run was Thu 04 Sep 2014 00:21:29 UTC [08:43:19] PROBLEM - HTTP error ratio anomaly detection on tungsten is CRITICAL: CRITICAL: Anomaly detected: 14 data above and 9 below the confidence bounds [08:43:19] PROBLEM - HTTP error ratio anomaly detection on labmon1001 is CRITICAL: CRITICAL: Anomaly detected: 14 data above and 9 below the confidence bounds [09:19:19] RECOVERY - HTTP error ratio anomaly detection on tungsten is OK: OK: No anomaly detected [09:19:20] RECOVERY - HTTP error ratio anomaly detection on labmon1001 is OK: OK: No anomaly detected [09:37:40] PROBLEM - puppet last run on cp4002 is CRITICAL: CRITICAL: Puppet has 1 failures [09:54:40] RECOVERY - puppet last run on cp4002 is OK: OK: Puppet is currently enabled, last run 2 seconds ago with 0 failures [09:56:55] greg-g: btw, note that the icinga check for the disk space is also out of whack. I found out that our check_graphite doesn't really support wildcard metrics properly, so I'll have to add that. I'm on vacation till Wednesday, though, so will get to it then. [10:03:10] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: 6.67% of data above the critical threshold [500.0] [10:04:00] PROBLEM - HTTP 5xx req/min on labmon1001 is CRITICAL: CRITICAL: 6.67% of data above the critical threshold [500.0] [10:17:10] RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: Less than 1.00% above the threshold [250.0] [10:18:00] RECOVERY - HTTP 5xx req/min on labmon1001 is OK: OK: Less than 1.00% above the threshold [250.0] [10:42:49] PROBLEM - Puppet freshness on mw1053 is CRITICAL: Last successful Puppet run was Thu 04 Sep 2014 00:21:29 UTC [12:42:40] PROBLEM - puppet last run on amssq43 is CRITICAL: CRITICAL: Epic puppet fail [12:43:49] PROBLEM - Puppet freshness on mw1053 is CRITICAL: Last successful Puppet run was Thu 04 Sep 2014 00:21:29 UTC [13:02:40] RECOVERY - puppet last run on amssq43 is OK: OK: Puppet is currently enabled, last run 54 seconds ago with 0 failures [13:30:18] (03PS2) 10Rush: add python-phabricator package to phab module [puppet] - 10https://gerrit.wikimedia.org/r/158448 [13:30:26] (03CR) 10Rush: [C: 032 V: 032] add python-phabricator package to phab module [puppet] - 10https://gerrit.wikimedia.org/r/158448 (owner: 10Rush) [14:20:37] (03PS1) 10Rush: update security macros for phab [puppet] - 10https://gerrit.wikimedia.org/r/158871 [14:30:08] (03PS1) 10Rush: manage phabricator extensions [puppet] - 10https://gerrit.wikimedia.org/r/158874 [14:30:29] (03CR) 10Rush: [C: 032] update security macros for phab [puppet] - 10https://gerrit.wikimedia.org/r/158871 (owner: 10Rush) [14:31:32] (03CR) 10Rush: [C: 032] manage phabricator extensions [puppet] - 10https://gerrit.wikimedia.org/r/158874 (owner: 10Rush) [14:44:49] PROBLEM - Puppet freshness on mw1053 is CRITICAL: Last successful Puppet run was Thu 04 Sep 2014 00:21:29 UTC [16:45:49] PROBLEM - Puppet freshness on mw1053 is CRITICAL: Last successful Puppet run was Thu 04 Sep 2014 00:21:29 UTC [17:28:21] (03PS25) 1001tonythomas: Added the bouncehandler router to catch in all bounce emails [puppet] - 10https://gerrit.wikimedia.org/r/155753 [17:29:08] (03CR) 10jenkins-bot: [V: 04-1] Added the bouncehandler router to catch in all bounce emails [puppet] - 10https://gerrit.wikimedia.org/r/155753 (owner: 1001tonythomas) [17:37:05] (03PS26) 1001tonythomas: Added the bouncehandler router to catch in all bounce emails [puppet] - 10https://gerrit.wikimedia.org/r/155753 [17:37:45] (03CR) 10jenkins-bot: [V: 04-1] Added the bouncehandler router to catch in all bounce emails [puppet] - 10https://gerrit.wikimedia.org/r/155753 (owner: 1001tonythomas) [17:44:00] can someone look into my puppet-exim patch https://gerrit.wikimedia.org/r/#/c/155753 and tell why the realm switch fails ? [17:55:08] (03PS27) 1001tonythomas: Added the bouncehandler router to catch in all bounce emails [puppet] - 10https://gerrit.wikimedia.org/r/155753 [17:55:52] (03CR) 10jenkins-bot: [V: 04-1] Added the bouncehandler router to catch in all bounce emails [puppet] - 10https://gerrit.wikimedia.org/r/155753 (owner: 1001tonythomas) [18:00:09] (03PS28) 1001tonythomas: Added the bouncehandler router to catch in all bounce emails [puppet] - 10https://gerrit.wikimedia.org/r/155753 [18:01:28] * tonythomas bites jenkins. phew ! at last [18:15:59] YuviPanda|zzzz: no worries man, thanks for your work. If you can, update the bug with some of this info next time (just for easy tracking if someone else wants to help). Enjoy vacation and don't worry about doing anything until after! [18:21:19] (03PS29) 1001tonythomas: Added the bouncehandler router to catch in all bounce emails [puppet] - 10https://gerrit.wikimedia.org/r/155753 [18:29:59] PROBLEM - puppet last run on amssq56 is CRITICAL: CRITICAL: Epic puppet fail [18:46:49] PROBLEM - Puppet freshness on mw1053 is CRITICAL: Last successful Puppet run was Thu 04 Sep 2014 00:21:29 UTC [18:48:59] RECOVERY - puppet last run on amssq56 is OK: OK: Puppet is currently enabled, last run 1 seconds ago with 0 failures [19:43:24] (03CR) 10JanZerebecki: [C: 04-1] Puppetize icinga log file permission fix. (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/158633 (owner: 10JanZerebecki) [19:56:02] (03CR) 10JanZerebecki: [C: 04-1] "See Chmarkine their comment." [puppet] - 10https://gerrit.wikimedia.org/r/153971 (owner: 10Dzahn) [20:10:41] (03CR) 10JanZerebecki: "(We shouldn't publish the email addresses either.) As long as we wouldn't need to touch labs/private.git for every new contact only insert" [puppet] - 10https://gerrit.wikimedia.org/r/158355 (owner: 10JanZerebecki) [20:39:19] PROBLEM - Kafka Broker Messages In on analytics1021 is CRITICAL: kafka.server.BrokerTopicMetrics.AllTopicsMessagesInPerSec.FifteenMinuteRate CRITICAL: 823.441119382 [20:47:49] PROBLEM - Puppet freshness on mw1053 is CRITICAL: Last successful Puppet run was Thu 04 Sep 2014 00:21:29 UTC [22:47:28] Don't suppose an op with sodium access is willing to do a bit of Saturday work with a quick query? :) [22:48:49] PROBLEM - Puppet freshness on mw1053 is CRITICAL: Last successful Puppet run was Thu 04 Sep 2014 00:21:29 UTC