[00:00:04] RoanKattouw, ^d: Respected human, time to deploy Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150127T0000). Please do the needful. [00:00:12] hardware for logstash? [00:00:29] mark___: yeah [00:00:45] mark________ [00:01:12] that dropping my mood? ;) [00:01:40] mark___: at least https://phabricator.wikimedia.org/T87460 which is machines from spare pool [00:01:56] mark___: if you hurry, you can /ns register mark_______ :p [00:02:45] YuviPanda: https://phabricator.wikimedia.org/T87460 says "This request is approved by Mark" [00:03:28] but mark___ and jgage are better sources of authority [00:04:36] urghhh [00:05:19] 3hardware-requests, operations: Allocate temporary Elasticsearch nodes from spares pool for Logstash - https://phabricator.wikimedia.org/T87460#995495 (10RobH) [00:07:28] 3hardware-requests, operations: Allocate temporary Elasticsearch nodes from spares pool for Logstash - https://phabricator.wikimedia.org/T87460#995506 (10mark) So we need to do this AND buy new hardware for it and move things around? Any way we can avoid that? [00:11:05] 3hardware-requests, operations: Allocate temporary Elasticsearch nodes from spares pool for Logstash - https://phabricator.wikimedia.org/T87460#995507 (10bd808) >>! In T87460#995506, @mark wrote: > So we need to do this AND buy new hardware for it and move things around? Any way we can avoid that? If this hardw... [00:12:24] 3hardware-requests, operations: Allocate temporary Elasticsearch nodes from spares pool for Logstash - https://phabricator.wikimedia.org/T87460#995508 (10bd808) >>! In T87460#995507, @bd808 wrote: > If this hardware is within warranty that makes Ops happy I think it would be all we need for the near/mid term. T... [00:12:38] PROBLEM - puppet last run on lvs2006 is CRITICAL: CRITICAL: Puppet has 1 failures [00:13:13] 3Wikimedia-Logstash, operations, hardware-requests: Allocate temporary Elasticsearch nodes from spares pool for Logstash - https://phabricator.wikimedia.org/T87460#995510 (10bd808) [00:14:41] 3Wikimedia-Logstash, operations, ops-core: Production hardware for Logstash service - https://phabricator.wikimedia.org/T84958#934590 (10bd808) [00:14:42] 3Wikimedia-Logstash, operations, ops-core: Upgrade RAM for logstash100[123] to 64G - https://phabricator.wikimedia.org/T87078#995511 (10bd808) 5Open>3declined a:3bd808 See {T87460} and/or {T84958} for a better plan [00:15:40] 3Wikimedia-Logstash, operations, hardware-requests: Allocate temporary Elasticsearch nodes from spares pool for Logstash - https://phabricator.wikimedia.org/T87460#995519 (10bd808) [00:15:43] 3Wikimedia-Logstash, operations, ops-core: Production hardware for Logstash service - https://phabricator.wikimedia.org/T84958#934590 (10bd808) [00:15:49] (03CR) 10Tim Landscheidt: "I find the name "$has_exim_sender" a bit confusing (has the instance its own exim sender or should it include the standard one?). How abo" [puppet] - 10https://gerrit.wikimedia.org/r/186891 (https://phabricator.wikimedia.org/T86575) (owner: 10Yuvipanda) [00:15:59] (03PS1) 1001tonythomas: Deploy BounceHandler in beta wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/186907 (https://phabricator.wikimedia.org/T87624) [00:18:48] ottomata: ping? [00:19:38] PROBLEM - Apache HTTP on mw1018 is CRITICAL: Connection refused [00:19:43] 3Analytics, operations: Hadoop logs on logstash are being really spammy - https://phabricator.wikimedia.org/T87206#995529 (10Gage) Where does this 3x number come from? Logstash reports that over the last 7 days Mediawiki has logged 3x as many events as Hadoop: mediawiki (241677764) MWException (86008559) Hadoo... [00:20:17] PROBLEM - HHVM rendering on mw1018 is CRITICAL: Connection refused [00:20:18] PROBLEM - HHVM processes on mw1018 is CRITICAL: PROCS CRITICAL: 0 processes with command name hhvm [00:23:54] paravoid pOOOONng [00:24:17] PROBLEM - RAID on stat1002 is CRITICAL: Connection refused by host [00:24:18] PROBLEM - configured eth on stat1002 is CRITICAL: Connection refused by host [00:24:25] wha!? [00:24:27] hm [00:24:28] PROBLEM - dhclient process on stat1002 is CRITICAL: Connection refused by host [00:24:28] PROBLEM - puppet last run on stat1002 is CRITICAL: Connection refused by host [00:24:34] ottomata: that's me, kinda [00:24:36] oh ok [00:24:39] wassup? [00:24:42] your /mnt/hdfs fuse mount breaks nrpe [00:24:47] PROBLEM - DPKG on stat1002 is CRITICAL: Connection refused by host [00:24:49] hmm [00:24:51] oh yeah? [00:24:57] PROBLEM - salt-minion processes on stat1002 is CRITICAL: Connection refused by host [00:25:05] oh is it not able to do whatever disk checking on that mount? [00:25:35] (03Abandoned) 10Jgreen: update cmcmahon's key, and add him to contint-admins [puppet] - 10https://gerrit.wikimedia.org/r/185567 (owner: 10Jgreen) [00:25:57] not just that [00:26:05] the check just hangs on a D state [00:26:07] RECOVERY - HHVM rendering on mw1018 is OK: HTTP OK: HTTP/1.1 200 OK - 64798 bytes in 0.320 second response time [00:26:08] RECOVERY - HHVM processes on mw1018 is OK: PROCS OK: 1 process with command name hhvm [00:26:19] D state? [00:26:41] hrm why is stat1002's dmesg full of diamond exiting 1 [00:27:02] jgage, paravoid is telling me that now? something to do with that pesky/useful /mnt/hdfs [00:27:17] same thing porbably [00:27:18] hmm dstate... [00:27:43] well something weird with that fuse mount [00:27:46] does that even work? [00:27:54] I tried sudoing as "nagios" and my shell hanged as well [00:28:00] when I tried ls /mnt/hdfs [00:28:14] it is ok as my user... [00:28:22] huh! [00:28:30] it's not as my user [00:28:31] sudo -u nagios ls /mnt/hdfs is hanging for me too [00:28:33] interesting. [00:28:34] it works as root [00:28:38] but not as "faidon" [00:28:49] bleh fuse [00:29:03] well now it's hanging as root as well [00:29:08] RECOVERY - puppet last run on lvs2006 is OK: OK: Puppet is currently enabled, last run 36 seconds ago with 0 failures [00:29:10] jgage: yeah my feelings exactly [00:29:18] indeed. [00:29:37] PROBLEM - HHVM rendering on mw1018 is CRITICAL: Connection refused [00:29:53] /mnt/hdfs is finicky sometimes, but mostly works, if it works [00:29:57] 3Wikimedia-Site-requests, operations: Anonymous users can't pick language on WMF wikis ($wgULSAnonCanChangeLanguage is set to false) - https://phabricator.wikimedia.org/T58464#995540 (10Nemo_bis) [00:30:02] by that i mean, whne it works it works! [00:30:02] :) [00:30:08] RECOVERY - configured eth on stat1002 is OK: NRPE: Unable to read output [00:30:17] RECOVERY - dhclient process on stat1002 is OK: PROCS OK: 0 processes with command name dhclient [00:30:18] RECOVERY - puppet last run on stat1002 is OK: OK: Puppet is currently enabled, last run 5 seconds ago with 0 failures [00:30:24] ottomata you were able to recover without reboot last time by force unmounting right? [00:30:25] huh now it is hanging for me [00:30:27] yes [00:30:37] RECOVERY - DPKG on stat1002 is OK: All packages OK [00:30:43] (03CR) 10Hoo man: "Where (in which change) did it get removed?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/186907 (https://phabricator.wikimedia.org/T87624) (owner: 1001tonythomas) [00:30:45] but that was because we made some change...or restarted namenode? don't remember [00:30:47] RECOVERY - salt-minion processes on stat1002 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [00:30:55] "last time"? [00:31:08] yeah, hdfs fuse has freaked once before [00:31:15] at least once [00:31:17] RECOVERY - RAID on stat1002 is OK: OK: optimal, 1 logical, 12 physical [00:31:18] RECOVERY - Apache HTTP on mw1018 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 3.278 second response time [00:32:45] paravoid, did something break on its own, or was it when you did something on stat1002? [00:33:07] RECOVERY - HHVM rendering on mw1018 is OK: HTTP OK: HTTP/1.1 200 OK - 64994 bytes in 0.158 second response time [00:33:10] (03CR) 10Reedy: [C: 04-1] "Adding it to extension-list-labs will have no effect" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/186907 (https://phabricator.wikimedia.org/T87624) (owner: 1001tonythomas) [00:35:20] (03CR) 1001tonythomas: "https://github.com/wikimedia/operations-mediawiki-config/commit/ba288f70064dee6e4cf40fc46e77d66d245be772 removes it from labs and installs" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/186907 (https://phabricator.wikimedia.org/T87624) (owner: 1001tonythomas) [00:36:41] 3operations: Decommission svn.wikimedia.org server (import SVN into Phabricator) - https://phabricator.wikimedia.org/T86655#995547 (10MZMcBride) [00:37:57] (03PS1) 10Reedy: Tidyup extension-list [mediawiki-config] - 10https://gerrit.wikimedia.org/r/186910 [00:38:09] Why were those skins only in -labs? :/ [00:38:12] ottomata: it was breaking on its own [00:38:43] i wonder what will happen if we kill the fuse process - 2411 [00:38:56] (03CR) 1001tonythomas: "@Reedy: In that case, where should I be adding that one ?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/186907 (https://phabricator.wikimedia.org/T87624) (owner: 1001tonythomas) [00:39:54] (03CR) 10Hoo man: "Labs' extensions list is the production extension list + the labs one, so you don't need to list stuff twice." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/186907 (https://phabricator.wikimedia.org/T87624) (owner: 1001tonythomas) [00:40:58] going to try to force unmount and see what's up [00:41:13] (03CR) 10Reedy: "extension-list is only used for localisation cache stuff" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/186907 (https://phabricator.wikimedia.org/T87624) (owner: 1001tonythomas) [00:41:17] (03PS1) 10Jgreen: add cmcmahon to contint-admins and update pub key [puppet] - 10https://gerrit.wikimedia.org/r/186911 [00:42:17] jgage, where do you see that process? [00:43:31] ah, it went away when i umounted [00:43:53] hm, weird, it is fine on an27 [00:45:42] (03PS2) 10Jgreen: add cmcmahon to contint-admins and update pub key, typo fixed [puppet] - 10https://gerrit.wikimedia.org/r/186911 [00:45:52] (03PS2) 1001tonythomas: Deploy BounceHandler in beta wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/186907 (https://phabricator.wikimedia.org/T87624) [00:46:20] ah, i couldn't umount because of i think faidon's hanging ls [00:46:22] killed that. [00:46:39] k, is back and fine now. [00:46:42] 3operations: Requesting access to gallium for cmcmahon - https://phabricator.wikimedia.org/T86685#995558 (10Jgreen) I ended up redoing the gerrit commit because the original one accidentally depended on other cruft in my git checkout. Merging instead https://gerrit.wikimedia.org/r/#/c/186911 [00:46:44] not sure what made it angry [00:46:47] RECOVERY - Disk space on stat1002 is OK: DISK OK [00:47:18] (03PS3) 1001tonythomas: Deploy BounceHandler in beta wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/186907 (https://phabricator.wikimedia.org/T87624) [00:48:03] (03CR) 10Jgreen: [C: 032 V: 031] add cmcmahon to contint-admins and update pub key, typo fixed [puppet] - 10https://gerrit.wikimedia.org/r/186911 (owner: 10Jgreen) [00:49:41] (03CR) 10Reedy: [C: 032] Deploy BounceHandler in beta wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/186907 (https://phabricator.wikimedia.org/T87624) (owner: 1001tonythomas) [00:49:45] (03Merged) 10jenkins-bot: Deploy BounceHandler in beta wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/186907 (https://phabricator.wikimedia.org/T87624) (owner: 1001tonythomas) [00:50:04] !log reedy Synchronized wmf-config/: Noop for bouncehandler for beta (duration: 00m 06s) [00:50:13] Logged the message, Master [00:51:20] 3Ops-Access-Requests, Continuous-Integration: Make sure relevant RelEng people have access to gallium (Chris M, Dan, Mukunda, Zeljko) - https://phabricator.wikimedia.org/T85936#995585 (10Jgreen) [00:51:23] 3operations: Requesting access to gallium for cmcmahon - https://phabricator.wikimedia.org/T86685#995584 (10Jgreen) 5Open>3Resolved [00:55:24] (03PS1) 10Jforrester: Beta Labs: Actually invoke the Citoid extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/186912 [00:56:25] thanks ottomata, sorry i had to afk for phone call [00:57:23] (03PS2) 10Reedy: Beta Labs: Actually invoke the Citoid extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/186912 (owner: 10Jforrester) [00:57:32] (03CR) 10Reedy: [C: 032] Beta Labs: Actually invoke the Citoid extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/186912 (owner: 10Jforrester) [00:57:36] (03Merged) 10jenkins-bot: Beta Labs: Actually invoke the Citoid extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/186912 (owner: 10Jforrester) [00:58:12] !log reedy Synchronized wmf-config/: Noop for citoid for beta (duration: 00m 05s) [00:58:22] Logged the message, Master [01:26:32] (03CR) 10Rillke: [C: 031] "As I read it, this wouldn't be harmful, given there are thousands of edits in the "observed period" (thus 5% would be a lot more than 25 e" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/186743 (https://phabricator.wikimedia.org/T87431) (owner: 10Glaisher) [02:10:45] !log l10nupdate Synchronized php-1.25wmf14/cache/l10n: (no message) (duration: 00m 02s) [02:10:49] !log LocalisationUpdate completed (1.25wmf14) at 2015-01-27 02:10:49+00:00 [02:10:59] Logged the message, Master [02:11:06] Logged the message, Master [02:12:20] (03PS1) 10Nemo bis: Allow a full text search button on Commons whenever possible [mediawiki-config] - 10https://gerrit.wikimedia.org/r/186916 (https://phabricator.wikimedia.org/T19471) [02:12:26] (03CR) 10jenkins-bot: [V: 04-1] Allow a full text search button on Commons whenever possible [mediawiki-config] - 10https://gerrit.wikimedia.org/r/186916 (https://phabricator.wikimedia.org/T19471) (owner: 10Nemo bis) [02:19:08] !log l10nupdate Synchronized php-1.25wmf15/cache/l10n: (no message) (duration: 00m 03s) [02:19:12] !log LocalisationUpdate completed (1.25wmf15) at 2015-01-27 02:19:11+00:00 [02:19:15] Logged the message, Master [02:19:29] Logged the message, Master [02:30:26] 3operations: Decommission svn.wikimedia.org server (import SVN into Phabricator) - https://phabricator.wikimedia.org/T86655#995625 (10Chad) >>! In T86655#995479, @RobH wrote: > @Chad, > > Would you be the person to import this in, or should an ops person take point on this? I'm working on it, as soon as we ver... [03:36:28] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: 13.33% of data above the critical threshold [500.0] [03:49:28] RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: Less than 1.00% above the threshold [250.0] [04:08:16] !log LocalisationUpdate ResourceLoader cache refresh completed at Tue Jan 27 04:08:16 UTC 2015 (duration 8m 15s) [04:08:24] Logged the message, Master [05:02:16] (03PS1) 10Glaisher: Add $wgLogo for wikimania2016wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/186919 [05:03:40] (03PS2) 10Glaisher: Add $wgLogo for wikimania2016wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/186919 [06:28:28] PROBLEM - puppet last run on mw1042 is CRITICAL: CRITICAL: Puppet has 1 failures [06:29:08] PROBLEM - puppet last run on elastic1027 is CRITICAL: CRITICAL: Puppet has 1 failures [06:29:44] 3Triagers, Phabricator, operations, Project-Creators: Broaden the group of users that can create projects in Phabricator - https://phabricator.wikimedia.org/T706#995762 (10Qgil) Done! @Jdlrobson, please remember to follow https://www.mediawiki.org/wiki/Phabricator/Creating_and_renaming_projects when creating new... [06:29:48] PROBLEM - puppet last run on mw1092 is CRITICAL: CRITICAL: Puppet has 1 failures [06:30:08] PROBLEM - puppet last run on mw1119 is CRITICAL: CRITICAL: Puppet has 2 failures [06:31:17] PROBLEM - puppet last run on mw1118 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:38] PROBLEM - puppet last run on mw1065 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:38] PROBLEM - puppet last run on cp4003 is CRITICAL: CRITICAL: Puppet has 1 failures [06:40:25] (03PS1) 1001tonythomas: Made the deployment-mx talk back to deployment wiki on receiving a bounce [puppet] - 10https://gerrit.wikimedia.org/r/186938 [06:44:19] * tonythomas wants some ops-help [06:45:37] RECOVERY - puppet last run on elastic1027 is OK: OK: Puppet is currently enabled, last run 21 seconds ago with 0 failures [06:46:17] RECOVERY - puppet last run on mw1092 is OK: OK: Puppet is currently enabled, last run 32 seconds ago with 0 failures [06:46:28] RECOVERY - puppet last run on mw1118 is OK: OK: Puppet is currently enabled, last run 16 seconds ago with 0 failures [06:46:37] RECOVERY - puppet last run on mw1119 is OK: OK: Puppet is currently enabled, last run 35 seconds ago with 0 failures [06:46:57] RECOVERY - puppet last run on mw1065 is OK: OK: Puppet is currently enabled, last run 52 seconds ago with 0 failures [06:46:58] RECOVERY - puppet last run on cp4003 is OK: OK: Puppet is currently enabled, last run 37 seconds ago with 0 failures [06:47:17] RECOVERY - puppet last run on mw1042 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:54:10] 3operations, MediaWiki-Vagrant: Provisioning MediaWiki-Vagrant fails with "Could not set 'file' on ensure: Is a directory - /etc/hhvm/php.ini" - https://phabricator.wikimedia.org/T87478#995772 (10Tgr) Adding ops, as I think they are in charge of packaging HHVM; if the package uses the wrong path, that should be... [07:00:55] 3operations, MediaWiki-Vagrant: Provisioning MediaWiki-Vagrant fails with "Could not set 'file' on ensure: Is a directory - /etc/hhvm/php.ini" - https://phabricator.wikimedia.org/T87478#995801 (10bd808) >>! In T87478#995772, @Tgr wrote: > Adding ops, as I think they are in charge of packaging HHVM; if the packag... [07:01:25] 3Scrum-of-Scrums, operations, Deployment-Systems: Update wikitech wiki with deployment train - https://phabricator.wikimedia.org/T70751#995808 (10Dzahn) [07:11:42] 3WMF-NDA-Requests, operations, WMF-Legal: Add multichill to WMF-NDA group - https://phabricator.wikimedia.org/T87097#995822 (10Qgil) @multichill has forwarded to @LuisV_WMF and me the NDA he signed a year a half ago. Looks good to me. I forwarded it to @aklapper. [07:14:56] 3WMF-NDA-Requests, operations, WMF-Legal: Grant WMF-NDA access to Stas in Phabricator - https://phabricator.wikimedia.org/T85170#995824 (10Qgil) Honestly, I think this task should be just resolved. [07:16:39] 3WMF-NDA-Requests, operations: Grant Nikerabbit access to WMF-NDA group - https://phabricator.wikimedia.org/T86632#995825 (10Qgil) a:3LuisV_WMF [09:55:17] PROBLEM - check_mysql on db1008 is CRITICAL: SLOW_SLAVE CRITICAL: Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 651 [10:00:17] RECOVERY - check_mysql on db1008 is OK: Uptime: 930596 Threads: 2 Questions: 1706095 Slow queries: 6185 Opens: 14489 Flush tables: 2 Open tables: 64 Queries per second avg: 1.833 Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 0 [10:45:27] PROBLEM - puppet last run on amssq53 is CRITICAL: CRITICAL: puppet fail [11:04:17] RECOVERY - puppet last run on amssq53 is OK: OK: Puppet is currently enabled, last run 1 second ago with 0 failures [12:58:11] (03PS2) 10Nemo bis: Allow a full text search button on Commons whenever possible [mediawiki-config] - 10https://gerrit.wikimedia.org/r/186916 (https://phabricator.wikimedia.org/T19471) [12:59:55] (03PS3) 10Nemo bis: Allow a full text search button on Commons whenever possible [mediawiki-config] - 10https://gerrit.wikimedia.org/r/186916 (https://phabricator.wikimedia.org/T19471) [13:38:24] (03CR) 10JanZerebecki: [C: 031] Exempt Item and Property namespaces from ConfirmEdit [mediawiki-config] - 10https://gerrit.wikimedia.org/r/186828 (https://phabricator.wikimedia.org/T86453) (owner: 10Hoo man) [14:05:35] (03CR) 10Steinsplitter: [C: 031] Allow a full text search button on Commons whenever possible [mediawiki-config] - 10https://gerrit.wikimedia.org/r/186916 (https://phabricator.wikimedia.org/T19471) (owner: 10Nemo bis) [14:42:27] (03CR) 10dschwen: [C: 031] "So the $wmg* variables are created according to the default values in the 'wmg*' dictionary entries for the current site? This is hard to " [mediawiki-config] - 10https://gerrit.wikimedia.org/r/186916 (https://phabricator.wikimedia.org/T19471) (owner: 10Nemo bis) [14:57:10] bblack: do you have any idea how it.m.wikipedia.org can still sometimes require an action=purge to show the correct editing permissions for unregistered users? It was supposedly fixed in November. :( https://meta.wikimedia.org/wiki/It.m.wikipedia.org#Unregistered_users [15:26:58] PROBLEM - puppet last run on capella is CRITICAL: CRITICAL: puppet fail [15:44:38] RECOVERY - puppet last run on capella is OK: OK: Puppet is currently enabled, last run 12 seconds ago with 0 failures [16:00:04] manybubbles, anomie, ^d, marktraceur: Dear anthropoid, the time has come. Please deploy Morning SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150127T1600). [16:02:15] anthropoid, manybubbles, anomie, ^d, marktraceur will you be able to deploy https://wikitech.wikimedia.org/wiki/Deployments#Monday.2C.C2.A0January.C2.A026 today? as far as i understand it, those items didn't deploy yesterday. [16:06:04] (03PS1) 10devunt: Rename portal namespace in kowiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/186958 [16:06:53] (03PS2) 10devunt: Rename portal namespace in kowiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/186958 (https://phabricator.wikimedia.org/T87528) [16:11:20] (03PS3) 10devunt: Rename portal namespace in kowiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/186958 (https://phabricator.wikimedia.org/T87528) [16:14:17] (03PS4) 10devunt: Rename portal namespace in kowiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/186958 (https://phabricator.wikimedia.org/T87528) [16:15:06] anthropoid, manybubbles, anomie, ^d, marktraceur: will you be able to deploy https://wikitech.wikimedia.org/wiki/Deployments#Monday.2C.C2.A0January.C2.A026 today? [16:15:27] anthropoid, manybubbles, anomie, ^d, marktraceur: any ideas on why those items didn’t deploy yesterday? [16:16:58] Krenair: do you happen to know why https://wikitech.wikimedia.org/wiki/Deployments#Monday.2C.C2.A0January.C2.A026 didn’t deploy yesterday? [16:32:17] dan-nl: Probably because we're all busy at the dev summit [16:33:04] thanks marktraceur. any chance it might go out today? [16:34:42] OK, I'll do it now [16:34:49] marktraceur, is it morning swat time now? [16:34:54] hi [16:34:55] Yeah [16:35:13] thanks marktraceur! that would be great [16:35:55] I'd just added task to deployments list [16:35:59] https://wikitech.wikimedia.org/w/index.php?title=Deployments&diff=142072&oldid=141823 [16:36:14] devunt: I see that, I'll do that too [16:36:27] thank you [16:36:56] But first dan-nl for being so patient with us [16:40:03] (03CR) 10KartikMistry: "Should be ready to go now." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/186358 (owner: 10KartikMistry) [16:40:46] yoohoo :o [16:40:50] (03CR) 10KartikMistry: "Should be okay to go!" [puppet] - 10https://gerrit.wikimedia.org/r/186522 (owner: 10KartikMistry) [16:40:53] * Revi watches his bug [16:41:16] Revi: What's that now? [16:41:24] Oh, you're watching the GWT bug? [16:41:50] devunt's one [16:41:57] nop he is referencing my issue :p [16:42:07] Ah. [16:42:08] I filed the bug, he submitted the patch [16:45:31] and was there GWT related patch? [16:45:35] * Revi sees deployments [16:46:43] Wellp [16:46:49] There's something weird [16:47:32] I have a NavigationTiming change that shows up on git log HEAD..origin/wmf/1.25wmf14 [16:49:55] I guess I can just revert [16:53:22] hmm, does that prevent you from deploying today? [16:53:45] No [16:53:52] Just need to talk to ori first [16:54:15] !log marktraceur Synchronized php-1.25wmf15/extensions/GWToolset/includes/: [SWAT] [wmf15] GWToolset HHVM fixes (duration: 00m 06s) [16:54:24] Logged the message, Master [16:54:44] That's basically a no-op because commons is on wmf14 [16:55:08] I need ori before I can do wmf14 [16:55:35] ah okay, hopefully he'll be able to sort it out. thanks for eorking on it [16:56:18] PROBLEM - puppet last run on db2030 is CRITICAL: CRITICAL: puppet fail [17:00:34] 3Phabricator, operations: Add Addshore to phabricator WMF-NDA group - https://phabricator.wikimedia.org/T87651#996136 (10Addshore) 3NEW [17:04:32] addshore: I thought you can just say T(bugnum) instead of full URL :P [17:04:44] if it was intentional, ignore me :p [17:05:01] meh, was doing it on my tablet, copy and paste.. [17:05:19] meh :pp [17:05:43] OK so [17:05:48] config patch [17:05:50] Should be fast [17:05:52] yay. [17:05:54] 3WMF-NDA-Requests, operations, WMF-Legal: Add Addshore to phabricator WMF-NDA group - https://phabricator.wikimedia.org/T87651#996145 (10Qgil) [17:06:00] Revi, devunt, ready to test? [17:06:08] yes I am [17:06:16] uhm, I'm mobile, so leaving to devunt [17:06:49] well this is 2:00 am now in here [17:06:50] (03CR) 10MarkTraceur: [C: 032] "Seems fine, ready to go" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/186958 (https://phabricator.wikimedia.org/T87528) (owner: 10devunt) [17:06:55] devunt: Sorry about that :( [17:06:55] (03Merged) 10jenkins-bot: Rename portal namespace in kowiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/186958 (https://phabricator.wikimedia.org/T87528) (owner: 10devunt) [17:07:10] 2AM here either :p [17:07:20] 3WMF-NDA-Requests, operations, WMF-Legal: Add Addshore to phabricator WMF-NDA group - https://phabricator.wikimedia.org/T87651#996136 (10Qgil) @addshore, are you saying that you already signed a volunteer NDA? If so, do you mind forwarding it to @aklapper,@LuisV_WMF, and myself? [17:07:35] whatever I usually leave patches to others because I'm sleeping when SWAT happens :p [17:07:57] (I don't know why whatever was inserted...heh) [17:08:08] marktraceur: you can merge it [17:08:39] ori: You mean deploy? [17:08:39] !log marktraceur Synchronized wmf-config/InitialiseSettings.php: [SWAT] [config] Rename portal namespace in kowiki (duration: 00m 05s) [17:08:47] Logged the message, Master [17:08:52] You merged the NavigationTiming patch then didn't deploy it [17:09:07] marktraceur: btw, ori says deploy [17:09:12] Oh, 'kay [17:10:06] greg-g, ori, I assume you want me to do the submodule update on NT too [17:10:15] Revi, devunt, test it please :) [17:10:31] marktraceur, it's working [17:10:32] thank you [17:10:36] Great! [17:11:09] good too [17:11:25] thanks :D [17:11:27] and goodnight [17:11:30] !log marktraceur Synchronized php-1.25wmf14/extensions/GWToolset/includes/: [SWAT] [wmf14] GWToolset HHVM fixes (duration: 00m 07s) [17:11:32] * ^d finds broken record about doing swat on days like today [17:11:35] * ^d plays record [17:11:36] Logged the message, Master [17:12:21] s'alright :) [17:13:08] PROBLEM - puppet last run on mw1212 is CRITICAL: CRITICAL: Puppet has 1 failures [17:13:18] PROBLEM - DPKG on labmon1001 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [17:13:58] RECOVERY - puppet last run on db2030 is OK: OK: Puppet is currently enabled, last run 42 seconds ago with 0 failures [17:14:09] greg-g: Am I correct in thinking I'm actually deploying NT, or did I just merge and do my thing without regard for the NT patch? [17:14:15] 3WMF-NDA-Requests, operations, WMF-Legal: Add Addshore to phabricator WMF-NDA group - https://phabricator.wikimedia.org/T87651#996158 (10Addshore) [17:14:28] RECOVERY - DPKG on labmon1001 is OK: All packages OK [17:14:38] marktraceur: if it did the update... I'm not positive [17:15:01] 3WMF-NDA-Requests, operations, WMF-Legal: Add Addshore to phabricator WMF-NDA group - https://phabricator.wikimedia.org/T87651#996136 (10Addshore) Yes I have already signed an NDA Forwarded to you @Qgil , Please forward on to @Aklapper and @LuisV_WMF ! [17:15:05] should i see a change on https://commons.wikimedia.org/wiki/Special:Version for the gwtoolset commit version? [17:15:37] greg-g: No update was made, I think [17:15:56] ori: Am I syncing this change of yours? [17:18:03] 3WMF-NDA-Requests, operations, WMF-Legal: Add Addshore to phabricator WMF-NDA group - https://phabricator.wikimedia.org/T87651#996179 (10Qgil) Done. This NDA is signed by @Afdshore and Erik Moeller. It looks good to me, and therefore I think we can add him to WMF-NDA. [17:18:15] marktraceur: as far as i can tell the gwt patch for wmf14 hasn't deployed. i'm basing that off of the commit hash on https://commons.wikimedia.org/wiki/Special:Version [17:18:22] Uhhhhhh [17:19:15] That might be cached somehow [17:19:23] dan-nl: Have you tried repro'ing the bug? [17:19:54] k, i can try but do you know how to kill a job in the job runner if it goes wrong? [17:20:08] Hm, no [17:20:10] 3operations, MediaWiki-Vagrant: Provisioning MediaWiki-Vagrant fails with "Could not set 'file' on ensure: Is a directory - /etc/hhvm/php.ini" - https://phabricator.wikimedia.org/T87478#996185 (10bd808) Workaround until we get the package fixed: ``` $ vagrant ssh $ sudo rm -rf /etc/hhvm/php.ini $ exit $ vagrant... [17:20:10] i was hoping to confirm via that version page that the fix deployed ... [17:20:23] k, then i'm okay with waiting on testing it [17:20:23] Wellp [17:20:35] as far as you're concerned it deployed? [17:20:39] I believe so [17:20:46] I can re-sync it I guess [17:21:04] !log marktraceur Synchronized php-1.25wmf14/extensions/GWToolset/: [SWAT] [wmf14] GWToolset HHVM fixes (duration: 00m 06s) [17:21:12] Logged the message, Master [17:21:29] k, i'll wait it out … need to get going anyway … will test it out tomorrow … would i log in here to get help stopping a job runner if needed? [17:21:45] dan-nl: Yeah, I think this is the best place [17:21:53] Though people may be slightly distracted [17:22:04] cool, thanks for your help marktraceur! [17:22:16] logging out for now. will test it tomorrow [17:22:35] !log rebooting sca1001, sca1002, chromium, oxygen [17:22:39] Logged the message, Master [17:22:58] PROBLEM - DPKG on chromium is CRITICAL: DPKG CRITICAL dpkg reports broken packages [17:23:10] !log I am consciously leaving NavigationTiming unsynced because nobody seems that concerned about it, and nobody is here to shepherd the patch. If you *are* concerned about it, then contact ori. [17:23:17] Logged the message, Master [17:25:18] RECOVERY - DPKG on chromium is OK: All packages OK [17:27:48] PROBLEM - Host chromium is DOWN: PING CRITICAL - Packet loss = 100% [17:27:58] PROBLEM - DPKG on oxygen is CRITICAL: DPKG CRITICAL dpkg reports broken packages [17:28:37] RECOVERY - Host chromium is UP: PING OK - Packet loss = 0%, RTA = 0.38 ms [17:29:28] PROBLEM - DPKG on bast4001 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [17:29:48] PROBLEM - DPKG on hooft is CRITICAL: DPKG CRITICAL dpkg reports broken packages [17:29:48] PROBLEM - DPKG on mw1153 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [17:29:48] PROBLEM - Apache HTTP on mw1154 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:29:57] PROBLEM - Apache HTTP on mw1155 is CRITICAL: Connection timed out [17:30:07] PROBLEM - Apache HTTP on mw1160 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:30:18] PROBLEM - puppetmaster backend https on strontium is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:30:27] PROBLEM - Apache HTTP on mw1156 is CRITICAL: Connection timed out [17:30:28] PROBLEM - Apache HTTP on mw1157 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:30:38] RECOVERY - DPKG on bast4001 is OK: All packages OK [17:30:57] RECOVERY - Apache HTTP on mw1154 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 4.508 second response time [17:30:58] RECOVERY - Apache HTTP on mw1155 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.086 second response time [17:30:58] PROBLEM - puppet last run on amssq55 is CRITICAL: CRITICAL: Puppet has 1 failures [17:31:07] RECOVERY - Apache HTTP on mw1160 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.183 second response time [17:31:18] RECOVERY - puppetmaster backend https on strontium is OK: HTTP OK: Status line output matched 400 - 335 bytes in 0.896 second response time [17:31:28] RECOVERY - DPKG on oxygen is OK: All packages OK [17:31:28] RECOVERY - Apache HTTP on mw1156 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.108 second response time [17:31:28] RECOVERY - Apache HTTP on mw1157 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.092 second response time [17:31:37] PROBLEM - puppet last run on elastic1015 is CRITICAL: CRITICAL: puppet fail [17:31:49] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [500.0] [17:31:57] PROBLEM - puppet last run on mc1014 is CRITICAL: CRITICAL: puppet fail [17:31:58] RECOVERY - puppet last run on mw1212 is OK: OK: Puppet is currently enabled, last run 7 seconds ago with 0 failures [17:32:02] <^d> poor puppetz. [17:32:08] PROBLEM - puppet last run on amssq48 is CRITICAL: CRITICAL: Puppet has 2 failures [17:32:48] PROBLEM - puppet last run on db2007 is CRITICAL: CRITICAL: Puppet has 1 failures [17:33:08] PROBLEM - puppet last run on db1043 is CRITICAL: CRITICAL: Puppet has 3 failures [17:33:17] PROBLEM - puppet last run on analytics1016 is CRITICAL: CRITICAL: Puppet has 1 failures [17:33:27] PROBLEM - puppet last run on db1026 is CRITICAL: CRITICAL: Puppet has 1 failures [17:33:28] PROBLEM - puppet last run on wtp1012 is CRITICAL: CRITICAL: Puppet has 1 failures [17:33:38] PROBLEM - puppet last run on db1060 is CRITICAL: CRITICAL: Puppet has 1 failures [17:33:38] PROBLEM - puppet last run on labsdb1006 is CRITICAL: CRITICAL: Puppet has 1 failures [17:33:38] PROBLEM - puppet last run on analytics1026 is CRITICAL: CRITICAL: Puppet has 1 failures [17:33:47] PROBLEM - puppet last run on db2037 is CRITICAL: CRITICAL: Puppet has 1 failures [17:33:47] PROBLEM - puppet last run on stat1003 is CRITICAL: CRITICAL: Puppet has 1 failures [17:33:48] PROBLEM - puppet last run on es2004 is CRITICAL: CRITICAL: Puppet has 1 failures [17:33:57] PROBLEM - puppet last run on virt1003 is CRITICAL: CRITICAL: Puppet has 1 failures [17:33:57] PROBLEM - puppet last run on gadolinium is CRITICAL: CRITICAL: Puppet has 1 failures [17:33:58] PROBLEM - puppet last run on mw1125 is CRITICAL: CRITICAL: Puppet has 1 failures [17:34:08] PROBLEM - puppet last run on dataset1001 is CRITICAL: CRITICAL: Puppet has 1 failures [17:34:08] PROBLEM - puppet last run on install2001 is CRITICAL: CRITICAL: Puppet has 2 failures [17:34:08] PROBLEM - puppet last run on db1048 is CRITICAL: CRITICAL: Puppet has 1 failures [17:34:08] PROBLEM - puppet last run on ms-be2008 is CRITICAL: CRITICAL: Puppet has 1 failures [17:34:09] PROBLEM - puppet last run on mw1149 is CRITICAL: CRITICAL: Puppet has 1 failures [17:34:09] PROBLEM - puppet last run on cp4019 is CRITICAL: CRITICAL: Puppet has 1 failures [17:34:09] PROBLEM - Host bast1001 is DOWN: PING CRITICAL - Packet loss = 100% [17:34:09] PROBLEM - puppet last run on mw1129 is CRITICAL: CRITICAL: Puppet has 2 failures [17:34:17] PROBLEM - puppet last run on mw1249 is CRITICAL: CRITICAL: Puppet has 1 failures [17:34:18] PROBLEM - puppet last run on mw1051 is CRITICAL: CRITICAL: Puppet has 2 failures [17:34:18] PROBLEM - puppet last run on cp1046 is CRITICAL: CRITICAL: Puppet has 1 failures [17:34:29] PROBLEM - puppet last run on analytics1013 is CRITICAL: CRITICAL: Puppet has 1 failures [17:34:29] PROBLEM - puppet last run on wtp1005 is CRITICAL: CRITICAL: Puppet has 1 failures [17:34:29] PROBLEM - puppet last run on ms-be3001 is CRITICAL: CRITICAL: Puppet has 1 failures [17:34:29] PROBLEM - puppet last run on amssq34 is CRITICAL: CRITICAL: Puppet has 1 failures [17:34:29] PROBLEM - puppet last run on ms-fe3002 is CRITICAL: CRITICAL: Puppet has 1 failures [17:34:30] PROBLEM - puppet last run on cp3010 is CRITICAL: CRITICAL: Puppet has 1 failures [17:34:38] PROBLEM - DPKG on strontium is CRITICAL: DPKG CRITICAL dpkg reports broken packages [17:34:38] PROBLEM - Host mw1153 is DOWN: PING CRITICAL - Packet loss = 100% [17:34:48] PROBLEM - puppet last run on snapshot1001 is CRITICAL: CRITICAL: Puppet has 1 failures [17:34:48] PROBLEM - puppet last run on mw1195 is CRITICAL: CRITICAL: Puppet has 2 failures [17:34:59] RECOVERY - Host bast1001 is UP: PING OK - Packet loss = 0%, RTA = 0.51 ms [17:35:07] PROBLEM - puppet last run on mw1044 is CRITICAL: CRITICAL: Puppet has 2 failures [17:35:08] PROBLEM - puppet last run on mw1168 is CRITICAL: CRITICAL: Puppet has 1 failures [17:35:08] PROBLEM - puppet last run on mw1237 is CRITICAL: CRITICAL: Puppet has 3 failures [17:35:09] PROBLEM - puppet last run on mw1055 is CRITICAL: CRITICAL: Puppet has 1 failures [17:35:18] PROBLEM - puppet last run on cp4005 is CRITICAL: CRITICAL: Puppet has 2 failures [17:35:28] RECOVERY - Host mw1153 is UP: PING OK - Packet loss = 0%, RTA = 0.69 ms [17:35:38] RECOVERY - DPKG on mw1153 is OK: All packages OK [17:36:27] PROBLEM - puppet last run on mw1204 is CRITICAL: CRITICAL: puppet fail [17:36:29] PROBLEM - puppet last run on mw1090 is CRITICAL: CRITICAL: puppet fail [17:36:29] PROBLEM - Disk space on hooft is CRITICAL: Connection refused by host [17:36:31] PROBLEM - configured eth on hooft is CRITICAL: Connection refused by host [17:36:31] PROBLEM - salt-minion processes on hooft is CRITICAL: Connection refused by host [17:36:31] PROBLEM - dhclient process on hooft is CRITICAL: Connection refused by host [17:36:38] PROBLEM - puppet last run on lvs3003 is CRITICAL: CRITICAL: puppet fail [17:36:48] PROBLEM - puppet last run on hooft is CRITICAL: Connection refused by host [17:36:48] PROBLEM - RAID on hooft is CRITICAL: Connection refused by host [17:37:00] RECOVERY - DPKG on strontium is OK: All packages OK [17:37:00] PROBLEM - SSH on hooft is CRITICAL: Connection refused [17:37:00] PROBLEM - puppet last run on cp1040 is CRITICAL: CRITICAL: puppet fail [17:37:00] PROBLEM - puppet last run on mw1107 is CRITICAL: CRITICAL: puppet fail [17:37:18] PROBLEM - Host antimony is DOWN: CRITICAL - Host Unreachable (208.80.154.7) [17:37:19] PROBLEM - puppet last run on mw1241 is CRITICAL: CRITICAL: puppet fail [17:37:49] RECOVERY - Host antimony is UP: PING OK - Packet loss = 0%, RTA = 5.02 ms [17:37:58] PROBLEM - puppet last run on amssq33 is CRITICAL: CRITICAL: Puppet has 1 failures [17:39:08] RECOVERY - puppet last run on hooft is OK: OK: Puppet is currently enabled, last run 28 minutes ago with 0 failures [17:39:08] RECOVERY - DPKG on hooft is OK: All packages OK [17:39:08] RECOVERY - RAID on hooft is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 [17:39:18] RECOVERY - SSH on hooft is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1.4 (protocol 2.0) [17:39:28] PROBLEM - puppet last run on bast1001 is CRITICAL: CRITICAL: Puppet has 59 failures [17:40:09] RECOVERY - Disk space on hooft is OK: DISK OK [17:40:09] RECOVERY - salt-minion processes on hooft is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [17:40:09] RECOVERY - configured eth on hooft is OK: NRPE: Unable to read output [17:40:10] RECOVERY - dhclient process on hooft is OK: PROCS OK: 0 processes with command name dhclient [17:40:27] PROBLEM - Host ytterbium is DOWN: PING CRITICAL - Packet loss = 100% [17:40:47] PROBLEM - Host ms1004 is DOWN: PING CRITICAL - Packet loss = 100% [17:40:58] PROBLEM - git.wikimedia.org on antimony is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Temporarily Unavailable - 516 bytes in 0.003 second response time [17:41:18] PROBLEM - Host strontium is DOWN: PING CRITICAL - Packet loss = 100% [17:41:48] RECOVERY - Host ytterbium is UP: PING OK - Packet loss = 0%, RTA = 2.61 ms [17:42:28] PROBLEM - Host zirconium is DOWN: CRITICAL - Host Unreachable (208.80.154.41) [17:42:48] RECOVERY - Host ms1004 is UP: PING OK - Packet loss = 0%, RTA = 1.85 ms [17:42:48] RECOVERY - Host strontium is UP: PING OK - Packet loss = 0%, RTA = 1.03 ms [17:43:08] PROBLEM - Host palladium is DOWN: PING CRITICAL - Packet loss = 100% [17:43:38] PROBLEM - puppet last run on logstash1003 is CRITICAL: CRITICAL: puppet fail [17:43:38] PROBLEM - puppet last run on ms-be2002 is CRITICAL: CRITICAL: puppet fail [17:43:48] PROBLEM - puppet last run on elastic1031 is CRITICAL: CRITICAL: puppet fail [17:43:48] PROBLEM - puppet last run on mw1218 is CRITICAL: CRITICAL: Puppet has 18 failures [17:43:48] PROBLEM - puppet last run on mw1038 is CRITICAL: CRITICAL: puppet fail [17:43:48] PROBLEM - puppet last run on mw1233 is CRITICAL: CRITICAL: Puppet has 116 failures [17:43:48] PROBLEM - puppet last run on mw1240 is CRITICAL: CRITICAL: puppet fail [17:43:49] PROBLEM - puppet last run on mw1089 is CRITICAL: CRITICAL: puppet fail [17:43:49] PROBLEM - puppet last run on ms-be1001 is CRITICAL: CRITICAL: puppet fail [17:43:50] PROBLEM - puppet last run on mw1080 is CRITICAL: CRITICAL: puppet fail [17:43:50] PROBLEM - puppet last run on mc1016 is CRITICAL: CRITICAL: puppet fail [17:43:57] PROBLEM - puppet last run on mw1178 is CRITICAL: CRITICAL: Puppet has 116 failures [17:43:57] PROBLEM - puppet last run on cp1059 is CRITICAL: CRITICAL: Puppet has 43 failures [17:43:58] PROBLEM - puppet last run on mw1067 is CRITICAL: CRITICAL: puppet fail [17:43:58] PROBLEM - puppet last run on mw1048 is CRITICAL: CRITICAL: puppet fail [17:43:58] RECOVERY - Host zirconium is UP: PING OK - Packet loss = 0%, RTA = 0.39 ms [17:43:58] PROBLEM - puppet last run on amssq43 is CRITICAL: CRITICAL: puppet fail [17:43:58] PROBLEM - puppet last run on cp3022 is CRITICAL: CRITICAL: Puppet has 46 failures [17:43:59] PROBLEM - puppet last run on amssq59 is CRITICAL: CRITICAL: puppet fail [17:43:59] PROBLEM - puppet last run on cp3015 is CRITICAL: CRITICAL: puppet fail [17:43:59] PROBLEM - puppet last run on cp3006 is CRITICAL: CRITICAL: puppet fail [17:44:00] PROBLEM - puppet last run on uranium is CRITICAL: CRITICAL: puppet fail [17:44:08] PROBLEM - puppet last run on mc1004 is CRITICAL: CRITICAL: puppet fail [17:44:08] PROBLEM - puppet last run on mw1134 is CRITICAL: CRITICAL: puppet fail [17:44:08] PROBLEM - puppet last run on mw1109 is CRITICAL: CRITICAL: Puppet has 10 failures [17:44:08] PROBLEM - puppet last run on mw1115 is CRITICAL: CRITICAL: puppet fail [17:44:08] PROBLEM - puppet last run on rdb1004 is CRITICAL: CRITICAL: puppet fail [17:44:17] PROBLEM - puppet last run on pc1001 is CRITICAL: CRITICAL: Puppet has 6 failures [17:44:17] PROBLEM - puppet last run on wtp1017 is CRITICAL: CRITICAL: puppet fail [17:44:27] PROBLEM - puppet last run on dbstore1001 is CRITICAL: CRITICAL: Puppet has 24 failures [17:44:27] PROBLEM - puppet last run on mw1132 is CRITICAL: CRITICAL: Puppet has 24 failures [17:44:27] PROBLEM - puppet last run on mw1124 is CRITICAL: CRITICAL: Puppet has 67 failures [17:44:28] PROBLEM - puppet last run on cerium is CRITICAL: CRITICAL: puppet fail [17:44:28] PROBLEM - puppet last run on mw1005 is CRITICAL: CRITICAL: puppet fail [17:44:28] PROBLEM - puppet last run on mw1028 is CRITICAL: CRITICAL: puppet fail [17:44:28] PROBLEM - puppet last run on mw1161 is CRITICAL: CRITICAL: Puppet has 8 failures [17:44:28] PROBLEM - puppet last run on mw1106 is CRITICAL: CRITICAL: puppet fail [17:44:29] PROBLEM - puppet last run on mw1072 is CRITICAL: CRITICAL: puppet fail [17:44:37] PROBLEM - puppet last run on cp1070 is CRITICAL: CRITICAL: puppet fail [17:44:38] PROBLEM - puppet last run on mw1040 is CRITICAL: CRITICAL: Puppet has 72 failures [17:44:38] PROBLEM - puppet last run on mw1200 is CRITICAL: CRITICAL: puppet fail [17:44:38] PROBLEM - puppet last run on mw1256 is CRITICAL: CRITICAL: puppet fail [17:44:38] PROBLEM - puppet last run on es2003 is CRITICAL: CRITICAL: Puppet has 3 failures [17:44:49] PROBLEM - puppet last run on ms-be2003 is CRITICAL: CRITICAL: puppet fail [17:44:49] PROBLEM - puppet last run on analytics1033 is CRITICAL: CRITICAL: puppet fail [17:44:49] RECOVERY - Host palladium is UP: PING OK - Packet loss = 0%, RTA = 0.31 ms [17:44:49] PROBLEM - puppet last run on cp4006 is CRITICAL: CRITICAL: puppet fail [17:44:50] PROBLEM - puppet last run on caesium is CRITICAL: CRITICAL: puppet fail [17:44:57] PROBLEM - puppet last run on magnesium is CRITICAL: CRITICAL: puppet fail [17:44:58] PROBLEM - puppet last run on ms-be1004 is CRITICAL: CRITICAL: puppet fail [17:44:58] PROBLEM - puppet last run on mw1141 is CRITICAL: CRITICAL: puppet fail [17:44:58] PROBLEM - puppet last run on db1029 is CRITICAL: CRITICAL: puppet fail [17:44:58] PROBLEM - Host amslvs1 is DOWN: PING CRITICAL - Packet loss = 100% [17:44:58] PROBLEM - Host amslvs4 is DOWN: PING CRITICAL - Packet loss = 100% [17:45:08] PROBLEM - puppet last run on cp3020 is CRITICAL: CRITICAL: puppet fail [17:45:08] PROBLEM - puppet last run on amssq54 is CRITICAL: CRITICAL: puppet fail [17:45:08] PROBLEM - puppet last run on amssq49 is CRITICAL: CRITICAL: puppet fail [17:45:08] PROBLEM - puppet last run on cp3008 is CRITICAL: CRITICAL: puppet fail [17:45:08] PROBLEM - puppet last run on cp3013 is CRITICAL: CRITICAL: Puppet has 55 failures [17:45:09] PROBLEM - DPKG on mw1154 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [17:45:09] PROBLEM - puppet last run on ytterbium is CRITICAL: CRITICAL: puppet fail [17:45:18] PROBLEM - puppet last run on netmon1001 is CRITICAL: CRITICAL: puppet fail [17:45:27] PROBLEM - puppet last run on db1031 is CRITICAL: CRITICAL: puppet fail [17:45:27] PROBLEM - puppet last run on cp1055 is CRITICAL: CRITICAL: puppet fail [17:45:28] RECOVERY - Host amslvs1 is UP: PING OK - Packet loss = 0%, RTA = 93.57 ms [17:45:28] PROBLEM - puppet last run on cp1039 is CRITICAL: CRITICAL: puppet fail [17:45:37] PROBLEM - puppet last run on ms-fe2004 is CRITICAL: CRITICAL: puppet fail [17:45:38] PROBLEM - puppet last run on cp1047 is CRITICAL: CRITICAL: puppet fail [17:45:38] PROBLEM - puppet last run on ms-be1006 is CRITICAL: CRITICAL: puppet fail [17:45:38] PROBLEM - puppet last run on mw1140 is CRITICAL: CRITICAL: puppet fail [17:45:41] puppet >.> [17:45:47] PROBLEM - puppet last run on mw1145 is CRITICAL: CRITICAL: puppet fail [17:45:48] PROBLEM - puppet last run on lvs1002 is CRITICAL: CRITICAL: puppet fail [17:45:48] PROBLEM - puppet last run on snapshot1003 is CRITICAL: CRITICAL: puppet fail [17:45:58] PROBLEM - puppet last run on ms-be2006 is CRITICAL: CRITICAL: puppet fail [17:45:58] PROBLEM - puppet last run on mw1117 is CRITICAL: CRITICAL: puppet fail [17:46:07] PROBLEM - puppet last run on ms-be1003 is CRITICAL: CRITICAL: puppet fail [17:46:07] PROBLEM - puppet last run on mw1082 is CRITICAL: CRITICAL: puppet fail [17:46:07] RECOVERY - Host amslvs4 is UP: PING OK - Packet loss = 0%, RTA = 96.75 ms [17:46:08] PROBLEM - puppet last run on cp1049 is CRITICAL: CRITICAL: puppet fail [17:46:08] PROBLEM - puppet last run on wtp1009 is CRITICAL: CRITICAL: puppet fail [17:46:18] PROBLEM - puppet last run on amssq53 is CRITICAL: CRITICAL: puppet fail [17:46:18] PROBLEM - puppet last run on amssq32 is CRITICAL: CRITICAL: puppet fail [17:46:18] PROBLEM - puppet last run on amssq61 is CRITICAL: CRITICAL: puppet fail [17:46:28] PROBLEM - puppet last run on potassium is CRITICAL: CRITICAL: puppet fail [17:46:28] PROBLEM - puppet last run on mw1217 is CRITICAL: CRITICAL: puppet fail [17:46:28] PROBLEM - puppet last run on mw1160 is CRITICAL: CRITICAL: puppet fail [17:46:28] PROBLEM - puppet last run on lvs1005 is CRITICAL: CRITICAL: puppet fail [17:46:38] PROBLEM - puppet last run on db1018 is CRITICAL: CRITICAL: puppet fail [17:46:38] PROBLEM - puppet last run on wtp1016 is CRITICAL: CRITICAL: puppet fail [17:46:38] PROBLEM - puppet last run on zirconium is CRITICAL: CRITICAL: puppet fail [17:46:39] PROBLEM - puppet last run on mc1006 is CRITICAL: CRITICAL: puppet fail [17:46:39] PROBLEM - puppet last run on analytics1035 is CRITICAL: CRITICAL: puppet fail [17:46:47] PROBLEM - puppet last run on db1053 is CRITICAL: CRITICAL: puppet fail [17:46:47] PROBLEM - puppet last run on mw1150 is CRITICAL: CRITICAL: puppet fail [17:46:48] PROBLEM - puppet last run on mw1120 is CRITICAL: CRITICAL: puppet fail [17:46:48] PROBLEM - puppet last run on analytics1017 is CRITICAL: CRITICAL: puppet fail [17:46:48] PROBLEM - puppet last run on elastic1018 is CRITICAL: CRITICAL: puppet fail [17:46:48] PROBLEM - puppet last run on mw1242 is CRITICAL: CRITICAL: puppet fail [17:46:48] PROBLEM - puppet last run on mw1099 is CRITICAL: CRITICAL: puppet fail [17:46:49] PROBLEM - puppet last run on capella is CRITICAL: CRITICAL: puppet fail [17:46:49] PROBLEM - puppet last run on es2009 is CRITICAL: CRITICAL: puppet fail [17:46:49] PROBLEM - puppet last run on es2008 is CRITICAL: CRITICAL: puppet fail [17:46:50] PROBLEM - puppet last run on ms-be2013 is CRITICAL: CRITICAL: Puppet has 39 failures [17:46:51] PROBLEM - puppet last run on labstore2001 is CRITICAL: CRITICAL: puppet fail [17:47:02] PROBLEM - puppet last run on mw1069 is CRITICAL: CRITICAL: puppet fail [17:47:02] PROBLEM - puppet last run on logstash1001 is CRITICAL: CRITICAL: puppet fail [17:47:02] PROBLEM - puppet last run on mw1006 is CRITICAL: CRITICAL: puppet fail [17:47:02] PROBLEM - puppet last run on db1065 is CRITICAL: CRITICAL: puppet fail [17:47:05] !log rebooting various LVSes... [17:47:07] PROBLEM - puppet last run on db1046 is CRITICAL: CRITICAL: puppet fail [17:47:07] PROBLEM - puppet last run on db2002 is CRITICAL: CRITICAL: puppet fail [17:47:07] PROBLEM - puppet last run on es2005 is CRITICAL: CRITICAL: puppet fail [17:47:07] RECOVERY - puppet last run on db2007 is OK: OK: Puppet is currently enabled, last run 10 seconds ago with 0 failures [17:47:08] PROBLEM - puppet last run on es2002 is CRITICAL: CRITICAL: puppet fail [17:47:08] PROBLEM - puppet last run on es2001 is CRITICAL: CRITICAL: puppet fail [17:47:08] PROBLEM - puppet last run on mw1031 is CRITICAL: CRITICAL: Puppet has 57 failures [17:47:08] PROBLEM - puppet last run on mc1003 is CRITICAL: CRITICAL: puppet fail [17:47:09] PROBLEM - puppet last run on ms1004 is CRITICAL: CRITICAL: puppet fail [17:47:10] PROBLEM - puppet last run on mw1176 is CRITICAL: CRITICAL: puppet fail [17:47:10] PROBLEM - puppet last run on lvs4002 is CRITICAL: CRITICAL: puppet fail [17:47:11] PROBLEM - puppet last run on cp4007 is CRITICAL: CRITICAL: Puppet has 50 failures [17:47:14] Logged the message, Master [17:47:27] PROBLEM - puppet last run on mw1007 is CRITICAL: CRITICAL: puppet fail [17:47:27] PROBLEM - puppet last run on db1050 is CRITICAL: CRITICAL: puppet fail [17:47:27] PROBLEM - puppet last run on mw1052 is CRITICAL: CRITICAL: puppet fail [17:47:27] PROBLEM - puppet last run on mw1235 is CRITICAL: CRITICAL: puppet fail [17:47:27] PROBLEM - puppet last run on mw1026 is CRITICAL: CRITICAL: puppet fail [17:47:28] PROBLEM - puppet last run on rbf1002 is CRITICAL: CRITICAL: puppet fail [17:47:28] RECOVERY - puppet last run on wtp1005 is OK: OK: Puppet is currently enabled, last run 1 second ago with 0 failures [17:47:29] PROBLEM - puppet last run on mw1088 is CRITICAL: CRITICAL: puppet fail [17:47:29] PROBLEM - puppet last run on mw1012 is CRITICAL: CRITICAL: puppet fail [17:47:30] PROBLEM - puppet last run on lvs3001 is CRITICAL: CRITICAL: puppet fail [17:47:30] PROBLEM - puppet last run on amssq35 is CRITICAL: CRITICAL: puppet fail [17:47:30] RECOVERY - puppet last run on amssq48 is OK: OK: Puppet is currently enabled, last run 10 seconds ago with 0 failures [17:47:48] PROBLEM - puppet last run on ms-fe1004 is CRITICAL: CRITICAL: puppet fail [17:47:48] PROBLEM - puppet last run on es1008 is CRITICAL: CRITICAL: puppet fail [17:47:48] PROBLEM - puppet last run on mw1228 is CRITICAL: CRITICAL: puppet fail [17:47:48] PROBLEM - puppet last run on mw1003 is CRITICAL: CRITICAL: puppet fail [17:47:49] RECOVERY - puppet last run on mw1195 is OK: OK: Puppet is currently enabled, last run 19 seconds ago with 0 failures [17:47:58] PROBLEM - puppet last run on vanadium is CRITICAL: CRITICAL: Puppet has 24 failures [17:47:59] RECOVERY - puppet last run on virt1003 is OK: OK: Puppet is currently enabled, last run 44 seconds ago with 0 failures [17:48:07] RECOVERY - puppet last run on mw1237 is OK: OK: Puppet is currently enabled, last run 21 seconds ago with 0 failures [17:48:08] RECOVERY - puppet last run on mw1055 is OK: OK: Puppet is currently enabled, last run 13 seconds ago with 0 failures [17:48:18] RECOVERY - puppet last run on db1048 is OK: OK: Puppet is currently enabled, last run 31 seconds ago with 0 failures [17:48:18] RECOVERY - puppet last run on dataset1001 is OK: OK: Puppet is currently enabled, last run 43 seconds ago with 0 failures [17:48:18] RECOVERY - puppet last run on install2001 is OK: OK: Puppet is currently enabled, last run 33 seconds ago with 0 failures [17:48:27] RECOVERY - puppet last run on mc1014 is OK: OK: Puppet is currently enabled, last run 23 seconds ago with 0 failures [17:48:27] RECOVERY - puppet last run on mw1249 is OK: OK: Puppet is currently enabled, last run 36 seconds ago with 0 failures [17:48:28] RECOVERY - puppet last run on db1043 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [17:48:28] RECOVERY - puppet last run on cp1046 is OK: OK: Puppet is currently enabled, last run 16 seconds ago with 0 failures [17:48:28] RECOVERY - puppet last run on mw1051 is OK: OK: Puppet is currently enabled, last run 9 seconds ago with 0 failures [17:48:37] RECOVERY - puppet last run on analytics1016 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [17:48:38] PROBLEM - puppet last run on hooft is CRITICAL: CRITICAL: Puppet has 36 failures [17:48:38] RECOVERY - puppet last run on amssq55 is OK: OK: Puppet is currently enabled, last run 52 seconds ago with 0 failures [17:48:38] RECOVERY - puppet last run on ms-fe3002 is OK: OK: Puppet is currently enabled, last run 4 seconds ago with 0 failures [17:48:38] RECOVERY - puppet last run on db1026 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [17:48:58] RECOVERY - puppet last run on db1060 is OK: OK: Puppet is currently enabled, last run 43 seconds ago with 0 failures [17:48:58] RECOVERY - puppet last run on snapshot1001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [17:48:58] RECOVERY - puppet last run on analytics1026 is OK: OK: Puppet is currently enabled, last run 42 seconds ago with 0 failures [17:48:58] !log reboot all swift machines in eqiad, in turn [17:49:07] RECOVERY - puppet last run on stat1003 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [17:49:07] Logged the message, Master [17:49:08] PROBLEM - Host lvs1004 is DOWN: PING CRITICAL - Packet loss = 100% [17:49:08] PROBLEM - Host lvs1006 is DOWN: PING CRITICAL - Packet loss = 100% [17:49:08] PROBLEM - Host lvs1005 is DOWN: PING CRITICAL - Packet loss = 100% [17:49:17] RECOVERY - puppet last run on es2004 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [17:49:17] RECOVERY - puppet last run on mw1168 is OK: OK: Puppet is currently enabled, last run 35 seconds ago with 0 failures [17:49:18] RECOVERY - puppet last run on mw1044 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [17:49:18] RECOVERY - puppet last run on elastic1015 is OK: OK: Puppet is currently enabled, last run 16 seconds ago with 0 failures [17:49:18] RECOVERY - puppet last run on gadolinium is OK: OK: Puppet is currently enabled, last run 46 seconds ago with 0 failures [17:49:18] RECOVERY - puppet last run on mw1125 is OK: OK: Puppet is currently enabled, last run 4 seconds ago with 0 failures [17:49:28] RECOVERY - puppet last run on mw1149 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [17:49:28] RECOVERY - puppet last run on cp4005 is OK: OK: Puppet is currently enabled, last run 17 seconds ago with 0 failures [17:49:28] RECOVERY - puppet last run on cp4019 is OK: OK: Puppet is currently enabled, last run 41 seconds ago with 0 failures [17:49:48] RECOVERY - puppet last run on analytics1013 is OK: OK: Puppet is currently enabled, last run 59 seconds ago with 0 failures [17:49:48] PROBLEM - Host ms-be1001 is DOWN: PING CRITICAL - Packet loss = 100% [17:49:48] RECOVERY - puppet last run on amssq34 is OK: OK: Puppet is currently enabled, last run 58 seconds ago with 0 failures [17:49:48] RECOVERY - puppet last run on ms-be3001 is OK: OK: Puppet is currently enabled, last run 52 seconds ago with 0 failures [17:49:48] RECOVERY - puppet last run on cp3010 is OK: OK: Puppet is currently enabled, last run 57 seconds ago with 0 failures [17:49:58] RECOVERY - DPKG on mw1154 is OK: All packages OK [17:50:08] RECOVERY - puppet last run on labsdb1006 is OK: OK: Puppet is currently enabled, last run 48 seconds ago with 0 failures [17:50:08] PROBLEM - Host cxserver.svc.eqiad.wmnet is DOWN: CRITICAL - Network Unreachable (10.2.2.18) [17:50:18] RECOVERY - puppet last run on db2037 is OK: OK: Puppet is currently enabled, last run 41 seconds ago with 0 failures [17:50:37] PROBLEM - DPKG on mw1155 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [17:50:38] RECOVERY - puppet last run on ms-be2008 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [17:50:38] PROBLEM - NTP on bast4001 is CRITICAL: NTP CRITICAL: Offset unknown [17:50:38] PROBLEM - Host lvs2006 is DOWN: PING CRITICAL - Packet loss = 100% [17:50:47] PROBLEM - NTP on hooft is CRITICAL: NTP CRITICAL: Offset unknown [17:50:48] RECOVERY - Host lvs1006 is UP: PING OK - Packet loss = 0%, RTA = 2.10 ms [17:50:48] RECOVERY - Host lvs1004 is UP: PING OK - Packet loss = 0%, RTA = 3.13 ms [17:50:58] RECOVERY - Host lvs1005 is UP: PING OK - Packet loss = 0%, RTA = 0.76 ms [17:51:07] RECOVERY - puppet last run on hooft is OK: OK: Puppet is currently enabled, last run 27 seconds ago with 0 failures [17:51:18] RECOVERY - Host cxserver.svc.eqiad.wmnet is UP: PING OK - Packet loss = 0%, RTA = 1.07 ms [17:51:24] RECOVERY - Host ms-be1001 is UP: PING OK - Packet loss = 0%, RTA = 2.65 ms [17:51:48] PROBLEM - NTP on bast1001 is CRITICAL: NTP CRITICAL: Offset unknown [17:51:48] PROBLEM - NTP on oxygen is CRITICAL: NTP CRITICAL: Offset unknown [17:52:08] PROBLEM - NTP on mw1153 is CRITICAL: NTP CRITICAL: Offset unknown [17:52:08] RECOVERY - puppet last run on ms-be1001 is OK: OK: Puppet is currently enabled, last run 37 seconds ago with 0 failures [17:52:39] RECOVERY - Host lvs2006 is UP: PING OK - Packet loss = 0%, RTA = 43.80 ms [17:52:57] PROBLEM - Host ms-be1002 is DOWN: PING CRITICAL - Packet loss = 100% [17:53:07] RECOVERY - DPKG on mw1155 is OK: All packages OK [17:53:08] RECOVERY - NTP on bast4001 is OK: NTP OK: Offset 0.001741528511 secs [17:53:08] RECOVERY - NTP on hooft is OK: NTP OK: Offset 0.001508712769 secs [17:53:17] PROBLEM - Host mw1154 is DOWN: PING CRITICAL - Packet loss = 100% [17:53:47] RECOVERY - Host mw1154 is UP: PING OK - Packet loss = 0%, RTA = 1.32 ms [17:54:08] RECOVERY - NTP on bast1001 is OK: NTP OK: Offset -0.0004782676697 secs [17:54:17] RECOVERY - NTP on oxygen is OK: NTP OK: Offset 0.001764178276 secs [17:54:37] RECOVERY - NTP on mw1153 is OK: NTP OK: Offset 0.001346826553 secs [17:54:37] RECOVERY - puppet last run on lvs3003 is OK: OK: Puppet is currently enabled, last run 36 seconds ago with 0 failures [17:54:48] RECOVERY - puppet last run on mw1107 is OK: OK: Puppet is currently enabled, last run 23 seconds ago with 0 failures [17:54:57] RECOVERY - Host ms-be1002 is UP: PING OK - Packet loss = 0%, RTA = 2.42 ms [17:54:58] PROBLEM - Host ms-be1008 is DOWN: PING CRITICAL - Packet loss = 100% [17:55:08] RECOVERY - puppet last run on mw1241 is OK: OK: Puppet is currently enabled, last run 55 seconds ago with 0 failures [17:55:19] PROBLEM - Host mw1155 is DOWN: PING CRITICAL - Packet loss = 100% [17:55:19] RECOVERY - puppet last run on mw1204 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [17:55:28] RECOVERY - puppet last run on mw1090 is OK: OK: Puppet is currently enabled, last run 26 seconds ago with 0 failures [17:58:00] rebooted neon [17:58:43] mutante: so risky :p [17:59:26] it's back though [17:59:48] logmsgbot is, icinga-wm is not.. hrmm [17:59:54] !log rebooting labmon1001 [18:00:00] Logged the message, Master [18:00:10] PROBLEM - Host ms-be1012 is DOWN: PING CRITICAL - Packet loss = 100% [18:00:11] icinga-wm is meh anyway, just a constant spam to our lives [18:00:14] there we go [18:00:20] speak of the devil, my point :p [18:00:25] RECOVERY - puppet last run on zirconium is OK: OK: Puppet is currently enabled, last run 36 seconds ago with 0 failures [18:00:25] RECOVERY - Host ms-be1004 is UP: PING OK - Packet loss = 0%, RTA = 2.02 ms [18:00:35] PROBLEM - Host baham is DOWN: PING CRITICAL - Packet loss = 100% [18:01:06] RECOVERY - puppet last run on mw1109 is OK: OK: Puppet is currently enabled, last run 11 seconds ago with 0 failures [18:01:06] PROBLEM - DPKG on mw1156 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [18:01:06] PROBLEM - Host payments1001 is DOWN: PING CRITICAL - Packet loss = 100% [18:01:12] RECOVERY - puppet last run on mw1218 is OK: OK: Puppet is currently enabled, last run 37 seconds ago with 0 failures [18:01:12] PROBLEM - NTP on ms-be1004 is CRITICAL: NTP CRITICAL: Offset unknown [18:01:12] RECOVERY - puppet last run on mw1040 is OK: OK: Puppet is currently enabled, last run 28 seconds ago with 0 failures [18:01:13] RECOVERY - puppet last run on ytterbium is OK: OK: Puppet is currently enabled, last run 36 seconds ago with 0 failures [18:01:15] PROBLEM - Host barium is DOWN: PING CRITICAL - Packet loss = 100% [18:01:25] RECOVERY - puppet last run on mw1233 is OK: OK: Puppet is currently enabled, last run 2 seconds ago with 0 failures [18:01:25] RECOVERY - puppet last run on pc1001 is OK: OK: Puppet is currently enabled, last run 14 seconds ago with 0 failures [18:01:35] RECOVERY - Host ms-be1012 is UP: PING OK - Packet loss = 0%, RTA = 1.30 ms [18:01:35] RECOVERY - puppet last run on mw1132 is OK: OK: Puppet is currently enabled, last run 36 seconds ago with 0 failures [18:01:35] RECOVERY - puppet last run on uranium is OK: OK: Puppet is currently enabled, last run 5 seconds ago with 0 failures [18:01:35] PROBLEM - Host ns1-v4 is DOWN: PING CRITICAL - Packet loss = 100% [18:01:36] RECOVERY - puppet last run on mw1115 is OK: OK: Puppet is currently enabled, last run 3 seconds ago with 0 failures [18:01:45] RECOVERY - puppet last run on dbstore1001 is OK: OK: Puppet is currently enabled, last run 9 seconds ago with 0 failures [18:01:45] RECOVERY - puppet last run on mw1028 is OK: OK: Puppet is currently enabled, last run 17 seconds ago with 0 failures [18:01:46] RECOVERY - puppet last run on wtp1017 is OK: OK: Puppet is currently enabled, last run 20 seconds ago with 0 failures [18:01:46] RECOVERY - puppet last run on cp3022 is OK: OK: Puppet is currently enabled, last run 15 seconds ago with 0 failures [18:01:55] RECOVERY - puppet last run on mw1124 is OK: OK: Puppet is currently enabled, last run 52 seconds ago with 0 failures [18:01:55] RECOVERY - puppet last run on rdb1004 is OK: OK: Puppet is currently enabled, last run 49 seconds ago with 0 failures [18:01:56] RECOVERY - puppet last run on vanadium is OK: OK: Puppet is currently enabled, last run 23 seconds ago with 0 failures [18:01:56] RECOVERY - puppet last run on mw1048 is OK: OK: Puppet is currently enabled, last run 26 seconds ago with 0 failures [18:02:05] PROBLEM - Graphite Carbon on labmon1001 is CRITICAL: Connection refused by host [18:02:05] RECOVERY - puppet last run on mc1004 is OK: OK: Puppet is currently enabled, last run 58 seconds ago with 0 failures [18:02:05] PROBLEM - Host lvs1001 is DOWN: CRITICAL - Host Unreachable (208.80.154.55) [18:02:06] RECOVERY - puppet last run on cp1059 is OK: OK: Puppet is currently enabled, last run 36 seconds ago with 0 failures [18:02:06] RECOVERY - puppet last run on es2003 is OK: OK: Puppet is currently enabled, last run 41 seconds ago with 0 failures [18:02:15] RECOVERY - puppet last run on wtp1009 is OK: OK: Puppet is currently enabled, last run 30 seconds ago with 0 failures [18:02:15] PROBLEM - uWSGI web apps on labmon1001 is CRITICAL: Connection refused by host [18:02:15] RECOVERY - puppet last run on ms-be2013 is OK: OK: Puppet is currently enabled, last run 56 seconds ago with 0 failures [18:02:15] RECOVERY - puppet last run on cp4007 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [18:02:16] RECOVERY - puppet last run on cp3013 is OK: OK: Puppet is currently enabled, last run 17 seconds ago with 0 failures [18:02:16] RECOVERY - puppet last run on db1065 is OK: OK: Puppet is currently enabled, last run 6 seconds ago with 0 failures [18:02:16] RECOVERY - puppet last run on elastic1031 is OK: OK: Puppet is currently enabled, last run 38 seconds ago with 0 failures [18:02:16] PROBLEM - RAID on labmon1001 is CRITICAL: Connection refused by host [18:02:26] RECOVERY - puppet last run on mc1016 is OK: OK: Puppet is currently enabled, last run 33 seconds ago with 0 failures [18:02:26] RECOVERY - puppet last run on db1029 is OK: OK: Puppet is currently enabled, last run 53 seconds ago with 0 failures [18:02:26] RECOVERY - puppet last run on es1001 is OK: OK: Puppet is currently enabled, last run 19 seconds ago with 0 failures [18:02:26] PROBLEM - configured eth on labmon1001 is CRITICAL: Connection refused by host [18:02:26] RECOVERY - puppet last run on logstash1003 is OK: OK: Puppet is currently enabled, last run 53 seconds ago with 0 failures [18:02:26] RECOVERY - puppet last run on mw1038 is OK: OK: Puppet is currently enabled, last run 47 seconds ago with 0 failures [18:02:26] RECOVERY - puppet last run on mw1067 is OK: OK: Puppet is currently enabled, last run 40 seconds ago with 0 failures [18:02:28] RECOVERY - puppet last run on ms-be1004 is OK: OK: Puppet is currently enabled, last run 15 seconds ago with 0 failures [18:02:28] RECOVERY - puppet last run on mw1080 is OK: OK: Puppet is currently enabled, last run 40 seconds ago with 0 failures [18:02:35] RECOVERY - Host barium is UP: PING WARNING - Packet loss = 93%, RTA = 0.68 ms [18:02:41] RECOVERY - puppet last run on mw1059 is OK: OK: Puppet is currently enabled, last run 3 seconds ago with 0 failures [18:02:41] RECOVERY - puppet last run on mw1072 is OK: OK: Puppet is currently enabled, last run 42 seconds ago with 0 failures [18:02:41] RECOVERY - puppet last run on mw1256 is OK: OK: Puppet is currently enabled, last run 46 seconds ago with 0 failures [18:02:41] RECOVERY - puppet last run on mw1240 is OK: OK: Puppet is currently enabled, last run 40 seconds ago with 0 failures [18:02:41] RECOVERY - puppet last run on ms-be2003 is OK: OK: Puppet is currently enabled, last run 10 seconds ago with 0 failures [18:02:42] RECOVERY - puppet last run on cerium is OK: OK: Puppet is currently enabled, last run 26 seconds ago with 0 failures [18:02:42] RECOVERY - puppet last run on mw1134 is OK: OK: Puppet is currently enabled, last run 59 seconds ago with 0 failures [18:02:42] RECOVERY - puppet last run on mw1178 is OK: OK: Puppet is currently enabled, last run 56 seconds ago with 0 failures [18:02:45] RECOVERY - Host baham is UP: PING OK - Packet loss = 0%, RTA = 43.17 ms [18:02:45] PROBLEM - dhclient process on labmon1001 is CRITICAL: Connection refused by host [18:02:45] PROBLEM - NTP on ms-be1012 is CRITICAL: NTP CRITICAL: Offset unknown [18:02:45] RECOVERY - puppet last run on mw1006 is OK: OK: Puppet is currently enabled, last run 13 seconds ago with 0 failures [18:02:47] hm, certain hosts seem to be cycling within icinga a fair bit [18:02:55] RECOVERY - puppet last run on mw1089 is OK: OK: Puppet is currently enabled, last run 49 seconds ago with 0 failures [18:02:55] RECOVERY - puppet last run on db1053 is OK: OK: Puppet is currently enabled, last run 58 seconds ago with 0 failures [18:02:55] PROBLEM - DPKG on labmon1001 is CRITICAL: Connection refused by host [18:02:55] RECOVERY - puppet last run on es2005 is OK: OK: Puppet is currently enabled, last run 3 seconds ago with 0 failures [18:02:55] RECOVERY - puppet last run on amssq43 is OK: OK: Puppet is currently enabled, last run 28 seconds ago with 0 failures [18:02:56] RECOVERY - puppet last run on mw1045 is OK: OK: Puppet is currently enabled, last run 9 seconds ago with 0 failures [18:02:56] RECOVERY - puppet last run on analytics1033 is OK: OK: Puppet is currently enabled, last run 37 seconds ago with 0 failures [18:02:56] RECOVERY - puppet last run on mw1200 is OK: OK: Puppet is currently enabled, last run 39 seconds ago with 0 failures [18:02:57] PROBLEM - graphite.wikimedia.org on labmon1001 is CRITICAL: Connection refused [18:02:57] RECOVERY - puppet last run on labstore2001 is OK: OK: Puppet is currently enabled, last run 13 seconds ago with 0 failures [18:02:58] RECOVERY - puppet last run on analytics1017 is OK: OK: Puppet is currently enabled, last run 54 seconds ago with 0 failures [18:02:58] RECOVERY - puppet last run on amssq44 is OK: OK: Puppet is currently enabled, last run 15 seconds ago with 0 failures [18:03:05] RECOVERY - puppet last run on mw1012 is OK: OK: Puppet is currently enabled, last run 22 seconds ago with 0 failures [18:03:05] RECOVERY - puppet last run on mw1141 is OK: OK: Puppet is currently enabled, last run 18 seconds ago with 0 failures [18:03:05] PROBLEM - Disk space on labmon1001 is CRITICAL: Connection refused by host [18:03:05] RECOVERY - puppet last run on es1008 is OK: OK: Puppet is currently enabled, last run 29 seconds ago with 0 failures [18:03:05] RECOVERY - puppet last run on mw1041 is OK: OK: Puppet is currently enabled, last run 20 seconds ago with 0 failures [18:03:06] PROBLEM - Host mw1156 is DOWN: PING CRITICAL - Packet loss = 100% [18:03:06] RECOVERY - puppet last run on netmon1001 is OK: OK: Puppet is currently enabled, last run 18 seconds ago with 0 failures [18:03:07] PROBLEM - puppet last run on labmon1001 is CRITICAL: Connection refused by host [18:03:26] RECOVERY - puppet last run on mw1140 is OK: OK: Puppet is currently enabled, last run 3 seconds ago with 0 failures [18:03:26] RECOVERY - puppet last run on carbon is OK: OK: Puppet is currently enabled, last run 46 seconds ago with 0 failures [18:03:26] RECOVERY - puppet last run on cp3006 is OK: OK: Puppet is currently enabled, last run 37 seconds ago with 0 failures [18:03:26] RECOVERY - Host lvs1001 is UP: PING OK - Packet loss = 0%, RTA = 0.95 ms [18:03:26] RECOVERY - Host ns1-v4 is UP: PING OK - Packet loss = 0%, RTA = 44.56 ms [18:03:26] RECOVERY - puppet last run on cp3020 is OK: OK: Puppet is currently enabled, last run 5 seconds ago with 0 failures [18:03:26] RECOVERY - puppet last run on mc1006 is OK: OK: Puppet is currently enabled, last run 37 seconds ago with 0 failures [18:03:27] icinga's just going to be a mess for a while, it's a lot of churn going on all over our infrastructure [18:03:27] RECOVERY - puppet last run on mw1161 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [18:03:28] RECOVERY - puppet last run on caesium is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [18:03:29] RECOVERY - puppet last run on db1031 is OK: OK: Puppet is currently enabled, last run 9 seconds ago with 0 failures [18:03:36] RECOVERY - puppet last run on amssq59 is OK: OK: Puppet is currently enabled, last run 38 seconds ago with 0 failures [18:03:36] RECOVERY - puppet last run on lvs4002 is OK: OK: Puppet is currently enabled, last run 50 seconds ago with 0 failures [18:03:36] RECOVERY - puppet last run on es2002 is OK: OK: Puppet is currently enabled, last run 33 seconds ago with 0 failures [18:03:36] RECOVERY - puppet last run on es2008 is OK: OK: Puppet is currently enabled, last run 28 seconds ago with 0 failures [18:03:36] RECOVERY - puppet last run on rbf1002 is OK: OK: Puppet is currently enabled, last run 28 seconds ago with 0 failures [18:03:36] PROBLEM - Host lvs1002 is DOWN: CRITICAL - Host Unreachable (208.80.154.56) [18:03:36] RECOVERY - puppet last run on mw1106 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [18:03:37] RECOVERY - puppet last run on mw1197 is OK: OK: Puppet is currently enabled, last run 39 seconds ago with 0 failures [18:03:38] RECOVERY - puppet last run on cp3008 is OK: OK: Puppet is currently enabled, last run 23 seconds ago with 0 failures [18:03:45] RECOVERY - puppet last run on potassium is OK: OK: Puppet is currently enabled, last run 20 seconds ago with 0 failures [18:03:46] RECOVERY - puppet last run on analytics1025 is OK: OK: Puppet is currently enabled, last run 13 seconds ago with 0 failures [18:03:46] RECOVERY - puppet last run on cp1070 is OK: OK: Puppet is currently enabled, last run 58 seconds ago with 0 failures [18:03:46] RECOVERY - puppet last run on ms-fe2004 is OK: OK: Puppet is currently enabled, last run 26 seconds ago with 0 failures [18:03:55] RECOVERY - puppet last run on mw1145 is OK: OK: Puppet is currently enabled, last run 47 seconds ago with 0 failures [18:03:55] RECOVERY - puppet last run on ms-be2006 is OK: OK: Puppet is currently enabled, last run 26 seconds ago with 0 failures [18:04:05] RECOVERY - puppet last run on cp1055 is OK: OK: Puppet is currently enabled, last run 21 seconds ago with 0 failures [18:04:05] RECOVERY - puppet last run on ms-be1006 is OK: OK: Puppet is currently enabled, last run 55 seconds ago with 0 failures [18:04:05] PROBLEM - Host lvs2002 is DOWN: PING CRITICAL - Packet loss = 100% [18:04:05] PROBLEM - Host lvs2001 is DOWN: PING CRITICAL - Packet loss = 100% [18:04:06] RECOVERY - puppet last run on ms-fe1001 is OK: OK: Puppet is currently enabled, last run 0 seconds ago with 0 failures [18:04:06] RECOVERY - puppet last run on amssq54 is OK: OK: Puppet is currently enabled, last run 3 seconds ago with 0 failures [18:04:06] PROBLEM - DPKG on mw1157 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [18:04:15] RECOVERY - puppet last run on mw1082 is OK: OK: Puppet is currently enabled, last run 16 seconds ago with 0 failures [18:04:15] RECOVERY - puppet last run on es2001 is OK: OK: Puppet is currently enabled, last run 22 seconds ago with 0 failures [18:04:16] RECOVERY - puppet last run on db1022 is OK: OK: Puppet is currently enabled, last run 3 seconds ago with 0 failures [18:04:16] RECOVERY - puppet last run on lead is OK: OK: Puppet is currently enabled, last run 24 seconds ago with 0 failures [18:04:16] RECOVERY - puppet last run on snapshot1003 is OK: OK: Puppet is currently enabled, last run 32 seconds ago with 0 failures [18:04:25] RECOVERY - DPKG on mw1156 is OK: All packages OK [18:04:25] RECOVERY - Host lvs1002 is UP: PING OK - Packet loss = 0%, RTA = 0.86 ms [18:04:25] RECOVERY - Host mw1156 is UP: PING OK - Packet loss = 0%, RTA = 2.92 ms [18:04:26] RECOVERY - puppet last run on analytics1035 is OK: OK: Puppet is currently enabled, last run 24 seconds ago with 0 failures [18:04:26] PROBLEM - SSH on labmon1001 is CRITICAL: Connection refused [18:04:26] PROBLEM - Host lvs1003 is DOWN: CRITICAL - Host Unreachable (208.80.154.57) [18:04:26] RECOVERY - NTP on ms-be1004 is OK: NTP OK: Offset -0.1059185266 secs [18:04:26] RECOVERY - puppet last run on gold is OK: OK: Puppet is currently enabled, last run 48 seconds ago with 0 failures [18:04:35] RECOVERY - puppet last run on amssq49 is OK: OK: Puppet is currently enabled, last run 57 seconds ago with 0 failures [18:04:35] RECOVERY - puppet last run on elastic1018 is OK: OK: Puppet is currently enabled, last run 7 seconds ago with 0 failures [18:04:36] PROBLEM - salt-minion processes on labmon1001 is CRITICAL: Connection refused by host [18:04:36] RECOVERY - puppet last run on mc1003 is OK: OK: Puppet is currently enabled, last run 40 seconds ago with 0 failures [18:04:36] RECOVERY - puppet last run on cp1039 is OK: OK: Puppet is currently enabled, last run 45 seconds ago with 0 failures [18:04:36] RECOVERY - puppet last run on sca1002 is OK: OK: Puppet is currently enabled, last run 10 seconds ago with 0 failures [18:04:36] RECOVERY - puppet last run on mc1002 is OK: OK: Puppet is currently enabled, last run 16 seconds ago with 0 failures [18:04:46] RECOVERY - puppet last run on mw1224 is OK: OK: Puppet is currently enabled, last run 59 seconds ago with 0 failures [18:04:46] RECOVERY - puppet last run on ms1004 is OK: OK: Puppet is currently enabled, last run 11 seconds ago with 0 failures [18:04:46] RECOVERY - puppet last run on wtp1016 is OK: OK: Puppet is currently enabled, last run 22 seconds ago with 0 failures [18:04:46] RECOVERY - puppet last run on mw1205 is OK: OK: Puppet is currently enabled, last run 11 seconds ago with 0 failures [18:04:46] RECOVERY - puppet last run on mw1222 is OK: OK: Puppet is currently enabled, last run 25 seconds ago with 0 failures [18:04:47] RECOVERY - puppet last run on mw1176 is OK: OK: Puppet is currently enabled, last run 31 seconds ago with 0 failures [18:04:47] RECOVERY - puppet last run on mw1173 is OK: OK: Puppet is currently enabled, last run 26 seconds ago with 0 failures [18:04:55] RECOVERY - puppet last run on mw1242 is OK: OK: Puppet is currently enabled, last run 41 seconds ago with 0 failures [18:04:56] RECOVERY - NTP on ms-be1012 is OK: NTP OK: Offset 0.08329701424 secs [18:04:56] RECOVERY - puppet last run on mw1026 is OK: OK: Puppet is currently enabled, last run 57 seconds ago with 0 failures [18:04:56] RECOVERY - puppet last run on mw1046 is OK: OK: Puppet is currently enabled, last run 14 seconds ago with 0 failures [18:04:56] RECOVERY - puppet last run on mw1068 is OK: OK: Puppet is currently enabled, last run 14 seconds ago with 0 failures [18:04:56] RECOVERY - puppet last run on cp4006 is OK: OK: Puppet is currently enabled, last run 58 seconds ago with 0 failures [18:04:57] RECOVERY - puppet last run on db1050 is OK: OK: Puppet is currently enabled, last run 37 seconds ago with 0 failures [18:04:57] RECOVERY - puppet last run on mw1160 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [18:04:58] RECOVERY - puppet last run on mw1003 is OK: OK: Puppet is currently enabled, last run 27 seconds ago with 0 failures [18:05:06] RECOVERY - puppet last run on amssq53 is OK: OK: Puppet is currently enabled, last run 16 seconds ago with 0 failures [18:05:06] RECOVERY - puppet last run on ms-be2004 is OK: OK: Puppet is currently enabled, last run 1 second ago with 0 failures [18:05:06] RECOVERY - puppet last run on elastic1021 is OK: OK: Puppet is currently enabled, last run 36 seconds ago with 0 failures [18:05:06] RECOVERY - puppet last run on elastic1008 is OK: OK: Puppet is currently enabled, last run 52 seconds ago with 0 failures [18:05:06] RECOVERY - puppet last run on db2002 is OK: OK: Puppet is currently enabled, last run 59 seconds ago with 0 failures [18:05:06] RECOVERY - puppet last run on mw1009 is OK: OK: Puppet is currently enabled, last run 50 seconds ago with 0 failures [18:05:15] RECOVERY - puppet last run on mw1100 is OK: OK: Puppet is currently enabled, last run 35 seconds ago with 0 failures [18:05:15] RECOVERY - puppet last run on amssq32 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [18:05:15] RECOVERY - puppet last run on amssq61 is OK: OK: Puppet is currently enabled, last run 12 seconds ago with 0 failures [18:05:16] RECOVERY - puppet last run on mw1217 is OK: OK: Puppet is currently enabled, last run 31 seconds ago with 0 failures [18:05:16] RECOVERY - puppet last run on labsdb1003 is OK: OK: Puppet is currently enabled, last run 3 seconds ago with 0 failures [18:05:16] RECOVERY - puppet last run on mw1060 is OK: OK: Puppet is currently enabled, last run 15 seconds ago with 0 failures [18:05:16] RECOVERY - puppet last run on elastic1012 is OK: OK: Puppet is currently enabled, last run 52 seconds ago with 0 failures [18:05:16] RECOVERY - puppet last run on mw1120 is OK: OK: Puppet is currently enabled, last run 44 seconds ago with 0 failures [18:05:17] RECOVERY - puppet last run on mw1150 is OK: OK: Puppet is currently enabled, last run 24 seconds ago with 0 failures [18:05:25] RECOVERY - puppet last run on mw1228 is OK: OK: Puppet is currently enabled, last run 54 seconds ago with 0 failures [18:05:25] RECOVERY - puppet last run on mw1008 is OK: OK: Puppet is currently enabled, last run 53 seconds ago with 0 failures [18:05:25] RECOVERY - puppet last run on capella is OK: OK: Puppet is currently enabled, last run 57 seconds ago with 0 failures [18:05:26] RECOVERY - puppet last run on mw1226 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [18:05:26] PROBLEM - Host lvs2003 is DOWN: PING CRITICAL - Packet loss = 100% [18:05:26] RECOVERY - puppet last run on mw1069 is OK: OK: Puppet is currently enabled, last run 41 seconds ago with 0 failures [18:05:35] RECOVERY - puppet last run on ms-fe1004 is OK: OK: Puppet is currently enabled, last run 5 seconds ago with 0 failures [18:05:36] RECOVERY - puppet last run on mw1099 is OK: OK: Puppet is currently enabled, last run 49 seconds ago with 0 failures [18:05:36] RECOVERY - puppet last run on cp1061 is OK: OK: Puppet is currently enabled, last run 33 seconds ago with 0 failures [18:05:36] RECOVERY - puppet last run on es2009 is OK: OK: Puppet is currently enabled, last run 55 seconds ago with 0 failures [18:05:45] RECOVERY - puppet last run on ruthenium is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [18:05:46] RECOVERY - puppet last run on mw1088 is OK: OK: Puppet is currently enabled, last run 56 seconds ago with 0 failures [18:05:46] RECOVERY - puppet last run on db1018 is OK: OK: Puppet is currently enabled, last run 37 seconds ago with 0 failures [18:05:55] RECOVERY - puppet last run on db1046 is OK: OK: Puppet is currently enabled, last run 26 seconds ago with 0 failures [18:05:56] RECOVERY - puppet last run on amssq35 is OK: OK: Puppet is currently enabled, last run 29 seconds ago with 0 failures [18:05:56] RECOVERY - puppet last run on mw1166 is OK: OK: Puppet is currently enabled, last run 43 seconds ago with 0 failures [18:05:56] RECOVERY - Host lvs1003 is UP: PING OK - Packet loss = 0%, RTA = 1.08 ms [18:06:06] RECOVERY - Host lvs2001 is UP: PING OK - Packet loss = 0%, RTA = 43.01 ms [18:06:06] RECOVERY - puppet last run on mw1117 is OK: OK: Puppet is currently enabled, last run 48 seconds ago with 0 failures [18:06:06] RECOVERY - puppet last run on cp3003 is OK: OK: Puppet is currently enabled, last run 5 seconds ago with 0 failures [18:06:15] RECOVERY - puppet last run on cp3016 is OK: OK: Puppet is currently enabled, last run 22 seconds ago with 0 failures [18:06:16] RECOVERY - puppet last run on mw1153 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [18:06:25] RECOVERY - Host payments1001 is UP: PING OK - Packet loss = 0%, RTA = 2.06 ms [18:06:31] RECOVERY - Host lvs2002 is UP: PING OK - Packet loss = 0%, RTA = 43.36 ms [18:06:31] RECOVERY - puppet last run on mw1052 is OK: OK: Puppet is currently enabled, last run 58 seconds ago with 0 failures [18:06:35] PROBLEM - Host lvs3001 is DOWN: PING CRITICAL - Packet loss = 100% [18:06:45] RECOVERY - Host lvs2003 is UP: PING OK - Packet loss = 0%, RTA = 43.29 ms [18:06:55] RECOVERY - puppet last run on mw1025 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [18:06:56] RECOVERY - puppet last run on mw1235 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [18:07:06] PROBLEM - Host ms-be1005 is DOWN: PING CRITICAL - Packet loss = 100% [18:07:15] RECOVERY - puppet last run on lvs1002 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [18:07:16] RECOVERY - Host lvs3001 is UP: PING OK - Packet loss = 0%, RTA = 93.09 ms [18:07:26] RECOVERY - DPKG on mw1157 is OK: All packages OK [18:07:45] PROBLEM - Host ms-be1006 is DOWN: PING CRITICAL - Packet loss = 100% [18:07:56] RECOVERY - Host ms-be1005 is UP: PING OK - Packet loss = 0%, RTA = 0.82 ms [18:07:57] RECOVERY - puppet last run on lvs3001 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [18:08:36] PROBLEM - Host lvs3002 is DOWN: PING CRITICAL - Packet loss = 100% [18:09:05] RECOVERY - Host ms-be1006 is UP: PING OK - Packet loss = 0%, RTA = 1.00 ms [18:09:25] RECOVERY - Host lvs3002 is UP: PING OK - Packet loss = 0%, RTA = 95.12 ms [18:09:36] PROBLEM - NTP on lvs1006 is CRITICAL: NTP CRITICAL: Offset unknown [18:10:06] PROBLEM - NTP on lvs1005 is CRITICAL: NTP CRITICAL: Offset unknown [18:10:26] PROBLEM - Host lvs4001 is DOWN: PING CRITICAL - Packet loss = 100% [18:10:46] PROBLEM - NTP on lvs1004 is CRITICAL: NTP CRITICAL: Offset unknown [18:11:16] PROBLEM - Host rubidium is DOWN: CRITICAL - Host Unreachable (208.80.154.40) [18:11:36] RECOVERY - Host lvs4001 is UP: PING OK - Packet loss = 0%, RTA = 80.73 ms [18:11:46] RECOVERY - NTP on lvs1004 is OK: NTP OK: Offset -0.004118800163 secs [18:11:56] PROBLEM - Host mw1157 is DOWN: PING CRITICAL - Packet loss = 100% [18:12:06] PROBLEM - Host ns0-v4 is DOWN: PING CRITICAL - Packet loss = 100% [18:12:16] PROBLEM - Host ns0-v6 is DOWN: PING CRITICAL - Packet loss = 100% [18:12:46] PROBLEM - Host lvs4002 is DOWN: PING CRITICAL - Packet loss = 100% [18:12:55] RECOVERY - Host mw1157 is UP: PING OK - Packet loss = 0%, RTA = 2.09 ms [18:12:56] PROBLEM - NTP on ms-be1001 is CRITICAL: NTP CRITICAL: Offset unknown [18:13:16] PROBLEM - Host ms-be1007 is DOWN: PING CRITICAL - Packet loss = 100% [18:13:36] RECOVERY - Host rubidium is UP: PING OK - Packet loss = 0%, RTA = 1.31 ms [18:13:36] RECOVERY - Host ns0-v4 is UP: PING OK - Packet loss = 0%, RTA = 1.73 ms [18:13:46] RECOVERY - Host lvs4002 is UP: PING OK - Packet loss = 0%, RTA = 81.47 ms [18:13:47] RECOVERY - Host ns0-v6 is UP: PING OK - Packet loss = 0%, RTA = 1.04 ms [18:13:47] PROBLEM - Host cp1066 is DOWN: PING CRITICAL - Packet loss = 100% [18:13:47] RECOVERY - NTP on lvs1006 is OK: NTP OK: Offset -0.002675652504 secs [18:13:55] PROBLEM - NTP on lvs4003 is CRITICAL: NTP CRITICAL: Offset unknown [18:13:56] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: 6.67% of data above the critical threshold [500.0] [18:14:15] RECOVERY - Host ms-be1007 is UP: PING OK - Packet loss = 0%, RTA = 1.80 ms [18:14:25] RECOVERY - NTP on lvs1005 is OK: NTP OK: Offset -0.003965735435 secs [18:14:36] PROBLEM - NTP on lvs3004 is CRITICAL: NTP CRITICAL: Offset unknown [18:14:36] PROBLEM - DPKG on mw1158 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [18:14:56] PROBLEM - NTP on ms-be1003 is CRITICAL: NTP CRITICAL: Offset unknown [18:15:15] RECOVERY - Host cp1066 is UP: PING OK - Packet loss = 0%, RTA = 1.02 ms [18:15:36] PROBLEM - Host labmon1001 is DOWN: PING CRITICAL - Packet loss = 100% [18:15:59] 3WMF-NDA-Requests, operations, WMF-Legal: Add Addshore to phabricator WMF-NDA group - https://phabricator.wikimedia.org/T87651#996269 (10Addshore) It might be worth getting the list of NDA users from LDAP and adding their account on phab to the group also. This would avoid duplicate tickets like this! [18:16:25] RECOVERY - salt-minion processes on labmon1001 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [18:16:25] RECOVERY - configured eth on labmon1001 is OK: NRPE: Unable to read output [18:16:35] RECOVERY - Host labmon1001 is UP: PING OK - Packet loss = 0%, RTA = 0.26 ms [18:16:36] RECOVERY - dhclient process on labmon1001 is OK: PROCS OK: 0 processes with command name dhclient [18:16:45] RECOVERY - DPKG on labmon1001 is OK: All packages OK [18:16:46] RECOVERY - NTP on lvs3004 is OK: NTP OK: Offset 0.000389456749 secs [18:16:56] RECOVERY - Disk space on labmon1001 is OK: DISK OK [18:16:57] RECOVERY - puppet last run on labmon1001 is OK: OK: Puppet is currently enabled, last run 28 minutes ago with 0 failures [18:17:06] RECOVERY - Graphite Carbon on labmon1001 is OK: OK: All defined Carbon jobs are runnning. [18:17:16] RECOVERY - SSH on labmon1001 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2 (protocol 2.0) [18:17:25] PROBLEM - Varnish traffic logger on cp1066 is CRITICAL: PROCS CRITICAL: 0 processes with command name varnishncsa [18:17:26] RECOVERY - RAID on labmon1001 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 [18:17:26] PROBLEM - Varnish HTTP text-frontend on cp1066 is CRITICAL: Connection refused [18:17:55] PROBLEM - Varnishkafka log producer on cp1066 is CRITICAL: PROCS CRITICAL: 0 processes with command name varnishkafka [18:17:55] PROBLEM - Host netmon1001 is DOWN: PING CRITICAL - Packet loss = 100% [18:17:56] PROBLEM - NTP peers on hydrogen is CRITICAL: NTP CRITICAL: Server not synchronized, Offset unknown [18:18:06] PROBLEM - Host 208.80.153.12 is DOWN: PING CRITICAL - Packet loss = 100% [18:18:06] PROBLEM - Host acamar is DOWN: PING CRITICAL - Packet loss = 100% [18:18:15] RECOVERY - NTP on ms-be1003 is OK: NTP OK: Offset -0.004372477531 secs [18:18:16] RECOVERY - NTP on lvs4003 is OK: NTP OK: Offset -0.002934336662 secs [18:18:17] RECOVERY - NTP on ms-be1001 is OK: NTP OK: Offset -0.003832936287 secs [18:18:17] PROBLEM - NTP on lvs1001 is CRITICAL: NTP CRITICAL: Offset unknown [18:18:26] PROBLEM - Host 2620:0:860:1:d6ae:52ff:feac:4dc8 is DOWN: /bin/ping6 -n -U -w 15 -c 5 2620:0:860:1:d6ae:52ff:feac:4dc8 [18:18:36] PROBLEM - Host osm-cp1001 is DOWN: PING CRITICAL - Packet loss = 100% [18:18:37] PROBLEM - Host ms-be1009 is DOWN: PING CRITICAL - Packet loss = 100% [18:18:46] PROBLEM - Host mw1158 is DOWN: PING CRITICAL - Packet loss = 100% [18:18:56] RECOVERY - NTP on ms-be1008 is OK: NTP OK: Offset -0.007517695427 secs [18:18:56] RECOVERY - graphite.wikimedia.org on labmon1001 is OK: HTTP OK: HTTP/1.1 200 OK - 1607 bytes in 0.017 second response time [18:19:00] PROBLEM - Host ms-be1011 is DOWN: PING CRITICAL - Packet loss = 100% [18:19:05] RECOVERY - Host netmon1001 is UP: PING OK - Packet loss = 0%, RTA = 1.23 ms [18:19:05] RECOVERY - Host acamar is UP: PING OK - Packet loss = 0%, RTA = 43.51 ms [18:19:46] PROBLEM - NTP on mw1156 is CRITICAL: NTP CRITICAL: Offset unknown [18:19:55] RECOVERY - Host ms-be1009 is UP: PING OK - Packet loss = 0%, RTA = 0.59 ms [18:19:55] RECOVERY - Host mw1158 is UP: PING OK - Packet loss = 0%, RTA = 3.36 ms [18:20:06] RECOVERY - DPKG on mw1158 is OK: All packages OK [18:20:06] RECOVERY - NTP peers on hydrogen is OK: NTP OK: Offset 0.000402 secs [18:20:06] PROBLEM - NTP on lvs1002 is CRITICAL: NTP CRITICAL: Offset unknown [18:20:26] PROBLEM - Host iodine is DOWN: PING CRITICAL - Packet loss = 100% [18:20:26] RECOVERY - Host ms-be1011 is UP: PING OK - Packet loss = 0%, RTA = 0.63 ms [18:20:26] RECOVERY - Host ms-be1010 is UP: PING OK - Packet loss = 0%, RTA = 0.54 ms [18:20:26] PROBLEM - puppet last run on cp1066 is CRITICAL: CRITICAL: Puppet has 1 failures [18:21:05] PROBLEM - url_downloader on chromium is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:21:16] PROBLEM - NTP peers on acamar is CRITICAL: NTP CRITICAL: Server not synchronized, Offset unknown [18:21:25] PROBLEM - Apache HTTP on mw1156 is CRITICAL: Connection timed out [18:21:37] RECOVERY - url_downloader on chromium is OK: TCP OK - 0.166 second response time on port 8080 [18:21:37] PROBLEM - puppetmaster backend https on strontium is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:21:37] PROBLEM - Apache HTTP on mw1155 is CRITICAL: Connection timed out [18:21:45] PROBLEM - Apache HTTP on mw1154 is CRITICAL: Connection timed out [18:21:46] RECOVERY - Host iodine is UP: PING OK - Packet loss = 0%, RTA = 1.22 ms [18:21:46] PROBLEM - Apache HTTP on mw1160 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:21:47] PROBLEM - NTP on lvs1003 is CRITICAL: NTP CRITICAL: Offset unknown [18:21:56] PROBLEM - Apache HTTP on mw1159 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:22:06] PROBLEM - Apache HTTP on mw1153 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:22:06] PROBLEM - LVS HTTP IPv4 on rendering.svc.eqiad.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:22:16] PROBLEM - NTP peers on chromium is CRITICAL: NTP CRITICAL: Server not synchronized, Offset unknown [18:22:35] PROBLEM - NTP on lvs3001 is CRITICAL: NTP CRITICAL: Offset unknown [18:22:36] RECOVERY - puppetmaster backend https on strontium is OK: HTTP OK: Status line output matched 400 - 335 bytes in 0.023 second response time [18:22:46] PROBLEM - LVS HTTP IPv4 on ocg.svc.eqiad.wmnet is CRITICAL: Connection refused [18:23:06] PROBLEM - puppet last run on mw1233 is CRITICAL: CRITICAL: puppet fail [18:23:06] PROBLEM - DPKG on rhenium is CRITICAL: DPKG CRITICAL dpkg reports broken packages [18:23:16] PROBLEM - DPKG on mw1159 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [18:23:16] PROBLEM - NTP on ms-be1005 is CRITICAL: NTP CRITICAL: Offset unknown [18:23:24] Hm, noisy in here [18:23:35] RECOVERY - Host 208.80.153.12 is UP: PING OK - Packet loss = 0%, RTA = 43.70 ms [18:23:35] RECOVERY - NTP peers on chromium is OK: NTP OK: Offset 5.5e-05 secs [18:23:35] PROBLEM - puppet last run on acamar is CRITICAL: CRITICAL: puppet fail [18:23:35] PROBLEM - puppet last run on wtp1017 is CRITICAL: CRITICAL: puppet fail [18:23:36] PROBLEM - puppet last run on virt1012 is CRITICAL: CRITICAL: puppet fail [18:23:41] <_joe_> imagescalers down? [18:23:45] PROBLEM - Host achernar is DOWN: PING CRITICAL - Packet loss = 100% [18:23:45] PROBLEM - Host 208.80.153.42 is DOWN: PING CRITICAL - Packet loss = 100% [18:23:45] RECOVERY - git.wikimedia.org on antimony is OK: HTTP OK: HTTP/1.1 200 OK - 59701 bytes in 0.261 second response time [18:23:46] RECOVERY - Host 2620:0:860:1:d6ae:52ff:feac:4dc8 is UP: PING OK - Packet loss = 0%, RTA = 56.27 ms [18:23:46] RECOVERY - Apache HTTP on mw1156 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 5.351 second response time [18:23:47] <_joe_> really? [18:23:56] RECOVERY - Apache HTTP on mw1154 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.090 second response time [18:23:56] RECOVERY - Host osm-cp1001 is UP: PING OK - Packet loss = 0%, RTA = 1.36 ms [18:23:56] RECOVERY - LVS HTTP IPv4 on ocg.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 456 bytes in 0.017 second response time [18:24:05] RECOVERY - Apache HTTP on mw1155 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 8.146 second response time [18:24:05] PROBLEM - Host 2620:0:860:2:d6ae:52ff:fead:5610 is DOWN: PING CRITICAL - Packet loss = 100% [18:24:16] RECOVERY - Apache HTTP on mw1159 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.093 second response time [18:24:16] RECOVERY - DPKG on rhenium is OK: All packages OK [18:24:25] PROBLEM - puppet last run on ms-be1005 is CRITICAL: CRITICAL: Puppet has 1 failures [18:24:26] RECOVERY - Host achernar is UP: PING OK - Packet loss = 0%, RTA = 43.44 ms [18:24:26] PROBLEM - puppet last run on uranium is CRITICAL: CRITICAL: puppet fail [18:24:26] RECOVERY - Apache HTTP on mw1153 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 4.137 second response time [18:24:26] PROBLEM - puppet last run on erbium is CRITICAL: CRITICAL: Puppet has 1 failures [18:24:26] PROBLEM - puppet last run on mc1008 is CRITICAL: CRITICAL: Puppet has 1 failures [18:24:35] RECOVERY - LVS HTTP IPv4 on rendering.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 64606 bytes in 4.844 second response time [18:24:42] PROBLEM - Host ms-be1013 is DOWN: PING CRITICAL - Packet loss = 100% [18:24:42] PROBLEM - DPKG on tin is CRITICAL: DPKG CRITICAL dpkg reports broken packages [18:24:56] PROBLEM - puppet last run on mw1124 is CRITICAL: CRITICAL: Puppet has 2 failures [18:24:57] PROBLEM - puppet last run on mw1221 is CRITICAL: CRITICAL: Puppet has 1 failures [18:24:57] PROBLEM - puppet last run on mw1196 is CRITICAL: CRITICAL: Puppet has 2 failures [18:25:06] PROBLEM - puppet last run on ms-be1009 is CRITICAL: CRITICAL: Puppet has 1 failures [18:25:06] PROBLEM - puppet last run on mw1096 is CRITICAL: CRITICAL: Puppet has 2 failures [18:25:06] PROBLEM - puppet last run on rubidium is CRITICAL: CRITICAL: Puppet has 1 failures [18:25:06] PROBLEM - puppet last run on cp1059 is CRITICAL: CRITICAL: Puppet has 1 failures [18:25:06] PROBLEM - puppet last run on cp4017 is CRITICAL: CRITICAL: Puppet has 1 failures [18:25:15] PROBLEM - check_mysql on lutetium is CRITICAL: Cant connect to local MySQL server through socket /tmp/mysql.sock (2) [18:25:16] PROBLEM - puppet last run on mw1214 is CRITICAL: CRITICAL: Puppet has 1 failures [18:25:16] RECOVERY - NTP on lvs1001 is OK: NTP OK: Offset -0.001872181892 secs [18:25:16] PROBLEM - puppet last run on ytterbium is CRITICAL: CRITICAL: Puppet has 1 failures [18:25:16] PROBLEM - puppet last run on cp1043 is CRITICAL: CRITICAL: Puppet has 1 failures [18:25:16] PROBLEM - puppet last run on wtp1021 is CRITICAL: CRITICAL: Puppet has 1 failures [18:25:16] PROBLEM - puppet last run on mw1036 is CRITICAL: CRITICAL: Puppet has 1 failures [18:25:16] PROBLEM - puppet last run on db1024 is CRITICAL: CRITICAL: Puppet has 1 failures [18:25:17] PROBLEM - puppet last run on mw1040 is CRITICAL: CRITICAL: Puppet has 1 failures [18:25:17] PROBLEM - NTP on ms-be1006 is CRITICAL: NTP CRITICAL: Offset unknown [18:25:17] PROBLEM - puppet last run on cp1057 is CRITICAL: CRITICAL: Puppet has 1 failures [18:25:25] PROBLEM - puppet last run on cp4007 is CRITICAL: CRITICAL: Puppet has 1 failures [18:25:25] PROBLEM - puppet last run on mw1109 is CRITICAL: CRITICAL: Puppet has 2 failures [18:25:25] PROBLEM - puppet last run on amssq45 is CRITICAL: CRITICAL: Puppet has 2 failures [18:25:25] PROBLEM - puppet last run on cp3013 is CRITICAL: CRITICAL: Puppet has 4 failures [18:25:25] PROBLEM - puppet last run on elastic1028 is CRITICAL: CRITICAL: Puppet has 2 failures [18:25:25] PROBLEM - puppet last run on mw1246 is CRITICAL: CRITICAL: Puppet has 1 failures [18:25:26] PROBLEM - puppet last run on mw1038 is CRITICAL: CRITICAL: Puppet has 1 failures [18:25:26] RECOVERY - NTP on mw1156 is OK: NTP OK: Offset -0.00342977047 secs [18:25:26] PROBLEM - puppet last run on mw1067 is CRITICAL: CRITICAL: Puppet has 1 failures [18:25:28] PROBLEM - puppet last run on mw1013 is CRITICAL: CRITICAL: Puppet has 2 failures [18:25:28] PROBLEM - puppet last run on mw1197 is CRITICAL: CRITICAL: Puppet has 1 failures [18:25:45] PROBLEM - puppet last run on db1053 is CRITICAL: CRITICAL: Puppet has 3 failures [18:25:45] PROBLEM - puppet last run on mw1089 is CRITICAL: CRITICAL: Puppet has 1 failures [18:25:46] PROBLEM - puppet last run on cp4013 is CRITICAL: CRITICAL: Puppet has 2 failures [18:25:46] PROBLEM - puppet last run on mw1083 is CRITICAL: CRITICAL: Puppet has 2 failures [18:25:46] PROBLEM - puppet last run on mw1045 is CRITICAL: CRITICAL: Puppet has 1 failures [18:25:46] PROBLEM - Apache HTTP on mw1158 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:25:46] PROBLEM - puppet last run on mw1200 is CRITICAL: CRITICAL: Puppet has 1 failures [18:25:47] PROBLEM - puppet last run on tmh1002 is CRITICAL: CRITICAL: Puppet has 1 failures [18:25:47] RECOVERY - NTP on lvs1002 is OK: NTP OK: Offset -0.001028776169 secs [18:25:55] PROBLEM - puppet last run on amssq43 is CRITICAL: CRITICAL: Puppet has 2 failures [18:25:55] PROBLEM - puppet last run on amssq57 is CRITICAL: CRITICAL: Puppet has 2 failures [18:25:55] PROBLEM - puppet last run on mw1012 is CRITICAL: CRITICAL: Puppet has 1 failures [18:25:56] PROBLEM - puppet last run on mw1245 is CRITICAL: CRITICAL: Puppet has 3 failures [18:25:56] PROBLEM - puppet last run on neptunium is CRITICAL: CRITICAL: Puppet has 4 failures [18:25:57] PROBLEM - puppet last run on mw1041 is CRITICAL: CRITICAL: Puppet has 3 failures [18:25:57] PROBLEM - puppet last run on ocg1001 is CRITICAL: CRITICAL: Puppet has 4 failures [18:26:05] RECOVERY - NTP on lvs3001 is OK: NTP OK: Offset 0.0003789663315 secs [18:26:05] PROBLEM - puppet last run on amssq54 is CRITICAL: CRITICAL: Puppet has 1 failures [18:26:06] PROBLEM - puppet last run on mw1048 is CRITICAL: CRITICAL: Puppet has 1 failures [18:26:06] PROBLEM - Host ms-be1015 is DOWN: PING CRITICAL - Packet loss = 100% [18:26:15] PROBLEM - Recursive DNS on 2620:0:860:1:d6ae:52ff:feac:4dc8 is CRITICAL: CRITICAL - Plugin timed out while executing system call [18:26:15] PROBLEM - puppet last run on mw1063 is CRITICAL: CRITICAL: Puppet has 2 failures [18:26:15] PROBLEM - puppet last run on mw1140 is CRITICAL: CRITICAL: Puppet has 4 failures [18:26:16] RECOVERY - Host ms-be1013 is UP: PING OK - Packet loss = 0%, RTA = 1.38 ms [18:26:16] PROBLEM - puppet last run on cp3015 is CRITICAL: CRITICAL: Puppet has 1 failures [18:26:16] PROBLEM - salt-minion processes on osm-cp1001 is CRITICAL: Connection refused by host [18:26:26] PROBLEM - puppet last run on amssq49 is CRITICAL: CRITICAL: Puppet has 2 failures [18:26:26] RECOVERY - NTP on lvs1003 is OK: NTP OK: Offset -0.0009608268738 secs [18:26:26] PROBLEM - Host rhenium is DOWN: CRITICAL - Host Unreachable (208.80.154.52) [18:26:26] PROBLEM - Disk space on osm-cp1001 is CRITICAL: Connection refused by host [18:26:35] PROBLEM - puppet last run on mw1059 is CRITICAL: CRITICAL: Puppet has 2 failures [18:26:36] PROBLEM - puppet last run on osm-cp1001 is CRITICAL: Connection refused by host [18:26:36] PROBLEM - puppet last run on lvs4002 is CRITICAL: CRITICAL: Puppet has 1 failures [18:26:36] RECOVERY - Host 208.80.153.42 is UP: PING OK - Packet loss = 0%, RTA = 44.50 ms [18:26:36] PROBLEM - Recursive DNS on 208.80.153.12 is CRITICAL: CRITICAL - Plugin timed out while executing system call [18:26:36] PROBLEM - puppet last run on cp3008 is CRITICAL: CRITICAL: Puppet has 1 failures [18:26:36] RECOVERY - DPKG on mw1159 is OK: All packages OK [18:26:36] RECOVERY - NTP on ms-be1005 is OK: NTP OK: Offset -0.01520657539 secs [18:26:46] PROBLEM - DPKG on osm-cp1001 is CRITICAL: Connection refused by host [18:26:46] RECOVERY - Apache HTTP on mw1158 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.076 second response time [18:26:46] RECOVERY - Host ms-be1014 is UP: PING OK - Packet loss = 0%, RTA = 1.74 ms [18:26:46] RECOVERY - Host ms-be1015 is UP: PING OK - Packet loss = 0%, RTA = 1.46 ms [18:26:46] RECOVERY - Host 2620:0:860:2:d6ae:52ff:fead:5610 is UP: PING OK - Packet loss = 0%, RTA = 44.08 ms [18:26:47] PROBLEM - puppet last run on mw1006 is CRITICAL: CRITICAL: Puppet has 3 failures [18:26:47] RECOVERY - DPKG on tin is OK: All packages OK [18:26:56] PROBLEM - configured eth on osm-cp1001 is CRITICAL: Connection refused by host [18:26:57] PROBLEM - RAID on osm-cp1001 is CRITICAL: Connection refused by host [18:27:17] PROBLEM - NTP on lvs4001 is CRITICAL: NTP CRITICAL: Offset unknown [18:27:17] PROBLEM - dhclient process on osm-cp1001 is CRITICAL: Connection refused by host [18:27:17] RECOVERY - Apache HTTP on mw1160 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.078 second response time [18:27:22] !log reboot swift in esams [18:27:26] RECOVERY - Host rhenium is UP: PING OK - Packet loss = 0%, RTA = 0.48 ms [18:27:29] Logged the message, Master [18:27:46] PROBLEM - puppet last run on ms-be2003 is CRITICAL: CRITICAL: Puppet has 1 failures [18:28:16] PROBLEM - NTP on mw1157 is CRITICAL: NTP CRITICAL: Offset unknown [18:28:26] PROBLEM - DPKG on mw1160 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [18:28:46] PROBLEM - Host tin is DOWN: PING CRITICAL - Packet loss = 100% [18:29:06] PROBLEM - Host ms-fe3001 is DOWN: PING CRITICAL - Packet loss = 100% [18:29:15] PROBLEM - NTP on rubidium is CRITICAL: NTP CRITICAL: Offset unknown [18:29:16] PROBLEM - Host ms-be3004 is DOWN: PING CRITICAL - Packet loss = 100% [18:29:17] PROBLEM - Host ms-fe3002 is DOWN: PING CRITICAL - Packet loss = 100% [18:29:17] PROBLEM - Host ms-be3001 is DOWN: PING CRITICAL - Packet loss = 100% [18:29:17] PROBLEM - Host ms-be3003 is DOWN: PING CRITICAL - Packet loss = 100% [18:29:27] godog: am rebooting tungsten. that ok? [18:29:36] PROBLEM - Host mw1159 is DOWN: PING CRITICAL - Packet loss = 100% [18:29:36] PROBLEM - Host ms-be3002 is DOWN: PING CRITICAL - Packet loss = 100% [18:29:41] !rebooting nembus (secondary ldap server) [18:29:45] PROBLEM - puppet last run on mw1205 is CRITICAL: CRITICAL: Puppet has 1 failures [18:29:45] PROBLEM - puppet last run on mw1164 is CRITICAL: CRITICAL: Puppet has 1 failures [18:29:46] PROBLEM - puppet last run on mw1011 is CRITICAL: CRITICAL: Puppet has 2 failures [18:29:56] RECOVERY - Host ms-fe3001 is UP: PING OK - Packet loss = 0%, RTA = 93.62 ms [18:29:56] RECOVERY - Host ms-fe3002 is UP: PING OK - Packet loss = 0%, RTA = 95.29 ms [18:29:56] PROBLEM - puppet last run on mw1172 is CRITICAL: CRITICAL: Puppet has 1 failures [18:30:15] RECOVERY - check_mysql on lutetium is OK: Uptime: 120 Threads: 1 Questions: 2210 Slow queries: 0 Opens: 27 Flush tables: 2 Open tables: 36 Queries per second avg: 18.416 Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 0 [18:30:15] PROBLEM - puppet last run on mw1251 is CRITICAL: CRITICAL: Puppet has 1 failures [18:30:16] RECOVERY - puppet last run on ms-be1009 is OK: OK: Puppet is currently enabled, last run 9 seconds ago with 0 failures [18:30:16] RECOVERY - Recursive DNS on 2620:0:860:1:d6ae:52ff:feac:4dc8 is OK: DNS OK: 5.098 seconds response time. www.wikipedia.org returns 208.80.154.224 [18:30:17] RECOVERY - puppet last run on rubidium is OK: OK: Puppet is currently enabled, last run 58 seconds ago with 0 failures [18:30:25] PROBLEM - puppet last run on mw1175 is CRITICAL: CRITICAL: Puppet has 1 failures [18:30:26] RECOVERY - NTP on lvs4001 is OK: NTP OK: Offset -6.878376007e-05 secs [18:30:26] YuviPanda: ye go for it [18:30:35] RECOVERY - Host mw1159 is UP: PING OK - Packet loss = 0%, RTA = 2.34 ms [18:30:35] PROBLEM - puppet last run on mw1044 is CRITICAL: CRITICAL: Puppet has 1 failures [18:30:36] RECOVERY - Recursive DNS on 208.80.153.12 is OK: DNS OK: 0.436 seconds response time. www.wikipedia.org returns 208.80.154.224 [18:30:36] PROBLEM - Host osm-cp1001 is DOWN: PING CRITICAL - Packet loss = 100% [18:30:46] PROBLEM - puppet last run on snapshot1001 is CRITICAL: CRITICAL: Puppet has 1 failures [18:30:55] PROBLEM - NTP on cp1066 is CRITICAL: NTP CRITICAL: Offset unknown [18:30:56] RECOVERY - Host tin is UP: PING OK - Packet loss = 0%, RTA = 0.72 ms [18:30:56] PROBLEM - puppet last run on mw1055 is CRITICAL: CRITICAL: Puppet has 1 failures [18:30:56] PROBLEM - puppet last run on mw1206 is CRITICAL: CRITICAL: Puppet has 1 failures [18:31:00] !log rebooting tungstun [18:31:03] Logged the message, Master [18:31:05] PROBLEM - puppet last run on snapshot1002 is CRITICAL: CRITICAL: Puppet has 1 failures [18:31:06] RECOVERY - puppet last run on acamar is OK: OK: Puppet is currently enabled, last run 34 seconds ago with 0 failures [18:31:26] RECOVERY - NTP on mw1157 is OK: NTP OK: Offset -0.01120829582 secs [18:31:26] PROBLEM - puppet last run on mw1051 is CRITICAL: CRITICAL: Puppet has 1 failures [18:31:35] PROBLEM - puppet last run on mw1213 is CRITICAL: CRITICAL: Puppet has 1 failures [18:31:36] RECOVERY - DPKG on mw1160 is OK: All packages OK [18:31:36] PROBLEM - puppet last run on mw1156 is CRITICAL: CRITICAL: Puppet has 1 failures [18:31:45] PROBLEM - puppet last run on mw1014 is CRITICAL: CRITICAL: Puppet has 2 failures [18:31:45] PROBLEM - puppet last run on mw1030 is CRITICAL: CRITICAL: Puppet has 1 failures [18:31:46] PROBLEM - puppet last run on mw1144 is CRITICAL: CRITICAL: Puppet has 1 failures [18:31:46] PROBLEM - DPKG on dysprosium is CRITICAL: DPKG CRITICAL dpkg reports broken packages [18:31:46] PROBLEM - puppet last run on mw1049 is CRITICAL: CRITICAL: Puppet has 1 failures [18:31:56] PROBLEM - DPKG on labsdb1005 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [18:32:05] PROBLEM - puppet last run on mw1202 is CRITICAL: CRITICAL: Puppet has 1 failures [18:32:05] PROBLEM - puppet last run on mw1057 is CRITICAL: CRITICAL: Puppet has 1 failures [18:32:06] PROBLEM - puppet last run on mw1258 is CRITICAL: CRITICAL: Puppet has 1 failures [18:32:06] PROBLEM - puppet last run on mw1238 is CRITICAL: CRITICAL: Puppet has 1 failures [18:32:06] PROBLEM - puppet last run on mw1208 is CRITICAL: CRITICAL: Puppet has 1 failures [18:32:06] PROBLEM - puppet last run on osmium is CRITICAL: CRITICAL: Puppet has 1 failures [18:32:07] PROBLEM - puppet last run on mw1098 is CRITICAL: CRITICAL: Puppet has 1 failures [18:32:15] PROBLEM - puppet last run on mw1181 is CRITICAL: CRITICAL: Puppet has 1 failures [18:32:15] PROBLEM - puppet last run on mw1111 is CRITICAL: CRITICAL: Puppet has 1 failures [18:32:16] RECOVERY - Host ms-be3001 is UP: PING OK - Packet loss = 0%, RTA = 95.77 ms [18:32:16] PROBLEM - puppet last run on mw1079 is CRITICAL: CRITICAL: Puppet has 1 failures [18:32:16] PROBLEM - puppet last run on mw1183 is CRITICAL: CRITICAL: Puppet has 1 failures [18:32:16] PROBLEM - puppet last run on mw1247 is CRITICAL: CRITICAL: Puppet has 1 failures [18:32:16] PROBLEM - puppet last run on mw1237 is CRITICAL: CRITICAL: Puppet has 1 failures [18:32:17] PROBLEM - Swift HTTP backend on ms-fe3002 is CRITICAL: Connection refused [18:32:25] RECOVERY - Host ms-be3002 is UP: PING OK - Packet loss = 0%, RTA = 94.36 ms [18:32:25] RECOVERY - Host ms-be3004 is UP: PING OK - Packet loss = 0%, RTA = 93.45 ms [18:32:26] PROBLEM - puppet last run on mw1133 is CRITICAL: CRITICAL: Puppet has 1 failures [18:32:26] PROBLEM - puppet last run on mw1190 is CRITICAL: CRITICAL: Puppet has 1 failures [18:32:26] PROBLEM - puppet last run on mw1165 is CRITICAL: CRITICAL: Puppet has 1 failures [18:32:26] PROBLEM - puppet last run on mw1050 is CRITICAL: CRITICAL: Puppet has 1 failures [18:32:26] PROBLEM - puppet last run on mw1034 is CRITICAL: CRITICAL: Puppet has 1 failures [18:32:27] PROBLEM - Swift HTTP frontend on ms-fe3002 is CRITICAL: Connection refused [18:32:35] PROBLEM - puppet last run on mw1076 is CRITICAL: CRITICAL: Puppet has 1 failures [18:32:36] PROBLEM - puppet last run on mw1168 is CRITICAL: CRITICAL: Puppet has 1 failures [18:32:36] PROBLEM - puppet last run on mw1180 is CRITICAL: CRITICAL: Puppet has 1 failures [18:32:36] PROBLEM - puppet last run on mw1084 is CRITICAL: CRITICAL: Puppet has 1 failures [18:32:36] PROBLEM - puppet last run on snapshot1004 is CRITICAL: CRITICAL: Puppet has 1 failures [18:32:36] PROBLEM - puppet last run on mw1004 is CRITICAL: CRITICAL: Puppet has 2 failures [18:32:45] PROBLEM - Host ms-fe1001 is DOWN: PING CRITICAL - Packet loss = 100% [18:32:45] PROBLEM - puppet last run on mw1125 is CRITICAL: CRITICAL: Puppet has 1 failures [18:32:45] PROBLEM - puppet last run on mw1054 is CRITICAL: CRITICAL: Puppet has 1 failures [18:32:46] PROBLEM - puppet last run on mw1195 is CRITICAL: CRITICAL: Puppet has 1 failures [18:32:46] PROBLEM - puppet last run on mw1056 is CRITICAL: CRITICAL: Puppet has 1 failures [18:32:56] PROBLEM - puppet last run on mw1146 is CRITICAL: CRITICAL: Puppet has 1 failures [18:32:56] PROBLEM - puppet last run on mw1248 is CRITICAL: CRITICAL: Puppet has 1 failures [18:32:56] PROBLEM - puppet last run on mw1149 is CRITICAL: CRITICAL: Puppet has 1 failures [18:32:56] PROBLEM - puppet last run on mw1210 is CRITICAL: CRITICAL: Puppet has 1 failures [18:32:56] PROBLEM - puppet last run on mw1227 is CRITICAL: CRITICAL: Puppet has 1 failures [18:32:57] PROBLEM - puppet last run on mw1074 is CRITICAL: CRITICAL: Puppet has 1 failures [18:32:57] PROBLEM - puppet last run on mw1151 is CRITICAL: CRITICAL: Puppet has 1 failures [18:32:58] PROBLEM - MediaWiki profile collector on tungsten is CRITICAL: Connection refused by host [18:33:06] PROBLEM - puppet last run on mw1081 is CRITICAL: CRITICAL: Puppet has 1 failures [18:33:06] PROBLEM - puppet last run on mw1198 is CRITICAL: CRITICAL: Puppet has 1 failures [18:33:06] PROBLEM - puppet last run on mw1249 is CRITICAL: CRITICAL: Puppet has 1 failures [18:33:16] PROBLEM - puppet last run on mw1162 is CRITICAL: CRITICAL: Puppet has 1 failures [18:33:25] PROBLEM - gdash.wikimedia.org on tungsten is CRITICAL: Connection refused [18:33:26] PROBLEM - SSH on tungsten is CRITICAL: Connection refused [18:33:26] PROBLEM - Graphite Carbon on tungsten is CRITICAL: Connection refused by host [18:33:36] PROBLEM - Host mw1160 is DOWN: PING CRITICAL - Packet loss = 100% [18:33:36] PROBLEM - Keyholder SSH agent on tin is CRITICAL: CRITICAL: Keyholder is not armed. Run keyholder arm to arm it. [18:33:36] PROBLEM - puppet last run on mw1163 is CRITICAL: CRITICAL: Puppet has 1 failures [18:33:37] PROBLEM - puppet last run on mw1097 is CRITICAL: CRITICAL: Puppet has 1 failures [18:33:37] RECOVERY - Host ms-fe1001 is UP: PING OK - Packet loss = 0%, RTA = 1.44 ms [18:33:37] PROBLEM - uWSGI web apps on tungsten is CRITICAL: Connection refused by host [18:33:37] PROBLEM - puppet last run on tungsten is CRITICAL: Connection refused by host [18:33:45] PROBLEM - puppet last run on mw1023 is CRITICAL: CRITICAL: Puppet has 1 failures [18:33:46] PROBLEM - DPKG on tungsten is CRITICAL: Connection refused by host [18:33:55] PROBLEM - graphite.wikimedia.org on tungsten is CRITICAL: Connection refused [18:33:55] PROBLEM - salt-minion processes on tungsten is CRITICAL: Connection refused by host [18:34:06] PROBLEM - Disk space on tungsten is CRITICAL: Connection refused by host [18:34:06] PROBLEM - configured eth on tungsten is CRITICAL: Connection refused by host [18:34:06] RECOVERY - DPKG on labsdb1005 is OK: All packages OK [18:34:15] PROBLEM - DPKG on calcium is CRITICAL: DPKG CRITICAL dpkg reports broken packages [18:34:16] PROBLEM - dhclient process on tungsten is CRITICAL: Connection refused by host [18:34:25] PROBLEM - RAID on tungsten is CRITICAL: Connection refused by host [18:34:25] RECOVERY - Host mw1160 is UP: PING OK - Packet loss = 0%, RTA = 0.75 ms [18:34:26] RECOVERY - Host ms-be3003 is UP: PING OK - Packet loss = 0%, RTA = 95.56 ms [18:34:36] RECOVERY - Host osm-cp1001 is UP: PING OK - Packet loss = 0%, RTA = 2.92 ms [18:34:46] RECOVERY - NTP on ms-be1006 is OK: NTP OK: Offset -0.004553556442 secs [18:34:55] PROBLEM - NTP on netmon1001 is CRITICAL: NTP CRITICAL: Offset unknown [18:35:06] RECOVERY - NTP on cp1066 is OK: NTP OK: Offset -0.002333402634 secs [18:35:15] PROBLEM - check_puppetrun on db1025 is CRITICAL: CRITICAL: Puppet has 2 failures [18:35:55] PROBLEM - NTP on ms-be1009 is CRITICAL: NTP CRITICAL: Offset unknown [18:36:07] PROBLEM - Host ms-fe1002 is DOWN: PING CRITICAL - Packet loss = 100% [18:36:16] PROBLEM - NTP on ms-be1010 is CRITICAL: NTP CRITICAL: Offset unknown [18:36:35] PROBLEM - SSH on ms-be3003 is CRITICAL: Connection refused [18:36:35] PROBLEM - swift-account-replicator on ms-be3003 is CRITICAL: Connection refused by host [18:36:35] PROBLEM - swift-container-auditor on ms-be3003 is CRITICAL: Connection refused by host [18:36:36] PROBLEM - swift-object-auditor on ms-be3003 is CRITICAL: Connection refused by host [18:36:46] PROBLEM - swift-account-server on ms-be3003 is CRITICAL: Connection refused by host [18:36:46] PROBLEM - very high load average likely xfs on ms-be3003 is CRITICAL: Connection refused by host [18:36:46] PROBLEM - swift-object-replicator on ms-be3003 is CRITICAL: Connection refused by host [18:36:47] PROBLEM - salt-minion processes on ms-be3003 is CRITICAL: Connection refused by host [18:36:47] PROBLEM - configured eth on ms-be3003 is CRITICAL: Connection refused by host [18:36:56] PROBLEM - swift-object-server on ms-be3003 is CRITICAL: Connection refused by host [18:36:56] PROBLEM - swift-container-replicator on ms-be3003 is CRITICAL: Connection refused by host [18:36:57] PROBLEM - Host db1008 is DOWN: PING CRITICAL - Packet loss = 100% [18:37:05] PROBLEM - DPKG on ms-be3003 is CRITICAL: Connection refused by host [18:37:16] PROBLEM - swift-object-updater on ms-be3003 is CRITICAL: Connection refused by host [18:37:16] PROBLEM - swift-account-auditor on ms-be3003 is CRITICAL: Connection refused by host [18:37:25] PROBLEM - swift-container-server on ms-be3003 is CRITICAL: Connection refused by host [18:37:26] RECOVERY - Host ms-fe1002 is UP: PING OK - Packet loss = 0%, RTA = 1.57 ms [18:37:36] PROBLEM - swift-account-reaper on ms-be3003 is CRITICAL: Connection refused by host [18:37:36] PROBLEM - dhclient process on ms-be3003 is CRITICAL: Connection refused by host [18:37:36] PROBLEM - NTP on iodine is CRITICAL: NTP CRITICAL: Offset unknown [18:37:37] PROBLEM - swift-container-updater on ms-be3003 is CRITICAL: Connection refused by host [18:37:37] PROBLEM - RAID on ms-be3003 is CRITICAL: Connection refused by host [18:37:57] PROBLEM - Host labsdb1007 is DOWN: PING CRITICAL - Packet loss = 100% [18:38:16] RECOVERY - puppet last run on ms-be1005 is OK: OK: Puppet is currently enabled, last run 31 seconds ago with 0 failures [18:38:16] PROBLEM - Host labsdb1005 is DOWN: PING CRITICAL - Packet loss = 100% [18:38:25] RECOVERY - puppet last run on erbium is OK: OK: Puppet is currently enabled, last run 32 seconds ago with 0 failures [18:38:36] PROBLEM - ElasticSearch health check for shards on logstash1003 is CRITICAL: CRITICAL - elasticsearch inactive shards 43 threshold =0.1% breach: {ustatus: uyellow, unumber_of_nodes: 2, uunassigned_shards: 43, utimed_out: False, uactive_primary_shards: 43, ucluster_name: uproduction-logstash-eqiad, urelocating_shards: 0, uactive_shards: 86, uinitializing_shards: 0, unumber_of_data_nodes: 2} [18:38:56] PROBLEM - ElasticSearch health check for shards on logstash1002 is CRITICAL: CRITICAL - elasticsearch inactive shards 35 threshold =0.1% breach: {ustatus: uyellow, unumber_of_nodes: 3, uunassigned_shards: 33, utimed_out: False, uactive_primary_shards: 44, ucluster_name: uproduction-logstash-eqiad, urelocating_shards: 0, uactive_shards: 97, uinitializing_shards: 2, unumber_of_data_nodes: 3} [18:38:56] RECOVERY - puppet last run on mw1214 is OK: OK: Puppet is currently enabled, last run 41 seconds ago with 0 failures [18:39:06] RECOVERY - Host labsdb1005 is UP: PING OK - Packet loss = 0%, RTA = 0.69 ms [18:39:06] RECOVERY - puppet last run on db1024 is OK: OK: Puppet is currently enabled, last run 19 seconds ago with 0 failures [18:39:06] RECOVERY - puppet last run on cp1057 is OK: OK: Puppet is currently enabled, last run 34 seconds ago with 0 failures [18:39:15] RECOVERY - NTP on ms-be1009 is OK: NTP OK: Offset -0.005220890045 secs [18:39:15] RECOVERY - NTP on netmon1001 is OK: NTP OK: Offset -0.006482720375 secs [18:39:16] RECOVERY - puppet last run on db1041 is OK: OK: Puppet is currently enabled, last run 17 seconds ago with 0 failures [18:39:26] RECOVERY - puppet last run on mw1136 is OK: OK: Puppet is currently enabled, last run 49 seconds ago with 0 failures [18:39:26] RECOVERY - puppet last run on zirconium is OK: OK: Puppet is currently enabled, last run 14 seconds ago with 0 failures [18:39:27] RECOVERY - puppet last run on mc1008 is OK: OK: Puppet is currently enabled, last run 1 second ago with 0 failures [18:39:35] PROBLEM - Host ms-fe1003 is DOWN: PING CRITICAL - Packet loss = 100% [18:39:36] RECOVERY - puppet last run on mw1083 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [18:39:46] RECOVERY - ElasticSearch health check for shards on logstash1003 is OK: OK - elasticsearch status production-logstash-eqiad: status: yellow, number_of_nodes: 3, unassigned_shards: 1, timed_out: False, active_primary_shards: 44, cluster_name: production-logstash-eqiad, relocating_shards: 0, active_shards: 129, initializing_shards: 2, number_of_data_nodes: 3 [18:39:46] PROBLEM - Host osm-cp1001 is DOWN: PING CRITICAL - Packet loss = 100% [18:39:46] RECOVERY - puppet last run on tmh1002 is OK: OK: Puppet is currently enabled, last run 55 seconds ago with 0 failures [18:39:46] PROBLEM - NTP on baham is CRITICAL: NTP CRITICAL: Offset unknown [18:39:55] RECOVERY - Host labsdb1007 is UP: PING OK - Packet loss = 0%, RTA = 1.72 ms [18:39:56] RECOVERY - puppet last run on mw1196 is OK: OK: Puppet is currently enabled, last run 23 seconds ago with 0 failures [18:39:56] RECOVERY - NTP peers on acamar is OK: NTP OK: Offset 0.001644 secs [18:40:06] RECOVERY - puppet last run on mw1096 is OK: OK: Puppet is currently enabled, last run 22 seconds ago with 0 failures [18:40:06] RECOVERY - ElasticSearch health check for shards on logstash1002 is OK: OK - elasticsearch status production-logstash-eqiad: status: yellow, number_of_nodes: 3, unassigned_shards: 1, timed_out: False, active_primary_shards: 44, cluster_name: production-logstash-eqiad, relocating_shards: 0, active_shards: 129, initializing_shards: 2, number_of_data_nodes: 3 [18:40:15] PROBLEM - check_puppetrun on db1025 is CRITICAL: CRITICAL: Puppet has 2 failures [18:40:15] PROBLEM - DPKG on tmh1001 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [18:40:16] RECOVERY - puppet last run on cp4017 is OK: OK: Puppet is currently enabled, last run 45 seconds ago with 0 failures [18:40:16] RECOVERY - puppet last run on ytterbium is OK: OK: Puppet is currently enabled, last run 36 seconds ago with 0 failures [18:40:16] RECOVERY - puppet last run on cp1043 is OK: OK: Puppet is currently enabled, last run 23 seconds ago with 0 failures [18:40:16] RECOVERY - puppet last run on wtp1021 is OK: OK: Puppet is currently enabled, last run 18 seconds ago with 0 failures [18:40:16] RECOVERY - puppet last run on mw1036 is OK: OK: Puppet is currently enabled, last run 14 seconds ago with 0 failures [18:40:25] RECOVERY - puppet last run on elastic1028 is OK: OK: Puppet is currently enabled, last run 16 seconds ago with 0 failures [18:40:25] RECOVERY - puppet last run on mw1246 is OK: OK: Puppet is currently enabled, last run 46 seconds ago with 0 failures [18:40:25] RECOVERY - Host db1008 is UP: PING OK - Packet loss = 0%, RTA = 1.66 ms [18:40:31] RECOVERY - Host ms-fe1003 is UP: PING OK - Packet loss = 0%, RTA = 1.60 ms [18:40:31] RECOVERY - puppet last run on amssq45 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [18:40:32] RECOVERY - puppet last run on mw1013 is OK: OK: Puppet is currently enabled, last run 44 seconds ago with 0 failures [18:40:45] RECOVERY - puppet last run on mw1138 is OK: OK: Puppet is currently enabled, last run 49 seconds ago with 0 failures [18:40:46] RECOVERY - puppet last run on amssq58 is OK: OK: Puppet is currently enabled, last run 43 seconds ago with 0 failures [18:40:46] PROBLEM - NTP peers on nescio is CRITICAL: NTP CRITICAL: Server not synchronized, Offset unknown [18:40:56] RECOVERY - puppet last run on cp4013 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [18:40:57] RECOVERY - puppet last run on wtp1017 is OK: OK: Puppet is currently enabled, last run 2 seconds ago with 0 failures [18:40:57] RECOVERY - NTP on baham is OK: NTP OK: Offset 0.02572953701 secs [18:40:57] PROBLEM - Host tungsten is DOWN: PING CRITICAL - Packet loss = 100% [18:40:57] RECOVERY - puppet last run on mw1124 is OK: OK: Puppet is currently enabled, last run 21 seconds ago with 0 failures [18:40:57] RECOVERY - puppet last run on mw1245 is OK: OK: Puppet is currently enabled, last run 44 seconds ago with 0 failures [18:56:48] RECOVERY - puppet last run on mw1221 is OK: OK: Puppet is currently enabled, last run 32 seconds ago with 0 failures [18:56:48] RECOVERY - puppet last run on mw1040 is OK: OK: Puppet is currently enabled, last run 32 seconds ago with 0 failures [18:56:48] RECOVERY - puppet last run on mw1233 is OK: OK: Puppet is currently enabled, last run 18 seconds ago with 0 failures [18:56:48] RECOVERY - puppet last run on mw1109 is OK: OK: Puppet is currently enabled, last run 14 seconds ago with 0 failures [18:56:48] PROBLEM - NTP on eeden is CRITICAL: NTP CRITICAL: Offset unknown [18:56:48] RECOVERY - puppet last run on uranium is OK: OK: Puppet is currently enabled, last run 1 second ago with 0 failures [18:56:48] RECOVERY - puppet last run on mw1178 is OK: OK: Puppet is currently enabled, last run 14 seconds ago with 0 failures [18:56:48] RECOVERY - puppet last run on wtp1019 is OK: OK: Puppet is currently enabled, last run 22 seconds ago with 0 failures [18:56:48] RECOVERY - puppet last run on mw1089 is OK: OK: Puppet is currently enabled, last run 5 seconds ago with 0 failures [18:56:48] PROBLEM - NTP on ms-be1013 is CRITICAL: NTP CRITICAL: Offset unknown [18:56:48] RECOVERY - puppet last run on ocg1001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [18:56:48] PROBLEM - NTP on ms-be1014 is CRITICAL: NTP CRITICAL: Offset unknown [18:56:48] RECOVERY - SSH on tungsten is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1.4 (protocol 2.0) [18:56:48] RECOVERY - puppet last run on mw1048 is OK: OK: Puppet is currently enabled, last run 55 seconds ago with 0 failures [18:56:48] RECOVERY - Graphite Carbon on tungsten is OK: OK: All defined Carbon jobs are runnning. [18:56:48] RECOVERY - puppet last run on cp1059 is OK: OK: Puppet is currently enabled, last run 35 seconds ago with 0 failures [18:56:49] RECOVERY - uWSGI web apps on tungsten is OK: OK: All defined uWSGI apps are runnning. [18:56:49] RECOVERY - Host tungsten is UP: PING OK - Packet loss = 0%, RTA = 1.05 ms [18:56:49] RECOVERY - puppet last run on tungsten is OK: OK: Puppet is currently enabled, last run 12 minutes ago with 0 failures [18:56:49] RECOVERY - puppet last run on cp4007 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [18:56:49] RECOVERY - DPKG on tungsten is OK: All packages OK [18:56:49] PROBLEM - NTP on ms-be1015 is CRITICAL: NTP CRITICAL: Offset unknown [18:56:49] RECOVERY - puppet last run on cp3013 is OK: OK: Puppet is currently enabled, last run 36 seconds ago with 0 failures [18:56:49] RECOVERY - puppet last run on mw1038 is OK: OK: Puppet is currently enabled, last run 49 seconds ago with 0 failures [18:56:49] RECOVERY - puppet last run on mw1067 is OK: OK: Puppet is currently enabled, last run 55 seconds ago with 0 failures [18:56:49] RECOVERY - salt-minion processes on tungsten is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [18:56:49] RECOVERY - puppet last run on mw1256 is OK: OK: Puppet is currently enabled, last run 45 seconds ago with 0 failures [18:56:49] RECOVERY - puppet last run on mw1134 is OK: OK: Puppet is currently enabled, last run 29 seconds ago with 0 failures [18:56:49] RECOVERY - puppet last run on cerium is OK: OK: Puppet is currently enabled, last run 4 seconds ago with 0 failures [18:56:49] RECOVERY - Disk space on tungsten is OK: DISK OK [18:56:49] RECOVERY - configured eth on tungsten is OK: NRPE: Unable to read output [18:56:49] RECOVERY - MediaWiki profile collector on tungsten is OK: OK: All defined mwprof jobs are runnning. [18:56:49] RECOVERY - puppet last run on db1053 is OK: OK: Puppet is currently enabled, last run 51 seconds ago with 0 failures [18:56:49] RECOVERY - puppet last run on ms-be2003 is OK: OK: Puppet is currently enabled, last run 4 seconds ago with 0 failures [18:56:49] RECOVERY - NTP peers on nescio is OK: NTP OK: Offset -0.044994 secs [18:56:49] RECOVERY - puppet last run on mw1006 is OK: OK: Puppet is currently enabled, last run 9 seconds ago with 0 failures [18:56:49] RECOVERY - puppet last run on mw1045 is OK: OK: Puppet is currently enabled, last run 35 seconds ago with 0 failures [18:56:49] RECOVERY - dhclient process on tungsten is OK: PROCS OK: 0 processes with command name dhclient [18:56:49] PROBLEM - NTP on rhenium is CRITICAL: NTP CRITICAL: Offset unknown [18:56:49] RECOVERY - puppet last run on mw1012 is OK: OK: Puppet is currently enabled, last run 2 seconds ago with 0 failures [18:56:49] RECOVERY - puppet last run on neptunium is OK: OK: Puppet is currently enabled, last run 45 seconds ago with 0 failures [18:56:49] RECOVERY - puppet last run on amssq43 is OK: OK: Puppet is currently enabled, last run 5 seconds ago with 0 failures [18:56:50] RECOVERY - puppet last run on mw1041 is OK: OK: Puppet is currently enabled, last run 21 seconds ago with 0 failures [18:56:50] RECOVERY - gdash.wikimedia.org on tungsten is OK: HTTP OK: HTTP/1.1 200 OK - 9352 bytes in 0.025 second response time [18:56:50] RECOVERY - RAID on tungsten is OK: OK: optimal, 1 logical, 2 physical [18:56:50] PROBLEM - Host labnet1001 is DOWN: PING CRITICAL - Packet loss = 100% [18:56:51] RECOVERY - puppet last run on mw1063 is OK: OK: Puppet is currently enabled, last run 40 seconds ago with 0 failures [18:56:51] RECOVERY - Host osm-cp1001 is UP: PING OK - Packet loss = 0%, RTA = 0.51 ms [18:56:51] RECOVERY - puppet last run on mw1140 is OK: OK: Puppet is currently enabled, last run 24 seconds ago with 0 failures [18:56:51] RECOVERY - puppet last run on cp3015 is OK: OK: Puppet is currently enabled, last run 34 seconds ago with 0 failures [18:56:51] PROBLEM - Host db1025 is DOWN: PING CRITICAL - Packet loss = 100% [18:56:51] RECOVERY - puppet last run on amssq49 is OK: OK: Puppet is currently enabled, last run 35 seconds ago with 0 failures [18:56:51] RECOVERY - puppet last run on mw1197 is OK: OK: Puppet is currently enabled, last run 41 seconds ago with 0 failures [18:56:51] RECOVERY - puppet last run on mw1059 is OK: OK: Puppet is currently enabled, last run 38 seconds ago with 0 failures [18:56:51] RECOVERY - puppet last run on mw1106 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [18:56:51] RECOVERY - puppet last run on lvs4002 is OK: OK: Puppet is currently enabled, last run 59 seconds ago with 0 failures [18:56:51] RECOVERY - puppet last run on amssq59 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [18:56:51] RECOVERY - puppet last run on cp1070 is OK: OK: Puppet is currently enabled, last run 56 seconds ago with 0 failures [18:56:51] RECOVERY - puppet last run on cp3008 is OK: OK: Puppet is currently enabled, last run 4 seconds ago with 0 failures [18:56:51] RECOVERY - Host labnet1001 is UP: PING OK - Packet loss = 0%, RTA = 1.48 ms [18:56:52] RECOVERY - puppet last run on mw1200 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [18:56:52] RECOVERY - puppet last run on amssq54 is OK: OK: Puppet is currently enabled, last run 38 seconds ago with 0 failures [18:56:53] RECOVERY - DPKG on tmh1001 is OK: All packages OK [18:56:53] PROBLEM - Host ms-fe1004 is DOWN: PING CRITICAL - Packet loss = 100% [18:56:53] RECOVERY - NTP on eeden is OK: NTP OK: Offset 0.000516295433 secs [18:56:53] RECOVERY - puppet last run on mw1205 is OK: OK: Puppet is currently enabled, last run 18 seconds ago with 0 failures [18:56:53] RECOVERY - puppet last run on mw1164 is OK: OK: Puppet is currently enabled, last run 26 seconds ago with 0 failures [18:56:54] RECOVERY - check_puppetrun on db1025 is OK: OK: Puppet is currently enabled, last run 29 seconds ago with 0 failures [18:56:54] PROBLEM - NTP on titanium is CRITICAL: NTP CRITICAL: Offset unknown [18:56:54] RECOVERY - NTP on ms-be1013 is OK: NTP OK: Offset -0.0001167058945 secs [18:56:54] RECOVERY - Host db1025 is UP: PING OK - Packet loss = 0%, RTA = 1.57 ms [18:56:54] RECOVERY - Host ms-fe1004 is UP: PING OK - Packet loss = 0%, RTA = 0.26 ms [18:56:54] RECOVERY - NTP on ms-be1014 is OK: NTP OK: Offset -0.006651997566 secs [18:56:55] RECOVERY - NTP on ms-be1015 is OK: NTP OK: Offset -0.01557266712 secs [18:56:55] PROBLEM - NTP on ms-fe3002 is CRITICAL: NTP CRITICAL: Offset unknown [18:56:55] RECOVERY - puppet last run on mw1144 is OK: OK: Puppet is currently enabled, last run 17 seconds ago with 0 failures [18:56:56] PROBLEM - NTP on ms-fe3001 is CRITICAL: NTP CRITICAL: Offset unknown [18:56:56] RECOVERY - DPKG on calcium is OK: All packages OK [18:56:56] RECOVERY - NTP on rhenium is OK: NTP OK: Offset -0.004738926888 secs [18:56:57] RECOVERY - puppet last run on mw1213 is OK: OK: Puppet is currently enabled, last run 31 seconds ago with 0 failures [18:56:57] RECOVERY - puppet last run on mw1054 is OK: OK: Puppet is currently enabled, last run 18 seconds ago with 0 failures [18:56:57] PROBLEM - NTP on tin is CRITICAL: NTP CRITICAL: Offset unknown [18:56:58] RECOVERY - NTP on ms-be1010 is OK: NTP OK: Offset -0.01744437218 secs [18:56:58] RECOVERY - puppet last run on mw1172 is OK: OK: Puppet is currently enabled, last run 33 seconds ago with 0 failures [18:56:58] RECOVERY - puppet last run on mw1249 is OK: OK: Puppet is currently enabled, last run 6 seconds ago with 0 failures [18:56:58] RECOVERY - puppet last run on mw1162 is OK: OK: Puppet is currently enabled, last run 9 seconds ago with 0 failures [18:56:58] RECOVERY - NTP on iodine is OK: NTP OK: Offset -0.02253675461 secs [18:56:58] RECOVERY - puppet last run on mw1251 is OK: OK: Puppet is currently enabled, last run 46 seconds ago with 0 failures [18:56:58] RECOVERY - puppet last run on mw1237 is OK: OK: Puppet is currently enabled, last run 28 seconds ago with 0 failures [18:56:58] RECOVERY - puppet last run on mw1175 is OK: OK: Puppet is currently enabled, last run 47 seconds ago with 0 failures [18:56:58] PROBLEM - Host tmh1001 is DOWN: PING CRITICAL - Packet loss = 100% [18:56:59] RECOVERY - puppet last run on mw1044 is OK: OK: Puppet is currently enabled, last run 23 seconds ago with 0 failures [18:56:59] RECOVERY - puppet last run on mw1195 is OK: OK: Puppet is currently enabled, last run 39 seconds ago with 0 failures [18:56:59] RECOVERY - NTP on tin is OK: NTP OK: Offset -0.0001437664032 secs [18:56:59] RECOVERY - puppet last run on snapshot1001 is OK: OK: Puppet is currently enabled, last run 33 seconds ago with 0 failures [18:56:59] RECOVERY - puppet last run on mw1149 is OK: OK: Puppet is currently enabled, last run 46 seconds ago with 0 failures [18:56:59] RECOVERY - puppet last run on mw1011 is OK: OK: Puppet is currently enabled, last run 59 seconds ago with 0 failures [19:00:47] RECOVERY - Host osm-cp1001 is UP: PING OK - Packet loss = 0%, RTA = 0.48 ms [19:00:56] PROBLEM - NTP on ms-fe1004 is CRITICAL: NTP CRITICAL: Offset unknown [19:01:06] RECOVERY - NTP on tungsten is OK: NTP OK: Offset 0.01636552811 secs [19:01:15] PROBLEM - DPKG on magnesium is CRITICAL: DPKG CRITICAL dpkg reports broken packages [19:01:16] PROBLEM - Host caesium is DOWN: PING CRITICAL - Packet loss = 100% [19:01:25] PROBLEM - DPKG on potassium is CRITICAL: DPKG CRITICAL dpkg reports broken packages [19:01:46] RECOVERY - git.wikimedia.org on antimony is OK: HTTP OK: HTTP/1.1 200 OK - 59701 bytes in 0.350 second response time [19:01:58] RECOVERY - Host caesium is UP: PING OK - Packet loss = 0%, RTA = 0.68 ms [19:03:15] (03PS1) 10Filippo Giunchedi: remove search* machines references [puppet] - 10https://gerrit.wikimedia.org/r/186982 (https://phabricator.wikimedia.org/T86149) [19:03:26] RECOVERY - DPKG on potassium is OK: All packages OK [19:03:36] RECOVERY - NTP on labnet1001 is OK: NTP OK: Offset -0.004298210144 secs [19:03:50] (03PS1) 10KartikMistry: cxserver: Remove unused dict package [puppet] - 10https://gerrit.wikimedia.org/r/186983 [19:04:05] RECOVERY - NTP on ms-fe1004 is OK: NTP OK: Offset -0.01555049419 secs [19:05:26] RECOVERY - DPKG on magnesium is OK: All packages OK [19:05:36] PROBLEM - Host osm-cp1001 is DOWN: PING CRITICAL - Packet loss = 100% [19:05:56] PROBLEM - Host protactinium is DOWN: CRITICAL - Host Unreachable (208.80.154.13) [19:06:16] PROBLEM - NTP on tmh1001 is CRITICAL: NTP CRITICAL: Offset unknown [19:06:17] PROBLEM - DPKG on ruthenium is CRITICAL: DPKG CRITICAL dpkg reports broken packages [19:06:46] PROBLEM - puppet last run on argon is CRITICAL: CRITICAL: Puppet has 1 failures [19:06:55] PROBLEM - DPKG on labstore1001 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [19:07:06] RECOVERY - Host protactinium is UP: PING OK - Packet loss = 0%, RTA = 2.20 ms [19:07:16] PROBLEM - NTP on labsdb1006 is CRITICAL: NTP CRITICAL: Offset unknown [19:07:26] PROBLEM - NTP on labsdb1004 is CRITICAL: NTP CRITICAL: Offset unknown [19:07:57] PROBLEM - NTP on calcium is CRITICAL: NTP CRITICAL: Offset unknown [19:08:20] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] remove search* machines references [puppet] - 10https://gerrit.wikimedia.org/r/186982 (https://phabricator.wikimedia.org/T86149) (owner: 10Filippo Giunchedi) [19:08:26] RECOVERY - NTP on tmh1001 is OK: NTP OK: Offset -0.006824851036 secs [19:09:47] PROBLEM - NTP on dysprosium is CRITICAL: NTP CRITICAL: Offset unknown [19:09:47] RECOVERY - DPKG on labstore1001 is OK: All packages OK [19:09:47] RECOVERY - Host osm-cp1001 is UP: PING OK - Packet loss = 0%, RTA = 2.20 ms [19:09:47] RECOVERY - NTP on labsdb1004 is OK: NTP OK: Offset -0.002525448799 secs [19:10:15] PROBLEM - salt-minion processes on cp1057 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [19:10:26] RECOVERY - NTP on labsdb1006 is OK: NTP OK: Offset -0.008845567703 secs [19:11:06] RECOVERY - NTP on calcium is OK: NTP OK: Offset -0.00501871109 secs [19:13:05] RECOVERY - NTP on dysprosium is OK: NTP OK: Offset -0.006666064262 secs [19:13:47] PROBLEM - NTP on tmh1002 is CRITICAL: NTP CRITICAL: Offset unknown [19:14:05] RECOVERY - Varnishkafka log producer on cp1066 is OK: PROCS OK: 1 process with command name varnishkafka [19:14:23] !log IRC RC seems broken [19:14:26] PROBLEM - Host dataset1001 is DOWN: CRITICAL - Host Unreachable (208.80.154.11) [19:14:29] Logged the message, Master [19:14:36] RECOVERY - Varnish HTTP text-frontend on cp1066 is OK: HTTP OK: HTTP/1.1 200 OK - 285 bytes in 0.006 second response time [19:14:46] RECOVERY - Host dataset1001 is UP: PING OK - Packet loss = 0%, RTA = 2.36 ms [19:15:02] Krinkle, ^ [19:16:26] Krenair: I'm not ops, *hands in the air* [19:16:26] PROBLEM - Host osm-cp1001 is DOWN: PING CRITICAL - Packet loss = 100% [19:16:27] !log irc.wikimedia.org is down. "Connection refused." [19:16:27] Logged the message, Master [19:16:27] Krenair: I'd assume it'll be a knock on effect [19:16:27] mutante: ^ [19:16:27] PROBLEM - Host ruthenium is DOWN: PING CRITICAL - Packet loss = 100% [19:16:46] RECOVERY - Varnish traffic logger on cp1066 is OK: PROCS OK: 2 processes with command name varnishncsa [19:16:46] Krenair: paravoid, checking, argon got rebooted [19:17:15] PROBLEM - Varnishkafka Delivery Errors per minute on cp3015 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [20000.0] [19:17:36] RECOVERY - puppet last run on cp1066 is OK: OK: Puppet is currently enabled, last run 47 seconds ago with 0 failures [19:18:05] RECOVERY - DPKG on ruthenium is OK: All packages OK [19:18:05] RECOVERY - NTP on tmh1002 is OK: NTP OK: Offset 0.012809515 secs [19:18:06] RECOVERY - Host ruthenium is UP: PING OK - Packet loss = 0%, RTA = 1.81 ms [19:18:45] RECOVERY - salt-minion processes on cp1057 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [19:18:56] PROBLEM - NTP on argon is CRITICAL: NTP CRITICAL: Offset unknown [19:18:56] RECOVERY - Host osm-cp1001 is UP: PING OK - Packet loss = 0%, RTA = 1.14 ms [19:19:38] !log run sysctl -w net.netfilter.nf_conntrack_max=131072 on labmon1001 [19:19:40] gah [19:19:43] Logged the message, Master [19:19:43] !log run sysctl -w net.netfilter.nf_conntrack_max=131072 on labnet1001 [19:19:46] Logged the message, Master [19:22:05] (03CR) 10Alexandros Kosiaris: [C: 04-1] admin: Add Niklas to cxserver-admin (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/186981 (owner: 10KartikMistry) [19:22:06] RECOVERY - NTP on argon is OK: NTP OK: Offset -0.003346681595 secs [19:22:25] PROBLEM - NTP on protactinium is CRITICAL: NTP CRITICAL: Offset unknown [19:22:35] PROBLEM - salt-minion processes on cp3021 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [19:22:45] RECOVERY - Varnishkafka Delivery Errors per minute on cp3015 is OK: OK: Less than 1.00% above the threshold [0.0] [19:22:46] PROBLEM - Host gadolinium is DOWN: PING CRITICAL - Packet loss = 100% [19:23:11] !log brought ircd back up on argon [19:23:14] 11:22 -!- Irssi: Connection to irc.wikimedia.org established [19:23:14] Logged the message, Master [19:23:16] Krenair: ^ [19:23:38] this needs fixing, it should come back by itself on reboot of course [19:23:46] hmm [19:23:56] PROBLEM - Host osm-cp1001 is DOWN: PING CRITICAL - Packet loss = 100% [19:24:17] I think it's still broken [19:24:23] is the bot missing? [19:24:36] PROBLEM - Host hafnium is DOWN: PING CRITICAL - Packet loss = 100% [19:24:45] * irc.pmtpa.wikimedia.org :No such channel [19:24:56] when trying to join a channel [19:25:05] ok, trying.. grrmm..the docs are all outdated [19:25:55] RECOVERY - Host hafnium is UP: PING OK - Packet loss = 0%, RTA = 1.11 ms [19:27:56] PROBLEM - DPKG on vanadium is CRITICAL: DPKG CRITICAL dpkg reports broken packages [19:28:00] (03PS2) 10KartikMistry: admin: Add Niklas to cxserver-admin [puppet] - 10https://gerrit.wikimedia.org/r/186981 [19:28:34] YuviPanda: i just restarted all worker services [19:28:35] PROBLEM - Host erbium is DOWN: PING CRITICAL - Packet loss = 100% [19:28:44] erbium?! [19:29:06] RECOVERY - Host osm-cp1001 is UP: PING OK - Packet loss = 0%, RTA = 3.80 ms [19:29:36] PROBLEM - Disk space on stat1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:29:46] RECOVERY - Host erbium is UP: PING OK - Packet loss = 0%, RTA = 1.71 ms [19:29:51] hm [19:30:05] poo also stat1002 /mnt/hdfs isn't working again [19:30:05] PROBLEM - Host virt1000 is DOWN: CRITICAL - Host Unreachable (208.80.154.18) [19:30:06] PROBLEM - NTP on dataset1001 is CRITICAL: NTP CRITICAL: Offset unknown [19:30:07] rats! [19:30:16] RECOVERY - puppet last run on argon is OK: OK: Puppet is currently enabled, last run 58 seconds ago with 0 failures [19:30:27] !log disable shinken for labs atm, restarts being done by andrewbogott [19:30:48] PROBLEM - Host labs-ns0.wikimedia.org is DOWN: CRITICAL - Host Unreachable (208.80.154.19) [19:30:49] YuviPanda: i believe that should reduce the amount of logs by a lot [19:30:52] lemme know [19:30:55] RECOVERY - NTP on protactinium is OK: NTP OK: Offset 0.009307026863 secs [19:30:56] there will still be some [19:31:07] RECOVERY - DPKG on vanadium is OK: All packages OK [19:31:46] PROBLEM - SSH on osm-cp1001 is CRITICAL: Connection refused [19:32:15] ottomata: \o/ cool [19:33:01] ottomata: Not only erbium. Also gadolinium :-( [19:33:13] Log bot is gone and wikitech seems to be down. [19:33:25] RECOVERY - NTP on dataset1001 is OK: NTP OK: Offset -0.00214099884 secs [19:33:37] Oh, not gone, just probably hanging. [19:34:06] PROBLEM - NTP on ruthenium is CRITICAL: NTP CRITICAL: Offset unknown [19:34:09] Connections to wikitech are timing out for me too. [19:34:15] PROBLEM - NTP on potassium is CRITICAL: NTP CRITICAL: Offset unknown [19:34:24] qchris ottomata gadolinium was me rebooting it, I don't see it coming back tho [19:34:27] PROBLEM - HTTP error ratio anomaly detection on tungsten is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 9 below the confidence bounds [19:34:35] (03PS1) 10BryanDavis: logstash: Configure Elasticsearch index.merge.scheduler.max_thread_count [puppet] - 10https://gerrit.wikimedia.org/r/186986 (https://phabricator.wikimedia.org/T87526) [19:34:46] RECOVERY - Host virt1000 is UP: PING OK - Packet loss = 0%, RTA = 4.25 ms [19:35:02] qchris: Fiona ottomata wikitech should be coming back up soon [19:35:19] !log (after the fact) reboot gadolinium, currently not coming back [19:35:32] godog, did you reboot erbium too? [19:35:34] Logged the message, Master [19:35:44] ottomata: I didn't [19:35:44] Back now, thx. [19:35:50] [@erbium:~] $ uptime [19:35:50] 19:35:19 up 6 min, [19:35:56] RECOVERY - Host labs-ns0.wikimedia.org is UP: PING OK - Packet loss = 0%, RTA = 0.99 ms [19:35:57] ottomata: I did [19:35:59] ok phew [19:36:06] why ? [19:36:17] just saw it, always worried when I see alerts about those things [19:36:23] ok [19:36:26] you rebooted for security patch? [19:36:36] PROBLEM - Host cp1070 is DOWN: PING CRITICAL - Packet loss = 100% [19:36:45] RECOVERY - NTP on rubidium is OK: NTP OK: Offset -0.0007619857788 secs [19:36:49] paravoid: I am considering an email to wmfall re GHost.. -- prudent? [19:36:59] nah [19:36:59] ottomata: yup [19:37:02] ok cool [19:37:16] RECOVERY - NTP on ruthenium is OK: NTP OK: Offset -0.002706408501 secs [19:37:34] we have a handful of linux users who may not being doing security updates.. [19:37:36] RECOVERY - salt-minion processes on cp3021 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [19:37:45] godog: what's with gadolinium? [19:37:46] (03CR) 10BryanDavis: "Long term I'd like to see the large set of config flags for Elasticsearch be moved to hiera, but in the interest of moving this one bug al" [puppet] - 10https://gerrit.wikimedia.org/r/186986 (https://phabricator.wikimedia.org/T87526) (owner: 10BryanDavis) [19:37:55] RECOVERY - Host cp1070 is UP: PING OK - Packet loss = 0%, RTA = 0.93 ms [19:38:26] RECOVERY - NTP on potassium is OK: NTP OK: Offset -0.007779955864 secs [19:38:29] paravoid: not coming back, I thought it was doing fsck but I'm unable to ssh to its mgmt at the moment, looking further [19:39:05] oh boy! [19:39:06] :) [19:39:19] mutante, still no #en.wikipedia on irc.wikimedia.org [19:39:24] gadolinium is the socat relay for other udp2log instances :) [19:39:28] and the webstatscollector [19:40:16] PROBLEM - Varnish HTTP bits on cp1070 is CRITICAL: Connection refused [19:40:37] sigh [19:40:56] PROBLEM - puppet last run on cp1070 is CRITICAL: CRITICAL: Puppet has 1 failures [19:40:59] how much time has it passed? [19:41:08] PROBLEM - HTTPS on cp1070 is CRITICAL: Return code of 255 is out of bounds [19:41:15] we can get remote hands there [19:41:56] Krenair: try now [19:42:10] mutante, working. what'd you do? might be worth documenting it :) [19:42:21] paravoid: 19:20 UTC [19:43:06] Krenair: i'll make a ticket to fix the "comes back on reboot" [19:43:35] PROBLEM - NTP on carbon is CRITICAL: NTP CRITICAL: Offset unknown [19:44:46] PROBLEM - NTP on osm-cp1001 is CRITICAL: NTP CRITICAL: No response from NTP server [19:45:16] PROBLEM - NTP on erbium is CRITICAL: NTP CRITICAL: Offset unknown [19:46:21] (03PS1) 10Springle: depool pc1003 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/186988 [19:46:28] !log rebooting virt1006 [19:46:33] Logged the message, Master [19:46:36] RECOVERY - Varnish HTTP bits on cp1070 is OK: HTTP OK: HTTP/1.1 200 OK - 189 bytes in 0.002 second response time [19:47:42] !log springle Synchronized wmf-config/db-eqiad.php: depool pc1003 (duration: 00m 02s) [19:47:45] Logged the message, Master [19:47:46] RECOVERY - NTP on carbon is OK: NTP OK: Offset -0.003549933434 secs [19:48:26] RECOVERY - NTP on erbium is OK: NTP OK: Offset -0.005519747734 secs [19:48:56] PROBLEM - Host virt1006 is DOWN: PING CRITICAL - Packet loss = 100% [19:49:59] !log springle Synchronized wmf-config/db-eqiad.php: depool pc1003 (duration: 00m 01s) [19:50:03] Logged the message, Master [19:50:15] PROBLEM - Varnishkafka Delivery Errors on cp3010 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 1554.266724 [19:50:16] PROBLEM - Host amssq56 is DOWN: PING CRITICAL - Packet loss = 100% [19:50:49] Krenair, mutante: As far as I know, the channels are supposed to be created as soon as there's activity. [19:51:00] And reboots should work fine... but who knows. [19:51:08] yeah, but there was activity and it wasn't being sent through to ircd [19:51:15] * Fiona nods. [19:51:21] Fiona: the channels are created as soon as the bot yes [19:51:37] ircd didnt come back up via puppet for some reason we should fix [19:51:46] RECOVERY - Host amssq56 is UP: PING OK - Packet loss = 0%, RTA = 96.87 ms [19:51:46] and then the bot needed a restart too [19:52:03] The bot is separate from ircd? [19:52:13] yea [19:52:26] bot = service ircecho [19:52:36] RECOVERY - Host virt1006 is UP: PING OK - Packet loss = 0%, RTA = 1.01 ms [19:52:43] server = sudo -u irc /usr/bin/ircd [19:52:57] so if ircd is up but the bot is not, you can connect to ircd but there are 0 channels [19:53:02] puppet is supposed to keep both ircecho and ircd running [19:53:07] Hmm, we're using ircecho for irc.wikimedia.org? I thought that was just for relaying tail -f to IRC. [19:53:26] PROBLEM - NTP on cp1070 is CRITICAL: NTP CRITICAL: Offset unknown [19:54:17] python /usr/local/bin/udpmxircecho.py rc-pmtpa localhost [19:54:26] PROBLEM - Varnish HTTP text-frontend on amssq56 is CRITICAL: Connection refused [19:54:26] PROBLEM - Varnishkafka log producer on amssq56 is CRITICAL: PROCS CRITICAL: 0 processes with command name varnishkafka [19:54:27] PROBLEM - Varnish traffic logger on amssq56 is CRITICAL: PROCS CRITICAL: 0 processes with command name varnishncsa [19:54:38] Oh, different ircecho. [19:55:57] Fiona: well, the docs say /usr/local/bin/start-ircbot but we dont have that anymore [19:56:26] RECOVERY - Varnishkafka Delivery Errors on cp3010 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [19:56:27] mutante: the docs also say pmtpa :p [19:56:45] RECOVERY - NTP on cp1070 is OK: NTP OK: Offset 0.009072542191 secs [19:56:47] https://github.com/wikimedia/operations-debs-ircecho/blob/master/ircecho is the other ircecho. [19:57:03] JohnLewis: better yet, "don't check it in to CVS" :) [19:57:09] PROBLEM - puppet last run on amssq56 is CRITICAL: CRITICAL: Puppet has 1 failures [19:57:38] Indeed :p [19:57:46] RECOVERY - Varnish HTTP text-frontend on amssq56 is OK: HTTP OK: HTTP/1.1 200 OK - 286 bytes in 0.198 second response time [19:57:46] RECOVERY - Varnishkafka log producer on amssq56 is OK: PROCS OK: 1 process with command name varnishkafka [19:58:23] hey all, anyone know why i'm unable to do anything ssh-related? [19:58:43] !log springle Synchronized wmf-config/db-eqiad.php: depool pc1003 (duration: 00m 07s) [19:58:49] Logged the message, Master [19:58:58] uh...hm. I can't ssh to tin either dbrant [19:59:17] dbrant: Krenair some services are being restarted [19:59:18] mutante ^ a new point has been raised :p [19:59:20] err [19:59:20] hosts [19:59:30] Krenair: dbrant which host? [19:59:32] tin? [19:59:35] none [19:59:37] I tried tin, terbium, mw1001 [19:59:45] RECOVERY - Keyholder SSH agent on tin is OK: OK: Keyholder is armed with all configured keys. [19:59:51] ok I could get to tin [19:59:52] ^ that ? [20:00:06] but I am using the ops host [20:00:14] Krenair: dbrant which bastion are you using? [20:00:51] don't think I've changed my ssh config... bast1001 [20:00:59] yep, bast1001 [20:01:15] ah ok, that's actually the step it's failing at. bast1001 [20:01:30] yup, [20:02:06] qchris: fyi i am working on bringing up a replacement node for gadolinium [20:02:19] Whoa. Cool. [20:02:36] RECOVERY - puppet last run on cp1070 is OK: OK: Puppet is currently enabled, last run 27 seconds ago with 0 failures [20:02:50] Generating pagecounts-raw through hive is working. So we can backfill from that. Same for the tsvs. [20:03:04] awesome ;) [20:03:10] haha, just in time :) [20:03:18] Totally :-D [20:03:21] !log reboot pc1003 [20:03:22] qchris, /mnt/hdfs is being really finicky lately [20:03:22] i [20:03:25] Logged the message, Master [20:03:29] i'm worried about our reliance on it [20:04:13] * qchris recalls discussions about fuse :-) [20:04:30] But it has been stable for me in the European morning. [20:06:41] yeah, something is funky [20:07:40] try bast1001 again [20:07:46] works for me now, after puppet run [20:07:48] as non-root [20:07:58] PROBLEM - NTP on amssq56 is CRITICAL: NTP CRITICAL: Offset unknown [20:07:58] PROBLEM - NTP on virt1006 is CRITICAL: NTP CRITICAL: Offset unknown [20:09:00] or not.. ro filesystem [20:09:34] dbrant, try now? [20:09:45] RECOVERY - Varnish traffic logger on amssq56 is OK: PROCS OK: 2 processes with command name varnishncsa [20:10:09] Krenair: working! [20:10:15] PROBLEM - DPKG on analytics1022 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [20:10:15] RECOVERY - puppet last run on amssq56 is OK: OK: Puppet is currently enabled, last run 17 seconds ago with 0 failures [20:10:16] PROBLEM - Kafka Broker Under Replicated Partitions on analytics1018 is CRITICAL: kafka.server.ReplicaManager.UnderReplicatedPartitions.Value CRITICAL: 12.0 [20:11:06] RECOVERY - NTP on amssq56 is OK: NTP OK: Offset 0.006764292717 secs [20:11:15] RECOVERY - puppet last run on bast1001 is OK: OK: Puppet is currently enabled, last run 30 seconds ago with 0 failures [20:11:15] RECOVERY - DPKG on analytics1022 is OK: All packages OK [20:11:25] PROBLEM - Kafka Broker Under Replicated Partitions on analytics1021 is CRITICAL: kafka.server.ReplicaManager.UnderReplicatedPartitions.Value CRITICAL: 12.0 [20:11:43] hah, jgage^ just curious, [20:11:56] i assume you did election when restarting brokers? so an21 was back as a leader for some topics? [20:12:06] RECOVERY - NTP on virt1006 is OK: NTP OK: Offset 0.02229857445 secs [20:12:15] PROBLEM - Kafka Broker Under Replicated Partitions on analytics1012 is CRITICAL: kafka.server.ReplicaManager.UnderReplicatedPartitions.Value CRITICAL: 12.0 [20:12:28] hmmm [20:12:35] that doesn't sound good [20:14:31] sorry, the downtime i scheduled has expired. rebooting analytics1022 now, then all hadoop kakfa brokers will be done. [20:14:48] ok phew [20:15:36] PROBLEM - Host analytics1022 is DOWN: PING CRITICAL - Packet loss = 100% [20:17:06] RECOVERY - Host analytics1022 is UP: PING WARNING - Packet loss = 66%, RTA = 6.42 ms [20:21:06] qchris: godog, cmjohnson1. protactinium 100% doing what gadolinium used to. :) [20:21:23] yay for easy puppet application! [20:21:25] ottomata: Awesome! [20:25:44] ottomata: sweet! thanks that was quick! [20:26:18] cool...equinix should have it up soon. just got off the phone with them [20:26:30] not that it matters much [20:26:34] RECOVERY - Kafka Broker Under Replicated Partitions on analytics1021 is OK: kafka.server.ReplicaManager.UnderReplicatedPartitions.Value OKAY: 0.0 [20:27:24] RECOVERY - Kafka Broker Under Replicated Partitions on analytics1012 is OK: kafka.server.ReplicaManager.UnderReplicatedPartitions.Value OKAY: 0.0 [20:28:12] ok all analytics kafka brokers are rebooted, ISRs synced, leaders balanced. [20:28:43] RECOVERY - Kafka Broker Under Replicated Partitions on analytics1018 is OK: kafka.server.ReplicaManager.UnderReplicatedPartitions.Value OKAY: 0.0 [20:32:15] !log springle Synchronized wmf-config/db-eqiad.php: repool pc1003 (duration: 00m 05s) [20:32:21] Logged the message, Master [20:33:54] PROBLEM - NTP on analytics1022 is CRITICAL: NTP CRITICAL: Offset unknown [20:35:14] RECOVERY - HTTPS on cp1070 is OK: SSLXNN OK - 36 OK [20:36:04] RECOVERY - NTP on analytics1022 is OK: NTP OK: Offset 0.004068255424 secs [20:40:09] !log rebooting virt1001 [20:40:14] Logged the message, Master [20:42:13] RECOVERY - Host gadolinium is UP: PING OK - Packet loss = 0%, RTA = 2.76 ms [20:42:21] ottomata: gadolinium is back, do we need to take any action? [20:43:04] PROBLEM - salt-minion processes on virt1001 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [20:43:43] PROBLEM - SSH on virt1001 is CRITICAL: Connection timed out [20:45:15] PROBLEM - Host virt1001 is DOWN: PING CRITICAL - Packet loss = 100% [20:47:24] PROBLEM - puppet last run on analytics1027 is CRITICAL: CRITICAL: Puppet has 2 failures [20:48:24] RECOVERY - Host virt1001 is UP: PING OK - Packet loss = 0%, RTA = 1.42 ms [20:48:24] RECOVERY - salt-minion processes on virt1001 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [20:48:54] RECOVERY - SSH on virt1001 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1.4 (protocol 2.0) [20:49:04] PROBLEM - Host stat1001 is DOWN: PING CRITICAL - Packet loss = 100% [20:49:34] RECOVERY - Host stat1001 is UP: PING OK - Packet loss = 0%, RTA = 1.80 ms [20:51:53] PROBLEM - DPKG on lanthanum is CRITICAL: Timeout while attempting connection [20:52:03] PROBLEM - ElasticSearch health check for shards on logstash1003 is CRITICAL: CRITICAL - elasticsearch inactive shards 23 threshold =0.1% breach: {ustatus: uyellow, unumber_of_nodes: 3, uunassigned_shards: 21, utimed_out: False, uactive_primary_shards: 41, ucluster_name: uproduction-logstash-eqiad, urelocating_shards: 0, uactive_shards: 100, uinitializing_shards: 2, unumber_of_data_nodes: 3} [20:52:13] PROBLEM - ElasticSearch health check for shards on logstash1001 is CRITICAL: CRITICAL - elasticsearch inactive shards 18 threshold =0.1% breach: {ustatus: uyellow, unumber_of_nodes: 3, uunassigned_shards: 16, utimed_out: False, uactive_primary_shards: 41, ucluster_name: uproduction-logstash-eqiad, urelocating_shards: 0, uactive_shards: 105, uinitializing_shards: 2, unumber_of_data_nodes: 3} [20:52:24] that's me, just rebooted logstash1002 [20:53:04] RECOVERY - ElasticSearch health check for shards on logstash1003 is OK: OK - elasticsearch status production-logstash-eqiad: status: yellow, number_of_nodes: 3, unassigned_shards: 2, timed_out: False, active_primary_shards: 41, cluster_name: production-logstash-eqiad, relocating_shards: 0, active_shards: 119, initializing_shards: 2, number_of_data_nodes: 3 [20:53:13] RECOVERY - ElasticSearch health check for shards on logstash1001 is OK: OK - elasticsearch status production-logstash-eqiad: status: yellow, number_of_nodes: 3, unassigned_shards: 2, timed_out: False, active_primary_shards: 41, cluster_name: production-logstash-eqiad, relocating_shards: 0, active_shards: 119, initializing_shards: 2, number_of_data_nodes: 3 [20:53:14] ACKNOWLEDGEMENT - NTP on logstash1002 is CRITICAL: NTP CRITICAL: Offset unknown Jeff Gage libc6 reboot [20:53:14] PROBLEM - Host lanthanum is DOWN: PING CRITICAL - Packet loss = 100% [20:53:54] RECOVERY - Host lanthanum is UP: PING OK - Packet loss = 0%, RTA = 3.85 ms [20:53:54] RECOVERY - DPKG on lanthanum is OK: All packages OK [20:55:13] PROBLEM - Host gallium is DOWN: PING CRITICAL - Packet loss = 100% [20:55:24] PROBLEM - DPKG on fluorine is CRITICAL: DPKG CRITICAL dpkg reports broken packages [20:57:34] RECOVERY - DPKG on fluorine is OK: All packages OK [20:58:04] PROBLEM - NTP on gadolinium is CRITICAL: NTP CRITICAL: Offset unknown [20:58:06] !log rebooting virt1002 [20:58:12] Logged the message, Master [20:59:56] PROBLEM - salt-minion processes on virt1002 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [23:31:45] PROBLEM - puppet last run on cp1057 is CRITICAL: Connection refused by host [23:31:54] PROBLEM - Disk space on cp1057 is CRITICAL: Connection refused by host [23:32:04] PROBLEM - configured eth on cp1057 is CRITICAL: Connection refused by host [23:32:15] PROBLEM - dhclient process on cp1057 is CRITICAL: Connection refused by host [23:32:15] PROBLEM - DPKG on cp1057 is CRITICAL: Connection refused by host [23:38:44] PROBLEM - NTP on cp3016 is CRITICAL: NTP CRITICAL: No response from NTP server [23:39:36] i have problems on labs with incorrect db data, but now i see that api from production wiki is also returning the same wrong info. so db replication must be broken in production, too [23:40:24] PROBLEM - Host cp1068 is DOWN: PING CRITICAL - Packet loss = 100% [23:41:45] Merlissimo: can you file a bug with more info? [23:42:14] RECOVERY - Host cp1068 is UP: PING OK - Packet loss = 0%, RTA = 1.55 ms [23:42:26] example: no categories altough in source an shown in rendered version: http://de.wikipedia.org/w/api.php?action=query&pageids=1106087&prop=categories [23:44:14] PROBLEM - Varnish HTCP daemon on cp1068 is CRITICAL: Connection refused by host [23:44:14] PROBLEM - salt-minion processes on cp1068 is CRITICAL: Connection refused by host [23:44:23] PROBLEM - HTTPS on cp1068 is CRITICAL: Return code of 255 is out of bounds [23:44:24] PROBLEM - Varnish HTTP text-frontend on cp1068 is CRITICAL: Connection refused [23:44:33] PROBLEM - DPKG on cp1068 is CRITICAL: Connection refused by host [23:44:34] PROBLEM - Varnish HTTP text-backend on cp1068 is CRITICAL: Connection refused [23:44:43] PROBLEM - Varnishkafka log producer on cp1068 is CRITICAL: Connection refused by host [23:44:53] PROBLEM - configured eth on cp1068 is CRITICAL: Connection refused by host [23:44:54] PROBLEM - dhclient process on cp1068 is CRITICAL: Connection refused by host [23:45:04] PROBLEM - RAID on cp1068 is CRITICAL: Connection refused by host [23:45:13] PROBLEM - puppet last run on cp1068 is CRITICAL: Connection refused by host [23:45:13] PROBLEM - Varnish traffic logger on cp1068 is CRITICAL: Connection refused by host [23:45:14] PROBLEM - Disk space on cp1068 is CRITICAL: Connection refused by host [23:53:44] PROBLEM - Host amssq62 is DOWN: PING CRITICAL - Packet loss = 100% [23:54:43] RECOVERY - Host amssq62 is UP: PING OK - Packet loss = 0%, RTA = 94.86 ms [23:56:38] !log cp3016 out of service for now, needs reinstall (precise!) [23:56:41] Logged the message, Master [23:56:54] PROBLEM - DPKG on amssq62 is CRITICAL: Connection refused by host [23:57:13] PROBLEM - Disk space on amssq62 is CRITICAL: Connection refused by host [23:57:14] PROBLEM - Varnish traffic logger on amssq62 is CRITICAL: Connection refused by host [23:57:14] PROBLEM - Varnish HTCP daemon on amssq62 is CRITICAL: Connection refused by host [23:57:14] PROBLEM - dhclient process on amssq62 is CRITICAL: Connection refused by host [23:57:14] PROBLEM - HTTPS on amssq62 is CRITICAL: Return code of 255 is out of bounds [23:57:14] PROBLEM - salt-minion processes on amssq62 is CRITICAL: Connection refused by host [23:57:14] PROBLEM - puppet last run on amssq62 is CRITICAL: Connection refused by host [23:57:23] PROBLEM - Varnish HTTP text-frontend on amssq62 is CRITICAL: Connection refused [23:57:23] PROBLEM - Varnish HTTP text-backend on amssq62 is CRITICAL: Connection refused [23:57:34] PROBLEM - Varnishkafka log producer on amssq62 is CRITICAL: Connection refused by host [23:57:34] PROBLEM - RAID on amssq62 is CRITICAL: Connection refused by host [23:57:35] PROBLEM - configured eth on amssq62 is CRITICAL: Connection refused by host [23:57:35] PROBLEM - NTP on cp1068 is CRITICAL: NTP CRITICAL: Offset unknown